Random Forests for Survival, Regression,
and Classification

A Parallel Package for a General Implemention of
Breiman's Random Forests

Theory and Specifications

Udaya Kogalur & Hemant Ishwaran

Table of Contents

Introduction

Random Forests for Survival, Regression, and Classification (RF-SRC) is an ensemble tree method for the analysis of data sets using a variety of models. As is well known, constructing ensembles from base learners such as trees can significantly improve learning performance. It was shown by [Breiman, 2001] that ensemble learning can be further improved by injecting randomization into the base learning process --- a method called Random Forests. RF-SRC extends Breiman's Random Forests method and provides a unified treatment of the methodology for models including right censored survival (single and multiple event competing risk), multivariate regression or classification, and mixed outcome (more than one continuous, discrete, and/or categorical outcome). When one continuous or categorical outcome is present, the model reduces to univariate regression or classification respectively. When no outcome is present, the model implements unsupervised learning. The [Five Models] figure summarizes the situation.

Five Models Implemented in RF-SRC
Five Models Implemented in RF-SRC

RF-SRC introduces new split rules for each model and gives the user the ability to define and code custom split rules. Deterministic or random splitting is available for all models. Variable predictiveness can be assessed using variable importance measures [Ishwaran et al., 2007] for single as well as grouped variables. Variable selection is implemented using minimal depth [Ishwaran et al., 2010]. Missing data (for x-variables and y-outcomes) can be imputed on both training and test data. The package supports shared memory parallel processing on desktop computers via OpenMP. The package also supports hybrid parallel processing via OpenMP & Open MPI on clusters that implement distributed memory. Execution time is reduced almost linearly with respect to the cores available.

An overview of this document follows: In the [package overview] we describe the Random Forests algorithm highlighting the recursivity that is at its core. The formulas, split rules, terminal node estimators, and prediction errors used in the five models are described briefly. In the section on [Variable Selection] the primary approaches for variable predictiveness, variable selection and identifying interactions between variables are discussed. In the section on [Imputation] the approach to handle missingness is discussed. In the section on [Prediction] a brief discussion of training, restoration, test data, and pruning is given. In the section on [Hybrid Parallel Processing] we demonstrate that the package is capable of hybrid parallel computation in a non-trivial and massive way, and provide details of how this can be achieved. In the [Theory and Specifications] section we provide the mathematics for the split rules, terminal node estimators, and prediction error for each family.

Package Overview

Building a Random Forests model involves growing a binary tree [Breiman, 2001] using user supplied training data and parameters. As is shown in the [Five Models] figure, data types must be real valued, discrete or categorical. The response can be right-censored time and censoring information, or any combination of real, discrete or categorical information. The response can also be absent entirely. The resulting forest contains many useful values which can be directly extracted by the user and parsed using additional functions. The basic algorithm for growing a forest is provided in this [Recursive Algorithm]. The process iterates over ntree, the number of trees that are to be grown. In practice, the iteration is actually parallelized and trees are grown concurrently, not iteratively. This is explained in more detail in the section on [Hybrid Parallel Processing]. The recursive nature of the algorithm is reflected in the repeated calls to split a node until conditions determine that a node is terminal. Another key aspect of the algorithm is the injection of randomization during model creation. Randomization reduces variation. Bootstrapping [Breiman, 1996] at the root node reduces variation. Feature selection is also randomized with the use of the parameter mtry. In the [Recursive Algorithm] N is defined as the number of records in the data set, and P as the number of x-variables in the data set. The parameter mtry is such that 1 <= mtry <= P. At each node, the algorithm selects mtry random x-variables according to a probability vector xvar.wt. The resulting subset of x-variables are examined for best splits according to a splitrule. The parameter nsplit also allows one to specify the number of random split points at which an x-variable is tested. The depth of trees can be controlled using the parameters nodesize and nodedepth. The parameter nodesize ensures that the average nodesize across the forest will be at least nodesize. The parameter nodedepth forces the termination of splitting when the depth of a node reaches the value specified. Node depth is zero-based from the root node onward. Reasonable models can be formed with the judicious selection of mtry, nsplit, nodesize, and nodedepth without exhaustive and deterministic splitting.

// The result of this algorithm is a forest of binary trees, each of
// which partitions the covariate space into hyper-cubes, yields 
// ensemble statistics for each individual based on terminal node
// membership across the forest, and allows model recovery for
// prediction, proximity, importance and much more.

function GROW(

             data$,  // Data containing y-outcomes and x-variables.
             $N$,     // Number of records in data.
             $P$,     // Number of x-variables in data.
             $ntree$, // Number of trees in the forest.
             $splitrule$, // Split rule and formula. 
             $mtry$,      // Size of the random subset of $\{1,...,P\}$.
             $xvar.wt$,   // Probabilities of selecting an x-var as an $mtry$ candidate. 
             $nodesize$,  // Average node size over the forest.
             $nodedepth$, // Maximum node depth.
             $nsplit$,    // Size of random split points for each $mtry$ candidate. 
         )

for ($i = 1$ to $ntree$)
{
    Bootstrap the $data$, creating an in-bag, and an out-of-bag
    subset.  Make the root node, and tag all individuals as
    members of this node.

    recurse until node fails to split
    {
        if ((size of node $\ge 2 \times nodesize$) and
        (depth of node $\ge nodedepth$) and (node is impure))
        {
            Select $mtry$ covariates according to weight vector $xvar.wt$.

            for each $mtry$ covariate
            {
                Conduct deterministic or random splitting according to $nsplit$.
                {
                    Use the specified splitrule to determine the
                    split statistic.  Save the covariate and split
                    point when the split statistic is better than
                    those proceeding it.
                }
            }
            if a best split exists
            {
                Split the node at the best split point into a left
                and right daughter.  Tag all individuals according
                to their daughter membership.
            }
        }            
        if the node was split 
        {
            recurse using the left daughter
            recurse using the right daughter                     
        }     
    }
    Save terminal node information.
    for each out-of-bag individual
    {
        Update the ensemble using the terminal node statistics.
    }
    Update performance measures, proximity and importance statistics.
}
Normalize out-of-bag ensemble outputs.  Save forest topology, and
terminal node membership to allow prediction.


Recursive Grow Algorithm

A simulated classification training data set is shown on the left of the figure labeled [Tree Decision Boundary]. The data set has two real valued features, $x_1$ and $x_2$. The response is a class label that can take on four values. The class sizes are equal. Each class covers a circular area. The circles touch at a single point – the origin. At this point, four data points with all four class labels exist. On the right of the figure is a tree produced from the training data with associated split points labeled. The tree was grown using bootstrapping [Breiman, 1996]. Bootstrapping results in well-defined subsets. The bootstrap, some of which are duplicates, are members of the in-bag (IB) subset. The remaining individuals define the out-of-bag (OOB) subset for the tree. See [Breiman, 1996] for a detailed description of these concepts. Only the bootstrap is used to train the model and to define terminal node statistics. In this model, the terminal node statistic is the resulting frequency distribution of the class labels in a terminal node. This distribution is shown under each terminal node in the right of the figure. The decision boundary on $x_1$ and $x_2$ formed by the tree is superimposed on the data space along with the six terminal node labels.

Tree Decision Boundary

Tree Decision Boundary

The OOB subset for a tree does not play a role in determining its topology. Each individual in the OOB subset for a tree is passive and assigned a unique terminal node membership and terminal node statistic. An OOB ensemble statistic for each individual is formed by combining the terminal node statistics from all trees in which an individual is an OOB member. The class with the maximum frequency in the OOB ensemble statistic serves as the predicted class label for the member. In the figure labeled [Forest Decision Boundary] a single test data point centered at the origin is sent into the previously grown forest for prediction. Each tree provides a unique terminal node membership and statistic for this test individual. The resulting forest ensemble statistic yields a class with the maximum frequency as the predicted class label for this test data point. We note that the probability of all classes are roughly equal. This is because the training data set has one case in each class at the origin. The maximum class frequency for this example is yellow, though we can ascribe this to Monte Carlo effects. When the x-variable space is densely covered with test data points, the predicted values for the data space reveal the forest decision boundary. The boundary is also shown in the lower half of the figure.

Forest Decision Boundary

Forest Decision Boundary

Splitting and Node Size

It is important to understand how splitting and node size affect the topology of a tree. To aid in this, some notation is introduced first. Assume $h$ is the node of a tree, and that the task is to split $h$ into two daughter nodes. Assume there are n cases in the node $h$. Some of these may be replicates if bootstrapping is in effect. At the root node, n = N. Let there be P x-variables in the data set. These can be real valued, discrete, or categorical. Let ${\bf X}_i$ be the P-dimensional feature for a case $i$ such that ${\bf X}_i = (X_{i1}, \ldots, X_{i {\tt P}})$. Let $x$ be the observed value for the $p^{th}$ feature in the data set, where $p \in \{1,\ldots, {\tt P}\}$.

Real versus categorical splitting

If $x$ is continuous or discrete, a proposed split of $h$ on $x$ is always of the form $x \le c$ and $x > c$. Such a split defines left and right daughter membership according to an individual's value $X_{ip}$. A deterministic split on $x$ requires considering all unique values of $x$ in the node $h$.

If $x$ is categorical, denoting the split is more complex. For computational coherence, categorical x-variables (and y-outcomes for that matter) are treated as factors with an immutable 1:1 mapping of categories to integers. Thus, an x-variable with $f$ categories (or levels) is uniquely mapped to $f$ integers $\{1, \ldots, f\}$. A split on a factor with $f$ levels in the node $h$ requires dividing the levels into two complementary subsets. Left and right daughter membership is determined according to an individual's level $X_{ip}$ and to which of the two complementary subsets it belongs. A deterministic split on a factor requires considering all possible complementary pairing of levels. For example, a factor with six levels in a node has \[ C(6, 1) + C(6, 2) + C(6, 3) \] possible splits. These possibilities can be considered to be composed of three groups. The first group has one level going left, and five levels going right. There are $C(6, 1)$ such splits. The second group has two levels going left and four levels going right. There are $C(6, 2)$ such splits. The third group has three levels going left and three levels going right. There are $C(6, 3)$ such splits. These three groups comprise the cardinal group count. In general, there are $floor(f/2)$ cardinal groups. The total number of possible splits (complementary pairings) is $2^{f-1} - 1 $.

Deterministic splitting on factors becomes problematic because of the large number of permutations as the number of levels rise. There is a practical limit on the number of levels a factor can have, within our implementation. It is imposed by UNIT_MAX in limits.h on the operating system in question. A typical value for this is $ 2^{32} - 1 $ on most machines. Overrides are in place to avoid deterministic splitting on large factors. The nature of the override is dependent on whether the user has specified splitrule = "random" and/or nsplit > 0. In general, we try to limit the number of split attempts to be less than the node size. By definition, this is always case when considering continuous x-variables. That is, if a node has $n'$ bootstrap cases, there will be at most $n'$ values on which to split. To encourage good splits, the algorithm attempts to emulate the same action on factors. In some cases this will result in deterministic splitting. In others, it will involve sampling from the $2^{f-1} - 1$ complementary pairings. Whether factor or continuous x-variable, the sampling is never greater than the nodesize regardless of what value of nsplit has been set.

Formulas for splits on continuous x-variables are more easily conveyed to the reader because left and right daughter splits depend on a simple inequality. For this reason, the split rules described in the [Theory and Specifications] section on discrete and continuous x-variables. However, the methodology is exactly applicable to factor x-variables. In addition, for clarity the formulas described are for splits at the root node, where $n = N$.

Deterministic versus random splitting

In contrast to deterministic splitting, a random version of any split rule may be invoked using the parameter nsplit. When nsplit = 0, deterministic splitting occurs and all possible split points for each of the mtry variables are considered. If nsplit is set to a non-zero positive integer, then a maximum of nsplit split points are randomly chosen for each of the mtry variables within a node. The split rule is applied to the randomly chosen split points, and the node is split on that x-variable and split value yielding the best split statistic as measured by the split rule. Pure random splitting can be invoked by setting splitrule="random". In this case, for each node, a variable is randomly selected and the node is split using one randomly chosen split point. See [Cutler and Zhao, 2001] and [Lin and Jeon, 2006] for more details.

Trees tend to favor splits on continuous variables [Loh and Shih, 1997], so it is good practice to use the nsplit option when the data contains a mix of continuous and discrete variables. Using a reasonably small value mitigates bias and may not compromise accuracy. The value of nsplit has a significant impact on the time taken to grow a forest. When non-random splitting is in effect, i.e. nsplit = 0, iterating over each split point can be CPU intensive. However, when nsplit > 0, or when pure random splitting is in effect, CPU times are drastically reduced.

Node depth and node size

Node size and node depth are the most explicit means of controlling the growth of a tree. The parameter nodedepth is defined as the number of connections between the node and the root node. It is zero-based, thus defining the root node as having depth zero. Setting nodedepth limits the maximum depth to which a tree can be grown. The default behaviour is to ignore this parameter and not limit the depth of a tree. In contrast, nodesize is always respected and is crucial to achieving a good model. When nodesize is set too large, the forest will be under-grow and model accuracy will be compromised. When it is set too small, the potential to split on noisy x-variables increases, the forest is over-grown and model accuracy can actually deteriorate past a certain threshold value for nodesize. We define nodesize as the average number of bootstrap cases that will be found in a terminal node in a tree in the forest. Splitting can terminate well before this condition is reached if node purity is detected before attempting to split a node. From the [Recursive Algorithm] it can be seen that there are three conditions necessary for splitting a node:
  1. The current node depth must be less than the maximum node depth allowed.
  2. The current node size must be at least 2 times the node size specified.
  3. The current node must be impure.
Condition (2) assures us that if a split occurs on a node with 2 times the node size specified, the average node size of the pair of daughters will be nodesize. If one daughter is less than nodesize, the other daughter will be greater than nodesize. But such terminal nodes will average to nodesize over the forest.

Model Overview

This section is intended as an overview to orient the reader to more high-level package details concerning the [Five Models] produced by the package. Each model requires a slightly different formula specification, and uses model-specific split rules. This results in model-specific terminal node statistics, ensembles, and a model-specific prediction error algorithm. For completeness and rigour, the mathematical details of all models are given in the [Theory and Specifications] section. Survival, Competing Risk, Regression, Classification, Multivariate and Unsupervised split rules are described, along with the terminal node estimators, and the prediction error for these families. In addition, a short code example for each model is given.

In the [Formula Specification] table, example grow calls for the five models are listed. The data sets used in the two Survival function calls are available in randomForestSRC. The data sets used in the remaining function calls are available from the base R installation. Survival settings require a time and censoring variable which need to be identified in the formula as the response using the standard Surv formula specification. Regression and Classification settings emulate the formula specifications of the randomForest package. In the Multivariate and Unsupervised case, there are variants for the formula, depending on how many y-outcomes are to be specified as the response. These variants are given in more detail in this section.

Family Example Grow Call with Formula Specification
Survival rfsrc(Surv(time, status) ~ ., data=veteran)
Competing Risk rfsrc(Surv(time, status) ~ ., data=wihs)
Regression rfsrc(Ozone ~., data=airquality)
Classification rfsrc(Species ~., data=iris)
Multivariate rfsrc(Multivar(mpg, cyl) ~., data=mtcars)
Unsupervised rfsrc(Unsupervised() ~., data=mtcars)

In the [Split Rules] table, the rule in italics denotes the default split rule for each model. The default split rule is applied when the user does not specify a split rule. The package uses the data set and formula specification to determine the model. Survival and Competing Risk both have two split rules. Regression has three flavours of split rules based on mean-squared error. Classification also has three flavours of split rules based on the Gini index. The Multivariate and Unsupervised split rules are a composite rule based on Regression and Classification. Each component of the composite is normalized so that the magnitude of any one y-outcome does not influence the statistic. See the [Theory and Specifications] section for more details on the mathematics of the split for each family.

Family Split Rules
Survival log-rank \eqref{eqn:survival.logrank}
log-rank score \eqref{eqn:survival.logrank.score}
Competing Risk log-rank modified weighted \eqref{eqn:cr.modified.logrank}
log-rank \eqref{eqn:cr.logrank}
Regression mean-squared error weighted \eqref{eqn:regression.weighted}
mean-squared error unweighted \eqref{eqn:regression.unweighted}
mean-squared error heavy weighted \eqref{eqn:regression.heavy.weighted}
Classification Gini index weighted \eqref{eqn:classification.weighted}
Gini index unweighted \eqref{eqn:classification.unweighted}
Gini index heavy weighted \eqref{eqn:classification.heavy.weighted}
Multivariate Composite mean-squared error
Composite Gini index
Unsupervised Composite mean-squared error
Composite Gini index

All models also allow the user to define a custom split rule statistic. Some basic C-programming skills are required. Examples for all the families reside in the C source code directory of the package in the file splitCustom.c. Note that recompiling and re-installing the package is necessary after modifying the source code.

In the table [Terminal Node Statistics] (TNS), the terminal node estimators produced by the five models are summarized. For Survival, the TNS is the Kaplan-Meier estimator at the time points of interest specified by the user, or as determined by the package if not specified. Competing Risk has two TNS's: the cause-specific cumulative hazard estimate, and the cause-specific cumulative incidence function. Regression and Classification TNS's are the mean and class proportions respectively. For a Multivariate model, there are TNS's for each response, whether it is real valued, discrete or categorical. The Unsupervised model has no TNS, as the analysis is responseless. See the [Theory and Specifications] section for more details on the mathematics of the TNS for each family.

Family Terminal Node Statistics
Survival Kaplan-Meier \eqref{eqn:kaplan.meier}
Competing Risk cause-specific cumulative hazard \eqref{eqn:cause.specific.chf}
cause-specific incidence \eqref{eqn:cause.specific.cif}
Regression mean \eqref{eqn:mean}
Classification class proportions \eqref{eqn:class.proportions}
Multivariate mean per response \eqref{eqn:mean}
class proportions per response \eqref{eqn:class.proportions}
Unsupervised none

In the table [Prediction Error], the error rate calculation for the five models is summarized. For Survival, it is based on Harrell's concordance-index using the cumulative hazard estimate as the values for comparison. For Competing Risk, Harrell's concordance-index uses the cause-specific cumulative-hazard estimate. For Regression, performance is based on mean-squared error. For classification, performance is based on the conditional and over-all misclassification rate. For the Multivariate and Unsupervised case, there is no prediction error implemented. See the [Theory and Specifications] section for more details on the mathematics of the prediction error for each model.

Family Prediction Error
Survival Harrell's C-index using cum-haz \eqref{eqn:survival.mortality} and \eqref{eqn:concordance}
Competing Risk Harrell's C-index using cause-specific cum-haz \eqref{eqn:cause.specific.mortality} and \eqref{eqn:concordance}
Regression mean-squared error \eqref{eqn:mean.squared.error}
Classification conditional and over-all misclassification rate \eqref{eqn:misclassification.rate} and \eqref{eqn:conditional.misclassification.rate}
Multivariate none
Unsupervised none

We conclude this section by giving examples of a mixed multivariate model, and an unsupervised model. The data set mtcars, available in the base distribution of R is used. The usual outcome for the data set is mpg (miles per gallon). First, we modify the data set to force carb (number of carburetors) and cyl (number of cylinders) to be categorical values. We produce a mixed model where all three of these variable are considered as outcomes. The code is given below:


# Multivariate mixed outcome: Motor Trend Car Road Tests.  Miles per
# gallon is the usual response, but the data set is modified to
# introduce categorical and ordinal variables.
mtcars.mod <- mtcars
mtcars.mod$cyl <- factor(mtcars.mod$cyl)
mtcars.mod$carb <- factor(mtcars.mod$carb, ordered=TRUE)
mtcars.mix <- rfsrc(cbind(carb, mpg, cyl) ~., data = mtcars.mod)

We also show various ways to implement an unsupervised model in the code below. It is possible to choose one, or more pseudo-responses at each split via the formula specification.


# Motor Trend Car Road Tests.  Miles per gallon is the usual response,
# but it is ignored below and the data set is taken to be responseless.
mtcars.unspv <- rfsrc(Unsupervised() ~., data = mtcars)
# The above is the equivalent of:
mtcars.unspv <- rfsrc(Unsupervised(1) ~., data = mtcars)
# Alternatively and equivalently:
mtcars.unspv <- rfsrc(data = mtcars)
# Choose two pseudo responses:
mtcars.unspv <- rfsrc(Unsupervised(2) ~., data = mtcars)

Variable Selection

The package has two primary approaches for variable selection and identifying interactions between variables. The function max.subtree() extracts maximal subtree information from a forest [Ishwaran et al., 2010], [Ishwaran et al., 2011]. The function vimp() extracts variable importance information from a forest [Ishwaran et al., 2007].

Minimal Depth and Maximal Subtrees

Minimal depth is a dimensionless order statistic that measures the predictiveness of a variable in a tree. It can be used to select variables in high-dimensional problems. It assesses the predictiveness of a variable by a depth calculation relative to the root node of a tree. A maximal subtree for an x-variable $x$ is the largest subtree whose root node splits on $x$ (i.e. all the parent nodes of $x's$ maximal subtree have nodes that split on variables other than $x$). There can be more than one maximal subtree for a variable. A maximal subtree will not exist if there are no splits on the variable. The shortest distance from the root of the tree to the root of the closest maximal subtree of $x$ is the minimal depth of $x$. A smaller value corresponds to a more predictive variable. The mean of the minimal depth distribution is used as the threshold value for deciding whether an x-variable's minimal depth value is small enough for the x-variable to be classified as strong. An important fact of minimal depth is that it does not depend on a comparison of error rates, as does variable importance (described later), but is only dependent on the topology of the forest. See [Ishwaran et al., 2010] and [Ishwaran et al., 2011] for more details. The maximal subtrees for $x_1$ and the tree shown in the figure labeled [Tree Decision Boundary] are highlighted in blue in the figure labeled [Maximal Subtrees for $x_1$]. There are exactly two of them. The minimal depth for $x_1$ is equal to one.

Maximal Subtrees for x_1

Maximal Subtrees for $x_1$

Examples of minimal depth and maximal subtrees follow:


# Survival Analysis with first and second order depths for all variables.
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ . , data = veteran)
v.max <- max.subtree(v.obj)

# First and second order depths.
print(round(v.max$order, 3))

# The minimal depth is the first order depth.
print(round(v.max$order[, 1], 3))

# Strong variables have minimal depth less that or equal to the threshold.
print(v.max$threshold)

# This corresponds to the top set of variables above the threshold.
print(v.max$topvars)

Variable Importance

The function vimp() extracts variable importance (VIMP) information [Ishwaran et al., 2007] from a forest for a single x-variable or group of x-variables. By default, VIMP is calculated for the training data, but the user can specify new test data for the VIMP calculation using the parameter newdata.

Action on the training data is as follows: The parameter importance allows four distinct ways to calculate VIMP. The default value of importance="permute" returns the Breiman-Cutler permutation VIMP as described in [Breiman, 2001]. For each tree, the prediction error on the OOB data is recorded. Then for a given x-variable $x$, OOB cases are randomly permuted in $x$ and the prediction error is recorded. The VIMP for $x$ for a tree is defined as the difference between the perturbed and unperturbed error rate for that tree. This is then averaged over all trees. If importance="random" is used, then OOB cases are assigned a daughter node randomly whenever a split on $x$ is encountered in the tree. Again, the VIMP for $x$ is defined as the difference between the perturbed and unperturbed error rate averaged over all trees. If the options permute.ensemble or random.ensemble are used, then VIMP is calculated by comparing the error rate for the perturbed OOB forest ensemble to the unperturbed OOB forest ensemble. Unlike the Breiman-Cutler measure, these last two VIMP ensembles do not measure the tree average effect of perturbing $x$, but rather the overall forest effect. See [Ishwaran et al., 2008] for further details.

Joint VIMP is requested using joint=TRUE. The joint VIMP is the importance for a group of variables when the group is perturbed simultaneously.

VIMP is completely dependent on prediction error. Although tempting, it is incorrect to assume VIMP measures the change in prediction error for a forest grown with and without a variable. For example, if two variables are highly correlated and both predictive, each can have large VIMP values. Removing one variable and regrowing the forest may affect the VIMP for the other variable (its value might get larger), but prediction error will likely remain unchanged. VIMP measures the change in prediction error on a fresh test case if $x$ were not available, given that the original forest was grown using $x$. In practice, this often equals change in prediction error for a forest grown with and without $x$, but conceptually the two quantities are different. Examples follow:


# Classification example showcasing different vimp measures.
iris.obj <- rfsrc(Species ~ ., data = iris)

# Breiman-Cutler permutation vimp.
print(vimp(iris.obj)$importance)

# Breiman-Cutler random daughter vimp.
print(vimp(iris.obj, importance = "random")$importance)

# Breiman-Cutler joint permutation vimp.
print(vimp(iris.obj, joint = TRUE)$importance)

# Breiman-Cuter paired vimp.
print(vimp(iris.obj, c("Petal.Length", "Petal.Width"), joint=TRUE)$importance)
print(vimp(iris.obj, c("Sepal.Length", "Petal.Width"), joint=TRUE)$importance)

Imputation

To handle data sets with missing data, the package provides two interfaces to impute missing values. If a predictive model is desired, then the user can create a forest with a call to rfsrc(). If all that is desired is a data set containing imputed values, without the need for a model, then a call to impute() will suffice. If no formula is specified, unsupervised splitting is implemented. The number of impute iterations is specified via the parameter nimpute. The number of impute iterations corresponds to the number of forests grown. When a predictive model is desired, only the final forest is retained. Note that two impute iterations were found to be the minimum necessary to produce reasonable imputed values.

On the first impute iteration, imputation is conducted in each internal node as the tree is grown. This in-node imputation is used to determine left and right daughter membership during the grow process. Note that imputed values are never used in split statistics on the first impute iteration. On the second, and subsequent impute iterations, in-node imputation is absent. Instead, the forest is grow as if there was no missing data, with missingness holes in the data plugged by the imputed values determined by the previous impute iteration. Imputation on the second and subsequent iteration only occurs in the terminal nodes, after the forest has been grown. The default setting for rfsrc() is nimpute = 1. The default setting for impute() is nimpute = 2.

Once a model has been created using rfsrc(), test data with or without missingness can be used with predict.rfsrc(). Values in the test data set are imputed based on the non-missing values in the training data set. Internally, both rfsrc() and impute.rfsrc() use variations of the missing data algorithm in [Ishwaran et al., 2008]. Setting na.action = "na.impute" imputes missing data (both x-variables and y-outcomes), while na.action = "na.omit" eliminates records with missingness and then grows the forest. Examples follow:


# Examples of survival imputation.

# Create a model via imputation.
data(pbc, package = "randomForestSRC")
pbc.d <- rfsrc(Surv(days, status) ~ ., data = pbc)

# Impute, using unsupervised splitting.
pbc2.d <- impute.rfsrc(data = pbc, nodesize = 1, nsplit = 10)

Prediction

Predicted values are obtained by dropping test data down the forest after the model has been created. The overall error rate is returned if the test data contains a y-outcome. If no test data is provided, then the original training data is used and the code reverts to restoring the forest, associated terminal node statistics and ensembles. This is useful in that it gives the user the ability to extract outputs from the forest that were not asked for during model creation. The figure [Forest Decision Boundary] is an example of how prediction can be used to view the forest decision boundary imposed on the x-variables. Examples of traning, testing and restoration follow:

Training/testing scenario:


data(veteran, package = "randomForestSRC")
# Sample the data and create a training subset.
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
# Train the model.
veteran.grow <- rfsrc(Surv(time, status) ~ ., veteran[train, ], ntree = 100)
# Test the model.
veteran.pred <- predict(veteran.grow, veteran[-train , ])
# Compare the results.
print(veteran.grow)
print(veteran.pred)

Restoration scenario:


# Train the model.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
# Restore the model and note identical results.
predict(airq.obj)
# Note that the two results are identical.
print(airq.obj)
# Retrieve an output that was not asked for in the original call.
prox <- predict(airq.obj, proximity = TRUE)$proximity

outcome = "test"

If outcome = "test", the training data is not used to calculate the terminal nodes statistics. Instead, the TNS are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the invariant topology of the forest is based on the training data, but where the ensembles and predicted values are based on the test data. The terminal node statistics, ensembles and error rates are all calculated by bootstrapping the test data and using OOB individuals to ensure unbiased estimates.

Consider the the forest depicted previously in [Forest Decision Boundary] and the associated forest ensemble for the single test point at the origin. This original training data and ensemble are shown in the top left and top right of [outcome = "test"]. New test data, formed by reducing the radii of the blue and yellow classes is shown in the bottom left of the figure. The origin contains only two data points, namely one red and one green point. Using the new test data to populate terminal nodes, results in an ensemble that is missing the blue and yellow classes. This ensemble is shown in the bottom right of the figure.

outcome =
test

outcome = "test"

The function for creating the spheres is given in the section [Supplementary Code], and the higher level code used to produce the [outcome = "test"] figure follows:


## Four circular regions with equal class counts, touching at the origin.
## We suppress the X3 dimension in order to get a circle from a sphere.
circles <- get.spheres()[1, 2, 4]

## Grow the forest.
grow.result <- rfsrc(outcome ~ ., data = circles, importance = "none")

## Get a new test data set by restricting yellow and blue at the origin.
test.circles <- get.spheres(class.radius=c(1.0, 0.75, 1.0, 0.75))[1, 2, 4]

## Predict using the training data for terminal node statistics.
outcome.train <- predict(grow.result, newdata=test.circles)

## The origin (first row) will have roughly equal classes.
print(outcome.train$predicted[1, ])

## Predict using the test data for terminal node statistics.
outcome.train <- predict(grow.result, newdata=test.circles, outcome="test")

## The origin (first row) will only have red and yellow classes.
print(outcome.train$predicted[1, ])

Pruning

There are a number of ways to limit the growth of trees in a model. The most obvious are the parameters nodedepth and nodesize discussed in the section [Node Depth and Node Size. Decreasing nodedepth or increasing nodesize will result in shallower trees. After a forest has been created, the user is able to prune trees back using the parameter ptn.count. This parameter represents the pseudo-terminal node count. It allows us to specify the desired number of pseudo-terminal nodes after a tree has been grown. This is not useful for Random Forests analysis per se, but it can be useful in other applications. Trees are flexible and adaptive nonparametric estimators and, as such, they represent ideal weak learners for implementing gradient boosting [Friedman, 2001]. Boosted trees for regression and classification, where exactly J-terminal nodes (for any integer value of J) are needed, is easily accomplished using the parameter ptn.count. This particular application of the randomForestSRC package is incorporated into the boostmtree package on CRAN [Ishwaran and Kogalur, 2016]. Pruning is accomplished by deleting terminal nodes from the maximum depth back toward the root until the desired pseudo-terminal node count is achieved. Daughter nodes are deleted in pairs. This results in a parent node becoming a pseudo-terminal node, after the daughter terminal nodes have been deleted at the current maximum depth.

Hybrid Parallel Processing

This package has the capabilities of non-trivial parallel processing with massive scalability. Growing a forest has a recursive component as was shown in the [Recursive Algorithm]. But it is also an iterative process that is repeated ntree times. This fact is depicted in the figure [Iterative Process]. The entry point is a user-defined R script that initiates the grow call rfsrc(). This function pre-processes the data set, and calls a grow function in the package's C library, which grows the forest. The C function grows a single tree, adds the resulting tree-specific ensemble(s) to the forest ensemble(s), and continues this process for ntree iterations. The C function then returns the forest ensemble(s) to the R function, which post-processes the results. It is advantageous to parallelize this iterative process. Such situations are often called "pleasingly parallel". The default pre-compiled binaries, available on CRAN, should execute in parallel. If not, the package has been written to be easily compiled to accommodate parallel execution. As a prerequisite, the host architecture, operating system, and installed compilers must support it. OpenMP and Open MPI compilers must be available on the host system. If this is the case, it is possible to compile the source code to enable parallel processing.

Forest Growth
as an Iterative Process

Forest Growth as an Iterative Process

A graphic showing one mode of parallel execution (OpenMP) on a typical desktop/laptop computer is shown in the figure [Shared Memory Parallel Processing]. In this example, the hardware on the left of the figure is a Macbook Pro (2015 model) with two quad-core CPUs, and 16GB of RAM. This hardware allows eight threads to execute simultaneously. The memory is shared among all cores, and this configuration is typical of symmetric multiprocessing (SMP) where two or more identical processors connect to a shared memory bank. Most reasonably configured multi-core notebooks and desktops are ideal hosts for randomForestSRC and OpenMP parallel execution. The right side of [Shared Memory Parallel Processing] shows the software implementation of the OpenMP model. The R code reads the data set, pre-processes the data set, calls the C code and requests a forest of ntree trees. When OpenMP parallel execution is in force, the C code grows trees simultaneously, rather than iteratively. Each core is tasked with growing a tree. Once a core completes growing a tree, the forest ensemble values are incremented with the contribution of that tree, and the core is re-tasked with growing another tree. This continues until the entire forest is grown. Control is then returned to the R code which does post-processing of the results. On the hardware shown in the figure, it is possible to grow eight trees at a time. Under ideal conditions, the elapsed time to grow a forest should decrease linearly and be close to a factor of eight.

OpenMP:  Shared
Memory Parallel Processing

OpenMP: Shared Memory Parallel Processing

In order to give the user some flexibility, core usage can be controlled via R options. First, the user needs to determine the number of cores on the machine. This can be done within an R session by loading the parallel package and issuing the command detectCores(). It is possible to set the numbers of cores accessed during OpenMP parallel execution at the start of every R session by issuing the command options(rf.cores = x), where x is the number of cores. If x is a negative number, the package will access the maximum number of cores on the machine. The options command can also be placed in the users .Rprofile file for convenience. Alternatively, one can initialize the environment variable RF_CORES in the user's command shell environment.

Gains in performance on a desktop (or a single SMP node) are limited by the number of cores available on the hardware. The OpenMP parallel model shown in [Shared Memory Parallel Processing] that relies on an underlying iterative process was chosen for its suitability in accommodating shared memory systems and its ease of implementation. Model creation is computationally intense and the OpenMP parallel processing model allows us to take full advantage of the hardware available.

Message Passing Interface (MPI), an alternative model to OpenMP is suitable for distributed memory systems, in which the number of cores available can be much greater than that of a desktop computer. It is possible to run randomForestSRC on appropriately designed and configured clusters, allowing for massive scalability. An ideal cluster for the package is made up of tightly connected SMP nodes. Each SMP node has its own memory, and does not share this memory with other nodes. Nodes, however, do communicate with each other using MPI.

Hybrid:  MPI/OpenMP Parallel Processing

Hybrid: MPI/OpenMP Parallel Processing

In the figure [MPI/OpenMP Parallel Processing], a generic cluster with four SMP nodes is depicted. Each node can be considered to be similar to a desktop computer. However, in a cluster setting, each node is tightly supervised and controlled by a top-level layer of cluster software that handles job scheduling, and resource management. On the left of the figure, four nodes are connected via a network that facilitates MPI communications across nodes. The right of the figure depicts the software implementation of hybrid parallel processing using the randomForestSRC package. A primary/replica framework is chosen whereby a primary node on the cluster is tasked with growing the forest. The primary node divides the forest into sub-forests. If the size of the forest is ntree, then each replica in this example is tasked with growing a sub-forest of size ntree/4. Hybrid parallel processing is dependent on the Rmpi package. This package must be available. If it is not, it must compiled and installed from the source code available on CRAN. The Rmpi is a mature package and provides flexible low level functionality. Other explicitly parallel general-use packages on CRAN such as snow, snowfall, and foreach provide higher level user friendly functions that, while potentially useful, were considered not as flexible and customizable in a cluster environment.

In the hybrid example that follows, Rmpi is loaded on the primary node, and MPI processes are then spawned on each replica. Disregarding the fact that the replicas can communicate with the primary using MPI, each replica acts like an SMP node in [Shared Memory Parallel Processing]. A replica initiates concurrent OpenMP threads across it cores, and grows its sub-forest using shared memory OpenMP parallel processing. It is possible to grow many more trees concurrently on a cluster, because the forest is spread among many more cores than are available on a single node, or desktop computer. In [MPI/OpenMP Parallel Processing], the forest uses 32 cores concurrently. In an ideal environment, the elapsed time to grow the original forest will be reduced by a factor of 32. After all replicas have completed growing their sub-forests, they return their data to the primary process, via MPI messages, and the replicas terminate. On a single SMP node (equivalent to a desktop computer), a thread on a core starts, grows a tree, terminates, and repeats this process until all trees have been grown. A core is either waiting for a task, working on a task, or has completed a task. A task, in this case, is the growing of a tree. On a cluster, a replica node starts, grows a sub-forest, and terminates. It repeats this process until all sub-forests requested by the primary node have been grown. A replica is either waiting for a sub-forest, working on a sub-forest, or has completed a sub-forest.

As proof of concept, and for benchmarking purposes we ran the package on two clusters available to us. The first cluster was the Learner Research Institute High Performance Cluster (LRI-HPC) at the Cleveland Clinic. This system runs Simple Linux Utility for Resource Management (SLURM) supervisor software. It is a Linux-based cluster consisting of twenty nodes with ten cores per node. Each node has 64GB of shared memory. The second cluster was the [Pegasus 2] installation at the High Performance Computing Group at the Center for Computational Science (HPC CCS) at the University of Miami. This system has 10,000 cores spread across over 600 nodes. The system also runs a flavour of Linux but uses Platform LSF as the job scheduler. The different job schedulers for the two clusters required minor customization of the calling shell scripts. For our benchmarks, we analyzed a simulated survival data set with n=3000, p=30, and ntree=1024, with the default value of importance="permute". We ran the package in serial mode and scaled up to 1024 cores across 64 nodes. The results are shown in this figure. In serial mode, growing 1024 trees iteratively (using one core) took about 330 minutes. The red points all reflect OpenMP parallel processing, on one node. These results would be similar to that of a desktop computer. Pegasus 2 has 16 cores per node. When using all cores on one node were used, elapsed time decreased to about 25 minutes. This reflects a decrease in a factor of 13 from the serial time. The theoretical maximum decrease would be closer to a factor of 16. The results would be similar to that of running the analysis on desktop computer with 16 cores. Next, hybrid parallel processing using the primary/replica framework was implemented. Two replica nodes were tasked, then three, and so on up to 64 nodes. At 64 nodes, total core usage was 1024. At this number, the cluster was growing all 1024 trees simultaneously. The data points reflecting hybrid parallel processing are in blue. Linear decreases in elapsed time were observed. This indicates that the package is capable of massive scalability. The primary and replica code harnesses are given in the section [Supplementary Code].

Benchmark of Hybrid Parallel Processing

Benchmark of Hybrid Parallel Processing

Theory and Specifications

Survival

There are two split rules that can be used to grow survival trees. In this model, the response or y-outcome associated with individual \(i\) is a pair of values specifying a non-negative survival time and censoring information. Denote the response for \(i\) by \(Y_i = (T_i, \delta_i)\). The individual is said to be right censored at time \(T_i\) if \(\delta_i = 0\) and is said to have died at time \(T_i\) if \(\delta_i = 1\). In the case of death, \(T_i\) will be referred to as an event time, and the death as an event. An individual \(i\) who is right censored at \(T_i\) simply means that the individual is known to have been alive at \(T_i\), but the exact time of death is unknown.

Log-rank splitting

The log-rank test for splitting survival trees is a well established concept [Segal, 1988]. It has been shown to be robust in both proportional and non-proportional hazard settings [Leblanc and Crowley, 1993] Note that a proposed split of \(h\) on a real value \(x\) is of the form \(x \le c\) and \(x > c\). Such a split defines left and right daughter membership. The split that maximizes survival differences between the two daughter nodes is the best split. Let \(t_1 < t_2 <\cdots < t_m\) be the distinct times of death in the parent node \(h\), and let \(d_{k,l}\) and \(Y_{k,l}\) equal the number of deaths and individuals at risk respectively at time \(t_k\) in the left daughter node for \(k \in \{1, \ldots, m\}\). Similarly let \(d_{k,r}\) and \(Y_{k,r}\) refer to the right daughter node. Note that \(Y_{k,s}\) is the number of individuals in daughter \(s \in \{l, r\}\) who are alive at time \(t_k\), or who have an event (death) at time \(t_k\). More precisely, \begin{equation*} Y_{k,l}=\#\{i : T_i\ge t_k,\;x_i\le c\},\hskip10pt Y_{k,r}=\#\{i : T_i\ge t_k,\;x_i>c\}, \end{equation*} where \(x_i\) is the value of \(x\) for individual \(i=1, \ldots, n\). Define \(Y_k=Y_{k,l}+Y_{k,r}\) and \(d_k=d_{k,l} + d_{k,r}\). Let \(n_s\) be the total number of observations in daughter \(s\), Thus, \(n=n_l+n_r\), where \(n_l=\#\{i: x_i\le c\}\) and \(n_r=\#\{i: x_i> c\}\). The log-rank test for a split at the value \(c\) for an x-variable \(x\) is \begin{equation} L(x,c)=\frac{{\displaystyle\sum_{k=1}^m\left(d_{k,l}-Y_{k,l}\frac{d_k}{Y_k}\right)}} {\sqrt{{\displaystyle\sum_{k=1}^m\frac{Y_{k,l}}{Y_k}\left(1-\frac{Y_{k,l}}{Y_k}\right) \left(\frac{Y_k-d_k}{Y_k-1}\right)d_k}}}. \label{eqn:survival.logrank} \end{equation} The value \(|L(x,c)|\) is a measure of node separation. Larger values of \(|L(x,c)|\) imply a greater the difference between the two groups, and the better the split. In particular, the best split at node \(h\) is determined by finding the x-variable \(x^*\) and split value \(c^*\) such that \(|L(x^*,c^*)|\ge |L(x,c)|\) for all \(x\) and \(c\).

Log-rank score splitting

The log-rank score test for splitting survival trees is described in [Hothorn and Lausen (2003)]. In this rule, assume the x-variable \(x\) has been ordered so that \(x_1 \le x_2 \le \cdots \le x_n\). Now, compute the ``ranks'' for each survival time \(T_j\) where \(j \in \{1, \ldots, n\}\). Thus, \begin{equation*} a_j=\delta_j-\sum_{k=1}^{\Gamma_j}\frac{\delta_k}{n-\Gamma_k+1} , \end{equation*} where \(\Gamma_k=\#\{t:T_t\le T_k\}\). Note that \(\Gamma_k\) is the index of the order for \(T_k\). The log-rank score test is defined as \begin{equation} S(x,c)=\frac{\sum_{x_k\le c}\left(a_j-n_l\overline{a}\right)} {\sqrt{{\displaystyle n_l\left(1-\frac{n_l}{n}\right)s_a^2}}} , \label{eqn:survival.logrank.score} \end{equation} where \(\overline{a}\) and \(s_a^2\) are the sample mean and sample variance of \(\{a_j: j=1, \ldots, n\}\). Log-rank score splitting defines the measure of node separation by \(|S(x,c)|\). Maximizing this value over \(x\) and \(c\) yields the best split.

Terminal node estimators

The survival estimate associated with a terminal node is provided by the Kaplan-Meier (KM) estimator [Kaplan and Meier (1958)]. Again let \(t_1 < t_2 <\cdots < t_m\) be the distinct death times in the terminal node \(h\), and let \(d_k\) and \(Y_k\) equal the number of deaths and individuals at risk respectively at time \(t_k\) in \(h\). Then the Kaplan-Meier estimator is \begin{equation} \hat{S}(t) = \prod_{t_k \le t}{\left( 1 - \frac{d_k}{Y_k} \right)} . \label{eqn:kaplan.meier} \end{equation} The cumulative hazard estimate (CHE) associated with a terminal node is provided by the Nelson-Aalen estimator and is calculated as \begin{equation} \hat{H}(t) = \sum_{t_k \le t}{\frac{d_k}{Y_k}} . \label{eqn:chf.terminal} \end{equation} If there are H terminal nodes in a tree, there are H such estimates. Let individual \(i\) have feature \({\bf X}_i\) and reside in terminal node \(h\). The CHE terminal node statistic will be denoted by \begin{equation*} \hat{H}(t| {\bf X}_i) = \hat{H}_h(t), \text{ if } \; {\bf X}_i \in h . \end{equation*} To produce the OOB ensemble for individual \(i\), the CHE terminal node statistic is averaged over all \(\tt ntree\) trees. Let \(\hat{H}_b (t|X)\) denote the CHE estimate for tree \(b \in \{1,\ldots,\tt ntree\}\). Define \(I_{i,b}=1\) if individual \(i\) is an OOB individual for \(b\), otherwise set \(I_{i,b}=0\). The OOB ensemble CHE for individual \(i\) is \begin{equation*} \hat{H}_e^{*}(t|{\bf X}_i) = \frac{\sum_{b=1}^{\tt ntree}{I_{i,b} \hat{H}_b (t|{\bf X}_i)}}{\sum_{b=1}^{\tt ntree} I_{i,b}} . \end{equation*} Note that the OOB ensemble is obtained by averaging over only those bootstraps in which individual \(i\) is is OOB.

An example of a survival forest follows:


# Veteran's Administration Lung Cancer Trial. Randomized 
# trial of two treatment regimens for lung cancer.
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100)
## Plot the error.
plot(v.obj)
## Plot the survival estimates.
plot.survival(v.obj)

Prediction error

Prediction error for survival models is measured by \(1-C\), where \(C\) is Harrell's [] concordance index. Prediction error is between 0 and 1, and measures how well the predictor correctly ranks two random individuals in terms of survival. Unlike other measures of survival performance, Harrell's C-index does not depend on choosing a fixed time for evaluation of the model and specifically takes into account censoring of individuals [May et al., (2004)] The method has quickly become quite popular in the literature as a means for assessing prediction performance in survival analysis settings. See [Kattan et al., 1988] and references therein.

To compute the concordance index, it is necessary to define what constitutes a worse predicted outcome. Let \(t_1^*,\ldots,t_M^*\) denote all unique times in the data. Define the predicted outcome for individual \(i\) to be \begin{equation} \mathcal{M}_i = \sum_{k=1}^{M}\hat{H}_e^*(t_k^*|{\bf X}_i). \label{eqn:survival.mortality} \end{equation} Individual \(i\) is said to have a worse predicted outcome than individual \(j\) if \(\mathcal{M}_i > \mathcal{M}_j\) The concordance error rate is computed as follows:

  1. Form all possible pairs of observations over all the data.
  2. Omit those pairs where the shorter event time is censored. Also, omit pairs \((i, j)\) if \(T_i=T_j\) unless \((\delta_i=1, \delta_j=0)\) or \((\delta_i=0, \delta_j=1)\) or \((\delta_i=1, \delta_j=1)\). The last restriction only allows ties in event times if at least one of the observations is a death. Let the resulting pairs be denoted by \(\mathbb{S}\). Let \(permissible = |\mathbb{S}|\).
  3. If \(T_i \ne T_j\), count 1 for each \(s \in \mathbb{S}\) in which the shorter time had the worse predicted outcome.
  4. If \(T_i \ne T_j\), count 0.5 for each \(s \in \mathbb{S}\) in which \(\mathcal{M}_i = \mathcal{M}_j\).
  5. If \(T_i = T_j\), count 1 for each \(s \in \mathbb{S}\) in which \(\mathcal{M}_i = \mathcal{M}_j\).
  6. If \(T_i = T_j\), count 0.5 for each \(s \in \mathbb{S}\) in which \(\mathcal{M}_i \ne \mathcal{M}_j\).
  7. Let \(concordance\) denote the resulting count over all permissible pairs. Define the concordance index \(C\) as \begin{equation} C=\dfrac{concordance}{permissible}. \label{eqn:concordance} \end{equation}
  8. The error rate is \(Err=1-C\). Note that \(0 \le Err \le 1\) and that \(Err=0.5\) corresponds to a procedure doing no better than random guessing, whereas \(Err=0\) indicates perfect prediction.

Competing Risk

Competing risks occur in survival analyses when the occurrence of one event precludes the occurrence of the remaining set of possible events. Individuals subject to competing risks are observed from study entry to the occurrence of the event of interest. This is a competing event though often, before the individual can experience one of the events, that person is right censored.

Formally, let \(T_i^o\) be the event time for the \(i^{th}\) subject, \(i=1,\ldots,N\), and let \(\delta_i^o\) be the event type, \(\delta_i^o\in\{1,\ldots,J\}\), where \(J \ge 1\). Let \(C_i^o\) denote the censoring time for individual \(i\) such that the actual time of event \(T_i^o\) is unobserved and one only observes \(T_i=\min(T_i^o,C_i^o)\) and the event indicator \(\delta_i=\delta_i^o I(T_i^o\le C_i^o)\). When \(\delta_i=0\), the individual is said to be censored at \(T_i\), otherwise if \(\delta_i=j>0\), the individual is said to have an event of type \(j\) at time \(T_i\). Denote the observed response for \(i\) as \(Y_i = (T_i,\delta_i)\). The cause-specific hazard function (cs-H) for event \(j\) given a \(P\)-dimensional feature \({\bf X}_i\) is \begin{equation} \alpha_j(t|{\bf X}_i) = \lim_{\Delta t\rightarrow 0} \frac{\mathbb{P}\{t\le T^o \le t+\Delta t, \delta^o=j | T^o\ge t,{\bf X}_i\}}{\Delta t} = \frac{f_j(t|{\bf X}_i)}{S(t|{\bf X}_i)} . \label{eqn:csh} \end{equation} Here \(S(t|{\bf X}_i)=\mathbb{P}\{T^o\ge t|{\bf X}_i\}\) is the event-free survival probability function given \({\bf X}_i\). The cs-H describes the instantaneous risk of event \(j\) for subjects that currently are event-free. The probability of an event is determined using the cause-specific cumulative incidence function (cs-CIF), defined as the probability of experiencing an event of type \(j\) by time \(t\); i.e. \(F_j(t|{\bf X}_i) = \mathbb{P}\{T^o \le t, \delta^o=j|{\bf X}_i\}\). The cs-CIF and cs-H are related according to \begin{equation} F_j(t|{\bf X}_i) =\int_0^t S(s-|{\bf X}_i)\alpha_j(s|{\bf X}_i)\, \mathrm d s = \int_0^t \exp\left(-\int_0^s \sum_{l=1}^J \alpha_l(u|{\bf X}_i) \mathrm d u\right) \alpha_j(s|{\bf X}_i)\, \mathrm d s . \label{eqn:cif} \end{equation} See [Gray (1988)] for an analysis of the cs-CIF.

With these two equations, it is possible to describe the two split rules that can be used to grow competing risk trees. Again, for notational convenience these rules are described for the root node using the entire learning data, and for splits on real valued x-variables. However, the algorithm extends to any tree node, to bootstrap data, and to splits on categorical x-variables.

As before, let \(T_i,\delta_i)_{1\le i \le n}\) denote the pairs of survival times and event indicators. Let \(t_1c\) for a continuous x-variable \(x\). Such a split forms two daughter nodes containing two new sets of competing risk data. The subscripts \(l\) and \(r\) refer to the left and right daughter node, respectively, and denote \(\alpha_{jl}(t)\) and \(\alpha_{jr}(t)\) for the cause-\(j\) specific hazard rate, given by \eqref{eqn:csh}, in the left and the right daughter node. Similarly define \(F_{jl}(t)\) and \(F_{jr}(t)\) to be the cs-CIF, given by \eqref{eqn:cif}, for the left and the right daughter node.

The number of individuals at risk at time \(t\) in the left and right daughter nodes are \(Y_l(t)\) and \(Y_r(t)\), where \begin{equation*} Y_l(t)=\sum_{i=1}^n I(T_i\ge t,x_i\le c),\hskip10pt Y_r(t)=\sum_{i=1}^n I(T_i\ge t,x_i> c), \end{equation*} and \(x_i\) is the value the x-variable assumes for individual \(i=1,\ldots,n\). The number of type \(j\) events at time \(t\) for the left and right daughters is \begin{equation*} d_{j,l}(t)=\sum_{i=1}^n I(T_i=t, \delta_i=j, x_i\le c),\hskip10pt d_{j,r}(t)=\sum_{i=1}^n I(T_i=t, \delta_i=j, x_i> c). \end{equation*} The total number of individuals who are risk at time \(t\), and the total number of type \(j\) events at time \(t\) in the parent node are given by \begin{eqnarray} Y(t)=Y_l(t)+Y_r(t), \label{eqn:event.count} \\ d_j(t)=d_{j,l}(t)+d_{j,r}(t). \label{eqn:at.risk} \end{eqnarray} Define also \(t_m,t_{m_l},t_{m_r}\) to be the largest time on study in the parent node and the two daughters, respectively.

Log-rank splitting

The first competing risk split rule is the log-rank test. This is a test of the null hypothesis \(H_0:\alpha_{jl}(t)=\alpha_{jr}(t)\) for all \(t\le\tau\). The test is based on the weighted difference of the cause-specific Nelson-Aalen estimates in the two daughter nodes. Specifically, for a split at the value \(c\) for variable \(x\), the split rule is \begin{equation*} L^{\mathrm{LR}}_j(x,c)=\frac 1{\hat\sigma^{\mathrm{LR}}_{j}(x,c)}\sum_{k=1}^m W_j(t_k)\left(d_{j,l}(t_k) -\frac{d_j(t_k)Y_l(t_k)}{Y(t_k)}\right) , \end{equation*} where the variance estimate is given by \begin{equation*} (\hat\sigma^{\mathrm{LR}}_{j}(x,c))^{2}= \sum_{k=1}^m W_j(t_k)^2 d_j(t_k) \frac{Y_l(t_k)}{Y(t_k)} \left(1-\frac{Y_l(t_k)}{Y(t_k)}\right) \left(\frac{Y(t_k)-d_j(t_k)}{Y(t_k)-1}\right) . \end{equation*} Time-dependent weights \(W_j(t)>0\) are used to make the test more sensitive to early or late differences between the cause-specific hazards. The choice \(W_j(t)=1\) corresponds to the standard log-rank test which has optimal power for detecting alternatives where the cause-specific hazards are proportional. In practice, \(W_j(t) = W_j\). That is, the weights are constant and not time-dependent. Furthermore, the cause-specific split rules across the event types are combined to arrive at the composite log-rank competing split rule that is implemented in the package and given by \begin{equation} L^{\mathrm{LR}}(x,c) = \frac{\sum_{j=1}^J (\hat\sigma^{\mathrm{LR}}_j(x,c))^2 L_j^{\mathrm{LR}}(x,c)}{\sqrt{\sum_{j=1}^J(\hat\sigma^{\mathrm{LR}}_j(x,c))^2}} . \label{eqn:cr.logrank} \end{equation} The best split is found by maximizing \(|L^{\mathrm{LR}}(x,c)|\) over \(x\) and \(c\).

Modified log-rank splitting

The cause-\(j\) specific log-rank split rule \eqref{eqn:cr.logrank} is useful if the main purpose is to detect variables that affect the cause-\(j\) specific hazard. It may not be optimal if the purpose is also prediction of cumulative event probabilities. In this case better results may be obtained with split rules that select variables based on their direct effect on the cumulative incidence. For this reason the second split rule is modeled after Gray's test [Gray (1988)], which tests the null hypothesis \(H_0:F_{jl}(t)=F_{jr}(t)\) for all \(t\le \tau\). See [Ishwaran et al. (2014)] for additional details in using the modified risk set defined by \begin{equation*} Y^*_j(t) = \sum_{i=1}^n I\Bigl(T_i\ge t \cup \left(T_i< t \cap \delta_i\ne j \cap C_i^o> t\right)\Bigr) . \end{equation*} This equals the number of individuals who have not had an event prior to \(t\) in addition to those individuals who have experienced an event \(j^{'} \ne j\) prior to \(t\), but who are not censored. The split rule based on the score statistic which uses the modified risk sets, is denoted \(L^{\mathrm{G}}_j(x,c)\) and given by substituting \(Y^*_j\) for \(Y_j\) and \(Y^*_{jl}\) for \(Y_l\) in \eqref{eqn:cr.logrank}. Note that if the censoring time is not known for those cases that have an event before the end of follow-up, the largest observed time is used, and the statistic \(L^{\mathrm{G}}_j(x,c)\) is still a good (and computationally efficient) approximation of Gray's test statistic, see [Section 3.2, Fine, Jason P and Gray, Robert J (1999)].

Thus the composite modified log-rank competing split rule is given by \begin{equation} L^{\mathrm{G}}(x,c) = \frac{\sum_{j=1}^J (\hat\sigma^{\mathrm{G}}_j(x,c))^2 L_j^{\mathrm{G}}(x,c)}{\sqrt{\sum_{j=1}^J(\hat\sigma^{\mathrm{G}}_j(x,c))^2}} . \label{eqn:cr.modified.logrank} \end{equation} The best split is found by maximizing \(|L^{\mathrm{G}}(x,c)|\) over \(x\) and \(c\).

Terminal node estimators

Let \(t_1 < t_2 <\cdots < t_m\) be the distinct non-censored event times in the terminal node \(h\), and let \(d_j(t_k)\) equal the number of events of type \(j\) at time \(t_k\). Let \(Y(t_k)\) be the number of bootstrap cases at risk at time \(t_k\). Both of these are is in \eqref{eqn:event.count} and \eqref{eqn:at.risk}. Let individual \(i\) have feature \({\bf X}_i\) and reside in terminal node \(h\). Denote the number of events of type \(j\) in \([0, t_j]\) as \begin{equation*} N_{j}(t| {\bf X}_i)=\sum_{t_{k} \le t} d_j(t | {\bf X}_i) . \end{equation*} The cause-specific hazard for individual \(i\) in node \(h\) and event \(j\) is defined by \(\alpha_j(t|{\bf X}_i)\) and given by \eqref{eqn:csh}. The cause-specific cumulative hazard function (cs-CHF) is given by \begin{equation*} H_j (t|{\bf X}_i) = \int_0^t \alpha_j(s | {\bf X}_i) ds = \int_0^t \frac{dN_j(s | {\bf X}_i)}{Y(t_k | {\bf X}_i)} . \end{equation*} The Nelson-Aalen estimate for the cs-CHF is given by \begin{equation} \hat{H}_j(t |{\bf X}_i) = \sum_{t_{k} < t}\hat{\alpha}_j(t_k | {\bf X}_i) = \sum_{t_{k} < t} \frac{d_j(t_k | {\bf X}_i)}{Y(t_k | {\bf X}_i)} . \label{eqn:cause.specific.chf} \end{equation} The cs-CIF is given by \eqref{eqn:cif} and can be written as \begin{equation*} F_j(t|{\bf X}_i) =\int_0^t S(s-|{\bf X}_i)\alpha_j(s|{\bf X}_i)\, \mathrm d s =\int_0^t S(s-|{\bf X}_i) \frac{dN_j(s | {\bf X}_i)}{Y(s | {\bf X}_i)} . \end{equation*} We have adopted the estimator for \(F_j(t)\) given in [Equation 2.3, Gray (1988)]. Namely, \begin{equation} \hat{F}_j(t|{\bf X}_i) =\sum_{t_{k} < t} \hat{S}(t_{k-1} | {\bf X}_i) \frac{d_j(t_k | {\bf X}_i)}{Y(t_k | {\bf X}_i)}, \label{eqn:cause.specific.cif} \end{equation} where \(\hat{S}(t)\) is given by \eqref{eqn:kaplan.meier}. The predicted value for event \(j\) is the integrated cs-CIF \begin{equation*} \mathcal{M}_j = \int_{0}^{\tau} F_j(s) ds . \end{equation*} Here \(\tau\) is a fixed time point. We use \(\tau = t_M\), the maximum observed event time. The value \(\mathcal{M}_j\) can be interpreted as the expected number of life years lost and represents a measure of mortality. This is also called the cause-\(j\) mortality, is denoted by \(\mathcal{M}_j\) and given by \begin{equation} \hat{\mathcal{M}}_j = \int_{0}^{\tau} \hat{F}_j(s) ds = \sum_{t_{k+1} \le \tau}\hat{F}_j(t_k)[t_{k+1} - t_k] . \label{eqn:cause.specific.mortality} \end{equation}

Prediction error

The cause-specific error rate for each event \(j\) is estimated using the concordance index protocol resulting in equation \eqref{eqn:concordance}. Let \(\hat{\mathcal{M}}_{i,j}\) denote the cause-\(j\) mortality for case \(i\). We say that case \(i\) has a higher risk of event \(j\) than case \(i^{'}\) if \(\hat{\mathcal{M}}_{i,j} > \hat{\mathcal{M}}_{i^{'},j}\). Applying equation \eqref{eqn:concordance} results in a concordance index \(C_j\) and an error rate \(Err_j = 1 - C_j\). An example of a competing risk call follows:


# Women's Interagency HIV Study (WIHS).  Competing risk 
# data set involving AIDS in women.
data(wihs, package = "randomForestSRC")
wihs.obj <- rfsrc(Surv(time, status) ~ ., data=wihs, nsplit=3, ntree=100)
#  Plot the ensemble cs-CIF, and cs-CHF.
plot.competing.risk(wihs.obj)

Regression

There are three split rules that can be used to grow regression trees. In this model, the y-outcome associated with an individual \(i\) is a single real value. Denote the response for \(i\) by \(Y_i \in \mathbb{R}\).

Weighted variance splitting

Suppose the proposed split for the root node is of the form \(x \le c\) and \(x>c\) for a continuous x-variable \(x\), and a split value of \(c\). The split rule is \begin{equation} \theta(x,c) = \frac{n_l}{n} \times \frac{1}{n_l} \sum_{x_i \le c} (Y_i - \bar{Y}_l)^2 + \frac{n_r}{n} \times \frac{1}{n_r} \sum_{x_i > c} (Y_i - \bar{Y}_r)^2 , \label{eqn:regression.weighted} \end{equation} where the subscripts \(l\) and \(r\) indicate left and right daughter node membership respectively. Then \begin{equation*} \bar{Y}_l = \frac{1}{n_l} \sum_{x_i \le c} Y_i , \hskip10pt \bar{Y}_r = \frac{1}{n_r} \sum_{x_i > c} Y_i \end{equation*} are the means within each of the daughters, and \(n_l\) and \(n_r\) are the number of cases in the left and right daughter respectively such that \((n = n_l + n_r).\) The goal is to find \(x\) and \(c\) to minimize \(\theta(x,c)\).

Unweighted variance splitting

In this rule, it is necessary to minimize \begin{equation} \theta(x,c) = \frac{1}{n_l} \sum_{x_i \le c} (Y_i - \bar{Y}_l)^2 + \frac{1}{n_r} \sum_{x_i > c} (Y_i - \bar{Y}_r)^2 . \label{eqn:regression.unweighted} \end{equation}

Heavy weighted variance splitting

In this rule, it is necessary to minimize \begin{equation} \theta(x,c) = \left( \frac{n_l}{n} \right)^2 \times \frac{1}{n_l} \sum_{x_i \le c} (Y_i - \bar{Y}_l)^2 + \left( \frac{n_r}{n} \right)^2 \times \frac{1}{n_r} \sum_{x_i > c} (Y_i - \bar{Y}_r)^2 . \label{eqn:regression.heavy.weighted} \end{equation}

Terminal node estimators

The predicted value for a terminal node is the mean value of the cases within that node. Let individual \(i\) have feature \({\bf X}_i\). and reside in terminal node \(h\). The terminal node predicted value for a node \(h\) is given by \begin{equation} \hat{f}_h = \frac{1}{n_h} \sum_{{\bf X}_i \in h} Y_i . \label{eqn:mean} \end{equation} To produce the OOB predicted value for individual \(i\) this quantity is averaged over all \(\tt ntree\) trees. Let \(\hat{f}_b({\bf X}_i)\) denote the predicted value for tree \(b \in \{1,\ldots,\tt tree\}\). As before, \(I_{i,b}=1\) if individual \(i\) is an OOB individual for \(b \in \{1, \ldots, \tt ntree \}\). Otherwise set \(I_{i,b}=0\). The OOB predicted value for individual \(i\) is \begin{equation*} \hat{f}_e^{*}({\bf X}_i) = \frac{\sum_{b=1}^{\tt ntree}{I_{i,b} \hat{f}_b ({\bf X}_i)}}{\sum_{b=1}^{\tt ntree} I_{i,b}}. \end{equation*}

Prediction error

The estimated prediction error is given by \begin{equation} Err = \frac{1}{n} \sum_{i=1}^{N} (Y_i - \hat{f}_e^{*}({\bf X}_i))^2 . \label{eqn:mean.squared.error} \end{equation} An example of a regression forest follows:


# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
# Partial plot of variables.
plot.variable(airq.obj)

Classification

There are three split rules that can be used to grow classification trees. In this model, the y-outcome associated with an individual \(i\) is a single single categorical value. Let \({\bf p} = (p_1, \ldots, p_J)\) be the class proportions for the classes \(1\) through \(J\) respectively, for the y-outcome in the node.

Weighted Gini index splitting

Suppose the proposed split for the root node is of the form \(x \le c\) and \(x>c\) for a continuous x-variable \(x\), and a split value of \(c\). The impurity of the node is defined as \begin{equation*} \phi({\bf p}) = \sum_{j = 1}^{J} p_j(1 - p_j) = 1 - \sum_{j = 1}^{J} p_j^2 . \end{equation*} The Gini index for a split \(c\) on \(x\) is \begin{equation*} \theta(x,c) = \frac{n_l}{n} \phi({\bf p}_l) + \frac{n_r}{n} \phi({\bf p}_r) , \end{equation*} where, as before, the subscripts \(l\) and \(r\) indicate left and right daughter node membership respectively, and \(n_l\) and \(n_r\) are the number of cases in the daughters such that \((n = n_l + n_r).\) The goal is to find \(x\) and \(c\) to minimize \begin{equation} \theta(x,c) = \frac{n_l}{n} \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,l}}{n_l} \right)^2 \right) + \frac{n_r}{n} \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,r}}{n_r} \right)^2 \right) , \end{equation} where \(n_{j,l}\) and \(n_{j,r}\) are the number of cases of class \(j\) in the left and right daughter, respectively, such that \((n_j = n_{j,l} + n_{j,r}).\) This is equivalent to maximizing \begin{equation} \theta^*(x,c) = \frac{1}{n} \sum_{j = 1}^{J} \frac{n_{j,l}^2}{n_l} + \frac{1}{n} \sum_{j = 1}^{J} \frac{(n_j - n_{j,l})^2}{n - n_l}, \label{eqn:classification.weighted} \end{equation}

Unweighted Gini index splitting

Unweighted Gini index splitting corresponds to \begin{equation*} \theta(x,c) = \phi({\bf p}_l) + \phi({\bf p}_r) . \end{equation*} The best split is found by minimizing \begin{equation*} \theta(x,c) = \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,l}}{n_l} \right)^2 \right) + \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,r}}{n_r} \right)^2 \right) . \end{equation*} This is equivalent to maximizing \begin{equation} \theta^*(x, c) = \sum_{j = 1}^{J} \left( \frac{n_{j,l}}{n_l} \right)^2 + \sum_{j = 1}^{J} \left( \frac{n_j - n_{j,l}}{n - n_l} \right)^2 . \label{eqn:classification.unweighted} \end{equation}

Heavy weighted Gini index splitting

Heavy weighted Gini index splitting corresponds to \begin{equation*} \theta(x,c) = \frac{n_l^2}{n^2} \phi({\bf p}_l) + \frac{n_r^2}{n^2} \phi({\bf p}_r) . \end{equation*} The best split is found by minimizing \begin{equation*} \theta(x,c) = \frac{n_l^2}{n^2} \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,l}}{n_l} \right)^2 \right) + \frac{n_r^2}{n^2} \left( 1 - \sum_{j = 1}^{J} \left( \frac{n_{j,r}}{n_r} \right)^2 \right) . \end{equation*} This is equivalent to maximizing \begin{equation} \theta^*(x, c) = \frac{1}{n^2} \sum_{j = 1}^{J} n_{j,l}^2 + \frac{1}{n^2} \sum_{j = 1}^{J} (n_j - n_{j,l})^2 - \frac{n_l^2}{n^2} - \frac{n_r^2}{n^2} + 2 . \label{eqn:classification.heavy.weighted} \end{equation} Note that the above is non-negative due to the addition of the constant last term.

Terminal node estimators

The predicted value for a terminal node \(h\) are the class proportions in the node. Let individual \(i\) have feature \({\bf X}_i\). and reside in terminal node \(h\). Let \(n_j\) equal the number of bootstrap cases of class \(j\) in the node. Then the proportion for the \(j^{th}\) class is given by \begin{equation} \hat{p}_{j,h} = \frac{1}{n_h} \sum_{{\bf X}_i \in h} I \{Y_i = j \} . \label{eqn:class.proportions} \end{equation} To produce the OOB predicted value for individual \(i\) this is averaged over all \(\tt ntree\) trees. Let \(\hat{p}_{j,b}({\bf X}_i)\) denote the predicted value for the \(j^{th}\) proportion for tree \(b \in \{1,\ldots,\tt ntree\}\). As before, \(I_{i,b}=1\) if individual \(i\) is an OOB individual for \(b \in \{1, \ldots, \tt tree\}\), otherwise set \(I_{i,b}=0\). The OOB predicted value for the \(j^{th}\) proportion for individual \(i\) is \begin{equation*} \hat{p}_{e,j}^{*}({\bf X}_i) = \frac{\sum_{b=1}^{\tt ntree}{I_{i,b} \hat{p}_{j,b} ({\bf X}_i)}}{\sum_{b=1}^{\tt tree} I_{i,b}} . \end{equation*}

Prediction error

Consider \(\hat{\bf p}_{e}^{*} ({\bf X}_i) = (\hat{p}_{e,1}^{*}({\bf X}_i), \ldots, \hat{p}_{e,J}^{*}({\bf X}_i))\). Let \begin{equation*} \hat{p}_{e,max}^{*}({\bf X}_i) = \max_{1 \le j \le J} (\hat{p}_{e,j}^{*}({\bf X}_i)) , \end{equation*} and let \(N_j^{*}\) equal the number of individuals with class equal to \(j\) in the data set. The estimated conditional misclassification rate is given by \begin{equation} Err_j = \frac{1}{N_j^{*}} \sum_{i:Y_i = j} (Y_i \ne \hat{p}_{e,max}^{*}({\bf X}_i)) . \label{eqn:misclassification.rate} \end{equation} The estimated over all misclassification rate is given by \begin{equation} Err = \frac{1}{N} \sum_{i=1}^{N} (Y_i \ne \hat{p}_{e,max}^{*}({\bf X}_i)) . \label{eqn:conditional.misclassification.rate} \end{equation} An example of a classification forest follows:


# Edgar Anderson's iris data with morphologic variation of three
# related species.
iris.obj <- rfsrc(Species ~., data = iris, ntree=100)
## Plot the results.
plot(iris.obj)

Multivariate

In the multivariate and mixed model, the y-outcomes associated with an individual \(i\) can be denoted by \({\bf Y}_i = ((Y_{i,1}, \ldots, Y_{i,r})\) where \(Y_{i,j}\) is real or categorical for each \(j \in \{1, \ldots, r \}\). When \(Y_{i,j}\) is real for all \(j \in \{1, \ldots, r \}\) the model is a multivariate regression forest. In this case the split rule statistic is a composite of \(r\) split rule statistics based on [Weighted Variance Splitting]. The outcomes in the node are normalized to ensure that a particular \(j\)-specific statistic does not overwhelm the composite statistic. When \(Y_{i,j}\) is categorical for all \(j \in \{1, \ldots, r \}\) the model is a multivariate classification forest. In this case the split rule statistic is a composite of \(r\) split rule statistics based on [Weighted Gini Index Splitting]. No normalization is necessary since the \(j\)-specific statistics represent an impurity measure which is invariant with respect to scale. In the general case where \(Y_{i,j}\) is real or categorical for each \(j \in \{1, \ldots, r \}\) the model is a multivariate mixed forest. In this case the split rule statistic is a composite of \(r\) split rule statistics based on [Weighted Variance Splitting] and [Weighted Gini Index Splitting]. Only those \(j\)-specific statistics representing real outcomes are normalized.

Terminal node estimators

There are \(j\) sets of terminal node estimators specific to the real or categorical cases previously described.

Prediction error

There is no prediction error implemented in this scenario.

Unsupervised

Unsupervised models are appropriate in settings where there is no y-outcome. In the unsupervised split rule, a subset of the x-variables (the size of which is specified by the formula) is chosen as the pseudo y-outcomes at each node, and the multivariate split rule in the [Multivariate] section is applied. Thus, each \(\tt mtry\) attempt results in a multivariate mixed split rule. Note importantly that all terminal node statistics and error rates are turned off under unsupervised splitting.

Terminal node estimators

There are no terminal node estimators implemented in this scenario.

Prediction error

There is no prediction error implemented in this scenario.

Supplementary Code

Generalized code for simulating the spheres used in [Tree Decision Boundary] and elsewhere:

Generalized code for simulating the survival data set used in [Benchmark of Hybrid Parallel Processing]:

The primary harness used in [Benchmark of Hybrid Parallel Processinge]:

The replica harness used in [Benchmark of Hybrid Parallel Processinge]:

References