ModelArg (RF-SRC Spark Target 2.7.0 API)

java.lang.Object
- com.kogalur.randomforest.ModelArg

```
public class ModelArg
extends Object
```
Class containing the user-defined model arguments that produce the RandomForestModel object. Parameters are of two types: those that define how the forest is to be trained; those that define the requested ensemble outputs.

Author:

Udaya Kogalur

Constructor Summary

Constructors
Constructor and Description
`ModelArg(String formula, org.apache.spark.sql.Dataset dataset)` Sets default values for the training parameters and ensemble outputs for the forest, given the formula and dataset.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`get_blockSize()` Returns the value for the block size associated with the reporting of the error rate.
`String`	`get_bootstrap()` Returns the type of bootstrap used in the model.
`double[]`	`get_caseWeight()` Returns the case weight vector.
`int`	`get_eventCount()` Returns the number of events in the data set, when survival or competing risk forests is in force.
`double[]`	`get_eventWeight()` Returns the event weight vector, when survival or competing risk forests is in force.
`String`	`get_family()` Returns the family of analysis intentioned by the `ModelArg` instance.
`int`	`get_htry()` Returns the maximum hypercube dimension to be considered in Greedy Splitting.
`int`	`get_mtry()` Returns the value for the number of x-variables to be randomly selected as candidates for splitting a node.
`int`	`get_nImpute()` Returns the number of iterations used by the missing data algorithm.
`int`	`get_nodeDepth()` Returns the maximum depth to which a tree should be grown.
`int`	`get_nodeSize()` Returns the desired value for the average number of unique cases in a terminal node.
`int`	`get_nSize()` Returns the number of records or rows (n) in the data set.
`int`	`get_nSplit()` Returns the parameter specifying deterministic versus non-deterministic splitting.
`int`	`get_ntree()` Returns the number of trees in the forest.
`int`	`get_rfCores()` Returns the number of cores to be used by the algorithm when OpenMP parallel processing is in force.
`int[][]`	`get_sample()` Returns the 2-D matrix explicitly specifying the bootstrap sample.
`int`	`get_sampleSize()` Returns the size of sample used in generating the bootstrap.
`String`	`get_sampleType()` Returns the type of sampling used in generating the bootstrap.
`int`	`get_seed()` Returns the seed for the random number generator used by the algorithm.
`String`	`get_splitRule()` Returns the split rule used in generating the model.
`double[]`	`get_timeInterest()` Returns the time interest vector used in the model, when survival or competing risk forests is in force.
`int`	`get_timeInterestSize()` Returns the size of the time interest vector used in the model, when survival or competing risk forests is in force.
`int`	`get_trace()` Returns the trace parameter indicating the specified update interval in seconds.
`double[][]`	`get_xData()` Returns a 2-D matrix of values representing the y-values.
`int[]`	`get_xLevel()` Returns a vector of length xSize (`get_xSize()`) containing the the number of levels found in each x-variable.
`int`	`get_xSize()` Returns the number of x-variables in the data set.
`double[]`	`get_xStatisticalWeight()` Returns the x-variable statistical weight vector.
`char[]`	`get_xType()` Returns a vector of length xSize (`get_xSize()`) containing the x-variable types.
`double[]`	`get_xWeight()` Returns the x-variable weight vector.
`double[][]`	`get_yData()` Returns a 2-D matrix of values representing the y-values.
`int[]`	`get_yLevel()` Returns a vector of length ySize (`get_ySize()`) containing the the number of levels found in each y-variable.
`int`	`get_ySize()` Returns the number of y-variables in the data set.
`int`	`get_ytry()` Returns the value for the number of y-variables to be randomly selected as pseudo-responses when unsupervised forests is in force.
`char[]`	`get_yType()` Returns a vector of length ySize (`get_ySize()`) containing the y-variables types.
`double[]`	`get_yWeight()` Returns the y-variable weight vector.
`String`	`getEnsembleArg(String key)` Returns the current value for the specified ensemble argument.
`void`	`set_blockSize()` Sets the default value for the block size associated with the reporting of the error rate.
`void`	`set_blockSize(int blockSize)` Sets the specified value for the block size associated with the reporting of the error rate.
`void`	`set_bootstrap()` Sets the bootstrap related parameters in the model.
`void`	`set_bootstrap(int ntree)` Sets the bootstrap related parameters in the model.
`void`	`set_bootstrap(int ntree, int[][] sample)` Sets the bootstrap related parameters in the model.
`void`	`set_bootstrap(int ntree, String bootstrap, String sampleType, int sampleSize, int[][] sample, double[] caseWeight)` Sets the bootstrap related parameters in the model.
`void`	`set_eventWeight()` Sets the event weight vector, when survival or competing risk forests is in force, to uniform weights.
`void`	`set_eventWeight(double[] weight)` Sets the event weight vector, when survival or competing risk forests is in force.
`void`	`set_htry(int htry)` Sets the maximum hypercube dimension to be considered in Greedy Splitting.
`void`	`set_mtry()` Sets the default value for the number of x-variables to be randomly selected as candidates for splitting a node.
`void`	`set_mtry(int mtry)` Sets the number of x-variables to be randomly selected as candidates for splitting a node.
`void`	`set_nImpute(int nImpute)` Sets the number of iterations for the missing data algorithm.
`void`	`set_nodeDepth(int nodeDepth)` Sets the maximum depth to which a tree should be grown.
`void`	`set_nodeSize()` Sets the default value for the average number of unique cases in a terminal node.
`void`	`set_nodeSize(int nodeSize)` Sets the desired average number of unique cases in a terminal node.
`void`	`set_nSplit()` Sets the default value for the parameter specifying deterministic versus non-deterministic splitting.
`void`	`set_nSplit(int nSplit)` Sets the parameter specifying deterministic versus non-deterministic splitting.
`void`	`set_rfCores(int rfCores)` Sets the number of cores to be used by the algorithm when OpenMP parallel processing is in force.
`void`	`set_seed(int seed)` Sets the seed for the random number generator used by the algorithm.
`void`	`set_splitRule()` Sets the default split rule, based on the data set and formuala.
`void`	`set_splitRule(String splitRule)` Sets the split rule to be used in generating the model.
`void`	`set_timeInterest()` Sets the time interest vector, when survival or competing risk forests is in force to the default value.
`void`	`set_timeInterest(double[] timeInterest)` Sets the time interest vector, when survival or competing risk forests is in force.
`void`	`set_trace(int trace)` Sets the trace parameter indicating the specified update interval in seconds.
`void`	`set_xStatisticalWeight()` Set the x-variable statistical weight vector to uniform weights.
`void`	`set_xStatisticalWeight(double[] weight)` Sets the x-variable statistical weight vector.
`void`	`set_xWeight()` Set the x-variable weight vector to uniform weights.
`void`	`set_xWeight(double[] weight)` Sets the x-variable weight vector.
`void`	`set_ytry(int ytry)` Sets the number of randomly selected pseudo-responses when unsupervised forests is in force.
`void`	`set_yWeight()` Set the y-variable weight vector to uniform weights.
`void`	`set_yWeight(double[] weight)` Sets the y-variable weight vector.
`void`	`setEnsembleArg()` Sets default values for the ensemble outputs resulting from the model.
`void`	`setEnsembleArg(String key, String value)` Sets the ensemble outputs desired from the model.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

ModelArg

public ModelArg(String formula,
                org.apache.spark.sql.Dataset dataset)

Sets default values for the training parameters and ensemble outputs for the forest, given the formula and dataset. The default values are specific to the family (RF-S, RF-R, RF-C, RF-R+, RF-C+, RF-M+), and can be customized using the methods available by this class.

Examples of formulae for various families follow:

Description	Example Formula	Data Set
survivial or competing risk	Surv(time, status) ~ .	Veteran's Administration Lung Cancer Trial
regression	Ozone ~.	New York Air Quality Measurements
classification	Species ~.	Edgar Anderson's Iris Data
multivariate regression	Multivar(mpg, cyl) ~ .	Motor Trend Car Road Tests
multivariate regression	Multivariate(mpg, wt) ~ hp + drat	Motor Trend Car Road Tests
unsupervised	Unsupervised() ~.	Veteran's Administration Lung Cancer Trial

An example found in the test classes follows:


 Dataset irisDF = spark
     .read()
     .option("header", "true")
     .option("inferSchema", "true") 
     .format("csv")
     .load("./test-classes/data/iris.csv");

 ModelArg modelArg = new ModelArg("Species ~ .", irisDF);

A overview of all the data sets used above can be found here.

Parameters:: formula - Specification of the y-variables and x-variables that are to be used in the model. These refer to the column names in the Spark Dataframe. The y-variables and x-variables are separated by the tilde (~) character. A period (.) indicates that the complement of the y-variables is to be used as the x-variables. See the example above for more information.; dataset - A Spark Dataset.

Method Detail

get_family

public String get_family()

Returns the family of analysis intentioned by the ModelArg instance. It is of the following form:

  
  
    Family
    Description
  
  
    RF-S
    survivial or competing risk
  
  
    RF-R
    regression
  
  
    RF-C
    classification
  
  
    RF-R+
    multivariate regression
  
  
    RF-C+
    multivariate classification
  
  
    RF-M+
    multivariate mixed
  
  
    RF-U
    unsupervised

Family	Description
RF-S	survivial or competing risk
RF-R	regression
RF-C	classification
RF-R+	multivariate regression
RF-C+	multivariate classification
RF-M+	multivariate mixed
RF-U	unsupervised

Returns:: The family of analysis.

get_nSize
```
public int get_nSize()
```
Returns the number of records or rows (n) in the data set.

Returns:

The number of records or rows (n) in the data set.

get_ySize
```
public int get_ySize()
```
Returns the number of y-variables in the data set. This will be one (1) if the analysis is univariate, two (2) if the analysis is survival related, zero (0) if the analysis is unsupervised, and greater than one if the analysis is multivariate.

Returns:

The number of y-variables in the data set.

get_xSize
```
public int get_xSize()
```
Returns the number of x-variables in the data set. This will always be greater than zero.

Returns:

The number of x-variables in the data set.

get_yType

public char[] get_yType()

Returns a vector of length ySize (get_ySize()) containing the y-variables types. The vector will be null in the family is unsupervised. The types are as follows:

  
 
  
    Description
    Value
  
  
    time
    T
  
  
    censoring
    S
  
  
    boolean
    B
  
  
    real
    R
  
  
    ordinal
    O
  
  
    categorical
    C
  
  
    integer
    I

Description	Value
time	T
censoring	S
boolean	B
real	R
ordinal	O
categorical	C
integer	I

Returns:: The y-variable types.

get_xType

public char[] get_xType()

Returns a vector of length xSize (get_xSize()) containing the x-variable types. The types are as follows:

 
 
  
    Description
    Value
  
  
    boolean
    B
  
  
    real
    R
  
  
    ordinal
    O
  
  
    categorical
    C
  
  
    integer
    I

Description	Value
boolean	B
real	R
ordinal	O
categorical	C
integer	I

Returns:: The x-variable types.

get_yLevel
```
public int[] get_yLevel()
```
Returns a vector of length ySize (get_ySize()) containing the the number of levels found in each y-variable. The elements of this vector will be non-zero for ordinal and categorical variables. All others elements assume the value of zero (0).

Returns:

The number of levels found in each y-variable

get_xLevel
```
public int[] get_xLevel()
```
Returns a vector of length xSize (get_xSize()) containing the the number of levels found in each x-variable. The elements of this vector will be non-zero for ordinal and categorical variables. All others elements assume the value of zero (0).

Returns:

The number of levels found in each x-variable.

get_yData
```
public double[][] get_yData()
```
Returns a 2-D matrix of values representing the y-values. The dimensions of this matrix are [ySize] x [nSize]. Note that the boolean, categorical, and ordinal variable types will have been mapped to real values.

Returns:

The 2-D matrix of y-values.

get_xData
```
public double[][] get_xData()
```
Returns a 2-D matrix of values representing the y-values. The dimensions of this matrix are [xSize] x [nSize]. Note that the boolean, categorical, and ordinal variable types will have been mapped to real values.

Returns:

The 2-D matrix of x-values.

set_bootstrap

public void set_bootstrap(int ntree,
                          String bootstrap,
                          String sampleType,
                          int sampleSize,
                          int[][] sample,
                          double[] caseWeight)

Sets the bootstrap related parameters in the model.

Note that the parameters are interdependent on one another. An explanation of the heirarchy, the interdependency, and default values is below. Only specific combinations are valid. We provide two other methods to set the bootstrap related parameters: set_bootstrap() and set_bootstrap(int) to aid the user. When explicitly specifying the sample to be used in the bootstrap algorithm, (sampleType = user), the element sample[i][j] represents the number of times case j (in the data set) appears in the bootstrap sample for tree i. Thus a value of zero (0) implies that case j is out-of-bag in tree i. Ensure that the sample over the forest is coherent: the sum of the each column should equal sampleSize.

  
 
  
    Parameter
    Default Value
    Possible Values
  
  
    ntree
    1000
     > 0
  
  
    bootstrap
    auto 
    auto, user 
  
  
    sampleType
    swr
    swr (sampling with replacement), swor (sampling without replacement)
  
  
    sampleSize
    n
    > 0
  
  
    sample     
    null
    null, 2-D matrix of dimension [ntree] x [nSize]
  
 


                                                                                       >0
                                                                swr                  /------ sampleSize, 
                                                              /------ sampleSize? --/        caseWeight (may be null)     
                                                             /                      \
                                      auto                  /                        \------ sampleSize = n,
                >0                  /------ sampleType? ---/                           =0    caseWeight (may be null) 
               ------ bootstrap? --/                       \                         
              /                    \                        \                           1 <= sampleSize <= n
   ntree?  --/                      \                        \                        /----- sampleSize,
             \                       \                        \------ sampleSize? -- /       caseWeight (may be null)
              \------ WARNING         \                         swor                 \
                =0                     \                                              \----- sampleSize = n * (e-1)/e,
                                        \                                               =0   caseWeight (may be null)     
                                         \
                                          \                     !null
                                           \                  /------ sampleType = swr, 
                                            ------ sample? --/        sampleSize (determined by sample), 
                                             user            \        ntree (determined by sample)
                                                              \
                                                               \------ ERROR 
                                                                 null

Finally, note that when bootstrap = auto is in force, it is also possible to use case weights in conjuntion with sampleType.

Parameter	Default Value	Possible Values
ntree	1000	> 0
bootstrap	auto	auto, user
sampleType	swr	swr (sampling with replacement), swor (sampling without replacement)
sampleSize	n	> 0
sample	null	null, 2-D matrix of dimension [ntree] x [nSize]

Parameters:: ntree - Number of trees in the forest.; bootstrap - Type of bootstrap used in the model.; sampleType - Type of sampling used in generating the bootstrap.; sampleSize - Size of sample used in generating the bootstrap.; sample - 2-D matrix explicitly specifying the bootstrap sample.; caseWeight - The case weight vector. This vector must be of length nSize (get_nSize()). This is a vector of non-negative weights where, after normalizing, weight[k] is the probability of selecting case k as a candidate when bootstrap = auto. The default is to use uniform weights for selection. It is generally better to use real weights rather than integers. With larger values of nSize, the slightly different sampling algorithms deployed in the two scenarios can result in dramatically different execution times.

set_bootstrap
```
public void set_bootstrap(int ntree)
```
Sets the bootstrap related parameters in the model. Use default values for all unspecified parameters.

Parameters:

ntree - Number of trees in the random forest.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

set_bootstrap
```
public void set_bootstrap(int ntree,
                          int[][] sample)
```
Sets the bootstrap related parameters in the model. Use default values for all unspecified parameters.

Parameters:

ntree - Number of trees in the random forest.

sample - 2-D matrix explicitly specifying the bootstrap sample.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

set_bootstrap
```
public void set_bootstrap()
```
Sets the bootstrap related parameters in the model. Use default values for all parameters.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_bootstrap
```
public String get_bootstrap()
```
Returns the type of bootstrap used in the model.

Returns:

The type of bootstrap used in the model.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_sampleType
```
public String get_sampleType()
```
Returns the type of sampling used in generating the bootstrap.

Returns:

The type of sampling used in generating the bootstrap.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_sampleSize
```
public int get_sampleSize()
```
Returns the size of sample used in generating the bootstrap.

Returns:

The size of sample used in generating the bootstrap.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_sample
```
public int[][] get_sample()
```
Returns the 2-D matrix explicitly specifying the bootstrap sample.

Returns:

The 2-D matrix explicitly specifying the bootstrap sample.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_ntree
```
public int get_ntree()
```
Returns the number of trees in the forest.

Returns:

The number of trees in the forest.

See Also:

set_bootstrap(int, String, String, int, int[][], double[])

get_blockSize
```
public int get_blockSize()
```
Returns the value for the block size associated with the reporting of the error rate.

Returns:

The value for the block size associated with the reporting of the error rate.

See Also:

set_blockSize(int)

set_blockSize
```
public void set_blockSize()
```
Sets the default value for the block size associated with the reporting of the error rate. This value defaults to ntree.

See Also:

set_blockSize(int)

set_blockSize
```
public void set_blockSize(int blockSize)
```
Sets the specified value for the block size associated with the reporting of the error rate. If the value is out of range, the default value will be applied.

See Also:

set_blockSize()

set_mtry

public void set_mtry(int mtry)

Sets the number of x-variables to be randomly selected as candidates for splitting a node.

Parameters:

mtry - The number of x-variables to be randomly selected as candidates for splitting a node. This number must be such that 1 ≤ mtry ≤ xSize. If the value is out of range, the default value will be applied:

  
 
  
    Family
    Default Value
  
  
    RF-R, RF-R+
    xSize/3
  
  
    all others     
    sqrt(xSize) 
  
 

 The default is to use uniform weights for selection, though this can be changed with set_xWeight(double[]).

Family	Default Value
RF-R, RF-R+	xSize/3
all others	sqrt(xSize)







set_mtry
public void set_mtry()
Sets the default value for the number of x-variables to be randomly selected as candidates for splitting a node.

See Also:
set_mtry(int)








get_mtry
public int get_mtry()
Returns the value for the number of x-variables to be randomly selected as candidates for splitting a node.

Returns:
The value for the number of x-variables to be randomly selected as candidates for splitting a node.
See Also:
set_mtry(int)








set_htry
public void set_htry(int htry)
Sets the maximum hypercube dimension to be considered in Greedy Splitting.

Parameters:
htry - The maximum hypercube dimension to be considered in Greedy Splitting.
 If htry = 0, Standard Splitting is in effect.  If htry > 0, Greedy Splitting is in effect.
 If the value is out of range, the default value of zero will be applied:
   
 
  
    htry
    Protocol
  
  
    0
    standard splitting
  
  
    htry > 0     
    greedy splitting
  
 









get_htry
public int get_htry()
Returns the maximum hypercube dimension to be considered in Greedy Splitting.

Returns:
The maximum hypercube dimension to be considered in Greedy Splitting.
See Also:
set_htry(int)








set_nImpute
public void set_nImpute(int nImpute)
Sets the number of iterations for the missing data algorithm.

Parameters:
nImpute - The number of iterations for the missing data algorithm.
 The default value is one (1).
 Performance measures such as out-of-bag (OOB) error rates tend
 to become optimistic if nimpute > 1.








get_nImpute
public int get_nImpute()
Returns the number of iterations used by the missing data algorithm.

Returns:
The number of iterations used by the missing data algorithm.
See Also:
set_nImpute(int)








get_caseWeight
public double[] get_caseWeight()
Returns the case weight vector.

Returns:
The case weight vector.
See Also:
set_bootstrap(int, String, String, int, int[][], double[])








set_ytry
public void set_ytry(int ytry)
Sets the number of randomly selected pseudo-responses when
 unsupervised forests is in force.

Parameters:
ytry - The number of randomly selected pseudo-responses when
 unsupervised forests is in force. The default value is one
 (1).  This means at every node, and every split attempt, one y-variable will be
 selected from the (xSize - 1) remaining x-variables when calculating the split statistic.
 This number must be such that 1 < ytry ≤ (xSize - 1).








get_ytry
public int get_ytry()
Returns the value for the number of y-variables to be randomly selected as pseudo-responses when unsupervised forests is in force.

Returns:
Yhe value for the number of y-variables to be randomly selected as pseudo-responses when unsupervised forests is in force.
See Also:
set_ytry(int)








set_yWeight
public void set_yWeight(double[] weight)
Sets the y-variable weight vector.

Parameters:
weight - The y-variable weight vector.  This vector must
 be of length ySize (get_ySize()).  This is a vector of
 non-negative weights.  The vector has two purposes.  Purpose 1:
 After normalizing, weight[k] is the probability of selecting
 y-variable k to include in the multivariate split statistic.
 This is useful in big-r situations when ySize is very large and
 the user desires to restrict the split statistic calculation to
 ytry (get_ytry()) y-variables instead of all ySize
 y-variables.  Purpose 2: All y-variables with weight zero
 define a special feature matrix.  This feature matrix is
 presented to the user when custom splitting is in effect.  All
 y-variables with non-zero weight arrive in the custom split
 rule as usual.  In both uses, the default is to use uniform
 weights.  For Purpose 1, it is generally better to use real
 weights rather than integers.  With larger values of ySize, the
 slightly different sampling algorithms deployed in the two
 scenarios can result in dramatically different execution times.
 For Purpose 2, the only value that matters is the presence of
 zero.








set_yWeight
public void set_yWeight()
Set the y-variable weight vector to uniform weights.

See Also:
set_yWeight(double[])








get_yWeight
public double[] get_yWeight()
Returns the y-variable weight vector.

Returns:
The y-variable weight vector.
See Also:
set_yWeight(double[])








set_xWeight
public void set_xWeight(double[] weight)
Sets the x-variable weight vector.

Parameters:
weight - The x-variable weight vector.  This vector 
 must be of length xSize (get_xSize()).  This is a vector of non-negative
 weights where, after normalizing, weight[k] is the
 probability of selecting x-variable k as a candidate for splitting a node.
 The default is to use uniform weights for selection.  It is generally better to use real
 weights rather than integers.  With larger values of xSize, the
 slightly different sampling algorithms deployed in the two
 scenarios can result in dramatically different execution times.








set_xWeight
public void set_xWeight()
Set the x-variable weight vector to uniform weights.

See Also:
set_xWeight(double[])








get_xWeight
public double[] get_xWeight()
Returns the x-variable weight vector.

Returns:
The x-variable weight vector.
See Also:
set_xWeight(double[])








set_xStatisticalWeight
public void set_xStatisticalWeight(double[] weight)
Sets the x-variable statistical weight vector.

Parameters:
weight - The x-variable statistical weight vector.  This vector  must be of
 length xSize (get_xSize()). This is a vector of
 non-negative weights where, after normalizing, weight[k] is the
 multiplier by which the split statistic for an x-variable is
 adjusted.  A large value encourages the node to split on the
 x-variable. The default is to use uniform weights so that all
 x-variables are treated equally.








set_xStatisticalWeight
public void set_xStatisticalWeight()
Set the x-variable statistical weight vector to uniform weights.

See Also:
set_xStatisticalWeight(double[])








get_xStatisticalWeight
public double[] get_xStatisticalWeight()
Returns the x-variable statistical weight vector.

Returns:
The x-variable statistical weight vector.
See Also:
set_xStatisticalWeight(double[])








set_eventWeight
public void set_eventWeight(double[] weight)
Sets the event weight vector, when survival or competing risk
 forests is in force.

Parameters:
weight - The event weight vector, when survival or competing risk forests is in force.  This vector must be the same length as the
 number of events in the data set (get_eventCount()). 
 This is a vector of non-negative weights,
 where, after normalizing, weight[k] is the multiplier by which
 the component of the split statistic related to event[k] is adjusted.
 The default is to to use a composite splitting rule which is an average over all event types (a democratic approach).
 To single out an event type, set all weights other than the one you are interested in to zero (0).
 Finally, note that regardless of how the weight vector is specified, the returned forest object always
 provides estimates for all event types.








set_eventWeight
public void set_eventWeight()
Sets the event weight vector, when survival or competing risk
 forests is in force, to uniform weights.

See Also:
set_eventWeight(double[])








get_eventWeight
public double[] get_eventWeight()
Returns the event weight vector, when survival or competing risk
 forests is in force.

Returns:
The event weight vector, when survival or competing risk
 forests is in force.
See Also:
set_eventWeight(double[])








get_eventCount
public int get_eventCount()
Returns the number of events in the data set, when survival or competing risk forests is in force.

Returns:
The number of events in the data set, when survival or competing risk forests is in force.








set_timeInterest
public void set_timeInterest(double[] timeInterest)
Sets the time interest vector, when survival or competing risk
 forests is in force.

Parameters:
timeInterest - The time interest vector, when survival or competing risk
 forests is in force.  This is a vector of real values to be
 used to constrain the ensemble calculations. Using time points
 at which events do not occur does not result in information
 gain.  The default action is to use all observed event times in
 the data set.








set_timeInterest
public void set_timeInterest()
Sets the time interest vector, when survival or competing risk
 forests is in force to the default value.

See Also:
set_timeInterest(double[])








get_timeInterest
public double[] get_timeInterest()
Returns the time interest vector used in the model, when survival or competing risk
 forests is in force.

Returns:
The time interest vector used in the model, when survival or competing risk
 forests is in force.
See Also:
set_timeInterest(double[])








get_timeInterestSize
public int get_timeInterestSize()
Returns the size of the time interest vector used in the model, when survival or competing risk
 forests is in force.

Returns:
The size of the time interest vector used in the model, when survival or competing risk
 forests is in force.
See Also:
get_timeInterest()








set_splitRule
public void set_splitRule(String splitRule)
Sets the split rule to be used in generating the model.

Parameters:
splitRule - The split rule to be used in generating the
 model.  The split rules available are detailed below. The rule
 in bold denotes the default split rule for each family. The
 default split rule is applied when the user does not specify a
 split rule. Survival and Competing Risk both have two split
 rules. Regression has three flavours of split rules based on
 mean-squared error. Classification has three flavours of split
 rules based on the Gini index, and one additional rule for
 ordinal outcomes. The Multivariate and Unsupervised split rules
 are a composite rule based on Regression and
 Classification. Each component of the composite is normalized
 so that the magnitude of any one y-variable does not influence
 the statistic. All families also allow the user to define a
 custom split rule statistic. Some basic C-programming skills
 are required. Examples for all the families reside in the C
 source code directory of the package in the file
 src/main/c/splitCustom.c. Note that recompiling
 and re-installing the package is necessary after modifying the
 source code.
 
   
 
  
    Family
    Split Rule Description
    Value
  
  
  
    survival
    log-rank
    logrank
  
  
    log-rank score
    logrankscore
  
  
  
    competing risk
    log-rank modified weighted
    logrankCR
  

  
    log-rank
    logrankACR
  

  
    regression
    mean-squared error weighted
    mse
  

  
    mean-squared error unweighted 
    mse.unwt
  

  
    mean-squared error heavy weighted 
    mse.hvwt
  

  
    classification
    Gini index weighted
    gini
  

  
    Gini index unweighted 
    gini.unwt
  

  
    Gini index heavy weighted 
    gini.hvwt
  

  
    Ranked Probability Score 
    rps
  

  
    multivariate regression
    Composite mean-squared error
    mv.mse
  
  
  
    multivariate classification
    Composite Gini index
    mv.gini
  
  
  
    multivariate mixed
    Composite Gini and MSE
    mv.mix
  

  
    unsupervised
    pseudo-response adaptive
    unsupv
  
  

  








set_splitRule
public void set_splitRule()
Sets the default split rule, based on the data set and formuala.

See Also:
set_splitRule(String)








get_splitRule
public String get_splitRule()
Returns the split rule used in generating the model.

Returns:
The split rule used in generating the model.
See Also:
set_splitRule(String)








set_nSplit
public void set_nSplit(int nSplit)
Sets the parameter specifying deterministic versus non-deterministic splitting.

Parameters:
nSplit - The parameter specifying deterministic versus non-deterministic splitting.  The parameter must be 
 a non-negative integer value.  When zero (0), deterministic
 splitting for an x-variable is in force.  When non-zero, a
 maximum of nSplit points are randomly chosen among the
 possible split points for an x-variable. This can
 significantly decrease computation time over deterministic splitting.  The
 default value for this parameter varies with the split rule:  When pure
 random splitting is in force, the default and only value for this parameter is one (1).
 When any other split rule is in force, the default value is
 zero (0).








set_nSplit
public void set_nSplit()
Sets the default value for the parameter specifying deterministic versus non-deterministic splitting.

See Also:
set_nSplit(int)








get_nSplit
public int get_nSplit()
Returns the parameter specifying deterministic versus non-deterministic splitting.

Returns:
The parameter specifying deterministic versus non-deterministic splitting.
See Also:
set_nSplit(int)








set_nodeSize
public void set_nodeSize(int nodeSize)
Sets the desired average number of unique cases in a terminal
 node.

Parameters:
nodeSize - The desired average number of unique cases in a terminal
 node. The parameter ensures that the average nodesize across
 the forest will be at least nodeSize.  Some nodes will be
 smaller than this value and some will be larger.  The default
 value for this parameter varies with the family, though it
 recommended to experiment with different values.

   
 
  
    Family
    Defalut Node Size
  
  
  
    survival
    3
  

  
    competing risk
    6
  

  
    regression
    5
  
  
    multivariate regression
    5
  
  

  
    classification
    1
  

  
    multivariate classification
    1
  
  
  
    multivariate mixed
    3
  

  
    unsupervised
    3
  
  

  








set_nodeSize
public void set_nodeSize()
Sets the default value for the average number of unique cases in a terminal node.

See Also:
set_nodeSize(int)








get_nodeSize
public int get_nodeSize()
Returns the desired value for the average number of unique cases in a terminal node.

Returns:
The desired value for the average number of unique cases in a terminal node.
See Also:
set_nodeSize(int)








set_nodeDepth
public void set_nodeDepth(int nodeDepth)
Sets the maximum depth to which a tree should be grown.

Parameters:
nodeDepth - The maximum depth to which a tree should be grown. The
 default behaviour is that this parameter is ignored. Not
 setting this parameter or setting this parameter to a negative
 value will ensure that this parameter is ignored.








get_nodeDepth
public int get_nodeDepth()
Returns the maximum depth to which a tree should be grown.

Returns:
The maximum depth to which a tree should be grown.








set_seed
public void set_seed(int seed)
Sets the seed for the random number generator used by the
 algorithm.  This must be a negative number.  The seed is a very
 important parameter if repeatability of the model generated is
 required.  Generally speaking, growing a model using the same
 data set, the same model paramaters, and the same seed will
 result in identical models. When large amounts of missing data
 are involved, there can be slight variations due to Monte Carlo
 effects. If the parameter is not set by the user, it can always
 be recovered with get_seed().







get_seed
public int get_seed()
Returns the seed for the random number generator used by the
 algorithm.

Returns:
The seed for the random number generator used by the
See Also:
set_seed(int)








set_trace
public void set_trace(int trace)
Sets the trace parameter indicating the specified update
 interval in seconds.

Parameters:
trace - The trace parameter indicating the specified update
 interval in seconds. During extended execution times,
 the approximate time to complete the execution is output to a
 trace file in the users HOME directory.  The format and
 location of the trace file can be controlled by modifying
 src/main/resources/spark/log.properties.  A value
 of zero (0) turns off the trace.








get_trace
public int get_trace()
Returns the trace parameter indicating the specified update interval in seconds.

Returns:
The trace parameter indicating the specified update interval in seconds.








set_rfCores
public void set_rfCores(int rfCores)
Sets the number of cores to be used by the algorithm when OpenMP
 parallel processing is in force.

Parameters:
rfCores - The number of cores to be used by the algorithm when OpenMP
 parallel processing is in force. The default behaviour is to
 use all cores available.  This is achieved by setting the
 parameter to a negative value.  The result is that each core
 will be independently tasked with growing a tree.  Significant
 savings in elapsed computation times can be achieved.








get_rfCores
public int get_rfCores()
Returns the number of cores to be used by the algorithm when OpenMP
 parallel processing is in force.

Returns:
the number of cores to be used by the algorithm when OpenMP parallel processing is in force.
See Also:
set_rfCores(int)








setEnsembleArg
public void setEnsembleArg(String key,
                           String value)
Sets the ensemble outputs desired from the model.  These
 settings are in the form of <key, value> pairs, where the
 key is the name of the ensemble, and the value is the specific
 option for that ensemble.
  The default option for each key is in bold. 

   
 
  
    Ensemble Key
    Possible Values
  
  
      
  
    weight
    Grow or Restore Only:
no, inbag, oob

        Predict Only:
no, yes
  

  
    proximity
    Grow or Restore Only:
no, inbag, oob

        Predict Only:
no, yes
  
  
  
    membership
    no, yes
  

  
    importance
    no, permute, random, permute.ensemble, random.ensemble
  

  
    varUsed
    no, every.tree, sum.tree
  

  
    splitDepth
    no, every.tree, sum.tree
  

  
    errorType
    For RF-C, RF-C+ Families Only:
misclass, brier, g.mean, no

        For RF-R, RF-R+ Families Only:
mse, no

        For RF-M+ Family Only:
default, no

        For RF-S  Family Only:
c-index, no

        For RF-U  Family Only:
no

  

  
    predictionType
    For RF-C, RF-C+ Families Only:
max.vote, rfq

        For RF-R, RF-R+ Families Only:
mean

        For RF-M+ Family Only:
default

        For RF-S  Family Only:
default

        For RF-U  Family Only:
no

  

 
  

Parameters:
key - The name of the ensemble output.
value - The specific value for the ensemble output.








setEnsembleArg
public void setEnsembleArg()
Sets default values for the ensemble outputs resulting from the model.

See Also:
setEnsembleArg(String, String).








getEnsembleArg
public String getEnsembleArg(String key)
Returns the current value for the specified ensemble argument.

Parameters:
key - The name of the ensemble output.
See Also:
setEnsembleArg(String, String).

Family	Split Rule Description	Value
survival	log-rank	logrank
survival	log-rank score	logrankscore
competing risk	log-rank modified weighted	logrankCR
competing risk	log-rank	logrankACR
regression	mean-squared error weighted	mse
	mean-squared error unweighted	mse.unwt
	mean-squared error heavy weighted	mse.hvwt
classification	Gini index weighted	gini
	Gini index unweighted	gini.unwt
	Gini index heavy weighted	gini.hvwt
Ranked Probability Score	rps
multivariate regression	Composite mean-squared error	mv.mse
multivariate classification	Composite Gini index	mv.gini
multivariate mixed	Composite Gini and MSE	mv.mix
unsupervised	pseudo-response adaptive	unsupv

Ensemble Key	Possible Values
weight	Grow or Restore Only: no, inbag, oob Predict Only: no, yes
proximity	Grow or Restore Only: no, inbag, oob Predict Only: no, yes
membership	no, yes
importance	no, permute, random, permute.ensemble, random.ensemble
varUsed	no, every.tree, sum.tree
splitDepth	no, every.tree, sum.tree
errorType	For RF-C, RF-C+ Families Only: misclass, brier, g.mean, no For RF-R, RF-R+ Families Only: mse, no For RF-M+ Family Only: default, no For RF-S Family Only: c-index, no For RF-U Family Only: no
predictionType	For RF-C, RF-C+ Families Only: max.vote, rfq For RF-R, RF-R+ Families Only: mean For RF-M+ Family Only: default For RF-S Family Only: default For RF-U Family Only: no

Class ModelArg

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ModelArg

Method Detail

get_family

get_nSize

get_ySize

get_xSize

get_yType

get_xType

get_yLevel

get_xLevel

get_yData

get_xData

set_bootstrap

set_bootstrap

set_bootstrap

set_bootstrap

get_bootstrap

get_sampleType

get_sampleSize

get_sample

get_ntree

get_blockSize

set_blockSize

set_blockSize

set_mtry

set_mtry

get_mtry

set_htry

get_htry

set_nImpute

get_nImpute

get_caseWeight

set_ytry

get_ytry

set_yWeight

set_yWeight

get_yWeight

set_xWeight

set_xWeight

get_xWeight

set_xStatisticalWeight

set_xStatisticalWeight

get_xStatisticalWeight

set_eventWeight

set_eventWeight

get_eventWeight

get_eventCount

set_timeInterest

set_timeInterest

get_timeInterest

get_timeInterestSize

set_splitRule

set_splitRule

get_splitRule

set_nSplit

set_nSplit

get_nSplit

set_nodeSize

set_nodeSize

get_nodeSize

set_nodeDepth

get_nodeDepth

set_seed

get_seed

set_trace

get_trace

set_rfCores

get_rfCores

setEnsembleArg

setEnsembleArg

getEnsembleArg