public class ModelArg extends Object
RandomForestModel
object. Parameters are of two types: those that
define how the forest is to be trained; those that define the
requested ensemble outputs.Constructor and Description 

ModelArg(String formula,
org.apache.spark.sql.Dataset dataset)
Sets default values for the training parameters and ensemble
outputs for the forest, given the formula and dataset.

Modifier and Type  Method and Description 

int 
get_blockSize()
Returns the value for the block size associated with the
reporting of the error rate.

String 
get_bootstrap()
Returns the type of bootstrap used in the model.

double[] 
get_caseWeight()
Returns the case weight vector.

int 
get_eventCount()
Returns the number of events in the data set, when survival or competing risk forests is in force.

double[] 
get_eventWeight()
Returns the event weight vector, when survival or competing risk
forests is in force.

String 
get_family()
Returns the family of analysis intentioned by the
ModelArg instance. 
int 
get_htry()
Returns the maximum hypercube dimension to be considered in Greedy Splitting.

int 
get_mtry()
Returns the value for the number of xvariables to be randomly selected as candidates for splitting a node.

int 
get_nImpute()
Returns the number of iterations used by the missing data algorithm.

int 
get_nodeDepth()
Returns the maximum depth to which a tree should be grown.

int 
get_nodeSize()
Returns the desired value for the average number of unique cases in a terminal node.

int 
get_nSize()
Returns the number of records or rows (n) in the data set.

int 
get_nSplit()
Returns the parameter specifying deterministic versus nondeterministic splitting.

int 
get_ntree()
Returns the number of trees in the forest.

int 
get_rfCores()
Returns the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.

int[][] 
get_sample()
Returns the 2D matrix explicitly specifying the bootstrap sample.

int 
get_sampleSize()
Returns the size of sample used in generating the bootstrap.

String 
get_sampleType()
Returns the type of sampling used in generating the bootstrap.

int 
get_seed()
Returns the seed for the random number generator used by the
algorithm.

String 
get_splitRule()
Returns the split rule used in generating the model.

double[] 
get_timeInterest()
Returns the time interest vector used in the model, when survival or competing risk
forests is in force.

int 
get_timeInterestSize()
Returns the size of the time interest vector used in the model, when survival or competing risk
forests is in force.

int 
get_trace()
Returns the trace parameter indicating the specified update interval in seconds.

double[][] 
get_xData()
Returns a 2D matrix of values representing the yvalues.

int[] 
get_xLevel()
Returns a vector of length xSize (
get_xSize() ) containing the the number of levels found in each
xvariable. 
int 
get_xSize()
Returns the number of xvariables in the data set.

double[] 
get_xStatisticalWeight()
Returns the xvariable statistical weight vector.

char[] 
get_xType()
Returns a vector of length xSize (
get_xSize() ) containing the xvariable types. 
double[] 
get_xWeight()
Returns the xvariable weight vector.

double[][] 
get_yData()
Returns a 2D matrix of values representing the yvalues.

int[] 
get_yLevel()
Returns a vector of length ySize (
get_ySize() ) containing the the number of levels found in each
yvariable. 
int 
get_ySize()
Returns the number of yvariables in the data set.

int 
get_ytry()
Returns the value for the number of yvariables to be randomly selected as pseudoresponses when unsupervised forests is in force.

char[] 
get_yType()
Returns a vector of length ySize (
get_ySize() ) containing the yvariables types. 
double[] 
get_yWeight()
Returns the yvariable weight vector.

String 
getEnsembleArg(String key)
Returns the current value for the specified ensemble argument.

void 
set_blockSize()
Sets the default value for the block size associated with the
reporting of the error rate.

void 
set_blockSize(int blockSize)
Sets the specified value for the block size associated with the
reporting of the error rate.

void 
set_bootstrap()
Sets the bootstrap related parameters in the model.

void 
set_bootstrap(int ntree)
Sets the bootstrap related parameters in the model.

void 
set_bootstrap(int ntree,
int[][] sample)
Sets the bootstrap related parameters in the model.

void 
set_bootstrap(int ntree,
String bootstrap,
String sampleType,
int sampleSize,
int[][] sample,
double[] caseWeight)
Sets the bootstrap related parameters in the model.

void 
set_eventWeight()
Sets the event weight vector, when survival or competing risk
forests is in force, to uniform weights.

void 
set_eventWeight(double[] weight)
Sets the event weight vector, when survival or competing risk
forests is in force.

void 
set_htry(int htry)
Sets the maximum hypercube dimension to be considered in Greedy Splitting.

void 
set_mtry()
Sets the default value for the number of xvariables to be randomly selected as candidates for splitting a node.

void 
set_mtry(int mtry)
Sets the number of xvariables to be randomly selected as candidates for splitting a node.

void 
set_nImpute(int nImpute)
Sets the number of iterations for the missing data algorithm.

void 
set_nodeDepth(int nodeDepth)
Sets the maximum depth to which a tree should be grown.

void 
set_nodeSize()
Sets the default value for the average number of unique cases in a terminal node.

void 
set_nodeSize(int nodeSize)
Sets the desired average number of unique cases in a terminal
node.

void 
set_nSplit()
Sets the default value for the parameter specifying deterministic versus nondeterministic splitting.

void 
set_nSplit(int nSplit)
Sets the parameter specifying deterministic versus nondeterministic splitting.

void 
set_rfCores(int rfCores)
Sets the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.

void 
set_seed(int seed)
Sets the seed for the random number generator used by the
algorithm.

void 
set_splitRule()
Sets the default split rule, based on the data set and formuala.

void 
set_splitRule(String splitRule)
Sets the split rule to be used in generating the model.

void 
set_timeInterest()
Sets the time interest vector, when survival or competing risk
forests is in force to the default value.

void 
set_timeInterest(double[] timeInterest)
Sets the time interest vector, when survival or competing risk
forests is in force.

void 
set_trace(int trace)
Sets the trace parameter indicating the specified update
interval in seconds.

void 
set_xStatisticalWeight()
Set the xvariable statistical weight vector to uniform weights.

void 
set_xStatisticalWeight(double[] weight)
Sets the xvariable statistical weight vector.

void 
set_xWeight()
Set the xvariable weight vector to uniform weights.

void 
set_xWeight(double[] weight)
Sets the xvariable weight vector.

void 
set_ytry(int ytry)
Sets the number of randomly selected pseudoresponses when
unsupervised forests is in force.

void 
set_yWeight()
Set the yvariable weight vector to uniform weights.

void 
set_yWeight(double[] weight)
Sets the yvariable weight vector.

void 
setEnsembleArg()
Sets default values for the ensemble outputs resulting from the model.

void 
setEnsembleArg(String key,
String value)
Sets the ensemble outputs desired from the model.

public ModelArg(String formula, org.apache.spark.sql.Dataset dataset)
Examples of formulae for various families follow:
Description  Example Formula  Data Set 

survivial or competing risk  Surv(time, status) ~ .  Veteran's Administration Lung Cancer Trial 
regression  Ozone ~.  New York Air Quality Measurements 
classification  Species ~.  Edgar Anderson's Iris Data 
multivariate regression  Multivar(mpg, cyl) ~ .  Motor Trend Car Road Tests 
multivariate regression  Multivariate(mpg, wt) ~ hp + drat  Motor Trend Car Road Tests 
unsupervised  Unsupervised() ~.  Veteran's Administration Lung Cancer Trial 
An example found in the test classes follows:
Dataset irisDF = spark
.read()
.option("header", "true")
.option("inferSchema", "true")
.format("csv")
.load("./testclasses/data/iris.csv");
ModelArg modelArg = new ModelArg("Species ~ .", irisDF);
A overview of all the data sets used above can be found here.
formula
 Specification of the yvariables and xvariables
that are to be used in the model. These refer to the
column names in the Spark Dataframe. The yvariables
and xvariables are separated by the tilde (~)
character. A period (.) indicates that the complement
of the yvariables is to be used as the xvariables. See
the example above for more information.dataset
 A Spark Dataset.public String get_family()
ModelArg
instance. It is of the following form:
Family
Description
RFS
survivial or competing risk
RFR
regression
RFC
classification
RFR+
multivariate regression
RFC+
multivariate classification
RFM+
multivariate mixed
RFU
unsupervised
public int get_nSize()
public int get_ySize()
public int get_xSize()
public char[] get_yType()
get_ySize()
) containing the yvariables types. The vector will be null in the family is unsupervised.
The types are as follows:
Description
Value
time
T
censoring
S
boolean
B
real
R
ordinal
O
categorical
C
integer
I
public char[] get_xType()
get_xSize()
) containing the xvariable types.
The types are as follows:
Description
Value
boolean
B
real
R
ordinal
O
categorical
C
integer
I
public int[] get_yLevel()
get_ySize()
) containing the the number of levels found in each
yvariable. The elements of this vector will be nonzero for ordinal and categorical
variables. All others elements assume the value of zero (0).public int[] get_xLevel()
get_xSize()
) containing the the number of levels found in each
xvariable. The elements of this vector will be nonzero for ordinal and categorical
variables. All others elements assume the value of zero (0).public double[][] get_yData()
public double[][] get_xData()
public void set_bootstrap(int ntree, String bootstrap, String sampleType, int sampleSize, int[][] sample, double[] caseWeight)
Note that the parameters are interdependent on one another.
An explanation of the heirarchy, the interdependency, and
default values is below. Only specific combinations are valid.
We provide two other methods to set the bootstrap related
parameters: set_bootstrap()
and set_bootstrap(int)
to aid the user. When explicitly specifying the sample to
be used in the bootstrap algorithm, (sampleType = user), the
element sample[i][j] represents the number of times case j (in
the data set) appears in the bootstrap sample for tree i. Thus
a value of zero (0) implies that case j is outofbag in tree
i. Ensure that the sample over the forest is coherent: the sum of the
each column should equal sampleSize.
Parameter
Default Value
Possible Values
ntree
1000
> 0
bootstrap
auto
auto, user
sampleType
swr
swr (sampling with replacement), swor (sampling without replacement)
sampleSize
n
> 0
sample
null
null, 2D matrix of dimension [ntree] x [nSize]
>0
swr / sampleSize,
/ sampleSize? / caseWeight (may be null)
/ \
auto / \ sampleSize = n,
>0 / sampleType? / =0 caseWeight (may be null)
 bootstrap? / \
/ \ \ 1 <= sampleSize <= n
ntree? / \ \ / sampleSize,
\ \ \ sampleSize?  / caseWeight (may be null)
\ WARNING \ swor \
=0 \ \ sampleSize = n * (e1)/e,
\ =0 caseWeight (may be null)
\
\ !null
\ / sampleType = swr,
 sample? / sampleSize (determined by sample),
user \ ntree (determined by sample)
\
\ ERROR
null
Finally, note that when bootstrap = auto is in force, it is also
possible to use case weights in conjuntion with sampleType.ntree
 Number of trees in the forest.bootstrap
 Type of bootstrap used in the model.sampleType
 Type of sampling used in generating the bootstrap.sampleSize
 Size of sample used in generating the bootstrap.sample
 2D matrix explicitly specifying the bootstrap sample.caseWeight
 The case weight vector. This vector
must be of length nSize (get_nSize()
). This is a vector of nonnegative
weights where, after normalizing, weight[k] is the
probability of selecting case k as a candidate when bootstrap = auto.
The default is to use uniform weights for selection. It is generally better to use real
weights rather than integers. With larger values of nSize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.public void set_bootstrap(int ntree)
ntree
 Number of trees in the random forest.set_bootstrap(int, String, String, int, int[][], double[])
public void set_bootstrap(int ntree, int[][] sample)
ntree
 Number of trees in the random forest.sample
 2D matrix explicitly specifying the bootstrap sample.set_bootstrap(int, String, String, int, int[][], double[])
public void set_bootstrap()
public String get_bootstrap()
set_bootstrap(int, String, String, int, int[][], double[])
public String get_sampleType()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_sampleSize()
set_bootstrap(int, String, String, int, int[][], double[])
public int[][] get_sample()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_ntree()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_blockSize()
set_blockSize(int)
public void set_blockSize()
set_blockSize(int)
public void set_blockSize(int blockSize)
set_blockSize()
public void set_mtry(int mtry)
mtry
 The number of xvariables to be randomly selected as candidates for splitting a node.
This number must be such that 1 ≤ mtry ≤ xSize.
If the value is out of range, the default value will be applied:
Family
Default Value
RFR, RFR+
xSize/3
all others
sqrt(xSize)
The default is to use uniform weights for selection, though this can be changed with set_xWeight(double[])
.

set_mtry
public void set_mtry()
Sets the default value for the number of xvariables to be randomly selected as candidates for splitting a node.
 See Also:
set_mtry(int)

get_mtry
public int get_mtry()
Returns the value for the number of xvariables to be randomly selected as candidates for splitting a node.
 Returns:
 The value for the number of xvariables to be randomly selected as candidates for splitting a node.
 See Also:
set_mtry(int)

set_htry
public void set_htry(int htry)
Sets the maximum hypercube dimension to be considered in Greedy Splitting.
 Parameters:
htry
 The maximum hypercube dimension to be considered in Greedy Splitting.
If htry = 0, Standard Splitting is in effect. If htry > 0, Greedy Splitting is in effect.
If the value is out of range, the default value of zero will be applied:
htry
Protocol
0
standard splitting
htry > 0
greedy splitting

get_htry
public int get_htry()
Returns the maximum hypercube dimension to be considered in Greedy Splitting.
 Returns:
 The maximum hypercube dimension to be considered in Greedy Splitting.
 See Also:
set_htry(int)

set_nImpute
public void set_nImpute(int nImpute)
Sets the number of iterations for the missing data algorithm.
 Parameters:
nImpute
 The number of iterations for the missing data algorithm.
The default value is one (1).
Performance measures such as outofbag (OOB) error rates tend
to become optimistic if nimpute > 1.

get_nImpute
public int get_nImpute()
Returns the number of iterations used by the missing data algorithm.
 Returns:
 The number of iterations used by the missing data algorithm.
 See Also:
set_nImpute(int)

get_caseWeight
public double[] get_caseWeight()
Returns the case weight vector.
 Returns:
 The case weight vector.
 See Also:
set_bootstrap(int, String, String, int, int[][], double[])

set_ytry
public void set_ytry(int ytry)
Sets the number of randomly selected pseudoresponses when
unsupervised forests is in force.
 Parameters:
ytry
 The number of randomly selected pseudoresponses when
unsupervised forests is in force. The default value is one
(1). This means at every node, and every split attempt, one yvariable will be
selected from the (xSize  1) remaining xvariables when calculating the split statistic.
This number must be such that 1 < ytry ≤ (xSize  1).

get_ytry
public int get_ytry()
Returns the value for the number of yvariables to be randomly selected as pseudoresponses when unsupervised forests is in force.
 Returns:
 Yhe value for the number of yvariables to be randomly selected as pseudoresponses when unsupervised forests is in force.
 See Also:
set_ytry(int)

set_yWeight
public void set_yWeight(double[] weight)
Sets the yvariable weight vector.
 Parameters:
weight
 The yvariable weight vector. This vector must
be of length ySize (get_ySize()
). This is a vector of
nonnegative weights. The vector has two purposes. Purpose 1:
After normalizing, weight[k] is the probability of selecting
yvariable k to include in the multivariate split statistic.
This is useful in bigr situations when ySize is very large and
the user desires to restrict the split statistic calculation to
ytry (get_ytry()
) yvariables instead of all ySize
yvariables. Purpose 2: All yvariables with weight zero
define a special feature matrix. This feature matrix is
presented to the user when custom splitting is in effect. All
yvariables with nonzero weight arrive in the custom split
rule as usual. In both uses, the default is to use uniform
weights. For Purpose 1, it is generally better to use real
weights rather than integers. With larger values of ySize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.
For Purpose 2, the only value that matters is the presence of
zero.

set_yWeight
public void set_yWeight()
Set the yvariable weight vector to uniform weights.
 See Also:
set_yWeight(double[])

get_yWeight
public double[] get_yWeight()
Returns the yvariable weight vector.
 Returns:
 The yvariable weight vector.
 See Also:
set_yWeight(double[])

set_xWeight
public void set_xWeight(double[] weight)
Sets the xvariable weight vector.
 Parameters:
weight
 The xvariable weight vector. This vector
must be of length xSize (get_xSize()
). This is a vector of nonnegative
weights where, after normalizing, weight[k] is the
probability of selecting xvariable k as a candidate for splitting a node.
The default is to use uniform weights for selection. It is generally better to use real
weights rather than integers. With larger values of xSize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.

set_xWeight
public void set_xWeight()
Set the xvariable weight vector to uniform weights.
 See Also:
set_xWeight(double[])

get_xWeight
public double[] get_xWeight()
Returns the xvariable weight vector.
 Returns:
 The xvariable weight vector.
 See Also:
set_xWeight(double[])

set_xStatisticalWeight
public void set_xStatisticalWeight(double[] weight)
Sets the xvariable statistical weight vector.
 Parameters:
weight
 The xvariable statistical weight vector. This vector must be of
length xSize (get_xSize()
). This is a vector of
nonnegative weights where, after normalizing, weight[k] is the
multiplier by which the split statistic for an xvariable is
adjusted. A large value encourages the node to split on the
xvariable. The default is to use uniform weights so that all
xvariables are treated equally.

set_xStatisticalWeight
public void set_xStatisticalWeight()
Set the xvariable statistical weight vector to uniform weights.
 See Also:
set_xStatisticalWeight(double[])

get_xStatisticalWeight
public double[] get_xStatisticalWeight()
Returns the xvariable statistical weight vector.
 Returns:
 The xvariable statistical weight vector.
 See Also:
set_xStatisticalWeight(double[])

set_eventWeight
public void set_eventWeight(double[] weight)
Sets the event weight vector, when survival or competing risk
forests is in force.
 Parameters:
weight
 The event weight vector, when survival or competing risk forests is in force. This vector must be the same length as the
number of events in the data set (get_eventCount()
).
This is a vector of nonnegative weights,
where, after normalizing, weight[k] is the multiplier by which
the component of the split statistic related to event[k] is adjusted.
The default is to to use a composite splitting rule which is an average over all event types (a democratic approach).
To single out an event type, set all weights other than the one you are interested in to zero (0).
Finally, note that regardless of how the weight vector is specified, the returned forest object always
provides estimates for all event types.

set_eventWeight
public void set_eventWeight()
Sets the event weight vector, when survival or competing risk
forests is in force, to uniform weights.
 See Also:
set_eventWeight(double[])

get_eventWeight
public double[] get_eventWeight()
Returns the event weight vector, when survival or competing risk
forests is in force.
 Returns:
 The event weight vector, when survival or competing risk
forests is in force.
 See Also:
set_eventWeight(double[])

get_eventCount
public int get_eventCount()
Returns the number of events in the data set, when survival or competing risk forests is in force.
 Returns:
 The number of events in the data set, when survival or competing risk forests is in force.

set_timeInterest
public void set_timeInterest(double[] timeInterest)
Sets the time interest vector, when survival or competing risk
forests is in force.
 Parameters:
timeInterest
 The time interest vector, when survival or competing risk
forests is in force. This is a vector of real values to be
used to constrain the ensemble calculations. Using time points
at which events do not occur does not result in information
gain. The default action is to use all observed event times in
the data set.

set_timeInterest
public void set_timeInterest()
Sets the time interest vector, when survival or competing risk
forests is in force to the default value.
 See Also:
set_timeInterest(double[])

get_timeInterest
public double[] get_timeInterest()
Returns the time interest vector used in the model, when survival or competing risk
forests is in force.
 Returns:
 The time interest vector used in the model, when survival or competing risk
forests is in force.
 See Also:
set_timeInterest(double[])

get_timeInterestSize
public int get_timeInterestSize()
Returns the size of the time interest vector used in the model, when survival or competing risk
forests is in force.
 Returns:
 The size of the time interest vector used in the model, when survival or competing risk
forests is in force.
 See Also:
get_timeInterest()

set_splitRule
public void set_splitRule(String splitRule)
Sets the split rule to be used in generating the model.
 Parameters:
splitRule
 The split rule to be used in generating the
model. The split rules available are detailed below. The rule
in bold denotes the default split rule for each family. The
default split rule is applied when the user does not specify a
split rule. Survival and Competing Risk both have two split
rules. Regression has three flavours of split rules based on
meansquared error. Classification has three flavours of split
rules based on the Gini index, and one additional rule for
ordinal outcomes. The Multivariate and Unsupervised split rules
are a composite rule based on Regression and
Classification. Each component of the composite is normalized
so that the magnitude of any one yvariable does not influence
the statistic. All families also allow the user to define a
custom split rule statistic. Some basic Cprogramming skills
are required. Examples for all the families reside in the C
source code directory of the package in the file
src/main/c/splitCustom.c
. Note that recompiling
and reinstalling the package is necessary after modifying the
source code.
Family
Split Rule Description
Value
survival
logrank
logrank
logrank score
logrankscore
competing risk
logrank modified weighted
logrankCR
logrank
logrankACR
regression
meansquared error weighted
mse
meansquared error unweighted
mse.unwt
meansquared error heavy weighted
mse.hvwt
classification
Gini index weighted
gini
Gini index unweighted
gini.unwt
Gini index heavy weighted
gini.hvwt
Ranked Probability Score
rps
multivariate regression
Composite meansquared error
mv.mse
multivariate classification
Composite Gini index
mv.gini
multivariate mixed
Composite Gini and MSE
mv.mix
unsupervised
pseudoresponse adaptive
unsupv

set_splitRule
public void set_splitRule()
Sets the default split rule, based on the data set and formuala.
 See Also:
set_splitRule(String)

get_splitRule
public String get_splitRule()
Returns the split rule used in generating the model.
 Returns:
 The split rule used in generating the model.
 See Also:
set_splitRule(String)

set_nSplit
public void set_nSplit(int nSplit)
Sets the parameter specifying deterministic versus nondeterministic splitting.
 Parameters:
nSplit
 The parameter specifying deterministic versus nondeterministic splitting. The parameter must be
a nonnegative integer value. When zero (0), deterministic
splitting for an xvariable is in force. When nonzero, a
maximum of nSplit points are randomly chosen among the
possible split points for an xvariable. This can
significantly decrease computation time over deterministic splitting. The
default value for this parameter varies with the split rule: When pure
random splitting is in force, the default and only value for this parameter is one (1).
When any other split rule is in force, the default value is
zero (0).

set_nSplit
public void set_nSplit()
Sets the default value for the parameter specifying deterministic versus nondeterministic splitting.
 See Also:
set_nSplit(int)

get_nSplit
public int get_nSplit()
Returns the parameter specifying deterministic versus nondeterministic splitting.
 Returns:
 The parameter specifying deterministic versus nondeterministic splitting.
 See Also:
set_nSplit(int)

set_nodeSize
public void set_nodeSize(int nodeSize)
Sets the desired average number of unique cases in a terminal
node.
 Parameters:
nodeSize
 The desired average number of unique cases in a terminal
node. The parameter ensures that the average nodesize across
the forest will be at least nodeSize. Some nodes will be
smaller than this value and some will be larger. The default
value for this parameter varies with the family, though it
recommended to experiment with different values.
Family
Defalut Node Size
survival
3
competing risk
6
regression
5
multivariate regression
5
classification
1
multivariate classification
1
multivariate mixed
3
unsupervised
3

set_nodeSize
public void set_nodeSize()
Sets the default value for the average number of unique cases in a terminal node.
 See Also:
set_nodeSize(int)

get_nodeSize
public int get_nodeSize()
Returns the desired value for the average number of unique cases in a terminal node.
 Returns:
 The desired value for the average number of unique cases in a terminal node.
 See Also:
set_nodeSize(int)

set_nodeDepth
public void set_nodeDepth(int nodeDepth)
Sets the maximum depth to which a tree should be grown.
 Parameters:
nodeDepth
 The maximum depth to which a tree should be grown. The
default behaviour is that this parameter is ignored. Not
setting this parameter or setting this parameter to a negative
value will ensure that this parameter is ignored.

get_nodeDepth
public int get_nodeDepth()
Returns the maximum depth to which a tree should be grown.
 Returns:
 The maximum depth to which a tree should be grown.

set_seed
public void set_seed(int seed)
Sets the seed for the random number generator used by the
algorithm. This must be a negative number. The seed is a very
important parameter if repeatability of the model generated is
required. Generally speaking, growing a model using the same
data set, the same model paramaters, and the same seed will
result in identical models. When large amounts of missing data
are involved, there can be slight variations due to Monte Carlo
effects. If the parameter is not set by the user, it can always
be recovered with get_seed()
.

get_seed
public int get_seed()
Returns the seed for the random number generator used by the
algorithm.
 Returns:
 The seed for the random number generator used by the
 See Also:
set_seed(int)

set_trace
public void set_trace(int trace)
Sets the trace parameter indicating the specified update
interval in seconds.
 Parameters:
trace
 The trace parameter indicating the specified update
interval in seconds. During extended execution times,
the approximate time to complete the execution is output to a
trace file in the users HOME directory. The format and
location of the trace file can be controlled by modifying
src/main/resources/spark/log.properties
. A value
of zero (0) turns off the trace.

get_trace
public int get_trace()
Returns the trace parameter indicating the specified update interval in seconds.
 Returns:
 The trace parameter indicating the specified update interval in seconds.

set_rfCores
public void set_rfCores(int rfCores)
Sets the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.
 Parameters:
rfCores
 The number of cores to be used by the algorithm when OpenMP
parallel processing is in force. The default behaviour is to
use all cores available. This is achieved by setting the
parameter to a negative value. The result is that each core
will be independently tasked with growing a tree. Significant
savings in elapsed computation times can be achieved.

get_rfCores
public int get_rfCores()
Returns the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.
 Returns:
 the number of cores to be used by the algorithm when OpenMP parallel processing is in force.
 See Also:
set_rfCores(int)

setEnsembleArg
public void setEnsembleArg(String key,
String value)
Sets the ensemble outputs desired from the model. These
settings are in the form of <key, value> pairs, where the
key is the name of the ensemble, and the value is the specific
option for that ensemble.
The default option for each key is in bold.
Ensemble Key
Possible Values
weight
Grow or Restore Only:
no, inbag, oob
Predict Only:
no, yes
proximity
Grow or Restore Only:
no, inbag, oob
Predict Only:
no, yes
membership
no, yes
importance
no, permute, random, permute.ensemble, random.ensemble
varUsed
no, every.tree, sum.tree
splitDepth
no, every.tree, sum.tree
errorType
For RFC, RFC+ Families Only:
misclass, brier, g.mean, no
For RFR, RFR+ Families Only:
mse, no
For RFM+ Family Only:
default, no
For RFS Family Only:
cindex, no
For RFU Family Only:
no
predictionType
For RFC, RFC+ Families Only:
max.vote, rfq
For RFR, RFR+ Families Only:
mean
For RFM+ Family Only:
default
For RFS Family Only:
default
For RFU Family Only:
no
 Parameters:
key
 The name of the ensemble output.
value
 The specific value for the ensemble output.

setEnsembleArg
public void setEnsembleArg()
Sets default values for the ensemble outputs resulting from the model.
 See Also:
setEnsembleArg(String, String).

getEnsembleArg
public String getEnsembleArg(String key)
Returns the current value for the specified ensemble argument.
 Parameters:
key
 The name of the ensemble output.
 See Also:
setEnsembleArg(String, String).
Copyright © 2018. All rights reserved.