public class ModelArg extends Object
RandomForestModel
object. Parameters are of two types: those that
define how the forest is to be trained; those that define the
requested ensemble outputs.Constructor and Description |
---|
ModelArg(String formula,
org.apache.spark.sql.Dataset dataset)
Sets default values for the training parameters and ensemble
outputs for the forest, given the formula and dataset.
|
Modifier and Type | Method and Description |
---|---|
int |
get_blockSize()
Returns the value for the block size associated with the
reporting of the error rate.
|
String |
get_bootstrap()
Returns the type of bootstrap used in the model.
|
double[] |
get_caseWeight()
Returns the case weight vector.
|
int |
get_eventCount()
Returns the number of events in the data set, when survival or competing risk forests is in force.
|
double[] |
get_eventWeight()
Returns the event weight vector, when survival or competing risk
forests is in force.
|
String |
get_family()
Returns the family of analysis intentioned by the
ModelArg instance. |
int |
get_htry()
Returns the maximum hypercube dimension to be considered in Greedy Splitting.
|
int |
get_mtry()
Returns the value for the number of x-variables to be randomly selected as candidates for splitting a node.
|
int |
get_nImpute()
Returns the number of iterations used by the missing data algorithm.
|
int |
get_nodeDepth()
Returns the maximum depth to which a tree should be grown.
|
int |
get_nodeSize()
Returns the desired value for the average number of unique cases in a terminal node.
|
int |
get_nSize()
Returns the number of records or rows (n) in the data set.
|
int |
get_nSplit()
Returns the parameter specifying deterministic versus non-deterministic splitting.
|
int |
get_ntree()
Returns the number of trees in the forest.
|
int |
get_rfCores()
Returns the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.
|
int[][] |
get_sample()
Returns the 2-D matrix explicitly specifying the bootstrap sample.
|
int |
get_sampleSize()
Returns the size of sample used in generating the bootstrap.
|
String |
get_sampleType()
Returns the type of sampling used in generating the bootstrap.
|
int |
get_seed()
Returns the seed for the random number generator used by the
algorithm.
|
String |
get_splitRule()
Returns the split rule used in generating the model.
|
double[] |
get_timeInterest()
Returns the time interest vector used in the model, when survival or competing risk
forests is in force.
|
int |
get_timeInterestSize()
Returns the size of the time interest vector used in the model, when survival or competing risk
forests is in force.
|
int |
get_trace()
Returns the trace parameter indicating the specified update interval in seconds.
|
double[][] |
get_xData()
Returns a 2-D matrix of values representing the y-values.
|
int[] |
get_xLevel()
Returns a vector of length xSize (
get_xSize() ) containing the the number of levels found in each
x-variable. |
int |
get_xSize()
Returns the number of x-variables in the data set.
|
double[] |
get_xStatisticalWeight()
Returns the x-variable statistical weight vector.
|
char[] |
get_xType()
Returns a vector of length xSize (
get_xSize() ) containing the x-variable types. |
double[] |
get_xWeight()
Returns the x-variable weight vector.
|
double[][] |
get_yData()
Returns a 2-D matrix of values representing the y-values.
|
int[] |
get_yLevel()
Returns a vector of length ySize (
get_ySize() ) containing the the number of levels found in each
y-variable. |
int |
get_ySize()
Returns the number of y-variables in the data set.
|
int |
get_ytry()
Returns the value for the number of y-variables to be randomly selected as pseudo-responses when unsupervised forests is in force.
|
char[] |
get_yType()
Returns a vector of length ySize (
get_ySize() ) containing the y-variables types. |
double[] |
get_yWeight()
Returns the y-variable weight vector.
|
String |
getEnsembleArg(String key)
Returns the current value for the specified ensemble argument.
|
void |
set_blockSize()
Sets the default value for the block size associated with the
reporting of the error rate.
|
void |
set_blockSize(int blockSize)
Sets the specified value for the block size associated with the
reporting of the error rate.
|
void |
set_bootstrap()
Sets the bootstrap related parameters in the model.
|
void |
set_bootstrap(int ntree)
Sets the bootstrap related parameters in the model.
|
void |
set_bootstrap(int ntree,
int[][] sample)
Sets the bootstrap related parameters in the model.
|
void |
set_bootstrap(int ntree,
String bootstrap,
String sampleType,
int sampleSize,
int[][] sample,
double[] caseWeight)
Sets the bootstrap related parameters in the model.
|
void |
set_eventWeight()
Sets the event weight vector, when survival or competing risk
forests is in force, to uniform weights.
|
void |
set_eventWeight(double[] weight)
Sets the event weight vector, when survival or competing risk
forests is in force.
|
void |
set_htry(int htry)
Sets the maximum hypercube dimension to be considered in Greedy Splitting.
|
void |
set_mtry()
Sets the default value for the number of x-variables to be randomly selected as candidates for splitting a node.
|
void |
set_mtry(int mtry)
Sets the number of x-variables to be randomly selected as candidates for splitting a node.
|
void |
set_nImpute(int nImpute)
Sets the number of iterations for the missing data algorithm.
|
void |
set_nodeDepth(int nodeDepth)
Sets the maximum depth to which a tree should be grown.
|
void |
set_nodeSize()
Sets the default value for the average number of unique cases in a terminal node.
|
void |
set_nodeSize(int nodeSize)
Sets the desired average number of unique cases in a terminal
node.
|
void |
set_nSplit()
Sets the default value for the parameter specifying deterministic versus non-deterministic splitting.
|
void |
set_nSplit(int nSplit)
Sets the parameter specifying deterministic versus non-deterministic splitting.
|
void |
set_rfCores(int rfCores)
Sets the number of cores to be used by the algorithm when OpenMP
parallel processing is in force.
|
void |
set_seed(int seed)
Sets the seed for the random number generator used by the
algorithm.
|
void |
set_splitRule()
Sets the default split rule, based on the data set and formuala.
|
void |
set_splitRule(String splitRule)
Sets the split rule to be used in generating the model.
|
void |
set_timeInterest()
Sets the time interest vector, when survival or competing risk
forests is in force to the default value.
|
void |
set_timeInterest(double[] timeInterest)
Sets the time interest vector, when survival or competing risk
forests is in force.
|
void |
set_trace(int trace)
Sets the trace parameter indicating the specified update
interval in seconds.
|
void |
set_xStatisticalWeight()
Set the x-variable statistical weight vector to uniform weights.
|
void |
set_xStatisticalWeight(double[] weight)
Sets the x-variable statistical weight vector.
|
void |
set_xWeight()
Set the x-variable weight vector to uniform weights.
|
void |
set_xWeight(double[] weight)
Sets the x-variable weight vector.
|
void |
set_ytry(int ytry)
Sets the number of randomly selected pseudo-responses when
unsupervised forests is in force.
|
void |
set_yWeight()
Set the y-variable weight vector to uniform weights.
|
void |
set_yWeight(double[] weight)
Sets the y-variable weight vector.
|
void |
setEnsembleArg()
Sets default values for the ensemble outputs resulting from the model.
|
void |
setEnsembleArg(String key,
String value)
Sets the ensemble outputs desired from the model.
|
public ModelArg(String formula, org.apache.spark.sql.Dataset dataset)
Examples of formulae for various families follow:
Description | Example Formula | Data Set |
---|---|---|
survivial or competing risk | Surv(time, status) ~ . | Veteran's Administration Lung Cancer Trial |
regression | Ozone ~. | New York Air Quality Measurements |
classification | Species ~. | Edgar Anderson's Iris Data |
multivariate regression | Multivar(mpg, cyl) ~ . | Motor Trend Car Road Tests |
multivariate regression | Multivariate(mpg, wt) ~ hp + drat | Motor Trend Car Road Tests |
unsupervised | Unsupervised() ~. | Veteran's Administration Lung Cancer Trial |
An example found in the test classes follows:
Dataset irisDF = spark
.read()
.option("header", "true")
.option("inferSchema", "true")
.format("csv")
.load("./test-classes/data/iris.csv");
ModelArg modelArg = new ModelArg("Species ~ .", irisDF);
A overview of all the data sets used above can be found here.
formula
- Specification of the y-variables and x-variables
that are to be used in the model. These refer to the
column names in the Spark Dataframe. The y-variables
and x-variables are separated by the tilde (~)
character. A period (.) indicates that the complement
of the y-variables is to be used as the x-variables. See
the example above for more information.dataset
- A Spark Dataset.public String get_family()
ModelArg
instance. It is of the following form:
Family
Description
RF-S
survivial or competing risk
RF-R
regression
RF-C
classification
RF-R+
multivariate regression
RF-C+
multivariate classification
RF-M+
multivariate mixed
RF-U
unsupervised
public int get_nSize()
public int get_ySize()
public int get_xSize()
public char[] get_yType()
get_ySize()
) containing the y-variables types. The vector will be null in the family is unsupervised.
The types are as follows:
Description
Value
time
T
censoring
S
boolean
B
real
R
ordinal
O
categorical
C
integer
I
public char[] get_xType()
get_xSize()
) containing the x-variable types.
The types are as follows:
Description
Value
boolean
B
real
R
ordinal
O
categorical
C
integer
I
public int[] get_yLevel()
get_ySize()
) containing the the number of levels found in each
y-variable. The elements of this vector will be non-zero for ordinal and categorical
variables. All others elements assume the value of zero (0).public int[] get_xLevel()
get_xSize()
) containing the the number of levels found in each
x-variable. The elements of this vector will be non-zero for ordinal and categorical
variables. All others elements assume the value of zero (0).public double[][] get_yData()
public double[][] get_xData()
public void set_bootstrap(int ntree, String bootstrap, String sampleType, int sampleSize, int[][] sample, double[] caseWeight)
Note that the parameters are interdependent on one another.
An explanation of the heirarchy, the interdependency, and
default values is below. Only specific combinations are valid.
We provide two other methods to set the bootstrap related
parameters: set_bootstrap()
and set_bootstrap(int)
to aid the user. When explicitly specifying the sample to
be used in the bootstrap algorithm, (sampleType = user), the
element sample[i][j] represents the number of times case j (in
the data set) appears in the bootstrap sample for tree i. Thus
a value of zero (0) implies that case j is out-of-bag in tree
i. Ensure that the sample over the forest is coherent: the sum of the
each column should equal sampleSize.
Parameter
Default Value
Possible Values
ntree
1000
> 0
bootstrap
auto
auto, user
sampleType
swr
swr (sampling with replacement), swor (sampling without replacement)
sampleSize
n
> 0
sample
null
null, 2-D matrix of dimension [ntree] x [nSize]
>0
swr /------ sampleSize,
/------ sampleSize? --/ caseWeight (may be null)
/ \
auto / \------ sampleSize = n,
>0 /------ sampleType? ---/ =0 caseWeight (may be null)
------ bootstrap? --/ \
/ \ \ 1 <= sampleSize <= n
ntree? --/ \ \ /----- sampleSize,
\ \ \------ sampleSize? -- / caseWeight (may be null)
\------ WARNING \ swor \
=0 \ \----- sampleSize = n * (e-1)/e,
\ =0 caseWeight (may be null)
\
\ !null
\ /------ sampleType = swr,
------ sample? --/ sampleSize (determined by sample),
user \ ntree (determined by sample)
\
\------ ERROR
null
Finally, note that when bootstrap = auto is in force, it is also
possible to use case weights in conjuntion with sampleType.ntree
- Number of trees in the forest.bootstrap
- Type of bootstrap used in the model.sampleType
- Type of sampling used in generating the bootstrap.sampleSize
- Size of sample used in generating the bootstrap.sample
- 2-D matrix explicitly specifying the bootstrap sample.caseWeight
- The case weight vector. This vector
must be of length nSize (get_nSize()
). This is a vector of non-negative
weights where, after normalizing, weight[k] is the
probability of selecting case k as a candidate when bootstrap = auto.
The default is to use uniform weights for selection. It is generally better to use real
weights rather than integers. With larger values of nSize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.public void set_bootstrap(int ntree)
ntree
- Number of trees in the random forest.set_bootstrap(int, String, String, int, int[][], double[])
public void set_bootstrap(int ntree, int[][] sample)
ntree
- Number of trees in the random forest.sample
- 2-D matrix explicitly specifying the bootstrap sample.set_bootstrap(int, String, String, int, int[][], double[])
public void set_bootstrap()
public String get_bootstrap()
set_bootstrap(int, String, String, int, int[][], double[])
public String get_sampleType()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_sampleSize()
set_bootstrap(int, String, String, int, int[][], double[])
public int[][] get_sample()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_ntree()
set_bootstrap(int, String, String, int, int[][], double[])
public int get_blockSize()
set_blockSize(int)
public void set_blockSize()
set_blockSize(int)
public void set_blockSize(int blockSize)
set_blockSize()
public void set_mtry(int mtry)
mtry
- The number of x-variables to be randomly selected as candidates for splitting a node.
This number must be such that 1 ≤ mtry ≤ xSize.
If the value is out of range, the default value will be applied:
Family
Default Value
RF-R, RF-R+
xSize/3
all others
sqrt(xSize)
The default is to use uniform weights for selection, though this can be changed with set_xWeight(double[])
.
public void set_mtry()
set_mtry(int)
public int get_mtry()
set_mtry(int)
public void set_htry(int htry)
htry
- The maximum hypercube dimension to be considered in Greedy Splitting.
If htry = 0, Standard Splitting is in effect. If htry > 0, Greedy Splitting is in effect.
If the value is out of range, the default value of zero will be applied:
htry
Protocol
0
standard splitting
htry > 0
greedy splitting
public int get_htry()
set_htry(int)
public void set_nImpute(int nImpute)
nImpute
- The number of iterations for the missing data algorithm.
The default value is one (1).
Performance measures such as out-of-bag (OOB) error rates tend
to become optimistic if nimpute > 1.public int get_nImpute()
set_nImpute(int)
public double[] get_caseWeight()
set_bootstrap(int, String, String, int, int[][], double[])
public void set_ytry(int ytry)
ytry
- The number of randomly selected pseudo-responses when
unsupervised forests is in force. The default value is one
(1). This means at every node, and every split attempt, one y-variable will be
selected from the (xSize - 1) remaining x-variables when calculating the split statistic.
This number must be such that 1 < ytry ≤ (xSize - 1).public int get_ytry()
set_ytry(int)
public void set_yWeight(double[] weight)
weight
- The y-variable weight vector. This vector must
be of length ySize (get_ySize()
). This is a vector of
non-negative weights. The vector has two purposes. Purpose 1:
After normalizing, weight[k] is the probability of selecting
y-variable k to include in the multivariate split statistic.
This is useful in big-r situations when ySize is very large and
the user desires to restrict the split statistic calculation to
ytry (get_ytry()
) y-variables instead of all ySize
y-variables. Purpose 2: All y-variables with weight zero
define a special feature matrix. This feature matrix is
presented to the user when custom splitting is in effect. All
y-variables with non-zero weight arrive in the custom split
rule as usual. In both uses, the default is to use uniform
weights. For Purpose 1, it is generally better to use real
weights rather than integers. With larger values of ySize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.
For Purpose 2, the only value that matters is the presence of
zero.public void set_yWeight()
set_yWeight(double[])
public double[] get_yWeight()
set_yWeight(double[])
public void set_xWeight(double[] weight)
weight
- The x-variable weight vector. This vector
must be of length xSize (get_xSize()
). This is a vector of non-negative
weights where, after normalizing, weight[k] is the
probability of selecting x-variable k as a candidate for splitting a node.
The default is to use uniform weights for selection. It is generally better to use real
weights rather than integers. With larger values of xSize, the
slightly different sampling algorithms deployed in the two
scenarios can result in dramatically different execution times.public void set_xWeight()
set_xWeight(double[])
public double[] get_xWeight()
set_xWeight(double[])
public void set_xStatisticalWeight(double[] weight)
weight
- The x-variable statistical weight vector. This vector must be of
length xSize (get_xSize()
). This is a vector of
non-negative weights where, after normalizing, weight[k] is the
multiplier by which the split statistic for an x-variable is
adjusted. A large value encourages the node to split on the
x-variable. The default is to use uniform weights so that all
x-variables are treated equally.public void set_xStatisticalWeight()
set_xStatisticalWeight(double[])
public double[] get_xStatisticalWeight()
set_xStatisticalWeight(double[])
public void set_eventWeight(double[] weight)
weight
- The event weight vector, when survival or competing risk forests is in force. This vector must be the same length as the
number of events in the data set (get_eventCount()
).
This is a vector of non-negative weights,
where, after normalizing, weight[k] is the multiplier by which
the component of the split statistic related to event[k] is adjusted.
The default is to to use a composite splitting rule which is an average over all event types (a democratic approach).
To single out an event type, set all weights other than the one you are interested in to zero (0).
Finally, note that regardless of how the weight vector is specified, the returned forest object always
provides estimates for all event types.public void set_eventWeight()
set_eventWeight(double[])
public double[] get_eventWeight()
set_eventWeight(double[])
public int get_eventCount()
public void set_timeInterest(double[] timeInterest)
timeInterest
- The time interest vector, when survival or competing risk
forests is in force. This is a vector of real values to be
used to constrain the ensemble calculations. Using time points
at which events do not occur does not result in information
gain. The default action is to use all observed event times in
the data set.public void set_timeInterest()
set_timeInterest(double[])
public double[] get_timeInterest()
set_timeInterest(double[])
public int get_timeInterestSize()
get_timeInterest()
public void set_splitRule(String splitRule)
splitRule
- The split rule to be used in generating the
model. The split rules available are detailed below. The rule
in bold denotes the default split rule for each family. The
default split rule is applied when the user does not specify a
split rule. Survival and Competing Risk both have two split
rules. Regression has three flavours of split rules based on
mean-squared error. Classification has three flavours of split
rules based on the Gini index, and one additional rule for
ordinal outcomes. The Multivariate and Unsupervised split rules
are a composite rule based on Regression and
Classification. Each component of the composite is normalized
so that the magnitude of any one y-variable does not influence
the statistic. All families also allow the user to define a
custom split rule statistic. Some basic C-programming skills
are required. Examples for all the families reside in the C
source code directory of the package in the file
src/main/c/splitCustom.c
. Note that recompiling
and re-installing the package is necessary after modifying the
source code.
Family
Split Rule Description
Value
survival
log-rank
logrank
log-rank score
logrankscore
competing risk
log-rank modified weighted
logrankCR
log-rank
logrankACR
regression
mean-squared error weighted
mse
mean-squared error unweighted
mse.unwt
mean-squared error heavy weighted
mse.hvwt
classification
Gini index weighted
gini
Gini index unweighted
gini.unwt
Gini index heavy weighted
gini.hvwt
Ranked Probability Score
rps
multivariate regression
Composite mean-squared error
mv.mse
multivariate classification
Composite Gini index
mv.gini
multivariate mixed
Composite Gini and MSE
mv.mix
unsupervised
pseudo-response adaptive
unsupv
public void set_splitRule()
set_splitRule(String)
public String get_splitRule()
set_splitRule(String)
public void set_nSplit(int nSplit)
nSplit
- The parameter specifying deterministic versus non-deterministic splitting. The parameter must be
a non-negative integer value. When zero (0), deterministic
splitting for an x-variable is in force. When non-zero, a
maximum of nSplit points are randomly chosen among the
possible split points for an x-variable. This can
significantly decrease computation time over deterministic splitting. The
default value for this parameter varies with the split rule: When pure
random splitting is in force, the default and only value for this parameter is one (1).
When any other split rule is in force, the default value is
zero (0).public void set_nSplit()
set_nSplit(int)
public int get_nSplit()
set_nSplit(int)
public void set_nodeSize(int nodeSize)
nodeSize
- The desired average number of unique cases in a terminal
node. The parameter ensures that the average nodesize across
the forest will be at least nodeSize. Some nodes will be
smaller than this value and some will be larger. The default
value for this parameter varies with the family, though it
recommended to experiment with different values.
Family
Defalut Node Size
survival
3
competing risk
6
regression
5
multivariate regression
5
classification
1
multivariate classification
1
multivariate mixed
3
unsupervised
3
public void set_nodeSize()
set_nodeSize(int)
public int get_nodeSize()
set_nodeSize(int)
public void set_nodeDepth(int nodeDepth)
nodeDepth
- The maximum depth to which a tree should be grown. The
default behaviour is that this parameter is ignored. Not
setting this parameter or setting this parameter to a negative
value will ensure that this parameter is ignored.public int get_nodeDepth()
public void set_seed(int seed)
get_seed()
.public int get_seed()
set_seed(int)
public void set_trace(int trace)
trace
- The trace parameter indicating the specified update
interval in seconds. During extended execution times,
the approximate time to complete the execution is output to a
trace file in the users HOME directory. The format and
location of the trace file can be controlled by modifying
src/main/resources/spark/log.properties
. A value
of zero (0) turns off the trace.public int get_trace()
public void set_rfCores(int rfCores)
rfCores
- The number of cores to be used by the algorithm when OpenMP
parallel processing is in force. The default behaviour is to
use all cores available. This is achieved by setting the
parameter to a negative value. The result is that each core
will be independently tasked with growing a tree. Significant
savings in elapsed computation times can be achieved.public int get_rfCores()
set_rfCores(int)
public void setEnsembleArg(String key, String value)
The default option for each key is in bold.
Ensemble Key
Possible Values
weight
Grow or Restore Only:
no, inbag, oob
Predict Only:
no, yes
proximity
Grow or Restore Only:
no, inbag, oob
Predict Only:
no, yes
membership
no, yes
importance
no, permute, random, permute.ensemble, random.ensemble
varUsed
no, every.tree, sum.tree
splitDepth
no, every.tree, sum.tree
errorType
For RF-C, RF-C+ Families Only:
misclass, brier, g.mean, no
For RF-R, RF-R+ Families Only:
mse, no
For RF-M+ Family Only:
default, no
For RF-S Family Only:
c-index, no
For RF-U Family Only:
no
predictionType
For RF-C, RF-C+ Families Only:
max.vote, rfq
For RF-R, RF-R+ Families Only:
mean
For RF-M+ Family Only:
default
For RF-S Family Only:
default
For RF-U Family Only:
no
key
- The name of the ensemble output.value
- The specific value for the ensemble output.public void setEnsembleArg()
setEnsembleArg(String, String).
public String getEnsembleArg(String key)
key
- The name of the ensemble output.setEnsembleArg(String, String).
Copyright © 2018. All rights reserved.