Building RF-SRC for the R Environment and Apache Spark

Udaya Kogalur & Hemant Ishwaran

The R Package

Source Code, and Bug Reporting

Regular stable releases of this package are available on CRAN here and on the master branch on our GitHub repository. Interim, sometimes unstable, development builds with bug fixes and/or additional functionality are available on the develop branch of our GitHub repository.

Bugs may be reported via GitHub here. Please provide the accompanying information with any reports:

1. sessionInfo()

2. A minimal reproducible example consisting of the following items:

• a minimal dataset, necessary to reproduce the error

• the minimal runnable code necessary to reproduce the error, which can be run on the given dataset

• the necessary information on the used packages, R version and system it is run on

• in the case of random processes, a seed (set by set.seed()) for reproducibility

Creating and Installing the randomForestSRC R Package

To create the R package using the GitHub repository, you will need an installation of R (> v3.0) that is capable of compiling source code packages containing C-code. This means that the approprate C-code compilers need to be in place and accessible by the R packaging and installation engine. Detailed descriptions on how this is achieved are available on a number of sites online and will not be reproduced here. You will also need Apache Ant (v1.10), and Java JDK (v1.80). Once the R package development environment is in place, it is possible to build our package natively on your platform using the following steps:

From the top-level directory (the directory containing build.xml), the command

ant

will give you several options. The command

ant source-cran

will create the R source code package directory-tree ./target/cran/randomForestSRC/. To install randomForestSRC in your default library, change to the directory ./target/cran/ and type

R CMD INSTALL --preclean --clean randomForestSRC

This will install an OpenMP parallel version of the package if the host system is capable of supporting this mode of execution.

Please note than on some platforms, even though an OpenMP C-compiler may have ben installed, the R packaging and installation engine does not pick up the appropriate compiler. For example, on macOS, the default compiler is Clang. It is not OpenMP capable out-of-the-box. You will need to install an OpenMP version of it, or install GCC using Homebrew or another package manager. Most importantly, you will also need to direct the R packaging and installtion engine to the OpenMP capable compiler. This is done by creating an .R directory in your HOME directory, and creating a Makevars file in that directory containing the appropriate compiler instructions. As an example, on macOS Sierra (v10.12) our installation has the following as its Makevars file:


F77 = gfortran-7
FC  = gfortran-7
CC  = gcc-7
CXX = g++-7
CFLAGS = -I/usr/local/Cellar/gcc/7.2.0/include
LDFLAGS = -L/usr/local/Cellar/gcc/7.2.0/lib/gcc/7


OpenMP Parallel Processing – Setting the Number of CPUs

There are several ways to control the number of CPU cores that the package accesses during OpenMP parallel execution. First, you will need to determine the number of cores on your local machine. Do this by starting an R session and issuing the command detectCores(). You will require the parallel package for this.

Then you can do the following:

At the start of every R session, you can set the number of cores accessed during OpenMP parallel execution by issuing the command options(rf.cores = x), where x is the number of cores. If x is a negative number, the package will access the maximum number of cores on your machine. The options command can also be placed in the users .Rprofile file for convenience. You can, alternatively, initialize the environment variable RF_CORES in your shell environment.

The default value for rf.cores is -1 (-1L), if left unspecified, which uses all available cores, with a minimum of two.

R-side Parallel Processing – Setting the Number of CPUs

The package also implements R-side parallel processing via the parallel package contained in the base R distribution. However, the parallel package must be explicitly loaded to take advantage of this functionality. When this is the case, the R function lapply() is replaced with the parallel version mclapply(). You can set the number of cores accessed by mclapply() by issuing the command

options(mc.cores = x)

where x is the number of cores. The options command can also be placed in the users .Rprofile file for convenience. You can, alternatively, initialize the environment variable MC_CORES in your shell environment. See the help files in parallel for more information.

The default value for mclapply() on non-Windows systems is two (2L) cores. On Windows systems, the default value is one (1L) core.

Example: Setting the Number of CPUs

As an example, issuing the following options command uses all available cores for both OpenMP and R-side processing:

options(rf.cores=detectCores(), mc.cores=detectCores())

As stated above, this option command can be placed in the users .Rprofile file.

Cautionary Note on Parallel Execution

1. Once the package has been compiled with OpenMP enabled, trees will be grown in parallel using the rf.cores option. Independently of this, we also utilize mclapply() to parallelize loops in R-side pre-processing and post-processing of the forest. This is always available and independent of whether the user chooses to compile the package with the OpenMP option enabled.

2. It is important to NOT write programs that fork R processes containing OpenMP threads. That is, one should not use mclapply() around the functions rfsrc(), predict.rfsrc(), vimp.rfsc(), var.select.rfsrc(), find.interaction.rfsrc() and partial.rfsrc(). In such a scenario, program execution is not guaranteed.

The Apache Spark Package

This effort is VERY PRELIMINARY and not ready for release, beta, or even alpha. It is provided as a status update of our efforts and nothing more.

Creating and Installing the randomForestSRC Spark Package

To create the Apache Spark package using the GitHub repository, you will need the following tools: Apache Ant (v1.10), Java JDK (v1.80), Scala (v2.12), and Apache Maven (v3.5). You must also have Apache Spark (v2.1) installed.

From the top-level directory (the directory containing build.xml), the command

ant

will give you several options. The command

ant source-spark

will create the Spark source code package directory-tree ./target/spark/. To compile the the source code package, type

ant build-spark.

This will create the Spark target package directory-tree ./target/spark/target/. A sample helloRandomForestSRC program can be executed by changing to the directory ./target/spark/target/ and typing

./hello.sh or ./hello.cmd according to your operating system. The source code for the example is located in the GitHub repository [here]. It does little more than start a Spark session, grow a forest, and stop the Spark session. Details of raw unformatted ensemble information is presented in a log file rfsrc-x.log in the users HOME directory, though they are not available for examination by the user at this point in any coherent way.

Java API Specification for randomForestSRC

The Java API Specification for randomForestSRC is avaliable [here]. It is purely skeletal at this point, but will be flushed out in more detail in the near future.