Type: | Package |
Title: | Innovative Complex Split Procedures in Random Forests Through Candidate Split Sampling |
Version: | 0.6.0 |
Date: | 2025-05-03 |
Maintainer: | Roman Hornung <hornung@ibe.med.uni-muenchen.de> |
Description: | Implementation of three methods based on the diversity forest (DF) algorithm (Hornung, 2022, <doi:10.1007/s42979-021-00920-1>), a split-finding approach that enables complex split procedures in random forests. The package includes: 1. Interaction forests (IFs) (Hornung & Boulesteix, 2022, <doi:10.1016/j.csda.2022.107460>): Model quantitative and qualitative interaction effects using bivariable splitting. Come with the Effect Importance Measure (EIM), which can be used to identify variable pairs that have well-interpretable quantitative and qualitative interaction effects with high predictive relevance. 2. Two random forest-based variable importance measures (VIMs) for multi-class outcomes: the class-focused VIM, which ranks covariates by their ability to distinguish individual outcome classes from the others, and the discriminatory VIM, which measures overall covariate influence irrespective of class-specific relevance. 3. The basic form of diversity forests that uses conventional univariable, binary splitting (Hornung, 2022). Except for the multi-class VIMs, all methods support categorical, metric, and survival outcomes. The package includes visualization tools for interpreting the identified covariate effects. Built as a fork of the 'ranger' R package (main author: Marvin N. Wright), which implements random forests using an efficient C++ implementation. |
SystemRequirements: | C++17 |
Encoding: | UTF-8 |
License: | GPL-3 |
Imports: | Rcpp (≥ 0.11.2), Matrix, ggplot2, ggpubr, scales, nnet, sgeostat, rms, MapGAM, gam, rlang, grDevices, RColorBrewer, RcppEigen, survival, patchwork |
LinkingTo: | Rcpp, RcppEigen |
Depends: | R (≥ 3.5) |
Suggests: | testthat, BOLTSSIRR |
Additional_repositories: | https://romanhornung.github.io/drat |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | yes |
Packaged: | 2025-05-05 16:41:49 UTC; hornung |
Author: | Roman Hornung [aut, cre], Marvin N. Wright [ctb, cph] |
Repository: | CRAN |
Date/Publication: | 2025-05-05 17:00:03 UTC |
Diversity Forests
Description
The diversity forest algorithm is a split-finding approach that enables complex split procedures to be realized in random forest variants. This is achieved by drastically reducing the number of candidate splits evaluated at each node. It also avoids the variable selection bias seen in conventional random forests, where variables with many possible splits are selected too frequently for splitting (Strobl et al., 2007). For details, see Hornung (2022).
Details
This package currently features three methodologies which use the diversity forest algorithm:
the basic form of diversity forests that uses univariable, binary splitting, which is also used in conventional random forests
-
interaction forests (IFs) (Hornung & Boulesteix, 2022), which use bivariable splitting to model quantitative and qualitative interaction effects. IFs feature the Effect Importance Measure (EIM), which ranks the variable pairs with respect to the predictive importance of their quantitative and qualitative interaction effects. The individual variables can be ranked as well using EIM. For details, see Hornung & Boulesteix (2022).
two variable importance measures (VIMs) designed for random forests for multi-class outcomes, the class-focused VIM and the discriminatory VIM. The class-focused VIM ranks variables based on their ability to distinguish individual outcome classes from all others, whereas the discriminatory VIM ranks variables based on their overall influence, irrespective of class relatedness.
Diversity forests with univariable, binary splitting can be constructed using the function divfor
,
interaction forests using the function interactionfor
, and the class-focused and the discriminatory VIM
can be computed using the function multifor
. The former two methods support categorical, metric,
and survival outcomes.
This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is in large parts taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'diversityForest' package.
Details on further functionalities of the code that are not presented in the help pages of 'diversityForest' are found in the help pages of 'ranger', version 0.11.0, because 'diversityForest' is based on the latter version of 'ranger'. The code in the example sections can be used as a template for all basic application scenarios with respect to classification, regression and survival prediction.
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25, <doi:10.1186/1471-2105-8-25>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Data on automatic analysis of cardiotocograms
Description
This data set contains measurements from 2126 fetal cardiotocograms (CTGs).
The CTGs were automatically processed and the respective diagnostic features measured.
The CTGs were also classified by three expert obstetricians and a consensus classification label
assigned to each of them. This description is taken from the UC Irvine Machine
Learning Repository, where this data set was downloaded from. The outcome CLASS
is categorical with ten classes that correspond to different fetal heart rate patterns.
See the 'Details' section below for further information.
Format
A data frame with 2126 observations, 24 covariates and one 10-class outcome variable
Details
The variables are as follows:
-
b
. numeric. Start instant -
e
. numeric. End instant -
LBE
. numeric. Fetal heart rate (FHR) baseline value assessed by medical expert (beats per minute) -
LB
. numeric. FHR baseline value assessed by SisPorto (beats per minute) -
AC
. numeric. Number of accelerations per second -
FM
. numeric. Number of fetal movements per second -
UC
. numeric. Number of uterine contractions per second -
DL
. numeric. Number of light decelerations per second -
DS
. numeric. Number of severe decelerations per second -
DP
. numeric. Number of prolonged decelerations per second -
ASTV
. numeric. Percentage of time with abnormal short term variability -
MSTV
. numeric. Mean value of short term variability -
ALTV
. numeric. Percentage of time with abnormal long term variability -
MLTV
. numeric. Mean value of long term variability -
Width
. numeric. Width of FHR histogram -
Min
. numeric. Minimum value of FHR histogram -
Max
. numeric. Maximum value of FHR histogram -
Nmax
. numeric. Number of histogram peaks -
Nzeros
. numeric. Number of histogram zeros -
Mode
. numeric. Mode of the histogram -
Mean
. numeric. Mean of the histogram -
Median
. numeric. Median of the histogram -
Variance
. numeric. Variance of the histogram -
Tendency
. factor. Histogram tendency (-1 for left asymmetric; 0 for symmetric; 1 for right asymmetric) -
CLASS
. factor. FHR pattern class
The classes of the outcome CLASS
are as follows:
-
A
. Calm sleep -
B
. REM sleep -
C
. Calm vigilance -
D
. Active vigilance -
SH
. Shift pattern (A or SUSP with shifts) -
AD
. Accelerative/decelerative pattern (stress situation) -
DE
. Decelerative pattern (vagal stimulation) -
LD
. Largely decelerative pattern -
FS
. Flat-sinusoidal pattern (pathological state) -
SUSP
. Suspect pattern
This is a pre-processed version of the "Cardiotocography" data set published
in the UC Irvine Machine Learning Repository. The raw data contained the four
additional variables Date
, FileName
, SegFile
, and NSP
,
which were removed in this version of the data. Moreover, the variable DR
, representing
the number of repetitive decelerations per second was removed as well because
it was 0 for all observations.
Source
UC Irvine Machine Learning Repository, link: https://archive.ics.uci.edu/dataset/193/cardiotocography/ (Accessed: 29/08/2024)
References
Ayres-de Campos, D., Bernardes, J., Garrido, A., Marques-de-Sá, J., Pereira-Leite, L. (2000). SisPorto 2.0: a program for automated analysis of cardiotocograms. J Matern Fetal Med. 9(5):311-318, <doi:10.1002/1520-6661(200009/10)9:5<311::AID-MFM12>3.0.CO;2-9>.
Dua, D., Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/.
Examples
# Load data:
data(ctg)
# Numbers of observations per outcome class:
table(ctg$CLASS)
# Dimension of data:
dim(ctg)
# First rows of data:
head(ctg)
Construct a basic diversity forest prediction rule that uses univariable, binary splitting.
Description
Implements the most basic form of diversity forests that uses univariable, binary splitting. Currently, categorical, metric, and survival outcomes are supported.
Usage
divfor(
formula = NULL,
data = NULL,
num.trees = 500,
mtry = NULL,
importance = "none",
write.forest = TRUE,
probability = FALSE,
min.node.size = NULL,
max.depth = NULL,
replace = TRUE,
sample.fraction = ifelse(replace, 1, 0.632),
case.weights = NULL,
class.weights = NULL,
splitrule = NULL,
num.random.splits = 1,
alpha = 0.5,
minprop = 0.1,
split.select.weights = NULL,
always.split.variables = NULL,
respect.unordered.factors = NULL,
scale.permutation.importance = FALSE,
keep.inbag = FALSE,
inbag = NULL,
holdout = FALSE,
quantreg = FALSE,
oob.error = TRUE,
num.threads = NULL,
save.memory = FALSE,
verbose = TRUE,
seed = NULL,
dependent.variable.name = NULL,
status.variable.name = NULL,
classification = NULL,
nsplits = 30,
proptry = 1
)
Arguments
formula |
Object of class |
data |
Training data of class |
num.trees |
Number of trees. Default is 500. |
mtry |
Artefact from 'ranger'. NOT needed for diversity forests. |
importance |
Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). NOTE: Not yet implemented for diversity forests! |
min.node.size |
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
splitrule |
Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For diversity forests currently only the default splitting rules are supported. |
num.random.splits |
Artefact from 'ranger'. NOT needed for diversity forests. |
alpha |
For "maxstat" splitrule: Significance threshold to allow splitting. NOT needed for diversity forests. |
minprop |
For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting. NOT needed for diversity forests. |
split.select.weights |
Numeric vector with weights between 0 and 1, representing the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used. |
always.split.variables |
Currently not useable. Character vector with variable names to be always selected. |
respect.unordered.factors |
Handling of unordered factor covariates. One of 'ignore' and 'order' (the option 'partition' possible in 'ranger' is not (yet) possible with diversity forests). Default is 'ignore'. Alternatively TRUE (='order') or FALSE (='ignore') can be used. |
scale.permutation.importance |
Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. |
quantreg |
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
save.memory |
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. NOT needed for diversity forests. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. For survival forests this is the time variable. |
status.variable.name |
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring. |
classification |
Only needed if data is a matrix. Set to |
nsplits |
Number of candidate splits to sample for each split. Default is 30. |
proptry |
Parameter that restricts the number of candidate splits considered for small nodes. If |
Value
Object of class divfor
with elements
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
predictions |
Predicted classes/values, based on out-of-bag samples (classification and regression only). |
variable.importance |
Variable importance for each independent variable. |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. |
r.squared |
R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data. |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
unique.death.times |
Unique death times (survival only). |
chf |
Estimated cumulative hazard function for each sample (survival only). |
survival |
Estimated survival function for each sample (survival only). |
call |
Function call. |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
min.node.size |
Value of minimal node size used. |
treetype |
Type of forest/tree. classification, regression or survival. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
splitrule |
Splitting rule. |
replace |
Sample with replacement. |
nsplits |
Value of |
proptry |
Value of |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to obtain reproducible results:
set.seed(1234)
## Diversity forest with default settings (NOT recommended)
# Classification:
divfor(Species ~ ., data = iris, num.trees = 20)
# Regression:
iris2 <- iris; iris2$Species <- NULL; iris2$Y <- rnorm(nrow(iris2))
divfor(Y ~ ., data = iris2, num.trees = 20)
# Survival:
library("survival")
divfor(Surv(time, status) ~ ., data = veteran, num.trees = 20, respect.unordered.factors = "order")
# NOTE: num.trees = 20 is specified too small for practical
# purposes - the prediction performance of the resulting
# forest will be suboptimal!!
# In practice, num.trees = 500 (default value) or a
# larger number should be used.
## Diversity forest with specified values for nsplits and proptry (NOT recommended)
divfor(Species ~ ., data = iris, nsplits = 10, proptry = 0.4, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
## Applying diversity forest after optimizing the values of nsplits and proptry (recommended)
tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20)
# NOTE: num.trees.pre = 20 is specified too small for practical
# purposes - the out-of-bag error estimates of the forests
# constructed during optimization will be much too variable!!
# In practice, num.trees.pre = 500 (default value) or a
# larger number should be used.
divfor(Species ~ ., data = iris, nsplits = tuneres$nsplitsopt,
proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
## Prediction
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
tuneres <- tunedivfor(formula = Species ~ ., data = iris.train, num.trees.pre = 20)
# NOTE again: num.trees.pre = 20 is specified too small for practical purposes.
rg.iris <- divfor(Species ~ ., data = iris.train, nsplits = tuneres$nsplitsopt,
proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
pred.iris <- predict(rg.iris, data = iris.test)
table(iris.test$Species, pred.iris$predictions)
## Variable importance
rg.iris <- divfor(Species ~ ., data = iris, importance = "permutation", num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
rg.iris$variable.importance
## End(Not run)
Data on human activity recognition using smartphones
Description
This data set contains sensor data from 30 volunteers aged 19-48 years, performing
six activities while wearing Samsung Galaxy S II smartphones on their waists.
The sensors recorded 3-axial linear acceleration and angular velocity at 50Hz.
The experiments were video-recorded to label the data manually. The outcome
Activity
is categorical with six classes that differentiate the six
activities.
This is an updated version of the Human Activity Recognition Using Smartphones
data set published in the UC Irvine Machine Learning Repository. This updated
version published on OpenML includes both raw sensor signals and updated
activity labels, with aggregated measurements for each individual and activity.
Format
A data frame with 180 observations (activities), 66 covariates and one 6-class outcome variable
Details
The classes of the outcome Activity
are as follows: LAYING
,
SITTING
, STANDING
, WALKING
, WALKING_DOWNSTAIRS
,
WALKING_UPSTAIRS
.
The OpenML data set contained one additional variable Person
that was removed because it has too many factors to use it as a covariate
in prediction.
Source
Updated version: OpenML: data.name: Smartphone-Based_Recognition_of_Human_Activities, data.id: 4153, link: https://www.openml.org/d/4153/ (Accessed: 29/08/2024)
Original version: UC Irvine Machine Learning Repository, link: https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones/ (Accessed: 29/08/2024)
References
Reyes-Ortiz, J.-L., Oneto, L., Samà, A., Parra, X., Anguita, D. (2016). Transition-aware human activity recognition using smartphones. Neurocomputing, 171:754-767, <doi:10.1016/j.neucom.2015.07.085>.
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations 15(2):49-60, <doi:10.1145/2641190.2641198>.
Dua, D., Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/.
Examples
# Load data:
data(hars)
# Numbers of observations per outcome class:
table(hars$Activity)
# Dimension of data:
dim(hars)
# First rows of (subset) data:
head(hars[,1:5])
Diversity Forest variable importance
Description
Extract variable importance of divfor
object.
Usage
## S3 method for class 'divfor'
importance(x, ...)
Arguments
x |
|
... |
Further arguments passed to or from other methods. |
Value
Variable importance measures.
Author(s)
Marvin N. Wright
See Also
Construct an interaction forest prediction rule and calculate EIM values as described in Hornung & Boulesteix (2022).
Description
Implements interaction forests as described in Hornung & Boulesteix (2022). Currently, categorical, metric, and survival outcomes are supported. Interaction forests feature the effect importance measure (EIM), which can be used to rank the covariate variable pairs with respect to the impact of their interaction effects on prediction. This allows to identify relevant interaction effects. Interaction forests focus on well interpretable interaction effects. See the 'Details' section below for more details. In addition, we strongly recommend to consult Section C of Supplementary Material 1 of Hornung & Boulesteix (2022), which uses detailed examples of interaction forest analyses with code to illustrate how interaction forests can be used in applications: Link.
Usage
interactionfor(
formula = NULL,
data = NULL,
importance = "both",
num.trees = NULL,
simplify.large.n = TRUE,
num.trees.eim.large.n = NULL,
write.forest = TRUE,
probability = FALSE,
min.node.size = NULL,
max.depth = NULL,
replace = FALSE,
sample.fraction = ifelse(replace, 1, 0.7),
case.weights = NULL,
class.weights = NULL,
splitrule = NULL,
always.split.variables = NULL,
keep.inbag = FALSE,
inbag = NULL,
holdout = FALSE,
quantreg = FALSE,
oob.error = TRUE,
num.threads = NULL,
verbose = TRUE,
seed = NULL,
dependent.variable.name = NULL,
status.variable.name = NULL,
npairs = NULL,
classification = NULL
)
Arguments
formula |
Object of class |
data |
Training data of class |
importance |
Effect importance mode. One of the following: "both" (the default), "qualitative", "quantitative", "mainonly", "none". See the 'Details' section below for explanation. |
num.trees |
Number of trees. The default number is 20000, if EIM values should be computed
and 2000 otherwise. Note that if |
simplify.large.n |
Should restricted tree depths be used, when calculating EIM values for large data sets? See the 'Details' section below for more information. Default is |
num.trees.eim.large.n |
Number of trees in the forest used for calculating the EIM values for large data sets.
If |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). |
min.node.size |
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. Default is |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. For classification, this can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
splitrule |
Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For interaction forests currently only the default splitting rules are supported. |
always.split.variables |
Currently not useable. Character vector with variable names to be always selected. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. NOTE: Currently, not useable for interaction forests. |
quantreg |
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. For survival forests this is the time variable. |
status.variable.name |
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring. |
npairs |
Number of variable pairs to sample for each split. Default is the square root of the number of independent variables divided by 2 (this number is rounded up). |
classification |
Only needed if data is a matrix. Set to |
Details
The effect importance measure (EIM) of interaction forests distinguishes quantitative and qualitative interaction effects (Peto, 1982).
This is a common distinction as these two types of interaction effects are interpreted in different ways (see below).
For both of these types, EIM values for each variable pair are obtained: the quantitative and qualitative EIM values.
Interaction forests target easily interpretable types of interaction effects. These can be communicated clearly using statements
of the following kind: "The strength of the positive (negative) effect of variable A on the outcome depends on the level of variable B"
for quantitative interactions, and "for observations with small values of variable B, the effect of variable A is positive (negative),
but for observations with large values of B, the effect of A is negative (positive)" for qualitative interactions.
In addition to calculating EIM values for variable pairs, importance values for the individual variables are calculated as well, the univariable
EIM values. These measure the variable importance as in the case of classical variable importance measures of random forests.
The effect importance mode can be set via the importance
argument: "qualitative"
: Calculate only qualitative EIM values;
"quantitative"
: Calculate only quantitative EIM values; "both"
(the default): Calculate qualitative and quantitative EIM
values; "mainonly"
: Calculate only univariable EIM values.
The top variable pairs with largest quantitative and qualitative EIM values likely have quantitative and qualitative interactions,
respectively, which have a considerable impact on prediction. The top variables with largest univariable EIM values likely have a considerable
impact on prediction. Note that it is currently not possible to test the EIM values for
statistical significance using the interaction forests algorithm itself. However, the p-values
shown in the plots obtained with plotEffects
(which are obtained using bivariable
models) can be adjusted for multiple testing using the Bonferroni procedure to obtain
practical p-values. See the end of the 'Details' section of plotEffects
for explanation and guidance.
If the number of variables is larger than 100, not all possible variable pairs are considered, but, using a screening procedure, the
5000 variable pairs with the strongest indications of interaction effects are pre-selected.
NOTE: To make interpretations, it is crucial to investigate (visually) the forms the interaction effects of variable pairs
with large quantitative and qualitative EIM values take. This can be done using the plot function plot.interactionfor
(first overview) and plotEffects
.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100,
but more seriously for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended
to interpret the univariable EIM values with caution, if the data are high-dimensional. If it is of interest to measure the univariable
importance of the variables for high-dimensional data, an additional conventional random forest (e.g., using the ranger
package)
should be constructed and the variable importance measure values of this random forest be used for ranking the univariable effects.
For large data sets with many observations the calculation of the EIM values can become very costly - when using fully grown trees.
Therefore, when calculating EIM values for data sets with more than 1000 observations we use the following
maximum tree depths by default (argument: simplify.large.n = TRUE
):
if
n \le 1000
: Use fully grown trees.if
1000 < n \le 2000
: Use tree depth 10.if
2000 < n \le 5000
: Use tree depth 7.if
n > 5000
: Use tree depth 5.
Extensive analyses in Hornung & Boulesteix (2022) suggest that by restricting the tree depth in this way,
the EIM values that would result when using fully grown trees are approximated well. However, the prediction
performance suffers, when using restricted trees. Therefore, we restrict the tree depth only when calculating
the EIM values (if n > 1000
), but construct a second interaction forest with unrestricted tree depth,
which is then used for prediction purposes.
Value
Object of class interactionfor
with elements
predictions |
Predicted classes/values, based on out-of-bag samples (classification and regression only). |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
unique.death.times |
Unique death times (survival only). |
min.node.size |
Value of minimal node size used. |
npairs |
Number of variable pairs sampled for each split. |
eim.univ.sorted |
Univariable EIM values sorted in decreasing order. |
eim.univ |
Univariable EIM values. |
eim.qual.sorted |
Qualitative EIM values sorted in decreasing order. |
eim.qual |
Qualitative EIM values. |
eim.quant.sorted |
Quantitative EIM values sorted in decreasing order. |
eim.quant |
Quantitative EIM values. These values are labeled analoguously as those in |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of misclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. This is 'NA' for data sets with more than 100 covariate variables, because for such data sets we pre-select the 5000 variable pairs with strongest indications of interaction effects. This pre-selection cannot be taken into account in the out-of-bag error estimation, which is why the out-of-bag error estimates would be (much) too optimistic for data sets with more than 100 covariate variables. |
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
chf |
Estimated cumulative hazard function for each sample (survival only). |
survival |
Estimated survival function for each sample (survival only). |
splitrule |
Splitting rule. |
treetype |
Type of forest/tree. classification, regression or survival. |
r.squared |
R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data. |
call |
Function call. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
replace |
Sample with replacement. |
eim.quant.rawlists |
List containing the four vectors of un-adjusted 'raw' quantitative EIM values
and the four vectors of adjusted EIM values. These are usually not required by the user. |
promispairs |
List giving the indices of the variables in the pre-selected variable pairs. If the number of variables is at most 100, all variable pairs are considered. |
plotres |
List ob objects needed by the plot functions: |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Peto, R., (1982) Statistical aspects of cancer trials. In: K.E. Halnam (Ed.), Treatment of Cancer. Chapman & Hall: London.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
predict.divfor
, plot.interactionfor
, plotEffects
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct interaction forests and calculate EIM values:
# Binary outcome:
data(zoo)
modelcat <- interactionfor(dependent.variable.name = "type", data = zoo,
num.trees = 20)
# Metric outcome:
data(stock)
modelcont <- interactionfor(dependent.variable.name = "company10", data = stock,
num.trees = 20)
# Survival outcome:
library("survival")
mgus2$id <- NULL # 'mgus2' data set is contained in the 'survival' package
# categorical variables need to be of factor format - important!!
mgus2$sex <- factor(mgus2$sex)
mgus2$pstat <- factor(mgus2$pstat)
# Remove the second time variable 'ptime':
mgus2$ptime <- NULL
# Remove missing values:
mgus2 <- mgus2[complete.cases(mgus2),]
# Take subset to make the calculations less computationally
# expensive for the example (in actual applications, we would of course
# use the whole data set):
mgus2sub <- mgus2[sample(1:nrow(mgus2), size=500),]
# Apply 'interactionfor':
modelsurv <- interactionfor(formula = Surv(futime, death) ~ ., data=mgus2sub, num.trees=20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.
## Inspect the rankings of the variables and variable pairs with respect to
## the univariable, quantitative, and qualitative EIM values:
# Univariable EIM values:
modelcat$eim.univ.sorted
# Pairs with top quantitative EIM values:
modelcat$eim.quant.sorted[1:5]
# Pairs with top qualitative EIM values:
modelcat$eim.qual.sorted[1:5]
## Investigate visually the forms of the interaction effects of the variable pairs with
## largest quantitative and qualitative EIM values:
plot(modelcat)
plotEffects(modelcat, type="quant") # type="quant" is default.
plotEffects(modelcat, type="qual")
## Prediction:
# Separate 'zoo' data set randomly in training
# and test data:
data(zoo)
train.idx <- sample(nrow(zoo), 2/3 * nrow(zoo))
zoo.train <- zoo[train.idx, ]
zoo.test <- zoo[-train.idx, ]
# Construct interaction forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
modelcattrain <- interactionfor(dependent.variable.name = "type", data = zoo.train,
importance = "none", num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate EIM values (by setting importance = "none"), because this
# speeds up calculations.
# Predict class values of the test data:
pred.zoo <- predict(modelcattrain, data = zoo.test)
# Compare predicted and true class values of the test data:
table(zoo.test$type, pred.zoo$predictions)
## End(Not run)
Construct a random forest prediction rule and calculate class-focused and discriminatory variable importance scores.
Description
Constructs a random forest for multi-class outcomes and calculates the class-focused variable importance measure (VIM) and the discriminatory VIM.
The class-focused VIM ranks the covariates with respect to their ability to distinguish individual outcome classes from all others, which can be important in multi-class prediction tasks (see "Details" below). The discriminatory VIM, in contrast, similarly to conventional VIMs, measures the overall influence of covariates on classification performance, regardless of their relevance to individual classes.
Usage
multifor(
formula = NULL,
data = NULL,
num.trees = ifelse(nrow(data) <= 5000, 5000, 1000),
importance = "both",
write.forest = TRUE,
probability = TRUE,
min.node.size = NULL,
max.depth = NULL,
replace = FALSE,
sample.fraction = ifelse(replace, 1, 0.7),
case.weights = NULL,
keep.inbag = FALSE,
inbag = NULL,
holdout = FALSE,
oob.error = TRUE,
num.threads = NULL,
verbose = TRUE,
seed = NULL,
dependent.variable.name = NULL,
mtry = NULL,
npervar = 5
)
Arguments
formula |
Object of class |
data |
Training data of class |
num.trees |
Number of trees. Default is 5000 for datasets with a maximum of 5000 observations and 1000 for datasets with more than 5000 observations. |
importance |
Variable importance mode, one of the following: "both" (the default), "class-focused", "discriminatory", "none". If "class-focused", class-focused VIM values are computed, if "discriminatory", discriminatory VIM values are computed, and if "both", both class-focused and discriminatory VIM values are computed. See the 'Details' section below for details. |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). Using this option (default is |
min.node.size |
Minimal node size. Default 5 for probability and 1 for classification. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. Default is |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. This can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. |
oob.error |
Compute OOB prediction error. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. |
mtry |
Number of candidate variables to sample for each split. Default is the (rounded down) square root of the number variables. |
npervar |
Number of splits to sample per candidate variable. Default is 5. |
Details
Covariates targeted by the class-focused VIM, which specifically help distinguish individual outcome classes from the others are hereafter referred to as "class-related covariates". The primary motivation for identifying class-related covariates is frequently the interpretation of covariate effects.
Potential example applications include cancer subtyping (identifying biomarkers predictive of specific subtypes, e.g., luminal A, HER2-positive, rather than broad groups, e.g., hormone-driven vs. non-hormone-driven cancers), voting studies (covariates specifically associated with support for individual parties rather than general ideological orientation), and forensic science (detecting covariates specific to crime types like burglary or cybercrime rather than broadly violent vs. non-violent offenses).
In contrast to the class-focused VIM, conventional VIMs,
such as the permutation VIM or the Gini importance, and the discriminatory VIM measure the overall influence
of variables regardless of their class-relatedness Therefore, these measures
rank not only class-related variables high, but also variables that only
discriminate well between groups of classes. This is problematic, if only
class-related variables are to be identified.
NOTE: To learn about the shapes of the influences of the variables with the largest
class-focused VIM values on the multi-class outcome, it is crucial to apply the
plot.multifor
function to the multifor
object. Two further related
plot functions are plotMcl
and plotVar
.
NOTE ALSO: This methodology is based on work currently under peer review. A reference will be added once the corresponding paper is published.
The class-focused VIM requires that all variables are ordered. For this reason, before constructing the random forest,
the categories of unordered categorical variables are ordered using an approach
by Coppersmith et al. (1999), which ensures that close categories feature similar
outcome class distributions. This approach is also used in the ranger
R package,
when using the option respect.unordered.factors="order"
.
Value
Object of class multifor
with elements
predictions |
Predicted classes (for |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
min.node.size |
Value of minimal node size used. |
mtry |
Number of candidate variables sampled for each split. |
class_foc_vim |
class-focused VIM values. Only computed for independent variables that feature at least as many unique values as the outcome variable has classes. For other variables, the entries in the vector |
discr_vim |
Discriminatory VIM values for all independent variables. |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples and for probability estimation the Brier score. |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
treetype |
Type of forest/tree. Classification or probability. |
call |
Function call. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
replace |
Sample with replacement. |
plotres |
List ob objects needed by the plot functions: |
Author(s)
Roman Hornung, Marvin N. Wright
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Coppersmith, D., Hong, S. J., Hosking, J. R. (1999). Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery 3:197-217, <doi:10.1023/A:1009869804967>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Load the "ctg" data set:
data(ctg)
## Construct a random forest:
model <- multifor(dependent.variable.name = "CLASS", data = ctg,
num.trees = 20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 5000 for datasets with a maximum of
# 5000 observations and num.trees = 1000 for datasets larger than that.
## The out-of-bag estimated Brier score (note that by default
## 'probability = TRUE' is used in 'multifor'):
model$prediction.error
## Inspect the class-focused and the discriminatory VIM values:
model$class_foc_vim
# --> Note that there are no class-focused VIM values for some of the variables.
# These are those for which there are fewer unique values than outcome classes.
# See the "Details" section above.
model$discr_vim
## Inspect the 5 variables with the largest class-focused VIM values and the
## 5 variables with the largest discriminatory VIM values:
sort(model$class_foc_vim, decreasing = TRUE)[1:5]
sort(model$discr_vim, decreasing = TRUE)[1:5]
## Instead of passing the name of the outcome variable through the
## 'dependent.variable.name' argument as above, the formula interface can also
## be used. Below, we fit a random forest with only the first five variables
## from the 'ctg' data set:
model <- multifor(CLASS ~ b + e + LBE + LB + AC, data=ctg, num.trees = 20)
## As expected, the out-of-bag estimated prediction error is much larger
## for this model:
model$prediction.error
## NOTE: Visual exploration of the results of the class-focused VIM analysis
## is crucial.
## Therefore, in practice the next step would be to apply the
## 'plot.multifor' function to the object 'model'.
# plot(model)
## Prediction:
# Separate 'ctg' data set randomly in training
# and test data:
data(ctg)
train.idx <- sample(nrow(ctg), 2/3 * nrow(ctg))
ctg.train <- ctg[train.idx, ]
ctg.test <- ctg[-train.idx, ]
# Construct random forest on training data:
# NOTE again: num.trees = 20 is specified too small for practical purposes.
model_train <- multifor(dependent.variable.name = "CLASS", data = ctg.train,
importance = "none", probability = FALSE,
num.trees = 20)
# NOTE: Because we are only interested in prediction here, we do not
# calculate VIM values (by setting importance = "none"), because this
# speeds up calculations.
# NOTE also: Because we are interested in class label prediction here
# rather than class probability prediction we specified 'probability = FALSE'
# above.
# Predict class values of the test data:
pred.ctg <- predict(model_train, data = ctg.test)
# Compare predicted and true class values of the test data:
table(ctg.test$CLASS, pred.ctg$predictions)
## Repeat the analysis for class probability prediction
## (default 'probability = TRUE'):
model_train <- multifor(dependent.variable.name = "CLASS", data = ctg.train,
importance = "none", num.trees = 20)
# Predict class probabilities in the test data:
pred.ctg <- predict(model_train, data = ctg.test)
# The predictions are now a matrix of class probabilities:
head(pred.ctg$predictions)
# Obtain class predictions by choosing the classes with the maximum predicted
# probabilities (the function 'which.is.max' chooses one class randomly if
# there are several classes with maximum probability):
library("nnet")
classes <- levels(ctg.train$CLASS)
pred_classes <- factor(classes[apply(pred.ctg$predictions, 1, which.is.max)],
levels=classes)
# Compare predicted and true class values of the test data:
table(ctg.test$CLASS, pred_classes)
## End(Not run)
Plot method for interactionfor
objects
Description
Plot function for interactionfor
objects that allows to obtain a first overview of the result of the
interaction forest analysis. This function visualises the distributions of the EIM values and
the estimated forms of the bivariable influences of the variable pairs with largest quantitative and
qualitative EIM values. Further visual exploration of the result of the interaction
forest analysis should be conducted using plotEffects
.
Usage
## S3 method for class 'interactionfor'
plot(x, numpairsquant = 2, numpairsqual = 2, ...)
Arguments
x |
Object of class |
numpairsquant |
The number of pairs with largest quantitative EIM values to plot. Default is 2. |
numpairsqual |
The number of pairs with largest qualitative EIM values to plot. Default is 2. |
... |
Further arguments passed to or from other methods. |
Details
For details on the plots of the estimated forms of the bivariable influences of the variable pairs see plotEffects
.
NOTE: The p-values shown in the plots are generally much too optimistic and MUST NOT be reported
as the result of a statistical test for significance of interaction. To obtain adjusted p-values that would correspond to
valid tests, it would be possible to multiply these p-values by the number of possible variable pairs,
which would correspond to Bonferroni-adjusted p-values. See the 'Details' section of plotEffects
for further
explanation and guidance. Note, however, that these Bonferroni-adjusted p-values should be interpreted
with caution because, stemming from bivariable models, these p-values do not take the multivariable nature
of the data into account.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100, but more seriously
for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended to interpret the univariable EIM values
with caution, if the data are high-dimensional. If it is of interest to measure the univariable importance of the variables for high-dimensional data,
an additional conventional random forest (e.g., using the ranger
package) should be constructed and the variable importance measure values
of this random forest be used for ranking the univariable effects.
Value
A ggplot2 plot.
Author(s)
Roman Hornung
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct interaction forest and calculate EIM values:
data(stock)
model <- interactionfor(dependent.variable.name = "company10", data = stock,
num.trees = 20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.
## When using the plot() function without further specifications,
## by default the estimated bivariable influences of the two pairs with largest quantitative
## and qualitative EIM values are shown:
plot(model)
# It is, however, also possible to change the numbers of
# pairs with largest quantitative and qualitative EIM values
# to be shown:
plot(model, numpairsquant = 4, numpairsqual = 3)
## End(Not run)
Plot method for multifor
objects
Description
Plot function for multifor
objects that allows to obtain a first overview of the result of the
class-focused VIM analysis. This function visualises the distribution of the class-focused VIM values
together with that of the corresponding discriminatory VIM values and
the estimated dependency structures of the multi-class outcome on the variables
with largest class-focused VIM values. These estimated dependency structures are visualised
using kernel density estimate-based plots and/or boxplots.
Usage
## S3 method for class 'multifor'
plot(x, plot_type = c("both", "density", "boxplot")[1], num_best = 5, ...)
Arguments
x |
Object of class |
plot_type |
Plot type, one of the following: "both" (the default), "density", "boxplot". If "density", kernel density estimate-based plots are produced, if "boxplot", boxplot plots are produced, and if "both", both kernel density estimate-based plots and boxplot plots are produced. See the 'Details' section of |
num_best |
The number of variables with largest class-focused VIM values to plot. Default is 5. |
... |
Further arguments passed to or from other methods. |
Details
In the plot showing the distribution of the class-focused VIM values along with
that of the discriminatory VIM values, the discriminatory VIM values are
normalized to make them comparable to the class-focused VIM values. This is
achieved by dividing the discriminatory VIM values by their mean and multiplying
it by that of the class-focused VIM values.
For details on the plots of the estimated dependency structures of the
multi-class outcome on the variables, see plotMcl
.
The latter function allows to visualise these estimated dependency structures
for arbitrary variables in the data.
Value
A ggplot2 plot.
Author(s)
Roman Hornung
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct random forest and calculate class-focused and discriminatory VIM values:
data(hars)
model <- multifor(dependent.variable.name = "Activity", data = hars,
num.trees = 100, probability=TRUE)
# NOTE: num.trees = 100 (in the above) would be likely too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 5000 for datasets with a maximum of
# 5000 observations and num.trees = 1000 for datasets larger than that.
## By default the estimated class-specific distributions of the num_best=5
## variables with the largest class-focused VIM values are plotted:
plot(model)
## Consider only the 2 variables with the largest class-focused VIM values:
plot(model, num_best = 2)
## Show only the density plots or only the boxplots:
plot(model, plot_type = "density", num_best = 2)
plot(model, plot_type = "boxplot", num_best = 2)
## Show only the plot of the distributions of the class-focused and
## discriminatory VIM values:
plot(model, num_best = 0)
## End(Not run)
Interaction forest plots: exploring interaction forest results through visualisation
Description
This function allows to visualise the (estimated) bivariable influences of pairs of variables (with large quantitative and qualitative EIM values) on the outcome. This step is crucial, because to interpret interaction effects between variable pairs with large quantitative and qualitative EIM values, it is necessary to learn about the forms these interaction effects take.
Usage
plotEffects(
intobj,
type = "quant",
numpairs = 5,
indpairs = NULL,
pairs = NULL,
allwith = NULL,
pvalues = TRUE,
twoplots = TRUE,
addtitles = TRUE,
plotit = TRUE
)
Arguments
intobj |
Object of class |
type |
This can be either "quant" or "qual" and determines whether the plotted pairs are sorted according to either the quantitative or qualitative EIM values in decreasing order. Default is "quant". |
numpairs |
The number of pairs to plot (default: 5). This is overwritten by |
indpairs |
Optional. The indices of the pairs in the sorted lists of quantitative ( |
pairs |
This can be used to specify the pairs to plot. It is an optional list of character string vectors, where each of these vectors has length two. Each list element corresponds to one
pair, where the first character string gives the name of the first member of the respective pair to plot and the second character string gives the name of the second member.
This argument overwrites |
allwith |
This is an optional character string that can be set to the name of one of the variables. If provided, only variable pairs will
be considered that feature the variable specified by this argument |
pvalues |
Set to |
twoplots |
Set to |
addtitles |
Set to |
plotit |
This states whether the plots are actually plotted or merely returned as |
Details
For each considered pair the bivariable influence of both pair members on the outcome estimated
using a two-dimensional flexible function is shown. Such visualisations make it possible to learn
about the forms of the interaction effects between variable pairs with large EIM values. Moreover,
these visualisations reveal (pathological) cases in which variable pairs do not show
indications of interaction effects despite featuring large EIM values.
For binary outcomes the probabilities for the second class are estimated, for categorical outcomes with more than two classes
the probabilities for the largest class (i.e., the class with the most observations) are estimated (using the function plotPair
, a different
class can be selected instead), for metric outcomes the means of the outcome are estimated, and for survival outcomes
the log hazards ratio values compared to the median effect are estimated.
The kinds of estimates shown differ also according to whether both pair members are metric or only one of the two
members is metric and the other one categorical or both pair members are categorical:
If both pair members are metric and the outcome is categorical or metric we use two-dimensional LOESS regression, where in the case of categorical outcomes, to obtain probability estimates for the first class (or largest class for multi-class outcomes), we use the value '1' for the first class (largest class for multi-class outcomes) and the value '0' for the second class (all other classes for multi-class outcomes).
If both pair members are metric and the outcome is survival we use a Cox proportional hazard additive model with a two-dimensional LOESS smooth (
gamcox
function from the 'MapGAM' package (version 1.2-5)) and in the rare cases for which the latter fails, we use classical Cox regression with an interaction term between the two covariates.If one pair member is metric and the other one categorical and the outcome is categorical or metric, we use LOESS regression between the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) and the values of the metric variable separately for each category of the categorical variable. In the rare cases in which the LOESS regression fails we use classical linear regression.
If one pair member is metric and the other one categorical and the outcome is survival, we use Cox regression with a linear tail-restricted cubic spline with four knots (univariable LOESS regression for survival outcomes does not seem to be available yet in R) separately for each category of the categorical variable. In cases in which the fitting of this spline regression fails we use classical Cox regression.
If both pair members are categorical and the outcome is categorical or metric, we simply calculate the mean of the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) for each possible combination of the categories of the two variables.
If both pair members are categorical and the outcome is survival, we use classical Cox regression with an interaction term between the two variables (there is no need for any flexible modelling in this setting, because the Cox model with two categorical variables plus interaction term is saturated).
As described above (function argument: pvalues
), there is an option to add p-values from tests for interaction effect
to the plots. If at least one of the variables in the considered variable pair is categorical and features more
than two categories, there are more than one interaction terms in the regression approaches used for testing,
because the categorical variables are dummy-coded. Therefore, in these cases we obtain a p-value for each interaction term.
to obtain a single p-value for the test for interaction we adjust these multiple p-values using the Holm-Bonferroni
procedure and take the minimum of the adjusted p-values.
NOTE: These p-values are generally much too optimistic, in particular for small data sets and
large numbers of variables. The reason for this overoptimism is that these p-values are not adjusted
for the fact that we already used the data to find the variable pairs with strongest indications of
interaction effects. This is similar to a multiple testing problem.
Therefore, these p-values should only be seen as a rough guide to be interpreted very cautiously
and MUST NOT be reported as the results of a statistical test for significance
of interaction. To obtain adjusted p-values that would correspond to
valid tests, it would be possible to multiply these p-values by the number of possible pairs,
which would correspond to Bonferroni-adjusted p-values. For example, assume that we have 30
covariate variables. In that case the number of possible pairs would be 'choose(30, 2) = 435',
which is why we would need to multiply each p-value by 435
to obtain an adjusted p-value (or keep the original p-values and divide the significance
level 0.05 by 435). Note, however, that Bonferroni-adjusted p-values deliver quite conservative results,
that is, weaker effects might not be detected using these p-values, while, however, effects for
which these p-values are small (< 0.05
) are most likely relevant.
Note further that these Bonferroni-adjusted p-values should be interpreted
with caution because, stemming from bivariable models, these p-values do not take the
multivariable nature of the data into account.
Value
A list of ggplot2 plots returned invisibly.
Author(s)
Roman Hornung
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Set seed to make results reproducible:
set.seed(1234)
## Construct interaction forest and calculate EIM values:
data(stock)
model <- interactionfor(dependent.variable.name = "company10", data = stock,
num.trees = 20)
# NOTE: num.trees = 20 (in the above) would be much too small for practical
# purposes. This small number of trees was simply used to keep the
# runtime of the example short.
# The default number of trees is num.trees = 20000 if EIM values are calculated
# and num.trees = 2000 otherwise.
## Obtain a first overview by applying the plot() function for
## interactionfor obects:
plot(model)
## Several possible application cases of the plotEffects() function:
# Visualise the estimated bivariable influences of the five variable pairs with the
# largest quantitative EIM values:
plotEffects(model) # type="quant" is default.
# Visualise the estimated bivariable influences of the five pairs with the
# largest qualitative EIM values:
plotEffects(model, type="qual")
# Visualise the estimated bivariable influences of all (eight) pairs that involve
# the variable "company7" sorted in decreasing order according to the
# qualitative EIM values:
plotEffects(model, allwith="company7", type="qual", numpairs=8)
# Visualise the estimated bivariable influences of the pairs with third and fifth
# largest qualitative EIM values:
plotEffects(model, type="qual", indpairs=c(3,5))
# Visualise the estimated bivariable influences of the pairs ("company3", "company5") and
# ("company1", "company9"):
plotEffects(model, pairs=list(c("company3", "company5"), c("company1", "company9")))
## Saving of plots generated with the plotEffects() function (e.g., for use in publications):
# Apply plotEffects() to obtain plots for the five variable pairs
# with the largest qualitative EIM values and store these plots in
# an object 'ps':
ps <- plotEffects(model, type="qual", pvalues=FALSE, twoplots=FALSE, addtitles=FALSE, plotit=FALSE)
# pvalues = FALSE states that no p-values should be shown in the plots,
# because these might not be desired in plots meant for publication.
# twoplots = FALSE ensures that we get one plot for each page instead of two plots per page.
# addtitles = FALSE removes the automatically generated titles, because these are likely
# not desired in publications.
# plotit = FALSE ensures that the plots are not displayed, but only returned (invisibly)
# by plotEffects().
# Save the plot with second largest qualitative EIM value:
p1 <- ps[[2]]
# Add title:
library("ggpubr")
p1 <- annotate_figure(p1, top = text_grob("My descriptive plot title 1", face = "bold", size = 14))
p1
# Save as PDF:
# library("ggplot2")
# ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)
# Save the plot with fifth largest qualitative EIM value:
p2 <- ps[[5]]
# Add title:
p2 <- annotate_figure(p2, top = text_grob("My descriptive plot title 2", face = "bold", size = 14))
p2
# Save as PDF:
# ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)
# Combine both of the above plots:
p <- ggarrange(p1, p2, nrow = 2)
p
# Save the combined plot:
# ggsave(file="mypathtofolder/FigureXYcombined.pdf", width=14, height=11)
# NOTE: Using plotEffects() it is not possible to change the plots
# themselves (by e.g., increasing the label sizes or changing the
# axes ranges). However, the function plotPair() can be used to change
# the plots themselves.
## End(Not run)
Plots of the (estimated) within-class distributions of variables
Description
This function allows to visualise the (estimated) distributions of one or several variables for each of the classes of the outcomes. This allows to study how exactly variables of interest are associated with the outcome, which is crucial for interpretive purposes. Two types of visualisations are available: density plots and boxplots. See the 'Details' section below for further explanation.
Usage
plotMcl(
data,
yvarname,
varnames,
plot_type = c("both", "density", "boxplot")[1],
addtitles = TRUE,
plotit = TRUE
)
Arguments
data |
Data frame containing the variables. |
yvarname |
Name of outcome variable. |
varnames |
Names of the variables for which plots should be created. |
plot_type |
Plot type, one of the following: "both" (the default), "density", "boxplot". If "density", |
addtitles |
Set to |
plotit |
This states whether the plots are actually plotted or merely returned as |
Details
For the "density"
plots, kernel density estimates (obtained using the
density()
function from base R) of the within-class distributions are
plotted in the same plot using different colors and, depending on the number
of classes, different line types. To account for the different number of
observations per class, each density is multiplied by the proportion of
observations from that class. The resulting scaled densities can be interpreted
in terms of the local density of the observations from each class relative to
those from the other classes. For example, if a scaled density has the largest
value in a particular region, this can be interpreted as the respective class
being the most frequent in that region. Another example: If the scaled density
of class "A" is twice as large as the scaled density of class "B" in a particular
region, this can be interpreted to mean that there are twice as many observations
of class "A" as of class "B" in that region.
In the "density"
plots, only classes represented by at least two
observations are considered. If the number of classes is greater than 7,
the different classes are distinguished using both colors and line styles.
To indicate the absolute numbers of observations in the different regions,
the locations of the observations from the different classes are visualized
using a rug plot on the x-axis, using the same colors and line types as for
the density plots. If the number of observations is greater than 1,000, a
random subset of 1,000 observations is shown in the rug plot instead of all
observations for visual clarity.
The "boxplot"
plots show the (estimated) within-class distributions
side by side using boxplots. All classes are considered, even those represented
by only a single observation. For the plot_type="both"
option, which
displays both "density"
and "boxplot"
plots, the boxplots are
displayed using the same colors and (if applicable) line styles as the kernel
density estimates, for clarity. Boxplots of classes for which no kernel density
estimates were obtained (i.e., those of the classes represented by single
observations) are shown in grey.
Note that plots are only generated for those variables in varnames
that have at least as many unique values as there are outcome classes. For
categorical variables, the category labels are printed on the x- or y-axis
of the "density"
or "boxplot"
plots, respectively. The rug plots
of the "density"
plots are produced only for numeric variables.
Value
A list returned invisibly. The list has length equal to the number of elements in varnames
.
Each element corresponds to one variable and contains a list of ggplot2
plots structured as in the output of plotVar
.
Author(s)
Roman Hornung
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Plot "density" and "boxplot" plots (default: plot_type = "both") for the
## first three variables in the "hars" dataset:
data(hars)
plotMcl(data = hars, yvarname = "Activity", varnames = c("tBodyAcc.mean...X",
"tBodyAcc.mean...Y",
"tBodyAcc.mean...Z"))
## Plot only the "density" plots for these variables:
plotMcl(data = hars, yvarname = "Activity",
varnames = c("tBodyAcc.mean...X", "tBodyAcc.mean...Y",
"tBodyAcc.mean...Z"), plot_type = "density")
## Plot the "density" plots for these variables, but without titles of the
## plots:
plotMcl(data = hars, yvarname = "Activity", varnames =
c("tBodyAcc.mean...X", "tBodyAcc.mean...Y", "tBodyAcc.mean...Z"),
plot_type = "density", addtitles = FALSE)
## Make density plots for these variables, but only save them in a list "ps"
## without plotting them ("plotit = FALSE"):
ps <- plotMcl(data = hars, yvarname = "Activity", varnames =
c("tBodyAcc.mean...X", "tBodyAcc.mean...Y",
"tBodyAcc.mean...Z"), plot_type = "density",
addtitles = FALSE, plotit = FALSE)
## The plots can be manipulated later by using ggplot2 functionalities:
library("ggplot2")
p1 <- ps[[1]]$dens_pl + ggtitle("First variable in the dataset") +
labs(x="Variable values", y="my scaled density")
p2 <- ps[[3]]$dens_pl + ggtitle("Third variable in the dataset") +
labs(x="Variable values", y="my scaled density")
## Combine both of the above plots:
library("ggpubr")
p <- ggarrange(p1, p2, ncol = 2)
p
## # Save as PDF:
## ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6)
## End(Not run)
Plot of the (estimated) simultaneous influence of two variables
Description
This function allows to visualise the (estimated) bivariable influence of a single specific pair of variables on the outcome. The estimation
and plotting is performed in the same way as in plotEffects
. However, plotPair
does not require an interactionfor
object
and can thus be used also without a constructed interaction forest.
Usage
plotPair(
pair,
yvarname,
statusvarname = NULL,
data,
levelsorder1 = NULL,
levelsorder2 = NULL,
categprob = NULL,
pvalue = TRUE,
returnseparate = FALSE,
intobj = NULL
)
Arguments
pair |
Character string vector of length two, where the first character string gives the name of the first member of the respective pair to plot and the second character string gives the name of the second member.
Note that the order of the two pair members in |
yvarname |
Name of outcome variable. |
statusvarname |
Name of status variable, only applicable to survival data. |
data |
Data frame containing the variables. |
levelsorder1 |
Optional. Order the categories of the first variable should have in the plot (if it is categorical). Character string vector, where the i-th entry contains the name of the category that should take the i-th place in the ordering of the categories of the first variable. |
levelsorder2 |
Optional. Order the categories of the second variable should have in the plot (if it is categorical). Character string vector specified in an analogous
way as |
categprob |
Optional. Only relevant for categorical outcomes with more than two classes.
Name of the class for which probabilities should be estimated. As described in |
pvalue |
Set to |
returnseparate |
Set to |
intobj |
Optional. Object of class |
Details
See the 'Details' section of plotEffects
.
Value
A ggplot2 plot.
Author(s)
Roman Hornung
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
plotEffects
, plot.interactionfor
Examples
## Not run:
## Load package:
library("diversityForest")
## Visualise the estimated bivariable influence of 'toothed' and 'feathers' on
## the probability of type="mammal":
data(zoo)
plotPair(pair = c("toothed", "feathers"), yvarname="type", data = zoo)
## Visualise the estimated bivariable influence of 'creat' and 'hgb' on
## survival (more precisely, on the log hazards ratio compared to the
## median effect):
library("survival")
mgus2compl <- mgus2[complete.cases(mgus2),]
plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death", data=mgus2compl)
# Problem: The outliers in the left plot make it difficult to see what is going
# on in the region with creat values smaller than about two even though the
# majority of values lie there.
# --> Solution: We re-run the above line setting returnseparate = TRUE, because
# this allows to get the two ggplot plots separately, which can then be manipulated
# to change the x-axis range in order to remove the outliers:
ps <- plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death",
data=mgus2compl, returnseparate = TRUE)
# Change the x-axis range:
library("ggplot2")
ps[[1]] + xlim(c(0.5,2))
# Save the plot:
# ggsave(file="mypathtofolder/FigureXY1.pdf", width=7, height=6)
# We can, for example, also change the label sizes of the second plot:
# With original label sizes:
ps[[2]]
# With larger label sizes:
ps[[2]] + theme(axis.title=element_text(size=15))
# Save the plot:
# library("ggplot2")
# ggsave(file="mypathtofolder/FigureXY2.pdf", width=7, height=6)
## End(Not run)
Plot of the (estimated) dependency structure of a variable x
on a categorical variable y
Description
This function allows to visualise the (estimated) distributions of a variable x
for each of the categories of a categorical variable y
.
This allows to study the dependency structure of y
on x
.
Two types of visualisations are available: density plots and boxplots.
Usage
plotVar(
x,
y,
plot_type = c("both", "density", "boxplot")[1],
x_label = "",
y_label = "",
plot_title = "",
plotit = TRUE
)
Arguments
x |
Metric variable or ordered categorical variable that has at least as many unique values as |
y |
Factor variable with at least three categories. |
plot_type |
Plot type, one of the following: "both" (the default), "density", "boxplot". If "density", a |
x_label |
Optional. The label of the x-axis. |
y_label |
Optional. The label (heading) of the legend that differentiates the categories of |
plot_title |
Optional. The title of the plot. |
plotit |
This states whether the plots are actually plotted or merely returned as |
Details
See the 'Details' section of plotMcl
.
Value
A list returned invisibly containing:
Only the element
dens_pl
ifplot_type = "density"
;Only the element
boxplot_pl
ifplot_type = "boxplot"
;The elements
dens_pl
,boxplot_pl
, andcombined_pl
ifplot_type = "both"
.
All returned plots are ggplot2
objects, with combined_pl
being a patchwork
object.
Author(s)
Roman Hornung
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
See Also
Examples
## Not run:
## Load package:
library("diversityForest")
## Load the "ctg" data set:
data(ctg)
## Set seed to make results reproducible (this is necessary because
## the rug plot produced by 'plotVar' does not show all observations, but
## only a random subset of 1000 observations):
set.seed(1234)
## Using a "density" plot and a "boxplot", visualise the (estimated)
## distributions of the variable "Mean" for each of the categories of the
# variable "Tendency":
plotVar(x = ctg$Mean, y = ctg$Tendency)
## Re-create this plot with labels:
plotVar(x = ctg$Mean, y = ctg$Tendency, x_label = "Mean of the histogram ('Mean')",
y_label = "Histogram tendency ('Tendency')",
plot_title = "Relationship between 'Mean' and 'Tendency'")
## Re-create this plot, but only show the "density" plot:
plotVar(x = ctg$Mean, y = ctg$Tendency, plot_type = "density",
x_label = "Mean of the histogram ('Mean')",
y_label = "Histogram tendency ('Tendency')",
plot_title = "Relationship between 'Mean' and 'Tendency'")
## Use ggplot2 and RColorBrewer functionalities to change the line colors and
## the labels of the categories of "Tendency":
library("ggplot2")
library("RColorBrewer")
p <- plotVar(x = ctg$Mean, y = ctg$Tendency, plot_type = "density",
x_label = "Mean of the histogram ('Mean')",
y_label = "Histogram tendency ('Tendency')",
plot_title = "Relationship between 'Mean' and 'Tendency'",
plotit = FALSE)$dens_pl +
scale_color_manual(values = brewer.pal(n = 3, name = "Set2"),
labels = c("left asymmetric", "symmetric",
"right asymmetric")) +
scale_linetype_manual(values = rep(1, 3),
labels = c("left asymmetric", "symmetric",
"right asymmetric"))
p
## # Save as PDF:
## ggsave(file="mypathtofolder/FigureXY1.pdf", width=10, height=7)
## Further customizations:
# Create plot without plotting it:
plotobj <- plotVar(x = ctg$Mean, y = ctg$Tendency,
x_label = "Mean of the histogram ('Mean')",
y_label = "Histogram tendency ('Tendency')",
plotit = FALSE)
# Customize the density plot:
dens_pl <- plotobj$dens_pl + theme(legend.position = "inside",
legend.position.inside = c(0.25, 0.9),
legend.title = element_text(size = 16),
legend.text = element_text(size = 12),
axis.title = element_text(size=16),
axis.text = element_text(size=12)) +
ylab("(Scaled) density")
# Customize the boxplot:
boxplot_pl <- plotobj$boxplot_pl +
theme(axis.text.x = element_text(color = "transparent"),
axis.ticks.x = element_line(color = "transparent"),
axis.title = element_text(size=16),
axis.text = element_text(size=12))
# Create a title with increased font size:
library("grid")
title_grob <- textGrob(
"Title of the combined plot",
gp = gpar(fontsize = 18)
)
# Arrange plots with title:
library("gridExtra")
p <- arrangeGrob(
dens_pl, boxplot_pl,
top = title_grob,
nrow = 1
)
p
## # Save as PDF:
## ggsave(file="mypathtofolder/FigureXY2.pdf", p, width=16, height=7)
## End(Not run)
Diversity Forest prediction
Description
Prediction with new data and a saved forest from divfor
.
Usage
## S3 method for class 'divfor'
predict(
object,
data = NULL,
predict.all = FALSE,
num.trees = object$num.trees,
type = "response",
se.method = "infjack",
quantiles = c(0.1, 0.5, 0.9),
seed = NULL,
num.threads = NULL,
verbose = TRUE,
...
)
Arguments
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and regression, a 3d array for probability estimation (sample x class x tree) and survival (sample x time x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details. |
se.method |
Method to compute standard errors. One of 'jack', 'infjack' with default 'infjack'. Only applicable if type = 'se'. See below for details. |
quantiles |
Vector of quantiles for quantile prediction. Set |
seed |
Random seed. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
Details
This package is a fork of the R package 'ranger' that implements random forests using an
efficient C++ implementation. More precisely, 'diversityForest' was written by modifying
the code of 'ranger', version 0.11.0. Therefore, details on further functionalities
of the code that are not presented in the help pages of 'diversityForest' are found
in the help pages of 'ranger' (version 0.11.0). The code in the example sections of divfor
and tunedivfor
can be
used as a template for all common application scenarios with respect to classification,
regression and survival prediction using univariable, binary splitting. Some function
arguments adopted from the 'ranger' package may not be useable with diversity forests
(for the current package version).
Value
Object of class divfor.prediction
with elements
predictions | Predicted classes/values (only for classification and regression) |
unique.death.times | Unique death times (only for survival). |
chf | Estimated cumulative hazard function for each sample (only for survival). |
survival | Estimated survival function for each sample (only for survival). |
num.trees | Number of trees. |
num.independent.variables | Number of independent variables. |
treetype | Type of forest/tree. Classification, regression or survival. |
num.samples | Number of samples. |
Author(s)
Marvin N. Wright
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research 15:1625-1651.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
Interaction Forest prediction
Description
Prediction with new data and a saved interaction forest from interactionfor
.
Usage
## S3 method for class 'interactionfor'
predict(
object,
data = NULL,
predict.all = FALSE,
num.trees = object$num.trees,
type = "response",
se.method = "infjack",
quantiles = c(0.1, 0.5, 0.9),
seed = NULL,
num.threads = NULL,
verbose = TRUE,
...
)
Arguments
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and regression, a 3d array for probability estimation (sample x class x tree) and survival (sample x time x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details. |
se.method |
Method to compute standard errors. One of 'jack', 'infjack' with default 'infjack'. Only applicable if type = 'se'. See below for details. |
quantiles |
Vector of quantiles for quantile prediction. Set |
seed |
Random seed. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
Details
Note that his package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is in large parts taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'diversityForest' package. Details on further functionalities of the code that are not presented in the help pages of 'diversityForest' are found in the help pages of 'ranger' (version 0.11.0).
Value
Object of class interaction.prediction
with elements
predictions | Predicted classes/values (only for classification and regression) |
unique.death.times | Unique death times (only for survival). |
chf | Estimated cumulative hazard function for each sample (only for survival). |
survival | Estimated survival function for each sample (only for survival). |
num.trees | Number of trees. |
num.independent.variables | Number of independent variables. |
treetype | Type of forest/tree. Classification, regression or survival. |
num.samples | Number of samples. |
Author(s)
Marvin N. Wright, Roman Hornung
References
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research 15:1625-1651.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
Random forest prediction using a saved forest from multifor
Description
Prediction with new data and a saved forest from multifor
.
Usage
## S3 method for class 'multifor'
predict(
object,
data = NULL,
predict.all = FALSE,
num.trees = object$num.trees,
type = "response",
seed = NULL,
num.threads = NULL,
verbose = TRUE,
...
)
Arguments
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification, a 3d array for probability estimation (sample x class x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. If "response" (default), the predicted classes (classification) or predicted probabilities (probability estimation) are returned. If "terminalNodes", the IDs of the terminal node in each tree for each observation in the given dataset are returned. |
seed |
Random seed. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
Details
This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. More precisely, 'diversityForest' was written by modifying the code of 'ranger', version 0.11.0.
Value
Object of class multifor.prediction
with elements
predictions | Predicted classes/values (only for classification and regression) |
num.trees | Number of trees. |
num.independent.variables | Number of independent variables. |
num.samples | Number of samples. |
treetype | Type of forest/tree. Classification or probability. |
Author(s)
Marvin N. Wright
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research 15:1625-1651.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
See Also
Diversity Forest predictions
Description
Diversity Forest predictions
Usage
## S3 method for class 'divfor'
predictions(x, ...)
Arguments
x |
|
... |
Further arguments passed to or from other methods. |
Value
Predictions: Classes for Classification forests, Numerical values for Regressions forests and the estimated survival functions for all individuals for Survival forests.
Author(s)
Marvin N. Wright
See Also
Diversity Forest predictions
Description
Diversity Forest predictions
Usage
## S3 method for class 'divfor.prediction'
predictions(x, ...)
Arguments
x |
|
... |
Further arguments passed to or from other methods. |
Value
Predictions: Classes for Classification forests, Numerical values for Regressions forests and the estimated survival functions for all individuals for Survival forests.
Author(s)
Marvin N. Wright
See Also
Data on stock prices of aerospace companies
Description
This data set contains 950 daily stock prices from January 1988 through October 1991, for ten aerospace companies.
The names of the companies are anonymised and the stock prices for one of these companies (company10
) were
flagged as the outcome variable. Thus, for this data set, both the outcome and the covariates were metric.
Format
A data frame with 950 observations, nine covariates and one metric outcome variable
Details
The variables are as follows: covariates: company1
, ..., company9
, outcome variable: company10
.
Source
OpenML: data.name: stock, data.id: 223, link: https://www.openml.org/d/223/
References
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations 15(2):49-60, <doi:10.1145/2641190.2641198>.
Examples
## Load data:
data(stock)
## Dimension of data:
dim(stock)
## First rows of data:
head(stock)
Optimization of the values of the tuning parameters nsplits
and proptry
Description
First, both for nsplits
and proptry
a grid of possible values may be provided,
where default grids are used if no grids are provided. Second, for each pairwise combination of
values from these two grids a forest is constructed. Third,
that pair of nsplits
and proptry
values is used as the optimized set of parameter
values that is associated with the smallest out-of-bag prediction error. If several pairs of
parameter values are associated with the same smallest out-of-bag prediction error, the
pair with the smallest (parameter) values is used.
Usage
tunedivfor(
formula = NULL,
data = NULL,
nsplitsgrid = c(2, 5, 10, 30, 50, 100, 200),
proptrygrid = c(0.05, 1),
num.trees.pre = 500
)
Arguments
formula |
Object of class |
data |
Training data of class |
nsplitsgrid |
Grid of values to consider for |
proptrygrid |
Grid of values to consider for |
num.trees.pre |
Number of trees used for each forest constructed during tuning parameter optimization. Default is 500. |
Value
List with elements
nsplitsopt |
Optimized value of |
proptryopt |
Optimized value of |
tunegrid |
Two-dimensional |
ooberrs |
The out-of-bag prediction errors obtained for each pair of values considered for |
Author(s)
Roman Hornung
References
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
See Also
Examples
## Load package:
library("diversityForest")
## Set seed to obtain reproducible results:
set.seed(1234)
## Tuning parameter optimization for the iris data set:
tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20)
# NOTE: num.trees.pre = 20 is specified too small for practical
# purposes - the out-of-bag error estimates of the forests
# constructed during optimization will be much too variable!!
# In practice, num.trees.pre = 500 (default value) or a
# larger number should be used.
tuneres
tuneres$nsplitsopt
tuneres$proptryopt
tuneres$tunegrid
tuneres$ooberrs
Data on biological species
Description
This data set describes 101 different biological species using 16 simple attributes,
where 15 of these are binary and one is metric (the number of legs). The outcome
"mammal vs. other" (type
) is binary.
Format
A data frame with 101 observations, 16 covariates and one binary outcome variable
Details
The variables are as follows:
-
hair
. factor. Presence of hairs (true = yes; false = no) -
feathers
. factor. Presence of feathers (true = yes; false = no) -
eggs
. factor. Does the species lay eggs? (true = yes; false = no) -
milk
. factor. Does the species give milk? (true = yes; false = no) -
airborne
. factor. Does the species fly? (true = yes; false = no) -
aquatic
. factor. Does the species live in the water? (true = yes; false = no) -
predator
. factor. Is the species a predator? (true = yes; false = no) -
toothed
. factor. Presence of teeth (true = yes; false = no) -
backbone
. factor. Presence of backbone (true = yes; false = no) -
breathes
. factor. Does the species breathe with lungs? (true = yes; false = no) -
venomous
. factor. Is the species venomous? (true = yes; false = no) -
fins
. factor. Presence of fins (true = yes; false = no) -
legs
. metric. Number of legs -
tail
. factor. Presence of tail (true = yes; false = no) -
domestic
. factor. Is the species domestic? (true = yes; false = no) -
catsize
. factor. Is the species large? (true = yes; false = no) -
type
. factor. Binary outcome variable - type of species ('mammal' vs. 'other')
The original openML dataset contains an additional variable animal
, which is removed in this
version of the data set. This variable provided the names of all species.
Source
OpenML: data.name: zoo, data.id: 965, link: https://www.openml.org/d/965/ (Accessed: 29/08/2024)
References
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations 15(2):49-60, <doi:10.1145/2641190.2641198>.
Dua, D., Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/.
Examples
##' Load data:
data(zoo)
##' Numbers of observations in the two classes:
table(zoo$type)
##' Dimension of data:
dim(zoo)
##' First rows of data:
head(zoo)