Title: | Explainable Ensemble Trees |
Version: | 0.2.0 |
Description: | The Explainable Ensemble Trees 'e2tree' approach has been proposed by Aria et al. (2024) <doi:10.1007/s00180-022-01312-6>. It aims to explain and interpret decision tree ensemble models using a single tree-like structure. 'e2tree' is a new way of explaining an ensemble tree trained through 'randomForest' or 'xgboost' packages. |
License: | MIT + file LICENSE |
URL: | https://github.com/massimoaria/e2tree |
BugReports: | https://github.com/massimoaria/e2tree/issues |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr, doParallel, parallel, foreach, future.apply, ggplot2, Matrix, partitions, purrr, tidyr, ranger, randomForest, rpart.plot, Rcpp, RSpectra, ape |
LinkingTo: | Rcpp |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | yes |
Packaged: | 2025-07-16 10:34:25 UTC; massimoaria |
Author: | Massimo Aria |
Maintainer: | Massimo Aria <aria@unina.it> |
Repository: | CRAN |
Date/Publication: | 2025-07-16 12:00:02 UTC |
Dissimilarity matrix
Description
The function createDisMatrix creates a dissimilarity matrix among observations from an ensemble tree.
Usage
createDisMatrix(
ensemble,
data,
label,
parallel = list(active = FALSE, no_cores = 1),
verbose = FALSE
)
Arguments
ensemble |
is an ensemble tree object |
data |
is a data frame containing the variables in the model. It is the data frame used for ensemble learning. |
label |
is a character. It indicates the response label. |
parallel |
A list with two elements: |
verbose |
Logical. If TRUE, the function prints progress messages and other information during execution. If FALSE (the default), messages are suppressed. |
Details
An ensemble
is a trained object of one of these classes trained
for classification or regression task:
-
randomForest
-
ranger
Value
A dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given random forest model.
Examples
## Classification
data("iris")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(Species ~ ., data = iris,
num.trees = 1000, importance = 'impurity')
D <- createDisMatrix(ensemble, data=training,
label = "Species",
parallel = list(active=FALSE, no_cores = 1))
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
## "randomForest" package
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(formula = mpg ~ ., data = training,
num.trees = 1000, importance = "permutation")
D = createDisMatrix(ensemble, data=training,
label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
Explainable Ensemble Tree
Description
It creates an explainable tree for Random Forest. Explainable Ensemble Trees (E2Tree) aimed to generate a “new tree” that can explain and represent the relational structure between the response variable and the predictors. This lead to providing a tree structure similar to those obtained for a decision tree exploiting the advantages of a dendrogram-like output.
Usage
e2tree(
formula,
data,
D,
ensemble,
setting = list(impTotal = 0.1, maxDec = 0.01, n = 2, level = 5)
)
Arguments
formula |
is a formula describing the model to be fitted, with a response but no interaction terms. | ||||||||||||
data |
a data frame containing the variables in the model. It is a data frame in which to interpret the variables named in the formula. | ||||||||||||
D |
is the dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given classifier of a random forest model. The dissimilarity matrix is obtained with the createDisMatrix function. | ||||||||||||
ensemble |
is an ensemble tree object (for the moment ensemble works only with random forest objects) | ||||||||||||
setting |
is a list containing the set of stopping rules for the tree building procedure.
Default is |
Value
A e2tree object, which is a list with the following components:
tree | A data frame representing the main structure of the tree aimed at explaining and graphically representing the relationships and interactions between the variables used to perform an ensemble method. | |
call | The matched call | |
terms | A list of terms and attributes | |
control | A list containing the set of stopping rules for the tree building procedure | |
varimp | A list containing a table and a plot for the variable importance. Variable importance refers to a quantitative measure that assesses the contribution of individual variables within a predictive model towards accurate predictions. It quantifies the influence or impact that each variable has on the model's overall performance. Variable importance provides insights into the relative significance of different variables in explaining the observed outcomes and aids in understanding the underlying relationships and dynamics within the model |
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(Species ~ ., data = iris,
num.trees = 1000, importance = 'impurity')
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
## "randomForest" package
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(formula = mpg ~ ., data = training,
num.trees = 1000, importance = "permutation")
D = createDisMatrix(ensemble, data=training, label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=(1*10^-6), n=2, level=5)
tree <- e2tree(mpg ~ ., training, D, ensemble, setting)
Comparison of Heatmaps and Mantel Test
Description
This function processes heatmaps for visual comparison and performs the Mantel test between a proximity matrix derived from Random Forest outputs and a matrix estimated by E2Tree. Heatmaps are generated for both matrices. The Mantel test quantifies the correlation between the matrices, offering a statistical measure of similarity.
Usage
eComparison(data, fit, D, graph = TRUE)
Arguments
data |
a data frame containing the variables in the model. It is the data frame used for ensemble learning. |
fit |
is e2tree object. |
D |
is the dissimilarity matrix. This is a dissimilarity matrix measuring the discordance between two observations concerning a given classifier of a random forest model. The dissimilarity matrix is obtained with the createDisMatrix function. |
graph |
A logical value (default: TRUE). If TRUE, heatmaps of both matrices are generated and displayed. |
Value
A list containing three elements:
-
RF HeatMap
: A heatmap plot of the Random Forest-derived proximity matrix. -
E2Tree HeatMap
: A heatmap plot of the E2Tree-estimated matrix. -
Mantel Test
: Results of the Mantel test, including the correlation coefficient and significance level.
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
eComparison(training, tree, D)
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
D = createDisMatrix(ensemble, data=training, label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=(1*10^-6), n=2, level=5)
tree <- e2tree(mpg ~ ., training, D, ensemble, setting)
eComparison(training, tree, D)
Predict responses through an explainable RF
Description
It predicts classification and regression tree responses
Usage
ePredTree(fit, data, target = "1")
Arguments
fit |
is a e2tree object |
data |
is a data frame |
target |
is the target value of response in the classification case |
Value
an object.
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
ePredTree(tree, validation, target="1")
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
D = createDisMatrix(ensemble, data=training, label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=(1*10^-6), n=2, level=5)
tree <- e2tree(mpg ~ ., training, D, ensemble, setting)
ePredTree(tree, validation)
Roc curve
Description
Computes and plots the Receiver Operating Characteristic (ROC) curve for a binary classification model, along with the Area Under the Curve (AUC). The ROC curve is a graphical representation of a classifier’s performance across all classification thresholds.
Usage
roc(response, scores, target = "1")
Arguments
response |
is the response variable vector |
scores |
is the probability vector of the prediction |
target |
is the target response class |
Value
an object.
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
pr <- ePredTree(tree, validation, target="setosa")
roc(response_training, scores = pr$score, target = "setosa")
Convert e2tree into an rpart object
Description
It converts an e2tree output into an rpart object.
Usage
rpart2Tree(fit, ensemble)
Arguments
fit |
is e2tree object. |
ensemble |
is an ensemble tree object (for the moment ensemble works only with random forest objects). |
Value
An rpart object. It contains the following components:
frame | The data frame includes a singular row for each node present in the tree. The row.names within the frame are assigned as unique node numbers, following a binary ordering system indexed by the depth of the nodes. The columns of the frame consist of the following components: (var) this variable denotes the names of the variables employed in the split at each node. In the case of leaf nodes, the level "leaf" is used to indicate their status as terminal nodes; (n) the variable 'n' represents the number of observations that reach a particular node; (wt) 'wt' signifies the sum of case weights associated with the observations reaching a given node; (dev) the deviance of the node, which serves as a measure of the node's impurity or lack of fit; (yval) the fitted value of the response variable at the node; (splits) this two-column matrix presents the labels for the left and right splits associated with each node; (complexity) the complexity parameter indicates the threshold value at which the split is likely to collapse; (ncompete) 'ncompete' denotes the number of competitor splits recorded for a node; (nsurrogate) the variable 'nsurrogate' represents the number of surrogate splits recorded for a node | |
where | An integer vector that matches the length of observations in the root node. The vector contains the row numbers in the frame that correspond to the leaf nodes where each observation is assigned | |
call | The matched call | |
terms | A list of terms and attributes | |
control | A list containing the set of stopping rules for the tree building procedure | |
functions | The summary, print, and text functions are utilized for the specific method required | |
variable.importance | Variable importance refers to a quantitative measure that assesses the contribution of individual variables within a predictive model towards accurate predictions. It quantifies the influence or impact that each variable has on the model's overall performance. Variable importance provides insights into the relative significance of different variables in explaining the observed outcomes and aids in understanding the underlying relationships and dynamics within the model |
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
## "randomForest" package
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(Species ~ ., data = iris,
num.trees = 1000, importance = 'impurity')
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
# Convert e2tree into an rpart object:
expl_plot <- rpart2Tree(tree, ensemble)
# Plot using rpart.plot package:
rpart.plot::rpart.plot(expl_plot)
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
## "randomForest" package
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
## "ranger" package
ensemble <- ranger::ranger(formula = mpg ~ ., data = training,
num.trees = 1000, importance = "permutation")
D = createDisMatrix(ensemble, data=training, label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=(1*10^-6), n=2, level=5)
tree <- e2tree(mpg ~ ., training, D, ensemble, setting)
# Convert e2tree into an rpart object:
expl_plot <- rpart2Tree(tree, ensemble)
# Plot using rpart.plot package:
rpart.plot::rpart.plot(expl_plot)
Variable Importance
Description
It calculate variable importance of an explainable tree
Usage
vimp(fit, data, type = "classification")
Arguments
fit |
is a e2tree object |
data |
is a data frame in which to interpret the variables named in the formula. |
type |
Specify the type. The default is ‘classification’, otherwise ‘regression’. |
Value
a data frame containing variable importance metrics.
Examples
## Classification:
data(iris)
# Create training and validation set:
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
training <- iris[train_ind, ]
validation <- iris[-train_ind, ]
response_training <- training[,5]
response_validation <- validation[,5]
# Perform training:
ensemble <- randomForest::randomForest(Species ~ ., data=training,
importance=TRUE, proximity=TRUE)
D <- createDisMatrix(ensemble, data=training, label = "Species",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=0.01, n=2, level=5)
tree <- e2tree(Species ~ ., training, D, ensemble, setting)
vimp(tree, training, type = "classification")
## Regression
data("mtcars")
# Create training and validation set:
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
training <- mtcars[train_ind, ]
validation <- mtcars[-train_ind, ]
response_training <- training[,1]
response_validation <- validation[,1]
# Perform training
ensemble = randomForest::randomForest(mpg ~ ., data=training, ntree=1000,
importance=TRUE, proximity=TRUE)
D = createDisMatrix(ensemble, data=training, label = "mpg",
parallel = list(active=FALSE, no_cores = 1))
setting=list(impTotal=0.1, maxDec=(1*10^-6), n=2, level=5)
tree <- e2tree(mpg ~ ., training, D, ensemble, setting)
vimp(tree, training, type = "regression")