Title: | Longitudinal Regression Trees and Forests |
Version: | 0.2.0 |
Description: | Builds regression trees and random forests for longitudinal or functional data using a spline projection method. Implements and extends the work of Yu and Lambert (1999) <doi:10.1080/10618600.1999.10474847>. This method allows trees and forests to be built while considering either level and shape or only shape of response trajectories. |
Depends: | R (≥ 3.5.0), rpart, nlme, splines |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | mosaic, ggplot2, treeClust, mclust |
RoxygenNote: | 6.1.1 |
Suggests: | R.rsp, knitr, rmarkdown, testthat |
VignetteBuilder: | R.rsp |
BugReports: | https://github.com/anna-neufeld/splinetree/issues |
URL: | https://github.com/anna-neufeld/splinetree |
NeedsCompilation: | no |
Packaged: | 2019-07-16 17:59:24 UTC; annaneufeld |
Author: | Anna Neufeld [aut, cre], Brianna Heggeseth [aut, ths] |
Maintainer: | Anna Neufeld <aneufeld@uw.edu> |
Repository: | CRAN |
Date/Publication: | 2019-07-18 06:36:41 UTC |
Compute the average tree size in a forest
Description
Returns the average number of terminal nodes for trees in a forest
Usage
avSize(forest)
Arguments
forest |
A model returned by splineForest() |
Value
The average number of terminal nodes in forest
Examples
avSize(forest)
Flattens predictor variable data into one row per person
Description
Assumes that splitting explanatory variables do not vary with time. Spline Tree is not meant to handle time-varying covariates.
Usage
flatten_predictors(idvar, data)
Arguments
idvar |
The string name of the ID variable (used to group observations) |
data |
The full dataset to be flattened (long form) |
Value
A wide format dataset with spline coefficients as the responses.
Sample forest used in vignettes
Description
Sample forest used in vignettes
Usage
forest
Format
An object of class list
of length 15.
Get the basis matrix to be used for this spline tree
Description
Using the user-specified parameters or the default parameters, computes the basis matrix that will be used for building the tree.
Usage
getBasisMat(yvar, tvar, idvar, data, knots = NULL, df, degree, intercept,
gridPoints, nGrid = 7)
Arguments
yvar |
Name of response variable (string) |
tvar |
Name of time variable (string) |
idvar |
Name of ID variable (string) |
data |
Full dataset |
knots |
Knots argument specified by user. Specifies location of INTERNAL knots. |
df |
Degrees of freedom argument specified by user |
degree |
The degree of the spline polynomial |
intercept |
Whether or not to use an intercept |
gridPoints |
Optional. A vector of numbers that will be used as the grid on which to evaluate the projection sum of squares. Should fall roughly within the range of the time variable. |
nGrid |
Number of grid points to evaluate split function at. |
Value
The basis matrix to be used for the tree building process
Retrieve the subset of the data found at a given terminal node
Description
Given a terminal node number, this function returns the data belonging to this terminal node. If the dataType argument is 'all', returns all rows of data from the original dataset that fall in this node. Otherwise, the flattened data that belongs to this node is returned (one row of data per ID, original responses replaced by spline coefficients).
Usage
getNodeData(tree, node, dataType = "all")
Arguments
tree |
a model returned from splineTree() |
node |
The number of the node to retrieve data from. Must be valid number of a terminal node. Node numbers can be seen using stPrint(tree) or treeSummary(tree). |
dataType |
If "all", the data returned is from the original dataset (one row per individual observation with original response values). If "flat", the data returned is the flattened data (one row per person/unit), with individual spline coefficients instead of response values. |
Value
A dataframe which holds all the data that falls into this node of the tree.
Examples
## Not run:
split_formula <- BMI ~ HISP + WHITE + BLACK + SEX +
Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1,
df=3, intercept=TRUE, cp=0.006, minNodeSize=20)
## End(Not run)
node6data <- getNodeData(tree, 6, dataType = 'all')
plot(BMI~AGE, data=node6data)
Sample importance used in vignettes
Description
Sample importance used in vignettes
Usage
importance
Format
An object of class list
of length 3.
Get spline coefficients for a single person
Description
Get spline coefficients for a single person
Usage
individual_spline(person, idvar, yvar, tvar, data, boundaryKnots,
innerKnots, degree, intercept)
Arguments
person |
ID of this person |
idvar |
name of the id variable (string) |
yvar |
the name of the response variable |
tvar |
name of time variable (string) |
data |
full dataset |
boundaryKnots |
the boundary knots for the bspline |
innerKnots |
the inner knots for the bspline |
degree |
the degree of the bspline |
intercept |
whether or not to include an intercept |
Baseline socioeconomic information and BMI of 100 individuals.
Description
A dataset containing the body mass index (BMI) and baseline socioeconomic information of 100 individuals from the National Longitudinal Survey of Youth 1979 (NLSY), a freely available longitudinal dataset. The 1000 individuals were drawn randomly from among all NLSY respondents with at least 10 non-missing height/weight responses spread out over at least 20 years. This dataset is used in the package vignettes and code examples. Only a small subset of the variables available from the NLSY are included here. See https://www.bls.gov/nls/nlsy79.htm for more
Usage
nlsySample
Format
A data frame with 16126 rows and 34 columns.
- ID
Unique identifier for each NLSY respondent
- SEX
Respondent's sex. 1 denotes male, 2 denotes female.
- AGE
Respondent's age
- BLACK
Indicator for whether or not respondent's identified as Black
- BMI
Respondent's body mass index - calculated from reported height and weight
- HGC_FATHER
Highest grade completed by respondent's father
- HGC_MOTHER
Highest grade completed by respondent's mother
- HISP
Indicator for whether or not respondent's race identified as Hispanic
- Num_sibs
Number of siblings of respondent
- WHITE
Indicator for whether or not respondent identified as white.
- HGC
Highest grade completed by respondent
- Age_first_smoke
Age that respondent reported first using tobacco. If they reported never using tobacco, recorded as 100.
- Age_first_alc
Age that respondent reported first drinking alcohol. If they reported never drinking alcohol, recorded as 100.
- RACE
Race, as recorded by NLSY. 1 denotes Hispanic, 2 denotes Black, 3 denotes White.
Source
https://www.bls.gov/nls/nlsy79.htm
Plots the trajectories of each terminal node side by side.
Description
Corresponds to plotting only the second panel of stPlot(). If model$intercept==FALSE, estimated intercepts are added to each trajectory so that the trajectories are plotted at the level of reasonable response values.
Usage
nodePlot(model, colors = NULL)
Arguments
model |
A model returned from splineTree() |
colors |
A list of colors to use. By default, uses colors drawn from a rainbow. |
Create a barplot of relative variable importance scores.
Description
Given a named vector of variable importance measures, this function makes a barplot of the relative importances. The importances are scaled to sum to 1. An appropriate input is one column of the output from varImpY() or varImpCoeff().
Usage
plotImp(importance_vector, ...)
Arguments
importance_vector |
a named vector where the names are the variables and the vector stores the importances. |
... |
additional arguments to plot, such as "main", "cex", etc. |
Examples
imp <- varImpCoeff(forest)[,3]
plotImp(imp, main="Standardized Variable Importance")
Plot the predicted trajectory for a single node
Description
Creates a simple plot of the predicted trajectory at a given node. Option to include the data that falls in the node on the same plot.
Usage
plotNode(tree, node, includeData = FALSE, estimateIntercept = TRUE)
Arguments
tree |
A model returned from splineTree() |
node |
A node number. Must be a valid terminal node for the given spline tree. To view valid terminal node numbers, use stPrint() or treeSummary(). |
includeData |
Would you like to see the data from the node plotted along with the predicted trajectory? |
estimateIntercept |
If the tree was built without an intercept, should the average starting response of all the individuals in the node be added to the trajectory to give the plot interpretable values? Or should the shape of the trajectory be plotted without any regard to the intercept? |
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
plotNode(tree, 6, includeData=TRUE)
Predict spline coefficients for a testset using a spline tree
Description
Returns a matrix of spline coefficients for each observation in the testset. If no testset is provided, returns predicted coefficients for the individuals in training set; in this case, the columns of the returned predictions correspond to the rows of the flattened training dataset (found in tree$parms$flat_data).
Usage
predictCoeffs(tree, testset = tree$parms$flat_data)
Arguments
tree |
A model created with splineTree() |
testset |
The dataset to predict coefficients for. Default is the flattened dataset used to make the tree. |
Details
importFrom treeClust rpart.predict.leaves
Value
A matrix of spline coefficients. The dimension of the matrix is the degrees of freedom of the spline by the number of units in the test set. The ith column of the matrix holds the predicted coefficients for the ith row in the testset.
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
preds <- predictCoeffs(tree)
Predict spline coefficients for a testset using a splineforest.
Description
Uses the forest to predict spline coefficients. Returns a matrix of predicted spline coefficients where the columns of the returned matrix correspond to rows of the testdata. The number of rows of the returned matrix is equal to the degrees of freedom of the forest. If no testdata is provided, forest$flat_data is used. When testdata is not provided, predictions will be made according to one of three methods. The "method" parameter must be either "oob", "itb", or "all". This parameter specifies which trees are used in making a prediction for a certain datapoint. This parameter is not relevant when predicting for a testset that is distinct from the training set.
Usage
predictCoeffsForest(forest, method = "oob", testdata = NULL)
Arguments
forest |
A model returned from splineForest() |
method |
A string; either "oob", "itb", or "all". If "oob" (the default), predictions for a given data point are made only using trees for which this data point was "out of the bag" (not in the random subsample). If "itb", predictions for a given data point are made using only the trees for which this datapoint was "in the bag" (in the random subsample). If "all", all trees are used for every datapoint. |
testdata |
The test data to make predictions for. If this is provided, then all trees are used for all datapoints. |
Value
A matrix of predicted spline coefficients. The dimensions are forest$df x nrow(testdata). Each column of the matrix corresponds to a row of the testdata.
Examples
trainingSetPreds <- predictCoeffsForest(forest)
newData <- data.frame("WHITE" = 0, "BLACK"=1, "HISP"=0, "Num_sibs"=3,
"HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1)
predictCoeffsForest(forest, testdata = newData)
Predictions from a spline tree
Description
Returns a vector of predicted responses for the testData. If testData is ommitted, returns predictions for the training data. This function is most meaningful if model$intercept==TRUE.
Usage
predictY(model, testData = NULL)
Arguments
model |
A model created with splineTree() |
testData |
The data to return predictions for. If ommitted, uses the training data. |
Value
A vector of predictions with rows corresponding to the testdata.
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
plot(predictY(tree), tree$parms$data[[tree$parms$yvar]])
Predict responses for a testset using a splineforest.
Description
Uses the forest to make predictions of responses for individuals. This method should only be used on forests where forest$intercept=TRUE. If the testdata parameter is null, makes predictions for each row of the training data. In this case, the methods parameter (which should be set to "oob", "itb", or "all") determines the method used for prediction. If the testdata parameter is not null, the methods parameter is ignored and all trees are used for the prediction of every datapoint.
Usage
predictYForest(forest, method = "oob", testdata = NULL)
Arguments
forest |
A model returned from splineForest() |
method |
A string. Must be either "oob", "itb", or "all". Only relevant when testdata is NULL. The default value is "oob". If "oob", predictions for a given data point are made only using trees for which this data point was "out of the bag" (not in the random subsample). If "itb", predictions for a given data point are made using only the trees for which this datapoint was in the bag (in the random subsample). If "all", all trees are used for every datapoint. |
testdata |
the Test data to make predictions for. If this is provided, then all trees are used for all datapoints. |
Value
A vector of predicted responses. The indices of the vector correspond to rows of the testdata.
Examples
trainingSetPreds <- predictYForest(forest)
newData <- data.frame("AGE"=21, "WHITE" = 0, "BLACK"=1, "HISP"=0,
"Num_sibs"=3, "HGC_MOTHER"=12, "HGC_FATHER"=12, "SEX"=1)
predictYForest(forest, testdata = newData)
Predict responses for the training data
Description
Calling predictY(model) and predict_y_training(model) return identical results, because when no test data is provided to predictY(), the default is to use the training set. This is a slightly faster version that can be used when you know that you wish to predict on the training data. It is faster because it takes advantage of the relationship between model$parms$flat_data and model$parms$data.
Usage
predict_y_training(model)
Arguments
model |
a model created with splineTree() |
Value
A vector of predicted responses where each element in the vector corresponds to a row in model$parms$data.
Computes percent of variation in projected response explained by a splinetree.
Description
Computes an R^2 measure for a splinetree based on the projected sum of squared errors. Returns 1-SSE/SST. SSE is the sum of projection squared errors between individual smoothed trajectories and predicted smoothed trajectories evaluated on a fixed grid. SST is the sum of projection squared errors between individual smoothed trajectories and the overall population mean trajectory, evaluated on the same fixed grid. If model$intercept==TRUE, then there is the option to ignore the intercept coefficient when computing this metric. When the intercept is ignored, the metric captures how well the model explains variation in shape, and ignores any variation in intercept explained by the model.
Usage
projectedR2(model, includeIntercept = FALSE)
Arguments
model |
a model created with splineTree() |
includeIntercept |
If FALSE and if the model was built with an intercept, the projected squared errors are computed while ignoring the intercept. If the model was built without an intercept, this parameter does not do anything. |
Value
The percentage of variation in projected trajectory explained by the model. Computed as 1-SSE/SST. See description.
Examples
r2 <- projectedR2(tree)
Computes a level-based or shape-based evaluation metric for a splineforest.
Description
Computes an R-squared-like evaluation metric for a spline forest. Goal is to see how well the predicted spline coefficients for each individual match the spline coefficients obtained when fitting a spline only to this individual's data (we call these coefficients the true coefficients). Computes 1-SSE/SST, where SSE is the total sum of squared projection errors of the true coefficients compared to the predicted coefficients, and SST is the total sum of squared projection errors of the true coefficients compared to the population mean coefficients. If this is an intercept forest, have the option to compute these sum of squares either with the intercept included or with the intercept ignored to isolate the shape.
Usage
projectedR2Forest(forest, method = "oob", removeIntercept = TRUE)
Arguments
forest |
The output of a call to splineForest() |
method |
How would you like to compute this metric? The choices are "oob", "itb", or "all". "oob" means that predictions for a datapoint can only be made using trees for which that datapoint was "out of the bag" (not in the random subsample). "all" means that all trees are used in the prediction for every datapoint. "itb" means that predictions for a datapoint are made using only the trees for which this datapoint was IN the random subsample. |
removeIntercept |
If true, the projection sum of squared error is computed while ignoring the intercept coefficient. This will help capture the tree's performance at clustering based on shape, not based on level. This parameter is only meaningful if this forest was built using an intercept. |
Value
Returns 1-SSE/SST, where SSE is the total sum of squared projection errors of the true coefficients compared to the predicted coefficients, and SST is the total sum of squared projection errors of the true coefficients compared to the population mean coefficients.
Examples
projectedR2Forest(forest, method="all", removeIntercept=TRUE)
projectedR2Forest(forest, method="all", removeIntercept=FALSE)
Prune each tree in forest using a given complexity parameter.
Description
Prunes each tree in the list forest$Trees according to the provided complexity parameter. Returns a new forest.
Usage
pruneForest(forest, cp)
Arguments
forest |
A model returned by splineForest() |
cp |
The complexity parameter that will be used to prune each tree (see rpart package documentation for detailed description of complexity parameter) |
Value
A new spline forest model (named list) where each tree has been pruned to the desired level.
Examples
print(avSize(forest))
print(avSize(pruneForest(forest, cp=0.007)))
print(avSize(pruneForest(forest, cp=0.01)))
Calculates coordinates for tree plot
Description
Figures out the coordinates on the tree plot for the each mini trajectory plots. Modified from code from the longRPart package.
Usage
rpartco(tree, parms = paste(".rpart.parms", dev.cur(), sep = "."))
Arguments
tree |
a SplineTree object |
parms |
a string |
Create a faceted spaghetti plot of a splinetree model
Description
Uses ggplot to create a paneled spaghetti plot of the data, where each panel corresponds to a terminal node in the tree. Allows users to visualize homogeneity of trajectories within the terminal nodes of the tree while also looking at the trajectories of different nodes side by side.
Usage
spaghettiPlot(model, colors = NULL)
Arguments
model |
a model returned from splineTree() |
colors |
optional argument specifying colors to be used for each panel. |
Examples
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),]
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySubset, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
spaghettiPlot(tree)
Build a spline random forest.
Description
Builds an ensemble of regression trees for longitudinal or functional data using the spline projection method. The resulting model contains a list of spline trees along with some additional information. All parameters are used in the same way that they are used in the splineTree() function. The additional parameter ntree specifies how many trees should be in the ensemble, and prob controls the probability of selecting a given variable for split consideration at a node. This method may take several minutes to run- saving the forest after building it is recommended.
Usage
splineForest(splitFormula, tformula, idvar, data, knots = NULL,
df = NULL, degree = 3, intercept = FALSE, nGrid = 7,
gridPoints = NULL, ntree = 50, prob = 0.3, cp = 0.001,
minNodeSize = 1, bootstrap = FALSE)
Arguments
splitFormula |
Formula specifying the longitudinal response variable and the time-constant variables that will be used for splitting in the tree. |
tformula |
Formula specifying the longitudinal response variable and the variable that acts as the time variable. |
idvar |
The name of the variable that serves as the ID variable for grouping observations. Must be in quotes |
data |
dataframe that contains all variables specified in the formulas- in long format. |
knots |
Specified locations for internal knots in the spline basis. Defaults to NULL, which corresponds to no internal knots. |
df |
Degrees of freedom of the spline basis. If this is specified but the knots parameter is NULL, then the appropriate number of internal knots will be added at quantiles of the training data. If both df and knots are unspecified, the spline basis will have no internal knots. |
degree |
Specifies degree of spline basis used in the tree. |
intercept |
Specifies whether or not the splitting process will consider the intercept coefficient of the spline projections. Defaults to FALSE, which means that the tree will split based on trajectory shape, ignoring response level. |
nGrid |
Number of grid points to evaluate projection sum of squares at. If gridPoints is not supplied, then this is the number of grid points that will be automatically placed at quantiles of the time variable. The default is 7. |
gridPoints |
Optional. A vector of numbers that will be used as the grid on which to evaluate the projection sum of squares. Should fall roughly within the range of the time variable. |
ntree |
Number of trees in the forest. |
prob |
Probability of selecting a variable to included as a candidate for each split. |
cp |
Complexity parameter passed to the rpart building process. Default is the rpart default of 0.01 |
minNodeSize |
Minimum number of observational units that can be in a terminal node. Controls tree size and helps avoid overfitting. Default is 10. |
bootstrap |
Boolean specifying whether bootstrap sampling should be used when choosing data to use for each tree. When set to FALSE (the default), sampling without replacement is used and 63.5 is used for each tree. When set to TRUE, a bootstrap sample is used for each tree. |
Details
The ensemble method is highly similar to the random forest methodology of Breiman (2001). Each tree in the ensemble is fit to a random sample of 63.5 the subset of variables considered at each node is determined by a random process. The prob parameter specifies the probability that a given variable will be selected at a certain node. Because the method is based on probability, the same number of variables are not considered for splitting at each node (as in the randomForest package). Note that if prob is small and the number of variables in the splitFormula is also small, there is a high probability that no variables will be considered for splitting at a certain node, which is problematic. The fewer total variables there are, the larger prob should be to ensure good results.
Value
A spline forest model, which is a named list with 15 components. The list stores a list of trees (in model$Trees), along with information about the spline basis used (model$intercept, model$innerKnots, model$boundaryKnots, etc.), and information about which datapoints were used to build each tree (model$oob_indices and model$index). Note that each element in model$Trees is an rpart object but it is not the same as a model returned from splineTree() because it does not store all relevant information in model$parms.
Examples
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),]
splitForm <-~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs
sampleForest <- splineForest(splitForm, BMI~AGE, 'ID', nlsySubset, degree=1, cp=0.005, ntree=10)
Build a splinetree model.
Description
Builds a regression tree for longitudinal or functional data using the spline projection method. The underlying tree building process uses the rpart package, and the resulting spline tree is an rpart object with additional stored information. The parameters df, knots, degree, intercept allow for flexibility in customizing the spline basis used for projection. The parameters nGrid and gridPoints allow for flexibility in the grid on which the projection sum of squares is evaluated. The parameters minNodeSize and cp allow for flexibility in controlling the size of the final tree.
Usage
splineTree(splitFormula, tformula, idvar, data, knots = NULL,
df = NULL, degree = 3, intercept = FALSE, nGrid = 7,
gridPoints = NULL, minNodeSize = 10, cp = 0.01)
Arguments
splitFormula |
Formula specifying the longitudinal response variable and the time-constant variables that will be used for splitting in the tree. |
tformula |
Formula specifying the longitudinal response variable and the variable that acts as the time variable. |
idvar |
The name of the variable that serves as the ID variable for grouping observations. Must be a string. |
data |
dataframe in long format that contains all variables specified in the formulas. |
knots |
Specified locations for internal knots in the spline basis. Defaults to NULL, which corresponds to no internal knots. |
df |
Degrees of freedom of the spline basis. If this is specified but the knots parameter is NULL, then the appropriate number of internal knots will be added at quantiles of the training data. If both df and knots are unspecified, the spline basis will have no internal knots. If knots is specified, this parameter will be ignored. |
degree |
Specifies degree of spline basis used for projection. |
intercept |
Specifies whether or not the set of basis functions will include the intercept function. Defaults to FALSE, which means that the tree will split based on trajectory shape, ignoring response level. |
nGrid |
Number of grid points to evaluate projection sum of squares at. If gridPoints is not supplied, this argument will be used and the appropriate number of grid points will be placed at equally spaced quantiles of the time variable. The default is 7. |
gridPoints |
Optional. A vector of numbers that will be used as the grid on which to evaluate the projection sum of squares. Should fall roughly within the range of the time variable. |
minNodeSize |
Minimum number of observational units that can be in a terminal node. Controls tree size and helps avoid overfitting. Defaults to 10. |
cp |
Complexity parameter passed to the rpart building process. Controls tree size. Defaults to the rpart default of 0.01. |
Value
An rpart object with additional splinetree-specific information stored in model$parms. The important attributes of the rpart object include model$frame, model$where, and model$cptable. model$frame holds information about each node in the tree. The ith entry in model$where tells us which row of model$frame describes the node that the ith individual in the flattened dataset falls into. model$parms$flat_data holds the flattened dataset that was used to build the tree. model$cptable displays the complexity parameters that would be needed to prune the tree to various desired sizes. Apart from holding the flattened dataset, model$parms holds the boundary knots and the internal knots of the spline basis used to build the tree. These are sometimes important to recover later.
Examples
nlsySample_subset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 500),]
splitForm <- ~HISP+WHITE+BLACK+HGC_MOTHER+HGC_FATHER+SEX+Num_sibs
tree1 <- splineTree(splitForm, BMI~AGE, 'ID', nlsySample_subset, degree=3, intercept=TRUE, cp=0.005)
stPrint(tree1)
stPlot(tree1)
Creates a tree plot of a spline tree.
Description
Creates a tree plot of a spline tree. This corresponds to plotting only the first panel of stPlot(). Code for this function was borrowed from the longRPart package on github.
Usage
splineTreePlot(model, colors = NULL)
Arguments
model |
a model returned from splineTree() |
colors |
a list of colors that will be used for the terminal nodes (if NULL, will use a rainbow) |
Custom rpart eval function.
Description
The eval function is required for custom rpart functionality. The split criterion is the total sum of squared errors of the projected or smoothed outcome values around their mean. Note that this is the node purity measure introduced by Yu and Lambert, 1999. The calling of this function is always handled internally by rpart; the user will never directly call this function.
Usage
spline_eval(y, wt = NULL, parms = NULL)
Arguments
y |
the responses at this node, which will be estimated spline coefficients for individuals in the node. |
wt |
Used to weight observations differently. Required by rpart, but not supported by splinetree, so its value will always be NULL. |
parms |
rpart's custom split functionality allows optional parameters to be passed through the splitting functions. In the splinetree package, the parms parameter is used to hold a list of length 1 or 2 containing either just a spline basis matrix (for a tree), or a spline basis matrix and the probability that a variable will be selected at a split (for a random forest). |
Value
A description for the node. This description includes the label, which is the mean response at the node, and the deviance, which in this case is the total projected sum of squares.
Custom rpart init function
Description
The init function is required for custom rpart functionality. This function initializes every node. The init function is responsible for defining the summary function that will be used by rpart's summary function if you call summary() on this tree object. The init function also passes forward its arguments and tells rpart the dimension of the response variable. This function is called internally by rpart; the details are not important for the end user.
Usage
spline_init(y, offset = NULL, parms = NULL, wt = NULL)
Arguments
y |
Response data, which will be estimated spline coefficients |
offset |
Required by rpart, but never used by splinetree, so its value will always be NULL |
parms |
rpart's custom split functionality allows optional parameters to be passed through the splitting functions. In the splinetree package, the parms parameter is used to hold a list of length 1 or 2 containing a spline basis matrix and the probability that a variable will be selected at a split. The probability is only used in splineforests. For splinetrees, only the basis matrix is needed. |
wt |
Used to weight observations differently. Required by rpart, but not supported by splinetree, so its value will always be NULL. |
Value
A list of information for this node that is used internally by rpart.
Custom rpart split function.
Description
The split function is required for the custom rpart functionality. This function is called once per covariate per node during the tree construction, and is responsible for choosing the covariate and threshold for the best split point. This implements the split function suggested by Yu and Lambert. When the covariate is categorical, this code uses a shortcut for computational efficiency. Instead of trying every possible combination of categories as a potential split point, the categories are ordered using the first principal component of the average spline coefficient vector.
Usage
spline_split(y, wt, x, parms = NULL, continuous)
Arguments
y |
The responses at this node |
wt |
Used to weight observations differently. Required by rpart, but not supported by splinetree, so its value will always be NULL. |
x |
The data for a particular covariate |
parms |
rpart's custom split functionality allows optional parameters to be passed through the splitting functions. In the splinetree package, the parms parameter is used to hold a list of length 1 or 2 containing either just a spline basis matrix (for a tree), or a spline basis matrix and the probability that a variable will be selected at a split (for a random forest). |
continuous |
Value is handled internally by rpart - tells us if this covariate is continuous (TRUE) or categorical (FALSE). |
Value
A list with two components, goodness and direction, describing the goodness of fit and direction for each possible split for this covariate.
The goodness component holds the utility of the split (projected sum of squares) for each possible split.
If the continuous parameter is TRUE, goodness and direction each have length n-1, here n is the length of x.
The ith value of goodness describes utility of splitting observations 1 to i from i + 1 to n.
The values of direction will be -1
and +1
, where -1
suggests that values with y < cutpoint be sent to the left side of the tree,
and a value of +1 that values with y cutpoint be sent to the right. This is not really an important choice,
it only matters for tree reading conventions.
If the continuous parameter is FALSE, then the predictor variable x is categorical with
k classes and there are potentially almost 2k different ways to split the node.
When invoking custom split functions, rpart assumes that a reasonable approximation can be
computed by first ordering the groups by their
first principal component of the average y vector and then using the
usual splitting rule on this ordered variable.
In this case, the direction vector has k values giving the ordering of the groups, and the goodness vector
has k-1 values giving the utility of the splits.
Custom rpart split function for spline random forests
Description
Wrapper for split function required for the random forest functionality. This function is called once per covariate at each potential split. Implements the random selection of variables; each variable is randomly selected to be included or excluded.
Usage
splineforest_split(y, wt, x, parms = NULL, continuous)
Arguments
y |
the responses at this node |
wt |
the weight of the responses |
x |
the X data for this covariate |
parms |
the basis matrix for the spline and the proportion of variables randomly sampled (diceProb) |
continuous |
value is handled internally by rpart - tells us if this covariate is continuous or categorical (factor). |
Plots a splinetree.
Description
Creates a two paneled plot of a splinetree that shows both the tree and the trajectories side by side. Note that this function has trouble when the plot window is not wide enough. If nothing shows up in RStudio, try increasing the size of the plot window and trying again. For a tree without an intercept, intercepts are estimated after-the-fact for each node using the average starting value in the data so that the plotted trajectories have reasonable response values.
Usage
stPlot(model, colors = NULL)
Arguments
model |
A model returned from splineTree() |
colors |
A list of colors that will be used for the trajectories (if NULL, will automatically select colors from rainbow color scheme. |
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
stPlot(tree, colors = c("red", "orange", "green", "blue", "cyan", "magenta"))
Print a spline tree in the style of print.rpart
Description
The printout provides numbered labels for the terminal nodes, a description of the split at each node, the number of observations found at each node, and the predicted spline coefficients for each node. This code is primarily taken from rpart base code for print.rpart. It has been modified to ensure that the full vector of coefficients is printed for each node.
Usage
stPrint(t, cp, digits = getOption("digits"))
Arguments
t |
A model returned by splineTree() |
cp |
Optional- if provided, a pruned version of the tree will be printed. The tree will be pruned using the provided cp as the complexity parameter. |
digits |
Specifies how many digits of each coefficient should be printed |
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
stPrint(tree)
Prints a summary of a terminal node in a tree
Description
If no argument is provided for the parameter node
, summaries are printed for every
terminal node. Otherwise, the summary of just the requested node is printed.
Usage
terminalNodeSummary(tree, node = NULL)
Arguments
tree |
A model returned by splineTree(). |
node |
The number of the node that you want summarized. To see which nodes correspond to which numbers, see stPrint(tree) or treeSummary(tree). If this parameter is provided, must correspond to a valid terminal node in the tree. |
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
terminalNodeSummary(tree)
Sample tree used in examples
Description
Sample tree used in examples
Usage
tree
Format
An object of class rpart
of length 14.
Given a list of node numbers, returns the depth at which these appear in the tree.
Description
Used in printing and plotting. Source: rpart
Usage
tree.depth(nodes)
Returns a measure of how similar the two trees are.
Description
Computes the Adjusted Rand Index of the clusterings of the population created by the two trees. In the case of correlated covariates, two trees that split on entirely different variables may actually describe similar partitions of the population. This metric allows us to detect when two trees are partitioning the population similarly. A value close to 1 indicates a similar clustering.
Usage
treeSimilarity(tree1, tree2)
Arguments
tree1 |
a model returned from splineTree() |
tree2 |
a model returned from splineTree() |
Value
The Adjusted Rand Index of the clusterings created by the two trees.
See Also
mclust::adjustedRandIndex
Examples
splitForm <- ~SEX+Num_sibs+HGC_MOTHER+HGC_FATHER
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),]
tree1 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=2, intercept=FALSE, cp=0.005)
tree2 <- splineTree(splitForm, BMI~AGE, "ID", nlsySubset, degree=1, df=3, intercept=TRUE, cp=0.005)
treeSimilarity(tree1, tree2)
Returns number of terminal nodes in a tree.
Description
Returns number of terminal nodes in a tree.
Usage
treeSize(model)
Arguments
model |
A model returned by splineTree(). Also works on any rpart object |
Value
The number of terminal nodes in the tree
Examples
## Not run:
split_formula <- ~ HISP + WHITE + BLACK + SEX + HGC_FATHER + HGC_MOTHER + Num_sibs
tree <- splineTree(split_formula, BMI~AGE, 'ID', nlsySample, degree=1,
df=3, intercept=TRUE, cp=0.006, minNodeSize=20)
## End(Not run)
treeSize(tree)
Returns the tree frame.
Description
Provides a similar output to model$frame, but with the redundant information of yval and yval2 removed. Also omits the deviance, the complexity, and the weight. Useful for viewing node numbers and for extracting coefficients for a given node.
Usage
treeSummary(model)
Arguments
model |
A model built with splineTree() |
Value
A dataframe. The number of rows is the same as the number of nodes in the tree. The row names display the node labels of each node. The "var" attribute either displays the split variable selected at each node, or <leaf> if this node is a terminal node. The "n" attribute displays the number of individuals in the node. The "dev" attribute reports the projected sum of squares at this node; terminal nodes have the smallest values for "dev" because this is what the tree building process is supposed to minimize. The "coeffs" attribute displays the coefficients predicted for each node.
Examples
nlsySubset <- nlsySample[nlsySample$ID %in% sample(unique(nlsySample$ID), 400),]
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySubset, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
treeSummary(tree)
Random Forest Variable Importance based on spline coefficients
Description
Returns the random forest variable importance based on the permutation accuracy measure, which is calculated as the difference in mean squared error between the original data and from randomly permuting the values of a variable.
Usage
varImpCoeff(forest, removeIntercept = TRUE, method = "oob")
Arguments
forest |
a random forest, generated from splineForest() |
removeIntercept |
a boolean value, TRUE if you want to exclude the intercept in the calculations, FALSE otherwise. |
method |
the method to be used. This must be one of "oob" (out of bag), "all", "itb" (in the bag). |
Value
a matrix of variable importance metrics.
Examples
importanceMatrix <- varImpCoeff(forest, removeIntercept=TRUE)
Random Forest Variable Importance based on Y
Description
Returns the random forest variable importance based on the permutation accuracy measure, which is calculated as the difference in mean squared error between the original data and from randomly permuting the values of a variable.
Usage
varImpY(forest, method = "oob")
Arguments
forest |
a random forest, generated from splineForest() |
method |
the method to be used. This must be one of "oob" (out of bag), "all", "itb" (in the bag). |
Details
The "method" parameter deals with the way in which forest performance should be measured. Since variable importance is based on a change in performance, the "method" parameter is necessary for a variable importance measure. The choices are "oob" (out of bag), "all", or "itb" (in the bag).
Value
A matrix storing variable importance metrics. The rows correspond to split variables. The columns are different methods of measuring importance. The first column is the absolute importance (mean difference in performance between permuted and unpermuted datasets). The second column measures the mean percent difference in performance. The third column standardizes the differences by dividing them by their standard deviation.
Examples
importanceMatrix <- varImpY(forest, method="oob")
plotImp(importanceMatrix[,3])
Computes percent of variation in response explained by spline tree.
Description
Computes the percentage of variation in response explained by the spline tree. This metric is only meaningful if model$intercept==TRUE. If the tree includes an intercept, the measure will be between 0 and 1.
Usage
yR2(model)
Arguments
model |
a model created with splineTree() |
Value
An R^2 goodness measure. 1-SSE/SST where SSE is the sum of squared errors between predicted responses and true responses, and SST is sum of squared errors of true responses around population mean. Note that if the tree passed in was built without an intercept, this function will return NULL.
Examples
split_formula <- ~HISP + WHITE + BLACK + SEX + Num_sibs + HGC_FATHER + HGC_MOTHER
tree <- splineTree(split_formula, BMI~AGE, idvar = "ID",
data = nlsySample, degree = 1, df = 3,
intercept = TRUE, cp = 0.005)
yR2(tree)
Computes a level-based evaluation metric for a splineforest that was built WITH an intercept.
Description
Computes the R-squared metric for a spline forest. Goal is to see how well the predicted response values match the actual response values. Note that this function should only be used on forests where the intercept parameter is TRUE. A simple 1-SSE/SST calculation.
Usage
yR2Forest(forest, method = "oob")
Arguments
forest |
The output from a call to splineForest() |
method |
How would you like to compute this metric? The choices are "oob", "itb", or "all". "oob" means that predictions for a datapoint can only be made using trees for which that datapoint was "out of the bag" (not in the random subsample). "all" means that all trees are used in the prediction for every datapoint. "itb" means that predictions for a datapoint are made using only the trees for which this datapoint was IN the random subsample. |
Value
Returns 1-SSE/SST, where SSE is the total sum of squared errors of the true responses and predicted responses, and SST is the total sum of squared errors of the responses around their mean. If this forest was not built with an intercept, returns NULL.
Examples
yR2Forest(forest, method="all")