Type: | Package |
Title: | Linear Model with Tree-Based Lasso Regularization for Rare Features |
Version: | 0.1.2 |
Description: | Implementation of an alternating direction method of multipliers algorithm for fitting a linear model with tree-based lasso regularization, which is proposed in Algorithm 1 of Yan and Bien (2020) <doi:10.1080/01621459.2020.1796677>. The package allows efficient model fitting on the entire 2-dimensional regularization path for large datasets. The complete set of functions also makes the entire process of tuning regularization parameters and visualizing results hassle-free. |
Depends: | R (≥ 3.2.1) |
Imports: | Matrix, glmnet, Rcpp |
Suggests: | knitr, dendextend, rmarkdown |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
VignetteBuilder: | knitr |
RoxygenNote: | 6.1.0 |
LinkingTo: | Rcpp, RcppArmadillo |
URL: | https://github.com/yanxht/rare |
BugReports: | https://github.com/yanxht/rare/issues |
NeedsCompilation: | yes |
Packaged: | 2025-06-15 22:55:55 UTC; xy257 |
Author: | Xiaohan Yan [aut, cre], Jacob Bien [aut] |
Maintainer: | Xiaohan Yan <xy257@cornell.edu> |
Repository: | CRAN |
Date/Publication: | 2025-06-16 05:10:02 UTC |
Model path for tree-based lasso framework for selecting rare features
Description
The package fits the linear model with tree-based lasso regularization proposed in Yan and Bien (2018) using alternating direction method of multipliers (ADMM). The ADMM algorithm is proposed in Algorithm 1 of the same paper. The package also provides tools for tuning regularization parameters, making predictions from the fitted model and visualizing recovered groups of the covariates in a dendrogram.
Details
Its main functions are rarefit
, rarefit.cv
,
rarefit.predict
, group.recover
and
group.plot
.
Author(s)
Xiaohan Yan xy257@cornell.edu, Jacob Bien
References
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
Document-term matrix for adjectives in TripAdvisor hotel reviews
Description
A 500-by-200 document-term matrix for 200 adjectives appearing in 500 TripAdvisor reviews. The document-term matrix is in sparse format.
Usage
data.dtm
Format
An object of class dgCMatrix
with 500 rows and 200 columns.
See Also
Hierarchical clustering tree for adjectives in TripAdvisor data set
Description
An hclust
tree for the 200 adjectives appearing in the TripAdvisor reviews.
The tree was generated with 100-dimensional word embeddings pre-trained by GloVe
(Pennington et al., 2014) on Gigaword5 and Wikipedia2014 corpora for the adjectives.
Usage
data.hc
Format
An object of class hclust
of length 7.
Source
Embeddings available at http://nlp.stanford.edu/data/glove.6B.zip
References
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
TripAdvisor hotel review ratings
Description
A length-500 TripAdvisor review ratings on the scale 1 to 5.
Usage
data.rating
Format
An object of class integer
of length 500.
Source
TripAdvisor Data Set used in https://www.cs.virginia.edu/~hw5x/paper/rp166f-wang.pdf
Find all descendant leaves of a node in an hclust tree
Description
The function recursively finds all leaves that are descendants of a
node in an hclust
tree.
Usage
find.leaves(ind, merge)
Arguments
ind |
Index of the tree node. For an |
merge |
A ( |
Value
Returns a sequence of indices for descendant leaves
in the leaf set {1, ..., p}. Unlike the notation used in
ind
, we use positive integers to denote leaves here.
Examples
## Not run:
hc <- hclust(dist(USArrests), "ave")
# Descendant leaves of the 10th leaf (should be iteself)
find.leaves(-10, hc$merge)
# Descendant leaves of the 10th interior node
find.leaves(10, hc$merge)
# Descendant leaves of the root (should be all leaves)
ind_root <- nrow(hc$merge)
all.equal(find.leaves(ind_root, hc$merge), hc$order)
## End(Not run)
Visualize groups by coloring branches and leaves of an hclust tree
Description
The function plots an hclust
tree with branches and leaves colored
based on group membership. The groups span the covariate indices {1, ..., nvars
}.
Covariates from the same group share equal coefficient (beta
), and sibling
groups have different coefficients. The function determines groups based on
the sparsity in gamma
. In an hclust
tree with beta[i]
on the
i
th leaf, the branch and leaf are colored in blue, red or gray according to beta[i]
being positive, negative or zero, respectively. The larger the magnitude of beta[i]
is,
the darker the color will be. So branches and leaves from the same group will have the
same color.
Usage
group.plot(beta, gamma, A, hc, nbreaks = 20)
Arguments
beta |
Length- |
gamma |
Length- |
A |
|
hc |
An |
nbreaks |
Number of breaks in binning |
Examples
## Not run:
# See vignette for more details.
set.seed(100)
ts <- sample(1:length(data.rating), 400) # Train set indices
# Fit the model on train set
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Cross validation
ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ],
rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Visualize the groups at optimal beta and gamma
ibest.lambda <- ourfit.cv$ibest[1]
ibest.alpha <- ourfit.cv$ibest[2]
beta.opt <- ourfit$beta[[ibest.alpha]][, ibest.lambda]
gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1
# Visualize the groups at optimal beta and gamma
group.plot(beta.opt, gamma.opt, ourfit$A, data.hc)
## End(Not run)
Recover aggregated groups of leaf indices
Description
The function finds aggregated groups of leaf indices by traversing non-zero
gamma
elements and finding descendant leaves at each gamma
element. In our problem,
gamma
are latent variables corresponding to tree nodes. The order
of the traversal is post-order, i.e., a node is visited after its descendants.
Usage
group.recover(gamma, A, postorder = seq(ncol(A)))
Arguments
gamma |
Length- |
A |
|
postorder |
Length- |
Value
Returns a list of recovered groups of leaf indices.
Examples
## Not run:
# See vignette for more details.
set.seed(100)
ts <- sample(1:length(data.rating), 400) # Train set indices
# Fit the model on train set
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Cross validation
ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ],
rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Group recovered at optimal beta and gamma
ibest.lambda <- ourfit.cv$ibest[1]
ibest.alpha <- ourfit.cv$ibest[2]
gamma.opt <- ourfit$gamma[[ibest.alpha]][, ibest.lambda] # works if ibest.alpha > 1
groups.opt <- group.recover(gamma.opt, ourfit$A)
## End(Not run)
Fit the rare feature selection model
Description
Fit the rare feature selection model proposed in Yan and Bien (2018):
min_{\beta, \gamma} 0.5 * ||y - X\beta - \beta_01_n||_2^2 +
\lambda * (\alpha * ||\gamma_{-root}||_1 + (1-\alpha) * ||\beta||_1)
using an alternating direction method of multipliers (ADMM) algorithm
described in Algorithm 1 of the same paper.
The regularization path is computed over a two-dimensional grid of
regularization parameters: lambda
and alpha
. Of the two,
lambda
controls the overall amount of regularization, and alpha
controls the tradeoff between sparsity and fusion of \beta
(larger alpha
induces more fusion in \beta
).
Usage
rarefit(y, X, A = NULL, Q = NULL, hc, intercept = T, lambda = NULL,
alpha = NULL, nlam = 50, lam.min.ratio = 1e-04, nalpha = 10,
rho = 0.01, eps1 = 1e-06, eps2 = 1e-05, maxite = 1e+06)
Arguments
y |
Length- |
X |
|
A |
|
Q |
|
hc |
An |
intercept |
Whether intercept be fitted (default = TRUE) or set to zero (FALSE). |
lambda |
A user-supplied |
alpha |
A user-supplied |
nlam |
Number of |
lam.min.ratio |
Smallest value for |
nalpha |
Number of |
rho |
Penalty parameter for the quadratic penalty in the ADMM algorithm.
The default value is |
eps1 |
Convergence threshold in terms of the absolute tolerance level
for the ADMMM algorithm. The default value is |
eps2 |
Convergence threshold in terms of the relative tolerance level
for the ADMM algorithm. The default value is |
maxite |
Maximum number of passes over the data for every pair of
( |
Details
The function splits model fitting path by alpha
. At each alpha
value,
the model is fit on the entire sequence of lambda
with warm start. We recommend
including an intercept (by setting intercept=T
) unless the input data have been
centered.
Value
Returns regression coefficients for beta
and gamma
and
intercept beta0
. We use a matrix-nested-within-list structure to store the coefficients: each list
item corresponds to an alpha
value; matrix (or vector) in that list item stores
coefficients at various lambda
values by columns (or entries).
beta0 |
Length- |
beta |
Length- |
gamma |
Length- |
lambda |
Sequence of |
alpha |
Sequence of |
A |
Binary matrix encoding ancestor-descendant relationship between leaves and nodes in the tree. |
Q |
Matrix with columns forming an orthonormal basis for the null space of |
intercept |
Whether an intercept is included in model fit. |
References
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
See Also
Examples
## Not run:
# See vignette for more details.
set.seed(100)
ts <- sample(1:length(data.rating), 400) # Train set indices
# Fit the model on train set
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
## End(Not run)
Perform K-fold cross validation
Description
The function does K-fold cross validaton (CV) to choose an optimal pair of (lambda
, alpha
)
on which the model performs best according to the chosen error metric: mean squared error
or mean absolute error.
Usage
rarefit.cv(fitObj, y, X, errtype = "mean-squared-error", nfolds = 5,
...)
Arguments
fitObj |
Output of |
y |
Response variable. |
X |
|
errtype |
Type of error metric used in cross validation. Available choices are mean-squared-error (default) and mean-absolute-error. |
nfolds |
Number of folds (default is 5) |
... |
Other arguments that can be passed to |
Value
folds |
A length- |
errs |
A |
m |
A |
se |
A |
ibest |
Indices of pair of ( |
lambda.best |
Value of |
alpha.best |
Value of |
See Also
Examples
## Not run:
# See vignette for more details.
set.seed(100)
ts <- sample(1:length(data.rating), 400) # Train set indices
# Fit the model on train set
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Cross validation
ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ],
rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
## End(Not run)
Make predictions from a rarefit object and a rarefit.cv object
Description
The function makes predictions using a rarefit
object at optimal
(lambda
, alpha
) chosen by rarefit.cv
.
Usage
rarefit.predict(fitObj, cvObj, newx)
Arguments
fitObj |
Output of |
cvObj |
Output of |
newx |
Matrix of new values for x at which predictions are made. |
Value
Returns a sequence of predictions.
See Also
Examples
## Not run:
# See vignette for more details.
set.seed(100)
ts <- sample(1:length(data.rating), 400) # Train set indices
# Fit the model on train set
ourfit <- rarefit(y = data.rating[ts], X = data.dtm[ts, ], hc = data.hc, lam.min.ratio = 1e-6,
nlam = 20, nalpha = 10, rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Cross validation
ourfit.cv <- rarefit.cv(ourfit, y = data.rating[ts], X = data.dtm[ts, ],
rho = 0.01, eps1 = 1e-5, eps2 = 1e-5, maxite = 1e4)
# Prediction on test set
pred <- rarefit.predict(ourfit, ourfit.cv, data.dtm[-ts, ])
pred.error <- mean((pred - data.rating[-ts])^2)
## End(Not run)
Generate matrix A encoding ancestor-descendant relationships in an hclust tree
Description
The function generates the binary matrix A
defined in Yan
and Bien (2018). The matrix encodes ancestor-descendant relationships between leaves
and tree nodes in an hclust
tree.
Usage
tree.matrix(hc)
Arguments
hc |
An |
Value
Returns a nvars
-by-nnodes
binary matrix A
where nvars
is the number of leaves (we associate covariate with leaf),
and nnodes
is the number of tree nodes (including both leaves and interior nodes).
For an hclust
tree, nnodes
= 2*nvars-1
. A[i,j]
is 1 if the i
th leaf
is a descendant of the j
th node in the tree, and 0 otherwise. By default, we
let the first nvars
columns correspond to leaves and the remaining
nvars-1
columns correspond to interior nodes.
A
is in sparse matrix format (inherit from class
sparseMatrix
as in package Matrix
).
References
Yan, X. and Bien, J. (2018) Rare Feature Selection in High Dimensions, https://arxiv.org/abs/1803.06675.
See Also
find.leaves
for finding descendant leaves of a node.
Examples
## Not run:
# For a perfect binary tree of depth 2 below
#
# 3
# /\
# 1 2
# /\ /\
# -1 -2 -3 -4
#
# A can expressed as the following:
A_true <- cbind(diag(4),
as.matrix(c(1, 1, 0, 0)),
as.matrix(c(0, 0, 1, 1)),
as.matrix(c(1, 1, 1, 1)))
# Now use tree.matrix to generate A
tree0 <- list()
tree0$merge <- matrix(c(-1, -2, -3, -4, 1, 2),
ncol = 2, byrow = TRUE)
tree0$labels <- c("leaf1", "leaf2", "leaf3", "leaf4")
A <- tree.matrix(tree0)
all(A_true == as.matrix(A))
# Another example
hc <- hclust(dist(USArrests), "ave")
A <- tree.matrix(hc)
## End(Not run)