Type: | Package |
Title: | ROSE Random Forests for Robust Semiparametric Efficient Estimation |
Version: | 0.1.0 |
Maintainer: | Elliot H. Young <ey244@cam.ac.uk> |
Description: | ROSE (RObust Semiparametric Efficient) random forests for robust semiparametric efficient estimation in partially parametric models (containing generalised partially linear models). Details can be found in the paper by Young and Shah (2024) <doi:10.48550/arXiv.2410.03471>. |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Imports: | caret (≥ 6.0.93), glmnet (≥ 4.1.6), keras, mgcv, mlr (≥ 2.19.1), ParamHelpers, ranger (≥ 0.14.1), grf, rpart, stats, tuneRanger (≥ 0.5), xgboost |
Depends: | R (≥ 4.2.0) |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
NeedsCompilation: | no |
Packaged: | 2024-10-07 23:51:35 UTC; elliotyoung |
Author: | Elliot H. Young [aut, cre], Rajen D. Shah [aut] |
Repository: | CRAN |
Date/Publication: | 2024-10-11 08:00:02 UTC |
Print for a rose random forest fitted object
Description
This is a method that prints a useful summary of aspects of a roseRF
object fitted by the functions roseRF_...
in roseRF
.
Usage
## S3 method for class 'roseforest'
print(x, ...)
Arguments
x |
a fitted |
... |
additional arguments |
Value
Prints output for roseRF
object
ROSE random forest estimator for the generalised partially linear model
Description
Estimates the parameter of interest \theta_0
in the generalised partially linear model
g(\mathbb{E}[Y|X,Z]) = X\theta_0 + f_0(Z),
for some (strictly increasing, differentiable) link function g
, which can be reposed in terms of
the ‘nuisance functions’ (\mathbb{E}[X|Z], \mathbb{E}[g(\mathbb{E}[Y|X,Z])|Z])
as
g\big(\mathbb{E}[Y|X,Z])-\mathbb{E}[g(\mathbb{E}[Y|X,Z])|Z]\big) = (X-\mathbb{E}[X|Z])\theta_0.
Usage
roseRF_gplm(
y_on_xz_formula,
y_on_xz_learner,
y_on_xz_pars = list(),
Gy_on_z_formula,
Gy_on_z_learner,
Gy_on_z_pars = list(),
x_formula,
x_learner,
x_pars = list(),
M1_formula = x_formula,
M1_learner = x_learner,
M1_pars = x_pars,
M2_formula = NA,
M2_learner = NA,
M2_pars = list(),
M3_formula = NA,
M3_learner = NA,
M3_pars = list(),
M4_formula = NA,
M4_learner = NA,
M4_pars = list(),
M5_formula = NA,
M5_learner = NA,
M5_pars = list(),
link = "identity",
data,
K = 5,
S = 1,
max.depth = 10,
num.trees = 500,
min.node.size = max(10, ceiling(0.01 * (K - 1)/K * nrow(data))),
replace = TRUE,
sample.fraction = 0.8
)
Arguments
y_on_xz_formula |
a two-sided formula object describing the model for |
y_on_xz_learner |
a string specifying the regression method to fit the regression as given by |
y_on_xz_pars |
a list containing hyperparameters for the |
Gy_on_z_formula |
a two-sided formula object describing the model for |
Gy_on_z_learner |
a string specifying the regression method to fit the regression as given by |
Gy_on_z_pars |
a list containing hyperparameters for the |
x_formula |
a two-sided formula object describing the model for |
x_learner |
a string specifying the regression method to fit the regression of |
x_pars |
a list containing hyperparameters for the |
M1_formula |
a two-sided formula object for the model |
M1_learner |
a string specifying the regression method for |
M1_pars |
a list containing hyperparameters for the |
M2_formula |
a two-sided formula object for the model |
M2_learner |
a string specifying the regression method for |
M2_pars |
a list containing hyperparameters for the |
M3_formula |
a two-sided formula object for the model |
M3_learner |
a string specifying the regression method for |
M3_pars |
a list containing hyperparameters for the |
M4_formula |
a two-sided formula object for the model |
M4_learner |
a string specifying the regression method for |
M4_pars |
a list containing hyperparameters for the |
M5_formula |
a two-sided formula object for the model |
M5_learner |
a string specifying the regression method for |
M5_pars |
a list containing hyperparameters for the |
link |
link function ( |
data |
a data frame containing the variables for the partially linear model. |
K |
the number of folds used for |
S |
the number of repeats to mitigate the randomness in the estimator on the sample splits used for |
max.depth |
Maximum depth parameter used for ROSE random forests. Default is 5. |
num.trees |
Number of trees used for a single ROSE random forest. Default is 50. |
min.node.size |
Minimum node size of a leaf in each tree. Default is |
replace |
Whether sampling for a single random tree are performed with (bootstrap) or without replacement. Default is |
sample.fraction |
Proportion of data used for each random tree. Default is 0.8. |
Details
The estimator of interest \theta_0
solves the estimating equation
\sum_{i}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}(Z),\hat{w}(Z)) = 0,
\psi(Y,X,Z;\theta,\eta_0,w) := \sum_{j=1}^J w_j(Z) \big( M_j(X) - \mathbb{E}[M_j(X)|Z] \big) g'(\mu(X,Z;\theta,\eta_0)) \big(Y-\mu(X,Z;\theta,\eta_0)\big) ,
\mu(X,Z;\theta,\eta_0) := g^{-1}\big(\mathbb{E}[g(\mathbb{E}[Y|X,Z])|Z] + (X-\mathbb{E}[X|Z])\theta\big),
\eta_0 := \big(\mathbb{E}[Y|Z=\cdot], \mathbb{E}[X|Z=\cdot]\big),
where M_1(X),\ldots,M_J(X)
denotes user-chosen functions of (X)
and w(Z)=\big(w_1(Z),\ldots,w_J(Z)\big)
denotes weights estimated via ROSE random forests.
The default takes J=1
and M_1(X)=X
; if taking J\geq 2
we recommend care
in checking the applicability and appropriateness of any additional user-chosen
regression tasks.
The parameter of interest \theta_0
is estimated using a DML2 / K
-fold cross-fitting
framework, to allow for arbitrary (faster than n^{1/4}
-consistent) learners for \hat{\eta}
i.e. solving
the estimating equation
\sum_{k \in [K]}\sum_{i \in I_k}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}^{(k)}(Z),\hat{w}^{(k)}(Z)) = 0,
where I_1,\ldots,I_K
denotes a partition of the index set for the datapoints (Y_i,X_i,Z_i)
,
\hat{\eta}^{(k)}
denotes an estimator for \eta_0
trained on the data indexed by
I_k^c
, and \hat{w}^{(k)}
denotes a ROSE random forest (again trained on the data
indexed by I_k^c
).
Value
A list containing:
theta
The estimator of
\theta_0
.stderror
Huber robust estimate of the standard error of the
\theta_0
-estimator.coefficients
Table of
\theta_0
coefficient estimator, standard error, z-value and p-value.
ROSE random forest estimator for the partially linear instrumental variable model
Description
ROSE random forest estimator for the partially linear instrumental variable model
Usage
roseRF_pliv(
y_formula,
y_learner,
y_pars = list(),
x_formula,
x_learner,
x_pars = list(),
IV1_formula = NA,
IV1_learner = NA,
IV1_pars = list(),
IV2_formula = NA,
IV2_learner = NA,
IV2_pars = list(),
IV3_formula = NA,
IV3_learner = NA,
IV3_pars = list(),
IV4_formula = NA,
IV4_learner = NA,
IV4_pars = list(),
IV5_formula = NA,
IV5_learner = NA,
IV5_pars = list(),
data,
K = 5,
S = 1,
max.depth = 10,
num.trees = 500,
min.node.size = max(10, ceiling(0.01 * (K - 1)/K * nrow(data))),
replace = TRUE,
sample.fraction = 0.8
)
Arguments
y_formula |
a two-sided formula object describing the regression model for |
y_learner |
a string specifying the regression method to fit the regression of |
y_pars |
a list containing hyperparameters for the |
x_formula |
a two-sided formula object describing the regression model for |
x_learner |
a string specifying the regression method to fit the regression of |
x_pars |
a list containing hyperparameters for the |
IV1_formula |
a two-sided formula object for the model |
IV1_learner |
a string specifying the regression method for |
IV1_pars |
a list containing hyperparameters for the |
IV2_formula |
a two-sided formula object for the model |
IV2_learner |
a string specifying the regression method for |
IV2_pars |
a list containing hyperparameters for the |
IV3_formula |
a two-sided formula object for the model |
IV3_learner |
a string specifying the regression method for |
IV3_pars |
a list containing hyperparameters for the |
IV4_formula |
a two-sided formula object for the model |
IV4_learner |
a string specifying the regression method for |
IV4_pars |
a list containing hyperparameters for the |
IV5_formula |
a two-sided formula object for the model |
IV5_learner |
a string specifying the regression method for |
IV5_pars |
a list containing hyperparameters for the |
data |
a data frame containing the variables for the partially linear model. |
K |
the number of folds used for |
S |
the number of repeats to mitigate the randomness in the estimator on the sample splits used for |
max.depth |
Maximum depth parameter used for ROSE random forests. Default is 5. |
num.trees |
Number of trees used for a single ROSE random forest. Default is 50. |
min.node.size |
Minimum node size of a leaf in each tree. Default is |
replace |
Whether sampling for a single random tree are performed with (bootstrap) or without replacement. Default is |
sample.fraction |
Proportion of data used for each random tree. Default is 0.8. |
Value
A list containing:
theta
The estimator of
\theta_0
.stderror
Huber robust estimate of the standard error of the
\theta_0
-estimator.coefficients
Table of
\theta_0
coefficient estimator, standard error, z-value and p-value.
ROSE random forest estimator for the partially linear model
Description
Estimates the parameter of interest \theta_0
in the partially linear model
\mathbb{E}[Y|X,Z] = X\theta_0 + f_0(Z),
which can be reposed in terms of
the ‘nuisance functions’ (\mathbb{E}[Y|X], \mathbb{E}[X|Z])
as
\mathbb{E}[Y|X,Z]-\mathbb{E}[Y|Z] = (X-\mathbb{E}[X|Z])\theta_0.
Usage
roseRF_plm(
y_formula,
y_learner,
y_pars = list(),
x_formula,
x_learner,
x_pars = list(),
M1_formula = x_formula,
M1_learner = x_learner,
M1_pars = x_pars,
M2_formula = NA,
M2_learner = NA,
M2_pars = list(),
M3_formula = NA,
M3_learner = NA,
M3_pars = list(),
M4_formula = NA,
M4_learner = NA,
M4_pars = list(),
M5_formula = NA,
M5_learner = NA,
M5_pars = list(),
data,
K = 5,
S = 1,
max.depth = 10,
num.trees = 500,
min.node.size = max(10, ceiling(0.01 * (K - 1)/K * nrow(data))),
replace = TRUE,
sample.fraction = 0.8
)
Arguments
y_formula |
a two-sided formula object describing the model for |
y_learner |
a string specifying the regression method to fit the regression of |
y_pars |
a list containing hyperparameters for the |
x_formula |
a two-sided formula object describing the model for |
x_learner |
a string specifying the regression method to fit the regression of |
x_pars |
a list containing hyperparameters for the |
M1_formula |
a two-sided formula object for the model |
M1_learner |
a string specifying the regression method for |
M1_pars |
a list containing hyperparameters for the |
M2_formula |
a two-sided formula object for the model |
M2_learner |
a string specifying the regression method for |
M2_pars |
a list containing hyperparameters for the |
M3_formula |
a two-sided formula object for the model |
M3_learner |
a string specifying the regression method for |
M3_pars |
a list containing hyperparameters for the |
M4_formula |
a two-sided formula object for the model |
M4_learner |
a string specifying the regression method for |
M4_pars |
a list containing hyperparameters for the |
M5_formula |
a two-sided formula object for the model |
M5_learner |
a string specifying the regression method for |
M5_pars |
a list containing hyperparameters for the |
data |
a data frame containing the variables for the partially linear model. |
K |
the number of folds used for |
S |
the number of repeats to mitigate the randomness in the estimator on the sample splits used for |
max.depth |
Maximum depth parameter used for ROSE random forests. Default is 5. |
num.trees |
Number of trees used for a single ROSE random forest. Default is 50. |
min.node.size |
Minimum node size of a leaf in each tree. Default is |
replace |
Whether sampling for a single random tree are performed with (bootstrap) or without replacement. Default is |
sample.fraction |
Proportion of data used for each random tree. Default is 0.8. |
Details
The estimator of interest \theta_0
solves the estimating equation
\sum_{i}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}(Z),\hat{w}(Z)) = 0,
\psi(Y,X,Z;\theta,\eta_0,w) := \sum_{j=1}^J w_j(Z) \big( M_j(X) - \mathbb{E}[M_j(X)|Z]\big) \Big( \big(Y-\mathbb{E}[Y|Z]\big)-\big(X-\mathbb{E}[X|Z]\big)\theta \Big),
\eta_0 := \big(\mathbb{E}[Y|Z=\cdot], \mathbb{E}[X|Z=\cdot]\big),
where M_1(X),\ldots,M_J(X)
denotes user-chosen functions of (X)
and w(Z)=\big(w_1(Z),\ldots,w_J(Z)\big)
denotes weights estimated via ROSE random forests.
The default takes J=1
and M_1(X)=X
; if taking J\geq 2
we recommend care
in checking the applicability and appropriateness of any additional user-chosen
regression tasks.
The parameter of interest \theta_0
is estimated using a DML2 / K
-fold cross-fitting
framework, to allow for arbitrary (faster than n^{1/4}
-consistent) learners for \hat{\eta}
i.e. solving
the estimating equation
\sum_{k \in [K]}\sum_{i \in I_k}\psi(Y_i,X_i,Z_i; \theta,\hat{\eta}^{(k)}(Z),\hat{w}^{(k)}(Z)) = 0,
where I_1,\ldots,I_K
denotes a partition of the index set for the datapoints (Y_i,X_i,Z_i)
,
\hat{\eta}^{(k)}
denotes an estimator for \eta_0
trained on the data indexed by
I_k^c
, and \hat{w}^{(k)}
denotes a ROSE random forest (again trained on the data
indexed by I_k^c
).
Value
A list containing:
theta
The estimator of
\theta_0
.stderror
Huber robust estimate of the standard error of the
\theta_0
-estimator.coefficients
Table of
\theta_0
coefficient estimator, standard error, z-value and p-value.
Summary for a rose random forest fitted object
Description
Prints a roseRF
object fitted by the functions roseRF_...
in roseRF
.
Usage
## S3 method for class 'roseforest'
summary(object, ...)
Arguments
object |
a fitted |
... |
additional arguments |
Value
Prints summary output for roseRF
object
Unweighted (baseline) estimator for the generalised partially linear model
Description
Estimates the parameter of interest \theta_0
in the generalised partially linear regression model
g(\mathbb{E}[Y|X,Z]) = X\theta_0 + f_0(Z),
as in roseRF_gplm
but without
any weights i.e. J=1
, M_1(X)=X
and w_1\equiv 1
.
Usage
unweighted_gplm(
y_on_xz_formula,
y_on_xz_learner,
y_on_xz_pars = list(),
Gy_on_z_formula,
Gy_on_z_learner,
Gy_on_z_pars = list(),
x_formula,
x_learner,
x_pars = list(),
link = "identity",
data,
K = 5,
S = 1
)
Arguments
y_on_xz_formula |
a two-sided formula object describing the regression model for |
y_on_xz_learner |
a string specifying the regression method to fit the regression as given by |
y_on_xz_pars |
a list containing hyperparameters for the |
Gy_on_z_formula |
a two-sided formula object describing the regression model for |
Gy_on_z_learner |
a string specifying the regression method to fit the regression as given by |
Gy_on_z_pars |
a list containing hyperparameters for the |
x_formula |
a two-sided formula object describing the regression model for |
x_learner |
a string specifying the regression method to fit the regression of |
x_pars |
a list containing hyperparameters for the |
link |
link function ( |
data |
a data frame containing the variables for the partially linear model. |
K |
the number of folds used for |
S |
the number of repeats to mitigate the randomness in the estimator on the sample splits used for |
Value
A list containing:
theta
The estimator of
\theta_0
.stderror
Huber robust estimate of the standard error of the
\theta_0
-estimator.coefficients
Table of
\theta_0
coefficient estimator, standard error, z-value and p-value.
Unweighted (baseline) estimator for the partially linear model
Description
Estimates the parameter of interest \theta_0
in the partially linear regression model
\mathbb{E}[Y|X,Z] = X\theta_0 + f_0(Z),
as in roseRF_plm
but without
any weights i.e. J=1
, M_1(X)=X
and w_1\equiv 1
.
Usage
unweighted_plm(
y_formula,
y_learner,
y_pars = list(),
x_formula,
x_learner,
x_pars = list(),
data,
K = 5,
S = 1
)
Arguments
y_formula |
a two-sided formula object describing the regression model for |
y_learner |
a string specifying the regression method to fit the regression of |
y_pars |
a list containing hyperparameters for the |
x_formula |
a two-sided formula object describing the regression model for |
x_learner |
a string specifying the regression method to fit the regression of |
x_pars |
a list containing hyperparameters for the |
data |
a data frame containing the variables for the partially linear model. |
K |
the number of folds used for |
S |
the number of repeats to mitigate the randomness in the estimator on the sample splits used for |
Value
A list containing:
theta
The estimator of
\theta_0
.stderror
Huber robust estimate of the standard error of the
\theta_0
-estimator.coefficients
Table of
\theta_0
coefficient estimator, standard error, z-value and p-value.