Title: | Penalized Linear Mixed Models for Correlated Data |
Version: | 4.2.1 |
Date: | 2025-03-03 |
Description: | Fits penalized linear mixed models that correct for unobserved confounding factors. 'plmmr' infers and corrects for the presence of unobserved confounding effects such as population stratification and environmental heterogeneity. It then fits a linear model via penalized maximum likelihood. Originally designed for the multivariate analysis of single nucleotide polymorphisms (SNPs) measured in a genome-wide association study (GWAS), 'plmmr' eliminates the need for subpopulation-specific analyses and post-analysis p-value adjustments. Functions for the appropriate processing of 'PLINK' files are also supplied. For examples, see the package homepage. https://pbreheny.github.io/plmmr/. |
License: | GPL-3 |
URL: | https://pbreheny.github.io/plmmr/, https://github.com/pbreheny/plmmr/ |
Depends: | bigalgebra, bigmemory, R (≥ 4.4.0) |
Imports: | biglasso (≥ 1.6.0), data.table, glmnet, Matrix, ncvreg, parallel, utils |
Suggests: | bigsnpr, bigstatsr, graphics, grDevices, knitr, MASS, rmarkdown, tinytest |
LinkingTo: | BH, bigmemory, Rcpp, RcppArmadillo (≥ 0.8.600) |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | yes |
Packaged: | 2025-03-03 17:46:44 UTC; pbreheny |
Author: | Tabitha K. Peter |
Maintainer: | Patrick J. Breheny <patrick-breheny@uiowa.edu> |
Repository: | CRAN |
Date/Publication: | 2025-03-03 18:10:10 UTC |
plmmr: Penalized Linear Mixed Models for Correlated Data
Description
Fits penalized linear mixed models that correct for unobserved confounding factors. 'plmmr' infers and corrects for the presence of unobserved confounding effects such as population stratification and environmental heterogeneity. It then fits a linear model via penalized maximum likelihood. Originally designed for the multivariate analysis of single nucleotide polymorphisms (SNPs) measured in a genome-wide association study (GWAS), 'plmmr' eliminates the need for subpopulation-specific analyses and post-analysis p-value adjustments. Functions for the appropriate processing of 'PLINK' files are also supplied. For examples, see the package homepage. https://pbreheny.github.io/plmmr/.
Author(s)
Maintainer: Patrick J. Breheny patrick-breheny@uiowa.edu (ORCID)
Authors:
Tabitha K. Peter tabitha-peter@uiowa.edu (ORCID)
Anna C. Reisetter anna-reisetter@uiowa.edu (ORCID)
Yujing Lu
See Also
Useful links:
helper function to implement MCP penalty The helper functions to implement each penalty.
Description
helper function to implement MCP penalty The helper functions to implement each penalty.
Usage
MCP(z, l1, l2, gamma, v)
Arguments
z |
a vector representing the solution over active set at each feature |
l1 |
upper bound (on beta) |
l2 |
lower bound (on beta) |
gamma |
The tuning parameter of the MCP penalty |
v |
the 'xtx' term |
Value
numeric vector of the MCP-penalized coefficient estimates within the given bounds
helper function to implement SCAD penalty
Description
helper function to implement SCAD penalty
Usage
SCAD(z, l1, l2, gamma, v)
Arguments
z |
solution over active set at each feature |
l1 |
upper bound |
l2 |
lower bound |
gamma |
The tuning parameter of the SCAD penalty |
v |
the 'xtx' term |
Value
numeric vector of the SCAD-penalized coefficient estimates within the given bounds
A helper function to add predictors to a filebacked matrix of data
Description
A helper function to add predictors to a filebacked matrix of data
Usage
add_predictors(obj, add_predictor, id_var, rds_dir, quiet)
Arguments
obj |
A |
add_predictor |
Optional: add additional covariates/predictors/features from an external file (i.e., not a PLINK file). |
id_var |
String specifying which column of the PLINK |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
quiet |
Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...) |
Value
A list of 2 components:
'obj' - a
bigSNP
object with an added element representing the matrix that includes the additional predictors as the first few columns'non_gen' - an integer vector that ranges from 1 to the number of added predictors. Example: if 2 predictors are added, unpen= 1:2
Admix: Semi-simulated SNP data
Description
A dataset containing the 100 SNPs, a demographic variable representing race, and a simulated outcome
Usage
admix
Format
A list with 3 components
- X
SNP matrix (197 observations of 100 SNPs)
- y
vector of simulated (continuous) outcomes
- race
vector with racial group categorization: # 0 = African, 1 = African American, 2 = European, 3 = Japanese
Source
https://hastie.su.domains/CASI/
A helper function to support process_plink()
Description
A helper function to support process_plink()
Usage
align_ids(id_var, quiet, add_predictor, og_ids)
Arguments
id_var |
String specifying the variable name of the ID column |
quiet |
Logical: should a message be printed? |
add_predictor |
External data to include in design matrix. This is the add_predictors... arg in |
og_ids |
Character vector with the PLINK ids (FID or IID) from the original data (i.e., the data before any subsetting from handling missing phenotypes) |
Value
A matrix with the same dimensions as add_predictor
a version of cbind() for file-backed matrices
Description
a version of cbind() for file-backed matrices
Usage
big_cbind(A, B, C, quiet)
Arguments
A |
in-memory data |
B |
file-backed data |
C |
file-backed placeholder for combined data |
quiet |
Logical |
Value
C, filled in with all column values of A and B combined
check_for_file_extension: a function to make our package 'smart' enough to handle .rds file extensions
Description
check_for_file_extension: a function to make our package 'smart' enough to handle .rds file extensions
Usage
check_for_file_extension(path)
Arguments
path |
A string specifying a file path that ends in a file name, e.g. "~/dir/my_file.rds" |
Value
a string with a filepath without an extension, e.g. "~/dir/my_file"
Coef method for "cv_plmm" class
Description
Coef method for "cv_plmm" class
Usage
## S3 method for class 'cv_plmm'
coef(object, lambda, which = object$min, ...)
Arguments
object |
An object of class "cv_plmm." |
lambda |
A numeric vector of lambda values. |
which |
Vector of lambda indices for which coefficients to return. Defaults to lambda index with minimum CVE. |
... |
Additional arguments (not used). |
Value
Returns a named numeric vector. Values are the coefficients of the
model at the specified value of either lambda
or which
. Names are the
values of lambda
.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design, return_fit = TRUE)
head(coef(cv_fit))
Coef method for "plmm" class
Description
Coef method for "plmm" class
Usage
## S3 method for class 'plmm'
coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)
Arguments
object |
An object of class "plmm." |
lambda |
A numeric vector of lambda values. |
which |
Vector of lambda indices for which coefficients to return. |
drop |
Logical. |
... |
Additional arguments. |
Value
Either a numeric matrix (if model was fit on data stored in memory)
or a sparse matrix (if model was fit on data stored filebacked). Rownames are
feature names, columns are values of lambda
.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
coef(fit)[1:10, 41:45]
a function to create the estimated variance matrix from a PLMM fit
Description
a function to create the estimated variance matrix from a PLMM fit
Usage
construct_variance(fit, K = NULL, eta = NULL)
Arguments
fit |
An object returned by |
K |
An optional matrix |
eta |
An optional numeric value between 0 and 1; if |
Value
Sigma_hat, a matrix representing the estimated variance
A helper function to count constant features
Description
A helper function to count constant features
Usage
count_constant_features(fbm, outfile, quiet)
Arguments
fbm |
A filebacked |
outfile |
String specifying name of log file |
quiet |
Logical: should a message be printed to the console |
Value
ns A numeric vector with the indices of the non-singular columns of the matrix associated with counts
A helper function to count the number of cores available on the current machine
Description
A helper function to count the number of cores available on the current machine
Usage
count_cores()
Value
A number of cores to use; if parallel
is installed, this will be parallel::detectCores()
. Otherwise, this returns a 1.
a function to create a design for PLMM modeling
Description
a function to create a design for PLMM modeling
Usage
create_design(data_file = NULL, rds_dir = NULL, X = NULL, y = NULL, ...)
Arguments
data_file |
For filebacked data (data from |
rds_dir |
For filebacked data, this is the filepath to the directory/folder where you want the design to be saved.
Note: do not include/append the name you want for the to-be-created file – the name is the argument |
X |
For in-memory data (data in a matrix or data frame), this is the design matrix. Defaults to NULL (this argument does not apply for filebacked data). |
y |
For in-memory data, this is the numeric vector representing the outcome. Defaults to NULL (this argument does not apply for filebacked data). Note: it is the responsibility of the user to ensure that the rows in X and the corresponding elements of y have the same row order, i.e., observations must be in the same order in both the design matrix and in the outcome vector. |
... |
Additional arguments to pass to |
Details
This function is a wrapper for the other create_design...()
inner functions; all arguments
included here are passed along to the create_design...()
inner function that
matches the type of the data being supplied. Note which arguments are optional
and which ones are not.
Additional arguments for all filebacked data:
-
new_file User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder.
-
feature_id Optional: A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. - for PLINK data: a string specifying an ID column of the PLINK
.fam
file. Options are "IID" (default) and "FID" - for all other filebacked data: a character vector of unique identifiers (IDs) for each row of the feature data (i.e., the data processed withprocess_delim()
) - if left NULL (default), X is assumed to have the same row-order as add_outcome. Note: if this assumption is made in error, calculations downstream will be incorrect. Pay close attention here. -
add_outcome A data frame or matrix with two columns: and ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric.
-
outcome_id A string specifying the name of the ID column in 'add_outcome'
-
outcome_col A string specifying the name of the phenotype column in 'add_outcome'
-
na_outcome_vals Optional: a vector of numeric values used to code NA values in the outcome. Defaults to
c(-9, NA_integer)
(the -9 matches PLINK conventions). -
overwrite Optional: logical - should existing .rds files be overwritten? Defaults to FALSE.
-
logfile Optional: name of the '.log' file to be written – Note: do not append a
.log
to the filename; this is done automatically. -
quiet Optional: logical - should messages to be printed to the console be silenced? Defaults to FALSE
Additional arguments specific to PLINK data:
-
add_predictor Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named.
-
predictor_id Optional (for PLINK data only): A string specifying the name of the column in 'add_predictor' with sample IDs. Required if 'add_predictor' is supplied. The names will be used to subset and align this external covariate with the supplied PLINK data.
Additional arguments specific to delimited file data:
-
unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, your delimited file must have column names.
Additional arguments for in-memory data:
-
unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.
Value
A filepath to an object of class plmm_design
, which is a named list with the design matrix,
outcome, penalty factor vector, and other details needed for fitting a model. This list is stored as an .rds
file for filebacked data, so in the filebacked case a string with the path to that file is returned. For in-memory data,
the list itself is returned.
Examples
## Example 1: matrix data in-memory ##
admix_design <- create_design(X = admix$X, y = admix$y, unpen = "Snp1")
## Example 2: delimited data ##
# process delimited data
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)
# prepare outcome data
colon_outcome <- read.delim(find_example_data(path = "colon2_outcome.txt"))
# create a design
colon_design <- create_design(data_file = colon_dat, rds_dir = temp_dir, new_file = "std_colon2",
add_outcome = colon_outcome, outcome_id = "ID", outcome_col = "y", unpen = "sex",
overwrite = TRUE, logfile = "test.log")
# look at the results
colon_rds <- readRDS(colon_design)
str(colon_rds)
## Example 3: PLINK data ##
# process PLINK data
temp_dir <- tempdir()
unzip_example_data(outdir = temp_dir)
plink_data <- process_plink(data_dir = temp_dir,
data_prefix = "penncath_lite",
rds_dir = temp_dir,
rds_prefix = "imputed_penncath_lite",
# imputing the mode to address missing values
impute_method = "mode",
# overwrite existing files in temp_dir
# (you can turn this feature off if you need to)
overwrite = TRUE,
# turning off parallelization - leaving this on causes problems knitting this vignette
parallel = FALSE)
# get outcome data
penncath_pheno <- read.csv(find_example_data(path = 'penncath_clinical.csv'))
outcome <- data.frame(FamID = as.character(penncath_pheno$FamID),
CAD = penncath_pheno$CAD)
unpen_predictors <- data.frame(FamID = as.character(penncath_pheno$FamID),
sex = penncath_pheno$sex,
age = penncath_pheno$age)
# create design where sex and age are always included in the model
pen_design <- create_design(data_file = plink_data,
feature_id = "FID",
rds_dir = temp_dir,
new_file = "std_penncath_lite",
add_outcome = outcome,
outcome_id = "FamID",
outcome_col = "CAD",
add_predictor = unpen_predictors,
predictor_id = "FamID",
logfile = "design",
# again, overwrite if needed; use with caution
overwrite = TRUE)
# examine the design - notice the components of this object
pen_design_rds <- readRDS(pen_design)
A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function
Description
A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function
Usage
create_design_filebacked(
data_file,
rds_dir,
obj,
new_file,
feature_id = NULL,
add_outcome,
outcome_id,
outcome_col,
na_outcome_vals = c(-9, NA_integer_),
add_predictor = NULL,
predictor_id = NULL,
unpen = NULL,
logfile = NULL,
overwrite = FALSE,
quiet = FALSE
)
Arguments
data_file |
A filepath to rds file of processed data (data from |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. |
obj |
The RDS object read in by |
new_file |
User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder. |
feature_id |
A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed.
- for PLINK data: a string specifying an ID column of the PLINK |
add_outcome |
A data frame or matrix with two columns: and ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric. |
outcome_id |
A string specifying the name of the ID column in 'add_outcome' |
outcome_col |
A string specifying the name of the phenotype column in 'add_outcome' |
na_outcome_vals |
A vector of numeric values used to code NA values in the outcome. Defaults to |
add_predictor |
Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named. |
predictor_id |
Optional (for PLINK data only): A string specifying the name of the column in 'add_predictor' with sample IDs. Required if 'add_predictor' is supplied. The names will be used to subset and align this external covariate with the supplied PLINK data. |
unpen |
Optional (for delimited file data only): an optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names. |
logfile |
Optional: name of the '.log' file to be written – Note: do not append a |
overwrite |
Logical: should existing .rds files be overwritten? Defaults to FALSE. |
quiet |
Logical: should messages to be printed to the console be silenced? Defaults to FALSE |
Value
A filepath to the created .rds file containing all the information for model fitting, including a standardized X and model design information
A function to create a design with an in-memory X matrix
Description
A function to create a design with an in-memory X matrix
Usage
create_design_in_memory(X, y, unpen = NULL)
Arguments
X |
A numeric matrix in which rows correspond to observations (e.g., samples) and columns correspond to features. |
y |
A numeric vector representing the outcome for the model. Note: it is the responsibility of the user to ensure that the outcome_col and X have the same row order! |
unpen |
An optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names. |
Value
A list with elements including a standardized X and model design information
create_log_file
Description
create_log_file
Usage
create_log(outfile, ...)
Arguments
outfile |
String specifying the name of the to-be-created file, without extension |
... |
Not used |
Value
Nothing is returned, intead a text file with the suffix .log is created.
Cross-validation for plmm
Description
Performs k-fold cross validation for lasso-, MCP-, or SCAD-penalized
linear mixed models over a grid of values for the regularization parameter lambda
.
Usage
cv_plmm(
design,
y = NULL,
K = NULL,
diag_K = NULL,
eta_star = NULL,
penalty = "lasso",
type = "blup",
gamma,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
warn = TRUE,
init = NULL,
cluster,
nfolds = 5,
seed,
fold = NULL,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The first argument must be one of three things:
(1) |
y |
Optional: In the case where |
K |
Similarity matrix used to rotate the data. This should either be (1) a known matrix that reflects the covariance of y, (2) an estimate (Default is |
diag_K |
Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Defaults to FALSE. Note: plmm() does not check to see if a matrix is diagonal. If you want to use a diagonal K matrix, you must set diag_K = TRUE. |
eta_star |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP". |
type |
A character argument indicating what should be returned from predict.plmm(). If type == 'lp', predictions are based on the linear predictor, X beta. If type == 'blup', predictions are based on the sum of the linear predictor and the estimated random effect (BLUP). Defaults to 'blup', as this has shown to be a superior prediction method in many applications. |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale. |
eps |
Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
cluster |
Option for in-memory data only: cv_plmm() can be run in parallel across a cluster using the parallel package. The cluster must be set up in advance using parallel::makeCluster(). The cluster must then be passed to cv_plmm(). Note: this option is not yet implemented for filebacked data. |
nfolds |
The number of cross-validation folds. Default is 5. |
seed |
You may set the seed of the random number generator in order to obtain reproducible results. |
fold |
Which fold each observation belongs to. By default, the observations are randomly assigned. |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each CV fold. Default is FALSE. |
save_rds |
Optional: if a filepath and name without the '.rds' suffix is specified (e.g., |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional arguments to |
Value
a list with 12 items:
type: the type of prediction used ('lp' or 'blup')
cve: numeric vector with the cross validation error (CVE) at each value of
lambda
cvse: numeric vector with the estimated standard error associated with each value of for
cve
fold: numeric
n
length vector of integers indicating the fold to which each observation was assignedlambda: numeric vector of
lambda
valuesfit: the overall fit of the object, including all predictors; this is a list as returned by
plmm()
min: The index corresponding to the value of
lambda
that minimizescve
lambda_min: The
lambda
value at whichcve
is minmizedmin1se: The index corresponding to the value of
lambda
within standard error of that which minimizescve
lambda1se: largest value of lambda such that error is within 1 standard error of the minimum.
null.dev: numeric value representing the deviance for the intercept-only model. If you have supplied your own
lambda
sequence, this quantity may not be meaningful.estimated_Sigma: an n x n matrix representing the estimated covariance matrix.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))
plot(cv_fit)
# Note: for examples with filebacked data, see the filebacking vignette
# https://pbreheny.github.io/plmmr/articles/filebacking.html
Cross-validation internal function for cv_plmm
Description
Internal function for cv_plmm which calls plmm on a fold subset of the original data.
Usage
cvf(i, fold, type, cv_args, ...)
Arguments
i |
Fold number to be excluded from fit. |
fold |
n-length vector of fold-assignments. |
type |
A character argument indicating what should be returned from predict.plmm. If |
cv_args |
List of additional arguments to be passed to plmm. |
... |
Optional arguments to |
Value
a list with three elements:
a numeric vector with the loss at each value of lambda
a numeric value indicating the number of lambda values used
a numeric value with the predicted outcome (y hat) values at each lambda
A function to take the eigendecomposition of K Note: This is faster than taking SVD of X when p >> n
Description
A function to take the eigendecomposition of K Note: This is faster than taking SVD of X when p >> n
Usage
eigen_K(std_X, fbm_flag)
Arguments
std_X |
The standardized design matrix, stored as big.matrix object. |
fbm_flag |
Logical: is std_X an FBM obejct? Passed from |
Value
A list with the eigenvectors and eigenvalues of K
Estimate eta (to be used in rotating the data)
This function is called internally by plmm()
Description
Estimate eta (to be used in rotating the data)
This function is called internally by plmm()
Usage
estimate_eta(n, s, U, y, eta_star)
Arguments
n |
The number of observations |
s |
The singular values of K, the realized relationship matrix |
U |
The left-singular vectors of the standardized design matrix |
y |
Continuous outcome vector. |
Value
a numeric value with the estimated value of eta, the variance parameter
Functions to convert between FBM and big.matrix type objects
Description
Functions to convert between FBM and big.matrix type objects
Usage
fbm2bm(fbm, desc = FALSE)
Arguments
fbm |
An FBM object; see |
desc |
Logical: is the descriptor file desired (as opposed to the filebacked big matrix)? Defaults to FALSE. |
Value
a big.matrix
- see bigmemory::filebacked.big.matrix()
for details
A function to help with accessing example PLINK files
Description
A function to help with accessing example PLINK files
Usage
find_example_data(path, parent = FALSE)
Arguments
path |
Argument (string) specifying a path (filename) for an external data file in |
parent |
If |
Value
If path=NULL
, a character vector of file names is returned. If path is given, then a character string
with the full file path
Examples
find_example_data(parent = TRUE)
Read in processed data
This function is intended to be called after either process_plink()
or process_delim()
has been called once.
Description
Read in processed data
This function is intended to be called after either process_plink()
or process_delim()
has been called once.
Usage
get_data(path, returnX = FALSE, trace = TRUE)
Arguments
path |
The file path to the RDS object containing the processed data. Do not add the '.rds' extension to the path. |
returnX |
Logical: Should the design matrix be returned as a numeric matrix that will be stored in memory. By default, this will be FALSE. |
trace |
Logical: Should trace messages be shown? Default is TRUE. |
Value
A list with these components:
std_X, the column-standardized design matrix as either (1) a numeric matrix or (2) a filebacked matrix (FBM). See
bigstatsr::FBM()
andbigsnpr::bigSnp-class
documentation for details.(if PLINK data) fam, a data frame containing the pedigree information (like a .fam file in PLINK)
(if PLINK data) map, a data frame containing the feature information (like a .bim file in PLINK)
ns: A vector indicating the which columns of X contain nonsingular features (i.e., features with variance != 0.
center: A vector of values for centering each column in X
scale: A vector of values for scaling each column in X
a function to return the computer's host name
Description
a function to return the computer's host name
Usage
get_hostname()
Value
String with hostname of current machine
A function to impute SNP data
Description
A function to impute SNP data
Usage
impute_snp_data(
obj,
X,
impute,
impute_method,
parallel,
outfile,
quiet,
seed = as.numeric(Sys.Date()),
...
)
Arguments
obj |
a |
X |
A matrix of genotype data as returned by |
impute |
Logical: should data be imputed? Default to TRUE. |
impute_method |
If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are:
|
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. |
quiet |
Logical: should messages be printed to the console? Defaults to TRUE |
seed |
Numeric value to be passed as the seed for |
... |
Optional: additional arguments to |
Value
Nothing is returned, but the obj$genotypes
is overwritten with the imputed version of the data
A function to align genotype and phenotype data
Description
A function to align genotype and phenotype data
Usage
index_samples(
obj,
rds_dir,
indiv_id,
add_outcome,
outcome_id,
outcome_col,
na_outcome_vals,
outfile,
quiet
)
Arguments
obj |
An object created by |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. |
indiv_id |
A character string indicating the ID column name in the 'fam' element of the genotype data list. Defaults to 'sample.ID', equivalent to 'IID' in PLINK. The other option is 'family.ID', equivalent to 'FID' in PLINK. |
add_outcome |
A data frame with at least two columns: and ID column and a phenotype column |
outcome_id |
A string specifying the name of the ID column in |
outcome_col |
A string specifying the name of the phenotype column in |
na_outcome_vals |
A vector of numeric values used to code NA values in the outcome. Defaults to |
outfile |
A string with the name of the filepath for the log file |
quiet |
Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on... |
Value
a list with two items:
a data.table with rows corresponding to the samples for which both genotype and phenotype are available.
a numeric vector with indices indicating which samples were 'complete' (i.e., which samples from add_outcome had corresponding data in the PLINK files)
Generate nicely formatted lambda vec
Description
Generate nicely formatted lambda vec
Usage
lam_names(l)
Arguments
l |
Vector of lambda values. |
Value
A character vector of formatted lambda value names
helper function to implement lasso penalty
Description
helper function to implement lasso penalty
Usage
lasso(z, l1, l2, v)
Arguments
z |
solution over active set at each feature |
l1 |
upper bound |
l2 |
lower bound |
v |
the 'xtx' term |
Value
numeric vector of the lasso-penalized coefficient estimates within the given bounds
Evaluate the negative log-likelihood of an intercept-only Gaussian plmm model
Description
This function allows you to evaluate the negative log-likelihood of a linear mixed model under the assumption of a null model in order to estimate the variance parameter, eta.
Usage
log_lik(eta, n, s, U, y, rot_y = NULL)
Arguments
eta |
The proportion of variance in the outcome that is attributable to causal SNP effects. In other words, signal-to-noise ratio. |
n |
The number of observations |
s |
The singular values of K, the realized relationship matrix |
U |
The left-singular vectors of the standardized design matrix |
y |
Continuous outcome vector. |
rot_y |
Optional: if y has already been rotated, then this can be supplied. |
Value
the value of the log-likelihood of the PLMM, evaluated with the supplied parameters
A helper function to label and summarize the contents of a bigSNP
Description
A helper function to label and summarize the contents of a bigSNP
Usage
name_and_count_bigsnp(obj, id_var, quiet, outfile)
Arguments
obj |
a |
id_var |
String specifying which column of the PLINK |
quiet |
Logical: should messages be printed to the console? Defaults to TRUE |
outfile |
The string with the name of the .log file |
Value
a list with components:
counts: column-wise summary of the minor allele counts in 'genotypes'
obj: a modified
bigSNP
list with additional componentsX: the
obj$genotypes
as its own FBMpos: the
obj$map$physical.pos
vector
Fit a linear mixed model via penalized maximum likelihood.
Description
Fit a linear mixed model via penalized maximum likelihood.
Usage
plmm(
design,
y = NULL,
K = NULL,
diag_K = NULL,
eta_star = NULL,
penalty = "lasso",
init = NULL,
gamma,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
warn = TRUE,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The first argument must be one of three things:
(1) |
y |
Optional: In the case where |
K |
Similarity matrix used to rotate the data. This should either be:
(1) a known matrix that reflects the covariance of y,
(2) an estimate (Default is |
diag_K |
Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Defaults to FALSE. Note: plmm() does not check to see if a matrix is diagonal. If you want to use a diagonal K matrix, you must set diag_K = TRUE. |
eta_star |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP". |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale. |
eps |
Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
save_rds |
Optional: if a filepath and name without the '.rds' suffix is specified (e.g., |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional optional arguments to |
Value
A list which includes 19 items:
beta_vals: The matrix of estimated coefficients. Rows are predictors (with the first row being the intercept), and columns are values of
lambda
.std_Xbeta: A matrix of the linear predictors on the scale of the standardized design matrix. Rows are predictors, columns are values of
lambda
. Note: std_Xbeta will not include rows for the intercept or for constant features.std_X_details: A list with 9 items: - center: The center values used to center the columns of the design matrix - scale: The scaling values used to scale the columns of the design matrix - ns: An integer vector of the nonsingular columns of the original data - unpen: An integer vector of indices of the unpenalized features, if any were specified in the design - unpen_colnames: A charater vector of the column names of ay unpenalized features. - X_colnames: A character vector with the column names of all features in the original design matrix - X_rownames: A character vector with the row names of all features in the original design matrix; if none were provided, these are named 'row1', 'row2', etc. - std_X_colnames: A subset of X_colnames representing only nonsignular columns (i.e., the columns indexed by 'ns') - std_X_rownames: A subset of X_rownames representing rows that passed QC filtering & and are represented in both the genotype and phenotype data sets (this only applies to PLINK data)
std_X: If design matrix is filebacked, the descriptor for the filebacked data is returned using
bigmemory::describe()
. If the the data were stored in-memory, nothing is returned (std_X is NULL).y: The outcome vector used in model fitting.
p: The total number of columns in the design matrix (including any singular columns, excluding the intercept).
plink_flag: Logical - did the data come from PLINK files?
lambda: A numeric vector of the tuning parameter values used in model fitting.
eta: A double between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure
penalty: A character string indicating the penalty with which the model was fit (e.g., 'MCP')
gamma: A numeric value indicating the tuning parameter used for the SCAD or lasso penalties was used. Not relevant for lasso models.
alpha: A numeric value indicating the elastic net tuning parameter.
loss: A vector with the numeric values of the loss at each value of
lambda
(calculated on the ~rotated~ scale)penalty_factor: A vector of indicators corresponding to each predictor, where 1 = predictor was penalized.
ns_idx: An integer vector with the indices of predictors which were non-singular features (i.e., features which had variation), where feature 1 is the intercept.
iter: An integer vector with the number of iterations needed in model fitting for each value of
lambda
converged: A vector of logical values indicating whether the model fitting converged at each value of
lambda
K: a list with 2 elements,
s
andU
—s: a vector of the eigenvalues of the relatedness matrix K (note: K is the kinship matrix for genetic/genomic data; see the article on notation for details)
U: a matrix of the eigenvectors of the relatedness matrix
Examples
# using admix data
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
s <- summary(fit, idx = 50)
print(s)
plot(fit)
# Note: for examples with large data that are too big to fit in memory,
# see the article "PLINK files/file-backed matrices" on our website
# https://pbreheny.github.io/plmmr/articles/filebacking.html
plmm_checks
Description
plmm_checks
Usage
plmm_checks(
design,
K = NULL,
diag_K = NULL,
eta_star = NULL,
penalty = "lasso",
init = NULL,
gamma,
alpha = 1,
trace = FALSE,
save_rds = NULL,
return_fit = TRUE,
...
)
Arguments
design |
The design object, as created by |
K |
Similarity matrix used to rotate the data. This should either be (1) a known matrix that reflects the covariance of y, (2) an estimate (Default is |
diag_K |
Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Defaults to FALSE. Note: plmm() does not check to see if a matrix is diagonal. If you want to use a diagonal K matrix, you must set diag_K = TRUE. |
eta_star |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0. |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
save_rds |
Optional: if a filepath and name is specified (e.g., |
return_fit |
Optional: a logical value indicating whether the fitted model should be returned as a |
... |
Additional arguments to |
Value
A list of parameters to pass on to model fitting. The list includes the standardized design matrix, the outcome, and meta-data
PLMM fit: a function that fits a PLMM using the values returned by plmm_prep()
Description
PLMM fit: a function that fits a PLMM using the values returned by plmm_prep()
Usage
plmm_fit(
prep,
y,
std_X_details,
eta_star,
penalty_factor,
fbm_flag,
penalty,
gamma = 3,
alpha = 1,
lambda_min,
nlambda = 100,
lambda,
eps = 1e-04,
max_iter = 10000,
init = NULL,
warn = TRUE,
...
)
Arguments
prep |
A list as returned from |
y |
The original (not centered) outcome vector. Need this for intercept estimate |
std_X_details |
A list with components 'center' (values used to center X), 'scale' (values used to scale X), and 'ns' (indices for nonsignular columns of X) |
eta_star |
The ratio of variances (passed from plmm()) |
penalty_factor |
A multiplicative factor for the penalty applied to each coefficient. If supplied, penalty_factor must be a numeric vector of length equal to the number of columns of X. The purpose of penalty_factor is to apply differential penalization if some coefficients are thought to be more likely than others to be in the model. In particular, penalty_factor can be 0, in which case the coefficient is always in the model without shrinkage. |
fbm_flag |
Logical: is std_X an FBM object? Passed from |
penalty |
The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso". |
gamma |
The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. |
nlambda |
Length of the sequence of lambda. Default is 100. |
lambda |
A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale. |
eps |
Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is |
max_iter |
Maximum number of iterations (total across entire path). Default is 10000. |
init |
Initial values for coefficients. Default is 0 for all columns of X. |
warn |
Return warning messages for failures to converge and model saturation? Default is TRUE. |
... |
Additional arguments that can be passed to |
PLMM format: a function to format the output of a model constructed with plmm_fit
Description
PLMM format: a function to format the output of a model constructed with plmm_fit
Usage
plmm_format(fit, p, std_X_details, fbm_flag, plink_flag)
Arguments
fit |
A list of parameters describing the output of a model constructed with |
p |
The number of features in the original data (including constant features) |
std_X_details |
A list with 3 items:
* 'center': the centering values for the columns of |
fbm_flag |
Logical: is the corresponding design matrix filebacked? Passed from |
plink_flag |
Logical: did these data come from PLINK files?
Note: This flag matters because of how non-genomic features
are handled for PLINK files – in data from PLINK files,
unpenalized columns are not counted in the |
Value
A list with the components:
-
beta_vals
: the matrix of estimated coefficients on the original scale. Rows are predictors, columns are values oflambda
-
lambda
: a numeric vector of the lasso tuning parameter values used in model fitting. -
eta
: a number (double) between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure. -
s
: a vectof of the eigenvalues of relatedness matrixK
; seerelatedness_mat()
for details. -
U
: a matrix of the eigenvalues of relatedness matrixK
-
rot_y
: the vector of outcome values on the rotated scale. This is the scale on which the model was fit. -
linear_predictors
: the matrix resulting from the product ofstdrot_X
and the estimated coefficients on the ~rotated~ scale. -
penalty
: character string indicating the penalty with which the model was fit (e.g., 'MCP') -
gamma
: numeric value indicating the tuning parameter used for the SCAD or lasso penalties was used. Not relevant for lasso models. -
alpha
: numeric value indicating the elastic net tuning parameter. -
loss
: vector with the numeric values of the loss at each value oflambda
(calculated on the ~rotated~ scale) -
penalty_factor
: vector of indicators corresponding to each predictor, where 1 = predictor was penalized. -
ns_idx
: vector with the indices of predictors which were nonsingular features (i.e., had variation). -
iter
: numeric vector with the number of iterations needed in model fitting for each value oflambda
-
converged
: vector of logical values indicating whether the model fitting converged at each value oflambda
Loss method for "plmm" class
Description
Loss method for "plmm" class
Usage
plmm_loss(y, yhat)
Arguments
y |
Observed outcomes (response) vector |
yhat |
Predicted outcomes (response) vector |
Value
A numeric vector of the squared-error loss values for the given observed and predicted outcomes
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design, K = relatedness_mat(admix$X))
yhat <- predict(object = fit, newX = admix$X, type = 'lp', lambda = 0.05)
head(plmm_loss(yhat = yhat, y = admix$y))
PLMM prep: a function to run checks, SVD, and rotation prior to fitting a PLMM model
This is an internal function for cv_plmm
Description
PLMM prep: a function to run checks, SVD, and rotation prior to fitting a PLMM model
This is an internal function for cv_plmm
Usage
plmm_prep(
std_X,
std_X_n,
std_X_p,
n,
p,
centered_y,
K = NULL,
diag_K = NULL,
eta_star = NULL,
fbm_flag,
trace = NULL,
...
)
Arguments
std_X |
Column standardized design matrix. May include clinical covariates and other non-SNP data. |
std_X_n |
The number of observations in std_X (integer) |
std_X_p |
The number of features in std_X (integer) |
n |
The number of instances in the original design matrix X. This should not be altered by standardization. |
p |
The number of features in the original design matrix X, including constant features |
centered_y |
Continuous outcome vector, centered. |
K |
Similarity matrix used to rotate the data. This should either be a known matrix that reflects the covariance of y, or an estimate (Default is |
diag_K |
Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Passed from |
eta_star |
Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1. |
fbm_flag |
Logical: is std_X an FBM type object? This is set internally by |
trace |
If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE. |
... |
Not used yet |
Value
List with these components:
centered_y: The vector of centered outcomes
std_X: standardized design matrix
K: a list with 2 elements. (1) s: vector with the eigenvalues of K, and (2) U: the eigenvectors of K (same as left singular values of X).
eta: the numeric value of the estimated eta parameter
trace: logical.
Plot method for cv_plmm class
Description
Plot method for cv_plmm class
Usage
## S3 method for class 'cv_plmm'
plot(
x,
log.l = TRUE,
type = c("cve", "rsq", "scale", "snr", "all"),
selected = TRUE,
vertical.line = TRUE,
col = "red",
...
)
Arguments
x |
An object of class cv_plmm |
log.l |
Logical to indicate the plot should be returned on the natural log scale. Defaults to |
type |
Type of plot to return. Defaults to "cve." |
selected |
Logical to indicate which variables should be plotted. Defaults to TRUE. |
vertical.line |
Logical to indicate whether vertical line should be plotted at the minimum/maximum value. Defaults to TRUE. |
col |
Color for vertical line, if plotted. Defaults to "red." |
... |
Additional arguments. |
Value
Nothing is returned; instead, a plot is drawn representing the relationship between the tuning parameter 'lambda' value (x-axis) and the cross validation error (y-axis).
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cvfit <- cv_plmm(design = admix_design)
plot(cvfit)
Plot method for plmm class
Description
Plot method for plmm class
Usage
## S3 method for class 'plmm'
plot(x, alpha = 1, log.l = FALSE, shade = TRUE, col, ...)
Arguments
x |
An object of class |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. |
log.l |
Logical to indicate the plot should be returned on the natural log scale. Defaults to |
shade |
Logical to indicate whether a local nonconvex region should be shaded. Defaults to TRUE. |
col |
Vector of colors for coefficient lines. |
... |
Additional arguments. |
Value
Nothing is returned; instead, a plot of the coefficient paths is drawn at each value of lambda (one 'path' for each coefficient).
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
plot(fit)
plot(fit, log.l = TRUE)
Predict method for plmm class
Description
Predict method for plmm class
Usage
## S3 method for class 'plmm'
predict(
object,
newX,
type = c("blup", "coefficients", "vars", "nvars", "lp"),
X = NULL,
lambda,
idx = 1:length(object$lambda),
...
)
Arguments
object |
An object of class |
newX |
Matrix of values at which predictions are to be made (not used for
|
type |
A character argument indicating what type of prediction should be returned. Options are "lp," "coefficients," "vars," "nvars," and "blup." See details. |
X |
Optional: if |
lambda |
A numeric vector of regularization parameter |
idx |
Vector of indices of the penalty parameter |
... |
Additional optional arguments |
Details
Define beta-hat as the coefficients estimated at the value of lambda that minimizes cross-validation error (CVE). Then options for type
are as follows:
'response' (default): uses the product of newX and beta-hat to predict new values of the outcome. This does not incorporate the correlation structure of the data. For the stats folks out there, this is simply the linear predictor.
'blup' (acronym for Best Linear Unbiased Predictor): adds to the 'response' a value that represents the esetimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome.
'coefficients': returns the estimated beta-hat
'vars': returns the indices of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
'nvars': returns the number of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
Value
Depends on the type
- see Details
Examples
set.seed(123)
train_idx <- sample(1:nrow(admix$X), 100)
# Note: ^ shuffling is important here! Keeps test and train groups comparable.
train <- list(X = admix$X[train_idx,], y = admix$y[train_idx])
train_design <- create_design(X = train$X, y = train$y)
test <- list(X = admix$X[-train_idx,], y = admix$y[-train_idx])
fit <- plmm(design = train_design)
# make predictions for all lambda values
pred1 <- predict(object = fit, newX = test$X, type = "lp")
pred2 <- predict(object = fit, newX = test$X, type = "blup", X = train$X)
# look at mean squared prediction error
mspe <- apply(pred1, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe)
mspe_blup <- apply(pred2, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe_blup) # BLUP is better
# compare the MSPE of our model to a null model, for reference
# null model = intercept only -> y_hat is always mean(y)
crossprod(mean(test$y) - test$y)/length(test$y)
Predict method to use in cross-validation (within cvf
)
Description
Predict method to use in cross-validation (within cvf
)
Usage
predict_within_cv(
fit,
testX,
type,
fbm = FALSE,
Sigma_11 = NULL,
Sigma_21 = NULL
)
Arguments
fit |
A list with the components returned by |
testX |
A design matrix used for computing predicted values (i.e, the test data). |
type |
A character argument indicating what type of prediction should be returned. Passed from |
fbm |
Logical: is trainX an FBM object? If so, this function expects that testX is also an FBM. The two X matrices must be stored the same way. |
Sigma_11 |
Variance-covariance matrix of the training data. Extracted from |
Sigma_21 |
Covariance matrix between the training and the testing data. Extracted from |
Details
Define beta-hat as the coefficients estimated at the value of lambda that minimizes cross-validation error (CVE). Then options for type
are as follows:
'lp' (default): uses the linear predictor (i.e., product of test data and estimated coefficients) to predict test values of the outcome. Note that this approach does not incorporate the correlation structure of the data.
'blup' (acronym for Best Linear Unbiased Predictor): adds to the 'lp' a value that represents the estimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome.
Note: the main difference between this function and the predict.plmm()
method is that
here in CV, the standardized testing data (std_test_X), Sigma_11, and Sigma_21 are calculated in cvf()
instead of the function defined here.
Value
A numeric vector of predicted values
a function to format the time
Description
a function to format the time
Usage
pretty_time()
Value
A string with the formatted current date and time
Print method for summary.cv_plmm objects
Description
Print method for summary.cv_plmm objects
Usage
## S3 method for class 'summary.cv_plmm'
print(x, digits, ...)
Arguments
x |
An object of class |
digits |
The number of digits to use in formatting output |
... |
Not used |
Value
Nothing is returned; instead, a message is printed to the console summarizing the results of the cross-validated model fit.
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))
A function to print the summary of a plmm
model
Description
A function to print the summary of a plmm
model
Usage
## S3 method for class 'summary.plmm'
print(x, ...)
Arguments
x |
A |
... |
Not used |
Value
Nothing is returned; instead, a message is printed to the console summarizing the results of the model fit.
Examples
lam <- rev(seq(0.01, 1, length.out=20)) |> round(2) # for sake of example
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design, lambda = lam)
fit2 <- plmm(design = admix_design, penalty = "SCAD", lambda = lam)
print(summary(fit, idx = 18))
print(summary(fit2, idx = 18))
A function to read in large data files as an FBM
Description
A function to read in large data files as an FBM
Usage
process_delim(
data_dir,
data_file,
feature_id,
rds_dir = data_dir,
rds_prefix,
logfile = NULL,
overwrite = FALSE,
quiet = FALSE,
...
)
Arguments
data_dir |
The directory to the file. |
data_file |
The file to be read in, without the filepath. This should be a file of numeric values.
Example: use |
feature_id |
A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. |
rds_dir |
The directory where the user wants to create the '.rds' and '.bk' files
Defaults to |
rds_prefix |
String specifying the user's preferred filename for the to-be-created .rds file (will be create insie |
logfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_delim', i.e. you will get 'process_delim.log' as the outfile. |
overwrite |
Optional: the name (character string) of the prefix of the logfile to be written.
Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.
Note: If there are multiple |
quiet |
Logical: should the messages printed to the console be silenced? Defaults to FALSE. |
... |
Optional: other arguments to be passed to |
Value
The file path to the newly created '.rds' file
Examples
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)
colon2 <- readRDS(colon_dat)
str(colon2)
Preprocess PLINK files using the bigsnpr
package
Description
Preprocess PLINK files using the bigsnpr
package
Usage
process_plink(
data_dir,
data_prefix,
rds_dir = data_dir,
rds_prefix,
logfile = NULL,
impute = TRUE,
impute_method = "mode",
id_var = "IID",
parallel = TRUE,
quiet = FALSE,
overwrite = FALSE,
...
)
Arguments
data_dir |
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use |
data_prefix |
The prefix (as a character string) of the bed/fam data files (e.g., |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
rds_prefix |
String specifying the user's preferred filename for the to-be-created .rds file (will be create insie |
logfile |
Optional: the name (character string) of the prefix of the logfile to be written in 'rds_dir'. Default to NULL (no log file written). Note: if you supply a file path in this argument, it will error out with a "file not found" error. Only supply the string; e.g., if you want my_log.log, supply 'my_log', the my_log.log file will appear in rds_dir. |
impute |
Logical: should data be imputed? Default to TRUE. |
impute_method |
If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are:
* mode (default): Imputes the most frequent call. See |
id_var |
String specifying which column of the PLINK |
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
quiet |
Logical: should messages to be printed to the console be silenced? Defaults to FALSE |
overwrite |
Logical: if existing |
... |
Optional: additional arguments to |
Details
Three files are created in the location specified by rds_dir
:
'rds_prefix.rds': This is a list with three items: (1)
X
: the filebackedbigmemory::big.matrix
object pointing to the imputed genotype data. This matrix has type 'double', which is important for downstream operations increate_design()
(2)map
: a data.frame with the PLINK 'bim' data (i.e., the variant information) (3)fam
: a data.frame with the PLINK 'fam' data (i.e., the pedigree information)'prefix.bk': This is the backingfile that stores the numeric data of the genotype matrix
'rds_prefix.desc'" This is the description file, as needed by the
Note that process_plink()
need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, get_data()
will access the '.rds' file.
For an example, see vignette on processing PLINK files
Value
The filepath to the '.rds' object created; see details for explanation.
A function to read in a large file as a numeric file-backed matrix (FBM
)
Note: this function is a wrapper for bigstatsr::big_read()
Description
A function to read in a large file as a numeric file-backed matrix (FBM
)
Note: this function is a wrapper for bigstatsr::big_read()
Usage
read_data_files(
data_file,
data_dir,
rds_dir,
rds_prefix,
outfile,
overwrite,
quiet,
...
)
Arguments
data_file |
The name of the file to read, not including its directory. Directory should be specified in |
data_dir |
The path to the directory where 'file' is |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
rds_prefix |
String specifying the user's preferred filename for the to-be-created .rds/.bk files (will be create insie |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. |
overwrite |
Logical: if existing |
quiet |
Logical: should messages be printed to the console? Defaults to TRUE |
... |
Optional: other arguments to be passed to |
Value
'.rds', '.bk', and '.desc' files are created in data_dir
, and obj
(a filebacked bigmemory big.matrix
object) is returned. See bigmemory
documentation for more info on the big.matrix
class.
A function to read in PLINK files using bigsnpr
methods
Description
A function to read in PLINK files using bigsnpr
methods
Usage
read_plink_files(
data_dir,
data_prefix,
rds_dir,
outfile,
parallel,
overwrite,
quiet
)
Arguments
data_dir |
The path to the bed/bim/fam data files, without a trailing "/" (e.g., use |
data_prefix |
The prefix (as a character string) of the bed/fam data files (e.g., |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
outfile |
Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. |
parallel |
Logical: should the computations within this function be run in parallel? Defaults to TRUE. See |
overwrite |
Logical: if existing |
quiet |
Logical: should messages be printed to the console? Defaults to TRUE |
Value
'.rds' and '.bk' files are created in data_dir
, and obj
(a bigSNP
object) is returned. See bigsnpr
documentation for more info on the bigSNP
class.
Calculate a relatedness matrix
Description
Given a matrix of genotypes, this function estimates the genetic relatedness matrix (GRM, also known as the RRM, see Hayes et al. 2009, doi:10.1017/S0016672308009981) among the subjects: XX'/p, where X is standardized.
Usage
relatedness_mat(X, std = TRUE, fbm = FALSE, ns = NULL, ...)
Arguments
X |
An n x p numeric matrix of genotypes (from fully-imputed data). Note: This matrix should not include non-genetic features. |
std |
Logical: should X be standardized? If you set this to FALSE (which can only be done if data are stored in memory), you should have a good reason for doing so, as standardization is a best practice. |
fbm |
Logical: is X stored as an FBM? Defaults to FALSE |
ns |
Optional vector of values indicating the indices of nonsingular features |
... |
Other optional arguments to |
Value
An n x n numeric matrix capturing the genomic relatedness of the
samples represented in X
. In our notation, we call this matrix K for 'kinship';
this is also known as the GRM or RRM.
Examples
RRM <- relatedness_mat(X = admix$X)
RRM[1:5, 1:5]
A function to rotate filebacked data
Description
A function to rotate filebacked data
Usage
rotate_filebacked(prep, ...)
Value
a list with 4 items:
stdrot_X:
X
on the rotated and re-standardized scalerot_y:
y
on the rotated scale (a numeric vector)stdrot_X_center: numeric vector of values used to center
rot_X
stdrot_X_scale: numeric vector of values used to scale
rot_X
Compute sequence of lambda values
Description
This function allows you compute a sequence of lambda values for plmm models.
Usage
setup_lambda(
X,
y,
alpha,
lambda_min,
nlambda,
penalty_factor,
intercept = TRUE
)
Arguments
X |
Rotated and standardized design matrix which includes the intercept column if present. May include clinical covariates and other non-SNP data. This can be either a 'matrix' or 'FBM' object. |
y |
Continuous outcome vector. |
alpha |
Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. A value of lambda_min = 0 is not supported. |
nlambda |
The desired number of lambda values in the sequence to be generated. |
penalty_factor |
A multiplicative factor for the penalty applied to each coefficient. If supplied, penalty_factor must be a numeric vector of length equal to the number of columns of X. The purpose of penalty_factor is to apply differential penalization if some coefficients are thought to be more likely than others to be in the model. In particular, penalty_factor can be 0, in which case the coefficient is always in the model without shrinkage. |
intercept |
Logical: does X contain an intercept column? Defaults to TRUE. |
Value
a numeric vector of lambda values, equally spaced on the log scale
A helper function to standardize a filebacked matrix
Description
A helper function to standardize a filebacked matrix
Usage
standardize_filebacked(
X,
new_file,
rds_dir,
non_gen,
complete_outcome,
id_var,
outfile,
quiet,
overwrite
)
Arguments
X |
A list that includes:
(1) subset_X: a |
new_file |
The new_file (as a character string) of the bed/fam data files (e.g., |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
outfile |
Optional: the name (character string) of the new_file of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. |
quiet |
Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...) |
overwrite |
Logical: if existing |
Value
A list with a new component of obj
called 'std_X' - this is an FBM with column-standardized data.
List also includes several other indices/meta-data on the standardized matrix
A helper function to standardize matrices
Description
A helper function to standardize matrices
Usage
standardize_in_memory(X)
Arguments
X |
a matrix |
Details
This function is adapted from https://github.com/pbreheny/ncvreg/blob/master/R/std.R
NOTE: this function returns a matrix in memory. For standardizing filebacked
data, use big_std()
– see src/big_standardize.cpp
Value
a list with the standardized matrix, vectors with the centering/scaling values, and a vector with the indices of nonsingular columns
A helper function to subset big.matrix
objects
Description
A helper function to subset big.matrix
objects
Usage
subset_filebacked(X, new_file, complete_samples, ns, rds_dir, outfile, quiet)
Arguments
X |
A filebacked |
new_file |
Optional user-specified new_file for the to-be-created .rds/.bk files. |
complete_samples |
Numeric vector with indicesmarking the rows of the original data which have a non-missing entry in the 6th column of the |
ns |
Numeric vector with the indices of the non-singular columns
This vector is created in |
rds_dir |
The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to |
outfile |
Optional: the name (character string) of the new_file of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. |
quiet |
Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...) |
Value
A list with two components. First, a big.matrix
object, 'subset_X', representing a design matrix wherein:
rows are subset according to user's specification in
handle_missing_phen
columns are subset so that no constant features remain – this is important for standardization downstream The list also includes the integer vector 'ns' which marks which columns of the original matrix were 'non-singular' (i.e. not constant features). The 'ns' index plays an important role in
plmm_format()
anduntransform()
(both helper functions in model fitting)
A summary function for cv_plmm objects
Description
A summary function for cv_plmm objects
Usage
## S3 method for class 'cv_plmm'
summary(object, lambda = "min", ...)
Arguments
object |
A |
lambda |
The regularization parameter value at which inference should be reported. Can choose a numeric value, 'min', or '1se'. Defaults to 'min.' |
... |
Not used |
Value
The return value is an object with S3 class summary.cv_plmm
. The class has its own print method and contains the following list elements:
-
lambda_min
: The lambda value at the minimum cross validation error -
lambda.1se
: The maximum lambda value within 1 standard error of the minimum cross validation error -
penalty
: The penalty applied to the fitted model -
nvars
: The number of non-zero coefficients at the selected lambda value -
cve
: The cross validation error at all folds -
min
: The minimum cross validation error -
fit
: Theplmm
fit used in the cross validation
if return_bias_details = TRUE
, two more items are returned:
-
bias
: The mean bias of the cross validation -
loss
: The loss at each value oflambda
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
summary(cv_fit)
A summary method for the plmm objects
Description
A summary method for the plmm objects
Usage
## S3 method for class 'plmm'
summary(object, lambda, idx, eps = 1e-05, ...)
Arguments
object |
An object of class |
lambda |
The regularization parameter value at which inference should be reported. |
idx |
Alternatively, |
eps |
If lambda is given, eps is the tolerance for difference between the given lambda value and a lambda value from the object. Defaults to 0.0001 (1e-5) |
... |
Not used |
Value
The return value is an object with S3 class summary.plmm
. The class has its own print method and contains the following list elements:
-
penalty
: The penalty used byplmm
(e.g. SCAD, MCP, lasso) -
n
: Number of instances/observations -
std_X_n
: the number of observations in the standardized data; the only time this would differ from 'n' is if data are from PLINK and the external data does not include all the same samples -
p
: Number of regression coefficients (not including the intercept) -
converged
: Logical indicator for whether the model converged -
lambda
: Thelambda
value at which inference is being reported -
lambda_char
: A formatted character string indicating the lambda value -
nvars
: The number of nonzero coefficients (again, not including the intercept) at that value oflambda
-
nonzero
: The column names indicating the nonzero coefficients in the model at the specified value oflambda
Examples
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
summary(fit, idx = 97)
Untransform coefficient values back to the original scale
Description
This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().
Usage
untransform(
std_scale_beta,
p,
std_X_details,
fbm_flag,
plink_flag,
use_names = TRUE
)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns' |
fbm_flag |
Logical: is the corresponding design matrix filebacked? |
plink_flag |
Logical: did these data come from PLINK files?
Note: This flag matters because of how non-genomic features
are handled for PLINK files – in data from PLINK files,
unpenalized columns are not counted in the |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.
Untransform coefficient values back to the original scale for file-backed data
Description
This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_delim(
std_scale_beta,
p,
std_X_details,
plink_flag,
use_names = TRUE
)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns' |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.
Untransform coefficient values back to the original scale In memory
Description
This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_in_memory(std_scale_beta, p, std_X_details, use_names = TRUE)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns' |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.
Untransform coefficient values back to the original scale for file-backed data
Description
This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().
Usage
untransform_plink(
std_scale_beta,
p,
std_X_details,
plink_flag,
use_names = TRUE
)
Arguments
std_scale_beta |
The estimated coefficients on the standardized scale |
p |
The number of columns in the original design matrix |
std_X_details |
A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns' |
use_names |
Logical: should names be added? Defaults to TRUE. Set to FALSE inside of |
Value
a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.
For Linux/Unix and MacOS only, here is a companion function to unzip the .gz files that ship with the plmmr
package
Description
For Linux/Unix and MacOS only, here is a companion function to unzip the .gz files that ship with the plmmr
package
Usage
unzip_example_data(outdir)
Arguments
outdir |
The file path to the directory to which the .gz files should be written |
Details
For an example of this function, look at vignette('plink_files', package = "plmmr")
.
Note again: this function will not work on Windows systems - only for Linux/Unix and MacOS.
Value
Nothing is returned; the PLINK files that ship with the plmmr
package are stored in the directory specified by 'outdir'