Help for package plmmr

Title:

Penalized Linear Mixed Models for Correlated Data

Version:

4.2.1

Date:

2025-03-03

Description:

Fits penalized linear mixed models that correct for unobserved confounding factors. 'plmmr' infers and corrects for the presence of unobserved confounding effects such as population stratification and environmental heterogeneity. It then fits a linear model via penalized maximum likelihood. Originally designed for the multivariate analysis of single nucleotide polymorphisms (SNPs) measured in a genome-wide association study (GWAS), 'plmmr' eliminates the need for subpopulation-specific analyses and post-analysis p-value adjustments. Functions for the appropriate processing of 'PLINK' files are also supplied. For examples, see the package homepage. https://pbreheny.github.io/plmmr/.

License:

GPL-3

URL:

https://pbreheny.github.io/plmmr/, https://github.com/pbreheny/plmmr/

Depends:

bigalgebra, bigmemory, R (≥ 4.4.0)

Imports:

biglasso (≥ 1.6.0), data.table, glmnet, Matrix, ncvreg, parallel, utils

Suggests:

bigsnpr, bigstatsr, graphics, grDevices, knitr, MASS, rmarkdown, tinytest

LinkingTo:

BH, bigmemory, Rcpp, RcppArmadillo (≥ 0.8.600)

VignetteBuilder:

knitr

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.2

NeedsCompilation:

yes

Packaged:

2025-03-03 17:46:44 UTC; pbreheny

Author:

Tabitha K. Peter

[aut], Anna C. Reisetter

[aut], Patrick J. Breheny

[aut, cre], Yujing Lu [aut]

Maintainer:

Patrick J. Breheny <patrick-breheny@uiowa.edu>

Repository:

CRAN

Date/Publication:

2025-03-03 18:10:10 UTC

plmmr: Penalized Linear Mixed Models for Correlated Data

Description

Author(s)

Maintainer: Patrick J. Breheny patrick-breheny@uiowa.edu (ORCID)

Authors:

Tabitha K. Peter tabitha-peter@uiowa.edu (ORCID)
Anna C. Reisetter anna-reisetter@uiowa.edu (ORCID)
Yujing Lu

helper function to implement MCP penalty The helper functions to implement each penalty.

Description

helper function to implement MCP penalty The helper functions to implement each penalty.

Usage

MCP(z, l1, l2, gamma, v)

Arguments

z

a vector representing the solution over active set at each feature

l1

upper bound (on beta)

l2

lower bound (on beta)

gamma

The tuning parameter of the MCP penalty

v

the 'xtx' term

Value

numeric vector of the MCP-penalized coefficient estimates within the given bounds

helper function to implement SCAD penalty

Description

helper function to implement SCAD penalty

Usage

SCAD(z, l1, l2, gamma, v)

Arguments

z

solution over active set at each feature

l1

upper bound

l2

lower bound

gamma

The tuning parameter of the SCAD penalty

v

the 'xtx' term

Value

numeric vector of the SCAD-penalized coefficient estimates within the given bounds

A helper function to add predictors to a filebacked matrix of data

Description

A helper function to add predictors to a filebacked matrix of data

Usage

add_predictors(obj, add_predictor, id_var, rds_dir, quiet)

Arguments

obj

A bigSNP object

add_predictor

Optional: add additional covariates/predictors/features from an external file (i.e., not a PLINK file).

id_var

String specifying which column of the PLINK .fam file has the unique sample identifiers.

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir(from process_plink() call)

quiet

Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...)

Value

A list of 2 components:

'obj' - a bigSNP object with an added element representing the matrix that includes the additional predictors as the first few columns
'non_gen' - an integer vector that ranges from 1 to the number of added predictors. Example: if 2 predictors are added, unpen= 1:2

Admix: Semi-simulated SNP data

Description

A dataset containing the 100 SNPs, a demographic variable representing race, and a simulated outcome

Usage

admix

Format

A list with 3 components

X: SNP matrix (197 observations of 100 SNPs)
y: vector of simulated (continuous) outcomes
race: vector with racial group categorization: # 0 = African, 1 = African American, 2 = European, 3 = Japanese

Source

https://hastie.su.domains/CASI/

A helper function to support `process_plink()`

Description

A helper function to support process_plink()

Usage

align_ids(id_var, quiet, add_predictor, og_ids)

Arguments

id_var

String specifying the variable name of the ID column

quiet

Logical: should a message be printed?

add_predictor

External data to include in design matrix. This is the add_predictors... arg in process_plink()

og_ids

Character vector with the PLINK ids (FID or IID) from the original data (i.e., the data before any subsetting from handling missing phenotypes)

Value

A matrix with the same dimensions as add_predictor

a version of cbind() for file-backed matrices

Description

a version of cbind() for file-backed matrices

Usage

big_cbind(A, B, C, quiet)

Arguments

A

in-memory data

B

file-backed data

C

file-backed placeholder for combined data

quiet

Logical

Value

C, filled in with all column values of A and B combined

check_for_file_extension: a function to make our package 'smart' enough to handle .rds file extensions

Description

check_for_file_extension: a function to make our package 'smart' enough to handle .rds file extensions

Usage

check_for_file_extension(path)

Arguments

path

A string specifying a file path that ends in a file name, e.g. "~/dir/my_file.rds"

Value

a string with a filepath without an extension, e.g. "~/dir/my_file"

Coef method for "cv_plmm" class

Description

Coef method for "cv_plmm" class

Usage

## S3 method for class 'cv_plmm'
coef(object, lambda, which = object$min, ...)

Arguments

object

An object of class "cv_plmm."

lambda

A numeric vector of lambda values.

which

Vector of lambda indices for which coefficients to return. Defaults to lambda index with minimum CVE.

...

Additional arguments (not used).

Value

Returns a named numeric vector. Values are the coefficients of the model at the specified value of either lambda or which. Names are the values of lambda.

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design, return_fit = TRUE)
head(coef(cv_fit))

Coef method for "plmm" class

Description

Coef method for "plmm" class

Usage

## S3 method for class 'plmm'
coef(object, lambda, which = 1:length(object$lambda), drop = TRUE, ...)

Arguments

object

An object of class "plmm."

lambda

A numeric vector of lambda values.

which

Vector of lambda indices for which coefficients to return.

drop

Logical.

...

Additional arguments.

Value

Either a numeric matrix (if model was fit on data stored in memory) or a sparse matrix (if model was fit on data stored filebacked). Rownames are feature names, columns are values of lambda.

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
coef(fit)[1:10, 41:45]

a function to create the estimated variance matrix from a PLMM fit

Description

a function to create the estimated variance matrix from a PLMM fit

Usage

construct_variance(fit, K = NULL, eta = NULL)

Arguments

fit

An object returned by plmm()

K

An optional matrix

eta

An optional numeric value between 0 and 1; if fit is not supplied, then this option must be specified.

Value

Sigma_hat, a matrix representing the estimated variance

A helper function to count constant features

Description

A helper function to count constant features

Usage

count_constant_features(fbm, outfile, quiet)

Arguments

fbm

A filebacked big.matrix

outfile

String specifying name of log file

quiet

Logical: should a message be printed to the console

Value

ns A numeric vector with the indices of the non-singular columns of the matrix associated with counts

A helper function to count the number of cores available on the current machine

Description

A helper function to count the number of cores available on the current machine

Usage

count_cores()

Value

A number of cores to use; if parallel is installed, this will be parallel::detectCores(). Otherwise, this returns a 1.

a function to create a design for PLMM modeling

Description

a function to create a design for PLMM modeling

Usage

create_design(data_file = NULL, rds_dir = NULL, X = NULL, y = NULL, ...)

Arguments

data_file

For filebacked data (data from process_plink() or process_delim()), this is the filepath to the processed data. Defaults to NULL (this argument does not apply for in-memory data).

rds_dir

For filebacked data, this is the filepath to the directory/folder where you want the design to be saved. Note: do not include/append the name you want for the to-be-created file – the name is the argument new_file, passed to create_design_filebacked(). Defaults to NULL (this argument does not apply for in-memory data).

X

For in-memory data (data in a matrix or data frame), this is the design matrix. Defaults to NULL (this argument does not apply for filebacked data).

y

For in-memory data, this is the numeric vector representing the outcome. Defaults to NULL (this argument does not apply for filebacked data). Note: it is the responsibility of the user to ensure that the rows in X and the corresponding elements of y have the same row order, i.e., observations must be in the same order in both the design matrix and in the outcome vector.

...

Additional arguments to pass to create_design_filebacked() or create_design_in_memory(). See the documentation for those helper functions for details.

Details

This function is a wrapper for the other create_design...() inner functions; all arguments included here are passed along to the create_design...() inner function that matches the type of the data being supplied. Note which arguments are optional and which ones are not.

Additional arguments for all filebacked data:

new_file User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder.
feature_id Optional: A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. - for PLINK data: a string specifying an ID column of the PLINK .fam file. Options are "IID" (default) and "FID" - for all other filebacked data: a character vector of unique identifiers (IDs) for each row of the feature data (i.e., the data processed with process_delim()) - if left NULL (default), X is assumed to have the same row-order as add_outcome. Note: if this assumption is made in error, calculations downstream will be incorrect. Pay close attention here.
add_outcome A data frame or matrix with two columns: and ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric.
outcome_id A string specifying the name of the ID column in 'add_outcome'
outcome_col A string specifying the name of the phenotype column in 'add_outcome'
na_outcome_vals Optional: a vector of numeric values used to code NA values in the outcome. Defaults to c(-9, NA_integer) (the -9 matches PLINK conventions).
overwrite Optional: logical - should existing .rds files be overwritten? Defaults to FALSE.
logfile Optional: name of the '.log' file to be written – Note: do not append a .log to the filename; this is done automatically.
quiet Optional: logical - should messages to be printed to the console be silenced? Defaults to FALSE

Additional arguments specific to PLINK data:

add_predictor Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named.
predictor_id Optional (for PLINK data only): A string specifying the name of the column in 'add_predictor' with sample IDs. Required if 'add_predictor' is supplied. The names will be used to subset and align this external covariate with the supplied PLINK data.

Additional arguments specific to delimited file data:

unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, your delimited file must have column names.

Additional arguments for in-memory data:

unpen Optional: an character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.

Value

A filepath to an object of class plmm_design, which is a named list with the design matrix, outcome, penalty factor vector, and other details needed for fitting a model. This list is stored as an .rds file for filebacked data, so in the filebacked case a string with the path to that file is returned. For in-memory data, the list itself is returned.

Examples


## Example 1: matrix data in-memory ##
admix_design <- create_design(X = admix$X, y = admix$y, unpen = "Snp1")

## Example 2: delimited data ##
# process delimited data
temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
 data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
 rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)

# prepare outcome data
colon_outcome <- read.delim(find_example_data(path = "colon2_outcome.txt"))

# create a design
colon_design <- create_design(data_file = colon_dat, rds_dir = temp_dir, new_file = "std_colon2",
add_outcome = colon_outcome, outcome_id = "ID", outcome_col = "y", unpen = "sex",
overwrite = TRUE, logfile = "test.log")

# look at the results
colon_rds <- readRDS(colon_design)
str(colon_rds)

## Example 3: PLINK data ##

# process PLINK data
temp_dir <- tempdir()
unzip_example_data(outdir = temp_dir)

plink_data <- process_plink(data_dir = temp_dir,
  data_prefix = "penncath_lite",
  rds_dir = temp_dir,
  rds_prefix = "imputed_penncath_lite",
  # imputing the mode to address missing values
  impute_method = "mode",
  # overwrite existing files in temp_dir
  # (you can turn this feature off if you need to)
  overwrite = TRUE,
  # turning off parallelization - leaving this on causes problems knitting this vignette
  parallel = FALSE)

# get outcome data
penncath_pheno <- read.csv(find_example_data(path = 'penncath_clinical.csv'))

outcome <- data.frame(FamID = as.character(penncath_pheno$FamID),
                  CAD = penncath_pheno$CAD)

unpen_predictors <- data.frame(FamID = as.character(penncath_pheno$FamID),
                               sex = penncath_pheno$sex,
                               age = penncath_pheno$age)


# create design where sex and age are always included in the model
pen_design <- create_design(data_file = plink_data,
  feature_id = "FID",
  rds_dir = temp_dir,
  new_file = "std_penncath_lite",
  add_outcome = outcome,
  outcome_id = "FamID",
  outcome_col = "CAD",
  add_predictor = unpen_predictors,
  predictor_id = "FamID",
  logfile = "design",
  # again, overwrite if needed; use with caution
  overwrite = TRUE)

# examine the design - notice the components of this object
pen_design_rds <- readRDS(pen_design)

A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function

Description

A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function

Usage

create_design_filebacked(
  data_file,
  rds_dir,
  obj,
  new_file,
  feature_id = NULL,
  add_outcome,
  outcome_id,
  outcome_col,
  na_outcome_vals = c(-9, NA_integer_),
  add_predictor = NULL,
  predictor_id = NULL,
  unpen = NULL,
  logfile = NULL,
  overwrite = FALSE,
  quiet = FALSE
)

Arguments

data_file

A filepath to rds file of processed data (data from process_plink() or process_delim())

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files.

obj

The RDS object read in by create_design()

new_file

User-specified filename (without .bk/.rds extension) for the to-be-created .rds/.bk files. Must be different from any existing .rds/.bk files in the same folder.

feature_id

A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed. - for PLINK data: a string specifying an ID column of the PLINK .fam file. Options are "IID" (default) and "FID" - for all other filebacked data: a character vector of unique identifiers (IDs) for each row of the feature data (i.e., the data processed with process_delim()) - if left NULL (default), X is assumed to have the same row-order as add_outcome. Note: if this assumption is made in error, calculations downstream will be incorrect. Pay close attention here.

add_outcome

A data frame or matrix with two columns: and ID column and a column with the outcome value (to be used as 'y' in the final design). IDs must be characters, outcome must be numeric.

outcome_id

A string specifying the name of the ID column in 'add_outcome'

outcome_col

A string specifying the name of the phenotype column in 'add_outcome'

na_outcome_vals

A vector of numeric values used to code NA values in the outcome. Defaults to c(-9, NA_integer) (the -9 matches PLINK conventions).

add_predictor

Optional (for PLINK data only): a matrix or data frame to be used for adding additional unpenalized covariates/predictors/features from an external file (i.e., not a PLINK file). This matrix must have one column that is an ID column; all other columns aside the ID will be used as covariates in the design matrix. Columns must be named.

predictor_id

Optional (for PLINK data only): A string specifying the name of the column in 'add_predictor' with sample IDs. Required if 'add_predictor' is supplied. The names will be used to subset and align this external covariate with the supplied PLINK data.

unpen

Optional (for delimited file data only): an optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.

logfile

Optional: name of the '.log' file to be written – Note: do not append a .log to the filename; this is done automatically.

overwrite

Logical: should existing .rds files be overwritten? Defaults to FALSE.

quiet

Logical: should messages to be printed to the console be silenced? Defaults to FALSE

Value

A filepath to the created .rds file containing all the information for model fitting, including a standardized X and model design information

A function to create a design with an in-memory X matrix

Description

A function to create a design with an in-memory X matrix

Usage

create_design_in_memory(X, y, unpen = NULL)

Arguments

X

A numeric matrix in which rows correspond to observations (e.g., samples) and columns correspond to features.

y

A numeric vector representing the outcome for the model. Note: it is the responsibility of the user to ensure that the outcome_col and X have the same row order!

unpen

An optional character vector with the names of columns to mark as unpenalized (i.e., these features would always be included in a model). Note: if you choose to use this option, X must have column names.

Value

A list with elements including a standardized X and model design information

create_log_file

Description

create_log_file

Usage

create_log(outfile, ...)

Arguments

outfile

String specifying the name of the to-be-created file, without extension

...

Not used

Value

Nothing is returned, intead a text file with the suffix .log is created.

Cross-validation for plmm

Description

Performs k-fold cross validation for lasso-, MCP-, or SCAD-penalized linear mixed models over a grid of values for the regularization parameter lambda.

Usage

cv_plmm(
  design,
  y = NULL,
  K = NULL,
  diag_K = NULL,
  eta_star = NULL,
  penalty = "lasso",
  type = "blup",
  gamma,
  alpha = 1,
  lambda_min,
  nlambda = 100,
  lambda,
  eps = 1e-04,
  max_iter = 10000,
  warn = TRUE,
  init = NULL,
  cluster,
  nfolds = 5,
  seed,
  fold = NULL,
  trace = FALSE,
  save_rds = NULL,
  return_fit = TRUE,
  ...
)

Arguments

design

The first argument must be one of three things: (1) plmm_design object (as created by create_design()) (2) a string with the file path to a design object (the file path must end in '.rds') (3) a matrix or data.frame object representing the design matrix of interest

y

Optional: In the case where design is a matrix or data.frame, the user must also supply a numeric outcome vector as the y argument. In this case, design and y will be passed internally to create_design(X = design, y = y).

K

Similarity matrix used to rotate the data. This should either be (1) a known matrix that reflects the covariance of y, (2) an estimate (Default is \frac{1}{p}(XX^T)), or (3) a list with components 'd' and 'u', as returned by choose_k().

diag_K

Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Defaults to FALSE. Note: plmm() does not check to see if a matrix is diagonal. If you want to use a diagonal K matrix, you must set diag_K = TRUE.

eta_star

Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1.

penalty

The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP".

type

A character argument indicating what should be returned from predict.plmm(). If type == 'lp', predictions are based on the linear predictor, X beta. If type == 'blup', predictions are based on the sum of the linear predictor and the estimated random effect (BLUP). Defaults to 'blup', as this has shown to be a superior prediction method in many applications.

gamma

The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD.

alpha

Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0.

lambda_min

The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise.

nlambda

Length of the sequence of lambda. Default is 100.

lambda

A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale.

eps

Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is 1e-4.

max_iter

Maximum number of iterations (total across entire path). Default is 10000.

warn

Return warning messages for failures to converge and model saturation? Default is TRUE.

init

Initial values for coefficients. Default is 0 for all columns of X.

cluster

Option for in-memory data only: cv_plmm() can be run in parallel across a cluster using the parallel package. The cluster must be set up in advance using parallel::makeCluster(). The cluster must then be passed to cv_plmm(). Note: this option is not yet implemented for filebacked data.

nfolds

The number of cross-validation folds. Default is 5.

seed

You may set the seed of the random number generator in order to obtain reproducible results.

fold

Which fold each observation belongs to. By default, the observations are randomly assigned.

trace

If set to TRUE, inform the user of progress by announcing the beginning of each CV fold. Default is FALSE.

save_rds

Optional: if a filepath and name without the '.rds' suffix is specified (e.g., save_rds = "~/dir/my_results"), then the model results are saved to the provided location (e.g., "~/dir/my_results.rds"). Defaults to NULL, which does not save the result. Note: Along with the model results, two '.rds' files ('loss' and 'yhat') will be created in the same directory as 'save_rds'. These files contain the loss and predicted outcome values in each fold; both files will be updated during after prediction within each fold.

return_fit

Optional: a logical value indicating whether the fitted model should be returned as a plmm object in the current (assumed interactive) session. Defaults to TRUE.

...

Additional arguments to plmm_fit

Value

a list with 12 items:

type: the type of prediction used ('lp' or 'blup')
cve: numeric vector with the cross validation error (CVE) at each value of lambda
cvse: numeric vector with the estimated standard error associated with each value of for cve
fold: numeric n length vector of integers indicating the fold to which each observation was assigned
lambda: numeric vector of lambda values
fit: the overall fit of the object, including all predictors; this is a list as returned by plmm()
min: The index corresponding to the value of lambda that minimizes cve
lambda_min: The lambda value at which cve is minmized
min1se: The index corresponding to the value of lambda within standard error of that which minimizes cve
lambda1se: largest value of lambda such that error is within 1 standard error of the minimum.
null.dev: numeric value representing the deviance for the intercept-only model. If you have supplied your own lambda sequence, this quantity may not be meaningful.
estimated_Sigma: an n x n matrix representing the estimated covariance matrix.

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))
plot(cv_fit)

# Note: for examples with filebacked data, see the filebacking vignette
# https://pbreheny.github.io/plmmr/articles/filebacking.html

Cross-validation internal function for cv_plmm

Description

Internal function for cv_plmm which calls plmm on a fold subset of the original data.

Usage

cvf(i, fold, type, cv_args, ...)

Arguments

i

Fold number to be excluded from fit.

fold

n-length vector of fold-assignments.

type

A character argument indicating what should be returned from predict.plmm. If type == 'lp' predictions are based on the linear predictor, $X beta$ . If type == 'individual' predictions are based on the linear predictor plus the estimated random effect (BLUP).

cv_args

List of additional arguments to be passed to plmm.

...

Optional arguments to predict_within_cv

Value

a list with three elements:

a numeric vector with the loss at each value of lambda
a numeric value indicating the number of lambda values used
a numeric value with the predicted outcome (y hat) values at each lambda

A function to take the eigendecomposition of K Note: This is faster than taking SVD of X when p >> n

Description

A function to take the eigendecomposition of K Note: This is faster than taking SVD of X when p >> n

Usage

eigen_K(std_X, fbm_flag)

Arguments

std_X

The standardized design matrix, stored as big.matrix object.

fbm_flag

Logical: is std_X an FBM obejct? Passed from plmm().

Value

A list with the eigenvectors and eigenvalues of K

Estimate eta (to be used in rotating the data) This function is called internally by `plmm()`

Description

Estimate eta (to be used in rotating the data) This function is called internally by plmm()

Usage

estimate_eta(n, s, U, y, eta_star)

Arguments

n

The number of observations

s

The singular values of K, the realized relationship matrix

U

The left-singular vectors of the standardized design matrix

y

Continuous outcome vector.

Value

a numeric value with the estimated value of eta, the variance parameter

Functions to convert between FBM and big.matrix type objects

Description

Functions to convert between FBM and big.matrix type objects

Usage

fbm2bm(fbm, desc = FALSE)

Arguments

fbm

An FBM object; see bigstatsr::FBM() for details

desc

Logical: is the descriptor file desired (as opposed to the filebacked big matrix)? Defaults to FALSE.

Value

a big.matrix - see bigmemory::filebacked.big.matrix() for details

A function to help with accessing example PLINK files

Description

A function to help with accessing example PLINK files

Usage

find_example_data(path, parent = FALSE)

Arguments

path

Argument (string) specifying a path (filename) for an external data file in extdata/

parent

If path=TRUE and the user wants the name of the parent directory where that file is located, set parent=TRUE. Defaults to FALSE.

Value

If path=NULL, a character vector of file names is returned. If path is given, then a character string with the full file path

Examples

find_example_data(parent = TRUE)

Read in processed data This function is intended to be called after either `process_plink()` or `process_delim()` has been called once.

Description

Read in processed data This function is intended to be called after either process_plink() or process_delim() has been called once.

Usage

get_data(path, returnX = FALSE, trace = TRUE)

Arguments

path

The file path to the RDS object containing the processed data. Do not add the '.rds' extension to the path.

returnX

Logical: Should the design matrix be returned as a numeric matrix that will be stored in memory. By default, this will be FALSE.

trace

Logical: Should trace messages be shown? Default is TRUE.

Value

A list with these components:

std_X, the column-standardized design matrix as either (1) a numeric matrix or (2) a filebacked matrix (FBM). See bigstatsr::FBM() and bigsnpr::bigSnp-class documentation for details.
(if PLINK data) fam, a data frame containing the pedigree information (like a .fam file in PLINK)
(if PLINK data) map, a data frame containing the feature information (like a .bim file in PLINK)
ns: A vector indicating the which columns of X contain nonsingular features (i.e., features with variance != 0.
center: A vector of values for centering each column in X
scale: A vector of values for scaling each column in X

a function to return the computer's host name

Description

a function to return the computer's host name

Usage

get_hostname()

Value

String with hostname of current machine

A function to impute SNP data

Description

A function to impute SNP data

Usage

impute_snp_data(
  obj,
  X,
  impute,
  impute_method,
  parallel,
  outfile,
  quiet,
  seed = as.numeric(Sys.Date()),
  ...
)

Arguments

obj

a bigSNP object (as created by read_plink_files())

X

A matrix of genotype data as returned by name_and_count_bigsnp

impute

Logical: should data be imputed? Default to TRUE.

impute_method

If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are:

mode (default): Imputes the most frequent call. See bigsnpr::snp_fastImputeSimple() for details.
random: Imputes sampling according to allele frequencies.
mean0: Imputes the rounded mean.
mean2: Imputes the mean rounded to 2 decimal places.
xgboost: Imputes using an algorithm based on local XGBoost models. See bigsnpr::snp_fastImpute() for details. Note: this can take several minutes, even for a relatively small data set.

parallel

Logical: should the computations within this function be run in parallel? Defaults to TRUE. See count_cores() and ?bigparallelr::assert_cores for more details. In particular, the user should be aware that too much parallelization can make computations slower.

outfile

Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.

quiet

Logical: should messages be printed to the console? Defaults to TRUE

seed

Numeric value to be passed as the seed for impute_method = 'xgboost'. Defaults to as.numeric(Sys.Date())

...

Optional: additional arguments to bigsnpr::snp_fastImpute() (relevant only if impute_method = "xgboost")

Value

Nothing is returned, but the obj$genotypes is overwritten with the imputed version of the data

A function to align genotype and phenotype data

Description

A function to align genotype and phenotype data

Usage

index_samples(
  obj,
  rds_dir,
  indiv_id,
  add_outcome,
  outcome_id,
  outcome_col,
  na_outcome_vals,
  outfile,
  quiet
)

Arguments

obj

An object created by process_plink()

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files.

indiv_id

A character string indicating the ID column name in the 'fam' element of the genotype data list. Defaults to 'sample.ID', equivalent to 'IID' in PLINK. The other option is 'family.ID', equivalent to 'FID' in PLINK.

add_outcome

A data frame with at least two columns: and ID column and a phenotype column

outcome_id

A string specifying the name of the ID column in pheno

outcome_col

A string specifying the name of the phenotype column in pheno. This column will be used as the default y argument to 'plmm()'.

na_outcome_vals

A vector of numeric values used to code NA values in the outcome. Defaults to c(-9, NA_integer) (the -9 matches PLINK conventions).

outfile

A string with the name of the filepath for the log file

quiet

Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...

Value

a list with two items:

a data.table with rows corresponding to the samples for which both genotype and phenotype are available.
a numeric vector with indices indicating which samples were 'complete' (i.e., which samples from add_outcome had corresponding data in the PLINK files)

Generate nicely formatted lambda vec

Description

Generate nicely formatted lambda vec

Usage

lam_names(l)

Arguments

l

Vector of lambda values.

Value

A character vector of formatted lambda value names

helper function to implement lasso penalty

Description

helper function to implement lasso penalty

Usage

lasso(z, l1, l2, v)

Arguments

z

solution over active set at each feature

l1

upper bound

l2

lower bound

v

the 'xtx' term

Value

numeric vector of the lasso-penalized coefficient estimates within the given bounds

Evaluate the negative log-likelihood of an intercept-only Gaussian plmm model

Description

This function allows you to evaluate the negative log-likelihood of a linear mixed model under the assumption of a null model in order to estimate the variance parameter, eta.

Usage

log_lik(eta, n, s, U, y, rot_y = NULL)

Arguments

eta

The proportion of variance in the outcome that is attributable to causal SNP effects. In other words, signal-to-noise ratio.

n

The number of observations

s

The singular values of K, the realized relationship matrix

U

The left-singular vectors of the standardized design matrix

y

Continuous outcome vector.

rot_y

Optional: if y has already been rotated, then this can be supplied.

Value

the value of the log-likelihood of the PLMM, evaluated with the supplied parameters

A helper function to label and summarize the contents of a `bigSNP`

Description

A helper function to label and summarize the contents of a bigSNP

Usage

name_and_count_bigsnp(obj, id_var, quiet, outfile)

Arguments

obj

a bigSNP object, possibly subset by add_external_phenotype()

id_var

String specifying which column of the PLINK .fam file has the unique sample identifiers. Options are "IID" (default) and "FID".

quiet

Logical: should messages be printed to the console? Defaults to TRUE

outfile

The string with the name of the .log file

Value

a list with components:

counts: column-wise summary of the minor allele counts in 'genotypes'
obj: a modified bigSNP list with additional components
X: the obj$genotypes as its own FBM
pos: the obj$map$physical.pos vector

Fit a linear mixed model via penalized maximum likelihood.

Description

Fit a linear mixed model via penalized maximum likelihood.

Usage

plmm(
  design,
  y = NULL,
  K = NULL,
  diag_K = NULL,
  eta_star = NULL,
  penalty = "lasso",
  init = NULL,
  gamma,
  alpha = 1,
  lambda_min,
  nlambda = 100,
  lambda,
  eps = 1e-04,
  max_iter = 10000,
  warn = TRUE,
  trace = FALSE,
  save_rds = NULL,
  return_fit = TRUE,
  ...
)

Arguments

design

y

K

Similarity matrix used to rotate the data. This should either be: (1) a known matrix that reflects the covariance of y, (2) an estimate (Default is \frac{1}{p}(XX^T)), or (3) a list with components 'd' and 'U', as returned by a previous plmm() model fit on the same data.

diag_K

eta_star

Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1.

penalty

The penalty to be applied to the model. Either "lasso" (the default), "SCAD", or "MCP".

init

Initial values for coefficients. Default is 0 for all columns of X.

gamma

The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD.

alpha

lambda_min

The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise.

nlambda

Length of the sequence of lambda. Default is 100.

lambda

A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale.

eps

Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is 1e-4.

max_iter

Maximum number of iterations (total across entire path). Default is 10000.

warn

Return warning messages for failures to converge and model saturation? Default is TRUE.

trace

If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE.

save_rds

Optional: if a filepath and name without the '.rds' suffix is specified (e.g., save_rds = "~/dir/my_results"), then the model results are saved to the provided location (e.g., "~/dir/my_results.rds"). Accompanying the RDS file is a log file for documentation, e.g., "~/dir/my_results.log". Defaults to NULL, which does not save any RDS or log files.

return_fit

Optional: a logical value indicating whether the fitted model should be returned as a plmm object in the current (assumed interactive) session. Defaults to TRUE.

...

Additional optional arguments to plmm_checks()

Value

A list which includes 19 items:

beta_vals: The matrix of estimated coefficients. Rows are predictors (with the first row being the intercept), and columns are values of lambda.
std_Xbeta: A matrix of the linear predictors on the scale of the standardized design matrix. Rows are predictors, columns are values of lambda. Note: std_Xbeta will not include rows for the intercept or for constant features.
std_X_details: A list with 9 items: - center: The center values used to center the columns of the design matrix - scale: The scaling values used to scale the columns of the design matrix - ns: An integer vector of the nonsingular columns of the original data - unpen: An integer vector of indices of the unpenalized features, if any were specified in the design - unpen_colnames: A charater vector of the column names of ay unpenalized features. - X_colnames: A character vector with the column names of all features in the original design matrix - X_rownames: A character vector with the row names of all features in the original design matrix; if none were provided, these are named 'row1', 'row2', etc. - std_X_colnames: A subset of X_colnames representing only nonsignular columns (i.e., the columns indexed by 'ns') - std_X_rownames: A subset of X_rownames representing rows that passed QC filtering & and are represented in both the genotype and phenotype data sets (this only applies to PLINK data)
std_X: If design matrix is filebacked, the descriptor for the filebacked data is returned using bigmemory::describe(). If the the data were stored in-memory, nothing is returned (std_X is NULL).
y: The outcome vector used in model fitting.
p: The total number of columns in the design matrix (including any singular columns, excluding the intercept).
plink_flag: Logical - did the data come from PLINK files?
lambda: A numeric vector of the tuning parameter values used in model fitting.
eta: A double between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure
penalty: A character string indicating the penalty with which the model was fit (e.g., 'MCP')
gamma: A numeric value indicating the tuning parameter used for the SCAD or lasso penalties was used. Not relevant for lasso models.
alpha: A numeric value indicating the elastic net tuning parameter.
loss: A vector with the numeric values of the loss at each value of lambda (calculated on the ~rotated~ scale)
penalty_factor: A vector of indicators corresponding to each predictor, where 1 = predictor was penalized.
ns_idx: An integer vector with the indices of predictors which were non-singular features (i.e., features which had variation), where feature 1 is the intercept.
iter: An integer vector with the number of iterations needed in model fitting for each value of lambda
converged: A vector of logical values indicating whether the model fitting converged at each value of lambda
K: a list with 2 elements, s and U —
- s: a vector of the eigenvalues of the relatedness matrix K (note: K is the kinship matrix for genetic/genomic data; see the article on notation for details)
- U: a matrix of the eigenvectors of the relatedness matrix

Examples

# using admix data
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
s <- summary(fit, idx = 50)
print(s)
plot(fit)

# Note: for examples with large data that are too big to fit in memory,
# see the article "PLINK files/file-backed matrices" on our website
# https://pbreheny.github.io/plmmr/articles/filebacking.html

plmm_checks

Description

plmm_checks

Usage

plmm_checks(
  design,
  K = NULL,
  diag_K = NULL,
  eta_star = NULL,
  penalty = "lasso",
  init = NULL,
  gamma,
  alpha = 1,
  trace = FALSE,
  save_rds = NULL,
  return_fit = TRUE,
  ...
)

Arguments

design

The design object, as created by create_design()

K

diag_K

eta_star

Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1.

penalty

The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso".

init

Initial values for coefficients. Default is 0 for all columns of X.

gamma

The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD.

alpha

trace

If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE.

save_rds

Optional: if a filepath and name is specified (e.g., save_rds = "~/dir/my_results.rds"), then the model results are saved to the provided location. Defaults to NULL, which does not save the result.

return_fit

Optional: a logical value indicating whether the fitted model should be returned as a plmm object in the current (assumed interactive) session. Defaults to TRUE.

...

Additional arguments to get_data()

Value

A list of parameters to pass on to model fitting. The list includes the standardized design matrix, the outcome, and meta-data

PLMM fit: a function that fits a PLMM using the values returned by plmm_prep()

Description

PLMM fit: a function that fits a PLMM using the values returned by plmm_prep()

Usage

plmm_fit(
  prep,
  y,
  std_X_details,
  eta_star,
  penalty_factor,
  fbm_flag,
  penalty,
  gamma = 3,
  alpha = 1,
  lambda_min,
  nlambda = 100,
  lambda,
  eps = 1e-04,
  max_iter = 10000,
  init = NULL,
  warn = TRUE,
  ...
)

Arguments

prep

A list as returned from plmm_prep

y

The original (not centered) outcome vector. Need this for intercept estimate

std_X_details

A list with components 'center' (values used to center X), 'scale' (values used to scale X), and 'ns' (indices for nonsignular columns of X)

eta_star

The ratio of variances (passed from plmm())

penalty_factor

A multiplicative factor for the penalty applied to each coefficient. If supplied, penalty_factor must be a numeric vector of length equal to the number of columns of X. The purpose of penalty_factor is to apply differential penalization if some coefficients are thought to be more likely than others to be in the model. In particular, penalty_factor can be 0, in which case the coefficient is always in the model without shrinkage.

fbm_flag

Logical: is std_X an FBM object? Passed from plmm().

penalty

The penalty to be applied to the model. Either "MCP" (the default), "SCAD", or "lasso".

gamma

The tuning parameter of the MCP/SCAD penalty (see details). Default is 3 for MCP and 3.7 for SCAD.

alpha

lambda_min

The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise.

nlambda

Length of the sequence of lambda. Default is 100.

lambda

A user-specified sequence of lambda values. By default, a sequence of values of length nlambda is computed, equally spaced on the log scale.

eps

Convergence threshold. The algorithm iterates until the RMSD for the change in linear predictors for each coefficient is less than eps. Default is 1e-4.

max_iter

Maximum number of iterations (total across entire path). Default is 10000.

init

Initial values for coefficients. Default is 0 for all columns of X.

warn

Return warning messages for failures to converge and model saturation? Default is TRUE.

...

Additional arguments that can be passed to biglasso::biglasso_simple_path()

PLMM format: a function to format the output of a model constructed with `plmm_fit`

Description

PLMM format: a function to format the output of a model constructed with plmm_fit

Usage

plmm_format(fit, p, std_X_details, fbm_flag, plink_flag)

Arguments

fit

A list of parameters describing the output of a model constructed with plmm_fit

p

The number of features in the original data (including constant features)

std_X_details

A list with 3 items: * 'center': the centering values for the columns of X * 'scale': the scaling values for the non-singular columns of X * 'ns': indicesof nonsingular columns in std_X

fbm_flag

Logical: is the corresponding design matrix filebacked? Passed from plmm().

plink_flag

Logical: did these data come from PLINK files? Note: This flag matters because of how non-genomic features are handled for PLINK files – in data from PLINK files, unpenalized columns are not counted in the p argument. For delimited files, p does include unpenalized columns. This difference has implications for how the untransform() function determines the appropriate dimensions for the estimated coefficient matrix it returns.

Value

A list with the components:

beta_vals: the matrix of estimated coefficients on the original scale. Rows are predictors, columns are values of lambda
lambda: a numeric vector of the lasso tuning parameter values used in model fitting.
eta: a number (double) between 0 and 1 representing the estimated proportion of the variance in the outcome attributable to population/correlation structure.
s: a vectof of the eigenvalues of relatedness matrix K; see relatedness_mat() for details.
U: a matrix of the eigenvalues of relatedness matrix K
rot_y: the vector of outcome values on the rotated scale. This is the scale on which the model was fit.
linear_predictors: the matrix resulting from the product of stdrot_X and the estimated coefficients on the ~rotated~ scale.
penalty: character string indicating the penalty with which the model was fit (e.g., 'MCP')
gamma: numeric value indicating the tuning parameter used for the SCAD or lasso penalties was used. Not relevant for lasso models.
alpha: numeric value indicating the elastic net tuning parameter.
loss: vector with the numeric values of the loss at each value of lambda (calculated on the ~rotated~ scale)
penalty_factor: vector of indicators corresponding to each predictor, where 1 = predictor was penalized.
ns_idx: vector with the indices of predictors which were nonsingular features (i.e., had variation).
iter: numeric vector with the number of iterations needed in model fitting for each value of lambda
converged: vector of logical values indicating whether the model fitting converged at each value of lambda

Loss method for "plmm" class

Description

Loss method for "plmm" class

Usage

plmm_loss(y, yhat)

Arguments

y

Observed outcomes (response) vector

yhat

Predicted outcomes (response) vector

Value

A numeric vector of the squared-error loss values for the given observed and predicted outcomes

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design, K = relatedness_mat(admix$X))
yhat <- predict(object = fit, newX = admix$X, type = 'lp', lambda = 0.05)
head(plmm_loss(yhat = yhat, y = admix$y))

PLMM prep: a function to run checks, SVD, and rotation prior to fitting a PLMM model This is an internal function for `cv_plmm`

Description

PLMM prep: a function to run checks, SVD, and rotation prior to fitting a PLMM model This is an internal function for cv_plmm

Usage

plmm_prep(
  std_X,
  std_X_n,
  std_X_p,
  n,
  p,
  centered_y,
  K = NULL,
  diag_K = NULL,
  eta_star = NULL,
  fbm_flag,
  trace = NULL,
  ...
)

Arguments

std_X

Column standardized design matrix. May include clinical covariates and other non-SNP data.

std_X_n

The number of observations in std_X (integer)

std_X_p

The number of features in std_X (integer)

n

The number of instances in the original design matrix X. This should not be altered by standardization.

p

The number of features in the original design matrix X, including constant features

centered_y

Continuous outcome vector, centered.

K

Similarity matrix used to rotate the data. This should either be a known matrix that reflects the covariance of y, or an estimate (Default is \frac{1}{p}(XX^T), where X is standardized). This can also be a list, with components d and u (as returned by choose_k)

diag_K

Logical: should K be a diagonal matrix? This would reflect observations that are unrelated, or that can be treated as unrelated. Passed from plmm().

eta_star

Optional argument to input a specific eta term rather than estimate it from the data. If K is a known covariance matrix that is full rank, this should be 1.

fbm_flag

Logical: is std_X an FBM type object? This is set internally by plmm().

trace

If set to TRUE, inform the user of progress by announcing the beginning of each step of the modeling process. Default is FALSE.

...

Not used yet

Value

List with these components:

centered_y: The vector of centered outcomes
std_X: standardized design matrix
K: a list with 2 elements. (1) s: vector with the eigenvalues of K, and (2) U: the eigenvectors of K (same as left singular values of X).
eta: the numeric value of the estimated eta parameter
trace: logical.

Plot method for cv_plmm class

Description

Plot method for cv_plmm class

Usage

## S3 method for class 'cv_plmm'
plot(
  x,
  log.l = TRUE,
  type = c("cve", "rsq", "scale", "snr", "all"),
  selected = TRUE,
  vertical.line = TRUE,
  col = "red",
  ...
)

Arguments

x

An object of class cv_plmm

log.l

Logical to indicate the plot should be returned on the natural log scale. Defaults to log.l = FALSE.

type

Type of plot to return. Defaults to "cve."

selected

Logical to indicate which variables should be plotted. Defaults to TRUE.

vertical.line

Logical to indicate whether vertical line should be plotted at the minimum/maximum value. Defaults to TRUE.

col

Color for vertical line, if plotted. Defaults to "red."

...

Additional arguments.

Value

Nothing is returned; instead, a plot is drawn representing the relationship between the tuning parameter 'lambda' value (x-axis) and the cross validation error (y-axis).

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cvfit <- cv_plmm(design = admix_design)
plot(cvfit)

Plot method for plmm class

Description

Plot method for plmm class

Usage

## S3 method for class 'plmm'
plot(x, alpha = 1, log.l = FALSE, shade = TRUE, col, ...)

Arguments

x

An object of class plmm

alpha

Tuning parameter for the Mnet estimator which controls the relative contributions from the MCP/SCAD penalty and the ridge, or L2 penalty. alpha=1 is equivalent to MCP/SCAD penalty, while alpha=0 would be equivalent to ridge regression. However, alpha=0 is not supported; alpha may be arbitrarily small, but not exactly 0.

log.l

Logical to indicate the plot should be returned on the natural log scale. Defaults to log.l = FALSE.

shade

Logical to indicate whether a local nonconvex region should be shaded. Defaults to TRUE.

col

Vector of colors for coefficient lines.

...

Additional arguments.

Value

Nothing is returned; instead, a plot of the coefficient paths is drawn at each value of lambda (one 'path' for each coefficient).

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
plot(fit)
plot(fit, log.l = TRUE)

Predict method for plmm class

Description

Predict method for plmm class

Usage

## S3 method for class 'plmm'
predict(
  object,
  newX,
  type = c("blup", "coefficients", "vars", "nvars", "lp"),
  X = NULL,
  lambda,
  idx = 1:length(object$lambda),
  ...
)

Arguments

object

An object of class plmm.

newX

Matrix of values at which predictions are to be made (not used for type="coefficients" or for some of the type settings in predict). This can be either a FBM object or a 'matrix' object. Note: Columns of this argument must be named!

type

A character argument indicating what type of prediction should be returned. Options are "lp," "coefficients," "vars," "nvars," and "blup." See details.

X

Optional: if type = 'blup' and the model was fit in-memory, the design matrix used to fit the model represented in object must be supplied. When supplied, this design matrix will be standardized using the center/scale values in object$std_X_details, so please do not standardize this matrix before supplying here. Note: If the model was fit file-backed, then the filepath to the .bk file with this standardized design matrix is returned as 'std_X' in the fit supplied to 'object'.

lambda

A numeric vector of regularization parameter lambda values at which predictions are requested.

idx

Vector of indices of the penalty parameter lambda at which predictions are required. By default, all indices are returned.

...

Additional optional arguments

Details

Define beta-hat as the coefficients estimated at the value of lambda that minimizes cross-validation error (CVE). Then options for type are as follows:

'response' (default): uses the product of newX and beta-hat to predict new values of the outcome. This does not incorporate the correlation structure of the data. For the stats folks out there, this is simply the linear predictor.
'blup' (acronym for Best Linear Unbiased Predictor): adds to the 'response' a value that represents the esetimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome.
'coefficients': returns the estimated beta-hat
'vars': returns the indices of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.
'nvars': returns the number of variables (e.g., SNPs) with nonzero coefficients at each value of lambda. EXCLUDES intercept.

Value

Depends on the type - see Details

Examples

set.seed(123)
train_idx <- sample(1:nrow(admix$X), 100)
# Note: ^ shuffling is important here! Keeps test and train groups comparable.
train <- list(X = admix$X[train_idx,], y = admix$y[train_idx])
train_design <- create_design(X = train$X, y = train$y)

test <- list(X = admix$X[-train_idx,], y = admix$y[-train_idx])
fit <- plmm(design = train_design)

# make predictions for all lambda values
 pred1 <- predict(object = fit, newX = test$X, type = "lp")
 pred2 <- predict(object = fit, newX = test$X, type = "blup", X = train$X)

# look at mean squared prediction error
mspe <- apply(pred1, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe)

mspe_blup <- apply(pred2, 2, function(c){crossprod(test$y - c)/length(c)})
min(mspe_blup) # BLUP is better

# compare the MSPE of our model to a null model, for reference
# null model = intercept only -> y_hat is always mean(y)
crossprod(mean(test$y) - test$y)/length(test$y)

Predict method to use in cross-validation (within `cvf`)

Description

Predict method to use in cross-validation (within cvf)

Usage

predict_within_cv(
  fit,
  testX,
  type,
  fbm = FALSE,
  Sigma_11 = NULL,
  Sigma_21 = NULL
)

Arguments

fit

A list with the components returned by plmm_fit.

testX

A design matrix used for computing predicted values (i.e, the test data).

type

A character argument indicating what type of prediction should be returned. Passed from cvf(), Options are "lp," "coefficients," "vars," "nvars," and "blup." See details.

fbm

Logical: is trainX an FBM object? If so, this function expects that testX is also an FBM. The two X matrices must be stored the same way.

Sigma_11

Variance-covariance matrix of the training data. Extracted from estimated_Sigma that is generated using all observations. Required if type == 'blup'.

Sigma_21

Covariance matrix between the training and the testing data. Extracted from estimated_Sigma that is generated using all observations. Required if type == 'blup'.

Details

Define beta-hat as the coefficients estimated at the value of lambda that minimizes cross-validation error (CVE). Then options for type are as follows:

'lp' (default): uses the linear predictor (i.e., product of test data and estimated coefficients) to predict test values of the outcome. Note that this approach does not incorporate the correlation structure of the data.
'blup' (acronym for Best Linear Unbiased Predictor): adds to the 'lp' a value that represents the estimated random effect. This addition is a way of incorporating the estimated correlation structure of data into our prediction of the outcome.

Note: the main difference between this function and the predict.plmm() method is that here in CV, the standardized testing data (std_test_X), Sigma_11, and Sigma_21 are calculated in cvf() instead of the function defined here.

Value

A numeric vector of predicted values

a function to format the time

Description

a function to format the time

Usage

pretty_time()

Value

A string with the formatted current date and time

Print method for summary.cv_plmm objects

Description

Print method for summary.cv_plmm objects

Usage

## S3 method for class 'summary.cv_plmm'
print(x, digits, ...)

Arguments

x

An object of class summary.cv_plmm

digits

The number of digits to use in formatting output

...

Not used

Value

Nothing is returned; instead, a message is printed to the console summarizing the results of the cross-validated model fit.

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
print(summary(cv_fit))

A function to print the summary of a `plmm` model

Description

A function to print the summary of a plmm model

Usage

## S3 method for class 'summary.plmm'
print(x, ...)

Arguments

x

A summary.plmm object

...

Not used

Value

Nothing is returned; instead, a message is printed to the console summarizing the results of the model fit.

Examples

lam <- rev(seq(0.01, 1, length.out=20)) |> round(2) # for sake of example
admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design, lambda = lam)
fit2 <- plmm(design = admix_design, penalty = "SCAD", lambda = lam)
print(summary(fit, idx = 18))
print(summary(fit2, idx = 18))

A function to read in large data files as an FBM

Description

A function to read in large data files as an FBM

Usage

process_delim(
  data_dir,
  data_file,
  feature_id,
  rds_dir = data_dir,
  rds_prefix,
  logfile = NULL,
  overwrite = FALSE,
  quiet = FALSE,
  ...
)

Arguments

data_dir

The directory to the file.

data_file

The file to be read in, without the filepath. This should be a file of numeric values. Example: use data_file = "myfile.txt", not data_file = "~/mydirectory/myfile.txt" Note: if your file has headers/column names, set 'header = TRUE' – this will be passed into bigmemory::read.big.matrix().

feature_id

A string specifying the column in the data X (the feature data) with the row IDs (e.g., identifiers for each row/sample/participant/, etc.). No duplicates allowed.

rds_dir

The directory where the user wants to create the '.rds' and '.bk' files Defaults to data_dir

rds_prefix

String specifying the user's preferred filename for the to-be-created .rds file (will be create insie rds_dir folder) Note: 'rds_prefix' cannot be the same as 'data_prefix'

logfile

Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_delim', i.e. you will get 'process_delim.log' as the outfile.

overwrite

Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile. Note: If there are multiple .rds files with names that start with "std_prefix_...", this will error out. To protect users from accidentally deleting files with saved results, only one .rds file can be removed with this option.

quiet

Logical: should the messages printed to the console be silenced? Defaults to FALSE.

...

Optional: other arguments to be passed to bigmemory::read.big.matrix(). Note: 'sep' is an option to pass here, as is 'header'.

Value

The file path to the newly created '.rds' file

Examples

temp_dir <- tempdir()
colon_dat <- process_delim(data_file = "colon2.txt",
 data_dir = find_example_data(parent = TRUE), overwrite = TRUE,
 rds_dir = temp_dir, rds_prefix = "processed_colon2", sep = "\t", header = TRUE)

colon2 <- readRDS(colon_dat)
str(colon2)

Preprocess PLINK files using the `bigsnpr` package

Description

Preprocess PLINK files using the bigsnpr package

Usage

process_plink(
  data_dir,
  data_prefix,
  rds_dir = data_dir,
  rds_prefix,
  logfile = NULL,
  impute = TRUE,
  impute_method = "mode",
  id_var = "IID",
  parallel = TRUE,
  quiet = FALSE,
  overwrite = FALSE,
  ...
)

Arguments

data_dir

The path to the bed/bim/fam data files, without a trailing "/" (e.g., use data_dir = '~/my_dir', not data_dir = '~/my_dir/')

data_prefix

The prefix (as a character string) of the bed/fam data files (e.g., data_prefix = 'mydata')

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir

rds_prefix

String specifying the user's preferred filename for the to-be-created .rds file (will be create insie rds_dir folder) Note: 'rds_prefix' cannot be the same as 'data_prefix'

logfile

Optional: the name (character string) of the prefix of the logfile to be written in 'rds_dir'. Default to NULL (no log file written). Note: if you supply a file path in this argument, it will error out with a "file not found" error. Only supply the string; e.g., if you want my_log.log, supply 'my_log', the my_log.log file will appear in rds_dir.

impute

Logical: should data be imputed? Default to TRUE.

impute_method

If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are: * mode (default): Imputes the most frequent call. See bigsnpr::snp_fastImputeSimple() for details. * random: Imputes sampling according to allele frequencies. * mean0: Imputes the rounded mean. * mean2: Imputes the mean rounded to 2 decimal places. * xgboost: Imputes using an algorithm based on local XGBoost models. See bigsnpr::snp_fastImpute() for details. Note: this can take several minutes, even for a relatively small data set.

id_var

String specifying which column of the PLINK .fam file has the unique sample identifiers. Options are "IID" (default) and "FID"

parallel

quiet

Logical: should messages to be printed to the console be silenced? Defaults to FALSE

overwrite

Logical: if existing .bk/.rds files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc. Note: If there are multiple .rds files with names that start with "std_prefix_...", this will error out. To protect users from accidentally deleting files with saved results, only one .rds file can be removed with this option.

...

Optional: additional arguments to bigsnpr::snp_fastImpute() (relevant only if impute_method = "xgboost")

Details

Three files are created in the location specified by rds_dir:

'rds_prefix.rds': This is a list with three items: (1) X: the filebacked bigmemory::big.matrix object pointing to the imputed genotype data. This matrix has type 'double', which is important for downstream operations in create_design() (2) map: a data.frame with the PLINK 'bim' data (i.e., the variant information) (3) fam: a data.frame with the PLINK 'fam' data (i.e., the pedigree information)
'prefix.bk': This is the backingfile that stores the numeric data of the genotype matrix
'rds_prefix.desc'" This is the description file, as needed by the

Note that process_plink() need only be run once for a given set of PLINK files; in subsequent data analysis/scripts, get_data() will access the '.rds' file.

For an example, see vignette on processing PLINK files

Value

The filepath to the '.rds' object created; see details for explanation.

A function to read in a large file as a numeric file-backed matrix (`FBM`) Note: this function is a wrapper for `bigstatsr::big_read()`

Description

A function to read in a large file as a numeric file-backed matrix (FBM) Note: this function is a wrapper for bigstatsr::big_read()

Usage

read_data_files(
  data_file,
  data_dir,
  rds_dir,
  rds_prefix,
  outfile,
  overwrite,
  quiet,
  ...
)

Arguments

data_file

The name of the file to read, not including its directory. Directory should be specified in data_dir

data_dir

The path to the directory where 'file' is

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir

rds_prefix

String specifying the user's preferred filename for the to-be-created .rds/.bk files (will be create insie rds_dir folder) Note: 'rds_prefix' cannot be the same as 'data_file'

outfile

Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.

overwrite

quiet

Logical: should messages be printed to the console? Defaults to TRUE

...

Optional: other arguments to be passed to bigmemory::read.big.matrix(). Note: 'sep' is an option to pass here.

Value

'.rds', '.bk', and '.desc' files are created in data_dir, and obj (a filebacked ⁠bigmemory big.matrix⁠ object) is returned. See bigmemory documentation for more info on the big.matrix class.

A function to read in PLINK files using `bigsnpr` methods

Description

A function to read in PLINK files using bigsnpr methods

Usage

read_plink_files(
  data_dir,
  data_prefix,
  rds_dir,
  outfile,
  parallel,
  overwrite,
  quiet
)

Arguments

data_dir

The path to the bed/bim/fam data files, without a trailing "/" (e.g., use data_dir = '~/my_dir', not data_dir = '~/my_dir/')

data_prefix

The prefix (as a character string) of the bed/fam data files (e.g., prefix = 'mydata')

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir

outfile

Optional: the name (character string) of the prefix of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.

parallel

overwrite

quiet

Logical: should messages be printed to the console? Defaults to TRUE

Value

'.rds' and '.bk' files are created in data_dir, and obj (a bigSNP object) is returned. See bigsnpr documentation for more info on the bigSNP class.

Calculate a relatedness matrix

Description

Given a matrix of genotypes, this function estimates the genetic relatedness matrix (GRM, also known as the RRM, see Hayes et al. 2009, doi:10.1017/S0016672308009981) among the subjects: XX'/p, where X is standardized.

Usage

relatedness_mat(X, std = TRUE, fbm = FALSE, ns = NULL, ...)

Arguments

X

An n x p numeric matrix of genotypes (from fully-imputed data). Note: This matrix should not include non-genetic features.

std

Logical: should X be standardized? If you set this to FALSE (which can only be done if data are stored in memory), you should have a good reason for doing so, as standardization is a best practice.

fbm

Logical: is X stored as an FBM? Defaults to FALSE

ns

Optional vector of values indicating the indices of nonsingular features

...

Other optional arguments to bigstatsr::bigapply() (like ncores = ...)

Value

An n x n numeric matrix capturing the genomic relatedness of the samples represented in X. In our notation, we call this matrix K for 'kinship'; this is also known as the GRM or RRM.

Examples

RRM <- relatedness_mat(X = admix$X)
RRM[1:5, 1:5]

A function to rotate filebacked data

Description

A function to rotate filebacked data

Usage

rotate_filebacked(prep, ...)

Value

a list with 4 items:

stdrot_X: X on the rotated and re-standardized scale
rot_y: y on the rotated scale (a numeric vector)
stdrot_X_center: numeric vector of values used to center rot_X
stdrot_X_scale: numeric vector of values used to scale rot_X

Compute sequence of lambda values

Description

This function allows you compute a sequence of lambda values for plmm models.

Usage

setup_lambda(
  X,
  y,
  alpha,
  lambda_min,
  nlambda,
  penalty_factor,
  intercept = TRUE
)

Arguments

X

Rotated and standardized design matrix which includes the intercept column if present. May include clinical covariates and other non-SNP data. This can be either a 'matrix' or 'FBM' object.

y

Continuous outcome vector.

alpha

lambda_min

The smallest value for lambda, as a fraction of lambda.max. Default is .001 if the number of observations is larger than the number of covariates and .05 otherwise. A value of lambda_min = 0 is not supported.

nlambda

The desired number of lambda values in the sequence to be generated.

penalty_factor

intercept

Logical: does X contain an intercept column? Defaults to TRUE.

Value

a numeric vector of lambda values, equally spaced on the log scale

A helper function to standardize a filebacked matrix

Description

A helper function to standardize a filebacked matrix

Usage

standardize_filebacked(
  X,
  new_file,
  rds_dir,
  non_gen,
  complete_outcome,
  id_var,
  outfile,
  quiet,
  overwrite
)

Arguments

X

A list that includes: (1) subset_X: a big.matrix object that has been subset &/or had any additional predictors appended as columns (2) ns: a numeric vector indicating the indices of nonsingular columns in subset_X

new_file

The new_file (as a character string) of the bed/fam data files (e.g., new_file = 'mydata')

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir

outfile

Optional: the name (character string) of the new_file of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.

quiet

Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...)

overwrite

Logical: if existing .bk/.rds files exist for the specified directory/new_file, should these be overwritten?

Value

A list with a new component of obj called 'std_X' - this is an FBM with column-standardized data. List also includes several other indices/meta-data on the standardized matrix

A helper function to standardize matrices

Description

A helper function to standardize matrices

Usage

standardize_in_memory(X)

Arguments

X

a matrix

Details

This function is adapted from https://github.com/pbreheny/ncvreg/blob/master/R/std.R NOTE: this function returns a matrix in memory. For standardizing filebacked data, use big_std() – see src/big_standardize.cpp

Value

a list with the standardized matrix, vectors with the centering/scaling values, and a vector with the indices of nonsingular columns

A helper function to subset `big.matrix` objects

Description

A helper function to subset big.matrix objects

Usage

subset_filebacked(X, new_file, complete_samples, ns, rds_dir, outfile, quiet)

Arguments

X

A filebacked big.matrix with the to-be-standardized design matrix

new_file

Optional user-specified new_file for the to-be-created .rds/.bk files.

complete_samples

Numeric vector with indicesmarking the rows of the original data which have a non-missing entry in the 6th column of the .fam file

ns

Numeric vector with the indices of the non-singular columns This vector is created in handle_missingness()

rds_dir

The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to data_dir

outfile

Optional: the name (character string) of the new_file of the logfile to be written. Defaults to 'process_plink', i.e. you will get 'process_plink.log' as the outfile.

quiet

Logical: should messages be printed to the console? Defaults to FALSE (which leaves the print messages on...)

Value

A list with two components. First, a big.matrix object, 'subset_X', representing a design matrix wherein:

rows are subset according to user's specification in handle_missing_phen
columns are subset so that no constant features remain – this is important for standardization downstream The list also includes the integer vector 'ns' which marks which columns of the original matrix were 'non-singular' (i.e. not constant features). The 'ns' index plays an important role in plmm_format() and untransform() (both helper functions in model fitting)

A summary function for cv_plmm objects

Description

A summary function for cv_plmm objects

Usage

## S3 method for class 'cv_plmm'
summary(object, lambda = "min", ...)

Arguments

object

A cv_plmm object

lambda

The regularization parameter value at which inference should be reported. Can choose a numeric value, 'min', or '1se'. Defaults to 'min.'

...

Not used

Value

The return value is an object with S3 class summary.cv_plmm. The class has its own print method and contains the following list elements:

lambda_min: The lambda value at the minimum cross validation error
lambda.1se: The maximum lambda value within 1 standard error of the minimum cross validation error
penalty: The penalty applied to the fitted model
nvars: The number of non-zero coefficients at the selected lambda value
cve: The cross validation error at all folds
min: The minimum cross validation error
fit: The plmm fit used in the cross validation

if return_bias_details = TRUE, two more items are returned:

bias: The mean bias of the cross validation
loss: The loss at each value of lambda

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
cv_fit <- cv_plmm(design = admix_design)
summary(cv_fit)

A summary method for the plmm objects

Description

A summary method for the plmm objects

Usage

## S3 method for class 'plmm'
summary(object, lambda, idx, eps = 1e-05, ...)

Arguments

object

An object of class plmm

lambda

The regularization parameter value at which inference should be reported.

idx

Alternatively, lambda may be specified by an index; idx=10 means: report inference for the 10th value of lambda along the regularization path. If both lambda and idx are specified, lambda takes precedence.

eps

If lambda is given, eps is the tolerance for difference between the given lambda value and a lambda value from the object. Defaults to 0.0001 (1e-5)

...

Not used

Value

The return value is an object with S3 class summary.plmm. The class has its own print method and contains the following list elements:

penalty: The penalty used by plmm (e.g. SCAD, MCP, lasso)
n: Number of instances/observations
std_X_n: the number of observations in the standardized data; the only time this would differ from 'n' is if data are from PLINK and the external data does not include all the same samples
p: Number of regression coefficients (not including the intercept)
converged: Logical indicator for whether the model converged
lambda: The lambda value at which inference is being reported
lambda_char: A formatted character string indicating the lambda value
nvars: The number of nonzero coefficients (again, not including the intercept) at that value of lambda
nonzero: The column names indicating the nonzero coefficients in the model at the specified value of lambda

Examples

admix_design <- create_design(X = admix$X, y = admix$y)
fit <- plmm(design = admix_design)
summary(fit, idx = 97)

Untransform coefficient values back to the original scale

Description

This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().

Usage

untransform(
  std_scale_beta,
  p,
  std_X_details,
  fbm_flag,
  plink_flag,
  use_names = TRUE
)

Arguments

std_scale_beta

The estimated coefficients on the standardized scale

p

The number of columns in the original design matrix

std_X_details

A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns'

fbm_flag

Logical: is the corresponding design matrix filebacked?

plink_flag

use_names

Logical: should names be added? Defaults to TRUE. Set to FALSE inside of cvf() helper, as 'ns' will vary within CV folds.

Value

a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.

Untransform coefficient values back to the original scale for file-backed data

Description

This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().

Usage

untransform_delim(
  std_scale_beta,
  p,
  std_X_details,
  plink_flag,
  use_names = TRUE
)

Arguments

std_scale_beta

The estimated coefficients on the standardized scale

p

The number of columns in the original design matrix

std_X_details

A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns'

use_names

Logical: should names be added? Defaults to TRUE. Set to FALSE inside of cvf() helper, as 'ns' will vary within CV folds.

Value

a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.

Untransform coefficient values back to the original scale In memory

Description

This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().

Usage

untransform_in_memory(std_scale_beta, p, std_X_details, use_names = TRUE)

Arguments

std_scale_beta

The estimated coefficients on the standardized scale

p

The number of columns in the original design matrix

std_X_details

A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns'

use_names

Logical: should names be added? Defaults to TRUE. Set to FALSE inside of cvf() helper, as 'ns' will vary within CV folds.

Value

a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.

Untransform coefficient values back to the original scale for file-backed data

Description

This function unwinds the initial standardization of the data to obtain coefficient values on their original scale. It is called by plmm_format().

Usage

untransform_plink(
  std_scale_beta,
  p,
  std_X_details,
  plink_flag,
  use_names = TRUE
)

Arguments

std_scale_beta

The estimated coefficients on the standardized scale

p

The number of columns in the original design matrix

std_X_details

A list with 3 elements describing the standardized design matrix BEFORE rotation; this should have elements 'scale', 'center', and 'ns'

use_names

Logical: should names be added? Defaults to TRUE. Set to FALSE inside of cvf() helper, as 'ns' will vary within CV folds.

Value

a matrix of estimated coeffcients, 'beta_vals', that is on the scale of the original data.

For Linux/Unix and MacOS only, here is a companion function to unzip the .gz files that ship with the `plmmr` package

Description

For Linux/Unix and MacOS only, here is a companion function to unzip the .gz files that ship with the plmmr package

Usage

unzip_example_data(outdir)

Arguments

outdir

The file path to the directory to which the .gz files should be written

Details

For an example of this function, look at vignette('plink_files', package = "plmmr"). Note again: this function will not work on Windows systems - only for Linux/Unix and MacOS.

Value

Nothing is returned; the PLINK files that ship with the plmmr package are stored in the directory specified by 'outdir'

plmmr: Penalized Linear Mixed Models for Correlated Data

Description

Author(s)

See Also

helper function to implement MCP penalty The helper functions to implement each penalty.

Description

Usage

Arguments

Value

helper function to implement SCAD penalty

Description

Usage

Arguments

Value

A helper function to add predictors to a filebacked matrix of data

Description

Usage

Arguments

Value

Admix: Semi-simulated SNP data

Description

Usage

Format

Source

A helper function to support process_plink()

Description

Usage

Arguments

Value

a version of cbind() for file-backed matrices

Description

Usage

Arguments

Value

check_for_file_extension: a function to make our package 'smart' enough to handle .rds file extensions

Description

Usage

Arguments

Value

Coef method for "cv_plmm" class

Description

Usage

Arguments

Value

Examples

Coef method for "plmm" class

Description

Usage

Arguments

Value

Examples

a function to create the estimated variance matrix from a PLMM fit

Description

Usage

Arguments

Value

A helper function to count constant features

Description

Usage

Arguments

Value

A helper function to count the number of cores available on the current machine

Description

Usage

Value

a function to create a design for PLMM modeling

Description

Usage

Arguments

Details

Value

Examples

A function to create a design matrix, outcome, and penalty factor to be passed to a model fitting function

Description

Usage

Arguments

Value

A function to create a design with an in-memory X matrix

Description

Usage

A helper function to support `process_plink()`

Estimate eta (to be used in rotating the data) This function is called internally by `plmm()`

Read in processed data This function is intended to be called after either `process_plink()` or `process_delim()` has been called once.