Type: Package
Title: Simulation of High-Dimensional Data and Parallelized Repeated Penalized Regression
Version: 1.3.6
Author: Levi Waldron <lwaldron.research@gmail.com>
Maintainer: Levi Waldron <lwaldron.research@gmail.com>
Depends: R (≥ 2.10.0)
Imports: methods, parallel, penalized, MASS
Suggests: survivalROC, survival, rmarkdown, knitr
Description: Simulation of continuous, correlated high-dimensional data with time to event or binary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regression with repeated starts and two-dimensional tuning of the Elastic Net.
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
LazyLoad: yes
VignetteBuilder: knitr
URL: https://waldronlab.io/pensim/
BugReports: https://github.com/waldronlab/pensim/issues
NeedsCompilation: no
Repository: CRAN
Packaged: 2022-12-06 12:10:19 UTC; levi
Date/Publication: 2022-12-09 00:10:02 UTC

Functions and data for simulation of high-dimensional data and parallelized repeated penalized regression

Description

Simulation of continuous, correlated high-dimensional data with time-to-event or binary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regression model training and validation by split-sample or nested cross-validation. See the help page for opt.nested.crossval() for the most extensive usage examples.

Details

Package: pensim
Type: Package
License: GPL (>=2)
LazyLoad: yes

Model training and validation by Lasso, Ridge, and Elastic Net penalized regression. This package also contains a function for simulation of correlated high-dimensional data with binary or time-to-event response.

Author(s)

Levi Waldron

Maintainer: Levi Waldron <lwaldron.research@gmail.com>

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

penalized-package

Examples

set.seed(9)
## 
## create some data,  with one of a group of five correlated variables
## having an association with the binary outcome:
## 
x <- create.data(
  nvars = c(10, 3),
  cors = c(0, 0.8),
  associations = c(0, 2),
  firstonly = c(TRUE, TRUE),
  nsamples = 50,
  response = "binary",
  logisticintercept = 0.5
)
x$summary
##
##predictor data frame and binary response vector
##
pen.data <- x$data[, -match("outcome", colnames(x$data))]
response <- x$data[, match("outcome", colnames(x$data))]
## lasso regression.  Note that epsilon=1e-2 is passed onto optL1,  and
## reduces the precision of the tuning compared to the default 1e-10.
output <-
  opt1D(
    nsim = 1,
    nprocessors = 1,
    penalized = pen.data,
    response = response,
    epsilon = 1e-2
  )
cc <-
  output[which.max(output[, "cvl"]), -1:-3]  ##non-zero b.* are true positives

Lung adenocarcinoma microarray expression data of Beer et al. (2002)

Description

Lung adenocarcinomas were profiled by Beer et al. (2002) using Affymetrix hu6800 microarrays. The data here were normalized from raw .CEL files by RMAExpress (v0.3). The expression matrix contains expression data for 86 patients with 7,129 probe sets.

Usage

data(beer.exprs)

Format

A data frame with 7129 probe sets (rows) for 86 patients (columns)

Source

Beer DG, Kardia SL, Huang C, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 2002, 8:816-824.

References

Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data, Nucl. Acids Res., 31, e15+-e15+.

Examples

data(beer.exprs)
mysd <- apply(beer.exprs, 1, sd)
beer.subset <- as.matrix(beer.exprs[rank(-mysd) <= 100, ])  
heatmap(beer.subset)

Survival data for Beer et al. (2002) lung adenocarcinoma study

Description

Overall survival time for 86 lung adenocarcinoma patients, with 62 of the 86 events being censored.

Usage

data(beer.survival)

Format

A data frame with 86 observations on the following 2 variables.

status

a numeric vector

os

a numeric vector

Source

Beer DG, Kardia SL, Huang C, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Gene-expression pr ofiles predict survival of patients with lung adenocarcinoma. Nat Med 2002, 8:816-824.

Examples

data(beer.survival)
library(survival)
surv.obj <- with(beer.survival, Surv(os, status))
surv.obj.rev <- with(beer.survival, Surv(os, 1-status))
survfit(surv.obj.rev~1)  #reverse KM estimate of follow-up time (months)
(my.survfit <- survfit(surv.obj~1))  ##KM estimate of survival
plot(my.survfit, xlab="Time (months)", 
     ylab="KM estimate of overall survival")
legend("bottomright", lty=c(1, 2), pch=-1,
       legend=c("KM estimate", "95 percent confidence interval"))

simulate correlated predictors with time-to-event or binary outcome

Description

This function creates multiple groups of predictor variables which may be correlated within each group, and binary or survival time (without censoring) response according to specified weights of the predictors.

Usage

create.data(nvars = c(100, 100, 100, 100, 600),
            cors = c(0.8, 0, 0.8, 0, 0),
            associations = c(0.5, 0.5, 0.3, 0.3, 0),
            firstonly = c(TRUE, FALSE, TRUE, FALSE, FALSE),
            nsamples = 100,
            censoring = "none",
            labelswapprob = 0,
            response = "timetoevent",
            basehaz = 0.2,
            logisticintercept = 0)

Arguments

nvars

integer vector giving the number of variables of each variable type. The number of variable types is equal to the length of this vector.

cors

integer vector of the same length as nvars, giving the population pairwise Pearson correlation within each group.

associations

integer vector of the same length as nvars, giving the associations of each type with outcome

firstonly

logical vector of the same length as nvars, specifying whether only the first variable of each type is associated with outcome (TRUE) or all variables of that type (FALSE)

nsamples

an integer giving the number of observations

censoring

"none" for no censoring, or a vector of length two c(a,b) for uniform U(a,b) censoring.

labelswapprob

This provides an option to add uncertainty to binary outcomes by randomly switching labels with probability labelswapprob. The probability of a label being swapped is independent for each observation. The value is ignored if response is "timetoevent"

response

either "timetoevent" or "binary"

basehaz

baseline hazard, used for "timetoevent"

logisticintercept

intercept which is added to X%*%Beta for "binary"

Details

This function simulates "predictor" variables in one or more groups, which are standard normally distributed. The user can specify the population correlation within each variable group, the association of each variable group to outcome, and whether the first or all variables of that type should be associated with outcome. The simulated response variable can be time to event with an exponential distribution, or binary survival with a logistic distribution.

Value

Returns a list with items:

summary

a summary of the variable types produced

associations

weights of each variable in computing the outcome

covariance

covariance matrix used for generating potentially correlated random predictors

data

dataframe containing the predictors and response. Response is the last column for binary outcome ("outcome"), and the last two columns for timetoevent outcome ("time" and "cens")

Note

Depends on the MASS package for correlated random number generation

Author(s)

Levi Waldron et al.

References

Waldron L., Pintilie M., Tsao M.-S., Shepherd F. A., Huttenhower C.*, and Jurisica I.* Optimized application of penalized regression methods to diverse genomic data. (2010). Under review. (*equal contribution)

Examples

##binary outcome example
set.seed(9)
x <-
  create.data(
    nvars = c(15, 5),
    cors = c(0, 0.8),
    associations = c(0, 2),
    firstonly = c(TRUE, TRUE),
    nsamples = 50,
    response = "binary",
    logisticintercept = 0.5
  )
summary(x)
x$summary
model <- glm(outcome ~ .,  data = x$data, family = binomial)
summary(model)
dat <- t(as.matrix(x$data[, -match("outcome", colnames(x$data))]))
heatmap(dat, ColSideColors = ifelse(x$data$outcome == 0, "black", "white"))

##censored survival outcome example:
set.seed(1)
x <- create.data(
  nvars = c(15, 5),
  cors = c(0, 0.8),
  associations = c(0, 2),
  firstonly = c(TRUE, TRUE),
  nsamples = 50,
  censoring = c(2, 10),
  response = "timetoevent"
)
sum(x$data$cens == 0) / nrow(x$data)  #34 percent censoring

library(survival)
surv.obj <- Surv(x$data$time, x$data$cens)
plot(survfit(surv.obj ~ 1), ylab = "Survival probability", xlab = "time")

Parallelized calculation of cross-validated risk score predictions from L1/L2/Elastic Net penalized regression.

Description

calculates risk score predictions by a nested cross-validation, using the optL1 and optL2 functions of the penalized R package for regression. In the outer level of cross-validation, samples are split into training and test samples. Model parameters are tuned by cross-validation within training samples only.

By setting nprocessors > 1, the outer cross-validation is split between multiple processors.

The functions support z-score scaling of training data, and application of these scaling and shifting coefficients to the test data. It also supports repeated tuning of the penalty parameters and selection of the model with greatest cross-validated likelihood.

Usage

opt.nested.crossval(outerfold=10, nprocessors=1, cl=NULL, ...)

Arguments

outerfold

number of folds in outer cross-validation (the level used for validation)

nprocessors

An integer number of processors to use. If specified in opt.nested.crossval, iterations of the outer cross-validation are sent to different processors. If specified in opt.splitval, repeated starts for the penalty tuning are sent to different processors.

cl

Optional cluster object created with the makeCluster() function of the parallel package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK"). Setting this parameter can enable parallelization in more diverse scenarios than multi-core desktops; see the documentation for the parallel package. Note that if cl is user-defined, this function will not automatically run parallel::stopCluster() to shut down the cluster.

...

optFUN (either "opt1D" or "opt2D"), scaling (TRUE to z-score training data then apply the same shift and scale factors to test data, FALSE for no scaling) are passed onto the opt.splitval function. Additional arguments are required, to be passed to the optL1 or optL2 function of the penalized R package. See those help pages, and it may be desirable to test these arguments directly on optL1 or optL2 before using this more CPU-consuming and complex function.

Details

This function calculates cross-validated risk score predictions, tuning a penalized regression model using the optL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation. Tuning is done by cross-validation in the training samples only. Test samples are scaled using the shift and scale factors determined from the training samples. parameter. If nprocessors > 1, it uses the SNOW package for parallelization, dividing the iterations of the outer cross-validation among the specified number of processors.

Some arguments MUST be passed (through the ... arguments) but which are documented for the functions in which they are used. These include, from the opt.splitval function:

optFUN="opt1D" for Lasso or Ridge regression, or "opt2D" for Elastic Net. See the help pages for opt1D and opt2D for additional arguments associated with these functions.

scaling=TRUE to scale each feature (column) of the training sample to z-scores. These same scaling and shifting factors are applied to the test data. If FALSE, no scaling is done. Note that only data in the penalized argument are scaled, not the optional unpenalized argument (see documentation for opt1D, opt2D, or cvl from the penalized package for descriptions of the penalized and unpenalized arguments). Alternatively, the standardize=TRUE argument to the penalized package functions can be used to do scaling internally.

nsim=50 this number specifies the number of times to repeat tuning of the penalty parameters on different data foldings for the cross-validation.

setpen="L1" or "L2" : if optFUN="opt1D", this sets regression type to LASSO or Ridge, respectively. See ?opt1D.

L1range, L2range, dofirst, L1gridsize, L2gridsize: options for Elastic Net regression if optFUN="opt2D". See ?opt2D.

Value

Returns a vector of cross-validated continuous risk score predictions.

Note

Depends on the R packages: penalized, parallel

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

opt.splitval

Examples

data(beer.exprs)
data(beer.survival)

##select just 100 genes to speed computation:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(100), ]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 0.5, ])
dat.filt <- t(dat.filt)
dat.filt <- data.frame(dat.filt)

library(survival)
surv.obj <- Surv(beer.survival$os, beer.survival$status)

## First, test the regression arguments using functions from
## the penalized package.  I use maxlambda1=5 here to ensure at least
## one non-zero coefficient.
testfit <- penalized::optL1(
  response = surv.obj,
  maxlambda1 = 3,
  penalized = dat.filt,
  fold = 2,
  positive = FALSE,
  standardize = TRUE,
  trace = TRUE
)

## Now pass these arguments to opt.nested.splitval() for cross-validated
## calculation and assessment of risk scores, with the additional
## arguments:
##    outerfold and nprocessors (?opt.nested.crossval)
##    optFUN and scaling (?opt.splitval)
##    setpen and nsim (?opt1D)

## Ideally nsim would be 50, and outerfold and fold would be 10, but the
## values below speed computation 200x compared to these recommended
## values.  Note that here we are using the standardize=TRUE argument of
## optL1 rather than the scaling=TRUE argument of opt.splitval.  These
## two approaches to scaling are roughly equivalent, but the scaling
## approaches are not the same (scaling=TRUE does z-score,
## standardize=TRUE scales to unit central L2 norm), and results will
## not be identical.  Also, using standardize=TRUE scales variables but
## provides coeffients for the original scale, whereas using
## scaling=TRUE scales variables in the training set then applies the
## same scales to the test set.
set.seed(1)
## In this example I use two processors:
preds <-
  pensim::opt.nested.crossval(
    outerfold = 2,
    nprocessors = 1,
    #opt.nested.crossval arguments
    optFUN = "opt1D",
    scaling = FALSE,
    #opt.splitval arguments
    setpen = "L1",
    nsim = 1,
    #opt1D arguments
    response = surv.obj,
    #rest are penalized::optl1 arguments
    penalized = dat.filt,
    fold = 2,
    maxlambda1 = 5,
    positive = FALSE,
    standardize = TRUE,
    trace = FALSE
  )

## We probably also want the coefficients from the model fit on all the
## data, for future use:
beer.coefs <- pensim::opt1D(
  setpen = "L1",
  nsim = 1,
  maxlambda1 = 5,
  response = surv.obj,
  penalized = dat.filt,
  fold = 2,
  positive = FALSE,
  standardize = TRUE,
  trace = FALSE
)

## We can also include unpenalized covariates, if desired.
## Note that when keeping only one variable for a penalized or
## unpenalized covariate, indexing a dataframe like [1] instead of doing
## [,1] preserves the variable name.  With [,1] the variable name gets
## converted to "".

beer.coefs <- pensim::opt1D(
  setpen = "L1",
  nsim = 1,
  maxlambda1 = 5,
  response = surv.obj,
  penalized = dat.filt[-1],
  # This is equivalent to dat.filt[, -1]
  unpenalized = dat.filt[1],
  fold = 2,
  positive = FALSE,
  standardize = TRUE,
  trace = FALSE
)
## (note the non-zero first coefficient this time, due to it being unpenalized).

## Summarization and plotting.
preds.dichot <- preds > median(preds)

coxfit.continuous <- coxph(surv.obj ~ preds)
coxfit.dichot <- coxph(surv.obj ~ preds.dichot)
summary(coxfit.continuous)
summary(coxfit.dichot)

nobs <- length(preds)
cutoff <- 12
if (requireNamespace("survivalROC", quietly = TRUE)) {
preds.roc <-
  survivalROC::survivalROC(
    Stime = beer.survival$os,
    status = beer.survival$status,
    marker = preds,
    predict.time = cutoff,
    span = 0.25 * nobs ^ (-0.20)
  )
 plot(
  preds.roc$FP,
  preds.roc$TP,
  type = "l",
  xlim = c(0, 1),
  ylim = c(0, 1),
  lty = 2,
  xlab = paste("FP", "\n", "AUC = ", round(preds.roc$AUC, 3)),
  ylab = "TP",
  main = "LASSO predictions\n ROC curve at 12 months"
 )
 abline(0, 1)
 }

Parallelized calculation of split training/test set predictions from L1/L2/Elastic Net penalized regression.

Description

uses a single training/test split to train a penalized regression model in the training samples, then use the model to calculate values of the linear risk score in the test samples. This function is used by opt.nested.crossval, but can also be used on its own.

This function support z-score scaling of training data, and application of these scaling and shifting coefficients to the test data. It also supports repeated tuning of the penalty parameters and selection of the model with greatest cross-validated likelihood.

Usage

opt.splitval(optFUN="opt1D",testset="equal",scaling=TRUE,...)

Arguments

optFUN

"opt1D" for Lasso or Ridge regression, "opt2D" for Elastic Net. See the help pages for these functions for additional arguments.

testset

For the opt.splitval function ONLY. "equal" for randomly assigned equal training and test sets, or an integer vector defining the positions of the test samples in the response, penalized, and unpenalized arguments which are passed to the optL1, optL2, or cvl functions of the penalized R package.

scaling

If TRUE, each feature (column) of the training samples (in matrix/dataframe specified by the penalized argument) are scaled to z-scores, then these scaling and shifting factors are applied to the test data. If FALSE, no scaling is done.

...

Additional arguments are required, to be passed to the optL1 or optL2 function of the penalized R package. See those help pages, and it may be desirable to test these arguments directly on optL1 or optL2 before using this more CPU-consuming and complex function.

Details

This function does split sample model training and testing for a single split of the data, using the optL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation. Scaling of the test samples is done independently, using scale factors determined from the training samples. Repeated starts of model training can be parallelized as documented in the opt1D and opt2D functions. This function is used for nested cross-validation by the opt.nested.crossval function.

Value

Returns a vector of cross-validated continuous risk score predictions.

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

opt1D, opt2D, opt.nested.crossval

Examples

data(beer.exprs)
data(beer.survival)

## select just 250 genes to speed computation:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(100),]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 0.5,])
dat.filt <- t(dat.filt)

library(survival)
surv.obj <- Surv(beer.survival$os, beer.survival$status)

## Single split training/test evaluation.  Ideally nsim would be 50 and
## fold=10, but this requires 100x more resources.
set.seed(1)
preds50 <- opt.splitval(
  optFUN = "opt1D",
  scaling = TRUE,
  testset = "equal",
  setpen = "L1",
  nsim = 1,
  nprocessors = 1,
  response = surv.obj,
  penalized = dat.filt,
  fold = 5,
  positive = FALSE,
  standardize = FALSE,
  trace = FALSE
)

preds50.dichot <- preds50 > median(preds50)

surv.obj.50 <-
  surv.obj[match(names(preds50), rownames(beer.survival))]
coxfit50.continuous <- coxph(surv.obj.50 ~ preds50)
coxfit50.dichot <- coxph(surv.obj.50 ~ preds50.dichot)
summary(coxfit50.continuous)
summary(coxfit50.dichot)

Parallelized repeated tuning of Lasso or Ridge penalty parameter

Description

This function is a wrapper to the optL1 and optL2 functions of the penalized R package, useful for parallelized repeated tuning of the penalty parameters.

Usage

opt1D(nsim = 50, nprocessors = 1, setpen = "L1", cl = NULL, ...)

Arguments

nsim

Number of times to repeat the simulation (around 50 is suggested)

nprocessors

An integer number of processors to use.

setpen

Either "L1" (Lasso) or "L2" (Ridge) penalty

cl

Optional cluster object created with the makeCluster() function of the parallel package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK"). Setting this parameter can enable parallelization in more diverse scenarios than multi-core desktops; see the documentation for the parallel package. Note that if cl is user-defined, this function will not automatically run parallel::stopCluster() to shut down the cluster.

...

arguments passed on to optL1 or optL2 function of the penalized R package

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize the task of repeated tunings the L1 or L2 penalty parameter. Tuning of the penalty parameters is done by the optL1 or optL2 functions of the penalized R package.

Value

Returns a matrix with the following columns:

L1 (or L2)

optimized value of the penalty parameter

cvl

optimized cross-validated likelihood

coef_1, coef_2, ..., coef_n

argmax coefficients for the model with this value of the tuning parameter

The matrix contains one row for each repeat of the regression.

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

optL1, optL2

Examples

data(beer.exprs)
data(beer.survival)

##select just 100 genes to speed computation:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100),]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(100),]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 0.5,])
dat.filt <- t(dat.filt)

##define training and test sets
set.seed(1)
trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt) / 2))
testset <-
  rownames(dat.filt)[!rownames(dat.filt) %in% trainingset]

dat.training <- data.frame(dat.filt[trainingset, ])
pheno.training <- beer.survival[trainingset, ]

library(survival)
surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset, ])
all.equal(colnames(dat.training), colnames(dat.test))
pheno.test <- beer.survival[testset, ]
surv.test <- Surv(pheno.test$os, pheno.test$status)

##ideally nsim should be on the order of 50,  but this slows computation
##50x without parallelization.
set.seed(1)
output <-
  pensim::opt1D(
    nsim = 1,
    nprocessors = 1,
    setpen = "L2",
    response = surv.training,
    penalized = dat.training,
    fold = 3,
    positive = FALSE,
    standardize = TRUE,
    minlambda2 = 1,
    maxlambda2 = 100
  )

cc <- output[which.max(output[, "cvl"]),-(1:2)]  #coefficients
sum(abs(cc) > 0)  #count non-zero coefficients

preds.training <- as.matrix(dat.training) %*% cc
preds.training.median <- median(preds.training)
preds.training.dichot <-
  ifelse(preds.training > preds.training.median, "high risk", "low risk")
preds.training.dichot <-
  factor(preds.training.dichot[, 1], levels = c("low risk", "high risk"))
preds.test <- as.matrix(dat.test) %*% cc
preds.test.dichot <-
  ifelse(preds.test > preds.training.median, "high risk", "low risk")
preds.test.dichot <-
  factor(preds.test.dichot[, 1], levels = c("low risk", "high risk"))

coxphfit.training <- coxph(surv.training ~ preds.training.dichot)
survfit.training <- survfit(surv.training ~ preds.training.dichot)
summary(coxphfit.training)
coxphfit.test <- coxph(surv.test ~ preds.test.dichot)
survfit.test <- survfit(surv.test ~ preds.test.dichot)
summary(coxphfit.test)

(p.training <-
    signif(summary(coxphfit.training)$logtest[3], 2))  #likelihood ratio test
(hr.training <- signif(summary(coxphfit.training)$conf.int[1], 2))
(hr.lower.training <- summary(coxphfit.training)$conf.int[3])
(hr.upper.training <- summary(coxphfit.training)$conf.int[4])
par(mfrow = c(1, 2))
plot(
  survfit.training,
  col = c("black", "red"),
  conf.int = FALSE,
  xlab = "Months",
  main = "TRAINING",
  ylab = "Overall survival"
)
xmax <- par("usr")[2] - 50
text(
  x = xmax,
  y = 0.4,
  lab = paste("HR=", hr.training),
  pos = 2
)
text(
  x = xmax,
  y = 0.3,
  lab = paste("p=", p.training, "", sep = ""),
  pos = 2
)
tmp <- summary(preds.training.dichot)
text(
  x = c(xmax, xmax),
  y = c(0.2, 0.1),
  lab = paste(tmp, names(tmp)),
  col = 1:2,
  pos = 2
)
(p.test <-
    signif(summary(coxphfit.test)$logtest[3], 2))  #likelihood ratio test
(hr.test <- signif(summary(coxphfit.test)$conf.int[1], 2))
(hr.lower.test <- summary(coxphfit.test)$conf.int[3])
(hr.upper.test <- summary(coxphfit.test)$conf.int[4])
plot(
  survfit.test,
  col = c("black", "red"),
  conf.int = FALSE,
  xlab = "Months",
  main = "TEST"
)
text(
  x = xmax,
  y = 0.4,
  lab = paste("HR=", hr.test),
  pos = 2
)
text(
  x = xmax,
  y = 0.3,
  lab = paste("p=", p.test, "", sep = ""),
  pos = 2
)
tmp <- summary(preds.test.dichot)
text(
  x = c(xmax, xmax),
  y = c(0.2, 0.1),
  lab = paste(tmp, names(tmp)),
  col = 1:2,
  pos = 2
)

Parallelized, two-dimensional tuning of Elastic Net L1/L2 penalties

Description

This function implements parallelized two-dimensional optimization of Elastic Net penalty parameters. This is accomplished by scanning a regular grid of L1/L2 penalties, then using the top five CVL penalty combinations from this grid as starting points for the convex optimization problem.

Usage

opt2D(nsim,
      L1range = c(0.001, 100),
      L2range = c(0.001, 100),
      dofirst = "both",
      nprocessors = 1,
      L1gridsize = 10, L2gridsize = 10,
      cl = NULL,
      ...)

Arguments

nsim

Number of times to repeat the simulation (around 50 is suggested)

L1range

numeric vector of length two, giving minimum and maximum constraints on the L1 penalty

L2range

numeric vector of length two, giving minimum and maximum constraints on the L2 penalty

dofirst

"L1" to optimize L1 followed by L2, "L2" to optimize L2 followed by L1, or "both" to optimize both simultaneously in a two-dimensional optimization.

nprocessors

An integer number of processors to use.

L1gridsize

Number of values of the L1 penalty in the regular grid of L1/L2 penalties

L2gridsize

Number of values of the L2 penalty in the regular grid of L1/L2 penalties

cl

Optional cluster object created with the makeCluster() function of the parallel package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK"). Setting this parameter can enable parallelization in more diverse scenarios than multi-core desktops; see the documentation for the parallel package. Note that if cl is user-defined, this function will not automatically run parallel::stopCluster() to shut down the cluster.

...

arguments passed on to optL1 and optL2 (dofirst="L1" or "L2"), or cvl (dofirst="both") functions of the penalized R package

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize the task of repeated tunings the Elastic Net penalty parameters. Three methods are implemented, as described by Waldron et al. (2011): lambda1 followed by lambda2 (lambda1-lambda2), lambda2 followed by lambda1 (lambda2-lambda1), and lambda1 with lambda2 simultaneously (lambda1+lambda2). Tuning of the penalty parameters is done by the optL1 or optL2 functions of the penalized R package.

Value

Returns a matrix with the following columns:

L1

optimized value of the L1 penalty parameter

L2

optimized value of the L2 penalty parameter

cvl

optimized cross-validated likelihood

convergence

0 if the optimization converged, non-zero otherwise (see stats:optim for details)

fncalls

number of calls to cvl function during optimization

coef_1, coef_2, ..., coef_n

argmax coefficients for the model with this value of the tuning parameter

The matrix contains one row for each repeat of the regression.

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

optL1, optL2, cvl

Examples

data(beer.exprs)
data(beer.survival)

## Select just 100 genes to speed computation:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100),]

## Apply an unreasonably strict gene filter here to speed computation
## time for the Elastic Net example.
gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(150),]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 1,])
dat.filt <- t(dat.filt)

## Define training and test sets
set.seed(9)
trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt) / 2))
testset <-
  rownames(dat.filt)[!rownames(dat.filt) %in% trainingset]

dat.training <- data.frame(dat.filt[trainingset,])
pheno.training <- beer.survival[trainingset,]

library(survival)
surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset,])
all.equal(colnames(dat.training), colnames(dat.test))
pheno.test <- beer.survival[testset,]
surv.test <- Surv(pheno.test$os, pheno.test$status)

set.seed(1)
##ideally set nsim=50, fold=10, but this takes 100x longer.
system.time(
  output <- opt2D(
    nsim = 1,
    L1range = c(0.1, 1),
    L2range = c(20, 1000),
    dofirst = "both",
    nprocessors = 1,
    response = surv.training,
    penalized = dat.training,
    fold = 5,
    positive = FALSE,
    standardize = TRUE
  )
)

cc <- output[which.max(output[, "cvl"]),-1:-5]
output[which.max(output[, "cvl"]), 1:5]  #small L1, large L2
sum(abs(cc) > 0)  #number of non-zero coefficients

preds.training <- as.matrix(dat.training) %*% cc
preds.training.median <- median(preds.training)
preds.training.dichot <-
  ifelse(preds.training > preds.training.median, "high risk", "low risk")
preds.training.dichot <-
  factor(preds.training.dichot[, 1], levels = c("low risk", "high risk"))
preds.test <- as.matrix(dat.test) %*% cc
preds.test.dichot <-
  ifelse(preds.test > preds.training.median, "high risk", "low risk")
preds.test.dichot <-
  factor(preds.test.dichot[, 1], levels = c("low risk", "high risk"))

coxphfit.training <- coxph(surv.training ~ preds.training.dichot)
survfit.training <- survfit(surv.training ~ preds.training.dichot)
summary(coxphfit.training)
coxphfit.test <- coxph(surv.test ~ preds.test.dichot)
survfit.test <- survfit(surv.test ~ preds.test.dichot)
summary(coxphfit.test)

(p.training <-
    signif(summary(coxphfit.training)$logtest[3], 2))  #likelihood ratio test
(hr.training <- signif(summary(coxphfit.training)$conf.int[1], 2))
(hr.lower.training <- summary(coxphfit.training)$conf.int[3])
(hr.upper.training <- summary(coxphfit.training)$conf.int[4])
par(mfrow = c(1, 2))
plot(
  survfit.training,
  col = c("black", "red"),
  conf.int = FALSE,
  xlab = "Months",
  main = "TRAINING",
  ylab = "Overall survival"
)
xmax <- par("usr")[2] - 50
text(
  x = xmax,
  y = 0.4,
  lab = paste("HR=", hr.training),
  pos = 2
)
text(
  x = xmax,
  y = 0.3,
  lab = paste("p=", p.training, "", sep = ""),
  pos = 2
)
tmp <- summary(preds.training.dichot)
text(
  x = xmax,
  y = c(0.2, 0.1),
  lab = paste(tmp, names(tmp)),
  col = 1:2,
  pos = 2
)
## Now the test set.
## in the test set,  HR=1.7 is not significant - not surprising with the
## overly strict non-specific pre-filter (IQR>1,  75th percentile > log2(150)
(p.test <-
    signif(summary(coxphfit.test)$logtest[3], 2))  #likelihood ratio test
(hr.test <- signif(summary(coxphfit.test)$conf.int[1], 2))
(hr.lower.test <- summary(coxphfit.test)$conf.int[3])
(hr.upper.test <- summary(coxphfit.test)$conf.int[4])
plot(
  survfit.test,
  col = c("black",  "red"),
  conf.int = FALSE,
  xlab = "Months",
  main = "TEST"
)
text(
  x = xmax,
  y = 0.4,
  lab = paste("HR=", hr.test),
  pos = 2
)
text(
  x = xmax,
  y = 0.3,
  lab = paste("p=", p.test, "", sep = ""),
  pos = 2
)
tmp <- summary(preds.test.dichot)
text(
  x = xmax,
  y = c(0.2, 0.1),
  lab = paste(tmp, names(tmp)),
  col = 1:2,
  pos = 2
)

Function calculate cross-validated likelihood on a regular grid of L1/L2 penalties

Description

This function generates a grid of values of L1/L2 penalties, then calculated cross-validated likelihood at each point on the grid. The grid can be regular (linear progression of the penalty values), or polynomial (finer grid for small penalty values, and coarser grid for larger penalty values).

Usage

scan.l1l2(L1range = c(0.1, 100.1),
          L2range = c(0.1, 100.1),
          L1.ngrid = 50,
          L2.ngrid = 50,
          nprocessors = 1,
          polydegree = 1,
          cl = NULL,
          ...)

Arguments

L1range

numeric vector of length two, giving minimum and maximum constraints on the L1 penalty

L2range

numeric vector of length two, giving minimum and maximum constraints on the L2 penalty

L1.ngrid

Number of values of the L1 penalty in the regular grid of L1/L2 penalties

L2.ngrid

Number of values of the L2 penalty in the regular grid of L1/L2 penalties

nprocessors

An integer number of processors to use.

polydegree

power of the polynomial on which the L1/L2 penalty values are fit. ie if polydegree=2, penalty values could be y=x^2, x=1,2,3,..., so y=1,4,9,...

cl

Optional cluster object created with the makeCluster() function of the parallel package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK"). Setting this parameter can enable parallelization in more diverse scenarios than multi-core desktops; see the documentation for the parallel package. Note that if cl is user-defined, this function will not automatically run parallel::stopCluster() to shut down the cluster.

...

arguments passed on to cvl function of the penalized R package

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize the task of scanning a grid of penalty values to search for suitable starting values for two-dimensional optimization of the Elastic Net.

Value

cvl

matrix of cvl values along the grid

L1range

range of L1 penalties to scan

L2range

range of L2 penalties to scan

xlab

A text string indicating the range of L1 penalties

ylab

A text string giving the range of L2 penalties

zlab

A text string giving the range of cvl values

note

A note to the user that rows of cvl correspond to values of lambda1, columns to lambda2

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406. (*equal contribution)

See Also

cvl

Examples

data(beer.exprs)
data(beer.survival)

##select just 250 genes to speed computation:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(150), ]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 1, ])
dat.filt <- t(dat.filt)

## Define training and test sets
set.seed(9)
trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt) / 2))
testset <- rownames(dat.filt)[!rownames(dat.filt) %in% trainingset]

dat.training <- data.frame(dat.filt[trainingset, ])
pheno.training <- beer.survival[trainingset, ]

library(survival)
surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset, ])
all.equal(colnames(dat.training), colnames(dat.test))
pheno.test <- beer.survival[testset, ]
surv.test <- Surv(pheno.test$os, pheno.test$status)

set.seed(9)
system.time(
  output <- scan.l1l2(
    L1range = c(0.2, 3.2),
    L2range = c(2, 30),
    L1.ngrid = 10,
    L2.ngrid = 10,
    polydegree = 1,
    nprocessors = 1,
    response = surv.training,
    penalized = dat.training,
    fold = 4,
    positive = FALSE,
    standardize = TRUE
  )
)

##Note that the cvl surface is not smooth because a different folding of
##the data was used for each cvl calculation
image(
  x = seq(output$L1range[1], output$L1range[2], length.out = nrow(output$cvl)),
  y = seq(output$L2range[1], output$L2range[2], length.out = ncol(output$cvl)),
  z = output$cvl,
  xlab = "lambda1",
  ylab = "lambda2",
  main = "red is higher cross-validated likelihood"
)