Title: | Sample Size and Power Calculations for Case-Control Studies |
Version: | 2.0.2 |
Date: | 2023-08-21 |
Author: | Mitchell H. Gail |
Description: | To determine sample size or power for case-control studies to be analyzed using logistic regression. |
Maintainer: | William Wheeler <WheelerB@imsweb.com> |
Depends: | mvtnorm |
License: | GPL-2 |
NeedsCompilation: | no |
Packaged: | 2023-08-21 14:43:02 UTC; wheelerwi |
Repository: | CRAN |
Date/Publication: | 2023-08-21 15:20:02 UTC |
Sample size and power calculations for case-control studies
Description
This package can be used to calculate the required sample size needed for
case-control studies to have sufficient power, or calculate the power of a
case-control study for a given sample size.
To calculate the sample size, one needs to
specify the significance level \alpha
, power \gamma
, and the
hypothesized non-null \theta
. Here \theta
is a log odds ratio
for an exposure main effect or \theta
is an interaction effect on the
logistic scale.
Choosing \theta
requires subject matter knowledge to understand how
strong the association needs to be to have practical importance.
Sample size varies inversely with
\theta^{2}
and is thus highly dependent on \theta
.
Details
The main functions in the package are for different types of exposure variables, where the
exposure variable is the variable of interest in a hypothesis test.
The functions sampleSize_binary
and power_binary
can be used for a binary exposure variable (X = 0 or 1),
while the functions sampleSize_ordinal
and power_ordinal
is a more general function that can be used for
an ordinal exposure variable (X takes the values 0, 1, ..., k).
sampleSize_continuous
and power_continuous
are useful for a continuous exposure variable and
sampleSize_data
and power_data
can be used when pilot data is available that defines
the distribution of the exposure and other confounding variables. Each function will return the
sample sizes or power for a Wald-type test and a score test. When there are no adjustments for confounders, the user can
specify a general distribution for the exposure variable. With confounders, either pilot data or a function to
generate random samples from the multivariate distribution of the confounders and exposure variable must
be given.
If the parameter of interest, \theta
,
is one dimensional, then the test statistic is often asymptotically equivalent
to a test of the form
T > Z_{1-\alpha}\sigma_{0}n^{-\frac{1}{2}}
or
T > Z_{1-\alpha}\sigma_{\theta}n^{-\frac{1}{2}}
, where
Z_{1-\alpha}
is the 1-\alpha
quantile of a standard
normal distribution, n
is the total sample size (cases plus controls),
and n^{\frac{1}{2}}T
is
normally distributed with mean 0 and null variance \sigma_{0}^{2}
.
Depending on which critical value
Z_{1-\alpha}\sigma_{0}n^{-\frac{1}{2}}
or
Z_{1-\alpha}\sigma_{\theta}n^{-\frac{1}{2}}
of the test was used, the formulas for sample size are obtained by inverting the
equations for power:
n_{1} = (Z_{\gamma}\sigma_{\theta} + Z_{1-\alpha}\sigma_{0})^{2}/\theta^{2}
or
n_{2} = (Z_{\gamma} + Z_{1-\alpha})^{2}\sigma_{\theta}^{2}/\theta^{2}
.
Author(s)
Mitchell H. Gail <gailm@mail.nih.gov>
References
Gail, M.H. and Haneuse, S. Power and sample size for case-control studies. In Handbook of Statistical Methods for Case-Control Studies. Editors: Ornulf Borgan, Norman Breslow, Nilanjan Chatterjee, Mitchell Gail, Alastair Scott, Christopher Wild. Chapman and Hall/CRC, Taylor and Francis Group, New York, 2018, pages 163-187.
Gail, M. H and Haneuse, S. Power and sample size for multivariate logistic modeling of unmatched case-control studies.
Statistical Methods in Medical Research. 2019;28(3):822-834,
https://doi.org/10.1177/0962280217737157
List to describe the covariate and exposure data
Description
The list to describe the covariate and exposure data for the data
option.
Format
The format is: List of 7
- file
Data file containing the confounders and exposure variables. No default.
- exposure
Name or column number in
file
for the exposure variable. This can also be a vector giving the columns to form an interaction variable (see details). No default.- covars
Character vector of variables names or numeric vector of column numbers in
file
that will be confounders. These variables must be numeric. The length and order of thelogOR
argument must match the length and order of c(covars
,exposure
). The default is NULL.- header
0 or 1 if
file
has the first row as variable names. The default is determined from the first line of thefile
.- delimiter
The delimiter in
file
. The default is determined from the first two lines of thefile
.- in.miss
Vector of character strings to define the missing values. This option corresponds to the option
na.strings
inread.table()
. The default is "NA".- subsetData
List of sublists to subset the data. Each sublist should contain the names "var", "operator" and "value" corresponding to a variable name, operator and values of the variable. Multiple sublists are logically connected by the AND operator. For example,
subsetData=list(list(var="GENDER", operator="==", value="MALE"))
will only include subjects with the string "MALE" for the GENDER variable.
subsetData=list(list(var="AGE", operator=">", value=50),
list(var="STUDY", operator="%in%", value=c("A", "B", "C")))
will include subjects with AGE > 50 AND in STUDY A, B or C. The default is NULL.
Details
In this list, file
and exposure
must be specified. If exposure
is a vector
of column names or column numbers, then an exposure variable will be created by multipling the columns
defined in the vector to form the interaction variable. Thus, the columns must be numeric variables.
In this case, the length and order of logOR
must match the length and order of
c(covars
, <new interaction variable>).
Power for a binary exposure
Description
Calculates the power of as case-control study with a binary exposure variable
Usage
power_binary(prev, logOR, probXeq1=NULL, distF=NULL, data=NULL,
size.2sided=0.05, sampleSize=1000, cc.ratio=0.5, interval=c(-100, 100), tol=0.0001)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
probXeq1 |
NULL or a number between 0 and 1 giving the probability that the exposure
variable is 1. If set to NULL, the the |
distF |
NULL, a function or a character string giving the function to generate random
vectors from the distribution of the confounders and exposure. The order of the returned
vector must match the order of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
sampleSize |
Sample size of the study. The default is 1000. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
Details
If there are no confounders (length(logOR) = 1), then either probXeq1
or data
must
be specified, where probXeq1
takes precedance. If there are confounders (length(logOR) > 1), then
either data
or distF
must be specified, where data
takes precedance.
Value
A list containing four powers, where two of them are for a Wald test and two for a score test.
The two powers for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
power_continuous
, power_ordinal
, power_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, Prob(X=1)=0.2
power_binary(prev, logOR, probXeq1=0.2)
# Generate data for a N(0,1) confounder and binary exposure
data <- cbind(rnorm(1000), rbinom(1000, 1, 0.4))
beta <- c(0.1, 0.2)
power_binary(prev, beta, data=data)
# Define a function to generate random vectors for two confounders and the binary exposure
f <- function(n) {cbind(rnorm(n), rbinom(n, 3, 0.5), rbinom(n, 1, 0.3))}
logOR <- c(0.2, 0.3, 0.25)
power_binary(prev, logOR, distF=f)
Power for a continuous exposure
Description
Calculates the power of as case-control study with a continuous exposure variable
Usage
power_continuous(prev, logOR, distF=NULL, distF.support=c(-Inf, Inf),
data=NULL, size.2sided=0.05, sampleSize=1000, cc.ratio=0.5, interval=c(-100, 100),
tol=0.0001, distF.var=NULL)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
distF |
NULL, a function or a character string giving the pdf of the exposure variable for the case
of no confounders, or giving the function to generate random vectors from the
distribution formed by the confounders and exposure.
For the case of no confounders, examples are |
distF.support |
Two element vector giving the domain of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
sampleSize |
Sample size of the study. The default is 1000. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
distF.var |
The variance of the exposure variable for the case of no confounders. This option is for efficiency purposes. If not specified, the variance will be estimated by either the empirical variance of a random sample from the distribution of the exposure or by numerical integration. The default is NULL. |
Details
The data
option takes precedance over the other options. If data
is not specified,
then the distribution of the exposure will be N(0,1) or MVN(0, 1) depending on whether there
are confounders.
Value
A list containing four powers, where two of them are for a Wald test and two for a score test.
The two powers for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
power_binary
, power_ordinal
, power_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, exposure assumed to be N(0,1)
power_continuous(prev, logOR)
# Two confounders and exposure assumed to be MVN(0,1)
beta <- c(0.1, 0.1, logOR)
power_continuous(prev, beta)
# No confounders, exposure is beta(0.3, 3)
power_continuous(prev, logOR, distF="dbeta(m, shape1=0.3, shape2=3)",
distF.support=c(0, 1))
Power using pilot data
Description
Calculates the power of a case-control study with pilot data
Usage
power_data(prev, logOR, data, size.2sided=0.05, sampleSize=1000, cc.ratio=0.5,
interval=c(-100, 100), tol=0.0001)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
data |
Matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
sampleSize |
Sample size of the study (see details). The default is 1000. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
Details
The option sampleSize
is not necessarily nrow(data)
. The input data
can be a
small sample of pilot data that would be representative of the actual study data.
Value
A list containing four powers, where two of them are for a Wald test and two for a score test.
The two powers for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
power_binary
, power_ordinal
, power_continuous
Examples
prev <- 0.01
logOR <- 0.3
data <- matrix(rnorm(100, mean=1.5), ncol=1)
# Assuming exposuure is N(1.5, 1)
power_data(prev, logOR, data)
Power for an ordinal exposure
Description
Calculates the power of as case-control study with an ordinal exposure variable
Usage
power_ordinal(prev, logOR, probX=NULL, distF=NULL, data=NULL,
size.2sided=0.05, sampleSize=1000, cc.ratio=0.5, interval=c(-100, 100), tol=0.0001)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios per category increase for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure.
If the exposures are coded 0, 1, ..., k (k+1 categories), then the |
probX |
NULL or a vector that sums to 1 giving the probability that the exposure
variable is equal to i, i = 0, 1, ..., k.
If set to NULL, the the |
distF |
NULL, a function or a character string giving the function to generate random
vectors from the distribution of the confounders and exposure. The order of the returned
vector must match the order of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
sampleSize |
Sample size of the study. The default is 1000. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
Details
If there are no confounders (length(logOR) = 1), then either probX
or data
must
be specified, where probX
takes precedance. If there are confounders (length(logOR) > 1), then
either data
or distF
must be specified, where data
takes precedance.
Value
A list containing four powers, where two of them are for a Wald test and two for a score test.
The two powers for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
power_continuous
, power_binary
, power_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, Prob(X=1)=0.2
power_ordinal(prev, logOR, probX=c(0.8, 0.2))
# Generate data for a N(0,1) confounder and ordinal exposure with 3 levels
data <- cbind(rnorm(1000), rbinom(1000, 2, 0.5))
beta <- c(0.1, 0.2)
power_ordinal(prev, beta, data=data)
# Define a function to generate random vectors for two confounders and an ordinal
# exposure with 5 levels
f <- function(n) {cbind(rnorm(n), rbinom(n, 1, 0.5), rbinom(n, 4, 0.5))}
beta <- c(0.2, 0.3, 0.25)
power_ordinal(prev, beta, distF=f)
Sample size for a binary exposure
Description
Calculates the required sample size of as case-control study with a binary exposure variable
Usage
sampleSize_binary(prev, logOR, probXeq1=NULL, distF=NULL, data=NULL,
size.2sided=0.05, power=0.9, cc.ratio=0.5, interval=c(-100, 100), tol=0.0001,
n.samples=10000)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
probXeq1 |
NULL or a number between 0 and 1 giving the probability that the exposure
variable is 1. If set to NULL, the the |
distF |
NULL, a function or a character string giving the function to generate random
vectors from the distribution of the confounders and exposure. The order of the returned
vector must match the order of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
power |
Number between 0 and 1 for the desired power of the test. The default is 0.9. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
n.samples |
Integer giving the number of random vectors to generate when the option |
Details
If there are no confounders (length(logOR) = 1), then either probXeq1
or data
must
be specified, where probXeq1
takes precedance. If there are confounders (length(logOR) > 1), then
either data
or distF
must be specified, where data
takes precedance.
Value
A list containing four sample sizes, where two of them are for a Wald test and two for a score test.
The two sample sizes for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
sampleSize_continuous
, sampleSize_ordinal
, sampleSize_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, Prob(X=1)=0.2
sampleSize_binary(prev, logOR, probXeq1=0.2)
# Generate data for a N(0,1) confounder and binary exposure
data <- cbind(rnorm(1000), rbinom(1000, 1, 0.4))
beta <- c(0.1, 0.2)
sampleSize_binary(prev, beta, data=data)
# Define a function to generate random vectors for two confounders and the binary exposure
f <- function(n) {cbind(rnorm(n), rbinom(n, 3, 0.5), rbinom(n, 1, 0.3))}
logOR <- c(0.2, 0.3, 0.25)
sampleSize_binary(prev, logOR, distF=f)
Sample size for a continuous exposure
Description
Calculates the required sample size of as case-control study with a continuous exposure variable
Usage
sampleSize_continuous(prev, logOR, distF=NULL, distF.support=c(-Inf, Inf),
data=NULL, size.2sided=0.05, power=0.9, cc.ratio=0.5, interval=c(-100, 100),
tol=0.0001, n.samples=10000, distF.var=NULL)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
distF |
NULL, a function or a character string giving the pdf of the exposure variable for the case
of no confounders, or giving the function to generate random vectors from the
distribution formed by the confounders and exposure.
For the case of no confounders, examples are |
distF.support |
Two element vector giving the domain of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
power |
Number between 0 and 1 for the desired power of the test. The default is 0.9. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
n.samples |
Integer giving the number of random vectors to generate when the option |
distF.var |
The variance of the exposure variable for the case of no confounders. This option is for efficiency purposes. If not specified, the variance will be estimated by either the empirical variance of a random sample from the distribution of the exposure or by numerical integration. The default is NULL. |
Details
The data
option takes precedance over the other options. If data
is not specified,
then the distribution of the exposure will be N(0,1) or MVN(0, 1) depending on whether there
are confounders.
Value
A list containing four sample sizes, where two of them are for a Wald test and two for a score test.
The two sample sizes for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
sampleSize_binary
, sampleSize_ordinal
, sampleSize_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, exposure assumed to be N(0,1)
sampleSize_continuous(prev, logOR)
# Two confounders and exposure assumed to be MVN(0,1)
beta <- c(0.1, 0.1, logOR)
sampleSize_continuous(prev, beta)
# No confounders, exposure is beta(0.3, 3)
sampleSize_continuous(prev, logOR, distF="dbeta(m, shape1=0.3, shape2=3)",
distF.support=c(0, 1))
Sample size using pilot data
Description
Calculates the required sample size of a case-control study with pilot data
Usage
sampleSize_data(prev, logOR, data, size.2sided=0.05, power=0.9, cc.ratio=0.5,
interval=c(-100, 100), tol=0.0001)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure. If the
option |
data |
Matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
power |
Number between 0 and 1 for the desired power of the test. The default is 0.9. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
Value
A list containing four sample sizes, where two of them are for a Wald test and two for a score test.
The two sample sizes for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
sampleSize_binary
, sampleSize_ordinal
, sampleSize_continuous
Examples
prev <- 0.01
logOR <- 0.3
data <- matrix(rnorm(100, mean=1.5), ncol=1)
# Assuming exposuure is N(1.5, 1)
sampleSize_data(prev, logOR, data)
Sample size for an ordinal exposure
Description
Calculates the required sample size of as case-control study with an ordinal exposure variable
Usage
sampleSize_ordinal(prev, logOR, probX=NULL, distF=NULL, data=NULL,
size.2sided=0.05, power=0.9, cc.ratio=0.5, interval=c(-100, 100), tol=0.0001,
n.samples=10000)
Arguments
prev |
Number between 0 and 1 giving the prevalence of disease. No default. |
logOR |
Vector of ordered log-odds ratios per category increase for the confounders and exposure.
The last log-odds ratio in the vector is for the exposure.
If the exposures are coded 0, 1, ..., k (k+1 categories), then the |
probX |
NULL or a vector that sums to 1 giving the probability that the exposure
variable is equal to i, i = 0, 1, ..., k.
If set to NULL, the the |
distF |
NULL, a function or a character string giving the function to generate random
vectors from the distribution of the confounders and exposure. The order of the returned
vector must match the order of |
data |
NULL, matrix, data frame or a list of type |
size.2sided |
Number between 0 and 1 giving the size of the 2-sided hypothesis test. The default is 0.05. |
power |
Number between 0 and 1 for the desired power of the test. The default is 0.9. |
cc.ratio |
Number between 0 and 1 for the proportion of cases in the case-control sample. The default is 0.5. |
interval |
Two element vector giving the interval to search for the estimated intercept parameter. The default is c(-100, 100). |
tol |
Positive value giving the stopping tolerance for the root finding method to estimate the intercept parameter. The default is 0.0001. |
n.samples |
Integer giving the number of random vectors to generate when the option |
Details
If there are no confounders (length(logOR) = 1), then either probX
or data
must
be specified, where probX
takes precedance. If there are confounders (length(logOR) > 1), then
either data
or distF
must be specified, where data
takes precedance.
Value
A list containing four sample sizes, where two of them are for a Wald test and two for a score test.
The two sample sizes for each test correspond to the equations for
n_{1}
and n_{2}
.
See Also
sampleSize_continuous
, sampleSize_binary
, sampleSize_data
Examples
prev <- 0.01
logOR <- 0.3
# No confounders, Prob(X=1)=0.2
sampleSize_ordinal(prev, logOR, probX=c(0.8, 0.2))
# Generate data for a N(0,1) confounder and ordinal exposure with 3 levels
data <- cbind(rnorm(1000), rbinom(1000, 2, 0.5))
beta <- c(0.1, 0.2)
sampleSize_ordinal(prev, beta, data=data)
# Define a function to generate random vectors for two confounders and an ordinal
# exposure with 5 levels
f <- function(n) {cbind(rnorm(n), rbinom(n, 1, 0.5), rbinom(n, 4, 0.5))}
beta <- c(0.2, 0.3, 0.25)
sampleSize_ordinal(prev, beta, distF=f)