Title: | Bootstrap with Survey Data |
Version: | 0.0.3 |
Description: | Implements different kinds of bootstraps to estimate sampling variation from survey data with complex designs. Includes the rescaled bootstrap described in Rust and Rao (1996) <doi:10.1177/096228029600500305> and Rao and Wu (1988) <doi:10.1080/01621459.1988.10478591>. |
License: | MIT + file LICENSE |
LinkingTo: | Rcpp, RcppArmadillo |
Imports: | Rcpp, stringr, dplyr, plyr, purrr, functional |
Depends: | R (≥ 2.10) |
Suggests: | knitr, testthat |
RoxygenNote: | 7.2.2 |
Encoding: | UTF-8 |
LazyData: | true |
NeedsCompilation: | yes |
Packaged: | 2023-01-09 05:42:24 UTC; dennis |
Author: | Dennis M. Feehan [aut, cre], Matthew J. Salganik [ths] |
Maintainer: | Dennis M. Feehan <feehan@berkeley.edu> |
Repository: | CRAN |
Date/Publication: | 2023-01-09 12:10:02 UTC |
The MU284 Population dataset
Description
A dataset containing information about Sweden's 284 municipalities. This dataset comes from Model-Assisted Survey Sampling by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204). The columns are:
Format
A data frame with 284 rows and 11 columns:
- LABEL
ID
- P85
Population in 1985
- P75
Population in 1975
- RMT85
Municipal tax revenue in 1985
- CS82
Number of Conservative seats in municipal council
- SS82
Number of Social-Democratic seats in municipal council
- S82
Total number of seats in municipal council
- ME84
Number of municipal employees
- REV84
Real estate values according to 1984 assessment
- REG
Geographic location indicator
- CL
Cluster indicator (neighboring municipalities are clustered together)
Source
'Model Assisted Survey Sampling' by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204)
Benchmarks for unit tests
Description
Benchmark results to use in unit tests; these are based on MU284.complex.surveys.
Format
A list with 10 data frames, each with 15 rows and 11 columns:
- mean.TS82.hat, ..., sd.R.RMT85.P85.hat
summaries for each estimand
Simulated sample surveys drawn from the MU284 Population using a complex design
Description
A list with 10 sample surveys with sample size 15 drawn from the MU284 dataset using a complex sampling design.
Format
A list with 10 data frames, each with 15 rows and 11 columns:
- LABEL, ..., CL
Same as MU284 dataset
- sample_weight
The sampling weight for the row
Details
The sampling design comes from Ex. 4.3.2 (pg 142-3) of 'Model Assisted Survey Sampling' by Sarndal, Swensson, and Wretman (2003, ISBN:0387406204).
The design is a two-stage sample:
stage I: the primary sampling units (PSUs) are the standard clusters from MU284; we take a simple random sample without replacement of
n_I = 5
out ofN_I = 50
of thesestage II: within each sampled PSU, we take a simple random sample without replacement of
n_i = 3
out ofN_i
municipalities
MU284.estimator.fn
Description
Produce estimates from a simulated sample survey of the MU284 population. Used in package tests and examples.
Usage
MU284.estimator.fn(survey.data, weights)
Arguments
survey.data |
the survey dataset |
weights |
a vector with the survey weights |
Value
a data.frame with one row and two columns:
-
TS82.hat
- the estimated total ofS82
-
R.RMT85.P85.hat
- the estimated ratio ofRMT85
/P85
MU284.estimator.summary.fn
Description
Summarize results from MU284.estimator.fn()
applied to many surveys.
(This is a dummy function, used for tests)
Usage
MU284.estimator.summary.fn(res)
Arguments
res |
a dataframe whose rows are the results of calling |
Value
the same dataframe
Simulated sample surveys drawn from the MU284 Population
Description
A list with 10 sample surveys with sample size 15 drawn from the MU284 dataset using simple random sampling with replacment.
Format
A list with 10 data frames, each with 15 rows and 11 columns:
- LABEL, ..., CL
Same as MU284 dataset
- sample_weight
The sampling weight for the row
bootstrap.estimates
Description
Use a given bootstrap method to estimate sampling uncertainty from a given estimator.
Usage
bootstrap.estimates(
survey.data,
survey.design,
bootstrap.fn,
estimator.fn,
num.reps,
weights = NULL,
...,
summary.fn = NULL,
verbose = TRUE,
parallel = FALSE,
paropts = NULL
)
Arguments
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey (see Details below) |
bootstrap.fn |
Name of the method to be used to take bootstrap resamples |
estimator.fn |
The name of a function which, given a dataset like
|
num.reps |
The number of bootstrap replication samples to draw |
weights |
Weights to use in estimation (or |
... |
additional arguments which will be passed on to |
summary.fn |
(Optional) Name of a function which, given the set of estimates
produced by |
verbose |
If |
parallel |
If |
paropts |
If not |
Details
The formula describing the survey design should have the form
~ psu_v1 + psu_v2 + ... + strata(strata_v1 + strata_v2 + ...)
,
where psu_v1, ...
are the variables identifying primary sampling units (PSUs)
and strata_v1, ...
identifies the strata
Value
If summary.fn
is not specified, then return the list of estimates
produced by estimator.fn
; if summary.fn
is specified, then return
its output
Examples
# example using a simple random sample
survey <- MU284.surveys[[1]]
estimator <- function(survey.data, weights) {
plyr::summarise(survey.data,
T82.hat = sum(S82 * weights))
}
ex.mu284 <- bootstrap.estimates(
survey.design = ~1,
num.reps = 10,
estimator.fn = estimator,
weights='sample_weight',
bootstrap.fn = 'srs.bootstrap.sample',
survey.data=survey)
## Not run:
idu.est <- bootstrap.estimates(
## this describes the sampling design of the
## survey; here, the PSUs are given by the
## variable cluster, and the strata are given
## by the variable region
survey.design = ~ cluster + strata(region),
## the number of bootstrap resamples to obtain
num.reps=1000,
## this is the name of the function
## we want to use to produce an estimate
## from each bootstrapped dataset
estimator.fn="our.estimator",
## these are the sampling weights
weights="indweight",
## this is the name of the type of bootstrap
## we wish to use
bootstrap.fn="rescaled.bootstrap.sample",
## our dataset
survey.data=example.survey,
## other parameters we need to pass
## to the estimator function
d.hat.vals=d.hat,
total.popn.size=tot.pop.size,
y.vals="clients",
missing="complete.obs")
## End(Not run)
Get a dataset from a chain
Description
Take the data for each member of the given chain and assemble it together in a dataset.
Usage
chain.data(chain)
Arguments
chain |
The chain to build a dataset from |
Value
A dataset comprised of all of the chain's members' data put together. The order of the rows in the dataset is not specified.
Get the size of a chain
Description
Count the total number of respondents in the chain and return it
Usage
chain.size(chain)
Arguments
chain |
The chain object |
Value
The number of respondents involved in the chain
Get all of the values of the given variable found among members of a chain
Description
Get all of the values of the given variable found among members of a chain
Usage
chain.vals(chain, qoi.var = "uid")
Arguments
chain |
The chain to get values from |
qoi.var |
The name of the variable to get from each member of the chain |
Value
A vector with all of the values of qoi.var
found in this chain. (Currently, the order of the values
in the vector is not guaranteed.)
Estimate degree distributions by trait
Description
Break down RDS degree distributions by trait, and return an object which has the degrees for each trait as well as functions to draw degrees from each trait.
Usage
estimate.degree.distns(survey.data, d.hat.vals, traits, keep.vars = NULL)
Arguments
survey.data |
The respondent info |
d.hat.vals |
The variable that contains the degrees for each respondent |
traits |
A vector of the names of the columns
of |
keep.vars |
Additional vars to return along with degrees |
Details
One of the items returned as a result is a function,
draw.degrees.fn
, which takes one argument,
traits
. This is a vector of traits and,
for each entry in this vector, draw.degress.fn
returns a draw from the empirical distribution of
degrees among respondents with that trait. So,
draw.degrees.fn(c("0.0", "0.1", "0.1")
would
return a degree drawn uniformly at random from among
the observed degrees of respondents with trait "0.0"
and then two degrees from respondents with trait "0.1"
Value
An object with
-
distns
a list with one entry per trait value; each -
draw.degrees.fn
a function which gets called with one -
keep.vars
the name of the other vars that are kept (if any)
Construct a mixing model from GoC/RDS data
Description
Given a dataset with the respondents and a dataset on the parents (in many cases the same individuals), and a set of relevant traits, estimate mixing parameters and return a markov model.
Usage
estimate.mixing(survey.data, parent.data, traits)
Arguments
survey.data |
The respondent info |
parent.data |
The parent info |
traits |
The names of the traits to build the model on |
Value
A list with entries:
-
mixing.df
the data used to estimate the mixing function -
choose.next.state.fn
a function which can be passed a vector of states and will return a draw of a subsequent state for each entry in the vector -
mixing.df
a dataframe (long-form) representation of the transition counts used to estimate the transition probabilities -
states
a list with an entry for each state. within each state's entry are-
trans.probs
a vector of estimated transition probabilities -
trans.fn
a function which, when called, randomly chooses a next state with probabilities given by the transition probs.
-
Grab a function based on its name
Description
Helper to grab a function that is passed in as an argument
Usage
get.fn(fn, env = parent.frame())
Arguments
fn |
The function to search for |
env |
The environment to start searching in |
Details
This is based on Hadley Wickham's response to an SO post: https://stackoverflow.com/questions/14183766/match-fun-provide-error-with-functions-defined-inside-functions with some minor modifications
Value
fn
, if fn
is already a function; otherwise, the first function found
in env or one of its parents whose name is fn
Get a variable from a dataframe or vector
Description
This function was written because a few of the estimator functions
need to use weights, and there are several cases to handle:
the user could pass in a column name, a vector of weights, or
nothing (in which case, the weights should default to 1 for each
row in the dataset). For the special case of getting weights, look
at the curried function get.weights()
Usage
get.var(survey.data, var, default = NA)
Arguments
survey.data |
The survey dataset |
var |
Either |
default |
The default value to fill in if the variable is not found |
Value
A vector of values whose length is the same as the
number of rows in survey.data
; if var
is NULL
, this has
the default values
Get the weights column from a dataframe
Description
This is the same as get.var()
with the default value set to 1
instead of NA
Usage
get.weights(...)
Arguments
... |
(this is a function curried from |
Determine whether or not one id is a parent of another
Description
This function allows us to determine which ids are directly descended from which other ones. It is the only part of the code that relies on the ID format used by the Curitiba study (see Details); by modifying this function, it should be possible to adapt this code to another study.
Usage
is.child.ct(id, seed.id)
Arguments
id |
the id of the potential child |
seed.id |
the id of the potential parent |
Details
See:
Salganik, M. J., Fazito, D., Bertoni, N., Abdo, A. H., Mello, M. B., & Bastos, F. I. (2011). Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. American journal of epidemiology, 174(10), 1190-1196.
Value
TRUE if id
is the direct descendant of seed.id
and FALSE otherwise
Build an RDS seed's chain from the dataset
Description
Note that this assumes that the chain is a tree (no loops)
Usage
make.chain(seed.id, survey.data, is.child.fn = is.child.ct)
Arguments
seed.id |
The id of the seed whose chain we wish to build from the dataset |
survey.data |
The dataset |
is.child.fn |
A function which takes two ids as arguments;
it is expected to return |
Value
info
Get the height (maximum depth) of a chain
Description
Get the height (maximum depth) of a chain
Usage
## S3 method for class 'depth'
max(chain)
Arguments
chain |
The chain object |
Value
The maximum depth of the chain
Run a markov model
Description
Run a given markov model for n time steps, starting at a specified state.
Usage
mc.sim(mm, start, n)
Arguments
mm |
The markov model object returned by |
start |
The name of the state to start in |
n |
The number of time-steps to run through |
Details
This uses the markov model produced by estimate.mixing()
Value
A vector with the state visited at each time step. the first entry has the starting state
Parse a formula that describes the design of a survey
Description
Parse a formula of the form
~ psu_v1 + psu_v2 + ... + strata(strata_v1 + strata_v2 + ...)
into a PSU formula and a strata formula.
Usage
parse_design(formula)
Arguments
formula |
a formula describing the sample design (see Description of |
Value
a list with entries psu.formula
and strata.formula
Draw RDS bootstrap resamples for one chain
Description
This function uses the algorithm described in the supporting online material for Weir et al 2012 (see Details) to take bootstrap resamples of one chain from an RDS dataset.
Usage
rds.boot.draw.chain(chain, mm, dd, parent.trait, idvar = "uid")
Arguments
chain |
The chain to draw resamples for |
mm |
The mixing model to use |
dd |
The degree distns to use |
parent.trait |
A vector whose length is the number of bootstrap reps we want |
idvar |
The name of the variable used to label the columns of the output (presumably some id identifying the row in the original dataset they come from) |
Details
See
Weir, Sharon S., et al. "A comparison of respondent-driven and venue-based sampling of female sex workers in Liuzhou, China." Sexually transmitted infections 88.Suppl 2 (2012): i95-i101.
Value
A list of dataframes with one entry for each respondent in the chain. each dataframe has one row for each bootstrap replicate. so if we take 10 bootstrap resamples of a chain of length 50, there will be 50 entries in the list that is returned. each entry will be a dataframe with 10 rows.
Draw RDS bootstrap resamples
Description
Draw bootstrap resamples for an RDS dataset, using
the algorithm described in the supporting online
material of Weir et al 2012 (see rds.boot.draw.chain()
).
Usage
rds.chain.boot.draws(chains, mm, dd, num.reps, keep.vars = NULL)
Arguments
chains |
A list whose entries are the chains we want to resample |
mm |
The mixing model |
dd |
The degree distributions |
num.reps |
The number of bootstrap resamples we want |
keep.vars |
If not |
Value
A list of length num.reps
; each entry in
the list has one bootstrap-resampled dataset
Draw RDS bootstrap resamples using the algorithm in Salganik 2006 (see Details below)
Description
This algorithm picks a respondent from the survey to be a seed uniformly at random. it then generates a bootstrap draw by simulating the markov process forward for n steps, where n is the size of the draw required.
If you wish the bootstrap dataset to end up with
variables from the original dataset other than the
traits and degree, then you must specify this when
you construct dd
using the
'estimate.degree.distns
function.
Usage
rds.mc.boot.draws(chains, mm, dd, num.reps)
Arguments
chains |
A list with the chains constructed from the survey
using |
mm |
The mixing model |
dd |
The degree distributions |
num.reps |
The number of bootstrap resamples we want |
Details
See:
Salganik, Matthew J. "Variance estimation, design effects, and sample size calculations for respondent-driven sampling." Journal of Urban Health 83.1 (2006): 98-112.
Value
A list of length num.reps
; each entry in
the list has one bootstrap-resampled dataset
rescaled.bootstrap.sample
Description
Given a survey dataset and a description of the survey design (ie, which combination of variables determines primary sampling units, and which combination of variables determines strata), take a bunch of bootstrap samples for the rescaled bootstrap estimator (see Details).
Usage
rescaled.bootstrap.sample(
survey.data,
survey.design,
parallel = FALSE,
paropts = NULL,
num.reps = 1
)
Arguments
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey (see Details) |
parallel |
If |
paropts |
An optional list of arguments passed on to |
num.reps |
The number of bootstrap replication samples to draw |
Details
survey.design
is a formula of the form
weight ~ psu_vars + strata(strata_vars)
where:
-
weight
is the variable with the survey weights -
psu_vars
has the formpsu_v1 + psu_v2 + ...
, where primary sampling units (PSUs) are determined bypsu_v1
, etc -
strata_vars
has the formstrata_v1 + strata_v2 + ...
, which determine strata
Note that we assume that the formula uniquely specifies PSUs. This will always be true if the PSUs were selected without replacement. If they were selected with replacement, then it will be necessary to make each realization of a given PSU in the sample a unique id. The code below assumes that all observations within each PSU (as identified by the design formula) are from the same draw of the PSU.
The rescaled bootstrap technique works by adjusting the estimation weights based on the number of times each row is included in the resamples. If a row is never selected, it is still included in the returned results, but its weight will be set to 0. It is therefore important to use estimators that make use of the estimation weights on the resampled datasets.
We always take m_i = n_i - 1, according to the advice presented in Rao and Wu (1988) and Rust and Rao (1996).
(This is a C++ version; a previous version, written in pure R,
is called rescaled.bootstrap.sample.pureR()
)
References:
Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.
Rao, Jon NK, and C. F. J. Wu. "Resampling inference with complex survey data." Journal of the American Statistical Association 83.401 (1988): 231-241.
Value
A list with num.reps
entries. Each entry is a dataset which
has at least the variables index
(the row index of the original
dataset that was resampled) and weight.scale
(the factor by which to multiply the sampling weights
in the original dataset).
Examples
survey <- MU284.complex.surveys[[1]]
boot_surveys <- rescaled.bootstrap.sample(survey.data = survey,
survey.design = ~ CL,
num.reps = 2)
rescaled.bootstrap.sample.pureR
Description
(this is the pure R version; it has been supplanted by
rescaled.bootstrap.sample
, which is partially written in C++)
Usage
rescaled.bootstrap.sample.pureR(
survey.data,
survey.design,
parallel = FALSE,
paropts = NULL,
num.reps = 1
)
Arguments
survey.data |
the dataset to use |
survey.design |
a formula describing the design of the survey (see below - TODO) |
parallel |
if TRUE, use parallelization (via |
paropts |
an optional list of arguments passed on to |
num.reps |
the number of bootstrap replication samples to draw |
Details
given a survey dataset and a description of the survey design (ie, which combination of vars determines primary sampling units, and which combination of vars determines strata), take a bunch of bootstrap samples for the rescaled bootstrap estimator (see, eg, Rust and Rao 1996).
Note that we assume that the formula uniquely specifies PSUs. This will always be true if the PSUs were selected without replacement. If they were selected with replacement, then it will be necessary to make each realization of a given PSU in the sample a unique id. Bottom line: the code below assumes that all observations within each PSU (as identified by the design formula) are from the same draw of the PSU.
The rescaled bootstrap technique works by adjusting the estimation weights based on the number of times each row is included in the resamples. If a row is never selected, it is still included in the returned results, but its weight will be set to 0. It is therefore important to use estimators that make use of the estimation weights on the resampled datasets.
We always take m_i = n_i - 1, according to the advice presented in Rao and Wu (1988) and Rust and Rao (1996).
survey.design
is a formula of the form
weight ~ psu_vars + strata(strata_vars),
where weight is the variable with the survey weights and psu
is the variable denoting the primary sampling unit
Value
a list with num.reps
entries. each entry is a dataset which
has at least the variables index
(the row index of the original
dataset that was resampled) and weight.scale
(the factor by which to multiply the sampling weights
in the original dataset).
rescaled.bootstrap.weights
Description
This function creates a dataset with rescaled bootstrap weights;
it can be a helpful alternative to bootstrap.estimates
in some situations
Usage
rescaled.bootstrap.weights(
survey.data,
survey.design,
num.reps,
weights = NULL,
idvar,
verbose = TRUE,
parallel = FALSE,
paropts = NULL
)
Arguments
survey.data |
The dataset to use |
survey.design |
A formula describing the design of the survey
(see Details in |
num.reps |
the number of bootstrap replication samples to draw |
weights |
weights to use in estimation (or NULL, if none) |
idvar |
the name of the column in |
verbose |
if TRUE, produce lots of feedback about what is going on |
parallel |
if TRUE, use the plyr library's .parallel argument to produce bootstrap resamples and estimates in parallel |
paropts |
if not NULL, additional arguments to pass along to the parallelization routine |
Details
The formula describing the survey design should have the form
~ psu_v1 + psu_v2 + ... + strata(strata_v1 + strata_v2 + ...)
,
where psu_v1, ...
are the variables identifying primary sampling units (PSUs)
and strata_v1, ...
identify the strata
Value
if no summary.fn is specified, then return the list of estimates produced by estimator.fn; if summary.fn is specified, then return its output
Examples
survey <- MU284.complex.surveys[[1]]
rescaled.bootstrap.weights(survey.data = survey,
survey.design = ~ CL,
weights='sample_weight',
idvar='LABEL',
num.reps = 2)
## Not run:
bootweights <- rescaled.bootstrap.weights(
# formula describing survey design:
# psu and strata
survey.design = ~ psu +
stratum(stratum_analysis),
num.reps=10000,
# column with respondent ids
idvar='caseid',
# column with sampling weight
weights='wwgt',
# survey dataset
survey.data=mw.ego)
## End(Not run)
srs.bootstrap.sample
Description
Given a survey dataset and a description of the survey design (ie, which combination of vars determines primary sampling units, and which combination of vars determines strata), take a bunch of bootstrap samples under a simple random sampling (with repetition) scheme
Usage
srs.bootstrap.sample(
survey.data,
num.reps = 1,
parallel = FALSE,
paropts = NULL,
...
)
Arguments
survey.data |
The dataset to use |
num.reps |
The number of bootstrap replication samples to draw |
parallel |
If |
paropts |
An optional list of arguments passed on to |
... |
Ignored, but useful because it allows params like |
Value
A list with num.reps
entries. Each entry is a dataset which has
at least the variables index
(the row index of the original dataset that
was resampled) and weight.scale
(the factor by which to multiply the
sampling weights in the original dataset).
Examples
survey <- MU284.surveys[[1]]
boot_surveys <- srs.bootstrap.sample(survey, num.reps = 2)
Survey bootstrap variance estimators
Description
surveybootstrap
has methods for analyzing data that were collected
using network reporting techniques. It includes estimators appropriate for
the simple bootstrap and the rescaled bootstrap.
Take a set of traits and turn into a string
Description
This is a helper function that is useful when we wish to make several traits into one variable
Usage
traits.to.string(data, traits, na.action = "drop", sep = ".")
Arguments
data |
The respondent info |
traits |
The names of the traits to build the model on |
na.action |
Defaults to 'drop' (meaning all rows of data
with any missingness on the traits are dropped). Anything else
means |
sep |
The separator character used to combine values |
Value
A list whose entries are
-
used.idx
, which indicates which rows from the original dataset were used (may not be all of them if there is missingness); and -
traits
, which has the string version of the traits
Unparse a collapsed trait string
Description
For a few of the RDS-related functions, it is useful to combine several traits into one variable as a string; for example, "male" and "young" might become "male.young". this function takes a string with combined traits and explodes it back into several variables.
Usage
unparse.trait(trait.string, names, sep = "\\.")
Arguments
trait.string |
A vector whose values are collapsed traits |
names |
A vector with the names of each trait (in order) |
sep |
The character used to separate the traits in their collapsed string representation |
Value
A dataframe whose rows correspond to the entries in
trait.string
, with one column per trait
Only prints things out in verbose mode
Description
Only prints things out in verbose mode
Usage
vcat(verbose = TRUE, ...)
Arguments
verbose |
If |
... |
Arguments to pass to cat if verbose is |
Compute the weighted mean
Description
Given a vector of values and a vector of weights, compute the weighted mean
Usage
weighted.mean(x, w, na.rm = FALSE)
Arguments
x |
The vector of values |
w |
The vector of weights |
na.rm |
if |
Value
The weighted mean