Title: |
Simulate Disease Outbreak Line List and Contacts Data |
Version: |
0.5.0 |
Description: |
Tools to simulate realistic raw case data for an epidemic in
the form of line lists and contacts using a branching process.
Simulated outbreaks are parameterised with epidemiological parameters
and can have age-structured populations, age-stratified
hospitalisation and death risk and time-varying case fatality risk. |
License: |
MIT + file LICENSE |
URL: |
https://github.com/epiverse-trace/simulist,
https://epiverse-trace.github.io/simulist/ |
BugReports: |
https://github.com/epiverse-trace/simulist/issues |
Depends: |
R (≥ 4.1.0) |
Imports: |
checkmate, english, epiparameter (≥ 0.4.0), randomNames,
rlang, stats |
Suggests: |
dplyr, epicontacts (≥ 1.1.3), ggplot2, incidence2 (≥
2.3.0), knitr, rmarkdown, spelling, testthat (≥ 3.0.0), tidyr |
VignetteBuilder: |
knitr |
Config/Needs/website: |
epiverse-trace/epiversetheme |
Config/testthat/edition: |
3 |
Encoding: |
UTF-8 |
Language: |
en-GB |
RoxygenNote: |
7.3.2 |
NeedsCompilation: |
no |
Packaged: |
2025-03-25 15:46:56 UTC; hugo |
Author: |
Joshua W. Lambert
[aut, cre, cph],
Carmen Tamayo
[aut],
Hugo Gruson [ctb,
rev],
Pratik R. Gupte
[ctb, rev],
Adam Kucharski
[rev],
Chris Hartgerink
[rev],
Sebastian Funk
[ctb],
London School of Hygiene and Tropical Medicine, LSHTM
[cph] |
Maintainer: |
Joshua W. Lambert <joshua.lambert@lshtm.ac.uk> |
Repository: |
CRAN |
Date/Publication: |
2025-03-25 19:50:02 UTC |
simulist: Simulate Disease Outbreak Line List and Contacts Data
Description
Tools to simulate realistic raw case data for an epidemic in the form of line lists and contacts using a branching process. Simulated outbreaks are parameterised with epidemiological parameters and can have age-structured populations, age-stratified hospitalisation and death risk and time-varying case fatality risk.
Author(s)
Maintainer: Joshua W. Lambert joshua.lambert@lshtm.ac.uk (ORCID) [copyright holder]
Authors:
Other contributors:
See Also
Useful links:
Add line list event dates and case information as columns to infectious
history <data.frame>
Description
These .add_*()
functions add columns to the <data.frame>
output by .sim_network_bp()
. The <data.frame>
supplied to .data
will
have a different number of columns depending on which function is being
called (i.e. the <data.frame>
supplied to .add_hospitalisation()
will
have more columns than the <data.frame>
supplied to .add_date_contact()
as former function is called later in the simulation).
The event date could be first contact, last contact or other.
Usage
.add_date_contact(
.data,
first_contact_distribution,
last_contact_distribution,
outbreak_start_date
)
.add_hospitalisation(.data, onset_to_hosp, hosp_risk)
.add_outcome(
.data,
onset_to_death,
onset_to_recovery,
hosp_death_risk,
non_hosp_death_risk,
config
)
.add_names(.data, anonymise = FALSE)
.add_ct(.data, distribution)
.add_reporting_delay(.data, reporting_delay)
Arguments
.data |
A <data.frame> containing the infectious history from a
branching process simulation (.sim_network_bp() ).
|
first_contact_distribution , last_contact_distribution |
A function to
generate the time for the first or last contact between the infector
and infectee (exposure window). See create_config() .
|
outbreak_start_date |
A date for the start of the outbreak.
|
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
onset_to_recovery |
A function or an <epiparameter> object for the
onset-to-recovery delay distribution. onset_to_recovery can also be NULL
to not simulate dates for individuals that recovered.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-recovery delay. The function
must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is NULL so by default cases that recover get an NA in the
$date_outcome line list column.
For hospitalised cases, the function ensures the onset-to-recovery time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-recovery time (from onset_to_recovery ) cannot be sampled
that is greater than a onset-to-hospitalisation time (from onset_to_hosp )
then the function will error. Due to this conditional sampling, the
onset-to-recovery times in the line list may not resemble the distributional
form input into the function.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
reporting_delay |
A function for the reporting delay distribution or
NULL . The (random) number generating function creates delays between the
time of symptom onset ($date_onset ) and the case being reported
($date_reporting ).
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the reporting delay. The function must have
a single argument.
The default is NULL so by default there is no reporting delay, and the
$date_reporting line list column is identical to the $date_onset column.
|
Value
A <data.frame>
with one more column than input into .data
.
Unless the column heading is already present in which the data is
overwritten.
Introduce user-specified proportion of custom missing values into a
<data.frame>
Description
Introduce user-specified proportion of custom missing values into a
<data.frame>
Usage
.add_missing(linelist, .args)
Arguments
Value
A line list <data.frame>
Anonymise names
Description
A simple algorithm to replace names with an alphanumeric string with an
fixed number of characters (i.e. nchar()
) specified by string_len
.
Usage
.anonymise(x, string_len = 10)
Arguments
x |
A vector of character strings.
|
string_len |
A single numeric specifying the number of alphanumeric
characters to use for each anonymising character string.
Default is 10 .
|
Value
A vector of character
strings of equal length to the input.
Check if <data.frame>
defining either age-stratified hospitalisation or
death risk, or defining age structure of population is correct
Description
Check if <data.frame>
defining either age-stratified hospitalisation or
death risk, or defining age structure of population is correct
Usage
.check_df(x, df_type = c("risk", "age"), age_range = NULL)
Arguments
x |
A <data.frame> .
|
df_type |
A character string, either "risk" or "age" to specify
which input <data.frame> is being checked.
|
age_range |
A numeric vector of length 2. Only required when
df_type = risk , NULL by default.
|
Value
A <data.frame>
, also called for error side-effects when input is
invalid.
Check if R object is line list from sim_linelist()
Description
Check if R object is line list from sim_linelist()
Usage
.check_linelist(linelist)
Arguments
Details
This is a check that the object supplied to linelist
is from the
sim_linelist()
or sim_outbreak()
functions, it is not related to
the class of the object, in other words, it does not check the object is
of class <linelist>
.
Value
Invisibly return the linelist
<data.frame>
. The function is
called for its side-effects, which will error if the input is invalid.
Description
Check if arguments input to simulation function are valid
Usage
.check_sim_input(
sim_type = c("linelist", "contacts", "outbreak"),
contact_distribution,
infectious_period,
prob_infection,
outbreak_start_date,
outbreak_size,
onset_to_hosp = NULL,
onset_to_death = NULL,
onset_to_recovery = NULL,
anonymise = NULL,
case_type_probs = NULL,
contact_tracing_status_probs = NULL,
hosp_risk = NULL,
hosp_death_risk = NULL,
non_hosp_death_risk = NULL,
population_age = NULL
)
Arguments
sim_type |
A character string specifying which simulation function
this function is being called within.
|
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
outbreak_start_date |
A date for the start of the outbreak.
|
outbreak_size |
A numeric vector of length 2 defining the minimum and
the maximum number of infected individuals for the simulated outbreak.
Default is c(10, 1e4) , so the minimum outbreak size is 10 infected
individuals, and the maximum outbreak size is 10,000 infected individuals.
Either number can be changed to increase or decrease the maximum or minimum
outbreak size to allow simulating larger or smaller outbreaks. If the
minimum outbreak size cannot be reached after running the simulation for
many iterations (internally) then the function errors, whereas if the
maximum outbreak size is exceeded the function returns the data early and a
warning stating how many cases and contacts are returned.
|
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
onset_to_recovery |
A function or an <epiparameter> object for the
onset-to-recovery delay distribution. onset_to_recovery can also be NULL
to not simulate dates for individuals that recovered.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-recovery delay. The function
must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is NULL so by default cases that recover get an NA in the
$date_outcome line list column.
For hospitalised cases, the function ensures the onset-to-recovery time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-recovery time (from onset_to_recovery ) cannot be sampled
that is greater than a onset-to-hospitalisation time (from onset_to_hosp )
then the function will error. Due to this conditional sampling, the
onset-to-recovery times in the line list may not resemble the distributional
form input into the function.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
case_type_probs |
A named numeric vector with the probability of
each case type. The names of the vector must be "suspected" , "probable" ,
"confirmed" . Values of each case type must sum to one.
|
contact_tracing_status_probs |
A named numeric vector with the
probability of each contact tracing status. The names of the vector must
be "under_followup" , "lost_to_followup" , "unknown" . Values of each
contact tracing status must sum to one.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
population_age |
Either a numeric vector with two elements or a
<data.frame> with age structure in the population. Use a numeric vector
to specific the age range of the population, the first element is the lower
bound for the age range, and and the second is the upper bound for the age
range (both inclusive, i.e. [lower, upper]). The <data.frame> with
age groups and the proportion of the population in that group. See details
and examples for more information.
|
Details
Arguments that are used by all simulation functions are required
and not given a default value, for other arguments that are not inputs in
all simulation functions a default of NULL
is used.
Defaults mentioned in argument documentation is the default for the exported
simulation function and not the default in this checking function. In this
function there is either no default or NULL
.
Value
Invisibly return the sim_type
character
string. The function is
called for its side-effects, which will error if the input is invalid.
Description
There are two types of cross-checking:
If the onset-to-event distribution is specified but the corresponding risk
is not specified (i.e. NULL
) the function will error (stop()
).
If the onset-to-event distribution is not specified (i.e. NULL
) but the
corresponding risk is specified the function will throw a warning
(warning()
).
The difference in condition handling is because in the case that the
onset-to-event is NULL
the simulation can safely ignore the corresponding
risk, while throwing a warning, as there are no events. In other words, if
the onset-to-hospitalisation is not specified, no infected individuals will
go to hospital and the date_admission
column in the line list will all be
NA
s. However, if the onset-to-event is specified and the corresponding
risk is NULL
then the proportion of individuals infected that are
hospitalised or die cannot be calculated and therefore the simulation
cannot run. It is in this latter case that the cross-checking throws an
error.
Usage
.cross_check_sim_input(
onset_to_hosp,
onset_to_death,
hosp_risk,
hosp_death_risk,
non_hosp_death_risk
)
Arguments
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
Value
Invisibly return the onset_to_hosp
argument. The function is
called for its side-effects, which will error or warn if the input is
invalid.
Description
Sample names for specified sexes by sampling with replacement to avoid
exhausting number of name when sample.with.replacement = FALSE
. The
duplicated names during sampling need to be removed to ensure each
individual has a unique name. In order to have enough unique names, more
names than required are sampled from randomNames::randomNames()
, and the
level of oversampling is determined by the buffer_factor
argument. A
buffer_factor
too high and the more names are sampled which takes longer,
a buffer_factor
too low and not enough unique names are sampled and
the .sample_names()
function will need to loop until it has enough
unique names.
Usage
.sample_names(.data, buffer_factor = 1.5)
Arguments
.data |
A <data.frame> containing the infectious history from a
branching process simulation (.sim_network_bp() ).
|
buffer_factor |
A single numeric determining the level of
oversampling (or buffer) when creating a vector of unique names from
randomNames::randomNames() .
|
Value
A character
vector.
Sample the onset-to-outcome time conditional that the outcome is after
a hospitalisation event
Description
The outcome of a case, either died or recovered, can have a time of event,
but this must be after the hospitalisation time, if a case has been admitted
to hospital. This function samples either the onset-to-death or
onset-to-recovery time conditional on it being greater than a
onset-to-hospitalisation time for a given case, if the case was admitted
to hospital. It does this by resampling onset-to-outcome (death or recovery)
times if they are less than the onset-to-hospitalisation time (from
.add_hospitalisation()
).
Usage
.sample_outcome_time(.data, onset_to_outcome, idx)
Arguments
.data |
A <data.frame> containing the infectious history from a
branching process simulation (.sim_network_bp() ).
|
onset_to_outcome |
A function for either the onset-to-death or
onset-to-recovery delay distribution. onset_to_outcome can also be set to
NULL to not simulate dates for individuals that died or recovered. See
sim_linelist() documentation for more information.
|
idx |
Either the infected_lgl_idx or died_lgl_idx from
.add_outcomes() to define who to sample recovery or death times for,
respectively.
|
Value
A <data.frame>
with a potentially modified $outcome_time
column.
Internal simulation function called by the exported simulation functions
within simulist
Description
This internal function simulates a line list, and
when sim_type
is "contacts"
or "outbreak"
a contacts table as well.
Usage
.sim_internal(
sim_type = c("linelist", "contacts", "outbreak"),
contact_distribution,
infectious_period,
prob_infection,
onset_to_hosp = NULL,
onset_to_death = NULL,
onset_to_recovery = NULL,
reporting_delay = NULL,
hosp_risk = NULL,
hosp_death_risk = NULL,
non_hosp_death_risk = NULL,
outbreak_start_date,
anonymise = NULL,
outbreak_size,
population_age,
case_type_probs = NULL,
contact_tracing_status_probs = NULL,
config
)
Arguments
sim_type |
A character string specifying which simulation function
this function is being called within.
|
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
onset_to_recovery |
A function or an <epiparameter> object for the
onset-to-recovery delay distribution. onset_to_recovery can also be NULL
to not simulate dates for individuals that recovered.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-recovery delay. The function
must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is NULL so by default cases that recover get an NA in the
$date_outcome line list column.
For hospitalised cases, the function ensures the onset-to-recovery time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-recovery time (from onset_to_recovery ) cannot be sampled
that is greater than a onset-to-hospitalisation time (from onset_to_hosp )
then the function will error. Due to this conditional sampling, the
onset-to-recovery times in the line list may not resemble the distributional
form input into the function.
|
reporting_delay |
A function for the reporting delay distribution or
NULL . The (random) number generating function creates delays between the
time of symptom onset ($date_onset ) and the case being reported
($date_reporting ).
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the reporting delay. The function must have
a single argument.
The default is NULL so by default there is no reporting delay, and the
$date_reporting line list column is identical to the $date_onset column.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
outbreak_start_date |
A date for the start of the outbreak.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
outbreak_size |
A numeric vector of length 2 defining the minimum and
the maximum number of infected individuals for the simulated outbreak.
Default is c(10, 1e4) , so the minimum outbreak size is 10 infected
individuals, and the maximum outbreak size is 10,000 infected individuals.
Either number can be changed to increase or decrease the maximum or minimum
outbreak size to allow simulating larger or smaller outbreaks. If the
minimum outbreak size cannot be reached after running the simulation for
many iterations (internally) then the function errors, whereas if the
maximum outbreak size is exceeded the function returns the data early and a
warning stating how many cases and contacts are returned.
|
population_age |
Either a numeric vector with two elements or a
<data.frame> with age structure in the population. Use a numeric vector
to specific the age range of the population, the first element is the lower
bound for the age range, and and the second is the upper bound for the age
range (both inclusive, i.e. [lower, upper]). The <data.frame> with
age groups and the proportion of the population in that group. See details
and examples for more information.
|
case_type_probs |
A named numeric vector with the probability of
each case type. The names of the vector must be "suspected" , "probable" ,
"confirmed" . Values of each case type must sum to one.
|
contact_tracing_status_probs |
A named numeric vector with the
probability of each contact tracing status. The names of the vector must
be "under_followup" , "lost_to_followup" , "unknown" . Values of each
contact tracing status must sum to one.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
Value
A <data.frame>
if sim_type
is "linelist"
or "contacts"
, or a
list of two <data.frame>
s if sim_type
is "outbreak"
.
Simulate a random network branching process model with a probability of
infection for each contact
Description
Simulate a branching process on a infinite network where the contact
distribution provides a function to sample the number of contacts of each
individual in the simulation. Each contact is then infected with the
probability of infection. The time between each contact is assumed to be
evenly distributed across the infectious period of the infected individual,
and is independent of whether the contact becomes infected.
Usage
.sim_network_bp(
contact_distribution,
infectious_period,
prob_infection,
max_outbreak_size,
config
)
Arguments
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
Details
The contact distribution sampled takes the network effect
q(n) \sim (n + 1)p(n + 1)
where p(n)
is the probability
density function of a distribution, e.g., Poisson or Negative binomial.
That is to say, the probability of having choosing a contact at random by
following up a contact chooses individuals with a probability proportional
to their number of contacts. The plus one is because one of the contacts
was "used" to infect the person.
Value
A <data.frame>
with the contact and transmission chain data.
Convert <epiparameter>
or NULL
to function
Description
An extension to as.function()
, particularly the epiparameter
as.function
S3 method, with the added feature that NULL
s are converted
into functions that generate a vector of NA
s to behave equivalently to
the generator functions output from
as.function(..., func_type = "generate")
.
If a function is already passed to as_function
it will be returned
unchanged.
There is also input checking to error if input is not an <epiparameter>
,
function
(closure), or for onset-to-event distributions NULL
.
Usage
as_function(x)
Arguments
x |
A named list containing either <epiparameter> , function or
NULL . Named list elements are: "contact_distribution" ,
"infectious_period" , "onset_to_hosp" , "onset_to_death" ,
"onset_to_recovery".
|
Value
A list of function
s.
Coerce and store <data.frame>
subclass to <data.frame>
and
restore <data.frame>
subclass to <data.frame>
from attribute.
Description
Coerce and store <data.frame>
subclass to <data.frame>
and
restore <data.frame>
subclass to <data.frame>
from attribute.
Usage
.as_df(x)
.restore_df_subclass(x)
Arguments
Value
A <data.frame>
or subclass of <data.frame>
.
Create a list of configuration settings for some details of sim_linelist()
Description
Create a list of configuration settings for some details of sim_linelist()
Usage
create_config(...)
Arguments
... |
<dynamic-dots > Named elements to replace
default settings. Only if names match exactly are elements replaced,
otherwise the function errors.
Accepted arguments and their defaults are:
last_contact_distribution A function to generate the
time for last contact. Default parameterisation is a Poisson
distribution with a \lambda of 3.
first_contact_distribution A function to generate the time
for the first contact. Default parameterisation is a Poisson
distribution with a \lambda of 3.
ct_distribution A function to generate Ct values for each
confirmed case. Default parameterisation is a Normal distribution with
a mean (\mu ) of 25 and a standard deviation (\sigma ) of 2.
network A character string, either "adjusted" (default) or
"unadjusted" .
time_varying_death_risk By default is NULL , but can also accept
a function with two arguments, risk and time , to apply a
time varying death risk of hospitalised and non-hospitalised cases in
the outbreak simulation. See
vignette("time-varying-cfr", package = "simulist") .
|
Details
The config
argument in sim_linelist()
controls the small details
around time windows around infections (time of first contact and last
contact with infector), and the distribution of the Cycle threshold (Ct)
value from a Real-time PCR or quantitative PCR (qPCR) for confirmed
cases, the network effect in the simulation, and if there is a time-varying
death risk.
These parameters do not warrant their own arguments in
sim_linelist()
as they rarely need to be changed from their default
setting. Therefore it is not worth increasing the number of sim_linelist()
arguments to accommodate these and the config
argument keeps the function
signature simpler and more readable.
The last_contact_distribution
and first_contact_distribution
can accept
any function that generates positive integers (e.g. discrete probability
distribution, rpois()
or rgeom()
). The ct_distribution
can accept
any function that generates real numbers (e.g. continuous or discrete
probability distribution, rnorm()
, rlnorm()
).
The network
option controls whether to sample contacts from a adjusted or
unadjusted contact distribution. Adjusted (default) sampling uses
q(n) \sim (n + 1)p(n + 1)
where p(n)
is the probability
density function of a distribution, e.g., Poisson or Negative binomial.
Unadjusted (network = "unadjusted"
) instead samples contacts directly from
a probability distribution p(n)
.
Value
A list of settings for sim_linelist()
Examples
# example with default configuration
create_config()
# example with customised Ct distribution
create_config(
ct_distribution = function(x) rlnorm(n = x, meanlog = 2, sdlog = 1)
)
Create messy line list data
Description
Take line list output from sim_linelist()
and replace elements of
the <data.frame>
with missing values (e.g. NA
), introduce spelling
mistakes and inconsistencies, as well as coerce date types.
Usage
messy_linelist(linelist, ...)
Arguments
linelist |
Line list <data.frame> output from sim_linelist() .
|
... |
<dynamic-dots > Named elements to replace
default settings. Only if names match exactly are elements replaced,
otherwise the function errors.
Accepted arguments and their defaults are:
prop_missing A numeric between 0 and 1 for the proportion of
missing values introduced. Default is 0.1 (10%).
missing_value A single atomic R object used to represent missing
values. Default is NA .
prop_spelling_mistakes A numeric between 0 and 1 used to
specify the proportion of spelling mistakes in character columns.
Default is 0.1 (10%).
inconsistent_sex A logical boolean to specify whether the
$sex column uses "m" and "f" , or inconsistently uses "m" , "f" ,
"M" , "F" , "male" , "female" , "Male" or "Female" . Default
is TRUE so sexes are sampled from the options.
sex_as_numeric A logical boolean used to specify whether
the values in the $sex column should be encoded as numeric
values (0 and 1 ). Default is FALSE . sex_as_numeric cannot be
TRUE if inconsistent_sex = TRUE .
numeric_as_char A logical boolean used to specify whether
numeric columns should be coerced to character . Default is TRUE .
date_as_char A logical boolean used to specify whether Date
columns should be coerced to character . Default is TRUE .
inconsistent_dates A logical boolean used to specify whether
the values in Date columns are inconsistently formatted (e.g.
"%Y-%m-%d" , "%Y/%m/%d" , "%d-%m-%Y" , or "%d %B %Y" ).
Default is FALSE .
prop_int_as_word A numeric between 0 and 1 for the proportion
of elements in integer columns should that are coerced to words (see
english::words() ). Default is 0.5 (50%).
prop_duplicate_row A numeric between 0 and 1 for the
proportion of rows to duplicate. Default is 0.01 (1%). If
prop_duplicate_row > 0 then it is guaranteed that at least one row will
be duplicated.
inconsistent_id A logical boolean used to specify whether the
$id column has inconsistent formatting by appending random prefixes and
suffixes to a random sample (~10%) of IDs. Default is FALSE , so IDs
are numbers (numeric , characters or words depending on
prop_int_as_word and numeric_as_char ).
|
Details
By default messy_linelist()
:
Introduces 10% of values missing, i.e. converts to NA
.
Introduces spelling mistakes in 10% of character
columns.
Introduce inconsistency in the reporting of $sex
.
Converts numeric
columns (double
& integer
) to character
.
Converts Date
columns to character
.
Converts 50% of integer
s to (English) words.
Duplicates 1% of rows.
Setting missing_value
to something other than NA
will likely cause
type coercion in the line list <data.frame>
columns, most likely to
character
.
When setting sex_as_numeric
to TRUE
, male is set to 0
and female
to 1
. Only one of inconsistent_sex
or sex_as_numeric
can be TRUE
,
otherwise the function will error.
If numeric_as_char = TRUE
and sex_as_numeric = TRUE
then the sex encoded
as 0 or 1 is converted to character
. If prop_spelling_mistake
> 0 and
numeric_as_char = TRUE
the columns that are converted from numeric
to
character
do not have spelling mistakes introduced, because they are
numeric characters stored as character strings. If
prop_spelling_mistake
> 0 and date_as_char = TRUE
spelling mistakes are
not introduced into dates.
The Date
columns can be converted into an inconsistent format by
setting inconsistent_dates = TRUE
and it requires date_as_char = TRUE
,
if the latter is FALSE
the function will error.
If numeric_as_char = FALSE
and prop_int_as_word
> 0 then the integer
columns are converted to character
string (either character
numbers or
words) but the other numeric
columns are not coerced. Spelling mistakes
are not introduced into integers converted to words when
prop_spelling_mistakes
> 0 and prop_int_as_word
> 0.
Rows are duplicated after other messy modifications so the duplicated row
contains identical messy elements.
Value
A messy line list <data.frame>
.
Examples
linelist <- sim_linelist()
messy_linelist <- messy_linelist(linelist)
# increasing proportion of missingness to 30% with a missing value of -99
messy_linelist <- messy_linelist(
linelist,
prop_missing = 0.3,
missing_value = -99
)
# increasing proportion of spelling mistakes to 50%
messy_linelist <- messy_linelist(linelist, prop_spelling_mistakes = 0.5)
# encode `$sex` as `numeric`
messy_linelist <- messy_linelist(
linelist,
sex_as_numeric = TRUE,
inconsistent_sex = FALSE
)
# inconsistently formatted dates
messy_linelist <- messy_linelist(linelist, inconsistent_dates = TRUE)
Description
Simulate contacts for an infectious disease outbreak
Usage
sim_contacts(
contact_distribution = function(x) stats::dpois(x = x, lambda = 2),
infectious_period = function(x) stats::rlnorm(n = x, meanlog = 2, sdlog = 0.5),
prob_infection = 0.5,
outbreak_start_date = as.Date("2023-01-01"),
anonymise = FALSE,
outbreak_size = c(10, 10000),
population_age = c(1, 90),
contact_tracing_status_probs = c(under_followup = 0.7, lost_to_followup = 0.2, unknown
= 0.1),
config = create_config()
)
Arguments
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
outbreak_start_date |
A date for the start of the outbreak.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
outbreak_size |
A numeric vector of length 2 defining the minimum and
the maximum number of infected individuals for the simulated outbreak.
Default is c(10, 1e4) , so the minimum outbreak size is 10 infected
individuals, and the maximum outbreak size is 10,000 infected individuals.
Either number can be changed to increase or decrease the maximum or minimum
outbreak size to allow simulating larger or smaller outbreaks. If the
minimum outbreak size cannot be reached after running the simulation for
many iterations (internally) then the function errors, whereas if the
maximum outbreak size is exceeded the function returns the data early and a
warning stating how many cases and contacts are returned.
|
population_age |
Either a numeric vector with two elements or a
<data.frame> with age structure in the population. Use a numeric vector
to specific the age range of the population, the first element is the lower
bound for the age range, and and the second is the upper bound for the age
range (both inclusive, i.e. [lower, upper]). The <data.frame> with
age groups and the proportion of the population in that group. See details
and examples for more information.
|
contact_tracing_status_probs |
A named numeric vector with the
probability of each contact tracing status. The names of the vector must
be "under_followup" , "lost_to_followup" , "unknown" . Values of each
contact tracing status must sum to one.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
Value
A contacts <data.frame>
Author(s)
Joshua W. Lambert, Carmen Tamayo
Examples
# quickly simulate contact tracing data using the function defaults
contacts <- sim_contacts()
head(contacts)
# to simulate more realistic contact tracing data load epiparameters from
# {epiparameter}
library(epiparameter)
contact_distribution <- epiparameter(
disease = "COVID-19",
epi_name = "contact distribution",
prob_distribution = create_prob_distribution(
prob_distribution = "pois",
prob_distribution_params = c(mean = 2)
)
)
infectious_period <- epiparameter(
disease = "COVID-19",
epi_name = "infectious period",
prob_distribution = create_prob_distribution(
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
)
contacts <- sim_contacts(
contact_distribution = contact_distribution,
infectious_period = infectious_period,
prob_infection = 0.5
)
Simulate a line list
Description
The line list is simulated using a branching process and
parameterised with epidemiological parameters.
Usage
sim_linelist(
contact_distribution = function(x) stats::dpois(x = x, lambda = 2),
infectious_period = function(x) stats::rlnorm(n = x, meanlog = 2, sdlog = 0.5),
prob_infection = 0.5,
onset_to_hosp = function(x) stats::rlnorm(n = x, meanlog = 1.5, sdlog = 0.5),
onset_to_death = function(x) stats::rlnorm(n = x, meanlog = 2.5, sdlog = 0.5),
onset_to_recovery = NULL,
reporting_delay = NULL,
hosp_risk = 0.2,
hosp_death_risk = 0.5,
non_hosp_death_risk = 0.05,
outbreak_start_date = as.Date("2023-01-01"),
anonymise = FALSE,
outbreak_size = c(10, 10000),
population_age = c(1, 90),
case_type_probs = c(suspected = 0.2, probable = 0.3, confirmed = 0.5),
config = create_config()
)
Arguments
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
onset_to_recovery |
A function or an <epiparameter> object for the
onset-to-recovery delay distribution. onset_to_recovery can also be NULL
to not simulate dates for individuals that recovered.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-recovery delay. The function
must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is NULL so by default cases that recover get an NA in the
$date_outcome line list column.
For hospitalised cases, the function ensures the onset-to-recovery time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-recovery time (from onset_to_recovery ) cannot be sampled
that is greater than a onset-to-hospitalisation time (from onset_to_hosp )
then the function will error. Due to this conditional sampling, the
onset-to-recovery times in the line list may not resemble the distributional
form input into the function.
|
reporting_delay |
A function for the reporting delay distribution or
NULL . The (random) number generating function creates delays between the
time of symptom onset ($date_onset ) and the case being reported
($date_reporting ).
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the reporting delay. The function must have
a single argument.
The default is NULL so by default there is no reporting delay, and the
$date_reporting line list column is identical to the $date_onset column.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
outbreak_start_date |
A date for the start of the outbreak.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
outbreak_size |
A numeric vector of length 2 defining the minimum and
the maximum number of infected individuals for the simulated outbreak.
Default is c(10, 1e4) , so the minimum outbreak size is 10 infected
individuals, and the maximum outbreak size is 10,000 infected individuals.
Either number can be changed to increase or decrease the maximum or minimum
outbreak size to allow simulating larger or smaller outbreaks. If the
minimum outbreak size cannot be reached after running the simulation for
many iterations (internally) then the function errors, whereas if the
maximum outbreak size is exceeded the function returns the data early and a
warning stating how many cases and contacts are returned.
|
population_age |
Either a numeric vector with two elements or a
<data.frame> with age structure in the population. Use a numeric vector
to specific the age range of the population, the first element is the lower
bound for the age range, and and the second is the upper bound for the age
range (both inclusive, i.e. [lower, upper]). The <data.frame> with
age groups and the proportion of the population in that group. See details
and examples for more information.
|
case_type_probs |
A named numeric vector with the probability of
each case type. The names of the vector must be "suspected" , "probable" ,
"confirmed" . Values of each case type must sum to one.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
Details
For age-stratified hospitalised and death risks a <data.frame>
will need to be passed to the hosp_risk
and/or hosp_death_risk
arguments. This <data.frame>
should have two columns:
-
age_limit
: a column with one numeric
per cell for the lower bound
(minimum) age of the age group (inclusive).
-
risk
: a column with one numeric
per cell for the proportion
(or probability) of hospitalisation for that age group. Should be between
0 and 1.
For an age-structured population, a <data.frame>
with two columns:
-
age_limit
: a column with one numeric
per cell for the lower bound
(minimum) age of the age group (inclusive), except the last element which is
the upper bound (maximum) of the population.
-
proportion
: a column with the proportion of the population that are in
that age group. Proportions must sum to one.
Value
A line list <data.frame>
Author(s)
Joshua W. Lambert, Carmen Tamayo
Examples
# quickly simulate a line list using the function defaults
linelist <- sim_linelist()
head(linelist)
# to simulate a more realistic line list load epiparameters from
# {epiparameter}
library(epiparameter)
contact_distribution <- epiparameter(
disease = "COVID-19",
epi_name = "contact distribution",
prob_distribution = create_prob_distribution(
prob_distribution = "pois",
prob_distribution_params = c(mean = 2)
)
)
infectious_period <- epiparameter(
disease = "COVID-19",
epi_name = "infectious period",
prob_distribution = create_prob_distribution(
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
)
onset_to_hosp <- epiparameter(
disease = "COVID-19",
epi_name = "onset to hospitalisation",
prob_distribution = create_prob_distribution(
prob_distribution = "lnorm",
prob_distribution_params = c(meanlog = 1, sdlog = 0.5)
)
)
# get onset to death from {epiparameter} database
onset_to_death <- epiparameter_db(
disease = "COVID-19",
epi_name = "onset to death",
single_epiparameter = TRUE
)
# example with single hospitalisation risk for entire population
linelist <- sim_linelist(
contact_distribution = contact_distribution,
infectious_period = infectious_period,
prob_infection = 0.5,
onset_to_hosp = onset_to_hosp,
onset_to_death = onset_to_death,
hosp_risk = 0.5
)
head(linelist)
# example with age-stratified hospitalisation risk
# 20% for over 80s
# 10% for under 5s
# 5% for the rest
age_dep_hosp_risk <- data.frame(
age_limit = c(1, 5, 80),
risk = c(0.1, 0.05, 0.2)
)
linelist <- sim_linelist(
contact_distribution = contact_distribution,
infectious_period = infectious_period,
prob_infection = 0.5,
onset_to_hosp = onset_to_hosp,
onset_to_death = onset_to_death,
hosp_risk = age_dep_hosp_risk
)
head(linelist)
Simulate a line list and a contacts table
Description
The line list and contacts are simulated using a branching
process and parameterised with epidemiological parameters.
Usage
sim_outbreak(
contact_distribution = function(x) stats::dpois(x = x, lambda = 2),
infectious_period = function(x) stats::rlnorm(n = x, meanlog = 2, sdlog = 0.5),
prob_infection = 0.5,
onset_to_hosp = function(x) stats::rlnorm(n = x, meanlog = 1.5, sdlog = 0.5),
onset_to_death = function(x) stats::rlnorm(n = x, meanlog = 2.5, sdlog = 0.5),
onset_to_recovery = NULL,
reporting_delay = NULL,
hosp_risk = 0.2,
hosp_death_risk = 0.5,
non_hosp_death_risk = 0.05,
outbreak_start_date = as.Date("2023-01-01"),
anonymise = FALSE,
outbreak_size = c(10, 10000),
population_age = c(1, 90),
case_type_probs = c(suspected = 0.2, probable = 0.3, confirmed = 0.5),
contact_tracing_status_probs = c(under_followup = 0.7, lost_to_followup = 0.2, unknown
= 0.1),
config = create_config()
)
Arguments
contact_distribution |
A function or an <epiparameter> object to
generate the number of contacts per infection.
The function can be defined or anonymous. The function must have a single
argument in the form of an integer vector with elements representing the
number of contacts, and return a numeric vector where each element
corresponds to the probability of observing the number of contacts in the
vector passed to the function. The index of the numeric vector returned
is offset by one to the corresponding probability of observing the number
of contacts, i.e. the first element of the output vector is the probability
of observing zero contacts, the second element is the probability of
observing one contact, etc.
An <epiparameter> can be provided. This will be converted into a
probability mass function internally.
The default is an anonymous function with a Poisson probability mass function
(dpois() ) with a mean (\lambda ) of 2 contacts per infection.
|
infectious_period |
A function or an <epiparameter> object for the
infectious period. This defines the duration from becoming infectious to
no longer infectious. In the simulation, individuals are assumed to
become infectious immediately after being infected (the latency period is
assumed to be zero). The time intervals between an infected individual and
their contacts are assumed to be uniformly distributed within the
infectious period. Infectious periods must be strictly positive.
The function can be defined or anonymous. The function must return a vector
of randomly generated real numbers representing sampled infectious periods.
The function must have a single argument, the number of random infectious
periods to generate.
An <epiparameter> can be provided. This will be converted into random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2 and sdlog = 0.5 .
|
prob_infection |
A single numeric for the probability of a secondary
contact being infected by an infected primary contact.
|
onset_to_hosp |
A function or an <epiparameter> object for the
onset-to-hospitalisation delay distribution. onset_to_hosp can also be
set to NULL to not simulate hospitalisation (admission) dates.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-hospitalisation delay. The
function must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 1.5 and sdlog = 0.5 .
If onset_to_hosp is set to NULL then hosp_risk and hosp_death_risk
will be automatically set to NULL if not manually specified.
|
onset_to_death |
A function or an <epiparameter> object for the
onset-to-death delay distribution. onset_to_death can also be set to
NULL to not simulate dates for individuals that died.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-death delay. The function must
have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is an anonymous function with a lognormal distribution random
number generator (rlnorm() ) with meanlog = 2.5 and sdlog = 0.5 .
If onset_to_death is set to NULL then non_hosp_death_risk and
hosp_death_risk will be automatically set to NULL if not manually
specified.
For hospitalised cases, the function ensures the onset-to-death time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-death time (from onset_to_death ) cannot be sampled that is
greater than a onset-to-hospitalisation time (from onset_to_hosp ) then
the function will error. Due to this conditional sampling, the
onset-to-death times in the line list may not resemble the distributional
form input into the function.
|
onset_to_recovery |
A function or an <epiparameter> object for the
onset-to-recovery delay distribution. onset_to_recovery can also be NULL
to not simulate dates for individuals that recovered.
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the onset-to-recovery delay. The function
must have a single argument.
An <epiparameter> can be provided. This will be converted into a random
number generator internally.
The default is NULL so by default cases that recover get an NA in the
$date_outcome line list column.
For hospitalised cases, the function ensures the onset-to-recovery time is
greater than the onset-to-hospitalisation time. After many (1000) attempts,
if an onset-to-recovery time (from onset_to_recovery ) cannot be sampled
that is greater than a onset-to-hospitalisation time (from onset_to_hosp )
then the function will error. Due to this conditional sampling, the
onset-to-recovery times in the line list may not resemble the distributional
form input into the function.
|
reporting_delay |
A function for the reporting delay distribution or
NULL . The (random) number generating function creates delays between the
time of symptom onset ($date_onset ) and the case being reported
($date_reporting ).
The function can be defined or anonymous. The function must return a vector
of numeric s for the length of the reporting delay. The function must have
a single argument.
The default is NULL so by default there is no reporting delay, and the
$date_reporting line list column is identical to the $date_onset column.
|
hosp_risk |
Either a single numeric for the hospitalisation risk of
everyone in the population, or a <data.frame> with age specific
hospitalisation risks. Default is 20% hospitalisation (0.2 ) for the entire
population. If the onset_to_hosp argument is set to NULL this argument
will automatically be set to NULL if not specified or can be manually
set to NULL . See details and examples for more information.
|
hosp_death_risk |
Either a single numeric for the death risk for
hospitalised individuals across the population, or a <data.frame> with age
specific hospitalised death risks Default is 50% death risk in hospitals
(0.5 ) for the entire population. If the onset_to_death argument is set
to NULL this argument will automatically be set to NULL if not specified
or can be manually set to NULL . See details and examples for more
information. The hosp_death_risk can vary through time if specified in
the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
non_hosp_death_risk |
Either a single numeric for the death risk for
outside of hospitals across the population, or a <data.frame> with age
specific death risks outside of hospitals. Default is 5% death risk outside
of hospitals (0.05 ) for the entire population. If the onset_to_death
argument is set to NULL this argument will automatically be set to NULL
if not specified or can be manually set to NULL . See details and examples
for more information. The non_hosp_death_risk can vary through time if
specified in the time_varying_death_risk element of config , see
vignette("time-varying-cfr", package = "simulist") for more information.
|
outbreak_start_date |
A date for the start of the outbreak.
|
anonymise |
A logical boolean for whether case names should be
anonymised. Default is FALSE .
|
outbreak_size |
A numeric vector of length 2 defining the minimum and
the maximum number of infected individuals for the simulated outbreak.
Default is c(10, 1e4) , so the minimum outbreak size is 10 infected
individuals, and the maximum outbreak size is 10,000 infected individuals.
Either number can be changed to increase or decrease the maximum or minimum
outbreak size to allow simulating larger or smaller outbreaks. If the
minimum outbreak size cannot be reached after running the simulation for
many iterations (internally) then the function errors, whereas if the
maximum outbreak size is exceeded the function returns the data early and a
warning stating how many cases and contacts are returned.
|
population_age |
Either a numeric vector with two elements or a
<data.frame> with age structure in the population. Use a numeric vector
to specific the age range of the population, the first element is the lower
bound for the age range, and and the second is the upper bound for the age
range (both inclusive, i.e. [lower, upper]). The <data.frame> with
age groups and the proportion of the population in that group. See details
and examples for more information.
|
case_type_probs |
A named numeric vector with the probability of
each case type. The names of the vector must be "suspected" , "probable" ,
"confirmed" . Values of each case type must sum to one.
|
contact_tracing_status_probs |
A named numeric vector with the
probability of each contact tracing status. The names of the vector must
be "under_followup" , "lost_to_followup" , "unknown" . Values of each
contact tracing status must sum to one.
|
config |
A list of settings to adjust the randomly sampled delays and
Ct values. See create_config() for more information.
|
Details
For age-stratified hospitalised and death risks a <data.frame>
will need to be passed to the hosp_risk
and/or hosp_death_risk
arguments. This <data.frame>
should have two columns:
-
age_limit
: a column with one numeric
per cell for the lower bound
(minimum) age of the age group (inclusive).
-
risk
: a column with one numeric
per cell for the proportion
(or probability) of hospitalisation for that age group. Should be between
0 and 1.
For an age-structured population, a <data.frame>
with two columns:
-
age_limit
: a column with one numeric
per cell for the lower bound
(minimum) age of the age group (inclusive), except the last element which is
the upper bound (maximum) of the population.
-
proportion
: a column with the proportion of the population that are in
that age group. Proportions must sum to one.
Value
A list with two elements:
A line list <data.frame>
A contacts <data.frame>
Author(s)
Joshua W. Lambert
Examples
# quickly simulate an outbreak using the function defaults
outbreak <- sim_outbreak()
head(outbreak$linelist)
head(outbreak$contacts)
# to simulate a more realistic outbreak load epiparameters from
# {epiparameter}
library(epiparameter)
contact_distribution <- epiparameter(
disease = "COVID-19",
epi_name = "contact distribution",
prob_distribution = create_prob_distribution(
prob_distribution = "pois",
prob_distribution_params = c(mean = 2)
)
)
infectious_period <- epiparameter(
disease = "COVID-19",
epi_name = "infectious period",
prob_distribution = create_prob_distribution(
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
)
onset_to_hosp <- epiparameter(
disease = "COVID-19",
epi_name = "onset to hospitalisation",
prob_distribution = create_prob_distribution(
prob_distribution = "lnorm",
prob_distribution_params = c(meanlog = 1, sdlog = 0.5)
)
)
# get onset to death from {epiparameter} database
onset_to_death <- epiparameter_db(
disease = "COVID-19",
epi_name = "onset to death",
single_epiparameter = TRUE
)
outbreak <- sim_outbreak(
contact_distribution = contact_distribution,
infectious_period = infectious_period,
prob_infection = 0.5,
onset_to_hosp = onset_to_hosp,
onset_to_death = onset_to_death
)
Adjust or subset a line list to account for right truncation
Description
Adjust or subset the line list <data.frame>
by removing cases that
have not been reported by the truncation time and setting hospitalisation
admission or outcome dates that are after the truncation point to NA
.
This is to replicate real-time outbreak data where recent cases or outcomes
are not yet observed or reported (right truncation). It implies an assumption
that symptom onsets are reported with a delay but hospitalisations are
reported instantly.
Usage
truncate_linelist(
linelist,
truncation_day = 14,
unit = c("days", "weeks", "months", "years"),
direction = c("backwards", "forwards")
)
Arguments
linelist |
Line list <data.frame> output from sim_linelist() .
|
truncation_day |
A single numeric specifying the number of
days (default), weeks, months or years before the end of the outbreak
(default) or since the start of the outbreak (see direction argument)
to truncate the line list at. By default it is 14 days before the end
of the outbreak.
Alternatively, truncation_day can accept a <Date> and this is
used as the truncation_day and the unit and direction is ignored.
|
unit |
A character string, either "days" (default),
"weeks" , "months" , or "years" , specifying the units of the
truncation_day argument.
Years are assumed to be 365.25 days and months are assumed to be 365.25 / 12
days (same as lubridate).
|
direction |
A character string, either "backwards" (default) or
"forwards" . direction = backwards defines the truncation_day as
the time before the end of the outbreak. direction = forwards defines
the truncation_day as the time since the start of the outbreak.
|
Details
The day on which the line list is truncated is the same for
all individuals in the line list, and is specified by the
truncation_day
and unit
arguments.
Value
A line list <data.frame>
.
Examples
set.seed(1)
linelist <- sim_linelist()
linelist_trunc <- truncate_linelist(linelist)
# set truncation point 3 weeks before the end of outbreak
linelist_trunc <- truncate_linelist(
linelist,
truncation_day = 3,
unit = "weeks"
)
# set truncation point to 2 months since the start of outbreak
linelist_trunc <- truncate_linelist(
linelist,
truncation_day = 2,
unit = "months",
direction = "forwards"
)
# set truncation point to 2023-03-01
linelist_trunc <- truncate_linelist(
linelist,
truncation_day = as.Date("2023-03-01")
)