Type: | Package |
Title: | Space-Time Anomaly Detection using Scan Statistics |
Description: | Detection of anomalous space-time clusters using the scan statistics methodology. Focuses on prospective surveillance of data streams, scanning for clusters with ongoing anomalies. Hypothesis testing is made possible by Monte Carlo simulation. Allévius (2018) <doi:10.21105/joss.00515>. |
Version: | 1.1.1 |
Date: | 2023-01-25 |
Encoding: | UTF-8 |
License: | GPL (≥ 3) |
URL: | https://github.com/promerpr/scanstatistics |
BugReports: | https://github.com/promerpr/scanstatistics/issues |
Depends: | R (≥ 3.4) |
Imports: | dplyr, ismev, magrittr, plyr, Rcpp, stats, sets, tibble, tidyr |
Suggests: | purrr, doParallel, foreach, ggplot2, knitr, MASS, pscl, reshape2, rmarkdown, sp, testthat, gamlss.dist |
VignetteBuilder: | knitr |
RoxygenNote: | 7.2.1 |
ByteCompile: | true |
SystemRequirements: | C++11 |
LinkingTo: | Rcpp, RcppArmadillo |
LazyData: | true |
NeedsCompilation: | yes |
Packaged: | 2023-01-25 19:57:56 UTC; promerpr |
Author: | Benjamin Allévius [aut], Paul Romer Present [ctb, cre] |
Maintainer: | Paul Romer Present <paul.romerpresent@fastmail.fm> |
Repository: | CRAN |
Date/Publication: | 2023-01-26 12:40:02 UTC |
Longitude and latitude of New Mexico county seats.
Description
A dataset containing the longitude and latitude of the county seats of New Mexico, except for Cibola county.
Usage
NM_geo
Format
A data frame with 33 rows and 7 variables:
- county
Factor; the counties of New Mexico (no spaces).
- seat
Character; the name of the county seat, i.e. the administrative center or seat of government.
- area(km2)
Numeric; the area in square kilometers of each county.
- seat_long
Numeric; the longitude of the county seat.
- seat_lat
Numeric; the latitude of the county seat.
- center_long
Numeric; the longitude of the geographical center of the county.
- center_lat
Numeric; the latitude of the geographical center of the county.
Source
https://en.wikipedia.org/wiki/List_of_counties_in_New_Mexico
Data to plot the counties of New Mexico.
Description
Map data for New Mexico. Was created using ggplot2::map_data
.
Usage
NM_map
Format
A data frame with 867 rows and 7 variables:
- long
Numeric; longitude of county polygon corner.
- lat
Numeric; latitude of county polygon corner.
- group
Numeric; grouping by county.
- order
Numeric; order of the polygon corners.
- region
Character; region is "new mexico" for all rows.
- subregion
Character; the county name (with spaces).
- county
Factor; the county name (no spaces).
Population and brain cancer cases in New Mexico counties during 1973–1991.
Description
A dataset containing the population count and number of brain cancer cases in the counties of New Mexico during the years 1973–1991. The population numbers are interpolations from the censuses conducted in 1973, 1982, and 1991. Interpolations were done using a quadratic function of time. Thus the year-to-year changes are overly smooth but match the census numbers in the three years mentioned.
Usage
NM_popcas
Format
A data frame with 608 rows and 4 variables:
- year
Integer; the year the cases were recorded.
- county
Character; the name of the county (no spaces).
- population
Integer; the population in that county and year.
- count
Integer; the number of brain cancer cases in that county and year.
Set of increasing sets from left to right of input vector.
Description
Returns a set (list) of the increasing sets (integer vectors) of the input
vector v
, in the sense that the first set contains the first element
of v
, the second set the first and second elements of v
, and so
on.
Usage
closest_subsets(v)
Arguments
v |
An integer vector. Meant to represent the |
Value
A list of the same length as the input. The first element of the list is v[1], the second is sort(v[1:2]), the third sort(v[1:3]), and so on.
Find the connected sets for a location and its k
nearest neighbors.
Description
Returns a set
of set
s, each set of the latter type containing
the location itself and zero or more of its neighbors, if they are connected.
Usage
connected_neighbors(neighbors, adjacency_matrix)
Arguments
neighbors |
A vector of neighbors to a location, the first element of the vector being the specific location, and the other elements its other nearest neighbors. Locations should be encoded as integers. |
adjacency_matrix |
A boolean matrix, with element |
Value
Returns a set
of set
s, each set of the latter type
containing the location itself and zero or more of its neighbors, if they
are connected.
Return those elements in the second set which are connected to those in the first.
Description
Return those elements in the second set Z_1
which are connected to
those in the first set Z_0
, according to the adjacency matrix.
Usage
connected_to(Z_0, Z_1, adjacency_matrix)
Arguments
Z_0 |
A set of locations, given as integers. |
Z_1 |
A set of locations, given as integers. |
adjacency_matrix |
A boolean matrix, with element |
Value
A set, possibly empty, containing those locations in Z_1
that are connected to any of the locations in Z_0
.
Get the k nearest neighbors for each location, given its coordinates.
Description
Get the k nearest neighbors for each location, including the location itself.
This function calls dist
, so the options for the
distance measure used is the same as for that one. Distances are calculated
between rows.
Usage
coords_to_knn(x, k = min(10, nrow(x)), method = "euclidean", p = 2)
Arguments
x |
a numeric matrix, data frame or |
k |
The number of nearest neighbors, counting the location itself. |
method |
the distance measure to be used. This must be one of
|
p |
The power of the Minkowski distance. |
Value
An integer matrix of the k
nearest neighbors for each location.
Each row corresponds to a location, with the first element of each row
being the location itself. Locations are encoded as integers.
Examples
x <- matrix(c(0, 0,
1, 0,
2, 1,
0, 4,
1, 3),
ncol = 2, byrow = TRUE)
plot(x)
coords_to_knn(x)
Convert a long data frame to a wide matrix.
Description
Convert a long data frame to a wide matrix, with time along the row dimension and locations along the column dimension. Values in the matrix could be e.g. the observed counts or the population.
Usage
df_to_matrix(df, time_col = 1, location_col = 2, value_col = 3)
Arguments
df |
A data frame with at least 3 columns. |
time_col |
Integer or string that specifies the time column. |
location_col |
Integer or string that specifies the location column. |
value_col |
Integer or string that specifies the value column. |
Value
A matrix with time on rows and locations on columns.
Given a distance matrix, find the k
nearest neighbors.
Description
Given a distance matrix, calculate the k
nearest neighbors of each
location, including the location itself. The matrix should contain only zeros
on the diagonal, and all other elements should be positive.
Usage
dist_to_knn(x, k = min(10, nrow(x)))
Arguments
x |
A (square) distance matrix. Elements should be non-negative and the diagonal zeros, but this is not checked. |
k |
The number of nearest neighbors, counting the location itself. |
Value
A matrix of integers, row i
containing the k
nearest
neighbors of location i
, including itself.
Examples
x <- matrix(c(0, 0,
1, 0,
2, 1,
0, 4,
1, 3),
ncol = 2, byrow = TRUE)
d <- dist(x, diag = TRUE, upper = TRUE)
dist_to_knn(d, k = 3)
Estimate baselines based on observed counts.
Description
Estimate the baselines (expected values) for the supplied counts.
Usage
estimate_baselines(counts, population = NULL)
Arguments
counts |
A matrix of observed counts. Rows indicate time (ordered from most recent) and columns indicate locations. |
population |
A matrix or vector of populations for each location
(optional). If a matrix, should be of the same dimensions as
|
Value
A matrix of baselines of the same dimensions as counts
.
Estimate variances based on observed counts.
Description
Estimate the variances for the supplied counts. It is assumed that variances are constant over time for each location.
Usage
estimate_variances(
counts,
baselines = NULL,
population = NULL,
constant_dim = 1
)
Arguments
counts |
A matrix of observed counts. Rows indicate time (ordered from most recent) and columns indicate locations. |
baselines |
A matrix of the same dimensions as |
population |
A matrix or vector of populations for each location
(optional). If a matrix, should be of the same dimensions as
|
constant_dim |
An integer. If equal to 1, variances are assumed to be constant over time but different between locations. If equal to 2, variances are assumed to vary over time but at each time point be equal for all locations. |
Value
A matrix of variances of the same dimensions as counts
.
Estimate the parameters of a ZIP distribution.
Description
Heuristically estimate the ZIP distribution Poisson mean parameters and the structural zero probabilities for each location and time point. Assumes the structural zero probability is constant over time for each location.
Usage
estimate_zip_params(counts, population = NULL, min_p = 0.001, min_mu = 0.3)
Arguments
counts |
A matrix or vector of observed counts. Rows indicate time (ordered from most recent) and columns indicate locations. If a vector, the elements are assumed to be the counts for each location. |
population |
A matrix or vector of populations for each location
(optional). If a matrix, should be of the same dimensions as
|
min_p |
The minimum value you think possible for the structural zero probability. |
min_mu |
The mimum value you think possible for the Poisson mean parameter of the ZIP distribution (before adjusting for population size). |
Value
A list with two elements:
- baselines
A matrix of the same dimensions as
counts
. Ifcounts
was a vector, a matrix with 1 row will be returned.- probs
A matrix of the same dimensions as
counts
. Ifcounts
was a vector, a matrix with 1 row will be returned.
Computes the flexibly shaped zones as in Tango (2005).
Description
Given a matrix of k
nearest neighbors and an adjacency matrix
for the locations involved, produces the set of flexibly shaped zones
as a list of integer vectors. The locations in these zones are all connected,
in the sense that any location in the zone can be reached from another by
traveling through adjacent locations within the zone.
Usage
flexible_zones(k_nearest, adjacency_matrix)
Arguments
k_nearest |
An integer matrix of the |
adjacency_matrix |
A boolean matrix, with element |
Value
A list of integer vectors.
References
Tango, T. & Takahashi, K. (2005), A flexibly shaped spatial scan statistic for detecting clusters, International Journal of Health Geographics 4(1).
Examples
A <- matrix(c(0,1,0,0,0,0,
1,0,1,0,0,0,
0,1,0,0,0,0,
0,0,0,0,1,0,
0,0,0,1,0,0,
0,0,0,0,0,0),
nrow = 6, byrow = TRUE) == 1
nn <- matrix(as.integer(c(1,2,3,4,5,6,
2,1,3,4,5,6,
3,2,1,4,5,6,
4,5,1,6,3,2,
5,4,6,1,3,2,
6,5,4,1,3,2)),
nrow = 6, byrow = TRUE)
flexible_zones(nn, A)
Flip a matrix upside down
Description
Flip a matrix upside down
Usage
flipud(x)
Arguments
x |
A matrix |
Value
A matrix, x
with rows reversed.
Get indices of zero elements in a vector.
Description
Get indices of zero elements in a vector.
Usage
get_zero_indices(v)
Arguments
v |
An integer vector. |
Value
A vector with the indices of elements equal to zero in v
.
Indices start at zero.
Extract a zone from the set of all zones.
Description
Extract zone number n
from the set of all zones.
Usage
get_zone(n, zones)
Arguments
n |
An integer; the number of the zone you wish to retrieve. |
zones |
A list of integer vectors, representing the set of all zones. |
Value
An integer vector.
Examples
zones <- list(1L, 2L, 3L, 1:2, c(1L, 3L), c(2L, 3L))
get_zone(4, zones)
Calculate the Gumbel p
-value for a scan statistic.
Description
Given an observed scan statistic \lambda^*
and a vector of replicate
scan statistics \lambda_i
, i=1,\ldots,R
, fit a Gumbel
distribution to the replicates and calculate a p
-value for the observed
statistic based on the fitted distribution.
\frac{1 + \sum_{i=1}^R \mathrm{I}(\lambda_i > \lambda^*)}{1 + R}
The function is vectorized, so multiple p
-values can be calculated if
several scan statistics (e.g. statistics from secondary clusters) are
supplied.
Usage
gumbel_pvalue(observed, replicates, method = "ML", ...)
Arguments
observed |
A scalar containing the observed value of the scan statistic, or a vector of observed values from secondary clusters. |
replicates |
A vector of Monte Carlo replicates of the scan statistic. |
method |
Either "ML", for maximum likelihood, or "MoM", for method of moments. |
... |
Additional arguments passed to |
Value
The p
-value or p
-values corresponding to the observed
scan statistic(s).
Is the relative error between two numbers is less than the given tolerance?
Description
Given two consecutive numbers in a sequence, return TRUE
if the
relative change is positive but less than the given tolerance.
Usage
has_converged(current, previous, tol = 0.01)
Arguments
current |
A scalar; the most recent value of the sequence. |
previous |
A scalar; the second most recent value of the sequence, or a reference value. |
tol |
The tolerance, a positive scalar near zero. |
Return a set of the location and its neighbors if they are connected, else return the empty set.
Description
If the location and its neighbors, not including itself, are connected, then return the set containing the location and its neighbors; otherwise, return the empty set
Usage
if_connected(distinct_neighbors, location, adjacency_matrix)
Arguments
distinct_neighbors |
A |
location |
A location, preferably given as an integer. |
adjacency_matrix |
A boolean matrix, with element |
Value
A set
of the given location and the neighbors if they are
connected, else returns the empty set.
Returns TRUE if the neighboring locations are connected to the given location, FALSE if not.
Description
Returns TRUE if the neighboring locations are connected to the given location, FALSE if not.
Usage
is_connected(neighbor_locations, location, adjacency_matrix)
Arguments
neighbor_locations |
A |
location |
A location, preferably given as an integer. |
adjacency_matrix |
A boolean matrix, with element |
Value
Boolean: is the neighbors connected to the given location?
Find the increasing subsets of k
nearest neighbors for all locations.
Description
Returns the set of increasing nearest neighbor sets for all locations, as
a list of integer vectors. That is, for each location the list returned
contains one vector containing the location itself, another containing the
location and its nearest neighbor, and so on, up to the vector containing the
location and its k-1
nearest neighbors.
Usage
knn_zones(k_nearest)
Arguments
k_nearest |
An integer matrix of with |
Value
A list of integer vectors.
Examples
nn <- matrix(c(1L, 2L, 4L, 3L, 5L,
2L, 1L, 3L, 4L, 5L,
3L, 2L, 4L, 1L, 5L,
4L, 1L, 2L, 3L, 5L,
5L, 3L, 4L, 2L, 1L),
ncol = 5, byrow = TRUE)
knn_zones(nn[, 1:3])
Convert a matrix to a data frame.
Description
Convert a matrix to a data frame with columns time, location, and one more containing the elements of the input matrix.
Usage
matrix_to_df(mat, name, locations = NULL, times = NULL)
Arguments
mat |
A matrix. |
name |
The name of the third column in the output matrix. |
locations |
If not |
times |
If not |
Value
A matrix with columns time, location, name
, where name
is specified in the input.
Calculate the Monte Carlo p
-value for a scan statistic.
Description
Given an observed scan statistic \lambda^*
and a vector of replicate
scan statistics \lambda_i
, i=1,\ldots,R
, calculate the Monte
Carlo p
-value as
\frac{1 + \sum_{i=1}^R \mathrm{I}(\lambda_i > \lambda^*)}{1 + R}
The function is vectorized, so multiple p
-values can be calculated if
several scan statistics (e.g. statistics from secondary clusters) are
supplied.
Usage
mc_pvalue(observed, replicates)
Arguments
observed |
A scalar containing the observed value of the scan statistic, or a vector of observed values from secondary clusters. |
replicates |
A vector of Monte Carlo replicates of the scan statistic. |
Value
The p
-value or p
-values corresponding to the observed
scan statistic(s).
Permute the entries of the matrix, preserving row and column marginals.
Description
Permute the entries of the matrix, preserving row and column marginals.
Usage
permute_matrix(A)
Arguments
A |
An integer matrix. |
Value
An integer matrix.
Creates a set of all non-empty subsets of the integers from 1 to n
.
Description
Creates a list of all 2^(n-1)
non-empty subsets of the integers from 1
to n
.
Usage
powerset_zones(n)
Arguments
n |
An integer larger than 0. |
Value
A list of integer vectors.
Print a scanstatistic object.
Description
Prints a scanstatistic object and returns it invisibly.
Usage
## S3 method for class 'scanstatistic'
print(x, ...)
Arguments
x |
A an object of class |
... |
Further arguments passed to or from other methods. |
Value
x, invisibly
Run a scan statistic analysis.
Description
Run a scan statistic analysis with the given scan statistic and arguments.
Usage
run_scan(scanstat, args, gumbel = FALSE)
Arguments
scanstat |
A scan statistic function. |
args |
A named list of arguments to be passed to |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
Value
A list with components
- observed
The table of observed statistics.
- simulated
The table of simulated statistics.
- MC_pvalue
The Monte Carlo P-value of the scan statistic.
- Gumbel_pvalue
The Gumbel P-value of the scan statistic.
Calculate the negative binomial bayesian scan statistic..
Description
Calculate the "Bayesian Spatial Scan Statistic" by Neill et al. (2006), adapted to a spatio-temporal setting. The scan statistic assumes that, given the relative risk, the data follows a Poisson distribution. The relative risk is in turn assigned a Gamma distribution prior, yielding a negative binomial marginal distribution for the counts under the null hypothesis. Under the alternative hypothesis, the
Usage
scan_bayes_negbin(
counts,
zones,
baselines = NULL,
population = NULL,
outbreak_prob = 0.05,
alpha_null = 1,
beta_null = 1,
alpha_alt = alpha_null,
beta_alt = beta_null,
inc_values = seq(1, 3, by = 0.1),
inc_probs = 1
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Not needed if |
outbreak_prob |
A scalar; the probability of an outbreak (at any time, any place). Defaults to 0.05. |
alpha_null |
A scalar; the shape parameter for the gamma distribution under the null hypothesis of no anomaly. Defaults to 1. |
beta_null |
A scalar; the scale parameter for the gamma distribution under the null hypothesis of no anomaly. Defaults to 1. |
alpha_alt |
A scalar; the shape parameter for the gamma distribution
under the alternative hypothesis of an anomaly. Defaults to the same value
as |
beta_alt |
A scalar; the scale parameter for the gamma distribution
under the alternative hypothesis of an anomaly. Defaults to the same value
as |
inc_values |
A vector of possible values for the increase in the mean (and variance) of an anomalous count. Defaults to evenly spaced values between 1 and 3, with a difference of 0.1 between consecutive values. |
inc_probs |
A vector of the prior probabilities of each value in
|
Value
A list which, in addition to the information about the type of scan
statistic, has the following components: priors
(list),
posteriors
(list), MLC
(list) and marginal_data_prob
(scalar). The list MLC
has elements
- zone
The number of the spatial zone of the most likely cluster (MLC).
- duration
The most likely event duration.
- log_posterior
The posterior log probability that an event is ongoing in the MLC.
- log_bayes_factor
The logarithm of the Bayes factor for the MLC.
- posterior
The posterior probability that an event is ongoing in the MLC.
- locations
The locations involved in the MLC.
The list priors
has elements
- null_prior
The prior probability of no anomaly.
- alt_prior
The prior probability of an anomaly.
- inc_prior
A vectorof prior probabilities of each value in the argument
inc_values
.- window_prior
The prior probability of an outbreak in any of the space-time windows.
The list posteriors
has elements
- null_posterior
The posterior probability of no anomaly.
- alt_posterior
The posterior probability of an anomaly.
- inc_posterior
A data frame with columns
inc_values
andinc_posterior
.- window_posteriors
A data frame with columns
zone
,duration
,log_posterior
andlog_bayes_factor
, each row corresponding to a space-time window.- space_time_posteriors
A matrix with the posterior anomaly probability of each location-time combination.
- location_posteriors
A vector with the posterior probability of an anomaly at each location.
References
Neill, D. B., Moore, A. W., Cooper, G. F. (2006). A Bayesian Spatial Scan Statistic. Advances in Neural Information Processing Systems 18.
Examples
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]),
length(ob_rows), length(ob_cols))
res <- scan_bayes_negbin(counts = counts,
zones = zones,
baselines = baselines)
Calculate the "Bayesian Spatial Scan Statistic" by Neill et al. (2006).
Description
Calculate the "Bayesian Spatial Scan Statistic" by Neill et al. (2006), adapted to a spatio-temporal setting. The scan statistic assumes that, given the relative risk, the data follows a Poisson distribution. The relative risk is in turn assigned a Gamma distribution prior, yielding a negative binomial marginal distribution for the counts.
Usage
scan_bayes_negbin_cpp(
counts,
baselines,
zones,
zone_lengths,
outbreak_prob,
alpha_null,
beta_null,
alpha_alt,
beta_alt,
inc_values,
inc_probs
)
Arguments
counts |
An integer matrix (most recent timepoint in first row). |
baselines |
A matrix with positive entries (most recent timepoint in first row). |
zones |
An integer vector (all zones concatenated; locations indexed from 0 and up). |
zone_lengths |
An integer vector. |
outbreak_prob |
A scalar; the probability of an outbreak (at any time, any place). |
alpha_null |
A scalar; the shape parameter for the gamma distribution under the null hypothesis of no anomaly. |
beta_null |
A scalar; the scale parameter for the gamma distribution under the null hypothesis of no anomaly. |
alpha_alt |
A scalar; the shape parameter for the gamma distribution under the alternative hypothesis of an anomaly. |
beta_alt |
A scalar; the scale parameter for the gamma distribution under the alternative hypothesis of an anomaly. |
inc_values |
A vector of possible values for the increase in the mean (and variance) of an anomalous count. |
inc_probs |
A vector of the prior probabilities of each value in
|
Value
A list with elements priors
(list), posteriors
(list),
and marginal_data_prob
(scalar). The list priors
has
elements
- null_prior
The prior probability of no anomaly.
- alt_prior
The prior probability of an anomaly.
- inc_prior
A vector (matrix with 1 row) of prior probabilities of each value in the argument
m_values
.- window_prior
The prior probability of an outbreak in any of the space-time windows.
The list posteriors
has elements
- null_posterior
The posterior probability of no anomaly.
- alt_posterior
The posterior probability of an anomaly.
- inc_posterior
A data frame with columns
inc_values
andinc_posterior
.- window_posteriors
A data frame with columns
zone
,duration
,log_posterior
andlog_bayes_factor
, each row corresponding to a space-time window.- space_time_posteriors
A matrix with the posterior anomaly probability of each location-time combination.
- location_posteriors
A vector (matrix with 1 row) with the posterior probability of an anomaly at each location.
Calculate the expectation-based negative binomial scan statistic.
Description
Calculate the expectation-based negative binomial scan statistic devised by Tango et al. (2011).
Usage
scan_eb_negbin(
counts,
zones,
baselines = NULL,
thetas = 1,
type = c("hotspot", "emerging"),
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
thetas |
Optional. A matrix of the same dimensions as |
type |
A string, either "hotspot" or "emerging". If "hotspot", the relative risk is assumed to be fixed over time. If "emerging", the relative risk is assumed to increase with the duration of the outbreak. |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, and the calculated score. In order, the elements of this list are named
zone_number, locations, duration, score
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, and score. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo
P
-value.- Gumbel_pvalue
A
P
-value obtained by fitting a Gumbel distribution to the replicate scan statistics.- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Tango, T., Takahashi, K. & Kohriyama, K. (2011), A space-time scan statistic for detecting emerging outbreaks, Biometrics 67(1), 106–115.
Examples
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
thetas <- matrix(runif(n_total, 0.05, 3), max_duration, n_locs)
counts <- matrix(rnbinom(n_total, mu = baselines, size = thetas),
max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rnbinom(ob_dur * length(ob_cols),
mu = 2 * baselines[ob_rows, ob_cols],
size = thetas[ob_rows, ob_cols]),
length(ob_rows), length(ob_cols))
res <- scan_eb_negbin(counts = counts,
zones = zones,
baselines = baselines,
thetas = thetas,
type = "hotspot",
n_mcsim = 99,
max_only = FALSE)
Calculate the expectation-based negative binomial scan statistic.
Description
Calculate the expectation-based negative binomial scan statistic and Monte Carlo replicates.
Usage
scan_eb_negbin_cpp(
counts,
baselines,
overdisp,
zones,
zone_lengths,
store_everything,
num_mcsim,
score_hotspot
)
Arguments
counts |
Integer matrix (most recent timepoint in first row) |
baselines |
Matrix (most recent timepoint in first row) |
overdisp |
Matrix (most recent timepoint in first row) |
zones |
Integer vector (all zones concatenated; locations indexed from 0 and up) |
zone_lengths |
Integer vector |
store_everything |
Boolean |
num_mcsim |
Integer |
score_hotspot |
Boolean |
Value
A list with elements observed
and simulated
, each
being a data frame with columns:
- zone
The top-scoring zone (spatial component of MLC).
- duration
The corresponding duration (time-length of MLC).
- score
The value of the loglihood ratio statistic (the scan statistic).
- relrisk
The estimated relative risk.
- n_iter
The number of iterations performed by the EM algorithm.
Calculate the expectation-based Poisson scan statistic.
Description
Calculate the expectation-based Poisson scan statistic devised by Neill et al. (2005).
Usage
scan_eb_poisson(
counts,
zones,
baselines = NULL,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Not needed if |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, and the relative risk. In order, the elements of this list are named
zone_number, locations, duration, score, relative_risk
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risk. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo
P
-value.- Gumbel_pvalue
A
P
-value obtained by fitting a Gumbel distribution to the replicate scan statistics.- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Neill, D. B., Moore, A. W., Sabhnani, M. and Daniel, K. (2005). Detection of emerging space-time clusters. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, 218.
Examples
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]),
length(ob_rows), length(ob_cols))
res <- scan_eb_poisson(counts = counts,
zones = zones,
baselines = baselines,
n_mcsim = 99,
max_only = FALSE)
Calculate the expecation-based Poisson scan statistic.
Description
Calculate the expectation-based Poisson scan statistic and Monte Carlo replicates.
Usage
scan_eb_poisson_cpp(
counts,
baselines,
zones,
zone_lengths,
store_everything,
num_mcsim
)
Arguments
counts |
An integer matrix (most recent timepoint in first row). |
baselines |
A matrix with positive entries (most recent timepoint in first row). |
zones |
An integer vector (all zones concatenated; locations indexed from 0 and up). |
zone_lengths |
An integer vector. |
store_everything |
A boolean. |
num_mcsim |
An integer. |
Value
A list with elements observed
and simulated
, each
being a data frame with columns:
- zone
The top-scoring zone (spatial component of MLC).
- duration
The corresponding duration (time-length of MLC).
- score
The value of the loglihood ratio statistic (the scan statistic).
- relrisk_in
The estimated relative risk inside.
- relrisk_in
The estimated relative risk outside.
Calculate the expectation-based ZIP scan statistic.
Description
Calculates the expectation-based scan statistic. See details below.
Usage
scan_eb_zip(
counts,
zones,
baselines = NULL,
probs = NULL,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE,
rel_tol = 0.001
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
probs |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
rel_tol |
A positive scalar. If the relative change in the incomplete information likelihood is less than this value, then the EM algorithm is deemed to have converged. |
Details
For the expectation-based zero-inflated Poisson scan statistic
(Allévius & Höhle 2017), the null hypothesis of no anomaly holds that
the count observed at each location i
and duration t
(the
number of time periods before present) has a zero-inflated Poisson
distribution with expected value parameter \mu_{it}
and structural
zero probability p_{it}
:
H_0 : Y_{it} \sim \textrm{ZIP}(\mu_{it}, p_{it}).
This holds for all locations i = 1, \ldots, m
and all durations
t = 1, \ldots,T
, with T
being the maximum duration considered.
Under the alternative hypothesis, there is a space-time window W
consisting of a spatial zone Z \subset \{1, \ldots, m\}
and a time
window D \subseteq \{1, \ldots, T\}
such that the counts in that
window have their Poisson expected value parameters inflated by a factor
q_W > 1
compared to the null hypothesis:
H_1 : Y_{it} \sim \textrm{ZIP}(q_W \mu_{it}, p_{it}), ~~(i,t) \in W.
For locations and durations outside of this window, counts are assumed to
be distributed as under the null hypothesis. The sets Z
considered
are those specified in the argument zones
, while the maximum
duration T
is taken as the maximum value in the column
duration
of the input table
.
For each space-time window W
considered, (the log of) a likelihood
ratio is computed using the distributions under the alternative and null
hypotheses, and the expectation-based Poisson scan statistic is calculated
as the maximum of these quantities over all space-time windows. The
expectation-maximization (EM) algorithm is used to obtain maximum
likelihood estimates.
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, the relative risk, and the number of iterations until convergence for the EM algorithm. In order, the elements of this list are named
zone_number, locations, duration, score, relative_risk, n_iter
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risk, number of EM iterations. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo
P
-value.- Gumbel_pvalue
A
P
-value obtained by fitting a Gumbel distribution to the replicate scan statistics.- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Allévius, B. and Höhle, M, An expectation-based space-time scan statistic for ZIP-distributed data, (Technical report), Link to PDF.
Examples
if (require("gamlss.dist")) {
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
probs <- matrix(runif(n_total) / 4, max_duration, n_locs)
counts <- matrix(gamlss.dist::rZIP(n_total, baselines, probs),
max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- gamlss.dist::rZIP(
ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols],
probs[ob_rows, ob_cols])
res <- scan_eb_zip(counts = counts,
zones = zones,
baselines = baselines,
probs = probs,
n_mcsim = 9,
max_only = FALSE,
rel_tol = 1e-3)
}
Calculate the highest-value EB ZIP loglihood ratio statistic.
Description
Calculate the expectation-based ZIP loglihood ratio statistic for each zone and duration, but only keep the zone and duration with the highest value (the MLC). The estimate of the relative risk is also calculated, along with the number of iterations the EM algorithm performed.
Usage
scan_eb_zip_cpp(
counts,
baselines,
probs,
zones,
zone_lengths,
rel_tol,
store_everything,
num_mcsim
)
Arguments
counts |
matrix (most recent timepoint in first row) |
baselines |
matrix (most recent timepoint in first row) |
probs |
matrix (most recent timepoint in first row) |
zones |
integer vector (all zones concatenated; locations indexed from 0 and up) |
zone_lengths |
integer vector |
rel_tol |
double |
store_everything |
boolean |
num_mcsim |
int |
Value
A list with elements observed
and simulated
, each
being a data frame with columns:
- zone
The top-scoring zone (spatial component of MLC).
- duration
The corresponding duration (time-length of MLC).
- score
The value of the loglihood ratio statistic (the scan statistic).
- relrisk
The estimated relative risk.
- n_iter
The number of iterations performed by the EM algorithm.
Calculate the space-time permutation scan statistic.
Description
Calculate the space-time permutation scan statistic (Kulldorff 2005) and Monte Carloo replicates.
Usage
scan_pb_perm_cpp(
counts,
baselines,
zones,
zone_lengths,
store_everything,
num_mcsim
)
Arguments
counts |
An integer matrix (most recent timepoint in first row). |
baselines |
A matrix with positive entries (most recent timepoint in first row). |
zones |
An integer vector (all zones concatenated; locations indexed from 0 and up) |
zone_lengths |
An integer vector. |
store_everything |
A boolean. |
num_mcsim |
An integer. |
Value
A list with elements observed
and simulated
, each
being a data frame with columns:
- zone
The top-scoring zone (spatial component of MLC).
- duration
The corresponding duration (time-length of MLC).
- score
The value of the loglihood ratio statistic (the scan statistic).
- relrisk_in
The estimated relative risk inside.
- relrisk_in
The estimated relative risk outside.
Calculate the population-based Poisson scan statistic.
Description
Calculate the population-based Poisson scan statistic devised by Kulldorff (1997, 2001).
Usage
scan_pb_poisson(
counts,
zones,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
population |
Optional. A matrix or vector of populations for each
location and time point. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a P-value. |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, and the relative risk inside and outside the cluster. In order, the elements of this list are named
zone_number, locations, duration, score, relrisk_in, relrisk_out
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risks. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo
P
-value.- Gumbel_pvalue
A
P
-value obtained by fitting a Gumbel distribution to the replicate scan statistics.- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics - Theory and Methods, 26, 1481–1496.
Kulldorff, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society, Series A (Statistics in Society), 164, 61–72.
Examples
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs)
counts <- matrix(rpois(n_total, as.vector(population) / 20),
max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20),
length(ob_rows), length(ob_cols))
res <- scan_pb_poisson(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)
Calculate the population-based Poisson scan statistic.
Description
Calculate the population-based Poisson scan statistic and Monte Carlo replicates.
Usage
scan_pb_poisson_cpp(
counts,
baselines,
zones,
zone_lengths,
store_everything,
num_mcsim
)
Arguments
counts |
integer matrix (most recent timepoint in first row) |
baselines |
matrix (most recent timepoint in first row) |
zones |
integer vector (all zones concatenated; locations indexed from 0 and up) |
zone_lengths |
integer vector |
store_everything |
boolean |
num_mcsim |
int |
Value
A list with elements observed
and simulated
, each
being a data frame with columns:
- zone
The top-scoring zone (spatial component of MLC).
- duration
The corresponding duration (time-length of MLC).
- score
The value of the loglihood ratio statistic (the scan statistic).
- relrisk_in
The estimated relative risk inside.
- relrisk_in
The estimated relative risk outside.
Calculate the space-time permutation scan statistic.
Description
Calculate the space-time permutation scan statistic devised by Kulldorff (2005).
Usage
scan_permutation(
counts,
zones,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE
)
Arguments
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
population |
Optional. A matrix or vector of populations for each
location and time point. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a P-value. |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, and the relative risk inside and outside the cluster. In order, the elements of this list are named
zone_number, locations, duration, score, relrisk_in, relrisk_out
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risks. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo
P
-value.- Gumbel_pvalue
A
P
-value obtained by fitting a Gumbel distribution to the replicate scan statistics.- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. M., Mostashari, F. (2005). A space-time permutation scan statistic for disease outbreak detection. PLoS Medicine, 2(3), 0216-0224.
Examples
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs)
counts <- matrix(rpois(n_total, as.vector(population) / 20),
max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20),
length(ob_rows), length(ob_cols))
res <- scan_permutation(counts = counts,
zones = zones,
population = population,
n_mcsim = 99,
max_only = FALSE)
scanstatistics: Space-time anomaly detection using scan statistics.
Description
The scanstatistics package provides two categories of important functions: data preparation functions, and the scan statistics themselves.
Data preparation functions
These functions prepare your data for use. In particular, it helps you define the zones which will be considered by the scan statistics.
Scan statistics
These are the functions used for space-time anomaly detection. Scan statistic
functions for univariate space-time data have a name that begins with
scan_
and functions for multivariate space-time data have a name that
begins with mscan_
.
Score each location over zones and duration.
Description
For each location, compute the average of the statistic calculated for each space-time window that the location is included in, i.e. average the statistic over both zones and the maximum duration.
Usage
score_locations(x, zones)
Arguments
x |
An object of class |
zones |
A list of integer vectors. |
Value
A data.table
with the following columns:
- location
The locations (as integers).
- total_score
For each location, the sum of all window statistics that the location appears in.
- n_zones
The number of spatial zones that the location appears in.
- score
The total score divided by the number of zones and the maximum duration.
- relative_score
The score divided by the maximum score.
Examples
# Simple example
set.seed(1)
table <- data.frame(zone = 1:5, duration = 1, score = 5:1)
zones <- list(1:2, 1:3, 2:5, 4:5, c(1, 5))
x <- list(observed = table, n_locations = 5, max_duration = 1, n_zones = 5)
score_locations(x, zones)
Get the top (non-overlappig) clusters.
Description
Get the top k
space-time clusters according to the statistic calculated
for each cluster (the maximum being the scan statistic). The default is to
return the spatially non-overlapping clusters, i.e. those that do not have
any locations in common.
Usage
top_clusters(
x,
zones,
k = 5,
overlapping = FALSE,
gumbel = FALSE,
alpha = NULL,
...
)
Arguments
x |
An object of class scanstatistics. |
zones |
A list of integer vectors. |
k |
An integer, the number of clusters to return. |
overlapping |
Logical; should the top clusters be allowed to overlap in
the spatial dimension? The default is |
gumbel |
Logical; should a Gumbel P-value be calculated? The default is
|
alpha |
A significance level, which if not |
... |
Parameters passed to |
Value
A data frame with at most k
rows, with columns
zone, duration, score
and possibly MC_pvalue, Gumbel_pvalue
and critical_value
.
Examples
set.seed(1)
counts <- matrix(rpois(15, 3), 3, 5)
zones <- list(1:2, 1:3, 2:5, c(1, 3), 4:5, c(1, 5))
scanres <- scan_permutation(counts, zones, n_mcsim = 5)
top_clusters(scanres, zones, k = 4, overlapping = FALSE)