Version: | 1.0 |
Date: | 2025-04-06 |
Title: | Survival Prediction Ensemble Classification Tool |
Author: | Stephen Abrams [aut, cre] |
Maintainer: | Stephen Abrams <stephen.abrams@gmail.com> |
Depends: | R (≥ 4.0), futile.logger, dplyr |
Imports: | doParallel, ggplot2, survminer, riskRegression, caret, caretEnsemble, survival, rlang |
Description: | A tool for survival analysis using a discrete time approach with ensemble binary classification. 'spect' provides a simple interface consistent with commonly used R data analysis packages, such as 'caret', a variety of parameter options to help facilitate search automation, a high degree of transparency to the end-user - all intermediate data sets and parameters are made available for further analysis and useful, out-of-the-box visualizations of model performance. Methods for transforming survival data into discrete-time are adapted from the 'autosurv' package by Suresh et al., (2022) <doi:10.1186/s12874-022-01679-6>. |
License: | GPL-3 |
URL: | https://github.com/dawdawdo/spect |
BugReports: | https://github.com/dawdawdo/spect/issues |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, randomForest, kernlab, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2025-04-07 00:26:07 UTC; steph |
Repository: | CRAN |
Date/Publication: | 2025-04-08 09:00:02 UTC |
Generates person-period data for any data set, given the bounds defined by the training set.
Description
Generates person-period data for any data set, given the bounds defined by the training set.
Usage
create_person_period_data(individual_data, bounds)
Arguments
individual_data |
A survival data set. |
bounds |
Output from the 'generate_bounds' function of this package. |
Value
A data set consisting of the original 'individual_data' repeated once for each interval defined by the 'bounds' parameter. Each row will be labeled with an id and an interval. The output of this function can be passed to either 'create_training_data' or 'spect_predict' to genreate modeling data or predictions respectively.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
See Also
[generate_bounds()], [spect_predict()], [create_training_data()]
Generates a survival data set for synthetic streaming service subscription data. The survival event in this case is a cancellation of the subscription. It is given as a function of household income and average number of hours watched in the prior month. Users can adjust the level of censoring and variance in the data with the supplied parameters or simply call with no parameters for a default distribution of data.
Description
Generates a survival data set for synthetic streaming service subscription data. The survival event in this case is a cancellation of the subscription. It is given as a function of household income and average number of hours watched in the prior month. Users can adjust the level of censoring and variance in the data with the supplied parameters or simply call with no parameters for a default distribution of data.
Usage
create_synthetic_data(
sample_size = 250,
minimum_income = 5000,
median_income = 50000,
income_variance = 10000,
min_watchhours = 0,
max_watchhours = 6,
censor_percentage = 0,
min_censor_amount = 0,
max_censor_amount = 0,
study_time_in_months = 48,
perturbation_shift = 0
)
Arguments
sample_size |
optional - size of the sample population to generate |
minimum_income |
optional - minimum household income used to generate the distribution |
median_income |
optional - median household income used to generate the distribution |
income_variance |
optional - variance to use when generating the household income distribution |
min_watchhours |
optional - minimum average number of hours watched used to generate the distribution |
max_watchhours |
optional - minimum average number of hours watched used to generate the distribution |
censor_percentage |
optional - percentage of population to artificially censor |
min_censor_amount |
optional - Minimum number of months of censoring to apply to the censored population |
max_censor_amount |
optional - maximum number of months of censoring to apply to the censored population |
study_time_in_months |
optional - observation horizon in months |
perturbation_shift |
optional - defines a boundary for the amount to randomly perturb the formulaic result. Zero for no perturbation |
Value
A survival data set suitable for modeling using spect_train.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Examples
data <- create_synthetic_data()
Generates modeling data from a person-period data set.
Description
Generates modeling data from a person-period data set.
Usage
create_training_data(person_period_data, time_col, event_col, cens)
Arguments
person_period_data |
A discrete-time data set. Generally, this will be output from the 'create_person_period_data' function. |
time_col |
A string specifying the name of the column which contains the survival time. |
event_col |
A string specifying the name of the column which contains the event indicator. |
cens |
Specifies how to apply censored data. Valid values are "same" - considers censorship to occur in the same interval as the survival time, "prev" - considers censorship to occur in the prior interval, and "half" - considers censorship to occur in the same interval as survival time if the individual survived for at least half of that interval. |
Value
A discrete-time data set suitable for training using any binary classifer.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
See Also
[create_person_period_data()]
Generates evaluation metrics, include time-dependent TPR and FPR rates as well as AUC
Description
Generates evaluation metrics, include time-dependent TPR and FPR rates as well as AUC
Usage
evaluate_model(train_result, prediction_times, plot_roc = TRUE)
Arguments
train_result |
return data object from spect_train |
prediction_times |
a vecotr of times to use for generating TPR and FPR data |
plot_roc |
optional indicator to display the time-dependent ROC curves. The TPR and FPR data will be returned regardless of the value of this parameter. |
Value
Evaluation metrics. Also plots the number of requested samples
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Generates the intervals based on the survival times in the supplied data set using the quantile function.
Description
Generates the intervals based on the survival times in the supplied data set using the quantile function.
Usage
generate_bounds(
train_data,
time_col,
event_col,
suggested_intervals,
obs_window
)
Arguments
train_data |
A survival data set containing at least three columns - one which matches the string in the 'time_col' parameter, one which matches the string in the 'event_col' parameter, and at least one covariate column for modeling. |
time_col |
The name of the column in 'train_data' containing survivial time |
event_col |
The name of the column in 'train_data' contaiing the event indicator. Values in this column must be either zero (0) or one (1) |
suggested_intervals |
The number of intervals to create. If the number of events in the data is less than 'suggested_intervals', it is ignored. |
obs_window |
An artificial censoring time. Any observations in 'train_data' beyond this time will be administratively censored. |
Value
A list of upper an lower bounds for each generated interval.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
See Also
[create_person_period_data()]
Examples
df <- data.frame(a=c(1,2,3,4,5,6), surv_time=c(1,4,5,6,8,9), event=c(1,1,1,1,0,1))
bounds <- generate_bounds(df, time_col="surv_time", event_col="event",
suggested_intervals=3, obs_window=8)
Plots a series of population Kaplan-Meier curves for different thresholds for both the test predictions and the ground truth
Description
Plots a series of population Kaplan-Meier curves for different thresholds for both the test predictions and the ground truth
Usage
plot_km(train_result, prediction_threshold_search_granularity = 0.05)
Arguments
train_result |
return data object from 'spect_train' |
prediction_threshold_search_granularity |
optional number between zero and one which defines the granularity of searching for cumulative probability thresholds. For instance, search a value of 0.05 will search 19 thresholds (0.05, 0.10, ..., 0.95) |
Value
Data used to produce the KM curve and the passed granularity parameter. Also plots the KM curves.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Plots a sample of individual survival curves from the test data set.
Description
Plots a sample of individual survival curves from the test data set.
Usage
plot_survival_curve(train_result, individual_id, curve_type = "both")
Arguments
train_result |
return data object from 'spect_train' |
individual_id |
identifier of the individual to plot |
curve_type |
optional specification of the type of curve. Available options are "conditional", which plots the conditional probability of surviving each interval given that the individual survived to the start of that interval, "absolute" which plots the unconditional probability of surviving each interval, and "both", the default value, which plots both curves on the same chart. |
Value
None - plots the number of requested samples
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Simple visualization of synthetic subscription data.
Description
Simple visualization of synthetic subscription data.
Usage
plot_synthetic_data(data)
Arguments
data |
a data object generated by create_synthetic_data |
Value
None - prints synthetic data generated by create_synthetic_data
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Examples
data <- create_synthetic_data()
plot_synthetic_data(data)
Generates predictions for each individual at each interval defined by the 'train_result' parameter. The interval-level predictions can be combined to generate surivival curves for an individual.
Description
Generates predictions for each individual at each interval defined by the 'train_result' parameter. The interval-level predictions can be combined to generate surivival curves for an individual.
Usage
spect_predict(train_result, new_data)
Arguments
train_result |
- return data object from spect_train |
new_data |
- New data set with the same covariates as the training data set. |
Value
predictions by the trained model on a new data set
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu
Generates a trained caret model using the given primary binary classification. Optionally generates a stacked ensemble model if a list of base learners is supplied.
Description
Generates a trained caret model using the given primary binary classification. Optionally generates a stacked ensemble model if a list of base learners is supplied.
Usage
spect_train(
test_prop = 0.2,
censor_type = "half",
bin_slices = 10,
method = "repeatedcv",
resampling_number = 10,
kfold_repeats = 3,
model_algorithm,
base_learner_list = list(),
metric = "Kappa",
rng_seed = 42,
use_parallel = TRUE,
cores = 0,
modeling_data,
event_indicator_var,
survival_time_var,
obs_window
)
Arguments
test_prop |
optional proportion of the data set to reserve for testing |
censor_type |
optional method used to determine censorship in a given bin - may be "half", "prev" or "same". see createDiscreteDat for usage. |
bin_slices |
optional number of intervals to use for predictions. |
method |
optional caret parameter |
resampling_number |
optional for repeated cv |
kfold_repeats |
optional number of folds |
model_algorithm |
primary classification algorithm. Trains a stack-ensemble model if 'base_learner_list' is supplied, otherwise trains a simple classifier model. |
base_learner_list |
optional list of base learner algorithms |
metric |
optional metric for model calibration |
rng_seed |
optional random number generation seed for reproducibility |
use_parallel |
optioanlly make use of the caret multicore training cluster |
cores |
optioanl number of cores for multicore training. If zero, spect will attempt to make a good choice. Note: only relevant if 'use_parallel' is set to TRUE, otherwise this parameter is ignored. |
modeling_data |
This data set must have one column for time and one column for the event indicator. The remaining columns are treated as covariates for modeling. |
event_indicator_var |
The name of the column containing the event indicator (values in this column must be zero or one). |
survival_time_var |
The name of the column containing the time variable |
obs_window |
The last time to use for generating person-period data. Any event occurring after this time will be administratively censored. In general, choosing a time at or near the end of the max observed time will include most events. |
Value
A list containing all intermediate data sets created by 'spect_train', a trained caret model object, the following parameters passed to 'spect_train': 'obs_window', 'survival_time_var', 'event_indicator_var', 'base_learner_list', 'bin_slices', and the bounds of each interval generated by the training data set.
Author(s)
Stephen Abrams, stephen.abrams@louisville.edu