Type: | Package |
Title: | Automated Covariate Selection Using HDPS Algorithm |
Version: | 1.0.0 |
Author: | Dennis Robert <dennis.robert.nm@gmail.com> |
Maintainer: | Dennis Robert <dennis.robert.nm@gmail.com> |
Description: | Contains functions to implement automated covariate selection using methods described in the high-dimensional propensity score (HDPS) algorithm by Schneeweiss et.al. Covariate adjustment in real-world-observational-data (RWD) is important for for estimating adjusted outcomes and this can be done by using methods such as, but not limited to, propensity score matching, propensity score weighting and regression analysis. While these methods strive to statistically adjust for confounding, the major challenge is in selecting the potential covariates that can bias the outcomes comparison estimates in observational RWD (Real-World-Data). This is where the utility of automated covariate selection comes in. The functions in this package help to implement the three major steps of automated covariate selection as described by Schneeweiss et. al elsewhere. These three functions, in order of the steps required to execute automated covariate selection are, get_candidate_covariates(), get_recurrence_covariates() and get_prioritised_covariates(). In addition to these functions, a sample real-world-data from publicly available de-identified medical claims data is also available for running examples and also for further exploration. The original article where the algorithm is described by Schneeweiss et.al. (2009) <doi:10.1097/EDE.0b013e3181a663cc> . |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
URL: | https://github.com/technOslerphile/autoCovariateSelection |
BugReports: | https://github.com/technOslerphile/autoCovariateSelection/issues |
Imports: | purrr, data.table |
Depends: | dplyr, R (≥ 2.10) |
RoxygenNote: | 7.1.1 |
Suggests: | testthat |
NeedsCompilation: | no |
Packaged: | 2020-12-11 11:14:45 UTC; root |
Repository: | CRAN |
Date/Publication: | 2020-12-14 09:50:11 UTC |
Generate candidate empirical baseline covariates based on prevalence in the baseline period
Description
get_candidate_covariates
function generates the list of candidate empirical covariates based on their prevalence
within each domains (dimensions). This is the first step in the automated covariate selection process. See 'Automated Covariate Selection'
section below for more details regarding the overall process.
Usage
get_candidate_covariates(
df,
domainVarname,
eventCodeVarname,
patientIdVarname,
patientIdVector,
n = 200,
min_num_patients = 100
)
Arguments
df |
The input |
domainVarname |
The variable(field) name which contains the domain of the covariate in the |
eventCodeVarname |
The variable name which contains the covariate codes (eg:- CCS, ICD9) in the |
patientIdVarname |
The variable name which contains the patient identifier in the |
patientIdVector |
The 1-D vector with all the patient identifiers. The length of this vector should be equal to
the number of distinct patients in the |
n |
The maximum number of empirical candidate baseline covariates that should be returned within each domain. By default, n is 200 |
min_num_patients |
Minimum number of patients that should be present for each covariate to be selected for selection.
To be considered for selection, a covariate should have occurred for a minimum |
Details
The theoretical details of the high-dimensional propensity score (HDPS) algorithm is detailed in the publication listed below in the References
section.
get_candidate_covariates
is the function implementing what is described in the 'Identify candidate empirical covariates' section
of the article.
Value
A named list containing three R objects
-
covars
A 1-D vector containing the names of selected baseline covariate names from each domain. For each domain in thedf
, the number ofcovars
would be equal to or less thann
-
covars_data
Thedata.frame
that is filtered out ofdf
with only the selectedcovars
. The values of theeventCodeVarname
field is prefixed with the correspondingdomain
name. For example, if the event code is 19900 and the domain is 'dx', then the the covariate name will be 'dx_19900'. -
patientIds
The list of patient ids present in the original inputdf
. This is exactly the same as the inputpatientIdVector
Automated Covariate Selection
The three steps in automated covariate selection are listed below with the functions implementing the methodology
Identify candidate empirical covariates:
get_candidate_covariates
Assess recurrence:
get_recurrence_covariates
Prioritize covariates:
get_prioritised_covariates
Author(s)
Dennis Robert dennis.robert.nm@gmail.com
References
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data Epidemiology. 2009;20(4):512-522. doi:10.1097/EDE.0b013e3181a663cc
Examples
library("autoCovariateSelection")
data(rwd)
head(rwd, 3)
#select distinct elements that are unique for each patient - treatment and outcome
basetable <- rwd %>% select(person_id, treatment, outcome_date) %>% distinct()
head(basetable, 3)
patientIds <- basetable$person_id
step1 <- get_candidate_covariates(df = rwd, domainVarname = "domain",
eventCodeVarname = "event_code", patientIdVarname = "person_id",
patientIdVector = patientIds,n = 100, min_num_patients = 10)
out1 <- step1$covars_data #this will be input to get_recurrence_covariates() function
Generate the prioritised covariates from the global list of binary recurrence covariates using multiplicative bias ranking
Description
get_prioritised_covariates
function assesses the recurrence of each of the identified candidate empirical covariates
based on their frequency of occurrence for each patient in the baseline period and generates three binary recurrence covariates
for each of the identified candidate empirical covariates. This is the third and final step in the automated covariate selection process.
The previous step of assessing recurrence and generating the binary recurrence covariates is done
using the get_recurrence_covariates
function.
See 'Automated Covariate Selection'section below for more details regarding the overall process.
Usage
get_prioritised_covariates(
df,
patientIdVarname,
exposureVector,
outcomeVector,
patientIdVector,
k = 500
)
Arguments
df |
The input |
patientIdVarname |
The variable name which contains the patient identifier in the |
exposureVector |
The 1-D exposure (treatment/intervention) vector. The length of this vector should be equal to that of
|
outcomeVector |
The 1-D outcome vector indicating whether or not the patient experienced the outcome of interest (value = 1) or not (value =0).
The length of this vector should be equal to that of |
patientIdVector |
The 1-D vector with all the patient identifiers. This should contain all the patient IDs in the original two
cohorts with its length and order equal to and resonating with that of |
k |
The maximum number of prioritised covariates that should be returned by the function. By default, this is 500 as described in the original paper |
Details
To prioritise covariates across data dimensions (domains) should be assessed by their potential for controlling confounding that is not conditional
on exposure and other covariates. This means that the association of the covariates with the outcomes (relative risk) should be taken into
consideration for quantifying the 'potential' for confounding. Relative risk weighted by the ratio of prevalence of the covariates between the
two exposure groups is known as multiplicative bias. The other way to do this would be to use the absolute risk and this would have been the rather
straight-forward procedure to quantify the potential for confounding. However, this method would invariably down-weight the association between the
covariate and the outcome if the outcome prevalence is small and the exposure prevalence is high which is a common phenomenon seen with comparative
effective research using real-world-data by retrospective cohort studies. The multiplicative bias term balances this and generates a quantity for each
covariate that is reflective of its confounding potential. By ranking the multiplicative bias, the objective is to choose the top k
number of
covariates from this procedure. k
, by default, is 500 as described in the original paper. For further theoretical details of the
algorithm please refer to the original article listed below in the References
section. get_recurrence_covariates
is the function
implementing what is described in the 'Prioritise Covariates' section of the article.
Value
A named list containing two R objects
-
autoselected_covariate_df
Adata.frame
in wide format containing the auto-selected prioritised covariates and their values (1 or 0) for each patients -
multiplicative_bias
The absolute log of the multiplicative bias term for each of the auto-selected prioritised covariates
Automated Covariate Selection
The three steps in automated covariate selection are listed below with the functions implementing the methodology
Identify candidate empirical covariates:
get_candidate_covariates
Assess recurrence:
get_recurrence_covariates
Prioritize covariates:
get_prioritised_covariates
Author(s)
Dennis Robert dennis.robert.nm@gmail.com
References
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data Epidemiology. 2009;20(4):512-522. doi:10.1097/EDE.0b013e3181a663cc
Examples
library("autoCovariateSelection")
data(rwd)
head(rwd, 3)
basetable <- rwd %>% select(person_id, treatment, outcome_date) %>% distinct()
head(basetable, 3)
patientIds <- basetable$person_id
step1 <- get_candidate_covariates(df = rwd, domainVarname = "domain",
eventCodeVarname = "event_code" , patientIdVarname = "person_id",
patientIdVector = patientIds,n = 100, min_num_patients = 10)
out1 <- step1$covars_data
all.equal(patientIds, step1$patientIds) #should be TRUE
step2 <- get_recurrence_covariates(df = out1,
patientIdVarname = "person_id", eventCodeVarname = "event_code",
patientIdVector = patientIds)
out2 <- step2$recurrence_data
out3 <- get_prioritised_covariates(df = out2,
patientIdVarname = "person_id", exposureVector = basetable$treatment,
outcomeVector = ifelse(is.na(basetable$outcome_date), 0,1),
patientIdVector = patientIds, k = 10)
Generate the binary recurrence covariates for the identified candidate empirical covariates
Description
get_recurrence_covariates
function assesses the recurrence of each of the identified candidate empirical covariates
based on their frequency of occurrence for each patient in the baseline period and generates three binary recurrence covariates
for each of the identified candidate empirical covariates. This is the second step in the automated covariate selection process.
The first step of identifying empirical candidate covariates is done via get_candidate_covariates
function.
See 'Automated Covariate Selection'section below for more details regarding the overall process.
Usage
get_recurrence_covariates(
df,
patientIdVarname,
eventCodeVarname,
patientIdVector
)
Arguments
df |
The input |
patientIdVarname |
The variable name which contains the patient identifier in the |
eventCodeVarname |
The variable name which contains the covariate codes (eg:- CCS, ICD9) in the |
patientIdVector |
The 1-D vector with all the patient identifiers. This should contain all the patient IDs in the original two
cohorts. This vector can simply be the |
Details
The recurrence covariates are generated based on the frequency (counts) of occurrence of each empirical candidate covariates that got
generated by the generate_candidate_covariates
function. This is done by looking at the baseline period of each patients and
assessing whether the covariate occurred only once or sporadically or frequently. That is, a maximum of three recurrence covariates
for each candidate covariate is created and returned.
-
once
Indicates whether or not the covariate occurred more than or equal to 1 number of times for the patient -
sporadic
Indicates whether or not the covariate occurred more than or equal to median (median of non-zero occurrences of the candidate covariate) number of times for the patient. -
frequent
Indicates whether or not the covariate occurred more than or equal to upper quartile (75th percentile of non-zero occurrences of the candidate covariate) number of times for the patient
Note that if two or all three covariates are identical for any of the binary recurrence covariates, only the distinct recurrence covariate
is returned. For example, if once == sporadic == frequent for the candidate covariate (median and upper quartile both are 1), then only the 'once' recurrence covariate is
returned. If once != sporadic == frequent, then 'once' and 'sporadic' is returned. If once == sporadic != frequent, then 'once'
and 'frequent' are returned. If none of three recurrence covariates are identical, then all three are returned.
The theoretical details of the algorithm implemented is detailed in the publication listed below in the References
section.
get_recurrence_covariates
is the function implementing what is described in the 'Assess Recurrence' section
of the article.
Value
A named list containing two R objects
-
recurrence_data
Adata.frame
containing all the binary recurrence covariates for all the patients in wide format. This means that thisdata.frame
will have a dimension with number of rows equal to number of distinct patients and number of columns equal to number of binary recurrence covariates plus 1 (for the patient Id variable). The binary recurrence covariate is prefixed with a 'rec_' to indicate that the covariate is a 'reccurrence covariate' and suffixed with '_once', '_sporadic' or '_frequent'. Seedetails
section above for details. -
patientIds
The list of patient ids present in the original inputdf
. This is exactly the same as the inputpatientIdVector
Automated Covariate Selection
The three steps in automated covariate selection are listed below with the functions implementing the methodology
Identify candidate empirical covariates:
get_candidate_covariates
Assess recurrence:
get_recurrence_covariates
Prioritize covariates:
get_prioritised_covariates
Author(s)
Dennis Robert dennis.robert.nm@gmail.com
References
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data Epidemiology. 2009;20(4):512-522. doi:10.1097/EDE.0b013e3181a663cc
Examples
library("autoCovariateSelection")
data(rwd)
head(rwd, 3)
basetable <- rwd %>% select(person_id, treatment, outcome_date) %>% distinct()
head(basetable, 3)
patientIds <- basetable$person_id
step1 <- get_candidate_covariates(df = rwd, domainVarname = "domain",
eventCodeVarname = "event_code" , patientIdVarname = "person_id",
patientIdVector = patientIds,n = 100, min_num_patients = 10)
out1 <- step1$covars_data
all.equal(patientIds, step1$patientIds) #should return TRUE
step2 <- get_recurrence_covariates(df = out1, patientIdVarname = "person_id",
eventCodeVarname = "event_code", patientIdVector = patientIds)
out2 <- step2$recurrence_data
Compute relative risk for each of the covariates with respect to outcomes occurred
Description
get_relative_risk
function is a helper function used within the get_prioritised_covariates
function.
This function computes the prevalence in the exposed and that in the unexposed and simply returns the relative risk for all the covariates in the
input data.frame
Usage
get_relative_risk(df, outcomeVec)
Arguments
df |
The input |
outcomeVec |
The 1-D outcome vector indicating whether or not the patient experienced the outcome of interest (value = 1) or not (value =0).
The length of this vector should be equal to the number of rows of |
Value
A 1-D vector containing relative risk of the association between the covariate (confounder) and the outcome. Thus, the length of this vector will be equal to the number of covariates.
Author(s)
Dennis Robert dennis.robert.nm@gmail.com
Sample Data for autoCovariateSelection
Description
This is data contains Medicare claims data of a small sample of 1000 patients from the publicly available CMS Medicare De-SynPUF data. It contains all data from three domains - diagnosis, procedures and medications. The diagnosis codes are ICD9 codes, procedures are CPT4/HCPCS codes and medications are NDC codes.
Usage
rwd
Format
A data frame with 69333 rows and 9 variables:
- person_id
patient_identifier
- index_date
Date of first exposure. For one patient, there will only be one index_date
- event_date
Date at which event_code occurred for the patient
- event_code
The medical coding of the event. These are ICD9, CPT4, HCPCS or NDC codes depending on the
domain
- event_concept_id
Another identifier for the
event_code
. This is irrelevant for this package and you can ignore it- domain
The domain to which the
event_code
belongs to. The three unique values are dx (for diagnosis), px (for procedure) and rx (for medication)- treatment
Binary indicator treatment allocation based on exposure. 1 indicates primary cohort and 0 for control/comparator cohort
- outcome_date
Date in which the outcome occurred.
NA
indicates no outcome occurred. In this sample data, the outcome is death- last_enrollment_date
Last enrolled date of the patient. This field is irrelevant for this package and you can ignore it
...