Title: | Clean and Standardize Epidemiological Data |
Version: | 1.1.1 |
Description: | Cleaning and standardizing tabular data package, tailored specifically for curating epidemiological data. It streamlines various data cleaning tasks that are typically expected when working with datasets in epidemiology. It returns the processed data in the same format, and generates a comprehensive report detailing the outcomes of each cleaning task. |
License: | MIT + file LICENSE |
URL: | https://epiverse-trace.github.io/cleanepi/, https://github.com/epiverse-trace/cleanepi |
BugReports: | https://github.com/epiverse-trace/cleanepi/issues |
Depends: | R (≥ 4.0.0) |
Imports: | checkmate, cli, dplyr, janitor, linelist (≥ 1.0.0), lubridate, magrittr, matchmaker, numberize, readr, rlang, tibble |
Suggests: | htmlwidgets, kableExtra, knitr, lintr, markdown, naniar, reactable, rmarkdown, spelling, systemfonts, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Config/Needs/website: | epiverse-trace/epiversetheme |
Config/potools/style: | explicit |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
Language: | en-US |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-07-16 10:25:36 UTC; karimmane |
Author: | Karim Mané |
Maintainer: | Karim Mané <karim.mane@lshtm.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-07-16 10:40:02 UTC |
cleanepi: Clean and Standardize Epidemiological Data
Description
Cleaning and standardizing tabular data package, tailored specifically for curating epidemiological data. It streamlines various data cleaning tasks that are typically expected when working with datasets in epidemiology. It returns the processed data in the same format, and generates a comprehensive report detailing the outcomes of each cleaning task.
Author(s)
Maintainer: Karim Mané karim.mane@lshtm.ac.uk (ORCID)
Authors:
Abdoelnaser Degoot abdoelnaser-mahmood.degoot@lshtm.ac.uk (ORCID)
Bankolé Ahadzie Bankole.Ahadzie@lshtm.ac.uk
Nuredin Mohammed Nuredin.Mohammed@lshtm.ac.uk
Bubacarr Bah Bubacarr.Bah1@lshtm.ac.uk (ORCID)
Other contributors:
Thibaut Jombart thibautjombart@gmail.com (Thibault contributed in development of date_guess().) [contributor]
Hugo Gruson hugo@data.org (ORCID) [contributor, reviewer]
Pratik R. Gupte pratik.gupte@lshtm.ac.uk (ORCID) [reviewer]
James M. Azam james.azam@lshtm.ac.uk (ORCID) [reviewer]
Joshua W. Lambert joshua.lambert@lshtm.ac.uk (ORCID) [reviewer, contributor]
Chris Hartgerink chris@data.org (ORCID) [reviewer]
Andree Valle-Campos avallecam@gmail.com (ORCID) [reviewer, contributor]
London School of Hygiene and Tropical Medicine, LSHTM (00a0jsq62) [copyright holder]
data.org [funder]
See Also
Useful links:
Report bugs at https://github.com/epiverse-trace/cleanepi/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Add an element to the data dictionary
Description
Add an element to the data dictionary
Usage
add_to_dictionary(dictionary, option, value, grp, order = NULL)
Arguments
dictionary |
A
|
option |
A |
value |
A |
grp |
A |
order |
A |
Value
A <data.frame>
. This is the new data dictionary with
an additional line that contains the details about the new options.
Examples
test <- add_to_dictionary(
dictionary = readRDS(
system.file("extdata", "test_dict.RDS", package = "cleanepi")
),
option = "ml",
value = "male",
grp = "gender",
order = NULL
)
Add an element to the report object
Description
Add an element to the report object
Usage
add_to_report(x, key, value)
Arguments
x |
A |
key |
A |
value |
The object to add to the report object |
Value
The input <data.frame>
or <linelist>
with an
additional element to the report.
Examples
# scan through the data
scan_res <- scan_data(
data = readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
)
# Perform data cleaning
cleaned_data <- clean_data(
data = readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
),
to_numeric = list(target_columns = "sex", lang = "en"),
dictionary = NULL
)
# add the data scanning result to the report
cleaned_data <- add_to_report(
x = cleaned_data,
key = "scanning_result",
value = scan_res
)
Checks whether the order in a sequence of date events is chronological. order.
Description
Checks whether a date sequence in a vector of specified columns is in chronological order or not.
Usage
check_date_sequence(data, target_columns)
Arguments
data |
The input |
target_columns |
A |
Value
The input dataset. When found, the incorrect date sequences will be
stored in the report and can be accessed using the print_report()
function as shown in the example below.
Examples
# import the data
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# standardize the date values
data <- data %>%
standardize_dates(
target_columns = c("date_first_pcr_positive_test", "date.of.admission"),
error_tolerance = 0.4,
format = NULL,
timeframe = NULL
)
# check whether all admission dates come after the test dates
good_date_sequence <- check_date_sequence(
data = data,
target_columns = c("date_first_pcr_positive_test", "date.of.admission")
)
# display rows where admission dates do not come after the test dates
print_report(
data = good_date_sequence,
what = "incorrect_date_sequence"
)
Check whether the subject IDs comply with the expected format. When incorrect
IDs are found, the function sends a warning and the user can call the
correct_subject_ids
function to correct them.
Description
Check whether the subject IDs comply with the expected format. When incorrect
IDs are found, the function sends a warning and the user can call the
correct_subject_ids
function to correct them.
Usage
check_subject_ids(
data,
target_columns,
prefix = NULL,
suffix = NULL,
range = NULL,
nchar = NULL
)
Arguments
data |
The input |
target_columns |
A |
prefix |
A |
suffix |
A |
range |
A |
nchar |
An |
Value
The input dataset with a warning if incorrect subject ids were found
Examples
data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
# make first and last subject IDs the same
data$study_id[10] <- data$study_id[1]
# set subject ID number 9 to NA
data$study_id[9] <- NA
# detect the incorrect subject ids i.e. IDs that do not have any or both of
# the followings:
# - starts with 'PS',
# - ends with 'P2',
# - has a number within 1 and 100,
# - contains 7 characters.
dat <- check_subject_ids(
data = data,
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1, 100),
nchar = 7
)
# display rows with invalid subject ids
print_report(dat, "incorrect_subject_id")
Checks the uniqueness in values of the sample IDs column
Description
Checks the uniqueness in values of the sample IDs column
Usage
check_subject_ids_oness(data, id_col_name)
Arguments
data |
The input |
id_col_name |
A |
Value
the input <data.frame>
with and extra element in its
attributes when there are missing or duplicated IDs.
Clean and standardize data
Description
Cleans up messy data frames by performing several operations. These include among others: cleaning of column names, detecting and removing duplicates, empty records and columns, constant columns, replacing missing values by NA, converting character columns into dates when they contain a certain number of date values, detecting subject IDs with wrong formats, etc.
Usage
clean_data(data, ...)
Arguments
data |
The input |
... |
A
|
Value
The cleaned input data according to the user-specified parameters.
This is associated with a data cleaning report that can be accessed using
attr(cleaned_data, "report")
Examples
# Parameters for column names standardization: rename all column names if
# applicable
standardize_column_names <- list(keep = NULL, rename = NULL)
# parameters to remove constant columns, empty rows and columns: remove rows
# and columns with 100% constant data
remove_constants <- list(cutoff = 1)
# Parameters for substituting missing values ("-99") with NA
replace_missing_values <- list(target_columns = NULL, na_strings = "-99")
# Parameters for duplicates removal across all columns
remove_duplicates <- list(target_columns = NULL)
# Parameters for the conversion of Date columns into "%Y-%m-%d" format
standardize_dates <- list(
target_columns = NULL,
error_tolerance = 0.4,
format = NULL,
timeframe = as.Date(c("1973-05-29", "2023-05-29")),
orders = list(
world_named_months = c("Ybd", "dby"),
world_digit_months = c("dmy", "Ymd"),
US_formats = c("Omdy", "YOmd")
)
)
# Parameters to check whether the subject IDs comply with the expected format
standardize_subject_ids <- list(
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1, 100),
nchar = 7
)
# convert the 'sex' column into numeric
to_numeric <- list(target_columns = "sex", lang = "en")
# the dictionary-based cleaning will not be performed here
dictionary = NULL
# no need to check if the sequence of date events is correct
check_date_sequence <- NULL
# perform the data cleaning
cleaned_data <- clean_data(
data = readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
),
standardize_column_names = standardize_column_names,
remove_constants = remove_constants,
replace_missing_values = replace_missing_values,
remove_duplicates = remove_duplicates,
standardize_dates = standardize_dates,
standardize_subject_ids = standardize_subject_ids,
to_numeric = to_numeric,
dictionary = NULL,
check_date_sequence = NULL
)
Perform dictionary-based cleaning
Description
Perform dictionary-based cleaning
Usage
clean_using_dictionary(data, dictionary)
Arguments
data |
The input |
dictionary |
A
|
Value
A <data.frame>
or <linelist>
where the target options
have been replaced with their corresponding values in the columns
specified in the data dictionary.
Examples
data <- readRDS(
system.file("extdata", "messy_data.RDS", package = "cleanepi")
)
dictionary <- readRDS(
system.file("extdata", "test_dict.RDS", package = "cleanepi")
)
# adding an option that is not defined in the dictionary to the 'gender'
# column
data$gender[2] <- "homme"
cleaned_df <- clean_using_dictionary(
data = data,
dictionary = dictionary
)
# print the report
print_report(cleaned_df, "misspelled_values")
Common strings representing missing values
Description
This vector contains common values of NA (missing) and is intended for
use within {cleanepi} functions replace_missing_values()
.
The current list of strings used can be found by printing out
common_na_strings
. It serves as a helpful tool to explore your data
for possible missing values. However, I strongly caution against using
this to replace NA
values without meticulously examining the
incidence for each case. Please note that common_na_strings
utilizes
\\
around the "?", ".", and "*" characters to prevent their wildcard
Usage
common_na_strings
Format
A vector of 35 character strings.
Source
This vector is a combination of naniar::common_na_strings
(https://github.com/njtierney/naniar/) and other strings found in the
literature.
Build the report for the detected misspelled values during dictionary-based data cleaning operation
Description
Build the report for the detected misspelled values during dictionary-based data cleaning operation
Usage
construct_misspelled_report(misspelled_options, data)
Arguments
misspelled_options |
A |
data |
The input |
Value
A <data.frame>
the details about where in the input data the
misspelled values were found.
Convert numeric to date
Description
Convert numeric to date
Usage
convert_numeric_to_date(data, target_columns, ref_date, forward = TRUE)
Arguments
data |
The input |
target_columns |
A |
ref_date |
A |
forward |
A |
Value
A <data.frame>
or <linelist>
where the column of
interest are updated
Examples
data <- readRDS(system.file("extdata", "test_df1.RDS", package = "cleanepi"))
data <- convert_numeric_to_date(
data = data,
target_columns = "recruited_on_day",
ref_date = as.Date("2022-10-13"),
forward = TRUE
)
Convert columns into numeric
Description
When this function is invoked without specifying the column names to be
converted, the target columns are the ones returned by the scan_data()
function. Furthermore, it identifies columns where the proportion of numeric
values is at least twice the percentage of character values and performs the
conversion in them. The function internally makes call of the main function
from the numberize package.
Usage
convert_to_numeric(data, target_columns = NULL, lang = c("en", "fr", "es"))
Arguments
data |
The input |
target_columns |
A |
lang |
A |
Value
A <data.frame>
or <linelist>
wherein all the specified
or detected columns have been transformed into numeric format after the
conversion process.
Examples
data <- readRDS(
system.file("extdata", "messy_data.RDS", package = "cleanepi")
)
# convert the 'age' column into numeric
dat <- convert_to_numeric(
data = data,
target_columns = "age",
lang = "en"
)
# print the report from this operation
print_report(data = dat, "converted_into_numeric")
Correct misspelled values by using approximate string matching techniques to compare them against the expected values.
Description
Correct misspelled values by using approximate string matching techniques to compare them against the expected values.
Usage
correct_misspelled_values(
data,
target_columns,
wordlist,
max_distance = 1,
confirm = rlang::is_interactive(),
...
)
Arguments
data |
The input |
target_columns |
A |
wordlist |
A |
max_distance |
An |
confirm |
A |
... |
Details
When used interactively (see interactive()
) the user is presented a menu
to ensure that the words detected using approximate string matching are not
false positives and the user can decided whether to proceed with the
spelling corrections. In non-interactive sessions all misspelled values are
replaced by their closest values within the provided vector of expected
values.
If multiple words supplied in the wordlist
equally match a word in the
data and confirm
is TRUE
the user is presented a menu to choose the
replacement word. If it is not used interactively multiple equal matches
throws a warning.
Value
The corrected input data according to the user-specified wordlist
.
Examples
df <- data.frame(
case_type = c("confirmed", "confermed", "probable", "susspected"),
outcome = c("died", "recoverd", "did", "recovered")
)
df
correct_misspelled_values(
data = df,
target_columns = c("case_type", "outcome"),
wordlist = c("confirmed", "probable", "suspected", "died", "recovered"),
confirm = FALSE
)
Correct the wrong subject IDs based on the user-provided values.
Description
After detecting incorrect subject IDs from the check_subject_ids()
function, use this function to provide the correct IDs and perform the
substitution.
Usage
correct_subject_ids(data, target_columns, correction_table)
Arguments
data |
The input |
target_columns |
A |
correction_table |
A
|
Value
The input dataset where all subject ids comply with the expected format.
Examples
data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
# detect the incorrect subject ids i.e. IDs that do not have any or both of
# the followings:
# - starts with 'PS',
# - ends with 'P2',
# - has a number within 1 and 100,
# - contains 7 characters.
dat <- check_subject_ids(
data = data,
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1, 100),
nchar = 7
)
# display rows with invalid subject ids
print_report(dat, "incorrect_subject_id")
# generate the correction table
correction_table <- data.frame(
from = c("P0005P2", "PB500P2", "PS004P2-1"),
to = c("PB005P2", "PB050P2", "PS004P2")
)
# perform the correction
dat <- correct_subject_ids(
data = dat,
target_columns = "study_id",
correction_table = correction_table
)
Convert and update date values
Description
Convert and update date values
Usage
date_check_outsiders(data, timeframe, new_dates, cols)
Arguments
data |
The input |
timeframe |
A |
new_dates |
A |
cols |
A |
Value
A <list>
of 2 data frames: the updated input data (if some
columns were converted to Date) and a data frame of date values that are
not within the specified timeframe.
Check date time frame
Description
Check date time frame
Usage
date_check_timeframe(first_date, last_date)
Arguments
first_date |
A |
last_date |
A |
Value
A <list>
with the first and last dates
Choose the first non-missing date from a data frame of dates
Description
Choose the first non-missing date from a data frame of dates
Usage
date_choose_first_good(date_a_frame, column_name)
Arguments
date_a_frame |
A |
column_name |
A |
Value
The chosen first <Date>
value. When there other possible
values for a given date, this will be registered in the report object.
Convert characters to dates
Description
Convert characters to dates
Usage
date_convert(data, cols, error_tolerance, timeframe = NULL, orders)
Arguments
data |
The input |
cols |
A |
error_tolerance |
A |
timeframe |
A |
orders |
A list( quarter_partial_dates = c("Y", "Ym", "Yq"), world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"), world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"), us_format = c("Omdy", "YOmd") ) |
Value
A <list>
with the following two elements: a data frame where
the specified columns have been converted into <Date>
values, a
boolean that tells whether numeric values that can also be of type
<Date>
are found in the specified columns.
Detect complex date format
Description
Detect complex date format
Usage
date_detect_complex_format(x)
Arguments
x |
A |
Value
A <character>
with the inferred format.
Detect the appropriate abbreviation for day or month value
Description
Detect the appropriate abbreviation for day or month value
Usage
date_detect_day_or_month(x)
Arguments
x |
A |
Value
A <character>
with the abbreviation used to designate the
written day or month
Detect a date format with only 1 separator
Description
Detect a date format with only 1 separator
Usage
date_detect_format(x)
Arguments
x |
A |
Value
A <character>
with the identified format.
Detect the special character that is the separator in the date values
Description
Detect the special character that is the separator in the date values
Usage
date_detect_separator(x)
Arguments
x |
A |
Value
A <character>
with the detected separator
A <vector>
of the identified special characters.
Get format from a simple Date value
Description
Get format from a simple Date value
Usage
date_detect_simple_format(x)
Arguments
x |
A |
Value
A <character>
with the abbreviation that correspond to the
Date value
Infer date format from a vector or characters
Description
Infer date format from a vector or characters
Usage
date_get_format(x)
Arguments
x |
A |
Value
A <character>
with the inferred date format
Split a string based on a pattern and return the first element of the resulting vector.
Description
Split a string based on a pattern and return the first element of the resulting vector.
Usage
date_get_part1(x, sep)
Arguments
x |
A |
sep |
A |
Value
A <character>
with the first element of the vector returned
by the strsplit()
function.
Get part2 of date value
Description
Get part2 of date value
Usage
date_get_part2(x, sep)
Arguments
x |
A |
sep |
A |
Value
A <character>
with the second element of the vector returned
by the strsplit()
function.
Get part3 of date value
Description
Get part3 of date value
Usage
date_get_part3(x, sep)
Arguments
x |
|
sep |
A |
Value
A <character>
with the third element of the vector returned
by the strsplit()
function.
Try and guess dates from a characters
Description
Note that THIS FEATURE IS STILL EXPERIMENTAL: we strongly recommend checking
a few converted dates manually. This function tries to extract dates from a
character
vector or a factor
. It treats each entry independently, using
regular expressions to detect if a date is present, its format, and if
successful it converts that entry to a standard Date
with the Ymd format
(e.g. 2018-01-21
). Entries which cannot be processed result in NA
. An
error threshold can be used to define the maximum number of resulting NA
(i.e. entries without an identified date) that can be tolerated. If this
threshold is exceeded, the original vector is returned.
Usage
date_guess(x, column_name, quiet = TRUE, orders = NULL)
Arguments
x |
A |
column_name |
A |
quiet |
A |
orders |
A list( quarter_partial_dates = c("Y", "Ym", "Yq"), world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"), world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"), us_format = c("Omdy", "YOmd") ) |
Value
A <list>
of following three elements: a vector of the newly
reformatted dates, a data frame with the date values that were converted
based on more than one format, and a Boolean that specifies whether
ambiguous values were found or not. If all values comply with only one
format, the second element will be NULL.
Guess if a character vector contains Date values, and convert them to date
Description
Guess if a character vector contains Date values, and convert them to date
Usage
date_guess_convert(data, error_tolerance, timeframe, orders)
Arguments
data |
A |
error_tolerance |
A |
timeframe |
A |
orders |
A list( quarter_partial_dates = c("Y", "Ym", "Yq"), world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"), world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"), us_format = c("Omdy", "YOmd") ) |
Value
A <list>
with the following two elements: the input data
frame where the character columns with date values have been converted
into <Date>
, and a vector of column names where there are numeric
values that can also be of type Date.
Extract date from a character vector
Description
This function tries converting a single character string into a well-formatted date, but still returning a character. If it can't convert it, it returns NA.
Usage
date_i_guess_and_convert(x)
Arguments
x |
A |
Value
If the format cannot be resolved, the function returns NA
; if
a matching format is found, it returns the <vector>
of the
converted values.
Build the auto-detected format
Description
Put together the different date format characters that were identified from the target date column.
Usage
date_make_format(f1, f2, f3)
Arguments
f1 |
A |
f2 |
A |
f3 |
A |
Value
A <character>
that represents the inferred format from the
provided elements. It returns <NULL>
when the format was not
resolved.
Check whether the number of provided formats matches the number of target columns to be standardized.
Description
Check whether the number of provided formats matches the number of target columns to be standardized.
Usage
date_match_format_and_column(target_columns, format)
Arguments
target_columns |
A |
format |
A |
Value
A <vector>
of characters with the validated formats
Process date variable
Description
Process date variable
Usage
date_process(x)
Arguments
x |
A |
Value
The converted input value into <Date>
or <character>
Find the dates that lubridate couldn't
Description
Find the dates that lubridate couldn't
Usage
date_rescue_lubridate_failures(date_a_frame, original_dates, column_name)
Arguments
date_a_frame |
A |
original_dates |
A |
column_name |
A |
Value
A <list>
with the following two elements: the input data
frame where the values that do not match the proposed formats have been
converted into Date, and a boolean that informs about the presence of
ambiguous values or not.
Trim dates outside of the defined timeframe
Description
Trim dates outside of the defined timeframe
Usage
date_trim_outliers(new_dates, dmin, dmax, cols, original_dates)
Arguments
new_dates |
A |
dmin |
A |
dmax |
A |
cols |
A |
original_dates |
A |
Value
A <list>
of 2 elements: the update input vector where date
values that are out of the specified timeframe are replaced by NA
,
and a vector of the out of timeframe values.
Detect misspelled options in columns to be cleaned
Description
Detect misspelled options in columns to be cleaned
Usage
detect_misspelled_options(data, dictionary)
Arguments
data |
The input |
dictionary |
A
|
Value
A <list>
with the indexes of the misspelled values in every
column that needs to be cleaned.
Detect the numeric columns that appears as characters due to the presence of some character values in the column.
Description
Detect the numeric columns that appears as characters due to the presence of some character values in the column.
Usage
detect_to_numeric_columns(scan_res, data)
Arguments
scan_res |
A |
data |
The input |
Value
a <vector>
of column names to be converted into numeric
Make data dictionary for 1 field
Description
Make data dictionary for 1 field
Usage
dictionary_make_metadata(x, field_column)
Arguments
x |
A |
field_column |
A |
Value
A <data.frame>
with the dictionary in the format that is
accepted by the matchmaker package.
Identify and return duplicated rows in a data frame or linelist.
Description
Identify and return duplicated rows in a data frame or linelist.
Usage
find_duplicates(data, target_columns = NULL)
Arguments
data |
The input |
target_columns |
A |
Value
A <data.frame>
or <linelist>
of all duplicated rows
with following 2 additional columns:
- row_id
The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
- group_id
a unique identifier associated to each group of duplicates.
Examples
data <- readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
# find duplicates across the following columns: "dt_onset", "dt_report",
# "sex", and "outcome"
dups <- find_duplicates(
data = data,
target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
# print the detected duplicates
print_report(dups, "found_duplicates")
Transform scanning result format into user-chosen format
Description
Transform scanning result format into user-chosen format
Usage
get_appropriate_format(counts, valid_count, format)
Arguments
counts |
A numeric vector with the counts of the different data types |
valid_count |
A numeric with the number of non-missing values in the the target column. |
format |
A character with the user-specified format |
Set and return clean_data
default parameters
Description
When clean_data()
function is called without any argument, these
default values provided to the function's arguments will be applied on the
input data. By default, operations that require the target columns to be
specified by the user will not be performed. The default cleaning operations
include: i) standardizing column names, ii) detecting and removing
duplicates, and iii) removing constant data.
Usage
get_default_params()
Value
A <list>
of the default cleaning parameters.
Examples
default_params <- get_default_params()
Get the names of the columns from which duplicates will be found
Description
Get the names of the columns from which duplicates will be found
Usage
get_target_column_names(data, target_columns, cols)
Arguments
data |
The input |
target_columns |
A |
cols |
A |
Value
A <vector>
with the target column names or indexes
Check order of a sequence of date-events
Description
Check order of a sequence of date-events
Usage
is_date_sequence_ordered(x)
Arguments
x |
A |
Value
TRUE
if elements of the vector are ordered, FALSE
otherwise.
Make column names unique when duplicated column names are found after the transformation
Description
Make column names unique when duplicated column names are found after the transformation
Usage
make_unique_column_names(after, kept, before, rename)
Arguments
after |
A |
kept |
A |
before |
A |
rename |
A |
Value
An adjusted <vector>
if there were duplicated names introduced
due to the transformation
Update clean_data
default argument's values with the
user-provided values.
Description
Update clean_data
default argument's values with the
user-provided values.
Usage
modify_default_params(defaults, params, strict = TRUE)
Arguments
defaults |
A |
params |
A |
strict |
A |
Value
The updated <list>
of parameters that will be used to perform
the data cleaning.
Detects whether a string contains only numbers or not.
Description
Detects whether a string contains only numbers or not.
Usage
numbers_only(x)
Arguments
x |
A |
Value
TRUE
if the string only contains numbers, FALSE
otherwise
Remove constant data.
Description
This function is called at each iteration of the constant data removal process until no constant data remains.
Usage
perform_remove_constants(data, cutoff)
Arguments
data |
The input |
cutoff |
A |
Value
A <list>
with the input dataset where all empty rows and
columns as well as constant columns have been removed.
Print the detected misspelled values
Description
Print the detected misspelled values
Usage
print_misspelled_values(data, misspelled_options)
Arguments
data |
The input |
misspelled_options |
A |
Value
Prints out the misspelled values from the column of interest
Generate report from data cleaning operations
Description
Generate report from data cleaning operations
Usage
print_report(
data,
what = NULL,
print = FALSE,
report_title = "{cleanepi} data cleaning report",
output_file_name = NULL,
format = "html"
)
Arguments
data |
A |
what |
A
|
print |
A |
report_title |
A |
output_file_name |
A |
format |
A |
Value
A <character>
containing the name and path of the saved
report
Examples
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
test_dictionary <- readRDS(
system.file("extdata", "test_dictionary.RDS", package = "cleanepi")
)
# scan through the data
scan_res <- scan_data(data)
# Perform data cleaning
cleaned_data <- data %>%
standardize_column_names(keep = NULL, rename = c("DOB" = "dateOfBirth")) %>%
replace_missing_values(target_columns = NULL, na_strings = "-99") %>%
remove_constants(cutoff = 1.0) %>%
remove_duplicates(target_columns = NULL) %>%
standardize_dates(
target_columns = NULL,
error_tolerance = 0.4,
format = NULL,
timeframe = as.Date(c("1973-05-29", "2023-05-29"))
) %>%
check_subject_ids(
target_columns = "study_id",
prefix = "PS",
suffix = "P2",
range = c(1L, 100L),
nchar = 7L
) %>%
convert_to_numeric(target_columns = "sex", lang = "en") %>%
clean_using_dictionary(dictionary = test_dictionary)
# add the data scanning result to the report
cleaned_data <- add_to_report(
x = cleaned_data,
key = "scanning_result",
value = scan_res
)
# save a report in the current directory using the previously-created objects
print_report(
data = cleaned_data,
report_title = "{cleanepi} data cleaning report",
output_file_name = NULL,
format = "html",
print = TRUE
)
Remove constant data, including empty rows, empty columns, and columns with constant values.
Description
The function iteratively removes constant data until none remain. It records details of the removed constant data as a data frame within the report object.
Usage
remove_constants(data, cutoff = 1)
Arguments
data |
The input |
cutoff |
A |
Value
The input dataset where the constant data is filtered out based on specified cut-off.
Examples
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# introduce an empty column
data$empty_column <- NA
# inject some missing values across some columns
data$study_id[3] = NA_character_
data$date.of.admission[3] = NA_character_
data$date.of.admission[4] = NA_character_
data$dateOfBirth[3] = NA_character_
data$dateOfBirth[4] = NA_character_
data$dateOfBirth[5] = NA_character_
# with cutoff = 1, line 3, 4, and 5 are not removed
cleaned_df <- remove_constants(
data = data,
cutoff = 1
)
# drop rows or columns with a percentage of constant values
# equal to or more than 50%
cleaned_df <- remove_constants(
data = cleaned_df,
cutoff = 0.5
)
# drop rows or columns with a percentage of constant values
# equal to or more than 25%
cleaned_df <- remove_constants(
data = cleaned_df,
cutoff = 0.25
)
# drop rows or columns with a percentage of constant values
# equal to or more than 15%
cleaned_df <- remove_constants(
data = cleaned_df,
cutoff = 0.15
)
# check the report to see what has happened
print_report(cleaned_df, "constant_data")
Remove duplicates
Description
When removing duplicates, users can specify a set columns to consider with
the target_columns
argument.
Usage
remove_duplicates(data, target_columns = NULL)
Arguments
data |
The input |
target_columns |
A |
Details
Caveat: In many epidemiological datasets, multiple rows may share the same value in one or more columns without being true duplicates. For example, several individuals might have the same symptom onset date and admission date. Be cautious when using this function—especially when applying it to a single target column—to avoid incorrect identification or removal of valid entries.
Value
The input data <data.frame>
or <linelist>
without the
duplicated rows identified from all or the specified columns.
Examples
data <- readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
no_dups <- remove_duplicates(
data = data,
target_columns = "linelist_tags"
)
# print the removed duplicates
print_report(no_dups, "removed_duplicates")
# print the detected duplicates
print_report(no_dups, "found_duplicates")
Replace missing values with NA
Description
Replace missing values with NA
Usage
replace_missing_values(
data,
target_columns = NULL,
na_strings = cleanepi::common_na_strings
)
Arguments
data |
The input |
target_columns |
A |
na_strings |
A |
Value
The input data where missing values are replaced by NA
.
Examples
data <- readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
)
# replace all occurrences of '-99' with NA
cleaned_data <- replace_missing_values(
data = data,
target_columns = NULL,
na_strings = "-99"
)
# print the names of the columns where the replacement occurred
print_report(cleaned_data, "missing_values_replaced_at")
Detect and replace values with NA
from a vector
Description
Detect and replace values with NA
from a vector
Usage
replace_with_na(x, na_strings)
Arguments
x |
A |
na_strings |
A |
Value
A <vector>
where the specified values were replaced with
NA
if found.
Get column names
Description
When performing several data cleaning operations using the
clean_data()
function, the input column names might be altered by
after the column names cleaning. As a consequence of this, some cleaning
operations will fail due to the column names mismatch. This function is
provided to anticipate on this scenario, hence providing continuity between
the cleaning operations.
Usage
retrieve_column_names(data, target_columns)
Arguments
data |
The input |
target_columns |
A |
Value
A <vector>
of column names to be used for the target cleaning
operations
Scan through a data frame and return the proportion of missing
, numeric
,
Date
, character
, logical
values.
Description
The function checks for the existence of character columns in the data. When found, it reports back the proportion of the data types mentioned above in those columns. See the details section to know more about how it works.
Usage
scan_data(data, format = "proportion")
Arguments
data |
The input |
format |
A |
Details
How does it work?
The <character>
columns are identified first. If no <character>
columns are found, the function returns a message.
For each <character>
column, the function counts:
The number of missing values (
NA
).The number of numeric values. A process is initiated to detect valid dates among these numeric values using
lubridate::as_date()
anddate_guess()
functions. If valid dates are found, a warning is triggered to alert about ambiguous numeric values potentially representing dates. Note: A date is considered valid if it falls within the range from today's date to 50 years in the past.The detection of
<Date>
values from non-numeric data using thedate_guess()
function. The total date count includes dates from today's from both numeric and non-numeric values. Due to overlap, the sum of counts across rows in the scanning result may exceed 1.The count of
<logical>
values.
Remaining values are categorized as <character>
.
Value
A <data.frame>
if the input data contains columns of type
character. It invisibly returns NA
otherwise. The returned data
frame will have the same number of rows as the number of character
columns, and six columns representing their column names, proportion of
missing, numeric, date, character, and logical values.
Examples
# scan through a data frame of character columns only
scan_result <- scan_data(
data = readRDS(
system.file("extdata", "messy_data.RDS", package = "cleanepi")
)
)
# scan through a data frame with two character columns
scan_result <- scan_data(
data = readRDS(system.file("extdata", "test_linelist.RDS",
package = "cleanepi"))
)
# scan through a data frame with no character columns
data(iris)
iris[["fct"]] <- as.factor(sample(c("gray", "orange"), nrow(iris),
replace = TRUE))
iris[["lgl"]] <- sample(c(TRUE, FALSE), nrow(iris), replace = TRUE)
iris[["date"]] <- as.Date(seq.Date(from = as.Date("2024-01-01"),
to = as.Date("2024-08-30"),
length.out = nrow(iris)))
iris[["posit_ct"]] <- as.POSIXct(iris[["date"]])
scan_result <- scan_data(data = iris)
Scan through a character column
Description
Scan through a character column
Usage
scan_in_character(x, x_name, format)
Arguments
x |
The input |
x_name |
The name of the corresponding column |
format |
A |
Value
A <vector>
of <numeric>
with the proportion of the
different types of data that were detected within the input vector.
Standardize column names of a data frame or line list
Description
All columns names will be reformatted to snake_case. When the
conversion to snakecase does not work as expected, use the keep
and/or
rename
arguments to reformat the column name properly.
Usage
standardize_column_names(data, keep = NULL, rename = NULL)
Arguments
data |
The input |
keep |
A |
rename |
A named |
Value
A <data.frame>
or <linelist>
with easy to work with
column names.
Examples
# do not rename 'date.of.admission'
cleaned_data <- standardize_column_names(
data = readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
),
keep = "date.of.admission"
)
# do not rename 'date.of.admission', but rename 'dateOfBirth' and 'sex' to
# 'DOB' and 'gender' respectively
cleaned_data <- standardize_column_names(
data = readRDS(
system.file("extdata", "test_df.RDS", package = "cleanepi")
),
keep = "date.of.admission",
rename = c(DOB = "dateOfBirth", gender = "sex")
)
# print the report
print_report(
data = cleaned_data,
what = "colnames"
)
Standardize date variables
Description
When the format of the values in a column and/or the target columns are not
defined, we strongly recommend checking a few converted dates manually to
make sure that the dates extracted from a character
vector or a factor
are correct.
Usage
standardize_dates(
data,
target_columns = NULL,
format = NULL,
timeframe = NULL,
error_tolerance = 0.4,
orders = list(world_named_months = c("Ybd", "dby"), world_digit_months = c("dmy",
"Ymd"), US_formats = c("Omdy", "YOmd"))
)
Arguments
data |
The input |
target_columns |
A |
format |
A |
timeframe |
A |
error_tolerance |
A |
orders |
A list( quarter_partial_dates = c("Y", "Ym", "Yq"), world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"), world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb", "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"), us_format = c("Omdy", "YOmd") ) |
Details
Check for the presence of date values that could have multiple formats
from the $multi_format_dates
element of the report
.
Converting ambiguous character strings to dates is difficult for many reasons:
dates may not use the standard Ymd format
within the same variable, dates may follow different formats
dates may be mixed with things that are not dates
the behavior of
as.Date
in the presence of non-date is hard to predict, sometimes returningNA
, sometimes issuing an error.
This function tries to address all the above issues. Dates with the following format should be automatically detected, irrespective of separators (e.g. "-", " ", "/") and surrounding text:
"19 09 2018"
"2018 09 19"
"19 Sep 2018"
"2018 Sep 19"
"Sep 19 2018"
How it works
This function relies heavily on lubridate::parse_date_time()
, which is an
extremely flexible date parser that works well for consistent date formats,
but can quickly become unwieldy and may produce spurious results.
standardize_dates()
will use a list of formats in the orders
argument to
run parse_date_time()
with each format vector separately and take the first
correctly parsed date from all the trials.
With the default orders shown above, the dates 03 Jan 2018, 07/03/1982, and
08/20/85 are correctly interpreted as 2018-01-03, 1982-03-07, and 1985-08-20.
The examples section will show how you can manipulate the orders
to be
customized for your situation.
Value
The input dataset where the date columns have been standardized. The date values that are out of the specified timeframe will be reported in the report. Similarly, date values that comply with multiple formats will also be featured in the report object.
Examples
x <- c("03 Jan 2018", "07/03/1982", "08/20/85")
# The below will coerce values where the month is written in letters only
# into Date.
as.Date(lubridate::parse_date_time(x, orders = c("Ybd", "dby")))
# coerce values where the month is written in letters or numbers into Date.
as.Date(lubridate::parse_date_time(x, orders = c("dmy", "Ymd")))
# How to use standardize_dates()
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# convert values in the 'date.of.admission' column into "%Y-%m-%d"
# format
dat <- standardize_dates(
data = data,
target_columns = "date.of.admission",
format = NULL,
timeframe = as.Date(c("2021-01-01", "2021-12-01")),
error_tolerance = 0.4,
orders = list(
world_named_months = c("Ybd", "dby"),
world_digit_months = c("dmy", "Ymd"),
US_format = c("Omdy", "YOmd")
)
)
# print the report
print_report(dat, "date_standardization")
Calculate time span between dates
Description
Calculate time span between dates
Usage
timespan(
data,
target_column = NULL,
end_date = Sys.Date(),
span_unit = c("years", "months", "weeks", "days"),
span_column_name = "span",
span_remainder_unit = NULL
)
Arguments
data |
The input |
target_column |
A |
end_date |
The end date. It can be either a |
span_unit |
A |
span_column_name |
A |
span_remainder_unit |
A |
Value
The input <data.frame>
with one or two additional columns:
- span
or any other name chosen by the user. This will contain the calculated time span in the desired units.
- "*_remainder"
a column with the number of the remaining days or weeks or months depending on the value of the 'span_remainder_unit' parameter. The star represents here the value of the 'span_column_name' argument.
Examples
# In the below example, this function is used to calculate patient's age from
# their dates of birth
# import the data, replace missing values with NA and convert date into ISO
# format
data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
data <- data %>%
replace_missing_values(target_columns = "dateOfBirth",
na_strings = "-99") %>%
standardize_dates(target_columns = "dateOfBirth",
error_tolerance = 0.0)
# calculate the age in 'years' and return the remainder in 'months'
age <- timespan(
data = data,
target_column = "dateOfBirth",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "age_in_years",
span_remainder_unit = "months"
)
Flag out what message will be translated using the potools package
Description
Flag out what message will be translated using the potools package
Usage
tr_(...)
Arguments
... |
A character string. This represents the message to be translated |
Details
This function was copied from the Translation for package developers
vignette of the potools package.
Value
The input object
Unnest an element of the data cleaning report
Description
Unnest an element of the data cleaning report
Usage
unnest_report(report, what, ...)
Arguments
report |
An object of type |
what |
A
|
... |
Any other extra argument |
Value
The input object where the specified element has been unnested and removed.