Type: | Package |
Title: | Clean Data Frames |
Version: | 0.2.1 |
Maintainer: | Harrison Tietze <Harrison4192@gmail.com> |
Description: | Provides a friendly interface for modifying data frames with a sequence of piped commands built upon the 'tidyverse' Wickham et al., (2019) <doi:10.21105/joss.01686> . The majority of commands wrap 'dplyr' mutate statements in a convenient way to concisely solve common issues that arise when tidying small to medium data sets. Includes smart defaults and allows flexible selection of columns via 'tidyselect'. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
Imports: | dplyr, stringr, tidyselect, purrr, janitor, rlang, lubridate, magrittr, tibble, rstudioapi, forcats, bit64, rio, readr, vroom, fs, rlist, fastDummies |
RoxygenNote: | 7.3.1 |
Suggests: | knitr, rmarkdown, badger, readxl |
VignetteBuilder: | knitr |
URL: | https://harrison4192.github.io/framecleaner/, https://github.com/Harrison4192/framecleaner |
BugReports: | https://github.com/Harrison4192/framecleaner/issues |
NeedsCompilation: | no |
Packaged: | 2024-03-15 06:58:32 UTC; 81804 |
Author: | Harrison Tietze [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2024-03-15 07:20:02 UTC |
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
as_integer16_or_64
Description
coerce to integer. if too large, coerces to 64-bit integer
Usage
as_integer16_or_64(x)
Arguments
x |
integerish vec |
Value
int or int64
auto setwd
Description
Call from a saved R script. Automatically sets your working directory to the directory that you saved the current R script in. Takes no arguments.
Usage
auto_setwd()
Value
No return value.
Clean Data Frame
Description
Uses the functions of framecleaner and other operations to apply cleaning operations to a data frame
Usage
clean_frame(.data)
Arguments
.data |
a data frame |
Details
Functions applied in clean_frame
rename_with
.fn = enc2utf8
clean_names
case = "all_caps", ascii = FALSE)
Value
data frame
Examples
iris %>%
clean_frame()
create dummies
Description
adapted from the dummy_cols
function Added the option to truncate the dummy column
names, and to specify dummy cols using tidyselect.
Usage
create_dummies(
.data,
...,
append_col_name = TRUE,
max_levels = 10L,
remove_first_dummy = FALSE,
remove_most_frequent_dummy = FALSE,
clean_names = TRUE,
ignore_na = FALSE,
split = NULL,
remove_selected_columns = TRUE
)
Arguments
.data |
data frame |
... |
tidyselect columns. default selection is all character or factor variables |
append_col_name |
logical, default TRUE. Appends original column name to dummy col name |
max_levels |
uses |
remove_first_dummy |
logical, default FALSE. |
remove_most_frequent_dummy |
logical, default FALSE |
clean_names |
logical, default TRUE. apply |
ignore_na |
logical, default FALSE |
split |
NULL |
remove_selected_columns |
logical, default TRUE |
Details
reference the fastDummies package for documentation on the original function.
Value
data frame
Examples
iris %>%
create_dummies(Species, append_col_name = FALSE) %>%
tibble::as_tibble()
create flag
Description
create flag
Usage
create_flag(.data, col, flag, full_name = FALSE, drop = FALSE)
Arguments
.data |
data frame |
col |
column |
flag |
column entry |
full_name |
Logical. default F. if T, new column name is original name + flag. other wise just flag |
drop |
logical. default F. If T, drop original column. |
Value
data frame
Examples
iris %>%
create_flag(
col = Species,
flag = "versicolor",
drop = TRUE) %>%
head()
date_yh
Description
creates a semesterly date vector from a date vector
Usage
date_yh(x)
Arguments
x |
a date |
Value
date vector
Examples
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1
d1 %>%
tibble::enframe() %>%
dplyr::mutate(YH = date_yh(value))
date_ym
Description
creates a monthly date vector from a date vector
Usage
date_ym(x)
Arguments
x |
a date |
Value
date vector
Examples
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1
d1 %>%
tibble::enframe() %>%
dplyr::mutate(YM = date_ym(value))
date_yq
Description
creates a quarterly date vector from a date vector
Usage
date_yq(x)
Arguments
x |
a date |
Value
date vector
Examples
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1
d1 %>%
tibble::enframe() %>%
dplyr::mutate(YQ = date_yq(value))
Fill NAs
Description
use tidyselect to fill NA
values
Default behavior is to fill all integer or double columns cols with 0, preserving their types.
Usage
fill_na(.data, ..., fill = 0L, missing_type = c("all", "NA", "NaN", "Inf"))
Arguments
.data |
data frame |
... |
tidyselect specification. Default selection: none |
fill |
value to fill missings |
missing_type |
character vector. Choose what type of missing to fill. Default is all types. choose from "all", "Na", "NaN", "Inf" |
Value
data frame
Examples
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf)) -> tbl
tbl %>%
fill_na()
tbl %>%
fill_na(fill = 1L, missing_type = "Inf")
tbl %>%
fill_na(missing_type = "NaN")
filter for
Description
Filter for all instances of a column that meet a specific condition at least once.
Usage
filter_for(.data, what, where)
Arguments
.data |
data frame |
what |
unquote col or vector of unquoted cols. |
where |
a logical condition used for filter |
Value
data frame
Examples
# An example using some time series data
tibble::tibble( CLIENT_ID = c("A1001", "B1001", "C1001",
"A1001", "B1001", "C1001", "A1001", "B1001", "C1001"),
YEAR = c(2019L, 2019L, 2019L, 2020L, 2020L, 2020L, 2021L, 2021L, 2021L),
SALES = c(3124, 56424, 3214132, 65534, 2342, 6566, 87654, 2332, 6565)
) %>%
dplyr::arrange(CLIENT_ID, YEAR) -> sales_data
sales_data
# filter for Clients that had sales greater than 4000 in the year 2019.
# this way we can see how the same clients sales looked in subsequent years
sales_data %>%
filter_for(what = CLIENT_ID, where = YEAR == 2019 & SALES > 4000L)
# filter for clients whose sales were less than 4000 in the year 2021
sales_data %>%
filter_for(what = CLIENT_ID, where = YEAR == 2021 & SALES < 4000L)
filter out missings
Description
More complex wrapper around dplyr::filter(!is.na())
to remove NA
rows using tidyselect. If any specified column contains an NA
the whole row is removed. Reports the amount of rows removed containing NaN
, NA
, Inf
, in that order.
For example if one row contains Inf
in one column and in another, the removed row will be counted in the NA
tally.
Usage
filter_missing(.data, ..., remove_inf = TRUE)
## S3 method for class 'data.frame'
filter_missing(.data, ..., remove_inf = TRUE, condition = c("any", "all"))
Arguments
.data |
dataframe |
... |
tidyselect. default selection is all columns |
remove_inf |
logical. default is to also remove |
condition |
defaults to "any". in which case removes rows if |
Details
S3 method, can also be used on vectors
Value
data frame
Examples
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf),
y = c(1L, NA, 2L, NA, Inf, 5L, Inf)) -> tbl1
tbl1
# remove any row with a missing or Inf
tbl1 %>%
filter_missing()
# remove any row with Na or NaN in the x column
tbl1 %>%
filter_missing(x, remove_inf = FALSE)
# only remove rows where every entry is Na, NaN, or Inf
tbl1 %>%
filter_missing(condition = "all")
import directory
Description
import directory
Usage
import_dir(
dir,
...,
method = c("rio", "vroom", "vroom_jp", "read_csv"),
return_type = c("df", "list")
)
Arguments
dir |
dir path |
... |
arguments passed to import method |
method |
import method chosen from import tibble |
return_type |
default is to bind dataframes together and remove duplicates. only recommended for a folder of files with the same data format. otherwise specify return as list of data frames |
Value
data frame
import tibble
Description
wrapper around multiple file readers. The default being import
set to return a tibble
Also available vroom
and vroom_jp
for japanese characters.
Usage
import_tibble(
path,
...,
method = c("rio", "vroom", "vroom_jp", "read_csv", "read_excel")
)
Arguments
path |
filepath |
... |
other arguments |
method |
method of import. default is rio |
Details
Supports multiple types of importing through method
Value
a tibble
is integerish character
Description
companion to rlang::is_integerish that returns true only for integerish character vectors.
Usage
is_integerish_character(x)
Arguments
x |
a vector |
Value
a logical
is integery
Description
practical extension of rlang::is_integerish that supports integer64, integers in character strings, and doesn't coerce factors
Usage
is_integery(x)
Arguments
x |
a vector |
Value
logical
is missing or inf
Description
is missing or inf
Usage
is_missing_or_inf(x)
Arguments
x |
Value
logical
is_probability
Description
is_probability
Usage
is_probability(x)
Arguments
x |
numeric vector |
Value
logical
Make NAs
Description
Set elements to NA values using tidyselect specification. Don't use this function on columns of different modes at once. Defaults to choosing all character columns.
Usage
## S3 method for class 'data.frame'
make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_"))
make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_"))
Arguments
.data |
data frame |
... |
tidyselect. Default selection: all chr cols |
vec |
vector of possible elements to replace with NA |
Value
data frame
Examples
# easily set NA values. blank space and empty space are default options
tibble::tibble(x = c("a", "b", "", "d", " ", "", "e")) %>%
make_na()
pad auto
Description
Automatically pads elements of a column to the largest sized element. Useful when an integer code with leading zeros is read in as an integer and needs to be fixed.
Usage
pad_auto(mdb, ..., side = "left", pad = "0")
Arguments
mdb |
data frame |
... |
tidyselect specification |
side |
str_pad side |
pad |
str_pad pad |
Value
data frame
Examples
# good for putting leading 0's
tibble::tibble(x = 1:10) %>%
pad_auto(x)
pad column
Description
wrapper around mutate and str_pad
Usage
pad_col(mdb, ..., width, pad = "0", side = "left")
Arguments
mdb |
data frame |
... |
tidyselect |
width |
str_pad width |
pad |
str_pad pad |
side |
str_pad side |
Value
data frame
Examples
# manually pad with 0's (or other value)
# use case over [pad_auto()]: the desired width is greater than the widest element
tibble::tibble(
ID = c(2, 13, 86, 302)
) %>%
pad_col(ID, width = 4)
recode_chr
Description
recode_chr
Usage
recode_chr(df, col, old_names, new_name, regex = FALSE, negate = FALSE)
Arguments
df |
data frame |
col |
unquoted col |
old_names |
character vector or regular expression |
new_name |
atomic chr string |
regex |
Logical, default F. Specify elements for old_names using a regex? |
negate |
logical, defailt F. If negating the regex, set to T |
Value
df
Examples
# Use a negative regex to rename all species other than "virginica" to "none"
iris %>%
recode_chr(
col = Species,
old_names = "vir",
new_name = "none",
regex = TRUE,
negate = TRUE) %>%
dplyr::count(Species)
# Specify old names using a regex
iris %>%
recode_chr(
col = Species,
old_names = "set|vir",
new_name = "other",
regex = TRUE) %>%
dplyr::count(Species)
Relocate All
Description
Arranges columns alphabetically and then by type The user can supply a tidyselect argument to specify columns that should come first
Usage
relocate_all(.data, ..., regex = NULL)
Arguments
.data |
data frame |
... |
a tidyselect specification |
regex |
a regular expression to match columns that will be put at the front of the df |
Value
data frame
Examples
iris %>%
head %>%
relocate_all(matches("Petal"))
remove infs
Description
remove infs
Usage
remove_infs(x)
Arguments
x |
vec |
Value
vec
remove nans
Description
remove nans
Usage
remove_nans(x)
Arguments
x |
vec |
Value
vec
remove nas
Description
remove nas
Usage
remove_nas(x)
Arguments
x |
vec |
Value
vec
Remove Whitespace
Description
Remove whitespace from columns using a tidyselect specification.
Usage
remove_whitespace(.data, ...)
Arguments
.data |
data frame |
... |
tidyselect specification (default selection: all character columns) |
Value
data frame
Examples
tibble::tibble(a = c(" a ", "b ", " c")) -> t1
t1
t1 %>%
remove_whitespace()
select_otherwise
Description
flexible select operator that powers the tidy consultant universe. Used to set sensible defaults and flexibly return the chosen columns. A developer focused function, but may be useful in interactive programming due to the ability to return different types.
Usage
select_otherwise(
.data,
...,
otherwise = NULL,
col = NULL,
return_type = c("names", "index", "df")
)
Arguments
.data |
dataframe |
... |
tidyselect. columns to choose |
otherwise |
tidyselect. default columns to choose if ... is not specified |
col |
tidyselect. column to choose regardless of ... or otherwise specifications |
return_type |
choose to return column index, names, or df. defaults to index |
Value
integer vector by default. possibly data frame or character vector
Examples
iris %>%
select_otherwise(where(is.double), return_type = "index")
set character
Description
set character
Usage
set_chr(.data, ...)
Arguments
.data |
dataframe |
... |
tidyselect. Default selection: none |
Value
dataframe
Examples
iris %>%
tibble::as_tibble() %>%
set_chr(tidyselect::everything())
set date
Description
set dates manually or automatically
Usage
set_date(.data, ..., date_fn = lubridate::ymd)
Arguments
.data |
dataframe |
... |
tidyselect |
date_fn |
a function to convert to a date object |
Details
note: can be called without any ...
arguments and instead automatically determines which character columns
are actually dates, then proceeds to set them. It checks for the date specified in date_fn
and also ymd_hms
.
On auto detect mode, it sets ymd_hms
output to ymd dates instead of datetimes with hms. This is because of the common occurrence
of trying to extract a ymd
date from an excel workbook, and having it come with extra 00:00:00. If you need a datetime, manually
supply the appropriate lubridate function.
Auto mode is experimental. Commonly detected error is a long character string of integers being interpreted as a date.
Value
tibble
Examples
tibble::tibble(date_col1 = c("20190101", "20170205"),
date_col2 = c("20201015", "20180909"),
not_date_col = c("a345", "b040")) -> t1
t1
t1 %>%
set_date()
t1 %>%
set_date(date_col1)
set double
Description
set double
Usage
set_dbl(.data, ...)
## S3 method for class 'character'
set_dbl(.data, ...)
## S3 method for class 'factor'
set_dbl(.data, ...)
## S3 method for class 'Date'
set_dbl(.data, ...)
## S3 method for class 'numeric'
set_dbl(.data, ...)
## S3 method for class 'integer64'
set_dbl(.data, ...)
## S3 method for class 'data.frame'
set_dbl(.data, ...)
Arguments
.data |
dataframe |
... |
tidyselect. Default selection: none |
Value
tibble
Examples
date_col <- c(lubridate::ymd(20180101), lubridate::ymd(20210420))
tibble::tibble(int = c(1L, 2L),
fct = factor(c(10, 11)),
date = date_col,
chr = c("a2.1", "rtg50.5")) -> t1
t1
t1 %>%
set_dbl(tidyselect::everything())
# s3 method works for vectors individually
# custom date coercion to represent date as a number. For lubridate's coercion method, use set_int
date_col %>%
set_dbl
set factor
Description
allows option to manually set the first level of the factor, for consistency with yardstick which automatically considers the first level as the "positive class" when evaluating classification.
Usage
set_fct(
.data,
...,
first_level = NULL,
order_fct = FALSE,
labels = NULL,
max_levels = Inf
)
## S3 method for class 'data.frame'
set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf)
## Default S3 method:
set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf)
Arguments
.data |
dataframe |
... |
tidyselect (default selection: all character columns) |
first_level |
character string to set the first level of the factor |
order_fct |
logical. ordered factor? |
labels |
chr vector of labels, length equal to factor levels |
max_levels |
integer. uses |
Value
tibble
Examples
## simply set the first level of a factor
iris$Species %>% levels
iris %>%
set_fct(Species, first_level = "virginica") %>%
dplyr::pull(Species) %>%
levels()
set integer
Description
set integer
Usage
set_int(.data, ...)
## S3 method for class 'data.frame'
set_int(.data, ...)
## S3 method for class 'grouped_df'
set_int(.data, ...)
Arguments
.data |
dataframe |
... |
tidyselect. Default Selecton: integerish doubles or integerish characters |
Value
tibble
Examples
int_vec <- c("1", "2", "10")
tibble::tibble(
chr_int = int_vec,
dbl_int = c(1.0, 5.0, 20.0),
chr_int64 = c("1033493932", "4432500065", "30303022192"),
string_int = c("SALES2020", "SALES2021", "SALES2022")) -> tbl
# automatically coerce integerish cols in a tibble
tbl
# integerish doubles or chars will be detected for coercion automatically
tbl %>%
set_int()
# string_int requires parsing, so it must be specified directly for coercion
tbl %>%
set_int(matches("str|chr"))
# s3 method works for vectors as well
int_vec
int_vec %>%
set_int()
set logical
Description
note: for non-binary data, all values other than the true_level will be set to false
Usage
## S3 method for class 'data.frame'
set_lgl(.data, ..., true_level = 1L)
set_lgl(.data, ..., true_level = 1L)
## Default S3 method:
set_lgl(.data, ...)
## S3 method for class 'numeric'
set_lgl(.data, ..., true_level = 1L)
## S3 method for class 'character'
set_lgl(.data, ..., true_level = c("T", "TRUE"))
Arguments
.data |
dataframe |
... |
tidyselect. Default selection: none |
true_level |
specify the value to set as TRUE. Default value is 1 for seamless conversion between logicals and integers. Can be given as a vector of values. |
Value
dataframe
Examples
# convert a 1/0 vector back into T/F
tibble::tibble(x = c(1, 0, 0, 1, 0, 1)) %>%
set_lgl(x)
set type groups
Description
set type groups
Usage
set_type_groups(.data, ..., setter)
Arguments
.data |
dataframe |
... |
tidyselect. default selection is integerish doubles or integerish characters |
setter |
which setter function to use |
Value
tibble