Title: | Summarise Continuous, Date and Categorical Variables, Check for Duplicates and Missing Data |
Version: | 0.1 |
Description: | Explore continuous, date and categorical variables. 'sumvar' aims to bring the ease and simplicity of the "sum" and "tab" functions from 'stata'. |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr, ggplot2, lubridate, magrittr, patchwork, purrr, rlang, scales, stats, tibble, tidyr, utils |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
URL: | https://github.com/alstockdale/sumvar, https://alstockdale.github.io/sumvar/ |
BugReports: | https://github.com/alstockdale/sumvar/issues |
License: | MIT + file LICENSE |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-06-11 17:14:46 UTC; al_st |
Author: | Alexander Stockdale [aut, cre] |
Maintainer: | Alexander Stockdale <a.stockdale@liverpool.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-06-13 20:00:02 UTC |
sumvar: Summarise Continuous and Categorical Variables in R
Description
The sumvar package explores continuous and categorical variables. sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.
To explore a continuous variable, use
dist_sum()
. You can stratify by a grouping variable:df %>% dist_sum(var, group)
To explore dates, use
dist_date()
; usage is the same asdist_sum()
.To summarise a single categorical variable use
tab1()
, e.g.df %>% tab1(var)
. For a two-way table, usetab()
, e.g.df %>% tab(var1, var2)
. Both include options for frequentist hypothesis tests.Explore duplicates and missing values with with
dup()
.
All functions are tidyverse/dplyr-friendly and accept the %>%
pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g. summary <- df %>% dist_sum(var)
.
Author(s)
Maintainer: Alexander Stockdale a.stockdale@liverpool.ac.uk
See Also
Useful links:
Report bugs at https://github.com/alstockdale/sumvar/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Summarize and visualize a date variable
Description
Summarises the minimum, maximum, median, and interquartile range of a date variable, optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.
Usage
dist_date(data, var, by = NULL)
Arguments
data |
A data frame or tibble. |
var |
The date variable to summarise. |
by |
Optional grouping variable. |
Value
A tibble with summary statistics for the date variable.
See Also
dist_sum
for continuous variables.
Examples
# Example ungrouped
df <- tibble::tibble(
dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE)
)
dist_date(df, dt)
# Example grouped
df2 <- tibble::tibble(
dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE),
grp = sample(1:2, 100, TRUE)
)
dist_date(df2, dt, grp)
# Note this function accepts a pipe from dplyr eg. df %>% dist_date(date_var, group_var)
Explore a continuous variable.
Description
Summarises the median, interquartile range, mean, standard deviation, confidence intervals of the mean and produces a density plot, stratified by a second grouping variable.
Provides frequentist hypothesis tests for comparison between groups: T test and Wilcoxon rank sum for 2 groups, Anova and Kruskall wallis test for 3 or more groups.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
Usage
dist_sum(data, var, by = NULL)
Arguments
data |
The data frame or tibble |
var |
The variable you would like to summarise |
by |
The grouping variable |
Value
A tibble with a summary of the variable frequency (n), number of missing observations (n_miss), median, interquartile range, mean, SD, 95% confidence intervals of the mean (using the Z distribution), and density plots.
Shows the T test (p_ttest) and Wilcoxon rank sum (p_wilcox) hypothesis tests when there are two groups And an Anova test (p_anova) and Kruskal-Wallis test (p_kruskal) when there are three or more groups.
Examples
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
group = sample(c("a", "b", "c", "d"),
size = 100, replace = TRUE))
dist_sum(example_data, age, group)
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
sex = sample(c("male", "female"),
size = 100, replace = TRUE))
dist_sum(example_data, age, sex)
summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.
Explore duplicate and missing data
Description
Provides an integer value for the number of duplicates found within a variable The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% dup(variable)
Usage
dup(data, var = NULL)
Arguments
data |
The data frame or tibble |
var |
The variable to assess |
Value
A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.
Examples
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
dup(example_data, age)
# It is also possible to pass a whole database to dup and it will explore all variables.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
sex = sample(c("Male", "Female"), 200, TRUE),
favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing.
dup(example_data)
Create a cross-tabulation of two categorial variables
Description
Creates a "n x n" cross-tabulation of two categorical variables, with row percentages. Includes options for adding frequentist hypothesis testing.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% tab(variable1, variable2)
Usage
tab(data, variable1, variable2, test = "none")
Arguments
data |
The data frame or tibble |
variable1 |
The first categorical variable |
variable2 |
The second categorical variable |
test |
Optional frequentist hypothesis test, use test=exact for Fisher's exact or test=chi for Chi squared |
Value
A tibble with a cross-tabulation of frequencies and row percentages
Examples
example_data <- dplyr::tibble(id = 1:100, group1 = sample(c("a", "b", "c", "d"),
size = 100, replace = TRUE),
group2= sample(c("male", "female"),
size = 100, replace = TRUE))
example_data$group1[sample(1:100, size = 10)] <- NA # Replace 10 with missing
tab(example_data, group1, group2)
summary <- tab(example_data, group1, group2) # Save summary statistics as a tibble.
Summarise a categorial variable
Description
Summarises frequencies and percentages for a categorical variable.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble. eg. example_data %>% tab1(variable)
Usage
tab1(data, variable, dp = 1)
Arguments
data |
The data frame or tibble |
variable |
The categorical variable you would like to summarise |
dp |
The number of decimal places for percentages (default=2) |
Value
A tibble with frequencies and percentages
Examples
example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"),
size = 100, replace = TRUE))
example_data$group[sample(1:100, size = 10)] <- NA # Replace 10 with missing
tab1(example_data, group)
summary <- tab1(example_data, group) # Save summary statistics as a tibble.