Title: Summarise Continuous, Date and Categorical Variables, Check for Duplicates and Missing Data
Version: 0.1
Description: Explore continuous, date and categorical variables. 'sumvar' aims to bring the ease and simplicity of the "sum" and "tab" functions from 'stata'.
Encoding: UTF-8
RoxygenNote: 7.3.2
Imports: dplyr, ggplot2, lubridate, magrittr, patchwork, purrr, rlang, scales, stats, tibble, tidyr, utils
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
Config/testthat/edition: 3
URL: https://github.com/alstockdale/sumvar, https://alstockdale.github.io/sumvar/
BugReports: https://github.com/alstockdale/sumvar/issues
License: MIT + file LICENSE
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-06-11 17:14:46 UTC; al_st
Author: Alexander Stockdale [aut, cre]
Maintainer: Alexander Stockdale <a.stockdale@liverpool.ac.uk>
Repository: CRAN
Date/Publication: 2025-06-13 20:00:02 UTC

sumvar: Summarise Continuous and Categorical Variables in R

Description

The sumvar package explores continuous and categorical variables. sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.

All functions are tidyverse/dplyr-friendly and accept the %>% pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g. summary <- df %>% dist_sum(var).

Author(s)

Maintainer: Alexander Stockdale a.stockdale@liverpool.ac.uk

See Also

Useful links:


Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Arguments

lhs

A value or the magrittr placeholder.

rhs

A function call using the magrittr semantics.

Value

The result of calling rhs(lhs).


Summarize and visualize a date variable

Description

Summarises the minimum, maximum, median, and interquartile range of a date variable, optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.

Usage

dist_date(data, var, by = NULL)

Arguments

data

A data frame or tibble.

var

The date variable to summarise.

by

Optional grouping variable.

Value

A tibble with summary statistics for the date variable.

See Also

dist_sum for continuous variables.

Examples

# Example ungrouped
df <- tibble::tibble(
  dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE)
)
dist_date(df, dt)

# Example grouped
df2 <- tibble::tibble(
  dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE),
  grp = sample(1:2, 100, TRUE)
)
dist_date(df2, dt, grp)
# Note this function accepts a pipe from dplyr eg. df %>% dist_date(date_var, group_var)

Explore a continuous variable.

Description

Summarises the median, interquartile range, mean, standard deviation, confidence intervals of the mean and produces a density plot, stratified by a second grouping variable.

Provides frequentist hypothesis tests for comparison between groups: T test and Wilcoxon rank sum for 2 groups, Anova and Kruskall wallis test for 3 or more groups.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

Usage

dist_sum(data, var, by = NULL)

Arguments

data

The data frame or tibble

var

The variable you would like to summarise

by

The grouping variable

Value

A tibble with a summary of the variable frequency (n), number of missing observations (n_miss), median, interquartile range, mean, SD, 95% confidence intervals of the mean (using the Z distribution), and density plots.

Shows the T test (p_ttest) and Wilcoxon rank sum (p_wilcox) hypothesis tests when there are two groups And an Anova test (p_anova) and Kruskal-Wallis test (p_kruskal) when there are three or more groups.

Examples

example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
                              group = sample(c("a", "b", "c", "d"),
                              size = 100, replace = TRUE))
dist_sum(example_data, age, group)
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
                             sex = sample(c("male", "female"),
                             size = 100, replace = TRUE))
dist_sum(example_data, age, sex)
summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.

Explore duplicate and missing data

Description

Provides an integer value for the number of duplicates found within a variable The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

eg. example_data %>% dup(variable)

Usage

dup(data, var = NULL)

Arguments

data

The data frame or tibble

var

The variable to assess

Value

A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.

Examples

example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
dup(example_data, age)
# It is also possible to pass a whole database to dup and it will explore all variables.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
                              sex = sample(c("Male", "Female"), 200, TRUE),
                              favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA  # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA  # Replace 32 values with missing.
dup(example_data)

Create a cross-tabulation of two categorial variables

Description

Creates a "n x n" cross-tabulation of two categorical variables, with row percentages. Includes options for adding frequentist hypothesis testing.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.

eg. example_data %>% tab(variable1, variable2)

Usage

tab(data, variable1, variable2, test = "none")

Arguments

data

The data frame or tibble

variable1

The first categorical variable

variable2

The second categorical variable

test

Optional frequentist hypothesis test, use test=exact for Fisher's exact or test=chi for Chi squared

Value

A tibble with a cross-tabulation of frequencies and row percentages

Examples

example_data <- dplyr::tibble(id = 1:100, group1 = sample(c("a", "b", "c", "d"),
                                                  size = 100, replace = TRUE),
                                                  group2= sample(c("male", "female"),
                                                  size = 100, replace = TRUE))
example_data$group1[sample(1:100, size = 10)] <- NA  # Replace 10 with missing
tab(example_data, group1, group2)
summary <- tab(example_data, group1, group2) # Save summary statistics as a tibble.

Summarise a categorial variable

Description

Summarises frequencies and percentages for a categorical variable.

The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble. eg. example_data %>% tab1(variable)

Usage

tab1(data, variable, dp = 1)

Arguments

data

The data frame or tibble

variable

The categorical variable you would like to summarise

dp

The number of decimal places for percentages (default=2)

Value

A tibble with frequencies and percentages

Examples

example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"),
                                                  size = 100, replace = TRUE))
example_data$group[sample(1:100, size = 10)] <- NA  # Replace 10 with missing
tab1(example_data, group)
summary <- tab1(example_data, group) # Save summary statistics as a tibble.