Title: | Compare Data Frames |
Version: | 0.3.0 |
Description: | A toolset for interactively exploring the differences between two data frames. |
License: | MIT + file LICENSE |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
Imports: | rlang (≥ 1.1.0), cli, dplyr (≥ 1.1.0), glue, tidyselect (≥ 1.2.0), vctrs (≥ 0.6.4), tibble, pillar, purrr, collapse (≥ 2.0.9), data.table |
URL: | https://eutwt.github.io/versus/, https://github.com/eutwt/versus |
BugReports: | https://github.com/eutwt/versus/issues |
Depends: | R (≥ 4.1.0) |
LazyData: | true |
Config/Needs/website: | rmarkdown |
NeedsCompilation: | yes |
Packaged: | 2024-01-12 00:13:02 UTC; mbp |
Author: | Ryan Dickerson [aut, cre, cph] |
Maintainer: | Ryan Dickerson <fresh.tent5866@fastmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-01-12 00:30:02 UTC |
versus: Compare Data Frames
Description
Compare two tables
Author(s)
Maintainer: Ryan Dickerson fresh.tent5866@fastmail.com [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/eutwt/versus/issues
Compare two data frames
Description
compare()
creates a representation of the differences between two tables,
along with a shallow copy of the tables. This output is used
as the comparison
argument when exploring the differences further with other
versus functions e.g. slice_*()
and weave_*()
.
Usage
compare(table_a, table_b, by, allow_both_NA = TRUE, coerce = TRUE)
Arguments
table_a |
A data frame |
table_b |
A data frame |
by |
< |
allow_both_NA |
Logical. If |
coerce |
Logical. If |
Value
compare()
A list of data frames having the following elements:
- tables
-
A data frame with one row per input table showing the number of rows and columns in each.
- by
-
A data frame with one row per
by
column showing the class of the column in each of the input tables. - intersection
-
A data frame with one row per column common to
table_a
andtable_b
and columns "n_diffs" showing the number of values which are different between the two tables, "class_a"/"class_b" the class of the column in each table, and "value_diffs" a (nested) data frame showing the the values in each table which are unequal and theby
columns - unmatched_cols
-
A data frame with one row per column which is in one input table but not the other and columns "table": which table the column appears in, "column": the name of the column, and "class": the class of the column.
- unmatched_rows
-
A data frame which, for each row present in one input table but not the other, contains the column "table" showing which table the row appears in and the
by
columns for that row.
data.table inputs
If the input is a data.table, you may want compare()
to make a deep copy instead
of a shallow copy so that future changes to the table don't affect the comparison.
To achieve this, you can set options(versus.copy_data_table = TRUE)
.
Examples
compare(example_df_a, example_df_b, by = car)
Modified version of datasets::mtcars
- version a
Description
A version of mtcars with some values altered and some rows/columns removed. Not for informational purposes, used only to demonstrate the comparison of two slightly different data frames. Since some values were altered at random, the values do not necessarily reflect the true original values. The variables are as follows:
Usage
example_df_a
Format
A data frame with 9 rows and 9 variables:
- car
The rowname in the corresponding
datasets::mtcars
row- mpg
Miles/(US) gallon
- cyl
Number of cylinders
- disp
Displacement (cu.in.)
- hp
Gross horsepower
- drat
Rear axle ratio
- wt
Weight (1000 lbs)
- vs
Engine (0 = V-shaped, 1 = straight)
- am
Transmission (0 = automatic, 1 = manual)
Source
Sourced from the CRAN datasets package, with modified values. Originally from Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
Modified version of datasets::mtcars
- version b
Description
A version of mtcars with some values altered and some rows/columns removed. Not for informational purposes, used only to demonstrate the comparison of two slightly different data frames. Since some values were altered at random, the values do not necessarily reflect the true original values. The variables are as follows:
Usage
example_df_b
Format
A data frame with 9 rows and 9 variables:
- car
The rowname in the corresponding
datasets::mtcars
row- wt
Weight (1000 lbs)
- mpg
Miles/(US) gallon
- hp
Gross horsepower
- cyl
Number of cylinders
- disp
Displacement (cu.in.)
- carb
Number of carburetors
- drat
Rear axle ratio
- vs
Engine (0 = V-shaped, 1 = straight)
Source
Sourced from the CRAN datasets package, with modified values. Originally from Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
Get rows with differing values
Description
Get rows with differing values
Usage
slice_diffs(comparison, table, column = everything())
Arguments
comparison |
The output of |
table |
One of |
column |
< |
Value
The input table is filtered to the rows for which comparison
shows differing values for one of the columns selected by column
Examples
comp <- compare(example_df_a, example_df_b, by = car)
comp |> slice_diffs("a", mpg)
comp |> slice_diffs("b", mpg)
comp |> slice_diffs("a", c(mpg, disp))
Get rows in only one table
Description
Get rows in only one table
Usage
slice_unmatched(comparison, table)
slice_unmatched_both(comparison)
Arguments
comparison |
The output of |
table |
One of |
Value
slice_unmatched() |
The table identified by |
slice_unmatched_both() |
The output of |
Examples
comp <- compare(example_df_a, example_df_b, by = car)
comp |> slice_unmatched("a")
comp |> slice_unmatched("b")
# slice_unmatched(comp, "a") output is the same as
example_df_a |> dplyr::anti_join(example_df_b, by = comp$by$column)
comp |> slice_unmatched_both()
Get the differing values from a comparison
Description
Get the differing values from a comparison
Usage
value_diffs(comparison, column)
value_diffs_stacked(comparison, column = everything())
Arguments
comparison |
The output of |
column |
< |
Value
value_diffs() |
A data frame with one row for each element
of |
value_diffs_stacked() , value_diffs_all() |
A data frame containing
the |
Examples
comp <- compare(example_df_a, example_df_b, by = car)
value_diffs(comp, disp)
value_diffs_stacked(comp, c(disp, mpg))
Argument type: tidy-select
Description
This page describes the <tidy-select>
argument modifier which
indicates that the argument uses tidy selection, a sub-type of
tidy evaluation. If you've never heard of tidy evaluation before,
start with the practical introduction in
https://r4ds.hadley.nz/functions.html#data-frame-functions then
then read more about the underlying theory in
https://rlang.r-lib.org/reference/topic-data-mask.html.
Overview of selection features
tidyselect implements a DSL for selecting variables. It provides helpers for selecting variables:
-
var1:var10
: variables lying betweenvar1
on the left andvar10
on the right.
-
starts_with("a")
: names that start with"a"
. -
ends_with("z")
: names that end with"z"
. -
contains("b")
: names that contain"b"
. -
matches("x.y")
: names that match regular expressionx.y
. -
num_range(x, 1:4)
: names following the pattern,x1
,x2
, ...,x4
. -
all_of(vars)
/any_of(vars)
: matches names stored in the character vectorvars
.all_of(vars)
will error if the variables aren't present;any_of(var)
will match just the variables that exist. -
everything()
: all variables. -
last_col()
: furthest column on the right. -
where(is.numeric)
: all variables whereis.numeric()
returnsTRUE
.
As well as operators for combining those selections:
-
!selection
: only variables that don't matchselection
. -
selection1 & selection2
: only variables included in bothselection1
andselection2
. -
selection1 | selection2
: all variables that match eitherselection1
orselection2
.
Key techniques
If you want the user to supply a tidyselect specification in a function argument, you need to tunnel the selection through the function argument. This is done by embracing the function argument
{{ }}
, e.gunnest(df, {{ vars }})
.If you have a character vector of column names, use
all_of()
orany_of()
, depending on whether or not you want unknown variable names to cause an error, e.gunnest(df, all_of(vars))
,unnest(df, !any_of(vars))
.To suppress
R CMD check
NOTE
s about unknown variables use"var"
instead ofvar
:
# has NOTE df %>% select(x, y, z) # no NOTE df %>% select("x", "y", "z")
Get differences in context
Description
Get differences in context
Usage
weave_diffs_long(comparison, column = everything())
weave_diffs_wide(comparison, column = everything())
Arguments
comparison |
The output of |
column |
< |
Value
weave_diffs_wide() |
The input |
weave_diffs_long() |
Input tables are filtered to rows where
differing values exist for one of the columns selected by |
Examples
comp <- compare(example_df_a, example_df_b, by = car)
comp |> weave_diffs_wide(disp)
comp |> weave_diffs_wide(c(mpg, disp))
comp |> weave_diffs_long(disp)
comp |> weave_diffs_long(c(mpg, disp))