Title: | Track your Data Pipelines |
Version: | 0.4.6 |
Description: | Track and document 'dplyr' data pipelines. As you filter, mutate, and join your way through a data set, 'dtrackr' seamlessly keeps track of your data flow and makes publication ready documentation of a data pipeline simple. |
License: | MIT + file LICENSE |
Language: | en-GB |
Imports: | dplyr (≥ 1.1.0), glue, htmltools, magrittr, rlang, rsvg, stringr, tibble, tidyr, utils, V8, fs, purrr, base64enc, pdftools, png, lifecycle |
Suggests: | spelling, here, knitr, rmarkdown, tidyselect, devtools, testthat (≥ 2.1.0), rstudioapi, survival, ggplot2, covr |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2.9003 |
Depends: | R (≥ 2.10) |
URL: | https://terminological.github.io/dtrackr/index.html, https://github.com/terminological/dtrackr |
BugReports: | https://github.com/terminological/dtrackr/issues |
NeedsCompilation: | no |
Packaged: | 2024-10-21 08:33:55 UTC; vp22681 |
Author: | Robert Challen |
Maintainer: | Robert Challen <rob.challen@bristol.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2024-10-21 09:20:02 UTC |
dtrackr: Track your Data Pipelines
Description
Track and document 'dplyr' data pipelines. As you filter, mutate, and join your way through a data set, 'dtrackr' seamlessly keeps track of your data flow and makes publication ready documentation of a data pipeline simple.
Author(s)
Maintainer: Robert Challen rob.challen@bristol.ac.uk (ORCID)
See Also
Useful links:
Report bugs at https://github.com/terminological/dtrackr/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::add_count()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# add_count
# adding in a count or tally column as a new column
iris %>%
track() %>%
add_count(Species, name="new_count_total",
.messages="{.new_cols}",
# .messages="{.cols}",
.headline="New columns from add_count:") %>%
history()
# add_tally
iris %>%
track() %>%
group_by(Species) %>%
dtrackr::add_tally(wt=Petal.Length, name="new_tally_total",
.messages="{.new_cols}",
.headline="New columns from add_tally:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::add_tally()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# add_count
# adding in a count or tally column as a new column
iris %>%
track() %>%
add_count(Species, name="new_count_total",
.messages="{.new_cols}",
# .messages="{.cols}",
.headline="New columns from add_count:") %>%
history()
# add_tally
iris %>%
track() %>%
group_by(Species) %>%
dtrackr::add_tally(wt=Petal.Length, name="new_tally_total",
.messages="{.new_cols}",
.headline="New columns from add_tally:") %>%
history()
Anti join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
anti_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"),
.headline = "Semi join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods. |
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::anti_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Anti join
join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::arrange()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# arrange
# In this case we sort the data descending and show the first value
# is the same as the maximum value.
iris %>%
track() %>%
arrange(
desc(Petal.Width),
.messages="{.count} items, columns: {.cols}",
.headline="Reordered dataframe:") %>%
history()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
bind_cols(
...,
.messages = "{.count.out} in combined set",
.headline = "Bind columns"
)
Arguments
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::bind_cols()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
Arguments
... |
a collection of tracked data frames to combine
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::bind_rows()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
Description
Start capturing exclusions on a tracked dataframe.
Usage
capture_exclusions(.data, .capture = TRUE)
Arguments
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
Value
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE
).
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% capture_exclusions()
tmp %>% filter(Species!="versicolor") %>% history()
Add a generic comment to the dtrackr history graph
Description
A comment can be any kind of note and is added once for every current
grouping as defined by the .message
field. It can be made context specific
by including variables such as {.count} and {.total} in .message
which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
Usage
comment(
.data,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = (.type == "exclusion"),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the same .data dataframe with the history graph updated with the comment
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% comment("hello {.total} rows") %>% history()
Add a subgroup count to the dtrackr history graph
Description
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
Usage
count_subgroup(
.data,
.subgroup,
...,
.messages = .defaultCountSubgroup(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = FALSE,
.tag = NULL,
.maxsubgroups = .defaultMaxSupportedGroupings()
)
Arguments
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
Value
the same .data dataframe with the history graph updated with a subgroup count as a new stage
Examples
library(dplyr)
library(dtrackr)
survival::cgd %>% track() %>% group_by(treat) %>%
count_subgroup(center) %>% history()
Distinct values of data
Description
Distinct acts in the same way as in dplyr::distinct
. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct()
.
Usage
## S3 method for class 'trackr_df'
distinct(
.data,
...,
.messages = "removing {.count.in-.count.out} duplicates",
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe with distinct values and history graph updated.
See Also
dplyr::distinct()
Examples
library(dplyr)
library(dtrackr)
tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5))
tmp %>% group_by(Species) %>% distinct() %>% history()
Convert Graphviz
dot content to a SVG
Description
Convert a graphviz
dot digraph as string to SVG
as string
Usage
dot2svg(dot)
Arguments
dot |
a |
Value
the SVG as a string
Examples
dot2svg("digraph { A->B }")
Exclude all items matching one or more criteria
Description
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE
they also remove
anything that cannot be evaluated by any criteria.
Usage
exclude_all(
.data,
...,
.headline = .defaultHeadline(),
na.rm = FALSE,
.type = "exclusion",
.asOffshoot = TRUE,
.stage = (if (is.null(.tag)) "" else .tag),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% capture_exclusions() %>% exclude_all(
Petal.Length > 5 ~ "{.excluded} long ones",
Petal.Length < 2 ~ "{.excluded} short ones"
) %>% history()
# simultaneous evaluation of criteria:
data.frame(a = 1:10) %>%
track() %>%
exclude_all(
# These two criteria identify the same value and one item is excluded
a > 9 ~ "{.excluded} value > 9",
a == max(a) ~ "{.excluded} max value",
) %>%
status() %>%
history()
# the behaviour is equivalent to the inverse of dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a <= 9, a != max(a)) %>%
nrow()
# step-wise evaluation of criteria results in a different output
data.frame(a = 1:10) %>%
track() %>%
# Performing the same exclusion sequentially results in 2 items
# being excluded as the criteria no longer identify the same
# item.
exclude_all(a > 9 ~ "{.excluded} value > 9") %>%
exclude_all(a == max(a) ~ "{.excluded} max value") %>%
status() %>%
history()
# the behaviour is equivalent to the inverse of dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a <= 9) %>%
dplyr::filter(a != max(a)) %>%
nrow()
Get the dtrackr excluded data record
Description
Get the dtrackr excluded data record
Usage
excluded(.data, simplify = TRUE)
Arguments
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
Value
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE
has a nested structure containing records excluded at each part of the pipeline.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% capture_exclusions()
tmp %>% exclude_all(
Petal.Length > 5.8 ~ "{.excluded} long ones",
Petal.Length < 1.3 ~ "{.excluded} short ones",
.stage = "petal length exclusion"
) %>% excluded()
Filtering data
Description
Filter acts in the same way as in dplyr
where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter()
.
Usage
## S3 method for class 'trackr_df'
filter(
.data,
...,
.messages = "excluded {.excluded} items",
.headline = .defaultHeadline(),
.type = "exclusion",
.asOffshoot = (.type == "exclusion"),
.stage = (if (is.null(.tag)) "" else .tag),
.tag = NULL
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
Value
the filtered .data
dataframe with history graph updated
See Also
dplyr::filter()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% filter(Petal.Length > 5) %>% history()
Flowchart output
Description
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
Usage
flowchart(
.data,
filename = NULL,
size = std_size$full,
maxWidth = size$width,
maxHeight = size$height,
formats = c("dot", "png", "pdf", "svg"),
defaultToHTML = TRUE,
landscape = size$rot != 0,
...
)
Arguments
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
other parameters passed onto either |
Value
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG
/PDF
link if in knitr
and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor")
tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Full join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
full_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Full join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::full_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Full join
join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Stratifying your analysis
Description
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by()
operation will create a lot of groups. This happens
for example if you are doing a group_by()
, summarise()
step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr
will detect this issue and
pause tracking the dataframe with a warning. It is up to the user to the
resume()
tracking when the large number of groups have been resolved e.g.
using a dplyr::ungroup()
. This limit is configurable with
options("dtrackr.max_supported_groupings"=XX)
. The default is 16. See
dplyr::group_by()
.
Usage
## S3 method for class 'trackr_df'
group_by(
.data,
...,
.messages = "stratify by {.cols}",
.headline = NULL,
.tag = NULL,
.maxgroups = .defaultMaxSupportedGroupings()
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
Value
the .data but grouped.
See Also
dplyr::group_by()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}")
tmp %>% comment("{.strata}") %>% history()
Group-wise modification of data and complex operations
Description
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify()
.
Usage
## S3 method for class 'trackr_df'
group_modify(
.data,
...,
.messages = NULL,
.headline = .defaultHeadline(),
.type = "modify",
.tag = NULL
)
Arguments
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the transformed .data dataframe with the history graph updated.
See Also
dplyr::group_modify()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% group_modify(
function(d,g,...) { return(tibble::tibble(x=runif(10))) },
.messages="{.count.in} in, {.count.out} out"
) %>% history()
Get the dtrackr history graph
Description
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
Usage
history(.data)
Arguments
.data |
a dataframe which may be grouped |
Value
the history graph. This is a list, of class trackr_graph
, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see
tagged()
)nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
Examples
library(dplyr)
library(dtrackr)
graph = iris %>% track() %>% comment("A comment") %>% history()
print(graph)
Include any items matching a criteria
Description
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr
history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all()
and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE
they also keep anything that cannot be evaluated by the criteria.
Usage
include_any(
.data,
...,
.headline = .defaultHeadline(),
na.rm = TRUE,
.type = "inclusion",
.asOffshoot = FALSE,
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% group_by(Species) %>% include_any(
Petal.Length > 5 ~ "{.included} long ones",
Petal.Length < 2 ~ "{.included} short ones"
) %>% history()
# simultaneous evaluation of criteria:
data.frame(a = 1:10) %>%
track() %>%
include_any(
# These two criteria identify the same value and one item is excluded
a > 1 ~ "{.included} value > 1",
a != min(a) ~ "{.included} everything but the smallest value",
) %>%
status() %>%
history()
# the behaviour is equivalent to dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a > 1, a != min(a)) %>%
nrow()
# step-wise evaluation of criteria results in a different output
data.frame(a = 1:10) %>%
track() %>%
# Performing the same exclusion sequentially results in 2 items
# being excluded as the criteria no longer identify the same
# item.
include_any(a > 1 ~ "{.included} value > 1") %>%
include_any(a != min(a) ~ "{.included} everything but the smallest value") %>%
status() %>%
history()
# the behaviour is equivalent to dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a > 1) %>%
dplyr::filter(a != min(a)) %>%
nrow()
Inner joins
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
inner_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Inner join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::inner_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Inner join
join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
## S3 method for class 'trackr_df'
intersect(
x,
y,
...,
.messages = "{.count.out} in intersection",
.headline = "Intersection"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
generics::intersect()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Left join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
left_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Left join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::left_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Left join
join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::mutate()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# mutate
# In this example we compare the column names of the input and the
# output to identify the new columns created by the mutate operation as
# the `.new_cols` variable
iris %>%
track() %>%
mutate(extra_col = NA_real_,
.messages="{.new_cols}",
.headline="Extra columns from mutate:") %>%
history()
Nest join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
nest_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"),
.headline = "Nest join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::nest_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Nest join
join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_add_count(x, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::add_count()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# add_count
# adding in a count or tally column as a new column
iris %>%
track() %>%
add_count(Species, name="new_count_total",
.messages="{.new_cols}",
# .messages="{.cols}",
.headline="New columns from add_count:") %>%
history()
# add_tally
iris %>%
track() %>%
group_by(Species) %>%
dtrackr::add_tally(wt=Petal.Length, name="new_tally_total",
.messages="{.new_cols}",
.headline="New columns from add_tally:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_add_tally(x, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::add_tally()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# add_count
# adding in a count or tally column as a new column
iris %>%
track() %>%
add_count(Species, name="new_count_total",
.messages="{.new_cols}",
# .messages="{.cols}",
.headline="New columns from add_count:") %>%
history()
# add_tally
iris %>%
track() %>%
group_by(Species) %>%
dtrackr::add_tally(wt=Petal.Length, name="new_tally_total",
.messages="{.new_cols}",
.headline="New columns from add_tally:") %>%
history()
Anti join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::anti_join()
for more details
on the underlying functions.
Usage
p_anti_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} not matched"),
.headline = "Semi join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::anti_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Anti join
join = lhs %>% anti_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_arrange(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::arrange()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# arrange
# In this case we sort the data descending and show the first value
# is the same as the maximum value.
iris %>%
track() %>%
arrange(
desc(Petal.Width),
.messages="{.count} items, columns: {.cols}",
.headline="Reordered dataframe:") %>%
history()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_bind_cols(
...,
.messages = "{.count.out} in combined set",
.headline = "Bind columns"
)
Arguments
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::bind_cols()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_bind_rows(..., .messages = "{.count.out} in union", .headline = "Union")
Arguments
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::bind_rows()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Start capturing exclusions on a tracked dataframe.
Description
Start capturing exclusions on a tracked dataframe.
Usage
p_capture_exclusions(.data, .capture = TRUE)
Arguments
.data |
a tracked dataframe |
.capture |
Should we capture exclusions (things removed from the data
set). This is useful for debugging data issues but comes at a significant
cost. Defaults to the value of |
Value
the .data dataframe with the exclusions flag set (or cleared if
.capture=FALSE
).
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% capture_exclusions()
tmp %>% filter(Species!="versicolor") %>% history()
Clear the dtrackr history graph
Description
This is unlikely to be needed directly and is mostly and internal function
Usage
p_clear(.data)
Arguments
.data |
a dataframe which may be grouped |
Value
the .data dataframe with the history graph removed
Examples
library(dplyr)
library(dtrackr)
mtcars %>% track() %>% comment("A comment") %>% p_clear() %>% history()
Add a generic comment to the dtrackr history graph
Description
A comment can be any kind of note and is added once for every current
grouping as defined by the .message
field. It can be made context specific
by including variables such as {.count} and {.total} in .message
which
refer to the grouped and ungrouped counts at this current stage of the
pipeline respectively. It can also pull in any global variable.
Usage
p_comment(
.data,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = (.type == "exclusion"),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue specification can refer to any grouping variables of .data, or any variables defined in the calling environment, the {.total} of all rows, the {.count} variable which is the count in each group and {.strata} a description of the group |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable (which is |
.type |
one of "info","...,"exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the same .data dataframe with the history graph updated with the comment
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% comment("hello {.total} rows") %>% history()
Copy the dtrackr history graph from one dataframe to another
Description
Copy the dtrackr history graph from one dataframe to another
Usage
p_copy(.data, from)
Arguments
.data |
a dataframe which may be grouped |
from |
the dataframe to copy the history graph from |
Value
the .data dataframe with the history graph of "from"
Examples
mtcars %>% p_copy(iris %>% comment("A comment")) %>% history()
Simple count_if dplyr summary function
Description
Simple count_if dplyr summary function
Usage
p_count_if(..., na.rm = TRUE)
Arguments
... |
expression to be evaluated |
na.rm |
ignore NA values? |
Value
a count of the number of times the expression evaluated to true, in the current context
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% dplyr::group_by(Species)
tmp %>% dplyr::summarise(long_ones = p_count_if(Petal.Length > 4))
Add a subgroup count to the dtrackr history graph
Description
A frequent use case for more detailed description is to have a subgroup count within a flowchart. This works best for factor subgroup columns but other data will be converted to a factor automatically. The count of the items in each subgroup is added as a new stage in the flowchart.
Usage
p_count_subgroup(
.data,
.subgroup,
...,
.messages = .defaultCountSubgroup(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = FALSE,
.tag = NULL,
.maxsubgroups = .defaultMaxSupportedGroupings()
)
Arguments
.data |
a dataframe which may be grouped |
.subgroup |
a column with a small number of levels (e.g. a factor) |
... |
passed to |
.messages |
a character vector of glue specifications. A glue specification can refer to anything from the calling environment, {.subgroup} for the subgroup column name and {.name} for the subgroup column value, {.count} for the subgroup column count, {.subtotal} for the current stratification grouping count and {.total} for the whole dataset count |
.headline |
a glue specification which can refer to grouping variables of .data, {.subtotal} for the current grouping count, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want to use the summary data from this step in the future then give it a name with .tag. |
.maxsubgroups |
the maximum number of discrete values allowed in
.subgroup is configurable with
|
Value
the same .data dataframe with the history graph updated with a subgroup count as a new stage
Examples
library(dplyr)
library(dtrackr)
survival::cgd %>% track() %>% group_by(treat) %>%
count_subgroup(center) %>% history()
Distinct values of data
Description
Distinct acts in the same way as in dplyr::distinct
. Prior to the operation
the size of the group is calculated {.count.in} and after the operation the
output size {.count.out} The group {.strata} is also available (if
grouped) for reporting. See dplyr::distinct()
.
Usage
p_distinct(
.data,
...,
.messages = "removing {.count.in-.count.out} duplicates",
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe with distinct values and history graph updated.
See Also
dplyr::distinct()
Examples
library(dplyr)
library(dtrackr)
tmp = bind_rows(iris %>% track(), iris %>% track() %>% filter(Petal.Length > 5))
tmp %>% group_by(Species) %>% distinct() %>% history()
Exclude all items matching one or more criteria
Description
Apply a set of filters and summarise the actions of the filter to the dtrackr
history graph. Because of the ... filter specification, all parameters MUST BE
NAMED. The filters work in an combinatorial manner, i.e. the results EXCLUDE ALL
rows that match any of the criteria. If na.rm = TRUE
they also remove
anything that cannot be evaluated by any criteria.
Usage
p_exclude_all(
.data,
...,
.headline = .defaultHeadline(),
na.rm = FALSE,
.type = "exclusion",
.asOffshoot = TRUE,
.stage = (if (is.null(.tag)) "" else .tag),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match any of the predicates will be excluded. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate. This can refer to grouping variables variables from the environment and {.excluded} and {.matched} or {.missing} (excluded = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default FALSE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = TRUE). |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the filtered .data dataframe with the history graph updated with the summary of excluded items as a new offshoot stage
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% capture_exclusions() %>% exclude_all(
Petal.Length > 5 ~ "{.excluded} long ones",
Petal.Length < 2 ~ "{.excluded} short ones"
) %>% history()
# simultaneous evaluation of criteria:
data.frame(a = 1:10) %>%
track() %>%
exclude_all(
# These two criteria identify the same value and one item is excluded
a > 9 ~ "{.excluded} value > 9",
a == max(a) ~ "{.excluded} max value",
) %>%
status() %>%
history()
# the behaviour is equivalent to the inverse of dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a <= 9, a != max(a)) %>%
nrow()
# step-wise evaluation of criteria results in a different output
data.frame(a = 1:10) %>%
track() %>%
# Performing the same exclusion sequentially results in 2 items
# being excluded as the criteria no longer identify the same
# item.
exclude_all(a > 9 ~ "{.excluded} value > 9") %>%
exclude_all(a == max(a) ~ "{.excluded} max value") %>%
status() %>%
history()
# the behaviour is equivalent to the inverse of dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a <= 9) %>%
dplyr::filter(a != max(a)) %>%
nrow()
Get the dtrackr excluded data record
Description
Get the dtrackr excluded data record
Usage
p_excluded(.data, simplify = TRUE)
Arguments
.data |
a dataframe which may be grouped |
simplify |
return a single summary dataframe of all exclusions. |
Value
a new dataframe of the excluded data up to this point in the workflow. This dataframe is by default flattened, but if .simplify=FALSE
has a nested structure containing records excluded at each part of the pipeline.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% capture_exclusions()
tmp %>% exclude_all(
Petal.Length > 5.8 ~ "{.excluded} long ones",
Petal.Length < 1.3 ~ "{.excluded} short ones",
.stage = "petal length exclusion"
) %>% excluded()
Filtering data
Description
Filter acts in the same way as in dplyr
where predicates which evaluate to
TRUE act to select items to include, and items for which the predicate cannot
be evaluated are excluded. For tracking prior to the filter operation the
size of each group is calculated {.count.in} and after the operation the
output size of each group {.count.out}. The grouping {.strata} is also
available (if grouped) for reporting. See dplyr::filter()
.
Usage
p_filter(
.data,
...,
.messages = "excluded {.excluded} items",
.headline = .defaultHeadline(),
.type = "exclusion",
.asOffshoot = (.type == "exclusion"),
.stage = (if (is.null(.tag)) "" else .tag),
.tag = NULL
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
<
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
the format type of the action typically an exclusion |
.asOffshoot |
if the type is exclusion, |
.stage |
a name for this step in the pathway |
.tag |
if you want the summary data from this step in the future then
give it a name with |
Value
the filtered .data
dataframe with history graph updated
See Also
dplyr::filter()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% filter(Petal.Length > 5) %>% history()
Flowchart output
Description
Generate a flowchart of the history of the dataframe(s), with all the tracked data pipeline as stages in the flowchart. Multiple dataframes can be plotted together in which case an attempt is made to determine which parts are common.
Usage
p_flowchart(
.data,
filename = NULL,
size = std_size$full,
maxWidth = size$width,
maxHeight = size$height,
formats = c("dot", "png", "pdf", "svg"),
defaultToHTML = TRUE,
landscape = size$rot != 0,
...
)
Arguments
.data |
the tracked dataframe(s) either as a single dataframe or as a list of dataframes. |
filename |
a file name which will be where the formatted flowcharts are
saved. If no extension is specified the output formats are determined by
the |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
defaultToHTML |
if the correct output format is not easy to determine
from the context, default providing |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
other parameters passed onto either |
Value
the nature of the flowchart output depends on the context in which
the function is called. It will be some form of browse-able html output if
called from an interactive session or a PNG
/PDF
link if in knitr
and
knitting latex or word type outputs, if file name is specified the output
will also be saved at the given location.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor")
tmp %>% group_by(Species) %>% comment(.tag="step2") %>% flowchart()
Full join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::full_join()
for more details
on the underlying functions.
Usage
p_full_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Full join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::full_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Full join
join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Get the dtrackr history graph
Description
This provides the raw history graph and is not really intended for mainstream use. The internal structure of the graph is explained below. print and plot S3 methods exist for the dtrackr history graph.
Usage
p_get(.data)
Arguments
.data |
a dataframe which may be grouped |
Value
the history graph. This is a list, of class trackr_graph
, containing the following named items:
excluded - the data items that have been excluded thus far as a nested dataframe
tags - a dataframe of tag-value pairs containing the summary of the data at named points in the data flow (see
tagged()
)nodes - a dataframe of the nodes of the flow chart
edges - an edge list (as a dataframe) of the relationships between the nodes in the flow chart
head - the current most recent nodes added into the graph as a dataframe.
The format of this data may grow over time but these fields are unlikely to be changed.
Examples
library(dplyr)
library(dtrackr)
graph = iris %>% track() %>% comment("A comment") %>% history()
print(graph)
DOT output
Description
(advance usage) outputs a dtrackr
history graph as a DOT string for rendering with Graphviz
Usage
p_get_as_dot(.data, fill = "lightgrey", fontsize = "8", colour = "black", ...)
Arguments
.data |
the tracked dataframe |
fill |
the default node fill colour |
fontsize |
the default font size |
colour |
the default font colour |
... |
not used |
Value
a representation of the history graph in Graphviz
dot format.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% comment(.tag = "step1") %>% filter(Species!="versicolor")
dot = tmp %>% group_by(Species) %>% comment(.tag="step2") %>% p_get_as_dot()
cat(dot)
Stratifying your analysis
Description
Grouping a data set acts in the normal way. When tracking a dataframe
sometimes a group_by()
operation will create a lot of groups. This happens
for example if you are doing a group_by()
, summarise()
step that is
aggregating data on a fine scale, e.g. by day in a time-series. This is
generally a terrible idea when tracking a dataframe as the resulting
flowchart will have many many branches and be illegible. dtrackr
will detect this issue and
pause tracking the dataframe with a warning. It is up to the user to the
resume()
tracking when the large number of groups have been resolved e.g.
using a dplyr::ungroup()
. This limit is configurable with
options("dtrackr.max_supported_groupings"=XX)
. The default is 16. See
dplyr::group_by()
.
Usage
p_group_by(
.data,
...,
.messages = "stratify by {.cols}",
.headline = NULL,
.tag = NULL,
.maxgroups = .defaultMaxSupportedGroupings()
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.cols} which is the columns that are being grouped by. |
.headline |
a headline glue spec. The glue code can use any global variable, or {.cols}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
.maxgroups |
the maximum number of subgroups allowed before the tracking is paused. |
Value
the .data but grouped.
See Also
dplyr::group_by()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species, .messages="stratify by {.cols}")
tmp %>% comment("{.strata}") %>% history()
Group-wise modification of data and complex operations
Description
Group modifying a data set acts in the normal way. The internal mechanics of
the modify function are opaque to the history. This means these can be used
to wrap any unsupported operation without losing the history (e.g. df %>% track() %>% group_modify(function(d,...) { d %>% unsupported_operation() })
) Prior to the operation the size of the group is calculated {.count.in}
and after the operation the output size {.count.out} The group {.strata}
is also available (if grouped) for reporting See dplyr::group_modify()
.
Usage
p_group_modify(
.data,
...,
.messages = NULL,
.headline = .defaultHeadline(),
.type = "modify",
.tag = NULL
)
Arguments
.data |
A grouped tibble |
... |
Additional arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.headline |
a headline glue spec. The glue code can use any global variable, or {.strata},{.count.in},and {.count.out} |
.type |
default "modify": used to define formatting |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the transformed .data dataframe with the history graph updated.
See Also
dplyr::group_modify()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% group_modify(
function(d,g,...) { return(tibble::tibble(x=runif(10))) },
.messages="{.count.in} in, {.count.out} out"
) %>% history()
Include any items matching a criteria
Description
Apply a set of inclusion criteria and record the actions of the
filter to the dtrackr
history graph. Because of the ... filter specification,
all parameters MUST BE NAMED. This function is the opposite of
exclude_all()
and the filtering criteria work to identify rows to
include i.e. the results include anything that match any of the criteria. If
na.rm=TRUE
they also keep anything that cannot be evaluated by the criteria.
Usage
p_include_any(
.data,
...,
.headline = .defaultHeadline(),
na.rm = TRUE,
.type = "inclusion",
.asOffshoot = FALSE,
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
a dplyr filter specification as a set of formulae where the LHS are predicates to test the data set against, items that match at least one of the predicates will be included. The RHS is a glue specification, defining the message, to be entered in the history graph for each predicate matched. This can refer to grouping variables, variables from the environment and {.included} and {.matched} or {.missing} (included = matched+missing), {.count} and {.total} - group and overall counts respectively, e.g. "excluding {.matched} items and {.missing} with missing values". |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
na.rm |
(default TRUE) if the filter cannot be evaluated for a row count that row as missing and either exclude it (TRUE) or don't exclude it (FALSE) |
.type |
default "inclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the filtered .data dataframe with the history graph updated with the summary of included items as a new stage
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% group_by(Species) %>% include_any(
Petal.Length > 5 ~ "{.included} long ones",
Petal.Length < 2 ~ "{.included} short ones"
) %>% history()
# simultaneous evaluation of criteria:
data.frame(a = 1:10) %>%
track() %>%
include_any(
# These two criteria identify the same value and one item is excluded
a > 1 ~ "{.included} value > 1",
a != min(a) ~ "{.included} everything but the smallest value",
) %>%
status() %>%
history()
# the behaviour is equivalent to dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a > 1, a != min(a)) %>%
nrow()
# step-wise evaluation of criteria results in a different output
data.frame(a = 1:10) %>%
track() %>%
# Performing the same exclusion sequentially results in 2 items
# being excluded as the criteria no longer identify the same
# item.
include_any(a > 1 ~ "{.included} value > 1") %>%
include_any(a != min(a) ~ "{.included} everything but the smallest value") %>%
status() %>%
history()
# the behaviour is equivalent to dplyr's filter function:
data.frame(a=1:10) %>%
dplyr::filter(a > 1) %>%
dplyr::filter(a != min(a)) %>%
nrow()
Inner joins
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::inner_join()
for more details
on the underlying functions.
Usage
p_inner_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Inner join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::inner_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Inner join
join = lhs %>% inner_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_intersect(
x,
y,
...,
.messages = "{.count.out} in intersection",
.headline = "Intersection"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
generics::intersect()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Left join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::left_join()
for more details
on the underlying functions.
Usage
p_left_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Left join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::left_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Left join
join = lhs %>% left_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_mutate(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::mutate()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# mutate
# In this example we compare the column names of the input and the
# output to identify the new columns created by the mutate operation as
# the `.new_cols` variable
iris %>%
track() %>%
mutate(extra_col = NA_real_,
.messages="{.new_cols}",
.headline="Extra columns from mutate:") %>%
history()
Nest join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::nest_join()
for more details
on the underlying functions.
Usage
p_nest_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS", "{.count.out} matched"),
.headline = "Nest join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::nest_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Nest join
join = lhs %>% nest_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Pause tracking the data frame.
Description
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume()
is
called, or when the data frame is ungrouped (if auto
is enabled).
Usage
p_pause(.data, auto = FALSE)
Arguments
.data |
a tracked dataframe |
auto |
if |
Value
the .data dataframe with history graph tracking paused
Examples
iris %>% track() %>% pause() %>% history()
Reshaping data using tidyr::pivot_longer
Description
A drop in replacement for tidyr::pivot_longer()
which optionally takes a
message and headline to store in the history graph.
Usage
p_pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the result of the tidyr::pivot_longer
but with a history graph
updated.
See Also
tidyr::pivot_longer()
Reshaping data using tidyr::pivot_wider
Description
A drop in replacement for tidyr::pivot_wider()
which optionally takes a
message and headline to store in the history graph.
Usage
p_pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the data dataframe result of the tidyr::pivot_wider
function but with
a history graph updated with a .message
if requested.
See Also
tidyr::pivot_wider()
Summarise a data set
Description
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
Usage
p_reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
See Also
dplyr::reframe()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% track()
tmp %>% reframe(tibble(
param = c("mean","min","max"),
value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length))
), .messages="length {param}: {value}") %>% history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::relocate()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# relocate, this shows how the columns can be reordered
iris %>%
track() %>%
group_by(Species) %>%
relocate(
tidyselect::starts_with("Sepal"),
.after=Species,
.messages="{.cols}",
.headline="Order of columns from relocate:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::rename()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# rename can show us which columns are new and which have been
# removed (with .dropped_cols)
iris %>%
track() %>%
group_by(Species) %>%
rename(
Stamen.Width = Sepal.Width,
Stamen.Length = Sepal.Length,
.messages=c("added {.new_cols}","dropped {.dropped_cols}"),
.headline="Renamed columns:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::rename_with()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# rename can show us which columns are new and which have been
# removed (with .dropped_cols)
iris %>%
track() %>%
group_by(Species) %>%
rename(
Stamen.Width = Sepal.Width,
Stamen.Length = Sepal.Length,
.messages=c("added {.new_cols}","dropped {.dropped_cols}"),
.headline="Renamed columns:") %>%
history()
Resume tracking the data frame.
Description
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX)
)
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups
parameter.
Usage
p_resume(.data, ...)
Arguments
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
Value
the .data data frame with history graph tracking resumed
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% pause() %>% resume() %>% history()
Right join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join()
for more details
on the underlying functions.
Usage
p_right_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Right join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::right_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Full join
join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_select(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::select()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# select
# The output of the select verb (here using tidyselect syntax) can be captured
# and here all column names are being reported with the .cols variable.
iris %>%
track() %>%
group_by(Species) %>%
select(
tidyselect::starts_with("Sepal"),
.messages="{.cols}",
.headline="Output columns from select:") %>%
history()
Semi join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join()
for more details
on the underlying functions.
Usage
p_semi_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in intersection"),
.headline = "Semi join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::semi_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Semi join
join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Set the dtrackr history graph
Description
This is unlikely to be useful to an end user and is called automatically by many of the other functions here. On the off chance you need to copy history metadata from one dataframe to another
Usage
p_set(.data, .graph)
Arguments
.data |
a dataframe which may be grouped |
.graph |
a history graph list (consisting of nodes, edges, and head) see examples |
Value
the .data dataframe with the history graph metadata set to the provided value
Examples
library(dplyr)
library(dtrackr)
mtcars %>% p_set(iris %>% comment("A comment") %>% p_get()) %>% history()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_setdiff(
x,
y,
...,
.messages = "{.count.out} items in difference",
.headline = "Difference"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::setdiff()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice()
Examples
library(dplyr)
library(dtrackr)
# an arbitrary 50 items from the iris dataframe is selected. The
# history is tracked
iris %>% track() %>% slice(51:100) %>% history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice_head(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_head()
Examples
library(dplyr)
library(dtrackr)
# the first 50% of the data frame, is taken and the history tracked
iris %>% track() %>% group_by(Species) %>%
slice_head(prop=0.5,.messages="{.count.out} / {.count.in}",
.headline="First {sprintf('%1.0f',prop*100)}%") %>%
history()
# The last 100 items:
iris %>% track() %>% group_by(Species) %>%
slice_tail(n=100,.messages="{.count.out} / {.count.in}",
.headline="Last 100") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice_max(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_max()
Examples
library(dplyr)
library(dtrackr)
# Subset the data by the maximum of a given value
iris %>% track() %>% group_by(Species) %>%
slice_max(prop=0.5, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} = {prop} (with ties)",
.headline="Widest 50% Sepals") %>%
history()
# The narrowest 25% of the iris data set by group can be calculated in the
# slice_min() function. Recording this is a matter of tracking and
# using glue specs.
iris %>%
track() %>%
group_by(Species) %>%
slice_min(prop=0.25, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} (with ties)",
.headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice_min(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_min()
Examples
library(dplyr)
library(dtrackr)
# Subset the data by the maximum of a given value
iris %>% track() %>% group_by(Species) %>%
slice_max(prop=0.5, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} = {prop} (with ties)",
.headline="Widest 50% Sepals") %>%
history()
# The narrowest 25% of the iris data set by group can be calculated in the
# slice_min() function. Recording this is a matter of tracking and
# using glue specs.
iris %>%
track() %>%
group_by(Species) %>%
slice_min(prop=0.25, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} (with ties)",
.headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice_sample(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_sample()
Examples
library(dplyr)
library(dtrackr)
# In this example the iris dataframe is resampled 100 times with replacement
# within each group and the
iris %>%
track() %>%
group_by(Species) %>%
slice_sample(n=100, replace=TRUE,
.messages="{.count.out} / {.count.in} = {n}",
.headline="100 {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
p_slice_tail(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_tail()
Examples
library(dplyr)
library(dtrackr)
# the first 50% of the data frame, is taken and the history tracked
iris %>% track() %>% group_by(Species) %>%
slice_head(prop=0.5,.messages="{.count.out} / {.count.in}",
.headline="First {sprintf('%1.0f',prop*100)}%") %>%
history()
# The last 100 items:
iris %>% track() %>% group_by(Species) %>%
slice_tail(n=100,.messages="{.count.out} / {.count.in}",
.headline="Last 100") %>%
history()
Add a summary to the dtrackr history graph
Description
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status
is essentially a
dplyr
summarisation step which is connected to a glue
specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
Usage
p_status(
.data,
...,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = FALSE,
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Details
Because of the ... summary specification parameters MUST BE NAMED.
Value
the same .data dataframe with the history metadata updated with the status inserted as a new stage
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% status(
long = p_count_if(Petal.Length>5),
short = p_count_if(Petal.Length<2),
.messages="{Species}: {long} long ones & {short} short ones"
) %>% history()
Summarise a data set
Description
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
Usage
p_summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
See Also
dplyr::summarise()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% track()
tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Retrieve tagged data in the history graph
Description
Any counts at the individual stages that was stored with a .tag
option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
Usage
p_tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
Arguments
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
Value
various things depending on what is requested.
By default a tibble with a .tag
column and all associated summary values in a nested .content
column.
If a .strata
column is specified the results are filtered to just those that match a given .strata
grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag
is specified the result will be for a single tag and .content
will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag
tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag
and .glue
is specified a .label
column will be computed from .glue
and the tagged content. If the result of this is a single row then just the string value of .label
is returned.
If just the .glue
is specified, an un-nested dataframe with .tag
,.strata
and .label
columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue
options until you think you know what you are doing. It made sense at the time.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% comment(.tag = "step1")
tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species)
tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
Description
Start tracking the dtrackr history graph
Usage
p_track(
.data,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe with additional history graph metadata, to allow tracking.
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
p_transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::transmute()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# In this example we compare the column names of the input and the
# output to identify the new columns created by the transmute operation as
# the `.new_cols` variable
# Here we do the same for a transmute()
iris %>%
track() %>%
group_by(Species, .add=TRUE) %>%
transmute(
sepal.w = Sepal.Width-1,
sepal.l = Sepal.Length+1,
.messages="{.new_cols}",
.headline="New columns from transmute:") %>%
history()
Remove a stratification from a data set
Description
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status()
,
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup()
.
Usage
p_ungroup(
x,
...,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
See Also
dplyr::ungroup()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% comment("A test")
tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_union(
x,
y,
...,
.messages = "{.count.out} unique items in union",
.headline = "Distinct union"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
generics::union()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
p_union_all(
x,
y,
...,
.messages = "{.count.out} items in union",
.headline = "Union"
)
Arguments
x , y |
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::union_all()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Remove tracking from the dataframe
Description
Remove tracking from the dataframe
Usage
p_untrack(.data)
Arguments
.data |
a tracked dataframe |
Value
the .data dataframe with history graph metadata removed.
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% untrack() %>% class()
Pause tracking the data frame.
Description
Pausing tracking of a data frame may be required if an operation is about to
be performed that creates a lot of groupings or that you otherwise don't
want to pollute the history graph (e.g. maybe selecting something using
an anti-join). Once paused the history is not updated until a resume()
is
called, or when the data frame is ungrouped (if auto
is enabled).
Usage
pause(.data, auto = FALSE)
Arguments
.data |
a tracked dataframe |
auto |
if |
Value
the .data dataframe with history graph tracking paused
Examples
iris %>% track() %>% pause() %>% history()
Reshaping data using tidyr::pivot_longer
Description
A drop in replacement for tidyr::pivot_longer()
which optionally takes a
message and headline to store in the history graph.
Usage
## S3 method for class 'trackr_df'
pivot_longer(data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the result of the tidyr::pivot_longer
but with a history graph
updated.
See Also
tidyr::pivot_longer()
Reshaping data using tidyr::pivot_wider
Description
A drop in replacement for tidyr::pivot_wider()
which optionally takes a
message and headline to store in the history graph.
Usage
## S3 method for class 'trackr_df'
pivot_wider(data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the data dataframe result of the tidyr::pivot_wider
function but with
a history graph updated with a .message
if requested.
See Also
tidyr::pivot_wider()
Plots a history graph as html
Description
Plots a history graph as html
Usage
## S3 method for class 'trackr_graph'
plot(x, fill = "lightgrey", fontsize = "8", colour = "black", ...)
Arguments
x |
a dtrackr history graph (e.g. output from |
fill |
the default node fill colour |
fontsize |
the default font size |
colour |
the default font colour |
... |
not used |
Value
HTML displayed
Examples
library(dplyr)
library(dtrackr)
iris %>% comment("hello {.total} rows") %>% history() %>% plot()
Print a history graph to the console
Description
Print a history graph to the console
Usage
## S3 method for class 'trackr_graph'
print(x, ...)
Arguments
x |
a dtrackr history graph (e.g. output from |
... |
not used |
Value
nothing
Examples
library(dplyr)
library(dtrackr)
iris %>% comment("hello {.total} rows") %>% history() %>% print()
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- dplyr
Summarise a data set
Description
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
Usage
## S3 method for class 'trackr_df'
reframe(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
See Also
dplyr::reframe()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% track()
tmp %>% reframe(tibble(
param = c("mean","min","max"),
value = c(mean(Petal.Length), min(Petal.Length), max(Petal.Length))
), .messages="length {param}: {value}") %>% history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
relocate(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::relocate()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# relocate, this shows how the columns can be reordered
iris %>%
track() %>%
group_by(Species) %>%
relocate(
tidyselect::starts_with("Sepal"),
.after=Species,
.messages="{.cols}",
.headline="Order of columns from relocate:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
rename(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::rename()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# rename can show us which columns are new and which have been
# removed (with .dropped_cols)
iris %>%
track() %>%
group_by(Species) %>%
rename(
Stamen.Width = Sepal.Width,
Stamen.Length = Sepal.Length,
.messages=c("added {.new_cols}","dropped {.dropped_cols}"),
.headline="Renamed columns:") %>%
history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
rename_with(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::rename_with()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# rename can show us which columns are new and which have been
# removed (with .dropped_cols)
iris %>%
track() %>%
group_by(Species) %>%
rename(
Stamen.Width = Sepal.Width,
Stamen.Length = Sepal.Length,
.messages=c("added {.new_cols}","dropped {.dropped_cols}"),
.headline="Renamed columns:") %>%
history()
Resume tracking the data frame.
Description
This may reset the grouping of the tracked data if the grouping structure
has changed since the data frame was paused. If you try and resume tracking a
data frame with too many groups (as defined by options("dtrackr.max_supported_groupings"=XX)
)
then the resume will fail and the data frame will still be paused. This can
be overridden by specifying a value for the .maxgroups
parameter.
Usage
resume(.data, ...)
Arguments
.data |
a tracked dataframe |
... |
Named arguments passed on to
|
Value
the .data data frame with history graph tracking resumed
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% pause() %>% resume() %>% history()
Right join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::right_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
right_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in linked set"),
.headline = "Right join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::right_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Full join
join = lhs %>% full_join(rhs, by="name", multiple = "all") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Save DOT content to a file
Description
Convert a digraph in dot format to SVG and save it to a range of output file types
Usage
save_dot(
dot,
filename,
size = std_size$half,
maxWidth = size$width,
maxHeight = size$height,
formats = c("dot", "png", "pdf", "svg"),
landscape = size$rot != 0,
...
)
Arguments
dot |
a |
filename |
the full path of the file name (minus extension for multiple formats) |
size |
a named list with 3 elements, length and width in inches and rotation. A predefined set of standard sizes are available in the std_size object. |
maxWidth |
a width (on the paper) in inches if |
maxHeight |
a height (on the paper) in inches if |
formats |
some of |
landscape |
rotate the output by 270 degrees into a landscape format.
|
... |
ignored |
Value
a list with items paths
with the absolute paths of the saved files
as a named list, and svg
as the SVG string of the rendered dot file.
Examples
save_dot("digraph {A->B}",tempfile())
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
select(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::select()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# select
# The output of the select verb (here using tidyselect syntax) can be captured
# and here all column names are being reported with the .cols variable.
iris %>%
track() %>%
group_by(Species) %>%
select(
tidyselect::starts_with("Sepal"),
.messages="{.cols}",
.headline="Output columns from select:") %>%
history()
Semi join
Description
Mutating joins behave as dplyr
joins, except the history graph of the two
sides of the joins is merged resulting in a tracked dataframe with the
history of both input dataframes. See dplyr::semi_join()
for more details
on the underlying functions.
Usage
## S3 method for class 'trackr_df'
semi_join(
x,
y,
...,
.messages = c("{.count.lhs} on LHS", "{.count.rhs} on RHS",
"{.count.out} in intersection"),
.headline = "Semi join by {.keys}"
)
Arguments
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
Other parameters passed onto methods.
Named arguments passed on to
|
.messages |
a set of glue specs. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
.headline |
a glue spec. The glue code can use any global variable, {.keys} for the joining columns, {.count.lhs}, {.count.rhs}, {.count.out} for the input and output dataframes sizes respectively |
Value
the join of the two dataframes with the history graph updated.
See Also
dplyr::semi_join()
Examples
library(dplyr)
library(dtrackr)
# Joins across data sets
# example data uses the dplyr starways data
people = starwars %>% select(-films, -vehicles, -starships)
films = starwars %>% select(name,films) %>% tidyr::unnest(cols = c(films))
lhs = people %>% track() %>% comment("People df {.total}")
rhs = films %>% track() %>% comment("Films df {.total}") %>%
comment("a test comment")
# Semi join
join = lhs %>% semi_join(rhs, by="name") %>% comment("joined {.total}")
# See what the history of the graph is:
join %>% history() %>% print()
nrow(join)
# Display the tracked graph (not run in examples)
# join %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
## S3 method for class 'trackr_df'
setdiff(
x,
y,
...,
.messages = "{.count.out} items in difference",
.headline = "Difference"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::setdiff()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice()
Examples
library(dplyr)
library(dtrackr)
# an arbitrary 50 items from the iris dataframe is selected. The
# history is tracked
iris %>% track() %>% slice(51:100) %>% history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice_head(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_head()
Examples
library(dplyr)
library(dtrackr)
# the first 50% of the data frame, is taken and the history tracked
iris %>% track() %>% group_by(Species) %>%
slice_head(prop=0.5,.messages="{.count.out} / {.count.in}",
.headline="First {sprintf('%1.0f',prop*100)}%") %>%
history()
# The last 100 items:
iris %>% track() %>% group_by(Species) %>%
slice_tail(n=100,.messages="{.count.out} / {.count.in}",
.headline="Last 100") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice_max(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_max()
Examples
library(dplyr)
library(dtrackr)
# Subset the data by the maximum of a given value
iris %>% track() %>% group_by(Species) %>%
slice_max(prop=0.5, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} = {prop} (with ties)",
.headline="Widest 50% Sepals") %>%
history()
# The narrowest 25% of the iris data set by group can be calculated in the
# slice_min() function. Recording this is a matter of tracking and
# using glue specs.
iris %>%
track() %>%
group_by(Species) %>%
slice_min(prop=0.25, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} (with ties)",
.headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice_min(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_min()
Examples
library(dplyr)
library(dtrackr)
# Subset the data by the maximum of a given value
iris %>% track() %>% group_by(Species) %>%
slice_max(prop=0.5, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} = {prop} (with ties)",
.headline="Widest 50% Sepals") %>%
history()
# The narrowest 25% of the iris data set by group can be calculated in the
# slice_min() function. Recording this is a matter of tracking and
# using glue specs.
iris %>%
track() %>%
group_by(Species) %>%
slice_min(prop=0.25, order_by = Sepal.Width,
.messages="{.count.out} / {.count.in} (with ties)",
.headline="narrowest {sprintf('%1.0f',prop*100)}% {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice_sample(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_sample()
Examples
library(dplyr)
library(dtrackr)
# In this example the iris dataframe is resampled 100 times with replacement
# within each group and the
iris %>%
track() %>%
group_by(Species) %>%
slice_sample(n=100, replace=TRUE,
.messages="{.count.out} / {.count.in} = {n}",
.headline="100 {Species}") %>%
history()
Slice operations
Description
Slice operations behave as in dplyr, except the history graph can be updated with
tracked dataframe with the before and after sizes of the dataframe.
See dplyr::slice()
, dplyr::slice_head()
, dplyr::slice_tail()
,
dplyr::slice_min()
, dplyr::slice_max()
, dplyr::slice_sample()
,
for more details on the underlying functions.
Usage
## S3 method for class 'trackr_df'
slice_tail(
.data,
...,
.messages = c("{.count.in} before", "{.count.out} after"),
.headline = "slice data"
)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For
|
.messages |
a set of glue specs. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively and {.excluded} for the difference |
.headline |
a glue spec. The glue code can use any global variable, {.count.in}, {.count.out} for the input and output dataframes sizes respectively. |
Value
the sliced dataframe with the history graph updated.
See Also
dplyr::slice_tail()
Examples
library(dplyr)
library(dtrackr)
# the first 50% of the data frame, is taken and the history tracked
iris %>% track() %>% group_by(Species) %>%
slice_head(prop=0.5,.messages="{.count.out} / {.count.in}",
.headline="First {sprintf('%1.0f',prop*100)}%") %>%
history()
# The last 100 items:
iris %>% track() %>% group_by(Species) %>%
slice_tail(n=100,.messages="{.count.out} / {.count.in}",
.headline="Last 100") %>%
history()
Add a summary to the dtrackr history graph
Description
In the middle of a pipeline you may wish to document something about the data
that is more complex than the simple counts. status
is essentially a
dplyr
summarisation step which is connected to a glue
specification
output, that is recorded in the data frame history. This means you can do an
arbitrary interim summarisation and put the result into the flowchart without
disrupting the pipeline flow.
Usage
status(
.data,
...,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.type = "info",
.asOffshoot = FALSE,
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
... |
any normal dplyr::summarise specification, e.g. |
.messages |
a character vector of glue specifications. A glue specification can refer to the summary outputs, any grouping variables of .data, the {.strata}, or any variables defined in the calling environment |
.headline |
a glue specification which can refer to grouping variables of .data, or any variables defined in the calling environment |
.type |
one of "info","exclusion": used to define formatting |
.asOffshoot |
do you want this comment to be an offshoot of the main flow (default = FALSE). |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Details
Because of the ... summary specification parameters MUST BE NAMED.
Value
the same .data dataframe with the history metadata updated with the status inserted as a new stage
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% group_by(Species)
tmp %>% status(
long = p_count_if(Petal.Length>5),
short = p_count_if(Petal.Length<2),
.messages="{Species}: {long} long ones & {short} short ones"
) %>% history()
Standard paper sizes
Description
A list of standard paper sizes for outputting flowcharts or other dot
graphs. These include width and height dimensions in inches and can be
used as one way to specify the output size of a dot graph, including
flowcharts (see the size
parameter of flowchart()
).
Usage
std_size
Format
An object of class list
of length 12.
Details
The sizes available are A4
, A5
, full
(fits a portrait A4 with margins), half
(half an
A4 with margins), third
, two_third
, quarter
, sixth
(all with reference to
an A4 page with margins). There are 2 landscape sizes A4_landscape
and full_landscape
which
fit an A4 page with or without margins. There are also 2 slide dimensions,
to fit with standard presentation software dimensions.
This is just a convenience. Similar effects can be achieved by providing width
and height
parameters to flowchart()
directly.
Summarise a data set
Description
Summarising a data set acts in the normal dplyr
manner to collapse groups
to individual rows. Any columns resulting from the summary can be added to
the history graph. In the history this also joins any stratified branches and
allows you to generate some summary statistics about the un-grouped data. See
dplyr::summarise()
.
Usage
## S3 method for class 'trackr_df'
summarise(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.headline |
a headline glue spec. The glue code can use any summary variable defined in the ... parameter, or any global variable, or {.strata} |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe summarised with the history graph updated showing the summarise operation as a new stage
See Also
dplyr::summarise()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% track()
tmp %>% summarise(avg = mean(Petal.Length), .messages="{avg} length") %>% history()
Retrieve tagged data in the history graph
Description
Any counts at the individual stages that was stored with a .tag
option in a pipeline step can be recovered here. The idea here is to provide a quick way to access a single value
for the counts or other details tagged in a pipeline into a format that can be reported in text of a document. (e.g. for a results section). For more examples the consort statement vignette
has some examples of use.
Usage
tagged(.data, .tag = NULL, .strata = NULL, .glue = NULL, ...)
Arguments
.data |
the tracked dataframe. |
.tag |
(optional) the tag to retrieve. |
.strata |
(optional) filter the tagged data by the strata. set to "" to filter just the top level ungrouped data. |
.glue |
(optional) a glue specification which will be applied to the tagged content to generate a |
... |
(optional) any other named parameters will be passed to |
Value
various things depending on what is requested.
By default a tibble with a .tag
column and all associated summary values in a nested .content
column.
If a .strata
column is specified the results are filtered to just those that match a given .strata
grouping (i.e. this will be the grouping label on the flowchart). Ungrouped content will have an empty "" as .strata
If .tag
is specified the result will be for a single tag and .content
will be automatically un-nested to give a single un-nested dataframe of the content captured at the .tag
tagged step.
This could be single or multiple rows depending on whether the original data was grouped at the point of tagging.
If both the .tag
and .glue
is specified a .label
column will be computed from .glue
and the tagged content. If the result of this is a single row then just the string value of .label
is returned.
If just the .glue
is specified, an un-nested dataframe with .tag
,.strata
and .label
columns with a label for each tag in each strata.
If this seems complex then the best thing is to experiment until you get the output you want, leaving any .glue
options until you think you know what you are doing. It made sense at the time.
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% track() %>% comment(.tag = "step1")
tmp = tmp %>% filter(Species!="versicolor") %>% group_by(Species)
tmp %>% comment(.tag="step2") %>% tagged(.glue = "{.count}/{.total}")
Start tracking the dtrackr history graph
Description
Start tracking the dtrackr history graph
Usage
track(
.data,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
.data |
a dataframe which may be grouped |
.messages |
a character vector of glue specifications. A glue
specification can refer to any grouping variables of .data, or any
variables defined in the calling environment, the {.total} variable which
is the count of all rows, the {.count} variable which is the count of
rows in the current group and the {.strata} which describes the current
group. Defaults to the value of |
.headline |
a glue specification which can refer to grouping variables
of .data, or any variables defined in the calling environment, or the
{.total} variable which is |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe with additional history graph metadata, to allow tracking.
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% history()
dplyr modifying operations
Description
See dplyr::mutate()
, dplyr::add_count()
, dplyr::add_tally()
,
dplyr::transmute()
, dplyr::select()
, dplyr::relocate()
,
dplyr::rename()
dplyr::rename_with()
, dplyr::arrange()
for more details
on underlying functions. dtrackr
provides equivalent functions for
mutating, selecting and renaming a data set which act in the same way as
dplyr
. mutate
/ select
/ rename
generally don't add anything in terms
of provenance of data so the default behaviour is to miss these out of the
dtrackr
history. This can be overridden with the .messages
, or
.headline
values in which case they behave just like a comment()
.
Usage
## S3 method for class 'trackr_df'
transmute(.data, ..., .messages = "", .headline = "", .tag = NULL)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
.messages |
a set of glue specs. The glue code can use any global variable, grouping variable, {.new_cols} or {.dropped_cols} for changes to columns, {.cols} for the output column names, or {.strata}. Defaults to nothing. |
.headline |
a headline glue spec. The glue code can use any global variable, grouping variable, {.new_cols}, {.dropped_cols}, {.cols} or {.strata}. Defaults to nothing. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data
dataframe after being modified by the dplyr
equivalent
function, but with the history graph updated with a new stage if the
.messages
or .headline
parameter is not empty.
See Also
dplyr::transmute()
Examples
library(dplyr)
library(dtrackr)
# mutate and other functions are unitary operations that generally change
# the structure but not size of a dataframe. In dtrackr these are by ignored
# by default but we can change that so that their behaviour is obvious.
# In this example we compare the column names of the input and the
# output to identify the new columns created by the transmute operation as
# the `.new_cols` variable
# Here we do the same for a transmute()
iris %>%
track() %>%
group_by(Species, .add=TRUE) %>%
transmute(
sepal.w = Sepal.Width-1,
sepal.l = Sepal.Length+1,
.messages="{.new_cols}",
.headline="New columns from transmute:") %>%
history()
Remove a stratification from a data set
Description
Un-grouping a data set logically combines the different arms. In the history
this joins any stratified branches and acts as a specific type of status()
,
allowing you to generate some summary statistics about the un-grouped data.
See dplyr::ungroup()
.
Usage
## S3 method for class 'trackr_df'
ungroup(
x,
...,
.messages = .defaultMessage(),
.headline = .defaultHeadline(),
.tag = NULL
)
Arguments
x |
A |
... |
variables to remove from the grouping. |
.messages |
a set of glue specs. The glue code can use any any global variable, or {.count}. the default is "total {.count} items" |
.headline |
a headline glue spec. The glue code can use {.count} and {.strata}. |
.tag |
if you want the summary data from this step in the future then give it a name with .tag. |
Value
the .data dataframe but ungrouped with the history graph updated showing the ungroup operation as a new stage.
See Also
dplyr::ungroup()
Examples
library(dplyr)
library(dtrackr)
tmp = iris %>% group_by(Species) %>% comment("A test")
tmp %>% ungroup(.messages="{.count} items in combined") %>% history()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
## S3 method for class 'trackr_df'
union(
x,
y,
...,
.messages = "{.count.out} unique items in union",
.headline = "Distinct union"
)
Arguments
x , y |
Vectors to combine. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
generics::union()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Set operations
Description
These perform set operations on tracked dataframes. It merges the history
of 2 (or more) dataframes and combines the rows (or columns). It calculates the total number of
resulting rows as {.count.out} in other terms it performs exactly the same
operation as the equivalent dplyr
operation. See dplyr::bind_rows()
,
dplyr::bind_cols()
, dplyr::intersect()
, dplyr::union()
,
dplyr::setdiff()
,dplyr::intersect()
, or dplyr::union_all()
for the
underlying function details.
Usage
## S3 method for class 'trackr_df'
union_all(
x,
y,
...,
.messages = "{.count.out} items in union",
.headline = "Union"
)
Arguments
x , y |
Pair of compatible data frames. A pair of data frames is compatible if they have the same column names (possibly in different orders) and compatible types. |
... |
a collection of tracked data frames to combine |
.messages |
a set of glue specs. The glue code can use any global variable, or {.count.out} |
.headline |
a glue spec. The glue code can use any global variable, or {.count.out} |
Value
the dplyr output with the history graph updated.
See Also
dplyr::union_all()
Examples
library(dplyr)
library(dtrackr)
# Set operations
people = starwars %>% select(-films, -vehicles, -starships)
chrs = people %>% track("start")
lhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Droid" ~ "{.included} droids"
)
# these are different subsets of the same data
rhs = chrs %>% include_any(
species == "Human" ~ "{.included} humans",
species == "Gungan" ~ "{.included} gungans"
) %>% comment("{.count} gungans & humans")
# Unions
set = bind_rows(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union(lhs,rhs) %>% comment("{.count} human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = union_all(lhs,rhs) %>% comment("{.count} 2*human,droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
# Intersections and differences
set = setdiff(lhs,rhs) %>% comment("{.count} droids and gungans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
set = intersect(lhs,rhs) %>% comment("{.count} humans")
# display the history of the result:
set %>% history()
nrow(set)
# not run - display the flowchart:
# set %>% flowchart()
Remove tracking from the dataframe
Description
Remove tracking from the dataframe
Usage
untrack(.data)
Arguments
.data |
a tracked dataframe |
Value
the .data dataframe with history graph metadata removed.
Examples
library(dplyr)
library(dtrackr)
iris %>% track() %>% untrack() %>% class()