Title: | Enhanced 'mutate' |
Version: | 0.2.0 |
Description: | Provides 'Apache Spark' style window aggregation for R dataframes and remote 'dbplyr' tables via 'mutate' in 'dplyr' flavour. |
Imports: | dplyr (≥ 1.1.0), tidyr (≥ 1.3.0), checkmate (≥ 2.1.0), rlang (≥ 1.0.6), slider (≥ 0.2.2), magrittr (≥ 1.5), furrr (≥ 0.3.0), dbplyr (≥ 2.3.1), |
Suggests: | lubridate, stringr, testthat, RSQLite, tibble, |
URL: | https://github.com/talegari/tidier |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.1 |
NeedsCompilation: | no |
Packaged: | 2023-09-11 11:36:46 UTC; s0k06e8 |
Author: | Srikanth Komala Sheshachala [aut, cre] |
Maintainer: | Srikanth Komala Sheshachala <sri.teach@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-09-11 18:10:02 UTC |
Drop-in replacement for mutate
Description
Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
Usage
mutate(x, ..., .by, .order_by, .frame, .index, .complete = FALSE)
Arguments
x |
( |
... |
expressions to be passed to |
.by |
(expression, optional: Yes) Columns to group by |
.order_by |
(expression, optional: Yes) Columns to order by |
.frame |
(vector, optional: Yes) Vector of length 2 indicating the
number of rows to consider before and after the current row. When argument
|
.index |
(expression, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only. |
.complete |
(flag, default: FALSE) This will be passed to
|
Details
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex:
c(days(2), days(1))
). Interval objects are currently supported for dataframe only. (nottbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.Application of a window aggregation can be optionally run in parallel over multiple groups (see argument
.by
) by setting a future parallel backend. This is implemented using furrr package.function subsumes regular usecases of
mutate
For tbl_lazy
input:
Uses
dbplyr::window_order
anddbplyr::window_frame
to translate topartition_by
and window frame specification.
Value
data.frame
or tbl_lazy
See Also
mutate_
Examples
library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column
iris %>%
mutate(sl_mean = mean(Sepal.Length),
.order_by = c(Petal.Width, Sepal.Width),
.by = Petal.Length,
.frame = c(Inf, 0),
) %>%
dplyr::slice_min(n = 3, Petal.Width, by = Species)
# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row
set.seed(101)
airquality %>%
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) %>%
dplyr::arrange(date_col) %>%
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
# example 3
airquality %>%
# create date column as character
dplyr::mutate(date_col =
as.character(lubridate::make_date(1973, Month, Day))
) %>%
tibble::as_tibble() %>%
# as `tbl_lazy`
dbplyr::memdb_frame() %>%
mutate(avg_temp = mean(Temp),
.by = Month,
.order_by = date_col,
.frame = c(3, 3)
) %>%
dplyr::collect() %>%
dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
Drop-in replacement for mutate
Description
Provides supercharged version of mutate
with group_by
, order_by
and aggregation over arbitrary window frame
around a row for dataframes and lazy (remote) tbl
s of class tbl_lazy
.
Usage
mutate_(
x,
...,
.by,
.order_by,
.frame,
.index,
.desc = FALSE,
.complete = FALSE
)
Arguments
x |
( |
... |
expressions to be passed to |
.by |
(character vector, optional: Yes) Columns to group by |
.order_by |
(string, optional: Yes) Columns to order by |
.frame |
(vector, optional: Yes) Vector of length 2 indicating the
number of rows to consider before and after the current row. When argument
|
.index |
(string, optional: Yes, default: NULL) index column. This is supported when input is a dataframe only. |
.desc |
(flag, default: FALSE) Whether to order in descending order |
.complete |
(flag, default: FALSE) This will be passed to
|
Details
A window function returns a value for every input row of a dataframe
or lazy_tbl
based on a group of rows (frame) in the neighborhood of the
input row. This function implements computation over groups (partition_by
in SQL) in a predefined order (order_by
in SQL) across a neighborhood of
rows (frame) defined by a (up, down) where
up/down are number of rows before and after the corresponding row
up/down are interval objects (ex:
c(days(2), days(1))
). Interval objects are currently supported for dataframe only. (nottbl_lazy
)
This implementation is inspired by spark's window API.
Implementation Details:
For dataframe input:
Iteration per row over the window is implemented using the versatile
slider
.Application of a window aggregation can be optionally run in parallel over multiple groups (see argument
.by
) by setting a future parallel backend. This is implemented using furrr package.function subsumes regular usecases of
mutate
For tbl_lazy
input:
Uses
dbplyr::window_order
anddbplyr::window_frame
to translate topartition_by
and window frame specification.
Value
data.frame
or tbl_lazy
See Also
mutate
Examples
library("magrittr")
# example 1 (simple case with dataframe)
# Using iris dataset,
# compute cumulative mean of column `Sepal.Length`
# ordered by `Petal.Width` and `Sepal.Width` columns
# grouped by `Petal.Length` column
iris %>%
tidier::mutate_(sl_mean = mean(Sepal.Length),
.order_by = c("Petal.Width", "Sepal.Width"),
.by = "Petal.Length",
.frame = c(Inf, 0),
) %>%
dplyr::slice_min(n = 3, Petal.Width, by = Species)
# example 2 (detailed case with dataframe)
# Using a sample airquality dataset,
# compute mean temp over last seven days in the same month for every row
set.seed(101)
airquality %>%
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) %>%
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) %>%
dplyr::arrange(date_col) %>%
# compute mean temperature over last seven days in the same month
tidier::mutate_(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = "Day",
.by = "Month",
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = "date_col"
)
# example 3
airquality %>%
# create date column as character
dplyr::mutate(date_col =
as.character(lubridate::make_date(1973, Month, Day))
) %>%
tibble::as_tibble() %>%
# as `tbl_lazy`
dbplyr::memdb_frame() %>%
mutate_(avg_temp = mean(Temp),
.by = "Month",
.order_by = "date_col",
.frame = c(3, 3)
) %>%
dplyr::collect() %>%
dplyr::select(Ozone, Solar.R, Wind, Temp, Month, Day, date_col, avg_temp)
Remove non-list columns when same are present in a list column
Description
Remove non-list columns when same are present in a list column
Usage
remove_common_nested_columns(df, list_column)
Arguments
df |
input dataframe |
list_column |
Name or expr of the column which is a list of named lists |
Value
dataframe