Title: | Panel Data Wrangling Tools |
Version: | 1.2.13 |
BugReports: | https://github.com/JSzitas/panelWranglR/issues |
Description: | Leading/lagging a panel, creating dummy variables, taking panel differences, looking for panel autocorrelations, and more. Implemented via a 'data.table' back end. |
License: | GPL-3 |
Depends: | R (≥ 3.2.0) |
Suggests: | testthat (≥ 2.1.0) |
Encoding: | UTF-8 |
LazyData: | true |
URL: | https://github.com/JSzitas/panelWranglR |
RoxygenNote: | 6.1.1 |
Imports: | data.table, Hmisc, caret |
NeedsCompilation: | no |
Packaged: | 2019-09-28 18:02:59 UTC; juraj |
Author: | Juraj Szitás [aut, cre] |
Maintainer: | Juraj Szitás <szitas.juraj13@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2019-10-03 08:30:02 UTC |
Wrapper for find correlations
Description
Just a helper function for correl_panel.
Usage
corr_finder(df, corr_cutoff)
Arguments
df |
The dataframe to use. |
corr_cutoff |
The correlation cutoff to pass to findCorrelations |
Examples
X_1 <- rnorm(1000)
X_2 <- rnorm(1000) + 0.6 * X_1
X_3 <- rnorm(1000) - 0.4 * X_1
data_fm <- do.call( cbind, list( X_1,
X_2,
X_3 ))
corr_finder( df = data_fm,
corr_cutoff = 0.3 )
Collect a panel, from wide to long
Description
Transforms cross sectional/time dummies to unified variables
Usage
panel_collect(data, cross.section = NULL, cross.section.columns = NULL,
time.variable = NULL, time.variable.columns = NULL)
Arguments
data |
The panel to transform |
cross.section |
The name of the transformed cross sectional variable supply as chracter. |
cross.section.columns |
The names of the columns indicating cross sections to collect. |
time.variable |
The name of the transformed time variable supply as character. |
time.variable.columns |
The names of the columns indicating time variables to collect. |
Details
For time variables named like "Time_Var_i" with arbitrary i, the program will check that all time variables are named using this convention, and strip this convention
Value
A collected data.table, with new columns constructed by collecting from the wide format.
Examples
x_1 <- rnorm( 10 )
cross_levels <- c( "AT", "DE" )
time <- seq(1:5)
time <- rep(time, 2)
geo_list <- list()
for(i in 1:length(cross_levels))
{
geo <- rep( cross_levels[i],
100 )
geo_list[[i]] <- geo
}
geo <- unlist(geo_list)
geo <- as.data.frame(geo)
example_data <- cbind( time,
x_1 )
example_data <- as.data.frame(example_data)
example_data <- cbind( geo,
example_data)
names(example_data) <- c("geo", "time", "x_1")
# generate dummies using panel_dummify()
test_dummies <- panel_dummify( data = example_data,
cross.section = "geo",
time.variable = "time")
panel_collect( data = test_dummies,
cross.section = "geo",
cross.section.columns = c( "AT", "DE"))
Panel linear combinations
Description
A function to find highly correlated variables in a panel of data, both by cross sections and by time dummies.
Usage
panel_correl(data, cross.section = NULL, time.variable = NULL,
corr.threshold = 0.7, autocorr.threshold = 0.5,
cross.threshold = 0.7, select.cross.sections = NULL,
select.time.periods = NULL)
Arguments
data |
The data to use, a data.frame or a data.table. |
cross.section |
The name of the cross sectional variable. |
time.variable |
The name of the time variable. |
corr.threshold |
The correlation threshold for finding significant correlations in the base specification, disregarding time or cross sectional dependencies. |
autocorr.threshold |
The correlation threshold for autocorrelation (splitting the pooled panel into cross sections). |
cross.threshold |
The correlation threshold for finding significant correlations in the cross sections. |
select.cross.sections |
An optional subset of cross sectional units. |
select.time.periods |
An optional subset of time periods |
Examples
x_1 <- rnorm( 100 )
x_2 <- rnorm( 100 ) + 0.5 * x_1
cross_levels <- c( "AT", "DE")
time <- seq(1:50)
time <- rep(time, 2)
geo_list <- list()
for(i in 1:length(cross_levels))
{ geo <- rep( cross_levels[i], 50 )
geo_list[[i]] <- geo }
geo <- unlist(geo_list)
geo <- as.data.frame(geo)
example_data <- do.call ( cbind, list( time, x_1, x_2))
example_data <- as.data.frame(example_data)
example_data <- cbind( geo,
example_data)
names(example_data) <- c("geo", "time", "x_1",
"x_2")
panel_correl( data = example_data,
cross.section = "geo",
time.variable = "time",
corr.threshold = 0.2,
autocorr.threshold = 0.5,
cross.threshold = 0.1)
Tidy panel differencing
Description
Efficient, tidy panel differencing
Usage
panel_diff(data, cross.section, time.variable = NULL, diff.order = 1,
lags = 1, variables.selected = NULL, keep.original = FALSE)
Arguments
data |
The data input, anything coercible to a data.table. |
cross.section |
The cross section argument, see examples. |
time.variable |
The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable. |
diff.order |
The number of applications of the difference operator to use in panel differencing. Defaults to 1. |
lags |
The number of lags to use for differences. Defaults to 1. |
variables.selected |
A variable selection for variables to difference, defaults to NULL and differences ALL variables. |
keep.original |
Whether to keep the original undifferenced data, defaults to FALSE. |
Details
Works on a full data.table backend for maximum speed wherever possible.
Value
The differenced data.table which contains either only the differenced variables, or also the original variables.
Examples
X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)
panel_diff(data = X,
cross.section = "geo_NO",
time.variable = "tim",
diff.order = 1,
lags = 1,
variables.selected = c("V3","V4"),
keep.original = TRUE)
Tidy time/variable dummies for panel data
Description
A simple function to dummify cross sections or time variables in panel data.
Usage
panel_dummify(data, cross.section = NULL, time.variable = NULL)
Arguments
data |
The panel to dummify |
cross.section |
The cross section variable in the panel. Defaults to NULL. |
time.variable |
The variable to indicate time in your panel. Defaults to NULL. |
Details
The encoding is binary, whether this is more appropriate than using a factor variable is up to the user.
Value
A new data.table, with the original variables to dummify removed, and new dummy columns included.
Examples
x_1 <- rnorm( 10 )
cross_levels <- c( "AT", "DE" )
time <- seq(1:5)
time <- rep(time, 2)
geo_list <- list()
for(i in 1:length(cross_levels))
{
geo <- rep( cross_levels[i],
100 )
geo_list[[i]] <- geo
}
geo <- unlist(geo_list)
geo <- as.data.frame(geo)
example_data <- cbind( time,
x_1 )
example_data <- as.data.frame(example_data)
example_data <- cbind( geo,
example_data)
names(example_data) <- c("geo", "time", "x_1")
test_dummies <- panel_dummify( data = example_data,
cross.section = "geo",
time.variable = "time")
Tidy panel lagging
Description
Efficient, tidy panel lagging
Usage
panel_lag(data, cross.section, time.variable = NULL, lags = 1,
variables.selected = NULL, keep.original = TRUE)
Arguments
data |
The data input, anything coercible to a data.table. |
cross.section |
The cross section argument, see examples. |
time.variable |
The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable. |
lags |
The lags to use in panel lagging. |
variables.selected |
A variable selection for variables to lag, defaults to NULL and lags ALL variables. |
keep.original |
Whether to keep the original unlagged data, defaults to TRUE. |
Details
Works on a full data.table backend for maximum speed wherever possible.
Value
The lagged data.table which contains either only the lagged variables, or also the original variables.
Examples
X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)
panel_lag(data = X,
cross.section = "geo_NO",
time.variable = "tim",
lags = 5,
variables.selected = c("V5","tim", "V7"),
keep.original = TRUE)
Tidy panel leading
Description
Efficient, tidy panel leading
Usage
panel_lead(data, cross.section, time.variable = NULL, leads = 1,
variables.selected = NULL, keep.original = TRUE)
Arguments
data |
The data input, anything coercible to a data.table. |
cross.section |
The cross section argument, see examples. |
time.variable |
The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable. |
leads |
The leads to use in panel leading. |
variables.selected |
A variable selection for variables to lead, defaults to NULL and leads ALL variables. |
keep.original |
Whether to keep the original unleadged data, defaults to TRUE. |
Details
Works on a full data.table backend for maximum speed wherever possible.
Value
The leading data.table which contains either only the leading variables, or also the original variables.
Examples
X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)
panel_lead(data = X,
cross.section = "geo_NO",
time.variable = "tim",
leads = 5,
variables.selected = c("V5","tim", "V7"),
keep.original = TRUE)