Help for package panelWranglR

Title:

Panel Data Wrangling Tools

Version:

1.2.13

BugReports:

https://github.com/JSzitas/panelWranglR/issues

Description:

Leading/lagging a panel, creating dummy variables, taking panel differences, looking for panel autocorrelations, and more. Implemented via a 'data.table' back end.

License:

GPL-3

Depends:

R (≥ 3.2.0)

Suggests:

testthat (≥ 2.1.0)

Encoding:

UTF-8

LazyData:

true

URL:

https://github.com/JSzitas/panelWranglR

RoxygenNote:

6.1.1

Imports:

data.table, Hmisc, caret

NeedsCompilation:

Packaged:

2019-09-28 18:02:59 UTC; juraj

Author:

Juraj Szitás [aut, cre]

Maintainer:

Juraj Szitás <szitas.juraj13@gmail.com>

Repository:

CRAN

Date/Publication:

2019-10-03 08:30:02 UTC

Wrapper for find correlations

Description

Just a helper function for correl_panel.

Usage

corr_finder(df, corr_cutoff)

Arguments

df

The dataframe to use.

corr_cutoff

The correlation cutoff to pass to findCorrelations

Examples


X_1 <- rnorm(1000)
X_2 <- rnorm(1000) + 0.6 * X_1
X_3 <- rnorm(1000) - 0.4 * X_1

data_fm <- do.call( cbind, list( X_1,
                                 X_2,
                                 X_3 ))

corr_finder( df = data_fm,
             corr_cutoff = 0.3 )

Collect a panel, from wide to long

Description

Transforms cross sectional/time dummies to unified variables

Usage

panel_collect(data, cross.section = NULL, cross.section.columns = NULL,
  time.variable = NULL, time.variable.columns = NULL)

Arguments

data

The panel to transform

cross.section

The name of the transformed cross sectional variable supply as chracter.

cross.section.columns

The names of the columns indicating cross sections to collect.

time.variable

The name of the transformed time variable supply as character.

time.variable.columns

The names of the columns indicating time variables to collect.

Details

For time variables named like "Time_Var_i" with arbitrary i, the program will check that all time variables are named using this convention, and strip this convention

Value

A collected data.table, with new columns constructed by collecting from the wide format.

Examples


x_1 <- rnorm( 10 )
cross_levels <- c( "AT", "DE" )
time <- seq(1:5)
time <- rep(time, 2)
geo_list <- list()
for(i in 1:length(cross_levels))
{
  geo <- rep( cross_levels[i],
                100 )
                  geo_list[[i]] <- geo
                  }
                  geo <- unlist(geo_list)
                  geo <- as.data.frame(geo)

 example_data <- cbind( time,
                       x_1 )
 example_data <- as.data.frame(example_data)

 example_data <- cbind( geo,
                       example_data)
 names(example_data) <- c("geo", "time", "x_1")

# generate dummies using panel_dummify()
 test_dummies <- panel_dummify( data = example_data,
                                cross.section = "geo",
                                time.variable = "time")
panel_collect( data = test_dummies,
               cross.section = "geo",
               cross.section.columns = c( "AT", "DE"))

Panel linear combinations

Description

A function to find highly correlated variables in a panel of data, both by cross sections and by time dummies.

Usage

panel_correl(data, cross.section = NULL, time.variable = NULL,
  corr.threshold = 0.7, autocorr.threshold = 0.5,
  cross.threshold = 0.7, select.cross.sections = NULL,
  select.time.periods = NULL)

Arguments

data

The data to use, a data.frame or a data.table.

cross.section

The name of the cross sectional variable.

time.variable

The name of the time variable.

corr.threshold

The correlation threshold for finding significant correlations in the base specification, disregarding time or cross sectional dependencies.

autocorr.threshold

The correlation threshold for autocorrelation (splitting the pooled panel into cross sections).

cross.threshold

The correlation threshold for finding significant correlations in the cross sections.

select.cross.sections

An optional subset of cross sectional units.

select.time.periods

An optional subset of time periods

Examples


   x_1 <- rnorm( 100 )
   x_2 <- rnorm( 100 ) + 0.5 * x_1
   cross_levels <- c( "AT", "DE")
   time <- seq(1:50)
   time <- rep(time, 2)
   geo_list <- list()
   for(i in 1:length(cross_levels))
   {  geo <- rep( cross_levels[i], 50 )
      geo_list[[i]] <- geo }
   geo <- unlist(geo_list)
   geo <- as.data.frame(geo)

   example_data <-  do.call ( cbind, list( time, x_1, x_2))
   example_data <- as.data.frame(example_data)
   example_data <- cbind( geo,
                         example_data)

                         names(example_data) <- c("geo", "time", "x_1",
                                                 "x_2")

   panel_correl( data = example_data,
                 cross.section = "geo",
                 time.variable = "time",
                 corr.threshold = 0.2,
                 autocorr.threshold = 0.5,
                 cross.threshold = 0.1)

Tidy panel differencing

Description

Efficient, tidy panel differencing

Usage

panel_diff(data, cross.section, time.variable = NULL, diff.order = 1,
  lags = 1, variables.selected = NULL, keep.original = FALSE)

Arguments

data

The data input, anything coercible to a data.table.

cross.section

The cross section argument, see examples.

time.variable

The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable.

diff.order

The number of applications of the difference operator to use in panel differencing. Defaults to 1.

lags

The number of lags to use for differences. Defaults to 1.

variables.selected

A variable selection for variables to difference, defaults to NULL and differences ALL variables.

keep.original

Whether to keep the original undifferenced data, defaults to FALSE.

Details

Works on a full data.table backend for maximum speed wherever possible.

Value

The differenced data.table which contains either only the differenced variables, or also the original variables.

Examples


X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)

panel_diff(data = X,
           cross.section = "geo_NO",
           time.variable = "tim",
           diff.order = 1,
           lags = 1,
           variables.selected = c("V3","V4"),
           keep.original = TRUE)

Tidy time/variable dummies for panel data

Description

A simple function to dummify cross sections or time variables in panel data.

Usage

panel_dummify(data, cross.section = NULL, time.variable = NULL)

Arguments

data

The panel to dummify

cross.section

The cross section variable in the panel. Defaults to NULL.

time.variable

The variable to indicate time in your panel. Defaults to NULL.

Details

The encoding is binary, whether this is more appropriate than using a factor variable is up to the user.

Value

A new data.table, with the original variables to dummify removed, and new dummy columns included.

Examples


x_1 <- rnorm( 10 )
cross_levels <- c( "AT", "DE" )
time <- seq(1:5)
time <- rep(time, 2)
geo_list <- list()
for(i in 1:length(cross_levels))
{
  geo <- rep( cross_levels[i],
                100 )
                  geo_list[[i]] <- geo
                  }
                  geo <- unlist(geo_list)
                  geo <- as.data.frame(geo)

 example_data <- cbind( time,
                        x_1 )
 example_data <- as.data.frame(example_data)

 example_data <- cbind( geo,
                        example_data)
 names(example_data) <- c("geo", "time", "x_1")

 test_dummies <- panel_dummify( data = example_data,
                                cross.section = "geo",
                                time.variable = "time")

Tidy panel lagging

Description

Efficient, tidy panel lagging

Usage

panel_lag(data, cross.section, time.variable = NULL, lags = 1,
  variables.selected = NULL, keep.original = TRUE)

Arguments

data

The data input, anything coercible to a data.table.

cross.section

The cross section argument, see examples.

time.variable

The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable.

lags

The lags to use in panel lagging.

variables.selected

A variable selection for variables to lag, defaults to NULL and lags ALL variables.

keep.original

Whether to keep the original unlagged data, defaults to TRUE.

Details

Works on a full data.table backend for maximum speed wherever possible.

Value

The lagged data.table which contains either only the lagged variables, or also the original variables.

Examples



X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)

panel_lag(data = X,
          cross.section = "geo_NO",
          time.variable = "tim",
          lags = 5,
          variables.selected = c("V5","tim", "V7"),
          keep.original = TRUE)

Tidy panel leading

Description

Efficient, tidy panel leading

Usage

panel_lead(data, cross.section, time.variable = NULL, leads = 1,
  variables.selected = NULL, keep.original = TRUE)

Arguments

data

The data input, anything coercible to a data.table.

cross.section

The cross section argument, see examples.

time.variable

The variable to indicate time in your panel. Defaults to NULL, though it is recommended to have a time variable.

leads

The leads to use in panel leading.

variables.selected

A variable selection for variables to lead, defaults to NULL and leads ALL variables.

keep.original

Whether to keep the original unleadged data, defaults to TRUE.

Details

Works on a full data.table backend for maximum speed wherever possible.

Value

The leading data.table which contains either only the leading variables, or also the original variables.

Examples



X <- matrix(rnorm(4000),800,5)
tim <- seq(1:400)
geo_AT <- rep(c("AT"), length = 400)
geo_NO <- rep(c("NO"), length = 400)
both_vec_1 <- cbind(tim,geo_NO)
both_vec_2 <- cbind(tim,geo_AT)
both <- rbind(both_vec_1,both_vec_2)
names(both[,"geo_NO"]) <- "geo"
X <- cbind(both,X)

panel_lead(data = X,
          cross.section = "geo_NO",
          time.variable = "tim",
          leads = 5,
          variables.selected = c("V5","tim", "V7"),
          keep.original = TRUE)