Title: | Streamlined Data Processing Tools for Genomic Selection |
Version: | 0.1.0 |
Description: | A toolkit for genomic selection in animal breeding with emphasis on multi-breed and multi-trait nested grouping operations. Streamlines iterative analysis workflows when working with 'ASReml-R' package. Includes utility functions for phenotypic data processing commonly used by animal breeders. |
License: | MIT + file LICENSE |
URL: | https://tony2015116.github.io/mintyr/ |
BugReports: | https://github.com/tony2015116/mintyr/issues |
Depends: | R (≥ 3.5.0) |
Imports: | arrow, data.table, dplyr, parallel, purrr, readxl, rlang, rsample, rstatix, stats, tibble, utils |
Suggests: | knitr, rmarkdown, testthat, tidyr, tools |
VignetteBuilder: | knitr |
Config/fusen/version: | 0.6.0 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.1 |
NeedsCompilation: | no |
Packaged: | 2024-12-12 01:59:42 UTC; Dell |
Author: | Guo Meng [aut, cre], Guo Meng [cph] |
Maintainer: | Guo Meng <tony2015116@163.com> |
Repository: | CRAN |
Date/Publication: | 2024-12-13 09:10:06 UTC |
Column to Pair Nested Transformation
Description
A sophisticated data transformation tool for generating column pair combinations and creating nested data structures with advanced configuration options.
Usage
c2p_nest(data, cols2bind, by = NULL, pairs_n = 2, sep = "-", nest_type = "dt")
Arguments
data |
Input
|
cols2bind |
Column specification for pair generation
|
by |
Optional grouping specification
|
pairs_n |
|
sep |
|
nest_type |
Output nesting format
|
Details
Advanced Transformation Mechanism:
Input validation and preprocessing
Dynamic column combination generation
Flexible pair transformation
Nested data structure creation
Transformation Process:
Validate input parameters and column specifications
Convert numeric indices to column names if necessary
Generate column combinations
Create subset data tables
Merge and nest transformed data
Column Specification:
Supports both column names and numeric indices
Numeric indices must be within valid range (1 to ncol)
Column names must exist in the dataset
Flexible specification for both cols2bind and by parameters
Value
data table
containing nested transformation results
Includes
pairs
column identifying column combinationsContains
data
column storing nested data structuresSupports optional grouping variables
Note
Key Operation Constraints:
Requires non-empty input data
Column specifications must be valid (either names or indices)
Supports flexible combination strategies
Computational complexity increases with combination size
See Also
-
utils::combn()
Combination generation
Examples
# Example data preparation: Define column names for combination
col_names <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
# Example 1: Basic column-to-pairs nesting with custom separator
c2p_nest(
iris, # Input iris dataset
cols2bind = col_names, # Columns to be combined as pairs
pairs_n = 2, # Create pairs of 2 columns
sep = "&" # Custom separator for pair names
)
# Returns a nested data.table where:
# - pairs: combined column names (e.g., "Sepal.Length&Sepal.Width")
# - data: list column containing data.tables with value1, value2 columns
# Example 2: Column-to-pairs nesting with numeric indices and grouping
c2p_nest(
iris, # Input iris dataset
cols2bind = 1:3, # First 3 columns to be combined
pairs_n = 2, # Create pairs of 2 columns
by = 5 # Group by 5th column (Species)
)
# Returns a nested data.table where:
# - pairs: combined column names
# - Species: grouping variable
# - data: list column containing data.tables grouped by Species
Convert Nested Columns Between data.frame
and data.table
Description
The convert_nest
function transforms a data.frame
or data.table
by converting nested columns
to either data.frame
or data.table
format while preserving the original data structure.
Usage
convert_nest(data, to = c("df", "dt"), nest_cols = NULL)
Arguments
data |
A |
to |
A |
nest_cols |
A |
Details
Advanced Nested Column Conversion Features:
Intelligent automatic detection of nested columns
Comprehensive conversion of entire data structure
Selective conversion of specified nested columns
Non-destructive transformation with data copying
Input Validation and Error Handling:
Validates existence of specified nested columns
Verifies that specified columns are actually list columns
Provides informative error messages for invalid inputs
Ensures data integrity through comprehensive checks
Conversion Strategies:
Nested column identification based on
is.list()
detectionPreservation of original data integrity
Flexible handling of mixed data structures
Consistent type conversion across nested elements
Nested Column Handling:
Supports conversion of
list
columnsHandles
data.table
,data.frame
, and genericlist
inputsMaintains original column structure and order
Prevents in-place modification of source data
Value
A transformed data.frame
or data.table
with nested columns converted to the specified format.
Note
Conversion Characteristics:
Non-destructive transformation of nested columns
Supports flexible input and output formats
Intelligent type detection and conversion
Minimal performance overhead
Error Conditions:
Throws error if specified columns don't exist in the input data
Throws error if specified columns are not list columns
Provides clear error messages for troubleshooting
Validates input parameters before processing
Examples
# Example 1: Create nested data structures
# Create single nested column
df_nest1 <- iris |>
dplyr::group_nest(Species) # Group and nest by Species
# Create multiple nested columns
df_nest2 <- iris |>
dplyr::group_nest(Species) |> # Group and nest by Species
dplyr::mutate(
data2 = purrr::map( # Create second nested column
data,
dplyr::mutate,
c = 2
)
)
# Example 2: Convert nested structures
# Convert data frame to data table
convert_nest(
df_nest1, # Input nested data frame
to = "dt" # Convert to data.table
)
# Convert specific nested columns
convert_nest(
df_nest2, # Input nested data frame
to = "dt", # Convert to data.table
nest_cols = "data" # Only convert 'data' column
)
# Example 3: Convert data table to data frame
dt_nest <- mintyr::w2l_nest(
data = iris, # Input dataset
cols2l = 1:2 # Columns to nest
)
convert_nest(
dt_nest, # Input nested data table
to = "df" # Convert to data frame
)
Export List with Advanced Directory Management
Description
The export_list
function exports a list of data.frame
, data.table
, or compatible data structures
with sophisticated directory handling, flexible naming, and multiple file format support.
Usage
export_list(split_dt, export_path = tempdir(), file_type = "txt")
Arguments
split_dt |
A |
export_path |
Base directory path for file export. Defaults to a temporary directory
created by |
file_type |
File export format, either |
Details
Comprehensive List Export Features:
Advanced nested directory structure support based on list element names
Intelligent handling of unnamed list elements
Automatic conversion to
data.table
for consistent exportHierarchical directory creation with nested path names
Multi-format file export with intelligent separator selection
Robust error handling and input validation
File Export Capabilities:
Supports
"txt"
(tab-separated) and"csv"
formatsIntelligent file naming based on list element names
Handles complex nested directory structures
Efficient file writing using
data.table::fwrite()
Value
An integer
representing the total number of files exported successfully.
Note
Key Capabilities:
Flexible list naming and directory management
Comprehensive support for
data.frame
anddata.table
inputsIntelligent default naming for unnamed elements
High-performance file writing mechanism
Examples
# Example: Export split data to files
# Step 1: Create split data structure
dt_split <- w2l_split(
data = iris, # Input iris dataset
cols2l = 1:2, # Columns to be split
by = "Species" # Grouping variable
)
# Step 2: Export split data to files
export_list(
split_dt = dt_split # Input list of data.tables
)
# Returns the number of files created
# Files are saved in tempdir() with .txt extension
# Check exported files
list.files(
path = tempdir(), # Default export directory
pattern = "txt", # File type pattern to search
recursive = TRUE # Search in subdirectories
)
# Clean up exported files
files <- list.files(
path = tempdir(), # Default export directory
pattern = "txt", # File type pattern to search
recursive = TRUE, # Search in subdirectories
full.names = TRUE # Return full file paths
)
file.remove(files) # Remove all exported files
Export Nested Data with Advanced Grouping and Flexible Handling
Description
The export_list
function exports nested data from a data.frame
or data.table
with sophisticated grouping
capabilities, supporting multiple nested column types and flexible file export options.
Usage
export_nest(
nest_dt,
group_cols = NULL,
nest_col = NULL,
export_path = tempdir(),
file_type = "txt"
)
Arguments
nest_dt |
A |
group_cols |
Optional character vector specifying grouping columns.
If |
nest_col |
Optional character string indicating the nested column to export.
If |
export_path |
Base directory path for file export. Defaults to a temporary directory
created by |
file_type |
File export format, either |
Details
Comprehensive Nested Data Export Features:
Automatic detection and handling of different nested column types
Flexible grouping strategies with intelligent column selection
Hierarchical directory structure generation based on grouping columns
Support for mixed nested column types (
data.frame
,data.table
,list
)Multi-threaded file writing for enhanced performance
Informative messaging and warning system
Nested Column Detection Hierarchy:
Prioritizes
data.frame
/data.table
nested columnsFalls back to regular
list
columns if nodata.frame
columns exist
Grouping Column Selection Strategy:
When
group_cols
isNULL
, uses all non-nested columnsProvides warnings about unused non-nested columns
Validates provided group columns
File Export Characteristics:
Supports
"txt"
(tab-separated) and"csv"
formatsUses multi-threading via
parallel::detectCores()
Creates nested directory structure based on grouping variables
Value
An integer
representing the total number of files exported successfully.
Note
Key Capabilities:
Handles complex nested data structures
Performs type conversion for nested content
Utilizes multi-threaded file export for optimal performance
Provides comprehensive column selection feedback
Examples
# Example 1: Basic nested data export workflow
# Step 1: Create nested data structure
dt_nest <- w2l_nest(
data = iris, # Input iris dataset
cols2l = 1:2, # Columns to be nested
by = "Species" # Grouping variable
)
# Step 2: Export nested data to files
export_nest(
nest_dt = dt_nest, # Input nested data.table
nest_col = "data", # Column containing nested data
group_cols = c("name", "Species") # Columns to create directory structure
)
# Returns the number of files created
# Creates directory structure: tempdir()/name/Species/data.txt
# Check exported files
list.files(
path = tempdir(), # Default export directory
pattern = "txt", # File type pattern to search
recursive = TRUE # Search in subdirectories
)
# Returns list of created files and their paths
# Clean up exported files
files <- list.files(
path = tempdir(), # Default export directory
pattern = "txt", # File type pattern to search
recursive = TRUE, # Search in subdirectories
full.names = TRUE # Return full file paths
)
file.remove(files) # Remove all exported files
fire
Description
Feeding behavior dataset from Fire system
Usage
fire
Format
A data frame with 9794 rows and 10 variables:
Location
integer Feeding station identification number
Tag
integer Animal electronic tag number
Date
character Date of feeding visit
Entry
character Time when animal entered feeding station
Exit
character Time when animal left feeding station
Ent Wt
double Feed weight at entry (kg)
Ext Wt
double Feed weight at exit (kg)
Consumed
double Amount of feed consumed (kg)
Weight
double Animal body weight (kg)
Topup Amount
double Amount of feed added to bin (kg)
Update Fire Dataset with Current Date
Description
The fires
function creates a copy of the fire dataset and adjusts the dates
to align with the current date while maintaining the original date patterns.
Usage
fires()
Details
The function performs the following operations:
Creates a copy of the fire dataset from the mintyr package
Calculates the number of days between the last recorded date and the previous day
Shifts all dates forward by the calculated number of days
Converts the updated dates back to character format
Value
A data.table with updated dates, shifted to the current date
Note
Requires the
data.table
andmintyr
packagesUses the current system date as a reference for date shifting
Maintains the original structure of the date column
Examples
head(fires())
Format Numeric Columns with Specified Digits
Description
The format_digits
function formats numeric columns in a data frame or data table by rounding numbers to a specified number of decimal places and converting them to character strings. It can optionally format the numbers as percentages.
Usage
format_digits(data, cols = NULL, digits = 2, percentage = FALSE)
Arguments
data |
A |
cols |
An optional numeric or character vector specifying the columns to format. If |
digits |
A non-negative integer specifying the number of decimal places to use. Defaults to |
percentage |
A logical value indicating whether to format the numbers as percentages. If |
Details
The function performs the following steps:
Validates the input parameters, ensuring that
data
is adata.frame
ordata.table
,cols
(if provided) are valid column names or indices, anddigits
is a non-negative integer.Converts
data
to adata.table
if it is not already one.Creates a formatting function based on the
digits
andpercentage
parameters:If
percentage = FALSE
, numbers are rounded todigits
decimal places.If
percentage = TRUE
, numbers are multiplied by 100, rounded todigits
decimal places, and a percent sign (%
) is appended.
Applies the formatting function to the specified columns:
If
cols
isNULL
, the function formats all numeric columns indata
.If
cols
is specified, only those columns are formatted.
Returns a new
data.table
with the formatted columns.
Value
A data.table
with the specified numeric columns formatted as character strings with the specified number of decimal places. If percentage = TRUE
, the numbers are shown as percentages.
Note
The input
data
must be adata.frame
ordata.table
.If
cols
is specified, it must be a vector of valid column names or indices present indata
.The
digits
parameter must be a single non-negative integer.The original
data
is not modified; a modified copy is returned.
Examples
# Example: Number formatting demonstrations
# Setup test data
dt <- data.table::data.table(
a = c(0.1234, 0.5678), # Numeric column 1
b = c(0.2345, 0.6789), # Numeric column 2
c = c("text1", "text2") # Text column
)
# Example 1: Format all numeric columns
format_digits(
dt, # Input data table
digits = 2 # Round to 2 decimal places
)
# Example 2: Format specific column as percentage
format_digits(
dt, # Input data table
cols = c("a"), # Only format column 'a'
digits = 2, # Round to 2 decimal places
percentage = TRUE # Convert to percentage
)
Extract Filenames from File Paths
Description
The get_filename
function extracts filenames from file paths with options to remove file extensions
and/or directory paths.
Usage
get_filename(paths, rm_extension = TRUE, rm_path = TRUE)
Arguments
paths |
A |
rm_extension |
A
|
rm_path |
A
|
Details
The function performs the following operations:
Validates input paths
Handles empty input vectors
Optionally removes directory paths using
basename
Optionally removes file extensions using regex substitution
Value
A character
vector of processed filenames with applied transformations.
Note
If both
rm_extension
andrm_path
are FALSE, a warning is issued and the original paths are returnedSupports multiple file paths in the input vector
See Also
-
base::basename()
for basic filename extraction
Examples
# Example: File path processing demonstrations
# Setup test files
xlsx_files <- mintyr_example(
mintyr_examples("xlsx_test") # Get example Excel files
)
# Example 1: Extract filenames without extensions
get_filename(
xlsx_files, # Input file paths
rm_extension = TRUE, # Remove file extensions
rm_path = TRUE # Remove directory paths
)
# Example 2: Keep file extensions
get_filename(
xlsx_files, # Input file paths
rm_extension = FALSE, # Keep file extensions
rm_path = TRUE # Remove directory paths
)
# Example 3: Keep full paths without extensions
get_filename(
xlsx_files, # Input file paths
rm_extension = TRUE, # Remove file extensions
rm_path = FALSE # Keep directory paths
)
Extract Specific Segments from File Paths
Description
The get_path_segment
function extracts specific segments from file paths provided as character strings. Segments can be extracted from either the beginning or the end of the path, depending on the value of n
.
Usage
get_path_segment(paths, n = 1)
Arguments
paths |
A 'character vector' containing file system paths
|
n |
Numeric index for segment selection
|
Details
Sophisticated Path Segment Extraction Mechanism:
Comprehensive input validation
Path normalization and preprocessing
Robust cross-platform path segmentation
Flexible indexing with forward and backward navigation
Intelligent segment retrieval
Graceful handling of edge cases
Indexing Behavior:
Positive
n
: Forward indexing from path start -n = 1
: First segment -n = 2
: Second segmentNegative
n
: Reverse indexing from path end -n = -1
: Last segment -n = -2
: Second-to-last segmentRange extraction: Supports
c(start, end)
index specification
Path Parsing Characteristics:
Standardizes path separators to
'/'
Removes drive letters (e.g.,
'C:'
)Ignores consecutive
'/'
delimitersRemoves leading and trailing separators
Returns
NA_character_
for non-existent segmentsSupports complex path structures
Value
'character vector' with extracted path segments
Matching segments for valid indices
-
NA_character_
for segments beyond path length
Note
Critical Operational Constraints:
Requires non-empty 'paths' input
-
n
must be non-zero numeric value Supports cross-platform path representations
Minimal computational overhead
Preserves path segment order
See Also
-
tools::file_path_sans_ext()
File extension manipulation
Examples
# Example: Path segment extraction demonstrations
# Setup test paths
paths <- c(
"C:/home/user/documents", # Windows style path
"/var/log/system", # Unix system path
"/usr/local/bin" # Unix binary path
)
# Example 1: Extract first segment
get_path_segment(
paths, # Input paths
1 # Get first segment
)
# Returns: c("home", "var", "usr")
# Example 2: Extract second-to-last segment
get_path_segment(
paths, # Input paths
-2 # Get second-to-last segment
)
# Returns: c("user", "log", "local")
# Example 3: Extract from first to last segment
get_path_segment(
paths, # Input paths
c(1,-1) # Range from first to last
)
# Returns full paths without drive letters
# Example 4: Extract first three segments
get_path_segment(
paths, # Input paths
c(1,3) # Range from first to third
)
# Returns: c("home/user/documents", "var/log/system", "usr/local/bin")
# Example 5: Extract last two segments (reverse order)
get_path_segment(
paths, # Input paths
c(-1,-2) # Range from last to second-to-last
)
# Returns: c("documents/user", "system/log", "bin/local")
# Example 6: Extract first two segments
get_path_segment(
paths, # Input paths
c(1,2) # Range from first to second
)
# Returns: c("home/user", "var/log", "usr/local")
Flexible CSV
/TXT
File Import with Multiple Backend Support
Description
A comprehensive CSV
or TXT
file import function offering advanced reading capabilities
through data.table
and arrow
packages with intelligent data combination strategies.
Usage
import_csv(
file,
package = "data.table",
rbind = TRUE,
rbind_label = "_file",
...
)
Arguments
file |
A |
package |
A
|
rbind |
A
|
rbind_label |
A
|
... |
Additional arguments passed to backend-specific reading functions
(e.g., |
Details
The function provides a unified interface for reading CSV files using either data.table or arrow package. When reading multiple files, it can either combine them into a single data object or return them as a list. File source tracking is supported through the rbind_label parameter.
Value
Depends on the rbind
parameter:
If
rbind = TRUE
: A single data object (from chosen package) containing all imported dataIf
rbind = FALSE
: A named list of data objects with names derived from input file names (without extensions)
Note
Critical Import Considerations:
Requires all specified files to be accessible
CSV/TXT
filesSupports flexible backend selection
-
rbind = TRUE
assumes compatible data structures Missing columns are automatically aligned
File extensions are automatically removed in tracking columns
See Also
-
data.table::fread()
fordata.table
backend -
arrow::read_csv_arrow()
forarrow
backend -
data.table::rbindlist()
for data combination
Examples
# Example: CSV file import demonstrations
# Setup test files
csv_files <- mintyr_example(
mintyr_examples("csv_test") # Get example CSV files
)
# Example 1: Import and combine CSV files using data.table
import_csv(
csv_files, # Input CSV file paths
package = "data.table", # Use data.table for reading
rbind = TRUE, # Combine all files into one data.table
rbind_label = "_file" # Column name for file source
)
# Example 2: Import files separately using arrow
import_csv(
csv_files, # Input CSV file paths
package = "arrow", # Use arrow for reading
rbind = FALSE # Keep files as separate data.tables
)
Import Data from XLSX
Files with Advanced Handling
Description
A robust and flexible function for importing data from one or multiple
XLSX
files, offering comprehensive options for sheet selection,
data combination, and source tracking.
Usage
import_xlsx(file, rbind = TRUE, sheet = NULL, ...)
Arguments
file |
A |
rbind |
A
|
sheet |
A
|
... |
Additional arguments passed to |
Details
The function provides a comprehensive solution for importing Excel data with the following features:
Supports multiple files and sheets
Automatic source tracking for files and sheets
Flexible combining options
Handles missing columns across sheets when combining
Preserves original data types through readxl
Value
Depends on the rbind
parameter:
If
rbind = TRUE
: A singledata.table
with additional tracking columns: -excel_name
: Source file name (without extension) -sheet_name
: Source sheet nameIf
rbind = FALSE
: A named list ofdata.table
s with format"filename_sheetname"
Note
Critical Import Considerations:
Requires all specified files to be accessible
Excel
filesSheet indices must be valid across input files
-
rbind = TRUE
assumes compatible data structures Missing columns are automatically filled with
NA
File extensions are automatically removed in tracking columns
See Also
-
readxl::read_excel()
for underlying Excel reading -
data.table::rbindlist()
for data combination
Examples
# Example: Excel file import demonstrations
# Setup test files
xlsx_files <- mintyr_example(
mintyr_examples("xlsx_test") # Get example Excel files
)
# Example 1: Import and combine all sheets from all files
import_xlsx(
xlsx_files, # Input Excel file paths
rbind = TRUE # Combine all sheets into one data.table
)
# Example 2: Import specific sheets separately
import_xlsx(
xlsx_files, # Input Excel file paths
rbind = FALSE, # Keep sheets as separate data.tables
sheet = 2 # Only import first sheet
)
Get path to mintyr examples
Description
mintyr
comes bundled with a number of sample files in
its inst/extdata
directory. Use mintyr_example()
to retrieve the full file path to a
specific example file.
Usage
mintyr_example(path = NULL)
Arguments
path |
Name of the example file to locate. If NULL or missing, returns the directory path containing the examples. |
Value
Character string containing the full path to the requested example file.
See Also
mintyr_examples()
to list all available example files
Examples
# Get path to an example file
mintyr_example("csv_test1.csv")
List all available example files in mintyr package
Description
mintyr
comes bundled with a number of sample files in its inst/extdata
directory. This function lists all available example files, optionally filtered
by a pattern.
Usage
mintyr_examples(pattern = NULL)
Arguments
pattern |
A regular expression to filter filenames. If |
Value
A character vector containing the names of example files. If no files match the pattern or if the example directory is empty, returns a zero-length character vector.
See Also
mintyr_example()
to get the full path of a specific example file
Examples
# List all example files
mintyr_examples()
nedap
Description
Dairy cow feeding behavior dataset
Usage
nedap
Format
A data frame with 31863 rows and 9 variables:
animal_number
integer Animal identification number
lifenumber
logical Life number of the animal
responder
integer Responder identification number
location
integer Feeding station location
visit_time
double Time of feeding visit
duration
integer Duration of feeding visit (minutes)
state
integer Status code
weight
integer Body weight (kg)
feed_intake
integer Feed intake amount (kg)
Update Nedap Dataset with Current Date
Description
The nedaps
function creates a copy of the Nedap dataset and adjusts the visit times
to align with the current date while maintaining the original time patterns.
Usage
nedaps()
Details
The function performs the following operations:
Creates a copy of the Nedap dataset from the mintyr package
Calculates the number of days between the last recorded visit and the previous day
Shifts all visit times forward by the calculated number of days
Preserves the original time patterns of the visits
Value
A data.table
with updated visit times, shifted to the current date
Note
Requires the
data.table
andmintyr
packagesUses the current system date as a reference for date shifting
Maintains the original time of day for each visit
Examples
head(nedaps())
Apply Cross-Validation to Nested Data
Description
The nest_cv
function applies cross-validation splits to nested data frames or data tables within a data table. It uses the rsample
package's vfold_cv
function to create cross-validation splits for predictive modeling and analysis on nested datasets.
Usage
nest_cv(
nest_dt,
v = 10,
repeats = 1,
strata = NULL,
breaks = 4,
pool = 0.1,
...
)
Arguments
nest_dt |
A
|
v |
The number of partitions of the data set. |
repeats |
The number of times to repeat the V-fold partitioning. |
strata |
A variable in |
breaks |
A single number giving the number of bins desired to stratify a numeric stratification variable. |
pool |
A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small. |
... |
These dots are for future extensions and must be empty. |
Details
The function performs the following steps:
Checks if the input
nest_dt
is non-empty and contains at least one nested column ofdata.frame
s ordata.table
s.Identifies the nested columns and non-nested columns within
nest_dt
.Applies
rsample::vfold_cv
to each nested data frame in the specified nested column(s), creating the cross-validation splits.Expands the cross-validation splits and associates them with the non-nested columns.
Extracts the training and validation data for each split and adds them to the output data table.
If the strata
parameter is provided, stratified sampling is performed during the cross-validation. Additional arguments can be passed to rsample::vfold_cv
via ...
.
Value
A data.table
containing the cross-validation splits for each nested dataset. It includes:
Original non-nested columns from
nest_dt
.-
splits
: The cross-validation split objects returned byrsample::vfold_cv
. -
train
: The training data for each split. -
validate
: The validation data for each split.
Note
The
nest_dt
must contain at least one nested column ofdata.frame
s ordata.table
s.The function converts
nest_dt
to adata.table
internally to ensure efficient data manipulation.The
strata
parameter should be a column name present in the nested data frames.If
strata
is specified, ensure that the specified column exists in all nested data frames.The
breaks
andpool
parameters are used whenstrata
is a numeric variable and control how stratification is handled.Additional arguments passed through
...
are forwarded torsample::vfold_cv
.
See Also
-
rsample::vfold_cv()
Underlying cross-validation function -
rsample::training()
Extract training set -
rsample::testing()
Extract test set
Examples
# Example: Cross-validation for nested data.table demonstrations
# Setup test data
dt_nest <- w2l_nest(
data = iris, # Input dataset
cols2l = 1:2 # Nest first 2 columns
)
# Example 1: Basic 2-fold cross-validation
nest_cv(
nest_dt = dt_nest, # Input nested data.table
v = 2 # Number of folds (2-fold CV)
)
# Example 2: Repeated 2-fold cross-validation
nest_cv(
nest_dt = dt_nest, # Input nested data.table
v = 2, # Number of folds (2-fold CV)
repeats = 2 # Number of repetitions
)
Row to Pair Nested Transformation
Description
A sophisticated data transformation tool for performing row pair conversion and creating nested data structures with advanced configuration options.
Usage
r2p_nest(data, rows2bind, by, nest_type = "dt")
Arguments
data |
Input
|
rows2bind |
Row binding specification
|
by |
Grouping specification for nested pairing
|
nest_type |
Output nesting format
|
Details
Advanced Transformation Mechanism:
Input validation and preprocessing
Dynamic column identification
Flexible row pairing across specified columns
Nested data structure generation
Transformation Process:
Validate input parameters and column specifications
Convert numeric indices to column names if necessary
Reshape data from wide to long format
Perform column-wise nested transformation
Generate final nested structure
Column Specification:
Supports both column names and numeric indices
Numeric indices must be within valid range (1 to ncol)
Column names must exist in the dataset
Flexible specification for both rows2bind and by parameters
Value
data table
containing nested transformation results
Includes
name
column identifying source columnsContains
data
column storing nested data structures
Note
Key Operation Constraints:
Requires non-empty input data
Column specifications must be valid (either names or indices)
By parameter must specify at least one column
Low computational overhead
See Also
-
data.table::melt()
Long format conversion -
data.table::dcast()
Wide format conversion -
base::rbind()
Row binding utility -
c2p_nest()
Column to pair nested transformation
Examples
# Example 1: Row-to-pairs nesting with column names
r2p_nest(
mtcars, # Input mtcars dataset
rows2bind = "cyl", # Column to be used as row values
by = c("hp", "drat", "wt") # Columns to be transformed into pairs
)
# Returns a nested data.table where:
# - name: variable names (hp, drat, wt)
# - data: list column containing data.tables with rows grouped by cyl values
# Example 2: Row-to-pairs nesting with numeric indices
r2p_nest(
mtcars, # Input mtcars dataset
rows2bind = 2, # Use 2nd column (cyl) as row values
by = 4:6 # Use columns 4-6 (hp, drat, wt) for pairs
)
# Returns a nested data.table where:
# - name: variable names from columns 4-6
# - data: list column containing data.tables with rows grouped by cyl values
Cross-Validation Split Generator
Description
A robust cross-validation splitting utility for multiple datasets with advanced stratification and configuration options.
Usage
split_cv(
split_dt,
v = 10,
repeats = 1,
strata = NULL,
breaks = 4,
pool = 0.1,
...
)
Arguments
split_dt |
|
v |
The number of partitions of the data set. |
repeats |
The number of times to repeat the V-fold partitioning. |
strata |
A variable in |
breaks |
A single number giving the number of bins desired to stratify a numeric stratification variable. |
pool |
A proportion of data used to determine if a particular group is too small and should be pooled into another group. We do not recommend decreasing this argument below its default of 0.1 because of the dangers of stratifying groups that are too small. |
... |
These dots are for future extensions and must be empty. |
Details
Advanced Cross-Validation Mechanism:
Input dataset validation
Stratified or unstratified sampling
Flexible fold generation
Train-validate set creation
Sampling Strategies:
Supports multiple dataset processing
Handles stratified and unstratified sampling
Generates reproducible cross-validation splits
Value
list
of data.table
objects containing:
-
splits
: Cross-validation split objects -
train
: Training dataset subsets -
validate
: Validation dataset subsets
Note
Important Constraints:
Requires non-empty input datasets
All datasets must be
data.frame
ordata.table
Strata column must exist if specified
Computational resources impact large dataset processing
See Also
-
rsample::vfold_cv()
Core cross-validation function
Examples
# Prepare example data: Convert first 3 columns of iris dataset to long format and split
dt_split <- w2l_split(data = iris, cols2l = 1:3)
# dt_split is now a list containing 3 data tables for Sepal.Length, Sepal.Width, and Petal.Length
# Example 1: Single cross-validation (no repeats)
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 1 # Perform cross-validation once (no repeats)
)
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data
# Example 2: Repeated cross-validation
split_cv(
split_dt = dt_split, # Input list of split data
v = 3, # Set 3-fold cross-validation
repeats = 2 # Perform cross-validation twice
)
# Returns a list where each element contains:
# - splits: rsample split objects
# - id: repeat numbers (Repeat1, Repeat2)
# - id2: fold numbers (Fold1, Fold2, Fold3)
# - train: training set data
# - validate: validation set data
Select Top Percentage of Data and Statistical Summarization
Description
The top_perc
function selects the top percentage of data based on a specified trait and computes summary statistics.
It allows for grouping by additional columns and offers flexibility in the type of statistics calculated.
The function can also retain the selected data if needed.
Usage
top_perc(data, perc, trait, by = NULL, type = "mean_sd", keep_data = FALSE)
Arguments
data |
A
|
perc |
Numeric vector of percentages for data selection
|
trait |
Character string specifying the 'selection column'
|
by |
Optional character vector for 'grouping columns'
|
type |
Statistical summary type
|
keep_data |
Logical flag for data retention
|
Value
A list or data frame:
If
keep_data
is FALSE, a data frame with summary statistics.If
keep_data
is TRUE, a list where each element is a list containing summary statistics (stat
) and the selected top data (data
).
Note
The
perc
parameter accepts values between -1 and 1. Positive values select the top percentage, while negative values select the bottom percentage.The function performs initial checks to ensure required arguments are provided and valid.
Grouping by additional columns (
by
) is optional and allows for more granular analysis.The
type
parameter specifies the type of summary statistics to compute, with "mean_sd" as the default.If
keep_data
is set to TRUE, the function will return both the summary statistics and the selected top data for each percentage.
See Also
-
rstatix::get_summary_stats()
Statistical summary computation -
dplyr::top_frac()
Percentage-based data selection
Examples
# Example 1: Basic usage with single trait
# This example selects the top 10% of observations based on Petal.Width
# keep_data=TRUE returns both summary statistics and the filtered data
top_perc(iris,
perc = 0.1, # Select top 10%
trait = c("Petal.Width"), # Column to analyze
keep_data = TRUE) # Return both stats and filtered data
# Example 2: Using grouping with 'by' parameter
# This example performs the same analysis but separately for each Species
# Returns nested list with stats and filtered data for each group
top_perc(iris,
perc = 0.1, # Select top 10%
trait = c("Petal.Width"), # Column to analyze
by = "Species") # Group by Species
# Example 3: Complex example with multiple percentages and grouping variables
# Reshape data from wide to long format for Sepal.Length and Sepal.Width
iris |>
tidyr::pivot_longer(1:2,
names_to = "names",
values_to = "values") |>
mintyr::top_perc(
perc = c(0.1, -0.2),
trait = "values",
by = c("Species", "names"),
type = "mean_sd")
Reshape Wide Data to Long Format and Nest by Specified Columns
Description
The w2l_nest
function reshapes wide-format data into long-format and nests it by specified columns.
It handles both data.frame
and data.table
objects and provides options for grouping and nesting the data.
Usage
w2l_nest(data, cols2l = NULL, by = NULL, nest_type = "dt")
Arguments
data |
|
cols2l |
|
by |
|
nest_type |
|
Details
The function melts the specified wide columns into long format and nests the resulting data by the name
column and any additional grouping variables specified in by
. The nested data can be in the form of
data.table
or data.frame
objects, controlled by the nest_type
parameter.
Both cols2l
and by
parameters accept either column indices or column names, providing flexible ways
to specify the columns for transformation and grouping.
Value
data.table
with nested data in long format, grouped by specified columns if provided. Each row contains a nested data.table
or data.frame
under the column data, depending on nest_type.
If
by
isNULL
, returns adata.table
nested byname
.If
by
is specified, returns adata.table
nested byname
and the grouping variables.
Note
Both
cols2l
andby
parameters can be specified using either numeric indices or character column names.When using numeric indices, they must be valid column positions in the data (1 to ncol(data)).
When using character names, all specified columns must exist in the data.
The function converts
data.frame
todata.table
if necessary.The
nest_type
parameter controls whether nested data aredata.table
("dt"
) ordata.frame
("df"
) objects.If
nest_type
is not"dt"
or"df"
, the function will stop with an error.
See Also
Related functions and packages:
-
tidytable::nest_by()
Nest data.tables by group
Examples
# Example: Wide to long format nesting demonstrations
# Example 1: Basic nesting by group
w2l_nest(
data = iris, # Input dataset
by = "Species" # Group by Species column
)
# Example 2: Nest specific columns with numeric indices
w2l_nest(
data = iris, # Input dataset
cols2l = 1:4, # Select first 4 columns to nest
by = "Species" # Group by Species column
)
# Example 3: Nest specific columns with column names
w2l_nest(
data = iris, # Input dataset
cols2l = c("Sepal.Length", # Select columns by name
"Sepal.Width",
"Petal.Length"),
by = 5 # Group by column index 5 (Species)
)
# Returns similar structure to Example 2
Reshape Wide Data to Long Format and Split into List
Description
The w2l_split
function reshapes wide-format data into long-format and splits it into a list
by variable names and optional grouping columns. It handles both data.frame
and data.table
objects.
Usage
w2l_split(data, cols2l = NULL, by = NULL, split_type = "dt", sep = "_")
Arguments
data |
|
cols2l |
|
by |
|
split_type |
|
sep |
|
Details
The function melts the specified wide columns into long format and splits the resulting data
into a list based on the variable names and any additional grouping variables specified in by
.
The split data can be in the form of data.table
or data.frame
objects, controlled by the
split_type
parameter.
Both cols2l
and by
parameters accept either column indices or column names, providing flexible ways
to specify the columns for transformation and splitting.
Value
A list of data.table
or data.frame
objects (depending on split_type
), split by variable
names and optional grouping columns.
If
by
isNULL
, returns a list split by variable names only.If
by
is specified, returns a list split by both variable names and grouping variables.
Note
Both
cols2l
andby
parameters can be specified using either numeric indices or character column names.When using numeric indices, they must be valid column positions in the data (1 to ncol(data)).
When using character names, all specified columns must exist in the data.
The function converts
data.frame
todata.table
if necessary.The
split_type
parameter controls whether split data aredata.table
("dt"
) ordata.frame
("df"
) objects.If
split_type
is not"dt"
or"df"
, the function will stop with an error.
See Also
Related functions and packages:
-
tidytable::group_split()
Split data frame by groups
Examples
# Example: Wide to long format splitting demonstrations
# Example 1: Basic splitting by Species
w2l_split(
data = iris, # Input dataset
by = "Species" # Split by Species column
) |>
lapply(head) # Show first 6 rows of each split
# Example 2: Split specific columns using numeric indices
w2l_split(
data = iris, # Input dataset
cols2l = 1:3, # Select first 3 columns to split
by = 5 # Split by column index 5 (Species)
) |>
lapply(head) # Show first 6 rows of each split
# Example 3: Split specific columns using column names
list_res <- w2l_split(
data = iris, # Input dataset
cols2l = c("Sepal.Length", # Select columns by name
"Sepal.Width"),
by = "Species" # Split by Species column
)
lapply(list_res, head) # Show first 6 rows of each split
# Returns similar structure to Example 2