Title: | Balancing Multiclass Datasets for Classification Tasks |
Version: | 0.2.0 |
Maintainer: | Keenan Ganz <ganzkeenan1@gmail.com> |
Description: | Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is useful. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm as described in Agrawal et. al. (2015) <doi:10.5220/0005595502260234>. Their paper uses model-based clustering and synthetic oversampling to balance multiclass training datasets, although other resampling methods are provided in this package. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Imports: | smotefamily, parallel, mclust |
Depends: | R (≥ 2.10) |
URL: | https://github.com/s-kganz/scutr |
BugReports: | https://github.com/s-kganz/scutr/issues |
Suggests: | testthat (≥ 2.0.0) |
Config/testthat/edition: | 2 |
NeedsCompilation: | no |
Packaged: | 2023-11-17 22:42:02 UTC; rsgal |
Author: | Keenan Ganz [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2023-11-17 23:10:02 UTC |
SMOTE and cluster-based undersampling technique.
Description
This function balances multiclass training datasets. In a dataframe with n
classes and m
rows, the resulting dataframe will have m / n
rows per class. SCUT_parallel()
distributes each over/undersampling task across multiple cores. Speedup usually occurs only if there are many classes using one of the slower resampling techniques (e.g. undersample_mclust()
). Note that SCUT_parallel()
will always run on one core on Windows.
Usage
SCUT(
data,
cls_col,
oversample = oversample_smote,
undersample = undersample_mclust,
osamp_opts = list(),
usamp_opts = list()
)
SCUT_parallel(
data,
cls_col,
ncores = detectCores()%/%2,
oversample = oversample_smote,
undersample = undersample_mclust,
osamp_opts = list(),
usamp_opts = list()
)
Arguments
data |
Numeric data frame. |
cls_col |
The column in |
oversample |
Oversampling method. Must be a function with the signature |
undersample |
Undersampling method. Must be a function with the signature |
osamp_opts |
List of options passed to the oversampling function. |
usamp_opts |
List of options passed to the undersampling function. |
ncores |
Number of cores to use with |
Details
Custom functions can be used to perform under/oversampling (see the required signature below). Parameters represented by ...
should be passsed via osamp_opts
or usamp_opts
as a list.
Value
A dataframe with equal class distribution.
References
Agrawal A, Viktor HL, Paquet E (2015). 'SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling.' In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 01, 226-234.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). 'SMOTE: Synthetic Minority Over-sampling Technique.' Journal of Artificial Intelligence Research, 16, 321-357. ISSN 1076-9757, doi:10.1613/jair.953, https://www.jair.org/index.php/jair/article/view/10302.
Examples
ret <- SCUT(iris, "Species", undersample = undersample_hclust,
usamp_opts = list(dist_calc="manhattan"))
ret2 <- SCUT(chickwts, "feed", undersample = undersample_kmeans)
table(ret$Species)
table(ret2$feed)
# SCUT_parallel fires a warning if ncores > 1 on Windows and will run on
# one core only.
ret <- SCUT_parallel(wine, "type", ncores = 1, undersample = undersample_kmeans)
table(ret$type)
An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.
Description
An imbalanced dataset with a minor class centered around the origin with a majority class surrounding the center.
Usage
bullseye
Format
a data.frame with 1000 rows and 3 columns.
Source
https://gist.github.com/s-kganz/c2534666e369f8e19491bb29d53c619d
An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.
Description
An imbalanced dataset with randomly placed normal distributions around the origin. The nth class has n * 10 observations.
Usage
imbalance
Format
a data.frame with 2100 rows and 11 columns
Source
https://gist.github.com/s-kganz/d08473f9492d48ea0e56c3c8a3fe1a74
Oversample a dataset by SMOTE.
Description
Oversample a dataset by SMOTE.
Usage
oversample_smote(data, cls, cls_col, m, k = NA)
Arguments
data |
Dataset to be oversampled. |
cls |
Class to be oversampled. |
cls_col |
Column containing class information. |
m |
Desired number of samples in the oversampled data. |
k |
Number of neighbors used in |
Value
The oversampled dataset.
Examples
table(iris$Species)
smoted <- oversample_smote(iris, "setosa", "Species", 100)
nrow(smoted)
Randomly resample a dataset.
Description
This function is used to resample a dataset by randomly removing or duplicating rows. It is usable for both oversampling and undersampling.
Usage
resample_random(data, cls, cls_col, m)
Arguments
data |
Dataframe to be resampled. |
cls |
Class that should be randomly resampled. |
cls_col |
Column containing class information. |
m |
Desired number of samples. |
Value
Resampled dataframe containing only cls
.
Examples
set.seed(1234)
only2 <- resample_random(wine, 2, "type", 15)
Stratified index sample of different values in a vector.
Description
Stratified index sample of different values in a vector.
Usage
sample_classes(vec, tot_sample)
Arguments
vec |
Vector of values to sample from. |
tot_sample |
Total number of samples. |
Value
A vector of indices that can be used to select a balanced population of values from vec
.
Examples
vec <- sample(1:5, 30, replace = TRUE)
table(vec)
sample_ind <- sample_classes(vec, 15)
table(vec[sample_ind])
Undersample a dataset by hierarchical clustering.
Description
Undersample a dataset by hierarchical clustering.
Usage
undersample_hclust(data, cls, cls_col, m, k = 5, h = NA, ...)
Arguments
data |
Dataset to be undersampled. |
cls |
Majority class that will be undersampled. |
cls_col |
Column in data containing class memberships. |
m |
Number of samples in undersampled dataset. |
k |
Number of clusters to derive from clustering. |
h |
Height at which to cut the clustering tree. |
... |
Additional arguments passed to |
Value
Undersampled dataframe containing only cls
.
Examples
table(iris$Species)
undersamp <- undersample_hclust(iris, "setosa", "Species", 15)
nrow(undersamp)
Undersample a dataset by kmeans clustering.
Description
Undersample a dataset by kmeans clustering.
Usage
undersample_kmeans(data, cls, cls_col, m, k = 5, ...)
Arguments
data |
Dataset to be undersampled. |
cls |
Class to be undersampled. |
cls_col |
Column containing class information. |
m |
Number of samples in undersampled dataset. |
k |
Number of centers in clustering. |
... |
Additional arguments passed to |
Value
The undersampled dataframe containing only instances of cls
.
Examples
table(iris$Species)
undersamp <- undersample_kmeans(iris, "setosa", "Species", 15)
nrow(undersamp)
Undersample a dataset by expectation-maximization clustering
Description
Undersample a dataset by expectation-maximization clustering
Usage
undersample_mclust(data, cls, cls_col, m, ...)
Arguments
data |
Data to be undersampled. |
cls |
Class to be undersampled. |
cls_col |
Class column. |
m |
Number of samples in undersampled dataset. |
... |
Additional arguments passed to |
Value
The undersampled dataframe containing only instance of cls
.
Examples
setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mclust(setosa, "setosa", "Species", 15)
nrow(undersamp)
Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.
Description
Undersample a dataset by iteratively removing the observation with the lowest total distance to its neighbors of the same class.
Usage
undersample_mindist(data, cls, cls_col, m, ...)
Arguments
data |
Dataset to undersample. Aside from |
cls |
Class to be undersampled. |
cls_col |
Column containing class information. |
m |
Desired number of observations after undersampling. |
... |
Additional arguments passed to |
Value
An undersampled dataframe.
Examples
setosa <- iris[iris$Species == "setosa", ]
nrow(setosa)
undersamp <- undersample_mindist(setosa, "setosa", "Species", 50)
nrow(undersamp)
Undersample a dataset by removing Tomek links.
Description
A Tomek link is a minority instance and majority instance that are each other's nearest neighbor. This function removes sufficient Tomek links that are an instance of cls to yield m instances of cls. If desired, samples are randomly discarded to yield m rows if insufficient Tomek links are in the data.
Usage
undersample_tomek(data, cls, cls_col, m, tomek = "minor", force_m = TRUE, ...)
Arguments
data |
Dataset to be undersampled. |
cls |
Majority class to be undersampled. |
cls_col |
Column in data containing class memberships. |
m |
Desired number of samples in undersampled dataset. |
tomek |
Definition used to determine if a point is considered a minority in the Tomek link definition.
|
force_m |
If |
... |
Additional arguments passed to |
Value
Undersampled dataframe containing only cls
.
Examples
table(iris$Species)
undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE)
nrow(undersamp)
undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE)
nrow(undersamp2)
Validate a dataset for resampling.
Description
This functions checks that the given column is present in the data and that all columns besides the class column are numeric.
Usage
validate_dataset(data, cls_col)
Arguments
data |
Dataframe to validate. |
cls_col |
Column with class information. |
Value
NA
Type and chemical analysis of three different kinds of wine.
Description
Type and chemical analysis of three different kinds of wine.
Usage
wine
Format
a data.frame with 178 rows and 14 columns