Title: | User-Friendly R Package for Supervised Machine Learning Pipelines |
Version: | 1.6.1 |
Date: | 2023-08-21 |
Description: | An interface to build machine learning models for classification and regression problems. 'mikropml' implements the ML pipeline described by Topçuoğlu et al. (2020) <doi:10.1128/mBio.00434-20> with reasonable default options for data preprocessing, hyperparameter tuning, cross-validation, testing, model evaluation, and interpretation steps. See the website https://www.schlosslab.org/mikropml/ for more information, documentation, and examples. |
License: | MIT + file LICENSE |
URL: | https://www.schlosslab.org/mikropml/, https://github.com/SchlossLab/mikropml |
BugReports: | https://github.com/SchlossLab/mikropml/issues |
Depends: | R (≥ 4.1.0) |
Imports: | caret, dplyr, e1071, glmnet, kernlab, MLmetrics, randomForest, rlang, rpart, stats, utils, xgboost |
Suggests: | assertthat, doFuture, forcats, foreach, future, future.apply, furrr, ggplot2, knitr, progress, progressr, purrr, rmarkdown, rsample, testthat, tidyr |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2023-08-21 14:16:32 UTC; sovacool |
Author: | Begüm Topçuoğlu |
Maintainer: | Kelly Sovacool <sovacool@umich.edu> |
Repository: | CRAN |
Date/Publication: | 2023-08-21 15:10:05 UTC |
mikropml: User-Friendly R Package for Robust Machine Learning Pipelines
Description
mikropml
implements supervised machine learning pipelines using regression,
support vector machines, decision trees, random forest, or gradient-boosted trees.
The main functions are preprocess_data()
to process your data prior to
running machine learning, and run_ml()
to run machine learning.
Authors
Begüm D. Topçuoğlu (ORCID)
Zena Lapp (ORCID)
Kelly L. Sovacool (ORCID)
Evan Snitkin (ORCID)
Jenna Wiens (ORCID)
Patrick D. Schloss (ORCID)
See vignettes
Author(s)
Maintainer: Kelly Sovacool sovacool@umich.edu (ORCID)
Authors:
Begüm Topçuoğlu topcuoglu.begum@gmail.com (ORCID)
Zena Lapp zenalapp@umich.edu (ORCID)
Evan Snitkin (ORCID)
Jenna Wiens (ORCID)
Patrick Schloss pschloss@umich.edu (ORCID)
Other contributors:
Nick Lesniak nlesniak@umich.edu (ORCID) [contributor]
Courtney Armour armourc@umich.edu (ORCID) [contributor]
Sarah Lucas salucas@umich.edu (ORCID) [contributor]
See Also
Useful links:
Report bugs at https://github.com/SchlossLab/mikropml/issues
Throw error if required packages are not installed.
Description
Reports which packages need to be installed and the parent function name. See https://stackoverflow.com/questions/15595478/how-to-get-the-name-of-the-calling-function-inside-the-called-routine This is only intended to be used inside a function. It will error otherwise.
Usage
abort_packages_not_installed(...)
Arguments
... |
names of packages to check |
Author(s)
Kelly Sovacool sovacool@umich.edu
Examples
## Not run:
abort_packages_not_installed("base")
abort_packages_not_installed("not-a-package-name", "caret", "dplyr", "non_package")
## End(Not run)
Calculate a bootstrap confidence interval for the performance on a single train/test split
Description
Uses rsample::bootstraps()
, rsample::int_pctl()
, and furrr::future_map()
Usage
bootstrap_performance(
ml_result,
outcome_colname,
bootstrap_times = 10000,
alpha = 0.05
)
Arguments
ml_result |
result returned from a single |
outcome_colname |
Column name as a string of the outcome variable
(default |
bootstrap_times |
the number of boostraps to create (default: |
alpha |
the alpha level for the confidence interval (default |
Value
a data frame with an estimate (.estimate
), lower bound (.lower
),
and upper bound (.upper
) for each performance metric (term
).
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
bootstrap_performance(otu_mini_bin_results_glmnet, "dx",
bootstrap_times = 10, alpha = 0.10
)
## Not run:
outcome_colname <- "dx"
run_ml(otu_mini_bin, "rf", outcome_colname = "dx") %>%
bootstrap_performance(outcome_colname,
bootstrap_times = 10000,
alpha = 0.05
)
## End(Not run)
Calculate balanced precision given actual and baseline precision
Description
Implements Equation 1 from Wu et al. 2021 doi:10.1016/j.ajhg.2021.08.012.
It is the same as Equation 7 if AUPRC
(aka prAUC
) is used in place of precision
.
Usage
calc_balanced_precision(precision, prior)
Arguments
precision |
actual precision of the model. |
prior |
baseline precision, aka frequency of positives. Can be calculated with calc_baseline_precision |
Value
the expected precision if the data were balanced
Author(s)
Kelly Sovacool sovacool@umich.edu
Examples
prior <- calc_baseline_precision(otu_mini_bin,
outcome_colname = "dx",
pos_outcome = "cancer"
)
calc_balanced_precision(otu_mini_bin_results_rf$performance$Precision, prior)
otu_mini_bin_results_rf$performance %>%
dplyr::mutate(
balanced_precision = calc_balanced_precision(Precision, prior),
aubprc = calc_balanced_precision(prAUC, prior)
) %>%
dplyr::select(AUC, Precision, balanced_precision, aubprc)
# cumulative performance for a single model
sensspec_1 <- calc_model_sensspec(
otu_mini_bin_results_glmnet$trained_model,
otu_mini_bin_results_glmnet$test_data,
"dx"
)
head(sensspec_1)
prior <- calc_baseline_precision(otu_mini_bin,
outcome_colname = "dx",
pos_outcome = "cancer"
)
sensspec_1 %>%
dplyr::mutate(balanced_precision = calc_balanced_precision(precision, prior)) %>%
dplyr::rename(recall = sensitivity) %>%
calc_mean_perf(group_var = recall, sum_var = balanced_precision) %>%
plot_mean_prc(ycol = mean_balanced_precision)
Calculate the fraction of positives, i.e. baseline precision for a PRC curve
Description
Calculate the fraction of positives, i.e. baseline precision for a PRC curve
Usage
calc_baseline_precision(dataset, outcome_colname = NULL, pos_outcome = NULL)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
pos_outcome |
the positive outcome from |
Value
the baseline precision based on the fraction of positives
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
# calculate the baseline precision
data.frame(y = c("a", "b", "a", "b")) %>%
calc_baseline_precision(
outcome_colname = "y",
pos_outcome = "a"
)
calc_baseline_precision(otu_mini_bin,
outcome_colname = "dx",
pos_outcome = "cancer"
)
# if you're not sure which outcome was used as the 'positive' outcome during
# model training, you can access it from the trained model and pass it along:
calc_baseline_precision(otu_mini_bin,
outcome_colname = "dx",
pos_outcome = otu_mini_bin_results_glmnet$trained_model$levels[1]
)
Generic function to calculate mean performance curves for multiple models
Description
Used by calc_mean_roc()
and calc_mean_prc()
.
Usage
calc_mean_perf(sensspec_dat, group_var = specificity, sum_var = sensitivity)
Arguments
sensspec_dat |
data frame created by concatenating results of
|
group_var |
variable to group by (e.g. specificity or recall). |
sum_var |
variable to summarize (e.g. sensitivity or precision). |
Value
data frame with mean & standard deviation of sum_var
summarized over group_var
Author(s)
Courtney Armour
Kelly Sovacool
Calculate and summarize performance for ROC and PRC plots
Description
Use these functions to calculate cumulative sensitivity, specificity, recall, etc. on single models, concatenate the results together from multiple models, and compute mean ROC and PRC. You can then plot mean ROC and PRC curves to visualize the results. Note: These functions assume a binary outcome.
Usage
calc_model_sensspec(trained_model, test_data, outcome_colname = NULL)
calc_mean_roc(sensspec_dat)
calc_mean_prc(sensspec_dat)
Arguments
trained_model |
Trained model from |
test_data |
Held out test data: dataframe of outcome and features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
sensspec_dat |
data frame created by concatenating results of
|
Value
data frame with summarized performance
Functions
-
calc_model_sensspec()
: Get sensitivity, specificity, and precision for a model. -
calc_mean_roc()
: Calculate mean sensitivity over specificity for multiple models -
calc_mean_prc()
: Calculate mean precision over recall for multiple models
Author(s)
Courtney Armour
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
library(dplyr)
# get cumulative performance for a single model
sensspec_1 <- calc_model_sensspec(
otu_mini_bin_results_glmnet$trained_model,
otu_mini_bin_results_glmnet$test_data,
"dx"
)
head(sensspec_1)
# get performance for multiple models
get_sensspec_seed <- function(seed) {
ml_result <- run_ml(otu_mini_bin, "glmnet", seed = seed)
sensspec <- calc_model_sensspec(
ml_result$trained_model,
ml_result$test_data,
"dx"
) %>%
dplyr::mutate(seed = seed)
return(sensspec)
}
sensspec_dat <- purrr::map_dfr(seq(100, 102), get_sensspec_seed)
# calculate mean sensitivity over specificity
roc_dat <- calc_mean_roc(sensspec_dat)
head(roc_dat)
# calculate mean precision over recall
prc_dat <- calc_mean_prc(sensspec_dat)
head(prc_dat)
# plot ROC & PRC
roc_dat %>% plot_mean_roc()
baseline_prec <- calc_baseline_precision(otu_mini_bin, "dx", "cancer")
prc_dat %>%
plot_mean_prc(baseline_precision = baseline_prec)
# balanced precision
prior <- calc_baseline_precision(otu_mini_bin,
outcome_colname = "dx",
pos_outcome = "cancer"
)
bprc_dat <- sensspec_dat %>%
dplyr::mutate(balanced_precision = calc_balanced_precision(precision, prior)) %>%
dplyr::rename(recall = sensitivity) %>%
calc_mean_perf(group_var = recall, sum_var = balanced_precision)
bprc_dat %>% plot_mean_prc(ycol = mean_balanced_precision) + ylab("Mean Bal. Precision")
## End(Not run)
Calculate performance for a single split from rsample::bootstraps()
Description
Used by bootstrap_performance()
.
Usage
calc_perf_bootstrap_split(
test_data_split,
trained_model,
outcome_colname,
perf_metric_function,
perf_metric_name,
class_probs,
method,
seed
)
Arguments
test_data_split |
a single bootstrap of the test set from |
trained_model |
Trained model from |
outcome_colname |
Column name as a string of the outcome variable
(default |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
method |
ML method.
Options:
|
seed |
Random seed (default: |
Value
a long data frame of performance metrics for rsample::int_pctl()
Author(s)
Kelly Sovacool, sovacool@umich.edu
Get performance metrics for test data
Description
Get performance metrics for test data
Usage
calc_perf_metrics(
test_data,
trained_model,
outcome_colname,
perf_metric_function,
class_probs
)
Arguments
test_data |
Held out test data: dataframe of outcome and features. |
trained_model |
Trained model from |
outcome_colname |
Column name as a string of the outcome variable
(default |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
Value
Dataframe of performance metrics.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
results <- run_ml(otu_small, "glmnet", kfold = 2, cv_times = 2)
calc_perf_metrics(results$test_data,
results$trained_model,
"dx",
multiClassSummary,
class_probs = TRUE
)
## End(Not run)
Calculate the p-value for a permutation test
Description
compute Monte Carlo p-value with correction based on formula from Page 158 of 'Bootstrap methods and their application' By Davison & Hinkley 1997
Usage
calc_pvalue(vctr, test_stat)
Arguments
vctr |
vector of statistics |
test_stat |
the test statistic |
Value
the number of observations in vctr
that are greater than
test_stat
divided by the number of observations in vctr
Author(s)
Kelly Sovacool sovacool@umich.edu
Change columns to numeric if possible
Description
Change columns to numeric if possible
Usage
change_to_num(features)
Arguments
features |
dataframe of features for machine learning |
Value
dataframe with numeric columns where possible
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
class(change_to_num(data.frame(val = c("1", "2", "3")))[[1]])
## End(Not run)
Check all params that don't return a value
Description
Check all params that don't return a value
Usage
check_all(
dataset,
method,
permute,
kfold,
training_frac,
perf_metric_function,
perf_metric_name,
groups,
group_partitions,
corr_thresh,
seed,
hyperparameters
)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
method |
ML method.
Options:
|
kfold |
Fold number for k-fold cross-validation (default: |
training_frac |
Fraction of data for training set (default: |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
corr_thresh |
For feature importance, group correlations
above or equal to |
seed |
Random seed (default: |
hyperparameters |
Dataframe of hyperparameters
(default |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Check if any features are categorical
Description
Check if any features are categorical
Usage
check_cat_feats(feats)
Arguments
feats |
features |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_cat_feats(otu_mini_bin)
## End(Not run)
check that corr_thresh is either NULL or a number between 0 and 1
Description
check that corr_thresh is either NULL or a number between 0 and 1
Usage
check_corr_thresh(corr_thresh)
Arguments
corr_thresh |
correlation threshold |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_corr_thresh(1)
check_corr_thresh(0.8)
check_corr_thresh(2019)
check_corr_thresh(NULL)
## End(Not run)
Check that the dataset is not empty and has more than 1 column.
Description
Errors if there are no rows or fewer than 2 columns.
Usage
check_dataset(dataset)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_dataset(otu_small)
## End(Not run)
Check features
Description
Check features
Usage
check_features(features, check_missing = TRUE)
Arguments
features |
features for machine learning |
check_missing |
check whether the features have missing data (default: TRUE) |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_features(otu_mini_bin[, 2:11])
## End(Not run)
Check the validity of the group_partitions list
Description
Check the validity of the group_partitions list
Usage
check_group_partitions(dataset, groups, group_partitions)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_group_partitions(
otu_mini_bin,
sample(LETTERS[1:8],
size = nrow(otu_mini_bin),
replace = TRUE
),
list(train = c("A", "B"), test = c("C", "D"))
)
## End(Not run)
Check grouping vector
Description
Check grouping vector
Usage
check_groups(dataset, groups, kfold)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
kfold |
Fold number for k-fold cross-validation (default: |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_groups(mikropml::otu_mini_bin,
sample(LETTERS, nrow(mikropml::otu_mini_bin), replace = TRUE),
kfold = 2
)
## End(Not run)
Check that kfold is an integer of reasonable size
Description
Check that kfold is an integer of reasonable size
Usage
check_kfold(kfold, dataset)
Arguments
kfold |
Fold number for k-fold cross-validation (default: |
dataset |
Data frame with an outcome variable and other columns as features. |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_kfold(5, otu_small)
## End(Not run)
Check if the method is supported. If not, throws error.
Description
Check if the method is supported. If not, throws error.
Usage
check_method(method, hyperparameters)
Arguments
method |
ML method.
Options:
|
hyperparameters |
Dataframe of hyperparameters
(default |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_method("rf")
## End(Not run)
Check ntree
Description
Check ntree
Usage
check_ntree(ntree)
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_ntree(NULL)
## End(Not run)
Check that outcome column exists. Pick outcome column if not specified.
Description
Check that outcome column exists. Pick outcome column if not specified.
Usage
check_outcome_column(
dataset,
outcome_colname,
check_values = TRUE,
show_message = TRUE
)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
check_values |
whether to check the outcome values or just get the column (default:TRUE) |
show_message |
whether to show which column is being used as the output column (default: TRUE) |
Value
outcome colname
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_outcome_column(otu_small, NULL)
check_outcome_column(otu_small, "dx")
## End(Not run)
Check that the outcome variable is valid. Pick outcome value if necessary.
Description
Check that the outcome variable is valid. Pick outcome value if necessary.
Usage
check_outcome_value(dataset, outcome_colname)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
Value
outcome value
Author(s)
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_outcome_value(otu_small, "dx", "cancer")
## End(Not run)
Check whether package(s) are installed
Description
Check whether package(s) are installed
Usage
check_packages_installed(...)
Arguments
... |
names of packages to check |
Value
named vector with status of each packages; installed (TRUE
) or not (FALSE
)
Author(s)
Kelly Sovacool sovacool@umich.edu
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_packages_installed("base")
check_packages_installed("not-a-package-name")
all(check_packages_installed("parallel", "doFuture"))
## End(Not run)
Check perf_metric_function is NULL or a function
Description
Check perf_metric_function is NULL or a function
Usage
check_perf_metric_function(perf_metric_function)
Arguments
perf_metric_function |
performance metric function |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_perf_metric_function(NULL)
## End(Not run)
Check perf_metric_name is NULL or a function
Description
Check perf_metric_name is NULL or a function
Usage
check_perf_metric_name(perf_metric_name)
Arguments
perf_metric_name |
performance metric function |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_perf_metric_name(NULL)
## End(Not run)
Check that permute is a logical
Description
Check that permute is a logical
Usage
check_permute(permute)
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
do_permute <- TRUE
check_permute(do_permute)
## End(Not run)
Check remove_var
Description
Check remove_var
Usage
check_remove_var(remove_var)
Arguments
remove_var |
Whether to remove variables with near-zero variance
( |
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
check_remove_var(NULL)
## End(Not run)
check that the seed is either NA or a number
Description
check that the seed is either NA or a number
Usage
check_seed(seed)
Arguments
seed |
random seed |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_seed(2019)
check_seed(NULL)
## End(Not run)
Check that the training fraction is between 0 and 1
Description
Check that the training fraction is between 0 and 1
Usage
check_training_frac(frac)
Arguments
frac |
fraction (numeric) |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
check_training_frac(0.8)
## End(Not run)
Check the validity of the training indices
Description
Check the validity of the training indices
Usage
check_training_indices(training_inds, dataset)
Arguments
training_inds |
vector of integers corresponding to samples for the training set |
dataset |
data frame containing the entire dataset |
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
training_indices <- otu_small %>%
nrow() %>%
sample(., size = 160)
check_training_indices(training_indices, otu_small)
## End(Not run)
Cluster a matrix of correlated features
Description
Cluster a matrix of correlated features
Usage
cluster_corr_mat(bin_corr_mat, hclust_method = "single", cut_height = 0)
Arguments
bin_corr_mat |
a binary correlation matrix created by |
hclust_method |
the |
cut_height |
the cut height ( |
Value
a named vector from stats::cutree()
. Each element is a cluster and
the name is a feature in that cluster.
Author(s)
Kelly Sovacool, sovacool@umich.edu
Pat Schloss, pschloss@umich.edu
Examples
## Not run:
corr_mat <- matrix(
data = c(1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1),
nrow = 4,
dimnames = list(
c("a", "b", "c", "d"),
c("a", "b", "c", "d")
)
)
corr_mat
cluster_corr_mat(corr_mat)
## End(Not run)
Collapse correlated features
Description
Collapse correlated features
Usage
collapse_correlated_features(features, group_neg_corr = TRUE, progbar = NULL)
Arguments
features |
dataframe of features for machine learning |
group_neg_corr |
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)). |
progbar |
optional progress bar (default: |
Value
features where perfectly correlated ones are collapsed
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
collapse_correlated_features(mikropml::otu_small[, 2:ncol(otu_small)])
## End(Not run)
Combine hyperparameter performance metrics for multiple train/test splits
Description
Combine hyperparameter performance metrics for multiple train/test splits generated by, for instance, looping in R or using a snakemake workflow on a high-performance computer.
Usage
combine_hp_performance(trained_model_lst)
Arguments
trained_model_lst |
List of trained models. |
Value
Named list:
-
dat
: Dataframe of performance metric for each group of hyperparameters -
params
: Hyperparameters tuned. -
Metric
: Performance metric used.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
results <- lapply(seq(100, 102), function(seed) {
run_ml(otu_small, "glmnet", seed = seed, cv_times = 2, kfold = 2)
})
models <- lapply(results, function(x) x$trained_model)
combine_hp_performance(models)
## End(Not run)
Perform permutation tests to compare the performance metric across all pairs of a group variable.
Description
A wrapper for permute_p_value()
.
Usage
compare_models(merged_data, metric, group_name, nperm = 10000)
Arguments
merged_data |
the concatenated performance data from |
metric |
metric to compare, must be numeric |
group_name |
column with group variables to compare |
nperm |
number of permutations, default=10000 |
Value
a table of p-values for all pairs of group variable
Author(s)
Courtney R Armour, armourc@umich.edu
Examples
df <- dplyr::tibble(
model = c("rf", "rf", "glmnet", "glmnet", "svmRadial", "svmRadial"),
AUC = c(.2, 0.3, 0.8, 0.9, 0.85, 0.95)
)
set.seed(123)
compare_models(df, "AUC", "model", nperm = 10)
Split into train and test set while splitting by groups.
When group_partitions
is NULL
, all samples from each group will go into
either the training set or the testing set.
Otherwise, the groups will be split according to group_partitions
Description
Split into train and test set while splitting by groups.
When group_partitions
is NULL
, all samples from each group will go into
either the training set or the testing set.
Otherwise, the groups will be split according to group_partitions
Usage
create_grouped_data_partition(
groups,
group_partitions = NULL,
training_frac = 0.8
)
Arguments
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
training_frac |
Fraction of data for training set (default: |
Value
vector of row indices for the training set
Author(s)
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
groups <- c("A", "B", "A", "B", "C", "C", "A", "A", "D")
set.seed(0)
create_grouped_data_partition(groups, training_frac = 0.8)
groups <- rep.int(c("A", "B", "C"), 3)
create_grouped_data_partition(groups,
group_partitions = list(train = c("A"), test = c("A", "B", "C"))
)
## End(Not run)
Splitting into folds for cross-validation when using groups
Description
Like createMultiFolds but still splitting by groups using groupKFold. Code modified from createMultiFolds.
Usage
create_grouped_k_multifolds(groups, kfold = 10, cv_times = 5)
Arguments
groups |
equivalent to y in caret::createMultiFolds |
kfold |
equivalent to k in caret::createMultiFolds |
cv_times |
equivalent to cv_times in caret::createMultiFolds |
Value
indices of folds for CV
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
set.seed(0)
groups <- c("A", "B", "A", "B", "C", "C", "A", "A", "D")
folds <- create_grouped_k_multifolds(groups, kfold = 2, cv_times = 2)
## End(Not run)
Define cross-validation scheme and training parameters
Description
Define cross-validation scheme and training parameters
Usage
define_cv(
train_data,
outcome_colname,
hyperparams_list,
perf_metric_function,
class_probs,
kfold = 5,
cv_times = 100,
groups = NULL,
group_partitions = NULL
)
Arguments
train_data |
Dataframe for training model. |
outcome_colname |
Column name as a string of the outcome variable
(default |
hyperparams_list |
Named list of lists of hyperparameters. |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
kfold |
Fold number for k-fold cross-validation (default: |
cv_times |
Number of cross-validation partitions to create (default: |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
Value
Caret object for trainControl that controls cross-validation
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Kelly Sovacool, sovacool@umich.edu
Examples
training_inds <- get_partition_indices(otu_small %>% dplyr::pull("dx"),
training_frac = 0.8,
groups = NULL
)
train_data <- otu_small[training_inds, ]
test_data <- otu_small[-training_inds, ]
cv <- define_cv(train_data,
outcome_colname = "dx",
hyperparams_list = get_hyperparams_list(otu_small, "glmnet"),
perf_metric_function = caret::multiClassSummary,
class_probs = TRUE,
kfold = 5
)
Get permuted performance metric difference for a single feature (or group of features)
Description
Requires the future.apply
package
Usage
find_permuted_perf_metric(
test_data,
trained_model,
outcome_colname,
perf_metric_function,
perf_metric_name,
class_probs,
feat,
test_perf_value,
nperms = 100,
alpha = 0.05,
progbar = NULL
)
Arguments
test_data |
Held out test data: dataframe of outcome and features. |
trained_model |
Trained model from |
outcome_colname |
Column name as a string of the outcome variable
(default |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
feat |
feature or group of correlated features to permute. |
test_perf_value |
value of the true performance metric on the held-out test data. |
nperms |
number of permutations to perform (default: |
alpha |
alpha level for the confidence interval
(default: |
progbar |
optional progress bar (default: |
Value
vector of mean permuted performance and mean difference between test and permuted performance (test minus permuted performance)
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Flatten correlation matrix to pairs
Description
Flatten correlation matrix to pairs
Usage
flatten_corr_mat(cormat)
Arguments
cormat |
correlation matrix computed with stats::cor |
Value
flattened correlation matrix (pairs of features their correlation)
Author(s)
Zena Lapp, zenalapp@umich.edu
Identify correlated features as a binary matrix
Description
Identify correlated features as a binary matrix
Usage
get_binary_corr_mat(
features,
corr_thresh = 1,
group_neg_corr = TRUE,
corr_method = "spearman"
)
Arguments
features |
a dataframe with each column as a feature for ML |
corr_thresh |
For feature importance, group correlations
above or equal to |
group_neg_corr |
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)). |
corr_method |
correlation method. options or the same as those supported
by |
Value
A binary matrix of correlated features
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
features <- data.frame(
a = 1:3, b = 2:4, c = c(1, 0, 1),
d = (5:7), e = c(5, 1, 4)
)
get_binary_corr_mat(features)
## End(Not run)
Get dummyvars dataframe (i.e. design matrix)
Description
Get dummyvars dataframe (i.e. design matrix)
Usage
get_caret_dummyvars_df(features, full_rank = FALSE, progbar = NULL)
Arguments
features |
dataframe of features for machine learning |
full_rank |
whether matrix should be full rank or not (see 'caret::dummyVars) |
progbar |
optional progress bar (default: |
Value
design matrix
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = 1:3,
var2 = c("a", "b", "c"),
var3 = c("no", "yes", "no"),
var4 = c(0, 1, 0)
)
get_caret_dummyvars_df(df, TRUE)
## End(Not run)
Get preprocessed dataframe for continuous variables
Description
Get preprocessed dataframe for continuous variables
Usage
get_caret_processed_df(features, method)
Arguments
features |
Dataframe of features for machine learning |
method |
Methods to preprocess the data, described in
|
Value
Named list:
-
processed
: Dataframe of processed features. -
removed
: Names of any features removed during preprocessing.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
get_caret_processed_df(mikropml::otu_small[, 2:ncol(otu_small)], c("center", "scale"))
Identify correlated features
Description
Identify correlated features
Usage
get_corr_feats(
features,
corr_thresh = 1,
group_neg_corr = TRUE,
corr_method = "spearman"
)
Arguments
features |
a dataframe with each column as a feature for ML |
corr_thresh |
For feature importance, group correlations
above or equal to |
group_neg_corr |
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)). |
corr_method |
correlation method. options or the same as those supported
by |
Value
Dataframe of correlated features where the columns are feature1, feature2, and the correlation between those two features (anything exceeding corr_thresh).
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Zena Lapp, zenalapp@umich.edu
Calculate the difference in the mean of the metric for two groups
Description
Calculate the difference in the mean of the metric for two groups
Usage
get_difference(sub_data, group_name, metric)
Arguments
sub_data |
subset of the merged performance data frame for two groups |
group_name |
name of column with group variable |
metric |
metric to compare |
Value
numeric difference in the average metric between the two groups
Author(s)
Courtney Armour, armourc@umich.edu
Examples
## Not run:
df <- dplyr::tibble(
condition = c("a", "a", "b", "b"),
AUC = c(.2, 0.3, 0.8, 0.9)
)
get_difference(df, "condition", "AUC")
## End(Not run)
Get feature importance using the permutation method
Description
Calculates feature importance using a trained model and test data. Requires
the future.apply
package.
Usage
get_feature_importance(
trained_model,
test_data,
outcome_colname,
perf_metric_function,
perf_metric_name,
class_probs,
method,
seed = NA,
corr_thresh = 1,
groups = NULL,
nperms = 100,
corr_method = "spearman"
)
Arguments
trained_model |
Trained model from |
test_data |
Held out test data: dataframe of outcome and features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
method |
ML method.
Options:
|
seed |
Random seed (default: |
corr_thresh |
For feature importance, group correlations
above or equal to |
groups |
Vector of feature names to group together during permutation.
Each element should be a string with feature names separated by a pipe
character ( |
nperms |
number of permutations to perform (default: |
corr_method |
correlation method. options or the same as those supported
by |
Details
For permutation tests, the p-value is the number of permutation statistics that are greater than the test statistic, divided by the number of permutations. In our case, the permutation statistic is the model performance (e.g. AUROC) after randomizing the order of observations for one feature, and the test statistic is the actual performance on the test data. By default we perform 100 permutations per feature; increasing this will increase the precision of estimating the null distribution, but also increases runtime. The p-value represents the probability of obtaining the actual performance in the event that the null hypothesis is true, where the null hypothesis is that the feature is not important for model performance.
We strongly recommend providing multiple cores to speed up computation time. See our vignette on parallel processing for more details.
Value
Data frame with performance metrics for when each feature (or group
of correlated features; feat
) is permuted (perf_metric
), differences
between the actual test performance metric on and the permuted performance
metric (perf_metric_diff
; test minus permuted performance), and the
p-value (pvalue
: the probability of obtaining the actual performance
value under the null hypothesis). Features with a larger perf_metric_diff
are more important. The performance metric name (perf_metric_name
) and
seed (seed
) are also returned.
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
# If you called `run_ml()` with `feature_importance = FALSE` (the default),
# you can use `get_feature_importance()` later as long as you have the
# trained model and test data.
results <- run_ml(otu_small, "glmnet", kfold = 2, cv_times = 2)
names(results$trained_model$trainingData)[1] <- "dx"
feat_imp <- get_feature_importance(results$trained_model,
results$trained_model$trainingData,
results$test_data,
"dx",
multiClassSummary,
"AUC",
class_probs = TRUE,
method = "glmnet"
)
# We strongly recommend providing multiple cores to speed up computation time.
# Do this before calling `get_feature_importance()`.
doFuture::registerDoFuture()
future::plan(future::multicore, workers = 2)
# Optionally, you can group features together with a custom grouping
feat_imp <- get_feature_importance(results$trained_model,
results$trained_model$trainingData,
results$test_data,
"dx",
multiClassSummary,
"AUC",
class_probs = TRUE,
method = "glmnet",
groups = c(
"Otu00007", "Otu00008", "Otu00009", "Otu00011", "Otu00012",
"Otu00015", "Otu00016", "Otu00018", "Otu00019", "Otu00020", "Otu00022",
"Otu00023", "Otu00025", "Otu00028", "Otu00029", "Otu00030", "Otu00035",
"Otu00036", "Otu00037", "Otu00038", "Otu00039", "Otu00040", "Otu00047",
"Otu00050", "Otu00052", "Otu00054", "Otu00055", "Otu00056", "Otu00060",
"Otu00003|Otu00002|Otu00005|Otu00024|Otu00032|Otu00041|Otu00053",
"Otu00014|Otu00021|Otu00017|Otu00031|Otu00057",
"Otu00013|Otu00006", "Otu00026|Otu00001|Otu00034|Otu00048",
"Otu00033|Otu00010",
"Otu00042|Otu00004", "Otu00043|Otu00027|Otu00049", "Otu00051|Otu00045",
"Otu00058|Otu00044", "Otu00059|Otu00046"
)
)
# the function can show a progress bar if you have the `progressr` package installed.
## optionally, specify the progress bar format:
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0
))
## tell progressr to always report progress
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
feat_imp <- get_feature_importance(results$trained_model,
results$trained_model$trainingData,
results$test_data,
"dx",
multiClassSummary,
"AUC",
class_probs = TRUE,
method = "glmnet"
)
# You can specify any correlation method supported by `stats::cor`:
feat_imp <- get_feature_importance(results$trained_model,
results$trained_model$trainingData,
results$test_data,
"dx",
multiClassSummary,
"AUC",
class_probs = TRUE,
method = "glmnet",
corr_method = "pearson"
)
## End(Not run)
Assign features to groups
Description
Assign features to groups
Usage
get_groups_from_clusters(cluster_ids)
Arguments
cluster_ids |
named vector created by |
Value
a vector where each element is a group of correlated features
separated by pipes (|
)
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
corr_mat <- matrix(
data = c(1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1),
nrow = 4,
dimnames = list(
c("a", "b", "c", "d"),
c("a", "b", "c", "d")
)
)
corr_mat
get_groups_from_clusters(cluster_corr_mat(corr_mat))
## End(Not run)
Get hyperparameter performance metrics
Description
Get hyperparameter performance metrics
Usage
get_hp_performance(trained_model)
Arguments
trained_model |
trained model (e.g. from |
Value
Named list:
-
dat
: Dataframe of performance metric for each group of hyperparameters. -
params
: Hyperparameters tuned. -
metric
: Performance metric used.
Author(s)
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool sovacool@umich.edu
Examples
get_hp_performance(otu_mini_bin_results_glmnet$trained_model)
Split hyperparameters dataframe into named lists for each parameter
Description
Using get_hyperparams_list
is preferred over this function.
Usage
get_hyperparams_from_df(hyperparams_df, ml_method)
Arguments
hyperparams_df |
dataframe of hyperparameters with columns |
ml_method |
machine learning method |
Value
named list of lists of hyperparameters
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
hparams_df <- dplyr::tibble(
param = c("alpha", "lambda", "lambda"),
value = c(1, 0, 1),
method = rep("glmnet", 3)
)
get_hyperparams_from_df(hparams_df, "glmnet")
## End(Not run)
Set hyperparameters based on ML method and dataset characteristics
Description
For more details see the vignette on hyperparameter tuning.
Usage
get_hyperparams_list(dataset, method)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
method |
ML method.
Options:
|
Value
Named list of hyperparameters.
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
get_hyperparams_list(otu_mini_bin, "rf")
get_hyperparams_list(otu_small, "rf")
get_hyperparams_list(otu_mini_bin, "rpart2")
get_hyperparams_list(otu_small, "rpart2")
Get outcome type.
Description
If the outcome is numeric, the type is continuous. Otherwise, the outcome type is binary if there are only two outcomes or multiclass if there are more than two outcomes.
Usage
get_outcome_type(outcomes_vec)
Arguments
outcomes_vec |
Vector of outcomes. |
Value
Outcome type (continuous, binary, or multiclass).
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
get_outcome_type(c(1, 2, 1))
get_outcome_type(c("a", "b", "b"))
get_outcome_type(c("a", "b", "c"))
Select indices to partition the data into training & testing sets.
Description
Use this function to get the row indices for the training set.
Usage
get_partition_indices(
outcomes,
training_frac = 0.8,
groups = NULL,
group_partitions = NULL
)
Arguments
outcomes |
vector of outcomes |
training_frac |
Fraction of data for training set (default: |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
Details
If groups
is NULL
, uses createDataPartition.
Otherwise, uses create_grouped_data_partition()
.
Set the seed prior to calling this function if you would like your data partitions to be reproducible (recommended).
Value
Vector of row indices for the training set.
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
training_inds <- get_partition_indices(otu_mini_bin$dx)
train_data <- otu_mini_bin[training_inds, ]
test_data <- otu_mini_bin[-training_inds, ]
Get default performance metric function
Description
Get default performance metric function
Usage
get_perf_metric_fn(outcome_type)
Arguments
outcome_type |
Type of outcome (one of: |
Value
Performance metric function.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
get_perf_metric_fn("continuous")
get_perf_metric_fn("binary")
get_perf_metric_fn("multiclass")
Get default performance metric name
Description
Get default performance metric name for cross-validation.
Usage
get_perf_metric_name(outcome_type)
Arguments
outcome_type |
Type of outcome (one of: |
Value
Performance metric name.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
get_perf_metric_name("continuous")
get_perf_metric_name("binary")
get_perf_metric_name("multiclass")
Get model performance metrics as a one-row tibble
Description
Get model performance metrics as a one-row tibble
Usage
get_performance_tbl(
trained_model,
test_data,
outcome_colname,
perf_metric_function,
perf_metric_name,
class_probs,
method,
seed = NA
)
Arguments
trained_model |
Trained model from |
test_data |
Held out test data: dataframe of outcome and features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
class_probs |
Whether to use class probabilities (TRUE for categorical outcomes, FALSE for numeric outcomes). |
method |
ML method.
Options:
|
seed |
Random seed (default: |
Value
A one-row tibble with a column for the cross-validation performance,
columns for each of the performance metrics for the test data,
plus the method
, and seed
.
Author(s)
Kelly Sovacool, sovacool@umich.edu
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
results <- run_ml(otu_small, "glmnet", kfold = 2, cv_times = 2)
names(results$trained_model$trainingData)[1] <- "dx"
get_performance_tbl(results$trained_model, results$test_data,
"dx",
multiClassSummary, "AUC",
class_probs = TRUE,
method = "glmnet"
)
## End(Not run)
Get seeds for caret::trainControl()
Description
Adapted from this Stack Overflow post and the trainControl documentation.
Usage
get_seeds_trainControl(hyperparams_list, kfold, cv_times, ncol_train)
Arguments
hyperparams_list |
Named list of lists of hyperparameters. |
kfold |
Fold number for k-fold cross-validation (default: |
cv_times |
Number of cross-validation partitions to create (default: |
ncol_train |
number of columns in training data |
Value
seeds for caret::trainControl()
Author(s)
Kelly Sovacool, sovacool@umich.edu
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
get_seeds_trainControl(
get_hyperparams_list(otu_small, "glmnet"),
5, 100, 60
)
## End(Not run)
Generate the tuning grid for tuning hyperparameters
Description
Generate the tuning grid for tuning hyperparameters
Usage
get_tuning_grid(hyperparams_list, method)
Arguments
hyperparams_list |
Named list of lists of hyperparameters. |
method |
ML method.
Options:
|
Value
The tuning grid.
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Kelly Sovacool, sovacool@umich.edu
Examples
ml_method <- "glmnet"
hparams_list <- get_hyperparams_list(otu_small, ml_method)
get_tuning_grid(hparams_list, ml_method)
Group correlated features
Description
Group correlated features
Usage
group_correlated_features(
features,
corr_thresh = 1,
group_neg_corr = TRUE,
corr_method = "spearman"
)
Arguments
features |
a dataframe with each column as a feature for ML |
corr_thresh |
For feature importance, group correlations
above or equal to |
group_neg_corr |
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)). |
corr_method |
correlation method. options or the same as those supported
by |
Value
vector where each element is a group of correlated features
separated by pipes (|
)
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
features <- data.frame(
a = 1:3, b = 2:4, c = c(1, 0, 1),
d = (5:7), e = c(5, 1, 4), f = c(-1, 0, -1)
)
group_correlated_features(features)
Check whether a numeric vector contains whole numbers.
Description
Because is.integer
checks for the class, not whether the number is an
integer in the mathematical sense.
This code was copy-pasted from the is.integer
docs.
Usage
is_whole_number(x, tol = .Machine$double.eps^0.5)
Arguments
x |
numeric vector |
tol |
tolerance (default: |
Value
logical vector
Examples
## Not run:
is_whole_number(c(1, 2, 3))
is.integer(c(1, 2, 3))
is_whole_number(c(1.0, 2.0, 3.0))
is_whole_number(1.2)
## End(Not run)
Whether groups can be kept together in partitions during cross-validation
Description
Whether groups can be kept together in partitions during cross-validation
Usage
keep_groups_in_cv_partitions(groups, group_partitions, kfold)
Arguments
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
kfold |
Fold number for k-fold cross-validation (default: |
Value
TRUE
if possible, FALSE
otherwise
Author(s)
Kelly Sovacool, sovacool@umich.edu
Get the lower and upper bounds for an empirical confidence interval
Description
Get the lower and upper bounds for an empirical confidence interval
Usage
lower_bound(x, alpha)
upper_bound(x, alpha)
Arguments
x |
vector of test statistics, such as from permutation tests or bootstraps |
alpha |
alpha level for the confidence interval
(default: |
Value
the value of the lower or upper bound for the confidence interval
Functions
-
lower_bound()
: Get the lower bound for an empirical confidence interval -
upper_bound()
: Get the upper bound for an empirical confidence interval
Examples
## Not run:
x <- 1:10000
lower_bound(x, 0.05)
upper_bound(x, 0.05)
## End(Not run)
Mutate all columns with utils::type.convert()
.'
Description
Turns factors into characters and numerics where possible.
Usage
mutate_all_types(dat)
Arguments
dat |
data.frame to convert |
Value
data.frame with no factors
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
dat <- data.frame(
c1 = as.factor(c("a", "b", "c")),
c2 = as.factor(1:3)
)
class(dat$c1)
class(dat$c2)
dat <- mutate_all_types(dat)
class(dat$c1)
class(dat$c2)
## End(Not run)
Mini OTU abundance dataset - preprocessed
Description
This is the result of running preprocess_data("otu_mini_bin")
Usage
otu_data_preproc
Format
An object of class list
of length 3.
Mini OTU abundance dataset
Description
A dataset containing relatives abundances of OTUs for human stool samples
with a binary outcome, dx
.
This is a subset of otu_small
.
Usage
otu_mini_bin
Format
A data frame
The dx
column is the diagnosis: healthy or cancerous (colorectal).
All other columns are OTU relative abundances.
Results from running the pipeline with L2 logistic regression on otu_mini_bin
with feature importance and grouping
Description
Results from running the pipeline with L2 logistic regression on otu_mini_bin
with feature importance and grouping
Usage
otu_mini_bin_results_glmnet
Format
An object of class list
of length 4.
Results from running the pipeline with random forest on otu_mini_bin
Description
Results from running the pipeline with random forest on otu_mini_bin
Usage
otu_mini_bin_results_rf
Format
An object of class list
of length 4.
Results from running the pipeline with rpart2 on otu_mini_bin
Description
Results from running the pipeline with rpart2 on otu_mini_bin
Usage
otu_mini_bin_results_rpart2
Format
An object of class list
of length 4.
Results from running the pipeline with svmRadial on otu_mini_bin
Description
Results from running the pipeline with svmRadial on otu_mini_bin
Usage
otu_mini_bin_results_svmRadial
Format
An object of class list
of length 4.
Results from running the pipeline with xbgTree on otu_mini_bin
Description
Results from running the pipeline with xbgTree on otu_mini_bin
Usage
otu_mini_bin_results_xgbTree
Format
An object of class list
of length 4.
Results from running the pipeline with glmnet on otu_mini_bin
with Otu00001
as the outcome
Description
Results from running the pipeline with glmnet on otu_mini_bin
with Otu00001
as the outcome
Usage
otu_mini_cont_results_glmnet
Format
An object of class list
of length 4.
Results from running the pipeline with glmnet on otu_mini_bin
with Otu00001
as the outcome column,
using a custom train control scheme that does not perform cross-validation
Description
Results from running the pipeline with glmnet on otu_mini_bin
with Otu00001
as the outcome column,
using a custom train control scheme that does not perform cross-validation
Usage
otu_mini_cont_results_nocv
Format
An object of class list
of length 4.
Cross validation on train_data_mini
with grouped features.
Description
Cross validation on train_data_mini
with grouped features.
Usage
otu_mini_cv
Format
An object of class list
of length 27.
Mini OTU abundance dataset with 3 categorical variables
Description
A dataset containing relatives abundances of OTUs for human stool samples
Usage
otu_mini_multi
Format
A data frame
The dx
column is the colorectal cancer diagnosis: adenoma, carcinoma, normal.
All other columns are OTU relative abundances.
Groups for otu_mini_multi
Description
Groups for otu_mini_multi
Usage
otu_mini_multi_group
Format
An object of class character
of length 490.
Results from running the pipeline with glmnet on otu_mini_multi
for
multiclass outcomes
Description
Results from running the pipeline with glmnet on otu_mini_multi
for
multiclass outcomes
Usage
otu_mini_multi_results_glmnet
Format
An object of class list
of length 4.
Small OTU abundance dataset
Description
A dataset containing relatives abundances of 60 OTUs for 60 human stool samples.
This is a subset of the data provided in extdata/otu_large.csv
, which was
used in Topçuoğlu et al. 2020.
Usage
otu_small
Format
A data frame with 60 rows and 61 variables.
The dx
column is the diagnosis: healthy or cancerous (colorectal).
All other columns are OTU relative abundances.
Update progress if the progress bar is not NULL
.
Description
This allows for flexible code that only initializes a progress bar if the
progressr
package is installed.
Usage
pbtick(pb, message = NULL)
Arguments
pb |
a progress bar created with |
message |
optional message to report (default: |
Author(s)
Kelly Sovacool sovacool@umich.edu
Examples
## Not run:
f <- function() {
if (isTRUE(check_packages_installed("progressr"))) {
pb <- progressr::progressor(steps = 5, message = "looping")
} else {
pb <- NULL
}
for (i in 1:5) {
pbtick(pb)
Sys.sleep(0.5)
}
}
progressr::with_progress(
f()
)
## End(Not run)
Calculated a permuted p-value comparing two models
Description
Calculated a permuted p-value comparing two models
Usage
permute_p_value(
merged_data,
metric,
group_name,
group_1,
group_2,
nperm = 10000
)
Arguments
merged_data |
the concatenated performance data from |
metric |
metric to compare, must be numeric |
group_name |
column with group variables to compare |
group_1 |
name of one group to compare |
group_2 |
name of other group to compare |
nperm |
number of permutations, default=10000 |
Value
numeric p-value comparing two models
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Courtney R Armour, armourc@umich.edu
Examples
df <- dplyr::tibble(
model = c("rf", "rf", "glmnet", "glmnet", "svmRadial", "svmRadial"),
AUC = c(.2, 0.3, 0.8, 0.9, 0.85, 0.95)
)
set.seed(123)
permute_p_value(df, "AUC", "model", "rf", "glmnet", nperm = 100)
Plot hyperparameter performance metrics
Description
Plot hyperparameter performance metrics
Usage
plot_hp_performance(dat, param_col, metric_col)
Arguments
dat |
dataframe of hyperparameters and performance metric (e.g. from |
param_col |
hyperparameter to be plotted. must be a column in |
metric_col |
performance metric. must be a column in |
Value
ggplot of hyperparameter performance.
Author(s)
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool sovacool@umich.edu
Examples
# plot for a single `run_ml()` call
hp_metrics <- get_hp_performance(otu_mini_bin_results_glmnet$trained_model)
hp_metrics
plot_hp_performance(hp_metrics$dat, lambda, AUC)
## Not run:
# plot for multiple `run_ml()` calls
results <- lapply(seq(100, 102), function(seed) {
run_ml(otu_small, "glmnet", seed = seed)
})
models <- lapply(results, function(x) x$trained_model)
hp_metrics <- combine_hp_performance(models)
plot_hp_performance(hp_metrics$dat, lambda, AUC)
## End(Not run)
Plot ROC and PRC curves
Description
Plot ROC and PRC curves
Usage
plot_mean_roc(dat, ribbon_fill = "#C6DBEF", line_color = "#08306B")
plot_mean_prc(
dat,
baseline_precision = NULL,
ycol = mean_precision,
ribbon_fill = "#C7E9C0",
line_color = "#00441B"
)
Arguments
dat |
sensitivity, specificity, and precision data calculated by |
ribbon_fill |
ribbon fill color (default: "#D9D9D9") |
line_color |
line color (default: "#000000") |
baseline_precision |
baseline precision from |
ycol |
column for the y axis (Default: |
Functions
-
plot_mean_roc()
: Plot mean sensitivity over specificity -
plot_mean_prc()
: Plot mean precision over recall
Author(s)
Courtney Armour
Kelly Sovacool sovacool@umich.edu
Examples
## Not run:
library(dplyr)
# get performance for multiple models
get_sensspec_seed <- function(seed) {
ml_result <- run_ml(otu_mini_bin, "glmnet", seed = seed)
sensspec <- calc_model_sensspec(
ml_result$trained_model,
ml_result$test_data,
"dx"
) %>%
mutate(seed = seed)
return(sensspec)
}
sensspec_dat <- purrr::map_dfr(seq(100, 102), get_sensspec_seed)
# plot ROC & PRC
sensspec_dat %>%
calc_mean_roc() %>%
plot_mean_roc()
baseline_prec <- calc_baseline_precision(otu_mini_bin, "dx", "cancer")
sensspec_dat %>%
calc_mean_prc() %>%
plot_mean_prc(baseline_precision = baseline_prec)
## End(Not run)
Plot performance metrics for multiple ML runs with different parameters
Description
ggplot2 is required to use this function.
Usage
plot_model_performance(performance_df)
Arguments
performance_df |
dataframe of performance results from multiple calls to |
Value
A ggplot2 plot of performance.
Author(s)
Begüm Topçuoglu, topcuoglu.begum@gmail.com
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
# call `run_ml()` multiple times with different seeds
results_lst <- lapply(seq(100, 104), function(seed) {
run_ml(otu_small, "glmnet", seed = seed)
})
# extract and combine the performance results
perf_df <- lapply(results_lst, function(result) {
result[["performance"]]
}) %>%
dplyr::bind_rows()
# plot the performance results
p <- plot_model_performance(perf_df)
# call `run_ml()` with different ML methods
param_grid <- expand.grid(
seeds = seq(100, 104),
methods = c("glmnet", "rf")
)
results_mtx <- mapply(
function(seed, method) {
run_ml(otu_mini_bin, method, seed = seed, kfold = 2)
},
param_grid$seeds, param_grid$methods
)
# extract and combine the performance results
perf_df2 <- dplyr::bind_rows(results_mtx["performance", ])
# plot the performance results
p <- plot_model_performance(perf_df2)
# you can continue adding layers to customize the plot
p +
theme_classic() +
scale_color_brewer(palette = "Dark2") +
coord_flip()
## End(Not run)
Preprocess data prior to running machine learning
Description
Function to preprocess your data for input into run_ml()
.
Usage
preprocess_data(
dataset,
outcome_colname,
method = c("center", "scale"),
remove_var = "nzv",
collapse_corr_feats = TRUE,
to_numeric = TRUE,
group_neg_corr = TRUE,
prefilter_threshold = 1
)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
method |
Methods to preprocess the data, described in
|
remove_var |
Whether to remove variables with near-zero variance
( |
collapse_corr_feats |
Whether to keep only one of perfectly correlated features. |
to_numeric |
Whether to change features to numeric where possible. |
group_neg_corr |
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)). |
prefilter_threshold |
Remove features which only have non-zero & non-NA
values N rows or fewer (default: 1). Set this to -1 to keep all columns at
this step. This step will also be skipped if |
Value
Named list including:
-
dat_transformed
: Preprocessed data. -
grp_feats
: If features were grouped together, a named list of the features corresponding to each group. -
removed_feats
: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).
If the progressr
package is installed, a progress bar with time elapsed
and estimated time to completion can be displayed.
More details
See the preprocessing vignette for more details.
Note that if any values in outcome_colname
contain spaces, they will be
converted to underscores for compatibility with caret
.
Author(s)
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
preprocess_data(mikropml::otu_small, "dx")
# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0
))
## tell progressor to always report progress
## Not run:
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")
## End(Not run)
Process categorical features
Description
Process categorical features
Usage
process_cat_feats(features, progbar = NULL)
Arguments
features |
dataframe of features for machine learning |
progbar |
optional progress bar (default: |
Value
list of two dataframes: categorical (processed) and continuous features (unprocessed)
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
process_cat_feats(mikropml::otu_small[, 2:ncol(otu_small)])
## End(Not run)
Preprocess continuous features
Description
Preprocess continuous features
Usage
process_cont_feats(features, method)
Arguments
features |
Dataframe of features for machine learning |
method |
Methods to preprocess the data, described in
|
Value
dataframe of preprocessed features
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
process_cont_feats(mikropml::otu_small[, 2:ncol(otu_small)], c("center", "scale"))
## End(Not run)
Process features with no variation
Description
Process features with no variation
Usage
process_novar_feats(features, progbar = NULL)
Arguments
features |
dataframe of features for machine learning |
progbar |
optional progress bar (default: |
Value
list of two dataframes: features with variability (unprocessed) and without (processed)
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
process_novar_feats(mikropml::otu_small[, 2:ncol(otu_small)])
## End(Not run)
Call sort()
with method = 'radix'
Description
THE BASE SORT FUNCTION USES A DIFFERENT METHOD DEPENDING ON YOUR LOCALE. However, the order for the radix method is always stable.
Usage
radix_sort(...)
Arguments
... |
All arguments forwarded to |
Details
see https://stackoverflow.com/questions/42272119/r-cmd-check-fails-devtoolstest-works-fine
stringr::str_sort()
solves this problem with the locale
parameter having
a default value, but I don't want to add that as another dependency.
Value
Whatever you passed in, now in a stable sorted order regardless of your locale.
Author(s)
Kelly Sovacool sovacool@umich.edu
Randomize feature order to eliminate any position-dependent effects
Description
Randomize feature order to eliminate any position-dependent effects
Usage
randomize_feature_order(dataset, outcome_colname)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
Value
Dataset with feature order randomized.
Author(s)
Nick Lesniak, nlesniak@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
dat <- data.frame(
outcome = c("1", "2", "3"),
a = 4:6, b = 7:9, c = 10:12, d = 13:15
)
randomize_feature_order(dat, "outcome")
caret contr.ltfr
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- caret
- dplyr
- rlang
Remove columns appearing in only threshold
row(s) or fewer.
Description
Removes columns which only have non-zero & non-NA values in threshold
row(s) or fewer.
Usage
remove_singleton_columns(dat, threshold = 1)
Arguments
dat |
dataframe |
threshold |
Number of rows. If a column only has non-zero & non-NA values
in |
Value
dataframe without singleton columns
Author(s)
Kelly Sovacool, sovacool@umich.edu
Courtney Armour
Examples
remove_singleton_columns(data.frame(a = 1:3, b = c(0, 1, 0), c = 4:6))
remove_singleton_columns(data.frame(a = 1:3, b = c(0, 1, 0), c = 4:6), threshold = 0)
remove_singleton_columns(data.frame(a = 1:3, b = c(0, 1, NA), c = 4:6))
remove_singleton_columns(data.frame(a = 1:3, b = c(1, 1, 1), c = 4:6))
Replace spaces in all elements of a character vector with underscores
Description
Replace spaces in all elements of a character vector with underscores
Usage
replace_spaces(x, new_char = "_")
Arguments
x |
a character vector |
new_char |
the character to replace spaces (default: |
Value
character vector with all spaces replaced with new_char
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
dat <- data.frame(
dx = c("outcome 1", "outcome 2", "outcome 1"),
a = 1:3, b = c(5, 7, 1)
)
dat$dx <- replace_spaces(dat$dx)
dat
Remove missing outcome values
Description
Remove missing outcome values
Usage
rm_missing_outcome(dataset, outcome_colname)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
Value
dataset with no missing outcomes
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
rm_missing_outcome(mikropml::otu_mini_bin, "dx")
test_df <- mikropml::otu_mini_bin
test_df[1:100, "dx"] <- NA
rm_missing_outcome(test_df, "dx")
## End(Not run)
Run the machine learning pipeline
Description
This function splits the data set into a train & test set,
trains machine learning (ML) models using k-fold cross-validation,
evaluates the best model on the held-out test set,
and optionally calculates feature importance using the framework
outlined in Topçuoğlu et al. 2020 (doi:10.1128/mBio.00434-20).
Required inputs are a data frame (must contain an outcome variable and all
other columns as features) and the ML method.
See vignette('introduction')
for more details.
Usage
run_ml(
dataset,
method,
outcome_colname = NULL,
hyperparameters = NULL,
find_feature_importance = FALSE,
calculate_performance = TRUE,
kfold = 5,
cv_times = 100,
cross_val = NULL,
training_frac = 0.8,
perf_metric_function = NULL,
perf_metric_name = NULL,
groups = NULL,
group_partitions = NULL,
corr_thresh = 1,
seed = NA,
...
)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
method |
ML method.
Options:
|
outcome_colname |
Column name as a string of the outcome variable
(default |
hyperparameters |
Dataframe of hyperparameters
(default |
find_feature_importance |
Run permutation importance (default: |
calculate_performance |
Whether to calculate performance metrics (default: |
kfold |
Fold number for k-fold cross-validation (default: |
cv_times |
Number of cross-validation partitions to create (default: |
cross_val |
a custom cross-validation scheme from |
training_frac |
Fraction of data for training set (default: |
perf_metric_function |
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
groups |
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than |
group_partitions |
Specify how to assign |
corr_thresh |
For feature importance, group correlations
above or equal to |
seed |
Random seed (default: |
... |
All additional arguments are passed on to |
Value
Named list with results:
-
trained_model
: Output ofcaret::train()
, including the best model. -
test_data
: Part of the data that was used for testing. -
performance
: Data frame of performance metrics. The first column is the cross-validation performance metric, and the last two columns are the ML method used and the seed (if one was set), respectively. All other columns are performance metrics calculated on the test data. This contains only one row, so you can easily combine performance data frames from multiple calls torun_ml()
(seevignette("parallel")
). -
feature_importance
: If feature importances were calculated, a data frame where each row is a feature or correlated group. The columns are the performance metric of the permuted data, the difference between the true performance metric and the performance metric of the permuted data (true - permuted), the feature name, the ML method, the performance metric name, and the seed (if provided). For AUC and RMSE, the higher perf_metric_diff is, the more important that feature is for predicting the outcome. For log loss, the lower perf_metric_diff is, the more important that feature is for predicting the outcome.
More details
For more details, please see the vignettes.
Author(s)
Begüm Topçuoğlu, topcuoglu.begum@gmail.com
Zena Lapp, zenalapp@umich.edu
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
# regression
run_ml(otu_small, "glmnet",
seed = 2019
)
# random forest w/ feature importance
run_ml(otu_small, "rf",
outcome_colname = "dx",
find_feature_importance = TRUE
)
# custom cross validation & hyperparameters
run_ml(otu_mini_bin[, 2:11],
"glmnet",
outcome_colname = "Otu00001",
seed = 2019,
hyperparameters = list(lambda = c(1e-04), alpha = 0),
cross_val = caret::trainControl(method = "none"),
calculate_performance = FALSE
)
## End(Not run)
Use future apply if available
Description
Use future apply if available
Usage
select_apply(fun = "apply")
Arguments
fun |
apply function to use (apply, lapply, sapply, etc.) |
Value
output of apply function
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
select_apply(fun = "sapply")
## End(Not run)
Set hyperparameters for regression models for use with glmnet
Description
Alpha is set to 0
for ridge (L2). An alpha of 1
would make it lasso (L1).
Usage
set_hparams_glmnet()
Value
default lambda & alpha values
Author(s)
Zena Lapp, zenalapp@umich.edu
Set hyparameters for random forest models
Description
Set hyparameters for random forest models
Usage
set_hparams_rf(n_features)
Arguments
n_features |
number of features in the dataset |
Value
named list of hyperparameters
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
set_hparams_rf(16)
set_hparams_rf(2000)
set_hparams_rf(1)
## End(Not run)
Set hyperparameters for decision tree models
Description
Set hyperparameters for decision tree models
Usage
set_hparams_rpart2(n_samples)
Arguments
n_samples |
number of samples in the dataset |
Value
named list of hyperparameters
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
set_hparams_rpart2(100)
set_hparams_rpart2(20)
## End(Not run)
Set hyperparameters for SVM with radial kernel
Description
Set hyperparameters for SVM with radial kernel
Usage
set_hparams_svmRadial()
Value
named list of hyperparameters
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
set_hparams_svmRadial()
## End(Not run)
Set hyperparameters for SVM with radial kernel
Description
Set hyperparameters for SVM with radial kernel
Usage
set_hparams_xgbTree(n_samples)
Arguments
n_samples |
number of samples in the dataset |
Value
named list of hyperparameters
Author(s)
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
set_hparams_xgbTree()
## End(Not run)
Get plot layers shared by plot_mean_roc
and plot_mean_prc
Description
Get plot layers shared by plot_mean_roc
and plot_mean_prc
Usage
shared_ggprotos(ribbon_fill = "#D9D9D9", line_color = "#000000")
Arguments
ribbon_fill |
ribbon fill color (default: "#D9D9D9") |
line_color |
line color (default: "#000000") |
Value
list of ggproto objects to add to a ggplot
Author(s)
Kelly Sovacool sovacool@umich.edu
Shuffle the rows in a column
Description
Shuffle the rows in a column
Usage
shuffle_group(dat, col_name)
Arguments
dat |
a data frame containing |
col_name |
column name to shuffle |
Value
dat
with the rows of col_name
shuffled
Author(s)
Courtney R Armour, armourc@umich.edu
Examples
## Not run:
set.seed(123)
df <- dplyr::tibble(
condition = c("a", "a", "b", "b"),
AUC = c(.2, 0.3, 0.8, 0.9)
)
shuffle_group(df, "condition")
## End(Not run)
Split dataset into outcome and features
Description
Split dataset into outcome and features
Usage
split_outcome_features(dataset, outcome_colname)
Arguments
dataset |
Data frame with an outcome variable and other columns as features. |
outcome_colname |
Column name as a string of the outcome variable
(default |
Value
list of length two: outcome, features (as dataframes)
Examples
## Not run:
split_outcome_features(mikropml::otu_mini_bin, "dx")
## End(Not run)
Tidy the performance dataframe
Description
Used by plot_model_performance()
.
Usage
tidy_perf_data(performance_df)
Arguments
performance_df |
dataframe of performance results from multiple calls to |
Value
Tidy dataframe with model performance metrics.
Author(s)
Begüm Topçuoglu, topcuoglu.begum@gmail.com
Kelly Sovacool, sovacool@umich.edu
Examples
## Not run:
# call `run_ml()` multiple times with different seeds
results_lst <- lapply(seq(100, 104), function(seed) {
run_ml(otu_small, "glmnet", seed = seed)
})
# extract and combine the performance results
perf_df <- lapply(results_lst, function(result) {
result[["performance"]]
}) %>%
dplyr::bind_rows()
# make it pretty!
tidy_perf_data(perf_df)
## End(Not run)
Train model using caret::train()
.
Description
Train model using caret::train()
.
Usage
train_model(
train_data,
outcome_colname,
method,
cv,
perf_metric_name,
tune_grid,
...
)
Arguments
train_data |
Training data. Expected to be a subset of the full dataset. |
outcome_colname |
Column name as a string of the outcome variable
(default |
method |
ML method.
Options:
|
cv |
Cross-validation caret scheme from |
perf_metric_name |
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = |
tune_grid |
Tuning grid from |
... |
All additional arguments are passed on to |
Value
Trained model from caret::train()
.
Author(s)
Zena Lapp, zenalapp@umich.edu
Examples
## Not run:
training_data <- otu_mini_bin_results_glmnet$trained_model$trainingData %>%
dplyr::rename(dx = .outcome)
method <- "rf"
hyperparameters <- get_hyperparams_list(otu_mini_bin, method)
cross_val <- define_cv(training_data,
"dx",
hyperparameters,
perf_metric_function = caret::multiClassSummary,
class_probs = TRUE,
cv_times = 2
)
tune_grid <- get_tuning_grid(hyperparameters, method)
rf_model <- train_model(
training_data,
"dx",
method,
cross_val,
"AUC",
tune_grid,
ntree = 1000
)
rf_model$results %>% dplyr::select(mtry, AUC, prAUC)
## End(Not run)