Title: | A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs |
Version: | 1.0.4 |
Description: | The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 4.4.0),biomaRt,caret,keras,mlr3tuning,mlr3 |
LazyData: | true |
URL: | https://molaison.github.io/MantaID/ |
Imports: | ggplot2,data.table,magrittr,stringr,tibble,tidyr,tidyselect,ggcorrplot,reshape2,scutr,paradox,RColorBrewer,purrr,dplyr |
Suggests: | mlr3hyperband,mlr3learners,ranger,rpart,xgboost |
NeedsCompilation: | no |
Packaged: | 2024-09-09 07:07:26 UTC; pc |
Author: | Zhengpeng Zeng |
Maintainer: | Zhengpeng Zeng <molaison@foxmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-09-09 07:50:02 UTC |
ID example dataset.
Description
ID example dataset.
Usage
Example
Format
A tibble with 5000 rows and 2 variables.
- ID
A identifier character.
- class
The database the ID belongs to.
A wrapper function that executes MantaID workflow.
Description
A wrapper function that executes MantaID workflow.
Usage
mi(
cores = NULL,
levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":"),
ratio = 0.3,
para_blc = FALSE,
model_path = NULL,
batch_size = 128,
epochs = 64,
validation_split = 0.3,
graph_path = NULL
)
Arguments
cores |
The number of cores used when balancing data. |
levels |
The vector that includes all the single characters occurred in IDs. |
ratio |
The ratio of the test set. |
para_blc |
A logical value whether using parallel computing when balancing data. |
model_path |
The path to save models. |
batch_size |
The batch size of deep learning model fitting. |
epochs |
The epochs of deep learning model fitting. |
validation_split |
The validation ratio of deep learning model fitting. |
graph_path |
The path to save graphs. |
Value
The list of models and graphs.
Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data.
Description
Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data.
Usage
mi_balance_data(data, ratio = 0.3, parallel = FALSE)
Arguments
data |
A data frame. Except for class column, all are numeric types. |
ratio |
Numeric between 0 and 1. The percent of the test set split from data. |
parallel |
Logical. |
Value
A list contains a train set and a test set.
Reshape data and delete meaningless rows.
Description
Reshape data and delete meaningless rows.
Usage
mi_clean_data(data, cols = everything(), placeholder = c("-"))
Arguments
data |
A dataframe or tibble or data.table or matrix. Names of the column will be regard as the class of ID included in column. |
cols |
Character vectors. Columns of |
placeholder |
Character vectors. IDs included in |
Value
A tibble with two columns("ID" and "class")
Examples
data <- tibble::tibble(
"class1" = c("A", "B", "C", "D"),
"class2" = c("E", "F", "G", "H"),
"class3" = c("L", "M", "-", "O")
)
mi_clean_data(data)
ID-related datasets in biomart.
Description
ID-related datasets in biomart.
Usage
mi_data_attributes
Format
A dataframe with 65 variables and 3 variables.
- name
The name of dataset.
- description
Description of dataset.
- page
collection of attributes.
Processed ID data.
Description
Processed ID data.
Usage
mi_data_procID
Format
A tibble dataframe with 5000 rows and 21 variables.
- pos1 to pos20
Splited ID.
- class
The databases that ID belongs to.
ID dataset for testing.
Description
ID dataset for testing.
Usage
mi_data_rawID
Format
A tibble with 5000 rows and 2 variables.
- ID
A identifier character.
- class
The database the ID belongs to.
Performing feature selection in a automatic way based on correlation and feature importance.
Description
Performing feature selection in a automatic way based on correlation and feature importance.
Usage
mi_filter_feat(data, cor_thresh = 0.7, imp_thresh = 0.99, union = FALSE)
Arguments
data |
The data frame returned by |
cor_thresh |
The threshold set for Pearson correlation. If correlation value is over this threshold, the two features will be viewed as redundant and one of them will be removed. |
imp_thresh |
The threshold set for feature importance. The last several features with the lowest importance will be removed if remained importance lower than |
union |
The method for combining the decisions of correlation method and importance method. If |
Value
The names of the features that should be removed.
Get ID data from the Biomart
database using attributes
.
Description
Get ID data from the Biomart
database using attributes
.
Usage
mi_get_ID(
attributes,
biomart = "genes",
dataset = "hsapiens_gene_ensembl",
mirror = "asia"
)
Arguments
attributes |
A dataframe.The information we want to retrieve.Use |
biomart |
BioMart database name you want to connect to. Use |
dataset |
Datasets of the selected BioMart database. |
mirror |
Specify an Ensembl mirror to connect to. |
Value
A tibble
dataframe.
Get ID attributes from the Biomart
database.
Description
Get ID attributes from the Biomart
database.
Usage
mi_get_ID_attr(
biomart = "genes",
dataset = "hsapiens_gene_ensembl",
mirror = "asia"
)
Arguments
biomart |
BioMart database name you want to connect to.Use |
dataset |
Datasets of the selected BioMart database. |
mirror |
Specify an Ensembl mirror to connect to. |
Value
A dataframe.
Compute the confusion matrix for the predicted result.
Description
Compute the confusion matrix for the predicted result.
Usage
mi_get_confusion(result_list, ifnet = FALSE)
Arguments
result_list |
A list returned from model training functions. |
ifnet |
Logical.Whether the data is obtained by a deep learning model. |
Value
A confusionMatrix
object.
Plot the bar plot for feature importance.
Description
Plot the bar plot for feature importance.
Usage
mi_get_importance(data)
Arguments
data |
A table. |
Value
A bar plot.
Observe the distribution of the false response of the test set.
Description
Observe the distribution of the false response of the test set.
Usage
mi_get_miss(predict)
Arguments
predict |
An R6 class |
Value
A tibble data frame that records the number of wrong predictions for each category ID
Get max length of ID data.
Description
Get max length of ID data.
Usage
mi_get_padlen(data)
Arguments
data |
A dataframe. |
Value
An int object.
Examples
data(mi_data_rawID)
mi_get_padlen(mi_data_rawID)
Plot correlation heatmap.
Description
Plot correlation heatmap.
Usage
mi_plot_cor(data, cls = "class")
Arguments
data |
Data frame including IDs' position features. |
cls |
The name of the class column. |
Value
A heatmap.
Examples
data(mi_data_procID)
data_num <- mi_to_numer(mi_data_procID)
mi_plot_cor(data_num)
Plot heatmap for result confusion matrix.
Description
Plot heatmap for result confusion matrix.
Usage
mi_plot_heatmap(table, name = NULL, filepath = NULL)
Arguments
table |
A table. |
name |
Model names. |
filepath |
File path the plot to save. Default NULL. |
Value
A ggplot
object.
Predict new data with a trained learner.
Description
Predict new data with a trained learner.
Usage
mi_predict_new(data, result, ifnet = F)
Arguments
data |
A dataframe. |
result |
The result object from a previous training. |
ifnet |
A boolean indicating if a neural network is used for prediction. |
Value
A data frame that contains features and 'predict' class.
Compare classification models with small samples.
Description
Compare classification models with small samples.
Usage
mi_run_bmr(data, row_num = 1000, resamplings = rsmps("cv", folds = 10))
Arguments
data |
A tibble.All are numeric except the first column is a factor. |
row_num |
The number of samples used. |
resamplings |
R6/Resampling.Resampling method. |
Value
A list of R6 class of benchmark results and scores of test set. examples data(mi_data_procID) mi_run_bmr(mi_data_procID)
Cut the string of ID column character by character and divide it into multiple columns.
Description
Cut the string of ID column character by character and divide it into multiple columns.
Usage
mi_split_col(data, cores = NULL, pad_len = 10)
Arguments
data |
Dataframe(tibble) to be split. |
cores |
Int.The num of cores to allocate for computing. |
pad_len |
The length of longest id, i.e. the maxlength. |
Value
A tibble with pad_len+1 column.
Split the string into individual characters and complete the character vector to the maximum length.
Description
Split the string into individual characters and complete the character vector to the maximum length.
Usage
mi_split_str(str, pad_len)
Arguments
str |
The string to be splited. |
pad_len |
The length of longest ID, i.e. the maxlength. |
Value
Splited character vector.
Examples
string_test <- "Good Job"
length <- 15
mi_split_str(string_test, length)
Convert data to numeric, and for the ID column convert with fixed levels.
Description
Convert data to numeric, and for the ID column convert with fixed levels.
Usage
mi_to_numer(
data,
levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":")
)
Arguments
data |
A tibble with n position column(pos1,pos2,...) and class column. |
levels |
Characters accommodated in IDs. |
Value
A numeric data frame with numerical or factor type columns.
Examples
data(mi_data_procID)
mi_to_numer(mi_data_procID)
Train a three layers neural network model.
Description
Train a three layers neural network model.
Usage
mi_train_BP(
train,
test,
cls = "class",
path2save = NULL,
batch_size = 128,
epochs = 64,
validation_split = 0.3,
verbose = 0
)
Arguments
train |
A dataframe with the |
test |
A dataframe with the |
cls |
A character.The name of the label column. |
path2save |
The folder path to store the model and train history. |
batch_size |
Integer or NULL. The number of samples per gradient update. |
epochs |
The number of epochs to train the model. |
validation_split |
Float between 0 and 1. Fraction of the training data to be used as validation data. |
verbose |
The verbosity mode. |
Value
A list
object containing the prediction confusion matrix, the model
object, and the mapping of predicted numbers to classes.
Random Forest Model Training.
Description
Random Forest Model Training.
Usage
mi_train_rg(train, test, measure = msr("classif.acc"), instance = NULL)
Arguments
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method. |
instance |
A tuner. |
Value
A list of learner for predicting and predicted result of test set.
Classification tree model training.
Description
Classification tree model training.
Usage
mi_train_rp(train, test, measure = msr("classif.acc"), instance = NULL)
Arguments
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method.Use |
instance |
A tuner. |
Value
A list of learner for predicting and predicted result of test set.
Xgboost model training
Description
Xgboost model training
Usage
mi_train_xgb(train, test, measure = msr("classif.acc"), instance = NULL)
Arguments
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method. |
instance |
A tuner. |
Value
A list of learner for predicting and predicted result of test set.
Tune the Random Forest model by hyperband.
Description
Tune the Random Forest model by hyperband.
Usage
mi_tune_rg(
data,
resampling = rsmp("cv", folds = 5),
measure = msr("classif.acc"),
eta = 3
)
Arguments
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
Value
A list of tuning instance and stage plot.
Tune the Decision Tree model by hyperband.
Description
Tune the Decision Tree model by hyperband.
Usage
mi_tune_rp(
data,
resampling = rsmp("bootstrap", ratio = 0.8, repeats = 5),
measure = msr("classif.acc"),
eta = 3
)
Arguments
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
Value
A list of tuning instance and stage plot.
Tune the Xgboost model by hyperband.
Description
Tune the Xgboost model by hyperband.
Usage
mi_tune_xgb(
data,
resampling = rsmp("cv", folds = 5),
measure = msr("classif.acc"),
eta = 3
)
Arguments
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
Value
A list of tuning instance and stage plot.
Predict with four models and unify results by the sub-model's specificity score to the four possible classes.
Description
Predict with four models and unify results by the sub-model's specificity score to the four possible classes.
Usage
mi_unify_mod(
data,
col_id,
result_rg,
result_rp,
result_xgb,
result_BP,
c_value = 0.75,
pad_len = 30
)
Arguments
data |
A dataframe contains the ID column. |
col_id |
The name of ID column. |
result_rg |
The result from the Random Forest model. |
result_rp |
The result from the Decision Tree model. |
result_xgb |
The result from the XGBoost model. |
result_BP |
The result from the Backpropagation Neural Network model. |
c_value |
A numeric value used in the final prediction calculation. |
pad_len |
The length to pad the ID characters to. |
Value
A dataframe.