Title: | Meta Clustering with Similarity Network Fusion |
Version: | 2.1.2 |
Description: | Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in <doi:10.1038/nmeth.2810>. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in <doi:10.1109/ICDM.2006.103>. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | cli, cluster, data.table, digest, dplyr, ggplot2, grDevices, MASS, mclust, methods, progressr, purrr, RColorBrewer, rlang, SNFtool, stats, tibble, tidyr, utils |
Suggests: | circlize, ComplexHeatmap, InteractiveComplexHeatmap, clv, future, future.apply, knitr, rmarkdown, testthat (≥ 3.0.0), ggalluvial, lifecycle, dbscan |
Config/testthat/edition: | 3 |
Depends: | R (≥ 4.1.0) |
LazyData: | true |
VignetteBuilder: | knitr |
URL: | https://branchlab.github.io/metasnf/, https://github.com/BRANCHlab/metasnf/ |
BugReports: | https://github.com/BRANCHlab/metasnf/issues |
NeedsCompilation: | no |
Packaged: | 2025-04-28 14:53:31 UTC; prashanth |
Author: | Prashanth S Velayudhan [aut, cre], Xiaoqiao Xu [aut], Prajkta Kallurkar [aut], Ana Patricia Balbon [aut], Maria T Secara [aut], Adam Taback [aut], Denise Sabac [aut], Nicholas Chan [aut], Shihao Ma [aut], Bo Wang [aut], Daniel Felsky [aut], Stephanie H Ameis [aut], Brian Cox [aut], Colin Hawco [aut], Lauren Erdman [aut], Anne L Wheeler [aut, ths] |
Maintainer: | Prashanth S Velayudhan <psvelayu@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-04-28 18:20:02 UTC |
metasnf: Meta Clustering with Similarity Network Fusion
Description
Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in doi:10.1038/nmeth.2810. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in doi:10.1109/ICDM.2006.103. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning.
Author(s)
Maintainer: Prashanth S Velayudhan psvelayu@gmail.com
Authors:
Xiaoqiao Xu
Prajkta Kallurkar
Ana Patricia Balbon
Maria T Secara
Adam Taback
Denise Sabac
Nicholas Chan
Shihao Ma
Bo Wang
Daniel Felsky
Stephanie H Ameis
Brian Cox
Colin Hawco
Lauren Erdman
Anne L Wheeler anne.wheeler@sickkids.ca [thesis advisor]
See Also
Useful links:
Report bugs at https://github.com/BRANCHlab/metasnf/issues
Mock ABCD anxiety data
Description
A randomly shuffled and anonymized copy of anxiety data from the NIMH Data
archive. The original file used was pdem02.txt. The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cbcl_anxiety
.
Usage
abcd_anxiety
Format
abcd_anxiety
A data frame with 275 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- cbcl_anxiety_r
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD "colour" data
Description
A randomly shuffled and anonymized copy of depression data from the NIMH
Data archive. The original file used was pdem02.txt. The file was
pre-processed by the abcdutils package
(https://github.com/BRANCHlab/abcdutils) function get_cbcl_depress
.
The data was transformed into categorical colour values to demonstrate
the Chi-squared test capabilities of extend_solutions
.
Usage
abcd_colour
Format
abcd_colour
A data frame with 275 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- colour
Categorical transformation of
cbcl_depress
.
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD cortical surface area data
Description
A randomly shuffled and anonymized copy of cortical surface area data from the NIMH Data
archive. The original file used was mrisdp10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cort_t
.
Usage
abcd_cort_sa
Format
abcd_cort_sa
A data frame with 188 rows and 152 columns:
- patient
The unique identifier of the ABCD dataset
- ...
Cortical surface areas of various ROIs (mm^2, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD cortical thickness data
Description
A randomly shuffled and anonymized copy of cortical thickness data from the NIMH Data
archive. The original file used was mrisdp10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cort_t
.
Usage
abcd_cort_t
Format
abcd_cort_t
A data frame with 188 rows and 152 columns:
- patient
The unique identifier of the ABCD dataset
- ...
Cortical thicknesses of various ROIs (mm^3, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD depression data
Description
A randomly shuffled and anonymized copy of depression data from the NIMH
Data archive. The original file used was pdem02.txt. The file was
pre-processed by the abcdutils package
(https://github.com/BRANCHlab/abcdutils) function get_cbcl_depress
.
Usage
abcd_depress
Format
abcd_depress
A data frame with 275 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- cbcl_depress_r
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD income data
Description
Like abcd_income, but with no NAs in patient column
Usage
abcd_h_income
Format
abcd_income
A data frame with 300 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- household_income
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD income data
Description
A randomly shuffled and anonymized copy of income data from the NIMH Data
archive. The original file used was pdem02.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_income
.
Usage
abcd_income
Format
abcd_income
A data frame with 300 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- household_income
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD pubertal status data
Description
A randomly shuffled and anonymized copy of pubertal status data from the NIMH Data
archive. The original files used were abcd_ssphp01.txt and abcd_ssphy01.txt. The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_pubertal_status
.
Usage
abcd_pubertal
Format
abcd_pubertal
A data frame with 275 rows and 2 columns:
- patient
The unique identifier of the ABCD dataset
- pubertal_status
Average reported pubertal status between child and parent (1-5 categorical scale)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD subcortical volumes data
Description
A randomly shuffled and anonymized copy of subcortical volume data from the NIMH Data
archive. The original file used was smrip10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_subc_v
.
Usage
abcd_subc_v
Format
abcd_subc_v
A data frame with 174 rows and 31 columns:
- patient
The unique identifier of the ABCD dataset
- ...
Subcortical volumes of various ROIs (mm^3, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Add columns to a data frame
Description
Add new columns to a data frame by specifying their names and a value to initialize them with.
Usage
add_columns(df, cols, value = NA)
Arguments
df |
The data frame to extend. |
cols |
The vector containing new column names. |
value |
The values stored in the newly added columns. NA by default. |
Value
A data frame containing with the same columns as the df
argument
as well as the new columns specified in the cols
argument.
Add rows to a settings_df
Description
Add rows to a settings_df
Usage
add_settings_df_rows(
sdf,
n_solutions = 0,
min_removed_inputs = 0,
max_removed_inputs = sum(startsWith(colnames(sdf), "inc_")) - 1,
dropout_dist = "exponential",
min_alpha = NULL,
max_alpha = NULL,
min_k = NULL,
max_k = NULL,
min_t = NULL,
max_t = NULL,
alpha_values = NULL,
k_values = NULL,
t_values = NULL,
possible_snf_schemes = c(1, 2, 3),
clustering_algorithms = NULL,
continuous_distances = NULL,
discrete_distances = NULL,
ordinal_distances = NULL,
categorical_distances = NULL,
mixed_distances = NULL,
dfl = NULL,
snf_input_weights = NULL,
snf_domain_weights = NULL,
retry_limit = 10,
allow_duplicates = FALSE
)
Arguments
sdf |
The existing settings data frame |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "two-step" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
dfl |
List containing distance metrics to vary over. See ?generate_dist_fns_list. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
allow_duplicates |
If TRUE, enables creation of a settings data frame with duplicate non-feature weighting related hyperparameters. This function should only be used when paired with a custom weights matrix that has non-duplicate rows. |
Value
A settings data frame
Heatmap of pairwise adjusted rand indices between solutions
Description
Defunct function to create an ARI heatmap. Please use
meta_cluster_heatmap()
instead.
Usage
adjusted_rand_index_heatmap(
aris,
order = NULL,
cluster_rows = FALSE,
cluster_columns = FALSE,
log_graph = FALSE,
scale_diag = "none",
min_colour = "#282828",
max_colour = "firebrick2",
col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)),
...
)
Arguments
aris |
Matrix of adjusted rand indices from |
order |
Numeric vector containing row order of the heatmap. |
cluster_rows |
Whether rows should be clustered. |
cluster_columns |
Whether columns should be clustered. |
log_graph |
If TRUE, log transforms the graph. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
col |
Colour ramp to use for the heatmap. |
... |
Additional parameters passed to |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.
Mock age data
Description
Mock age data
Usage
age_df
Format
age_df
A data frame with 200 rows and 2 columns:
- patient_id
Random three-digit number uniquely identifying the patient
- age
Mock age feature
Source
This data came from the SNFtool package, with slight modifications.
Alluvial plot of patients across cluster counts and important features
Description
This function creates an alluvial plot that shows how observations in a similarity matrix could have been clustered over a set of clustering functions.
Usage
alluvial_cluster_plot(
cluster_sequence,
similarity_matrix,
dl = NULL,
data = NULL,
key_outcome,
key_label = key_outcome,
extra_outcomes = NULL,
title = NULL
)
Arguments
cluster_sequence |
A list of clustering algorithms. |
similarity_matrix |
A similarity matrix. |
dl |
A data list. |
data |
A data frame that contains any features to include in the plot. |
key_outcome |
The name of the feature that determines how each patient stream is coloured in the alluvial plot. |
key_label |
Name of key outcome to be used for the plot legend. |
extra_outcomes |
Names of additional features to add to the plot. |
title |
Title of the plot. |
Value
An alluvial plot (class "gg" and "ggplot") showing distribution of a feature across varying number cluster solutions.
Examples
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 1)
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mats <- sim_mats_list(sol_df)
clust_fn_sequence <- list(spectral_two, spectral_four)
alluvial_cluster_plot(
cluster_sequence = clust_fn_sequence,
similarity_matrix = sim_mats[[1]],
dl = input_dl,
key_outcome = "gender",
key_label = "Gender",
extra_outcomes = "diagnosis",
title = "Gender Across Cluster Counts"
)
Mock ABCD anxiety data
Description
Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".
Usage
anxiety
Format
anxiety
A data frame with 275 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- cbcl_anxiety_r
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Sort data frames in a data list by their unique ID values
Description
Sort data frames in a data list by their unique ID values
Usage
arrange_dll(dll)
Arguments
dll |
A data list-like |
Value
The data list-like object with all data frames sorted by uid.
Coerce a data_list
class object into a data.frame
class object
Description
Horizontally joins data frames within a data list into a single data frame,
using the uid
attribute as the joining key.
Usage
## S3 method for class 'data_list'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
Arguments
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
... |
Additional parameter passed to |
Value
dl_df A data.frame
class object with all the features and
observations of dl
.
Coerce a ext_solutions_df
class object into a data.frame
class object
Description
Coerce a ext_solutions_df
class object into a data.frame
class object
Usage
## S3 method for class 'ext_solutions_df'
as.data.frame(
x,
row.names = NULL,
optional = FALSE,
keep_attributes = FALSE,
...
)
Arguments
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
keep_attributes |
If TRUE, resulting data frame includes settings data frame and weights matrix. |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a settings_df
class object into a data.frame
class object
Description
Coerce a settings_df
class object into a data.frame
class object
Usage
## S3 method for class 'settings_df'
as.data.frame(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a settings_df
class object into a data.frame
class object
Description
Coerce a settings_df
class object into a data.frame
class object
Usage
## S3 method for class 'snf_config'
as.data.frame(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a solutions_df
class object into a data.frame
class object
Description
Coerce a solutions_df
class object into a data.frame
class object
Usage
## S3 method for class 'solutions_df'
as.data.frame(
x,
row.names = NULL,
optional = FALSE,
keep_attributes = FALSE,
...
)
Arguments
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
keep_attributes |
If TRUE, resulting data frame includes settings data frame and weights matrix. |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a t_ext_solutions_df
class object into a data.frame
class object
Description
Coerce a t_ext_solutions_df
class object into a data.frame
class object
Usage
## S3 method for class 't_ext_solutions_df'
as.data.frame(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a t_solutions_df
class object into a data.frame
class object
Description
Coerce a t_solutions_df
class object into a data.frame
class object
Usage
## S3 method for class 't_solutions_df'
as.data.frame(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a weights_matrix
class object into a data.frame
class object
Description
Coerce a weights_matrix
class object into a data.frame
class object
Usage
## S3 method for class 'weights_matrix'
as.data.frame(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A data.frame
class object with all the columns of x and its
contained solutions data frame.
Coerce a clust_fns_list
class object into a list
class object
Description
Coerce a clust_fns_list
class object into a list
class object
Usage
## S3 method for class 'clust_fns_list'
as.list(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A list
class object with all the functions of x
.
Coerce a data_list
class object into a list
class object
Description
Coerce a data_list
class object into a list
class object
Usage
## S3 method for class 'data_list'
as.list(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A list
class object with all the objects of x
.
Coerce a dist_fns_list
class object into a list
class object
Description
Coerce a dist_fns_list
class object into a list
class object
Usage
## S3 method for class 'dist_fns_list'
as.list(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A list
class object with all the functions of x
.
Coerce a sim_mats_list
class object into a list
class object
Description
Coerce a sim_mats_list
class object into a list
class object
Usage
## S3 method for class 'sim_mats_list'
as.list(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A list
class object with all the functions of x
.
Coerce a snf_config
class object into a list
class object
Description
Coerce a snf_config
class object into a list
class object
Usage
## S3 method for class 'snf_config'
as.list(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A list
class object with all the functions of x
.
Coerce a ari_matrix
class object into a matrix
class object
Description
Coerce a ari_matrix
class object into a matrix
class object
Usage
## S3 method for class 'ari_matrix'
as.matrix(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A matrix
and array
class object.
Coerce a weights_matrix
class object into a matrix
class object
Description
Coerce a weights_matrix
class object into a matrix
class object
Usage
## S3 method for class 'weights_matrix'
as.matrix(x, ...)
Arguments
x |
A |
... |
Additional parameter passed to |
Value
A matrix
and array
class object.
Convert an object to an ARI matrix
Description
This function coerces non-ari_matrix
class objects into
ari_matrix
class objects.
Usage
as_ari_matrix(x)
Arguments
x |
The object to convert into a weights matrix. |
Value
An ari_matrix
class object.
Convert an object to a data list
Description
This function coerces non-data_list
class objects into data_list
class
objects.
Usage
as_data_list(x)
Arguments
x |
The object to convert into a data list. |
Value
A data_list
class object.
Convert an object to a settings data frame
Description
This function coerces non-settings_df
class objects into settings_df
class
objects.
Usage
as_settings_df(x)
Arguments
x |
The object to convert into a data list. |
Value
A settings_df
class object.
Convert an object to a similarity matrix list
Description
This function converts non-sim_mats_list
class objects into
sim_mats_list
class objects.
Usage
as_sim_mats_list(x)
Arguments
x |
The object to convert into a |
Value
A sim_mats_list
class object.
Convert an object to a snf config
Description
This function coerces non-snf_config
class objects into snf_config
class
objects.
Usage
as_snf_config(x)
Arguments
x |
The object to convert into a snf config. |
Value
A snf_config
class object.
Convert an object to a weights matrix
Description
This function converts non-weights_matrix
objects into weights_matrix
class objects.
Usage
as_weights_matrix(x)
Arguments
x |
The object to convert into a data list. |
Value
A weights_matrix
class object.
Collapse a data frame and/or a data list into a single data frame
Description
Collapse a data frame and/or a data list into a single data frame
Usage
assemble_data(data, dl)
Arguments
data |
A data frame. |
dl |
A nested list of input data from |
Value
A class "data.frame" object containing all the features of the provided data frame and/or data list.
Heatmap of pairwise associations between features
Description
Heatmap of pairwise associations between features
Usage
assoc_pval_heatmap(
correlation_matrix,
scale_diag = "max",
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_names = TRUE,
show_column_names = TRUE,
show_heatmap_legend = FALSE,
confounders = NULL,
out_of_models = NULL,
annotation_colours = NULL,
labels_colour = NULL,
split_by_domain = FALSE,
dl = NULL,
significance_stars = TRUE,
slice_font_size = 8,
...
)
Arguments
correlation_matrix |
Matrix containing all pairwise association p-values. The recommended way to obtain this matrix is through the calc_assoc_pval function. |
scale_diag |
Parameter that controls how the diagonals of the correlation_matrix are adjusted in the heatmap. For best viewing, this is set to "max", which will match the diagonals to whichever pairwise association has the highest p-value. |
cluster_rows |
Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided. |
cluster_columns |
Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided. |
show_row_names |
Parameter for ComplexHeatmap::Heatmap. |
show_column_names |
Parameter for ComplexHeatmap::Heatmap. |
show_heatmap_legend |
Parameter for ComplexHeatmap::Heatmap. |
confounders |
A named list where the elements are columns in the correlation_matrix and the names are the corresponding display names. |
out_of_models |
Like confounders, but a named list of out of model measures (who are also present as columns in the correlation_matrix). |
annotation_colours |
Named list of heatmap annotations and their colours. |
labels_colour |
Vector of colours to use for the columns and rows of the heatmap. |
split_by_domain |
Visually slice the heatmap based on feature domains. |
dl |
A nested list of input data from |
significance_stars |
If TRUE (default), plots significance stars on heatmap cells |
slice_font_size |
Font size for domain separating labels. |
... |
Additional parameters passed into ComplexHeatmap::Heatmap. |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise associations between features from the provided correlation_matrix.
Examples
data_list <- data_list(
list(income, "household_income", "demographics", "ordinal"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(fav_colour, "favourite_colour", "demographics", "categorical"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)
ap_heatmap <- assoc_pval_heatmap(assoc_pval_matrix)
Automatically plot features across clusters
Description
Given a single row of a solutions data frame and data provided through a data list, this function will return a series of bar and/or jitter plots based on feature types.
Usage
auto_plot(
sol_df_row = NULL,
dl = NULL,
cluster_df = NULL,
return_plots = TRUE,
save = NULL,
jitter_width = 6,
jitter_height = 6,
bar_width = 6,
bar_height = 6,
verbose = FALSE
)
Arguments
sol_df_row |
A single row of a solutions data frame. |
dl |
A data list containing data to plot. |
cluster_df |
Directly provide a cluster_df rather than a solutions matrix. Useful if plotting data from label propagated results. |
return_plots |
If |
save |
If a string is provided, plots will be saved and this string will be used to prefix plot names. |
jitter_width |
Width of jitter plots if save is specified. |
jitter_height |
Height of jitter plots if save is specified. |
bar_width |
Width of bar plots if save is specified. |
bar_height |
Height of bar plots if save is specified. |
verbose |
If TRUE, output progress to console. |
Value
By default, returns a list of plots (class "gg", "ggplot") with
one plot for every feature in the provided data list and/or target list.
If return_plots
is FALSE, will instead return a single "data.frame"
object containing every provided feature for every observation in long
format.
Bar plot separating a feature by cluster
Description
Bar plot separating a feature by cluster
Usage
bar_plot(df, feature)
Arguments
df |
A data.frame containing cluster column and the feature to plot. |
feature |
The feature to plot. |
Value
A bar plot (class "gg", "ggplot") showing the distribution of a feature across clusters.
Generate closure function to run batch_snf in an apply-friendly format
Description
Generate closure function to run batch_snf in an apply-friendly format
Usage
batch_row_closure(
dl,
dfl,
cfl,
sdf,
wm,
similarity_matrix_dir,
return_sim_mats,
prog
)
Arguments
dl |
A nested list of input data from |
dfl |
An optional nested list containing which distance metric function should be used for the various feature types (continuous, discrete, ordinal, categorical, and mixed). See ?dist_fns_list for details on how to build this. |
cfl |
List of custom clustering algorithms to apply to the final fused network. See ?clust_fns_list. |
sdf |
matrix indicating parameters to iterate SNF through. |
wm |
A matrix containing feature weights to use during distance matrix calculation. See ?weights_matrix for details on how to build this. |
similarity_matrix_dir |
If specified, this directory will be used to save all generated similarity matrices. |
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
prog |
Progressr function to update parallel processing progress |
Value
A "function" class object used to run batch_snf
in lapply-form
for parallel processing.
Run variations of SNF
Description
This is the core function of the metasnf
package. Using the information
stored in a settings_df (see ?settings_df) and a data list
(see ?data_list), run repeated complete SNF pipelines to generate
a broad space of post-SNF cluster solutions.
Usage
batch_snf(dl, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL)
Arguments
dl |
A nested list of input data from |
sc |
An |
processes |
Specify number of processes used to complete SNF iterations
|
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
sim_mats_dir |
If specified, this directory will be used to save all generated similarity matrices. |
Value
By default, returns a solutions data frame (class "data.frame"), a
a data frame containing one row for every row of the provided settings
matrix, all the original columns of that settings data frame, and new columns
containing the assigned cluster of each observation from the cluster
solution derived by that row's settings. If return_sim_mats
is
TRUE, the function will instead return a list containing the
solutions data frame as well as a list of the final similarity matrices (class
"matrix") generated by SNF for each row of the settings data frame. If
suppress_clustering
is TRUE, the solutions data frame will not be returned
in the output.
Examples
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 3)
# A solutions data frame without similarity matrices:
sol_df <- batch_snf(input_dl, sc)
# A solutions data frame with similarity matrices:
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mats_list(sol_df)
Run SNF clustering pipeline on a list of subsampled data lists
Description
Run SNF clustering pipeline on a list of subsampled data lists
Usage
batch_snf_subsamples(
dl_subsamples,
sc,
processes = 1,
return_sim_mats = FALSE,
sim_mats_dir = NULL
)
Arguments
dl_subsamples |
A list of subsampled data lists. This object is
generated by the function |
sc |
An |
processes |
Specify number of processes used to complete SNF iterations
|
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
sim_mats_dir |
If specified, this directory will be used to save all generated similarity matrices. |
Value
By default, returns a one-element list: cluster_solutions
, which
is itself a list of cluster solution data frames corresponding to each of
the provided data list subsamples. Setting the parameters
return_sim_mats
and return_solutions
to TRUE
will turn the result of the function to a three-element list containing the
corresponding solutions data frames and final fused similarity matrices of
those cluster solutions, should you require these objects for your own
stability calculations.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
batch_subsample_results <- batch_snf_subsamples(
my_dl_subsamples,
sc
)
Cached example extended solutions data frame
Description
An extended solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.
Usage
cache_a_complete_example_ext_sol_df
Format
cache_a_complete_example_ext_sol_df
Contains 20 cluster solutions, 87 observations, and p-values for 336 features.
Source
This data came from the metasnf package.
Cached example extended solutions data frame
Description
An extended solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.
Usage
cache_a_complete_example_lp_ext_sol_df
Format
cache_a_complete_example_lp_ext_sol_df
Contains 5 cluster solutions, 74 observations, and p-values for 2 features.
Source
This data comes from the metasnf package.
Cached example solutions data frame
Description
An solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.
Usage
cache_a_complete_example_sol_df
Format
cache_a_complete_example_sol_df
A solutions data frame with 20 cluster solutions and 87 observations.
Source
This data came from the metasnf package.
Construct an ARI matrix storing inter-solution similarities
Description
This function constructs an ari_matrix
class object from a solutions_df
class object. The ARI matrix stores pairwise adjusted Rand indices for all
cluster solutions as well as a numeric order for the solutions data frame
based on the hierarchical clustering of the ARI matrix.
Usage
calc_aris(
sol_df,
processes = 1,
verbose = FALSE,
dist_method = "euclidean",
hclust_method = "complete"
)
Arguments
sol_df |
Solutions data frame containing cluster solutions to calculate pairwise ARIs for. |
processes |
Specify number of processes used to complete calculations
|
verbose |
If TRUE, output progress to console. |
dist_method |
Distance method to use when calculating sorting order to of the matrix. Argument is directly passed into stats::dist. Options include "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski". |
hclust_method |
Agglomerative method to use when calculating sorting
order by |
Value
om_aris ARIs between clustering solutions of an solutions data frame
Examples
dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(dl, n_solutions = 3)
sol_df <- batch_snf(dl, sc)
calc_aris(sol_df)
Calculate p-values based on feature vectors and their types
Description
Calculate p-values based on feature vectors and their types
Usage
calc_assoc_pval(var1, var2, type1, type2, cat_test = "chi_squared")
Arguments
var1 |
A single vector containing a feature. |
var2 |
A single vector containing a feature. |
type1 |
The type of var1 (continuous, discrete, ordinal, categorical). |
type2 |
The type of var2 (continuous, discrete, ordinal, categorical). |
cat_test |
String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test. |
Value
pval A p-value from a statistical test based on the provided types. Currently, this will either be the F-test p-value from a linear model if at least one feature is non-categorical, or the chi-squared test p-value if both features are categorical.
Calculate p-values for all pairwise associations of features in a data list
Description
Calculate p-values for all pairwise associations of features in a data list
Usage
calc_assoc_pval_matrix(dl, verbose = FALSE, cat_test = "chi_squared")
Arguments
dl |
A nested list of input data from |
verbose |
If TRUE, output progress to the console. |
cat_test |
String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test. |
Value
A "matrix" class object containing pairwise association p-values between the features in the provided data list.
Examples
data_list <- data_list(
list(income, "household_income", "demographics", "ordinal"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)
Calculate feature NMIs for a data list and a solutions data frame
Description
Normalized mutual information scores can be used to indirectly measure how important a feature may have been in producing a cluster solution. This function will calculate the normalized mutual information between cluster solutions in a solutions data frame as well as cluster solutions created by including only a single feature from a provided data list, but otherwise using all the same hyperparameters as specified in the original SNF config. Note that NMIs can be calculated between two cluster solutions regardless of what features were actually used to create those cluster solutions. For example, a feature that was not involved in producing a particular cluster solution may still have a high NMI with that cluster solution (typically because it was highly correlated with a different feature that was used).
Usage
calc_nmis(
dl,
sol_df,
transpose = TRUE,
ignore_inclusions = TRUE,
processes = 1
)
Arguments
dl |
A nested list of input data from |
sol_df |
Result of |
transpose |
If TRUE, will transpose the output data frame. |
ignore_inclusions |
If TRUE, will ignore the inclusion columns in the solutions data frame and calculate NMIs for all features. If FALSE, will give NAs for features that were dropped on a given settings_df row. |
processes |
Specify number of processes used to complete SNF iterations
|
Value
A "data.frame" class object containing one row for every feature in the provided data list and one column for every solution in the provided solutions data frame. Populated values show the calculated NMI score for each feature-solution combination.
Examples
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 2)
sol_df <- batch_snf(input_dl, sc)
calc_nmis(input_dl, sol_df)
Calculate co-clustering data
Description
Calculate co-clustering data
Usage
calculate_coclustering(subsample_solutions, sol_df, verbose = FALSE)
Arguments
subsample_solutions |
A list of containing cluster solutions from
distinct subsamples of the data. This object is generated by the function
|
sol_df |
A solutions data frame. This object is generated by the
function |
verbose |
If TRUE, output time remaining estimates to console. |
Value
A list containing the following components:
cocluster_dfs: A list of data frames, one per cluster solution, that shows the number of times that every pair of observations in the original cluster solution occurred in the same subsample, the number of times that every pair clustered together in a subsample, and the corresponding fraction of times that every pair clustered together in a subsample.
cocluster_ss_mats: The number of times every pair of observations occurred in the same subsample, formatted as a pairwise matrix.
cocluster_sc_mats: The number of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_cf_mats: The fraction of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_summary: Specifically among pairs of observations that clustered together in the original full cluster solution, what fraction of those pairs remained clustered together throughout the subsample solutions. This information is formatted as a data frame with one row per cluster solution.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
sol_df <- batch_snf(my_dl, sc)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
batch_subsample_results <- batch_snf_subsamples(
my_dl_subsamples,
sc
)
coclustering_results <- calculate_coclustering(
batch_subsample_results,
sol_df,
verbose = TRUE
)
Mock diagnosis data
Description
This is the same data as diagnosis_df
, with renamed features and columns.
Usage
cancer_diagnosis_df
Format
cancer_diagnosis_df
A data frame with 200 rows and 2 columns:
- patient_id
Random three-digit number uniquely identifying the patient
- diagnosis
Mock cancer diagnosis feature (1, 2, or 3)
Source
This data came from the SNFtool package, with slight modifications.
Helper function for generating categorical colour palette
Description
Helper function for generating categorical colour palette
Usage
cat_colours(vector, palette)
Arguments
vector |
Vector of categorical data to generate palette for. |
palette |
Which RColorBrewer palette should be used. |
Value
A named list of colours where the names correspond to the unique values of vector and the values correspond to their colours.
Place significance stars on ComplexHeatmap cells
Description
This is an internal function meant to be used to by the assoc_pval_heatmap function.
Usage
cell_significance_fn(data)
Arguments
data |
The matrix containing the cells to base the significance stars on. |
Value
cell_fn Another function that is well-formatted for usage as the cell_fun argument in ComplexHeatmap::Heatmap.
Convert character-type columns of a data frame to factor-type
Description
Convert character-type columns of a data frame to factor-type
Usage
char_to_fac(df)
Arguments
df |
A data frame. |
Value
The data frame with factor columns instead of char columns.
Check if functions in a distance metrics list-like have valid arguments
Description
Check if functions in a distance metrics list-like have valid arguments
Usage
check_cfll_fn_args(cfll)
Value
Doesn't return any value. Raises error if the functions in dfll don't have valid arguments.
Check if items of a clustering functions list-like object are functions
Description
Check if items of a clustering functions list-like object are functions
Usage
check_cfll_fns(cfll)
Arguments
cfll |
A clust_fns_list-like |
Value
Doesn't return any value. Raises error if the items of cfll are not functions.
Check if clustering functions list-like object has named algorithms
Description
Check if clustering functions list-like object has named algorithms
Usage
check_cfll_named(cfll)
Arguments
cfll |
A clust_fns_list-like |
Value
Doesn't return any value. Raises error if there are unnamed clustering functions in cfll.
Check if names in a clustering functions list-like object are unique
Description
Check if names in a clustering functions list-like object are unique
Usage
check_cfll_unique_names(cfll)
Arguments
cfll |
A clust_fns_list-like |
Value
Doesn't return any value. Raises error if the names in cfll aren't unique.
Check if settings_df exceeds bounds of clust_fns_list
Description
Check if settings_df exceeds bounds of clust_fns_list
Usage
check_compatible_sdf_cfl(sdf, cfl)
Arguments
sdf |
A |
cfl |
A |
Value
Doesn't return any value. Raises error if sdf calls for a clustering function outside the range of cfl.
Check if settings_df exceeds bounds of dist_fns_list
Description
Check if settings_df exceeds bounds of dist_fns_list
Usage
check_compatible_sdf_dfl(sdf, dfl)
Arguments
sdf |
A |
dfl |
A |
Value
Doesn't return any value. Raises error if sdf calls for a distance function outside the range of dfl.
Check if settings_df and weights_matrix have same number of rows
Description
Check if settings_df and weights_matrix have same number of rows
Usage
check_compatible_sdf_wm(sdf, wm)
Arguments
sdf |
A |
wm |
A |
Value
Doesn't return any value. Raises error if sdf and wm don't have the same number of rows.
Helper function to stop annotation building when no data was provided
Description
Helper function to stop annotation building when no data was provided
Usage
check_dataless_annotations(annotation_requests, data)
Arguments
annotation_requests |
A list of requested annotations |
data |
A data frame with data to build annotations |
Value
Does not return any value. This function just raises an error when annotations are requested without any provided data for a heatmap.
Check if functions in a distance metrics list-like have valid arguments
Description
Check if functions in a distance metrics list-like have valid arguments
Usage
check_dfll_fn_args(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
Doesn't return any value. Raises error if the functions in dfll don't have valid arguments.
Check if functions in a distance metrics list-like have names
Description
Check if functions in a distance metrics list-like have names
Usage
check_dfll_fn_names(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
Doesn't return any value. Raises error if the functions in dfll don't have names.
Check if items of a distance metrics list-like object have valid names
Description
Check if items of a distance metrics list-like object have valid names
Usage
check_dfll_item_names(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
Doesn't return any value. Raises error if the items of dfll don't have valid formatted names.
Check if items of a distance metrics list-like object are functions
Description
Check if items of a distance metrics list-like object are functions
Usage
check_dfll_subitems_are_fns(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
Doesn't return any value. Raises error if the items of dfll are not functions.
Check if names in a distance metrics list-like object are unique
Description
Check if names in a distance metrics list-like object are unique
Usage
check_dfll_unique_names(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
Doesn't return any value. Raises error if the items of dfll aren't unique across layer 1 or within each item of layer 2.
Check if data list contains any duplicate names
Description
Check if data list contains any duplicate names
Usage
check_dll_duplicate_components(dll)
Arguments
dll |
A data list-like |
Value
Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.
Check if data list contains any duplicate features
Description
Check if data list contains any duplicate features
Usage
check_dll_duplicate_features(dll)
Arguments
dll |
A data list-like |
Value
Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.
Error if empty input provided during data list initialization
Description
Error if empty input provided during data list initialization
Usage
check_dll_empty_input(data_list_input)
Arguments
data_list_input |
Input data provided for data list initialization. |
Value
Raises an error if data_list_input has 0 length.
Error if data list-like list doesn't have only 4-item nested lists
Description
Error if data list-like list doesn't have only 4-item nested lists
Usage
check_dll_four_subitems(dll)
Arguments
dll |
A data list-like |
Value
Raises error if dll doesn't have only 4-item nested lists
Error if data list-like structure isn't a list
Description
Error if data list-like structure isn't a list
Usage
check_dll_inherits_list(dll)
Arguments
dll |
A data list-like |
Value
Raises error if data list-like structure isn't a list
Check if UID columns in a nested list have valid structure for a data list
Description
Check if UID columns in a nested list have valid structure for a data list
Usage
check_dll_subitem_classes(dll)
Arguments
dll |
A data list-like |
Value
Raises an error if the UID columns do not have a valid structure.
Check valid item names for a data list-like list
Description
Error if data list-like structure doesn't have nested names of "data", "name", "domain", and "type".
Usage
check_dll_subitem_names(dll)
Arguments
dll |
A data list-like |
Value
Raises error if dll doesn't have only 4-item nested lists
Error if data list-like structure has invalid feature types
Description
Error if data list-like structure has invalid feature types
Usage
check_dll_types(dll)
Arguments
dll |
A data list-like |
Value
Raises an error if the loaded types are not among continuous, discrete, ordinal, categorical, or mixed.
Check if UID columns in a nested list have valid structure for a data list
Description
Check if UID columns in a nested list have valid structure for a data list
Usage
check_dll_uid(dll)
Arguments
dll |
A data list-like |
Value
Raises an error if the UID columns do not have a valid structure.
Check for ComplexHeatmap and circlize dependencies
Description
Check for ComplexHeatmap and circlize dependencies
Usage
check_hm_dependencies()
Value
Does not return any value. This function just checks that the ComplexHeatmap and circlize packages are installed.
Check if settings data frame inherits class data.frame
Description
Check if settings data frame inherits class data.frame
Usage
check_sdfl_colnames(sdfl)
Arguments
sdfl |
A settings data frame-like matrix object to be validated. |
Value
Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.
Check if settings data frame inherits class data.frame
Description
Check if settings data frame inherits class data.frame
Usage
check_sdfl_is_df(sdfl)
Arguments
sdfl |
A settings data frame-like matrix object to be validated. |
Value
Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.
Check if settings data frame is numeric
Description
Check if settings data frame is numeric
Usage
check_sdfl_numeric(sdfl)
Arguments
sdfl |
A settings data frame-like matrix object to be validated. |
Value
Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.
Check validity of similarity matrices
Description
Check to see if similarity matrices in a list have the following properties:
The maximum value in the entire matrix is 0.5
Every value in the diagonal is 0.5
Usage
check_similarity_matrices(similarity_matrices)
Arguments
similarity_matrices |
A list of similarity matrices |
Value
valid_matrices Boolean indicating if properties are met by all similarity matrices
Check if max K exceeds the number of observations
Description
Check if max K exceeds the number of observations
Usage
check_valid_k(sdf, dl)
Arguments
sdf |
A |
dl |
A nested list of input data from |
Value
Doesn't return any value. Raises error if max K exceeds the number of observations.
Check if SNF config has valid structure
Description
Check if SNF config has valid structure
Usage
check_valid_sc(sc)
Arguments
sc |
An |
Value
Doesn't return any value. Raises error if snf_config is not an
snf_config
class object.
Chi-squared test p-value (generic)
Description
Return p-value for chi-squared test for any two features
Usage
chi_squared_pval(cat_var1, cat_var2)
Arguments
cat_var1 |
A categorical feature. |
cat_var2 |
A categorical feature. |
Value
pval A p-value (class "numeric").
Built-in clustering algorithms
Description
These functions can be used when building a metasnf
clustering functions
list. Each function converts a similarity matrix (matrix class object) to a
cluster solution (numeric vector). Note that these functions (or custom
clustering functions) cannot accept number of clusters as a parameter; this
value must be built into the function itself if necessary.
Usage
spectral_eigen(similarity_matrix)
spectral_rot(similarity_matrix)
spectral_eigen_classic(similarity_matrix)
spectral_rot_classic(similarity_matrix)
spectral_two(similarity_matrix)
spectral_three(similarity_matrix)
spectral_four(similarity_matrix)
spectral_five(similarity_matrix)
spectral_six(similarity_matrix)
spectral_seven(similarity_matrix)
spectral_eight(similarity_matrix)
spectral_nine(similarity_matrix)
spectral_ten(similarity_matrix)
Arguments
similarity_matrix |
A similarity matrix. |
Details
spectral_eigen: Spectral clustering where the number of clusters is based on the eigen-gap heuristic
spectral_rot: Spectral clustering where the number of clusters is based on the rotation-cost heuristic
spectral_(C): Spectral clustering for a C-cluster solution.
Value
solution_data A vector of cluster assignments
Build a clustering algorithms list
Description
This function can be used to specify custom clustering algorithms to apply to the final similarity matrices produced by each run of the batch_snf function.
Usage
clust_fns_list(clust_fns = NULL, use_default_clust_fns = FALSE)
Arguments
clust_fns |
A list of named clustering functions |
use_default_clust_fns |
If TRUE, prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph) to clust_fns. |
Value
A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.
Examples
# Using just the base clustering algorithms --------------------------------
# This will just contain spectral_eigen and spectral_rot
cfl <- clust_fns_list(use_default_clust_fns = TRUE)
# Adding algorithms provided by the package --------------------------------
# This will contain the base clustering algorithms (spectral_eigen,
# spectral_rot) as well as two pre-defined spectral clustering functions
# that force the number of clusters to be two or five
cfl <- clust_fns_list(
clust_fns = list(
"two_cluster_spectral" = spectral_two,
"five_cluster_spectral" = spectral_five
)
)
# Adding your own algorithms -----------------------------------------------
# This will contain the base and user-provided clustering algorithms
my_clustering_algorithm <- function(similarity_matrix) {
# your code that converts similarity matrix to clusters here...
}
# Suppress the base algorithms----------------------------------------------
# This will contain only user-provided clustering algorithms
cfl <- clust_fns_list(
clust_fns = list(
"two_cluster_spectral" = spectral_two,
"five_cluster_spectral" = spectral_five
)
)
Density plot of co-clustering stability across subsampled data
Description
This function creates a density plot that shows, for all pairs of observations that originally clustered together, the distribution of the the fractions that those pairs clustered together across subsampled data.
Usage
cocluster_density(cocluster_df)
Arguments
cocluster_df |
A data frame containing co-clustering data for a single
cluster solution. This object is generated by the |
Value
Density plot (class "gg", "ggplot") of the distribution of co-clustering across pairs and subsamples of the data.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
sol_df <- batch_snf(my_dl, sc)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
batch_subsample_results <- batch_snf_subsamples(
my_dl_subsamples,
sc
)
coclustering_results <- calculate_coclustering(
batch_subsample_results,
sol_df,
verbose = TRUE
)
cocluster_dfs <- coclustering_results$"cocluster_dfs"
cocluster_density(cocluster_dfs[[1]])
Heatmap of observation co-clustering across resampled data
Description
Create a heatmap that shows the distribution of observation co-clustering across resampled data.
Usage
cocluster_heatmap(
cocluster_df,
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_names = FALSE,
show_column_names = FALSE,
dl = NULL,
data = NULL,
left_bar = NULL,
right_bar = NULL,
top_bar = NULL,
bottom_bar = NULL,
left_hm = NULL,
right_hm = NULL,
top_hm = NULL,
bottom_hm = NULL,
annotation_colours = NULL,
min_colour = NULL,
max_colour = NULL,
...
)
Arguments
cocluster_df |
A data frame containing co-clustering data for a single
cluster solution. This object is generated by the |
cluster_rows |
Argument passed to |
cluster_columns |
Argument passed to |
show_row_names |
Argument passed to |
show_column_names |
Argument passed to |
dl |
See ?similarity_matrix_heatmap. |
data |
See ?similarity_matrix_heatmap. |
left_bar |
See ?similarity_matrix_heatmap. |
right_bar |
See ?similarity_matrix_heatmap. |
top_bar |
See ?similarity_matrix_heatmap. |
bottom_bar |
See ?similarity_matrix_heatmap. |
left_hm |
See ?similarity_matrix_heatmap. |
right_hm |
See ?similarity_matrix_heatmap. |
top_hm |
See ?similarity_matrix_heatmap. |
bottom_hm |
See ?similarity_matrix_heatmap. |
annotation_colours |
See ?similarity_matrix_heatmap. |
min_colour |
See ?similarity_matrix_heatmap. |
max_colour |
See ?similarity_matrix_heatmap. |
... |
Arguments passed to |
Value
Heatmap (class "Heatmap" from ComplexHeatmap) object showing the distribution of observation co-clustering across resampled data.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
sol_df <- batch_snf(my_dl, sc)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
batch_subsample_results <- batch_snf_subsamples(
my_dl_subsamples,
sc
)
coclustering_results <- calculate_coclustering(
batch_subsample_results,
sol_df,
verbose = TRUE
)
cocluster_dfs <- coclustering_results$"cocluster_dfs"
cocluster_heatmap(
cocluster_dfs[[1]],
dl = my_dl,
top_hm = list(
"Income" = "household_income",
"Pubertal Status" = "pubertal_status"
),
annotation_colours = list(
"Pubertal Status" = colour_scale(
c(1, 4),
min_colour = "black",
max_colour = "purple"
),
"Income" = colour_scale(
c(0, 4),
min_colour = "black",
max_colour = "red"
)
)
)
Co-clustering coverage check
Description
Check if co-clustered data has at least one subsample in which every pair of observations were a part of simultaneously.
Usage
coclustering_coverage_check(cocluster_df, action = "warn")
Arguments
cocluster_df |
data frame containing co-clustering data. |
action |
Control if parent function should warn or stop. |
Value
This function does not return any value. It checks a cocluster_df
for complete coverage (all pairs occur in the same solution at least once).
Will raise a warning or error if coverage is incomplete depending on the
value of the action parameter.
Convert a data list into a data frame
Description
Defunct function for converting a data list into a data frame. Please
use
as.data.frame()
instead.
Usage
collapse_dl(data_list)
Arguments
data_list |
A nested list of input data from |
Value
A "data.frame"-formatted version of the provided data list.
Return a colour ramp for a given vector
Description
Given a numeric vector and min and max colour values, return a colour ramp
that assigns a colour to each element in the vector. This function is a
wrapper for circlize::colorRamp2
.'
Usage
colour_scale(data, min_colour, max_colour)
Arguments
data |
Vector of numeric values. |
min_colour |
Minimum colour value. |
max_colour |
Maximum colour value. |
Value
A "function" class object that can build a circlize-style colour ramp.
Convert unique identifiers of data list to "uid"
Description
Column name "uid" is reserved for the unique identifier of observations. This function ensures all data frames have their UID set as "uid".
Usage
convert_uids(dll, uid)
Arguments
dll |
A data list-like |
uid |
(string) the name of the uid column currently used data |
Value
The provided nested list with "uid" as UID.
Mock ABCD cortical surface area data
Description
Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".
Usage
cort_sa
Format
cort_sa
A data frame with 188 rows and 152 columns:
- unique_id
The unique identifier of the ABCD dataset
- ...
Cortical surface areas of various ROIs (mm^2, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock ABCD cortical thickness data
Description
Like the mock data frame "abcd_cort_t", but with "unique_id" as the "uid".
Usage
cort_t
Format
cort_t
A data frame with 188 rows and 152 columns:
- unique_id
The unique identifier of the ABCD dataset
- ...
Cortical thicknesses of various ROIs (mm^3, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Build a data_list
class object
Description
data_list()
constructs a data list object which inherits from classes
data_list
and list
. This object is the primary way in which features to
be used along the metasnf
clustering pipeline are stored. The data list is
fundamentally a 2-level nested list object where each inner list contains a
data frame and associated metadata for that data frame. The metadata
includes the name of the data frame, the 'domain' of that data frame (the
broader source of information that the input data frame is capturing,
determined by user's domain knowledge), and the type of feature stored in
the data frame (continuous, discrete, ordinal, categorical, or mixed).
Usage
data_list(..., uid)
Arguments
... |
Any number of lists formatted as (df, "df_name", "df_domain", "df_type") and/or any number of lists of lists formatted as (df, "df_name", "df_domain", "df_type"). |
uid |
(character) the name of the uid column currently used data. data frame. |
Examples
heart_rate_df <- data.frame(
patient_id = c("1", "2", "3"),
var1 = c(0.04, 0.1, 0.3),
var2 = c(30, 2, 0.3)
)
personality_test_df <- data.frame(
patient_id = c("1", "2", "3"),
var3 = c(900, 1990, 373),
var4 = c(509, 2209, 83)
)
survey_response_df <- data.frame(
patient_id = c("1", "2", "3"),
var5 = c(1, 3, 3),
var6 = c(2, 3, 3)
)
city_df <- data.frame(
patient_id = c("1", "2", "3"),
var7 = c("toronto", "montreal", "vancouver")
)
# Explicitly (Name each nested list element):
dl <- data_list(
list(
data = heart_rate_df,
name = "heart_rate",
domain = "clinical",
type = "continuous"
),
list(
data = personality_test_df,
name = "personality_test",
domain = "surveys",
type = "continuous"
),
list(
data = survey_response_df,
name = "survey_response",
domain = "surveys",
type = "ordinal"
),
list(
data = city_df,
name = "city",
domain = "location",
type = "categorical"
),
uid = "patient_id"
)
# Compact loading
dl <- data_list(
list(heart_rate_df, "heart_rate", "clinical", "continuous"),
list(personality_test_df, "personality_test", "surveys", "continuous"),
list(survey_response_df, "survey_response", "surveys", "ordinal"),
list(city_df, "city", "location", "categorical"),
uid = "patient_id"
)
# Printing data list summaries
summary(dl)
# Alternative loading: providing a single list of lists
list_of_lists <- list(
list(heart_rate_df, "data1", "domain1", "continuous"),
list(personality_test_df, "data2", "domain2", "continuous")
)
dl <- data_list(
list_of_lists,
uid = "patient_id"
)
Mock ABCD depression data
Description
Like the mock data frame "abcd_depress", but with "unique_id" as the "uid".
Usage
depress
Format
depress
A data frame with 275 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- cbcl_depress_r
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Mock diagnosis data
Description
This is the same data as cancer_diagnosis_df
, with renamed features and columns.
Usage
diagnosis_df
Format
diagnosis_df
A data frame with 200 rows and 2 columns:
- patient_id
Random three-digit number uniquely identifying the patient
- diagnosis
Mock diagnosis feature
Source
This data came from the SNFtool package, with slight modifications.
Internal function for estimate_nclust_given_graph
Description
Internal function taken from SNFtool
to use for number of cluster
estimation.
Usage
discretisation(eigenvectors)
Arguments
eigenvectors |
Matrix of eigenvectors. |
Value
"Matrix" class object, intermediate product in spectral clustering.
Internal function for estimate_nclust_given_graph
Description
Internal function taken from SNFtool
to use for number of cluster
estimation.
Usage
discretisation_evec_data(eigenvector)
Arguments
eigenvector |
Matrix of eigenvectors |
Value
"Matrix" class object of discretized provided eigenvector to values 0 or 1.
Built-in distance functions
Description
These functions can be used when building a metasnf
distance functions
list. Each function converts a data frame into to a distance matrix.
Usage
euclidean_distance(df, weights_row)
gower_distance(df, weights_row)
sn_euclidean_distance(df, weights_row)
sew_euclidean_distance(df, weights_row)
hamming_distance(df, weights_row)
Arguments
df |
Data frame containing at least 1 data column |
weights_row |
Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights_row. |
Details
Functions that work for numeric data:
euclidean_distance: typical Euclidean distance
sn_euclidean_distance: Data frame is first standardized and normalized before typical Euclidean distance is applied
siw_euclidean_distance: Squared (including weights) Euclidean distance, where the weights are also squared
sew_euclidean_distance: Squared (excluding weights) Euclidean distance, where the weights are not also squared
Functions that work for binary data:
hamming_distance: typical Hamming distance
Functions that work for any type of data:
gower_distance: Gower distance (cluster::daisy)
Value
A matrix class object containing pairwise distances.
Build a distance metrics list
Description
The distance metrics list object (inherits classes dist_fns_list
and list
) is a list that stores R functions which can convert a data
frame of features into a matrix of pairwise distances. The list is a nested
one, where the first layer of the list can hold up to 5 items (one for each
of the metasnf
recognized feature types, continuous, discrete, ordinal,
categorical, and mixed), and the second layer can hold an arbitrary number
of distance functions for each of those types.
Usage
dist_fns_list(
cnt_dist_fns = NULL,
dsc_dist_fns = NULL,
ord_dist_fns = NULL,
cat_dist_fns = NULL,
mix_dist_fns = NULL,
automatic_standard_normalize = FALSE,
use_default_dist_fns = FALSE
)
Arguments
cnt_dist_fns |
A named list of continuous distance metric functions. |
dsc_dist_fns |
A named list of discrete distance metric functions. |
ord_dist_fns |
A named list of ordinal distance metric functions. |
cat_dist_fns |
A named list of categorical distance metric functions. |
mix_dist_fns |
A named list of mixed distance metric functions. |
automatic_standard_normalize |
If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters. |
use_default_dist_fns |
If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list. |
Details
Call ?distance_metrics to see all distance metric functions provided in metasnf.
Value
A distance metrics list object.
Examples
# Using just the base distance metrics ------------------------------------
dist_fns_list <- dist_fns_list()
# Adding your own metrics --------------------------------------------------
# This will contain only the and user-provided distance function:
cubed_euclidean <- function(df, weights_row) {
# (your code that converts a data frame to a distance metric here...)
weights <- diag(weights_row, nrow = length(weights_row))
weighted_df <- as.matrix(df) %*% weights
distance_matrix <- weighted_df |>
stats::dist(method = "euclidean") |>
as.matrix()
distance_matrix <- distance_matrix^3
return(distance_matrix)
}
dist_fns_list <- dist_fns_list(
cnt_dist_fns = list(
"my_cubed_euclidean" = cubed_euclidean
)
)
# Using default base metrics------------------------------------------------
# Call ?distance_metrics to see all distance metric functions provided in
# metasnf. The code below will contain a mix of user-provided and built-in
# distance metric functions.
dist_fns_list <- dist_fns_list(
cnt_dist_fns = list(
"my_distance_metric" = cubed_euclidean
),
dsc_dist_fns = list(
"my_distance_metric" = cubed_euclidean
),
ord_dist_fns = list(
"my_distance_metric" = cubed_euclidean
),
cat_dist_fns = list(
"my_distance_metric" = gower_distance
),
mix_dist_fns = list(
"my_distance_metric" = gower_distance
),
use_default_dist_fns = TRUE
)
Variable-level summary of a data list
Description
Defunct function to summarize a data list. Please use
summary()
with
argument scope = "feature"
instead.
Usage
dl_variable_summary(dl)
Arguments
dl |
A nested list of input data from |
Value
variable_level_summary A data frame containing the name, type, and domain of every variable in a data list.
Apply-like function for data list objects
Description
This function enables manipulating a data_list
class object with lapply
syntax without removing that object's data_list
class attribute. The
function will only preserve this attribute if the result of the apply call
has a valid data list structure.
Usage
dlapply(X, FUN, ...)
Arguments
X |
A |
FUN |
The function to be applied to each data list component. |
... |
Optional arguments to |
Value
If FUN applied to each component of X yields a valid data list, a data list. Otherwise, a list.
Examples
# Convert all UID values to lowercase
dl <- data_list(
list(abcd_income, "income", "demographics", "discrete"),
list(abcd_colour, "colour", "likes", "categorical"),
uid = "patient"
)
dl_lower <- dlapply(
dl,
function(x) {
x$"data"$"uid" <- tolower(x$"data"$"uid")
return(x)
}
)
Make the uid UID columns of a data list first
Description
Make the uid UID columns of a data list first
Usage
dll_uid_first_col(dll)
Arguments
dll |
A data list-like |
Value
The object with "uid" positioned as the first of each data frame column.
Pull domains from a data list
Description
Pull domains from a data list
Usage
domains(dl)
Arguments
dl |
A nested list of input data from |
Value
A character vector of domains.
Function to extend dplyr to extended solutions data frame objects
Description
Function to extend dplyr to extended solutions data frame objects
Usage
dplyr_row_slice.ext_solutions_df(data, i, ...)
Arguments
data |
An extended solutions data frame. |
i |
A vector of row indices. |
... |
Additional arguments. |
Value
Row sliced object with appropriately preserved attributes.
Function to extend dplyr to solutions data frame objects
Description
Function to extend dplyr to solutions data frame objects
Usage
dplyr_row_slice.solutions_df(data, i, ...)
Arguments
data |
A solutions data frame. |
i |
A vector of row indices. |
... |
Additional arguments. |
Value
Row sliced object with appropriately preserved attributes.
Helper function to remove columns from a data frame
Description
Helper function to remove columns from a data frame
Usage
drop_cols(x, cols)
Arguments
x |
A data frame |
cols |
Vector of column names to be removed |
Value
x without columns in cols
Execute inclusion
Description
Given a data list and a settings data frame row, returns a data list of selected inputs.
Usage
drop_inputs(sdf_row, dl)
Arguments
sdf_row |
Row of a settings data frame. |
dl |
A nested list of input data from |
Value
A data list (class "list") in which any component with a corresponding 0 value in the provided settings data frame row has been removed.
Ensure the data item of each component is a data.frame
class object
Description
Ensure the data item of each component is a data.frame
class object
Usage
ensure_dll_df(dll)
Arguments
dll |
A data list-like |
Value
The provided dll with the data item of each component as a data frame.
Manhattan plot of feature-cluster association p-values
Description
Manhattan plot of feature-cluster association p-values
Usage
esm_manhattan_plot(
esm,
neg_log_pval_thresh = 5,
threshold = NULL,
point_size = 5,
jitter_width = 0.1,
jitter_height = 0.1,
text_size = 15,
plot_title = NULL,
hide_x_labels = FALSE,
bonferroni_line = FALSE
)
Arguments
esm |
Extended solutions data frame storing associations between features
and cluster assignments. See |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
P-value threshold to plot dashed line at. |
point_size |
Size of points in the plot. |
jitter_width |
Width of jitter. |
jitter_height |
Height of jitter. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
hide_x_labels |
If TRUE, hides x-axis labels. |
bonferroni_line |
If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold. |
Value
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame.
Examples
esm_manhattan_plot(mock_ext_solutions_df)
Estimate number of clusters for a similarity matrix
Description
Calculate eigengap and rotation-cost estimates of the number of clusters
to use when clustering a similarity matrix. This function was adapted
from SNFtool::estimateClustersGivenGraph
, but scales up the Laplacian
operator prior to eigenvalue calculations to minimize the risk of
floating point-related errors.
Usage
estimate_nclust_given_graph(W, NUMC = 2:10)
Arguments
W |
Similarity matrix to calculate number of clusters for. |
NUMC |
Range of cluster counts to consider among when picking best number of clusters. |
Value
A list containing the top two eigengap and rotation-cost estimates for the number of clusters in a given similarity matrix.
Examples
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 1)
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mat <- sim_mats_list(sol_df)[[1]]
estimate_nclust_given_graph(sim_mat)
Modification of SNFtool mock data frame "Data1"
Description
Modification of SNFtool mock data frame "Data1"
Usage
expression_df
Format
expression_df
A data frame with 200 rows and 3 columns:
- gene_1_expression
Mock gene expression feature
- gene_2_expression
Mock gene expression feature
- patient_id
Random three-digit number uniquely identifying the patient
Source
This data came from the SNFtool package, with slight modifications.
Constructor for ext_solutions_df
class object
Description
The extended solutions data frame is a column-extended variation of the
solutions data frame. It contains association p-values relating cluster
membership to feature distribution for all solutions in a solutions data
frame and all features in a provided data list (or data lists). If a
target data list was used during the call to extend_solutions
, the
extended solutions data frame will also have columns "min_pval",
"mean_pval", and "max_pval" summarizing the p-values of just those features
that were a part of the target list.
Usage
ext_solutions_df(ext_sol_dfl, sol_df, fts, target_dl)
Arguments
ext_sol_dfl |
An extended solutions data frame-like object. |
sol_df |
Result of |
fts |
A vector of all features that have association p-values stored in the resulting extended solutions data frame. |
target_dl |
A data list with features to calculate p-values for. Features in the target list will be included during p-value summary measure calculations. |
Value
An ext_solutions_df
class object.
Extend a solutions data frame to include outcome evaluations
Description
Extend a solutions data frame to include outcome evaluations
Usage
extend_solutions(
sol_df,
target_dl = NULL,
dl = NULL,
cat_test = "chi_squared",
min_pval = 1e-10,
processes = 1,
verbose = FALSE
)
Arguments
sol_df |
Result of |
target_dl |
A data list with features to calculate p-values for. Features in the target list will be included during p-value summary measure calculations. |
dl |
A data list with features to calculate p-values for, but that should not be incorporated into p-value summary measure columns (i.e., min/mean/max p-value columns). |
cat_test |
String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test. |
min_pval |
If assigned a value, any p-value less than this will be replaced with this value. |
processes |
The number of processes to use for parallelization. Progress is only reported for sequential processing (processes = 1). |
verbose |
If TRUE, output progress to console. |
Value
An extended solutions data frame (ext_sol_df
class object)
that contains p-value columns for each outcome in the provided data lists
Examples
## Not run:
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 2)
sol_df <- batch_snf(input_dl, sc)
ext_sol_df <- extend_solutions(sol_df, input_dl)
## End(Not run)
Mock ABCD "colour" data
Description
Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".
Usage
fav_colour
Format
fav_colour
A data frame with 275 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- colour
Categorical transformation of
cbcl_depress
.
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Return character vector of features stored in an object
Description
Return character vector of features stored in an object
Usage
features(x)
Arguments
x |
The object to pull features from. |
Value
A character vector of features in x.
Fisher exact test p-value
Description
Return p-value for Fisher exact test for any two features
Usage
fisher_exact_pval(cat_var1, cat_var2)
Arguments
cat_var1 |
A categorical feature. |
cat_var2 |
A categorical feature. |
Value
pval A p-value (class "numeric").
Mock gender data
Description
Mock gender data
Usage
gender_df
Format
gender_df
A data frame with 200 rows and 2 columns:
- patient_id
Random three-digit number uniquely identifying the patient
- gender_df
Mock gene methylation feature
Source
This data came from the SNFtool package, with slight modifications.
Generate annotations list
Description
Intermediate function that takes in formatted lists of features and the annotations they should be viewed through and returns annotation objects usable by ComplexHeatmap::Heatmap.
Usage
generate_annotations_list(
df,
left_bar = NULL,
right_bar = NULL,
top_bar = NULL,
bottom_bar = NULL,
left_hm = NULL,
right_hm = NULL,
top_hm = NULL,
bottom_hm = NULL,
show_legend = TRUE,
annotation_colours = NULL
)
Arguments
df |
data frame containing all the data that is specified in the remaining arguments. |
left_bar |
Named list of strings, where the strings are features in df that should be used for a barplot annotation on the left of the plot and the names are the names that will be used to caption the plots and their legends. |
right_bar |
See left_bar. |
top_bar |
See left_bar. |
bottom_bar |
See left_bar. |
left_hm |
Like left_bar, but with a heatmap annotation instead of a barplot annotation. |
right_hm |
See left_hm. |
top_hm |
See left_hm. |
bottom_hm |
See left_hm. |
show_legend |
Add legends to the annotations. |
annotation_colours |
Named list of heatmap annotations and their colours. |
Value
annotations_list A named list of all the annotations.
Generate a clustering algorithms list
Description
Deprecated function for building a clustering algorithms list. Please use
clust_fns_list()
(or better yet, snf_config()
) instead.
Usage
generate_clust_algs_list(..., disable_base = FALSE)
Arguments
... |
An arbitrary number of named clustering functions |
disable_base |
If TRUE, do not prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph. |
Value
A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.
Generate a list of distance metrics
Description
Deprecated function for building a distance metrics list. Please use
dist_fns_list()
(or better yet, snf_config()
) instead.
Usage
generate_distance_metrics_list(
continuous_distances = NULL,
discrete_distances = NULL,
ordinal_distances = NULL,
categorical_distances = NULL,
mixed_distances = NULL,
keep_defaults = TRUE
)
Arguments
continuous_distances |
A named list of distance metric functions |
discrete_distances |
A named list of distance metric functions |
ordinal_distances |
A named list of distance metric functions |
categorical_distances |
A named list of distance metric functions |
mixed_distances |
A named list of distance metric functions |
keep_defaults |
If TRUE (default), prepend the base distance metrics (euclidean and standard normalized euclidean) |
Value
A nested and named list of distance metrics functions.
Build a settings data frame
Description
Deprecated function for building a settings matrix. Please use
settings_df()
instead.
Usage
generate_settings_matrix(...)
Arguments
... |
Arguments used to generate a settings matrix. |
Value
Raises a deprecated error.
Extract cluster membership information from one solutions data frame row
Description
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a single row of a solutions data frame and returns a
data frame containing the cluster assignments for each uid. It is
similar to get_clusters()
, which takes one solutions data frame row and
returns a vector of cluster assignments' and get_cluster_solutions()
,
which takes a solutions data frame with any number of rows and returns a
data frame indicating the cluster assignments for each of those rows.
Usage
get_cluster_df(sol_df_row)
Arguments
sol_df_row |
One row from a solutions data frame. |
Value
cluster_df data frame of cluster and uid.
Extract cluster membership information from a sol_df
Description
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a solutions data frame and returns a data frame containing
the cluster assignments for each uid. It is similar to
'get_clusters()
, which takes one solutions data frame row and returns a vector
of cluster assignments' and get_cluster_df()
, which takes a solutions
matrix with only one row and returns a data frame with two columns: "cluster"
and "uid" (the UID of the observation).
Usage
get_cluster_solutions(sol_df)
Arguments
sol_df |
A sol_df. |
Value
A "data.frame" object where each row is an observation and each column (apart from the uid column) indicates the cluster that observation as assigned to for the corresponding solutions data frame row.
Extract cluster membership vector from one solutions data frame row
Description
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a single row of a solutions data frame and returns a
vector containing the cluster assignments for each observation. It is
similar to get_cluster_df()
, which takes a solutions data frame with only one
row and returns a data frame with two columns: "cluster" and "uid"
'(the UID of the observation) and get_cluster_solutions()
, which takes a
solutions data frame with any number of rows and returns a data frame indicating
the cluster assignments for each of those rows.
Usage
get_clusters(sol_df_row)
Arguments
sol_df_row |
Output matrix row. |
Value
clusters Vector of assigned clusters.
Pull complete-data UIDs from a list of data frames
Description
This function identifies all observations within a list of data frames that
have no missing data across all data frames. This function is useful when
constructing data lists of distinct feature sets from the same sample of
observations. As data_list()
strips away observations with any missing
data, distinct sets of observations may be generated by building a data
list from the same group of observations over different sets of features.
Reducing the pool of observations to only those with complete UIDs first
will avoid downstream generation of data lists of differing sizes.
Usage
get_complete_uids(list_of_dfs, uid)
Arguments
list_of_dfs |
List of data frames. |
uid |
Name of column across data frames containing UIDs |
Value
A character vector of the UIDs of observations that have complete data across the provided list of data frames.
Examples
complete_uids <- get_complete_uids(
list(income, pubertal, anxiety, depress),
uid = "unique_id"
)
income <- income[income$"unique_id" %in% complete_uids, ]
pubertal <- pubertal[pubertal$"unique_id" %in% complete_uids, ]
anxiety <- anxiety[anxiety$"unique_id" %in% complete_uids, ]
depress <- depress[depress$"unique_id" %in% complete_uids, ]
input_dl <- data_list(
list(income, "income", "demographics", "ordinal"),
list(pubertal, "pubertal", "demographics", "continuous"),
uid = "unique_id"
)
target_dl <- data_list(
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
Calculate distance matrices
Description
Given a data frame of numerical features, return a euclidean distance matrix.
Usage
get_dist_matrix(
df,
input_type,
cnt_dist_fn,
dsc_dist_fn,
ord_dist_fn,
cat_dist_fn,
mix_dist_fn,
weights_row
)
Arguments
df |
Raw data frame with subject IDs in column "uid" |
input_type |
Either "numeric" (resulting in euclidean distances), "categorical" (resulting in binary distances), or "mixed" (resulting in gower distances) |
cnt_dist_fn |
distance metric function for continuous data |
dsc_dist_fn |
distance metric function for discrete data |
ord_dist_fn |
distance metric function for ordinal data |
cat_dist_fn |
distance metric function for categorical data |
mix_dist_fn |
distance metric function for mixed data |
weights_row |
Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights_row. |
Value
dist_matrix Matrix of inter-observation distances.
Extract UIDs from a data list
Description
Deprecated function for extracting UIDs from a data list.
Please use
uids()
instead.
Usage
get_dl_uids(dl, prefix = FALSE)
Arguments
dl |
A nested list of input data from |
prefix |
If TRUE, preserves the "uid_" prefix added to UIDs when creating a data list. |
Value
A character vector of the UID labels contained in a data list.
Return the row or column ordering present in a heatmap
Description
Return the row or column ordering present in a heatmap
Usage
get_heatmap_order(heatmap, type = "rows")
Arguments
heatmap |
A heatmap object to collect ordering from. |
type |
The type of ordering to return. Either "rows" or "columns". |
Value
A numeric vector of the ordering used within the provided ComplexHeatmap "Heatmap" object.
Return the hierarchical clustering order of a matrix
Description
Return the hierarchical clustering order of a matrix
Usage
get_matrix_order(matrix, dist_method = "euclidean", hclust_method = "complete")
Arguments
matrix |
Matrix to cluster. |
dist_method |
Distance method to use when calculating sorting order to of the matrix. Argument is directly passed into stats::dist. Options include "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski". |
hclust_method |
Agglomerative method to use when calculating sorting
order by |
Value
A numeric vector of the ordering derived by the specified hierarchical clustering method applied to the provided matrix.
Examples
dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, sc)
ext_sol_df <- extend_solutions(
sol_df,
dl = dl,
min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
)
# Calculate pairwise similarities between cluster solutions
sol_aris <- calc_aris(sol_df)
# Extract hierarchical clustering order of the cluster solutions
meta_cluster_order <- get_matrix_order(sol_aris)
Get mean p-value
Description
Given an solutions data frame row containing evaluated p-values, returns mean.
Usage
get_mean_pval(sol_df_row)
Arguments
sol_df_row |
row of sol_df object |
Value
mean_pval mean p-value
Get minimum p-value
Description
Given an solutions data frame row containing evaluated p-values, returns min.
Usage
get_min_pval(sol_df_row)
Arguments
sol_df_row |
row of sol_df object |
Value
min_pval minimum p-value
Get p-values from an extended solutions data frame
Description
This function can be used to neatly format the p-values associated with an extended solutions data frame. It can also calculate the negative logs of those p-values to make it easier to interpret large-scale differences.
Usage
get_pvals(ext_sol_df, negative_log = FALSE, keep_summaries = TRUE)
Arguments
ext_sol_df |
The output of |
negative_log |
If TRUE, will replace p-values with negative log p-values. |
keep_summaries |
If FALSE, will remove the mean, min, and max p-value. |
Value
A "data.frame" class object Of only the p-value related columns of the provided ext_sol_df.
Extract representative solutions from a matrix of ARIs
Description
Following clustering with batch_snf
, a matrix of pairwise ARIs that show
how related each cluster solution is to each other can be generated by the
calc_aris
function. Partitioning of the ARI matrix can be done by
visual inspection of meta_cluster_heatmap()
results or by
shiny_annotator
. Given the indices of meta cluster boundaries, this
function will return a single representative solution from each meta cluster
based on maximum average ARI to all other solutions within that meta
cluster.
Usage
get_representative_solutions(aris, sol_df, filter_fn = NULL)
Arguments
aris |
Matrix of adjusted rand indices from |
sol_df |
Output of |
filter_fn |
Optional function to filter the meta-cluster by prior to
maximum average ARI determination. This can be useful if you are explicitly
trying to select a solution that meets a certain condition, such as only
picking from the 4 cluster solutions within a meta cluster. An example
valid function could be |
Value
The provided solutions data frame reduced to just one row per meta cluster defined by the split vector.
Examples
dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, sc)
ext_sol_df <- extend_solutions(
sol_df,
dl = dl,
min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
)
# Calculate pairwise similarities between cluster solutions
sol_aris <- calc_aris(sol_df)
# Extract hierarchical clustering order of the cluster solutions
meta_cluster_order <- get_matrix_order(sol_aris)
# Identify meta cluster boundaries with shiny app or trial and error
# ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)
# shiny_annotator(ari_hm)
# Result of meta cluster examination
split_vec <- c(2, 5, 12, 17)
ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order)
# Extracting representative solutions from each defined meta cluster
rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)
Helper function to drop columns from a data frame by grepl search
Description
Helper function to drop columns from a data frame by grepl search
Usage
gexclude(x, pattern)
Arguments
x |
Data frame to drop columns from. |
pattern |
Pattern used to match columns to drop. |
Value
x without columns matching pattern.
Helper function to pick columns from a data frame by grepl
search
Description
Helper function to pick columns from a data frame by grepl
search
Usage
gselect(x, pattern)
Arguments
x |
Data frame to select columns from. |
pattern |
Pattern used to match columns to select. |
Value
x with only columns matching pattern.
Mock ABCD income data
Description
Like the mock data frame "abcd_h_income", but with "unique_id" as the "uid".
Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".
Usage
income
income
Format
income
A data frame with 300 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- household_income
Household income in 3 category levels (low = 1, medium = 2, high = 3)
income
A data frame with 300 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- household_income
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Test if the object is a data list
Description
Given an object, returns TRUE
if that object inherits from the data_list
class.
Usage
is_data_list(x)
Arguments
x |
An object. |
Value
TRUE
if the object inherits from the data_list
class.
Jitter plot separating a feature by cluster
Description
Jitter plot separating a feature by cluster
Usage
jitter_plot(df, feature)
Arguments
df |
A data.frame containing cluster column and the feature to plot. |
feature |
The feature to plot. |
Value
A jitter+violin plot (class "gg", "ggplot") showing the distribution of a feature across clusters.
Assign meta cluster labels to rows of a solutions data frame or extended solutions data frame
Description
Given a solutions data frame or extended solutions data frame class object and a numeric vector indicating which rows correspond to which meta clusters, assigns meta clustering information to the "meta_clusters" attribute of the data frame.
Usage
label_meta_clusters(sol_df, split_vector, order = NULL)
Arguments
sol_df |
A solutions data frame or extended solutions data frame to assign meta clusters to. |
split_vector |
A numeric vector indicating which rows of sol_df should be the split points for meta cluster labeling. |
order |
An optional numeric vector indicating how the solutions data
frame should be reordered prior to meta cluster labeling. This vector can
be obtained by running |
Value
A solutions data frame with a populated "meta_clusters" attribute.
Examples
dl <- data_list(
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
set.seed(42)
my_sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, my_sc)
sol_df
sol_aris <- calc_aris(sol_df)
meta_cluster_order <- get_matrix_order(sol_aris)
# `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator()
split_vec <- c(6, 10, 16)
ari_hm <- meta_cluster_heatmap(
sol_aris,
order = meta_cluster_order,
split_vector = split_vec
)
mc_sol_df <- label_meta_clusters(
sol_df,
order = meta_cluster_order,
split_vector = split_vec
)
mc_sol_df
Label propagation
Description
Given a full fused network (one containing both pre-clustered observations and to-be-clustered observations) and the clusters of the pre-clustered observations, return a label propagated list of clusters for all observations. This function is derived from SNFtool::groupPredict. Modifications are made to take a full fused network as input, rather than taking input data frames and running SNF internally. This ensures that alternative approaches to data normalization and distance matrix calculations can be chosen by the user.
Usage
label_prop(full_fused_network, clusters)
Arguments
full_fused_network |
A network made by running SNF on training and test observations together. |
clusters |
A vector of assigned clusters for training observations in matching order as they appear in full_fused_network. |
Value
A list of cluster labels for all observations.
Label propagate cluster solutions to non-clustered observations
Description
Given a solutions data frame containing clustered observations and a data list containing those clustered observations as well as additional to-be-clustered observations, this function will re-run SNF to generate a similarity matrix of all observations and use the label propagation algorithm to assigned predicted clusters to the non-clustered observations.
Usage
label_propagate(partial_sol_df, full_dl, verbose = FALSE)
Arguments
partial_sol_df |
A solutions data frame derived from the training set. |
full_dl |
A data list containing observations from both the training and testing sets. |
verbose |
If TRUE, output progress to console. |
Value
A data frame with one row per observation containing a column for UIDs, a column for whether the observation was in the train (original) or test (held out) set, and one column per row of the solutions data frame indicating the original and propagated clusters.
Examples
# Function to identify observations with complete data
uids_with_complete_obs <- get_complete_uids(
list(subc_v, income, pubertal, anxiety, depress),
uid = "unique_id"
)
# Dataframe assigning 80% of observations to train and 20% to test
train_test_split <- train_test_assign(
train_frac = 0.8,
uids = uids_with_complete_obs
)
# Pulling the training and testing observations specifically
train_obs <- train_test_split$"train"
test_obs <- train_test_split$"test"
# Partition a training set
train_subc_v <- subc_v[subc_v$"unique_id" %in% train_obs, ]
train_income <- income[income$"unique_id" %in% train_obs, ]
train_pubertal <- pubertal[pubertal$"unique_id" %in% train_obs, ]
train_anxiety <- anxiety[anxiety$"unique_id" %in% train_obs, ]
train_depress <- depress[depress$"unique_id" %in% train_obs, ]
# Partition a test set
test_subc_v <- subc_v[subc_v$"unique_id" %in% test_obs, ]
test_income <- income[income$"unique_id" %in% test_obs, ]
test_pubertal <- pubertal[pubertal$"unique_id" %in% test_obs, ]
test_anxiety <- anxiety[anxiety$"unique_id" %in% test_obs, ]
test_depress <- depress[depress$"unique_id" %in% test_obs, ]
# Find cluster solutions in the training set
train_dl <- data_list(
list(train_subc_v, "subc_v", "neuroimaging", "continuous"),
list(train_income, "household_income", "demographics", "continuous"),
list(train_pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
# We'll pick a solution that has good separation over our target features
train_target_dl <- data_list(
list(train_anxiety, "anxiety", "behaviour", "ordinal"),
list(train_depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
train_dl,
n_solutions = 5,
min_k = 10,
max_k = 30
)
train_sol_df <- batch_snf(
train_dl,
sc,
return_sim_mats = TRUE
)
ext_sol_df <- extend_solutions(
train_sol_df,
train_target_dl
)
# Determining solution with the lowest minimum p-value
lowest_min_pval <- min(ext_sol_df$"min_pval")
which(ext_sol_df$"min_pval" == lowest_min_pval)
top_row <- ext_sol_df[1, ]
# Propagate that solution to the observations in the test set
# data list below has both training and testing observations
full_dl <- data_list(
list(subc_v, "subc_v", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
# Use the solutions data frame from the training observations and the data list
# from the training and testing observations to propagate labels to the test observations
propagated_labels <- label_propagate(top_row, full_dl)
propagated_labels_all <- label_propagate(ext_sol_df, full_dl)
head(propagated_labels_all)
tail(propagated_labels_all)
Convert a vector of partition indices into meta cluster labels
Description
Convert a vector of partition indices into meta cluster labels
Usage
label_splits(split_vector, nrow)
Arguments
split_vector |
A vector of partition indices. |
nrow |
The number of rows in the data being partitioned. |
Value
A character vector that expands the split_vector into an nrow-length sequence of ascending letters of the alphabet. If the split vector is c(3, 6) and the number of rows is 8, the result will be a vector of two "A"s (up to the first index, 3), three "B"s (up to the second index, 6), and three "C"s (up to and including the last index, 8).
Linearly correct data list by features with unwanted signal
Description
Given a data list to correct and another data list of categorical features to linearly adjust for, corrects the first data list based on the residuals of the linear model relating the numeric features in the first data list to the unwanted signal features in the second data list.
Usage
linear_adjust(dl, unwanted_signal_list, sig_digs = NULL)
Arguments
dl |
A nested list of input data from |
unwanted_signal_list |
A data list of categorical features that should have their mean differences removed in the first data list. |
sig_digs |
Number of significant digits to round the residuals to. |
Value
A data list ("list") in which each data component has been converted to contain residuals off of the linear model built against the features in the unwanted_signal_list.
Examples
has_tutor <- sample(c(1, 0), size = 9, replace = TRUE)
math_score <- 70 + 30 * has_tutor + rnorm(9, mean = 0, sd = 5)
math_df <- data.frame(uid = paste0("id_", 1:9), math = math_score)
tutor_df <- data.frame(uid = paste0("id_", 1:9), tutor = has_tutor)
dl <- data_list(
list(math_df, "math_score", "school", "continuous"),
uid = "uid"
)
adjustment_dl <- data_list(
list(tutor_df, "tutoring", "school", "categorical"),
uid = "uid"
)
adjusted_dl <- linear_adjust(dl, adjustment_dl)
adjusted_dl[[1]]$"data"$"math"
# Equivalent to:
as.numeric(resid(lm(math_score ~ has_tutor)))
Linear model p-value (generic)
Description
Return p-value of F-test for a linear model of any two features
Usage
linear_model_pval(predictor, response)
Arguments
predictor |
A categorical or numeric feature. |
response |
A numeric feature. |
Value
pval A p-value (class "numeric").
Manhattan plot of feature-meta cluster association p-values
Description
Given a data frame of representative meta cluster solutions (see
get_representative_solutions()
, returns a Manhattan plot for showing
feature separation across all features in provided data/target lists.
Usage
mc_manhattan_plot(
ext_sol_df,
dl = NULL,
target_dl = NULL,
variable_order = NULL,
neg_log_pval_thresh = 5,
threshold = NULL,
point_size = 5,
text_size = 20,
plot_title = NULL,
xints = NULL,
hide_x_labels = FALSE,
domain_colours = NULL
)
Arguments
ext_sol_df |
A sol_df that contains "_pval"
columns containing the values to be plotted. This object is the output of
|
dl |
List of data frames containing data information. |
target_dl |
List of data frames containing target information. |
variable_order |
Order of features to be displayed in the plot. |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
p-value threshold to plot horizontal dashed line at. |
point_size |
Size of points in the plot. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
xints |
Either "outcomes" or a vector of numeric values to plot vertical lines at. |
hide_x_labels |
If TRUE, hides x-axis labels. |
domain_colours |
Named vector of colours for domains. |
Value
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame, stratified by meta cluster label.
Examples
dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, sc)
ext_sol_df <- extend_solutions(
sol_df,
dl = dl,
min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
)
# Calculate pairwise similarities between cluster solutions
sol_aris <- calc_aris(sol_df)
# Extract hierarchical clustering order of the cluster solutions
meta_cluster_order <- get_matrix_order(sol_aris)
# Identify meta cluster boundaries with shiny app or trial and error
# ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)
# shiny_annotator(ari_hm)
# Result of meta cluster examination
split_vec <- c(2, 5, 12, 17)
ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order)
# Extracting representative solutions from each defined meta cluster
rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)
mc_manhattan <- mc_manhattan_plot(
rep_solutions,
dl = dl,
point_size = 3,
text_size = 12,
plot_title = "Feature-Meta Cluster Associations",
threshold = 0.05,
neg_log_pval_thresh = 5
)
mc_manhattan
Merge clust_fns_list
objects
Description
Merge clust_fns_list
objects
Usage
## S3 method for class 'clust_fns_list'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
A new clust_fns_list
object containing the merged clustering
functions.
Merge observations between two compatible data lists
Description
Join two data lists with the same components (data frames) but separate
observations. To instead merge two data lists that have the same
observations but different components, simply use c()
.
Usage
## S3 method for class 'data_list'
merge(x, y, ...)
Arguments
x |
The first data list to merge. |
y |
The second data list to merge. |
... |
Additional arguments passed into merge function. |
Value
A data list ("list"-class object) containing the observations of both provided data lists.
Merge dist_fns_list
objects
Description
Merge dist_fns_list
objects
Usage
## S3 method for class 'dist_fns_list'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
A new clust_fns_list
object containing the merged clustering
functions.
Merge ext_solutions_df
objects
Description
Merge ext_solutions_df
objects
Usage
## S3 method for class 'ext_solutions_df'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to ext_solutions_df
objects.
Merge settings_df
objects
Description
Merge settings_df
objects
Usage
## S3 method for class 'settings_df'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to settings_df
objects.
Merge sim_mats_list
objects
Description
Merge sim_mats_list
objects
Usage
## S3 method for class 'sim_mats_list'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
A merged sim_mats_list
object containing the similarity matrices
from both input objects.
Merge method for SNF config objects
Description
Merge method for SNF config objects
Usage
## S3 method for class 'snf_config'
merge(x, y, reset_indices = TRUE, ...)
Arguments
x |
SNF config to merge. |
y |
SNF config to merge. |
reset_indices |
If TRUE (default), re-labels the "solutions" indices in the config from 1 to the number of defined settings. |
... |
Additional arguments passed into merge function. |
Value
An SNF config combining the rows of both prior configurations.
Merge solutions_df
objects
Description
Merge solutions_df
objects
Usage
## S3 method for class 'solutions_df'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to solutions_df
objects.
Merge t_ext_solutions_df
objects
Description
Merge t_ext_solutions_df
objects
Usage
## S3 method for class 't_ext_solutions_df'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to t_ext_solutions_df
objects.
Merge t_solutions_df
objects
Description
Merge t_solutions_df
objects
Usage
## S3 method for class 't_solutions_df'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to t_solutions_df
objects.
Merge weights_matrix
objects
Description
Merge weights_matrix
objects
Usage
## S3 method for class 'weights_matrix'
merge(x, y, ...)
Arguments
x |
The first |
y |
The second |
... |
Additional arguments (not used). |
Value
Error message indicating that the merge function is not applicable
to weights_matrix
objects.
Merge list of data frames into a single data frame
Description
This helper function combines all data frames in a single-level list into a single data frame.
Usage
merge_df_list(df_list, join = "inner", uid = "uid", no_na = FALSE)
Arguments
df_list |
list of data frames. |
join |
String indicating if join should be "inner" or "full". |
uid |
Column name to join on. Default is "uid". |
no_na |
Whether to remove NA values from the merged data frame. |
Value
Inner join of all data frames in list.
Examples
merge_df_list(list(income, pubertal), uid = "unique_id")
Helper function for raising alerts
Description
Helper function for raising alerts
Usage
metasnf_alert(..., env = 1)
Arguments
... |
Arbitrary number of strings to be pasted together into alert message. |
env |
Environment to evaluate expressions in. |
Value
Returns no value. Raises an alert through cli::cli_alert_info.
Helper function for defunct function errors
Description
Helper function for defunct function errors
Usage
metasnf_defunct(version, alternative, env = 1)
Arguments
version |
Version of |
alternative |
Recommended alternative approach. |
env |
Environment to evaluate expressions in. |
Value
Returns no value. Raises an error through cli::cli_abort.
Helper function for deprecated function warnings
Description
Helper function for deprecated function warnings
Usage
metasnf_deprecated(version, alternative, env = 1)
Arguments
version |
Version of |
alternative |
Recommended alternative approach. |
env |
Environment to evaluate expressions in. |
Value
Returns no value. Raises a warning through cli::cli_warn.
Helper function for raising errors
Description
Helper function for raising errors
Usage
metasnf_error(..., env = 1)
Arguments
... |
Arbitrary number of strings to be pasted together into error message. |
env |
Value
Returns no value. Raises an error through cli::cli_abort.
Helper function for raising warnings
Description
Helper function for raising warnings
Usage
metasnf_warning(..., env = 1)
Arguments
... |
Arbitrary number of strings to be pasted together into warning message. |
env |
Environment to evaluate expressions in. |
Value
Returns no value. Raises a warning through cli::cli_warn
Modification of SNFtool mock data frame "Data2"
Description
Modification of SNFtool mock data frame "Data2"
Usage
methylation_df
Format
methylation_df
A data frame with 200 rows and 3 columns:
- gene_1_expression
Mock gene methylation feature
- gene_2_expression
Mock gene methylation feature
- patient_id
Random three-digit number uniquely identifying the patient
Source
This data came from the SNFtool package, with slight modifications.
Mock example of an ari_matrix
metasnf object
Description
An ari_matrix
class object containing adjusted Rand indices (ARIs) between 20 cluster solutions.
Used as an example of an ari_matrix
metasnf object.
Usage
mock_ari_matrix
Format
mock_ari_matrix
A 20 by 20 ARI matrix.
Source
This data comes from the metasnf package.
Mock example of a clust_fns_list
metasnf object
Description
Mock example of a clust_fns_list
metasnf object
Usage
mock_clust_fns_list
Format
mock_clust_fns_list
A clust_fns_list
object containing two clustering functions covering 2 and 5 five cluster solution versions of spectral clustering.
Extracted from mock_snf_config
.
Source
This data comes from the metasnf package.
Mock example of a data_list
metasnf object
Description
Mock example of a data_list
metasnf object
Usage
mock_data_list
Format
mock_data_list
A data list containing 4 data frames with 100 observations each:
- subcortical volume (30 features)
- cortical surface area (151 features)
- household income (1 feature)
- pubertal status (1 feature)
Used as an example of an data_list
metasnf object.
Source
This data comes from the metasnf package.
Mock example of a dist_fns_list
metasnf object
Description
Mock example of a dist_fns_list
metasnf object
Usage
mock_dist_fns_list
Format
mock_dist_fns_list
A dist_fns_list
object containing a variety of distance metrics.
Extracted from mock_snf_config
.
Source
This data comes from the metasnf package.
Mock example of a ext_solutions_df
metasnf object
Description
An ext_solutions_df
class object generated by extending the mock_rep_solutions_df
object against mock_data_list
as the target data list.
Usage
mock_ext_solutions_df
Format
mock_ext_solutions_df
Contains 20 cluster solutions.
Source
This data comes from the metasnf package.
Mock example of a mc_solutions_df
metasnf object
Description
Mock example of a mc_solutions_df
metasnf object
Usage
mock_mc_solutions_df
Format
mock_mc_solutions_df
A meta cluster labeled solutions data frame derived from mock_solutions_df
.
Contains 20 cluster solutions.
Source
This data comes from the metasnf package.
Mock example of a rep_solutions_df
metasnf object
Description
A solutions_df
class object derived by filtering the mock_mc_solutions_df
to its representative solutions.
Usage
mock_rep_solutions_df
Format
mock_rep_solutions_df
Contains 4 cluster solutions.
Source
This data comes from the metasnf package.
Mock example of a settings_df
metasnf object
Description
Mock example of a settings_df
metasnf object
Usage
mock_settings_df
Format
mock_settings_df
Settings for 20 cluster solutions.
Source
This data comes from the metasnf package.
Mock example of a snf_config
metasnf object
Description
Mock example of a snf_config
metasnf object
Usage
mock_snf_config
Format
mock_snf_config
An SNF config containing hyperparameters and functions defined for generating 20 cluster solutions from a data list.
The config has been specified to:
- limit the k
hyperparameter to 40
- make use of uniformly distributed random weights
- randomly select between using spectral clustering where the number of clusters can be 2, 5, decided by the eigen-gap heuristic, or decided by the rotation cost heuristic
- use Gower distance for categorical and mixed data, Euclidean distance for ordinal data, and randomly select from Euclidean distance or standard/normalized Euclidean distance for continuous and discrete data
The config was built using the mock_data_list
loaded into the namespace after calling library("metasnf")
.
Used as an example of an snf_config
metasnf object.
Source
This data comes from the metasnf package.
Mock example of a solutions_df
metasnf object
Description
Mock example of a solutions_df
metasnf object
Usage
mock_solutions_df
Format
mock_solutions_df
A solutions data frame containing 20 cluster solutions generated from mock_snf_config
and mock_data_list
.
Used as an example of an solutions_df
metasnf object.
Source
This data comes from the metasnf package.
Mock example of a t_solutions_df
metasnf object
Description
Mock example of a t_solutions_df
metasnf object
Usage
mock_t_solutions_df
Format
mock_t_solutions_df
A transposed solutions data frame containing 20 cluster solutions generated from mock_solutions_df
.
Used as an example of a t_solutions_df
metasnf object.
Source
This data comes from the metasnf package.
Mock example of a weights_matrix
metasnf object
Description
Mock example of a weights_matrix
metasnf object
Usage
mock_weights_matrix
Format
mock_weights_matrix
A weights_matrix
class object containing 20 sets of weights for 183 features.
Source
This data comes from the metasnf package.
Extract number of features stored in an object
Description
Extract number of features stored in an object
Usage
n_features(x)
Arguments
x |
The object to extract number of features from. |
Value
The number of features in x.
Extract number of observations stored in an object
Description
Extract number of observations stored in an object
Usage
n_observations(x)
Arguments
x |
The object to extract number of observations from. |
Value
The number of observations in x.
Constructor for ari_matrix
class object
Description
Constructor for ari_matrix
class object
Usage
new_ari_matrix(aml, dist_method, hclust_method)
Arguments
aml |
An ari_matrix-like matrix object to be validated. |
Value
A ari_matrix
object.
Constructor for clust_fns_list
class object
Description
Constructor for clust_fns_list
class object
Usage
new_clust_fns_list(cfll)
Arguments
cfll |
A clust_fns_list-like |
Value
A clust_fns_list
class object.
Constructor for data_list
class object
Description
Constructor for data_list
class object
Usage
new_data_list(dll)
Arguments
dll |
A data list-like |
Value
A data_list
object, which is a nested list with class data_list
.
Constructor for dist_fns_list
class object
Description
Constructor for dist_fns_list
class object
Usage
new_dist_fns_list(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
A dist_fns_list
object.
Constructor for ext_solutions_df
class object
Description
Constructor for ext_solutions_df
class object
Usage
new_ext_solutions_df(ext_sol_dfl)
Arguments
ext_sol_dfl |
An extended solutions data frame-like object. |
Value
An ext_solutions_df
object, which is a data frame with class
ext_solutions_df
.
Constructor for settings_df
class object
Description
Constructor for settings_df
class object
Usage
new_settings_df(sdfl)
Arguments
sdfl |
A settings data frame-like matrix object to be validated. |
Value
A settings_df
object.
Constructor for similarity_matrix_list
class object
Description
Constructor for similarity_matrix_list
class object
Usage
new_sim_mats_list(smll)
Arguments
smll |
A similarity matrix list-like object. |
Value
A similarity_matrix_list
class object.
Constructor for snf_config
class object
Description
Constructor for snf_config
class object
Usage
new_snf_config(scl)
Arguments
scl |
An SNF config-like |
Value
An snf_config
object.
Constructor for solutions_df
class object
Description
Constructor for solutions_df
class object
Usage
new_solutions_df(sol_dfl)
Arguments
sol_dfl |
A solutions data frame-like object to be validated and converted into a solutions data frame. |
Value
A solutions_df
class object.
Constructor for weights_matrix
class object
Description
Constructor for weights_matrix
class object
Usage
new_weights_matrix(wml)
Arguments
wml |
A weights_matrix-like matrix object to be validated. |
Value
A weights_matrix
object.
Helper function for creating what hidden ft/obs/sols message
Description
Helper function for creating what hidden ft/obs/sols message
Usage
not_shown_message(
hidden_solutions = NULL,
hidden_observations = NULL,
hidden_features = NULL
)
Arguments
Number of hidden solutions. | |
Number of hidden observations. | |
Number of hidden features. |
Value
If all arguments are NULL or 0, returns NULL. Otherwise, output a neatly formatted string indicating how many observations, features, and/or observations were not shown.
Convert columns of a data frame to numeric type (if possible)
Description
Converts all columns in a data frame that can be converted to numeric type to numeric type.
Usage
numcol_to_numeric(df)
Arguments
df |
A data frame. |
Value
The data frame coercible columns converted to type numeric.
Ordinal regression p-value
Description
Returns the overall p-value of an ordinal regression on a categorical predictor and response vectors.
Usage
ord_reg_pval(predictor, response)
Arguments
predictor |
A categorical or numeric feature. |
response |
A numeric feature. |
Value
pval A p-value (class "numeric").
Parallel processing form of batch_snf
Description
Parallel processing form of batch_snf
Usage
parallel_batch_snf(
dl,
dfl,
cfl,
sdf,
wm,
similarity_matrix_dir,
return_sim_mats,
processes
)
Arguments
dl |
A data list. |
dfl |
An optional nested list containing which distance metric function should be used for the various feature types (continuous, discrete, ordinal, categorical, and mixed). See ?dist_fns_list for details on how to build this. |
cfl |
List of custom clustering algorithms to apply to the final fused network. See ?clust_fns_list. |
sdf |
matrix indicating parameters to iterate SNF through. |
wm |
A matrix containing feature weights to use during distance matrix calculation. See ?weights_matrix for details on how to build this. |
similarity_matrix_dir |
If specified, this directory will be used to save all generated similarity matrices. |
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
processes |
Number of parallel processes used when executing SNF. |
Value
The same values as ?batch_snf().
Helper function to pick columns from a data frame
Description
Helper function to pick columns from a data frame
Usage
pick_cols(x, cols)
Arguments
x |
A data frame |
cols |
Vector of column names to be picked |
Value
x with only columns in cols
Helper function to pluralize a string
Description
Helper function to pluralize a string
Usage
pl(x)
Arguments
x |
A vector of length 1 or greater. |
Value
A string "s" if the length of x is greater than 1, otherwise an empty string.
Heatmap of pairwise adjusted rand indices between solutions
Description
Heatmap of pairwise adjusted rand indices between solutions
Usage
## S3 method for class 'ari_matrix'
plot(
x,
order = NULL,
cluster_rows = FALSE,
cluster_columns = FALSE,
log_graph = FALSE,
scale_diag = "none",
min_colour = "#282828",
max_colour = "firebrick2",
col = circlize::colorRamp2(c(min(x), max(x)), c(min_colour, max_colour)),
...
)
meta_cluster_heatmap(
x,
order = NULL,
cluster_rows = FALSE,
cluster_columns = FALSE,
log_graph = FALSE,
scale_diag = "none",
min_colour = "#282828",
max_colour = "firebrick2",
col = circlize::colorRamp2(c(min(x), max(x)), c(min_colour, max_colour)),
...
)
Arguments
x |
Matrix of adjusted rand indices from |
order |
Numeric vector containing row order of the heatmap. |
cluster_rows |
Whether rows should be clustered. |
cluster_columns |
Whether columns should be clustered. |
log_graph |
If TRUE, log transforms the graph. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
col |
Colour ramp to use for the heatmap. |
... |
Additional parameters passed to |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.
Examples
dl <- data_list(
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
set.seed(42)
my_sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, my_sc)
sol_df
sol_aris <- calc_aris(sol_df)
meta_cluster_order <- get_matrix_order(sol_aris)
# `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator()
split_vec <- c(6, 10, 16)
ari_hm <- plot(
sol_aris,
order = meta_cluster_order,
split_vector = split_vec
)
Plot of feature values in a data list
Description
This plot, built on ComplexHeatmap::Heatmap()
, visualizes the feature
values in a data list as a continuous heatmap with observations along the
columns and features along the rows.
Usage
## S3 method for class 'data_list'
plot(
x,
y = NULL,
cluster_rows = TRUE,
cluster_columns = TRUE,
heatmap_legend_param = NULL,
row_title = "Observation",
column_title = "Feature",
show_row_names = FALSE,
...
)
Arguments
x |
A |
y |
Optional argument to |
cluster_rows |
Logical indicating whether to cluster the rows (observations). |
cluster_columns |
Logical indicating whether to cluster the columns (features). |
heatmap_legend_param |
A list of parameters for the heatmap legend. |
row_title |
Title for the rows (observations). |
column_title |
Title for the columns (features). |
show_row_names |
Logical indicating whether to show row names. |
... |
Additional arguments passed to |
Value
A heatmap visualization of feature values.
Plot of cluster assignments in an extended solutions data frame
Description
This plot, built on ComplexHeatmap::Heatmap()
, visualizes the cluster
assignments in a solutions data frame as a categorical heatmap with
observations along the columns and clusters along the rows.
Usage
## S3 method for class 'ext_solutions_df'
plot(
x,
y = NULL,
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_names = TRUE,
show_column_names = TRUE,
heatmap_legend_param = NULL,
row_title = "Solution",
column_title = "Observation",
...
)
## S3 method for class 't_ext_solutions_df'
plot(x, ...)
Arguments
x |
An |
y |
Optional argument to |
cluster_rows |
If the value is a logical, it controls whether to make cluster on rows. The value can also be a |
cluster_columns |
Whether make cluster on columns? Same settings as |
show_row_names |
Whether show row names. |
show_column_names |
Whether show column names. |
heatmap_legend_param |
A list contains parameters for the heatmap legends. See |
row_title |
Title on the row. |
column_title |
Title on the column. |
... |
Additional arguments passed to |
Value
A ComplexHeatmap::Heatmap()
object visualization of cluster
assignments.
Heatmap for visualizing an SNF config
Description
Create a heatmap where each row corresponds to a different set of hyperparameters in an SNF config object. Numeric parameters are scaled normalized and non-numeric parameters are added as heatmap annotations. Rows can be reordered to match prior meta clustering results.
Usage
## S3 method for class 'snf_config'
plot(
x,
order = NULL,
hide_fixed = FALSE,
show_column_names = TRUE,
show_row_names = TRUE,
rect_gp = grid::gpar(col = "black"),
colour_breaks = c(0, 1),
colours = c("black", "darkseagreen"),
column_split_vector = NULL,
row_split_vector = NULL,
column_split = NULL,
row_split = NULL,
column_title = NULL,
include_weights = TRUE,
include_settings = TRUE,
...
)
config_heatmap(
x,
order = NULL,
hide_fixed = FALSE,
show_column_names = TRUE,
show_row_names = TRUE,
rect_gp = grid::gpar(col = "black"),
colour_breaks = c(0, 1),
colours = c("black", "darkseagreen"),
column_split_vector = NULL,
row_split_vector = NULL,
column_split = NULL,
row_split = NULL,
column_title = NULL,
include_weights = TRUE,
include_settings = TRUE,
...
)
## S3 method for class 'settings_df'
plot(
x,
order = NULL,
hide_fixed = FALSE,
show_column_names = TRUE,
show_row_names = TRUE,
rect_gp = grid::gpar(col = "black"),
colour_breaks = c(0, 1),
colours = c("black", "darkseagreen"),
column_split_vector = NULL,
row_split_vector = NULL,
column_split = NULL,
row_split = NULL,
column_title = NULL,
include_weights = TRUE,
include_settings = TRUE,
...
)
## S3 method for class 'weights_matrix'
plot(
x,
order = NULL,
hide_fixed = FALSE,
show_column_names = TRUE,
show_row_names = TRUE,
rect_gp = grid::gpar(col = "black"),
colour_breaks = c(0, 1),
colours = c("black", "darkseagreen"),
column_split_vector = NULL,
row_split_vector = NULL,
column_split = NULL,
row_split = NULL,
column_title = NULL,
include_weights = TRUE,
include_settings = TRUE,
...
)
Arguments
x |
An |
order |
Numeric vector indicating row ordering of SNF config. |
hide_fixed |
Whether fixed parameters should be removed. |
show_column_names |
Whether show column names. |
show_row_names |
Whether show row names. |
rect_gp |
Graphic parameters for drawing rectangles (for heatmap body). The value should be specified by |
colour_breaks |
Numeric vector of breaks for the legend. |
colours |
Vector of colours to use for the heatmap. Should match the length of colour_breaks. |
column_split_vector |
Vector of indices to split columns by. |
row_split_vector |
Vector of indices to split rows by. |
column_split |
Split on columns. For heatmap splitting, please refer to https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#heatmap-split . |
row_split |
Same as |
column_title |
Title on the column. |
include_weights |
If TRUE, includes feature weights of the weights matrix into the config heatmap. |
include_settings |
If TRUE, includes columns from the settings data frame into the config heatmap. |
... |
Additional parameters passed to |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the scaled values of the provided SNF config.
Examples
dl <- data_list(
list(income, "household_income", "demographics", "ordinal"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(fav_colour, "favourite_colour", "demographics", "categorical"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
dl,
n_solutions = 10,
dropout_dist = "uniform"
)
plot(sc)
Plot of cluster assignments in a solutions data frame
Description
This plot, built on ComplexHeatmap::Heatmap()
, visualizes the cluster
assignments in a solutions data frame as a categorical heatmap with
observations along the columns and clusters along the rows.
Usage
## S3 method for class 'solutions_df'
plot(
x,
y = NULL,
cluster_rows = FALSE,
cluster_columns = TRUE,
heatmap_legend_param = NULL,
row_title = "Solution",
column_title = "Observation",
...
)
## S3 method for class 't_solutions_df'
plot(x, ...)
Arguments
x |
A |
y |
Optional argument to |
cluster_rows |
If the value is a logical, it controls whether to make cluster on rows. The value can also be a |
cluster_columns |
Whether make cluster on columns? Same settings as |
heatmap_legend_param |
A list contains parameters for the heatmap legends. See |
row_title |
Title on the row. |
column_title |
Title on the column. |
... |
Additional arguments passed to |
Value
A ComplexHeatmap::Heatmap()
object visualization of cluster
assignments.
Add "uid_" prefix to all UID values in uid column
Description
Add "uid_" prefix to all UID values in uid column
Usage
prefix_dll_uid(dll)
Arguments
dll |
A data list-like |
Value
A data list with UIDs prefixed with the string "uid_".
Print method for class ari_matrix
Description
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
Usage
## S3 method for class 'ari_matrix'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class clust_fns_list
Description
Custom formatted print for clustering functions list objects that outputs information about the contained clustering functions to the console.
Usage
## S3 method for class 'clust_fns_list'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class data_list
Description
Custom formatted print for data list objects that outputs information about the contained observations and components to the console.
Usage
## S3 method for class 'data_list'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class dist_fns_list
Description
Custom formatted print for distance metrics list objects that outputs information about the contained distance metrics to the console.
Usage
## S3 method for class 'dist_fns_list'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class ext_solutions_df
Description
Custom formatted print for extended solutions data frame class objects.
Usage
## S3 method for class 'ext_solutions_df'
print(x, n = NULL, ...)
Arguments
x |
A |
n |
Number of rows to print, passed into |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class settings_df
Description
Custom formatted print for settings data frame that outputs information about SNF hyperparameters to the console.
Usage
## S3 method for class 'settings_df'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class sim_mats_list
Description
Custom formatted print for similarity matrix list
Usage
## S3 method for class 'sim_mats_list'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Print method for class snf_config
Description
Custom formatted print for SNF config
Usage
## S3 method for class 'snf_config'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class solutions_df
Description
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
Usage
## S3 method for class 'solutions_df'
print(x, n = NULL, tips = TRUE, ...)
Arguments
x |
A |
n |
Number of rows to print, passed into |
tips |
If TRUE, include lines on how to print more rows / transposed. |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class t_ext_solutions_df
Description
Custom formatted print for transposed solutions data frame class objects.
Usage
## S3 method for class 't_ext_solutions_df'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class t_solutions_df
Description
Custom formatted print for transposed solutions data frame class objects.
Usage
## S3 method for class 't_solutions_df'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Print method for class weights_matrix
Description
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
Usage
## S3 method for class 'weights_matrix'
print(x, ...)
Arguments
x |
A |
... |
Other arguments passed to |
Value
Function prints to console but does not return any value.
Helper function for outputting tip on changing rows printed
Description
Helper function for outputting tip on changing rows printed
Usage
print_with_n_message()
Value
Output a message to use print with n
to change displayed rows.
Helper function for transposing solutions_df message
Description
Helper function for transposing solutions_df message
Usage
print_with_t_message()
Value
Output a message to use print with n
to change displayed rows.
Mock ABCD pubertal status data
Description
Like the mock data frame "abcd_pubertal", but with "unique_id" as the "uid".
Usage
pubertal
Format
pubertal
A data frame with 275 rows and 2 columns:
- unique_id
The unique identifier of the ABCD dataset
- pubertal_status
Average reported pubertal status between child and parent (1-5 categorical scale)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Heatmap of p-values
Description
Heatmap of p-values
Usage
pval_heatmap(
ext_sol_df,
order = NULL,
cluster_columns = TRUE,
cluster_rows = FALSE,
show_row_names = FALSE,
show_column_names = TRUE,
min_colour = "red2",
max_colour = "white",
legend_breaks = c(0, 1),
col = circlize::colorRamp2(legend_breaks, c(min_colour, max_colour)),
heatmap_legend_param = list(color_bar = "continuous", title = "p-value", at = c(0, 1)),
rect_gp = grid::gpar(col = "black"),
column_split_vector = NULL,
row_split_vector = NULL,
column_split = NULL,
row_split = NULL,
...
)
Arguments
ext_sol_df |
An ext_solutions_df class object (produced from
the function |
order |
Numeric vector containing row order of the heatmap. |
cluster_columns |
Whether columns should be sorted by hierarchical clustering. |
cluster_rows |
Whether rows should be sorted by hierarchical clustering. |
show_row_names |
Whether row names should be shown. |
show_column_names |
Whether column names should be shown. |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
legend_breaks |
Numeric vector of breaks for the legend. |
col |
Colour function for |
heatmap_legend_param |
Legend function for |
rect_gp |
Cell border function for |
column_split_vector |
Vector of indices to split columns by. |
row_split_vector |
Vector of indices to split rows by. |
column_split |
Standard parameter of |
row_split |
Standard parameter of |
... |
Additional parameters passed to |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the provided p-values.
Examples
dl <- data_list(
list(income, "household_income", "demographics", "ordinal"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(fav_colour, "favourite_colour", "demographics", "categorical"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
sc <- snf_config(
dl,
n_solutions = 4,
dropout_dist = "uniform",
max_k = 50
)
sol_df <- batch_snf(dl, sc)
ext_sol_df <- extend_solutions(sol_df, dl)
pval_heatmap(ext_sol_df)
Quality metrics
Description
These functions calculate conventional metrics of cluster solution quality.
Usage
calculate_silhouettes(sol_df)
calculate_dunn_indices(sol_df)
calculate_db_indices(sol_df)
Arguments
sol_df |
A |
Details
calculate_silhouettes: A wrapper for cluster::silhouette
that calculates
silhouette scores for all cluster solutions in a provided solutions data
frame. Silhouette values range from -1 to +1 and indicate an overall ratio
of how close together observations within a cluster are to how far apart
observations across clusters are. You can learn more about interpreting
the results of this function by calling ?cluster::silhouette
.
calculate_dunn_indices: A wrapper for clv::clv.Dunn
that calculates
Dunn indices for all cluster solutions in a provided solutions data
frame. Dunn indices, like silhouette scores, similarly reflect similarity
within clusters and separation across clusters. You can learn more about
interpreting the results of this function by calling ?clv::clv.Dunn
.
calculate_db_indices: A wrapper for clv::clv.Davies.Bouldin
that
calculates Davies-Bouldin indices for all cluster solutions in a provided
solutions data frame. These values can be interpreted similarly as those
above. You can learn more about interpreting the results of this function by
calling ?clv::clv.Davies.Bouldin
.
Value
A list of silhouette
class objects, a vector of Dunn indices, or a
vector of Davies-Bouldin indices depending on which function was used.
Examples
## Not run:
input_dl <- data_list(
list(gender_df, "gender", "demographics", "categorical"),
list(diagnosis_df, "diagnosis", "clinical", "categorical"),
uid = "patient_id"
)
sc <- snf_config(input_dl, n_solutions = 5)
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
# calculate Davies-Bouldin indices
davies_bouldin_indices <- calculate_db_indices(sol_df)
# calculate Dunn indices
dunn_indices <- calculate_dunn_indices(sol_df)
# calculate silhouette scores
silhouette_scores <- calculate_silhouettes(sol_df)
## End(Not run)
Generate random removal sequence
Description
Helper function to contribute to rows within the settings data frame. Number of columns removed follows a uniform or exponential probability distribution.
Usage
random_removal(
columns,
min_removed_inputs,
max_removed_inputs,
dropout_dist = "exponential"
)
Arguments
columns |
Columns of the settings_df that are passed in |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. |
dropout_dist |
Indication of how input data frames should be dropped. can be "none" (no dropout), "uniform" (uniformly draw number between min and max removed inputs), or "exponential" (like uniform, but using an exponential distribution; default). |
Value
Settings data frame row containing inclusion information.
Row-binding of solutions data frame class objects
Description
Row-binding of solutions data frame class objects
Usage
## S3 method for class 'ext_solutions_df'
rbind(..., reset_indices = FALSE)
Arguments
... |
An arbitrary number of |
reset_indices |
If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings. |
Value
An ext_solutions_df
class object.
Row-binding of solutions data frame class objects
Description
Row-binding of solutions data frame class objects
Usage
## S3 method for class 'solutions_df'
rbind(..., reset_indices = FALSE)
Arguments
... |
An arbitrary number of |
reset_indices |
If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings. |
Value
A solutions_df
class object.
Row-binding of t_solutions_df class objects
Description
Vertically stack two or more t_solutions_df
class objects.
Usage
## S3 method for class 't_solutions_df'
rbind(...)
Arguments
... |
An arbitrary number of |
Value
A t_solutions_df
class object.
Row-bind weights matrices
Description
Vertically stack two or more weights_matrix
class objects.
Usage
## S3 method for class 'weights_matrix'
rbind(...)
Arguments
... |
An arbitrary number of |
Value
A weights_matrix
class object.
Remove observations with incomplete data from a data list-like list object
Description
Helper function during data_list
class initialization. First applies
stats::na.omit()
to the data frames named "data" within a nested list.
Then removes any observations that are not present across all data frames.
Usage
remove_dll_incomplete(dll)
Arguments
dll |
A data list-like |
Value
The provided dll with missing observations removed.
Rename features in a data list
Description
Rename features in a data list
Usage
rename_dl(dl, name_mapping)
Arguments
dl |
A nested list of input data from |
name_mapping |
A named vector where the values are the features to be renamed and the names are the new names for those features. |
Value
A data list ("list"-class object) with adjusted feature names.
Examples
dl <- data_list(
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
summary(dl, "feature")
name_changes <- c(
"anxiety_score" = "cbcl_anxiety_r",
"depression_score" = "cbcl_depress_r"
)
dl <- rename_dl(dl, name_changes)
summary(dl, "feature")
Reorder the uids in a data list
Description
Reorder the uids in a data list
Usage
reorder_dl_uids(dl, ordered_uids)
Arguments
dl |
A nested list of input data from |
ordered_uids |
A vector of the uid values in the data list in the desired order of the sorted data list. |
Value
A data list ("list"-class object) with reordered observations.
Helper resampling function found in ?sample
Description
Like sample, but when given a single value x, returns back that single value instead of a random value from 1 to x.
Usage
resample(x, ...)
Arguments
x |
Vector or single value to sample from |
... |
Remaining arguments for base::sample function |
Value
Numeric vector result of running base::sample.
Run SNF
Description
Helper function for running a single SNF config pipeline.
Usage
run_snf(i, dl, sc, return_sim_mats, sim_mats_dir, p)
Arguments
i |
Row of settings_df and weights_matrix within SNF config to use. |
dl |
A nested list of input data from |
sc |
An |
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
sim_mats_dir |
If specified, this directory will be used to save all generated similarity matrices. |
Value
A list containing a cluster solution (numeric vector) and a the fused network used to create that cluster solution. The fused network is NULL if return_sim_mats is FALSE.
Save a heatmap object to a file
Description
Save a heatmap object to a file
Usage
save_heatmap(heatmap, path, width = 480, height = 480, res = 100)
Arguments
heatmap |
The heatmap object to save. |
path |
The path to save the heatmap to. |
width |
The width of the heatmap. |
height |
The height of the heatmap. |
res |
The resolution of the heatmap. |
Value
Does not return any value. Saves heatmap to file.
Adjust the diagonals of a matrix
Description
Adjust the diagonals of a matrix to reduce contrast with off-diagonals during plotting.
Usage
scale_diagonals(matrix, method = "mean")
Arguments
matrix |
Matrix to rescale. |
method |
Method of rescaling. Can be:
|
Value
A "matrix" class object with rescaled diagonals.
Build a settings data frame
Description
The settings_df is a data frame whose rows completely specify the hyperparameters and decisions required to transform individual input data frames (found in a data list, see ?data_list) into a single similarity matrix through SNF. The format of the settings data frame is as follows:
A column named "solution": This column is used to keep track of the rows and should have integer values only.
A column named "alpha": This column contains the value of the alpha hyperparameter that will be used on that run of the SNF pipeline.
A column named "k": Like above, but for the K (nearest neighbours) hyperparameter.
A column named "t": Like above, but for the t (number of iterations) hyperparameter.
A column named "snf_scheme": Which of 3 pre-defined schemes will be used to integrate the data frames of the data list into a final fused network. The purpose of varying these schemes is primarily to increase the diversity of the generated cluster solutions.
A value of 1 corresponds to the "individual" scheme, in which all data frames are directly merged by SNF into the final fused network. This scheme corresponds to the approach shown in the original SNF paper.
A value of 2 corresponds to the "two-step" scheme, in which all data frames within a domain are first merged into a domain-specific fused network. Next, domain-specific networks are fused once more by SNF into the final fused network. This scheme is useful for fairly re-weighting SNF pipelines with unequal numbers of data frames across domains.
A value of 3 corresponds to the "domain" scheme, in which all data frames within a domain are first concatenated into a single domain- specific data frame before being merged by SNF into the final fused network. This approach serves as an alternative way to re-weight SNF pipelines with unequal numbers of data frames across domains. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/snf_schemes.html.
A column named "clust_alg": Specification of which clustering algorithm will be applied to the final similarity matrix. By default, this column can take on the integer values 1 or 2, which correspond to spectral clustering where the number of clusters is determined by the eigen-gap or rotation cost heuristic respectively. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/clustering_algorithms.html.
A column named "cnt_dist": Specification of which distance metric will be used for data frames of purely continuous data. You can learn about this metric and its defaults here: https://branchlab.github.io/metasnf/articles/distance_metrics.html
A column named "dsc_dist": Like above, but for discrete data frames.
A column named "ord_dist": Like above, but for ordinal data frames.
A column named "cat_dist": Like above, but for categorical data frames.
A column named "mix_dist": Like above, but for mixed-type (e.g., both categorical and discrete) data frames.
One column for every input data frame in the corresponding data list which can either have the value of 0 or 1. The name of the column should be formatted as "inc_[]" where the square brackets are replaced with the name (as found in dl_summary(dl)$"name") of each data frame. When 0, that data frame will be excluded from that run of the SNF pipeline. When 1, that data frame will be included.
Usage
settings_df(
dl,
n_solutions = 0,
min_removed_inputs = 0,
max_removed_inputs = length(dl) - 1,
dropout_dist = "exponential",
min_alpha = NULL,
max_alpha = NULL,
min_k = NULL,
max_k = NULL,
min_t = NULL,
max_t = NULL,
alpha_values = NULL,
k_values = NULL,
t_values = NULL,
possible_snf_schemes = c(1, 2, 3),
clustering_algorithms = NULL,
continuous_distances = NULL,
discrete_distances = NULL,
ordinal_distances = NULL,
categorical_distances = NULL,
mixed_distances = NULL,
dfl = NULL,
snf_input_weights = NULL,
snf_domain_weights = NULL,
retry_limit = 10,
allow_duplicates = FALSE
)
Arguments
dl |
A nested list of input data from |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "two-step" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
dfl |
List containing distance metrics to vary over. See ?generate_dist_fns_list. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
allow_duplicates |
If TRUE, enables creation of a settings data frame with duplicate non-feature weighting related hyperparameters. This function should only be used when paired with a custom weights matrix that has non-duplicate rows. |
Value
A settings data frame
Launch a shiny app to identify meta cluster boundaries
Description
This function calls the htShiny()
function from the package
InteractiveComplexHeatmap to assist users in identifying the indices of the
boundaries between meta clusters in a meta cluster heatmap. By providing a
heatmap of inter-solution similarities (obtained through
meta_cluster_heatmap()), users can click on positions within the heatmap
that appear to meaningfully separate major sets of similar cluster
solutions by visual inspection. The corresponding indices of the clicked
positions are printed to the console and also shown within the app. This
function can only run from an interactive session of R.
Usage
shiny_annotator(ari_heatmap)
Arguments
ari_heatmap |
Heatmap of ARIs to divide into meta clusters. |
Value
Does not return any value. Launches interactive shiny applet.
Examples
dl <- data_list(
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
set.seed(42)
my_sc <- snf_config(
dl = dl,
n_solutions = 20,
min_k = 20,
max_k = 50
)
sol_df <- batch_snf(dl, my_sc)
sol_aris <- calc_aris(sol_df)
meta_cluster_order <- get_matrix_order(sol_aris)
ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)
# Click on meta cluster boundaries to obtain `split_vec` values
shiny_annotator(ari_hm)
split_vec <- c(6, 10, 16)
ari_hm <- meta_cluster_heatmap(
sol_aris,
order = meta_cluster_order,
split_vector = split_vec
)
Create or extract a sim_mats_list
class object
Description
Create or extract a sim_mats_list
class object
Usage
sim_mats_list(x)
Arguments
x |
The object to create or extract a |
Value
A sim_mats_list
class object.
Plot heatmap of similarity matrix
Description
Plot heatmap of similarity matrix
Usage
similarity_matrix_heatmap(
similarity_matrix,
order = NULL,
cluster_solution = NULL,
scale_diag = "mean",
log_graph = TRUE,
cluster_rows = FALSE,
cluster_columns = FALSE,
show_row_names = FALSE,
show_column_names = FALSE,
data = NULL,
left_bar = NULL,
right_bar = NULL,
top_bar = NULL,
bottom_bar = NULL,
left_hm = NULL,
right_hm = NULL,
top_hm = NULL,
bottom_hm = NULL,
annotation_colours = NULL,
min_colour = NULL,
max_colour = NULL,
split_vector = NULL,
row_split = NULL,
column_split = NULL,
...
)
Arguments
similarity_matrix |
A similarity matrix |
order |
Vector of numbers to reorder the similarity matrix (and data if provided). Overwrites ordering specified by cluster_solution param. |
cluster_solution |
Row of a solutions data frame or column of a transposed solutions data frame. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
log_graph |
If TRUE, log transforms the graph. |
cluster_rows |
Parameter for ComplexHeatmap::Heatmap. |
cluster_columns |
Parameter for ComplexHeatmap::Heatmap. |
show_row_names |
Parameter for ComplexHeatmap::Heatmap. |
show_column_names |
Parameter for ComplexHeatmap::Heatmap. |
data |
A data frame containing elements requested for annotation. |
left_bar |
Named list of strings, where the strings are features in df that should be used for a barplot annotation on the left of the plot and the names are the names that will be used to caption the plots and their legends. |
right_bar |
See left_bar. |
top_bar |
See left_bar. |
bottom_bar |
See left_bar. |
left_hm |
Like left_bar, but with a heatmap annotation instead of a barplot annotation. |
right_hm |
See left_hm. |
top_hm |
See left_hm. |
bottom_hm |
See left_hm. |
annotation_colours |
Named list of heatmap annotations and their colours. |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
split_vector |
A vector of partition indices. |
row_split |
Standard parameter of |
column_split |
Standard parameter of |
... |
Additional parameters passed into ComplexHeatmap::Heatmap. |
Value
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the similarities between observations in the provided matrix.
Examples
my_dl <- data_list(
list(
data = expression_df,
name = "expression_data",
domain = "gene_expression",
type = "continuous"
),
list(
data = methylation_df,
name = "methylation_data",
domain = "gene_methylation",
type = "continuous"
),
uid = "patient_id"
)
sc <- snf_config(my_dl, n_solutions = 10)
sol_df <- batch_snf(my_dl, sc, return_sim_mats = TRUE)
sim_mats <- sim_mats_list(sol_df)
similarity_matrix_heatmap(
sim_mats[[1]],
cluster_solution = sol_df[1, ]
)
Generate a complete path and filename to store an similarity matrix
Description
Generate a complete path and filename to store an similarity matrix
Usage
similarity_matrix_path(similarity_matrix_dir, i)
Arguments
similarity_matrix_dir |
Directory to store similarity matrices. |
i |
Corresponding solution. |
Value
Complete path and filename to store an similarity matrix.
Squared (including weights) Euclidean distance
Description
Squared (including weights) Euclidean distance
Usage
siw_euclidean_distance(df, weights_row)
Arguments
df |
data frame containing at least 1 data column. |
weights_row |
Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights. |
Value
distance_matrix A distance matrix.
Define configuration for generating a set of SNF-based cluster solutions
Description
snf_config()
constructs an SNF config object which inherits from classes
snf_config
and list
. This object is used to store all settings
required to transform data stored in a data_list
class object into a
space of cluster solutions by SNF. The SNF config object contains the
following components:
1. A settings data frame (inherits from settings_df
and data.frame
).
Data frame that stores SNF-specific hyperparameters and information
about feature selection and weighting, SNF schemes, clustering
algorithms, and distance metrics. Each row of the settings data frame
corresponds to a distinct cluster solution.
2. A clustering algorithms list (inherits from clust_fns_list
and
list
), which stores all clustering algorithms that the settings
data frame can point to.
3. A distance metrics list (inherits from dist_metrics_list
and
list
), which stores all distance metrics that the settings data
frame can point to.
4. A weights matrix (inherits from weights_matrix
, matrix
, and
array
'), which stores the feature weights to use prior to distance
calculations. Each column of the weights matrix corresponds to a
different feature in the data list and each row corresponds to a
different row in the settings data frame.
Usage
snf_config(
dl = NULL,
sdf = NULL,
dfl = NULL,
cfl = NULL,
wm = NULL,
n_solutions = 0,
min_removed_inputs = 0,
max_removed_inputs = length(dl) - 1,
dropout_dist = "exponential",
min_alpha = NULL,
max_alpha = NULL,
min_k = NULL,
max_k = NULL,
min_t = NULL,
max_t = NULL,
alpha_values = NULL,
k_values = NULL,
t_values = NULL,
possible_snf_schemes = c(1, 2, 3),
clustering_algorithms = NULL,
continuous_distances = NULL,
discrete_distances = NULL,
ordinal_distances = NULL,
categorical_distances = NULL,
mixed_distances = NULL,
snf_input_weights = NULL,
snf_domain_weights = NULL,
retry_limit = 10,
cnt_dist_fns = NULL,
dsc_dist_fns = NULL,
ord_dist_fns = NULL,
cat_dist_fns = NULL,
mix_dist_fns = NULL,
automatic_standard_normalize = FALSE,
use_default_dist_fns = FALSE,
clust_fns = NULL,
use_default_clust_fns = FALSE,
weights_fill = "ones"
)
Arguments
dl |
A nested list of input data from |
sdf |
A |
dfl |
A |
cfl |
A |
wm |
A |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "two-step" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
cnt_dist_fns |
A named list of continuous distance metric functions. |
dsc_dist_fns |
A named list of discrete distance metric functions. |
ord_dist_fns |
A named list of ordinal distance metric functions. |
cat_dist_fns |
A named list of categorical distance metric functions. |
mix_dist_fns |
A named list of mixed distance metric functions. |
automatic_standard_normalize |
If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters. |
use_default_dist_fns |
If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list. |
clust_fns |
A list of named clustering functions |
use_default_clust_fns |
If TRUE, prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph) to clust_fns. |
weights_fill |
String indicating what to populate generate rows with. Can be "ones" (default; fill matrix with 1), "uniform" (fill matrix with uniformly distributed random values), or "exponential" (fill matrix with exponentially distributed random values). |
Value
An snf_config
class object.
Examples
# Simple random config for 5 cluster solutions
input_dl <- data_list(
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
my_sc <- snf_config(
dl = input_dl,
n_solutions = 5
)
# specifying possible K range
my_sc <- snf_config(
dl = input_dl,
n_solutions = 5,
min_k = 20,
max_k = 40
)
# Random feature weights across from uniform distribution
my_sc <- snf_config(
dl = input_dl,
n_solutions = 5,
min_k = 20,
max_k = 40,
weights_fill = "uniform"
)
# Specifying custom pre-built clustering and distance functions
# - Random alternation between 2-cluster and 5-cluster solutions
# - When continuous or discrete data frames are being processed,
# randomly alternate between standardized/normalized Euclidean
# distance and regular Euclidean distance
my_sc <- snf_config(
dl = input_dl,
n_solutions = 5,
min_k = 20,
max_k = 40,
weights_fill = "uniform",
clust_fns = list(
"two_cluster_spectral" = spectral_two,
"five_cluster_spectral" = spectral_five
),
cnt_dist_fns = list(
"euclidean" = euclidean_distance,
"std_nrm_euc" = sn_euclidean_distance
),
dsc_dist_fns = list(
"euclidean" = euclidean_distance,
"std_nrm_euc" = sn_euclidean_distance
)
)
SNF schemes
Description
These functions manage the way in which input data frames are passed into SNF to yield a final fused network.
Usage
two_step_merge(
dl,
k = 20,
alpha = 0.5,
t = 20,
cnt_dist_fn,
dsc_dist_fn,
ord_dist_fn,
cat_dist_fn,
mix_dist_fn,
weights_row
)
domain_merge(
dl,
cnt_dist_fn,
dsc_dist_fn,
ord_dist_fn,
cat_dist_fn,
mix_dist_fn,
weights_row,
k,
alpha,
t
)
individual(
dl,
k = 20,
alpha = 0.5,
t = 20,
cnt_dist_fn,
dsc_dist_fn,
ord_dist_fn,
cat_dist_fn,
mix_dist_fn,
weights_row
)
Arguments
dl |
A nested list of input data from |
k |
k hyperparameter. |
alpha |
alpha/eta/sigma hyperparameter. |
t |
SNF number of iterations hyperparameter. |
cnt_dist_fn |
distance metric function for continuous data. |
dsc_dist_fn |
distance metric function for discrete data. |
ord_dist_fn |
distance metric function for ordinal data. |
cat_dist_fn |
distance metric function for categorical data. |
mix_dist_fn |
distance metric function for mixed data. |
weights_row |
data frame row containing feature weights. |
Details
individual: The "vanilla" scheme - does distance matrix conversions of each input data frame separately before a single call to SNF fuses them into the final fused network.
domain_merge: Given a data list, returns a new data list where all data objects of a particular domain have been concatenated.
two_step_merge: Individual data frames into individual similarity matrices into one fused network per domain into one final fused network.
Helper function for using the correct SNF scheme
Description
Helper function for using the correct SNF scheme
Usage
snf_step(
dl,
scheme,
k = 20,
alpha = 0.5,
t = 20,
cnt_dist_fn,
dsc_dist_fn,
ord_dist_fn,
cat_dist_fn,
mix_dist_fn,
weights_row
)
Arguments
dl |
A nested list of input data from |
scheme |
Which SNF system to use to achieve the final fused network. |
k |
k hyperparameter. |
alpha |
alpha/eta/sigma hyperparameter. |
t |
SNF number of iterations hyperparameter. |
cnt_dist_fn |
distance metric function for continuous data. |
dsc_dist_fn |
distance metric function for discrete data. |
ord_dist_fn |
distance metric function for ordinal data. |
cat_dist_fn |
distance metric function for categorical data. |
mix_dist_fn |
distance metric function for mixed data. |
weights_row |
data frame row containing feature weights. |
Value
A fused similarity network (matrix).
Helper function for organizing solutions df-like column order
Description
Reorders columns of a solutions data frame to "solution", "nclust", "mc", then all other column names.
Usage
sol_df_col_order(x)
Arguments
x |
Object with columns "solution", "nclust", and "mc". |
Value
x with column names reordered.
Constructor for solutions_df
class object
Description
Constructor for solutions_df
class object
Usage
solutions_df(sol_dfl, smll, sc, dl)
Arguments
sol_dfl |
A solutions data frame-like object to be validated and converted into a solutions data frame. |
smll |
A similarity matrix list-like object to be validated and used to construct a solutions data frame. |
sc |
An |
dl |
An |
Value
A solutions data frame (solutions_df
class object).
Helper function to determine which row and columns to split on
Description
Helper function to determine which row and columns to split on
Usage
split_parser(
row_split_vector = NULL,
column_split_vector = NULL,
row_split = NULL,
column_split = NULL,
n_rows,
n_columns
)
Arguments
row_split_vector |
A vector of row indices to split on. |
column_split_vector |
A vector of column indices to split on. |
row_split |
Standard parameter of |
column_split |
Standard parameter of |
n_rows |
The number of rows in the data. |
n_columns |
The number of columns in the data. |
Value
"list"-class object containing row_split and column_split character vectors to pass into ComplexHeatmap::Heatmap.
Structure of a ari_matrix
object
Description
Structure of a ari_matrix
object
Usage
## S3 method for class 'ari_matrix'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a clust_fns_list
object
Description
Structure of a clust_fns_list
object
Usage
## S3 method for class 'clust_fns_list'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a data_list
object
Description
Structure of a data_list
object
Usage
## S3 method for class 'data_list'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a dist_fns_list
object
Description
Structure of a dist_fns_list
object
Usage
## S3 method for class 'dist_fns_list'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a ext_solutions_df
object
Description
Structure of a ext_solutions_df
object
Usage
## S3 method for class 'ext_solutions_df'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a settings_df
object
Description
Structure of a settings_df
object
Usage
## S3 method for class 'settings_df'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a sim_mats_list
object
Description
Structure of a sim_mats_list
object
Usage
## S3 method for class 'sim_mats_list'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a snf_config
object
Description
Structure of a snf_config
object
Usage
## S3 method for class 'snf_config'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a solutions_df
object
Description
Structure of a solutions_df
object
Usage
## S3 method for class 'solutions_df'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a t_ext_solutions_df
object
Description
Structure of a t_ext_solutions_df
object
Usage
## S3 method for class 't_ext_solutions_df'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a t_solutions_df
object
Description
Structure of a t_solutions_df
object
Usage
## S3 method for class 't_solutions_df'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Structure of a weights_matrix
object
Description
Structure of a weights_matrix
object
Usage
## S3 method for class 'weights_matrix'
str(object, ...)
Arguments
object |
A |
... |
Additional arguments (not used). |
Value
Does not return an object; outputs object structure to console.
Mock ABCD subcortical volumes data
Description
Like the mock data frame "abcd_subc_v", but with "unique_id" as the "uid".
Usage
subc_v
Format
subc_v
A data frame with 174 rows and 31 columns:
- unique_id
The unique identifier of the ABCD dataset
- ...
Subcortical volumes of various ROIs (mm^3, I think)
Source
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Create subsamples of a data list
Description
Given a data list, return a list of smaller data lists that are generated
through random sampling (without replacement). The results of this function
can be passed into batch_snf_subsamples()
to obtain a list of resampled
solutions data frames.
Usage
subsample_dl(
dl,
n_subsamples,
subsample_fraction = NULL,
n_observations = NULL
)
Arguments
dl |
A nested list of input data from |
n_subsamples |
Number of subsamples to create. |
subsample_fraction |
Percentage of patients to include per subsample. |
n_observations |
Number of patients to include per subsample. |
Value
A "list" class object containing n_subsamples
number of
data lists. Each of those data lists contains a random subsample_fraction
fraction of the observations of the provided data list.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
Calculate pairwise adjusted Rand indices across subsamples of data
Description
Given a list of subsampled solutions data frames from
'batch_snf_subsamples()
, this function calculates the adjusted Rand
indices across all the subsamples of each solution. ARI calculation between
two subsamples only factors in observations that were present in both
subsamples.
Usage
subsample_pairwise_aris(subsample_solutions, verbose = FALSE)
Arguments
subsample_solutions |
A list of solutions data frames from
subsamples of the data. This object is generated by the function
|
verbose |
If TRUE, output progress to console. |
Value
A two-item list: "raw_aris", a list of inter-subsample pairwise ARI matrices (one for each full cluster solution) and "ari_summary", a data frame containing the mean and SD of the inter-subsample ARIs for each original cluster solution.
Examples
my_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
my_dl_subsamples <- subsample_dl(
my_dl,
n_subsamples = 20,
subsample_fraction = 0.85
)
batch_subsample_results <- batch_snf_subsamples(
my_dl_subsamples,
sc
)
pairwise_aris <- subsample_pairwise_aris(
batch_subsample_results,
verbose = TRUE
)
# Visualize ARIs
ComplexHeatmap::Heatmap(
pairwise_aris$"raw_aris"[[1]],
heatmap_legend_param = list(
color_bar = "continuous",
title = "Inter-Subsample\nARI",
at = c(0, 0.5, 1)
),
show_column_names = FALSE,
show_row_names = FALSE
)
Summarize a clust_fns_list object
Description
Summarize a clust_fns_list object
Usage
summarize_clust_fns_list(cfl)
Arguments
cfl |
A |
Value
summary_df "data.frame" class object containing the name and index
of each clustering algorithm in the provided clust_fns_list
.
Summarize a distance functions list
Description
Summarize a distance functions list
Usage
summarize_dfl(dfl)
Arguments
dfl |
A dist_fns_list. |
Value
"data.frame"-class object summarizing items in a distance metrics list.
Summarize a data list
Description
Defunct function for summarizing a data list. Please
use
summary()
instead.
Usage
summarize_dl(data_list, scope = "component")
Arguments
data_list |
A nested list of input data from |
scope |
The level of detail for the summary. Options are:
|
Value
data.frame class object summarizing all components (or features if scope == "component").
Summarize p-value columns of an extended solutions data frame
Description
Summarize p-value columns of an extended solutions data frame
Usage
summarize_pvals(ext_sol_df)
Arguments
ext_sol_df |
Result of |
Value
The provided extended solutions data frame along with columns for the min, mean, and maximum across p-values for each row.
Summary method for class ari_matrix
Description
Provides a summary of the ari_matrix
class object, including the
distribution of the adjusted Rand index (ARI) values and the number of
solutions.
Usage
## S3 method for class 'ari_matrix'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the number of solutions and the distribution of ARI values.
Summary method for class clust_fns_list
Description
This summary function simply returns to the console the number of functions
contained in the clust_fns_list
object.
Usage
## S3 method for class 'clust_fns_list'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
Returns no value. Outputs a message to the console.
Summary method for class data_list
Description
Returns a data list summary (data.frame
class object) containing
information on components, features, variable types, domains, and component
dimensions.
Usage
## S3 method for class 'data_list'
summary(object, scope = "component", ...)
Arguments
object |
A |
scope |
The level of detail for the summary. By default, this is set to "component", which returns a summary of the data list at the component level. Can also be set to "feature", resulting in a summary at the feature level. |
... |
Other arguments passed to |
Value
A data.frame
class object. If scope
is "component", each row
shows the name, variable type, domain, and dimensions of each component. If
scope
is "feature", each row shows the name, variable type, and domain of
each feature.
Summary method for class dist_fns_list
Description
This summary function simply returns to the console the number of functions
contained in the dist_fns_list
object.
Usage
## S3 method for class 'dist_fns_list'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
Returns no value. Outputs a message to the console.
Summary method for class ext_solutions_df
Description
This summary function provides a summary of the ext_solutions_df
class
object, including the number of solutions, the distribution of the number of
clusters, the number of features, the number of observations, and the
distribution of p-values.
Usage
## S3 method for class 'ext_solutions_df'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.
Summary method for class settings_df
Description
This summary function provides a summary of the settings_df
class
object, including the number of settings, the distribution of alpha values,
the distribution of k values, and the distribution of clustering functions.
Usage
## S3 method for class 'settings_df'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing summary information of the settings data frame.
Summary method for class sim_mats_list
Description
This summary function simply returns to the console the number of functions
contained in the sim_mats_list
object.
Usage
## S3 method for class 'sim_mats_list'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
Returns no value. Outputs a message to the console.
Summary method for class snf_config
Description
This summary function provides a summary of the snf_config
class object,
including the settings data frame, clustering functions list, distance
functions list, and weights matrix.
Usage
## S3 method for class 'snf_config'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the summaries of objects within the config.
Summary method for class solutions_df
Description
This summary function provides a summary of the solutions_df
class
object, including the number of solutions, the distribution of the number of
clusters, and the number of observations.
Usage
## S3 method for class 'solutions_df'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the number of solutions, the distribution of the number of clusters, and the number of observations.
Summary method for class t_ext_solutions_df
Description
This summary function provides a summary of the t_ext_solutions_df
class
object, including the number of solutions, the distribution of the number of
clusters, the number of features, the number of observations, and the
distribution of p-values.
Usage
## S3 method for class 't_ext_solutions_df'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.
Summary method for class t_solutions_df
Description
This summary function provides a summary of the t_solutions_df
class
object, including the number of solutions, the distribution of the number of
clusters, the number of features, the number of observations, and the
distribution of p-values.
Usage
## S3 method for class 't_solutions_df'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.
Summary method for class weights_matrix
Description
This summary function provides a summary of the weights_matrix
class
object, including the minimum, maximum, mean, and standard deviation of the
feature weights.
Usage
## S3 method for class 'weights_matrix'
summary(object, ...)
Arguments
object |
A |
... |
Other arguments passed to |
Value
A named list containing the summary statistics of the weights matrix, the number of solutions, and the number of features.
Pull features used to calculate summary p-values from an object
Description
Pull features used to calculate summary p-values from an object
Usage
summary_features(x)
Arguments
x |
The object to extract summary features from. |
Value
A character vector of summary features.
Training and testing split
Description
Given a vector of uid_id and a threshold, returns a list of which members should be in the training set and which should be in the testing set. The function relies on whether or not the absolute value of the Jenkins's one_at_a_time hash function exceeds the maximum possible value (2147483647) multiplied by the threshold.
Usage
train_test_assign(train_frac, uids, seed = 42)
Arguments
train_frac |
The fraction (0 to 1) of observations for training |
uids |
A character vector of UIDs to be distributed into training and test sets. |
seed |
Seed used for Jenkins's one_at_a_time hash function. |
Value
A named list containing the training and testing uid_ids.
Pull UIDs from an object
Description
Pull UIDs from an object
Usage
uids(x)
Arguments
x |
The object to extract UIDs from. |
Value
A character vector of UIDs.
Validator for ari_matrix
class object
Description
Validator for ari_matrix
class object
Usage
validate_ari_matrix(aml)
Arguments
aml |
An ari_matrix-like matrix object to be validated. |
Value
If aml has a valid structure for a ari_matrix
class
object, returns the input unchanged. Otherwise, raises an error.
Validator for clust_fns_list
class object
Description
Validator for clust_fns_list
class object
Usage
validate_clust_fns_list(cfll)
Arguments
cfll |
A clust_fns_list-like |
Value
If cfll has a valid structure for a clust_fns_list
class object,
returns the input unchanged. Otherwise, raises an error.
Validator for data_list class object
Description
Validator for data_list class object
Usage
validate_data_list(dll)
Arguments
dll |
A data list-like |
Value
If dll has a valid structure for a data_list
class object,
returns the input unchanged. Otherwise, raises an error.
Validator for dist_fns_list class object
Description
Validator for dist_fns_list class object
Usage
validate_dist_fns_list(dfll)
Arguments
dfll |
A distance metrics list-like list object to be validated. |
Value
If dfll has a valid structure for a dist_fns_list
class
object, returns the input unchanged. Otherwise, raises an error.
Validator for ext_solutions_df
class object
Description
Validator for ext_solutions_df
class object
Usage
validate_ext_solutions_df(ext_sol_dfl)
Arguments
ext_sol_dfl |
An extended solutions data frame-like object. |
Value
If ext_sol_dfl has a valid structure for an object of class ext_solutions_df, returns ext_sol_dfl. Otherwise, raises an error.
Validator for settings_df
class object
Description
Validator for settings_df
class object
Usage
validate_settings_df(sdfl)
Arguments
sdfl |
A settings data frame-like matrix object to be validated. |
Value
If sdfl has a valid structure for a settings_df
class object,
returns the input unchanged. Otherwise, raises an error.
Validator for similarity_matrix_list
class object
Description
Validator for similarity_matrix_list
class object
Usage
validate_sim_mats_list(smll)
Arguments
smll |
A similarity matrix list-like object. |
Value
If smll has a valid structure for class similarity_matrix_list
,
returns smll. Otherwise, raises an error.
Validator for snf_config class object
Description
Validator for snf_config class object
Usage
validate_snf_config(scl)
Arguments
scl |
An SNF config-like |
Value
If dll has a valid structure for a data_list
class object,
returns input unchanged. Otherwise, raises an error.
Validator for solutions_df
class object
Description
Validator for solutions_df
class object
Usage
validate_solutions_df(sol_dfl)
Arguments
sol_dfl |
A solutions data frame-like object to be validated and converted into a solutions data frame. |
Value
If sol_dfl has a valid structure for a solutions_df
class object,
returns the input unchanged. Otherwise, raises an error.
Validator for weights_matrix
class object
Description
Validator for weights_matrix
class object
Usage
validate_weights_matrix(wml)
Arguments
wml |
A weights_matrix-like matrix object to be validated. |
Value
If wml has a valid structure for a weights_matrix
class
object, returns the input unchanged. Otherwise, raises an error.
Manhattan plot of feature-feature association p-values
Description
Manhattan plot of feature-feature association p-values
Usage
var_manhattan_plot(
dl,
key_var,
neg_log_pval_thresh = 5,
threshold = NULL,
point_size = 5,
text_size = 20,
plot_title = NULL,
hide_x_labels = FALSE,
bonferroni_line = FALSE
)
Arguments
dl |
List of data frames containing data information. |
key_var |
Feature for which the association p-values of all other features are plotted. |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
p-value threshold to plot dashed line at. |
point_size |
Size of points in the plot. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
hide_x_labels |
If TRUE, hides x-axis labels. |
bonferroni_line |
If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold. |
Value
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against one key feature in a data list.
Examples
dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
var_manhattan <- var_manhattan_plot(
dl,
key_var = "household_income",
plot_title = "Correlation of Features with Household Income",
text_size = 16,
neg_log_pval_thresh = 3,
threshold = 0.05
)
Generate a matrix to store feature weights
Description
Function for building a weights matrix independently of an SNF config. The
weights matrix contains one row corresponding to each row of the settings
data frame in an SNF config (one row for each resulting cluster solution)
and one column for each feature in the data list used for clustering. Values
of the weights matrix are passed to distance metrics functions during the
conversion of input data frames to distance matrices. Typically, there is no
need to use this function directly. Instead, users should provide weights
matrix-building parameters to the snf_config()
function.
Usage
weights_matrix(dl = NULL, n_solutions = 1, weights_fill = "ones")
Arguments
dl |
A nested list of input data from |
n_solutions |
Number of rows to generate the template weights matrix for. |
weights_fill |
String indicating what to populate generate rows with. Can be "ones" (default; fill matrix with 1), "uniform" (fill matrix with uniformly distributed random values), or "exponential" (fill matrix with exponentially distributed random values). |
Value
wm A properly formatted matrix containing columns for all the features that require weights and rows.
Examples
input_dl <- data_list(
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
sc <- snf_config(input_dl, n_solutions = 5)
wm <- weights_matrix(input_dl, n_solutions = 5, weights_fill = "uniform")
# updating an SNF config in parts
sc$"weights_matrix" <- wm