Help for package metasnf

Title:

Meta Clustering with Similarity Network Fusion

Version:

2.1.2

Description:

Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in <doi:10.1038/nmeth.2810>. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in <doi:10.1109/ICDM.2006.103>. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning.

License:

GPL (≥ 3)

Encoding:

UTF-8

RoxygenNote:

7.3.2

Imports:

cli, cluster, data.table, digest, dplyr, ggplot2, grDevices, MASS, mclust, methods, progressr, purrr, RColorBrewer, rlang, SNFtool, stats, tibble, tidyr, utils

Suggests:

circlize, ComplexHeatmap, InteractiveComplexHeatmap, clv, future, future.apply, knitr, rmarkdown, testthat (≥ 3.0.0), ggalluvial, lifecycle, dbscan

Config/testthat/edition:

Depends:

R (≥ 4.1.0)

LazyData:

true

VignetteBuilder:

knitr

URL:

https://branchlab.github.io/metasnf/, https://github.com/BRANCHlab/metasnf/

BugReports:

https://github.com/BRANCHlab/metasnf/issues

NeedsCompilation:

Packaged:

2025-04-28 14:53:31 UTC; prashanth

Author:

Prashanth S Velayudhan [aut, cre], Xiaoqiao Xu [aut], Prajkta Kallurkar [aut], Ana Patricia Balbon [aut], Maria T Secara [aut], Adam Taback [aut], Denise Sabac [aut], Nicholas Chan [aut], Shihao Ma [aut], Bo Wang [aut], Daniel Felsky [aut], Stephanie H Ameis [aut], Brian Cox [aut], Colin Hawco [aut], Lauren Erdman [aut], Anne L Wheeler [aut, ths]

Maintainer:

Prashanth S Velayudhan <psvelayu@gmail.com>

Repository:

CRAN

Date/Publication:

2025-04-28 18:20:02 UTC

metasnf: Meta Clustering with Similarity Network Fusion

Description

Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in doi:10.1038/nmeth.2810. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in doi:10.1109/ICDM.2006.103. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning.

Author(s)

Maintainer: Prashanth S Velayudhan psvelayu@gmail.com

Authors:

Xiaoqiao Xu
Prajkta Kallurkar
Ana Patricia Balbon
Maria T Secara
Adam Taback
Denise Sabac
Nicholas Chan
Shihao Ma
Bo Wang
Daniel Felsky
Stephanie H Ameis
Brian Cox
Colin Hawco
Lauren Erdman
Anne L Wheeler anne.wheeler@sickkids.ca [thesis advisor]

Mock ABCD anxiety data

Description

A randomly shuffled and anonymized copy of anxiety data from the NIMH Data archive. The original file used was pdem02.txt. The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_cbcl_anxiety.

Usage

abcd_anxiety

Format

`abcd_anxiety`

A data frame with 275 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
cbcl_anxiety_r: Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.

Mock ABCD "colour" data

Description

A randomly shuffled and anonymized copy of depression data from the NIMH Data archive. The original file used was pdem02.txt. The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_cbcl_depress. The data was transformed into categorical colour values to demonstrate the Chi-squared test capabilities of extend_solutions.

Usage

abcd_colour

Format

`abcd_colour`

A data frame with 275 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
colour: Categorical transformation of cbcl_depress.

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD cortical surface area data

Description

A randomly shuffled and anonymized copy of cortical surface area data from the NIMH Data archive. The original file used was mrisdp10201.txt The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_cort_t.

Usage

abcd_cort_sa

Format

`abcd_cort_sa`

A data frame with 188 rows and 152 columns:

patient: The unique identifier of the ABCD dataset
...: Cortical surface areas of various ROIs (mm^2, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD cortical thickness data

Description

A randomly shuffled and anonymized copy of cortical thickness data from the NIMH Data archive. The original file used was mrisdp10201.txt The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_cort_t.

Usage

abcd_cort_t

Format

`abcd_cort_t`

A data frame with 188 rows and 152 columns:

patient: The unique identifier of the ABCD dataset
...: Cortical thicknesses of various ROIs (mm^3, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD depression data

Description

Usage

abcd_depress

Format

`abcd_depress`

A data frame with 275 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
cbcl_depress_r: Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD income data

Description

Like abcd_income, but with no NAs in patient column

Usage

abcd_h_income

Format

`abcd_income`

A data frame with 300 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
household_income: Household income in 3 category levels (low = 1, medium = 2, high = 3)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD income data

Description

A randomly shuffled and anonymized copy of income data from the NIMH Data archive. The original file used was pdem02.txt The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_income.

Usage

abcd_income

Format

`abcd_income`

A data frame with 300 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
household_income: Household income in 3 category levels (low = 1, medium = 2, high = 3)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD pubertal status data

Description

A randomly shuffled and anonymized copy of pubertal status data from the NIMH Data archive. The original files used were abcd_ssphp01.txt and abcd_ssphy01.txt. The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_pubertal_status.

Usage

abcd_pubertal

Format

`abcd_pubertal`

A data frame with 275 rows and 2 columns:

patient: The unique identifier of the ABCD dataset
pubertal_status: Average reported pubertal status between child and parent (1-5 categorical scale)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD subcortical volumes data

Description

A randomly shuffled and anonymized copy of subcortical volume data from the NIMH Data archive. The original file used was smrip10201.txt The file was pre-processed by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function get_subc_v.

Usage

abcd_subc_v

Format

`abcd_subc_v`

A data frame with 174 rows and 31 columns:

patient: The unique identifier of the ABCD dataset
...: Subcortical volumes of various ROIs (mm^3, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Add columns to a data frame

Description

Add new columns to a data frame by specifying their names and a value to initialize them with.

Usage

add_columns(df, cols, value = NA)

Arguments

df

The data frame to extend.

cols

The vector containing new column names.

value

The values stored in the newly added columns. NA by default.

Value

A data frame containing with the same columns as the df argument as well as the new columns specified in the cols argument.

Add rows to a settings_df

Description

Add rows to a settings_df

Usage

add_settings_df_rows(
  sdf,
  n_solutions = 0,
  min_removed_inputs = 0,
  max_removed_inputs = sum(startsWith(colnames(sdf), "inc_")) - 1,
  dropout_dist = "exponential",
  min_alpha = NULL,
  max_alpha = NULL,
  min_k = NULL,
  max_k = NULL,
  min_t = NULL,
  max_t = NULL,
  alpha_values = NULL,
  k_values = NULL,
  t_values = NULL,
  possible_snf_schemes = c(1, 2, 3),
  clustering_algorithms = NULL,
  continuous_distances = NULL,
  discrete_distances = NULL,
  ordinal_distances = NULL,
  categorical_distances = NULL,
  mixed_distances = NULL,
  dfl = NULL,
  snf_input_weights = NULL,
  snf_domain_weights = NULL,
  retry_limit = 10,
  allow_duplicates = FALSE
)

Arguments

sdf

The existing settings data frame

n_solutions

Number of rows to generate for the settings data frame.

min_removed_inputs

The smallest number of input data frames that may be randomly removed. By default, 0.

max_removed_inputs

The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list.

dropout_dist

Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default).

min_alpha

The minimum value that the alpha hyperparameter can have. Random assigned value of alpha for each row will be obtained by uniformly sampling numbers between min_alpha and max_alpha at intervals of 0.1. Cannot be used in conjunction with the alpha_values parameter.

max_alpha

The maximum value that the alpha hyperparameter can have. See min_alpha parameter. Cannot be used in conjunction with the alpha_values parameter.

min_k

The minimum value that the k hyperparameter can have. Random assigned value of k for each row will be obtained by uniformly sampling numbers between min_k and max_k at intervals of 1. Cannot be used in conjunction with the k_values parameter.

max_k

The maximum value that the k hyperparameter can have. See min_k parameter. Cannot be used in conjunction with the k_values parameter.

min_t

The minimum value that the t hyperparameter can have. Random assigned value of t for each row will be obtained by uniformly sampling numbers between min_t and max_t at intervals of 1. Cannot be used in conjunction with the t_values parameter.

max_t

The maximum value that the t hyperparameter can have. See min_t parameter. Cannot be used in conjunction with the t_values parameter.

alpha_values

A number or numeric vector of a set of possible values that alpha can take on. Value will be obtained by uniformly sampling the vector. Cannot be used in conjunction with the min_alpha or max_alpha parameters.

k_values

A number or numeric vector of a set of possible values that k can take on. Value will be obtained by uniformly sampling the vector. Cannot be used in conjunction with the min_k or max_k parameters.

t_values

A number or numeric vector of a set of possible values that t can take on. Value will be obtained by uniformly sampling the vector. Cannot be used in conjunction with the min_t or max_t parameters.

possible_snf_schemes

A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "two-step" scheme.

clustering_algorithms

A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms.

continuous_distances

A vector of continuous distance metrics to use when a custom dist_fns_list is provided.

discrete_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

ordinal_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

categorical_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

mixed_distances

A vector of mixed distance metrics to use when a custom dist_fns_list is provided.

dfl

List containing distance metrics to vary over. See ?generate_dist_fns_list.

snf_input_weights

Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights)

snf_domain_weights

Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights)

retry_limit

The maximum number of attempts to generate a novel row. This function does not return matrices with identical rows. As the range of requested possible settings tightens and the number of requested rows increases, the risk of randomly generating a row that already exists increases. If a new random row has matched an existing row retry_limit number of times, the function will terminate.

allow_duplicates

If TRUE, enables creation of a settings data frame with duplicate non-feature weighting related hyperparameters. This function should only be used when paired with a custom weights matrix that has non-duplicate rows.

Value

A settings data frame

Heatmap of pairwise adjusted rand indices between solutions

Description

Defunct function to create an ARI heatmap. Please use meta_cluster_heatmap() instead.

Usage

adjusted_rand_index_heatmap(
  aris,
  order = NULL,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  log_graph = FALSE,
  scale_diag = "none",
  min_colour = "#282828",
  max_colour = "firebrick2",
  col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)),
  ...
)

Arguments

aris

Matrix of adjusted rand indices from calc_aris()

order

Numeric vector containing row order of the heatmap.

cluster_rows

Whether rows should be clustered.

cluster_columns

Whether columns should be clustered.

log_graph

If TRUE, log transforms the graph.

scale_diag

Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0).

min_colour

Colour used for the lowest value in the heatmap.

max_colour

Colour used for the highest value in the heatmap.

col

Colour ramp to use for the heatmap.

...

Additional parameters passed to similarity_matrix_heatmap(), the function that this function wraps.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.

Mock age data

Description

Mock age data

Usage

age_df

Format

`age_df`

A data frame with 200 rows and 2 columns:

patient_id: Random three-digit number uniquely identifying the patient
age: Mock age feature

Source

This data came from the SNFtool package, with slight modifications.

Alluvial plot of patients across cluster counts and important features

Description

This function creates an alluvial plot that shows how observations in a similarity matrix could have been clustered over a set of clustering functions.

Usage

alluvial_cluster_plot(
  cluster_sequence,
  similarity_matrix,
  dl = NULL,
  data = NULL,
  key_outcome,
  key_label = key_outcome,
  extra_outcomes = NULL,
  title = NULL
)

Arguments

cluster_sequence

A list of clustering algorithms.

similarity_matrix

A similarity matrix.

dl

A data list.

data

A data frame that contains any features to include in the plot.

key_outcome

The name of the feature that determines how each patient stream is coloured in the alluvial plot.

key_label

Name of key outcome to be used for the plot legend.

extra_outcomes

Names of additional features to add to the plot.

title

Title of the plot.

Value

An alluvial plot (class "gg" and "ggplot") showing distribution of a feature across varying number cluster solutions.

Examples

input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 1)

sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)

sim_mats <- sim_mats_list(sol_df)

clust_fn_sequence <- list(spectral_two, spectral_four)

alluvial_cluster_plot(
    cluster_sequence = clust_fn_sequence,
    similarity_matrix = sim_mats[[1]],
    dl = input_dl,
    key_outcome = "gender",
    key_label = "Gender",
    extra_outcomes = "diagnosis",
    title = "Gender Across Cluster Counts"
)

Mock ABCD anxiety data

Description

Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".

Usage

anxiety

Format

`anxiety`

A data frame with 275 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
cbcl_anxiety_r: Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Sort data frames in a data list by their unique ID values

Description

Sort data frames in a data list by their unique ID values

Usage

arrange_dll(dll)

Arguments

dll

A data list-like list class object.

Value

The data list-like object with all data frames sorted by uid.

Coerce a `data_list` class object into a `data.frame` class object

Description

Horizontally joins data frames within a data list into a single data frame, using the uid attribute as the joining key.

Usage

## S3 method for class 'data_list'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)

Arguments

x

A data_list class object.

row.names

Additional parameter passed to as.data.frame().

optional

Additional parameter passed to as.data.frame().

...

Additional parameter passed to as.data.frame().

Value

dl_df A data.frame class object with all the features and observations of dl.

Coerce a `ext_solutions_df` class object into a `data.frame` class object

Description

Coerce a ext_solutions_df class object into a data.frame class object

Usage

## S3 method for class 'ext_solutions_df'
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  keep_attributes = FALSE,
  ...
)

Arguments

x

A ext_solutions_df class object.

row.names

Additional parameter passed to as.data.frame().

optional

Additional parameter passed to as.data.frame().

keep_attributes

If TRUE, resulting data frame includes settings data frame and weights matrix.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `settings_df` class object into a `data.frame` class object

Description

Coerce a settings_df class object into a data.frame class object

Usage

## S3 method for class 'settings_df'
as.data.frame(x, ...)

Arguments

x

A settings_df class object.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `settings_df` class object into a `data.frame` class object

Description

Coerce a settings_df class object into a data.frame class object

Usage

## S3 method for class 'snf_config'
as.data.frame(x, ...)

Arguments

x

A settings_df class object.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `solutions_df` class object into a `data.frame` class object

Description

Coerce a solutions_df class object into a data.frame class object

Usage

## S3 method for class 'solutions_df'
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  keep_attributes = FALSE,
  ...
)

Arguments

x

A solutions_df class object.

row.names

Additional parameter passed to as.data.frame().

optional

Additional parameter passed to as.data.frame().

keep_attributes

If TRUE, resulting data frame includes settings data frame and weights matrix.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `t_ext_solutions_df` class object into a `data.frame` class object

Description

Coerce a t_ext_solutions_df class object into a data.frame class object

Usage

## S3 method for class 't_ext_solutions_df'
as.data.frame(x, ...)

Arguments

x

A t_ext_solutions_df class object.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `t_solutions_df` class object into a `data.frame` class object

Description

Coerce a t_solutions_df class object into a data.frame class object

Usage

## S3 method for class 't_solutions_df'
as.data.frame(x, ...)

Arguments

x

A t_solutions_df class object.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `weights_matrix` class object into a `data.frame` class object

Description

Coerce a weights_matrix class object into a data.frame class object

Usage

## S3 method for class 'weights_matrix'
as.data.frame(x, ...)

Arguments

x

A weights_matrix class object.

...

Additional parameter passed to as.data.frame().

Value

A data.frame class object with all the columns of x and its contained solutions data frame.

Coerce a `clust_fns_list` class object into a `list` class object

Description

Coerce a clust_fns_list class object into a list class object

Usage

## S3 method for class 'clust_fns_list'
as.list(x, ...)

Arguments

x

A clust_fns_list class object.

...

Additional parameter passed to as.list().

Value

A list class object with all the functions of x.

Coerce a `data_list` class object into a `list` class object

Description

Coerce a data_list class object into a list class object

Usage

## S3 method for class 'data_list'
as.list(x, ...)

Arguments

x

A data_list class object.

...

Additional parameter passed to as.list().

Value

A list class object with all the objects of x.

Coerce a `dist_fns_list` class object into a `list` class object

Description

Coerce a dist_fns_list class object into a list class object

Usage

## S3 method for class 'dist_fns_list'
as.list(x, ...)

Arguments

x

A dist_fns_list class object.

...

Additional parameter passed to as.list().

Value

A list class object with all the functions of x.

Coerce a `sim_mats_list` class object into a `list` class object

Description

Coerce a sim_mats_list class object into a list class object

Usage

## S3 method for class 'sim_mats_list'
as.list(x, ...)

Arguments

x

A sim_mats_list class object.

...

Additional parameter passed to as.list().

Value

A list class object with all the functions of x.

Coerce a `snf_config` class object into a `list` class object

Description

Coerce a snf_config class object into a list class object

Usage

## S3 method for class 'snf_config'
as.list(x, ...)

Arguments

x

A snf_config class object.

...

Additional parameter passed to as.list().

Value

A list class object with all the functions of x.

Coerce a `ari_matrix` class object into a `matrix` class object

Description

Coerce a ari_matrix class object into a matrix class object

Usage

## S3 method for class 'ari_matrix'
as.matrix(x, ...)

Arguments

x

A ari_matrix class object.

...

Additional parameter passed to as.matrix().

Value

A matrix and array class object.

Coerce a `weights_matrix` class object into a `matrix` class object

Description

Coerce a weights_matrix class object into a matrix class object

Usage

## S3 method for class 'weights_matrix'
as.matrix(x, ...)

Arguments

x

A weights_matrix class object.

...

Additional parameter passed to as.matrix().

Value

A matrix and array class object.

Convert an object to an ARI matrix

Description

This function coerces non-ari_matrix class objects into ari_matrix class objects.

Usage

as_ari_matrix(x)

Arguments

x

The object to convert into a weights matrix.

Value

An ari_matrix class object.

Convert an object to a data list

Description

This function coerces non-data_list class objects into data_list class objects.

Usage

as_data_list(x)

Arguments

x

The object to convert into a data list.

Value

A data_list class object.

Convert an object to a settings data frame

Description

This function coerces non-settings_df class objects into settings_df class objects.

Usage

as_settings_df(x)

Arguments

x

The object to convert into a data list.

Value

A settings_df class object.

Convert an object to a similarity matrix list

Description

This function converts non-sim_mats_list class objects into sim_mats_list class objects.

Usage

as_sim_mats_list(x)

Arguments

x

The object to convert into a sim_mats_list. Must be a list of square matrices with identical column and row names.

Value

A sim_mats_list class object.

Convert an object to a snf config

Description

This function coerces non-snf_config class objects into snf_config class objects.

Usage

as_snf_config(x)

Arguments

x

The object to convert into a snf config.

Value

A snf_config class object.

Convert an object to a weights matrix

Description

This function converts non-weights_matrix objects into weights_matrix class objects.

Usage

as_weights_matrix(x)

Arguments

x

The object to convert into a data list.

Value

A weights_matrix class object.

Collapse a data frame and/or a data list into a single data frame

Description

Collapse a data frame and/or a data list into a single data frame

Usage

assemble_data(data, dl)

Arguments

data

A data frame.

dl

A nested list of input data from data_list().

Value

A class "data.frame" object containing all the features of the provided data frame and/or data list.

Heatmap of pairwise associations between features

Description

Heatmap of pairwise associations between features

Usage

assoc_pval_heatmap(
  correlation_matrix,
  scale_diag = "max",
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  show_row_names = TRUE,
  show_column_names = TRUE,
  show_heatmap_legend = FALSE,
  confounders = NULL,
  out_of_models = NULL,
  annotation_colours = NULL,
  labels_colour = NULL,
  split_by_domain = FALSE,
  dl = NULL,
  significance_stars = TRUE,
  slice_font_size = 8,
  ...
)

Arguments

correlation_matrix

Matrix containing all pairwise association p-values. The recommended way to obtain this matrix is through the calc_assoc_pval function.

scale_diag

Parameter that controls how the diagonals of the correlation_matrix are adjusted in the heatmap. For best viewing, this is set to "max", which will match the diagonals to whichever pairwise association has the highest p-value.

cluster_rows

Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided.

cluster_columns

Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided.

show_row_names

Parameter for ComplexHeatmap::Heatmap.

show_column_names

Parameter for ComplexHeatmap::Heatmap.

show_heatmap_legend

Parameter for ComplexHeatmap::Heatmap.

confounders

A named list where the elements are columns in the correlation_matrix and the names are the corresponding display names.

out_of_models

Like confounders, but a named list of out of model measures (who are also present as columns in the correlation_matrix).

annotation_colours

Named list of heatmap annotations and their colours.

labels_colour

Vector of colours to use for the columns and rows of the heatmap.

split_by_domain

Visually slice the heatmap based on feature domains.

dl

A nested list of input data from data_list().

significance_stars

If TRUE (default), plots significance stars on heatmap cells

slice_font_size

Font size for domain separating labels.

...

Additional parameters passed into ComplexHeatmap::Heatmap.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise associations between features from the provided correlation_matrix.

Examples


data_list <- data_list(
    list(income, "household_income", "demographics", "ordinal"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(fav_colour, "favourite_colour", "demographics", "categorical"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)
ap_heatmap <- assoc_pval_heatmap(assoc_pval_matrix)

Automatically plot features across clusters

Description

Given a single row of a solutions data frame and data provided through a data list, this function will return a series of bar and/or jitter plots based on feature types.

Usage

auto_plot(
  sol_df_row = NULL,
  dl = NULL,
  cluster_df = NULL,
  return_plots = TRUE,
  save = NULL,
  jitter_width = 6,
  jitter_height = 6,
  bar_width = 6,
  bar_height = 6,
  verbose = FALSE
)

Arguments

sol_df_row

A single row of a solutions data frame.

dl

A data list containing data to plot.

cluster_df

Directly provide a cluster_df rather than a solutions matrix. Useful if plotting data from label propagated results.

return_plots

If TRUE, the function will return a list of plots. If FALSE, the function will instead return the full data frame used for plotting.

save

If a string is provided, plots will be saved and this string will be used to prefix plot names.

jitter_width

Width of jitter plots if save is specified.

jitter_height

Height of jitter plots if save is specified.

bar_width

Width of bar plots if save is specified.

bar_height

Height of bar plots if save is specified.

verbose

If TRUE, output progress to console.

Value

By default, returns a list of plots (class "gg", "ggplot") with one plot for every feature in the provided data list and/or target list. If return_plots is FALSE, will instead return a single "data.frame" object containing every provided feature for every observation in long format.

Bar plot separating a feature by cluster

Description

Bar plot separating a feature by cluster

Usage

bar_plot(df, feature)

Arguments

df

A data.frame containing cluster column and the feature to plot.

feature

The feature to plot.

Value

A bar plot (class "gg", "ggplot") showing the distribution of a feature across clusters.

Generate closure function to run batch_snf in an apply-friendly format

Description

Generate closure function to run batch_snf in an apply-friendly format

Usage

batch_row_closure(
  dl,
  dfl,
  cfl,
  sdf,
  wm,
  similarity_matrix_dir,
  return_sim_mats,
  prog
)

Arguments

dl

A nested list of input data from data_list().

dfl

An optional nested list containing which distance metric function should be used for the various feature types (continuous, discrete, ordinal, categorical, and mixed). See ?dist_fns_list for details on how to build this.

cfl

List of custom clustering algorithms to apply to the final fused network. See ?clust_fns_list.

sdf

matrix indicating parameters to iterate SNF through.

wm

A matrix containing feature weights to use during distance matrix calculation. See ?weights_matrix for details on how to build this.

similarity_matrix_dir

If specified, this directory will be used to save all generated similarity matrices.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

prog

Progressr function to update parallel processing progress

Value

A "function" class object used to run batch_snf in lapply-form for parallel processing.

Run variations of SNF

Description

This is the core function of the metasnf package. Using the information stored in a settings_df (see ?settings_df) and a data list (see ?data_list), run repeated complete SNF pipelines to generate a broad space of post-SNF cluster solutions.

Usage

batch_snf(dl, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL)

Arguments

dl

A nested list of input data from data_list().

sc

An snf_config class object which stores all sets of hyperparameters used to transform data in dl into a cluster solutions. See ?settings_df or https://branchlab.github.io/metasnf/articles/settings_df.html for more details.

processes

Specify number of processes used to complete SNF iterations

1 (default) Sequential processing: function will iterate through the settings_df one row at a time with a for loop. This option will not make use of multiple CPU cores, but will show a progress bar.
2 or higher: Parallel processing will use the future.apply::future_apply to distribute the SNF iterations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.
max: All available cores will be used.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

sim_mats_dir

If specified, this directory will be used to save all generated similarity matrices.

Value

By default, returns a solutions data frame (class "data.frame"), a a data frame containing one row for every row of the provided settings matrix, all the original columns of that settings data frame, and new columns containing the assigned cluster of each observation from the cluster solution derived by that row's settings. If return_sim_mats is TRUE, the function will instead return a list containing the solutions data frame as well as a list of the final similarity matrices (class "matrix") generated by SNF for each row of the settings data frame. If suppress_clustering is TRUE, the solutions data frame will not be returned in the output.

Examples

input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 3)

# A solutions data frame without similarity matrices:
sol_df <- batch_snf(input_dl, sc)

# A solutions data frame with similarity matrices:
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mats_list(sol_df)

Run SNF clustering pipeline on a list of subsampled data lists

Description

Run SNF clustering pipeline on a list of subsampled data lists

Usage

batch_snf_subsamples(
  dl_subsamples,
  sc,
  processes = 1,
  return_sim_mats = FALSE,
  sim_mats_dir = NULL
)

Arguments

dl_subsamples

A list of subsampled data lists. This object is generated by the function batch_snf_subsamples().

sc

processes

Specify number of processes used to complete SNF iterations

1 (default) Sequential processing: function will iterate through the settings_df one row at a time with a for loop. This option will not make use of multiple CPU cores, but will show a progress bar.
2 or higher: Parallel processing will use the future.apply::future_apply to distribute the SNF iterations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.
max: All available cores will be used.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

sim_mats_dir

If specified, this directory will be used to save all generated similarity matrices.

Value

By default, returns a one-element list: cluster_solutions, which is itself a list of cluster solution data frames corresponding to each of the provided data list subsamples. Setting the parameters return_sim_mats and return_solutions to TRUE will turn the result of the function to a three-element list containing the corresponding solutions data frames and final fused similarity matrices of those cluster solutions, should you require these objects for your own stability calculations.

Examples


my_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)

my_dl_subsamples <- subsample_dl(
    my_dl,
    n_subsamples = 20,
    subsample_fraction = 0.85
)

batch_subsample_results <- batch_snf_subsamples(
    my_dl_subsamples,
    sc
)

Cached example extended solutions data frame

Description

An extended solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.

Usage

cache_a_complete_example_ext_sol_df

Format

`cache_a_complete_example_ext_sol_df`

Contains 20 cluster solutions, 87 observations, and p-values for 336 features.

Source

This data came from the metasnf package.

Cached example extended solutions data frame

Description

An extended solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.

Usage

cache_a_complete_example_lp_ext_sol_df

Format

`cache_a_complete_example_lp_ext_sol_df`

Contains 5 cluster solutions, 74 observations, and p-values for 2 features.

Source

This data comes from the metasnf package.

Cached example solutions data frame

Description

An solutions data frame used as a cached example in the "a_complete_example.Rmd" vignette.

Usage

cache_a_complete_example_sol_df

Format

`cache_a_complete_example_sol_df`

A solutions data frame with 20 cluster solutions and 87 observations.

Source

This data came from the metasnf package.

Construct an ARI matrix storing inter-solution similarities

Description

This function constructs an ari_matrix class object from a solutions_df class object. The ARI matrix stores pairwise adjusted Rand indices for all cluster solutions as well as a numeric order for the solutions data frame based on the hierarchical clustering of the ARI matrix.

Usage

calc_aris(
  sol_df,
  processes = 1,
  verbose = FALSE,
  dist_method = "euclidean",
  hclust_method = "complete"
)

Arguments

sol_df

Solutions data frame containing cluster solutions to calculate pairwise ARIs for.

processes

Specify number of processes used to complete calculations

1 (default) Sequential processing
2 or higher: Parallel processing will use the future.apply::future_apply to distribute the calculations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.
max: All available cores will be used. Note that no progress indicator is available during multi-core processing.

verbose

If TRUE, output progress to console.

dist_method

Distance method to use when calculating sorting order to of the matrix. Argument is directly passed into stats::dist. Options include "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski".

hclust_method

Agglomerative method to use when calculating sorting order by stats::hclust. Options include "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid".

Value

om_aris ARIs between clustering solutions of an solutions data frame

Examples

dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(dl, n_solutions = 3)
sol_df <- batch_snf(dl, sc)
calc_aris(sol_df)

Calculate p-values based on feature vectors and their types

Description

Calculate p-values based on feature vectors and their types

Usage

calc_assoc_pval(var1, var2, type1, type2, cat_test = "chi_squared")

Arguments

var1

A single vector containing a feature.

var2

A single vector containing a feature.

type1

The type of var1 (continuous, discrete, ordinal, categorical).

type2

The type of var2 (continuous, discrete, ordinal, categorical).

cat_test

String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test.

Value

pval A p-value from a statistical test based on the provided types. Currently, this will either be the F-test p-value from a linear model if at least one feature is non-categorical, or the chi-squared test p-value if both features are categorical.

Calculate p-values for all pairwise associations of features in a data list

Description

Calculate p-values for all pairwise associations of features in a data list

Usage

calc_assoc_pval_matrix(dl, verbose = FALSE, cat_test = "chi_squared")

Arguments

dl

A nested list of input data from data_list().

verbose

If TRUE, output progress to the console.

cat_test

String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test.

Value

A "matrix" class object containing pairwise association p-values between the features in the provided data list.

Examples

data_list <- data_list(
    list(income, "household_income", "demographics", "ordinal"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)

Calculate feature NMIs for a data list and a solutions data frame

Description

Normalized mutual information scores can be used to indirectly measure how important a feature may have been in producing a cluster solution. This function will calculate the normalized mutual information between cluster solutions in a solutions data frame as well as cluster solutions created by including only a single feature from a provided data list, but otherwise using all the same hyperparameters as specified in the original SNF config. Note that NMIs can be calculated between two cluster solutions regardless of what features were actually used to create those cluster solutions. For example, a feature that was not involved in producing a particular cluster solution may still have a high NMI with that cluster solution (typically because it was highly correlated with a different feature that was used).

Usage

calc_nmis(
  dl,
  sol_df,
  transpose = TRUE,
  ignore_inclusions = TRUE,
  processes = 1
)

Arguments

dl

A nested list of input data from data_list().

sol_df

Result of batch_snf storing cluster solutions and the settings that were used to generate them. Use the same value as was used in the original call to batch_snf().

transpose

If TRUE, will transpose the output data frame.

ignore_inclusions

If TRUE, will ignore the inclusion columns in the solutions data frame and calculate NMIs for all features. If FALSE, will give NAs for features that were dropped on a given settings_df row.

processes

Specify number of processes used to complete SNF iterations

1 (default) Sequential processing: function will iterate through the settings_df one row at a time with a for loop. This option will not make use of multiple CPU cores, but will show a progress bar.
2 or higher: Parallel processing will use the future.apply::future_apply to distribute the SNF iterations across the specified number of CPU cores. If higher than the number of available cores, a warning will be raised and the maximum number of cores will be used.
max: All available cores will be used.

Value

A "data.frame" class object containing one row for every feature in the provided data list and one column for every solution in the provided solutions data frame. Populated values show the calculated NMI score for each feature-solution combination.

Examples

input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 2)

sol_df <- batch_snf(input_dl, sc)

calc_nmis(input_dl, sol_df)

Calculate co-clustering data

Description

Calculate co-clustering data

Usage

calculate_coclustering(subsample_solutions, sol_df, verbose = FALSE)

Arguments

subsample_solutions

A list of containing cluster solutions from distinct subsamples of the data. This object is generated by the function batch_snf_subsamples(). These solutions should correspond to the ones in the solutions data frame.

sol_df

A solutions data frame. This object is generated by the function batch_snf(). The solutions in the solutions data frame should correspond to those in the subsample solutions.

verbose

If TRUE, output time remaining estimates to console.

Value

A list containing the following components:

cocluster_dfs: A list of data frames, one per cluster solution, that shows the number of times that every pair of observations in the original cluster solution occurred in the same subsample, the number of times that every pair clustered together in a subsample, and the corresponding fraction of times that every pair clustered together in a subsample.
cocluster_ss_mats: The number of times every pair of observations occurred in the same subsample, formatted as a pairwise matrix.
cocluster_sc_mats: The number of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_cf_mats: The fraction of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_summary: Specifically among pairs of observations that clustered together in the original full cluster solution, what fraction of those pairs remained clustered together throughout the subsample solutions. This information is formatted as a data frame with one row per cluster solution.

Examples


    my_dl <- data_list(
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        uid = "unique_id"
    )
    
    sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
    
    sol_df <- batch_snf(my_dl, sc)
    
    my_dl_subsamples <- subsample_dl(
        my_dl,
        n_subsamples = 20,
        subsample_fraction = 0.85
    )
    
    batch_subsample_results <- batch_snf_subsamples(
        my_dl_subsamples,
        sc
    )
    
    coclustering_results <- calculate_coclustering(
        batch_subsample_results,
        sol_df,
        verbose = TRUE
    )

Mock diagnosis data

Description

This is the same data as diagnosis_df, with renamed features and columns.

Usage

cancer_diagnosis_df

Format

`cancer_diagnosis_df`

A data frame with 200 rows and 2 columns:

patient_id: Random three-digit number uniquely identifying the patient
diagnosis: Mock cancer diagnosis feature (1, 2, or 3)

Source

This data came from the SNFtool package, with slight modifications.

Helper function for generating categorical colour palette

Description

Helper function for generating categorical colour palette

Usage

cat_colours(vector, palette)

Arguments

vector

Vector of categorical data to generate palette for.

palette

Which RColorBrewer palette should be used.

Value

A named list of colours where the names correspond to the unique values of vector and the values correspond to their colours.

Place significance stars on ComplexHeatmap cells

Description

This is an internal function meant to be used to by the assoc_pval_heatmap function.

Usage

cell_significance_fn(data)

Arguments

data

The matrix containing the cells to base the significance stars on.

Value

cell_fn Another function that is well-formatted for usage as the cell_fun argument in ComplexHeatmap::Heatmap.

Convert character-type columns of a data frame to factor-type

Description

Convert character-type columns of a data frame to factor-type

Usage

char_to_fac(df)

Arguments

df

A data frame.

Value

The data frame with factor columns instead of char columns.

Check if functions in a distance metrics list-like have valid arguments

Description

Check if functions in a distance metrics list-like have valid arguments

Usage

check_cfll_fn_args(cfll)

Value

Doesn't return any value. Raises error if the functions in dfll don't have valid arguments.

Check if items of a clustering functions list-like object are functions

Description

Check if items of a clustering functions list-like object are functions

Usage

check_cfll_fns(cfll)

Arguments

cfll

A clust_fns_list-like list class object.

Value

Doesn't return any value. Raises error if the items of cfll are not functions.

Check if clustering functions list-like object has named algorithms

Description

Check if clustering functions list-like object has named algorithms

Usage

check_cfll_named(cfll)

Arguments

cfll

A clust_fns_list-like list class object.

Value

Doesn't return any value. Raises error if there are unnamed clustering functions in cfll.

Check if names in a clustering functions list-like object are unique

Description

Check if names in a clustering functions list-like object are unique

Usage

check_cfll_unique_names(cfll)

Arguments

cfll

A clust_fns_list-like list class object.

Value

Doesn't return any value. Raises error if the names in cfll aren't unique.

Check if settings_df exceeds bounds of clust_fns_list

Description

Check if settings_df exceeds bounds of clust_fns_list

Usage

check_compatible_sdf_cfl(sdf, cfl)

Arguments

sdf

A settings_df class object.

cfl

A clust_fns_list class object.

Value

Doesn't return any value. Raises error if sdf calls for a clustering function outside the range of cfl.

Check if settings_df exceeds bounds of dist_fns_list

Description

Check if settings_df exceeds bounds of dist_fns_list

Usage

check_compatible_sdf_dfl(sdf, dfl)

Arguments

sdf

A settings_df class object.

dfl

A dist_fns_list class object.

Value

Doesn't return any value. Raises error if sdf calls for a distance function outside the range of dfl.

Check if settings_df and weights_matrix have same number of rows

Description

Check if settings_df and weights_matrix have same number of rows

Usage

check_compatible_sdf_wm(sdf, wm)

Arguments

sdf

A settings_df class object.

wm

A weights_matrix class object.

Value

Doesn't return any value. Raises error if sdf and wm don't have the same number of rows.

Helper function to stop annotation building when no data was provided

Description

Helper function to stop annotation building when no data was provided

Usage

check_dataless_annotations(annotation_requests, data)

Arguments

annotation_requests

A list of requested annotations

data

A data frame with data to build annotations

Value

Does not return any value. This function just raises an error when annotations are requested without any provided data for a heatmap.

Check if functions in a distance metrics list-like have valid arguments

Description

Check if functions in a distance metrics list-like have valid arguments

Usage

check_dfll_fn_args(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

Doesn't return any value. Raises error if the functions in dfll don't have valid arguments.

Check if functions in a distance metrics list-like have names

Description

Check if functions in a distance metrics list-like have names

Usage

check_dfll_fn_names(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

Doesn't return any value. Raises error if the functions in dfll don't have names.

Check if items of a distance metrics list-like object have valid names

Description

Check if items of a distance metrics list-like object have valid names

Usage

check_dfll_item_names(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

Doesn't return any value. Raises error if the items of dfll don't have valid formatted names.

Check if items of a distance metrics list-like object are functions

Description

Check if items of a distance metrics list-like object are functions

Usage

check_dfll_subitems_are_fns(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

Doesn't return any value. Raises error if the items of dfll are not functions.

Check if names in a distance metrics list-like object are unique

Description

Check if names in a distance metrics list-like object are unique

Usage

check_dfll_unique_names(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

Doesn't return any value. Raises error if the items of dfll aren't unique across layer 1 or within each item of layer 2.

Check if data list contains any duplicate names

Description

Check if data list contains any duplicate names

Usage

check_dll_duplicate_components(dll)

Arguments

dll

A data list-like list class object.

Value

Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.

Check if data list contains any duplicate features

Description

Check if data list contains any duplicate features

Usage

check_dll_duplicate_features(dll)

Arguments

dll

A data list-like list class object.

Value

Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.

Error if empty input provided during data list initialization

Description

Error if empty input provided during data list initialization

Usage

check_dll_empty_input(data_list_input)

Arguments

data_list_input

Input data provided for data list initialization.

Value

Raises an error if data_list_input has 0 length.

Error if data list-like list doesn't have only 4-item nested lists

Description

Error if data list-like list doesn't have only 4-item nested lists

Usage

check_dll_four_subitems(dll)

Arguments

dll

A data list-like list class object.

Value

Raises error if dll doesn't have only 4-item nested lists

Error if data list-like structure isn't a list

Description

Error if data list-like structure isn't a list

Usage

check_dll_inherits_list(dll)

Arguments

dll

A data list-like list class object.

Value

Raises error if data list-like structure isn't a list

Check if UID columns in a nested list have valid structure for a data list

Description

Check if UID columns in a nested list have valid structure for a data list

Usage

check_dll_subitem_classes(dll)

Arguments

dll

A data list-like list class object.

Value

Raises an error if the UID columns do not have a valid structure.

Check valid item names for a data list-like list

Description

Error if data list-like structure doesn't have nested names of "data", "name", "domain", and "type".

Usage

check_dll_subitem_names(dll)

Arguments

dll

A data list-like list class object.

Value

Raises error if dll doesn't have only 4-item nested lists

Error if data list-like structure has invalid feature types

Description

Error if data list-like structure has invalid feature types

Usage

check_dll_types(dll)

Arguments

dll

A data list-like list class object.

Value

Raises an error if the loaded types are not among continuous, discrete, ordinal, categorical, or mixed.

Check if UID columns in a nested list have valid structure for a data list

Description

Check if UID columns in a nested list have valid structure for a data list

Usage

check_dll_uid(dll)

Arguments

dll

A data list-like list class object.

Value

Raises an error if the UID columns do not have a valid structure.

Check for ComplexHeatmap and circlize dependencies

Description

Check for ComplexHeatmap and circlize dependencies

Usage

check_hm_dependencies()

Value

Does not return any value. This function just checks that the ComplexHeatmap and circlize packages are installed.

Check if settings data frame inherits class `data.frame`

Description

Check if settings data frame inherits class data.frame

Usage

check_sdfl_colnames(sdfl)

Arguments

sdfl

A settings data frame-like matrix object to be validated.

Value

Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.

Check if settings data frame inherits class `data.frame`

Description

Check if settings data frame inherits class data.frame

Usage

check_sdfl_is_df(sdfl)

Arguments

sdfl

A settings data frame-like matrix object to be validated.

Value

Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.

Check if settings data frame is numeric

Description

Check if settings data frame is numeric

Usage

check_sdfl_numeric(sdfl)

Arguments

sdfl

A settings data frame-like matrix object to be validated.

Value

Doesn't return any value. Raises error if there are features with duplicate names in a generated data list.

Check validity of similarity matrices

Description

Check to see if similarity matrices in a list have the following properties:

The maximum value in the entire matrix is 0.5
Every value in the diagonal is 0.5

Usage

check_similarity_matrices(similarity_matrices)

Arguments

similarity_matrices

A list of similarity matrices

Value

valid_matrices Boolean indicating if properties are met by all similarity matrices

Check if max K exceeds the number of observations

Description

Check if max K exceeds the number of observations

Usage

check_valid_k(sdf, dl)

Arguments

sdf

A settings_df class object.

dl

A nested list of input data from data_list().

Value

Doesn't return any value. Raises error if max K exceeds the number of observations.

Check if SNF config has valid structure

Description

Check if SNF config has valid structure

Usage

check_valid_sc(sc)

Arguments

sc

An snf_config class object.

Value

Doesn't return any value. Raises error if snf_config is not an snf_config class object.

Chi-squared test p-value (generic)

Description

Return p-value for chi-squared test for any two features

Usage

chi_squared_pval(cat_var1, cat_var2)

Arguments

cat_var1

A categorical feature.

cat_var2

A categorical feature.

Value

pval A p-value (class "numeric").

Built-in clustering algorithms

Description

These functions can be used when building a metasnf clustering functions list. Each function converts a similarity matrix (matrix class object) to a cluster solution (numeric vector). Note that these functions (or custom clustering functions) cannot accept number of clusters as a parameter; this value must be built into the function itself if necessary.

Usage

spectral_eigen(similarity_matrix)

spectral_rot(similarity_matrix)

spectral_eigen_classic(similarity_matrix)

spectral_rot_classic(similarity_matrix)

spectral_two(similarity_matrix)

spectral_three(similarity_matrix)

spectral_four(similarity_matrix)

spectral_five(similarity_matrix)

spectral_six(similarity_matrix)

spectral_seven(similarity_matrix)

spectral_eight(similarity_matrix)

spectral_nine(similarity_matrix)

spectral_ten(similarity_matrix)

Arguments

similarity_matrix

A similarity matrix.

Details

spectral_eigen: Spectral clustering where the number of clusters is based on the eigen-gap heuristic
spectral_rot: Spectral clustering where the number of clusters is based on the rotation-cost heuristic
spectral_(C): Spectral clustering for a C-cluster solution.

Value

solution_data A vector of cluster assignments

Build a clustering algorithms list

Description

This function can be used to specify custom clustering algorithms to apply to the final similarity matrices produced by each run of the batch_snf function.

Usage

clust_fns_list(clust_fns = NULL, use_default_clust_fns = FALSE)

Arguments

clust_fns

A list of named clustering functions

use_default_clust_fns

If TRUE, prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph) to clust_fns.

Value

A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.

Examples

# Using just the base clustering algorithms --------------------------------
# This will just contain spectral_eigen and spectral_rot
cfl <- clust_fns_list(use_default_clust_fns = TRUE)

# Adding algorithms provided by the package --------------------------------
# This will contain the base clustering algorithms (spectral_eigen,
#  spectral_rot) as well as two pre-defined spectral clustering functions
#  that force the number of clusters to be two or five
cfl <- clust_fns_list(
     clust_fns = list(
        "two_cluster_spectral" = spectral_two,
        "five_cluster_spectral" = spectral_five
    )
)

# Adding your own algorithms -----------------------------------------------
# This will contain the base and user-provided clustering algorithms
my_clustering_algorithm <- function(similarity_matrix) {
    # your code that converts similarity matrix to clusters here...
}

# Suppress the base algorithms----------------------------------------------
# This will contain only user-provided clustering algorithms
cfl <- clust_fns_list(
    clust_fns = list(
        "two_cluster_spectral" = spectral_two,
        "five_cluster_spectral" = spectral_five
    )
)

Density plot of co-clustering stability across subsampled data

Description

This function creates a density plot that shows, for all pairs of observations that originally clustered together, the distribution of the the fractions that those pairs clustered together across subsampled data.

Usage

cocluster_density(cocluster_df)

Arguments

cocluster_df

A data frame containing co-clustering data for a single cluster solution. This object is generated by the calculate_coclustering function.

Value

Density plot (class "gg", "ggplot") of the distribution of co-clustering across pairs and subsamples of the data.

Examples


my_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)

sol_df <- batch_snf(my_dl, sc)

my_dl_subsamples <- subsample_dl(
    my_dl,
    n_subsamples = 20,
    subsample_fraction = 0.85
)

batch_subsample_results <- batch_snf_subsamples(
    my_dl_subsamples,
    sc
)

coclustering_results <- calculate_coclustering(
    batch_subsample_results,
    sol_df,
    verbose = TRUE
)

cocluster_dfs <- coclustering_results$"cocluster_dfs"

cocluster_density(cocluster_dfs[[1]])

Heatmap of observation co-clustering across resampled data

Description

Create a heatmap that shows the distribution of observation co-clustering across resampled data.

Usage

cocluster_heatmap(
  cocluster_df,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  show_row_names = FALSE,
  show_column_names = FALSE,
  dl = NULL,
  data = NULL,
  left_bar = NULL,
  right_bar = NULL,
  top_bar = NULL,
  bottom_bar = NULL,
  left_hm = NULL,
  right_hm = NULL,
  top_hm = NULL,
  bottom_hm = NULL,
  annotation_colours = NULL,
  min_colour = NULL,
  max_colour = NULL,
  ...
)

Arguments

cocluster_df

A data frame containing co-clustering data for a single cluster solution. This object is generated by the calculate_coclustering function.

cluster_rows

Argument passed to ComplexHeatmap::Heatmap().

cluster_columns

Argument passed to ComplexHeatmap::Heatmap().

show_row_names

Argument passed to ComplexHeatmap::Heatmap().

show_column_names

Argument passed to ComplexHeatmap::Heatmap().

dl

See ?similarity_matrix_heatmap.

data

See ?similarity_matrix_heatmap.

left_bar

See ?similarity_matrix_heatmap.

right_bar

See ?similarity_matrix_heatmap.

top_bar

See ?similarity_matrix_heatmap.

bottom_bar

See ?similarity_matrix_heatmap.

left_hm

See ?similarity_matrix_heatmap.

right_hm

See ?similarity_matrix_heatmap.

top_hm

See ?similarity_matrix_heatmap.

bottom_hm

See ?similarity_matrix_heatmap.

annotation_colours

See ?similarity_matrix_heatmap.

min_colour

See ?similarity_matrix_heatmap.

max_colour

See ?similarity_matrix_heatmap.

...

Arguments passed to ComplexHeatmap::Heatmap().

Value

Heatmap (class "Heatmap" from ComplexHeatmap) object showing the distribution of observation co-clustering across resampled data.

Examples


    my_dl <- data_list(
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        uid = "unique_id"
    )
    
    sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)
    
    sol_df <- batch_snf(my_dl, sc)
    
    my_dl_subsamples <- subsample_dl(
        my_dl,
        n_subsamples = 20,
        subsample_fraction = 0.85
    )
    
    batch_subsample_results <- batch_snf_subsamples(
        my_dl_subsamples,
        sc
    )
    
    coclustering_results <- calculate_coclustering(
        batch_subsample_results, 
        sol_df,
        verbose = TRUE
    )
    
    cocluster_dfs <- coclustering_results$"cocluster_dfs"
    
    cocluster_heatmap(
        cocluster_dfs[[1]],
        dl = my_dl,
        top_hm = list(
            "Income" = "household_income",
            "Pubertal Status" = "pubertal_status"
        ),
        annotation_colours = list(
            "Pubertal Status" = colour_scale(
                c(1, 4),
                min_colour = "black",
                max_colour = "purple"
            ),
            "Income" = colour_scale(
                c(0, 4),
                min_colour = "black",
                max_colour = "red"
            )
        )
    )

Co-clustering coverage check

Description

Check if co-clustered data has at least one subsample in which every pair of observations were a part of simultaneously.

Usage

coclustering_coverage_check(cocluster_df, action = "warn")

Arguments

cocluster_df

data frame containing co-clustering data.

action

Control if parent function should warn or stop.

Value

This function does not return any value. It checks a cocluster_df for complete coverage (all pairs occur in the same solution at least once). Will raise a warning or error if coverage is incomplete depending on the value of the action parameter.

Convert a data list into a data frame

Description

Defunct function for converting a data list into a data frame. Please use as.data.frame() instead.

Usage

collapse_dl(data_list)

Arguments

data_list

A nested list of input data from generate_data_list().

Value

A "data.frame"-formatted version of the provided data list.

Return a colour ramp for a given vector

Description

Given a numeric vector and min and max colour values, return a colour ramp that assigns a colour to each element in the vector. This function is a wrapper for circlize::colorRamp2.'

Usage

colour_scale(data, min_colour, max_colour)

Arguments

data

Vector of numeric values.

min_colour

Minimum colour value.

max_colour

Maximum colour value.

Value

A "function" class object that can build a circlize-style colour ramp.

Convert unique identifiers of data list to "uid"

Description

Column name "uid" is reserved for the unique identifier of observations. This function ensures all data frames have their UID set as "uid".

Usage

convert_uids(dll, uid)

Arguments

dll

A data list-like list class object.

uid

(string) the name of the uid column currently used data

Value

The provided nested list with "uid" as UID.

Mock ABCD cortical surface area data

Description

Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".

Usage

cort_sa

Format

`cort_sa`

A data frame with 188 rows and 152 columns:

unique_id: The unique identifier of the ABCD dataset
...: Cortical surface areas of various ROIs (mm^2, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock ABCD cortical thickness data

Description

Like the mock data frame "abcd_cort_t", but with "unique_id" as the "uid".

Usage

cort_t

Format

`cort_t`

A data frame with 188 rows and 152 columns:

unique_id: The unique identifier of the ABCD dataset
...: Cortical thicknesses of various ROIs (mm^3, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Build a `data_list` class object

Description

data_list() constructs a data list object which inherits from classes data_list and list. This object is the primary way in which features to be used along the metasnf clustering pipeline are stored. The data list is fundamentally a 2-level nested list object where each inner list contains a data frame and associated metadata for that data frame. The metadata includes the name of the data frame, the 'domain' of that data frame (the broader source of information that the input data frame is capturing, determined by user's domain knowledge), and the type of feature stored in the data frame (continuous, discrete, ordinal, categorical, or mixed).

Usage

data_list(..., uid)

Arguments

...

Any number of lists formatted as (df, "df_name", "df_domain", "df_type") and/or any number of lists of lists formatted as (df, "df_name", "df_domain", "df_type").

uid

(character) the name of the uid column currently used data. data frame.

Examples

heart_rate_df <- data.frame(
    patient_id = c("1", "2", "3"),
    var1 = c(0.04, 0.1, 0.3),
    var2 = c(30, 2, 0.3)
)

personality_test_df <- data.frame(
    patient_id = c("1", "2", "3"),
    var3 = c(900, 1990, 373),
    var4 = c(509, 2209, 83)
)

survey_response_df <- data.frame(
    patient_id = c("1", "2", "3"),
    var5 = c(1, 3, 3),
    var6 = c(2, 3, 3)
)

city_df <- data.frame(
    patient_id = c("1", "2", "3"),
    var7 = c("toronto", "montreal", "vancouver")
)

# Explicitly (Name each nested list element):
dl <- data_list(
    list(
        data = heart_rate_df,
        name = "heart_rate",
        domain = "clinical",
        type = "continuous"
    ),
    list(
        data = personality_test_df,
        name = "personality_test",
        domain = "surveys",
        type = "continuous"
    ),
    list(
        data = survey_response_df,
        name = "survey_response",
        domain = "surveys",
        type = "ordinal"
    ),
    list(
        data = city_df,
        name = "city",
        domain = "location",
        type = "categorical"
    ),
    uid = "patient_id"
)

# Compact loading
dl <- data_list(
    list(heart_rate_df, "heart_rate", "clinical", "continuous"),
    list(personality_test_df, "personality_test", "surveys", "continuous"),
    list(survey_response_df, "survey_response", "surveys", "ordinal"),
    list(city_df, "city", "location", "categorical"),
    uid = "patient_id"
)

# Printing data list summaries
summary(dl)

# Alternative loading: providing a single list of lists
list_of_lists <- list(
    list(heart_rate_df, "data1", "domain1", "continuous"),
    list(personality_test_df, "data2", "domain2", "continuous")
)

dl <- data_list(
    list_of_lists,
    uid = "patient_id"
)

Mock ABCD depression data

Description

Like the mock data frame "abcd_depress", but with "unique_id" as the "uid".

Usage

depress

Format

`depress`

A data frame with 275 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
cbcl_depress_r: Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Mock diagnosis data

Description

This is the same data as cancer_diagnosis_df, with renamed features and columns.

Usage

diagnosis_df

Format

`diagnosis_df`

A data frame with 200 rows and 2 columns:

patient_id: Random three-digit number uniquely identifying the patient
diagnosis: Mock diagnosis feature

Source

This data came from the SNFtool package, with slight modifications.

Internal function for `estimate_nclust_given_graph`

Description

Internal function taken from SNFtool to use for number of cluster estimation.

Usage

discretisation(eigenvectors)

Arguments

eigenvectors

Matrix of eigenvectors.

Value

"Matrix" class object, intermediate product in spectral clustering.

Internal function for `estimate_nclust_given_graph`

Description

Internal function taken from SNFtool to use for number of cluster estimation.

Usage

discretisation_evec_data(eigenvector)

Arguments

eigenvector

Matrix of eigenvectors

Value

"Matrix" class object of discretized provided eigenvector to values 0 or 1.

Built-in distance functions

Description

These functions can be used when building a metasnf distance functions list. Each function converts a data frame into to a distance matrix.

Usage

euclidean_distance(df, weights_row)

gower_distance(df, weights_row)

sn_euclidean_distance(df, weights_row)

sew_euclidean_distance(df, weights_row)

hamming_distance(df, weights_row)

Arguments

df

Data frame containing at least 1 data column

weights_row

Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights_row.

Details

Functions that work for numeric data:

euclidean_distance: typical Euclidean distance
sn_euclidean_distance: Data frame is first standardized and normalized before typical Euclidean distance is applied
siw_euclidean_distance: Squared (including weights) Euclidean distance, where the weights are also squared
sew_euclidean_distance: Squared (excluding weights) Euclidean distance, where the weights are not also squared

Functions that work for binary data:

hamming_distance: typical Hamming distance

Functions that work for any type of data:

gower_distance: Gower distance (cluster::daisy)

Value

A matrix class object containing pairwise distances.

Build a distance metrics list

Description

The distance metrics list object (inherits classes dist_fns_list and list) is a list that stores R functions which can convert a data frame of features into a matrix of pairwise distances. The list is a nested one, where the first layer of the list can hold up to 5 items (one for each of the metasnf recognized feature types, continuous, discrete, ordinal, categorical, and mixed), and the second layer can hold an arbitrary number of distance functions for each of those types.

Usage

dist_fns_list(
  cnt_dist_fns = NULL,
  dsc_dist_fns = NULL,
  ord_dist_fns = NULL,
  cat_dist_fns = NULL,
  mix_dist_fns = NULL,
  automatic_standard_normalize = FALSE,
  use_default_dist_fns = FALSE
)

Arguments

cnt_dist_fns

A named list of continuous distance metric functions.

dsc_dist_fns

A named list of discrete distance metric functions.

ord_dist_fns

A named list of ordinal distance metric functions.

cat_dist_fns

A named list of categorical distance metric functions.

mix_dist_fns

A named list of mixed distance metric functions.

automatic_standard_normalize

If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters.

use_default_dist_fns

If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list.

Details

Call ?distance_metrics to see all distance metric functions provided in metasnf.

Value

A distance metrics list object.

Examples

# Using just the base distance metrics  ------------------------------------
dist_fns_list <- dist_fns_list()

# Adding your own metrics --------------------------------------------------
# This will contain only the and user-provided distance function:
cubed_euclidean <- function(df, weights_row) {
    # (your code that converts a data frame to a distance metric here...)
    weights <- diag(weights_row, nrow = length(weights_row))
    weighted_df <- as.matrix(df) %*% weights
    distance_matrix <- weighted_df |>
        stats::dist(method = "euclidean") |>
        as.matrix()
    distance_matrix <- distance_matrix^3
    return(distance_matrix)
}

dist_fns_list <- dist_fns_list(
    cnt_dist_fns = list(
         "my_cubed_euclidean" = cubed_euclidean
    )
)

# Using default base metrics------------------------------------------------
# Call ?distance_metrics to see all distance metric functions provided in
# metasnf. The code below will contain a mix of user-provided and built-in
# distance metric functions.
dist_fns_list <- dist_fns_list(
    cnt_dist_fns = list(
         "my_distance_metric" = cubed_euclidean
    ),
    dsc_dist_fns = list(
         "my_distance_metric" = cubed_euclidean
    ),
    ord_dist_fns = list(
         "my_distance_metric" = cubed_euclidean
    ),
    cat_dist_fns = list(
         "my_distance_metric" = gower_distance
    ),
    mix_dist_fns = list(
         "my_distance_metric" = gower_distance
    ),
    use_default_dist_fns = TRUE
)

Variable-level summary of a data list

Description

Defunct function to summarize a data list. Please use summary() with argument scope = "feature" instead.

Usage

dl_variable_summary(dl)

Arguments

dl

A nested list of input data from data_list().

Value

variable_level_summary A data frame containing the name, type, and domain of every variable in a data list.

Apply-like function for data list objects

Description

This function enables manipulating a data_list class object with lapply syntax without removing that object's data_list class attribute. The function will only preserve this attribute if the result of the apply call has a valid data list structure.

Usage

dlapply(X, FUN, ...)

Arguments

X

A data_list class object.

FUN

The function to be applied to each data list component.

...

Optional arguments to FUN.

Value

If FUN applied to each component of X yields a valid data list, a data list. Otherwise, a list.

Examples

# Convert all UID values to lowercase
dl <- data_list(
    list(abcd_income, "income", "demographics", "discrete"),
    list(abcd_colour, "colour", "likes", "categorical"),
    uid = "patient"
)

dl_lower <- dlapply(
    dl,
    function(x) {
        x$"data"$"uid" <- tolower(x$"data"$"uid")
        return(x)
    }
)

Make the uid UID columns of a data list first

Description

Make the uid UID columns of a data list first

Usage

dll_uid_first_col(dll)

Arguments

dll

A data list-like list class object.

Value

The object with "uid" positioned as the first of each data frame column.

Pull domains from a data list

Description

Pull domains from a data list

Usage

domains(dl)

Arguments

dl

A nested list of input data from data_list().

Value

A character vector of domains.

Function to extend dplyr to extended solutions data frame objects

Description

Function to extend dplyr to extended solutions data frame objects

Usage

dplyr_row_slice.ext_solutions_df(data, i, ...)

Arguments

data

An extended solutions data frame.

i

A vector of row indices.

...

Additional arguments.

Value

Row sliced object with appropriately preserved attributes.

Function to extend dplyr to solutions data frame objects

Description

Function to extend dplyr to solutions data frame objects

Usage

dplyr_row_slice.solutions_df(data, i, ...)

Arguments

data

A solutions data frame.

i

A vector of row indices.

...

Additional arguments.

Value

Row sliced object with appropriately preserved attributes.

Helper function to remove columns from a data frame

Description

Helper function to remove columns from a data frame

Usage

drop_cols(x, cols)

Arguments

x

A data frame

cols

Vector of column names to be removed

Value

x without columns in cols

Execute inclusion

Description

Given a data list and a settings data frame row, returns a data list of selected inputs.

Usage

drop_inputs(sdf_row, dl)

Arguments

sdf_row

Row of a settings data frame.

dl

A nested list of input data from data_list().

Value

A data list (class "list") in which any component with a corresponding 0 value in the provided settings data frame row has been removed.

Ensure the data item of each component is a `data.frame` class object

Description

Ensure the data item of each component is a data.frame class object

Usage

ensure_dll_df(dll)

Arguments

dll

A data list-like list class object.

Value

The provided dll with the data item of each component as a data frame.

Manhattan plot of feature-cluster association p-values

Description

Manhattan plot of feature-cluster association p-values

Usage

esm_manhattan_plot(
  esm,
  neg_log_pval_thresh = 5,
  threshold = NULL,
  point_size = 5,
  jitter_width = 0.1,
  jitter_height = 0.1,
  text_size = 15,
  plot_title = NULL,
  hide_x_labels = FALSE,
  bonferroni_line = FALSE
)

Arguments

esm

Extended solutions data frame storing associations between features and cluster assignments. See ?extend_solutions.

neg_log_pval_thresh

Threshold for negative log p-values.

threshold

P-value threshold to plot dashed line at.

point_size

Size of points in the plot.

jitter_width

Width of jitter.

jitter_height

Height of jitter.

text_size

Size of text in the plot.

plot_title

Title of the plot.

hide_x_labels

If TRUE, hides x-axis labels.

bonferroni_line

If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold.

Value

A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame.

Examples

esm_manhattan_plot(mock_ext_solutions_df)

Estimate number of clusters for a similarity matrix

Description

Calculate eigengap and rotation-cost estimates of the number of clusters to use when clustering a similarity matrix. This function was adapted from SNFtool::estimateClustersGivenGraph, but scales up the Laplacian operator prior to eigenvalue calculations to minimize the risk of floating point-related errors.

Usage

estimate_nclust_given_graph(W, NUMC = 2:10)

Arguments

W

Similarity matrix to calculate number of clusters for.

NUMC

Range of cluster counts to consider among when picking best number of clusters.

Value

A list containing the top two eigengap and rotation-cost estimates for the number of clusters in a given similarity matrix.

Examples

input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 1)
sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)
sim_mat <- sim_mats_list(sol_df)[[1]]
estimate_nclust_given_graph(sim_mat)

Modification of SNFtool mock data frame "Data1"

Description

Modification of SNFtool mock data frame "Data1"

Usage

expression_df

Format

`expression_df`

A data frame with 200 rows and 3 columns:

gene_1_expression: Mock gene expression feature
gene_2_expression: Mock gene expression feature
patient_id: Random three-digit number uniquely identifying the patient

Source

This data came from the SNFtool package, with slight modifications.

Constructor for `ext_solutions_df` class object

Description

The extended solutions data frame is a column-extended variation of the solutions data frame. It contains association p-values relating cluster membership to feature distribution for all solutions in a solutions data frame and all features in a provided data list (or data lists). If a target data list was used during the call to extend_solutions, the extended solutions data frame will also have columns "min_pval", "mean_pval", and "max_pval" summarizing the p-values of just those features that were a part of the target list.

Usage

ext_solutions_df(ext_sol_dfl, sol_df, fts, target_dl)

Arguments

ext_sol_dfl

An extended solutions data frame-like object.

sol_df

Result of batch_snf storing cluster solutions and the settings that were used to generate them.

fts

A vector of all features that have association p-values stored in the resulting extended solutions data frame.

target_dl

A data list with features to calculate p-values for. Features in the target list will be included during p-value summary measure calculations.

Value

An ext_solutions_df class object.

Extend a solutions data frame to include outcome evaluations

Description

Extend a solutions data frame to include outcome evaluations

Usage

extend_solutions(
  sol_df,
  target_dl = NULL,
  dl = NULL,
  cat_test = "chi_squared",
  min_pval = 1e-10,
  processes = 1,
  verbose = FALSE
)

Arguments

sol_df

Result of batch_snf storing cluster solutions and the settings that were used to generate them.

target_dl

A data list with features to calculate p-values for. Features in the target list will be included during p-value summary measure calculations.

dl

A data list with features to calculate p-values for, but that should not be incorporated into p-value summary measure columns (i.e., min/mean/max p-value columns).

cat_test

String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test.

min_pval

If assigned a value, any p-value less than this will be replaced with this value.

processes

The number of processes to use for parallelization. Progress is only reported for sequential processing (processes = 1).

verbose

If TRUE, output progress to console.

Value

An extended solutions data frame (ext_sol_df class object) that contains p-value columns for each outcome in the provided data lists

Examples

## Not run: 
    input_dl <- data_list(
        list(gender_df, "gender", "demographics", "categorical"),
        list(diagnosis_df, "diagnosis", "clinical", "categorical"),
        uid = "patient_id"
    )
    
    sc <- snf_config(input_dl, n_solutions = 2)
    
    sol_df <- batch_snf(input_dl, sc)
    
    ext_sol_df <- extend_solutions(sol_df, input_dl)

## End(Not run)

Mock ABCD "colour" data

Description

Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".

Usage

fav_colour

Format

`fav_colour`

A data frame with 275 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
colour: Categorical transformation of cbcl_depress.

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Return character vector of features stored in an object

Description

Return character vector of features stored in an object

Usage

features(x)

Arguments

x

The object to pull features from.

Value

A character vector of features in x.

Fisher exact test p-value

Description

Return p-value for Fisher exact test for any two features

Usage

fisher_exact_pval(cat_var1, cat_var2)

Arguments

cat_var1

A categorical feature.

cat_var2

A categorical feature.

Value

pval A p-value (class "numeric").

Mock gender data

Description

Mock gender data

Usage

gender_df

Format

`gender_df`

A data frame with 200 rows and 2 columns:

patient_id: Random three-digit number uniquely identifying the patient
gender_df: Mock gene methylation feature

Source

This data came from the SNFtool package, with slight modifications.

Generate annotations list

Description

Intermediate function that takes in formatted lists of features and the annotations they should be viewed through and returns annotation objects usable by ComplexHeatmap::Heatmap.

Usage

generate_annotations_list(
  df,
  left_bar = NULL,
  right_bar = NULL,
  top_bar = NULL,
  bottom_bar = NULL,
  left_hm = NULL,
  right_hm = NULL,
  top_hm = NULL,
  bottom_hm = NULL,
  show_legend = TRUE,
  annotation_colours = NULL
)

Arguments

df

data frame containing all the data that is specified in the remaining arguments.

left_bar

Named list of strings, where the strings are features in df that should be used for a barplot annotation on the left of the plot and the names are the names that will be used to caption the plots and their legends.

right_bar

See left_bar.

top_bar

See left_bar.

bottom_bar

See left_bar.

left_hm

Like left_bar, but with a heatmap annotation instead of a barplot annotation.

right_hm

See left_hm.

top_hm

See left_hm.

bottom_hm

See left_hm.

show_legend

Add legends to the annotations.

annotation_colours

Named list of heatmap annotations and their colours.

Value

annotations_list A named list of all the annotations.

Generate a clustering algorithms list

Description

Deprecated function for building a clustering algorithms list. Please use clust_fns_list() (or better yet, snf_config()) instead.

Usage

generate_clust_algs_list(..., disable_base = FALSE)

Arguments

...

An arbitrary number of named clustering functions

disable_base

If TRUE, do not prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph.

Value

A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.

Generate a list of distance metrics

Description

Deprecated function for building a distance metrics list. Please use dist_fns_list() (or better yet, snf_config()) instead.

Usage

generate_distance_metrics_list(
  continuous_distances = NULL,
  discrete_distances = NULL,
  ordinal_distances = NULL,
  categorical_distances = NULL,
  mixed_distances = NULL,
  keep_defaults = TRUE
)

Arguments

continuous_distances

A named list of distance metric functions

discrete_distances

A named list of distance metric functions

ordinal_distances

A named list of distance metric functions

categorical_distances

A named list of distance metric functions

mixed_distances

A named list of distance metric functions

keep_defaults

If TRUE (default), prepend the base distance metrics (euclidean and standard normalized euclidean)

Value

A nested and named list of distance metrics functions.

Build a settings data frame

Description

Deprecated function for building a settings matrix. Please use settings_df() instead.

Usage

generate_settings_matrix(...)

Arguments

...

Arguments used to generate a settings matrix.

Value

Raises a deprecated error.

Extract cluster membership information from one solutions data frame row

Description

Deprecated function for building extracting cluster solutions from a solutions data frame. Please use t() instead.

This function takes in a single row of a solutions data frame and returns a data frame containing the cluster assignments for each uid. It is similar to get_clusters(), which takes one solutions data frame row and returns a vector of cluster assignments' and get_cluster_solutions(), which takes a solutions data frame with any number of rows and returns a data frame indicating the cluster assignments for each of those rows.

Usage

get_cluster_df(sol_df_row)

Arguments

sol_df_row

One row from a solutions data frame.

Value

cluster_df data frame of cluster and uid.

Extract cluster membership information from a sol_df

Description

Deprecated function for building extracting cluster solutions from a solutions data frame. Please use t() instead.

This function takes in a solutions data frame and returns a data frame containing the cluster assignments for each uid. It is similar to 'get_clusters(), which takes one solutions data frame row and returns a vector of cluster assignments' and get_cluster_df(), which takes a solutions matrix with only one row and returns a data frame with two columns: "cluster" and "uid" (the UID of the observation).

Usage

get_cluster_solutions(sol_df)

Arguments

sol_df

A sol_df.

Value

A "data.frame" object where each row is an observation and each column (apart from the uid column) indicates the cluster that observation as assigned to for the corresponding solutions data frame row.

Extract cluster membership vector from one solutions data frame row

Description

Deprecated function for building extracting cluster solutions from a solutions data frame. Please use t() instead.

This function takes in a single row of a solutions data frame and returns a vector containing the cluster assignments for each observation. It is similar to get_cluster_df(), which takes a solutions data frame with only one row and returns a data frame with two columns: "cluster" and "uid" '(the UID of the observation) and get_cluster_solutions(), which takes a solutions data frame with any number of rows and returns a data frame indicating the cluster assignments for each of those rows.

Usage

get_clusters(sol_df_row)

Arguments

sol_df_row

Output matrix row.

Value

clusters Vector of assigned clusters.

Pull complete-data UIDs from a list of data frames

Description

This function identifies all observations within a list of data frames that have no missing data across all data frames. This function is useful when constructing data lists of distinct feature sets from the same sample of observations. As data_list() strips away observations with any missing data, distinct sets of observations may be generated by building a data list from the same group of observations over different sets of features. Reducing the pool of observations to only those with complete UIDs first will avoid downstream generation of data lists of differing sizes.

Usage

get_complete_uids(list_of_dfs, uid)

Arguments

list_of_dfs

List of data frames.

uid

Name of column across data frames containing UIDs

Value

A character vector of the UIDs of observations that have complete data across the provided list of data frames.

Examples

complete_uids <- get_complete_uids(
    list(income, pubertal, anxiety, depress),
    uid = "unique_id"
)

income <- income[income$"unique_id" %in% complete_uids, ]
pubertal <- pubertal[pubertal$"unique_id" %in% complete_uids, ]
anxiety <- anxiety[anxiety$"unique_id" %in% complete_uids, ]
depress <- depress[depress$"unique_id" %in% complete_uids, ]

input_dl <- data_list(
    list(income, "income", "demographics", "ordinal"),
    list(pubertal, "pubertal", "demographics", "continuous"),
    uid = "unique_id"
)

target_dl <- data_list(
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

Calculate distance matrices

Description

Given a data frame of numerical features, return a euclidean distance matrix.

Usage

get_dist_matrix(
  df,
  input_type,
  cnt_dist_fn,
  dsc_dist_fn,
  ord_dist_fn,
  cat_dist_fn,
  mix_dist_fn,
  weights_row
)

Arguments

df

Raw data frame with subject IDs in column "uid"

input_type

Either "numeric" (resulting in euclidean distances), "categorical" (resulting in binary distances), or "mixed" (resulting in gower distances)

cnt_dist_fn

distance metric function for continuous data

dsc_dist_fn

distance metric function for discrete data

ord_dist_fn

distance metric function for ordinal data

cat_dist_fn

distance metric function for categorical data

mix_dist_fn

distance metric function for mixed data

weights_row

Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights_row.

Value

dist_matrix Matrix of inter-observation distances.

Extract UIDs from a data list

Description

Deprecated function for extracting UIDs from a data list. Please use uids() instead.

Usage

get_dl_uids(dl, prefix = FALSE)

Arguments

dl

A nested list of input data from data_list().

prefix

If TRUE, preserves the "uid_" prefix added to UIDs when creating a data list.

Value

A character vector of the UID labels contained in a data list.

Return the row or column ordering present in a heatmap

Description

Return the row or column ordering present in a heatmap

Usage

get_heatmap_order(heatmap, type = "rows")

Arguments

heatmap

A heatmap object to collect ordering from.

type

The type of ordering to return. Either "rows" or "columns".

Value

A numeric vector of the ordering used within the provided ComplexHeatmap "Heatmap" object.

Return the hierarchical clustering order of a matrix

Description

Return the hierarchical clustering order of a matrix

Usage

get_matrix_order(matrix, dist_method = "euclidean", hclust_method = "complete")

Arguments

matrix

Matrix to cluster.

dist_method

hclust_method

Agglomerative method to use when calculating sorting order by stats::hclust. Options include "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid".

Value

A numeric vector of the ordering derived by the specified hierarchical clustering method applied to the provided matrix.

Examples


dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

sc <- snf_config(
    dl = dl,
    n_solutions = 20,
    min_k = 20,
    max_k = 50
)

sol_df <- batch_snf(dl, sc)

ext_sol_df <- extend_solutions(
    sol_df,
    dl = dl,
    min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
)

# Calculate pairwise similarities between cluster solutions
sol_aris <- calc_aris(sol_df)

# Extract hierarchical clustering order of the cluster solutions
meta_cluster_order <- get_matrix_order(sol_aris)

Get mean p-value

Description

Given an solutions data frame row containing evaluated p-values, returns mean.

Usage

get_mean_pval(sol_df_row)

Arguments

sol_df_row

row of sol_df object

Value

mean_pval mean p-value

Get minimum p-value

Description

Given an solutions data frame row containing evaluated p-values, returns min.

Usage

get_min_pval(sol_df_row)

Arguments

sol_df_row

row of sol_df object

Value

min_pval minimum p-value

Get p-values from an extended solutions data frame

Description

This function can be used to neatly format the p-values associated with an extended solutions data frame. It can also calculate the negative logs of those p-values to make it easier to interpret large-scale differences.

Usage

get_pvals(ext_sol_df, negative_log = FALSE, keep_summaries = TRUE)

Arguments

ext_sol_df

The output of extend_solutions. A data frame that contains at least one p-value column ending in "_pval".

negative_log

If TRUE, will replace p-values with negative log p-values.

keep_summaries

If FALSE, will remove the mean, min, and max p-value.

Value

A "data.frame" class object Of only the p-value related columns of the provided ext_sol_df.

Extract representative solutions from a matrix of ARIs

Description

Following clustering with batch_snf, a matrix of pairwise ARIs that show how related each cluster solution is to each other can be generated by the calc_aris function. Partitioning of the ARI matrix can be done by visual inspection of meta_cluster_heatmap() results or by shiny_annotator. Given the indices of meta cluster boundaries, this function will return a single representative solution from each meta cluster based on maximum average ARI to all other solutions within that meta cluster.

Usage

get_representative_solutions(aris, sol_df, filter_fn = NULL)

Arguments

aris

Matrix of adjusted rand indices from calc_aris()

sol_df

Output of batch_snf containing cluster solutions.

filter_fn

Optional function to filter the meta-cluster by prior to maximum average ARI determination. This can be useful if you are explicitly trying to select a solution that meets a certain condition, such as only picking from the 4 cluster solutions within a meta cluster. An example valid function could be fn <- function(x) x[x$"nclust" == 4, ].

Value

The provided solutions data frame reduced to just one row per meta cluster defined by the split vector.

Examples


    dl <- data_list(
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        list(anxiety, "anxiety", "behaviour", "ordinal"),
        list(depress, "depressed", "behaviour", "ordinal"),
        uid = "unique_id"
    )
    
    sc <- snf_config(
        dl = dl,
        n_solutions = 20,
        min_k = 20,
        max_k = 50
    )
    
    sol_df <- batch_snf(dl, sc)
    
    ext_sol_df <- extend_solutions(
        sol_df,
        dl = dl,
        min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
    )
    
    # Calculate pairwise similarities between cluster solutions
    sol_aris <- calc_aris(sol_df)
    
    # Extract hierarchical clustering order of the cluster solutions
    meta_cluster_order <- get_matrix_order(sol_aris)
    
    # Identify meta cluster boundaries with shiny app or trial and error
    # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)
    # shiny_annotator(ari_hm)
    
    # Result of meta cluster examination
    split_vec <- c(2, 5, 12, 17)
    
    ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order)
    
    # Extracting representative solutions from each defined meta cluster
    rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)

Helper function to drop columns from a data frame by grepl search

Description

Helper function to drop columns from a data frame by grepl search

Usage

gexclude(x, pattern)

Arguments

x

Data frame to drop columns from.

pattern

Pattern used to match columns to drop.

Value

x without columns matching pattern.

Helper function to pick columns from a data frame by `grepl` search

Description

Helper function to pick columns from a data frame by grepl search

Usage

gselect(x, pattern)

Arguments

x

Data frame to select columns from.

pattern

Pattern used to match columns to select.

Value

x with only columns matching pattern.

Mock ABCD income data

Description

Like the mock data frame "abcd_h_income", but with "unique_id" as the "uid".

Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".

Usage

income

income

Format

`income`

A data frame with 300 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
household_income: Household income in 3 category levels (low = 1, medium = 2, high = 3)

`income`

A data frame with 300 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
household_income: Household income in 3 category levels (low = 1, medium = 2, high = 3)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Test if the object is a data list

Description

Given an object, returns TRUE if that object inherits from the data_list class.

Usage

is_data_list(x)

Arguments

x

An object.

Value

TRUE if the object inherits from the data_list class.

Jitter plot separating a feature by cluster

Description

Jitter plot separating a feature by cluster

Usage

jitter_plot(df, feature)

Arguments

df

A data.frame containing cluster column and the feature to plot.

feature

The feature to plot.

Value

A jitter+violin plot (class "gg", "ggplot") showing the distribution of a feature across clusters.

Assign meta cluster labels to rows of a solutions data frame or extended solutions data frame

Description

Given a solutions data frame or extended solutions data frame class object and a numeric vector indicating which rows correspond to which meta clusters, assigns meta clustering information to the "meta_clusters" attribute of the data frame.

Usage

label_meta_clusters(sol_df, split_vector, order = NULL)

Arguments

sol_df

A solutions data frame or extended solutions data frame to assign meta clusters to.

split_vector

A numeric vector indicating which rows of sol_df should be the split points for meta cluster labeling.

order

An optional numeric vector indicating how the solutions data frame should be reordered prior to meta cluster labeling. This vector can be obtained by running get_matrix_order() on an ARI matrix, which itself can be obtained by calling calc_aris() on a solutions data frame.

Value

A solutions data frame with a populated "meta_clusters" attribute.

Examples


    dl <- data_list(
        list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        uid = "unique_id"
    )
    
    set.seed(42)
    my_sc <- snf_config(
        dl = dl,
        n_solutions = 20,
        min_k = 20,
        max_k = 50
    )
    
    sol_df <- batch_snf(dl, my_sc)
    
    sol_df
    
    sol_aris <- calc_aris(sol_df)
    
    meta_cluster_order <- get_matrix_order(sol_aris)
    
    # `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator()
    split_vec <- c(6, 10, 16)
    ari_hm <- meta_cluster_heatmap(
        sol_aris,
        order = meta_cluster_order,
        split_vector = split_vec
    )
    
    mc_sol_df <- label_meta_clusters(
        sol_df,
        order = meta_cluster_order,
        split_vector = split_vec
    )
    
    mc_sol_df

Label propagation

Description

Given a full fused network (one containing both pre-clustered observations and to-be-clustered observations) and the clusters of the pre-clustered observations, return a label propagated list of clusters for all observations. This function is derived from SNFtool::groupPredict. Modifications are made to take a full fused network as input, rather than taking input data frames and running SNF internally. This ensures that alternative approaches to data normalization and distance matrix calculations can be chosen by the user.

Usage

label_prop(full_fused_network, clusters)

Arguments

full_fused_network

A network made by running SNF on training and test observations together.

clusters

A vector of assigned clusters for training observations in matching order as they appear in full_fused_network.

Value

A list of cluster labels for all observations.

Label propagate cluster solutions to non-clustered observations

Description

Given a solutions data frame containing clustered observations and a data list containing those clustered observations as well as additional to-be-clustered observations, this function will re-run SNF to generate a similarity matrix of all observations and use the label propagation algorithm to assigned predicted clusters to the non-clustered observations.

Usage

label_propagate(partial_sol_df, full_dl, verbose = FALSE)

Arguments

partial_sol_df

A solutions data frame derived from the training set.

full_dl

A data list containing observations from both the training and testing sets.

verbose

If TRUE, output progress to console.

Value

A data frame with one row per observation containing a column for UIDs, a column for whether the observation was in the train (original) or test (held out) set, and one column per row of the solutions data frame indicating the original and propagated clusters.

Examples


# Function to identify observations with complete data
uids_with_complete_obs <- get_complete_uids(
    list(subc_v, income, pubertal, anxiety, depress),
    uid = "unique_id"
)

# Dataframe assigning 80% of observations to train and 20% to test
train_test_split <- train_test_assign(
    train_frac = 0.8,
    uids = uids_with_complete_obs
)

# Pulling the training and testing observations specifically
train_obs <- train_test_split$"train"
test_obs <- train_test_split$"test"

# Partition a training set
train_subc_v <- subc_v[subc_v$"unique_id" %in% train_obs, ]
train_income <- income[income$"unique_id" %in% train_obs, ]
train_pubertal <- pubertal[pubertal$"unique_id" %in% train_obs, ]
train_anxiety <- anxiety[anxiety$"unique_id" %in% train_obs, ]
train_depress <- depress[depress$"unique_id" %in% train_obs, ]

# Partition a test set
test_subc_v <- subc_v[subc_v$"unique_id" %in% test_obs, ]
test_income <- income[income$"unique_id" %in% test_obs, ]
test_pubertal <- pubertal[pubertal$"unique_id" %in% test_obs, ]
test_anxiety <- anxiety[anxiety$"unique_id" %in% test_obs, ]
test_depress <- depress[depress$"unique_id" %in% test_obs, ]

# Find cluster solutions in the training set
train_dl <- data_list(
    list(train_subc_v, "subc_v", "neuroimaging", "continuous"),
    list(train_income, "household_income", "demographics", "continuous"),
    list(train_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

# We'll pick a solution that has good separation over our target features
train_target_dl <- data_list(
    list(train_anxiety, "anxiety", "behaviour", "ordinal"),
    list(train_depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

sc <- snf_config(
    train_dl,
    n_solutions = 5,
    min_k = 10,
    max_k = 30
)

train_sol_df <- batch_snf(
    train_dl,
    sc,
    return_sim_mats = TRUE
)

ext_sol_df <- extend_solutions(
    train_sol_df,
    train_target_dl
)

# Determining solution with the lowest minimum p-value
lowest_min_pval <- min(ext_sol_df$"min_pval")
which(ext_sol_df$"min_pval" == lowest_min_pval)
top_row <- ext_sol_df[1, ]

# Propagate that solution to the observations in the test set
# data list below has both training and testing observations
full_dl <- data_list(
    list(subc_v, "subc_v", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

# Use the solutions data frame from the training observations and the data list
# from the training and testing observations to propagate labels to the test observations
propagated_labels <- label_propagate(top_row, full_dl)

propagated_labels_all <- label_propagate(ext_sol_df, full_dl)

head(propagated_labels_all)
tail(propagated_labels_all)

Convert a vector of partition indices into meta cluster labels

Description

Convert a vector of partition indices into meta cluster labels

Usage

label_splits(split_vector, nrow)

Arguments

split_vector

A vector of partition indices.

nrow

The number of rows in the data being partitioned.

Value

A character vector that expands the split_vector into an nrow-length sequence of ascending letters of the alphabet. If the split vector is c(3, 6) and the number of rows is 8, the result will be a vector of two "A"s (up to the first index, 3), three "B"s (up to the second index, 6), and three "C"s (up to and including the last index, 8).

Linearly correct data list by features with unwanted signal

Description

Given a data list to correct and another data list of categorical features to linearly adjust for, corrects the first data list based on the residuals of the linear model relating the numeric features in the first data list to the unwanted signal features in the second data list.

Usage

linear_adjust(dl, unwanted_signal_list, sig_digs = NULL)

Arguments

dl

A nested list of input data from data_list().

unwanted_signal_list

A data list of categorical features that should have their mean differences removed in the first data list.

sig_digs

Number of significant digits to round the residuals to.

Value

A data list ("list") in which each data component has been converted to contain residuals off of the linear model built against the features in the unwanted_signal_list.

Examples

has_tutor <- sample(c(1, 0), size = 9, replace = TRUE)
math_score <- 70 + 30 * has_tutor + rnorm(9, mean = 0, sd = 5)

math_df <- data.frame(uid = paste0("id_", 1:9), math = math_score)
tutor_df <- data.frame(uid = paste0("id_", 1:9), tutor = has_tutor)

dl <- data_list(
    list(math_df, "math_score", "school", "continuous"),
    uid = "uid"
)

adjustment_dl <- data_list(
    list(tutor_df, "tutoring", "school", "categorical"),
    uid = "uid"
)

adjusted_dl <- linear_adjust(dl, adjustment_dl)

adjusted_dl[[1]]$"data"$"math"

# Equivalent to:
as.numeric(resid(lm(math_score ~ has_tutor)))

Linear model p-value (generic)

Description

Return p-value of F-test for a linear model of any two features

Usage

linear_model_pval(predictor, response)

Arguments

predictor

A categorical or numeric feature.

response

A numeric feature.

Value

pval A p-value (class "numeric").

Manhattan plot of feature-meta cluster association p-values

Description

Given a data frame of representative meta cluster solutions (see get_representative_solutions(), returns a Manhattan plot for showing feature separation across all features in provided data/target lists.

Usage

mc_manhattan_plot(
  ext_sol_df,
  dl = NULL,
  target_dl = NULL,
  variable_order = NULL,
  neg_log_pval_thresh = 5,
  threshold = NULL,
  point_size = 5,
  text_size = 20,
  plot_title = NULL,
  xints = NULL,
  hide_x_labels = FALSE,
  domain_colours = NULL
)

Arguments

ext_sol_df

A sol_df that contains "_pval" columns containing the values to be plotted. This object is the output of extend_solutions().

dl

List of data frames containing data information.

target_dl

List of data frames containing target information.

variable_order

Order of features to be displayed in the plot.

neg_log_pval_thresh

Threshold for negative log p-values.

threshold

p-value threshold to plot horizontal dashed line at.

point_size

Size of points in the plot.

text_size

Size of text in the plot.

plot_title

Title of the plot.

xints

Either "outcomes" or a vector of numeric values to plot vertical lines at.

hide_x_labels

If TRUE, hides x-axis labels.

domain_colours

Named vector of colours for domains.

Value

A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame, stratified by meta cluster label.

Examples


    dl <- data_list(
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        list(anxiety, "anxiety", "behaviour", "ordinal"),
        list(depress, "depressed", "behaviour", "ordinal"),
        uid = "unique_id"
    )
    
    sc <- snf_config(
        dl = dl,
        n_solutions = 20,
        min_k = 20,
        max_k = 50
    )
    
    sol_df <- batch_snf(dl, sc)
    
    ext_sol_df <- extend_solutions(
        sol_df,
        dl = dl,
        min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10
    )
    
    # Calculate pairwise similarities between cluster solutions
    sol_aris <- calc_aris(sol_df)
    
    # Extract hierarchical clustering order of the cluster solutions
    meta_cluster_order <- get_matrix_order(sol_aris)
    
    # Identify meta cluster boundaries with shiny app or trial and error
    # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)
    # shiny_annotator(ari_hm)
    
    # Result of meta cluster examination
    split_vec <- c(2, 5, 12, 17)
    
    ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order)
    
    # Extracting representative solutions from each defined meta cluster
    rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)
    
    mc_manhattan <- mc_manhattan_plot(
        rep_solutions,
        dl = dl,
        point_size = 3,
        text_size = 12,
        plot_title = "Feature-Meta Cluster Associations",
        threshold = 0.05,
        neg_log_pval_thresh = 5
    )
    mc_manhattan

Merge `clust_fns_list` objects

Description

Merge clust_fns_list objects

Usage

## S3 method for class 'clust_fns_list'
merge(x, y, ...)

Arguments

x

The first clust_fns_list object to merge.

y

The second clust_fns_list object to merge.

...

Additional arguments (not used).

Value

A new clust_fns_list object containing the merged clustering functions.

Merge observations between two compatible data lists

Description

Join two data lists with the same components (data frames) but separate observations. To instead merge two data lists that have the same observations but different components, simply use c().

Usage

## S3 method for class 'data_list'
merge(x, y, ...)

Arguments

x

The first data list to merge.

y

The second data list to merge.

...

Additional arguments passed into merge function.

Value

A data list ("list"-class object) containing the observations of both provided data lists.

Merge `dist_fns_list` objects

Description

Merge dist_fns_list objects

Usage

## S3 method for class 'dist_fns_list'
merge(x, y, ...)

Arguments

x

The first clust_fns_list object to merge.

y

The second clust_fns_list object to merge.

...

Additional arguments (not used).

Value

A new clust_fns_list object containing the merged clustering functions.

Merge `ext_solutions_df` objects

Description

Merge ext_solutions_df objects

Usage

## S3 method for class 'ext_solutions_df'
merge(x, y, ...)

Arguments

x

The first ext_solutions_df object to merge.

y

The second ext_solutions_df object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to ext_solutions_df objects.

Merge `settings_df` objects

Description

Merge settings_df objects

Usage

## S3 method for class 'settings_df'
merge(x, y, ...)

Arguments

x

The first settings_df object to merge.

y

The second settings_df object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to settings_df objects.

Merge `sim_mats_list` objects

Description

Merge sim_mats_list objects

Usage

## S3 method for class 'sim_mats_list'
merge(x, y, ...)

Arguments

x

The first sim_mats_list object to merge.

y

The second sim_mats_list object to merge.

...

Additional arguments (not used).

Value

A merged sim_mats_list object containing the similarity matrices from both input objects.

Merge method for SNF config objects

Description

Merge method for SNF config objects

Usage

## S3 method for class 'snf_config'
merge(x, y, reset_indices = TRUE, ...)

Arguments

x

SNF config to merge.

y

SNF config to merge.

reset_indices

If TRUE (default), re-labels the "solutions" indices in the config from 1 to the number of defined settings.

...

Additional arguments passed into merge function.

Value

An SNF config combining the rows of both prior configurations.

Merge `solutions_df` objects

Description

Merge solutions_df objects

Usage

## S3 method for class 'solutions_df'
merge(x, y, ...)

Arguments

x

The first solutions_df object to merge.

y

The second solutions_df object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to solutions_df objects.

Merge `t_ext_solutions_df` objects

Description

Merge t_ext_solutions_df objects

Usage

## S3 method for class 't_ext_solutions_df'
merge(x, y, ...)

Arguments

x

The first t_ext_solutions_df object to merge.

y

The second t_ext_solutions_df object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to t_ext_solutions_df objects.

Merge `t_solutions_df` objects

Description

Merge t_solutions_df objects

Usage

## S3 method for class 't_solutions_df'
merge(x, y, ...)

Arguments

x

The first t_solutions_df object to merge.

y

The second t_solutions_df object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to t_solutions_df objects.

Merge `weights_matrix` objects

Description

Merge weights_matrix objects

Usage

## S3 method for class 'weights_matrix'
merge(x, y, ...)

Arguments

x

The first weights_matrix object to merge.

y

The second weights_matrix object to merge.

...

Additional arguments (not used).

Value

Error message indicating that the merge function is not applicable to weights_matrix objects.

Merge list of data frames into a single data frame

Description

This helper function combines all data frames in a single-level list into a single data frame.

Usage

merge_df_list(df_list, join = "inner", uid = "uid", no_na = FALSE)

Arguments

df_list

list of data frames.

join

String indicating if join should be "inner" or "full".

uid

Column name to join on. Default is "uid".

no_na

Whether to remove NA values from the merged data frame.

Value

Inner join of all data frames in list.

Examples

merge_df_list(list(income, pubertal), uid = "unique_id")

Helper function for raising alerts

Description

Helper function for raising alerts

Usage

metasnf_alert(..., env = 1)

Arguments

...

Arbitrary number of strings to be pasted together into alert message.

env

Environment to evaluate expressions in.

Value

Returns no value. Raises an alert through cli::cli_alert_info.

Helper function for defunct function errors

Description

Helper function for defunct function errors

Usage

metasnf_defunct(version, alternative, env = 1)

Arguments

version

Version of metasnf in which function has been made defunct.

alternative

Recommended alternative approach.

env

Environment to evaluate expressions in.

Value

Returns no value. Raises an error through cli::cli_abort.

Helper function for deprecated function warnings

Description

Helper function for deprecated function warnings

Usage

metasnf_deprecated(version, alternative, env = 1)

Arguments

version

Version of metasnf in which function has been deprecated.

alternative

Recommended alternative approach.

env

Environment to evaluate expressions in.

Value

Returns no value. Raises a warning through cli::cli_warn.

Helper function for raising errors

Description

Helper function for raising errors

Usage

metasnf_error(..., env = 1)

Arguments

...

Arbitrary number of strings to be pasted together into error message.

env

Value

Returns no value. Raises an error through cli::cli_abort.

Helper function for raising warnings

Description

Helper function for raising warnings

Usage

metasnf_warning(..., env = 1)

Arguments

...

Arbitrary number of strings to be pasted together into warning message.

env

Environment to evaluate expressions in.

Value

Returns no value. Raises a warning through cli::cli_warn

Modification of SNFtool mock data frame "Data2"

Description

Modification of SNFtool mock data frame "Data2"

Usage

methylation_df

Format

`methylation_df`

A data frame with 200 rows and 3 columns:

gene_1_expression: Mock gene methylation feature
gene_2_expression: Mock gene methylation feature
patient_id: Random three-digit number uniquely identifying the patient

Source

This data came from the SNFtool package, with slight modifications.

Mock example of an `ari_matrix` metasnf object

Description

An ari_matrix class object containing adjusted Rand indices (ARIs) between 20 cluster solutions. Used as an example of an ari_matrix metasnf object.

Usage

mock_ari_matrix

Format

`mock_ari_matrix`

A 20 by 20 ARI matrix.

Source

This data comes from the metasnf package.

Mock example of a `clust_fns_list` metasnf object

Description

Mock example of a clust_fns_list metasnf object

Usage

mock_clust_fns_list

Format

`mock_clust_fns_list`

A clust_fns_list object containing two clustering functions covering 2 and 5 five cluster solution versions of spectral clustering. Extracted from mock_snf_config.

Source

This data comes from the metasnf package.

Mock example of a `data_list` metasnf object

Description

Mock example of a data_list metasnf object

Usage

mock_data_list

Format

`mock_data_list`

A data list containing 4 data frames with 100 observations each: - subcortical volume (30 features) - cortical surface area (151 features) - household income (1 feature) - pubertal status (1 feature) Used as an example of an data_list metasnf object.

Source

This data comes from the metasnf package.

Mock example of a `dist_fns_list` metasnf object

Description

Mock example of a dist_fns_list metasnf object

Usage

mock_dist_fns_list

Format

`mock_dist_fns_list`

A dist_fns_list object containing a variety of distance metrics. Extracted from mock_snf_config.

Source

This data comes from the metasnf package.

Mock example of a `ext_solutions_df` metasnf object

Description

An ext_solutions_df class object generated by extending the mock_rep_solutions_df object against mock_data_list as the target data list.

Usage

mock_ext_solutions_df

Format

`mock_ext_solutions_df`

Contains 20 cluster solutions.

Source

This data comes from the metasnf package.

Mock example of a `mc_solutions_df` metasnf object

Description

Mock example of a mc_solutions_df metasnf object

Usage

mock_mc_solutions_df

Format

`mock_mc_solutions_df`

A meta cluster labeled solutions data frame derived from mock_solutions_df. Contains 20 cluster solutions.

Source

This data comes from the metasnf package.

Mock example of a `rep_solutions_df` metasnf object

Description

A solutions_df class object derived by filtering the mock_mc_solutions_df to its representative solutions.

Usage

mock_rep_solutions_df

Format

`mock_rep_solutions_df`

Contains 4 cluster solutions.

Source

This data comes from the metasnf package.

Mock example of a `settings_df` metasnf object

Description

Mock example of a settings_df metasnf object

Usage

mock_settings_df

Format

`mock_settings_df`

Settings for 20 cluster solutions.

Source

This data comes from the metasnf package.

Mock example of a `snf_config` metasnf object

Description

Mock example of a snf_config metasnf object

Usage

mock_snf_config

Format

`mock_snf_config`

An SNF config containing hyperparameters and functions defined for generating 20 cluster solutions from a data list. The config has been specified to: - limit the k hyperparameter to 40 - make use of uniformly distributed random weights - randomly select between using spectral clustering where the number of clusters can be 2, 5, decided by the eigen-gap heuristic, or decided by the rotation cost heuristic - use Gower distance for categorical and mixed data, Euclidean distance for ordinal data, and randomly select from Euclidean distance or standard/normalized Euclidean distance for continuous and discrete data The config was built using the mock_data_list loaded into the namespace after calling library("metasnf"). Used as an example of an snf_config metasnf object.

Source

This data comes from the metasnf package.

Mock example of a `solutions_df` metasnf object

Description

Mock example of a solutions_df metasnf object

Usage

mock_solutions_df

Format

`mock_solutions_df`

A solutions data frame containing 20 cluster solutions generated from mock_snf_config and mock_data_list. Used as an example of an solutions_df metasnf object.

Source

This data comes from the metasnf package.

Mock example of a `t_solutions_df` metasnf object

Description

Mock example of a t_solutions_df metasnf object

Usage

mock_t_solutions_df

Format

`mock_t_solutions_df`

A transposed solutions data frame containing 20 cluster solutions generated from mock_solutions_df. Used as an example of a t_solutions_df metasnf object.

Source

This data comes from the metasnf package.

Mock example of a `weights_matrix` metasnf object

Description

Mock example of a weights_matrix metasnf object

Usage

mock_weights_matrix

Format

`mock_weights_matrix`

A weights_matrix class object containing 20 sets of weights for 183 features.

Source

This data comes from the metasnf package.

Extract number of features stored in an object

Description

Extract number of features stored in an object

Usage

n_features(x)

Arguments

x

The object to extract number of features from.

Value

The number of features in x.

Extract number of observations stored in an object

Description

Extract number of observations stored in an object

Usage

n_observations(x)

Arguments

x

The object to extract number of observations from.

Value

The number of observations in x.

Constructor for `ari_matrix` class object

Description

Constructor for ari_matrix class object

Usage

new_ari_matrix(aml, dist_method, hclust_method)

Arguments

aml

An ari_matrix-like matrix object to be validated.

Value

A ari_matrix object.

Constructor for `clust_fns_list` class object

Description

Constructor for clust_fns_list class object

Usage

new_clust_fns_list(cfll)

Arguments

cfll

A clust_fns_list-like list class object.

Value

A clust_fns_list class object.

Constructor for `data_list` class object

Description

Constructor for data_list class object

Usage

new_data_list(dll)

Arguments

dll

A data list-like list class object.

Value

A data_list object, which is a nested list with class data_list.

Constructor for `dist_fns_list` class object

Description

Constructor for dist_fns_list class object

Usage

new_dist_fns_list(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

A dist_fns_list object.

Constructor for `ext_solutions_df` class object

Description

Constructor for ext_solutions_df class object

Usage

new_ext_solutions_df(ext_sol_dfl)

Arguments

ext_sol_dfl

An extended solutions data frame-like object.

Value

An ext_solutions_df object, which is a data frame with class ext_solutions_df.

Constructor for `settings_df` class object

Description

Constructor for settings_df class object

Usage

new_settings_df(sdfl)

Arguments

sdfl

A settings data frame-like matrix object to be validated.

Value

A settings_df object.

Constructor for `similarity_matrix_list` class object

Description

Constructor for similarity_matrix_list class object

Usage

new_sim_mats_list(smll)

Arguments

smll

A similarity matrix list-like object.

Value

A similarity_matrix_list class object.

Constructor for `snf_config` class object

Description

Constructor for snf_config class object

Usage

new_snf_config(scl)

Arguments

scl

An SNF config-like list class object.

Value

An snf_config object.

Constructor for `solutions_df` class object

Description

Constructor for solutions_df class object

Usage

new_solutions_df(sol_dfl)

Arguments

sol_dfl

A solutions data frame-like object to be validated and converted into a solutions data frame.

Value

A solutions_df class object.

Constructor for `weights_matrix` class object

Description

Constructor for weights_matrix class object

Usage

new_weights_matrix(wml)

Arguments

wml

A weights_matrix-like matrix object to be validated.

Value

A weights_matrix object.

Helper function for creating what hidden ft/obs/sols message

Description

Helper function for creating what hidden ft/obs/sols message

Usage

not_shown_message(
  hidden_solutions = NULL,
  hidden_observations = NULL,
  hidden_features = NULL
)

Arguments

hidden_solutions

Number of hidden solutions.

hidden_observations

Number of hidden observations.

hidden_features

Number of hidden features.

Value

If all arguments are NULL or 0, returns NULL. Otherwise, output a neatly formatted string indicating how many observations, features, and/or observations were not shown.

Convert columns of a data frame to numeric type (if possible)

Description

Converts all columns in a data frame that can be converted to numeric type to numeric type.

Usage

numcol_to_numeric(df)

Arguments

df

A data frame.

Value

The data frame coercible columns converted to type numeric.

Ordinal regression p-value

Description

Returns the overall p-value of an ordinal regression on a categorical predictor and response vectors.

Usage

ord_reg_pval(predictor, response)

Arguments

predictor

A categorical or numeric feature.

response

A numeric feature.

Value

pval A p-value (class "numeric").

Parallel processing form of batch_snf

Description

Parallel processing form of batch_snf

Usage

parallel_batch_snf(
  dl,
  dfl,
  cfl,
  sdf,
  wm,
  similarity_matrix_dir,
  return_sim_mats,
  processes
)

Arguments

dl

A data list.

dfl

cfl

List of custom clustering algorithms to apply to the final fused network. See ?clust_fns_list.

sdf

matrix indicating parameters to iterate SNF through.

wm

A matrix containing feature weights to use during distance matrix calculation. See ?weights_matrix for details on how to build this.

similarity_matrix_dir

If specified, this directory will be used to save all generated similarity matrices.

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

processes

Number of parallel processes used when executing SNF.

Value

The same values as ?batch_snf().

Helper function to pick columns from a data frame

Description

Helper function to pick columns from a data frame

Usage

pick_cols(x, cols)

Arguments

x

A data frame

cols

Vector of column names to be picked

Value

x with only columns in cols

Helper function to pluralize a string

Description

Helper function to pluralize a string

Usage

pl(x)

Arguments

x

A vector of length 1 or greater.

Value

A string "s" if the length of x is greater than 1, otherwise an empty string.

Heatmap of pairwise adjusted rand indices between solutions

Description

Heatmap of pairwise adjusted rand indices between solutions

Usage

## S3 method for class 'ari_matrix'
plot(
  x,
  order = NULL,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  log_graph = FALSE,
  scale_diag = "none",
  min_colour = "#282828",
  max_colour = "firebrick2",
  col = circlize::colorRamp2(c(min(x), max(x)), c(min_colour, max_colour)),
  ...
)

meta_cluster_heatmap(
  x,
  order = NULL,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  log_graph = FALSE,
  scale_diag = "none",
  min_colour = "#282828",
  max_colour = "firebrick2",
  col = circlize::colorRamp2(c(min(x), max(x)), c(min_colour, max_colour)),
  ...
)

Arguments

x

Matrix of adjusted rand indices from calc_aris()

order

Numeric vector containing row order of the heatmap.

cluster_rows

Whether rows should be clustered.

cluster_columns

Whether columns should be clustered.

log_graph

If TRUE, log transforms the graph.

scale_diag

Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0).

min_colour

Colour used for the lowest value in the heatmap.

max_colour

Colour used for the highest value in the heatmap.

col

Colour ramp to use for the heatmap.

...

Additional parameters passed to similarity_matrix_heatmap(), the function that this function wraps.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.

Examples


    dl <- data_list(
        list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
        list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
        list(income, "household_income", "demographics", "continuous"),
        list(pubertal, "pubertal_status", "demographics", "continuous"),
        uid = "unique_id"
    )
    
    set.seed(42)
    my_sc <- snf_config(
        dl = dl,
        n_solutions = 20,
        min_k = 20,
        max_k = 50
    )
    
    sol_df <- batch_snf(dl, my_sc)
    
    sol_df
    
    sol_aris <- calc_aris(sol_df)
    
    meta_cluster_order <- get_matrix_order(sol_aris)
    
    # `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator()
    split_vec <- c(6, 10, 16)
    ari_hm <- plot(
        sol_aris,
        order = meta_cluster_order,
        split_vector = split_vec
    )

Plot of feature values in a data list

Description

This plot, built on ComplexHeatmap::Heatmap(), visualizes the feature values in a data list as a continuous heatmap with observations along the columns and features along the rows.

Usage

## S3 method for class 'data_list'
plot(
  x,
  y = NULL,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  heatmap_legend_param = NULL,
  row_title = "Observation",
  column_title = "Feature",
  show_row_names = FALSE,
  ...
)

Arguments

x

A data_list object.

y

Optional argument to plot, not used in this method.

cluster_rows

Logical indicating whether to cluster the rows (observations).

cluster_columns

Logical indicating whether to cluster the columns (features).

heatmap_legend_param

A list of parameters for the heatmap legend.

row_title

Title for the rows (observations).

column_title

Title for the columns (features).

show_row_names

Logical indicating whether to show row names.

...

Additional arguments passed to ComplexHeatmap::Heatmap().

Value

A heatmap visualization of feature values.

Plot of cluster assignments in an extended solutions data frame

Description

This plot, built on ComplexHeatmap::Heatmap(), visualizes the cluster assignments in a solutions data frame as a categorical heatmap with observations along the columns and clusters along the rows.

Usage

## S3 method for class 'ext_solutions_df'
plot(
  x,
  y = NULL,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  show_row_names = TRUE,
  show_column_names = TRUE,
  heatmap_legend_param = NULL,
  row_title = "Solution",
  column_title = "Observation",
  ...
)

## S3 method for class 't_ext_solutions_df'
plot(x, ...)

Arguments

x

An ext_solutions_df object.

y

Optional argument to plot, not used in this method.

cluster_rows

If the value is a logical, it controls whether to make cluster on rows. The value can also be a hclust or a dendrogram which already contains clustering. Check https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#clustering .

cluster_columns

Whether make cluster on columns? Same settings as cluster_rows.

show_row_names

Whether show row names.

show_column_names

Whether show column names.

heatmap_legend_param

A list contains parameters for the heatmap legends. See color_mapping_legend,ColorMapping-method for all available parameters.

row_title

Title on the row.

column_title

Title on the column.

...

Additional arguments passed to ComplexHeatmap::Heatmap().

Value

A ComplexHeatmap::Heatmap() object visualization of cluster assignments.

Heatmap for visualizing an SNF config

Description

Create a heatmap where each row corresponds to a different set of hyperparameters in an SNF config object. Numeric parameters are scaled normalized and non-numeric parameters are added as heatmap annotations. Rows can be reordered to match prior meta clustering results.

Usage

## S3 method for class 'snf_config'
plot(
  x,
  order = NULL,
  hide_fixed = FALSE,
  show_column_names = TRUE,
  show_row_names = TRUE,
  rect_gp = grid::gpar(col = "black"),
  colour_breaks = c(0, 1),
  colours = c("black", "darkseagreen"),
  column_split_vector = NULL,
  row_split_vector = NULL,
  column_split = NULL,
  row_split = NULL,
  column_title = NULL,
  include_weights = TRUE,
  include_settings = TRUE,
  ...
)

config_heatmap(
  x,
  order = NULL,
  hide_fixed = FALSE,
  show_column_names = TRUE,
  show_row_names = TRUE,
  rect_gp = grid::gpar(col = "black"),
  colour_breaks = c(0, 1),
  colours = c("black", "darkseagreen"),
  column_split_vector = NULL,
  row_split_vector = NULL,
  column_split = NULL,
  row_split = NULL,
  column_title = NULL,
  include_weights = TRUE,
  include_settings = TRUE,
  ...
)

## S3 method for class 'settings_df'
plot(
  x,
  order = NULL,
  hide_fixed = FALSE,
  show_column_names = TRUE,
  show_row_names = TRUE,
  rect_gp = grid::gpar(col = "black"),
  colour_breaks = c(0, 1),
  colours = c("black", "darkseagreen"),
  column_split_vector = NULL,
  row_split_vector = NULL,
  column_split = NULL,
  row_split = NULL,
  column_title = NULL,
  include_weights = TRUE,
  include_settings = TRUE,
  ...
)

## S3 method for class 'weights_matrix'
plot(
  x,
  order = NULL,
  hide_fixed = FALSE,
  show_column_names = TRUE,
  show_row_names = TRUE,
  rect_gp = grid::gpar(col = "black"),
  colour_breaks = c(0, 1),
  colours = c("black", "darkseagreen"),
  column_split_vector = NULL,
  row_split_vector = NULL,
  column_split = NULL,
  row_split = NULL,
  column_title = NULL,
  include_weights = TRUE,
  include_settings = TRUE,
  ...
)

Arguments

x

An snf_config class object.

order

Numeric vector indicating row ordering of SNF config.

hide_fixed

Whether fixed parameters should be removed.

show_column_names

Whether show column names.

show_row_names

Whether show row names.

rect_gp

Graphic parameters for drawing rectangles (for heatmap body). The value should be specified by gpar and fill parameter is ignored.

colour_breaks

Numeric vector of breaks for the legend.

colours

Vector of colours to use for the heatmap. Should match the length of colour_breaks.

column_split_vector

Vector of indices to split columns by.

row_split_vector

Vector of indices to split rows by.

column_split

Split on columns. For heatmap splitting, please refer to https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#heatmap-split .

row_split

Same as split.

column_title

Title on the column.

include_weights

If TRUE, includes feature weights of the weights matrix into the config heatmap.

include_settings

If TRUE, includes columns from the settings data frame into the config heatmap.

...

Additional parameters passed to ComplexHeatmap::Heatmap.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the scaled values of the provided SNF config.

Examples

dl <- data_list(
    list(income, "household_income", "demographics", "ordinal"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(fav_colour, "favourite_colour", "demographics", "categorical"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

sc <- snf_config(
    dl,
    n_solutions = 10,
    dropout_dist = "uniform"
)

plot(sc)

Plot of cluster assignments in a solutions data frame

Description

This plot, built on ComplexHeatmap::Heatmap(), visualizes the cluster assignments in a solutions data frame as a categorical heatmap with observations along the columns and clusters along the rows.

Usage

## S3 method for class 'solutions_df'
plot(
  x,
  y = NULL,
  cluster_rows = FALSE,
  cluster_columns = TRUE,
  heatmap_legend_param = NULL,
  row_title = "Solution",
  column_title = "Observation",
  ...
)

## S3 method for class 't_solutions_df'
plot(x, ...)

Arguments

x

A solutions_df object.

y

Optional argument to plot, not used in this method.

cluster_rows

cluster_columns

Whether make cluster on columns? Same settings as cluster_rows.

heatmap_legend_param

A list contains parameters for the heatmap legends. See color_mapping_legend,ColorMapping-method for all available parameters.

row_title

Title on the row.

column_title

Title on the column.

...

Additional arguments passed to ComplexHeatmap::Heatmap().

Value

A ComplexHeatmap::Heatmap() object visualization of cluster assignments.

Add "uid_" prefix to all UID values in uid column

Description

Add "uid_" prefix to all UID values in uid column

Usage

prefix_dll_uid(dll)

Arguments

dll

A data list-like list class object.

Value

A data list with UIDs prefixed with the string "uid_".

Print method for class `ari_matrix`

Description

Custom formatted print for weights matrices that outputs information about feature weights functions to the console.

Usage

## S3 method for class 'ari_matrix'
print(x, ...)

Arguments

x

A ari_matrix class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `clust_fns_list`

Description

Custom formatted print for clustering functions list objects that outputs information about the contained clustering functions to the console.

Usage

## S3 method for class 'clust_fns_list'
print(x, ...)

Arguments

x

A clust_fns_list class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `data_list`

Description

Custom formatted print for data list objects that outputs information about the contained observations and components to the console.

Usage

## S3 method for class 'data_list'
print(x, ...)

Arguments

x

A data_list class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `dist_fns_list`

Description

Custom formatted print for distance metrics list objects that outputs information about the contained distance metrics to the console.

Usage

## S3 method for class 'dist_fns_list'
print(x, ...)

Arguments

x

A dist_fns_list class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `ext_solutions_df`

Description

Custom formatted print for extended solutions data frame class objects.

Usage

## S3 method for class 'ext_solutions_df'
print(x, n = NULL, ...)

Arguments

x

A ext_solutions_df class object.

n

Number of rows to print, passed into tibble::print.tbl_df().

...

Other arguments passed to print (not used in this function).

Value

Function prints to console but does not return any value.

Print method for class `settings_df`

Description

Custom formatted print for settings data frame that outputs information about SNF hyperparameters to the console.

Usage

## S3 method for class 'settings_df'
print(x, ...)

Arguments

x

A settings_df class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `sim_mats_list`

Description

Custom formatted print for similarity matrix list

Usage

## S3 method for class 'sim_mats_list'
print(x, ...)

Arguments

x

A sim_mats_list class object.

...

Other arguments passed to print (not used in this function).

Print method for class `snf_config`

Description

Custom formatted print for SNF config

Usage

## S3 method for class 'snf_config'
print(x, ...)

Arguments

x

A snf_config class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `solutions_df`

Description

Custom formatted print for weights matrices that outputs information about feature weights functions to the console.

Usage

## S3 method for class 'solutions_df'
print(x, n = NULL, tips = TRUE, ...)

Arguments

x

A weights_matrix class object.

n

Number of rows to print, passed into tibble::print.tbl_df().

tips

If TRUE, include lines on how to print more rows / transposed.

...

Other arguments passed to print (not used in this function).

Value

Function prints to console but does not return any value.

Print method for class `t_ext_solutions_df`

Description

Custom formatted print for transposed solutions data frame class objects.

Usage

## S3 method for class 't_ext_solutions_df'
print(x, ...)

Arguments

x

A t_solutions_df class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `t_solutions_df`

Description

Custom formatted print for transposed solutions data frame class objects.

Usage

## S3 method for class 't_solutions_df'
print(x, ...)

Arguments

x

A t_solutions_df class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Print method for class `weights_matrix`

Description

Custom formatted print for weights matrices that outputs information about feature weights functions to the console.

Usage

## S3 method for class 'weights_matrix'
print(x, ...)

Arguments

x

A weights_matrix class object.

...

Other arguments passed to print (not used in this function)

Value

Function prints to console but does not return any value.

Helper function for outputting tip on changing rows printed

Description

Helper function for outputting tip on changing rows printed

Usage

print_with_n_message()

Value

Output a message to use print with n to change displayed rows.

Helper function for transposing solutions_df message

Description

Helper function for transposing solutions_df message

Usage

print_with_t_message()

Value

Output a message to use print with n to change displayed rows.

Mock ABCD pubertal status data

Description

Like the mock data frame "abcd_pubertal", but with "unique_id" as the "uid".

Usage

pubertal

Format

`pubertal`

A data frame with 275 rows and 2 columns:

unique_id: The unique identifier of the ABCD dataset
pubertal_status: Average reported pubertal status between child and parent (1-5 categorical scale)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Heatmap of p-values

Description

Heatmap of p-values

Usage

pval_heatmap(
  ext_sol_df,
  order = NULL,
  cluster_columns = TRUE,
  cluster_rows = FALSE,
  show_row_names = FALSE,
  show_column_names = TRUE,
  min_colour = "red2",
  max_colour = "white",
  legend_breaks = c(0, 1),
  col = circlize::colorRamp2(legend_breaks, c(min_colour, max_colour)),
  heatmap_legend_param = list(color_bar = "continuous", title = "p-value", at = c(0, 1)),
  rect_gp = grid::gpar(col = "black"),
  column_split_vector = NULL,
  row_split_vector = NULL,
  column_split = NULL,
  row_split = NULL,
  ...
)

Arguments

ext_sol_df

An ext_solutions_df class object (produced from the function extend_solutions.

order

Numeric vector containing row order of the heatmap.

cluster_columns

Whether columns should be sorted by hierarchical clustering.

cluster_rows

Whether rows should be sorted by hierarchical clustering.

show_row_names

Whether row names should be shown.

show_column_names

Whether column names should be shown.

min_colour

Colour used for the lowest value in the heatmap.

max_colour

Colour used for the highest value in the heatmap.

legend_breaks

Numeric vector of breaks for the legend.

col

Colour function for ComplexHeatmap::Heatmap()

heatmap_legend_param

Legend function for ComplexHeatmap::Heatmap()

rect_gp

Cell border function for ComplexHeatmap::Heatmap()

column_split_vector

Vector of indices to split columns by.

row_split_vector

Vector of indices to split rows by.

column_split

Standard parameter of ComplexHeatmap::Heatmap.

row_split

Standard parameter of ComplexHeatmap::Heatmap.

...

Additional parameters passed to ComplexHeatmap::Heatmap.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the provided p-values.

Examples


dl <- data_list(
    list(income, "household_income", "demographics", "ordinal"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(fav_colour, "favourite_colour", "demographics", "categorical"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

sc <- snf_config(
    dl,
    n_solutions = 4,
    dropout_dist = "uniform",
    max_k = 50
)

sol_df <- batch_snf(dl, sc)

ext_sol_df <- extend_solutions(sol_df, dl)

pval_heatmap(ext_sol_df)

Quality metrics

Description

These functions calculate conventional metrics of cluster solution quality.

Usage

calculate_silhouettes(sol_df)

calculate_dunn_indices(sol_df)

calculate_db_indices(sol_df)

Arguments

sol_df

A solutions_df class object created by batch_snf() with the parameter return_sim_mats = TRUE.

Details

calculate_silhouettes: A wrapper for cluster::silhouette that calculates silhouette scores for all cluster solutions in a provided solutions data frame. Silhouette values range from -1 to +1 and indicate an overall ratio of how close together observations within a cluster are to how far apart observations across clusters are. You can learn more about interpreting the results of this function by calling ?cluster::silhouette.

calculate_dunn_indices: A wrapper for clv::clv.Dunn that calculates Dunn indices for all cluster solutions in a provided solutions data frame. Dunn indices, like silhouette scores, similarly reflect similarity within clusters and separation across clusters. You can learn more about interpreting the results of this function by calling ?clv::clv.Dunn.

calculate_db_indices: A wrapper for clv::clv.Davies.Bouldin that calculates Davies-Bouldin indices for all cluster solutions in a provided solutions data frame. These values can be interpreted similarly as those above. You can learn more about interpreting the results of this function by calling ?clv::clv.Davies.Bouldin.

Value

A list of silhouette class objects, a vector of Dunn indices, or a vector of Davies-Bouldin indices depending on which function was used.

Examples

## Not run: 
input_dl <- data_list(
    list(gender_df, "gender", "demographics", "categorical"),
    list(diagnosis_df, "diagnosis", "clinical", "categorical"),
    uid = "patient_id"
)

sc <- snf_config(input_dl, n_solutions = 5)

sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE)

# calculate Davies-Bouldin indices
davies_bouldin_indices <- calculate_db_indices(sol_df)

# calculate Dunn indices
dunn_indices <- calculate_dunn_indices(sol_df)

# calculate silhouette scores
silhouette_scores <- calculate_silhouettes(sol_df)

## End(Not run)

Generate random removal sequence

Description

Helper function to contribute to rows within the settings data frame. Number of columns removed follows a uniform or exponential probability distribution.

Usage

random_removal(
  columns,
  min_removed_inputs,
  max_removed_inputs,
  dropout_dist = "exponential"
)

Arguments

columns

Columns of the settings_df that are passed in

min_removed_inputs

The smallest number of input data frames that may be randomly removed.

max_removed_inputs

The largest number of input data frames that may be randomly removed.

dropout_dist

Indication of how input data frames should be dropped. can be "none" (no dropout), "uniform" (uniformly draw number between min and max removed inputs), or "exponential" (like uniform, but using an exponential distribution; default).

Value

Settings data frame row containing inclusion information.

Row-binding of solutions data frame class objects

Description

Row-binding of solutions data frame class objects

Usage

## S3 method for class 'ext_solutions_df'
rbind(..., reset_indices = FALSE)

Arguments

...

An arbitrary number of ext_solutions_df class objects.

reset_indices

If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings.

Value

An ext_solutions_df class object.

Row-binding of solutions data frame class objects

Description

Row-binding of solutions data frame class objects

Usage

## S3 method for class 'solutions_df'
rbind(..., reset_indices = FALSE)

Arguments

...

An arbitrary number of solutions_df class objects.

reset_indices

If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings.

Value

A solutions_df class object.

Row-binding of t_solutions_df class objects

Description

Vertically stack two or more t_solutions_df class objects.

Usage

## S3 method for class 't_solutions_df'
rbind(...)

Arguments

...

An arbitrary number of t_solutions_df class objects.

Value

A t_solutions_df class object.

Row-bind weights matrices

Description

Vertically stack two or more weights_matrix class objects.

Usage

## S3 method for class 'weights_matrix'
rbind(...)

Arguments

...

An arbitrary number of weights_matrix class objects.

Value

A weights_matrix class object.

Remove observations with incomplete data from a data list-like list object

Description

Helper function during data_list class initialization. First applies stats::na.omit() to the data frames named "data" within a nested list. Then removes any observations that are not present across all data frames.

Usage

remove_dll_incomplete(dll)

Arguments

dll

A data list-like list class object.

Value

The provided dll with missing observations removed.

Rename features in a data list

Description

Rename features in a data list

Usage

rename_dl(dl, name_mapping)

Arguments

dl

A nested list of input data from data_list().

name_mapping

A named vector where the values are the features to be renamed and the names are the new names for those features.

Value

A data list ("list"-class object) with adjusted feature names.

Examples

dl <- data_list(
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

summary(dl, "feature")

name_changes <- c(
    "anxiety_score" = "cbcl_anxiety_r",
    "depression_score" = "cbcl_depress_r"
)

dl <- rename_dl(dl, name_changes)

summary(dl, "feature")

Reorder the uids in a data list

Description

Reorder the uids in a data list

Usage

reorder_dl_uids(dl, ordered_uids)

Arguments

dl

A nested list of input data from data_list().

ordered_uids

A vector of the uid values in the data list in the desired order of the sorted data list.

Value

A data list ("list"-class object) with reordered observations.

Helper resampling function found in ?sample

Description

Like sample, but when given a single value x, returns back that single value instead of a random value from 1 to x.

Usage

resample(x, ...)

Arguments

x

Vector or single value to sample from

...

Remaining arguments for base::sample function

Value

Numeric vector result of running base::sample.

Run SNF

Description

Helper function for running a single SNF config pipeline.

Usage

run_snf(i, dl, sc, return_sim_mats, sim_mats_dir, p)

Arguments

i

Row of settings_df and weights_matrix within SNF config to use.

dl

A nested list of input data from data_list().

sc

return_sim_mats

If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE.

sim_mats_dir

If specified, this directory will be used to save all generated similarity matrices.

Value

A list containing a cluster solution (numeric vector) and a the fused network used to create that cluster solution. The fused network is NULL if return_sim_mats is FALSE.

Save a heatmap object to a file

Description

Save a heatmap object to a file

Usage

save_heatmap(heatmap, path, width = 480, height = 480, res = 100)

Arguments

heatmap

The heatmap object to save.

path

The path to save the heatmap to.

width

The width of the heatmap.

height

The height of the heatmap.

res

The resolution of the heatmap.

Value

Does not return any value. Saves heatmap to file.

Adjust the diagonals of a matrix

Description

Adjust the diagonals of a matrix to reduce contrast with off-diagonals during plotting.

Usage

scale_diagonals(matrix, method = "mean")

Arguments

matrix

Matrix to rescale.

method

Method of rescaling. Can be:

"mean" (replace diagonals with average value of off-diagonals)
"zero" (replace diagonals with 0)
"min" (replace diagonals with min value of off-diagonals)
"max" (replace diagonals with max value of off-diagonals)

Value

A "matrix" class object with rescaled diagonals.

Build a settings data frame

Description

The settings_df is a data frame whose rows completely specify the hyperparameters and decisions required to transform individual input data frames (found in a data list, see ?data_list) into a single similarity matrix through SNF. The format of the settings data frame is as follows:

A column named "solution": This column is used to keep track of the rows and should have integer values only.
A column named "alpha": This column contains the value of the alpha hyperparameter that will be used on that run of the SNF pipeline.
A column named "k": Like above, but for the K (nearest neighbours) hyperparameter.
A column named "t": Like above, but for the t (number of iterations) hyperparameter.
A column named "snf_scheme": Which of 3 pre-defined schemes will be used to integrate the data frames of the data list into a final fused network. The purpose of varying these schemes is primarily to increase the diversity of the generated cluster solutions.
- A value of 1 corresponds to the "individual" scheme, in which all data frames are directly merged by SNF into the final fused network. This scheme corresponds to the approach shown in the original SNF paper.
- A value of 2 corresponds to the "two-step" scheme, in which all data frames within a domain are first merged into a domain-specific fused network. Next, domain-specific networks are fused once more by SNF into the final fused network. This scheme is useful for fairly re-weighting SNF pipelines with unequal numbers of data frames across domains.
- A value of 3 corresponds to the "domain" scheme, in which all data frames within a domain are first concatenated into a single domain- specific data frame before being merged by SNF into the final fused network. This approach serves as an alternative way to re-weight SNF pipelines with unequal numbers of data frames across domains. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/snf_schemes.html.
A column named "clust_alg": Specification of which clustering algorithm will be applied to the final similarity matrix. By default, this column can take on the integer values 1 or 2, which correspond to spectral clustering where the number of clusters is determined by the eigen-gap or rotation cost heuristic respectively. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/clustering_algorithms.html.
A column named "cnt_dist": Specification of which distance metric will be used for data frames of purely continuous data. You can learn about this metric and its defaults here: https://branchlab.github.io/metasnf/articles/distance_metrics.html
A column named "dsc_dist": Like above, but for discrete data frames.
A column named "ord_dist": Like above, but for ordinal data frames.
A column named "cat_dist": Like above, but for categorical data frames.
A column named "mix_dist": Like above, but for mixed-type (e.g., both categorical and discrete) data frames.
One column for every input data frame in the corresponding data list which can either have the value of 0 or 1. The name of the column should be formatted as "inc_[]" where the square brackets are replaced with the name (as found in dl_summary(dl)$"name") of each data frame. When 0, that data frame will be excluded from that run of the SNF pipeline. When 1, that data frame will be included.

Usage

settings_df(
  dl,
  n_solutions = 0,
  min_removed_inputs = 0,
  max_removed_inputs = length(dl) - 1,
  dropout_dist = "exponential",
  min_alpha = NULL,
  max_alpha = NULL,
  min_k = NULL,
  max_k = NULL,
  min_t = NULL,
  max_t = NULL,
  alpha_values = NULL,
  k_values = NULL,
  t_values = NULL,
  possible_snf_schemes = c(1, 2, 3),
  clustering_algorithms = NULL,
  continuous_distances = NULL,
  discrete_distances = NULL,
  ordinal_distances = NULL,
  categorical_distances = NULL,
  mixed_distances = NULL,
  dfl = NULL,
  snf_input_weights = NULL,
  snf_domain_weights = NULL,
  retry_limit = 10,
  allow_duplicates = FALSE
)

Arguments

dl

A nested list of input data from data_list().

n_solutions

Number of rows to generate for the settings data frame.

min_removed_inputs

The smallest number of input data frames that may be randomly removed. By default, 0.

max_removed_inputs

The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list.

dropout_dist

min_alpha

max_alpha

The maximum value that the alpha hyperparameter can have. See min_alpha parameter. Cannot be used in conjunction with the alpha_values parameter.

min_k

max_k

The maximum value that the k hyperparameter can have. See min_k parameter. Cannot be used in conjunction with the k_values parameter.

min_t

max_t

The maximum value that the t hyperparameter can have. See min_t parameter. Cannot be used in conjunction with the t_values parameter.

alpha_values

k_values

t_values

possible_snf_schemes

clustering_algorithms

continuous_distances

A vector of continuous distance metrics to use when a custom dist_fns_list is provided.

discrete_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

ordinal_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

categorical_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

mixed_distances

A vector of mixed distance metrics to use when a custom dist_fns_list is provided.

dfl

List containing distance metrics to vary over. See ?generate_dist_fns_list.

snf_input_weights

Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights)

snf_domain_weights

Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights)

retry_limit

allow_duplicates

Value

A settings data frame

Launch a shiny app to identify meta cluster boundaries

Description

This function calls the htShiny() function from the package InteractiveComplexHeatmap to assist users in identifying the indices of the boundaries between meta clusters in a meta cluster heatmap. By providing a heatmap of inter-solution similarities (obtained through meta_cluster_heatmap()), users can click on positions within the heatmap that appear to meaningfully separate major sets of similar cluster solutions by visual inspection. The corresponding indices of the clicked positions are printed to the console and also shown within the app. This function can only run from an interactive session of R.

Usage

shiny_annotator(ari_heatmap)

Arguments

ari_heatmap

Heatmap of ARIs to divide into meta clusters.

Value

Does not return any value. Launches interactive shiny applet.

Examples


dl <- data_list(
    list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

set.seed(42)
my_sc <- snf_config(
    dl = dl,
    n_solutions = 20,
    min_k = 20,
    max_k = 50
)

sol_df <- batch_snf(dl, my_sc)

sol_aris <- calc_aris(sol_df)

meta_cluster_order <- get_matrix_order(sol_aris)

ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order)

# Click on meta cluster boundaries to obtain `split_vec` values
shiny_annotator(ari_hm)

split_vec <- c(6, 10, 16)

ari_hm <- meta_cluster_heatmap(
    sol_aris,
    order = meta_cluster_order,
    split_vector = split_vec
)

Create or extract a `sim_mats_list` class object

Description

Create or extract a sim_mats_list class object

Usage

sim_mats_list(x)

Arguments

x

The object to create or extract a sim_mats_list from.

Value

A sim_mats_list class object.

Plot heatmap of similarity matrix

Description

Plot heatmap of similarity matrix

Usage

similarity_matrix_heatmap(
  similarity_matrix,
  order = NULL,
  cluster_solution = NULL,
  scale_diag = "mean",
  log_graph = TRUE,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  show_row_names = FALSE,
  show_column_names = FALSE,
  data = NULL,
  left_bar = NULL,
  right_bar = NULL,
  top_bar = NULL,
  bottom_bar = NULL,
  left_hm = NULL,
  right_hm = NULL,
  top_hm = NULL,
  bottom_hm = NULL,
  annotation_colours = NULL,
  min_colour = NULL,
  max_colour = NULL,
  split_vector = NULL,
  row_split = NULL,
  column_split = NULL,
  ...
)

Arguments

similarity_matrix

A similarity matrix

order

Vector of numbers to reorder the similarity matrix (and data if provided). Overwrites ordering specified by cluster_solution param.

cluster_solution

Row of a solutions data frame or column of a transposed solutions data frame.

scale_diag

Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0).

log_graph

If TRUE, log transforms the graph.

cluster_rows

Parameter for ComplexHeatmap::Heatmap.

cluster_columns

Parameter for ComplexHeatmap::Heatmap.

show_row_names

Parameter for ComplexHeatmap::Heatmap.

show_column_names

Parameter for ComplexHeatmap::Heatmap.

data

A data frame containing elements requested for annotation.

left_bar

right_bar

See left_bar.

top_bar

See left_bar.

bottom_bar

See left_bar.

left_hm

Like left_bar, but with a heatmap annotation instead of a barplot annotation.

right_hm

See left_hm.

top_hm

See left_hm.

bottom_hm

See left_hm.

annotation_colours

Named list of heatmap annotations and their colours.

min_colour

Colour used for the lowest value in the heatmap.

max_colour

Colour used for the highest value in the heatmap.

split_vector

A vector of partition indices.

row_split

Standard parameter of ComplexHeatmap::Heatmap.

column_split

Standard parameter of ComplexHeatmap::Heatmap.

...

Additional parameters passed into ComplexHeatmap::Heatmap.

Value

Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the similarities between observations in the provided matrix.

Examples


my_dl <- data_list(
    list(
        data = expression_df,
        name = "expression_data",
        domain = "gene_expression",
        type = "continuous"
    ),
    list(
        data = methylation_df,
        name = "methylation_data",
        domain = "gene_methylation",
        type = "continuous"
    ),
    uid = "patient_id"
)

sc <- snf_config(my_dl, n_solutions = 10)

sol_df <- batch_snf(my_dl, sc, return_sim_mats = TRUE)

sim_mats <- sim_mats_list(sol_df)

similarity_matrix_heatmap(
    sim_mats[[1]],
    cluster_solution = sol_df[1, ]
)

Generate a complete path and filename to store an similarity matrix

Description

Generate a complete path and filename to store an similarity matrix

Usage

similarity_matrix_path(similarity_matrix_dir, i)

Arguments

similarity_matrix_dir

Directory to store similarity matrices.

i

Corresponding solution.

Value

Complete path and filename to store an similarity matrix.

Squared (including weights) Euclidean distance

Description

Squared (including weights) Euclidean distance

Usage

siw_euclidean_distance(df, weights_row)

Arguments

df

data frame containing at least 1 data column.

weights_row

Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights.

Value

distance_matrix A distance matrix.

Define configuration for generating a set of SNF-based cluster solutions

Description

snf_config() constructs an SNF config object which inherits from classes snf_config and list. This object is used to store all settings required to transform data stored in a data_list class object into a space of cluster solutions by SNF. The SNF config object contains the following components: 1. A settings data frame (inherits from settings_df and data.frame). Data frame that stores SNF-specific hyperparameters and information about feature selection and weighting, SNF schemes, clustering algorithms, and distance metrics. Each row of the settings data frame corresponds to a distinct cluster solution. 2. A clustering algorithms list (inherits from clust_fns_list and list), which stores all clustering algorithms that the settings data frame can point to. 3. A distance metrics list (inherits from dist_metrics_list and list), which stores all distance metrics that the settings data frame can point to. 4. A weights matrix (inherits from weights_matrix, matrix, and array'), which stores the feature weights to use prior to distance calculations. Each column of the weights matrix corresponds to a different feature in the data list and each row corresponds to a different row in the settings data frame.

Usage

snf_config(
  dl = NULL,
  sdf = NULL,
  dfl = NULL,
  cfl = NULL,
  wm = NULL,
  n_solutions = 0,
  min_removed_inputs = 0,
  max_removed_inputs = length(dl) - 1,
  dropout_dist = "exponential",
  min_alpha = NULL,
  max_alpha = NULL,
  min_k = NULL,
  max_k = NULL,
  min_t = NULL,
  max_t = NULL,
  alpha_values = NULL,
  k_values = NULL,
  t_values = NULL,
  possible_snf_schemes = c(1, 2, 3),
  clustering_algorithms = NULL,
  continuous_distances = NULL,
  discrete_distances = NULL,
  ordinal_distances = NULL,
  categorical_distances = NULL,
  mixed_distances = NULL,
  snf_input_weights = NULL,
  snf_domain_weights = NULL,
  retry_limit = 10,
  cnt_dist_fns = NULL,
  dsc_dist_fns = NULL,
  ord_dist_fns = NULL,
  cat_dist_fns = NULL,
  mix_dist_fns = NULL,
  automatic_standard_normalize = FALSE,
  use_default_dist_fns = FALSE,
  clust_fns = NULL,
  use_default_clust_fns = FALSE,
  weights_fill = "ones"
)

Arguments

dl

A nested list of input data from data_list().

sdf

A settings_df class object. Overrides settings data frame related parameters.

dfl

A dist_fns_list class object. Overrides distance functions list related parameters.

cfl

A clust_fns_list class object. Overrides clustering functions list related parameters.

wm

A weights_matrix class object. Overrides weights matrix related parameters.

n_solutions

Number of rows to generate for the settings data frame.

min_removed_inputs

The smallest number of input data frames that may be randomly removed. By default, 0.

max_removed_inputs

The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list.

dropout_dist

min_alpha

max_alpha

The maximum value that the alpha hyperparameter can have. See min_alpha parameter. Cannot be used in conjunction with the alpha_values parameter.

min_k

max_k

The maximum value that the k hyperparameter can have. See min_k parameter. Cannot be used in conjunction with the k_values parameter.

min_t

max_t

The maximum value that the t hyperparameter can have. See min_t parameter. Cannot be used in conjunction with the t_values parameter.

alpha_values

k_values

t_values

possible_snf_schemes

clustering_algorithms

continuous_distances

A vector of continuous distance metrics to use when a custom dist_fns_list is provided.

discrete_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

ordinal_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

categorical_distances

A vector of categorical distance metrics to use when a custom dist_fns_list is provided.

mixed_distances

A vector of mixed distance metrics to use when a custom dist_fns_list is provided.

snf_input_weights

Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights)

snf_domain_weights

Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights)

retry_limit

cnt_dist_fns

A named list of continuous distance metric functions.

dsc_dist_fns

A named list of discrete distance metric functions.

ord_dist_fns

A named list of ordinal distance metric functions.

cat_dist_fns

A named list of categorical distance metric functions.

mix_dist_fns

A named list of mixed distance metric functions.

automatic_standard_normalize

If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters.

use_default_dist_fns

If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list.

clust_fns

A list of named clustering functions

use_default_clust_fns

weights_fill

String indicating what to populate generate rows with. Can be "ones" (default; fill matrix with 1), "uniform" (fill matrix with uniformly distributed random values), or "exponential" (fill matrix with exponentially distributed random values).

Value

An snf_config class object.

Examples

# Simple random config for 5 cluster solutions
input_dl <- data_list(
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)
my_sc <- snf_config(
    dl = input_dl,
    n_solutions = 5
)

# specifying possible K range
my_sc <- snf_config(
    dl = input_dl,
    n_solutions = 5,
    min_k = 20,
    max_k = 40
)

# Random feature weights across from uniform distribution
my_sc <- snf_config(
    dl = input_dl,
    n_solutions = 5,
    min_k = 20,
    max_k = 40,
    weights_fill = "uniform"
)

# Specifying custom pre-built clustering and distance functions
# - Random alternation between 2-cluster and 5-cluster solutions
# - When continuous or discrete data frames are being processed,
#   randomly alternate between standardized/normalized Euclidean
#   distance and regular Euclidean distance
my_sc <- snf_config(
    dl = input_dl,
    n_solutions = 5,
    min_k = 20,
    max_k = 40,
    weights_fill = "uniform",
    clust_fns = list(
        "two_cluster_spectral" = spectral_two,
        "five_cluster_spectral" = spectral_five
    ),
    cnt_dist_fns = list(
         "euclidean" = euclidean_distance,
         "std_nrm_euc" = sn_euclidean_distance
    ),
    dsc_dist_fns = list(
         "euclidean" = euclidean_distance,
         "std_nrm_euc" = sn_euclidean_distance
    )
)

SNF schemes

Description

These functions manage the way in which input data frames are passed into SNF to yield a final fused network.

Usage

two_step_merge(
  dl,
  k = 20,
  alpha = 0.5,
  t = 20,
  cnt_dist_fn,
  dsc_dist_fn,
  ord_dist_fn,
  cat_dist_fn,
  mix_dist_fn,
  weights_row
)

domain_merge(
  dl,
  cnt_dist_fn,
  dsc_dist_fn,
  ord_dist_fn,
  cat_dist_fn,
  mix_dist_fn,
  weights_row,
  k,
  alpha,
  t
)

individual(
  dl,
  k = 20,
  alpha = 0.5,
  t = 20,
  cnt_dist_fn,
  dsc_dist_fn,
  ord_dist_fn,
  cat_dist_fn,
  mix_dist_fn,
  weights_row
)

Arguments

dl

A nested list of input data from data_list().

k

k hyperparameter.

alpha

alpha/eta/sigma hyperparameter.

t

SNF number of iterations hyperparameter.

cnt_dist_fn

distance metric function for continuous data.

dsc_dist_fn

distance metric function for discrete data.

ord_dist_fn

distance metric function for ordinal data.

cat_dist_fn

distance metric function for categorical data.

mix_dist_fn

distance metric function for mixed data.

weights_row

data frame row containing feature weights.

Details

individual: The "vanilla" scheme - does distance matrix conversions of each input data frame separately before a single call to SNF fuses them into the final fused network.

domain_merge: Given a data list, returns a new data list where all data objects of a particular domain have been concatenated.

two_step_merge: Individual data frames into individual similarity matrices into one fused network per domain into one final fused network.

Helper function for using the correct SNF scheme

Description

Helper function for using the correct SNF scheme

Usage

snf_step(
  dl,
  scheme,
  k = 20,
  alpha = 0.5,
  t = 20,
  cnt_dist_fn,
  dsc_dist_fn,
  ord_dist_fn,
  cat_dist_fn,
  mix_dist_fn,
  weights_row
)

Arguments

dl

A nested list of input data from data_list().

scheme

Which SNF system to use to achieve the final fused network.

k

k hyperparameter.

alpha

alpha/eta/sigma hyperparameter.

t

SNF number of iterations hyperparameter.

cnt_dist_fn

distance metric function for continuous data.

dsc_dist_fn

distance metric function for discrete data.

ord_dist_fn

distance metric function for ordinal data.

cat_dist_fn

distance metric function for categorical data.

mix_dist_fn

distance metric function for mixed data.

weights_row

data frame row containing feature weights.

Value

A fused similarity network (matrix).

Helper function for organizing solutions df-like column order

Description

Reorders columns of a solutions data frame to "solution", "nclust", "mc", then all other column names.

Usage

sol_df_col_order(x)

Arguments

x

Object with columns "solution", "nclust", and "mc".

Value

x with column names reordered.

Constructor for `solutions_df` class object

Description

Constructor for solutions_df class object

Usage

solutions_df(sol_dfl, smll, sc, dl)

Arguments

sol_dfl

A solutions data frame-like object to be validated and converted into a solutions data frame.

smll

A similarity matrix list-like object to be validated and used to construct a solutions data frame.

sc

An snf_config object used to construct a solutions data frame.

dl

An data_list object used to construct a solutions data frame.

Value

A solutions data frame (solutions_df class object).

Helper function to determine which row and columns to split on

Description

Helper function to determine which row and columns to split on

Usage

split_parser(
  row_split_vector = NULL,
  column_split_vector = NULL,
  row_split = NULL,
  column_split = NULL,
  n_rows,
  n_columns
)

Arguments

row_split_vector

A vector of row indices to split on.

column_split_vector

A vector of column indices to split on.

row_split

Standard parameter of ComplexHeatmap::Heatmap.

column_split

Standard parameter of ComplexHeatmap::Heatmap.

n_rows

The number of rows in the data.

n_columns

The number of columns in the data.

Value

"list"-class object containing row_split and column_split character vectors to pass into ComplexHeatmap::Heatmap.

Structure of a `ari_matrix` object

Description

Structure of a ari_matrix object

Usage

## S3 method for class 'ari_matrix'
str(object, ...)

Arguments

object

A ari_matrix class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `clust_fns_list` object

Description

Structure of a clust_fns_list object

Usage

## S3 method for class 'clust_fns_list'
str(object, ...)

Arguments

object

A clust_fns_list class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `data_list` object

Description

Structure of a data_list object

Usage

## S3 method for class 'data_list'
str(object, ...)

Arguments

object

A data_list class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `dist_fns_list` object

Description

Structure of a dist_fns_list object

Usage

## S3 method for class 'dist_fns_list'
str(object, ...)

Arguments

object

A dist_fns_list class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `ext_solutions_df` object

Description

Structure of a ext_solutions_df object

Usage

## S3 method for class 'ext_solutions_df'
str(object, ...)

Arguments

object

A ext_solutions_df class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `settings_df` object

Description

Structure of a settings_df object

Usage

## S3 method for class 'settings_df'
str(object, ...)

Arguments

object

A settings_df class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `sim_mats_list` object

Description

Structure of a sim_mats_list object

Usage

## S3 method for class 'sim_mats_list'
str(object, ...)

Arguments

object

A sim_mats_list class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `snf_config` object

Description

Structure of a snf_config object

Usage

## S3 method for class 'snf_config'
str(object, ...)

Arguments

object

A snf_config class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `solutions_df` object

Description

Structure of a solutions_df object

Usage

## S3 method for class 'solutions_df'
str(object, ...)

Arguments

object

A solutions_df class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `t_ext_solutions_df` object

Description

Structure of a t_ext_solutions_df object

Usage

## S3 method for class 't_ext_solutions_df'
str(object, ...)

Arguments

object

A t_ext_solutions_df class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `t_solutions_df` object

Description

Structure of a t_solutions_df object

Usage

## S3 method for class 't_solutions_df'
str(object, ...)

Arguments

object

A t_solutions_df class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Structure of a `weights_matrix` object

Description

Structure of a weights_matrix object

Usage

## S3 method for class 'weights_matrix'
str(object, ...)

Arguments

object

A weights_matrix class object.

...

Additional arguments (not used).

Value

Does not return an object; outputs object structure to console.

Mock ABCD subcortical volumes data

Description

Like the mock data frame "abcd_subc_v", but with "unique_id" as the "uid".

Usage

subc_v

Format

`subc_v`

A data frame with 174 rows and 31 columns:

unique_id: The unique identifier of the ABCD dataset
...: Subcortical volumes of various ROIs (mm^3, I think)

Source

Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:

Create subsamples of a data list

Description

Given a data list, return a list of smaller data lists that are generated through random sampling (without replacement). The results of this function can be passed into batch_snf_subsamples() to obtain a list of resampled solutions data frames.

Usage

subsample_dl(
  dl,
  n_subsamples,
  subsample_fraction = NULL,
  n_observations = NULL
)

Arguments

dl

A nested list of input data from data_list().

n_subsamples

Number of subsamples to create.

subsample_fraction

Percentage of patients to include per subsample.

n_observations

Number of patients to include per subsample.

Value

A "list" class object containing n_subsamples number of data lists. Each of those data lists contains a random subsample_fraction fraction of the observations of the provided data list.

Examples

my_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

my_dl_subsamples <- subsample_dl(
    my_dl,
    n_subsamples = 20,
    subsample_fraction = 0.85
)

Calculate pairwise adjusted Rand indices across subsamples of data

Description

Given a list of subsampled solutions data frames from 'batch_snf_subsamples(), this function calculates the adjusted Rand indices across all the subsamples of each solution. ARI calculation between two subsamples only factors in observations that were present in both subsamples.

Usage

subsample_pairwise_aris(subsample_solutions, verbose = FALSE)

Arguments

subsample_solutions

A list of solutions data frames from subsamples of the data. This object is generated by the function batch_snf_subsamples().

verbose

If TRUE, output progress to console.

Value

A two-item list: "raw_aris", a list of inter-subsample pairwise ARI matrices (one for each full cluster solution) and "ari_summary", a data frame containing the mean and SD of the inter-subsample ARIs for each original cluster solution.

Examples


my_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(my_dl, n_solutions = 5, max_k = 40)

my_dl_subsamples <- subsample_dl(
    my_dl,
    n_subsamples = 20,
    subsample_fraction = 0.85
)

batch_subsample_results <- batch_snf_subsamples(
    my_dl_subsamples,
    sc
)

pairwise_aris <- subsample_pairwise_aris(
    batch_subsample_results,
    verbose = TRUE
)

# Visualize ARIs 
ComplexHeatmap::Heatmap(
    pairwise_aris$"raw_aris"[[1]],
    heatmap_legend_param = list(
        color_bar = "continuous",
        title = "Inter-Subsample\nARI",
        at = c(0, 0.5, 1)
    ),
    show_column_names = FALSE,
    show_row_names = FALSE
)

Summarize a clust_fns_list object

Description

Summarize a clust_fns_list object

Usage

summarize_clust_fns_list(cfl)

Arguments

cfl

A clust_fns_list class object.

Value

summary_df "data.frame" class object containing the name and index of each clustering algorithm in the provided clust_fns_list.

Summarize a distance functions list

Description

Summarize a distance functions list

Usage

summarize_dfl(dfl)

Arguments

dfl

A dist_fns_list.

Value

"data.frame"-class object summarizing items in a distance metrics list.

Summarize a data list

Description

Defunct function for summarizing a data list. Please use summary() instead.

Usage

summarize_dl(data_list, scope = "component")

Arguments

data_list

A nested list of input data from data_list().

scope

The level of detail for the summary. Options are:

"component" (default): One row per component (data frame) in the data list.
"feature": One row for each feature in the data list.

Value

data.frame class object summarizing all components (or features if scope == "component").

Summarize p-value columns of an extended solutions data frame

Description

Summarize p-value columns of an extended solutions data frame

Usage

summarize_pvals(ext_sol_df)

Arguments

ext_sol_df

Result of extend_solutions

Value

The provided extended solutions data frame along with columns for the min, mean, and maximum across p-values for each row.

Summary method for class `ari_matrix`

Description

Provides a summary of the ari_matrix class object, including the distribution of the adjusted Rand index (ARI) values and the number of solutions.

Usage

## S3 method for class 'ari_matrix'
summary(object, ...)

Arguments

object

A ari_matrix class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the number of solutions and the distribution of ARI values.

Summary method for class `clust_fns_list`

Description

This summary function simply returns to the console the number of functions contained in the clust_fns_list object.

Usage

## S3 method for class 'clust_fns_list'
summary(object, ...)

Arguments

object

A clust_fns_list class object.

...

Other arguments passed to summary (not used in this function).

Value

Returns no value. Outputs a message to the console.

Summary method for class `data_list`

Description

Returns a data list summary (data.frame class object) containing information on components, features, variable types, domains, and component dimensions.

Usage

## S3 method for class 'data_list'
summary(object, scope = "component", ...)

Arguments

object

A data_list class object.

scope

The level of detail for the summary. By default, this is set to "component", which returns a summary of the data list at the component level. Can also be set to "feature", resulting in a summary at the feature level.

...

Other arguments passed to summary (not used in this function)

Value

A data.frame class object. If scope is "component", each row shows the name, variable type, domain, and dimensions of each component. If scope is "feature", each row shows the name, variable type, and domain of each feature.

Summary method for class `dist_fns_list`

Description

This summary function simply returns to the console the number of functions contained in the dist_fns_list object.

Usage

## S3 method for class 'dist_fns_list'
summary(object, ...)

Arguments

object

A dist_fns_list class object.

...

Other arguments passed to summary (not used in this function).

Value

Returns no value. Outputs a message to the console.

Summary method for class `ext_solutions_df`

Description

This summary function provides a summary of the ext_solutions_df class object, including the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Usage

## S3 method for class 'ext_solutions_df'
summary(object, ...)

Arguments

object

A ext_solutions_df class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Summary method for class `settings_df`

Description

This summary function provides a summary of the settings_df class object, including the number of settings, the distribution of alpha values, the distribution of k values, and the distribution of clustering functions.

Usage

## S3 method for class 'settings_df'
summary(object, ...)

Arguments

object

A settings_df class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing summary information of the settings data frame.

Summary method for class `sim_mats_list`

Description

This summary function simply returns to the console the number of functions contained in the sim_mats_list object.

Usage

## S3 method for class 'sim_mats_list'
summary(object, ...)

Arguments

object

A sim_mats_list class object.

...

Other arguments passed to summary (not used in this function).

Value

Returns no value. Outputs a message to the console.

Summary method for class `snf_config`

Description

This summary function provides a summary of the snf_config class object, including the settings data frame, clustering functions list, distance functions list, and weights matrix.

Usage

## S3 method for class 'snf_config'
summary(object, ...)

Arguments

object

A snf_config class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the summaries of objects within the config.

Summary method for class `solutions_df`

Description

This summary function provides a summary of the solutions_df class object, including the number of solutions, the distribution of the number of clusters, and the number of observations.

Usage

## S3 method for class 'solutions_df'
summary(object, ...)

Arguments

object

A ext_solutions_df class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the number of solutions, the distribution of the number of clusters, and the number of observations.

Summary method for class `t_ext_solutions_df`

Description

This summary function provides a summary of the t_ext_solutions_df class object, including the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Usage

## S3 method for class 't_ext_solutions_df'
summary(object, ...)

Arguments

object

A t_ext_solutions_df class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Summary method for class `t_solutions_df`

Description

This summary function provides a summary of the t_solutions_df class object, including the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Usage

## S3 method for class 't_solutions_df'
summary(object, ...)

Arguments

object

A t_solutions_df class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the number of solutions, the distribution of the number of clusters, the number of features, the number of observations, and the distribution of p-values.

Summary method for class `weights_matrix`

Description

This summary function provides a summary of the weights_matrix class object, including the minimum, maximum, mean, and standard deviation of the feature weights.

Usage

## S3 method for class 'weights_matrix'
summary(object, ...)

Arguments

object

A weights_matrix class object.

...

Other arguments passed to summary (not used in this function).

Value

A named list containing the summary statistics of the weights matrix, the number of solutions, and the number of features.

Pull features used to calculate summary p-values from an object

Description

Pull features used to calculate summary p-values from an object

Usage

summary_features(x)

Arguments

x

The object to extract summary features from.

Value

A character vector of summary features.

Training and testing split

Description

Given a vector of uid_id and a threshold, returns a list of which members should be in the training set and which should be in the testing set. The function relies on whether or not the absolute value of the Jenkins's one_at_a_time hash function exceeds the maximum possible value (2147483647) multiplied by the threshold.

Usage

train_test_assign(train_frac, uids, seed = 42)

Arguments

train_frac

The fraction (0 to 1) of observations for training

uids

A character vector of UIDs to be distributed into training and test sets.

seed

Seed used for Jenkins's one_at_a_time hash function.

Value

A named list containing the training and testing uid_ids.

Pull UIDs from an object

Description

Pull UIDs from an object

Usage

uids(x)

Arguments

x

The object to extract UIDs from.

Value

A character vector of UIDs.

Validator for `ari_matrix` class object

Description

Validator for ari_matrix class object

Usage

validate_ari_matrix(aml)

Arguments

aml

An ari_matrix-like matrix object to be validated.

Value

If aml has a valid structure for a ari_matrix class object, returns the input unchanged. Otherwise, raises an error.

Validator for `clust_fns_list` class object

Description

Validator for clust_fns_list class object

Usage

validate_clust_fns_list(cfll)

Arguments

cfll

A clust_fns_list-like list class object.

Value

If cfll has a valid structure for a clust_fns_list class object, returns the input unchanged. Otherwise, raises an error.

Validator for data_list class object

Description

Validator for data_list class object

Usage

validate_data_list(dll)

Arguments

dll

A data list-like list class object.

Value

If dll has a valid structure for a data_list class object, returns the input unchanged. Otherwise, raises an error.

Validator for dist_fns_list class object

Description

Validator for dist_fns_list class object

Usage

validate_dist_fns_list(dfll)

Arguments

dfll

A distance metrics list-like list object to be validated.

Value

If dfll has a valid structure for a dist_fns_list class object, returns the input unchanged. Otherwise, raises an error.

Validator for `ext_solutions_df` class object

Description

Validator for ext_solutions_df class object

Usage

validate_ext_solutions_df(ext_sol_dfl)

Arguments

ext_sol_dfl

An extended solutions data frame-like object.

Value

If ext_sol_dfl has a valid structure for an object of class ext_solutions_df, returns ext_sol_dfl. Otherwise, raises an error.

Validator for `settings_df` class object

Description

Validator for settings_df class object

Usage

validate_settings_df(sdfl)

Arguments

sdfl

A settings data frame-like matrix object to be validated.

Value

If sdfl has a valid structure for a settings_df class object, returns the input unchanged. Otherwise, raises an error.

Validator for `similarity_matrix_list` class object

Description

Validator for similarity_matrix_list class object

Usage

validate_sim_mats_list(smll)

Arguments

smll

A similarity matrix list-like object.

Value

If smll has a valid structure for class similarity_matrix_list, returns smll. Otherwise, raises an error.

Validator for snf_config class object

Description

Validator for snf_config class object

Usage

validate_snf_config(scl)

Arguments

scl

An SNF config-like list class object.

Value

If dll has a valid structure for a data_list class object, returns input unchanged. Otherwise, raises an error.

Validator for `solutions_df` class object

Description

Validator for solutions_df class object

Usage

validate_solutions_df(sol_dfl)

Arguments

sol_dfl

A solutions data frame-like object to be validated and converted into a solutions data frame.

Value

If sol_dfl has a valid structure for a solutions_df class object, returns the input unchanged. Otherwise, raises an error.

Validator for `weights_matrix` class object

Description

Validator for weights_matrix class object

Usage

validate_weights_matrix(wml)

Arguments

wml

A weights_matrix-like matrix object to be validated.

Value

If wml has a valid structure for a weights_matrix class object, returns the input unchanged. Otherwise, raises an error.

Manhattan plot of feature-feature association p-values

Description

Manhattan plot of feature-feature association p-values

Usage

var_manhattan_plot(
  dl,
  key_var,
  neg_log_pval_thresh = 5,
  threshold = NULL,
  point_size = 5,
  text_size = 20,
  plot_title = NULL,
  hide_x_labels = FALSE,
  bonferroni_line = FALSE
)

Arguments

dl

List of data frames containing data information.

key_var

Feature for which the association p-values of all other features are plotted.

neg_log_pval_thresh

Threshold for negative log p-values.

threshold

p-value threshold to plot dashed line at.

point_size

Size of points in the plot.

text_size

Size of text in the plot.

plot_title

Title of the plot.

hide_x_labels

If TRUE, hides x-axis labels.

bonferroni_line

If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold.

Value

A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against one key feature in a data list.

Examples

dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

var_manhattan <- var_manhattan_plot(
    dl,
    key_var = "household_income",
    plot_title = "Correlation of Features with Household Income",
    text_size = 16,
    neg_log_pval_thresh = 3,
    threshold = 0.05
)

Generate a matrix to store feature weights

Description

Function for building a weights matrix independently of an SNF config. The weights matrix contains one row corresponding to each row of the settings data frame in an SNF config (one row for each resulting cluster solution) and one column for each feature in the data list used for clustering. Values of the weights matrix are passed to distance metrics functions during the conversion of input data frames to distance matrices. Typically, there is no need to use this function directly. Instead, users should provide weights matrix-building parameters to the snf_config() function.

Usage

weights_matrix(dl = NULL, n_solutions = 1, weights_fill = "ones")

Arguments

dl

A nested list of input data from data_list().

n_solutions

Number of rows to generate the template weights matrix for.

weights_fill

Value

wm A properly formatted matrix containing columns for all the features that require weights and rows.

Examples

input_dl <- data_list(
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(input_dl, n_solutions = 5)

wm <- weights_matrix(input_dl, n_solutions = 5, weights_fill = "uniform")

# updating an SNF config in parts
sc$"weights_matrix" <- wm

metasnf: Meta Clustering with Similarity Network Fusion

Description

Author(s)

See Also

Mock ABCD anxiety data

Description

Usage

Format

abcd_anxiety

Source

Mock ABCD "colour" data

Description

Usage

Format

abcd_colour

Source

Mock ABCD cortical surface area data

Description

Usage

Format

abcd_cort_sa

Source

Mock ABCD cortical thickness data

Description

Usage

Format

abcd_cort_t

Source

Mock ABCD depression data

Description

Usage

Format

abcd_depress

Source

Mock ABCD income data

Description

Usage

Format

abcd_income

Source

Mock ABCD income data

Description

Usage

Format

abcd_income

Source

Mock ABCD pubertal status data

Description

Usage

Format

abcd_pubertal

Source

Mock ABCD subcortical volumes data

Description

Usage

Format

abcd_subc_v

Source

Add columns to a data frame

Description

Usage

Arguments

Value

Add rows to a settings_df

Description

Usage

Arguments

Value

Heatmap of pairwise adjusted rand indices between solutions

Description

Usage

Arguments

Value

Mock age data

Description

Usage

Format

age_df

Source

Alluvial plot of patients across cluster counts and important features

`abcd_anxiety`

`abcd_colour`

`abcd_cort_sa`

`abcd_cort_t`

`abcd_depress`

`abcd_income`

`abcd_income`

`abcd_pubertal`

`abcd_subc_v`

`age_df`

`anxiety`

Coerce a `data_list` class object into a `data.frame` class object

Coerce a `ext_solutions_df` class object into a `data.frame` class object

Coerce a `settings_df` class object into a `data.frame` class object

Coerce a `settings_df` class object into a `data.frame` class object

Coerce a `solutions_df` class object into a `data.frame` class object

Coerce a `t_ext_solutions_df` class object into a `data.frame` class object

Coerce a `t_solutions_df` class object into a `data.frame` class object

Coerce a `weights_matrix` class object into a `data.frame` class object

Coerce a `clust_fns_list` class object into a `list` class object

Coerce a `data_list` class object into a `list` class object

Coerce a `dist_fns_list` class object into a `list` class object