Title: | Reconstructing the Regulatory Programs of Target Genes in scRNA-Seq Data |
Version: | 0.2.0 |
Description: | Implementation of the scregclust algorithm described in Larsson, Held, et al. (2024) <doi:10.1038/s41467-024-53954-3> which reconstructs regulatory programs of target genes in scRNA-seq data. Target genes are clustered into modules and each module is associated with a linear model describing the regulatory program. |
Encoding: | UTF-8 |
Depends: | R (≥ 3.5.0) |
Imports: | Matrix, stats, methods, utils, reshape, igraph, graphics, grid, cli, prettyunits, ggplot2, rlang, Rcpp (≥ 1.0.8) |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0), Seurat (≥ 5.0.0), glmGamPoi, hdf5r |
LinkingTo: | Rcpp, RcppEigen |
VignetteBuilder: | knitr |
RoxygenNote: | 7.3.2 |
License: | GPL (≥ 3) |
Config/testthat/edition: | 3 |
URL: | https://scmethods.github.io/scregclust/, https://github.com/scmethods/scregclust/ |
BugReports: | https://github.com/scmethods/scregclust/issues |
NeedsCompilation: | yes |
Packaged: | 2024-12-06 14:06:57 UTC; felixheld |
Author: | Felix Held |
Maintainer: | Felix Held <felix.held@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-12-06 14:30:01 UTC |
scregclust: Reconstructing the Regulatory Programs of Target Genes in scRNA-Seq Data
Description
Implementation of the scregclust algorithm described in Larsson, Held, et al. (2024) doi:10.1038/s41467-024-53954-3 which reconstructs regulatory programs of target genes in scRNA-seq data. Target genes are clustered into modules and each module is associated with a linear model describing the regulatory program.
Details
Computational methods for the scregclust algorithm
Author(s)
Ida Larsson, Felix Held, Sven Nelander
See Also
Useful links:
Report bugs at https://github.com/scmethods/scregclust/issues
Allocate 3d-array and fill with matrix along first dimension
Description
Allocate 3d-array and fill with matrix along first dimension
Usage
alloc_array(input, n_cl)
Arguments
input |
the matrix of size |
n_cl |
the size of the three-dimensional array's first dimension |
Value
The allocated and filled array of size n_cl x n_obs x n_genes
Extract final configurations into a data frame
Description
Extract final configurations into a data frame
Usage
available_results(obj)
Arguments
obj |
An object of class |
Value
A data.frame
containing penalization parameters and
final configurations for those penalizations.
Create a table of module overlap for two clusterings
Description
Compares two clusterings and creates a table of overlap between them. Module labels do not have to match.
Usage
cluster_overlap(k1, k2)
Arguments
k1 |
First clustering |
k2 |
Second clustering |
Value
A matrix showing the module overlap with the labels of k1
in
the columns and the labels of k2
in the rows.
Compute NNLS coefficients
Description
Computes non-negative least squares coefficients with a matrix right hand side.
Usage
coef_nnls(x, y, eps = 1e-12, max_iter = 1000L)
Arguments
x |
Coefficient matrix (p x n matrix) |
y |
Right hand side (p x m matrix) |
eps |
Convergence tolerance |
max_iter |
Maximum number of iterations |
Value
A list containing
beta |
The estimated coefficient matrix |
iterations |
A vector containing the number of iterations needed
for the |
References
Duy Khuong Nguyen and Tu Bao Ho. Accelerated anti-lopsided algorithm for nonnegative least squares. International Journal of Data Science and Analytics, 3(1):23–34, 2017.
Adapted from https://github.com/khuongnd/nnls_antilopsided
Compute OLS coefficients
Description
If the design matrix has full column-rank, then use the normal least squares estimate. Otherwise, use the Moore-Penrose inverse to compute the least squares estimate.
Usage
coef_ols(y, x)
Arguments
y |
Target vector (n x 1)/matrix (n x m) |
x |
Design matrix (n x p) |
Value
Vector of OLS coefficients
Compute ridge regression coefficients
Description
Compute ridge regression coefficients
Usage
coef_ridge(y, x, lambda)
Arguments
y |
Target vector (n x 1)/matrix (n x m) |
x |
Design matrix (n x p) |
lambda |
Positive parameter for ridge penalty |
Value
Vector of ridge regression coefficients
Compute Hubert's and Arabie's Adjusted Rand index
Description
Compute Hubert's and Arabie's Adjusted Rand index
Usage
compute_adjusted_rand_index(k1, k2)
Arguments
k1 |
First clustering as vector of integers |
k2 |
Second clustering as vector of integers |
Value
The Adjusted Rand index as a numeric value
References
Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification. 2 (1): 193–218. DOI:10.1007/BF01908075
Compute the Rand index
Description
Compute the Rand index
Usage
compute_rand_index(k1, k2)
Arguments
k1 |
First clustering as vector of integers |
k2 |
Second clustering as vector of integers |
Value
The Rand index as a numeric value
References
W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association 66 (336): 846–850. DOI:10.2307/2284239
ADMM algorithm for solving the group-penalized least squares problem
Description
Implements estimation of the coop-lasso problem.
Usage
coop_lasso(
y,
x,
lambda,
weights,
beta_0 = NULL,
rho_0 = 0.2,
alpha_0 = 1.5,
n_update = 2L,
eps_corr = 0.2,
max_iter = 1000L,
eps_rel = 1e-08,
eps_abs = 1e-12,
verbose = FALSE
)
Arguments
y |
Target (n x m) |
x |
Design matrix (n x p) |
lambda |
Penalization parameter |
weights |
A specific weight for each group (typically this is
|
beta_0 |
Initial value for coefficients, allowing for warm start.
Can be set to NULL, which results in the initial |
rho_0 |
Initial ADMM step-size |
alpha_0 |
Initial ADMM relaxation parameter |
n_update |
Number of steps in-between updates of the step-size/adaptation parameters |
eps_corr |
Lower bound for the correlation in the step-size update steps |
max_iter |
Maximum number of iterations |
eps_rel |
Relative tolerance for convergence check |
eps_abs |
Absolute tolerance for convergence check |
verbose |
Whether or not information about the optimization process should be printed to the terminal |
Value
A list containing
beta |
The coefficients at convergence |
iterations |
Number of iterations |
References
Xu et al. (2017) Adaptive relaxed ADMM: Convergence theory and practical implementation. DOI 10.1109/CVPR.2017.765
Format count table nicely
Description
Format count table nicely
Usage
count_table(counts, title, row_names, col_width = 5)
Arguments
counts |
a list of count vectors with |
title |
title above the table |
row_names |
a vector of row names, one for each count vector |
col_width |
minimum width for columns |
Value
A string formatted as a table
Fast computation of correlation
Description
This uses a more memory-intensive but much faster algorithm than
the built-in cor
function.
Usage
fast_cor(x, y)
Arguments
x |
first input matrix |
y |
second input matrix |
Details
Computes the correlation between the columns of x
and y
.
Value
Correlations matrix between the columns of x
and y
Determine module sizes
Description
Determine module sizes
Usage
find_module_sizes(module, n_modules)
Arguments
module |
Vector of module indices |
n_modules |
Total number of modules |
Value
A named vector containing the name of the module (its index or
"Noise"
) and the number of elements in that module
Get the average number of active regulators per module
Description
Get the average number of active regulators per module
Usage
get_avg_num_regulators(fit)
Arguments
fit |
An object of class |
Value
A data.frame
containing the average number of active regulators
per module for each penalization parameter.
Return the number of final configurations
Description
Returns the number of final configurations per penalization parameter in an scRegClust object.
Usage
get_num_final_configs(fit)
Arguments
fit |
An object of class |
Value
An integer vector containing the number of final configurations for each penalization parameter.
Compute Rand indices
Description
Compute Rand indices for fitted scregclust object
Usage
get_rand_indices(fit, groundtruth, adjusted = TRUE)
Arguments
fit |
An object of class |
groundtruth |
A known clustering of the target genes (integer vector) |
adjusted |
If TRUE, the Adjusted Rand index is computed. Otherwise the ordinary Rand index is computed. |
Value
A data.frame
containing the Rand indices. Since there can
be more than one final configuration for some penalization
parameters, Rand indices are averaged for each fixed penalization
parameter. Returned are the mean, standard deviation and number
of final configurations that were averaged.
References
W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association 66 (336): 846–850. DOI:10.2307/2284239
Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification. 2 (1): 193–218. DOI:10.1007/BF01908075
Return list of regulator genes
Description
Return list of regulator genes
Usage
get_regulator_list(mode = c("TF", "kinase"))
Arguments
mode |
Determines which genes are considered to be regulators. Currently supports TF=transcription factors and kinases. |
Value
a list of gene symbols
See Also
Extract target gene modules for given penalization parameters
Description
Extract target gene modules for given penalization parameters
Usage
get_target_gene_modules(fit, penalization = NULL)
Arguments
fit |
An object of class |
penalization |
A numeric vector of penalization parameters.
The penalization parameters specified here must have
been used used during fitting of the |
Value
A list of lists of final target modules. One list for each
parameter in penalization
. The lists contain the modules of
target genes for each final configuration.
Compute indicator matrix of pairwise distances smaller than threshold
Description
Computes the Jaccard distance between rows of a matrix and returns a sparse symmetric indicator matrix containing the entries with a distance of less than a given upper bound. Note that the diagonal is always 1.
Usage
jaccard_indicator(x, upper_bnd = 0.8)
Arguments
x |
the input matrix with vectors to be compared in the rows. |
upper_bnd |
pairs with a Jaccard distance below this upper bound are returned as 1 while all others receive the entry 0. |
Value
A list of vectors describing a sparse lower triangular pattern matrix
i |
Row indices |
j |
Column indices |
Perform the computations for thresholded Jaccard distance
Description
Perform the computations for thresholded Jaccard distance
Usage
jaccard_indicator_comp(gs, eps)
Arguments
gs |
a list of integer vectors, one for each row, giving the column
indices of the non-zero elements of the row or |
eps |
an upper bound on the Jaccard distance ( |
Details
This function is optimized for sparse matrices and computes the pairwise
Jaccard distances between the rows of the input matrix. Note that the
actual distance is not saved. Instead, a threshold (eps
) is supplied
and an indicator matrix is returned, with a one indicating that the
distance is smaller than eps
(equivalently, the Jaccard similarity
is larger than 1 - eps
).
Value
A list with row and column indices in the #row x #row indicator
matrix specifying which rows in the original matrix had a distance
of at most eps
.
Perform the k-means++ algorithm
Description
Performs the k-means++ algorithm to cluster the rows of the input matrix.
Usage
kmeanspp(x, n_cluster, n_init_clusterings = 10L, n_max_iter = 10L)
Arguments
x |
Input matrix (n x p) |
n_cluster |
Number of clusters |
n_init_clusterings |
Number of repeated random initializations to perform |
n_max_iter |
Number of maximum iterations to perform in the k-means algorithm |
Details
Estimation is repeated
Value
An object of class stats::kmeans
.
References
David Arthur and Sergei Vassilvitskii. K-Means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, pages 1027––1035. Society for Industrial and Applied Mathematics, 2007.
Determine initial centers for the kmeans++ algorithm
Description
Determine initial centers for the kmeans++ algorithm
Usage
kmeanspp_init(n_cluster, x = NULL, dm = NULL)
Arguments
x |
data matrix to be clustered |
dm |
distance matrix (between rows of x; of class "dist") |
Value
Row indices of initial cluster centers of x
Plot average silhouette scores and average predictive R^2
Description
Plot average silhouette scores and average predictive R^2
Usage
plot_module_count_helper(list_of_fits, penalization)
Arguments
list_of_fits |
A list of |
penalization |
Either a single numeric value requesting the results
for the same penalty parameter across all fits in
|
Value
A ggplot2 plot showing the average silhouette score and the
average predictive R^2
Plotting the regulatory table from scregclust as a directed graph
Description
Plotting the regulatory table from scregclust as a directed graph
Usage
plot_regulator_network(
output,
arrow_size = 0.3,
edge_scaling = 30,
no_links = 6,
col = c("gray80", "#FC7165", "#BD828C", "#9D8A9F", "#7D92B2", "#BDA88C", "#FCBD65",
"#F2BB90", "#E7B9BA", "#BDB69C", "#92B27D", "#9B8BA5", "#9D7DB2", "#94A5BF")
)
Arguments
output |
Object of type |
arrow_size |
Size of arrow head |
edge_scaling |
Scaling factor for edge width |
no_links |
Threshold value (0-10) for number of edges to show, higher value = more stringent threshold = less edges |
col |
color |
Value
Graph with gene modules and regulators as nodes
Plot individual silhouette scores
Description
Plot individual silhouette scores
Usage
plot_silhouettes(list_of_fits, penalization, final_config = 1L)
Arguments
list_of_fits |
A list of |
penalization |
Either a single numeric value requesting the results
for the same penalty parameter across all fits in
|
final_config |
The final configuration that should be visualized.
Either a single number to be used for all fits in
|
Value
A ggplot2 plot showing the the silhouette scores for each supplied fit.
Quick'n'dirty progress bar
Description
Creates a progress bar and returns it as a string.
Usage
progstr(step, n_steps, name, finished = FALSE, progress_length = 20L)
Arguments
step |
current step being worked on |
n_steps |
total number of steps |
name |
name of the process |
finished |
whether the process is finished |
progress_length |
length of the progress bar in ascii signs |
Value
A string formatted as a progress bar
Remove empty modules
Description
Remove empty modules
Usage
remove_empty_modules(module)
Arguments
module |
Vector of module indices |
Details
Only iterates through modules with positive index, leaving the noise module untouched.
Value
The updated vector of module indices with empty modules removed.
Reset input 3d-array by filling matrix along first dimension
Description
Reset input 3d-array by filling matrix along first dimension
Usage
reset_array(arr, input)
Arguments
arr |
The 3d-array of dimension |
input |
The matrix of size |
Uncover gene modules and their regulatory programs from single-cell data
Description
Use the scRegClust algorithm to determine gene modules and their regulatory programs from single-cell data.
Usage
scregclust(
expression,
genesymbols,
is_regulator,
penalization,
n_modules,
initial_target_modules = NULL,
sample_assignment = NULL,
center = TRUE,
split1_proportion = 0.5,
total_proportion = 1,
split_indices = NULL,
prior_indicator = NULL,
prior_genesymbols = NULL,
prior_baseline = 1e-06,
prior_weight = 0.5,
min_module_size = 0L,
allocate_per_obs = TRUE,
noise_threshold = 0.025,
n_cycles = 50L,
use_kmeanspp_init = TRUE,
n_initializations = 50L,
max_optim_iter = 10000L,
tol_coop_rel = 1e-08,
tol_coop_abs = 1e-12,
tol_nnls = 1e-04,
compute_predictive_r2 = TRUE,
compute_silhouette = FALSE,
nowarnings = FALSE,
verbose = TRUE,
quick_mode = FALSE,
quick_mode_percent = 0.1
)
Arguments
expression |
|
genesymbols |
A vector of gene names corresponding to rows of
|
is_regulator |
An indicator vector where |
penalization |
Sparsity penalty related to the amount of regulators associated with each module. Either a single positive number or a vector of positive numbers. |
n_modules |
Requested number of modules (integer).
If this is provided without specifying |
initial_target_modules |
The initial assignment of target genes to
modules of length |
sample_assignment |
A vector of sample assignment for each cell, can
be used to perform the data splitting with
stratification. Has to be of length |
center |
Whether or not genes should be centered within each subgroup
defined in |
split1_proportion |
The proportion to use for the first dataset during
data splitting. The proportion for the second
dataset is |
total_proportion |
Can be used to only use a proportion of the supplied
observations. The proportion of the first dataset
during data splitting in relation to the full
dataset will be
|
split_indices |
Can be used to provide an explicit data split. If this
is supplied then |
prior_indicator |
An indicator matrix (sparse or dense) of size |
prior_genesymbols |
A vector of gene names of length q corresponding
to the rows/columns in |
prior_baseline |
A positive baseline for the network prior. The larger this parameter is, the less impact the network prior will have. |
prior_weight |
A number between 0 and 1 indicating the strength of the prior in relation to the data. 0 ignores the prior and makes the algorithm completely data-driven. 1 uses only the prior during module allocation. |
min_module_size |
Minimum required size of target genes in a module. Smaller modules are emptied. |
allocate_per_obs |
Whether module allocation should be performed for
each observation in the second data split separately.
If |
noise_threshold |
Threshold for the best |
n_cycles |
Number of maximum algorithmic cycles. |
use_kmeanspp_init |
Use kmeans++ for module initialization if
|
n_initializations |
Number of kmeans(++) initialization runs. |
max_optim_iter |
Maximum number of iterations during optimization in the coop-Lasso and NNLS steps. |
tol_coop_rel |
Relative convergence tolerance during optimization in the coop-Lasso step. |
tol_coop_abs |
Absolute convergence tolerance during optimization in the coop-Lasso step. |
tol_nnls |
Convergence tolerance during optimization in the NNLS step. |
compute_predictive_r2 |
Whether to compute predictive |
compute_silhouette |
Whether to compute silhouette scores for each target gene. |
nowarnings |
When turned on then no warning messages are shown. |
verbose |
Whether to print progress. |
quick_mode |
Whether to use a reduced number of noise targets to speed up computations. |
quick_mode_percent |
A number in [0, 1) indicating the amount of
noise targets to use in the re-allocation process
if |
Value
A list with S3 class scregclust
containing
penalization |
The supplied |
results |
A list of result lists (each with S3 class
|
initial_target_modules |
Initial allocation of target genes into modules. |
split_indices |
either verbatim the vector given as input or a vector encoding the splits as NA = not included, 1 = split 1 or 2 = split 2. Allows reproducibility of data splits. |
For each supplied penalization parameter, results
contains a list with
the current
penalization
parameter,the supplied
genesymbols
after filtering (as used during fitting),the supplied
is_regulator
vector after filtering (as used during fitting),the number of fitted modules
n_modules
,whether the current run
converged
to a single configuration (as a boolean),as well as an
output
object containing the numeric results for each final configuration.
It is possible that the algorithm ends in a finite cycle of configurations
instead of a unique final configuration.
Therefore, output
is a list with each element itself being a list
with the following contents:
reg_table
a regulator table, a matrix of weights for each regulator and module
module
vector of same length as
genesymbols
containing the module assignments for all genes with regulators marked asNA
. Genes considered noise are marked as-1
.module_all
same as
module
, however, genes that were marked as noise (-1 inmodule
) are assigned to the module in which it has the largestR^2
, even if it is belownoise_threshold
.r2
matrix of predictive
R^2
value for each target gene and modulebest_r2
vector of best predictive
R^2
for each gene (regulators marked with NA)best_r2_idx
module index corresponding to best predictive
R^2
for each gene (regulators marked with NA)r2_module
a vector of predictive
R^2
values for each module (included ifcompute_predictive_r2 == TRUE
)importance
a matrix of importance values for each regulator (rows) and module (columns) (included if
compute_predictive_r2 == TRUE
)r2_cross_module_per_target
a matrix of cross module
R^2
values for each target gene (rows) and each module (columns) (included ifcompute_silhouette == TRUE
)silhouette
a vector of silhouette scores for each target gene (included if
compute_silhouette == TRUE
)models
regulator selection for each module as a matrix with regulators in rows and modules in columns
signs
regulator signs for each module as a matrix with regulators in rows and modules in columns
weights
average regulator coefficient for each module
coeffs
list of regulator coefficient matrices for each module for all target genes as re-estimated in the NNLS step
sigmas
matrix of residual variances, one per target gene in each module; derived from the residuals in NNLS step
Package data before clustering
Description
Package data before clustering
Usage
scregclust_format(expression_matrix, mode = c("TF", "kinase"))
Arguments
expression_matrix |
The p x n gene expression matrix with gene symbols as rownames. |
mode |
Determines which genes are considered to be regulators. |
Value
A list with
genesymbols |
The gene symbols extracted from the expression matrix |
sample_assignment |
A vector filled with |
is_regulator |
Whether a gene is considered to be a regulator or not,
determined dependent on |
See Also
Split Sample
Description
Splits sample in train and test set
Usage
split_sample(
z,
stratification,
is_regulator,
split_indices,
split1_proportion,
total_proportion,
center
)
Arguments
z |
matrix of single cell data with rows as genes and columns as cells. |
stratification |
a vector by which the sampling will be stratified
of length |
is_regulator |
an indicator vector, telling which rows in |
split_indices |
a vector of given split indices. can be |
split1_proportion |
proportion to include in first data split |
total_proportion |
proportion of data to include overall in splitting |
center |
TRUE if data should be row-centered. Set to FALSE otherwise. |
Value
a list containing
z1_reg |
first data split, TF-part |
z2_reg |
second data split, TF-part |
z1_target |
first data split, non-TF part |
z2_target |
second data split, non-TF part |
split_indices |
either verbatim the vector given as input or a vector encoding the splits as NA = not included, 1 = split 1 or 2 = split 2. Allows reproducibility of data splits. |