Install the required dependencies if not already installed:
# The simulator
install.packages("remotes")
remotes::install_github("tgbrooks/dependentsimr")
# DESeq2 - optional dependency only needed if doing RNA-seq data sets
# Needed for type = "DESeq2"
install.packages("BiocManager")
BiocManager::install("DESeq2")
# Two more dependencies that are used in some simulation modes
# Needed for method = "corpcor"
install.packages("corpcor")
# Needed for method = "spiked Wishart"
install.packages("sparsesvd")
# Extras needed for this vignette
install.packages("tidyverse")
Load the libraries:
Set a seed for reproducibility.
Simulations with dependentsimr
start with a real data
set that should be mimicked. This data can be almost any shape,
including continuous, or discrete. Here, we will simulate RNA-seq, which
has discrete counts that we handle with a special “DESeq2” model. Other
data types can use normal, Poisson, or empirical (which uses only the
values present in the data set) types, or even a combination of
these.
We will use the following data set, which is a trimmed-down RNA-seq data of 1000 genes on 12 samples, originating from GSE151923.
head(read_counts)
#> # A tibble: 6 × 13
#> gene_id Arpp19_10_3 Arpp19_4_1 Arpp19_5_1 Arpp19_5_3 Arpp19_5_4 Arpp19_6_1
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ENSMUSG000… 2377 2162 2169 2650 2217 2552
#> 2 ENSMUSG000… 908 799 730 941 761 780
#> 3 ENSMUSG000… 316 342 373 373 325 405
#> 4 ENSMUSG000… 584 650 466 687 551 606
#> 5 ENSMUSG000… 2739 2531 2594 2988 2614 3081
#> 6 ENSMUSG000… 807 753 746 894 806 819
#> # ℹ 6 more variables: Arpp19_7_2 <dbl>, Arpp19_9_1 <dbl>, Dach1_11_1 <dbl>,
#> # Dach1_2_2 <dbl>, Dach1_3_1 <dbl>, Dach1_5_3 <dbl>
We first just format it as a matrix for use with
dependentsimr
instead of as a tibble.
The first step is to compute a ‘random structure’ for the data, which encodes how the variables (genes, in this case) in the data set are dependent upon each other. This computes a covariance matrix which captures the dependence between genes, however, that matrix is not the same as the usual sample covariance matrix. It also computes the shape (such as mean and variance) of each variable individually.
There are three implemented methods (“pca”, “spiked Wishart”, and “corpcor”) with different characteristics. We demonstrate here each.
First, the “pca” method takes a rank
and includes only
gene-gene dependence up to that many dimensions. Typically small values
(2 or 3, for example) would be used for rank
, similar to
when performing PCA, unless a large number of input samples are
available. Note that we put our read_counts
data into a
list()
. This is done since
get_random_structure
also supports multiple data sets of
different ‘modes’ (for example, proteomics and transcriptomics done on
the same samples). We only have one ‘mode’ that we give the arbitrary
name of data
.
rs_pca <- get_random_structure(list(data=count_matrix), method="pca", rank=2, types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
Second, the spiked Wishart method also takes a rank
value but uses it differently. Typically, this will be larger, up to the
total number of samples minus one, which is what we use here. For very
large sample counts, rank
might need to be capped well
below the number of samples for computational efficiency reasons.
rs_wishart <- get_random_structure(list(data=count_matrix), rank=11, method="spiked Wishart", types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
Third, a method based off the corpcor
package is
implemented. It takes no parameters but often underestimates the amount
of gene-gene dependence in the data.
rs_corpcor <- get_random_structure(list(data=count_matrix), method="corpcor", types="DESeq2")
#> converting counts to integer mode
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Estimating optimal shrinkage intensity lambda (correlation matrix): 0.7601
#> Estimating optimal shrinkage intensity lambda.var (variance vector): 1
Lastly, we could generate data without gene-gene dependence for comparison by removing the dependence from any of our random structures.
Real RNA-seq data sets typically have variation between samples in
the total number of reads per sample. Here, we just copy the total
number of reads from the real data set to get realistic values. A
library_size
of 1 corresponds to the average read depth of
the data used to generate the random structure. This step is only needed
when generating RNA-seq data with type = "DESeq2"
.
actual_library_sizes <- count_matrix |> apply(2, sum)
N_SAMPLES <- 6
library_sizes <- sample(actual_library_sizes / mean(actual_library_sizes), size=N_SAMPLES, replace=TRUE)
Now we generate some ‘control’ samples, which have the same
expression levels (as well as gene-gene dependence) as was estimated
from the original data set. We’ll use rs_pca
that we
generated above for example, but any of the methods could be used. Each
column is a separate simulated sample. Samples are independent of each
other in this model, however genes within a sample are dependent upon
each other.
control_sim_data <- draw_from_multivariate_corr(rs_pca, n_samples=N_SAMPLES, size_factors=library_sizes)$data
Note that we used $data
here because that was the name
we provided to get_random_structure
and it was the only
mode of data that we had. If multiple modes were used to generate the
random structure, then we draw_from_multivariate_corr
returns a list with each of them.
To generate a set of ‘treatment’ samples which differ from the ‘control’ samples, we will have to modify the random structure to reflect the desired expression values of the treatment group. We do this by just choosing some genes at random and picking a fold-change for each at random.
# fraction of all genes to modify in treatment
FRAC_DE <- 0.1
# Min and max value for the log fold change (LFC) for each DE gene chosen
# non-DE genes will have a LFC of 0
LFC_MIN <- 0.5
LFC_MAX <- 4.0
N_DE <- floor(FRAC_DE * nrow(control_sim_data))
# Pick which genes are DE
de_genes <- sample(nrow(control_sim_data), N_DE)
# Pick the LFCs of each of those genes
de_lfc <- runif(N_DE, LFC_MIN, LFC_MAX) * sample(c(-1, 1), size=N_DE, replace=TRUE)
Now, we create a new random structure object and modify it to have
the desired expression values. Information about the marginal
distribution of each gene (e.g., mean and variance but not dependence on
other genes) is stored in rs$marginals
for a random
structure rs
. The specifics depend upon the
type
used in get_random_structure
, but for
DESeq2
, the q
value determines the mean
expression, so we change that here.
rs_treatment <- rs_pca
rs_treatment$marginals$data$q[de_genes] <- 2^(de_lfc) * rs_treatment$marginals$data$q[de_genes]
Lastly, we generate the data for the ‘treatment’ samples.
To demonstrate the data, we show a PCA plot of the control and treatment groups.
sim_data <- cbind(control_sim_data, treatment_sim_data)
pca <- prcomp(t(sim_data), center=TRUE, scale.=TRUE, rank.=2)
pca_points <- predict(pca, t(sim_data))
pca_data <- tibble(x=pca_points[,1], y=pca_points[,2]) |>
mutate(group = c(rep("control", N_SAMPLES), rep("treatment", N_SAMPLES)))
ggplot(pca_data, aes(x, y, color=group)) +
geom_point() +
labs(x = "PC1", y = "PC2")
As expected, the control and treatment group separate nicely, but there is also a substantial component of variation (PC1) which both groups have. This arises from the dependence between genes, which is the same in both groups in our simulation.