Type: | Package |
Version: | 1.0.0 |
Date: | 2018-05-01 |
Title: | Repertoire Dissimilarity Index |
Description: | Methods for calculation and visualization of the Repertoire Dissimilarity Index. Citation: Bolen and Rubelt, et al (2017) <doi:10.1186/s12859-017-1556-5>. |
License: | CC BY-SA 4.0 |
URL: | http://rdi.readthedocs.io |
BugReports: | https://bitbucket.org/cbolen1/rdicore/issues |
LazyData: | true |
Depends: | R (≥ 3.0.0) |
Imports: | beanplot, gplots, pdist, stringr |
Suggests: | knitr, ggplot2 |
RoxygenNote: | 6.0.1 |
NeedsCompilation: | no |
Packaged: | 2018-05-04 14:59:52 UTC; bolenc |
Author: | Christopher Bolen [aut, cre], Florian Rubelt [aut], Jason Vander Heiden [aut] |
Maintainer: | Christopher Bolen <cbolen1@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2018-05-07 11:14:21 UTC |
Calculate repertoire distances
Description
Calculate repertoire distances from a matrix of vdjCounts
Usage
calcRDI(vdjCounts, distMethod = c("euclidean", "cor"), subsample = TRUE,
nIter = 100, constScale = TRUE, units = c("lfc", "pct"), ...)
Arguments
vdjCounts |
a matrix of repertoire counts, as created by calcVDJCounts |
distMethod |
one of c("euclidean","cor") determining how to calculate the distance from the matrix of vdj counts. See Details. |
subsample |
logical; if true, all repertoires are subsampled to be equal size. |
nIter |
value defining how many times the subsampling should be repeated. Only used if subsample is TRUE. |
constScale |
logical; if |
units |
One of "lfc" or "pct". This determines the method used for transforming the repertoire counts. See Details. |
... |
additional parameters; these are ignored by this function. |
Details
There are two options for distance methods, "euclidean" and "cor". Euclidean refers to
standard euclidean distance, and is the standard for the RDI measure described in
(Bolen et al. Bioinformatics 2016). In contrast, cor refers to a correlation-based
distance metric, where the distance is defined as (1-correlation)
between each
column of vdjCounts.
The units
parameter is used to determine the transformation function for the
repertoire counts. If units='lfc'
(default), then the arcsinh transformation
is applied to the count matrix, resulting in a distance metric which
will scale with the average log fold change of each gene. In contrast,
units='pct'
will result in no transformation of the count matrix, and distances
will be proportional to the average percent change of each gene, instead. Note that
"units" is a bit of a misnomer, as the distance metric doesn't actually represent the
true log-fold or percent change in the repertoires. In order to actually estimate
these parameters, refer to the rdiModel and convertRDI
functions.
Value
A dissimilarity structure containing distances between repertoires, averaged across each subsampe run. In addition to the standard attributes in a dist object, three additional attributes are defined as follows:
ngenes | integers, the number of genes in each column of "genes" that were included in at least one repertoire. |
nseq | integer, the number of sequences used after subsampling
the repertoires. If subsample=FALSE , this is not
defined. |
units | string, either "lfc" or "pct", depending on the "units" in the original call |
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts
cts = calcVDJcounts(genes,seqAnnot)
##calculate RDI
d = calcRDI(cts)
##calculate RDI in percent space
d_pct = calcRDI(cts,units="pct")
##convert RDI to actual 'lfc' estimates and compare
dtrue = convertRDI(d)$pred
plot(d, dtrue)
Calculate repertoire counts
Description
Create count matrices for a set of repertoires
Usage
calcVDJcounts(genes, seqAnnot, select = NULL, combine = NULL,
vdjDrop = NULL, splitBy = NULL, simplifyNames = TRUE,
splitCommas = FALSE, ...)
Arguments
genes |
vector (or matrix) containing the gene calls for each sequence. If genes is a matrix, counts will be calculated for each column of 'genes', and the resulting count matrices will be concatenated. See Details. |
seqAnnot |
matrix containing repertoire annotations. |
select |
a list containing definitions of which repertoires to use. See Details. |
combine |
a list defining repertoires that should be grouped together. See Details. |
vdjDrop |
a list specifying specific genes to exclude from analysis. See Details. |
splitBy |
the columns in seqAnnot to use for splitting repertoires. Default is to use all columns in seqAnnot. |
simplifyNames |
logical; if true, any columns of seqAnnot where all selected (and collapsed) sequences share the same value will not be used to make the names of sequenceCodes. |
splitCommas |
logical; if true, seqAnnot is assumed to contain. comma-separated lists of sequence annotations, which will be split up before generating sequence codes. Note: setting this to TRUE will make processing much slower. |
... |
additional parameters; these are ignored by this function. |
Details
In most cases, genes
will be a single vector or one-column matrix. However,
there are some cases where a row of seqAnnot
corresponds to two (or more) genes
(e.g. the V and J gene segments of a single immune sequence). Rather than make multiple
rows for each gene, the calcVDJcounts
function provides the option to provide
a multi-column matrix for genes
. The counts for each column will be tallied
separately, and are then concatenated.
To ensure equal variance across all repertoires, the default RDI metric uses
subsampling to ensure that all repertoires have the same number of sequences. The
default RDI metric subsamples all repertoires to the size of the smallest repertoire,
which may result in a loss of power for comparisons between larger repertoires.
In order to increase power for various tests, it is often useful to only calculate
the repertoire counts for a subset of the repertoires in seqAnnot. This can be done by
using the select
and combine
parameters to specify which
repertoires to include in the analysis.
Both parameters are lists containing entries
with the same name as one of the columns of seqAnnot. For select
, each entry is
a vector defining which values to include (e.g., to include only Visit 1 and 3, you
might specify select=list(visit=c("V1","V3"))
, where the 'visit'
column
in seqAnnot contains the values "V1"
,"V2"
, and "V3"
). In this
case, any rows of genes
and seqAnnot
that come from a repertoire not
specified in select
will be discarded. By default, if a select
code is
not specified for a column in seqAnnot
, all values from that column will be
included.
The combine
parameter works in a similar fashion, but instead of a vector
describing which parameters to include, you can specify a vector of regular
expressions, and any values of the seqAnnot
column that match the regular
expression will be combined into a single repertoire (e.g. to combine visits 1 and 3
into a single repertoire, you might specify combine=list(visit="V[13]")
).
The vdjDrop
parameter is also useful for limiting sequences. Like
select
and combine
, this is a named list, with entries corresponding to
the columns of genes
. Each entry of vdjDrop
is a vector of gene segment
names to remove from the analysis. All sequences containing those genes are removed
from the analysis before subsampling.
Once unwanted rows have been removed, the columns of seqAnnot
are concatenated
to generate "repertoire" labels for each row. The repertoire labels are then used
to split the rows of genes
, and gene prevalence is tallied within a repertoire.
By default, columns of seqAnnot
that are constant after subsetting will not be
included in the label. However, this can be controlled by the simplifyNames
parameter. If simplifyNames
is FALSE, all columns of seqAnnot
are
included when generating labels.
Value
A matrix where each row represents a gene, and each column represents a repertoire.
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
visit = sample(c("V1","V2","V3"), 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
##generate repertoire counts for all repertoires
cts = calcVDJcounts(genes,seqAnnot)
##Only include visit 1
cts = calcVDJcounts(genes,seqAnnot, select=list(visit="V1"))
## Just T cell repertoires, combining visit 1 and 3 together, and dropping visit 2
cts = calcVDJcounts(genes,seqAnnot,
select=list(cellType="T", visit=c("V1","V3")),
combine=list(visit="V[13]"))
Convert RDI measures
Description
Method to convert RDI values to fold/percent change
Usage
convertRDI(d, models = NULL, calcSD = FALSE)
Arguments
d |
Distance matrix (as produced by calcRDI), or a vector of distances. |
models |
Set of RDI models, as produced by rdiModel. If |
calcSD |
logical; if |
Details
The convertRDI function works by first generating a model for the RDI values at a given repertoire size and feature count using the rdiModel function (see that method's help file for more details). The RDI models predict the average log-fold/percent change across a range of RDI values, and allows us to convert RDI to a more stable and interpretable metric.
In addition to the average log-fold or percent change value, rdiModel also generates models for the standard deviation at each RDI value. This is useful for understanding the confidence intervals around the fold change estimate.
Value
A list containing either one or two features:
pred | The converted predictions; same length as d . |
sd | If calcSD==T , a set of standard deviation estimates for each
prediction.
|
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE))
#calculate RDI
d = rdi(genes, seqAnnot)
##convert RDI to actual 'lfc' estimates and compare
dtrue = convertRDI(d)$pred
plot(d, dtrue)
##look at SD ranges around lfc estimates
dtrue = convertRDI(d, calcSD=TRUE)
##plot using ggplot2
library(ggplot2)
x = as.numeric(d)
y = as.numeric(dtrue$pred)
sd = as.numeric(dtrue$sd)
qplot(x,y)+geom_errorbar(aes(x=x, ymin=y-sd, ymax=y+sd))
RDI ladder plotting function
Description
function for adding a pre-computed RDI ladder onto a plot
Usage
plotRDIladder(ladder, side = 4, toPlot = NULL, labelLadder = TRUE,
add = TRUE, cex = 0.7, lineCol = NULL, fillCol = "#AAAAAA")
Arguments
ladder |
the ladder object to add, as created by rdiLadder |
side |
integer; value between 1 and 4 indicating where the ladder will be added. 1 - bottom, 2 - left, 3 - top, 4 - right. |
toPlot |
logical vector; which ladders should be plotted? By default, ladders that are significantly overlapped by their neighbor and those that are majority outside the plotting region are removed. |
labelLadder |
logical; if |
add |
logical; if |
cex |
character expansion for ladder labels. |
lineCol |
the colors to be used for the ladder border. If the length of |
fillCol |
the colors to be used to fill the ladder. If the length of |
Details
This function is used in conjunction with rdiLadder to add a useful annotation to any plot containing RDI values.
Because RDI values vary according to the number of genes and size of the repertoires, they are not useful as numbers by themselves. Instead, it is useful to compare them with estimates of the true difference between the two repertoires. This function adds a series of density curves along one side of a standard plotting region, each one representing the most likely RDI values between two repertoires that vary by a set amount.
By default, not all density curves from the ladder
parameter are plotted.
Instead, the function intelligently chooses which ladders to plot based on the amount
of overlap between neighboring ladders. If a ladder is significantly overlapped by the
ladder below it, then the ladder will not be plotted. In addition, if the mean of a
ladder is outside the main plotting region, it will be dropped. In order to control this
behavior, you can directly specify which ladders are plotted using the toPlot
parameter.
Value
Invisibly returns the location of the ladder (if side 1 or 3, the y location; otherwise, the x location).
See Also
Calculate RDI dissimilarity matrix
Description
Wrapper function for calculating RDIs
Usage
rdi(genes, seqAnnot, params = NULL, ...)
Arguments
genes |
vector (or matrix) containing the gene calls for each sequence. If genes is a matrix, counts will be calculated for each column of 'genes', and the resulting count matrices will be concatenated. |
seqAnnot |
matrix containing repertoire annotations. Must be same length as 'genes'. |
params |
list; contains parameters to pass to child functions.
Should contain |
... |
other parameters to pass to calcVDJcounts and calcRDI. |
Details
This function is a wrapper for the two core functions of RDI,
calcVDJcounts and calcRDI. To control the function of
both calcVDJcounts
and calcRDI
, additional parameters can be specified
either directly in the RDI function call, or parameters for the individual functions
can be wrapped up into lists of parameters and passed into the params
parameter.
params
should be a list containing at least one of two parameter lists:
countParams
and distParams
, which
will be passed to calcVDJcounts
and calcRDI
, respectively. An example
analysis is included below.
Value
Dissimilarity structure, as calculated by dist. In addition to the standard attributes returned by dist, two additional attributes are defined as follows:
nseq | integer, the number of sequences used after subsampling the repertoires |
ngenes | integers, the number of genes in each column of "genes" that were included in at least one repertoire. |
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE),
visit = sample(c("V1","V2","V3"), 10000, replace=TRUE),
cellType = sample(c("B","T"), 10000, replace=TRUE)
)
#parameters
params = list(
countParams = list(
select = list(
visit = c("V1","V3"),
cellType = "B"
),
combine = list(
visit = "V[13]"
),
simplifyNames = FALSE
),
distParams = list(
constScale=FALSE
)
)
##calculate RDI
d = rdi(genes, seqAnnot, params)
##plot using hierarchical clustering
plot(hclust(d))
RDI Axis annotation function
Description
This function takes a RDI model, as generated by rdiModel, and adds an axis with annotations in the fold change space.
Usage
rdiAxis(model, side = 2, at = NULL, ...)
Arguments
model |
A model object, as generated by rdiModel. |
side |
The side the axis will be added to. (1 - bottom; 2 - left; 3 - top; 4 - right). Default is 2. |
at |
The points at which the tick marks are drawn. By default, tickmarks are placed at 'round' fold/percent change values using the "pretty" breakpoints function. This may not be ideal if log-RDI values are being plotted. |
... |
Additional parameters to pass to axis |
Details
This function is designed to replace the default axes generated by a plot
function. Instead of annotating the true RDI value, rdiAxis
will estimate
the "true difference" values at various points within the plotting region, and will
annotate the axis with those estimates.
It is worth noting that although the RDI value can range below rdiModel
's
estimate for "identical" repertoires, no negative true difference values will be
annotated, as these values do not make sense.
See Also
rdiModel, rdiLadder, plotRDIladder
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:10, 10000, replace=TRUE))
#calculate RDI
d = rdi(genes, seqAnnot)
##create a "baseVect" with the same probability as our features
##since we sampled uniformly, the base vector has equal probability
baseVect = rep(1/length(letters),length(letters))
##generate an RDI model
m = rdiModel(attr(d, "nseq"), baseVects=baseVect)
##convert RDI to lfc
td = convertRDI(d,models=m)$pred
par(mar=c(4,4,1,4),las=1,mgp=c(3,0.5,0))
plot(td,d, ylab="RDI", xlab="LFC")
##now add "true difference" axis annotation to the right side of the plot
rdiAxis(m, side=4)
RDI ladder
Description
Function for creating the RDI ladder for a specific number of sequences
Usage
rdiLadder(n, ngenes = NULL, baseVects = NULL, diffPoints = NULL,
units = c("lfc", "pct"), ...)
Arguments
n |
the repertoire size; alternatively, an rdiModel object as created by rdiModel. |
ngenes |
numeric vector indicating the number of genes in each chain. If baseVect is not provided, this parameter will be used to generate a base prevalence vector for each of the genes. |
baseVects |
A vector or list of vectors representing the total prevalence of each gene (for each chain) in the dataset. See rdiModel for details. |
diffPoints |
numeric vector; each value specifices either a log2 fold change or percent deviation value (depending on the 'units') at which the RDI ladder will be calculated. |
units |
String; either "lfc" or "pct", depending on what transform was used in the original RDI calculations. See Details. |
... |
Additional parameters to be passed to rdiModel |
Details
Because RDI values vary according to the number of genes and size of the repertoires, they are not useful as numbers by themselves. Instead, it is useful to compare them with estimates of the true difference between the two repertoires. This function uses the models generated by rdiModel to generate estimated RDI values corresponding to a set of pre-defined true distance (log-fold change or percent) values. This function is primarily meant to be used in conjunction with plotRDIladder in order to add a useful reference point for RDI values.
The units used for the RDI model should always match the units used to generate your RDI values. For more details on units, refer to the details of calcRDI
Value
A list of the same length as diffPoints, with each entry in the list containing the mean RDI value and standard deviation corresponding to a given true difference value.
See Also
plotRDIladder, rdiModel, rdiAxis
RDI Models
Description
Generate models equating RDI values to true differences in underlying prevalence values
Usage
rdiModel(n, ngenes = NULL, baseVects = NULL, nIter = 50, nSample = 20,
units = c("lfc", "pct"), constScale = TRUE)
Arguments
n |
the repertoire size. |
ngenes |
numeric vector indicating the number of genes in each chain.
If |
baseVects |
A vector or list of vectors representing the total prevalence of each gene (for each chain) in the dataset. Differential datasets will be created from alterations of this vector. If not provided, a base vector will be randomly . generated at each subsample step containing the number of genes specified by ngenes. |
nIter |
The number of iterations (i.e. number of datasets to generate). |
nSample |
The number of samples to generate for each subsample. Each sample will have a different true fold change, but the same starting vector |
units |
String; either "lfc" or "pct", depending on what transform was used in the original RDI calculations. See Details. |
constScale |
logical; if |
Details
This method uses simulated sequencing datasets to estimate the RDI values for datasets with a known true deviation.
Briefly, a baseline probability vector (either randomly generated or supplied by the
baseVects
parameter) is randomly perturbed, and the difference between the
baseline vector and the perturbed vector is calculated. Then, nSample
sequencing
datasets of size n are randomly drawn from both the baseline vector and the perturbed
vector, and the RDI distance between all datasets calculated. This process is repeated
nIter
times, resulting in a dataset of RDI values and matched true differences.
A set of spline models is then fit to the data: one from RDI to true difference, and
another from true difference to RDI value, allowing for bi-directional conversions.
If a baseline probability vector is not provided, one will be generated from an empirical model of gene segment prevalence. However, for best performance, this is not recommended. Estimates of true fold change is very sensitive to the distribution of features in your count dataset, and it is important that your baseline vector match your overall dataset as accurately as possible. The best baseline vector is almost always the average feature prevalence across all repertoires in a dataset, although manually generated baseline vectors may also work well.
The units used for the RDI model should always match the units used to generate your RDI values. For more details on units, refer to the details of calcRDI.
Value
A list containing three objects:
fit | an object of class "smooth.spline" , based on a spline
model with the true difference (lfc or pct) as the independent (x)
and RDI as the dependent (y). Used for converting from true difference
to RDI. |
rev.fit | an object of class "smooth.spline" .
The opposite of fit. Used for converting from RDI to true difference.
|
units | one of c("lfc","pct") , representing the units of the true
difference values.
|
See Also
rdiAxis, rdiLadder, plotRDIladder
Examples
#create genes
genes = sample(letters, 10000, replace=TRUE)
#create sequence annotations
seqAnnot = data.frame(donor = sample(1:4, 10000, replace=TRUE))
#calculate RDI
d = rdi(genes, seqAnnot)
##create a "baseVect" with the same probability as our features
##since we sampled uniformly, the base vector has equal probability
baseVect = rep(1/length(letters),length(letters))
##generate an RDI model
m = rdiModel(attr(d, "nseq"), baseVects=baseVect)
##plot the spline model
plot(m$fit, xlab="log fold change",ylab="RDI",type='l')
##convert RDI to log fold change
mean = predict(m$rev.fit, d)$y
mean[mean<0] = 0