Type: | Package |
Version: | 1.1.0 |
Date: | 2023-10-10 |
Title: | Infers Novel Immunoglobulin Alleles from Sequencing Data |
Description: | Infers the V genotype of an individual from immunoglobulin (Ig) repertoire sequencing data (AIRR-Seq, Rep-Seq). Includes detection of any novel alleles. This information is then used to correct existing V allele calls from among the sample sequences. Citations: Gadala-Maria, et al (2015) <doi:10.1073/pnas.1417683112>, Gadala-Maria, et al (2019) <doi:10.3389/fimmu.2019.00129>. |
License: | AGPL-3 |
URL: | http://tigger.readthedocs.io |
BugReports: | https://bitbucket.org/kleinstein/tigger/issues |
LazyData: | true |
BuildVignettes: | true |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
Depends: | R (≥ 4.0), ggplot2 (≥ 3.4.0) |
Imports: | alakazam (≥ 1.3.0), dplyr (≥ 1.0.0), doParallel, foreach, graphics, gridExtra, gtools, iterators, lazyeval, parallel, rlang, stats, stringi, tidyr (≥ 1.1.0), utils |
Suggests: | knitr, rmarkdown, testthat |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2023-10-10 12:54:05 UTC; susanna |
Author: | Daniel Gadala-Maria [aut], Susanna Marquez [aut, cre], Moriah Cohen [aut], Jason Vander Heiden [aut], Gur Yaari [aut], Steven Kleinstein [aut, cph] |
Maintainer: | Susanna Marquez <susanna.marquez@yale.edu> |
Repository: | CRAN |
Date/Publication: | 2023-10-10 13:40:02 UTC |
tigger: Infers Novel Immunoglobulin Alleles from Sequencing Data
Description
Infers the V genotype of an individual from immunoglobulin (Ig) repertoire sequencing data (AIRR-Seq, Rep-Seq). Includes detection of any novel alleles. This information is then used to correct existing V allele calls from among the sample sequences. Citations: Gadala-Maria, et al (2015) doi:10.1073/pnas.1417683112, Gadala-Maria, et al (2019) doi:10.3389/fimmu.2019.00129.
Author(s)
Maintainer: Susanna Marquez susanna.marquez@yale.edu
Authors:
Daniel Gadala-Maria daniel.gadala-maria@yale.edu
Moriah Cohen moriah.cohen@biu.ac.il
Jason Vander Heiden jason.vanderheiden@gmail.com
Gur Yaari gur.yaari@biu.ac.il
Steven Kleinstein steven.kleinstein@yale.edu [copyright holder]
See Also
Useful links:
Example human immune repertoire data
Description
A data.frame
of example V(D)J immunoglobulin sequences derived from a
single individual (PGP1), sequenced on the Roche 454 platform, and assigned by
IMGT/HighV-QUEST to IGHV1 family alleles.
Format
A data.frame
where rows correspond to unique V(D)J sequences and
columns include:
-
"sequence_alignment"
: IMGT-gapped V(D)J nucleotide sequence. -
"v_call"
: IMGT/HighV-QUEST V segment allele calls. -
"d_call"
: IMGT/HighV-QUEST D segment allele calls. -
"j_call"
: IMGT/HighV-QUEST J segment allele calls. -
"junction_length"
: Junction region length.
References
Gadala-Maria, et al. (2015) Automated analysis of high-throughput B cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. PNAS. 112(8):E862-70.
See Also
See SampleDb for Change-O formatted version of AIRRDb
.
Human IGHV germlines
Description
A character
vector of all human IGHV germline gene segment alleles
in IMGT/GENE-DB (2019-06-01, 372 alleles).
See IMGT data updates: https://www.imgt.org/IMGTgenedbdoc/dataupdates.html.
Format
Values correspond to IMGT-gaped nuceltoide sequences (with nucleotides capitalized and gaps represented by ".") while names correspond to stripped-down IMGT allele names (e.g. "IGHV1-18*01").
References
Xochelli, et al. (2014) Immunoglobulin heavy variable (IGHV) genes and alleles: new entities, new names and implications for research and prognostication in chronic lymphocytic leukaemia. Immunogenetics. 67(1):61-6.
Example human immune repertoire data
Description
A data.frame
of example V(D)J immunoglobulin sequences derived from a
single individual (PGP1), sequenced on the Roche 454 platform, and assigned by
IMGT/HighV-QUEST to IGHV1 family alleles.
Format
A data.frame
where rows correspond to unique V(D)J sequences and
columns include:
-
"SEQUENCE_IMGT"
: IMGT-gapped V(D)J nucleotide sequence. -
"V_CALL"
: IMGT/HighV-QUEST V segment allele calls. -
"D_CALL"
: IMGT/HighV-QUEST D segment allele calls. -
"J_CALL"
: IMGT/HighV-QUEST J segment allele calls. -
"JUNCTION_LENGTH"
: Junction region length.
References
Gadala-Maria, et al. (2015) Automated analysis of high-throughput B cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. PNAS. 112(8):E862-70.
See Also
See AIRRDb for an AIRR formatted version of SampleDb
.
Example genotype inferrence results
Description
A data.frame
of genotype inference results from inferGenotype
after novel allele detection via findNovelAlleles.
Source data was a collection of V(D)J immunoglobulin sequences derived from a single
individual (PGP1), sequenced on the Roche 454 platform, and assigned by
IMGT/HighV-QUEST to IGHV1 family alleles.
Format
A data.frame
where rows correspond to genes carried by an
individual and columns lists the alleles of those genes and their counts.
References
Gadala-Maria, et al. (2015) Automated analysis of high-throughput B cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. PNAS. 112(8):E862-70.
See Also
See inferGenotype for detailed column descriptions.
Example Human IGHV germlines
Description
A character
vector of all 344 human IGHV germline gene segment alleles
in IMGT/GENE-DB release 2014-08-4.
Format
Values correspond to IMGT-gaped nuceltoide sequences (with nucleotides capitalized and gaps represented by ".") while names correspond to stripped-down IMGT allele names (e.g. "IGHV1-18*01").
References
Xochelli, et al. (2014) Immunoglobulin heavy variable (IGHV) genes and alleles: new entities, new names and implications for research and prognostication in chronic lymphocytic leukaemia. Immunogenetics. 67(1):61-6.
Example novel allele detection results
Description
A data.frame
of novel allele detection results from findNovelAlleles.
Source data was a collection of V(D)J immunoglobulin sequences derived from a single
individual (PGP1), sequenced on the Roche 454 platform, and assigned by
IMGT/HighV-QUEST to IGHV1 family alleles.
Format
A data.frame
where rows correspond to alleles checked for
polymorphisms and columns give results as well as paramaters used to run
the test.
References
Gadala-Maria, et al. (2015) Automated analysis of high-throughput B cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. PNAS. 112(8):E862-70.
See Also
See findNovelAlleles for detailed column descriptions.
Clean up nucleotide sequences
Description
cleanSeqs
capitalizes nucleotides and replaces all characters
besides c("A", "C", "G", "T", "-", ".")
with "N"
.
Usage
cleanSeqs(seqs)
Arguments
seqs |
vector of nucleotide sequences. |
Value
A modified vector of nucleotide sequences.
See Also
sortAlleles and updateAlleleNames can help format a list of allele names.
Examples
# Clean messy nucleotide sequences
seqs <- c("AGAT.taa-GAG...ATA", "GATACAGTXXZZAGNNPPACA")
cleanSeqs(seqs)
Find novel alleles from repertoire sequencing data
Description
findNovelAlleles
analyzes mutation patterns in sequences thought to
align to each germline allele in order to determine which positions
might be polymorphic.
Usage
findNovelAlleles(
data,
germline_db,
v_call = "v_call",
j_call = "j_call",
seq = "sequence_alignment",
junction = "junction",
junction_length = "junction_length",
germline_min = 200,
min_seqs = 50,
auto_mutrange = TRUE,
mut_range = 1:10,
pos_range = 1:312,
pos_range_max = NULL,
y_intercept = 0.125,
alpha = 0.05,
j_max = 0.15,
min_frac = 0.75,
nproc = 1
)
Arguments
data |
|
germline_db |
vector of named nucleotide germline sequences
matching the V calls in |
v_call |
name of the column in |
j_call |
name of the column in |
seq |
name of the column in |
junction |
Junction region nucleotide sequence, which includes
the CDR3 and the two flanking conserved codons. Default
is |
junction_length |
Number of junction nucleotides in the junction sequence.
Default is |
germline_min |
the minimum number of sequences that must have a particular germline allele call for the allele to be analyzed. |
min_seqs |
minimum number of total sequences (within the desired mutational range and nucleotide range) required for the samples to be considered. |
auto_mutrange |
if |
mut_range |
range of mutations that samples may carry and be considered by the algorithm. |
pos_range |
range of IMGT-numbered positions that should be considered by the algorithm. |
pos_range_max |
Name of the column in |
y_intercept |
y-intercept threshold above which positions should be considered potentially polymorphic. |
alpha |
alpha value used for determining whether the
fit y-intercept is greater than the |
j_max |
maximum fraction of sequences perfectly aligning to a potential novel allele that are allowed to utilize to a particular combination of junction length and J gene. The closer to 1, the less strict the filter for junction length and J gene diversity will be. |
min_frac |
minimum fraction of sequences that must have usable nucleotides in a given position for that position to considered. |
nproc |
number of processors to use. |
Details
The TIgGER allele-finding algorithm, briefly, works as follows: Mutations are determined through comparison to the provided germline. Mutation frequency at each *position* is determined as a function of *sequence-wide* mutation counts. Polymorphic positions exhibit a high mutation frequency despite sequence-wide mutation count. False positive of potential novel alleles resulting from clonally-related sequences are guarded against by ensuring that sequences perfectly matching the potential novel allele utilize a wide range of combinations of J gene and junction length.
Value
A data.frame
with a row for each known allele analyzed.
Besides metadata on the the parameters used in the search, each row will have
either a note as to where the polymorphism-finding algorithm exited or a
nucleotide sequence for the predicted novel allele, along with columns providing
additional evidence.
The output contains the following columns:
-
germline_call
: The input (uncorrected) V call. -
note
: Comments regarding the inferrence. -
polymorphism_call
: The novel allele call. -
nt_substitutions
: Mutations identified in the novel allele, relative to the reference germline (germline_call
) -
novel_imgt
: The novel allele sequence. -
novel_imgt_count
: The number of times the sequencenovel_imgt
is found in the input data. Considers the subsequence ofnovel_imgt
in thepos_range
. -
novel_imgt_unique_j
: Number of distinct J calls associated tonovel_imgt
in the input data. Considers the subsequence ofnovel_imgt
in thepos_range
. -
novel_imgt_unique_cdr3
: Number of distinct CDR3 sequences associated withnovel_imgt
in the input data. Considers the subsequence ofnovel_imgt
in thepos_range
. -
perfect_match_count
: Final number of sequences retained to call the new allele. These are unique sequences that have V segments that perfectly match the predicted germline in thepos_range
. -
perfect_match_freq
:perfect_match_count / germline_call_count
-
germline_call_count
: The number of sequences with thegermline_call
in the input data that were initially considered for the analysis. -
germline_call_freq
: The fraction of sequences with thegermline_call
in the input data initially considered for the analysis. -
germline_imgt
: Germline sequence forgermline_call
. -
germline_imgt_count
: The number of times thegermline_imgt
sequence is found in the input data. -
mut_min
: Minimum mutation considered by the algorithm. -
mut_max
: Maximum mutation considered by the algorithm. -
mut_pass_count
: Number of sequences in the mutation range. -
pos_min
: First position of the sequence considered by the algorithm (IMGT numbering). -
pos_max
: Last position of the sequence considered by the algorithm (IMGT numbering). -
y_intercept
: The y-intercept above which positions were considered potentially polymorphic. -
y_intercept_pass
: Number of positions that pass they_intercept
threshold. -
snp_pass
: Number of sequences that pass they_intercept
threshold and are within the desired nucleotide range (min_seqs
). -
unmutated_count
: Number of unmutated sequences. -
unmutated_freq
: Number of unmutated sequences overgermline_imgt_count
. -
unmutated_snp_j_gene_length_count
: Number of distinct combinations of SNP, J gene, and junction length. -
snp_min_seqs_j_max_pass
: Number of SNPs that pass both themin_seqs
andj_max
thresholds. -
alpha
: Significance threshold to be used when constructing the confidence interval for the y-intercept. -
min_seqs
: Inputmin_seqs
. The minimum number of total sequences (within the desired mutational range and nucleotide range) required for the samples to be considered. -
j_max
: Inputj_max
. The maximum fraction of sequences perfectly aligning to a potential novel allele that are allowed to utilize to a particular combination of junction length and J gene. -
min_frac
: Inputmin_frac
. The minimum fraction of sequences that must have usable nucleotides in a given position for that position to be considered.
The following comments can appear in the note
column:
-
Novel allele found: A novel allele was detected.
-
Same as:: The same novel allele sequence has been identified multiple times. If this happens, the function will also throw the message 'Duplicated polymorphism(s) found'.
-
Plurality sequence too rare: No sequence is frequent enough to pass the J test (
j_max
). -
A J-junction combination is too prevalent: Not enough J diversity (
j_max
). -
No positions pass y-intercept test: No positions above
y_intercept
. -
Insufficient sequences in desired mutational range:
mut_range
andpos_range
. -
Not enough sequences: Not enough sequences in the desired mutational range and nucleotide range (
min_seqs
). -
No unmutated versions of novel allele found: All observed variants of the allele are mutated.
See Also
selectNovel to filter the results to show only novel alleles. plotNovel to visualize the data supporting any novel alleles hypothesized to be present in the data and inferGenotype and inferGenotypeBayesian to determine if the novel alleles are frequent enought to be included in the subject's genotype.
Examples
# Note: In this example, with SampleGermlineIGHV,
# which contains reference germlines retrieved on August 2014,
# TIgGER finds the allele IGHV1-8*02_G234T. This allele
# was added to IMGT as IGHV1-8*03 on March 28, 2018.
# Find novel alleles and return relevant data
novel <- findNovelAlleles(AIRRDb, SampleGermlineIGHV)
selectNovel(novel)
Determine which calls represent an unmutated allele
Description
findUnmutatedCalls
determines which allele calls would represent a
perfect match with the germline sequence, given a vector of allele calls and
mutation counts. In the case of multiple alleles being assigned to a
sequence, only the subset that would represent a perfect match is returned.
Usage
findUnmutatedCalls(allele_calls, sample_seqs, germline_db)
Arguments
allele_calls |
vector of strings respresenting Ig allele calls, where multiple calls are separated by a comma. |
sample_seqs |
V(D)J-rearranged sample sequences matching the order
of the given |
germline_db |
vector of named nucleotide germline sequences |
Value
A vector of strings containing the members of allele_calls
that represent unmutated sequences.
Examples
# Find which of the sample alleles are unmutated
calls <- findUnmutatedCalls(AIRRDb$v_call, AIRRDb$sequence_alignment,
germline_db=SampleGermlineIGHV)
Generate evidence
Description
generateEvidence
builds a table of evidence metrics for the final novel V
allele detection and genotyping inferrences.
Usage
generateEvidence(
data,
novel,
genotype,
genotype_db,
germline_db,
j_call = "j_call",
junction = "junction",
fields = NULL
)
Arguments
data |
a |
novel |
the |
genotype |
the |
genotype_db |
a vector of named nucleotide germline sequences in the genotype. Returned by genotypeFasta. |
germline_db |
the original uncorrected germline database used to by findNovelAlleles to identify novel alleles. |
j_call |
name of the column in |
junction |
Junction region nucleotide sequence, which includes
the CDR3 and the two flanking conserved codons. Default
is |
fields |
character vector of column names used to split the data to
identify novel alleles, if any. If |
Value
Returns the genotype
input data.frame
with the following additional columns
providing supporting evidence for each inferred allele:
-
field_id
: Data subset identifier, defined with the input paramterfields
. A variable number of columns, specified with the input parameter
fields
.-
polymorphism_call
: The novel allele call. -
novel_imgt
: The novel allele sequence. -
closest_reference
: The closest reference gene and allele in thegermline_db
database. -
closest_reference_imgt
: Sequence of the closest reference gene and allele in thegermline_db
database. -
germline_call
: The input (uncorrected) V call. -
germline_imgt
: Germline sequence forgermline_call
. -
nt_diff
: Number of nucleotides that differ between the new allele and the closest reference (closest_reference
) in thegermline_db
database. -
nt_substitutions
: A comma separated list of specific nucleotide differences (e.g.112G>A
) in the novel allele. -
aa_diff
: Number of amino acids that differ between the new allele and the closest reference (closest_reference
) in thegermline_db
database. -
aa_substitutions
: A comma separated list with specific amino acid differences (e.g.96A>N
) in the novel allele. -
sequences
: Number of sequences unambiguosly assigned to this allele. -
unmutated_sequences
: Number of records with the unmutated novel allele sequence. -
unmutated_frequency
: Proportion of records with the unmutated novel allele sequence (unmutated_sequences / sequences
). -
allelic_percentage
: Percentage at which the (unmutated) allele is observed in the sequence dataset compared to other (unmutated) alleles. -
unique_js
: Number of unique J sequences found associated with the novel allele. The sequences are those who have been unambiguously assigned to the novel allelle (polymorphism_call
). -
unique_cdr3s
: Number of unique CDR3s associated with the inferred allele. The sequences are those who have been unambiguously assigned to the novel allelle (polymorphism_call). -
mut_min
: Minimum mutation considered by the algorithm. -
mut_max
: Maximum mutation considered by the algorithm. -
pos_min
: First position of the sequence considered by the algorithm (IMGT numbering). -
pos_max
: Last position of the sequence considered by the algorithm (IMGT numbering). -
y_intercept
: The y-intercept above which positions were considered potentially polymorphic. -
alpha
: Significance threshold to be used when constructing the confidence interval for the y-intercept. -
min_seqs
: Inputmin_seqs
. The minimum number of total sequences (within the desired mutational range and nucleotide range) required for the samples to be considered. -
j_max
: Inputj_max
. The maximum fraction of sequences perfectly aligning to a potential novel allele that are allowed to utilize to a particular combination of junction length and J gene. -
min_frac
: Inputmin_frac
. The minimum fraction of sequences that must have usable nucleotides in a given position for that position to be considered. -
note
: Comments regarding the novel allele inferrence.
See Also
See findNovelAlleles, inferGenotype and genotypeFasta for generating the required input.
Examples
# Generate input data
novel <- findNovelAlleles(AIRRDb, SampleGermlineIGHV,
v_call="v_call", j_call="j_call", junction="junction",
junction_length="junction_length", seq="sequence_alignment")
genotype <- inferGenotype(AIRRDb, find_unmutated=TRUE,
germline_db=SampleGermlineIGHV,
novel=novel,
v_call="v_call", seq="sequence_alignment")
genotype_db <- genotypeFasta(genotype, SampleGermlineIGHV, novel)
data_db <- reassignAlleles(AIRRDb, genotype_db,
v_call="v_call", seq="sequence_alignment")
# Assemble evidence table
evidence <- generateEvidence(data_db, novel, genotype,
genotype_db, SampleGermlineIGHV,
j_call = "j_call",
junction = "junction")
Return the nucleotide sequences of a genotype
Description
genotypeFasta
converts a genotype table into a vector of nucleotide
sequences.
Usage
genotypeFasta(genotype, germline_db, novel = NA)
Arguments
genotype |
|
germline_db |
vector of named nucleotide germline sequences
matching the alleles detailed in |
novel |
an optional |
Value
A named vector of strings containing the germline nucleotide sequences of the alleles in the provided genotype.
See Also
Examples
# Find the sequences that correspond to the genotype
genotype_db <- genotypeFasta(SampleGenotype, SampleGermlineIGHV, SampleNovel)
Determine the mutation counts from allele calls
Description
getMutCount
takes a set of nucleotide sequences and their allele calls
and determines the distance between that seqeunce and any germline alleles
contained within the call
Usage
getMutCount(samples, allele_calls, germline_db)
Arguments
samples |
vector of IMGT-gapped sample V sequences |
allele_calls |
vector of strings respresenting Ig allele calls for
the sequences in |
germline_db |
vector of named nucleotide germline sequences
matching the calls detailed in |
Value
A list equal in length to samples
, containing the Hamming
distance to each germline allele contained within each call within
each element of samples
Examples
# Insert a mutation into a germline sequence
s2 <- s3 <- SampleGermlineIGHV[1]
stringi::stri_sub(s2, 103, 103) <- "G"
stringi::stri_sub(s3, 107, 107) <- "C"
sample_seqs <- c(SampleGermlineIGHV[2], s2, s3)
# Pretend that one sample sequence has received an ambiguous allele call
sample_alleles <- c(paste(names(SampleGermlineIGHV[1:2]), collapse=","),
names(SampleGermlineIGHV[2]),
names(SampleGermlineIGHV[1]))
# Compare each sequence to its assigned germline(s) to determine the distance
getMutCount(sample_seqs, sample_alleles, SampleGermlineIGHV)
Find the location of mutations in a sequence
Description
getMutatedPositions
takes two vectors of aligned sequences and
compares pairs of sequences. It returns a list of the nucleotide positions of
any differences.
Usage
getMutatedPositions(
samples,
germlines,
ignored_regex = "[\\.N-]",
match_instead = FALSE
)
Arguments
samples |
vector of strings respresenting aligned sequences |
germlines |
vector of strings respresenting aligned sequences
to which |
ignored_regex |
regular expression indicating what characters should be ignored (such as gaps and N nucleotides). |
match_instead |
if |
Value
A list of the nucleotide positions of any differences between the input vectors.
Examples
# Create strings to act as a sample sequences and a reference sequence
seqs <- c("----GATA", "GAGAGAGA", "TANA")
ref <- "GATAGATA"
# Find the differences between the two
getMutatedPositions(seqs, ref)
Find mutation counts for frequency sequences
Description
getPopularMutationCount
determines which sequences occur frequently
for each V gene and returns the mutation count of those sequences.
Usage
getPopularMutationCount(
data,
germline_db,
v_call = "v_call",
seq = "sequence_alignment",
gene_min = 0.001,
seq_min = 50,
seq_p_of_max = 1/8,
full_return = FALSE
)
Arguments
data |
|
germline_db |
named list of IMGT-gapped germline sequences. |
v_call |
name of the column in |
seq |
name of the column in |
gene_min |
portion of all unique sequences a gene must constitute to avoid exclusion. |
seq_min |
number of copies of the V that must be present for to avoid exclusion. |
seq_p_of_max |
ror each gene, the fraction of the most common V sequence count that a sequence must meet to avoid exclusion. |
full_return |
if |
Value
A data frame of genes that have a frequent sequence mutation count above 1.
See Also
getMutatedPositions can be used to find which positions of a set of sequences are mutated.
Examples
getPopularMutationCount(AIRRDb, SampleGermlineIGHV)
Infer a subject-specific genotype using a frequency method
Description
inferGenotype
infers an subject's genotype using a frequency method.
The genotype is inferred by finding the minimum number set of alleles that
can explain the majority of each gene's calls. The most common allele of
each gene is included in the genotype first, and the next most common allele
is added until the desired fraction of alleles can be explained. In this
way, mistaken allele calls (resulting from sequences which
by chance have been mutated to look like another allele) can be removed.
Usage
inferGenotype(
data,
germline_db = NA,
novel = NA,
v_call = "v_call",
seq = "sequence_alignment",
fraction_to_explain = 0.875,
gene_cutoff = 1e-04,
find_unmutated = TRUE
)
Arguments
data |
|
germline_db |
named vector of sequences containing the
germline sequences named in
|
novel |
optional |
v_call |
column in |
seq |
name of the column in |
fraction_to_explain |
the portion of each gene that must be explained by the alleles that will be included in the genotype. |
gene_cutoff |
either a number of sequences or a fraction of
the length of |
find_unmutated |
if |
Details
Allele calls representing cases where multiple alleles have been
assigned to a single sample sequence are rare among unmutated
sequences but may result if nucleotides for certain positions are
not available. Calls containing multiple alleles are treated as
belonging to all groups. If novel
is provided, all
sequences that are assigned to the same starting allele as any
novel germline allele will have the novel germline allele appended
to their assignent prior to searching for unmutated sequences.
Value
A data.frame
of alleles denoting the genotype of the subject containing
the following columns:
-
gene
: The gene name without allele. -
alleles
: Comma separated list of alleles for the givengene
. -
counts
: Comma separated list of observed sequences for each corresponding allele in thealleles
list. -
total
: The total count of observed sequences for the givengene
. -
note
: Any comments on the inferrence.
Note
This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.
See Also
plotGenotype for a colorful visualization and genotypeFasta to convert the genotype to nucleotide sequences. See inferGenotypeBayesian to infer a subject-specific genotype using a Bayesian approach.
Examples
# Infer IGHV genotype, using only unmutated sequences, including novel alleles
inferGenotype(AIRRDb, germline_db=SampleGermlineIGHV, novel=SampleNovel,
find_unmutated=TRUE)
Infer a subject-specific genotype using a Bayesian approach
Description
inferGenotypeBayesian
infers an subject's genotype by applying a Bayesian framework
with a Dirichlet prior for the multinomial distribution. Up to four distinct alleles are
allowed in an individual’s genotype. Four likelihood distributions were generated by
empirically fitting three high coverage genotypes from three individuals
(Laserson and Vigneault et al, 2014). A posterior probability is calculated for the
four most common alleles. The certainty of the highest probability model was
calculated using a Bayes factor (the most likely model divided by second-most likely model).
The larger the Bayes factor (K), the greater the certainty in the model.
Usage
inferGenotypeBayesian(
data,
germline_db = NA,
novel = NA,
v_call = "v_call",
seq = "sequence_alignment",
find_unmutated = TRUE,
priors = c(0.6, 0.4, 0.4, 0.35, 0.25, 0.25, 0.25, 0.25, 0.25)
)
Arguments
data |
a |
germline_db |
named vector of sequences containing the
germline sequences named in |
novel |
an optional |
v_call |
column in |
seq |
name of the column in |
find_unmutated |
if |
priors |
a numeric vector of priors for the multinomial distribution.
The |
Details
Allele calls representing cases where multiple alleles have been
assigned to a single sample sequence are rare among unmutated
sequences but may result if nucleotides for certain positions are
not available. Calls containing multiple alleles are treated as
belonging to all groups. If novel
is provided, all
sequences that are assigned to the same starting allele as any
novel germline allele will have the novel germline allele appended
to their assignent prior to searching for unmutated sequences.
Value
A data.frame
of alleles denoting the genotype of the subject with the log10
of the likelihood of each model and the log10 of the Bayes factor. The output
contains the following columns:
-
gene
: The gene name without allele. -
alleles
: Comma separated list of alleles for the givengene
. -
counts
: Comma separated list of observed sequences for each corresponding allele in thealleles
list. -
total
: The total count of observed sequences for the givengene
. -
note
: Any comments on the inferrence. -
kh
: log10 likelihood that thegene
is homozygous. -
kd
: log10 likelihood that thegene
is heterozygous. -
kt
: log10 likelihood that thegene
is trizygous -
kq
: log10 likelihood that thegene
is quadrozygous. -
k_diff
: log10 ratio of the highest to second-highest zygosity likelihoods.
Note
This method works best with data derived from blood, where a large portion of sequences are expected to be unmutated. Ideally, there should be hundreds of allele calls per gene in the input.
References
Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. PNAS. 2014 111(13):4928-33.
See Also
plotGenotype for a colorful visualization and genotypeFasta to convert the genotype to nucleotide sequences. See inferGenotype to infer a subject-specific genotype using a frequency method
Examples
# Infer IGHV genotype, using only unmutated sequences, including novel alleles
inferGenotypeBayesian(AIRRDb, germline_db=SampleGermlineIGHV, novel=SampleNovel,
find_unmutated=TRUE, v_call="v_call", seq="sequence_alignment")
Insert polymorphisms into a nucleotide sequence
Description
insertPolymorphisms
replaces nucleotides in the desired locations of a
provided sequence.
Usage
insertPolymorphisms(sequence, positions, nucleotides)
Arguments
sequence |
starting nucletide sequence. |
positions |
numeric vector of positions which to be changed. |
nucleotides |
character vector of nucletides to which to change the positions. |
Value
A sequence with the desired nucleotides in the provided locations.
Examples
insertPolymorphisms("HUGGED", c(1, 6, 2), c("T", "R", "I"))
Show a colorful representation of a genotype
Description
plotGenotype
plots a genotype table.
Usage
plotGenotype(
genotype,
facet_by = NULL,
gene_sort = c("name", "position"),
text_size = 12,
silent = FALSE,
...
)
Arguments
genotype |
|
facet_by |
column name in |
gene_sort |
string defining the method to use when sorting alleles.
if |
text_size |
point size of the plotted text. |
silent |
if |
... |
additional arguments to pass to ggplot2::theme. |
Value
A ggplot object defining the plot.
See Also
Examples
# Plot genotype
plotGenotype(SampleGenotype)
# Facet by subject
genotype_a <- genotype_b <- SampleGenotype
genotype_a$SUBJECT <- "A"
genotype_b$SUBJECT <- "B"
geno_sub <- rbind(genotype_a, genotype_b)
plotGenotype(geno_sub, facet_by="SUBJECT", gene_sort="pos")
Visualize evidence of novel V alleles
Description
plotNovel
is be used to visualize the evidence of any novel V
alleles found using findNovelAlleles. It can also be used to
visualize the results for alleles that did
Usage
plotNovel(
data,
novel_row,
v_call = "v_call",
j_call = "j_call",
seq = "sequence_alignment",
junction = "junction",
junction_length = "junction_length",
pos_range_max = NULL,
ncol = 1,
multiplot = TRUE
)
Arguments
data |
|
novel_row |
single row from a data frame as output by findNovelAlleles that contains a polymorphism-containing germline allele. |
v_call |
name of the column in |
j_call |
name of the column in |
seq |
name of the column in |
junction |
Junction region nucleotide sequence, which includes
the CDR3 and the two flanking conserved codons. Default
is |
junction_length |
number of junction nucleotides in the junction sequence.
Default is |
pos_range_max |
Name of the column in |
ncol |
number of columns to use when laying out the plots. |
multiplot |
whether to return one single plot ( |
Details
The first panel in the plot shows, for all sequences which align to a particular germline allele, the mutation frequency at each postion along the aligned sequence as a function of the sequence-wide mutation count. Each line is a position. Positions that contain polymorphisms (rather than somatic hypermutations) will exhibit a high apparent mutation frequency for a range of sequence-wide mutation counts. The positions are color coded as follows:
red: the position(s) pass(ess) the novel allele test
yellow: the position(s) pass(ess) the y-intercept test but not other tests
blue: the position(s) didn't pass the y-intercept test and was(were) not further considered
The second panel shows the nucleotide usage at each of the polymorphic positions as a function of sequence-wide mutation count. If no polymorphisms were identified, the panel will show the mutation count.
To avoid cases where a clonal expansion might lead to a false positive, TIgGER examines the combinations of J gene and junction length among sequences which perfectly match the proposed germline allele. Clonally related sequences usually share the same V gene, J gene and junction length. Requiring the novel allele to be found in different combinations of J gene and junction lengths is a proxy for requiring it to be found in different clonal lineages.
Examples
# Plot the evidence for the first (and only) novel allele in the example data
novel <- selectNovel(SampleNovel)
plotNovel(AIRRDb, novel[1, ], v_call="v_call", j_call="j_call",
seq="sequence_alignment", junction="junction", junction_length="junction_length",
multiplot=TRUE)
Read immunoglobulin sequences
Description
readIgFasta
reads a fasta-formatted file of immunoglobulin (Ig)
sequences and returns a named vector of those sequences.
Usage
readIgFasta(fasta_file, strip_down_name = TRUE, force_caps = TRUE)
Arguments
fasta_file |
fasta-formatted file of immunoglobuling sequences. |
strip_down_name |
if |
force_caps |
if |
Value
Named vector of strings respresenting Ig alleles.
See Also
writeFasta to do the inverse.
Examples
## Not run:
# germlines <- readIgFasta("ighv.fasta")
## End(Not run)
Correct allele calls based on a personalized genotype
Description
reassignAlleles
uses a subject-specific genotype to correct
correct preliminary allele assignments of a set of sequences derived
from a single subject.
Usage
reassignAlleles(
data,
genotype_db,
v_call = "v_call",
seq = "sequence_alignment",
method = "hamming",
path = NA,
keep_gene = c("gene", "family", "repertoire")
)
Arguments
data |
|
genotype_db |
vector of named nucleotide germline sequences
matching the calls detailed in |
v_call |
name of the column in |
seq |
name of the column in |
method |
method to use when realigning sequences to
the genotype_db sequences. Currently, only |
path |
directory containing the tool used in the realignment method, if needed. Hamming distance does not require a path to a tool. |
keep_gene |
string indicating if the gene ( |
Details
In order to save time, initial gene assignments are preserved and
the allele calls are chosen from among those provided in genotype_db
,
based on a simple alignment to the sample sequence.
Value
A modifed input data.frame
containing the best allele call from
among the sequences listed in genotype_db
in the
v_call_genotyped
column.
Examples
# Extract the database sequences that correspond to the genotype
genotype_db <- genotypeFasta(SampleGenotype, SampleGermlineIGHV, novel=SampleNovel)
# Use the personlized genotype to determine corrected allele assignments
output_db <- reassignAlleles(AIRRDb, genotype_db, v_call="v_call",
seq="sequence_alignment")
Select rows containing novel alleles
Description
selectNovel
takes the result from findNovelAlleles and
selects only the rows containing unique, novel alleles.
Usage
selectNovel(novel, keep_alleles = FALSE)
Arguments
novel |
|
keep_alleles |
|
Details
If, for instance, subject has in his genome IGHV1-2*02
and a novel
allele equally close to IGHV1-2*02
and IGHV1-2*05
, the novel allele may be
detected by analyzing sequences that best align to either of these alleles.
If keep_alleles
is TRUE
, both polymorphic allele calls will
be retained. In the case that multiple mutation ranges are checked for the
same allele, only one mutation range will be kept in the output.
Value
A data.frame
containing only unique, novel alleles (if any)
that were in the input.
Examples
novel <- selectNovel(SampleNovel)
Sort allele names
Description
sortAlleles
returns a sorted vector of strings respresenting Ig allele
names. Names are first sorted by gene family, then by gene, then by allele.
Duplicated genes have their alleles are sorted as if they were part of their
non-duplicated counterparts (e.g. IGHV1-69D*01
comes after IGHV1-69*01
but before IGHV1-69*02
), and non-localized genes (e.g. IGHV1-NL1*01
)
come last within their gene family.
Usage
sortAlleles(allele_calls, method = c("name", "position"))
Arguments
allele_calls |
vector of strings respresenting Ig allele names. |
method |
string defining the method to use when sorting alleles.
If |
Value
A sorted vector of strings respresenting Ig allele names.
See Also
Like sortAlleles
, updateAlleleNames can help
format a list of allele names.
Examples
# Create a list of allele names
alleles <- c("IGHV1-69D*01","IGHV1-69*01","IGHV1-2*01","IGHV1-69-2*01",
"IGHV2-5*01","IGHV1-NL1*01", "IGHV1-2*01,IGHV1-2*05",
"IGHV1-2", "IGHV1-2*02", "IGHV1-69*02")
# Sort the alleles by name
sortAlleles(alleles)
# Sort the alleles by position in the locus
sortAlleles(alleles, method="pos")
Subsample repertoire
Description
subsampleDb
will sample the same number of sequences for each gene, family
or allele (specified with mode
) in data
. Samples or subjects can
be subsampled indepently by setting group
.
Usage
subsampleDb(
data,
gene = "v_call",
mode = c("gene", "allele", "family"),
min_n = 1,
max_n = NULL,
group = NULL
)
Arguments
data |
|
gene |
name of the column in |
mode |
one of |
min_n |
minimum number of observations to sample from each groupe. A group with less observations than the minimum is excluded. |
max_n |
maximum number of observations to sample for all |
group |
columns containing additional grouping variables, e.g. sample_id.
These groups will be subsampled independently. If
|
Details
data
will be split into gene, allele or family subsets (mode
) from
which the same number of sequences will be subsampled. If mode=gene
,
for each gene in the field gene
from data
, a maximum of
max_n
sequences will be subsampled. Input sequences
that have multiple gene calls (ties), can be subsampled from any of their calls,
but these duplicated samplings will be removed, and the final
subsampled data
will contain unique rows.
Value
Subsampled version of the input data
.
See Also
Examples
subsampleDb(AIRRDb)
tigger
Description
Here we provide a Tool for Immunoglobulin Genotype Elucidation via Rep-Seq (TIgGER). TIgGER inferrs the set of Ig alleles carried by an individual (including any novel alleles) and then uses this set of alleles to correct the initial assignments given to sample sequences by existing tools.
Details
Immunoglobulin repertoire sequencing (AIRR-Seq, Rep-Seq) data is currently the subject of much study. A key step in analyzing these data involves assigning the closest known V(D)J germline alleles to the (often somatically mutated) sample sequences using a tool such as IMGT/HighV-QUEST. However, if the sample utilizes alleles not in the germline database used for alignment, this step will fail. Additionally, this alignment has an associated error rate of ~5 mutations. The purpose of TIgGER is to address these issues.
Allele detection and genotyping
-
findNovelAlleles: Detect novel alleles.
-
plotNovel: Plot evidence of novel alleles.
-
inferGenotype: Infer an Ig genotype using a frequency approach.
-
inferGenotypeBayesian: Infer an Ig genotype using a Bayesian approach.
-
plotGenotype: A colorful genotype visualization.
-
genotypeFasta: Convert a genotype to sequences.
-
reassignAlleles: Correct allele calls.
-
generateEvidence: Generate evidence for the genotype and allele detection inferrence.
Mutation handling
-
getMutatedPositions: Find mutation locations.
-
getMutCount: Find distance from germline.
-
findUnmutatedCalls: Subset unmutated sequences.
-
getPopularMutationCount: Find most common sequence's mutation count.
-
insertPolymorphisms: Insert SNPs into a sequence.
Input, output and formatting
-
readIgFasta: Read a fasta file of Ig sequences.
-
updateAlleleNames: Correct outdated allele names.
-
sortAlleles: Sort allele names intelligently.
-
cleanSeqs: Standardize sequence format.
References
Gadala-Maria, et al. (2015) Automated analysis of high-throughput B cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. PNAS. 112(8):E862-70.
Update IGHV allele names
Description
updateAlleleNames
takes a set of IGHV allele calls and replaces any
outdated names (e.g. IGHV1-f) with the new IMGT names.
Usage
updateAlleleNames(allele_calls)
Arguments
allele_calls |
vector of strings respresenting IGHV allele names. |
Value
Vector of strings respresenting updated IGHV allele names.
Note
IGMT has removed IGHV2-5*10
and IGHV2-5*07
as it has determined they
are actually alleles 02
and 04
, respectively. The updated allele
names are based on IMGT release 2014-08-4.
References
Xochelli et al. (2014) Immunoglobulin heavy variable (IGHV) genes and alleles: new entities, new names and implications for research and prognostication in chronic lymphocytic leukaemia. Immunogenetics. 67(1):61-6
See Also
Like updateAlleleNames
, sortAlleles can help
format a list of allele names.
Examples
# Create a vector that uses old gene/allele names.
alleles <- c("IGHV1-c*01", "IGHV1-f*02", "IGHV2-5*07")
# Update the alleles to the new names
updateAlleleNames(alleles)
Write to a fasta file
Description
writeFasta
writes a named vector of sequences to a file in fasta
format.
Usage
writeFasta(named_sequences, file, width = 60, append = FALSE)
Arguments
named_sequences |
vector of named string representing sequences |
file |
the name of the output file. |
width |
the number of characters to be printed per line. if not between 1 and 255, width with be infinite. |
append |
|
Value
A named vector of strings respresenting Ig alleles.
See Also
readIgFasta to do the inverse.
Examples
## Not run:
# writeFasta(germlines, "ighv.fasta")
## End(Not run)