Title: | Simulates Pooled Sequencing Genetic Data |
Version: | 1.1.0 |
Description: | Simulates pooled sequencing data under a variety of conditions. Also allows for the evaluation of the average absolute difference between allele frequencies computed from genotypes and those computed from pooled data. Carvalho et al., (2022) <doi:10.1101/2023.01.20.524733>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.1 |
Imports: | MCMCpack, Metrics, scrm, stats |
Suggests: | knitr, rmarkdown, ggplot2, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
URL: | https://github.com/joao-mcarvalho/poolHelper |
BugReports: | https://github.com/joao-mcarvalho/poolHelper/issues |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2023-06-29 12:25:30 UTC; jcarvalho |
Author: | João Carvalho |
Maintainer: | João Carvalho <jgcarvalho@fc.ul.pt> |
Repository: | CRAN |
Date/Publication: | 2023-06-29 17:50:02 UTC |
Compute expected heterozygosity per site
Description
Computes the expected heterozygosity for a given site.
Usage
ExpHet_site(geno_site)
Arguments
geno_site |
is a vector where each entry contains the genotype for a single individual, coded as 0,1,2 and using NA for the missing data. |
Value
a numerical value corresponding to the expected heterozygosity at that site.
Compute expected heterozygosity within a population
Description
This functions calculates the value of the expected heterozygosity for each SNP.
Usage
Expected_Het(Pop_Pi)
Arguments
Pop_Pi |
is a matrix or list of allele frequencies. When dealing with a single locus, this input is a matrix and when dealing with multiple loci it is a list. Each entry of that list is a matrix representing a different locus. Each row of that matrix should correspond to a different population and each column to a different SNP. |
Value
if the input is a single matrix, the output will be a matrix where each row represents a different population and each column is the expected heterozygosity of a population at that site. If the input is a list, the output will also be a list, with each entry corresponding to a different locus. Each of those entries will be a matrix with different populations in different rows and the expected heterozygosity of different sites at different columns.
Create genotypes from a output with haplotypes
Description
This function applies the hap2geno
function to all entries of a
list. Each entry of that list is a different locus simulated with
scrm::scrm()
. Thus, this function converts the haplotypes of all simulated
loci into genotypes.
Usage
GetGenotypes(haplotypes, nDip)
Arguments
haplotypes |
a list of haplotypes obtained from the simulations done
with |
nDip |
an integer representing the total number of diploid individuals to simulate. Note that this is the total number of diploid individuals and not the number of individuals per population. |
Value
a list of genotypes. Each entry of the list is a matrix corresponding to a different locus. locus. At each matrix, each column is a different site and each row is a different genotype
Compute allele frequencies from genotypes
Description
Computes alternative allele frequencies from genotypes by dividing the total number of alternative alleles by the total number of gene copies.
Usage
Ifreqs(nDip, genotypes)
Arguments
nDip |
an integer representing the total number of diploid individuals
to simulate. Note that |
genotypes |
a list of simulated genotypes, where each entry is a matrix corresponding to a different locus. At each matrix, each column is a different SNP and each row is a different individual. |
Value
a list of allele frequencies. Each entry of the list corresponds to a different locus.
Examples
genotypes <- run_scrm(nDip = 10, nloci = 10)
Ifreqs(nDip = 10, genotypes)
Compute allele frequencies from pooled sequencing data
Description
Computes the frequency of the alternative allele in Pool-seq data and removes any site with too few minor-allele reads from both the pool frequencies and the frequencies computed directly from genotypes.
Usage
Pfreqs(reference, alternative, coverage, min.minor, ifreqs)
Arguments
reference |
a matrix with the number of reference allele reads. Each row should be a different population and each column a different site. |
alternative |
a matrix with the number of alternative allele reads. Each row should be a different population and each column a different site. |
coverage |
a matrix with the total coverage. Each row should be a different population and each column a different site. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
ifreqs |
a vector of allele frequencies computed directly from the genotypes where each entry corresponds to a different site. |
Details
The frequency at a given SNP is calculated according to: pi = c/r
, where c
= number of alternative allele reads and r = total number of observed reads.
Additionally, if a site has less minor-allele reads than min.minor
across all populations, that site is removed from the data.
Value
a list with two entries. The ifreqs
entry contains the allele
frequencies computed directly from genotypes and pfreqs
the allele
frequencies computed from pooled sequencing data.
Examples
set.seed(10)
# create a vector of allele frequencies
freqs <- runif(20)
set.seed(10)
# create a matrix with the number of reads with the alternative allele
alternative <- matrix(sample(x = c(0,5,10), size = 20, replace = TRUE), nrow = 1)
# create a matrix with the depth of coverage
coverage <- matrix(sample(100:150, size = 20), nrow = 1)
# the number of reads with the reference allele is obtained by subtracting
# the number of alternative allele reads from the depth of coverage
reference <- coverage - alternative
# compute allele frequencies from pooled sequencing data
Pfreqs(reference = reference, alternative = alternative, coverage = coverage,
min.minor = 2, ifreqs = freqs)
Calculate population frequency at each SNP
Description
The frequency at a given SNP is calculated according to: pi = c/r
, where c
= number of minor-allele reads and r = total number of observed reads.
Usage
calculatePi(listPool, nLoci)
Arguments
listPool |
a list containing the "minor" element, representing the
number of reads with the minor-allele and the "total" element that contains
information about the total number of reads. The list should also contain a
"major" entry with the information about reads containing the major-allele.
The output of the |
nLoci |
an integer that represents the total number of independent loci in the dataset. |
Details
This function takes as input a list that contains the number of reads with the minor allele and the number of total reads per population at a given site. The names of the respective elements of the list should be minor and total. It works with lists containing just one set of minor and total reads, corresponding to a single locus, and with lists where each entry contains a different set of minor and total number of reads, corresponding to different loci.
Value
a list with two named entries
pi |
a list with the allele frequencies of each population. Each list entry is a matrix, corresponding to a different locus. Each row of a matrix corresponds to a different population and each column to a different site. |
pool |
a list with three different entries: major, minor and total.
This list is similar to the one obtained with the |
Examples
# simulate coverage at 5 SNPs for two populations, assuming 20x mean coverage
reads <- simulateCoverage(mean = c(20, 20), variance = c(100, 100), nSNPs = 5, nLoci = 1)
# simulate the number of reads contributed by each individual
# for each population there are two pools, each with 5 individuals
indContribution <- popsReads(list_np = rep(list(rep(5, 2)), 2), coverage = reads, pError = 5)
# set seed and create a random matrix of genotypes for the 20 individuals - 10 per population
set.seed(10)
genotypes <- matrix(rpois(100, 0.5), nrow = 20)
# simulate the number of reference reads for the two populations
readsReference <- numberReferencePop(genotypes = genotypes, indContribution = indContribution,
size = rep(list(rep(5, 2)), 2), error = 0.01)
# create Pooled DNA sequencing data for these two populations and for a single locus
pools <- poolPops(nPops = 2, nLoci = 1, indContribution = indContribution,
readsReference = readsReference)
# define the major and minor alleles for this pool-seq data
# note that we have to select the first entry of the pools list
# because this function works for matrices
pools <- findMinor(reference = pools$reference[[1]], alternative = pools$alternative[[1]],
coverage = pools$total[[1]])
# calculate population frequency at each SNP of this locus
calculatePi(listPool = pools, nLoci = 1)
Compute the number of reference reads over a matrix
Description
This function works over all the rows and columns of a matrix and computes the number of reads containing the reference allele at each site and for each individual.
Usage
computeReference(genotypes, indContribution, error)
Arguments
genotypes |
is a matrix of genotypes. Each column of the matrix should be a different site and each row a different individual. Genotypes should be encoded as 0: reference homozygote, 1: heterozygote and 2: alternative homozygote. |
indContribution |
is a matrix of individual contributions. Each row of that matrix is a different individual and each column is a different site. Thus, each entry of the matrix should contain the number of reads contributed by that individual at that particular site. |
error |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
Value
a matrix with the number of reference allele reads contributed by each individual. Each row of the matrix represents a different individual and each column is a different site.
Examples
# probability of contribution for 10 individuals at 5 sites
probs <- indProbs(np = 10, nSNPs = 5, pError = 5)
# simulate the number of reads contributed, assuming 20 coverage for each site
indContribution <- indReads(np = 10, coverage = rep(20, 5), probs = probs)
# set seed and create a random matrix of genotypes
set.seed(10)
genotypes <- matrix(rpois(50, 0.5), nrow = 10)
# simulate the number of reads with the reference allele
computeReference(genotypes = genotypes, indContribution = indContribution, error = 0.01)
Average absolute difference between expected heterozygosity
Description
Calculates the average absolute difference between the expected heterozygosity computed directly from genotypes and from pooled sequencing data.
Usage
errorHet(
nDip,
nloci,
pools,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA,
theta = 10
)
Arguments
nDip |
an integer representing the total number of diploid individuals
to simulate. Note that |
nloci |
is an integer that represents how many independent loci should be simulated. |
pools |
a list with a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
vCov |
an integer that defines the variance of the depth of coverage across all sites. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
theta |
a value for the mutation rate assuming theta = 4Nu, where u is the neutral mutation rate per locus. |
Details
Different combinations of parameters can be tested to check the effect of the
various parameters. The average absolute difference is computed with the
mae function, assuming the expected heterozygosity computed
directly from the genotypes as the actual
input argument and the
expected heterozygosity from pooled data as the predicted
input
argument.
Value
a data.frame with columns detailing the number of diploid individuals, the pool error, the number of pools, the number of individuals per pool, the mean coverage, the variance of the coverage and the average absolute difference between the expected heterozygosity computed from genotypes and from pooled data.
Examples
# single population sequenced with a single pool of 100 individuals
errorHet(nDip = 100, nloci = 10, pools = list(100), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2)
# single population sequenced with two pools, each with 50 individuals
errorHet(nDip = 100, nloci = 10, pools = list(c(50, 50)), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2)
# single population sequenced with two pools, each with 50 individuals
# removing sites with coverage below 10x or above 180x
errorHet(nDip = 100, nloci = 10, pools = list(c(50, 50)), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2, minimum = 10, maximum = 180)
Filter sites according to a minor-allele reads threshold
Description
Removes sites from matrices with counts of reads. If a site has less
minor-allele reads than min.minor
across all populations, that site is
removed from the data.
Usage
filterMinor(reference, alternative, coverage, min.minor)
Arguments
reference |
a matrix with the number of reads with the reference allele. Each row should be a different population and each column a different site. |
alternative |
a matrix with the number of reads with the alternative allele. Each row should be a different population and each column a different site. |
coverage |
is a matrix of total coverage. Each row of the matrix should be a different population and each column a different site. Thus, each entry of the matrix contains the total number of observed reads for that population at a given site. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
Value
a list with three named entries:
reference |
a list with one entry per locus. Each entry is a matrix with the number of reference allele reads. Each column represents a different site. |
alternative |
a list with one entry per locus. Each entry is a matrix with the number of alternative allele reads. Each column represents a different site. |
total |
a list with one entry per locus. Each entry is a matrix with the total depth of coverage. Each column represents a different site. |
Filter Pool-seq data according to a minor-allele reads threshold
Description
Removes sites from Pool-seq data. If a site has less minor-allele reads than
min.minor
across all populations, that site is removed from the data.
Usage
filterPool(pool, nloci, min.minor)
Arguments
pool |
a list containing the "reference" element, representing the number of reads with the reference allele, the "alternative" element representing the number of reads with the alternative allele and the "total" element that contains information about the total number of reads. |
nloci |
an integer that represents the total number of independent loci in the dataset. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
Value
a list with three named entries:
reference |
a list with one entry per locus. Each entry is a matrix with the number of reference allele reads. Each column represents a different site. |
alternative |
a list with one entry per locus. Each entry is a matrix with the number of alternative allele reads. Each column represents a different site. |
total |
a list with one entry per locus. Each entry is a matrix with the total depth of coverage. Each column represents a different site. |
Define major and minor alleles
Description
This function checks which of the two simulated alleles (reference or alternative) corresponds to the minor allele. This function can also be used to remove sites according to a minor-allele reads threshold.
Usage
findMinor(reference, alternative, coverage)
Arguments
reference |
is a matrix of reference allele reads. Each row of the matrix should be a different population and each column a different site. Thus, each entry of the matrix contains the number of observed reads with the reference allele for that population at a given site. |
alternative |
is a matrix of alternative allele reads. Each row of the matrix should be a different population and each column a different site. Thus, each entry of the matrix contains the number of observed reads with the alternative allele for that population at a given site. |
coverage |
is a matrix of total coverage. Each row of the matrix should be a different population and each column a different site. Thus, each entry of the matrix contains the total number of observed reads for that population at a given site. |
Details
More precisely, this function counts the number of reads with the reference
or alternative allele at each site and then sets the minor allele as the
least frequent of the two. This is done across all populations and so the
major and minor alleles are defined at a global level. Then if the
min.minor
input is not NA, sites where the number of minor allele reads,
across all populations, is below the user-defined threshold are removed.
Value
a list with three names entries
major |
a list with one entry per locus. Each entry is a matrix with the number of major allele reads for each population. Each column represents a different site and each row a different population. |
minor |
a list with one entry per locus. Each entry is a matrix with the number of minor allele reads for each population. Each column represents a different site and each row a different population. |
total |
a list with one entry per locus. Each entry is a matrix with the coverage of each population. Each column represents a different site and each row a different population. |
Examples
# simulate coverage at 5 SNPs for two populations, assuming 20x mean coverage
reads <- simulateCoverage(mean = c(20, 20), variance = c(100, 100), nSNPs = 5, nLoci = 1)
# simulate the number of reads contributed by each individual
# for each population there are two pools, each with 5 individuals
indContribution <- popsReads(list_np = rep(list(rep(5, 2)), 2), coverage = reads, pError = 5)
# set seed and create a random matrix of genotypes for the 20 individuals - 10 per population
set.seed(10)
genotypes <- matrix(rpois(100, 0.5), nrow = 20)
# simulate the number of reference reads for the two populations
readsReference <- numberReferencePop(genotypes = genotypes, indContribution = indContribution,
size = rep(list(rep(5, 2)), 2), error = 0.01)
# create Pooled DNA sequencing data for these two populations and for a single locus
pools <- poolPops(nPops = 2, nLoci = 1, indContribution = indContribution,
readsReference = readsReference)
# define the major and minor alleles for this Pool-seq data
# we have to select the first entry of the pools list because this function works for matrices
findMinor(reference = pools$reference[[1]], alternative = pools$alternative[[1]],
coverage = pools$total[[1]])
Randomly select the required number of loci from the pooled sequencing data
Description
This function removes loci without polymorphic sites and then randomly selects the required number of loci from a larger dataset.
Usage
forcePool(nSims, pool, target)
Arguments
nSims |
is an integer representing how many types of simulations were performed. The possible types of simulations include loci without barriers against migration between divergent ecotypes, loci without migration from the C towards the W ecotype, loci without migration from the W towards the C ecotypes and loci where no migration occurs between divergent ecotypes. |
pool |
a list containing the "minor" element, representing the number of reads with the minor-allele and the "total" element that contains information about the total number of Reads. The list should also contain a "major" entry with the information about reads containing the major-allele. |
target |
is a vector with the required number of loci per category. If extra loci were simulated, this vector informs how many loci of each simulation type should be randomly selected. |
Details
If extra loci were simulated to try to have sufficient loci to keep the required number of loci after filtering, then this function is used to remove extra loci. This is done by randomly selecting the required number of loci from the full contingent of extra simulated loci.
Value
a list with three named entries
major |
a list with one entry per locus. Each entry is a matrix with the number of major allele reads for each population. Each column represents a different site and each row a different population. |
minor |
a list with one entry per locus. Each entry is a matrix with the number of minor allele reads for each population. Each column represents a different site and each row a different population. |
total |
a list with one entry per locus. Each entry is a matrix with the coverage of each population. Each column represents a different site and each row a different population. |
Compute the number of reference reads
Description
This function takes as input the total depth of coverage and computes how many of those reads are reference allele reads.
Usage
getNumReadsR_vector(genotype_v, readCount_v, error)
Arguments
genotype_v |
is a vector with the genotype of a given individual. Each entry of the vector should be a different site. Genotypes should be encoded as 0: reference homozygote, 1: heterozygote and 2: alternative homozygote. |
readCount_v |
is a vector with the number of reads contributed by the same given individual. Each entry of that vector should be a different site. |
error |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
Details
More precisely, this function computes the number of reference reads per site for one individual, given the genotype of the individual at each site, the total number of reads observed for the individual at that site and an error rate.
Value
a vector with the number of reference allele reads. Each entry of the vector corresponds to a different individual.
Examples
# number of reference allele reads for three individuals, each with 10x coverage
# one individual is homozygote for the reference allele (0), other is heterozygote (1)
# and the last is homozygote for the alternative allele (2)
getNumReadsR_vector(genotype_v = c(0,1,2), readCount_v = c(10, 10, 10), error = 0.01)
Convert haplotypes to genotypes
Description
This function converts haplotypes simulated with scrm::scrm()
into
genotypes by adding the entries on one row with the entries of the subsequent
row.
Usage
hap2geno(haplo)
Arguments
haplo |
a matrix of haplotypes obtained from the simulations done with
|
Value
a matrix of genotypes with half the rows of the haplo
matrix. Each
column of this matrix is a different site and each row is a different
genotype.
Create invariable sites
Description
This function applies a correction for the situations where scrm::scrm()
does not produce a single polymorphic site for a given locus. In this
situation, two artificial sites are created at that locus. All individuals
are assumed to be homozygous for the reference allele at those sites.
Usage
haplo.fix(haplotypes, nHap)
Arguments
haplotypes |
a list of haplotypes obtained from the simulations done
with |
nHap |
an integer representing the total number of haplotypes simulated. |
Value
a list of haplotypes identical to haplotypes
, but without empty
loci.
Probability of contribution of each individual
Description
This function computes the probability of contribution for each individual of a given pool. Please note that this function works for a single pool and should not be directly applied to situations where multiple pools were used.
Usage
indProbs(np, nSNPs, pError)
Arguments
np |
an integer specifying how many individuals were pooled. |
nSNPs |
an integer indicating how many SNPs exist in the data. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal individual contribution towards the total number of reads contributed by a single pool - the higher the value the more unequal are the individual contributions. |
Value
a matrix with the probabilities of contribution for each individual. Each row represents a different individual and each column is a different site.
Examples
# probability of contribution for 10 individuals at 5 sites
indProbs(np = 10, nSNPs = 5, pError = 100)
Reads contributed by each individual
Description
This function simulates the contribution, in terms of reads, of each individual of a given pool. Please note that this function works for a single pool and should not be directly applied to situations where multiple pools were used.
Usage
indReads(np, coverage, probs)
Arguments
np |
an integer specifying how many individuals were pooled. |
coverage |
a vector containing the total depth of coverage of a given pool. Each entry of the vector represents a different site. |
probs |
a matrix containing the probability of contribution of each
individual. This matrix can be obtained with the |
Value
a matrix with the number of reads contributed by each individual towards the coverage of its pool. Each row of the matrix is a different individual and each column a different site.
Examples
# probability of contribution for 10 individuals at 5 sites
probs <- indProbs(np = 10, nSNPs = 5, pError = 100)
# simulate the number of reads contributed, assuming 10x coverage for each site
indReads(np = 10, coverage = rep(10, 5), probs = probs)
Average absolute difference between allele frequencies computed from genotypes and from Pool-seq data
Description
Calculates the average absolute difference between the allele frequencies computed directly from genotypes and from pooled sequencing data.
Usage
maeFreqs(
nDip,
nloci,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA,
theta = 10
)
Arguments
nDip |
is an integer or a vector representing the total number of
diploid individuals to simulate. Note that |
nloci |
is an integer that represents how many independent loci should be simulated. |
pError |
an integer or a vector representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. If it is a vector, then each vector entry will be simulated independently. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer or a vector that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. If it is a vector, then each vector entry will be simulated independently. |
vCov |
an integer or a vector that defines the variance of the depth of
coverage across all sites. If the |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
theta |
a value for the mutation rate assuming theta = 4Nu, where u is the neutral mutation rate per locus. |
Details
The average absolute difference is computed with the mae
function, assuming the frequencies computed directly from the genotypes as
the actual
input argument and the frequencies from pooled data as the
predicted
input argument.
Note that this functions allows for different combinations of parameters.
Thus, the effect of different combinations of parameters on the average
absolute difference can be tested. For instance, it is possible to check what
is the effect of different coverages by including more than one value in the
mCov
input argument. This function will run and compute the average
absolute difference for all combinations of the nDip
, pError
and mCov
input arguments. This function assumes that a single pool of
size nDip
was used to sequence the population.
Value
a data.frame with columns detailing the number of diploid individuals, the pool error, the number of pools, the number of individuals per pool, the mean coverage, the variance of the coverage and the average absolute difference between the frequencies computed from genotypes and from pooled data.
Examples
# a simple test with a simple combination of parameters
maeFreqs(nDip = 100, nloci = 10, pError = 100, sError = 0.01, mCov = 100, vCov = 200, min.minor = 1)
# effect of two different pool error values in conjugation with a fixed coverage and pool size
maeFreqs(nDip = 100, nloci = 10, pError = c(100, 200), sError = 0.01,
mCov = 100, vCov = 200, min.minor = 1)
# effect of two different pool error values in conjugation with a fixed pool size
# and two different coverages
maeFreqs(nDip = 100, nloci = 10, pError = c(100, 200), sError = 0.01,
mCov = c(100, 200), vCov = c(200, 500), min.minor = 1)
Average absolute difference between the expected heterozygosity computed from genotypes and from Pool-seq data
Description
Calculates the average absolute difference between the expected heterozygosity computed directly from genotypes and from pooled sequencing data.
Usage
maeHet(
nDip,
nloci,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA,
theta = 10
)
Arguments
nDip |
is an integer or a vector representing the total number of
diploid individuals to simulate. Note that |
nloci |
is an integer that represents how many independent loci should be simulated. |
pError |
an integer or a vector representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. If it is a vector, then each vector entry will be simulated independently. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer or a vector that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. If it is a vector, then each vector entry will be simulated independently. |
vCov |
an integer or a vector that defines the variance of the depth of
coverage across all sites. If the |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
theta |
a value for the mutation rate assuming theta = 4Nu, where u is the neutral mutation rate per locus. |
Details
The average absolute difference is computed with the mae
function, assuming the expected heterozygosity computed directly from the
genotypes as the actual
input argument and the expected heterozygosity
from pooled data as the predicted
input argument.
Note that this functions allows for different combinations of parameters.
Thus, the effect of different combinations of parameters on the average
absolute difference can be tested. For instance, it is possible to check what
is the effect of different coverages by including more than one value in the
mCov
input argument. This function will run and compute the average
absolute difference for all combinations of the nDip
, pError
and mCov
input arguments. This function assumes that a single pool of
size nDip
was used to sequence the population.
Value
a data.frame with columns detailing the number of diploid individuals, the pool error, the number of pools, the number of individuals per pool, the mean coverage, the variance of the coverage and the average absolute difference between the expected heterozygosity computed from genotypes and from pooled data.
Examples
# a simple test with a simple combination of parameters
maeHet(nDip = 100, nloci = 10, pError = 100, sError = 0.01, mCov = 100, vCov = 200, min.minor = 1)
# effect of two different pool error values in conjugation with a fixed coverage and pool size
maeHet(nDip = 100, nloci = 10, pError = c(100, 200), sError = 0.01,
mCov = 100, vCov = 200, min.minor = 1)
# effect of two different pool error values in conjugation with a fixed pool size
# and two different coverages
maeHet(nDip = 100, nloci = 10, pError = c(100, 200), sError = 0.01,
mCov = c(100, 200), vCov = c(200, 500), min.minor = 1)
Average absolute difference between allele frequencies
Description
Calculates the average absolute difference between the allele frequencies computed directly from genotypes and from pooled sequencing data.
Usage
maePool(
nDip,
nloci,
pools,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA,
theta = 10
)
Arguments
nDip |
an integer representing the total number of diploid individuals
to simulate. Note that |
nloci |
is an integer that represents how many independent loci should be simulated. |
pools |
a list with a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
vCov |
an integer that defines the variance of the depth of coverage across all sites. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
theta |
a value for the mutation rate assuming theta = 4Nu, where u is the neutral mutation rate per locus. |
Details
Different combinations of parameters can be tested to check the effect of the
various parameters. The average absolute difference is computed with the
mae function, assuming the frequencies computed directly from
the genotypes as the actual
input argument and the frequencies from
pooled data as the predicted
input argument.
Value
a data.frame with columns detailing the number of diploid individuals, the pool error, the number of pools, the number of individuals per pool, the mean coverage, the variance of the coverage and the average absolute difference between the frequencies computed from genotypes and from pooled data.
Examples
# single population sequenced with a single pool of 100 individuals
maePool(nDip = 100, nloci = 10, pools = list(100), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2)
# single population sequenced with two pools, each with 50 individuals
maePool(nDip = 100, nloci = 10, pools = list(c(50, 50)), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2)
# single population sequenced with two pools, each with 50 individuals
# removing sites with coverage below 10x or above 180x
maePool(nDip = 100, nloci = 10, pools = list(c(50, 50)), pError = 100, sError = 0.01,
mCov = 100, vCov = 250, min.minor = 2, minimum = 10, maximum = 180)
Average absolute difference between allele frequencies computed from genotypes supplied by the user and from Pool-seq data
Description
Calculates the average absolute difference between the allele frequencies computed directly from genotypes and from pooled sequencing data. The genotypes used should be supplied by the user and can be simulated using different software and under the demographic model of choice.
Usage
mymae(
genotypes,
pools,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA
)
Arguments
genotypes |
a list of genotypes, where each entry is a matrix corresponding to a different locus. At each matrix, each column is a different SNP and each row is a different individual. Genotypes should be coded as 0, 1 or 2. |
pools |
a list with a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
vCov |
an integer that defines the variance of the depth of coverage across all sites. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
Details
The average absolute difference is computed with the mae
function, assuming the frequencies computed directly from the genotypes as
the actual
input argument and the frequencies from pooled data as the
predicted
input argument.
Note that this functions allows for different combinations of parameters.
Thus, the effect of different combinations of parameters on the average
absolute difference can be tested. For instance, it is possible to check what
is the effect of different coverages by including more than one value in the
mCov
input argument. This function will run and compute the average
absolute difference for all combinations of the pools
, pError
and mCov
input arguments.
Value
a data.frame with columns detailing the number of diploid individuals, the pool error, the number of pools, the number of individuals per pool, the mean coverage, the variance of the coverage and the average absolute difference between the frequencies computed from genotypes and from pooled data.
Examples
# 100 individuals sampled at a single locus
genotypes <- run_scrm(nDip = 100, nloci = 1, theta = 5)
# compute the mean absolute error assuming a coverage of 100x and two pools of 50 individuals each
mymae(genotypes = genotypes, pools = list(c(50, 50)), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# 10 individuals sampled at 5 different loci
genotypes <- run_scrm(nDip = 10, nloci = 5, theta = 5)
# compute the mean absolute error assuming a coverage of 100x and one pool of 10 individuals
mymae(genotypes = genotypes, pools = list(10), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
Compute the number of reference reads at multiple loci
Description
This function computes the number of reference reads over multiple loci and for a single population.
Usage
numberReference(genotypes, indContribution, error)
Arguments
genotypes |
is a list, where each entry corresponds to a different locus. Each entry should be a matrix containing the genotypes (coded as 0, 1 or 2). Each column of that matrix should be a different site and each row a different individual. |
indContribution |
either a list or a matrix (that the function will convert to a list). Each list entry should be a matrix of individual contributions. Each row of that matrix is a different individual and each column is a different site. Thus, each entry of the matrix should contain the number of reads contributed by that individual at that particular site. |
error |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
Details
Note that this function will also work on a single locus, provided that the input is in the list format.
Value
a list with one entry per locus. Each of those entries is a matrix with the number of reference allele reads contributed by each individual. Each matrix row represents a different individual and each column is a different site.
Compute the number of reference reads for multiple populations
Description
This function computes the number of reference reads over a single locus for multiple populations.
Usage
numberReferencePop(genotypes, indContribution, size, error)
Arguments
genotypes |
either a list with a single entry (one locus) or a matrix (that the function will convert to a list) containing the genotypes (coded as 0, 1 or 2). Each column of that matrix should be a different site and each row a different individual. |
indContribution |
a list where each entry contains the information for a single population. Each entry should be a matrix, with as many rows as the number of individuals of that population. Each row contains the number of contributed reads for a given individual and across all sites. |
size |
a list with one entry per population. Each entry should be a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
error |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
Details
Note that this function will not work as intended if the input consists of multiple loci.
Value
a list with one entry per population. Each entry contains the number of reference allele reads for the individuals of that population and for that locus. Different individuals are in different rows and each columns represents a different site.
Examples
# simulate coverage at 5 SNPs for two populations, assuming 20x mean coverage
reads <- simulateCoverage(mean = c(20, 20), variance = c(100, 100), nSNPs = 5, nLoci = 1)
# simulate the number of reads contributed by each individual
# for each population there are two pools, each with 5 individuals
indContribution <- popsReads(list_np = rep(list(rep(5, 2)), 2), coverage = reads, pError = 5)
# set seed and create a random matrix of genotypes for the 20 individuals - 10 per population
set.seed(10)
genotypes <- matrix(rpois(100, 0.5), nrow = 20)
# simulate the number of reference reads for the two populations
numberReferencePop(genotypes = genotypes, indContribution = indContribution,
size = rep(list(rep(5, 2)), 2), error = 0.01)
Create 'synchronized' file from Pool-seq data
Description
Creates and saves a file with the information from Pool-seq data coded in the 'synchronized' format.
Usage
pool2sync(reference, alternative, file, pos = NULL)
Arguments
reference |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the reference allele. Each entry of the vector corresponds to a different SNP. This list can have a single entry if the data is comprised of a single locus. |
alternative |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the alternative allele. Each entry of the vector corresponds to a different SNP. This list can have a single entry if the data is comprised of a single locus. |
file |
is a character string naming the file to write to. |
pos |
is an optional input (default is NULL). If the actual position of the SNPs are known, they can be used as input here. When working with a single locus, this should be a numeric vector with each entry corresponding to the position of each SNP. If the data has multiple loci, this should be a list where each entry is a numeric vector with the position of the SNPs for a different locus. |
Details
It starts by converting the number of reads with the reference
allele
and the alternative
allele to a
A-count:T-count:C-count:G-count:N-count:deletion-count string. Here, we
assume that the reference allele is always A and the alternative is always T.
Then, this A-count:T-count:C-count:G-count:N-count:deletion-count string is combined with other necessary information such as the chromosome of each SNP, the position of the SNP and the reference character. This step creates a data frame where each row corresponds to a different SNP.
A file
is then created and saved in the current working directory,
with the Pool-seq data coded in the 'synchronized' file format.
Value
a file in the current working directory containing Pool-seq data in the 'synchronized' format.
Examples
# simulate Pool-seq data for 100 individuals sampled at a single locus
genotypes <- run_scrm(nDip = 100, nloci = 1, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and two pools of 50 individuals each
pool <- simPoolseq(genotypes = genotypes, pools = c(50, 50), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# create a 'synchronized' file of the simulated data - this will create a txt file
# pool2sync(reference = pool$reference, alternative = pool$alternative, file = "mysync.txt")
# simulate Pool-seq data for 10 individuals sampled at 5 loci
genotypes <- run_scrm(nDip = 10, nloci = 5, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and a single pool of 10 individuals
pool <- simPoolseq(genotypes = genotypes, pools = 10, pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# create a 'synchronized' file of the simulated data - this will create a txt file
# pool2sync(reference = pool$reference, alternative = pool$alternative, file = "mysync.txt")
Create VCF file from Pool-seq data
Description
Creates and saves a file with the information from Pool-seq data coded in the VCF format.
Usage
pool2vcf(reference, alternative, total, file, pos = NULL)
Arguments
reference |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the reference allele. Each entry of the vector corresponds to a different SNP. This list can have a single entry if the data is comprised of a single locus. |
alternative |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the alternative allele. Each entry of the vector corresponds to a different SNP. This list can have a single entry if the data is comprised of a single locus. |
total |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the total number of reads observed at each SNP. Each entry of the vector corresponds to a different SNP. This list can have a single entry if the data is comprised of a single locus. |
file |
is a character string naming the file to write to. |
pos |
is an optional input (default is NULL). If the actual position of the SNPs are known, they can be used as input here. When working with a single locus, this should be a numeric vector with each entry corresponding to the position of each SNP. If the data has multiple loci, this should be a list where each entry is a numeric vector with the position of the SNPs for a different locus. |
Details
It starts by converting the number of reads with the reference
allele,
the alternative
allele and the total
depth of coverage to a
R,A:DP string. R is the number of reads of the reference allele, A is the
number of reads of the alternative allele and DP is the total depth of
coverage.
Then, this information coded as R,A:DP is combined with other necessary information such as the chromosome of each SNP, the position of the SNP and the quality of the genotype among others. This creates a data frame where each row corresponds to a different SNP.
A file
is then created and saved in the current working directory,
with the header lines that go above the table in a VCF file. Finally, the
data frame is appended to that file.
Value
a file in the current working directory containing Pool-seq data in the VCF format.
Examples
# simulate Pool-seq data for 100 individuals sampled at a single locus
genotypes <- run_scrm(nDip = 100, nloci = 1, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and two pools of 50 individuals each
pool <- simPoolseq(genotypes = genotypes, pools = c(50, 50), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# create a vcf file of the simulated data - this will create a txt file
# pool2vcf(reference = pool$reference, alternative = pool$alternative,
# total = pool$total, file = "myvcf.txt")
# simulate Pool-seq data for 10 individuals sampled at 5 loci
genotypes <- run_scrm(nDip = 10, nloci = 5, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and a single pool of 10 individuals
pool <- simPoolseq(genotypes = genotypes, pools = 10, pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# create a vcf file of the simulated data - this will create a txt file
# pool2vcf(reference = pool$reference, alternative = pool$alternative,
# total = pool$total, file = "myvcf.txt")
Create Pooled DNA sequencing data for multiple populations
Description
This function combines the information for each individual of each population into information at the population level.
Usage
poolPops(nPops, nLoci, indContribution, readsReference)
Arguments
nPops |
An integer representing the total number of populations in the dataset. |
nLoci |
An integer that represents the total number of independent loci in the dataset. |
indContribution |
Either a list or a matrix (when dealing with a single locus). |
readsReference |
A list, where each entry contains the information for a
single locus. Each list entry should then have one separate entry per
population. Each of these entries should be a matrix, with each row
corresponding to a single individual and each column a different site.
Thus, each entry of the matrix contains the number of observed reads with
the reference allele for that individual at a given site. The output of the
|
Details
In other words, the information of all individuals in a given population is
combined into a single population value and this is done for the various
populations. In this situation, each entry of the indContribution
and
readsReference
lists should contain one entry per population - being, in
essence, a list within a list. Please note that this function is intended to
work for multiple populations and should not be used with a single
population.
Value
a list with three names entries
reference |
a list with one entry per locus. Each entry is a matrix with the number of reference allele reads for each population. Each column represents a different site and each row a different population. |
alternative |
a list with one entry per locus. Each entry is a matrix with the number of alternative allele reads for each population. Each column represents a different site and each row a different population. |
total |
a list with one entry per locus. Each entry is a matrix with the coverage of each population. Each column represents a different site and each row a different population. |
Examples
# simulate coverage at 5 SNPs for two populations, assuming 20x mean coverage
reads <- simulateCoverage(mean = c(20, 20), variance = c(100, 100), nSNPs = 5, nLoci = 1)
# simulate the number of reads contributed by each individual
# for each population there are two pools, each with 5 individuals
indContribution <- popsReads(list_np = rep(list(rep(5, 2)), 2), coverage = reads, pError = 5)
# set seed and create a random matrix of genotypes for the 20 individuals - 10 per population
set.seed(10)
genotypes <- matrix(rpois(100, 0.5), nrow = 20)
# simulate the number of reference reads for the two populations
readsReference <- numberReferencePop(genotypes = genotypes, indContribution = indContribution,
size = rep(list(rep(5, 2)), 2), error = 0.01)
# create Pooled DNA sequencing data for these two populations and for a single locus
poolPops(nPops = 2, nLoci = 1, indContribution = indContribution, readsReference = readsReference)
Probability of contribution of each pool
Description
This function computes the probability of contribution of each pool towards the total depth of coverage of a single population. If multiple pools where used to sequence a single population, it is possible that some pools contribute more than others.
Usage
poolProbs(nPools, vector_np, nSNPs, pError)
Arguments
nPools |
an integer indicating how many pools were used to sequence the population. |
vector_np |
is a vector where each entry contains the number of diploid individuals of a given pool. Thus, if a population was sequenced using two pools, each with 10 individuals, this vector would contain two entries and both will be 10. |
nSNPs |
an integer indicating how many SNPs exist in the data. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal pool contribution towards the total number of reads of a population - the higher the value the more unequal are the pool contributions. |
Value
a matrix with the probabilities of contribution for each pool. Each row represents a different pool and each column is a different site.
Examples
# probability of contribution at 8 SNPs for 5 pools, each with 10 individuals
poolProbs(nPools = 5, vector_np = rep(10, 5), nSNPs = 8, pError = 50)
Reads contributed by each pool
Description
This function simulates the contribution, in terms of reads, of each pool. The number of reads contributed from all pools is equal to the total coverage of the population.
Usage
poolReads(nPools, coverage, probs)
Arguments
nPools |
an integer indicating how many pools were used to sequence the population. |
coverage |
a vector containing the total depth of coverage of the population. Each entry of the vector represents a different site. |
probs |
a matrix containing the probability of contribution of each pool
used to sequence the population. This matrix can be obtained with the
|
Value
a matrix with the number of reads contributed by each pool towards the total coverage of the population. Each row of the matrix is a different pool and each column a different site.
Examples
# simulate the probability of contribution of each pool
probs <- poolProbs(nPools = 5, vector_np = rep(10, 5), nSNPs = 8, pError = 50)
# simulate the number of reads contributed, assuming 10x coverage for each site
poolReads(nPools = 5, coverage = rep(10, 8), probs = probs)
Compute number of reads for each individual and across all sites
Description
This function computes the contribution of each individual towards the total coverage of a given population.
Usage
popReads(vector_np, coverage, pError)
Arguments
vector_np |
is a vector where each entry contains the number of diploid individuals of a given pool. Thus, if a population was sequenced using two pools, each with 10 individuals, this vector would contain two entries and both will be 10. |
coverage |
a vector containing the total depth of coverage of the population. Each entry of the vector represents a different site. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
Details
If multiple pools were used to sequence a population, this will compute the contribution of each pool and then use that to calculate how many reads does that pool contribute. Next, the probability of contribution of each individual is computed and utilized to calculate the number of reads that each individual contributes towards the total number of reads observed in the corresponding pool.
Value
a matrix with the number of reads contributed by each individual. Each row of the matrix corresponds to a different individual and each column to a different site.
Examples
# simulate number of reads contributed by each individual towards the total population coverage
# assuming a coverage of 10x at 5 sites and two pools, each with 5 individuals
popReads(vector_np = c(5, 5), coverage = rep(10, 5), pError = 100)
Simulate total number of reads for multiple populations
Description
Simulates the contribution of each individual towards the total coverage of its population.
Usage
popsReads(list_np, coverage, pError)
Arguments
list_np |
is a list where each entry corresponds to a different population. Each entry is a vector and each vector entry contains the number of diploid individuals of a given pool. Thus, if a population was sequenced using two pools, each with 10 individuals, this vector would contain two entries and both will be 10. |
coverage |
a matrix containing the total depth of coverage of all populations. Each row corresponds to a different population and each column to a different site. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
Details
If multiple pools were used to sequence a population, this will compute the contribution of each pool and then use that to calculate how many reads does that pool contribute. Next, the probability of contribution of each individual is computed and utilized to calculate the number of reads that each individual contributes towards the total number of reads observed in the corresponding pool. These steps will be performed for each population, thus obtaining the number of reads contributed by each individual for each population.
Value
a list with one entry per population. Each entry represents the number of reads contributed by each individual towards the total coverage of its population. Different individuals correspond to different rows and different sites to different columns.
Examples
# simulate coverage for two populations sequenced at 10x at 5 sites
reads <- simulateCoverage(mean = c(10, 10), variance = c(20, 20), nSNPs = 5, nLoci = 1)
# simulate the individual contribution towards that coverage
# assuming that the first population was sequenced using two pools of 5 individuals
# and the second using a single pool with 10 individuals
popsReads(list_np = list(c(5, 5), 10), coverage = reads, pError = 5)
Apply a minor allele reads threshold
Description
Removes sites where the total number of minor-allele reads is below a certain threshold.
Usage
removeSites(freqs, alternative, coverage, minor, min.minor)
Arguments
freqs |
a vector of allele frequencies where each entry corresponds to a different site. |
alternative |
a matrix with the number of reads with the alternative allele. Each row should be a different population and each column a different site. |
coverage |
a matrix with the total coverage. Each row should be a different population and each column a different site. |
minor |
a matrix with the number of minor-allele reads. Each row should be a different population and each column a different site. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
Details
If a site has less minor-allele reads than min.minor
across all
populations, that site is removed from the data.
Value
a list with three named entries:
freqs |
is a vector with the allele frequencies minus the frequency of the removed sites. |
alternative |
is a matrix with the number of alternative-allele reads per site, minus any removed sites. |
coverage |
is a matrix with the depth of coverage minus the coverage of the removed sites. |
Examples
# create a vector of allele frequencies
freqs <- runif(20)
set.seed(10)
# create a matrix with the number of reads with the alternative allele
alternative <- matrix(sample(x = c(0,5,10), size = 20, replace = TRUE), nrow = 1)
# create a matrix with the depth of coverage
coverage <- matrix(sample(100:150, size = 20), nrow = 1)
# the number of reads with the reference allele is obtained by subtracting
# the number of alternative allele reads from the depth of coverage
reference <- coverage - alternative
# find the minor allele at each site
minor <- findMinor(reference = reference, alternative = alternative, coverage = coverage)
# keep only the matrix with the minor allele reads
minor <- minor[["minor"]]
# remove sites where the number of minor-allele reads is below the threshold
removeSites(freqs = freqs, alternative = alternative, coverage = coverage,
minor = minor, min.minor = 2)
Apply a coverage-based filter over a list
Description
This function removes sites that have a coverage below a minimum
value and
sites with a coverage above a maximum
value. This is done over multiple
loci, assuming that each entry of the reads
list is a different locus. If a
list of genotypes is also supplied, then those same sites are also removed
from each locus of the genotypes.
Usage
remove_by_reads(nLoci, reads, minimum, maximum, genotypes = NA)
Arguments
nLoci |
an integer that represents how many independent loci were simulated. |
reads |
a list with the total depth of coverage. Each entry of the list should be a matrix corresponding to a different locus. Each row of that matrix should be the coverage of a different population and each column a different site. |
minimum |
an integer representing the minimum coverage allowed. Sites where any population has a depth of coverage below this threshold are removed from the data. |
maximum |
an integer representing the maximum coverage allowed. Sites where any population has a depth of coverage above this threshold are removed from the data. |
genotypes |
an optional list input with the genotypes. Each entry of the list should be a matrix corresponding to a different locus. Each column of the matrix should be a different site and each row a different individual. |
Value
a list with the total depth of coverage similar to the reads
input
argument but without sites where the coverage was below the minimum
or
above the maximum
. If the genotypes were included, a second list entry
will also be included in the output, containing the genotypes minus the
sites that were removed.
Examples
set.seed(10)
# simulate coverage for 10 locus
reads <- simulateCoverage(mean = c(25, 25), variance = c(200, 200), nSNPs = 10, nLoci = 10)
# remove sites with coverage below 10x or above 100x
reads <- remove_by_reads(nLoci = 10, reads = reads, minimum = 5, maximum = 100)
# notice that some locus no longer have 10 SNPs - those sites were removed
reads
Apply a coverage-based filter to a matrix
Description
This function removes sites that have a coverage below a minimum
value and
sites with a coverage above a maximum
value. If a matrix of genotypes is
also supplied, then those same sites are also removed from that matrix.
Usage
remove_by_reads_matrix(reads, minimum, maximum, genotypes = NA)
Arguments
reads |
a matrix with the total depth of coverage. Each row of the matrix should be the coverage of a different population and each column a different site. |
minimum |
an integer representing the minimum coverage allowed. Sites where any population has a depth of coverage below this threshold are removed from the data. |
maximum |
an integer representing the maximum coverage allowed. Sites where any population has a depth of coverage above this threshold are removed from the data. |
genotypes |
an optional matrix input with the genotypes. Each column of the matrix should be a different site and each row a different individual. |
Value
a matrix with the total depth of coverage minus the sites (i.e. columns) where the coverage for any of the populations was below the minimum or above the maximum. If genotypes were supplied, then the output will be a list, with one entry per locus. Each entry will contain the filtered coverage in the first entry and the genotypes, minus the removed sites, in the second entry.
Examples
set.seed(10)
# simulate coverage for a single locus - select the first entry to obtain a matrix
reads <- simulateCoverage(mean = c(25, 25), variance = c(200, 200), nSNPs = 10, nLoci = 1)[[1]]
# check the coverage matrix
reads
# remove sites with coverage below 10x or above 100x
remove_by_reads_matrix(reads = reads, minimum = 10, maximum = 100)
Simulate a single population
Description
Simulates the evolution of biological sequences for a single population with variable theta values.
Usage
run_scrm(nDip, nloci, theta = 10)
Arguments
nDip |
an integer representing the total number of diploid individuals
to simulate. Note that |
nloci |
is an integer that represents how many independent loci should be simulated. |
theta |
a value for the mutation rate assuming theta = 4Nu, where u is the neutral mutation rate per locus. |
Value
a list with genotypes. Each entry of the list corresponds to a different locus. For each locus, the genotypes are in a matrix, with each row representing a different individual and each column a different site.
Examples
run_scrm(nDip = 100, nloci = 10)
run_scrm(nDip = 100, nloci = 10, theta = 5)
Simulate Pool-seq data
Description
Simulates pooled sequencing data given a set of parameters and individual genotypes.
Usage
simPoolseq(
genotypes,
pools,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA
)
Arguments
genotypes |
a list of genotypes, where each entry is a matrix corresponding to a different locus. At each matrix, each column is a different SNP and each row is a different individual. Genotypes should be coded as 0, 1 or 2. |
pools |
a list with a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
vCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
Details
Note that this functions allows for different combinations of parameters.
Thus, Pool-seq data can be simulated for a variety of parameters. For
instance, different mean depths of coverage can be used to simulate Pool-seq
data. It is also possible to simulate Pool-seq data using different pool
sizes (by changing the pools
input) and different values of the
Pool-seq error parameter (pError
).
Value
a list with three named entries:
reference |
a list with one entry per locus. Each entry is a matrix with the number of reference allele reads. Each column represents a different site. |
alternative |
a list with one entry per locus. Each entry is a matrix with the number of alternative allele reads. Each column represents a different site. |
total |
a list with one entry per locus. Each entry is a matrix with the total depth of coverage. Each column represents a different site. |
Examples
# simulate Pool-seq data for 100 individuals sampled at a single locus
genotypes <- run_scrm(nDip = 100, nloci = 1, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and two pools of 50 individuals each
simPoolseq(genotypes = genotypes, pools = c(50, 50), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# simulate Pool-seq data for 10 individuals sampled at 5 loci
genotypes <- run_scrm(nDip = 10, nloci = 5, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and a single pool of 10 individuals
simPoolseq(genotypes = genotypes, pools = 10, pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
Simulate coverage at a single locus
Description
Simulates the total number of reads, for each polymorphic site of a given locus using a negative binomial distribution.
Usage
simReads(mean, variance, nSNPs = NA, genotypes = NA)
Arguments
mean |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. If a vector is supplied instead, the function assumes that each entry of the vector is the mean for a different population. |
variance |
an integer that defines the variance of the depth of coverage across all sites. If a vector is supplied instead, the function assumes that each entry of the vector is the variance for a different population. |
nSNPs |
an integer representing the number of polymorphic sites per
locus to simulate. This is an optional input but either this or the
|
genotypes |
a matrix of simulated genotypes, where each column is a
different SNP and each row is a different individual. This is an optional
input but either this or the |
Details
The total number of reads is simulated with a negative binomial and according
to a user-defined mean depth of coverage and variance. This function is
intended to work with a matrix of genotypes, simulating the depth of coverage
for each site present in the genotypes. However, it can also be used to
simulate coverage distributions independent of genotypes, by choosing how
many sites should be simulated (with the nSNPs
option).
Value
a matrix with the total coverage per population and per site. Different rows represent different populations and each column is a different site.
Examples
# coverage for one population at 10 sites
simReads(mean = 20, variance = 100, nSNPs = 10)
# simulate coverage at one locus with 10 SNPs for two populations:
# the first with 100x and the second with 50x
simReads(mean = c(100, 50), variance = c(250, 150), nSNPs = 10)
Simulate total number of reads per site
Description
This function simulates the total number of reads, for each polymorphic site using a negative binomial distribution.
Usage
simulateCoverage(mean, variance, nSNPs = NA, nLoci = NA, genotypes = NA)
Arguments
mean |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. If a vector is supplied instead, the function assumes that each entry of the vector is the mean for a different population. |
variance |
an integer that defines the variance of the depth of coverage across all sites. If a vector is supplied instead, the function assumes that each entry of the vector is the variance for a different population. |
nSNPs |
an integer representing the number of polymorphic sites per
locus to simulate. This is an optional input but either this or the
|
nLoci |
an optional integer that represents how many independent loci should be simulated. |
genotypes |
a list of simulated genotypes, where each entry is a matrix
corresponding to a different locus. At each matrix, each column is a
different SNP and each row is a different individual. This is an optional
input but either this or the |
Details
The total number of reads is simulated with a negative binomial and according
to a user-defined mean depth of coverage and variance. This function is
intended to work with a list of genotypes, simulating the depth of coverage
for each site present in the genotypes. However, it can also be used to
simulate coverage distributions independent of genotypes, by choosing how
many loci to simulate (with the nLoci
option) and choosing how many sites
per locus should be simulated (with the nSNPs
option).
Value
a list with the total coverage per population and per site. Each list entry is a matrix corresponding to a different locus. For each matrix, different rows represent different populations and each column is a different site.
Examples
# simulate 10 loci, each with 10 SNPs for a single population
simulateCoverage(mean = 100, variance = 250, nSNPs = 10, nLoci = 10)
# simulate 10 loci, each with 10 SNPs for two populations:
# the first with 100x and the second with 50x
simulateCoverage(mean = c(100, 50), variance = c(250, 150), nSNPs = 10, nLoci = 10)
# simulate coverage given a set of genotypes
# run scrm and obtain genotypes
genotypes <- run_scrm(nDip = 100, nloci = 10)
# simulate coverage
simulateCoverage(mean = 50, variance = 200, genotypes = genotypes)
Split matrix of genotypes
Description
This function splits a matrix into different list entries. The matrix is
split according to a set of row indexes defined by the size
input.
Usage
splitMatrix(matrix, size)
Arguments
matrix |
is obviously a matrix, ideally one containing genotypes. Each column of the matrix should be a different site and each row a different individual. |
size |
a list with one entry per population. Each entry should be a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
Details
The size
input is utilized to create the index of the rows that go into the
different list entries. It specifies the size, in terms of number of
individuals, of each population.
Value
a list with one entry per entry of the size input argument. Each entry contains the information of the original matrix for the number of individuals specified by the corresponding entry of the size input argument.
Examples
set.seed(10)
# create a random matrix
mymatrix <- matrix(round(runif(n = 50, min = 10, max = 25)), nrow = 10)
# split that matrix assuming 8 individuals in the first population and two in the second
splitMatrix(matrix = mymatrix, size = list(8, 2))
Create sync string for a single SNP
Description
Creates a string with the information for a single SNP. The information is coded as A-count:T-count:C-count:G-count:N-count:deletion-count. Note that we assume that the reference allele is always A and the alternative is always T.
Usage
strg2sync(reference, alternative)
Arguments
reference |
an integer representing the number of reads with the reference allele. |
alternative |
an integer representing the number of reads with the alternative allele. |
Value
a character string coded as A-count:T-count:C-count:G-count:N-count:deletion-count.
Create vcf string for a single SNP
Description
Creates a string with the information for a single SNP. The information is coded as R,A:DP. R is the number of reads of the reference allele, A is the number of reads of the alternative allele and DP is the total depth of coverage.
Usage
strg2vcf(reference, alternative, total)
Arguments
reference |
an integer representing the number of reads with the reference allele |
alternative |
an integer representing the number of reads with the alternative allele |
total |
an integer representing the total number of reads observed at this SNP. |
Value
a character string coded as R,A:DP.
Create vcf table with relevant information
Description
Creates a data frame in the VCF format for all SNPs and across all loci in the data set.
Usage
vcfinfo(string, pos = NULL)
Arguments
string |
is a character vector or a list where each entry contains a
character vector for a different locus. Each entry of this character vector
contains the information for a single SNP coded as R,A:DP. The output of
the |
pos |
is an optional input (default is NULL). If the actual position of the SNPs are known, they can be used as input here. When working with a single locus, this should be a numeric vector with each entry corresponding to the position of each SNP. If the data has multiple loci, this should be a list where each entry is a numeric vector with the position of the SNPs for a different locus. |
Details
This function combines the information coded as R,A:DP with other necessary information such as the chromosome of each SNP, the position of the SNP and the quality of the genotype among others. Note that in the character string, R is the number of reads of the reference allele, A is the number of reads of the alternative allele and DP is the total depth of coverage. Each row of the data frame corresponds to a different SNP.
Value
a data frame with 10 different columns
chr |
Chromosome. Each locus is treated as different linkage group. |
pos |
Co-ordinate. The coordinate of the SNP. |
ID |
Identifier. |
REF |
Reference allele. We assume that the reference allele is always an A. Note that this is not necessarily the major allele. |
ALT |
Alternative allele. We assume that the alternative allele is always a T. |
QUAL |
Quality score out of 100. We assume that this score is always 100. |
FILTER |
If this SNP passed quality filters. |
INFO |
Further information. Provides further information on the variants. |
FORMAT |
Information about the following columns. This column tells us how the number of reads is coded in the next column. |
pop1 |
Number of reference-allele reads, alternative-allele reads and total depth of coverage observed for this population at this SNP. |
Create vcf string for all SNPs in multiple loci
Description
Creates a string with the information for all SNPs across multiple loci. The information is coded as R,A:DP. R is the number of reads of the reference allele, A is the number of reads of the alternative allele and DP is the total depth of coverage. Each entry of the character string corresponds to a different SNP and each entry of the list to a different locus.
Usage
vcfloci(reference, alternative, total)
Arguments
reference |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the reference allele. Each entry of the vector corresponds to a different SNP. |
alternative |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the number of reads with the alternative allele. Each entry of the vector corresponds to a different SNP. |
total |
is a list where each entry corresponds to a different locus. Each list entry is a vector with the total number of reads observed at each SNP. Each entry of the vector corresponds to a different SNP. |
Value
is a list where each entry corresponds to a different locus. Each entry of the list is a character vector with as many entries as the number of SNPs in the locus. Each entry of this character vector contains the information for a single SNP coded as R,A:DP.
Create vcf string for all SNPs in a single locus
Description
Creates a string with the information for all SNPs. The information is coded as R,A:DP. R is the number of reads of the reference allele, A is the number of reads of the alternative allele and DP is the total depth of coverage. Each entry of the character string corresponds to a different SNP.
Usage
vcflocus(reference, alternative, total)
Arguments
reference |
is a vector with the number of reads with the reference allele. Each entry of the vector corresponds to a different SNP. |
alternative |
is a vector with the number of reads with the alternative allele. Each entry of the vector corresponds to a different SNP. |
total |
is a vector with the total number of reads observed at each SNP. Each entry of the vector corresponds to a different SNP. |
Value
is a character vector with as many entries as the number of SNPs in the locus. Each entry of this character vector contains the information for a single SNP coded as R,A:DP.