Type: | Package |
Title: | Fast Principal Component Analysis for Outlier Detection |
Version: | 4.4.0 |
Date: | 2024-09-24 |
Description: | Methods to detect genetic markers involved in biological adaptation. 'pcadapt' provides statistical tools for outlier detection based on Principal Component Analysis. Implements the method described in (Luu, 2016) <doi:10.1111/1755-0998.12592> and later revised in (Privé, 2020) <doi:10.1093/molbev/msaa053>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Imports: | bigutilsr (≥ 0.3), data.table, ggplot2, magrittr, mmapcharr (≥ 0.3), Rcpp (≥ 0.12.8), RSpectra |
LinkingTo: | mmapcharr, Rcpp, rmio |
Suggests: | plotly, shiny, spelling, testthat, vcfR |
RoxygenNote: | 7.2.3 |
Encoding: | UTF-8 |
Language: | en-US |
URL: | https://github.com/bcm-uga/pcadapt |
BugReports: | https://github.com/bcm-uga/pcadapt/issues |
NeedsCompilation: | yes |
Packaged: | 2024-09-24 12:52:57 UTC; au639593 |
Author: | Keurcien Luu [aut], Michael Blum [aut], Florian Privé [aut, cre], Eric Bazin [ctb], Nicolas Duforet-Frebourg [ctb] |
Maintainer: | Florian Privé <florian.prive.21@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-09-24 13:10:05 UTC |
Convert a bed to a matrix
Description
Convert a bed to a matrix
Usage
bed2matrix(bedfile, n = NULL, p = NULL)
Arguments
bedfile |
Path to a bed file. |
n |
Number of samples. Default reads it from corresponding fam file. |
p |
Number of SNPs. Default reads it from corresponding bim file. |
Value
An integer matrix.
Examples
bedfile <- system.file("extdata", "geno3pops.bed", package = "pcadapt")
mat <- bed2matrix(bedfile)
dim(mat)
table(mat)
Get the principal component the most associated with a genetic marker
Description
get.pc
returns a data frame such that each row contains the index of
the genetic marker and the principal component the most correlated with it.
Usage
get.pc(x, list)
Arguments
x |
an object of class 'pcadapt'. |
list |
a list of integers corresponding to the indices of the markers of interest. |
Retrieve population names
Description
get.pop.names
retrieves the population names from the population file.
Usage
get.pop.names(pop)
Arguments
pop |
a list of integers or strings specifying which population the individuals belong to. |
Examples
## see also ?pcadapt for examples
Population colorization
Description
get.score.color
allows the user to display individuals of the same
predefined population with the same color when using the option
"scores"
in pcadapt
.
Usage
get.score.color(pop)
Arguments
pop |
a list of integers or strings specifying which population the individuals belong to. |
Examples
## see also ?pcadapt for examples
pcadapt statistics
Description
get_statistics
returns chi-squared distributed statistics.
Usage
get_statistics(zscores, method, pass)
Arguments
zscores |
a numeric matrix containing the z-scores. |
method |
a character string specifying the method to be used to compute
the p-values. Two statistics are currently available, |
pass |
a boolean vector. |
Value
The returned value is a list containing the test statistics and the associated p-values.
Neutral Distribution Estimation
Description
hist_plot
plots the histogram of the chi-squared statistics, as well
as the estimated null distribution.
Usage
hist_plot(x, K)
Arguments
x |
an output from |
K |
an integer indicating which principal component the histogram will be associated with. |
Manhattan Plot
Description
manhattan_plot
displays a Manhattan plot which represents the p-values
for each SNP for a particular test statistic.
Usage
manhattan_plot(x, chr.info, snp.info, plt.pkg = "ggplot", K = 1)
Arguments
x |
an object of class "pcadapt" generated with |
chr.info |
a list containing the chromosome information for each marker. |
snp.info |
a list containing the names of all genetic markers present in the input. |
plt.pkg |
a character string specifying the package to be used to
display the graphical outputs. Use |
K |
an integer specifying which principal component to display when
|
Principal Component Analysis for outlier detection
Description
pcadapt
performs principal component analysis and computes p-values to
test for outliers. The test for outliers is based on the correlations between
genetic variation and the first K
principal components. pcadapt
also handles Pool-seq data for which the statistical analysis is performed on
the genetic markers frequencies. Returns an object of class pcadapt
.
Usage
pcadapt(
input,
K = 2,
method = "mahalanobis",
min.maf = 0.05,
ploidy = 2,
LD.clumping = NULL,
pca.only = FALSE,
tol = 1e-04
)
## S3 method for class 'pcadapt_matrix'
pcadapt(
input,
K = 2,
method = c("mahalanobis", "componentwise"),
min.maf = 0.05,
ploidy = 2,
LD.clumping = NULL,
pca.only = FALSE,
tol = 1e-04
)
## S3 method for class 'pcadapt_bed'
pcadapt(
input,
K = 2,
method = c("mahalanobis", "componentwise"),
min.maf = 0.05,
ploidy = 2,
LD.clumping = NULL,
pca.only = FALSE,
tol = 1e-04
)
## S3 method for class 'pcadapt_pool'
pcadapt(
input,
K = (nrow(input) - 1),
method = "mahalanobis",
min.maf = 0.05,
ploidy = NULL,
LD.clumping = NULL,
pca.only = FALSE,
tol
)
Arguments
input |
The output of function |
K |
an integer specifying the number of principal components to retain. |
method |
a character string specifying the method to be used to compute
the p-values. Two statistics are currently available, |
min.maf |
Threshold of minor allele frequencies above which p-values are
computed. Default is |
ploidy |
Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome. |
LD.clumping |
Default is |
pca.only |
a logical value indicating whether PCA results should be returned (before computing any statistic). |
tol |
Convergence criterion of |
Details
First, a principal component analysis is performed on the scaled and
centered genotype data. Depending on the specified method
, different
test statistics can be used.
mahalanobis
(default): the robust Mahalanobis distance is computed for
each genetic marker using a robust estimate of both mean and covariance
matrix between the K
vectors of z-scores.
communality
: the communality statistic measures the proportion of
variance explained by the first K
PCs. Deprecated in version 4.0.0.
componentwise
: returns a matrix of z-scores.
To compute p-values, test statistics (stat
) are divided by a genomic
inflation factor (gif
) when method="mahalanobis"
. When using
method="mahalanobis"
, the scaled statistics
(chi2_stat
) should follow a chi-squared distribution with K
degrees of freedom. When using method="componentwise"
, the z-scores
should follow a chi-squared distribution with 1
degree of freedom. For
Pool-seq data, pcadapt
provides p-values based on the Mahalanobis
distance for each SNP.
Value
The returned value is an object of class pcadapt
.
Convert ped files
Description
ped2pcadapt
converts ped
files to the format pcadapt
.
Usage
ped2pcadapt(input, output)
Arguments
input |
a character string specifying the name of the file to be converted. |
output |
a character string specifying the name of the output file. |
pcadapt visualization tool
Description
plot.pcadapt
is a method designed for objects of class pcadapt
.
It provides plotting options for quick visualization of pcadapt
objects. Different options are currently available : "screeplot"
,
"scores"
, "stat.distribution"
, "manhattan"
and
"qqplot"
. "screeplot"
shows the decay of the genotype matrix
singular values and provides a figure to help with the choice of K
.
"scores"
plots the projection of the individuals onto the first two
principal components. "stat.distribution"
displays the histogram of
the selected test statistics, as well as the estimated distribution for the
neutral SNPs. "manhattan"
draws the Manhattan plot of the p-values
associated with the statistic of interest. "qqplot"
draws a Q-Q plot
of the p-values associated with the statistic of interest.
Usage
## S3 method for class 'pcadapt'
plot(
x,
...,
option = "manhattan",
i = 1,
j = 2,
pop,
col,
chr.info = NULL,
snp.info = NULL,
plt.pkg = "ggplot",
K = NULL
)
Arguments
x |
an object of class "pcadapt" generated with |
... |
... |
option |
a character string specifying the figures to be displayed. If
|
i |
an integer indicating onto which principal component the individuals
are projected when the "scores" option is chosen.
Default value is set to |
j |
an integer indicating onto which principal component the individuals
are projected when the "scores" option is chosen.
Default value is set to |
pop |
a list of integers or strings specifying which subpopulation the individuals belong to. |
col |
a list of colors to be used in the score plot. |
chr.info |
a list containing the chromosome information for each marker. |
snp.info |
a list containing the names of all genetic markers present in the input. |
plt.pkg |
a character string specifying the package to be used to
display the graphical outputs. Use |
K |
an integer specifying the principal component of interest. |
Examples
## see ?pcadapt for examples
Summary
Description
print_convert
prints out a summary of the file conversion.
Usage
print_convert(input, output, M, N, pool)
Arguments
input |
a genotype matrix or a character string specifying the name of the file to be converted. |
output |
a character string specifying the name of the output file. |
M |
an integer specifying the number of genetic markers present in the data. |
N |
an integer specifying the number of individuals present in the data. |
pool |
an integer specifying the type of data. '0' for genotype data, '1' for pooled data. |
p-values Q-Q Plot
Description
qq_plot
plots a Q-Q plot of the p-values computed.
Usage
qq_plot(x, K = 1)
Arguments
x |
an output from |
K |
an integer specifying which principal component to display when |
File Converter
Description
read.pcadapt
converts genotype matrices or files to an appropriate
format readable by pcadapt
. For a file as input, you can choose to
return either a matrix or convert it in bed/bim/fam files.
For a matrix as input, this returns a matrix.
Usage
read.pcadapt(
input,
type = c("pcadapt", "lfmm", "vcf", "bed", "ped", "pool", "example"),
type.out = c("bed", "matrix"),
allele.sep = c("/", "|"),
pop.sizes,
ploidy,
local.env,
blocksize
)
Arguments
input |
A genotype matrix or a character string specifying the name of the file to be converted. Matrices should use NAs to encode missing values. To encode missing values in 'pcadapt' and 'lfmm' files, 9s should be used. |
type |
A character string specifying the type of data to be converted from. Converters from 'vcf' and 'ped' formats are not maintained anymore; if you have any issue with those, please use PLINK >= 1.9 to convert them to the 'bed' format. |
type.out |
Either a bed file or a standard R matrix. If the input is a matrix, then the output is automatically a matrix (so that you don't need to specify this parameter). If the input is a bed file, then the output is also a bed file. |
allele.sep |
a vector of characters indicating what delimiters are used
in VCF files. By default, only "|" and "/" are recognized.
So, this argument is only useful for |
pop.sizes |
deprecated argument. |
ploidy |
deprecated argument. |
local.env |
deprecated argument. |
blocksize |
deprecated argument. |
Shiny app
Description
pcadapt
comes with a Shiny interface.
Usage
run.pcadapt()
Principal Components Analysis Scores Plot
Description
"score_plot"
plots the projection of the individuals onto the
first two principal components.
Usage
score_plot(x, i = 1, j = 2, pop, col, plt.pkg = "ggplot")
Arguments
x |
an output from |
i |
an integer indicating onto which principal component the individuals
are projected when the "scores" option is chosen.
Default value is set to |
j |
an integer indicating onto which principal component the individuals
are projected when the "scores" option is chosen.
Default value is set to |
pop |
a list of integers or strings specifying which subpopulation the individuals belong to. |
plt.pkg |
a character string specifying the package to be used to
display the graphical outputs. Use |
Principal Components Analysis Scree Plot
Description
scree_plot
plots the scree plot associated with the principal components
analysis performed on the dataset. NB : pcadapt
has to be run on the
dataset in order to get an output readable by plot.screePlot
Usage
scree_plot(x, K = NULL)
Arguments
x |
an output from |
K |
an integer specifying the number of components to plot. |
vcfR-based converter
Description
vcf2pcadapt
uses the package vcfR
to extract the genotype information
from a vcf file and exports it under the format required by pcadapt
.
Note that this VCF converter is not maintained anymore. Please use PLINK.
Usage
vcf2pcadapt(input, output = "tmp.pcadapt", allele.sep = c("/", "|"))
Arguments
input |
a character string specifying the name of the file to be converted. |
output |
a character string indicating the name of the output file. |
allele.sep |
a vector of characters indicating what delimiters are used to separate alleles. |
Convert vcfR genotype matrices
Description
vcf_convert
converts outputs of extract.gt
to the format
pcadapt
.
Usage
vcf_convert(string_geno, output, allele_sep)
Arguments
string_geno |
a genotype matrix extracted from a VCF file with 'vcfR'. |
output |
a character string indicating the name of the output file. |
allele_sep |
a vector of characters indicating what delimiters are used to separate alleles. |
Write PLINK files
Description
Function to write bed/bim/fam files from a pcadapt or an lfmm file.
Usage
writeBed(file, is.pcadapt)
Arguments
file |
A pcadapt or lfmm file. |
is.pcadapt |
a boolean value. |
Value
The input 'bedfile' path.