Title: | Estimate Kinship and FST under Arbitrary Population Structure |
Version: | 1.3.23 |
Description: | Provides functions to estimate the kinship matrix of individuals from a large set of biallelic SNPs, and extract inbreeding coefficients and the generalized FST (Wright's fixation index). Method described in Ochoa and Storey (2021) <doi:10.1371/journal.pgen.1009241>. |
LazyData: | true |
Depends: | R (≥ 3.5.0) |
Imports: | RColorBrewer, graphics, grDevices, ape |
Suggests: | BEDMatrix, testthat, knitr, rmarkdown, bnpsd |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
VignetteBuilder: | knitr |
URL: | https://github.com/StoreyLab/popkin/ |
BugReports: | https://github.com/StoreyLab/popkin/issues |
NeedsCompilation: | no |
Packaged: | 2023-01-06 23:37:19 UTC; viiia |
Author: | Alejandro Ochoa |
Maintainer: | Alejandro Ochoa <alejandro.ochoa@duke.edu> |
Repository: | CRAN |
Date/Publication: | 2023-01-07 00:30:02 UTC |
A package for estimating kinship and FST under arbitrary population structure
Description
The heart of this package is the popkin()
function, which estimates the kinship matrix of all individual pairs from their genotype matrix.
Inbreeding coefficients, the generalized FST
, and the individual-level pairwise FST
matrix are extracted from the kinship matrix using inbr()
, fst()
, and pwfst()
, respectively.
fst()
accepts weights for individuals to balance subpopulations obtained with weights_subpops()
.
Kinship matrices can be renormalized (to change the most recent common ancestor population or MRCA) using rescale_popkin()
.
Coancestry matrices can be estimated from allele frequency matrices using popkin_af()
.
Lastly, kinship, coancestry, and pairwise FST matrices can be visualized using plot_popkin()
(with the help of inbr_diag()
for kinship matrices only).
Author(s)
Maintainer: Alejandro Ochoa alejandro.ochoa@duke.edu (ORCID)
Authors:
John D. Storey jstorey@princeton.edu (ORCID)
See Also
Useful links:
Examples
# estimate and visualize kinship and FST from a genotype matrix
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow = 3, byrow = TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
subpops2 <- 1:3 # alternate labels treat every individual as a different subpop
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix from the genotypes "X"!
# all downstream analysis require "kinship", none use "X" after this
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
# plot the kinship matrix, marking the subpopulations
# note inbr_diag replaces the diagonal of kinship with inbreeding coefficients
plot_popkin( inbr_diag(kinship), labs = subpops )
# extract inbreeding coefficients from kinship
inbreeding <- inbr(kinship)
# estimate FST
weights <- weights_subpops(subpops) # weigh individuals so subpopulations are balanced
Fst <- fst(kinship, weights) # use kinship matrix and weights to calculate fst
Fst <- fst(inbreeding, weights) # estimate more directly from inbreeding vector (same result)
# estimate and visualize the pairwise FST matrix
pairwise_fst <- pwfst(kinship) # estimated matrix
leg_title <- expression(paste('Pairwise ', F[ST])) # fancy legend label
# NOTE no need for inbr_diag() here!
plot_popkin(pairwise_fst, labs = subpops, leg_title = leg_title)
# rescale the kinship matrix using different subpopulations (implicitly changes the MRCA)
kinship2 <- rescale_popkin(kinship, subpops2)
# toy allele frequency data
P <- matrix( runif( 9 ), nrow = 3 )
# estimate coancestry from allele frequency matrix
# (useful from P matrices from admixture models)
coancestry <- popkin_af( P )
Label ancestries based on best match to individual labels
Description
Returns labels for each ancestry (columns) of an admixture matrix which is the best matching label among the average individual (rows) of each subpopulation. More specifically, each ancestry is associated to the subpopulation label in which its admixture proportion was the highest averaging over all individuals from that subpopulation. If there are two or more ancestries that match to the same label, these are made unique by appending its order of appearance (if the label is "A", then the first column that matches to it is labeled "A1", the next one "A2", etc).
Usage
admix_label_cols(Q, labs)
Arguments
Q |
The admixture proportions matrix. |
labs |
Subpopulation labels for individuals (rows of |
Value
The best label assignments for the ancestries (columns of Q
), made unique by indexes if there are overlaps.
See Also
admix_order_cols()
to automatically order ancestries given ordered individuals.
plot_admix()
for plotting admixture matrices.
Examples
# toy admixture matrix with labels for individuals/rows that match well with ancestry/columns
Q <- matrix(
c(
0.1, 0.8, 0.1,
0.1, 0.7, 0.2,
0.0, 0.4, 0.6,
0.0, 0.3, 0.7,
0.9, 0.0, 0.1
),
nrow = 5,
ncol = 3,
byrow = TRUE
)
labs <- c('X', 'X', 'Y', 'Y', 'Z')
# to calculate matches and save as column names, do this:
colnames( Q ) <- admix_label_cols( Q, labs )
# expected column names: c('Z', 'X', 'Y')
Reorder admixture matrix columns
Description
Returns the order of the columns (ancestries) of an admixture matrix so that they are in their average order of appearance in rows (individuals). More specifically, for each ancestry it calculates its mean row (expected row number weighted by this ancestry's proportion distribution among rows), and returns the order in which these mean row values are increasing. In datasets where the rows/individuals are already ordered in a meaningful way (for example, by distance from the species' geographical origin, and generally grouping the most similar individuals together), this function can lead to a more pleasing automated visualization of the admixture proportions.
Usage
admix_order_cols(Q)
Arguments
Q |
The admixture proportions matrix. |
Value
The desired order of the columns (a vector of indexes).
See Also
admix_label_cols()
to automatically assign labels to ancestries given labels to individuals.
plot_admix()
for plotting admixture matrices.
Examples
# here is a toy admixture proportions matrix with columns in no meaningful order
Q <- matrix(
c(
0.1, 0.8, 0.1,
0.1, 0.7, 0.2,
0.0, 0.4, 0.6,
0.0, 0.3, 0.7,
0.9, 0.0, 0.1
),
nrow = 5,
ncol = 3,
byrow = TRUE
)
# get nicer order
indexes <- admix_order_cols( Q )
# apply reordering to columns
Q <- Q[ , indexes ]
# notice that now the first columns takes on the highest values initially,
# followed by the second column, and lastly the third column.
Calculate a kinship matrix between subpopulations by averaging individual data
Description
This function calculates a kinship matrix between subpopulations, whose values are the average kinship values between all individual pairs where one individual is in the first subpopulation and the other individual is in the second subpopulation.
To estimate coancestry instead of kinship, which is recommended to get more interpretable diagonal values, the input kinship matrix should be transformed using inbr_diag()
.
Usage
avg_kinship_subpops(kinship, subpops, subpop_order = unique(subpops))
Arguments
kinship |
A symmetric |
subpops |
The length- |
subpop_order |
The optional order of subpopulations in the output matrix.
|
Value
The symmetric K
-by-K
kinship matrix between subpopulations, where K
is the number of unique subpopulations in subpops
, ordered as in subpop_order
.
Examples
# a toy kinship matrix with 5 individuals belonging to 2 subpopulations
kinship <- matrix(
c(
0.7, 0.4, 0.4, 0.1, 0.0,
0.4, 0.7, 0.4, 0.2, 0.1,
0.4, 0.4, 0.7, 0.2, 0.0,
0.1, 0.2, 0.2, 0.6, 0.1,
0.0, 0.1, 0.0, 0.1, 0.6
),
nrow = 5,
ncol = 5
)
subpops <- c(1, 1, 1, 2, 2)
# calculate mean kinship between (and within) subpopulations
# a 2x2 matrix
avg_kinship_subpops( kinship, subpops )
# calculate coancestry estimate instead (difference is diagonal)
avg_kinship_subpops( inbr_diag( kinship ), subpops )
Calculate FST from a population-level kinship matrix or vector of inbreeding coefficients
Description
This function simply returns the weighted mean inbreeding coefficient.
If weights are NULL
(default), the regular mean is calculated.
If a kinship matrix is provided, then the inbreeding coefficients are extracted from its diagonal using inbr()
(requires the diagonal to contains self-kinship values as popkin()
returns, and not inbreeding coefficients as inbr_diag()
returns).
If there is local inbreeding and it can be estimated (from known pedigrees, for example), it can be subtracted from the total inbreeding coefficients, resulting in a vector of structural inbreeding that correctly averages into FST.
Usage
fst(x, weights = NULL, x_local = NULL)
Arguments
x |
The vector of inbreeding coefficients, or the kinship matrix if |
weights |
Weights for individuals (optional, defaults to uniform weights) |
x_local |
An optional vector of inbreeding coefficients, or a local kinship matrix if |
Details
The returned weighted mean inbreeding coefficient equals the generalized FST if all individuals are "locally outbred" (i.e. if the self-relatedness of every individual stems entirely from the population structure rather than due partly to having unusually closely related parents, such as first or second cousins).
Note most individuals in population-scale human data are locally outbred.
If there are locally-inbred individuals, but their local inbreeding cannot be estimated, then the returned value will overestimate FST.
Good estimates of local inbreeding can be passed (parameter x_local
), in which case the code will subtract their effect and FST will be more accurate.
Value
FST
Examples
# Get FST from a genotype matrix
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow = 3, byrow = TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix "kinship" from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
weights <- weights_subpops(subpops) # can weigh individuals so subpopulations are balanced
Fst <- fst(kinship, weights) # use kinship matrix and weights to calculate fst
Fst <- fst(kinship) # no (or NULL) weights implies uniform weights
inbr <- inbr(kinship) # if you extracted inbr for some other analysis...
Fst <- fst(inbr, weights) # ...use this inbreeding vector as input too!
HGDP subset
Description
Subset of the HGDP dataset.
Usage
hgdp_subset
Format
a matrix of 0's, 1's and 2's.
Value
genotype matrix
Source
Stanford HGDP http://www.hagsc.org/hgdp/files.html
Extract inbreeding coefficients from a kinship matrix
Description
The kinship matrix contains transformed inbreeding coefficients along the diagonal.
This function extracts the vector of inbreeding values from the input kinship matrix, by transforming the diagonal using the formula 2 * x - 1
.
Usage
inbr(kinship)
Arguments
kinship |
The |
Value
The length-n
vector of inbreeding coefficient for each individual.
See Also
inbr_diag()
to replace kinship diagonal with inbreeding values (better for plots)
Examples
#########
# illustrate the main transformation on a 2x2 kinship matrix:
# same inbreeding values for both individuals
inbr <- 0.2
# corresponding self kinship (diagonal values) for both individuals
kinship_self <- (1 + inbr)/2
# actual kinship matrix (zero kinship between individuals)
kinship <- matrix(c(kinship_self, 0, 0, kinship_self), nrow=2)
# expected output of inbr (extracts inbreeding coefficients)
inbr_exp <- c(inbr, inbr)
# actual output from this function
inbr_obs <- inbr(kinship)
# verify that they match (up to machine precision)
stopifnot( all( abs(inbr_obs - inbr_exp) < .Machine$double.eps ) )
#########
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
# extract inbreeding coefficients from Kinship
inbr <- inbr(kinship)
Replace kinship diagonal with inbreeding coefficients
Description
The usual kinship matrix contains self-kinship values along their diagonal given by diag(kinship) = ( 1 + inbr ) / 2
, where inbr
is the vector of inbreeding coefficient.
This function returns a modified kinship matrix with diagonal values replaced with inbr
(off-diagonal values stay the same).
The resulting matrix is better for visualization, but is often not appropriate for modeling (e.g. in mixed-effects models for association or heritability estimation).
Usage
inbr_diag(kinship)
Arguments
kinship |
A kinship matrix with self-kinship values along the diagonal.
Can pass multiple kinship matrices contained in a list.
If |
Value
The modified kinship matrix, with inbreeding coefficients along the diagonal, preserving column and row names.
If the input was a list of kinship matrices, the output is the corresponding list of transformed matrices.
NULL
inputs are preserved without causing errors.
See Also
The inverse function is given by bnpsd::coanc_to_kinship()
.
inbr()
to extract the vector of inbreeding values from a kinship matrix.
Examples
#########
# illustrate the main transformation on a 2x2 kinship matrix:
# same inbreeding values for both individuals
inbr <- 0.2
# corresponding self kinship (diagonal values) for both individuals
kinship_self <- (1 + inbr)/2
# kinship between the two individuals
kinship_between <- 0.1
# actual kinship matrix
kinship <- matrix(c(kinship_self, kinship_between, kinship_between, kinship_self), nrow=2)
# expected output of inbr_diag (replaces self kinship with inbreeding)
kinship_inbr_diag_exp <- matrix(c(inbr, kinship_between, kinship_between, inbr), nrow=2)
# actual output from this function
kinship_inbr_diag_obs <- inbr_diag(kinship)
# verify that they match (up to machine precision)
stopifnot( all( abs(kinship_inbr_diag_obs - kinship_inbr_diag_exp) < .Machine$double.eps ) )
# for a list of matrices, returns list of transformed matrices:
inbr_diag( list(kinship, kinship) )
# a list with NULL values also works
inbr_diag( list(kinship, NULL, kinship) )
#########
# Construct toy data (to more closely resemble real data analysis)
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
# lastly, replace diagonal of kinship matrix with inbreeding coefficients
kinship_inbr_diag <- inbr_diag(kinship)
Calculate the weighted mean kinship
Description
This function computes a particular weighted mean kinship that arises in the context of kinship and FST estimators and in the definition of the effective sample size. This function allows for weights to be zero or even negative, but they are internally normalized to sum to one.
Usage
mean_kinship(kinship, weights = NULL)
Arguments
kinship |
The kinship matrix |
weights |
Weights for individuals (optional).
If |
Value
The weighted mean kinship matrix, equivalent to drop( weights %*% kinship %*% weights )
after normalizing weights to sum to one.
Examples
# construct a dummy kinship matrix
kinship <- matrix(c(0.5, 0, 0, 0.6), nrow=2)
# this is the ordinary mean
mean_kinship(kinship)
# weighted mean with twice as much weight on the second individual
# (weights are internally normalized to sum to one)
weights <- c(1, 2)
mean_kinship(kinship, weights)
Calculates the effective sample size of the data
Description
The effective sample size n_eff
is the equivalent number of independent haplotypes that gives the same variance as that observed under the given population.
The variance in question is for the weighted sample mean ancestral allele frequency estimator.
It follows that n_eff
equals the inverse of the weighted mean kinship.
If max = TRUE
, a calculation is performed that implicitly uses optimal weights which maximize n_eff
, which equals the sum of the elements of the inverse kinship matrix.
However, if nonneg = TRUE
and if the above solution has negative weights (common), optimal non-negative weights are found instead (there are three algorithms available, see algo
).
If max = FALSE
, then the input weights are used in this calculation, and if weights are NULL
, uniform weights are used.
Usage
n_eff(
kinship,
max = TRUE,
weights = NULL,
nonneg = TRUE,
algo = c("gradient", "newton", "heuristic"),
tol = 1e-10
)
Arguments
kinship |
An |
max |
If |
weights |
Weights for individuals (optional).
If |
nonneg |
If |
algo |
Algorithm for finding optimal non-negative weights (applicable only if |
tol |
Tolerance parameter for "gradient" and "newton" algorithms. The algorithms converge when the norm of the step vector is smaller than this tolerance value. |
Details
The maximum n_eff
possible is 2 * n
, where n
is the number of individuals; this value is attained only when all haplotypes are independent (a completely unstructured population in Hardy-Weinberg equilibrium).
The minimum n_eff
possible is 1, which is attained in an extremely structured population with FST of 1, where every individual has exactly the same haplotype at every locus (no heterozygotes).
Moreover, for K
extremely-differentiated subpopulations (FST = 1 per subpopulation) n_eff = K
.
In this way, n_eff
is smaller than the ideal value of 2 * n
depending on the amount of kinship (covariance) in the data.
Occasionally, depending on the quality of the input kinship matrix, the estimated n_eff
may be outside the theoretical [1
, 2*n
] range, in which case the return value is set to the closest boundary value.
The quality of the results depends on the success of matrix inversion (which for numerical reasons may incorrectly contain negative eigenvalues, for example) or of the gradient optimization.
Value
A list containing n_eff
and weights
(optimal weights if max = TRUE
, input weights otherwise).
Examples
# Get n_eff from a genotype matrix
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix "kinship" from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
weights <- weights_subpops(subpops) # can weigh individuals so subpopulations are balanced
# use kinship matrix to calculate n_eff
# default mode returns maximum n_eff possible across all non-negative weights that sum to one
# also returns the weights that were optimal
obj <- n_eff(kinship)
n_eff_max <- obj$n_eff
w_max <- obj$weights
# version that uses weights provided
obj <- n_eff(kinship, max = FALSE, weights = weights)
n_eff_w <- obj$n_eff
w <- obj$weights # returns input weights renormalized for good measure
# no (or NULL) weights implies uniform weights
obj <- n_eff(kinship, max = FALSE)
n_eff_u <- obj$n_eff
w <- obj$weights # uniform weights
Make a structure/admixture plot
Description
This function facilitates structure plots with options that resemble those of plot_popkin()
in name and results.
The biggest difference is this function plots single panels (technically 2 panels including the legend, unless it is omitted), whereas plot_popkin()
can plot multiple kinship matrices with a shared legend.
Usage
plot_admix(
Q,
col = RColorBrewer::brewer.pal(max(ncol(Q), 3), "Paired"),
mar_pad = 0.2,
panel_letters = NA,
panel_letters_cex = 1.5,
panel_letters_adj = -0.1,
axis_lab_cex = 1,
xlab = "Individuals",
xlab_line = 1,
xlab_cex = axis_lab_cex,
ylab = "Ancestry",
ylab_line = 2,
ylab_side = 2,
ylab_cex = axis_lab_cex,
leg_title = "Ancestries",
leg_title_cex = axis_lab_cex,
leg_title_line = 2,
leg_cex = 1,
leg_mar = leg_title_line + 1,
leg_width = 0.2,
leg_las = 0,
leg_omit = FALSE,
layout_add = !leg_omit,
names = FALSE,
names_cex = 1,
names_line = NA,
names_las = 2,
labs = NULL,
labs_cex = 1,
labs_las = 0,
labs_line = 0,
labs_sep = TRUE,
labs_lwd = 1,
labs_col = "black",
labs_ticks = FALSE,
labs_text = TRUE,
labs_even = FALSE,
...
)
Arguments
Q |
The admixture proportions matrix, with |
col |
A vector of at least |
mar_pad |
Margin padding used for legend panel only (margins for first/main panel are not altered by this function). |
panel_letters |
Panel letter to include in first/main panel (default |
panel_letters_cex |
Scaling factor of panel letter (default 1.5). |
panel_letters_adj |
X-axis adjustment for panel letter (default -0.1). Negative values place the letter into the left margin area. Might need adjustment depending on the size of the left margin. AXIS LABEL OPTIONS |
axis_lab_cex |
Scaling factor for x-axis, y-axis, and legend title labels (which can also be set individually, see below). |
xlab |
X-axis label (default "Individuals").
Set to |
xlab_line |
The value of |
xlab_cex |
Scaling factor for x-axis label. |
ylab |
Y-axis label (default "Ancestry").
Set to |
ylab_line |
The value of |
ylab_side |
The value of |
ylab_cex |
Scaling factor for y-axis label. LEGEND (COLOR KEY) OPTIONS |
leg_title |
The name of the categorical ancestry variable (default "Ancestries"). |
leg_title_cex |
Scaling factor for legend title label. |
leg_title_line |
The value of |
leg_cex |
Scaling factor for ancestry labels. |
leg_mar |
Margin values for the kinship legend panel only.
A length-4 vector (in |
leg_width |
The width of the legend panel, relative to the width of the main panel.
This value is passed to |
leg_las |
The ancestry label orientations (in format that |
leg_omit |
If |
layout_add |
If INDIVIDUAL LABEL OPTIONS |
names |
If |
names_cex |
Scaling factor for the individual names. |
names_line |
Line where individual names are placed. |
names_las |
Orientation of labels relative to axis. Default (2) makes labels perpendicular to axis. SUBPOPULATION LABEL OPTIONS |
labs |
Subpopulation labels for individuals in the admixture matrix. Use a matrix of labels to show groupings at more than one level (for a hierarchy or otherwise). |
labs_cex |
A vector of label scaling factors for each level of labs. |
labs_las |
A vector of label orientations (in format that |
labs_line |
A vector of lines where labels are placed (in format that |
labs_sep |
A vector of logicals that specify whether lines separating the subpopulations are drawn for each level of labs. |
labs_lwd |
A vector of line widths for the lines that divide subpopulations (if |
labs_col |
A vector of colors for the lines that divide subpopulations (if |
labs_ticks |
A vector of logicals that specify whether ticks separating the subpopulations are drawn for each level of labs. |
labs_text |
A vector of logicals that specify whether the subpopulation labels are shown for each level of labs. Useful for including separating lines or ticks without text. |
labs_even |
A vector of logicals that specify whether the subpopulations labels are drawn with equal spacing for each level of labs.
When |
... |
Additional options passed to |
See Also
admix_order_cols()
to automatically order ancestries given ordered individuals.
admix_label_cols()
to automatically assign labels to ancestries given labels to individuals.
Examples
# create random proportions for two ancestries
Q <- runif( 10 )
Q <- cbind( Q, 1 - Q )
# add ancestry names
colnames( Q ) <- c('A1', 'A2')
# plot this data!
plot_admix( Q )
# See vignette for more elaborate examples!
Plot a phylo
tree object
Description
This is a wrapper around ape::plot.phylo()
that makes several adjustments so plots agree more with accompanying kinship matrices.
In particular, tree is reversed on the y-axis to match matrix orientation, y-axis spacing is more padded for small trees, and an x-axis scale is always added.
Usage
plot_phylo(
tree,
xlab = "Coancestry",
xmax = NULL,
leg_n = 5,
edge_width = 1,
...
)
Arguments
tree |
A |
xlab |
The x-axis label (default "Coancestry"). |
xmax |
X-axis maximum limit. |
leg_n |
The desired number of ticks in the x-axis (input to |
edge_width |
The width of the tree edges (passed to |
... |
Additional parameters passed to |
See Also
plot_popkin()
can create multipanel figures including kinship matrices and trees (calling the present function in the process).
Examples
# create a small random tree
library(ape)
tree <- rtree( 3 )
# plot it!
plot_phylo( tree )
Visualize one or more kinship matrices and other related objects
Description
This function plots one or more kinship matrices, trees (class phylo
objects, see ape
package), and arbitrary functions, and a shared legend for the kinship color key.
Many options allow for fine control of individual or subpopulation labeling.
Usage
plot_popkin(
kinship,
titles = NULL,
col = NULL,
col_n = 100,
mar = NULL,
mar_pad = 0.2,
oma = 1.5,
diag_line = FALSE,
panel_letters = toupper(letters),
panel_letters_cex = 1.5,
panel_letters_adj = -0.1,
ylab = "Individuals",
ylab_adj = NA,
ylab_line = 0,
ylab_side = 2,
ylab_per_panel = FALSE,
layout_add = TRUE,
layout_rows = 1,
leg_per_panel = FALSE,
leg_title = "Kinship",
leg_cex = 1,
leg_n = 5,
leg_mar = 3,
leg_width = 0.3,
leg_column = NA,
names = FALSE,
names_cex = 1,
names_line = NA,
names_las = 2,
labs = NULL,
labs_cex = 1,
labs_las = 0,
labs_line = 0,
labs_sep = TRUE,
labs_lwd = 1,
labs_col = "black",
labs_ticks = FALSE,
labs_text = TRUE,
labs_even = FALSE,
null_panel_data = FALSE,
weights = NULL,
raster = is.null(weights),
sym = FALSE,
...
)
Arguments
kinship |
A numeric kinship matrix, a |
titles |
Titles to add to each matrix panel (default is no titles). Applied to kinship and phylo panels only. |
col |
Colors for kinship heatmap (default is a red-white-blue palette symmetric about zero constructed using |
col_n |
The number of colors to use in the heatmap (applies if |
mar |
Margins shared by all panels (if a vector) or for each panel (if a list of such vectors).
If the vector has length 1, |
mar_pad |
Margin padding added to all panels ( |
oma |
Outer margin vector.
If length 1, the value of |
diag_line |
If |
panel_letters |
Vector of strings for labeling panels (default A-Z).
No labels are added if |
panel_letters_cex |
Scaling factor of panel letters (default 1.5). |
panel_letters_adj |
X-axis adjustment for panel letters (default -0.1). Negative values place the letter into the left margin area. Might need adjustment depending on the size of the left margin. |
ylab |
The y-axis label (default "Individuals").
If |
ylab_adj |
The value of |
ylab_line |
The value of |
ylab_side |
The value of |
ylab_per_panel |
Forces y-axis labels to appear for each panel, in the inner margins.
Most useful to cover the case where there is a single panel but no outer margins ( LAYOUT OPTIONS |
layout_add |
If |
layout_rows |
Number of rows in layout, used only if LEGEND (COLOR KEY) OPTIONS |
leg_per_panel |
If |
leg_title |
The name of the variable that the kinship heatmap colors measure (default "Kinship"), or a vector of such values if they vary per panel. |
leg_cex |
Scaling factor for |
leg_n |
The desired number of ticks in the kinship legend y-axis, and phylo x-axis (input to |
leg_mar |
Margin values for the kinship legend panel only, or a list of such values if they vary per panel.
A length-4 vector (in |
leg_width |
The width of the legend panel, relative to the width of a single main panel.
This value is passed to |
leg_column |
The column number in which to place the kinship legend (default INDIVIDUAL LABEL OPTIONS |
names |
If |
names_cex |
Scaling factor for the column and row names of a kinship matrix, or the tip labels of a |
names_line |
Line where kinship column and row names are placed, or a vector of such values if they vary per panel. Has no effect on non-kinship panels. |
names_las |
Orientation of labels relative to axis. Default (2) makes labels perpendicular to axis. Has no effect on non-kinship panels. SUBPOPULATION LABEL OPTIONS |
labs |
Subpopulation labels for individuals in kinship matrices. Use a matrix of labels (individuals along rows, levels along columns) to show groupings at more than one level (for a hierarchy or otherwise). If input is a vector or a matrix, the same subpopulation labels are shown for every kinship matrix; the input must be a list of such vectors or matrices if the labels vary per panel. Has no effect on non-kinship panels. |
labs_cex |
A vector of label scaling factors for each level of labs, or a list of such vectors if labels vary per panel. |
labs_las |
A vector of label orientations (in format that |
labs_line |
A vector of lines where labels are placed (in format that |
labs_sep |
A vector of logicals that specify whether lines separating the subpopulations are drawn for each level of labs, or a list of such vectors if labels vary per panel. |
labs_lwd |
A vector of line widths for the lines that divide subpopulations (if |
labs_col |
A vector of colors for the lines that divide subpopulations (if |
labs_ticks |
A vector of logicals that specify whether ticks separating the subpopulations are drawn for each level of labs, or a list of such vectors if labels vary per panel. |
labs_text |
A vector of logicals that specify whether the subpopulation labels are shown for each level of labs, or a list of such vectors if labels vary per panel. Useful for including separating lines or ticks without text. |
labs_even |
A vector of logicals that specify whether the subpopulations labels are drawn with equal spacing for each level of labs, or a list of such vectors if labels vary per panel.
When |
null_panel_data |
If |
weights |
A vector with weights for every individual, or a list of such vectors if they vary per panel. The width of every individual in the kinship matrix becomes proportional to their weight. Individuals with zero or negative weights are omitted. Has no effect on non-kinship panels. |
raster |
A logical equivalent to |
sym |
If AXIS LABEL OPTIONS |
... |
Additional options passed to |
Details
plot_popkin
plots the input kinship matrices as-is.
For best results, a standard kinship matrix (such as the output of popkin()
) should have its diagonal rescaled to contain inbreeding coefficients using inbr_diag()
before plot_popkin
is used.
This function permits the labeling of individuals (from row and column names when names = TRUE
) and of subpopulations (passed through labs
).
The difference is that the labels passed through labs
are assumed to be shared by many individuals, and lines (or other optional visual aids) are added to demarcate these subgroups.
Examples
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow = 3, byrow = TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
# simple plot of the kinship matrix, marking the subpopulations only
# note inbr_diag replaces the diagonal of kinship with inbreeding coefficients
# (see vignette for more elaborate examples)
plot_popkin( inbr_diag(kinship), labs = subpops )
Estimate kinship from a genotype matrix and subpopulation assignments
Description
Given the biallelic genotypes of n
individuals, this function returns the n
-by-n
kinship matrix such that the kinship estimate between the most distant subpopulations is zero on average (this sets the ancestral population to the most recent common ancestor population).
Usage
popkin(
X,
subpops = NULL,
n = NA,
loci_on_cols = FALSE,
mean_of_ratios = FALSE,
mem_factor = 0.7,
mem_lim = NA,
want_M = FALSE,
m_chunk_max = 1000
)
Arguments
X |
Genotype matrix, |
subpops |
The length- |
n |
Number of individuals (required only when |
loci_on_cols |
If |
mean_of_ratios |
Chose how to weigh loci.
If |
mem_factor |
Proportion of available memory to use loading and processing data.
Ignored if |
mem_lim |
Memory limit in GB, used to break up data into chunks for very large datasets.
Note memory usage is somewhat underestimated and is not controlled strictly.
Default in Linux is |
want_M |
If |
m_chunk_max |
Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting. |
Details
The subpopulation assignments are only used to estimate the baseline kinship (the zero value).
If the user wants to re-estimate the kinship matrix using different subpopulation labels,
it suffices to rescale it using rescale_popkin()
(as opposed to starting from the genotypes again, which gives the same answer but more slowly).
Value
If want_M = FALSE
, returns the estimated n
-by-n
kinship matrix only.
If X
has names for the individuals, they will be copied to the rows and columns of this kinship matrix.
If want_M = TRUE
, a named list is returned, containing:
-
kinship
: the estimatedn
-by-n
kinship matrix -
M
: then
-by-n
matrix of non-missing pair counts (seewant_M
option).
See Also
popkin_af()
for coancestry estimation from allele frequency matrices.
Examples
# Construct toy data
X <- matrix(
c(0, 1, 2,
1, 0, 1,
1, 0, 2),
nrow = 3,
byrow = TRUE
) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
kinship <- popkin(X, subpops) # calculate kinship from genotypes and subpopulation labels
Compute popkin's A
and M
matrices from genotypes
Description
This function returns lower-level, intermediate calculations for the main popkin
function.
These are not intended for most users, but rather for researchers studying the estimator.
Usage
popkin_A(
X,
n = NA,
loci_on_cols = FALSE,
mean_of_ratios = FALSE,
mem_factor = 0.7,
mem_lim = NA,
m_chunk_max = 1000
)
Arguments
X |
Genotype matrix, |
n |
Number of individuals (required only when |
loci_on_cols |
If |
mean_of_ratios |
Chose how to weigh loci.
If |
mem_factor |
Proportion of available memory to use loading and processing data.
Ignored if |
mem_lim |
Memory limit in GB, used to break up data into chunks for very large datasets.
Note memory usage is somewhat underestimated and is not controlled strictly.
Default in Linux is |
m_chunk_max |
Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting. |
Value
A named list containing:
-
A
: n-by-n matrix, for individualsj
andk
, of averagew_i * ( ( x_ij - 1 ) * ( x_ik - 1 ) - 1)
values across all locii
inX
; ifmean_of_ratios = FALSE
,w_i = 1
, otherwisew_i = 1 / (p_est_i * (1 - p_est_i) )
wherep_est_i
is the reference allele frequency. -
M
: n-by-n matrix of sample sizes (number of loci with non-missing individualj
andk
pairs, used to normalizeA
)
See Also
The main popkin()
function (a wrapper of this popkin_A
function and popkin_A_min_subpops()
to estimate the minimum A
value).
Examples
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow = 3, byrow = TRUE) # genotype matrix
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
# library(BEDMatrix)
# X <- BEDMatrix(file) # load genotype matrix object
obj <- popkin_A(X) # calculate A and M from genotypes
A <- obj$A
M <- obj$M
Estimate the minimum expected value of a matrix A
using subpopulations
Description
This function averages the values of a square matrix A
between every subpopulation pair and returns the minimum of these averages.
If no subpopulation partition is provided, the function returns the minimum value of A
excluding the diagonal, to agree when the code treats each individual as a subpopulation.
The return value can be used to adjust an A
matrix to yield the kinship matrix.
Usage
popkin_A_min_subpops(A, subpops = NULL)
Arguments
A |
A symmetric |
subpops |
A length- |
Value
The minimum of the average between-subpopulation A
values, which estimates the minimum expected value of A
See Also
popkin_A()
to generate the A
matrix usually inputted into this function (popkin_A_min_subpops
).
popkin()
is the wrapper function around both of these.
avg_kinship_subpops()
for the full matrix of mean kinship values between subpopulations.
Examples
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# calculate A from genotypes
A <- popkin_A(X)$A
# the recommended form using appropriate subpopulation labels
A_min_est <- popkin_A_min_subpops( A, subpops )
# this recovers the popkin estimate
kinship <- 1 - A / A_min_est
stopifnot( kinship == popkin( X, subpops ) )
# a simple default for exploratory analysis, equals min( A ) for correctly-calculated A
A_min_est <- popkin_A_min_subpops( A )
stopifnot( A_min_est == min( A ) )
Estimate coancestry from an allele frequency matrix and subpopulation assignments
Description
Given the individual-specific allele frequencies of n
individuals, this function returns the n
-by-n
coancestry matrix.
This function is the analog of popkin()
for allele frequencies rather than genotypes, and as a consequence estimates coancestry instead of kinship.
These coancestry estimates are unbiased if the true allele frequencies are provided, but may be less accurate when the allele frequencies themselves are estimated.
This function is intended for cases where allele frequencies, but not individual genotypes, are available; otherwise it is best to use the individual genotypes and popkin()
.
An application of interest is the allele frequency matrices from admixture models, in which case the columns correspond to subpopulations rather than individuals, and subpops = NULL
is an acceptable choice.
Usage
popkin_af(
P,
subpops = NULL,
loci_on_cols = FALSE,
mem_factor = 0.7,
mem_lim = NA,
want_M = FALSE,
m_chunk_max = 1000
)
Arguments
P |
|
subpops |
The length- |
loci_on_cols |
If |
mem_factor |
Proportion of available memory to use loading and processing data.
Ignored if |
mem_lim |
Memory limit in GB, used to break up data into chunks for very large datasets.
Note memory usage is somewhat underestimated and is not controlled strictly.
Default in Linux is |
want_M |
If |
m_chunk_max |
Sets the maximum number of loci to process at the time. Actual number of loci loaded may be lower if memory is limiting. |
Value
If want_M = FALSE
, returns the estimated n
-by-n
coancestry matrix only.
If P
has names for the individuals, they will be copied to the rows and columns of this coancestry matrix.
If want_M = TRUE
, a named list is returned, containing:
-
coancestry
: the estimatedn
-by-n
coancestry matrix -
M
: then
-by-n
matrix of non-missing pair counts (seewant_M
option).
See Also
popkin()
for kinship estimation from genotype matrices.
Examples
# a matrix of random uniform allele frequencies
# (unstructured, unlike real data)
P <- matrix( runif( 9 ), nrow = 3 )
coancestry <- popkin_af( P )
Estimate the individual-level pairwise FST matrix
Description
This function construct the individual-level pairwise FST matrix implied by the input kinship matrix.
If the input is the true kinship matrix, the return value corresponds to the true pairwise FST matrix.
On the other hand, if the input is the estimated kinship returned by popkin()
, the same code results in the return value being the pairwise FST estimates described in our paper.
In all cases the diagonal of the pairwise FST matrix is zero by definition.
Usage
pwfst(kinship)
Arguments
kinship |
The |
Value
The n
-by-n
pairwise FST matrix
Examples
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# estimate the kinship matrix from the genotypes "X"!
kinship <- popkin(X, subpops) # calculate kinship from X and optional subpop labels
# lastly, compute pairwise FST matrix from the kinship matrix
pwF <- pwfst(kinship)
Rescale kinship matrix to set a given kinship value to zero.
Description
If you already have a population kinship matrix, and you desire to estimate the kinship matrix in a subset of the individuals, you could do it the slow way (reestimating starting from the genotypes of the subset of individuals) or you can do it the fast way: first subset the kinship matrix to only contain the individuals of interest, then use this function to rescale this kinship matrix so that the minimum kinship is zero. This rescaling is required when subsetting results in a more recent Most Recent Common Ancestor (MRCA) population compared to the original dataset (for example, if the original data had individuals from across the world but the subset only contains individuals from a single continent).
Usage
rescale_popkin(kinship, subpops = NULL, min_kinship = NA)
Arguments
kinship |
An |
subpops |
The length- |
min_kinship |
A scalar kinship value to define the new zero kinship. |
Details
This function rescales the input kinship
matrix so that the value min_kinship
in the original kinship matrix becomes zero, using the formula
kinship_rescaled = ( kinship - min_kinship ) / ( 1 - min_kinship )
.
This is equivalent to changing the ancestral population of the data.
If subpopulation labels subpops
are provided (recommended), they are used to estimate min_kinship
using the function popkin_A_min_subpops()
, which is the recommended way to set the MRCA population correctly.
If both subpops
and min_kinship
are provided, only min_kinship
is used.
If both subpops
and min_kinship
are omitted, the function sets min_kinship = min( kinship )
.
Value
The rescaled n
-by-n
kinship matrix, with the desired level of relatedness set to zero.
Examples
# Construct toy data
X <- matrix(c(0,1,2,1,0,1,1,0,2), nrow=3, byrow=TRUE) # genotype matrix
subpops <- c(1,1,2) # subpopulation assignments for individuals
subpops2 <- 1:3 # alternate labels treat every individual as a different subpop
# NOTE: for BED-formatted input, use BEDMatrix!
# "file" is path to BED file (excluding .bed extension)
## library(BEDMatrix)
## X <- BEDMatrix(file) # load genotype matrix object
# suppose we first estimate kinship without subpopulations, which will be more biased
kinship <- popkin(X) # calculate kinship from genotypes, WITHOUT subpops
# then we visualize this matrix, figure out a reasonable subpopulation partition
# now we can adjust the kinship matrix!
kinship2 <- rescale_popkin(kinship, subpops)
# prev is faster but otherwise equivalent to re-estimating kinship from scratch with subpops:
# kinship2 <- popkin(X, subpops)
# can also manually set the level of relatedness min_kinship we want to be zero:
min_kinship <- min(kinship) # a naive choice for example
kinship2 <- rescale_popkin(kinship, min_kinship = min_kinship)
# lastly, omiting both subpops and min_kinship sets the minimum value in kinship to zero
kinship3 <- rescale_popkin(kinship2)
# equivalent to both of:
# kinship3 <- popkin(X)
# kinship3 <- rescale_popkin(kinship2, min_kinship = min(kinship))
Validate a kinship matrix
Description
Tests that the input is a valid kinship matrix (a numeric, square, and optionally symmetric R matrix).
Intended for matrices to plot and for other uses, including biased estimates, so there is flexibility as to what constitutes a valid kinship matrix.
Throws errors if the input is not as above.
Can instead return TRUE
/FALSE
if logical = TRUE
.
Usage
validate_kinship(kinship, sym = TRUE, name = "kinship", logical = FALSE)
Arguments
kinship |
The kinship matrix to validate. |
sym |
If |
name |
Default "kinship".
Change to desired variable name for more informative error messages (i.e. "A" when used to validate the |
logical |
If |
Details
True kinship matrices have values strictly between 0 and 1, and diagonal values strictly between 0.5 and 1. However, estimated matrices may contain values slightly out of range. For greater flexibility, this function does not check for out-of-range values.
Value
If logical = FALSE
(default), nothing.
If logical = TRUE
, returns TRUE
if the input is a valid kinship matrix, FALSE
otherwise.
Examples
# this is a valid (positive) example
kinship <- matrix(c(0.5, 0, 0, 0.6), nrow=2)
# this will run without errors or warnings
validate_kinship(kinship)
# negative examples
# dies if input is missing
try( validate_kinship() )
# and if input is not a matrix
try( validate_kinship( 1:5 ) )
# and for non-numeric matrices
char_mat <- matrix(c('a', 'b', 'c', 'd'), nrow=2)
try( validate_kinship( char_mat ) )
# and non-square matrices
non_kinship <- matrix(1:2, nrow=2)
try( validate_kinship( non_kinship ) )
# and non-symmetric matrices
non_kinship <- matrix(1:4, nrow=2)
try( validate_kinship( non_kinship ) )
# but example passes if we drop symmetry requirement this way
validate_kinship( non_kinship, sym = FALSE )
# instead of stopping, can get a logical value
# this returns FALSE
validate_kinship( non_kinship, logical = TRUE )
Get weights for individuals that balance subpopulations
Description
This function returns positive weights that sum to one for individuals using subpopulation labels, such that every subpopulation receives equal weight.
In particular, if there are K
subpopulations, then the sum of weights for every individuals of a given subpopulation will equal 1 / K
.
The weight of every individual is thus inversely proportional to the number of individuals in its subpopulation.
If the optional sub-subpopulation labels are also provided, then each sub-subpopulation within a given subpopulation is also weighted equally.
Usage
weights_subpops(subpops, subsubpops = NULL)
Arguments
subpops |
The length- |
subsubpops |
The optional length- |
Value
The length-n
vector of weights for each individual.
Examples
# if every individual has a different subpopulation, weights are uniform:
subpops <- 1:10
weights <- weights_subpops( subpops )
stopifnot( all( weights == rep.int( 1/10, 10 ) ) )
# subpopulations can be strings too
subpops <- c('a', 'b', 'c')
weights <- weights_subpops( subpops )
stopifnot( all( weights == rep.int( 1/3, 3 ) ) )
# if there are two subpopulations
# and the first has twice as many individuals as the second
# then the individuals in this first subpopulation weight half as much
# as the ones in the second subpopulation
subpops <- c(1, 1, 2)
weights <- weights_subpops( subpops )
stopifnot( all( weights == c( 1/4, 1/4, 1/2 ) ) )
# 2-level hierarchy example
subpops <- c(1, 1, 1, 2, 2)
subsubpops <- c('a', 'b', 'b', 'c', 'd')
weights <- weights_subpops( subpops, subsubpops )
stopifnot( all( weights == c( 1/4, 1/8, 1/8, 1/4, 1/4 ) ) )