Type: | Package |
Title: | Statistical Tools for Quantitative Genetic Analyses |
Version: | 1.1.6 |
Date: | 2024-12-13 |
Maintainer: | Peter Soerensen <peter.sorensen@r-qgg.org> |
Description: | Provides an infrastructure for efficient processing of large-scale genetic and phenotypic data including core functions for: 1) fitting linear mixed models, 2) constructing marker-based genomic relationship matrices, 3) estimating genetic parameters (heritability and correlation), 4) performing genomic prediction and genetic risk profiling, and 5) single or multi-marker association analyses. Rohde et al. (2019) <doi:10.1101/503631>. |
License: | GPL-3 |
Encoding: | UTF-8 |
Depends: | R (≥ 3.3.0) |
Imports: | Rcpp (≥ 1.0.5), data.table, parallel, statmod, stats, MCMCpack, MASS, coda, corpcor, Matrix, methods |
LinkingTo: | Rcpp, RcppArmadillo |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/psoerensen/qgg |
BugReports: | https://github.com/psoerensen/qgg/issues |
NeedsCompilation: | yes |
Packaged: | 2024-12-15 16:03:53 UTC; au223366 |
Author: | Peter Soerensen [aut, cre], Palle Duun Rohde [aut], Izel Fourie Soerensen [aut] |
Repository: | CRAN |
Date/Publication: | 2024-12-16 16:50:02 UTC |
Compute prediction accuracy for a quantitative or binary trait
Description
Compute prediction accuracy for a quantitative or binary trait
Usage
acc(yobs = NULL, ypred = NULL, typeoftrait = "quantitative")
Arguments
yobs |
is a vector of observed phenotypes |
ypred |
is a vector of predicted phenotypes |
typeoftrait |
is a character with possible values "binary" or "quantitative" (default) |
LD pruning of summary statistics
Description
Perform LD pruning of summary statistics before they are used in gene set enrichment analyses.
Usage
adjLD(
stat = NULL,
Glist = NULL,
chr = NULL,
statistics = "p-value",
r2 = 0.9,
ldSets = NULL,
threshold = 1,
method = "pruning"
)
Arguments
stat |
A data frame with marker summary statistics. Ensure that it is in the correct format. |
Glist |
List of information about the genotype matrix stored on disk. |
chr |
Chromosome(s) being processed. |
statistics |
Specify what type of statistics ("b", "z", or "p-value") is being processed. Default is "p-value". |
r2 |
Threshold used in the clumping/pruning procedure. Default is 0.9. |
ldSets |
List of marker sets - names correspond to row names in 'stat'. |
threshold |
P-value threshold used in LD pruning. |
method |
Method used in adjustment for linkage disequilibrium. Options are "pruning" or "clumping". Default is "pruning". |
Check concordance between marker effect and sparse LD matrix.
Description
Check concordance between predicted and observed marker effect. Marker effect is predicted based on sparse LD matrix in Glist.
Usage
adjLDStat(
stat = NULL,
Glist = NULL,
chr = NULL,
region = NULL,
msize = NULL,
threshold = 1e-05,
overlap = NULL,
niter = 5
)
Arguments
stat |
data frame with marker summary statistics (see required format above) |
Glist |
list of information about genotype matrix stored on disk |
chr |
chromosome for which marker effect is checked |
region |
genome region (in base pairs) for which marker effect is checked |
msize |
is the number of markers used in the prediction |
threshold |
p-value threshold used for chisquare test for difference between observed and predicted marker effects |
overlap |
is the number of markers overlapping between adjacent genome region |
niter |
is the number of iteration used for detecting outliers |
Adjustment of marker summary statistics using clumping and thresholding
Description
Adjust marker summary statistics using linkage disequilibrium information from Glist.
Usage
adjStat(
stat = NULL,
Glist = NULL,
chr = NULL,
statistics = "b",
r2 = 0.9,
ldSets = NULL,
threshold = 1,
header = NULL,
method = "pruning"
)
Arguments
stat |
A data frame with marker summary statistics (see required format above). |
Glist |
List of information about genotype matrix stored on disk. |
chr |
Chromosome(s) being processed. |
statistics |
Specify what type of statistics ("b" or "z") is being processed. Default is "b". |
r2 |
Threshold used in clumping/pruning procedure. Default is 0.9. |
ldSets |
List of marker sets - names correspond to row names in 'stat'. |
threshold |
P-value threshold used in clumping procedure. Default is 1. |
header |
Character vector with column names to be excluded in the LD adjustment. |
method |
Method used in adjustment for linkage disequilibrium. Default is "clumping". |
Details
Required input format for summary statistics:
stat can be a data.frame(rsids, chr, pos, ea, nea, eaf, b, seb, stat, p, n) (single trait)
stat can be a list(marker=(rsids, chr, pos, ea, nea, eaf), b, seb, stat, p, n) (multiple trait)
For details about the summary statistics format, see the main function description.
Author(s)
Peter Soerensen
Adjust B-values
Description
This function adjusts the B-values based on the LD structure and other parameters. The adjustment is done in subsets, and a plot of observed vs. predicted values is produced for each subset.
Usage
adjustB(
b = NULL,
seb = NULL,
LD = NULL,
msize = 100,
overlap = 50,
nrep = 20,
shrink = 0.001,
threshold = 1
)
Arguments
b |
A numeric vector containing the B-values to be adjusted. If NULL (default), no adjustments are made. |
LD |
A matrix representing the linkage disequilibrium (LD) structure. |
msize |
An integer specifying the size of the subsets. |
overlap |
An integer specifying the overlap size between consecutive subsets. |
shrink |
A numeric value used for shrinkage. Default is 0.001. |
threshold |
A numeric value specifying the threshold. Default is 1e-8. |
Value
A list containing the adjusted B-values.
Adjust Linkage Disequilibrium (LD) Using Map Information
Description
Adjusts the linkage disequilibrium (LD) values based on map information, effective population size, and sample size used for map construction.
Usage
adjustMapLD(LD = NULL, map = NULL, neff = 11600, nmap = 186, threshold = 0.001)
Arguments
LD |
A matrix representing the linkage disequilibrium (LD) structure. |
map |
A numeric vector containing the map information. |
neff |
An integer representing the effective population size. Default is 11600. |
nmap |
An integer representing the sample size used for map construction. Default is 186. |
threshold |
A numeric value specifying the threshold for setting LD to zero. Currently unused in the function. Default is 0.001. |
Value
A matrix where each value is the adjusted LD based on the input map, effective population size, and map construction sample size.
Compute AUC
Description
Compute Area Under the Curve (AUC)
Usage
auc(yobs = NULL, ypred = NULL)
Arguments
yobs |
is a vector of observed phenotypes |
ypred |
is a vector of predicted phenotypes |
Quality Control of Marker Summary Statistics
Description
Quality control is a fundamental step in GWAS summary statistics analysis. The function is equipped to handle various tasks including mapping marker ids, checking the effect allele and its frequency, determining build versions, and excluding data based on multiple criteria.
Usage
checkStat(
Glist = NULL,
stat = NULL,
excludeMAF = 0.01,
excludeMAFDIFF = 0.05,
excludeINFO = 0.8,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeMHC = FALSE,
excludeMISS = 0.05,
excludeHWE = 1e-12
)
Arguments
Glist |
List containing information about genotype matrix stored on disk. |
stat |
Data frame of marker summary statistics. It should either follow the "internal" or "external" format. |
excludeMAF |
Numeric. Exclusion threshold for minor allele frequency. Default is 0.01. |
excludeMAFDIFF |
Numeric. Threshold for excluding markers based on allele frequency difference. Default is 0.05. |
excludeINFO |
Numeric. Exclusion threshold for info score. Default is 0.8. |
excludeCGAT |
Logical. Exclude ambiguous alleles (CG or AT). Default is TRUE. |
excludeINDEL |
Logical. Exclude insertion/deletion markers. Default is TRUE. |
excludeDUPS |
Logical. Exclude markers with duplicated ids. Default is TRUE. |
excludeMHC |
Logical. Exclude markers located in MHC region. Default is FALSE. |
excludeMISS |
Numeric. Exclusion threshold for sample missingness. Default is 0.05. |
excludeHWE |
Numeric. Exclusion threshold for Hardy Weinberg Equilibrium test p-value. Default is 1e-12. |
Details
Performs quality control on GWAS summary statistics, which includes: - Mapping marker ids to LD reference panel data. - Checking effect allele, frequency, and build version. - Excluding based on various criteria like MAF, HWE, INDELS, and more.
The function works with both "internal" and "external" formats of summary statistics. When the summary statistics format is "external", the function maps marker ids based on chr-pos-ref-alt information. It also aligns the effect allele with the LD reference panel and flips effect sizes if necessary. When allele frequencies are not provided, it uses the frequencies from the genotype data.
Required headers for external summary statistics: marker, chr, pos, ea, nea, eaf, b, seb, stat, p, n
Required headers for internal summary statistics: rsids, chr, pos, ea, nea, eaf, b, seb, stat, p, n
Value
A data frame with processed and quality-controlled summary statistics.
Author(s)
Peter Soerensen
Compute Receiver Operating Curve statistics
Description
Compute ROC
Usage
computeROC(yobs = NULL, ypred = NULL)
Arguments
yobs |
vector of observed phenotype |
ypred |
vector of predicted phenotype |
Author(s)
Palle Duun Rohde
Create Linkage Disequilibrium (LD) Sets
Description
This function partitions a vector of LD scores into sets or blocks based on cumulative LD scores, with constraints on block size and separation. The method ensures that blocks are evenly distributed and reduces overlap based on LD patterns.
Usage
createLDsets(
ldscores = NULL,
msize = 200,
maxsize = 2000,
nsplit = 200,
verbose = FALSE
)
Arguments
ldscores |
A numeric vector of LD scores for markers, where names correspond to marker identifiers. |
msize |
An integer specifying the minimum block size for averaging LD scores. Default is 200. |
maxsize |
An integer specifying the maximum block size. Default is 2000. |
nsplit |
An integer specifying the number of splits (blocks) to create. Default is 200. |
verbose |
A logical value. If |
Details
The function uses cumulative sums of LD scores to create blocks of markers that minimize overlap while satisfying block size constraints. Blocks are defined iteratively by identifying regions with low cumulative LD scores and expanding them until they reach the defined block size limits.
- **Cumulative LD Calculation:** The function calculates the average cumulative LD scores
over sliding windows of size msize
.
- **Block Splitting:** Regions with the lowest cumulative LD scores are selected iteratively
to define block boundaries.
- **Plotting:** If verbose = TRUE
, the function generates two diagnostic plots:
- Block sizes for each LD set.
- Cumulative LD scores across genome positions, highlighting split positions.
Value
A list where each element is a vector of marker names corresponding to a block of LD scores.
Bayesian linear regression models
Description
Bayesian linear regression (BLR) models:
- unified mapping of genetic variants, estimation of genetic parameters (e.g. heritability) and prediction of disease risk)
- handles different genetic architectures (few large, many small effects)
- scale to large data (e.g. sparse LD)
In the Bayesian multiple regression model the posterior density of the model parameters depend on the likelihood of the data given the parameters and a prior probability for the model parameters
The prior density of marker effects defines whether the model will induce variable selection and shrinkage or shrinkage only. Also, the choice of prior will define the extent and type of shrinkage induced. Ideally the choice of prior for the marker effect should reflect the genetic architecture of the trait, and will vary (perhaps a lot) across traits.
The following prior distributions are provided:
Bayes N: Assigning a Gaussian prior to marker effects implies that the posterior means are the BLUP estimates (same as Ridge Regression).
Bayes L: Assigning a double-exponential or Laplace prior is the density used in the Bayesian LASSO
Bayes A: similar to ridge regression but t-distribution prior (rather than Gaussian) for the marker effects ; variance comes from an inverse-chi-square distribution instead of being fixed. Estimation via Gibbs sampling.
Bayes C: uses a “rounded spike” (low-variance Gaussian) at origin many small effects can contribute to polygenic component, reduces the dimensionality of the model (makes Gibbs sampling feasible).
Bayes R: Hierarchical Bayesian mixture model with 4 Gaussian components, with variances scaled by 0, 0.0001 , 0.001 , and 0.01 .
Usage
gbayes(
y = NULL,
X = NULL,
W = NULL,
stat = NULL,
covs = NULL,
trait = NULL,
fit = NULL,
Glist = NULL,
chr = NULL,
rsids = NULL,
b = NULL,
bm = NULL,
seb = NULL,
LD = NULL,
n = NULL,
formatLD = "dense",
vg = NULL,
vb = NULL,
ve = NULL,
ssg_prior = NULL,
ssb_prior = NULL,
sse_prior = NULL,
lambda = NULL,
scaleY = TRUE,
h2 = NULL,
pi = 0.001,
updateB = TRUE,
updateG = TRUE,
updateE = TRUE,
updatePi = TRUE,
adjustE = TRUE,
models = NULL,
nug = 4,
nub = 4,
nue = 4,
verbose = FALSE,
msize = 100,
mask = NULL,
GRMlist = NULL,
ve_prior = NULL,
vg_prior = NULL,
tol = 0.001,
nit = 100,
nburn = 0,
nthin = 1,
nit_local = NULL,
nit_global = NULL,
method = "mixed",
algorithm = "mcmc"
)
Arguments
y |
is a vector or matrix of phenotypes |
X |
is a matrix of covariates |
W |
is a matrix of centered and scaled genotypes |
stat |
dataframe with marker summary statistics |
covs |
is a list of summary statistics (output from internal cvs function) |
trait |
is an integer used for selection traits in covs object |
fit |
is a list of results from gbayes |
Glist |
list of information about genotype matrix stored on disk |
chr |
is the chromosome for which to fit BLR models |
rsids |
is a character vector of rsids |
b |
is a vector or matrix of marginal marker effects |
bm |
is a vector or matrix of adjusted marker effects for the BLR model |
seb |
is a vector or matrix of standard error of marginal effects |
LD |
is a list with sparse LD matrices |
n |
is a scalar or vector of number of observations for each trait |
formatLD |
is a character specifying LD format (formatLD="dense" is default) |
vg |
is a scalar or matrix of genetic (co)variances |
vb |
is a scalar or matrix of marker (co)variances |
ve |
is a scalar or matrix of residual (co)variances |
ssg_prior |
is a scalar or matrix of prior genetic (co)variances |
ssb_prior |
is a scalar or matrix of prior marker (co)variances |
sse_prior |
is a scalar or matrix of prior residual (co)variances |
lambda |
is a vector or matrix of lambda values |
scaleY |
is a logical; if TRUE y is centered and scaled |
h2 |
is the trait heritability |
pi |
is the proportion of markers in each marker variance class (e.g. pi=c(0.999,0.001),used if method="ssvs") |
updateB |
is a logical for updating marker (co)variances |
updateG |
is a logical for updating genetic (co)variances |
updateE |
is a logical for updating residual (co)variances |
updatePi |
is a logical for updating pi |
adjustE |
is a logical for adjusting residual variance |
models |
is a list structure with models evaluated in bayesC |
nug |
is a scalar or vector of prior degrees of freedom for prior genetic (co)variances |
nub |
is a scalar or vector of prior degrees of freedom for marker (co)variances |
nue |
is a scalar or vector of prior degrees of freedom for prior residual (co)variances |
verbose |
is a logical; if TRUE it prints more details during iteration |
msize |
number of markers used in compuation of sparseld |
mask |
is a vector or matrix of TRUE/FALSE specifying if marker should be ignored |
GRMlist |
is a list providing information about GRM matrix stored in binary files on disk |
ve_prior |
is a scalar or matrix of prior residual (co)variances |
vg_prior |
is a scalar or matrix of prior genetic (co)variances |
tol |
is tolerance, i.e. convergence criteria used in gbayes |
nit |
is the number of iterations |
nburn |
is the number of burnin iterations |
nthin |
is the thinning parameter |
nit_local |
is the number of local iterations |
nit_global |
is the number of global iterations |
method |
specifies the methods used (method="bayesN","bayesA","bayesL","bayesC","bayesR") |
algorithm |
specifies the algorithm |
Value
Returns a list structure including
b |
vector or matrix (mxt) of posterior means for marker effects |
d |
vector or matrix (mxt) of posterior means for marker inclusion probabilities |
vb |
scalar or vector (t) of posterior means for marker variances |
vg |
scalar or vector (t) of posterior means for genomic variances |
ve |
scalar or vector (t) of posterior means for residual variances |
rb |
matrix (txt) of posterior means for marker correlations |
rg |
matrix (txt) of posterior means for genomic correlations |
re |
matrix (txt) of posterior means for residual correlations |
pi |
vector (1xnmodels) of posterior probabilities for models |
h2 |
vector (1xt) of posterior means for model probability |
param |
a list current parameters (same information as item listed above) used for restart of the analysis |
stat |
matrix (mxt) of marker information and effects used for genomic risk scoring |
Author(s)
Peter Sørensen
Examples
# Simulate data and test functions
W <- matrix(rnorm(100000),nrow=1000)
set1 <- sample(1:ncol(W),5)
set2 <- sample(1:ncol(W),5)
sets <- list(set1,set2)
g <- rowSums(W[,c(set1,set2)])
e <- rnorm(nrow(W),mean=0,sd=1)
y <- g + e
fitM <- gbayes(y=y, W=W, method="bayesN")
fitA <- gbayes(y=y, W=W, method="bayesA")
fitL <- gbayes(y=y, W=W, method="bayesL")
fitC <- gbayes(y=y, W=W, method="bayesC")
Compute Genomic BLUP values
Description
Compute Genomic BLUP values based on linear mixed model fit output from greml
Usage
gblup(
GRMlist = NULL,
GRM = NULL,
fit = NULL,
ids = NULL,
idsCLS = NULL,
idsRWS = NULL
)
Arguments
GRMlist |
list providing information about GRM matrix stored in binary files on disk |
GRM |
list of one or more genomic relationship matrices |
fit |
list object output from greml function |
ids |
vector of ids for which BLUP values is computed |
idsCLS |
vector of column ids in GRM for which BLUP values is computed |
idsRWS |
vector of row ids in GRM for which BLUP values is computed |
Get elements from genotype matrix stored in PLINK bedfiles
Description
Extracts specific rows (based on ids or row numbers) and columns (based on rsids or column numbers) from a genotype matrix stored on disk. The extraction is based on provided arguments such as chromosome number, ids, rsids, etc. Genotypes can be optionally scaled and imputed.
Usage
getG(
Glist = NULL,
chr = NULL,
bedfiles = NULL,
bimfiles = NULL,
famfiles = NULL,
ids = NULL,
rsids = NULL,
rws = NULL,
cls = NULL,
impute = TRUE,
scale = FALSE
)
Arguments
Glist |
A list structure containing information about genotypes stored on disk. |
chr |
An integer representing the chromosome for which the genotype matrix is to be extracted. It is required. |
bedfiles |
A vector of filenames for the PLINK bed-file. |
bimfiles |
A vector of filenames for the PLINK bim-file. |
famfiles |
A vector of filenames for the PLINK fam-file. |
ids |
A vector of individual IDs for whom the genotype data needs to be extracted. |
rsids |
A vector of SNP identifiers for which the genotype data needs to be extracted. |
rws |
A vector of row numbers to be extracted from the genotype matrix. |
cls |
A vector of column numbers to be extracted from the genotype matrix. |
impute |
A logical or integer. If TRUE, missing genotypes are replaced with their expected values (2 times the allele frequency). If set to an integer, missing values are replaced by that integer. |
scale |
A logical. If TRUE, the genotype markers are scaled to have a mean of zero and variance of one. |
Details
This function facilitates the extraction of specific genotype data from storage based on various criteria. The extracted genotype data can be optionally scaled or imputed. If rsids are provided that are not found in the 'Glist', a warning is raised.
Value
A matrix with extracted genotypic data. Rows correspond to individuals, and columns correspond to SNPs. Row names are set to individual IDs, and column names are set to rsids.
Extract elements from genomic relationship matrix (GRM) stored on disk
Description
Extract elements from genomic relationship matrix (GRM) (whole or subset) stored on disk.
Usage
getGRM(
GRMlist = NULL,
ids = NULL,
idsCLS = NULL,
idsRWS = NULL,
cls = NULL,
rws = NULL
)
Arguments
GRMlist |
list providing information about GRM matrix stored in binary files on disk |
ids |
vector of ids in GRM to be extracted |
idsCLS |
vector of column ids in GRM to be extracted |
idsRWS |
vector of row ids in GRM to be extracted |
cls |
vector of columns in GRM to be extracted |
rws |
vector of rows in GRM to be extracted |
Retrieve Sparse LD Matrix for a Given Chromosome
Description
Extracts and returns a sparse LD (Linkage Disequilibrium) matrix for the specified chromosome based on genotypic data provided in 'Glist'.
Usage
getLD(Glist = NULL, chr = NULL, rsids = NULL)
Arguments
Glist |
A list structure containing genotypic data, including rsids for LD calculation ('rsidsLD'), LD file locations ('ldfiles'), and 'msize' which indicates the size for surrounding region to consider for LD. |
chr |
A specific chromosome from which LD sets need to be extracted. |
rsids |
A vector of rsids that need to be included in the sparse LD matrix. Default is NULL, implying all rsids in the chromosome will be used. |
Details
The function constructs the LD matrix by reading LD values from binary files stored in 'Glist$ldfiles'. Each column of the matrix represents a SNP from 'rsidsLD', and rows represent LD values for surrounding SNPs. The main diagonal (msize + 1 row) is set to 1 for all SNPs.
Value
A matrix containing LD values. The matrix is of size (msize * 2 + 1) x mchr, where mchr is the number of rsids in the chromosome. The returned matrix has column names corresponding to rsids in the chromosome, and row names representing relative positions to the current SNP, from -msize to msize.
Extract Linkage Disequilibrium (LD) Scores
Description
This function retrieves LD scores from a structured list ('Glist') that includes LD scores, chromosome information, and SNP identifiers. You can extract all LD scores or filter them by a specific chromosome and/or SNP identifiers (rsids).
Usage
getLDscores(Glist = NULL, chr = NULL, rsids = NULL)
Arguments
Glist |
A list structure containing genotypic data, including rsids for LD calculation ('rsidsLD'), LD file locations ('ldfiles'), and 'msize' which indicates the size for surrounding region to consider for LD. |
chr |
A specific chromosome from which LD sets need to be extracted. |
rsids |
A vector of rsids that need to be included in the sparse LD matrix. Default is NULL, implying all rsids in the chromosome will be used. |
Value
A vector containing LD scores.
Get marker LD sets
Description
Extracts marker LD sets based on a sparse LD matrix stored in the Glist object.
Usage
getLDsets(Glist = NULL, chr = NULL, r2 = 0.5)
Arguments
Glist |
A list structure containing information about genotypes stored on disk. |
chr |
A numeric value specifying the chromosome for which LD sets are to be extracted. |
r2 |
A numeric threshold, defaulting to 0.5, used for extracting LD sets. |
Retrieve the map for specified rsids on a given chromosome.
Description
Fetch the map associated with provided rsids for a given chromosome from the list 'Glist'.
Usage
getMap(Glist = NULL, chr = NULL, rsids = NULL)
Arguments
Glist |
A list structure with information about genotypes stored on disk. |
chr |
A chromosome from which the map is retrieved. |
rsids |
A vector of rsids for which the map is needed. |
Value
A vector containing the map corresponding to the specified rsids on the given chromosome.
Retrieve marker rsids in a specified genome region.
Description
Get marker rsids (reference SNP cluster IDs) from a specified genome region based on markers present in 'Glist'.
Usage
getMarkers(Glist = NULL, chr = NULL, region = NULL)
Arguments
Glist |
A list structure with information about genotypes stored on disk. |
chr |
A chromosome from which markers are extracted. |
region |
A genome region (in base pairs) from which markers are extracted. |
Value
A vector of rsids that fall within the specified region on the given chromosome.
Retrieve the positions for specified rsids on a given chromosome.
Description
Fetch the genomic positions associated with provided rsids for a given chromosome from the list 'Glist'.
Usage
getPos(Glist = NULL, chr = NULL, rsids = NULL)
Arguments
Glist |
A list structure with information about genotypes stored on disk. |
chr |
A chromosome from which the positions are retrieved. |
rsids |
A vector of rsids for which the positions are needed. |
Value
A vector containing the positions corresponding to the specified rsids on the given chromosome.
Extract Sparse Linkage Disequilibrium (LD) Information
Description
Retrieves and formats linkage disequilibrium (LD) data from binary files based on a specified chromosome and LD threshold. It provides options for returning the data in sparse or dense format.
Usage
getSparseLD(
Glist = NULL,
chr = NULL,
r2 = 0,
onebased = FALSE,
rsids = NULL,
format = "sparse"
)
Arguments
Glist |
A list containing details such as the LD file path, msize, rsids for LD. |
chr |
A numeric value representing the chromosome for which LD data is to be extracted. |
r2 |
A numeric value specifying the LD threshold for extraction. Default is 0. |
onebased |
A logical value indicating whether indices are one-based (default) or zero-based. |
rsids |
A vector of rsids for which the LD data needs to be extracted in dense format. Default is NULL. |
format |
A character string specifying the format of the result, either "sparse" (default) or "dense". |
Value
If 'format' is "sparse", a list with two components: 'indices' and 'values'. Each component is a list of length equal to the number of rsids in the specified chromosome. If 'format' is "dense", a matrix with rows and columns named after the rsids is returned.
Filter genetic marker data based on different quality measures
Description
Quality control is a critical step for working with summary statistics (in particular for external). Processing and quality control of GWAS summary statistics includes:
- map marker ids (rsids/cpra (chr, pos, ref, alt)) to LD reference panel data - check effect allele (flip EA, EAF, Effect) - check effect allele frequency - thresholds for MAF and HWE - exclude INDELS, CG/AT and MHC region - remove duplicated marker ids - check which build version - check for concordance between marker effect and LD data
External summary statistics format: marker, chr, pos, effect_allele, non_effect_allele, effect_allele_freq, effect, effect_se, stat, p, n
Internal summary statistics format: rsids, chr, pos, a1, a2, af, b, seb, stat, p, n
Usage
gfilter(
Glist = NULL,
excludeMAF = 0.01,
excludeMISS = 0.05,
excludeINFO = NULL,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeHWE = 1e-12,
excludeMHC = FALSE,
assembly = "GRCh37"
)
Arguments
Glist |
A list containing information about the genotype matrix stored on disk. |
excludeMAF |
A scalar threshold. Exclude markers with a minor allele frequency (MAF) below this threshold. Default is 0.01. |
excludeMISS |
A scalar threshold. Exclude markers with missingness (MISS) above this threshold. Default is 0.05. |
excludeINFO |
A scalar threshold. Exclude markers with an info score (INFO) below this threshold. Default is 0.8. |
excludeCGAT |
A logical value; if TRUE exclude markers if the alleles are ambiguous (i.e., either CG or AT combinations). |
excludeINDEL |
A logical value; if TRUE exclude markers that are insertions or deletions (INDELs). |
excludeDUPS |
A logical value; if TRUE exclude markers if their identifiers are duplicated. |
excludeHWE |
A scalar threshold. Exclude markers where the p-value for the Hardy-Weinberg Equilibrium test is below this threshold. Default is 0.01. |
excludeMHC |
A logical value; if TRUE exclude markers located within the MHC region. |
assembly |
A character string indicating the name of the genome assembly (e.g., "GRCh38"). |
Author(s)
Peter Soerensen
Single marker association analysis using linear models or linear mixed models
Description
The function glma performs single marker association analysis between genotype markers and the phenotype either based on linear model analysis (LMA) or mixed linear model analysis (MLMA).
The basic MLMA approach involves 1) building a genetic relationship matrix (GRM) that models genome-wide sample structure, 2) estimating the contribution of the GRM to phenotypic variance using a random effects model (with or without additional fixed effects) and 3) computing association statistics that account for this component on phenotypic variance.
MLMA methods are the method of choice when conducting association mapping in the presence of sample structure, including geographic population structure, family relatedness and/or cryptic relatedness. MLMA methods prevent false positive associations and increase power. The general recommendation when using MLMA is to exclude candidate markers from the GRM. This can be efficiently implemented via a leave-one-chromosome-out analysis. Further, it is recommend that analyses of randomly ascertained quantitative traits should include all markers (except for the candidate marker and markers in LD with the candidate marker) in the GRM, except as follows. First, the set of markers included in the GRM can be pruned by LD to reduce running time (with association statistics still computed for all markers). Second, genome-wide significant markers of large effect should be conditioned out as fixed effects or as an additional random effect (if a large number of associated markers). Third, when population stratification is less of a concern, it may be useful using the top associated markers selected based on the global maximum from out-of sample predictive accuracy.
Usage
glma(
y = NULL,
X = NULL,
W = NULL,
Glist = NULL,
chr = NULL,
fit = NULL,
verbose = FALSE,
statistic = "mastor",
ids = NULL,
rsids = NULL,
msize = 100,
scale = TRUE
)
Arguments
y |
vector or matrix of phenotypes |
X |
design matrix for factors modeled as fixed effects |
W |
matrix of centered and scaled genotypes |
Glist |
list of information about genotype matrix stored on disk |
chr |
chromosome for which summary statistics are computed |
fit |
list of information about linear mixed model fit (output from greml) |
verbose |
is a logical; if TRUE it prints more details during optimization |
statistic |
single marker test statistic used (currently based on the "mastor" statistics). |
ids |
vector of individuals used in the analysis |
rsids |
vector of marker rsids used in the analysis |
msize |
number of genotype markers used for batch processing |
scale |
logical if TRUE the genotypes have been scaled to mean zero and variance one |
Value
Returns a dataframe (if number of traits = 1) else a list including
coef |
single marker coefficients |
se |
standard error of coefficients |
stat |
single marker test statistic |
p |
p-value |
Author(s)
Peter Soerensen
References
Chen, W. M., & Abecasis, G. R. (2007). Family-based association tests for genomewide association scans. The American Journal of Human Genetics, 81(5), 913-926.
Loh, P. R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., ... & Patterson, N. (2015). Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3), 284-290.
Kang, H. M., Sul, J. H., Zaitlen, N. A., Kong, S. Y., Freimer, N. B., Sabatti, C., & Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4), 348-354.
Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., & Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nature methods, 8(10), 833-835.
Listgarten, J., Lippert, C., Kadie, C. M., Davidson, R. I., Eskin, E., & Heckerman, D. (2012). Improved linear mixed models for genome-wide association studies. Nature methods, 9(6), 525-526.
Listgarten, J., Lippert, C., & Heckerman, D. (2013). FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nature Genetics, 45(5), 470-471.
Lippert, C., Quon, G., Kang, E. Y., Kadie, C. M., Listgarten, J., & Heckerman, D. (2013). The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. Scientific reports, 3.
Zhou, X., & Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7), 821-824.
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M., & Aulchenko, Y. S. (2012). Rapid variance components-based method for whole-genome association analysis. Nature genetics, 44(10), 1166-1170.
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M., & Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46(2), 100-106.
Bulik-Sullivan, B. K., Loh, P. R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., ... & Schizophrenia Working Group of the Psychiatric Genomics Consortium. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3), 291-295.
Examples
# Simulate data
W <- matrix(rnorm(1000000), ncol = 1000)
colnames(W) <- as.character(1:ncol(W))
rownames(W) <- as.character(1:nrow(W))
y <- rowSums(W[, 1:10]) + rowSums(W[, 501:510]) + rnorm(nrow(W))
# Create model
data <- data.frame(y = y, mu = 1)
fm <- y ~ 0 + mu
X <- model.matrix(fm, data = data)
# Linear model analyses and single marker association test
stat <- glma(y=y,X=X,W = W)
head(stat)
# Compute GRM
GRM <- grm(W = W)
# Estimate variance components using REML analysis
fit <- greml(y = y, X = X, GRM = list(GRM), verbose = TRUE)
# Single marker association test
stat <- glma(fit = fit, W = W)
head(stat)
Finemapping using Bayesian Linear Regression Models
Description
In the Bayesian multiple regression model, the posterior density of the model parameters depends on the likelihood of the data given the parameters and a prior probability for the model parameters. The choice of the prior for marker effects can influence the type and extent of shrinkage induced in the model.
Usage
gmap(
Glist = NULL,
stat = NULL,
sets = NULL,
models = NULL,
rsids = NULL,
ids = NULL,
mask = NULL,
lambda = NULL,
vb = NULL,
vg = NULL,
ve = NULL,
vy = NULL,
pi = NULL,
gamma = NULL,
mc = 5000,
h2 = 0.5,
nub = 4,
nug = 4,
nue = 4,
ssb_prior = NULL,
ssg_prior = NULL,
sse_prior = NULL,
vb_prior = NULL,
vg_prior = NULL,
ve_prior = NULL,
updateB = TRUE,
updateG = TRUE,
updateE = TRUE,
updatePi = TRUE,
formatLD = "dense",
checkLD = FALSE,
shrinkLD = FALSE,
shrinkCor = FALSE,
pruneLD = FALSE,
checkConvergence = FALSE,
critVe = 3,
critVg = 3,
critVb = 3,
critPi = 3,
critB = 3,
critB1 = 0.5,
critB2 = 3,
verbose = FALSE,
eigen_threshold = 0.995,
cs_threshold = 0.9,
cs_r2 = 0.5,
nit = 1000,
nburn = 100,
nthin = 1,
output = "summary",
method = "bayesR",
algorithm = "mcmc-eigen",
seed = 10
)
Arguments
Glist |
A list containing information on genotypic data, including SNPs, chromosomes, positions, and optionally, LD matrices. |
stat |
A data frame or list of summary statistics including effect sizes, standard errors, sample sizes, etc. |
sets |
Optional list specifying sets of SNPs for mapping. |
models |
Optional list of predefined models for Bayesian regression. |
rsids |
Vector of SNP identifiers. |
ids |
Vector of sample identifiers. |
mask |
Logical matrix indicating SNPs to exclude from analysis. |
lambda |
Vector of initial values for penalty parameters in the model. |
vb |
Initial value for the marker effect variance (default: NULL). |
vg |
Initial value for the genetic variance (default: NULL). |
ve |
Initial value for the residual variance (default: NULL). |
vy |
Initial value for the phenotypic variance (default: NULL). |
pi |
Vector of initial values for pi parameters in the model (default of pi=c(0.999,0.001) for bayesC and pi=c(0.994,0.003,0.002,0.001). |
gamma |
Vector of initial values for gamma parameters in the model (default of gamma=c(0,1) for bayesC and gamma=c(0,0.01,0.1,1). |
mc |
Number of potentiel genome-wide causal markers for the trait analysed - only used for specification of ssb_prior (default: 5000). |
h2 |
Heritability estimate (default: 0.5). |
nub , nug , nue |
Degrees of freedom parameters for the priors of marker, genetic, and residual variances, respectively. |
ssb_prior , ssg_prior , sse_prior |
Priors for the marker, genetic, and residual variances. |
vb_prior , vg_prior , ve_prior |
Additional priors for marker, genetic, and residual variances (default: NULL). |
updateB , updateG , updateE , updatePi |
Logical values specifying whether to update marker effects, genetic variance, residual variance, and inclusion probabilities, respectively. |
formatLD |
Format of LD matrix ("dense" by default). |
checkLD |
Logical, whether to check the LD matrix for inconsistencies (default: FALSE). |
shrinkLD , shrinkCor |
Logical, whether to apply shrinkage to the LD or correlation matrices (default: FALSE). |
pruneLD |
Logical, whether to prune LD matrix (default: FALSE). |
checkConvergence |
Logical, whether to check for convergence of the Gibbs sampler (default: FALSE). |
critVe , critVg , critVb , critPi , critB , critB1 , critB2 |
Convergence criteria for residual, genetic, and marker variances, inclusion probabilities, and marker effects. |
verbose |
Logical, whether to print detailed output for debugging (default: FALSE). |
eigen_threshold |
Threshold for eigenvalues in eigen decomposition (default: 0.995). |
cs_threshold , cs_r2 |
PIP and r2 thresholds credible set construction (default: cs_threshold=0.9, cs_r2=0.5) |
nit |
Number of iterations in the MCMC sampler (default: 5000). |
nburn |
Number of burn-in iterations (default: 500). |
nthin |
Thinning interval for MCMC (default: 5). |
output |
Level of output, options include "summary", "full". |
method |
The regression method to use, options include "blup", "bayesN", "bayesA", "bayesL", "bayesC", "bayesR". |
algorithm |
Algorithm for MCMC sampling, options include "mcmc", "em-mcmc", "mcmc-eigen". |
seed |
Random seed for reproducibility (default: 10). |
Details
This function implements Bayesian linear regression models to provide unified mapping of genetic variants, estimate genetic parameters (e.g. heritability), and predict disease risk. It is designed to handle various genetic architectures and scale efficiently with large datasets.
Value
Returns a list structure including the following components:
Author(s)
Peter Sørensen
Prepare genotype data for all statistical analyses
Description
All functions in qgg relies on a simple data infrastructure that takes five main input sources; phenotype data (y), covariate data (X), genotype data (G or Glist), a genomic relationship matrix (GRM or GRMlist) and genetic marker sets (sets).
The genotypes are stored in a matrix (n x m (individuals x markers)) in memory (G) or in a binary file on disk (Glist).
It is only for small data sets that the genotype matrix (G) can stored in memory. For large data sets the genotype matrix has to stored in a binary file on disk (Glist). Glist is as a list structure that contains information about the genotypes in the binary file.
The gprep function prepares the Glist, and is required for downstream analyses of large-scale genetic data. Typically, the Glist is prepared once, and saved as an *.Rdata-file.
The gprep function reads genotype information from binary PLINK files, and creates the Glist object that contains general information about the genotypes such as reference alleles, allele frequencies and missing genotypes, and construct a binary file on the disk that contains the genotypes as allele counts of the alternative allele (memory usage = (n x m)/4 bytes).
The gprep function can also be used to prepare sparse ld matrices. The r2 metric used is the pairwise correlation between markers (allele count alternative allele) in a specified region of the genome. The marker genotype is allele count of the alternative allele which is assumed to be centered and scaled.
The Glist structure is used as input parameter for a number of qgg core functions including: 1) construction of genomic relationship matrices (grm), 2) construction of sparse ld matrices, 3) estimating genomic parameters (greml), 4) single marker association analyses (glma), 5) gene set enrichment analyses (gsea), and 6) genomic prediction from genotypes and phenotypes (gsolve) or genotypes and summary statistics (gscore).
Usage
gprep(
Glist = NULL,
task = "prepare",
study = NULL,
fnBED = NULL,
ldfiles = NULL,
bedfiles = NULL,
bimfiles = NULL,
famfiles = NULL,
mapfiles = NULL,
ids = NULL,
rsids = NULL,
assembly = NULL,
overwrite = FALSE,
msize = 100,
r2 = NULL,
kb = NULL,
cm = NULL,
ncores = 1
)
Arguments
Glist |
A list containing information about the genotype matrix stored on disk. |
task |
A character string specifying the task to perform. Possible tasks are "prepare" (default), "sparseld", "ldscores", and "geneticmap". |
study |
The name of the study. |
fnBED |
Path and filename of the .bed binary file used to store genotypes on disk. |
ldfiles |
Path and filename of the .ld binary files used for storing the sparse LD matrix on disk. |
bedfiles |
A vector of filenames for the PLINK bed-files. |
bimfiles |
A vector of filenames for the PLINK bim-files. |
famfiles |
A vector of filenames for the PLINK fam-files. |
mapfiles |
A vector of filenames for the mapfiles. |
ids |
A vector of individual identifiers used in the study. |
rsids |
A vector of marker rsids used in the study. |
assembly |
Character string indicating the name of the assembly. |
overwrite |
A logical value; if TRUE, the binary genotype/LD file will be overwritten. |
msize |
Number of markers used in the computation of sparseld. |
r2 |
A threshold value (more context might be beneficial, e.g., threshold for what?). |
kb |
Size of the genomic region in kilobases (kb). |
cm |
Size of the genomic region in centimorgans (cm). |
ncores |
Number of processing cores to be used for genotype processing. |
Value
Returns a list structure (Glist) with information about the genotypes.
Author(s)
Peter Soerensen
Examples
bedfiles <- system.file("extdata", "sample_chr1.bed", package = "qgg")
bimfiles <- system.file("extdata", "sample_chr1.bim", package = "qgg")
famfiles <- system.file("extdata", "sample_chr1.fam", package = "qgg")
Glist <- gprep(study="Example", bedfiles=bedfiles, bimfiles=bimfiles,
famfiles=famfiles)
Genomic rescticted maximum likelihood (GREML) analysis
Description
The greml function is used for the estimation of genomic parameters (co-variance, heritability and correlation) for linear mixed models using restricted maximum likelihood estimation (REML) and genomic prediction using best linear unbiased prediction (BLUP).
The linear mixed model can account for multiple genetic factors (fixed and random genetic marker effects), adjust for complex family relationships or population stratification and adjust for other non-genetic factors including lifestyle characteristics. Different genetic architectures (infinitesimal, few large and many small effects) is accounted for by modeling genetic markers in different sets as fixed or random effects and by specifying individual genetic marker weights. Different genetic models (e.g. additive and non-additive) can be specified by providing additive and non-additive genomic relationship matrices (GRMs) (constructed using grm). The GRMs can be accessed from the R environment or from binary files stored on disk facilitating the analyses of large-scale genetic data.
The output contains estimates of variance components, fixed and random effects, first and second derivatives of log-likelihood and the asymptotic standard deviation of parameter estimates.
Assessment of predictive accuracy (including correlation and R2, and AUC for binary phenotypes) can be obtained by providing greml with a data frame, or a list that contains sample IDs used in the validation (see examples for details).
Genomic parameters can also be estimated with DMU (http://www.dmu.agrsci.dk/DMU/) if interface =”DMU”. This option requires DMU to be installed locally, and the path to the DMU binary files has to be specified (see examples below for details).
Usage
greml(
y = NULL,
X = NULL,
GRMlist = NULL,
GRM = NULL,
theta = NULL,
ids = NULL,
validate = NULL,
maxit = 100,
tol = 1e-05,
bin = NULL,
ncores = 1,
wkdir = getwd(),
verbose = FALSE,
interface = "R",
fm = NULL,
data = NULL
)
Arguments
y |
is a vector or matrix of phenotypes |
X |
is a design matrix for factors modeled as fixed effects |
GRMlist |
is a list providing information about GRM matrix stored in binary files on disk |
GRM |
is a list of one or more genomic relationship matrices |
theta |
is a vector of initial values of co-variance for REML estimation |
ids |
is a vector of individuals used in the analysis |
validate |
is a data frame or list of individuals used in cross-validation (one column/row for each validation set) |
maxit |
is the maximum number of iterations used in REML analysis |
tol |
is tolerance, i.e. convergence criteria used in REML |
bin |
is the directory for fortran binaries (e.g. DMU binaries dmu1 and dmuai) |
ncores |
is the number of cores used for the analysis |
wkdir |
is the working directory used for REML |
verbose |
is a logical; if TRUE it prints more details during optimization |
interface |
is used for specifying whether to use R or Fortran implementations of REML |
fm |
is a formula with model statement for the linear mixed model |
data |
is a data frame containing the phenotypic observations and fixed factors specified in the model statements |
Value
returns a list structure including:
llik |
log-likelihood at convergence |
theta |
covariance estimates from REML |
asd |
asymptotic standard deviation |
b |
vector of fixed effect estimates |
varb |
vector of variances of fixed effect estimates |
g |
vector or matrix of random effect estimates |
e |
vector or matrix of residual effects |
accuracy |
matrix of prediction accuracies (only returned if [validate?] is provided) |
Author(s)
Peter Soerensen
References
Lee, S. H., & van der Werf, J. H. (2006). An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genetics Selection Evolution, 38(1), 25.
Examples
# Simulate data
W <- matrix(rnorm(1000000), ncol = 1000)
colnames(W) <- as.character(1:ncol(W))
rownames(W) <- as.character(1:nrow(W))
y <- rowSums(W[, 1:10]) + rowSums(W[, 501:510]) + rnorm(nrow(W))
# Create model
data <- data.frame(y = y, mu = 1)
fm <- y ~ 0 + mu
X <- model.matrix(fm, data = data)
# Compute GRM
GRM <- grm(W = W)
# REML analyses
fitG <- greml(y = y, X = X, GRM = list(GRM))
# REML analyses and cross validation
# Create marker sets
setsGB <- list(A = colnames(W)) # gblup model
setsGF <- list(C1 = colnames(W)[1:500], C2 = colnames(W)[501:1000]) # gfblup model
setsGT <- list(C1 = colnames(W)[1:10], C2 = colnames(W)[501:510]) # true model
GB <- lapply(setsGB, function(x) {grm(W = W[, x])})
GF <- lapply(setsGF, function(x) {grm(W = W[, x])})
GT <- lapply(setsGT, function(x) {grm(W = W[, x])})
n <- length(y)
fold <- 10
nvalid <- 5
validate <- replicate(nvalid, sample(1:n, as.integer(n / fold)))
cvGB <- greml(y = y, X = X, GRM = GB, validate = validate)
cvGF <- greml(y = y, X = X, GRM = GF, validate = validate)
cvGT <- greml(y = y, X = X, GRM = GT, validate = validate)
cvGB$accuracy
cvGF$accuracy
cvGT$accuracy
Computing the genomic relationship matrix (GRM)
Description
The grm function is used to compute a genomic relationship matrix (GRM) based on all, or a subset of marker genotypes. GRM for additive, and non-additive (dominance and epistasis) genetic models can be constructed. The output of the grm function can either be a within-memory GRM object (n x n matrix), or a GRM-list which is a list structure that contains information about the GRM stored in a binary file on the disk.
Usage
grm(
Glist = NULL,
GRMlist = NULL,
ids = NULL,
rsids = NULL,
rws = NULL,
cls = NULL,
W = NULL,
method = "add",
scale = TRUE,
msize = 100,
ncores = 1,
fnG = NULL,
overwrite = FALSE,
returnGRM = FALSE,
miss = NA,
impute = TRUE,
pedigree = NULL,
task = "grm"
)
Arguments
Glist |
list providing information about genotypes stored on disk |
GRMlist |
list providing information about GRM matrix stored in binary files on disk |
ids |
vector of individuals used for computing GRM |
rsids |
vector marker rsids used for computing GRM |
rws |
rows in genotype matrix used for computing GRM |
cls |
columns in genotype matrix used for computing GRM |
W |
matrix of centered and scaled genotypes |
method |
indicator of method used for computing GRM: additive (add, default), dominance (dom) or epistasis (epi-pairs or epi-hadamard (all genotype markers)) |
scale |
logical if TRUE the genotypes in Glist has been scaled to mean zero and variance one |
msize |
number of genotype markers used for batch processing |
ncores |
number of cores used to compute the GRM |
fnG |
name of the binary file used for storing the GRM on disk |
overwrite |
logical if TRUE the binary file fnG will be overwritten |
returnGRM |
logical if TRUE function returns the GRM matrix to the R environment |
miss |
the missing code (miss=NA is default) used for missing values in the genotype data |
impute |
if missing values in the genotype matrix W then mean impute |
pedigree |
is a dataframe with pedigree information |
task |
either computation of GRM (task="grm" which is default) or eigenvalue decomposition of GRM (task="eigen") |
Value
Returns a genomic relationship matrix (GRM) if returnGRM=TRUE else a list structure (GRMlist) with information about the GRM stored on disk
Author(s)
Peter Soerensen
Examples
# Simulate data
W <- matrix(rnorm(1000000), ncol = 1000)
colnames(W) <- as.character(1:ncol(W))
rownames(W) <- as.character(1:nrow(W))
# Compute GRM
GRM <- grm(W = W)
# Eigen value decompostion GRM
eig <- grm(GRM=GRM, task="eigen")
Genomic scoring based on single marker summary statistics
Description
Computes genomic predictions using single marker summary statistics and observed genotypes.
Usage
gscore(
Glist = NULL,
chr = NULL,
bedfiles = NULL,
bimfiles = NULL,
famfiles = NULL,
stat = NULL,
fit = NULL,
ids = NULL,
scaleMarker = TRUE,
scaleGRS = TRUE,
impute = TRUE,
msize = 100,
ncores = 1,
verbose = FALSE
)
Arguments
Glist |
List of information about genotype matrix. Default is NULL. |
chr |
Chromosome for which genomic scores is computed. Default is NULL. |
bedfiles |
Names of the PLINK bed-files. Default is NULL. |
bimfiles |
Names of the PLINK bim-files. Default is NULL. |
famfiles |
Names of the PLINK fam-files. Default is NULL. |
stat |
Matrix of single marker effects. Default is NULL. |
fit |
Fit object output from gbayes. Default is NULL. |
ids |
Vector of individuals used in the analysis. Default is NULL. |
scaleMarker |
Logical; if TRUE the genotype markers are scaled to mean zero and variance one. Default is TRUE. |
scaleGRS |
Logical; if TRUE the GRS are scaled to mean zero and variance one. Default is TRUE. |
impute |
Logical; if TRUE, missing genotypes are set to its expected value (2*af where af is allele frequency). Default is TRUE. |
msize |
Number of genotype markers used for batch processing. Default is 100. |
ncores |
Number of cores used in the analysis. Default is 1. |
verbose |
Logical; if TRUE, more details are printed during optimization. Default is FALSE. |
Value
Returns the genomic scores based on the provided parameters.
Author(s)
Peter Soerensen
Examples
## Plink bed/bim/fam files
bedfiles <- system.file("extdata", paste0("sample_chr",1:2,".bed"), package = "qgg")
bimfiles <- system.file("extdata", paste0("sample_chr",1:2,".bim"), package = "qgg")
famfiles <- system.file("extdata", paste0("sample_chr",1:2,".fam"), package = "qgg")
# Summarize bed/bim/fam files
Glist <- gprep(study="Example", bedfiles=bedfiles, bimfiles=bimfiles, famfiles=famfiles)
# Simulate phenotype
sim <- gsim(Glist=Glist, chr=1, nt=1)
# Compute single marker summary statistics
stat <- glma(y=sim$y, Glist=Glist, scale=FALSE)
# Compute genomic scores
gsc <- gscore(Glist = Glist, stat = stat)
Gene set enrichment analysis
Description
The function gsea can perform several different gene set enrichment analyses. The general procedure is to obtain single marker statistics (e.g. summary statistics), from which it is possible to compute and evaluate a test statistic for a set of genetic markers that measures a joint degree of association between the marker set and the phenotype. The marker set is defined by a genomic feature such as genes, biological pathways, gene interactions, gene expression profiles etc.
Currently, four types of gene set enrichment analyses can be conducted with gsea; sum-based, count-based, score-based, and our own developed method, the covariance association test (CVAT). For details and comparisons of test statistics consult doi:10.1534/genetics.116.189498.
The sum test is based on the sum of all marker summary statistics located within the feature set. The single marker summary statistics can be obtained from linear model analyses (from PLINK or using the qgg glma approximation), or from single or multiple component REML analyses (GBLUP or GFBLUP) from the greml function. The sum test is powerful if the genomic feature harbors many genetic markers that have small to moderate effects.
The count-based method is based on counting the number of markers within a genomic feature that show association (or have single marker p-value below a certain threshold) with the phenotype. Under the null hypothesis (that the associated markers are picked at random from the total number of markers, thus, no enrichment of markers in any genomic feature) it is assumed that the observed count statistic is a realization from a hypergeometric distribution.
The score-based approach is based on the product between the scaled genotypes in a genomic feature and the residuals from the liner mixed model (obtained from greml).
The covariance association test (CVAT) is derived from the fit object from greml (GBLUP or GFBLUP), and measures the covariance between the total genomic effects for all markers and the genomic effects of the markers within the genomic feature.
The distribution of the test statistics obtained from the sum-based, score-based and CVAT is unknown, therefore a circular permutation approach is used to obtain an empirical distribution of test statistics.
Usage
gsea(
stat = NULL,
sets = NULL,
Glist = NULL,
W = NULL,
fit = NULL,
g = NULL,
e = NULL,
threshold = 0.05,
method = "sum",
nperm = 1000,
ncores = 1
)
Arguments
stat |
vector or matrix of single marker statistics (e.g. coefficients, t-statistics, p-values) |
sets |
list of marker sets - names corresponds to row names in stat |
Glist |
list providing information about genotypes stored on disk |
W |
matrix of centered and scaled genotypes (used if method = cvat or score) |
fit |
list object obtained from a linear mixed model fit using the greml function |
g |
vector (or matrix) of genetic effects obtained from a linear mixed model fit (GBLUP of GFBLUP) |
e |
vector (or matrix) of residual effects obtained from a linear mixed model fit (GBLUP of GFBLUP) |
threshold |
used if method='hyperg' (threshold=0.05 is default) |
method |
including sum, cvat, hyperg, score |
nperm |
number of permutations used for obtaining an empirical p-value |
ncores |
number of cores used in the analysis |
Value
Returns a dataframe or a list including
stat |
marker set test statistics |
m |
number of markers in the set |
p |
enrichment p-value for marker set |
Author(s)
Peter Soerensen
Examples
# Simulate data
W <- matrix(rnorm(1000000), ncol = 1000)
colnames(W) <- as.character(1:ncol(W))
rownames(W) <- as.character(1:nrow(W))
y <- rowSums(W[, 1:10]) + rowSums(W[, 501:510]) + rnorm(nrow(W))
# Create model
data <- data.frame(y = y, mu = 1)
fm <- y ~ 0 + mu
X <- model.matrix(fm, data = data)
# Single marker association analyses
stat <- glma(y=y,X=X,W=W)
# Create marker sets
f <- factor(rep(1:100,each=10), levels=1:100)
sets <- split(as.character(1:1000),f=f)
# Set test based on sums
b2 <- stat[,"stat"]**2
names(b2) <- rownames(stat)
mma <- gsea(stat = b2, sets = sets, method = "sum", nperm = 100)
head(mma)
# Set test based on hyperG
p <- stat[,"p"]
names(p) <- rownames(stat)
mma <- gsea(stat = p, sets = sets, method = "hyperg", threshold = 0.05)
head(mma)
G <- grm(W=W)
fit <- greml(y=y, X=X, GRM=list(G=G), theta=c(10,1))
# Set test based on cvat
mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="cvat")
head(mma)
# Set test based on score
mma <- gsea(W=W,fit = fit, sets = sets, nperm = 1000, method="score")
head(mma)
Genomic simulation
Description
Simulate Genotype and Phenotype Data
Usage
gsim(Glist = NULL, chr = 1, nt = 1, W = NULL, n = 1000, m = 1000, rsids = NULL)
Arguments
Glist |
A list of information about the genotype matrix. Default is 'NULL'. |
chr |
The chromosome(s) being used in the simulation. Default is 1. |
nt |
Number of traits. Default is 1. |
W |
Matrix of centered and scaled genotypes. Default is 'NULL'. |
n |
Number of individuals. Default is 1000. |
m |
Number of markers. Default is 1000. |
rsids |
A character vector of rsids. Default is 'NULL'. |
Details
This function simulates genotype and phenotype data based on the 'Glist', which is information about the genotype matrix.
Value
A list containing:
-
y
: Phenotypes. -
W
: Matrix of centered and scaled genotypes. -
e
: Errors. -
g
: Genotype effect. -
b0
,b1
: Coefficients. -
set0
,set1
: Selected markers. -
causal
: Causal markers.
Author(s)
Peter Soerensen
Examples
## Plink bed/bim/fam files
bedfiles <- system.file("extdata", paste0("sample_chr",1:2,".bed"), package = "qgg")
bimfiles <- system.file("extdata", paste0("sample_chr",1:2,".bim"), package = "qgg")
famfiles <- system.file("extdata", paste0("sample_chr",1:2,".fam"), package = "qgg")
# Summarize bed/bim/fam files
Glist <- gprep(study="Example", bedfiles=bedfiles, bimfiles=bimfiles, famfiles=famfiles)
# Simulate phenotype
sim <- gsim(Glist=Glist, chr=1, nt=1)
head(sim$y)
head(sim$e)
head(sim$causal)
Simulate Genetic Data Based on Given Parameters
Description
This function simulates phenotype data by random sampling of markers available on 'Glist'. Default parameters for the simulated phenotype reflect the genetic architecture assumed by BayesC prior(Habier et al., 2011). This function is under active development.
Usage
gsimC(Glist = NULL, h2 = NULL, m = NULL, prp.cau = NULL, n = NULL)
Arguments
Glist |
A list containing genetic data. If NULL, the function will stop with an error. |
h2 |
Heritability. If NULL, heritability of 0.5 is assumed. |
m |
Number of causal markers. The values for either 'm' or 'prp.cau' should be provided at any given time. If the list of quality controlled markers is not available then list of raw markers is used. If 'm' is NULL and 'prp.cau' is also NULL, 'prp.cau' will default to 0.001. |
prp.cau |
Proportion of causal markers. The values for either 'm' or 'prp.cau' should be provided at any given time. |
n |
Number of individuals randomly sampled from 'Glist'. If NULL, all the individuals on 'Glist' is used. |
Value
A list containing:
-
y
: Vector of simulated phenotypes. -
g
: Vector of simulated genetic values. -
e
: Vector of simulated residual effects. -
b
: Vector of effect sizes of the simulated causal markers. -
causal
: Vector of ids for the simulated causal markers. -
h2
: Estimated heritability of the simulated phenotype.
Author(s)
Peter Soerensen
Merina Shrestha
Simulate Genetic Data Based on Given Parameters
Description
This function simulates phenotype data by random sampling of markers available on 'Glist'. Default parameters for the simulated phenotype reflect the genetic architecture assumed by BayesR prior(Erbe et al., 2012). Marker effect are sampled from mixture distributions leading to small, moderate and large effect sizes. This function is under active development.
Usage
gsimR(Glist = NULL, h2 = NULL, m = NULL, prp.cau = NULL, n = NULL)
Arguments
Glist |
A list containing genetic data. If NULL, the function will stop with an error. |
h2 |
Heritability. If NULL, heritability of 0.5 is assumed. |
m |
Number of causal markers. The values for either 'm' or 'prp.cau' should be provided at any given time. If the list of quality controlled markers is not available then list of raw markers is used. If 'm' is NULL and 'prp.cau' is also NULL, 'prp.cau' will default to 0.001. |
prp.cau |
Proportion of causal markers. The values for either 'm' or 'prp.cau' should be provided at any given time. |
n |
Number of individuals randomly sampled from 'Glist'. If NULL, all the individuals on 'Glist' is used. |
Value
A list containing:
-
y
: Vector of simulated phenotypes. -
g
: Vector of simulated genetic values. -
e
: Vector of simulated residual effects. -
b
: list of vectors for effect sizes of the simulated causal markers in three classes. -
causal
: list of vectors for ids for the simulated causal markers in three classes. -
h2
: Estimated heritability of the simulated phenotype.
Author(s)
Peter Soerensen
Merina Shrestha
Solve linear mixed model equations
Description
The gsolve function is used for solving of linear mixed model equations. The algorithm used to solve the equation system is based on a Gauss-Seidel (GS) method (matrix-free with residual updates) that handles large data sets.
The linear mixed model fitted can account for multiple traits, multiple genetic factors (fixed or random genetic marker effects), adjust for complex family relationships or population stratification, and adjust for other non-genetic factors including lifestyle characteristics. Different genetic architectures (infinitesimal, few large and many small effects) is accounted for by modeling genetic markers in different sets as fixed or random effects and by specifying individual genetic marker weights.
Usage
gsolve(
y = NULL,
X = NULL,
GRM = NULL,
va = NULL,
ve = NULL,
Glist = NULL,
W = NULL,
ids = NULL,
rsids = NULL,
sets = NULL,
scale = TRUE,
lambda = NULL,
weights = FALSE,
maxit = 500,
tol = 1e-05,
method = "gsru",
ncores = 1
)
Arguments
y |
vector or matrix of phenotypes |
X |
design matrix of fixed effects |
GRM |
genetic relationship matrix |
va |
genetic variance |
ve |
residual variance |
Glist |
list of information about genotype matrix stored on disk |
W |
matrix of centered and scaled genotypes |
ids |
vector of individuals used in the analysis |
rsids |
vector of marker rsids used in the analysis |
sets |
list containing marker sets rsids |
scale |
logical if TRUE the genotypes in Glist will be scaled to mean zero and variance one |
lambda |
overall shrinkage factor |
weights |
vector of single marker weights used in BLUP |
maxit |
maximum number of iterations used in the Gauss-Seidel procedure |
tol |
tolerance, i.e. the maximum allowed difference between two consecutive iterations of the solver to declare convergence |
method |
used in solver (currently only methods="gsru": gauss-seidel with resiudal update) |
ncores |
number of cores used in the analysis |
Author(s)
Peter Soerensen
Examples
# Simulate data
W <- matrix(rnorm(1000000), ncol = 1000)
colnames(W) <- as.character(1:ncol(W))
rownames(W) <- as.character(1:nrow(W))
m <- ncol(W)
causal <- sample(1:ncol(W),50)
y <- rowSums(W[,causal]) + rnorm(nrow(W),sd=sqrt(50))
X <- model.matrix(y~1)
Sg <- 50
Se <- 50
h2 <- Sg/(Sg+Se)
lambda <- Se/(Sg/m)
lambda <- m*(1-h2)/h2
# BLUP of single marker effects and total genomic effects based on Gauss-Seidel procedure
fit <- gsolve( y=y, X=X, W=W, lambda=lambda)
Perform Hardy Weinberg Equilibrium Test
Description
Perform Hardy Weinberg Equilibrium Test ££(HWE)
Usage
hwe(Glist = NULL)
Arguments
Glist |
is a list structure with information about genotypes stored on disk |
LD score regression
Description
The ldsc function is used for LDSC analysis
Usage
ldsc(
Glist = NULL,
ldscores = NULL,
sets = NULL,
method = "regression",
residual = FALSE,
z = NULL,
b = NULL,
seb = NULL,
af = NULL,
stat = NULL,
tol = 1e-08,
n = NULL,
intercept = TRUE,
what = "h2",
maxZ2 = NULL,
SE.h2 = FALSE,
SE.rg = FALSE,
blk = 200
)
Arguments
Glist |
list of information about genotype matrix stored on disk |
ldscores |
vector of LD scores (optional as LD scores are stored within Glist) |
sets |
Optional list specifying sets of SNPs for mapping. |
method |
the regression method to use, options include "regression", "bayesC", "bayesR". |
residual |
logical if TRUE then add a residual that capture the h2 not explained by the sets |
z |
matrix of z statistics for n traits |
b |
matrix of marker effects for n traits if z matrix not is given |
seb |
matrix of standard errors of marker effects for n traits if z matrix not is given |
af |
vector of allele frequencies |
stat |
dataframe with marker summary statistics |
tol |
smallest value for h2 |
n |
vector of sample sizes for the traits (element i corresponds to column vector i in z matrix) |
intercept |
logical if TRUE the LD score regression includes intercept |
what |
either computation of heritability (what="h2") or genetic correlation between traits (what="rg") |
maxZ2 |
maximum value for squared value of z-statistics |
SE.h2 |
logical if TRUE standard errors and significance for the heritability estimates are computed using a block jackknife approach |
SE.rg |
logical if TRUE standard errors and significance for the genetic correlations are computed using a block jackknife approach |
blk |
numeric size of the blocks used in the jackknife estimation of standard error (default = 200) |
Value
Returns a matrix of heritability estimates when what="h2", and if SE.h2=TRUE standard errors (SE) and significance levels (P) are returned. If what="rg" an n-by-n matrix of correlations is returned where the diagonal elements being h2 estimates. If SE.rg=TRUE a list is returned with n-by-n matrices of genetic correlations, estimated standard errors and significance levels.
Author(s)
Peter Soerensen
Palle Duun Rohde
Examples
# Plink bed/bim/fam files
#bedfiles <- system.file("extdata", paste0("sample_chr",1:2,".bed"), package = "qgg")
#bimfiles <- system.file("extdata", paste0("sample_chr",1:2,".bim"), package = "qgg")
#famfiles <- system.file("extdata", paste0("sample_chr",1:2,".fam"), package = "qgg")
#
## Summarize bed/bim/fam files
#Glist <- gprep(study="Example", bedfiles=bedfiles, bimfiles=bimfiles, famfiles=famfiles)
#
## Filter rsids based on MAF, missingness, HWE
#rsids <- gfilter(Glist = Glist, excludeMAF=0.05, excludeMISS=0.05, excludeHWE=1e-12)
#
## Compute sparse LD (msize=size of LD window)
##ldfiles <- system.file("extdata", paste0("sample_chr",1:2,".ld"), package = "qgg")
##Glist <- gprep(Glist, task="sparseld", msize=200, rsids=rsids, ldfiles=ldfiles, overwrite=TRUE)
#
#
##Simulate data
#W1 <- getG(Glist, chr=1, scale=TRUE)
#W2 <- getG(Glist, chr=2, scale=TRUE)
#W <- cbind(W1,W2)
#causal <- sample(1:ncol(W),5)
#b1 <- rnorm(length(causal))
#b2 <- rnorm(length(causal))
#y1 <- W[, causal]%*%b1 + rnorm(nrow(W))
#y2 <- W[, causal]%*%b2 + rnorm(nrow(W))
#data1 <- data.frame(y = y1, mu = 1)
#data2 <- data.frame(y = y2, mu = 1)
#X1 <- model.matrix(y ~ 0 + mu, data = data1)
#X2 <- model.matrix(y ~ 0 + mu, data = data2)
## Linear model analyses and single marker association test
#maLM1 <- lma(y=y1, X=X1,W = W)
#maLM2 <- lma(y=y2,X=X2,W = W)
#
## Compute heritability and genetic correlations for trait 1 and 2
#z1 <- maLM1[,"stat"]
#z2 <- maLM2[,"stat"]
#z <- cbind(z1=z1,z2=z2)
#h2 <- ldsc(Glist, z=z, n=c(500,500), what="h2")
#rg <- ldsc(Glist, z=z, n=c(500,500), what="rg")
Compute LD (Linkage Disequilibrium) Scores for a Given Chromosome.
Description
This function calculates LD scores for the specified chromosome(s) based on genotypic data provided in 'Glist'. The LD score quantifies the amount of Linkage Disequilibrium at a given SNP.
Usage
ldscore(
Glist = NULL,
chr = NULL,
onebased = TRUE,
nbytes = 4,
cm = NULL,
kb = NULL
)
Arguments
Glist |
A list structure with genotypic data stored, including positions ('pos'), map information ('map'), rsids for LD calculation ('rsidsLD'), and LD file locations ('ldfiles'). |
chr |
A single chromosome or a vector of chromosomes for which LD scores need to be computed. Default is NULL, implying all chromosomes in 'Glist' will be used. |
onebased |
Logical, if 'TRUE', the indexing of positions and other genomic information is 1-based. Default is 'TRUE'. |
nbytes |
The size (in bytes) of each numeric value to read from the binary LD files. Default is 4. |
cm |
The threshold in centiMorgans for filtering LD values. Default is NULL. |
kb |
The threshold in kilobases for filtering LD values. Default is NULL. If specified, it will be converted to base pairs internally. |
Details
The function computes the LD scores for each SNP by reading LD values from binary files stored in 'Glist$ldfiles'. It can filter SNPs based on physical distance ('kb') or genetic map distance ('cm'). If both 'cm' and 'kb' are NULL, all LD values are used in computation.
Value
A list containing computed LD scores for each chromosome in the input.
Bayesian Multi-marker Analysis of Genomic Annotation (Bayesian MAGMA)
Description
This function analyzes feature sets using MAGMA or Bayesian methods for association testing. It supports joint or marginal testing, as well as Bayesian linear regression using different priors ('bayesC', 'bayesR').
Usage
magma(
stat = NULL,
sets = NULL,
method = "magma",
type = "joint",
test = "one-sided",
pi = 0.001,
nit = 5000,
nburn = 1000
)
Arguments
stat |
A numeric vector or matrix of summary statistics, where rows represent features and columns represent phenotypes. |
sets |
A list of feature sets (e.g., genes, SNPs) to be analyzed. |
method |
A string specifying the method to use. Options are '"magma"', '"blr"', '"bayesC"', or '"bayesR"'. Default is '"magma"'. |
type |
A string specifying the type of analysis to perform. Options are '"joint"' (default) or '"marginal"'. Only used with 'method = "magma"'. |
test |
A string specifying the statistical test. Options are '"one-sided"' (default) or '"two-sided"'. Only used with 'method = "magma"'. |
pi |
A numeric value specifying the proportion of non-zero effects. Used for Bayesian methods. Default is '0.001'. |
nit |
An integer specifying the number of iterations for Bayesian methods. Default is '5000'. |
nburn |
An integer specifying the number of burn-in iterations for Bayesian methods. Default is '1000'. |
Details
The function uses either the MAGMA approach for set-based testing or Bayesian linear regression to estimate effect sizes and probabilities of association for feature sets. For Bayesian methods, a spike-and-slab prior is applied.
The 'stat' input must have row names corresponding to feature identifiers. The 'sets' input must be a named list, where each element corresponds to a feature set.
Value
A data frame or list with analysis results.
Map Sets to rsids
Description
This function maps sets to rsids. If a 'Glist' is provided, 'rsids' are extracted from the 'Glist'. It returns a list of matched RSIDs for each set.
Usage
mapSets(sets = NULL, rsids = NULL, Glist = NULL, index = TRUE)
Arguments
sets |
A list of character vectors where each vector represents a set of items. If the names of the sets are not provided, they are named as "Set1", "Set2", etc. |
rsids |
A character vector of RSIDs. If 'Glist' is provided, this parameter is ignored. |
Glist |
A list containing an element 'rsids' which is a character vector of RSIDs. |
index |
A logical. If 'TRUE' (default), it returns indices of RSIDs; otherwise, it returns the RSID names. |
Value
A list where each element represents a set and contains matched RSIDs or their indices.
Map marker summary statistics to Glist
Description
Quality control is a critical step for working with summary statistics (in particular for external). Processing and quality control of GWAS summary statistics includes:
- map marker ids (rsids/cpra (chr, pos, ref, alt)) to LD reference panel data
- check effect allele (flip EA, EAF, Effect)
- check effect allele frequency
- thresholds for MAF and HWE
- exclude INDELS, CG/AT and MHC region
- remove duplicated marker ids
- check which build version
- check for concordance between marker effect and LD data
Usage
mapStat(
Glist = NULL,
stat = NULL,
excludeMAF = 0.01,
excludeMAFDIFF = 0.05,
excludeINFO = 0.8,
excludeCGAT = TRUE,
excludeINDEL = TRUE,
excludeDUPS = TRUE,
excludeMHC = FALSE,
excludeMISS = 0.05,
excludeHWE = 1e-12
)
Arguments
Glist |
list of information about genotype matrix stored on disk |
stat |
dataframe with marker summary statistics |
excludeMAF |
exclude marker if minor allele frequency (MAF) is below threshold (0.01 is default) |
excludeMAFDIFF |
exclude marker if minor allele frequency difference (MAFDIFF) between Glist$af and stat$af is above threshold (0.05 is default) |
excludeINFO |
exclude marker if info score (INFO) is below threshold (0.8 is default) |
excludeCGAT |
exclude marker if alleles are ambigous (CG or AT) |
excludeINDEL |
exclude marker if it an insertion/deletion |
excludeDUPS |
exclude marker id if duplicated |
excludeMHC |
exclude marker if located in MHC region |
excludeMISS |
exclude marker if missingness (MISS) is above threshold (0.05 is default) |
excludeHWE |
exclude marker if p-value for Hardy Weinberg Equilibrium test is below threshold (0.01 is default) |
Author(s)
Peter Soerensen
Merge multiple GRMlist objects
Description
Merge multiple GRMlist objects each with information about a genomic rfelationship matrix stored on disk
Usage
mergeGRM(GRMlist = NULL)
Arguments
GRMlist |
list providing information about GRM matrix stored in binary files on disk |
Adjustment of marker effects using correlated trait information
Description
The 'mtadj' function uses selection index theory to determine the optimal weights across 'n' traits. These weights are then used to adjust marker effects by 'n' correlated traits. More details can be found [here](https://www.nature.com/articles/s41467-017-02769-6).
Usage
mtadj(
h2 = NULL,
rg = NULL,
stat = NULL,
b = NULL,
z = NULL,
n = NULL,
mtotal = NULL,
meff = 60000,
method = "ols",
statistics = "z"
)
Arguments
h2 |
A vector of heritability estimates. |
rg |
An n-by-n matrix of genetic correlations. |
stat |
A dataframe containing marker summary statistics. |
b |
A matrix of marker effects. |
z |
A matrix of z-scores. |
n |
A vector indicating the sample size used to estimate marker effects for each trait. |
mtotal |
Total number of markers. |
meff |
Effective number of uncorrelated genomic segments (default = 60,000). |
method |
Method to estimate marker effects. Can be "OLS" (ordinary least square, default) or "BLUP" (best linear unbiased prediction). |
statistics |
Specifies which kind of statistics ("b" or "z") should be used in the analysis. |
Value
A matrix of adjusted marker effects for each trait.
Author(s)
Palle Duun Rohde and Peter Soerensen
Examples
#bedfiles <- system.file("extdata", "sample_22.bed", package = "qgg")
#bimfiles <- system.file("extdata", "sample_22.bim", package = "qgg")
#famfiles <- system.file("extdata", "sample_22.fam", package = "qgg")
#Glist <- gprep(study="1000G", bedfiles=bedfiles, bimfiles=bimfiles,famfiles=famfiles)
#Glist <- gprep(Glist, task="sparseld", msize=200)
#
##Simulate data
#set.seed(23)
#
#W <- getG(Glist, chr=1, scale=TRUE)
#causal <- sample(1:ncol(W),50)
#set1 <- c(causal, sample(c(1:ncol(W))[-causal],10))
#set2 <- c(causal, sample(c(1:ncol(W))[-set1],10))
#
#b1 <- rnorm(length(set1))
#b2 <- rnorm(length(set2))
#y1 <- W[, set1]%*%b1 + rnorm(nrow(W))
#y2 <- W[, set2]%*%b2 + rnorm(nrow(W))
#
## Create model
#data1 <- data.frame(y = y1, mu = 1)
#data2 <- data.frame(y = y2, mu = 1)
#X1 <- model.matrix(y ~ 0 + mu, data = data1)
#X2 <- model.matrix(y ~ 0 + mu, data = data2)
#
## Linear model analyses and single marker association test
#maLM1 <- glma(y=y1, X=X1,W = W)
#maLM2 <- glma(y=y2,X=X2,W = W)
#
## Compute genetic parameters
#z1 <- maLM1[,"stat"]
#z2 <- maLM2[,"stat"]
#
#z <- cbind(z1=z1,z2=z2)
#
#h2 <- ldsc(Glist, z=z, n=c(500,500), what="h2")
#rg <- ldsc(Glist, z=z, n=c(500,500), what="rg")
#
## Adjust summary statistics using estimated genetic parameters
#b <- cbind(b1=maLM1[,"b"],b2=maLM2[,"b"])
#bm <- mtadj( h2=h2, rg=rg, b=b, n=c(500,500), method="ols")
Plot fit from gbayes
Description
Summary plots from gbayes fit
Usage
plotBayes(fit = NULL, causal = NULL, what = "bm", chr = 1)
Arguments
fit |
object from gbayes |
causal |
indices for "causal" markers |
what |
character fro what to plot (e.g. "trace", "bpm", "ve", "vg", "vb") |
chr |
chromosome to plot |
Forest plot
Description
This function generates a forest plot, which is commonly used to visualize effect sizes and their confidence intervals.
Usage
plotForest(
x = NULL,
sd = NULL,
cex = 1,
mar = NULL,
mai = NULL,
xlim = NULL,
pos = NULL,
reorder = TRUE,
xaxis = TRUE,
main = NULL,
xlab = "x"
)
Arguments
x |
A vector of point estimates or effect sizes. |
sd |
A vector of standard deviations corresponding to the values in 'x'. |
cex |
A numerical value indicating the amount by which plotting text and symbols should be scaled. Default is 1. |
mar |
A numerical vector of the form 'c(bottom, left, top, right)' which gives the number of lines of margin to be specified on the four sides of the plot. Default is 'c(5,12,3,1)'. |
mai |
A numerical vector indicating the margins in inches. |
xlim |
The x limits (x1, x2) of the plot. |
pos |
Position of y-axis labels. |
reorder |
A logical value. If 'TRUE', data points are reordered based on the values in 'x'. Default is 'TRUE'. |
xaxis |
A logical value. If 'TRUE', x-axis is drawn. Default is 'TRUE'. |
main |
An overall title for the plot. |
xlab |
A label for the x-axis. Default is "x". |
Plot LD Matrix
Description
Visualizes the linkage disequilibrium (LD) matrix using a color gradient. The function produces an image plot with custom color mapping.
Usage
plotLD(LD = NULL, cols = NULL)
Arguments
LD |
A matrix representing the LD values to be plotted. Each element should be a numeric value, typically between 0 and 1, representing the degree of LD. Rows and columns of the matrix should correspond to specific genetic markers (e.g., SNPs). |
cols |
A color palette to use for the plot. By default, it creates a blue gradient ranging from light blue ('#f0f3ff') to dark blue ('#0033BB'). |
Value
A plot visualizing the LD matrix. Row and column names of the LD matrix are used as labels on the x and y axes, respectively.
Plot Receiver Operating Curves
Description
Plot ROC
Usage
plotROC(roc.data = NULL, cols = NULL)
Arguments
roc.data |
data frame with ROC information (from computeROC) |
cols |
which columns should be used in the ROC plot |
Author(s)
Palle Duun Rohde
Bayesian Polygenic Prioritisation Scoring (Bayesian POPS)
Description
This function performs Polygenic Prioritisation Scoring (POPS) using Bayesian regression ('bayesC' or 'bayesR') or ridge regression ('rr'). It maps features to sets, performs optional feature selection based on p-value thresholds, and calculates predictive scores for prioritisation.
Usage
pops(
stat = NULL,
sets = NULL,
validate = NULL,
threshold = NULL,
method = "bayesC",
pi = 0.001,
nit = 5000,
nburn = 1000,
updateB = TRUE,
updateE = TRUE,
updatePi = TRUE,
updateG = TRUE
)
Arguments
stat |
A numeric vector or matrix of summary statistics (e.g., phenotypic values or effect sizes), where rows represent features (e.g., SNPs) and columns represent traits. Required. |
sets |
A list of feature sets (e.g., genes or SNP groups) to map to the rows of 'stat'. Required. |
validate |
An optional validation set. If provided, cross-validation results are returned instead of fitting the model. |
threshold |
A numeric value specifying a p-value threshold for feature selection. If provided, only features with p-values below this threshold are included in the model. |
method |
A string specifying the regression method. Options are '"bayesC"' (default), '"bayesR"', or '"rr"' (ridge regression). |
pi |
A numeric value specifying the proportion of non-zero effects for Bayesian methods. Default is '0.001'. |
nit |
An integer specifying the number of iterations for Bayesian methods. Default is '5000'. |
nburn |
An integer specifying the number of burn-in iterations for Bayesian methods. Default is '1000'. |
updateB |
A logical value indicating whether to update marker effects in Bayesian methods. Default is 'TRUE'. |
updateE |
A logical value indicating whether to update residual variances in Bayesian methods. Default is 'TRUE'. |
updatePi |
A logical value indicating whether to update the proportion of non-zero effects in Bayesian methods. Default is 'TRUE'. |
updateG |
A logical value indicating whether to update the genomic variances in Bayesian methods. Default is 'TRUE'. |
Value
A matrix of predicted prioritisation scores ('ypred') for each feature, ordered by their predictive values. If a validation set is provided, cross-validation results are returned instead.
Expected AUC for prediction of a binary trait using information on correlated binary trait
Description
Computes the expected Area Under the Curve (AUC) for predicting a binary trait using information on a correlated binary trait.
Usage
predict_auc_mt_cc(h2x, Nx, Kx, Px, h2y, Ny, Ky, Py, rg, Me, M)
Arguments
h2x |
Heritability of the target trait. |
Nx |
Number of samples for the target trait. |
Kx |
Prevalence of the target trait. |
Px |
Case-control proportion of the target trait. |
h2y |
Heritability of the correlated trait. |
Ny |
Number of samples for the correlated trait. |
Ky |
Prevalence of the correlated trait. |
Py |
Case-control proportion of the correlated trait. |
rg |
Genetic correlation between the target and the correlated trait. |
Me |
Number of independent chromosome segments. |
M |
Number of markers. |
Value
A numeric value representing the expected AUC.
Expected AUC for prediction of a binary trait using information on a correlated continuous trait
Description
Computes the expected Area Under the Curve (AUC) for predicting a binary trait using information from a correlated continuous trait.
Usage
predict_auc_mt_continuous(h2x, Nx, Kx, Px, h2y, Ny, rg, Me, M)
Arguments
h2x |
Heritability of the target trait. |
Nx |
Number of samples for the target trait. |
Kx |
Prevalence of the target trait. |
Px |
Case-control proportion of the target trait. |
h2y |
Heritability of the correlated trait. |
Ny |
Number of samples for the correlated trait. |
rg |
Genetic correlation between the target and correlated trait. |
Me |
Number of independent chromosome segments. |
M |
Number of markers. |
Value
A numeric value representing the expected AUC.
Expected AUC for prediction of a binary trait
Description
Computes the expected Area Under the Curve (AUC) for predicting a binary trait.
Usage
predict_auc_st(h2x, Nx, Kx, Px, Me, M)
Arguments
h2x |
Heritability of the target trait. |
Nx |
Number of samples for the target trait. |
Kx |
Prevalence of the target trait. |
Px |
Case-control proportion of the target trait. |
Me |
Number of independent chromosome segments. |
M |
Number of markers. |
Value
A numeric value representing the expected AUC.
Expected R2 for multiple trait prediction of continuous traits
Description
Computes the expected R2 value for the multiple trait prediction of continuous traits.
Usage
predict_r2_mt(h2x, Nx, h2y, Ny, rg, Me, M)
Arguments
h2x |
Heritability of the target trait. |
Nx |
Number of samples for the target trait. |
h2y |
Heritability of the correlated trait. |
Ny |
Number of samples for the correlated trait. |
rg |
Genetic correlation between the target and correlated trait. |
Me |
Number of independent chromosome segments. |
M |
Number of markers. |
Value
A numeric value representing the expected R2 for the multiple trait prediction.
Expected R2 for single trait prediction of a continuous trait
Description
Computes the expected R2 value for the single trait prediction of a continuous trait.
Usage
predict_r2_st(h2x, Nx, Me, M)
Arguments
h2x |
Heritability of the target trait. |
Nx |
Number of samples for the target trait. |
Me |
Number of independent chromosome segments. |
M |
Number of markers. |
Value
A numeric value representing the expected R2 for the single trait prediction.
Compute Nagelkerke R2
Description
Compute Nagelkerke R2 ££ perhaps add: r2nag?
Usage
rnag(yobs = NULL, ypred = NULL)
Arguments
yobs |
is a vector of observed phenotypes |
ypred |
is a vector of predicted phenotypes |
Split Vector with Overlapping Segments
Description
Splits a vector into segments of a specified length with a specified overlap. The function returns a list where each element contains a segment of the vector.
Usage
splitWithOverlap(vec, seg.length, overlap)
Arguments
vec |
A numeric or character vector to be split. |
seg.length |
An integer specifying the length of each segment. |
overlap |
An integer specifying the number of overlapping elements between consecutive segments. |
Value
A list where each element is a segment of the input vector. The segments can overlap based on the specified overlap.
Perform VEGAS Gene-Based Association Analysis
Description
This function performs VEGAS (Versatile Gene-based Association Study) to analyze gene-level associations using marker statistics and linkage disequilibrium (LD) structure from a reference panel.
Usage
vegas(
Glist = NULL,
sets = NULL,
stat = NULL,
p = NULL,
threshold = 1e-10,
tol = 1e-07,
minsize = 2,
verbose = FALSE
)
Arguments
Glist |
A list containing genomic information, such as LD matrices or genotype data. Required. |
sets |
A list of sets (e.g., genes with their associated markers) to analyze. Required. |
stat |
A data frame containing marker-level statistics. Must include 'rsids' (marker IDs) and 'p' (p-values). |
p |
A numeric matrix of p-values for markers across multiple studies. If provided, 'stat' should be NULL. |
threshold |
A numeric value specifying the lower bound for p-values to avoid numerical issues. Default is '1e-10'. |
tol |
A numeric value specifying the tolerance for eigenvalues in LD matrices. Default is '1e-7'. |
minsize |
An integer specifying the minimum number of markers required for a set to be analyzed. Default is '2'. |
verbose |
A logical value indicating whether to print progress messages. Default is 'FALSE'. |
Details
The function uses marker-level statistics to compute gene-level association statistics, accounting for LD structure among markers. The LD structure is retrieved from 'Glist', which should include precomputed LD matrices or genotype data for the markers.
Two modes are supported: - **'stat' Mode**: Uses marker statistics (e.g., p-values) from a single study to compute gene-level statistics. - **'p' Mode**: Uses marker p-values across multiple studies for meta-analysis of gene-level statistics.
Value
A data frame with the results
Write a subset of data from a BED file
Description
This function reads a BED file and writes a subset of it based on a list of SNP (rsids) identifiers to output BED, BIM, and FAM files.
Usage
writeBED(
bedRead = NULL,
bimRead = NULL,
famRead = NULL,
bedWrite = NULL,
bimWrite = NULL,
famWrite = NULL,
rsids = NULL,
endian = .Platform$endian,
useBytes = TRUE
)
Arguments
bedRead |
The full path to the input BED file to read. |
bimRead |
The full path to the input BIM file to read. |
famRead |
The full path to the input FAM file to read. |
bedWrite |
The full path to the output BED file to write. |
bimWrite |
The full path to the output BIM file to write. |
famWrite |
The full path to the output FAM file to write. |
rsids |
A character vector containing SNP rsids to select from the BIM file. |
Value
No return value. Files are written to the specified output paths.