Title: Estimate Sample Size for Population Genomic Studies
Version: 1.0.2
Description: Estimate sample sizes needed to capture target levels of genetic diversity from a population (multivariate allele frequencies) for applications like germplasm conservation and breeding efforts. Compares bootstrap samples to a full population using linear regression, employing the R-squared value to represent the proportion of diversity captured. Iteratively increases sample size until a user-defined target R-squared is met. Offers a parallelized R implementation of a previously developed 'python' method. All ploidy levels are supported. For more details, see Sandercock et al. (2024) <doi:10.1073/pnas.2403505121>.
License: Apache License (≥ 2)
URL: https://github.com/alex-sandercock/castgen
BugReports: https://github.com/alex-sandercock/castgen/issues
Imports: doParallel (≥ 1.0.17), dplyr (≥ 1.1.2), foreach (≥ 1.5.2), parallel (≥ 4.0.0), Rdpack (≥ 0.7), stats, utils, vcfR (≥ 1.15.0)
Suggests: testthat (≥ 3.0.0)
Encoding: UTF-8
RoxygenNote: 7.3.2
NeedsCompilation: no
Packaged: 2025-04-01 19:42:07 UTC; ams866
Author: Alexander M. Sandercock ORCID iD [aut, cre]
Maintainer: Alexander M. Sandercock <ams866@cornell.edu>
Repository: CRAN
Date/Publication: 2025-04-02 17:40:05 UTC

Estimate Minimum Number of Individuals to Sample to Capture Population Genomic Diversity (Genotype Matrix)

Description

This function can be used to estimate the number of individuals to sample from a population in order to capture a desired percentage of the genomic diversity. It assumes that the samples are the columns, and the genomic markers are in rows. Missing data should be set as NA, which will then be ignored for the calculations. All samples must have the same ploidy. This function was adapted from a previously developed Python method (Sandercock et al., 2023) (https://github.com/alex-sandercock/Capturing_genomic_diversity/)

Usage

capture_diversity.Gmat(
  df,
  ploidy,
  r2_threshold = 0.9,
  iterations = 10,
  sample_list = NULL,
  parallel = FALSE,
  batch = 1,
  save.result = FALSE,
  verbose = TRUE
)

Arguments

df

Genotype matrix or data.frame with the numeric count of alternate alleles (0=homozygous reference, 1 = heterozygous, 2 = homozygous alternate)

ploidy

The ploidy of the species being analyzed

r2_threshold

The ratio of diversity to capture (default = 0.9)

iterations

The number of iterations to perform to estimate the average result (default = 10)

sample_list

The list of samples to subset from the dataset (optional)

parallel

Run the analysis in parallel (True/False) (default = FALSE)

batch

The number of samples to draw in each bootstrap sample iteration (default = 1)

save.result

Save the results to a .txt file? (default = FALSE)

verbose

Print out the results to the console (default = TRUE)

Value

A data.frame with minimum number of samples required to match or exceed the input ratio

References

A.M. Sandercock, J.W. Westbrook, Q. Zhang, & J.A. Holliday, A genome-guided strategy for climate resilience in American chestnut restoration populations, Proc. Natl. Acad. Sci. U.S.A. 121 (30) e2403505121, https://doi.org/10.1073/pnas.2403505121 (2024).

Examples


#Example with a tetraploid population
set.seed(123)
test_gmat <- matrix(sample(0:4, 100, replace = TRUE), nrow = 10)
colnames(test_gmat) <- paste0("Sample", 1:10)
rownames(test_gmat) <- paste0("Marker", 1:10)
test_gmat <- as.data.frame(test_gmat)

#Estimate the number of samples required to capture 90% of the population's genomic diversity
result <- capture_diversity.Gmat(test_gmat,
                                 ploidy = 4,
                                 r2_threshold = 0.90,
                                 iterations = 10,
                                 save.result = FALSE,
                                 parallel=FALSE,
                                 verbose=FALSE)

#View results
print(result)


Estimate Minimum Number of Individuals to Sample to Capture Population Genomic Diversity (VCF)

Description

This function can be used to estimate the number of individuals to sample from a population in order to capture a desired percentage of the genomic diversity. VCF files can be either unzipped or gzipped. All samples must have the same ploidy and the VCF must contain GT information. This function was adapted from a previously developed Python method (Sandercock et al., 2024) (https://github.com/alex-sandercock/Capturing_genomic_diversity/)

Usage

capture_diversity.VCF(
  vcf,
  ploidy,
  r2_threshold = 0.9,
  iterations = 10,
  sample_list = NULL,
  parallel = FALSE,
  batch = 1,
  save.result = TRUE,
  verbose = TRUE
)

Arguments

vcf

Path to VCF file (.vcf or .vcf.gz) with genotype information

ploidy

The ploidy of the species being analyzed

r2_threshold

The ratio of diversity to capture (default = 0.9)

iterations

The number of iterations to perform to estimate the average result (default = 10)

sample_list

The list of samples to subset from the dataset (optional)

parallel

Run the analysis in parallel (True/False) (default = FALSE)

batch

The number of samples to draw in each bootstrap sample iteration (default = 1)

save.result

Save the results to a .txt file? (default = TRUE)

verbose

Print out the results to the console (default = TRUE)

Value

A data.frame with minimum number of samples required to match or exceed the input ratio

References

A.M. Sandercock, J.W. Westbrook, Q. Zhang, & J.A. Holliday, A genome-guided strategy for climate resilience in American chestnut restoration populations, Proc. Natl. Acad. Sci. U.S.A. 121 (30) e2403505121, https://doi.org/10.1073/pnas.2403505121 (2024).

Examples

#Example with a diploid vcf

# Example vcf
vcf_file <- system.file("diploid_example.vcf.gz", package = "castgen")

#Estimate the number of samples required to capture 95% of the population's genomic diversity
result <- capture_diversity.VCF(vcf_file,
                                 ploidy = 2,
                                 r2_threshold = 0.95,
                                 iterations = 10,
                                 save.result = FALSE,
                                 parallel=FALSE,
                                 verbose=FALSE)

#View results
print(result)