Help for package permubiome

Type:

Package

Title:

A Permutation Based Test for Biomarker Discovery in Microbiome Data

Version:

1.3.2

Date:

2023-09-19

Author:

Alfonso Benitez-Paez

Maintainer:

Alfonso Benitez-Paez <alfbenpa@gmail.com>

Description:

The permubiome R package was created to perform a permutation-based non-parametric analysis on microbiome data for biomarker discovery aims. This test executes thousands of comparisons in a pairwise manner, after a random shuffling of data into the different groups of study with a prior selection of the microbiome features with the largest variation among groups. Previous to the permutation test itself, data can be normalized according to different methods proposed to handle microbiome data ('proportions' or 'Anders'). The median-based differences between groups resulting from the multiple simulations are fitted to a normal distribution with the aim to calculate their significance. A multiple testing correction based on Benjamini-Hochberg method (fdr) is finally applied to extract the differentially presented features between groups of your dataset. LATEST UPDATES: v1.1 and olders incorporates function to parse COLUMN format; v1.2 and olders incorporates -optimize- function to maximize evaluation of features with largest inter-class variation; v1.3 and olders includes the -size.effect- function to perform estimation statistics using the bootstrap-coupled approach implemented in the 'dabestr' (>=0.3.0) R package. Current v1.3.2 fixed bug with "Class" recognition and updated 'dabestr' functions.

License:

GPL-3

Imports:

ggplot2, rlang, dabestr, gridExtra, Matrix, dplyr

NeedsCompilation:

Packaged:

2023-10-16 12:44:54 UTC; alfonso

Repository:

CRAN

Date/Publication:

2023-10-16 13:10:02 UTC

Parsing the data file.

Description

This function prompts for the file contained all the data needed to process. You only have to execute this function in the working directory where your file is stored properly formatted as requested.

The input file is a tab-delimited text matrix as follows:

Sample     Class      feature(1)      feature(2)      feature(n)      ...
sampleA    classX     counts(A1)      counts(A2)      counts(An)      ...
sampleB    classY     counts(B1)      counts(B2)      counts(Bn)      ...
sampleC    classX     counts(C1)      counts(C2)      counts(Cn)      ...
sampleD    classY     counts(D1)      counts(D2)      counts(Dn)      ...

From the version 1.1 on you will be able to load your data as COLUMN format, just adding the "Class" information in the second row as follows:

Sample        sampleA        sampleB        sampleC        sampleD        ...
Class         classX         classY         classX         classY         ...
feature(1)    counts(A1)     counts(B1)     counts(C1)     counts(D1)     ...
feature(2)    counts(A2)     counts(B2)     counts(C2)     counts(D2)     ...
feature(3)    counts(A3)     counts(B3)     counts(C3)     counts(D3)     ...
feature(4)    counts(A4)     counts(B4)     counts(C4)     counts(D4)     ...
feature(n)    counts(An)     counts(Bn)     counts(Cn)     counts(Dn)     ...

Usage

get.data()

Author(s)

Alfonso Benitez-Paez

References

Benitez-Paez A. 2023. Permubiome: an R package to perform permutation based test for biomarker discovery in microbiome analyses. [https://cran.r-project.org]. Benitez-Paez A, et al. mSystems. 2020;5:e00857-19. doi: 10.1128/mSystems.00857-19.

Examples

## The function is currently defined as
function () 
{
    DATA <- readline("Type the name of your data set : ")
    if (substr(DATA, 1, 1) == ""){
        tb <- read.table(system.file("extdat", "DATA_2", package = "permubiome"), 
            header = T, sep = "\t")
        print(paste("As you declare no input file, the permubiome test data was loaded"))
        save(tb, DATA, file = "permubiome.RData")
    }
    else {
        FORMAT <- readline("Type the format of your data set (PERMUBIOME or COLUMN): ")
        if (FORMAT == "PERMUBIOME") {
            tb <- read.table(DATA, header = T, sep = "\t")
            save(tb, FORMAT, file = "permubiome.RData")
        }
        else {
            biom <- read.table(DATA, sep = "\t")
            tb <- t(biom)
            colnames(tb) <- tb[1, ]
            rownames(tb) <- NULL
            tb = tb[-1, ]
            labels <- colnames(tb)
            tb <- as.data.frame(tb)
            for (i in 3:length(labels)) {
                tb[, i] <- as.numeric(as.character(tb[, i]))
            }
            save(tb, file = "permubiome.RData")
        }
    }
    print("opening DATA")
    (load("permubiome.RData"))
    df <- as.data.frame(tb)
    classes <- levels(as.factor(df$Class))
    samples <- nrow(df)
    print(paste("Your data file contains:", samples, "samples"))
    print(paste("The classes in your data file are:", classes[1], 
        "and", classes[2]))
    print(paste("The number of different categories to compare are:", 
        (ncol(tb) - 2)))
    save(tb, FORMAT, DATA, df, REFERENCE, classes, file = "permubiome.RData")
  }

Normalize the microbiome dataset prior to perform the permutation test.

Description

A critical aspect when working with microbiome data is to achieve a proper normalization to the retrieved counts, thus overpassing the variability in terms of sequencing efforts or coverage. There are several ways to do normalization, and we have implemented two well-known methods whose choice will depend on the research question investigated and the researcher's preference. Optionally, if you don't feel comfortable with normalization methods implemented in this package or if your data are already normalized, you have the option of performing no normalization on your data (method=0).

Usage

normalize(prevalence = 0.3, method = 1)

Arguments

prevalence

This controls the prevalence of microbiome features across samples in order to keep those with higher occurrence in the cohort of samples under survey. If you have 20 samples and declare a prevalence = 0.3 (default), the algorithm will remove those categories with less prevalence than 6 samples. Although the permutation test deals fairly well with zeros, we recommend setting a restricted value in order to improve the statistics for the biomarker discovery (i.e prevalence => 0.3).

method

Describes the normalization method to be used. We implemented two different strategies to normalize the microbiome data: (1) corresponds to the relative proportion of counts to the features. After retrieving the relative abundance for every feature in very sample the normalization process generate the number of reads corresponding to the features per million reads; (2) corresponds with normalization method described by Anders & Huber (2010), which uses a size factor to correct differences in sequencing coverage. If the user decides not to perform normalization, it must declare method = 0.

Author(s)

Alfonso Benitez-Paez

References

Examples

## The function is currently defined as
function (prevalence = 0.3, method = 1) 
{
    load("permubiome.RData")
    df_norm <- df
    if (method == 1) {
        y <- array(, nrow(df_norm))
        for (j in 1:nrow(df_norm)) {
            y[j] <- sum(df_norm[j, 3:ncol(df_norm)])
        }
        for (l in 3:ncol(df_norm)) {
            for (m in 1:nrow(df_norm)) {
                df_norm[m, l] <- round((df_norm[m, l]/y[m]) * 
                  1e+06, digits = 0)
            }
        }
        for (i in ncol(df_norm):3) {
            if (sum(df_norm[, i] == "0") >= (nrow(df_norm) * 
                1 - prevalence)) {
                df_norm[, i] <- NULL
            }
        }
    }
    else if (method == 2) {
        for (i in ncol(df_norm):3) {
            if (sum(df_norm[, i] == 0) >= (nrow(df_norm) * 1 - prevalence)) {
                df_norm[, i] <- NULL
            }
        }
        sfactor_matrix <- matrix(, ncol = ncol(df_norm) - 2, 
            nrow = nrow(df_norm))
        y <- array(, nrow(df_norm))
        for (m in 1:nrow(df_norm)) {
            for (l in 3:ncol(df_norm)) {
                sfactor_matrix[m, l - 2] <- signif((df_norm[m, 
                  l]/mean(df_norm[, l])), digits = 3)
            }
            y[m] <- median(sfactor_matrix[m, 1:ncol(sfactor_matrix)])
        }
        for (a in 3:ncol(df_norm)) {
            for (b in 1:nrow(df_norm)) {
                df_norm[b, a] <- round((df_norm[b, a] * y[b]), 
                  digits = 0)
            }
        }
    }
    else if (method == 0) {
        head(df_norm)
        print(paste("Your dataset was not normalized according to method option: 0"))
    }
    else {
        print(paste("Select and appropiate method for normalization: 1 ('proportions'), 
        2 ('anders'), or 0 ('none')"))
    }
    print(paste("Your normalized data now contains:", ncol(df_norm) - 
        2, "normalize categories ready to analize"))
    save(df_norm, file = "permubiome.RData")
  }

Optimization for detection of features with larger variation between classes

Description

This function is the previous step to the permutation test and it optimizes the detection of features differentially distributed between classes. The intra- and inter-classes distances are calculated with log-transformed data and then a ratio test is done to maximize the variation between classes. This pre-processing penalizes the microbiome features with a larger intra-class and lower inter-class variation, which would interfere with the statistical estimations to executed during the permutation test.

Usage

optimize()

Author(s)

Alfonso Benitez-Paez

References

Examples

## The function is currently defined as
function () 
{
    load("permubiome.RData")
    df_norm <- df_norm
    REFERENCE <- REFERENCE
    classes <- levels(as.factor(df_norm$Class))
    if (REFERENCE == "") {
        REFERENCE <- classes[1]
    }
    else if (REFERENCE == classes[2]) {
        classes[2] <- classes[1]
        classes[1] <- REFERENCE
    }
    df_norm$Class <- factor(df$Class, levels = (c(classes[1], 
        classes[2])))
    group1 <- subset(df_norm, Class == classes[1])
    group2 <- subset(df_norm, Class == classes[2])
    categories <- colnames(df_norm)
    distance_matrix <- matrix(, nrow = ncol(df_norm) - 2, ncol = 7, 
        byrow = T)
    colnames(distance_matrix) <- c("Category", paste("SumDist ", 
        "[", classes[1], "]", sep = ""), paste("SumDist ", "[", 
        classes[2], "]", sep = ""), "ClassDist", "RatioDist", 
        "Delta-Log", "Z-score")
    for (i in 3:(ncol(group1))) {
        mydata1 <- group1[, i]
        sumdist1 <- log10(sum(abs(apply(combn(mydata1, 2), 2, 
            diff))))
        distance_matrix[(i - 2), 1] <- categories[i]
        distance_matrix[(i - 2), 2] <- sumdist1
    }
    for (j in 3:(ncol(group2))) {
        mydata2 <- group2[, j]
        sumdist2 <- log10(sum(abs(apply(combn(mydata2, 2), 2, 
            diff))))
        distance_matrix[(j - 2), 3] <- sumdist2
    }
    classes_matrix <- matrix(, nrow = ncol(group1) - 2, ncol = (nrow(group1) * 
        nrow(group2)), byrow = T)
    rownames(classes_matrix) <- colnames(group1[3:ncol(group1)])
    features <- colnames(group1)
    for (k in 3:(ncol(group1))) {
        classdist <- vector()
        for (l in 1:nrow(group1[k])) {
            classdist_tmp <- as.list(abs(group2[k] - group1[l, 
                k]))
            classdist <- c(classdist, classdist_tmp[[features[k]]])
        }
        classes_matrix[(k - 2), ] <- classdist
    }
    inter.class.dist <- as.list(rowSums(classes_matrix))
    for (m in 3:(ncol(group1))) {
        distance_matrix[(m - 2), 4] <- log10(inter.class.dist[[features[m]]])
    }
    distance_matrix[, 5] <- as.numeric(distance_matrix[, 4])/((as.numeric(distance_matrix[, 
        3])/as.numeric(distance_matrix[, 2])))
    distance_matrix[, 6] <- abs(as.numeric(distance_matrix[, 
        5]) - as.numeric(distance_matrix[, 4]))
    distance_matrix[, 7] <- (as.numeric(distance_matrix[, 6]) - 
        mean(as.numeric(distance_matrix[, 6])))/sd(as.numeric(distance_matrix[, 
        6]))
    selected_features <- subset(distance_matrix, abs(as.numeric(distance_matrix[, 
        6])) > quantile(as.numeric(distance_matrix[,6]),0.96))
    save(df, df_norm, REFERENCE, classes, distance_matrix, selected_features, 
        file = "permubiome.RData")
  }

Permutation-based non-parametric analysis to infer differential abundance of features between groups.

Description

This function performs multiple simulations for every feature present in your dataset. All observations are randomly distributed between groups and the median's differences are calculated for all simulations. Differences calculated from simulations are fitted to the normal distribution, Z-scores are obtained, and the respective probability to reject the null hypothesis is then calculated. A multiple testing correction based on Benjamini-Hochberg method is done to uncover the biomarkers associated with your dataset classes. FDR threshold for differential abundance can be set at 0.1.

Usage

permutation(nperm = 1000, write.output = TRUE)

Arguments

nperm

The number of permutations to be executed during the analysis (1000 as default). The higher the number of permutations, the more precise will be the p-value returned and the function becomes more time-consuming. We recommend to use nperm = 1000 as the minimum.

write.output

When TRUE (as default), a sorted output file is generated and stored in the working directory. Control for the number of features to be present in the output is allowed with the "all" or "selected" parameters prompted.

Author(s)

Alfonso Benitez-Paez

References

Examples

## The function is currently defined as
function (nperm = 1000, write.output = TRUE) 
{
    Class <- NULL
    load("permubiome.RData")
    df_norm <- df_norm
    selected_features <- selected_features
    tags.in <- selected_features[, 1]
    tags.out <- setdiff(colnames(df_norm[3:ncol(df_norm)]), tags.in)
    for (a in ncol(df_norm):3) {
        if ((colnames(df_norm[a]) %in% tags.out)) {
            df_norm[, a] <- NULL
        }
    }
    classes <- levels(df_norm$Class)
    if (REFERENCE == "") {
        REFERENCE <- classes[1]
    }
    else if (REFERENCE == classes[2]) {
        classes[2] <- classes[1]
        classes[1] <- REFERENCE
    }
    df_norm$Class <- factor(df$Class, levels = (c(classes[1], 
        classes[2])))
    group1 <- subset(df_norm, Class == classes[1])
    group2 <- subset(df_norm, Class == classes[2])
    categories <- colnames(df_norm)
    size1 <- nrow(group1)
    size2 <- nrow(group2)
    size <- size1 + size2
    pvalue_matrix <- matrix(, nrow = ncol(df_norm) - 2, ncol = 5, 
        byrow = T)
    colnames(pvalue_matrix) <- c("Category", paste("Median ", 
        classes[1], sep = ""), paste("Median ", classes[2], sep = ""), 
        "p.value", "p.adj (fdr)")
    print(paste("Permutation test in progress - This can take some seconds or minutes!"))
    for (i in 3:(ncol(df_norm))) {
        category <- categories[i]
        diff <- median(group1[, i]) - median(group2[, i])
        x <- c(group1[, i], group2[, i])
        y <- array(, nperm)
        for (j in 1:nperm) {
            set <- sample(size, size2, replace = FALSE)
            diff_iter <- median(x[set]) - median(x[-set])
            y[j] <- diff_iter
            ref_score <- (diff - mean(y))/sd(y)
        }
        if (ref_score >= 0) {
            pvalue.i <- pnorm(ref_score, lower.tail = F)
        }
        else {
            pvalue.i <- pnorm(ref_score)
        }
        if (pvalue.i != 0) {
            pvalue_matrix[(i - 2), 1] <- category
        }
        if (pvalue.i != 0) {
            pvalue_matrix[(i - 2), 2] <- round(median(group1[, 
                i]), digits = 0)
        }
        if (pvalue.i != 0) {
            pvalue_matrix[(i - 2), 3] <- round(median(group2[, 
                i]), digits = 0)
        }
        if (pvalue.i != 0) {
            pvalue_matrix[(i - 2), 4] <- format((pvalue.i * 2), 
                digits = 7, scientific = F)
        }
        else {
            pvalue_matrix[(i - 2), 2] <- 1
        }
        pb = txtProgressBar(min = 0, max = (ncol(df_norm) - 2), 
            initial = 0, style = 3)
        setTxtProgressBar(pb, (i - 2))
        invisible()
    }
    pvalue_matrix <- pvalue_matrix[order(pvalue_matrix[, 4]), 
        ]
    pvalue_matrix[, 5] <- format(p.adjust(as.numeric(pvalue_matrix[, 
        4], n = nrow(pvalue_matrix), method = "fdr")), digits = 7, 
        scientific = F)
    cat("\n")
    if (write.output == TRUE) {
        all <- readline("Do you want to include all fetures in the output? (yes/no) : ")
        if (substr(all, 1, 1) == "n") {
            select <- as.numeric(readline("Level of significance to output features (i.e. 0.2) : "))
            significant <- subset(pvalue_matrix, pvalue_matrix[, 
                5] <= select)
        }
        else {
            significant <- pvalue_matrix
        }
        write.table(significant, file = "permutation.output", 
            quote = F, row.names = F, sep = "\t")
        print(paste("Permutation test done and output table printed!"))
    }
    else {
        significant <- pvalue_matrix
        significant
        print(paste("Permutation test done!"))
    }
    save(df, df_norm, REFERENCE, classes, selected_features, 
        nperm, tags.in, tags.out, pvalue_matrix, file = "permubiome.RData")
  }

Plotting the features with differential abundance.

Description

Option to plot individually all features found to be differentially presented in the classes of your dataset.

Usage

plots()

Details

When executed, the name of the feature as well as the different output options will be prompted.

Author(s)

Alfonso Benitez-Paez

References

Examples

## The function is currently defined as
function () 
{
    Class <- NULL
    non_zero <- NULL
    Occurring <- NULL
    prevalence <- NULL
    ref<-NULL
    loadNamespace("ggplot2")
    loadNamespace("Matrix")
    load("permubiome.RData")
    df_norm <- df_norm
    category <- readline("Type the category you want to plot : ")
    if (category == "") {
        category <- colnames(df_norm[3])
        print(paste("As you declared no categories, the very first one of your 
        dataset is plotted!"))
    }
    df_to_plot <- df_norm[, c("Sample", "Class", category)]
    classes <- levels(df_to_plot$Class)
    if (REFERENCE == "") {
        REFERENCE <- classes[1]
    }
    else if (REFERENCE == classes[2]) {
        classes[2] <- classes[1]
        classes[1] <- REFERENCE
    }
    df_to_plot$ref <- factor(df_to_plot$Class, levels = (c(classes[1], 
        classes[2])))
    p1 <- (ggplot(df_to_plot, aes(ref, df_to_plot[, 
        category], fill = Class), environment = environment())) + 
        geom_boxplot(notch = F, outlier.colour = "blue", outlier.shape = 1, 
            outlier.size = 3) + geom_point(colour = "#000000", 
        size = 2.5, pch = 19) + scale_fill_manual(values = c("#E41A1C", 
        "#377EB8")) + ggtitle(category) + guides(fill = FALSE) + 
        theme(plot.title = element_text(size = 24, face = "bold")) + 
        ylab("Normalized read proportion") + 
        xlab("Classes") + theme(axis.text = element_text(size = 12), 
        axis.title = element_text(size = 16, face = "bold")) + 
        coord_flip() + theme(plot.margin = unit(c(0.25, 
        0.25, 0.25, 0.25), "cm"))
    non_zero <- as.data.frame((tapply(df_to_plot[[category]], 
        df_to_plot$ref, nnzero)))
    total <- as.data.frame((tapply(df_to_plot[[category]], df_to_plot$ref, 
        length)))
    prevalence_table <- data.frame(names = factor(c(classes[1], 
        classes[2]), levels = c(classes[1], classes[2])), Occurring = c(non_zero[, 
        1]), Subjects = c(total[, 1]))
    prevalence_table$prevalence <- (prevalence_table$Occurring/prevalence_table$Subjects) * 
        100
    p2 <- ggplot(prevalence_table, aes(x = names, y = prevalence, fill = names, 
        width = 0.75)) + geom_bar(stat = "identity", colour = "grey20") + 
        scale_fill_manual(values = c("#E41A1C", "#377EB8")) + 
        coord_flip() + guides(fill = FALSE) + ggtitle(paste("", 
        "", sep = " ")) + theme(plot.title = element_text(size = 24, 
        face = "bold")) + ylab("Prevalence (percentage)") + theme(axis.title.y = element_blank(), 
        axis.text.y = element_blank(), axis.text.x = element_text(size = 12), 
        axis.title = element_text(size = 16, face = "bold")) + 
        theme(plot.margin = unit(c(0.25, 0.25, 0.25, 0.25), "cm")) + 
        ylim(0, 100)
    output <- readline("Do you want an output file (yes/no)? : ")
    if (substr(output, 1, 1) == "y") {
        extension <- readline("What extension do you prefer for the output plot 
        (ps, pdf, jpeg, tiff, png, bmp )? : ")
        tiff(filename = paste(category, extension, sep = "."), 
            width = 1000, height = 200, res = 100, units = "px")
        grid.arrange(p1, p2, ncol = 2)
        dev.off()
    }
    else {
        print(grid.arrange(p1, p2, ncol = 2))
    }
    save(df, df_norm, df_to_plot, REFERENCE, classes, file = "permubiome.RData")
  }

Executing estimation statistics based on bootstrap-coupled approach

Description

Assessing the size effect on selected microbiome features found to be differentially abundant between classes. This analysis is based on the Data Analysis using Bootstrap-Coupled Estimation (dabestr) R package and gives you the option to create Gardner-Altman estimation plots individually all features found to be differentially presented in your dataset.

Usage

size.effect(category = "", replicates = 5000, 
paired = FALSE, plot.file = "tiff", id.pairs = NULL)

Arguments

category

Name of the microbiome feature, which differential abundance between classes will be further explored.

replicates

The number of bootstrap resamples that have to be generated. Integer, default 5000.

paired

If TRUE, the two groups are treated as paired samples, please add an extra column (id.pairs) to parse identity of the datapoint. Default FALSE, the control_group group is treated as pre-intervention and the test_group group is considered post-intervention.

plot.file

Extension for plot graphics (ps, pdf, jpeg, tiff, png, bmp). Default "tiff".

id.pairs

Column name for information to parse identity of the datapoint in case of paired data.

Details

Be careful to type the "category" correctly to be analyzed in order to that matches with the table contained information.

Author(s)

Alfonso Benitez-Paez

References

Examples

## The function is currently defined as
function (category = "", replicates = 5000, 
    paired = FALSE, plot.file = "tiff", id.pairs = NULL)
{
    Class <- NULL
    ref <- NULL
    loadNamespace("dabestr")
    loadNamespace("rlang")
    loadNamespace("dplyr")
    load("permubiome.RData")
    df_norm <- df_norm
    if (paired == TRUE) {
        print(paste("You declared paired data, be sure to include the correct -id.column- argument 
	to parse the identity of the datapoint!"))
    }
    classes <- levels(df_norm$Class)
    if (REFERENCE == "") {
        REFERENCE <- classes[1]
    }
    else if (REFERENCE == classes[2]) {
        classes[2] <- classes[1]
        classes[1] <- REFERENCE
    }
    df_norm<-tibble(df_norm)
    prepare.stats <- load(df_norm, Class, category, paired = paired, 
        idx = c(classes[1], classes[2]), id_col = id.pairs)
    prepare.stats$y<-quo_set_expr(prepare.stats$y, as.symbol(category))
    print(prepare.stats)
    if (category == "") {
        category <- colnames(df_norm[3])
        print(paste("As you declared no categories, the very first one of your dataset will be 
	processed!"))
    }
    estimation.stats<-median_diff(prepare.stats, perm_count = replicates)
    e_plot <- plot(estimation.stats, group.summaries = "median_quartiles", 
        palette = "Set1", rawplot.ylabel = paste(category, "normalized reads", 
            sep = " "), tick.fontsize = 12, axes.title.fontsize = 18)
    tiff(filename = paste(category, "estimation", plot.file, sep = "."), 
        width = 650, height = 600, res = 100, units = "px")
    e_plot
    dev.off()
    print(e_plot)
    save(df, df_norm, REFERENCE, classes, file = "permubiome.RData")
  }