This section covers basic univariate tests for two-group comparison, covering t-test, Wilcoxon test, and multiple testing.
The following example compares the abundance of a selected bug between two conditions. Let us assume that the data is already properly normalized.
Let us load example data
library(microbiome)
data(dietswap)
<- dietswap
d
# Pick microbial abundances for a given taxonomic group
<- "Dialister"
taxa
# Construct a data.frame with the selected
# taxonomic group and grouping
<- data.frame(Abundance = abundances(d)[taxa,],
df Group = meta(d)$nationality)
Compare the groups visually using a boxplot (left). However, we observe that the abundances are in absolute scale and therefore the comparison is not clear. Let us try the log10 transformation. Now, the data contains many zeros and taking log10 will yield infinite values. Hence we choose the commonly used, although somewhat problematic, log10(1+x) transformation (right).
<- ggplot(df, aes(x = Group, y = Abundance)) +
p1 geom_boxplot() +
labs(title = "Absolute abundances", y = "Abundance\n (read count)")+ theme(plot.title = element_text(size=18))
# Let us add the log10(1+x) version:
$Log10_Abundance <- log10(1 + df$Abundance)
df<- ggplot(df, aes(x = Group, y = Log10_Abundance)) +
p2 geom_boxplot() +
labs(title = "Log10 abundances", y = "Abundance\n (log10(1+x) read count)")+ theme(plot.title = element_text(size=18))
library(patchwork)
print(p1 + p2)
The groups seem to differ. Let us test the difference statistically. First, let us perform t-test, which is based on Gaussian assumptions. Each group is expected to follow Gaussian distribution.
Significance p-value with t-test:
print(t.test(Log10_Abundance ~ Group, data = df)$p.value)
## [1] 0.02554997
Now let us investigate the Gaussian assumption in more detail. Boxplots may not show deviations from Gaussian assumptions very clearly Let us try another visualization; the density plot.
<- ggplot(df, aes(fill = Group, x = Log10_Abundance)) +
p geom_density(alpha = 0.5)
print(p)
Apparently, the data is not Gaussian distributed. In such cases, a common procedure is to use non-parametric tests. These do not make assumptions of the data distribution but instead compare the ordering of the samples.
So, let us look at the significance p-value with Wilcoxon test (log10 data):
print(wilcox.test(Log10_Abundance ~ Group, data = df)$p.value)
## [1] 0.02979053
But since the test is non-parametric, we can as well use the original absolute abundances since the log transformation does not change sample ordering on which the Wilcoxon test is based.
Let us verify that the absolute abundances yield the same p-value for Wilcoxon test:
print(wilcox.test(Abundance ~ Group, data = df)$p.value)
## [1] 0.02979053
Let us compare how much the results would differ in the whole data between t-test and Wilcoxon test. To remove non-varying taxa that would demand extra scripting, let us for demonstration purposes now focus on core taxa that are observed in more than 20% of the samples with more than 3 reads.
# Core taxa to be tested
<- core_members(d, prevalence = 20/100, detection = 3)
test.taxa
# Calculate p-values with the two different methods for each taxonomic unit
<- c()
pvalue.ttest <- c()
pvalue.wilcoxon for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
<- data.frame(Abundance = abundances(d)[taxa,],
df Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
<- t.test(Log10_Abundance ~ Group, data = df)$p.value
pvalue.ttest[[taxa]] <- wilcox.test(Abundance ~ Group, data = df)$p.value
pvalue.wilcoxon[[taxa]]
}# Arrange the results in a data.frame
<- data.frame(taxon = test.taxa,
pvalues pvalue.ttest = pvalue.ttest,
pvalue.wilcoxon = pvalue.wilcoxon)
# Note that multiple testing occurs.
# We must correct the p-values.
# let us apply the standard Benjamini-Hochberg False Discovery Rate (FDR)
# correction
$pvalue.ttest.adjusted <- p.adjust(pvalue.ttest)
pvalues#pvalues$pvalue.ttest.adjusted <- p.adjust(pvalues$pvalue.ttest)
$pvalue.wilcoxon.adjusted <- p.adjust(pvalue.wilcoxon) pvalues
Compare the p-value histograms between raw and adjusteed p-values.
library(reshape2)
library(tidyverse)
$pvalue.wilcoxon<- as.numeric(pvalue.wilcoxon)
pvalues<- ggplot(pvalues, aes(x = pvalue.wilcoxon)) + geom_histogram(bins = 50, binwidth = .03)+
p1 labs(title = "Raw p-values") +
ylim(c(0, 80))
<- ggplot(pvalues, aes(x = pvalue.wilcoxon.adjusted)) +
p2 geom_histogram(bins = 50, binwidth = .03) +
labs(title = "Adjusted p-values") +
ylim(c(0, 80))
print(p1 + p2)
Now compare these adjusted p-values between t-test and Wilcoxon test. Let us also highlight the p = 0.05 intervals.
<- ggplot(data = pvalues,
p aes(x = pvalue.ttest.adjusted,
y = pvalue.wilcoxon.adjusted)) +
geom_text(aes(label = taxon)) +
geom_abline(aes(intercept = 0, slope = 1)) +
geom_hline(aes(yintercept = 0.05), shape = 2) +
geom_vline(aes(xintercept = 0.05), shape = 2)
print(p)
This section provides a brief hands-on introduction to the practical motivations and use linear (and generalized linear) models.
Let us compare two groups with a linear model. We use Log10 abundances since this is closer to the Gaussian assumptions than the absolute count data. We can fit a linear model with Gaussian variation as follows:
<- glm(Log10_Abundance ~ Group, data = df, family = "gaussian") res
Let us investigate model coefficients
kable(summary(res)$coefficients, digits = 5)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 0.64825 | 0.02877 | 22.53405 | 0 |
GroupAFR | 0.20313 | 0.04308 | 4.71530 | 0 |
The intercept equals to the mean in the first group:
print(mean(subset(df, Group == "AAM")$Log10_Abundance))
## [1] 0.6482493
The group term equals to the difference between group means:
print(mean(subset(df, Group == "AFR")$Log10_Abundance) -
mean(subset(df, Group == "AAM")$Log10_Abundance))
## [1] 0.2031287
Note that the linear model (default) significance equals to t-test assuming equal variances.
print(t.test(Log10_Abundance ~ Group, data = df, var.equal=TRUE)$p.value)
## [1] 4.284318e-06
An important advantage of linear models, compared to plain t-test is that they allow incorporating additional variables, such as potential confounders (age, BMI, gender..):
# Add a covariate:
$sex <- meta(d)$sex
df
# Fit the model:
<- glm(Log10_Abundance ~ Group + sex, data = df, family = "gaussian") res
Let us briefly discuss the ideas underlying generalized linear models before proceeding to the next section.
The Generalized linear model (GLM) allows a richer family of probability distributions to describe the data. Intuitively speaking, GLMs allow the modeling of nonlinear, nonsymmetric, and nongaussian associations. GLMs consist of three elements: - A probability distribution (from exponential family) - A linear predictor η = Xβ . - A link function g such that \(E(Y) = μ = g^{−1}(η)\).
We use Poisson with (its natural) log-link. Fit abundance (read counts) assuming that the data is Poisson distributed, and the logarithm of its mean, or expectation, is obtained with a linear model.
<- glm(Abundance ~ Group, data = df, family = "poisson") res
Investigate the model output:
kable(summary(res)$coefficients, digits = 5)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 1.49409 | 0.04272 | 34.97577 | 0 |
GroupAFR | 1.04329 | 0.05122 | 20.36855 | 0 |
GLMs are the basis for advanced testing of differential abundance in sequencing data. This is necessary, as the sequencing data sets deviate from symmetric, continuous, Gaussian assumptions in many ways.
Sequencing data consists of discrete counts:
print(abundances(d)[1:5,1:3])
## Sample-1 Sample-2 Sample-3
## Actinomycetaceae 0 1 0
## Aerococcus 0 0 0
## Aeromonas 0 0 0
## Akkermansia 18 97 67
## Alcaligenes faecalis et rel. 1 2 3
The data is sparse:
hist(log10(1 + abundances(d)), 100)
Long tails of rare taxa:
<- apply(abundances(d),1,median)/1e3
medians library(reshape2)
<- melt(otu_tibble(d)) A
## Using FeatureID as id variables
$FeatureID <- factor(A$FeatureID, levels = rev(names(sort(medians))))
A<- ggplot(A, aes(x = FeatureID, y = value)) +
p geom_boxplot() +
labs(y = "Abundance (reads)", x = "Taxonomic Group") +
scale_y_log10() +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, size=6))
print(p)
Overdispersion (variance exceeds the mean):
<- apply(abundances(d),1,mean)
means <- apply(abundances(d),1,var)
variances
# Calculate mean and variance over samples for each taxon
library(reshape2)
library(dplyr)
<- melt(otu_tibble(d))
df names(df) <- c("Taxon", "Sample", "Reads")
<- df %>% group_by(Taxon) %>%
df summarise(mean = mean(Reads),
variance = var(Reads))
# Illustrate overdispersion
library(scales)
<- ggplot(df, aes(x = mean, y = variance)) +
p geom_point() +
geom_abline(aes(intercept = 0, slope = 1)) +
scale_x_log10(labels = scales::scientific) +
scale_y_log10(labels = scales::scientific) +
labs(title = "Overdispersion (variance > mean)")
print(p)
DESeq2 analysis can accommodate those particular assumptions about sequencing data.
# Start by converting phyloseq object to deseq2 format
library(DESeq2)
<- phyloseq_to_deseq2(d, ~ group + nationality)
ds2
# Run DESeq2 analysis (all taxa at once!)
<- DESeq(ds2)
dds
# Investigate results
<- results(dds)
res <- as.data.frame(res)
deseq.results <- deseq.results
df $taxon <- rownames(df)
df<- df %>% arrange(log2FoldChange, padj)
df
# Print the results; flitered and sorted by pvalue and effectsize
library(knitr)
<- df %>% filter(pvalue < 0.05 & log2FoldChange > 1.5) %>%
df arrange(pvalue, log2FoldChange)
kable(df, digits = 5)
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | taxon | |
---|---|---|---|---|---|---|---|
Clostridium difficile et rel. | 29.20535 | 1.91205 | 0.13432 | 14.23457 | 0.00000 | 0.00000 | Clostridium difficile et rel. |
Mitsuokella multiacida et rel. | 51.65152 | 3.04116 | 0.28687 | 10.60107 | 0.00000 | 0.00000 | Mitsuokella multiacida et rel. |
Klebisiella pneumoniae et rel. | 12.39749 | 1.83825 | 0.18531 | 9.91994 | 0.00000 | 0.00000 | Klebisiella pneumoniae et rel. |
Megasphaera elsdenii et rel. | 44.16494 | 1.78333 | 0.23072 | 7.72937 | 0.00000 | 0.00000 | Megasphaera elsdenii et rel. |
Escherichia coli et rel. | 66.93783 | 1.68345 | 0.25330 | 6.64609 | 0.00000 | 0.00000 | Escherichia coli et rel. |
Weissella et rel. | 3.63459 | 1.53142 | 0.23140 | 6.61792 | 0.00000 | 0.00000 | Weissella et rel. |
Serratia | 5.74035 | 3.07334 | 0.47848 | 6.42308 | 0.00000 | 0.00000 | Serratia |
Moraxellaceae | 0.42171 | 1.70079 | 0.47147 | 3.60743 | 0.00031 | 0.00075 | Moraxellaceae |
For comparison purposes, assess significances and effect sizes based on Wilcoxon test.
<- taxa(d)
test.taxa <- c()
pvalue.wilcoxon <- c()
foldchange for (taxa in test.taxa) {
# Create a new data frame for each taxonomic group
<- data.frame(Abundance = abundances(d)[taxa,],
df Log10_Abundance = log10(1 + abundances(d)[taxa,]),
Group = meta(d)$nationality)
# Calculate pvalue and effect size (difference beween log means)
<- wilcox.test(Abundance ~ Group, data = df)$p.value
pvalue.wilcoxon[[taxa]] <- coef(lm(Log10_Abundance ~ Group, data = df))[[2]]
foldchange[[taxa]]
}# Correct p-values for multiple testing
<- p.adjust(pvalue.wilcoxon) pvalue.wilcoxon.adjusted
par(mfrow = c(1,2))
plot(deseq.results$padj, pvalue.wilcoxon.adjusted,
xlab = "DESeq2 adjusted p-value",
ylab = "Wilcoxon adjusted p-value",
main = "P-value comparison")
abline(v = 0.05, h = 0.05, lty = 2)
plot(deseq.results$log2FoldChange, foldchange,
xlab = "DESeq2",
ylab = "Linear model",
main = "Effect size comparison")
abline(0,1)
%>%
deseq.results ::rownames_to_column("Taxon") %>%
tibblefilter(padj <= 0.05 & abs(log2FoldChange) >= 1.5) %>%
ggplot(aes(log2FoldChange, reorder(Taxon, log2FoldChange))) +
geom_col() +
labs(x="log2FoldChange", y="Taxon")