Chapter 9 Differential abundance analysis
Here, we analyse abundances with three different methods: Wilcoxon test (CLR), DESeq2, and ANCOM-BC. All of these test statistical differences between groups. We will analyse Genus level abundances.
We might want to first perform prevalence filtering to reduce the amount of multiple tests. In this particular dataset, all genera pass a prevalence threshold of 10%, therefore, we do not perform filtering.
9.1 Wilcoxon test
A Wilcoxon test estimates the difference in an outcome between two groups. It is a non-parametric alternative to a t-test, which means that the Wilcoxon test does not make any assumptions about the data.
Let’s first combine the data for the testing purpose.
# Agglomerates data to Genus level
<- agglomerateByRank(tse, rank = "Genus")
tse_genus
# Perform clr transformation. A Pseudocount of 1 needs to be added,
# because the data contains zeros and the clr transformation includes a
# log transformation.
<- transformCounts(tse_genus, method = "clr", pseudocount = 1)
tse_genus
# Does transpose, so samples are in rows, then creates a data frame.
<- data.frame(t(assay(tse_genus, "clr")))
abundance_analysis_data # We will analyse whether abundances differ depending on the"patient_status".
# There are two groups: "ADHD" and "control".
# Let's include that to the data frame.
<- cbind(
abundance_analysis_data
abundance_analysis_data, patient_status = colData(tse_genus)$patient_status
)
Now we can start with the Wilcoxon test. We test all the taxa by looping through columns, and store individual p-values to a vector. Then we create a data frame from collected data.
The code below does the Wilcoxon test only for columns that contain abundances, not for columns that contain patient status.
<- names(abundance_analysis_data[, !names(abundance_analysis_data) %in% "patient_status"])
genera
<- c() # Initialize empty vector for p-values
wilcoxon_p
# Do "for loop" over selected column names
for (i in genera) {
<- wilcox.test(abundance_analysis_data[, i] ~ patient_status,
result data = abundance_analysis_data)
# Stores p-value to the vector with this column name
<- result$p.value
wilcoxon_p[[i]]
}
<- data.frame(taxa = names(wilcoxon_p),
wilcoxon_p p_raw = unlist(wilcoxon_p))
Multiple tests were performed. These are not independent, so we need to adjust p-values for multiple testing. Otherwise, we would increase the chance of a type I error drastically depending on our p-value threshold. By applying a p-value adjustment, we can keep the false positive rate at a level that is acceptable. What is acceptable depends on our research goals. Here we use the fdr method, but there are several other methods as well.
$p_adjusted <- p.adjust(wilcoxon_p$p_raw, method = "fdr") wilcoxon_p
# prepare a dataframe to plot p values
<- data.frame(x = c(wilcoxon_p$p_raw, wilcoxon_p$p_adjusted),
df type=rep(c("raw", "fdr"),
c(length(wilcoxon_p$p_raw),
length(wilcoxon_p$p_adjusted))))
# make a histrogram of p values and adjusted p values
<- ggplot(df) +
wilcoxon_plot geom_histogram(aes(x=x, fill=type)) +
labs(x = "p-value", y = "Frequency")
wilcoxon_plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
9.2 DESeq2
Our second analysis method is DESeq2. This method performs the data normalization automatically. It also takes care of the p-value adjustment, so we don’t have to worry about that.
DESeq2 utilizes a negative binomial distribution to detect differences in read counts between groups. Its normalization takes care of the differences between library sizes and compositions. DESeq2 analysis includes multiple steps, but they are done automatically. More information can be found, e.g., from Harvard Chan Bioinformatic Core’s tutorial Introduction to DGE - ARCHIVED
Now let us show how to do this. First, run the DESeq2 analysis.
# Creates DESeq2 object from the data. Uses "patient_status" to create groups.
<- DESeqDataSet(tse_genus, ~patient_status) ds2
## converting counts to integer mode
## Warning in DESeqDataSet(tse_genus, ~patient_status): 2 duplicate rownames were
## renamed by adding numbers
## Warning in DESeqDataSet(tse_genus, ~patient_status): some variables in design
## formula are characters, converting to factors
# Does the analysis
<- DESeq(ds2) dds
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 11 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
# Gets the results from the object
<- results(dds)
res
# Creates a data frame from results
<- as.data.frame(res)
df
# Adds taxon column that includes names of taxa
$taxon <- rownames(df)
df
# Orders the rows of data frame in increasing order firstly based on column
# "log2FoldChange" and secondly based on "padj" column
<- df %>% arrange(log2FoldChange, padj)
df
::kable(head(df)) %>%
knitr::kable_styling("striped") %>%
kableExtra::scroll_box(width = "100%") kableExtra
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | taxon | |
---|---|---|---|---|---|---|---|
Genus:Ruminococcaceae_UCG-014 | 22.548297 | -24.891267 | 2.460684 | -10.115589 | 0.0000000 | 0.0000000 | Genus:Ruminococcaceae_UCG-014 |
Order:Bacteroidales | 40.353733 | -9.241798 | 2.136205 | -4.326270 | 0.0000152 | 0.0002730 | Order:Bacteroidales |
Genus:Faecalibacterium | 231.079502 | -7.074433 | 1.745612 | -4.052694 | 0.0000506 | 0.0006835 | Genus:Faecalibacterium |
Genus:Catabacter | 18.045614 | -6.615454 | 1.716150 | -3.854823 | 0.0001158 | 0.0012508 | Genus:Catabacter |
Genus:Butyricicoccus | 2.392885 | -5.179608 | 2.948055 | -1.756957 | 0.0789251 | 0.3278426 | Genus:Butyricicoccus |
Order:Gastranaerophilales | 2.067972 | -3.054975 | 2.938641 | -1.039588 | 0.2985315 | 0.7269742 | Order:Gastranaerophilales |
9.3 ANCOM-BC
The analysis of composition of microbiomes with bias correction (ANCOM-BC) is a recently developed method for differential abundance testing. It is based on an earlier published approach. The former version of this method could be recommended as part of several approaches: A recent study compared several mainstream methods and found that among another method, ANCOM produced the most consistent results and is probably a conservative approach. Please note that based on this and other comparisons, no single method can be recommended across all datasets. Rather, it could be recommended to apply several methods and look at the overlap/differences.
As the only method, ANCOM-BC incorporates the so called sampling fraction into the model. The latter term could be empirically estimated by the ratio of the library size to the microbial load. Variations in this sampling fraction would bias differential abundance analyses if ignored. Furthermore, this method provides p-values, and confidence intervals for each taxon. It also controls the FDR and it is computationally simple to implement.
As we will see below, to obtain results, all that is needed is to pass
a phyloseq object to the ancombc()
function. Therefore, below we first convert
our tse
object to a phyloseq
object. Then, we specify the formula. In this formula, other covariates could potentially be included to adjust for confounding.
Please check the function documentation
to learn about the additional arguments that we specify below. Also, see here for another example for more than 1 group comparison.
# currently, ancombc requires the phyloseq format, but we can convert this easily
<- makePhyloseqFromTreeSummarizedExperiment(tse)
pseq <- phyloseq::tax_glom(pseq, taxrank = "Genus")
pseq_genus
= ancombc(
out phyloseq = pseq_genus,
formula = "patient_status",
p_adj_method = "fdr",
zero_cut = 0.90, # by default prevalence filter of 10% is applied
lib_cut = 0,
group = "patient_status",
struc_zero = TRUE,
neg_lb = TRUE,
tol = 1e-5,
max_iter = 100,
conserve = TRUE,
alpha = 0.05,
global = TRUE
)<- out$res res
The object out
contains all relevant information. Again, see the
documentation of the function
under Value for an explanation of all the output objects. Our question can be answered
by looking at the res
object, which now contains dataframes with the coefficients,
standard errors, p-values and q-values. Conveniently, there is a dataframe diff_abn
.
Here, we can find all differentially abundant taxa. Below we show the first 6 entries of this dataframe:
::kable(head(res$diff_abn)) %>% kableExtra::kable_styling("striped") %>%
knitr::scroll_box(width = "100%") kableExtra
patient_statusControl | |
---|---|
172647198 | FALSE |
1726478 | FALSE |
172647201 | FALSE |
17264798 | FALSE |
172647195 | FALSE |
1726472 | FALSE |
In total, this method detects 14 differentially abundant taxa.
9.4 Comparison of the methods
Let’s compare results that we got from the methods. As we can see from the scatter plot, DESeq2 gives lower p-values than Wilcoxon test.
<- data.frame(df$padj, wilcoxon_p$p_adjusted)
mf <- ggplot(mf, aes(x = df$padj, y = wilcoxon_p$p_adjusted)) +
p labs(x = "DESeq2 adjusted p-value", y = "Wilcoxon test adjusted p-value") +
geom_count() +
scale_size_area(max_size = 10)
print(p)
Prints number of p-values under 0.05
print(paste0("Wilcoxon test p-values under 0.05: ", sum(wilcoxon_p$p_adjusted<0.05, na.rm = TRUE), "/", length(wilcoxon_p$p_adjusted)))
## [1] "Wilcoxon test p-values under 0.05: 2/54"
print(paste0("DESeq2 p-values under 0.05: ", sum(df$padj<0.05, na.rm = TRUE), "/", length(df$padj)))
## [1] "DESeq2 p-values under 0.05: 7/54"
print(paste0("ANCOM p-values under 0.05: ", sum(out$res$diff_abn$patient_statusControl), "/", length(out$res$diff_abn$patient_statusControl)))
## [1] "ANCOM p-values under 0.05: 14/49"
We can also look at the intersection of identified taxa
# to let R check this for us, we need to make sure,
# to use the same tax names (I call it labels here) everywhere.
<- tibble(
wilcox_labels wilcox_labels_new = rowData(tse_genus)$Genus,
taxa = colnames(data.frame(t(assay(tse_genus, "clr"))))
)<-wilcoxon_p %>%
wilcox_taxa left_join(wilcox_labels, by = "taxa") %>%
filter(p_adjusted <= 0.05) %>%
$wilcox_labels_new
.<- filter(df, padj <= 0.05) %>%
deseq2_taxa $taxon %>%
.::str_remove("Genus:") %>%
stringr::str_remove("Order:")
stringr# for ancom we need to assign genus names to ids
<- tibble::rownames_to_column(
taxid_df as.data.frame(rowData(tse)),
"taxid") %>%
select(taxid, Genus)
<- tibble::rownames_to_column(out$res$diff_abn, "taxid") %>%
ancom_taxa left_join(taxid_df, by = "taxid") %>%
filter(patient_statusControl) %>%
$Genus
.
# all methods identified in common:
Reduce(intersect, list(deseq2_taxa, wilcox_taxa, ancom_taxa))
## [1] "Faecalibacterium" "[Ruminococcus]_gauvreauii_group"
9.5 Comparison of abundance
In previous steps, we got information which taxa vary between ADHD and control groups. Let’s plot those taxa in the boxplot, and compare visually if abundances of those taxa differ in ADHD and control samples. For comparison, let’s plot also taxa that do not differ between ADHD and control groups.
Let’s first gather data about taxa that have highest p-values.
# There are some taxa that do not include Genus level information. They are
# excluded from analysis.
# str_detect finds if the pattern is present in values of "taxon" column.
# Subset is taken, only those rows are included that do not include the pattern.
<- df[ !stringr::str_detect(df$taxon, "Genus:uncultured"), ]
df
# Sorts p-values in decreasing order. Takes 3 first ones. Takes those rows that match
# with p-values. Takes taxa.
<- df[df$padj %in% sort(df$padj, decreasing = TRUE)[1:3], ]$taxon
highest3
# From clr transformed table, takes only those taxa that had highest p-values
<- assay(tse_genus, "clr")[highest3, ]
highest3
# Transposes the table
<- t(highest3)
highest3
# Adds colData that includes patient status infomation
<- data.frame(highest3, as.data.frame(colData(tse_genus)))
highest3
# Some taxa names are that long that they don't fit nicely into title. So let's add there
# a line break after e.g. "Genus". Here the dot after e.g. Genus is replaced with
# ": \n"
colnames(highest3)[1:3] <- lapply(colnames(highest3)[1:3], function(x){
# Replaces the first dot
<- stringr::str_replace(x, "[.]", ": ")
temp
# Replace all other dots and underscores with space
<- stringr::str_replace_all(temp, c("[.]" = " ", "_" = " "))
temp
# Adds line break so that 25 characters is the maximal width
<- stringr::str_wrap(temp, width = 25)
temp })
Next, let’s do the same but for taxa with lowest p-values.
# Sorts p-values in increasing order. Takes 3rd first ones. Takes those rows that match
# with p-values. Takes taxa.
<- df[df$padj %in% sort(df$padj, decreasing = FALSE)[1:3], ]$taxon
lowest3
# From clr transformed table, takes only those taxa that had lowest p-values
<-assay(tse_genus, "clr")[lowest3, ]
lowest3
# Transposes the table
<- t(lowest3)
lowest3
# Adds colData that includes patient status infomation
<- data.frame(lowest3, as.data.frame(colData(tse_genus)))
lowest3
# Some taxa names are that long that they don't fit nicely into title. So let's add there
# a line break after e.g. "Genus". Here the dot after e.g. Genus is replaced with
# ": \n"
colnames(lowest3)[1:3] <- lapply(colnames(lowest3)[1:3], function(x){
# Replaces the first dot
<- stringr::str_replace(x, "[.]", ": ")
temp
# Replace all other dots and underscores with space
<- stringr::str_replace_all(temp, c("[.]" = " ", "_" = " "))
temp
# Adds line break so that 25 characters is the maximal width
<- stringr::str_wrap(temp, width = 25)
temp })
Then we can plot these six different taxa. Let’s arrange them into the same picture.
# Puts plots in the same picture
::grid.arrange(
gridExtra
# Plot 1
ggplot(highest3, aes(x = patient_status, y = highest3[,1])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[1]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 2
ggplot(highest3, aes(x = patient_status, y = highest3[,2])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[2]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 3
ggplot(highest3, aes(x = patient_status, y = highest3[,3])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(highest3)[3]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 4
ggplot(lowest3, aes(x = patient_status, y = lowest3[,1])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[1]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 5
ggplot(lowest3, aes(x = patient_status, y = lowest3[,2])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[2]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# Plot 6
ggplot(lowest3, aes(x = patient_status, y = lowest3[,3])) +
geom_boxplot() +
ylab("CLR abundances") + # y axis title
ggtitle(names(lowest3)[3]) + # main title
theme(title = element_text(size = 7),
axis.text = element_text(size = 7),
axis.title.x=element_blank()), # makes titles smaller, removes x axis title
# 3 columns and 2 rows
ncol = 3,
nrow = 2
)
We plotted those taxa that have the highest and lowest p values according to DESeq2. Can you create a plot that shows the difference in abundance in “[Ruminococcus]_gauvreauii_group”, which is the other taxon that was identified by all tools. Try for yourself! Below you find one way how to do it.
select(
abundance_analysis_data,
patient_status, Ruminococcus_gauvreauii_group = contains("gauvreauii_group")) %>%
ggplot(aes(patient_status, Ruminococcus_gauvreauii_group)) +
geom_boxplot()