17 Differential abundance

Differential Abundance Analysis (DAA) is a method used to identify differences in the abundances of individual taxonomic features (at any taxonomic level) between two or more groups, such as treatment versus control groups. Here, we demonstrate its implementation on Tengeler2020 data set.

The goal of DAA is to identify biomarkers of a certain phenotype or condition, and gain understanding of a complex system by looking at its isolated components. For example, the identification of a bacterial taxon that is more or less abundant between healthy patients and diseased patients can lead to important insights into the underlying mechanisms of the disease. In other words, differentially abundant taxonomic features can be involved in the dynamics of the disease, which in turn helps understand the system as a whole. Despite its relevance in current research, the DAA approach has also been subject to debate (Quinn, Gordon-Rodriguez, and Erb 2021).

Quinn, Thomas P., Elliott Gordon-Rodriguez, and Ionas Erb. 2021. “A Critique of Differential Abundance Analysis, and Advocacy for an Alternative.” arXiv:2104.07266 [q-Bio, Stat], April. https://arxiv.org/abs/2104.07266.

17.1 Statistical challenges of microbiome data

As discussed in Section 12.1, data display unique properties that are exclusively addressed by DAA tools developed for microbiome analysis.

We recommend to have a look at Nearing et al. (2022). In this study, multiple DAA methods were applied to 38 different datasets and their results were compared to one another. The study highlighted that different methods might yield varying results due to differing assumptions and normalization techniques. Interestingly, Pelto et al. (2024) concluded that elementary DAA methods achieve the highest consistency.

Pelto, Juho, Kari Auranen, Janne Kujala, and Leo Lahti. 2024. “Elementary Methods Provide More Replicable Results in Microbial Differential Abundance Analysis.” https://arxiv.org/abs/2404.02691.

Recently, Yang and Chen (2022) comprehensively evaluated DAA methods via a semi-parametric framework and 106 real datasets, concluding that different methods can produce contradictory results, creating the risk of cherry-picking the most favorable options for one’s own hypothesis. Therefore, it is highly recommended to perform DAA with multiple methods to verify if the findings are consistent across different approaches.

Built on the findings of Calgaro et al. (2020), the benchdamic (Calgaro et al. 2022) package could offer a valuable support in this regard through a comprehensive evaluation process. It serves both practitioners by comparing DA methods from existing literature, and method developers by providing an impartial tool to evaluate their new approaches in comparison to what is already available. For more details, refer to its extensive vignette.

Calgaro, Matteo, Chiara Romualdi, Levi Waldron, Davide Risso, and Nicola Vitulo. 2020. “Assessment of Statistical Methods from Single Cell, Bulk RNA-Seq, and Metagenomics Applied to Microbiome Data.” Genome Biology 21 (1): 191. https://doi.org/10.1186/s13059-020-02104-1.

Calgaro, Matteo, Chiara Romualdi, Davide Risso, and Nicola Vitulo. 2022. “Benchdamic: Benchmarking of Differential Abundance Methods for Microbiome Data.” Bioinformatics 39 (1). https://doi.org/10.1093/bioinformatics/btac778.

17.1.0.1 Approaches to Handle Zero-Inflated Data :

The first approach to target zero-inflated data consists of specialized models, such as over-dispersed count models and zero-inflated mixture models. DESeq2, edgeR and corncorb are based on over-dispersed count models, whereas metagenomeSeq, RAIDA, ZIBB and Omnibus implement zero-inflated mixture models to address zero-inflation. Typically, these models assume a negative binomial, beta-binomial or normal/log-normal distribution.

Another approach to deal with zero-inflated data is zero imputation where zeross are replaced with estimated values. ALDEx2 and eBay apply a Bayesian model to impute the zeros when working with proportion data, accounting for sampling variability and sequencing depth variation. Other methods, such as MaAsLin3 and ANCOMBC impute the zeros with a pseudo-count strategy.

17.1.0.2 Approaches to handle Compositional Data :

To address the compositionality of microbiome data, several approaches have been developed to perform robust normalization with methods specifically designed to reduce the bias found in compositional data. Some examples include :

Trimmed mean of M-values (TMM) normalization used by edgeR.
Relative log expression (RLE) normalization used by DESeq2.
Cumulative sum scaling (CSS) normalization used by metagenomeSeq.
Centered log-ratio transformation (CLR) normalization used by ALDEx2.
Geometric mean of pairwise ratios (GMPR) normalization used by Omnibus and Wrench normalization (Kumar et al. 2018), which corrects the compositional bias by an empirical Bayes approach.

Kumar, M. Senthil, V. Eric Slud, Kwame Okrah, C. Stephanie Hicks, Sridhar Hannenhalli, and Corrada Héctor Bravo. 2018. “Analysis and Correction of Compositional Bias in Sparse Sequencing Count Data.” BMC Genomics 19 (November). https://doi.org/10.1186/s12864-018-5160-5.

Other methods to deal with compositional data entail reference taxonomic features approach used by DACOMP and RAIDA, analyzing the pattern of pairwise log ratios as done by ANCOM and bias-correction applied by ANCOMBC.

17.2 Using the tools

In this section, we demonstrate the use of five methods that can be recommended based on recent literature ( ALDEx2 (Gloor, Macklaim, and Fernandes 2016), ANCOM-BC (Lin and Peddada 2020), MaAsLin3 (Nickols et al. 2024), LinDA (H. Zhou et al. 2022) and ZicoSeq (Yang and Chen 2022)).

The purpose of this section is to show how to perform DAA in R, not how to correctly do causal inference. Depending on your experimental setup and your theory, you must determine how to specify any model exactly. E.g., there might be confounding factors that might drive (the absence of) differences between the shown groups that we ignore here for simplicity. Or your dataset is repeated sampling design, matched-pair design or the general longitudinal design. We will demonstrate how to include covariates in those models. We picked a dataset that merely has microbial abundances in a TSE object as well as a grouping variable in the sample data. We simplify the examples by only including two of the three groups.

library(mia)
library(knitr)

# Import dataset
data("Tengeler2020", package = "mia")
tse <- Tengeler2020

# Show patient status by cohort
table(tse$patient_status, tse$cohort) |>
  kable()

	Cohort_1	Cohort_2	Cohort_3
ADHD	4	5	4
Control	6	5	3

17.2.1 Preparing the data for DAA

Before starting the analysis, it is recommended to reduce the size and complexity of the data to make the results more reproducible. For this purpose, we agglomerate the features by genus and filter them by a prevalence threshold of 10%.

# Agglomerate by genus and subset by prevalence
tse <- subsetByPrevalent(tse, rank = "Genus", prevalence = 10/100)

# Transform count assay to relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")

While some DAA tools provide optional arguments for prevalence filtering, here we filtered the tse object directly. This way, we ensure that the input data remains the same when multiple tools are used.

17.2.2 ALDEx2

In this section, we will show how to perform DAA with ALDEx2, which can be regarded as the method of choice for its consistency, as it normally identifies features that are also found by complementary methods (Nearing et al. 2022). A more extensive introduction to its functionality is available in the ALDEx2 vignette.

ALDEx2 estimates technical variation within each sample per taxon by utilizing the Dirichlet distribution. It furthermore applies the CLR transformation (or closely related log-ratio transforms). Depending on the experimental setup, it will perform a two sample Welch’s t-test and Wilcoxon test or a one-way ANOVA and Kruskal-Wallis test. For more complex study designs, there is a possibility to utilize the glm functionality within ALDEx2. The Benjamini-Hochberg procedure is applied by default to correct for multiple testing.

# Load package
library(ALDEx2)

# Generate Monte Carlo samples of the Dirichlet distribution for each sample.
# Convert each instance using the centered log-ratio transform.
# This is the input for all further analyses.
set.seed(123)
x <- aldex.clr(assay(tse), tse$patient_status)

The t-test:

# calculates expected values of the Welch's t-test and Wilcoxon rank
# test on the data returned by aldex.clr
x_tt <- aldex.ttest(x, paired.test = FALSE, verbose = FALSE)

Effect sizes:

# Determines the median clr abundance of the feature in all samples and in
# groups, the median difference between the two groups, the median variation
# within each group and the effect size, which is the median of the ratio
# of the between group difference and the larger of the variance within groups
x_effect <- aldex.effect(x, CI = TRUE, verbose = FALSE)

# combine all outputs
aldex_out <- data.frame(x_tt, x_effect)

Now, we can create a so called Bland-Altman or MA plot (left). It shows the association between the relative abundance and the magnitude of the difference per sample. Next to that, we can also create a plot that shows the dispersion on the x-axis instead of log-ratio abundance. Red dots represent genera that are differentially abundant (\(q \leq 0.1\)) between the 2 groups. Black points are rare taxonomic features and grey ones are abundant features. The dashed line represent an effect size of 1. Gloor, Macklaim, and Fernandes (2016) provides more information on these plots.

par(mfrow = c(1, 2))

aldex.plot(
    aldex_out,
    type = "MA",
    test = "welch",
    xlab = "Log-ratio abundance",
    ylab = "Difference",
    cutoff = 0.05)

aldex.plot(
    aldex_out,
    type = "MW",
    test = "welch",
    xlab = "Dispersion",
    ylab = "Difference",
    cutoff = 0.05)

The evaluation as differential abundant in above plots is based on the corrected p-value. According to the ALDEx2 developers, the safest approach is to identify those features where the 95% CI of the effect size does not cross 0. As we can see in below table, this is not the case for any of the identified genera (see overlap column, which indicates the proportion of overlap). Also, the authors recommend to focus on effect sizes and CIs rather than interpreting the p-value. To keep the comparison simple, we will here use the p-value as decision criterion. But please be aware that the effect size together with the CI is a better answer to the question we are typically interested in.

library(tidyverse)

aldex_out |>
  rownames_to_column(var = "Genus") |>
  # here we choose the wilcoxon output rather than t-test output
  filter(wi.eBH <= 0.05)  |>
  dplyr::select(Genus, we.eBH, wi.eBH, effect, overlap) |>
  kable()

Genus	we.eBH	wi.eBH	effect	overlap
[Ruminococcus]_gauvreauii_group	0.1082	0.0355	0.8719	0.1321

17.2.3 ANCOM-BC

The analysis of composition of microbiomes with bias correction (ANCOM-BC) (Lin and Peddada 2020) is a recently developed method for differential abundance testing. It is based on an earlier published approach (Mandal et al. 2015). The previous version of ANCOM was among the methods that produced the most consistent results and is probably a conservative approach (Nearing et al. 2022). However, the new ANCOM-BC method operates quite differently compared to the former ANCOM method.

Nearing, Jacob T., Gavin M. Douglas, Molly G. Hayes, Jocelyn MacDonald, Dhwani K. Desai, Nicole Allward, Casey M. A. Jones, et al. 2022. “Microbiome Differential Abundance Methods Produce Different Results Across 38 Datasets.” Nature Communications 13 (1): 342. https://doi.org/10.1038/s41467-022-28034-z.

As the only method, ANCOM-BC incorporates the so called sampling fraction into the model. The latter term could be empirically estimated by the ratio of the library size to the microbial load. According to the authors, ignoring variations in this sampling fraction would bias DAA results. Furthermore, this method provides p-values and confidence intervals for each taxon. It also controls the FDR and it is computationally simple to implement.

Note that the original method was implemented in the ancombc() function (see extended tutorial). The method has since then been updated and new features have been added to enable multi-group comparisons and repeated measurements among other improvements. We do not cover the more advanced features of ANCOMBC in this tutorial as these features are documented in detail in this tutorial.

We now proceed with a simple example. First, we specify a formula. In this formula, other covariates could potentially be included to adjust for confounding. We show this further below. Again, please make sure to check the function documentation as well as the linked tutorials to learn about the additional arguments that we specify.

# Load package
library(ANCOMBC)

# Run ANCOM-BC at the genus level and only including the prevalent genera
ancombc2_out <- ancombc2(
    data = tse,
    assay.type = "counts",
    fix_formula = "patient_status",
    p_adj_method = "fdr",
    prv_cut = 0,
    group = "patient_status",
    struc_zero = TRUE,
    neg_lb = TRUE,
    # multi group comparison is deactivated automatically
    global = TRUE)

The object out contains all model output. Again, see the documentation of the function under Value for details. Our question whether taxonomic features are differentially abundant can be answered by looking at the res object, which contains dataframes with the coefficients, standard errors, p-values and q-values. Below we show the first entries of this dataframe.

# store the FDR adjusted results
ancombc2_out$res |>
  dplyr::select(taxon, lfc_patient_statusControl, q_patient_statusControl) |>
  filter(q_patient_statusControl < 0.05) |>
  arrange(q_patient_statusControl) |>
  head() |>
  kable()

taxon	lfc_patient_statusControl	q_patient_statusControl
Ruminococcus_1	2.647	0.0136
Coprobacter	-1.382	0.0240
Subdoligranulum	1.634	0.0240
[Ruminococcus]_gauvreauii_group	1.236	0.0435

17.2.4 MaAsLin3

Let us next illustrate MaAsLin3 (Nickols et al. 2024). This method is based on generalized linear models and flexible for different study designs and covariate structures. For details, check their Biobakery tutorial.

# Load package
library(maaslin3)

# MaAsLin3 takes tse as input data

# We can specify different GLMs/normalizations/transforms.
maaslin3_out <- maaslin3(
    input_data = tse,
    output = "DAA_example",
    formula = "~ patient_status",
    normalization = "TSS",
    transform = "LOG",
    verbosity = "ERROR"
    )

Which genera are identified as differentially abundant? MaAsLin3 tests differential abundance and differential prevalence. We choose the differential abundance results. You can see that the results includes two adjusted p-values: individual and joint. “Individual” is q-value for differential abundance while “joint” combines both differential abundance and differential prevalence results for the association.

maaslin3_out <- maaslin3_out[["fit_data_abundance"]][["results"]]
maaslin3_out |>
  filter(qval_joint <= 0.05) |>
  kable()

feature	metadata	value	name	coef	null_hypothesis	stderr	pval_individual	qval_individual	pval_joint	qval_joint	error	model	N	N_not_zero
Coprobacter	patient_status	Control	patient_statusControl	-2.193	0.3124	0.4074	0.0006	0.0563	0.0006	0.0306	NA	linear	27	12
Faecalibacterium	patient_status	Control	patient_statusControl	-5.129	0.3124	1.1783	0.0016	0.0713	0.0016	0.0388	NA	linear	27	11

This will create a folder that is called in the output specified above. It contains also figures to visualize difference between significant features. The following figure summarizes the DAA results generated by MaAsLin3. As can be seen, Coprobacter and Faecalibacterium are diffential abundant between the study groups.

include_graphics(file.path("DAA_example", "figures", "summary_plot.png"))

17.2.5 PhILR

PhILR is a tree-based method that tests group-wise associations based on balances. A detailed introduction to this method is available in this Bioconductor tutorial.

17.2.6 Comparison of methods

Although the methods described above yield unidentical results, they are expected to agree on a few differentially abundant taxonomic features. To draw more informed conclusions, it is good practice to compare the outcomes of different methods in terms of found features, their effect sizes and significances, as well as other method-specific aspects. Such comparative approach is outlined in this exercise.

17.3 DAA with confounding

Confounders can be defined as variables that are related to and affect the apparent dynamics between the response and the main independent variable. They are common in experimental studies. Generally, they can be classified into three groups:

Biological confounders, such as age and sex
Technical confounders produced during sample collection, processing and analysis
Confounders resulting from experimental models, such as batch effects and sample history

Controlling for confounders is an important practice to reach an unbiased conclusion. To perform causal inference, it is crucial that the method is able to include confounders in the model. This is not possible with statistical tests of general use, such as the Wilcoxon test. In contrast, methods that target DAA, such as those described in this chapter, allow controlling for confounders. In the following examples, we will perform DAA with a main independent variable and a few confounders.

17.3.1 Selecting confounders

In addition to patient status, we will now control for two confounders: cohort and library size. The former is a categorical variable with three factors, whereas the latter is a discrete numerical variable. Remarkably, most DAA methods accept these two and several other data types.

For demonstration, library size is treated as a confounder and included in the formulas of the DAA methods. Although this is a satisfactory approach to control for uneven sequencing efforts across samples, rarefaction generally represents a better solution (Schloss 2023). With that said, library size can be readily computed and added to the colData.

Schloss, Patrick D. 2023. “Rarefaction Is Currently the Best Approach to Control for Uneven Sequencing Effort in Amplicon Sequence Analyses.” bioRxiv, 2023–06. https://doi.org/10.1101/2023.06.23.546313.

# Compute and store library size in colData
colData(tse)$library_size <- colSums(assay(tse, "counts"))

17.3.2 ANCOM-BC

Here, confounders can be added to the formula along with patient status, the main outcome variable. This way, the model evaluates whether differentially abundant taxonomic features are associated with one of the variables when the other two are kept constant.

# perform the analysis
ancombc2_out <- ancombc2(
    tse,
    assay.type = "counts",
    fix_formula = "patient_status + cohort + library_size",
    p_adj_method = "fdr",
    lib_cut = 0,
    group = "patient_status",
    struc_zero = TRUE,
    neg_lb = TRUE,
    alpha = 0.05,
    # multi-group comparison is deactivated automatically
    global = TRUE)

In the output, each taxon is assigned with several effect sizes (lfc, which stands for log-fold change) and adjusted p-values (q). For categorical variables such as patient status and cohort, the statistics indicate whether the abundance of a given taxon is significantly different between the specified group (column name) and the reference group (the group that does not appear in the column names), whereas for numerical variables such as library size, they indicate whether the abundance of a given taxon varies with that variable.

ancombc2_out$res |>
  dplyr::select(starts_with(c("taxon", "lfc", "q"))) |>
  arrange(q_patient_statusControl) |>
  head() |>
  kable()

taxon	lfc_(Intercept)	lfc_patient_statusControl	lfc_cohortCohort_2	lfc_cohortCohort_3	lfc_library_size	q_(Intercept)	q_patient_statusControl	q_cohortCohort_2	q_cohortCohort_3	q_library_size
Erysipelatoclostridium	-4.0180	1.640	-0.6592	2.6290	1e-04	0.0004	0e+00	0.1709	0.0000	0.0000
Coprobacter	-0.0541	-2.552	0.5573	2.2533	0e+00	0.9473	0e+00	0.2417	0.0000	0.0000
Ruminococcus_1	-3.0054	2.416	1.9413	2.3661	0e+00	0.0059	0e+00	0.0067	0.0000	0.3512
Subdoligranulum	-0.2294	1.117	-0.2230	0.2076	0e+00	0.7766	0e+00	0.6505	0.2652	0.0168
[Ruminococcus]_gauvreauii_group	-1.9342	1.085	0.5407	1.2639	0e+00	0.0111	0e+00	0.2355	0.0000	0.0710
uncultured_bacterium	0.9141	-1.779	0.1219	0.8114	0e+00	0.2132	6e-04	0.8130	0.0076	0.0334

17.4 Further information on tools for DAA

17.4.0.1 LinDA

LinDA covers linear models for differential abundance analysis of microbiome compositional data (H. Zhou et al. (2022)). This is very similar to ANCOMBC with few differences:

LinDA corrects for the compositional bias differently using the mode of all regression coefficients.
It is faster (100x-1000x than ANCOMBC and according to the authors);
It supports hierarchical models. The latest ANCOMBC versions are also supporting hierarchical models.

Nevertheless, LinDA seems a promising tool that achieves a very good power/fdr trade-off together with ANCOMBC according to the review. The speed improvements might make it critical especially for datasets that have higher sample or feature set sizes.

17.4.0.2 ZicoSeq

Subsequently, we demonstrate DAA with ZicoSeq, a method based on linear models and permutation. Further details can be found in this tutorial. This approach has been assessed to exhibit high power and a low false discovery rate, which has the following components:

Winsorization to decrease the influence of outliers;
Posterior sampling based on a beta mixture prior to address sampling variability and zero inflation;
Reference-based multiple-stage normalization to address compositional effects;Additional resources

Additional resources

DAA can be performed by several means. Although most of them provide similar functionality, some may be more suitable than others given a certain study design or data type. Commonly used DAA tools include:

ALDEx2 (Gloor, Macklaim, and Fernandes 2016)
ANCOM (Mandal et al. 2015)
ANCOMBC (Lin and Peddada 2020)
BIRDMAn (Rahman et al. 2023)
corncob (Martin, Witten, and Willis 2021)
DACOMP (Brill, Amnon, and Ruth 2022)
DESeq2 (Love, Huber, and Anders 2014)
eBay (Liu et al. 2020)
edgeR (Y. Chen, Lun, and Smyth 2016)
fastANCOM (C. Zhou et al. 2022)
fido (Silverman et al. 2019)
LDM (Hu and Satten 2020)
lefser (Khleborodova 2021)
limma (Ritchie et al. 2015)
LimROTS (Anwar et al. 2025)
LinDA (H. Zhou et al. 2022)
MaAsLin2 (Mallick, Rahnavard, and McIver 2020)
MaAsLin3 (Nickols et al. 2024)
metagenomeSeq (Paulson, Talukder, and Bravo 2017)
Omnibus (J. Chen et al. 2018)
RAIDA (Sohn, Du, and An 2015)
sccomp (Mangiola et al. 2023)
t-test
Wilcoxon test
ZicoSeq (Yang and Chen 2022)
ZINQ (Ling et al. 2021)

Ling, Wodan, Zhao Ni, M. Plantinga Anna, J. Launer Lenore, A. Fodor Anthony, A. Meyer Katie, and C. Wu Michael. 2021. “Powerful and Robust Non-Parametric Association Testing for Microbiome Data via a Zero-Inflated Quantile Approach (ZINQ).” Microbiome 181 (September). https://doi.org/10.1186/s40168-021-01129-3.

Yang, Lu, and Jun Chen. 2022. “A Comprehensive Evaluation of Microbial Differential Abundance Analysis Methods: Current Status and Potential Solutions.” Microbiome 10 (130): 2049–2618. https://doi.org/10.1186/s40168-022-01320-0.

Mangiola, Stefano, Alexandra J. Roth-Schulze, Marie Trussart, Enrique Zozaya-Valdés, Mengyao Ma, Zijie Gao, Alan F. Rubin, Terence P. Speed, Heejung Shim, and Anthony T. Papenfuss. 2023. “Sccomp: Robust Differential Composition and Variability Analysis for Single-Cell Data.” Proceedings of the National Academy of Sciences 120 (33): e2203828120. https://doi.org/10.1073/pnas.2203828120.

Sohn, Michael B., Ruofei Du, and Lingling An. 2015. “A Robust Approach for Identifying Differentially Abundant Features in Metagenomic Samples.” Bioinformatics 31 (July). https://doi.org/10.1093/bioinformatics/btv165.

Chen, Jun, Emily King, Rebecca Deek, Zhi Wei, Yue Yu, Diane Grill, and Karla Ballman. 2018. “An Omnibus Test for Differential Distribution Analysis of Microbiome Sequencing Data.” Bioinformatics 34 (February).

Paulson, JN, H Talukder, and HC Bravo. 2017. “Longitudinal Differential Abundance Analysis of Marker-Gene Surveys Using Smoothing Splines.” Biorxiv. https://doi.org/10.1101/099457.

Nickols, William A., Thomas Kuntz, Jiaxian Shen, Sagun Maharjan, Himel Mallick, Eric A. Franzosa, Kelsey N. Thompson, Jacob T. Nearing, and Curtis Huttenhower. 2024. “MaAsLin 3: Refining and Extending Generalized Multivariable Linear Models for Meta-Omic Association Discovery,” December. https://doi.org/10.1101/2024.12.13.628459.

Mallick, Himel, Ali Rahnavard, and Lauren J. McIver. 2020. MaAsLin 2: Multivariable Association in Population-Scale Meta-Omics Studies. http://huttenhower.sph.harvard.edu/maaslin2.

Zhou, Huijuan, Kejun He, Jun Chen, and Xianyang Zhang. 2022. “LinDA: Linear Models for Differential Abundance Analysis of Microbiome Compositional Data.” Genome Biology 23 (1): 95. https://doi.org/10.1186/s13059-022-02655-5.

Anwar, Ali Mostafa, Akewak Jeba, Leo Lahti, and Eleanor Coffey. 2025. “LimROTS: A Hybrid Method Integrating Empirical Bayes and Reproducibility-Optimized Statistics for Robust Analysis of Proteomics Data.” bioRxiv. https://doi.org/10.1101/2025.02.06.636801.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Khleborodova, Asya. 2021. Lefser: R Implementation of the LEfSE Method for Microbiome Biomarker Discovery. https://github.com/waldronlab/lefser.

Hu, Yijuan, and Glen A Satten. 2020. “Testing Hypotheses about the Microbiome Using the Linear Decomposition Model (LDM).” Bioinformatics 36 (July): 4106--4115.

Silverman, Justin D, Kim Roche, Zachary C Holmes, Lawrence A David, and Sayan Mukherjee. 2019. “Bayesian Multinomial Logistic Normal Models Through Marginally Latent Matrix-t Processes.” arXiv e-Prints. https://arxiv.org/abs/1903.11695.

Zhou, Chao, Huimin Wang, Hongyu Zhao, and Tao Wang. 2022. “fastANCOM: A Fast Method for Analysis of Compositions of Microbiomes.” Bioinformatics 38 (March): 2039–41.

Chen, Yunshun, Aaron TL Lun, and Gordon K Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” F1000Research 5: 1438. https://doi.org/10.12688/f1000research.8987.2.

Liu, Chi, Yaoming Cui, Xiangzhen Li, and Minjie Yao. 2020. “Microeco: An r Package for Data Mining in Microbial Community Ecology.” FEMS Microbiology Ecology 97 (2). https://doi.org/10.1093/femsec/fiaa255.

Love, Michael, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biol 15 (December): 550. https://doi.org/10.1186/s13059-014-0550-8.

Brill, Barak, Amir Amnon, and Heller Ruth. 2022. “Testing for Differential Abundance in Compositional Counts Data, with Application to Microbiome Studies.” The Annals of Applied Statistics 16 (December).

Martin, Bryan D, Daniela Witten, and Amy D Willis. 2021. Corncob: Count Regression for Correlated Observations with the Beta-Binomial. https://CRAN.R-project.org/package=corncob.

Rahman, Gibraan, James T. Morton, Cameron Martino, Gregory D. Sepich-Poore, Celeste Allaband, Caitlin Guccione, Yang Chen, Daniel Hakim, Mehrbod Estaki, and Rob Knight. 2023. “BIRDMAn: A Bayesian Differential Abundance Framework That Enables Robust Inference of Host-Microbe Associations.” bioRxiv. https://doi.org/10.1101/2023.01.30.526328.

Lin, Huang, and Shyamal Das Peddada. 2020. “Analysis of Compositions of Microbiomes with Bias Correction.” Nature Communications 11 (1): 1–11. https://doi.org/https://doi.org/10.1038/s41467-020-17041-7.

Mandal, Siddhartha, Will Van Treuren, Richard A. White, Merete Eggesbø, Rob Knight, and Shyamal D. Peddada. 2015. “Analysis of Composition of Microbiomes: A Novel Method for Studying Microbial Composition.” Microbial Ecology in Health & Disease 26 (0). https://doi.org/10.3402/mehd.v26.27663.

Gloor, Gregory B., Jean M. Macklaim, and Andrew D. Fernandes. 2016. “Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes.” Journal of Computational and Graphical Statistics 25 (3): 971–79. https://doi.org/10.1080/10618600.2015.1131161.

Exercises

Goal: The goal of these exercises it to learn how to perform differential abundance analysis and to visualize the results.

Exercise 1: Differential abundance analysis (DAA)

Load any of the example datasets mentioned in Section 4.2.
Observe colData. Ensure that the data includes a meaningful way to group samples (e.g., Region, Treatment, Diagnosis). If it does not, choose another dataset.
Agglomerate the data to the genus level and filter by a prevalence of 10%.
Run DAA with MaAsLin3. Model abundances without confounders, i.e., the formula should simply be "~ groups".
Explore the results table. Is there any taxonomic features with significant abundance difference?
Print the default plot from MaAsLin3.
Run the same analysis now with confounders. Do the results differ? Why?
Select the significant taxonomic features and visualize their abundance with a violin plot.

Useful functions:

utils::data(), SummarizedExperiment::colData(), mia::subsetByPrevalent(), maaslin3::maaslin3(), knitr::include_graphics(), scater::plotExpression()