Appendix C — Differential abundance analysis

C.0.1 Differential abundance analysis with ordinal regression model

We show here how to perform DAA by analyzing relative abundances with ordinal regression model (ORM). We have chosen ORM as it makes few assumptions on the distributions of abundances while it can incorporate covariates (including interaction terms) in the analysis. As ORM performs very similarly to the familiar non-parametric Wilcoxon test in the basic two group comparison it can also be seen as an extension of the Wilcoxon test (Harrell 2001; Liu et al. 2017).

Harrell, Frank E. 2001. Regression Modeling Strategies. Springer New York. https://doi.org/10.1007/978-1-4757-3462-1.

Liu, Qi, Bryan E. Shepherd, Chun Li, and Frank E. Harrell. 2017. “Modeling Continuous Response Variables Using Ordinal Regression.” Statistics in Medicine 36 (November): 4316. https://doi.org/10.1002/SIM.7433.

By using ORM to model the skewed and/or zero inflated distributions of microbial abundances, we can avoid the strong parametric assumptions (e.g. negative binomial distribution) made by some methods. Furthermore, we do not need to transform abundances. This good because the choice of the transformation may be somewhat arbitrary while it may affect the DAA results. Moreover, ORM has been shown to perform well in practice in DAA when considering its sensitivity and the consistency of its results (Pelto et al. 2025).

We first create a function that (for each taxon) runs ORM and returns the effect size estimate and p-value (based on score test) for each variable in the model.

# The function uses the orm() function from the rms package
library(rms)

run_orm <- function(abundance, metadata, formula) {
    # Create the design matrix of the model
    mm <- model.matrix(formula, metadata) |>
        cbind.data.frame(abundance)
    mm[["(Intercept)"]] <- NULL

    inds <- 1:(ncol(mm) - 1)
    vars <- colnames(mm)[inds]

    # Fit ordinal regression model with all variables
    # (Intercepts are automatically added)
    fit_1 <- orm(abundance ~ ., data = mm)
    score_1 <- fit_1$stats["Score"]

    res <- data.frame(estimate = fit_1$coefficients[vars], p_value = NA)

    # Calculate p-value based on score test for each variable
    if (length(inds) == 1) {
        res$p_value <- fit_1$stats["Score P"]
    } else {
        for (i in inds) {
            fit_0 <- orm(abundance ~ ., data = mm[, -i])
            score_0 <- fit_0$stats["Score"]
            res$p_value[i] <- as.numeric(1 - pchisq(score_1 - score_0, df = 1))
        }
    }

    res[["variable"]] <- rownames(res)
    rownames(res) <- NULL
    return(res)
}

We use the same Tengeler2020 dataset as in Chapter 16.

library(mia)

# Import dataset
data("Tengeler2020", package = "mia")
tse <- Tengeler2020

# Agglomerate by genus and subset by prevalence
tse <- subsetByPrevalent(tse, rank = "Genus", prevalence = 0.10)

# Transform count data and create assay of relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")

# Transform patient_status to a factor and set "Control" as the reference level
tse$patient_status <- factor(tse$patient_status, levels = c("Control", "ADHD"))

# We also transform cohort as a factor
tse$cohort <- factor(tse$cohort, levels = c("Cohort_1", "Cohort_2", "Cohort_3"))

C.0.1.1 The basic two-group comparison

We compare the relative abundance of taxa between the people with and without ADHD (indicated by the variable patient_status).

The code below works as follows.

We extract the assay with relative abundances from the tse object
The assay is transposed and transformed to a data.frame
The function run_orm is run for each taxon
1. The . indicates here the relative abundance of each taxon
2. The metadata including the variable patient_status is extracted from the tse object
3. The model formula indicates that we included only patient_status in our analysis
The results for all taxa are binded to as one data.frame
Lastly, we calculate p-values adjusted from multiple testing (q-value) by using the standard FDR correction by Benjamini and Hochberg (1995) (method = "BH").

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (January): 289–300. https://doi.org/10.1111/J.2517-6161.1995.TB02031.X.

library(purrr)
library(dplyr)
library(knitr)

res_daa_basic <- assay(tse, "relabundance") |> # Extract relative abundances
    t() |> # Transpose the assay
    as.data.frame() |> # Transform to a data.frame
    map(
        ~ run_orm( # Run ORM for each taxon
            .,
            metadata = colData(tse) |> as.data.frame(), # Extract the metadata
            formula = ~patient_status # Model formula. Now only patient_status
        )
    ) |>
    bind_rows(.id = "taxon") |> # Bind the results of all taxa
    mutate(
        q_value = p.adjust(p_value, method = "BH")
    ) # Calculate q-values
# (FDR adjusted p-values)

# The results for taxa with the smallest p-values
res_daa_basic |>
    arrange(p_value) |>
    head() |>
    kable()

taxon	estimate	p_value	variable	q_value
[Ruminococcus]_gauvreauii_group	-3.469550	0.0000919	patient_statusADHD	0.0045014
Erysipelatoclostridium	-2.367021	0.0032408	patient_statusADHD	0.0613379
Faecalibacterium	2.332248	0.0049883	patient_statusADHD	0.0613379
[Clostridium]_innocuum_group	-2.076335	0.0050072	patient_statusADHD	0.0613379
Catabacter	2.300598	0.0091887	patient_statusADHD	0.0765760
Ruminococcaceae_UCG-014	14.341405	0.0101499	patient_statusADHD	0.0765760

In the output

variable indicates the predictor variable. As patient_status is a categorical variable, it also tells the group used as the non-reference level (here ADHD as Control was set as the reference level).
estimate is log(odds ratio). Most importantly, positive values indicate abundance being higher in the non-reference group (here ADHD). Otherwise, estimates can be a bit difficult to interpret for ORMs.

C.0.1.2 More complex analyses

Analysis with covariates and analysis of interaction terms can be performed in by just using different formula in the run_orm function, e.g. formula = patient_status + cohort or formula = patient_status * cohort.

C.0.2 Analysis of presence/absence with Firth logistic regression

As some versions of logistic regression perform badly in the cases where, for instance, the prevalence of a taxon is zero in one group we show here how to perform DPAA with a robust version of logistic regression introduced by Firth (1993). This version has been observed to perform well in DPAA in practice (Pelto et al. 2025).

Firth, David. 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80 (March): 27. https://doi.org/10.2307/2336755.

Pelto, Juho, Kari Auranen, Janne V Kujala, and Leo Lahti. 2025. “Elementary Methods Provide More Replicable Results in Microbial Differential Abundance Analysis.” Briefings in Bioinformatics 26 (March): 130. https://doi.org/10.1093/BIB/BBAF130.

We first create a function that runs Firth logistic regression for presence/absence and returns the effect size estimate and p-value for each variable in the model.

# The function uses the logistf() function from the logistf package
library(logistf)

run_firth <- function(abundance, metadata, formula) {
    # Create the design matrix of the model
    mm <- model.matrix(formula, metadata) |>
        cbind.data.frame(abundance)
    mm[["(Intercept)"]] <- NULL

    # Fit Firth logistic regression model
    fit <- logistf(abundance ~ .,
        data = mm,
        control = logistf.control(maxit = 1000)
    )

    res <- data.frame(
        estimate = fit$coefficients,
        p_value = fit$prob
    )

    res <- res[!rownames(res) %in% c("(Intercept)"), , drop = FALSE]
    res[["variable"]] <- rownames(res)
    rownames(res) <- NULL

    return(res)
}

We use the same Tengeler2020 dataset as in Chapter 16.

library(mia)
library(scater)

# Import dataset
data("Tengeler2020", package = "mia")
tse <- Tengeler2020

# Agglomerate by genus and subset by prevalence
tse <- subsetByPrevalent(tse, rank = "Genus", prevalence = 0.10)

# Transform count data and create assay of presence/absences
tse <- transformAssay(tse, assay.type = "counts", method = "pa")

# Transform patient_status to a factor and set "Control" as the reference level
tse$patient_status <- factor(tse$patient_status, levels = c("Control", "ADHD"))

# We also transform cohort as a factor
tse$cohort <- factor(tse$cohort, levels = c("Cohort_1", "Cohort_2", "Cohort_3"))

# Calculate the sequencing depth (the total number of reads/counts) for each
# sample (needed when analyzing presence/absence of taxa)
tse <- addPerCellQC(tse)

C.0.2.1 The basic two-group comparison

We perform here the basic two-group comparison while controlling for sequencing depth (reads) as in Section 16.3.5. We compare the presence/absence of taxa between the people with and without ADHD.

library(purrr)
library(dplyr)
library(knitr)

res_firth_basic <- assay(tse, "pa") |> # Extract presence/absences
    t() |> # Transpose the assay
    as.data.frame() |> # Transform to a data.frame
    map(
        ~ run_firth( # Run logistic regression
            ., # for each taxon
            metadata = colData(tse) |> as.data.frame(), # Extract the metadata
            formula = ~ patient_status + total # Model formula
        )
    ) |>
    bind_rows(.id = "taxon") |> # Bind the results of all taxa
    filter(variable != "total") |> # Remove the variable total
    mutate(
        q_value = p.adjust(p_value, method = "BH")
    ) # Calculate q-values
# (FDR adjusted p-values)

# The results for taxa with the smallest p-values
res_firth_basic |>
    arrange(p_value) |>
    head() |>
    kable()

taxon	estimate	p_value	variable	q_value
Erysipelatoclostridium	-2.595419	0.0015120	patient_statusADHD	0.0740877
[Ruminococcus]_gauvreauii_group	-3.252377	0.0032377	patient_statusADHD	0.0793226
uncultured_2	-6.429475	0.0088418	patient_statusADHD	0.1047792
Ruminococcaceae_UCG-014	3.021242	0.0089423	patient_statusADHD	0.1047792
Barnesiella	2.021508	0.0113858	patient_statusADHD	0.1047792
uncultured_1	-2.894353	0.0128301	patient_statusADHD	0.1047792

In the output

variable indicates the predictor variable. As patient_status is a categorical variable, it also tells the group used as the non-reference level (here ADHD as Control was set as the reference level).
estimate is the familiar log(odds ratio). A positive value indicates the taxon being more prevalent in the non-reference group (here ADHD).

C.0.2.2 More complex analyses

Analysis with covariates and analysis of interaction terms can be performed in by just using different formula in the run_firth function, e.g. formula = patient_status + cohort + reads or formula = patient_status * cohort + reads.