# The function uses the orm() function from the rms package
library(rms)
run_orm <- function(abundance, metadata, formula) {
# Create the design matrix of the model
mm <- model.matrix(formula, metadata) |>
cbind.data.frame(abundance)
mm[["(Intercept)"]] <- NULL
inds <- 1:(ncol(mm) - 1)
vars <- colnames(mm)[inds]
# Fit ordinal regression model with all variables
# (Intercepts are automatically added)
fit_1 <- orm(abundance ~ ., data = mm)
score_1 <- fit_1$stats["Score"]
res <- data.frame(estimate = fit_1$coefficients[vars], p_value = NA)
# Calculate p-value based on score test for each variable
if (length(inds) == 1) {
res$p_value <- fit_1$stats["Score P"]
} else {
for (i in inds) {
fit_0 <- orm(abundance ~ ., data = mm[, -i])
score_0 <- fit_0$stats["Score"]
res$p_value[i] <- as.numeric(1 - pchisq(score_1 - score_0, df = 1))
}
}
res[["variable"]] <- rownames(res)
rownames(res) <- NULL
return(res)
}Appendix C — Differential abundance analysis
C.0.1 Differential abundance analysis with ordinal regression model
We show here how to perform DAA by analyzing relative abundances with ordinal regression model (ORM). We have chosen ORM as it makes few assumptions on the distributions of abundances while it can incorporate covariates (including interaction terms) in the analysis. As ORM performs very similarly to the familiar non-parametric Wilcoxon test in the basic two group comparison it can also be seen as an extension of the Wilcoxon test (Harrell 2001; Liu et al. 2017).
By using ORM to model the skewed and/or zero inflated distributions of microbial abundances, we can avoid the strong parametric assumptions (e.g. negative binomial distribution) made by some methods. Furthermore, we do not need to transform abundances. This good because the choice of the transformation may be somewhat arbitrary while it may affect the DAA results. Moreover, ORM has been shown to perform well in practice in DAA when considering its sensitivity and the consistency of its results (Pelto et al. 2025).
We first create a function that (for each taxon) runs ORM and returns the effect size estimate and p-value (based on score test) for each variable in the model.
We use the same Tengeler2020 dataset as in Chapter 16.
library(mia)
# Import dataset
data("Tengeler2020", package = "mia")
tse <- Tengeler2020
# Agglomerate by genus and subset by prevalence
tse <- subsetByPrevalent(tse, rank = "Genus", prevalence = 0.10)
# Transform count data and create assay of relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")
# Transform patient_status to a factor and set "Control" as the reference level
tse$patient_status <- factor(tse$patient_status, levels = c("Control", "ADHD"))
# We also transform cohort as a factor
tse$cohort <- factor(tse$cohort, levels = c("Cohort_1", "Cohort_2", "Cohort_3"))C.0.1.1 The basic two-group comparison
We compare the relative abundance of taxa between the people with and without ADHD (indicated by the variable patient_status).
The code below works as follows.
- We extract the assay with relative abundances from the
tseobject - The assay is transposed and transformed to a
data.frame - The function
run_ormis run for each taxon- The
.indicates here the relative abundance of each taxon - The metadata including the variable
patient_statusis extracted from thetseobject - The model formula indicates that we included only
patient_statusin our analysis
- The
- The results for all taxa are binded to as one
data.frame - Lastly, we calculate p-values adjusted from multiple testing (
q-value) by using the standard FDR correction by Benjamini and Hochberg (1995) (method = "BH").
library(purrr)
library(dplyr)
library(knitr)
res_daa_basic <- assay(tse, "relabundance") |> # Extract relative abundances
t() |> # Transpose the assay
as.data.frame() |> # Transform to a data.frame
map(
~ run_orm( # Run ORM for each taxon
.,
metadata = colData(tse) |> as.data.frame(), # Extract the metadata
formula = ~patient_status # Model formula. Now only patient_status
)
) |>
bind_rows(.id = "taxon") |> # Bind the results of all taxa
mutate(
q_value = p.adjust(p_value, method = "BH")
) # Calculate q-values
# (FDR adjusted p-values)
# The results for taxa with the smallest p-values
res_daa_basic |>
arrange(p_value) |>
head() |>
kable()| taxon | estimate | p_value | variable | q_value |
|---|---|---|---|---|
| [Ruminococcus]_gauvreauii_group | -3.469550 | 0.0000919 | patient_statusADHD | 0.0045014 |
| Erysipelatoclostridium | -2.367021 | 0.0032408 | patient_statusADHD | 0.0613379 |
| Faecalibacterium | 2.332248 | 0.0049883 | patient_statusADHD | 0.0613379 |
| [Clostridium]_innocuum_group | -2.076335 | 0.0050072 | patient_statusADHD | 0.0613379 |
| Catabacter | 2.300598 | 0.0091887 | patient_statusADHD | 0.0765760 |
| Ruminococcaceae_UCG-014 | 14.341405 | 0.0101499 | patient_statusADHD | 0.0765760 |
In the output
-
variableindicates the predictor variable. Aspatient_statusis a categorical variable, it also tells the group used as the non-reference level (hereADHDasControlwas set as the reference level). -
estimateis log(odds ratio). Most importantly, positive values indicate abundance being higher in the non-reference group (hereADHD). Otherwise, estimates can be a bit difficult to interpret for ORMs.
C.0.1.2 More complex analyses
Analysis with covariates and analysis of interaction terms can be performed in by just using different formula in the run_orm function, e.g. formula = patient_status + cohort or formula = patient_status * cohort.
C.0.2 Analysis of presence/absence with Firth logistic regression
As some versions of logistic regression perform badly in the cases where, for instance, the prevalence of a taxon is zero in one group we show here how to perform DPAA with a robust version of logistic regression introduced by Firth (1993). This version has been observed to perform well in DPAA in practice (Pelto et al. 2025).
We first create a function that runs Firth logistic regression for presence/absence and returns the effect size estimate and p-value for each variable in the model.
# The function uses the logistf() function from the logistf package
library(logistf)
run_firth <- function(abundance, metadata, formula) {
# Create the design matrix of the model
mm <- model.matrix(formula, metadata) |>
cbind.data.frame(abundance)
mm[["(Intercept)"]] <- NULL
# Fit Firth logistic regression model
fit <- logistf(abundance ~ .,
data = mm,
control = logistf.control(maxit = 1000)
)
res <- data.frame(
estimate = fit$coefficients,
p_value = fit$prob
)
res <- res[!rownames(res) %in% c("(Intercept)"), , drop = FALSE]
res[["variable"]] <- rownames(res)
rownames(res) <- NULL
return(res)
}We use the same Tengeler2020 dataset as in Chapter 16.
library(mia)
library(scater)
# Import dataset
data("Tengeler2020", package = "mia")
tse <- Tengeler2020
# Agglomerate by genus and subset by prevalence
tse <- subsetByPrevalent(tse, rank = "Genus", prevalence = 0.10)
# Transform count data and create assay of presence/absences
tse <- transformAssay(tse, assay.type = "counts", method = "pa")
# Transform patient_status to a factor and set "Control" as the reference level
tse$patient_status <- factor(tse$patient_status, levels = c("Control", "ADHD"))
# We also transform cohort as a factor
tse$cohort <- factor(tse$cohort, levels = c("Cohort_1", "Cohort_2", "Cohort_3"))
# Calculate the sequencing depth (the total number of reads/counts) for each
# sample (needed when analyzing presence/absence of taxa)
tse <- addPerCellQC(tse)C.0.2.1 The basic two-group comparison
We perform here the basic two-group comparison while controlling for sequencing depth (reads) as in Section 16.3.5. We compare the presence/absence of taxa between the people with and without ADHD.
library(purrr)
library(dplyr)
library(knitr)
res_firth_basic <- assay(tse, "pa") |> # Extract presence/absences
t() |> # Transpose the assay
as.data.frame() |> # Transform to a data.frame
map(
~ run_firth( # Run logistic regression
., # for each taxon
metadata = colData(tse) |> as.data.frame(), # Extract the metadata
formula = ~ patient_status + total # Model formula
)
) |>
bind_rows(.id = "taxon") |> # Bind the results of all taxa
filter(variable != "total") |> # Remove the variable total
mutate(
q_value = p.adjust(p_value, method = "BH")
) # Calculate q-values
# (FDR adjusted p-values)
# The results for taxa with the smallest p-values
res_firth_basic |>
arrange(p_value) |>
head() |>
kable()| taxon | estimate | p_value | variable | q_value |
|---|---|---|---|---|
| Erysipelatoclostridium | -2.595419 | 0.0015120 | patient_statusADHD | 0.0740877 |
| [Ruminococcus]_gauvreauii_group | -3.252377 | 0.0032377 | patient_statusADHD | 0.0793226 |
| uncultured_2 | -6.429475 | 0.0088418 | patient_statusADHD | 0.1047792 |
| Ruminococcaceae_UCG-014 | 3.021242 | 0.0089423 | patient_statusADHD | 0.1047792 |
| Barnesiella | 2.021508 | 0.0113858 | patient_statusADHD | 0.1047792 |
| uncultured_1 | -2.894353 | 0.0128301 | patient_statusADHD | 0.1047792 |
In the output
-
variableindicates the predictor variable. Aspatient_statusis a categorical variable, it also tells the group used as the non-reference level (hereADHDasControlwas set as the reference level). -
estimateis the familiar log(odds ratio). A positive value indicates the taxon being more prevalent in the non-reference group (hereADHD).
C.0.2.2 More complex analyses
Analysis with covariates and analysis of interaction terms can be performed in by just using different formula in the run_firth function, e.g. formula = patient_status + cohort + reads or formula = patient_status * cohort + reads.