Appendix A — Microbe Set Enrichment Analysis (MSEA)

Similar to gene set enrichment analyses for genes (Subramanian et al. 2005), an obvious next step following differential abundance analysis in microbiome studies is to conduct enrichment analysis for microbe sets, known as microbe set enrichment analysis (MSEA) (Kou et al. 2020). Similar to GSEA, the primary goal of MSEA is to detect the modest but coordinated changes in pre-specified sets of related microbial features. Such a set might include all the microbes in a specific pathway or microbial genes that have been shown to be co-regulated based on previously published studies. Like GSEA, MSEA aggregates the per-feature statistics across microbes within a microbe set. This corresponds to the hypothesis that many relevant phenotype differences are manifested by small but consistent changes in a set of features.

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Kou, Yan, Xiaomin Xu, Zhengnong Zhu, Lei Dai, and Yan Tan. 2020. “Microbe-Set Enrichment Analysis Facilitates Functional Interpretation of Microbiome Profiling Data.” Scientific Reports 10 (1): 21466. https://doi.org/10.1038/s41598-020-78511-y.

The goal of the MSEA approach is to determine if the members of S (microbe set) are randomly distributed throughout the ranked list of features (L) or primarily found at the top or bottom. We will use the R package gsEasy to conduct the MSEA test described by Subramanian et al. (2005).

A.1 Input data for MSEA using species relative abundance data

In this chapter, we will use the publicly available Inflammatory Bowel Diseases (IBD) microbiome data from the integrative Human Microbiome Project (iHMP) available from the curatedMetagenomicData package (Lloyd-Price et al. 2019). We aim to conduct MSEA analysis based on both taxonomic profiles (species relative abundances) and functional profiles (pathway relative abundances).

Lloyd-Price, Jason, Cesar Arze, Ashwin N. Ananthakrishnan, Melanie Schirmer, Julian Avila-Pacheco, Tiffany W. Poon, Elizabeth Andrews, et al. 2019. “Multi-Omics of the Gut Microbial Ecosystem in Inflammatory Bowel Diseases.” Nature 569 (7758): 655–62. https://doi.org/https://doi.org/10.1038/s41586-019-1237-9.

A.2 Performing the MSEA analysis with species relative abundance data

We will first prepare the input feature table and sample metadata for differential abundance analysis using Maaslin2 (Mallick et al. 2021). The ranked feature list from the differential abundance analysis serves as an input for the MSEA.

Mallick, Himel, Ali Rahnavard, Lauren J. McIver, Siyuan Ma, Yancong Zhang, Long H. Nguyen, Timothy L. Tickle, et al. 2021. “Multivariable Association Discovery in Population-Scale Meta-Omics Studies.” PLOS Computational Biology 17 (11): e1009442. https://doi.org/https://doi.org/10.1371/journal.pcbi.1009442.
##################
# Load iHMP data #
##################

library(curatedMetagenomicData)
library(dplyr)

se_relative <- sampleMetadata |>
    filter(study_name == "HMP_2019_ibdmdb") |>
    returnSamples("relative_abundance", rownames = "short")

##########################
# Create sample metadata #
##########################

sample_metadata <-
    colData(se_relative) |>
    as.data.frame() |>
    filter(visit_number == 1) |>
    dplyr::select(c("age", "disease", "antibiotics_current_use"))

#################
# Set reference #
#################

sample_metadata$disease <- as.factor(sample_metadata$disease)
sample_metadata$disease <- relevel(sample_metadata$disease, "healthy")

###########################
# Create species features #
###########################

feature_species_t <- as.data.frame(assay(se_relative))
rownames(feature_species_t) <- sub(".*s__", "", rownames(feature_species_t))

##############################
# Subset to baseline samples #
##############################

feature_species <- as.data.frame(t(feature_species_t))
feature_species <- feature_species[rownames(sample_metadata), ]
feature_species <- feature_species / 100
rm(feature_species_t)
rm(se_relative)

In the next step, we will use Maaslin2 to fit a multivariable regression model for testing the association between microbial species abundance versus IBD diagnosis. The analysis method we use here is “LM”, which is the default setting. We also adjust for age and antibiotic usage, following the original study.

knitr::opts_chunk$set(eval = FALSE)
library(Maaslin2)

fit_data <- Maaslin2(
    input_data = feature_species,
    input_metadata = sample_metadata,
    normalization = "NONE",
    output = "output_species",
    fixed_effects = c("disease", "age", "antibiotics_current_use")
)

Unlike gene expression studies, we do not have well-defined signatures or modules for microbiome data. Here, we will construct data-driven modules using weighted gene co-expression network analysis (WGCNA) (Langfelder and Horvath 2008), (Geistlinger et al. 2023). We aim to ensure that the effect of disease and other covariates has been removed by working on the residuals. Following the WGCNA tutorial, our first step will be to check whether there are any outliers in our data.

Langfelder, Peter, and Steve Horvath. 2008. WGCNA: An r Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9: 559. https://doi.org/https://doi.org/10.1186/1471-2105-9-559.
Geistlinger, Ludwig, Chloe Mirzayi, Fatima Zohra, Rimsha Azhar, Shaimaa Elsafoury, Clare Grieve, Jennifer Wokaty, et al. 2023. BugSigDB Captures Patterns of Differential Abundance Across a Broad Range of Host-Associated Microbial Signatures.” Nature Biotechnology 42 (5): 790–802. https://doi.org/https://doi.org/10.1038/s41587-023-01872-y.
library(WGCNA)

datExpr <- as.data.frame(t(fit_data$residuals))
gsg <- goodSamplesGenes(datExpr, verbose = 3)
gsg$allOK

If the last statement returns TRUE, no outliers are identified. If not, we need to remove the outliers from the data.

if (!gsg$allOK) {
    if (sum(!gsg$goodGenes) > 0) {
        printFlush(paste(
            "Removing genes:",
            paste(names(datExpr)[!gsg$goodGenes], collapse = ", ")
        ))
    }
    if (sum(!gsg$goodSamples) > 0) {
        printFlush(paste(
            "Removing samples:",
            paste(rownames(datExpr)[!gsg$goodSamples], collapse = ", ")
        ))
    }
    datExpr <- datExpr[gsg$goodSamples, gsg$goodGenes]
}

After removing the outliers, we need to choose a suitable soft threshold parameter for creating the modules as part of the WGCNA algorithm. This power value must produce a graph similar to a scale-free network. We can use the mean connectivity graphic for the selection of this power parameter.

# Choose a set of soft threshold parameters
powers <- c(c(1:20), seq(from = 22, to = 30, by = 2))
sft <- pickSoftThreshold(
    datExpr,
    powerVector = powers, verbose = 5, dataIsExpr = TRUE,
    RsquaredCut = 0.30
)

In this step, we will conduct a one-step module detection based on the selected soft threshold parameter selected above.

power <- sft$powerEstimate
net <- blockwiseModules(
    datExpr,
    power = power,
    corFnc = "bicor",
    corOptions = list(maxPOutliers = 0.1),
    networkType = "unsigned",
    maxBlockSize = ncol(datExpr),
    minModuleSize = 3,
    TOMType = "unsigned",
    reassignThreshold = 0,
    mergeCutHeight = 0,
    verbose = 3
)

####################
# How many modules #
####################

ncol(net$MEs)
table(net$colors)

The WGCNA algorithm produced 14 modules which we can visualize as follows.

##########################
# Plot module dendrogram #
##########################

eigenGenes <- net$MEs
MEDiss <- 1 - cor(eigenGenes)
METree <- hclust(as.dist(MEDiss), method = "average")
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")

Next, we calculate hub genes for the modules and create the mapping files to proceed with the MSEA.

###########################################
# Re-calculate modules and find hub genes #
###########################################

moduleColors <- net$colors
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
modules_data <- orderMEs(MEs0)

#######################
# Create mapping file #
#######################

library(tidyverse)

feature_by_modules <- as.data.frame(net$colors)
feature_by_modules <- rownames_to_column(feature_by_modules)
colnames(feature_by_modules) <- c("Feature", "Module")
features_mapping <- feature_by_modules
features_mapping$Module <- paste("ME", features_mapping$Module, sep = "")

Finally, we will run the MSEA analysis on the modules we constructed using WGCNA. Here, we first create a wrapper for the MSEA analysis using the gsEasy package.

library(reshape2)
library(gsEasy)

################
# MSEA Wrapper #
################

run_MSEA <- function(microbeSet, # A list
                     ranked_features, # Ranked list of featured
                     filter.count = 3,
                     seed = 1234,
                     fdr.correction = "BH") {
    ###################
    # Filter out sets #
    ##################

    microbeSet0 <- microbeSet
    cond <- sapply(microbeSet0, function(x) length(x) > filter.count)
    microbeSet <- microbeSet0[cond]
    lengthmicrobeSet <- as.data.frame(
        reshape2::melt(lapply(microbeSet, function(x) length(x)))
    )
    colnames(lengthmicrobeSet) <- c("Freq", "Set")

    ################
    # Classic MSEA #
    ################

    set.seed(seed)
    enrichment <- as.data.frame(
        sapply(microbeSet, function(set) gset(S = set, r = ranked_features))
    )
    colnames(enrichment) <- "ES"
    enrichment <- rownames_to_column(enrichment, "Set")
    enrichment <- merge(enrichment, lengthmicrobeSet, "Set")
    enrichment$qval <- p.adjust(enrichment$ES, fdr.correction)

    ##########
    # Return #
    ##########

    return(enrichment)
}

Before running the MSEA, we also need to rank the differential analysis results from Maaslin2. We use the topGO package to create a list of microbe sets from the mapping file created above.

###################
# Rank DA results #
###################

results <- fit_data$results |> filter(metadata == "disease")
results$qval <- p.adjust(results$pval, "BH")
results <- results[order(results$qval, decreasing = FALSE), ]

###################
# MSEA Processing #
###################

library(topGO)
module_map <- features_mapping
mod.gs <- tapply(module_map$Module, module_map$Feature, as.character)
microbeSet <- inverseList(mod.gs)
microbeSet

We are now ready to run the MSEA analysis. We run \(100,000\) permutations to calculate the enrichment scores.

MSEA <- run_MSEA(microbeSet, results$feature)
MSEA <- MSEA[
    , c("Set", "Freq", "ES", setdiff(names(MSEA), c("Set", "Freq", "ES")))
]
colnames(MSEA) <- c("ID", "Size", "pval", "qval")
MSEA$ID <- paste(MSEA$ID, " (", MSEA$Size, ")", sep = "")

We can plot the enrichment scores to visualize the MSEA results.

p <- MSEA |>
    arrange(-pval) |>
    mutate(ID = factor(ID, levels = ID)) |>
    ggplot(aes(y = -log10(pval), x = ID)) +
    geom_bar(stat = "identity", fill = "cornflowerblue") +
    theme_bw() +
    coord_flip() +
    ggtitle("Statistically significant modules associated with disease") +
    xlab("") +
    ylab("MSEA enrichment score")

print(p)

Based on the MSEA results, we obtain 13 enriched modules of microbial species. We can also examine the members of the top enriched modules.

A.4 Performing the MSEA analysis with pathway relative abundance data

Next, we repeat the MSEA with the pathway relative abundance data from the iHMP project and follow the same steps as before.

##########################
# Load HMP2 pathway data #
##########################

se_pathway <- sampleMetadata |>
    filter(study_name == "HMP_2019_ibdmdb") |>
    returnSamples("pathway_abundance", rownames = "short")

##########################
# Create sample metadata #
##########################

sample_metadata <- colData(se_pathway) |>
    as.data.frame() |>
    filter(visit_number == 1) |>
    dplyr::select("age", "disease", "antibiotics_current_use")

# Set reference
sample_metadata$disease <- as.factor(sample_metadata$disease)
sample_metadata$disease <- relevel(sample_metadata$disease, "healthy")

###########################
# Create Pathway Features #
###########################

feature_pwys_t <- as.data.frame(assay(se_pathway))
feature_pwys_t <- rownames_to_column(feature_pwys_t, "ID")
feature_pwys_t <- feature_pwys_t |>
    filter(!grepl("\\|", ID)) |>
    filter(!ID %in% c("UNMAPPED", "UNINTEGRATED")) |>
    column_to_rownames("ID") |>
    as.data.frame()

##############################
# Subset to baseline samples #
##############################

feature_pwys <- as.data.frame(t(feature_pwys_t))
feature_pwys <- feature_pwys[rownames(sample_metadata), ]
feature_pwys <- feature_pwys / 100
rm(feature_pwys_t)
rm(se_pathway)

As before, we first run a Maaslin2 analysis using default settings and construct the modules using residuals from the Maaslin2 models.

fit_data <- Maaslin2(
    input_data = feature_pwys,
    input_metadata = sample_metadata,
    normalization = "NONE",
    output = "output_pwys",
    fixed_effects = c("disease", "age", "antibiotics_current_use")
)

##########################
# Extract the residuals #
##########################

datExpr <- as.data.frame(t(fit_data$residuals))

########################
# Create WGCNA modules #
########################

gsg <- goodSamplesGenes(datExpr, verbose = 3)
gsg$allOK

if (!gsg$allOK) {
    if (sum(!gsg$goodGenes) > 0) {
        printFlush(paste(
            "Removing genes:",
            paste(names(datExpr)[!gsg$goodGenes], collapse = ", ")
        ))
    }
    if (sum(!gsg$goodSamples) > 0) {
        printFlush(paste(
            "Removing samples:",
            paste(rownames(datExpr)[!gsg$goodSamples], collapse = ", ")
        ))
    }
    datExpr <- datExpr[gsg$goodSamples, gsg$goodGenes]
}

gsg <- goodSamplesGenes(datExpr, verbose = 3)
gsg$allOK # TRUE

###################################
# Choose soft threshold parameter #
###################################

powers <- c(c(1:20), seq(from = 22, to = 30, by = 2))
sft <- pickSoftThreshold(
    datExpr,
    powerVector = powers, verbose = 5, dataIsExpr = TRUE,
    RsquaredCut = 0.30
)

##############################
#  One-step module detection #
##############################

power <- sft$powerEstimate
net <- blockwiseModules(
    datExpr,
    power = power,
    corFnc = "bicor",
    corOptions = list(maxPOutliers = 0.1),
    networkType = "unsigned",
    maxBlockSize = ncol(datExpr),
    minModuleSize = 3,
    TOMType = "unsigned",
    reassignThreshold = 0,
    mergeCutHeight = 0,
    verbose = 3
)

####################
# How many modules #
####################

ncol(net$MEs)
table(net$colors)

##########################
# Plot module dendrogram #
##########################

eigenGenes <- net$MEs
MEDiss <- 1 - cor(eigenGenes)
METree <- hclust(as.dist(MEDiss), method = "average")
plot(METree, main = "Clustering of module eigengenes", xlab = "", sub = "")

###########################################
# Re-calculate modules and find hub genes #
###########################################

moduleColors <- net$colors
MEs0 <- moduleEigengenes(datExpr, moduleColors)$eigengenes
modules_data <- orderMEs(MEs0)

#######################
# Create mapping file #
#######################

feature_by_modules <- as.data.frame(net$colors)
feature_by_modules <- rownames_to_column(feature_by_modules)
colnames(feature_by_modules) <- c("Feature", "Module")
features_mapping <- feature_by_modules
features_mapping$Module <- paste("ME", features_mapping$Module, sep = "")

We perform the MSEA as before using the modules from the WGCNA analysis on the pathways.

###################
# Rank DA results #
###################

results <- fit_data$results |> filter(metadata == "disease")
results$qval <- p.adjust(results$pval, "BH")
sum(results$qval < 0.05)
results <- results[order(results$qval, decreasing = FALSE), ]

###################
# MSEA Processing #
##################

module_map <- features_mapping
mod.gs <- tapply(module_map$Module, module_map$Feature, as.character)
microbeSet <- inverseList(mod.gs)
microbeSet

############
# Run MSEA #
############

MSEA <- run_MSEA(microbeSet, results$feature)
MSEA <- MSEA[
    , c("Set", "Freq", "ES", setdiff(names(MSEA), c("Set", "Freq", "ES")))
]
colnames(MSEA) <- c("ID", "Size", "pval", "qval")
MSEA$ID <- paste(MSEA$ID, " (", MSEA$Size, ")", sep = "")

########
# Plot #
########

p <- MSEA |>
    arrange(-pval) |>
    mutate(ID = factor(ID, levels = ID)) |>
    ggplot(aes(y = -log10(pval), x = ID)) +
    geom_bar(stat = "identity", fill = "cornflowerblue") +
    theme_bw() +
    coord_flip() +
    ggtitle("Statistically significant modules associated with disease") +
    xlab("") +
    ylab("MSEA enrichment score")

p

Based on the MSEA results, we obtain 4 enriched modules of microbial pathways. We can similarly examine the members of the top enriched modules.

Back to top