Authors: Tuomas Borman, Pande Erawijantari, Giulio Benedetti, Stefanie Peschel, Leo Lahti.
Last modified: 16 September, 2024.

Overview

Description

This tutorial introduces the latest advances in Bioconductor tools and data structures supporting microbiome analysis. We will show how this can enhance interoperability across omics and walk through a typical microbiome data science workflow. You will learn from active developers how to access open microbiome data resources, utilize optimized data structures, assess community diversity and composition, integrate taxonomic and functional hierarchies, and visualize microbiome data. We will follow the online book “Orchestrating Microbiome Analysis with Bioconductor” and Bioconductor SummarizedExperiment framework, supporting optimized analysis and integration of hierarchical, multi-domain microbiome data. Participants are encouraged to install the latest versions of R and Bioconductor.

Pre-requisites

  • Basic knowledge of R coding
  • Understanding of microbiome research

If your time allows, we recommend to spend some time to explore beforehand Orchestrating Microbiome Analysis (OMA) online book.

Participation

We have instructor-led demonstrations where participants are expected to follow along. Questions are encouraged at any time during the workshop.

It is preferred to use your own laptop. You can find help for installation from here. There are also available pre-installed virtual machines by Galaxy/Bioconductor. (Under “EUROBIOC2024” tab, session called “Package Demo: mia”)

R / Bioconductor packages used

In this workshop, we will focus on the mia package, which is designed for microbiome data science. The mia package provides essential tools for analyzing microbiome data using the SummarizedExperiment framework.

Additionally, we will cover NetCoMi and iSEEtree in network and interactive data exploration session, respectively.

Time outline

Activity Time
Bioconductor resources for microbiome data science 30m
Analysis of community diversity and composition 30m
Microbiome data integration 30m
Coffee break 30m
Microbial network analysis 30m
Interactive microbiome data exploration with iSEEtree 30m
Recap and Q&A 30m

Workshop goals and objectives

Learning goals

  • Methods for microbiome analysis: Learn to harness SummarizedExperiment ecosystem in your project.

  • Data structure: Understand how to utilize the TreeSummarizedExperiment class for effective microbiome data analysis.

  • Access resources: Get familiar with additional tools and resources, including the OMA online book.

Learning objectives

  • Analyze and apply methods: Apply the framework to process and analyze microbiome data.

  • Visualize data: Generate and interpret common graphs for microbiome data

  • Explore documentation: Use the OMA to explore additional tools and methods.

Workshop

Introduction

In this workflow, we demonstrate how to analyze data from publicly available microbiome resources using Bioconductor methods. We will cover essential data wrangling steps, explore alpha and beta diversity metrics, and introduce the fundamentals of data integration and microbial network analysis. Additionally, we will explore how to utilize the tools through a graphical interface.

Bioconductor resources for microbiome data science

Data container

The miaverse framework is based on TreeSummarizedExperiment data container, extension to common SummarizedExperiment object.

_TreeSummarizedExperiment_ class

TreeSummarizedExperiment class

Load packages

First, we load the required packages into the session. The following script ensures that any packages not already installed are automatically installed.

# List of packages that we need
packages <- c(
    "ComplexHeatmap",
    "curatedMetagenomicData",
    "dplyr",
    "ggsignif",
    "iSEEtree",
    "mia",
    "miaViz",
    "NetCoMi",
    "scater",
    "shadowtext",
    "patchwork",
    "scater",
    "SpiecEasi",
    "tidyr"
    )

# Get packages that are already installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )

# Install GitHub packages
if( "SpiecEasi" %in% packages_need_to_install ){
    # SpiecEasi must be installed from "latentcor" branch
    devtools::install_github("zdk123/SpiecEasi", force = TRUE, ref = "latentcor")
    # Remove from installation list
    packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "SpiecEasi" ]
}
if( "SPRING" %in% packages_need_to_install ){
    devtools::install_github("GraceYoon/SPRING")
    # Remove from installation list
    packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "SPRING" ]
}
if( "NetCoMi" %in% packages_need_to_install ){
    # NetCoMi must be installed from "TSE" branch
    devtools::install_github(
        "stefpeschel/NetCoMi", force = TRUE, ref = "TSE",
        dependencies = c("Depends", "Imports", "LinkingTo"),
        repos = c("https://cloud.r-project.org/", BiocManager::repositories()))
    # Remove from installation list
    packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "NetCoMi" ]
}

# Loads BiocManager into the session. Install it if it is not already installed.
if( !require("BiocManager") ){
    install.packages("BiocManager")
    library("BiocManager")
}

# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
    install(version = "devel")
    install(packages_need_to_install, ask = FALSE, update = FALSE)
}

# Load all packages into session. Stop if there are packages that were not
# successfully loaded
pkgs_not_loaded <- !sapply(packages, require, character.only = TRUE)
pkgs_not_loaded <- names(pkgs_not_loaded)[ pkgs_not_loaded ]
if( length(pkgs_not_loaded) > 0 ){
    stop(
        "Error in loading the following packages into the session: '",
        paste0(pkgs_not_loaded, collapse = "', '"), "'")
}

If you encountered an error, install packages manually.

Importing data

There are several openly available datasets listed in OMA. In this workflow, we will retrieve data from the curatedMetagenomicData resource. We’ll focus on a dataset that includes both colorectal cancer (CRC) patients and control subjects. Our goal is to investigate whether CRC is associated with alterations in the gut microbiota.

Let’s fetch the data from the database.

# Study from which we want to fetch data
study <- "GuptaA_2019"
# Experiments that we want to fetch
experiments <- c("pathway_abundance", "relative_abundance") |>
    paste0(collapse = "|")
search_pattern <- paste0(study, ".(", experiments, ")")

# Fetch data
data_list <- curatedMetagenomicData(
    search_pattern, dryrun = FALSE, counts = TRUE, rownames = "short")
#> $`2021-03-31.GuptaA_2019.relative_abundance`
#> dropping rows without rowTree matches:
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis
#>   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#>   k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra

# Add new names
names(data_list) <- c("pathway", "taxonomy")

# The assay name for taxonomy profile is incorrect since we have counts, not
# relative abundances
tse <- data_list[["taxonomy"]]
assayNames(tse) <- "counts"
data_list[["taxonomy"]] <- tse

# Rename also pathway abundance. They are in relative scale
tse <- data_list[["pathway"]]
assayNames(tse) <- "relabundance"
data_list[["pathway"]] <- tse

# For pathway data, extract pathway information from rownames and add them to
# rowData
tse <- data_list[["pathway"]]
rowData(tse)[["full_name"]] <- rownames(tse)
rowData(tse)[["pathway"]] <- sub("\\|.*", "", rowData(tse)[["full_name"]])
data_list[["pathway"]] <- tse

# Get taxonomy object
tse <- data_list[["taxonomy"]]
tse
#> class: TreeSummarizedExperiment 
#> dim: 304 60 
#> metadata(1): agglomerated_by_rank
#> assays(1): counts
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#>   Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#>   disease_location
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

The output is a TreeSummarizedExperiment (TreeSE) object. For details on how to handle and access the data, refer to data containers chapter in OMA.

Preprocess

A common step in microbiome workflows is agglomeration, where we summarize data at specific taxonomy ranks. Additionally, preprocessing involves transforming the data to address the unique characteristics of microbiome profiling data, for instance. Details on agglomeration and transformation are covered in agglomeration and transformation chapters in OMA, respectively.

# Agglomerate data
tse_family <- agglomerateByRank(tse, rank = "family")
#> Warning: The following values are already present in `metadata` and will be
#> overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
#> specify alternative names.

# Transform the main TreeSE
tse <- transformAssay(tse, method = "relabundance")
tse <- transformAssay(tse, method = "clr", pseudocount = 1)
# Transform agglomerated data
tse_family <- transformAssay(tse_family, method = "relabundance")
tse_family <- transformAssay(tse_family, method = "clr", pseudocount = 1)

Analysis of community diversity and composition

Now that the data has been pre-processed, we could start to analyze the overview of the microbiome profile. For the dataset above, we can continue to perform analysis, for example to answer these questions:

  1. How does the microbial community differ between the CRC (n=30) vs healthy group (n=30)?

  2. How does the beta diversity compare between the CRC (n=30) vs healthy group (n=30)?

  3. What species shows significantly different relative abundance between 2 groups?

Let’s start with analyzing the differences on microbial community diversity between CRC and healthy control group. Community diversity is a central concept in microbiome research, and in the human microbiome study it has been shown that the microbial diversity associated with the host health status. Several diversity indices are available in the ecological literature. You can explore further in the OMA book about what analysis in-term of community diversity can be analyzed using mia package.

For simplicity, let’s calculate the microbial richness and alpha diversity using the Shannon index (other calculation using different index also available, and you can explore further)

tse <- addAlpha(tse, index = c("observed", "shannon"))

# to see the observed and shannon calculation
head(tse$observed)
#> [1]  55  29 148  68  63  51
head(tse$shannon)
#> [1] 2.755092 1.044521 3.305171 2.744079 2.570301 2.937390

To see the differences between healthy and CRC group, let’s visualize the species richness and shannon diversity using violin plot (you can also experiment with boxplot). Additionally, let’s also calculate the statistical significance and plot it together using ggsignif with the wilcoxon test. Please notes that the package does not support p-value adjustment (such as with “fdr”). For more complex test, please see example here on how you can calculate p-value with adjustment using stat_compare_means function from the ggpubr package, and add visually appealing p-values to the plots or if you want to perform more than two groups comparison.

The information about diseases state is available from the colData of tse object, in the “diseases” column.

# Changes old levels with new levels
tse$disease <- factor(tse$disease)
# Access the levels of the factor
disease_levels <- levels(tse$disease)

# Compute all pairwise combinations of these levels
comb <- combn(disease_levels, 2, simplify = FALSE)

# Observed richness plot
observed_plot <- plotColData(
    tse, 
    y = "observed", 
    x = "disease",
    colour_by = "disease"
    ) + 
  geom_signif(comparisons = comb, map_signif_level = FALSE) +
  theme_bw() + 
  theme(text = element_text(size = 8))

# Shannon diversity plot
shannon_plot <- plotColData(
    tse, 
    y = "shannon", 
    x = "disease",
    colour_by = "disease"
    ) +
  geom_signif(comparisons = comb, map_signif_level = FALSE) +
  theme_bw() + 
  theme(text = element_text(size = 8))

# Combine the two plots into a grid
combined_plot <- observed_plot + shannon_plot + plot_layout(guides = "collect")

# Display the combined plot
print(combined_plot)
#> Warning in wilcox.test.default(c(55, 29, 148, 68, 63, 51, 37, 44, 45, 61, :
#> cannot compute exact p-value with ties

Let’s pause and think

  • Can you also do the diversity analysis using different indexes?

  • What can you inffer by these observations?

Let’s continue observing the beta-diversity between 2 groups. Beta diversity quantifies the dissimilarity between communities (multiple samples), as opposed to alpha diversity, which focuses on variation within a community (one sample). This OMA book section explained various indexes that commonly used in the microbial ecology context.

For this tutorial, let’s try to calculate the bray-curtis distance and plot the PCoA.

# Perform PCoA
tse <- runMDS(
    tse,
    FUN = getDissimilarity,
    method = "bray",
    assay.type = "relabundance",
    name = "MDS_bray"
    )

# Create ggplot object
p <- plotReducedDim(tse, "MDS_bray", colour_by = "disease")

# Calculate explained variance
e <- attr(reducedDim(tse, "MDS_bray"), "eig")
rel_eig <- e / sum(e[e > 0])

# Add explained variance for each axis
p1 <- p + labs(
    x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""),
    y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = "")
    )
p1

Sometimes, we also want to create ellipse to guide our visualization whether the microbial community in two groups is different. You can achieve it using the code below.

p_ellipse <- p1 + stat_ellipse(aes(color = colour_by), level = 0.95)
p_ellipse

Additionally, we may want to also perform PERMANOVA analysis to further estimate whether the microbial community is statistically different. Please explore the comprehensive example in the OMA book.

Let’s pause and think

  • Above is the example for unsupervised ordination, can you give example of it supervised counterpart?

  • Can you distinguish between both method and when to use it?

OMA book also provide further exploration to explore the microbial community structure, for example based on community typing.

Another primary analysis that usually can be done in microbiome study is differential abundance analysis (DAA) for example to answer the research question 3. However, due to limited time slots, we will not demonstrate in this tutorial. You can explore various example in OMA book for DAA here

Microbiome data integration

Data handling

As demonstrated, CRC patients appear to have altered microbiota. However, this observation only indicates an association and does not imply causality or reveal underlying mechanisms. To explore complex mechanisms, we often need more comprehensive data. Efficiently handling data from multiple experiments requires integrating these datasets into a unified data container. Bioconductor offers two approaches for this:

  • Using the altExp slot within the TreeSE object, which is a straightforward method.
  • Employing MultiAssayExperiment, a more general and flexible approach, though it comes with a more complex data structure compared to TreeSE.

For detailed information on both approaches, refer to OMA. In this tutorial, we will focus on the simpler method using the altExp slot.

The data fetched at the beginning of this workflow is a list containing multiple experiments, each represented as a unique (Tree)SE object. Managing this list and the connections between samples can be error-prone and cumbersome. To simplify this, we can use the altExp slot to handle the bookkeeping. Let’s start by checking which experiments are available:

names(data_list)
#> [1] "pathway"  "taxonomy"

As we see, we have metabolic pathway predictions in addition to taxonomic profiling. We can add the entire list of experiments to the altExp slot.

altExps(tse) <- data_list
tse
#> class: TreeSummarizedExperiment 
#> dim: 304 60 
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#>   Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(2): pathway taxonomy
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

With the altExp slot now populated, you can access the data using the altExp() function and specify which experiment you want to retrieve.

altExp(tse, "pathway")
#> class: SummarizedExperiment 
#> dim: 12535 60 
#> metadata(0):
#> assays(1): relabundance
#> rownames(12535): UNMAPPED UNINTEGRATED ... PWY-7197: pyrimidine
#>   deoxyribonucleotide
#>   phosphorylation|g__Desulfovibrio.s__Desulfovibrio_piger PWY-6703:
#>   preQ0 biosynthesis|g__Butyricimonas.s__Butyricimonas_virosa
#> rowData names(2): full_name pathway
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#>   disease_location

As mentioned earlier, the advantage of the altExp slot is that it handles sample linkages for us. For example, if we select a subset of data, the object automatically subsets all associated experiments accordingly.

For example, let’s select only the samples from senior patients.

tse_sub <- tse[ , tse$age_category == "senior"]
tse_sub
#> class: TreeSummarizedExperiment 
#> dim: 304 8 
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#>   Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(8): GupDM_A4 GupDM_A5 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(2): pathway taxonomy
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

# Check that all sample names still match
all( colnames(altExp(tse_sub, "pathway")) == colnames(tse_sub) )
#> [1] TRUE

Now, we can see that the data has been successfully subsetted, with all sample names matching across experiments.

In addition to adding the entire list, we can also add individual experiments. Let’s add the agglomerated data separately to keep the whole dataset organized.

altExp(tse, "family") <- tse_family
tse
#> class: TreeSummarizedExperiment 
#> dim: 304 60 
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#>   Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(3): pathway taxonomy family
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

Cross-association analysis

Now that the entire dataset is well-organized and in one place, let’s dive into analyzing the interconnections between experiments. Specifically, we’ll examine whether the abundance of certain bacteria is associated with specific metabolic activity pathway.

We’ll perform a simple cross-association analysis to explore these relationships. First, we’ll preprocess the functional annotation data by agglomerating the pathways based on each pathway’s function, and then further by prevalence.

# Agglomerate based on pathways function
altExp(tse, "pathway_func") <- agglomerateByVariable(
    altExp(tse, "pathway"),
    by = "rows",
    f = "pathway"
    )
# Agglomerate based on prevalence
altExp(tse, "pathway_prev") <- agglomerateByPrevalence(
    altExp(tse, "pathway_func"),
    assay.type = "relabundance",
    detection = 0.001,
    prevalence = 0.2
    )

The data now includes the most prevalent pathways, each representing a single metabolic activity. Next, we apply a transformation as the final preprocessing step for our functional data.

altExp(tse, "pathway_prev") <- transformAssay(
    altExp(tse, "pathway_prev"),
    assay.type = "relabundance",
    method = "clr"
    )
altExp(tse, "pathway_prev")
#> class: SummarizedExperiment 
#> dim: 24 60 
#> metadata(0):
#> assays(2): relabundance clr
#> rownames(24): BRANCHED-CHAIN-AA-SYN-PWY: superpathway of branched amino
#>   acid biosynthesis DTDPRHAMSYN-PWY: dTDP-L-rhamnose biosynthesis I ...
#>   VALSYN-PWY: L-valine biosynthesis Other
#> rowData names(2): full_name pathway
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#>   disease_location

We’re now ready to perform the cross-association analysis. We will examine the associations between microbial families and the most abundant pathways. To visualize the results, we’ll use a heatmap.

# Perform cross-associaton
res <- getCrossAssociation(
    tse,
    altexp1 = "family",
    altexp2 = "pathway_prev",
    assay.type1 = "clr",
    assay.type2 = "clr",
    mode = "matrix",
    test.signif = TRUE
    )
#> Calculating correlations...
#> altexp1: family, altexp2: pathway_prev, assay.type1: clr, col.var1: -, assay.type2: clr, col.var2: -
#> by: 1, function: stats::cor.test, method: kendall, test.signif: TRUE, p.adj.method: fdr, paired: FALSE, show.warnings: TRUE
#> Converting table into matrices...

# Function for marking significant correlations with "X"
add_signif <- function(j, i, x, y, width, height, fill) {
    # If the p-value is under threshold
    if( !is.na(res$p_adj[i, j]) & res$p_adj[i, j] < 0.05 ){
        # Print "X"
        grid.shadowtext(
            sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
    }
}

# Create a heatmap
p <- Heatmap(
    # Correlation values
    res$cor,
    # Print values to cells
    cell_fun = add_signif,
    # Adjust legend
    heatmap_legend_param = list(
        title = "correlation", legend_height = unit(5, "cm")),
    # Adjust column names
    column_names_rot = 45,
    column_names_max_height = unit(20, "cm")
)
p

For more information on data integration, see the following sections:

Microbial network analysis

After having analyzed associations between microbial families and pathways in the last section, we are now interested in the microbial associations itself, which helps us to understand the complex interplay of microbial communities.

In a typical microbiome network, nodes are ASVs or taxa on a certain rank (here family). The nodes are connected by edges, which represent microbial associations. Higher edge thickness corresponds to higher edge weight. Edges with a positive underlying association are blue, those with a negative association are red.

An exemplary microbiome network.

An exemplary microbiome network.

From read counts to networks

An essential step in microbial network learning is to obtain an adjacency matrix representing the network from a given read count matrix:

Association estimation methods need to take the typical characteristics of microbiome data into account. Here is a brief overview of the typical data characteristics and common approaches to solve them:

Compositionality

  • Data transformation (e.g., clr or rclr)
  • See the OMA chapter on data transformations

Zero-inflation

  • Adding small pseudo counts (to the zeros only!)
  • More sophisticated zero imputation methods, e.g., the R package zCompositions

High dimensionality (p>>n)

Sparse association estimation (sparsity assumption is needed due to the underdetermined mathematical system). Depending on the desired type of association, there’s a wide range of existing methods and R packages:

  • Correlation:
    • SparCC (“Sparse Correlations for Compositional data”)
    • CCLasso (“Correlation Inference for Compositional Data Through Lasso”)
    • CCREPE (“Compositionality Corrected by REnormalization and PErmutation”)
  • Conditional dependence / partial correlation:
    • SPIEC-EASI (“SParse InversE Covariance Estimation for Ecological ASsociation Inference”)
    • SPRING (“Semi-Parametric Rank-based approach for INference in Graphical model”)
    • gCoda
  • Proportionality

These methods are also available through the R package NetCoMi, which we will use to construct and analyze an association network for the gut microbiome.

The complete workflow implemented in NetCoMi is shown in the OMA chapter on network learning.

Network learning

The function for network construction netConstruct() offers the possibility to filter samples and taxa according to certain criteria. Here, we keep only taxa occurring in at least 10% of the samples.

We use the SPIEC-EASI method to estimate microbial associations. The approach includes two graphical model inference procedures:

  • Meinhausen and Bühlmann’s neighborhood selection (the MB method)
  • Inverse covariance selection (the glasso method).

We will use glasso here.

SPIEC-EASI uses the StARS (“Stability Approach to Regularization Selection”) method to obtain a sparse association matrix. We set the StARS threshold to 0.1 to get a denser graph (the default is 0.05).

Since zero replacement, normalization, and sparsification (model selection) is included in SPIEC-EASI, we set the corresponding netConstruct arguments to “none”.

net <- netConstruct(
    tse_family,
    taxRank = "family",
    filtTax = "numbSamp",
    filtTaxPar = list(numbSamp = 0.1),
    measure = "spieceasi",
    measurePar = list(method = "glasso", 
                      # nlambda and rep.num should be higher (30 to 50)
                      nlambda=20, 
                      pulsar.params=list(rep.num=10,
                                         thresh = 0.1), # defines sparsity
                      norm.params = list(method = "mclr", 
                                         types = "count")),
    zeroMethod = "none",
    normMethod = "none",
    sparsMethod = "none",
    dissFunc = "signed",
    seed = 16092024)

netConstruct() returns an object of the class microNet, which contains all matrices generated during network construction.

The object also contains an edge list, giving each edge’s estimated association, dissimilarity, and adjacency. Let’s take a quick look at the edges with the highest and lowest edge weights:

edgelist <- net$edgelist1[order(net$edgelist1$adja, 
                                decreasing = TRUE), ]
head(edgelist)
#>                  v1               v2      asso      diss      adja
#> 45 Odoribacteraceae    Rikenellaceae 0.3771518 0.5580538 0.4419462
#> 10  Akkermansiaceae Odoribacteraceae 0.3090037 0.5877909 0.4122091
#> 53    Rikenellaceae   Tannerellaceae 0.2994215 0.5918524 0.4081476
#> 46 Odoribacteraceae   Tannerellaceae 0.2967157 0.5929942 0.4070058
#> 17   Bacteroidaceae   Tannerellaceae 0.2913105 0.5952686 0.4047314
#> 16   Bacteroidaceae    Rikenellaceae 0.2838756 0.5983830 0.4016170
tail(edgelist)
#>                  v1                  v2       asso      diss      adja
#> 54 Selenomonadaceae      Tannerellaceae -0.1088053 0.7445822 0.2554178
#> 23   Clostridiaceae    Lactobacillaceae -0.1088343 0.7445920 0.2554080
#> 40   Eubacteriaceae    Lactobacillaceae -0.1116851 0.7455485 0.2544515
#> 27   Clostridiaceae    Selenomonadaceae -0.1404570 0.7551348 0.2448652
#> 12  Akkermansiaceae    Selenomonadaceae -0.1423893 0.7557742 0.2442258
#> 41 Lactobacillaceae Methanobacteriaceae -0.2651769 0.7953543 0.2046457

Network analysis

The microNet object is now passed to netAnalyze() to perform network analysis.

The function computes several common network characteristics such as centrality measures, hubs, cluster assignment, and global network measures.

The user has several options to choose from, such as a clustering method, how to define hubs, and whether or not to normalize centrality values. See the help page ?netAnalyze for a description of the arguments.

netprops <- netAnalyze(net,
                       clustMethod = "cluster_fast_greedy",
                       hubPar = "eigenvector",
                       normDeg = FALSE,
                       gcmHeat = FALSE)

We use NetCoMi's plot function to visualize the network.

In the first plot, node colors represent the detected clusters and node sizes are scaled by eigenvector centrality. Hub nodes are highlighted by default. Unconnected nodes are not included in the plot.

Note that nodes are sometimes placed too close together so that the labels overlap. You may need to play around with the repulsion argument until you find a value where the labels are legible, but also the clusters are still well recognizable.

# Palette with distinguishable colors
nodecols <- c("#E69F00", "#56B4E9", "#009E73",  "#CC79A7", "#F0E442", 
              "#0072B2", "#D55E00")

plot(netprops,
     repulsion = 1,
     rmSingles = TRUE,
     shortenLabels = "none",
     labelScale = FALSE,
     nodeSize = "eigenvector",
     nodeSizeSpread = 3,
     nodeColor = "cluster",
     colorVec = nodecols,
     nodeTransp = 20,
     posCol = c("#5C7aee", "#273cc2"),
     hubBorderCol = "gray40",
     cexNodes = 1.8,
     edgeTranspHigh = 20,
     title1 = "Network properties highlighted",
     showTitle = TRUE,
     cexTitle = 2,
     mar = c(1, 3, 4, 8))

legend(0.9, 1.1, cex = 1.7, title = "estimated correlation:",
       legend = c("+","-"), lty = 1, lwd = 3, col = c("#273cc2","red"),
       bty = "n", horiz = TRUE)

par(xpd = TRUE)
legend(0.7, -0.7, cex = 1.5, bty = "n",
       legend = c("Node color: Clusters", 
                  "Node size: Eigenvector centrality",
                  "Hubs: Highest eigen. centrality"))

We now color nodes according to their phylum. The node sizes are proportional to a taxon’s sum of mclr-transformed abundances.

# Generate vector with phylum names for node coloring
phyla <- rowData(tse_family)$phylum

# Replace phyla that occur only once with "other"
phylum_counts <- table(phyla)
phyla <- ifelse(phyla %in% names(phylum_counts[phylum_counts == 1]) | 
                  is.na(phyla), "Other", phyla)

phyla <- as.factor(phyla)

# Make level "Other" the last one
levels(phyla) <- c(setdiff(levels(phyla), "Other"), "Other")

# Assign family names
names(phyla) <- rowData(tse_family)$family

# Node colors (phylum "Other" should be gray)
colors_phylum <- c(nodecols[1:(nlevels(phyla) - 1)], "#999999")

plot(netprops,
     repulsion = 1,
     rmSingles = TRUE,
     shortenLabels = "none",
     labelScale = FALSE,
     nodeSize = "mclr",
     nodeColor = "feature",
     featVecCol = phyla,
     colorVec = colors_phylum,
     nodeTransp = 20,
     posCol = c("#5C7aee", "#273cc2"),
     highlightHubs = FALSE,
     cexNodes = 1.8,
     edgeWidth = 0.8,
     edgeTranspHigh = 20,
     title1 = "Data features highlighted",
     showTitle = TRUE,
     cexTitle = 2,
     mar = c(1, 10, 4, 6))

# Add legends
legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:",
       legend = c("+","-"), lty = 1, lwd = 3, col = c("#273cc2","red"),
       bty = "n", horiz = TRUE)

# Colors used in the legend should be equally transparent as in the plot
col_transp <- colToTransp(colors_phylum, 20)

legend(-1.8, 1.1, cex = 1.7, pt.cex = 2.5, title = "Phylum:",
       legend=levels(phyla), col = col_transp, bty = "n", pch = 16)

par(xpd = TRUE)
legend(0.5, -0.8, cex = 1.5, bty = "n",
       legend = c("Node color: Phylum", 
                  "Node size: Sum of mclr abundances"))

In addition to the visual analysis, we can also have a look at the computed network properties.

summary(netprops, numbNodes = 5)
#> 
#> Component sizes
#> ```````````````          
#> size: 28 1
#>    #:  1 8
#> ______________________________
#> Global network properties
#> `````````````````````````
#> Largest connected component (LCC):
#>                                  
#> Relative LCC size         0.77778
#> Clustering coefficient    0.27864
#> Modularity                0.36060
#> Positive edge percentage 77.77778
#> Edge density              0.14286
#> Natural connectivity      0.05141
#> Vertex connectivity       1.00000
#> Edge connectivity         1.00000
#> Average dissimilarity*    0.95366
#> Average path length**     1.97776
#> 
#> Whole network:
#>                                  
#> Number of components      9.00000
#> Clustering coefficient    0.27864
#> Modularity                0.36060
#> Positive edge percentage 77.77778
#> Edge density              0.08571
#> Natural connectivity      0.03746
#> -----
#> *: Dissimilarity = 1 - edge weight
#> **: Path length = Units with average dissimilarity
#> 
#> ______________________________
#> Clusters
#> - In the whole network
#> - Algorithm: cluster_fast_greedy
#> ```````````````````````````````` 
#>                
#> name: 0 1 2 3 4
#>    #: 8 9 7 8 4
#> 
#> ______________________________
#> Hubs
#> - In alphabetical/numerical order
#> - Based on empirical quantiles of centralities
#> ```````````````````````````````````````````````                 
#>  Akkermansiaceae 
#>  Odoribacteraceae
#> 
#> ______________________________
#> Centrality measures
#> - In decreasing order
#> - Centrality of disconnected components is zero
#> ````````````````````````````````````````````````
#> Degree (unnormalized):
#>                   
#> Akkermansiaceae  9
#> Odoribacteraceae 9
#> Clostridiaceae   8
#> Lactobacillaceae 8
#> Rikenellaceae    8
#> 
#> Betweenness centrality (normalized):
#>                         
#> Lactobacillaceae 0.27066
#> Tannerellaceae   0.18234
#> Odoribacteraceae 0.15954
#> Akkermansiaceae  0.14245
#> Eubacteriaceae   0.13960
#> 
#> Closeness centrality (normalized):
#>                         
#> Akkermansiaceae  0.89177
#> Odoribacteraceae 0.88758
#> Rikenellaceae    0.84766
#> Clostridiaceae   0.83774
#> Lactobacillaceae 0.82564
#> 
#> Eigenvector centrality (normalized):
#>                         
#> Odoribacteraceae 1.00000
#> Akkermansiaceae  0.91584
#> Rikenellaceae    0.78864
#> Tannerellaceae   0.76453
#> Clostridiaceae   0.72856

The network consists of one large connected component (lcc) with 28 nodes and 8 unconnected nodes.

Akkermansiaceae and Odoribacteraceae are identified as hub nodes (highly connected families), which are nodes with an eigenvector centrality above the 95% quantile of all computed eigenvector centrality values.

Network comparison

As next step, we might be interested in whether the microbial network differs between colorectal cancer (CRC) patients and control subjects.

This task could be performed with the NetCoMi function netCompare, as described in the OMA chapter on network comparison.

Interactive microbiome data exploration with iSEEtree

iSEEtree is a shiny app that provides a Graphical User Interface (GUI) to visualize and interactively explore TreeSE objects. We will demonstrate its capability by reproducing some of the plots that were previously created in this workshop, this time without the need of coding (small lie).

As we did before for tse, we add alpha and beta diversity metrics also to the agglomerated experiment tse_family, so we will be able to visualize these metrics in the app.

# Add information on alpha diversity
tse_family <- addAlpha(tse_family,
                       index = c("coverage_diversity", "shannon_diversity"))

# Add information on beta diversity
tse_family <- runMDS(tse_family,
                     FUN = getDissimilarity,
                     method = "bray",
                     assay.type = "relabundance",
                     ncomp = 10,
                     name = "MDS_bray")

Launching the app is easy. All you need to do is run the command iSEE on the TreeSE object that you want to explore.

# Launch app
if (interactive()) {
  iSEE(tse_family)
}

The default panel layout does not contain the so-called ColumnDataTable panel, which plots variables from the colData. Therefore, we need to specify a custom panel layout with the initial argument to generate the alpha diversity plot.

# Launch app with custom layout
if (interactive()) {
  iSEE(tse_family, initial = c(ColumnDataTable(), ColumnDataPlot()))
}

If interested, you can explore further functionality and the panel catalogue in the iSEEtree vignettes.

Ideas for self-learning:

  • Learn what different panels show by reading their help page
  • Adjust a plot to deliver the message you want
  • Use selection parameters to show a subset of the data
  • Save a plot to image and/or access its R code

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    grid      stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] tidyr_1.3.1                     patchwork_1.2.0                
#>  [3] shadowtext_0.1.4                scater_1.33.4                  
#>  [5] scuttle_1.15.4                  NetCoMi_1.1.0                  
#>  [7] SpiecEasi_1.1.3                 miaViz_1.13.10                 
#>  [9] ggraph_2.2.1                    ggplot2_3.5.1                  
#> [11] mia_1.13.38                     MultiAssayExperiment_1.31.5    
#> [13] iSEEtree_0.99.6                 iSEE_2.17.4                    
#> [15] ggsignif_0.6.4                  dplyr_1.1.4                    
#> [17] curatedMetagenomicData_3.13.0   TreeSummarizedExperiment_2.13.0
#> [19] Biostrings_2.73.1               XVector_0.45.0                 
#> [21] SingleCellExperiment_1.27.2     SummarizedExperiment_1.35.1    
#> [23] Biobase_2.65.1                  GenomicRanges_1.57.1           
#> [25] GenomeInfoDb_1.41.1             IRanges_2.39.2                 
#> [27] S4Vectors_0.43.2                BiocGenerics_0.51.1            
#> [29] MatrixGenerics_1.17.0           matrixStats_1.4.1              
#> [31] ComplexHeatmap_2.21.0           BiocManager_1.30.25            
#> 
#> loaded via a namespace (and not attached):
#>   [1] nnet_7.3-19                 DT_0.33                    
#>   [3] vctrs_0.6.5                 digest_0.6.37              
#>   [5] png_0.1-8                   corpcor_1.6.10             
#>   [7] shape_1.4.6.1               pcaPP_2.0-5                
#>   [9] rbiom_1.0.3                 ggrepel_0.9.6              
#>  [11] corrplot_0.94               permute_0.9-7              
#>  [13] mediation_4.5.0             SPRING_1.0.4               
#>  [15] MASS_7.3-61                 pkgdown_2.1.0              
#>  [17] reshape2_1.4.4              httpuv_1.6.15              
#>  [19] foreach_1.5.2               withr_3.0.1                
#>  [21] psych_2.4.6.26              xfun_0.47                  
#>  [23] ggfun_0.1.6                 survival_3.7-0             
#>  [25] memoise_2.0.1               ggbeeswarm_0.7.2           
#>  [27] systemfonts_1.1.0           gtools_3.9.5               
#>  [29] ragg_1.3.3                  tidytree_0.4.6             
#>  [31] zoo_1.8-12                  GlobalOptions_0.1.2        
#>  [33] pbapply_1.7-2               Formula_1.2-5              
#>  [35] KEGGREST_1.45.1             promises_1.3.0             
#>  [37] httr_1.4.7                  rhdf5filters_1.17.0        
#>  [39] rhdf5_2.49.0                rstudioapi_0.16.0          
#>  [41] shinyAce_0.4.2              UCSC.utils_1.1.0           
#>  [43] miniUI_0.1.1.1              generics_0.1.3             
#>  [45] base64enc_0.1-3             curl_5.2.2                 
#>  [47] zlibbioc_1.51.1             ScaledMatrix_1.13.0        
#>  [49] polyclip_1.10-7             doSNOW_1.0.20              
#>  [51] quadprog_1.5-8              GenomeInfoDbData_1.2.12    
#>  [53] ExperimentHub_2.13.1        SparseArray_1.5.34         
#>  [55] xtable_1.8-4                stringr_1.5.1              
#>  [57] desc_1.4.3                  ade4_1.7-22                
#>  [59] doParallel_1.0.17           evaluate_0.24.0            
#>  [61] S4Arrays_1.5.7              BiocFileCache_2.13.0       
#>  [63] preprocessCore_1.67.0       glmnet_4.1-8               
#>  [65] pulsar_0.3.11               irlba_2.3.5.1              
#>  [67] colorspace_2.1-1            filelock_1.0.3             
#>  [69] shinyWidgets_0.8.6          magrittr_2.0.3             
#>  [71] later_1.3.2                 viridis_0.6.5              
#>  [73] ggtree_3.13.1               lattice_0.22-6             
#>  [75] DECIPHER_3.1.4              cowplot_1.1.3              
#>  [77] Hmisc_5.1-3                 pillar_1.9.0               
#>  [79] nlme_3.1-166                huge_1.3.5                 
#>  [81] iterators_1.0.14            decontam_1.25.0            
#>  [83] compiler_4.4.1              beachmat_2.21.6            
#>  [85] stringi_1.8.4               biomformat_1.33.0          
#>  [87] tokenizers_0.3.0            minqa_1.2.8                
#>  [89] plyr_1.8.9                  crayon_1.5.3               
#>  [91] abind_1.4-8                 gridGraphics_0.5-1         
#>  [93] graphlayouts_1.1.1          bit_4.0.5                  
#>  [95] rootSolve_1.8.2.4           mixedCCA_1.6.2             
#>  [97] sandwich_3.1-1              fastcluster_1.2.6          
#>  [99] codetools_0.2-20            textshaping_0.4.0          
#> [101] BiocSingular_1.21.3         bslib_0.8.0                
#> [103] slam_0.1-53                 phyloseq_1.49.0            
#> [105] GetoptLong_1.0.5            tidytext_0.4.2             
#> [107] multtest_2.61.0             mime_0.12                  
#> [109] splines_4.4.1               circlize_0.4.16            
#> [111] Rcpp_1.0.13                 dbplyr_2.5.0               
#> [113] sparseMatrixStats_1.17.2    qgraph_1.9.8               
#> [115] knitr_1.48                  blob_1.2.4                 
#> [117] utf8_1.2.4                  here_1.0.1                 
#> [119] clue_0.3-65                 BiocVersion_3.20.0         
#> [121] lme4_1.1-35.5               pbivnorm_0.6.0             
#> [123] fs_1.6.4                    checkmate_2.3.2            
#> [125] DelayedMatrixStats_1.27.3   Rdpack_2.6.1               
#> [127] ggplotify_0.1.2             lavaan_0.6-18              
#> [129] tibble_3.2.1                Matrix_1.7-0               
#> [131] statmod_1.5.0               lpSolve_5.6.21             
#> [133] tweenr_2.0.3                pkgconfig_2.0.3            
#> [135] tools_4.4.1                 cachem_1.1.0               
#> [137] rbibutils_2.2.16            RSQLite_2.3.7              
#> [139] viridisLite_0.4.2           DBI_1.2.3                  
#> [141] impute_1.79.0               fastmap_1.2.0              
#> [143] rmarkdown_2.28              scales_1.3.0               
#> [145] shinydashboard_0.7.2        AnnotationHub_3.13.3       
#> [147] sass_0.4.9                  rpart_4.1.23               
#> [149] snow_0.4-4                  farver_2.1.2               
#> [151] tidygraph_1.3.1             mgcv_1.9-1                 
#> [153] yaml_2.3.10                 VGAM_1.1-11                
#> [155] foreign_0.8-87              cli_3.6.3                  
#> [157] purrr_1.0.2                 lifecycle_1.0.4            
#> [159] mvtnorm_1.3-1               bluster_1.15.1             
#> [161] backports_1.5.0             rintrojs_0.3.4             
#> [163] BiocParallel_1.39.0         gtable_0.3.5               
#> [165] rjson_0.2.22                limma_3.61.9               
#> [167] parallel_4.4.1              ape_5.8                    
#> [169] SnowballC_0.7.1             jsonlite_1.8.8             
#> [171] colourpicker_1.3.0          bit64_4.0.5                
#> [173] glasso_1.11                 yulab.utils_0.1.7          
#> [175] vegan_2.6-8                 BiocNeighbors_1.99.0       
#> [177] RcppParallel_5.1.9          janeaustenr_1.0.0          
#> [179] jquerylib_0.1.4             highr_0.11                 
#> [181] shinyjs_2.1.0               orca_1.1-2                 
#> [183] lazyeval_0.2.2              shiny_1.9.1                
#> [185] dynamicTreeCut_1.63-1       htmltools_0.5.8.1          
#> [187] GO.db_3.19.1                rappdirs_0.3.3             
#> [189] glue_1.7.0                  rprojroot_2.0.4            
#> [191] treeio_1.29.1               mnormt_2.1.1               
#> [193] jpeg_0.1-10                 gridExtra_2.3              
#> [195] boot_1.3-31                 igraph_2.0.3               
#> [197] R6_2.5.1                    fdrtool_1.2.18             
#> [199] labeling_0.4.3              listviewer_4.0.0           
#> [201] cluster_2.1.6               Rhdf5lib_1.27.0            
#> [203] aplot_0.2.3                 nloptr_2.1.1               
#> [205] DirichletMultinomial_1.47.0 DelayedArray_0.31.11       
#> [207] tidyselect_1.2.1            vipor_0.4.7                
#> [209] WGCNA_1.72-5                htmlTable_2.4.3            
#> [211] ggforce_0.4.2               AnnotationDbi_1.67.0       
#> [213] filematrix_1.3              rsvd_1.0.5                 
#> [215] munsell_0.5.1               data.table_1.16.0          
#> [217] htmlwidgets_1.6.4           RColorBrewer_1.1-3         
#> [219] rlang_1.1.4                 Cairo_1.6-2                
#> [221] ggnewscale_0.5.0            fansi_1.0.6                
#> [223] beeswarm_0.4.0