18  Network learning & analysis

Learning and analyzing microbial association networks is another common exploratory data analysis approach aimed at understanding the complex interplay of microbial communities in their natural habitat. Microbial networks consist of nodes, representing microbial species or taxa, and edges expressing their associations. The mathematical representation of a network is the adjacency matrix, which has a non-zero entry for each edge in the network.

A typical workflow for estimating a microbial network involves several steps, including data preprocessing, estimating microbial associations, and transforming them into edge weights. The resulting network can be analyzed using local network properties, such as centrality measures, or global measures that describe the overall structure of the network. Network plots provide further insight into the microbial community structure and allow for exploratory analysis of microbial relationships.

In this chapter, we go through the complete workflow of constructing and analyzing a single microbial network, step by step. A potential next analysis task is to compare networks between groups, such as patients and controls, different environmental conditions, or different time points. How to compare networks between two groups is explained in Chapter 19.

18.1 Network learning

18.1.1 Typical workflow

Figure 18.1 shows the workflow for learning/constructing a microbial association network as proposed by Peschel et al. (2021). The respective steps are explained below.

Figure 18.1: The typical input is a \(p\) x \(n\) dimensional count matrix coming from a sequencing process, where \(n\) is the number of samples and \(p\) the number of features / ASVs / OTUs. Steps 1 through 6 are explained below. Each matrix resulting from steps 4, 5, and 6 plays a specific role in the final network: The adjacency matrix is used for edge colors, dissimilarity for layout, and similarity for edge weights. In weighted networks, the similarity matrix equals the adjacency matrix.
  1. Zero replacement: Since the following steps usually require non-zero entries in the read count matrix, zero counts must be replaced. A simple solution is to add a pseudo count to the data. Other possible approaches are implemented in the R package zCompositions

  2. Normalization: To avoid compositional effects, the data are normalized using a compositionality aware transformation. A common approach is the centered log-ratio (clr) transformation, which moves the data from a \(p\)-dimensional simplex to Euclidean space so that standard statistical analysis methods are valid. A variance stabilizing transformation (vst) is also a suitable approach for normalizing microbial count data (Badri et al. 2020).

  3. Association estimation: This is the crucial step in network learning to obtain statistical relations between the taxa. Common association measures include correlation, conditional dependence (which we will equate to partial correlation), and proportionality. Further information on these three types of association and their application can be found in Section 18.6. The following list gives a selection of compositionality aware approaches:

  4. Sparsification: Transforming the estimated associations directly into adjacencies would lead to a dense network where all nodes are connected and only weighted network measures are meaningful. Therefore, the association matrix is usually sparsified to select edges of interest. A common sparsification approach for correlations is thresholding, where correlations with a magnitude below the threshold are set to zero. Another possibility is a statistical test (Student’s t-test or permutation test) with the null hypothesis that the correlation is equal to zero. SpiecEasi uses the StARS stability selection approach (Liu, Roeder, and Wasserman 2010) to decide on an appropriate sparsification level of the inferred conditional dependence graph.

  5. Transformation into dissimilarity: A common next step is to simply use the absolute values of the sparsified associations as the edge weights. In this way, correlations of high magnitude (both positive and negative) will have a high edge weight. From a biological point of view, it would also make sense to assign a low edge weight to taxa that are strongly negatively associated, which would correspond to a high dissimilarity value. Here we follow Dongen and Enright (2012) to directly transform the sparse associations \(r_{ij}^*\) into dissimilarities, which can later be used for shortest path network measures. Depending on the desired handling of negative associations, one of the two proposed transformations should be chosen:

    5a: “signed”: \(d_{ij} = \sqrt{0.5(1-r^*_{ij})}\), where strongly negatively associated taxa have the largest distance and are placed further apart in the network.

    5b: “unsigned”: \(d_{ij} = \sqrt{1-{r_{ij}^*}^2}\), resulting in a small distance between strongly associated taxa (regardless of the sign).

  6. Transformation into similarity / edge weight: Finally, the dissimilarities are transformed into similarities by \(s_{ij} = 1 - d_{ij}\), which are used as edge weights. Thus, the similarity matrix is equal to the adjacency matrix in a weighted network.

The main association measure used in this chapter is the SPRING (“Semi-Parametric Rank-based approach for INference in Graphical model”) method proposed by Yoon, Gaynanova, and Müller (2019). SPRING learns conditional dependency graphs for compositional data and follows the neighborhood selection method introduced by Meinshausen and Bühlmann (2006) (“MB”). We will show how to apply the method directly, as well as how to use it in conjunction with the R package NetCoMi, which is specifically designed for the construction and analysis of networks for microbiome data.

See Section 18.6 for a comparison of all three association types (correlation, partial correlation, and proportionality) with more information on each measure and applications.

We demonstrate the workflow using the the PeerJ data set (Potbhare et al. 2022). It contains skin microbial profiles of 58 subjects.

Potbhare, Renuka, Ameeta RaviKumar, Eveliina Munukka, Leo Lahti, and Richa Ashma. 2022. “Skin Microbiota Diversity Among Genetically Unrelated Individuals of Indian Origin.” PeerJ 10: e13075. https://doi.org/10.7717/peerj.13075.
library(mia)

data("peerj13075", package = "mia")
tse <- peerj13075
dim(tse)
##  [1] 674  58

18.1.2 Install packages

Three packages used in this chapter are available on GitHub only: SpiecEasi, SPRING, and NetCoMi. We recommend that you install these packages before proceeding.

if(!require(SpiecEasi)){
  devtools::install_github("zdk123/SpiecEasi")
}

if(!require(SPRING)){
  devtools::install_github("GraceYoon/SPRING")
}

if(!require(NetCoMi)){
  devtools::install_github(
      "stefpeschel/NetCoMi", force = TRUE, ref = "TSE",
       dependencies = c("Depends", "Imports", "LinkingTo"),
       repos = c("https://cloud.r-project.org/", BiocManager::repositories()))
}

18.1.3 Data preparation

Before applying the network learning methods, we perform some data preparation steps:

  • Aggregation to genus level
  • Add relative abundance assay
  • Prevalence filtering (keep genera with prevalence > 20%)
  • Add assay with log10 transformed abundances
  • Add assay with clr transformed abundances
# Agglomerate to genus level
tse <- agglomerateByRank(tse, rank = "genus")

# Add relative abundances
tse <- transformAssay(
    tse,
    assay.type = "counts",
    method = "relabundance",
    MARGIN = "cols")

# Filter by prevalence
tse <- subsetByPrevalent(
    tse,
    prevalence = 0.2,
    detection = 0,
    assay.type = "relabundance")

# Add log10-transformed abundances
tse <- transformAssay(tse, method = "log10", pseudocount = 1)

# Add clr-transformed abundances
tse <- transformAssay(tse, method = "clr", pseudocount = 1)

dim(tse)
##  [1] 147  58

18.1.4 SPRING network

As explained in Section 18.1.1, we use SPRING (“Semi-Parametric Rank-based approach for INference in Graphical model”) as association measure. We first use the SPRING function directly to construct a conditional dependency graph.

Neither zero replacement nor normalization (steps 1 and 2 in our workflow) are required because SPRING uses a modified clr (mclr) transformation that can handle zero counts, and the correlation estimation method itself can also deal with zeros in the data. mclr is similar to the clr transformation except that mclr considers only the non-zero values. More precisely, the geometric mean is derived from positive values only, and zero counts remain zero after the transformation. This approach is similar to the “robust clr” (rclr) transformation included in the vegan package, except that mclr applies a positive shift to all non-zero values to make them strictly positive. See (Yoon, Gaynanova, and Müller 2019) for details.

Yoon, Grace, Christian L Müller, and Irina Gaynanova. 2021. “Fast Computation of Latent Correlations.” Journal of Computational and Graphical Statistics 30 (4): 1249–56.

The Rmethod argument is set to “approx” to estimate the correlations using a hybrid multi-linear interpolation approach proposed by Yoon, Müller, and Gaynanova (2021). This method considerably reduces the runtime while controlling the approximation error.

SPRING uses the StARS (“Stability Approach to Regularization Selection”) method (Liu, Roeder, and Wasserman 2010) to obtain a sparse association matrix. Thus, also step 4 of our workflow is already included. We set the StARS threshold to 0.05 to get a sparser graph.

library(SPRING)
set.seed(13075)
spring_est <- SPRING(
    t(assay(tse, "counts")),
    Rmethod = "approx",
    thresh = 0.05,
    lambdaseq = "data-specific")
# Get index of the optimal lambda selected by StARS
opt.K <- spring_est$output$stars$opt.index

# Store partial correlation matrix belonging to the optimal lambda as matrix
spring_cor <- SpiecEasi::symBeta(as.matrix(spring_est$output$est$beta[[opt.K]]))
spring_cor <- as.matrix(spring_cor)
rownames(spring_cor) <- colnames(spring_cor) <- rownames(tse)
diag(spring_cor) <- 1

As explained in Section 18.1.1, the estimated associations are sparsified and transformed into dissimilarities, and finally transformed into similarities, which are the adjacency values. We write a function for these steps, which will be reused later.

Since SPRING already includes a sparsification approach, the thresh argument is not needed here, but will be needed in Section 18.6 for other association measures.

To be consistent with the workflow, we provide two dissimilarity transformations: “signed” and “unsigned” (see Section 18.1.1 for an explanation). These transformations were introduced by Dongen and Enright (2012). We use the “signed” transformation in our examples so that strongly negatively associated genera have low edge weights.

Dongen, Stijn van, and Anton J. Enright. 2012. Metric distances derived from cosine similarity and Pearson and Spearman correlations.” arXiv Preprint 2: 2–6. http://arxiv.org/abs/1208.3145.

The output of the function is an igraph object, which can be plotted and analyzed using functions from the igraph package.

# Arguments:
# - assoMat: association matrix
# - threshold: associations below the threshold are set to zero
# - dissTrans: dissimilarity transformation ("signed" or "unsigned")

transform_asso <- function(assoMat, thresh = NULL, dissTrans = "signed") {
  # Sparsification
  if (!is.null(thresh)) {
      assoMat[abs(assoMat) < thresh] <- 0
  }

  # Compute dissimilarity matrix
  if (dissTrans == "signed") {
      dissMat <- sqrt(0.5 * (1 - assoMat))
  } else {
      dissMat <- sqrt(1 - assoMat^2)
  }

  # Dissimilarity between nodes with zero correlation is set to 1
  # (these nodes are unconnected and thus should have maximum dissimilarity)
  dissMat[assoMat == 0] <- 1

  # Compute similarity matrix
  simMat <- 1 - dissMat

  # Turn into igraph object
  graphObj <- SpiecEasi::adj2igraph(simMat)

  return(list(graph = graphObj, adja = simMat, asso = assoMat, diss = dissMat))
}
# Create graph object
spring_graph <- transform_asso(spring_cor)$graph

18.1.5 NetCoMi network

The NetCoMi (Peschel et al. 2021) package is specifically designed to construct, analyze, and compare networks for microbiome data and implements the complete workflow described in Section 18.1.1. Instead of using several functions for each of the steps, NetCoMi provides a single function for network construction (netConstruct()), so the package streamlines the workflow considerably. The user can choose from a variety of methods for data preprocessing, association estimation, sparsification, and transformation. The returned microNet object can then be passed to netAnalyze() (the network analysis function) so that all necessary information is available for the network analysis workflow.

Peschel, Stefanie, Christian L Müller, Erika von Mutius, Anne-Laure Boulesteix, and Martin Depner. 2021. NetCoMi: network construction and comparison for microbiome data in R.” Briefings in Bioinformatics 22 (4): bbaa290. https://doi.org/10.1093/bib/bbaa290.

We again use SPRING as one of the association measures available in NetCoMi to construct a conditional dependency graph.

To demonstrate how taxa are filtered with netConstruct(), we will use the unfiltered tse object this time. The filtering is the same as before: Taxa occurring in less than 20% of the samples are removed.

netcomi_net <- netConstruct(
    tse,
    taxRank = "genus",
    filtTax = "numbSamp",
    filtTaxPar = list(numbSamp = 0.2),
    measure = "spring",
    measurePar = list(thresh = 0.05, Rmethod = "approx"),
    sparsMethod = "none",
    dissFunc = "signed",
    seed = 13075)

netConstruct() returns an object of the class microNet, which contains all matrices generated during network construction.

The object also contains an edge list, giving each edge’s estimated association, dissimilarity, and adjacency. Let’s take a quick look at the edges with the highest and lowest edge weights:

edgelist <- netcomi_net$edgelist1[
    order(netcomi_net$edgelist1$adja, decreasing = TRUE), ]
edgelist |> head()
##                  v1                    v2   asso   diss   adja
##  73     Citrobacter           Escherichia 0.3682 0.5621 0.4379
##  63    Buttiauxella              Serratia 0.2426 0.6154 0.3846
##  69   Chitinivibrio Pseudogracilibacillus 0.2203 0.6244 0.3756
##  19        Algicola           Siccibacter 0.2193 0.6248 0.3752
##  111     Haliangium          Marinobacter 0.2148 0.6266 0.3734
##  143 Planomicrobium         Virgibacillus 0.2006 0.6322 0.3678
edgelist |> tail()
##                 v1             v2       asso   diss   adja
##  132 Mycobacterium Salinibacillus  0.0017483 0.7065 0.2935
##  24      Amphritea    Providencia  0.0014085 0.7066 0.2934
##  102       Erwinia    Siccibacter  0.0013114 0.7066 0.2934
##  116     Holophaga   Methylarcula  0.0007828 0.7068 0.2932
##  17       Algicola     Lysobacter  0.0005191 0.7069 0.2931
##  95   Enterobacter     Janibacter -0.0013921 0.7076 0.2924

As before, the adjacency matrix is converted into an igraph object. Further steps like sparsification and transformation are not necessary because they are done internally by netConstruct().

netcomi_graph <- SpiecEasi::adj2igraph(abs(netcomi_net$adjaMat1))

18.2 Network analysis with igraph

The computed network is now analyzed using appropriate methods. We will first use the igraph package to analyze the SPRING network. NetCoMi’s netAnalyze() function will be used later to analyze the constructed microNet object.

18.2.1 Network plot

To get an overview of the network structure, a first common analysis method is to plot the network. We here use the igraph package, which is a state-of-the-art package for network analysis and visualization. Other packages that could be used for network plotting are the qgraph package or the ggnet2 package. Since we will use igraph for network analysis later on, we are using its plotting function here as well.

We use the Fruchterman-Reingold layout (a force-directed layout) for node placement. By placing strongly connected nodes close together and those with low edge weight far apart, this layout results in an easy-to-read network plot.

The node sizes are proportional to a taxon’s log10-transformed abundance, which we previously added to the tse object, averaged across all samples. The values are rescaled to be visually distinguishable.

Since we created two graph objects, one with SPRING and one with NetCoMi, we plot them side by side. The two plots should be identical.

# Node sizes
vsize <- (colMeans(t(assay(tse, "log10"))) + 1) * 3

# Fruchterman-Reingold layout from igraph package
set.seed(13075)
lay_fr <- layout_with_fr(spring_graph)

par(mfrow = c(1,2))
plot(spring_graph, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "SPRING network")
plot(netcomi_graph, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "NetCoMi network\n(with SPRING associations)")

18.2.2 Centrality measures

Centrality measures express the importance of nodes within the network. Common measures are the degree, betweenness, closeness, and eigenvector centrality. The igraph package provides functions to compute these measures. We wrap a function around the code to reuse it later.

get_centr <- function(graph_obj) {
  # We access igraph directly with "::" because there are more packages loaded
  # in this chapter that contain a degree() function.
  df <- data.frame(Degree = igraph::degree(graph_obj))
  df$Betweenness <- betweenness(graph_obj)
  df$Closeness <- closeness(graph_obj, normalized = TRUE)
  df$Eigenvector <- eigen_centrality(graph_obj)$vector
  return(df)
}

centr_df <- get_centr(spring_graph)
rownames(centr_df) <- rownames(spring_cor)
centr_df |> head(15)
##                   Degree Betweenness Closeness Eigenvector
##  Abyssicoccus          0           0       NaN   1.135e-18
##  Acidaminococcus       2           0    0.6464   1.862e-01
##  Acinetobacter         3           5    0.5544   1.257e-01
##  Actinomyces           0           0       NaN   1.135e-18
##  Actinoplanes          2          99    0.5104   1.099e-02
##  Aerococcus            0           0       NaN   1.135e-18
##  Aeromonas             4         351    0.7477   2.023e-01
##  Agromyces             6         435    0.7733   5.293e-01
##  Algicola              4         264    0.7206   1.500e-01
##  Alicyclobacillus      0           0       NaN   1.135e-18
##  Alteribacillus        0           0       NaN   1.135e-18
##  Ammoniibacillus       1           0    0.5064   5.419e-02
##  Amphritea             5         382    0.6256   1.335e-02
##  Amycolatopsis         1           0    0.6117   1.338e-01
##  Anaerococcus          2         392    0.4064   1.288e-03

The closeness centrality is “NaN” for some genera. These are unconnected nodes, as can be seen by the zero degree and betweenness centrality.

18.2.3 Scale node sizes by degree

Centrality measures can be visualized in the network plot by scaling the node sizes according to one of these measures. We plot the Spring graph using the same layout as before and with the node sizes scaled according to all four centrality measures.

Of the four centrality measures, only the degree has a range suitable to be used as node size. The other centrality measures must be rescaled because their range is either too small or too large. The following scaling is a suggestion that works for this example. The values might be adapted for other data sets.

get_vsizes <- function(centr_df) {
    df <- as.matrix(centr_df)
    df[, "Betweenness"] <- log(df[, "Betweenness"])
    df[, "Closeness"] <- df[, "Closeness"] * 10
    df[, "Eigenvector"] <- df[, "Eigenvector"] * 10
    df[is.infinite(df) | is.na(df)] <- 0
    return(df)
}

vsize_df <- get_vsizes(centr_df )
vsize_df |> head()
##                  Degree Betweenness Closeness Eigenvector
##  Abyssicoccus         0       0.000     0.000   1.135e-17
##  Acidaminococcus      2       0.000     6.464   1.862e+00
##  Acinetobacter        3       1.609     5.544   1.257e+00
##  Actinomyces          0       0.000     0.000   1.135e-17
##  Actinoplanes         2       4.595     5.104   1.099e-01
##  Aerococcus           0       0.000     0.000   1.135e-17
par(mfrow = c(2,2))
for (i in seq_along(centr_df)) {
    plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i],
        vertex.label = NA, main = colnames(centr_df)[i])
}

We observe that the two-node component at the bottom has a much higher closeness centrality than the nodes belonging to the main component of the network. Obviously, closeness centrality as commonly defined is misleading when the network consists of disconnected components. Nodes belonging to smaller components are seen as closer to others than in larger components. To overcome this problem, centrality values, and especially closeness centrality, are often calculated only for the largest connected component (LCC), which we will do below.

# Extract the LCC
dg_net <- igraph::decompose.graph(spring_graph)
idx_lcc <- which.max(unlist(lapply(dg_net, function(x) length(igraph::V(x)))))
lcc <- dg_net[[idx_lcc]]

# Compute centrality values for the LCC
centr_df_lcc <- get_centr(lcc)

# Replace centrality values by those for LCC and set all others to zero
lcc_nodes <- as.numeric(rownames(centr_df_lcc))
centr_df[lcc_nodes, ] <- centr_df_lcc
centr_df[-lcc_nodes, ] <- 0

# Node/vertex sizes
vsize_df <- get_vsizes(centr_df)
par(mfrow = c(2,2))
for (i in seq_along(centr_df)) {
    plot(spring_graph, layout = lay_fr, vertex.size = vsize_df[, i],
        vertex.label = NA, main = colnames(centr_df)[i])
}

Note that NetCoMi follows a different approach to overcome this problem. NetCoMi uses the definition of closeness centrality proposed by Tore Opsahl, which is well defined even for disconnected networks and assigns higher closeness centrality values to nodes in larger components. This is more intuitive because nodes in a larger component are connected to a larger number of other nodes than in small components.

18.2.4 Degree distribution

The degree distribution is another popular measure that expresses the probability distribution of degrees over the entire network. It thus provides insight into the overall network structure. We plot the degree distribution for all four association estimation methods to compare the network structure.

# Compute degree distribution
ddist<- igraph::degree.distribution(spring_graph)

# Data frame needed for ggplot2
df <- data.frame(
    Degree = as.factor((seq_along(ddist)) - 1), Fraction = ddist)

ggplot(data = df, aes(x = Degree, y = Fraction, group = 1)) +
    geom_line() +
    geom_point() +
    labs(title = "Degree distribution") +
    theme_bw()

The network has a large number of singletons and sparsely connected nodes, and only a small number of nodes with a higher degree of 7 or more.

18.2.5 Clustered heatmaps

Using the ComplexHeatmap package, we plot a heatmap of the association matrix estimated with SPRING. Rows and columns are sorted according to the clusters identified via hierarchical clustering.

We select the 50 nodes with the highest sum of edge weights to get a smaller heatmap.

sel <- names(sort(rowSums(spring_cor), decreasing = TRUE))[seq_len(50)]
adja_sel <- spring_cor[sel, sel]
# Color vector
col <- colorRamp2(
    c(-1, -0.5, 0, 0.5, 1),
    c("royalblue4", "lightblue", "white", "orange", "firebrick3"))

Heatmap(
    adja_sel,
    col = col,
    rect_gp = gpar(col = "gray", lwd = 1),
    show_row_names = FALSE,
    show_column_names = FALSE,
    name = "Association")

The associations are generally quite low, and there are no prominent clusters detected by hierarchical clustering.

18.2.6 Global network measures

Global measures describe the overall network structure. We take a look at three common measures: density, transitivity, and average path length. The values are again computed with igraph functions.

18.2.6.1 Density

Definition: Proportion of present edges from all possible edges.

edge_density(spring_graph)
##  [1] 0.01444

18.2.6.2 Transitivity (clustering coefficient)

Here, we consider only the global clustering coefficient, which is defined as the ratio of triangles to connected triples.

transitivity(spring_graph)
##  [1] 0.1456

18.2.6.3 Average path length

Definition: Mean of the shortest distance between each pair of nodes.

average.path.length(spring_graph)
##  [1] 1.699

18.3 Network analysis with NetCoMi

The netcomi_net object of class microNet created before is now passed to netAnalyze() to perform network analysis with NetCoMi.

The function computes several common network characteristics such as centrality measures, cluster assignment, the graphlet correlation matrix, as well as global network measures.

The user has several options to choose from, such as a clustering method, how to define hubs, and whether or not to normalize centrality values. See the help page ?netAnalyze for a description of the arguments.

By default, a heatmap of the Graphlet Correlation Matrix (GCM) is returned (with graphlet correlations in the upper triangle and significance codes resulting from Student’s t-test in the lower triangle). See ?calcGCM and ?testGCM for details.

netcomi_netprops <- netAnalyze(
    netcomi_net,
    clustMethod = "cluster_fast_greedy",
    hubPar = "eigenvector",
    normDeg = FALSE)

summary(netcomi_netprops, numbNodes = 5)
##  
##  Component sizes
##  ```````````````              
##  size: 103 2  1
##     #:   1 1 42
##  ______________________________
##  Global network properties
##  `````````````````````````
##  Largest connected component (LCC):
##                                   
##  Relative LCC size         0.70068
##  Clustering coefficient    0.21253
##  Modularity                0.70495
##  Positive edge percentage 99.35065
##  Edge density              0.02932
##  Natural connectivity      0.01173
##  Vertex connectivity       1.00000
##  Edge connectivity         1.00000
##  Average dissimilarity*    0.99084
##  Average path length**     3.81101
##  
##  Whole network:
##                                   
##  Number of components     44.00000
##  Clustering coefficient    0.21253
##  Modularity                0.70787
##  Positive edge percentage 99.35484
##  Edge density              0.01444
##  Natural connectivity      0.00782
##  -----
##  *: Dissimilarity = 1 - edge weight
##  **: Path length = Units with average dissimilarity
##  
##  ______________________________
##  Clusters
##  - In the whole network
##  - Algorithm: cluster_fast_greedy
##  ```````````````````````````````` 
##                                    
##  name:  0  1  2  3  4 5  6 7 8 9 10
##     #: 42 16 11 14 20 9 17 8 5 3  2
##  
##  ______________________________
##  Hubs
##  - In alphabetical/numerical order
##  - Based on empirical quantiles of centralities
##  ```````````````````````````````````````````````                      
##   Agromyces            
##   Aneurinibacillus     
##   Anoxybacillus        
##   Chitinivibrio        
##   Erwinia              
##   Geobacillus          
##   Janibacter           
##   Pseudogracilibacillus
##  
##  ______________________________
##  Centrality measures
##  - In decreasing order
##  - Centrality of disconnected components is zero
##  ````````````````````````````````````````````````
##  Degree (unnormalized):
##                         
##  Anoxybacillus        10
##  Erwinia              10
##  Chitinivibrio         9
##  Janibacter            8
##  Escherichia/Shigella  7
##  
##  Betweenness centrality (normalized):
##                      
##  Janibacter    0.3941
##  Chitinivibrio 0.3137
##  Enterobacter  0.2780
##  Buttiauxella  0.2454
##  Erwinia       0.2095
##  
##  Closeness centrality (normalized):
##                              
##  Chitinivibrio         0.5247
##  Janibacter            0.5067
##  Anoxybacillus         0.4841
##  Pseudogracilibacillus 0.4659
##  Erwinia               0.4653
##  
##  Eigenvector centrality (normalized):
##                              
##  Anoxybacillus         1.0000
##  Chitinivibrio         0.9831
##  Pseudogracilibacillus 0.8657
##  Janibacter            0.8361
##  Aneurinibacillus      0.6890

Interpretation of some findings:

  • The largest connected component (LCC) has 103 nodes and the network contains 42 singletons.
  • 10 clusters have been identified, containing 2 to 20 nodes.
  • There are 8 hub nodes detected which by definition are the nodes with the highest eigenvector centrality.
  • The average path length in the LCC is 3.811. This means that on average it takes 3.811 steps (step length is the average dissimilarity) to get from one node to another. Note that the average path length in NetCoMi is defined differently than in the igraph package, which is why the values differ.
  • Low values of edge density and the connectivity measures indicate that the network is rather sparse and not robust to perturbations (i.e., removal of nodes or edges).

18.4 Network visualization

Further insight into the network structure can be gained by visualizing the network. We have already seen examples of how to plot a network using the igraph package. Here we will use NetCoMi’s plot function. It takes as input the microNetProps object returned by netAnalyze(), which contains all computed network properties. This has the advantage that the user can choose which properties to plot by simply changing some arguments. The plot function is based on qgraph, which is another state-of-the-art R package for network visualization. The help page can be accessed via ?plot.microNetProps.

18.4.1 Highlight node properties

In the first plot, node colors represent the detected clusters and node sizes are scaled by eigenvector centrality. Hub nodes are highlighted by default. Singletons are not included in the plot. To improve the readability NetCoMi’s “intelligent” label shortening approach is used.

Note that nodes are sometimes placed too close together so that the labels overlap. You may need to play around with the repulsion argument until you find a value where the labels are legible, but also the clusters are still well recognizable.

plot(netcomi_netprops,
     repulsion = 0.98,
     rmSingles = TRUE,
     shortenLabels = "intelligent",
     labelScale = FALSE,
     nodeSize = "eigenvector",
     nodeSizeSpread = 3,
     nodeColor = "cluster",
     hubBorderCol = "gray40",
     cexNodes = 1.8,
     edgeTranspHigh = 20,
     title1 = "Network properties highlighted",
     showTitle = TRUE,
     cexTitle = 2.3,
     mar = c(1, 3, 4, 8))

legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:",
       legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"),
       bty = "n", horiz = TRUE)

18.4.2 Highlight data features

We now color nodes according to their phylum. The node sizes are proportional to a taxon’s sum of mclr-transformed abundances. As already mentioned in Section 18.1.4, this is the normalization method used by SPRING. A color palette from RColorBrewer is used here.

library(RColorBrewer)
# Generate vector with phylum names for node coloring
phyla <- as.factor(rowData(tse)$phylum)
names(phyla) <- rowData(tse)$genus

# Create color vector
colvec <- RColorBrewer::brewer.pal(length(levels(phyla)), "Set3")

plot(netcomi_netprops,
     repulsion = 0.98,
     rmSingles = TRUE,
     shortenLabels = "intelligent",
     labelScale = FALSE,
     nodeSize = "mclr",
     nodeColor = "feature",
     featVecCol = phyla,
     colorVec =  colvec,
     nodeTransp = 20,
     highlightHubs = FALSE,
     cexNodes = 1.8,
     edgeTranspHigh = 20,
     title1 = "Data features highlighted",
     showTitle = TRUE,
     cexTitle = 2.3,
     mar = c(1, 10, 4, 6))

# Add legends
legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:",
       legend = c("+","-"), lty = 1, lwd = 3, col = c("#009900","red"),
       bty = "n", horiz = TRUE)

# Colors used in the legend should be equally transparent as in the plot
col_transp <- colToTransp(colvec, 20)

legend(-1.8, 1.1, cex = 1.7, pt.cex = 2.5, title = "Phylum:",
       legend=levels(phyla), col = col_transp, bty = "n", pch = 16)

A few things to observe:

  • Genera belonging to the same phylum tend to cluster together, though not perfectly.
  • Genera with a low total count play a rather unimportant role in the network, i.e., they have a low centrality.
  • There is only one negative edge in the network. This edge is between two clusters, as expected when using the “signed” transformation.

18.5 Which method(s) to choose?

Throughout all the steps from primary data to potentially significant network features, there is a variety of methods and parameters to choose from. However, there is no general consensus in the community on the “right” way to estimate and analyze microbial networks. In the absence of a “best method” for inferring and analyzing microbial networks, researchers may be tempted to try different methods and report only the optimal results or those that fit some prior knowledge. This carries the risk of “overfitting” the analysis to the existing data so that the results are not replicable for new data (Ullmann et al. 2023).

Ullmann, Theresa, Stefanie Peschel, Philipp Finger, Christian L Müller, and Anne-Laure Boulesteix. 2023. “Over-Optimism in Unsupervised Microbiome Analysis: Insights from Network Learning and Clustering.” PLoS Computational Biology 19 (1): e1010820. https://doi.org/10.1371/journal.pcbi.1010820.

Therefore, the selection of the workflow building blocks should be set up once and independently of any hypothesis about the data, thus avoiding the fallacy of starting to “fish” for results that best fit a previously formulated hypothesis. For example, one should ask prior to the analysis whether correlation or conditional dependence as a measure of association better fits the research question and choose the method accordingly. Another example is the choice of transformation from estimated association to dissimilarity (i.e., “signed” or “unsigned”), which completely changes the interpretation and characteristics of the network. This choice should be made based on the research question before starting the analysis.

18.6 More about association measures

As mentioned in the introduction of this chapter, there are three types of association measures that are commonly used to express relationships between taxa: correlation, conditional dependence, and proportionality. Below, we provide a brief explanation of each of these measures, along with lists of available compositionality-aware approaches.

  • Correlation: Two popular measures of ecological association are Pearson’s correlation coefficient and Spearman’s rank correlation coefficient, both of which can be inferred from empirical (sample) covariances. However, in the \(p\gg n\) setting, which most microbiome datasets are in, sample covariances and correlations are unreliable because the parameters being estimated are typically underdetermined. One way to improve sample covariance estimates is to assume that the underlying covariance matrix is sparse and use a regularized covariance estimator to implement this structural assumption. The Schäfer-Strimmer shrinkage estimator (Schäfer and Strimmer 2005) is one possible method for estimating a sparse correlation matrix. Other popular methods, especially designed to estimate correlations for compositional data, are SparCC (“Sparse Correlations for Compositional data”) by J. Friedman and Alm (2012), CCREPE (“Compositionality Corrected by REnormalization and PErmutation”) by Faust et al. (2012), and CCLasso (“Correlation inference for Compositional data through Lasso”) by Huaying Fang et al. (2015). The latter three methods already include a compositionality aware normalization, and SparCC also includes a zero replacement approach.
  • Conditional dependence: Since standard correlations include both direct and indirect dependencies, conditional dependence or partial correlation is often preferred for measuring association. Unlike (marginal) correlation, it expresses the relationship between two features conditioned on all other features in the data set. The approach and R package SpiecEasi (“Sparse InversE Covariance estimation for Ecological Association and Statistical Inference”) by Kurtz et al. (2015) is specifically designed for inferring ecological networks from microbiome data and includes two approaches for estimating conditional dependence structures between taxa: Neighborhood Selection; short “MB” (Meinshausen and Bühlmann 2006) and (inverse) covariance selection (Jerome Friedman, Hastie, and Tibshirani 2008), which is based on a penalized maximum likelihood approach and is also known as “graphical lasso”. Another approach and R package for inferring partial correlations from microbiome data is SPRING (“Semi-Parametric Rank-based approach for INference in Graphical model”) by Yoon, Gaynanova, and Müller (2019). They also use the MB neighborhood selection method, but introduce a novel semi-parametric rank-based approach for sparse partial correlation estimation that can naturally handle the excess of zeros in the data. gCoda (H. Fang et al. 2017) is another conditional dependence measure based on penalized maximum likelihood estimation. All of the aforementioned conditional dependence measures address the high dimensionality of microbiome data.
  • Proportionality: Lovell et al. (2015) introduce proportionality as an alternative measure of pairwise association for compositional data. The idea is that if the relative abundances between two taxa \(i\) and \(j\) are proportional, then their corresponding absolute abundances are also proportional: \(\frac{\omega_i}{m} \propto \frac{\omega_j}{m} \Rightarrow \omega_i \propto \omega_j\), where \(m\) is the sum of counts in the sample. It follows that proportionality is identical for the observed (relative) read counts and the true unobserved counts. The proportionality measure proposed by Lovell et al. (2015) is based on log-ratio variance \(var(log \frac{x_i}{x_j})\), which is zero when \(\omega_i\) and \(\omega_j\) are perfectly proportional. Proportionality is implemented in the R package propr. Badri et al. (2020) extend the proportionality measure to a so-called “shrinkage proportionality estimator”. It combines proportionality with the covariance shrinkage approach to obtain consistent association estimates even with small sample sizes.
Schäfer, Juliane, and Korbinian Strimmer. 2005. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.” Statistical Applications in Genetics and Molecular Biology 4 (1): 1–30. https://doi.org/10.2202/1544-6115.1175.
Faust, Karoline, J. Fah Sathirapongsasuti, Jacques Izard, Nicola Segata, Dirk Gevers, Jeroen Raes, and Curtis Huttenhower. 2012. Microbial Co-occurrence Relationships in the Human Microbiome.” PLoS Computational Biology 8 (7): e1002606. https://doi.org/10.1371/journal.pcbi.1002606.
Fang, Huaying, Chengcheng Huang, Hongyu Zhao, and Minghua Deng. 2015. CCLasso: Correlation inference for compositional data through Lasso.” Bioinformatics 31 (19): 3172–80. https://doi.org/10.1093/bioinformatics/btv349.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso.” Biostatistics (Oxford, England) 9 (3): 432–41. https://doi.org/10.1093/BIOSTATISTICS/KXM045.
Yoon, Grace, Irina Gaynanova, and Christian L Müller. 2019. Microbial networks in SPRING - Semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data.” Frontiers in Genetics 10: 516. https://doi.org/10.3389/fgene.2019.00516.
Fang, H, C Huang, H Zhao, and M Deng. 2017. gCoda: Conditional Dependence Network Inference for Compositional Data.” Journal of Computational Biology 24 (7): 699–708. https://doi.org/10.1089/cmb.2017.0054.

18.7 Comparison of association measures

In this section, we provide three additional examples for constructing a network using each of the three types of association:

  • Correlation using SparCC
  • Partial correlation using SpiecEasi
  • Proportionality using the shrinkage proportionality measure

18.7.1 SparCC

The first association measure we look at is SparCC (“Sparse Correlations fo Compositional data”), introduced by J. Friedman and Alm (2012). It estimates Pearson correlations while taking into account the compositional structure of the data. The SpiecEasi package provides an implementation of this method.

Friedman, J, and EJ Alm. 2012. “Inferring Correlation Networks from Genomic Survey Data.” PLoS Computational Biology 8 (9): e1002687. https://doi.org/10.1371/journal.pcbi.1002687.
# Set seed for reproducibility
set.seed(13075)
# Compute correlation matrix
sparcc_cor <- SpiecEasi::sparcc(t(assay(tse, "counts")))$Cor
rownames(sparcc_cor) <- colnames(sparcc_cor) <- rownames(tse)

We reuse the transform_asso() function created in Section 18.1.4, which sparsifies the association matrix, transforms it into a similarity matrix, and finally returns an igraph object.

Two threshold values are used to see the effect of sparsification later in the network plot.

sparcc_trans03 <- transform_asso(sparcc_cor, thresh = 0.3)
sparcc_trans04 <- transform_asso(sparcc_cor, thresh = 0.4)

sparcc_graph03 <- sparcc_trans03$graph
sparcc_graph04 <- sparcc_trans04$graph

18.7.2 Shrinkage proportionality

In the second example, microbial associations are measured by proportionality, originally introduced by Lovell et al. (2015). We use the shrinkage proportionality estimator proposed by Badri et al. (2020), which gives consistent results even for small sample sizes. Since there is no R package implementing this estimator, we use the rho_shrink_est() function provided in the GitHub repository associated with the paper. The function is slightly modified to take normalized counts as input.

Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. 2015. Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11 (3): 1–12. https://doi.org/10.1371/journal.pcbi.1004075.
Badri, Michelle, Zachary D. Kurtz, Richard Bonneau, and Christian L. Müller. 2020. Shrinkage improves estimation of microbial associations under different normalization methods.” NAR Genomics and Bioinformatics 2 (4). https://doi.org/10.1093/NARGAB/LQAA100.
# norm_counts: clr-transformed count matrix with samples in rows
rho_shrink_est <- function(norm_counts, ...) {
    shrunk_cov <- cov.shrink(norm_counts, ...)
    p <- ncol(norm_counts)
    J <- matrix(rep(diag(shrunk_cov), p), p)
    rho <- 2 * shrunk_cov / (J + t(J))
    (rho + t(rho)) / 2
}
# Apply the shrinkage proportionality estimator to the clr-transformed counts
prop_est <- as(rho_shrink_est(t(assay(tse, "clr"))), "matrix")
##  Estimating optimal shrinkage intensity lambda.var (variance vector): 0.0857 
##  
##  Estimating optimal shrinkage intensity lambda (correlation matrix): 0.3634

Again, we use our transformation function to convert the association matrix into a graph object.

prop_trans <- transform_asso(prop_est, thresh = 0.4)
prop_graph <- prop_trans$graph

18.7.3 SpiecEasi - MB

As third example, we use the SpiecEasi (“Sparse InversE Covariance estimation for Ecological Association and Statistical Inference”) approach proposed by Kurtz et al. (2015) to estimate a sparse conditional dependency graph. The neighborhood selection method (“MB”) introduced by Meinshausen and Bühlmann (2006) is used for network learning. The approach is implemented in the R package SpiecEasi.

Kurtz, ZD, CL Müller, ER Miraldi, DR Littman, MJ Blaser, and Bonneau RA. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.” PLoS Computational Biology 11 (5): e1004226. https://doi.org/10.1371/journal.pcbi.1004226.
Meinshausen, Nicolai, and Peter Bühlmann. 2006. High-dimensional graphs and variable selection with the Lasso.” Annals of Statistics 34 (3): 1436–62. https://doi.org/10.1214/009053606000000281.
library(SpiecEasi)
set.seed(13075)
se_mb_est <- spiec.easi(
    t(assay(tse, "counts")),
    method = 'mb', nlambda = 20,
    pulsar.params = list(rep.num = 20))

Since SpiecEasi uses the StARS (“Stability Approach to Regularization Selection”) method (Liu, Roeder, and Wasserman 2010) to obtain a sparse association matrix, we don’t need to set a threshold here. We store the partial correlations corresponding to the StARS-optimal lambda and convert them into an igraph object.

Liu, Han, Kathryn Roeder, and Larry Wasserman. 2010. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models.” Advances in Neural Information Processing Systems. https://doi.org/10.48550/arxiv.1006.3316.
# Get optimal matrix with partial correlations
se_mb_cor <- as.matrix(getOptBeta(se_mb_est))
se_mb_cor <- as.matrix(symBeta(se_mb_cor))
rownames(se_mb_cor) <- colnames(se_mb_cor) <- rownames(tse)
diag(se_mb_cor) <- 1

# Create graph object
se_mb_graph <- transform_asso(se_mb_cor)$graph

18.7.4 Network plots

The graph objects can now be plotted using the igraph package. The same layout is used in all four plots so that the networks are comparable.

# Node sizes
vsize <- (colMeans(t(assay(tse, "log10"))) + 1) * 3

# Use Fruchterman-Reingold (force-directed) layout
set.seed(13075)
lay_fr <- layout_with_fr(se_mb_graph)

par(mfrow = c(2,2))
plot(sparcc_graph03, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "SparCC (thresh 0.3)")
plot(sparcc_graph04, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "SparCC (thresh 0.4)")
plot(prop_graph, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "Shrinkage proportionality\n(thresh 0.4)")
plot(se_mb_graph, layout = lay_fr, vertex.size = vsize,
     vertex.label = NA, main = "SpiecEasi (MB)")

A few observations: The density of SparCC (threshold 0.4), proportionality and SpiecEasi is comparable, while the SparCC correlation network with threshold 0.3 is much denser. However, there are edges in the proportionality and SpiecEasi networks that are not present in the two SparCC networks. The SpiecEasi network has less highly connected nodes than the other three networks, but more nodes with one or two connections.

We will look at the degree distribution in the next section to quantify these observations.

18.7.5 Network analysis

Here we repeat some of the network analysis approaches explained in Section 18.2. The analyses are performed simultaneously for the three association measures as well as the SPRING network constructed in Section 18.1.4. Therefore, we start by creating a list of all the graph objects we need for the analyses.

graphlist <- list(
    SparCC = sparcc_graph04,
    Proportionality = prop_graph,
    SpiecEasi = se_mb_graph,
    SPRING = spring_graph)

18.7.5.1 Degree distribution

The degree distribution is plotted for all four measures to compare the overall network structure.

# Compute degree distributions
ddlist <- lapply(graphlist, igraph::degree.distribution)

# Maximum degree
maxdeg <- max(lengths(ddlist))

# Make list elements the same length
for(i in seq_along(graphlist)) {
    length(ddlist[[i]]) <- maxdeg
}

# Data frame needed for ggplot2
df <- data.frame(
    Degree = rep(seq_len(maxdeg), length(graphlist)),
    Fraction = unlist(ddlist),
    Method = rep(names(graphlist), each = maxdeg))

ggplot(df, aes(x = Degree, y = Fraction, group = Method)) +
    geom_line(aes(color = Method)) +
    geom_point(aes(color = Method)) +
    theme_bw()

The SparCC and shrinkage proportionality networks have a considerably higher proportion of singletons (zero-degree nodes) than the two conditional dependency graphs, but a lower proportion of nodes with degrees between one and five. The SpiecEasi and SPRING graphs, on the other hand, have a higher proportion of low degree nodes, but no highly connected nodes with a degree greater than eleven.

18.7.5.2 Clustered heatmaps

Using the ComplexHeatmap package, we plot heatmaps of the association matrices for the four considered association measures. Rows and columns are sorted according to the clusters identified via hierarchical clustering.

For each association measure, we select the 50 nodes with the highest sum of edge weights.

# Function for selecting taxa with highest sum of edge weights
select_taxa <- function(adja, ntaxa = 50) {
    sel <- names(sort(rowSums(adja), decreasing = TRUE))[seq_len(ntaxa)]
    adja[sel, sel]
}

assolist <- list()
assolist$SparCC <- select_taxa(sparcc_trans04$adja)
assolist$Proportionality <- select_taxa(prop_trans$adja)
assolist$SpiecEasi <- select_taxa(se_mb_cor)
assolist$SPRING <- select_taxa(spring_cor)
# Color vector for the legend
col <- colorRamp2(
    c(-1, -0.5, 0, 0.5, 1),
    c("royalblue4", "lightblue", "white", "orange", "firebrick3"))

hm_list <- list()

for(i in seq_along(assolist)) {
    if (i %in% c(2, 4)) {
        showlegend <- TRUE
    } else {
        showlegend <- FALSE
    }

    hm_list[[i]] <- Heatmap(
        assolist[[i]],
        col = col,
        rect_gp = gpar(col = "gray", lwd = 1),
        show_row_names = FALSE,
        show_column_names = FALSE,
        column_title = names(assolist)[i],
        name = "Association",
        show_heatmap_legend = showlegend) |>

        draw() |>
        grid.grabExpr()
}

# Plot with wrap_plots() function from patchwork package
wrap_plots(hm_list, ncol = 2, widths = c(8, 10, 8, 10))

The SparCC and the proportionality network show a block structure, where each block corresponds to a cluster. The clusters are less pronounced in the conditional dependence networks. The latter also generally have lower edge weights.

18.7.5.3 Global network measures

For each association measure, the three global network measures density, transitivity, and average path length are computed and stored in a data frame for comparison.

# Compute density and store in a data frame
glob <- data.frame(Density = unlist(lapply(graphlist, edge_density)))

# Transitivity
glob$Transitivity <- unlist(lapply(graphlist, transitivity))

# Average path length
glob$Av.path <- unlist(lapply(graphlist, average.path.length))

glob
##                  Density Transitivity Av.path
##  SparCC          0.01510       0.5458  0.8932
##  Proportionality 0.01295       0.5402  1.1847
##  SpiecEasi       0.02162       0.2145  1.6322
##  SPRING          0.01444       0.1456  1.6993
Back to top