vignettes/eccb_workflow.Rmd
eccb_workflow.Rmd
Authors: Tuomas Borman, Pande Erawijantari, Giulio Benedetti,
Stefanie Peschel, Leo Lahti.
Last modified: 16 September,
2024.
This tutorial introduces the latest advances in Bioconductor tools and data structures supporting microbiome analysis. We will show how this can enhance interoperability across omics and walk through a typical microbiome data science workflow. You will learn from active developers how to access open microbiome data resources, utilize optimized data structures, assess community diversity and composition, integrate taxonomic and functional hierarchies, and visualize microbiome data. We will follow the online book “Orchestrating Microbiome Analysis with Bioconductor” and Bioconductor SummarizedExperiment framework, supporting optimized analysis and integration of hierarchical, multi-domain microbiome data. Participants are encouraged to install the latest versions of R and Bioconductor.
If your time allows, we recommend to spend some time to explore beforehand Orchestrating Microbiome Analysis (OMA) online book.
We have instructor-led demonstrations where participants are expected to follow along. Questions are encouraged at any time during the workshop.
It is preferred to use your own laptop. You can find help for installation from here. There are also available pre-installed virtual machines by Galaxy/Bioconductor. (Under “EUROBIOC2024” tab, session called “Package Demo: mia”)
In this workshop, we will focus on the mia package, which is designed for microbiome data science. The mia package provides essential tools for analyzing microbiome data using the SummarizedExperiment framework.
Additionally, we will cover NetCoMi and iSEEtree in network and interactive data exploration session, respectively.
Activity | Time |
---|---|
Bioconductor resources for microbiome data science | 30m |
Analysis of community diversity and composition | 30m |
Microbiome data integration | 30m |
Coffee break | 30m |
Microbial network analysis | 30m |
Interactive microbiome data exploration with iSEEtree | 30m |
Recap and Q&A | 30m |
Methods for microbiome analysis: Learn to harness SummarizedExperiment ecosystem in your project.
Data structure: Understand how to utilize the TreeSummarizedExperiment class for effective microbiome data analysis.
Access resources: Get familiar with additional tools and resources, including the OMA online book.
Analyze and apply methods: Apply the framework to process and analyze microbiome data.
Visualize data: Generate and interpret common graphs for microbiome data
Explore documentation: Use the OMA to explore additional tools and methods.
In this workflow, we demonstrate how to analyze data from publicly available microbiome resources using Bioconductor methods. We will cover essential data wrangling steps, explore alpha and beta diversity metrics, and introduce the fundamentals of data integration and microbial network analysis. Additionally, we will explore how to utilize the tools through a graphical interface.
The miaverse framework is based on TreeSummarizedExperiment data container, extension to common SummarizedExperiment object.
First, we load the required packages into the session. The following script ensures that any packages not already installed are automatically installed.
# List of packages that we need
packages <- c(
"ComplexHeatmap",
"curatedMetagenomicData",
"dplyr",
"ggsignif",
"iSEEtree",
"mia",
"miaViz",
"NetCoMi",
"scater",
"shadowtext",
"patchwork",
"scater",
"SpiecEasi",
"tidyr"
)
# Get packages that are already installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )
# Install GitHub packages
if( "SpiecEasi" %in% packages_need_to_install ){
# SpiecEasi must be installed from "latentcor" branch
devtools::install_github("zdk123/SpiecEasi", force = TRUE, ref = "latentcor")
# Remove from installation list
packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "SpiecEasi" ]
}
if( "SPRING" %in% packages_need_to_install ){
devtools::install_github("GraceYoon/SPRING")
# Remove from installation list
packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "SPRING" ]
}
if( "NetCoMi" %in% packages_need_to_install ){
# NetCoMi must be installed from "TSE" branch
devtools::install_github(
"stefpeschel/NetCoMi", force = TRUE, ref = "TSE",
dependencies = c("Depends", "Imports", "LinkingTo"),
repos = c("https://cloud.r-project.org/", BiocManager::repositories()))
# Remove from installation list
packages_need_to_install <- packages_need_to_install[ !packages_need_to_install %in% "NetCoMi" ]
}
# Loads BiocManager into the session. Install it if it is not already installed.
if( !require("BiocManager") ){
install.packages("BiocManager")
library("BiocManager")
}
# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
install(version = "devel")
install(packages_need_to_install, ask = FALSE, update = FALSE)
}
# Load all packages into session. Stop if there are packages that were not
# successfully loaded
pkgs_not_loaded <- !sapply(packages, require, character.only = TRUE)
pkgs_not_loaded <- names(pkgs_not_loaded)[ pkgs_not_loaded ]
if( length(pkgs_not_loaded) > 0 ){
stop(
"Error in loading the following packages into the session: '",
paste0(pkgs_not_loaded, collapse = "', '"), "'")
}
If you encountered an error, install packages manually.
There are several openly available datasets listed in OMA. In this workflow, we will retrieve data from the curatedMetagenomicData resource. We’ll focus on a dataset that includes both colorectal cancer (CRC) patients and control subjects. Our goal is to investigate whether CRC is associated with alterations in the gut microbiota.
Let’s fetch the data from the database.
# Study from which we want to fetch data
study <- "GuptaA_2019"
# Experiments that we want to fetch
experiments <- c("pathway_abundance", "relative_abundance") |>
paste0(collapse = "|")
search_pattern <- paste0(study, ".(", experiments, ")")
# Fetch data
data_list <- curatedMetagenomicData(
search_pattern, dryrun = FALSE, counts = TRUE, rownames = "short")
#> $`2021-03-31.GuptaA_2019.relative_abundance`
#> dropping rows without rowTree matches:
#> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis
#> k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#> k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
# Add new names
names(data_list) <- c("pathway", "taxonomy")
# The assay name for taxonomy profile is incorrect since we have counts, not
# relative abundances
tse <- data_list[["taxonomy"]]
assayNames(tse) <- "counts"
data_list[["taxonomy"]] <- tse
# Rename also pathway abundance. They are in relative scale
tse <- data_list[["pathway"]]
assayNames(tse) <- "relabundance"
data_list[["pathway"]] <- tse
# For pathway data, extract pathway information from rownames and add them to
# rowData
tse <- data_list[["pathway"]]
rowData(tse)[["full_name"]] <- rownames(tse)
rowData(tse)[["pathway"]] <- sub("\\|.*", "", rowData(tse)[["full_name"]])
data_list[["pathway"]] <- tse
# Get taxonomy object
tse <- data_list[["taxonomy"]]
tse
#> class: TreeSummarizedExperiment
#> dim: 304 60
#> metadata(1): agglomerated_by_rank
#> assays(1): counts
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#> Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
The output is a TreeSummarizedExperiment
(TreeSE
) object. For details on how to handle and access
the data, refer to data
containers chapter in OMA.
A common step in microbiome workflows is agglomeration, where we summarize data at specific taxonomy ranks. Additionally, preprocessing involves transforming the data to address the unique characteristics of microbiome profiling data, for instance. Details on agglomeration and transformation are covered in agglomeration and transformation chapters in OMA, respectively.
# Agglomerate data
tse_family <- agglomerateByRank(tse, rank = "family")
#> Warning: The following values are already present in `metadata` and will be
#> overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
#> specify alternative names.
# Transform the main TreeSE
tse <- transformAssay(tse, method = "relabundance")
tse <- transformAssay(tse, method = "clr", pseudocount = 1)
# Transform agglomerated data
tse_family <- transformAssay(tse_family, method = "relabundance")
tse_family <- transformAssay(tse_family, method = "clr", pseudocount = 1)
Now that the data has been pre-processed, we could start to analyze the overview of the microbiome profile. For the dataset above, we can continue to perform analysis, for example to answer these questions:
How does the microbial community differ between the CRC (n=30) vs healthy group (n=30)?
How does the beta diversity compare between the CRC (n=30) vs healthy group (n=30)?
What species shows significantly different relative abundance between 2 groups?
Let’s start with analyzing the differences on microbial community diversity between CRC and healthy control group. Community diversity is a central concept in microbiome research, and in the human microbiome study it has been shown that the microbial diversity associated with the host health status. Several diversity indices are available in the ecological literature. You can explore further in the OMA book about what analysis in-term of community diversity can be analyzed using mia package.
For simplicity, let’s calculate the microbial richness and alpha diversity using the Shannon index (other calculation using different index also available, and you can explore further)
tse <- addAlpha(tse, index = c("observed", "shannon"))
# to see the observed and shannon calculation
head(tse$observed)
#> [1] 55 29 148 68 63 51
head(tse$shannon)
#> [1] 2.755092 1.044521 3.305171 2.744079 2.570301 2.937390
To see the differences between healthy and CRC group, let’s visualize
the species richness and shannon diversity using violin plot (you can
also experiment with boxplot). Additionally, let’s also calculate the
statistical significance and plot it together using ggsignif with the
wilcoxon test. Please notes that the package does not support p-value
adjustment (such as with “fdr”). For more complex test, please see example
here on how you can calculate p-value with adjustment using
stat_compare_means
function from the ggpubr package, and
add visually appealing p-values to the plots or if you want to perform
more than two groups comparison.
The information about diseases state is available from the
colData
of tse
object, in the “diseases”
column.
# Changes old levels with new levels
tse$disease <- factor(tse$disease)
# Access the levels of the factor
disease_levels <- levels(tse$disease)
# Compute all pairwise combinations of these levels
comb <- combn(disease_levels, 2, simplify = FALSE)
# Observed richness plot
observed_plot <- plotColData(
tse,
y = "observed",
x = "disease",
colour_by = "disease"
) +
geom_signif(comparisons = comb, map_signif_level = FALSE) +
theme_bw() +
theme(text = element_text(size = 8))
# Shannon diversity plot
shannon_plot <- plotColData(
tse,
y = "shannon",
x = "disease",
colour_by = "disease"
) +
geom_signif(comparisons = comb, map_signif_level = FALSE) +
theme_bw() +
theme(text = element_text(size = 8))
# Combine the two plots into a grid
combined_plot <- observed_plot + shannon_plot + plot_layout(guides = "collect")
# Display the combined plot
print(combined_plot)
#> Warning in wilcox.test.default(c(55, 29, 148, 68, 63, 51, 37, 44, 45, 61, :
#> cannot compute exact p-value with ties
Let’s pause and think
Can you also do the diversity analysis using different indexes?
What can you inffer by these observations?
Let’s continue observing the beta-diversity between 2 groups. Beta diversity quantifies the dissimilarity between communities (multiple samples), as opposed to alpha diversity, which focuses on variation within a community (one sample). This OMA book section explained various indexes that commonly used in the microbial ecology context.
For this tutorial, let’s try to calculate the bray-curtis distance and plot the PCoA.
# Perform PCoA
tse <- runMDS(
tse,
FUN = getDissimilarity,
method = "bray",
assay.type = "relabundance",
name = "MDS_bray"
)
# Create ggplot object
p <- plotReducedDim(tse, "MDS_bray", colour_by = "disease")
# Calculate explained variance
e <- attr(reducedDim(tse, "MDS_bray"), "eig")
rel_eig <- e / sum(e[e > 0])
# Add explained variance for each axis
p1 <- p + labs(
x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""),
y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = "")
)
p1
Sometimes, we also want to create ellipse to guide our visualization whether the microbial community in two groups is different. You can achieve it using the code below.
p_ellipse <- p1 + stat_ellipse(aes(color = colour_by), level = 0.95)
p_ellipse
Additionally, we may want to also perform PERMANOVA analysis to further estimate whether the microbial community is statistically different. Please explore the comprehensive example in the OMA book.
Let’s pause and think
Above is the example for unsupervised ordination, can you give example of it supervised counterpart?
Can you distinguish between both method and when to use it?
OMA book also provide further exploration to explore the microbial community structure, for example based on community typing.
Another primary analysis that usually can be done in microbiome study is differential abundance analysis (DAA) for example to answer the research question 3. However, due to limited time slots, we will not demonstrate in this tutorial. You can explore various example in OMA book for DAA here
As demonstrated, CRC patients appear to have altered microbiota. However, this observation only indicates an association and does not imply causality or reveal underlying mechanisms. To explore complex mechanisms, we often need more comprehensive data. Efficiently handling data from multiple experiments requires integrating these datasets into a unified data container. Bioconductor offers two approaches for this:
altExp
slot within the TreeSE
object, which is a straightforward method.MultiAssayExperiment
, a more general and
flexible approach, though it comes with a more complex data structure
compared to TreeSE
.For detailed information on both approaches, refer to OMA.
In this tutorial, we will focus on the simpler method using the
altExp
slot.
The data fetched at the beginning of this workflow is a
list
containing multiple experiments, each represented as a
unique (Tree)SE
object. Managing this list and the
connections between samples can be error-prone and cumbersome. To
simplify this, we can use the altExp
slot to handle the
bookkeeping. Let’s start by checking which experiments are
available:
names(data_list)
#> [1] "pathway" "taxonomy"
As we see, we have metabolic pathway predictions in addition to
taxonomic profiling. We can add the entire list of experiments to the
altExp
slot.
altExps(tse) <- data_list
tse
#> class: TreeSummarizedExperiment
#> dim: 304 60
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#> Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(2): pathway taxonomy
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
With the altExp
slot now populated, you can access the
data using the altExp()
function and specify which
experiment you want to retrieve.
altExp(tse, "pathway")
#> class: SummarizedExperiment
#> dim: 12535 60
#> metadata(0):
#> assays(1): relabundance
#> rownames(12535): UNMAPPED UNINTEGRATED ... PWY-7197: pyrimidine
#> deoxyribonucleotide
#> phosphorylation|g__Desulfovibrio.s__Desulfovibrio_piger PWY-6703:
#> preQ0 biosynthesis|g__Butyricimonas.s__Butyricimonas_virosa
#> rowData names(2): full_name pathway
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
As mentioned earlier, the advantage of the altExp
slot
is that it handles sample linkages for us. For example, if we select a
subset of data, the object automatically subsets all associated
experiments accordingly.
For example, let’s select only the samples from senior patients.
tse_sub <- tse[ , tse$age_category == "senior"]
tse_sub
#> class: TreeSummarizedExperiment
#> dim: 304 8
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#> Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(8): GupDM_A4 GupDM_A5 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(2): pathway taxonomy
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
# Check that all sample names still match
all( colnames(altExp(tse_sub, "pathway")) == colnames(tse_sub) )
#> [1] TRUE
Now, we can see that the data has been successfully subsetted, with all sample names matching across experiments.
In addition to adding the entire list, we can also add individual experiments. Let’s add the agglomerated data separately to keep the whole dataset organized.
altExp(tse, "family") <- tse_family
tse
#> class: TreeSummarizedExperiment
#> dim: 304 60
#> metadata(1): agglomerated_by_rank
#> assays(3): counts relabundance clr
#> rownames(304): [Bacteroides] pectinophilus [Clostridium] hylemonae ...
#> Victivallis vadensis Weissella confusa
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(29): study_name subject_id ... observed shannon
#> reducedDimNames(1): MDS_bray
#> mainExpName: NULL
#> altExpNames(3): pathway taxonomy family
#> rowLinks: a LinkDataFrame (304 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
Now that the entire dataset is well-organized and in one place, let’s dive into analyzing the interconnections between experiments. Specifically, we’ll examine whether the abundance of certain bacteria is associated with specific metabolic activity pathway.
We’ll perform a simple cross-association analysis to explore these relationships. First, we’ll preprocess the functional annotation data by agglomerating the pathways based on each pathway’s function, and then further by prevalence.
# Agglomerate based on pathways function
altExp(tse, "pathway_func") <- agglomerateByVariable(
altExp(tse, "pathway"),
by = "rows",
f = "pathway"
)
# Agglomerate based on prevalence
altExp(tse, "pathway_prev") <- agglomerateByPrevalence(
altExp(tse, "pathway_func"),
assay.type = "relabundance",
detection = 0.001,
prevalence = 0.2
)
The data now includes the most prevalent pathways, each representing a single metabolic activity. Next, we apply a transformation as the final preprocessing step for our functional data.
altExp(tse, "pathway_prev") <- transformAssay(
altExp(tse, "pathway_prev"),
assay.type = "relabundance",
method = "clr"
)
altExp(tse, "pathway_prev")
#> class: SummarizedExperiment
#> dim: 24 60
#> metadata(0):
#> assays(2): relabundance clr
#> rownames(24): BRANCHED-CHAIN-AA-SYN-PWY: superpathway of branched amino
#> acid biosynthesis DTDPRHAMSYN-PWY: dTDP-L-rhamnose biosynthesis I ...
#> VALSYN-PWY: L-valine biosynthesis Other
#> rowData names(2): full_name pathway
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
We’re now ready to perform the cross-association analysis. We will examine the associations between microbial families and the most abundant pathways. To visualize the results, we’ll use a heatmap.
# Perform cross-associaton
res <- getCrossAssociation(
tse,
altexp1 = "family",
altexp2 = "pathway_prev",
assay.type1 = "clr",
assay.type2 = "clr",
mode = "matrix",
test.signif = TRUE
)
#> Calculating correlations...
#> altexp1: family, altexp2: pathway_prev, assay.type1: clr, col.var1: -, assay.type2: clr, col.var2: -
#> by: 1, function: stats::cor.test, method: kendall, test.signif: TRUE, p.adj.method: fdr, paired: FALSE, show.warnings: TRUE
#> Converting table into matrices...
# Function for marking significant correlations with "X"
add_signif <- function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(res$p_adj[i, j]) & res$p_adj[i, j] < 0.05 ){
# Print "X"
grid.shadowtext(
sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
}
}
# Create a heatmap
p <- Heatmap(
# Correlation values
res$cor,
# Print values to cells
cell_fun = add_signif,
# Adjust legend
heatmap_legend_param = list(
title = "correlation", legend_height = unit(5, "cm")),
# Adjust column names
column_names_rot = 45,
column_names_max_height = unit(20, "cm")
)
p
For more information on data integration, see the following sections:
After having analyzed associations between microbial families and pathways in the last section, we are now interested in the microbial associations itself, which helps us to understand the complex interplay of microbial communities.
In a typical microbiome network, nodes are ASVs or taxa on a certain rank (here family). The nodes are connected by edges, which represent microbial associations. Higher edge thickness corresponds to higher edge weight. Edges with a positive underlying association are blue, those with a negative association are red.
An essential step in microbial network learning is to obtain an adjacency matrix representing the network from a given read count matrix:
Association estimation methods need to take the typical characteristics of microbiome data into account. Here is a brief overview of the typical data characteristics and common approaches to solve them:
Compositionality
Zero-inflation
High dimensionality (p>>n)
Sparse association estimation (sparsity assumption is needed due to the underdetermined mathematical system). Depending on the desired type of association, there’s a wide range of existing methods and R packages:
These methods are also available through the R package NetCoMi, which we will use to construct and analyze an association network for the gut microbiome.
The complete workflow implemented in NetCoMi
is shown in
the OMA
chapter on network learning.
The function for network construction netConstruct()
offers the possibility to filter samples and taxa
according to certain criteria. Here, we keep only taxa occurring
in at least 10% of the samples.
We use the SPIEC-EASI method to estimate microbial associations. The approach includes two graphical model inference procedures:
We will use glasso here.
SPIEC-EASI uses the StARS (“Stability Approach to Regularization Selection”) method to obtain a sparse association matrix. We set the StARS threshold to 0.1 to get a denser graph (the default is 0.05).
Since zero replacement, normalization, and sparsification (model
selection) is included in SPIEC-EASI, we set the corresponding
netConstruct
arguments to “none”.
net <- netConstruct(
tse_family,
taxRank = "family",
filtTax = "numbSamp",
filtTaxPar = list(numbSamp = 0.1),
measure = "spieceasi",
measurePar = list(method = "glasso",
# nlambda and rep.num should be higher (30 to 50)
nlambda=20,
pulsar.params=list(rep.num=10,
thresh = 0.1), # defines sparsity
norm.params = list(method = "mclr",
types = "count")),
zeroMethod = "none",
normMethod = "none",
sparsMethod = "none",
dissFunc = "signed",
seed = 16092024)
netConstruct()
returns an object of the class
microNet
, which contains all matrices generated during
network construction.
The object also contains an edge list, giving each edge’s estimated association, dissimilarity, and adjacency. Let’s take a quick look at the edges with the highest and lowest edge weights:
edgelist <- net$edgelist1[order(net$edgelist1$adja,
decreasing = TRUE), ]
head(edgelist)
#> v1 v2 asso diss adja
#> 45 Odoribacteraceae Rikenellaceae 0.3771518 0.5580538 0.4419462
#> 10 Akkermansiaceae Odoribacteraceae 0.3090037 0.5877909 0.4122091
#> 53 Rikenellaceae Tannerellaceae 0.2994215 0.5918524 0.4081476
#> 46 Odoribacteraceae Tannerellaceae 0.2967157 0.5929942 0.4070058
#> 17 Bacteroidaceae Tannerellaceae 0.2913105 0.5952686 0.4047314
#> 16 Bacteroidaceae Rikenellaceae 0.2838756 0.5983830 0.4016170
tail(edgelist)
#> v1 v2 asso diss adja
#> 54 Selenomonadaceae Tannerellaceae -0.1088053 0.7445822 0.2554178
#> 23 Clostridiaceae Lactobacillaceae -0.1088343 0.7445920 0.2554080
#> 40 Eubacteriaceae Lactobacillaceae -0.1116851 0.7455485 0.2544515
#> 27 Clostridiaceae Selenomonadaceae -0.1404570 0.7551348 0.2448652
#> 12 Akkermansiaceae Selenomonadaceae -0.1423893 0.7557742 0.2442258
#> 41 Lactobacillaceae Methanobacteriaceae -0.2651769 0.7953543 0.2046457
The microNet
object is now passed to
netAnalyze()
to perform network
analysis.
The function computes several common network characteristics such as centrality measures, hubs, cluster assignment, and global network measures.
The user has several options to choose from, such as a clustering
method, how to define hubs, and whether or not to normalize centrality
values. See the help page ?netAnalyze
for a description of
the arguments.
netprops <- netAnalyze(net,
clustMethod = "cluster_fast_greedy",
hubPar = "eigenvector",
normDeg = FALSE,
gcmHeat = FALSE)
We use NetCoMi's
plot function to visualize the
network.
In the first plot, node colors represent the detected clusters and node sizes are scaled by eigenvector centrality. Hub nodes are highlighted by default. Unconnected nodes are not included in the plot.
Note that nodes are sometimes placed too close together so that the
labels overlap. You may need to play around with the
repulsion
argument until you find a value where the labels
are legible, but also the clusters are still well recognizable.
# Palette with distinguishable colors
nodecols <- c("#E69F00", "#56B4E9", "#009E73", "#CC79A7", "#F0E442",
"#0072B2", "#D55E00")
plot(netprops,
repulsion = 1,
rmSingles = TRUE,
shortenLabels = "none",
labelScale = FALSE,
nodeSize = "eigenvector",
nodeSizeSpread = 3,
nodeColor = "cluster",
colorVec = nodecols,
nodeTransp = 20,
posCol = c("#5C7aee", "#273cc2"),
hubBorderCol = "gray40",
cexNodes = 1.8,
edgeTranspHigh = 20,
title1 = "Network properties highlighted",
showTitle = TRUE,
cexTitle = 2,
mar = c(1, 3, 4, 8))
legend(0.9, 1.1, cex = 1.7, title = "estimated correlation:",
legend = c("+","-"), lty = 1, lwd = 3, col = c("#273cc2","red"),
bty = "n", horiz = TRUE)
par(xpd = TRUE)
legend(0.7, -0.7, cex = 1.5, bty = "n",
legend = c("Node color: Clusters",
"Node size: Eigenvector centrality",
"Hubs: Highest eigen. centrality"))
We now color nodes according to their phylum. The node sizes are proportional to a taxon’s sum of mclr-transformed abundances.
# Generate vector with phylum names for node coloring
phyla <- rowData(tse_family)$phylum
# Replace phyla that occur only once with "other"
phylum_counts <- table(phyla)
phyla <- ifelse(phyla %in% names(phylum_counts[phylum_counts == 1]) |
is.na(phyla), "Other", phyla)
phyla <- as.factor(phyla)
# Make level "Other" the last one
levels(phyla) <- c(setdiff(levels(phyla), "Other"), "Other")
# Assign family names
names(phyla) <- rowData(tse_family)$family
# Node colors (phylum "Other" should be gray)
colors_phylum <- c(nodecols[1:(nlevels(phyla) - 1)], "#999999")
plot(netprops,
repulsion = 1,
rmSingles = TRUE,
shortenLabels = "none",
labelScale = FALSE,
nodeSize = "mclr",
nodeColor = "feature",
featVecCol = phyla,
colorVec = colors_phylum,
nodeTransp = 20,
posCol = c("#5C7aee", "#273cc2"),
highlightHubs = FALSE,
cexNodes = 1.8,
edgeWidth = 0.8,
edgeTranspHigh = 20,
title1 = "Data features highlighted",
showTitle = TRUE,
cexTitle = 2,
mar = c(1, 10, 4, 6))
# Add legends
legend(0.7, 1.1, cex = 1.7, title = "estimated correlation:",
legend = c("+","-"), lty = 1, lwd = 3, col = c("#273cc2","red"),
bty = "n", horiz = TRUE)
# Colors used in the legend should be equally transparent as in the plot
col_transp <- colToTransp(colors_phylum, 20)
legend(-1.8, 1.1, cex = 1.7, pt.cex = 2.5, title = "Phylum:",
legend=levels(phyla), col = col_transp, bty = "n", pch = 16)
par(xpd = TRUE)
legend(0.5, -0.8, cex = 1.5, bty = "n",
legend = c("Node color: Phylum",
"Node size: Sum of mclr abundances"))
In addition to the visual analysis, we can also have a look at the computed network properties.
summary(netprops, numbNodes = 5)
#>
#> Component sizes
#> ```````````````
#> size: 28 1
#> #: 1 8
#> ______________________________
#> Global network properties
#> `````````````````````````
#> Largest connected component (LCC):
#>
#> Relative LCC size 0.77778
#> Clustering coefficient 0.27864
#> Modularity 0.36060
#> Positive edge percentage 77.77778
#> Edge density 0.14286
#> Natural connectivity 0.05141
#> Vertex connectivity 1.00000
#> Edge connectivity 1.00000
#> Average dissimilarity* 0.95366
#> Average path length** 1.97776
#>
#> Whole network:
#>
#> Number of components 9.00000
#> Clustering coefficient 0.27864
#> Modularity 0.36060
#> Positive edge percentage 77.77778
#> Edge density 0.08571
#> Natural connectivity 0.03746
#> -----
#> *: Dissimilarity = 1 - edge weight
#> **: Path length = Units with average dissimilarity
#>
#> ______________________________
#> Clusters
#> - In the whole network
#> - Algorithm: cluster_fast_greedy
#> ````````````````````````````````
#>
#> name: 0 1 2 3 4
#> #: 8 9 7 8 4
#>
#> ______________________________
#> Hubs
#> - In alphabetical/numerical order
#> - Based on empirical quantiles of centralities
#> ```````````````````````````````````````````````
#> Akkermansiaceae
#> Odoribacteraceae
#>
#> ______________________________
#> Centrality measures
#> - In decreasing order
#> - Centrality of disconnected components is zero
#> ````````````````````````````````````````````````
#> Degree (unnormalized):
#>
#> Akkermansiaceae 9
#> Odoribacteraceae 9
#> Clostridiaceae 8
#> Lactobacillaceae 8
#> Rikenellaceae 8
#>
#> Betweenness centrality (normalized):
#>
#> Lactobacillaceae 0.27066
#> Tannerellaceae 0.18234
#> Odoribacteraceae 0.15954
#> Akkermansiaceae 0.14245
#> Eubacteriaceae 0.13960
#>
#> Closeness centrality (normalized):
#>
#> Akkermansiaceae 0.89177
#> Odoribacteraceae 0.88758
#> Rikenellaceae 0.84766
#> Clostridiaceae 0.83774
#> Lactobacillaceae 0.82564
#>
#> Eigenvector centrality (normalized):
#>
#> Odoribacteraceae 1.00000
#> Akkermansiaceae 0.91584
#> Rikenellaceae 0.78864
#> Tannerellaceae 0.76453
#> Clostridiaceae 0.72856
The network consists of one large connected component (lcc) with 28 nodes and 8 unconnected nodes.
Akkermansiaceae and Odoribacteraceae are identified as hub nodes (highly connected families), which are nodes with an eigenvector centrality above the 95% quantile of all computed eigenvector centrality values.
As next step, we might be interested in whether the microbial network differs between colorectal cancer (CRC) patients and control subjects.
This task could be performed with the NetCoMi function
netCompare
, as described in the OMA
chapter on network comparison.
iSEEtree is a
shiny app that provides a Graphical User Interface (GUI) to visualize
and interactively explore TreeSE
objects. We will
demonstrate its capability by reproducing some of the plots that were
previously created in this workshop, this time without the need of
coding (small lie).
As we did before for tse
, we add alpha and beta
diversity metrics also to the agglomerated experiment
tse_family
, so we will be able to visualize these metrics
in the app.
# Add information on alpha diversity
tse_family <- addAlpha(tse_family,
index = c("coverage_diversity", "shannon_diversity"))
# Add information on beta diversity
tse_family <- runMDS(tse_family,
FUN = getDissimilarity,
method = "bray",
assay.type = "relabundance",
ncomp = 10,
name = "MDS_bray")
Launching the app is easy. All you need to do is run the command
iSEE
on the TreeSE
object that you want to
explore.
# Launch app
if (interactive()) {
iSEE(tse_family)
}
The default panel layout does not contain the so-called
ColumnDataTable
panel, which plots variables from the
colData.
Therefore, we need to specify a custom panel
layout with the initial
argument to generate the alpha
diversity plot.
# Launch app with custom layout
if (interactive()) {
iSEE(tse_family, initial = c(ColumnDataTable(), ColumnDataPlot()))
}
If interested, you can explore further functionality and the panel catalogue in the iSEEtree vignettes.
Ideas for self-learning:
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 grid stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] tidyr_1.3.1 patchwork_1.2.0
#> [3] shadowtext_0.1.4 scater_1.33.4
#> [5] scuttle_1.15.4 NetCoMi_1.1.0
#> [7] SpiecEasi_1.1.3 miaViz_1.13.10
#> [9] ggraph_2.2.1 ggplot2_3.5.1
#> [11] mia_1.13.38 MultiAssayExperiment_1.31.5
#> [13] iSEEtree_0.99.6 iSEE_2.17.4
#> [15] ggsignif_0.6.4 dplyr_1.1.4
#> [17] curatedMetagenomicData_3.13.0 TreeSummarizedExperiment_2.13.0
#> [19] Biostrings_2.73.1 XVector_0.45.0
#> [21] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.1
#> [23] Biobase_2.65.1 GenomicRanges_1.57.1
#> [25] GenomeInfoDb_1.41.1 IRanges_2.39.2
#> [27] S4Vectors_0.43.2 BiocGenerics_0.51.1
#> [29] MatrixGenerics_1.17.0 matrixStats_1.4.1
#> [31] ComplexHeatmap_2.21.0 BiocManager_1.30.25
#>
#> loaded via a namespace (and not attached):
#> [1] nnet_7.3-19 DT_0.33
#> [3] vctrs_0.6.5 digest_0.6.37
#> [5] png_0.1-8 corpcor_1.6.10
#> [7] shape_1.4.6.1 pcaPP_2.0-5
#> [9] rbiom_1.0.3 ggrepel_0.9.6
#> [11] corrplot_0.94 permute_0.9-7
#> [13] mediation_4.5.0 SPRING_1.0.4
#> [15] MASS_7.3-61 pkgdown_2.1.0
#> [17] reshape2_1.4.4 httpuv_1.6.15
#> [19] foreach_1.5.2 withr_3.0.1
#> [21] psych_2.4.6.26 xfun_0.47
#> [23] ggfun_0.1.6 survival_3.7-0
#> [25] memoise_2.0.1 ggbeeswarm_0.7.2
#> [27] systemfonts_1.1.0 gtools_3.9.5
#> [29] ragg_1.3.3 tidytree_0.4.6
#> [31] zoo_1.8-12 GlobalOptions_0.1.2
#> [33] pbapply_1.7-2 Formula_1.2-5
#> [35] KEGGREST_1.45.1 promises_1.3.0
#> [37] httr_1.4.7 rhdf5filters_1.17.0
#> [39] rhdf5_2.49.0 rstudioapi_0.16.0
#> [41] shinyAce_0.4.2 UCSC.utils_1.1.0
#> [43] miniUI_0.1.1.1 generics_0.1.3
#> [45] base64enc_0.1-3 curl_5.2.2
#> [47] zlibbioc_1.51.1 ScaledMatrix_1.13.0
#> [49] polyclip_1.10-7 doSNOW_1.0.20
#> [51] quadprog_1.5-8 GenomeInfoDbData_1.2.12
#> [53] ExperimentHub_2.13.1 SparseArray_1.5.34
#> [55] xtable_1.8-4 stringr_1.5.1
#> [57] desc_1.4.3 ade4_1.7-22
#> [59] doParallel_1.0.17 evaluate_0.24.0
#> [61] S4Arrays_1.5.7 BiocFileCache_2.13.0
#> [63] preprocessCore_1.67.0 glmnet_4.1-8
#> [65] pulsar_0.3.11 irlba_2.3.5.1
#> [67] colorspace_2.1-1 filelock_1.0.3
#> [69] shinyWidgets_0.8.6 magrittr_2.0.3
#> [71] later_1.3.2 viridis_0.6.5
#> [73] ggtree_3.13.1 lattice_0.22-6
#> [75] DECIPHER_3.1.4 cowplot_1.1.3
#> [77] Hmisc_5.1-3 pillar_1.9.0
#> [79] nlme_3.1-166 huge_1.3.5
#> [81] iterators_1.0.14 decontam_1.25.0
#> [83] compiler_4.4.1 beachmat_2.21.6
#> [85] stringi_1.8.4 biomformat_1.33.0
#> [87] tokenizers_0.3.0 minqa_1.2.8
#> [89] plyr_1.8.9 crayon_1.5.3
#> [91] abind_1.4-8 gridGraphics_0.5-1
#> [93] graphlayouts_1.1.1 bit_4.0.5
#> [95] rootSolve_1.8.2.4 mixedCCA_1.6.2
#> [97] sandwich_3.1-1 fastcluster_1.2.6
#> [99] codetools_0.2-20 textshaping_0.4.0
#> [101] BiocSingular_1.21.3 bslib_0.8.0
#> [103] slam_0.1-53 phyloseq_1.49.0
#> [105] GetoptLong_1.0.5 tidytext_0.4.2
#> [107] multtest_2.61.0 mime_0.12
#> [109] splines_4.4.1 circlize_0.4.16
#> [111] Rcpp_1.0.13 dbplyr_2.5.0
#> [113] sparseMatrixStats_1.17.2 qgraph_1.9.8
#> [115] knitr_1.48 blob_1.2.4
#> [117] utf8_1.2.4 here_1.0.1
#> [119] clue_0.3-65 BiocVersion_3.20.0
#> [121] lme4_1.1-35.5 pbivnorm_0.6.0
#> [123] fs_1.6.4 checkmate_2.3.2
#> [125] DelayedMatrixStats_1.27.3 Rdpack_2.6.1
#> [127] ggplotify_0.1.2 lavaan_0.6-18
#> [129] tibble_3.2.1 Matrix_1.7-0
#> [131] statmod_1.5.0 lpSolve_5.6.21
#> [133] tweenr_2.0.3 pkgconfig_2.0.3
#> [135] tools_4.4.1 cachem_1.1.0
#> [137] rbibutils_2.2.16 RSQLite_2.3.7
#> [139] viridisLite_0.4.2 DBI_1.2.3
#> [141] impute_1.79.0 fastmap_1.2.0
#> [143] rmarkdown_2.28 scales_1.3.0
#> [145] shinydashboard_0.7.2 AnnotationHub_3.13.3
#> [147] sass_0.4.9 rpart_4.1.23
#> [149] snow_0.4-4 farver_2.1.2
#> [151] tidygraph_1.3.1 mgcv_1.9-1
#> [153] yaml_2.3.10 VGAM_1.1-11
#> [155] foreign_0.8-87 cli_3.6.3
#> [157] purrr_1.0.2 lifecycle_1.0.4
#> [159] mvtnorm_1.3-1 bluster_1.15.1
#> [161] backports_1.5.0 rintrojs_0.3.4
#> [163] BiocParallel_1.39.0 gtable_0.3.5
#> [165] rjson_0.2.22 limma_3.61.9
#> [167] parallel_4.4.1 ape_5.8
#> [169] SnowballC_0.7.1 jsonlite_1.8.8
#> [171] colourpicker_1.3.0 bit64_4.0.5
#> [173] glasso_1.11 yulab.utils_0.1.7
#> [175] vegan_2.6-8 BiocNeighbors_1.99.0
#> [177] RcppParallel_5.1.9 janeaustenr_1.0.0
#> [179] jquerylib_0.1.4 highr_0.11
#> [181] shinyjs_2.1.0 orca_1.1-2
#> [183] lazyeval_0.2.2 shiny_1.9.1
#> [185] dynamicTreeCut_1.63-1 htmltools_0.5.8.1
#> [187] GO.db_3.19.1 rappdirs_0.3.3
#> [189] glue_1.7.0 rprojroot_2.0.4
#> [191] treeio_1.29.1 mnormt_2.1.1
#> [193] jpeg_0.1-10 gridExtra_2.3
#> [195] boot_1.3-31 igraph_2.0.3
#> [197] R6_2.5.1 fdrtool_1.2.18
#> [199] labeling_0.4.3 listviewer_4.0.0
#> [201] cluster_2.1.6 Rhdf5lib_1.27.0
#> [203] aplot_0.2.3 nloptr_2.1.1
#> [205] DirichletMultinomial_1.47.0 DelayedArray_0.31.11
#> [207] tidyselect_1.2.1 vipor_0.4.7
#> [209] WGCNA_1.72-5 htmlTable_2.4.3
#> [211] ggforce_0.4.2 AnnotationDbi_1.67.0
#> [213] filematrix_1.3 rsvd_1.0.5
#> [215] munsell_0.5.1 data.table_1.16.0
#> [217] htmlwidgets_1.6.4 RColorBrewer_1.1-3
#> [219] rlang_1.1.4 Cairo_1.6-2
#> [221] ggnewscale_0.5.0 fansi_1.0.6
#> [223] beeswarm_0.4.0