Authors: Tuomas Borman1
Last modified: 6 November, 2025.


In microbiome research, there is an increasing demand for robust, transparent, and reproducible computational methods. Bioconductor is well-positioned to to fulfill this need by providing high-quality, open-source bioinformatics software. At the core of Bioconductor are standardized data containers that harmonize data management and enable interoperability across the ecosystem.
This workshop introduces the latest advances in microbiome analysis within Bioconductor, focusing on the mia (Microbiome Analysis) framework. The framework is built upon the TreeSummarizedExperiment data container — the central data structure in Bioconductor — and provides a comprehensive suite of statistical tools for microbiome data science. It is continuously expanding with new methods and offers native support for multi-table and hierarchical data, while being computationally more efficient than previous approaches. These features enable users to build modular and reproducible workflows for microbiome analysis.
During the workshop, participants will gain hands-on experience with the TreeSummarizedExperiment data container and the mia framework through a practical case study. The practicals cover common downstream analysis steps, from data import and wrangling to diversity estimation, differential abundance analysis, and visualization. After the workshop, participants will have the skills to continue learning through the freely available Orchestrating Microbiome Analysis (OMA) online book, which promotes best practices and supports adoption of the ecosystem. Together, these resources foster scalable, transparent, and community-driven microbiome data science.
To get most of the training session, you should meet the following pre-requisites.
If your time allows, we recommend to spend some time to explore beforehand Orchestrating Microbiome Analysis (OMA) online book.
Participants are encouraged to ask questions throughout the workshop. The session will follow a tutorial, with participants running the tutorial alongside the instructor.
In this training session, we will cover a common methods and packages for microbiome data science in SummarizedExperiment ecosystem. We will have specific focus on mia, which provides essential methods for conducting microbiome analysis.
| Activity | Time |
|---|---|
| Practicalities and background | 30m |
| Trained-guided live coding | 60m |
| Break | 15m |
| Trained-guided live coding continues | 60m |
| Time to try things out on your own | 30m |
| Questions, discussion and recap | 15m |
| Total | 3h 30m |
Below, we we import a dataset containing 60 samples from healthy controls and patients with colorectal cancer (CRC). First, we import the data files.
library(ape)
dir_name <- file.path("data", "GuptaA_2019")
# Abundance table
path <- file.path(dir_name, "taxonomy_abundance.csv")
assay <- read.csv(path, row.names = 1L)
# Taxonomy table
path <- file.path(dir_name, "taxonomy_table.csv")
taxonomy_table <- read.csv(path, row.names = 1L)
# Sample metadata
path <- file.path(dir_name, "sample_metadata.csv")
sample_metadata <- read.csv(path, row.names = 1L)
# Phylogeny
path <- file.path(dir_name, "phylogeny.tree")
phylogeny <- read.tree(path)Then we create TreeSummarizedExperiment object. Note: data types must be in specific format.
library(mia)
# Abundance table
assay <- assay |> as.matrix()
assay_list <- SimpleList(counts = assay)
# Taxonomy table and sample metadata
taxonomy_table <- taxonomy_table |> DataFrame()
sample_metadata <- sample_metadata |> DataFrame()
# Construct TreeSE
tse <- TreeSummarizedExperiment(
assays = assay_list,
rowData = taxonomy_table,
colData = sample_metadata,
rowTree = phylogeny
)TreeSummarizedExperiment
extends SummarizedExperiment
class by adding a support for microbiome-specific datatypes. These
include, for instance, rowTree slot that can be utilized to
store phylogeny or any other hierarchical presentation of the data. All
slots derived from SummarizedExperiment
class are also available in TreeSummarizedExperiment,
providing full backward compatibility.

tse
#> class: TreeSummarizedExperiment
#> dim: 308 60
#> metadata(0):
#> assays(1): counts
#> rownames(308): species-Escherichia_coli species-Alistipes_putredinis
#> ... species-Campylobacter_ureolyticus
#> species-Prevotella_sp._oral_taxon_376
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (308 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULLSlots can be accessed with dedicated accessor functions. For
instance, colData (sample metadata) can be accessed with
colData() function.
# Show only first five rows and columns
colData(tse)[1:5, 1:5]
#> DataFrame with 5 rows and 5 columns
#> study_name subject_id body_site antibiotics_current_use
#> <character> <character> <character> <character>
#> GupDM_A_11 GuptaA_2019 GupDM_A11 stool no
#> GupDM_A_15 GuptaA_2019 GupDM_A15 stool no
#> GupDM_A1 GuptaA_2019 GupDM_A1 stool no
#> GupDM_A10 GuptaA_2019 GupDM_A10 stool no
#> GupDM_A12 GuptaA_2019 GupDM_A12 stool no
#> study_condition
#> <character>
#> GupDM_A_11 CRC
#> GupDM_A_15 CRC
#> GupDM_A1 CRC
#> GupDM_A10 CRC
#> GupDM_A12 CRCThe key functionality of data containers is that it does the sample and feature bookkeeping for us. E.g., we can subset the data container without need for worrying about sample matching between abundance table and sample metadata.
tse[1:10, c(1, 2)]
#> class: TreeSummarizedExperiment
#> dim: 10 2
#> metadata(0):
#> assays(1): counts
#> rownames(10): species-Escherichia_coli species-Alistipes_putredinis ...
#> species-Prevotella_bivia species-Odoribacter_splanchnicus
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(2): GupDM_A_11 GupDM_A_15
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (10 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULLMicrobiome data has unique characteristics, meaning that dealing with such data also poses unique challenges and approaches. The mia package provides methods for performing common operations on microbiome data within the SummarizedExperiment ecosystem.
Below, we agglomerate the data into all available taxonomy levels.
tse <- agglomerateByRanks(tse)At first glance, it might seem that nothing has changed. However, the
agglomerated data is stored in the altExp slot. This slot
keeps track of the sample mapping and stores different versions of the
data.
We can access data agglomeration into the phylum level with the following command:
altExp(tse, "phylum")
#> class: TreeSummarizedExperiment
#> dim: 11 60
#> metadata(1): agglomerated_by_rank
#> assays(1): counts
#> rownames(11): Actinobacteria Bacteroidota ... Synergistetes
#> Verrucomicrobia
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(60): GupDM_A_11 GupDM_A_15 ... GupDM_JO GupDM_JP
#> colData names(27): study_name subject_id ... disease_stage
#> disease_location
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (11 rows)
#> rowTree: 1 phylo tree(s) (11 leaves)
#> colLinks: NULL
#> colTree: NULLThe data looks similar to original data; only the number of rows has
changed. While we could store the phylum-level data in a separate
variable, it’s better to keep it in the altExp slot, as it
maintains consistent sample mapping for us.
Microbiome data is typically zero-inflated, meaning that there are lots of unobserved features. Let’s first visualize the distribution of counts.
library(miaViz)
plotHistogram(tse, assay.type = "counts")
As we can see, the distribution is highly right-skewed. To make the data more normally-distributed, one can apply centered log-ratio transformation.
tse <- transformAssay(
tse,
assay.type = "counts",
method = "rclr",
altexp = altExpNames(tse)
)And when we visualize the distribution…
plotHistogram(tse, assay.type = "rclr")
… we see that the data is centered at zero and exhibit a distribution that is more similar to normal than before.
Another common transformation is relative transformation.
tse <- transformAssay(
tse,
assay.type = "counts",
method = "relabundance",
altexp = altExpNames(tse)
)We can access the transformed data with the following command:
assay(tse, "relabundance")[1:5, 1:5]
#> GupDM_A_11 GupDM_A_15 GupDM_A1
#> species-Escherichia_coli 0.26212849 0.5733251 0.246128460
#> species-Alistipes_putredinis 0.15494579 0.0000000 0.008134657
#> species-Porphyromonas_asaccharolytica 0.07437682 0.0000000 0.003352148
#> species-Bacteroides_uniformis 0.07322334 0.0000000 0.010739912
#> species-Prevotella_stercorea 0.06433625 0.0000000 0.008512455
#> GupDM_A10 GupDM_A12
#> species-Escherichia_coli 0.3415149 0.0128953044
#> species-Alistipes_putredinis 0.0000000 0.0000000000
#> species-Porphyromonas_asaccharolytica 0.0000000 0.0000000000
#> species-Bacteroides_uniformis 0.0000000 0.0026836120
#> species-Prevotella_stercorea 0.0000000 0.0004406144A common visualizing technique is to summarize relative abundances of features with a bar plot. Below, we visualize phyla. To compare controls and CRC patients we can visualize the groups separately.
# Get phylum-level data
tse_phyla <- altExp(tse, "phylum")
# Create a bar plot
plotAbundance(
tse_phyla,
rank = "phylum",
assay.type = "relabundance",
col.var = "disease",
facet.cols = TRUE,
scales = "free_x"
)
From the figure, we observe that especially abundance of Proteobacteria seems to differ between the groups.
Alpha diversity indices can be calculated with
addAlpha().
tse <- addAlpha(tse, assay.type = "counts")The results are stored in colData. By default, the
function returns a set of indices that considers different aspects of
diversity. Below, we visualize Faith’s diversity that assess the
phylogenetic diversity of samples.
plotBoxplot(tse, col.var = "faith_diversity", x = "disease")
From the figure, we can observe that CRC patients have more diverse microbiomes. This may suggest that their gut is colonized by microbes that are not typically present in a healthy gut.
A common beta diversity method is Principal Coordinate Analysis (PCoA) also known as Multi-dimensional Scaling (MDS). It is unsupervised technique that can be utilized to find patterns from the data.
tse <- addMDS(
tse,
assay.type = "counts",
method = "unifrac"
)PCoA results are commonly visualized with a scatter plot. Here we color points based on disease.
library(scater)
plotReducedDim(tse, dimred = "MDS", colour_by = "disease")
We can see clear pattern. CRC patients’ microbiome profile seem to differ from healthy ones.
Next, we can utilize distance-based Redundancy Analysis (dbRDA). It is similar to PCoA, but it specifically aims to assess how much variance or association is accounted to sample covariates.
tse <- addRDA(
tse,
assay.type = "rclr",
method = "euclidean",
formula = x ~ disease + gender
)Similarly, we can visualize the results with a biplot, specific type of scatter plot.
plotRDA(tse, dimred = "RDA", colour.by = "disease")
Feature loadings from the dbRDA analysis offer a first detailed look at the features that are associated with CRC.
plotLoadings(tse, dimred = "RDA", ncomponents = 2L, layout = "lollipop")
For instance, Prevotella copri is positively associated with the first coordinate (or x-axis in our biplot). Because, CRC was also positively associated with the first coordinate, this suggests association between higher abundance of Prevotella copri and CRC.
In differential abundance analysis, we can dig deeper on our analysis. We use maaslin3, which uses linear models to assess differentially abundant (and prevalent) features.
library(maaslin3)
res <- maaslin3(
tse,
assay.type = "counts",
formula = ~ disease,
normalization = "TSS",
transform = "LOG",
output = "maaslin3_output",
verbosity = "ERROR"
)The function generates different outputs. The output includes summary
plot in a directory that was specified by output
argument.
library(knitr)
file_path <- file.path("maaslin3_output", "figures", "summary_plot.png")
include_graphics(file_path)
The results confirm our dbRDA results. There are also other features that are either differentially abundant or prevalent in study groups.
Join us!


sessionInfo()
#> R Under development (unstable) (2025-11-06 r88990)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] knitr_1.50 maaslin3_1.3.0
#> [3] scater_1.39.0 scuttle_1.21.0
#> [5] miaViz_1.19.1 ggraph_2.2.2
#> [7] ggplot2_4.0.0 mia_1.17.11
#> [9] TreeSummarizedExperiment_2.19.0 Biostrings_2.79.2
#> [11] XVector_0.51.0 SingleCellExperiment_1.33.0
#> [13] MultiAssayExperiment_1.37.1 SummarizedExperiment_1.41.0
#> [15] Biobase_2.71.0 GenomicRanges_1.63.0
#> [17] Seqinfo_1.1.0 IRanges_2.45.0
#> [19] S4Vectors_0.49.0 BiocGenerics_0.57.0
#> [21] generics_0.1.4 MatrixGenerics_1.23.0
#> [23] matrixStats_1.5.0 ape_5.8-1
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.6.0 ggplotify_0.1.3
#> [3] tibble_3.3.0 cellranger_1.1.0
#> [5] polyclip_1.10-7 DirichletMultinomial_1.53.0
#> [7] lifecycle_1.0.4 lattice_0.22-7
#> [9] MASS_7.3-65 SnowballC_0.7.1
#> [11] magrittr_2.0.4 sass_0.4.10
#> [13] rmarkdown_2.30 jquerylib_0.1.4
#> [15] yaml_2.3.10 pbapply_1.7-4
#> [17] cowplot_1.2.0 DBI_1.2.3
#> [19] RColorBrewer_1.1-3 multcomp_1.4-29
#> [21] abind_1.4-8 purrr_1.2.0
#> [23] fillpattern_1.0.2 yulab.utils_0.2.1
#> [25] TH.data_1.1-4 tweenr_2.0.3
#> [27] rappdirs_0.3.3 sandwich_3.1-1
#> [29] gdtools_0.4.4 ggrepel_0.9.6
#> [31] irlba_2.3.5.1 tokenizers_0.3.0
#> [33] tidytree_0.4.6 vegan_2.7-2
#> [35] BiocStyle_2.39.0 rbiom_2.2.1
#> [37] parallelly_1.45.1 pkgdown_2.2.0
#> [39] permute_0.9-8 DelayedMatrixStats_1.33.0
#> [41] getopt_1.20.4 codetools_0.2-20
#> [43] DelayedArray_0.37.0 ggtext_0.1.2
#> [45] xml2_1.4.1 ggforce_0.5.0
#> [47] tidyselect_1.2.1 aplot_0.2.9
#> [49] farver_2.1.2 ScaledMatrix_1.19.0
#> [51] viridis_0.6.5 jsonlite_2.0.0
#> [53] BiocNeighbors_2.5.0 decontam_1.31.0
#> [55] tidygraph_1.3.1 survival_3.8-3
#> [57] emmeans_2.0.0 systemfonts_1.3.1
#> [59] tools_4.6.0 ggnewscale_0.5.2
#> [61] treeio_1.35.0 ragg_1.5.0
#> [63] Rcpp_1.1.0 glue_1.8.0
#> [65] gridExtra_2.3 SparseArray_1.11.1
#> [67] xfun_0.54 mgcv_1.9-4
#> [69] dplyr_1.1.4 withr_3.0.2
#> [71] BiocManager_1.30.26 fastmap_1.2.0
#> [73] bluster_1.21.0 digest_0.6.38
#> [75] rsvd_1.0.5 R6_2.6.1
#> [77] gridGraphics_0.5-1 estimability_1.5.1
#> [79] textshaping_1.0.4 tidyr_1.3.1
#> [81] fontLiberation_0.1.0 DECIPHER_3.7.0
#> [83] graphlayouts_1.2.2 htmlwidgets_1.6.4
#> [85] S4Arrays_1.11.0 pkgconfig_2.0.3
#> [87] gtable_0.3.6 S7_0.2.0
#> [89] janeaustenr_1.0.0 htmltools_0.5.8.1
#> [91] fontBitstreamVera_0.1.1 logging_0.10-108
#> [93] scales_1.4.0 png_0.1-8
#> [95] optparse_1.7.5 ggfun_0.2.0
#> [97] tzdb_0.5.0 reshape2_1.4.5
#> [99] coda_0.19-4.1 nlme_3.1-168
#> [101] cachem_1.1.0 zoo_1.8-14
#> [103] stringr_1.6.0 parallel_4.6.0
#> [105] vipor_0.4.7 desc_1.4.3
#> [107] pillar_1.11.1 grid_4.6.0
#> [109] vctrs_0.6.5 slam_0.1-55
#> [111] BiocSingular_1.27.0 beachmat_2.27.0
#> [113] xtable_1.8-4 cluster_2.1.8.1
#> [115] beeswarm_0.4.0 evaluate_1.0.5
#> [117] readr_2.1.5 mvtnorm_1.3-3
#> [119] cli_3.6.5 compiler_4.6.0
#> [121] rlang_1.1.6 crayon_1.5.3
#> [123] tidytext_0.4.3 labeling_0.4.3
#> [125] plyr_1.8.9 fs_1.6.6
#> [127] ggbeeswarm_0.7.2 ggiraph_0.9.2
#> [129] stringi_1.8.7 viridisLite_0.4.2
#> [131] BiocParallel_1.45.0 lazyeval_0.2.2
#> [133] fontquiver_0.2.1 Matrix_1.7-4
#> [135] hms_1.1.4 patchwork_1.3.2
#> [137] sparseMatrixStats_1.23.0 gridtext_0.1.5
#> [139] igraph_2.2.1 memoise_2.0.1
#> [141] bslib_0.9.0 ggtree_4.1.1
#> [143] readxl_1.4.5University of Turku, tvborm@utu.fi↩︎