Brain, bacteria, and behaviour summer course

July 7 - July 11, 2025

Rendered on: 2025-07-10 11:32

Day 1

Bioconductor

Community-driven open-source project

Training programs & workshops
Conferences & community support
Bioinformatics software

Bioconductor

Community-driven open-source project

Training programs & workshops
Conferences & community support
Bioinformatics software

Software

~2,300 R packages
Review, testing, documentation

Data containers form the core

TreeSummarizedExperiment

class: TreeSummarizedExperiment 
dim: 19216 26 
metadata(0):
assays(1): counts
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL

MIcrobiome Analysis (mia)

Microbiome data science ecosystem
Distributed through several R packages
mia package top 7.6% Bioconductor downloads

Community-driven ecosystem of tools

mia (Data analysis)
miaViz (Visualization)
miaSim (Simulation)
miaTime (Time series analysis)
miaDash (Graphical user interface)
iSEETree (Interactive visualization)
Expanded by independent developers

mia logo. MGnifyR logo. HoloFoodR logo. iSEE logo. MAE logo. SE logo. SCE logo. scater logo. benchdamic logo. netcomi logo. radEmu logo. DESeq2 logo. Biobakery logo.

Advantages

Shared data container
Scalable & optimized for large datasets
Comprehensive documentation

Allows us to develop efficient microbiome data science workflows

Orchestrating Microbiome Analysis with Bioconductor

Resources and tutorials for microbiome analysis
Community-built best practices
Open to contributions!

Go to the Orchestrating Microbiome Analysis (OMA) online book

microbiome.github.io/OMA

Reproducible reporting

To create human-readable reports
Transparency, reusability and reproducibility
Debugging

Literate programming

Programming paradigm introduced by Donald Knuth (1984) in which a computer program is given as an explanation of its logic in a natural language, embedded with code chunks, from which compilable source code can be generated.
(Adapted from Wikipedia)

Reproducible notebooks

We use Quarto for reproducible documentation.
Next generation development of Rmarkdown
Supported by RStudio

Demonstration

Exercises

From OMA online book, Chapter 1: Microbiome data science in Bioconductor

microbiome.github.io/OMA/docs/devel/pages/intro.html

Exercise 2
Exercise 3
Exercise 4

Data containers

Demonstration

Exercises

From OMA online book, Chapter 3: Data containers

All exercises

From OMA online book, Chapter 10: Subsetting

All exercises

Day 2

Preprocessing

Agglomeration

Interested only higher level of taxonomy ranks
Reduce noise

–> Summarize rows to higher taxonomy rank

Demonstration

Exercises

From OMA online book, Chapter 11: Agglomeration

1.2, 1.3, 1.4, 1.5, 1.6

Microbiome data

High variability

Abundance of feature varies a lot from sample to sample

Zero-inflation

Many taxa absent in most samples

Compositionality

Only relative abundances observed
Total counts per sample are arbitrary

Compositionality

Transformation

Mitigate biases
Make comparable
Meet assumptions of statistical test

Demonstration

Diversity

Alpha diversity

Diversity within a sample
Richness vs evenness

Demonstration

Exercises

From OMA online book, Chapter 12: Transformation

1.2, 1.3, 1.4, 1.5, 1.6

From OMA online book, Chapter 14: Alpha diversity

1.2, 1.3, 1.4, 1.9, 1.10

From OMA online book, Chapter 13: Community composition

1.2, 1.3, 1.4, 1.5

Day 3

Beta diversity

Similarity between samples
Dissimilarity/distance between samples, clustering, ordination…

Ordination

Simplify and visualize high-dimensional data
Projects data into lower dimensional latent space

Ordination methods

PCA, PCoA/MDS, RDA, …
Euclidean vs non-Euclidean
Unsupervised vs supervised

Principal component analysis (PCA)

Unsupervised ordination method
Euclidean distance
Aitchison distance: CLR + Euclidean distance

Principal coordinate analysis (PCoA)

Multidimensional scaling (MDS)
Unsupervised ordination method
Any dissimilarity metric (e.g., Bray-Curtis dissimilarity)

Redundancy analysis (RDA)

Supervised ordination method
Find variance explained by sample metadata

Demonstration

Exercises

From OMA online book, Chapter 15: Community similarity

Exercise 1

Exercises

From OMA online book, Chapter 15: Community similarity

Exercise 1
Exercise 2

Exercises

From OMA online book, Chapter 15: Community similarity

Exercise 1
Exercise 2
Exercise 3

Differential abundance analysis (DAA)

Identify taxa whose abundance differs between groups
Classical statistical tests
Methods dedicated for microbiome data

Elementary methods provide more replicable results in microbial differential abundance analysis

Relative abundances with a Wilcoxon test
Log-transformed relative abundances with a t-test
Presence/absence of taxa with logistic regression

Pelto et al. 2025

Relative abundance

\[ \text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}} \]

where

\(i\) indexes the sample,
\(j\) indexes the taxon,
\(m\) is the total number of taxa.

Log-transformed relative abundance

\[ \text{X}_{ij} = \log \left( \text{Relative abundance}_{ij} + \epsilon \right) \]

where

\(\epsilon\) is a small pseudocount to avoid \(\log(0)\).

Wilcoxon vs t-test

Elementary methods provide more replicable results in microbial differential abundance analysis

We are developing a package containing all these methods

Exercises

From OMA online book, Chapter 17: Differential abundance

All exercises

Day 4

Multiomics integration

Data containers
Methods

Alternative experiment

altExp()
Slot in TreeSummarizedExperiment
One-to-one sample mapping

Methods

Association
Ordination
Supervised machine learning

Cross-association

CLR + Spearman

Centered log-ratio (CLR)

\[ \text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D} \]

( x_i ): the i-th component of the composition
( g(x) ): geometric mean of the composition

\[ \text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14 \]

\[ \text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} = \sqrt[3]{512} = 8 \]

Removes compositional constraints (e.g., constant sum)
Allows use of standard statistical tools (e.g. PCA)
Symmetric: values centered around zero

Pearson correlation

“Normalized covariance”
Spearman rho = Pearson calculated for ranks of values

\[ r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \cdot \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } } \]

Demonstration

Exercises

From OMA online book, Chapter 23: Cross-association

All exercises

Contact info

Contact info

tvborm@utu.fi & t.f.s.bastiaanssen@amsterdamumc.nl
github.com/microbiome/OMA/discussions
community-bioc.zulipchat.com

Bioconductor logo.

Feedback form: https://forms.gle/mZRAAWiuFtpPjoZp6

Extra material

Formulas in R

Symbolic way to express a model or relationship between variables

Formula	How to Read	When to Use
`y ~ 1`	Model with intercept only	Intercept-only (null model)
`y ~ x`	y is modeled by x	Simple linear regression
`y ~ x + z`	y is modeled by x and z (additive effects)	Multiple predictors, no interactions
`y ~ x * z`	x + z + interaction x:z	You want to model interaction between x and z
`y ~ x + (1 \| group)`	y modeled by x + random intercept for group	Repeated measures, hierarchical data, group-level baseline variation