Brain, bacteria, and behaviour summer course

July 7 - July 11, 2025

Rendered on: 2025-07-10 11:32

Day 1

Bioconductor

  • Community-driven open-source project
  1. Training programs & workshops
  2. Conferences & community support
  3. Bioinformatics software
Bioconductor logo.

Bioconductor

  • Community-driven open-source project
  1. Training programs & workshops
  2. Conferences & community support
  3. Bioinformatics software
Bioconductor logo.

Software

  • ~2,300 R packages
  • Review, testing, documentation

Data containers form the core

TreeSummarizedExperiment

TreeSummarizedExperiment class

class: TreeSummarizedExperiment 
dim: 19216 26 
metadata(0):
assays(1): counts
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL

TreeSummarizedExperiment class

MIcrobiome Analysis (mia)

  • Microbiome data science ecosystem
  • Distributed through several R packages
  • mia package top 7.6% Bioconductor downloads
mia logo.

Community-driven ecosystem of tools

mia logo. MGnifyR logo. HoloFoodR logo. iSEE logo. MAE logo. SE logo. SCE logo. scater logo. benchdamic logo. netcomi logo. radEmu logo. DESeq2 logo. Biobakery logo.

Advantages

  • Shared data container
  • Scalable & optimized for large datasets
  • Comprehensive documentation

Allows us to develop efficient microbiome data science workflows

Orchestrating Microbiome Analysis with Bioconductor

  • Resources and tutorials for microbiome analysis
  • Community-built best practices
  • Open to contributions!

Go to the Orchestrating Microbiome Analysis (OMA) online book

Reproducible reporting

Reproducible reporting

  • To create human-readable reports
  • Transparency, reusability and reproducibility
  • Debugging

Literate programming

Programming paradigm introduced by Donald Knuth (1984) in which a computer program is given as an explanation of its logic in a natural language, embedded with code chunks, from which compilable source code can be generated.
(Adapted from Wikipedia)

Reproducible notebooks

  • We use Quarto for reproducible documentation.
  • Next generation development of Rmarkdown
  • Supported by RStudio

Demonstration

Exercises

From OMA online book, Chapter 1: Microbiome data science in Bioconductor

microbiome.github.io/OMA/docs/devel/pages/intro.html

  • Exercise 2
  • Exercise 3
  • Exercise 4

Data containers

Demonstration

Exercises

From OMA online book, Chapter 3: Data containers

  • All exercises

From OMA online book, Chapter 10: Subsetting

  • All exercises

Day 2

Preprocessing

Agglomeration

  • Interested only higher level of taxonomy ranks
  • Reduce noise

–> Summarize rows to higher taxonomy rank

Demonstration

Exercises

From OMA online book, Chapter 11: Agglomeration

  • 1.2, 1.3, 1.4, 1.5, 1.6

Microbiome data

High variability

  • Abundance of feature varies a lot from sample to sample

Zero-inflation

  • Many taxa absent in most samples

Compositionality

  • Only relative abundances observed
  • Total counts per sample are arbitrary

Compositionality

Transformation

  • Mitigate biases
  • Make comparable
  • Meet assumptions of statistical test

Demonstration

Diversity

Alpha diversity

  • Diversity within a sample
  • Richness vs evenness

Demonstration

Exercises

From OMA online book, Chapter 12: Transformation

  • 1.2, 1.3, 1.4, 1.5, 1.6

From OMA online book, Chapter 14: Alpha diversity

  • 1.2, 1.3, 1.4, 1.9, 1.10

From OMA online book, Chapter 13: Community composition

  • 1.2, 1.3, 1.4, 1.5

Day 3

Beta diversity

Beta diversity

  • Similarity between samples
  • Dissimilarity/distance between samples, clustering, ordination…

Ordination

  • Simplify and visualize high-dimensional data
  • Projects data into lower dimensional latent space

Ordination methods

  • PCA, PCoA/MDS, RDA, …
  • Euclidean vs non-Euclidean
  • Unsupervised vs supervised

Principal component analysis (PCA)

  • Unsupervised ordination method
  • Euclidean distance
  • Aitchison distance: CLR + Euclidean distance

Principal coordinate analysis (PCoA)

  • Multidimensional scaling (MDS)
  • Unsupervised ordination method
  • Any dissimilarity metric (e.g., Bray-Curtis dissimilarity)

Redundancy analysis (RDA)

  • Supervised ordination method
  • Find variance explained by sample metadata

Demonstration

Exercises

From OMA online book, Chapter 15: Community similarity

  • Exercise 1

Exercises

From OMA online book, Chapter 15: Community similarity

  • Exercise 1
  • Exercise 2

Exercises

From OMA online book, Chapter 15: Community similarity

  • Exercise 1
  • Exercise 2
  • Exercise 3

Differential abundance analysis (DAA)

Differential abundance analysis (DAA)

  • Identify taxa whose abundance differs between groups
  • Classical statistical tests
  • Methods dedicated for microbiome data

Elementary methods provide more replicable results in microbial differential abundance analysis

  • Relative abundances with a Wilcoxon test
  • Log-transformed relative abundances with a t-test
  • Presence/absence of taxa with logistic regression

Pelto et al. 2025

Relative abundance

\[ \text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}} \]

where

  • \(i\) indexes the sample,
  • \(j\) indexes the taxon,
  • \(m\) is the total number of taxa.

Log-transformed relative abundance

\[ \text{X}_{ij} = \log \left( \text{Relative abundance}_{ij} + \epsilon \right) \]

where

  • \(\epsilon\) is a small pseudocount to avoid \(\log(0)\).

Wilcoxon vs t-test

Elementary methods provide more replicable results in microbial differential abundance analysis

We are developing a package containing all these methods

Exercises

From OMA online book, Chapter 17: Differential abundance

  • All exercises

Day 4

Multiomics integration

Multiomics integration

  • Data containers
  • Methods

Alternative experiment

Methods

  • Association
  • Ordination
  • Supervised machine learning

Cross-association

  • CLR + Spearman

Centered log-ratio (CLR)

\[ \text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D} \]

  • ( x_i ): the i-th component of the composition
  • ( g(x) ): geometric mean of the composition

\[ \text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14 \]

\[ \text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} = \sqrt[3]{512} = 8 \]

  • Removes compositional constraints (e.g., constant sum)
  • Allows use of standard statistical tools (e.g. PCA)
  • Symmetric: values centered around zero

Pearson correlation

  • “Normalized covariance”
  • Spearman rho = Pearson calculated for ranks of values

\[ r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \cdot \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } } \]

Demonstration

Exercises

From OMA online book, Chapter 23: Cross-association

  • All exercises

Contact info

QR code to discussion forum

Contact info

Bioconductor logo.

Feedback form: https://forms.gle/mZRAAWiuFtpPjoZp6

Extra material

Formulas in R

Symbolic way to express a model or relationship between variables

Formula How to Read When to Use
y ~ 1 Model with intercept only Intercept-only (null model)
y ~ x y is modeled by x Simple linear regression
y ~ x + z y is modeled by x and z (additive effects) Multiple predictors, no interactions
y ~ x * z x + z + interaction x:z You want to model interaction between x and z
y ~ x + (1 | group) y modeled by x + random intercept for group Repeated measures, hierarchical data, group-level baseline variation