Transformation

Tuesday, February 3, 2026

Microbiome data

  • High variability — feature abundances vary widely across samples
  • Zero-inflation — many taxa absent in most samples
  • Compositionality — only relative abundances observed; total counts per sample are arbitrary

Heteroscedasticity

Zero-inflation

Compositionality

  • Sequencing measures relative abundances, not true microbial counts
  • Values are parts of a whole
  • Parts are linked by a fixed total (library size)
    • If one increases, at least one other must decrease

Goals of transformation

  • Mitigate biases in abundances
  • Make features comparable across samples
  • Meet assumptions of downstream statistical analyses

Relative abundance

Library sizes are arbitrary

\[ \text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}} \]

where

  • \(i\) indexes the sample,
  • \(j\) indexes the taxon,
  • \(m\) is the total number of taxa.

Log-transformed relative abundance

Library sizes are arbitrary and data zero-inflated

\[ \text{X}_{ij} = \log \left( \text{Relative abundance}_{ij} + \epsilon \right) \]

where

  • \(\epsilon\) is a small pseudocount to avoid \(\log(0)\).

Centered log-ratio (CLR)

Library sizes are arbitrary and values are dependent

\[ \text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D} \]

  • \(x_i\): abundance of bacteria \(i\)
  • \(g(x)\): geometric mean across all bacteria in the sample

Intuition

Intuition

  1. Values dependent, only ratios matter

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance

\[ \frac{x}{y} = a-b \]

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance
    • Use log-scale

\[ log(\frac{x}{y}) = log(x) - log(y) \]

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance
    • Use log-scale
  3. Library sizes are arbitrary

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance
    • Use log-scale
  3. Library sizes are arbitrary
    • Center the data

Centering

\[ \text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm] = \log(x_i) - \log(g(x)) \\[2mm] \]

\[ \text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14 \]

\[ \text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} \\[2mm] = \exp\Bigg( \frac{\log 2 + \log 8 + \log 32}{3} \Bigg) \\[2mm] = 8 \]

Intuition

  1. Values dependent, only ratios matter
  2. Standard statistics rely on subtraction and distance
    • Use log-scale
  3. Library sizes are arbitrary
    • Center the data

Centered log-ratio (CLR)

\[ \text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm] = \log(x_i) - \log(g(x)) \\[2mm] = \log(x_i) - \frac{1}{D}\sum_{j=1}^{D}\log(x_j) \]

Centered log-ratio (CLR)

Correct transformation depends on

  1. Your research question
  2. On the analysis method

Use the simplest justified approach that works

Demonstration

library(mia)

tse <- transformAssay(tse, method = "relabundance")
print(tse)
class: TreeSummarizedExperiment 
dim: 19216 26 
metadata(0):
assays(2): counts relabundance
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL

Exercises

From OMA online book, Chapter 12: Transformation

  • All exercises

References