Transformation

Monday, August 25, 2025

Microbiome data

  • High variability — feature abundances vary widely across samples
  • Zero-inflation — many taxa absent in most samples
  • Compositionality — only relative abundances observed; total counts per sample are arbitrary

Heteroscedasticity

Zero-inflation

Compositionality

  • Numbers are dependent: if one increases, at least one must decrease
  • Simplex space
    • If we know N-1 values, we know the last one
    • Constraint: all values positive and sum to a constant (often 1)

Goals of transformation

  • Mitigate biases in abundances
  • Make features comparable across samples
  • Meet assumptions of downstream statistical analyses

Relative abundance

\[ \text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}} \]

where

  • \(i\) indexes the sample,
  • \(j\) indexes the taxon,
  • \(m\) is the total number of taxa.

Log-transformed relative abundance

\[ \text{X}_{ij} = \log \left( \text{Relative abundance}_{ij} + \epsilon \right) \]

where

  • \(\epsilon\) is a small pseudocount to avoid \(\log(0)\).

Centered log-ratio (CLR)

\[ \text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D} \]

  • \(x_i\): the \(i\)-th component of the composition
  • \(g(x)\): geometric mean of the composition

\[ \text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14 \]

\[ \text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} \\[2mm] = \exp\Bigg( \frac{\log 2 + \log 8 + \log 32}{3} \Bigg) \\[2mm] = 8 \]

Centered log-ratio (CLR)

\[ \text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm] = \log(x_i) - \log(g(x)) \\[2mm] = \log(x_i) - \frac{1}{D}\sum_{j=1}^{D}\log(x_j) \]

Interpretation:

  • Positive CLR → feature is more abundant than the sample’s average
  • Negative CLR → feature is less abundant than the sample’s average
  • Magnitude → how strongly the feature deviates from the average
    • On a log scale: CLR = 1 → ~2.7× average
    • CLR = 2 → ~7.4× average
    • CLR = −1 → ~0.37× average

Centered log-ratio (CLR)

Demonstration

library(mia)

tse <- transformAssay(tse, method = "relabundance")
print(tse)
class: TreeSummarizedExperiment 
dim: 19216 26 
metadata(0):
assays(2): counts relabundance
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL

Exercises

From OMA online book, Chapter 12: Transformation

  • All exercises

References