Transformation
Tuesday, February 3, 2026
Microbiome data
- High variability — feature abundances vary widely across samples
- Zero-inflation — many taxa absent in most samples
- Compositionality — only relative abundances observed; total counts per sample are arbitrary
Heteroscedasticity
Zero-inflation
Compositionality
- Sequencing measures relative abundances, not true microbial counts
- Values are parts of a whole
- Parts are linked by a fixed total (library size)
- If one increases, at least one other must decrease
Relative abundance
Library sizes are arbitrary
\[
\text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}}
\]
where
- \(i\) indexes the sample,
- \(j\) indexes the taxon,
- \(m\) is the total number of taxa.
Centered log-ratio (CLR)
Library sizes are arbitrary and values are dependent
\[
\text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad
g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D}
\]
- \(x_i\): abundance of bacteria \(i\)
- \(g(x)\): geometric mean across all bacteria in the sample
Intuition
- Values dependent, only ratios matter
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
\[
\frac{x}{y} = a-b
\]
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
\[
log(\frac{x}{y}) = log(x) - log(y)
\]
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
- Library sizes are arbitrary
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
- Library sizes are arbitrary
Centering
\[
\text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm]
= \log(x_i) - \log(g(x)) \\[2mm]
\]
\[
\text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14
\]
\[
\text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} \\[2mm]
= \exp\Bigg( \frac{\log 2 + \log 8 + \log 32}{3} \Bigg) \\[2mm]
= 8
\]
Intuition
- Values dependent, only ratios matter
- Standard statistics rely on subtraction and distance
- Library sizes are arbitrary
Centered log-ratio (CLR)
\[
\text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm]
= \log(x_i) - \log(g(x)) \\[2mm]
= \log(x_i) - \frac{1}{D}\sum_{j=1}^{D}\log(x_j)
\]
Centered log-ratio (CLR)
Demonstration
library(mia)
tse <- transformAssay(tse, method = "relabundance")
print(tse)
class: TreeSummarizedExperiment
dim: 19216 26
metadata(0):
assays(2): counts relabundance
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL