Transformation
Monday, August 25, 2025
Microbiome data
- High variability — feature abundances vary widely across samples
- Zero-inflation — many taxa absent in most samples
- Compositionality — only relative abundances observed; total counts per sample are arbitrary
Heteroscedasticity
Zero-inflation
Compositionality
- Numbers are dependent: if one increases, at least one must decrease
- Simplex space
- If we know N-1 values, we know the last one
- Constraint: all values positive and sum to a constant (often 1)
Relative abundance
\[
\text{X}_{ij} = \frac{\text{Count}_{ij}}{\sum_{k=1}^m \text{Count}_{ik}}
\]
where
- \(i\) indexes the sample,
- \(j\) indexes the taxon,
- \(m\) is the total number of taxa.
Centered log-ratio (CLR)
\[
\text{CLR}(x_i) = \log\left(\frac{x_i}{g(x)}\right),\quad
g(x) = \left(\prod_{j=1}^{D} x_j \right)^{1/D}
\]
- \(x_i\): the \(i\)-th component of the composition
- \(g(x)\): geometric mean of the composition
\[
\text{Arithmetic mean} = \frac{2 + 8 + 32}{3} = \frac{42}{3} = 14
\]
\[
\text{Geometric mean} = \sqrt[3]{2 \cdot 8 \cdot 32} \\[2mm]
= \exp\Bigg( \frac{\log 2 + \log 8 + \log 32}{3} \Bigg) \\[2mm]
= 8
\]
Centered log-ratio (CLR)
\[
\text{CLR}(x_i) \;=\; \log\left(\frac{x_i}{g(x)}\right) \\[2mm]
= \log(x_i) - \log(g(x)) \\[2mm]
= \log(x_i) - \frac{1}{D}\sum_{j=1}^{D}\log(x_j)
\]
Interpretation:
- Positive CLR → feature is more abundant than the sample’s average
- Negative CLR → feature is less abundant than the sample’s average
- Magnitude → how strongly the feature deviates from the average
- On a log scale: CLR = 1 → ~2.7× average
- CLR = 2 → ~7.4× average
- CLR = −1 → ~0.37× average
Centered log-ratio (CLR)
Demonstration
library(mia)
tse <- transformAssay(tse, method = "relabundance")
print(tse)
class: TreeSummarizedExperiment
dim: 19216 26
metadata(0):
assays(2): counts relabundance
rownames(19216): 549322 522457 ... 200359 271582
rowData names(7): Kingdom Phylum ... Genus Species
colnames(26): CL3 CC1 ... Even2 Even3
colData names(7): X.SampleID Primer ... SampleType Description
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (19216 rows)
rowTree: 1 phylo tree(s) (19216 leaves)
colLinks: NULL
colTree: NULL