12 Transformation
Data transformations are common in (microbial) ecology (Legendre and Gallagher 2001) and used to improve compatibility with assumptions related to specific statistical methods, mitigate biases, enhance the comparability of samples, or features, or to obtain more interpretable values.
Examples include the logarithmic transformation, the calculation of relative abundances (percentages), and compositionality-aware transformations such as the centered log-ratio transformation (clr).
12.1 Characteristics of microbiome data
Microbiome sequencing data has unique characteristics that must be addressed; otherwise, incorrect decisions might be made based on the results. Specifically, microbiome sequencing data is characterized by high variability, zero-inflation and compositionality. High variability expresses that abundance of taxa often varies by several orders of magnitude from sample to sample. Zero-inflation means that typically more than 70% of the values are zeros, which could be due to either physical absence (structural zeros) or insufficient sampling effort (sampling zeros). Compositionality means that a change in the absolute abundance of one taxon will lead to apparent variations in the relative abundances of other taxa in the same sample. If neglected, such properties may cause significant bias in the results of DAA or other statistical tests. Therefore, several approaches have been developed to address the unique properties of microbiome data and provide statistically reliable results.
12.2 Common transformation methods
Let us now summarize some commonly used transformations in microbiome data science; further details and benchmarkings available in the references.
alr: The additive log ratio transformation is part of a broader Aitchison family of transformations with ‘clr’ and ‘rclr’. Compared to them the biggest difference is that it selects a single feature or component as a reference and expresses all other features as log-ratios relative to it. (Greenacre, Martínez-Álvaro, and Blasco 2021) provides guidance on choosing an appropriate reference feature.
clr: Centered log ratio transformation (Aitchison 1986) is used to reduce data skewness and compositionality bias in relative abundances, while bringing the data to the logarithmic scale. This transformation is frequently applied in microbial ecology as it may enhance comparability of relative differences between samples (Gloor et al. 2017). However, the resulting transformed values are difficult to interpret directly, and this transformation only applies to positive values. Usual solution for for making values non-zero is to add
pseudocount
, which adds another type of bias in the data (see ‘rclr’ below).hellinger: Hellinger transformation is equal to the square root of relative abundances. This ecological transformation can be useful if we are interested in changes in relative abundances.
log, log2, log10: Logarithmic transformations, used e.g. to reduce data skewness. With compositional data, the
clr
(orrclr
) transformation is often preferred.pa: Presence/Absence transformation ignores abundances and only indicates whether the given feature is detected above the given threshold (default: 0). This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).
rank: Rank transformation replaces each value by its rank. Also see ‘rrank’ (relative rank transformation). This has use, for instance, in non-parametric statistics.
rclr: The robust
clr
(“rclr”) is similar to regular clr (see above) but allows data with zeroes and avoids the need to add pseudocount Martino et al. (2019).relabundance: Relative transformation, also known as total sum scaling (TSS) and compositional transformation. This converts counts into percentages (at the scale [0, 1]) that sum up to 1. Much of the currently available taxonomic abundance data from high-throughput assays (16S, metagenomic sequencing) is compositional by nature, even if the data is provided as counts (Gloor et al. 2017).
standardize: Standardize (or ‘z-score’) transformation scales data to zero mean and unit variance. This is used to bring features (or samples) to more comparable levels in terms of mean and scale of the values. This can enhance visualization and interpretation of the data
Other available transformations include Chi square (‘chi.square’), Frequency transformation (‘frequency’), and make margin sum of squares equal to one (‘normalize’)
Transformations on abundance assays can be performed with mia::transformAssay()
, keeping both the original and the transformed assay(s). The transformed abundance assay is then stored back to the ‘assays’ slot in the data object. The function applies sample-wise or column-wise transformation when MARGIN = ‘cols’, feature-wise or row-wise transformation when MARGIN = ‘rows’. A complete list of available transformations and parameters, is available in the function help.
Pseudocount
is a small non-negative value added to the normalized data to avoid taking the logarithm of zero. It’s value can have a significant impact on the results when applying a logarithm transformation to normalized data, as the logarithm transformation is a nonlinear operation that can fundamentally change the data distribution (Costea et al. 2014).
Pseudocount
should be chosen consistently across all normalization methods being compared, for example, by setting it to a value smaller than the minimum abundance value before transformation. Some tools, like ancombc2, take into account the effect of the pseudocount
by performing sensitivity tests using multiple pseudocount values. See Chapter 17.
12.3 Rarefaction
Another approach to control uneven sampling depths is to apply rarefaction with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).
12.4 Transformations in practice
Below, we apply relative transformation to counts table.
# Load example data
library(mia)
data("Tengeler2020")
tse <- Tengeler2020
# Transform counts assay to relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")
Get the values in the resulting assay, and view some of the first entries of it with the head
command.
assay(tse, "relabundance") |> head()
## A110 A12 A15 A19 A21 A23 A25
## Bacteroides 0.47393 0.28657 0.00000 0.22459 0.27397 0.32796 0.21594
## Bacteroides_1 0.32230 0.00000 0.16664 0.07080 0.08503 0.04193 0.00000
## Parabacteroides 0.00000 0.02390 0.00000 0.01400 0.02283 0.00000 0.00994
## Bacteroides_2 0.00000 0.04709 0.00000 0.14019 0.10376 0.00000 0.05362
## Akkermansia 0.03057 0.04659 0.07539 0.01489 0.01323 0.12818 0.04012
## Bacteroides_3 0.00000 0.16011 0.00000 0.11362 0.09605 0.00000 0.04760
## A28 A29 A34 A36 A37 A39 A111
## Bacteroides 0.19379 0.14221 0.27229 0.37622 0.38072 0.00000 0.49423
## Bacteroides_1 0.00000 0.00000 0.30309 0.38768 0.00000 0.00000 0.39163
## Parabacteroides 0.01752 0.01749 0.00000 0.00000 0.10521 0.43546 0.00000
## Bacteroides_2 0.07981 0.07957 0.00000 0.02852 0.32992 0.00000 0.02786
## Akkermansia 0.03593 0.01411 0.07693 0.05196 0.02641 0.05413 0.01383
## Bacteroides_3 0.07525 0.19865 0.00000 0.00000 0.00000 0.00000 0.00000
## A13 A14 A16 A17 A18 A210 A22
## Bacteroides 0.05534 0.22500 0.25188 0.21775 0.50314 0.22023 0.09614
## Bacteroides_1 0.00000 0.05667 0.00000 0.00000 0.29137 0.09577 0.00000
## Parabacteroides 0.33893 0.00000 0.10925 0.06554 0.00000 0.00000 0.28324
## Bacteroides_2 0.02042 0.00000 0.09913 0.10589 0.00000 0.00000 0.04164
## Akkermansia 0.05270 0.09199 0.09739 0.10253 0.03738 0.13349 0.05105
## Bacteroides_3 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
## A24 A26 A27 A33 A35 A38
## Bacteroides 0.375716 0.076844 0.24740 0.34204 0.05687 0.20908
## Bacteroides_1 0.266715 0.000000 0.10112 0.04590 0.00000 0.16765
## Parabacteroides 0.000000 0.172826 0.00000 0.00000 0.27061 0.00000
## Bacteroides_2 0.010230 0.006608 0.02002 0.04444 0.02484 0.03583
## Akkermansia 0.038485 0.118447 0.07714 0.02582 0.07364 0.04671
## Bacteroides_3 0.008525 0.000000 0.02200 0.03333 0.00000 0.01621
In ‘pa’ transformation, abundance table is converted to presence/absence table that ignores abundances and only indicates whether the given feature is detected. This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).
# Here, assay.type is not explicitly specified, however, it is good practice
# to do so. By default. the function uses the assay named "counts" for the
# transformation.
tse <- transformAssay(tse, method = "pa")
assay(tse, "pa") |> head()
## A110 A12 A15 A19 A21 A23 A25 A28 A29 A34 A36 A37 A39 A111
## Bacteroides 1 1 0 1 1 1 1 1 1 1 1 1 0 1
## Bacteroides_1 1 0 1 1 1 1 0 0 0 1 1 0 0 1
## Parabacteroides 0 1 0 1 1 0 1 1 1 0 0 1 1 0
## Bacteroides_2 0 1 0 1 1 0 1 1 1 0 1 1 0 1
## Akkermansia 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Bacteroides_3 0 1 0 1 1 0 1 1 1 0 0 0 0 0
## A13 A14 A16 A17 A18 A210 A22 A24 A26 A27 A33 A35 A38
## Bacteroides 1 1 1 1 1 1 1 1 1 1 1 1 1
## Bacteroides_1 0 1 0 0 1 1 0 1 0 1 1 0 1
## Parabacteroides 1 0 1 1 0 0 1 0 1 0 0 1 0
## Bacteroides_2 1 0 1 1 0 0 1 1 1 1 1 1 1
## Akkermansia 1 1 1 1 1 1 1 1 1 1 1 1 1
## Bacteroides_3 0 0 0 0 0 0 0 1 0 1 1 0 1
You can now view the entire list of abundance assays in your data object with:
assays(tse)
## List of length 3
## names(3): counts relabundance pa
A common question is whether the centered log-ratio (clr) transformation should be applied directly to raw counts or if a prior transformation, such as conversion to relative abundances, is necessary.
In theory, the clr transformation is scale-invariant, meaning it does not matter whether it is applied to raw or relative abundances, as long as the relative scale of abundances remains the same. However, in practice, there are some differences due to the introduction of a pseudocount, which can introduce bias.
There is no single correct answer, but the following considerations may help:
Data imputation should typically be applied to raw abundances, regardless of the microbial profiling pipeline used or whether the obtained abundances are counts or relative abundances.
Once a pseudocount has been added, it makes no difference whether one first converts to relative abundances before applying clr or applies clr directly to the adjusted counts.
Since applying clr directly to raw counts is the simpler approach, it is generally recommended.
One might also consider using robust clr instead.
tse <- transformAssay(
x = tse,
assay.type = "counts",
method = "clr",
pseudocount = TRUE,
name = "clr"
)
To incorporate phylogenetic information, one can apply the phylogenetic isometric log-ratio (PhILR) transformation (Silverman et al. 2017). Unlike standard transformations, PhILR accounts for the genetic relationships between taxa. This is important because closely related species often share similar properties, which traditional transformations fail to capture.
tse <- transformAssay(tse, method = "philr", MARGIN = 1L, pseudocount = TRUE)
Unlike other transformations, PhILR outputs a table where rows represent nodes of phylogeny. These new features do not match with features of TreeSE
which is why this new dataset is stored into altExp
.
altExp(tse, "philr")
## class: TreeSummarizedExperiment
## dim: 149 27
## metadata(0):
## assays(1): philr
## rownames(149): node_1 node_2 ... node_148 node_149
## rowData names(0):
## colnames(27): A110 A12 ... A35 A38
## colData names(4): patient_status cohort patient_status_vs_cohort
## sample_name
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Microbiome data is characterized by the following features:
- Compositional
- High variability
- Zero-inflated
OSCA book provides additional information on normalization from the perspective of single-cell analysis.