12 Transformation

Data transformations are common in (microbial) ecology (Legendre and Gallagher 2001) and used to improve compatibility with assumptions related to specific statistical methods, mitigate biases, enhance the comparability of samples, or features, or to obtain more interpretable values.

Legendre, Pierre, and Eugene D. Gallagher. 2001. “Ecologically Meaningful Transformations for Ordination of Species Data.” Oecologia 129 (2): 271–80. https://doi.org/10.1007/s004420100716.

Examples include the logarithmic transformation, the calculation of relative abundances (percentages), and compositionality-aware transformations such as the centered log-ratio transformation (clr).

12.1 Characteristics of microbiome data

Microbiome sequencing data has unique characteristics that must be addressed; otherwise, incorrect decisions might be made based on the results. Specifically, microbiome sequencing data is characterized by high variability, zero-inflation and compositionality. High variability expresses that abundance of taxa often varies by several orders of magnitude from sample to sample. Zero-inflation means that typically more than 70% of the values are zeros, which could be due to either physical absence (structural zeros) or insufficient sampling effort (sampling zeros). Compositionality means that a change in the absolute abundance of one taxon will lead to apparent variations in the relative abundances of other taxa in the same sample. If neglected, such properties may cause significant bias in the results of DAA or other statistical tests. Therefore, several approaches have been developed to address the unique properties of microbiome data and provide statistically reliable results.

12.2 Common transformation methods

Let us now summarize some commonly used transformations in microbiome data science; further details and benchmarkings available in the references.

alr: The additive log ratio transformation is part of a broader Aitchison family of transformations with ‘clr’ and ‘rclr’. Compared to them the biggest difference is that it selects a single feature or component as a reference and expresses all other features as log-ratios relative to it. (Greenacre, Martínez-Álvaro, and Blasco 2021) provides guidance on choosing an appropriate reference feature.
clr: Centered log ratio transformation (Aitchison 1986) is used to reduce data skewness and compositionality bias in relative abundances, while bringing the data to the logarithmic scale. This transformation is frequently applied in microbial ecology as it may enhance comparability of relative differences between samples (Gloor et al. 2017). However, the resulting transformed values are difficult to interpret directly, and this transformation only applies to positive values. Usual solution for for making values non-zero is to add pseudocount, which adds another type of bias in the data (see ‘rclr’ below).
hellinger: Hellinger transformation is equal to the square root of relative abundances. This ecological transformation can be useful if we are interested in changes in relative abundances.
log, log2, log10: Logarithmic transformations, used e.g. to reduce data skewness. With compositional data, the clr (or rclr) transformation is often preferred.
pa: Presence/Absence transformation ignores abundances and only indicates whether the given feature is detected above the given threshold (default: 0). This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).
rank: Rank transformation replaces each value by its rank. Also see ‘rrank’ (relative rank transformation). This has use, for instance, in non-parametric statistics.
rclr: The robust clr (“rclr”) is similar to regular clr (see above) but allows data with zeroes and avoids the need to add pseudocount Martino et al. (2019).
relabundance: Relative transformation, also known as total sum scaling (TSS) and compositional transformation. This converts counts into percentages (at the scale [0, 1]) that sum up to 1. Much of the currently available taxonomic abundance data from high-throughput assays (16S, metagenomic sequencing) is compositional by nature, even if the data is provided as counts (Gloor et al. 2017).
standardize: Standardize (or ‘z-score’) transformation scales data to zero mean and unit variance. This is used to bring features (or samples) to more comparable levels in terms of mean and scale of the values. This can enhance visualization and interpretation of the data
Other available transformations include Chi square (‘chi.square’), Frequency transformation (‘frequency’), and make margin sum of squares equal to one (‘normalize’)

Greenacre, Michael, Marina Martínez-Álvaro, and Agustín Blasco. 2021. “Compositional Data Analysis of Microbiome and Any-Omics Datasets: A Validation of the Additive Logratio Transformation.” Frontiers in Microbiology 12 (October). https://doi.org/10.3389/fmicb.2021.727398.

Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London, UK: Chapman & Hall.

Gloor, GB, JM Macklaim, V Pawlowsky-Glahn, and JJ Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8. https://doi.org/10.3389/fmicb.2017.02224.

Martino, C, J. T. Morton, C. A. Marotz, L. R. Thompson, A Tripathi, R Knight, and K Zengler. 2019. “A Novel Sparse Compositional Technique Reveals Microbial Perturbations.” mSystems 4.

Transformations on abundance assays can be performed with mia::transformAssay(), keeping both the original and the transformed assay(s). The transformed abundance assay is then stored back to the ‘assays’ slot in the data object. The function applies sample-wise or column-wise transformation when MARGIN = ‘cols’, feature-wise or row-wise transformation when MARGIN = ‘rows’. A complete list of available transformations and parameters, is available in the function help.

Important

Pseudocount is a small non-negative value added to the normalized data to avoid taking the logarithm of zero. It’s value can have a significant impact on the results when applying a logarithm transformation to normalized data, as the logarithm transformation is a nonlinear operation that can fundamentally change the data distribution (Costea et al. 2014).

Pseudocount should be chosen consistently across all normalization methods being compared, for example, by setting it to a value smaller than the minimum abundance value before transformation. Some tools, like ancombc2, take into account the effect of the pseudocount by performing sensitivity tests using multiple pseudocount values. See Chapter 17.

Costea, Paul I., Georg Zeller, Shinichi Sunagawa, and Peer Bork. 2014. “A Fair Comparison.” Nature Methods 11: 359. https://doi.org/https://doi.org/10.1038/nmeth.2897.

12.3 Rarefaction

Another approach to control uneven sampling depths is to apply rarefaction with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).

Schloss, Patrick D. 2024a. “Rarefaction Is Currently the Best Approach to Control for Uneven Sequencing Effort in Amplicon Sequence Analyses.” mSphere 9 (2): e00354–23. https://doi.org/10.1128/msphere.00354-23.

———. 2024b. “Waste Not, Want Not: Revisiting the Analysis That Called into Question the Practice of Rarefaction.” mSphere 9 (1): e00355–23. https://doi.org/10.1128/msphere.00355-23.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4): e1003531.

12.4 Transformations in practice

Below, we apply relative transformation to counts table.

# Load example data
library(mia)
data("Tengeler2020")
tse <- Tengeler2020

# Transform counts assay to relative abundances
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")

Get the values in the resulting assay, and view some of the first entries of it with the head command.

assay(tse, "relabundance") |> head()
##                     A110     A12     A15     A19     A21     A23     A25
##  Bacteroides     0.47393 0.28657 0.00000 0.22459 0.27397 0.32796 0.21594
##  Bacteroides_1   0.32230 0.00000 0.16664 0.07080 0.08503 0.04193 0.00000
##  Parabacteroides 0.00000 0.02390 0.00000 0.01400 0.02283 0.00000 0.00994
##  Bacteroides_2   0.00000 0.04709 0.00000 0.14019 0.10376 0.00000 0.05362
##  Akkermansia     0.03057 0.04659 0.07539 0.01489 0.01323 0.12818 0.04012
##  Bacteroides_3   0.00000 0.16011 0.00000 0.11362 0.09605 0.00000 0.04760
##                      A28     A29     A34     A36     A37     A39    A111
##  Bacteroides     0.19379 0.14221 0.27229 0.37622 0.38072 0.00000 0.49423
##  Bacteroides_1   0.00000 0.00000 0.30309 0.38768 0.00000 0.00000 0.39163
##  Parabacteroides 0.01752 0.01749 0.00000 0.00000 0.10521 0.43546 0.00000
##  Bacteroides_2   0.07981 0.07957 0.00000 0.02852 0.32992 0.00000 0.02786
##  Akkermansia     0.03593 0.01411 0.07693 0.05196 0.02641 0.05413 0.01383
##  Bacteroides_3   0.07525 0.19865 0.00000 0.00000 0.00000 0.00000 0.00000
##                      A13     A14     A16     A17     A18    A210     A22
##  Bacteroides     0.05534 0.22500 0.25188 0.21775 0.50314 0.22023 0.09614
##  Bacteroides_1   0.00000 0.05667 0.00000 0.00000 0.29137 0.09577 0.00000
##  Parabacteroides 0.33893 0.00000 0.10925 0.06554 0.00000 0.00000 0.28324
##  Bacteroides_2   0.02042 0.00000 0.09913 0.10589 0.00000 0.00000 0.04164
##  Akkermansia     0.05270 0.09199 0.09739 0.10253 0.03738 0.13349 0.05105
##  Bacteroides_3   0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
##                       A24      A26     A27     A33     A35     A38
##  Bacteroides     0.375716 0.076844 0.24740 0.34204 0.05687 0.20908
##  Bacteroides_1   0.266715 0.000000 0.10112 0.04590 0.00000 0.16765
##  Parabacteroides 0.000000 0.172826 0.00000 0.00000 0.27061 0.00000
##  Bacteroides_2   0.010230 0.006608 0.02002 0.04444 0.02484 0.03583
##  Akkermansia     0.038485 0.118447 0.07714 0.02582 0.07364 0.04671
##  Bacteroides_3   0.008525 0.000000 0.02200 0.03333 0.00000 0.01621

In ‘pa’ transformation, abundance table is converted to presence/absence table that ignores abundances and only indicates whether the given feature is detected. This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).

Giliberti, R, S Cavaliere, IE Mauriello, D Ercolini, and E Pasolli. 2022. “Host Phenotype Classification from Human Microbiome Data Is Mainly Driven by the Presence of Microbial Taxa.” PLoS Comput Biol. 18 (4): e1010066. https://doi.org/10.1371/journal.pcbi.1010066.

# Here, assay.type is not explicitly specified, however, it is good practice
# to do so. By default. the function uses the assay named "counts" for the
# transformation.
tse <- transformAssay(tse, method = "pa")
assay(tse, "pa") |> head()
##                  A110 A12 A15 A19 A21 A23 A25 A28 A29 A34 A36 A37 A39 A111
##  Bacteroides        1   1   0   1   1   1   1   1   1   1   1   1   0    1
##  Bacteroides_1      1   0   1   1   1   1   0   0   0   1   1   0   0    1
##  Parabacteroides    0   1   0   1   1   0   1   1   1   0   0   1   1    0
##  Bacteroides_2      0   1   0   1   1   0   1   1   1   0   1   1   0    1
##  Akkermansia        1   1   1   1   1   1   1   1   1   1   1   1   1    1
##  Bacteroides_3      0   1   0   1   1   0   1   1   1   0   0   0   0    0
##                  A13 A14 A16 A17 A18 A210 A22 A24 A26 A27 A33 A35 A38
##  Bacteroides       1   1   1   1   1    1   1   1   1   1   1   1   1
##  Bacteroides_1     0   1   0   0   1    1   0   1   0   1   1   0   1
##  Parabacteroides   1   0   1   1   0    0   1   0   1   0   0   1   0
##  Bacteroides_2     1   0   1   1   0    0   1   1   1   1   1   1   1
##  Akkermansia       1   1   1   1   1    1   1   1   1   1   1   1   1
##  Bacteroides_3     0   0   0   0   0    0   0   1   0   1   1   0   1

You can now view the entire list of abundance assays in your data object with:

assays(tse)
##  List of length 3
##  names(3): counts relabundance pa

A common question is whether the centered log-ratio (clr) transformation should be applied directly to raw counts or if a prior transformation, such as conversion to relative abundances, is necessary.

In theory, the clr transformation is scale-invariant, meaning it does not matter whether it is applied to raw or relative abundances, as long as the relative scale of abundances remains the same. However, in practice, there are some differences due to the introduction of a pseudocount, which can introduce bias.

There is no single correct answer, but the following considerations may help:

Data imputation should typically be applied to raw abundances, regardless of the microbial profiling pipeline used or whether the obtained abundances are counts or relative abundances.
Once a pseudocount has been added, it makes no difference whether one first converts to relative abundances before applying clr or applies clr directly to the adjusted counts.
Since applying clr directly to raw counts is the simpler approach, it is generally recommended.
One might also consider using robust clr instead.

tse <- transformAssay(
    x = tse,
    assay.type = "counts",
    method = "clr",
    pseudocount = TRUE,
    name = "clr"
)

To incorporate phylogenetic information, one can apply the phylogenetic isometric log-ratio (PhILR) transformation (Silverman et al. 2017). Unlike standard transformations, PhILR accounts for the genetic relationships between taxa. This is important because closely related species often share similar properties, which traditional transformations fail to capture.

Silverman, Justin D, Alex D Washburne, Sayan Mukherjee, and Lawrence A David. 2017. “A Phylogenetic Transform Enhances Analysis of Compositional Microbiota Data.” eLife 6. https://doi.org/10.7554/eLife.21887.

tse <- transformAssay(tse, method = "philr", MARGIN = 1L, pseudocount = TRUE)

Unlike other transformations, PhILR outputs a table where rows represent nodes of phylogeny. These new features do not match with features of TreeSE which is why this new dataset is stored into altExp.

altExp(tse, "philr")
##  class: TreeSummarizedExperiment 
##  dim: 149 27 
##  metadata(0):
##  assays(1): philr
##  rownames(149): node_1 node_2 ... node_148 node_149
##  rowData names(0):
##  colnames(27): A110 A12 ... A35 A38
##  colData names(4): patient_status cohort patient_status_vs_cohort
##    sample_name
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: NULL
##  rowTree: NULL
##  colLinks: NULL
##  colTree: NULL

Summary

Microbiome data is characterized by the following features:

Compositional
High variability
Zero-inflated

OSCA book provides additional information on normalization from the perspective of single-cell analysis.

Exercises

Goal: The goal is to learn how to apply different transformations.

Exercise 1: Transform data

Load any of the example datasets mentioned in Section 4.2.
Visualize counts with a histogram. Describe the data distribution. Is there lots of zeroes?
Transform the counts assay into relative abundances and store it into the TreeSE as an assay named relabund.
Similarly, perform a CLR transformation on the counts assay with a pseudocount of 1 and add it to the TreeSE as a new assay.
List the available assays by name.
Visualize the CLR-transformed data with a histogram. Compare the distribution with distribution of counts data.
Access the CLR-assay and store it to variable. Select a subset of its first 100 features and 10 samples, and print the abundance table. Explore the data.
Agglomerate the data with agglomerateByRanks() Transform data now with altexp = altExpNames(tse).
If the data has phylogenetic tree, perform the phILR transformation. Where the transformed data was stored? Compare the feature names with original data. Why the names differ?

Useful functions:

data(), plotHistogram(), transformAssay(), assayNames(), assay(), agglomerateByRanks(), altExp(), rownames()