10  Transformation

Data transformations are common in (microbial) ecology (Legendre and Gallagher 2001) and used to improve compatibility with assumptions related to specific statistical methods, mitigate biases, enhance the comparability of samples, or features, or to obtain more interpretable values.

Legendre, Pierre, and Eugene D. Gallagher. 2001. “Ecologically Meaningful Transformations for Ordination of Species Data.” Oecologia 129 (2): 271–80. https://doi.org/10.1007/s004420100716.

Examples include the logarithmic transformation, the calculation of relative abundances (percentages), and compositionality-aware transformations such as the centered log-ratio transformation (clr).

10.1 Characteristics of microbiome data

Microbiome sequencing data has unique characteristics that must be addressed; otherwise, incorrect decisions might be made based on the results. Specifically, microbiome sequencing data is characterized by high variability, zero-inflation and compositionality. High variability expresses that abundance of taxa often varies by several orders of magnitude from sample to sample. Zero-inflation means that typically more than 70% of the values are zeros, which could be due to either physical absence (structural zeros) or insufficient sampling effort (sampling zeros). Compositionality means that a change in the absolute abundance of one taxon will lead to apparent variations in the relative abundances of other taxa in the same sample. If neglected, such properties may cause significant bias in the results of DAA or other statistical tests. Therefore, several approaches have been developed to address the unique properties of microbiome data and provide statistically reliable results.

10.2 Common transformation methods

Let us now summarize some commonly used transformations in microbiome data science; further details and benchmarkings available in the references.

  • ‘relabundance’ Relative transformation, also known as total sum scaling (TSS) and compositional transformation. This converts counts into percentages (at the scale [0, 1]) that sum up to

    1. Much of the currently available taxonomic abundance data from high-throughput assays (16S, metagenomic sequencing) is compositional by nature, even if the data is provided as counts (Gloor et al. 2017).
  • ‘clr’ Centered log ratio transformation (Aitchison 1986) is used to reduce data skewness and compositionality bias in relative abundances, while bringing the data to the logarithmic scale. This transformation is frequently applied in microbial ecology (Gloor et al. 2017). However, this transformation only applies to positive values. Usual solution is to add pseudocount, which adds another type of bias in the data. The robust clr transformation (‘rclr’) aims to circumvent the need to add a pseudocount. While the resulting values from these transformations are difficult to interpret directly, this transformation may enhance comparability of relative differences between samples. It is part of a broader Aitchison family of transformations; the additive log ratio transformation (alr') is also available. The robustclr` (“rclr”) is similar to regular clr (see above) but allows data with zeroes and avoids the need to add pseudocount Martino et al. (2019).

  • ‘pa’ Presence/Absence transformation ignores abundances and only indicates whether the given feature is detected above the given threshold (default: 0). This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).

  • ‘standardize’ Standardize(or ‘z-score’) transformation scales data to zero mean and unit variance. This is used to bring features (or samples) to more comparable levels in terms of mean and scale of the values. This can enhance visualization and interpretation of the data

  • ‘log’, ‘log2’, ‘log10’ Logarithmic transformations, used e.g. to reduce data skewness. With compositional data, the clr (or rclr) transformation is often preferred.

  • ‘hellinger’ Hellinger transformation is equal to the square root of relative abundances. This ecological transformation can be useful if we are interested in changes in relative abundances.

  • ‘rank’ Rank transformation replaces each value by its rank. Also see ‘rrank’ (relative rank transformation). This has use, for instance, in non-parametric statistics.

  • Other available transformations include Chi square (‘chi.square’), Frequency transformation (‘frequency’), and Make margin sum of squares equal to one (‘normalize’)

Gloor, GB, JM Macklaim, V Pawlowsky-Glahn, and JJ Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8. https://doi.org/10.3389/fmicb.2017.02224.
Aitchison, J. 1986. The Statistical Analysis of Compositional Data. London, UK: Chapman & Hall.
Martino, C, J. T. Morton, C. A. Marotz, L. R. Thompson, A Tripathi, R Knight, and K Zengler. 2019. A Novel Sparse Compositional Technique Reveals Microbial Perturbations.” mSystems 4.

Transformations on abundance assays can be performed with mia::transformAssay(), keeping both the original and the transformed assay(s). The transformed abundance assay is then stored back to the ‘assays’ slot in the data object. The function applies sample-wise or column-wise transformation when MARGIN = ‘cols’, feature-wise or row-wise transformation when MARGIN = ‘rows’. A complete list of available transformations and parameters, is available in the function help.

Important

Pseudocount is a small non-negative value added to the normalized data to avoid taking the logarithm of zero. It’s value can have a significant impact on the results when applying a logarithm transformation to normalized data, as the logarithm transformation is a nonlinear operation that can fundamentally change the data distribution (Costea et al. 2014).

Pseudocount should be chosen consistently across all normalization methods being compared, for example, by setting it to a value smaller than the minimum abundance value before transformation. Some tools, like ancombc2, take into account the effect of the pseudocount by performing sensitivity tests using multiple pseudocount values. See Chapter 16.

Costea, Paul I., Georg Zeller, Shinichi Sunagawa, and Peer Bork. 2014. “A Fair Comparison.” Nature Methods 11: 359. https://doi.org/https://doi.org/10.1038/nmeth.2897.

10.3 Transformations in practice

# Load example data
library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

# Transform "counts" assay to relative abundances ("relabundance"), with
# pseudocount 1
tse <- transformAssay(
     tse, assay.type = "counts", method = "relabundance", pseudocount = 1)

# Transform relative abundance assay ("relabundance") to "clr", using
# pseudocount if necessary; name the resulting assay to "clr"
tse <- transformAssay(
    x = tse, assay.type = "relabundance", method = "clr", pseudocount = TRUE,
    name = "clr")

Get the values in the resulting assay, and view some of the first entries of it with the head command.

assay(tse, "clr") |> head()
##             CL3    CC1     SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
##  549322 -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381 -0.4562 -0.2601
##  522457 -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381 -0.4562 -0.2601
##  951    -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381  0.1924 -0.2601
##  244423 -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381 -0.4562 -0.2601
##  586076 -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381 -0.4562 -0.2601
##  246140 -0.9056 -1.054 -0.7109  -0.264 -0.2319 -0.3381 -0.4562 -0.2601
##         M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm  AQC4cm  AQC7cm     NP2
##  549322 -0.2193 -0.1554  -0.3118   0.2891  2.465  3.4909  3.8475  0.4076
##  522457 -0.2193 -0.1554  -0.3118  -0.2950 -0.653  0.1236  0.9658 -0.2330
##  951    -0.2193 -0.1554  -0.3118  -0.2950 -0.653 -0.7237 -0.7218 -0.2330
##  244423 -0.2193 -0.1554  -0.3118  -0.2950 -0.653  2.0279  2.3827 -0.2330
##  586076 -0.2193 -0.1554  -0.3118  -0.2950 -0.653  0.1236 -0.1711 -0.2330
##  246140 -0.2193 -0.1554  -0.3118  -0.2950 -0.653 -0.2128  0.4424 -0.2330
##             NP3     NP5 TRRsed1 TRRsed2 TRRsed3    TS28    TS29   Even1
##  549322 -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##  522457 -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##  951    -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##  244423 -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##  586076 -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##  246140 -0.3929 -0.3238 -0.2659 -0.4643 -0.4281 -0.2514 -0.2354 -0.3146
##           Even2   Even3
##  549322 -0.2334 -0.2185
##  522457 -0.2334 -0.2185
##  951    -0.2334 -0.2185
##  244423 -0.2334 -0.2185
##  586076 -0.2334 -0.2185
##  246140 -0.2334 -0.2185

In ‘pa’ transformation, abundance table is converted to presence/absence table that ignores abundances and only indicates whether the given feature is detected. This simple transformation is relatively widely used in ecological research. It has shown good performance in microbiome-based classification performance (Giliberti et al. 2022, Karwowska2024).

Giliberti, R, S Cavaliere, IE Mauriello, D Ercolini, and E Pasolli. 2022. “Host Phenotype Classification from Human Microbiome Data Is Mainly Driven by the Presence of Microbial Taxa.” PLoS Comput Biol. 18 (4): e1010066. https://doi.org/10.1371/journal.pcbi.1010066.
# Here, `assay.type` is not explicitly specified.
# Then The function uses the "counts" assay for the transformation.
tse <- transformAssay(tse, method = "pa")
assay(tse, "pa") |> head()
##         CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong M11Tong
##  549322   0   0   0       0       0       0       0       0       0       0
##  522457   0   0   0       0       0       0       0       0       0       0
##  951      0   0   0       0       0       0       1       0       0       0
##  244423   0   0   0       0       0       0       0       0       0       0
##  586076   0   0   0       0       0       0       0       0       0       0
##  246140   0   0   0       0       0       0       0       0       0       0
##         LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2 NP3 NP5 TRRsed1 TRRsed2
##  549322        0        1      1      1      1   1   0   0       0       0
##  522457        0        0      0      1      1   0   0   0       0       0
##  951           0        0      0      0      0   0   0   0       0       0
##  244423        0        0      0      1      1   0   0   0       0       0
##  586076        0        0      0      1      1   0   0   0       0       0
##  246140        0        0      0      1      1   0   0   0       0       0
##         TRRsed3 TS28 TS29 Even1 Even2 Even3
##  549322       0    0    0     0     0     0
##  522457       0    0    0     0     0     0
##  951          0    0    0     0     0     0
##  244423       0    0    0     0     0     0
##  586076       0    0    0     0     0     0
##  246140       0    0    0     0     0     0

You can now view the entire list of abundance assays in your data object with:

assays(tse)
##  List of length 4
##  names(4): counts relabundance clr pa
Summary

Microbiome data is characterized by the following features:

  • Compositional
  • High variability
  • Zero-inflated

OSCA book provides additional information on normalization from the perspective of single-cell analysis.

Back to top