13  Alpha Diversity

13.1 Background

Alpha diversity, or within-sample diversity, is a central concept in microbiome research. In ecological literature, several distinct but related alpha diversity indices, often referring to richness and evenness - the number of taxa and how they are distributed, respectively - are commonly used (Willis 2019; Whittaker 1960). The term diversity can be used to collectively refer to all these indices.

Willis, Amy D. 2019. “Rarefaction, Alpha Diversity, and Statistics.” Frontiers in Microbiology 10. https://doi.org/10.3389/fmicb.2019.02407.
Whittaker, R. H. 1960. “Vegetation of the Siskiyou Mountains, Oregon and California.” Ecological Monographs 30 (3): 279–338. https://doi.org/https://doi.org/10.2307/1943563.

13.1.1 Applications

Alpha diversity is predominantly used to quantify complexity in the microbiome. In the general adult population, lower alpha diversity and lower bacterial load have been associated to worse overall physical and mental health (Valles-Colomer et al. 2019; Vandeputte et al. 2017). However, this principle may not generalize to other populations, most notably in early life and in patient cohorts (Ma, Li, and Gotelli 2019).

Valles-Colomer, Mireia, Gwen Falony, Youssef Darzi, Ettje F. Tigchelaar, Jun Wang, Raul Y. Tito, Carmen Schiweck, et al. 2019. “The Neuroactive Potential of the Human Gut Microbiota in Quality of Life and Depression.” Journal Article. Nature Microbiology. https://doi.org/10.1038/s41564-018-0337-x.
Vandeputte, Doris, Gunter Kathagen, Kevin D’hoe, Sara Vieira-Silva, Mireia Valles-Colomer, João Sabino, Jun Wang, et al. 2017. “Quantitative Microbiome Profiling Links Gut Community Variation to Microbial Load.” Nature 551 (7681): 507–11.
Ma, Zhanshan (Sam), Lianwei Li, and Nicholas J Gotelli. 2019. “Diversity-Disease Relationships and Shared Species Analyses for Human Microbiome-Associated Diseases.” The ISME Journal 13 (8): 1911–19. https://doi.org/10.1038/s41396-019-0395-y.

13.1.2 Approaches

The majority of alpha diversity metrics are closely related, though this is not evident from their names. Bastiaanssen et al. (2023) lay out this relationship across two factors (See table below); First, alpha diversity metrics can be defined as special cases of a unifying equation of diversity, where the Hill number determines the specific index captured. Lower Hill numbers favour richness, the number of distinct taxa, whereas higher numbers favour evenness, how the taxa are distributed over the sample (Hill 1973). Second, some alpha diversity metrics are weighed based on phylogeny, like Faith’s PD (1992) and PhILR (Silverman et al. 2017).

Bastiaanssen, Thomaz FS, Thomas P Quinn, and Amy Loughman. 2023. “Bugs as Features (Part 1): Concepts and Foundations for the Compositional Data Analysis of the Microbiome–Gut–Brain Axis.” Nature Mental Health 1 (12): 930–38. https://doi.org/10.1038/s44220-023-00148-3.
Hill, M. O. 1973. Diversity and Evenness: A Unifying Notation and Its Consequences.” Ecology 54 (2): 427–32. https://doi.org/10.2307/1934352.
Faith, D. P. 1992. “Conservation Evaluation and Phylogenetic Diversity.” Biological Conservation 61 (1): 10. https://doi.org/10.1016/0006-3207(92)91201-3.
Silverman, Justin D, Alex D Washburne, Sayan Mukherjee, and Lawrence A David. 2017. “A Phylogenetic Transform Enhances Analysis of Compositional Microbiota Data.” eLife 6. https://doi.org/10.7554/eLife.21887.
Hill number 0
Hill number 1
Hill number 2
Dependent on presence and absence of taxa, not abundance Dependent on how evenly taxa are distributed in a sample The probability of two randomly picked taxa not being the same
Neutral Diversity

Weighs each taxon equally, no assumptions about phylogeny

Richness (Chao1) Shannon Entropy Simpson’s Index
Phylogenetic Diversity

Indices are scaled based on taxonomic closeness with a phylogenetic tree

Faith’s Phylogenetic Diversity Phylogenetic Entropy Rao’s Quadratic Diversity
The equation for general diversity can be defined as follows: \({{}^qD = (\sum_{i=1}^{R}p^q_i})^{\frac{1}{(1-q)}}\),
with q = Hill number, R = number of features, p = relative feature abundance.
Note: Richness estimators and denoising

Several estimators have been developed to address the confounding effect of limited sampling size on observed richness, most notably ACE (A and SM 1992) and Chao1 (A 1984). Notably, these approaches may yield misleading results for modern 16S data, which commonly features denoising and removal of singletons (Deng, Umbach, and Neufeld 2024).

Deng, Yongcui, Alexander K Umbach, and Josh D Neufeld. 2024. “Nonparametric Richness Estimators Chao1 and ACE Must Not Be Used with Amplicon Sequence Variant Data.” The ISME Journal 18 (1): wrae106. https://doi.org/10.1093/ismejo/wrae106.
A, Chao. 1984. “Non-Parametric Estimation of the Number of Classes in a Population.” Scandinavian Journal of Statistics 11 (4): 265–70. https://www.jstor.org/stable/4615964.
A, Chao, and Lee SM. 1992. “Estimating the Number of Classes via Sample Coverage.” Journal of the American Statistical Association 87 (417): 210–17. https://doi.org/https://doi.org/10.1080/01621459.1992.10475194.

13.2 Examples

13.2.1 Calculate alpha diversity measures

Alpha diversity can be estimated with the addAlpha() function, which interacts with other packages implementing the calculation, such as vegan (Oksanen et al. 2020) and picante (Kembel et al. 2010; W et al. 2010). These functions calculate the given indices, and add them to the colData slot of the SummarizedExperiment object with the given name.

Oksanen, Jari, F. Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, et al. 2020. Vegan: Community Ecology Package. https://CRAN.R-project.org/package=vegan.
Kembel, Steven W, Peter D Cowan, Matthew R Helmus, William K Cornwell, Helene Morlon, David D Ackerly, Simon P Blomberg, and Campbell O Webb. 2010. Picante: R Tools for Integrating Phylogenies and Ecology. https://cran.r-project.org/web/packages/picante.
W, Kembel Steven, Cowan Peter D, Helmus Matthew R, Cornwell William K, Morlon Helene, Ackerly David D, Blomberg Simon P, and Webb Campbell O. 2010. Picante: R tools for integrating phylogenies and ecology.” Bioinformatics 26 (11): 1463–64. https://doi.org/https://doi.org/10.1093/bioinformatics/btq166.
# First, let's load some example data.
library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns

# Compute one or multiple indices simultaneously through the index 'parameter'. 
tse <- addAlpha(
    tse, assay.type = "counts", index = c("observed", "shannon", "faith"),
    detection = 10)

# Check some of the first values in colData
tse$observed |> head()
##  [1] 3144 3791 2319  761  671 1004
tse$shannon  |> head()
##  [1] 6.577 6.777 6.498 3.828 3.288 4.289

Certain indices have additional options, here observed has detection parameter that control the detection threshold. Species over this threshold is considered as detected. See full list of options from from help(addAlpha).

Note: Phylogenetic distances require a tree

Because tse is a TreeSummarizedExperiment object, its phylogenetic tree is used by default. However, the optional argument tree must be provided if tse does not contain a rowTree.

13.2.2 Visualize alpha diversity measures

As alpha diversity metrics typically summarize high-dimensional samples into singular values, many visualization approaches are available. Once calculated, these metrics can be analyzed directly from the colData, for example, by plotting them using plotColData() from the scater package (McCarthy et al. 2020). Here, we use the observed species as a measure of richness. Let’s visualize the results against selected colData variables (sample type and final barcode).

McCarthy, Davis, Kieran Campbell, Aaron Lun, and Quin Wills. 2020. Scater: Single-Cell Analysis Toolkit for Gene Expression Data in r. http://bioconductor.org/packages/scater/.
library(scater)
plotColData(
    tse,
    "observed",
    "SampleType",
    colour_by = "Final_Barcode") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Sample types", y = expression(Richness[Observed]))

Observerd richness plotted grouped by sample type with colour-labeled barcode.

13.2.2.1 Alpha diversity measure comparisons

We can compare alpha diversities for example by calculating correlation between them. Below, a visual comparison between shannon and faith indices is shown with a scatter plot.

tse <- addAlpha(tse, assay.type = "counts", index = "shannon")

plotColData(tse, x = "shannon", y = "faith") +
    labs(x="Shannon index", y="Faith (phylogenetic) index") +
    geom_smooth(method = "lm")

cor.test(tse[["shannon"]], tse[["faith"]])
##  
##      Pearson's product-moment correlation
##  
##  data:  tse[["shannon"]] and tse[["faith"]]
##  t = 2.2, df = 24, p-value = 0.04
##  alternative hypothesis: true correlation is not equal to 0
##  95 percent confidence interval:
##   0.02719 0.68822
##  sample estimates:
##     cor 
##  0.4102

Let us visualize results from multiple alpha diversity measures against a given sample grouping available in colData (here, sample type). These have been readily stored in the colData slot, and they are thus directly available for plotting.

library(patchwork)

# Create the plots
plots <- lapply(
    c("observed", "shannon", "faith"),
    plotColData,
    object = tse,
    x = "SampleType",
    colour_by = "SampleType")

# Fine-tune visual appearance
plots <- lapply(
    plots, "+",
    theme(axis.text.x = element_blank(),
          axis.title.x = element_blank(),
          axis.ticks.x = element_blank()))

# Plot the figures
wrap_plots(plots, ncol = 1) +
  plot_layout(guides = "collect")

13.2.3 Statistical analysis of alpha diversity measures

We can then analyze the statistical significance. We use the non-parametric Wilcoxon or Mann-Whitney test, as it is more flexible than the commonly used Student’s t-Test, since it does not assume normality.

pairwise.wilcox.test(
    tse[["observed"]], tse[["SampleType"]], p.adjust.method = "fdr")
##  
##      Pairwise comparisons using Wilcoxon rank sum exact test 
##  
##  data:  tse[["observed"]] and tse[["SampleType"]] 
##  
##                     Feces Freshwater Freshwater (creek) Mock Ocean
##  Freshwater         0.4   -          -                  -    -    
##  Freshwater (creek) 0.3   0.3        -                  -    -    
##  Mock               0.3   0.3        0.3                -    -    
##  Ocean              0.8   1.0        0.3                0.3  -    
##  Sediment (estuary) 0.3   0.8        0.3                0.3  0.8  
##  Skin               0.3   0.8        0.3                0.3  0.8  
##  Soil               0.3   0.3        0.5                0.3  0.3  
##  Tongue             0.3   0.5        0.3                0.8  0.5  
##                     Sediment (estuary) Skin Soil
##  Freshwater         -                  -    -   
##  Freshwater (creek) -                  -    -   
##  Mock               -                  -    -   
##  Ocean              -                  -    -   
##  Sediment (estuary) -                  -    -   
##  Skin               1.0                -    -   
##  Soil               0.3                0.3  -   
##  Tongue             0.3                0.3  0.3 
##  
##  P value adjustment method: fdr

13.2.3.1 Visualizing significance in group-wise comparisons

Next, let’s compare the Shannon index between sample groups and visualize the statistical significance. Using the stat_compare_means function from the ggpubr package, we can add visually appealing p-values to our plots.

To add adjusted p-values, we have to first calculate them.

library(ggpubr)
library(tidyverse)


index <- "shannon"
group_var <- "SampleType"

# Subsets the data. Takes only those samples that are from feces, skin, or
# tongue.
tse_sub <- tse[ , tse[[group_var]] %in% c("Feces", "Skin", "Tongue") ]

# Changes old levels with new levels
tse_sub$SampleType <- factor(tse_sub$SampleType)

# Calculate p values
pvals <- pairwise.wilcox.test(
    tse_sub[[index]], tse_sub[[group_var]], p.adjust.method = "fdr")
# Put them to data.frame format
pvals <- pvals[["p.value"]] |>
    as.data.frame()
varname <- "group1"
pvals[[varname]] <- rownames(pvals)
# To long format
pvals <- reshape(
    pvals,
    direction = "long",
    varying = colnames(pvals)[ !colnames(pvals) %in% varname ],
    times = colnames(pvals)[ !colnames(pvals) %in% varname ],
    v.names = "p",
    timevar = "group2",
    idvar = "group1"
    ) |>
    na.omit()
# Add y-axis position
pvals[["y.position"]] <- apply(pvals, 1, function(x){
    temp1 <- tse[[index]][ tse[[group_var]] == x[["group1"]] ]
    temp2 <- tse[[index]][ tse[[group_var]] == x[["group2"]] ]
    temp <- max( c(temp1, temp2) )
    return(temp)
})
pvals[["y.position"]] <- max(pvals[["y.position"]]) +
    order(pvals[["y.position"]]) * 0.2
# Round values
pvals[["p"]] <- round(pvals[["p"]], 3)

# Create a boxplot
p <- plotColData(
    tse_sub, x = group_var, y = index,
    show_boxplot = TRUE, show_violin = FALSE) +
    theme(text = element_text(size = 10)) +
    stat_pvalue_manual(pvals)
p

13.3 Further reading

Article on ggpubr package provides further examples for estimating and highlighting significances.

Back to top