Estimate alpha diversity indices

These functions estimates alpha diversity indices optionally using rarefaction.

addAlpha(x, ...)

getAlpha(x, ...)

# S4 method for class 'SummarizedExperiment'
addAlpha(x, ...)

# S4 method for class 'SummarizedExperiment'
getAlpha(
  x,
  assay.type = "counts",
  index = c("dbp_dominance", "faith_diversity", "observed_richness", "shannon_diversity"),
  name = index,
  niter = NULL,
  BPPARAM = SerialParam(),
  ...
)

Arguments

x

a SummarizedExperiment object.

...

optional arguments:

sample: Integer scalar. Specifies the rarefaction depth i.e. the number of counts drawn from each sample. (Default: min(colSums2(assay(x, assay.type))))
tree.name: Character scalar. Specifies which rowTree will be used. ( Faith's index). (Default: "phylo")
node.label: Character vector or NULL Specifies the links between rows and node labels of phylogeny tree specified by tree.name. If a certain row is not linked with the tree, missing instance should be noted as NA. When NULL, all the rownames should be found from the tree. (Faith's index). (Default: NULL)
only.tips: (Faith's index). Logical scalar. Specifies whether to remove internal nodes when Faith's index is calculated. When only.tips=TRUE, those rows that are not tips of tree are removed. (Default: FALSE)
threshold: (Coverage and all evenness indices). Numeric scalar. From 0 to 1, determines the threshold for coverage and evenness indices. When evenness indices are calculated values under or equal to this threshold are denoted as zeroes. For coverage index, see details. (Default: 0.5 for coverage, 0 for evenness indices)
quantile: (log modulo skewness index). Numeric scalar. Arithmetic abundance classes are evenly cut up to to this quantile of the data. The assumption is that abundances higher than this are not common, and they are classified in their own group. (Default: 0.5)
nclasses: (log modulo skewness index). Integer scalar. The number of arithmetic abundance classes from zero to the quantile cutoff indicated by quantile. (Default: 50)
ntaxa: (absolute and relative indices). Integer scalar. The n-th position of the dominant taxa to consider. (Default: 1)
aggregate: (absolute, dbp, dmn, and relative indices). Logical scalar. Aggregate the values for top members selected by ntaxa or not. If TRUE, then the sum of relative abundances is returned. Otherwise the relative abundance is returned for the single taxa with the indicated rank (default: aggregate = TRUE).
detection: (observed index). Numeric scalar Selects detection threshold for the abundances (Default: 0)

assay.type

Character scalar. Specifies the name of assay used in calculation. (Default: "counts")

index

Character vector. Specifies the alpha diversity indices to be calculated.

name

Character vector. A name for the column of the colData where results will be stored. (Default: index)

niter

Integer scalar. Specifies the number of rarefaction rounds. Rarefaction is not applied when niter=NULL (see Details section). (Default: NULL)

BPPARAM

A BiocParallelParam object specifying whether the calculation should be parallelized.

Value

getAlpha returns a DataFrame. addAlpha returns a x with additional colData column(s) named name.

Details

Different diversity metrics considers different aspects of microbial community. Cassol et al. (2025) categorized alpha diversity metrics into four categories: richness, dominance, information, and phylogenetic. These categories provide complementary information, and by default, *Alpha function return indices from each category: observed richness, Berger-Parker dominance, Shannon index for "information", and Faith phylogenetic index.

Diversity

Alpha diversity is a joint quantity that combines elements or community richness and evenness. Diversity increases, in general, when species richness or evenness increase.

The following diversity indices are available:

'coverage': Number of species needed to cover a given fraction of the ecosystem (50 percent by default). Tune this with the threshold argument.
'faith': Faith's phylogenetic alpha diversity index measures how long the taxonomic distance is between taxa that are present in the sample (Faith 1992). Larger values represent higher diversity. The current implementation is based on the Stacked Faith's Phylogenetic Diversity (SFPhD) algorithm (Armstrong et al. 2021), which produces values equivalent to picante::pd with the parameter include.root=TRUE. Using this index requires a rowTree.

If the data includes features that are not in tree's tips but in internal nodes, there are two options. First, you can keep those features, and prune the tree to match features so that each tip can be found from the features. Other option is to remove all features that are not tips. (See only.tips parameter)
'fisher': Fisher's alpha; as implemented in vegan::fisher.alpha. (Fisher et al. 1943)
'gini_simpson': Gini-Simpson diversity i.e. $1 - lambda$, where $lambda$ is the Simpson index, calculated as the sum of squared relative abundances. This corresponds to the diversity index 'simpson' in vegan::diversity. This is also called Gibbs–Martin, or Blau index in sociology, psychology and management studies. The Gini-Simpson index (1-lambda) should not be confused with Simpson's dominance (lambda), Gini index, or inverse Simpson index (1/lambda).
'inverse_simpson': Inverse Simpson diversity: $1/lambda$ where $lambda=sum(p^2)$ and p refers to relative abundances. This corresponds to the diversity index 'invsimpson' in vegan::diversity. Don't confuse this with the closely related Gini-Simpson index
'log_modulo_skewness': The rarity index characterizes the concentration of species at low abundance. Here, we use the skewness of the frequency distribution of arithmetic abundance classes (see Magurran & McGill 2011). These are typically right-skewed; to avoid taking log of occasional negative skews, we follow Locey & Lennon (2016) and use the log-modulo transformation that adds a value of one to each measure of skewness to allow logarithmization.
'shannon': Shannon diversity (entropy).

Dominance

A dominance index quantifies the dominance of one or few species in a community. Greater values indicate higher dominance.

Dominance indices are in general negatively correlated with alpha diversity indices (species richness, evenness, diversity, rarity). More dominant communities are less diverse.

The following community dominance indices are available:

'absolute': Absolute index equals to the absolute abundance of the most dominant n species of the sample (specify the number with the argument ntaxa). Index gives positive integer values.
'dbp': Berger-Parker index (See Berger & Parker 1970) calculation is a special case of the 'relative' index. dbp is the relative abundance of the most abundant species of the sample. Index gives values in interval 0 to 1, where bigger value represent greater dominance.

$$dbp = \frac{N_1}{N_{tot}}$$ where $N_1$ is the absolute abundance of the most dominant species and $N_{tot}$ is the sum of absolute abundances of all species.
'core_abundance': Core abundance index is related to core species. Core species are species that are most abundant in all samples, i.e., in whole data set. Core species are defined as those species that have prevalence over 50\ species must be prevalent in 50\ calculate the core abundance index. Core abundance index is sum of relative abundances of core species in the sample. Index gives values in interval 0 to 1, where bigger value represent greater dominance.

$$core_abundance = \frac{N_{core}}{N_{tot}}$$ where $N_{core}$ is the sum of absolute abundance of the core species and $N_{tot}$ is the sum of absolute abundances of all species.
'gini': Gini index is probably best-known from socio-economic contexts (Gini 1921). In economics, it is used to measure, for example, how unevenly income is distributed among population. Here, Gini index is used similarly, but income is replaced with abundance.

If there is small group of species that represent large portion of total abundance of microbes, the inequality is large and Gini index closer to 1. If all species has equally large abundances, the equality is perfect and Gini index equals 0. This index should not be confused with Gini-Simpson index, which quantifies diversity.
'dmn': McNaughton’s index is the sum of relative abundances of the two most abundant species of the sample (McNaughton & Wolf, 1970). Index gives values in the unit interval:

$$dmn = (N_1 + N_2)/N_tot$$

where $N_1$ and $N_2$ are the absolute abundances of the two most dominant species and $N_{tot}$ is the sum of absolute abundances of all species.
'relative': Relative index equals to the relative abundance of the most dominant n species of the sample (specify the number with the argument ntaxa). This index gives values in interval 0 to 1.

$$relative = N_1/N_tot$$

where $N_1$ is the absolute abundance of the most dominant species and $N_{tot}$ is the sum of absolute abundances of all species.
'simpson_lambda': Simpson's (dominance) index or Simpson's lambda is the sum of squared relative abundances. This index gives values in the unit interval. This value equals the probability that two randomly chosen individuals belongs to the same species. The higher the probability, the greater the dominance (See e.g. Simpson 1949).

$$lambda = \sum(p^2)$$

where p refers to relative abundances.

There is also a more advanced Simpson dominance index (Simpson 1949). However, this is not provided and the simpler squared sum of relative abundances is used instead as the alternative index is not in the unit interval and it is highly correlated with the simpler variant implemented here.

Evenness

Evenness is a standard index in community ecology, and it quantifies how evenly the abundances of different species are distributed. The following evenness indices are provided:

By default, four indices are returned, each taking into account different aspects: richness (the number of observed unique features), dominance (Berger-Parker), information (Shannon), and phylogenetics (Faith) (Cassol et al., 2025).

The available evenness indices include the following (all in lowercase):

'camargo': Camargo's evenness (Camargo 1992)
'simpson_evenness': Simpson’s evenness is calculated as inverse Simpson diversity (1/lambda) divided by observed species richness S: (1/lambda)/S.
'pielou': Pielou's evenness (Pielou, 1966), also known as Shannon or Shannon-Weaver/Wiener/Weiner evenness; H/ln(S). The Shannon-Weaver is the preferred term; see Spellerberg and Fedor (2003).
'evar': Smith and Wilson’s Evar index (Smith & Wilson 1996).
'bulla': Bulla’s index (O) (Bulla 1994).

Desirable statistical evenness metrics avoid strong bias towards very large or very small abundances; are independent of richness; and range within the unit interval with increasing evenness (Smith & Wilson 1996). Evenness metrics that fulfill these criteria include at least camargo, simpson, smith-wilson, and bulla. Also see Magurran & McGill (2011) and Beisel et al. (2003) for further details.

Richness

The richness is calculated per sample. This is a standard index in community ecology, and it provides an estimate of the number of unique species in the community. This is often not directly observed for the whole community but only for a limited sample from the community. This has led to alternative richness indices that provide different ways to estimate the species richness.

Richness index differs from the concept of species diversity or evenness in that it ignores species abundance, and focuses on the binary presence/absence values that indicate simply whether the species was detected.

The function takes all index names in full lowercase. The user can provide the desired spelling through the argument name (see examples).

The following richness indices are provided.

'ace': Abundance-based coverage estimator (ACE) is another nonparametric richness index that uses sample coverage, defined based on the sum of the probabilities of the observed species. This method divides the species into abundant (more than 10 reads or observations) and rare groups in a sample and tends to underestimate the real number of species. The ACE index ignores the abundance information for the abundant species, based on the assumption that the abundant species are observed regardless of their exact abundance. We use here the bias-corrected version (O'Hara 2005, Chiu et al. 2014) implemented in estimateR. For an exact formulation, see estimateR. Note that this index comes with an additional column with standard error information.
'chao1': This is a nonparametric estimator of species richness. It assumes that rare species carry information about the (unknown) number of unobserved species. We use here the bias-corrected version (O'Hara 2005, Chiu et al. 2014) implemented in estimateR. This index implicitly assumes that every taxa has equal probability of being observed. Note that it gives a lower bound to species richness. The bias-corrected for an exact formulation, see estimateR. This estimator uses only the singleton and doubleton counts, and hence it gives more weight to the low abundance species. Note that this index comes with an additional column with standard error information.
'hill': Effective species richness aka Hill index (see e.g. Chao et al. 2016). Currently only the case 1D is implemented. This corresponds to the exponent of Shannon diversity. Intuitively, the effective richness indicates the number of species whose even distribution would lead to the same diversity than the observed community, where the species abundances are unevenly distributed.
'observed': The observed richness gives the number of species that is detected above a given detection threshold in the observed sample (default 0). This is conceptually the simplest richness index. The corresponding index in the vegan package is "richness".

Rarefaction

Rarefaction can be used to control uneven sequencing depths. Although, it is highly debated method. Some think that it is the only option that successfully controls the variation caused by uneven sampling depths. The biggest argument against rarefaction is the fact that it omits data.

Rarefaction works by sampling the counts randomly. This random sampling is done niter times. In each sampling iteration, sample number of random samples are drawn, and alpha diversity is calculated for this subset. After the iterative process, there are niter number of result that are then averaged to get the final result.

Refer to Schloss (2024) for more details on rarefaction.

References

Armstrong G. et al. (2021) Efficient computation of Faith's phylogenetic diversity with applications in characterizing microbiomes. Genome Res. 31(11):2131-2137. doi: 10.1101/gr.275777.121

Beisel J-N. et al. (2003) A Comparative Analysis of Diversity Index Sensitivity. Internal Rev. Hydrobiol. 88(1):3-15. https://portais.ufg.br/up/202/o/2003-comparative_evennes_index.pdf

Berger WH & Parker FL (1970) Diversity of Planktonic Foraminifera in Deep-Sea Sediments. Science 168(3937):1345-1347. doi: 10.1126/science.168.3937.1345

Bulla L. (1994) An index of diversity and its associated diversity measure. Oikos 70:167–171

Camargo, JA. (1992) New diversity index for assessing structural alterations in aquatic communities. Bull. Environ. Contam. Toxicol. 48:428–434.

Cassol, I. (2025) Key features and guidelines for the application of microbial alpha diversity metrics. Sci. Rep. 15:622. doi: 10.1038/s41598-024-77864-y

Chao A. (1984) Non-parametric estimation of the number of classes in a population. Scand J Stat. 11:265–270.

Chao A, Chun-Huo C, Jost L (2016). Phylogenetic Diversity Measures and Their Decomposition: A Framework Based on Hill Numbers. Biodiversity Conservation and Phylogenetic Systematics, Springer International Publishing, pp. 141–172, doi:10.1007/978-3-319-22461-9_8.

Chiu, C.H., Wang, Y.T., Walther, B.A. & Chao, A. (2014). Improved nonparametric lower bound of species richness via a modified Good-Turing frequency formula. Biometrics 70, 671-682.

Faith D.P. (1992) Conservation evaluation and phylogenetic diversity. Biological Conservation 61(1):1-10.

Fisher R.A., Corbet, A.S. & Williams, C.B. (1943) The relation between the number of species and the number of individuals in a random sample of animal population. Journal of Animal Ecology 12, 42-58.

Gini C (1921) Measurement of Inequality of Incomes. The Economic Journal 31(121): 124-126. doi: 10.2307/2223319

Locey KJ and Lennon JT. (2016) Scaling laws predict global microbial diversity. PNAS 113(21):5970-5975; doi:10.1073/pnas.1521291113.

Magurran AE, McGill BJ, eds (2011) Biological Diversity: Frontiers in Measurement and Assessment (Oxford Univ Press, Oxford), Vol 12.

McNaughton, SJ and Wolf LL. (1970). Dominance and the niche in ecological systems. Science 167:13, 1–139

O'Hara, R.B. (2005). Species richness estimators: how many species can dance on the head of a pin? J. Anim. Ecol. 74, 375-386.

Pielou, EC. (1966) The measurement of diversity in different types of biological collections. J Theoretical Biology 13:131–144.

Schloss PD (2024) Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses. mSphere 28;9(2):e0035423. doi: 10.1128/msphere.00354-23

Simpson EH (1949) Measurement of Diversity. Nature 163(688). doi: 10.1038/163688a0

Smith B and Wilson JB. (1996) A Consumer's Guide to Evenness Indices. Oikos 76(1):70-82.

Spellerberg and Fedor (2003). A tribute to Claude Shannon (1916 –2001) and a plea for more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’ Index. Alpha Ecology & Biogeography 12, 177–197.

Cassol, I., Ibañez, M. & Bustamante, J.P. (2025) Key features and guidelines for the application of microbial alpha diversity metrics. Sci Rep 15, 622. doi:10.1038/s41598-024-77864-y

Examples


data("GlobalPatterns")
tse <- GlobalPatterns

# Calculate the default Shannon index with no rarefaction
tse <- addAlpha(tse, index = "shannon")

# Shows the estimated Shannon index
tse$shannon
#>  [1] 6.576517 6.776603 6.498494 3.828368 3.287666 4.289269 4.849999 4.874747
#>  [9] 2.672103 3.905419 3.093981 3.651142 3.552736 3.372495 4.027716 4.230515
#> [17] 4.483806 4.563943 6.157462 4.869817 5.461840 4.126538 3.452772 4.083665
#> [25] 3.956909 4.006375

# Calculate observed richness with 10 rarefaction rounds
tse <- addAlpha(tse,
   assay.type = "counts",
   index = "observed_richness",
   sample = min(colSums(assay(tse, "counts")), na.rm = TRUE),
   niter=10)

# Shows the estimated observed richness
tse$observed_richness
#>  [1] 3497 3824 2838  738  588 1326 2287 1790  506 1584  807  920 2211 2022 2226
#> [16]  969 1196  865 2995 2113 2451  823  695  884  731  628

# One can also calculate the indices and get the results without adding
# them to colData
res <- getAlpha(tse, index = "shannon")
res |> head()
#> DataFrame with 6 rows and 1 column
#>           shannon
#>         <numeric>
#> CL3       6.57652
#> CC1       6.77660
#> SV1       6.49849
#> M31Fcsw   3.82837
#> M11Fcsw   3.28767
#> M31Plmr   4.28927