11  Exploration & quality control

This chapter focuses on the quality control and exploration of microbiome data and establishes commonly used descriptive summaries. Familiarizing yourself with the peculiarities of a given dataset is essential for data analysis and model building.

The dataset should not suffer from severe technical biases, and you should at least be aware of potential challenges, such as outliers, biases, unexpected patterns, and so forth. Standard summaries and visualizations can help, and the rest comes with experience. Moreover, exploration and quality control often entail iterative processes.

11.1 Abundance

Abundance visualization is an important data exploration approach. miaViz integrated the plotAbundanceDensity() function to plot the most abundant taxa along with several options.

Next, a few demonstrations are shown using the (Lahti et al. 2014) dataset. A jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) (Supplementary Fig.1), can be visualized as follows:

Lahti, L, JSalojärvi, A Salonen, M Scheffer, and WM de Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 2014: 1–10. https://doi.org/https://doi.org/10.1038/ncomms5344.
Salosensaari, Aaro, Ville Laitinen, Aki Havulinna, Guillaume Méric, Susan Cheng, Markus Perola, Liisa Valsta, et al. 2021. “Taxonomic Signatures of Cause-Specific Mortality Risk in Human Gut Microbiome.” Nature Communications 12: 1–8. https://www.nature.com/articles/s41467-021-22962-y.
# Load example data
library(miaTime)
library(miaViz)
data(hitchip1006)
tse <- hitchip1006

# Add relative abundances
tse <- transformAssay(tse, MARGIN = "cols", method = "relabundance")

# Use argument names
# assay.type / assay.type / assay.type
# depending on the mia package version
plotAbundanceDensity(
    tse, layout = "jitter",
    assay.type = "relabundance",
    n = 40, point_size=1, point.shape=19,
    point.alpha=0.1) +
    scale_x_log10(label=scales::percent)

The relative abundance values for the top-5 taxonomic features can be visualized as a density plot over a log-scaled axis, with “nationality” indicated by colors:

plotAbundanceDensity(
    tse, layout = "density",
    assay.type = "relabundance",
    n = 5, colour.by="nationality",
    point.alpha=1/10) +
    scale_x_log10()

11.2 Prevalence

Prevalence quantifies the frequency of samples where certain microbes were detected (above a given detection threshold). Prevalence can be given as sample size (N) or percentage (unit interval).

Investigating prevalence allows you either to focus on changes which pertain to the majority of the samples, or identify rare microbes, which may be conditionally abundant in a small number of samples.

The population prevalence (frequency) at a 1% relative abundance threshold (detection = 1/100 and as.relative = TRUE) can look like this.

getPrevalence(tse, detection = 1/100, sort = TRUE, as.relative = TRUE) |>
    head()
##  Faecalibacterium prausnitzii et rel.           Ruminococcus obeum et rel. 
##                                0.9522                               0.9140 
##    Oscillospira guillermondii et rel.        Clostridium symbiosum et rel. 
##                                0.8801                               0.8714 
##      Subdoligranulum variable at rel.     Clostridium orbiscindens et rel. 
##                                0.8358                               0.8315

The function arguments detection and as.relative can also be used to access how many samples pass a threshold for raw counts. Here, the population prevalence (frequency) at the absolute abundance threshold (as.relative = FALSE) with a read count of 1 (detection = 1) is accessed.

getPrevalence(
    tse, detection = 1, sort = TRUE, assay.type = "counts",
    as.relative = FALSE) |>
    head()
##             Uncultured Mollicutes      Uncultured Clostridiales II 
##                                 1                                1 
##        Uncultured Clostridiales I               Tannerella et rel. 
##                                 1                                1 
##    Sutterella wadsworthia et rel. Subdoligranulum variable at rel. 
##                                 1                                1

If the output is to be used for subsetting or storing the data in the rowData, set sort = FALSE.

11.2.1 Prevalence analysis

To investigate microbiome prevalence at a selected taxonomic level, two approaches are available.

First, the data can be agglomerated to the taxonomic level and then getPrevalence() can be applied to the resulting object.

# Agglomerate taxa abundances to Phylum level
#and add the new table to the altExp slot
altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table
#from the altExp slot
getPrevalence(
    altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
    assay.type = "counts", as.relative = TRUE) |>
    head()
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia    Fusobacteria 
##        0.1277150       0.0008688

Alternatively, the rank argument can be set to perform the agglomeration on the fly.

getPrevalence(
    tse, rank = "Phylum", detection = 1/100, sort = TRUE,
    assay.type = "counts", as.relative = TRUE) |>
    head()
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia    Fusobacteria 
##        0.1277150       0.0008688
Note

By default, na.rm = TRUE is used for agglomeration in getPrevalence(), whereas the default for agglomerateByRank() is FALSE to prevent accidental data loss.

If you only need the names of the prevalent taxa, getPrevalent() is available. This returns the taxa that exceed the given prevalence and detection thresholds.

getPrevalent(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalent(
    tse, detection = 0, prevalence = 50/100, rank = "Phylum", sort = TRUE)
prev
Note

The detection and prevalence thresholds are not the same, as detection can be applied to relative counts or absolute counts depending on whether as.relative is set TRUE or FALSE

The function getPrevalentAbundance() can be used to check the total relative abundance of the prevalent taxa (between 0 and 1).

11.2.2 Rare taxa

Related functions are available for the analysis of rare taxa (rare_members(), rare_abundance(), low_abundance(), getRare(), subsetByRare()).

11.2.3 Plotting prevalence

To plot the prevalence, add the prevalence of each taxon to rowData. Here, we are analyzing the Phylum-level abundances, which are stored in the altExp slot.

rowData(altExp(tse,"Phylum"))$prevalence <- getPrevalence(
    altExp(tse,"Phylum"), detection = 1/100,
    sort = FALSE,
    assay.type = "counts", as.relative = TRUE)

The prevalences can then be plotted using the plotting functions from the scater package.

library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")

The prevalence can also be visualized on the taxonomic tree with the miaViz package.

tse <- agglomerateByRanks(tse)
altExps(tse) <- lapply(
    altExps(tse), function(y){
        rowData(y)$prevalence <- getPrevalence(
            y, detection = 1/100,
            sort = FALSE,
            assay.type = "counts",
            as.relative = TRUE)
        return(y)
    })
top_phyla <- getTop(
    altExp(tse,"Phylum"),
    method="prevalence",
    top=5L,
    assay.type="counts")
top_phyla_mean <- getTop(
    altExp(tse,"Phylum"),
    method="mean",
    top=5L,
    assay.type="counts")

x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addHierarchyTree(x)

After some preparation, the data are assembled and can be plotted with plotRowTree().

library(miaViz)
# Filter rows where Phylum is in top_phyla
selected <- rowData(x)$Phylum %in% top_phyla

# Plot the data
plotRowTree(
    x[selected, ],
    edge.colour.by = "Phylum",
    tip.colour.by = "prevalence",
    node.colour.by = "prevalence")

Prevalence of top phyla as judged by prevalence
# Filter for Phylum in top_phyla_mean
selected <- rowData(x)$Phylum %in% top_phyla_mean

# Plot the data
plotRowTree(
    x[selected, ],
    edge.colour.by = "Phylum",
    tip.colour.by = "prevalence",
    node.colour.by = "prevalence")

Prevalence of top phyla as judged by mean abundance

11.3 Quality control

Next, let’s load the GlobalPatterns dataset to illustrate standard microbiome data summaries.

library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns

11.3.1 Top taxa

The function getTop() identifies top taxa in the data.

# Pick the top taxa
top_features <- getTop(tse, method="median", top=10)

# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)][1:5, 1:3]
##  DataFrame with 5 rows and 3 columns
##             Kingdom         Phylum               Class
##         <character>    <character>         <character>
##  549656    Bacteria  Cyanobacteria         Chloroplast
##  331820    Bacteria  Bacteroidetes         Bacteroidia
##  317182    Bacteria  Cyanobacteria         Chloroplast
##  94166     Bacteria Proteobacteria Gammaproteobacteria
##  279599    Bacteria  Cyanobacteria    Nostocophycideae

11.3.2 Unique taxa

The function getUnique returns unique taxa in the data.

# Get unique taxa at rank Class
getUnique(tse, "Class", sort = TRUE)
##    [1] "0319-6G9"              "09D2Y74"               "12-24"                
##    [4] "4C0d-2"                "5B-18"                 "A712011"              
##    [7] "AB16"                  "ABS-6"                 "Acidobacteria"        
##   [10] "Acidobacteria-5"       "Actinobacteria"        "Alphaproteobacteria"  
##   [13] "Anaerolineae"          "Armatimonadia"         "BB34"                 
##   [16] "BD4-9"                 "BPC102"                "BSV19"                
##   [19] "Bacilli"               "Bacteroidia"           "Betaproteobacteria"   
##   [22] "Bljii12"               "C2"                    "C6"                   
##   [25] "CH21"                  "Caldisericia"          "Caldithrixae"         
##   [28] "Chlamydiae"            "Chloracidobacteria"    "Chlorobia"            
##   [31] "Chloroflexi"           "Chloroflexi-4"         "Chloroplast"          
##   [34] "Chthonomonadetes"      "Clostridia"            "Dehalococcoidetes"    
##   [37] "Deinococci"            "Deltaproteobacteria"   "Elusimicrobia"        
##   [40] "Endomicrobia"          "Epsilonproteobacteria" "Erysipelotrichi"      
##   [43] "FFCH393"               "FFCH6980"              "Fibrobacteres"        
##   [46] "Flavobacteria"         "Fusobacteria"          "GKS2-174"             
##   [49] "GN07"                  "GN08"                  "GN13"                 
##   [52] "GN15"                  "GW-22"                 "Gammaproteobacteria"  
##   [55] "Gemmatimonadetes"      "Halobacteria"          "Holophagae"           
##   [58] "Ignavibacteria"        "JG37-AG-4"             "JS1"                  
##   [61] "KSB3"                  "Koll-18"               "Ktedonobacteria"      
##   [64] "Kueneniae"             "Lentisphaerae"         "Leptospirae"          
##   [67] "MJK10"                 "ML615J-28"             "MSB-5A5"              
##   [70] "MSB-5B5"               "MVS-40"                "Methanobacteria"      
##   [73] "Methanomicrobia"       "Methylacidiphilae"     "Mollicutes"           
##   [76] "Nitrospira"            "Nostocophycideae"      "ODP123"               
##   [79] "OP11-2"                "OP11-3"                "OP8_1"                
##   [82] "OP8_2"                 "OPB56"                 "OS-K"                 
##   [85] "Opitutae"              "Oscillatoriophycideae" "PAUC37f"              
##   [88] "PBS-25"                "PRR-11"                "PRR-12"               
##   [91] "PW285"                 "Phycisphaerae"         "Planctomycea"         
##   [94] "R76-B18"               "RA13C7"                "RB25"                 
##   [97] "RB384"                 "S15B-MN24"             "S1a-1H"               
##  [100] "SBRH58"                "SHA-114"               "SJA-176"              
##  [103] "SJA-28"                "SJA-4"                 "SM1B09"               
##  [106] "SOGA31"                "Sd-NA"                 "Solibacteres"         
##  [109] "Spartobacteria"        "Sphingobacteria"       "Spirochaetes"         
##  [112] "Sva0725"               "Synechococcophycideae" "Synergistia"          
##  [115] "TG3-1"                 "TG3-2"                 "TK-SH13"              
##  [118] "TK17"                  "TM7-1"                 "TM7-3"                
##  [121] "TP21"                  "Thaumarchaeota"        "Thermomicrobia"       
##  [124] "Thermoplasmata"        "Thermoprotei"          "Thermotogae"          
##  [127] "Ucn15732"              "VC12-cl04"             "VHS-B5-50"            
##  [130] "Verruco-5"             "Verrucomicrobiae"      "WM88"                 
##  [133] "WWE1"                  "Zetaproteobacteria"    "agg27"                
##  [136] "iii1-8"                "koll11"                "pMC2A209"             
##  [139] "vadinHA49"             NA

11.3.3 Library size / read count

The total counts per sample can be calculated using perCellQCMetrics()/addPerCellQC() from the scater package. The former one just calculates the values, whereas the latter directly adds them to colData. mia provides the function summary for SE objects which returns the summary of counts for all samples and features including measures of central tendency.

library(scater)
# Get an overview of sample and taxa counts
summary(tse, assay.type= "counts")
##  $samples
##  # A tibble: 1 × 6
##    total_counts min_counts max_counts median_counts mean_counts stdev_counts
##           <dbl>      <dbl>      <dbl>         <dbl>       <dbl>        <dbl>
##  1     28216678      58688    2357181       1106849    1085257.      650145.
##  
##  $features
##  # A tibble: 1 × 3
##    total singletons per_sample_avg
##    <int>      <int>          <dbl>
##  1 19216       2134          4022.

# Calculate total counts per sample
perCellQCMetrics(tse)[1:5,]
##  DataFrame with 5 rows and 3 columns
##                sum  detected     total
##          <numeric> <numeric> <numeric>
##  CL3        864077      6964    864077
##  CC1       1135457      7679   1135457
##  SV1        697509      5729    697509
##  M31Fcsw   1543451      2667   1543451
##  M11Fcsw   2076476      2574   2076476

# Calculate and add total counts to `colData`
tse <- addPerCellQC(tse)
colData(tse)[1:5,1:3]
##  DataFrame with 5 rows and 3 columns
##          X.SampleID   Primer Final_Barcode
##            <factor> <factor>      <factor>
##  CL3        CL3      ILBC_01        AACGCA
##  CC1        CC1      ILBC_02        AACTCG
##  SV1        SV1      ILBC_03        AACTGT
##  M31Fcsw    M31Fcsw  ILBC_04        AAGAGA
##  M11Fcsw    M11Fcsw  ILBC_05        AAGCTG

The distribution of calculated library sizes can be visualized as a histogram (left) or by sorting the samples by library size (right).

library(ggplot2)
library(dplyr)
library(patchwork)

# Create a histrogram showing distribution of library sizes
p1 <- ggplot(colData(tse)) +
        geom_histogram(aes(x = sum/1e6), color = "black", fill = "gray") +
        labs(x = "Library size (million reads)", y = "Frequency (n)") +
        theme_classic()

# Order data in increasing order
df <- as.data.frame(colData(tse)) |>
    arrange(sum) |>
    mutate(proportion = (1:n())/n())

# Create a scatter plot showing how library sizes are distributed
p2 <- ggplot(df, aes(y = proportion, x = sum/1e6)) +
    geom_point() +
    labs(x = "Library size (million reads)", y = "Proportion of samples") +
    theme_classic()

p1 + p2

Library size distribution.

Library sizes and other variables from colData can be visualized using a specified function called plotColData().

# Sort samples by read count,
# order the factor levels,
# and store back to tse as DataFrame
colData(tse) <- as.data.frame(colData(tse)) |>
    arrange(X.SampleID) |>
    mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
        DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    labs(y = "Library size (N)", x = "Sample ID")

Library sizes per sample.
plotColData(tse,"sum","SampleType", colour_by = "SampleType") +
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Library sizes per sample type.

In addition, data can be rarefied with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).

Schloss, Patrick D. 2024a. “Rarefaction Is Currently the Best Approach to Control for Uneven Sequencing Effort in Amplicon Sequence Analyses.” mSphere 9 (2): e00354–23. https://doi.org/10.1128/msphere.00354-23.
———. 2024b. “Waste Not, Want Not: Revisiting the Analysis That Called into Question the Practice of Rarefaction.” mSphere 9 (1): e00355–23. https://doi.org/10.1128/msphere.00355-23.
McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4): e1003531.

11.3.4 Contaminant sequences

Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on its frequencies and concentrations across the samples.

The following decontam functions are based on (Davis et al. 2018) and support such functionality:

Davis, Nicole M, Diana M Proctor, Susan P Holmes, David A Relman, and Benjamin J Callahan. 2018. “Simple Statistical Identification and Removal of Contaminant Sequences in Marker-Gene and Metagenomics Data.” Microbiome 6 (1): 1–14.
Back to top