4  Exploration and Quality Control

This chapter focuses on the quality control and exploration of microbiome data and establishes commonly used descriptive summaries. Familiarizing with the peculiarities of a given dataset is the essential basis for any data analysis and model building.

The dataset should not suffer from severe technical biases, and you should at least be aware of potential challenges, such as outliers, biases, unexpected patterns and so forth. Standard summaries and visualizations can help, and the rest comes with experience. The exploration and quality control can be iterative processes.

4.1 Abundance

Abundance visualization is an important data exploration approach. miaViz offers the function plotAbundanceDensity to plot the most abundant taxa with several options.

Next, a few demonstrations are shown, using the (Lahti et al. 2014) dataset. A Jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) supplementary figure 1, could be visualized as follows:

Lahti, L, JSalojärvi, A Salonen, M Scheffer, and WM de Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 2014: 1–10. https://doi.org/https://doi.org/10.1038/ncomms5344.
Salosensaari, Aaro, Ville Laitinen, Aki Havulinna, Guillaume Méric, Susan Cheng, Markus Perola, Liisa Valsta, et al. 2021. “Taxonomic Signatures of Cause-Specific Mortality Risk in Human Gut Microbiome.” Nature Communications 12: 1–8. https://www.nature.com/articles/s41467-021-22962-y.
# Load example data
library(miaTime)
library(miaViz)
data(hitchip1006)
tse <- hitchip1006

# Add relative abundances
tse <- transformAssay(tse, MARGIN = "samples", method = "relabundance")

# Use argument names
# assay.type / assay.type / assay.type
# depending on the mia package version
plotAbundanceDensity(tse, layout = "jitter", assay.type = "relabundance",
                     n = 40, point_size=1, point_shape=19, point_alpha=0.1) + 
                     scale_x_log10(label=scales::percent)

The relative abundance values for the top-5 taxonomic features can be visualized as a density plot over a log scaled axis, with “nationality” indicated by colors:

plotAbundanceDensity(tse, layout = "density", assay.type = "relabundance",
                     n = 5, colour_by="nationality", point_alpha=1/10) +
    scale_x_log10()

4.2 Prevalence

Prevalence quantifies the frequency of samples where certain microbes were detected (above a given detection threshold). The prevalence can be given as sample size (N) or percentage (unit interval).

Investigating prevalence allows you either to focus on changes which pertain to the majority of the samples, or identify rare microbes, which may be conditionally abundant in a small number of samples.

The population prevalence (frequency) at a 1% relative abundance threshold (detection = 1/100 and as_relative = TRUE), can look like this.

head(getPrevalence(tse, detection = 1/100, sort = TRUE, as_relative = TRUE))
##  Faecalibacterium prausnitzii et rel.           Ruminococcus obeum et rel. 
##                                0.9522                               0.9140 
##    Oscillospira guillermondii et rel.        Clostridium symbiosum et rel. 
##                                0.8801                               0.8714 
##      Subdoligranulum variable at rel.     Clostridium orbiscindens et rel. 
##                                0.8358                               0.8315

The function arguments detection and as_relative can also be used to access, how many samples do pass a threshold for raw counts. Here, the population prevalence (frequency) at the absolute abundance threshold (as_relative = FALSE) at read count 1 (detection = 1) is accessed.

head(getPrevalence(tse, detection = 1, sort = TRUE, assay.type = "counts",
                   as_relative = FALSE))
##             Uncultured Mollicutes      Uncultured Clostridiales II 
##                                 1                                1 
##        Uncultured Clostridiales I               Tannerella et rel. 
##                                 1                                1 
##    Sutterella wadsworthia et rel. Subdoligranulum variable at rel. 
##                                 1                                1

If the output should be used for subsetting or storing the data in the rowData, set sort = FALSE.

4.2.1 Prevalence analysis

To investigate microbiome prevalence at a selected taxonomic level, two approaches are available.

First the data can be agglomerated to the taxonomic level and getPrevalence applied on the resulting object.

# Agglomerate taxa abundances to Phylum level, and add the new table
# to the altExp slot
altExp(tse,"Phylum") <- mergeFeaturesByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table from the altExp slot
head(getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
                   assay.type = "counts", as_relative = TRUE))
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia   Cyanobacteria 
##        0.1277150       0.0008688

Alternatively, the rank argument could be set to perform the agglomeration on the fly.

head(getPrevalence(tse, rank = "Phylum", detection = 1/100, sort = TRUE,
                   assay.type = "counts", as_relative = TRUE))
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia   Cyanobacteria 
##        0.1277150       0.0008688

Note that, by default, na.rm = TRUE is used for agglomeration in getPrevalence, whereas the default for mergeFeaturesByRank is FALSE to prevent accidental data loss.

If you only need the names of the prevalent taxa, getPrevalentFeatures is available. This returns the taxa that exceed the given prevalence and detection thresholds.

getPrevalentFeatures(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalentFeatures(tse, detection = 0, prevalence = 50/100,
                         rank = "Phylum", sort = TRUE)
prev

Note that the detection and prevalence thresholds are not the same, since detection can be applied to relative counts or absolute counts depending on whether as_relative is set TRUE or FALSE

The function ‘getPrevalentAbundance’ can be used to check the total relative abundance of the prevalent taxa (between 0 and 1).

4.2.2 Rare taxa

Related functions are available for the analysis of rare taxa (rareMembers; rareAbundance; lowAbundance, getRareFeatures, subsetByRareFeatures).

4.2.3 Plotting prevalence

To plot the prevalence, add the prevalence of each taxon to rowData. Here, we are analysing the Phylum level abundances, which are stored in the altExp slot.

rowData(altExp(tse,"Phylum"))$prevalence <- 
    getPrevalence(altExp(tse,"Phylum"), detection = 1/100, sort = FALSE,
                  assay.type = "counts", as_relative = TRUE)

The prevalences can then be plotted using the plotting functions from the scater package.

library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")

The prevalence can also be visualized on the taxonomic tree with the miaViz package.

altExps(tse) <- splitByRanks(tse)
altExps(tse) <-
   lapply(altExps(tse),
          function(y){
              rowData(y)$prevalence <- 
                  getPrevalence(y, detection = 1/100, sort = FALSE,
                                assay.type = "counts", as_relative = TRUE)
              y
          })
top_phyla <- getTopFeatures(altExp(tse,"Phylum"),
                        method="prevalence",
                        top=5L,
                        assay.type="counts")
top_phyla_mean <- getTopFeatures(altExp(tse,"Phylum"),
                             method="mean",
                             top=5L,
                             assay.type="counts")
x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addHierarchyTree(x)

After some preparation, the data is assembled and can be plotted with plotRowTree.

library(miaViz)
plotRowTree(x[rowData(x)$Phylum %in% top_phyla,],
            edge_colour_by = "Phylum",
            tip_colour_by = "prevalence",
            node_colour_by = "prevalence")

Prevalence of top phyla as judged by prevalence
plotRowTree(x[rowData(x)$Phylum %in% top_phyla_mean,],
            edge_colour_by = "Phylum",
            tip_colour_by = "prevalence",
            node_colour_by = "prevalence")

Prevalence of top phyla as judged by mean abundance

4.3 Quality control

Next, let us load the GlobalPatterns dataset to illustrate standard microbiome data summaries.

library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns 

4.3.1 Top taxa

The getTopFeatures identifies top taxa in the data.

# Pick the top taxa
top_features <- getTopFeatures(tse, method="median", top=10)

# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)]
##  DataFrame with 10 rows and 7 columns
##             Kingdom         Phylum               Class             Order
##         <character>    <character>         <character>       <character>
##  549656    Bacteria  Cyanobacteria         Chloroplast     Stramenopiles
##  331820    Bacteria  Bacteroidetes         Bacteroidia     Bacteroidales
##  317182    Bacteria  Cyanobacteria         Chloroplast     Stramenopiles
##  94166     Bacteria Proteobacteria Gammaproteobacteria    Pasteurellales
##  279599    Bacteria  Cyanobacteria    Nostocophycideae        Nostocales
##  158660    Bacteria  Bacteroidetes         Bacteroidia     Bacteroidales
##  329744    Bacteria Actinobacteria      Actinobacteria   Actinomycetales
##  326977    Bacteria Actinobacteria      Actinobacteria Bifidobacteriales
##  248140    Bacteria  Bacteroidetes         Bacteroidia     Bacteroidales
##  550960    Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
##                     Family           Genus                Species
##                <character>     <character>            <character>
##  549656                 NA              NA                     NA
##  331820     Bacteroidaceae     Bacteroides                     NA
##  317182                 NA              NA                     NA
##  94166     Pasteurellaceae     Haemophilus Haemophilusparainflu..
##  279599        Nostocaceae  Dolichospermum                     NA
##  158660     Bacteroidaceae     Bacteroides                     NA
##  329744             ACK-M1              NA                     NA
##  326977 Bifidobacteriaceae Bifidobacterium Bifidobacteriumadole..
##  248140     Bacteroidaceae     Bacteroides      Bacteroidescaccae
##  550960 Enterobacteriaceae     Providencia                     NA

4.3.2 Library size / read count

The total counts/sample can be calculated using perCellQCMetrics/addPerCellQC from the scater package. The former one just calculates the values, whereas the latter one directly adds them to colData.

library(scater)
perCellQCMetrics(tse)
##  DataFrame with 26 rows and 3 columns
##                sum  detected     total
##          <numeric> <numeric> <numeric>
##  CL3        864077      6964    864077
##  CC1       1135457      7679   1135457
##  SV1        697509      5729    697509
##  M31Fcsw   1543451      2667   1543451
##  M11Fcsw   2076476      2574   2076476
##  ...           ...       ...       ...
##  TS28       937466      2679    937466
##  TS29      1211071      2629   1211071
##  Even1     1216137      4213   1216137
##  Even2      971073      3130    971073
##  Even3     1078241      2776   1078241
tse <- addPerCellQC(tse)
colData(tse)
##  DataFrame with 26 rows and 10 columns
##          X.SampleID   Primer Final_Barcode Barcode_truncated_plus_T
##            <factor> <factor>      <factor>                 <factor>
##  CL3        CL3      ILBC_01        AACGCA                   TGCGTT
##  CC1        CC1      ILBC_02        AACTCG                   CGAGTT
##  SV1        SV1      ILBC_03        AACTGT                   ACAGTT
##  M31Fcsw    M31Fcsw  ILBC_04        AAGAGA                   TCTCTT
##  M11Fcsw    M11Fcsw  ILBC_05        AAGCTG                   CAGCTT
##  ...            ...      ...           ...                      ...
##  TS28         TS28   ILBC_25        ACCAGA                   TCTGGT
##  TS29         TS29   ILBC_26        ACCAGC                   GCTGGT
##  Even1        Even1  ILBC_27        ACCGCA                   TGCGGT
##  Even2        Even2  ILBC_28        ACCTCG                   CGAGGT
##  Even3        Even3  ILBC_29        ACCTGT                   ACAGGT
##          Barcode_full_length SampleType
##                     <factor>   <factor>
##  CL3             CTAGCGTGCGT      Soil 
##  CC1             CATCGACGAGT      Soil 
##  SV1             GTACGCACAGT      Soil 
##  M31Fcsw         TCGACATCTCT      Feces
##  M11Fcsw         CGACTGCAGCT      Feces
##  ...                     ...        ...
##  TS28            GCATCGTCTGG      Feces
##  TS29            CTAGTCGCTGG      Feces
##  Even1           TGACTCTGCGG      Mock 
##  Even2           TCTGATCGAGG      Mock 
##  Even3           AGAGAGACAGG      Mock 
##                                         Description       sum  detected
##                                            <factor> <numeric> <numeric>
##  CL3     Calhoun South Carolina Pine soil, pH 4.9      864077      6964
##  CC1     Cedar Creek Minnesota, grassland, pH 6.1     1135457      7679
##  SV1     Sevilleta new Mexico, desert scrub, pH 8.3    697509      5729
##  M31Fcsw M3, Day 1, fecal swab, whole body study      1543451      2667
##  M11Fcsw M1, Day 1, fecal swab, whole body study      2076476      2574
##  ...                                            ...       ...       ...
##  TS28                                       Twin #1    937466      2679
##  TS29                                       Twin #2   1211071      2629
##  Even1                                      Even1     1216137      4213
##  Even2                                      Even2      971073      3130
##  Even3                                      Even3     1078241      2776
##              total
##          <numeric>
##  CL3        864077
##  CC1       1135457
##  SV1        697509
##  M31Fcsw   1543451
##  M11Fcsw   2076476
##  ...           ...
##  TS28       937466
##  TS29      1211071
##  Even1     1216137
##  Even2      971073
##  Even3     1078241

The distribution of calculated library sizes can be visualized as a histogram (left), or by sorting the samples by library size (right).

library(ggplot2)

p1 <- ggplot(colData(tse)) +
        geom_histogram(aes(x = sum), color = "black", fill = "gray", bins = 30) +
        labs(x = "Library size", y = "Frequency (n)") + 
        # scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x), 
        # labels = scales::trans_format("log10", scales::math_format(10^.x))) +
        theme_bw() +
        theme(panel.grid.major = element_blank(), # Removes the grid
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black")) # Adds y-axis

library(dplyr)
df <- as.data.frame(colData(tse)) %>%
        arrange(sum) %>%
        mutate(index = 1:n())
p2 <- ggplot(df, aes(y = index, x = sum/1e6)) +
        geom_point() +  
        labs(x = "Library size (million reads)", y = "Sample index") +  
        theme_bw() +
        theme(panel.grid.major = element_blank(), # Removes the grid
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black")) # Adds y-axis

library(patchwork)
p1 + p2

Library size distribution.

Library sizes other variables from colData can be visualized by using specified function called plotColData.

# Sort samples by read count, order the factor levels, and store back to tse as DataFrame
# TODO: plotColData could include an option for sorting samples based on colData variables
colData(tse) <- as.data.frame(colData(tse)) %>%
                 arrange(X.SampleID) %>%
             mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
         DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    labs(y = "Library size (N)", x = "Sample ID")       

Library sizes per sample.
plotColData(tse,"sum","SampleType", colour_by = "SampleType") + 
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Library sizes per sample type.

In addition, data can be rarefied with subsampleCounts, which normalises the samples to an equal number of reads. However, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4): e1003531.

4.3.3 Contaminant sequences

Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on their frequencies and concentrations across the samples.

The following decontam functions are based on the (Davis et al. 2018) and support such functionality:

Davis, Nicole M, Diana M Proctor, Susan P Holmes, David A Relman, and Benjamin J Callahan. 2018. “Simple Statistical Identification and Removal of Contaminant Sequences in Marker-Gene and Metagenomics Data.” Microbiome 6 (1): 1–14.
  • isContaminant, isNotContaminant
  • addContaminantQC, addNotContaminantQC
Back to top