9  Exploration & quality control

This chapter focuses on the quality control (QC) and exploration of microbiome data and establishes commonly used descriptive summaries. Familiarizing yourself with the peculiarities of a given dataset is essential for data analysis and model building to make justified conclusions.

The dataset should not suffer from severe technical biases, and you should at least be aware of potential challenges, such as outliers, biases, unexpected patterns, and so forth. Standard summaries and visualizations can help, and the rest comes with experience. Moreover, exploration and QC often entail iterative processes.

There are available guidelines for QC, for instance, (Zuur, Ieno, and Elphick 2010). However, it should be noted that one should not follow any protocols strictly but rather use them as a template and customize them to one’s dataset. The goal of QC is not only to improve quality of data, but also to understand it and its limitations. Poor data quality leads to poor results which is illustrated by a common expression in computer science:

Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14. https://doi.org/https://doi.org/10.1111/j.2041-210X.2009.00001.x.

“Garbage in, garbage out”

Below we download a dataset from microbiomeDataSets. The dataset contains 16S data from the gut microbiome of baboons.

library(microbiomeDataSets)
tse <- baboongut()

9.1 Summarize data

When you first get your hands dirty with the data, the first step is to summarize it and get a sense of what kind of data you’re dealing with. Printing TreeSE already provides useful information about the dataset’s dimensions, such as the number of samples and taxa.

tse
##  class: TreeSummarizedExperiment 
##  dim: 613 16234 
##  metadata(0):
##  assays(1): counts
##  rownames(613): asv_10002 asv_10020 ... asv_9990 asv_9991
##  rowData names(7): Domain Phylum ... Genus ASV
##  colnames(16234): sample_11410-AACTCCTGTGGA-396
##    sample_11413-CGGCCTAAGTTC-397 ... sample_11413-ACCCTCAGCCCA-397
##    sample_12049-TCAGCAAATGGT-407
##  colData names(35): sample baboon_id ... pc4_bc pc5_bc
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (613 rows)
##  rowTree: 1 phylo tree(s) (613 leaves)
##  colLinks: NULL
##  colTree: NULL
##  referenceSeq: a DNAStringSet (613 sequences)

The dataset includes 613 taxa identified from 16234 samples. mia provides the summary() function for TreeSE objects which returns the summary of counts for all samples and features including measures of central tendency.

library(mia)

# Calculate summary tables
summary(tse, assay.type = "counts")
##  $samples
##  # A tibble: 1 × 6
##    total_counts min_counts max_counts median_counts mean_counts stdev_counts
##           <dbl>      <dbl>      <dbl>         <dbl>       <dbl>        <dbl>
##  1    787665544        874     467725         47009      48519.       19089.
##  
##  $features
##  # A tibble: 1 × 3
##    total singletons per_sample_avg
##    <int>      <int>          <dbl>
##  1   613          0           209.

The returned tables show that samples exhibit lots of variance in library sizes. Moreover, we can observe that there are no singletons in the dataset (library size and singletons are discussed in more detail in Section 9.2).

Another type of summary can be generated using the summarizeDominance() function. This function returns a table displaying both the absolute and relative abundance of each taxon — that is, how many times a taxon was detected in the dataset and the proportion of samples in which it was identified. Below, we create a summary table for genera.

df <- summarizeDominance(tse, rank = "Genus")
df
##  # A tibble: 43 × 3
##    dominant_taxa                   n rel_freq
##    <chr>                       <int>    <dbl>
##  1 Bifidobacterium             11160  0.687  
##  2 Prevotella 9                 2141  0.132  
##  3 Rikenellaceae RC9 gut group  1555  0.0958 
##  4 Candidatus Methanogranum      460  0.0283 
##  5 Megasphaera                   246  0.0152 
##  6 CAG-873                       138  0.00850
##  # ℹ 37 more rows

Based on the summary table, Bifidobacterium seems to be highly presented in the baboon gut.

mia also provides other functions to summarize the dataset. For example, you can retrieve unique, top, prevalent, or rare taxa using getUnique(), getTop(), getPrevalent(), and getRare(), respectively.

Let’s first check which phyla are included in the dataset.

uniq <- getUnique(tse, rank = "Phylum")
uniq
##   [1] <NA>               Tenericutes        Firmicutes        
##   [4] Lentisphaerae      Actinobacteria     Elusimicrobia     
##   [7] Kiritimatiellaeota Euryarchaeota      Bacteroidetes     
##  [10] Spirochaetes       Proteobacteria     Epsilonbacteraeota
##  [13] Cyanobacteria     
##  42 Levels: Acidobacteria Actinobacteria Aquificae ... WS4

There are 13 phyla present in the dataset.

Next, we might be interested on assessing which features are the most abundance. This can be done by utilizing getTop() function. Below, we pick top 10 taxa which is determined based on median abundance.

# Pick the top taxa
top_features <- getTop(tse, method = "median", top = 10)
top_features
##   [1] "asv_8243" "asv_1543" "asv_1874" "asv_9331" "asv_4629" "asv_3018"
##   [7] "asv_2919" "asv_3109" "asv_9212" "asv_7528"

These ten features, have the highest median abundance across all samples. getPrevalent() differs from getTop() so that it retrieves which taxa exceeds the specified prevalence and detection thresholds. Here we determine prevalent genera.

prev <- getPrevalent(tse, rank = "Genus", prevalence = 0.2, detection = 0)
prev |> head()
##  [1] "Acidaminococcus"   "Alloprevotella"    "Anaerosporobacter"
##  [4] "Bifidobacterium"   "Brachyspira"       "Butyricicoccus"

Out of 92 genera, 78 of them are detected to be prevalent, meaning they are found in a sufficient number of samples with sufficiently high abundance.

getRare() function is counterpart of getPrevalent(). It returns those taxa that do not exceed the thresholds.

rare <- getRare(tse, rank = "Genus", prevalence = 0.2, detection = 0)
rare |> head()
##  [1] "Acetitomaculum"               "Acholeplasma"                
##  [3] "Candidatus Soleaferrea"       "Catenibacterium"             
##  [5] "Cellulosilyticum"             "Lachnospiraceae NC2004 group"

It returns 14 taxa which is expected as it should match with number of genera after subtracting the prevalent ones.

9.2 Outliers, singletons and contaminants

Outliers are observations that deviate significantly from the rest of the data. Researchers should be cautious when handling outliers, as what appears to be an outlier may actually result from a valid biological mechanism. This is especially relevant in the microbiome field, where sequenced counts often exhibit seemingly irregular variation. In such cases, transformations (see Chapter 12) are typically the preferred approach for handling these observations. However, if an observation or sample is significantly affected by measurement error, removing it may be a reasonable option. The approach to dealing with outliers in each dataset requires careful consideration by the researcher.

Singletons are sequences that appears only once in a dataset, meaning it is observed in just a single sequencing read. Often rare features are removed from the data as they usually represent sequencing artifacts. See Section 10.6 and Section 11.3 for more details on prevalence filtering and agglomeration.

9.2.1 Library size

Library size refers to the total number of counts found in a single sample. The returned tables in Section 9.1 showed that samples exhibit lots of variance in library sizes. We can then visualize the library sizes. The total counts per sample can be calculated using addPerCellQC() from the scater package.

library(scater)

# Calculate and add total counts to colData
tse <- addPerCellQC(tse)

The results are stored in colData. We can then visualize these results with a violin plot and histogram.

library(miaViz)
library(patchwork)

p1 <- plotColData(tse, x = "sex", y = "total", colour_by = "age")
p2 <- plotHistogram(tse, col.var = "total")

p1 + p2

The distribution of library size is right-skewed. Most of the samples follow normal distribution while some of the samples deviates from this. This might be caused by technical variations or biological factors.

Nevertheless, the most important check is to ensure that the sampling depth is sufficient. In case of insufficient sampling depth, one might consider filtering the data based on library size (Section 10.4).

To control uneven sampling depths, one should apply data transformation or apply rarefaction. These both approaches are discussed in Chapter 12.

9.2.2 Contaminant sequences

Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on its frequencies and concentrations across the samples.

The following decontam functions are based on (Davis et al. 2018) and support such functionality:

Davis, Nicole M, Diana M Proctor, Susan P Holmes, David A Relman, and Benjamin J Callahan. 2018. “Simple Statistical Identification and Removal of Contaminant Sequences in Marker-Gene and Metagenomics Data.” Microbiome 6 (1): 1–14.

Contaminations can be detected using two main approaches: frequency-based and prevalence-based testing. In frequency-based testing, the user must provide the DNA concentration of samples. The abundance of features is then compared to the DNA concentration, as contaminants are expected to show an inverse relationship — they make up a larger fraction in low-DNA samples and smaller fraction in high-DNA samples.

In the prevalence-based approach, sequence prevalence is compared between true biological samples and control samples. This method assumes that contaminants are more prevalent in negative controls, as these lack true biological DNA and primarily contain background noise from contamination sources.

The dataset contains DNA concentration recorded in the “post_pcr_dna_ng” column in the sample metadata. This information can be used for frequency-based contamination identification.

library(knitr)

colData(tse) |> head() |> kable()
sample baboon_id collection_date sex age social_group group_size rain_month_mm season hydro_year month readcount plate post_pcr_dna_ng diet_PC1 diet_PC2 diet_PC3 diet_PC4 diet_PC5 diet_PC6 diet_PC7 diet_PC8 diet_PC9 diet_PC10 diet_PC11 diet_PC12 diet_PC13 diet_shannon_h asv_richness asv_shannon_h pc1_bc pc2_bc pc3_bc pc4_bc pc5_bc sum detected total
sample_11410-AACTCCTGTGGA-396 sample_11410-AACTCCTGTGGA-396 Baboon_1 2010-08-14 M 8.375 g_2.1 74 0.0 dry 2010 8 26140 129 78.98 44.245 13.304 -3.182 1.639 -1.5291 1.2958 1.957 0.840 0.8753 0.2847 -0.4202 0.4844 -0.0784 0.7761 179 2.972 -0.0649 0.2813 -0.0912 -0.0204 0.0731 25202 159 25202
sample_11413-CGGCCTAAGTTC-397 sample_11413-CGGCCTAAGTTC-397 Baboon_1 2010-05-14 M 8.123 g_2.1 72 44.6 wet 2010 5 11186 9 49.70 -1.621 -3.402 -16.786 8.296 -0.4551 -0.4667 1.290 -5.776 0.9291 -1.7560 0.1513 -1.9706 -5.9968 1.6158 60 1.574 0.3312 0.2543 -0.1423 0.2182 -0.2007 10975 58 10975
sample_11409-CAGAAGGTGTGG-395 sample_11409-CAGAAGGTGTGG-395 Baboon_1 2011-01-08 M 8.778 g_2.1 80 12.2 wet 2011 1 30532 196 59.16 -34.954 -8.627 -5.499 -9.029 -1.9215 9.7852 13.418 10.094 -10.1494 -3.0103 3.0000 0.2265 0.1694 1.8709 164 3.045 0.1434 0.2225 0.0493 0.0680 -0.0521 29799 151 29799
sample_11412-AAGGCCTTTACG-397 sample_11412-AAGGCCTTTACG-397 Baboon_1 2010-11-18 M 8.638 g_2.1 75 70.6 wet 2011 11 35855 198 46.91 -14.331 5.290 9.872 12.223 -1.3256 1.4292 1.043 -1.948 -1.7495 -0.4891 -3.5910 0.2045 0.0381 1.4014 113 2.485 0.1796 0.2159 0.0269 -0.1299 -0.0282 35143 107 35143
sample_11409-GTTGCTGAGTCC-395 sample_11409-GTTGCTGAGTCC-395 Baboon_1 2011-01-22 M 8.816 g_2.1 81 0.0 wet 2011 1 49800 146 50.14 -6.118 -8.050 -7.833 -8.504 -8.9970 4.1727 4.685 12.025 -9.4630 0.8942 2.1969 0.4153 0.1357 1.9947 201 3.317 0.0630 0.2135 0.0708 0.0638 -0.0419 48532 189 48532
sample_11408-GGTCTTAGCACC-395 sample_11408-GGTCTTAGCACC-395 Baboon_1 2010-10-14 M 8.542 g_2.1 76 0.0 dry 2010 10 45778 194 46.91 15.927 2.057 5.217 4.699 -5.0800 0.4955 2.755 -3.277 -2.1587 -0.2728 1.5095 -1.5876 0.3703 1.4930 244 3.783 0.0027 0.1412 0.1576 0.0705 0.0437 43298 217 43298

Now we can detect contaminant sequences. We run addContaminantQC() which adds results to rowData.

tse <- addContaminantQC(tse, concentration = "post_pcr_dna_ng")

rowData(tse) |> head() |> kable()
Domain Phylum Class Order Family Genus ASV freq prev p.freq p.prev p contaminant
asv_10002 Bacteria NA NA NA NA NA asv_10002 0.0003 3355 0.6914 NA 0.6914 FALSE
asv_10020 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10020 0.0001 2511 0.7099 NA 0.7099 FALSE
asv_10027 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10027 0.0001 2835 0.7766 NA 0.7766 FALSE
asv_10036 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10036 0.0016 8633 0.7051 NA 0.7051 FALSE
asv_10052 NA NA NA NA NA NA asv_10052 0.0001 2833 0.7341 NA 0.7341 FALSE
asv_10072 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminiclostridium 6 asv_10072 0.0001 2138 0.6394 NA 0.6394 FALSE

The method performs statistical tests to identify contaminants. By default, it uses a 0.1 probability threshold for identifying contaminants, which can be adjusted as needed. In this case, 0% of sequences were detected to be contaminants. We can then filter out these features from the data.

tse <- tse[ !rowData(tse)[["contaminant"]], ]

As an example, we also demonstrate how to apply the prevalence-based approach. Note that the data used here is arbitrary, and in practice, you should use real control sample information.

First, we add arbitrary control samples:

control <- rep(FALSE, ncol(tse))
control[ sample(seq_len(ncol(tse)), 1) ] <- TRUE
tse[["control"]] <- control

Next, we perform the analysis. Note that we could have applied both frequency-based and prevalence-based methods simultaneously by specifying the method parameter in the previous step.

not_contaminant <- isNotContaminant(tse, control = "control", detailed = FALSE)
not_contaminant |> head()
##  [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE

To identify non-contaminant sequences with prevalence approach, a threshold of 0.5 is used, by default. As noted in the previous step, the add*() functions add results to rowData. Here, we used function that return only the results. By specifying detailed = FALSE, we obtain the results as a vector, which can then be used for subsetting the data.

9.3 Data distribution

Microbiome counts data is rarely normally distributed (see Section 12.1). However, many common statistical tests assume normality and using them while violating the assumptions might lead to incorrect conclusions. While several tests for normality exist — such as Shapiro-Wilk test — they do not replace visual observation.

plotHistogram() function provides easy way to visualize the counts distribution with a histogram.

plotHistogram(tse, assay.type = "counts")

The microbiome data is typically zero-inflated, meaning that there are lots of zeroes. Same method can also be used to visualize continuous variables from colData. For categorical values, one can utilize plotBarplot().

p1 <- plotHistogram(tse, col.var = "age") +
    labs(x = "Age")
p2 <- plotBarplot(tse, col.var = "sex") +
    labs(x = "Sex")

p1 + p2

9.3.1 Abundance

Abundance visualization is an important data exploration approach. plotAbundanceDensity() function generates a plot to visualize the most abundant taxa along with several options.

Next, a few demonstrations are shown using the (Lahti et al. 2014) dataset. A jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) (Supplementary Fig.1), can be visualized as follows:

Lahti, L, JSalojärvi, A Salonen, M Scheffer, and WM de Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 2014: 1–10. https://doi.org/https://doi.org/10.1038/ncomms5344.
Salosensaari, Aaro, Ville Laitinen, Aki Havulinna, Guillaume Méric, Susan Cheng, Markus Perola, Liisa Valsta, et al. 2021. “Taxonomic Signatures of Cause-Specific Mortality Risk in Human Gut Microbiome.” Nature Communications 12: 1–8. https://www.nature.com/articles/s41467-021-22962-y.
# Add relative abundances
tse <- transformAssay(tse, method = "relabundance")

plotAbundanceDensity(
    tse, layout = "jitter",
    assay.type = "relabundance",
    n = 40, point.size=1, point.shape=19,
    point.alpha = 0.1) +
    scale_x_log10(label=scales::percent)

The relative abundance values for the top-5 taxonomic features can be visualized as a density plot over a log-scaled axis, with “sex” indicated by colors:

plotAbundanceDensity(
    tse, layout = "density",
    assay.type = "relabundance",
    n = 5, colour.by = "sex",
    point.alpha = 0.1) +
    scale_x_log10()

Alternatively, scater::plotExpression() can be used to visualize taxa with a violin plot. Below, we visualize top-10 features, selected based on their mean abundance.

# Select top features
top <- getTop(tse, top = 10L, method = "mean")

plotExpression(
    tse,
    features = top,
    x = "sex",
    assay.type = "relabundance",
    point_alpha = 0.01
    ) +
    scale_y_log10()

9.3.2 Prevalence

Prevalence quantifies the frequency of samples where certain microbes were detected (above a given detection threshold). Prevalence can be given as sample size (N) or percentage (unit interval).

Investigating prevalence allows you either to focus on changes which pertain to the majority of the samples, or identify rare microbes, which may be conditionally abundant in a small number of samples.

We can plot histogram of prevalence of taxa. This would tell us whether there are many taxa present in most of the samples or are there mostly rare taxa. The population prevalence (frequency) at a 0.01% relative abundance threshold (detection = 0.1/100 and as.relative = TRUE) can look like this.

# Add prevalence of features
tse <- addPrevalence(tse, detection = 0.1/100, as.relative = TRUE)
# Plot them with a histogram
plotHistogram(tse, row.var = "prevalence")

Most of the taxa are present only in minority of samples with specified abundance threshold. Similar conclusion can be made with visualization generated with plotPrevalentAbundance() or plotRowPrevalence() functions.

p1 <- plotPrevalentAbundance(tse, as.relative = TRUE)
# Remove y axis text as there are so many features that one cannot read them
p2 <- plotRowPrevalence(tse, as.relative = TRUE) +
    theme(
       axis.text.y=element_blank(), 
       axis.ticks.y=element_blank()
    )

p1 + p2

The plots above shows that most of the taxa are present with low abundance. However, there are couple taxa that are both high abundant and prevalent in the dataset.

The figures also show the overall trend that most microbes are low in abundance and occur in a limited number of samples. This is a typical pattern in microbiome datasets, and can further visualized as follows.

plotPrevalence(tse, as.relative = TRUE)

The plot shows the relationship between microbial relative abundance and prevalence across samples. Taxa with higher prevalence (yellow) tend to have lower relative abundance, while less prevalent taxa (purple) can still reach higher abundances in some cases.

To analyze how present the core taxa are in samples, we can calculate how large proportions the core taxa present in each sample. We can then visualize this distribution again with a histogram.

# Calculate the proportion of core taxa
tse <- addPrevalentAbundance(tse, prevalence = 50/100, detection = 0.1/100)
# Visualize
plotHistogram(tse, col.var = "prevalent_abundance")

In most of the samples, the core taxa represent over 3/4 of species. However, there are some samples where this proportion is much lower.

9.4 Colinearity and independence

Microbial species are rarely independent; rather, they influence each other’s abundance through complex networks of competition and symbiosis. Colinearity occurs when information from one variable is already included in some other variable. Modeling variables that exhibit collinearity can lead to issues such as reduced interpretability, overfitting and incorrect estimations.

Collinearity of variables can be assessed, for instance, with correlation heatmaps (see Chapter 18) or with scatter plots.

Exercises

Goal: The goal is to learn relevant function in quality control and initial exploration of the data.

Exercise 1: QC and exploration

  1. Load any of the example datasets mentioned in Section 4.2.

  2. Summarize the counts with with histogram.

  3. Add prevalence of taxa to rowData.

  4. Visualize the prevalence distribution with histogram. Does the data include many prevalent taxa or are they found only in small part of samples?

  5. Add library sizes to colData.

  6. Visualize the library size distribution with histogram. Does the sampling depth differ a lot?

  7. Visualize categorical values in colData with a bar plot.

  8. Get the available taxonomy ranks in the data.

  9. Calculate a table that summarizes the dominance of genera or any other rank. Which taxa is present in the highest number of samples? In how many samples it is present? What percentage of samples the number corresponds to?

  10. Get the most prevalent features in specific taxonomy rank. Use counts table, and set prevalence and detection threshold to 20% and 1, respectively.

  11. Get the most abundant features based on median abundance. How this differs from prevalent features?

  12. Visualize the most prevalent features.

Useful functions:

data(), plotHistogram(), addPrevalence(), rowData(), addPerCellQCMetrics(), colData(), plotBarplot(), taxonomyRanks(), summarizeDominance(), getPrevalent(), getTop, plotAbundanceDensity()

Back to top