11 Exploration & quality control
This chapter focuses on the quality control and exploration of microbiome data and establishes commonly used descriptive summaries. Familiarizing yourself with the peculiarities of a given dataset is essential for data analysis and model building.
The dataset should not suffer from severe technical biases, and you should at least be aware of potential challenges, such as outliers, biases, unexpected patterns, and so forth. Standard summaries and visualizations can help, and the rest comes with experience. Moreover, exploration and quality control often entail iterative processes.
11.1 Abundance
Abundance visualization is an important data exploration approach. miaViz
integrated the plotAbundanceDensity()
function to plot the most abundant taxa along with several options.
Next, a few demonstrations are shown using the (Lahti et al. 2014) dataset. A jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) (Supplementary Fig.1), can be visualized as follows:
# Load example data
library(miaTime)
library(miaViz)
data(hitchip1006)
tse <- hitchip1006
# Add relative abundances
tse <- transformAssay(tse, MARGIN = "cols", method = "relabundance")
# Use argument names
# assay.type / assay.type / assay.type
# depending on the mia package version
plotAbundanceDensity(
tse, layout = "jitter",
assay.type = "relabundance",
n = 40, point_size=1, point.shape=19,
point.alpha=0.1) +
scale_x_log10(label=scales::percent)
The relative abundance values for the top-5 taxonomic features can be visualized as a density plot over a log-scaled axis, with “nationality” indicated by colors:
plotAbundanceDensity(
tse, layout = "density",
assay.type = "relabundance",
n = 5, colour.by="nationality",
point.alpha=1/10) +
scale_x_log10()
11.2 Prevalence
Prevalence quantifies the frequency of samples where certain microbes were detected (above a given detection threshold). Prevalence can be given as sample size (N) or percentage (unit interval).
Investigating prevalence allows you either to focus on changes which pertain to the majority of the samples, or identify rare microbes, which may be conditionally abundant in a small number of samples.
The population prevalence (frequency) at a 1% relative abundance threshold (detection = 1/100
and as.relative = TRUE
) can look like this.
getPrevalence(tse, detection = 1/100, sort = TRUE, as.relative = TRUE) |>
head()
## Faecalibacterium prausnitzii et rel. Ruminococcus obeum et rel.
## 0.9522 0.9140
## Oscillospira guillermondii et rel. Clostridium symbiosum et rel.
## 0.8801 0.8714
## Subdoligranulum variable at rel. Clostridium orbiscindens et rel.
## 0.8358 0.8315
The function arguments detection
and as.relative
can also be used to access how many samples pass a threshold for raw counts. Here, the population prevalence (frequency) at the absolute abundance threshold (as.relative = FALSE
) with a read count of 1 (detection = 1
) is accessed.
getPrevalence(
tse, detection = 1, sort = TRUE, assay.type = "counts",
as.relative = FALSE) |>
head()
## Uncultured Mollicutes Uncultured Clostridiales II
## 1 1
## Uncultured Clostridiales I Tannerella et rel.
## 1 1
## Sutterella wadsworthia et rel. Subdoligranulum variable at rel.
## 1 1
If the output is to be used for subsetting or storing the data in the rowData
, set sort = FALSE
.
11.2.1 Prevalence analysis
To investigate microbiome prevalence at a selected taxonomic level, two approaches are available.
First, the data can be agglomerated to the taxonomic level and then getPrevalence()
can be applied to the resulting object.
# Agglomerate taxa abundances to Phylum level
#and add the new table to the altExp slot
altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table
#from the altExp slot
getPrevalence(
altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
assay.type = "counts", as.relative = TRUE) |>
head()
## Firmicutes Bacteroidetes Actinobacteria Proteobacteria
## 1.0000000 0.9852302 0.4821894 0.2988705
## Verrucomicrobia Fusobacteria
## 0.1277150 0.0008688
Alternatively, the rank
argument can be set to perform the agglomeration on the fly.
getPrevalence(
tse, rank = "Phylum", detection = 1/100, sort = TRUE,
assay.type = "counts", as.relative = TRUE) |>
head()
## Firmicutes Bacteroidetes Actinobacteria Proteobacteria
## 1.0000000 0.9852302 0.4821894 0.2988705
## Verrucomicrobia Fusobacteria
## 0.1277150 0.0008688
By default, na.rm = TRUE
is used for agglomeration in getPrevalence()
, whereas the default for agglomerateByRank()
is FALSE
to prevent accidental data loss.
If you only need the names of the prevalent taxa, getPrevalent()
is available. This returns the taxa that exceed the given prevalence and detection thresholds.
getPrevalent(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalent(
tse, detection = 0, prevalence = 50/100, rank = "Phylum", sort = TRUE)
prev
The detection
and prevalence
thresholds are not the same, as detection
can be applied to relative counts or absolute counts depending on whether as.relative
is set TRUE
or FALSE
The function getPrevalentAbundance()
can be used to check the total relative abundance of the prevalent taxa (between 0 and 1).
11.2.2 Rare taxa
Related functions are available for the analysis of rare taxa (rare_members()
, rare_abundance()
, low_abundance()
, getRare()
, subsetByRare()
).
11.2.3 Plotting prevalence
To plot the prevalence, add the prevalence of each taxon to rowData
. Here, we are analyzing the Phylum-level abundances, which are stored in the altExp
slot.
rowData(altExp(tse,"Phylum"))$prevalence <- getPrevalence(
altExp(tse,"Phylum"), detection = 1/100,
sort = FALSE,
assay.type = "counts", as.relative = TRUE)
The prevalences can then be plotted using the plotting functions from the scater
package.
library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")
The prevalence can also be visualized on the taxonomic tree with the miaViz
package.
tse <- agglomerateByRanks(tse)
altExps(tse) <- lapply(
altExps(tse), function(y){
rowData(y)$prevalence <- getPrevalence(
y, detection = 1/100,
sort = FALSE,
assay.type = "counts",
as.relative = TRUE)
return(y)
})
top_phyla <- getTop(
altExp(tse,"Phylum"),
method="prevalence",
top=5L,
assay.type="counts")
top_phyla_mean <- getTop(
altExp(tse,"Phylum"),
method="mean",
top=5L,
assay.type="counts")
x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addHierarchyTree(x)
After some preparation, the data are assembled and can be plotted with plotRowTree()
.
library(miaViz)
# Filter rows where Phylum is in top_phyla
selected <- rowData(x)$Phylum %in% top_phyla
# Plot the data
plotRowTree(
x[selected, ],
edge.colour.by = "Phylum",
tip.colour.by = "prevalence",
node.colour.by = "prevalence")
# Filter for Phylum in top_phyla_mean
selected <- rowData(x)$Phylum %in% top_phyla_mean
# Plot the data
plotRowTree(
x[selected, ],
edge.colour.by = "Phylum",
tip.colour.by = "prevalence",
node.colour.by = "prevalence")
11.3 Quality control
Next, let’s load the GlobalPatterns
dataset to illustrate standard microbiome data summaries.
11.3.1 Top taxa
The function getTop()
identifies top taxa in the data.
# Pick the top taxa
top_features <- getTop(tse, method="median", top=10)
# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)][1:5, 1:3]
## DataFrame with 5 rows and 3 columns
## Kingdom Phylum Class
## <character> <character> <character>
## 549656 Bacteria Cyanobacteria Chloroplast
## 331820 Bacteria Bacteroidetes Bacteroidia
## 317182 Bacteria Cyanobacteria Chloroplast
## 94166 Bacteria Proteobacteria Gammaproteobacteria
## 279599 Bacteria Cyanobacteria Nostocophycideae
11.3.2 Unique taxa
The function getUnique
returns unique taxa in the data.
# Get unique taxa at rank Class
getUnique(tse, "Class", sort = TRUE)
## [1] "0319-6G9" "09D2Y74" "12-24"
## [4] "4C0d-2" "5B-18" "A712011"
## [7] "AB16" "ABS-6" "Acidobacteria"
## [10] "Acidobacteria-5" "Actinobacteria" "Alphaproteobacteria"
## [13] "Anaerolineae" "Armatimonadia" "BB34"
## [16] "BD4-9" "BPC102" "BSV19"
## [19] "Bacilli" "Bacteroidia" "Betaproteobacteria"
## [22] "Bljii12" "C2" "C6"
## [25] "CH21" "Caldisericia" "Caldithrixae"
## [28] "Chlamydiae" "Chloracidobacteria" "Chlorobia"
## [31] "Chloroflexi" "Chloroflexi-4" "Chloroplast"
## [34] "Chthonomonadetes" "Clostridia" "Dehalococcoidetes"
## [37] "Deinococci" "Deltaproteobacteria" "Elusimicrobia"
## [40] "Endomicrobia" "Epsilonproteobacteria" "Erysipelotrichi"
## [43] "FFCH393" "FFCH6980" "Fibrobacteres"
## [46] "Flavobacteria" "Fusobacteria" "GKS2-174"
## [49] "GN07" "GN08" "GN13"
## [52] "GN15" "GW-22" "Gammaproteobacteria"
## [55] "Gemmatimonadetes" "Halobacteria" "Holophagae"
## [58] "Ignavibacteria" "JG37-AG-4" "JS1"
## [61] "KSB3" "Koll-18" "Ktedonobacteria"
## [64] "Kueneniae" "Lentisphaerae" "Leptospirae"
## [67] "MJK10" "ML615J-28" "MSB-5A5"
## [70] "MSB-5B5" "MVS-40" "Methanobacteria"
## [73] "Methanomicrobia" "Methylacidiphilae" "Mollicutes"
## [76] "Nitrospira" "Nostocophycideae" "ODP123"
## [79] "OP11-2" "OP11-3" "OP8_1"
## [82] "OP8_2" "OPB56" "OS-K"
## [85] "Opitutae" "Oscillatoriophycideae" "PAUC37f"
## [88] "PBS-25" "PRR-11" "PRR-12"
## [91] "PW285" "Phycisphaerae" "Planctomycea"
## [94] "R76-B18" "RA13C7" "RB25"
## [97] "RB384" "S15B-MN24" "S1a-1H"
## [100] "SBRH58" "SHA-114" "SJA-176"
## [103] "SJA-28" "SJA-4" "SM1B09"
## [106] "SOGA31" "Sd-NA" "Solibacteres"
## [109] "Spartobacteria" "Sphingobacteria" "Spirochaetes"
## [112] "Sva0725" "Synechococcophycideae" "Synergistia"
## [115] "TG3-1" "TG3-2" "TK-SH13"
## [118] "TK17" "TM7-1" "TM7-3"
## [121] "TP21" "Thaumarchaeota" "Thermomicrobia"
## [124] "Thermoplasmata" "Thermoprotei" "Thermotogae"
## [127] "Ucn15732" "VC12-cl04" "VHS-B5-50"
## [130] "Verruco-5" "Verrucomicrobiae" "WM88"
## [133] "WWE1" "Zetaproteobacteria" "agg27"
## [136] "iii1-8" "koll11" "pMC2A209"
## [139] "vadinHA49" NA
11.3.3 Library size / read count
The total counts per sample can be calculated using perCellQCMetrics()
/addPerCellQC()
from the scater
package. The former one just calculates the values, whereas the latter directly adds them to colData
. mia provides the function summary
for SE objects which returns the summary of counts for all samples and features including measures of central tendency.
library(scater)
# Get an overview of sample and taxa counts
summary(tse, assay.type= "counts")
## $samples
## # A tibble: 1 × 6
## total_counts min_counts max_counts median_counts mean_counts stdev_counts
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28216678 58688 2357181 1106849 1085257. 650145.
##
## $features
## # A tibble: 1 × 3
## total singletons per_sample_avg
## <int> <int> <dbl>
## 1 19216 2134 4022.
# Calculate total counts per sample
perCellQCMetrics(tse)[1:5,]
## DataFrame with 5 rows and 3 columns
## sum detected total
## <numeric> <numeric> <numeric>
## CL3 864077 6964 864077
## CC1 1135457 7679 1135457
## SV1 697509 5729 697509
## M31Fcsw 1543451 2667 1543451
## M11Fcsw 2076476 2574 2076476
# Calculate and add total counts to `colData`
tse <- addPerCellQC(tse)
colData(tse)[1:5,1:3]
## DataFrame with 5 rows and 3 columns
## X.SampleID Primer Final_Barcode
## <factor> <factor> <factor>
## CL3 CL3 ILBC_01 AACGCA
## CC1 CC1 ILBC_02 AACTCG
## SV1 SV1 ILBC_03 AACTGT
## M31Fcsw M31Fcsw ILBC_04 AAGAGA
## M11Fcsw M11Fcsw ILBC_05 AAGCTG
The distribution of calculated library sizes can be visualized as a histogram (left) or by sorting the samples by library size (right).
library(ggplot2)
library(dplyr)
library(patchwork)
# Create a histrogram showing distribution of library sizes
p1 <- ggplot(colData(tse)) +
geom_histogram(aes(x = sum/1e6), color = "black", fill = "gray") +
labs(x = "Library size (million reads)", y = "Frequency (n)") +
theme_classic()
# Order data in increasing order
df <- as.data.frame(colData(tse)) |>
arrange(sum) |>
mutate(proportion = (1:n())/n())
# Create a scatter plot showing how library sizes are distributed
p2 <- ggplot(df, aes(y = proportion, x = sum/1e6)) +
geom_point() +
labs(x = "Library size (million reads)", y = "Proportion of samples") +
theme_classic()
p1 + p2
Library sizes and other variables from colData
can be visualized using a specified function called plotColData()
.
# Sort samples by read count,
# order the factor levels,
# and store back to tse as DataFrame
colData(tse) <- as.data.frame(colData(tse)) |>
arrange(X.SampleID) |>
mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(y = "Library size (N)", x = "Sample ID")
plotColData(tse,"sum","SampleType", colour_by = "SampleType") +
theme(axis.text.x = element_text(angle = 45, hjust=1))
In addition, data can be rarefied with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).
11.3.4 Contaminant sequences
Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on its frequencies and concentrations across the samples.
The following decontam functions are based on (Davis et al. 2018) and support such functionality:
-
isContaminant()
,isNotContaminant()
-
addContaminantQC()
,addNotContaminantQC()