11  Exploration & quality control

This chapter focuses on the quality control and exploration of microbiome data and establishes commonly used descriptive summaries. Familiarizing yourself with the peculiarities of a given dataset is essential for data analysis and model building.

The dataset should not suffer from severe technical biases, and you should at least be aware of potential challenges, such as outliers, biases, unexpected patterns, and so forth. Standard summaries and visualizations can help, and the rest comes with experience. Moreover, exploration and quality control often entail iterative processes.

11.1 Abundance

Abundance visualization is an important data exploration approach. miaViz integrated the plotAbundanceDensity() function to plot the most abundant taxa along with several options.

Next, a few demonstrations are shown using the (Lahti et al. 2014) dataset. A jitter plot based on relative abundance data, similar to the one presented at (Salosensaari et al. 2021) (Supplementary Fig.1), can be visualized as follows:

Lahti, L, JSalojärvi, A Salonen, M Scheffer, and WM de Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 2014: 1–10. https://doi.org/https://doi.org/10.1038/ncomms5344.
Salosensaari, Aaro, Ville Laitinen, Aki Havulinna, Guillaume Méric, Susan Cheng, Markus Perola, Liisa Valsta, et al. 2021. “Taxonomic Signatures of Cause-Specific Mortality Risk in Human Gut Microbiome.” Nature Communications 12: 1–8. https://www.nature.com/articles/s41467-021-22962-y.
# Load example data
library(miaTime)
library(miaViz)
data(hitchip1006)
tse <- hitchip1006

# Add relative abundances
tse <- transformAssay(tse, MARGIN = "cols", method = "relabundance")

# Use argument names
# assay.type / assay.type / assay.type
# depending on the mia package version
plotAbundanceDensity(
    tse, layout = "jitter",
    assay.type = "relabundance",
    n = 40, point_size=1, point.shape=19,
    point.alpha=0.1) +
    scale_x_log10(label=scales::percent)

The relative abundance values for the top-5 taxonomic features can be visualized as a density plot over a log-scaled axis, with “nationality” indicated by colors:

plotAbundanceDensity(
    tse, layout = "density",
    assay.type = "relabundance",
    n = 5, colour.by="nationality",
    point.alpha=1/10) +
    scale_x_log10()

11.2 Prevalence

Prevalence quantifies the frequency of samples where certain microbes were detected (above a given detection threshold). Prevalence can be given as sample size (N) or percentage (unit interval).

Investigating prevalence allows you either to focus on changes which pertain to the majority of the samples, or identify rare microbes, which may be conditionally abundant in a small number of samples.

The population prevalence (frequency) at a 1% relative abundance threshold (detection = 1/100 and as.relative = TRUE) can look like this.

getPrevalence(tse, detection = 1/100, sort = TRUE, as.relative = TRUE) |>
    head()
##  Faecalibacterium prausnitzii et rel.           Ruminococcus obeum et rel. 
##                                0.9522                               0.9140 
##    Oscillospira guillermondii et rel.        Clostridium symbiosum et rel. 
##                                0.8801                               0.8714 
##      Subdoligranulum variable at rel.     Clostridium orbiscindens et rel. 
##                                0.8358                               0.8315

The function arguments detection and as.relative can also be used to access how many samples pass a threshold for raw counts. Here, the population prevalence (frequency) at the absolute abundance threshold (as.relative = FALSE) with a read count of 1 (detection = 1) is accessed.

getPrevalence(
    tse, detection = 1, sort = TRUE, assay.type = "counts",
    as.relative = FALSE) |>
    head()
##             Uncultured Mollicutes      Uncultured Clostridiales II 
##                                 1                                1 
##        Uncultured Clostridiales I               Tannerella et rel. 
##                                 1                                1 
##    Sutterella wadsworthia et rel. Subdoligranulum variable at rel. 
##                                 1                                1

If the output is to be used for subsetting or storing the data in the rowData, set sort = FALSE.

11.2.1 Prevalence analysis

To investigate microbiome prevalence at a selected taxonomic level, two approaches are available.

First, the data can be agglomerated to the taxonomic level and then getPrevalence() can be applied to the resulting object.

# Agglomerate taxa abundances to Phylum level
#and add the new table to the altExp slot
altExp(tse,"Phylum") <- agglomerateByRank(tse, "Phylum")
# Check prevalence for the Phylum abundance table
#from the altExp slot
getPrevalence(
    altExp(tse,"Phylum"), detection = 1/100, sort = TRUE,
    assay.type = "counts", as.relative = TRUE) |>
    head()
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia    Fusobacteria 
##        0.1277150       0.0008688

Alternatively, the rank argument can be set to perform the agglomeration on the fly.

getPrevalence(
    tse, rank = "Phylum", detection = 1/100, sort = TRUE,
    assay.type = "counts", as.relative = TRUE) |>
    head()
##       Firmicutes   Bacteroidetes  Actinobacteria  Proteobacteria 
##        1.0000000       0.9852302       0.4821894       0.2988705 
##  Verrucomicrobia    Fusobacteria 
##        0.1277150       0.0008688
Note

By default, na.rm = TRUE is used for agglomeration in getPrevalence(), whereas the default for agglomerateByRank() is FALSE to prevent accidental data loss.

If you only need the names of the prevalent taxa, getPrevalent() is available. This returns the taxa that exceed the given prevalence and detection thresholds.

getPrevalent(tse, detection = 0, prevalence = 50/100)
prev <- getPrevalent(
    tse, detection = 0, prevalence = 50/100, rank = "Phylum", sort = TRUE)
prev
Note

The detection and prevalence thresholds are not the same, as detection can be applied to relative counts or absolute counts depending on whether as.relative is set TRUE or FALSE

The function getPrevalentAbundance() can be used to check the total relative abundance of the prevalent taxa (between 0 and 1).

11.2.2 Rare taxa

Related functions are available for the analysis of rare taxa (rare_members(), rare_abundance(), low_abundance(), getRare(), subsetByRare()).

11.2.3 Plotting prevalence

To plot the prevalence, add the prevalence of each taxon to rowData. Here, we are analyzing the Phylum-level abundances, which are stored in the altExp slot.

rowData(altExp(tse,"Phylum"))$prevalence <- getPrevalence(
    altExp(tse,"Phylum"), detection = 1/100,
    sort = FALSE,
    assay.type = "counts", as.relative = TRUE)

The prevalences can then be plotted using the plotting functions from the scater package.

library(scater)
plotRowData(altExp(tse,"Phylum"), "prevalence", colour_by = "Phylum")

The prevalence can also be visualized on the taxonomic tree with the miaViz package.

tse <- agglomerateByRanks(tse)
altExps(tse) <- lapply(
    altExps(tse), function(y){
        rowData(y)$prevalence <- getPrevalence(
            y, detection = 1/100,
            sort = FALSE,
            assay.type = "counts",
            as.relative = TRUE)
        return(y)
    })
top_phyla <- getTop(
    altExp(tse,"Phylum"),
    method="prevalence",
    top=5L,
    assay.type="counts")
top_phyla_mean <- getTop(
    altExp(tse,"Phylum"),
    method="mean",
    top=5L,
    assay.type="counts")

x <- unsplitByRanks(tse, ranks = taxonomyRanks(tse)[1:6])
x <- addHierarchyTree(x)

After some preparation, the data are assembled and can be plotted with plotRowTree().

library(miaViz)
# Filter rows where Phylum is in top_phyla
selected <- rowData(x)$Phylum %in% top_phyla

# Plot the data
plotRowTree(
    x[selected, ],
    edge.colour.by = "Phylum",
    tip.colour.by = "prevalence",
    node.colour.by = "prevalence")

Prevalence of top phyla as judged by prevalence
# Filter for Phylum in top_phyla_mean
selected <- rowData(x)$Phylum %in% top_phyla_mean

# Plot the data
plotRowTree(
    x[selected, ],
    edge.colour.by = "Phylum",
    tip.colour.by = "prevalence",
    node.colour.by = "prevalence")

Prevalence of top phyla as judged by mean abundance

11.3 Quality control

Next, let’s load the GlobalPatterns dataset to illustrate standard microbiome data summaries.

library(mia)
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns

11.3.1 Top taxa

The function getTop() identifies top taxa in the data.

# Pick the top taxa
top_features <- getTop(tse, method="median", top=10)

# Check the information for these
rowData(tse)[top_features, taxonomyRanks(tse)][1:5, 1:3]
##  DataFrame with 5 rows and 3 columns
##             Kingdom         Phylum               Class
##         <character>    <character>         <character>
##  549656    Bacteria  Cyanobacteria         Chloroplast
##  331820    Bacteria  Bacteroidetes         Bacteroidia
##  317182    Bacteria  Cyanobacteria         Chloroplast
##  94166     Bacteria Proteobacteria Gammaproteobacteria
##  279599    Bacteria  Cyanobacteria    Nostocophycideae

11.3.2 Unique taxa

The function getUnique returns unique taxa in the data.

# Get unique taxa at rank Class
getUnique(tse, "Class", sort = TRUE)
##    [1] "0319-6G9"              "09D2Y74"               "12-24"                
##    [4] "4C0d-2"                "5B-18"                 "A712011"              
##    [7] "AB16"                  "ABS-6"                 "Acidobacteria"        
##   [10] "Acidobacteria-5"       "Actinobacteria"        "Alphaproteobacteria"  
##   [13] "Anaerolineae"          "Armatimonadia"         "BB34"                 
##   [16] "BD4-9"                 "BPC102"                "BSV19"                
##   [19] "Bacilli"               "Bacteroidia"           "Betaproteobacteria"   
##   [22] "Bljii12"               "C2"                    "C6"                   
##   [25] "CH21"                  "Caldisericia"          "Caldithrixae"         
##   [28] "Chlamydiae"            "Chloracidobacteria"    "Chlorobia"            
##   [31] "Chloroflexi"           "Chloroflexi-4"         "Chloroplast"          
##   [34] "Chthonomonadetes"      "Clostridia"            "Dehalococcoidetes"    
##   [37] "Deinococci"            "Deltaproteobacteria"   "Elusimicrobia"        
##   [40] "Endomicrobia"          "Epsilonproteobacteria" "Erysipelotrichi"      
##   [43] "FFCH393"               "FFCH6980"              "Fibrobacteres"        
##   [46] "Flavobacteria"         "Fusobacteria"          "GKS2-174"             
##   [49] "GN07"                  "GN08"                  "GN13"                 
##   [52] "GN15"                  "GW-22"                 "Gammaproteobacteria"  
##   [55] "Gemmatimonadetes"      "Halobacteria"          "Holophagae"           
##   [58] "Ignavibacteria"        "JG37-AG-4"             "JS1"                  
##   [61] "KSB3"                  "Koll-18"               "Ktedonobacteria"      
##   [64] "Kueneniae"             "Lentisphaerae"         "Leptospirae"          
##   [67] "MJK10"                 "ML615J-28"             "MSB-5A5"              
##   [70] "MSB-5B5"               "MVS-40"                "Methanobacteria"      
##   [73] "Methanomicrobia"       "Methylacidiphilae"     "Mollicutes"           
##   [76] "Nitrospira"            "Nostocophycideae"      "ODP123"               
##   [79] "OP11-2"                "OP11-3"                "OP8_1"                
##   [82] "OP8_2"                 "OPB56"                 "OS-K"                 
##   [85] "Opitutae"              "Oscillatoriophycideae" "PAUC37f"              
##   [88] "PBS-25"                "PRR-11"                "PRR-12"               
##   [91] "PW285"                 "Phycisphaerae"         "Planctomycea"         
##   [94] "R76-B18"               "RA13C7"                "RB25"                 
##   [97] "RB384"                 "S15B-MN24"             "S1a-1H"               
##  [100] "SBRH58"                "SHA-114"               "SJA-176"              
##  [103] "SJA-28"                "SJA-4"                 "SM1B09"               
##  [106] "SOGA31"                "Sd-NA"                 "Solibacteres"         
##  [109] "Spartobacteria"        "Sphingobacteria"       "Spirochaetes"         
##  [112] "Sva0725"               "Synechococcophycideae" "Synergistia"          
##  [115] "TG3-1"                 "TG3-2"                 "TK-SH13"              
##  [118] "TK17"                  "TM7-1"                 "TM7-3"                
##  [121] "TP21"                  "Thaumarchaeota"        "Thermomicrobia"       
##  [124] "Thermoplasmata"        "Thermoprotei"          "Thermotogae"          
##  [127] "Ucn15732"              "VC12-cl04"             "VHS-B5-50"            
##  [130] "Verruco-5"             "Verrucomicrobiae"      "WM88"                 
##  [133] "WWE1"                  "Zetaproteobacteria"    "agg27"                
##  [136] "iii1-8"                "koll11"                "pMC2A209"             
##  [139] "vadinHA49"             NA

11.3.3 Library size / read count

The total counts per sample can be calculated using perCellQCMetrics()/addPerCellQC() from the scater package. The former one just calculates the values, whereas the latter directly adds them to colData. mia provides the function summary for SE objects which returns the summary of counts for all samples and features including measures of central tendency.

library(scater)
# Get an overview of sample and taxa counts
summary(tse, assay.type= "counts")
##  $samples
##  # A tibble: 1 × 6
##    total_counts min_counts max_counts median_counts mean_counts stdev_counts
##           <dbl>      <dbl>      <dbl>         <dbl>       <dbl>        <dbl>
##  1     28216678      58688    2357181       1106849    1085257.      650145.
##  
##  $features
##  # A tibble: 1 × 3
##    total singletons per_sample_avg
##    <int>      <int>          <dbl>
##  1 19216       2134          4022.

# Calculate total counts per sample
perCellQCMetrics(tse)[1:5,]
##  DataFrame with 5 rows and 3 columns
##                sum  detected     total
##          <numeric> <numeric> <numeric>
##  CL3        864077      6964    864077
##  CC1       1135457      7679   1135457
##  SV1        697509      5729    697509
##  M31Fcsw   1543451      2667   1543451
##  M11Fcsw   2076476      2574   2076476

# Calculate and add total counts to `colData`
tse <- addPerCellQC(tse)
colData(tse)[1:5,1:3]
##  DataFrame with 5 rows and 3 columns
##          X.SampleID   Primer Final_Barcode
##            <factor> <factor>      <factor>
##  CL3        CL3      ILBC_01        AACGCA
##  CC1        CC1      ILBC_02        AACTCG
##  SV1        SV1      ILBC_03        AACTGT
##  M31Fcsw    M31Fcsw  ILBC_04        AAGAGA
##  M11Fcsw    M11Fcsw  ILBC_05        AAGCTG

The distribution of calculated library sizes can be visualized as a histogram (left) or by sorting the samples by library size (right).

library(ggplot2)
library(dplyr)
library(patchwork)

# Create a histrogram showing distribution of library sizes
p1 <- ggplot(colData(tse)) +
        geom_histogram(aes(x = sum/1e6), color = "black", fill = "gray") +
        labs(x = "Library size (million reads)", y = "Frequency (n)") +
        theme_classic()

# Order data in increasing order
df <- as.data.frame(colData(tse)) |>
    arrange(sum) |>
    mutate(proportion = (1:n())/n())

# Create a scatter plot showing how library sizes are distributed
p2 <- ggplot(df, aes(y = proportion, x = sum/1e6)) +
    geom_point() +
    labs(x = "Library size (million reads)", y = "Proportion of samples") +
    theme_classic()

p1 + p2

Library size distribution.

Library sizes and other variables from colData can be visualized using a specified function called plotColData().

# Sort samples by read count,
# order the factor levels,
# and store back to tse as DataFrame
colData(tse) <- as.data.frame(colData(tse)) |>
    arrange(X.SampleID) |>
    mutate(X.SampleID = factor(X.SampleID, levels=X.SampleID)) %>%
        DataFrame
plotColData(tse,"sum","X.SampleID", colour_by = "SampleType") +
    theme(axis.text.x = element_text(angle = 45, hjust=1)) +
    labs(y = "Library size (N)", x = "Sample ID")

Library sizes per sample.
plotColData(tse,"sum","SampleType", colour_by = "SampleType") +
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Library sizes per sample type.

If you want to subset data based on library size, see Section 8.4.

In addition, data can be rarefied with rarefyAssay, which normalizes the samples to an equal number of reads. This remains controversial, however, and strategies to mitigate the information loss in rarefaction have been proposed (Schloss 2024a) (Schloss 2024b). Moreover, this practice has been discouraged for the analysis of differentially abundant microorganisms (see (McMurdie and Holmes 2014)).

Schloss, Patrick D. 2024a. “Rarefaction Is Currently the Best Approach to Control for Uneven Sequencing Effort in Amplicon Sequence Analyses.” mSphere 9 (2): e00354–23. https://doi.org/10.1128/msphere.00354-23.
———. 2024b. “Waste Not, Want Not: Revisiting the Analysis That Called into Question the Practice of Rarefaction.” mSphere 9 (1): e00355–23. https://doi.org/10.1128/msphere.00355-23.
McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4): e1003531.

11.3.4 Contaminant sequences

Samples might be contaminated with exogenous sequences. The impact of each contaminant can be estimated based on its frequencies and concentrations across the samples.

The following decontam functions are based on (Davis et al. 2018) and support such functionality:

Davis, Nicole M, Diana M Proctor, Susan P Holmes, David A Relman, and Benjamin J Callahan. 2018. “Simple Statistical Identification and Removal of Contaminant Sequences in Marker-Gene and Metagenomics Data.” Microbiome 6 (1): 1–14.

Contaminations can be detected using two main approaches: frequency-based and prevalence-based testing. In frequency-based testing, the user must provide the DNA concentration of samples. The abundance of features is then compared to the DNA concentration, as contaminants are expected to show an inverse relationship — they make up a larger fraction in low-DNA samples and smaller fraction in high-DNA samples.

In the prevalence-based approach, sequence prevalence is compared between true biological samples and control samples. This method assumes that contaminants are more prevalent in negative controls, as these lack true biological DNA and primarily contain background noise from contamination sources.

Below we download a dataset from microbiomeDataSets.

library(microbiomeDataSets)
tse <- baboongut()

The dataset contains 16S data from the gut microbiome of baboons, with DNA concentration recorded in the “post_pcr_dna_ng” column in the sample metadata. This information can be used for frequency-based contamination identification.

library(knitr)

colData(tse) |> head() |> kable()
sample baboon_id collection_date sex age social_group group_size rain_month_mm season hydro_year month readcount plate post_pcr_dna_ng diet_PC1 diet_PC2 diet_PC3 diet_PC4 diet_PC5 diet_PC6 diet_PC7 diet_PC8 diet_PC9 diet_PC10 diet_PC11 diet_PC12 diet_PC13 diet_shannon_h asv_richness asv_shannon_h pc1_bc pc2_bc pc3_bc pc4_bc pc5_bc
sample_11410-AACTCCTGTGGA-396 sample_11410-AACTCCTGTGGA-396 Baboon_1 2010-08-14 M 8.375 g_2.1 74 0.0 dry 2010 8 26140 129 78.98 44.245 13.304 -3.182 1.639 -1.5291 1.2958 1.957 0.840 0.8753 0.2847 -0.4202 0.4844 -0.0784 0.7761 179 2.972 -0.0649 0.2813 -0.0912 -0.0204 0.0731
sample_11413-CGGCCTAAGTTC-397 sample_11413-CGGCCTAAGTTC-397 Baboon_1 2010-05-14 M 8.123 g_2.1 72 44.6 wet 2010 5 11186 9 49.70 -1.621 -3.402 -16.786 8.296 -0.4551 -0.4667 1.290 -5.776 0.9291 -1.7560 0.1513 -1.9706 -5.9968 1.6158 60 1.574 0.3312 0.2543 -0.1423 0.2182 -0.2007
sample_11409-CAGAAGGTGTGG-395 sample_11409-CAGAAGGTGTGG-395 Baboon_1 2011-01-08 M 8.778 g_2.1 80 12.2 wet 2011 1 30532 196 59.16 -34.954 -8.627 -5.499 -9.029 -1.9215 9.7852 13.418 10.094 -10.1494 -3.0103 3.0000 0.2265 0.1694 1.8709 164 3.045 0.1434 0.2225 0.0493 0.0680 -0.0521
sample_11412-AAGGCCTTTACG-397 sample_11412-AAGGCCTTTACG-397 Baboon_1 2010-11-18 M 8.638 g_2.1 75 70.6 wet 2011 11 35855 198 46.91 -14.331 5.290 9.872 12.223 -1.3256 1.4292 1.043 -1.948 -1.7495 -0.4891 -3.5910 0.2045 0.0381 1.4014 113 2.485 0.1796 0.2159 0.0269 -0.1299 -0.0282
sample_11409-GTTGCTGAGTCC-395 sample_11409-GTTGCTGAGTCC-395 Baboon_1 2011-01-22 M 8.816 g_2.1 81 0.0 wet 2011 1 49800 146 50.14 -6.118 -8.050 -7.833 -8.504 -8.9970 4.1727 4.685 12.025 -9.4630 0.8942 2.1969 0.4153 0.1357 1.9947 201 3.317 0.0630 0.2135 0.0708 0.0638 -0.0419
sample_11408-GGTCTTAGCACC-395 sample_11408-GGTCTTAGCACC-395 Baboon_1 2010-10-14 M 8.542 g_2.1 76 0.0 dry 2010 10 45778 194 46.91 15.927 2.057 5.217 4.699 -5.0800 0.4955 2.755 -3.277 -2.1587 -0.2728 1.5095 -1.5876 0.3703 1.4930 244 3.783 0.0027 0.1412 0.1576 0.0705 0.0437

Now we can detect contaminant sequences. We run addContaminantQC() which adds results to rowData.

library(mia)

tse <- addContaminantQC(tse, concentration = "post_pcr_dna_ng")

rowData(tse) |> head() |> kable()
Domain Phylum Class Order Family Genus ASV freq prev p.freq p.prev p contaminant
asv_10002 Bacteria NA NA NA NA NA asv_10002 0.0003 3355 0.6914 NA 0.6914 FALSE
asv_10020 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10020 0.0001 2511 0.7099 NA 0.7099 FALSE
asv_10027 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10027 0.0001 2835 0.7766 NA 0.7766 FALSE
asv_10036 Bacteria Tenericutes Mollicutes Mollicutes RF39 NA NA asv_10036 0.0016 8633 0.7051 NA 0.7051 FALSE
asv_10052 NA NA NA NA NA NA asv_10052 0.0001 2833 0.7341 NA 0.7341 FALSE
asv_10072 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminiclostridium 6 asv_10072 0.0001 2138 0.6394 NA 0.6394 FALSE

The method performs statistical tests to identify contaminants. By default, it uses a 0.1 probability threshold for identifying contaminants, which can be adjusted as needed. In this case, 0% of sequences were detected to be contaminants. We can then filter out these features from the data.

tse <- tse[ !rowData(tse)[["contaminant"]], ]

As an example, we also demonstrate how to apply the prevalence-based approach. Note that the data used here is arbitrary, and in practice, you should use real control sample information.

First, we add arbitrary control samples:

control <- rep(FALSE, ncol(tse))
control[ sample(seq_len(ncol(tse)), 1) ] <- TRUE
tse[["control"]] <- control

Next, we perform the analysis. Note that we could have applied both frequency-based and prevalence-based methods simultaneously by specifying the method parameter in the previous step.

not_contaminant <- isNotContaminant(tse, control = "control", detailed = FALSE)
not_contaminant |> head()
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

To identify non-contaminant sequences with prevalence approach, a threshold of 0.5 is used, by default. As noted in the previous step, the add*() functions add results to rowData. Here, we used function that return only the results. By specifying detailed = FALSE, we obtain the results as a vector, which can then be used for subsetting the data.

Exercises

Goal: The goal is to learn relevant function in quality control and initial exploration of the data.

Exercise 1: QC and exploration

  1. Load any of the example datasets mentioned in Section 4.2.

  2. Summarize the counts with with histogram.

  3. Add prevalence of taxa to rowData.

  4. Visualize the prevalence distribution with histogram. Does the data include many prevalent taxa or are they found only in small part of samples?

  5. Add library sizes to colData.

  6. Visualize the library size distribution with histogram. Does the sampling depth differ a lot?

  7. Visualize categorical values in colData with a bar plot.

  8. Get the available taxonomy ranks in the data.

  9. Calculate a table that summarizes the dominance of genera or any other rank. Which taxa is present in the highest number of samples? In how many samples it is present? What percentage of samples the number corresponds to?

  10. Get the most prevalent features in specific taxonomy rank. Use counts table, and set prevalence and detection threshold to 20% and 1, respectively.

  11. Get the most abundant features based on median abundance. How this differs from prevalent features?

  12. Visualize the most prevalent features.

Useful functions:

data(), plotHistogram(), addPrevalence(), rowData(), addPerCellQCMetrics(), colData(), plotBarplot(), taxonomyRanks(), summarizeDominance(), getPrevalent(), getTop, plotAbundanceDensity()

Back to top