8  Taxonomic Information

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

Taxonomic information is a key part of analyzing microbiome data, and without it, any type of data analysis probably will not make much sense. However, the degree of detail of taxonomic information differs depending on the dataset and annotation data used.

Therefore, the mia package expects a loose assembly of taxonomic information and assumes certain key aspects:

In this chapter, we will refer to the co-abundant groups as CAGs, which are clusters of taxa that co-vary across samples.

8.1 Assigning taxonomic information

There are a number of methods to assign taxonomic information. We like to give a short introduction about the methods available without ranking one over the other. This has to be your choice based on the result for the individual dataset.

8.1.1 DADA2

The dada2 package (Callahan et al. 2016) implements the assignTaxonomy() function, which takes as input the ASV sequences associated with each row of data and a training dataset. For more information visit the dada2 homepage.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13: 581–83. https://doi.org/10.1038/nmeth.3869.

8.1.2 DECIPHER

The DECIPHER package (Wright 2020) implements the IDTAXA algorithm to assign either taxonomic information or function information. For mia, only the first option is of interest for now and more information can be found on the DECIPHER website.

Wright, Erik. 2020. DECIPHER: Tools for Curating, Analyzing, and Manipulating Biological Sequences.

8.2 Functions to access taxonomic information

8.2.1 Check taxonomy ranks in data

checkTaxonomy() checks whether the taxonomic information is usable for mia

checkTaxonomy(tse)
##  [1] TRUE

Since the rowData can contain other data, taxonomyRanks() will return the columns that mia assumes to contain the taxonomic information.

taxonomyRanks(tse)
##  [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

This can then be used to subset the rowData to columns needed.

rowData(tse)[, taxonomyRanks(tse)]
##  DataFrame with 19216 rows and 7 columns
##             Kingdom        Phylum        Class        Order        Family
##         <character>   <character>  <character>  <character>   <character>
##  549322     Archaea Crenarchaeota Thermoprotei           NA            NA
##  522457     Archaea Crenarchaeota Thermoprotei           NA            NA
##  951        Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae
##  244423     Archaea Crenarchaeota        Sd-NA           NA            NA
##  586076     Archaea Crenarchaeota        Sd-NA           NA            NA
##  ...            ...           ...          ...          ...           ...
##  278222    Bacteria           SR1           NA           NA            NA
##  463590    Bacteria           SR1           NA           NA            NA
##  535321    Bacteria           SR1           NA           NA            NA
##  200359    Bacteria           SR1           NA           NA            NA
##  271582    Bacteria           SR1           NA           NA            NA
##               Genus                Species
##         <character>            <character>
##  549322          NA                     NA
##  522457          NA                     NA
##  951     Sulfolobus Sulfolobusacidocalda..
##  244423          NA                     NA
##  586076          NA                     NA
##  ...            ...                    ...
##  278222          NA                     NA
##  463590          NA                     NA
##  535321          NA                     NA
##  200359          NA                     NA
##  271582          NA                     NA

taxonomyRankEmpty() checks for empty values in the given rank and returns a logical vector of length(x).

all(!taxonomyRankEmpty(tse, rank = "Kingdom"))
##  [1] TRUE
table(taxonomyRankEmpty(tse, rank = "Genus"))
##  
##  FALSE  TRUE 
##   8008 11208
table(taxonomyRankEmpty(tse, rank = "Species"))
##  
##  FALSE  TRUE 
##   1413 17803

8.2.2 Get taxonomy labels

getTaxonomyLabels() is a multi-purpose function, which turns taxonomic information into a character vector of length(x)

head(getTaxonomyLabels(tse))
##  [1] "Class:Thermoprotei"               "Class:Thermoprotei_1"            
##  [3] "Species:Sulfolobusacidocaldarius" "Class:Sd-NA"                     
##  [5] "Class:Sd-NA_1"                    "Class:Sd-NA_2"

By default, this will use the lowest non-empty information to construct a string with the following scheme level:value. If all levels are the same, this part is omitted, but can be added by setting with.rank = TRUE.

phylum <- !is.na(rowData(tse)$Phylum) &
    vapply(data.frame(apply(
       rowData(tse)[, taxonomyRanks(tse)[3:7]], 1L, is.na)), all, logical(1))
head(getTaxonomyLabels(tse[phylum,]))
##  [1] "Crenarchaeota"    "Crenarchaeota_1"  "Crenarchaeota_2" 
##  [4] "Actinobacteria"   "Actinobacteria_1" "Spirochaetes"
head(getTaxonomyLabels(tse[phylum,], with.rank = TRUE))
##  [1] "Phylum:Crenarchaeota"    "Phylum:Crenarchaeota_1" 
##  [3] "Phylum:Crenarchaeota_2"  "Phylum:Actinobacteria"  
##  [5] "Phylum:Actinobacteria_1" "Phylum:Spirochaetes"

By default the return value of getTaxonomyLabels() contains only unique elements by passing it through make.unique. This step can be omitted by setting make.unique = FALSE.

head(getTaxonomyLabels(tse[phylum,], with.rank = TRUE, make.unique = FALSE))
##  [1] "Phylum:Crenarchaeota"  "Phylum:Crenarchaeota"  "Phylum:Crenarchaeota" 
##  [4] "Phylum:Actinobacteria" "Phylum:Actinobacteria" "Phylum:Spirochaetes"

To apply the loop resolving function resolveLoop() from the TreeSummarizedExperiment package (Huang 2020) within getTaxonomyLabels(), set resolve.loops = TRUE.

8.2.3 Get information on certain taxa

The function getUnique() gives a list of unique taxa for the specified taxonomic rank.

head(getUnique(tse, rank = "Phylum"))
##  [1] "Crenarchaeota"  "Euryarchaeota"  "Actinobacteria" "Spirochaetes"  
##  [5] "MVP-15"         "Proteobacteria"

With mapTaxonomy(), you can search information on certain taxa from the taxonomy table. For instance, we can check all the taxa that matches with “Escherichia”.

mapTaxonomy(GlobalPatterns, taxa = "Escherichia")
##  $Escherichia
##  DataFrame with 1 row and 7 columns
##             Kingdom         Phylum               Class             Order
##         <character>    <character>         <character>       <character>
##  249227    Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
##                     Family       Genus     Species
##                <character> <character> <character>
##  249227 Enterobacteriaceae Escherichia          NA

8.2.4 Generate a hierarchy tree on the fly

A hierarchy tree shows mapping between the taxonomic levels in taxonomic rank table (included in rowData), rather than the detailed phylogenetic relations. Usually, a phylogenetic tree refers to latter which is why we call here the generated tree as “hierarchy tree”.

To create a hierarchy tree, getHierarchyTree() used the information and returns a phylo object. Duplicate information from the rowData is removed.

getHierarchyTree(tse)
##  
##  Phylogenetic tree with 1645 tips and 1089 internal nodes.
##  
##  Tip labels:
##    Species:Cenarchaeumsymbiosum, Species:pIVWA5, Species:CandidatusNitrososphaeragargensis, Species:SCA1145, Species:SCA1170, Species:Sulfolobusacidocaldarius, ...
##  Node labels:
##    root:ALL, Kingdom:Archaea, Phylum:Crenarchaeota, Class:C2, Class:Sd-NA, Class:Thaumarchaeota, ...
##  
##  Rooted; includes branch lengths.
tse <- addHierarchyTree(tse)
tse
##  class: TreeSummarizedExperiment 
##  dim: 19216 26 
##  metadata(0):
##  assays(1): counts
##  rownames(19216): 549322 522457 ... 200359 271582
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (1645 leaves)
##  colLinks: NULL
##  colTree: NULL

The implementation is based on the toTree() function from the TreeSummarizedExperiment package (Huang 2020).

Huang, Ruizhu. 2020. TreeSummarizedExperiment: A S4 Class for Data with Tree Structures.

8.3 Set taxonomy ranks

If your data includes taxonomy ranks that are not included by default in mia, you can set the ranks manually. By doing so, mia will be able to detect and utilize these taxonomy ranks from your data as expected.

Get default ranks of mia.

getTaxonomyRanks()
##   [1] "domain"       "superkingdom" "kingdom"      "phylum"      
##   [5] "class"        "order"        "family"       "genus"       
##   [9] "species"      "strain"

Set ranks to your own ranks. Remember that the order is meaningful.

# Set ranks
setTaxonomyRanks(c("test", "test2", "apple"))

# Get ranks
getTaxonomyRanks()
##  [1] "test"  "test2" "apple"
Back to top