8 Taxonomic Information
Taxonomic information is a key part of analyzing microbiome data, and without it, any type of data analysis probably will not make much sense. However, the degree of detail of taxonomic information differs depending on the dataset and annotation data used.
Therefore, the mia
package expects a loose assembly of taxonomic information and assumes certain key aspects:
- Taxonomic information is given as character vectors or factors in the
rowData
of aSummarizedExperiment
object. - The columns containing the taxonomic information must be named
domain
,kingdom
,phylum
,class
,order
,family
,genus
,species
or with a capital first letter. - The columns must be given in the order shown above.
- column can be omitted, but the order must remain
In this chapter, we will refer to the co-abundant groups as CAGs, which are clusters of taxa that co-vary across samples.
8.1 Assigning taxonomic information
There are a number of methods to assign taxonomic information. We like to give a short introduction about the methods available without ranking one over the other. This has to be your choice based on the result for the individual dataset.
8.1.1 DADA2
The dada2
package (Callahan et al. 2016) implements the assignTaxonomy()
function, which takes as input the ASV sequences associated with each row of data and a training dataset. For more information visit the dada2 homepage.
8.1.2 DECIPHER
The DECIPHER
package (Wright 2020) implements the IDTAXA
algorithm to assign either taxonomic information or function information. For mia
, only the first option is of interest for now and more information can be found on the DECIPHER website.
8.2 Functions to access taxonomic information
8.2.1 Check taxonomy ranks in data
checkTaxonomy()
checks whether the taxonomic information is usable for mia
checkTaxonomy(tse)
## [1] TRUE
Since the rowData
can contain other data, taxonomyRanks()
will return the columns that mia
assumes to contain the taxonomic information.
taxonomyRanks(tse)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
This can then be used to subset the rowData
to columns needed.
rowData(tse)[, taxonomyRanks(tse)]
## DataFrame with 19216 rows and 7 columns
## Kingdom Phylum Class Order Family
## <character> <character> <character> <character> <character>
## 549322 Archaea Crenarchaeota Thermoprotei NA NA
## 522457 Archaea Crenarchaeota Thermoprotei NA NA
## 951 Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae
## 244423 Archaea Crenarchaeota Sd-NA NA NA
## 586076 Archaea Crenarchaeota Sd-NA NA NA
## ... ... ... ... ... ...
## 278222 Bacteria SR1 NA NA NA
## 463590 Bacteria SR1 NA NA NA
## 535321 Bacteria SR1 NA NA NA
## 200359 Bacteria SR1 NA NA NA
## 271582 Bacteria SR1 NA NA NA
## Genus Species
## <character> <character>
## 549322 NA NA
## 522457 NA NA
## 951 Sulfolobus Sulfolobusacidocalda..
## 244423 NA NA
## 586076 NA NA
## ... ... ...
## 278222 NA NA
## 463590 NA NA
## 535321 NA NA
## 200359 NA NA
## 271582 NA NA
taxonomyRankEmpty()
checks for empty values in the given rank
and returns a logical vector of length(x)
.
all(!taxonomyRankEmpty(tse, rank = "Kingdom"))
## [1] TRUE
table(taxonomyRankEmpty(tse, rank = "Genus"))
##
## FALSE TRUE
## 8008 11208
table(taxonomyRankEmpty(tse, rank = "Species"))
##
## FALSE TRUE
## 1413 17803
8.2.2 Get taxonomy labels
getTaxonomyLabels()
is a multi-purpose function, which turns taxonomic information into a character vector of length(x)
head(getTaxonomyLabels(tse))
## [1] "Class:Thermoprotei" "Class:Thermoprotei_1"
## [3] "Species:Sulfolobusacidocaldarius" "Class:Sd-NA"
## [5] "Class:Sd-NA_1" "Class:Sd-NA_2"
By default, this will use the lowest non-empty information to construct a string with the following scheme level:value
. If all levels are the same, this part is omitted, but can be added by setting with.rank = TRUE
.
phylum <- !is.na(rowData(tse)$Phylum) &
vapply(data.frame(apply(
rowData(tse)[, taxonomyRanks(tse)[3:7]], 1L, is.na)), all, logical(1))
head(getTaxonomyLabels(tse[phylum,]))
## [1] "Crenarchaeota" "Crenarchaeota_1" "Crenarchaeota_2"
## [4] "Actinobacteria" "Actinobacteria_1" "Spirochaetes"
head(getTaxonomyLabels(tse[phylum,], with.rank = TRUE))
## [1] "Phylum:Crenarchaeota" "Phylum:Crenarchaeota_1"
## [3] "Phylum:Crenarchaeota_2" "Phylum:Actinobacteria"
## [5] "Phylum:Actinobacteria_1" "Phylum:Spirochaetes"
By default the return value of getTaxonomyLabels()
contains only unique elements by passing it through make.unique
. This step can be omitted by setting make.unique = FALSE
.
head(getTaxonomyLabels(tse[phylum,], with.rank = TRUE, make.unique = FALSE))
## [1] "Phylum:Crenarchaeota" "Phylum:Crenarchaeota" "Phylum:Crenarchaeota"
## [4] "Phylum:Actinobacteria" "Phylum:Actinobacteria" "Phylum:Spirochaetes"
To apply the loop resolving function resolveLoop()
from the TreeSummarizedExperiment
package (Huang 2020) within getTaxonomyLabels()
, set resolve.loops = TRUE
.
8.2.3 Get information on certain taxa
The function getUnique()
gives a list of unique taxa for the specified taxonomic rank.
With mapTaxonomy()
, you can search information on certain taxa from the taxonomy table. For instance, we can check all the taxa that matches with “Escherichia”.
mapTaxonomy(GlobalPatterns, taxa = "Escherichia")
## $Escherichia
## DataFrame with 1 row and 7 columns
## Kingdom Phylum Class Order
## <character> <character> <character> <character>
## 249227 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## Family Genus Species
## <character> <character> <character>
## 249227 Enterobacteriaceae Escherichia NA
8.2.4 Generate a hierarchy tree on the fly
A hierarchy tree shows mapping between the taxonomic levels in taxonomic rank table (included in rowData
), rather than the detailed phylogenetic relations. Usually, a phylogenetic tree refers to latter which is why we call here the generated tree as “hierarchy tree”.
To create a hierarchy tree, getHierarchyTree()
used the information and returns a phylo
object. Duplicate information from the rowData
is removed.
getHierarchyTree(tse)
##
## Phylogenetic tree with 1645 tips and 1089 internal nodes.
##
## Tip labels:
## Species:Cenarchaeumsymbiosum, Species:pIVWA5, Species:CandidatusNitrososphaeragargensis, Species:SCA1145, Species:SCA1170, Species:Sulfolobusacidocaldarius, ...
## Node labels:
## root:ALL, Kingdom:Archaea, Phylum:Crenarchaeota, Class:C2, Class:Sd-NA, Class:Thaumarchaeota, ...
##
## Rooted; includes branch lengths.
tse <- addHierarchyTree(tse)
tse
## class: TreeSummarizedExperiment
## dim: 19216 26
## metadata(0):
## assays(1): counts
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (1645 leaves)
## colLinks: NULL
## colTree: NULL
The implementation is based on the toTree()
function from the TreeSummarizedExperiment
package (Huang 2020).
8.3 Set taxonomy ranks
If your data includes taxonomy ranks that are not included by default in mia
, you can set the ranks manually. By doing so, mia
will be able to detect and utilize these taxonomy ranks from your data as expected.
Get default ranks of mia
.
getTaxonomyRanks()
## [1] "domain" "superkingdom" "kingdom" "phylum"
## [5] "class" "order" "family" "genus"
## [9] "species" "strain"
Set ranks to your own ranks. Remember that the order is meaningful.
# Set ranks
setTaxonomyRanks(c("test", "test2", "apple"))
# Get ranks
getTaxonomyRanks()
## [1] "test" "test2" "apple"