Cleaning taxonomy table

Once you have sequenced 16S rRNA gene libraries and processed the raw sequences with your choice of tool, Deblur, DADA2, NG-Tax, etc. You will get a table of taxa abundances and taxonomic classification of taxa. This can be merged with metadata and used for downstream analysis.

Data can be in *.biom format or tables.

Let’s get started!

Here, we will focus on cleaning taxonomy table sotred in tax_table slot using phyloseq and microbiome.

Check the read_phyloseq function from microbiome package for importing and converting you data into phyloseq format.

An introduction to helper functions in microbiome and phyloseq packages can be found here documentation. Unfortunately, many users do not go through this. So we suggested checking basic introductory documentation of software tools!

First install the packages of interest.

# From cran  
install.packages("devtools")

# From Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) # Install BiocManager if not already installed. 
    install.packages("BiocManager")

BiocManager::install("microbiome")

# From GitHub
devtools::install_github("microsud/microbiomeutilities")

Note: When possible we will try to call the functions together with package name for the purpose of clarity of packages providing the function. E.g. microbiome::transform()

Load R package

# Load a library  
library(microbiomeutilities)

Load example data

We use the data stored in microbiomeutilties R package.

data("zackular2014") # access the test data
# assign to an aboject with shorter name.
ps <- zackular2014

Get to know your data

Check

Firstly, check and understand some basic information about your data.

microbiome::summarize_phyloseq(ps)

## Compositional = NO2

## 1] Min. number of reads = 157042] Max. number of reads = 2168883] Total number of reads = 45946264] Average number of reads = 52211.65909090915] Median number of reads = 49415.57] Sparsity = 0.7284265902528656] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 7investigation_typeproject_nameDiseaseStateagebody_productFOBT.resultmaterial2

## [[1]]
## [1] "1] Min. number of reads = 15704"
## 
## [[2]]
## [1] "2] Max. number of reads = 216888"
## 
## [[3]]
## [1] "3] Total number of reads = 4594626"
## 
## [[4]]
## [1] "4] Average number of reads = 52211.6590909091"
## 
## [[5]]
## [1] "5] Median number of reads = 49415.5"
## 
## [[6]]
## [1] "7] Sparsity = 0.728426590252865"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 7"
## 
## [[11]]
## [1] "investigation_type" "project_name"       "DiseaseState"       "age"                "body_product"       "FOBT.result"       
## [7] "material"

Questions you need to ask regarding your data:
How many taxa are present in the data ?
Do you expect these number of taxa in your ecosystem of interest?
Have you successfully imported all your samples ?
Are all the taxonomic ranks included in your taxonomy ?
What is the Minimum and Maximum number of reads in your samples?
Are there singletons?
Is your data zero-inflated Sparsity?

Access parts

# OTU table  

otu_tab <- microbiome::abundances(ps)

# check 

otu_tab[1:5,1:5] # for my table show me [1 to 5 rows, 1 to 5 columns]

##               Adenoma10-2757 Adenoma11-2775 Adenoma13-2803 Adenoma14-2817 Adenoma15-2847
## d__denovo1773              0              0              0              0              0
## d__denovo1771              0              0              0              6              0
## d__denovo1776              0              0              3              0              0
## d__denovo1777              0              0              0             15              1
## d__denovo1775              1              0              1              1              4

# Taxonomy table
tax_tab <- phyloseq::tax_table(ps)
# check 
tax_tab[1:5,1:5] # for my table show me [1 to 5 otu ids, 1 to 5 first five ranks]

## Taxonomy Table:     [5 taxa by 5 taxonomic ranks]:
##               Domain        Phylum             Class            Order              Family              
## d__denovo1773 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae" 
## d__denovo1771 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae" 
## d__denovo1776 "k__Bacteria" "p__Firmicutes"    "c__Clostridia"  "o__Clostridiales" "f__Ruminococcaceae"
## d__denovo1777 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae" 
## d__denovo1775 "k__Bacteria" "p__Firmicutes"    "c__Clostridia"  "o__Clostridiales" "f__Lachnospiraceae"

Notice something, very common to see patterns like "k__" prefixes and OTU id "d__denovo" or sequences as row names.

Clean taxonomy table

Cleaning of taxonomy tables is useful to do at the beginning of the analysis. It is time-consuming but also useful to understanding taxonomic information of your taxa.

Let us use two functions to change OTU ids. gsub from base R and taxa_names from phyloseq.

# accessing the OTUids 
taxa_names(ps)[1:5] # print first 5 ids

## [1] "d__denovo1773" "d__denovo1771" "d__denovo1776" "d__denovo1777" "d__denovo1775"

# find and substitute
taxa_names(ps) <- gsub(taxa_names(ps), pattern = "d__", replacement = "")  

# Check again Taxonomy table
phyloseq::tax_table(ps)[1:3,1:3] # check for my table show me [1 to 3 otu ids, 1 to 3 first three ranks].

## Taxonomy Table:     [3 taxa by 3 taxonomic ranks]:
##            Domain        Phylum             Class           
## denovo1773 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia"
## denovo1771 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia"
## denovo1776 "k__Bacteria" "p__Firmicutes"    "c__Clostridia"

What happened here? We first accessed the names of taxa (OTUids) with taxa_names(ps) which contains the names of OTUids.

taxa_names(ps)[1:5] # print first 5 names

## [1] "denovo1773" "denovo1771" "denovo1776" "denovo1777" "denovo1775"

For each of these, we ask gsub to find a pattern "d__" and substitute it with "" (nothing).
By providing “taxa_names(ps) <-” we straight away substitute these names.

We can substitute “denovo” with just “OTU-”

# find and substitute
taxa_names(ps) <- gsub(taxa_names(ps), pattern = "denovo", replacement = "OTU-")

Next, cleaning the "k__" and similar values.

# extending gsub to entire table 
tax_table(ps)[, colnames(tax_table(ps))] <- gsub(tax_table(ps)[, colnames(tax_table(ps))],     pattern = "[a-z]__", replacement = "")

What happened here? By using colnames(tax_table(ps)) we are specifying names of the columns in tax_table(ps).

colnames(tax_table(ps))

## [1] "Domain"  "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

For columns in tax_table(ps) we used gsub to check all the columns for patterns ranging from [a-z] joined by __ like this [a-z]__ and substitute it with "" i.e. nothing.

phyloseq::tax_table(ps)[1:3,1:3] # check for my table show me [1 to 3 otu ids, 1 to 3 first three ranks].

## Taxonomy Table:     [3 taxa by 3 taxonomic ranks]:
##          Domain     Phylum          Class        
## OTU-1773 "Bacteria" "Bacteroidetes" "Bacteroidia"
## OTU-1771 "Bacteria" "Bacteroidetes" "Bacteroidia"
## OTU-1776 "Bacteria" "Firmicutes"    "Clostridia"

There are several other special characters and labels that you should check for in your data. This is an important step, especially when aggregating/collapsing at higher taxonomic ranks.

Some examples are below. These are not detected in the test data used here. However, we have come across these in our research.

# Replace sequence IDs with ATACAACTATACG with ASV1 and so on...
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))

Replace special characters.

tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="=*",replacement="")

# remove asterisk
tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="[*]",replacement="") # notice here the square brackets. These are required when special characters such as * and ~ below are used as they have functional role in R base programming. 

# remove ~
tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="[~]",replacement="")

Other than gsub, we can also go through one column at a time in the taxonomy table to specifically substitute ambiguous markings.

# For each column separately we attempt to replace "k__<empty>" with "" for consistency.
tax_table(ps)[tax_table(ps) == "k__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "p__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "c__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "o__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "f__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "g__<empty>"] <- ""
# some more ambiguities 
tax_table(ps)[tax_table(ps) == "o__Unknown_Order"] <- ""
tax_table(ps)[tax_table(ps) == "c__Unknown_Class"] <- ""
tax_table(ps)[tax_table(ps) == "f__Unknown_Family"] <- ""

What happened here? We asked for any pattern matching "k__" in the table and replace it with "" nothing.

Aggregate at genus level.

ps.gen <- phyloseq::tax_glom(ps, "Genus", NArm = TRUE)

Notice the NArm = TRUE parameter. This will remove anything that is unclassified at genus level. This choice will depend on your research goals.
Check the names.

taxa_names(ps.gen)[1:5]

## [1] "OTU-174"  "OTU-1079" "OTU-363"  "OTU-369"  "OTU-103"

Substitute these IDs with names of genus.

taxa_names(ps.gen) <- tax_table(ps.gen)[,"Genus"]

Check if the gsub above worked.

taxa_names(ps.gen)[1:5]

## [1] "Listeria"         "Allisonella"      "Enterorhabdus"    "Cloacibacillus"   "Clostridium_XlVb"

What happened here? Similar to our first instance of gsub operation to replace "d__", we asked R to substitute taxa_names(ps.gen) which are ids with corresponding genus identity accessed by this tax_table(ps.gen)[,"Genus"]

tax_table(ps.gen)[,"Genus"][1:5] # [1:5] is first five to print.

## Taxonomy Table:     [5 taxa by 1 taxonomic ranks]:
##                  Genus             
## Listeria         "Listeria"        
## Allisonella      "Allisonella"     
## Enterorhabdus    "Enterorhabdus"   
## Cloacibacillus   "Cloacibacillus"  
## Clostridium_XlVb "Clostridium_XlVb"

Print all genus names. We use the unique() function to print each name only once.

# 
unique(tax_table(ps.gen)[,"Genus"] )

## Taxonomy Table:     [104 taxa by 1 taxonomic ranks]:
##                                    Genus                               
## Listeria                           "Listeria"                          
## Allisonella                        "Allisonella"                       
## Enterorhabdus                      "Enterorhabdus"                     
## Cloacibacillus                     "Cloacibacillus"                    
## Clostridium_XlVb                   "Clostridium_XlVb"                  
## Clostridium_XlVa                   "Clostridium_XlVa"                  
## Subdoligranulum                    "Subdoligranulum"                   
## Streptococcus                      "Streptococcus"                     
## Anaerococcus                       "Anaerococcus"                      
## Gemella                            "Gemella"                           
## Sporobacter                        "Sporobacter"                       
## Acinetobacter                      "Acinetobacter"                     
## Streptophyta                       "Streptophyta"                      
## Slackia                            "Slackia"                           
## Acidaminococcus                    "Acidaminococcus"                   
## Citrobacter                        "Citrobacter"                       
## Desulfovibrio                      "Desulfovibrio"                     
## Fusobacterium                      "Fusobacterium"                     
## Prevotella                         "Prevotella"                        
## Actinomyces                        "Actinomyces"                       
## Clostridium_XI                     "Clostridium_XI"                    
## Gemmiger                           "Gemmiger"                          
## Finegoldia                         "Finegoldia"                        
## Insolitispirillum                  "Insolitispirillum"                 
## Solobacterium                      "Solobacterium"                     
## Gordonibacter                      "Gordonibacter"                     
## Lactonifactor                      "Lactonifactor"                     
## Oscillibacter                      "Oscillibacter"                     
## Acetanaerobacterium                "Acetanaerobacterium"               
## Staphylococcus                     "Staphylococcus"                    
## Rothia                             "Rothia"                            
## Enterobacter                       "Enterobacter"                      
## Veillonella                        "Veillonella"                       
## Coprobacillus                      "Coprobacillus"                     
## Bilophila                          "Bilophila"                         
## Megasphaera                        "Megasphaera"                       
## Anaerosporobacter                  "Anaerosporobacter"                 
## Lactococcus                        "Lactococcus"                       
## Akkermansia                        "Akkermansia"                       
## Methanobrevibacter                 "Methanobrevibacter"                
## Collinsella                        "Collinsella"                       
## Dialister                          "Dialister"                         
## Ruminococcus2                      "Ruminococcus2"                     
## Dorea                              "Dorea"                             
## Marvinbryantia                     "Marvinbryantia"                    
## Blautia                            "Blautia"                           
## Bacteroides                        "Bacteroides"                       
## Escherichia/Shigella               "Escherichia/Shigella"              
## Roseburia                          "Roseburia"                         
## Anaerostipes                       "Anaerostipes"                      
## Holdemania                         "Holdemania"                        
## Parabacteroides                    "Parabacteroides"                   
## Faecalibacterium                   "Faecalibacterium"                  
## Clostridium_IV                     "Clostridium_IV"                    
## Eubacterium                        "Eubacterium"                       
## Barnesiella                        "Barnesiella"                       
## Deinococcus                        "Deinococcus"                       
## Neisseria                          "Neisseria"                         
## Hydrogenoanaerobacterium           "Hydrogenoanaerobacterium"          
## Odoribacter                        "Odoribacter"                       
## Enterococcus                       "Enterococcus"                      
## Anaerotruncus                      "Anaerotruncus"                     
## Parasutterella                     "Parasutterella"                    
## Salmonella                         "Salmonella"                        
## Butyricimonas                      "Butyricimonas"                     
## Peptoniphilus                      "Peptoniphilus"                     
## Lachnospiracea_incertae_sedis      "Lachnospiracea_incertae_sedis"     
## Pseudomonas                        "Pseudomonas"                       
## Coprococcus                        "Coprococcus"                       
## Leuconostoc                        "Leuconostoc"                       
## Helicobacter                       "Helicobacter"                      
## Catenibacterium                    "Catenibacterium"                   
## Anaerofustis                       "Anaerofustis"                      
## Eggerthella                        "Eggerthella"                       
## Anaerovorax                        "Anaerovorax"                       
## Parvimonas                         "Parvimonas"                        
## Hallella                           "Hallella"                          
## Anaerofilum                        "Anaerofilum"                       
## Mannheimia                         "Mannheimia"                        
## Turicibacter                       "Turicibacter"                      
## Lactobacillus                      "Lactobacillus"                     
## Butyricicoccus                     "Butyricicoccus"                    
## Clostridium_XVIII                  "Clostridium_XVIII"                 
## Herbaspirillum                     "Herbaspirillum"                    
## Pseudoflavonifractor               "Pseudoflavonifractor"              
## Paraprevotella                     "Paraprevotella"                    
## Ruminococcus                       "Ruminococcus"                      
## Alistipes                          "Alistipes"                         
## Bifidobacterium                    "Bifidobacterium"                   
## Sutterella                         "Sutterella"                        
## Howardella                         "Howardella"                        
## Flavonifractor                     "Flavonifractor"                    
## Mogibacterium                      "Mogibacterium"                     
## Erysipelotrichaceae_incertae_sedis "Erysipelotrichaceae_incertae_sedis"
## Rhodobacter                        "Rhodobacter"                       
## Weissella                          "Weissella"                         
## Oxalobacter                        "Oxalobacter"                       
## Clostridium_III                    "Clostridium_III"                   
## Haemophilus                        "Haemophilus"                       
## Clostridium_sensu_stricto          "Clostridium_sensu_stricto"         
## Porphyromonas                      "Porphyromonas"                     
## Phascolarctobacterium              "Phascolarctobacterium"             
## Atopobium                          "Atopobium"                         
## Hespellia                          "Hespellia"

There can also be spaces between names e.g Prevotella 1 which can be changed to Prevotella_1

taxa_names(ps) <- gsub(" ", ".", taxa_names(ps))

There can also be spaces between names e.g [Prevotella] copri which can be changed to Prevotella copri

taxa_names(ps) <- gsub("\\[|\\]", "", taxa_names(ps))

There can also be spaces between names e.g Prevotella-copri which can be changed to Prevotella.copri

taxa_names(ps) <- gsub("-", ".", taxa_names(ps))

Also check the format_to_besthit in microbiomeutilities R package here.

These are just the common examples we observe in our research. There can be may other challenges with taxonomic labeling. However, we hope this has given a general impression of how we can do cleaning of data in R with just gsub and phyloseq functions to access information from within a phyloseq object.
For more information of gsub and related grepl check the base R Pattern Matching and Replacement