Once you have sequenced 16S rRNA gene libraries and processed the raw sequences with your choice of tool, Deblur, DADA2, NG-Tax, etc. You will get a table of taxa abundances and taxonomic classification of taxa. This can be merged with metadata and used for downstream analysis.
Data can be in *.biom format or tables.
Let’s get started!
Here, we will focus on cleaning taxonomy table sotred in tax_table
slot using phyloseq and microbiome.
Check the read_phyloseq
function from microbiome package for importing and converting you data into phyloseq format.
An introduction to helper functions in microbiome and phyloseq packages can be found here documentation. Unfortunately, many users do not go through this. So we suggested checking basic introductory documentation of software tools!
First install the packages of interest.
# From cran
install.packages("devtools")
# From Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) # Install BiocManager if not already installed.
install.packages("BiocManager")
BiocManager::install("microbiome")
# From GitHub
devtools::install_github("microsud/microbiomeutilities")
Note: When possible we will try to call the functions together with package name for the purpose of clarity of packages providing the function. E.g. microbiome::transform()
# Load a library
library(microbiomeutilities)
We use the data stored in microbiomeutilties
R package.
data("zackular2014") # access the test data
# assign to an aboject with shorter name.
ps <- zackular2014
Firstly, check and understand some basic information about your data.
microbiome::summarize_phyloseq(ps)
## Compositional = NO2
## 1] Min. number of reads = 157042] Max. number of reads = 2168883] Total number of reads = 45946264] Average number of reads = 52211.65909090915] Median number of reads = 49415.57] Sparsity = 0.7284265902528656] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons
## (i.e. exactly one read detected across all samples)010] Number of sample variables are: 7investigation_typeproject_nameDiseaseStateagebody_productFOBT.resultmaterial2
## [[1]]
## [1] "1] Min. number of reads = 15704"
##
## [[2]]
## [1] "2] Max. number of reads = 216888"
##
## [[3]]
## [1] "3] Total number of reads = 4594626"
##
## [[4]]
## [1] "4] Average number of reads = 52211.6590909091"
##
## [[5]]
## [1] "5] Median number of reads = 49415.5"
##
## [[6]]
## [1] "7] Sparsity = 0.728426590252865"
##
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
##
## [[8]]
## [1] "8] Number of singletons = 0"
##
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)0"
##
## [[10]]
## [1] "10] Number of sample variables are: 7"
##
## [[11]]
## [1] "investigation_type" "project_name" "DiseaseState" "age" "body_product" "FOBT.result"
## [7] "material"
Questions you need to ask regarding your data:
How many taxa are present in the data ?
Do you expect these number of taxa in your ecosystem of interest?
Have you successfully imported all your samples ?
Are all the taxonomic ranks included in your taxonomy ?
What is the Minimum and Maximum number of reads in your samples?
Are there singletons?
Is your data zero-inflated Sparsity
?
# OTU table
otu_tab <- microbiome::abundances(ps)
# check
otu_tab[1:5,1:5] # for my table show me [1 to 5 rows, 1 to 5 columns]
## Adenoma10-2757 Adenoma11-2775 Adenoma13-2803 Adenoma14-2817 Adenoma15-2847
## d__denovo1773 0 0 0 0 0
## d__denovo1771 0 0 0 6 0
## d__denovo1776 0 0 3 0 0
## d__denovo1777 0 0 0 15 1
## d__denovo1775 1 0 1 1 4
# Taxonomy table
tax_tab <- phyloseq::tax_table(ps)
# check
tax_tab[1:5,1:5] # for my table show me [1 to 5 otu ids, 1 to 5 first five ranks]
## Taxonomy Table: [5 taxa by 5 taxonomic ranks]:
## Domain Phylum Class Order Family
## d__denovo1773 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae"
## d__denovo1771 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae"
## d__denovo1776 "k__Bacteria" "p__Firmicutes" "c__Clostridia" "o__Clostridiales" "f__Ruminococcaceae"
## d__denovo1777 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Bacteroidaceae"
## d__denovo1775 "k__Bacteria" "p__Firmicutes" "c__Clostridia" "o__Clostridiales" "f__Lachnospiraceae"
Notice something, very common to see patterns like "k__" prefixes and OTU id "d__denovo" or sequences as row names.
Cleaning of taxonomy tables is useful to do at the beginning of the analysis. It is time-consuming but also useful to understanding taxonomic information of your taxa.
Let us use two functions to change OTU ids. gsub
from base R and taxa_names
from phyloseq.
# accessing the OTUids
taxa_names(ps)[1:5] # print first 5 ids
## [1] "d__denovo1773" "d__denovo1771" "d__denovo1776" "d__denovo1777" "d__denovo1775"
# find and substitute
taxa_names(ps) <- gsub(taxa_names(ps), pattern = "d__", replacement = "")
# Check again Taxonomy table
phyloseq::tax_table(ps)[1:3,1:3] # check for my table show me [1 to 3 otu ids, 1 to 3 first three ranks].
## Taxonomy Table: [3 taxa by 3 taxonomic ranks]:
## Domain Phylum Class
## denovo1773 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia"
## denovo1771 "k__Bacteria" "p__Bacteroidetes" "c__Bacteroidia"
## denovo1776 "k__Bacteria" "p__Firmicutes" "c__Clostridia"
What happened here? We first accessed the names of taxa (OTUids) with taxa_names(ps)
which contains the names of OTUids.
taxa_names(ps)[1:5] # print first 5 names
## [1] "denovo1773" "denovo1771" "denovo1776" "denovo1777" "denovo1775"
For each of these, we ask gsub
to find a pattern "d__" and substitute it with "" (nothing).
By providing “taxa_names(ps) <-” we straight away substitute these names.
We can substitute “denovo” with just “OTU-”
# find and substitute
taxa_names(ps) <- gsub(taxa_names(ps), pattern = "denovo", replacement = "OTU-")
Next, cleaning the "k__" and similar values.
# extending gsub to entire table
tax_table(ps)[, colnames(tax_table(ps))] <- gsub(tax_table(ps)[, colnames(tax_table(ps))], pattern = "[a-z]__", replacement = "")
What happened here? By using colnames(tax_table(ps))
we are specifying names of the columns in tax_table(ps)
.
colnames(tax_table(ps))
## [1] "Domain" "Phylum" "Class" "Order" "Family" "Genus" "Species"
For columns in tax_table(ps)
we used gsub
to check all the columns for patterns ranging from [a-z]
joined by __
like this [a-z]__
and substitute it with "" i.e. nothing.
phyloseq::tax_table(ps)[1:3,1:3] # check for my table show me [1 to 3 otu ids, 1 to 3 first three ranks].
## Taxonomy Table: [3 taxa by 3 taxonomic ranks]:
## Domain Phylum Class
## OTU-1773 "Bacteria" "Bacteroidetes" "Bacteroidia"
## OTU-1771 "Bacteria" "Bacteroidetes" "Bacteroidia"
## OTU-1776 "Bacteria" "Firmicutes" "Clostridia"
There are several other special characters and labels that you should check for in your data. This is an important step, especially when aggregating/collapsing at higher taxonomic ranks.
Some examples are below. These are not detected in the test data used here. However, we have come across these in our research.
# Replace sequence IDs with ATACAACTATACG with ASV1 and so on...
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
Replace special characters.
tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="=*",replacement="")
# remove asterisk
tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="[*]",replacement="") # notice here the square brackets. These are required when special characters such as * and ~ below are used as they have functional role in R base programming.
# remove ~
tax_table(ps)[,colnames(tax_table(ps))] <- gsub(tax_table(ps)[,colnames(tax_table(ps))],pattern="[~]",replacement="")
Other than gsub
, we can also go through one column at a time in the taxonomy table to specifically substitute ambiguous markings.
# For each column separately we attempt to replace "k__<empty>" with "" for consistency.
tax_table(ps)[tax_table(ps) == "k__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "p__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "c__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "o__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "f__<empty>"] <- ""
tax_table(ps)[tax_table(ps) == "g__<empty>"] <- ""
# some more ambiguities
tax_table(ps)[tax_table(ps) == "o__Unknown_Order"] <- ""
tax_table(ps)[tax_table(ps) == "c__Unknown_Class"] <- ""
tax_table(ps)[tax_table(ps) == "f__Unknown_Family"] <- ""
What happened here? We asked for any pattern matching "k__
Aggregate at genus level.
ps.gen <- phyloseq::tax_glom(ps, "Genus", NArm = TRUE)
Notice the NArm = TRUE
parameter. This will remove anything that is unclassified at genus level. This choice will depend on your research goals.
Check the names.
taxa_names(ps.gen)[1:5]
## [1] "OTU-174" "OTU-1079" "OTU-363" "OTU-369" "OTU-103"
Substitute these IDs with names of genus.
taxa_names(ps.gen) <- tax_table(ps.gen)[,"Genus"]
Check if the gsub
above worked.
taxa_names(ps.gen)[1:5]
## [1] "Listeria" "Allisonella" "Enterorhabdus" "Cloacibacillus" "Clostridium_XlVb"
What happened here? Similar to our first instance of gsub
operation to replace "d__", we asked R to substitute taxa_names(ps.gen)
which are ids with corresponding genus identity accessed by this tax_table(ps.gen)[,"Genus"]
tax_table(ps.gen)[,"Genus"][1:5] # [1:5] is first five to print.
## Taxonomy Table: [5 taxa by 1 taxonomic ranks]:
## Genus
## Listeria "Listeria"
## Allisonella "Allisonella"
## Enterorhabdus "Enterorhabdus"
## Cloacibacillus "Cloacibacillus"
## Clostridium_XlVb "Clostridium_XlVb"
Print all genus names. We use the unique() function to print each name only once.
#
unique(tax_table(ps.gen)[,"Genus"] )
## Taxonomy Table: [104 taxa by 1 taxonomic ranks]:
## Genus
## Listeria "Listeria"
## Allisonella "Allisonella"
## Enterorhabdus "Enterorhabdus"
## Cloacibacillus "Cloacibacillus"
## Clostridium_XlVb "Clostridium_XlVb"
## Clostridium_XlVa "Clostridium_XlVa"
## Subdoligranulum "Subdoligranulum"
## Streptococcus "Streptococcus"
## Anaerococcus "Anaerococcus"
## Gemella "Gemella"
## Sporobacter "Sporobacter"
## Acinetobacter "Acinetobacter"
## Streptophyta "Streptophyta"
## Slackia "Slackia"
## Acidaminococcus "Acidaminococcus"
## Citrobacter "Citrobacter"
## Desulfovibrio "Desulfovibrio"
## Fusobacterium "Fusobacterium"
## Prevotella "Prevotella"
## Actinomyces "Actinomyces"
## Clostridium_XI "Clostridium_XI"
## Gemmiger "Gemmiger"
## Finegoldia "Finegoldia"
## Insolitispirillum "Insolitispirillum"
## Solobacterium "Solobacterium"
## Gordonibacter "Gordonibacter"
## Lactonifactor "Lactonifactor"
## Oscillibacter "Oscillibacter"
## Acetanaerobacterium "Acetanaerobacterium"
## Staphylococcus "Staphylococcus"
## Rothia "Rothia"
## Enterobacter "Enterobacter"
## Veillonella "Veillonella"
## Coprobacillus "Coprobacillus"
## Bilophila "Bilophila"
## Megasphaera "Megasphaera"
## Anaerosporobacter "Anaerosporobacter"
## Lactococcus "Lactococcus"
## Akkermansia "Akkermansia"
## Methanobrevibacter "Methanobrevibacter"
## Collinsella "Collinsella"
## Dialister "Dialister"
## Ruminococcus2 "Ruminococcus2"
## Dorea "Dorea"
## Marvinbryantia "Marvinbryantia"
## Blautia "Blautia"
## Bacteroides "Bacteroides"
## Escherichia/Shigella "Escherichia/Shigella"
## Roseburia "Roseburia"
## Anaerostipes "Anaerostipes"
## Holdemania "Holdemania"
## Parabacteroides "Parabacteroides"
## Faecalibacterium "Faecalibacterium"
## Clostridium_IV "Clostridium_IV"
## Eubacterium "Eubacterium"
## Barnesiella "Barnesiella"
## Deinococcus "Deinococcus"
## Neisseria "Neisseria"
## Hydrogenoanaerobacterium "Hydrogenoanaerobacterium"
## Odoribacter "Odoribacter"
## Enterococcus "Enterococcus"
## Anaerotruncus "Anaerotruncus"
## Parasutterella "Parasutterella"
## Salmonella "Salmonella"
## Butyricimonas "Butyricimonas"
## Peptoniphilus "Peptoniphilus"
## Lachnospiracea_incertae_sedis "Lachnospiracea_incertae_sedis"
## Pseudomonas "Pseudomonas"
## Coprococcus "Coprococcus"
## Leuconostoc "Leuconostoc"
## Helicobacter "Helicobacter"
## Catenibacterium "Catenibacterium"
## Anaerofustis "Anaerofustis"
## Eggerthella "Eggerthella"
## Anaerovorax "Anaerovorax"
## Parvimonas "Parvimonas"
## Hallella "Hallella"
## Anaerofilum "Anaerofilum"
## Mannheimia "Mannheimia"
## Turicibacter "Turicibacter"
## Lactobacillus "Lactobacillus"
## Butyricicoccus "Butyricicoccus"
## Clostridium_XVIII "Clostridium_XVIII"
## Herbaspirillum "Herbaspirillum"
## Pseudoflavonifractor "Pseudoflavonifractor"
## Paraprevotella "Paraprevotella"
## Ruminococcus "Ruminococcus"
## Alistipes "Alistipes"
## Bifidobacterium "Bifidobacterium"
## Sutterella "Sutterella"
## Howardella "Howardella"
## Flavonifractor "Flavonifractor"
## Mogibacterium "Mogibacterium"
## Erysipelotrichaceae_incertae_sedis "Erysipelotrichaceae_incertae_sedis"
## Rhodobacter "Rhodobacter"
## Weissella "Weissella"
## Oxalobacter "Oxalobacter"
## Clostridium_III "Clostridium_III"
## Haemophilus "Haemophilus"
## Clostridium_sensu_stricto "Clostridium_sensu_stricto"
## Porphyromonas "Porphyromonas"
## Phascolarctobacterium "Phascolarctobacterium"
## Atopobium "Atopobium"
## Hespellia "Hespellia"
There can also be spaces between names e.g Prevotella 1
which can be changed to Prevotella_1
taxa_names(ps) <- gsub(" ", ".", taxa_names(ps))
There can also be spaces between names e.g [Prevotella] copri
which can be changed to Prevotella copri
taxa_names(ps) <- gsub("\\[|\\]", "", taxa_names(ps))
There can also be spaces between names e.g Prevotella-copri
which can be changed to Prevotella.copri
taxa_names(ps) <- gsub("-", ".", taxa_names(ps))
Also check the format_to_besthit
in microbiomeutilities
R package here.
These are just the common examples we observe in our research. There can be may other challenges with taxonomic labeling. However, we hope this has given a general impression of how we can do cleaning of data in R with just gsub
and phyloseq
functions to access information from within a phyloseq
object.
For more information of gsub
and related grepl
check the base R Pattern Matching and Replacement