Processing phyloseq objects
Instructions to manipulate microbiome data sets using tools from the phyloseq package and some extensions from the microbiome package, including subsetting, aggregating and filtering.
Load example data:
library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006
summarize_phyloseq(pseq)
## Compositional = NO2
## 1] Min. number of reads = 19002] Max. number of reads = 288833] Total number of reads = 135465644] Average number of reads = 11769.38662033015] Median number of reads = 111717] Sparsity = 0.2090022054400866] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons
## (i.e. exactly one read detected across all samples)010] Number of sample variables are: 10agesexnationalityDNA_extraction_methodprojectdiversitybmi_groupsubjecttimesample2
## [[1]]
## [1] "1] Min. number of reads = 1900"
##
## [[2]]
## [1] "2] Max. number of reads = 28883"
##
## [[3]]
## [1] "3] Total number of reads = 13546564"
##
## [[4]]
## [1] "4] Average number of reads = 11769.3866203301"
##
## [[5]]
## [1] "5] Median number of reads = 11171"
##
## [[6]]
## [1] "7] Sparsity = 0.209002205440086"
##
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
##
## [[8]]
## [1] "8] Number of singletons = 0"
##
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)0"
##
## [[10]]
## [1] "10] Number of sample variables are: 10"
##
## [[11]]
## [1] "age" "sex" "nationality" "DNA_extraction_method" "project"
## [6] "diversity" "bmi_group" "subject" "time" "sample"
A phyloseq object contains OTU table (taxa abundances), sample metadata, taxonomy table (mapping between OTUs and higher-level taxonomic classifications), and phylogenetic tree (relations between the taxa). Some of these are optional.
Pick metadata as data.frame:
meta <- meta(pseq)
Taxonomy table:
taxonomy <- tax_table(pseq)
Abundances for taxonomic groups (‘OTU table’) as a TaxaxSamples matrix:
# Absolute abundances
otu.absolute <- abundances(pseq)
# Relative abundances
otu.relative <- abundances(pseq, "compositional")
Total read counts:
reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5
## 7593 10148 7131 10855 12000
Add read per sample to phyloseq object metadata.
sample_data(pseq)$reads_sample <- reads_sample
# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])
## sample reads_sample
## Sample-1 Sample-1 7593
## Sample-2 Sample-2 10148
## Sample-3 Sample-3 7131
## Sample-4 Sample-4 10855
## Sample-5 Sample-5 12000
## Sample-6 Sample-6 7914
Melting phyloseq data for easier plotting:
df <- psmelt(pseq)
kable(head(df))
OTU | Sample | Abundance | age | sex | nationality | DNA_extraction_method | project | diversity | bmi_group | subject | time | sample | reads_sample | Phylum | Family | Genus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110989 | Prevotella melaninogenica et rel. | Sample-448 | 14961 | 54 | female | CentralEurope | o | 18 | 5.98 | lean | 448 | 0 | Sample-448 | 26546 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111380 | Prevotella melaninogenica et rel. | Sample-360 | 14296 | 45 | female | CentralEurope | o | 13 | 5.49 | severeobese | 360 | 0 | Sample-360 | 21086 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111232 | Prevotella melaninogenica et rel. | Sample-190 | 13676 | 34 | female | CentralEurope | r | 7 | 6.06 | lean | 190 | 0 | Sample-190 | 23954 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111553 | Prevotella melaninogenica et rel. | Sample-743 | 13509 | 52 | male | US | NA | 19 | 5.21 | obese | 743 | 0 | Sample-743 | 20632 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
110590 | Prevotella melaninogenica et rel. | Sample-366 | 13490 | 52 | female | CentralEurope | o | 15 | 5.63 | obese | 366 | 0 | Sample-366 | 19651 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111029 | Prevotella melaninogenica et rel. | Sample-375 | 13384 | 45 | female | CentralEurope | o | 16 | 5.64 | severeobese | 375 | 0 | Sample-375 | 21408 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
Sample names and variables
head(sample_names(pseq))
## [1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"
Total OTU abundance in each sample
s <- sample_sums(pseq)
Abundance of a given species in each sample
head(abundances(pseq)["Akkermansia",])
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6
## 21 36 475 61 34 14
Select a subset by metadata fields:
pseq.subset <- subset_samples(pseq, nationality == "AFR")
Select a subset by providing sample names:
# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))
# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)
Pick samples at the baseline time points only:
pseq0 <- baseline(pseq)
The microbiome package provides a wrapper for standard sample/OTU transforms. For arbitrary transforms, use the transform_sample_counts function in the phyloseq package. Log10 transform is log(1+x) if the data contains zeroes. Also “Z,” “clr,” “hellinger,” and “shift” are available as common transformations. Relative abundances (note that the input data needs to be in absolute scale, not logarithmic!):
pseq.compositional <- microbiome::transform(pseq, "compositional")
The CLR (“clr”) transformation is also available, and comes with a pseudocount to avoid zeroes. An alternative method is to impute the zero-inflated unobserved values. Sometimes a multiplicative Kaplan-Meier smoothing spline (KMSS) replacement, multiplicative lognormal replacement, or multiplicative simple replacement are used. These are available in the zCompositions R package (functions multKM, multLN, and multRepl, respectively). Use at least n.draws = 1000 in practice, here less used to speed up example.
data(dietswap)
x <- dietswap
# Compositional data
x2 <- microbiome::transform(x, "compositional")
Sample variable names
sample_variables(pseq)
## [1] "age" "sex" "nationality" "DNA_extraction_method" "project"
## [6] "diversity" "bmi_group" "subject" "time" "sample"
## [11] "reads_sample"
Pick values for a given variable
head(get_variable(pseq, sample_variables(pseq)[1]))
## [1] 28 24 52 22 25 42
Assign new fields to metadata
# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")
# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div
Number of taxa
n <- ntaxa(pseq)
Most abundant taxa
topx <- top_taxa(pseq, n = 10)
Names
ranks <- rank_names(pseq) # Taxonomic levels
taxa <- taxa(pseq) # Taxa names at the analysed level
Subset taxa
pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")
Prune (select) taxa:
# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes
# With given taxon names
ex2 <- prune_taxa(taxa, pseq)
# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)
Filter by user-specified function values (here variance):
f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)
List unique phylum-level groups:
head(get_taxa_unique(pseq, "Phylum"))
## [1] "Actinobacteria" "Firmicutes" "Proteobacteria" "Verrucomicrobia" "Bacteroidetes" "Spirochaetes"
Pick the taxa abundances for a given sample:
samplename <- sample_names(pseq)[[1]]
# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]
library(jeevanuDB)
# Example data
data("moving_pictures")
# Rename
ps <- moving_pictures
taxa_names(ps)[1:2]
## [1] "TACGGAGGATACGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTGGCTGATTAAGTCAGTGGTGAAAAGCTATGGCTCAACCATAGTCTTGCCGTTGATACTGGTTAGCTTGAGTGTAGATGATGTAGGCGGAATGCGTAG"
## [2] "TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGTAGGTGGCTTGATAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACGGCATCTGATACTGTTAGGCTAGAGTAGGTGAGAGGAAGGTAGAATTCCAGG"
The taxa names are unique sequences. This common when using dada2
for denoising and identifying amplicon sequence variants.
print(ps)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3644 taxa and 1966 samples ]
## sample_data() Sample Data: [ 1966 samples by 89 sample variables ]
## tax_table() Taxonomy Table: [ 3644 taxa by 7 taxonomic ranks ]
It would be useful to store these inside the refseq()
slot of phyloseq. For this we can use microbiome::add_refseq
ps <- add_refseq(ps, tag = "ASV")
# check again
taxa_names(ps)[1:2]
## [1] "ASV1" "ASV2"
print(ps)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3644 taxa and 1966 samples ]
## sample_data() Sample Data: [ 1966 samples by 89 sample variables ]
## tax_table() Taxonomy Table: [ 3644 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 3644 reference sequences ]
As can be seen, the sequence ids are converted to ASV ids and sequences stored within the refseq()
slot.
Converting or amending the ASV ids with their best/lowest taxonomic classification can be very handy. This can be achieved with microbiome::add_besthit()
ps <- add_besthit(ps)
taxa_names(ps)[1:2]
## [1] "ASV1:g__Porphyromonas.s__" "ASV2:g__Psychrobacter.s__pulmonis"
If required, the names can be cleaned as follows.
taxa_names(ps) <- gsub("[a-x]__", "",taxa_names(ps))
taxa_names(ps)[1:2]
## [1] "ASV1:Porphyromonas." "ASV2:Psychrobacter.pulmonis"
Aggregate taxa to higher taxonomic levels. This is particularly useful if the phylogenetic tree is missing. When it is available, see merge_samples, merge_taxa and tax_glom).
pseq2 <- aggregate_taxa(pseq, "Phylum")
Merge the desired taxa into “Other” category. Here, we merge all Bacteroides groups into a single group named Bacteroides.
pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides")
Merging phyloseq objects with the phyloseq package:
merge_phyloseq(pseqA, pseqB)
Joining otu/asv table and taxonomy in one data frame
library(dplyr)
library(microbiome)
data("atlas1006") # example data from microbiome pkg
otu.tax.tibble <- combine_otu_tax(atlas1006, column.id = "ASVID")
# view first fist rows and columns
otu.tax.tibble[1:5,1:5]
## # A tibble: 5 x 5
## ASVID Phylum Family Genus `Sample-1`
## <chr> <chr> <chr> <chr> <dbl>
## 1 Actinomycetaceae Actinobacteria Actinobacteria Actinomycetaceae 0
## 2 Aerococcus Firmicutes Bacilli Aerococcus 0
## 3 Aeromonas Proteobacteria Proteobacteria Aeromonas 0
## 4 Akkermansia Verrucomicrobia Verrucomicrobia Akkermansia 21
## 5 Alcaligenes faecalis et rel. Proteobacteria Proteobacteria Alcaligenes faecalis et rel. 1
In many cases, user may want to extract slots from phyloseq as tibbles for custom data processing.
This can be achieved with following functions.
otu.tib <- otu_tibble(atlas1006, column.id = "FeatureID")
tax.tib <- tax_tibble(atlas1006, column.id = "FeatureID")
sample.tib <- sample_tibble(atlas1006, column.id = "SampleID")
As an alternative to phyloseq::psmelt
, we provide psmelt2
which returns a tibble in long format.
ps.long.tib <- psmelt2(pseq)
head(ps.long.tib)
## # A tibble: 6 x 17
## FeatureID .SampleID value Phylum Family Genus age sex nationality DNA_extraction_~ project diversity$diver~ bmi_group subject
## <chr> <chr> <dbl> <chr> <chr> <chr> <int> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 Actinomyc~ Sample-1 0 Actino~ Actino~ Actin~ 28 male US <NA> 1 3.19 severeob~ 1
## 2 Actinomyc~ Sample-2 0 Actino~ Actino~ Actin~ 24 fema~ US <NA> 1 3.39 obese 2
## 3 Actinomyc~ Sample-3 0 Actino~ Actino~ Actin~ 52 male US <NA> 1 2.86 lean 3
## 4 Actinomyc~ Sample-4 0 Actino~ Actino~ Actin~ 22 fema~ US <NA> 1 3.06 underwei~ 4
## 5 Actinomyc~ Sample-5 0 Actino~ Actino~ Actin~ 25 fema~ US <NA> 1 3.07 lean 5
## 6 Actinomyc~ Sample-6 0 Actino~ Actino~ Actin~ 42 male US <NA> 1 2.94 lean 6
## # ... with 3 more variables: time <dbl>, sample <chr>, reads_sample <dbl>
pseq.rarified <- rarefy_even_depth(pseq)
Convert between taxonomic levels (here from Genus (Akkermansia) to Phylum (Verrucomicrobia):
m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)
## [1] "Verrucomicrobia"
Visualize frequencies of given factor (sex) levels within the indicated groups (group):
p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)
# Retrieving the actual data values:
# kable(head(p@data), digits = 2)
Custom functions are provided to cut age or BMI information into discrete classes.
group_bmi(c(22, 28, 31), "standard")
## [1] lean overweight obese
## Levels: underweight lean overweight obese severe morbid super
group_age(c(17, 41, 102), "decades")
## [1] [10,20) [40,50) [100,110]
## Levels: [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90,100) [100,110]
Add metadata to a phyloseq object. For reproducibility, we just use the existing metadata in this example, but this can be replaced by another data.frame (samples x fields).
# Example data
data(dietswap)
pseq <- dietswap
# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)
# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))