Microbiome data processing

Processing phyloseq objects

Instructions to manipulate microbiome data sets using tools from the phyloseq package and some extensions from the microbiome package, including subsetting, aggregating and filtering.

Load example data:

library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)   
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006

Summarizing the contents of a phyloseq object

summarize_phyloseq(pseq)

## Compositional = NO2

## 1] Min. number of reads = 19002] Max. number of reads = 288833] Total number of reads = 135465644] Average number of reads = 11769.38662033015] Median number of reads = 111717] Sparsity = 0.2090022054400866] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 10agesexnationalityDNA_extraction_methodprojectdiversitybmi_groupsubjecttimesample2

## [[1]]
## [1] "1] Min. number of reads = 1900"
## 
## [[2]]
## [1] "2] Max. number of reads = 28883"
## 
## [[3]]
## [1] "3] Total number of reads = 13546564"
## 
## [[4]]
## [1] "4] Average number of reads = 11769.3866203301"
## 
## [[5]]
## [1] "5] Median number of reads = 11171"
## 
## [[6]]
## [1] "7] Sparsity = 0.209002205440086"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 10"
## 
## [[11]]
##  [1] "age"                   "sex"                   "nationality"           "DNA_extraction_method" "project"              
##  [6] "diversity"             "bmi_group"             "subject"               "time"                  "sample"

Retrieving data elements from phyloseq object

A phyloseq object contains OTU table (taxa abundances), sample metadata, taxonomy table (mapping between OTUs and higher-level taxonomic classifications), and phylogenetic tree (relations between the taxa). Some of these are optional.

Pick metadata as data.frame:

meta <- meta(pseq)

Taxonomy table:

taxonomy <- tax_table(pseq)

Abundances for taxonomic groups (‘OTU table’) as a TaxaxSamples matrix:

# Absolute abundances
otu.absolute <- abundances(pseq)

# Relative abundances
otu.relative <- abundances(pseq, "compositional")

Total read counts:

reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]

## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 
##     7593    10148     7131    10855    12000

Add read per sample to phyloseq object metadata.

sample_data(pseq)$reads_sample <- reads_sample

# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])

##            sample reads_sample
## Sample-1 Sample-1         7593
## Sample-2 Sample-2        10148
## Sample-3 Sample-3         7131
## Sample-4 Sample-4        10855
## Sample-5 Sample-5        12000
## Sample-6 Sample-6         7914

Melting phyloseq data for easier plotting:

df <- psmelt(pseq)
kable(head(df))

	OTU	Sample	Abundance	age	sex	nationality	DNA_extraction_method	project	diversity	bmi_group	subject	sample	reads_sample	Phylum	Family	Genus
110989	Prevotella melaninogenica et rel.	Sample-448	14961	54	female	CentralEurope	o	18	5.98	lean	448	Sample-448	26546	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111380	Prevotella melaninogenica et rel.	Sample-360	14296	45	female	CentralEurope	o	13	5.49	severeobese	360	Sample-360	21086	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111232	Prevotella melaninogenica et rel.	Sample-190	13676	34	female	CentralEurope	r	7	6.06	lean	190	Sample-190	23954	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111553	Prevotella melaninogenica et rel.	Sample-743	13509	52	male	US	NA	19	5.21	obese	743	Sample-743	20632	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
110590	Prevotella melaninogenica et rel.	Sample-366	13490	52	female	CentralEurope	o	15	5.63	obese	366	Sample-366	19651	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111029	Prevotella melaninogenica et rel.	Sample-375	13384	45	female	CentralEurope	o	16	5.64	severeobese	375	Sample-375	21408	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.

Sample operations

Sample names and variables

head(sample_names(pseq))

## [1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"

Total OTU abundance in each sample

s <- sample_sums(pseq)

Abundance of a given species in each sample

head(abundances(pseq)["Akkermansia",])

## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6 
##       21       36      475       61       34       14

Select a subset by metadata fields:

pseq.subset <- subset_samples(pseq, nationality == "AFR")

Select a subset by providing sample names:

# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))

# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)

Pick samples at the baseline time points only:

pseq0 <- baseline(pseq)

Data transformations

The microbiome package provides a wrapper for standard sample/OTU transforms. For arbitrary transforms, use the transform_sample_counts function in the phyloseq package. Log10 transform is log(1+x) if the data contains zeroes. Also “Z,” “clr,” “hellinger,” and “shift” are available as common transformations. Relative abundances (note that the input data needs to be in absolute scale, not logarithmic!):

pseq.compositional <- microbiome::transform(pseq, "compositional")

The CLR (“clr”) transformation is also available, and comes with a pseudocount to avoid zeroes. An alternative method is to impute the zero-inflated unobserved values. Sometimes a multiplicative Kaplan-Meier smoothing spline (KMSS) replacement, multiplicative lognormal replacement, or multiplicative simple replacement are used. These are available in the zCompositions R package (functions multKM, multLN, and multRepl, respectively). Use at least n.draws = 1000 in practice, here less used to speed up example.

data(dietswap)
x <- dietswap
# Compositional data 
x2 <- microbiome::transform(x, "compositional")

Variable operations

Sample variable names

sample_variables(pseq)

##  [1] "age"                   "sex"                   "nationality"           "DNA_extraction_method" "project"              
##  [6] "diversity"             "bmi_group"             "subject"               "time"                  "sample"               
## [11] "reads_sample"

Pick values for a given variable

head(get_variable(pseq, sample_variables(pseq)[1]))

## [1] 28 24 52 22 25 42

Assign new fields to metadata

# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")

# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div

Taxa operations

Number of taxa

n <- ntaxa(pseq)

Most abundant taxa

topx <- top_taxa(pseq, n = 10)

Names

ranks <- rank_names(pseq)  # Taxonomic levels
taxa  <- taxa(pseq)        # Taxa names at the analysed level

Subset taxa

pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")

Prune (select) taxa:

# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes

# With given taxon names
ex2 <- prune_taxa(taxa, pseq)

# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)

Filter by user-specified function values (here variance):

f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)

List unique phylum-level groups:

head(get_taxa_unique(pseq, "Phylum"))

## [1] "Actinobacteria"  "Firmicutes"      "Proteobacteria"  "Verrucomicrobia" "Bacteroidetes"   "Spirochaetes"

Pick the taxa abundances for a given sample:

samplename <- sample_names(pseq)[[1]]

# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]

Get

library(jeevanuDB)

# Example data
data("moving_pictures")
# Rename
ps <- moving_pictures
taxa_names(ps)[1:2]

## [1] "TACGGAGGATACGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTGGCTGATTAAGTCAGTGGTGAAAAGCTATGGCTCAACCATAGTCTTGCCGTTGATACTGGTTAGCTTGAGTGTAGATGATGTAGGCGGAATGCGTAG"
## [2] "TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGAGCGTAGGTGGCTTGATAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACGGCATCTGATACTGTTAGGCTAGAGTAGGTGAGAGGAAGGTAGAATTCCAGG"

The taxa names are unique sequences. This common when using dada2 for denoising and identifying amplicon sequence variants.

print(ps)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3644 taxa and 1966 samples ]
## sample_data() Sample Data:       [ 1966 samples by 89 sample variables ]
## tax_table()   Taxonomy Table:    [ 3644 taxa by 7 taxonomic ranks ]

It would be useful to store these inside the refseq() slot of phyloseq. For this we can use microbiome::add_refseq

ps <- add_refseq(ps, tag = "ASV")
# check again
taxa_names(ps)[1:2]

## [1] "ASV1" "ASV2"

print(ps)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3644 taxa and 1966 samples ]
## sample_data() Sample Data:       [ 1966 samples by 89 sample variables ]
## tax_table()   Taxonomy Table:    [ 3644 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 3644 reference sequences ]

As can be seen, the sequence ids are converted to ASV ids and sequences stored within the refseq() slot.

Labelling ASVs with best taxonomy

Converting or amending the ASV ids with their best/lowest taxonomic classification can be very handy. This can be achieved with microbiome::add_besthit()

ps <- add_besthit(ps)
taxa_names(ps)[1:2]

## [1] "ASV1:g__Porphyromonas.s__"         "ASV2:g__Psychrobacter.s__pulmonis"

If required, the names can be cleaned as follows.

taxa_names(ps) <- gsub("[a-x]__", "",taxa_names(ps))
taxa_names(ps)[1:2]

## [1] "ASV1:Porphyromonas."         "ASV2:Psychrobacter.pulmonis"

Merging operations

Aggregate taxa to higher taxonomic levels. This is particularly useful if the phylogenetic tree is missing. When it is available, see merge_samples, merge_taxa and tax_glom).

pseq2 <- aggregate_taxa(pseq, "Phylum")

Merge the desired taxa into “Other” category. Here, we merge all Bacteroides groups into a single group named Bacteroides.

pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides")

Merging phyloseq objects with the phyloseq package:

merge_phyloseq(pseqA, pseqB)

Joining otu/asv table and taxonomy in one data frame

library(dplyr) 
library(microbiome)
data("atlas1006") # example data from microbiome pkg
otu.tax.tibble <- combine_otu_tax(atlas1006, column.id = "ASVID")
# view first fist rows and columns
otu.tax.tibble[1:5,1:5]

## # A tibble: 5 x 5
##   ASVID                        Phylum          Family          Genus                        `Sample-1`
##   <chr>                        <chr>           <chr>           <chr>                             <dbl>
## 1 Actinomycetaceae             Actinobacteria  Actinobacteria  Actinomycetaceae                      0
## 2 Aerococcus                   Firmicutes      Bacilli         Aerococcus                            0
## 3 Aeromonas                    Proteobacteria  Proteobacteria  Aeromonas                             0
## 4 Akkermansia                  Verrucomicrobia Verrucomicrobia Akkermansia                          21
## 5 Alcaligenes faecalis et rel. Proteobacteria  Proteobacteria  Alcaligenes faecalis et rel.          1

Get tibbles

In many cases, user may want to extract slots from phyloseq as tibbles for custom data processing.
This can be achieved with following functions.

otu.tib <- otu_tibble(atlas1006, column.id = "FeatureID")

tax.tib <- tax_tibble(atlas1006, column.id = "FeatureID")

sample.tib <- sample_tibble(atlas1006, column.id = "SampleID")

Convert to tidy long format

As an alternative to phyloseq::psmelt, we provide psmelt2 which returns a tibble in long format.

ps.long.tib <- psmelt2(pseq)
head(ps.long.tib)

## # A tibble: 6 x 17
##   FeatureID  .SampleID value Phylum  Family  Genus    age sex   nationality DNA_extraction_~ project diversity$diver~ bmi_group subject
##   <chr>      <chr>     <dbl> <chr>   <chr>   <chr>  <int> <fct> <fct>       <fct>            <fct>              <dbl> <fct>     <fct>  
## 1 Actinomyc~ Sample-1      0 Actino~ Actino~ Actin~    28 male  US          <NA>             1                   3.19 severeob~ 1      
## 2 Actinomyc~ Sample-2      0 Actino~ Actino~ Actin~    24 fema~ US          <NA>             1                   3.39 obese     2      
## 3 Actinomyc~ Sample-3      0 Actino~ Actino~ Actin~    52 male  US          <NA>             1                   2.86 lean      3      
## 4 Actinomyc~ Sample-4      0 Actino~ Actino~ Actin~    22 fema~ US          <NA>             1                   3.06 underwei~ 4      
## 5 Actinomyc~ Sample-5      0 Actino~ Actino~ Actin~    25 fema~ US          <NA>             1                   3.07 lean      5      
## 6 Actinomyc~ Sample-6      0 Actino~ Actino~ Actin~    42 male  US          <NA>             1                   2.94 lean      6      
## # ... with 3 more variables: time <dbl>, sample <chr>, reads_sample <dbl>

Rarefaction

pseq.rarified <- rarefy_even_depth(pseq)

Taxonomy

Convert between taxonomic levels (here from Genus (Akkermansia) to Phylum (Verrucomicrobia):

m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)

## [1] "Verrucomicrobia"

Metadata

Visualize frequencies of given factor (sex) levels within the indicated groups (group):

p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)

# Retrieving the actual data values:
# kable(head(p@data), digits = 2)

Custom functions are provided to cut age or BMI information into discrete classes.

group_bmi(c(22, 28, 31), "standard")

## [1] lean       overweight obese     
## Levels: underweight lean overweight obese severe morbid super

group_age(c(17, 41, 102), "decades")

## [1] [10,20)   [40,50)   [100,110]
## Levels: [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90,100) [100,110]

Add metadata to a phyloseq object. For reproducibility, we just use the existing metadata in this example, but this can be replaced by another data.frame (samples x fields).

# Example data
data(dietswap)
pseq <- dietswap

# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)

# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))