5 Importing microbiome data
This section demonstrates how to import taxonomic profiling data in R.
5.1 Data access
ADHD-associated changes in gut microbiota and brain in a mouse model
Tengeler AC et al. (2020) Gut microbiota from persons with attention-deficit/hyperactivity disorder affects the brain in mice. Microbiome 8:44.
In this study, mice are colonized with microbiota from participants with ADHD (attention deficit hyperactivity disorder) and healthy participants. The aim of the study was to assess whether the mice display ADHD behaviors after being inoculated with ADHD microbiota, suggesting a role of the microbiome in ADHD pathology.
Download the data from data subfolder.
5.2 Importing microbiome data in R
Import example data by modifying the examples in the online book section on data exploration and manipulation.
The data files in our example are in biom container, which is a standard file format for microbiome data. Other file formats exist as well, and import details vary by platform. Here, we import biom data files into a specific data container (structure) in R, TreeSummarizedExperiment (TreeSE) Huang et al. (2020).
The data container provides the basis for downstream data analysis in the miaverse data science framework.
Figure sources:
Original article - Huang R et al. (2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research 9:1246.
Reference Sequence slot extension - Lahti L et al. (2020) Upgrading the R/Bioconductor ecosystem for microbiome research F1000Research 9:1464 (slides).
5.2.1 Example solution
Let us first import the biom file into R / TreeSE container.
# Load the mia R package
library(mia)
# Defining file paths
## Biom file (taxonomic profiles)
<- "data/Tengeler2020/Aggregated_humanization2.biom"
biom_file_path
# Import the data into SummarizedExperiment container
<- loadFromBiom(biom_file_path)
se
# Convert this data to TreeSE container (no direct importer exists)
<- as(se, "TreeSummarizedExperiment") tse
Check and clean up rowData (information on the taxonomic features).
# Investigate the rowData of this data object
print(head(rowData(tse)))
## DataFrame with 6 rows and 6 columns
## taxonomy1 taxonomy2 taxonomy3
## <character> <character> <character>
## 1726470 "k__Bacteria p__Bacteroidetes c__Bacteroidia
## 1726471 "k__Bacteria p__Bacteroidetes c__Bacteroidia
## 17264731 "k__Bacteria p__Bacteroidetes c__Bacteroidia
## 17264726 "k__Bacteria p__Bacteroidetes c__Bacteroidia
## 1726472 "k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae
## 17264724 "k__Bacteria p__Bacteroidetes c__Bacteroidia
## taxonomy4 taxonomy5 taxonomy6
## <character> <character> <character>
## 1726470 o__Bacteroidales f__Bacteroidaceae g__Bacteroides"
## 1726471 o__Bacteroidales f__Bacteroidaceae g__Bacteroides"
## 17264731 o__Bacteroidales f__Porphyromonadaceae g__Parabacteroides"
## 17264726 o__Bacteroidales f__Bacteroidaceae g__Bacteroides"
## 1726472 o__Verrucomicrobiales f__Verrucomicrobiaceae g__Akkermansia"
## 17264724 o__Bacteroidales f__Bacteroidaceae g__Bacteroides"
# We notice that the rowData fields do not have descriptibve names.
# Hence, let us rename the columns in rowData
names(rowData(tse)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus")
# We also notice that the taxa names are of form "c__Bacteroidia" etc.
# Goes through the whole DataFrame. Removes '.*[kpcofg]__' from strings, where [kpcofg]
# is any character from listed ones, and .* any character.
<- BiocParallel::bplapply(rowData(tse),
rowdata_modified FUN = stringr::str_remove,
pattern = '.*[kpcofg]__')
# Genus level has additional '\"', so let's delete that also
<- BiocParallel::bplapply(rowdata_modified,
rowdata_modified FUN = stringr::str_remove,
pattern = '\"')
# rowdata_modified is a list, so convert this back to DataFrame format.
# and assign the cleaned data back to the TSE rowData
rowData(tse) <- DataFrame(rowdata_modified)
# Recheck rowData after the modifications
print(head(rowData(tse)))
## DataFrame with 6 rows and 6 columns
## Kingdom Phylum Class Order
## <character> <character> <character> <character>
## 1726470 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 1726471 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 17264731 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 17264726 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 1726472 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales
## 17264724 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## Family Genus
## <character> <character>
## 1726470 Bacteroidaceae Bacteroides
## 1726471 Bacteroidaceae Bacteroides
## 17264731 Porphyromonadaceae Parabacteroides
## 17264726 Bacteroidaceae Bacteroides
## 1726472 Verrucomicrobiaceae Akkermansia
## 17264724 Bacteroidaceae Bacteroides
Next let us add sample information (colData) to our TreeSE object.
## Sample phenodata
<- "data/Tengeler2020/Mapping_file_ADHD_aggregated.csv"
sample_meta_file_path
# Check what type of data it is
# read.table(sample_meta_file_path)
# It seems like a comma separated file and it does not include headers
# Let us read the file; note that sample names are in the first column
<- read.table(sample_meta_file_path, sep=",", header=FALSE, row.names=1)
sample_meta
# Check the data
print(head(sample_meta))
## V2 V3 V4 V5
## A110 ADHD Cohort_1 ADHD_Cohort_1 A110
## A12 ADHD Cohort_1 ADHD_Cohort_1 A12
## A15 ADHD Cohort_1 ADHD_Cohort_1 A15
## A19 ADHD Cohort_1 ADHD_Cohort_1 A19
## A21 ADHD Cohort_2 ADHD_Cohort_2 A21
## A23 ADHD Cohort_2 ADHD_Cohort_2 A23
# Add headers for the columns (as they seem to be missing)
colnames(sample_meta) <- c("patient_status", "cohort",
"patient_status_vs_cohort", "sample_name")
# Add this sample data to colData of the taxonomic data object
# Note that the data must be given in a DataFrame format (required for our purposes)
colData(tse) <- DataFrame(sample_meta)
# Check the colData after modifications
print(head(colData(tse)))
## DataFrame with 6 rows and 4 columns
## patient_status cohort patient_status_vs_cohort sample_name
## <character> <character> <character> <character>
## A110 ADHD Cohort_1 ADHD_Cohort_1 A110
## A12 ADHD Cohort_1 ADHD_Cohort_1 A12
## A15 ADHD Cohort_1 ADHD_Cohort_1 A15
## A19 ADHD Cohort_1 ADHD_Cohort_1 A19
## A21 ADHD Cohort_2 ADHD_Cohort_2 A21
## A23 ADHD Cohort_2 ADHD_Cohort_2 A23
Add phylogenetic tree (rowTree).
## Phylogenetic tree
<- "data/Tengeler2020/Data_humanization_phylo_aggregation.tre"
tree_file_path
# Read the tree file
<- ape::read.tree(tree_file_path)
tree
# Add tree to rowTree
rowTree(tse) <- tree
We can save the data object as follows.
saveRDS(tse, file="tse.rds")
5.2.2 Loading readily processed data
Alternatively, you can just load the already processed data set.
By using this readily processed data set we can skip the data import step, and assume that the data has already been appropriately preprocessed and available in the TreeSE container.
<- readRDS("data/Tengeler2020/tse.rds") tse
In addition to our example data, further demonstration data sets are available in the TreeSE data container: see OMA demo data sets.
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## Loading required package: SingleCellExperiment
## Loading required package: TreeSummarizedExperiment
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: MultiAssayExperiment
## Loading required package: ggplot2
## Loading required package: ggraph
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Biostrings':
##
## collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:XVector':
##
## slice
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following object is masked from 'package:matrixStats':
##
## count
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union