4 Import
This chapter demonstrates how data can be imported into data containers from files. Additionally, multiple databases offer curated microbiome data, which are introduced in Section 4.2.
In prior to importing the data to container, the raw sequences must be mapped to abundances. The specific method depends on sequencing type and other data-specific factors. There are several resources available to guide this process Metagenomics wiki being one of them. In Section C.2 we provide an example workflow for processing 16S rRNA sequences into a data container using DADA2.
4.1 Import microbiome data from files
The data containers can be constructed from scratch. For most common microbiome data formats, there is dedicated importers available which streamlines the importing.
4.1.1 Standard data formats
Specific import functions are provided for:
- BIOM files (see
help(mia::importBIOM)
) - QIIME2 files (see
help(mia::importQIIME2)
) - Mothur files (see
help(mia::importMothur)
) - MetaPhlAn files (see
help(mia::importMetaPhlAn)
) - HUMAnN files (see
help(mia::importHUMAnN)
) - taxpasta files (see
help(mia::importTaxpasta)
)
Here we show how Biom files are imported into a TreeSE
object using as an example Tengeler2020, which is further described here. This dataset consists of 3 files, which can be fetched or downloaded from this repository:
- biom file: abundance table and taxonomy information
- csv file: sample metadata
- tree file: phylogenetic tree
To begin with, we store the data in a local directory within the working directory, such as data/, and define the source file paths.
biom_file_path <- system.file(
"extdata", "Aggregated_humanization2.biom", package = "OMA")
sample_meta_file_path <- system.file(
"extdata", "Mapping_file_ADHD_aggregated.csv", package = "OMA")
tree_file_path <- system.file(
"extdata", "Data_humanization_phylo_aggregation.tre", package = "OMA")
Now we can read in the biom file and convert it into a TreeSE
object. In addition, we retrieve the rank names from the prefixes of the taxa names and then remove them with the rank.from.prefix
and prefix.rm
optional arguments.
library(mia)
# read biom and convert it to TreeSE
tse <- importBIOM(
biom_file_path,
rank.from.prefix = TRUE,
prefix.rm = TRUE,
artifact.rm = TRUE)
# Check
tse
## class: TreeSummarizedExperiment
## dim: 151 27
## metadata(0):
## assays(1): counts
## rownames(151): 1726470 1726471 ... 17264756 17264757
## rowData names(6): kingdom phylum ... family genus
## colnames(27): A110 A111 ... A38 A39
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
The assays
slot includes a list of abundance tables. The imported abundance table is named as “counts”. Let us inspect only the first cols and rows.
assay(tse, "counts")[1:3, 1:3]
## A110 A111 A12
## 1726470 17722 11630 0
## 1726471 12052 0 2679
## 17264731 0 970 0
The rowdata
includes taxonomic information from the biom file. The head()
command shows just the beginning of the data table for an overview.
knitr::kable()
helps print the information more nicely.
rowData(tse) |> head()
## DataFrame with 6 rows and 6 columns
## kingdom phylum class order
## <character> <character> <character> <character>
## 1726470 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 1726471 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 17264731 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 17264726 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## 1726472 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales
## 17264724 Bacteria Bacteroidetes Bacteroidia Bacteroidales
## family genus
## <character> <character>
## 1726470 Bacteroidaceae Bacteroides
## 1726471 Bacteroidaceae Bacteroides
## 17264731 Porphyromonadaceae Parabacteroides
## 17264726 Bacteroidaceae Bacteroides
## 1726472 Verrucomicrobiaceae Akkermansia
## 17264724 Bacteroidaceae Bacteroides
We notice that the imported biom file did not contain any colData
yet, so only an empty dataframe appears in this slot.
Let us add colData
from the sample metadata, which is stored in a CSV file.
# CSV file with colnames in the first row and rownames in the first column
sample_meta <- read.csv(
sample_meta_file_path, sep = ",", row.names = 1)
# Add this sample data to colData of the taxonomic data object
# Note that the samples in the sample data must be in the same order as
# in the original biom file and that data must be given in a DataFrame format
colData(tse) <- DataFrame(sample_meta)
Now the colData
includes the sample metadata.
colData(tse) |> head()
## DataFrame with 6 rows and 4 columns
## Treatment Cohort TreatmentxCohort Description
## <character> <character> <character> <character>
## A110 ADHD Cohort_1 ADHD_Cohort_1 A110
## A12 ADHD Cohort_1 ADHD_Cohort_1 A12
## A15 ADHD Cohort_1 ADHD_Cohort_1 A15
## A19 ADHD Cohort_1 ADHD_Cohort_1 A19
## A21 ADHD Cohort_2 ADHD_Cohort_2 A21
## A23 ADHD Cohort_2 ADHD_Cohort_2 A23
Finally, we add a phylogenetic tree to the rowData
slot. Such feature is available only in TreeSE
objects. Similarly, Trees specifying the sample hierarchy can be stored in the colTree
slot.
Here, we read in the file containing the phylogenetic tree and insert it in corresponding slot of the TreeSE
object.
# Reads the tree file
tree <- ape::read.tree(tree_file_path)
# Add tree to rowTree
rowTree(tse) <- tree
# Check
tse
## class: TreeSummarizedExperiment
## dim: 151 27
## metadata(0):
## assays(1): counts
## rownames(151): 1726470 1726471 ... 17264756 17264757
## rowData names(6): kingdom phylum ... family genus
## colnames(27): A110 A12 ... A35 A38
## colData names(4): Treatment Cohort TreatmentxCohort Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (151 rows)
## rowTree: 1 phylo tree(s) (151 leaves)
## colLinks: NULL
## colTree: NULL
Now the rowTree
slot contains the phylogenetic tree:
4.1.2 Non-standard formats
Microbiome (taxonomic) profiling data is commonly distributed in various file formats. You can import such external data files as a TreeSE
object, but the details depend on the file format. Here, we provide examples for common formats. Some datasets and raw files to learn how to import raw data and construct TreeSE/MAE
containers are available in the microbiome data repository.
4.1.2.1 CSV import
CSV data tables can be imported with the standard R functions, then converted to the desired format. For detailed examples, you can check the Bioconductor course material by Martin Morgan. You can also check the example files and construct your own CSV files accordingly.
Recommendations for the CSV files are the following. File names are arbitrary; we refer here to the same names as in the examples:
Abundance table (
assay_taxa.csv
): data matrix (features x samples); first column provides feature IDs, the first row provides sample IDs; other values should be numeric (abundances).Row data (
rowdata_taxa.csv
): data table (features x info); first column provides feature IDs, the first row provides column headers; this file usually contains the taxonomic mapping between different taxonomic levels. Ideally, the feature IDs (row names) match one-to-one with the abundance table row names.Column data (
coldata.csv
): data table (samples x info); first column provides sample IDs, the first row provides column headers; this file usually contains the sample metadata/phenodata (such as subject age, health etc). Ideally, the sample IDs match one-to-one with the abundance table column names.
After you have set up the CSV files, you can read them in R:
count_file <- system.file("extdata", "assay_taxa.csv", package = "OMA")
tax_file <- system.file("extdata", "rowdata_taxa.csv", package = "OMA")
sample_file <- system.file("extdata", "coldata.csv", package = "OMA")
# Load files
counts <- read.csv(count_file, row.names=1) # Abundance table (e.g. ASV data; to assay data)
tax <- read.csv(tax_file, row.names=1) # Taxonomy table (to rowData)
samples <- read.csv(sample_file, row.names=1) # Sample data (to colData)
After reading the data in R, ensure the following:
abundance table (
counts
): numericmatrix
, with feature IDs as rownames and sample IDs as column names.rowdata
(tax
):DataFrame
, with feature IDs as rownames. If this is adata.frame
you can use the functionDataFrame()
to change the format. Column names are free but in microbiome analysis they usually they refer to taxonomic ranks. The rownames in rowdata should match with rownames in abundance table.coldata
(samples
):DataFrame
, with sample IDs as rownames. If this is adata.frame
you can use the functionDataFrame()
to change the format. Column names are free. The rownames incoldata
should match with colnames in abundance table.
Always ensure that the tables have rownames! The TreeSE constructor compares rownames and ensures that, for example, right samples are linked with right patient.
Also, ensure that the row and column names match one-to-one between abundance table, rowdata
, and coldata
:
If you hesitate about the format of the data, you can compare to one of the available demonstration datasets, and make sure that your data components have the same format.
There are many different source files and many different ways to read data in R. One can do data manipulation in R as well. Investigate the entries as follows.
# coldata rownames match assay colnames
all(rownames(samples) == colnames(counts)) # our dataset
## [1] TRUE
class(samples) # should be data.frame or DataFrame
## [1] "data.frame"
# rowdata rownames match assay rownames
all(rownames(tax) == rownames(counts)) # our dataset
## [1] TRUE
class(tax) # should be data.frame or DataFrame
## [1] "data.frame"
# Counts
class(counts) # should be a numeric matrix
## [1] "matrix" "array"
Ensure that colnames of assay match with rownames of colData
, and rownames of assay match with rownames of rowData
.
If your data do not have names, you have to be especially careful, since this can lead to errors!
4.1.2.2 Constructing TreeSE
Now, let’s create the TreeSE
object from the input data tables. Here we also convert the data objects in their preferred formats:
- counts –> numeric matrix
- rowData –> DataFrame
- colData –> DataFrame
The SimpleList
could be used to include multiple alternative assays, if necessary.
# Create a TreeSE
tse_taxa <- TreeSummarizedExperiment(
assays = SimpleList(counts = counts),
colData = DataFrame(samples),
rowData = DataFrame(tax))
tse_taxa
## class: TreeSummarizedExperiment
## dim: 12706 40
## metadata(0):
## assays(1): counts
## rownames(12706): GAYR01026362.62.2014 CVJT01000011.50.2173 ...
## JRJTB:03787:02429 JRJTB:03787:02478
## rowData names(7): Phylum Class ... Species OTU
## colnames(40): C1 C2 ... C39 C40
## colData names(6): Sample Rat ... Fat XOS
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Now you should have a ready-made TreeSE
data object that can be used in downstream analyses.
4.1.2.3 Constructing MAE
To construct a MAE
object, just combine multiple TreeSE
data containers.
Here we import metabolite data from the same study.
count_file <- system.file("extdata", "assay_metabolites.csv", package = "OMA")
sample_file <- system.file("extdata", "coldata.csv", package = "OMA")
# Load files
counts <- read.csv(count_file, row.names=1)
samples <- read.csv(sample_file, row.names=1)
# Create a TreeSE for the metabolite data
tse_metabolite <- TreeSummarizedExperiment(
assays = SimpleList(concs = as.matrix(counts)),
colData = DataFrame(samples))
tse_metabolite
## class: TreeSummarizedExperiment
## dim: 38 40
## metadata(0):
## assays(1): concs
## rownames(38): Butyrate Acetate ... Malonate 1,3-dihydroxyacetone
## rowData names(0):
## colnames(40): C1 C2 ... C39 C40
## colData names(6): Sample Rat ... Fat XOS
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
When creating TreeSE, assay
must be a matrix, and both colData
and rowData
must be DataFrame objects.
Now we can combine these two experiments into MAE.
# Create an ExperimentList that includes experiments
experiments <- ExperimentList(
microbiome = tse_taxa, metabolite = tse_metabolite)
# Create a MAE
mae <- MultiAssayExperiment(experiments = experiments)
mae
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] microbiome: TreeSummarizedExperiment with 12706 rows and 40 columns
## [2] metabolite: TreeSummarizedExperiment with 38 rows and 40 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
4.2 Data resources
Open demonstration data for testing and benchmarking purposes is available from multiple locations. This chapter introduces some options. The other chapters of this book provide ample examples about the use of the data.
4.2.1 Package data
The mia
R package contains example datasets that are direct conversions from the alternative phyloseq
container to the TreeSE
container.
List the available datasets in the mia
package:
Load the GlobalPatterns
data from the mia
package:
data("GlobalPatterns", package="mia")
GlobalPatterns
## class: TreeSummarizedExperiment
## dim: 19216 26
## metadata(0):
## assays(1): counts
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
R packages contain additional demonstration data sets (see the Datasets section of the reference page):
4.2.2 ExperimentHub data
ExperimentHub provides a variety of data resources, including the microbiomeDataSets package (Morgan and Shepherd 2021; Lahti, Ernst, and Shetty 2021).
A table of the available datasets is available through the availableDataSets()
function.
library(microbiomeDataSets)
availableDataSets()
## Dataset
## 1 GrieneisenTSData
## 2 LahtiMLData
## 3 LahtiMData
## 4 OKeefeDSData
## 5 SilvermanAGutData
## 6 SongQAData
## 7 SprockettTHData
All data are downloaded from ExperimentHub and cached for local re-use. Check the man pages of each function for a detailed documentation of the data contents and references. Let us retrieve a MAE dataset:
# mae <- HintikkaXOData()
# Since HintikkaXOData is now added to mia, we can load it directly from there
# We suggest to check other datasets from microbiomeDataSets
data(HintikkaXOData, package = "mia")
mae <- HintikkaXOData
Data is available in SE, TreeSE and MAE data containers; see the for example Section 20.1 for more details.
4.2.3 Curated metagenomic data
curatedMetagenomicData is a large collection of curated human microbiome datasets, provided as TreeSE
objects (Pasolli et al. 2017). The resource provides curated human microbiome data including gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples from different body sites. See the package homepage for more details on data availability and access.
As one example, let us retrieve the Vatanen (2016) (Vatanen et al. 2016) data set. This is a larger collection with a bit longer download time.
library(curatedMetagenomicData)
tse <- curatedMetagenomicData("Vatanen*", dryrun = FALSE, counts = TRUE)
4.2.4 Human microbiome compendium
MicroBioMap dataset includes over 170k samples of publicly available 16S rRNA amplicon sequencing data, all processed using the same pipeline and reference database (Davis et al. 2024). After installing the MicroBioMap package (see the original website for instructions), you can load the compendium with
library(MicroBioMap)
cpd <- getCompendium()
This returns a TreeSE
object. Currently, the rowTree
slot of the TreeSE
is not populated.
After loading the compendium, you will have immediate access to nearly 170,000 microbiome samples of publicly available 16S rRNA amplicon sequencing data, all processed using the same pipeline and reference database. For more use examples in R/Bioconductor, see the MicroBioMap vignette.
4.2.5 Other data sources
The current collections provide access to vast microbiome data resources. The output has to be converted into TreeSE/MAE
separately.
- MGnifyR provides access to EBI/MGnify
- HoloFoodR provides access to EBI/HoloFood
- qiitr provides access to QIITA
- qiime2R provides access to QIIME2