4 Import

This chapter demonstrates how data can be imported into data containers from files. Additionally, multiple databases offer curated microbiome data, which are introduced in Section 4.2.

In prior to importing the data to container, the raw sequences must be mapped to abundances. The specific method depends on sequencing type and other data-specific factors. There are several resources available to guide this process Metagenomics wiki being one of them. In Section C.2 we provide an example workflow for processing 16S rRNA sequences into a data container using DADA2.

4.1 Import microbiome data from files

The data containers can be constructed from scratch. For most common microbiome data formats, there is dedicated importers available which streamlines the importing.

4.1.1 Standard data formats

Specific import functions are provided for:

BIOM files (see help(importBIOM, package = "mia"))
QIIME2 files (see help(importQIIME2, package = "mia"))
Mothur files (see help(importMothur, package = "mia"))
MetaPhlAn files (see help(importMetaPhlAn, package = "mia"))
HUMAnN files (see help(importHUMAnN, package = "mia"))
taxpasta files (see help(importTaxpasta, package = "mia"))

Here we show how Biom files are imported into a TreeSE object using as an example Tengeler2020, which is further described here. This dataset consists of 3 files, which can be fetched or downloaded from this repository:

biom file: abundance table and taxonomy information
csv file: sample metadata
tree file: phylogenetic tree

To begin with, we store the data in a local directory within the working directory, such as data/, and define the source file paths.

biom_file_path <- system.file(
    "extdata", "Aggregated_humanization2.biom", package = "OMA")
sample_meta_file_path <- system.file(
    "extdata", "Mapping_file_ADHD_aggregated.csv", package = "OMA")
tree_file_path <- system.file(
    "extdata", "Data_humanization_phylo_aggregation.tre", package = "OMA")

Now we can read in the biom file and convert it into a TreeSE object. In addition, we retrieve the rank names from the prefixes of the taxa names and then remove them with the rank.from.prefix and prefix.rm optional arguments.

library(mia)

# read biom and convert it to TreeSE
tse <- importBIOM(
    biom_file_path,
    rank.from.prefix = TRUE,
    prefix.rm = TRUE,
    artifact.rm = TRUE)

# Check
tse
##  class: TreeSummarizedExperiment 
##  dim: 151 27 
##  metadata(0):
##  assays(1): counts
##  rownames(151): 1726470 1726471 ... 17264756 17264757
##  rowData names(6): kingdom phylum ... family genus
##  colnames(27): A110 A111 ... A38 A39
##  colData names(0):
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: NULL
##  rowTree: NULL
##  colLinks: NULL
##  colTree: NULL

The assays slot includes a list of abundance tables. The imported abundance table is named as “counts”. Let us inspect only the first cols and rows.

assay(tse, "counts")[1:3, 1:3]
##            A110  A111  A12
##  1726470  17722 11630    0
##  1726471  12052     0 2679
##  17264731     0   970    0

The rowdata includes taxonomic information from the biom file. The head() command shows just the beginning of the data table for an overview.

knitr::kable() helps print the information more nicely.

rowData(tse) |> head()
##  DataFrame with 6 rows and 6 columns
##               kingdom          phylum            class              order
##           <character>     <character>      <character>        <character>
##  1726470     Bacteria   Bacteroidetes      Bacteroidia      Bacteroidales
##  1726471     Bacteria   Bacteroidetes      Bacteroidia      Bacteroidales
##  17264731    Bacteria   Bacteroidetes      Bacteroidia      Bacteroidales
##  17264726    Bacteria   Bacteroidetes      Bacteroidia      Bacteroidales
##  1726472     Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales
##  17264724    Bacteria   Bacteroidetes      Bacteroidia      Bacteroidales
##                        family           genus
##                   <character>     <character>
##  1726470       Bacteroidaceae     Bacteroides
##  1726471       Bacteroidaceae     Bacteroides
##  17264731  Porphyromonadaceae Parabacteroides
##  17264726      Bacteroidaceae     Bacteroides
##  1726472  Verrucomicrobiaceae     Akkermansia
##  17264724      Bacteroidaceae     Bacteroides

We notice that the imported biom file did not contain any colData yet, so only an empty dataframe appears in this slot.

colData(tse) |> head()
##  DataFrame with 6 rows and 0 columns

Let us add colData from the sample metadata, which is stored in a CSV file.

# CSV file with colnames in the first row and rownames in the first column
sample_meta <- read.csv(
    sample_meta_file_path, sep = ",", row.names = 1)

sample_meta |> head()
##       Treatment   Cohort TreatmentxCohort Description
##  A110      ADHD Cohort_1    ADHD_Cohort_1        A110
##  A12       ADHD Cohort_1    ADHD_Cohort_1         A12
##  A15       ADHD Cohort_1    ADHD_Cohort_1         A15
##  A19       ADHD Cohort_1    ADHD_Cohort_1         A19
##  A21       ADHD Cohort_2    ADHD_Cohort_2         A21
##  A23       ADHD Cohort_2    ADHD_Cohort_2         A23

You have to be especially careful when you add sample metadata, effectively replacing the existing, empty colData. The method does not check sample names, which can lead to mistakes. Therefore, you must manually ensure that the sample names in the metadata not only exist but also match those in TreeSE.

# Check that sample names exist
if( is.null(rownames(sample_meta)) || is.null(colnames(tse)) ){
    stop("Sample names missing!", call. = FALSE)
}

# Sort rows in sample metadata to match with TreeSE
sample_order <- match(colnames(tse), rownames(sample_meta))
sample_meta <- sample_meta[sample_order, , drop = FALSE]

# If sample metadata did not include all samples, missing ones are now named NA.
# That is why we replace sample metadata names with ones in TreeSE. 
rownames(sample_meta) <- colnames(tse)

Now, we can add sample metadata to the colData.

colData(tse) <- DataFrame(sample_meta)
colData(tse) |> head()
##  DataFrame with 6 rows and 4 columns
##         Treatment      Cohort TreatmentxCohort Description
##       <character> <character>      <character> <character>
##  A110        ADHD    Cohort_1    ADHD_Cohort_1        A110
##  A111     Control    Cohort_1 Control_Cohort_1        A111
##  A12         ADHD    Cohort_1    ADHD_Cohort_1         A12
##  A13      Control    Cohort_1 Control_Cohort_1         A13
##  A14      Control    Cohort_1 Control_Cohort_1         A14
##  A15         ADHD    Cohort_1    ADHD_Cohort_1         A15

Finally, we add a phylogenetic tree to the rowData slot. Such feature is available only in TreeSE objects. Similarly, Trees specifying the sample hierarchy can be stored in the colTree slot.

Here, we read in the file containing the phylogenetic tree and insert it in corresponding slot of the TreeSE object.

# Reads the tree file
tree <- ape::read.tree(tree_file_path)

# Add tree to rowTree
rowTree(tse) <- tree

# Check
tse
##  class: TreeSummarizedExperiment 
##  dim: 151 27 
##  metadata(0):
##  assays(1): counts
##  rownames(151): 1726470 1726471 ... 17264756 17264757
##  rowData names(6): kingdom phylum ... family genus
##  colnames(27): A110 A111 ... A38 A39
##  colData names(4): Treatment Cohort TreatmentxCohort Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (151 rows)
##  rowTree: 1 phylo tree(s) (151 leaves)
##  colLinks: NULL
##  colTree: NULL

Now the rowTree slot contains the phylogenetic tree:

rowTree(tse) |> head()

4.1.2 Non-standard formats

Microbiome (taxonomic) profiling data is commonly distributed in various file formats. You can import such external data files as a TreeSE object, but the details depend on the file format. Here, we provide examples for common formats. Some datasets and raw files to learn how to import raw data and construct TreeSE/MAE containers are available in the microbiome data repository.

4.1.2.1 CSV import

CSV data tables can be imported with the standard R functions, then converted to the desired format. For detailed examples, you can check the Bioconductor course material by Martin Morgan. You can also check the example files and construct your own CSV files accordingly.

Recommendations for the CSV files are the following. File names are arbitrary; we refer here to the same names as in the examples:

Abundance table (assay_taxa.csv): data matrix (features x samples); first column provides feature IDs, the first row provides sample IDs; other values should be numeric (abundances).
Row data (rowdata_taxa.csv): data table (features x info); first column provides feature IDs, the first row provides column headers; this file usually contains the taxonomic mapping between different taxonomic levels. Ideally, the feature IDs (row names) match one-to-one with the abundance table row names.
Column data (coldata.csv): data table (samples x info); first column provides sample IDs, the first row provides column headers; this file usually contains the sample metadata/phenodata (such as subject age, health etc). Ideally, the sample IDs match one-to-one with the abundance table column names.

After you have set up the CSV files, you can read them in R:

count_file  <- system.file("extdata", "assay_taxa.csv", package = "OMA")
tax_file    <- system.file("extdata", "rowdata_taxa.csv", package = "OMA")
sample_file <- system.file("extdata", "coldata.csv", package = "OMA")

# Load files
counts  <- read.csv(count_file, row.names=1)   # Abundance table (e.g. ASV data; to assay data)
tax     <- read.csv(tax_file, row.names=1)     # Taxonomy table (to rowData)
samples <- read.csv(sample_file, row.names=1)  # Sample data (to colData)

After reading the data in R, ensure the following:

abundance table (counts): numeric matrix, with feature IDs as rownames and sample IDs as column names.
rowdata (tax): DataFrame, with feature IDs as rownames. If this is a data.frame you can use the function DataFrame() to change the format. Column names are free but in microbiome analysis they usually they refer to taxonomic ranks. The rownames in rowdata should match with rownames in abundance table.
coldata (samples): DataFrame, with sample IDs as rownames. If this is a data.frame you can use the function DataFrame() to change the format. Column names are free. The rownames in coldata should match with colnames in abundance table.

Always ensure that the tables have rownames! The TreeSE constructor compares rownames and ensures that, for example, right samples are linked with right patient.

Also, ensure that the row and column names match one-to-one between abundance table, rowdata, and coldata:

# Match rows and columns
counts <- counts[rownames(tax), rownames(samples)]

# Let's ensure that the data is in correct (numeric matrix) format:
counts <- as.matrix(counts)

If you hesitate about the format of the data, you can compare to one of the available demonstration datasets, and make sure that your data components have the same format.

There are many different source files and many different ways to read data in R. One can do data manipulation in R as well. Investigate the entries as follows.

# coldata rownames match assay colnames
all(rownames(samples) == colnames(counts)) # our dataset
##  [1] TRUE
class(samples) # should be data.frame or DataFrame
##  [1] "data.frame"

# rowdata rownames match assay rownames
all(rownames(tax) == rownames(counts)) # our dataset
##  [1] TRUE
class(tax) # should be data.frame or DataFrame
##  [1] "data.frame"

# Counts
class(counts) # should be a numeric matrix
##  [1] "matrix" "array"

Important!

Ensure that colnames of assay match with rownames of colData, and rownames of assay match with rownames of rowData.

If your data do not have names, you have to be especially careful, since this can lead to errors!

4.1.2.2 Constructing TreeSE

Now, let’s create the TreeSE object from the input data tables. Here we also convert the data objects in their preferred formats:

counts –> numeric matrix
rowData –> DataFrame
colData –> DataFrame

The SimpleList could be used to include multiple alternative assays, if necessary.

# Create a TreeSE
tse_taxa <- TreeSummarizedExperiment(
    assays =  SimpleList(counts = counts),
    colData = DataFrame(samples),
    rowData = DataFrame(tax))

tse_taxa
##  class: TreeSummarizedExperiment 
##  dim: 12706 40 
##  metadata(0):
##  assays(1): counts
##  rownames(12706): GAYR01026362.62.2014 CVJT01000011.50.2173 ...
##    JRJTB:03787:02429 JRJTB:03787:02478
##  rowData names(7): Phylum Class ... Species OTU
##  colnames(40): C1 C2 ... C39 C40
##  colData names(6): Sample Rat ... Fat XOS
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: NULL
##  rowTree: NULL
##  colLinks: NULL
##  colTree: NULL

Now you should have a ready-made TreeSE data object that can be used in downstream analyses.

4.1.2.3 Constructing MAE

To construct a MAE object, just combine multiple TreeSE data containers.

Here we import metabolite data from the same study.

count_file <- system.file("extdata", "assay_metabolites.csv", package = "OMA")
sample_file <- system.file("extdata", "coldata.csv", package = "OMA")

# Load files
counts  <- read.csv(count_file, row.names=1)
samples <- read.csv(sample_file, row.names=1)

# Create a TreeSE for the metabolite data
tse_metabolite <- TreeSummarizedExperiment(
    assays = SimpleList(concs = as.matrix(counts)),
    colData = DataFrame(samples))

tse_metabolite
##  class: TreeSummarizedExperiment 
##  dim: 38 40 
##  metadata(0):
##  assays(1): concs
##  rownames(38): Butyrate Acetate ... Malonate 1,3-dihydroxyacetone
##  rowData names(0):
##  colnames(40): C1 C2 ... C39 C40
##  colData names(6): Sample Rat ... Fat XOS
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: NULL
##  rowTree: NULL
##  colLinks: NULL
##  colTree: NULL

Important!

When creating TreeSE, assay must be a matrix, and both colData and rowData must be DataFrame objects.

Now we can combine these two experiments into MAE.

# Create an ExperimentList that includes experiments
experiments <- ExperimentList(
    microbiome = tse_taxa, metabolite = tse_metabolite)

# Create a MAE
mae <- MultiAssayExperiment(experiments = experiments)

mae
##  A MultiAssayExperiment object of 2 listed
##   experiments with user-defined names and respective classes.
##   Containing an ExperimentList class object of length 2:
##   [1] microbiome: TreeSummarizedExperiment with 12706 rows and 40 columns
##   [2] metabolite: TreeSummarizedExperiment with 38 rows and 40 columns
##  Functionality:
##   experiments() - obtain the ExperimentList instance
##   colData() - the primary/phenotype DataFrame
##   sampleMap() - the sample coordination DataFrame
##   `$`, `[`, `[[` - extract colData columns, subset, or experiment
##   *Format() - convert into a long or wide DataFrame
##   assays() - convert ExperimentList to a SimpleList of matrices
##   exportClass() - save data to flat files

4.2 Data resources

Open demonstration data for testing and benchmarking purposes is available from multiple locations. This chapter introduces some options. The other chapters of this book provide ample examples about the use of the data.

4.2.1 Package data

The mia R package contains example datasets that are direct conversions from the alternative phyloseq container to the TreeSE container.

List the available datasets in the mia package:

data(package="mia")

Load the GlobalPatterns data from the mia package:

data("GlobalPatterns", package = "mia")
GlobalPatterns
##  class: TreeSummarizedExperiment 
##  dim: 19216 26 
##  metadata(0):
##  assays(1): counts
##  rownames(19216): 549322 522457 ... 200359 271582
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

R packages contain additional demonstration data sets (see the Datasets section of the reference page):

4.2.2 ExperimentHub data

ExperimentHub provides a variety of data resources, including the microbiomeDataSets package (Morgan and Shepherd 2021; Lahti, Ernst, and Shetty 2021).

Morgan, Martin, and Lori Shepherd. 2021. ExperimentHub: Client to Access ExperimentHub Resources.

Lahti, Leo, Felix G. M. Ernst, and Sudarshan Shetty. 2021. microbiomeDataSets: Experiment Hub Based Microbiome Datasets.

A table of the available datasets is available through the availableDataSets() function.

library(microbiomeDataSets)
availableDataSets()
##              Dataset
##  1  GrieneisenTSData
##  2       LahtiMLData
##  3        LahtiMData
##  4      OKeefeDSData
##  5 SilvermanAGutData
##  6        SongQAData
##  7   SprockettTHData

All data are downloaded from ExperimentHub and cached for local re-use. Check the man pages of each function for a detailed documentation of the data contents and references. Let us retrieve a MAE dataset:

# mae <- HintikkaXOData()
# Since HintikkaXOData is now added to mia, we can load it directly from there
# We suggest to check other datasets from microbiomeDataSets
data(HintikkaXOData, package = "mia")
mae <- HintikkaXOData

Data is available in SE, TreeSE and MAE data containers; see the for example Chapter 23 for more details.

4.2.3 Curated metagenomic data

curatedMetagenomicData is a large collection of curated human microbiome datasets, provided as TreeSE objects (Pasolli et al. 2017). The resource provides curated human microbiome data including gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples from different body sites. See the package homepage for more details on data availability and access.

Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11): 1023–24. https://doi.org/https://doi.org/10.1038/nmeth.4468.

Vatanen, Tommi, Aleksandar D. Kostic, Eva d’Hennezel, Heli Siljander, Eric A. Franzosa, Moran Yassour, Raivo Kolde, et al. 2016. “Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans.” Cell 165 (May): 842–53. https://doi.org/10.1016/j.cell.2016.04.007.

As one example, let us retrieve the Vatanen (2016) (Vatanen et al. 2016) data set. This is a larger collection with a bit longer download time.

library(curatedMetagenomicData)
tse <- curatedMetagenomicData("Vatanen*", dryrun = FALSE, counts = TRUE)

4.2.4 Human microbiome compendium

MicroBioMap dataset includes over 170k samples of publicly available 16S rRNA amplicon sequencing data, all processed using the same pipeline and reference database (Davis et al. 2024). After installing the MicroBioMap package (see the original website for instructions), you can load the compendium with

Davis, Sean, Richard Abdill, Ran Blehkman, Samantha Graham, and Casey Greene. 2024. MicroBioMap: Access the Microbiome Compendium from r. https://github.com/seandavi/MicroBioMap.

library(MicroBioMap)
cpd <- getCompendium()

This returns a TreeSE object. Currently, the rowTree slot of the TreeSE is not populated.

After loading the compendium, you will have immediate access to nearly 170,000 microbiome samples of publicly available 16S rRNA amplicon sequencing data, all processed using the same pipeline and reference database. For more use examples in R/Bioconductor, see the MicroBioMap vignette.

4.2.5 Other data sources

The current collections provide access to vast microbiome data resources. The output has to be converted into TreeSE/MAE separately.

MGnifyR provides access to EBI/MGnify
HoloFoodR provides access to EBI/HoloFood
qiitr provides access to QIITA
qiime2R provides access to QIIME2

Exercises

Goal: The goal of these exercises is to learn how to import data from files or database.

Exercise 1: Construct TreeSE

Download the data

Download the CSV files from this directory.

Load the data into R

Read the downloaded CSV files into R using read.csv() and store them in the following variables:

assay → contains the main data matrix (e.g., microbial counts)
rowdata → contains metadata for features (e.g., taxonomic information)
coldata → contains metadata for samples (e.g., sample descriptions)

Convert data into correct format
- assay → converted into a matrix
- rowdata → converted into a DataFrame
- coldata → converted into a DataFrame
Ensure that all tables have row and column names and rows of rowdata correspond to rows of assay and rows of coldata match with columns of assay.
Create a SimpeList containing the abundance table
Create a TreeSummarizedExperiment object.
Verify the import

Ensure that the data has been correctly imported by checking that the values in the TreeSE object match those in the original CSV files. - Pick a random row (feature) and column (sample). - Compare the corresponding value in the original assay data and the TreeSE object.

Exercise 2: Load BIOM file

Download this BIOM file.
Import the BIOM file to R.

Exercise 3: Fetch data from MGnify

Explore available data from here and choose study or sample. How many public samples there are available?
Create MgnifyClient() object.
Search available analyses on the selected sample or study.
Fetch associated metadata on the analyses, and select analyses to fetch. Select analyses only from single analysis pipeline.
Retrieve data into TreeSE.

Useful functions:

read.csv(), DataFrame(), TreeSummarizedExperiment(), matrix(), rownames(), colnames(), SimpleList(), download.file(), importBIOM(), MGnifyR::MgnifyClient(), MGnifyR::searchAnalysis(), MGnifyR::getMetadata(), MGnifyR::getResult()