23 Resources
23.1 Data containers
23.1.1 Data container documentation
SingleCellExperiment (Lun and Risso 2020)
SummarizedExperiment (Morgan et al. 2020)
TreeSummarizedExperiment (Huang 2020)
MultiAssayExperiment (Ramos et al. 2017)
23.1.2 Other relevant containers
- DataFrame which behaves similarly to
data.frame
, yet efficient and fast when used with large datasets.
- DNAString along with
DNAStringSet
,RNAString
andRNAStringSet
efficient storage and handling of long biological sequences are offered within the Biostrings package (Pagès et al. 2020).
- GenomicRanges ((Lawrence et al. 2013)) offers an efficient representation and manipulation of genomic annotations and alignments, see e.g.
GRanges
andGRangesList
at An Introduction to the GenomicRangesPackage.
NGS Analysis Basics provides a walk-through of the above-mentioned features with detailed examples.
23.1.3 phyloseq: an alternative container for microbiome data
The phyloseq
package and class became the first widely used data container for microbiome data science in R. Many methods for taxonomic profiling data are readily available for this class. We provide here a short description how phyloseq
and *Experiment
classes relate to each other.
assays
: This slot is similar to otu_table
in phyloseq
. In a SummarizedExperiment
object multiple assays, raw counts, transformed counts can be stored. See also (2017) for storing data from multiple experiments
such as RNASeq, Proteomics, etc. rowData
: This slot is similar to tax_table
in phyloseq
to store taxonomic information. colData
: This slot is similar to sample_data
in phyloseq
to store information related to samples. rowTree
: This slot is similar to tree.file
in phyloseq
to store phylogenetic tree.
In this book, you will encounter terms such as FeatureIDs
and SampleIDs
. FeatureIDs
: These are basically OTU/ASV ids which are row names in assays
and rowData
. SampleIDs
: As the name suggests, these are sample ids which are column names in assays
and row names in colData
. FeatureIDs
and SampleIDs
are used but the technical terms rownames
and colnames
are encouraged to be used, since they relate to actual objects we work with.
23.1.3.1 Benchmarking TreeSE with phyloseq
TreeSE objects can be converted into phyloseq objects and vice versa, therefore it is possible to compare the two containers in terms of computational efficiency. Remarkably, TreeSE and phyloseq were benchmarked against one another in mia v1.2.3 and phyloseq v1.38.0, respectively. 5 standard microbiome analysis operationswere applied to 4 datasets of varying size with both containers. In a nutshell, TreeSE and phyloseq showed a similar performance for datasets of small and medium size for most of the operations. However, TreeSE performed more efficiently as the size of the datasets increased. Further details on such results can be found in the benchmarking repository.
23.1.3.2 Resources on phyloseq
The phyloseq container provides analogous methods to TreeSE. The following material can be used to familiarize with such alternative methods:
- List of R tools for microbiome analysis
- phyloseq (McMurdie and Holmes 2013)
- microbiome tutorial
- microbiomeutilities
- phyloseq/TreeSE cheatsheet
- Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses (Callahan et al. 2016).
23.2 R programming resources
23.2.1 Base R and RStudio
If you are new to R, you could try swirl for a kickstart to R programming. Further support resources are available through the Bioconductor project (Huber et al. 2015).
23.2.2 Bioconductor Classes
Introduction to data analysis with R and Bioconductor; Carpentries introduction, including R & RStudio installation instructions
S4 system
S4 class system has brought several useful features to the object-oriented programming paradigm within R, and it is constantly deployed in R/Bioconductor packages (Huber et al. 2015).
- Hervé Pagès, A quick overview of the S4 class system.
- Laurent Gatto, A practical tutorial on S4 programming
- How S4 Methods Work (J. Chambers 2006)
- John M. Chambers. Software for Data Analysis: Programming with R. Springer, New York, 2008. ISBN-13 978-0387759357 (J. M. Chambers 2008)
- I Robert Gentleman. R Programming for Bioinformatics. Chapman & Hall/CRC, New York, 2008. ISBN-13 978-1420063677 (Gentleman 2008)
23.3 Reproducible reporting with Quarto
23.3.1 Learn Quarto
Reproducible reporting is the starting point for robust interactive data science. Perform the following tasks:
If you are entirely new to Quarto, take this short tutorial to get introduced to the most important functions within Quarto. Then experiment with different options from the Quarto cheatsheet.
Create a Quarto template in RStudio, and render it into a document (markdown, PDF, docx or other format). In case you are new to Quarto, its documentation provides guidelines to use Quarto with the R language (here) and the RStudio IDE (here).
Further examples are tips for Quarto are available in this online tutorial to interactive reproducible reporting.
23.3.2 Additional material on Rmarkdown
Being able to use Quarto in R partly relies on your previous knowledge of Rmarkdown. The following resources can help you get familiar with Rmarkdown:
Figure sources:
Original article
- Huang R et al. (2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research 9:1246. (Huang et al. 2021)
Reference Sequence slot extension
- Lahti L et al. (2020) Upgrading the R/Bioconductor ecosystem for microbiome research F1000Research 9:1464 (slides).