28 Resources
28.1 Data containers
28.1.1 Data container documentation
28.1.2 Other relevant containers
-
DataFrame which behaves similarly to
data.frame, yet efficient and fast when used with large datasets. -
DNAString along with
DNAStringSet,RNAStringandRNAStringSetefficient storage and handling of long biological sequences are offered within the Biostrings package (Pagès et al. 2020). -
GenomicRanges (Lawrence et al. 2013) offers an efficient representation and manipulation of genomic annotations and alignments, see e.g.
GRangesandGRangesListfrom An Introduction to the GenomicRanges Package.
NGS Analysis Basics provides a walk-through of the above-mentioned features with detailed examples.
28.1.3 phyloseq: an alternative container for microbiome data
The phyloseq package and class became the first widely used data container for microbiome data science in R. Many methods for taxonomic profiling data are readily available for this class. We provide here a short description how phyloseq and SummarizedExperiment classes relate to each other.
assays: This slot is similar tootu_tablein phyloseq. In a SummarizedExperiment object multiple assays, raw counts, transformed counts can be stored. See also Ramos et al. (2017) for storing data from multipleexperimentssuch as RNASeq, Proteomics, etc.rowData: This slot is similar totax_tablein phyloseq to store taxonomic information.colData: This slot is similar tosample_datain phyloseq to store information related to samples.rowTree: This slot is similar totree.filein phyloseq to store phylogenetic tree.
In this book, you will encounter terms such as FeatureIDs and SampleIDs.
FeatureIDs: These are basically OTU/ASV ids which are row names inassaysandrowData.SampleIDs: As the name suggests, these are sample ids which are column names inassaysand row names incolData.FeatureIDsandSampleIDsare used but the technical termsrownamesandcolnamesare encouraged to be used, since they relate to actual objects we work with.

28.1.3.1 Benchmarking TreeSE with phyloseq
TreeSummarizedExperiment objects can be converted into phyloseq objects and vice versa, therefore it is possible to compare the two containers in terms of computational efficiency. Remarkably, TreeSummarizedExperiment and phyloseq were benchmarked against one another in mia v1.2.3 and phyloseq v1.38.0, respectively. 5 standard microbiome analysis operations were applied to 4 datasets of varying size with both containers. In a nutshell, TreeSummarizedExperiment and phyloseq showed a similar performance for datasets of small and medium size for most of the operations. However, TreeSummarizedExperiment performed more efficiently as the size of the datasets increased. Further details on such results can be found in the benchmarking repository.
28.1.3.2 Resources on phyloseq
The phyloseq container provides analogous methods to TreeSummarizedExperiment. The following material can be used to familiarize with such alternative methods:
- List of R tools for microbiome analysis
- phyloseq (McMurdie and Holmes 2013)
- microbiome tutorial
- microbiomeutilities
- phyloseq / TreeSummarizedExperiment cheatsheet from Appendix D
- Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses (Callahan et al. 2016).
28.2 R programming resources
28.2.1 Base R and RStudio
If you are new to R, you could try swirl for a kickstart to R programming. Further support resources are available through the Bioconductor project (Huber et al. 2015).
- Base R and RStudio cheatsheets
- Package-specific cheatsheets
- Visualization with ggplot2
- R graphics cookbook
28.2.2 Bioconductor Classes
Introduction to data analysis with R and Bioconductor; Carpentries introduction, including R & RStudio installation instructions
S4 system
S4 class system has brought several useful features to the object-oriented programming paradigm within R, and it is constantly deployed in R/Bioconductor packages (Huber et al. 2015).
- Hervé Pagès, A quick overview of the S4 class system.
- Laurent Gatto, A practical tutorial on S4 programming
- How S4 Methods Work (J. Chambers 2006)
- John M. Chambers. Software for Data Analysis: Programming with R. Springer, New York, 2008. ISBN-13 978-0387759357 (J. M. Chambers 2008)
- I Robert Gentleman. R Programming for Bioinformatics. Chapman & Hall/CRC, New York, 2008. ISBN-13 978-1420063677 (Gentleman 2008)
28.3 Reproducible reporting with Quarto
28.3.1 Learn Quarto
Reproducible reporting is the starting point for robust interactive data science. Perform the following tasks:
If you are entirely new to Quarto, take this short tutorial to get introduced to the most important functions within Quarto. Then experiment with different options from the Quarto cheatsheet.
Create a Quarto template in RStudio, and render it into a document (markdown, PDF, docx or other format). In case you are new to Quarto, its documentation provides guidelines to use Quarto with the R language (here) and the RStudio IDE (here).
Further examples are tips for Quarto are available in this online tutorial to interactive reproducible reporting.
28.3.2 Additional material on Rmarkdown
Being able to use Quarto in R partly relies on your previous knowledge of Rmarkdown. The following resources can help you get familiar with Rmarkdown:
Figure sources:
Original article
- Huang R et al. (2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research 9:1246. (Huang et al. 2021)
Reference Sequence slot extension
- Lahti L et al. (2020) Upgrading the R/Bioconductor ecosystem for microbiome research F1000Research 9:1464 (slides).
28.3.3 Further reading
The following online books provide good general data science background:
- Data science basics in R
- Modern Statistics for Modern Biology open access book (Holmes S, Huber W)
- The Bioconductor project (background on the Bioconductor project; Carpentries workshop)