31  Resources

31.1 Data containers

31.1.1 Data container documentation

Lun, Aaron, and Davide Risso. 2020. SingleCellExperiment: S4 Classes for Single Cell Data.
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2020. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.
Huang, Ruizhu. 2020. TreeSummarizedExperiment: A S4 Class for Data with Tree Structures.

31.1.2 Other relevant containers

  • DataFrame which behaves similarly to data.frame, yet efficient and fast when used with large datasets.
  • DNAString along with DNAStringSet,RNAString and RNAStringSet efficient storage and handling of long biological sequences are offered within the Biostrings package (Pagès et al. 2020).
  • GenomicRanges ((Lawrence et al. 2013)) offers an efficient representation and manipulation of genomic annotations and alignments, see e.g. GRanges and GRangesList at An Introduction to the GenomicRangesPackage.
Pagès, H., P. Aboyoun, R. Gentleman, and S. DebRoy. 2020. Biostrings: Efficient Manipulation of Biological Strings. https://bioconductor.org/packages/Biostrings.
Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin Morgan, and Vincent Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9. https://doi.org/10.1371/journal.pcbi.1003118.

NGS Analysis Basics provides a walk-through of the above-mentioned features with detailed examples.

31.1.3 phyloseq: an alternative container for microbiome data

The phyloseq package and class became the first widely used data container for microbiome data science in R. Many methods for taxonomic profiling data are readily available for this class. We provide here a short description how phyloseq and Experiment classes relate to each other.

assays : This slot is similar to otu_table in phyloseq. In a SummarizedExperiment object multiple assays, raw counts, transformed counts can be stored. See also (2017) for storing data from multiple experiments such as RNASeq, Proteomics, etc. rowData : This slot is similar to tax_table in phyloseq to store taxonomic information. colData : This slot is similar to sample_data in phyloseq to store information related to samples. rowTree : This slot is similar to tree.file in phyloseq to store phylogenetic tree.

Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez Cabrera, Tiffany Chan, et al. 2017. “Software for the Integration of Multiomics Experiments in Bioconductor.” Cancer Research. https://doi.org/10.1158/0008-5472.CAN-17-0344.

In this book, you will encounter terms such as FeatureIDs and SampleIDs. FeatureIDs : These are basically OTU/ASV ids which are row names in assays and rowData. SampleIDs : As the name suggests, these are sample ids which are column names in assays and row names in colData. FeatureIDs and SampleIDs are used but the technical terms rownames and colnames are encouraged to be used, since they relate to actual objects we work with.

31.1.3.1 Benchmarking TreeSE with phyloseq

TreeSE objects can be converted into phyloseq objects and vice versa, therefore it is possible to compare the two containers in terms of computational efficiency. Remarkably, TreeSE and phyloseq were benchmarked against one another in mia v1.2.3 and phyloseq v1.38.0, respectively. 5 standard microbiome analysis operationswere applied to 4 datasets of varying size with both containers. In a nutshell, TreeSE and phyloseq showed a similar performance for datasets of small and medium size for most of the operations. However, TreeSE performed more efficiently as the size of the datasets increased. Further details on such results can be found in the benchmarking repository.

31.1.3.2 Resources on phyloseq

The phyloseq container provides analogous methods to TreeSE. The following material can be used to familiarize with such alternative methods:

McMurdie, PJ, and S Holmes. 2013. Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS ONE 8: e61217. https://doi.org/10.1371/journal.pone.0061217.
Callahan, Ben J., Kris Sankaran, Julia A. Fukuyama, Paul J. McMurdie, and Susan P. Holmes. 2016. “Bioconductor Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses [Version 2; Peer Review: 3 Approved].” F1000Research 5: 1492. https://doi.org/10.12688/f1000research.8986.2.

31.2 R programming resources

31.2.1 Base R and RStudio

If you are new to R, you could try swirl for a kickstart to R programming. Further support resources are available through the Bioconductor project (Huber et al. 2015).

31.2.2 Bioconductor Classes

Introduction to data analysis with R and Bioconductor; Carpentries introduction, including R & RStudio installation instructions

S4 system

S4 class system has brought several useful features to the object-oriented programming paradigm within R, and it is constantly deployed in R/Bioconductor packages (Huber et al. 2015).

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.
  Online Document:
Chambers, JM. 2006. “How S4 Methods Work.” Technical report.
  Books:
  • John M. Chambers. Software for Data Analysis: Programming with R. Springer, New York, 2008. ISBN-13 978-0387759357 (J. M. Chambers 2008)
  • I Robert Gentleman. R Programming for Bioinformatics. Chapman & Hall/CRC, New York, 2008. ISBN-13 978-1420063677 (Gentleman 2008)
Chambers, John M. 2008. Software for Data Analysis: Programming with r. Vol. 2. Springer.
Gentleman, Robert. 2008. R Programming for Bioinformatics. CRC Press.

31.3 Reproducible reporting with Quarto

31.3.1 Learn Quarto

Reproducible reporting is the starting point for robust interactive data science. Perform the following tasks:

  • If you are entirely new to Quarto, take this short tutorial to get introduced to the most important functions within Quarto. Then experiment with different options from the Quarto cheatsheet.

  • Create a Quarto template in RStudio, and render it into a document (markdown, PDF, docx or other format). In case you are new to Quarto, its documentation provides guidelines to use Quarto with the R language (here) and the RStudio IDE (here).

  • Further examples are tips for Quarto are available in this online tutorial to interactive reproducible reporting.

31.3.2 Additional material on Rmarkdown

Being able to use Quarto in R partly relies on your previous knowledge of Rmarkdown. The following resources can help you get familiar with Rmarkdown:

Figure sources:

Original article

  • Huang R et al. (2021) TreeSummarizedExperiment: a S4 class for data with hierarchical structure. F1000Research 9:1246. (Huang et al. 2021)
Huang, Ruizhu, Charlotte Soneson, Felix G. M. Ernst, et al. 2021. “TreeSummarizedExperiment: A S4 Class for Data with Hierarchical Structure [Version 2; Peer Review: 3 Approved].” F1000Research 9: 1246. https://doi.org/10.12688/f1000research.26669.2.

Reference Sequence slot extension

31.3.3 Further reading

The following online books provide good general data science background:

  • (Data science basics in R](https://r4ds.had.co.nz)
  • (Modern Statistics for Modern Biology)[https://www.huber.embl.de/msmb/] open access book (Holmes S, Huber W)
  • The Bioconductor project (background on the Bioconductor project; Carpentries workshop)
Back to top