24 Exercises

Here you can find assignments on different topics.

Tips for exercises:

Add comments that explain what each line or lines of code do. This helps you and others understand your code and find bugs. Furthermore, it is easier for you to reuse the code, and it promotes transparency.
Interpret results by comparing them to literature. List main findings, so that results can easily be understood by others without advanced knowledge on data analytics.
Avoid hard-coding. Use variables which get values in the beginning of the pipeline. That way it is easier for you to modify parameters and reuse the code.

24.1 Basics of R/Bioconductor

Bioconductor training material has been contributed to Carpentries. You can check the following lessons for basic background of R and Bioconductor.

24.2 Workflows

24.2.1 Reproducible reporting with Quarto

The following batch of exercises walks you through typical use cases of Quarto in RStudio. Before heading to the exercises, it is recommended to read the Quarto guidelines for RStudio

24.2.1.1 New document

This exercise gets you started with creating a Quarto document and adding text to it with typing conventions borrowed from the markdown syntax. Feel free to render the document with the Render button after each step to see the changes in the final report.

Open RStudio and create a new Quarto file named My first Quarto.
Add the subtitle My first section and write some text of your choice underneath. You can choose the level of headings by the number of preceding hashes (#).
Add a subsection named List of items and list three items underneath, both ordered and unordered. You can initialize items with numbers (1., 2., 3., …) or stars (*) for the ordered and unordered case, respectively.
Add another subsection named Link to web and add a clickable link to the OMA book, using the pattern [text](url).
Render the document and check its appearance

Nice start! You are now able to create a Quarto document, understand its syntax and can render it into a reproducible report. If you got stuck, you can look up the docs on creating and rendering Quarto documents.

24.2.1.2 Code chunks

While customizable text is nothing new by itself, the advantage of Quarto (and previously Rmarkdown) is to combine text with code in R or other programming languages, so that both the analytical pipeline and verbal description can be put in one place. In this exercise, we learn how to write and run code in Quarto.

Open RStudio and create a new Quarto file (click on File tab - New File - Quarto document).
Initialize a code chunk by pressing alt + cmd + i and define the variables A <- "my name" and B <- 0 in it.
Write the text Below is my first code chunk just above the code chunk.
Initialize one more code chunk and add 100 to the variable B in it.
Write the text Below I change variable B just above the second chunk.
Extra: Write the following line of text: my name is A and I am B years old, where A and B are variables defined in the code chunks upstream and change if those variables are modified. Be aware that inline code can be added as > r my_inline_command (without >).

Good job. You are now able to combine text and code in a Quarto document. If you got stuck, you can refer to the Quarto docs on using code chunks.

24.2.1.3 Knitr options

Code chunks can be greatly customized in terms of visibility and execution, output size and location and much more. This is possible with the knitr chunk options, which usually appear at the beginning of the target chunk with the syntax #| option-name: value, also described here. In this exercise, we explore part of the knitr potential.

Open RStudio and create a new Quarto file.
Initialize three code chunks and label them as setup, fig-box and tbl-coldata, respectively. Remember that the name of a chunk can be specified with the label option.
Write the following code in the corresponding chunk and render the document.

# setup
library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

# this line sets some options for all the chunks (global chunk options)
knitr::opts_chunk$set(message = FALSE, warning = FALSE)

# fig-box
boxplot(colSums(assay(tse)) ~ tse$SampleType)

# tbl-coldata
knitr::kable(head(colData(tse)))

Set include: false in the setup chunk, fig-width: 10 in the fig-box chunk and echo: false in the tbl-coldata chunk. Render the document again and find the differences from before.

The setup code and output are no longer visible, resulting in a cleaner document without initialization details. Figures in the fig-box chunk are now wider, providing better visibility and detail in the graphics. The code in the tbl-coldata chunk is hidden, making the document more concise and focused on the results rather than the code implementation.

Add the options fig-cap and tab-cap to the fig-box and tbl-coldata chunks, respectively. They require some text input, which makes for the caption of the figures or tables.
Extra: Create a cross-reference to fig-box and tbl-coldata in the text above the respective code chunk. You can do that with the syntax @chunk-name.
Extra: Define a custom folder for storing figures with fig-path. Insert it in knitr::opts_chunk$set, so that it applies globally to all the figures generated in the document.

Congratulations! You are now familiar with the great potential and flexibility of knitr chunk options. An exhaustive list of available options can be found in the knitr documentation.

24.2.1.4 YAML instructions

The box at the beginning of every Quarto document contains yaml options that let you define the metadata of the document. They will affect the appearance of the document when it is rendered. By default, the box includes yaml options for the title, format and editor to be used, but much more information on layout, code execution and figures can be specified. A comprehensive list of yaml options is available here. In this exercise, we will get a tiny taste of such functionality.

Open RStudio and create a new Quarto file.
In the yaml box at the beginning of the document, change the title from Untitled to My first Quarto.
In the same box, add the two lines author and date followed by your name and today’s date, respectively.
Render the document and check its appearance.
Extra: Set toc: true to produce a table of contents. This line should follow format and html at the second level of indentation.

Well done! Now you are able to specify yaml options and understand how they affect your Quarto document. If you got stuck, you can check this section of the Quarto documentation.

24.2.1.5 Quarto parameters

An advanced feature of Quarto consists of execution parameters, which are externally pre-defined variables that are also accessible in the Quarto document. They can be specified in the yaml box as params. Here we learn how to use them.

Open RStudio and create a new Quarto file.
In the yaml box at the beginning of the document, add a line named params followed by an indented line with gamma: 10
Initialize a code chunk and type str(params$gamma) in it.
Render the document and check what happened.
Define one more parameter beta: 3 and multiply gamma by beta in a code chunk below.
Render the document again and check what happened.

Well done! You can now use an advanced feature of Quarto such as parameters. If you got stuck, here you can find more information about parameter definition and usage.

24.3 Data containers: TreeSE

TreeSE containers represent the working unit of the mia package. In the following exercises we learn how to construct, explore and work with them. A few demo datasets can be imported with mia and can be accessed as explained in chapter Section 2.3.

24.3.1 Constructing a data object

Here we cover how to construct a TreeSE from CSV files, using the components of OKeefeDSData from the microbiomeDataSets package as an example dataset.

Fetch or download the files in this directory.

Show solution

url_to_assay <- url("https://github.com/microbiome/data/raw/main/OKeefeDSData/okeefe_assay.csv")
url_to_coldata <- url("https://github.com/microbiome/data/raw/main/OKeefeDSData/okeefe_coldata.csv")
url_to_rowdata <- url("https://github.com/microbiome/data/raw/main/OKeefeDSData/okeefe_rowdata.csv")

Read in the csv files with read.csv and store them into the variables assays, rowdata and coldata, respectively.

Show solution

okeefe_assay <- read.csv(url_to_assay, row.names = 1)
okeefe_coldata <- read.csv(url_to_coldata, row.names = 1)
okeefe_rowdata <- read.csv(url_to_rowdata, row.names = 1)

Create a TreeSE from the individual components with TreeSummarizedExperiment. Note that the function requires three arguments: assays, rowData and colData, to which you can give the appropriate item.

Show solution

tse <- TreeSummarizedExperiment(assays = list(counts = okeefe_assay),
                                colData = okeefe_coldata,
                                rowData = okeefe_rowdata)

Check that importing is done correctly. E.g., choose random samples and features, and check that their values equal between raw files and TreeSE.

Show solution

okeefe_assay[1:5, 1:7]
assay(tse, "counts")[1:5,1:7]

Usefuls functions: DataFrame, TreeSummarizedExperiment, matrix, rownames, colnames, SimpleList

24.3.2 Importing data

Raw data of different types can be imported as a TreeSE with a number of functions explained in chapter Section 2.4.2. You can also check the function reference in the mia package.

Get familiar with the microbiome data repository and read the instructions in its README to import and construct datasets from there.
Import data by using mia. (functions: importMetaPhlAn | importMothur | importQIIME2 | convertFromBIOM | convertFromDADA2 …)
Try out conversions between TreeSE and phyloseq data containers (convertFromPhyloseq; convertToPhyloseq)

24.3.3 Preliminary exploration

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns

Get a summary about the TreeSE with summary. What is the mean count across samples? How many features recur only once (singletons)?

Show solution

summary(tse)

Check the dimensions of the TreeSE with dim or alternatively with nrow and ncol. How many samples and features are present?

Show solution

dim(tse)

List sample and features names with rownames and colnames.

Show solution

rownames(tse)
colnames(tse)

Check what information about samples and features is contained by the colData and rowData of the TreeSE with names.

Show solution

names(colData(tse))
names(rowData(tse))

Extra: Calculate the number of unique taxa for each taxonomic rank. You can use apply to count unique elements for each column of rowData.

24.3.4 Assay retrieval

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns

List the names of all available assays with assayNames.

Show solution

assayNames(tse)

Fetch the list of assays with assays.

Show solution

assays(tse)

Retrieve the first assay of the TreeSE with assay, where the second argument can be either the name or the index of the desired assay.

Show solution

assay(tse, "counts")[1:5,1:7]

Well done! You can now locate and retrieve individual assays of a TreeSE. If you got stuck, you can refer to chapter Section 2.2.1 of this book.

24.3.5 Sample information

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

Check the names of the samples with colnames.

Show solution

colnames(tse)

List the information on samples available in colData with names.

Show solution

names(colData(tse))

Visualize the colData with View and briefly look at the information stored in the different columns.

Show solution

View(colData(tse))

Get the abundances of all features for a specific sample, such as ID34, for an assay of your choice.

24.3.6 Feature information

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

Check the names of the features with rownames.

Show solution

rownames(tse)

List the information on features available in rowData with names.

Show solution

names(rowData(tse))

Visualize the rowData with View and briefly look at the information stored in the different columns.

Show solution

View(rowData(tse))

Get the abundances for a specific feature, such as OTU1810, in all the samples. You can access feature-specific abundances for an assay of your choice.

If you got stuck, you can look up chapters Chapter 3 and Section 5.2.1 on how to pick specific abundances and generate row trees, respectively.

24.3.7 Other elements

Extract some of the other TreeSE elements listed in chapter Chapter 2. However, such data are not always included.

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 and store it into a variable named tse.

Show solution

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

Fetch the metadata of the TreeSE. Is there any information available?

Show solution

names(metadata(tse))

Access the phylogenetic tree with rowTree. How big is it in terms of tips and nodes. If you like you can visualize it with ggtree.

Show solution

rowTree(tse)

Check if a sample tree is available with colTree, which is suitable for hierarchical or nested study designs.

Show solution

colTree(tse)

If present, obtain the information on feature DNA sequences from the DNA sequence slot.

Show solution

colData(tse)$Barcode_full_length

24.4 Data manipulation

24.4.1 Subsetting

In this exercise, we’ll go over the basics of subsetting a TreeSE object. Keep in mind that it’s good practice to always keep the original data intact. Therefore, we suggest creating a new TreeSE object whenever you subset.

Please have a try at the following exercises.

Subset the TreeSE object to specific samples and features.

1.1. Find some samples and features to subset with.

Show solution

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

head(unique(rowData(tse)$Phylum)) # Features

unique(tse$SampleType) # Samples

1.2. Subset the TreeSE with your found samples and features.

Show solution

features <- c("Crenarchaeota", "Euryarchaeota", "Actinobacteria")
samples <- c("Feces", "Skin", "Tongue")

rows <- rowData(tse)$Phylum %in% features
cols <- tse$SampleType %in% samples
tse_subset_by_feature <- tse[rows, cols]

unique(rowData(tse_subset_by_feature)$Phylum)

24.4.2 Library sizes

Calculate library sizes
Subsample / rarify the counts (see: rarefyAssay)

Useful functions: nrow, ncol, dim, summary, table, quantile, unique, scater::addPerCellQC, mia::agglomerateByRank

24.4.3 Prevalent and core taxonomic features

Estimate prevalence for your chosen feature (specific row and taxonomic group)

Show solution

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

# First, merge the features by rank and add it as an alternative experiment.
tse_phylum <- agglomerateByRank(tse, "Phylum") 

# Calculate relative abundance
tse_phylum <- transformAssay(tse_phylum, method = "relabundance")

# Then get the prevalence for a specific feature
res <- getPrevalence(tse_phylum, detection = 1/100,  sort = TRUE, assay.type = "relabundance")
res["Proteobacteria"]

Identify all prevalent features and subset the data accordingly.

Show solution

# Get the prevalent features
prevalent <- getPrevalent(tse, assay.type = "counts", detection = 1, prevalence = 0.1)
# Subset the data to the prevalent features
tse_prevalent1 <- tse[rownames(tse) %in% prevalent, ]

# Alternatively subsetByPrevalentFeatures can be used to subset directly
tse_prevalent2 <- subsetByPrevalent(tse, assay.type = "counts", detection = 1, prevalence = 0.1)

identical(tse_prevalent1, tse_prevalent2)

How many features are left ?

Show solution

nrow(tse_prevalent1)

Recalculate the most prevalent features based on relative abundance. How many are left?

Show solution

tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")

tse_prevalent <- subsetByPrevalent(tse, assay.type = "relabundance", detection = 0.01/100, prevalence = 50/100)

nrow(tse_prevalent)

Visualize prevalence of most prevalent Phyla using a violin/bean plot.

Show solution

# Import the scater library
library(scater)

# Agglomerate based on prevalence. Agglomerate to Phylum level.
tse_phylum <- subsetByPrevalent(tse, rank = "Phylum", assay.type = "relabundance", detection = 0.001, prevalence = 0.2)
# Store the prevalence of each taxon
res <- getPrevalence(tse_phylum, detection = 0.001, sort = FALSE, assay.type = "relabundance")
rowData(tse_phylum)$prevalence <- res

# Then plot the row data
plotRowData(tse_phylum, "prevalence", colour_by = "Phylum")

Useful functions: getPrevalence, getPrevalent, subsetByPrevalent

24.4.4 Data exploration

Summarize sample metadata variables. (How many age groups, how they are distributed? 0%, 25%, 50%, 75%, and 100% quantiles of library size?)
Create two histograms. Another shows the distribution of absolute counts, another shows how CLR transformed values are distributed.
Visualize how relative abundances are distributed between taxa in samples.

Useful functions: nrow, ncol, dim, summary, table, quantile, unique, transformAssay, ggplot, wilcox.test, agglomerateByRank, miaViz::plotAbundance

24.4.5 Other functions

Merge data objects (merge, mergeSEs)
Melt the data for visualization purposes (meltSE)

24.4.6 Assay transformation

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Transform the counts assay into relative abundances with transformAssay and store it into the TreeSE as an assay named relabund (see chapter Section 5.4).
Similarly, perform a clr transformation on the counts assay with a pseudocount of 1 and add it to the TreeSE as a new assay.
List the available assays by name with assays.
Access the clr assay and select a subset of its first 100 features and 10 samples. Remember that assays are subsettable with assay[row_idx, col_idx].
Take the same subset from the TreeSE, and check how this affects the individual transformed assays. TreeSE can also be subsetted with tse[row_idx, col_idx].
Extra: If the data has phylogenetic tree, perform the phILR transformation.

24.5 Abundance tables

24.5.1 Taxonomic levels

Import the mia package, load one of the example data sets mentioned in Chapter 3.3 with data (you need one with taxonomic information at Phylum level) and store it into a variable named tse.
List the available taxonomic ranks in the data with taxonomyRanks.
Agglomerate the data to Phylum level with agglomerateByRank and the appropriate value for Rank.
Report the dimensions of the TreeSE before and after agglomerating. You can use dim for that.
Extra: Perform CLR transformation on the data. Does this affect agglomeration?
Extra: List full taxonomic information for a few selected taxa, such as OTU1 and OTU1368. For that you can use mapTaxonomy on a specific subset of the TreeSE.

24.5.2 Alternative experiments

Import the mia package, load one of the example data sets mentioned in Chapter 3.3 with data (you need one with taxonomic information) and store it into a variable named tse. Can you find the same taxonomyRanks from rowData?
Check the taxonomic ranks of the features with taxonomyRanks. What is the deepest taxonomic rank available?
Agglomerate the TreeSE to each taxonomic rank and store the resulting experiments as altExps. This can be performed automatically with agglomerateByRanks.
Check the names of the generated altExps with altExpNames and retrieve a complete list with altExps.
Retrieve the data agglomerated by genus from the corresponding altExp. As for assays, you can access the desired altExp by name or index.
Extra: Split the data based on other features with splitOn.

24.5.3 Beta Diversity

This Exercise will focus on calculating dissimilarities or beta diversity.

Import the mia and vegan packages and load a dataset. This can be your own or one of the packages built in to mia.

Show solution

library(mia)
library(vegan)

data("Tengeler2020", package = "mia")
tse <- Tengeler2020

Agglomerate data to all available taxonomy ranks by using agglomerateByRanks()

Show solution

tse <- agglomerateByRanks(tse)
altExpNames(tse)

Import the scater library and run PCoA on on one of the created alternative experiments using Bray-Curtis dissimilarity.

Show solution

# We use scater library for calculating ordination
library(scater)

# Choose rank
rank <- "Genus"
# Calculate relative abundance
altExp(tse, rank) <- transformAssay(altExp(tse, rank), MARGIN = "cols", method="relabundance")

# Calculate ordination
altExp(tse, rank) <- runMDS(
    altExp(tse, rank),
    FUN = vegan::vegdist,
    method = "bray",
    assay.type = "relabundance",
    name = "MDS_bray")

Plot the first two coordinates using the reducedDim function and plot the coordinates.

Show solution

p <- plotReducedDim(altExp(tse, rank), "MDS_bray")
p

Add the explained variance ratios for each coordinate on the axes labels and plot the PCoA again.

Show solution

# Calculate explained variance
e <- attr(reducedDim(altExp(tse, rank), "MDS_bray"), "eig")
rel_eig <- e / sum(e[e > 0])

# Add explained variance for each axis
p <- p +
   labs(
      x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""),
      y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = ""))

p

24.6 Community (alpha) diversity

24.6.1 Estimation

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Calculate multiple alpha diversity indices with estimateDiversity without any additional arguments.
Check the names of colData with names. Can you identify which columns contain the alpha diversity indices?
Extra: Agglomerate the TreeSE by phylum and compare the mean Shannon diversity of the original experiment with its agglomerated version. You can use agglomerateByRank to perform agglomeration and mean to calculate the mean values of the respective columns in colData.

24.6.2 Visualization

Import the mia and scater packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Calculate Shannon diversity index and Faith’s phylogenetic diversity with estimateDiversity and the appropriate arguments for index.
Make a boxplot of Shannon diversity on the y axis and sample type on the x axis using scater::plotColData.
Repeat the previous point with Faith’s phylogenetic diversity and compare the sample distributions of the two alpha diversity indices. How greatly do they differ?
Extra: Make a scatterplot of Shannon diversity on the y axis and Faith’s phylogenetic diversity on the x axis with plotColData. Colour the points by sample type with the appropriate optional argument.

24.6.3 Correlation

Import the mia and scater packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Calculate coverage and Shannon diversity index with estimateDiversity and the appropriate arguments for index.
Test the correlation between the two indices with cor.test. Remember that colData parameters are accessible with tse$param_name. Use Kendall tau coefficients as method to measure correlation. Is the correlation weak or strong, significant or not?
Make a scatterplot of Shannon diversity index on the y axis and coverage on the x axis. You can do that with plotColData. How do the two indices relate to one another?
Extra: Compute the library size of the samples by applying colSums to the counts assay of the TreeSE, and test the correlation of library size with Shannon diversity or coverage. Which index is more correlated with library size?

In this example, we inspected the correlation between two related variables, also known as multicollinearity, and checked the correlation to library size, which is part of quality control. However, the correlation between alpha diversity and other numerical data about samples, such as participant’s age and weight, also represent an important analysis in several studies.

24.6.4 Differences between groups

Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Calculate the Gini-Simpson diversity with estimateDiversity and the appropriate argument for index. Set name to simpson. You will use this name to access the diversity index from colData.
Inspect the Diet column in the colData. Determine how the samples are grouped in terms of diet. You can see the number of unique elements in a column with unique.
Test differences in Gini-Simpson diversity between different diets with kruskal.test. Remember that colData parameters are accessible with tse$param_name.
Is diversity significantly different between vegan and mixed diet? To visualize that, make a boxplot of Gini-Simpson diversity on the y axis and diet on the x axis with plotColData.
Extra: Repeat points 3 through 5, this time for age groups. Make sure that you are using an appropriate statistical test for the number of groups and distribution.

24.7 Community similarity

24.7.1 Reduced dimensions retrieval

Import the mia package, load enterotype with data and store it into a variable named tse.
List all available reduced dimensions with reducedDims. At this point, no reducedDims are likely found, because we haven’t created any yet.
Perform PCA using the scater library and store its output in the TreeSE by running tse <- runPCA(tse, assay.type = "counts"). Note that it is required to specify the assay on which dimensionality reduction should be conducted.
View the names of all available reduced dimensions with reducedDimNames. Has something new appeared?
Extra: Access the scater::PCA reducedDim object with reducedDim and explore its content. How are the different dimensions stored? Extract an array with only the values from the second dimension by indexing the object with [ , 2].

24.7.2 Visualization basics with PCA

Import the mia and scater packages, load enterotype with data and store it into a variable named tse.
Perform a 3-component PCA based on the counts assay. You can use runPCA and set the optional arguments ncomponents and assay.type to the appropriate values.
Plot the first two dimensions of PCA with scater::plotReducedDim, to which you should give the appropriate reducedDim name as the second argument. Note that by default only the first two dimensions are shown.
Check which information is stored in the ColData of the TreeSE. What would be worth visualizing in our coordination plot?
Make the same plot again, but this time colour the observations by Enterotype. You can do that by setting colour.by to the appropriate colname in the colData of the TreeSE.
Extra: Plot all three dimensions of PCA with scater::plotReducedDim and the optional argument ncomponents. Colour observations by Enterotype. Which pair of dimensions best explains the variance between Enterotypes?

24.7.3 Principal Coordinate Analysis (PCoA)

PCoA turns out to be particularly relevant for microbiome analysis, because unlike PCA it can generate reduced dimensions from distances other than Euclidean. There are several ecological distances to choose from and you can find many of them under methods in the vignettes of vegan::vegdist.

Import the mia and scater packages, load enterotype with data and store it into a variable named tse.
Transform the counts assay to relative abundances with transformAssay.
Perform a Multi-Dimensional Scaling (MDS) based on the relative abundance assay in terms of Bray-Curtis dissimilarity. You can use scater::runMDS with the compulsory argument FUN = vegan::vegdist.
Plot the first two dimensions of PCA with plotReducedDim, to which you should give the appropriate reducedDim name as the second argument. Colour the observations by Enterotype with colour.by.
Extra: Perform MDS again with scater::runMDS, but this time use Jaccard dissimilarity. The distance metric to use can be defined with the optional argument method, choosing from the methods in ?vegan::vegdist. If you don’t want to overwrite the reducedDim object made in point 3, set name to a name of your choice. Visualize and compare it to the plot from point 4.

Good job! You are now able to produce and visualize reduced dimensions of a TreeSE. runMDS is actually one of several algorithms for PCoA and dimensionality reduction, which you can find in section Section 7.2.2.

24.7.4 PERMANOVA analysis

In this exercise we focus on studying the weight of variables on the microbiome composition. Significance of each variable on beta diversity is tested with PERMANOVA (point 4) and the homogeneity assumption is also be controlled with a PERMDISP2 analysis (point 5).

Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Transform the counts assay into relative abundances with transformAssay.
Extract the relative abundance assay, transpose it and save it into a variable named relabund_assay.
Perform PERMANOVA with vegan::adonis2 to see how much Diet can explain the relative abundance assay (formula = relabund_assay ~ Diet) in terms of Bray-Curtis dissimilarity (method = "bray"). Also set data = colData(tse) by = "margin" and permutations = 99. What do the results tell you about Diet with respect to beta diversity?
Extra: Test homogeneity of distribution across Diet groups with anova(betadisper(my_mat), my_groups, where my_mat is the Bray-Curtis dissimilarity matrix of relabund_assay calculated with vegdist(relabund_assay, "bray") and my_groups is the vector of Diet values obtained from the colData of the TreeSE.

Well done! You went through testing the effect and significance of Diet on beta diversity. Keep in mind that the formula fed to adonis2 can take more than one independent variable, so that you can also (and very often should) include covariates of your studies.

24.7.5 Redundancy analysis (RDA)

Here we apply RDA, an ordination method that provides dimensions with the largest variation between the data based in the light of the specified variables (point 3). The results of RDA are usually assessed with PERMANOVA (point 5) and the homogeneity assumption should be checked as in the previous exercise. This is a relatively complex procedure, but the way this is broken down into steps below will hopefully make more sense.

Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.
Transform the counts assay into relative abundances with transformAssay.
Perform RDA with getRDA to see how much Diet can explain the relative abundance assay (formula = assay ~ Diet and assay.type = relabundance) in terms of Bray-Curtis dissimilarity (method = "bray").
Extract the RDA dimensions from the appropriate reducedDim slot with attr(reducedDim(tse, "RDA"), "rda) and store it into rda.
Test the effect and significance of Diet on beta diversity by PERMANOVA with vegan::anova.cca. Feed this function with rda and set by = "margin" and permutations = 99, respectively. What do the results tell you about Diet?
Extra: Check what other parameters are stored in the colData of peerj13075, add them to the formula (formula = assay ~ Diet + ...) of getRDA and proceed to see how that changes the results of PERMANOVA.

Well done! You went through an RDA analysis followed by significance testing with PERMANOVA and BETADISPER2. In the next exercise we’ll go deeper quantify the contributions to beta diversity.

24.7.6 Beta diversity analysis

In this exercise, we’ll ask you to implement a distance-based Redundancy Analysis (dbRDA) to understand the relationships between microbial community composition and various environmental or experimental factors within your dataset. You can refer to chapter Section 7.4.1 for a step-by-step walkthrough, which may be simplified in the future.

Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
library(vegan)

data("GlobalPatterns", package = "mia") # Feel free to use any dataset
tse <- GlobalPatterns

Create a new assay with the relative abundance for your dataset and merge the features by a rank of your choice.

Show solution

tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")

Select a group of samples and perform a permutational analysis on the calculated relative abundance using Bray-Curtis dissimilarities

Show solution

tse$Group <- tse$SampleType == "Feces"

Implement dbRDA on relative abundances and perform another permutational analysis on the redundancy analysis.

Show solution

tse <- runRDA(
    tse,
    assay.type = "relabundance",
    formula = assay ~ Group,
    distance = "bray",
    na.action = na.exclude)

rda_info <- attr(reducedDim(tse, "RDA"), "significance")

rda_info

rda_info now holds information on the variation differences between the chosen groups, in our case, the homogeneity test has a p-value of 0.01, indicating that the groups have different variances, making it so that the permanova results are compromised and do not correctly explain whether there are significant differences between groups.

Nevertheless, extract the p-values. Can the differences between samples be explained with variables from their metadata?

Show solution

rda_info[["permanova"]][["Pr(>F)"]]

Useful functions: scater::runMDS, runRDA, transformAssay, agglomerateByRank

24.8 Differential abundance

24.8.1 Standard analysis with ALDEx2

Import the mia and ALDEx2 packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
library(ALDEx2)

data("peerj13075", package = "mia")
tse <- peerj13075

Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with subsetByPrevalent by specifying the rank and prevalence arguments.

Show solution

tse <- subsetByPrevalent(tse,
                                 rank = "genus",
                                 prevalence = 0.1)

Model the counts assay of the TreeSE with aldex.clr and store it into the variable x. As a second argument, provide the grouping variable Diet, which is contained in a column of the colData.

Show solution

x <- aldex.clr(assay(tse), tse$Gender)

Feed x to the functions aldex.ttest to perform t-test and to aldex.effect to estimate effect sizes. Store the output into x_tt and x_effect, respectively.

Show solution

x_tt <- aldex.ttest(x)
x_effect <- aldex.effect(x, CI = TRUE)

Create a data.frame named aldex_out which includes both x_tt and x_effect and filter for the features with wi.eBH < 0.05. Are there any significantly differential abundance taxa?

Show solution

aldex_out <- data.frame(x_tt, x_effect)

aldex_out %>%
  rownames_to_column(var = "Genus") %>%
  filter(wi.eBH <= 0.05)  %>%
  dplyr::select(Genus, we.eBH, wi.eBH, effect, overlap)

Extra: If these results appear boring, repeat steps 1 - 5, but use Gender or Age as the grouping variable. Do we have any better luck with Gender? What is the problem with Age?

24.8.2 Controlling for confounders

Import the mia and MicrobiomeStat packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
library(MicrobiomeStat)

data("peerj13075", package = "mia")
tse <- peerj13075

Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with subsetByPrevalent by specifying the rank and prevalence arguments.

Show solution

tse <- subsetByPrevalent(tse,
                                 rank = "genus",
                                 prevalence = 0.1)

Model the counts assay of the TreeSE with linda and store the output into a variable named linda_out. Provide the colData converted into a data.frame (with as.data.frame) as the second argument, and a formula with the Age, Gender and Diet as variables. For example, formula = "~ A + B" represents a formula with variables A and B.

Show solution

linda_out <- linda(as.data.frame(assay(tse, "counts")),
                   as.data.frame(colData(tse)),
                   formula = "~ Diet + Gender + Age")

Extract the output$AgeElderly object from linda_out with $ and store it into a variable named linda_res.

Show solution

linda_res <- linda_out$output$AgeElderly

Filter linda_res for features with reject == TRUE. How many differentially abundant taxa were found? What are their names and how significant are they in terms of log-fold change and adjusted p-value?

Show solution

linda_res %>%
  filter(reject) %>%
  dplyr::select(log2FoldChange, stat, padj) %>%
  rownames_to_column(var = "feature")

nrow(linda_res)

24.8.3 Comparing methods

Here, we conduct DAA with identical parameters as in the previous exercise, but with a different method, namely ZicoSeq. We aim to compare the results between these two methods and draw better informed conclusions from such comparative approach.

Import the mia and GUniFrac packages, load any of the example data sets mentioned in Chapter 3.3 with data and store it into a variable named tse.

Show solution

library(mia)
library(GUniFrac)

data("peerj13075", package = "mia")
tse <- peerj13075

Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with subsetByPrevalent by specifying the rank and prevalence arguments.

Show solution

tse <- subsetByPrevalent(tse,
                            rank = "genus",
                            prevalence = 0.1)

Model the counts assay of the TreeSE with ZicoSeq as the feature.dat argument and store the output into a variable named zicoseq_out. Provide also the colData converted to a data.frame (with as.data.frame) as meta.dat. In addition, set grp.name to "Age", adj.name to c("Diet", "Gender"), feature.dat.type to "count", return.feature.dat to TRUE and perm.no to 999.

Show solution

zicoseq_out <- ZicoSeq(feature.dat = as.matrix(assay(tse)),
                       meta.dat = as.data.frame(colData(tse)),
                       grp.name = "Age",
                       adj.name = c("Diet", "Gender"), 
                       feature.dat.type = "count",
                       return.feature.dat = TRUE,
                       perm.no = 999)

View the top six differentially abundant taxa and their adjusted p-values with head(sort(zicoseq_out$p.adj.fdr)). Is there any significant taxon according to ZicoSeq? Compared to the output of linda, do we see the same taxa at the top in terms of significance? Overall, to what extent do the two methods agree with one another?

Show solution

zicoseq_res <- cbind.data.frame(p.raw = zicoseq_out$p.raw,
                                p.adj.fdr = zicoseq_out$p.adj.fdr)

zicoseq_res %>%
  arrange(p.raw) %>%
  head() %>%
  knitr::kable()

24.8.4 Workflow 1

Get the abundances for an individual feature (taxonomic group / row).
Visualize the abundances per group with boxplot / jitterplot.
Is the difference significant (Wilcoxon test)?
Is the difference significant (linear model with covariates)?
How do transformations affect the outcome (log10, clr..)?
Get p-values for all features (taxa), for instance with a for loop.
Do multiple testing correction.
Compare the results from different tests with a scatterplot.

Useful functions: [], ggplot2::geom_boxplot, ggplot2::geom_jitter, wilcox.test, lm.test, transformAssay, p.adjust

24.8.5 Workflow 2

Install the latest development version of mia from GitHub.
Load experimental dataset from mia.
Compare abundances of each taxa between groups. First, use Wilcoxon or Kruskall-Wallis test. Then use some other method dedicated to microbiome data.
Summarize findings by plotting a taxa vs samples heatmap. Add column annotation that tells the group of each sample, and row annotation that tells whether the difference of certain taxa was statistically significant.
Choose statistically significant taxa and visualize their abundances with boxplot & jitterplot.

Useful functions: wilcox.test, kruskal.test, ggplot2::ggplot, pheatmap::pheatmap, ComplexHeatMap::Heatmap, ANCOMBC::ancombc2, ALDEx2::aldex, Maaslin2::Maaslin2, agglomerateByRank, transformAssay, subsetByPrevalent

24.9 Visualization

24.9.1 Multivariate ordination

Load experimental dataset from mia.
Create PCoA with Bray-Curtis dissimilarities.
Create PCA with Aitchison dissimilarities.
Visualize and compare both.
Test other transformations, dissimilarities, and ordination methods.

Useful functions: scater::runMDS, runNMDS, transformAssay, ggplot, scater::plotReducedDim

24.9.2 Heatmap visualization

Load one of the datasets from mia and the miaViz and sechm packages.

Show solution

library(mia)
library(miaViz)
library(sechm)

data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

Agglomerate your TreeSE to include the most prevalent Phyla.

Show solution

tse <- agglomerateByPrevalence(tse, rank = "Phylum", prevalence = 0.9)

Add a relative abundance assay to the TreeSE.

Show solution

tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")

Visualize the relative abundances with a heatmap using the sechm library

Show solution

setSechmOption("hmcols", value=c("#F0F0FF","#007562"))
sechm(tse, features = rownames(tse), assayName="relabundance", show_colnames=TRUE)

24.9.3 Advanced Heatmaps

Import mia, miaviz, complexHeatmap, shadowtext as well as a dataset of your choice

Show solution

library(mia)
library(miaViz)
library(shadowtext)
library(ComplexHeatmap) 

data("Tengeler2020", package = "mia")
tse <- Tengeler2020

Agglomerate the data by prevalence to a rank of your choice, apply log10 transformation and assure unique rownames for when we plot the heatmap.

Show solution

tse <- agglomerateByPrevalence(tse, rank = "Family")

tse <- transformAssay(tse, method = "log10", pseudocount = TRUE)

rownames(tse) <- getTaxonomyLabels(tse)

Calculate the correlations and their respective p-values between the log10 and count assays.

Show solution

correlations <- getCrossAssociation(
    tse, 
    test.signif=TRUE,            
    assay.type1 = "log10", 
    assay.type2 = "counts",
    method = "spearman", 
    p.adj.threshold = NULL,
    cor.threshold = NULL,
    # Remove when mia is fixed
    mode = "matrix",
    sort = TRUE,
    show.warnings = FALSE)

plot a heatmap using the correlations and mark significant correlations (p<0.05) with a cross.

Show solution

# Create a heatmap and store it
p <- Heatmap(
    correlations$cor,
    # Print values to cells
    cell_fun = function(j, i, x, y, width, height, fill) {
      # If the p-value is under threshold
      if( !is.na(correlations$pval[i, j]) & correlations$pval[i, j] < 0.05 ){
          # Print "X"
          grid.shadowtext(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
      }
    },
    heatmap_legend_param = list(title = "", legend_height = unit(5, "cm"))
)
p

Plot similar heatmap but adjust color scale to ensure that 0 is in the middle

Hint: utilize circlize::colorRamp2()

Show solution

library(circlize)
# Update color scale to ensure 0 is in the middle
col_fun<-colorRamp2(c(-1, -0.5, 0, 0.5, 1), c("darkblue","skyblue","white", "brown1", "darkred"))

# Create a heatmap and store it
p <- Heatmap(
    correlations$cor,
    # Print values to cells
    cell_fun = function(j, i, x, y, width, height, fill) {
      # If the p-value is under threshold
      if( !is.na(correlations$pval[i, j]) & correlations$pval[i, j] < 0.05 ){
          # Print "X"
          grid.shadowtext(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
      }
    },
    heatmap_legend_param = list(title = "", legend_height = unit(5, "cm")),
    col = col_fun
)
p

Extra Cool Exercise : Identify Patterns and Clusters To make the exercise more engaging, let’s do an exploratory data analysis task:

24.9.3.1 1. Cluster Analysis on PCoA/PCA Results:

Perform clustering on the ordination results (e.g., k-means clustering).
Visualize the clusters on the ordination plots.

library(scater)

# Setting seed for reproducibility
set.seed(123)

# Perform k-means clustering on the reduced dimensions of pcoa_bray
pcoa_bray_clusters <- kmeans(reducedDim(pcoa_bray), centers = 3)$cluster

# Plot the reduced dimensions of pcoa_bray with clusters as colors
plotReducedDim(pcoa_bray, colour_by = pcoa_bray_clusters)

# Perform k-means clustering on the reduced dimensions of pca_aitchison
pca_aitchison_clusters <- kmeans(reducedDim(pca_aitchison), centers = 3)$cluster

# Plot the reduced dimensions of pca_aitchison with clusters as colors
plotReducedDim(pca_aitchison, colour_by = pca_aitchison_clusters)

Where : ‘reducedDim’ is a function that extracts the reduced dimensionality representation of your data (e.g., from a PCA or PCoA object). ‘plotReducedDim’ is a function that can plot these reduced dimensions and has a ‘colour_by argument’ that accepts a vector of cluster assignments.

24.9.3.2 2. Heatmap with Annotations:

Annotate the heatmap with additional metadata (e.g., subject groups or sample types).

# Load necessary libraries
library(SummarizedExperiment)
library(pheatmap)

# Example data creation
# Assuming enterotype is a SummarizedExperiment object with sample metadata
# and assay_data is a matrix containing assay data

# Create example SummarizedExperiment object
example_data <- matrix(runif(200), nrow = 20, ncol = 10) # Example assay data
col_data <- DataFrame(Sample = paste0("Sample", 1:10), Group = rep(c("A", "B"), each = 5)) # Example sample metadata
enterotype <- SummarizedExperiment(assays = list(counts = example_data), colData = col_data)

# Extract sample metadata
sample_metadata <- colData(enterotype)

# Extract assay data
assay_data <- assay(enterotype, "counts")

# Plot heatmap with clustering and sample annotations
pheatmap(assay_data, cluster_rows = TRUE, cluster_cols = TRUE, annotation_col = as.data.frame(sample_metadata))

24.10 Multiomics

24.10.1 Basic exploration

Here we learn how to conduct preliminary exploration on a MAE, using HintikkaXOData as an example dataset.

Import the mia package, load HintikkaXOData with data and store it into a variable named mae.
Which experiments make up the MAE? How many samples and features are contained by each experiment? You can get a summary for all experiments with experiments, and check for each individual experiment with dim, nrow and ncol.
What are the names of the features and samples of the different experiments? You can see that with rownames and colnames, respectively.
What information is known about the samples? Remember that information about samples is stored in the colData of the MAE.
Extra: How do the samples of the individual experiments map to the columns of the MAE? You can find the sample mapping in the sampleMap of the MAE.

So far so good. You explored a MAE and its experiments, getting a taste of how information is organized in its internal structure.

24.10.2 Experiment agglomeration

Here we learn how to manipulate an experiment contained by a MAE and save the new modified version of the experiment in a suitable place (the altExp slot).

Import the mia package, load HintikkaXOData with data and store it into a variable named mae.
Agglomerate the microbiota experiment by Genus and store the output into the altExp slot of the microbiota experiment, with the custom name microbiota_genus.
How many features remain after agglomerating? What are their names?
Extra: create one more alternative experiment named prevalent_microbiota_family, which contains the microbiota experiment agglomerated by Family with a prevalence threshold of 10%. You can agglomerate and in parallel select by prevalence with mergeFeaturesByPrevalence.

Good job! You agglomerated one of the experiments in the MAE and stored it as an alternative experiment.

24.10.3 Experiment transformation

We proceed with an exercise on a different type of data manipulation, that is, transformation of assays of individual experiments in the MAE.

Import the mia package, load HintikkaXOData with data and store it into a variable named mae.
What assays are contained by each individual experiment? You can check their names with assays.
Apply a log10 transformation to the assay of the metabolite experiment. For that you can use transformAssay and don’t forget to specify the assay to be transformed with the argument assay.type.
Apply a CLR transformation to the counts assay of the microbiota experiment. To ensure non-null values in the assay, set pseudocount equal to 1.

You made it! You learnt how to apply different transformations to the assays of individual experiments in a MAE with transformAssay, specifying optional arguments based on the used method.

24.10.4 Assay extraction

The following exercise walks you through disassembling a MAE object in order to retrieve a specific assay, or to store its components as multiple separate csv files.

Import the mia package, load HintikkaXOData with data and store it into a variable named mae.
Extract the individual metabolite experiment from the MAE into a distinct TreeSE object named metabolites.
Which and how many assays are contained by metabolites? You can check that with assays or assayNames.
Write a csv file for the nmr assay with write.csv. You can access an individual assay of a TreeSE with assay by specifying the name of the desired assay.
Extra: Repeat step 1 thorugh 4 also for the microbiota and biomarkers experiments, so that a completely disassembled version of the MAE is available.
Extra: Besides experiments, MAEs also include a sampleData and a sampleMap, which are accessible with colData(mae) and sampleMap(mae), respectively. Save also each of these two elements into a csv file.

Well done! You just splitted a MAE into its components and stored them as csv files. This script shows a possible approach.

24.10.5 MAE reconstruction

Next, we will try to reconstruct the same MAE from the files you created. Make sure you know their names and location! Alternatively, you can fetch or download the CSV files in this directory with the readily disassembled components of HintikkaXOData.

Read in the csv files containing assays with read.csv and save each of them into a variable named <assay name>_assays.
Create one TreeSE from each assays object with the TreeSummarizedExperiment function, as explained in this exercise.
Read in the sampleData and the sampleMap and store them into the variables sample_data and sample_map, respectively.
Combine the components with MultiAssayExperiment, where the first argument is an ExperimentList (for now include only the microbiota and metabolites TreeSEs), the second is colData and the third is sampleMap.
Make sure that the MAE experiments are identical to the original TreeSEs. You can do that qualitatively by checking their head and quantitatively by looking at their dim.
Extra: Add the biomarkers TreeSE as a new experiment to the MAE. Note that new experiments can be added to a MAE through simple concatenation with c(mae, experiment).

Good job! Now you are aware of how MAEs are built and we can proceed to some analytical exercises.

24.10.6 Cross-correlation analysis

Now we will perform a cross-correlation analysis between two of the experiments in the MAE.

Import the mia package, load HintikkaXOData with data and store it into a variable named mae.
Analyze correlations between the microbiota and the biomarkers experiments with getCrossAssociation. Don’t forget to specify the experiments you want to compare with the arguments experiment1 and experiment2, and which of their assays with assay.type1 and assay.type2.
What does the output look like? By default, correlation is measured in terms of Kendall tau coefficients. Repeat point 2, but this time change method to Spearman coefficients.
Are you able to infer significance from the output?
Visualize results with a heatmap similarly to the example in section Section 12.1. Do you see any significant correlations? Interpret your results.
Extra: Perform cross-correlation analysis between the remaining experiments (microbiota vs metabolites and metabolites vs biomarkers) and visualize results with heatmaps.

Great job! You performed a cross-correlation analysis between two experiments of a MAE and visualized the results with a heatmap. You are also able to customise the correlation method and significance testing used for the analysis.