# setup
library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
# this line sets some options for all the chunks (global chunk options)
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
24 Exercises
Here you can find assignments on different topics.
Tips for exercises:
- Add comments that explain what each line or lines of code do. This helps you and others understand your code and find bugs. Furthermore, it is easier for you to reuse the code, and it promotes transparency.
- Interpret results by comparing them to literature. List main findings, so that results can easily be understood by others without advanced knowledge on data analytics.
- Avoid hard-coding. Use variables which get values in the beginning of the pipeline. That way it is easier for you to modify parameters and reuse the code.
24.1 Basics of R/Bioconductor
Bioconductor training material has been contributed to Carpentries. You can check the following lessons for basic background of R and Bioconductor.
24.2 Workflows
24.2.1 Reproducible reporting with Quarto
The following batch of exercises walks you through typical use cases of Quarto in RStudio. Before heading to the exercises, it is recommended to read the Quarto guidelines for RStudio
24.2.1.1 New document
This exercise gets you started with creating a Quarto document and adding text to it with typing conventions borrowed from the markdown syntax. Feel free to render the document with the Render button after each step to see the changes in the final report.
- Open RStudio and create a new Quarto file named
My first Quarto
. - Add the subtitle
My first section
and write some text of your choice underneath. You can choose the level of headings by the number of preceding hashes (#
). - Add a subsection named
List of items
and list three items underneath, both ordered and unordered. You can initialize items with numbers (1.
,2.
,3.
, …) or stars (*
) for the ordered and unordered case, respectively. - Add another subsection named
Link to web
and add a clickable link to the OMA book, using the pattern[text](url)
. - Render the document and check its appearance
Nice start! You are now able to create a Quarto document, understand its syntax and can render it into a reproducible report. If you got stuck, you can look up the docs on creating and rendering Quarto documents.
24.2.1.2 Code chunks
While customizable text is nothing new by itself, the advantage of Quarto (and previously Rmarkdown) is to combine text with code in R or other programming languages, so that both the analytical pipeline and verbal description can be put in one place. In this exercise, we learn how to write and run code in Quarto.
- Open RStudio and create a new Quarto file (click on File tab - New File - Quarto document).
- Initialize a code chunk by pressing
alt
+cmd
+i
and define the variablesA <- "my name"
andB <- 0
in it.
- Write the text
Below is my first code chunk
just above the code chunk.
- Initialize one more code chunk and add 100 to the variable
B
in it.
- Write the text
Below I change variable B
just above the second chunk.
-
Extra: Write the following line of text:
my name is A and I am B years old
, where A and B are variables defined in the code chunks upstream and change if those variables are modified. Be aware that inline code can be added as> r my_inline_command
(without>
).
Good job. You are now able to combine text and code in a Quarto document. If you got stuck, you can refer to the Quarto docs on using code chunks.
24.2.1.3 Knitr options
Code chunks can be greatly customized in terms of visibility and execution, output size and location and much more. This is possible with the knitr chunk options, which usually appear at the beginning of the target chunk with the syntax #| option-name: value
, also described here. In this exercise, we explore part of the knitr potential.
- Open RStudio and create a new Quarto file.
- Initialize three code chunks and label them as
setup
,fig-box
andtbl-coldata
, respectively. Remember that the name of a chunk can be specified with thelabel
option.
- Write the following code in the corresponding chunk and render the document.
- Set
include: false
in thesetup
chunk,fig-width: 10
in thefig-box
chunk andecho: false
in thetbl-coldata
chunk. Render the document again and find the differences from before.
The setup code and output are no longer visible, resulting in a cleaner document without initialization details. Figures in the fig-box chunk are now wider, providing better visibility and detail in the graphics. The code in the tbl-coldata chunk is hidden, making the document more concise and focused on the results rather than the code implementation.
- Add the options
fig-cap
andtab-cap
to thefig-box
andtbl-coldata
chunks, respectively. They require some text input, which makes for the caption of the figures or tables.
-
Extra: Create a cross-reference to
fig-box
andtbl-coldata
in the text above the respective code chunk. You can do that with the syntax@chunk-name
.
-
Extra: Define a custom folder for storing figures with
fig-path
. Insert it inknitr::opts_chunk$set
, so that it applies globally to all the figures generated in the document.
Congratulations! You are now familiar with the great potential and flexibility of knitr chunk options. An exhaustive list of available options can be found in the knitr documentation.
24.2.1.4 YAML instructions
The box at the beginning of every Quarto document contains yaml options that let you define the metadata of the document. They will affect the appearance of the document when it is rendered. By default, the box includes yaml options for the title, format and editor to be used, but much more information on layout, code execution and figures can be specified. A comprehensive list of yaml options is available here. In this exercise, we will get a tiny taste of such functionality.
- Open RStudio and create a new Quarto file.
- In the yaml box at the beginning of the document, change the title from
Untitled
toMy first Quarto
.
- In the same box, add the two lines
author
anddate
followed by your name and today’s date, respectively.
- Render the document and check its appearance.
-
Extra: Set
toc: true
to produce a table of contents. This line should followformat
andhtml
at the second level of indentation.
Well done! Now you are able to specify yaml options and understand how they affect your Quarto document. If you got stuck, you can check this section of the Quarto documentation.
24.2.1.5 Quarto parameters
An advanced feature of Quarto consists of execution parameters, which are externally pre-defined variables that are also accessible in the Quarto document. They can be specified in the yaml box as params
. Here we learn how to use them.
- Open RStudio and create a new Quarto file.
- In the yaml box at the beginning of the document, add a line named
params
followed by an indented line withgamma: 10
- Initialize a code chunk and type
str(params$gamma)
in it.
- Render the document and check what happened.
- Define one more parameter
beta: 3
and multiplygamma
bybeta
in a code chunk below.
- Render the document again and check what happened.
Well done! You can now use an advanced feature of Quarto such as parameters. If you got stuck, here you can find more information about parameter definition and usage.
24.3 Data containers: TreeSE
TreeSE containers represent the working unit of the mia package. In the following exercises we learn how to construct, explore and work with them. A few demo datasets can be imported with mia and can be accessed as explained in chapter Section 2.3.
24.3.1 Constructing a data object
Here we cover how to construct a TreeSE from CSV files, using the components of OKeefeDSData from the microbiomeDataSets package as an example dataset.
- Fetch or download the files in this directory.
Show solution
- Read in the csv files with
read.csv
and store them into the variablesassays
,rowdata
andcoldata
, respectively.
- Create a TreeSE from the individual components with
TreeSummarizedExperiment
. Note that the function requires three arguments: assays, rowData and colData, to which you can give the appropriate item.
Show solution
tse <- TreeSummarizedExperiment(assays = list(counts = okeefe_assay),
colData = okeefe_coldata,
rowData = okeefe_rowdata)
- Check that importing is done correctly. E.g., choose random samples and features, and check that their values equal between raw files and TreeSE.
Show solution
okeefe_assay[1:5, 1:7]
assay(tse, "counts")[1:5,1:7]
Usefuls functions: DataFrame, TreeSummarizedExperiment, matrix, rownames, colnames, SimpleList
24.3.2 Importing data
Raw data of different types can be imported as a TreeSE with a number of functions explained in chapter Section 2.4.2. You can also check the function reference in the mia package.
- Get familiar with the microbiome data repository and read the instructions in its README to import and construct datasets from there.
- Import data by using mia. (functions: importMetaPhlAn | importMothur | importQIIME2 | convertFromBIOM | convertFromDADA2 …)
- Try out conversions between TreeSE and phyloseq data containers (convertFromPhyloseq; convertToPhyloseq)
24.3.3 Preliminary exploration
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Get a summary about the TreeSE with
summary
. What is the mean count across samples? How many features recur only once (singletons)?
Show solution
summary(tse)
- Check the dimensions of the TreeSE with
dim
or alternatively withnrow
andncol
. How many samples and features are present?
Show solution
dim(tse)
- List sample and features names with
rownames
andcolnames
.
- Check what information about samples and features is contained by the
colData
androwData
of the TreeSE withnames
.
-
Extra: Calculate the number of unique taxa for each taxonomic rank. You can use
apply
to count unique elements for each column of rowData.
24.3.4 Assay retrieval
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- List the names of all available assays with
assayNames
.
Show solution
assayNames(tse)
- Fetch the list of assays with
assays
.
Show solution
assays(tse)
- Retrieve the first assay of the TreeSE with
assay
, where the second argument can be either the name or the index of the desired assay.
Show solution
assay(tse, "counts")[1:5,1:7]
Well done! You can now locate and retrieve individual assays of a TreeSE. If you got stuck, you can refer to chapter Section 2.2.1 of this book.
24.3.5 Sample information
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Check the names of the samples with
colnames
.
Show solution
colnames(tse)
- List the information on samples available in colData with
names
.
- Visualize the colData with
View
and briefly look at the information stored in the different columns.
- Get the abundances of all features for a specific sample, such as
ID34
, for an assay of your choice.
24.3.6 Feature information
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Check the names of the features with
rownames
.
Show solution
rownames(tse)
- List the information on features available in rowData with
names
.
- Visualize the rowData with
View
and briefly look at the information stored in the different columns.
- Get the abundances for a specific feature, such as
OTU1810
, in all the samples. You can access feature-specific abundances for an assay of your choice.
If you got stuck, you can look up chapters Chapter 3 and Section 5.2.1 on how to pick specific abundances and generate row trees, respectively.
24.3.7 Other elements
Extract some of the other TreeSE elements listed in chapter Chapter 2. However, such data are not always included.
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 and store it into a variable named
tse
.
- Fetch the metadata of the TreeSE. Is there any information available?
Show solution
names(metadata(tse))
- Access the phylogenetic tree with
rowTree
. How big is it in terms of tips and nodes. If you like you can visualize it withggtree
.
Show solution
rowTree(tse)
- Check if a sample tree is available with
colTree
, which is suitable for hierarchical or nested study designs.
Show solution
colTree(tse)
- If present, obtain the information on feature DNA sequences from the DNA sequence slot.
Show solution
colData(tse)$Barcode_full_length
24.4 Data manipulation
24.4.1 Subsetting
In this exercise, we’ll go over the basics of subsetting a TreeSE object. Keep in mind that it’s good practice to always keep the original data intact. Therefore, we suggest creating a new TreeSE object whenever you subset.
Please have a try at the following exercises.
- Subset the TreeSE object to specific samples and features.
1.1. Find some samples and features to subset with.
1.2. Subset the TreeSE with your found samples and features.
24.4.2 Library sizes
- Calculate library sizes
- Subsample / rarify the counts (see: rarefyAssay)
Useful functions: nrow, ncol, dim, summary, table, quantile, unique, scater::addPerCellQC, mia::agglomerateByRank
24.4.3 Prevalent and core taxonomic features
- Estimate prevalence for your chosen feature (specific row and taxonomic group)
Show solution
library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns
# First, merge the features by rank and add it as an alternative experiment.
tse_phylum <- agglomerateByRank(tse, "Phylum")
# Calculate relative abundance
tse_phylum <- transformAssay(tse_phylum, method = "relabundance")
# Then get the prevalence for a specific feature
res <- getPrevalence(tse_phylum, detection = 1/100, sort = TRUE, assay.type = "relabundance")
res["Proteobacteria"]
- Identify all prevalent features and subset the data accordingly.
Show solution
# Get the prevalent features
prevalent <- getPrevalent(tse, assay.type = "counts", detection = 1, prevalence = 0.1)
# Subset the data to the prevalent features
tse_prevalent1 <- tse[rownames(tse) %in% prevalent, ]
# Alternatively subsetByPrevalentFeatures can be used to subset directly
tse_prevalent2 <- subsetByPrevalent(tse, assay.type = "counts", detection = 1, prevalence = 0.1)
identical(tse_prevalent1, tse_prevalent2)
- How many features are left ?
Show solution
nrow(tse_prevalent1)
- Recalculate the most prevalent features based on relative abundance. How many are left?
Show solution
tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")
tse_prevalent <- subsetByPrevalent(tse, assay.type = "relabundance", detection = 0.01/100, prevalence = 50/100)
nrow(tse_prevalent)
- Visualize prevalence of most prevalent Phyla using a violin/bean plot.
Show solution
# Import the scater library
library(scater)
# Agglomerate based on prevalence. Agglomerate to Phylum level.
tse_phylum <- subsetByPrevalent(tse, rank = "Phylum", assay.type = "relabundance", detection = 0.001, prevalence = 0.2)
# Store the prevalence of each taxon
res <- getPrevalence(tse_phylum, detection = 0.001, sort = FALSE, assay.type = "relabundance")
rowData(tse_phylum)$prevalence <- res
# Then plot the row data
plotRowData(tse_phylum, "prevalence", colour_by = "Phylum")
Useful functions: getPrevalence, getPrevalent, subsetByPrevalent
24.4.4 Data exploration
- Summarize sample metadata variables. (How many age groups, how they are distributed? 0%, 25%, 50%, 75%, and 100% quantiles of library size?)
- Create two histograms. Another shows the distribution of absolute counts, another shows how CLR transformed values are distributed.
- Visualize how relative abundances are distributed between taxa in samples.
Useful functions: nrow, ncol, dim, summary, table, quantile, unique, transformAssay, ggplot, wilcox.test, agglomerateByRank, miaViz::plotAbundance
24.4.5 Other functions
- Merge data objects (merge, mergeSEs)
- Melt the data for visualization purposes (meltSE)
24.4.6 Assay transformation
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Transform the counts assay into relative abundances with
transformAssay
and store it into the TreeSE as an assay namedrelabund
(see chapter Section 5.4).
- Similarly, perform a clr transformation on the counts assay with a
pseudocount
of 1 and add it to the TreeSE as a new assay.
- List the available assays by name with
assays
.
- Access the clr assay and select a subset of its first 100 features and 10 samples. Remember that assays are subsettable with
assay[row_idx, col_idx]
.
- Take the same subset from the TreeSE, and check how this affects the individual transformed assays. TreeSE can also be subsetted with
tse[row_idx, col_idx]
.
- Extra: If the data has phylogenetic tree, perform the phILR transformation.
24.5 Abundance tables
24.5.1 Taxonomic levels
- Import the mia package, load one of the example data sets mentioned in Chapter 3.3 with
data
(you need one with taxonomic information at Phylum level) and store it into a variable namedtse
.
- List the available taxonomic ranks in the data with
taxonomyRanks
.
- Agglomerate the data to Phylum level with
agglomerateByRank
and the appropriate value forRank
.
- Report the dimensions of the TreeSE before and after agglomerating. You can use
dim
for that.
-
Extra: Perform CLR transformation on the data. Does this affect agglomeration?
-
Extra: List full taxonomic information for a few selected taxa, such as
OTU1
andOTU1368
. For that you can usemapTaxonomy
on a specific subset of the TreeSE.
24.5.2 Alternative experiments
- Import the mia package, load one of the example data sets mentioned in Chapter 3.3 with
data
(you need one with taxonomic information) and store it into a variable namedtse
. Can you find the same taxonomyRanks from rowData? - Check the taxonomic ranks of the features with
taxonomyRanks
. What is the deepest taxonomic rank available?
- Agglomerate the TreeSE to each taxonomic rank and store the resulting experiments as altExps. This can be performed automatically with
agglomerateByRanks
.
- Check the names of the generated altExps with
altExpNames
and retrieve a complete list withaltExps
.
- Retrieve the data agglomerated by genus from the corresponding
altExp
. As for assays, you can access the desired altExp by name or index.
-
Extra: Split the data based on other features with
splitOn
.
24.5.3 Beta Diversity
This Exercise will focus on calculating dissimilarities or beta diversity.
- Import the mia and vegan packages and load a dataset. This can be your own or one of the packages built in to mia.
- Agglomerate data to all available taxonomy ranks by using
agglomerateByRanks()
Show solution
tse <- agglomerateByRanks(tse)
altExpNames(tse)
- Import the scater library and run PCoA on on one of the created alternative experiments using Bray-Curtis dissimilarity.
Show solution
# We use scater library for calculating ordination
library(scater)
# Choose rank
rank <- "Genus"
# Calculate relative abundance
altExp(tse, rank) <- transformAssay(altExp(tse, rank), MARGIN = "cols", method="relabundance")
# Calculate ordination
altExp(tse, rank) <- runMDS(
altExp(tse, rank),
FUN = vegan::vegdist,
method = "bray",
assay.type = "relabundance",
name = "MDS_bray")
- Plot the first two coordinates using the
reducedDim
function and plot the coordinates.
Show solution
p <- plotReducedDim(altExp(tse, rank), "MDS_bray")
p
- Add the explained variance ratios for each coordinate on the axes labels and plot the PCoA again.
Show solution
# Calculate explained variance
e <- attr(reducedDim(altExp(tse, rank), "MDS_bray"), "eig")
rel_eig <- e / sum(e[e > 0])
# Add explained variance for each axis
p <- p +
labs(
x = paste("PCoA 1 (", round(100 * rel_eig[[1]], 1), "%", ")", sep = ""),
y = paste("PCoA 2 (", round(100 * rel_eig[[2]], 1), "%", ")", sep = ""))
p
24.6 Community (alpha) diversity
24.6.1 Estimation
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Calculate multiple alpha diversity indices with
estimateDiversity
without any additional arguments. - Check the names of colData with
names
. Can you identify which columns contain the alpha diversity indices? -
Extra: Agglomerate the TreeSE by phylum and compare the mean Shannon diversity of the original experiment with its agglomerated version. You can use
agglomerateByRank
to perform agglomeration andmean
to calculate the mean values of the respective columns in colData.
24.6.2 Visualization
- Import the mia and scater packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Calculate Shannon diversity index and Faith’s phylogenetic diversity with
estimateDiversity
and the appropriate arguments forindex
. - Make a boxplot of Shannon diversity on the y axis and sample type on the x axis using
scater::plotColData
. - Repeat the previous point with Faith’s phylogenetic diversity and compare the sample distributions of the two alpha diversity indices. How greatly do they differ?
-
Extra: Make a scatterplot of Shannon diversity on the y axis and Faith’s phylogenetic diversity on the x axis with
plotColData
. Colour the points by sample type with the appropriate optional argument.
24.6.3 Correlation
- Import the mia and scater packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Calculate coverage and Shannon diversity index with
estimateDiversity
and the appropriate arguments forindex
. - Test the correlation between the two indices with
cor.test
. Remember that colData parameters are accessible withtse$param_name
. Use Kendall tau coefficients asmethod
to measure correlation. Is the correlation weak or strong, significant or not? - Make a scatterplot of Shannon diversity index on the y axis and coverage on the x axis. You can do that with
plotColData
. How do the two indices relate to one another? -
Extra: Compute the library size of the samples by applying
colSums
to the counts assay of the TreeSE, and test the correlation of library size with Shannon diversity or coverage. Which index is more correlated with library size?
In this example, we inspected the correlation between two related variables, also known as multicollinearity, and checked the correlation to library size, which is part of quality control. However, the correlation between alpha diversity and other numerical data about samples, such as participant’s age and weight, also represent an important analysis in several studies.
24.6.4 Differences between groups
- Import the mia package, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Calculate the Gini-Simpson diversity with
estimateDiversity
and the appropriate argument forindex
. Setname
tosimpson
. You will use this name to access the diversity index from colData. - Inspect the Diet column in the colData. Determine how the samples are grouped in terms of diet. You can see the number of unique elements in a column with
unique
. - Test differences in Gini-Simpson diversity between different diets with
kruskal.test
. Remember that colData parameters are accessible withtse$param_name
. - Is diversity significantly different between vegan and mixed diet? To visualize that, make a boxplot of Gini-Simpson diversity on the y axis and diet on the x axis with
plotColData
. - Extra: Repeat points 3 through 5, this time for age groups. Make sure that you are using an appropriate statistical test for the number of groups and distribution.
24.7 Community similarity
24.7.1 Reduced dimensions retrieval
- Import the mia package, load enterotype with
data
and store it into a variable namedtse
. - List all available reduced dimensions with
reducedDims
. At this point, no reducedDims are likely found, because we haven’t created any yet. - Perform PCA using the scater library and store its output in the TreeSE by running
tse <- runPCA(tse, assay.type = "counts")
. Note that it is required to specify the assay on which dimensionality reduction should be conducted. - View the names of all available reduced dimensions with
reducedDimNames
. Has something new appeared? -
Extra: Access the
scater::PCA
reducedDim object withreducedDim
and explore its content. How are the different dimensions stored? Extract an array with only the values from the second dimension by indexing the object with[ , 2]
.
24.7.2 Visualization basics with PCA
- Import the mia and scater packages, load enterotype with
data
and store it into a variable namedtse
. - Perform a 3-component PCA based on the
counts
assay. You can userunPCA
and set the optional argumentsncomponents
andassay.type
to the appropriate values. - Plot the first two dimensions of PCA with
scater::plotReducedDim
, to which you should give the appropriate reducedDim name as the second argument. Note that by default only the first two dimensions are shown. - Check which information is stored in the ColData of the TreeSE. What would be worth visualizing in our coordination plot?
- Make the same plot again, but this time colour the observations by Enterotype. You can do that by setting
colour.by
to the appropriate colname in the colData of the TreeSE. -
Extra: Plot all three dimensions of PCA with
scater::plotReducedDim
and the optional argumentncomponents
. Colour observations by Enterotype. Which pair of dimensions best explains the variance between Enterotypes?
24.7.3 Principal Coordinate Analysis (PCoA)
PCoA turns out to be particularly relevant for microbiome analysis, because unlike PCA it can generate reduced dimensions from distances other than Euclidean. There are several ecological distances to choose from and you can find many of them under methods in the vignettes of vegan::vegdist
.
- Import the mia and scater packages, load enterotype with
data
and store it into a variable namedtse
. - Transform the counts assay to relative abundances with
transformAssay
. - Perform a Multi-Dimensional Scaling (MDS) based on the relative abundance assay in terms of Bray-Curtis dissimilarity. You can use
scater::runMDS
with the compulsory argumentFUN = vegan::vegdist
. - Plot the first two dimensions of PCA with
plotReducedDim
, to which you should give the appropriate reducedDim name as the second argument. Colour the observations by Enterotype withcolour.by
. -
Extra: Perform MDS again with
scater::runMDS
, but this time use Jaccard dissimilarity. The distance metric to use can be defined with the optional argumentmethod
, choosing from the methods in?vegan::vegdist
. If you don’t want to overwrite the reducedDim object made in point 3, setname
to a name of your choice. Visualize and compare it to the plot from point 4.
Good job! You are now able to produce and visualize reduced dimensions of a TreeSE. runMDS
is actually one of several algorithms for PCoA and dimensionality reduction, which you can find in section Section 7.2.2.
24.7.4 PERMANOVA analysis
In this exercise we focus on studying the weight of variables on the microbiome composition. Significance of each variable on beta diversity is tested with PERMANOVA (point 4) and the homogeneity assumption is also be controlled with a PERMDISP2 analysis (point 5).
- Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Transform the
counts
assay into relative abundances withtransformAssay
. - Extract the relative abundance assay, transpose it and save it into a variable named
relabund_assay
. - Perform PERMANOVA with
vegan::adonis2
to see how much Diet can explain the relative abundance assay (formula = relabund_assay ~ Diet
) in terms of Bray-Curtis dissimilarity (method = "bray"
). Also setdata = colData(tse)
by = "margin"
andpermutations = 99
. What do the results tell you about Diet with respect to beta diversity? -
Extra: Test homogeneity of distribution across Diet groups with
anova(betadisper(my_mat), my_groups
, wheremy_mat
is the Bray-Curtis dissimilarity matrix ofrelabund_assay
calculated withvegdist(relabund_assay, "bray")
andmy_groups
is the vector of Diet values obtained from the colData of the TreeSE.
Well done! You went through testing the effect and significance of Diet on beta diversity. Keep in mind that the formula fed to adonis2
can take more than one independent variable, so that you can also (and very often should) include covariates of your studies.
24.7.5 Redundancy analysis (RDA)
Here we apply RDA, an ordination method that provides dimensions with the largest variation between the data based in the light of the specified variables (point 3). The results of RDA are usually assessed with PERMANOVA (point 5) and the homogeneity assumption should be checked as in the previous exercise. This is a relatively complex procedure, but the way this is broken down into steps below will hopefully make more sense.
- Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
. - Transform the
counts
assay into relative abundances withtransformAssay
. - Perform RDA with
getRDA
to see how much Diet can explain the relative abundance assay (formula = assay ~ Diet
andassay.type = relabundance
) in terms of Bray-Curtis dissimilarity (method = "bray"
). - Extract the RDA dimensions from the appropriate reducedDim slot with
attr(reducedDim(tse, "RDA"), "rda)
and store it intorda
. - Test the effect and significance of Diet on beta diversity by PERMANOVA with
vegan::anova.cca
. Feed this function withrda
and setby = "margin"
andpermutations = 99
, respectively. What do the results tell you about Diet? -
Extra: Check what other parameters are stored in the colData of peerj13075, add them to the formula (
formula = assay ~ Diet + ...
) ofgetRDA
and proceed to see how that changes the results of PERMANOVA.
Well done! You went through an RDA analysis followed by significance testing with PERMANOVA and BETADISPER2. In the next exercise we’ll go deeper quantify the contributions to beta diversity.
24.7.6 Beta diversity analysis
In this exercise, we’ll ask you to implement a distance-based Redundancy Analysis (dbRDA) to understand the relationships between microbial community composition and various environmental or experimental factors within your dataset. You can refer to chapter Section 7.4.1 for a step-by-step walkthrough, which may be simplified in the future.
- Import the mia and vegan packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Create a new assay with the relative abundance for your dataset and merge the features by a rank of your choice.
Show solution
tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")
- Select a group of samples and perform a permutational analysis on the calculated relative abundance using Bray-Curtis dissimilarities
Show solution
tse$Group <- tse$SampleType == "Feces"
- Implement dbRDA on relative abundances and perform another permutational analysis on the redundancy analysis.
Show solution
tse <- runRDA(
tse,
assay.type = "relabundance",
formula = assay ~ Group,
distance = "bray",
na.action = na.exclude)
rda_info <- attr(reducedDim(tse, "RDA"), "significance")
rda_info
rda_info now holds information on the variation differences between the chosen groups, in our case, the homogeneity test has a p-value of 0.01, indicating that the groups have different variances, making it so that the permanova results are compromised and do not correctly explain whether there are significant differences between groups.
- Nevertheless, extract the p-values. Can the differences between samples be explained with variables from their metadata?
Show solution
rda_info[["permanova"]][["Pr(>F)"]]
Useful functions: scater::runMDS, runRDA, transformAssay, agglomerateByRank
24.8 Differential abundance
24.8.1 Standard analysis with ALDEx2
- Import the mia and ALDEx2 packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with
subsetByPrevalent
by specifying therank
andprevalence
arguments.
Show solution
tse <- subsetByPrevalent(tse,
rank = "genus",
prevalence = 0.1)
- Model the counts assay of the TreeSE with
aldex.clr
and store it into the variablex
. As a second argument, provide the grouping variableDiet
, which is contained in a column of the colData.
- Feed
x
to the functionsaldex.ttest
to perform t-test and toaldex.effect
to estimate effect sizes. Store the output intox_tt
andx_effect
, respectively.
Show solution
x_tt <- aldex.ttest(x)
x_effect <- aldex.effect(x, CI = TRUE)
- Create a data.frame named
aldex_out
which includes bothx_tt
andx_effect
and filter for the features withwi.eBH < 0.05
. Are there any significantly differential abundance taxa?
Show solution
aldex_out <- data.frame(x_tt, x_effect)
aldex_out %>%
rownames_to_column(var = "Genus") %>%
filter(wi.eBH <= 0.05) %>%
dplyr::select(Genus, we.eBH, wi.eBH, effect, overlap)
-
Extra: If these results appear boring, repeat steps 1 - 5, but use
Gender
orAge
as the grouping variable. Do we have any better luck with Gender? What is the problem with Age?
24.8.2 Controlling for confounders
- Import the mia and MicrobiomeStat packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with
subsetByPrevalent
by specifying therank
andprevalence
arguments.
Show solution
tse <- subsetByPrevalent(tse,
rank = "genus",
prevalence = 0.1)
- Model the counts assay of the TreeSE with
linda
and store the output into a variable namedlinda_out
. Provide the colData converted into a data.frame (withas.data.frame
) as the second argument, and aformula
with the Age, Gender and Diet as variables. For example,formula = "~ A + B"
represents a formula with variables A and B.
Show solution
linda_out <- linda(as.data.frame(assay(tse, "counts")),
as.data.frame(colData(tse)),
formula = "~ Diet + Gender + Age")
- Extract the
output$AgeElderly
object fromlinda_out
with$
and store it into a variable namedlinda_res
.
Show solution
linda_res <- linda_out$output$AgeElderly
- Filter
linda_res
for features withreject == TRUE
. How many differentially abundant taxa were found? What are their names and how significant are they in terms of log-fold change and adjusted p-value?
24.8.3 Comparing methods
Here, we conduct DAA with identical parameters as in the previous exercise, but with a different method, namely ZicoSeq. We aim to compare the results between these two methods and draw better informed conclusions from such comparative approach.
- Import the mia and GUniFrac packages, load any of the example data sets mentioned in Chapter 3.3 with
data
and store it into a variable namedtse
.
- Agglomerate the TreeSE by genus and filter by a prevalence of 10%. You can perform both operations in one go with
subsetByPrevalent
by specifying therank
andprevalence
arguments.
Show solution
tse <- subsetByPrevalent(tse,
rank = "genus",
prevalence = 0.1)
- Model the counts assay of the TreeSE with
ZicoSeq
as thefeature.dat
argument and store the output into a variable namedzicoseq_out
. Provide also the colData converted to a data.frame (withas.data.frame
) asmeta.dat
. In addition, setgrp.name
to"Age"
,adj.name
toc("Diet", "Gender")
,feature.dat.type
to"count"
,return.feature.dat
toTRUE
andperm.no
to999
.
- View the top six differentially abundant taxa and their adjusted p-values with
head(sort(zicoseq_out$p.adj.fdr))
. Is there any significant taxon according to ZicoSeq? Compared to the output of linda, do we see the same taxa at the top in terms of significance? Overall, to what extent do the two methods agree with one another?
Show solution
zicoseq_res <- cbind.data.frame(p.raw = zicoseq_out$p.raw,
p.adj.fdr = zicoseq_out$p.adj.fdr)
zicoseq_res %>%
arrange(p.raw) %>%
head() %>%
knitr::kable()
24.8.4 Workflow 1
- Get the abundances for an individual feature (taxonomic group / row).
- Visualize the abundances per group with boxplot / jitterplot.
- Is the difference significant (Wilcoxon test)?
- Is the difference significant (linear model with covariates)?
- How do transformations affect the outcome (log10, clr..)?
- Get p-values for all features (taxa), for instance with a for loop.
- Do multiple testing correction.
- Compare the results from different tests with a scatterplot.
Useful functions: [], ggplot2::geom_boxplot, ggplot2::geom_jitter, wilcox.test, lm.test, transformAssay, p.adjust
24.8.5 Workflow 2
- Install the latest development version of mia from GitHub.
- Load experimental dataset from mia.
- Compare abundances of each taxa between groups. First, use Wilcoxon or Kruskall-Wallis test. Then use some other method dedicated to microbiome data.
- Summarize findings by plotting a taxa vs samples heatmap. Add column annotation that tells the group of each sample, and row annotation that tells whether the difference of certain taxa was statistically significant.
- Choose statistically significant taxa and visualize their abundances with boxplot & jitterplot.
Useful functions: wilcox.test, kruskal.test, ggplot2::ggplot, pheatmap::pheatmap, ComplexHeatMap::Heatmap, ANCOMBC::ancombc2, ALDEx2::aldex, Maaslin2::Maaslin2, agglomerateByRank, transformAssay, subsetByPrevalent
24.9 Visualization
24.9.1 Multivariate ordination
- Load experimental dataset from mia.
- Create PCoA with Bray-Curtis dissimilarities.
- Create PCA with Aitchison dissimilarities.
- Visualize and compare both.
- Test other transformations, dissimilarities, and ordination methods.
Useful functions: scater::runMDS, runNMDS, transformAssay, ggplot, scater::plotReducedDim
24.9.2 Heatmap visualization
- Load one of the datasets from mia and the miaViz and sechm packages.
- Agglomerate your TreeSE to include the most prevalent Phyla.
Show solution
tse <- agglomerateByPrevalence(tse, rank = "Phylum", prevalence = 0.9)
- Add a relative abundance assay to the TreeSE.
Show solution
tse <- transformAssay(tse, assay.type = "counts", method = "relabundance")
- Visualize the relative abundances with a heatmap using the sechm library
Show solution
setSechmOption("hmcols", value=c("#F0F0FF","#007562"))
sechm(tse, features = rownames(tse), assayName="relabundance", show_colnames=TRUE)
24.9.3 Advanced Heatmaps
- Import mia, miaviz, complexHeatmap, shadowtext as well as a dataset of your choice
Show solution
library(mia)
library(miaViz)
library(shadowtext)
library(ComplexHeatmap)
data("Tengeler2020", package = "mia")
tse <- Tengeler2020
- Agglomerate the data by prevalence to a rank of your choice, apply log10 transformation and assure unique rownames for when we plot the heatmap.
Show solution
tse <- agglomerateByPrevalence(tse, rank = "Family")
tse <- transformAssay(tse, method = "log10", pseudocount = TRUE)
rownames(tse) <- getTaxonomyLabels(tse)
- Calculate the correlations and their respective p-values between the log10 and count assays.
Show solution
correlations <- getCrossAssociation(
tse,
test.signif=TRUE,
assay.type1 = "log10",
assay.type2 = "counts",
method = "spearman",
p.adj.threshold = NULL,
cor.threshold = NULL,
# Remove when mia is fixed
mode = "matrix",
sort = TRUE,
show.warnings = FALSE)
- plot a heatmap using the correlations and mark significant correlations (p<0.05) with a cross.
Show solution
# Create a heatmap and store it
p <- Heatmap(
correlations$cor,
# Print values to cells
cell_fun = function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(correlations$pval[i, j]) & correlations$pval[i, j] < 0.05 ){
# Print "X"
grid.shadowtext(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
}
},
heatmap_legend_param = list(title = "", legend_height = unit(5, "cm"))
)
p
- Plot similar heatmap but adjust color scale to ensure that 0 is in the middle
Hint: utilize circlize::colorRamp2()
Show solution
library(circlize)
# Update color scale to ensure 0 is in the middle
col_fun<-colorRamp2(c(-1, -0.5, 0, 0.5, 1), c("darkblue","skyblue","white", "brown1", "darkred"))
# Create a heatmap and store it
p <- Heatmap(
correlations$cor,
# Print values to cells
cell_fun = function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(correlations$pval[i, j]) & correlations$pval[i, j] < 0.05 ){
# Print "X"
grid.shadowtext(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#f5f5f5"))
}
},
heatmap_legend_param = list(title = "", legend_height = unit(5, "cm")),
col = col_fun
)
p
Extra Cool Exercise : Identify Patterns and Clusters To make the exercise more engaging, let’s do an exploratory data analysis task:
24.9.3.1 1. Cluster Analysis on PCoA/PCA Results:
- Perform clustering on the ordination results (e.g., k-means clustering).
- Visualize the clusters on the ordination plots.
library(scater)
# Setting seed for reproducibility
set.seed(123)
# Perform k-means clustering on the reduced dimensions of pcoa_bray
pcoa_bray_clusters <- kmeans(reducedDim(pcoa_bray), centers = 3)$cluster
# Plot the reduced dimensions of pcoa_bray with clusters as colors
plotReducedDim(pcoa_bray, colour_by = pcoa_bray_clusters)
# Perform k-means clustering on the reduced dimensions of pca_aitchison
pca_aitchison_clusters <- kmeans(reducedDim(pca_aitchison), centers = 3)$cluster
# Plot the reduced dimensions of pca_aitchison with clusters as colors
plotReducedDim(pca_aitchison, colour_by = pca_aitchison_clusters)
Where : ‘reducedDim’ is a function that extracts the reduced dimensionality representation of your data (e.g., from a PCA or PCoA object). ‘plotReducedDim’ is a function that can plot these reduced dimensions and has a ‘colour_by argument’ that accepts a vector of cluster assignments.
24.9.3.2 2. Heatmap with Annotations:
- Annotate the heatmap with additional metadata (e.g., subject groups or sample types).
# Load necessary libraries
library(SummarizedExperiment)
library(pheatmap)
# Example data creation
# Assuming enterotype is a SummarizedExperiment object with sample metadata
# and assay_data is a matrix containing assay data
# Create example SummarizedExperiment object
example_data <- matrix(runif(200), nrow = 20, ncol = 10) # Example assay data
col_data <- DataFrame(Sample = paste0("Sample", 1:10), Group = rep(c("A", "B"), each = 5)) # Example sample metadata
enterotype <- SummarizedExperiment(assays = list(counts = example_data), colData = col_data)
# Extract sample metadata
sample_metadata <- colData(enterotype)
# Extract assay data
assay_data <- assay(enterotype, "counts")
# Plot heatmap with clustering and sample annotations
pheatmap(assay_data, cluster_rows = TRUE, cluster_cols = TRUE, annotation_col = as.data.frame(sample_metadata))
24.10 Multiomics
24.10.1 Basic exploration
Here we learn how to conduct preliminary exploration on a MAE, using HintikkaXOData as an example dataset.
- Import the mia package, load HintikkaXOData with
data
and store it into a variable namedmae
. - Which experiments make up the MAE? How many samples and features are contained by each experiment? You can get a summary for all experiments with
experiments
, and check for each individual experiment withdim
,nrow
andncol
. - What are the names of the features and samples of the different experiments? You can see that with
rownames
andcolnames
, respectively. - What information is known about the samples? Remember that information about samples is stored in the
colData
of the MAE. -
Extra: How do the samples of the individual experiments map to the columns of the MAE? You can find the sample mapping in the
sampleMap
of the MAE.
So far so good. You explored a MAE and its experiments, getting a taste of how information is organized in its internal structure.
24.10.2 Experiment agglomeration
Here we learn how to manipulate an experiment contained by a MAE and save the new modified version of the experiment in a suitable place (the altExp slot).
- Import the mia package, load HintikkaXOData with
data
and store it into a variable namedmae
. - Agglomerate the microbiota experiment by Genus and store the output into the
altExp
slot of the microbiota experiment, with the custom namemicrobiota_genus
. - How many features remain after agglomerating? What are their names?
-
Extra: create one more alternative experiment named
prevalent_microbiota_family
, which contains the microbiota experiment agglomerated by Family with a prevalence threshold of 10%. You can agglomerate and in parallel select by prevalence withmergeFeaturesByPrevalence
.
Good job! You agglomerated one of the experiments in the MAE and stored it as an alternative experiment.
24.10.3 Experiment transformation
We proceed with an exercise on a different type of data manipulation, that is, transformation of assays of individual experiments in the MAE.
- Import the mia package, load HintikkaXOData with
data
and store it into a variable namedmae
. - What assays are contained by each individual experiment? You can check their names with
assays
. - Apply a log10 transformation to the assay of the metabolite experiment. For that you can use
transformAssay
and don’t forget to specify the assay to be transformed with the argumentassay.type
. - Apply a CLR transformation to the counts assay of the microbiota experiment. To ensure non-null values in the assay, set
pseudocount
equal to 1.
You made it! You learnt how to apply different transformations to the assays of individual experiments in a MAE with transformAssay
, specifying optional arguments based on the used method.
24.10.4 Assay extraction
The following exercise walks you through disassembling a MAE object in order to retrieve a specific assay, or to store its components as multiple separate csv files.
- Import the mia package, load HintikkaXOData with
data
and store it into a variable namedmae
. - Extract the individual metabolite experiment from the MAE into a distinct TreeSE object named
metabolites
. - Which and how many assays are contained by
metabolites
? You can check that withassays
orassayNames
. - Write a csv file for the nmr assay with
write.csv
. You can access an individual assay of a TreeSE withassay
by specifying the name of the desired assay. - Extra: Repeat step 1 thorugh 4 also for the microbiota and biomarkers experiments, so that a completely disassembled version of the MAE is available.
-
Extra: Besides experiments, MAEs also include a sampleData and a sampleMap, which are accessible with
colData(mae)
andsampleMap(mae)
, respectively. Save also each of these two elements into a csv file.
Well done! You just splitted a MAE into its components and stored them as csv files. This script shows a possible approach.
24.10.5 MAE reconstruction
Next, we will try to reconstruct the same MAE from the files you created. Make sure you know their names and location! Alternatively, you can fetch or download the CSV files in this directory with the readily disassembled components of HintikkaXOData.
- Read in the csv files containing assays with
read.csv
and save each of them into a variable named<assay name>_assays
. - Create one TreeSE from each assays object with the
TreeSummarizedExperiment
function, as explained in this exercise. - Read in the sampleData and the sampleMap and store them into the variables
sample_data
andsample_map
, respectively. - Combine the components with
MultiAssayExperiment
, where the first argument is anExperimentList
(for now include only the microbiota and metabolites TreeSEs), the second is colData and the third is sampleMap. - Make sure that the MAE experiments are identical to the original TreeSEs. You can do that qualitatively by checking their
head
and quantitatively by looking at theirdim
. -
Extra: Add the biomarkers TreeSE as a new experiment to the MAE. Note that new experiments can be added to a MAE through simple concatenation with
c(mae, experiment)
.
Good job! Now you are aware of how MAEs are built and we can proceed to some analytical exercises.
24.10.6 Cross-correlation analysis
Now we will perform a cross-correlation analysis between two of the experiments in the MAE.
- Import the mia package, load HintikkaXOData with
data
and store it into a variable namedmae
. - Analyze correlations between the microbiota and the biomarkers experiments with
getCrossAssociation
. Don’t forget to specify the experiments you want to compare with the argumentsexperiment1
andexperiment2
, and which of their assays withassay.type1
andassay.type2
. - What does the output look like? By default, correlation is measured in terms of Kendall tau coefficients. Repeat point 2, but this time change
method
to Spearman coefficients. - Are you able to infer significance from the output?
- Visualize results with a heatmap similarly to the example in section Section 12.1. Do you see any significant correlations? Interpret your results.
- Extra: Perform cross-correlation analysis between the remaining experiments (microbiota vs metabolites and metabolites vs biomarkers) and visualize results with heatmaps.
Great job! You performed a cross-correlation analysis between two experiments of a MAE and visualized the results with a heatmap. You are also able to customise the correlation method and significance testing used for the analysis.