Chapter 11 Additional Community Typing
# Load data
library(mia)
data(enterotype)
<- enterotype tse
11.1 Community composition
11.1.1 Composition barplot
The typical way to visualise the microbiome compositions is by using a composition barplot. In the following, we calculate the relative abundances and the top 10 taxa. The barplot is ordered by “Bacteroidetes”:
library(miaViz)
# Remove unwanted taxon
<- subsetTaxa(tse, rownames(tse) != "-1")
tse # Only consider Sanger technique
<- subsetSamples(tse, colData(tse)$SeqTech == "Sanger")
mat
# Get top taxa
<- getTopTaxa(tse, top = 10, abund_values = "counts")
rel_taxa
# Take only the top taxa
<- subsetTaxa(mat, is.element(rownames(tse), rel_taxa))
mat
# Visualise composition barplot, with samples order by "Bacteroides"
plotAbundance(mat, abund_values = "counts",
order_rank_by = "abund", order_sample_by = "Bacteroides")
11.1.2 Composition heatmap
Community composition can be visualised with a heatmap. The colour of each intersection point represents the abundance of a taxon in a specific sample.
Here, the CLR + Z-transformed abundances are shown.
library(pheatmap)
library(grid)
library(RColorBrewer)
# CLR and Z transforms
<- transformCounts(tse, abund_values = "counts", method = "clr", pseudocount = 1)
mat <- transformFeatures(mat, abund_values = "clr", method = "z")
mat # Take only the top taxa
<- subsetTaxa(mat, is.element(rownames(tse), rel_taxa))
mat
# For annotating heat map
<- data.frame("SeqTech" = colData(mat)$SeqTech)
SeqTechs rownames(SeqTechs) <- colData(mat)$Sample_ID
# count matrix for pheatmap
<- assays(mat)$z
mat
# Make grid for heatmap
<- seq(-2, 2, length.out = 10)
breaks setHook("grid.newpage", function() pushViewport(viewport(x = 1, y = 1, width = 0.9,
height = 0.9, name = "vp",
just = c("right","top"))),
action = "prepend")
pheatmap(mat, color = rev(brewer.pal(9, "RdBu")), breaks = breaks, main = "SeqTechs", treeheight_row = 0, treeheight_col = 0, show_colnames = 0, annotation_col = SeqTechs, cluster_cols = F)
setHook("grid.newpage", NULL, "replace")
grid.text("Sample", x = 0.39, y = -0.04, gp = gpar(fontsize = 16))
grid.text("Genus", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16))
11.2 Cluster into CSTs
The burden of specifying the number of clusters falls on the researcher. To help make an informed decision, we turn to previously established methods for doing so. In this section we introduce three such methods (aside from DMM analysis) to cluster similar samples. They include the Elbow Method, Silhouette Method, and Gap Statistic Method. All of them will utilise the `kmeans’ algorithm which essentially assigns clusters and minimises the distance within clusters (a sum of squares calculation). The default distance metric used is the Euclidean metric.
The scree plot allows us to see how much of the variance is captured by each dimension in the MDS ordination.
library(ggplot2); th <- theme_bw()
# reset data
<- enterotype
tse
# MDS analysis with the first 20 dimensions
<- scater::runMDS(tse, FUN = vegan::vegdist, method = "bray",
tse name = "MDS_BC", exprs_values = "counts", ncomponents = 20)
<- reducedDim(tse, "MDS_BC", withDimnames = TRUE)
ord # retrieve eigenvalues
<- attr(ord, "eig")
eigs
# variance in each of the axes
<- eigs / sum(eigs)
var # first 12 values
<- data.frame(x = c(1:12), y = var[1:12])
df
# create scree plot
<- ggplot(df, aes(x, y)) +
p geom_bar(stat="identity") +
xlab("Principal Component") +
ylab("Variance") +
ggtitle("Scree Plot")
p
From the scree plot (above), we can see that the two principal dimensions can account for 68% of the total variation, but at least the top 4 eigenvalues are considerable. Dimensions beyond 3 and 4 may be useful so let’s try to remove the less relevant dimensions.
# histogram of MDS eigenvalues from the fifth dimension onward
<- hist(eigs[5:length(eigs)], 100) h
plot(h$mids, h$count, log = "y", type = "h", lwd = 10, lend = 2)
As you can see, dimensions 5, 6, and 7 still stand out so we will include 7 MDS dimensions.
11.2.1 Elbow Method
This method graphs the sum of the sum of squares for each \(k\) (number of clusters), where \(k\) ranges between 1 and 10. Where the bend or `elbow’ occurs is the optimal value of \(k\).
library(factoextra)
# take only first 7 dimensions
<- 7
NDIM <- ord[, 1:NDIM]
x
# Elbow Method
::fviz_nbclust(x, kmeans, method = "wss") +
factoextrageom_vline(xintercept = 3, linetype = 2) +
labs(subtitle = "Elbow Method") + th
As you can see, the bend occurs at \(k=3\), implying there are three clusters present in the enterotype data.
11.2.2 Silhouette Method
This method on the otherhand returns a width for each \(k\). In this case, we want the \(k\) that maximises the width.
# Silhouette method
::fviz_nbclust(x, kmeans, method = "silhouette") +
factoextralabs(subtitle = "Silhouette method") + th
The graph shows the maximum occurring at \(k=2\). \(k=3\) is also strongly suggested by the plot and is consistent with what we obtained from the elbow method.
11.2.3 Gap-Statistic Method
The Gap-Statistic Method is the most complicated among the methods discussed here. In the plot produced, we want the \(k\) value that maximises the output (local and global maxima), but we also want to pay attention to where the plot jumps if the maximum value doesn’t turn out to be helpful.
# Gap Statistic Method
::fviz_nbclust(x, kmeans, method = "gap_stat", nboot = 50)+
factoextralabs(subtitle = "Gap Statistic Method") + th
\(k=6\) clusters, as the peak suggests, would not be very revealing since having too many clusters can cease to be revealing and so we look to the points where the graph jumps. We can see this occurs at \(k=2\) and \(k=8\). The output indicates that there are at least two clusters present. Since we have previous evidence for the existence of three clusters and \(k=3\) yields a higher gap statistic than \(k=2\), \(k=3\) seems to be a good choice.
One could argue for the existence of two or three clusters as they are both good candidates. At times like these it helps to visualise the clustering in an MDS or NMDS plot.
Now let’s divide the subjects into their respective clusters.
library(cluster)
# assume 3 clusters
<- 3
K <- ord[, 1:NDIM]
x
<- as.factor(pam(x, k = K, cluster.only = T))
clust # swapping the assignment of 2 and 3 to match ravel cst enumeration
== 2] <- NA
clust[clust == 3] <- 2
clust[clust is.na(clust)] <- 3
clust[
colData(tse)$CST <- clust
<- as.character(seq(K)) CSTs
Let’s take a look at the MDS ordination with the Bray-Curtis dissimilarity in the first four dimensions.
library(scater)
library(RColorBrewer)
library(cowplot)
# set up colours
<- brewer.pal(6, "Paired")[c(2, 5, 3)] # Length 6 for consistency with pre-revision CST+ colouration
CSTColors names(CSTColors) <- CSTs
<- scale_colour_manual(name = "CST", values = CSTColors)
CSTColorScale <- scale_fill_manual(name = "CST", values = CSTColors)
CSTFillScale
# plot MDS with Bray-Curtis dimensions 1 and 2
<- scater::plotReducedDim(tse, "MDS_BC", colour_by = "CST", point_alpha = 1,
p1 percentVar = var[c(1, 2)]*100) + th + labs(title = "Ordination by Cluster") +
theme(plot.title = element_text(hjust = 0.5))
# plot MDS with Bray-Curtis dimensions 3 and 4
<- scater::plotReducedDim(tse, "MDS_BC", colour_by = "CST", point_alpha = 1,
p2 ncomponents = c(3, 4), percentVar = var[c(1, 2, 3, 4)]*100) + th
plot_grid(p1 + CSTColorScale, p2 + CSTColorScale, nrow = 2)
And now nMDS.
<- runNMDS(tse, FUN = vegan::vegdist, method = "bray",
tse name = "NMDS_BC", exprs_values = "counts", ncomponents = 20)
## initial value 4.717325
## iter 5 value 2.138115
## iter 10 value 1.879881
## iter 15 value 1.787953
## iter 20 value 1.750342
## iter 25 value 1.730981
## iter 30 value 1.719617
## iter 30 value 1.717980
## iter 30 value 1.717898
## final value 1.717898
## converged
::plotReducedDim(tse, "NMDS_BC", colour_by = "CST", point_alpha = 1) + th +
scaterlabs(title = "NMDS Bray-Curtis by Cluster") +
theme(plot.title = element_text(hjust = 0.5)) + CSTColorScale
We can also look at a multitude of other distance metrics and subsets of the data.
11.3 Holmes and McMurdie Workflow
For the following distance measures, we will only consider the sequencing technique “Sanger” from the enterotype data.
dim(tse)
## [1] 553 280
# Subset data
<- subsetSamples(tse, colData(tse)$SeqTech == "Sanger")
tse2
# Change NAs to 0
<- as.vector(colData(tse2)$Enterotype)
x is.na(x)] <- 0
x[<- factor(x, levels = c("1", "2", "3", "0"))
x colData(tse2)$Enterotype <- x
dim(tse2)
## [1] 553 41
table(colData(tse2)$Enterotype)
##
## 1 2 3 0
## 8 6 18 9
11.3.1 Jensen-Shannon Distance
library(philentropy)
<- function (x) as.dist(philentropy::JSD(x))
custom_FUN <- scater::runMDS(tse2, FUN = custom_FUN, name = "MDS_JSD", exprs_values = "counts")
tse2 ::plotReducedDim(tse2, "MDS_JSD", colour_by = "Enterotype",
scatershape_by = "Enterotype", point_size = 4, point_alpha = 1) +
ggtitle("Multidimensional Scaling using Jensen-Shannon Divergence") + th
11.3.2 NonMetric Multidimensional Scaling
<- mia::runNMDS(tse2, FUN = custom_FUN, nmdsFUN = "monoMDS", name = "NMDS_JSD")
tse2 ::plotReducedDim(tse2, "NMDS_JSD", colour_by = "Enterotype",
scatershape_by = "Enterotype", point_size = 4, point_alpha = 1) +
ggtitle("Non-Metric Multidimensional Scaling using Jensen-Shannon Divergence") + th
11.3.3 Chi-Squared Correspondence Results
<- mia::runCCA(tse2, name = "PCA_CS", exprs_values = "counts")
tse2 ::plotReducedDim(tse2, "PCA_CS", colour_by = "Enterotype",
scatershape_by = "Enterotype", point_size = 3, point_alpha = 1) +
ggtitle("Correspondance Analysis using Chi-Squared Distance") + th
11.3.4 Bray-Curtis MDS
<- scater::runMDS(tse2, FUN = vegan::vegdist, name = "MDS_BC", exprs_values = "counts")
tse2 ::plotReducedDim(tse2, "MDS_BC", colour_by = "Enterotype",
scatershape_by = "Enterotype", point_size = 4, point_alpha = 1) +
ggtitle("MDS using Bray-Curtis Dissimilarity") + th
11.3.5 Bray-Curtis NMDS
<- mia::runNMDS(tse2, FUN = vegan::vegdist, nmdsFUN = "monoMDS", method = "bray", name = "NMDS_BC")
tse2 ::plotReducedDim(tse2, "NMDS_BC", colour_by = "Enterotype",
scatershape_by = "Enterotype", point_size = 4, point_alpha = 1) +
ggtitle("nMDS using Bray-Curtis") + th
11.3.6 Other distances from the Philentropy package
library(gridExtra)
# take distances that more or less match Holmes and McMurdie
<- getDistMethods()[c(1, 2, 23, 42)]
distances # variable to store results
<- vector("list", length(distances))
plist names(plist) <- distances
# calculate each distance measure
for (i in distances[1:length(distances)]) {
<- function (x) as.dist(philentropy::distance(x, method = i, p = 1))
custom_FUN <- scater::runMDS(tse2, FUN = custom_FUN, name = i, exprs_values = "counts")
tse2 # store result in plist
<- scater::plotReducedDim(tse2, i, colour_by = "Enterotype")
plist[[i]]
}
library(plyr)
# plot
<- ldply(plist, function(x) x$data)
df names(df) <- c("distance", "Axis.1", "Axis.2", "Enterotype")
<- ggplot(df, aes(Axis.1, Axis.2, color = Enterotype, shape = Enterotype)) +
p geom_point(size = 2.5, alpha = 0.5) +
facet_wrap(~distance, scales = "free") +
theme(strip.text = element_text(size = 8)) +
ggtitle("MDS on various distance metrics for Enterotype dataset") + th
p
11.3.7 Add a Clustering Variable
It may be revealing to see the full data (with subsetting) clustered according to a colData variable.
# remove taxon
<- subsetTaxa(tse, rownames(tse) != "-1")
tse3
# Remove NAs
<- as.vector(colData(tse3)$Enterotype)
x <- factor(x)
x colData(tse)$Enterotype <- x
# loop over distances
for (i in distances[1:length(distances)]) {
<- function (x) as.dist(philentropy::distance(x, method = i, p = 1))
custom_FUN <- scater::runMDS(tse3, FUN = custom_FUN, name = i, exprs_values = "counts")
tse3
<- scater::plotReducedDim(tse3, i, colour_by = "SeqTech", shape_by = "Enterotype")
plist[[i]]
}# plot
<- ldply(plist, function(x) x$data)
df names(df) <- c("distance", "Axis.1", "Axis.2", "SeqTech", "Enterotype")
<- ggplot(df, aes(Axis.1, Axis.2, color = SeqTech, shape = Enterotype)) +
p geom_point(size = 2, alpha = 0.5) +
facet_wrap(~distance, scales = "free") +
theme(strip.text = element_text(size = 8)) +
ggtitle("MDS on various distance metrics for Enterotype dataset") + th
p
11.3.8 More Clustering
Dropping 9 outliers (the 0 enterotype), let’s visualise the clustering in a different manner. Based on our analysis above, we concluded there were 3 distinct clusters. The authors of the Nature paper published on the enterotype data created a figure similar to the following:
library(ade4)
# remove 9 outliers
<- subsetSamples(tse2, colData(tse2)$Enterotype != 0)
tse2 colData(tse2)$Enterotype <- factor(colData(tse2)$Enterotype)
<- colData(tse2)$Enterotype
clus <- calculateJSD(tse2)
dist <- dudi.pco(dist, scannf = F, nf = 3)
pcoa s.class(pcoa$li, grid = F, fac = clus)
Revisit the distances without the 9 outlier values.
# loop over distances
for (i in distances[1:length(distances)]) {
<- function (x) as.dist(philentropy::distance(x, method = i, p = 1))
custom_FUN <- scater::runMDS(tse2, FUN = custom_FUN, name = i, exprs_values = "counts")
tse2
<- scater::plotReducedDim(tse2, i, colour_by = "Enterotype")
plist[[i]]
}# plot
<- ldply(plist, function(x) x$data)
df names(df) <- c("distance", "Axis.1", "Axis.2", "Enterotype")
<- ggplot(df, aes(Axis.1, Axis.2, color = Enterotype, shape = Enterotype)) +
p geom_point(size = 2.5, alpha = 0.5) +
facet_wrap(~distance, scales = "free") +
theme(strip.text = element_text(size = 8)) +
ggtitle("MDS on various distance metrics for Enterotype dataset with 9 samples removed") + th
p
Would we use the same number of clusters as before with 9 values missing?
## Slightly faster way to use pam
<- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
pam1
= function(x, k) {
pamPCoA list(cluster = pam(x[,1:2], k, cluster.only = TRUE))
}# gap statistic method
<- clusGap(assays(tse2)$counts, FUN = pamPCoA, K.max = 10)
gs fviz_gap_stat(gs)
We still see a jump at three so it is still a good choice for the number of clusters.
11.4 CST Analysis
Looking at heat maps of the CSTs can reveal structure on a different level. Using all of the data again (for the top 20 taxa) and taking the z transformation of the clr transformed counts, we have
# Find top 20 taxa
<- getTopTaxa(tse, top = 20)
rel_taxa # Z transform of CLR counts
<- transformCounts(tse, abund_values = "counts", method = "clr", pseudocount = 1)
mat <- transformFeatures(mat, abund_values = "clr", method = "z")
mat # Select top 20 taxa
<- subsetTaxa(mat, is.element(rownames(tse), rel_taxa))
mat <- assays(mat)$z
mat
# Order CSTs
<- mat[, order(clust)]
mat colnames(mat) <- names(sort(clust))
# Plot
<- as.data.frame(sort(clust))
CSTs names(CSTs) <- "CST"
<- seq(-2, 2, length.out = 10)
breaks # Make grid for heatmap
setHook("grid.newpage", function() pushViewport(viewport(x = 1, y = 1, width = 0.9,
height = 0.9, name = "vp",
just = c("right","top"))),
action = "prepend")
pheatmap(mat, color = rev(brewer.pal(9, "RdBu")), breaks = breaks, main = "All CSTs", treeheight_row = 0, treeheight_col = 0, show_colnames = 0, annotation_col = CSTs, cluster_cols = F)
setHook("grid.newpage", NULL, "replace")
grid.text("Sample", x = 0.39, y = -0.04, gp = gpar(fontsize = 16))
grid.text("Genus", x = -0.04, y = 0.47, rot = 90, gp = gpar(fontsize = 16))
11.5 Session Info
sessionInfo()
## R Under development (unstable) (2022-01-07 r81454)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ade4_1.7-18 plyr_1.8.6
## [3] gridExtra_2.3 philentropy_0.5.0
## [5] cowplot_1.1.1 scater_1.23.2
## [7] scuttle_1.5.0 cluster_2.1.2
## [9] factoextra_1.0.7 RColorBrewer_1.1-2
## [11] pheatmap_1.0.12 miaViz_1.3.2
## [13] ggraph_2.0.5 ggplot2_3.3.5
## [15] mia_1.3.13 MultiAssayExperiment_1.21.5
## [17] TreeSummarizedExperiment_2.3.0 Biostrings_2.63.1
## [19] XVector_0.35.0 SingleCellExperiment_1.17.2
## [21] SummarizedExperiment_1.25.3 Biobase_2.55.0
## [23] GenomicRanges_1.47.5 GenomeInfoDb_1.31.1
## [25] IRanges_2.29.1 S4Vectors_0.33.9
## [27] BiocGenerics_0.41.2 MatrixGenerics_1.7.0
## [29] matrixStats_0.61.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 igraph_1.2.11
## [3] lazyeval_0.2.2 splines_4.2.0
## [5] BiocParallel_1.29.10 digest_0.6.29
## [7] yulab.utils_0.0.4 htmltools_0.5.2
## [9] viridis_0.6.2 fansi_1.0.0
## [11] magrittr_2.0.1 memoise_2.0.1
## [13] ScaledMatrix_1.3.0 DECIPHER_2.23.0
## [15] graphlayouts_0.8.0 colorspace_2.0-2
## [17] blob_1.2.2 ggrepel_0.9.1
## [19] xfun_0.29 dplyr_1.0.7
## [21] crayon_1.4.2 RCurl_1.98-1.5
## [23] jsonlite_1.7.2 ape_5.6-1
## [25] glue_1.6.0 polyclip_1.10-0
## [27] gtable_0.3.0 zlibbioc_1.41.0
## [29] DelayedArray_0.21.2 car_3.0-12
## [31] BiocSingular_1.11.0 abind_1.4-5
## [33] scales_1.1.1 DBI_1.1.2
## [35] rstatix_0.7.0 Rcpp_1.0.7
## [37] viridisLite_0.4.0 decontam_1.15.0
## [39] gridGraphics_0.5-1 tidytree_0.3.7
## [41] bit_4.0.4 rsvd_1.0.5
## [43] ellipsis_0.3.2 pkgconfig_2.0.3
## [45] farver_2.1.0 sass_0.4.0
## [47] utf8_1.2.2 ggplotify_0.1.0
## [49] tidyselect_1.1.1 labeling_0.4.2
## [51] rlang_0.4.12 reshape2_1.4.4
## [53] munsell_0.5.0 tools_4.2.0
## [55] cachem_1.0.6 DirichletMultinomial_1.37.0
## [57] generics_0.1.1 RSQLite_2.2.9
## [59] broom_0.7.11 evaluate_0.14
## [61] stringr_1.4.0 fastmap_1.1.0
## [63] yaml_2.2.1 ggtree_3.3.1
## [65] knitr_1.37 bit64_4.0.5
## [67] tidygraph_1.2.0 purrr_0.3.4
## [69] nlme_3.1-153 sparseMatrixStats_1.7.0
## [71] aplot_0.1.2 compiler_4.2.0
## [73] beeswarm_0.4.0 ggsignif_0.6.3
## [75] treeio_1.19.1 tibble_3.1.6
## [77] tweenr_1.0.2 bslib_0.3.1
## [79] stringi_1.7.6 highr_0.9
## [81] lattice_0.20-45 Matrix_1.4-0
## [83] vegan_2.5-7 permute_0.9-5
## [85] vctrs_0.3.8 pillar_1.6.4
## [87] lifecycle_1.0.1 jquerylib_0.1.4
## [89] BiocNeighbors_1.13.0 bitops_1.0-7
## [91] irlba_2.3.5 patchwork_1.1.1
## [93] R6_2.5.1 bookdown_0.24
## [95] vipor_0.4.5 MASS_7.3-54
## [97] assertthat_0.2.1 withr_2.4.3
## [99] GenomeInfoDbData_1.2.7 mgcv_1.8-38
## [101] parallel_4.2.0 ggfun_0.0.4
## [103] beachmat_2.11.0 tidyr_1.1.4
## [105] rmarkdown_2.11 DelayedMatrixStats_1.17.0
## [107] carData_3.0-5 ggpubr_0.4.0
## [109] ggnewscale_0.4.5 ggforce_0.3.3
## [111] ggbeeswarm_0.6.0
11.6 Bibliography
Robert L. Thorndike. 1953. “Who Belongs in the Family?” Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263
Peter J. Rousseeuw. 1987. “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.
Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2002. Estimating the number of clusters in a data set via the gap statistic method. (63): 411-423. doi:10.1111/1467-9868.00293.
Susan Holmes and Joey McMurdie. 2015. Reproducible Research: Enterotype Example. http://statweb.stanford.edu/~susan/papers/EnterotypeRR.html.
Daniel B. DiGiulio et al. 2015. Temporal and spatial variation of the human microbiota during pregnancy. (112): 11060–11065. doi:10.1073/pnas.1502875112
Arumugam, M., et al. (2011). Enterotypes of the human gut microbiome.