11 Agglomeration

In Chapter 10, we covered how to select features from the dataset. Agglomeration, on the other hand, involves combining data points into broader feature categories by summing their values. For instance, if you have counts for individual species, you might agglomerate them into groups based on the family of each species. This means you would add up the counts of all species within a particular family to get a single value that represents that family. While this method simplifies your dataset and you lose the specific information about individual species, it can, however, be useful in and highlighting overall trends.

The choice between subsetting and agglomeration depends on your research goals and the type of insights you want to gain from your data. For example, if you are interested in comparing compositions of two types of microbiomes (e.g., gut microbiome and soil microbiome), you might want to first examine their taxonomic profiles at the level of phyla or families, rather than species, to get an idea of their broad similarities or differences. Agglomeration is most often used to reduce the number of features, especially in sequencing data, where there may not be enough resolution to reliably differentiate between closely related species. By combining data at higher taxonomic levels, allows you to focus on broader patterns in the community while managing the complexity of the dataset. Similarly, in addition to aggregating features to higher categories, you might want to aggregate samples within the given sample type, to examine these broad differences across sample types rather than samples.

Note

When you are interested in the total feature abundances of certain taxonomy ranks, the data should first be agglomerated to the specified taxonomy level. Afterward, we can then select the desired taxa from the dataset.

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

11.1 Agglomerate based on taxonomy rank

One of the main applications of taxonomic information in sequencing data is to agglomerate counts (such as ASV counts) to specific taxonomic levels and to model how different sample-specific variables may influence the feature composition at different taxonomic levels. For this mia contains the agglomerateByRank() function.

At its simplest, the function takes a TreeSE object as input and outputs a TreeSE object agglomerated to a specified taxonomy level using the rank parameter. Note that we can also choose to prune the phylogenetic tree to correspond to the agglomerated data, by setting update.tree = TRUE.

# Agglomerate
tse_phylum <- agglomerateByRank(tse, rank = "Phylum", update.tree = TRUE)
tse_phylum
##  class: TreeSummarizedExperiment 
##  dim: 66 26 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(66): ABY1_OD1 AC1 ... ZB2 ZB3
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (66 rows)
##  rowTree: 1 phylo tree(s) (66 leaves)
##  colLinks: NULL
##  colTree: NULL

The output now contains 66 features, a reduction from the original 19216 rows. Agglomeration has combined features together to higher feature categories, but the samples remain unchanged from the original dataset. Let’s take a look at the rowData to see how it looks.

rowData(tse_phylum)
##  DataFrame with 66 rows and 7 columns
##                     Kingdom         Phylum       Class       Order
##                 <character>    <character> <character> <character>
##  ABY1_OD1          Bacteria       ABY1_OD1          NA          NA
##  AC1               Bacteria            AC1          NA          NA
##  AD3               Bacteria            AD3          NA          NA
##  Acidobacteria     Bacteria  Acidobacteria          NA          NA
##  Actinobacteria    Bacteria Actinobacteria          NA          NA
##  ...                    ...            ...         ...         ...
##  WS1               Bacteria            WS1          NA          NA
##  WS2               Bacteria            WS2          NA          NA
##  WS3               Bacteria            WS3          NA          NA
##  ZB2               Bacteria            ZB2          NA          NA
##  ZB3               Bacteria            ZB3          NA          NA
##                      Family       Genus     Species
##                 <character> <character> <character>
##  ABY1_OD1                NA          NA          NA
##  AC1                     NA          NA          NA
##  AD3                     NA          NA          NA
##  Acidobacteria           NA          NA          NA
##  Actinobacteria          NA          NA          NA
##  ...                    ...         ...         ...
##  WS1                     NA          NA          NA
##  WS2                     NA          NA          NA
##  WS3                     NA          NA          NA
##  ZB2                     NA          NA          NA
##  ZB3                     NA          NA          NA

As we observe from the taxonomy table, all lower ranks below Phylum now contain NA values. This is expected, as we have agglomerated the data to the Phylum level, meaning that the lowest rank that rows can be uniquely mapped to is the Phylum rank.

Since we specified update.tree = TRUE, the phylogenetic tree has also been agglomerated. This is evident from the tree, which now has only 66 tips, each corresponding to a single row in the dataset.

rowTree(tse_phylum)
##  
##  Phylogenetic tree with 66 tips and 65 internal nodes.
##  
##  Tip labels:
##    Crenarchaeota, Euryarchaeota, Actinobacteria, Spirochaetes, MVP-15, Proteobacteria, ...
##  Node labels:
##    , 0.858.4, 1.000.21479, 0.946.619, 0.953.588, 0.940.473, ...
##  
##  Rooted; includes branch length(s).

Additionally, we can examine the counts assay table to assess how the agglomeration has affected the counts of unique features per sample.

assay(tse_phylum, "counts") |> head()
##                     CL3    CC1    SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
##  ABY1_OD1            11     33      0       0       0       0       0       0
##  AC1                  0      0      0       0       0       0       0       0
##  AD3               1299  15577     17       2       1       3       0       0
##  Acidobacteria   269202 183132  92863     120     115     249    1901     332
##  Actinobacteria   39601  90280 121703    2540     841   85825  103067   27295
##  Armatimonadetes    782   1124   1972       0       1       4      20       8
##                  M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm   NP2
##  ABY1_OD1              0       0        0        6      6     18     21     0
##  AC1                   0       0        0        1     25    209    265     0
##  AD3                   2       1        0        0      3      1      5     0
##  Acidobacteria       150      62      365      469  14622  25997  26860   190
##  Actinobacteria   123464    8822   444605   648008  26392  27519  30304 20047
##  Armatimonadetes       4      10      463     1683    806    919   1296     7
##                    NP3    NP5 TRRsed1 TRRsed2 TRRsed3  TS28   TS29  Even1
##  ABY1_OD1            0      0       2       0       1     0      0      0
##  AC1                 0      0       0       1       0     0      0      0
##  AD3                 1      2       0       5       1     1      0      4
##  Acidobacteria     242     89     830    5650    3110   108    108   1562
##  Actinobacteria  19592 168762    1058    2290    2351 27569 130158 105541
##  Armatimonadetes     1      2       2       0       4     1      4      7
##                  Even2 Even3
##  ABY1_OD1            0     0
##  AC1                 0     0
##  AD3                26     2
##  Acidobacteria     565   310
##  Actinobacteria  79261 89965
##  Armatimonadetes     5     6

The values in the counts assay are significantly larger than in the original data, indicating that the values have been summed during agglomeration.

Now when the data is agglomerated, we can check the abundances of certain phyla across the samples.

# Store features of interest into phyla
phyla <- c("Actinobacteria", "Chlamydiae")
# subset by feature
tse_sub <- tse_phylum[phyla, ]
# Show dimensions
assay(tse_sub)
##                   CL3   CC1    SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
##  Actinobacteria 39601 90280 121703    2540     841   85825  103067   27295
##  Chlamydiae       265   702     22       0       1       5      22       0
##                 M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm   NP2
##  Actinobacteria  123464    8822   444605   648008  26392  27519  30304 20047
##  Chlamydiae           0       0        3        8      9     26     35     6
##                   NP3    NP5 TRRsed1 TRRsed2 TRRsed3  TS28   TS29  Even1
##  Actinobacteria 19592 168762    1058    2290    2351 27569 130158 105541
##  Chlamydiae        36      9       2      85      62     2      0      1
##                 Even2 Even3
##  Actinobacteria 79261 89965
##  Chlamydiae         1     0

Note

As data was agglomerated and then further subsetted, the number of rows should equal the number of phyla used in subsetting (in this case, just 2).

Furthermore, we can observe that the agglomeration will be applied to every assay in the dataset. Let’s add another assay to the original dataset and then perform agglomeration again, this time at the Family level.

# Add another assay
assay(tse, "another_assay", withDimnames = FALSE) <- matrix(
  runif(ncol(tse)*nrow(tse), 0, 1), ncol = ncol(tse), nrow = nrow(tse))

# Agglomerate
tse_family <- agglomerateByRank(tse, rank = "Family")

assayNames(tse_family)
##  [1] "counts"        "another_assay"

We can now confirm that the agglomerated dataset contains two assays: “counts” and “another_assay,” consistent with the original data structure.

If the data is agglomerated by features, the ideal location to store the resulting dataset is the alternative experiment, altExp slot (see Section 3.7). Let’s add the Phylum data there.

altExp(tse, "phylum") <- tse_phylum
altExpNames(tse)
##  [1] "phylum"

altExpNames now consists of Phylum level data. This can be extended to use any taxonomic level listed in taxonomyRanks(tse). While it is certainly possible to agglomerate data one taxonomic level at a time, you can also aggregate data across all available ranks in a single step using the agglomerateByRanks() function. This function returns a TreeSE object that includes the agglomerated data sets in the altExp slot.

If you want the data as a list as discussed in Section 7.1, you can achieve this by specifying as.list = TRUE

tse <- agglomerateByRanks(tse)
altExpNames(tse)
##  [1] "phylum"  "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"  
##  [8] "Species"

11.2 Aggregate data based on variable

The agglomerateByRank() function aggregates data based on rowData, that is, the agglomerates are made based on features, such as taxonomy information. Alternatively, you can also use agglomerateByVariable() to aggregate the data.

agglomerateByVariable() is designed to aggregate data based on other criteria than taxa, such as sample groups or clusters. The function operates similarly to feature aggregation, but the by parameter must be set to “rows.”

Note

Note that a feature dataset might contain taxa with the same lower-level rank, even if their higher-level ranks differ. For this reason, agglomerateByVariable() should not be used for aggregating taxonomy ranks.

# Agglomerate samples based on type
tse_sub <- agglomerateByVariable(tse, by = "cols", group = "SampleType")
tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 19216 9 
##  metadata(0):
##  assays(2): counts another_assay
##  rownames(19216): 549322 522457 ... 200359 271582
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): Feces Freshwater ... Soil Tongue
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(8): phylum Kingdom ... Genus Species
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

Now, the data includes as many columns as there are sample types.

In Section 16.1.2, we will explore how cluster information can be used to agglomerate data effectively.

11.3 Agglomerate based on prevalence

Section 10.6 demonstrated how to select data with only prevalent or rare features. In some cases, it might be beneficial to combine the features that would otherwise be removed. For example, instead of filtering out rare taxa, we can combine the reads from all these rare taxa under a new feature which we can call “Rare taxa”. The function agglomerateByPrevalence() accomplishes this by merging filtered features into a single feature called “Other” by default, preventing unnecessary loss of information. This is particularly useful when working with left-skewed feature data, such as soil microbiome data sets where rare taxa may represent a significant proportion of the reads when combined.

Here, we demonstrate how to agglomerate the data using prevalence and abundance thresholds of 20% and 0.1%, respectively. This means that for a feature to be considered detected in a sample, its relative abundance must exceed 0.1% and after this, the feature must be detected in at least 20% of samples. Taxa that do not meet these criteria are combined in the “Other” group.

To use relative abundances for the detection threshold, we must first transform the data to relative abundances (aka normalise as proportions). These kind of transformations are introduce in more detail in Chapter 12.

# Transform
tse <- transformAssay(tse, method = "relabundance")
# Agglomerate
tse_prev <- agglomerateByPrevalence(
  tse,
  assay.type = "relabundance",
  prevalence = 20 / 100,
  detection = 0.1 / 100
  )
tse_prev
##  class: TreeSummarizedExperiment 
##  dim: 20 26 
##  metadata(0):
##  assays(3): counts another_assay relabundance
##  rownames(20): 329744 326977 ... 108747 Other
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(8): phylum Kingdom ... Genus Species
##  rowLinks: a LinkDataFrame (20 rows)
##  rowTree: 1 phylo tree(s) (20 leaves)
##  colLinks: NULL
##  colTree: NULL

We have now successfully agglomerated the data based on prevalence. Similarly, if we were interested in only the most prevalent Phyla, we can certainly first agglomerate the data by Phylum rank and then by prevalence, combining the rarer Phyla into one category. In fact, the function agglomerateByPrevalence() allows us to accomplish both tasks simultaneously.

tse_prev_phylum <- agglomerateByPrevalence(
  tse,
  rank = "Phylum",
  prevalence = 20 / 100,
  detection =  1
  )
tse_prev_phylum
##  class: TreeSummarizedExperiment 
##  dim: 48 26 
##  metadata(2): agglomerated_by_rank agglomerated_by_rank
##  assays(3): counts another_assay relabundance
##  rownames(48): ABY1_OD1 AD3 ... WS3 Other
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (48 rows)
##  rowTree: 1 phylo tree(s) (48 leaves)
##  colLinks: NULL
##  colTree: NULL

Exercises

Goal: The goal is to learn how to agglomerate the data.

Exercise 1: Agglomerate data

Load any of the example datasets mentioned in Section 4.2.
List the available taxonomic ranks in the data with taxonomyRanks().
Select a single taxonomy rank and list all unique values in this rank. How many unique values there are?
Agglomerate data to the rank and save the agglomerated data to new variable.
How many rows there are in agglomerated data? Does this correspond to the number of unique values in step 2? If not, why they do not match?
Observe rowData of agglomerated data. What has happened? Why there are so many NAs?
Agglomerate data to all taxonomy ranks at once.
Print all available alternative experiments. What these names correspond to?
Select a single alternative experiment, and retrieve the agglomerated data from the altExp slot. How many rows and columns there are?
Agglomerate the data based on prevalence and save the agglomerated data to a new variable.
How does prevalence-agglomeration differ from prevalence-subsetting?
Add prevalence-agglomerated data to altExp slot of the original TreeSE object.
Explore colData. Select some grouping variable and agglomerate the data on the groups. Save the agglomerated data to a new variable. How many rows and columns there are? Why this data cannot be added to altExp()?

Useful functions:

utils::data(), mia::taxonomyRanks(), mia::getUnique(), mia::agglomerateByRank(), SummarizedExperiment::rowData(), mia::agglomerateByRanks(), SingleCellExperiment::altExpNames(), SingleCellExperiment::altExp(), base::dim(), mia::agglomerateByPrevalence(), mia::subsetByPrevalent(), SingleCellExperiment::altExp()<-, SummarizedExperiment::colData(), mia::agglomerateByVariable()