9  Agglomeration

In Chapter 8, we covered how to select features from the dataset. Agglomeration, on the other hand, involves combining data points into broader categories by summing their values. For instance, if you have counts for individual species, you might agglomerate them into groups based on their family or genus. This means you would add up the counts of all species within a particular family to get a single value that represents that family. While this method simplifies your dataset and highlights overall trends, it means you lose the specific information about individual species.

The choice between these subsetting and agglomeration depends on your research goals and the type of insights you want to gain from your data. Agglomeration is often used to reduce the number of features, especially in sequencing data, where there may not be enough resolution to reliably differentiate between closely related species. By combining data at higher taxonomic levels, you can focus on broader patterns and trends in the community while managing the complexity of the dataset.

Moreover, whenthe total abundances of certain taxonomy rank are important, the data should first be agglomerated to the specified taxonomy level. Afterward, we can select the desired taxa from the dataset.

library(mia)
data("GlobalPatterns", package = "mia")
tse <- GlobalPatterns

9.1 Agglomerate based on taxonomy rank

One of the main applications of taxonomic information in regards to count data is to agglomerate count data on taxonomic levels and track the influence of changing conditions through these levels. For this mia contains the agglomerateByRank() function.

At its simplest, the function takes a TreeSE object as input and outputs a TreeSE object agglomerated to a specified taxonomy level using the rank parameter. Additionally, we can choose to prune the phylogenetic tree to correspond to the agglomerated data.

# Agglomerate
tse_phylum <- agglomerateByRank(tse, rank = "Phylum", update.tree = TRUE)
tse_phylum
##  class: TreeSummarizedExperiment 
##  dim: 66 26 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(66): ABY1_OD1 AC1 ... ZB2 ZB3
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (66 rows)
##  rowTree: 1 phylo tree(s) (66 leaves)
##  colLinks: NULL
##  colTree: NULL

The output now contains 66 features, a reduction from the original 19216 rows. It’s important to note that the samples remain unchanged from the original dataset. Let’s take a look at the rowData to see how it looks.

rowData(tse_phylum)
##  DataFrame with 66 rows and 7 columns
##                     Kingdom         Phylum       Class       Order
##                 <character>    <character> <character> <character>
##  ABY1_OD1          Bacteria       ABY1_OD1          NA          NA
##  AC1               Bacteria            AC1          NA          NA
##  AD3               Bacteria            AD3          NA          NA
##  Acidobacteria     Bacteria  Acidobacteria          NA          NA
##  Actinobacteria    Bacteria Actinobacteria          NA          NA
##  ...                    ...            ...         ...         ...
##  WS1               Bacteria            WS1          NA          NA
##  WS2               Bacteria            WS2          NA          NA
##  WS3               Bacteria            WS3          NA          NA
##  ZB2               Bacteria            ZB2          NA          NA
##  ZB3               Bacteria            ZB3          NA          NA
##                      Family       Genus     Species
##                 <character> <character> <character>
##  ABY1_OD1                NA          NA          NA
##  AC1                     NA          NA          NA
##  AD3                     NA          NA          NA
##  Acidobacteria           NA          NA          NA
##  Actinobacteria          NA          NA          NA
##  ...                    ...         ...         ...
##  WS1                     NA          NA          NA
##  WS2                     NA          NA          NA
##  WS3                     NA          NA          NA
##  ZB2                     NA          NA          NA
##  ZB3                     NA          NA          NA

As we observe from the taxonomy table, all lower ranks below Phylum now contain NA values. This is expected, as we have agglomerated the data to the Phylum level, meaning that the lowest rank that rows can be uniquely mapped to is the Family rank.

Since we specified update.tree = TRUE, the phylogenetic tree has also been agglomerated. This is evident from the tree, which now has only 66 tips, each corresponding to a single row in the dataset.

rowTree(tse_phylum)
##  
##  Phylogenetic tree with 66 tips and 65 internal nodes.
##  
##  Tip labels:
##    549322, 173903, 368907, 51888, 591769, 249529, ...
##  Node labels:
##    , 0.858.4, 1.000.21479, 0.946.619, 0.953.588, 0.940.473, ...
##  
##  Rooted; includes branch lengths.

Additionally, we can examine the counts assay to assess how the agglomeration has affected the counts.

assay(tse_phylum, "counts") |> head()
##                     CL3    CC1    SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
##  ABY1_OD1            11     33      0       0       0       0       0       0
##  AC1                  0      0      0       0       0       0       0       0
##  AD3               1299  15577     17       2       1       3       0       0
##  Acidobacteria   269202 183132  92863     120     115     249    1901     332
##  Actinobacteria   39601  90280 121703    2540     841   85825  103067   27295
##  Armatimonadetes    782   1124   1972       0       1       4      20       8
##                  M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm   NP2
##  ABY1_OD1              0       0        0        6      6     18     21     0
##  AC1                   0       0        0        1     25    209    265     0
##  AD3                   2       1        0        0      3      1      5     0
##  Acidobacteria       150      62      365      469  14622  25997  26860   190
##  Actinobacteria   123464    8822   444605   648008  26392  27519  30304 20047
##  Armatimonadetes       4      10      463     1683    806    919   1296     7
##                    NP3    NP5 TRRsed1 TRRsed2 TRRsed3  TS28   TS29  Even1
##  ABY1_OD1            0      0       2       0       1     0      0      0
##  AC1                 0      0       0       1       0     0      0      0
##  AD3                 1      2       0       5       1     1      0      4
##  Acidobacteria     242     89     830    5650    3110   108    108   1562
##  Actinobacteria  19592 168762    1058    2290    2351 27569 130158 105541
##  Armatimonadetes     1      2       2       0       4     1      4      7
##                  Even2 Even3
##  ABY1_OD1            0     0
##  AC1                 0     0
##  AD3                26     2
##  Acidobacteria     565   310
##  Actinobacteria  79261 89965
##  Armatimonadetes     5     6

The values in the counts assay are significantly larger than in the original data, indicating that the values have been summed during agglomeration.

Now when the data is agglomerated, we can check the abundances of certain phyla.

# Store features of interest into phyla
phyla <- c("Actinobacteria", "Chlamydiae")
# subset by feature
tse_sub <- tse_phylum[phyla, ]
# Show dimensions
assay(tse_sub)
##                   CL3   CC1    SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
##  Actinobacteria 39601 90280 121703    2540     841   85825  103067   27295
##  Chlamydiae       265   702     22       0       1       5      22       0
##                 M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm   NP2
##  Actinobacteria  123464    8822   444605   648008  26392  27519  30304 20047
##  Chlamydiae           0       0        3        8      9     26     35     6
##                   NP3    NP5 TRRsed1 TRRsed2 TRRsed3  TS28   TS29  Even1
##  Actinobacteria 19592 168762    1058    2290    2351 27569 130158 105541
##  Chlamydiae        36      9       2      85      62     2      0      1
##                 Even2 Even3
##  Actinobacteria 79261 89965
##  Chlamydiae         1     0
Note

As data was agglomerated, the number of rows should equal the number of phyla used to index (in this case, just 2).

Furthermore, we can observe that the agglomeration will be applied to every assay in the dataset. Let’s add another assay to the dataset and then perform agglomeration again, this time at the Family level.

# Add another assay
assay(tse, "another_assay", withDimnames = FALSE) <- matrix(
  runif(ncol(tse)*nrow(tse), 0, 1), ncol = ncol(tse), nrow = nrow(tse))

# Agglomerate
tse_family <- agglomerateByRank(tse, rank = "Family")

assayNames(tse_family)
##  [1] "counts"        "another_assay"

We can now confirm that the agglomerated dataset contains two assays: “counts” and “another_assay,” consistent with the original data structure.

If the data is agglomerated by features, the ideal location to store the resulting dataset is as an alternative experiment, altExp slot (see Section 3.7). Let’s add the Phylum data there.

altExp(tse, "phylum") <- tse_phylum
altExpNames(tse)
##  [1] "phylum"

altExpNames now consists of Phylum level data. This can be extended to use any taxonomic level listed in taxonomyRanks(tse). While it is certainly possible to agglomerate data one taxonomic level at a time, you can also aggregate data across all available ranks in a single step using the agglomerateByRanks() function. This function returns a TreeSE object that includes the agglomerated data in the altExp slot.

If you want the data as a list as discussed in Section 7.1, you can achieve this by specifying as.list = TRUE

tse <- agglomerateByRanks(tse)
altExpNames(tse)
##  [1] "phylum"  "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"  
##  [8] "Species"

9.2 Aggregate data based on variable

The agglomerateByRank() function aggregates data while considering taxonomy information. For more flexible aggregations, the agglomerateByVariable() method is also available. In some cases, both functions may yield the same results; however, agglomerateByRank() ensures that the entire taxonomy of a feature is unique for successful agglomeration.

For example, a dataset might contain taxa with the same lower-level rank, even if their higher-level ranks differ. Thus, agglomerateByVariable() should not be used for aggregating taxonomy ranks.

Instead, agglomerateByVariable() is designed to aggregate data based on other criteria, such as sample groups or clusters. For instance, we can aggregate the data by sample types. The function operates similarly to feature aggregation, but the by parameter must be set to “rows.”

# Agglomerate samples based on type
tse_sub <- agglomerateByVariable(tse, by = "cols", group = "SampleType")
tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 19216 9 
##  metadata(0):
##  assays(2): counts another_assay
##  rownames(19216): 549322 522457 ... 200359 271582
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): Feces Freshwater ... Soil Tongue
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(8): phylum Kingdom ... Genus Species
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

Now, the data includes as many columns as there are sample types.

In Section 15.1.2, we will explore how cluster information can be used to agglomerate data effectively.

9.3 Agglomerate based on prevalence

Section 8.6 demonstrated how to select only prevalent or rare features. In some cases, it might be beneficial to combine these features that would otherwise be removed. The function agglomerateByPrevalence() accomplishes this by merging filtered features into a single feature called “Other” by default, preventing unnecessary loss of information. This is particularly useful for retaining features that may still represent a significant proportion when combined.

Here, we demonstrate how to agglomerate the data using prevalence thresholds of 20% and 0.1%, respectively. This means that for a feature to be considered detected in a sample, its abundance must exceed 0.1%. Furthermore, the feature must be present in at least 20% of samples to be retained in the dataset rather than placed in the “Other” group.

To use relative abundances for the detection threshold, we must first transform the data to relative abundances. We will let the Chapter 10 to introduce transformations in more detail.

# Transform
tse <- transformAssay(tse, method = "relabundance")
# Agglomerate
tse_prev <- agglomerateByPrevalence(
  tse,
  assay.type = "relabundance",
  prevalence = 20 / 100,
  detection = 0.1 / 100
  )
tse_prev
##  class: TreeSummarizedExperiment 
##  dim: 20 26 
##  metadata(0):
##  assays(3): counts another_assay relabundance
##  rownames(20): 329744 326977 ... 108747 Other
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(8): phylum Kingdom ... Genus Species
##  rowLinks: a LinkDataFrame (20 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

We have now successfully agglomerated the data based on prevalence. If we are interested in prevalent Phyla, we can certainly first agglomerate the data by Phylum rank and then by prevalence. The function agglomerateByPrevalence() allows us to accomplish both tasks simultaneously.

tse_prev_phylum <- agglomerateByPrevalence(
  tse,
  rank = "Phylum",
  prevalence = 20 / 100,
  detection =  1
  )
tse_prev_phylum
##  class: TreeSummarizedExperiment 
##  dim: 48 26 
##  metadata(2): agglomerated_by_rank agglomerated_by_rank
##  assays(3): counts another_assay relabundance
##  rownames(48): ABY1_OD1 AD3 ... WS3 Other
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (48 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
Back to top