9 Agglomeration
In Chapter 8, we covered how to select features from the dataset. Agglomeration, on the other hand, involves combining data points into broader categories by summing their values. For instance, if you have counts for individual species, you might agglomerate them into groups based on their family or genus. This means you would add up the counts of all species within a particular family to get a single value that represents that family. While this method simplifies your dataset and highlights overall trends, it means you lose the specific information about individual species.
The choice between these subsetting and agglomeration depends on your research goals and the type of insights you want to gain from your data. Agglomeration is often used to reduce the number of features, especially in sequencing data, where there may not be enough resolution to reliably differentiate between closely related species. By combining data at higher taxonomic levels, you can focus on broader patterns and trends in the community while managing the complexity of the dataset.
Moreover, whenthe total abundances of certain taxonomy rank are important, the data should first be agglomerated to the specified taxonomy level. Afterward, we can select the desired taxa from the dataset.
9.1 Agglomerate based on taxonomy rank
One of the main applications of taxonomic information in regards to count data is to agglomerate count data on taxonomic levels and track the influence of changing conditions through these levels. For this mia
contains the agglomerateByRank()
function.
At its simplest, the function takes a TreeSE
object as input and outputs a TreeSE
object agglomerated to a specified taxonomy level using the rank
parameter. Additionally, we can choose to prune the phylogenetic tree to correspond to the agglomerated data.
# Agglomerate
tse_phylum <- agglomerateByRank(tse, rank = "Phylum", update.tree = TRUE)
tse_phylum
## class: TreeSummarizedExperiment
## dim: 66 26
## metadata(1): agglomerated_by_rank
## assays(1): counts
## rownames(66): ABY1_OD1 AC1 ... ZB2 ZB3
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (66 rows)
## rowTree: 1 phylo tree(s) (66 leaves)
## colLinks: NULL
## colTree: NULL
The output now contains 66 features, a reduction from the original 19216 rows. It’s important to note that the samples remain unchanged from the original dataset. Let’s take a look at the rowData
to see how it looks.
rowData(tse_phylum)
## DataFrame with 66 rows and 7 columns
## Kingdom Phylum Class Order
## <character> <character> <character> <character>
## ABY1_OD1 Bacteria ABY1_OD1 NA NA
## AC1 Bacteria AC1 NA NA
## AD3 Bacteria AD3 NA NA
## Acidobacteria Bacteria Acidobacteria NA NA
## Actinobacteria Bacteria Actinobacteria NA NA
## ... ... ... ... ...
## WS1 Bacteria WS1 NA NA
## WS2 Bacteria WS2 NA NA
## WS3 Bacteria WS3 NA NA
## ZB2 Bacteria ZB2 NA NA
## ZB3 Bacteria ZB3 NA NA
## Family Genus Species
## <character> <character> <character>
## ABY1_OD1 NA NA NA
## AC1 NA NA NA
## AD3 NA NA NA
## Acidobacteria NA NA NA
## Actinobacteria NA NA NA
## ... ... ... ...
## WS1 NA NA NA
## WS2 NA NA NA
## WS3 NA NA NA
## ZB2 NA NA NA
## ZB3 NA NA NA
As we observe from the taxonomy table, all lower ranks below Phylum now contain NA
values. This is expected, as we have agglomerated the data to the Phylum level, meaning that the lowest rank that rows can be uniquely mapped to is the Family rank.
Since we specified update.tree = TRUE
, the phylogenetic tree has also been agglomerated. This is evident from the tree, which now has only 66 tips, each corresponding to a single row in the dataset.
rowTree(tse_phylum)
##
## Phylogenetic tree with 66 tips and 65 internal nodes.
##
## Tip labels:
## 549322, 173903, 368907, 51888, 591769, 249529, ...
## Node labels:
## , 0.858.4, 1.000.21479, 0.946.619, 0.953.588, 0.940.473, ...
##
## Rooted; includes branch lengths.
Additionally, we can examine the counts assay to assess how the agglomeration has affected the counts.
assay(tse_phylum, "counts") |> head()
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
## ABY1_OD1 11 33 0 0 0 0 0 0
## AC1 0 0 0 0 0 0 0 0
## AD3 1299 15577 17 2 1 3 0 0
## Acidobacteria 269202 183132 92863 120 115 249 1901 332
## Actinobacteria 39601 90280 121703 2540 841 85825 103067 27295
## Armatimonadetes 782 1124 1972 0 1 4 20 8
## M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2
## ABY1_OD1 0 0 0 6 6 18 21 0
## AC1 0 0 0 1 25 209 265 0
## AD3 2 1 0 0 3 1 5 0
## Acidobacteria 150 62 365 469 14622 25997 26860 190
## Actinobacteria 123464 8822 444605 648008 26392 27519 30304 20047
## Armatimonadetes 4 10 463 1683 806 919 1296 7
## NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
## ABY1_OD1 0 0 2 0 1 0 0 0
## AC1 0 0 0 1 0 0 0 0
## AD3 1 2 0 5 1 1 0 4
## Acidobacteria 242 89 830 5650 3110 108 108 1562
## Actinobacteria 19592 168762 1058 2290 2351 27569 130158 105541
## Armatimonadetes 1 2 2 0 4 1 4 7
## Even2 Even3
## ABY1_OD1 0 0
## AC1 0 0
## AD3 26 2
## Acidobacteria 565 310
## Actinobacteria 79261 89965
## Armatimonadetes 5 6
The values in the counts assay are significantly larger than in the original data, indicating that the values have been summed during agglomeration.
Now when the data is agglomerated, we can check the abundances of certain phyla.
# Store features of interest into phyla
phyla <- c("Actinobacteria", "Chlamydiae")
# subset by feature
tse_sub <- tse_phylum[phyla, ]
# Show dimensions
assay(tse_sub)
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
## Actinobacteria 39601 90280 121703 2540 841 85825 103067 27295
## Chlamydiae 265 702 22 0 1 5 22 0
## M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2
## Actinobacteria 123464 8822 444605 648008 26392 27519 30304 20047
## Chlamydiae 0 0 3 8 9 26 35 6
## NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
## Actinobacteria 19592 168762 1058 2290 2351 27569 130158 105541
## Chlamydiae 36 9 2 85 62 2 0 1
## Even2 Even3
## Actinobacteria 79261 89965
## Chlamydiae 1 0
As data was agglomerated, the number of rows should equal the number of phyla used to index (in this case, just 2).
Furthermore, we can observe that the agglomeration will be applied to every assay in the dataset. Let’s add another assay to the dataset and then perform agglomeration again, this time at the Family level.
# Add another assay
assay(tse, "another_assay", withDimnames = FALSE) <- matrix(
runif(ncol(tse)*nrow(tse), 0, 1), ncol = ncol(tse), nrow = nrow(tse))
# Agglomerate
tse_family <- agglomerateByRank(tse, rank = "Family")
assayNames(tse_family)
## [1] "counts" "another_assay"
We can now confirm that the agglomerated dataset contains two assays: “counts” and “another_assay,” consistent with the original data structure.
If the data is agglomerated by features, the ideal location to store the resulting dataset is as an alternative experiment, altExp
slot (see Section 3.7). Let’s add the Phylum data there.
altExp(tse, "phylum") <- tse_phylum
altExpNames(tse)
## [1] "phylum"
altExpNames
now consists of Phylum
level data. This can be extended to use any taxonomic level listed in taxonomyRanks(tse)
. While it is certainly possible to agglomerate data one taxonomic level at a time, you can also aggregate data across all available ranks in a single step using the agglomerateByRanks()
function. This function returns a TreeSE
object that includes the agglomerated data in the altExp
slot.
If you want the data as a list
as discussed in Section 7.1, you can achieve this by specifying as.list = TRUE
tse <- agglomerateByRanks(tse)
altExpNames(tse)
## [1] "phylum" "Kingdom" "Phylum" "Class" "Order" "Family" "Genus"
## [8] "Species"
9.2 Aggregate data based on variable
The agglomerateByRank()
function aggregates data while considering taxonomy information. For more flexible aggregations, the agglomerateByVariable()
method is also available. In some cases, both functions may yield the same results; however, agglomerateByRank() ensures that the entire taxonomy of a feature is unique for successful agglomeration.
For example, a dataset might contain taxa with the same lower-level rank, even if their higher-level ranks differ. Thus, agglomerateByVariable()
should not be used for aggregating taxonomy ranks.
Instead, agglomerateByVariable()
is designed to aggregate data based on other criteria, such as sample groups or clusters. For instance, we can aggregate the data by sample types. The function operates similarly to feature aggregation, but the by
parameter must be set to “rows.”
# Agglomerate samples based on type
tse_sub <- agglomerateByVariable(tse, by = "cols", group = "SampleType")
tse_sub
## class: TreeSummarizedExperiment
## dim: 19216 9
## metadata(0):
## assays(2): counts another_assay
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(9): Feces Freshwater ... Soil Tongue
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(8): phylum Kingdom ... Genus Species
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
Now, the data includes as many columns as there are sample types.
In Section 15.1.2, we will explore how cluster information can be used to agglomerate data effectively.
9.3 Agglomerate based on prevalence
Section 8.6 demonstrated how to select only prevalent or rare features. In some cases, it might be beneficial to combine these features that would otherwise be removed. The function agglomerateByPrevalence()
accomplishes this by merging filtered features into a single feature called “Other” by default, preventing unnecessary loss of information. This is particularly useful for retaining features that may still represent a significant proportion when combined.
Here, we demonstrate how to agglomerate the data using prevalence thresholds of 20% and 0.1%, respectively. This means that for a feature to be considered detected in a sample, its abundance must exceed 0.1%. Furthermore, the feature must be present in at least 20% of samples to be retained in the dataset rather than placed in the “Other” group.
To use relative abundances for the detection threshold, we must first transform the data to relative abundances. We will let the Chapter 10 to introduce transformations in more detail.
# Transform
tse <- transformAssay(tse, method = "relabundance")
# Agglomerate
tse_prev <- agglomerateByPrevalence(
tse,
assay.type = "relabundance",
prevalence = 20 / 100,
detection = 0.1 / 100
)
tse_prev
## class: TreeSummarizedExperiment
## dim: 20 26
## metadata(0):
## assays(3): counts another_assay relabundance
## rownames(20): 329744 326977 ... 108747 Other
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(8): phylum Kingdom ... Genus Species
## rowLinks: a LinkDataFrame (20 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
We have now successfully agglomerated the data based on prevalence. If we are interested in prevalent Phyla, we can certainly first agglomerate the data by Phylum rank and then by prevalence. The function agglomerateByPrevalence()
allows us to accomplish both tasks simultaneously.
tse_prev_phylum <- agglomerateByPrevalence(
tse,
rank = "Phylum",
prevalence = 20 / 100,
detection = 1
)
tse_prev_phylum
## class: TreeSummarizedExperiment
## dim: 48 26
## metadata(2): agglomerated_by_rank agglomerated_by_rank
## assays(3): counts another_assay relabundance
## rownames(48): ABY1_OD1 AD3 ... WS3 Other
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (48 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL