R/agglomerate.R
, R/getPrevalence.R
agglomerate-methods.Rd
Agglomeration functions can be used to sum-up data based on specific criteria such as taxonomic ranks, variables or prevalence.
agglomerateByRank(x, ...)
mergeFeaturesByRank(x, ...)
# S4 method for SummarizedExperiment
agglomerateByRank(
x,
rank = taxonomyRanks(x)[1],
onRankOnly = FALSE,
na.rm = FALSE,
empty.fields = c(NA, "", " ", "\t", "-", "_"),
...
)
# S4 method for SummarizedExperiment
mergeFeaturesByRank(
x,
rank = taxonomyRanks(x)[1],
onRankOnly = FALSE,
na.rm = FALSE,
empty.fields = c(NA, "", " ", "\t", "-", "_"),
...
)
# S4 method for SingleCellExperiment
agglomerateByRank(x, ..., altexp = NULL, strip_altexp = TRUE)
# S4 method for SingleCellExperiment
mergeFeaturesByRank(x, ..., altexp = NULL, strip_altexp = TRUE)
# S4 method for TreeSummarizedExperiment
agglomerateByRank(
x,
...,
agglomerate.tree = agglomerateTree,
agglomerateTree = FALSE
)
# S4 method for TreeSummarizedExperiment
mergeFeaturesByRank(x, ..., agglomerate.tree = FALSE)
agglomerateByPrevalence(x, ...)
# S4 method for SummarizedExperiment
agglomerateByPrevalence(
x,
rank = taxonomyRanks(x)[1L],
other_label = "Other",
...
)
a
SummarizedExperiment
object
arguments passed to agglomerateByRank
function for
SummarizedExperiment
objects,
to getPrevalence
and getPrevalentTaxa
and used in
agglomeratebyPrevalence
,
to mergeRows
and
sumCountsAcrossFeatures
.
remove_empty_ranks
A single boolean value for selecting
whether to remove those columns of rowData that include only NAs after
agglomeration. (By default: remove_empty_ranks = FALSE
)
make_unique
A single boolean value for selecting
whether to make rownames unique. (By default: make_unique = TRUE
)
detection
Detection threshold for absence/presence.
Either an absolute value compared directly to the values of x
or a relative value between 0 and 1, if as_relative = FALSE
.
prevalence
Prevalence threshold (in 0 to 1). The
required prevalence is strictly greater by default. To include the
limit, set include_lowest
to TRUE
.
as.relative
Logical scalar: Should the detection
threshold be applied on compositional (relative) abundances?
(default: FALSE
)
a single character defining a taxonomic rank. Must be a value of
taxonomyRanks()
function.
TRUE
or FALSE
: Should information only from
the specified rank be used or from ranks equal and above? See details.
(default: onRankOnly = FALSE
)
TRUE
or FALSE
: Should taxa with an empty rank be
removed? Use it with caution, since empty entries on the selected rank
will be dropped. This setting can be tweaked by defining
empty.fields
to your needs. (default: na.rm = TRUE
)
a character
value defining, which values should be
regarded as empty. (Default: c(NA, "", " ", "\t")
). They will be
removed if na.rm = TRUE
before agglomeration.
String or integer scalar specifying an alternative experiment containing the input data.
TRUE
or FALSE
: Should alternative
experiments be removed prior to agglomeration? This prevents to many
nested alternative experiments by default (default:
strip_altexp = TRUE
)
TRUE
or FALSE
: should
rowTree()
also be agglomerated? (Default:
agglomerate.tree = FALSE
)
alias for agglomerate.tree
.
A single character
valued used as the label for the
summary of non-prevalent taxa. (default: other_label = "Other"
)
agglomerateByRank
returns a taxonomically-agglomerated,
optionally-pruned object of the same class as x
.
agglomerateByPrevalence
returns a taxonomically-agglomerated object
of the same class as x and based on prevalent taxonomic results.
Depending on the available taxonomic data and its structure, setting
onRankOnly = TRUE
has certain implications on the interpretability of
your results. If no loops exist (loops meaning two higher ranks containing
the same lower rank), the results should be comparable. You can check for
loops using detectLoop
.
Agglomeration sums up the values of assays at the specified taxonomic level. With certain assays, e.g. those that include binary or negative values, this summing can produce meaningless values. In those cases, consider performing agglomeration first, and then applying the transformation afterwards.
agglomerateByPrevalence
sums up the values of assays at the taxonomic
level specified by rank
(by default the highest taxonomic level
available) and selects the summed results that exceed the given population
prevalence at the given detection level. The other summed values (below the
threshold) are agglomerated in an additional row taking the name indicated by
other_label
(by default "Other").
data(GlobalPatterns)
# print the available taxonomic ranks
colnames(rowData(GlobalPatterns))
#> [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
taxonomyRanks(GlobalPatterns)
#> [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(GlobalPatterns, rank="Family")
## How many taxa before/after agglomeration?
nrow(GlobalPatterns)
#> [1] 19216
nrow(x1)
#> [1] 603
# agglomerate the tree as well
x2 <- agglomerateByRank(GlobalPatterns, rank="Family",
agglomerate.tree = TRUE)
nrow(x2) # same number of rows, but
#> [1] 603
rowTree(x1) # ... different
#>
#> Phylogenetic tree with 19216 tips and 19215 internal nodes.
#>
#> Tip labels:
#> 549322, 522457, 951, 244423, 586076, 246140, ...
#> Node labels:
#> , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 1.000.2, ...
#>
#> Rooted; includes branch lengths.
rowTree(x2) # ... tree
#>
#> Phylogenetic tree with 603 tips and 602 internal nodes.
#>
#> Tip labels:
#> 549322, 951, 244423, 143239, 215972, 138353, ...
#> Node labels:
#> , 0.858.4, 1.000.154, 0.764.3, 0.995.2, 0.943.7, ...
#>
#> Rooted; includes branch lengths.
# If assay contains binary or negative values, summing might lead to meaningless
# values, and you will get a warning. In these cases, you might want to do
# agglomeration again at chosen taxonomic level.
tse <- transformAssay(GlobalPatterns, method = "pa")
tse <- agglomerateByRank(tse, rank = "Genus")
#> Warning: 'pa' includes binary values.
#> Agglomeration of it might lead to meaningless values.
#> Check the assay, and consider doing transformation again manually with agglomerated data.
tse <- transformAssay(tse, method = "pa")
# removing empty labels by setting na.rm = TRUE
sum(is.na(rowData(GlobalPatterns)$Family))
#> [1] 5603
x3 <- agglomerateByRank(GlobalPatterns, rank="Family", na.rm = TRUE)
nrow(x3) # different from x2
#> [1] 341
# Because all the rownames are from the same rank, rownames do not include
# prefixes, in this case "Family:".
print(rownames(x3[1:3,]))
#> [1] "Sulfolobaceae" "SAGMA-X" "Cenarchaeaceae"
# To add them, use getTaxonomyLabels function.
rownames(x3) <- getTaxonomyLabels(x3, with_rank = TRUE)
print(rownames(x3[1:3,]))
#> [1] "Family:Sulfolobaceae" "Family:SAGMA-X" "Family:Cenarchaeaceae"
# use 'remove_empty_ranks' to remove columns that include only NAs
x4 <- agglomerateByRank(GlobalPatterns, rank="Phylum", remove_empty_ranks = TRUE)
head(rowData(x4))
#> DataFrame with 6 rows and 2 columns
#> Kingdom Phylum
#> <character> <character>
#> Phylum:Crenarchaeota Archaea Crenarchaeota
#> Phylum:Euryarchaeota Archaea Euryarchaeota
#> Phylum:Actinobacteria Bacteria Actinobacteria
#> Phylum:Spirochaetes Bacteria Spirochaetes
#> Phylum:MVP-15 Bacteria MVP-15
#> Phylum:Proteobacteria Bacteria Proteobacteria
# If the assay contains NAs, you might want to consider replacing them,
# since summing-up NAs lead to NA
x5 <- GlobalPatterns
# Replace first value with NA
assay(x5)[1,1] <- NA
x6 <- agglomerateByRank(x5, "Kingdom")
head( assay(x6) )
#> CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong
#> Archaea NA 1248 28811 33 57 42 112 140 303
#> Bacteria 862815 1134209 668698 1543418 2076419 718901 433782 186157 2000099
#> M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2 NP3
#> Archaea 30 131 145 4459 24692 28051 1826 43197
#> Bacteria 100157 2117461 1217167 1163289 2332489 1671242 521808 1435768
#> NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1 Even2 Even3
#> Archaea 33996 843 8418 14250 1598 1690 150 23 91
#> Bacteria 1618758 57845 484708 265454 935868 1209381 1215987 971050 1078150
# Replace NAs with 0. This is justified when we are summing-up counts.
assay(x5)[ is.na(assay(x5)) ] <- 0
x6 <- agglomerateByRank(x5, "Kingdom")
head( assay(x6) )
#> CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr M31Tong
#> Archaea 1262 1248 28811 33 57 42 112 140 303
#> Bacteria 862815 1134209 668698 1543418 2076419 718901 433782 186157 2000099
#> M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2 NP3
#> Archaea 30 131 145 4459 24692 28051 1826 43197
#> Bacteria 100157 2117461 1217167 1163289 2332489 1671242 521808 1435768
#> NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1 Even2 Even3
#> Archaea 33996 843 8418 14250 1598 1690 150 23 91
#> Bacteria 1618758 57845 484708 265454 935868 1209381 1215987 971050 1078150
## Look at enterotype dataset...
data(enterotype)
## Print the available taxonomic ranks. Shows only 1 available rank,
## not useful for agglomerateByRank
taxonomyRanks(enterotype)
#> [1] "Genus"
## Data can be aggregated based on prevalent taxonomic results
tse <- GlobalPatterns
tse <- agglomerateByPrevalence(tse,
rank = "Phylum",
detection = 1/100,
prevalence = 50/100,
as_relative = TRUE)
#> Warning: The 'getPrevalentTaxa' function is deprecated. Use 'getPrevalentFeatures' instead.
tse
#> class: TreeSummarizedExperiment
#> dim: 6 26
#> metadata(2): agglomerated_by_rank agglomerated_by_rank
#> assays(1): counts
#> rownames(6): Phylum:Actinobacteria Phylum:Proteobacteria ...
#> Phylum:Firmicutes Other
#> rowData names(7): Kingdom Phylum ... Genus Species
#> colnames(26): CL3 CC1 ... Even2 Even3
#> colData names(7): X.SampleID Primer ... SampleType Description
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL
# Here data is aggregated at the taxonomic level "Phylum". The five phyla
# that exceed the population prevalence threshold of 50/100 represent the
# five first rows of the assay in the aggregated data. The sixth and last row
# named by default "Other" takes the summed up values of all the other phyla
# that are below the prevalence threshold.
assay(tse)[,1:5]
#> CL3 CC1 SV1 M31Fcsw M11Fcsw
#> Phylum:Actinobacteria 39601 90280 121703 2540 841
#> Phylum:Proteobacteria 294228 361327 224004 18798 86614
#> Phylum:Cyanobacteria 1955 3353 16676 423 212812
#> Phylum:Bacteroidetes 67395 96398 93436 804395 1424107
#> Phylum:Firmicutes 8584 4726 3524 700084 330423
#> Other 452314 579373 238166 17211 21679