3 Data Manipulation
3.1 Tidying and Subsetting
3.1.1 Tidy data
For several custom analysis and visualization packages, such as those from tidyverse
, the SE
data can be converted to a long data.frame format with meltSE
.
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")
molten_tse <- mia::meltSE(tse,
add.row = TRUE,
add.col = TRUE,
assay.type = "relabundance")
# Show the first five columns
molten_tse[,1:5]
## # A tibble: 499,616 × 5
## FeatureID SampleID relabundance Kingdom Phylum
## <fct> <fct> <dbl> <chr> <chr>
## 1 549322 CL3 0 Archaea Crenarchaeota
## 2 549322 CC1 0 Archaea Crenarchaeota
## 3 549322 SV1 0 Archaea Crenarchaeota
## 4 549322 M31Fcsw 0 Archaea Crenarchaeota
## 5 549322 M11Fcsw 0 Archaea Crenarchaeota
## 6 549322 M31Plmr 0 Archaea Crenarchaeota
## # ℹ 499,610 more rows
3.1.2 Subsetting
Subsetting data helps to draw the focus of an analysis on particular sets of samples and / or features. When dealing with large datasets, the subset of interest can be extracted and investigated separately. This might improve performance and reduce the computational load.
Load:
- mia
- dplyr
- knitr
- data
GlobalPatterns
Let us store GlobalPatterns
into tse
and check its original number of features (rows) and samples (columns). Note: When subsetting by sample, expect the number of columns to decrease. When subsetting by feature, expect the number of rows to decrease.
3.1.2.1 Subset by sample (column-wise)
For the sake of demonstration, here we will extract a subset containing only the samples of human origin (feces, skin or tongue), stored as SampleType
within colData(tse)
as well as in tse
.
First, we would like to see all the possible values that SampleType
can have and how frequent those are:
# Inspect possible values for SampleType
unique(tse$SampleType)
## [1] Soil Feces Skin
## [4] Tongue Freshwater Freshwater (creek)
## [7] Ocean Sediment (estuary) Mock
## 9 Levels: Feces Freshwater Freshwater (creek) Mock ... Tongue
# Show the frequency of each value
tse$SampleType %>% table()
. | Freq |
---|---|
Feces | 4 |
Freshwater | 2 |
Freshwater (creek) | 3 |
Mock | 3 |
Ocean | 3 |
Sediment (estuary) | 3 |
Skin | 3 |
Soil | 3 |
Tongue | 2 |
Note: after subsetting, expect the number of columns to equal the sum of the frequencies of the samples that you are interested in. For instance, ncols = Feces + Skin + Tongue = 4 + 3 + 2 = 9
.
Next, we logical index across the columns of tse
(make sure to leave the first index empty to select all rows) and filter for the samples of human origin. For this, we use the information on the samples from the meta data colData(tse)
.
As a sanity check, the new object tse_subset_by_sample
should have the original number of features (rows) and a number of samples (columns) equal to the sum of the samples of interest (in this case 9).
Several criteria can be used to subset by sample:
- origin
- sampling time
- sequencing method
- DNA / RNA barcode
- cohort
3.1.2.2 Subset by feature (row-wise)
Similarly, here we will extract a subset containing only the features that belong to the phyla Actinobacteria and Chlamydiae, stored as Phylum
within rowData(tse)
. However, subsetting by feature implies a few more obstacles, such as the presence of NA
elements and the possible need for agglomeration.
As previously, we would first like to see all the possible values that Phylum
can have and how frequent those are:
# Show the frequency of each value
rowData(tse)$Phylum %>% table() %>% head() %>%
kable() %>%
kableExtra::kable_styling("striped", latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
. | Freq |
---|---|
ABY1_OD1 | 7 |
AC1 | 1 |
AD3 | 9 |
Acidobacteria | 1021 |
Actinobacteria | 1631 |
Armatimonadetes | 61 |
Note: after subsetting, expect the number of columns to equal the sum of the frequencies of the feature(s) that you are interested in. For instance, nrows = Actinobacteria + Chlamydiae = 1631 + 21 = 1652
.
Depending on your research question, you might or might not need to agglomerate the data in the first place: if you want to find the abundance of each and every feature that belongs to Actinobacteria and Chlamydiae, agglomeration is not needed. However, if you want to find the total abundance of all features that belong to Actinobacteria or Chlamydiae, agglomeration is recommended.
3.1.2.2.1 Non-agglomerated data
Next, we logical index across the rows of tse
(make sure to leave the second index empty to select all columns) and filter for the features that fall in either Actinobacteria or Chlamydiae group. For this, we use the information on the samples from the metadata rowData(tse)
.
The first term with the %in%
operator includes all the features of interest, whereas the second term after the AND operator &
filters out all features that have an NA
in place of the phylum variable.
As a sanity check, the new object, tse_subset_by_feature
, should have the original number of samples (columns) and a number of features (rows) equal to the sum of the features of interest (in this case, 1652).
3.1.2.2.2 Agglomerated data
When total abundances of certain phyla are of relevance, the data is initially agglomerated by Phylum. Then, similar steps as in the case of non-agglomerated data are followed.
# Agglomerate by phylum
tse_phylum <- tse %>% agglomerateByRank(rank = "Phylum")
# Subset by feature and remove NAs
tse_phylum_subset_by_feature <- tse_phylum[
rowData(tse_phylum)$Phylum %in%
c("Actinobacteria", "Chlamydiae")
&!is.na(rowData(tse_phylum)$Phylum),
]
# Show dimensions
dim(tse_phylum_subset_by_feature)
## [1] 2 26
Note: as data was agglomerated, the number of rows should equal the number of phyla used to index (in this case, just 2).
Alternatively:
The code above returns the non-agglomerated version of the data.
Fewer characteristics can be used to subset by feature:
- Taxonomic rank
- Meta-taxonomic group
For subsetting by kingdom, agglomeration does not apply, whereas for the other ranks it can be applied if necessary.
3.1.2.3 Subset by sample and feature
Finally, we can subset data by sample and feature at once. The resulting subset contains all the samples of human origin and all the features of phyla Actinobacteria or Chlamydiae.
Note: the dimensions of tse_subset_by_sample_feature
are on par with those of the previous subsets (9 columns filtered by sample and 1652 rows filtered by feature).
If a study was to consider and quantify the presence of Actinobacteria as well as Chlamydiae in different sites of the human body, tse_subset_by_sample_feature
might be a suitable subset to start with.
3.1.2.4 Remove empty columns and rows
Sometimes data might contain, e.g., features that are not present in any of the samples. This can occur, for example, after data subsetting. In certain analyses, we might want to remove those instances.
# Agglomerate data at Genus level
tse_genus <- agglomerateByRank(tse, rank = "Genus", na.rm = FALSE)
# List bacteria that we want to include
genera <- c("Class:Thermoprotei",
"Genus:Sulfolobus",
"Genus:Sediminicola")
# Subset data
tse_genus_sub <- tse_genus[genera, ]
tse_genus_sub
## class: TreeSummarizedExperiment
## dim: 3 26
## metadata(1): agglomerated_by_rank
## assays(1): counts
## rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(26): CL3 CC1 ... Even2 Even3
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (3 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
Now we can see that certain samples do not include any bacteria. We can remove those.
# Remove samples that do not contain any bacteria
tse_genus_sub <- tse_genus_sub[ ,
colSums(assay(
tse_genus_sub, "counts")) != 0
]
tse_genus_sub
## class: TreeSummarizedExperiment
## dim: 3 18
## metadata(1): agglomerated_by_rank
## assays(1): counts
## rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(18): CL3 M31Fcsw ... TRRsed3 Even2
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (3 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
The same action can also be applied to the features.
# Take only those samples that are collected from feces, skin, or tongue
tse_genus_sub <- tse_genus[ , tse_genus$SampleType %in%
c("Feces", "Skin", "Tongue")]
tse_genus_sub
## class: TreeSummarizedExperiment
## dim: 1516 9
## metadata(1): agglomerated_by_rank
## assays(1): counts
## rownames(1516): Class:0319-6G9 Class:09D2Y74 ... Phylum:ZB2
## Phylum:ZB3
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (1516 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
We can see that there are bacteria that are not present in the samples we chose. We can remove those bacteria from the data.
# Take only those bacteria that are present
tse_genus_sub <- tse_genus_sub[rowSums(assay
(tse_genus_sub,
"counts")) > 0, ]
tse_genus_sub
## class: TreeSummarizedExperiment
## dim: 1081 9
## metadata(1): agglomerated_by_rank
## assays(1): counts
## rownames(1081): Class:0319-6G9 Class:5B-18 ... Phylum:SR1
## Phylum:WPS-2
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
## colData names(7): X.SampleID Primer ... SampleType Description
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (1081 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
3.1.3 Splitting
You can split the data based on variables by using the functions agglomerateByRanks
and splitOn
.
agglomerateByRanks
splits the data based on taxonomic ranks. Since the elements of the output list share columns, they can be stored into altExp
by setting as.list = FALSE
.
tse <- agglomerateByRanks(tse)
altExps(tse)
## List of length 7
## names(7): Kingdom Phylum Class Order Family Genus Species
If you want to split the data based on another variable than taxonomic rank, use splitOn
. It works for row-wise and column-wise splitting.
splitOn(tse, "SampleType")
## List of length 9
## names(9): Soil Feces Skin Tongue ... Ocean Sediment (estuary) Mock
3.2 Add or modify data
The information contained by the colData
of a TreeSE
can be modified by accessing the desired variables.
# modify the Description entries
colData(tse)$Description <- paste(
colData(tse)$Description,
"modified description")
# view modified variable
head(tse$Description)
## [1] "Calhoun South Carolina Pine soil, pH 4.9 modified description"
## [2] "Cedar Creek Minnesota, grassland, pH 6.1 modified description"
## [3] "Sevilleta new Mexico, desert scrub, pH 8.3 modified description"
## [4] "M3, Day 1, fecal swab, whole body study modified description"
## [5] "M1, Day 1, fecal swab, whole body study modified description"
## [6] "M3, Day 1, right palm, whole body study modified description"
New information can also be added to the experiment by creating a new variable.
3.3 Merge data
mia
package has mergeSEs
function that merges multiple SummarizedExperiment
objects. For example, it is possible to combine multiple TreeSE
objects which each includes one sample.
mergeSEs
works much like standard joining operations. It combines rows and columns and allows you to specify the merging method.
# Take subsets for demonstration purposes
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
# With inner join, we want to include all shared rows. When using mergeSEs function
# all samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
## class: TreeSummarizedExperiment
## dim: 100 4
## metadata(0):
## assays(1): counts
## rownames(100): 239672 243675 ... 104332 159421
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(4): CC1 CL3 M31Fcsw SV1
## colData names(8): X.SampleID Primer ... Description NewVariable
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (100 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
# Left join preserves all rows of the 1st object
tse <- mergeSEs(tse1, tse4, missing.values = 0, join = "left")
tse
## class: TreeSummarizedExperiment
## dim: 19216 2
## metadata(0):
## assays(1): counts
## rownames(19216): 239672 243675 ... 239967 254851
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(2): CL3 M31Fcsw
## colData names(8): X.SampleID Primer ... Description NewVariable
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL