7 Data wrangling
This chapter introduces several essential techniques for preparing data for analysis. These techniques include splitting data, modifying data, and converting data to a data.frame.
Additionally, it explains how to merge multiple SummarizedExperiment
objects when needed. For a basic understanding of TreeSE
please refer to Chapter 3
7.1 Splitting
You can split the data based on variables by using the functions agglomerateByRanks()
and splitOn()
. The former is detailed in Chapter 9.
If you want to split the data based on a variable other than taxonomic rank, use splitOn()
. It works for row-wise and column-wise splitting. We might want to split the data, Splitting the data may be useful, for example, if you want to analyze data from different cohorts separately.
7.2 Add or modify variables
The information contained by the colData
of a TreeSE
can be added and/or modified by accessing the desired variables. You might want to add or modify this data to include new variables or update existing ones, which can be essential for ensuring that all relevant metadata is available for subsequent analyses.
# modify the Description entries
colData(tse)$Description <- paste(
colData(tse)$Description, "modified description")
# view modified variable
tse$Description |> head()
## [1] "Calhoun South Carolina Pine soil, pH 4.9 modified description"
## [2] "Cedar Creek Minnesota, grassland, pH 6.1 modified description"
## [3] "Sevilleta new Mexico, desert scrub, pH 8.3 modified description"
## [4] "M3, Day 1, fecal swab, whole body study modified description"
## [5] "M1, Day 1, fecal swab, whole body study modified description"
## [6] "M3, Day 1, right palm, whole body study modified description"
New information can be added to the experiment by creating a new variable.
Alternatively, you can add whole table by merging it with existing colData
.
Similar steps can also be applied to rowData
. If you have an assay whose rows and columns aling with the existing ones, you can add the assay easily to the TreeSE
object.
Here we add an assay that has random numbers but in real life these steps might come handy after you have transformed the data with custom transformation that cannot be found from mia
.
Now we can see that the TreeSE
object has now an additional assay called “random”. When adding new samples or features to your existing dataset, you can use cbind()
to combine columns for new features or rbind()
to add rows for new samples.
tse2 <- cbind(tse, tse)
tse2
## class: TreeSummarizedExperiment
## dim: 19216 52
## metadata(0):
## assays(2): counts random
## rownames(19216): 549322 522457 ... 200359 271582
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(52): CL3 CC1 ... Even2 Even3
## colData names(10): X.SampleID Primer ... var1 var2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
However, the aforementioned functions assume that the rows align correctly when combining columns, and vice versa. In practice, this is often not the case; for example, samples may have different feature sets. In such situations, using a merging approach is the appropriate method.
7.3 Merge data
mia
package has mergeSEs()
function that merges multiple SummarizedExperiment
objects. For example, it is possible to combine multiple TreeSE
objects which each includes one sample.
mergeSEs()
works much like standard joining operations. It combines rows and columns and allows you to specify the merging method.
# Take subsets for demonstration purposes
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
# With inner join, we want to include all shared rows. When using mergeSEs
# function all samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
## class: TreeSummarizedExperiment
## dim: 100 4
## metadata(0):
## assays(1): counts
## rownames(100): 239672 243675 ... 104332 159421
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(4): CC1 CL3 M31Fcsw SV1
## colData names(10): X.SampleID Primer ... var1 var2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (100 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
# Left join preserves all rows of the 1st object
tse <- mergeSEs(tse1, tse4, missing.values = 0, join = "left")
tse
## class: TreeSummarizedExperiment
## dim: 19216 2
## metadata(0):
## assays(1): counts
## rownames(19216): 239672 243675 ... 239967 254851
## rowData names(7): Kingdom Phylum ... Genus Species
## colnames(2): CL3 M31Fcsw
## colData names(10): X.SampleID Primer ... var1 var2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (19216 rows)
## rowTree: 1 phylo tree(s) (19216 leaves)
## colLinks: NULL
## colTree: NULL
7.4 Melting data
For several custom analysis and visualization packages, such as those from tidyverse
, the SE
data can be converted to a long data.frame
format with meltSE()
.
library(knitr)
# Melt SE object
molten_tse <- meltSE(
tse,
add.row = TRUE,
add.col = TRUE,
assay.type = "counts")
molten_tse |> head() |> kable()
FeatureID | SampleID | counts | Kingdom | Phylum | Class | Order | Family | Genus | Species | X.SampleID | Primer | Final_Barcode | Barcode_truncated_plus_T | Barcode_full_length | SampleType | Description | NewVariable | var1 | var2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
239672 | CL3 | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | CL3 | ILBC_01 | AACGCA | TGCGTT | CTAGCGTGCGT | Soil | Calhoun South Carolina Pine soil, pH 4.9 modified description | 0.6300 | 0.0010 | 0.6086 |
239672 | M31Fcsw | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | M31Fcsw | ILBC_04 | AAGAGA | TCTCTT | TCGACATCTCT | Feces | M3, Day 1, fecal swab, whole body study modified description | 0.5745 | 0.9621 | 0.6618 |
243675 | CL3 | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | CL3 | ILBC_01 | AACGCA | TGCGTT | CTAGCGTGCGT | Soil | Calhoun South Carolina Pine soil, pH 4.9 modified description | 0.6300 | 0.0010 | 0.6086 |
243675 | M31Fcsw | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | M31Fcsw | ILBC_04 | AAGAGA | TCTCTT | TCGACATCTCT | Feces | M3, Day 1, fecal swab, whole body study modified description | 0.5745 | 0.9621 | 0.6618 |
444679 | CL3 | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | CL3 | ILBC_01 | AACGCA | TGCGTT | CTAGCGTGCGT | Soil | Calhoun South Carolina Pine soil, pH 4.9 modified description | 0.6300 | 0.0010 | 0.6086 |
444679 | M31Fcsw | 0 | Archaea | Crenarchaeota | C2 | B10 | NA | NA | NA | M31Fcsw | ILBC_04 | AAGAGA | TCTCTT | TCGACATCTCT | Feces | M3, Day 1, fecal swab, whole body study modified description | 0.5745 | 0.9621 | 0.6618 |