7  Data wrangling

This chapter introduces several essential techniques for preparing data for analysis. These techniques include splitting data, modifying data, and converting data to a data.frame. Additionally, it explains how to merge multiple SummarizedExperiment objects when needed. For a basic understanding of TreeSE please refer to Chapter 3

7.1 Splitting

You can split the data based on variables by using the functions agglomerateByRanks() and splitOn(). The former is detailed in Chapter 9.

If you want to split the data based on a variable other than taxonomic rank, use splitOn(). It works for row-wise and column-wise splitting. We might want to split the data, Splitting the data may be useful, for example, if you want to analyze data from different cohorts separately.

library(mia)
data("GlobalPatterns")
tse <- GlobalPatterns

splitOn(tse, "SampleType")
##  List of length 9
##  names(9): Soil Feces Skin Tongue ... Ocean Sediment (estuary) Mock

7.2 Add or modify variables

The information contained by the colData of a TreeSE can be added and/or modified by accessing the desired variables. You might want to add or modify this data to include new variables or update existing ones, which can be essential for ensuring that all relevant metadata is available for subsequent analyses.

# modify the Description entries
colData(tse)$Description <- paste(
    colData(tse)$Description, "modified description")

# view modified variable
tse$Description |> head()
##  [1] "Calhoun South Carolina Pine soil, pH 4.9 modified description"  
##  [2] "Cedar Creek Minnesota, grassland, pH 6.1 modified description"  
##  [3] "Sevilleta new Mexico, desert scrub, pH 8.3 modified description"
##  [4] "M3, Day 1, fecal swab, whole body study modified description"   
##  [5] "M1, Day 1, fecal swab, whole body study  modified description"  
##  [6] "M3, Day 1, right palm, whole body study modified description"

New information can be added to the experiment by creating a new variable.

# simulate new data
new_data <- runif(ncol(tse))

# store new data as new variable in colData
colData(tse)$NewVariable <- new_data

# view new variable
tse$NewVariable |> head()
##  [1] 0.6300 0.3415 0.7651 0.5745 0.0632 0.8530

Alternatively, you can add whole table by merging it with existing colData.

# simulate new data
new_data <- data.frame(var1 = runif(ncol(tse)), var2 = runif(ncol(tse)))
rownames(new_data) <- colnames(tse)

# Combine existing data with new data
colData(tse) <- cbind(colData(tse), new_data)

Similar steps can also be applied to rowData. If you have an assay whose rows and columns aling with the existing ones, you can add the assay easily to the TreeSE object.

Here we add an assay that has random numbers but in real life these steps might come handy after you have transformed the data with custom transformation that cannot be found from mia.

# Create a matrix with random values
mat <- rnorm(ncol(tse)*nrow(tse), 0, 1)
mat <- matrix(mat, ncol = ncol(tse), nrow = nrow(tse))
# Add matrix to tse
assay(tse, "random", withDimnames = FALSE) <- mat

assayNames(tse)
##  [1] "counts" "random"

Now we can see that the TreeSE object has now an additional assay called “random”. When adding new samples or features to your existing dataset, you can use cbind() to combine columns for new features or rbind() to add rows for new samples.

tse2 <- cbind(tse, tse)
tse2
##  class: TreeSummarizedExperiment 
##  dim: 19216 52 
##  metadata(0):
##  assays(2): counts random
##  rownames(19216): 549322 522457 ... 200359 271582
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(52): CL3 CC1 ... Even2 Even3
##  colData names(10): X.SampleID Primer ... var1 var2
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

However, the aforementioned functions assume that the rows align correctly when combining columns, and vice versa. In practice, this is often not the case; for example, samples may have different feature sets. In such situations, using a merging approach is the appropriate method.

7.3 Merge data

mia package has mergeSEs() function that merges multiple SummarizedExperiment objects. For example, it is possible to combine multiple TreeSE objects which each includes one sample.

mergeSEs() works much like standard joining operations. It combines rows and columns and allows you to specify the merging method.

# Take subsets for demonstration purposes
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
# With inner join, we want to include all shared rows. When using mergeSEs
# function all samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
##  class: TreeSummarizedExperiment 
##  dim: 100 4 
##  metadata(0):
##  assays(1): counts
##  rownames(100): 239672 243675 ... 104332 159421
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(4): CC1 CL3 M31Fcsw SV1
##  colData names(10): X.SampleID Primer ... var1 var2
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (100 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# Left join preserves all rows of the 1st object
tse <- mergeSEs(tse1, tse4, missing.values = 0, join = "left")
tse
##  class: TreeSummarizedExperiment 
##  dim: 19216 2 
##  metadata(0):
##  assays(1): counts
##  rownames(19216): 239672 243675 ... 239967 254851
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(2): CL3 M31Fcsw
##  colData names(10): X.SampleID Primer ... var1 var2
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

7.4 Melting data

For several custom analysis and visualization packages, such as those from tidyverse, the SE data can be converted to a long data.frame format with meltSE().

library(knitr)

# Melt SE object
molten_tse <- meltSE(
    tse,
    add.row = TRUE,
    add.col = TRUE,
    assay.type = "counts")

molten_tse |> head() |> kable()
FeatureID SampleID counts Kingdom Phylum Class Order Family Genus Species X.SampleID Primer Final_Barcode Barcode_truncated_plus_T Barcode_full_length SampleType Description NewVariable var1 var2
239672 CL3 0 Archaea Crenarchaeota C2 B10 NA NA NA CL3 ILBC_01 AACGCA TGCGTT CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9 modified description 0.6300 0.0010 0.6086
239672 M31Fcsw 0 Archaea Crenarchaeota C2 B10 NA NA NA M31Fcsw ILBC_04 AAGAGA TCTCTT TCGACATCTCT Feces M3, Day 1, fecal swab, whole body study modified description 0.5745 0.9621 0.6618
243675 CL3 0 Archaea Crenarchaeota C2 B10 NA NA NA CL3 ILBC_01 AACGCA TGCGTT CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9 modified description 0.6300 0.0010 0.6086
243675 M31Fcsw 0 Archaea Crenarchaeota C2 B10 NA NA NA M31Fcsw ILBC_04 AAGAGA TCTCTT TCGACATCTCT Feces M3, Day 1, fecal swab, whole body study modified description 0.5745 0.9621 0.6618
444679 CL3 0 Archaea Crenarchaeota C2 B10 NA NA NA CL3 ILBC_01 AACGCA TGCGTT CTAGCGTGCGT Soil Calhoun South Carolina Pine soil, pH 4.9 modified description 0.6300 0.0010 0.6086
444679 M31Fcsw 0 Archaea Crenarchaeota C2 B10 NA NA NA M31Fcsw ILBC_04 AAGAGA TCTCTT TCGACATCTCT Feces M3, Day 1, fecal swab, whole body study modified description 0.5745 0.9621 0.6618
Back to top