6  Data Wrangling

This chapter covers various basic techniques for preparing data for subsequent analysis. These include subsetting, splitting, and modifying data, as well as converting data to a data.frame through melting and merging multiple SummarizedExperiment objects. See Chapter 3 for basics of TreeSE.

6.1 Subsetting

Subsetting data helps to draw the focus of an analysis on particular sets of samples and / or features. When dealing with large datasets, the subset of interest can be extracted and investigated separately. This might improve performance and reduce the computational load.

Load:

  • mia
  • dplyr
  • knitr
  • data GlobalPatterns

Let us store GlobalPatterns into tse and check its original number of features (rows) and samples (columns). Note: When subsetting by sample, expect the number of columns to decrease. When subsetting by feature, expect the number of rows to decrease.

# Store data into se and check dimensions
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
# Show dimensions (features x samples)
dim(tse)
##  [1] 19216    26

6.1.1 Subset by sample (column-wise)

For the sake of demonstration, here we will extract a subset containing only the samples of human origin (feces, skin or tongue), stored as SampleType within colData(tse) as well as in tse.

First, we would like to see all the possible values that SampleType can have and how frequent those are:

# Inspect possible values for SampleType
unique(tse$SampleType)
##  [1] Soil               Feces              Skin              
##  [4] Tongue             Freshwater         Freshwater (creek)
##  [7] Ocean              Sediment (estuary) Mock              
##  9 Levels: Feces Freshwater Freshwater (creek) Mock ... Tongue
# Show the frequency of each value
tse$SampleType %>% table()
. Freq
Feces 4
Freshwater 2
Freshwater (creek) 3
Mock 3
Ocean 3
Sediment (estuary) 3
Skin 3
Soil 3
Tongue 2
Note

After subsetting, expect the number of columns to equal the sum of the frequencies of the samples that you are interested in. For instance, ncols = Feces + Skin + Tongue = 4 + 3 + 2 = 9.

Next, we logical index across the columns of tse (make sure to leave the first index empty to select all rows) and filter for the samples of human origin. For this, we use the information on the samples from the meta data colData(tse).

# Subset by sample
tse_sub <- tse[ , tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# Show dimensions
dim(tse_sub)
##  [1] 19216     9

As a sanity check, the new object tse_subset_by_sample should have the original number of features (rows) and a number of samples (columns) equal to the sum of the samples of interest (in this case 9).

Several criteria can be used to subset by sample:

  • origin
  • sampling time
  • sequencing method
  • DNA / RNA barcode
  • cohort

6.1.2 Subset by feature (row-wise)

Similarly, here we will extract a subset containing only the features that belong to the phyla Actinobacteria and Chlamydiae, stored as Phylum within rowData(tse). However, subsetting by feature implies a few more obstacles, such as the presence of NA elements and the possible need for agglomeration.

As previously, we would first like to see all the possible values that Phylum can have and how frequent those are:

# Inspect possible values for phylum
# and show the first five values
cat(paste(unique(rowData(tse)$Phylum)[1:5], collapse = "\n"))
##  Crenarchaeota
##  Euryarchaeota
##  Actinobacteria
##  Spirochaetes
##  MVP-15
# Show the frequency of each value
rowData(tse)$Phylum %>% table() %>% head() %>%
  kable() %>%
  kableExtra::kable_styling("striped", latex_options="scale_down") %>%
  kableExtra::scroll_box(width = "100%")
. Freq
ABY1_OD1 7
AC1 1
AD3 9
Acidobacteria 1021
Actinobacteria 1631
Armatimonadetes 61
Note

After subsetting, expect the number of columns to equal the sum of the frequencies of the feature(s) that you are interested in. For instance, nrows = Actinobacteria + Chlamydiae = 1631 + 21 = 1652.

Depending on your research question, you might or might not need to agglomerate the data in the first place: if you want to find the abundance of each and every feature that belongs to Actinobacteria and Chlamydiae, agglomeration is not needed. However, if you want to find the total abundance of all features that belong to Actinobacteria or Chlamydiae, agglomeration is recommended (see Chapter 9 for details).

6.1.3 Select features based on taxonomy rank

Next, we logical index across the rows of tse (make sure to leave the second index empty to select all columns) and filter for the features that fall in either Actinobacteria or Chlamydiae group. For this, we use the information on the samples from the metadata rowData(tse).

The first term with the %in% operator includes all the features of interest, whereas the second term after the AND operator & filters out all features that have an NA in place of the phylum variable.

# Subset by feature
tse_sub <- tse[
    rowData(tse)$Phylum %in% c("Actinobacteria","Chlamydiae"), ]

# Show dimensions
dim(tse_sub)
##  [1] 1652   26

As a sanity check, the new object, tse_subset_by_feature, should have the original number of samples (columns) and a number of features (rows) equal to the sum of the features of interest (in this case, 1652).

6.1.4 Subset by sample and feature

Finally, we can subset data by sample and feature at once. The resulting subset contains all the samples of human origin and all the features of phyla Actinobacteria or Chlamydiae.

# Subset by sample and feature and remove NAs
tse_sub <- tse[
  rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") &
      !is.na(rowData(tse)$Phylum),
  tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# Show dimensions
dim(tse_sub)
##  [1] 1652    9
Note

The dimensions of tse_sub are on par with those of the previous subsets (9 columns filtered by sample and 1652 rows filtered by feature).

If a study was to consider and quantify the presence of Actinobacteria as well as Chlamydiae in different sites of the human body, tse_sub might be a suitable subset to start with.

6.1.5 Remove empty columns and rows

Sometimes data might contain, e.g., features that are not present in any of the samples. This can occur, for example, after data subsetting. In certain analyses, we might want to remove those instances.

# Agglomerate data at Genus level
tse_genus <- agglomerateByRank(tse, rank = "Genus", na.rm = FALSE)
# List bacteria that we want to include
genera <- c(
    "Class:Thermoprotei",
    "Genus:Sulfolobus",
    "Genus:Sediminicola")
# Subset data
tse_sub <- tse_genus[genera, ]

tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 3 26 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (3 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# List total counts of each sample
colSums(assay(tse_sub, "counts"))[1:5]
##      CL3     CC1     SV1 M31Fcsw M11Fcsw 
##        1       0       0       1       1

Now we can see that certain samples do not include any bacteria. We can remove those.

# Remove samples that do not contain any bacteria
tse_genus_sub <- tse_sub[
  , colSums(assay(tse_sub, "counts")) != 0]
tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 3 26 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (3 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

The same action can also be applied to the features.

# Take only those samples that are collected from feces, skin, or tongue
tse_sub <- tse_genus[
    , tse_genus$SampleType %in% c("Feces", "Skin", "Tongue")]

tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 1516 9 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(1516): Class:0319-6G9 Class:09D2Y74 ... Phylum:ZB2
##    Phylum:ZB3
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (1516 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# What is the number of bacteria that are not present?
sum(rowSums(assay(tse_sub, "counts")) == 0)
##  [1] 435

We can see that there are bacteria that are not present in the samples we chose. We can remove those bacteria from the data.

# Take only those bacteria that are present
tse_sub <- tse_sub[
  rowSums(assay(tse_sub, "counts")) > 0, ]

tse_sub
##  class: TreeSummarizedExperiment 
##  dim: 1081 9 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(1081): Class:0319-6G9 Class:5B-18 ... Phylum:SR1
##    Phylum:WPS-2
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (1081 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

6.2 Splitting

You can split the data based on variables by using the functions agglomerateByRanks() and splitOn().

agglomerateByRanks() splits the data based on taxonomic ranks. Since the elements of the output list share columns, they can be stored into altExp by setting as.list = FALSE.

agglomerateByRanks(tse, as.list = TRUE)
##  List of length 7
##  names(7): Kingdom Phylum Class Order Family Genus Species

If you want to split the data based on another variable than taxonomic rank, use splitOn(). It works for row-wise and column-wise splitting.

splitOn(tse, "SampleType")
##  List of length 9
##  names(9): Soil Feces Skin Tongue ... Ocean Sediment (estuary) Mock

6.3 Add or modify data

The information contained by the colData of a TreeSE can be modified by accessing the desired variables.

# modify the Description entries
colData(tse)$Description <- paste(
    colData(tse)$Description, "modified description")

# view modified variable
head(tse$Description)
##  [1] "Calhoun South Carolina Pine soil, pH 4.9 modified description"  
##  [2] "Cedar Creek Minnesota, grassland, pH 6.1 modified description"  
##  [3] "Sevilleta new Mexico, desert scrub, pH 8.3 modified description"
##  [4] "M3, Day 1, fecal swab, whole body study modified description"   
##  [5] "M1, Day 1, fecal swab, whole body study  modified description"  
##  [6] "M3, Day 1, right palm, whole body study modified description"

New information can also be added to the experiment by creating a new variable.

# simulate new data
new_data <- runif(ncol(tse))

# store new data as new variable in colData
colData(tse)$NewVariable <- new_data

# view new variable
head(tse$NewVariable)
##  [1] 0.12835 0.04433 0.65614 0.29787 0.23109 0.43326

6.4 Melting data

For several custom analysis and visualization packages, such as those from tidyverse, the SE data can be converted to a long data.frame format with meltSE().

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, MARGIN = "cols", method="relabundance")
# Melt SE object
molten_tse <- mia::meltSE(
    tse,
    add.row = TRUE,
    add.col = TRUE,
    assay.type = "relabundance")
# Show the first five columns
molten_tse[, 1:5]
##  # A tibble: 499,616 × 5
##    FeatureID SampleID relabundance Kingdom Phylum       
##    <fct>     <fct>           <dbl> <chr>   <chr>        
##  1 549322    CL3                 0 Archaea Crenarchaeota
##  2 549322    CC1                 0 Archaea Crenarchaeota
##  3 549322    SV1                 0 Archaea Crenarchaeota
##  4 549322    M31Fcsw             0 Archaea Crenarchaeota
##  5 549322    M11Fcsw             0 Archaea Crenarchaeota
##  6 549322    M31Plmr             0 Archaea Crenarchaeota
##  # ℹ 499,610 more rows

6.5 Merge data

mia package has mergeSEs() function that merges multiple SummarizedExperiment objects. For example, it is possible to combine multiple TreeSE objects which each includes one sample.

mergeSEs() works much like standard joining operations. It combines rows and columns and allows you to specify the merging method.

# Take subsets for demonstration purposes
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
# With inner join, we want to include all shared rows. When using mergeSEs
# function all samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
##  class: TreeSummarizedExperiment 
##  dim: 100 4 
##  metadata(0):
##  assays(1): counts
##  rownames(100): 239672 243675 ... 104332 159421
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(4): CC1 CL3 M31Fcsw SV1
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (100 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# Left join preserves all rows of the 1st object
tse <- mergeSEs(tse1, tse4, missing.values = 0, join = "left")
tse
##  class: TreeSummarizedExperiment 
##  dim: 19216 2 
##  metadata(0):
##  assays(1): counts
##  rownames(19216): 239672 243675 ... 239967 254851
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(2): CL3 M31Fcsw
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
Back to top