3  Data Manipulation

3.1 Tidying and subsetting

3.1.1 Tidy data

For several custom analysis and visualization packages, such as those from tidyverse, the SE data can be converted to a long data.frame format with meltAssay.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, MARGIN = "samples", method="relabundance")
molten_tse <- mia::meltAssay(tse,
                        add_row_data = TRUE,
                        add_col_data = TRUE,
                        assay.type = "relabundance")
molten_tse
##  # A tibble: 499,616 × 17
##    FeatureID SampleID relabundance Kingdom Phylum        Class        Order
##    <fct>     <fct>           <dbl> <chr>   <chr>         <chr>        <chr>
##  1 549322    CL3                 0 Archaea Crenarchaeota Thermoprotei <NA> 
##  2 549322    CC1                 0 Archaea Crenarchaeota Thermoprotei <NA> 
##  3 549322    SV1                 0 Archaea Crenarchaeota Thermoprotei <NA> 
##  4 549322    M31Fcsw             0 Archaea Crenarchaeota Thermoprotei <NA> 
##  5 549322    M11Fcsw             0 Archaea Crenarchaeota Thermoprotei <NA> 
##  6 549322    M31Plmr             0 Archaea Crenarchaeota Thermoprotei <NA> 
##  # ℹ 499,610 more rows
##  # ℹ 10 more variables: Family <chr>, Genus <chr>, Species <chr>, …

3.1.2 Subsetting

Subsetting data helps to draw the focus of analysis on particular sets of samples and / or features. When dealing with large datasets, the subset of interest can be extracted and investigated separately. This might improve performance and reduce the computational load.

Load:

  • mia
  • dplyr
  • knitr
  • data GlobalPatterns

Let us store GlobalPatterns into tse and check its original number of features (rows) and samples (columns). Note: when subsetting by sample, expect the number of columns to decrease; when subsetting by feature, expect the number of rows to decrease.

# Store data into se and check dimensions
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
# Show dimensions (features x samples)
dim(tse) 
##  [1] 19216    26

3.1.2.1 Subset by sample (column-wise)

For the sake of demonstration, here we will extract a subset containing only the samples of human origin (feces, skin or tongue), stored as SampleType within colData(tse) and also in tse.

First, we would like to see all the possible values that SampleType can take on and how frequent those are:

# Inspect possible values for SampleType
unique(tse$SampleType)
##  [1] Soil               Feces              Skin              
##  [4] Tongue             Freshwater         Freshwater (creek)
##  [7] Ocean              Sediment (estuary) Mock              
##  9 Levels: Feces Freshwater Freshwater (creek) Mock ... Tongue
# Show the frequency of each value
tse$SampleType %>% table()
. Freq
Feces 4
Freshwater 2
Freshwater (creek) 3
Mock 3
Ocean 3
Sediment (estuary) 3
Skin 3
Soil 3
Tongue 2

Note: after subsetting, expect the number of columns to equal the sum of the frequencies of the samples that you are interested in. For instance, ncols = Feces + Skin + Tongue = 4 + 3 + 2 = 9.

Next, we logical index across the columns of tse (make sure to leave the first index empty to select all rows) and filter for the samples of human origin. For this, we use the information on the samples from the meta data colData(tse).

# Subset by sample
tse_subset_by_sample <- tse[ , tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# Show dimensions
dim(tse_subset_by_sample)
##  [1] 19216     9

As a sanity check, the new object tse_subset_by_sample should have the original number of features (rows) and a number of samples (columns) equal to the sum of the samples of interest (in this case 9).

Several characteristics can be used to subset by sample:

  • origin
  • sampling time
  • sequencing method
  • DNA / RNA barcode
  • cohort

3.1.2.2 Subset by feature (row-wise)

Similarly, here we will extract a subset containing only the features that belong to the phyla Actinobacteria and Chlamydiae, stored as Phylum within rowData(tse). However, subsetting by feature implies a few more obstacles, such as the presence of NA elements and the possible need for agglomeration.

As previously, we would first like to see all the possible values that Phylum can take on and how frequent those are:

# Inspect possible values for phylum
unique(rowData(tse)$Phylum)
##   [1] "Crenarchaeota"    "Euryarchaeota"    "Actinobacteria"  
##   [4] "Spirochaetes"     "MVP-15"           "Proteobacteria"  
##   [7] "SBR1093"          "Fusobacteria"     "Tenericutes"     
##  [10] "ZB3"              "Cyanobacteria"    "GOUTA4"          
##  [13] "TG3"              "Chlorobi"         "Bacteroidetes"   
##  [16] "Caldithrix"       "KSB1"             "SAR406"          
##  [19] "LCP-89"           "Thermi"           "Gemmatimonadetes"
##  [22] "Fibrobacteres"    "GN06"             "AC1"             
##  [25] "TM6"              "OP8"              "Elusimicrobia"   
##  [28] "NC10"             "SPAM"             NA                
##  [31] "Acidobacteria"    "CCM11b"           "Nitrospirae"     
##  [34] "NKB19"            "BRC1"             "Hyd24-12"        
##  [37] "WS3"              "PAUC34f"          "GN04"            
##  [40] "GN12"             "Verrucomicrobia"  "Lentisphaerae"   
##  [43] "LD1"              "Chlamydiae"       "OP3"             
##  [46] "Planctomycetes"   "Firmicutes"       "OP9"             
##  [49] "WPS-2"            "Armatimonadetes"  "SC3"             
##  [52] "TM7"              "GN02"             "SM2F11"          
##  [55] "ABY1_OD1"         "ZB2"              "OP11"            
##  [58] "Chloroflexi"      "SC4"              "WS1"             
##  [61] "GAL15"            "AD3"              "WS2"             
##  [64] "Caldiserica"      "Thermotogae"      "Synergistetes"   
##  [67] "SR1"
# Show the frequency of each value
rowData(tse)$Phylum %>% table()
. Freq
ABY1_OD1 7
AC1 1
AD3 9
Acidobacteria 1021
Actinobacteria 1631
Armatimonadetes 61
BRC1 13
Bacteroidetes 2382
CCM11b 2
Caldiserica 3
Caldithrix 10
Chlamydiae 21
Chlorobi 64
Chloroflexi 437
Crenarchaeota 106
Cyanobacteria 393
Elusimicrobia 31
Euryarchaeota 102
Fibrobacteres 7
Firmicutes 4356
Fusobacteria 37
GAL15 2
GN02 8
GN04 7
GN06 2
GN12 1
GOUTA4 11
Gemmatimonadetes 191
Hyd24-12 4
KSB1 6
LCP-89 2
LD1 2
Lentisphaerae 21
MVP-15 5
NC10 9
NKB19 16
Nitrospirae 74
OP11 6
OP3 30
OP8 27
OP9 4
PAUC34f 3
Planctomycetes 638
Proteobacteria 6416
SAR406 21
SBR1093 9
SC3 8
SC4 8
SM2F11 5
SPAM 22
SR1 5
Spirochaetes 124
Synergistetes 7
TG3 5
TM6 27
TM7 32
Tenericutes 143
Thermi 46
Thermotogae 1
Verrucomicrobia 470
WPS-2 20
WS1 5
WS2 2
WS3 70
ZB2 2
ZB3 2

Note: after subsetting, expect the number of columns to equal the sum of the frequencies of the feature(s) that you are interested in. For instance, nrows = Actinobacteria + Chlamydiae = 1631 + 21 = 1652.

Depending on your research question, you might or might not need to agglomerate the data in the first place: if you want to find the abundance of each and every feature that belongs to Actinobacteria and Chlamydiae, agglomeration is not needed; if you want to find the total abundance of all features that belong to Actinobacteria or Chlamydiae, agglomeration is recommended.

3.1.2.2.1 Non-agglomerated data

Next, we logical index across the rows of tse (make sure to leave the second index empty to select all columns) and filter for the features that fall in either Actinobacteria or Chlamydiae group. For this, we use the information on the samples from the metadata rowData(tse).

The first term with the %in% operator includes all the features of interest, whereas the second term after the AND operator & filters out all features that have an NA in place of the phylum variable.

# Subset by feature
tse_subset_by_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), ]

# Show dimensions
dim(tse_subset_by_feature)
##  [1] 1652   26

As a sanity check, the new object, tse_subset_by_feature, should have the original number of samples (columns) and a number of features (rows) equal to the sum of the features of interest (in this case, 1652).

3.1.2.2.2 Agglomerated data

When total abundances of certain phyla are of relevance, the data is initially agglomerated by Phylum. Then, similar steps as in the case of non-agglomerated data are followed.

# Agglomerate by phylum
tse_phylum <- tse %>% mergeFeaturesByRank(rank = "Phylum")

# Subset by feature and remove NAs
tse_phylum_subset_by_feature <- tse_phylum[rowData(tse_phylum)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse_phylum)$Phylum), ]

# Show dimensions
dim(tse_phylum_subset_by_feature)
##  [1]  2 26

Note: as data was agglomerated, the number of rows should equal the number of phyla used to index (in this case, just 2).

Alternatively:

# Store features of interest into phyla
phyla <- c("Phylum:Actinobacteria", "Phylum:Chlamydiae")
# subset by feature
tse_phylum_subset_by_feature <- tse_phylum[phyla, ]
# Show dimensions
dim(tse_subset_by_feature)
##  [1] 1652   26

The code above returns the non-agglomerated version of the data.

Fewer characteristics can be used to subset by feature:

  • Taxonomic rank
  • Meta-taxonomic group

For subsetting by kingdom, agglomeration does not apply, whereas for the other ranks it can be applied if necessary.

3.1.2.3 Subset by sample and feature

Finally, we can subset data by sample and feature at once. The resulting subset contains all the samples of human origin and all the features of phyla Actinobacteria or Chlamydiae.

# Subset by sample and feature and remove NAs
tse_subset_by_sample_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# Show dimensions
dim(tse_subset_by_sample_feature)
##  [1] 1652    9

Note: the dimensions of tse_subset_by_sample_feature agree with those of the previous subsets (9 columns filtered by sample and 1652 rows filtered by feature).

If a study was to consider and quantify the presence of Actinobacteria as well as Chlamydiae in different sites of the human body, tse_subset_by_sample_feature might be a suitable subset to start with.

3.1.2.4 Remove empty columns and rows

Sometimes data might contain, e.g., features that are not present in any of the samples. This can occur, for example, after the data subsetting. In certain analyses, we might want to remove those instances.

# Agglomerate data at Genus level 
tse_genus <- mergeFeaturesByRank(tse, rank = "Genus")
# List bacteria that we want to include
genera <- c("Class:Thermoprotei", "Genus:Sulfolobus", "Genus:Sediminicola")
# Subset data
tse_genus_sub <- tse_genus[genera, ]

tse_genus_sub
##  class: TreeSummarizedExperiment 
##  dim: 3 26 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(26): CL3 CC1 ... Even2 Even3
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (3 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# List total counts of each sample
colSums(assay(tse_genus_sub, "counts"))
##       CL3      CC1      SV1  M31Fcsw  M11Fcsw  M31Plmr  M11Plmr  F21Plmr 
##         1        0        0        1        1        0        4        1 
##   M31Tong  M11Tong LMEpi24M SLEpi20M   AQC1cm   AQC4cm   AQC7cm      NP2 
##         7        3        0        2       64      105      136      222 
##       NP3      NP5  TRRsed1  TRRsed2  TRRsed3     TS28     TS29    Even1 
##      6433     1154        2        2        2        0        0        0 
##     Even2    Even3 
##         2        0

Now we can see that certain samples do not include any bacteria. We can remove those.

# Remove samples that do not contain any bacteria
tse_genus_sub <- tse_genus_sub[ , colSums(assay(tse_genus_sub, "counts")) != 0 ]
tse_genus_sub
##  class: TreeSummarizedExperiment 
##  dim: 3 18 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(3): Class:Thermoprotei Genus:Sulfolobus Genus:Sediminicola
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(18): CL3 M31Fcsw ... TRRsed3 Even2
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (3 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

The same action can also be applied to the features.

# Take only those samples that are collected from feces, skin, or tongue
tse_genus_sub <- tse_genus[ , tse_genus$SampleType %in% c("Feces", "Skin", "Tongue")]

tse_genus_sub
##  class: TreeSummarizedExperiment 
##  dim: 1516 9 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(1516): Class:Thermoprotei Genus:Sulfolobus ...
##    Genus:Coprothermobacter Phylum:SR1
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (1516 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# What is the number of bacteria that are not present?
sum(rowSums(assay(tse_genus_sub, "counts")) == 0)
##  [1] 435

We can see that there are bacteria that are not present in these samples we chose. We can remove those bacteria from the data.

# Take only those bacteria that are present
tse_genus_sub <- tse_genus_sub[rowSums(assay(tse_genus_sub, "counts")) > 0, ]

tse_genus_sub
##  class: TreeSummarizedExperiment 
##  dim: 1081 9 
##  metadata(1): agglomerated_by_rank
##  assays(1): counts
##  rownames(1081): Genus:Sulfolobus Order:NRP-J ...
##    Genus:Coprothermobacter Phylum:SR1
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(9): M31Fcsw M11Fcsw ... TS28 TS29
##  colData names(7): X.SampleID Primer ... SampleType Description
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (1081 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

3.1.3 Splitting

You can split the data based on variables by using the functions splitByRanks and splitOn.

splitByRanks splits the data based on taxonomic ranks. Since the elements of the output list share columns, they can be stored into altExp.

altExps(tse) <- splitByRanks(tse)
altExps(tse)
##  List of length 7
##  names(7): Kingdom Phylum Class Order Family Genus Species

If you want to split the data based on another variable than taxonomic rank, use splitOn. It works for row-wise and column-wise splitting.

splitOn(tse, "SampleType")
##  List of length 9
##  names(9): Soil Feces Skin Tongue ... Ocean Sediment (estuary) Mock

3.2 Add or modify data

The information contained by the colData of a TreeSE can be modified by accessing the desired variables.

# modify the Description entries
colData(tse)$Description <- paste(colData(tse)$Description, "modified description")

# view modified variable
head(tse$Description)
##  [1] "Calhoun South Carolina Pine soil, pH 4.9 modified description"  
##  [2] "Cedar Creek Minnesota, grassland, pH 6.1 modified description"  
##  [3] "Sevilleta new Mexico, desert scrub, pH 8.3 modified description"
##  [4] "M3, Day 1, fecal swab, whole body study modified description"   
##  [5] "M1, Day 1, fecal swab, whole body study  modified description"  
##  [6] "M3, Day 1, right palm, whole body study modified description"

New information can also be added to the experiment by creating a new variable.

# simulate new data
new_data <- runif(ncol(tse))

# store new data as new variable in colData
colData(tse)$NewVariable <- new_data

# view new variable
head(tse$NewVariable)
##  [1] 0.4025 0.1458 0.1353 0.1317 0.1748 0.0327

3.3 Merge data

mia package has mergeSEs function that merges multiple SummarizedExperiment objects. For example, it is possible to combine multiple TreeSE objects which each includes one sample.

mergeSEs works much like standard joining operations. It combines rows and columns and allows you to specify the merging method.

# Take subsets for demonstration purposes
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
# With inner join, we want to include all shared rows. When using mergeSEs function
# all samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
##  class: TreeSummarizedExperiment 
##  dim: 100 4 
##  metadata(0):
##  assays(1): counts
##  rownames(100): 239672 243675 ... 104332 159421
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(4): CC1 CL3 M31Fcsw SV1
##  colData names(8): X.SampleID Primer ... Description NewVariable
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (100 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL
# Left join preserves all rows of the 1st object
tse <- mergeSEs(tse1, tse4, missing_values = 0, join = "left")
tse
##  class: TreeSummarizedExperiment 
##  dim: 19216 2 
##  metadata(0):
##  assays(1): counts
##  rownames(19216): 239672 243675 ... 239967 254851
##  rowData names(7): Kingdom Phylum ... Genus Species
##  colnames(2): CL3 M31Fcsw
##  colData names(8): X.SampleID Primer ... Description NewVariable
##  reducedDimNames(0):
##  mainExpName: NULL
##  altExpNames(0):
##  rowLinks: a LinkDataFrame (19216 rows)
##  rowTree: 1 phylo tree(s) (19216 leaves)
##  colLinks: NULL
##  colTree: NULL

3.3.1 Additional functions

Back to top