Data Manipulation

Why data manipulation?

Raw data might be uninformative or incompatible with a method. We want to be able to modify, polish, subset, agglomerate and transform it.

Why so complex?

TreeSE containers organise information to improve flexibility and accessibility, which comes with a bit of complexity. Focus on assays, colData and rowData.

Example 1.1: Data Import

We work with microbiome data inside TreeSummarizedExperiment (TreeSE) containers and mia is our toolkit.

# Load Tengeler2020 and store it into a TreeSE
library(mia)
data("Tengeler2020", package = "mia")
tse <- Tengeler2020

The components of a TreeSE can all be seen at a glance.

# Print TreeSE
tse
class: TreeSummarizedExperiment 
dim: 151 27 
metadata(0):
assays(1): counts
rownames(151): Bacteroides Bacteroides_1 ... Parabacteroides_8
  Unidentified_Lachnospiraceae_14
rowData names(6): Kingdom Phylum ... Family Genus
colnames(27): A110 A12 ... A35 A38
colData names(4): patient_status cohort patient_status_vs_cohort
  sample_name
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (151 rows)
rowTree: 1 phylo tree(s) (151 leaves)
colLinks: NULL
colTree: NULL

Example 1.2: Column data

Columns represent the samples of an experiment.

# Retrieve sample names
head(colnames(tse), 3)
[1] "A110" "A12"  "A15" 

All information about the samples is stored in colData.

# Retrieve sample data
head(colData(tse), 3)
DataFrame with 3 rows and 4 columns
     patient_status      cohort patient_status_vs_cohort sample_name
        <character> <character>              <character> <character>
A110           ADHD    Cohort_1            ADHD_Cohort_1        A110
A12            ADHD    Cohort_1            ADHD_Cohort_1         A12
A15            ADHD    Cohort_1            ADHD_Cohort_1         A15

Individual variables about the samples can be accessed directly.

# Retrieve sample variables
head(tse$patient_status, 3)
[1] "ADHD" "ADHD" "ADHD"

Example 1.3: Row data

Rows represent the features of an experiment.

# Retrieve feature names
head(rownames(tse), 3)
[1] "Bacteroides"     "Bacteroides_1"   "Parabacteroides"

All information about the samples is stored in rowData.

# Retrieve feature data
head(rowData(tse), 3)
DataFrame with 3 rows and 6 columns
                    Kingdom        Phylum       Class         Order
                <character>   <character> <character>   <character>
Bacteroides        Bacteria Bacteroidetes Bacteroidia Bacteroidales
Bacteroides_1      Bacteria Bacteroidetes Bacteroidia Bacteroidales
Parabacteroides    Bacteria Bacteroidetes Bacteroidia Bacteroidales
                            Family           Genus
                       <character>     <character>
Bacteroides         Bacteroidaceae     Bacteroides
Bacteroides_1       Bacteroidaceae     Bacteroides
Parabacteroides Porphyromonadaceae Parabacteroides

Individual variables about the samples can be accessed from rowData.

# Retrieve feature variables
head(rowData(tse)$Genus, 3)
[1] "Bacteroides"     "Bacteroides"     "Parabacteroides"

Example 1.4: Assays

The assays of an experiment (counts, relative abundance, etc.) can be found in assays.

assays(tse)
List of length 1
names(1): counts

assayNames return only their names.

assayNames(tse)
[1] "counts"

An individual assay can be retrieved with assay.

assay(tse, "counts")[seq(6), seq(6)]
                 A110   A12  A15  A19  A21  A23
Bacteroides     17722 11630    0 8806 1740 1791
Bacteroides_1   12052     0 2679 2776  540  229
Parabacteroides     0   970    0  549  145    0
Bacteroides_2       0  1911    0 5497  659    0
Akkermansia      1143  1891 1212  584   84  700
Bacteroides_3       0  6498    0 4455  610    0

Exercise 1

Extra:

Raw data can be retrieved here.

Example 2.1: Subsetting

We can subset features or samples of a TreeSE, but first we need to pick a variable.

# Check levels of a sample variable
unique(tse$patient_status)
[1] "ADHD"    "Control"

To subset samples, we filter columns with a conditional.

# Subset by a sample variable
subcol_tse <- tse[ , tse$patient_status == "ADHD"]
dim(subcol_tse)
[1] 151  13

We now want to subset by our favourite Phylum.

# Check levels of a feature variable
unique(rowData(tse)$Phylum)
[1] "Bacteroidetes"   "Verrucomicrobia" "Proteobacteria"  "Firmicutes"     
[5] "Cyanobacteria"  

To subset features, we filter rows with a conditional.

# Subset by a feature variable
subrow_tse <- tse[rowData(tse)$Phylum == "Firmicutes", ]
dim(subrow_tse)
[1] 97 27

Example 2.2: Agglomeration

Agglomeration condenses the assays to higher taxonomic ranks. Related taxa are combined together. We can agglomerate by different ranks.

# View rank options
taxonomyRanks(tse)
[1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"  

We agglomerate by Phylum and store the new experiment in the altExp slot.

# Agglomerate by Phylum and store into altExp slot
altExp(tse, "phylum") <- agglomerateByRank(tse, rank = "Phylum")
altExp(tse, "phylum")
class: TreeSummarizedExperiment 
dim: 5 27 
metadata(1): agglomerated_by_rank
assays(1): counts
rownames(5): Bacteroidetes Cyanobacteria Firmicutes Proteobacteria
  Verrucomicrobia
rowData names(6): Kingdom Phylum ... Family Genus
colnames(27): A110 A12 ... A35 A38
colData names(4): patient_status cohort patient_status_vs_cohort
  sample_name
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (5 rows)
rowTree: 1 phylo tree(s) (151 leaves)
colLinks: NULL
colTree: NULL

Example 2.3: Transformation

Data can be transformed for different reasons. For example, to make samples comparable we can use relative abundance.

# Transform counts to relative abundance
tse <- transformAssay(tse,
                      assay.type = "counts",
                      method = "relabundance")

# View sample-wise sums
head(colSums(assay(tse, "relabundance")), 3)
A110  A12  A15 
   1    1    1 

Or to standardise features to the normal distribution we can use z-scores: \(Z = \frac{x - \mu}{\sigma}\).

# Transform relative abundance to z-scores
tse <- transformAssay(tse,
                      assay.type = "relabundance",
                      method = "z",
                      MARGIN = "features")

# View feature-wise standard deviations
head(rowSds(assay(tse, "z")), 3)
    Bacteroides   Bacteroides_1 Parabacteroides 
              1               1               1 

Exercise 2

Extra:

Resources