22  Multi-omics prediction and classification

This chapter illustrates how we can create a classification model that utilizes multiomics data. In contrast Chapter 23 shows classification models that utilizes single omic data.

In multiview data analysis, there are two main types of approaches: early fusion and late fusion. combines all datasets into a single representation, which then serves as input for a supervised learning model. In contrast, builds individual models for each data view and combines their predictions using a second-level model as the final predictor. However, these traditional paradigms treat the data views in isolation and do not allow for interactions or dependencies between them. A more advanced method, called cooperative learning (Ding et al. 2022), which is also known as , combines the best of both worlds by encouraging predictions from different data views to align through an agreement parameter (\(\rho\)).

Ding, Daisy Yi, Shuangning Li, Balasubramanian Narasimhan, and Robert Tibshirani. 2022. “Cooperative Learning for Multiview Analysis.” Proceedings of the National Academy of Sciences 119 (38): e2202113119. https://doi.org/https://doi.org/10.1073/pnas.2202113119.

22.1 Import and preprocess data

We use the publicly available Inflammatory Bowel Diseases (IBD) data from the curatedMetagenomicData package (Lloyd-Price et al. 2019), where we aim to predict IBD disease status based on both taxonomic (species abundances) and functional (pathway abundances) profiles.

Lloyd-Price, Jason, Cesar Arze, Ashwin N. Ananthakrishnan, Melanie Schirmer, Julian Avila-Pacheco, Tiffany W. Poon, Elizabeth Andrews, et al. 2019. “Multi-Omics of the Gut Microbial Ecosystem in Inflammatory Bowel Diseases.” Nature 569 (7758): 655–62. https://doi.org/https://doi.org/10.1038/s41586-019-1237-9.
library(curatedMetagenomicData)
library(dplyr)
library(mia)

tse1 <- sampleMetadata |>
  filter(study_name == "HMP_2019_ibdmdb") |>
  returnSamples("relative_abundance", rownames = "short")

tse2 <- sampleMetadata |>
  filter(study_name == "HMP_2019_ibdmdb") |>
  returnSamples("pathway_abundance", rownames = "short")

# Create a MAE object
mae <- MultiAssayExperiment(
    ExperimentList(microbiota = tse1, pathway = tse2)
    )
mae
##  A MultiAssayExperiment object of 2 listed
##   experiments with user-defined names and respective classes.
##   Containing an ExperimentList class object of length 2:
##   [1] microbiota: TreeSummarizedExperiment with 579 rows and 1627 columns
##   [2] pathway: SummarizedExperiment with 22113 rows and 1627 columns
##  Functionality:
##   experiments() - obtain the ExperimentList instance
##   colData() - the primary/phenotype DataFrame
##   sampleMap() - the sample coordination DataFrame
##   `$`, `[`, `[[` - extract colData columns, subset, or experiment
##   *Format() - convert into a long or wide DataFrame
##   assays() - convert ExperimentList to a SimpleList of matrices
##   exportClass() - save data to flat files

Let’s first check how many patients are in each group.

table(mae[[1]]$disease) / ncol(mae[[1]])
##  
##      IBD healthy 
##   0.7382  0.2618

The dataset appears to be imbalanced, with nearly three-quarters of the patients having IBD. This imbalance may impact the training process and how the model learns to predict outcomes for underrepresented group members.

Further data pre-processing is necessary to handle near-zero-variance features in this dataset. A combination of variance and prevalence filtering is applied to the dataset to include only meaningful features.

# Subset by prevalence
mae[[1]] <- subsetByPrevalent(
    mae[[1]], assay.type = "relative_abundance", prevalence = 0.1,
    include.lowest = TRUE)
mae[[2]] <- subsetByPrevalent(
    mae[[2]], assay.type = "pathway_abundance", prevalence = 0.1,
    include.lowest = TRUE)

# Subset those features that have near zero variance
rowData(mae[[1]])[["sd"]] <- rowSds(assay(mae[[1]], "relative_abundance"))
rowData(mae[[2]])[["sd"]] <- rowSds(assay(mae[[2]], "pathway_abundance"))
mae[[1]] <- mae[[1]][ rowData(mae[[1]])[["sd"]] > 0.001, ]
mae[[2]] <- mae[[2]][ rowData(mae[[2]])[["sd"]] > 0.001, ]

# Transform the data with CLR
mae[[1]] <- transformAssay(
  mae[[1]], assay.type = "relative_abundance", method = "clr",
  pseudocount = TRUE)
mae[[2]] <- transformAssay(
  mae[[2]], assay.type = "pathway_abundance", method = "clr",
  pseudocount = TRUE)

Ultimately, 250 features are retained, consisting of 125 pathways and 79 species.

22.2 Fit model

We randomly select 20% of the samples for the validation set, ensuring they are not used in training. This approach provides a more robust estimate of how well the model generalizes to unseen data.

set.seed(377)
colData(mae)[["validation_set"]] <- sample(
    c(TRUE, FALSE), size = ncol(mae[[1]]), replace = TRUE, prob = c(0.2, 0.8))

Now, we are ready to fit the model using the IntegratedLearner package, which supports early, late, and intermediate fusion methods (Mallick et al. 2023). Under the hood, it leverages the SuperLearner package. The process begins by fitting “base learners” for each layer, with separate models trained for each omic dataset. Then, a “meta learner” integrates the outputs from these individual models to make the final predictions. For “base learners” and “meta learner”, we use random forest and rank loss minimization models, respectively.

Mallick, Himel, Anupreet Porwal, Satabdi Saha, Piyali Basak, Vladimir Svetnik, and Erina Paul. 2023. “An Integrated Bayesian Framework for Multi-Omics Prediction and Classification.” Statistics in Medicine 43 (5): 983–1002. https://doi.org/10.1002/sim.9953.
library(IntegratedLearner)
library(SuperLearner)

# Fit the model
fit <- IntegratedLearnerFromMAE(
  mae,
  # Select experiments and assays
  experiment = names(experiments(mae)),
  assay.type = c("clr", "clr"),
  # Select columns from sample metadata
  outcome.col = "disease",
  valid.col = "validation_set",
  # Options for base and meta models
  base_learner = "SL.randomForest",
  meta_learner = "SL.nnls.auc",
  # The outcome is binary
  family = binomial()
)
##  Running base model for layer  1 ...
##  Running base model for layer  2 ... 
##  Running stacked model...
##  Running concatenated model...
##  Time for model fit : 1.603 minutes 
##  ========================================
##  Model fit for individual layers: SL.randomForest 
##  Model fit for stacked layer: SL.nnls.auc 
##  Model fit for concatenated layer: SL.randomForest 
##  ========================================
##  AUC metric for training data: 
##  Individual layers: 
##  microbiota    pathway 
##       0.971      0.947 
##  ======================
##  Stacked model:0.97 
##  ======================
##  Concatenated model:0.965 
##  ======================
##  ========================================
##  AUC metric for test data: 
##  Individual layers: 
##  microbiota    pathway 
##       0.961      0.929 
##  ======================
##  Stacked model:0.96 
##  ======================
##  Concatenated model:0.949 
##  ======================
##  ========================================
##  Weights for individual layers predictions in IntegratedLearner: 
##  microbiota    pathway 
##       0.949      0.051 
##  ========================================

The output include the following models:

- Individual models: Trained separately for each omic dataset.
- Stacked model (late fusion): Combines the predictions from each individual model into a single meta-model.
- Concatenated model (early fusion): Trains a model where all features are merged before training.

22.3 Visualize results

We can summarize the model performance using a Receiver Operating Characteristic (ROC) plot.

library(tidyverse)
library(cowplot)

p <- IntegratedLearner:::plot.learner(fit)

Based on the ROC plot described below, we observe that the AUC is 0.929 when considering only the pathway abundance data in the model, and 0.961 for the model including only the species abundance data. The AUC increases to 0.949 when using the early fusion model and reaches 0.96 with the late fusion model. Overall, most integrated classifiers outperform individual layers in distinguishing between T2D and healthy controls.

The model’s performance can also be evaluated and summarized using a confusion matrix.

library(caret)

obs <- factor(fit$Y_test)
pred <- ifelse(fit$yhat.test$stacked > 0.5, levels(obs)[[2]], levels(obs)[[1]])
pred <- factor(pred, levels = levels(obs))
conf <- confusionMatrix(data = pred, reference = obs)
conf
##  Confusion Matrix and Statistics
##  
##            Reference
##  Prediction   1   2
##           1 257  22
##           2   2  70
##                                        
##                 Accuracy : 0.932       
##                   95% CI : (0.9, 0.956)
##      No Information Rate : 0.738       
##      P-Value [Acc > NIR] : < 2e-16     
##                                        
##                    Kappa : 0.81        
##                                        
##   Mcnemar's Test P-Value : 0.000105    
##                                        
##              Sensitivity : 0.992       
##              Specificity : 0.761       
##           Pos Pred Value : 0.921       
##           Neg Pred Value : 0.972       
##               Prevalence : 0.738       
##           Detection Rate : 0.732       
##     Detection Prevalence : 0.795       
##        Balanced Accuracy : 0.877       
##                                        
##         'Positive' Class : 1           
##  

The model appears to be performing well the the accuracy being 93.16%. The model seems to precict correctly almost all IBD patients. It is also worth noting that over three-quarters of the controls are classified correctly

Lastly, we may be interested in identifying the features that contribute most to predicting the outcome. To do this, we first extract feature importance scores from the individual models. Next, we scale these importance scores based on the weights assigned to each layer in the stacked model. This process provides us with the overall importance of each feature in the final model.

library(miaViz)

# Get individual models
models <- fit$model_fits$model_layers
# Get importances
importances <- lapply(seq_len(length(models)), function(i){
  # Get importances
  temp <- models[[i]]$importance
  # Scale based on weight in stacked model
  temp <- temp * fit$weights[[i]]
  return(temp)
  })
# Combine the feature importances
importances <- do.call(rbind, importances)

# Plot feature importances
p <- plotLoadings(importances, ncomponents = 1, n = 20, show.color = FALSE)
p

From the plot, we can observe that Alistipes putredinis and Firmicutes bacterium CAG:83 appear to have the greatest predictive power among all the features in determining the outcome. However, the predictive power appears to be fairly evenly distributed across all features.

Summary

For more details on modeling with IntegratedLearner, refer to IntegratedLearner vignette.

Back to top