Abstract

Background and objective

   The development of machine learning-based models that can be used for
   the prediction of severe diseases has been one of the main concerns of
   the scientific community. The current study seeks to expand a highly
   sophisticated tool, the Convolutional Neural Networks, making it
   applicable in multidimensional omics data classification problems and
   testing the newly introduced method on publicly available
   transcriptomics and proteomics data.

Methods

   In this study, we introduce Omics-CNN, a Convolutional Neural
   Network-based pipeline, which couples Convolutional Neural Networks
   with dimensionality reduction, preprocessing, clustering, and
   explainability techniques to make them suitable to build highly
   accurate and interpretable classification models from high-throughput
   omics data. The developed tool has the potential to classify patients
   depending on the expression of genetic and clinical factors and
   identify features that can act as diagnostic biomarkers. Regarding
   dimensionality reduction, univariate and multivariate techniques were
   explored and compared. Gradient Weighted Class Activation Mapping
   analysis was performed to determine the most important features in the
   classification of the samples after training the model.

Results

   The newly introduced pipeline was applied to one transcriptomics and
   one proteomics dataset for the identification of diagnostic models and
   biosignatures for Ischemic Stroke (IS) and COVID-19 infection,
   reporting highly accurate biosignatures with accuracies of 96 % and
   95.41 %, respectively. Meanwhile, classification models based solely on
   a small part of attributes provided lower predictive accuracy, but
   identified compact transcript biosignature (KRT15, VPRBP, TNFRSF4,
   GORASP2) for Ischemic Stroke and protein biosignature (ADGRB3, VNN2,
   AGER, CIAPIN1) for Covid-19 infection diagnosis, respectively.

Conclusions

   Omics-CNN, overcame the inherent problems of applying Convolutional
   Neural Networks for the training diagnostic models with quantitative
   omics data, outperforming previous models of machine learning developed
   using the same datasets for Ischemic Stroke and Covid-19 infection
   diagnosis, determining the most contributing biomarkers for both
   diseases.

   Keywords: Convolutional neural networks, Transcriptomics, Personalized
   medicine, Covid-19, Ischemic stroke

1. Introduction

   The big data revolution has started being noticeable in the field of
   Biology. High-throughput technologies (e.g., microarrays and
   next-generation sequencing) produce large amounts of biological data
   also known as omics. Omics include different types of data, such as
   genomics, epigenomics, transcriptomics, and proteomics, and can
   contribute to the comprehensive understanding of biological processes
   and pathways as well as promote the understanding of life-threatening
   diseases. The combination of such information lays the groundwork for
   precision medicine. Precision medicine uses data from different
   sources, thus offering a holistic description of a patient's overall
   health expediting disease diagnosis, and allowing for the selection of
   the most suitable therapeutic protocol for each patient [[29]1]. The
   increased complexity of omics data in terms of dimensionality and noise
   impedes their analysis with simple statistical tests. As a result,
   omics data demand more sophisticated methods of computational
   intelligence, which have the capability of capturing complex nonlinear
   and hierarchical associations between the examined features.

   In this context, Deep Neural Networks [[30]2] are powerful computing
   tools that utilize the high correlation of the input data in sample
   classification. In recent years such computing tools have attracted
   interest in the analysis of complex data, including omics data.
   Indicatively some of the studies that have been conducted involve
   complicated architectures, which either combine different types of Deep
   Neural Networks, such as Recurrent Neural Networks with Convolutional
   Neural Networks [[31]3] or analyze different data, such as clinical or
   multi-omics data [[32]4]. There is a clear trend toward incorporating
   high-throughput omics or multi-omics analysis in biomedical research to
   explain the complex relationships between molecular layers. Such
   powerful tools can be widely adapted to various applications of
   high-dimensional omics data and have great potential to facilitate more
   accurate and personalized clinical decision-making, especially in
   life-threatening diseases, such as cancer, metastasis of cancer
   [[33]5], stroke [[34]6], or Covid-19 [[35]7].

   Convolutional Neural Networks (CNNs) have been widely used for
   classification and regression applications using imaging data but
   despite their feature of taking into consideration complex
   relationships between the utilized inputs which is very suitable for
   the analysis of omics data, only a few applications have been conducted
   on omics data due to technical limitations. In Ref. [[36]8], the
   authors developed a model called, OmicsMapNet for the analysis of
   high-dimensional omics data as 2-dimensional images. The most
   contributory features in the trained CNN were confirmed in pathway
   analysis. Another previous study [[37]9] proposed a CNN approach that
   combines spectral clustering information processing to classify lung
   cancer, with greater performance than other machine learning
   algorithms, such as Support Vector Machines (SVM) or Random Forests
   (RFs). Moreover, as the power of gene expression profile in cancer
   identification has been proven, in another conducted analysis [[38]10],
   2D images were generated, by integrated gene expression profiles and
   protein-protein interaction (PPI) network from human samples, to be
   analyzed by a CNN. Another approach applying a CNN to non-image data
   [[39]11], called DeepFeature, was used to successfully transform omics
   data into a form that is optimal for fitting a CNN model and returned
   sets of the most important genes used internally for computing
   predictions.

   Despite the promising results of these applications, CNN methods are
   not generally applicable to omics data because of the significantly
   higher number of features compared to the available samples in most of
   these datasets and more importantly because CNN requires a specific
   structured organization of the inputs, as in the case of images, which
   is not straightforward for omics data. In an attempt to overcome these
   limitations, we proposed an omics classification pipeline based on 1D
   CNN, which used dimensionality reduction methods to alleviate the high
   dimensionality issue of omics data and clustering techniques to
   organize the input features in a meaningful manner allowing the CNN
   method to utilize this feature organization to improve classification
   performance. 1D was preferred over the 2D CNN method because of the
   inherent organization of the omics data as 1D feature vectors.

   The present study explored designing and applying a new CNN-based tool
   for the discovery of diagnostic biosignatures and models for Ischemic
   Stroke and COVID-19 using high-dimensionality transcriptomics and
   proteomics datasets, respectively. Regarding Ischemic Stroke, an early
   and accurate diagnosis can improve the probability of a positive
   outcome. The objective of many studies was to identify biomarkers to
   facilitate the early diagnosis of acute ischemic stroke (AIS). In a
   prior study [[40]17], the most important genes were identified using
   differential expression analysis. These were tested in a logistic
   regression model and further validated by QRT-PCR. Another study that
   provided insight into the molecular of AIS [[41]18], made use of a
   machine-learning technique known as genetic algorithm k-nearest
   neighbors’ (GA/kNN) to identify a pattern of gene expression that could
   optimally discriminate between patients and neurologically asymptomatic
   controls. Furthermore, a more recent study [[42]5], handled a hybrid
   genetic algorithm–support vector machine learning tool combined with a
   network comparison approach to identify transcription patterns
   characteristic of patients with acute ischemic stroke. The approach in
   a recent study [[43]19], was to identify the optimal model, by
   comparing different computational methods, for analyzing microRNA
   expression data for discriminating patients with AIS from controls.
   Machine learning algorithms, including artificial neural networks
   (ANNs), random forests, extreme gradient boosting, and support vector
   machines (SVM) were applied. The different models used three different
   combinations of microRNAs. ANNs and SVM models had the best
   performance, according to AUC (Area Under Curve). Moreover, concerning
   COVID-19, several recent studies have focused their interest on
   identifying emerging biomarkers for SARS-CoV-2 detection, COVID-19
   diagnostics, treatment, prognosis, and the design of new therapies.
   Specifically, in a recent study, a combination of clinical parameters
   with protein abundancies was examined to identify the survival or death
   in COVID-19 patients [[44]20]. For this analysis, the WEKA machine tool
   was used for training and validation with 9 clinical and 45
   protein-based putative biomarkers being associated with the
   survival/death of COVID-19 patients. In another study examining the
   severity of the disease based on proteomic data [[45]6], it was found
   that the Support Vector Machine had the best performance compared to
   other machine learning algorithms. Furthermore, a combination of
   Proteomic and Metabolomic data was used in another study to predict
   severity using the Random Forest model and reported an AUC of 95.7 %
   [[46]21].

2. Methods

     * 1)
       Datasets

   Data of the first dataset were from microarray experiments (Affymetrix
   whole-genome expression arrays U133 2.0) on peripheral blood samples
   from stroke patients and from control non-stroke patients. The samples
   were derived from three different scientific studies. As the three
   datasets were generated using different instrumentation and
   experimental setup, they were separately normalized using the raw data
   to homogenize them generating a single expression matrix with a
   consistent scaling of expression levels. The dataset [[47]9] consists
   of 20 stroke and control peripheral blood mononuclear cell (PBMC)
   samples, while [[48]10] has 39 stroke and 25 control whole blood
   samples, that were evaluated at three different time points (within 3 h
   (h), 5, and 24 h of the stroke event); only the within 3 h time point
   data were used for this study. The [[49]11] has whole blood for 23
   stroke and control samples. The final integrated dataset included
   patients who had IS and healthy patients (control group). There was a
   total of 82 patients with IS and 68 healthy patients participating in
   the study. After keeping the expression values of commonly measured
   genes in all independent studies we ended up with 13243 quantified gene
   transcripts for each sample. A linear regression method was used to
   correct data for different sample types (PBMC and whole blood), while
   also transcripts significantly differentiated between the two sample
   types were filtered out. This study aimed to predict whether the sample
   would manifest a stroke, depending on the expression of its specific
   genes.

   The samples of the second dataset were derived from the MGH study
   [[50]12]. The data include protein measurements using a high throughput
   antibody technique and essential clinical parameters from plasma
   samples of COVID-19 patients and controls. The study was conducted by a
   group of clinicians and immunologists at MGH, which included patients
   with a clinical concern for COVID-19 upon Emergency Department arrival.
   Out of the 384 patients enrolled, 306 tested positive for COVID-19,
   whereas the 78, who tested negative for COVID-19, were included as a
   control group. The blood samples from the patients’ group were selected
   on days 0, 3, and 7, while virus-negative patients had sampling only on
   day 0. In the context of the present study, we used data obtained on
   day 0 to discriminate between COVID-19 and control patients. Several
   clinical and demographic characteristics were also collected alongside
   the proteomics measurements including age, body mass index (BMI),
   pre-existing medical conditions, and laboratory measurements of
   C-reactive protein (CRP), absolute neutrophil count, and D-dimer.

   For each omics measurement of both datasets, logistic regression models
   were fitted using the sklearn version 1.1.1 Python package.
     * 2)
       Omics-CNN

2.1. Data preprocessing

   The central idea of the study was to examine whether a previously
   established imaging data analysis deep learning method, namely the
   Convolutional Neural Networks (CNN), is suitable for high-throughput
   omics data analysis. To make CNN applicable to high-throughput omics
   data, we designed and implemented a new pipeline and tool called
   Omics-CNN. The proposed pipeline is presented in detail in [51]Fig 1.
   The first step of the proposed pipeline was data preprocessing. The
   existence of missing values in the datasets was examined since data
   imputation or filtering techniques are essential for training CNN
   models. The attributes that had more than 30 % missing values were
   filtered. A kNN algorithm was used to impute the values for the
   remaining missing values, using the KNNImputer method of the sklearn
   library version 1.1.1, and k = 20 (default value) [[52]13]. We used a
   combination of the Local Outlier Factor method [[53]14] and Principal
   Components Analysis (PCA) to identify potential outliers. From both
   datasets, less than 5 % of the data were marked as outliers.

Fig. 1.

   [54]Fig. 1
   [55]Open in a new tab

   The Omics-CNN pipeline. Different colors in the input data correspond
   to omics data from different modalities showing that the Omic-CNN
   pipeline is applicable to all quantitative omics data. (For
   interpretation of the references to color in this figure legend, the