Abstract Background and objective The development of machine learning-based models that can be used for the prediction of severe diseases has been one of the main concerns of the scientific community. The current study seeks to expand a highly sophisticated tool, the Convolutional Neural Networks, making it applicable in multidimensional omics data classification problems and testing the newly introduced method on publicly available transcriptomics and proteomics data. Methods In this study, we introduce Omics-CNN, a Convolutional Neural Network-based pipeline, which couples Convolutional Neural Networks with dimensionality reduction, preprocessing, clustering, and explainability techniques to make them suitable to build highly accurate and interpretable classification models from high-throughput omics data. The developed tool has the potential to classify patients depending on the expression of genetic and clinical factors and identify features that can act as diagnostic biomarkers. Regarding dimensionality reduction, univariate and multivariate techniques were explored and compared. Gradient Weighted Class Activation Mapping analysis was performed to determine the most important features in the classification of the samples after training the model. Results The newly introduced pipeline was applied to one transcriptomics and one proteomics dataset for the identification of diagnostic models and biosignatures for Ischemic Stroke (IS) and COVID-19 infection, reporting highly accurate biosignatures with accuracies of 96 % and 95.41 %, respectively. Meanwhile, classification models based solely on a small part of attributes provided lower predictive accuracy, but identified compact transcript biosignature (KRT15, VPRBP, TNFRSF4, GORASP2) for Ischemic Stroke and protein biosignature (ADGRB3, VNN2, AGER, CIAPIN1) for Covid-19 infection diagnosis, respectively. Conclusions Omics-CNN, overcame the inherent problems of applying Convolutional Neural Networks for the training diagnostic models with quantitative omics data, outperforming previous models of machine learning developed using the same datasets for Ischemic Stroke and Covid-19 infection diagnosis, determining the most contributing biomarkers for both diseases. Keywords: Convolutional neural networks, Transcriptomics, Personalized medicine, Covid-19, Ischemic stroke 1. Introduction The big data revolution has started being noticeable in the field of Biology. High-throughput technologies (e.g., microarrays and next-generation sequencing) produce large amounts of biological data also known as omics. Omics include different types of data, such as genomics, epigenomics, transcriptomics, and proteomics, and can contribute to the comprehensive understanding of biological processes and pathways as well as promote the understanding of life-threatening diseases. The combination of such information lays the groundwork for precision medicine. Precision medicine uses data from different sources, thus offering a holistic description of a patient's overall health expediting disease diagnosis, and allowing for the selection of the most suitable therapeutic protocol for each patient [[29]1]. The increased complexity of omics data in terms of dimensionality and noise impedes their analysis with simple statistical tests. As a result, omics data demand more sophisticated methods of computational intelligence, which have the capability of capturing complex nonlinear and hierarchical associations between the examined features. In this context, Deep Neural Networks [[30]2] are powerful computing tools that utilize the high correlation of the input data in sample classification. In recent years such computing tools have attracted interest in the analysis of complex data, including omics data. Indicatively some of the studies that have been conducted involve complicated architectures, which either combine different types of Deep Neural Networks, such as Recurrent Neural Networks with Convolutional Neural Networks [[31]3] or analyze different data, such as clinical or multi-omics data [[32]4]. There is a clear trend toward incorporating high-throughput omics or multi-omics analysis in biomedical research to explain the complex relationships between molecular layers. Such powerful tools can be widely adapted to various applications of high-dimensional omics data and have great potential to facilitate more accurate and personalized clinical decision-making, especially in life-threatening diseases, such as cancer, metastasis of cancer [[33]5], stroke [[34]6], or Covid-19 [[35]7]. Convolutional Neural Networks (CNNs) have been widely used for classification and regression applications using imaging data but despite their feature of taking into consideration complex relationships between the utilized inputs which is very suitable for the analysis of omics data, only a few applications have been conducted on omics data due to technical limitations. In Ref. [[36]8], the authors developed a model called, OmicsMapNet for the analysis of high-dimensional omics data as 2-dimensional images. The most contributory features in the trained CNN were confirmed in pathway analysis. Another previous study [[37]9] proposed a CNN approach that combines spectral clustering information processing to classify lung cancer, with greater performance than other machine learning algorithms, such as Support Vector Machines (SVM) or Random Forests (RFs). Moreover, as the power of gene expression profile in cancer identification has been proven, in another conducted analysis [[38]10], 2D images were generated, by integrated gene expression profiles and protein-protein interaction (PPI) network from human samples, to be analyzed by a CNN. Another approach applying a CNN to non-image data [[39]11], called DeepFeature, was used to successfully transform omics data into a form that is optimal for fitting a CNN model and returned sets of the most important genes used internally for computing predictions. Despite the promising results of these applications, CNN methods are not generally applicable to omics data because of the significantly higher number of features compared to the available samples in most of these datasets and more importantly because CNN requires a specific structured organization of the inputs, as in the case of images, which is not straightforward for omics data. In an attempt to overcome these limitations, we proposed an omics classification pipeline based on 1D CNN, which used dimensionality reduction methods to alleviate the high dimensionality issue of omics data and clustering techniques to organize the input features in a meaningful manner allowing the CNN method to utilize this feature organization to improve classification performance. 1D was preferred over the 2D CNN method because of the inherent organization of the omics data as 1D feature vectors. The present study explored designing and applying a new CNN-based tool for the discovery of diagnostic biosignatures and models for Ischemic Stroke and COVID-19 using high-dimensionality transcriptomics and proteomics datasets, respectively. Regarding Ischemic Stroke, an early and accurate diagnosis can improve the probability of a positive outcome. The objective of many studies was to identify biomarkers to facilitate the early diagnosis of acute ischemic stroke (AIS). In a prior study [[40]17], the most important genes were identified using differential expression analysis. These were tested in a logistic regression model and further validated by QRT-PCR. Another study that provided insight into the molecular of AIS [[41]18], made use of a machine-learning technique known as genetic algorithm k-nearest neighbors’ (GA/kNN) to identify a pattern of gene expression that could optimally discriminate between patients and neurologically asymptomatic controls. Furthermore, a more recent study [[42]5], handled a hybrid genetic algorithm–support vector machine learning tool combined with a network comparison approach to identify transcription patterns characteristic of patients with acute ischemic stroke. The approach in a recent study [[43]19], was to identify the optimal model, by comparing different computational methods, for analyzing microRNA expression data for discriminating patients with AIS from controls. Machine learning algorithms, including artificial neural networks (ANNs), random forests, extreme gradient boosting, and support vector machines (SVM) were applied. The different models used three different combinations of microRNAs. ANNs and SVM models had the best performance, according to AUC (Area Under Curve). Moreover, concerning COVID-19, several recent studies have focused their interest on identifying emerging biomarkers for SARS-CoV-2 detection, COVID-19 diagnostics, treatment, prognosis, and the design of new therapies. Specifically, in a recent study, a combination of clinical parameters with protein abundancies was examined to identify the survival or death in COVID-19 patients [[44]20]. For this analysis, the WEKA machine tool was used for training and validation with 9 clinical and 45 protein-based putative biomarkers being associated with the survival/death of COVID-19 patients. In another study examining the severity of the disease based on proteomic data [[45]6], it was found that the Support Vector Machine had the best performance compared to other machine learning algorithms. Furthermore, a combination of Proteomic and Metabolomic data was used in another study to predict severity using the Random Forest model and reported an AUC of 95.7 % [[46]21]. 2. Methods * 1) Datasets Data of the first dataset were from microarray experiments (Affymetrix whole-genome expression arrays U133 2.0) on peripheral blood samples from stroke patients and from control non-stroke patients. The samples were derived from three different scientific studies. As the three datasets were generated using different instrumentation and experimental setup, they were separately normalized using the raw data to homogenize them generating a single expression matrix with a consistent scaling of expression levels. The dataset [[47]9] consists of 20 stroke and control peripheral blood mononuclear cell (PBMC) samples, while [[48]10] has 39 stroke and 25 control whole blood samples, that were evaluated at three different time points (within 3 h (h), 5, and 24 h of the stroke event); only the within 3 h time point data were used for this study. The [[49]11] has whole blood for 23 stroke and control samples. The final integrated dataset included patients who had IS and healthy patients (control group). There was a total of 82 patients with IS and 68 healthy patients participating in the study. After keeping the expression values of commonly measured genes in all independent studies we ended up with 13243 quantified gene transcripts for each sample. A linear regression method was used to correct data for different sample types (PBMC and whole blood), while also transcripts significantly differentiated between the two sample types were filtered out. This study aimed to predict whether the sample would manifest a stroke, depending on the expression of its specific genes. The samples of the second dataset were derived from the MGH study [[50]12]. The data include protein measurements using a high throughput antibody technique and essential clinical parameters from plasma samples of COVID-19 patients and controls. The study was conducted by a group of clinicians and immunologists at MGH, which included patients with a clinical concern for COVID-19 upon Emergency Department arrival. Out of the 384 patients enrolled, 306 tested positive for COVID-19, whereas the 78, who tested negative for COVID-19, were included as a control group. The blood samples from the patients’ group were selected on days 0, 3, and 7, while virus-negative patients had sampling only on day 0. In the context of the present study, we used data obtained on day 0 to discriminate between COVID-19 and control patients. Several clinical and demographic characteristics were also collected alongside the proteomics measurements including age, body mass index (BMI), pre-existing medical conditions, and laboratory measurements of C-reactive protein (CRP), absolute neutrophil count, and D-dimer. For each omics measurement of both datasets, logistic regression models were fitted using the sklearn version 1.1.1 Python package. * 2) Omics-CNN 2.1. Data preprocessing The central idea of the study was to examine whether a previously established imaging data analysis deep learning method, namely the Convolutional Neural Networks (CNN), is suitable for high-throughput omics data analysis. To make CNN applicable to high-throughput omics data, we designed and implemented a new pipeline and tool called Omics-CNN. The proposed pipeline is presented in detail in [51]Fig 1. The first step of the proposed pipeline was data preprocessing. The existence of missing values in the datasets was examined since data imputation or filtering techniques are essential for training CNN models. The attributes that had more than 30 % missing values were filtered. A kNN algorithm was used to impute the values for the remaining missing values, using the KNNImputer method of the sklearn library version 1.1.1, and k = 20 (default value) [[52]13]. We used a combination of the Local Outlier Factor method [[53]14] and Principal Components Analysis (PCA) to identify potential outliers. From both datasets, less than 5 % of the data were marked as outliers. Fig. 1. [54]Fig. 1 [55]Open in a new tab The Omics-CNN pipeline. Different colors in the input data correspond to omics data from different modalities showing that the Omic-CNN pipeline is applicable to all quantitative omics data. (For interpretation of the references to color in this figure legend, the