Abstract Amyotrophic Lateral Sclerosis (ALS) is a complex and rare neurodegenerative disorder characterized by significant genetic, molecular, and clinical heterogeneity. Despite numerous endeavors to discover the genetic factors underlying ALS, a significant number of these factors remain unknown. This knowledge gap highlights the necessity for personalized medicine approaches that can provide more comprehensive information for the purposes of diagnosis, prognosis, and treatment of ALS. This work utilizes an innovative approach by employing a machine learning-facilitated, multi-omic model to develop a more comprehensive knowledge of ALS. Through unsupervised clustering on gene expression profiles, 9,847 genes associated with ALS pathways are isolated and integrated with 7,699 genes containing rare, presumed pathogenic genomic variants, leading to a comprehensive amalgamation of 17,546 genes. Subsequently, a Variational Autoencoder is applied to distil complex biomedical information from these genes, culminating in the creation of the proposed Multi-Omics for ALS (MOALS) model, which has been designed to expose intricate genotype-phenotype interconnections within the dataset. Our meticulous investigation elucidates several pivotal ALS signaling pathways and demonstrates that MOALS is a superior model, outclassing other machine learning models based on single omic approaches such as SNV and RNA expression, enhancing accuracy by 1.7 percent and 6.2 percent, respectively. The findings of this study suggest that analyzing the relationships within biological systems can provide heuristic insights into the biological mechanisms that help to make highly accurate ALS diagnosis tools and achieve more interpretable results. Keywords: ALS diagnosis, Pathway level analysis, Variational autoencoder, Multi-omic integration Graphical abstract graphic file with name gr001.jpg [37]Open in a new tab Highlights * • An advanced machine learning-based multi-omic model was developed to enhance the understanding of ALS mechanisms. * • MOALS combined 9,847 ALS-related genes with 7,699 rare variant genes, creating a comprehensive dataset for precise analysis. * • Findings showed that unsupervised clustering exposed critical ALS pathways, validating MOALS performance in gene analysis. * • The proposed model outperformed existing ML methods, with accuracy improvements ranging from 1.7% to 6.2% in ALS prediction. Nomenclature ALS Amyotrophic Lateral Sclerosis MND Motor Neuron Disease FDA Food and Drug Administration GWAS Genome-Wide Association Studies ML Machine Learning FALS Familial Amyotrophic Lateral Sclerosis SALS Sporadic Amyotrophic Lateral Sclerosis RNA-seq RNA Sequencing mRNA Messenger RNA WGS Whole Genome Sequencing SNV Single Nucleotide Variant SHAP Shapley Additive Explanations iPCS Induced Pluripotent Stem Cells ATAC-seq Assay for Transposase-Accessible Chromatin using sequencing ChIP-seq Chromatin Immunoprecipitation Sequencing Hi-C A method to study the three-dimensional architecture of genomes MOALS Multi-Omics for ALS VAE Variational Autoencoder GRCh38 Genome Reference Consortium Human Build 38 bwa-mem2 Burrows-Wheeler Aligner-Maximum Exact Matches 2 GATK Genome Analysis Toolkit VEP Variant Effect Predictor FASTQC Fast Quality Control FDR False Discovery Rate BQSR Base Quality Score Recalibration [38]CE Cross Entropy MSE Mean Squared Error DEM Deep Embedding Module BCE Binary Cross Entropy SVM Support Vector Machine UMAP Uniform Manifold Approximation and Projection PCA Principal Component Analysis MTLR Multi-Task Logistic Regression GradNorm Gradient Normalization KEGG Kyoto Encyclopedia of Genes and Genomes ER Endoplasmic Reticulum UPR Unfolded Protein Response UPS Ubiquitin-Proteasome System STRING-db Search Tool for the Retrieval of Interacting Genes/Proteins database TF-IDF Term Frequency-Inverse Document Frequency ROC Receiver Operating Characteristic AUC Area Under the Curve FCNN Fully Connected Neural Network RFR Random Forest Regression SVR Support Vector Regression RMSE Root Mean Square Error MAE Mean Absolute Error MedAE Median Absolute Error [39]R2 Coefficient of Determination C-index Concordance Index IBS Integrated Brier Score CoxPH Cox Proportional Hazards AI Artificial Intelligence RNA Ribonucleic Acid 1. Introduction Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disease that causes Motor Neuron (MN) loss in the spinal cord and the motor cortex. ALS, also known as Lou Gehrig's disease, leads to progressive paralysis, muscular atrophy, and death. According to the US Centers for Disease Control and Prevention, 12,000 to 15,000 Americans are thought to have ALS [40][1]. About 10 percent ALS cases are familial while the rest 90 percent are sporadic. Some monogenic drivers of familial ALS include mutations in the genes C9orf72, SOD1, TARDBP, and FUS [41][2]. However, the pathogenesis of sporadic ALS still has no known genetic or environmental cause. Familial and sporadic ALS patients have few treatment options. Despite 30 years of clinical trials, only Rilutek (riluzole), Radicava (edaravone), Relyvrio (sodium phenylbutyrate and taurursodiol), and Qalsody (tofersen) have been approved by the FDA as symptomatic treatments for ALS (in 1995, 2017, 2022, and 2023 respectively). Unfortunately, drug neither stops the disease nor restores motor function [42][3]. In understanding the genetic complexities of Amyotrophic Lateral Sclerosis (ALS), it is crucial to consider that known causative genetic variants manifest predominantly in later life and account for only 10 percent of ALS heritability, leaving a large “missing heritability” component that may be polygenic or even omnigenic in nature [43][4]. Addressing these issues requires a systems biology perspective that considers disease as a dysfunction in biological modules or pathways. Our research addresses these gaps by focusing on the biological pathways disrupted in ALS, employing a two-step approach for mapping genotypes to disease prevalence. In the first stage, genes were clustered based on their expression profiles in specific brain and spinal cord regions of ALS patients, compared to healthy controls. In the second stage, germline mutations and gene expression data were integrated into a general classifier using a multi-omics approach. Our methodology introduces a novel diagnostic process for ALS, addressing critical methodological challenges such as the limitations of genome-wide association studies (GWAS) in capturing non-additive genetic effects like epistasis. These findings open up avenues for further research, particularly in the exploration of machine learning models to capture the complexity of gene-gene interactions and the potentially omnigenic nature of ALS. Machine learning (ML) as a data-driven platform has played a significant role in making progress in the diagnosis of many diseases, including ALS. However, particularly for ALS, mainly single-omic approaches have been used in ML models for diagnoses [44][4], [45][7], [46][9], prognosis [47][12], mutation [48][11], subtyping [49][6], biomarkers [50][10], pathway [51][5], biological identification [52][13], and gene discovery [53][8]. Another ML approach RefMap [54][14] uses iPCS cells, ATAC-seq, Histone ChIP-seq, Hi-C, and RNA-seq for gene discovery. Further details about the mentioned studies and their specific features have been summarized in [55]Table 1 (top). Table 1. Literature review of ML based methods for ALS (top) and other diseases (bottom). Ref Disease Omics Approach ML model Feature [56][5] ALS mRNA expression Pathways Unsupervised hierarchical clustering Classification SALS and control, Identification of common pathogenic link between FALS and SALS [57][6] ALS RNA-seq Subtyping Clustering Identification of TARDBP/TDP-43 and retrotransposon expression two factors for ALS [58][7] ALS Whole Genome Sequence Diagnosis Convolutional neural network, Deep neural network Identification of ALS-associated promoter regions, ALS classification [59][8] ALS Protein–protein interaction data, gene function annotation, known disease-gene associations Gene discovery knowledge-based machine learning Gene prioritization for ALS [60][9] ALS RNA expression Diagnosis Deep convolution neural networks and Shapley Values ALS classification [61][10] ALS Plasma samples Biomarker Random Forest Presenting very high prediction rates for ALS diagnosis and prognosis [62][11] ALS Whole Genome Sequence Mutation Unsupervised machine-learning Identification of subset of common genetic variants for ALS [63][12] ALS Demographic, Family history, Genetic factor Prognosis Probabilistic Causal Discovery Assess genetic factors association ALS clinical progressions [64][13] ALS Multichannel fluorescence microscopy data Biological Identification Image-based deep learning Evaluation the impact of stress on valosin-containing protein related to ALS [65][14] ALS iPCS cells, ATAC-seq, Histone ChIP-seq, Hi-C, RNA-seq Diagnosis Regional fine-mapping (RefMap) Identification of risk genes related to ALS [66][4] ALS Whole Genome Sequence Diagnosis capsule networks Disease prediction from individual genotype profiles __________________________________________________________________ [67][15] Cancer mRNA expression, DNA methylation, microRNA expression Diagnosis Variational Autoencoder Implementation of multi-task platform for cancer diagnosis [68][16] Cancer, Alzheimer mRNA, DNA methylation, RNA expression Biomarker Graph convolutional networks Identification of important biomarkers [69][17] Cancer mutations, copy number changes, DNA methylation, gene expression Biomarker Graph convolutional networks Identification of new cancer genes and their associated molecular mechanisms [70][18] Cancer mRNA expression, DNA methylation, microRNA expression Diagnosis Interpretable deep learning, Variational Autoencoder Discovering of biomedical knowledge, cancer classification [71][19] Cancer DNA methylation, gene expression Translation framework Generative adversarial networks Omics-to-omics translation [72]Open in a new tab So far single-omic based approaches have led to significant progress in ML-based diagnosis of ALS. For instance, [73][4] made a groundbreaking advancement in ALS research by being the first to utilize capsule networks on a whole-genome scale, achieving an unprecedented 86.9 percent predictive accuracy and illuminating ‘non-additive’ genes that have previously remained obscured in linear models. On the other hand, in an investigation developed by [74][9], the convolutional neural networks and the Shapley Additive Explanations (SHAP) as a novel paradigm shift by converting RNA expression values into pixel-based images for analysis, yielding 80.7 percent accuracy and a level of interpretability that allowed for the identification of disease-critical genes. Moreover, a two-step deep convolutional neural network approach utilized by [75][7] underscores the importance of domain-specific architecture, offering a 77 percent accuracy rate in ALS prediction and highlighting the value of incorporating prior genomic knowledge into machine learning models. It should be mentioned that single-omic techniques have their deficiency in terms of integrating various methodologies, and they face challenges in exploring genetic, biochemical, metabolic, proteomic, and epigenetic mechanisms that are important underlying factors for ALS. To reach a comprehensive presentation of various layers of regulation, interconnected complexity and higher resolution picture in biological systems, the multi-omic approach might be useful. To this end, incorporation of multi-omic approaches within ML models provides a powerful analytical option that is capable of finding patterns in dense datasets for genomically heterogeneous and complex diseases [76][15], [77][18], biomarker identification [78][16], [79][17], and other applications [80][19]. In an interdisciplinary effort to address the prevailing challenges of omics data analysis in the biomedical sector, we introduce a compendium of state-of-the-art computational methods. For example, [81][18] utilized the framework of XOmiVAE to address the pressing need for explainability in deep learning applications, particularly in cancer classification, by elucidating the contributions of individual genes and latent dimensions. Complementing this, [82][17], introduced an integrated EMOGI multi-omics pan-cancer data with protein–protein interaction networks through graph convolutional networks, thereby providing an accurate and interpretable model for predicting cancer genes. In another study, [83][16], presented MOGONET offers advancements in the realm of multi-omics integrative analysis, excelling in both data classification and biomarker discovery across disparate biomedical applications. To confront the challenges associated with high-dimensional data and cross-omics translation [84][15] and [85][19] implemented OmiEmbed and OmiTrans. This action extended the boundaries of current computational capabilities. However, all the mentioned efforts have been restricted to some specific ones to cancers, and Alzheimer and ALS diagnosis by multi-omics approach still has not been taken into account. Further details about the multi-omics studies and their specific features in other diseases have been reported in [86]Table 1 (bottom). In this paper, for the first time, a multi-omics approach based on ML for ALS diagnosis, first symptom and survival prediction is presented. We use unsupervised clustering to provide an interpretable biological processes on gene expression profiles to identify 9847 genes associated with ALS pathways, which were then integrated with 7,699 genes that include rare, predicted pathogenic genomic variants prioritized based on biological knowledge. We use a variational autoencoder to capture biomedical information in the integrated 17,546 genes. We named our model Multi-Omics for ALS (MOALS). Our investigation detects several ALS signaling pathways, and MOALS outperforms other existing ALS classification models in accuracy. In summary, the contributions of this paper are as follows: * • Acquiring and preprocessing whole genome sequencing (WGS) data to extract and analyze single nucleotide variants (SNVs) based on biological knowledge. * • Selecting features using a clustering algorithm for ALS pathway-level analyses on mRNA transcriptomes. * • Combining the above two in a variational autoencoder learning framework to predict ALS status, the age of first symptoms and survival predictions. 2. Materials and methods In this section, the followed methodology for feature selection, the ML platform, and the pathways level analysis will be described. The innovative platform named MOALS is designed for multi-omic data processing activities. MOALS's workflow may be broken down into three components: (1) Pre-processing and extraction of genes in ALS pathways by clustering of gene expression profiles; To begin, transcriptome data were pre-processed, and feature preselection was performed to identify ALS pathway signaling and to eliminate noise in the expression data which may degrade classification task performance. (2) Then, in the variations extraction step, the single nucleotide variants (SNV) were detected by the sequence alignment/map tools, and the variants were prioritized based on the biological knowledge using bioinformatic programs. Briefly, fastq raw data was aligned to human reference genome (GRCh38) using bwa-mem2. SNVs, and small insertions and deletions (indels) were then called using GATK (version 4.2.5.0). Next, SNVs and small indels were annotated with gene information using VEP (Version 104.2). Finally, variants were filtered and ranked using custom scripts. (3) Machine learning using Multi-omics data; The initial class probability estimates from each single omic were used to compute multi-omics integration using variational autoencoder based on concatenating each omics layer. In the following sub-sections, detailed explanations related to each part will be presented. The overview scheme relating to each step has been depicted in [87]Fig. 1. Figure 1. [88]Figure 1 [89]Open in a new tab The overview of the implemented method for ALS diagnosis. 2.1. Data acquisition and preprocessing The sequencing (RNA and WGS) analyzed in this study from the Target ALS cohort were obtained upon application to the New York Genome Center with the data request. The selection of 672 cases, including 593 ALS and 79 non-ALS cases, was based on the availability of high-quality multi-omics data necessary for robust analysis. Due to the limited availability of Control samples, we selected a robust model specifically designed to effectively handle the non-balanced nature of the data. 2.1.1. RNA-sequencing analysis The raw RNA sequencing data was processed in-house according to the following pipeline. We used FASTQC [90][20] to perform quality control after obtaining the raw sequencing data in fastq format, with mean quality value across each base location in the read and per-sequence quality scores as the major criteria for data quality evaluation. Kallisto [91][21] pseudo-aligned the sequences to the reference genome of GRCh38 from Ensembl release 95 [92][22]. To conduct pathway analysis, GO annotations and homology information was obtained from Ensembl BioMart database [93][23]. Enrichr [94][24] was used to perform gene set enrichment analysis. For the pathway analysis, we utilized the Kyoto Encyclopedia of Genes and Genomes (KEGG), a comprehensive database resource that offers a systematic understanding of biological functions and the interconnection of various elements of the biological system. KEGG pathway annotations were employed against the whole genome as a background reference to identify statistically significant pathways. A false discovery rate (FDR) cutoff 0.05 was used to select significantly enriched pathways. 2.1.2. Genomic variant extraction Whole-genome sequencing (WGS) data, represented by raw fastq files, were meticulously aligned to the GRCh38 reference genome, employing the GATK best-practices workflow. This comprehensive workflow incorporated BWA-MEM for alignment, Picard tools for annotating repetitive reads, local realignment surrounding indels, and a base quality score recalibration (BQSR) [95][25] to refine alignment accuracy. Following the precise alignment, individual sample variant calling was executed utilizing HaplotypeCaller. This was augmented by joint genotyping and Variant Quality Score Recalibration to enhance the reliability of variant identification. Subsequently, the generated variants underwent rigorous annotation using Variant Effect Predictor (VEP) [96][26] and bcftools [97][27] to delineate the variant's potential impacts, focusing on those with predicted effects on protein-coding sequences, as discerned through functional annotation. Post annotation, a series of meticulous filtering steps were instigated. This involved isolating variants associated with canonical transcripts, recognized gene symbols, and variants demonstrating a population allele frequency < 0.01 or those absent in large normal populations [98][26]. With the acquired biological context from extensive annotation, the variants were systematically scored and ranked based on their predicted impacts, prioritizing variants such as frameshift, stop gained, transcript ablation, stop lost, start lost, transcript amplification, and splice donor and acceptor variants. These were meticulously ranked with scores of 5, 3, and 2, to delineate their relative significance. After evaluating all samples related to patients, 7699 variants of interest were identified. The number of repetitions for each variant in each individual were different. This exhaustive and systematic approach ensured the meticulous identification and evaluation of variants with profound functional implications, enabling a holistic insight into the investigated genomic landscapes. 2.2. Feature selection based on pathway-level analyses Fuzzy k-means clustering is utilized as a central technique for grouping mRNA gene expression data to discern intricate patterns associated with Amyotrophic Lateral Sclerosis (ALS). Unlike traditional k-means clustering, which assigns each data point rigidly to a single cluster, fuzzy k-means permits a data point to belong to multiple clusters with varying degrees of membership. This fuzziness is crucial in revealing complex patterns in gene expression data, particularly in cases where the boundaries between different expression levels are not distinctly defined. The primary advantage of using fuzzy k-means for gene expression analysis is its ability to handle the inherent uncertainty and variability in gene expression levels. This provides a nuanced and comprehensive understanding of the underlying biological phenomena, enabling more biologically meaningful interpretations of the heterogeneous nature of mRNA transcripts associated with ALS. The flexibility of fuzzy k-means is particularly relevant in addressing the complex nature of neurodegenerative disorders such as ALS, thereby facilitating more refined and precise analyses compared to conventional clustering methods. To enhance the description and reproducibility of our clustering process, we detail our methodology as follows: * • Mathematical Formulation: The fuzzy k-means algorithm assigns membership degrees using the formula shown in Equation [99](1): [MATH: uij=1k=1c(xivjxivk)2 m1, :MATH] (1) where [MATH: uij :MATH] is the membership degree of the i-th data point in the j-th cluster, [MATH: xi :MATH] is the i-th data point, [MATH: vj :MATH] is the centroid of the j-th cluster, c is the number of clusters, and m is the fuzziness parameter. * • Cluster Initialization: We employed the fuzzy k-means algorithm with a specified range of cluster counts (6 to 12 clusters) to determine the optimal granularity for our data. * • Cluster Optimization: The algorithm iteratively adjusts the cluster centers based on a weighted average of the data points, where weights correspond to the degree of belonging of each point to a particular cluster. The process continues until the changes in the cluster centers are minimal, ensuring convergence. * • Membership Degree Evaluation: Each gene's membership degree to the cluster centers was evaluated to score and rank the genes based on their centrality. This scoring influenced their subsequent selection for pathway analysis. * • Statistical Evaluation: Ranked genes were analyzed through pathway-level enrichment tests to determine their biological significance, enhancing our understanding of ALS-associated pathways. By providing these additional details, we aim to improve the transparency of our analytical approach and allow for better reproducibility of our results by other researchers in the field. 2.3. Network architecture The MOALS platform integrates a multi-task deep learning framework to analyze multi-omics data, which is crucial for applications like ALS classification, predicting the age at first symptoms, and survival prediction. Central to this framework is the Deep Embedding Module (DEM), which employs a Variational Autoencoder (VAE) to transform high-dimensional multi-omics data into a meaningful, low-dimensional latent space. 2.3.1. Variational Autoencoder (VAE) overview The VAE is a cornerstone of our DEM, offering a robust method for learning deep representations of complex datasets. Unlike traditional autoencoders, VAEs introduce a probabilistic approach to encode inputs into a latent (hidden) space. This approach not only helps in compressing the data but also in generating new data points, hence facilitating the modeling of complex biological phenomena. Mathematical framework of VAE In the VAE, each high-dimensional input vector [MATH: x(i)Rd :MATH] from the multi-omics dataset [MATH: D :MATH] is mapped to a latent vector [MATH: z(i)Rp :MATH] , where [MATH: pd :MATH] . The mapping is done through a probabilistic encoding process defined by a distribution [MATH: qϕ(z|x) :MATH] , typically assumed to be Gaussian, as shown in Equation [100](2): [MATH: qϕ(z|x)=N(z;μ(x),σ(x)2I< mo stretchy="false">) :MATH] (2) where [MATH: μ(x) :MATH] and [MATH: σ(x) :MATH] are outputs from the encoder network, parameterized by ϕ, representing the mean and variance of the latent distribution. The VAE optimizes the parameters by maximizing the Evidence Lower Bound (ELBO) to the logarithm of the likelihood of the data, as shown in Equation [101](3): [MATH: L(ϕ,θ;x(i))=Eqϕ(z|x(i))[logpθ(x(i)|z)]DKL(qϕ(z|x(i))||p(z)) :MATH] (3) where [MATH: pθ(x|z) :MATH] is the probability of reconstructing x given z, modeled by the decoder network with parameters θ, and [MATH: DKL :MATH] represents the Kullback–Leibler divergence, encouraging the encoded latent variables to approximate a prior distribution [MATH: p(z) :MATH] , typically a standard normal distribution. This ELBO component ensures that the VAE not only reconstructs the data efficiently but also regularizes the learning process to avoid overfitting, making the model robust to unseen data. Implementation details In our implementation, the VAE's encoder and decoder networks consist of fully connected layers, with non-linear activation functions like ReLU to introduce non-linearities into the model, crucial for capturing complex patterns in the data. The encoder compresses the input into the latent variables μ and σ, and the decoder reconstructs the input from the latent representation sampled using the reparameterization trick, as shown in Equation [102](4): [MATH: z=μ+σϵ,ϵN(0,I) :MATH] (4) This step ensures that gradients can be backpropagated through the stochastic sampling process, making the network trainable via standard backpropagation techniques used in deep learning. The integrated VAE module within the MOALS framework is pivotal for reducing dimensionality and uncovering latent structures in complex multi-omics data, facilitating downstream tasks such as classification and regression with enhanced interpretability and accuracy. The end-to-end downstream network of MOALS is capable of classification, regression, and survival prediction. Employing the method of multi-task training described in the next sections, every downstream task that fits is possible to train individuals according to one of these categories or in conjunction with other downstream tasks. To perform classification-related downstream tasks, like (Control & Target) type categorization, and site of motor onset classification, a multilayer completely linked network. The categorization downstream network's output dimensionality was set to the number of categories. An analogous network was linked to the DEM for regression, but just one neuron was maintained in the outlet layer to forestall the desired scalar amount. There is a more complex downstream network for predicting survival, which will be covered in greater detail in the next section. To make this low-dimensional latent representation even more regular, the downstream networks use the DEM to discover the omics embeddings associated with certain downstream activities and use that information in the module. Using downstream modules, from omics data, a single good-educated multi-task MOALS network can rebuild a complete diagnostic, predictive, and demographic profile. 2.3.2. Learning strategy The joint loss function, like the overall structure, has two key elements: the losses of deep embedding plus tasks in the downstream sector. For every omics profile type, [MATH: xj :MATH] is used to signify the input profile, and [MATH: xj :MATH] is used to denote the reconstructed profile matching that input profile, where M is different sorts of omics and the index is j. In order to calculate the deep embedding loss, we use the formula provided in Equation [103](5): [MATH: Lemb< mi>ed=1M< /mi>j=1MBCE(xj,xj< /mrow>)+DKL(N(μ,σ)||N(0,I)) :MATH] (5) For comparison, BCE is the binary cross-entropy, while KL divergence measures the difference between a learnt distribution and a standard Gaussian one. A classification task's loss function is shown in Equation [104](6): [MATH: Lcla< mi>ssification=CE(y,y ) :MATH] (6) Suppose that the predicted label [MATH: y :MATH] is equal to the cross-entropy loss (CE), and that the true label y represents the anticipated label. The regression task's loss function is the same as the classification task's loss function, as shown in Equation [105](7): [MATH: Lreg< mi>ression=MSE(y,y ) :MATH] (7) In this case, MSE stands for the average squared difference between the actual and predicted values. MOALS was used to construct three training stages that made use of the aforementioned loss mechanisms. This was a period in which the deep embedding module was the exclusive focus, hence it was unsupervised in the beginning. This training phase solely used backpropagation to improve the deep embedding loss and only made minor adjustments to those parameters based on the gradients. While the downstream networks were being trained, the previously trained embedding network was fixed. Only the downstream networks were updated during this phase, and the total downstream loss was backpropagated. 2.3.3. Survival function strategy The survival function, is defined as shown in Equation [106](8): [MATH: S(t)=P[T>t] :MATH] (8) where T denotes the time elapsed during sample acquisition and the time of event happening. The survival function demonstrates the probability that the death (as the failure event) has not happened by time t. The mentioned function can be measured via Equation [107](9): [MATH: h(t)=limdt0P[tT<t+dt|Tt]dt :MATH] (9) This shows the instantaneous frequency with which the unsuccessful event occurs. High hazard numbers denote a high risk of death at the time t indicated by the number. It is rare to use the original hazard function in its original form; instead, the risk score for each sample x is calculated by the following formula, as shown in Equation [108](10): [MATH: r(x)=i=1mh(ti,x) :MATH] (10) It is not only necessary to use the omics data x, a survival predicting downstream network, as well as the event time T and the event indication E. When a failure happened during the study, the indicator was set to 1, and when it didn't, it was set to 0, a procedure known as censoring. Time T is the interval between sample collection and the subject's last contact in the case of censorship. 2.3.4. Multi-task learning For application in the downstream task of survival prediction, the MOALS architecture was modified from the multi-task logistic regression (MTLR) model. Firstly, the time axis was split into m time intervals [MATH: {li}i=1m :MATH] . Time was taken into account as [MATH: li=[ti1,ti) :MATH] , with [MATH: t0=0 :MATH] being zero and [MATH: tmmax(T) :MATH] being the maximum allowed value. The hyperparameter m denotes the number of time periods that are included in the calculation. Increased precision comes at the expense of processing resources. A multi-layer fully connected network underpins our survival prediction system, and the output layer has the dimension of the number of time intervals. Consequently, we get an m-dimensional vector [MATH: y=[y1< /mrow>,y2, ...,y< /mrow>m] :MATH] from our survival prediction network. At time point [MATH: ti :MATH] , the survival label for each subject was kept as an m-dimensional vector [MATH: y=[y1,y2,...,< mi>ym] :MATH] , with name [MATH: yi :MATH] denoting the subject's survival status. Sample x has the following conditions, and the probability of finding y with the network parameters θ is formulated by Equation [109](11): [MATH: Pθ(y|x)=exp(i< mo linebreak="badbreak" linebreakstyle="after">=1myiyi)j=0mexp(i< mo linebreak="badbreak" linebreakstyle="after">=j+1myi) :MATH] (11) The goal of this survival network is to find a set of variables θ that maximizes log-likelihood; consequently, the loss function for the survival prediction function is written as shown in Equation [110](12): [MATH: Lsur< mi>vival=i=1myi< msubsup>yi< /mrow>+log j=0mexp(i=j+1myi) :MATH] (12) It can be implemented straight to the survival component and is incorporated into MOALS's joint loss function. As an alternative to training each downstream network in MOALS separately, several downstream networks in MOALS simultaneously trained using the joint loss function of the downstream tasks. This resulted in an integrated model capable of reconstructing a comprehensive phenotypic profile for each individual, as shown in Equation [111](13): [MATH: Ldow< mi>n=1k< /mi>k=1KWk< mrow>Ldow< msub>nk :MATH] (13) The loss associated with each function is denoted by the letter [MATH: Ldownk :MATH] , and the weight is indicated by the [MATH: Wk :MATH] that might be explicitly set as hyperparameters or utilized as trainable parameters during the training procedure. The last step required calculating and backpropagating the total loss function defined in Equation after pre-training the embedded and downstream networks independently [112](13). During this last training step, the whole MOALS network, including the DEM and downstream task, was fine-tuned to maximize performance. The multi-task optimization approach gradient normalization (GradNorm) is adjusted to presented MOALS architecture to balance the optimization of varied workloads. The weights for each downstream loss are different for each training iteration. When a task's gradients are either too large or too little, GradNorm penalizes the network, ensuring that all tasks learn at a consistent rate. For starters, the gradient norm of each subsequent job is derived using the Equation [113](14): [MATH: Gθ(k)=ΘWk< mrow>Ldow< msub>nk2 :MATH] (14) In witch θ is the parameters of the DEM of MOALS's last encoding layer are. The mean gradient norm for all tasks can therefore be determined as shown in Equation [114](15): [MATH: Gθ¯=1k< /mi>k=1KGθ(k) :MATH] (15) where K represents the number of subsequent tasks. The following definition applies to each task's relative inverse training rate, as shown in Equation [115](16): [MATH: rk=L˜downk1kk=1KL˜downk :MATH] (16) where [MATH: L˜downk=Ldow< msub>nk/Ldow< msub>nk< /mrow>0 :MATH] which it is the difference between the current loss and the loss the downstream task k experienced initially. In that case, the GradNorm loss is defined as follows in Equation [116](17): [MATH: Lgra< mi>d=k=1K|Gθ< /mrow>(k)Gθ¯×rkα|1 :MATH] (17) where α is the hyperparameter corresponding to the strength required to reduce tasks to a common training rate. During each training iteration, a separate backpropagation process was run on [MATH: Lgrad :MATH] , which was utilized only to update [MATH: Wk :MATH] . 2.4. Models' training and evaluating procedure In this section, a brief description of the MOALS is presented as our proposed algorithm. MOALS implements PyTorch's deep learning library for a multi-omics ALS prediction. During the training and testing of the presented platform, the separation was conducted in a stratified manner to keep the proportion of each class; five-fold cross-validation of the train-validate data optimized the developed architecture and other hyperparameters for MOALS. Moreover, accuracy, precision, recall, and f1-score were selected as three different metrics to evaluate the performance of the proposed algorithm. It should be mentioned that the network architecture is fully connected through all layers. Besides, in this paper, shallow machine learning models also have been considered. The findings prove that there is essential to implement a deep network to obtain a significant recall. Also, the GradNorm algorithm is applied to optimize the network parameters with an initial learning rate of 0.02 and a decay of 2e-4. In a batch size of 32 and in over 200 epochs, the optimization process was conducted. The hyperparameters used to train this model were listed in [117]Table 2. Table 2. Hyper-parameters used in the model. Hyper-parameter Value Latent dimension 128 Learning rate 1e-3 Batch size 32 Epoch number—unsupervised 50 Epoch number—supervised 100 [118]Open in a new tab With respect to multi-omics approaches and for each omics individually, the platform's network architecture is optimized. It should be noted that as well as for multi-omics one, variables are optimized separately for each omic. Next, MOLAS is applied to the test data to determine the performance of our approach. However, it should be highlighted that, for these samples, the authors utilize the candidate genes and the whole gene expression data individually. Moreover, a GPU with a 12-gigabyte capacity was utilized in this paper to develop and test the proposed algorithm. 2.5. A comprehensive comparison between different machine learning methods By considering test data during the assessment of MOALS, its functionality has been evaluated with other algorithms. Three different algorithms (SVM, random forest, and fully connected neural network) were taken into account for mentioned comparison. To conduct a dimension reduction, UMAP and PCA have been utilized for each mentioned algorithm. Then, a cross-validation method is implemented to optimize hyperparameters, and finally, the performance of examined dataset is presented. To explore a non-linear boundary and consequently maximize the margin between two different clusters, SVM, as a well-known binary classification method is implemented. Also, a radial basis function is implemented as SVM kernel and its coefficient is equal to 0.001. Moreover, it is useful to highlight some points related to features of the random forest algorithm. This algorithm is so popular, too. It is an applied machine learning method that generates multiple decision trees and blends their individual categories to reach a final classification. A higher accuracy has a tight relationship with the number of decision trees. However, increasing the number of trees has a negative impact on the training time and decreases the train speed. In the current study, 100 trees are implemented with a maximum depth of 5 and at most 100 features. The neural network that is one of the implemented algorithms that has been selected to carry out a comparison between its performance with our proposed algorithm includes a series of fully connected layers that connect every neuron in one layer to another one in the other layer. The structure agnostic is the most significant feature of this method. In fact, in this method, no special perception is needed in the input data. In the current study, three hidden layers with ‘relu’ activation function apart from input and output layers are implemented. With considering three different assumptions about the data in the algorithm, a dimension reduction method (UMAP) that has been founded based on them is implemented in this investigation. It should be mentioned that the data is uniformly distributed on the Riemannian manifold; The mentioned metric is approximately fixed and locally connected with respect to its position. After considering all mentioned points, it is possible to apply a fuzzy topological structure in order to simulate the manifold. Searching for a low-dimensional projection of the data that has the nearest direction equivalent to the fuzzy topological structure. It should be emphasized that 10 and 0.05 are defined for the values of the number of neighbors and the minimum distance, respectively. PCA can significantly capture the variation present in the data with fewer parameters and provides information on the whole structure of the evaluated dataset. This action is conducted by the mentioned algorithm using linear combinations of parameters to synthase orthogonal axes. In the current manuscript, four components are considered as variables for generating the orthogonal axes. 3. Results and discussion 3.1. Identification of ALS pathway correlated genes We performed unsupervised clustering on the RNA expression profiles to discover groups of genes with similar expression patterns in ALS samples relative to healthy controls. The main objective of this clustering was not solely to pinpoint genes directly associated with ALS but to explore a broader spectrum of biological pathways potentially contributing to ALS pathogenesis. This comprehensive approach aids in understanding the complex network of interactions and the potential overlapping pathways that could influence ALS and other neurodegenerative diseases. For each cluster of genes identified by the algorithm, we conducted a pathway enrichment analysis. This analysis crucially allows us to identify not just the direct pathways like ALS but also other significant pathways that might be mechanistically linked. These include pathways related to neurodegeneration, cellular stress responses, and protein homeostasis, which are vital for understanding the broader biological context of ALS. We tested different numbers of clusters (6-12) for clustering of gene expression, finding that irrespective of the number of clusters, at least one cluster was consistently associated with ALS. This not only demonstrates the robustness of our clustering approach but also supports the hypothesis that molecular signatures of ALS are strongly represented in the dataset. To visually represent this, we extracted genes that appeared in any detected ALS KEGG pathways enriched clusters and identified 9847 genes for downstream analysis. It is important to note that our clustering approach also highlighted other pathways with even more significant p-values, such as Endocytosis. This finding underscores the multi-faceted nature of neurodegenerative diseases where multiple biological processes are often interconnected. The identification of pathways like Endocytosis with better p-values than ALS itself suggests potential upstream or parallel processes that could influence or be influenced by ALS pathology. The Endocytosis pathway plays a crucial role in cellular homeostasis by mediating the internalization and recycling of cell surface receptors, lipids, and other molecules. Dysregulation in endocytosis has been implicated in the pathogenesis of neurodegenerative diseases, including ALS [119][28]. In ALS, defective endocytosis could lead to impaired clearance of misfolded proteins and other cellular debris, contributing to neuronal damage. This pathway's significance in ALS is underscored by its strong association with other neurodegenerative processes, suggesting that alterations in endocytic trafficking may be a common mechanism in ALS and other neurodegenerative disorders. The endoplasmic reticulum (ER) is critical for proper protein folding and processing. In ALS, mutations in proteins involved in ER stress response, such as VAPB, have been shown to disrupt ER homeostasis, leading to the accumulation of misfolded proteins and triggering the unfolded protein response (UPR) [120][29]. Persistent UPR activation can lead to neuronal apoptosis, contributing to the progressive loss of motor neurons observed in ALS patients. The identification of this pathway emphasizes the importance of protein homeostasis in ALS and highlights potential therapeutic targets aimed at modulating ER stress responses. Ubiquitin-mediated proteolysis is essential for the degradation of damaged or misfolded proteins via the ubiquitin-proteasome system (UPS). In ALS, disruptions in UPS have been reported, leading to the accumulation of ubiquitinated protein aggregates in motor neurons, a hallmark of the disease. This pathway's involvement in ALS is supported by the frequent observation of ubiquitin-positive inclusions in post-mortem ALS tissues. Understanding how ubiquitin-mediated proteolysis is compromised in ALS could provide insights into disease mechanisms and inform strategies to enhance protein clearance in affected neurons [121][30]. Autophagy is a cellular process involved in the degradation and recycling of damaged organelles and proteins. In ALS, autophagy dysfunction is believed to contribute to the accumulation of toxic proteins and organelles in motor neurons. Enhancing autophagy has been proposed as a potential therapeutic strategy for ALS, aimed at reducing the burden of protein aggregates and promoting neuronal survival. The identification of the autophagy pathway in our analysis reinforces its critical role in ALS and provides a rationale for exploring autophagy modulators as potential treatments [122][31]. The overlap between ALS and other neurodegenerative diseases, such as Huntington's, Parkinson's, and Alzheimer's, suggests shared pathogenic mechanisms. Common features include the accumulation of misfolded proteins, mitochondrial dysfunction, and oxidative stress. The identification of these pathways in ALS patients supports the hypothesis that ALS may share molecular pathways with other neurodegenerative conditions. This cross-disease perspective could lead to the development of broad-spectrum therapeutics targeting these shared mechanisms [123][32]. Moreover, the implications of these pathways extend beyond ALS, potentially illuminating common mechanisms underlying other neurodegenerative diseases such as Alzheimer's, Parkinson's, and Huntington's diseases. The disruption of cellular trafficking in the Endocytosis pathway, the imbalance in protein processing, and the failure of protein clearance mechanisms observed in ubiquitin mediated proteolysis are not only pivotal to ALS but are also critical components in the pathology of these other diseases [124][33]. This suggests a shared pathological framework, where targeting these pathways could lead to broad-spectrum therapeutic strategies for multiple neurodegenerative disorders. By identifying and understanding these interconnected pathways, our study contributes to a holistic view of neurodegeneration, offering insights that could inform cross-disease therapeutic approaches. [125]Table 3 delineates the Enrichr-derived analyses results, illustrating a substantial association between the submitted gene list and the ALS pathway, underscored by a highly significant p-value of [MATH: 1.27×1021 :MATH] and an adjusted p-value of [MATH: 2.02×1019 :MATH] . This statistical significance minimizes the likelihood of the association emerging by random chance, reinforcing the credibility of the biological linkage inferred. Additionally, the Odds Ratio of 2.94 elucidates that the genes within our investigated list are approximately three times more probable to align with the ALS pathway than expected by chance, with the combined score of 141.56 further attesting to the robustness and overall significance of this association. Table 3. The results of ALS signaling pathways obtained by MOALS. Name P-value Adjusted p-value Odds Ratio Combined score Endocytosis 2.35E-24 7.48E-22 4.28 232.93 Amyotrophic lateral sclerosis 1.27E-21 2.02E-19 2.94 141.56 Protein processing in endoplasmic reticulum 9.79E-21 1.04E-18 5.35 246.54 Ubiquitin mediated proteolysis 1.38E-19 1.10E-17 6.38 277.09 Huntington disease 4.26E-17 2.72E-15 2.79 105.35 Pathways of neurodegeneration 6.55E-17 3.48E-15 2.23 83.14 Parkinson disease 1.63E-15 7.41E-14 2.96 100.94 Autophagy 3.14E-15 1.17E-13 4.71 157.32 Spinocerebellar ataxia 3.30E-15 1.17E-13 4.5 150.16 Prion disease 8.63E-12 2.48E-10 2.38 60.55 Thermogenesis 9.34E-12 2.48E-10 2.57 65.37 Alzheimer disease 8.40E-11 2.06E-09 2.01 46.59 [126]Open in a new tab Conversely, [127]Table 4 encapsulates the insights gained from STRING-db, revealing that 96 out of the 352 genes analyzed are implicated in ALS, with a significant association strength of 1.73. This table corroborates the insights from Enrichr by demonstrating a significant association, further emphasized by an exceptionally minimal false discovery rate (FDR) of [MATH: 5.52×10154 :MATH] , ensuring an exceptionally high level of confidence in the accuracy and relevance of this association. The extraordinary significance of the FDR consolidates the validity of the biological connection inferred between the analyzed gene network and ALS. Table 4. The results of ALS signaling pathways for first hundred candidates' genes obtained by MOALS. Pathway Description Count in Network Strength False discovery rate hsa03050 Proteasome 20 of 43 1.96 9.63e-30 hsa05014 Amyotrophic lateral sclerosis 96 of 352 1.73 5.52e-154 hsa04136 Autophagy - other 7 of 29 1.67 1.18e-08 hsa05017 Spinocerebellar ataxia 31 of 135 1.65 6.29E-39 hsa05012 Parkinson disease 45 of 240 1.56 1.02E-54 hsa05020 Prion disease 48 of 265 1.55 4.51E-58 hsa05016 Huntington disease 52 of 298 1.53 1.05E-62 hsa05010 Alzheimer disease 56 of 355 1.49 6.77E-66 [128]Open in a new tab In consolidating the insights from [129]Table 3, [130]Table 4, it is apparent that both Enrichr and STRING-db analyses converge on a similar conclusion, highlighting a significant association between the examined gene entities and the ALS pathway. Enrichr provides a comprehensive perspective on the statistical and biological significance of the gene list in the context of ALS, while STRING-db accentuates the interactive networks among the genes, reinforcing the biological relevance of the findings. The convergence of these analyses not only strengthens the inferred association between the analyzed entities and ALS but also propels our understanding forward, opening avenues for exploring the intricate molecular mechanisms underlying ALS. [131]Fig. 2 displays a UMAP-based scatter plot visualizing terms from the KEGG 2021 Human gene set library extracted from the fuzzy k-means clustering algorithm. Terms, represented by points, are plotted on the first two UMAP dimensions, clustered by the Leiden algorithm based on computed TF-IDF values, allowing similar gene sets to be grouped together. Larger, black-outlined points signify terms significantly enriched, particularly relating to ALS, Parkinson's, Alzheimer's, and neurodegeneration pathways. Figure 2. [132]Figure 2 [133]Open in a new tab The scatter plot to demonstrate disease classification based on KEGG conducted by MOALS. Next, genes were ranked according to their degree of interconnection (STRINGdb), and their association with ALS. STRINGdb detects statistically significant associations between a list of input genes and known biological pathways, and for this analysis, KEGG pathways with adjusted p-values less than 0.05 were selected. Based on current literature relating to the molecular pathways of ALS, these pathways demonstrate a remarkable degree of association. The input genes selected during the first phase of our analysis have more interconnection among each other than what normally would be expected for a random set of genes of the same size. This enrichment depicts that the mentioned parameters are at least tendentiously biologically interconnected as a group. Also, to show all the interconnection between ranked genes and argue on it, [134]Fig. 3 has been illustrated. As can be seen in the mentioned figure, five different clusters have been detected related to ALS and other neurodegenerative diseases. The figure shows the first 100 ranked genes (as nodes) that have been connected by 604 edges. Between every two nodes, one or several edges have been recognized which are classified by some colors. Besides, co-expression that has been colored in black identifies which genes have a tendency to show a coordinated expression pattern across a group of genes. Figure 3. [135]Figure 3 [136]Open in a new tab The clustering illustration of the first hundred candidates' genes by MOALS. Image 1 , Image 2 , Image 3 , Image 4 , Image 5 , Image 6 , Image 7 , Image 8 , Image 9 , Image 10 . 3.2. Explaining multi-omics integration results An in-depth analysis was conducted to distinguish between ALS and healthy cases using single and multi-omic data, focusing explicitly on either RNA expression data or SNV data. To evaluate the stability and performance consistency of the classification models, a comprehensive five-fold cross-validation analysis was performed. The results, illustrated in [137]Fig. 4, include boxplots for accuracy, precision, recall, and F1-score across the validation folds, stratified by the classification methods and omic data types—Gene Expression, SNV, and their combination. Additionally, 95% confidence intervals were calculated for each performance metric, providing a more accurate representation of variability across folds. The effect sizes (Cohen's d) were also computed to assess the practical significance of the observed differences between ALS and control groups, with values ranging between 0.87 and 1.16. This range indicates a medium to large effect size, underscoring the practical relevance and robustness of the classification models in effectively distinguishing between ALS and healthy cases. Figure 4. [138]Figure 4 [139]Open in a new tab 5-Fold cross-validation results for multiple classification methods on omics data. Key observations from the boxplots include notable fluctuations in model performance metrics which may indicate variability in model robustness or adaptive responses to the integration of multi-omic data. These detailed distributions provide deeper insights into the predictive stability of each method beyond the mean performance metrics typically reported. Notably, the MOALS method consistently demonstrated superior performance across all metrics, particularly in the Gene Expression + SNV category, where it achieved the highest median scores and showed relatively tight interquartile ranges, indicating less variability and higher reliability in comparison to other methods. Relating these insights to the consolidated results presented in [140]Table 5, it is evident that the tabulated data encapsulates the average performance metrics from the cross-validation exercise. While the table effectively compares the methodological efficacies, the boxplots enrich this comparison by detailing the range and consistency of performance across multiple experimental runs, thus offering a holistic view of each method's efficacy and reliability. The standout performance of MOALS, as highlighted in the cross-validation plots, underscores its robustness and underscores its potential as a highly effective tool for integrating multi-omic data in the classification of ALS. Table 5. Classification results based on six classification methods applied by single omic and multi-omics approaches. Gene Expression __________________________________________________________________ SNV __________________________________________________________________ Gene Expression + SNV __________________________________________________________________ Accuracy Precision Recall F1-score Accuracy Precision Recall F1-score Accuracy Precision Recall F1-score UMAP+SVM 0.7564 0.7663 0.7438 0.7573 0.7367 0.7415 0.7273 0.7368 0.7846 0.7938 0.7743 0.7847 PCA+SVM 0.7406 0.7562 0.7305 0.7497 0.7252 0.7369 0.7161 0.7234 0.7796 0.7816 0.7653 0.7713 UMAP+RF 0.7761 0.7812 0.761 0.7704 0.7589 0.7612 0.7414 0.7598 0.801 0.8116 0.792 0.8002 PCA+RF 0.7656 0.7729 0.759 0.7645 0.7414 0.7599 0.7371 0.7486 0.79 0.8019 0.7809 0.7971 UMAP+FCNN 0.7824 0.7979 0.7702 0.7864 0.7628 0.7749 0.7561 0.7618 0.8183 0.823 0.8076 0.8176 PCA+FCNN 0.7564 0.7663 0.7438 0.7573 0.7367 0.7415 0.7273 0.7368 0.7846 0.7938 0.7743 0.7847 VAE+ FCNN 0.8048 0.8176 0.7965 0.8035 0.7706 0.7826 0.7638 0.7762 0.8505 0.8476 0.8255 0.8371 MOALS 0.8485 0.8374 0.8566 0.8214 0.9206 0.9186 0.9148 0.9211 [141]Open in a new tab The comparative analysis delineates the performance of various dimensionality reduction techniques when coupled with machine learning classifiers, specifically in the context of Gene Expression data, SNV data, and their integration. Performance is quantified via accuracy, precision, recall, and F1-score, offering a comprehensive evaluation of each method's efficacy. Upon examining the performance of UMAP with SVM (UMAP+SVM), one observes a moderate level of accuracy at 0.7564 for Gene Expression data, precision at 0.7663, recall at 0.7438, and an F1-score of 0.7573. This suggests a balanced classification capability. However, the application of this combination to the integrated dataset yields improved results, with the accuracy and F1-score elevating to 0.7846 and 0.7847, respectively. However, an intriguing enhancement in performance is observed when SVM is applied to the integrated Gene Expression and SNV dataset, suggesting that SVM classifiers benefit from a richer feature space that encapsulates a more diverse biological signal. The PCA+SVM combination, while similar in approach to UMAP+SVM, records a slightly reduced accuracy of 0.7406 and an F1-score of 0.7497 for Gene Expression data. The metrics for the combined dataset are marginally lower than those for UMAP+SVM, with an accuracy of 0.7796 and an F1-score of 0.7713, reaffirming UMAP's superior feature extraction capability for SVM classifiers. Moving to ensemble methods, UMAP paired with Random Forest (UMAP+RF) shows an appreciable accuracy of 0.7761 and an F1-score of 0.7704 for Gene Expression data, which is a marked improvement over SVM-based methods. The amalgamation of Gene Expression and SNV data under UMAP+RF further improves accuracy to 0.8010 and the F1-score to 0.8002, suggesting an efficient harnessing of combined data features. For neural network-based classifiers, UMAP+FCNN displays notable efficacy, particularly in the integrated dataset, where the accuracy reaches 0.8183 and the F1-score climbs to 0.8176. This combination outperforms all other non-neural network classifiers, indicating FCNN's superior ability in modeling complex data interactions. A substantial leap in performance is evident with VAE+FCNN, especially for the integrated dataset, where the accuracy surges to 0.8505 and the F1-score to 0.8371. This indicates the VAE's powerful feature extraction capability in conjunction with FCNN's classification strength. The standout performer, MOALS, which incorporates a clustering algorithm for feature selection followed by VAE+FCNN for classification, achieves the highest accuracy of 0.9206 and an F1-score of 0.9211 in the integrated dataset. These metrics are considerably higher than those of other methods, underscoring the profound impact of feature selection through clustering in enhancing classifier performance. The Multi-Omics Analysis with Latent Structures (MOALS) approach, integrating the clustering algorithm with VAE+FCNN, showcases the pinnacle of classification performance. This technique's robustness is evidenced by the top-tier metrics across all evaluated categories in the combined dataset. MOALS effectively narrows down the feature space to the most discriminative gene sets, which evidently facilitates a more refined and targeted classification process. [142]Fig. 5 illustrates the Receiver Operating Characteristic (ROC) curves for various classification methods, with a focus on the integrated Gene Expression and SNV dataset. The Area Under the Curve (AUC) is a critical metric for evaluating the performance of classifiers, as it provides a single scalar value to compare models. Among the methods evaluated, MOALS achieved the highest AUC of 0.91, indicating superior discriminatory power in distinguishing between ALS and control cases. This performance further solidifies the MOALS approach as the most effective model in our study, outperforming other deep learning and traditional machine learning models. The VAE+FCNN model, while performing well, recorded a lower AUC of 0.84, followed by UMAP+FCNN at 0.82, reinforcing the advantages of multi-omics data integration combined with the MOALS methodology. The remaining models exhibited moderate performance, with AUC values ranging from 0.78 to 0.80, highlighting the impact of model selection and data integration strategies on predictive accuracy. Figure 5. [143]Figure 5 [144]Open in a new tab AUC-ROC curve for ALS classification using multi-omics (Gene Expression + SNV). [145]Fig. 6 presents a comparison of ROC curves between the MOALS model using single-omic data (Gene Expression) and the MOALS model using integrated multi-omic data (Gene Expression + SNV). The results clearly indicate that the multi-omic approach significantly outperforms the single-omic approach, as evidenced by the higher AUC value (0.91 for multi-omic vs. 0.83 for single-omic). This highlights the substantial improvement in classification accuracy that can be achieved by incorporating diverse data types, thus providing a more comprehensive understanding of the underlying biology in ALS. Figure 6. [146]Figure 6 [147]Open in a new tab ROC curve comparison for ALS classification using MOALS with single-omic (Gene Expression) and multi-omic (Gene Expression + SNV) data. In conclusion, the study underscores the crucial role of integrating multi-omic data with dimensionality reduction and machine learning techniques in analyzing complex biological data. The findings highlight how deep learning, when combined with advanced dimensionality reduction strategies and multi-omic integration, can effectively reveal the subtle biological dynamics underlying complex diseases like ALS, paving the way for breakthroughs in precision medicine. An in-depth analysis was conducted to distinguish between ALS and healthy cases using single and multi-omic data, focusing explicitly on either RNA expression data or SNV data. To elucidate the superiority of MOALS, a comprehensive comparison with existing models such as PCA+FCNN, UMAP+FCNN, and other commonly utilized models in ALS research is provided. This comparison extends beyond performance metrics to include methodological differences, highlighting how MOALS's integrative approach to multi-omics data provides a more robust and accurate prediction model. Specifically, the integration of clustering algorithms and VAE+FCNN allows MOALS to effectively handle the heterogeneity and complexity of multi-omic data, leading to improved prediction accuracy and reliability. The use of advanced dimensionality reduction techniques combined with deep learning architectures differentiates MOALS from traditional models that may not fully exploit the potential of integrated multi-omic datasets. Furthermore, the implementation of cross-validation in MOALS is detailed, explaining how this was adapted to manage the complexities of multi-omic data, including the stratification of data across different omic types to ensure that the validation process is robust and reflects the true predictive power of the model. These methodological advancements are crucial for understanding why MOALS performs better than other models, as it not only leverages genetic information but also incorporates epigenetic, transcriptomic, and proteomic data, providing a holistic view of the disease pathology. In summary, the innovative approach of MOALS sets a new standard in the field by not just incrementally improving over existing models but by redefining what is possible in the prediction and understanding of complex diseases like ALS. Our findings suggest that the MOALS model, with its superior diagnostic accuracy and robustness in integrating multi-omic data, could be effectively incorporated into existing clinical diagnostic pathways for ALS. Specifically, we propose that the model be used as a complementary tool alongside traditional clinical tests, such as neuroimaging and electrophysiological studies, to enhance diagnostic precision. By incorporating MOALS into the diagnostic workflow, clinicians may achieve faster and more accurate identification of ALS, enabling earlier intervention and potentially improving patient outcomes. This integration could also lead to more efficient utilization of healthcare resources by reducing the need for multiple confirmatory tests, thereby streamlining the diagnostic process. Furthermore, the predictive power of MOALS extends beyond diagnosis; it also demonstrates a significant capability in related predictive tasks, such as estimating the onset age of ALS symptoms and survival prediction. [148]Fig. 7 provides a comprehensive display of regression lines and diagnostic accuracy for various models. In the regression plots, the MOALS model is distinguished by achieving the highest [MATH: R2 :MATH] value of 0.78, indicating a strong linear correlation and capturing about 78% of the variance in age data. This high [MATH: R2 :MATH] value highlights MOALS' superior capability in integrating complex multi-omic data, which significantly surpasses other models like VAE+FCNN with an [MATH: R2 :MATH] of 0.71, and models like UMAP+RFR and PCA+RFR, which show moderate fits with [MATH: R2 :MATH] values of 0.49 and 0.57, respectively. The confidence intervals around the regression lines in MOALS' plot are notably narrower, pointing to its higher precision in age predictions. This contrast is evident when compared to PCA+FCNN and UMAP+FCNN, where broader intervals suggest greater prediction variability. Figure 7. [149]Figure 7 [150]Open in a new tab Comparison of different models for age prediction and diagnostic accuracy. In addition to regression analysis, the box plots in [151]Fig. 7 highlight the diagnostic accuracy between ALS and Control groups. Here, statistical analysis, particularly t-tests, shows significant differences in the means of ALS and Control groups, underscoring the diagnostic reliability of the models. Despite the unbalanced sample sizes between the ALS and control groups, the diagnostic accuracy distributions depicted in [152]Fig. 7 demonstrate a remarkable consistency across both groups. The box plots reveal that the median accuracy levels for ALS and Control are closely aligned within the context of each model, particularly for the MOALS model. This suggests that despite the variability in group sizes, the model's performance in distinguishing between ALS and Control remains robust, ensuring reliable diagnostic outcomes. This finding is particularly significant as it underscores the model's capability to generalize well across different group sizes without a loss in predictive accuracy. It highlights the efficacy of the model in handling imbalanced datasets, which is a common challenge in medical diagnostics. Such resilience in performance, evidenced by the narrow interquartile ranges and similar medians in the box plots, supports the utility of the model in clinical settings where the proportion of cases to controls may not always be balanced. The statistical robustness of MOALS, corroborated by the t-tests indicating significant differences between the ALS and Control groups, further enhances the credibility of the model. This statistical rigor, combined with the consistent accuracy across diverse group sizes, provides compelling evidence of the model's suitability for real-world applications, promising enhanced diagnostic precision in the clinical diagnosis of ALS. [153]Table 6 corroborates these findings by quantifying the relative error indices—RMSE, MAE, and MedAE—across the models. MOALS' remarkable reduction in all error metrics, especially in a multi-omics environment, is clearly delineated, reaffirming its efficacy in the predictive analytics realm. This detailed examination accentuates the critical role of leveraging diverse omic data to enhance the accuracy and reliability of biomedical predictions, firmly establishing MOALS as a groundbreaking tool in predictive methodologies. Table 6. The first symptom age prediction results based on six prediction methods applied by single omic and multi-omics approaches. Gene Expression __________________________________________________________________ SNV __________________________________________________________________ Gene Expression + SNV __________________________________________________________________ MedAE MAE RMSE R^2 MedAE MAE RMSE R^2 MedAE MAE RMSE R^2 UMAP+SVR 8.839 11.492 14.830 0.41 10.482 11.592 14.482 0.41 9.953 10.884 13.861 0.43 PCA+SVR 8.919 10.592 14.005 0.40 9.963 10.836 13.936 0.42 9.829 10.563 13.685 0.43 UMAP+RFR 8.899 9.971 12.991 0.45 8.928 9.949 13.002 0.48 8.936 9.925 12.893 0.49 PCA+RFR 8.850 9.283 12.629 0.54 8.994 9.385 12.693 0.52 8.317 9.623 12.317 0.57 UMAP+FCNN 8.965 10.037 13.199 0.51 8.842 9.513 12.839 0.50 8.629 9.666 12.582 0.52 PCA+FCNN 8.482 9.472 11.973 0.55 8.328 8.873 11.284 0.61 8.094 9.434 11.119 0.56 VAE+FCNN 8.194 8.548 10.913 0.64 7.893 8.361 9.952 0.68 6.625 8.242 10.619 0.71 MOALS 7.683 8.192 9.837 0.69 6.015 7.158 9.609 0.78 [154]Open in a new tab A meticulous examination of the table reveals that the MOALS model, a proposed multi-omics approach, notably outperforms the other models, showcasing the least error across all the enlisted metrics—MedAE, MAE, RMSE, and achieving the highest [MATH: R2 :MATH] of 0.78 in the Gene Expression + SNV approach. It demonstrates superior analytical accuracy and reliability, with errors reduced by 10.14 percent, 15.14 percent, and 10.51 percent in MedAE, MAE, and RMSE respectively, when compared to the best-performing multi-omics approach developed using alternative algorithms. Delving deeper into the specifics, the MOALS model exhibits unambiguous supremacy, attaining the lowest MedAE of 6.015, MAE of 7.158, and RMSE of 9.609, underscoring its enhanced predictive precision and reliability in estimating the first symptom age in ALS. When juxtaposed with the single omic approaches, the superior analytical finesse of the MOALS model becomes increasingly evident, especially in the context of Gene Expression and SNV data. In contrast, while the VAE model demonstrated considerable prowess, especially in the single omic approaches, it was discernibly overshadowed by the enhanced accuracy and reduced error rates manifested by the MOALS model in the multi-omics approach, specifically in the integrated Gene Expression + SNV method. This reinforces the pivotal role of integrated multi-omic methodologies in achieving heightened precision and reliability in predictive analytics, surpassing the capabilities of singular omic data analyses. Furthermore, the PCA+FCNN and UMAP+FCNN models, despite their commendable performance in both single and multi-omic approaches, were unable to match the elevated levels of analytical accuracy and reduced error margins achieved by the MOALS model, thereby reiterating the superior diagnostic capabilities of the latter. In conclusion, the results derived from the comparison accentuate the paramount importance and enhanced diagnostic proficiency of the developed multi-omics model, MOALS, in predicting the onset of ALS symptoms with refined precision and minimized error, validating its significant potential as a groundbreaking tool in biomedical research. The holistic integration of diverse omic data in the MOALS model unequivocally contributes to its augmented reliability and precision, setting a new benchmark in the contemporary spectrum of predictive methodologies. [155]Table 7 illustrates a meticulous comparative evaluation focusing on survival prediction, employing various methods applied by both single omic and multi-omics approaches. The evaluated models are juxtaposed based on two pivotal metrics: the Concordance index (C-index) and Integrated Brier Score (IBS), both quintessential for appraising survival prediction tasks. A C-index value of 1 represents the epitome of prediction accuracy, signifying an excellent model, while a value of 0.5 symbolizes a model performing no better than random. Concurrently, the accuracy of a predicted survival function at specific time points, represented by IBS, ranges between 0 and 1, with lower scores depicting higher levels of model accuracy. Table 7. The survival prediction results in different methods applied by single omic and multi-omics approaches. Gene Expression __________________________________________________________________ SNV __________________________________________________________________ Gene Expression+SNV __________________________________________________________________ C-Index IBS C-Index IBS C-Index IBS UMAP+CoxPH 0.617 0.209 0.627 0.208 0.693 0.201 PCA+CoxPH 0.592 0.224 0.616 0.220 0.671 0.216 UMAP+random survival forest 0.614 0.220 0.629 0.219 0.655 0.217 PCA+random survival forest 0.659 0.211 0.673 0.199 0.685 0.204 VAE(Selected Gene) 0.697 0.200 0.719 0.195 0.745 0.188 MOALS 0.777 0.180 0.837 0.121 [156]Open in a new tab From a detailed viewpoint, the MOALS model conspicuously stands out, showcasing an unparalleled C-index value of 0.837, the highest across all examined models and approaches, coupled with the most optimal IBS value of 0.121, reinforcing the enhanced predictive accuracy and reliability of the proposed multi-omics model. These figures not only accentuate the formidable precision of MOALS in predicting survival but also illuminate its superior analytical reliability in comparison to the other enlisted models, especially within the realm of integrated Gene Expression + SNV approach. Delving deeper, VAE(A) also demonstrated commendable prowess, reflecting a substantial C-index of 0.719 and an impressive IBS of 0.195 within the SNV method, marking it as a noteworthy contender in the landscape of survival prediction models. However, even with its significant precision, it doesn't overshadow the supremacy of the MOALS model in the multi-omics context, further validating the extensive capabilities of MOALS in offering a more nuanced and comprehensive analytical perspective. Analyzing the results within the confines of single omic approaches, it is unequivocal that the models exhibit varying degrees of accuracy and reliability, with MOALS achieving a pioneering C-index of 0.777 and the most favorable IBS of 0.180 in Gene Expression. This underscores MOALS's superior predictive capabilities and higher levels of accuracy, even in singular omic data analyses, bolstering its standing as a multifaceted analytical tool. In light of the above elucidation, the enhanced functionality and the groundbreaking precision of the proposed multi-omics model, MOALS, are conspicuously ratified. It sets a novel paradigm in the domain of survival prediction, offering a more refined, holistic, and high-resolution insight into survival prediction methodologies. The enhanced C-index and optimized IBS values exhibited by MOALS emphasize its unparalleled capability to provide more nuanced, reliable, and precise survival predictions, solidifying its potential as an innovative and pioneering instrument in advanced biomedical research and analytics. 4. Limitations of the study and proposed framework for future research The insights from this study highlight the significant potential of the MOALS model in unraveling the complex molecular landscapes and gene interrelationships associated with Amyotrophic Lateral Sclerosis (ALS). Through the integration of multi-omics data and advanced machine learning techniques like Variational Autoencoders (VAEs), this research has made considerable progress in identifying key pathways that may contribute to ALS pathogenesis. However, several limitations should be addressed, and future research directions considered to enhance the model's robustness, generalizability, and clinical applicability. A primary challenge is the dependency on large-scale, high-quality multi-omic datasets, which are not always accessible. This limitation could lead to biased findings, as genetic and environmental factors influencing ALS can vary widely across populations. Future research should prioritize the inclusion of geographically and ethnically diverse datasets to improve the model's generalizability and relevance across different clinical settings. The current study's datasets may not fully capture global genetic diversity, limiting the assessment of the MOALS model's applicability across varied populations. To enhance generalizability, validating and refining the model using more diverse datasets is crucial. The computational complexity of the MOALS model, especially with VAEs, poses another significant limitation. The high demands for computational resources and expertise may challenge broader applications, particularly in clinical settings. Future research should focus on optimizing computational efficiency and enhancing the model's interpretability, potentially by incorporating explainable AI techniques. While the study successfully integrates gene expression and genomic variant data, the MOALS model's full potential could be realized by adding additional omics layers, such as proteomics and metabolomics. These layers would offer a more comprehensive understanding of ALS, although integrating such diverse data types presents challenges. Addressing these could lead to even more robust models and the identification of novel biomarkers. Finally, the translational impact of the MOALS model depends on rigorous validation in clinical settings. This includes independent validation, clinical trials, and assessing the model's ability to predict clinical outcomes. Ethical and data privacy considerations must also be addressed, particularly concerning the use of sensitive genomic data. Ensuring data privacy and navigating ethical concerns are essential for the equitable application of models like MOALS in clinical practice. 5. Conclusion This study developed a novel multi-omics approach to understanding the genetic underpinnings of Amyotrophic Lateral Sclerosis (ALS) using machine learning techniques. By integrating gene expression profiles and rare pathogenic genomic variants, the study identified 17,546 genes associated with ALS pathways. The Multi-Omics for ALS (MOALS) model, utilizing unsupervised clustering and a Variational Autoencoder (VAE), revealed intricate genotype-phenotype correlations within the dataset. The MOALS model significantly outperformed traditional single-omic models, improving diagnostic accuracy by 1.7% and 6.2% compared to SNV and RNA expression-based models, respectively. These findings highlight the superiority of a multi-omic approach in capturing the complex biological interactions underlying ALS, offering a more nuanced understanding of the disease's molecular architecture. Given its performance, the MOALS model holds potential for enhancing diagnostic precision, informing prognosis, and guiding personalized therapeutic strategies. This study underscores the importance of integrating multi-omic data in Amyotrophic Lateral Sclerosis research and contributes to uncovering the molecular mechanisms driving ALS and other complex disorders. CRediT authorship contribution statement Hima Nikafshan Rad: Writing – review & editing, Writing – original draft, Visualization, Validation, Methodology, Formal analysis, Conceptualization. Zheng Su: Conceptualization. Anne Trinh: Writing – review & editing. M.A. Hakim Newton: Writing – review & editing. Jannah Shamsani: Formal analysis. NYGC ALS Consortium: Data curation. Abdul Karim: Investigation. Abdul Sattar: Supervision. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements