Abstract Schizophrenia (SCZ) is a complex psychiatric disorder presenting challenges for characterization. The current study aimed to identify and evaluate disease‐responsive essential genes (DREGs) to enhance the molecular characterization of SCZ. RNA‐sequencing data from PsychENCODE (536 SCZ patients, 832 controls) and peripheral blood transcriptome data from 144 recruited subjects (59 SCZ patients, 6 non‐SCZ psychiatric patients, 79 controls) are analyzed. Shared differential expression genes are obtained using three algorithms. Support vector machine (SVM)‐based recursive feature elimination is employed to identify DREGs. The biological relevance of these DREGs is examined through protein–protein interaction network, pathway enrichment, polygenic scoring, and brain tissue expression. Key DREGs are validated in SCZ animal models. A DREGs‐based machine‐learning model for SCZ characterization is developed and its performance is assessed using multiple datasets. The analysis identified 184 DREGs forming an interconnected network involved in synaptic plasticity, inflammation, neuronal development, and neurotransmission. DREGs exhibited distinct expression in SCZ‐related brain regions and animal models. Their genetic contributions are comparable to genome‐wide polygenic risk scores. The DREG‐based SVM model demonstrated high performance (AUC 85% for SCZ characterization, 79% for specificity). These findings provide new insights into the molecular mechanisms underlying SCZ and emphasize the potential of DREGs in improving SCZ characterization. Keywords: characterization, machine learning, molecular signatures, schizophrenia, transcriptome __________________________________________________________________ This study leverages machine learning‐based transcriptomic analysis integrated with genomic annotation and experimental validation to identify 184 disease‐responsive essential genes (DREGs) in schizophrenia. These DREGs form an interconnected network involved in synaptic plasticity, neuroinflammation, and neurodevelopment. The developed DREG‐based model demonstrates high performance in disease characterization, offering new insights into molecular mechanisms and potential therapeutic targets. graphic file with name ADVS-12-2407628-g001.jpg 1. Introduction Schizophrenia (SCZ) is a complex psychiatric disorder with a significant societal burden, affecting roughly 0.3% of the population and characterized by a combination of psychotic symptoms, cognitive deficits, and functional impairments.^[ [62]^1 ^] Understanding the underlying pathogenic mechanisms of SCZ is crucial for improving diagnosis, treatment, and patient outcomes. Considerable progress has been made in identifying genetic risk factors through genome‐wide association studies (GWAS).^[ [63]^2 , [64]^3 ^] These studies have provided valuable insights into the genetic architecture of SCZ, suggesting that the disorder is influenced by the combined effects of numerous common genetic variants. However, the translation of GWAS findings into clinically useful risk prediction models has been challenging.^[ [65]^4 ^] Genetic risk factors alone often have limited predictive power, as the complex pathogenesis of SCZ likely involves the interplay of various molecular mechanisms beyond genetic variations.^[ [66]^5 , [67]^6 ^] Transcriptomic analysis has emerged as a complementary approach to elucidate the molecular underpinnings of SCZ.^[ [68]^7 , [69]^8 ^] By examining disease‐driven gene expression patterns, researchers can uncover key genes and pathways involved in the pathogenesis of SCZ, which may also contain important genetic variations underlying disease susceptibility and development.^[ [70]^9 , [71]^10 ^] Recent biomedical research has opened new avenues for identifying disease‐associated features, particularly through the use of artificial intelligence techniques like machine learning (ML).^[ [72]^11 ^] While previous studies have employed ML on peripheral blood or prefrontal cortex (PFC) transcriptomic data to distinguish SCZ cases from healthy controls,^[ [73]^12 , [74]^13 ^] the absence of external validation and functional analysis on the identified genes has undermined reproducibility and limited their utility as stable disease‐responsive features. Furthermore, these studies are typically confined to either blood or PFC data, lacking an integrated approach that encompasses both peripheral and central transcriptomic profiles. This gap highlights the need for integrating PFC and peripheral blood transcriptomics via ML to uncover more stable disease‐responsive features and reliable peripheral biomarkers.^[ [75]^14 ^] Building on GWAS insights, our study employs a comprehensive approach, integrating transcriptomic analysis with genomic data and experimental validation, to identify disease‐responsive essential genes (DREGs) that enhance SCZ characterization. By applying advanced ML methods to a large cohort of postmortem brain and peripheral blood RNA‐sequencing data, we aim to capture core SCZ‐driven transcriptional patterns, elucidate underlying biological mechanisms, and evaluate these DREGs as potential disease markers. To better illustrate our analytical framework and workflow, we have provided a detailed schematic in Figure [76]1 . Unlike previous studies focused on distinguishing SCZ cases from controls, we target disease‐driven molecular signatures involved in SCZ pathogenesis. This comprehensive approach aims to provide a deeper understanding of SCZ's complex molecular mechanisms and further develop improved characterization models with clinical applications. Our approach complements previous GWAS efforts and offers a fresh perspective on the disorder's genetic and genomic basis. Figure 1. Figure 1 [77]Open in a new tab The workflow for SCZ DREGs identification, analysis, and characterization. (SCZ DREGs identification) Using PsychENCODE data (three PFC RNA‐Seq datasets) and new peripheral blood RNA‐Seq data, 70% were used for differential expression gene (DEG) analysis. DEGs in each dataset were identified by intersecting DESeq2, EdgeR, and Limma results, then subjected to support vector machine (SVM)‐based feature elimination to identify DREGs. (Biological significance analysis of SCZ DREGs) Protein–protein interaction analysis used a self‐constructed latest human interactome. GO and KEGG analyses revealed SCZ‐related pathway enrichment. DREGs expression were validated in human brain tissues and SCZ models. PRS analysis assessed DREGs' genetic contribution comparable to genome‐wide PRS. (Evaluation of DREGs' SCZ characterization) Eight top machine learning models performed tenfold cross‐validation on 70% of PFC and blood RNA‐Seq data to obtain the best characterization model. The best model was used to validate DREGs' SCZ characterization in three independent datasets: internal test set, external test set (Dataset 2), and SCZ/non‐SCZ patient set (Dataset 3). Results were evaluated using AUC values of ROC curves. SCZ: schizophrenia; PFC: prefrontal cortex; DREGs: disease‐responsive essential genes; PPI: protein–protein interaction; PRS: polygenic risk score; ROC: receiver operating characteristic; AUC: area under the curve; LR: logistic regression; DT: decision tree; RF: random forest; ET: extra tree; GBDT: gradient boosting decision tree; XGBoost: eXtreme gradient boosting; SVM: support vector machine; MLP: multilayer perceptron. 2. Results 2.1. Characterization of 184 SCZ DREGs Table [78]S1 (Supporting Information) and Figure [79]2A–L present the detailed results of differentially expressed genes by DESeq2, EdgeR, and Limma analysis. Pathway enrichment analysis results are shown in Table [80]S2 (Supporting Information). Integrating pathways enriched with differentially expressed genes from the four training datasets identified 70 significant pathways and 600 corresponding genes (Figure [81]2M,N; Table [82]S3, Supporting Information). Recursive feature elimination using the support vector machine (SVM) model (Figure [83]2O) selected 184 DREGs (Table [84]S4, Supporting Information). These DREGs were further used for constructing a characterization model for SCZ. Figure 2. Figure 2 [85]Open in a new tab The results of characterization of 184 SCZ DREGs. (A–L) Differentially expressed genes in four datasets: Analysis of the differentially expressed genes obtained from four datasets (CommonMind Consortium [CMC], Human Brain Collection Core [HBCC], and Lieber Institute for Brain Development [LIBD], Dataset 1 from peripheral blood) using limma (A–D); edgeR (E–H) and DESeq2 (I–L). Given of the potential transcriptional heterogeneity of different tissues, the direction of expression changes was not strictly limited, and thus the absolute value of log2FC was considered. (M) 70 shared enriched pathways for differentially expressed genes: Presenting the 70 shared enriched pathways associated with differentially expressed genes in the four training sets. Each circle color represents pathways enriched in different training sets, with a significance threshold of P‐value <0.05. (N) 600 common genes included in the shared enriched pathways: Highlighting the 600 common genes found in the shared enriched pathways of differentially expressed genes across the four training sets. The circles denote genes included in the shared enriched pathways in different training sets, indicated by varying colors. (O) Result curve of recursive feature elimination: Demonstrating the result curve of recursive feature elimination based on the 600 shared genes. The x‐axis represents the number of discarded genes, while the y‐axis represents the average under the curve accuracy value of the SVM model after tenfold cross‐validation. The shaded area depicts the 95% confidence interval. Each point represents the result of a specific experiment, and the gray dotted line indicates the number of genes finally discarded when reaching the optimal AUC value. 2.2. DREGs Exhibit Significant Biological and Clinical Relevance 2.2.1. A Significantly Interconnected Protein–Protein Interaction (PPI) Network Encoded by the DREGs We constructed a comprehensive human interactome dataset with 24 178 genes and 2 544 177 interactions (Table [86]S5, Supporting Information). Among the 184 DREGs, we identified 155 with direct interactions, forming a densely interconnected PPI network of 155 genes with 900 interactions (Figure [87]3A). Permutation tests compared this network to 1000 randomly generated PPI networks, showing that the DREGs PPI network had significantly more protein interactions (P < 1×10^−16) (Figure [88]3B). Network parameters (node degree and betweenness centrality) indicated that DREGs had higher values than background (BG) genes (node degree P < 2×10^−16, betweenness centrality P = 4.8×10^−16) (Figure [89]3C,D). These findings demonstrate enriched protein interactions and central roles of DREGs in the network. Figure 3. Figure 3 [90]Open in a new tab The characteristics of the significantly interconnected PPI Network encoded by the DREGs. (A) A densely interconnected PPI network encoded by DREGs: The visualization of a densely interconnected PPI network, where nodes represent genes and edges represent interaction relationships. Pink nodes indicate hub genes, while cyan nodes denote genes included in two functional modules. (B) Permutation test results of 1000 random PPI networks: During the permutation test, 1000 random PPI networks were generated by randomly selecting 155 genes from the human interactome to maintain the same number of nodes as the DREGs PPI network. The size of the direct connectivity component (number of edges in the network) was compared between the DREGs PPI network and the random PPI networks. The largest random PPI network had a direct connectivity component size of 248, which is significantly smaller than the observed 900 in the DREGs PPI network (grey dashed line). (C) Comparative box plots of node degree: A comparative box plot of the node degree between the DREGs PPI network and a background gene (BGG) PPI network. The Wilcoxon test was performed, revealing a highly significant difference between the two networks (P < 2 × 10^−16). (D) Comparative box plots of betweenness centrality: A comparative box plot indicating the betweenness centrality between the DREGs PPI network and a BGG PPI network. The Wilcoxon test was conducted, showing a statistically significant difference between the two networks (P = 4.8×10^−16). Data presented as violin plots with embedded box plots showing the distribution of log2‐transformed values. The box plots display the median (central line), first and third quartiles (box boundaries), and whiskers extending to the most extreme data points that are not considered outliers. The violin plots show the kernel density estimation of the underlying distribution for both DREGs and BGGs (background genes) groups. 2.2.2. Identification of 19 key DREGs in the PPI Network We analyzed the DREGs PPI network to investigate its characteristics. We defined hub genes as DREGs with at least 20 direct interactions with other DREGs, identifying 8 hub genes: ESR1, GRB2, STAT3, BRD4, CDK9, TRIM28, MYH9, and DOT1L (Figure [91]3A). Using ClusterONE, we identified two significant functional modules: module 1 (P = 0.019) and module 2 (P = 0.03) (Figure [92]4C,D). Module 1 consists of 8 genes: RFGAP1, CYTH2, ADORA2A, IFFO1, PACSIN2, ENTPD1, BICD1, and KDELR3. Module 2 comprises 3 genes: PLXND1, PLXNA2, and SEMA7A. Figure 4. Figure 4 [93]Open in a new tab The results of dominant pathway enrichment in the DREG PPI network. (A) Pathway enrichment of top 32 genes in DREG set: Analysis of pathway enrichment for the top 32 genes in the DREG set. The presence of genes across different pathways (DREGs, Hub genes, Module 1, and Module 2) was examined with a repetition threshold of 30% of the maximum number of repetitions (135 times for BCL2, approximately 41 times for TICAM1). The left subfigure displays the significance of enriched GO (Molecular Function [MF], Biological Process[BP], Cell Component [CC]) and KEGG entries, focusing on the top 20 most significant entries. The right subfigure indicates the number of DREGs genes associated with each entry. Significant enrichments included immune response regulation, synaptic plasticity, neuronal development and projection, glutamatergic synapse, MAPK signaling pathway, JAK‐STAT signaling pathway, and neurotrophin signaling pathway. (B) Hub pathway enrichment of gene sets: Genes with direct interactions greater than 20 were extracted from the DREGs PPI network to identify hub genes. GO and KEGG enrichment analysis was conducted on the hub genes. The left subfigure presents the significance of GO (BP, CC, MF) and KEGG enriched entries, with the top 20 most significant entries. The right subfigure shows the gene count from the 184 DREGs present in each entry. (C) Function module 1: The specific genes enriched in this module are presented. (D) Function module 2: The specific genes enriched in this module are presented. (E) Module 1 pathway enrichment: Pathway enrichment analysis of the functional enrichment module 1. The left subfigure displays significant GO (BP, CC, MF) and KEGG entries, with the top 20 most significant entries. The right subfigure shows the gene count from the 184 DREGs present in each entry. (F) Module 2 pathway enrichment: Pathway enrichment analysis of the functional enrichment module 2. The left subfigure displays significant GO (BP, CC, MF) and KEGG entries, with the top 20 most significant entries. The right subfigure shows the gene count from the 184 DREGs present in each entry. 2.2.3. Dominant Enrichment of Pathways Associated with Synaptic Plasticity, Immune Inflammation, Neuronal Development, Neurotransmitters, and Astrocytes in the DREGs PPI Network To examine the convergence of SCZ DREGs in the DREGs PPI network toward specific pathways, we performed GO and KEGG pathway enrichment analysis. The analysis identified significant pathways including synaptic plasticity, neuronal development/projection, synaptic transmission, inflammation regulation, calcium homeostasis, neurotransmitter regulation, vesicle transport/secretion, GPCR signaling, miRNA regulation, and MAPK/neurotrophin/toll‐like receptor/TNF/JAK‐STAT signaling (Table [94]S6, Supporting Information). Further analysis of key DREGs, including 8 hub genes and 2 densely connected modules, revealed enrichments in pathways related to epigenetic gene regulation, immune response, inflammation, neurotransmitter secretion, synaptic transmission, astrocyte activation, synaptic plasticity, neuronal development, and Notch/IL6/toll‐like receptor/JAK‐STAT/chemokine signaling (Tables [95]S7–[96]S9, Supporting Information). We further analyzed gene repetitions across different pathways within each gene set (Tables [97]S10–[98]S13, Supporting Information) to quantify pathway enrichment and assess the significance of each pathway. In the DREGs set, immune regulation, synaptic plasticity, neuronal development, glutamate synapse, and MAPK/JAK‐STAT/neurotrophin signaling were significantly enriched (Figure [99]4A). The hub gene set showed remarkable enrichments in chromatin remodeling, transcriptional regulation, miRNA regulation, JAK‐STAT signaling, and chemokine signaling (Figure [100]4B). Module 1 was associated with glutamate‐based neurotransmitter secretion and synaptic transmission (Figure [101]4E), while module 2 was related to synaptic plasticity, neuronal development, and projection (Figure [102]4F). Notably, the most repeated genes in the hub and module gene sets were among the top 32 genes in the DREGs set (Tables [103]S10–S13, Supporting Information). Additionally, SYT11, like ADORA2A, is another noteworthy gene linked to SCZ in our unpublished study, playing a role in mediating SCZ‐like behaviors through dopamine overtransmission. 2.2.4. Expression Changes in DREGs Across Various Human Brain Tissues We analyzed RNA‐seq data from various sources to study the expression patterns of DREGs in different brain contexts. In human brain tissues (GTEx V8 database), DREGs showed higher expression levels compared to BG genes (DREGs: P = 0.085, hub genes: P < 1×10^−8, module 1: P = 0.020, module 2: P = 0.001), with hub genes displaying the highest overall expression (Figure [104]5A). Hub genes and genes in module 2 showed consistent expression trends across different brain tissues, while genes in module 1 had lower expression in certain brain regions (Figure [105]5A). During brain development (BrainSpan database), DREGs, hub genes, and module 2 genes showed significantly higher expression levels across all developmental stages compared to BG genes (DREGs: P = 2×10^−5, hub genes: P < 1×10^−8, module 2: P < 1×10^−8) (Figure [106]5B). Hub genes exhibited a peak in expression after birth, while hub genes and genes in module 2 displayed prominent expression fluctuations throughout development. In diverse brain regions (Human Brain Transcriptome [HBT] database), DREGs, hub genes, and module 1–2 genes had significantly higher expression levels across different brain regions (all P < 1×10^−16) (Figure [107]5C). Hub genes consistently showed the highest expression, and module 2 genes demonstrated expression variations specific to different brain regions. In the SCZ‐associated middle temporal gyrus (MTG) and anterior cingulate gyrus (CgGr) (Allen database), DREGs, hub genes, module 1–2 genes showed significantly higher expression levels compared to BG genes (Figure [108]5D,E). Fluctuating expression patterns of key DREGs were observed in glutamate‐type neurons, with consistent trends between hub genes and DREGs. Notably, genes in modules 1 and 2 exhibited distinct expression patterns between MTG and CgGr, indicating diverse roles in different neuron types, particularly glutamatergic neurons. Figure 5. Figure 5 [109]Open in a new tab The results of expression patterns in DREGs across various human brain tissues. (A) Expression patterns of DREGs, key DREGs, and BG genes in 13 types of human brain tissues in the GTEx database. The expression trends of hub genes and genes in module 2 were consistent across different brain tissues, showing high expression in the brain spinal cord (vervical c‐1). On the other hand, genes in module 1 exhibited low expression in this specific brain tissue. (B) Spatiotemporal expression patterns of DREGs, key DREGs, and BG genes in 13 brain development stages in the Brainspan database. These stages ranged from embryonic to young adulthood. BG genes and module 1 genes showed no significant changes in expression throughout development (P = 0.1146328, one‐way repeated measures ANOVA). However, hub genes and module 2 genes displayed noticeable expression fluctuations, with hub genes peaking after birth and maintaining relatively high expression levels during development. (C) Expression patterns of DREGs, key DREGs, and BG genes in 16 brain regions in the HBT database. The 16 brain areas include primary auditory (A1) cortex (A1C), amygdala (AMY), cerebellar cortex (CBC), dorsolateral prefrontal cortex (DFC), hippocampus (HIP), posterior inferior parietal cortex (IPC), inferior temporal cortex (ITC), primary motor (M1) cortex (M1C), mediodorsal nucleus of the thalamus (MD), medial prefrontal cortex (MFC), orbital prefrontal cortex (OFC), primary somatosensory (S1) cortex (S1C), superior temporal cortex (STC), striatum (STR), primary visual (V1) cortex (V1C), and ventrolateral prefrontal cortex (VFC). Hub genes displayed the highest overall expression across all regions, while genes in module 2 exhibited significant expression variations among brain regions. (D) Expression patterns of DREGs, key DREGs, and BG genes in cell types of the middle temporal gyrus in the Allen database. (E) Expression patterns of DREGs, key DREGs, and BG genes in cell types of the anterior cingulate gyrus in the Allen database. In both brain regions, the key DREGs in glutamate‐type neurons (IT, L4 IT, L5 ET, L5/6 IT Car3, L6 CT, L6b) exhibited obvious expression fluctuations, along with consistent expression trends in hub genes and DREGs. Furthermore, the expression patterns of genes in modules 1–2 varied between the two brain regions. In the middle temporal gyrus, genes in modules 1–2 were mainly expressed in the cell type L6 CT, while in the anterior cingulate gyrus, genes in modules 1–2 were highly expressed not only in the cell type L6 CT but also in L5 ET. All values are presented as mean expression levels across genes within each gene set (Hub genes, Module1, Module2, DREGs, and BG genes) for each condition (tissues, developmental stages, brain regions, or cell types). Expression values were normalized and processed according to their respective databases: TPM values for GTEx, RPKM values for BrainSpan, and normalized expression values for HBT and Allen Brain Atlas data. 2.2.5. Significant Changes in Expression Patterns of 9 Novel Key DREGs in Animal Models We identified 19 key DREGs, including 8 hub genes, 8 genes from module 1, and 3 genes from module 2. Among these, 10 genes (ADORA2A, ENTPD1, PLXNA2, SEMA7A, ESR1, GRB2, STAT3, BRD4, TRIM28, MYH9) have previously been associated with SCZ (see [110]Supporting Information). To validate the expression patterns of these key DREGs, we focused on 9 novel genes (BICD1, IFFO1, ARFGAP1, KDELR3, CYTH2, PACSIN2, PLXND1, CDK9, DOT1L) using an SCZ animal model induced by MK‐801 (Figure [111]S1, Supporting Information). We examined the mRNA levels of these 9 key DREGs in the peripheral blood and the PFC of the modeled mice (Figure [112]6 ). In the animal models, 8 out of the 9 novel genes showed statistically significant expression changes in the brain samples, with the KDELR3 gene exhibiting a trend toward significance (Figure [113]6A–I). Although four DREGs (KDELR3, PACSIN2, CDK9, PLXND1) did not exhibit statistically significant differences in the peripheral blood of the SCZ animal model, their expression trends were consistent with those observed in human brain data (Figure [114]6D,E,G,H). Despite small sample sizes, these findings confirm DREGs as reliable SCZ responsive indicators, with potential to elucidate SCZ mechanisms. Figure 6. Figure 6 [115]Open in a new tab Differential expression profiles of 9 key DREGs in the peripheral blood and prefrontal cortex of SCZ animal models and human RNA‐seq datasets. mRNA expression changes of (A) BICD1; (B) IFFO1; (C) ARFGAP1; (D) KDELR3;(E) CYTH2; (F) PACSIN2; (G) PLXND1; (H) CDK9; (I) DOTIL in the peripheral blood and prefrontal cortex of human (left) and mice (right). The human peripheral blood data is derived from the combined Datasets 1 and 2, while the human prefrontal cortex data is obtained from the merged datasets of the CMC, LIBD, and HBCC. Statistical comparisons were performed using Student's t‐test. Data are presented as means ± SEM (n = 8 per group). Significant differences between SCZ and control groups are indicated in the figure (P < 0.05). 2.3. Strong Polygenic Risk for SCZ Associated with DREGs Polygenic Risk Scoring (PRS) When applied to the UK Biobank (UKB)‐SCZ dataset, both the genome‐wide PRS and the 184‐DREGs PRS derived from Psychiatric Genomic Consortium version 3 (PGC3)‐SCZ showed significant associations with SCZ status (permutation P < 0.00001; Table [116]S14 and Figure [117]S2, Supporting Information). In the smaller PsychENCODE‐SCZ dataset, the genome‐wide PRS remained significant (permutation P < 0.00001), although the significance of the 184‐DREGs PRS was slightly reduced (permutation P < 0.005). However, the consistent effect direction with odds ratios (ORs) > 1.1 was noteworthy (Table [118]S14, Supporting Information). These findings suggest that the polygenic risk contributed by the 184‐DREGs SNPs for SCZ is significantly higher than random chance and is even comparable to those optimal PRS using genes across whole genome. 2.4. DREGs Exhibit Characteristic Capabilities and Specificity for SCZ To optimize characterization models for SCZ, we combined four training sets and evaluated eight models using tenfold cross‐validation, with optimized parameters detailed in Table [119]S15 (Supporting Information). The SVM model performed best, surpassing other models with an average accuracy of 89.21% (Figure [120]7A). We selected the optimized SVM model, called the DREGs‐based SVM (DRES) model, as the ideal characterization model. Testing the DRES model on internal datasets showed accuracy rates of 69%, 76%, 82%, and 83%, with corresponding area under the curve (AUC) values of 73%, 81%, 88%, and 85% (Figure [121]7B). External evaluation on testset Dataset 2 demonstrated a characteristic accuracy of 83% and an AUC value of 85% (Figure [122]7B). The DRES model effectively differentiated SCZ from non‐SCZ conditions, achieving an AUC of 79% and an accuracy of 83% (Figure [123]7C). Our findings suggest the potential of the DRES model in identifying individuals with SCZ across different disease categories. Figure 7. Figure 7 [124]Open in a new tab Assessment of characteristic performance and specificity of DREGs for SCZ. (A) Performance evaluation results of optimized machine learning models: Results of 10‐fold cross‐validation on 8 optimized machine learning models using the combined training dataset. The x‐axis represents the names of the 8 optimized machine learning models, while the y‐axis represents the average accuracy achieved through 10‐fold cross‐validation. All values are presented as mean ± SEM of accuracy values obtained from 10‐fold cross‐validation. (B) Receiver operating characteristic (ROC) curves of DREG‐based SVM model for different testsets: ROC curves illustrating the performance of the DREG‐based SVM model on various test sets.(C) ROC curves for differentiating SCZ and non‐SCZ patients: The ROC curve demonstrating the ability to distinguish between patients with SCZ and patients with non‐SCZ conditions. Only 1 out of 6 SCZ patients was misclassified, and only methamphetamine‐induced psychosis was not differentiated among the 6 non‐SCZ diseases. The ROC curves, AUC (area under the curve (AUC), and ACC (accuracy) values were calculated using the scikit‐learn package in Python. Specifically, the roc_curve and auc_score functions were used for ROC curve generation and AUC calculation, while accuracy_score was used for ACC calculation. 3. Discussion This study leverages cross‐tissue transcriptomic data and various omics annotation/integration approaches to provide both biological and clinical insight on SCZ manifestation. In clinical settings, there is currently a lack of effective approaches for diagnosing and characterizing SCZ. This study utilizes ML‐based approaches with RNA‐seq datasets to characterize SCZ. Our novel methods address the limitations of current characterization methods (low discriminative ability), further improving characterization for individuals with SCZ (AUC > 0.8). Analyzing the PPI network, performing pathway enrichment, utilizing human brain datasets, and conducting laboratory experiments collectively demonstrate the crucial role of DREGs in SCZ etiology. This approach has potential for extending to other neuropsychiatric disorders, facilitating precision psychiatry. While many studies have utilized transcriptomics data and ML to identify characteristic expression patterns^[ [125]^15 ^] or biomarkers for SCZ,^[ [126]^16 ^] there remains room for further refinement. Most of these studies did not conduct functional analyses or experimental validations of the identified genes,^[ [127]^12 , [128]^13 , [129]^15 , [130]^16 ^] which may limit their effectiveness in providing stable disease characterizations. In our study, we first combined traditional bioinformatics methods and recursive feature elimination algorithms with multiple RNA‐seq datasets. This approach identified 184 SCZ DREGs, improving our ability to extract relevant disease‐responsive features. Then through PPI network analysis, we found strong evidence that DREGs form a highly interconnected network involved in SCZ pathogenesis. Within this network, we identified 19 key DREGs, including 11 genes in densely connected modules and 8 hub genes. Enrichment analysis revealed shared gene ontology terms and pathways, including neuronal development, immune response regulation, synaptic plasticity, and epigenetic gene expression regulation, known to be involved in SCZ.^[ [131]^17 ^] Both the genome‐wide PRS and the 184‐DREGs PRS were significantly associated with SCZ status, validating the biological relevance of DREGs in SCZ characterization. These findings support the use of DREGs as a reliable gene set for characterizing SCZ. Among the 19 key DREGs, 10 have previously been linked to SCZ. SYT11, along with ADORA2A, is notable as both genes play crucial roles in the glutamatergic and dopaminergic systems,^[ [132]^18 ^] which are implicated in SCZ pathogenesis.^[ [133]^17 ^] ADORA2A in astrocytes regulates glial glutamate transporter 1 activity, potentially leading to disturbances in glutamine levels and SCZ induction.^[ [134]^19 ^] Besides, our unpublished study suggests that abnormal SYT11 expression contributes to SCZ‐related behaviors through dopamine overtransmission. Significant associations between ADORA2A, SYT11 polymorphisms, and SCZ susceptibility have been identified, indicating their central role in triggering SCZ.^[ [135]^20 ^] Additionally, nine newly identified genes (BICD1, IFFO1, ARFGAP1, KDELR3, CYTH2, PACSIN2, PLXND1, CDK9, DOT1L) are implicated in SCZ, some of which are involved in other central nervous system disorders.^[ [136]^21 ^] Furthermore, all nine of these novel genes are involved in synaptic function (PACSIN2,^[ [137]^22 ^] PLXND1,^[ [138]^23 ^] DOT1L,^[ [139]^24 ^]) neurodevelopment (BICD1,^[ [140]^25 ^] ARFGAP1,^[ [141]^26 ^] CYTH2,^[ [142]^27 ^]) and immune response (IFFO1,^[ [143]^28 ^] KDELR3,^[ [144]^29 ^] CDK9,^[ [145]^30 ^]), aligning closely with our enrichment analysis. These findings resonate with the conclusions of a recent Science article, which used single‐nucleus RNA sequencing technology to identify significant transcriptional changes in synaptic and neurodevelopmental pathways across various cell types in the PFC of SCZ patients.^[ [146]^31 ^] This convergence of evidence further emphasizes the critical roles of these key DREGs in synaptic plasticity and neuronal development in the pathophysiology of SCZ. Additionally, RT‐qPCR validation revealed consistent similar expression change directions for these nine key DREGs between human and SCZ animal model brain samples, with statistically significant changes observed in both. Peripheral blood samples showed consistent trends across humans and animal models, though not all reached statistical significance. These findings highlight the potential relevance of these genes in SCZ pathophysiology, demonstrating parallels in expression patterns between human and animal models in both central and peripheral tissues. Further functional research is needed to understand how these nine genes regulate molecular mechanisms in SCZ. Transcriptomic data poses challenge due to its wide nature and the potential for overfitting in data analysis models.^[ [147]^32 ^] To address this, we employed dimensionality reduction and utilized a practical ML model to minimize overfitting.^[ [148]^33 ^] Our unique analysis methods identified SCZ DREGs from diverse RNA‐seq datasets (brain and blood) and provided insights into underlying biological processes and pathways. We also developed an accurate disease characterization model using DREGs and ML, indicating strong performance in classifying SCZ patients and distinguishing them from other psychiatric disorders. While previous studies using omics data have improved risk stratification for SCZ and other psychiatric disorders,^[ [149]^34 ^] our pipeline offers a more targeted focus by characterizing SCZ‐responsive genes and identifying core pathogenic mechanisms with minimal transcriptome data. Unlike the broad approach of Wang et al.,^[ [150]^34 ^] which created a comprehensive functional genomic resource, our study focuses on disease‐driven expression patterns specific to SCZ. By leveraging ML, functional annotation, network analysis, and animal validation, we provide deeper insights into the roles of DREGs in synaptic function, immune regulation, and neurodevelopment. Notably, the disease‐responsive essential genes may contain SCZ risk or development variations, but examining how these SNPs regulate DREGs' expression changes was beyond the scope of our study. Our study addresses some limitations in previous research. Merikangas et al.^[ [151]^35 ^] faced methodological inconsistencies and covariate variations, leading to few consistently replicated genes. Unlike their literature‐based approach, we integrate multiple RNA‐seq datasets with ML, ensuring the stability of identified genes through experimental validation. By aligning with LIBD principles,^[ [152]^36 ^] we correct for batch effects and include critical covariates such as age and sex, addressing confounding factors and improving robustness. While there may be ancestry‐dependent differential expression genes for brain disorders, most are less constrained and sensitive to evolutionary changes.^[ [153]^37 ^] However, by conducting differential expression analyses independently for each dataset and employing ancestry‐matched data in our PRS analysis, we enhance the validity and reliability of our findings. This approach ensures that ancestry‐related biases are unlikely to significantly impact our results. Despite fewer peripheral blood samples, our findings remain robust through repeat validation. Additionally, recent findings by Ruzicka et al.^[ [154]^31 ^] support the importance of neurodevelopment and synapse‐related pathways in SCZ, validating our mechanisms. We also evaluated the genetic effects of DREGs via PRS analysis, offering new insights into SCZ's genetic basis. While Wang et al.^[ [155]^34 ^] provided a broad foundational understanding, our study offers targeted mechanistic insights and potential clinical applications, emphasizing the unique advantages of our approach. Our work demonstrates the importance of LIBD principles^[ [156]^36 ^] like sample size, covariates, and expression complexity, reflecting the effectiveness of these principles in advancing SCZ research. While our study demonstrates robustness in identifying SCZ characterization using RNA‐seq data and ML techniques, caution should be exercised in extrapolating the findings to broader populations. First, gene expression changes are regulated by a multitude of factors and cannot be solely attributed to disease response. Second, the relatively small sample size and the predominance of Han Chinese participants may limit the generalizability of our results to other ethnic groups or clinical settings. Additionally, the reliance on peripheral blood samples for RNA‐seq analysis may not fully capture the disease‐specific transcriptional changes occurring in the brain, and the absence of regulatory RNA data, such as miRNA, limits our understanding of the transcriptional regulatory networks associated with disease features. The lack of unified processing standards across different laboratories adds complexity and challenges to our study. Despite these limitations, our model offers significant value in accurately characterizing SCZ. The findings have important implications for clinical practice, potentially aiding in earlier and more precise diagnosis. Give of our preliminary results, future studies, with larger sample sizes, diverse populations, and additional types of data, would be required to further validate and expand our findings. 4. Conclusion In summary, our study presents a comprehensive approach to enhance SCZ characterization by integrating ML‐based transcriptomic analysis with genomic data annotation and experimental validation. We identified 184 DREGs significantly associated with SCZ, conducted pathway enrichment and PPI network analyses, and validated key DREGs in SCZ animal models. Additionally, we assessed the genetic contribution of DREGs using PRS and developed high‐performance machine‐learning models for SCZ characterization. Our findings contribute to improved disease characterization, elucidate SCZ molecular mechanisms, and suggest new potential therapeutic targets. Future research will focus on functional validation, longitudinal studies, and expanding to broader cohorts to enhance robustness and generalizability. 5. Experimental Section Sample Collection Participants were recruited from multiple sites. Dataset 1 included episodic SCZ patients and healthy controls from Yingtan Mental Health Hospital. Dataset 2 comprised episodic SCZ patients from Shandong Mental Health Center and healthy controls from Qilu Hospital of Shandong University. Dataset 3 consisted of both episodic SCZ and non‐SCZ psychiatric patients from Yingtan Mental Health Hospital. Participants provided written informed consent and 2.5 mL of whole blood was collected for RNA sequencing. The study followed ethical principles outlined in the 2002 Declaration of Helsinki, with approval from the Medical Ethics Committee of Xi'an Jiaotong University Health Science Center. The study employed three independent datasets (Dataset 1 with 43 SCZ and 59 controls; Dataset 2 with 10 SCZ and 20 controls; Dataset 3 with 6 SCZ and 6 non‐SCZ psychiatric patients), with Supplementary Methods (Supporting Information) providing detailed information and inclusion/exclusion criteria. All patient samples had more than one‐month medication‐free history, and Dataset 1 consisted of medication‐naïve first‐episode patients. This study was conducted in accordance with the ethical principles outlined in the 2002 Declaration of Helsinki. The protocol was approved by the Medical Ethics Committee of Xi'an Jiaotong University Health Science Center (approval number: NO. 2017030). Written informed consent was obtained from all participants prior to their enrollment in the study. RNA Sequencing and Data Pre‐Processing Total RNA extraction from peripheral blood samples was performed using the PAXgene Blood RNA Kit (BD Biosciences, USA) following the manufacturer's instructions for datasets 1, 2, and 3. Total RNA quality was assessed using agarose gel electrophoresis and quantified with a NanoDrop spectrophotometer (NanoDrop, USA). For mRNA library construction, total RNA underwent ribosomal RNA depletion using the Epicenter Ribo‐Zero kit. TruSeq RNA Sample Preparation kit processed 3 µg RNA/sample following Illumina's protocol. RT‐PCR employed Phusion high‐fidelity DNA polymerase, indexed (X) primers, and universal PCR primers. AMPure XP system purified the products, while library quality was evaluated on the Agilent Bioanalyzer 2100 system. The Illumina NovaSeq 6000 platform sequenced the mRNA libraries. Initial quality control was performed using FastQC^[ [157]^38 ^] to assess sequencing data quality, including base quality distribution, GC content, and sequence duplication levels. Fastp^[ [158]^39 ^] software was used to filter out low‐quality reads and adapter sequences. The remaining reads were aligned to the human reference genome hg19 using HISAT2.2.4.^[ [159]^40 ^] All count data were finally generated for subsequent analysis. Existing Data and Combined Data Preparation This study also utilized three PFC RNA‐seq datasets (CommonMind Consortium [CMC], Human Brain Collection Core [HBCC], and Lieber Institute for Brain Development [LIBD]) accessed from PsychENCODE, comprising SCZ patients and healthy controls (Table [160]S16, Supporting Information). To ensure reliable results, these datasets and Dataset 1 (43 SCZ and 59 controls) were split into training and internal test sets (7:3 ratio, no overlap). The training sets were used for DREG extraction, model training, and hyperparameter optimization, while the internal test sets assessed model performance. Dataset 2 served as an external test set to evaluate the analysis pipeline and findings' robustness. PRS was performed using three large‐scale genetic datasets (PGC, UKB, and PsychENCODE project including CMC, HBCC, and LIBD, detailed in Table [161]S16, Supporting Information) to assess SCZ status holistically. PGC version 3 data for SCZ GWAS was publicly available. UKB raw data was accessed under approved application No. 86920. PsychENCODE data was accessed via Synapse portal with granted approval to Dr. Guan's team. This approach aimed to evaluate the collective impact of identified SCZ DREGs and their genomic contribution to this mental disorder. Supplementary Methods ([162]Supporting Information) provide brief descriptions of each dataset. Extraction of Characteristic DREGs The four RNA‐seq datasets (CMC, HBCC, LIBD, Dataset 1) were divided into training and test sets (7:3 ratio). The SVA (Surrogate Variable Analysis) package was used to correct for batch effects, ensuring that the variability introduced by different batches did not confound our results. Differential expression analysis was performed on the training datasets using the limma^[ [163]^41 ^], Deseq2^[ [164]^42 ^], and edgeR^[ [165]^43 ^] packages, with significance determined as P‐value < 0.05 and |logFC| > 0. To mitigate the impact of age and gender on gene expression, these variables were included as covariates in the differential expression analysis using DESeq2, edgeR, and limma. Specifically, the design matrix incorporated age and gender along with the primary condition of phenotype (SCZ vs. control). The final set of differentially expressed genes for each training dataset was determined by taking the intersection of the results from these three software packages. ENSEMBL IDs were converted to ENTREZ IDs, and pathway enrichment analysis was conducted using the clusterProfiler package.^[ [166]^44 ^] Significant pathways (P‐value < 0.05) and their shared differentially expressed genes were integrated to identify responsive characteristics for SCZ. After correcting and standardizing the count matrices, recursive feature elimination (RFE)^[ [167]^45 ^] with a SVM model was employed to identify characteristic DREGs. Further details can be found in the Supplementary Methods ([168]Supporting Information). Analysis of the Biological Basis of DREGs The newest human interactome database was self‐constructed by integrating PPI data from multiple sources (String, Biogrid, Bioplex, CCSB, HINT, HPRD, IntAct, and Mint) and analyzed the PPI networks formed by DREGs. Various network parameters were computed, including node degree and betweenness centrality, to understand the characteristics of DREGs in the network. Permutation tests were performed to compare the network connectivity with 1000 randomly generated networks, while Wilcoxon tests were used to compare network parameters (node degree and betweenness centrality) between DREGs and background genes. Hub genes and densely connected modules were identified in the PPI network, defining all hub genes and those within these modules as key DREGs. Pathway enrichment analysis explored the biological functions of DREGs, hub genes, and modules. Additionally, gene expression profiles in various tissues, developmental stages, brain regions, and specific cell types were analyzed using RNA sequencing data from multiple databases (GTEx, BrainSpan, HBT, and Allen). These analyses provided insights into the functional roles and expression patterns of DREGs relevant to SCZ. Further details can be found in the Supplementary Methods ([169]Supporting Information). Detecting Expression Patterns of Key DREGs in SCZ Animal Models All animal experiments were performed using male C57BL/6J mice (n = 8 per group) obtained from Beijing Vital River Laboratory Animal Technology Co., Ltd. (Beijing, China). The experimental procedures were conducted in accordance with institutional guidelines and approved by the Institutional Animal Care and Use Committee of Xi'an Jiaotong University (approval number: No. 2022680). SCZ mice models were established by NMDA receptor antagonist MK‐801 to further validate the expression changes of SCZ DREGs. Then the prefrontal cortex and blood samples were collected for RNA extraction. The following qRT‐PCR were carried out in the Bio‐Rad CFX96 detection instrument (Bio‐Rad, USA). Primer sequences are provided in Table [170]S17 (Supporting Information). For more details, please refer to the Supplementary Methods ([171]Supporting Information). Assessing the Genetic Effect of DREGs Using PRS To assess the polygenic risk of SCZ, PRSice2 was employed with GWAS summary statistics (PGC3) and raw data (PsychENCODE‐SCZ and UKB‐SCZ) with matched ancestry. The aim was to characterize SCZ risk in PsychENCODE‐SCZ or UKB‐SCZ (target data), using PGC3‐SCZ as the training dataset.^[ [172]^46 ^] No sample overlap occurred between the training and target datasets. Common SNPs (minor allele frequency > 0.05) in both datasets were analyzed for compatibility. Exclusion criteria included imputation scores < 0.5 in training data and palindromic SNPs with ambiguous alleles. Two PRS models were created: one with genome‐wide SNPs and another mapping to DREGs (± 10kb boundary). PRS were standardized and associated using logistic regression, adjusting for gender and principal components (3 for PsychENCODE‐SCZ, 4 for UKB‐SCZ). PRSice2 generated optimal PRS scores across different P thresholds (5×10^−8, 10^−7, 10^−6, 10^−5, 10^4, .001, .01, .05, 0.1, 0.2, 0.3, 0.4, 0.5, and 1),^[ [173]^47 ^] with permutation (100 000 times) correcting for multiple testing and overfitting.^[ [174]^48 ^] SCZ Characterization via ML Models Four training sets of RNA‐seq data were combined to create a unified dataset and employed eight high‐performance ML models. These models were used to assess the reliability and stability of DREGs in characterizing SCZ, including logistic regression (LR), decision tree (DT), random forest (RF), extra tree (ET), gradient boosting decision tree (GBDT), eXtreme gradient boosting (XGBoost), SVM, and multilayer perceptron (MLP). The model with the best generalization performance was selected after hyperparameter tuning and validated it using four internal test sets and one external test set (Dataset 2). The model's discriminative performance between SCZ and non‐SCZ disorders was also evaluated using Dataset 3. The model performance was evaluated using the scikit‐learn package in Python to calculate the AUC and accuracy values. Supplementary Methods ([175]Supporting Information) provide further details. Statistical Analysis—Pre‐Processing RNA‐seq data underwent quality control using FastQC v0.11.9, with low‐quality reads and adapter sequences filtered by Fastp v0.20.0. Batch effects were corrected using SVA package. Data normalization included variance stabilizing transformation in DESeq2. Age and gender were included as covariates in differential expression analyses, with data aligned to human reference genome hg19 using HISAT2.2.4. Statistical Analysis—Data Presentation Laboratory experimental data and animal experiments are presented as mean ± SEM. RNA‐seq expression values are presented according to their respective databases (TPM values for GTEx, RPKM values for BrainSpan, and normalized expression values for HBT and Allen Brain Atlas). Network analysis results are presented as violin plots with embedded box plots showing log2‐transformed values, with PPI network comprising 155 DREGs forming 900 direct interactions. Machine learning model performance is presented as mean ± SEM from tenfold cross‐validation. Statistical Analysis—Sample size The study analyzed RNA‐sequencing data from two sources: 1) PsychENCODE public database (536 SCZ patients, 832 controls), and 2) the newly generated peripheral blood RNA‐seq data from three independent cohorts (Dataset 1: 43 SCZ and 59 controls; Dataset 2: 10 SCZ and 20 controls; Dataset 3: 6 SCZ and 6 non‐SCZ psychiatric patients). Animal experiments used 8 mice per group. Statistical Analysis—Statistical Methods Differential expression analysis employed three algorithms (limma, DESeq2, and edgeR) with significance defined as P‐value < 0.05 and |logFC| > 0. PPI network connectivity was evaluated using two‐sided permutation tests (1000 times), while network parameters were compared using two‐sided Wilcoxon tests. One‐way repeated measures ANOVA assessed expression changes across developmental stages. Student's t‐test was used for animal experimental data. Pathway enrichment used hypergeometric tests with P‐value < 0.05 threshold. PRS analysis used permutation testing (100 000 times) for multiple testing correction. Statistical Analysis—Software RNA‐seq analyses were performed using R version 4.2.0 with packages including limma, DESeq2, edgeR, and clusterProfiler. Network analyses used R packages igraph and ggplot2. Machine learning analyses were conducted using Python's scikit‐learn package, with roc_curve and auc_score functions for receiver operating characteristic (ROC) curves and AUC calculations, and accuracy_score for ACC calculations. PRS analyses used PRSice2. Animal experimental data were analyzed using SPSS version 23. Conflict of Interest No. The authors declare no conflict of interest. Supporting information Supporting Information [176]ADVS-12-2407628-s001.docx^ (316KB, docx) Acknowledgements