Abstract RNA methylation modification influences various processes in the human body and has gained increasing attention from scholars. Predicting genes associated with RNA methylation pathways can significantly aid biologists in studying RNA methylation processes. Several prediction methods have been investigated, but their performance is still limited by the scarcity of positive samples. To address the challenge of data imbalance in RNA methylation-associated gene prediction tasks, this study employed a generative adversarial network to learn the feature distribution of the original dataset. The quality of synthetic samples was controlled using the Classifier Two-Sample Test (CTST). These synthetic samples were then added to the data blocks to mitigate class distribution imbalance. Experimental results demonstrated that integrating the synthetic samples generated by our proposed model with the original data enhances the prediction performance of various classifiers, outperforming other oversampling methods. Moreover, gene ontology (GO) enrichment analyses further demonstrate the effectiveness of the predicted genes associated with RNA methylation pathways. The model generating gene samples with PyTorch is available at [30]https://github.com/heyheyheyheyhey1/WGAN-GP_RNA_methylation Keywords: RNA methylation, Pathways, Machine learning, Generative adversarial nets Subject terms: RNA, Computational biology and bioinformatics, Computational models, Protein function predictions Introduction RNA methylation refers to the chemical process in which methyl groups are selectively added to methyladenine in RNA under the catalytic action of methyltransferase. So far, researchers have discovered over 160 RNA modifications^[31]1 that are extensively distributed across eukaryotes and prokaryotes. Among these, the most prevalent form of RNA modification is methylation. By studying this process, researchers have discovered associations between RNA methylation modifications and various fields, including cancer^[32]2,[33]3, cardiovascular disease^[34]4, embryonic development^[35]5, and cell differentiation^[36]6. These findings highlight the significance of RNA methylation and suggest its potential implications in diverse biological processes and diseases. Further research in this area may shed more light on the functional roles of RNA methylation and its therapeutic applications^[37]7. Numerous reports have revealed that specific enzymes or proteins within the RNA methylation pathway regulate distinct functions and biochemical processes in the human body^[38]8–[39]11. These studies are based on the identification of proteins, genes, or specific enzymes associated with RNA methylation. Despite the significant role of RNA methylation process in various aspects of the human body, researchers have limited knowledge about the pathways involved. This limitation primarily arises from the substantial financial and time investments required to identify gene functions using experimental wet laboratory methods^[40]12,[41]13. Fortunately, due to intensive research of large-scale omics data and artificial intelligence methods, researchers can now explore target gene functions using computational approaches that are faster and more cost-effective than wet methods. Several reports have employed machine learning-based methods to identify genes associated with RNA methylation pathways^[42]14. Tsagkogeorga et al.^[43]14 used expression data from genes associated with RNA methylation as positive samples and trained machine learning models to predict additional potential genes. The results suggest that ’dry’ computational methods could be used to identify RNA methylation-associated genes. However, due to the limited number of known samples and the highly imbalanced ratio of positive to negative samples, the classifier may incorrectly identify majority class samples as minority class samples^[44]15. Any misclassification of genes obtained through computational methods leads to increased research costs, including time and funding expenses. Thus, avoiding overfitting and improving the effectiveness of classification methods in highly imbalanced and high-dimensional data remain challenging. Various approaches have been developed to address the issue of data imbalance, with the most common approach involving the oversampling of minority class samples^[45]16. Random oversampling is a widely employed algorithm that duplicates randomly selected minority class samples from the original dataset to balance the sample distribution. However, this algorithm primarily enhances the classifier’s weights on the minority class but lacks diversity in generating new samples^[46]17. Another effective technique, the synthetic minority oversampling technique (SMOTE)^[47]18, generates samples that lie near the minority class in the feature space, introducing diversity through interpolation among neighboring samples. However, SMOTE has limited impact on most classifiers trained on high-dimensional data, particularly in the field of bioinformatics^[48]19. This limitation also applies to certain SMOTE variants (e.g., Borderline-SMOTE^[49]20). Directly applying these algorithms to highly imbalanced datasets may not alleviate overfitting and enhance classification performance. Deep learning offers an alternative approach to address the challenge of data imbalance^[50]21. Researchers utilized deep learning techniques to generate synthetic samples that supplement the original dataset. One common approach is the use of Generative Adversarial Networks^[51]22 (GANs), which consist of two networks that ‘compete’ against each other: a generator and a discriminator. The generator is responsible for producing synthetic data, while the discriminator evaluates the authenticity of the data. Through an adversarial training process, these two networks compete until they reach equilibrium, where the discriminator can no longer distinguish between real and synthetic data, and the generator cannot improve its sample generation further. This process allows the generator to learn the underlying data distribution and generate synthetic samples that can be utilized for downstream tasks. Several studies have confirmed that GANs and their variants can generate synthetic data by learning the distribution of minority class sample to supplement the original dataset. For example, Jing et al.^[52]23 utilized attention-augmented convolutional layers and an adversarial training mechanism to enhance the identification of transcription factor binding sites (TFBSs). Ma et al.^[53]24 utilized a deep convolutional generative adversarial network (DC-GAN) to classify white blood cell. Gadermayr et al.^[54]25 used a cycle GAN to translate images with healthy conditions into images with affected conditions for augmenting labeled segmentation data. Frid-Adar et al.^[55]26 employed GANs to synthesize CT image data and demonstrated that the generated images achieved favorable results despite a limited number of samples. Not only have GANs shown remarkable success in image synthesis, but Xiao et al.^[56]27 also applied a WGAN to learn high-dimensional features from cancer gene expression data. Wan and Jones^[57]28 introduces CTST to evaluate the quality of WGANGP-synthesized protein feature samples and improve protein function prediction performance. Despite the remarkable success of GANs in balancing datasets, their application to highly imbalanced gene expression dataset has, surprisingly, not yet been studied, to the best of our knowledge. Thus, we propose a novel GAN-based approach to address the challenges posed by highly imbalanced data in predicting genes related to RNA methylation pathways. Firstly, we enhance WGAN-GP^[58]29 to improve its capability in learning the distribution of gene expression data. Secondly, we employ the Classifier Two Sample Test^[59]30 (CTST) to regulate the quality of synthetic samples generated by WGAN-GP. Thirdly, we implement a data block structure and incorporate the optimal samples, as identified by CTST, into each block to augment the minority samples. We then employ various classifiers to compare the performance metrics with conventional sampling methods. The results suggest that the samples generated by WGAN-GP integrate well with the minority class samples, as indicated by a CTST accuracy close to 0.5. The data block structure mitigates the risk of overfitting caused by highly unbalanced data, reducing the need for synthetic samples. The newly added samples effectively complement the minority class sample data, leading to improved predictive performance. Method Workflow of the study Generally, our proposed method consists of four steps (Fig. [60]1). In the first step, we perform feature extraction on the dataset to reduce dimensionality and then train the WGAN-GP using the positive samples. In the second step, we evaluate all WGAN-GP weights using CTST and select the model weights that yield the best CTST results for downstream task oversampling. In the third step, we divide the training set into multiple data blocks and oversample each block using the selected WGAN-GP model to achieve balanced data blocks. In the fourth step, we train the machine learning models using these balanced data blocks and evaluate their performance using the test set. Fig. 1. [61]Fig. 1 [62]Open in a new tab Workflow of our study. Data set and feature transformation We use the dataset created and made available by Tsagkogeorga et al.^[63]14. The original dataset was collected from [64]Harmonizome^[65]31 website. We selected 15 one-hot-encoded datasets to construct our dataset. The dataset was initially standardized to continuous values ranging from 0 to 1, or − 1 to 1, where 1 indicates a strong positive gene-feature association, 0 indicates no observed gene-feature association, and − 1 indicates a strong negative gene-feature association. The genes within these 15 datasets, along with their associated features, exhibit variations in both quantity and distribution. The number of genes and features differs across datasets. Some genes may appear in one or multiple datasets, and certain features may be present across various datasets as well. Each gene encompasses 50,176 features. Within the 15 datasets, Some genes lack specific features that are prevalent in other datasets. Consequently, features absent from a gene in this context are marked as null. The null values are subsequently assigned a value of 0. While a value of 0 signifies the absence of gene-feature associations, it does not impact the ultimate outcomes of training and prediction. To gain highly informative features and reduce dimensionality, we will remove the following 2 types of features: (i) those with zero values in more than 70% of cases. (ii) those with variance less than 16% of the total variance. The improved data block constructure balancing Given the significant imbalance between positive and negative samples in our dataset, directly applying oversampling may not improve classifier performance (as discussed in subsection “[66]Ablation study”). Therefore, we adopted an improved data block structure for oversampling. Specifically, we divide the training dataset into subsets and then use oversampling to balance these subsets. This approach, initially introduced by Yin et al.^[67]32, integrates undersampling and oversampling schemes, which reduces information loss from undersampling by slicing into multiple chunks while supplementing the dataset. We enhance this data block structure by ensuring that the proportion of positive and negative samples in these subsets is intentionally unbalanced to better integrate with the oversampling methods. Specifically, these subsets are created by combining the same positive samples with different negative samples (The positive samples represent the minority class). The number of positive and negative samples in each subset is allocated proportionally based on the hyperparameters. Generally, each subset contains fewer positive samples than negative samples, aiming to achieve a balanced distribution of positive and negative samples after oversampling. Assuming the number of positive samples and negative samples are [MATH: Npos :MATH] and [MATH: Nneg :MATH] , respectively, the sets of positive and negative samples are [MATH: Spos :MATH] and [MATH: Sneg :MATH] respectively, S = [MATH: SposSneg :MATH] denotes the entire dataset, [MATH: λ=Nneg/Npos :MATH] denotes the imbalanced ratio of the dataset, [MATH: α :MATH] is a hyperparameter that controls the oversampling rate. The pseudocode for the data block construction process to achive balance is shown in Algorithm 1. Algorithm 1. [68]Algorithm 1 [69]Open in a new tab Data block balancing The small-sample augmentation based on WGAN-GP Generative Adversarial Networks (GANs) are neural networks capable of learning high-dimensional feature distributions from data samples^[70]33. A GAN generally consists of two neural networks: a generator G and a discriminator D. The generator receives random noise as input and produces samples that are intended to be indistinguishable from real data. The discriminator’s task is to distinguish between the fake samples generated by G and the real samples from the training data. During training, the generator and the discriminator engage in a minimax two-player game: the generator aims to create samples that can deceive the discriminator, while the discriminator tries to correctly identify whether samples are real or fake. Ideally, a Nash equilibrium is reached when the generator can no longer produce samples that the discriminator can distinguish from real samples, and the discriminator is unable to differentiate between real and fake samples. Formally, let V(G, D) be the object function. The minimax game is evaluated as: [MATH: minGmaxDV (G,D)=Expdata[logD(x)]+Ezpz< mo stretchy="false">[log(1-D(g(z)))] :MATH] 1 where x is sampled from the real data distribution [MATH: pdata :MATH] and z is sampled from the prior noise distribution [MATH: pz :MATH] . [MATH: E :MATH] represents the expectation, D(x) represents the score assigned by the discriminator D to x, and G(z) represents the fake samples generated by G using noise sampled from the prior distribution [MATH: pz :MATH] . In a GAN, the generator and discriminator are trained alternatively. That is, the weights of one network are fixed while the other network is trained. Specifically, when training the discriminator D, the generator G is kept fixed, and the discriminator D is optimized by maximizing [MATH: logD(x) :MATH] . Conversely, when training the generator G, the discriminator is fixed, and the generator is optimized by minimizing [MATH: log(1-D(g(z))) :MATH] . However, conventional GANs face training difficulties primarily due to the Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence loss functions. Specifically, when dealing with non-overlapping distributions, the JS divergence between [MATH: pdata :MATH] and G(z) can approximate a constant value, leading to issues such as gradient vanishing and mode collapse^[71]28. WGAN address the problem of gradient vanishing by replacing these divergence measures with the Wasserstein distance. Equation ([72]2) denotes the value function of WGAN: [MATH: V=Expdata[fw(x)]-Ezpz< mo stretchy="false">[fw(gθ(z))] :MATH] 2 where [MATH: fw :MATH] represents a function that satisfies the K-Lipschitz constraint, w and [MATH: θ :MATH] are the parameters of f and g, respectively. To satisfy K-Lipschitz, w is constrained to ensure that the function [MATH: fw :MATH] has a slope value less than K between any two points. Given the powerful fitting capability of neural networks, we can simply replace f and g with D and G, respectively. Thus, the object function of WGAN is represented by Eq. ([73]3). [MATH: minGmaxDV (G,D)=Expdata[D(x)]-Ezpz< mo stretchy="false">[D(G(z))] :MATH] 3 Given that D must satisfy the K-Lipschitz constraint, all parameters in D are clipped to a fixed range after each training epoch, for example, from − 0.01 to 0.01. Compared to traditional GANs, WGAN incorporates the K-Lipschitz constraint to address the issue of gradient vanishing and its effective smoothing of the value function improves trainability. Nevertheless, WGAN’s approach to enforcing the K-Lipschitz constraint through parameter clipping can be somewhat crude, as this clipping process can significantly impact the network’s fitting capability. This often results in a large number of weight values being concentrated along the clipped boundaries, making it challenging to fit complex distributions and significantly impairing performance^[74]34. WGAN-GP adopts a more refined approach by incorporating a gradient penalty to enforce the K-Lipschitz constraint. Specifically, during the training of discriminator, a gradient penalty is added to the discriminator’s loss function. Let GP represent the gradient penalty, Eq. ([75]4) illustrates the calculation of gradient penalty. [MATH: GP=γEx^px^[(x^D(x^)2-1)2] :MATH] 4 And the object function of WGAN-GP is rewritten as follows: [MATH: minGmaxDV (G,D)=Expdata[D(x)]-Ezpz< mo stretchy="false">[D(G(z))]-GP :MATH] 5 where [MATH: x^ :MATH] is obtained by interpolating between [MATH: pdata :MATH] and [MATH: pz :MATH] , [MATH: γ :MATH] is the gradient penalty coefficient, which is typically a positive real number chosen empirically. [MATH: x^D(x^) :MATH] denotes the gradient of [MATH: D(x^) :MATH] with respect to [MATH: x^ :MATH] . Furthermore, the CNN structure originally used in WGAN-GP is designed for image feature extraction and is not well-suited for gene expression data. Therefore, it is crucial to adapt WGAN-GP to better accommodate gene expression data. To this end, we substitute the CNN structure in WGAN-GP with a fully-connected neural network and perform additional optimization of the network parameters to tailor it specifically to our dataset. Each fully-connected layer is followed by a Leaky ReLU (Rectified Linear Unit) activation function with a negative slope of 0.2. In the discriminator, the number of neurons in each layer progressively decreases until it reaches 1, while in the generator, the number of neurons in each layer progressively increases until it matches the number of features in the dataset. Further generator selection In this study, we utilize the CTST approach to select the generator that produces the optimal synthetic samples, rather than evaluating the samples themselves, which differs from^[76]28. Optimal synthetic samples are those that closely mimic the feature distribution of real positive training samples without being identical to the real gene expression samples. In essence, when the number of synthetic samples equals that of real samples, CTST employs a binary classifier to differentiate between the two sets of samples. If cross-validation results indicate that the classifier can easily distinguish differentiate between the two sets of samples. Specifically, given two sample sets of equal size that follow two distributions, P and Q, CTST tests the null hypothesis that P is equal to Q. If the null hypothesis is accepted, the classifier’s accuracy will be near chance level (50%); if rejected, the accuracy will approach 100%^[77]35. During the training process of WGAN-GP, we employed the CTST after each epoch to promptly assess the generator and retain the optimal network weights for downstream tasks. Given the time-intensive nature of network training, cross-validation was performed only once, and the network weights are retained based on the accuracy of this validation. However, these weights may be influenced by chance due to the probabilistic nature of data generation. For instance, the network may produce samples that closely resemble a specific distribution, resulting in a CTST accuracy close to 0.5. At this point, the network has partially captured the target samples distribution, but the retained weights may still affect performance in downstream tasks. To assess the impact of generator fitting over different epochs, we conducted additional CTST experiments after WGAN-GP training. Multiple CTSTs were performed with different weights, and the generator with the mean accuracy closest to 0.5 was selected as the data generator for the downstream task. In this study, we conduct the CTST using Support Vector Machine classification algorithm. Prediction and evaluation The dataset for this study comprises 26,936 gene expression data samples, including 92 samples related to RNA methylation (positive samples) and 26,304 unlabeled samples (regarded as negative samples). The dataset was split into training and test sets in an 8:2 ratio. A total of 74 positive samples and 21,044 negative samples were utilized for training and cross-validation (CV) to determine hyperparameters via grid search^[78]36,[79]37, while 18 positive samples and 5260 negative samples were allocated for testing. The positive training samples were initially used to train WGAN-GP in order to acquire generators for producing synthetic samples for downstream tasks, and later employed for training the classification model, as shown in Table [80]1. Table 1. Numbers of gene sample for training, CV, and test. Train and CV Test Total Positive gene numbers 74 18 92 Negative gene numbers 21,044 5260 26,304 Total 21,118 5278 26,936 [81]Open in a new tab To prevent mode collapse during the training process of WGAN-GP, we employ an Early Stopping mechanism^[82]38. If the CTST result, which measures the dissimilarity between synthetic samples generated by the generator and real samples, falls below a preset threshold of 0.02, the weight from that epoch is saved and made available for downstream tasks. However, to increase the number of generator models available for selection, training does not stop immediately upon reaching the Early Stopping condition; instead, it continues until the preset number of epochs is reached. By analyzing the probability density curve of results between 0.4 and 0.6, we observed a peak around 0.52, which ensures that most useless generators are filtered out while retaining a relatively large number of generators. Therefore, we chose 0.02 as the default threshold for preliminary filtering, as shown in Supplementary Figure [83]1. To evaluate the performance of predicting genes related to RNA methylation pathways, we employ five machine learning classifiers: Support Vector Machine (SVM), Gradient Boosting (GB), Gaussian Naive Bayes (GNB), Random Forest (RF), and Logistic Regression (LR). After dividing the data based on the oversampling rate [MATH: α :MATH] , these classifiers are trained on an equal number of data blocks. For SVM, GB, and RF, we determine the optimal hyperparameters using a grid-search technique, selecting the hyperparameters combination that achieves the highest accuracy through 3-fold cross-validation. Five metrics are utilized for evaluating the classifiers’ performance: Accuracy, Precision, Recall, F1 score, and Matthews Correlation Coefficient (MCC). Given the highly imbalanced ratio of positive to negative samples in the test set, these metrics are computed across multiple data blocks derived from the test set, and the average performance across all blocks is taken as the final test result. Unlike the training set, the test set’s data block splitting process does not include the oversampling step. Result Influence of hyperparameters and training strategies To investigate the effect of the oversampling rate [MATH: α :MATH] on the model performance, We performed cross-validation on the training set using various oversampling rates. Our experiments involved nine different [MATH: α :MATH] values across five classifiers. Generally, each classifiers exhibited varied valid performance with different [MATH: α :MATH] values, and fixed [MATH: α :MATH] values also produced diverse results across classifiers. As shown in Fig. [84]2, the highest accuracy (0.9757) was achieved with an [MATH: α :MATH] of 0.4 paired with the Logistic Regression (LR) classifier. Conversely, the lowest accuracy (0.8771) occurred with an [MATH: α :MATH] of 0.1 paired with the Gradient Boosting (GB) classifier, indicating a difference of approximately 14.3%. Notably, Random Forest (RF) was the most impacted by variations in [MATH: α :MATH] , with a performance difference of 9.49%. In contrast, Gaussian Naïve Bayes (GNB) was the least affected, showing a difference of only 1.27% between its best (0.9539) and worst (0.9412) results. The performance drop for other classifiers compared to their best results was 4.48% for Support Vector Machine (SVM), 6.34% for GB, and 1.82%for LR. This suggests that inadequate synthetic samples ( [MATH: α :MATH] close to 1) or excessive samples ( [MATH: α :MATH] close to 0) can result in performance degradation. Based on these validation results, unless otherwise specified, we will set [MATH: α :MATH] to 0.4 for the following analysis. Fig. 2. Fig. 2 [85]Open in a new tab Impact of oversample rate [MATH: α :MATH] , the horizontal axis is the oversampling rate, the vertical axis is the valid accuracy score, and the lines in different categories indicate different classes of classifiers. Ablation study The data block constructure and oversampling process are two major components in our approach. To study the effectiveness of these components, we created two variants of the scheme: one without the data block structure and another without the oversampling process. These two variants were used to train five machine learning models. We then evaluated their performance using four metrics to assess the impact of the data block constructure and oversampling on overall model performance. Figure [86]3 depicts the results of the ablation study. Fig. 3. [87]Fig. 3 [88]Open in a new tab Overview of ablation study. In general, our proposed complete scheme exhibits superior predictive performance across all five classifiers. However, without the data block structure or proper oversampling, performance degradation occurs to varying extents. Among the classifiers, GNB classifier shows the least sensitivity to these components, with performance reduction of 1.76% (MCC), 0.7% (F1 score), and 3.1% (recall) when data block constructure is removed and 7.5% (MCC), 3.4% (F1 score), 2.7% (recall) when WGAN-GP oversampling is missing. In contrast, the other four classifiers demonstrate greater sensitivity to the absence of data block constructure. For instance, GB experiences a 26.9% drop in accuracy, 60.2% in recall, 44.6% in F1 score, 42.3% in MCC. Similarly, LR shows a decrease of 16.9% (accuracy), 40.9% (recall), 24.3% (F1 score), 27.0% (MCC). RF faces reductions of 19.4% (accuracy), 45.8% (recall), 28.9% (F1 score), 30.7% (MCC), while SVM observes declines of 26.5% (accuracy), 59.8%(recall), 44.3% (F1 score), 41.7% (MCC). When the oversampling process is absent, The situation is more moderate. MCC performance drops by 5.3% for GB, 5.9% for SVM, 10.2% for RF, and 6.4% for LR, with similar decline in other metrics. These results clearly demonstrate that merely generating a large number of synthetic samples to address imbalanced datasets is ineffective for highly imbalanced classification problems. In contrast, our proposed scheme, which integrates the benefits of both oversampling and undersampling, achieves better performance. By incorporating data blocks, it significantly reduces the need for synthetic samples, preserves the original sample distribution, and effectively expands the dataset. Data block constructure also facilitates the use of multiple weak classifiers, enabling ensemble techniques that capture diverse patterns in the data, which reduces individual model bias and improving generalization. A total of 61 weights were saved based on the preset parameters. The additional CTST experimental results for these 61 weights are displayed in Fig. [89]4, sorted by variance. After conducting 200 repetitions of the experiment, slight differences were observed in the mean values of all generator test results. The majority of the generator test results still adhere to the preset hyperparameter criteria, falling between 0.498 and 0.502 as denoted by the green dots in Fig. [90]4. A few generators produced test results that deviate from the preset criteria, yet remain within the acceptable range for downstream tasks, ranging from 0.4 to 0.498 and 0.502 to 0.6, indicated by the orange points. These generators also show an increase in variance. In contrast, a few generators, represented by the red dots, exhibit test results significantly deviating from the 0.5 criterion, with an absolute difference exceeding 0.1, alongside higher variance. In such cases, the samples generated by these generators could introduce bias into the classifier model’s training. These results affirm the importance of incorporating additional CTST to refine the selection process. Fig. 4. Fig. 4 [91]Open in a new tab CTST result over 61 saved model weight, sorted based on their variance. WGAN-GP successfully generated high-quality positive samples Overall, the samples synthesized by WGAN-GP exhibited a strong alignment with the original positive samples. We incorporated supplementary CTST after each training epoch. As shown in Fig. [92]5, at the beginning of generator training, the generated samples significantly diverged from the distribution of positive samples, resulting in a CTST accuracy of 1.0. This observation underscores the effective discrimination achieved by the CTST classifier between the generated and positive samples. Fig. 5. [93]Fig. 5 [94]Open in a new tab Distribution changes between positive and synthetic samples. a epoch 0; b epoch 200; c epoch 500; d epoch 800. Around the [MATH: 200th :MATH] epoch, the generator began capturing the distinctive feature distribution of the positive samples. Consequently, an overlap emerged in the feature space between the generated and positive samples, reducing the CTST accuracy to approximately 0.7. As training progressed, by the [MATH: 500th :MATH] epoch, this overlapping region expanded further, causing the CTST accuracy to diminish to about 0.6. After over 800 additional training iterations, WGAN-GP successfully internalized the genetic feature distribution of authentic positive samples. Evidently, the CTST accuracy decreased to 0.5 and exhibited fluctuation around this value, signifying the generator’s ability to effectively align with the positive sample distribution. Additionally, the training loss curve further indicated that the generator attained a local optimum as the CTST accuracy reached 0.5. Furthermore, independent CTST evaluations on each saved generator model unveiled that more than half of the generators consistently uphold a classifier accuracy around 0.5. This result highlights the effectiveness of WGAN-GP in producing high-quality synthetic samples, as illustrated in Fig. [95]5. While WGAN-GP partially addressed these difficulties by incorporating gradient penalties, proper neural network design and hyperparameter tuning remained crucial for successful GAN training. In addition to visualizing samples at various epochs, observing changes in loss during GAN training proved effective^[96]27. For WGAN-GP, the generator’s loss reflected the disparity between generated and real samples. As shown in Fig. [97]6, the generator initially exhibited poor sample fitting ability, resulting in a rise in loss throughout the training process. After undergoing adversarial training with the discriminator, the generator started capturing the underlying sample distribution, causing a decrease in loss. As the training progressed, the generator’s loss began to fluctuate, indicating its position in the intermediate stage. Around 600–800 epochs, the generator and discriminator reached a Nash equilibrium in the minimax game, as indicated by the consistent reduction and eventual stabilization of loss. This suggests the generative adversarial network had reached an approximate optimal state. Fig. 6. Fig. 6 [98]Open in a new tab WGAN-GP train loss over 5000 epochs. Blue line indicates discriminator loss and orange line indicates generator loss. Performance comparison As shown in Table [99]2, our approach consistently exhibits superior prediction performance across most classifier combinations. For instance, when paired with the SVM classifier, our method improves accuracy by 16.01%, recall by 38.81%, F1 by 23.34%, and MCC by 25.14% compared to the second-best approach. Similarly, the combination with GB classifier achieves respective improvements of 13.77%, 33.91%, 19.35%, and 21.87%. Additionally, when combined with RF classifier, our approach surpasses the second-best model by 16.83%, 40.58%, 24.26%, and 26.73%. For the LR classifier, our method outperforms the runner-up by 11.69%, 30.39%, 15.92%, and 18.89%. Across most performance metrics, our scheme consistently attains top-tier results, securing the highest accuracy (0.9320), F1 (0.9354), and MCC (0.9327) when combined with LR. However, with GNB classifier, the RandomOversample scheme achieves the best performance, boasting an average accuracy of 0.9183, a recall of 1.0, F1 of 0.9261, and an MCC of 0.8505. This suggests that the RandomOversample method is more inclined to identify positive samples. Nonetheless, our scheme remains competitive, producing accuracy, recall, F1, and MCC scores that closely approach those of the leading scheme, with only a minor differences. Table 2. Accuracy, Recall, F1 and MCC for 4 approaches. Method Classifier SMOTE BorderlineSMOTE RandomOversample Ours Accuracy Recall F1 MCC Accuracy Recall F1 MCC Accuracy Recall F1 MCC Accuracy Recall F1 MCC SVM 0.5 0 0 0 0.7630 0.5263 0.6895 0.5973 0.5 0 0 0 0.9232 0.9145 0.9230 0.8487 GB 0.7629 0.5263 0.6894 0.5970 0.7891 0.5789 0.7330 0.6375 0.7631 0.5263 0.6896 0.5976 0.9269 0.9181 0.9266 0.8562 RF 0.7368 0.4736 0.6428 0.5570 0.5789 0.1578 0.2727 0.2927 0.5263 0.0526 0.1 0.1644 0.9315 0.9321 0.9322 0.8650 LR 0.8413 0.6842 0.8118 0.7192 0.8151 0.6315 0.7735 0.6776 0.8153 0.6315 0.7737 0.6781 0.9320 0.9354 0.9328 0.8665 GNB 0.9164 0.9473 0.9201 0.8365 0.8094 0.6315 0.7684 0.6627 0.8115 0.7894 0.8088 0.6263 0.9108 0.9785 0.9178 0.8324 [100]Open in a new tab The highest values for each specific classifier are highlighted in bold, while the highest values across all classifiersare underlined. Notably, some approaches encountered a significant problem, misclassifying all test set samples as negative (see^[101]32, Table 3). This led to their average accuracy, recall, F1, and MCC values dropping to 0.5, 0, 0, 0, respectively. In such cases, these approaches were deemed unusable. Among the three compared schemes, the combination of SMOTE with SVM exhibits invalid values. Similarly, RandomOversample encounters the same issue when combined with SVM. While BorderlineSMOTE mitigates the problem,it still underperformed compared to our approach. In contrast, our method demonstrated better data compatibility and was more effective at handling highly imbalanced gene expression data. To investigate the differences in predictive performance between samples synthesized by WGAN-GP and other schemes, we substitute WGAN-GP with three oversampling schemes, creating three new variants of our approach. These variants balanced the data blocks rather than the entire dataset before training, following the same methodology as our proposed approach. A scenario with no oversampling was used as the baseline for comparison. Table [102]3 shows the test results for various oversampling schemes combined with data block construction. Generally, samples generated by WGAN-GP outperform the other four compared schemes across most metrics and classifiers, achieving the highest overall scores in accuracy, recall, F1, and MCC. Specifically: With the SVM classifier, WGAN-GP attained the highest accuracy (0.9232), recall (0.9342), F1 (0.9230), and MCC (0.8487). With the GB classifier, WGAN-GP achieved the highest accuracy (0.9269), F1 (0.9266), and MCC (0.8562). When using the RF classifier, WGAN-GP achieved the highest accuracy (0.9315), recall (0.9321), F1 (0.9322), and MCC (0.8650). With the LR classifier, WGAN-GP reached the highest accuracy (0.9320), precision (0.9327), F1 (0.9328), and MCC (0.8665). With the GNB classifier, WGAN-GP achieved the highest Recall (0.9785), F1 (0.9178), and MCC (0.8324). Among the results, the WGAN-GP with LR combination attained the highest overall accuracy, F1, and MCC scores, while the combination with GNB achieved the highest recall score. The highest overall precision was obtained by the combination of SMOTE with GB. Table 3. Comparison of model performance with data block constructure. Oversample method Classifier Accuracy Precision Recall F1 MCC WGAN-GP SVM 0.9232 0.9342 0.9145 0.9230 0.8487 GB 0.9269 0.9379 0.9181 0.9266 0.8562 RF 0.9315 0.9345 0.9321 0.9322 0.8650 LR 0.9320 0.9327 0.9354 0.9328 0.8665 GNB 0.9108 0.8673 0.9785 0.9178 0.8324 BorderlineSMOTE SVM 0.9171 0.9277 0.9087 0.9168 0.8367 GB 0.9252 0.9365 0.9159 0.9249 0.8529 RF 0.9293 0.9328 0.9291 0.9299 0.8605 LR 0.9244 0.9217 0.9321 0.9255 0.8514 GNB 0.9046 0.8658 0.9662 0.9114 0.8189 SMOTE SVM 0.9129 0.9229 0.9055 0.9121 0.8291 GB 0.9237 0.9389 0.9099 0.9229 0.8501 RF 0.9301 0.9352 0.9282 0.9307 0.8622 LR 0.9248 0.9255 0.9282 0.9257 0.8519 GNB 0.9042 0.8736 0.9518 0.9095 0.8151 RandomOversample SVM 0.9193 0.9308 0.9099 0.9190 0.8411 GB 0.9181 0.9289 0.9096 0.9178 0.8388 RF 0.9235 0.9210 0.9312 0.9249 0.8493 LR 0.9286 0.9259 0.9360 0.9297 0.8597 GNB 0.9112 0.8695 0.9751 0.9177 0.8324 ModCGAN SVM 0.8947 0.9171 0.8726 0.8927 0.7930 GB 0.9193 0.9293 0.9115 0.9184 0.8420 RF 0.9196 0.9255 0.9169 0.9194 0.8424 LR 0.8995 0.9198 0.8798 0.8980 0.8020 GNB 0.9114 0.8959 0.9366 0.9137 0.8276 ACGAN SVM 0.8381 0.9117 0.7532 0.8221 0.7538 GB 0.8728 0.9265 0.8142 0.8646 0.7538 RF 0.8706 0.9238 0.8124 0.8625 0.8625 LR 0.8461 0.9167 0.7658 0.8317 0.7048 GNB 0.8937 0.9136 0.8752 0.8926 0.7905 *Non SVM 0.8909 0.8763 0.9178 0.8947 0.7863 GB 0.8957 0.8834 0.9189 0.8989 0.7957 RF 0.8795 0.8645 0.9085 0.8840 0.7639 LR 0.8983 0.8767 0.9342 0.9029 0.8016 GNB 0.8719 0.8279 0.9506 0.8829 0.7570 [103]Open in a new tab *Non means no oversampling method intergreted The highest values for each specific classifier are highlighted in bold, while the highest values across all classifiersare underlined. In addition to common oversampling methods, we also compared other, more sophisticated oversampling schemes. ModCGAN^[104]39,[105]40 employs a different approach to ensure the quality of synthetic samples by training an MLP classifier after the GAN is trained to assess whether the generated samples meet the required quality. Similarly, ACGAN^[106]41 builds on CGAN by forcing the discriminator to reconstruct the auxiliary information of CGAN, thereby regulating the quality of the synthetic samples. We integrated these methods with data blocks to compare the impact of synthetic sample quality on predicting RNA methylation-related genes. Although ModCGAN and ACGAN performs similarly to our method across five metrics and five classifiers, our method shows a 1–5% improvement on most metrics. The key to ModCGAN lies in using an external classifier to distinguish between real samples and noise, thereby regulating the generator’s outputs. However, ModCGAN assumes that noise follows only a joint distribution of uniform and Gaussian distributions. In cases involving large datasets, as in this study, larger negative samples may overlap with the Gaussian or uniform distribution, leading the external classifier to exclude some qualified synthetic samples. Although this study does not use GANs to generate majority class samples, a small number of positive samples might also overlap with the predefined noise distribution. Therefore, it is more reasonable to use the distinguishability between synthetic and real samples as a standard for assessing the quality of synthetic samples. It is worth noting that the WGAN-GP oversampling method, when combined with different classifiers, can achieve the highest MCC and F1 scores for each respective classifier. Since the MCC metric is commonly used to evaluate binary classification models, particularly for unbalanced datasets, this suggests that WGAN-GP generates more diverse samples, thereby reducing the classifier’s bias towards the majority class. GO terms and KEGG pathway enrichment analyses With [MATH: α :MATH] set to 0.4, each classifier retained 116 models. The average score of these models for each gene was then calculated as the classifier’s final score. Subsequently, high-confidence genes identified by each classifier were utilized for GO terms enrichment analysis. High-confidence genes are defined as the top 1% of genes ranked by confidence. Based on the gene quantity in our dataset, each classifier typically has 270 high-confidence genes. However, since GNB tends to overestimate confidence for many genes, we considered the top 1500 genes in GNB’s confidence rankings as high-confidence (beyond rank 1500, other classifiers typically assign confidence levels below 0.5). Figure [107]7 compares the top 20 GO terms ranked by each classifier. Overall, the predictions from the five classifiers exhibit strong consistency. GO terms enrichment analysis indicates that high-confidence genes are significantly enriched for RNA methylation functions. The terms with the highest enrichment include methylation, RNA modification, RNA splicing, and RNA localization. Other terms related to rRNA, mRNA, tRNA, and ncRNA are also present in the enrichment analysis, with over 200 genes enriched for these GO terms. This confirms that the predicted genes are closely associated with RNA methylation process. Fig. 7. [108]Fig. 7 [109]Open in a new tab GO terms enrichment analyses of high-confidence genes. Supplementary Figure [110]2 shows the KEGG pathway enrichment analysis of the top 1% confidence genes predicted by five classifiers. Despite GNB’s tendency to assign high confidence to most genes, the pathway enrichment analyses of the five models showed strong similarity. All five enrichment analysis reports highlighted pathways strongly related to RNA methylation, such as Spliceosome, mRNA surveillance pathway, RNA polymerase, and RNA degradation. The enrichment counts ranged from 20 to 40, and all P-values were all below 0.05, indicating high confidence in these items. Notably, the KEGG enrichment report from the GB classifier erroneously included a pathway related to Coronavirus disease - COVID-19, which did not appear in other classifiers. The count and P-value for this item were much lower than those related to RNA methylation, further indicating that the model consistently predict genes involved in RNA methylation. Functional insights into the new predictions To gain deeper insights into the role of newly predicted genes in the RNA methylation process, we selected the top 1% confidence genes (predicted by SVM) and constructed a PPI network using the STRING database^[111]42. This network included 188 newly predicted genes and 76 known genes related to RNA methylation. The STRING database constructed a densely interactive PPI network for these genes, comprising 4093 edges. Using the Louvain^[112]43community detection algorithm, we identified six tightly connected communities within the PPI network, as shown in Supplementary Figure [113]3. Community 0 consists of 71 genes, including 48 known RNA methylation-related genes and 23 newly predicted genes. Functional analysis indicates significant enrichment such as tRNA N2-guanine methylation (GO:0002940, P = 3.53e−06), rRNA 2-O-methylation (GO:0000451, P = 3.53e−06), tRNA N1-guanine methylation (GO:0002939, P = 0.00025), tRNA (guanine-N7)-methylation (GO:0106004, P = 0.0180), and rRNA (guanine-N7)-methylation (GO:0070476, P = 0.0180). Community 1 consists of 18 genes, including only one known RNA methylation-related gene: FDXACB1. Although as it is, functional enrichment analysis shows significant enrichment such as histidyl-tRNA aminoacylation (GO:0006427, P = 0.0018), phenylalanyl-tRNA aminoacylation (GO:0006432, P = 0.0044), tRNA aminoacylation for mitochondrial protein translation (GO:0070127, P = 0.0154), aminoacyl-tRNA ligase activity (GO:0004812, P = 1.19e−13), and RNA adenylyltransferase activity (GO:1990817, P = 0.0046). Community 2 consists of 22 genes, including 20 newly predicted genes. HSD17B10 and RSAD1 in this community are known RNA methylation-related genes. Functional enrichment analysis suggests significant enrichment such as tRNA 5-leader removal (GO:0001682, P = 1.49e−16), tRNA 5-end processing (GO:0099116, P = 1.41e−18), tRNA splicing via endonucleolytic cleavage and ligation (GO:0006388, P = 1.48e−06), and tRNA-specific ribonuclease activity (GO:0004549). Community 3 consists of 80 genes, including 7 known RNA methylation-related genes (CMTR1, CMTR2, LARP7, MEPCE, RNGTT, RNMT, TGS1) and 73 newly predicted genes. GO enrichment analysis shows significant enrichment such as cap1 mRNA methylation (GO:0097309, P = 0.0280), 7-methylguanosine RNA capping (GO:0009452, P = 1.02e−08), RNA 5-cap (guanine-N7)-methylation (GO:0106005, P = 0.433), 7-methylguanosine mRNA capping (GO:0006370, P = 5.86e−07), mRNA (nucleoside-2-O-) methyltransferase activity (GO:0004483, P = 0.0264), mRNA methyltransferase activity (GO:0008174, P = 0.0099), snRNA binding (GO:0017069, P = 1.69e−06), and O-methyltransferase activity (GO:008171, P = 0.0479). Community 4 consists of 16 genes, including 12 known RNA methylation-related genes and 4 newly predicted genes. Functional enrichment analysis reveals significant enrichment such as snRNA (adenine-N6)-methylation (GO:0120049, P = 0.0014), S-adenosylmethionine biosynthetic process (GO:0006556, P = 0.0033), mRNA methylation (GO:0080009, P = 9.06e−17), mRNA (2-O-methyladenosine-N6-)-methyltransferase activity (GO:0016422, P = 6.17e−06), RNA (adenine-N6-)-methyltransferase activity (GO:0008988, P = 6.17e−06), mRNA methyltransferase activity (GO:0008174, P = 5.16e−07), and RNA N6-methyladenosine methyltransferase complex (GO:0036396, P = 6.55e−16). Community 5 consists of 57 genes, including 6 known RNA methylation-related genes (CEBPZ, DIMT1, EMG1, FBL, FBLL1, TFB2M) and 51 newly predicted genes. Functional enrichment analysis indicates significant enrichment such as Box H/ACA RNA 3-end processing (GO:0000495, P = 0.0074), U5 snRNA 3-end processing (GO:0034476, P = 0.0119), U1 snRNA 3-end processing (GO:0034473, P = 0.0119), histone-glutamine methyltransferase activity (GO:1990259, P = 0.0073), rRNA (adenine-N6,N6-)-dimethyltransferase activity (GO:0000179, P = 0.0158), snoRNA binding (GO:0030515, P = 9.05e−11), 3-5-exoribonuclease activity (GO:000175, P = 8.02e−09), rRNA methyltransferase activity (GO:0008649, P = 4.28e−06), catalytic activity acting on RNA (GO:0140098, P = 6.20e−20), and N-methyltransferase activity (GO:0008170, P = 0.0242). Presuming that the top 20 false-positive genes, based on confidence ranking, represent new predictions of the model, Existing researches reveal that these genes play critical roles in various RNA- or methylation-related biological processes. DDX56 (score = 0.979), a member of the DEAD box RNA helicase family, plays a crucial role in several RNA-related biological processes^[114]44. METTL18 (score = 0.971) has been reported to be closely related to METTL17^[115]45. METTL18 is a methyltransferase, while METTL17 regulates mitochondrial ribosomal RNA modification^[116]45. PUS3 (score = 0.969) is involved in RNA pseudouridylation in humans^[117]46 and has also been studied in relation to methylation and sepsis^[118]47. PRPF4B (score = 0.967), a member of the Clk/Sty kinases family, encodes the pre-mRNA processing factor 4 kinase (PRP4K)^[119]48. PAPOLG (score = 0.965), a poly(A) polymerase (PAP) family member, plays a key role in mRNA stability and translational modifications^[120]49. EXOSC2 (score = 0.962) provides catalytic activity to the RNA exosome^[121]50. PPIG (score = 0.955) interacts directly with the phosphorylated RNA Pol II carboxy-terminal domain (CTD) via its RS domains both in vivo and in vitro^[122]51.TRUB2 (score = 0.953) is involved in mitochondrial mRNA pseudouridylation , regulating 16S rRNA and mitochondrial translation^[123]52. DHX38 (score = 0.946) encodes RNA helicase PRP16, essential for disrupting the U2-U6 helix I between the first and second catalytic steps of the splicing^[124]53. SF3B1 (score = 0.945) plays a central role in the RNA splicing function of SF3b^[125]54. UTP23 (score = 0.945) is a trans-acting factor involved in the early assembly of 18S rRNA^[126]55. CHTOP (score = 0.944) binds competitively with arginine methyltransferases PRMT1 and PRMT5, promoting the asymmetric or symmetric methylation of arginine residues^[127]56. DDX47 (score = 0.944) plays a role in pre-rRNA processing^[128]57. RRP9 (score = 0.942) is involved in the maturation of ribosomal RNA, ensuring proper ribosome formation and function. HARS2 (score = 0.941) encodes the mitochondrial histidyl-tRNA synthetase (mt-HisRS)^[129]58. UPF2 (score = 0.940) gene plays a role in mRNA degradation^[130]59. DDX46 (score = 0.938) recruits ALKBH5, an eraser of the RNA modification N6-methyladenosine (m6A), through its DEAD helicase domain to demethylate m6A-modified antiviral transcripts^[131]60. Notably, recent research^[132]61 reported that CSTF2T(score = 0.941)’s paralog, CSTF2, acts as a mediator of m6A deposition, thereby regulating mRNA m6A modification. This suggests that CSTF2T may play a similar role in the m6A methylation process. Influence of highly informative features In this study, we used a dataset (full set) filtered for zero values and variance. This dataset consists of 26,936 genes and 1517 features, including 62 GO annotations, 1110 GTEx, 39 TISSUES, 162 PathCommons,108 HPA, one InterPro, and 35 BioGPS features. While GO annotations and InterPro features are informative, they may bias the model’s predictions towards pre-existing methylation-related annotations. To investigate the impact of GO and InterPro features on the model, we further removed all GO and InterPro features from the dataset, resulting in a reduced set with 1454 features. We retrained the classifiers (started from generator’s training) using the reduced set and calculated scores for all genes across different models. The probability score distribution curves are shown in Supplementary Figure [133]4. The prediction trends across all five classifiers were very similar: the classifiers tended to assign low confidence to most genes. In the low confidence interval (score < 0.5), the peak for the reduced set without GO and InterPro features shifted slightly to the right. This suggests that the reduced set slightly increased confidence in the low confidence interval, indicating that GO and InterPro features prevented classifiers from blindly assigning high confidence to most genes. As shown in Supplementary Figure [134]5 and Supplementary Table [135]1, although the test results declined after training with the reduced set, GO enrichment analysis revealed a strong association with RNA methylation. This suggests that, while the reduced set may lower evaluation metrics, it does not strongly bias the model’s predictions. Additionally, we conducted a feature importance analysis based on the LR model. Supplementary Figure [136]6 shows the importance rankings of the top 50 features in both the full set and the reduced set. In the full set, GO (26/50) and InterPro (1/50) features accounted for over 50% (27/50) of the top positions. Their importance scores were significantly higher than those of GTEx and HPA, indicating that the model was more heavily influenced by GO and InterPro features than by other types. In the top 50 rankings of the reduced set (without GO and InterPro features), GTEx and HPA appeared most frequently. This is likely due to that GTEx and HPA provide crucial insights into gene expression across various tissues, which can help identifying tissue-specific patterns of RNA methylation. Conclusion In this study, we highlight the importance of RNA methylation and address the imbalance learning problem in predicting genes associated with RNA methylation pathways. We propose a deep learning-based Wasserstein generative adversarial network with gradient penalty (WGAN-GP) approach to alleviate the imbalance problem in gene prediction tasks. We use five machine learning classifiers (SVM, GB, GNB, RF, LR) for model training and evaluate performance using various metrics. Experimental results demonstrate that the improved WGAN-GP effectively captures the features of gene expression data and accurately models the distribution of positive samples in the feature space. Additionally, it generates synthetic samples that are difficult to distinguish from real positive samples and exhibit better diversity compared to other oversampling methods. Our study presents an effective and performance-challenging approach compared to conventional schemes. Furthermore, our work has several potential extensions. First, while our study incorporates gene expression knowledge bases, future research should integrate additional data, such as gene interactions, to further enhance predictions of genes associated with RNA methylation pathways. Second, employing advanced techniques to reduce feature dimensionality could improve prediction performance. Lastly, exploring end-to-end models that map inputs directly to outputs, without relying on intermediate representations, could reduce the need for manual intervention in the modeling process. Supplementary Information [137]Supplementary Information.^ (2.1MB, pdf) Acknowledgements