Abstract

   RNA methylation modification influences various processes in the human
   body and has gained increasing attention from scholars. Predicting
   genes associated with RNA methylation pathways can significantly aid
   biologists in studying RNA methylation processes. Several prediction
   methods have been investigated, but their performance is still limited
   by the scarcity of positive samples. To address the challenge of data
   imbalance in RNA methylation-associated gene prediction tasks, this
   study employed a generative adversarial network to learn the feature
   distribution of the original dataset. The quality of synthetic samples
   was controlled using the Classifier Two-Sample Test (CTST). These
   synthetic samples were then added to the data blocks to mitigate class
   distribution imbalance. Experimental results demonstrated that
   integrating the synthetic samples generated by our proposed model with
   the original data enhances the prediction performance of various
   classifiers, outperforming other oversampling methods. Moreover, gene
   ontology (GO) enrichment analyses further demonstrate the effectiveness
   of the predicted genes associated with RNA methylation pathways. The
   model generating gene samples with PyTorch is available at
   [30]https://github.com/heyheyheyheyhey1/WGAN-GP_RNA_methylation

   Keywords: RNA methylation, Pathways, Machine learning, Generative
   adversarial nets

   Subject terms: RNA, Computational biology and bioinformatics,
   Computational models, Protein function predictions

Introduction

   RNA methylation refers to the chemical process in which methyl groups
   are selectively added to methyladenine in RNA under the catalytic
   action of methyltransferase. So far, researchers have discovered over
   160 RNA modifications^[31]1 that are extensively distributed across
   eukaryotes and prokaryotes. Among these, the most prevalent form of RNA
   modification is methylation. By studying this process, researchers have
   discovered associations between RNA methylation modifications and
   various fields, including cancer^[32]2,[33]3, cardiovascular
   disease^[34]4, embryonic development^[35]5, and cell
   differentiation^[36]6. These findings highlight the significance of RNA
   methylation and suggest its potential implications in diverse
   biological processes and diseases. Further research in this area may
   shed more light on the functional roles of RNA methylation and its
   therapeutic applications^[37]7.

   Numerous reports have revealed that specific enzymes or proteins within
   the RNA methylation pathway regulate distinct functions and biochemical
   processes in the human body^[38]8–[39]11. These studies are based on
   the identification of proteins, genes, or specific enzymes associated
   with RNA methylation. Despite the significant role of RNA methylation
   process in various aspects of the human body, researchers have limited
   knowledge about the pathways involved. This limitation primarily arises
   from the substantial financial and time investments required to
   identify gene functions using experimental wet laboratory
   methods^[40]12,[41]13. Fortunately, due to intensive research of
   large-scale omics data and artificial intelligence methods, researchers
   can now explore target gene functions using computational approaches
   that are faster and more cost-effective than wet methods. Several
   reports have employed machine learning-based methods to identify genes
   associated with RNA methylation pathways^[42]14. Tsagkogeorga et
   al.^[43]14 used expression data from genes associated with RNA
   methylation as positive samples and trained machine learning models to
   predict additional potential genes. The results suggest that ’dry’
   computational methods could be used to identify RNA
   methylation-associated genes. However, due to the limited number of
   known samples and the highly imbalanced ratio of positive to negative
   samples, the classifier may incorrectly identify majority class samples
   as minority class samples^[44]15. Any misclassification of genes
   obtained through computational methods leads to increased research
   costs, including time and funding expenses. Thus, avoiding overfitting
   and improving the effectiveness of classification methods in highly
   imbalanced and high-dimensional data remain challenging.

   Various approaches have been developed to address the issue of data
   imbalance, with the most common approach involving the oversampling of
   minority class samples^[45]16. Random oversampling is a widely employed
   algorithm that duplicates randomly selected minority class samples from
   the original dataset to balance the sample distribution. However, this
   algorithm primarily enhances the classifier’s weights on the minority
   class but lacks diversity in generating new samples^[46]17. Another
   effective technique, the synthetic minority oversampling technique
   (SMOTE)^[47]18, generates samples that lie near the minority class in
   the feature space, introducing diversity through interpolation among
   neighboring samples. However, SMOTE has limited impact on most
   classifiers trained on high-dimensional data, particularly in the field
   of bioinformatics^[48]19. This limitation also applies to certain SMOTE
   variants (e.g., Borderline-SMOTE^[49]20). Directly applying these
   algorithms to highly imbalanced datasets may not alleviate overfitting
   and enhance classification performance.

   Deep learning offers an alternative approach to address the challenge
   of data imbalance^[50]21. Researchers utilized deep learning techniques
   to generate synthetic samples that supplement the original dataset. One
   common approach is the use of Generative Adversarial Networks^[51]22
   (GANs), which consist of two networks that ‘compete’ against each
   other: a generator and a discriminator. The generator is responsible
   for producing synthetic data, while the discriminator evaluates the
   authenticity of the data. Through an adversarial training process,
   these two networks compete until they reach equilibrium, where the
   discriminator can no longer distinguish between real and synthetic
   data, and the generator cannot improve its sample generation further.
   This process allows the generator to learn the underlying data
   distribution and generate synthetic samples that can be utilized for
   downstream tasks. Several studies have confirmed that GANs and their
   variants can generate synthetic data by learning the distribution of
   minority class sample to supplement the original dataset. For example,
   Jing et al.^[52]23 utilized attention-augmented convolutional layers
   and an adversarial training mechanism to enhance the identification of
   transcription factor binding sites (TFBSs). Ma et al.^[53]24 utilized a
   deep convolutional generative adversarial network (DC-GAN) to classify
   white blood cell. Gadermayr et al.^[54]25 used a cycle GAN to translate
   images with healthy conditions into images with affected conditions for
   augmenting labeled segmentation data. Frid-Adar et al.^[55]26 employed
   GANs to synthesize CT image data and demonstrated that the generated
   images achieved favorable results despite a limited number of samples.
   Not only have GANs shown remarkable success in image synthesis, but
   Xiao et al.^[56]27 also applied a WGAN to learn high-dimensional
   features from cancer gene expression data. Wan and Jones^[57]28
   introduces CTST to evaluate the quality of WGANGP-synthesized protein
   feature samples and improve protein function prediction performance.

   Despite the remarkable success of GANs in balancing datasets, their
   application to highly imbalanced gene expression dataset has,
   surprisingly, not yet been studied, to the best of our knowledge. Thus,
   we propose a novel GAN-based approach to address the challenges posed
   by highly imbalanced data in predicting genes related to RNA
   methylation pathways. Firstly, we enhance WGAN-GP^[58]29 to improve its
   capability in learning the distribution of gene expression data.
   Secondly, we employ the Classifier Two Sample Test^[59]30 (CTST) to
   regulate the quality of synthetic samples generated by WGAN-GP.
   Thirdly, we implement a data block structure and incorporate the
   optimal samples, as identified by CTST, into each block to augment the
   minority samples. We then employ various classifiers to compare the
   performance metrics with conventional sampling methods. The results
   suggest that the samples generated by WGAN-GP integrate well with the
   minority class samples, as indicated by a CTST accuracy close to 0.5.
   The data block structure mitigates the risk of overfitting caused by
   highly unbalanced data, reducing the need for synthetic samples. The
   newly added samples effectively complement the minority class sample
   data, leading to improved predictive performance.

Method

Workflow of the study

   Generally, our proposed method consists of four steps (Fig. [60]1). In
   the first step, we perform feature extraction on the dataset to reduce
   dimensionality and then train the WGAN-GP using the positive samples.
   In the second step, we evaluate all WGAN-GP weights using CTST and
   select the model weights that yield the best CTST results for
   downstream task oversampling. In the third step, we divide the training
   set into multiple data blocks and oversample each block using the
   selected WGAN-GP model to achieve balanced data blocks. In the fourth
   step, we train the machine learning models using these balanced data
   blocks and evaluate their performance using the test set.

Fig. 1.

   [61]Fig. 1
   [62]Open in a new tab

   Workflow of our study.

Data set and feature transformation

   We use the dataset created and made available by Tsagkogeorga et
   al.^[63]14. The original dataset was collected from
   [64]Harmonizome^[65]31 website. We selected 15 one-hot-encoded datasets
   to construct our dataset. The dataset was initially standardized to
   continuous values ranging from 0 to 1, or − 1 to 1, where 1 indicates a
   strong positive gene-feature association, 0 indicates no observed
   gene-feature association, and − 1 indicates a strong negative
   gene-feature association. The genes within these 15 datasets, along
   with their associated features, exhibit variations in both quantity and
   distribution. The number of genes and features differs across datasets.
   Some genes may appear in one or multiple datasets, and certain features
   may be present across various datasets as well. Each gene encompasses
   50,176 features. Within the 15 datasets, Some genes lack specific
   features that are prevalent in other datasets. Consequently, features
   absent from a gene in this context are marked as null. The null values
   are subsequently assigned a value of 0. While a value of 0 signifies
   the absence of gene-feature associations, it does not impact the
   ultimate outcomes of training and prediction. To gain highly
   informative features and reduce dimensionality, we will remove the
   following 2 types of features: (i) those with zero values in more than
   70% of cases. (ii) those with variance less than 16% of the total
   variance.

The improved data block constructure balancing

   Given the significant imbalance between positive and negative samples
   in our dataset, directly applying oversampling may not improve
   classifier performance (as discussed in subsection “[66]Ablation
   study”). Therefore, we adopted an improved data block structure for
   oversampling. Specifically, we divide the training dataset into subsets
   and then use oversampling to balance these subsets. This approach,
   initially introduced by Yin et al.^[67]32, integrates undersampling and
   oversampling schemes, which reduces information loss from undersampling
   by slicing into multiple chunks while supplementing the dataset. We
   enhance this data block structure by ensuring that the proportion of
   positive and negative samples in these subsets is intentionally
   unbalanced to better integrate with the oversampling methods.

   Specifically, these subsets are created by combining the same positive
   samples with different negative samples (The positive samples represent
   the minority class). The number of positive and negative samples in
   each subset is allocated proportionally based on the hyperparameters.
   Generally, each subset contains fewer positive samples than negative
   samples, aiming to achieve a balanced distribution of positive and
   negative samples after oversampling.

   Assuming the number of positive samples and negative samples are
   [MATH: <msub><mi>N</mi><mrow><mi
   mathvariant="italic">pos</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mi>N</mi><mrow><mi
   mathvariant="italic">neg</mi></mrow></msub> :MATH]
   , respectively, the sets of positive and negative samples are
   [MATH: <msub><mi>S</mi><mrow><mi
   mathvariant="italic">pos</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mi>S</mi><mrow><mi
   mathvariant="italic">neg</mi></mrow></msub> :MATH]
   respectively, S =
   [MATH: <mrow><msub><mi>S</mi><mrow><mi
   mathvariant="italic">pos</mi></mrow></msub><mo>∪</mo><msub><mi>S</mi><m
   row><mi mathvariant="italic">neg</mi></mrow></msub></mrow> :MATH]
   denotes the entire dataset,
   [MATH: <mrow><mi>λ</mi><mo>=</mo><msub><mi>N</mi><mrow><mi
   mathvariant="italic">neg</mi></mrow></msub><mo
   stretchy="false">/</mo><msub><mi>N</mi><mrow><mi
   mathvariant="italic">pos</mi></mrow></msub></mrow> :MATH]
   denotes the imbalanced ratio of the dataset,
   [MATH: <mi>α</mi> :MATH]
   is a hyperparameter that controls the oversampling rate. The pseudocode
   for the data block construction process to achive balance is shown in
   Algorithm 1.

Algorithm 1.

   [68]Algorithm 1
   [69]Open in a new tab

   Data block balancing

The small-sample augmentation based on WGAN-GP

   Generative Adversarial Networks (GANs) are neural networks capable of
   learning high-dimensional feature distributions from data
   samples^[70]33. A GAN generally consists of two neural networks: a
   generator G and a discriminator D. The generator receives random noise
   as input and produces samples that are intended to be indistinguishable
   from real data. The discriminator’s task is to distinguish between the
   fake samples generated by G and the real samples from the training
   data. During training, the generator and the discriminator engage in a
   minimax two-player game: the generator aims to create samples that can
   deceive the discriminator, while the discriminator tries to correctly
   identify whether samples are real or fake. Ideally, a Nash equilibrium
   is reached when the generator can no longer produce samples that the
   discriminator can distinguish from real samples, and the discriminator
   is unable to differentiate between real and fake samples. Formally, let
   V(G, D) be the object function. The minimax game is evaluated as:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>m</mi><mi>i</mi><msub><mi>n</mi><mi>G</mi
   ></msub><mi>m</mi><mi>a</mi><msub><mi>x</mi><mi>D</mi></msub><mi>V</mi>
   <mrow><mo stretchy="false">(</mo><mi>G</mi><mo>,</mo><mi>D</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mtext>E</mtext><mrow><mi
   >x</mi><mo>∼</mo><msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub></mrow></msub><mrow><mo
   stretchy="false">[</mo><mo>log</mo><mrow><mi>D</mi><mo
   stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow><mo>+</mo><msub><mtext>E</mtext><mrow><mi
   >z</mi><mo>∼</mo><msub><mi>p</mi><mi>z</mi></msub></mrow></msub><mrow><
   mo stretchy="false">[</mo><mo>log</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>D</mi><mo
   stretchy="false">(</mo><mi>g</mi><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo><mo
   stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   1

   where x is sampled from the real data distribution
   [MATH: <msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub> :MATH]
   and z is sampled from the prior noise distribution
   [MATH: <msub><mi>p</mi><mi>z</mi></msub> :MATH]
   .
   [MATH: <mtext>E</mtext> :MATH]
   represents the expectation, D(x) represents the score assigned by the
   discriminator D to x, and G(z) represents the fake samples generated by
   G using noise sampled from the prior distribution
   [MATH: <msub><mi>p</mi><mi>z</mi></msub> :MATH]
   . In a GAN, the generator and discriminator are trained alternatively.
   That is, the weights of one network are fixed while the other network
   is trained. Specifically, when training the discriminator D, the
   generator G is kept fixed, and the discriminator D is optimized by
   maximizing
   [MATH: <mrow><mo>log</mo><mrow><mi>D</mi><mo
   stretchy="false">(</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . Conversely, when training the generator G, the discriminator is
   fixed, and the generator is optimized by minimizing
   [MATH: <mrow><mo>log</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>D</mi><mo
   stretchy="false">(</mo><mi>g</mi><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo><mo
   stretchy="false">)</mo><mo stretchy="false">)</mo></mrow></mrow> :MATH]
   .

   However, conventional GANs face training difficulties primarily due to
   the Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence
   loss functions. Specifically, when dealing with non-overlapping
   distributions, the JS divergence between
   [MATH: <msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub> :MATH]
   and G(z) can approximate a constant value, leading to issues such as
   gradient vanishing and mode collapse^[71]28. WGAN address the problem
   of gradient vanishing by replacing these divergence measures with the
   Wasserstein distance. Equation ([72]2) denotes the value function of
   WGAN:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>V</mi><mo>=</mo><msub><mtext>E</mtext><mr
   ow><mi>x</mi><mo>∼</mo><msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub></mrow></msub><mrow><mo
   stretchy="false">[</mo><msub><mi>f</mi><mi>w</mi></msub><mrow><mo
   stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow><mo>-</mo><msub><mtext>E</mtext><mrow><mi
   >z</mi><mo>∼</mo><msub><mi>p</mi><mi>z</mi></msub></mrow></msub><mrow><
   mo stretchy="false">[</mo><msub><mi>f</mi><mi>w</mi></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>θ</mi></msub><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   2

   where
   [MATH: <msub><mi>f</mi><mi>w</mi></msub> :MATH]
   represents a function that satisfies the K-Lipschitz constraint, w and
   [MATH: <mi>θ</mi> :MATH]
   are the parameters of f and g, respectively. To satisfy K-Lipschitz, w
   is constrained to ensure that the function
   [MATH: <msub><mi>f</mi><mi>w</mi></msub> :MATH]
   has a slope value less than K between any two points. Given the
   powerful fitting capability of neural networks, we can simply replace f
   and g with D and G, respectively. Thus, the object function of WGAN is
   represented by Eq. ([73]3).
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>m</mi><mi>i</mi><msub><mi>n</mi><mi>G</mi
   ></msub><mi>m</mi><mi>a</mi><msub><mi>x</mi><mi>D</mi></msub><mi>V</mi>
   <mrow><mo stretchy="false">(</mo><mi>G</mi><mo>,</mo><mi>D</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mtext>E</mtext><mrow><mi
   >x</mi><mo>∼</mo><msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub></mrow></msub><mrow><mo
   stretchy="false">[</mo><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow><mo>-</mo><msub><mtext>E</mtext><mrow><mi
   >z</mi><mo>∼</mo><msub><mi>p</mi><mi>z</mi></msub></mrow></msub><mrow><
   mo stretchy="false">[</mo><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mi>G</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   3

   Given that D must satisfy the K-Lipschitz constraint, all parameters in
   D are clipped to a fixed range after each training epoch, for example,
   from − 0.01 to 0.01. Compared to traditional GANs, WGAN incorporates
   the K-Lipschitz constraint to address the issue of gradient vanishing
   and its effective smoothing of the value function improves
   trainability. Nevertheless, WGAN’s approach to enforcing the
   K-Lipschitz constraint through parameter clipping can be somewhat
   crude, as this clipping process can significantly impact the network’s
   fitting capability. This often results in a large number of weight
   values being concentrated along the clipped boundaries, making it
   challenging to fit complex distributions and significantly impairing
   performance^[74]34. WGAN-GP adopts a more refined approach by
   incorporating a gradient penalty to enforce the K-Lipschitz constraint.
   Specifically, during the training of discriminator, a gradient penalty
   is added to the discriminator’s loss function. Let GP represent the
   gradient penalty, Eq. ([75]4) illustrates the calculation of gradient
   penalty.
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>G</mi><mi>P</mi><mo>=</mo><mi>γ</mi><msub
   ><mtext>E</mtext><mrow><mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover><mo>∼</mo><msub><mi>p</mi><mover
   accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover></msub></mrow></msub><mrow><mo
   stretchy="false">[</mo><mo stretchy="false">(</mo><mo
   stretchy="false">‖</mo></mrow><msub><mi
   mathvariant="normal">∇</mi><mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover></msub><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover><mo
   stretchy="false">)</mo></mrow><msub><mrow><mo
   stretchy="false">‖</mo></mrow><mn>2</mn></msub><mo>-</mo><mn>1</mn><mro
   w><msup><mo stretchy="false">)</mo><mn>2</mn></msup><mo
   stretchy="false">]</mo></mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   4

   And the object function of WGAN-GP is rewritten as follows:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>m</mi><mi>i</mi><msub><mi>n</mi><mi>G</mi
   ></msub><mi>m</mi><mi>a</mi><msub><mi>x</mi><mi>D</mi></msub><mi>V</mi>
   <mrow><mo stretchy="false">(</mo><mi>G</mi><mo>,</mo><mi>D</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mtext>E</mtext><mrow><mi
   >x</mi><mo>∼</mo><msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub></mrow></msub><mrow><mo
   stretchy="false">[</mo><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow><mo>-</mo><msub><mtext>E</mtext><mrow><mi
   >z</mi><mo>∼</mo><msub><mi>p</mi><mi>z</mi></msub></mrow></msub><mrow><
   mo stretchy="false">[</mo><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mi>G</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow><mo>-</mo><mi>G</mi><mi>P</mi></mrow></mt
   d></mtr></mtable></mrow> :MATH]
   5

   where
   [MATH: <mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover> :MATH]
   is obtained by interpolating between
   [MATH: <msub><mi>p</mi><mrow><mi
   mathvariant="italic">data</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mi>p</mi><mi>z</mi></msub> :MATH]
   ,
   [MATH: <mi>γ</mi> :MATH]
   is the gradient penalty coefficient, which is typically a positive real
   number chosen empirically.
   [MATH: <mrow><msub><mi mathvariant="normal">∇</mi><mover
   accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover></msub><mi>D</mi><mrow><mo
   stretchy="false">(</mo><mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   denotes the gradient of
   [MATH: <mrow><mi>D</mi><mo stretchy="false">(</mo><mover
   accent="true"><mi>x</mi><mo stretchy="false">^</mo></mover><mo
   stretchy="false">)</mo></mrow> :MATH]
   with respect to
   [MATH: <mover accent="true"><mi>x</mi><mo
   stretchy="false">^</mo></mover> :MATH]
   .

   Furthermore, the CNN structure originally used in WGAN-GP is designed
   for image feature extraction and is not well-suited for gene expression
   data. Therefore, it is crucial to adapt WGAN-GP to better accommodate
   gene expression data. To this end, we substitute the CNN structure in
   WGAN-GP with a fully-connected neural network and perform additional
   optimization of the network parameters to tailor it specifically to our
   dataset. Each fully-connected layer is followed by a Leaky ReLU
   (Rectified Linear Unit) activation function with a negative slope of
   0.2. In the discriminator, the number of neurons in each layer
   progressively decreases until it reaches 1, while in the generator, the
   number of neurons in each layer progressively increases until it
   matches the number of features in the dataset.

Further generator selection

   In this study, we utilize the CTST approach to select the generator
   that produces the optimal synthetic samples, rather than evaluating the
   samples themselves, which differs from^[76]28. Optimal synthetic
   samples are those that closely mimic the feature distribution of real
   positive training samples without being identical to the real gene
   expression samples. In essence, when the number of synthetic samples
   equals that of real samples, CTST employs a binary classifier to
   differentiate between the two sets of samples. If cross-validation
   results indicate that the classifier can easily distinguish
   differentiate between the two sets of samples. Specifically, given two
   sample sets of equal size that follow two distributions, P and Q, CTST
   tests the null hypothesis that P is equal to Q. If the null hypothesis
   is accepted, the classifier’s accuracy will be near chance level (50%);
   if rejected, the accuracy will approach 100%^[77]35.

   During the training process of WGAN-GP, we employed the CTST after each
   epoch to promptly assess the generator and retain the optimal network
   weights for downstream tasks. Given the time-intensive nature of
   network training, cross-validation was performed only once, and the
   network weights are retained based on the accuracy of this validation.
   However, these weights may be influenced by chance due to the
   probabilistic nature of data generation. For instance, the network may
   produce samples that closely resemble a specific distribution,
   resulting in a CTST accuracy close to 0.5. At this point, the network
   has partially captured the target samples distribution, but the
   retained weights may still affect performance in downstream tasks. To
   assess the impact of generator fitting over different epochs, we
   conducted additional CTST experiments after WGAN-GP training. Multiple
   CTSTs were performed with different weights, and the generator with the
   mean accuracy closest to 0.5 was selected as the data generator for the
   downstream task. In this study, we conduct the CTST using Support
   Vector Machine classification algorithm.

Prediction and evaluation

   The dataset for this study comprises 26,936 gene expression data
   samples, including 92 samples related to RNA methylation (positive
   samples) and 26,304 unlabeled samples (regarded as negative samples).
   The dataset was split into training and test sets in an 8:2 ratio. A
   total of 74 positive samples and 21,044 negative samples were utilized
   for training and cross-validation (CV) to determine hyperparameters via
   grid search^[78]36,[79]37, while 18 positive samples and 5260 negative
   samples were allocated for testing. The positive training samples were
   initially used to train WGAN-GP in order to acquire generators for
   producing synthetic samples for downstream tasks, and later employed
   for training the classification model, as shown in Table [80]1.

Table 1.

   Numbers of gene sample for training, CV, and test.
                         Train and CV Test Total
   Positive gene numbers 74           18   92
   Negative gene numbers 21,044       5260 26,304
   Total                 21,118       5278 26,936
   [81]Open in a new tab

   To prevent mode collapse during the training process of WGAN-GP, we
   employ an Early Stopping mechanism^[82]38. If the CTST result, which
   measures the dissimilarity between synthetic samples generated by the
   generator and real samples, falls below a preset threshold of 0.02, the
   weight from that epoch is saved and made available for downstream
   tasks. However, to increase the number of generator models available
   for selection, training does not stop immediately upon reaching the
   Early Stopping condition; instead, it continues until the preset number
   of epochs is reached. By analyzing the probability density curve of
   results between 0.4 and 0.6, we observed a peak around 0.52, which
   ensures that most useless generators are filtered out while retaining a
   relatively large number of generators. Therefore, we chose 0.02 as the
   default threshold for preliminary filtering, as shown in Supplementary
   Figure [83]1.

   To evaluate the performance of predicting genes related to RNA
   methylation pathways, we employ five machine learning classifiers:
   Support Vector Machine (SVM), Gradient Boosting (GB), Gaussian Naive
   Bayes (GNB), Random Forest (RF), and Logistic Regression (LR). After
   dividing the data based on the oversampling rate
   [MATH: <mi>α</mi> :MATH]
   , these classifiers are trained on an equal number of data blocks. For
   SVM, GB, and RF, we determine the optimal hyperparameters using a
   grid-search technique, selecting the hyperparameters combination that
   achieves the highest accuracy through 3-fold cross-validation.

   Five metrics are utilized for evaluating the classifiers’ performance:
   Accuracy, Precision, Recall, F1 score, and Matthews Correlation
   Coefficient (MCC). Given the highly imbalanced ratio of positive to
   negative samples in the test set, these metrics are computed across
   multiple data blocks derived from the test set, and the average
   performance across all blocks is taken as the final test result. Unlike
   the training set, the test set’s data block splitting process does not
   include the oversampling step.

Result

Influence of hyperparameters and training strategies

   To investigate the effect of the oversampling rate
   [MATH: <mi>α</mi> :MATH]
   on the model performance, We performed cross-validation on the training
   set using various oversampling rates. Our experiments involved nine
   different
   [MATH: <mi>α</mi> :MATH]
   values across five classifiers. Generally, each classifiers exhibited
   varied valid performance with different
   [MATH: <mi>α</mi> :MATH]
   values, and fixed
   [MATH: <mi>α</mi> :MATH]
   values also produced diverse results across classifiers. As shown in
   Fig. [84]2, the highest accuracy (0.9757) was achieved with an
   [MATH: <mi>α</mi> :MATH]
   of 0.4 paired with the Logistic Regression (LR) classifier. Conversely,
   the lowest accuracy (0.8771) occurred with an
   [MATH: <mi>α</mi> :MATH]
   of 0.1 paired with the Gradient Boosting (GB) classifier, indicating a
   difference of approximately 14.3%. Notably, Random Forest (RF) was the
   most impacted by variations in
   [MATH: <mi>α</mi> :MATH]
   , with a performance difference of 9.49%. In contrast, Gaussian Naïve
   Bayes (GNB) was the least affected, showing a difference of only 1.27%
   between its best (0.9539) and worst (0.9412) results. The performance
   drop for other classifiers compared to their best results was 4.48% for
   Support Vector Machine (SVM), 6.34% for GB, and 1.82%for LR. This
   suggests that inadequate synthetic samples (
   [MATH: <mi>α</mi> :MATH]
   close to 1) or excessive samples (
   [MATH: <mi>α</mi> :MATH]
   close to 0) can result in performance degradation. Based on these
   validation results, unless otherwise specified, we will set
   [MATH: <mi>α</mi> :MATH]
   to 0.4 for the following analysis.

Fig. 2.

   Fig. 2
   [85]Open in a new tab

   Impact of oversample rate
   [MATH: <mi>α</mi> :MATH]
   , the horizontal axis is the oversampling rate, the vertical axis is
   the valid accuracy score, and the lines in different categories
   indicate different classes of classifiers.

Ablation study

   The data block constructure and oversampling process are two major
   components in our approach. To study the effectiveness of these
   components, we created two variants of the scheme: one without the data
   block structure and another without the oversampling process. These two
   variants were used to train five machine learning models. We then
   evaluated their performance using four metrics to assess the impact of
   the data block constructure and oversampling on overall model
   performance. Figure [86]3 depicts the results of the ablation study.

Fig. 3.

   [87]Fig. 3
   [88]Open in a new tab

   Overview of ablation study.

   In general, our proposed complete scheme exhibits superior predictive
   performance across all five classifiers. However, without the data
   block structure or proper oversampling, performance degradation occurs
   to varying extents. Among the classifiers, GNB classifier shows the
   least sensitivity to these components, with performance reduction of
   1.76% (MCC), 0.7% (F1 score), and 3.1% (recall) when data block
   constructure is removed and 7.5% (MCC), 3.4% (F1 score), 2.7% (recall)
   when WGAN-GP oversampling is missing. In contrast, the other four
   classifiers demonstrate greater sensitivity to the absence of data
   block constructure. For instance, GB experiences a 26.9% drop in
   accuracy, 60.2% in recall, 44.6% in F1 score, 42.3% in MCC. Similarly,
   LR shows a decrease of 16.9% (accuracy), 40.9% (recall), 24.3% (F1
   score), 27.0% (MCC). RF faces reductions of 19.4% (accuracy), 45.8%
   (recall), 28.9% (F1 score), 30.7% (MCC), while SVM observes declines of
   26.5% (accuracy), 59.8%(recall), 44.3% (F1 score), 41.7% (MCC). When
   the oversampling process is absent, The situation is more moderate. MCC
   performance drops by 5.3% for GB, 5.9% for SVM, 10.2% for RF, and 6.4%
   for LR, with similar decline in other metrics. These results clearly
   demonstrate that merely generating a large number of synthetic samples
   to address imbalanced datasets is ineffective for highly imbalanced
   classification problems. In contrast, our proposed scheme, which
   integrates the benefits of both oversampling and undersampling,
   achieves better performance. By incorporating data blocks, it
   significantly reduces the need for synthetic samples, preserves the
   original sample distribution, and effectively expands the dataset. Data
   block constructure also facilitates the use of multiple weak
   classifiers, enabling ensemble techniques that capture diverse patterns
   in the data, which reduces individual model bias and improving
   generalization.

   A total of 61 weights were saved based on the preset parameters. The
   additional CTST experimental results for these 61 weights are displayed
   in Fig. [89]4, sorted by variance. After conducting 200 repetitions of
   the experiment, slight differences were observed in the mean values of
   all generator test results. The majority of the generator test results
   still adhere to the preset hyperparameter criteria, falling between
   0.498 and 0.502 as denoted by the green dots in Fig. [90]4. A few
   generators produced test results that deviate from the preset criteria,
   yet remain within the acceptable range for downstream tasks, ranging
   from 0.4 to 0.498 and 0.502 to 0.6, indicated by the orange points.
   These generators also show an increase in variance. In contrast, a few
   generators, represented by the red dots, exhibit test results
   significantly deviating from the 0.5 criterion, with an absolute
   difference exceeding 0.1, alongside higher variance. In such cases, the
   samples generated by these generators could introduce bias into the
   classifier model’s training. These results affirm the importance of
   incorporating additional CTST to refine the selection process.

Fig. 4.

   Fig. 4
   [91]Open in a new tab

   CTST result over 61 saved model weight, sorted based on their variance.

WGAN-GP successfully generated high-quality positive samples

   Overall, the samples synthesized by WGAN-GP exhibited a strong
   alignment with the original positive samples. We incorporated
   supplementary CTST after each training epoch. As shown in Fig. [92]5,
   at the beginning of generator training, the generated samples
   significantly diverged from the distribution of positive samples,
   resulting in a CTST accuracy of 1.0. This observation underscores the
   effective discrimination achieved by the CTST classifier between the
   generated and positive samples.

Fig. 5.

   [93]Fig. 5
   [94]Open in a new tab

   Distribution changes between positive and synthetic samples. a epoch 0;
   b epoch 200; c epoch 500; d epoch 800.

   Around the
   [MATH: <mrow><msup><mn>200</mn><mrow><mi
   mathvariant="italic">th</mi></mrow></msup></mrow> :MATH]
   epoch, the generator began capturing the distinctive feature
   distribution of the positive samples. Consequently, an overlap emerged
   in the feature space between the generated and positive samples,
   reducing the CTST accuracy to approximately 0.7. As training
   progressed, by the
   [MATH: <mrow><msup><mn>500</mn><mrow><mi
   mathvariant="italic">th</mi></mrow></msup></mrow> :MATH]
   epoch, this overlapping region expanded further, causing the CTST
   accuracy to diminish to about 0.6.

   After over 800 additional training iterations, WGAN-GP successfully
   internalized the genetic feature distribution of authentic positive
   samples. Evidently, the CTST accuracy decreased to 0.5 and exhibited
   fluctuation around this value, signifying the generator’s ability to
   effectively align with the positive sample distribution. Additionally,
   the training loss curve further indicated that the generator attained a
   local optimum as the CTST accuracy reached 0.5.

   Furthermore, independent CTST evaluations on each saved generator model
   unveiled that more than half of the generators consistently uphold a
   classifier accuracy around 0.5. This result highlights the
   effectiveness of WGAN-GP in producing high-quality synthetic samples,
   as illustrated in Fig. [95]5.

   While WGAN-GP partially addressed these difficulties by incorporating
   gradient penalties, proper neural network design and hyperparameter
   tuning remained crucial for successful GAN training. In addition to
   visualizing samples at various epochs, observing changes in loss during
   GAN training proved effective^[96]27. For WGAN-GP, the generator’s loss
   reflected the disparity between generated and real samples. As shown in
   Fig. [97]6, the generator initially exhibited poor sample fitting
   ability, resulting in a rise in loss throughout the training process.
   After undergoing adversarial training with the discriminator, the
   generator started capturing the underlying sample distribution, causing
   a decrease in loss. As the training progressed, the generator’s loss
   began to fluctuate, indicating its position in the intermediate stage.
   Around 600–800 epochs, the generator and discriminator reached a Nash
   equilibrium in the minimax game, as indicated by the consistent
   reduction and eventual stabilization of loss. This suggests the
   generative adversarial network had reached an approximate optimal
   state.

Fig. 6.

   Fig. 6
   [98]Open in a new tab

   WGAN-GP train loss over 5000 epochs. Blue line indicates discriminator
   loss and orange line indicates generator loss.

Performance comparison

   As shown in Table [99]2, our approach consistently exhibits superior
   prediction performance across most classifier combinations. For
   instance, when paired with the SVM classifier, our method improves
   accuracy by 16.01%, recall by 38.81%, F1 by 23.34%, and MCC by 25.14%
   compared to the second-best approach. Similarly, the combination with
   GB classifier achieves respective improvements of 13.77%, 33.91%,
   19.35%, and 21.87%. Additionally, when combined with RF classifier, our
   approach surpasses the second-best model by 16.83%, 40.58%, 24.26%, and
   26.73%. For the LR classifier, our method outperforms the runner-up by
   11.69%, 30.39%, 15.92%, and 18.89%. Across most performance metrics,
   our scheme consistently attains top-tier results, securing the highest
   accuracy (0.9320), F1 (0.9354), and MCC (0.9327) when combined with LR.
   However, with GNB classifier, the RandomOversample scheme achieves the
   best performance, boasting an average accuracy of 0.9183, a recall of
   1.0, F1 of 0.9261, and an MCC of 0.8505. This suggests that the
   RandomOversample method is more inclined to identify positive samples.
   Nonetheless, our scheme remains competitive, producing accuracy,
   recall, F1, and MCC scores that closely approach those of the leading
   scheme, with only a minor differences.

Table 2.

   Accuracy, Recall, F1 and MCC for 4 approaches.
   Method
   Classifier SMOTE BorderlineSMOTE RandomOversample Ours
   Accuracy Recall F1 MCC Accuracy Recall F1 MCC Accuracy Recall F1 MCC
   Accuracy Recall F1 MCC
   SVM 0.5 0 0 0 0.7630 0.5263 0.6895 0.5973 0.5 0 0 0 0.9232 0.9145
   0.9230 0.8487
   GB 0.7629 0.5263 0.6894 0.5970 0.7891 0.5789 0.7330 0.6375 0.7631
   0.5263 0.6896 0.5976 0.9269 0.9181 0.9266 0.8562
   RF 0.7368 0.4736 0.6428 0.5570 0.5789 0.1578 0.2727 0.2927 0.5263
   0.0526 0.1 0.1644 0.9315 0.9321 0.9322 0.8650
   LR 0.8413 0.6842 0.8118 0.7192 0.8151 0.6315 0.7735 0.6776 0.8153
   0.6315 0.7737 0.6781 0.9320 0.9354 0.9328 0.8665
   GNB 0.9164 0.9473 0.9201 0.8365 0.8094 0.6315 0.7684 0.6627 0.8115
   0.7894 0.8088 0.6263 0.9108 0.9785 0.9178 0.8324
   [100]Open in a new tab

   The highest values for each specific classifier are highlighted in
   bold, while the highest values across all classifiersare underlined.

   Notably, some approaches encountered a significant problem,
   misclassifying all test set samples as negative (see^[101]32, Table 3).
   This led to their average accuracy, recall, F1, and MCC values dropping
   to 0.5, 0, 0, 0, respectively. In such cases, these approaches were
   deemed unusable. Among the three compared schemes, the combination of
   SMOTE with SVM exhibits invalid values. Similarly, RandomOversample
   encounters the same issue when combined with SVM. While BorderlineSMOTE
   mitigates the problem,it still underperformed compared to our approach.
   In contrast, our method demonstrated better data compatibility and was
   more effective at handling highly imbalanced gene expression data.

   To investigate the differences in predictive performance between
   samples synthesized by WGAN-GP and other schemes, we substitute WGAN-GP
   with three oversampling schemes, creating three new variants of our
   approach. These variants balanced the data blocks rather than the
   entire dataset before training, following the same methodology as our
   proposed approach. A scenario with no oversampling was used as the
   baseline for comparison.

   Table [102]3 shows the test results for various oversampling schemes
   combined with data block construction. Generally, samples generated by
   WGAN-GP outperform the other four compared schemes across most metrics
   and classifiers, achieving the highest overall scores in accuracy,
   recall, F1, and MCC. Specifically: With the SVM classifier, WGAN-GP
   attained the highest accuracy (0.9232), recall (0.9342), F1 (0.9230),
   and MCC (0.8487). With the GB classifier, WGAN-GP achieved the highest
   accuracy (0.9269), F1 (0.9266), and MCC (0.8562). When using the RF
   classifier, WGAN-GP achieved the highest accuracy (0.9315), recall
   (0.9321), F1 (0.9322), and MCC (0.8650). With the LR classifier,
   WGAN-GP reached the highest accuracy (0.9320), precision (0.9327), F1
   (0.9328), and MCC (0.8665). With the GNB classifier, WGAN-GP achieved
   the highest Recall (0.9785), F1 (0.9178), and MCC (0.8324). Among the
   results, the WGAN-GP with LR combination attained the highest overall
   accuracy, F1, and MCC scores, while the combination with GNB achieved
   the highest recall score. The highest overall precision was obtained by
   the combination of SMOTE with GB.

Table 3.

   Comparison of model performance with data block constructure.
   Oversample method Classifier Accuracy Precision Recall F1     MCC
   WGAN-GP           SVM        0.9232   0.9342    0.9145 0.9230 0.8487
                     GB         0.9269   0.9379    0.9181 0.9266 0.8562
                     RF         0.9315   0.9345    0.9321 0.9322 0.8650
                     LR         0.9320   0.9327    0.9354 0.9328 0.8665
                     GNB        0.9108   0.8673    0.9785 0.9178 0.8324
   BorderlineSMOTE   SVM        0.9171   0.9277    0.9087 0.9168 0.8367
                     GB         0.9252   0.9365    0.9159 0.9249 0.8529
                     RF         0.9293   0.9328    0.9291 0.9299 0.8605
                     LR         0.9244   0.9217    0.9321 0.9255 0.8514
                     GNB        0.9046   0.8658    0.9662 0.9114 0.8189
   SMOTE             SVM        0.9129   0.9229    0.9055 0.9121 0.8291
                     GB         0.9237   0.9389    0.9099 0.9229 0.8501
                     RF         0.9301   0.9352    0.9282 0.9307 0.8622
                     LR         0.9248   0.9255    0.9282 0.9257 0.8519
                     GNB        0.9042   0.8736    0.9518 0.9095 0.8151
   RandomOversample  SVM        0.9193   0.9308    0.9099 0.9190 0.8411
                     GB         0.9181   0.9289    0.9096 0.9178 0.8388
                     RF         0.9235   0.9210    0.9312 0.9249 0.8493
                     LR         0.9286   0.9259    0.9360 0.9297 0.8597
                     GNB        0.9112   0.8695    0.9751 0.9177 0.8324
   ModCGAN           SVM        0.8947   0.9171    0.8726 0.8927 0.7930
                     GB         0.9193   0.9293    0.9115 0.9184 0.8420
                     RF         0.9196   0.9255    0.9169 0.9194 0.8424
                     LR         0.8995   0.9198    0.8798 0.8980 0.8020
                     GNB        0.9114   0.8959    0.9366 0.9137 0.8276
   ACGAN             SVM        0.8381   0.9117    0.7532 0.8221 0.7538
                     GB         0.8728   0.9265    0.8142 0.8646 0.7538
                     RF         0.8706   0.9238    0.8124 0.8625 0.8625
                     LR         0.8461   0.9167    0.7658 0.8317 0.7048
                     GNB        0.8937   0.9136    0.8752 0.8926 0.7905
   *Non              SVM        0.8909   0.8763    0.9178 0.8947 0.7863
                     GB         0.8957   0.8834    0.9189 0.8989 0.7957
                     RF         0.8795   0.8645    0.9085 0.8840 0.7639
                     LR         0.8983   0.8767    0.9342 0.9029 0.8016
                     GNB        0.8719   0.8279    0.9506 0.8829 0.7570
   [103]Open in a new tab

   *Non means no oversampling method intergreted

   The highest values for each specific classifier are highlighted in
   bold, while the highest values across all classifiersare underlined.

   In addition to common oversampling methods, we also compared other,
   more sophisticated oversampling schemes. ModCGAN^[104]39,[105]40
   employs a different approach to ensure the quality of synthetic samples
   by training an MLP classifier after the GAN is trained to assess
   whether the generated samples meet the required quality. Similarly,
   ACGAN^[106]41 builds on CGAN by forcing the discriminator to
   reconstruct the auxiliary information of CGAN, thereby regulating the
   quality of the synthetic samples. We integrated these methods with data
   blocks to compare the impact of synthetic sample quality on predicting
   RNA methylation-related genes.

   Although ModCGAN and ACGAN performs similarly to our method across five
   metrics and five classifiers, our method shows a 1–5% improvement on
   most metrics. The key to ModCGAN lies in using an external classifier
   to distinguish between real samples and noise, thereby regulating the
   generator’s outputs. However, ModCGAN assumes that noise follows only a
   joint distribution of uniform and Gaussian distributions. In cases
   involving large datasets, as in this study, larger negative samples may
   overlap with the Gaussian or uniform distribution, leading the external
   classifier to exclude some qualified synthetic samples. Although this
   study does not use GANs to generate majority class samples, a small
   number of positive samples might also overlap with the predefined noise
   distribution. Therefore, it is more reasonable to use the
   distinguishability between synthetic and real samples as a standard for
   assessing the quality of synthetic samples.

   It is worth noting that the WGAN-GP oversampling method, when combined
   with different classifiers, can achieve the highest MCC and F1 scores
   for each respective classifier. Since the MCC metric is commonly used
   to evaluate binary classification models, particularly for unbalanced
   datasets, this suggests that WGAN-GP generates more diverse samples,
   thereby reducing the classifier’s bias towards the majority class.

GO terms and KEGG pathway enrichment analyses

   With
   [MATH: <mi>α</mi> :MATH]
   set to 0.4, each classifier retained 116 models. The average score of
   these models for each gene was then calculated as the classifier’s
   final score. Subsequently, high-confidence genes identified by each
   classifier were utilized for GO terms enrichment analysis.
   High-confidence genes are defined as the top 1% of genes ranked by
   confidence. Based on the gene quantity in our dataset, each classifier
   typically has 270 high-confidence genes. However, since GNB tends to
   overestimate confidence for many genes, we considered the top 1500
   genes in GNB’s confidence rankings as high-confidence (beyond rank
   1500, other classifiers typically assign confidence levels below 0.5).
   Figure [107]7 compares the top 20 GO terms ranked by each classifier.
   Overall, the predictions from the five classifiers exhibit strong
   consistency. GO terms enrichment analysis indicates that
   high-confidence genes are significantly enriched for RNA methylation
   functions. The terms with the highest enrichment include methylation,
   RNA modification, RNA splicing, and RNA localization. Other terms
   related to rRNA, mRNA, tRNA, and ncRNA are also present in the
   enrichment analysis, with over 200 genes enriched for these GO terms.
   This confirms that the predicted genes are closely associated with RNA
   methylation process.

Fig. 7.

   [108]Fig. 7
   [109]Open in a new tab

   GO terms enrichment analyses of high-confidence genes.

   Supplementary Figure [110]2 shows the KEGG pathway enrichment analysis
   of the top 1% confidence genes predicted by five classifiers. Despite
   GNB’s tendency to assign high confidence to most genes, the pathway
   enrichment analyses of the five models showed strong similarity. All
   five enrichment analysis reports highlighted pathways strongly related
   to RNA methylation, such as Spliceosome, mRNA surveillance pathway, RNA
   polymerase, and RNA degradation. The enrichment counts ranged from 20
   to 40, and all P-values were all below 0.05, indicating high confidence
   in these items. Notably, the KEGG enrichment report from the GB
   classifier erroneously included a pathway related to Coronavirus
   disease - COVID-19, which did not appear in other classifiers. The
   count and P-value for this item were much lower than those related to
   RNA methylation, further indicating that the model consistently predict
   genes involved in RNA methylation.

Functional insights into the new predictions

   To gain deeper insights into the role of newly predicted genes in the
   RNA methylation process, we selected the top 1% confidence genes
   (predicted by SVM) and constructed a PPI network using the STRING
   database^[111]42. This network included 188 newly predicted genes and
   76 known genes related to RNA methylation. The STRING database
   constructed a densely interactive PPI network for these genes,
   comprising 4093 edges. Using the Louvain^[112]43community detection
   algorithm, we identified six tightly connected communities within the
   PPI network, as shown in Supplementary Figure [113]3.

   Community 0 consists of 71 genes, including 48 known RNA
   methylation-related genes and 23 newly predicted genes. Functional
   analysis indicates significant enrichment such as tRNA N2-guanine
   methylation (GO:0002940, P = 3.53e−06), rRNA 2-O-methylation
   (GO:0000451, P = 3.53e−06), tRNA N1-guanine methylation (GO:0002939,
   P = 0.00025), tRNA (guanine-N7)-methylation (GO:0106004, P = 0.0180),
   and rRNA (guanine-N7)-methylation (GO:0070476, P = 0.0180).

   Community 1 consists of 18 genes, including only one known RNA
   methylation-related gene: FDXACB1. Although as it is, functional
   enrichment analysis shows significant enrichment such as histidyl-tRNA
   aminoacylation (GO:0006427, P = 0.0018), phenylalanyl-tRNA
   aminoacylation (GO:0006432, P = 0.0044), tRNA aminoacylation for
   mitochondrial protein translation (GO:0070127, P = 0.0154),
   aminoacyl-tRNA ligase activity (GO:0004812, P = 1.19e−13), and RNA
   adenylyltransferase activity (GO:1990817, P = 0.0046).

   Community 2 consists of 22 genes, including 20 newly predicted genes.
   HSD17B10 and RSAD1 in this community are known RNA methylation-related
   genes. Functional enrichment analysis suggests significant enrichment
   such as tRNA 5-leader removal (GO:0001682, P = 1.49e−16), tRNA 5-end
   processing (GO:0099116, P = 1.41e−18), tRNA splicing via
   endonucleolytic cleavage and ligation (GO:0006388, P = 1.48e−06), and
   tRNA-specific ribonuclease activity (GO:0004549).

   Community 3 consists of 80 genes, including 7 known RNA
   methylation-related genes (CMTR1, CMTR2, LARP7, MEPCE, RNGTT, RNMT,
   TGS1) and 73 newly predicted genes. GO enrichment analysis shows
   significant enrichment such as cap1 mRNA methylation (GO:0097309,
   P = 0.0280), 7-methylguanosine RNA capping (GO:0009452, P = 1.02e−08),
   RNA 5-cap (guanine-N7)-methylation (GO:0106005, P = 0.433),
   7-methylguanosine mRNA capping (GO:0006370, P = 5.86e−07), mRNA
   (nucleoside-2-O-) methyltransferase activity (GO:0004483, P = 0.0264),
   mRNA methyltransferase activity (GO:0008174, P = 0.0099), snRNA binding
   (GO:0017069, P = 1.69e−06), and O-methyltransferase activity
   (GO:008171, P = 0.0479).

   Community 4 consists of 16 genes, including 12 known RNA
   methylation-related genes and 4 newly predicted genes. Functional
   enrichment analysis reveals significant enrichment such as snRNA
   (adenine-N6)-methylation (GO:0120049, P = 0.0014), S-adenosylmethionine
   biosynthetic process (GO:0006556, P = 0.0033), mRNA methylation
   (GO:0080009, P = 9.06e−17), mRNA
   (2-O-methyladenosine-N6-)-methyltransferase activity (GO:0016422,
   P = 6.17e−06), RNA (adenine-N6-)-methyltransferase activity
   (GO:0008988, P = 6.17e−06), mRNA methyltransferase activity
   (GO:0008174, P = 5.16e−07), and RNA N6-methyladenosine
   methyltransferase complex (GO:0036396, P = 6.55e−16).

   Community 5 consists of 57 genes, including 6 known RNA
   methylation-related genes (CEBPZ, DIMT1, EMG1, FBL, FBLL1, TFB2M) and
   51 newly predicted genes. Functional enrichment analysis indicates
   significant enrichment such as Box H/ACA RNA 3-end processing
   (GO:0000495, P = 0.0074), U5 snRNA 3-end processing (GO:0034476,
   P = 0.0119), U1 snRNA 3-end processing (GO:0034473, P = 0.0119),
   histone-glutamine methyltransferase activity (GO:1990259, P = 0.0073),
   rRNA (adenine-N6,N6-)-dimethyltransferase activity (GO:0000179,
   P = 0.0158), snoRNA binding (GO:0030515, P = 9.05e−11),
   3-5-exoribonuclease activity (GO:000175, P = 8.02e−09), rRNA
   methyltransferase activity (GO:0008649, P = 4.28e−06), catalytic
   activity acting on RNA (GO:0140098, P = 6.20e−20), and
   N-methyltransferase activity (GO:0008170, P = 0.0242).

   Presuming that the top 20 false-positive genes, based on confidence
   ranking, represent new predictions of the model, Existing researches
   reveal that these genes play critical roles in various RNA- or
   methylation-related biological processes. DDX56 (score = 0.979), a
   member of the DEAD box RNA helicase family, plays a crucial role in
   several RNA-related biological processes^[114]44. METTL18
   (score = 0.971) has been reported to be closely related to
   METTL17^[115]45. METTL18 is a methyltransferase, while METTL17
   regulates mitochondrial ribosomal RNA modification^[116]45. PUS3
   (score = 0.969) is involved in RNA pseudouridylation in humans^[117]46
   and has also been studied in relation to methylation and
   sepsis^[118]47. PRPF4B (score = 0.967), a member of the Clk/Sty kinases
   family, encodes the pre-mRNA processing factor 4 kinase
   (PRP4K)^[119]48. PAPOLG (score = 0.965), a poly(A) polymerase (PAP)
   family member, plays a key role in mRNA stability and translational
   modifications^[120]49. EXOSC2 (score = 0.962) provides catalytic
   activity to the RNA exosome^[121]50. PPIG (score = 0.955) interacts
   directly with the phosphorylated RNA Pol II carboxy-terminal domain
   (CTD) via its RS domains both in vivo and in vitro^[122]51.TRUB2
   (score = 0.953) is involved in mitochondrial mRNA pseudouridylation ,
   regulating 16S rRNA and mitochondrial translation^[123]52. DHX38
   (score = 0.946) encodes RNA helicase PRP16, essential for disrupting
   the U2-U6 helix I between the first and second catalytic steps of the
   splicing^[124]53. SF3B1 (score = 0.945) plays a central role in the RNA
   splicing function of SF3b^[125]54. UTP23 (score = 0.945) is a
   trans-acting factor involved in the early assembly of 18S rRNA^[126]55.
   CHTOP (score = 0.944) binds competitively with arginine
   methyltransferases PRMT1 and PRMT5, promoting the asymmetric or
   symmetric methylation of arginine residues^[127]56. DDX47
   (score = 0.944) plays a role in pre-rRNA processing^[128]57. RRP9
   (score = 0.942) is involved in the maturation of ribosomal RNA,
   ensuring proper ribosome formation and function. HARS2 (score = 0.941)
   encodes the mitochondrial histidyl-tRNA synthetase (mt-HisRS)^[129]58.
   UPF2 (score = 0.940) gene plays a role in mRNA degradation^[130]59.
   DDX46 (score = 0.938) recruits ALKBH5, an eraser of the RNA
   modification N6-methyladenosine (m6A), through its DEAD helicase domain
   to demethylate m6A-modified antiviral transcripts^[131]60.

   Notably, recent research^[132]61 reported that CSTF2T(score = 0.941)’s
   paralog, CSTF2, acts as a mediator of m6A deposition, thereby
   regulating mRNA m6A modification. This suggests that CSTF2T may play a
   similar role in the m6A methylation process.

Influence of highly informative features

   In this study, we used a dataset (full set) filtered for zero values
   and variance. This dataset consists of 26,936 genes and 1517 features,
   including 62 GO annotations, 1110 GTEx, 39 TISSUES, 162 PathCommons,108
   HPA, one InterPro, and 35 BioGPS features. While GO annotations and
   InterPro features are informative, they may bias the model’s
   predictions towards pre-existing methylation-related annotations. To
   investigate the impact of GO and InterPro features on the model, we
   further removed all GO and InterPro features from the dataset,
   resulting in a reduced set with 1454 features. We retrained the
   classifiers (started from generator’s training) using the reduced set
   and calculated scores for all genes across different models. The
   probability score distribution curves are shown in Supplementary Figure
   [133]4. The prediction trends across all five classifiers were very
   similar: the classifiers tended to assign low confidence to most genes.
   In the low confidence interval (score < 0.5), the peak for the reduced
   set without GO and InterPro features shifted slightly to the right.
   This suggests that the reduced set slightly increased confidence in the
   low confidence interval, indicating that GO and InterPro features
   prevented classifiers from blindly assigning high confidence to most
   genes. As shown in Supplementary Figure [134]5 and Supplementary Table
   [135]1, although the test results declined after training with the
   reduced set, GO enrichment analysis revealed a strong association with
   RNA methylation. This suggests that, while the reduced set may lower
   evaluation metrics, it does not strongly bias the model’s predictions.

   Additionally, we conducted a feature importance analysis based on the
   LR model. Supplementary Figure [136]6 shows the importance rankings of
   the top 50 features in both the full set and the reduced set. In the
   full set, GO (26/50) and InterPro (1/50) features accounted for over
   50% (27/50) of the top positions. Their importance scores were
   significantly higher than those of GTEx and HPA, indicating that the
   model was more heavily influenced by GO and InterPro features than by
   other types. In the top 50 rankings of the reduced set (without GO and
   InterPro features), GTEx and HPA appeared most frequently. This is
   likely due to that GTEx and HPA provide crucial insights into gene
   expression across various tissues, which can help identifying
   tissue-specific patterns of RNA methylation.

Conclusion

   In this study, we highlight the importance of RNA methylation and
   address the imbalance learning problem in predicting genes associated
   with RNA methylation pathways. We propose a deep learning-based
   Wasserstein generative adversarial network with gradient penalty
   (WGAN-GP) approach to alleviate the imbalance problem in gene
   prediction tasks. We use five machine learning classifiers (SVM, GB,
   GNB, RF, LR) for model training and evaluate performance using various
   metrics. Experimental results demonstrate that the improved WGAN-GP
   effectively captures the features of gene expression data and
   accurately models the distribution of positive samples in the feature
   space. Additionally, it generates synthetic samples that are difficult
   to distinguish from real positive samples and exhibit better diversity
   compared to other oversampling methods. Our study presents an effective
   and performance-challenging approach compared to conventional schemes.
   Furthermore, our work has several potential extensions. First, while
   our study incorporates gene expression knowledge bases, future research
   should integrate additional data, such as gene interactions, to further
   enhance predictions of genes associated with RNA methylation pathways.
   Second, employing advanced techniques to reduce feature dimensionality
   could improve prediction performance. Lastly, exploring end-to-end
   models that map inputs directly to outputs, without relying on
   intermediate representations, could reduce the need for manual
   intervention in the modeling process.

Supplementary Information

   [137]Supplementary Information.^ (2.1MB, pdf)

Acknowledgements