Abstract
RNA methylation modification influences various processes in the human
body and has gained increasing attention from scholars. Predicting
genes associated with RNA methylation pathways can significantly aid
biologists in studying RNA methylation processes. Several prediction
methods have been investigated, but their performance is still limited
by the scarcity of positive samples. To address the challenge of data
imbalance in RNA methylation-associated gene prediction tasks, this
study employed a generative adversarial network to learn the feature
distribution of the original dataset. The quality of synthetic samples
was controlled using the Classifier Two-Sample Test (CTST). These
synthetic samples were then added to the data blocks to mitigate class
distribution imbalance. Experimental results demonstrated that
integrating the synthetic samples generated by our proposed model with
the original data enhances the prediction performance of various
classifiers, outperforming other oversampling methods. Moreover, gene
ontology (GO) enrichment analyses further demonstrate the effectiveness
of the predicted genes associated with RNA methylation pathways. The
model generating gene samples with PyTorch is available at
[30]https://github.com/heyheyheyheyhey1/WGAN-GP_RNA_methylation
Keywords: RNA methylation, Pathways, Machine learning, Generative
adversarial nets
Subject terms: RNA, Computational biology and bioinformatics,
Computational models, Protein function predictions
Introduction
RNA methylation refers to the chemical process in which methyl groups
are selectively added to methyladenine in RNA under the catalytic
action of methyltransferase. So far, researchers have discovered over
160 RNA modifications^[31]1 that are extensively distributed across
eukaryotes and prokaryotes. Among these, the most prevalent form of RNA
modification is methylation. By studying this process, researchers have
discovered associations between RNA methylation modifications and
various fields, including cancer^[32]2,[33]3, cardiovascular
disease^[34]4, embryonic development^[35]5, and cell
differentiation^[36]6. These findings highlight the significance of RNA
methylation and suggest its potential implications in diverse
biological processes and diseases. Further research in this area may
shed more light on the functional roles of RNA methylation and its
therapeutic applications^[37]7.
Numerous reports have revealed that specific enzymes or proteins within
the RNA methylation pathway regulate distinct functions and biochemical
processes in the human body^[38]8–[39]11. These studies are based on
the identification of proteins, genes, or specific enzymes associated
with RNA methylation. Despite the significant role of RNA methylation
process in various aspects of the human body, researchers have limited
knowledge about the pathways involved. This limitation primarily arises
from the substantial financial and time investments required to
identify gene functions using experimental wet laboratory
methods^[40]12,[41]13. Fortunately, due to intensive research of
large-scale omics data and artificial intelligence methods, researchers
can now explore target gene functions using computational approaches
that are faster and more cost-effective than wet methods. Several
reports have employed machine learning-based methods to identify genes
associated with RNA methylation pathways^[42]14. Tsagkogeorga et
al.^[43]14 used expression data from genes associated with RNA
methylation as positive samples and trained machine learning models to
predict additional potential genes. The results suggest that ’dry’
computational methods could be used to identify RNA
methylation-associated genes. However, due to the limited number of
known samples and the highly imbalanced ratio of positive to negative
samples, the classifier may incorrectly identify majority class samples
as minority class samples^[44]15. Any misclassification of genes
obtained through computational methods leads to increased research
costs, including time and funding expenses. Thus, avoiding overfitting
and improving the effectiveness of classification methods in highly
imbalanced and high-dimensional data remain challenging.
Various approaches have been developed to address the issue of data
imbalance, with the most common approach involving the oversampling of
minority class samples^[45]16. Random oversampling is a widely employed
algorithm that duplicates randomly selected minority class samples from
the original dataset to balance the sample distribution. However, this
algorithm primarily enhances the classifier’s weights on the minority
class but lacks diversity in generating new samples^[46]17. Another
effective technique, the synthetic minority oversampling technique
(SMOTE)^[47]18, generates samples that lie near the minority class in
the feature space, introducing diversity through interpolation among
neighboring samples. However, SMOTE has limited impact on most
classifiers trained on high-dimensional data, particularly in the field
of bioinformatics^[48]19. This limitation also applies to certain SMOTE
variants (e.g., Borderline-SMOTE^[49]20). Directly applying these
algorithms to highly imbalanced datasets may not alleviate overfitting
and enhance classification performance.
Deep learning offers an alternative approach to address the challenge
of data imbalance^[50]21. Researchers utilized deep learning techniques
to generate synthetic samples that supplement the original dataset. One
common approach is the use of Generative Adversarial Networks^[51]22
(GANs), which consist of two networks that ‘compete’ against each
other: a generator and a discriminator. The generator is responsible
for producing synthetic data, while the discriminator evaluates the
authenticity of the data. Through an adversarial training process,
these two networks compete until they reach equilibrium, where the
discriminator can no longer distinguish between real and synthetic
data, and the generator cannot improve its sample generation further.
This process allows the generator to learn the underlying data
distribution and generate synthetic samples that can be utilized for
downstream tasks. Several studies have confirmed that GANs and their
variants can generate synthetic data by learning the distribution of
minority class sample to supplement the original dataset. For example,
Jing et al.^[52]23 utilized attention-augmented convolutional layers
and an adversarial training mechanism to enhance the identification of
transcription factor binding sites (TFBSs). Ma et al.^[53]24 utilized a
deep convolutional generative adversarial network (DC-GAN) to classify
white blood cell. Gadermayr et al.^[54]25 used a cycle GAN to translate
images with healthy conditions into images with affected conditions for
augmenting labeled segmentation data. Frid-Adar et al.^[55]26 employed
GANs to synthesize CT image data and demonstrated that the generated
images achieved favorable results despite a limited number of samples.
Not only have GANs shown remarkable success in image synthesis, but
Xiao et al.^[56]27 also applied a WGAN to learn high-dimensional
features from cancer gene expression data. Wan and Jones^[57]28
introduces CTST to evaluate the quality of WGANGP-synthesized protein
feature samples and improve protein function prediction performance.
Despite the remarkable success of GANs in balancing datasets, their
application to highly imbalanced gene expression dataset has,
surprisingly, not yet been studied, to the best of our knowledge. Thus,
we propose a novel GAN-based approach to address the challenges posed
by highly imbalanced data in predicting genes related to RNA
methylation pathways. Firstly, we enhance WGAN-GP^[58]29 to improve its
capability in learning the distribution of gene expression data.
Secondly, we employ the Classifier Two Sample Test^[59]30 (CTST) to
regulate the quality of synthetic samples generated by WGAN-GP.
Thirdly, we implement a data block structure and incorporate the
optimal samples, as identified by CTST, into each block to augment the
minority samples. We then employ various classifiers to compare the
performance metrics with conventional sampling methods. The results
suggest that the samples generated by WGAN-GP integrate well with the
minority class samples, as indicated by a CTST accuracy close to 0.5.
The data block structure mitigates the risk of overfitting caused by
highly unbalanced data, reducing the need for synthetic samples. The
newly added samples effectively complement the minority class sample
data, leading to improved predictive performance.
Method
Workflow of the study
Generally, our proposed method consists of four steps (Fig. [60]1). In
the first step, we perform feature extraction on the dataset to reduce
dimensionality and then train the WGAN-GP using the positive samples.
In the second step, we evaluate all WGAN-GP weights using CTST and
select the model weights that yield the best CTST results for
downstream task oversampling. In the third step, we divide the training
set into multiple data blocks and oversample each block using the
selected WGAN-GP model to achieve balanced data blocks. In the fourth
step, we train the machine learning models using these balanced data
blocks and evaluate their performance using the test set.
Fig. 1.
[61]Fig. 1
[62]Open in a new tab
Workflow of our study.
Data set and feature transformation
We use the dataset created and made available by Tsagkogeorga et
al.^[63]14. The original dataset was collected from
[64]Harmonizome^[65]31 website. We selected 15 one-hot-encoded datasets
to construct our dataset. The dataset was initially standardized to
continuous values ranging from 0 to 1, or − 1 to 1, where 1 indicates a
strong positive gene-feature association, 0 indicates no observed
gene-feature association, and − 1 indicates a strong negative
gene-feature association. The genes within these 15 datasets, along
with their associated features, exhibit variations in both quantity and
distribution. The number of genes and features differs across datasets.
Some genes may appear in one or multiple datasets, and certain features
may be present across various datasets as well. Each gene encompasses
50,176 features. Within the 15 datasets, Some genes lack specific
features that are prevalent in other datasets. Consequently, features
absent from a gene in this context are marked as null. The null values
are subsequently assigned a value of 0. While a value of 0 signifies
the absence of gene-feature associations, it does not impact the
ultimate outcomes of training and prediction. To gain highly
informative features and reduce dimensionality, we will remove the
following 2 types of features: (i) those with zero values in more than
70% of cases. (ii) those with variance less than 16% of the total
variance.
The improved data block constructure balancing
Given the significant imbalance between positive and negative samples
in our dataset, directly applying oversampling may not improve
classifier performance (as discussed in subsection “[66]Ablation
study”). Therefore, we adopted an improved data block structure for
oversampling. Specifically, we divide the training dataset into subsets
and then use oversampling to balance these subsets. This approach,
initially introduced by Yin et al.^[67]32, integrates undersampling and
oversampling schemes, which reduces information loss from undersampling
by slicing into multiple chunks while supplementing the dataset. We
enhance this data block structure by ensuring that the proportion of
positive and negative samples in these subsets is intentionally
unbalanced to better integrate with the oversampling methods.
Specifically, these subsets are created by combining the same positive
samples with different negative samples (The positive samples represent
the minority class). The number of positive and negative samples in
each subset is allocated proportionally based on the hyperparameters.
Generally, each subset contains fewer positive samples than negative
samples, aiming to achieve a balanced distribution of positive and
negative samples after oversampling.
Assuming the number of positive samples and negative samples are
[MATH: Npos :MATH]
and
[MATH: Nneg :MATH]
, respectively, the sets of positive and negative samples are
[MATH: Spos :MATH]
and
[MATH: Sneg :MATH]
respectively, S =
[MATH: Spos∪Sneg :MATH]
denotes the entire dataset,
[MATH: λ=Nneg/Npos :MATH]
denotes the imbalanced ratio of the dataset,
[MATH: α :MATH]
is a hyperparameter that controls the oversampling rate. The pseudocode
for the data block construction process to achive balance is shown in
Algorithm 1.
Algorithm 1.
[68]Algorithm 1
[69]Open in a new tab
Data block balancing
The small-sample augmentation based on WGAN-GP
Generative Adversarial Networks (GANs) are neural networks capable of
learning high-dimensional feature distributions from data
samples^[70]33. A GAN generally consists of two neural networks: a
generator G and a discriminator D. The generator receives random noise
as input and produces samples that are intended to be indistinguishable
from real data. The discriminator’s task is to distinguish between the
fake samples generated by G and the real samples from the training
data. During training, the generator and the discriminator engage in a
minimax two-player game: the generator aims to create samples that can
deceive the discriminator, while the discriminator tries to correctly
identify whether samples are real or fake. Ideally, a Nash equilibrium
is reached when the generator can no longer produce samples that the
discriminator can distinguish from real samples, and the discriminator
is unable to differentiate between real and fake samples. Formally, let
V(G, D) be the object function. The minimax game is evaluated as:
[MATH: minGmaxDV
(G,D)=Ex∼pdata[logD(x)]+Ez∼pz<
mo stretchy="false">[log(1-D(g(z)))]
:MATH]
1
where x is sampled from the real data distribution
[MATH: pdata :MATH]
and z is sampled from the prior noise distribution
[MATH: pz :MATH]
.
[MATH: E :MATH]
represents the expectation, D(x) represents the score assigned by the
discriminator D to x, and G(z) represents the fake samples generated by
G using noise sampled from the prior distribution
[MATH: pz :MATH]
. In a GAN, the generator and discriminator are trained alternatively.
That is, the weights of one network are fixed while the other network
is trained. Specifically, when training the discriminator D, the
generator G is kept fixed, and the discriminator D is optimized by
maximizing
[MATH: logD(x) :MATH]
. Conversely, when training the generator G, the discriminator is
fixed, and the generator is optimized by minimizing
[MATH: log(1-D(g(z))) :MATH]
.
However, conventional GANs face training difficulties primarily due to
the Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence
loss functions. Specifically, when dealing with non-overlapping
distributions, the JS divergence between
[MATH: pdata :MATH]
and G(z) can approximate a constant value, leading to issues such as
gradient vanishing and mode collapse^[71]28. WGAN address the problem
of gradient vanishing by replacing these divergence measures with the
Wasserstein distance. Equation ([72]2) denotes the value function of
WGAN:
[MATH: V=Ex∼pdata[fw(x)]-Ez∼pz<
mo stretchy="false">[fw(gθ(z))]
:MATH]
2
where
[MATH: fw :MATH]
represents a function that satisfies the K-Lipschitz constraint, w and
[MATH: θ :MATH]
are the parameters of f and g, respectively. To satisfy K-Lipschitz, w
is constrained to ensure that the function
[MATH: fw :MATH]
has a slope value less than K between any two points. Given the
powerful fitting capability of neural networks, we can simply replace f
and g with D and G, respectively. Thus, the object function of WGAN is
represented by Eq. ([73]3).
[MATH: minGmaxDV
(G,D)=Ex∼pdata[D(x)]-Ez∼pz<
mo stretchy="false">[D(G(z))]
:MATH]
3
Given that D must satisfy the K-Lipschitz constraint, all parameters in
D are clipped to a fixed range after each training epoch, for example,
from − 0.01 to 0.01. Compared to traditional GANs, WGAN incorporates
the K-Lipschitz constraint to address the issue of gradient vanishing
and its effective smoothing of the value function improves
trainability. Nevertheless, WGAN’s approach to enforcing the
K-Lipschitz constraint through parameter clipping can be somewhat
crude, as this clipping process can significantly impact the network’s
fitting capability. This often results in a large number of weight
values being concentrated along the clipped boundaries, making it
challenging to fit complex distributions and significantly impairing
performance^[74]34. WGAN-GP adopts a more refined approach by
incorporating a gradient penalty to enforce the K-Lipschitz constraint.
Specifically, during the training of discriminator, a gradient penalty
is added to the discriminator’s loss function. Let GP represent the
gradient penalty, Eq. ([75]4) illustrates the calculation of gradient
penalty.
[MATH: GP=γEx^∼px^[(‖∇x^D(x^)‖2-1)2]
:MATH]
4
And the object function of WGAN-GP is rewritten as follows:
[MATH: minGmaxDV
(G,D)=Ex∼pdata[D(x)]-Ez∼pz<
mo stretchy="false">[D(G(z))]-GP :MATH]
5
where
[MATH: x^ :MATH]
is obtained by interpolating between
[MATH: pdata :MATH]
and
[MATH: pz :MATH]
,
[MATH: γ :MATH]
is the gradient penalty coefficient, which is typically a positive real
number chosen empirically.
[MATH: ∇x^D(x^) :MATH]
denotes the gradient of
[MATH: D(x^) :MATH]
with respect to
[MATH: x^ :MATH]
.
Furthermore, the CNN structure originally used in WGAN-GP is designed
for image feature extraction and is not well-suited for gene expression
data. Therefore, it is crucial to adapt WGAN-GP to better accommodate
gene expression data. To this end, we substitute the CNN structure in
WGAN-GP with a fully-connected neural network and perform additional
optimization of the network parameters to tailor it specifically to our
dataset. Each fully-connected layer is followed by a Leaky ReLU
(Rectified Linear Unit) activation function with a negative slope of
0.2. In the discriminator, the number of neurons in each layer
progressively decreases until it reaches 1, while in the generator, the
number of neurons in each layer progressively increases until it
matches the number of features in the dataset.
Further generator selection
In this study, we utilize the CTST approach to select the generator
that produces the optimal synthetic samples, rather than evaluating the
samples themselves, which differs from^[76]28. Optimal synthetic
samples are those that closely mimic the feature distribution of real
positive training samples without being identical to the real gene
expression samples. In essence, when the number of synthetic samples
equals that of real samples, CTST employs a binary classifier to
differentiate between the two sets of samples. If cross-validation
results indicate that the classifier can easily distinguish
differentiate between the two sets of samples. Specifically, given two
sample sets of equal size that follow two distributions, P and Q, CTST
tests the null hypothesis that P is equal to Q. If the null hypothesis
is accepted, the classifier’s accuracy will be near chance level (50%);
if rejected, the accuracy will approach 100%^[77]35.
During the training process of WGAN-GP, we employed the CTST after each
epoch to promptly assess the generator and retain the optimal network
weights for downstream tasks. Given the time-intensive nature of
network training, cross-validation was performed only once, and the
network weights are retained based on the accuracy of this validation.
However, these weights may be influenced by chance due to the
probabilistic nature of data generation. For instance, the network may
produce samples that closely resemble a specific distribution,
resulting in a CTST accuracy close to 0.5. At this point, the network
has partially captured the target samples distribution, but the
retained weights may still affect performance in downstream tasks. To
assess the impact of generator fitting over different epochs, we
conducted additional CTST experiments after WGAN-GP training. Multiple
CTSTs were performed with different weights, and the generator with the
mean accuracy closest to 0.5 was selected as the data generator for the
downstream task. In this study, we conduct the CTST using Support
Vector Machine classification algorithm.
Prediction and evaluation
The dataset for this study comprises 26,936 gene expression data
samples, including 92 samples related to RNA methylation (positive
samples) and 26,304 unlabeled samples (regarded as negative samples).
The dataset was split into training and test sets in an 8:2 ratio. A
total of 74 positive samples and 21,044 negative samples were utilized
for training and cross-validation (CV) to determine hyperparameters via
grid search^[78]36,[79]37, while 18 positive samples and 5260 negative
samples were allocated for testing. The positive training samples were
initially used to train WGAN-GP in order to acquire generators for
producing synthetic samples for downstream tasks, and later employed
for training the classification model, as shown in Table [80]1.
Table 1.
Numbers of gene sample for training, CV, and test.
Train and CV Test Total
Positive gene numbers 74 18 92
Negative gene numbers 21,044 5260 26,304
Total 21,118 5278 26,936
[81]Open in a new tab
To prevent mode collapse during the training process of WGAN-GP, we
employ an Early Stopping mechanism^[82]38. If the CTST result, which
measures the dissimilarity between synthetic samples generated by the
generator and real samples, falls below a preset threshold of 0.02, the
weight from that epoch is saved and made available for downstream
tasks. However, to increase the number of generator models available
for selection, training does not stop immediately upon reaching the
Early Stopping condition; instead, it continues until the preset number
of epochs is reached. By analyzing the probability density curve of
results between 0.4 and 0.6, we observed a peak around 0.52, which
ensures that most useless generators are filtered out while retaining a
relatively large number of generators. Therefore, we chose 0.02 as the
default threshold for preliminary filtering, as shown in Supplementary
Figure [83]1.
To evaluate the performance of predicting genes related to RNA
methylation pathways, we employ five machine learning classifiers:
Support Vector Machine (SVM), Gradient Boosting (GB), Gaussian Naive
Bayes (GNB), Random Forest (RF), and Logistic Regression (LR). After
dividing the data based on the oversampling rate
[MATH: α :MATH]
, these classifiers are trained on an equal number of data blocks. For
SVM, GB, and RF, we determine the optimal hyperparameters using a
grid-search technique, selecting the hyperparameters combination that
achieves the highest accuracy through 3-fold cross-validation.
Five metrics are utilized for evaluating the classifiers’ performance:
Accuracy, Precision, Recall, F1 score, and Matthews Correlation
Coefficient (MCC). Given the highly imbalanced ratio of positive to
negative samples in the test set, these metrics are computed across
multiple data blocks derived from the test set, and the average
performance across all blocks is taken as the final test result. Unlike
the training set, the test set’s data block splitting process does not
include the oversampling step.
Result
Influence of hyperparameters and training strategies
To investigate the effect of the oversampling rate
[MATH: α :MATH]
on the model performance, We performed cross-validation on the training
set using various oversampling rates. Our experiments involved nine
different
[MATH: α :MATH]
values across five classifiers. Generally, each classifiers exhibited
varied valid performance with different
[MATH: α :MATH]
values, and fixed
[MATH: α :MATH]
values also produced diverse results across classifiers. As shown in
Fig. [84]2, the highest accuracy (0.9757) was achieved with an
[MATH: α :MATH]
of 0.4 paired with the Logistic Regression (LR) classifier. Conversely,
the lowest accuracy (0.8771) occurred with an
[MATH: α :MATH]
of 0.1 paired with the Gradient Boosting (GB) classifier, indicating a
difference of approximately 14.3%. Notably, Random Forest (RF) was the
most impacted by variations in
[MATH: α :MATH]
, with a performance difference of 9.49%. In contrast, Gaussian Naïve
Bayes (GNB) was the least affected, showing a difference of only 1.27%
between its best (0.9539) and worst (0.9412) results. The performance
drop for other classifiers compared to their best results was 4.48% for
Support Vector Machine (SVM), 6.34% for GB, and 1.82%for LR. This
suggests that inadequate synthetic samples (
[MATH: α :MATH]
close to 1) or excessive samples (
[MATH: α :MATH]
close to 0) can result in performance degradation. Based on these
validation results, unless otherwise specified, we will set
[MATH: α :MATH]
to 0.4 for the following analysis.
Fig. 2.
Fig. 2
[85]Open in a new tab
Impact of oversample rate
[MATH: α :MATH]
, the horizontal axis is the oversampling rate, the vertical axis is
the valid accuracy score, and the lines in different categories
indicate different classes of classifiers.
Ablation study
The data block constructure and oversampling process are two major
components in our approach. To study the effectiveness of these
components, we created two variants of the scheme: one without the data
block structure and another without the oversampling process. These two
variants were used to train five machine learning models. We then
evaluated their performance using four metrics to assess the impact of
the data block constructure and oversampling on overall model
performance. Figure [86]3 depicts the results of the ablation study.
Fig. 3.
[87]Fig. 3
[88]Open in a new tab
Overview of ablation study.
In general, our proposed complete scheme exhibits superior predictive
performance across all five classifiers. However, without the data
block structure or proper oversampling, performance degradation occurs
to varying extents. Among the classifiers, GNB classifier shows the
least sensitivity to these components, with performance reduction of
1.76% (MCC), 0.7% (F1 score), and 3.1% (recall) when data block
constructure is removed and 7.5% (MCC), 3.4% (F1 score), 2.7% (recall)
when WGAN-GP oversampling is missing. In contrast, the other four
classifiers demonstrate greater sensitivity to the absence of data
block constructure. For instance, GB experiences a 26.9% drop in
accuracy, 60.2% in recall, 44.6% in F1 score, 42.3% in MCC. Similarly,
LR shows a decrease of 16.9% (accuracy), 40.9% (recall), 24.3% (F1
score), 27.0% (MCC). RF faces reductions of 19.4% (accuracy), 45.8%
(recall), 28.9% (F1 score), 30.7% (MCC), while SVM observes declines of
26.5% (accuracy), 59.8%(recall), 44.3% (F1 score), 41.7% (MCC). When
the oversampling process is absent, The situation is more moderate. MCC
performance drops by 5.3% for GB, 5.9% for SVM, 10.2% for RF, and 6.4%
for LR, with similar decline in other metrics. These results clearly
demonstrate that merely generating a large number of synthetic samples
to address imbalanced datasets is ineffective for highly imbalanced
classification problems. In contrast, our proposed scheme, which
integrates the benefits of both oversampling and undersampling,
achieves better performance. By incorporating data blocks, it
significantly reduces the need for synthetic samples, preserves the
original sample distribution, and effectively expands the dataset. Data
block constructure also facilitates the use of multiple weak
classifiers, enabling ensemble techniques that capture diverse patterns
in the data, which reduces individual model bias and improving
generalization.
A total of 61 weights were saved based on the preset parameters. The
additional CTST experimental results for these 61 weights are displayed
in Fig. [89]4, sorted by variance. After conducting 200 repetitions of
the experiment, slight differences were observed in the mean values of
all generator test results. The majority of the generator test results
still adhere to the preset hyperparameter criteria, falling between
0.498 and 0.502 as denoted by the green dots in Fig. [90]4. A few
generators produced test results that deviate from the preset criteria,
yet remain within the acceptable range for downstream tasks, ranging
from 0.4 to 0.498 and 0.502 to 0.6, indicated by the orange points.
These generators also show an increase in variance. In contrast, a few
generators, represented by the red dots, exhibit test results
significantly deviating from the 0.5 criterion, with an absolute
difference exceeding 0.1, alongside higher variance. In such cases, the
samples generated by these generators could introduce bias into the
classifier model’s training. These results affirm the importance of
incorporating additional CTST to refine the selection process.
Fig. 4.
Fig. 4
[91]Open in a new tab
CTST result over 61 saved model weight, sorted based on their variance.
WGAN-GP successfully generated high-quality positive samples
Overall, the samples synthesized by WGAN-GP exhibited a strong
alignment with the original positive samples. We incorporated
supplementary CTST after each training epoch. As shown in Fig. [92]5,
at the beginning of generator training, the generated samples
significantly diverged from the distribution of positive samples,
resulting in a CTST accuracy of 1.0. This observation underscores the
effective discrimination achieved by the CTST classifier between the
generated and positive samples.
Fig. 5.
[93]Fig. 5
[94]Open in a new tab
Distribution changes between positive and synthetic samples. a epoch 0;
b epoch 200; c epoch 500; d epoch 800.
Around the
[MATH: 200th :MATH]
epoch, the generator began capturing the distinctive feature
distribution of the positive samples. Consequently, an overlap emerged
in the feature space between the generated and positive samples,
reducing the CTST accuracy to approximately 0.7. As training
progressed, by the
[MATH: 500th :MATH]
epoch, this overlapping region expanded further, causing the CTST
accuracy to diminish to about 0.6.
After over 800 additional training iterations, WGAN-GP successfully
internalized the genetic feature distribution of authentic positive
samples. Evidently, the CTST accuracy decreased to 0.5 and exhibited
fluctuation around this value, signifying the generator’s ability to
effectively align with the positive sample distribution. Additionally,
the training loss curve further indicated that the generator attained a
local optimum as the CTST accuracy reached 0.5.
Furthermore, independent CTST evaluations on each saved generator model
unveiled that more than half of the generators consistently uphold a
classifier accuracy around 0.5. This result highlights the
effectiveness of WGAN-GP in producing high-quality synthetic samples,
as illustrated in Fig. [95]5.
While WGAN-GP partially addressed these difficulties by incorporating
gradient penalties, proper neural network design and hyperparameter
tuning remained crucial for successful GAN training. In addition to
visualizing samples at various epochs, observing changes in loss during
GAN training proved effective^[96]27. For WGAN-GP, the generator’s loss
reflected the disparity between generated and real samples. As shown in
Fig. [97]6, the generator initially exhibited poor sample fitting
ability, resulting in a rise in loss throughout the training process.
After undergoing adversarial training with the discriminator, the
generator started capturing the underlying sample distribution, causing
a decrease in loss. As the training progressed, the generator’s loss
began to fluctuate, indicating its position in the intermediate stage.
Around 600–800 epochs, the generator and discriminator reached a Nash
equilibrium in the minimax game, as indicated by the consistent
reduction and eventual stabilization of loss. This suggests the
generative adversarial network had reached an approximate optimal
state.
Fig. 6.
Fig. 6
[98]Open in a new tab
WGAN-GP train loss over 5000 epochs. Blue line indicates discriminator
loss and orange line indicates generator loss.
Performance comparison
As shown in Table [99]2, our approach consistently exhibits superior
prediction performance across most classifier combinations. For
instance, when paired with the SVM classifier, our method improves
accuracy by 16.01%, recall by 38.81%, F1 by 23.34%, and MCC by 25.14%
compared to the second-best approach. Similarly, the combination with
GB classifier achieves respective improvements of 13.77%, 33.91%,
19.35%, and 21.87%. Additionally, when combined with RF classifier, our
approach surpasses the second-best model by 16.83%, 40.58%, 24.26%, and
26.73%. For the LR classifier, our method outperforms the runner-up by
11.69%, 30.39%, 15.92%, and 18.89%. Across most performance metrics,
our scheme consistently attains top-tier results, securing the highest
accuracy (0.9320), F1 (0.9354), and MCC (0.9327) when combined with LR.
However, with GNB classifier, the RandomOversample scheme achieves the
best performance, boasting an average accuracy of 0.9183, a recall of
1.0, F1 of 0.9261, and an MCC of 0.8505. This suggests that the
RandomOversample method is more inclined to identify positive samples.
Nonetheless, our scheme remains competitive, producing accuracy,
recall, F1, and MCC scores that closely approach those of the leading
scheme, with only a minor differences.
Table 2.
Accuracy, Recall, F1 and MCC for 4 approaches.
Method
Classifier SMOTE BorderlineSMOTE RandomOversample Ours
Accuracy Recall F1 MCC Accuracy Recall F1 MCC Accuracy Recall F1 MCC
Accuracy Recall F1 MCC
SVM 0.5 0 0 0 0.7630 0.5263 0.6895 0.5973 0.5 0 0 0 0.9232 0.9145
0.9230 0.8487
GB 0.7629 0.5263 0.6894 0.5970 0.7891 0.5789 0.7330 0.6375 0.7631
0.5263 0.6896 0.5976 0.9269 0.9181 0.9266 0.8562
RF 0.7368 0.4736 0.6428 0.5570 0.5789 0.1578 0.2727 0.2927 0.5263
0.0526 0.1 0.1644 0.9315 0.9321 0.9322 0.8650
LR 0.8413 0.6842 0.8118 0.7192 0.8151 0.6315 0.7735 0.6776 0.8153
0.6315 0.7737 0.6781 0.9320 0.9354 0.9328 0.8665
GNB 0.9164 0.9473 0.9201 0.8365 0.8094 0.6315 0.7684 0.6627 0.8115
0.7894 0.8088 0.6263 0.9108 0.9785 0.9178 0.8324
[100]Open in a new tab
The highest values for each specific classifier are highlighted in
bold, while the highest values across all classifiersare underlined.
Notably, some approaches encountered a significant problem,
misclassifying all test set samples as negative (see^[101]32, Table 3).
This led to their average accuracy, recall, F1, and MCC values dropping
to 0.5, 0, 0, 0, respectively. In such cases, these approaches were
deemed unusable. Among the three compared schemes, the combination of
SMOTE with SVM exhibits invalid values. Similarly, RandomOversample
encounters the same issue when combined with SVM. While BorderlineSMOTE
mitigates the problem,it still underperformed compared to our approach.
In contrast, our method demonstrated better data compatibility and was
more effective at handling highly imbalanced gene expression data.
To investigate the differences in predictive performance between
samples synthesized by WGAN-GP and other schemes, we substitute WGAN-GP
with three oversampling schemes, creating three new variants of our
approach. These variants balanced the data blocks rather than the
entire dataset before training, following the same methodology as our
proposed approach. A scenario with no oversampling was used as the
baseline for comparison.
Table [102]3 shows the test results for various oversampling schemes
combined with data block construction. Generally, samples generated by
WGAN-GP outperform the other four compared schemes across most metrics
and classifiers, achieving the highest overall scores in accuracy,
recall, F1, and MCC. Specifically: With the SVM classifier, WGAN-GP
attained the highest accuracy (0.9232), recall (0.9342), F1 (0.9230),
and MCC (0.8487). With the GB classifier, WGAN-GP achieved the highest
accuracy (0.9269), F1 (0.9266), and MCC (0.8562). When using the RF
classifier, WGAN-GP achieved the highest accuracy (0.9315), recall
(0.9321), F1 (0.9322), and MCC (0.8650). With the LR classifier,
WGAN-GP reached the highest accuracy (0.9320), precision (0.9327), F1
(0.9328), and MCC (0.8665). With the GNB classifier, WGAN-GP achieved
the highest Recall (0.9785), F1 (0.9178), and MCC (0.8324). Among the
results, the WGAN-GP with LR combination attained the highest overall
accuracy, F1, and MCC scores, while the combination with GNB achieved
the highest recall score. The highest overall precision was obtained by
the combination of SMOTE with GB.
Table 3.
Comparison of model performance with data block constructure.
Oversample method Classifier Accuracy Precision Recall F1 MCC
WGAN-GP SVM 0.9232 0.9342 0.9145 0.9230 0.8487
GB 0.9269 0.9379 0.9181 0.9266 0.8562
RF 0.9315 0.9345 0.9321 0.9322 0.8650
LR 0.9320 0.9327 0.9354 0.9328 0.8665
GNB 0.9108 0.8673 0.9785 0.9178 0.8324
BorderlineSMOTE SVM 0.9171 0.9277 0.9087 0.9168 0.8367
GB 0.9252 0.9365 0.9159 0.9249 0.8529
RF 0.9293 0.9328 0.9291 0.9299 0.8605
LR 0.9244 0.9217 0.9321 0.9255 0.8514
GNB 0.9046 0.8658 0.9662 0.9114 0.8189
SMOTE SVM 0.9129 0.9229 0.9055 0.9121 0.8291
GB 0.9237 0.9389 0.9099 0.9229 0.8501
RF 0.9301 0.9352 0.9282 0.9307 0.8622
LR 0.9248 0.9255 0.9282 0.9257 0.8519
GNB 0.9042 0.8736 0.9518 0.9095 0.8151
RandomOversample SVM 0.9193 0.9308 0.9099 0.9190 0.8411
GB 0.9181 0.9289 0.9096 0.9178 0.8388
RF 0.9235 0.9210 0.9312 0.9249 0.8493
LR 0.9286 0.9259 0.9360 0.9297 0.8597
GNB 0.9112 0.8695 0.9751 0.9177 0.8324
ModCGAN SVM 0.8947 0.9171 0.8726 0.8927 0.7930
GB 0.9193 0.9293 0.9115 0.9184 0.8420
RF 0.9196 0.9255 0.9169 0.9194 0.8424
LR 0.8995 0.9198 0.8798 0.8980 0.8020
GNB 0.9114 0.8959 0.9366 0.9137 0.8276
ACGAN SVM 0.8381 0.9117 0.7532 0.8221 0.7538
GB 0.8728 0.9265 0.8142 0.8646 0.7538
RF 0.8706 0.9238 0.8124 0.8625 0.8625
LR 0.8461 0.9167 0.7658 0.8317 0.7048
GNB 0.8937 0.9136 0.8752 0.8926 0.7905
*Non SVM 0.8909 0.8763 0.9178 0.8947 0.7863
GB 0.8957 0.8834 0.9189 0.8989 0.7957
RF 0.8795 0.8645 0.9085 0.8840 0.7639
LR 0.8983 0.8767 0.9342 0.9029 0.8016
GNB 0.8719 0.8279 0.9506 0.8829 0.7570
[103]Open in a new tab
*Non means no oversampling method intergreted
The highest values for each specific classifier are highlighted in
bold, while the highest values across all classifiersare underlined.
In addition to common oversampling methods, we also compared other,
more sophisticated oversampling schemes. ModCGAN^[104]39,[105]40
employs a different approach to ensure the quality of synthetic samples
by training an MLP classifier after the GAN is trained to assess
whether the generated samples meet the required quality. Similarly,
ACGAN^[106]41 builds on CGAN by forcing the discriminator to
reconstruct the auxiliary information of CGAN, thereby regulating the
quality of the synthetic samples. We integrated these methods with data
blocks to compare the impact of synthetic sample quality on predicting
RNA methylation-related genes.
Although ModCGAN and ACGAN performs similarly to our method across five
metrics and five classifiers, our method shows a 1–5% improvement on
most metrics. The key to ModCGAN lies in using an external classifier
to distinguish between real samples and noise, thereby regulating the
generator’s outputs. However, ModCGAN assumes that noise follows only a
joint distribution of uniform and Gaussian distributions. In cases
involving large datasets, as in this study, larger negative samples may
overlap with the Gaussian or uniform distribution, leading the external
classifier to exclude some qualified synthetic samples. Although this
study does not use GANs to generate majority class samples, a small
number of positive samples might also overlap with the predefined noise
distribution. Therefore, it is more reasonable to use the
distinguishability between synthetic and real samples as a standard for
assessing the quality of synthetic samples.
It is worth noting that the WGAN-GP oversampling method, when combined
with different classifiers, can achieve the highest MCC and F1 scores
for each respective classifier. Since the MCC metric is commonly used
to evaluate binary classification models, particularly for unbalanced
datasets, this suggests that WGAN-GP generates more diverse samples,
thereby reducing the classifier’s bias towards the majority class.
GO terms and KEGG pathway enrichment analyses
With
[MATH: α :MATH]
set to 0.4, each classifier retained 116 models. The average score of
these models for each gene was then calculated as the classifier’s
final score. Subsequently, high-confidence genes identified by each
classifier were utilized for GO terms enrichment analysis.
High-confidence genes are defined as the top 1% of genes ranked by
confidence. Based on the gene quantity in our dataset, each classifier
typically has 270 high-confidence genes. However, since GNB tends to
overestimate confidence for many genes, we considered the top 1500
genes in GNB’s confidence rankings as high-confidence (beyond rank
1500, other classifiers typically assign confidence levels below 0.5).
Figure [107]7 compares the top 20 GO terms ranked by each classifier.
Overall, the predictions from the five classifiers exhibit strong
consistency. GO terms enrichment analysis indicates that
high-confidence genes are significantly enriched for RNA methylation
functions. The terms with the highest enrichment include methylation,
RNA modification, RNA splicing, and RNA localization. Other terms
related to rRNA, mRNA, tRNA, and ncRNA are also present in the
enrichment analysis, with over 200 genes enriched for these GO terms.
This confirms that the predicted genes are closely associated with RNA
methylation process.
Fig. 7.
[108]Fig. 7
[109]Open in a new tab
GO terms enrichment analyses of high-confidence genes.
Supplementary Figure [110]2 shows the KEGG pathway enrichment analysis
of the top 1% confidence genes predicted by five classifiers. Despite
GNB’s tendency to assign high confidence to most genes, the pathway
enrichment analyses of the five models showed strong similarity. All
five enrichment analysis reports highlighted pathways strongly related
to RNA methylation, such as Spliceosome, mRNA surveillance pathway, RNA
polymerase, and RNA degradation. The enrichment counts ranged from 20
to 40, and all P-values were all below 0.05, indicating high confidence
in these items. Notably, the KEGG enrichment report from the GB
classifier erroneously included a pathway related to Coronavirus
disease - COVID-19, which did not appear in other classifiers. The
count and P-value for this item were much lower than those related to
RNA methylation, further indicating that the model consistently predict
genes involved in RNA methylation.
Functional insights into the new predictions
To gain deeper insights into the role of newly predicted genes in the
RNA methylation process, we selected the top 1% confidence genes
(predicted by SVM) and constructed a PPI network using the STRING
database^[111]42. This network included 188 newly predicted genes and
76 known genes related to RNA methylation. The STRING database
constructed a densely interactive PPI network for these genes,
comprising 4093 edges. Using the Louvain^[112]43community detection
algorithm, we identified six tightly connected communities within the
PPI network, as shown in Supplementary Figure [113]3.
Community 0 consists of 71 genes, including 48 known RNA
methylation-related genes and 23 newly predicted genes. Functional
analysis indicates significant enrichment such as tRNA N2-guanine
methylation (GO:0002940, P = 3.53e−06), rRNA 2-O-methylation
(GO:0000451, P = 3.53e−06), tRNA N1-guanine methylation (GO:0002939,
P = 0.00025), tRNA (guanine-N7)-methylation (GO:0106004, P = 0.0180),
and rRNA (guanine-N7)-methylation (GO:0070476, P = 0.0180).
Community 1 consists of 18 genes, including only one known RNA
methylation-related gene: FDXACB1. Although as it is, functional
enrichment analysis shows significant enrichment such as histidyl-tRNA
aminoacylation (GO:0006427, P = 0.0018), phenylalanyl-tRNA
aminoacylation (GO:0006432, P = 0.0044), tRNA aminoacylation for
mitochondrial protein translation (GO:0070127, P = 0.0154),
aminoacyl-tRNA ligase activity (GO:0004812, P = 1.19e−13), and RNA
adenylyltransferase activity (GO:1990817, P = 0.0046).
Community 2 consists of 22 genes, including 20 newly predicted genes.
HSD17B10 and RSAD1 in this community are known RNA methylation-related
genes. Functional enrichment analysis suggests significant enrichment
such as tRNA 5-leader removal (GO:0001682, P = 1.49e−16), tRNA 5-end
processing (GO:0099116, P = 1.41e−18), tRNA splicing via
endonucleolytic cleavage and ligation (GO:0006388, P = 1.48e−06), and
tRNA-specific ribonuclease activity (GO:0004549).
Community 3 consists of 80 genes, including 7 known RNA
methylation-related genes (CMTR1, CMTR2, LARP7, MEPCE, RNGTT, RNMT,
TGS1) and 73 newly predicted genes. GO enrichment analysis shows
significant enrichment such as cap1 mRNA methylation (GO:0097309,
P = 0.0280), 7-methylguanosine RNA capping (GO:0009452, P = 1.02e−08),
RNA 5-cap (guanine-N7)-methylation (GO:0106005, P = 0.433),
7-methylguanosine mRNA capping (GO:0006370, P = 5.86e−07), mRNA
(nucleoside-2-O-) methyltransferase activity (GO:0004483, P = 0.0264),
mRNA methyltransferase activity (GO:0008174, P = 0.0099), snRNA binding
(GO:0017069, P = 1.69e−06), and O-methyltransferase activity
(GO:008171, P = 0.0479).
Community 4 consists of 16 genes, including 12 known RNA
methylation-related genes and 4 newly predicted genes. Functional
enrichment analysis reveals significant enrichment such as snRNA
(adenine-N6)-methylation (GO:0120049, P = 0.0014), S-adenosylmethionine
biosynthetic process (GO:0006556, P = 0.0033), mRNA methylation
(GO:0080009, P = 9.06e−17), mRNA
(2-O-methyladenosine-N6-)-methyltransferase activity (GO:0016422,
P = 6.17e−06), RNA (adenine-N6-)-methyltransferase activity
(GO:0008988, P = 6.17e−06), mRNA methyltransferase activity
(GO:0008174, P = 5.16e−07), and RNA N6-methyladenosine
methyltransferase complex (GO:0036396, P = 6.55e−16).
Community 5 consists of 57 genes, including 6 known RNA
methylation-related genes (CEBPZ, DIMT1, EMG1, FBL, FBLL1, TFB2M) and
51 newly predicted genes. Functional enrichment analysis indicates
significant enrichment such as Box H/ACA RNA 3-end processing
(GO:0000495, P = 0.0074), U5 snRNA 3-end processing (GO:0034476,
P = 0.0119), U1 snRNA 3-end processing (GO:0034473, P = 0.0119),
histone-glutamine methyltransferase activity (GO:1990259, P = 0.0073),
rRNA (adenine-N6,N6-)-dimethyltransferase activity (GO:0000179,
P = 0.0158), snoRNA binding (GO:0030515, P = 9.05e−11),
3-5-exoribonuclease activity (GO:000175, P = 8.02e−09), rRNA
methyltransferase activity (GO:0008649, P = 4.28e−06), catalytic
activity acting on RNA (GO:0140098, P = 6.20e−20), and
N-methyltransferase activity (GO:0008170, P = 0.0242).
Presuming that the top 20 false-positive genes, based on confidence
ranking, represent new predictions of the model, Existing researches
reveal that these genes play critical roles in various RNA- or
methylation-related biological processes. DDX56 (score = 0.979), a
member of the DEAD box RNA helicase family, plays a crucial role in
several RNA-related biological processes^[114]44. METTL18
(score = 0.971) has been reported to be closely related to
METTL17^[115]45. METTL18 is a methyltransferase, while METTL17
regulates mitochondrial ribosomal RNA modification^[116]45. PUS3
(score = 0.969) is involved in RNA pseudouridylation in humans^[117]46
and has also been studied in relation to methylation and
sepsis^[118]47. PRPF4B (score = 0.967), a member of the Clk/Sty kinases
family, encodes the pre-mRNA processing factor 4 kinase
(PRP4K)^[119]48. PAPOLG (score = 0.965), a poly(A) polymerase (PAP)
family member, plays a key role in mRNA stability and translational
modifications^[120]49. EXOSC2 (score = 0.962) provides catalytic
activity to the RNA exosome^[121]50. PPIG (score = 0.955) interacts
directly with the phosphorylated RNA Pol II carboxy-terminal domain
(CTD) via its RS domains both in vivo and in vitro^[122]51.TRUB2
(score = 0.953) is involved in mitochondrial mRNA pseudouridylation ,
regulating 16S rRNA and mitochondrial translation^[123]52. DHX38
(score = 0.946) encodes RNA helicase PRP16, essential for disrupting
the U2-U6 helix I between the first and second catalytic steps of the
splicing^[124]53. SF3B1 (score = 0.945) plays a central role in the RNA
splicing function of SF3b^[125]54. UTP23 (score = 0.945) is a
trans-acting factor involved in the early assembly of 18S rRNA^[126]55.
CHTOP (score = 0.944) binds competitively with arginine
methyltransferases PRMT1 and PRMT5, promoting the asymmetric or
symmetric methylation of arginine residues^[127]56. DDX47
(score = 0.944) plays a role in pre-rRNA processing^[128]57. RRP9
(score = 0.942) is involved in the maturation of ribosomal RNA,
ensuring proper ribosome formation and function. HARS2 (score = 0.941)
encodes the mitochondrial histidyl-tRNA synthetase (mt-HisRS)^[129]58.
UPF2 (score = 0.940) gene plays a role in mRNA degradation^[130]59.
DDX46 (score = 0.938) recruits ALKBH5, an eraser of the RNA
modification N6-methyladenosine (m6A), through its DEAD helicase domain
to demethylate m6A-modified antiviral transcripts^[131]60.
Notably, recent research^[132]61 reported that CSTF2T(score = 0.941)’s
paralog, CSTF2, acts as a mediator of m6A deposition, thereby
regulating mRNA m6A modification. This suggests that CSTF2T may play a
similar role in the m6A methylation process.
Influence of highly informative features
In this study, we used a dataset (full set) filtered for zero values
and variance. This dataset consists of 26,936 genes and 1517 features,
including 62 GO annotations, 1110 GTEx, 39 TISSUES, 162 PathCommons,108
HPA, one InterPro, and 35 BioGPS features. While GO annotations and
InterPro features are informative, they may bias the model’s
predictions towards pre-existing methylation-related annotations. To
investigate the impact of GO and InterPro features on the model, we
further removed all GO and InterPro features from the dataset,
resulting in a reduced set with 1454 features. We retrained the
classifiers (started from generator’s training) using the reduced set
and calculated scores for all genes across different models. The
probability score distribution curves are shown in Supplementary Figure
[133]4. The prediction trends across all five classifiers were very
similar: the classifiers tended to assign low confidence to most genes.
In the low confidence interval (score < 0.5), the peak for the reduced
set without GO and InterPro features shifted slightly to the right.
This suggests that the reduced set slightly increased confidence in the
low confidence interval, indicating that GO and InterPro features
prevented classifiers from blindly assigning high confidence to most
genes. As shown in Supplementary Figure [134]5 and Supplementary Table
[135]1, although the test results declined after training with the
reduced set, GO enrichment analysis revealed a strong association with
RNA methylation. This suggests that, while the reduced set may lower
evaluation metrics, it does not strongly bias the model’s predictions.
Additionally, we conducted a feature importance analysis based on the
LR model. Supplementary Figure [136]6 shows the importance rankings of
the top 50 features in both the full set and the reduced set. In the
full set, GO (26/50) and InterPro (1/50) features accounted for over
50% (27/50) of the top positions. Their importance scores were
significantly higher than those of GTEx and HPA, indicating that the
model was more heavily influenced by GO and InterPro features than by
other types. In the top 50 rankings of the reduced set (without GO and
InterPro features), GTEx and HPA appeared most frequently. This is
likely due to that GTEx and HPA provide crucial insights into gene
expression across various tissues, which can help identifying
tissue-specific patterns of RNA methylation.
Conclusion
In this study, we highlight the importance of RNA methylation and
address the imbalance learning problem in predicting genes associated
with RNA methylation pathways. We propose a deep learning-based
Wasserstein generative adversarial network with gradient penalty
(WGAN-GP) approach to alleviate the imbalance problem in gene
prediction tasks. We use five machine learning classifiers (SVM, GB,
GNB, RF, LR) for model training and evaluate performance using various
metrics. Experimental results demonstrate that the improved WGAN-GP
effectively captures the features of gene expression data and
accurately models the distribution of positive samples in the feature
space. Additionally, it generates synthetic samples that are difficult
to distinguish from real positive samples and exhibit better diversity
compared to other oversampling methods. Our study presents an effective
and performance-challenging approach compared to conventional schemes.
Furthermore, our work has several potential extensions. First, while
our study incorporates gene expression knowledge bases, future research
should integrate additional data, such as gene interactions, to further
enhance predictions of genes associated with RNA methylation pathways.
Second, employing advanced techniques to reduce feature dimensionality
could improve prediction performance. Lastly, exploring end-to-end
models that map inputs directly to outputs, without relying on
intermediate representations, could reduce the need for manual
intervention in the modeling process.
Supplementary Information
[137]Supplementary Information.^ (2.1MB, pdf)
Acknowledgements