Abstract
Traditional differential expression genes (DEGs) identification models
have limitations in small sample size datasets because they require
meeting distribution assumptions, otherwise resulting high false
positive/negative rates due to sample variation. In contrast, tabular
data model based on deep learning (DL) frameworks do not need to
consider the data distribution types and sample variation. However,
applying DL to RNA-Seq data is still a challenge due to the lack of
proper labeling and the small sample size compared to the number of
genes. Data augmentation (DA) extracts data features using different
methods and procedures, which can significantly increase complementary
pseudo-values from limited data without significant additional cost.
Based on this, we combine DA and DL framework-based tabular data model,
propose a model TabDEG, to predict DEGs and their
up-regulation/down-regulation directions from gene expression data
obtained from the Cancer Genome Atlas database. Compared to five
counterpart methods, TabDEG has high sensitivity and low
misclassification rates. Experiment shows that TabDEG is robust and
effective in enhancing data features to facilitate classification of
high-dimensional small sample size datasets and validates that
TabDEG-predicted DEGs are mapped to important gene ontology terms and
pathways associated with cancer.
Introduction
NGS(Next-generation Sequencing) is a high-throughput sequencing method
used to analyze the transcriptome and genome of a species. It is
considered revolutionary and widely used in genetics [[30]1]. Many gene
sequencing technologies, such as whole-genome DNA sequencing and total
RNA-sequencing, utilize NGS. Total RNA sequencing can detect multiple
forms of non-coding RNA and then overcome limitations of traditional
microarray experiments [[31]2–[32]4]. RNA sequencing (RNA-Seq)
technology, with its own merits mentioned above, is widely used in gene
expression analysis. It can be used to explore the evolution of gene
expression [[33]5] and assess the impact of gene expression levels in
medicine [[34]6]. Additionally, RNA-Seq enables certain applications
that are not possible with microarray experiments. For instance, gene
differential expression profile based on RNA-Seq in disease tissues
(such as cancer cells) is used to obtain a complete transcriptomic
profile for discovery of new genes and information about their
function, mechanism of action, and pathways [[35]7]. With the market
availability of NGS platforms, RNA-Seq has become an attractive
alternative method to traditional ones for detecting differentially
expressed genes (DEGs). By conducting differentially expressed (DE)
analysis, researchers can identify genes that undergo changes in
biological patterns between health and disease conditions. This can
help prioritize these condition-specific genes as potential biomarkers
for a particular disease [[36]8].
RNA-Seq data is mapped to a reference genome and summarized as “reads”.
Existing methods for classifying microarray data [[37]9–[38]11] cannot
be applied to RNA-Seq data due to potential discrete distributions,
such as Poisson and Negative Binomial [[39]12, [40]13]. Thus, several
statistical methods have been proposed for RNA-Seq data analysis
[[41]14]. Normalization is the first step in RNA-Seq data analysis to
eliminate variations caused by different numbers of reads from
different samples. “Sequencing Depth (SD)” [[42]15], proportional to
the total number of reads, is estimated and used to divide individual
counts for normalization. Accurate normalization ensures unbiased
interpretation of RNA-Seq data.
There are two problems with current pre-normalization strategies in
RNA-Seq data analysis: different methods may estimate SD differently,
and there are no systematic guidelines for selecting the best method
[[43]16]. Additionally, the assumption that most genes are not
differentially expressed (DE) between samples in RNA-Seq analysis may
not always hold true, such as in cancer samples where most genes may be
DE [[44]17]. To address these issues, scale-invariant (SI) [[45]18]
methods have been proposed, which give consistent results despite
differences in SD estimates. However, SI approaches may not be
applicable to all tasks, and some non-SI methods can result in biased
results [[46]19, [47]20]. This can lead to high rates of false
positives and false negatives when predicting DE genes in RNA-Seq data
analysis.
The challenges and solutions of deep learning-based DEGs identification
In recent years, many machine learning (ML) methods have been developed
for gene expression classification. For example, support vector
machines (SVM) use mutual information and Hine to classify genes to
distinguish between colon cancer patients and healthy patients
[[48]21]; logistic regression (LR) has also been used to classify
different types of cancers and normal tissues [[49]22]; A randomized
rest-based approach to classify genes in microarray data was also
proposed [[50]23]. However, these ML methods require prior knowledge
about gene features for classifier training.
In the particular case of high-throughput genomics, with the increase
of gene expression data, deep learning (DL) has been shown to capture
internal structural features of biological data and extract high-level
abstract features from high-dimensional sequencing or expression data,
thereby improving the performance and interpretability of traditional
ML methods [[51]18]. DL includes a series of techniques such as
multi-layer neural networks (MLNN), convolutional neural networks
(CNN), deep autoencoders (AE), and recurrent neural networks (RNN),
which have achieved excellent performance in fields such as computer
vision, pattern recognition, and natural language processing [[52]24].
However, applying DL to RNA-Seq is still challenging because the sample
size is small and lacks appropriate labels compared to the huge number
of genes [[53]25, [54]26]. To address the imbalance between sample size
and features (genes), two types of solutions are proposed in the
literature: data augmentation (DA) and transfer learning (TL)
strategies. It should be noted that compared to TL, DA is more suitable
for small-sample datasets, as it can enhance model robustness and
generalization ability.
Data augmentation (DA) is a process of enlarging the training dataset
by generating pseudo data equivalent to more data without significantly
increasing the amount of actual data, using various methods and
procedures. Generally, DA can be classified into supervised and
unsupervised methods. Supervised DA can be further divided into single
sample DA and multiple sample DA, while unsupervised DA is divided into
generating new data and learning augmentation strategy. For instance,
affine or perspective transformations can be applied to images
[[55]27–[56]29], as well as more advanced techniques based on
variational autoencoders [[57]30] and generative adversarial neural
networks [[58]31]. In experimental settings, DA has biological
significance in that it allows for the generation of new data points
with varied biologically relevant features, enabling better study of
biological variability under different conditions. These new data
points can be utilized to enhance machine learning models’ accuracy,
classification and prediction results, as well as validate and support
biological hypotheses [[59]32]. Moreover, DA can aid in addressing
noise and uncertainty in experiments, thereby improving the quality and
reliability of data. By leveraging data augmentation, researchers can
better utilize the original data and obtain more meaningful and
accurate information from it, thereby enhancing our understanding and
interpretation of biological phenomena and problems [[60]33].
Selecting the suitable deep network model
In addition, there are limitations to commonly used techniques for
neural networks in gene expression data due to its unstructured nature.
In a recent study, multilayer perceptron (MLP) and CNN were applied in
combination with linear discriminant analysis (LDA), logistic
regression (LR), plain Bayesian (NB), random forest (RF), and support
vector machine (SVM) methods to predict disease staging or distinguish
diseased samples from normal samples by comparing tests on more than 30
gene expression genes [[61]34]. The results showed that CNN were one of
the worst-performing methods in most of the analyzed datasets. The
reason for this outcome may be that the effectiveness of CNN depends on
how the convolution filter exploits the local motif patterns present in
the analyzed data. In the image domain, neighboring pixels do not share
information independently, so local patterns can be extracted through
the convolution layer. However, gene expression data lacks local
patterns because neighboring genes in the same sample are considered to
be independent. Hence, gene expression samples do not exhibit “spatial
coherence”, and it is the “local” regions of genes in the sample that
do not share similar information. Therefore, if pixels in an image are
rearranged, their relative positions change, which may negatively
affect the feature extraction and performance of CNNs [[62]35]. RNN, as
a class of neural networks for sequential data (e.g. time series and
text series), is very effective for data with sequential properties
[[63]36], but less effective for gene expression types. In order to
take advantage of DL’s ability to process large datasets and reduce the
need for feature engineering by other machine learning methods, a
Google experimental team proposed a new high-performance and
interpretable canonical deep tabular data learning architecture, TabNet
[[64]37]. TabNet could enhance the ability of the end-to-end learning
process to efficiently encode tabular type data such as gene expression
data. End-to-end learning is a learning paradigm provided by deep
learning, where the entire learning process is left entirely to the DL
model to learn the mapping from the original data to the desired output
directly [[65]38]. This approach circumvents the inherent pitfalls of
multiple modules and could reduce the complexity of the operational
process.
It is possible to create a model based on TabNet architecture that uses
trained features to identify DEGs in large datasets and predict
upregulating (UR)/downregulating (DR) directions. To our knowledge,
there are no end-to-end models in existing literature based on
attentional mechanisms for deep neural network (DNN) models combined
with DA methods to predict DEGs and determine their direction of action
by training gene expression data. Considering the excellent performance
of TabNet network on tabular data, we propose a robust model (TabDEG)
in this paper, which combines DA methods to classify DEGs and predict
UR/DR directions from RNA-seq cancer datasets with higher accuracy than
other ML approaches.
Materials and methods
Data collection and preprocessing
Cancer datasets used in this paper are downloaded from UCSX Xena Data
Browser(the UCSX Xena Data Browser serves as a data browser that
provides access to and analyzes TCGA data,
[66]https://xenabrowser.net/datapages/) and they belong to the type of
pan-cancer. This type of cancer dataset includes *11K RNA-Seq gene
expression samples from 10 different tumor types and is converted to
RSEM values by transformation log2(TPM + 0.001). The TCGA dataset
selected in this paper comes from a variety of samples from patients
with different regions, races and clinical characteristics, and thus
contains different genomic information, which helps to reduce group
bias and improve the representativeness of the results. Meanwhile,
different tumor types may have different characteristics, development
patterns, and treatments, so by using datasets of multiple tumor types,
the over-reliance on specific tumor types can be reduced, and the
generalization ability of the algorithm can be improved, making it more
robust in dealing with unknown or novel tumor types. The ten datasets
presented in [67]Table 1 have different sample size and each sample
contains expression values for 60,488 input variables (transcripts).
Table 1. Values of hyperparameters used in TabDEG model.
Hyperparameters Train Test
max_epochs 200 200
patience 50 50
batch_size 512 512
virtual_batch_size 64 64
drop_last False False
loss_fn loss_fn loss_fn
learning_rate 5e-4 5e-4
weight_decay 1e-5 1e-5
n_d 8 8
n_a 8 8
n_step 1 1
lambda_sparse 0 0
optimizer_fn torch.optim.Adam torch.optim.Adam
optimizer_params dict(lr = 2e-2,weight decay = le-5) dict(lr =
2e-2,weight decay = le-5)
mask_type “entmax” “entmax”
scheduler_params dict(milestones=[50, 100, 150],gamma = 0.95)
dict(milestones=[25, 50, 100, 150],gamma = 0.95)
scheduler_fn torch.optim.lr_schedulerMultiStepLR
torch.optim.lr_schedulerMultiStepLR
verbose 10 10
eval_metric LogLossMetric,SmoothedLogLossMetric
LogLossMetric,SmoothedLogLossMetric
[68]Open in a new tab
The data pre-processing steps are listed as follows (using COAD dataset
as an example):
1. The dataset from TCGA is prepared and represented as an expression
matrix with genes as rows and samples as columns. For all 60,488
RNA transcripts, all samples are screened from the tumor and normal
groups, i.e. all columns with “01A” and “11A” in their names.
2. Import the gene annotation file gene_length_Table, which contains a
total of 56,716 transcript annotation information. And the
annotation information of only 19,627 coding RNAs is needed in this
paper. The ENSEMBL identifier of RNA transcript is used to
intersect TCGA dataset and each row in TCGA dataset is labelled
with the corresponding gene name.
3. Import the “tidyverse” package and then start pre-processing the
gene expression data. RNA transcripts with an expression level
greater than 1 are screened out and a total of 17983 RNA
transcripts are screened. The screen-obtained datasets are divided
into T and N according to its column names (see Step 1 for
details), which are used as control conditions in the next step.
4. After importing “DESeq2” package, the function
DESeqDataSetFromMatrix() is used to construct a DESeqDataSet(dds)
object from the expression matrix and sample information. After
constructing the dds object, DESeq() is used to perform variance
analysis.
5. Since in RNA-seq data, different sample conditions have different
sequencing depths and RNA compositions, which may negatively affect
downstream analysis, the function estimateSizeFactors() is used to
calculate the normalization coefficient ‘sizeFactor’ in the process
of differential analysis. Then estimateDispersions() is used to
estimate the degree of gene dispersion. Finally, statistical tests
are performed using nbinomWaldTest() and the results of variance
analysis could be obtained by calling results().
6. In the result data of difference analysis, the screening conditions
are set as abs(log2FoldChange) > 1 and padj < 0.05, and genes
satisfying these two conditions are treated as DEGs. Non-DEGs are
labelled as “2”.
7. Finally, DEGs are marked as UR/DR directions based on logFC
thresholds. If logFC > 1, DEGs are treated as UR genes and marked
as “1”; and if logFC < -1, DEGs are treated as DR genes and marked
as “0”. LogFC > 1 indicates at least a 2-fold change in the
expression level of a gene between two conditions, and such a
change is considered to be relatively large to provide a higher
significance difference and thus more likely to reflect biological
importance. In many related studies [[69]39], logFC > 1 is
considered a reasonable threshold for screening out DEGs of
functional and biological importance.
Data augmentation
In the introduction, it is pointed out that applying DL to RNA-Seq data
is still very challenging due to the small sample size (n) relative to
the number of genes (g). To address this problem, we adopt a DA method
to simulate biological changes and generate data similar to natural
observations [[70]33], partially alleviating the huge imbalance between
sample size and feature (gene) number in RNA-Seq data and improving the
recognition and classification ability of DL models for complex and
variable biological data. DA can be achieved by combining different
dimensionality reduction and clustering methods. In this paper, we use
PCA, UMAP, and K-means methods. PCA is a common dimensionality
reduction method that finds the main components of data by seeking
directions that maximize variance, but it may ignore variance in other
directions and thus overlook fine structures [[71]40]. In contrast,
UMAP aims to preserve the topological relationships of adjacent samples
in the data, namely finer local structures [[72]41]. PCA and UMAP have
different characteristics and can preserve different types of
structures in the data. Combining them can more effectively extract
features from the data. K-means clustering is a popular unsupervised
learning method that has high accuracy in small sample clustering and
can optimize iterative functions to iteratively correct and prune
clusters obtained to determine the clustering of certain samples
[[73]42]. By using these three methods simultaneously, we can
efficiently perform data augmentation while minimizing accuracy loss.
In experiments, we use statistical knowledge and ML techniques to
expand effective features of TCGA data sets that have been preprocessed
to improve the performance of DL models. Specifically, we classify the
features of the preprocessed TCGA data set by keyword into two groups,
T and N, and then apply factor analysis (FA), UMAP, and K-means methods
to each group. Finally, we extract the information of the two groups of
features and merge the results to obtain the original data set. The
detailed workflow of the entire process can be seen in [74]Fig 1.
Fig 1. Workflow of TabDEG.
[75]Fig 1
[76]Open in a new tab
Factor analysis
FA is a correlation-based data analysis technique and is usually used
to do dimensionality reduction in multidimensional data analysis. In
essence, FA could be used to explore some kind of structure hidden
behind multidimensional data with a large number of observations, with
aim to find common factors for variation in a set of variables and to
group variables with same nature into a factor [[77]40]. There are many
types of FA and the method of Principal Component Analysis (PCA) is
used in this paper.
In a TCGA dataset, assume that genes are denoted as (X[1], X[2], …X[M])
and M represents the number of genes. For each gene X[i](i = 1, 2, …,
M), the corresponding sample is denoted as (x[i1], x[i2], …, x[iN]),
that is, the feature corresponding to each gene has N columns. Thus,
RNA-Seq data containing N samples of M genes could be recorded as an
expression matrix X with dimension M × N. The empirical mean of each
dimension (genes) is denoted as u[i](i = 1, …, M) where
[MATH:
ui=1N
∑j=1Nxij :MATH]
. And the deviation matrix B = X^T − hu^T from the mean is calculated
by the empirical mean vector u = (u[1], u[2], …u[M])^T where h is a
column vector of dimension N × 1 with all entries being 1. Further
calculate the covariance matrix
[MATH:
C=1N-1B*B :MATH]
of the above deviation matrix where * is the conjugate transpose
operator. Finally, the covariance matrix C is decomposed as:
[MATH:
C=VDVT :MATH]
where V is a M × N dimensional matrix composed of N eigenvectors for C
and D is a diagonal matrix with dimension N composed of N eigenvalues
for C. All eigenvectors and eigenvalues in V and D are paired in
decreasing order for eigenvalues, that is, the i − th eigenvalue
corresponds to the i − th eigenvector. The appropriate top p feature
vectors are chosen and treated as the basis for new dimensions and
these new dimensions enable meaningful data augmentation.
UMAP
UMAP is an algorithm for dimensionality reduction by mapping
high-dimensional probability distributions to a low-dimensional space
[[78]41]. The algorithm is mainly based on the theory of refinement and
topological algorithms, which can preserve more global structure and
continuity of data subsets, and has superior runtime performance and
better scalability. In addition, UMAP has no computational restrictions
on the embedding dimension, so it can be used as a generalized machine
learning dimensionality reduction technique [[79]43]. In this paper,
UMAP method is used to process the pre-processed TCGA dataset with aim
to obtain “new” features and this process mainly consists of two steps:
(i)Learning about flow structures in high-dimensional space [[80]41].
In the exploratory analysis of the TCGA dataset, UMAP represents data
using a weighted graph called “fuzzy surface complex”, and finds
nearest neighbors by expanding the radius of each point, ensuring a
balance between local and global structure.
(ii)Finding a low-dimensional representation of the manifold considered
[[81]41]. The learned manifold is projected onto a low-dimensional
space, and the standard Euclidean distance in the globally coordinate
system is used to set distances on the manifold. By selecting and
controlling the minimum point distribution to avoid overlap, the
cross-entropy cost function is minimized to find the optimal weight of
edges in the low-dimensional representation, and stochastic gradient
descent is used to perform the minimization process.
[MATH:
CE=∑e<
mo>∈E{w
h(e)lo
gwh
(e)wl(e)+
(1-wh(e))log1-w
h(e)1-wl
(e)} :MATH]
For edges() in the set E(i.e. all edges found in step 1), the first
term in above formula would act as an “attractor force” as long as
there is a large weight associated with the high-dimensional case. The
reason may be that this term will be minimized in the largest possible
case which would occur when the distance between points involved is as
large as possible. When a high-dimensional weight is small, the second
term acts as a “repulsive force”. The reason may be that this term will
be minimized by making as small as possible. Finally, an array
containing the coordinates of each data point is obtained in the
specified low-dimensional space. The complete UMAP workflow is showed
in [82]Fig 2.
Fig 2. UMAP workflow.
[83]Fig 2
[84]Open in a new tab
K-MEANS
In the process of clustering analysis, the number of clusters K is
generally determined in advance so that the samples in the same
clusters are distributed as closely as possible and the distance
between clusters is as large as possible. Almost all algorithms for
clustering analysis are designed to divide the data analyzed into
independent sets of data samples such that the variances between sets
of clusters are equal and this process is mathematically described as
minimizing the sum of squares within clusters [[85]42]. As an
unsupervised clustering algorithm, K-means is relatively simple to
implement and has good clustering effect so that it is widely used.
There has been perfect strategy to chose the number of clusters K. The
Calinski-Harabasz index is essentially the ratio of the inter-cluster
distance to the intra-cluster distance and the overall calculation
process is similar to the variance calculation, so it is also referred
to as the variance ratio criterion [[86]44].
[MATH:
VRCk=SSBSS
mi>W×N-
kk-1 :MATH]
where SSB is the between-class variance and defined by
[MATH:
SSB=∑
i=1kni‖mi-m‖2<
/mrow> :MATH]
, m is the center point of all points, m[i] is the center point of a
certain class; SSW is the within-class variance and defined by
[MATH:
SSW=∑
i=1k∑x∈Ci‖x-m<
/mi>i‖2 :MATH]
;
[MATH:
N-kk-
1 :MATH]
is the complexity. The larger VRC[k] ratio, the greater data
separation.
After choosing a suitable value K and completing K-means clustering,
new features formed by clusters are obtained for DA. One could refer to
[87]Fig 3 for the flowchart of K-means clustering.
Fig 3. K-means clustering flowchart.
[88]Fig 3
[89]Open in a new tab
Frame structure of model
We proposed a new method called TabDEG, which combines DA and TabNet
and can be used for multi-classification problems such as DEGs
classification and UR/DR direction prediction. PyTorch in python 3.8 is
used to implement TabDEG. This model is suitable for RNA-Seq datasets
with or without suitable labels and with a small number of samples, and
achieves good learning results. Currently, there are no literature
reports on end-to-end models based on attention mechanisms combined
with DA methods for training gene expression data to predict DEGs and
determine their direction of action.
TabDEG model is divided into two stages: (1)The first stage is DA and
its specific structure is shown in [90]Fig 4. The input to the first
stage is two sets of vectors representing two feature sets (N and T)
for each gene in cancer dataset. These two feature sets are extracted
by UMAP, PCA, and K-means in the augmentation step respectively to
generate new input features which are then extended to the same range
or distribution. (2)The second stage is to input data into TabNet for
training and prediction. The definite structure of TabNet is shown in
[91]Fig 5 and its core part consists of encoder and decoder. As shown
in [92]Fig 5, the encoder consists of Feature transformer, Attention
transformer and Feature selection mask, while the decoder is relatively
simple and is designed to retain only the Feature transformer part. In
the stage of entire encode-decode, a special masker is used, with each
dimension of the input features corresponding to a Bernoulli
distribution. The parameters of Bernoulli distributions are controlled
by a manually set pre-training ratio.
Fig 4. DA workflow diagram.
[93]Fig 4
[94]Open in a new tab
The data is divided into two parts: disease group (T) and normal group
(N). UMAP, KMEANS and PCA methods is used with aim to get mixed feature
data in the process of DA. And then the data is standardized to ensure
data scale consistency.
Fig 5. TabNet structure chart.
[95]Fig 5
[96]Open in a new tab
(a) TabNet encoder, composed of a feature transformer, an attentive
transformer and feature masking. A split block divides the processed
representation and decomposition products obtained are used by the
attentive transformer of the subsequent step as well as for the overall
output. For each step, the feature selection mask provides
interpretable information about the model’s functionality and the masks
could be aggregated to obtain global feature important attribution. (b)
TabNet decoder, composed of a feature transformer block at each step.
(c) Attention transformer, the features of the n_a part enter the FC
layer and expand to the same dimension as the input features to
facilitate subsequent attention calculations, then enter the BN layer,
and then multiply with “priors” before entering the sparsemax layer.
The sparsemax layer is used for feature selection and “priors” is a
constant vector with all entries being 1 in the initial step, which
would change in each step. (d) Feature transformer, the output of the
feature transformer is divided into two parts: n_d(n features for
decision) and n_a(n features for attetion), where n_a is used for
further calculation in subsequent steps and n_d is used for the final
decision. (the output of the initial step has only n_a-dimensional
features, rather than n_d-dimensional features to participate in the
final decision).
DA model construction
In the introduction, it is pointed out that RNA-Seq datasets usually
lack suitable labels, have a small number of samples, and are
disproportionate to the number of genes, leading to poor performance of
general DL models. To solve the problem of imbalance between the number
of samples and the number of genes in RNA-Seq datasets, we used DA
methods in this study.
In TabDEG, the DA approach is used to train models with aim to solve
the gene classification task for cancer datasets, i.e. RNA-Seq
datasets, and construct models to solve multiple classification
problems such as classifying DEGs and predicting UR/DR directions. As
shown in [97]Fig 4, the features of TCGA datasets are divided into two
different subsets which are labeled as T and N. For these two subsets,
three DA methods (PCA, UMAP, and K-means) are respectively used to
generate new valid input features (labeled as T[1] and N[1]) in order
to achieve effect enhancement and improve the predictive power of gene
classification for model involved. Taking PCA as an example, an
appropriate hyperparameter p for these two data subsets shoud be
chosen, i.e. the remaining size after dimension reduction. Intuitively,
large differences within T would correspond to more clusters p being
generated while small differences within N would correspond to fewer
clusters p being generated. Along this line of thought, a initial value
p is specified and then confirm the approximate range by iterative
experiments. Similarly, for UMAP and K-means, a similar approach is
used to confirm the approximate range of their hyperparameters. After
confirming hyperparameters of these algorithms built into these
methods, effective input features could be obtained and they could
typically boost effective feature information by 15% to 20%.
In the process of augmenting datasets, [98]Fig 4 shows that all scales
of raw and augmented datasets are not normalized. This management could
avoid and control effect of scale variations as small as possible and
each raw feature is scaled to the same range or distribution before
starting training. A rank transformation could be used to smooth out
the data distribution with less being affected by outliers. However,
this transformation may distort the correlations and distances among
and within features. In order to neutralize these two aspects, a
transformation function QuantileTransformer(), which is parameter-free
and based on quantile function theory, is used to map raw feature data
onto a normal distribution field. Mix transformed raw feature data with
augmented new feature data together to obtain the complete experimental
data (E). As shown in [99]Fig 1, E is divided into train data and test
set in the ratio of 4 to 1 and then the train data is divided into
train set and verification set in the ratio of 4 to 1. During the
experiments, ten large datasets are trained and 5-fold cross-validation
with 10 repetitions is executed in order to avoid bias.
TabNet model construction
Google Research has proposed a new high-performance, interpretable
learning architecture called TabNet, which can directly train on raw
tabular data without any feature engineering. In TabNet, each decision
step uses sequential attention for feature selection inference and
unsupervised pre-training to predict masked features, thereby improving
model accuracy and interpretability. TabNet uses a single DL framework
for end-to-end learning and corresponds to two interpretable points: a
local interpretable point that shows the importance and combination of
each input feature, and a global interpretable point that quantifies
the contribution of each input feature to the output.
Feature selection mask in the stage provides interpretable information
about model capabilities and this mask could be aggregated to obtain
global feature important properties. As could be seen in [100]Fig 5,
the final n_d (the part that passes through the relu output) of each
step is multiplied with the sparse vector output by the sparsemax layer
of the Attention transformer of this round of steps and the results of
the multiplication at each step are accumulated. In fact, it combines
the characteristic importance of two parts which is a small
integration. The output part will be filtered by the ReLU with negative
number of output being directly set to 0 and then the Attention
transformer would also output a mask importance. These two importances
are integrated by multiplying themselves as the result of determining
the feature importance of the current step to the input features. After
the entire encode-decode stage, the reshaped features are obtained, the
activation output of a linear layer is taken as the feature
representation and Softmax() is applied to find the probability of each
class in the range [0, 1].
The experiment is conducted to test the effectiveness of TabDEG model.
All 10 cancer datasets are trained and used to test each dataset. Since
the train data has three labels (“0”, “1” and “2”), this train proess
is a three-class classification problem. In the first phase of
experiment, two feature sets (N and T features) for each gene are
inputted into PCA(), KMeans(), and UMAP() respectively with aim to
extract new valid features and then these two raw sets of features are
inputted into transformation QuantileTransformer() for normalization.
Both the normalized features and the new valid features form final
experimental dataset.
In the stage of experiment, the LabelSmoothing module is built in order
to reduce the weight of the real sample label category when calculating
the loss function, which has the effect of suppressing overfitting and
increasing the generalization ability of TabDEG model. Then TabNet runs
for 100 epochs with a batch size of 512, after the outcome within
LabelSmoothing module being inputted into. It uses Adam() as the
optimizer to calculate the loss between a given input X and the output
(y_pred) and updates the parameters according to the gradient. The
pytorch-tabnet packaged in python3.8 is used to execute the experiment.
The selection of hyperparameters is shown in [101]Table 1 and the train
and prediction of RNA-Seq data can be officially started after these
hyperparameters are specified. For predicted class values 0, 1, 2, the
input gene is accordingly classifed as DR, UR or non-DEGs. To estimate
the predictive performance of TabDEG model, the average value of four
criteria (accuracy, recall, precision and F1-measure) are calculated.
Results
In this section, the new method TabDEG proposed in this paper, data
experiment details and results will be listed. The workflow of TabDEG
is presented in [102]Fig 1. It falls into two stages: the first stage
includes data collection, pre-processing and labeling; the second stage
consists of data augmentation, training and testing the model involved
to predict DEGs and UR/DR directions. The specific architecture of
neural networks involved in TabDEG is given in [103]Fig 5 and these
networks are used to train and classify empirical data studied in this
paper.
Ten cancer datasets (referring to [104]Table 2) from the TCGA project
is used here to do data analysis experiment which consists of three
stages. The first stage is to download these ten datasets via UCSX Xena
Data Explorer and pre-process them respectively. And then the DA
process is applied to these pre-processed datasets in order to
encapsulate them as input data in the next stage. In second stage, the
input data were fed into nueral networks listed in [105]Fig 5 for
training in order to classify DEGs and predict UR/DR directions in all
datasets. In the last stage, the performance of TabDEG model is
compared with other ML methods (DTC [[106]45], CNNs [[107]35], LSTM
[[108]36], RFC [[109]34] and XGBoost [[110]46]) in terms of criteria
such as accuracy, recall, precision, F1-measure and ROC curve.
Table 2. Dataset abbreviations for cancer datasets.
Dataset abbreviation type
BRCA Breast invasive carcinoma
COAD Colon adenocarcinoma
HNSC Head and neck squamous cell carcinoma
KIRC Kidney renal clear cell carcinoma
LUAD Lung adenocarcinoma
LUSC Lung squamous cell carcinoma
PRAD Prostate adenocarcinoma
STAD Stomach adenocarcinoma
THCA Thyroid carcinoma
UCEC Uterine Corpus endometrial carcinoma
[111]Open in a new tab
All through this paper, the following abbreviations are used to denote
relevant datasets. The datasets after pre-processing and labeling are
recorded as R (“raw data”) and the feature column in raw data contains
normal group and tumor group. The feature column of normal group is
denoted as N (“normal data”) while it is marked as N[1] after adding
the feature column generated by DA process. The feature column of tumor
group is denoted as T (“tumor data”) while it is marked as T[1] after
adding the feature column generated by DA process. N[1] and T[1] are
recombined into a new data set which is denoted as E (“Experimental
data”). Then experimental data E is divided into train data and test
set in the ratio of 4 to 1 and the train data is also divided into a
train set and a validation set in the ratio of 4 to 1.
Comparison of TabDEG performance with other ML methods
This section evaluated the performance of TabDEG and 5 other ML methods
(DTC, CNNs, RFC, LSTM, and XGBoost) in predicting DEGs and UR/DR
directions. Comparing using metrics such as accuracy, recall,
precision, F1-measure, and ROC score, the TabDEG model and its
corresponding models were trained on the training sets of all 10 cancer
datasets and tested using 5-fold cross-validation. The experimental
results showed that, as shown in [112]Table 3, the scores of TabDEG
exceeded 93% for each dataset, with a score of 96% for the LUSC
dataset. Compared with the 5 control ML models, TabDEG scores at least
2% higher. In terms of ROC scores, as shown in [113]Fig 6 (only two
datasets are presented in the text, while the rest are presented in the
[114]S1 File), the scores for all 10 cancer datasets were greater than
0.9. Compared with other ML methods, TabDEG performed better on most
datasets, while CNNs and LSTM showed more fluctuation in their results.
The experimental results show that The TabDEG model is better than the
control five ML models and can effectively classify DEGs and predict
UR/DR genes.
Table 3. Performance of TabDEG against other five ML models on test data of
all ten datasets with five-fold cross-validation being used.
PRECISION TabDEG LSTM CNN DTC RCF XGBoost RECALL TabDEG LSTM CNN DTC
RCF XGBoost
BRCA 0.93 0.9 0.79 0.74 0.75 0.91 BRCA 0.93 0.82 0.74 0.72 0.7 0.88
COAD 0.94 0.92 0.92 0.79 0.76 0.92 COAD 0.94 0.87 0.86 0.78 0.67 0.89
HNSC 0.93 0.91 0.78 0.75 0.74 0.9 HNSC 0.93 0.89 0.7 0.74 0.66 0.88
KIRC 0.94 0.94 0.85 0.77 0.76 0.92 KIRC 0.94 0.88 0.73 0.77 0.78 0.91
LUAD 0.94 0.91 0.71 0.78 0.75 0.92 LUAD 0.94 0.84 0.74 0.77 0.75 0.89
LUSC 0.96 0.88 0.92 0.81 0.79 0.93 LUSC 0.96 0.9 0.87 0.81 0.8 0.93
PRAD 0.93 0.9 0.86 0.72 0.78 0.91 PRAD 0.93 0.84 0.78 0.72 0.54 0.84
STAD 0.93 0.9 0.88 0.74 0.72 0.9 STAD 0.93 0.89 0.8 0.73 0.65 0.87
THCA 0.93 0.91 0.76 0.71 0.77 0.9 THCA 0.93 0.87 0.83 0.71 0.57 0.82
UCEC 0.94 0.89 0.9 0.78 0.74 0.91 UCEC 0.94 0.83 0.84 0.78 0.73 0.91
F1-SCORE TabDEG LSTM CNN DTC RCF XGBoost ACCURACY TabDEG LSTM CNN DTC
RCF XGBoost
BRCA 0.93 0.85 0.72 0.73 0.69 0.89 BRCA 0.93 0.89 0.74 0.79 0.79 0.92
COAD 0.94 0.89 0.88 0.78 0.67 0.9 COAD 0.94 0.91 0.86 0.81 0.75 0.92
HNSC 0.93 0.9 0.7 0.74 0.65 0.89 HNSC 0.93 0.91 0.7 0.78 0.74 0.9
KIRC 0.94 0.9 0.7 0.77 0.76 0.91 KIRC 0.94 0.91 0.73 0.8 0.81 0.92
LUAD 0.94 0.86 0.71 0.77 0.74 0.9 LUAD 0.94 0.88 0.74 0.81 0.8 0.92
LUSC 0.96 0.89 0.89 0.81 0.78 0.93 LUSC 0.96 0.88 0.87 0.81 0.8 0.93
PRAD 0.93 0.86 0.79 0.72 0.51 0.87 PRAD 0.93 0.92 0.78 0.83 0.79 0.92
STAD 0.93 0.89 0.81 0.74 0.66 0.89 STAD 0.93 0.91 0.8 0.77 0.74 0.9
THCA 0.93 0.89 0.78 0.71 0.6 0.86 THCA 0.93 0.94 0.83 0.85 0.84 0.93
UCEC 0.94 0.85 0.84 0.78 0.72 0.91 UCEC 0.94 0.85 0.84 0.8 0.75 0.91
[115]Open in a new tab
Fig 6. ROC curves with scores for different methods across ten datasets.
[116]Fig 6
[117]Open in a new tab
GO enrichment analysis of predicted UR and DR genes
After classifying genes into DEGs and their UR/DR directions using our
TabDEG model, we evaluated the GO enrichment of the predicted UR and DR
genes using ToppGene Suite. As shown in the [118]Table 4, the predicted
UR and DR genes were enriched with some common GO terms associated with
carcinogenesis.
Table 4. Datasets GO ID/attribute p-value q-value.
Datasets GO ID/attribute p-value q-value
BRCA G protein-coupled receptor signaling pathway 1.74E-08 1.02E-04
cell fate commitment 2.32E-04 1.00E+00
monocarboxylic acid metabolic process 2.43E-04 1.00E+00
negative regulation of cell development 3.80E-04 1.00E+00
modulation of chemical synaptic transmission 1.53E-04 8.98E-01
humoral immune response 1.74E-07 1.02E-03
killing of cells of another organism 6.44E-07 3.78E-03
trans-synaptic signaling 1.80E-06 1.06E-02
positive regulation of protein kinase A signaling 9.52E-05 2.97E-01
UCEC antimicrobial humoral response 1.07E-08 7.09E-05
humoral immune response 4.61E-07 3.06E-03
keratinocyte differentiation 4.68E-06 3.10E-02
meiotic cell cycle 4.50E-05 2.98E-01
antibacterial humoral response 6.25E-05 4.14E-01
skin development 6.64E-05 4.40E-01
nuclear division 1.17E-04 7.78E-01
angiogenesis 1.05E-08 5.92E-05
regulation of angiogenesis 7.10E-05 4.00E-01
positive regulation of cell differentiation 1.67E-04 9.44E-01
ameboidal-type cell migration 1.10E-03 1.00E+00
regulation of cell migration 4.31E-05 2.43E-01
epithelial cell proliferation 5.40E-04 1.00E+00
regulation of transporter activity 7.10E-05 4.00E-01
G protein-coupled receptor signaling pathway 1.38E-04 7.81E-01
[119]Open in a new tab
According to the data provided and GO ID/attribute, we can see that the
predicted UR and DR gene mappings in the BRCA and UCEC datasets are
mainly related to processes such as tumor cell proliferation, survival,
invasion, and metastasis [[120]47]. In the BRCA dataset, common GO
terms related to cancer include tumor cell proliferation, survival,
invasion, and metastasis, such as the G protein-coupled receptor
signaling pathway (such as GO:0007186) which usually plays an important
role in tumor cell proliferation and infiltration; pathways related to
cell death and immune response (such as GO:0006959 and GO:0031640) are
related to the anti-tumor effect and immune escape of tumor cells;
pathways related to extracellular matrix degradation and cell movement
(such as GO:0099537 and GO:0045165) are related to the invasion and
metastasis of tumor cells [[121]48].
For the UCEC dataset, the GO terms mapped by the predicted UR and DR
genes mainly involve aspects such as cell adhesion and connection, cell
signal transduction, and enzyme-linked receptor protein signaling, and
the abnormal activation or inhibition of these pathways is closely
related to the occurrence and development of various cancers [[122]49].
For example, the cell-cell adhesion protein E-cadherin plays an
important role in the adhesion and metastasis of cancer cells, while
pathways such as the RAS-MAPK pathway and the PI3K-AKT-mTOR pathway are
abnormally activated in many tumors, directly participating in the
growth and invasion of tumor cells [[123]50]. The significant
enrichment of these GO terms suggests that these biological processes
may play a key role in tumors and may become targets for drug
development.
Pathway enrichment analysis of predicted UR and DR genes
We performed pathway enrichment analysis on the predicted UR and DR
genes in the biological test data of the BRCA and UCEC datasets. We
reported 13 important pathways related to the progress of various
cancer datasets, all of which are sourced from the predicted UR and DR
genes of BRCA and UCEC in [124]S2 File. We will discuss the DEGs that
are repeatedly mapped to multiple pathways (using the BRCA data set as
an example) in the following paragraph. In [125]Fig 7, we show the
UR/DR genes in the BRCA dataset that are repeatedly mapped to multiple
pathways [[126]51, [127]52].
Fig 7. Pathways mapped from predicted UR and DR genes of BRCA.
[128]Fig 7
[129]Open in a new tab
UR genes: HTR2A is involved in tumor cell proliferation, invasion, and
metastasis and is associated with the occurrence and progression of
breast cancer [[130]53]; CCL11 is related to cancer-related
inflammation [[131]54]; PTHLH is also considered a breast cancer growth
factor [[132]55]; PROX1 participates in breast cancer formation and
development by inhibiting angiogenesis and regulating epithelial cell
proliferation [[133]56]; NKX6–1 can inhibit the proliferation and
invasion of breast cancer cells [[134]57]; IFNG is an important factor
in anti-tumor immune response [[135]58]; MMP9 is a protease that
promotes tumor invasion and metastasis [[136]59]; DRD2 may affect tumor
growth and metastasis by regulating the proliferation and migration of
tumor cells in neurons [[137]60]; KIT is involved in the proliferation
and invasion of tumor cells [[138]61].
DR genes: COMP can promote the proliferation, invasion, and metastasis
of breast cancer cells and has a certain impact on the occurrence and
progression of breast cancer [[139]62]; ADIPOQ, as an anti-tumor
protein, plays an important role in regulating the apoptosis,
metabolism, and immune response of tumor cells, and its level is
closely related to the prognosis of breast cancer patients [[140]63].
These UR/DR genes identified by our model play important roles in
complex signal transduction networks, participating in the regulation
and cross-reaction of multiple signaling pathways. Further research on
these genes can reveal their mechanisms and key nodes in the entire
signal transduction network, providing a more comprehensive and
detailed understanding for the development of new therapeutic
strategies.
Discussion & conclusion
With the continuous development of NGS technology, RNA-seq has become
an important tool for exploring cellular heterogeneity and disease
states. However, there is currently a lack of classification methods
suitable for all large-scale datasets. Neural network models are not
limited by data distribution and are therefore suitable for classifying
all RNA-Seq data. DL has demonstrated strong performance advantages in
fields such as bioinformatics and will further explore biological
problems such as gene screening in precision medicine in the future.
However, applying DL to classify RNA-Seq data remains highly
challenging, with one reason being the significant imbalance between
sample size and gene features. In addition, these data are typically
unstructured, so commonly used neural network techniques still have
limitations in gene expression data.
This paper proposes a model called TabDEG, which combines DA and
TabNet, aiming to solve multiple classification problems, such as
classifying DEGs and predicting UR/DR direction. The model can be
generalized to RNA-Seq datasets without restrictions on sample size and
appropriate labels, and achieve good learning results. Experimental
results show that the TabDEG model performs better than (at least
comparable to) corresponding machine learning models, effectively
classifying DEGs and predicting UR/DR direction from the test sets of
these TCGA cancer datasets. We validated the biological enrichment of
predicted UR and DR genes in BRCA and UCEC datasets, including GO and
pathway enrichment. We found that predicted UR and DR genes were
significantly enriched in cancer-related GO terms with significant
p-values and q-values. Similarly, in terms of biological pathways, we
found that predicted UR and DR genes were enriched in pathways related
to breast cancer, such as the NABA MATRISOME and NABA CORE MATRISOME
pathways. The pathways mapped by predicted UR and DR genes from UCEC
datasets also played an important role in carcinogenesis, such as KEGG
metabolic disease-related pathways.
The proposed TabDEG model provides a new method for predicting DEGs and
their UR/DR direction from both trained and untrained datasets using
logFC values and statistical knowledge. Through downstream analysis of
predicted UR and DR genes, we gained insight into potential mechanisms
and helped identify regulatory factors for breast cancer and
endometrial cancer. Therefore, by predicting and classifying DEGs and
their UR/DR direction, TabDEG can help explore potential biomarkers
from other RNA-seq datasets.
Supporting information
S1 File. ROC curves with scores for different methods across ten
datasets.
(PDF)
[141]pone.0305857.s001.pdf^ (2.7MB, pdf)
S2 File. Mapped predicted genes in cancer pathways.
(PDF)
[142]pone.0305857.s002.pdf^ (216.7KB, pdf)
Data Availability
The code.zip file is the code involved in the article, and the data.zip
file is the data used in the experiments in the article, and these have
now been uploaded to: [143]https://github.com/xueyupi/my_tabdeg.git.
Funding Statement
This work was supported by a Natural Science Foundation of Guangdong
Province grant (2023A1515012891 awarded to ZW). The funders had no role
in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.
References