Abstract
Simple Summary
Breast cancer is the most fatal female cancer, which the existing
clinical and pathological information sometimes fails to diagnose
accurately. Recent artificial intelligence-based studies have shown the
capability of identifying molecular biomarkers using high-throughput
genomics data. Our aim was to apply machine learning methods to a large
cohort of transcriptomics data for gene reduction and the construction
of a diagnostic model for cancer classification. Advanced statistical
methods and cross-validation with another set of machine learning
methods increased the accuracy of the diagnostic model and predicted a
novel diagnostic nine-gene signature. Further, survival analysis
revealed a novel prognostic model of eight-gene signatures.
Experimental validation confirmed the expression of the identified gene
signatures in breast cancer patients and increased the reliability of
the study. The identified gene signature biomarkers have the potential
to improve healthcare management with precise diagnosis and prognosis
at a reduced cost.
Abstract
Background: Breast cancer (BC) is one of the most common female
cancers. Clinical and histopathological information is collectively
used for diagnosis, but is often not precise. We applied machine
learning (ML) methods to identify the valuable gene signature model
based on differentially expressed genes (DEGs) for BC diagnosis and
prognosis. Methods: A cohort of 701 samples from 11 GEO BC microarray
datasets was used for the identification of significant DEGs. Seven ML
methods, including RFECV-LR, RFECV-SVM, LR-L1, SVC-L1, RF, and
Extra-Trees were applied for gene reduction and the construction of a
diagnostic model for cancer classification. Kaplan–Meier survival
analysis was performed for prognostic signature construction. The
potential biomarkers were confirmed via qRT-PCR and validated by
another set of ML methods including GBDT, XGBoost, AdaBoost, KNN, and
MLP. Results: We identified 355 DEGs and predicted BC-associated
pathways, including kinetochore metaphase signaling, PTEN, senescence,
and phagosome-formation pathways. A hub of 28 DEGs and a novel
diagnostic nine-gene signature (COL10A, S100P, ADAMTS5, WISP1, COMP,
CXCL10, LYVE1, COL11A1, and INHBA) were identified using stringent
filter conditions. Similarly, a novel prognostic model consisting of
eight-gene signatures (CCNE2, NUSAP1, TPX2, S100P, ITM2A, LIFR, TNXA,
and ZBTB16) was also identified using disease-free survival and overall
survival analysis. Gene signatures were validated by another set of ML
methods. Finally, qRT-PCR results confirmed the expression of the
identified gene signatures in BC. Conclusion: The ML approach helped
construct novel diagnostic and prognostic models based on the
expression profiling of BC. The identified nine-gene signature and
eight-gene signatures showed excellent potential in BC diagnosis and
prognosis, respectively.
Keywords: breast cancer, gene expression profiling, artificial
intelligence, machine learning methods, diagnostic and prognostic model
1. Introduction
Breast cancer (BC) is the most common cause of cancer death in women,
with 1 in 8 cancer cases, and its incidence has increased significantly
despite the preventive and curative approaches utilized in recent years
[[38]1,[39]2]. In 2020, the International Agency for Research on Cancer
(IARC) and World Health Organization (WHO) reported 2.26 million new BC
cases and 684,996 global cases of BC mortality in females, surpassing
lung cancer with 2.20 million new cases. Further, the diagnosis of new
cases and BC death by 2040 is predicted to increase to over 3 million
and 1 million, respectively [[40]3,[41]4]. BC is a heterogeneous
disease and the symptoms may include a bump, skin dimpling, nipple
discharge, scaly hair patch and flaky skin around the nipple, and
thickness/swelling in some parts of the breast [[42]5]. BC survival
rates were found to be variable, at ~80%, ~60%, and ~40% in high-,
mid-, and low-income countries, respectively.
An accurate diagnosis is key for the optimal treatment of cancer
patients. At present, cancer classification and diagnosis heavily
depend on the subjective evaluation of physical examination,
clinical/pathological test, radiological scan, and histopathological
information, but they are subject to human errors [[43]6].
Surprisingly, medical error is the third leading cause of death, even
in the most advanced countries such as the USA [[44]7]. Additionally,
in some instances, (i) incomplete or misleading clinical information,
(ii) complicated radiological images, and (iii) variable, atypical, or
lack of morphologic features in histological information may result in
diagnostic confusion, and thus affect patient care [[45]2].
Molecular diagnostics offer precise, fair, and efficient breast cancer
classification, but are not widely applied in clinical settings.
Microarray-platform-based assays, including the Affymetrix GeneChip
Human Genome U133 Plus 2.0 array (Affymetrix, Santa Clara, CA, USA),
have the ability to measure thousands of gene expressions
simultaneously for each data point (sample) [[46]8,[47]9]. Expression
profiling to check for variability in gene expression is an important
factor influencing the precision and accuracy of clinical decisions in
the diagnosis of BC [[48]1,[49]8,[50]10,[51]11,[52]12,[53]13]. Despite
the large-scale, high-dimensional, and highly redundant type of
microarray data, with numerous tools to identify genes that are
differentially expressed across cancer/disease phenotypes, the
interpretation of results and follow-up analysis are quite challenging.
DNA-microarray-based gene expression profiling is promising for BC
diagnosis and prognosis [[54]12,[55]14], but limitations such as small
sample size, biased case vs. control distribution, multiple BC
subtypes, variable populations, and different platforms, complicate the
analysis, and the identification of gene signatures remains an issue
[[56]15,[57]16,[58]17,[59]18,[60]19].
In 1999, the first gene expression signature was identified to classify
leukemia into acute myeloid leukemia (AML) and acute lymphoblastic
leukemia (ALL). Since then, a series of gene expression signatures have
been reported for various cancers to classify tumors, tumor types,
tumor stages, and predict the disease prognosis [[61]20,[62]21].
BC is a very heterogeneous disease and is categorized into five
molecular subtypes: HER2+, basal (ER−/HER2−/PR−), luminal A (ER+/HER2,
with a low-proliferative phenotype), luminal B (ER+/HER2, with a
high-proliferative phenotype), and normal BC [[63]17]. Each subtype
exhibits distinct transcriptomics patterns, and finding unified BC
biomarkers or the gene signature applicable to all molecular subtypes
remains a challenge [[64]13,[65]22]. Hence, multiple datasets need to
be integrated to find universal diagnostic/prognostic biomarkers
broadly applicable to all BC subtypes.
A stringent filtration condition drastically reduces the number of
DEGs, but filters out many biologically relevant genes as well, whereas
a lenient cutoff allows for many genes to pass through, and a follow-up
issue arises in selecting the most interesting genes. Although
conducting gene ontology and pathway enrichment analysis is useful in
predicting biological processes, cellular components, molecular
function, networks, and canonical pathways for the detected DEGs, the
selection of the most relevant genes, a diagnostic and/or prognostic
biomarker, in cancer remains a challenge. Machine learning methods and
different evaluation techniques such as the Kaplan–Meier (KM) estimator
might be useful in identifying the biologically relevant genes from a
long DEG list, without any obvious selection way [[66]8,[67]11].
Another problem in high-throughput gene expression profiling is
reducing the extremely high dimensionality of irrelevant or redundant
gene features responsible for cancer classification accuracy. Feature
selection methods have been used to select key genes from thousands of
expressed genes, but the large numbers of microarray genes used in most
existing methods for cancer classification often hamper the model
outcomes. For an efficient diagnostic model for BC,
machine-learning-based feature selection methods were applied to a
smaller number of differentially expressed genes passing the standard
statistical cutoff of p < 0.5 and log2folds change > 2 in BC.
To address these challenges, we integrated eleven GEO oligonucleotide
microarray datasets to create a gene expression database of 701 samples
(356 breast tumors and 345 normal breast tissues) and applied different
R packages and machine learning methods on gene expression data for the
molecular classification, accurate diagnosis, and prognostic evaluation
of the identified gene signatures in BC. A larger sample size gave
greater analysis power because it constricted the distribution of the
test statistic. Further, an almost equal group (3356 vs. 3345) reduced
the biases in machine-learning-based data analysis, and increases the
accuracy of the model and predicted biomarkers. We also demonstrated
that the combined use of molecular pathway analysis, expression
analysis, feature selection methods, and survival analysis was helpful
in selecting gene signatures with high confidence.
2. Materials and Methods
2.1. Data Sets and Patients
The raw gene expression data, a set of binary files in a CEL format, of
BC from eleven datasets, including [68]GSE61304, [69]GSE42568,
[70]GSE7904, [71]GSE3744, [72]GSE29431, [73]GSE26910, [74]GSE31138,
[75]GSE71053, [76]GSE10780, [77]GSE30010, and [78]GSE111662, were
retrieved from the Gene Expression Omnibus (GEO) database using
“GEOquery” library of the R program
([79]https://www.ncbi.nlm.nih.gov/geo/, accessed on 2 January 2023). We
selected only the human breast tumor samples to eliminate differential
genetic interference in different BC cell lines. Clinicopathological
information from the original studies was used for analysis. The ratio
of breast tumors to normal breast was biased in the majority of the
deposited GEO datasets, including [80]GSE61304, [81]GSE42568,
[82]GSE7904, [83]GSE3744, and [84]GSE26910. Thus, we included
additional normal breast cases ([85]GSE30010 and [86]GSE111662) to
balance the data (n = 701, 356 = BT vs. 345 = NB) for a better outcome,
while identifying DEGs or applying ML methods to develop a diagnostic
model ([87]Table 1). This study was approved by the university’s CEGMR
bioethical committee (16-CEGMR-bioeth-2022), dated 13 October 2022, and
we recruited patients for the validation of potential biomarkers after
obtaining their consent.
Table 1.
GEO datasets from the [88]GPL570 platform used for gene expression
profiling.
Dataset Title/Description Normalization Methods No. of Samples
Percentage of Cancer
[89]GSE61304 Novel biomarker discovery for stratification and prognosis
of breast cancer patients MAS5 signal intensity 62 (58 breast tumor + 4
normal breast) 94%
[90]GSE42568 Breast cancer gene expression analysis Log2 GCRMA signal
intensity 121 (104 breast tumor + 17 normal breast) 86%
[91]GSE7904 Expression data from human breast tissue RMA expression
value 50 (43 breast tumor + 7 normal breast) 86%
[92]GSE3744 Human breast tumor expression GCRMA calculated signal
intensity, log2 transformed 47 (40 breast tumor + 7 normal breast) 85%
[93]GSE29431 Identifying breast cancer biomarkers RMA expression values
66 (54 breast tumor + 12 normal breast) 82%
[94]GSE26910 Stromal molecular signatures of breast and prostate cancer
Log2 RMA signal 12 (6 breast tumor + 6 normal breast) 50%
[95]GSE31138 Identifying novel anti-angiogenic targets in human breast
cancer Log2 RMA signal 6 (3 breast tumor + 3 normal breast) 50%
[96]GSE71053 Differential effect of surgical manipulation on gene
expression in normal breast tissue and breast tumour tissue
Log2-normalized signal 18 (6 breast tumor + 12 normal breast) 33%
[97]GSE10780 Proliferative genes dominate malignancy risk gene
signature in histologically normal breast tissue RMA expression value
185 (42 breast tumor + 143 normal breast) 23%
[98]GSE30010 Expression data from breast samples of postmenopausal
women RMA expression value 107 (0 breast tumor + 107 normal breast) 0%
[99]GSE111662 Whole breast tissue gene expression in comparison to
expression in epithelial and stromal tissues RMA expression values 27
(0 breast tumor + 27 normal breast) 0%
Total 701 (356 breast tumors + 345 normal breasts) 51%
[100]Open in a new tab
2.2. Preprocessing and Differential Expression Analysis
The median expression values of less than 5.55 intensity on the log2
scale of each probe, indicating the failure of true hybridization, were
filtered out. We also excluded the probes expressed in less than two
samples. We merged all the raw CEL files (n = 701) and applied the RMA
method for the normalization of expression values, and generated box
plots using the “oligo” package from R software. Principal component
analysis (PCA) was performed using “prcomp function”, and hierarchical
clustering was carried out using the “pheatmap” R package to correlate
the samples with the probes.
We used linear models for the microarray “Limma” package of R to
identify differentially expressed genes (DEGs), using an empirical
Bayesian method to assess the differences in gene expression. Wang et
al. (2021) conducted a study demonstrating the superior performance of
the moderated t-test when the sample size was ≥40. [[101]23,[102]24].
Our study, with a sample size of 701, comprising 356 breast tumors and
345 normal breast samples, exceeds the required 95% power of the test,
and the nearly equal representation of the test (tumor) and control
(normal) groups mitigates biases in machine-learning-based data
analysis, and enhances the accuracy of the model and predicted
biomarkers. The “decideTests” function was used to differentiate
between the altered (up or down) and normal expression. The “topTable”
function from R was applied with a cut-off-adjusted p-value
(Benjamini–Hochberg-corrected false discovery rate) < 0.05 and log2
fold change > ±2 to detect the most significant DEGs in BC compared
with normal samples. Unannotated probes, not representing genes, were
removed, and duplicate probes, representing single genes, were averaged
for expression values to get a unique set of DEGs.
2.3. Functional Pathway and Gene Set Enrichment Analysis
We used a comprehensive set of functional annotation tools, such as the
QIAGEN Ingenuity Pathway Analysis (IPA knowledgebase v84978992, QIAGEN,
USA) and WEB-based Gene SeT AnaLysis Toolkit (WebGestalt 2019,
[103]https://www.webgestalt.org/, accessed on 2 January 2023), to
investigate and understand the biological meaning of long-list
significant DEGs [[104]1,[105]25,[106]26]. We explored gene ontologies,
enriched and canonical pathways, upstream regulators, disease and
functions, and the networks associated with BC. Over-representation (or
enrichment) analysis (ORA), a statistical method, was used to determine
the presence of known genes in pre-defined sets, as well as in
dataset/DEGs.
2.4. Machine Learning and Feature Selection Methods
We applied machine learning methods to BC transcriptomics data and
considered performance measurements such as classification accuracy,
specificity, sensitivity, and AUC, to identify the most informative
features. To determine the effectiveness of a classification model, a
set of performance metrics was used for assessment, such as measuring
the model’s ability to accurately classify instances into the correct
categories. We used the confusion matrix to compute the accuracy,
precision, recall, and F1 score, as shown in Equations (1)–(4).
[MATH:
Accuracy=TP+
mo>TNTP+F
P+FN+TN<
/mi> :MATH]
(1)
[MATH:
Precision=∑T
mi>P∑TP+∑
FP :MATH]
(2)
[MATH:
Recall=∑TP
∑TP+∑FN :MATH]
(3)
[MATH:
F1−score=2∗p
mi>recision∗recallpreci
mi>sion+recall
:MATH]
(4)
In addition, the model’s performance was evaluated by plotting the
receiver operating characteristic curve (ROC) and the area under the
ROC (AUC), which is a metric used to measure the model’s effectiveness.
Models with larger AUCs are considered to have higher performance.
The Scikit-learn (sklearn) in python platform was used to build the ROC
curves of the DEGs and measure the AUC to compare the diagnostic value
of the DEGs, and to predict the accuracy of the detected DEGs. The ROC
curve, reflecting the relationship between sensitivity and specificity,
and AUC were used to determine the diagnostic value of a factor in a
specific disease, with AUC values between 0.5 and 1 representing low
and high authenticity, respectively.
We used seven machine learning algorithms, including (i) recursive
feature elimination with cross validation (RFECV) with logistic
regression, (ii) RFECV with support vector machine (SVM), (iii) Lasso
regularization (L1) with logistic regression, (iv) Lasso regularization
(L1) with support vector classification (SVC-L1), (v) random forest
(RF) classifier, (vi) extremely randomized trees (extra trees)
classifier, and (vii) genetic algorisms (GA), to find the most
significantly expressed genes in all samples (n = 701, 356BT + 345NB)
and construct the diagnostic model from candidate DEGs. To validate the
constructed diagnostic and prognostic models, we used five additional
ML methods, including (i) adaptive boosting (AdaBoost), (ii)
gradient-boosted decision trees (GBDT), (iii) K-nearest neighbors
(KNN), (iv) multilayer perceptron (MLP), and (v) the extreme gradient
boosting (XGBoost).
2.4.1. RFECV with Logistic Regression or with SVM
We used RFECV with logistic regression and SVM in Python using the
RFECV class from the scikit-learn library. The RFECV with logistic
regression or SVM is a method of feature selection in machine learning.
It is a combination of two techniques: recursive feature elimination
(RFE) and cross validation (CV). RFE is a backward selection algorithm
that starts with all the features and removes the weakest feature until
a specified number of features is left. CV, on the other hand, is a
technique used to evaluate the performance of a model by dividing the
data into several folds and training the model on different folds,
while testing it on one fold at a time. In RFECV with logistic
regression or SVM, the RFE algorithm is combined with CV to eliminate
features, while also evaluating the performance of the logistic
regression model. This helps determine the optimal number of features
that provide the best performance, while avoiding overfitting. By
combining these two techniques, RFECV with logistic regression or SVM
ensures that the final feature set is not only informative, but also
generalizable to new data.
2.4.2. LASSO Regularization (L1) Using Logistic Regression or Support Vector
Classification
We used L1 with logistic regression and SVM from the scikit-learn
library to identify the most significant genes [[107]27]. LASSO
regularization is a technique used to reduce the number of features in
a model and prevent overfitting. The technique shrinks the magnitude of
the coefficients using a penalty term proportional to the absolute
value of the coefficients, resulting in some coefficients becoming
zero. It helps select the most relevant features, while reducing the
impact of irrelevant or noisy features on the model’s performance. L1
regularization is particularly useful when dealing with a large number
of features or highly correlated features. The loss function of Lasso
regression is defined as shown in (5):
[MATH: ∑i=1<
/mn>n(Yi−∑j=1<
/mn>pXij
βj)2+λ∑j=1<
/mn>pβj :MATH]
(5)
where lambda is the regularization parameter that controls the strength
of the penalty term.
In logistic regression with LASSO regularization, the L1 penalty term
helps reduce the impact of irrelevant or noisy features on the model’s
performance by shrinking their coefficients toward zero. This can
improve the model’s interpretability and reduce the risk of
overfitting, especially when dealing with high-dimensional data.
In SVM, LASSO regularization is implemented by adding a penalty term to
the objective function that minimizes the classification error. The
penalty term is dependent on the magnitude of the coefficients of the
features, so larger coefficients receive a larger penalty. This ensures
that features with a large impact on the classification result receive
a smaller penalty and are more likely to be included in the final
model.
2.4.3. Random Forest
The random forest classifier is a machine learning algorithm used for
both classification and regression problems [[108]28]. It is an
ensemble of decision trees, where each tree is trained on a random
subset of data. The final prediction is made by taking the average of
all the trees’ predictions. In feature selection, a random forest
classifier is used to select the most important features in the
dataset. The algorithm calculates the importance of each feature by
measuring the average decrease in impurity for that feature. The higher
the average decrease, the more important the feature is considered.
2.4.4. Extra Trees Classifier
The extra trees classifier is an ensemble machine learning algorithm
that can be used for feature selection in Python [[109]29]. It is a
type of random forest classifier where multiple decision trees are
grown and combined to make a prediction. The algorithm works by
randomly selecting a subset of features at each split in the tree and
determining the most important features based on their impact on the
final prediction. By aggregating the feature importance scores across
all trees, the extra trees classifier can provide a ranking of the most
important features for a given dataset. This can be useful in
identifying the most relevant features for building a predictive model
and reducing the dimensionality of the data.
2.4.5. Genetic Algorithm
This algorithm uses principles of evolution and natural selection to
find the optimal combination of features that result in the best model
performance. The algorithm starts with a random set of features and
uses a fitness function to evaluate the performance of each
combination. The best performing combinations are then recombined and
mutated to create a new generation of features, and the process
continues until a satisfactory set of features is found [[110]30].
2.4.6. XGBoost
This algorithm is an ensemble learning method that works by combining
multiple weak models into a strong one [[111]31]. It uses gradient
boosting, which is a method of iteratively training decision trees on
residuals to improve the model performance. It provides faster
computation and parallelization of training, which is useful when
working with large datasets. It also has built-in regularization
techniques to reduce overfitting, which is a common problem in machine
learning.
2.4.7. GBDT
This is a machine learning algorithm that works by building an ensemble
of decision trees in a way that each subsequent tree focuses on the
errors made by the previous trees [[112]32]. This iterative process
results in a model that can learn complex non-linear relationships in
data. It has the ability to handle large datasets, handle missing data,
and provide accurate predictions with high interpretability.
2.4.8. MLP
This is a type of feedforward artificial neural network that consists
of multiple layers of nodes that process information from the input
layer to the output layer through a series of nonlinear transformations
[[113]33]. The nodes in each layer are connected to the nodes in the
previous and next layers, and each node applies an activation function
to the weighted sum of its inputs. The goal of training the MLP is to
minimize this cost function by adjusting the weights and biases in the
network using an optimization algorithm such as gradient descent. This
allows the model to learn the best set of weights that can accurately
predict the binary classification labels for unseen data. The cost
function in the binary classification of MLP uses the binary
cross-entropy loss of function and is defined as shown in (6):
[MATH:
cross−entropy
=−1N∑i=1<
/mn>Nyi<
/mi>log(pi )+1−yilog(1−pi<
/mi> ) :MATH]
(6)
where N is the number of samples,
[MATH:
yi
:MATH]
is the actual outcome,
[MATH: pi
:MATH]
is the probability of the tumor class, and
[MATH:
1−pi <
/mo> :MATH]
is the probability of the normal class.
2.4.9. AdaBoost
This is an ensemble learning algorithm that combines several weak
learners to create a strong learner [[114]34]. It works by repeatedly
fitting a weak learner to the data, and adjusting the weights of the
training samples to focus on the misclassified ones. The algorithm then
combines these weak learners to form a strong learner that is capable
of accurately predicting the target variable.
2.4.10. KNN
The K-nearest neighbors (KNN) is a popular machine learning algorithm
belonging to the family of instance-based or lazy learning algorithms,
which means that it does not attempt to learn a function from the
training data [[115]35]. Instead, KNN stores all the training data, and
classifies new data based on the similarity of its features to those in
the training set. The number of nearest neighbors (K) is a
hyperparameter that can be tuned to improve performance.
2.5. Survival Analysis Using the Kaplan–Meier Estimator
The KM estimator, a statistical technique tool (available at
[116]https://kmplot.com/analysis/, accessed on 10 February 2023) was
used for calculating survival probability functions to investigate the
overall and relapse-free survival of prognostic genes for breast cancer
patients. It is assumed that the occurrence of the event is fixed in
time, and both the censored observations and data points have an equal
chance of survival [[117]8,[118]36].
The mathematical expression of KM is expressed as shown in (7):
[MATH:
St=∏i:
ti≤t<
mrow>ni−dini=∏i:
ti≤t1−dini
:MATH]
(7)
[MATH:
St :MATH]
stands for survival function.
In this context,
[MATH:
ni
:MATH]
refers to the count of individuals at risk at a specific time tᵢ, and
[MATH:
di
:MATH]
is the count of events that happen at the same time, tᵢ. The survival
curve remains unchanging between the two events or times, i.e., between
tᵢ and tᵢ + 1 [[119]36].
This analysis was conducted for mRNA (gene chip) microarray data for
the relapse-free and overall survival. The KM analysis was performed
with a confidence interval and log rank p-value cut-off of >95% and
≤0.05, respectively. We proceeded to further check the mRNA (RNA seq)
datasets for overall survival for genes which were significant in both
microarray data for relapse-free survival and overall survival.
Finally, we established eight gene hubs (four upregulated and four
downregulated) for prognostic importance.
The web-based KMplot tool incorporates three databases in the
background: TCGA, EGA, and GEO [[120]37]. The Kaplan–Meier method is a
strong non-parametric statistical approach used for predicting the
likelihood of survival. The KM analysis was performed with a confidence
interval and log rank p-value cut-off of >95% and ≤0.05, respectively.
2.6. RNA Isolation and qRT-PCR
Trizol was used to lyse the cells, and chloroform and isopropanol were
used to extract RNA. After determining the RNA concentration, the cDNA
(complimentary deoxyribonucleic acid) was reverse-transcribed. The
primer sets were designed for the identified gene signature using
Primer-3 software (V.0.4.0). ABI 7500 instruments were used for
real-time quantitative PCR. Endogenous GAPDH gene expression was
measured as the internal control to determine the relative expression
of the detected genes. The reaction was run in a final volume of 10 μL,
comprising 5 μL SYBR-Green qPCR master mix (KAPA Biosystems,
Wilmington, MA, USA), 10 pmol of each primer, and 20 ng genomic DNA.
PCR was performed in triplicate using the SYBR-Green qPCR master mix
(KAPA Biosystems, USA) in a 96-well plate. Raw data were generated
through the use of StepOne Plus™ Real-Time PCR Systems and Data Assist
software. qPCR data were analyzed by ∆∆CT or the Livak method, and the
GraphPad PRISM software was used for presentation.
2.7. Statistical Analysis
All statistical analyses were conducted using R software (version
v.4.2.2) (R core team 2021). R was also used for the picture
generation. The chi-squared test was used to compare categorical
variables of patient characteristics. The Wilcoxon rank sum test was
used to compare the expression signature. The p-values were adjusted
for multiple comparisons using the Benjamini–Hochberg method, and the
default value <0.05 was considered statistically significant, otherwise
specified. Cox regression analysis (univariate and/or multivariate) was
used to assess the contribution of all parameters, such as evaluating
the independent predictive OS performance of different clinical factors
and the detected biomarkers. The KM curve and time-dependent ROC curve
were drawn by the R package “survminer” and “survivalROC”,
respectively.
3. Results
3.1. Differentially Expressed Genes in BC
Breast tumor (356) and normal breast tissue (345) samples from the
Affymetrix GeneChip Human Genome U133 Plus 2.0 arrays platform
(HG-U133_Plus_2) with 54,675 features/probes from 11 different GEO data
series were used as discovery cohorts. Out of 54,675 hybridized probes,
only 46,597 probes passed the cut-off: median expression >5.5 and
present in at least two samples. The expression data were
GC-RMA-normalized. The raw intensities and RMA-normalized expression
values were shown in Boxplot ([121]Figure 1).
Figure 1.
[122]Figure 1
[123]Open in a new tab
Boxplot showing the expression distribution for dataset [124]GSE7904.
(A) Raw (un-normalized) expression distribution with log2 scale in the
range of −200 to 400. (B) Normalized intensities showing almost similar
distributions of expression intensities, with the log2 scale in the
range of 0 to 12.
For a tumor-normal matrix, the “decideTests” function differentiated
46,597 probe signal intensities into altered (over-expressed (15,000),
under-expressed (21,952)) and unaltered (9645) expression. The
“topTable” functions revealed the most significant DEGs in breast
cancer (n = 487) for the screening criteria of the adjusted P-value
(Benjamini–Hochberg-corrected false discovery rate) < 0.05 and fold
change (log2FC > ±2). Additionally, unannotated probes, not
representing valid genes (n = 20), were removed first, then duplicate
genes, multiple probes representing single genes (n = 193), were
averaged for expression values (n = 83) to get a unique set of DEGs (n
= 355, upregulated = 77 and downregulated = 278) ([125]Figure 2 and
[126]Table 2).
Figure 2.
[127]Figure 2
[128]Open in a new tab
Volcano plot showing differentially expressed genes: (i) the majority
were non-significant (black), (ii) upregulated DEGs (red), and (iii)
downregulated DEGs (blue).
Table 2.
Top ten up- and downregulated differentially expressed genes from a
total of 355 DEGs in breast cancer.
Gene Symbol Gene Name Log2FC adj.p-Value Decide Test
COL11A1 Collagen Type XI Alpha 1 Chain 4.36 1.69 × 10^−172 Upregulated
TOP2A DNA Topoisomerase II Alpha 3.96 4.16 × 10^−220 Upregulated
S100P S100 Calcium-Binding Protein P 3.70 3.57 × 10^−137 Upregulated
COL10A1 Collagen Type X Alpha 1 Chain 3.59 7.47 × 10^−192 Upregulated
RRM2 Ribonucleotide Reductase Regulatory Subunit M2 3.47 4.44 × 10^−205
Upregulated
CKS2 CDC28 Protein Kinase Regulatory Subunit 2 3.26 8.66 × 10^−201
Upregulated
MMP1 Matrix Metallopeptidase 1 3.21 7.95 × 10^−113 Upregulated
COMP Cartilage Oligomeric Matrix Protein 3.15 6.27 × 10^−137
Upregulated
NUSAP1 Nucleolar And Spindle-Associated Protein 1 3.08 9.67 × 10^−194
Upregulated
ANLN Anillin, Actin-Binding Protein 3.07 2.42 × 10^−173 Upregulated
ADH1B Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide −4.84 2.10 ×
10^−168 Downregulated
ADIPOQ Adiponectin, C1Q And Collagen Domain Containing −4.47 6.08 ×
10^−119 Downregulated
PLIN1 Perilipin 1 −4.20 8.42 × 10^−161 Downregulated
LEP Leptin −4.11 7.20 × 10^−105 Downregulated
LPL Lipoprotein Lipase −4.09 2.35 × 10^−130 Downregulated
SDPR Serum Deprivation Response −4.06 1.34 × 10^−201 Downregulated
RBP4 Retinol Binding Protein 4, Plasma −4.06 4.37 × 10^−118
Downregulated
C2orf40 Chromosome 2 Open Reading Frame 40 −4.05 3.15 × 10^−211
Downregulated
ABCA8 Atp-Binding Cassette Subfamily A Member 8 −4.05 4.21 × 10^−166
Downregulated
NTRK2 Neurotrophic Tyrosine Kinase, Receptor, Type 2 −4.04 4.78 ×
10^−181 Downregulated
[129]Open in a new tab
3.2. Function Pathway Analysis and Network Enrichment Analysis
Functional analysis based on the Z-score and −log(p-value) indicated
activation of the kinetochore metaphase signaling pathway (2.71, 7.35),
PTEN pathway (2.64, 2.31), HOTAIR regulatory pathway (2.12, 2.71), and
WNT/β-catenin signaling pathway (2.45, 1.47), and suppression of the
senescence pathway (−3.16, 3.02), phagosome formation (−3.15, 2.09),
FAK signaling (−3.128, 1.74), oxytocin signaling pathway (−3.05, 3.8),
and breast cancer regulation by Stathmin1 (−3.0, 2.06) ([130]Figure 3).
Most significantly enriched molecular processes were the extracellular
matrix, cell division, mitotic cell cycle process, cell migration, and
the regulation of cell proliferation ([131]Table 3).
Figure 3.
[132]Figure 3
[133]Open in a new tab
Canonical pathways derived using the IPA tool. (A) Kinetochore
metaphase signaling pathway, (B) PTEN pathway overlapped with breast
cancer associated genes.
Table 3.
Gene ontology of biological processes derived from the enrichment of
differentially expressed genes in breast cancer.
Gene Set Description Gene Set Size Expect Values Overlap Value
Enrichment Ratio FDR
GO:0031012 Extracellular matrix 487 09.48 35 3.69 3.31 × 10^−8
GO:0051301 Cell division 576 11.21 41 3.65 2.69 × 10^−9
GO:1903047 Mitotic cell cycle process 780 15.18 46 3.02 2.62 × 10^−8
GO:0016477 Cell migration 1352 26.32 68 2.58 1.45 × 10^−9
GO:0042127 Regulation of cell proliferation 1535 29.88 72 2.40 3.75 ×
10^−9
GO:0048870 Cell motility 1493 29.06 69 2.37 1.64 × 10^−8
GO:0051674 Localization of cell 1493 29.06 69 2.37 1.64 × 10^−8
GO:0009719 Response to endogenous stimulus 1574 30.64 71 2.31 2.01 ×
10^−8
GO:0008283 Cell proliferation 1953 38.02 87 2.28 6.20 × 10^−10
GO:0009888 Tissue development 1814 35.31 77 2.18 3.31 × 10^−8
[134]Open in a new tab
3.3. Machine Learning Algorithms for the Identification of Diagnostic
Biomarker Genes
Initially, seven ML algorithms predicted genes with diagnostic
importance using 355 DEGs significantly expressed in BC samples, and
important genes predicted by at least four ML models were selected for
further analysis (n = 65). Additionally, we analyzed each dataset
individually and checked the status of 355 DEGs in each dataset; the
genes present in at least four datasets were selected for further
analysis (n = 94). We identified 28 common genes passing both criteria
(DEGs > 3 ML and DEGs > 4 datasets) as the potential hub of BC
diagnostic and prognostic biomarkers ([135]Figure 4A). With more
stringent conditions (DEGs > 5 ML and DEGs > 7 dataset) and based on
their role in tumorigenesis, a novel diagnostic nine-gene signature
(COL10A, S100P, ADAMTS5, WISP1, COMP, CXCL10, LYVE1, COL11A1, and
INHBA) was identified for BC. An unsupervised hierarchical
clustering-based heatmap showed a correlation in a pairwise fashion
between the samples and probes. A heatmap of unfiltered probes (n =
54676) was ambiguous and non-conclusive, while the unique set of DEGs
(n = 355), hub genes (n = 28), and gene signatures (n = 9) for BC
diagnosis showed a distinct correlation between the samples and gene
expression ([136]Figure 4B).
Figure 4.
[137]Figure 4
[138]Open in a new tab
(A) Venn diagram showing 28 hub genes derived from the intersection of
DEGs > 3 ML and DEGs > 4 datasets. (B) Unsupervised hierarchical
clustering: heatmap of 701 samples, including 356 breast tumor (BT,
cyan) and 345 normal breast (NB, pink) tissues, showing the gene
expression pattern of 28 hub genes, including diagnostic and prognostic
gene signatures. Upregulated genes are shown in red and downregulated
genes are in blue.
3.4. Machine-Learning-Algorithm-Based 10-Fold Cross-Validation
We used a 10-fold cross-validation technique to evaluate the
performance of ML models for diagnostic and prognostic gene signatures.
To perform 10-fold cross-validation, the dataset was divided into 10
equally sized folds. The model was then trained and validated 10 times,
each time using a different fold for validation and the remaining nine
folds for training. The process was repeated for all the folds, and the
results were averaged to obtain an estimate of the model’s performance.
This provided a more reliable estimate of the model’s performance
compared to using a single-train test split, which may be biased based
on the specific data that were selected; it can also help prevent
overfitting.
For validating the diagnostic performance of our nine-gene signature, a
new set of ML algorithms (GBDT, XGBoost, AdaBoost, KNN, and MLP) was
employed to evaluate the model. By comparing the diagnostic efficiency,
accuracy, and precision of different algorithms, the constructed
diagnostic model was validated ([139]Figure 5 and [140]Table 4). Each
ML model’s performance was evaluated by measuring a range of
performance metrices, including AUC, accuracy, precision, recall, and
the F1 score. Here, all the ML methods predicted were above 95, and the
ML model had the greatest AUC value indicating the candidate gene
signature as a potential biomarker. KNN showed the highest values for
all the evaluation metrics (mean F1 = 0.982), which indicates that it
performed the best among the five models. In contrast, MLP showed the
lowest values for all evaluation metrics. Furthermore, biomarkers that
could distinguish disease samples and normal samples were analyzed
according to PCA, as it groups the samples based on similarities
([141]Figure 6).
Figure 5.
[142]Figure 5
[143]Open in a new tab
K-nearest neighbors (KNN)-based ML model for diagnostic gene signature
showing the mean ROC (AUC 0.989 ± 0.013).
Table 4.
Machine learning methods for the 10-fold cross-validation of the
diagnostic nine-gene signature.
ML Model Mean AUC Mean ACC Mean Precision Mean Recall Mean F1
KNN 0.989 0.981 0.983 0.980 0.982
GBDT 0.995 0.973 0.970 0.978 0.973
AdaBoost 0.992 0.974 0.972 0.977 0.975
XGBoost 0.994 0.971 0.969 0.975 0.972
MLP 0.975 0.960 0.961 0.961 0.960
[144]Open in a new tab
Figure 6.
[145]Figure 6
[146]Open in a new tab
PCA plot showing an overall distribution of the samples (n = 701),
including breast tumor (blue) and normal breast tissue (red) based on
transcriptomics profiles: (A) 54,675 probes, (B) 355 DEGs, (C) 28 hub
genes, and (D) diagnostic nine-gene signature.
3.5. Survival Analysis to Identify Genes with Prognostic Importance
Survival analysis using the KM estimator was conducted for 28 hub
genes. First, relapse-free survival and overall survival analyses were
performed for mRNA (gene-chip), followed by the overall survival for
mRNA (RNA seq). We identified a novel prognostic gene signature of
eight genes (CCNE2, NUSAP1, TPX2, S100P, ITM2A, LIFR, TNXA, and ZBTB16)
that were significant (log-rank p-value < 0.05) under both RFS and OS
conditions ([147]Table 5 and [148]Table 6, and [149]Figure 7 and
[150]Figure 8). The hazard ratio (HR) compares the risk of death
(overall survival) and postoperative follow-up (relapse-free survival)
occurring between the high and low expressions of the gene. The HR
value < 1 or >1 indicates that the risks associated with a lower
expression and higher expression of the gene are significantly
different. On the other hand, the confidence intervals (CI) indicate
the level of uncertainty around the estimated survival probability at
each time point. A narrower confidence interval indicates that the
survival estimate is more precise and that the sample size is large
enough to produce reliable results. The CI value demonstrates the
precision and reliability of the results. Finally, the log-rank p-value
measures the statistical significance that helps determine whether the
observed difference in survival between the groups (high and low
expression of the gene) is statistically significant (<0.05), i.e.,
unlikely to have occurred by chance. Therefore, log-rank p-values are
the deciding factors for survival significance.
Table 5.
mRNA (gene chip) and the relapse-free survival analysis of 28 hub
genes, with the measured hazard ratio (HR), confidence interval (CI),
and log-rank p-value.
Gene Symbol Probe_IDs HR CI Log-Rank p-Value Decision (Log-Rank
p-Value)
ADAMTS5 219935_at 0.9 0.77–0.94 1.50 × 10^−3 Significant
CCNE2 205034_at 1.9 1.67–2.06 1.00 × 10^−16 Significant
CKS2 204170_s_at 1.7 1.51–1.85 1.00 × 10^−16 Significant
CXCL10 204533_at 1.2 1.12–1.37 4.40 × 10^−5 Significant
EDNRB 206701_x_at 0.8 0.69–0.85 2.20 × 10^−7 Significant
FABP4 203980_at 0.9 0.81–0.99 2.58 × 10^−2 Significant
GPC3 209220_at 0.8 0.76–0.92 5.00 × 10^−4 Significant
ITM2A 202747_s_at 0.7 0.63–0.77 1.40 × 10^−12 Significant
LIFR 225575_at 0.7 0.56–0.75 1.60 × 10^−8 Significant
MATN2 202350_s_at 0.9 0.78–0.95 3.30 × 10^−3 Significant
LYVE1 219059_s_at 0.9 0.81–0.99 3.78 × 10^−2 Significant
NUSAP1 218039_at 1.7 1.54–1.89 1.00 × 10^−16 Significant
SCN4B 236359_at 0.6 0.55–0.75 1.00 × 10^−8 Significant
SDPR 222717_at 0.7 0.57–0.77 9.70 × 10^−8 Significant
SPRY2 204011_at 0.9 0.79–0.97 1.02 × 10^−2 Significant
TF 214063_s_at 0.9 0.78–0.96 5.20 × 10^−3 Significant
TNXA 216333_x_at 0.7 0.63–0.77 2.40 × 10^−12 Significant
TPX2 210052_s_at 1.6 1.48–1.82 1.00 × 10^−16 Significant
WISP1 229802_at 0.8 0.64–0.87 1.00 × 10^−4 Significant
ZBTB16 205883_at 0.7 0.58–0.72 1.00 × 10^−16 Significant
COL11A1 37892_at 1.2 1.12–1.38 2.30 × 10^−5 Significant
INHBA 210511_s_at 1.2 1.06–1.3 1.70 × 10^−3 Significant
S100P 204351_at 1.5 1.31–1.61 6.30 × 10^−3 Significant
COL10A1 205941_s_at 1.0 0.88–1.08 6.60 × 10^−1 Insignificant
COMP 205713_s_at 0.9 0.85–1.04 2.53 × 10^−1 Insignificant
GJB2 223278_at 1.0 0.88–1.19 7.89 × 10^−1 Insignificant
LRRC15 213909_at 0.9 0.82–1.01 7.16 × 10^−2 Insignificant
MME 203435_s_at 1.1 0.98–1.2 1.28 × 10^−1 Insignificant
[151]Open in a new tab
Table 6.
mRNA (gene chip) and the overall survival analysis of 28 hub genes,
with the measured hazard ratio (HR), confidence interval (CI), and
log-rank p-value.
Gene SymboL Probe_IDs HR CI Log-Rank p-Value Decision (Log-Rank
p-Value)
CCNE2 205034_at 1.47 1.22–1.78 5.00 × 10^−5 Significant
CKS2 204170_s_at 1.32 1.09–1.59 3.70 × 10^−3 Significant
ITM2A 202747_s_at 0.61 0.5–0.73 2.00 × 10^−7 Significant
LIFR 225575_at 0.59 0.45–0.78 1.00 × 10^−4 Significant
NUSAP1 218039_at 1.65 1.36–2 1.90 × 10^−7 Significant
SDPR 222717_at 0.70 0.53–0.92 8.90 × 10^−3 Significant
TNXA 216333_x_at 0.71 0.59–0.85 3.00 × 10^−4 Significant
TPX2 210052_s_at 1.56 1.29–1.89 3.20 × 10^−6 Significant
ZBTB16 205883_at 0.63 0.52–0.76 1.40 × 10^−6 Significant
S100P 204351_at 1.50 1.25–1.82 2.10 × 10^−5 Significant
ADAMTS5 219935_at 0.85 0.7–1.02 8.00 × 10^−2 Insignificant
COL10A1 205941_s_at 0.96 0.79–1.15 6.38 × 10^−1 Insignificant
COMP 205713_s_at 1.06 0.88–1.27 5.67 × 10^−1 Insignificant
CXCL10 204533_at 0.9 0.75–1.09 2.98 × 10^−1 Insignificant
EDNRB 206701_x_at 0.88 0.73–1.06 1.85 × 10^−1 Insignificant
FABP4 203980_at 0.84 0.7–1.02 7.78 × 10^−2 Insignificant
GJB2 223278_at 1.18 0.9–1.54 2.29 × 10^−1 Insignificant
GPC3 209220_at 0.84 0.7–1.02 7.47 × 10^−2 Insignificant
MATN2 202350_s_at 0.85 0.7–1.02 8.10 × 10^−2 Insignificant
LRRC15 213909_at 0.87 0.72–1.04 1.30 × 10^−1 Insignificant
LYVE1 219059_s_at 1.04 0.86–1.25 6.90 × 10^−1 Insignificant
MME 203435_s_at 0.83 0.69–1.01 5.95 × 10^−2 Insignificant
SCN4B 236359_at 0.83 0.63–1.08 1.68 × 10^−1 Insignificant
SPRY2 204011_at 0.89 0.74–1.08 2.36 × 10^−1 Insignificant
TF 214063_s_at 0.99 0.82–1.19 9.08 × 10^−1 Insignificant
WISP1 229802_at 0.79 0.6–1.03 8.33 × 10^−2 Insignificant
COL11A1 37892_at 1.12 0.93–1.35 2.37 × 10^−1 Insignificant
INHBA 210511_s_at 1.12 0.93–1.36 0.2212 Insignificant
CKS2 204170_s_at 1.32 1.09–1.59 0.0612 Insignificant
SDPR 222717_at 0.7 0.53–0.92 0.0628 Insignificant
[152]Open in a new tab
Figure 7.
[153]Figure 7
[154]Open in a new tab
KM plot based on the relapse-free survival analysis of eight individual
genes (mRNA, gene-chip) of prognostic gene signature. The X-axis and
Y-axis represent time in months and the probability of the survival of
patients, respectively. The impact of the high and low expression of
the gene on patient survival is shown in red and black lines,
respectively.
Figure 8.
[155]Figure 8
[156]Open in a new tab
KM plot based on the overall survival analysis of eight individual
genes (mRNA, gene-chip) of prognostic gene signature. The X-axis and
Y-axis represent time in months and the probability of the survival of
patients, respectively. The impact of the high and low expression of
the gene on patient survival is shown in red and black lines,
respectively.
First, we validated the prognostic eight-gene signature using mRNA (RNA
seq) dataset based on the RFS and OS analyses by collectively measuring
the hazard ratio (HR), confidence interval (CI), and log-rank p-value
([157]Figure 9 and [158]Table 7), and the prognostic signatures were
again validated using five ML methods, including GBDT, XGBoost,
AdaBoost, KNN, and MLP. Among the five ML models, GBDT showed the
highest values for the mean AUC (0.993), mean accuracy (0.980), and
mean F1 score (0.98), while XGBoost showed the highest mean precision
(0.981). KNN showed the second-highest values for all the evaluation
metrics, while MLP showed the lowest values ([159]Figure 10 and
[160]Table 8).
Figure 9.
[161]Figure 9
[162]Open in a new tab
RFS and OS analyses and the validation of upregulated (CCNE2, NUSAP1,
TPX2, and S100P), and downregulated (ITM2A, LIFR, TNXA, and ZBTB16)
gene groups (mRNA, RNA seq) of the prognostic gene signature. The
X-axis and Y-axis represent time in months and the probability of the
survival of patients. The impact of the high and low expression of the
gene on patient survival is shown in red and black lines, respectively.
Table 7.
mRNA (RNA seq) based on the relapse-free survival and overall survival
analyses of eight prognostic gene hubs collectively measuring the
hazard ratio (HR), confidence interval (CI), and log-rank p-value.
Survival Type Gene Hub HR CI Log-Rank p-Value Decision by Log-Rank
p-Value Expression
RFS CCNE2, NUSAP1, TPX2, S100P 1.62 1.46–1.79 1.00 × 10^−16 Significant
Upregulated
RFS ITM2A, LIFR, TNXA-TNXB, ZBTB16 0.58 0.50–0.68 1.90 × 10^−12
Significant Downregulated
OS CCNE2, NUSAP1, TPX2, S100P 1.44 1.19–1.74 0.00014 Significant
Upregulated
OS ITM2A, LIFR, TNXA-TNXB, ZBTB16 0.57 0.43–0.75 4.20 × 10^−5
Significant Downregulated
[163]Open in a new tab
Figure 10.
[164]Figure 10
[165]Open in a new tab
Gradient-boosting decision trees (GBDT) based on the ML model for the
prognostic gene signature showing the mean ROC (AUC 0.993 ± 0.006).
Table 8.
Machine learning model for the 10-fold cross-validation of the
prognostic eight-gene signature.
ML Model Mean_AUC Mean_ACC Mean_Precision Mean_Recall Mean_F1
GBDT 0.993 0.980 0.983 0.977 0.980
XGBoost 0.992 0.976 0.981 0.972 0.976
AdaBoost 0.987 0.967 0.965 0.972 0.968
KNN 0.985 0.979 0.978 0.980 0.979
MLP 0.979 0.966 0.975 0.958 0.966
[166]Open in a new tab
3.6. qRT-PCR Analysis
qPCR was used to confirm the identified BC biomarkers (gene signatures)
by determining the relative expression of 16 genes (nine-gene signature
for diagnosis: COL10A, S100P, ADAMTS5, WISP1, COMP, CXCL10, LYVE1,
COL11A1, and INHBA; and eight-gene signature for prognosis: CCNE2,
NUSAP1, TPX2, S100P, ITM2A, LIFR, TNXA, and ZBTB16), with an overlap of
the S100P gene ([167]Figure 11). GAPDH was used as the internal
control. We found qPCR results in concordance with microarray-analyzed
expression patterns ([168]Table 9).
Figure 11.
[169]Figure 11
[170]Open in a new tab
qRT-PCR results showing overexpression of COL10A, S100P, WISP1, COMP,
CXCL10, COL11A1, INHBA; CCNE2, NUSAP1, TPX2, and S100P genes, and
under-expression of ADAMTS5, LYVE1, ITM2A, LIFR, TNXA, and ZBTB16
genes.
Table 9.
Expression of gene signatures in microarray and qRT-PCR. The results of
qRT-PCR are represented by the quantitative expression (Rq) and
fold-change (FC), along with the standard deviation (StdDev) and
p-values.
Gene Microarray qRT-PCR
FC adj.p-Value Rq FC StdDev p-Value
Expression of Diagnostic Gene Signature
COL11A1 3.907519 5.1 × 10^−163 9.834254 3.297816 0.64495 3.47 × 10^−7
COL10A1 3.842333 2.1 × 10^−178 10.30614 3.365432 0.660567 1.84 × 10^−8
S100P 3.701498 3.6 × 10^−137 5.345665 2.418369 0.995349 2.73 × 10^−11
COMP 3.150415 6.3 × 10^−137 6.191366 2.630258 0.425433 5.81 × 10^−15
INHBA 3.042628 2.6 × 10^−157 7.21937 2.851873 0.901961 1.97 × 10^−8
WISP1 2.551547 6 × 10^−105 4.783692 2.258124 0.461439 2.9 × 10^−8
ADAMTS5 −3.13169 3.3 × 10^−184 0.209914 −2.25213 0.511629 2.88 × 10^−9
CXCL10 2.530934 2.16 × 10^−95 4.763343 2.251974 0.86944 2.16 × 10^−5
LYVE1 −3.14204 1.2 × 10^−142 0.177332 −2.49548 0.877592 1.73 × 10^−6
Expression of Prognostic Gene Signature
CCNE2 2.530327 3.7 × 10^−154 5.112678 2.354079 0.392123 3.1 × 10^−15
NUSAP1 2.732299 2.4 × 10^−124 7.065653 2.820823 0.9127 3.81 × 10^−10
TPX2 2.145025 5.6 × 10^−135 5.179436 2.372795 0.432888 4.72 × 10^−10
ITM2A −2.54576 9.1 × 10^−149 0.241341 −2.05085 0.683145 4.21 × 10^−8
LIFR −3.0494 5.6 × 10^−159 0.271185 −1.88265 0.853309 1.78 × 10^−6
TNXA −2.54523 1.9 × 10^−129 0.228871 −2.12739 0.736176 7.89 × 10^−7
ZBTB16 −2.4943 1.12 × 10^−115 0.187856 −2.4123 0.567998 3.32 × 10^−9
S100P 3.701498 3.6 × 10^−137 5.345665 2.418369 0.995349 2.73 × 10^−11
[171]Open in a new tab
4. Discussion
In recent years, multiple molecular diagnostic prognostic and
predictive biomarkers have been proposed, and despite the availability
of few molecular tests, traditional pathological factors such as the
number of lymph node metastases, tumor size, and tumor grade, continue
to be mandatory for clinical decisions [[172]38]. However, in the era
of personalized treatment, these factors alone are inadequate and
require molecular/genomic assistance, as cancer occurs via genetic
alterations that transform normal cells into tumor cells. Although
significant knowledge exists related to carcinogenesis, a complete
understanding of cancer development mechanisms is still required. In
recent years, genomics and proteomics have played a vital part in the
development of different biomarkers for breast cancer
[[173]39,[174]40]. Gene expression profiling can detect genetic
alterations in the origin, growth, proliferation, and metastasis of
tumors, and classify them accordingly. Gene expression signatures,
derived from DEGs, specifically correlate these genetic alterations
with clinical variables such as the diagnosis and prognosis
[[175]20,[176]41]. A correct and timely diagnosis is the starting point
of treatment and determination of prognosis is the most immediate
challenge in patient management. This can be best achieved through a
combination of traditional clinicopathological prognostic factors,
molecular biomarkers such as single-gene tests (ER, PR, HER2) and
specific multigene tests (gene signatures).
Among the 355 DEGs identified in a combined BC cohort, COL11A1, TOP2A,
S100P, COL10A1, and RRM2 were the most upregulated, while ADH1B,
ADIPOQ, PLIN1, LEP, and LPL were the most downregulated DEGs in BC. The
pathway and enrichment analyses of DEGs revealed activation of the
kinetochore metaphase signaling pathway, PTEN pathway, HOTAIR
regulatory pathway, etc., and suppression of the senescence pathway and
phagosome formation pathways in BC. The most significantly enriched
molecular processes were the extracellular matrix, cell division,
mitotic cell cycle process, cell migration, and regulation of cell
proliferation. First, to verify the reliability of our method of
screening for biomarkers, we confirmed our finding of DEGs, pathways,
and gene ontologies using literature mining and verification. Matching
our results with previous findings was good evidence that they are
indeed involved in the development and progression of BC
[[177]42,[178]43,[179]44,[180]45,[181]46].
Kinetochore architecture and its functional regulation is one of the
most fascinating multi-protein machineries in a cell [[182]47]. The
kinetochore metaphase signaling is essential for chromosome segregation
in mitosis and meiosis [[183]48]. The critical regulators of alignment
and segregation of chromosomes during mitosis, aurora B kinase (AURKB),
dual specificity protein kinase TTK (Mps1), and kinetochore protein
NDC80 homolog (NDC80) previously reported were significant in our study
too [[184]49]. Another essential pathway that was significantly
upregulated in our study was PTEN/PI3K/AKT. This controls the signaling
of numerous biological processes, including apoptosis, cell
proliferation, cell growth, and metabolism. Phosphatase and tensin
homolog deleted on chromosome 10 (PTEN) is a dual protein/lipid
phosphatase, of which the main substrate is phosphatidyl-inositol,3,4,5
triphosphate (PIP3), the product of PI3K [[185]50,[186]51]. The PTEN
tumor suppressor is the chief brake of the PI3K-Akt pathway and a
common target for inactivation in somatic cancers [[187]52]. PTEN
activity is frequently lost in several metastatic human cancers due to
mutations, deletions, or promoter methylation silencing [[188]50].
Senescence is associated with mitochondrial metabolic activities such
as the tricarboxylic acid cycle, oxidative phosphorylation, and
glycolytic pathways. The old senescent cells die during aging or
apoptosis. The senescence pathway promotes cell cycle arrest triggered
in response to stress with increased AMP/ADP:ATP and NAD^+/NADH ratios,
and activating AMPK, p53, p16, KRAS, etc. [[189]53,[190]54,[191]55].
The in vitro demonstration of oncogene-induced senescence establishes
senescence as a vital tumor-suppressive mechanism, in addition to
apoptosis. Senescence not only stops the proliferation of premalignant
cells (tumorigenesis), but also eases the clearance of affected cells
through the immunosurveillance [[192]56]. In vivo studies showed that
suppression of the senescence pathway can also promote mammary
tumorigenesis [[193]57].
AI and ML techniques based on automated medical diagnosis are
increasing gradually for clinical, pathological, and radiological
reports. The fusion of multiple techniques in different types of data
processing for cancer study must be a further instrument to obtain
successful results. An earlier convolution neural network approach had
been applied for image processing in medical diagnosis [[194]58].
However, using AI and ML in the evaluation of high-throughput genomics
data from patients in diagnostic decision-making is still a bottle neck
in healthcare [[195]59,[196]60,[197]61]. Typically, microarray data
have thousands of features (genes/probes), but only a few samples (in
tens or hundreds). For ML classification, it is better to have a large
cohort with fewer features. Eleven BC datasets from different studies
were integrated to increase the cohort size. Transcriptomics profiling
resulted in 355 DEGs associated with BC, but this number was
technically too big to recommend for gene signature biomarkers for
diagnostic or prognostic tests. However, AI and ML have the potential
to filter out genes with the best diagnostic and prognostic importance.
Thus, for BC diagnosis via binary classification (whether or not BC),
we used seven ML and feature selection methods (RFECV-LR, RFECV-SVM,
RF, extra trees, LASSO, SVM-L1, SVM-L2, and GA) for gene reduction, and
found high accuracy in the models. We identified a hub of 28 genes
predicted by at least three ML methods and present in at least four BC
datasets. RFECV-LR and RFECV-SVM improved the classification accuracy
of logistic regression by selecting the most relevant features for the
model and helped reduce overfitting by removing irrelevant or redundant
features. Recursive feature elimination was utilized to rank the genes,
with a random forest classifier used to evaluate gene fitness through
five-fold cross-validation [[198]62]. The extra trees method was
versatile, less prone to overfitting, computationally efficient, robust
to noise, and could handle missing data [[199]29,[200]63]. LASSO had
advantages in its ability to handle multicollinearity, and provided a
sparse solution for both variable selection and shrinkage problems
[[201]64,[202]65]. SVM models were frequently used for classification
and regression tasks using L1 and L2 regularization [[203]66]. L1
regularization had improved the interpretability of the model and
reduced overfitting by encouraging sparsity and selecting only the most
relevant features for the classification task. GA utilized the concept
of survival of the fittest and was based on a population-based search
approach for a robust and efficient search [[204]67].
Based on the importance of the 28 hub genes in BC and using stringent
filter conditions such as the genes predicted to be diagnostically
important by at least five ML methods and present in at least seven BC
datasets, a novel nine-gene signature (COL10A, S100P, ADAMTS5, WISP1,
COMP, CXCL10, LYVE1, COL11A1, and INHBA) was identified. Similarly, by
evaluating 28 hub genes using RFS and OS analyses by KM plot, a novel
prognostic model consisting of an eight-gene signature (CCNE2, NUSAP1,
TPX2, S100P, ITM2A, LIFR, TNXA, and ZBTB16) was identified. Many gene
expression signatures have been proposed for BC diagnosis and prognosis
in recent years; few are under trial and five of them succeeded to get
FDA approval for commercial and clinical application, including
OncotypeDX (21-gene signature), MammaPrint (70-gene signature),
Prosigna (58-gene signature), EndoPredict (12-gene signature), and
Breast Cancer Index (7-gene signature) [[205]68,[206]69].
Gene signature validation was crucial before recommendation for further
analysis and clinical trials. We used 10-fold-cross validation by five
ML methods including KNN, GBDT, AdaBoost, XGBoost, and MLP. Several
studies have reported successful applications of these ML methods in BC
gene hub/signature validation
[[207]70,[208]71,[209]72,[210]73,[211]74,[212]75]. KNN-based validation
was used to classify genes based on their expression profiles, and
identify the gene clusters associated with cancer metastasis
[[213]72,[214]73]. GBDT and AdaBoost were used to identify the key
genes and pathways associated with breast cancer metastasis
[[215]72,[216]74,[217]76]. In addition, to identify disease-associated
genes and pathways, XGBoost predicted cancer recurrence based on gene
expression data [[218]71,[219]73]. The MLP method predicted gene
clusters from expression data that were functionally related and
associated with BC [[220]70,[221]72].
CCNE2 (Cyclin E2), involved in cell cycle regulation, can serve as an
individual indicator of the likely outcome for BC patients. It is
upregulated in tumor tissues and has the potential to function as a
biomarker and linked to worse metastasis-free survival (MFS) outcomes
and a poor overall survival [[222]77,[223]78]. NUSAP1 (nucleolar and
spindle-associated protein 1) playing a critical role in cell division
and being a useful prognostic marker, has been implicated in various
types of cancer, including BC [[224]79]. The TPX2 (targeting protein
for xenopus kinesin-like protein 2), a microtubule-associated protein
involved in spindle formation and cell division, is highly expressed in
various cancers, including BC [[225]80]. A high expression of TPX2 can
reduce the survival time of HER2-positive patients, as well as triple
negative BC [[226]81]. S100P, a calcium-binding protein, is involved in
cell proliferation, differentiation, and apoptosis. An overexpression
of S100P in BC cells makes it more aggressive, and hence it has the
potential as a prognostic and therapeutic biomarker [[227]8]. ITM2A
(integral membrane protein 2A) regulates cellular growth and survival,
and its low expression may play a role in the progression of BC,
especially at advanced stages and higher grades of triple-negative
breast cancer [[228]82]. A decreased expression of LIFR (leukemia
inhibitory factor receptor) may be a marker for a poorer prognosis and
reduced survival in BC [[229]83]. TNXA (tenascin XA) is an
extracellular matrix protein involved in cell adhesion and migration.
The survival analysis of abnormally expressed TNXA in breast tissue
indicates poor prognosis [[230]84]. A low expression of ZBTB16 (zinc
finger and BTB domain-containing protein 16) in BC has been associated
with a poor prognosis and an increased risk of metastasis
[[231]85,[232]86].
5. Conclusions
Artificial intelligence and machine learning approaches for the
identification of novel diagnostic and prognostic gene signature
biomarkers for breast cancer using microarray-based gene expression
profiles were attempted. Initially, we identified a total of 355 DEGs
via gene expression profiling of BC microarray data, and our
artificial-intelligence-based strategy significantly reduced the number
of genes needed for an effective evaluation of diagnostic and
prognostic importance. As a result, two novel gene signatures were
highlighted, (i) diagnostic nine-gene signature (COL10A, S100P,
ADAMTS5, WISP1, COMP, CXCL10, LYVE1, COL11A1, and INHBA) and (ii)
prognostic eight-gene signature (CCNE2, NUSAP1, TPX2, S100P, ITM2A,
LIFR, TNXA, and ZBTB16), using machine learning algorithms and survival
analysis. The results were confirmed via qPCR of BC samples, and
validated by another set of ML methods to measure the model accuracy
and precision. To the best of our knowledge, the identified diagnostic
and prognostic gene signatures are novel and have clinical potential.
Acknowledgments