Abstract

   Ovarian cancer is one of the most common gynecological malignancies,
   ranking third after cervical and uterine cancer. High-grade serous
   ovarian cancer (HGSOC) is one of the most aggressive subtype, and the
   late onset of its symptoms leads in most cases to an unfavourable
   prognosis. Current predictive algorithms used to estimate the risk of
   having Ovarian Cancer fail to provide sufficient sensitivity and
   specificity to be used widely in clinical practice. The use of
   additional biomarkers or parameters such as age or menopausal status to
   overcome these issues showed only weak improvements. It is necessary to
   identify novel molecular signatures and the development of new
   predictive algorithms able to support the diagnosis of HGSOC, and at
   the same time, deepen the understanding of this elusive disease, with
   the final goal of improving patient survival. Here, we apply a Machine
   Learning-based pipeline to an open-source HGSOC Proteomic dataset to
   develop a decision support system (DSS) that displayed high discerning
   ability on a dataset of HGSOC biopsies. The proposed DSS consists of a
   double-step feature selection and a decision tree, with the resulting
   output consisting of a combination of three highly discriminating
   proteins: TOP1, PDIA4, and OGN, that could be of interest for further
   clinical and experimental validation. Furthermore, we took advantage of
   the ranked list of proteins generated during the feature selection
   steps to perform a pathway analysis to provide a snapshot of the main
   deregulated pathways of HGSOC. The datasets used for this study are
   available in the Clinical Proteomic Tumor Analysis Consortium (CPTAC)
   data portal ([34]https://cptac-data-portal.georgetown.edu/).

   Subject terms: Diagnostic markers, Biomedical engineering, Tumour
   biomarkers

Introduction

   Ovarian cancer is the seventh most common cancer in women and the
   eighth-most common cause of cancer death overall, with five-year
   survival rates below 45%. Along with the increasing life expectancy,
   the number of cases diagnosed each year is also growing, with only a
   minimal improvement in mortality^[35]1,[36]2.

   Although once considered a single entity, ovarian cancer can be
   subdivided into different histological subtypes that differ in
   molecular patterns, cells of origin, and clinical features. Among these
   types, high-grade serous ovarian carcinoma (HGSOC) is the most commonly
   diagnosed^[37]3 and is responsible for an elevated number of deaths.
   Its molecular features consist of a p53 mutation for 96% of the cases,
   while BRCA1/BRCA2 accounts for 22% of cases^[38]4. One of the principal
   factors influencing the elevated mortality of HGSOC patients is the
   inability to perform an early diagnosis, due to the symptoms being
   diverse and non-specific^[39]5. While the long-term survival of
   patients with stage I and II of ovarian cancer is respectively up to
   90% and 70%, 4/5 of patients with HGSOC are diagnosed during stage III,
   and IV, resulting in a significantly lower survival rate of less than
   20%^[40]6,[41]7. Several studies have shown the importance of an
   accurate pre-operative evaluation and correct staging to enhance the
   prognosis of patients with a pelvic mass suspected of HGSOC. In fact,
   those treated by gynecologic oncologists had significantly lower
   morbidity and overall increased survival than those treated by general
   gynecologists and general surgeons^[42]5, [43]8–[44]10.

   Several biomarkers, such as CA125^[45]11, HE4^[46]12 and
   osteopontin^[47]13 have been used for the risk assessment of ovarian
   cancer in patients with a pelvic mass. Each of the biomarkers can be
   used alone or combined in multiple-biomarker algorithms (e.g.
   RMI^[48]14, ROMA^[49]15, OVA1^[50]16), having received both FDA and EU
   approval ^[51]17.

   However, the screening methods based on these multiple-biomarker
   algorithms show different limits hampering their usage in clinical
   practice. All of them include CA125, a marker expressed in only 80% of
   Ovarian Cancer cases, and only in the 50% in the early stage of the
   disease^[52]18. The lack of expression in CA125 levels exhibited in
   some ovarian cancer cases and especially in the early stages of the
   disease is reflected by the sensitivity of the algorithms based on
   CA125. Furthermore, other studies show that different physiological and
   pathological conditions exhibit an increased expression of CA125
   levels, thus limiting its specificity for the detection of this
   disease^[53]19,[54]20. The use of additional biomarkers to overcome the
   limits of CA125 usually improves the sensitivity of the algorithm but
   always leads to a reduced specificity to detect ovarian
   cancer^[55]21–[56]23. Hence, the necessity to find new molecular
   distinctive features that could both improve the disease understanding
   and be used as a starting point to develop new diagnostic tools, in
   order to establish one of the most appropriate treatment strategies,
   with the intention to improve ovarian cancer survival rates.

   With this in mind, the purpose of this study was to dissect the
   pathways deregulated in HGSOC and find new possible biomarkers with
   high discriminating power, sensitivity and specificity that are
   localized in the serum, in order to be potentially assessed without
   invasive or expensive approaches. To reach this goal, we analyzed a
   publicly available ovarian cancer proteomic dataset using Machine
   Learning based algorithms, which can manage optimally such large scale
   omic datasets. The data used in this publication were generated by the
   Clinical Proteomic Tumor Analysis Consortium (NCI/NIH) ^[57]24.

   Our computational approach allows us to overcome the decline in the
   specificity of existing tests, maintaining both sensitivity and
   specificity respectively at 98.2% and 97.2%.

Materials and methods

Database

   For this study, we used the publicly available database generated by
   the Clinical Proteomic Tumor Analysis Consortium (CPTAC) ^[58]24. The
   Decision Support System (DSS) was trained, tested, and validated using
   the CPTAC Ovarian Cancer Confirmatory Study Proteomic Dataset, which
   includes the analysis form Ovarian tissue sample from a cohort of 100
   individuals with HGSOC and 25 Non-Tumor ovarian samples, performed by
   the Johns Hopkins University (JHU) and Pacific Northwest National
   Laboratory (PNNL) using isobaric Tags for Relative and Absolute
   Quantification (iTRAQ) protein quantification method^[59]25. Clinical
   features were present only for Tumor patients. The Tumor cohort was
   composed of women ranging from 36 to 85 years, with an average age of
   59. The 7% of the participants had an history of other malignancies.
   The anatomic site of origin of tumor specimens are: ovary 52%, omentum
   41%, peritoneum 3%, pelvic mass 3% and unknown origin 1%. All samples
   are classified as “Serous Adenocarcinoma”. FIGO staging ranges from IIB
   to IV (not specified whether A or B), with the majority of the samples
   classified as stage IIIC (63.8%), followed by IV (15.2%), IIIB (7.6%),
   IIIA (2.9%), IC (1.9%), IIB (1%) and a remaining 7.6% of specimens
   having uncertain classification. The 80.8% of the samples are
   classified as Grade 3, 5.8% as Grade 2, 0.9% as Grade 1, while for
   12.5% of the samples grading was not reported. The efficacy of the DSS
   was further tested on the dataset generated from the CPTAC and TCGA
   Cancer Proteome Study of Ovarian Tissue, including the analysis of
   samples from 174 Ovarian tumors, of which 169 from HGSOC, also
   performed by JHU and PNNL using iTRAQ^[60]26. Cohort is composed of
   women ranging from 35 years to 87, with an average age of 60.5. Tumor
   tissue site is Ovary for 98% of the samples, Omentum in 1% of the
   samples and Peritoneum ovary in 1%. All samples are classified as
   “Serous Cystadenocarcinoma”. FIGO staging of the samples goes from
   stage IC to IV (not specified whether A or B), where stage IIIC
   accounts for 69.9% of the samples, IV for 17%, IIIB and IIC accounting
   each one for 4.4%, IC for 1.5%, and IIA, IIB and IIA accounting each
   one for 1%. The 81.5% of the samples are Grade 3, 16.5% are Grade 2, 1%
   are Grade 1, while grading is unknown for 1% of the samples. Datasets
   were subsequently processed in Python (distribution 3.9.1) using NumPy
   and pandas libraries to merge JHU and PNNL datasets and remove protein
   columns containing more than 10% of missing values. After that, the
   data were processed and analyzed using a software tool coded in
   MATLAB2020b (Mathworks Inc., MA).

Machine Learning pipeline

   Here we describe the Machine Learning pipeline used to develop the
   Decision Support System. Each sample from the dataset is described by
   its features (i.e., the proteins). We report such pipeline in Fig.
   [61]1. It includes the following steps:

Figure 1.

   [62]Figure 1
   [63]Open in a new tab

   Machine Learning pipeline.

Feature selection based on correlation analysis

   In this step, we computed for each feature the Pearson correlation
   coefficient with respect to the target variable (tumor/non tumor). The
   correlation coefficient between two random variables is a measure of
   their linear dependency. If each feature has N scalar observations,
   then the Pearson correlation coefficient of the i-th feature
   [MATH: <msub><mi>f</mi><mi>i</mi></msub> :MATH]
   is defined as
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>ρ</mi><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>f</mi><mi>i</mi></msub><mo>,</mo
   ><mi>t</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo>∑</mo><mrow><mi
   >j</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderover><mfenced
   close=")"
   open="("><mfrac><mrow><msub><mi>f</mi><mi>i</mi></msub><mrow><mo
   stretchy="false">(</mo><mi>j</mi><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>μ</mi><msub><mi>f</mi
   ><mi>i</mi></msub></msub></mrow><msub><mi>σ</mi><msub><mi>f</mi><mi>i</
   mi></msub></msub></mfrac></mfenced><mfenced close=")"
   open="("><mfrac><mrow><mi>t</mi><mrow><mo
   stretchy="false">(</mo><mi>j</mi><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>μ</mi><mi>t</mi></msu
   b></mrow><msub><mi>σ</mi><mi>t</mi></msub></mfrac></mfenced></mrow></mt
   d></mtr></mtable></mrow> :MATH]
   1

   where
   [MATH: <msub><mi>μ</mi><msub><mi>f</mi><mi>i</mi></msub></msub> :MATH]
   ,
   [MATH: <msub><mi>σ</mi><msub><mi>f</mi><mi>i</mi></msub></msub> :MATH]
   ,
   [MATH: <msub><mi>μ</mi><mi>t</mi></msub> :MATH]
   ,
   [MATH: <msub><mi>σ</mi><mi>t</mi></msub> :MATH]
   are the mean and standard deviation of the i-th feature and the target
   variable, respectively. The values of the coefficients can range from −
   1 to 1, with − 1 representing a direct, negative correlation, 0
   representing no correlation, and 1 representing a direct, positive
   correlation. All features with an absolute value of the correlation
   coefficient higher than 0.6 are then selected. In this way, we selected
   all the features with a high (positive or negative) correlation with
   the target variable.

Feature selection based on relief method

   All the features selected from the Correlation Analysis are then
   examined with a second feature selection step based on the ReliefF
   algorithm^[64]27. Such an algorithm ranks the importance of the
   features with respect to the target value. The importance of a feature
   is represented by the weight of that feature. The values of those
   weights can range from
   [MATH: <mrow><mo>-</mo><mn>1</mn></mrow> :MATH]
   to 1, with the largest positive weights assigned to the most important
   features. The algorithm penalizes the features that provide different
   values to k neighbors of the same class while rewarding the ones that
   provide different values to k neighbors of different classes.

Decision tree

   The features (i.e. the proteins) selected by the reliefF method are
   used to train the CART^[65]28 algorithm for the binary
   (Tumor/Non-Tumor) classification task. We chose to use a decision tree
   classifier for its high interpretability and explainability, unlike
   other methods of machine and deep learning. The CART tree is a binary
   decision tree that is constructed by splitting a node into two child
   nodes repeatedly, beginning from the root node that contains the whole
   learning sample. The basic idea of the tree growth is to choose a split
   among all the possible splits at each node so that the resulting child
   nodes are the “purest”. The purity metric defines a node as 100% impure
   when its samples evenly belong (50:50) to both the classes while
   defining a node as 100% pure when all of its data belongs to a single
   class. In this algorithm, only univariate splits are considered. That
   is, each split depends on the value of just one feature. At node t, the
   best split s is chosen to maximize a splitting criterion
   [MATH: <mrow><mi mathvariant="normal">Δ</mi><mi>i</mi><mo
   stretchy="false">(</mo><mi>s</mi><mo>,</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   . When the impurity measure for a node can be defined, the splitting
   criterion corresponds to a decrease in impurity. In our case, we used a
   Gini criterion as the impurity measure. During the training, we chose
   not to impose a control on the tree’s depth, fixing the maximum number
   of splits as the size of the training set
   [MATH: <mrow><mo>-</mo><mn>1</mn></mrow> :MATH]
   and the minimum leaf size (the minimum number of samples in the leafs)
   as 1. Furthermore, we fixed the cost of classifying a sample into class
   j if its true class is i equal to:
     *
       [MATH:
       <mrow><msub><mi>C</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></
       msub><mo>=</mo><mn>1</mn></mrow> :MATH]
       , if
       [MATH: <mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow> :MATH]
     *
       [MATH:
       <mrow><msub><mi>C</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></
       msub><mo>=</mo><mn>0</mn></mrow> :MATH]
       , if
       [MATH: <mrow><mi>i</mi><mo>=</mo><mi>j</mi></mrow> :MATH]

   We decided also not to implement a pruning strategy.

Performance evaluation

   To evaluate the performance of our system we computed the confusion
   matrix. A confusion matrix is an N
   [MATH: <mo>×</mo> :MATH]
   N matrix used for evaluating the performance of a classification model,
   where N is the number of target classes. In our case, the task
   performed by the model is a binary classification task, thus N is equal
   to 2. From the confusion matrix we calculated the classification
   accuracy
   [MATH: <mfenced close=")"
   open="("><mi>A</mi><mi>c</mi><mi>c</mi><mo>=</mo><mfrac><mrow><mi>T</mi
   ><mi>P</mi><mo>+</mo><mi>T</mi><mi>N</mi></mrow><mrow><mi>P</mi><mo>+</
   mo><mi>N</mi></mrow></mfrac></mfenced> :MATH]
   , the precision per class
   [MATH: <mrow><mo stretchy="false">(</mo><msub><mi>P</mi><mrow><mi
   mathvariant="italic">Tumor</mi></mrow></msub><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi></mrow><mrow><mi>T</mi><mi>P</mi><mo>+</mo>
   <mi>F</mi><mi>P</mi></mrow></mfrac></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>P</mi><mrow><mi
   mathvariant="italic">NonTumor</mi></mrow></msub><mo>=</mo><mfrac><mrow>
   <mi
   mathvariant="italic">TN</mi></mrow><mrow><mi>T</mi><mi>N</mi><mo>+</mo>
   <mi>F</mi><mi>N</mi></mrow></mfrac><mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   , sensitivity and specificity
   [MATH: <mfenced close=")"
   open="("><mi>S</mi><mi>e</mi><mi>n</mi><mi>s</mi><mi>i</mi><mi>t</mi><m
   i>i</mi><mi>v</mi><mi>i</mi><mi>t</mi><mi>y</mi><mo>=</mo><mfrac><mrow>
   <mi
   mathvariant="italic">TP</mi></mrow><mi>P</mi></mfrac><mo>,</mo><mspace
   width="0.166667em"></mspace><mi>S</mi><mi>p</mi><mi>e</mi><mi>c</mi><mi
   >i</mi><mi>f</mi><mi>i</mi><mi>c</mi><mi>i</mi><mi>t</mi><mi>y</mi><mo>
   =</mo><mfrac><mrow><mi
   mathvariant="italic">TN</mi></mrow><mi>N</mi></mfrac></mfenced> :MATH]
   . Furthermore for each class we compute the F1 score, a relevant metric
   in case of unbalanced dataset,
   [MATH: <mrow><mi>F</mi><msub><mn>1</mn><mrow><mi
   mathvariant="italic">Tumor</mi></mrow></msub><mo>=</mo><mn>2</mn><mrow>
   </mrow><mo>∗</mo><mfenced close=")"
   open="("><mfrac><mrow><msub><mi>P</mi><mrow><mi
   mathvariant="italic">Tumor</mi></mrow></msub><mrow></mrow><mo>∗</mo><mi
   >S</mi><mi>e</mi><mi>n</mi><mi>s</mi><mi>i</mi><mi>t</mi><mi>i</mi><mi>
   v</mi><mi>i</mi><mi>t</mi><mi>y</mi></mrow><mrow><msub><mi>P</mi><mrow>
   <mi
   mathvariant="italic">Tumor</mi></mrow></msub><mo>+</mo><mi>S</mi><mi>e<
   /mi><mi>n</mi><mi>s</mi><mi>i</mi><mi>t</mi><mi>i</mi><mi>v</mi><mi>i</
   mi><mi>t</mi><mi>y</mi></mrow></mfrac></mfenced></mrow> :MATH]
   and
   [MATH: <mrow><mi>F</mi><msub><mn>1</mn><mrow><mi
   mathvariant="italic">NonTumor</mi></mrow></msub><mo>=</mo><mn>2</mn><mr
   ow></mrow><mo>∗</mo><mfenced close=")"
   open="("><mfrac><mrow><msub><mi>P</mi><mrow><mi
   mathvariant="italic">NonTumor</mi></mrow></msub><mrow></mrow><mo>∗</mo>
   <mi>S</mi><mi>p</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>f</mi><mi>i</mi><
   mi>c</mi><mi>i</mi><mi>t</mi><mi>y</mi></mrow><mrow><msub><mi>P</mi><mr
   ow><mi
   mathvariant="italic">NonTumor</mi></mrow></msub><mo>+</mo><mi>S</mi><mi
   >p</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>f</mi><mi>i</mi><mi>c</mi><mi>
   i</mi><mi>t</mi><mi>y</mi></mrow></mfrac></mfenced></mrow> :MATH]
   .

   As usual, P and N denote the number of positive patients (with Tumor)
   and negative patients (Non-Tumor) records, whereas TP, TN, FP and FN
   stands respectively for true positive, true negative, false positive
   and false negative classifications. A true positive classification
   implies that the patients are correctly detected by the system as
   patients without tumor, whereas a true negative classification
   indicates that the system correctly recognizes the patients with HGSOC.
   We developed two main performance test:
     * Test 1 This test is developed to evaluate the performance of the
       system only on CPTAC dataset using a 5-fold cross-validation
       procedure as follows. First, we randomly shuffled the dataset and
       split it into 5 groups. For each group, a single group is taken as
       a hold out or test data set and the remaining groups as a training
       data set. After training and test, the evaluation score is retained
       and the model is discarded. This operation is then repeated for
       each group. Importantly, each sample in the data set is assigned to
       an individual group and stays in that group for the duration of the
       procedure. This means that each sample is given the opportunity to
       be used in the hold out set once and used to train the model 4
       times. This procedure results in a less biased or less optimistic
       estimate of the system performance than other methods, such as a
       simple train/test split.
     * Test 2 This test is developed to evaluate the robustness of our
       system. We trained the system on CPTAC Dataset and tested it on a
       different dataset called Cancer Proteome Study of Ovarian Tissue
       (TCGA). This latter dataset is composed of 216 tumor patients.

Pathway enrichment analysis

   We used the ranked lists of proteins resulting from the correlation
   analysis, as input to perform a Pathway Enrichment Analysis using
   GSEA^[66]29,[67]30 v.4.1.0 desktop software. The pathway gene set
   database was:
   Human_GO_AllPathways_with_GO_iea_January_13_2021_symbol.gmt release
   13-01-2021, downloaded from [68]http://baderlab.org/GeneSets. This file
   includes pathways from GO, Panther, NetPath, NCI, Reactome and MSigDB,
   both C2 and Hallmark collection. The number of permutations was set to
   1000 and the maximum size of the sets was set to 200. Visualization of
   enrichment results was performed with Cytoscape^[69]31 v.3.8.2 using
   EnrichmentMap Pipeline Collection apps^[70]32, setting the FDR Q value
   cutoff to 0.01. In this work, we selected all the features with a
   coefficient higher than the average value taken by the positive
   coefficients.

Results

   As the first step of feature selection, the correlation was assessed
   between each feature and the tumor or non tumor variable, in order to
   possibly identify the most relevant molecular features of the tumor
   phenotype. The dataset after the pre-processing step consisted of 209
   samples and 6223 proteins. In Table [71]1 we reported the results
   obtained setting the correlation coefficient cutoff to 0.6, thus
   reducing the significant features to 137 proteins. After the second
   step of feature selection, the list was further reduced to 46 proteins.

Table 1.

   Here are summarized the results of the correlation between proteomics
   data and tumor phenotype. It appears that a vast portion of the
   proteins displayed no evident correlation, and the majority of the
   proteins were negatively correlated.
                        Tumor
   Positive correlation  20
   Negative correlation  117
   Noncorrelation       6086
   [72]Open in a new tab

   We then used the entire set of proteins and their respective
   correlation coefficient as a ranked list to perform a GSEA pathway
   enrichment analysis. The output was subsequently visualized and
   interpreted using the Cytoscape add-on EnrichmentMap. Resulting
   Normalized Enrichment Scores (NESs) ranged from -3.3251 to 3.4016. A
   subnetwork (Fig. [73]3) was generated from the main enrichment map
   selecting the most enriched pathways, setting the cutoff of NES to
   [MATH: <mrow><mo>+</mo><mo>-</mo></mrow> :MATH]
   2.5, in order to drive the attention only on the most represented
   pathways. As in Fig. [74]3A, B the over-represented pathways are
   related to three main categories: RNA maturation and export,
   Translation and DNA Repair. By contrast, under-represented pathways
   (Fig. [75]3C) include: immune response, cell-matrix adhesion and
   extracellular matrix adhesion, protease activities, G-Protein coupled
   receptors signalling, myogenesis, muscular contraction, wound healing
   and blood coagulation.

Figure 3.

   [76]Figure 3
   [77]Open in a new tab

   A Subnetwork was created from the main network to increase the
   interpretability. Red and blue nodes represent pathways that are
   upregulated (A, B) and downregulated (C). The diameter of each node is
   proportional to the number of proteins included. Pathways sharing
   proteins are connected with blue edges, with the thickness of the edges
   proportional to the number of protein shared. Clusters of nodes were
   manually annotated.

Explainable decision support system for tumor/non-tumor classification and
biomarker discovery

   With respect to test 1, we evaluated our method on the dataset
   presented in “[78]Database” section. So, we started with a full dataset
   consisting of 209 samples and 6223 proteins. After the first step of
   Feature Selection based on Correlation Analysis, 137 features were
   left. Then, after the ReliefF-based Feature Selection step, we obtained
   46 proteins. Finally, the dataset comprising 209 samples of 46 features
   was used to train the decision tree classifier. The model and the
   biomarkers achieved are shown in Fig. [79]2. The model is characterized
   by a graph with split conditions on three proteins: TOP1, PDIA4 and
   OGN. Furthermore, in Table [80]2 we report the classification confusion
   matrix that was computed collecting the prediction at the end of each
   iteration of the 5-fold cross-validation. All computed metrics from the
   confusion matrix are equal to 98.1% for accuracy, 98.2% for the
   sensitivity, 97.6% for specificity, 93% for precision of Non-Tumor
   class and 99.4% for precision of Tumor class, and 95.3% and 98.8% for
   F1-score of Non-Tumor and Tumor classes, respectively. With respect to
   test 2 we analyze the robustness of our system: for this reason we
   trained it on a dataset (CPTAC) and tested on a different one (TCGA).
   This latter dataset is composed of 216 tumor patients. In Table [81]3
   we report the confusion matrix achieved. Furthermore, we calculate the
   accuracy of the system and the precision, sensitivity and F1-score per
   Tumor class that are equal to 98.2%, 100%, 97.2%, and 98.6%
   respectively. We did not computed metrics regarding the Non-Tumor class
   since the TCGA dataset does not present samples of this class.

Figure 2.

   Figure 2
   [82]Open in a new tab

   Final decision tree, with focus on the biomarkers.

Table 2.

   This Confusion Matrix is achieved in fivefold-cross-validation on CPTAC
   Ovarian Cancer Confirmatory Study Proteomic Dataset (209 samples). The
   matrix compares the actual target values (Truth) with those predicted
   (Pred.) by our model. On first diagonal are reported the samples
   correctly classified, whereas on second diagonal are reported the
   misclassified samples.
   Pred.     Truth
             Non-tumor Tumor
   Non-tumor    40       3
   Tumor         1      165
   [83]Open in a new tab

Table 3.

   This Confusion Matrix reports the performance of our system trained on
   CPTAC Ovarian Cancer Confirmatory Study Proteomic Dataset and tested on
   TCGA Cancer Proteome Study of Ovarian Tissue (216 samples). The matrix
   compares the actual target values (Truth) with those predicted (Pred.)
   by our model. On first diagonal are reported the samples correctly
   classified, whereas on second diagonal are reported the misclassified
   samples. The TCGA dataset only presents samples from the Tumor class.
   Pred.     Truth
             Non-tumor Tumor
   Non-tumor 0           6
   Tumor     0          210
   [84]Open in a new tab

Discussion

   Given the impact and the high mortality rate of HGSOC, numerous studies
   from the past few years took advantage of ’-omic’ scale expression data
   to characterize its underlying molecular features and to discover novel
   biomarkers. Nevertheless, the vast majority of existing studies makes
   use of RNA expression rather than protein expression. The main reason
   is the advantage of transcriptomics being a robust and cost-effective
   high-throughput technology. However, mRNA levels do not always
   correlate to protein abundance, given the number of regulatory
   processes occurring after mRNA transcription^[85]33,[86]34. Hence, to
   find novel biomarkers suitable for cost-effective and non-invasive
   diagnostic methods such as blood or serum testing, we choose to base
   our analysis on Proteomics data.

Correlation-based overview on the most deregulated pathways

   We first performed a correlation analysis. In this way, we reduced the
   number of features in the dataset, and at the same time, removed the
   “background noise” represented by the proteins that had a random
   correlation with the Tumor phenotype^[87]35. We then used the gene set
   enrichment analysis to extract biological insight from the ranked list
   of proteins that emerged from the correlation analysis. Among the
   over-represented pathways, displayed in Fig. [88]3 and summarized in
   Table [89]4, we found established and well-known cancer signatures,
   such as the increase of MYC and E2F downstream genes and DNA-Repair
   related genes such as MCMs and RAD21^[90]36–[91]39. Interestingly, as
   shown in Fig. [92]3B, pathways related to mRNA splicing, export,
   metabolism, and translation were strikingly abundant and predominant
   among all the over-represented pathways. Given the crucial role of
   splicing as a source of biological complexity and plasticity, this same
   mechanism can be exploited by cancer cells to adapt and thrive in
   tumor-induced pathological conditions such as hypoxia^[93]40 and,
   favoring tumor progression, by contributing to the reprogramming of the
   cellular processes^[94]41. In accordance with this, a study shows that
   the spliceosome inhibitory drug Sudemycin is able to induce selective
   cytotoxicity in chronic lymphocytic leukemia (CLL) cells by targeting
   SF3B1, a component of U2 snRNP, which is also found in 13 nodes of our
   network. At the level of RNA export, there are several forms of cancer
   associated with dysregulation of some nucleoporins (Nup98, Nup214),
   components of the transcription-export complex TREX (THOC1), and
   exportines (XPO1, XPO5) that are also included in several nodes of our
   network and may be worth investigating further for their involvement in
   HGSOC^[95]42–[96]44. As shown in Fig. [97]3A a large portion of
   pathways involved in the assembly of the initiation complex and
   ribosome biogenesis were significantly over-represented. Increasing
   evidence links deregulation of translational control to cancer
   insurgence and progression. Indeed, one of the most regulated steps
   during translation is its initiation, given its role in the decision of
   the rate of production of every protein, or if it is produced at
   all^[98]45. It is therefore not surprising that initiation factor
   encoding genes (eIFs) are overexpressed in a variety of cancers, such
   as breast, prostate and pancreatic cancer^[99]46, [100]47. Altered
   ribosome biogenesis also concurs to the altered translational activity
   of cancer cells; for example, it has been observed that in the
   aggressive breast cancer cell line MA-, 43S pre-rRNA was abnormal,
   resulting in an impaired ability to initiate p53 cap-independent
   translation via IRES^[101]48. Another cluster of pathways that stood
   out from our analysis involves nonsense-mediated decay (NMD) activity.
   NMD is a mechanism of post-transcriptional gene regulation, whose main
   purpose is exerting quality control on the mRNA through the recognition
   of premature termination codons (PTC), that may be introduced because
   of genetic mutations, or errors occurring during transcription or
   splicing. Beyond quality control, NMD emerged also as a mechanism for
   fine-tuning the amount of certain proteins^[102]49. An example is
   represented by the regulation of selenocysteine-containing proteins
   (SePs), such as glutathione peroxidase 1 (Se-GPx1) abundance in
   response to a decrease in selenium (Se) concentrations via NMD
   recognition of a Sec TGA codon^[103]50. Indeed, among the pathways
   present in this highly interconnected cluster, two groups of proteins
   are involved in selenocysteine synthesis^[104]51. SePs are known to be
   oxidoreductases, using selenocysteine in their active site. Their role
   in malignancy progression may vary according to the stage: on one hand
   they can inhibit tumor development by dampening oxidative insults that
   could induce mutagenesis and genomic instability while, on the other,
   they could offer tumor cells a competitive advantage to oxidative
   stress and chemotherapeutics, at an advanced stage^[105]52. This may
   indicate that in the context of HGSOC, they could favor tumor
   progression. The last members of this supercluster are proteins
   involved in the Slit/Robo pathway. Slits are a family of secreted
   proteins, as they bind to the transmembrane Robo receptors, they
   activate a signalling pathway that regulates various physiological
   processes, such as neural axon guidance, angiogenesis, cellular
   proliferation and motility, thus making it worthwhile to lead future
   research toward investigating their role as new druggable targets for
   HGSOC^[106]53, [107]54. Conversely, Fig. [108]3C shows the pathways
   that are significantly less represented in tumor cells than expected in
   physiological conditions. The first recognizable cluster involves the
   immune response. The avoidance of immune destruction is one of the
   hallmarks of cancer and has always represented a hot topic for research
   since the discovery of immunotherapy focused on targeting immune
   checkpoints^[109]55. In particular, the central nodes are involved in
   the regulation of complement activation, suggesting that HGSOC cells
   counteract the complement activation also by downregulating proteins
   involved in its activation such as CR2^[110]56. The second cluster of
   Fig. [111]3C involves cell-substrate adhesion and extracellular matrix
   (ECM) organization. Under-representation of pathways related to
   adhesion is a characteristic of cancer cells, in fact, adhesion
   molecules not only maintain contact with other cells or the substrate
   but also play a role as signalling molecules for a variety of cellular
   functions, such as growth regulation and gene expression, moreover,
   loss of adhesion is related to the Epithelial-Mesenchymal Transition
   (EMT), which leads to cell migration and invasiveness^[112]57,[113]58.
   Here we found that proteases inhibitor-related pathways are
   significantly underrepresented. Proteases are enzymes that catalyze the
   hydrolysis of proteins, they take part in a plethora of physiological
   functions and their deregulation is associated with as many pathologies
   such as neurodegenerative disorders, inflammatory diseases,
   cardiovascular diseases and cancer^[114]59. Serpins, in particular, are
   serine protease inhibitors, regulating several biological activities,
   including coagulation, regulation of blood pressure, angiogenesis and
   hormone transport. Among the Serpins present in the nodes of our
   networks, Serpin B1, Serpin B5 and Serpin B9 have been found to be
   associated to tumor suppression and increased overall survival in
   Colorectal Cancer, suggesting that they could exert the same role also
   in HGSOC^[115]60–[116]62. The next cluster examined in Fig. [117]3C
   belongs to the pathways involved in the negative regulation of
   coagulation. Activated Protein C (APC) is One of the most recurrent
   proteins among the nodes, along with its interactors Thrombodulin (TM)
   and Endothelial Cell Protein C Receptor (EPCR). APC is a serine
   protease that acts as an anticoagulant by inhibiting thrombin formation
   when the latter is bound to TM. This function is enhanced by EPCR,
   which binds APC and presents it to the TM-Thrombin complex^[118]63. The
   role of these three proteins in tumorigenesis is supported by the
   observation that the decrease or loss in their expression is related to
   tumor progression and poor prognosis^[119]64. It is accepted that
   enhanced coagulation represents a risk factor for the development of
   metastasis, possibly due to the fact that thrombin may favor the
   adherence of cancer cells either to platelets and to endothelial
   cells^[120]65. Interestingly, pathways related to myogenesis and
   muscular contraction were also found significantly under-represented.
   Among the nodes, Dystrophin (DMD) and other muscular
   distrophy-associated proteins: dysferlin and calpain-3 are found
   ubiquitously. These proteins are well-known for their role in the
   Duchenne muscular dystrophy, however, a role in cancer pathogenesis is
   slowly emerging. In this respect, it has been observed that Duchenne
   muscular dystrophy mdx mouse model was prone to develop skeletal
   muscle-associated tumors and that the dystrophic muscle presented
   genomic instability in a tumor-like fashion both in the mouse model and
   in humans^[121]66. Furthermore, DMD has been found to be downregulated
   in several tumors affecting the nervous system, hematological
   malignancies, melanoma and carcinomas, including lung adenocarcinoma,
   prostate, colon and breast cancer^[122]67. Our results show that DMD
   has a strong negative correlation to the tumor phenotype (
   [MATH: <mrow><mo>-</mo><mn>0.75</mn></mrow> :MATH]
   ), thus suggesting that an altered DMD expression may play a relevant
   role in the pathogenesis of HGSOC. The last underrepresented pathway is
   the G Protein-coupled receptor (GPCR) signalling pathway. GPCRs are the
   largest family of transmembrane signal transduction proteins, involved
   in a variety of biological processes, ranging from neurotransmission to
   hormone release, tissue development and homeostasis. It is not
   surprising that their dysfunction leads to numerous diseases^[123]68.
   Among the GCPRs present in the nodes of our network, the most relevant
   are GNA13, GNAS, SHH, FZD3 and SMO. These proteins exhibit loss of
   function mutations in cancers such as diffused B-cell lymphoma,
   Burkitt’s Lymphoma and basal cell carcinoma^[124]69, suggesting a
   possible role as oncosuppressors also in HGSOC. Overall, this analysis
   offers a plausible overview of the relevantly deregulated pathways in
   HGSOC, with most the pathways already known to be related to tumor
   progression, and some that could represent new paths to explore, in
   order to dissect the mechanisms underlying this gynecological
   malignancy. Given these premises, it may be worth lead future
   researches on the emerged proteins and their link to HGSOC.

Table 4.

   Summary of the 100 top-most deregulated pathways, ranked by their NES
   values, selected from the pathways composing the Subnetwork in Fig.
   [125]3. Pathways are named according to their Gene Ontology name or
   their standard name. In the left column are listed the 50 pathways that
   are found to be less represented in HGSOC tumor biopsies, a lower NES
   score corresponds to a lower representation. The right column displays
   the 50 pathways that appear to be the most over represented. A higher
   NES score correspond to a higher over representation.
   Less represented pathways Over-represented pathways
   Pathway description NES Pathway description NES
   Regulation of vascular smooth muscle cell proliferation − 1.8195
   Pre-mRNA splicing 3.4016
   Positive regulation of phospholipid metabolic process − 1.818 mRNA
   Splicing 3.3727
   Neutrophil chemotaxis − 1.8175 Regulation of mRNA processing 3.3537
   Positive regulation of lipid transport − 1.8168 Cap-dependent
   translation initiation 3.2584
   Positive regulation of protein kinase B signaling − 1.8157 rRNA
   processing 3.2518
   IGF1R signaling cascade − 1.8154 rRNA processing in the nucleus and
   cytosol 3.2488
   Allograft rejection − 1.8151 Influenza viral RNA transcription and
   replication 3.2475
   Positive regulation of transporter activity − 1.8148 Influenza
   infection 3.2379
   PID_IFNG_PATHWAY − 1.8141 Major pathway of rRNA processing in the
   nucleolus and cytosol 3.2266
   BIOCARTA_BIOPEPTIDES_PATHWAY − 1.8141 L13a-mediated translational
   silencing of ceruloplasmin expression 3.2208
   Regulation of heart rate − 1.8134 Spliceosomal complex 3.2119
   Tertiary granule lumen − 1.8111 Viral gene expression 3.2043
   PID_CXCR4_PATHWAY − 1.8088 Eukaryotic translation initiation 3.1978
   Negative regulation of small molecule metabolic process − 1.8082 GTP
   hydrolysis and joining of the 60S ribosomal subunit 3.1919
   Negative regulation of cell-substrate adhesion − 1.8075 Regulation of
   mRNA splicing, via spliceosome 3.1886
   Regulation of glucose transmembrane transport − 1.8065 Cytosolic
   ribosome 3.1753
   Monocarboxylic acid transport − 1.8039 Ribosome 3.1709
   Positive regulation of cholesterol transport − 1.8038 Formation of a
   pool of free 40S subunits 3.1677
   Gastrin signaling pathway − 1.8037 Viral transcription 3.1615
   Activation of MAPKK activity − 1.8037 Ribosomal subunit 3.1467
   Cortical cytoskeleton − 1.8036 Structural constituent of ribosome
   3.1382
   Amine metabolic process − 1.8035 Eukaryotic translation elongation
   3.137
   Negative regulation of cell projection organization − 1.8027
   Translational initiation 3.1285
   PID_ERBB1_DOWNSTREAM_PATHWAY − 1.8018 Peptide chain elongation 3.1255
   Negative regulation of neuron projection development − 1.8012
   Regulation of RNA splicing 3.122
   IRS-related events triggered by IGF1R − 1.8001 SRP-dependent
   cotranslational protein targeting to membrane 3.1111
   Growth factor receptor binding − 1.7996 Nonsense mediated decay (NMD)
   independent of the exon junction complex (EJC) 3.1079
   Regulation of reactive oxygen species biosynthetic process − 1.799
   Viral mRNA translation 3.1065
   Neuronal system − 1.7989 Eukaryotic translation termination 3.0998
   Negative regulation of axonogenesis − 1.7965 HALLMARK_MYC_TARGETS_V1
   3.0941
   Opioid signalling − 1.7963 Response of EIF2AK4 (GCN2) to amino acid
   deficiency 3.0812
   Cell–cell adhesion via plasma-membrane adhesion molecules − 1.7957
   Protein targeting to ER 3.0792
   BIOCARTA_HER2_PATHWAY − 1.7956 Nonsense mediated decay (NMD) enhanced
   by the exon junction complex (EJC) 3.0764
   PID_ERBB1_RECEPTOR_PROXIMAL_PATHWAY − 1.795 Nonsense-mediated decay
   (NMD) 3.0737
   Phosphatidylinositol binding − 1.7946 Catalytic step 2 spliceosome
   3.072
   Phosphatidic acid biosynthetic process − 1.7934 Selenocysteine
   synthesis 3.0554
   Granulocyte chemotaxis − 1.7913 SRP-dependent cotranslational protein
   targeting to membrane 3.0479
   Regulation of blood vessel endothelial cell migration − 1.791
   Establishment of protein localization to endoplasmic reticulum 3.0463
   B cell receptor signaling pathway − 1.7905 Regulation of expression of
   SLITs and ROBOs 3.0373
   Monocarboxylic acid binding − 1.7896 Cotranslational protein targeting
   to membrane 3.0326
   Toll-like receptor cascades − 1.7875 Nuclear-transcribed mRNA catabolic
   process, nonsense-mediated decay 3.0094
   Regulation of calcium-mediated signaling − 1.7874 Regulation of
   alternative mRNA splicing, via spliceosome 3.0092
   Triglyceride metabolism − 1.7864 Selenoamino acid metabolism 2.972
   Multicellular organismal movement − 1.7857 Protein localization to
   endoplasmic reticulum 2.9604
   Hydrogen peroxide catabolic process − 1.7848 Ribonucleoprotein complex
   assembly 2.9379
   Negative regulation of cellular response to growth factor stimulus
   − 1.7846 Ribonucleoprotein complex subunit organization 2.9292
   Gamma carboxylation, hypusine formation and arylsulfatase activation
   − 1.7846 Activation of the mRNA upon binding of the cap-binding complex
   and eIFs, and subsequent binding to 43S 2.9227
   Regulation of sodium ion transport − 1.7843 rRNA processing 2.9199
   Detection of external stimulus − 1.7843 mRNA Processing 2.9151
   Regulation of Rho protein signal transduction − 1.7842 Translation
   initiation complex formation 2.8601
   [126]Open in a new tab

Decision support system based on three discriminating biomarkers

   As shown in Fig. [127]1, the step following Correlation Analysis
   consisted in a second feature selection method based on Relief
   algorithm. This allowed a further reduction and a list of the most
   important features ordered by importance score. The topmost 46 features
   were used as input to train and develop the highly discriminating
   Decision Support System, which is able to distinguish a tumor from a
   Non-Tumor patient based on the differential expression of three
   proteins: Topoisomerase 1 (TOP1), Protein Disulfide Isomerase Family A
   Member 4 (PDIA4) and Osteoglycin (OGN) ,as displayed in Fig. [128]2.
   Strikingly, as assessed in Test 1, the system showed 97.6% of
   specificity, 98.2% of sensitivity on the CPATC Ovarian Cancer
   Confirmatory Study Proteomic Dataset,with an F1 score of 98.8% for the
   tumor class and 93% for the fewer cases belonging to Non-Tumor class,
   while once tested on the second dataset (Test 2), it showed 97.2%
   sensitivity and 98.6% F1 score, thus eliminating the risk that the good
   performance was due to overfitting. Furthermore, these three proteins
   also appear to have a serum localization, thus making them ideal
   candidates, after clinical validation, for the development of
   non-invasive tests. The first biomarker is TOP1, one of the six human
   topoisomerases, whose function is to unwind negative DNA supercoilings
   occurring during the events of replication ^[129]70. TOP1 is also known
   to play a role in the maintenance of genomic integrity, in fact, a
   decrease in TOP1 activity, due to low expression or lack of recruitment
   to chromatin by SMARCA4, may result in DNA damage and genomic
   breaks^[130]71,[131]72. This is reflected by the upregulation of TOP1
   in cancer cells, which undergo through replicative and transcriptional
   stress^[132]73. Given this crucial role, there are several FDA-approved
   drugs targeting TOP1. The most famous are the camptothecin alkaloid
   derivatives, which act by binding at the interface between the DNA and
   the topoisomerase^[133]74. The second biomarker, PDIA4, is one of the
   largest member of the Protein Disulfide Isomerases family (PDIs), which
   are known to mediate protein folding via either the formation or the
   breakage of disulfide bonds^[134]75. Other than its protein folding
   function, exerted when located in the endoplasmic reticulum, PDIA4 can
   also be present on the surface of the platelet, where it participates
   in thrombus formation^[135]76. It has been observed to be
   over-expressed in a cohort of Epithelial Ovarian Cancer (EOC) patients,
   where it was associated with disease progression and poor
   prognosis^[136]77, potential mechanisms involve the inhibition of
   apoptosis emerged in another study, where the over-expression of PDIA4
   in tumor cells reduced caspase 3 and 7 activity favoring cell
   growth^[137]78, thus potentially enabling tumor resistance to
   therapy^[138]79. Lastly, OGN, a small leucine-rich proteoglycan (SLRP)
   protein. Its function is different in different cell types: in the
   extracellular compartment it is involved in collagen cross-linking,
   while in vascular smooth cells (VSMCs) and fibroblasts, a reduced
   expression leads to cellular proliferation. Its implications in tumor
   progression are quite recent but evident. For instance, OGN appears to
   be under the control of p53, and several studies show a reduction or
   lack of OGN expression in a variety of cancers, among which breast,
   colon, lung, ovarian and pancreatic cancer^[139]80. It has been
   observed in bladder cancer that ECRG4 promotes OGN expression by
   upregulating NFIC, preventing the activation of NF-KB downstream
   pathways, thus inhibiting cell proliferation and migration^[140]81.

   Furthermore, in breast cancer, OGN seems to reverse epithelial to
   mesenchymal transition by repressing the PI3K/Akt/mTOR axis^[141]82.
   Overall, the DSS managed to identify, among the HGSOC proteome, three
   proteins that are known to be linked to tumorigenesis. In addition, the
   high sensitivity and specificity of these biomarkers for the
   distinction between tumor and Non-Tumor patients, coupled with the fact
   that they also appear to be localized in the serum, is promising for
   their possible clinical use for the diagnosis of HGSOC. It’s worth
   noting that in our analysis seral biomarkers CA125 and HE4 were found
   to not correlate with Tumor phenotype, and were consequently dropped at
   the fist step of the pipeline. This prevented us from performing a
   proper comparison, since the lack of correlation implies that if we
   build a classifier using only these two proteins, this will be with any
   probability unable to distinguish Tumor from Non Tumor samples if
   applied to our datasets.

Conclusions

   To summarize, we provided a reliable overview of the most relevant
   deregulated pathways in HGSOC, focusing mainly on those genes that were
   not related directly to HGSOC before, thus providing novel associations
   and new starting points for future researches. Furthermore, we
   developed a Decision Support System able to find three possible
   Biomarkers for the diagnosis of HGSOC. These three proteins are
   ubiquitous and exert their primary function in physiological
   conditions. However, a role for TOP1 as an oncogene has been already
   strongly suggested, being found upregulated in different types of
   tumors, including breast, liver and colorectal
   cancers ^[142]83–[143]86. Indeed, several TOP1-targeting drugs have
   received FDA approval ^[144]74,[145]87,[146]88. The connection of PDIA4
   and OGN with tumor progression is relatively recent, PDIA4 has been
   found overexpressed in a cohort of EOC patients, and associated with
   poor prognosis, cell gowth and resistance. On the other hand, a
   decrease in OGN expression was found in different types of cancers.
   This is coherent with the results of our dataset analysis, in which we
   found they showed a strong correlation with the tumor phenotype, with
   TOP1 and PDIA4 positively correlating and OGN being negatively
   correlated. Furthermore, the predictive efficiency of this system in
   considerably high in both of the tested datasets. Notwithstanding,
   further validation is crucial to support this in silico results, and,
   for a possible clinical use, further studies are needed to assess if
   the proportions of these biomarkers are maintained in the serum as they
   are in HGSOC biopsies. Finally, once clinically and experimentally
   validated, this pipeline could be easily applied to other tumor
   datasets for the purpose of discovering novel biomarkers and clinical
   predictors.

Acknowledgements