Abstract PURPOSE Endometrial cancer (EC) is the most common gynecologic cancer in the United States with rising incidence and mortality. Despite optimal treatment, 15%-20% of all patients will recur. To better select patients for adjuvant therapy, it is important to accurately predict patients at risk for recurrence. Our objective was to train, validate, and test models of EC recurrence using lasso regression and other machine learning (ML) and deep learning (DL) analytics in a large, comprehensive data set. METHODS Data from patients with EC were downloaded from the Oncology Research Information Exchange Network database and stratified into low risk, The International Federation of Gynecology and Obstetrics (FIGO) grade 1 and 2, stage I (N = 329); high risk, or FIGO grade 3 or stages II, III, IV (N = 324); and nonendometrioid histology (N = 239) groups. Clinical, pathologic, genomic, and genetic data were used for the analysis. Genomic data included microRNA, long noncoding RNA, isoforms, and pseudogene expressions. Genetic variation included single-nucleotide variation (SNV) and copy-number variation (CNV). In the discovery phase, we selected variables informative for recurrence (P < .05), using univariate analyses of variance. Then, we trained, validated, and tested multivariate models using selected variables and lasso regression, MATLAB (ML), and TensorFlow (DL). RESULTS Recurrence clinic models for low-risk, high-risk, and high-risk nonendometrioid histology had AUCs of 56%, 70%, and 65%, respectively. For training, we selected models with AUC >80%: five for the low-risk group, 20 models for the high-risk group, and 20 for the nonendometrioid group. The two best low-risk models included clinical data and CNVs. For the high-risk group, three of the five best-performing models included pseudogene expression. For the nonendometrioid group, pseudogene expression and SNV were overrepresented in the best models. CONCLUSION Prediction models of EC recurrence built with ML and DL analytics had better performance than models with clinical and pathologic data alone. Prospective validation is required to determine clinical utility. BACKGROUND Endometrial cancer (EC) incidence and mortality continues to rise^[60]1 despite advancements in adjuvant therapy over the past two decades, with a projected mortality increase of 55% by 2030.^[61]2 In addition, important clinical trials have changed standards of treatment for low-risk and low-intermediate-risk EC (PORTEC-1 and GOG 99),^[62]3,[63]4 high-intermediate-risk EC (PORTEC-2 and ASTEC),^[64]5,[65]6 and high-risk EC (PORTEC-3, GOG 249, and GOG 258).^[66]7-[67]10 Furthermore, immunotherapy and targeted therapy have been introduced in advanced-stage and recurrent ECs with notable success (RUBY, GY-018, and DUO-E).^[68]11-[69]13 Regardless of these advances, treatment failure occurs in approximately 10%-15% of patients with early-stage EC. Although nonendometrioid EC types account for a disproportionately high number of EC recurrences and cancer-related deaths,^[70]14 the majority of treatment failures and recurrences occur in endometrioid EC.^[71]14,[72]15 Thus, identifying patients who might benefit from additional surveillance and treatment to prevent recurrence and reduce mortality in EC would be of great value. CONTEXT * Key Objective * To build and test models of endometrial cancer recurrence integrated with clinical/pathologic risk. * Knowledge Generated * Models of recurrence integrating clinical and genomic data performed well in each risk group (low, high, and nonendometrioid) of the original database, Oncology Research Information Exchange Network, a network of US cancer centers. Testing in an independent database (The Cancer Genome Atlas) had some limitations and performed worse. * Relevance * Integrating genomic, clinical, and pathologic data improved performance of models for EC recurrence. Larger data sets with similar data are needed to externally validate these models. Historical studies included EC clinical and pathologic characteristics to stratify risk for recurrence and to inform adjuvant treatment.^[73]16,[74]17 Since the publication of The Cancer Genome Atlas (TCGA) and description of specific molecular profiles for EC,^[75]18 efforts have been directed to stratify EC treatment on the basis of these four profiles^[76]19: (1) POLE-mut characterized by EC by mutation in DNA polymerase-€; (2) mismatch repair (MMR) deficiency with functional loss of the MMR proteins, resulting in microsatellite instability (MSI); (3) TP53-abnormal (TP53abn); and (4) no specific molecular profile.^[77]20 The International Federation of Gynecology and Obstetrics (FIGO) new 2023 EC staging took consideration of this molecular classification and added POLE-mut as a characteristic of good prognosis in early-stage EC, and TP53abn as a sign of worse prognosis in early-stage EC.^[78]21 Models for EC recurrence integrating this new molecular classification and clinical data were superior to models with clinical data alone, with performances measured by the AUC over 70%^[79]22 versus below 70%,^[80]23 respectively. Unfortunately, these models were not validated in independent data sets,^[81]22 nor their performances are ideal for being use in a clinical setting. Trials on the basis of those molecular-clinical models for treatment selection (RAINBO and PORTEC-4a) will be mature by the end of 2028.^[82]20,[83]24 Therefore, there is room for improvement in EC recurrence prediction to better select patients for adjuvant treatment. In a preliminary pilot study, we identified several prediction models for EC recurrence using integration of clinical and genomic data.^[84]25 Genomic data included gene, exon, long noncoding RNA (lncRNA), and microRNA expression (MIR), single-nucleotide variation (SNV) and copy-number variation (CNV), and structural variation. The best performance model had an AUC of 90% (95% CI, 75% to 100%) and it was built with just five lncRNA.^[85]25 However, validation of those models in an independent data set, the Oncology Research Information Exchange Network (ORIEN), performed poorly, with an AUC of 57% (95% CI, 51% to 63%), likely because the initial model was constructed with a limited data set with only seven reported cases of EC recurrence. To improve the performance and accuracy of these models, we need a larger database of EC, with more diverse cases, and with larger representation of low-risk and high-risk EC endometrioid cases and nonendometrioid cases. The objective of this study is to train, validate, and test models of EC recurrence with lasso regression, other machine learning (ML), and deep learning (DL) analytics integrating clinical and genomic data from ORIEN, a large comprehensive EC database.^[86]26,[87]27 METHODS Study Design We performed a retrospective, multi-institution, case-control study with data originated from the ORIEN network EC data set. ORIEN is composed of multiple cancer centers that have agreed to use the same institutional review board–approved protocol and consent (Total Cancer Care Protocol) to follow patients throughout their lifetime.^[88]26,[89]27 A copy of the protocol is included in the Data Supplement. Patients consent to donate medical records and tissue specimens for molecular profiling, as an approach to improve design and performance of personalized cancer care. RNA and DNA were extracted from tumor specimens and processed to obtain the necessary genomic data. The study analysis was carried out in two phases: (1) phase I: selection of variables and group of variables that were more informative for the outcome of interest, EC recurrence, using cross-validation and lasso regression; and (2) phase II: with the selected variables from phase I, we trained and validated models for EC recurrence using lasso regression, MATLAB classification learner app, and TensorFlow analytics. Finally, we tested these models, with lasso regression, MATLAB apps, and TensorFlow analytics in an independent EC data set, TCGA. Patients' Inclusion and Clinical Data All patients in the ORIEN database with EC, including all histologies that had information about recurrent disease. Patients with EC recurrence (or cases) were those that after completion of treatment with no evidence of disease (NED), EC reappeared, either locally (vaginal), regionally (pelvis), or distally. Index cases included women with a new event of EC cancer after treatment, those who had cancer at the last surveillance, or died from cancer. Controls were patients with NED during the whole follow-up. There were a total 892 women with EC included in this analysis with an average of follow-up of 31 months: 186 with EC recurrence (cases, average f/u of 28 months) and 706 without (controls, average of f/u of 31 months) who had RNA and DNA sequenced and had recurrence information. Clinical, pathologic, treatment, and molecular baseline characteristics of these patients are detailed in Table [90]1 (14 clinical-pathologic) and the Data Supplement (Table S1; 28 laboratory values). Included patients were part of the ORIEN database since 2004 and up to 2021. TABLE 1. Patients' Baseline Characteristics Characteristic Low-Risk (FIGO stage I and grade 1 and 2) Endometrioid Type High-Risk (FIGO stage II, III, IV or grade 3) Endometrioid Type Nonendometrioid Type Recurrent (n = 38) Nonrecurrent (n = 291) P Recurrent (n = 69) Nonrecurrent (n = 255) P Recurrent (n = 79) Nonrecurrent (n = 160) P Age, years  Average 61 61 .731 60 60 .973 66 64 .430 BMI  Average 38.7 36.3 .224 33.6 35.5 .203 31.9 32.3 .800 Ethnicity .412 .082[91]^a .746  Cuban 0 2 0 1 0 0  Hispanic 1 9 6 10 4 9  Mexican 0 2 3 5 2 2  Non-Hispanic 36 260 60 230 72 145  Puerto Rican 0 4 0 0 1 1 Race .885 .775 .986  American Native 0 0 3 0 0 0  Asian 1 6 1 5 1 6  Black 1 12 3 10 9 10  Filipino 0 1 0 2 0 1  Islander 1 0 0 0 0 0  Other 0 6 0 6 0 3  White 35 266 62 232 69 137 Smoking .132 .650 .036[92]^a  Yes 7 89 41 156 50 97  No 25 162 18 79 13 53  Unknown 6 40 10 20 16 10 Alcohol use .217 .322 .945  Yes 12 119 29 103 33 77  No 20 123 25 120 26 62  Unknown 6 49 15 32 20 21 Personal history  Familial polyposis 0 1 .992 0 0 NA 0 0 NA  HPV 0 3 .991 2 4 .670 1 2 .949  Hyperplasia 3 31 .323 1 19 .053[93]^a 2 8 .421  CIN 0 3 .991 0 0 NA 0 1 .987  Lynch syndrome 0 4 .992 2 3 .470 0 1 .987  Anemia 3 34 .509 16 37 .060[94]^a 17 23 .063[95]^a  COPD 3 11 .246 1 7 .540 2 8 .450  DM 4 44 .928 12 40 .752 9 6 .018[96]^a  Heart disease 0 1 .995 3 9 .701 6 10 .468  MI/CHF 4 36 .436 12 19 .016[97]^a 8 20 .771  Stroke 3 29 .688 10 19 .078[98]^a 8 20 .771  DVT 2 32 .296 16 27 .004[99]^a 13 18 .105  PE 3 24 .370 13 20 .005[100]^a 12 13 .035[101]^a  Hyperlipidemia 4 61 .143 15 47 .433 11 36 .303  Hypertension 24 167 .480 42 139 .380 43 98 .852  Hypothyroidism 7 53 .928 6 39 .197 15 38 .827  Pain 11 67 .907 14 56 .888 28 37 .005[102]^a Grade .705 .501 .989  1 17 188 5 18 0 2  2 4 55 8 24 1 3  3 NA NA 7 16 24 41  Undifferentiated NA NA NA NA 7 26 FIGO stage  I 38 291 4 70 Ref 19 71 Ref  II NA NA 11 35 .006[103]^a 7 17 .406  II NA NA 27 78 .001[104]^a 27 45 .023[105]^a  IV NA NA 19 17 <.001[106]^a 22 15 <.001[107]^a MMR .581 1.000 .519  MMRp 11 72 10 47 11 30  MMRd 9 45 10 47 14 28  Unknown 18 174 49 161 54 102 MI .491 .157 .615  <50% 27 227 2 53 7 35  ≥50% 10 64 2 12 5 18 Adjuvant radiation (any type) .671 .199 <.001[108]^a  No 27 217 42 133 54 71  Yes 7 68 27 122 25 89 Initial chemotherapy .164 .161 .315  No 30 273 39 169 42 74  Yes 4 16 29 85 37 86 Histology  Carcinoma 5 26 Ref  Carcinosarcoma 12 22 .085[109]^a  Clear 7 9 .046[110]^a  Mixed 11 52 .872  Mucinous 1 2 .469  Serous 43 49 .004[111]^a [112]Open in a new tab NOTE. These are the baseline variables determined at treatment completion and included in the analysis. Abbreviations: CIN, cervical intraepithelial neoplasia; COPD, chronic obstructive pulmonary disease; DM, diabetes mellitus; DVT, deep vein thrombosis; FIGO, The International Federation of Gynecology and Obstetrics; HPV, human papillomavirus; MI/CHF, myocardial infarction/congestive heart failure; MI, myometrial invasion; MMRd, mismatch repair deficient; MMRp, mismatch repair proficient; NA, not available; PE, pulmonary embolism; Ref, reference variable. ^^a Statistically significant with P value <.05. Patients with 2009 FIGO stage I and histologic grade 1 or 2 endometrioid EC had an overall recurrence rate of 11.6% (38/329) and were considered low risk for recurrence. Patients with a histologic grade 3 endometrioid EC or with FIGO stage II-IV had an overall recurrence rate of 21.3% (69/324) and were considered high risk for recurrence. Patients with nonendometrioid type EC (serous, carcinosarcoma, clear cell, undifferentiated, and mixed) had an overall recurrence rate of 33.1% (79/239) and were also considered high risk for recurrence. Given the different risks of recurrence for each group (different phenotype), we built a model of recurrence for each group. Genomic Data Data Preprocessing Details about data preprocessing are found in the Data Supplement. Genomic Variables All genomic data were normalized and log[2]-transformed before analysis, including number of SNVs, CNVs, and fusion transcripts. For analysis and modeling, we included gene, gene isoforms, MIR, lncRNA, pseudogene and fusion transcript expressions, as well as SNV and CNV data (Table [113]2). TABLE 2. Variable Selection and Variables After Prediction Model Construction With Type of Data Type of Data Baseline Endometrioid Nonendometrioid Low Risk High Risk Clinical[114]^a 42 1 6 4 SNV 19,239 800 1,257 1,461 CNV 23,445 5,311 7,052 4,234 Fusion 10,942 2,430 2,681 5,464 Gene expression 26,629 10,021 5,134 3,217 Isoforms expression 61,427 18,472 15,268 6,404 MIR[115]^a 1,881 — — — LncRNA 16,849 4,915 4,033 1,285 Pseudogenes 15,250 5,247 3,053 1,171 [116]Open in a new tab NOTE. The baseline represents the initial number of variables for each type of data. After the selection with ANOVA (P value <.05), the most informative variables were kept for the multivariable lasso regression for all risks groups, low-risk and high-risk endometrioid EC, and nonendometrioid EC. Clinical baseline characteristics are detailed in Table [117]1 (14 clinical-pathologic) and the Data Supplement (Table S1; 28 laboratory values). The clinical selected variable for low-risk endometrioid EC was BMI; and the clinical selected variables for high-risk endometrioid EC were FIGO stage, Hispanic ethnicity, radiation treatment after surgery (either brachytherapy or external-beam), albumin, bilirubin and RDW; and the clinical selected variables for nonendometrioid EC were FIGO stage, histologic type (serous, clear cell, and carcinosarcoma), undifferentiated or dedifferentiated, or mixed, radiation treatment after surgery (either brachytherapy or external-beam, and albumin). Abbreviations: ANOVA, analysis of variance; CNV, copy-number variation; EC, endometrial cancer; FIGO, The International Federation of Gynecology and Obstetrics; lncRNA, long noncoding RNA; MIR, microRNA expression; RDW, RBC distribution width. ^^a Lasso regression was performed directly with no preselection because of smaller number of variables in clinical data and MIR. Data Analysis and Modeling Selection of Variables Briefly, in the first phase of analysis, we selected those most informative variables for prediction of EC recurrence with analysis of variance (P < .05) and cross-validation with 10 replicates for each fold. Then, selected variables from the univariate analysis were incorporated into multivariate lasso regression prediction models of EC recurrence. Data types were progressively combined to create more complex prediction models. For details about selection of variables and training, validating, and testing of selection models, see the Data Supplement. Training, Validating, and Testing EC Recurrence Models Only variables included in models that were superior to an AUC >0.8 in phase I were brought forward to the second phase of analysis. Then, we trained, validated, and tested models including the selected variables from phase I, and used lasso regression, other ML included in MATLAB apps, and DL (TensorFlow) analytics. Testing of Prediction Models in an External Data Set Additionally, we used TCGA EC data set for external testing of the best prediction models for EC recurrence trained in the ORIEN set in phase II. The best prediction models of EC recurrence were tested with lasso regression, MATLAB, and TensorFlow, including TCGA data as the testing set. Pathway Analysis Pathway enrichment analysis for selected genes were performed in R environment with the package clusterProfile,^[118]28,[119]29 which interrogates the Kyoto Encyclopedia of Genes and Genomes database to identify overrepresented pathways given a gene set.^[120]30 RESULTS Selection of Variables Recurrence models with only clinical data for low-risk endometrioid, and high-risk endometrioid and nonendometrioid, had and AUC of 0.56, 0.81, and 0.65, respectively. For low-risk endometrioid EC, the only variable informative for recurrence was BMI, odds ratio (OR), 1.06 (Table [121]2). For high-risk endometrioid EC, there were six clinical variables that were included in the prediction model: FIGO stage (OR, 1.5), Hispanic ethnicity (OR, 1.5), radiation treatment after surgery (either brachytherapy or external-beam, OR, 0.69), and albumin (OR, 0.47), bilirubin (OR, 4.1) and RBC distribution width (RDW; OR, 0.96) values (Data Supplement, Table S1). For nonendometrioid type, FIGO stage (OR, 1.59), histologic type (serous, clear cell, carcinosarcoma, undifferentiated or dedifferentiated, or mixed, OR, 1.02), radiation treatment after surgery (either brachytherapy or external-beam, OR, 0.88), and albumin values (OR, 0.53) were in the clinical model. Almost two thirds of MMR information was missing from the database (Table [122]1) and even more MSI testing, probably because more than a fourth of specimens were collected before the manuscript with TCGA data were published in 2013.^[123]18 Missing MMR and MSI information made it impractical to add them to the models. We assessed all SNVs of those genes included in TCGA classification: POLE, TP53, and MMR/MSI genes, MLH1, MSH2, MSH6, and PMS2 (Data Supplement, Table S2), although none of the variants of these genes were present in any good performing model of EC recurrence for any risk level. Initial models included only significant variables from each data type. Data types were later combined to create more complex models. For the training, validation, and testing phases, we only selected those variables included in models that had a performance ≥0.8. Figure [124]1 details the composition of those models for the different risks' groups. These best models with combination of different data types were used in the second phase for training, validation, and testing of final prediction models for EC recurrence. FIG 1. FIG 1. [125]Open in a new tab Selection of best models of EC recurrence after combination of data types. EC recurrence models for all risk groups with performances ≥0.8 measured by the AUC. The three panels represent risk-based groups: (A) Low-risk endometrioid EC best models (blue); (B) high-risk endometrioid EC best models (orange); and (C) nonendometrioid group best models (red). Different performances on all three panels are displayed in ascending order. The x-axis is AUC as a percentage (0%-100%). The red error mark displays the 95% CI. Overall, over 300 models with different combinations of datatypes were tested. We only displayed the best (A) five models for low-risk endometrioid EC, (B) 19 models for high-risk endometrioid EC, and (C) 20 for nonendometrioid EC. Genomic variation: CNV, copy-number variation; EC, endometrial cancer; SNV, single-nucleotide variation. Transcriptome: FUS, fusion transcript expression; ISO, gene isoform expression; LNC, long noncoding RNA expression; MIR, microRNA expression; mRNA, gene expression; PSE, pseudogene expression. Training, Validation, and Testing Models for EC Recurrence We built, validated, and tested models with the selected features from phase I. First, we trained and validated models (cross-validation) with selected variables using lasso regression. Additionally, we trained, validated, and tested models using selected variables in phase I from the ORIEN database in two analytical platforms: MATLAB (ML) and TensorFlow (DL; Table [126]3). In Table [127]3, we included only the best-performing model of the 35 possible results from MATLAB. For low-risk endometrioid EC, the only resulting clinical variable resulting from the analysis was BMI. For high-risk endometrioid EC, three of the five best-performing models included pseudogene expression. Some pseudogenes and SNVs were predominant in the best-performing models after validation and testing for nonendometrioid EC. Details of validation and training for some of these models are represented in the Data Supplement (Fig S3). Details about all variables included in best prediction models of EC for every risk group are detailed in the Data Supplement (Figs S4-S6 and Tables S7-S9). Pathway enrichment analysis results are also detailed in the Data Supplement (Fig S10). TABLE 3. Validation and Testing of Best Prediction Models Risk Groups Variables Lasso MATLAB TensorFlow Validation Validation Testing Validation Testing AUC 95% CI AUC AUC AUC AUC Low-risk endometrioid 2 Clinic + CNV 0.97 0.95 to 0.99 0.87 0.98 0.99 0.91 2 CNV + MIR 0.90 0.82 to 0.97 0.88 0.98 0.99 0.95 3 Clinic + PSE + ISO 0.95 0.90 to 1.00 0.96 0.99 0.99 0.95 High-risk endometrioid 1 PSE 0.92 0.87 to 0.98 0.85 0.94 0.99 0.97 1 SNV 0.97 0.94 to 0.99 0.88 0.95 0.99 0.96 3 MIR + PSE + mRNA 0.96 0.94 to 0.99 0.90 0.91 1.00 0.93 3 Clinic + PSE + fusion 0.95 0.88 to 1.02 0.91 1.00 0.94 0.92 3 SNV + LNC + ISO 0.98 0.97 to 0.99 0.94 0.97 1.00 0.94 High-risk nonendometrioid 2 SNV + fusion 0.92 0.88 to 0.96 0.97 0.88 0.99 0.91 2 MIR + PSE 0.95 0.93 to 0.98 0.92 0.95 1.00 0.98 2 SNV + PSE 0.96 0.94 to 0.99 0.95 0.98 0.99 0.94 3 Clinic + ISO + mRNA 0.98 0.97 to 1.00 0.91 1.00 0.99 1.00 3 Clinic + ISO + SNV 0.93 0.87 to 0.99 0.95 0.96 1.00 1.00 3 CNV + PSE + SNV 0.96 0.94 to 0.99 0.93 0.97 0.99 0.94 3 SNV + PSE + MIR 0.96 0.94 to 0.99 0.93 1.00 1.00 0.97 [128]Open in a new tab NOTE. Validation of best models of EC recurrence on the basis of risk classification. The initial model was built and validated with cross-validation with a lasso regression in an R environment (left side of the table). Validation and testing were performed in two analytical platforms (right side of the table): MATLAB (ML) and TensorFlow (DL). The upper part of the table has patients with low-risk endometrioid EC: two of the best models include clinical data and CNVs. The only resulting variable for clinical data is BMI, other variables are not informative for recurrence in this risk group. The middle part of the table has patients with high-risk endometrioid EC: three of the five best-performing models include PSE. The lower part of the table has patients with nonendometrioid EC: PSE and SNV were overrepresented in best performance models. Abbreviations: CNV, copy-number variation; DL, deep learning; EC, endometrial cancer; FUS, fusion transcript expression; ISO, gene isoform expression; LNC, long noncoding RNA expression; MIR, microRNA expression; ML, machine learning; mRNA, gene expression; PSE, pseudogene expression; SNV, single-nucleotide variation. External Testing of Models for EC Recurrence We evaluated some of the best-performing models of endometrioid EC recurrence in TCGA data. TCGA endometrioid EC data set clinical characteristics that were found to be informative in the clinical model of recurrence for the ORIEN endometrioid data set are described in the Data Supplement (Table S11). There were some variables included in the high-risk endometrioid EC clinical model of ORIEN that were not available in TCGA, such as laboratory information, specifically albumin, bilirubin, and RDW values. The nonendometrioid data from ORIEN had more diverse histologic types than just serous cancers, so we considered that it was more problematic to evaluate those models in TCGA. After downloading and preprocessing TCGA data set as we did with ORIEN, we selected those variables included in EC endometrioid best-performing models. Unfortunately, in half of the tested models, there were some missing variables in TCGA, therefore some modifications had to be made in the original ORIEN model to allow for testing. These modifications, or relearned models, were made on the Clinic + PSE + ISO (clinical, pseudogene, and isoform expressions) model for the low-risk set, and the SNV, SNV + LNC + ISO (SNV, lncRNA, and isoforms expressions), MIR + PSE + mRNA (miRNA, pseudogene, and gene expressions), and Clin + PSE + FUS (clinical, pseudogene, and fusion transcript expressions) models for the high-risk set, and are marked on Table [129]4 with an asterisk. Testing of ORIEN models of endometrioid EC recurrence in TCGA data had good accuracy but poor AUCs (Table [130]4) for both analytical platforms (ML and DL). This is most likely due to the unbalanced data: recurrences account for <10% of all samples in the low-risk group and <19% for the high-risk group. Details about some of these results are depicted in the Data Supplement (Fig S12). Additionally, other factors may had influenced the testing performance in TCGA EC data: (1) missing variables with respect to the original ORIEN model; (2) more reported recurrences in ORIEN versus TCGA in each risk group: low-risk: 12% versus 10%, high-risk: 21% versus 18%, respectively, although these differences were not statistically significant (chi-square P value >.05), underreported outcomes (recurrences) may have detrimental effects on prediction performances; and (3) 61% (252 of 411) of all EC TCGA samples were collected and processed before 2010, in comparison with only 7% (52 of 708) of all ORIEN EC samples. TABLE 4. External Testing of Best Prediction Models in TCGA Endometrioid EC Dataset Risk Groups Variables Lasso MATLAB TensorFlow AUC (95% CI) Testing Testing Validation Testing Accuracy AUC Accuracy AUC Low-risk endometrioid 2 Clinic + CNV 0.97 (0.95 to 0.99) 0.58 (0.43 to 0.72) 0.90 0.54 0.91 0.49 2 CNV + MIR 0.90 (0.82 to 0.97) 0.52 (0.36 to 0.67) 0.90 0.54 0.86 0.42 3 Clinic + PSE + ISO[131]^a 0.95 (0.90 to 1.00) 0.54 (0.37 to 0.71) 0.90 0.54 0.90 0.50 High-risk endometrioid 1 PSE 0.92 (0.87 to 0.98) 0.52 (0.42 to 0.62) 0.83 0.53 0.69 0.52 1 SNV[132]^a 0.97 (0.94 to 0.99) 0.50 (0.50 to 0.50) 0.82 0.50 0.82 0.50 3 MIR + PSE + mRNA[133]^a 0.93 (0.87 to 0.99) 0.53 (0.44 to 0.63) 0.73 0.54 0.18 0.50 3 Clinic + PSE + FUS[134]^a 0.95 (0.88 to 1.02) 0.54 (0.45 to 0.64) 0.82 0.54 0.35 0.55 3 SNV + LNC + ISO[135]^a 0.98 (0.97 to 0.99) 0.50 (0.50 to 0.50) 0.82 0.52 0.18 0.50 [136]Open in a new tab NOTE. Evaluation (testing) of best models of endometrioid EC recurrence in TCGA endometrioid EC data set by risk classification. The initial model was built and validated (cross-validation) with a lasso regression in an R environment on the ORIEN EC data set (left side of the table). This ORIEN model was validated in TCGA data (lasso testing). Additional testing was performed on two analytical platforms (right side of the table): MATLAB (ML) and TensorFlow (DL). Performance was measured in terms of AUC and accuracy. The upper part of the table has patients with low-risk endometrioid EC: two of the best models include clinical data and CNVs. The lower part of the table has patients with high-risk endometrioid EC: three of the five best-performing models include PSE. Abbreviations: CNV, copy-number variation; DL, deep learning; EC, endometrial cancer; FUS, fusion transcript expression; ISO, gene isoform expression; LNC, long noncoding RNA expression; MIR, microRNA expression; ML, machine learning; mRNA, gene expression; PSE, pseudogene expression; SNV, single-nucleotide variation; TCGA, the cancer genome atlas. ^^a When TCGA had not all the original clinical/genomic data, the ORIEN prediction models had to be relearned first with the data available in TCGA, and then were tested in TCGA data with different analytics platforms (lasso, MATLAB, and TensorFlow). DISCUSSION In this study, we trained, validated, and tested models for EC recurrence stratified by risk factors. Risk factors were based on historical clinical-pathologic characteristics that were used in the past 40 years to determine adjuvant treatment for EC.^[137]3-[138]5,[139]8 The resulting prediction models were tested in a subset of the ORIEN database and were found to have excellent performances, on the basis of their AUC. ORIEN is one of the largest databases of EC clinical and genomic information maintained prospectively by a network of academic institutions.^[140]26,[141]27,[142]31 Accordingly, clinical information and surveillance is optimized. Additionally, we tried to evaluate these EC recurrence models in TCGA clinical-genomic data. Unfortunately, not all data were available for testing, so some compromises had to be made. Also, there was a concern about recurrence reporting: recurrence in early-stage, low-risk EC is a rare event and missed reporting may result in misclassification. In higher-risk EC, recurrence is more frequent, but still there was less reporting of disease relapse in the TCGA data set than in ORIEN's network data. Furthermore, almost two thirds of specimen collection, processing, and analysis for TCGA was performed before 2010 with older technology and shorter reads (50mers v 100-150mers), in comparison with only 7% for ORIEN's, which may have affected overlapping sequence reading and counts for fusion transcripts, CNV, and other somatic structural variations.^[143]32 All these factors lead us to conclude that ORIEN-trained models tested in TCGA data may have had conflicting performances. Models of EC prediction with only clinical-pathologic data performed as previous historical models.^[144]16,[145]17,[146]22,[147]23 What was unique of our study is that we separated model building by EC risk group, so the prediction potential of some of the variables that could have been diluted in the whole data set showed prediction capabilities for individual risk groups. For example, for the low-risk endometrioid EC group, which is the most common group of all EC groups (58% in our database), grade and myometrial invasion did not show any prediction potential and only BMI performed fairly in predicting EC recurrence. The American Cancer Society just projected that EC will surpass ovarian cancer in mortality this year,^[148]2 and that it seems to be driven by the increasing incidence of high-risk histologic subtypes accounting for a disproportionate number of EC deaths.^[149]33 This has to be coupled with an obesity epidemic with links to cancer incidence and mortality.^[150]34,[151]35 In clinical-pathologic models for higher-risk groups, both endometrioid and nonendometrioid, FIGO stage and radiation after surgery were predictors of recurrence: the higher the stage, the higher the risk for recurrence and radiation protected from recurrence. Notably, in high-risk endometrioid type, Hispanic ethnicity conferred higher risk for recurrence, but race was not a factor. Black women tend to have more nonendometrioid EC types and more mortality rate by EC than White women.^[152]36 However, in our study, Black race did not confer more risk for recurrence in these more aggressive EC types, either endometrioid or nonendometrioid. This could be due to the relative low numbers of Black women included, only 5% of the total. Administration of initial chemotherapy was a predictor of recurrence, probably because the distribution of women who received chemotherapy was similar for each risk group (Table [153]1). Other laboratory values were associated with high-risk endometrioid EC recurrence: increasing levels of albumin and RDW conferred less risk for recurrence, while elevated levels of bilirubin increased it. Likewise, increasing levels of albumin was protective of recurrence for nonendometrioid EC. Lower serum albumin levels (or hypoalbuminemia) has been considered a marker for illness severity and has been incorporated in several prognostic scores such as the Acute Physiology and Chronic Health Evaluation score, Child's classification in patients with liver cirrhosis, and the Glasgow Prognostic Score.^[154]37 Additionally, hypoalbuminemia has been associated with poor prognosis in ovarian cancer^[155]38 and EC.^[156]37 Therefore, it is plausible that decreasing levels of albumin confer higher risk of disease recurrence in more aggressive types of EC, both endometrioid and nonendometrioid. The integrated genomic characterization of EC by TCGA represented a shift in EC tumor classification.^[157]18 TCGA initial classification resulted from clustering results from gene copy number, whole exome sequencing (WES) of 248 tumor-normal pairs, MSI status, RNA expression, protein expression, and DNA methylation analyses. Clustering, a unsupervised learning algorithm, is a great method to identify underlying groups on the basis of the available data, which is very useful when there is no previous knowledge about grouping.^[158]39 One limitation of clustering algorithms is overlap between groups with similar data points even when they are of a different class.^[159]40 Using methods available in clinical practice, investigators were able to refine TCGA four molecular subgroups with surrogate markers, p53 abnormalities, MSI, and POLE mutations, resulting in a classification tool.^[160]22,[161]41 Models for EC recurrence created with these integrating clinical and molecular markers had an AUC around 0.7, without external validation.^[162]22 In our study, we trained, validated, and tested integrated models of EC recurrence with superior performance than those using TCGA molecular surrogates. Previously described POLE pathologic somatic mutations (exons 9, 13, and 14)^[163]42 have low incidence in both recurrent and nonrecurrent ECs (Data Supplement, Table S2), with no statistical differences between both by EC risk groups. When we take all risk groups together, there were 34 somatic variants in nonrecurrent EC (of 672) and only two in recurrent cases (of 184), with a chi-square P value = .022. There were cases with recurrent EC (even in the low risk) that had POLE mutations. As larger data sets of SNV from WES or whole genome sequencing are available, we would be able to assess the real frequency of that genomic alteration in all EC groups and their association with recurrence. In our analysis, we did not find POLE somatic variation as a predictor for recurrence for any of the risk groups, including the low risk. Similarly, TP53 variation was not predictor of EC recurrence on any of the groups, including the nonendometrioid, despite having significantly more recurrent cases with mutations, other than p.P72R (Data Supplement, Table S2). Our interpretation is that TP53 SNVs are so prevalent in nonendometrioid types (including serous) that they do not discriminate well which samples are at risk for recurrence. Neither variation of the genes involved in MMR was a predictor of EC recurrence in any group. All these results point to the fact that not all molecular characteristics that are associated with prognosis in EC are necessarily good classifiers or predictors of EC recurrence. The best-performing models for low-risk endometrioid EC recurrence included altered CNV in some lncRNAs. Most of them were protective for disease relapse. In previous analyses, with less precise definition of outcomes' phenotypes, and smaller sample size, we also detected lncRNA as important variables in predicting EC recurrence.^[164]25 In this study, we identified some ncRNAs with altered copy number that were part of the DNA repair mechanism and conferred protection for EC recurrence in low-risk EC. The association between DNA damage, DNA repair, and cancer is well known and is the basis for novel therapies, such as poly(ADP-ribose) polymerase inhibitors, checkpoints inhibitors, and even immunotherapies.^[165]43 DNA damage response (DDR) coordinates DNA repair through a complex network of cellular pathways. Genes encoding DDR factors are frequently mutated in cancer, causing genomic instability.^[166]44 In some cancers, such as colorectal cancer, ncRNAs have been associated with prognosis, cancer progression, or suppression.^[167]45 For example, LINC00905 has been associated with worse recurrence in cervical cancer,^[168]46 LINC00847 was associated with worse prognosis in pancreatic cancer,^[169]46 ZNF674-AS1 may inhibit migration and invasion in lung cancer,^[170]47 and TPRG1-AS1 inhibits liver cancer progression.^[171]48 It seems that the last two effects on cancer progression were mediated by interactions with MIRs. Interactions between ncRNAs, DDR mechanisms, and disease relapse in low-risk EC must be elucidated before we can tap into their potential for treatment targeting. High-risk endometrioid and nonendometrioid ECs had several common variables that were included in the best-performing prediction models of recurrence. The majority of those were pseudogenes but also there were two SNVs, SAMM50 and SELENOH, and the pathway analysis point out to an overrepresentation of the mitophagy machinery. Mitophagy is a specialized form of autophagy that plays a significant role in the occurrence and development of cancers.^[172]49 In EC, mitophagy activity is closely associated with tumor cell metabolism, proliferation, survival, and resistance to treatment.^[173]50 Additionally, pseudogene expression alone can accurately classify the major histologic subtypes of EC.^[174]51 Pseudogenes are evolutionary relics present in the genomes of a wide variety of species, and recent multiomics studies have determined that dysregulation of many pseudogenes is associated with relapse of disease in diverse cancer types.^[175]52 One of the strengths of this study is that was performed on the ORIEN network EC data set, a prospectively collected database from US academic institutions with comprehensive clinical, pathologic, and genomic data. Despite the database prospective collection, this was an observational, case-control study, with limitations inherent to this type of design. For example, MMR and MSI status was partially available, thus not useful for modeling. However, we took advantage of the prospective nature of the data collection and outcome surveillance, including disease relapse. Additionally, all genomics analyses were performed uniformly, following better practices analytics and National Cancer Institute analysis recommendations.^[176]53 We grouped patients on the basis of classic characteristics of risk so, for the modeling training, we had homogeneous phenotypes of EC recurrence: low-risk, high-risk, and nonendometrioid groups. All models were trained with cross-validation and then were tested on samples that were left out of the initial training. Additionally, we did external validation of the best-performing models in an independent data set, TCGA. TCGA validation had some limitations because of potential disease relapse underreporting and surveillance shortcomings, the historical nature of the database collection that limited some analytics, and potential differences in genetic background between both populations, ORIEN and TCGA.^[177]54 To avoid overfitting, we performed cross-validation in the discovery phase as well in the training of models and left out samples for further testing. However, the discovery phase and model training were performed in the same data, and that could lead to overfitting. To better evaluate the clinical value of these prediction models, we will need to perform prospective evaluation with independent EC data collected from collaborative institutions, like the ORIEN network. Other data could be included in these models in the future to improve their performance if external testing is disappointing. Artificial intelligence analysis of histopathology slides and their association with outcome prediction is evolving rapidly,^[178]55 and there are some DL trained models with slides predicting outcomes that are promising.^[179]56 Additionally, we could create multimodal models, integrating DL models from tabular and image data, to create more robust and better performing models for EC recurrence. In summary, training, validating, and testing models of EC recurrence in a comprehensive database from the ORIEN network resulted in excellent performing models that, after prospectively evaluated, could help to assess which patients are at risk of relapse and are potential candidates for clinical trials. Erin George Consulting or Advisory Role: Incyclix Bio Research Funding: Merck Serono Ahmad A. Tarhini Consulting or Advisory Role: Bristol Myers Squibb, Merck, Genentech/Roche, Novartis, Sanofi/Regeneron, Partner Therapeutics, Clinigen Group, Eisai, Bayer, Instil Bio, ConcertAI, BioNTech, AstraZeneca, Nested Research Funding: Bristol Myers Squibb (Inst), Merck (Inst), Genentech/Roche (Inst), OncoSec (Inst), Sanofi/Regeneron (Inst), Clinigen Group (Inst), InflaRx (Inst), Acrotech Biopharma (Inst), Pfizer (Inst), Agenus (Inst), Scholar Rock (Inst), Agenus (Inst) Casey M. Cosgrove Honoraria: UpToDate, GOG Foundation, Immunogen Consulting or Advisory Role: GlaxoSmithKline, AstraZeneca, Imvax, Intuitive Surgical Research Funding: GlaxoSmithKline, Regeneron Marilyn S. Huang Honoraria: Intersphere MJH Consulting or Advisory Role: Tesaro, Seagen, Aptitude Health, Agenus, Cooper Surgical, touchIME, FLASCO, Eisai, Immunogen, Aspira Women's Health, Clovis Oncology, Curio Science, VBL Therapeutics, Swedish Cancer Center, Voluntis, Seagen, IntegrityCE, Elsevier, MJH Healthcare Holdings, LLC, Natera, Immunogen, Merck, AstraZeneca, Pfizer, AbbVie Research Funding: Merck (Inst) Bradley Corr Consulting or Advisory Role: GlaxoSmithKline (Inst), Merck (Inst), AstraZeneca/Merck (Inst), Immunogen, Imvax, Gilead Sciences (Inst), Corcept Therapeutics (Inst), Zentalis (Inst) Research Funding: Clovis Oncology (Inst), Immunogen (Inst) Bodour Salhia Leadership: CpG Diagnostics Stock and Other Ownership Interests: CpG Diagnostics Inc Consulting or Advisory Role: AstraZeneca Patents, Royalties, Other Intellectual Property: Patents filed and pending at University of Southern California Travel, Accommodations, Expenses: CpG Diagnostics In Stephen B. Edge Honoraria: North American Center for Continuing Medical Education Lisa Landrum Consulting or Advisory Role: GlaxoSmithKline No other potential conflicts of interest were reported. DISCLAIMER The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views, assertions, opinions, or policies of the Uniformed Services University of the Health Sciences, the Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc, the Department of Defense, or the Departments of the Army, Navy, or Air Force. Mention of trade names, commercial products, or organizations does not imply endorsement by the US government. SUPPORT Supported in part by the NIH 5R01CA99908-18 (K. Leslie PI), and by the Research Fund of the Gynecologic Oncology Division of the University of Iowa Hospitals and Clinics, and supported in part by the American Association of Obstetricians and Gynecologists Foundation (AAOGF) Bridge Funding Award. DATA SHARING STATEMENT A data sharing statement provided by the authors is available with this article at DOI [180]https://doi.org/10.1200/PO-24-00859. AUTHOR CONTRIBUTIONS Conception and design: Jesus Gonzalez Bosquet, Rob L. Dood, Vincent M. Wagner Financial support: Jesus Gonzalez Bosquet Administrative support: Jesus Gonzalez Bosquet, Michelle Churchman Provision of study materials or patients: Erin George, Casey M. Cosgrove, Kathleen Darcy, Lisa Landrum, Rob J. Rounbehler, Michelle Churchman Collection and assembly of data: Erin George, Ahmad A. Tarhini, Casey M. Cosgrove, Bodour Salhia, Lauren E. Dockery, Stephen B. Edge, Lisa Landrum, Rob J. Rounbehler, Michelle Churchman Data analysis and interpretation: Jesus Gonzalez Bosquet, Andrew Polio, Erin George, Ahmad A. Tarhini, Marilyn S. Huang, Bradley Corr, Aliza L. Leiser, Kathleen Darcy, Christopher M. Tarney, Rob L. Dood, Michael J. Cavnar Manuscript writing: All authors Final approval of manuscript: All authors Accountable for all aspects of the work: All authors AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to [181]www.asco.org/rwc or [182]ascopubs.org/po/author-center. Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians ([183]Open Payments). Erin George Consulting or Advisory Role: Incyclix Bio Research Funding: Merck Serono Ahmad A. Tarhini Consulting or Advisory Role: Bristol Myers Squibb, Merck, Genentech/Roche, Novartis, Sanofi/Regeneron, Partner Therapeutics, Clinigen Group, Eisai, Bayer, Instil Bio, ConcertAI, BioNTech, AstraZeneca, Nested Research Funding: Bristol Myers Squibb (Inst), Merck (Inst), Genentech/Roche (Inst), OncoSec (Inst), Sanofi/Regeneron (Inst), Clinigen Group (Inst), InflaRx (Inst), Acrotech Biopharma (Inst), Pfizer (Inst), Agenus (Inst), Scholar Rock (Inst), Agenus (Inst) Casey M. Cosgrove Honoraria: UpToDate, GOG Foundation, Immunogen Consulting or Advisory Role: GlaxoSmithKline, AstraZeneca, Imvax, Intuitive Surgical Research Funding: GlaxoSmithKline, Regeneron Marilyn S. Huang Honoraria: Intersphere MJH Consulting or Advisory Role: Tesaro, Seagen, Aptitude Health, Agenus, Cooper Surgical, touchIME, FLASCO, Eisai, Immunogen, Aspira Women's Health, Clovis Oncology, Curio Science, VBL Therapeutics, Swedish Cancer Center, Voluntis, Seagen, IntegrityCE, Elsevier, MJH Healthcare Holdings, LLC, Natera, Immunogen, Merck, AstraZeneca, Pfizer, AbbVie Research Funding: Merck (Inst) Bradley Corr Consulting or Advisory Role: GlaxoSmithKline (Inst), Merck (Inst), AstraZeneca/Merck (Inst), Immunogen, Imvax, Gilead Sciences (Inst), Corcept Therapeutics (Inst), Zentalis (Inst) Research Funding: Clovis Oncology (Inst), Immunogen (Inst) Bodour Salhia Leadership: CpG Diagnostics Stock and Other Ownership Interests: CpG Diagnostics Inc Consulting or Advisory Role: AstraZeneca Patents, Royalties, Other Intellectual Property: Patents filed and pending at University of Southern California Travel, Accommodations, Expenses: CpG Diagnostics In Stephen B. Edge Honoraria: North American Center for Continuing Medical Education Lisa Landrum Consulting or Advisory Role: GlaxoSmithKline No other potential conflicts of interest were reported. REFERENCES