Abstract The aim of this study was to use whole‐exome sequencing to derive a molecular classifier for nasopharyngeal carcinoma (NPC) and evaluate its clinical performance. We performed whole‐exome sequencing on 82 primary NPC tumors from Sun Yat‐sen University Cancer Center (Guangzhou cohort) to obtain somatic single‐nucleotide variants, indels, and copy number variants. A novel molecular classifier was then developed and validated in another NPC cohort (Hong Kong cohort, n = 99). Survival analysis was estimated by the Kaplan‐Meier method and compared using the log‐rank test. Cox proportional hazards model was adopted for univariate and multivariate analyses. We identified three prominent NPC genetic subtypes: RAS/PI3K/AKT (based on RAS, AKT1, and PIK3CA mutations), cell‐cycle (based on CDKN2A/CDKN2B deletions, and CDKN1B and CCND1 amplifications), and unclassified (based on dominant mutations in epigenetic regulators, such as KMT2C/2D, or the Notch signaling pathway, such as NOTCH1/2). These subtypes differed in survival analysis, with good, intermediate, and poor progression‐free survival in the unclassified, cell‐cycle, and RAS/PI3K/AKT subgroups, respectively, among the Guangzhou, Hong Kong, and combined cohorts (n = 82, P = 0.0342; n = 99, P = 0.0372; and n = 181, P = 0.0023; log‐rank test). We have uncovered genetic subtypes of NPC with distinct mutations and/or copy number changes, reflecting discrete paths of NPC tumorigenesis and providing a roadmap for developing new prognostic biomarkers and targeted therapies. Keywords: copy number variants, indels, molecular classifier, nasopharyngeal carcinoma, single‐nucleotide variants 1. INTRODUCTION Nasopharyngeal carcinoma (NPC) has a unique geographical distribution that is distinct from other head and neck cancers. Currently, the tumor‐lymph node‐metastasis (TNM) staging system is the key clinical tool for prognostication, risk stratification, and making treatment decisions. Although patients of the same TNM stage receive similar treatments, their clinical outcomes vary greatly. Similarly, a previous study found comparable relative survival rates between undifferentiated and differentiated subtypes according to the World Health Organization histological classification.[40]1 Thus, current staging and histological classification systems are insufficient for predicting patient survival. One current hypothesis is that differences in prognosis and treatment efficacy might be attributed to biological heterogeneity. Proper classification is essential for physicians to properly assign treatments and evaluate clinical outcomes. In the past 2 decades, researchers have found gene expression profiling is altered between cancer patients and healthy controls.[41]2, [42]3, [43]4 Despite its early promise as a diagnostic and prognostic tool, gene expression profiling remains cost‐prohibitive and challenging to implement in a clinical setting. In NPC patients, several studies have reported molecular classifications only based on the expression of protein‐coding genes and microRNAs (7‐9), which have not been widely applied in clinical settings, even though they are relevant to prognosis. Recently, molecular profiling has been achieved by high‐throughput analyses, making molecular classifications based on genetic lesions more comprehensive and prevalent in clinical cancer management. In colorectal cancer, BRAF V600E and activating KRAS mutations are associated with metastasis, leading to poor survival.[44]5 The Cancer Genome Atlas has proposed a molecular classification mainly according to genomic mutations, amplifications, and fusion genes that divide gastric cancer into four subtypes, providing a roadmap for patient stratification and trials of targeted therapies.[45]6 A large‐scale international database demonstrated that four consensus molecular subtypes with distinguishable features including mutations and copy number changes facilitated clinical treatment in colorectal cancer.[46]7 Notably, the advent of large‐scale DNA molecular profiling has been helpful to identify novel molecular targets that can be applied to the treatment of particular cancer patients.[47]8, [48]9 Thus, developing a molecular classifier from DNA high‐throughput sequencing data that can identify dysregulated pathways and candidate drivers in NPC is an urgent priority. To gain further insight into the genetic heterogeneity of primary NPC and to establish a DNA‐based molecular classifier capable of performing multiple‐gene classification for prognostic and therapeutic stratification, we performed whole‐exome sequencing (WES) on 82 formalin‐fixed paraffin‐embedded NPC tumors as a discovery cohort (Guangzhou NPC Cohort [GZNPC]). The WES data of 99 external NPC cases (Hong Kong NPC Cohort [HKNPC]) were used as an independent validation cohort.[49]10 This study revealed that a molecular classifier derived from somatic single nucleotide variants (SNVs), indels, and copy number variants (CNVs) could better illustrate NPC tumorigenesis, and thus, could be used as a tool to explore different therapeutic strategies. 2. MATERIALS AND METHODS 2.1. Clinical specimens Between July 2007 and December 2012, we obtained 82 primary NPC tumor tissues and corresponding blood samples from the Department of Pathology and Biobank at Sun Yat‐sen University Cancer Center (GZNPC cohort, Guangzhou, China). All specimens were independently reviewed by two pathologists to determine World Health Organization histological classification and tumor cellularity. NPC specimens with >50% tumor cellularity were used for sectioning, nucleic acid extraction, and library preparation. Additionally, 99 NPC patients from a Hong Kong study were used as the validation cohort (HKNPC).[50]10 Detailed clinical data of all patients are summarized in Table [51]1. This study was approved by the institutional review board of Sun Yat‐sen University Cancer Center (RDDA2019001009). Table 1. Clinical characteristics of the surveyed NPC patients including 82 patients from Guangzhou and 99[52]^a patients from Hong Kong, respectively Discovery cohort (n = 82, %) External validation cohort (n = 99, %) Total (n = 181, %) P value Age at diagnosis (years) Median 47 49 48 0.0721[53]^b Range 19‐71 23‐80 19‐80 OS, months Median 48 56 50 0.1029[54]^b Range 7‐94 2‐122 2‐122 PFS, months Median 31 22 29 0.3034[55]^b Range 1‐63 2‐96 1‐96 Sex Male 62 (75.6) 71 (73.2) 133 (74.3) 0.7346[56]^c Female 20 (24.4) 26 (26.8) 46 (25.7) Unknown 0 2 2 Smoking status Nonsmoker 45 (61.6) 47 (51.7) 92 (57.9) 0.0809[57]^c Smoker 28 (38.4) 44 (48.3) 72 (42.1) Unknown 9 8 17 Clinical stage Early stage (I + II) 7 (8.5) 25 (26.1) 32 (18.0) 0.0030[58]^c Advanced stage (III + IV) 75 (91.5) 71 (73.9) 146 (82.0) Unknown 0 3 3 WHO classification NKUC 73 (89.0) 91 (93.8) 164 (91.6) 0.0205[59]^c NKDC 7 (8.5) 0 7 (3.3) KSCC 2 (2.5) 6 (6.2) 8 (5.1) Unknown 0 2 2 PFS rate (5 year, 95% CI) (%) 45.4 (26.4‐62.7) 41.5 (30.1‐52.5) 46.4 (37.8‐54.5) OS rate (5 year, 95% CI) (%) 78.2 (63.5‐87.5) 65.9 (53.7‐75.5) 71.8 (62.9‐78.9) [60]Open in a new tab NPC, nasopharyngeal carcinoma; OS, overall survival; PFS, progression‐free survival; CI, confidence interval; NKUC, nonkeratinizing undifferentiated carcinoma; NKDC, nonkeratinizing differentiated carcinoma; KSCC, keratinizing squamous cell carcinoma. ^a All NPC cases are from Asia. ^b Wilcoxon rank sum test. ^c Pearson's x ^2‐test. 2.2. The whole‐exome sequencing Details of DNA isolation methods are provided in the online Data [61]S1. DNA library preparation for NPC tumors and matched controls was performed according to the Agilent SureSelect^XT protocol (Santa Clara, CA) with minor modifications. Briefly, 2 μg of tumor DNA and 200 ng of control DNA were fragmented by ultrasonication (M220; Covaris, Woburn, MA), and fragments were captured using SureSelect^XT Human All Exon V5+UTRs 75M (Cat no. 5190‐6214; Agilent). Quantities and sizes of the libraries were determined using a Qubit fluorescence detector (Life Technologies, Carlsbad, CA) and an Agilent Bioanalyzer 2100, respectively. Finally, WES libraries were hybridized to an Illumina HiSeq PE Cluster Kit and SBS kit v4 for enrichment and were sequenced by 150 paired‐end read lengths on an Illumina Hiseq 1500 sequencing platform including a dual eight‐base index barcode according to the manufacturer's protocols (Illumina, San Diego, CA). The WES raw sequence data reported have been deposited in the Genome Sequence Archive, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences,[62]11, [63]12 under accession number CRA001397 that are publicly accessible at [64]http://bigd.big.ac.cn/gsa/s/29XtNNXW. 2.3. Variant calling Analyses of WES data were performed to identify somatic alterations in each tumor, including somatic SNVs, indels, and CNVs. DNA from peripheral blood were used as references to filter germline variants.