Abstract

   Survival analysis of the Cancer Genome Atlas (TCGA) dataset is a
   well-known method for discovering gene expression-based prognostic
   biomarkers of head and neck squamous cell carcinoma (HNSCC). A cutoff
   point is usually used in survival analysis for patient dichotomization
   when using continuous gene expression values. There is some
   optimization software for cutoff determination. However, the software’s
   predetermined cutoffs are usually set at the medians or quantiles of
   gene expression values. There are also few clinicopathological features
   available in pre-processed datasets. We applied an in-house workflow,
   including data retrieving and pre-processing, feature selection,
   sliding-window cutoff selection, Kaplan–Meier survival analysis, and
   Cox proportional hazard modeling for biomarker discovery. In our
   approach for the TCGA HNSCC cohort, we scanned human protein-coding
   genes to find optimal cutoff values. After adjustments with
   confounders, clinical tumor stage and surgical margin involvement were
   found to be independent risk factors for prognosis. According to the
   results tables that show hazard ratios with Bonferroni-adjusted p
   values under the optimal cutoff, three biomarker candidates, CAMK2N1,
   CALML5, and FCGBP, are significantly associated with overall survival.
   We validated this discovery by using the another independent HNSCC
   dataset ([30]GSE65858). Thus, we suggest that transcriptomic analysis
   could help with biomarker discovery. Moreover, the robustness of the
   biomarkers we identified should be ensured through several additional
   tests with independent datasets.

   Keywords: head and neck squamous cell carcinoma (HNSCC), the Cancer
   Genome Atlas (TCGA), transcriptomic analysis, survival analysis,
   optimal cutoff, effect size, calcium/calmodulin dependent protein
   kinase II inhibitor 1 (CAMK2N1), calmodulin like 5 (CALML5), Fc
   fragment of IgG binding protein (FCGBP), mindfulness meditation

1. Introduction

   Head and neck squamous cell carcinoma (HNSCC), including that of oral,
   oropharyngeal, and hypopharyngeal origins, is the fourth leading cause
   of cancer-related death for males in Taiwan [[31]1]. The
   age-standardized incidence rate of HNSCC in males is 42.43 per 100,000
   persons [[32]2]. The treatment strategies of HNSCC are surgery alone,
   systemic therapy with concurrent radiation therapy (systemic
   therapy/RT), and surgery with adjuvant systemic therapy/RT (according
   to National Comprehensive Cancer Network, NCCN, Clinical Practice
   Guidelines for HNSCC, Version 2.2020) [[33]3]. Despite the improvements
   in those interventions, the survival of HNSCC has improved only
   marginally over the past decade worldwide [[34]4]. The critical
   advancements of targeted therapy and immuno-oncology should benefit
   from emerging prognostic biomarkers guiding modern systemic therapies.

   Cumulative knowledge shows that some biomarkers have prognostic
   significance in HNSCC. For example, node-negative HNSCC patients with
   p53 overexpression were found to have lower survival [[35]5].
   Overexpression of hypoxia-inducible factor (HIF)-1 alpha [[36]6] or
   Ki-67 [[37]7] was found to be correlated with poor response to
   radiotherapy of HNSCC. Both epidermal growth factor receptor (EGFR)
   [[38]8,[39]9] and matrix metalloproteinase (MMP) [[40]10] were found to
   be overexpressed to promote the invasion and metastasis of HNSCC. From
   2000 to 2006, the first anti-EGFR antibody-drug (cetuximab) was
   developed and combined with radiotherapy, known as bio-RT, to increase
   survival with unresectable locoregionally advanced disease [[41]11].
   The systemic therapy of cetuximab plus platinum-fluorouracil
   chemotherapy (EXTREME regimen) improves overall survival when given as
   a first-line treatment in patients with recurrent or metastatic HNSCC
   [[42]12,[43]13]. It was approved by the US Food and Drug Administration
   (FDA) in 2008. The bio-RT could be proceeded with docetaxel, cisplatin,
   and 5-fluorouracil (Tax-PF) induction chemotherapy to overcome
   radio-resistance of HNSCC [[44]14].

   However, Rampias and his colleagues [[45]15] suggested that Harvey rat
   sarcoma viral oncoprotein (HRAS) mutations could mediate cetuximab
   resistance in systemic therapy of HNSCC via the EGFR/rat sarcoma
   (RAS)/extracellular signal-regulated kinases (ERK) signaling pathway.
   After that, the EGFR tyrosine kinase inhibitor (TKI) was introduced to
   help cetuximab in 2018. Anti-tumor activity was observed in a phase 1
   trial for HNSCC patients using cetuximab and afatinib, a TKI of EGFR,
   human epidermal growth factor receptor (HER)2, and HER4 [[46]16]. Other
   EGFR TKIs, such as gefitinib, erlotinib, and osimertinib, were also
   developed to treat advanced HNSCC. Although 90% of HNSCCs overexpress
   EGFR, cetuximab only has a 10–20% response rate in those patients. As
   of 2019, cetuximab was still the only drug of choice with proven
   efficacy for selected HNSCC patients [[47]17].

   In the immuno-oncology era, the immune-checkpoint inhibitor (ICI) was
   introduced in 2014 for treating HNSCC [[48]18,[49]19]. The ICI works on
   immune checkpoint molecules, including programmed death 1 (PD-1),
   cytotoxic T lymphocyte antigen 4 (CTLA-4), T-cell immunoglobulin mucin
   protein 3 (TIM-3), lymphocyte activation gene 3 (LAG-3), T cell
   immunoglobin and immunoreceptor tyrosine-based inhibitory motif
   (TIGIT), glucocorticoid-induced tumor necrosis factor receptor (GITR),
   and V-domain Ig suppressor of T-cell activation (VISTA) [[50]20]. The
   US FDA has approved the anti-PD-1 agents (e.g., pembrolizumab and
   nivolumab) as monotherapies for platinum-treated patients with
   recurrent or metastatic HNSCC [[51]21]. According to the phase 3
   KEYNOTE-048 study, PD-L1 is a validated biomarker used as clinical
   guidance for candidate selection of pembrolizumab [[52]22,[53]23].
   However, due to the complexity of immune-tumor interactions, ICI has
   20% response rate in programmed death ligand 1 (PD-L1)-expressing
   patients (over 50% in immunohistochemistry (IHC) staining of HNSCC)
   [[54]19,[55]23].

   According to our previous proteomic study from 2010 to 2017, thymosin
   beta-4 X-linked (TMSB4X) is related to tumor growth and the metastasis
   of HNSCC [[56]24]. It was also reported by the subsequent
   investigations that TMSB4X contributes to tumor aggressiveness through
   epithelial-mesenchymal-transition (EMT) in pancreatic [[57]25], gastric
   [[58]26], colorectal [[59]27], lung [[60]28], ovarian [[61]29], and
   melanoma [[62]30] cancers. Thus, it might be suggested that TMSB4X is a
   candidate for tumor type-agnostic therapy [[63]31], as a common
   biomarker of several types of cancer.

   The Cancer Genome Atlas (TCGA) has clinical and genomic data of HNSCC
   (528 participants), which were standardized and are available at a
   unified data portal, Genomic Data Commons (GDC) of the the National
   Cancer Institute (NCI). The advantages of applying the TCGA data for
   cancer biomarker identification include:
     * To the best of our knowledge, the TCGA database is the largest
       collection (in terms of both cancer types and cohort size,
       especially in HNSCC) of comprehensive genomics with survival data
       available in the field of cancer research. The whole-genome
       sequencing data were harmonized across all genome data analysis
       centers. Many databases adopt the essential demographic data from
       TCGA, since it has comprehensive physical and social features of
       patients, such as exposure to alcohol, asbestos, radioactive radon,
       tobacco smoking, and cigarettes.
     * TCGA has a remarkable advantage for computational and life
       scientists who study cancer, since useful web-based tools and APIs
       are ready to analyze and visualize TCGA data. It might be getting
       help soon from the research community for trouble-shooting
       purposes.
     * Many achievements in diagnosis, treatment, and prevention that
       relied on the TCGA data have already been published and keep
       increasing in number [[64]32].

   Usually, researchers develop an in-house workflow of gene expression
   analysis of TCGA data to find HNSCC biomarkers. It would be helpful to
   show that alterations in gene expression correlate with phenotypes of
   HNSCC. Some researchers
   [[65]33,[66]34,[67]35,[68]36,[69]37,[70]38,[71]39] tried to find
   differentially expressed genes (DEGs) of the HNSCC samples at both
   genotypic and phenotypic levels (without survival information) for
   biomarker discovery. Gene expression data were downloaded from the TCGA
   or Gene Expression Omnibus (GEO) databases (e.g., [72]GSE117973
   [[73]39]; HIPO-HNC cohort has n = 87). They used the Database for
   Annotation, Visualization, and Integrated Discovery (DAVID, available
   at [74]https://david.ncifcrf.gov/, accessed 15 November 2019) to obtain
   information for Gene Ontology (GO), including biological processes,
   cellular components, and molecular functions. Kyoto Encyclopedia of
   Genes and Genomes (KEGG) pathway analysis was also used to annotate the
   potential functions of their biomarker candidates. The pathway
   enrichment analysis of DEGs was also performed by DAVID, STRING
   (available at [75]https://string-db.org, accessed 15 November 2019),
   and Cytoscape software [[76]37,[77]38]. Li [[78]36] and his colleagues
   made an R package (GDCRNATool) for the implementation of those
   workflows for gene expression analyses of the TCGA. Xu and his
   colleagues [[79]40] also identified their biomarkers via DEGs analysis.
   The significant impacts of genes on overall survival were evaluated
   with Kaplan–Meier survival curves with a log-rank test (p value <
   [MATH: <mrow><mrow><mn>0.01</mn></mrow></mrow> :MATH]
   ) and univariate Cox regression. They validated the candidate genes by
   using the web-based tools of Gene Expression Profiling Interactive
   Analysis—GEPIA—and Human Protein Atlas (HPA) databases. GEPIA was
   developed, using TCGA datasets, by Zefang Tang and his colleagues
   (version 1 [[80]41], version 2 [[81]42], and GEPIA2021 [[82]43],
   available at [83]http://gepia2021.cancer-pku.cn/, accessed 18 July
   2021). HPA [[84]44] applied immunohistochemistry (IHC) for the TCGA
   database (please see details in the Discussions section “Validation by
   Web-based Tools”). Finally, their biomarkers were verified by using the
   gene expression profile from the GEO and HNSCC cell lines and tissues.

   Other investigators should gather genes of interest to specific cancer
   types. They should upload those genes manually onto web-based tools,
   such as SurvExpress [[85]45] (available at
   [86]http://bioinformatica.mty.itesm.mx:8080/Biomatec/SurvivaX.jsp,
   accessed 11 August 2021), and then analyze cohorts of interest (e.g.,
   TCGA). After downloading the survival results, they could curate plots
   and tables carefully. It is not possible to scan the whole human
   protein-coding genome in this way. The web-based tools might set a
   cutoff at the median, 1/4 quantile, or 3/4 quantile for subsequent
   analyses. There are several visualization tools and R packages which
   deal with cutoff determination [[87]46], such as Prognoscan [[88]47],
   Cutoff Finder [[89]48], Findcut [[90]49], OptimalCutpoints [[91]50],
   cutpointr (available at [92]https://github.com/thie1e/cutpointr,
   accessed 20 November 2018), and cutoffR (available at
   [93]https://cran.r-project.org/web/packages/cutoffR, accessed 27 June
   2021). However, none of them could perform survival analysis in tandem
   with cutoff selection and whole-genome scanning.

   In summary, identifying predictive biomarkers for selecting
   standard-of-care or advanced systemic therapy [[94]50] in HNSCC is
   crucial. Our approach describes an in-house workflow implemented in R
   script, which runs on the Rstudio server. Its functions include data
   retrieving and pre-processing, feature selection, sliding-window cutoff
   selection, Kaplan–Meier survival analysis, Cox proportional hazard
   modeling, and biomarker discovery. The independent HNSCC dataset
   ([95]GSE65858) [[96]51] was used to validate this strategy. The
   workflow, shown in [97]Figure 1, has scanned 20,500 human
   protein-coding genes of the TCGA HNSCC cohort to yield a model with
   biomarker estimates using gene-expression-based survival analysis.

Figure 1.

   [98]Figure 1
   [99]Open in a new tab

   A workflow of HNSCC biomarker discovery. The workflow includes data
   retrieval from the TCGA GDC data portal, data processing with merging
   and cleaning, and then performing the survival analyses (within yellow
   square). The Cutoff engine (in R script: cutofFinder_func.HNSCC.R, a
   serial cutoff for grouping patients with low or high expression of a
   specific gene, to yield a collection of p values; please see Materials
   and Methods section for details) might calculate all possible
   Kaplan–Meier p values (corrected by false discovery rate (FDR) method)
   to find the optimal cutoff value of gene expression for subsequent Cox
   modeling. The candidate selection performs (1) dissection and selection
   of candidate genes with further Bonferroni-adjusted p values and the
   hazard ratios of a Cox model, based on the results from the survival
   analyses; (2) survival analyses of the other HNSCC dataset
   ([100]GSE65858) using Kaplan–Meier estimates (with FDR corrections) and
   Cox modeling. The biomarker candidates were consensus results of TCGA
   and [101]GSE65858. (HNSCC: head and neck squamous cell carcinoma; TCGA:
   the Cancer Genome Atlas; RNA-Seq: RNA sequencing; GDC: Genomic Data
   Commons).

2. Results

   The TCGA HNSCC cohort was used for exploration of biomarker candidates.
   A total of 9416 Kaplan–Meier plots (under sliding-window cutoff
   selection) with associated Cox univariate and multivariate tables were
   generated by Cox modeling ([102]Figure 1) and justified by the ranking
   of hazard ratios. In total, 967 out of 9416 genes were kept by criteria
   of FDR-adjusted Kaplan–Meier p values (<
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) and hazard ratios (HR) derived from Cox’s model ([103]Figure 2a,b,
   initial trial). In the next step, a stringent Bonferroni p value
   correction was used to yield 20 genes ([104]Figure 2c,d).

Figure 2.

   [105]Figure 2
   [106]Open in a new tab

   The initial progress of candidate selection from the TCGA HNSCC cohort.
   The p value of Kaplan–Meier survival was one of the selection criteria.
   The effect size was estimated by Cox’s hazard ratio. Initial trial
   step: (a) univariate HR versus FDR-adjusted p value; (b) multivariate
   HR versus FDR-adjusted p value. After stringent restriction by
   Bonferroni-adjusted p values and Cox’s HR, a few top-ranked genes were
   acquired by (c) univariate HR versus Bonferroni-adjusted p value; (d)
   multivariate HR versus Bonferroni-adjusted p value. (TCGA: the Cancer
   Genome Atlas; HR: hazard ratio; FDR: false discovery rate).

   CAMK2N1, CALML5, and FCGBP and 17 other genes (DKK1, STC2, PGK1, SURF4,
   USP10, NDFIP1, FOXA2, STIP1, DKC1, ZNF557, ZNF266, IL19, MYO1H, EVPLL,
   PNMA5, IQCN, and NPB) had significant FDR-adjusted p values (<
   [MATH: <mrow><mrow><mn>0.0003</mn></mrow></mrow> :MATH]
   ) in the Kaplan–Meier estimates and appropriate hazard ratios (HRs) (>
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   ) in Cox’s model ([107]Figure 3; log
   [MATH:
   <mrow><mrow><msub><mrow></mrow><mn>10</mn></msub><mrow><mo>(</mo><mn>0.
   0003</mn><mo>)</mo></mrow><mo>=</mo><mo>−</mo><mn>3.5</mn></mrow></mrow
   > :MATH]
   ). The volcano plot reveals that those top 20 genes
   (Bonferroni-adjusted p <
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) form the peaks. At the same time, Cox’s HRs separate them in regard
   to significant prognostic impact.

Figure 3.

   [108]Figure 3
   [109]Open in a new tab

   A volcano plot of genes in survival analyses of TCGA HNSCC. This cohort
   was applied for exploration of the candidate biomarkers. A total of
   9416 genes had unadjusted p values of less than 0.05. CAMK2N1, CALML5,
   FCGBP, and 17 other genes (marked in black square) had hazard ratios
   (HRs) >
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   . The 22 genes, listed on the side, had hazard ratios between 0.6 and
   1.5. Red spots:
   [MATH: <mrow><mrow><mi>H</mi><mi>R</mi></mrow></mrow> :MATH]
   > 1.0. Green spots:
   [MATH: <mrow><mrow><mi>H</mi><mi>R</mi></mrow></mrow> :MATH]
   < 1.0. (X-axis: Kaplan–Meier survival estimates, with FDR-adjusted p
   values (log10 transformed); y-axis: HR of Cox proportional hazard
   regression model.)

   In our validation study using the [110]GSE65858 cohort [[111]51] (under
   median cutoffs), CAMK2N1, CALML5, and FCGBP (3 out of those 20 genes
   discovered in the TCGA cohort) kept ahead of the curve with their
   FDR-corrected p values (<
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ), and Cox’s HRs (>
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   ) ([112]Supplementary Table S3). However, the significance of the other
   17 genes was insufficient compared to that of DUSP6, MSMB, and RBM11
   ([113]Figure 4). Conversely, there are 22 genes which have high hazard
   ratios (>
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   ) in the [114]GSE65858 cohort ([115]Figure 4); their hazard ratios were
   between 0.6 and 1.5 in the study of TCGA HNSCC ([116]Figure 3). Thus,
   there is a consensus between the TCGA and [117]GSE65858 cohorts that
   CAMK2N1, CALML5, and FCGBP are significant candidates for HNSCC
   biomarkers.

Figure 4.

   [118]Figure 4
   [119]Open in a new tab

   Volcano plot of genes in survival analyses of [120]GSE65858 cohort.
   This HNSCC cohort was used for filtering of our candidate genes:
   CAMK2N1, CALML5, and FCGBP. In total, 534 genes had FDR-adjusted p
   values less than 0.05 Red spots: hazard ratios are greater than 1.0;
   Green spots: hazard ratios are under 1.0. The 22 genes, listed on the
   side, had hazard ratios >
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   . (X-axis: Kaplan–Meier survival estimates, with FDR-adjusted p values,
   log10 transformed; y-axis: the hazard ratio (HR) under the Cox
   proportional hazard regression model).

   Our top candidate is calcium/calmodulin dependent protein kinase II
   inhibitor 1 (CAMK2N1). The Kaplan–Meier curve reveals that 152 patients
   bearing higher expression levels of CAMK2N1 suffered from an only 35%
   5-year OS rate. In comparison, the other 262 patients with lower
   expression levels had better prognoses (Bonferroni-adjusted p
   [MATH: <mrow><mrow><mo>=</mo><mn>0.002</mn></mrow></mrow> :MATH]
   ) ([121]Figure 5a). [122]Figure 5b’s cumulative p value plot shows that
   the 147 uncorrected p values (<
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) were estimated by a serial cut from 144 to 290 persons for grouping
   the cohort in our cutoff finding procedure (cutofFinder_func.R;
   [123]Figure 1, cutoff engine). The smallest p value (2.97 × 10^−7),
   when cut at n = 262 (63.3% of total cohort 414, with the cutoff value
   of 0.027 for RNA-Seq by Expectation-Maximization—RSEM), was defined as
   an optimal p value. The plot in [124]Figure 5b shows a “backlash” curve
   with the half of values below 1.0 × 10^−3.

Figure 5.

   [125]Figure 5
   [126]Open in a new tab

   Kaplan–Meier survival analyses, by cutoff finding. The Kaplan–Meier
   curves of (a) CAMK2N1, (c) CALML5, and (e) FCGBP with optimal p values.
   The cutoffs in the cumulative p value plots of (b) CAMK2N1, (d) CALML5,
   and (f) FCGBP, show that over 50% of those unadjusted p values derived
   by the sliding-window cutoff-finding procedure are below 0.001. (* p: p
   value adjusted by false discovery rate, FDR).

   Conversely, the gene most associated with better survival was
   calmodulin like 5 (CALML5). In [127]Figure 5c, a Kaplan–Meier curve
   reveals 200 patients bearing higher expression of CALML5 had a 60%
   5-year OS survival rate (Bonferroni-adjusted p
   [MATH: <mrow><mrow><mo>=</mo><mn>0.039</mn></mrow></mrow> :MATH]
   ). The sliding-window cutoff-selection-generated cumulative p value
   plot is in [128]Figure 5d. This plot reveals a “V” curve with the
   minimum at the middle portion. The 166 uncorrected p values were
   estimated by a serial cut from 125 to 290 for grouping the cohort. The
   smallest p value (5.87 × 10^−6), when cut at n = 214 (51.7% of total
   cohort 414), was defined as an optimal p value with a cutoff value of
   −0.359 for RSEM.

   The third candidate is Fc fragment of IgG binding protein (FCGBP). It
   was also correlated with better survival in both the TCGA and
   [129]GSE65858 cohorts. In [130]Figure 5e, a Kaplan–Meier curve reveals
   282 patients bearing higher expression levels of FCGBP had a 60% 5-year
   OS survival rate (Bonferroni-adjusted p
   [MATH: <mrow><mrow><mo>=</mo><mn>0.008</mn></mrow></mrow> :MATH]
   ). The sliding-window cutoff-selection-generated cumulative p value
   plot is in [131]Figure 5f. This plot has a “W-shaped” curve with the
   majority of values being far below 1.0 × 10^−3. The 166 uncorrected p
   values were estimated by a serial cut from 125 to 290 for grouping the
   cohort. The smallest p value (1.21 × 10^−6), when cut at n = 132 (31.9%
   of total cohort 414), was defined as an optimal p value with a cutoff
   value of −0.472 for RSEM.

   After adjustments for confounders, CAMK2N1 overexpression became an
   independent prognostic factor (multivariate HR 2.007 (95% CI:
   1.490–2.704, p < 0.001), [132]Table 1). The clinical T stage (HR 1.982
   (95% CI: 1.048–3.745, p = 0.035)) and surgical margin status (HR 1.631
   (95% CI: 1.182–2.250, p = 0.003)) also have significant impacts on a
   patient’s survival. A patient being older than 65 could worsen survival
   (HR 1.391 (95% CI: 1.025–1.888, p = 0.034)). The M stage should be
   ignored in this cohort due to only 3 out of 414 patients having distant
   metastasis.

Table 1.

   Univariate/multivariate Cox proportional hazard regression analyses on
   OS time of CAMK2N1 gene expression in HNSCC.
   Features Univariate Multivariate
   HR CI95% p Value HR CI95% p Value
   Gender Female 1 1
   Male 1.157 0.843–1.587 0.367 1.076 0.767–1.510 0.671
   Age at diagnosis
   [MATH: <mrow><mrow><mo><</mo><mo>=</mo><mn>65</mn></mrow></mrow> :MATH]
   y 1 1
   [MATH: <mrow><mrow><mo>></mo><mn>65</mn></mrow></mrow> :MATH]
   y 1.329 0.990–1.784 0.058 1.391 1.025–1.888 0.034
   Clinical T Status T1 + T2 1 1
   T3 + T4 1.409 1.028–1.931 0.033 1.982 1.048–3.745 0.035
   Clinical N Status N0 1 1
   N1-3 1.185 0.890–1.577 0.246 1.145 0.801–1.636 0.457
   Clinical M Status M0 1 1
   M1 4.097 1.009–16.644 0.049 7.314 1.590–33.631 0.011
   Clinical Stage Stage I + II 1 1
   Stage III + IV 1.245 0.882–1.759 0.213 0.621 0.287–1.343 0.226
   Surgical Margin status Negative 1 1
   Positive 1.591 1.155–2.191 0.004 1.631 1.182–2.250 0.003
   Tobacco Exposure Low 1 1
   High 1.364 1.008–1.844 0.044 1.363 0.990–1.875 0.058
   Gene Expression Low 1 1
   High 2.101 1.572–2.809 < 0.001 2.007 1.490–2.704 <0.001
   [133]Open in a new tab

   (OS: overall survival; HR: hazard ratio; CI95%: 95% confidence
   interval; p value significant code is denoted: red < 0.05).

   In summary, those three biomarker candidates, clinical T stage, and the
   presence of a surgical margin are independent prognostic factors in
   HNSCC. We also found those candidates have proper effect sizes—Cox’s HR
   >
   [MATH: <mrow><mrow><mn>1.8</mn></mrow></mrow> :MATH]
   or <
   [MATH: <mrow><mrow><mn>0.6</mn></mrow></mrow> :MATH]
   ; [134]Table 2. Thus, the prognostic model with coefficients was
   established by the TCGA HNSCC cohort and validated by the [135]GSE65858
   cohort.

Table 2.

   The top 3 genes with prognostic impacts on HNSCC.
   Gene ID Gene Description Kaplan–Meier Survival Cox Univariate Cox
   Multivariate
   FDR
   p Value Bonferroni
   p Value HR * CI95% HR * CI95%
   CAMK2N1 calcium/calmodulin-
   dependent protein
   kinase II inhibitor 1 1.63 × 10^−5 0.002 2.101 1.572–2.809 2.007
   1.490–2.704
   CALML5 calmodulin like 5 1.97 × 10^−4 0.039 0.51 0.379–0.686 0.493
   0.364–0.667
   FCGBP Fc fragment of
   IgG binding protein 4.83× 10^−5 0.008 0.484 0.359–0.653 0.496
   0.366–0.674
   [136]Open in a new tab

   Selection criteria (fit all): (1) Kaplan–Meier Bonferroni-adjusted p <
   0.05; (2) Cox’s univariate and multivariate HR >= 1.8 or <= 0.6 in TCGA
   cohort; (3) Cox’s univariate and multivariate HR >= 1.8 or <= 0.6 in
   [137]GSE65858 cohort. * Cox’s model: p <0.001 (HR: hazard ratio; CI95%:
   95% confidence interval; FDR: false discovery rate).

3. Discussion

3.1. The Three Biomarkers in Cancer

3.1.1. The Protein/Pathology Atlas

   Proteomics analysis in the Human Protein Atlas project (HPA) was based
   on 26,941 antibodies targeting 17,165 unique proteins. The HPA’s
   Pathology Atlas analyzed each protein in patients using
   immunohistochemistry (IHC) analysis based on tissue microarrays (TMAs)
   adopted from TCGA. Kaplan–Meier survival analyses were based on RNA-Seq
   expression levels of human genes in HNSCC tissue and the clinical
   outcome.

   CAMK2N1 is on the list of unfavorable prognostic genes for HNSCC, and
   lung or liver cancer from the Human Protein Atlas (HPA) (pathology
   atlas [[138]44] is available at
   [139]https://www.proteinatlas.org/humanproteome/pathology/head+and+neck
   +cancer, Version: 20.0 updated: 19 November 2020, accessed 27 June
   2021). CALML5 and FCGBP are on the list of favorable prognostic genes
   for HNSCC (available at
   [140]https://www.proteinatlas.org/ENSG00000178372-CALML5/pathology;
   [141]https://www.proteinatlas.org/ENSG00000275395-FCGBP/pathology,
   respectively). Furthermore, CALML5 was validated by the HNSCC cohort at
   mRNA and protein levels [[142]44,[143]52].

3.1.2. Literature Review

   We searched Embase/Pubmed to find the evidence of our three biomarker
   candidates in cancer research.

   CAMK2N1 (calcium/calmodulin dependent protein kinase II inhibitor 1) is
   an endogenous inhibitor of calcium/calmodulin-dependent protein kinase
   II, CaMKII. CaMKII is a multi-functional kinase composed of four
   different chains: alpha, beta, gamma, and delta. CAMK2A encodes the
   alpha chain. Although the mRNA expression of CaMKII’s endogenous
   inhibitor CAMK2N1 inversely correlates with the severity of medullary
   thyroid carcinoma [[144]53], both CAMK2N1 (
   [MATH:
   <mrow><mrow><mi>H</mi><mi>R</mi><mo>=</mo><mn>2.1</mn></mrow></mrow>
   :MATH]
   ) and CAMK2A (
   [MATH:
   <mrow><mrow><mi>H</mi><mi>R</mi><mo>=</mo><mn>1.6</mn></mrow></mrow>
   :MATH]
   ) were overexpressed in the TCGA HNSCC patientswith worse outcomes.
   Overexpression of CAMK2N1 is also unfavorable in head and neck cancer,
   as revealed by the Human Pathology Atlas (available at
   [145]https://www.proteinatlas.org/ENSG00000162545-CAMK2N1/pathology/hea
   d+and+neck+cancer, accessed on 10 March 2021). Another web-based tool,
   KM plotter [[146]54,[147]55], shows that CAMK2N1 either worsens or
   improves survival in various cancer types.

   Calmodulin-like 5 (CALML5) is overexpressed in differentiating
   keratinocytes [[148]56]. In patients with HPV-associated HNSCC,
   hypermethylation of the CALML5 gene is associated with significantly
   reduced survival, with a hazard ratio of 7.01 (95% CI: 1.01–48.66)
   [[149]57]. CALML5 expression could be a protective mechanism for
   patient survival. Furthermore, CALML5 was validated by the HNSCC cohort
   at mRNA and protein levels [[150]44,[151]52].

   The Fc fragment of the IgG binding protein (
   [MATH:
   <mrow><mrow><mi>F</mi><mi>c</mi><mi>γ</mi><mi>B</mi><mi>P</mi></mrow></
   mrow> :MATH]
   , FCGBP) is expressed in the normal thyroid and is down-regulated in
   papillary and follicular thyroid carcinomas [[152]58,[153]59].
   Overexpression of FCGBP has hazard ratio of 0.306 (95% CI: 0.136–0.686)
   for gallbladder cancer [[154]60].

   In conclusion, the three prognostic genes underlined have been
   highlighted by published studies using the TCGA cohort, an in-house
   cohort, or in vitro and in vivo experiments.

3.2. Feature Selection for Survival Modeling

   Besides ethnicity, age, gender, TNM stage, radiation therapy,
   chemotherapy, and targeted therapy, the comprehensive adversely
   prognostic features in HNSCC should also include tobacco exposure, EGFR
   amplification, human papillomavirus (HPV) status, positive/close
   surgical margin (<5 mm), extra-nodal extension (ENE), lymph-vascular
   space invasion (LVSI), perineural invasion (PNI), depth of invasion
   (DOI) (>5 mm), metastatic lymph node density (LND) [[155]61], and worst
   pattern of invasion score 5 (WPOI-5), which is defined as tumor
   dispersion (1 mm apart between tumor satellites) or positive PNI/LVSI
   [[156]62]. The features of DOI, LND, and tumor dispersion are not
   available in the TCGA dataset. The Brandwein–Gensler risk model
   (lymphocytic host response, WPOI-5, and PNI) [[157]63,[158]64] has been
   suggested for routine pathological examinations. In previous reports of
   HNSCC, the loco-regional failure was high when the initial frozen
   section had a positive/close surgical margin, and even the final margin
   revision revealed a negative effect [[159]65]. According to [160]Table
   1, in our study, the positive surgical margin had to yield a hazard
   ratio greater than 1.6 to influence a patient’s OS. It is suggested by
   authors
   [[161]66,[162]67,[163]68,[164]69,[165]70,[166]71,[167]72,[168]73,[169]7
   4,[170]75] that the reason for a positive/close surgical margin is
   possibly tumor aggressiveness or dispersion (WPOI-5) instead of
   iatrogenic actions in surgery. The surgical margin status was also
   suggested as an independent surrogate for tumor dispersion in the HNSCC
   study. Thus, we selected common clinicopathological features during our
   biomarker discovery, including gender, age, clinical T, clinical N,
   clinical M, surgical margin status, and tobacco exposure, to adjust
   confounders (details description at Materials and Methods section).

3.3. The Purpose of Sliding-Window Cutoff Selection

   By trying to find an optimal cutpoint of that RNA expression data to
   maximize candidate mining coverage, this strategy could identify more
   but sometimes weak “biomarkers.” Thus, we had to try our best to handle
   the effect size with Cox’s modeling. Additionally, validation of those
   candidates was required by using other independent datasets.

   Statistical significance (p value) is affected by sample size, error,
   effect size (substantive significance) [[171]76,[172]77], and cutoff.
   The effect size is the magnitude of the difference (e.g., hazard ratio)
   between the groups being compared. The effect size is independent of
   the sample size [[173]76].

   In a study with a large sample size, the difference can be noticed
   easily (i.e., p value <
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) due to decreased standard error [[174]76]. However, a small effect
   size (non-zero) is often meaningless or implies substantive
   insignificance (e.g., hazard ratio between 0.8 and 1.2). Conversely,
   the effect size can be large but fail to gain statistical significance
   if the sample size is small. The following errors could also impact the
   p value:
     * A random error, defined as the variability in data, is not
       considered a bias but rather occurs randomly across the entire
       study population and can distort the measurement process (e.g.,
       RNA-Seq experiments). A larger sample size could reduce the random
       error.
     * A systematic error is a bias, a selection biases, an information
       bias, or a confounder. It could deleteriously impact the
       statistical significance. A larger sample size could not affect the
       systematic error.

   While statistical significance can inform the researcher of whether an
   effect exists, the p value will not directly tell one of the effect
   size. Thus, if there is no error in two study groups, and the sample
   sizes are the same (not small), the group which has a larger effect
   size will have a small p value [[175]77]. If a skewed cutoff that
   splits between the two groups—for example, into 425 versus
   75—statistical significance will be gained by increasing the effect
   size artificially.

   There was a benefit in using the sliding-window cutoff method (the
   between
   [MATH: <mrow><mrow><mn>30</mn><mo>%</mo></mrow></mrow> :MATH]
   and
   [MATH: <mrow><mrow><mn>70</mn><mo>%</mo></mrow></mrow> :MATH]
   quantile) in Kaplan–Meier analysis of the TCGA HNSCC cohort at the
   beginning. We compared the results of cutoff at the optimal p value by
   sliding window or just at the median of gene expression. The numbers of
   genes (with unadjusted p values <
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) were 6284 and 3118, respectively. After FDR correction, it became 967
   versus 209, respectively. The sliding-window cutoff method could catch
   more potential candidates which have p values far less than
   [MATH: <mrow><mrow><mn>0.001</mn></mrow></mrow> :MATH]
   for subsequent Cox’s modeling. That is because of the properly selected
   cutoff improving the statistical significance. A smaller p value might
   predict large effect size (HR) associated biomarkers. Then, these
   preliminary candidates will have an opportunity to be carefully
   selected by using FDR, stringent Bonferroni correction, their effect
   sizes (Cox’s HR), and the use of another independent cohort
   ([176]GSE65858) to prevent false discovery.

   We can explain the aforementioned situation by examples. When reviewing
   the special cases of genes such as NDFIP1, DKC1, PNMA5, and NPB, we
   noticed that NDFIP1, with a p value of 0.05 at the 50% quantile
   (median) cutoff, could achieve a p value of 2.62 × 10^−6 at a 70%
   quantile cutoff. NDFIP1 has a FDR-adjusted p value of 1.07 × 10^−4
   ([177]Supplementary Figure S1, a “S” or “W”-shaped portion of the p
   value plot). However, it was excluded as a candidate by a small effect
   size (
   [MATH: <mrow><mrow><mi>H</mi><mi>R</mi></mrow></mrow> :MATH]
   = 1.33 in [178]GSE65858) of less than 1.8.

   The other example is IL19. It has a p value plot with acute S-curve
   bending at the median zone, which lets the FDR-adjusted p value have a
   large difference between the 50% quantile cutoff (FDR-KM p = 0.115) and
   an optimal 48% quantile cutoff (FDR-KM p = 6.54 × 10^−6). This optimal
   cutoff method could boost its statistic significance to pass correction
   by both FDR and Bonferroni methods. Even IL19 became a candidate
   through its effect size (
   [MATH: <mrow><mrow><mi>H</mi><mi>R</mi></mrow></mrow> :MATH]
   = 0.472 in TCGA cohort), but failed in the validation with the
   [179]GSE65858 cohort (small effect size as
   [MATH: <mrow><mrow><mi>H</mi><mi>R</mi></mrow></mrow> :MATH]
   = 0.630, and FDR-KM p = 0.031).

3.4. Technical Considerations

   There are two essential points of biomarker discovery from survival
   analysis of the TCGA HNSCC dataset.

   First, since TCGA genomic data were harmonized, the pre-processing of
   TCGA RNA-Seq in our workflow was done as follows:
     * HNSCC samples without complete clinical information were removed;
     * Null expressed genes in more than 30% of the HNSCC samples were
       excluded;
     * The updated number of protein-coding genes in the TCGA HNSCC was
       20,500.

   After investigation of the mRNA expression dataset obtained through
   NCI’s Firehose API, we found that the expression values of two genes
   (gene ID: 9906 and gene ID: 728661) were saved together under the
   entity of gene symbol “SLC35E2.” The expression file of SLC35E2 was
   almost double those of SLC35E1 and SLC35E3 in size. According to the
   Human Gene Database (available at
   [180]https://www.genecards.org/Search/Keyword?queryString=SLC35E2,
   accessed on 10 March 2021), SLC35E2A (Gene ID: 9906) and SLC35E2B (Gene
   ID: 728661) should be the correct entities for the TCGA HNSCC dataset.
   SLC35E2 is the previous symbol of SLC35E2A (reference at
   [181]https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:
   20863, accessed on 10 March 2021). Thus, we reassigned the expression
   values of SLC35E2A and SLC35E2B and updated the number of
   protein-coding genes in this TCGA HNSCC dataset from 20,499 to 20,500.

   Second, we analyzed the error log during the cutoff finding and Cox
   modeling. The result shows that program could be halted under several
   technical situations. These included:
     * If 32.2% of events had a “one group” issue in the confusion matrix
       of Chi-square test in Cox regression (coxph), due to a zero in the
       M (distant metastasis) patient subgroup;
     * If 21.05% of errors occurred via “one group” issues in log-rank
       test (survdiff or survdiff.fit function in R package “survival”) in
       the Kaplan–Meier estimate;
     * If 0.78% had unknown reasons (so those 159 genes were excluded in
       our workflow).

   These technical problems could not be detected prior to program
   running. They might have been due to skewed distribution of the
   expression value or even random error derived during the RNA sequencing
   procedure.

3.5. Limitations of the Study

   The validation rate in the current study was 3 out of 20 (15%). The
   TCGA HNSCC and [182]GSE65858 cohorts have similar demographic features.
   However, the percentages of females were 26.9% and 17.0%. Moreover, the
   percentages of smokers and tobacco they had were 76.9% and 82.2%, and
   45.8 and 28.3 pack-years, respectively. In total, 60.3% of females were
   smokers in the TCGA HNSCC; 67.3% of females were smokers in the
   [183]GSE65858. The results of the current study also show that higher
   tobacco exposure is an independent risk factor of HNSCC patient
   survival (HR 1.364 (95% CI: 1.008–1.844, p = 0.044)). This implies that
   even though gender is not a prognostic factor, the heavy female smoker
   might contribute some genetic alterations to HNSCC.

   American white people were 85.6% of the TCGA HNSCC cohort, and German
   people were 90% of the [184]GSE65858 cohort. Grunwald and his
   colleagues [[185]78] found that oropharyngeal cancer is more common in
   North America (51.2%) than Northern Europe (32.4% Germany). However, we
   found only nine patients with cancer of the oropharynx in the TCGA
   HNSCC. Of oropharyngeal SCC patients (102 samples, 37.8%), 52.2% were
   positive for HPV in [186]GSE65858. Thus, the oropharyngeal SCC was far
   less prevalent (1.7%) in the TCGA cohort than in the [187]GSE65858
   cohort (37.8%). The oropharyngeal SCC should have a different genetic
   background than non-oropharyngeal HNSCC [[188]79]. This could be
   another reason for the poor consensus across different HNSCC datasets.
   Dataset selection and proper demographic matching are important in
   biomarker discovery.

   Our approach was to build a regression model to predict a patient’s
   survival by the TCGA’s gene expression. The model was validated by
   [189]GSE65858 cohort. Thus, we performed made a head-to-head comparison
   of Cox’s hazard ratios (5404 genes) from TCGA HNSCC and [190]GSE65858
   datasets ([191]Figure 6a). It showed a poor [[192]80] Pearson’s
   correlation coefficient [[193]81] (r = 0.27). [194]Figure 6b shows that
   20 genes, including three biomarker candidates—CAMK2N1, CALML5, and
   FCGBP—have a moderate [[195]80] effect size (HR) correlation between
   the two cohorts (Pearson’s r = 0.68). Overfitting of Cox’s regression
   model could achieve only a moderate correlation. In such a case, an
   overfitted model will only perfectly match every single gene in the
   TCGA, and has higher variability when predicting what it never saw—the
   [196]GSE65858 dataset. Those, as mentioned earlier, might be the
   reasons why the validation rate was so low. Overfitting will limit the
   usefulness of the model in its generalization. It might be controlled
   well by using cross-validation.

Figure 6.

   [197]Figure 6
   [198]Open in a new tab

   A head-to-head comparison of Cox’s hazard ratios from TCGA HNSCC and
   [199]GSE65858 datasets. TCGA HNSCC and [200]GSE65858 cohorts were
   applied for identification and validation of the candidate biomarkers
   in HNSCC. (a) A total of 5404 genes had Cox’s hazard ratios from TCGA
   HNSCC and [201]GSE65858 (Pearson’s correlation coefficient [[202]81], r
   = 0.27). CAMK2N1, CALML5, FCGBP, and 17 other genes (marked in black)
   had hazard ratios (HRs) >1.8 or <0.6. Red spots: HRs > 1.0 in TCGA
   HNSCC. Green spots: HRs > 1.0 in TCGA HNSCC. Sizes of spots: bigger for
   Kaplan–Meier p values in TCGA HNSCC. (b) The 20 genes were extracted
   and shown. The hazard ratios of those genes have a moderate correlation
   between the two cohorts (Pearson’s r = 0.68). (X-axis: Hazard ratios of
   Cox proportional hazard regression model from TCGA HNSCC; y-axis: Those
   values from [203]GSE65858; TCGA: the Cancer Genome Atlas; HNSCC: head
   and neck squamous cell carcinoma).

   Moreover, the head-to-head comparison of Kaplan–Meier p values (5404
   and 20 genes) from those two cohorts is also shown in
   [204]Supplementary Figure S2a,b. It reveals poor Pearson’s correlations
   (r = 0.01 and r = 0.19). Thus, the significance of CAMK2N1, CALML5, and
   FCGBP was still sufficient compared to that of the other 17 genes in
   the [205]GSE65858 study. There were 270 participants whose data were
   collected in [206]GSE65858 for our validation study. Again, the effect
   size (i.e., hazard ratio) can be large but fail to gain statistical
   significance if the sample size is not large enough. A high-quality
   HNSCC dataset (with protein-coding gene expression and survival data, n
   > 500) is not easy to find in the GEO database. The three biomarker
   candidates were discovered by TCGA HNSCC and then validated by the
   [207]GSE65858 dataset. More data are required for further confirmation.

3.6. Future Directions in Translational Medicine

3.6.1. Proteomics Validation

   Although we combined the power of genome-wide scanning and an optimal
   cutoff finder for survival analysis, the study has some limitations. We
   are aware of the importance of direct assessment of protein products
   comprising the basic functional units in cancer cells’ biological
   processes. The announcement of the Cancer Proteome Atlas (TCPA,
   [208]http://tcpaportal.org, accessed on 10 March 2021) excites the
   cancer research community [[209]82,[210]83]. Through the utility of the
   reverse-phase protein arrays (RPPAs) or reverse-phase protein lysate
   microarray (RPMA), microarray “Western blots” in the TCPA could help to
   test our hypotheses from RNA-Seq studies. However, in the TCPA database
   (v3.0 [[211]84]), there are only 237 antibodies available, not covering
   our candidates so far.

3.6.2. Laboratory Validation

   We encourage multidisciplinary studies that use complementary
   computational and experimental approaches to address challenging cancer
   research. Such in vitro and in vivo validation experiments will be
   undertaken in our laboratory. We plan to analyze the mRNA (e.g.,
   qRT-PCR) and protein (e.g., Western blot) of HNSCC cell lysate to
   confirm the candidate genes’ expression. The effects of overexpression
   and knockdown of the genes by lentiviral clones should be observed in
   cell function assays (e.g., proliferation, migration, and invasion) and
   mouse xenograft models (e.g., tumor growth).

   Moreover, this bioinformatics paper provides targets and supports the
   community’s rationale for looking into these HNSCC candidates via in
   vitro and in vivo validation. We aim to promote a reproducible
   bioinformatics [[212]85,[213]86] method allowing successful repetition
   and extension of analyses based on the TCGA or other in-house HNSCC
   datasets. Good research reproducibility practice is necessary to allow
   the reuse of code and results for new projects. It may turn out to be a
   time-saver in the longer run. When multiple scientists can reproduce a
   result, it will also validate our initial results and readiness to
   progress to the next research phase. Once our laboratory or the
   community confirms those candidates as targets, compound screening
   [[214]87,[215]88,[216]89] could facilitate more personalized therapy
   for HNSCC patients.

3.6.3. Cancer Type-Agnostic Study

   Our strategy still has the strength to explore more possible biomarkers
   from RNA-Seq datasets in cancer research. In our previous work, altered
   glucose metabolism—the Warburg effect [[217]90]—promoted the
   progression of HNSCC, which is partially attributed to the solute
   carrier family 2 member A4 (SLC2A4, or glucose transporter 4, GLUT4)
   and tripartite motif-containing 24 (TRIM24) pathway [[218]91,[219]92].
   Lactic acidosis-induced GLUT4 overexpression was also found in lung
   cancer cells [[220]93]. Currently, pembrolizumab and nivolumab’s
   success has been based on a common biomarker (e.g., PD-1) in several
   types of cancer. It is a model of tumor type-agnostic therapy
   [[221]31]. There are several common biomarkers of immune-checkpoint
   inhibitor (ICI) under evaluation, including tumor-infiltrating
   lymphocytes (TIL), interferon gamma (IFN-
   [MATH: <mrow><mi>γ</mi></mrow> :MATH]
   ), and tumor mutational burden (TMB) [[222]23]. The other ICI,
   anti-LAG-3 (pelatlimab), is currently being evaluated in phase I/IIA
   [[223]50] (ClinicalTrials.gov Identifier: [224]NCT01968109) and II-IVA
   [[225]94] ([226]NCT04080804) studies.

   In line with tumor-agnostic research, we plan to explore common
   biomarkers crossing TCGA diseases. However, the GDC provided
   standardized data frames that could not directly fit our workflow’s
   scope. Before the global gene scanning process, it is necessary to
   re-format, transpose, and merge the 528 patients’ clinical datasets and
   correlate 20,500 expressions of bio-specimens. This process should be
   carefully curated to confirm the data integrity within the correct
   definition [[227]95]. We also plan to upgrade our R script for the
   cutoff engine to C++ and source it in the Rstudio server. The high
   performance of C++ could speed up the critical steps in this workflow
   involving heavy computation of matrix data [[228]96]. Moreover, it will
   be possible to introduce the Rstudio Shiny app
   ([229]https://shiny.rstudio.com, accessed on 10 March 2021) as a
   web-based tool (named “pvalueTex”) packaged with our workflow in the
   future.

3.6.4. Holistic Cancer Care

   There are 81 physical, pathological, and social conditions derived from
   participants available for survival modeling in the TCGA, such as age,
   gender, residual tumor, vital status, days-to-last-followup, cancer
   stage, smoking duration, exposure to alcohol, asbestos, and radioactive
   radon. However, the TCGA did not collect other features related to
   holistic care. When going for holistic cancer care [[230]97,[231]98],
   spiritual and emotional conditions are equally essential, alongside
   physical and social status ([232]Figure 7). Psychosocial stress is
   associated with cancer incidence [[233]98,[234]99,[235]100], metastasis
   [[236]99,[237]101,[238]102,[239]103], and poor survival [[240]104].
   These impacts might be mediated through the
   hypothalamic–pituitary–adrenal (HPA) axis [[241]105]. Holistic
   healthcare providers engage patients with eye contact for mind-to-mind
   connection. Their empathy, sympathy, and compassion are induced by the
   suffering of patients from those diseases. They try to treat patients
   by prescribing medicine (and “themselves”) or performing surgery. Thus,
   the healing resilience of patients should be induced by unconditional
   positive regard. The patients trust those who take care of them and
   have the confidence to increase the capacity to recover from diseases
   through a mind–brain–body connection manner ([242]Figure 7).

Figure 7.

   [243]Figure 7
   [244]Open in a new tab

   The concept of holistic care for HNSCC patients. Beyond carcinogenesis:
   In the mind–brain–body axis, a stressful environment (giant black
   arrow) will trigger an emotional response. The subconscious mind
   (brain) releases stress hormones and inflammation signals in response
   to negative emotions. The body’s internal environment (cells) alters
   epigenetic control in gene regulation and mRNA expression. Over a long
   time, the tissue/cells will be transformed into dysplasia and then
   malignancy (e.g., HNSCC) with help from known carcinogens. Cancer care:
   Holistic care should take care of cancer patients’ spiritual,
   emotional, physical, and socioeconomic needs. Physical care will be
   carried out by medication therapy or surgery. After establishing a
   therapeutic relationship (TR), the physicians’ spiritual properties
   (empathy, sympathy, and compassion) will engage cancer patients and
   recover their self-compassion to gain resilience against the disease
   through their mind–brain–body axis. Thus, we suggest that electric
   healthcare records (EHR) should include physical, pathological, and
   psychological data, and even more spiritual information. The TCGA might
   collect those “holistic features” (green dashed line) for further study
   of personalized medicine.

4. Materials and Methods

4.1. Patient Cohort

   A large-scale cancer database, aggregating many independent features,
   is necessary to facilitate the biomarker discovery. The Cancer Genome
   Atlas (TCGA) project [[245]106] has been developed since 2005 and
   supervised by the National Cancer Institute’s (NCI) Center for Cancer
   Genomics and the National Human Genome Research Institute (NHGRI),
   funded by the US government. TCGA represents comprehensive genomics and
   clinic data from 84,392 patients among 33 major cancer types (data
   release 27.0—29 October 2020, available at
   [246]https://www.cancer.gov/about-nci/organization/ccg/research/structu
   ral-genomics/tcga/studied-cancers, accessed on 10 March 2021). TCGA and
   the genome data analysis center (GDAC) generated and analyzed DNA
   (mutations, copy number variations, methylation sites, simple
   nucleotide polymorphisms), RNA (microarray, RNA-Seq, microRNA), and
   protein (reverse protein phased array) data derived from biospecimens.
   Sample types available at TCGA are primary solid tumors, recurrent
   solid tumors, blood-derived normal and tumor, metastatic, and solid
   normal tissue.

   The NCI’s Genomic Data Commons (GDC, available at
   [247]https://portal.gdc.cancer.gov, accessed on 10 March 2021)
   receives, processes, and distributes genomic, clinical, and biospecimen
   data from the TCGA database and other cancer research programs. The
   clinical features have been defined by TCGA GDC data dictionary: Common
   Data Element (CDE) [[248]107]. The RNA-Seq expression data have been
   harmonized and re-aligned against an official reference genome build
   (Genome Reference Consortium Homo sapiens genome assembly 38, GRCh38).
   TCGA, GDC, and some research communities offer several computational
   tools to the public for facilitating cancer research. GDC Data Portal
   has the official web-based TCGA data analysis tools. Other available
   web-based tools have been reviewed by Zhang et al. [[249]108] and
   Matthieu Foll (availalbe at
   [250]https://github.com/IARCbioinfo/awesome-TCGA, accessed on 10 March
   2021). One of the GDACs, the Broad TCGA Data and Analyses (Broad GDAC),
   provides Firehose, a repository of the TCGA public-accessible Level 3
   data and Level 4 analyses. Broad GDAC Firehose is an analytical
   infrastructure that analyses algorithms not performed by the GDC (e.g.,
   GISTIC, MutSig2CV, correlation with clinical variables, mRNA
   clustering). A web-based version of Broad GDAC Firehose is Firebrowse
   (available at firebrowse.org, Version: 1.1.40, 13 October 2019). Broad
   GDAC Firebrowse provides graphical tools such as viewGene to explore
   expression levels and iCoMut to explore a mutation analysis of each
   TCGA disease.

   GDC’s application programmable interface (API) uses the
   Representational State Transfer (REST) architecture and provides
   accessibility to external users for programmatic access to the same
   functionality found through GDC Portals. Those functions include
   searching, viewing, submitting, and downloading subsets of data files,
   metadata, and annotations based on specific parameters. Moreover, if
   restricted data are requested, the user must provide a token along with
   the API call. This token can be downloaded directly from the GDC
   Portals. Broad GDAC Firebrowse RESTful API can be accessed using an R
   package, FirebrowseR (available at
   [251]https://github.com/mariodeng/FirebrowseR, accessed on 10 March
   2021) [[252]109].

   GDC is available at
   [253]https://portal.gdc.cancer.gov/projects/TCGA-HNSC, accessed on 10
   March 2021. TCGA offers several computational tools to the public that
   facilitat cancer research. Broad genome data analysis center (GDAC)
   Firebrowse (firebrowse.org, version 1.1.35, 27 September 2016) is one
   of those tools to provide data access for each TCGA disease through a
   Representational State Transfer (REST) application programmable
   interface (API). The 528 TCGA HNSCC patients’ clinical information and
   RNA-Seq data were obtained from the Firebrowse RESTful API with an R
   package, FirebrowseR (available at
   [254]https://github.com/mariodeng/FirebrowseR, accessed on 10 March
   2021) [[255]109]. We utilized FirebrowseR with our analysis workflow
   ([256]Figure 1, green square) to receive standardized data frames while
   avoiding data re-formatting, often causing some errors. [257]GSE65858
   is a dataset we used for candidate selection in our workflow. After
   removing missing data, there were 270 participants whose data were
   collected for our validation study. Initially, there were 288 HNSCC
   participants involved in their prospective study [[258]51]. At the
   University Hospital Leipzig, Germany, these patients were diagnosed as
   having oral, oropharyngeal, hypopharyngeal, or laryngeal squamous cell
   carcinomas (SCCs). Patients were excluded if they had a prior history
   of cancer other than HNSCC within the last five years. The 49 (17.0%)
   females and 239 (83.0%) males had a median age of 58 years old. A total
   of 82.2% were smokers who consumed 28.3 pack-years of cigarettes. In
   total, 88.5% of participants used alcoholic beverages; 84.9% with oral
   SCC were HPV-negative; 52.2% with oropharyngeal SCC were positive for
   HPV. The cancer stage distribution among this cohort was 19.0% early
   stages (I/II) and 81.3% late stages (III/IV).

   Regarding the TCGA database, 528 HNSCC participants from several
   centers were used in the prospective studies [[259]110]. The 142
   (26.9%) females and 386 (73.1%) males had a median age of 61 years old.
   A total of 97.5% were smokers who consumed 45.8 pack-years of
   cigarettes. In total, 67.6% of participants used alcoholic beverages;
   82.1% of participants with oral SCC were HPV-negative. The cancer stage
   distribution among this cohort was 104 (20.7%) early stages (I/II) and
   424 (79.3%) late stages (III/IV).

4.1.1. RNA Sequencing Data

   The number of protein-coding genes was suggested to be 20,500
   [[260]111]. The GDC Data Portal-provided TCGA data were harmonized with
   re-aligned RNA sequencing data against an official reference genome
   build (Genome Reference Consortium Homo sapiens genome assembly 38,
   GRCh38). RNA-Seq expression level read counts produced by Illumina
   HiSeq were normalized using the Fragments per kilobase per million
   reads mapped (FPKM) method, as described in [[261]112]. The RNA-Seq
   preprocessor of Broad GDAC picked the RNA-Seq by
   Expectation-Maximization (RSEM) value from Illumina HiSeq/GA2 messenger
   RNA-Seq level_3 (v2) dataset of NCI GDC. It made the messenger RNA-Seq
   matrix with log2 transformed for the downstream analysis, as described
   in their paper [[262]113]. We utilized FirebrowseR’s function call,
   Samples.mRNASeq(cohort = “HNSC,” gene = GeneName, format = “csv”), to
   download the RNA-Seq dataset of every HNSCC patient and to save 20,499
   data frame files, named “HNSCC.mRNA.Exp.[GeneName].Fire.Rda.” After
   careful investigation of the genomics dataset, the RNA-Seq values of
   “solute carrier family 35 member E2A (SLC35E2A)” and “solute carrier
   family 35 member E2B (SLC35E2B)” were considered two distinct
   expression entities. We concluded that the number of protein-coding
   genes in the TCGA dataset is 20,500. We removed null expressed genes,
   over 30% of the cohort, to avoid useless results.

4.1.2. Clinical Data

   We utilized FirebrowseR’s function call, Samples.Clinical(cohort =
   “HNSC,” format = “csv”), to get all 81 clinical features (including
   pathological data, defined by TCGA GDC data dictionary: Common Data
   Element (CDE) [[263]107]) of all 528 HNSCC patients, which were saved
   as one data frame file: “HNSCC.clinical.Fire.Rda” (accessed November
   2019).

   “HNSCC.clinical.Fire.Rda” tables each have 20,500
   “HNSCC.mRNA.Exp.[GeneName]. Fire.Rda” tables were transposed and merged
   by their _participant_barcode (unique patient identification, ID) to
   yield a data frame with 528 rows (participants) against 20,581 columns
   (81 clinical features and 20,500 protein-coding RNA-Seq of cancer
   specimens). The clinicopathological features selected for our workflow
   included gender, age, clinical tumor size, clinical cervical lymph node
   metastases, clinical distant metastasis assessment, pathological
   surgical margin, and tobacco exposure with their corresponding survival
   data. The tumor size (T), cervical lymph node metastases (N), and
   distal metastasis status (M) were classified according to the American
   Joint Committee on Cancer (AJCC) [[264]62] along with he Union for
   International Cancer Control (UICC) [[265]114] TNM system for clinical
   staging of HNSCC. We made data clean by removing duplicated rows and
   columns.

4.2. Cutoff Finder Core Engine

   To evaluate the effect of gene expression on patient survival, we used
   sliding-window cutoff selection by stratifying patients with
   Kaplan–Meier survival analysis according to each gene’s low/high
   expression. Our cutofFinder_func subroutine employs the minimum p value
   approach to recognizing cutoff points in continuous gene expression
   measurement for patient sub-populations. First, patients were ordered
   by RNA-Seq values (RSEM) of a given gene. Next, patients were
   stratified at a serial cut (counted people ranked between the 30th and
   70th percentiles of the cohort; [266]Figure 1 cutoff engine). The
   survival risk differences of the two groups were estimated by log-rank
   test to yield around 165 Kaplan–Meier p values for each gene. Then, the
   optimal cutoff of RNA-Seq giving the minimum p value was selected by
   the cutofFinder_func subroutine. This iteration method could calculate
   all possible cutoffs of each gene’s expression in this cohort. After
   each run of the cutofFinder_func function call for an individual gene,
   it returned an optimal cutoff for specific patient groups (e.g., low
   expression in 262 persons versus high expression in 152 persons with
   calcium/calmodulin dependent protein kinase II inhibitor 1; [267]Figure
   5). The cutoff would be returned to the main program to allow
   downstream Cox survival analysis. The percentile range we applied, 30%
   to 70%, was used to avoid a small grouping effect [[268]47,[269]115].
   In case there was no significant p value, a median expression of this
   gene was set as its cutpoint as usual. The false discovery rate (FDR)
   (<
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) correction [[270]116] shows which genes should be retained for
   subsequent univariate and multivariate analysis. It ensures the control
   of type I error of multiple tested p values during our cutoff finding
   procedure. Then Bonferroni adjustment of that p values was used for
   candidate selection.

4.3. Statistical Consideration for Survival Analysis

   Our workflow has loops to call the function
   survival_marginSFP(GeneName) with the given GeneName to process the
   survival analysis gene by gene. We dichotomized the clinicopathological
   features, which included gender (male/ female), age at diagnosis
   (below/beyond 65 years-old), clinical tumor size (T1-2/T3-4), clinical
   nodal status (negative/positive), clinical distant metastasis
   (negative/positive), TNM staging (early/late), surgical margin status
   (negative/positive), and tobacco exposure (low/high). The patients were
   grouped by an RNA-Seq value of each gene—low or high-expression
   according an optimal p value obtained from the cutofFinder_func
   subroutine (see the section of “Cutoff Finder Core Engine”). Pearson’s
   chi-square test was used for these binary variables. Kaplan–Meier
   survival was analyzed using the log-rank test for two groups OS
   comparison.

   The Cox proportional-hazards regression model [[271]117,[272]118] is
   commonly used for modeling survival data. It allows analyzing survival
   for one or more variables and provides the effect sizes (coefficients,
   i.e., hazard ratios) for them [[273]119]. The Cox model also accounts
   for confounding factors [[274]120]. It is expressed by the hazard
   function denoted by h(t). The hazard function represents the risk of a
   specific event (e.g., death) at time t. It can be estimated as follows:
   [MATH:
   <mrow><mrow><mi>h</mi><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow><mo>=<
   /mo><msub><mi>h</mi><mn>0</mn></msub><mrow><mo>(</mo><mi>t</mi><mo>)</m
   o></mrow><mo>×</mo><mi>e</mi><mi>x</mi><mi>p</mi><mrow><mo>(</mo><msub>
   <mi>β</mi><mn>1</mn></msub><msub><mi>X</mi><mn>1</mn></msub><mo>+</mo><
   msub><mi>β</mi><mn>2</mn></msub><msub><mi>X</mi><mn>2</mn></msub><mo>+<
   /mo><msub><mi>β</mi><mn>3</mn></msub><msub><mi>X</mi><mn>3</mn></msub><
   mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>+</mo><msub><mi>β</mi><mi>n<
   /mi></msub><msub><mi>X</mi><mi>n</mi></msub><mo>)</mo></mrow></mrow></m
   row> :MATH]

   where
     * t represents the survival time;
     *
       [MATH:
       <mrow><mrow><mi>h</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow></mrow>
       :MATH]
       is the hazard function determined by a set of n covariates (
       [MATH:
       <mrow><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>.</mo><mo>.</mo><m
       o>.</mo><msub><mi>X</mi><mi>n</mi></msub></mrow></mrow> :MATH]
       )—e.g., clinicopathological features, including age, gender, gene
       expression, cancer stage (tumor size, nodal metastases, distant
       metastases), surgical margin, smoking, and alcohol; unfortunately,
       spiritual, emotional, and social status are not available in TCGA
       database;
     * The coefficients (
       [MATH:
       <mrow><mrow><msub><mi>β</mi><mn>1</mn></msub><mo>.</mo><mo>.</mo><m
       o>.</mo><msub><mi>β</mi><mi>n</mi></msub></mrow></mrow> :MATH]
       ) measure the impacts—the effect sizes of covariates;
     * The term
       [MATH: <mrow><msub><mi>h</mi><mn>0</mn></msub></mrow> :MATH]
       is called the baseline hazard. It corresponds to the hazard value
       if all the
       [MATH: <mrow><msub><mi>X</mi><mi>i</mi></msub></mrow> :MATH]
       are equal to zero. The “t” in
       [MATH:
       <mrow><mrow><mi>h</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow></mrow>
       :MATH]
       indicates the hazard may vary over time.

   Thus, the biomarker discovery strategy is survival modeling through a
   collection of
   [MATH:
   <mrow><mrow><msub><mi>X</mi><mn>1</mn></msub><mo>.</mo><mo>.</mo><mo>.<
   /mo><msub><mi>X</mi><mi>n</mi></msub></mrow></mrow> :MATH]
   features from cancer datasets.

   A univariate Cox proportional regression model, using the “coxph”
   function in R package “survival,” has been applied to calculate hazard
   ratios, 95% confidence interval (95% CI), and significance, and to
   estimate the independent contribution of each clinicopathological
   feature and gene expression level to the overall survival.

   In a multivariate test, those covariates used include the
   clinicopathological features (gender, age at diagnosis separated by
   being 65 years old or not, clinical tumor size (T1 or T2/T3 or T4),
   clinical nodal status (N0/N+), clinical distant metastasis (M0/M1), TNM
   staging (stage 1 or 2/stage 3 or 4), surgical margin status
   (negative/positive), and tobacco exposure (low/high)); and gene
   expression levels (low/high) defined by cutoffs. All covariates were
   pooled in the hazard function h(t) to estimate their impact on the
   overall survival.

   Results were considered statistically significant when a two-sided p
   value was less than 0.05, or a lower threshold if indicated. The false
   discovery rate (FDR) (<
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   ) could be used to pick up the optimal p value to ensure the control
   for type I error of the Kaplan–Meier survival test during the cutoff
   finding procedure. There were also multiple correlated tests of null
   hypotheses during our global scanning of 20,500 protein-coding genes.
   The stringent Bonferroni correction could result in an adjusted p value
   to ensure the control for type I error.

   The resulting data, including Kaplan–Meier curves, cumulative p value
   plots, and Cox regression tables, were exported to “.xlsx” and “.Rda"
   files (by R package “r2excel”) for subsequent biomarker selection.

4.4. Biomarker Selection and Validation

   Those genes with prognostic impacts, whose hazard ratios were
   [MATH: <mrow><mrow><mo>></mo><mo>=</mo><mn>1.8</mn></mrow></mrow>
   :MATH]
   or
   [MATH: <mrow><mrow><mo><</mo><mo>=</mo><mn>0.6</mn></mrow></mrow>
   :MATH]
   in both Cox models (univariate and multivariate), were assigned as
   provisional candidates. Bonferroni-adjusted (Kaplan–Meier) p values
   were used to rank candidates for the decision ([275]Figure 1, candidate
   selection).

   [276]GSE65858 [[277]51] is a dataset we used for helping with candidate
   selection in our workflow. The Gene Expression Omnibus (GEO) database
   [[278]121], founded by National Center for Biotechnology Information
   (NCBI), is a public repository supporting MIAME-compliant data,
   including microarray and sequence-based experiments. The GEOquery R
   package [[279]122] was used to download the RNA-Seq dataset (in SOFT or
   MINiML format) of a HNSCC cohort, [280]GSE65858, from the GEO database
   (available at
   [281]https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE65858, accessed on
   10 March 2021). [282]GSE65858 has OS, RFS, and survival time. There
   were 270 HNSCC participants involved in this cohort. The expression
   data were generated using the platform [283]GPL10558 (Illumina
   HumanHT-12 v4.0 Expression BeadChip), which targets more than 30,330
   annotated genes (47,000 probes, derived from the NCBI Reference
   Sequence, release 38 on 7 November 2009). We have performed
   Kaplan–Meier (with FDR-correction of p value) and Cox survival analyses
   with gene expression cutoffs at their median values. The biomarker
   candidates were a consensus result of TCGA and [284]GSE65858 analyses.

5. Conclusions

   Our findings suggested three biomarker candidates—CAMK2N1, CALML5, and
   FCGBP—which are all heavily associated with the prognosis of OS under
   an optimal cutoff with stringent Bonferroni p values and proper effect
   size (HR).

   The microenvironment of HNSCC, influenced by the mind–brain–body axis,
   requires further exploration and understanding using holistic
   multi-parametric approaches. Since mindfulness meditation will be
   helpful in cancer healthcare, we continually educate our cancer
   patients that they should confess for not taking care of their bodies
   and spirits in the past, and give sincere thanks for their physical
   body’s hard work.

Acknowledgments