Graphical abstract

   graphic file with name fx1.jpg
   [53]Open in a new tab

Highlights

     * •
       An integrative omics study of early-onset colorectal cancer in a
       Chinese cohort
     * •
       Omics architecture supports early-onset tumorigenesis and
       therapeutic development
     * •
       Mutational landscape varies by ancestry, and LMTK3 mutation has
       Chinese specificity
     * •
       LMTK3 works at DNA repair process, immunophenotype, and drug target
     __________________________________________________________________

   The molecular characterization of early-onset colorectal cancer remains
   undetermined. We report a multi-omics landscape of young colorectal
   tumors in a Chinese cohort. These profiles reveal distinct molecular
   alterations and immunophenotypes in early-onset tumorigenesis and
   provide an ancestry-specific LMTK3 as a modulator, biomarker, and
   therapeutic target for precision oncology.

Introduction

   Colorectal cancer is estimated to be the third most common cancer and
   the second leading cause of cancer death worldwide.[54]^1 Although the
   incidence has generally declined in high-income countries, mainly due
   to cancer screenings,[55]^2 the increasing rate of patients with
   early-onset colorectal cancer (EOCRC), who are diagnosed before 50
   years of age, is of great concern, especially in China.[56]^3^,[57]^4
   Currently, the reasons for this widespread increase in EOCRC remain
   elusive, but accumulating studies have revealed several salient
   characteristics, such as greater consumption of a Western-style diet,
   obesity, and physical inactivity as well as molecular alterations in
   driver genes of APC, TP53, and KRAS,[58]^5^,[59]^6^,[60]^7 indicating
   that these established risk factors for late-onset colorectal cancer
   (LOCRC) also play important roles in EOCRC.

   However, notably, compared with patients with LOCRC, patients with
   EOCRC are more likely to be male, have less environmental exposure
   accumulation, have a history of inflammatory bowel disease or a family
   history of colorectal cancer, and have left-sided and advanced-stage
   tumors with genetic and molecular heterogeneity.[61]^5^,[62]^6^,[63]^8
   A recent report described a higher tumor mutation burden (TMB) in
   African American patients than in White patients as well as different
   mutation frequencies in driver genes between diverse populations of
   patients with EOCRC and LOCRC.[64]^9^,[65]^10 These findings supported
   a differential mutational landscape across both different ages of onset
   and diverse ancestries. In addition, to address the extensive molecular
   complexity of this disease, previous studies of colorectal cancer
   cohorts of The Cancer Genome Atlas (TCGA) and the Clinical Proteomic
   Tumor Analysis Consortium have described multi-omics features related
   to colorectal tumorigenesis, progression, and therapeutic
   development.[66]^11^,[67]^12 However, these studies focused more on
   overall colorectal cancer in non-Asian populations, generating a
   knowledge gap in age-of-onset-specific etiological processes of
   colorectal tumorigenesis.

   In this molecular characterization study, we profiled the genome,
   epigenome, transcriptome, and proteome of EOCRC tumors based on a
   Chinese colorectal cancer cohort. Identifying the distinct multi-omics
   architecture of these EOCRC tumors compared with LOCRC tumors would
   improve the understanding of age-of-onset disparities in colorectal
   cancer etiology and provide valuable insight into the development of
   precision therapies.

Results

Overview of population features and multi-omics profiling

   The demographic characteristics of patients with colorectal cancer are
   summarized in [68]Table S1. A total of 79 patients with colorectal
   cancer were enrolled from the Nanjing Colorectal Cancer (NJCRC) cohort
   ([69]Figure 1A), namely, 30 (37.97%) patients with EOCRC and 49
   (62.03%) patients with LOCRC. All patients provided tumor tissues and
   normal adjacent tissues (NATs) for multi-omics profiling
   ([70]Figure 1B). Principal-component analysis clearly separated tumors
   and NATs among the epigenomic, transcriptomic, and proteomic spectra
   ([71]Figure S1A). For age-related features, we observed a strong
   positive correlation between chronological age at diagnosis and DNA
   methylation age (DNAmAge) in tissues (r[Tumor] = 0.566, p[Tumor] =
   7.64e−5; r[NAT] = 0.910, p[NAT] < 2.2e−16; [72]Figure S1B).
   Furthermore, the pathway enrichment analysis for the transcriptomic
   comparison between EOCRC and LOCRC showed that the top 10 significant
   Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways in either
   tumors or NATs included pathways related to aging-associated diseases
   ([73]Figure S1C). These findings highlight the ability to identify
   EOCRC-specific molecular features at multi-omics layers.

Figure 1.

   [74]Figure 1
   [75]Open in a new tab

   Overview of population features and multi-omics profiling

   (A) Multi-omics profiling for designedly recruited NJCRC patients. The
   histogram plot indicates the age distribution of candidate patients
   with colorectal cancer for omics analyses; the density plot presents
   the age distribution of overall patients with colorectal cancer in
   NJCRC. For the definition of age-of-onset-subgroup-specific molecules,
   omics molecules located in the yellow pie plot with dashed lines are
   specific to EOCRC. BioRender supports the icon.

   (B) Illustration of 79 paired colorectal cancer samples used in the
   individual omics measurements. The genome, transcriptome, epigenome
   (DNA methylation), and proteome are colored blue/green (WGS/WES),
   orange, yellow, and gray, respectively. For each omics dataset, the top
   panel represents tumor tissues, and the bottom panel represents NATs
   from the same donor.

Genomic alterations in EOCRC tumors

   There were a total of 19,251 somatic protein altering events, namely
   17,442 somatic single-nucleotide variants (SNVs) and 1,809 somatic
   insertions or deletions (indels), in 9,815 genes ([76]Table S2A)
   according to a combination of whole-genome sequencing (WGS) and
   whole-exome sequencing (WES). At the gene level, we identified 11
   significantly mutated genes (SMGs) in NJCRC tumors ([77]Figure 2A;
   [78]Table S2B), including 7 well-defined driver genes of colorectal
   cancer[79]^13 (i.e., TP53 [65.8%], APC [63.3%], KRAS [39.2%], FBXW7
   [11.4%], RNF43 [10.1%], TCF7L2 [10.1%], and ACVR2A [6.3%];
   [80]Figures 2A, [81]S2A, and S2B; [82]Table S2B). Of note, NJCRC tumors
   showed significantly lower mutation frequencies of APC and FBXW7 than
   those from the TCGA European cohort (APC: 63.3% versus 90.7%, p =
   1.68e−7; FBXW7: 11.4% versus 35.8%, p = 2.49e−5; [83]Figure S2B) and
   marginally lower mutation frequencies of TP53 and KRAS (TP53: 65.8%
   versus 77%, p = 0.054; KRAS: 39.2% versus 50.4%, p = 0.091). In
   addition, BRAF and PTEN were found to have extremely low mutation
   frequencies in NJCRC tumors (BRAF: 3.8% and PTEN: 5.1%; [84]Table S2A).
   These results were consistent with a previous ancestry comparison
   between Asian and European subjects.[85]^14 In addition, although no
   additional SMGs were identified separately in EOCRC and LOCRC tumors
   ([86]Figure 2B; [87]Table S2B), three well-known driver genes (APC,
   TP53, and KRAS) were shared across subgroups.[88]^5^,[89]^6 The
   statistical power for detecting SMGs across each group was more than
   80% ([90]Table S2B).

Figure 2.

   [91]Figure 2
   [92]Open in a new tab

   Genomic alteration in EOCRC tumors from NJCRC

   (A) Mutation spectrum of 11 SMGs.

   (B) Venn diagrams presenting the overlap of SMGs among groups of all
   samples, EOCRC, and LOCRC.

   (C) Differential mutational features between age-of-onset subgroups of
   EOCRC and LOCRC, including TMB, proportions of six mutational subtypes,
   and mutational signatures by COSMIC. p value was calculated via
   Wilcoxon rank-sum test.

   (D and E) Somatic copy-number alterations in colorectal tumors. UpSetR
   plots show the overlap of somatic copy-number alterations (SCNAs) at
   the arm level (D) and the focal level (E, left) across groups of all
   samples, EOCRC, and LOCRC. The bar plot shows the mutation rate of
   EOCRC-specific SCNAs at the focal level (E, right). SCNA includes
   amplification (Amp) and deletion (Del).

   In terms of the global genomic alterations, we observed a significantly
   higher TMB in EOCRC tumors than in LOCRC tumors (median = 3.92/Mb
   versus 2.63/Mb, p = 0.031; [93]Figure 2C) but a non-significant
   difference in microsatellite instability (MSI) status or hypermutation
   status ([94]Table S2C). For mutational subtypes of transition and
   transversion in [95]Figure 2C and [96]Table S2C, EOCRC tumors
   frequently had more mutations of the T > G subtype (p = 2.74e−4), but
   fewer of the C > A (p = 0.029) and C > G subtypes (p = 0.040), which
   could possibly contribute to mutational signatures in the Catalog of
   Somatic Mutations in Cancer (COSMIC) with increased proportions of
   signatures 3 (associated with failure of DNA repair), 9 (attributed to
   polymerase η), 17 (unknown), and 28 (unknown) but decreased proportions
   of signature 1 (related to age of cancer diagnosis).

   Regarding somatic copy-number alterations (SCNAs), 11 amplifications
   and 7 deletions at the arm level were identified across 79 tumors
   ([97]Figures 2D and [98]S2C; [99]Table S2D), including previously
   reported arm amplifications of 1q, 7p and q, 8p and q, 13q, and 20p and
   q, as well as deletions of 17p and q and 18p and q.[100]^11 In
   age-of-onset subgroups, we observed distinct SCNAs between EOCRC and
   LOCRC tumors ([101]Figures 2D and [102]S2C; [103]Table S2D), including
   some arm-level SCNAs prone to appear in young tumors, such as two
   well-documented deletions (14q and 15q).[104]^15^,[105]^16 Similarly,
   focal SCNA events displayed differential patterns across each group
   ([106]Figures 2E, [107]S2D, and S2E; [108]Table S2E).

EOCRC-specific molecular events across epigenomic, transcriptomic, and
proteomic levels

   At a global view of DNA methylation, we found a higher but
   non-significant occurrence of the CpG island methylator phenotype
   (CIMP) in EOCRC tumors (66.7%) than in LOCRC tumors (47.4%;
   [109]Table S3), similar to previous studies.[110]^5 Furthermore, we
   noted a dramatic discordance of DNA methylation in EOCRC tumors that
   the correlation of DNAmAge and chronological age at diagnosis remained
   significant within LOCRC tissues (r[Tumor] = 0.579, p[Tumor] = 0.009;
   r[NAT] = 0.717, p[NAT] = 5.45e−4) and EOCRC NATs (r[NAT] = 0.837,
   p[NAT] = 3.52e−7) but diminished within EOCRC tumors (r[Tumor] =
   −0.124, p[Tumor] = 0.565; [111]Figure 3A). Further examination of DNA
   methylation patterns revealed evident and site-specific differentially
   methylated positions ([112]Figures S3A and S3B) and regions
   ([113]Figure S3C) derived from each group, and the following pathway
   enrichment analysis disclosed 52 significant KEGG pathways harboring
   six pathways specific to EOCRC that included two cancer pathways
   (colorectal cancer and endometrial cancer; [114]Figure S3D).

Figure 3.

   [115]Figure 3
   [116]Open in a new tab

   EOCRC-specific molecular events across epigenomic, transcriptomic, and
   proteomic levels

   (A) Correlations between chronological age at diagnosis of colorectal
   cancer and DNAmAge across tumors and NATs stratified by age-of-onset
   subgroups. Pearson’s correlation test was used to calculate correlation
   coefficients (r) and p values.

   (B) Differences in abundances of six immune cell types between EOCRC
   and LOCRC. p value was calculated via Wilcoxon rank-sum test.

   (C) Heatmap of TMB and TIME in four clusters at “hot,” “pre-hot,”
   “cold,” and “high TMB” immunophenotypes. K-means clustering by column
   is 4, accompanied by hierarchical clustering.

   (D) Differential distributions of four immunophenotypes in age-of-onset
   subgroups.

   (E) Pathway enrichment analysis for the age-related proteome. A total
   of 1,192 proteins were assigned to age-related proteins via
   differential expression analysis of the proteome between EOCRC and
   LOCRC tumors at p < 0.05.

   (F) Differential expression pattern of platinum drug resistance-related
   genes. All genes existed in both transcriptomic and proteomic
   comparisons between EOCRC and LOCRC tumors. ∗p < 0.05, ∗∗p < 0.01,
   ∗∗∗p < 0.001, calculated via t test; ns, not significant; FC, fold
   change in LOCRC versus EOCRC.

   The tumor immune microenvironment (TIME) decomposed from transcriptomic
   profiles showed significantly higher immune infiltration in EOCRC
   tumors than in LOCRC tumors, which was probably attributed to
   immunosenescence[117]^17 ([118]Figure 3B; [119]Table S3). Considering
   that both TMB and TIME have been explored as independent biomarkers of
   immunotherapy efficacy,[120]^18^,[121]^19 we performed hierarchical
   clustering to refine tumor subtypes and identified four distinct
   immunophenotypes, namely, “hot,” “pre-hot,” “cold,” and “high TMB”
   ([122]Figure 3C), as well as a dramatically high proportion of hot
   tumors in EOCRC (60%; [123]Figure 3D). Notably, these immunophenotypes
   could dramatically stratify 5-year survival in patients with EOCRC (p =
   0.003) as validated in the TCGA cohort ([124]Figures S3E–S3G). In
   addition, we observed a similar distribution of consensus molecular
   subtypes (CMS) subgroups at 17.7%, 40.5%, 19%, 21.5%, and 1.3% for
   CMS1, CMS2, CMS3, CMS4, and CMS1/CMS2, respectively, as reported
   previously,[125]^12 but no significant differences between EOCRC and
   LOCRC tumors were found ([126]Table S3), which differed from a recent
   study.[127]^20 Subsequently, among differentially expressed genes
   generated by comparing tumors versus NATs across each group
   ([128]Figure S3H), 130 genes were specifically assigned to EOCRC (47
   upregulated and 83 downregulated; [129]Figure S3I), and the
   corresponding enrichment analysis indicated six EOCRC-specific KEGG
   pathways ([130]Figure S3J), including fatty acid degradation (FAD),
   mismatch repair, and cell adhesion molecules (CAMs).

   Regarding the proteome, more than 100 differentially expressed proteins
   between tumors and NATs distinguished each group ([131]Figure S3K) and
   included four downregulated proteins with EOCRC specificity
   ([132]Figure S3L). Intriguingly, among the 23 significant KEGG pathways
   enriched in differential proteins ([133]Figure S3M), two EOCRC-specific
   KEGG pathways were consistent with those found in the transcriptomic
   comparison (CAMs and FAD; [134]Figure S3J), sharing two downregulated
   genes in EOCRC tumors (ADH1B in FAD, p[mRNA] = 1.96e−10 and
   p[Protein] = 6.67e−6; CADM3 in CAMs, p[mRNA] = 2.06e−13 and
   p[Protein] = 6.52e−6; [135]Figure S3N). In addition, we profiled the
   age-related proteome by comparing EOCRC and LOCRC tumors, and as shown
   in [136]Figure 3E, the longevity regulating pathway was enriched,
   similar to the transcriptomic findings ([137]Figure S1C). Notably, 13
   genes linked to a clinical management pathway of platinum drug
   resistance ([138]Figure 3E) displayed inconsistent expression patterns,
   and most genes were more highly expressed at the mRNA level but less
   expressed at the protein level in EOCRC tumors relative to LOCRC tumors
   ([139]Figure 3F; [140]Table S3).

Integrative omics architecture for EOCRC and clinical therapeutics

   Next, we evaluated cis- and trans-effects of SCNAs on both mRNA and
   protein abundances. For cis-effects of SCNAs on mRNA abundance, we
   observed strong positive correlations along the diagonal
   ([141]Figure S4A). Most diagonal signals corresponded to arm-level
   changes, consistent with previous studies (e.g., chromosomes 7, 13, and
   20; [142]Table S2D).[143]^11^,[144]^21 Additionally, we confirmed a
   previously reported trans-acting SCNA hotspot on chromosome 20q at the
   mRNA level ([145]Figure S4A). In contrast, both cis- and trans-effects
   of SCNAs on protein abundance were considerably sparse
   ([146]Figure S4A), probably due to the limited number of proteomic
   samples. Notably, each age-of-onset subgroup displayed similar
   correlation patterns, but cis-/trans-effects were apparently dampened
   in EOCRC tumors ([147]Figure 4A). Furthermore, in the annotation of the
   most amplified region of 20q13.12 specific to EOCRC ([148]Figure 2E),
   we found a correlation hotspot at the mRNA level that disappeared at
   the protein level ([149]Table S2F).

Figure 4.

   [150]Figure 4
   [151]Open in a new tab

   Integrative omics architecture for EOCRC and clinical therapeutics

   (A) Correlations between SCNAs and the abundance of mRNA and protein
   across age-of-onset subgroups of EOCRC (left) and LOCRC (right).
   Significant positive and negative correlations are indicated in red and
   green, respectively (p < 0.01). SCNA-driven cis-effects appear as the
   red diagonal line; trans-effects appear as vertical red and green
   lines. The accompanying histograms show the number of significant cis-
   and trans-events (p < 0.01) corresponding to the indicated genomic loci
   (blue bars upward) as well as the overlap between SCNA-mRNA and
   SCNA-protein events (black bars downward).

   (B) Gene-wise correlation of mRNA and protein abundances in tumors and
   NATs across age-of-onset subgroups of EOCRC (left) and LOCRC (right).
   Correlation coefficient was calculated via Spearman’s correlation
   analysis.

   (C) Pathway enrichment annotation for significant positive mRNA-protein
   pairs within EOCRC (left) and LOCRC (right) tumors.

   (D) Profiling of differential expression patterns of immune-related
   genes. A total of 223 of 770 genes were shared in both transcriptomic
   and proteomic datasets. Volcano plots indicate the comparison between
   tumors of EOCRC and LOCRC at transcriptomic (left) and proteomic
   (right) levels. Brown lines represent criteria of p < 0.05 (calculated
   via t test) and log2FC >1 for significantly differential genes, where
   upregulated genes in LOCRC tumors are colored red and upregulated genes
   in EOCRC tumors are colored blue. Other genes are colored gray.

   To understand the regulatory relationship between transcriptomic and
   proteomic processes, we calculated gene-wise (inter-sample) and
   sample-wise (intra-sample) correlations. Gene-wise mRNA-protein
   correlations displayed striking differences within tumors (r[mean] =
   0.229) and NATs (r[mean] = 0.026) ([152]Figure S4B), consistent with
   previous studies,[153]^11^,[154]^21 and each age-of-onset subgroup had
   a similar correlation (EOCRC: r[Tumor] = 0.256 versus r[NAT] = 0.035;
   LOCRC: r[Tumor] = 0.291 versus r[NAT] = 0.108; [155]Figure 4B), which
   indicated more active translational efficiency or less degradation of
   proteins underlying posttranscriptional mechanisms that occurred in
   colorectal tumors. In addition, there was a consistently positive
   correlation in fold changes of mRNA versus protein across each subgroup
   ([156]Figure S4C). Furthermore, 29.36% and 41.59% of mRNA-protein pairs
   had significant positive Spearman’s correlations in EOCRC and LOCRC
   tumors, respectively (p < 0.05; [157]Figure 4B), along with distinct
   network clusters where DNA damage repair pathways and metabolic
   networks were markedly enriched in EOCRC tumors ([158]Figures 4C and
   [159]S4D). In terms of sample-wise mRNA-protein correlations, we
   detected a fairly consistent correlation in tumors (r = 0.294) and NATs
   (r = 0.221), but these correlations were not associated with age of
   onset ([160]Figure S4E).

   For potential therapeutic targets of colorectal cancer, we attempted to
   assess gene expression patterns for cancer testis (CT)
   antigens,[161]^22 pan-essential genes,[162]^23 and a PanCancer Immune
   Profiling panel[163]^24 using omics datasets of this study.
   Interestingly, some genes had differential expression at both mRNA and
   protein levels in comparisons of tumors versus NATs and EOCRC versus
   LOCRC tumors ([164]Figures S4F–S4H and [165]4D). In particular, one
   gene from the immune panel, CCL21, was consistently significantly
   increased in EOCRC tumors compared with LOCRC tumors at both mRNA and
   protein levels, despite its lower expression in tumors compared with
   NATs ([166]Figure 4D).

LMTK3 mutation specific to EOCRC independently triggers molecular alterations
in tumors

   In addition to SMGs, we found nine differentially mutated genes (DMGs)
   specifically in EOCRC tumors compared with LOCRC tumors (p < 0.01;
   [167]Figure 5A; [168]Table S4), which had a significant performance in
   distinguishing the two age-of-onset subgroups with an area under the
   receiver operating characteristic curve (AUC) of 0.893
   ([169]Figure S5A). Importantly, LMTK3 (p = 0.016) and GPR98 (p = 0.019)
   were also mutated in EOCRC tumors (versus LOCRC tumors) in the
   International Cancer Genome Consortium (ICGC) Chinese cohort
   ([170]Figure S5B; [171]Table S4) but not in Europeans ([172]Figure 5B;
   [173]Table S4). Accordingly, five genes exhibited ancestry-specific
   higher mutation rates in EOCRC tumors from the NJCRC cohort than the
   TCGA European cohort; one of these genes was LMTK3 (20% in NJCRC versus
   2.78% in TCGA, p = 0.041; [174]Figure 5C). Collectively, we focused on
   the consequences of LMTK3 mutation on molecular phenotypes and clinical
   outcomes.

Figure 5.

   [175]Figure 5
   [176]Open in a new tab

   Identification of LMTK3 specifically mutated in EOCRC tumors alongside
   the multi-omics annotation

   (A–C) Differential mutation rates of 9 DMGs between age-of-onset
   subgroups in the NJCRC (A) and TCGA European colorectal cancer cohorts
   (B) and between EOCRC tumors of the two cohorts (C). p values
   (∗∗p < 0.01) and odds ratios were calculated via Fisher’s exact test.
   NA, not available because of zero mutation in the TCGA cohort.

   (D) Mutational lolliplot of LMTK3 mutations in colorectal tumors of
   NJCRC (upward) and ICGC (downward) cohorts. Mutations are shaped in
   circles and colored differently.

   (E) Significantly co-occurring genes with LMTK3 in NJCRC tumors. DDR,
   DNA damage response. p value was derived from Fisher’s exact test.

   (F) A cartoon comparison plot of LMTK3 protein structure between wild
   type and mutant. Wild-type LMTK3 is colored green, and the mutant is
   colored light blue. The red stick indicates the amino acid with/without
   LMTK3 mutations.

   (G–K) Differences in TMB (G), somatic signature (H), immune
   infiltration (I), immunophenotype (J), and LMTK3 expression (K) between
   LMTK3 mutant and wild-type tumors in NJCRC. p value was calculated via
   Wilcoxon rank-sum test.

   Seven mutations (six missense and one frameshift) existed in LMTK3
   derived from six EOCRC tumors harboring over 90% statistical power
   ([177]Figure 5D; [178]Table S2B). They were mutually mutated with 157
   genes (p < 0.05; [179]Table S5A), including five colorectal cancer
   driver genes[180]^13 ([181]Figure 5E), and altered LMTK3 protein
   structure ([182]Figure 5F). LMTK3 mutant tumors had a significantly
   higher TMB (p = 0.036; [183]Figure 5G; [184]Table S5B) and increased
   proportions of signatures 3, 9, 17, and 28 ([185]Figure 5H;
   [186]Table S5B) but decreased signature 1 ([187]Table S5B). No
   significant differences in MSI status or SCNAs were elicited by mutant
   LMTK3 ([188]Table S5B). Given that signature 3 is strongly associated
   with the failure of DNA repair by homologous recombination,[189]^25 we
   further evaluated co-occurring mutation patterns of 450 DNA damage
   response (DDR) genes.[190]^26 Among them, both SHPRH (p = 0.014) and
   POLE (p = 0.027) were concurrently mutated with LMTK3 ([191]Figure 5E;
   [192]Table S5A), but their mRNA levels had no correlation and were not
   affected by their own mutations ([193]Figures S5C and S5D).

   Regarding the epigenome, LMTK3 mutant tumors had an increased but
   non-significant proportion of low CIMPs, from 56.8% to 66.7%
   ([194]Table S5B). In addition, LMTK3 mutations did not alter the
   difference between DNAmAge and chronological age at diagnosis in EOCRC
   tumors ([195]Figure S5E). Among the 37 CpGs in LMTK3 ([196]Figure S5F),
   cg13996617 located in the gene body was significantly hyper-methylated
   in tumors, especially in EOCRC tumors (p = 2.64e−11; [197]Figure S5G),
   but it was not correlated with LMTK3 expression or mutation
   ([198]Figure S5H).

   In the transcriptome, LMTK3 mutations were independently associated
   with increased proportions of CD8^+ T cells and macrophages (p = 0.012
   and 0.008, respectively; [199]Figure 5I; [200]Table S5B) as well as
   tumors with a hot immunophenotype ([201]Figure 5J) but not the
   expression of corresponding immune cell markers ([202]Figure S5C). In
   addition, LMTK3 mutant tumors had an increased but non-significant
   prevalence of CMS1 (33.3%; referred to as a microsatellite
   unstable/immune subtype) relative to the wild type (16.7%;
   [203]Figure S5I).

   Furthermore, among 5,887 mRNA-protein pairs in tumors, 229 were
   significantly affected by LMTK3 mutations, intriguingly including 161
   coordinately and 68 exclusively expressed genes ([204]Figure S5J).
   These coordinate genes were enriched into 25 significant pathways,
   including four well-defined oncogenic pathways involving colorectal
   cancer development and immunotherapy, namely antigen processing and
   presentation,[205]^27 the transforming growth factor β (TGF-β)
   signaling pathway,[206]^28 the interleukin-17 (IL-17) signaling
   pathway,[207]^29 and the mRNA surveillance pathway,[208]^30 and
   clustered into five informative networks ([209]Figure S5K). Moreover,
   LMTK3 expression was dramatically elevated via LMTK3 mutations (p =
   0.007; [210]Figure 5K) but not via SCNA ([211]Figure S5L) or DNA
   methylation ([212]Figure S5H) in LMTK3.

LMTK3 differs in gene expression as a promising biomarker and target in
clinical implications

   LMTK3 was significantly upregulated in the majority of tumors
   ([213]Figure S6A). In colorectal cancer, regardless of onset age, both
   mRNA and protein abundances of LMTK3 were highly expressed in tumors
   ([214]Figures 6A and 6B) but were not associated with 5-year survival
   (as was mutation; [215]Figure S6B) or tumor stage ([216]Figure S6C).
   Similar to a previous study in breast cancer,[217]^31 we observed
   significant mutual correlations among cascade genes of LMTK3,
   PRKCA-AKT1-FOXO3 signaling, and ESR1 in NATs (r[LMTK3-AKT1] = 0.56, p =
   1.07e−7; r[LMTK3-FOXO3] = 0.42, p = 1.43e−4; r[LMTK3-ESR1] = 0.34, p =
   1.92e−3; [218]Figure S6D) that were attenuated in tumors; however,
   LMTK3 mutations did not alter the gene expression pattern of
   PRKCA-AKT1-FOXO3 signaling ([219]Figure S5C). We further dissected the
   LMTK3 expression pattern in TIME of colorectal tumors at the
   single-cell level ([220]Figure S6E) and intuitively observed that LMTK3
   was predominantly expressed in epithelial cells and significantly
   increased in EOCRC tumors, especially within B cells and stromal cells
   ([221]Figure 6C), but not in corresponding cell subtypes
   ([222]Figure S6F).

Figure 6.

   [223]Figure 6
   [224]Open in a new tab

   Differential expression pattern and clinical implication of LMTK3

   (A) Differential expression of LMTK3 between colorectal tumors and
   NATs. p value was calculated via Wilcoxon rank-sum test.

   (B) Semiquantitative analysis of the immunohistochemical staining
   intensity of LMTK3 protein between colorectal cancer tissues and normal
   mucosal tissues. p value was calculated via Wilcoxon rank-sum test.
   Representative immunohistochemical images of LMTK3 protein expression
   in colorectal tumors (Ts; left) and normal mucosal tissues (Ns; right).

   (C) LMTK3 expression pattern in the single-cell transcriptome of
   colorectal cancer tissues. The t-distributed stochastic neighbor
   embedding (tSNE) shows 47,285 cells of colorectal cancer tissues
   between age-of-onset subgroups across six major cell types (left).
   Cells expressing LMTK3 are colored purple in the tSNE plot using the
   normalized value after canonical correlation analysis (CCA) correction
   (right). Scatterplots indicate differential expression of LMTK3 in
   tumors between age-of-onset subgroups across six cell types (bottom).
   ∗p < 0.05 and ∗∗∗p < 0.001 were calculated via Wilcoxon rank-sum test.
   ns, no significance.

Discussion

   The study characterizes the molecular landscape with a focus on
   patients of Chinese ancestry with EOCRC and indicates the value of
   multi-omics integration in disclosing molecular heterogeneity in
   age-of-onset-specific tumorigenesis and demonstrates the utility of
   omics measurements in precision therapeutic development.

   We initially observed profound aggressive features for EOCRC tumors
   underlying the mutational signature 1, oncogenic SCNAs (14/15q and
   20q13.12), DNAmAge, and carcinogenic pathway enrichment across each
   molecular layer, which indicated the feasibility and necessity of
   identifying molecular phenotypes in EOCRC tumorigenesis. To fully
   utilize the omics information, we launched transcriptomic and proteomic
   analyses and found predominant signaling processes of FAD, CAMs, and
   DNA repair involved in EOCRC development. ADH1B, a key metabolic enzyme
   of alcohol prioritized in FAD pathways, was concordantly downregulated
   at both mRNA and protein levels in EOCRC tumors, which might underlie
   the inverse association of the fatty acid metabolic signature capturing
   alcohol consumption alongside colorectal cancer
   development.[225]^32^,[226]^33 CADM3, a key protein prioritized in the
   CAM pathway, was decreased in EOCRC tumors, consistent with its
   suppressive role in the tumorigenicity of colon cancer cells.[227]^34 A
   significant increase in signature 3, related to DNA repair, was
   observed in EOCRC tumors, followed by pathway enrichment and network
   clustering of DNA repair processes, increased immune infiltration, and
   a hot immunophenotype, indicating the prominent inter-play of
   immune-inflammation response and DDR in the young tumor
   microenvironment.[228]^35^,[229]^36 These molecular phenotypes help to
   enrich our etiological understanding of EOCRC.

   These distinct omics profiles provide valuable insights into the
   development of new therapeutic strategies for EOCRC. TMB is an
   important independent genomic biomarker indicating the likelihood of a
   response to immunotherapy, including in colorectal
   cancer.[230]^37^,[231]^38 EOCRC tumors have a significantly high TMB
   and remarkably unique pathway enrichment in mismatch repair, suggesting
   a therapeutic preference for immunotherapy.[232]^39 In addition, we
   profiled the gene expression spectrum for platinum drug resistance, CT
   antigens, pan-essential genes, and an immune panel to identify
   potential drug targets. Importantly, most genes displayed inconsistent
   expression patterns at both mRNA and protein levels in the comparison
   between age-of-onset subgroups, which may be attributed to tighter
   posttranscriptional regulation in young tumors. For example, CCL21 may
   play important roles in clinical implications of EOCRC, as previous
   evidence supports that cancer therapy, by combining CCL21, serves as an
   effective strategy for colorectal cancer immunotherapy through inducing
   antitumor immune responses.[233]^40^,[234]^41 Hence, these discrepant
   genes in young tumors prompted us to take a precision approach to drug
   development, treatment selection, and therapeutic management for EOCRC.

   LMTK3 varied in mutation frequency across age of onset and ancestry,
   which potentially triggered molecular alterations in EOCRC
   tumorigenesis. We found that the LMTK3 mutation not only co-occurred
   with driver genes but also upregulated gene expression (including its
   cognate gene) in defined oncogenic pathways, consistent with agonistic
   features of EOCRC tumors. Intriguingly, SHPRH and POLE, two DNA repair
   genes, were concurrently mutated with LMTK3, consistent with the
   increased signature 3 by mutant LMTK3. Specifically, POLE mutations
   have also been reported with increased frequency in Chinese patients
   and younger tumors.[235]^42^,[236]^43 Moreover, pathogenic
   characteristics of large T cell infiltration and a higher TMB driven by
   POLE mutation[237]^42^,[238]^43 were in accordance with those driven by
   LMTK3 mutation. Notably, LMTK3 is a druggable target for cancer therapy
   (including colorectal cancer), independent of its RNA expression,
   acting as a protein kinase modulating ERα by or not by PRKCA-AKT1-FOXO3
   signaling[239]^31 and producing a small molecule underlying the latest
   druggable genome evidence at the tier 3 priority for cancer
   treatments.[240]^44^,[241]^45 Collectively, mutation-relevant findings
   could be used to interpret the biological mechanism of DNA repair and
   immune-inflammation response in early-onset colorectal tumorigenesis
   and further support the discovery of new therapeutic targets.

   In summary, we performed a molecular characterization of colorectal
   cancer through genomic, epigenomic, transcriptomic, and proteomic
   analyses, when integrated appropriately, to delineate the molecular
   complexity underlying differences in the age of onset and ancestry.
   These discoveries bring insights that could be used to mitigate
   disparities among patients with colorectal cancer via the prospective
   exploration of precision oncology in drug development and therapeutic
   modalities.

Limitations of the study

   We acknowledge several limitations in this study. First, we noted some
   obvious but not significant molecular alterations, such as increased
   MSI and low CIMP driven by the early onset or LMTK3 mutation. These
   results are probably due to the limited number of sequenced samples.
   Second, we did not profile posttranslational modifications, such as the
   phosphoproteome, and did not perform trans-omics-based subtype analysis
   because of a low number of, consistency of, and overlapping of samples
   of DNA methylation and proteomic profiling. These limitations restrict
   the synergy of multi-omics analysis to provide a unified view of the
   molecular heterogeneity of EOCRC. Third, there was a lack of survival
   analysis in NJCRC due to the short-term follow up, and the absence of
   experimental validation of LMTK3 mutation or expression involving age
   of onset and colorectal tumor aggressiveness. Therefore, larger sample
   sizes that increase the power of identifying molecular events; more
   in-depth profiling that includes the test in the single-cell level;
   deep clinical phenotyping that provides clinical insights; novel
   analytic frameworks that enrich the knowledge of cancer etiology; and
   functional experiments that elaborate cancer pathophysiology are
   warranted in future work.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Antibodies
     __________________________________________________________________

   Mouse monoclonal anti-LMTK3 Antibody (I-17) Santa Cruz Biotechnology
   Cat# sc-100418; RRID:AB_2234692
     __________________________________________________________________

   Biological samples
     __________________________________________________________________

   Tumor and normal adjacent tissues from human colorectal cancer patients
   Nanjing ColoRectal Cancer (NJCRC) cohort This paper
   Tissue microarray of colorectal cancer Shanghai Outdo Biotech Co., Ltd.
   Cat# HColA180Su19
     __________________________________________________________________

   Critical commercial assays
     __________________________________________________________________

   QIAamp DNA Mini Kit QIAGEN Cat# 51306
   TruSeq Nano DNA HT Library Prep Kit Illumina Cat# FC-121-4003
   SureSelectXT Reagent Kit Agilent Cat# G9641B
   SureSelectXT Human All Exon V6 Kit Agilent Cat# 5190-8865
   Infinium HumanMethylation EPIC BeadChip Kit Illumina Cat# WG-317-1001
   EZ DNA Methylation Kit Zymo Research Cat# D5001
   AllPrep DNA/RNA Mini Kit QIAGEN Cat# 80204
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Colorectal cancer whole-genome sequencing data This study
   [242]https://db.cngb.org/cnsa/Accession number: CNP0004009
   Colorectal cancer whole-exome sequencing data This study
   [243]https://db.cngb.org/cnsa/Accession number: CNP0004009
   Colorectal cancer whole-genome DNA methylation data This study
   [244]https://db.cngb.org/cnsa/Accession number: CNP0004009
   Colorectal cancer whole-transcriptome sequencing data This study
   [245]https://db.cngb.org/cnsa/Accession number: CNP0004009
   Colorectal cancer mass-spectrometry-based proteome data This study
   [246]https://db.cngb.org/cnsa/Accession number: CNP0004009
   Colorectal cancer single-cell transcriptome data Lee et al.[247]^35
   [248]https://www.ncbi.nlm.nih.gov/geo/GEO: [249]GSE132465
   ICGC mutation data ICGC [250]https://dcc.icgc.org
   TCGA mutation data TCGA [251]https://portal.gdc.cancer.gov
   TCGA expression data TCGA [252]https://portal.gdc.cancer.gov
   Pan cancer expression data PanCanAtlas
   [253]https://gdc.cancer.gov/about-data/publications/pancanatlas
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   BWA (version 0.7.17-r1188) Li et al.[254]^46
   [255]http://bio-bwa.sourceforge.net
   SAMtools (version 0.1.18) Li et al.[256]^47 [257]http://www.htslib.org
   Picard (version 2.9.0) Broad Institute
   [258]http://broadinstitute.github.io/picard
   GATK (version 3.8.0) DePristo et al.[259]^48
   [260]https://gatk.broadinstitute.org/hc/en-us
   MuTect2 (version 1.1.4) Cibulskis et al.[261]^49
   [262]https://gatk.broadinstitute.org/hc/en-us
   Strelka (version 1.0.13) Saunders et al.[263]^50
   [264]https://github.com/Illumina/strelka
   ANNOVAR Wang et al.[265]^51
   [266]https://annovar.openbioinformatics.org/en/latest
   MutSigCV (version 1.41) Lawrence et al.[267]^52
   [268]https://software.broadinstitute.org/cancer/cga/mutsig
   MSMuTect (version 1.0) Maruvka et al.[269]^53
   [270]https://github.com/getzlab/MSMuTect
   Control-FREEC (version 6.2) Boeva et al.[271]^54
   [272]http://boevalab.inf.ethz.ch/FREEC/
   GISTIC2 (version 2.0.23) Mermel et al.[273]^55
   [274]https://github.com/broadinstitute/gistic2
   iScan System Illumina
   [275]https://www.illumina.com/systems/array-scanners/iscan.html
   DNA Methylation Age Calculator Horvath[276]^56
   [277]http://dnamage.genetics.ucla.edu
   HISAT2 (version 2.0.4) Kim et al.[278]^57
   [279]http://daehwankimlab.github.io/hisat2
   StringTie (version 1.3.4) Pertea et al.[280]^58
   [281]https://ccb.jhu.edu/software/stringtie
   Cufflinks (version 2.1.1) Roberts et al.[282]^59
   [283]http://cole-trapnell-lab.github.io/cufflinks
   TIMER Li et al.[284]^60 [285]http://cistrome.org/TIMER
   ESTIMATE Yoshihara et al.[286]^61
   [287]https://bioinformatics.mdanderson.org/estimate
   Proteome Discoverer 2.2 Thermo Fisher
   [288]https://assets.thermofisher.com/TFS-Assets/CMD/manuals/Man-XCALI-9
   7808-Proteome-Discoverer-User-ManXCALI97808-EN.pdf
   OmicsX Pan et al.[289]^62 [290]http://bioinfo.wilmer.jhu.edu/OmicsX/
   R package deconstructSigs Rosenthal et al.[291]^63
   [292]https://github.com/raerose01/deconstructSigs
   R package ChAMP Tian et al.[293]^64
   [294]https://www.bioconductor.org/packages/release/bioc/vignettes/ChAMP
   /inst/doc/ChAMP.html
   R package ConsensusClusterPlus Wilkerson et al.[295]^65
   [296]https://bioconductor.org/packages/release/bioc/html/ConsensusClust
   erPlus.html
   R package CMSclassifier Guinney et al.[297]^12
   [298]https://github.com/Sage-Bionetworks/CMSclassifier
   R package corrplot N/A
   [299]https://cran.r-project.org/web/packages/corrplot/index.html
   R package clusterProfiler Yu et al.[300]^66
   [301]https://bioconductor.org/packages/release/bioc/html/clusterProfile
   r.html
   R package factoextra N/A
   [302]https://cran.r-project.org/web/packages/factoextra/index.html
   R package Seurat Butler et al.[303]^67
   [304]https://satijalab.org/seurat
   R package RCAv2 Li et al.[305]^68
   [306]https://github.com/prabhakarlab/RCAv2
   R package cancerSeqStudy Bailey et al.[307]^69
   [308]https://github.com/KarchinLab/cancerSeqStudy
   R package UpSetR Conway et al.[309]^70
   [310]https://github.com/hms-dbmi/UpSetR
   ProteinPaint St. Jude Children’s Research Hospital.
   [311]https://proteinpaint.stjude.org/
   PyMOL Schrödinger, Inc. [312]https://pymol.org/2/
   [313]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the Lead Contact, Meilin Wang
   ([314]mwang@njmu.edu.cn).

Materials availability

   This study did not generate new, unique reagents.

Experimental model and subject details

Specimens and clinical data

   The study protocol was reviewed and approved by the Ethics Committee of
   Nanjing Medical University. Colorectal cancer samples for omics
   measurements were designedly selected from the Nanjing ColoRectal
   Cancer (NJCRC) cohort.[315]^71^,[316]^72 The NJCRC cohort belongs to
   the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO)
   and comprises 1,316 cases in total with 366 cases of early-onset
   colorectal cancer (EOCRC; 27.81%). Briefly, the NJCRC cohort included
   newly diagnosed, untreated colorectal cancer patients undergoing
   surgical resection. Basic patient demographics were collected via a
   face-to-face interview with subjects who provided written informed
   consent; and the corresponding clinical information regarding the
   status of surgery and relevant diagnostic information were collected
   using case report forms.

   In this study, 79 paired tumor tissues and normal adjacent tissues
   (NATs) were designedly selected for omics detection; the tumor tissues
   were from sporadic tumors without hereditary syndromes and
   histopathologically confirmed by pathologists and verified to have at
   least 70% tumor cell nuclei.

Publicly available datasets

   We requested available colorectal cancer omics datasets from two
   databases, namely, International Cancer Genome Consortium (ICGC;
   [317]https://dcc.icgc.org/, version 2019_08_31) and The Cancer Genome
   Atlas (TCGA; [318]https://portal.gdc.cancer.gov/, version 2016_01_28),
   for global or candidate molecular evaluation as appropriate. In terms
   of the evaluation of ancestral disparity, we retained samples with
   explicit ancestry information specific to Asian (Chinese) or European
   populations. Therefore, 325 Chinese individuals with colorectal cancer
   cases (COCA-CN from ICGC) and 226 European individuals with colorectal
   cancer (COAD and READ combined from TCGA) were included for further
   analysis. Demographic characteristics are shown in [319]Table S1.

   It is noteworthy that the proportion of EOCRC patients in the overall
   NJCRC (27.81%), ICGC (27.69% in Chinese), and TCGA (15.93% in
   Europeans) cohorts was over 10% less than that recruited in this study
   (37.97%), which indicated adequate statistical power to observe
   molecular features in consideration of cost-effectiveness.[320]^73 In
   addition, the Pan-Cancer Atlas initiative
   ([321]https://gdc.cancer.gov/about-data/publications/pancanatlas)
   provided transcriptomic datasets across 32 cancer types for the
   evaluation of candidate gene expression patterns.

Method details

Whole-genome sequencing (WGS) and whole-exome sequencing (WES)

   We carried out both WGS and WES paired-end sequencing reactions for 27
   and 52 pairs of tumors and NATs, respectively, in collaborations with
   Novogene Co., Ltd. (Beijing, China). Briefly, DNA was extracted from
   fresh frozen tissue blocks using a QIAGEN QIAamp DNA Mini Kit. For WGS,
   library preparation was performed using at least 0.8 μg of genomic DNA
   per sample with a high molecular weight (>20 kb single band) via the
   TruSeq Nano DNA HT Library Prep Kit. Pooled DNA libraries were
   sequenced on an Illumina HiSeq X Ten platform with 150-bp paired-end
   sequencing. For WES, library construction and whole-exome capture of
   genomic DNA were performed using the SureSelectXT Reagent Kit (Agilent)
   and SureSelectXT Human All Exon V6 Kit (Agilent). The captured DNA was
   then sequenced on an Illumina HiSeq 2500 sequencing platform with
   150-bp paired-end sequencing.

Genomic data processing

   The FastQC package was used to assess the quality-score distribution of
   the sequencing reads. All raw Illumina sequence data were demultiplexed
   and converted to fastq files, with the subsequent trimming of adaptors,
   contamination, and low-quality nucleotides to obtain clean data.
   Paired-end sequencing reads were aligned to the human reference genome
   (GRCh37) using Burrows-Wheeler Aligner (BWA; version
   0.7.17-r1188);[322]^46 then, SAMtools (version 0.1.18),[323]^47 Picard
   (version 2.9.0; [324]http://broadinstitute.github.io/picard/), and
   Genome Analysis Toolkit (GATK; version 3.8.0)[325]^48 were used to sort
   BAM files and perform duplicate marking, local realignment, and base
   quality recalibration to generate final BAM files for computing the
   sequence coverage and depth. The average sequence coverage was 56× for
   27 tumors and 35× for 27 NATs for WGS, as well as 188× for 52 tumors
   and 115× for 52 NATs for WES, which were utilized for downstream
   analyses of somatic mutation, microsatellite instability (MSI), tumor
   mutation burden (TMB), and somatic copy number alteration (SCNA).

Somatic mutation detection

   MuTect2 (version 1.1.4)[326]^49 and Strelka (version 1.0.13)[327]^50
   were used to call somatic single nucleotide variants (SNVs) and small
   insertions/deletions (INDELs) from the comparison between tumors and
   NATs, respectively. Somatic calls were then subjected to quality
   control using the following exclusion criteria: (1) they were found in
   a panel of normal tissues built by matched normal tissues; (2) they
   were located in the segmental duplication region marked by the UCSC
   browser ([328]http://genome.ucsc.edu/); and (3) they were found in the
   1000 Genomes Project or Exome Aggregation Consortium with the same
   mutation direction. The filtered somatic mutations were then annotated
   with ANNOVAR[329]^51 based on GENCODE v19 for protein altering events.

Analyses of significantly mutated gene (SMG) and differentially mutated gene
(DMG)

   The significance of candidate mutations was evaluated using MutSigCV
   (version 1.41),[330]^52 which considered the global mutation situation,
   mutation frequency of genes near the mutation site, whether the site
   was located in a region where the chromosome was easy to open, and
   other parameters to find genes with a mutation rate higher than the
   background mutation rate. Genes with a false discovery rate (FDR; q
   value) less than 0.1 were considered SMGs.

   DMGs were defined as genes with a significantly differential mutation
   rate between EOCRC and LOCRC underlying a p value less than 0.01 using
   Fisher’s exact test. A mutational lolliplot was generated with
   ProteinPaint ([331]https://proteinpaint.stjude.org/), and the Pfam
   database ([332]https://pfam.xfam.org/) supports protein domains for
   candidate genes (e.g., LMTK3). PyMOL software was used to visualize
   LMTK3 protein structure with or without LMTK3 mutations.

Microsatellite instability (MSI) status prediction

   MSMuTect (version 1.0) was applied to infer the somatic microsatellite
   level by comparing somatic microsatellite INDEL (MS INDEL)
   calling.[333]^53 MSMuTect used the Kolmogorov-Smirnov test to identify
   microsatellite sites with different alleles between tumor and normal
   pairs. Patients with MS INDELs greater than 40 were grouped into the
   MSI-high (MSI-H) group, while the others were grouped into the
   microsatellite stability (MSS) group.

Tumor mutation burden (TMB) and hypermutation status calculations

   TMB, an emerging biomarker of immunotherapy responses,[334]^74 is
   defined as the number of nonsynonymous somatic mutations (including
   SNVs and INDELs) per mega base (Mb) of genome examined. The
   hypermutation status of the tumor was categorized by the threshold
   TMB > 12/Mb.[335]^75

Mutational signature analysis

   All mutation data from WES and WGS datasets were converted into a
   matrix composed of 96 features comprising six substitution patterns
   (C > A, C > G, C > T, T > A, T > C, and T > G) using each possible 5'
   (C, A, G, and T) and 3' (C, A, G, and T) context for all samples.
   Furthermore, the R package deconstructSigs[336]^63 was applied to
   delineate the proportion of 30 known mutational signatures derived from
   Catalogue Of Somatic Mutations In Cancer (COSMIC;
   [337]http://cancer.sanger.ac.uk/cosmic/signatures; v2, March 2015;
   [338]Table S2) assigned to each tumor tissue.

Somatic copy number alteration (SCNA) analysis

   SCNAs were identified using Control-FREEC (version 6.2),[339]^54 of
   which amplified or deleted genomic regions were identified using
   GISTIC2 (version 2.0.23).[340]^55 For the focal level analysis,
   G-scores were calculated for significant genomic and gene-coding
   regions on the basis of the frequency and amplitude of amplification or
   deletion affecting each gene. A default q value threshold (0.25) of
   GISTIC2 was used to define highly amplified or deleted regions for both
   arm- and focal-level analysis.

Whole-genome DNA methylation detection

   DNA methylation profiling for 43 pairs of tumors and NATs (including 24
   EOCRC and 19 LOCRC) was performed with CapitalBio (Beijing, China)
   using the Illumina Infinium HumanMethylation EPIC BeadChip, which
   covered over 850,000 CpG loci across the whole human genome at
   single-nucleotide resolution. Briefly, the EZ DNA Methylation Kit (Zymo
   Research) was used to perform sodium bisulfite-treated arrays for
   genomic DNA. To minimize batch effects, pairs of tumor and normal
   tissues were randomly distributed to different arrays. The
   bisulfite-converted genomic DNA was isothermally amplified overnight
   and fragmented enzymatically. Precipitated DNA using isopropyl alcohol
   was resuspended in hybridization buffer and dispensed onto BeadChips (8
   samples/chip). The hybridization procedure was performed at 48°C
   overnight using an Illumina hybridization oven. After hybridization,
   the free DNA was washed away, and BeadChips were processed through a
   single nucleotide extension followed by immunohistochemistry staining.
   Finally, we downloaded the manifest files of each array in advance and
   imaged BeadChips using an Illumina iScan.

DNA methylation profile processing

   The methylation level of each CpG is represented as β values (M/[M+U],
   ranging from 0 to 1), where M represents the methylated intensity and U
   represents the unmethylated intensity. In this study, we processed all
   raw data of the Illumina array following the pipeline of R package
   ChAMP[341]^64 to perform quality control of the dataset and
   differential expression analysis as needed. Briefly, low-quality probes
   were excluded if probes (1) had a detection p value > 0.1; (2) had bead
   counts < 3 in at least 5% of samples; (3) were non-CpG labeled; (4)
   were related to single nucleotide polymorphisms; (5) were aligned to
   multiple locations; and (6) were located on chromosomes X and Y.
   Ultimately, 744,713 CpG probes were included for type-II probe
   normalization,[342]^76 which were used for downstream analyses.

DNA methylation age estimation

   The DNA methylation age (DNAmAge), also known as the epigenetic age,
   was estimated based on DNA methylation profiles using a freely
   available web-based tool (DNA Methylation Age Calculator;
   [343]http://dnamage.genetics.ucla.edu/).[344]^56 Briefly, DNA
   methylation profiles originating from tumors and NATs were reduced to
   datasets with 28,587 CpGs, which were then analyzed based on DNA
   sources subjected to “colon”. The output variable of “DNAmAge” was used
   to determine the correlation with chronological age at diagnosis of
   colorectal cancer using Pearson’s correlation test.

CpG island methylator phenotype (CIMP) calculation

   The CIMP status of each sample was predicted using CpGs derived from
   promoter regions with a high standard deviation (SD > 0.2) of the
   methylation level in tumor tissues and a low methylation level (mean β
   value < 0.05) in normal tissues.[345]^77 The R package
   ConsensusClusterPlus was used to perform consensus clustering analysis
   on the basis of the K-means algorithm.[346]^65 In this study, as
   limited samples were available for DNA methylation profiling, we
   grouped samples with hypermethylation patterns as CIMP-H and all others
   as CIMP-L.

Analyses of differentially methylated position/region

   For the differentially methylated position analysis, R package limma
   (implemented in ChAMP package) was used to calculate the p value for
   differential methylation using a linear model, and CpG probes meeting
   the criteria of log2-fold change (FC) >1 and Bonferroni adjusted p
   value <0.05 (i.e., p < 0.05/744,713) were considered to be
   significantly differential CpGs. Differentially methylated region
   algorithms implemented Bumphunter (deposited in ChAMP package), which
   first grouped all probes into small clusters (or regions) and then
   applied a random permutation method to estimate significantly
   differential candidate regions accompanied by a Bonferroni adjusted p
   value <0.05 (i.e., p < 0.05/5,674 for all samples, 0.05/4,432 for
   EOCRC, 0.05/4,294 for LOCRC).

Whole-transcriptome sequencing and data processing

   The transcriptomes of 79 pairs of tumors and NATs were generated in
   collaboration with Novogene Co., Ltd. (Beijing, China) using the
   Illumina HiSeq 2500 platform with 150-bp sequencing based on the
   Standard Illumina RNA sequencing (RNA-Seq) protocol. RNA was extracted
   from fresh frozen tissue blocks using the QIAGEN AllPrep DNA/RNA Mini
   Kit following the manufacturer’s instructions. RNA reads were
   generated, aligned to the GENCODE v19 genome assembly with HISAT2
   (version 2.0.4),[347]^57 and quantified with StringTie (version
   1.3.4).[348]^58 Cufflinks (version 2.1.1)[349]^59 was then used to
   quantify the gene expression abundance and calculate the standardized
   gene expression (represented as fragments per kilobase per million
   mapped reads [FPKM]) for each sample.

   In this study, a total of 18,094 protein-coding RNAs were initially
   identified in expression profiles. When performing quality control
   underlying criteria of call rate >80% of samples and gene expression
   level >0.5 FPKM, we included 13,070 genes for the general differential
   analysis. However, when focusing on evaluating expression or
   correlation patterns in candidate genes, we retained all available
   genes, which allowed us to capture low-abundance genes for further
   analysis, such as LMTK3, which exhibited low expression at the mRNA
   level but high expression at the protein level.

Immune cell infiltration estimation

   The Tumor IMmune Estimation Resource (TIMER) algorithm was employed to
   estimate the infiltrations of diverse immune cell types in tumor
   tissues based on tumor RNA-Seq profiles.[350]^60 A leukocyte gene
   signature matrix was used as the reference data, including 2,271
   signature genes overexpressed in the immune lineage. Batch effects
   between in-house data and the external reference datasets were removed
   using the ComBat function from R package sva. Six types of immune
   infiltrates were estimated in this study, including B cells, CD4^+
   T cells, CD8^+ T cells, Neutrophils, Macrophages, and Dendritic cells.

   The Estimation of STromal and Immune cells in MAlignant Tumours using
   Expression data (ESTIMATE) algorithm used the transcriptome to
   calculate the estimated levels of infiltrating stromal and immune cells
   and estimated tumor purity.[351]^61 The ESTIMATE score infers the tumor
   purity of tumor tissues.

Consensus clustering for transcriptomic data

   We employed the consensus molecular subtypes (CMS) classifier,
   exploited by random forest algorithm implemented in the R package
   CMSclassifier
   ([352]https://github.com/Sage-Bionetworks/CMSclassifier),[353]^12 to
   assign the CMS for each sample based on the gene expression matrix.
   Using a default posterior probability of 0.5 as a threshold for sample
   classification, we assigned 79 prospective tumors to four CMS subtypes,
   i.e., CMS1-4.[354]^12

Immunophenotype identification

   Tumor immunophenotypes were defined by combining TMB and immune
   infiltration underlying K-means = 4, accompanied by hierarchical
   clustering within each cluster. The “Hot” immunophenotype was
   characterized by CD8^+ T cell infiltration, followed by “pre-Hot” with
   CD4^+ T cell infiltration; in contrast, “Cold” tumors had less immune
   cell infiltration, and “high TMB” tumors were enriched by the high
   value of TMB.

Differentially expressed gene analysis

   A two-sided paired t test was performed for each gene expression
   comparison between two subgroups, such as tumors versus NATs across
   age-of-onset subgroups, as well as EOCRC versus LOCRC across tumors and
   NATs. Differentially expressed genes were defined as differential genes
   with absolute log2FC > 1 and Bonferroni adjusted p value <0.05 (i.e.,
   p < 0.05/13,070) in each comparison.

Correlation pattern analysis

   The R package corrplot was utilized to analyze and visualize the
   correlation pattern of candidate genes, such as genes from the cascade
   of LMTK3, PRKCA-AKT1-FOXO3, and ESR1, as well as co-occurring mutant
   genes of SHPRH, POLE, and LMTK3. The correlation coefficient and p
   value were calculated via Spearman’s correlation analysis.

Tandem mass tag (TMT) detection and processing

   TMT is a chemical label used for mass spectrometry (MS)-based
   identification and quantitation of proteins. In this study, we
   collaborated with Novogene Co., Ltd. (Beijing, China) to detect the
   proteome in 37 pairs of tumors and NATs. Briefly, a BSA quantitative
   kit was used to quantify extracted tissue proteins, followed by quality
   control using SDS-PAGE and Coomassie brilliant blue R-250 staining.
   Eligible protein samples were labeled using TMT labeling reagents,
   which were then mixed in equal volumes. For multiple labeled groups, a
   common reference was created by pooling an equal quantity of each
   sample. Fractions were separated using an L3000 HPLC system (Rigol).
   Shotgun proteomics analyses were performed using an EASY-nLC™
   1200 UHPLC system (Thermo Fisher) coupled with a Q Exactive HF (X) mass
   spectrometer (Thermo Fisher) operating in the data-dependent
   acquisition (DDA) mode. The raw data of MS detection were named “.raw”.

   In terms of protein identification and quantitation, Proteome
   Discoverer 2.2 (Thermo Fisher) was used to retrieve the detected
   spectra from each run separately against the Homo sapiens database
   (112,443 sequences). To improve the quality of available proteins,
   proteins were retained if they met the criteria of Peptide Spectrum
   Matches (PSMs) with a credibility >99%, contained >1 unique peptide,
   and had an FDR <0.01. Finally, a total of 9,772 proteins were
   identified in the TMT platform, and after filtering based on the
   criteria of call rate greater than 50% in both tumors and NATs, 6,621
   proteins (assigned by 6,356 genes) were retained for further analysis.

Differentially expressed protein analysis

   A two-sided paired t test was performed for each protein expression
   comparison between two groups, such as tumors and NATs across
   age-of-onset subgroups and EOCRC versus LOCRC across tumors and NATs.
   Differentially expressed proteins were regarded as differential
   proteins when the absolute log2FC > 1 and Bonferroni adjusted p value
   <0.05 (i.e., p < 0.05/6,621) in each comparison.

SCNA-driven cis- and trans-effects

   Correlations between SCNA and the abundance of mRNA and protein were
   determined via 5,861 common genes present among omics datasets of SCNA,
   mRNA, and protein. Correlation coefficient and p value were calculated
   via Spearman’s correlation analysis. SCNA-driven cis- and trans-effects
   for a given gene were determined with statistically significant
   (p < 0.01) positive or negative correlations.

Correlation of mRNA and protein expression

   A total of 5,887 genes with mRNA (13,070 genes) and protein (6,621
   proteins) data across 37 pairs of colorectal tumors and NATs were used
   to measure gene-wise and sample-wise mRNA and protein correlations. We
   followed the pipeline of OmicsX
   ([355]http://bioinfo.wilmer.jhu.edu/OmicsX/)[356]^62 to perform the
   omics correlation analysis.

Pathway enrichment analysis

   Pathway enrichment analysis was conducted separately for each omics
   dataset and each comparison. The champ.GSEA function implemented in R
   package ChAMP was used to enrich genes from differentially methylated
   regions (from the epigenome) specifically for Kyoto Encyclopedia of
   Genes and Genomes (KEGG) pathways using Fisher’s exact test. The R
   package clusterProfiler[357]^66 was applied to enrich and cluster
   differentially expressed genes or proteins (from tumors versus NATs,
   EOCRC versus LOCRC, and Mutants versus Wild-types, as appropriate) into
   KEGG pathways, and the deposited function of enrichKEGG was used to
   enrich significantly positively correlated genes from mRNA-protein
   pairs underlying background genes of the positive correlation-ranked
   list of genes.

Principal component analysis (PCA)

   We performed PCA on all omics samples to illustrate the global
   epigenomic (DNA methylation), transcriptomic (mRNA), and proteomic
   (protein) differences between tumors and NATs ([358]Figure S1A). PCA
   using prcomp function in R package stats was implemented for
   unsupervised clustering analysis on the centered and scaled expression
   matrix of each omics dataset. The R package factoextra was used to
   visualize the PCA output.

Immunocytochemistry of LMTK3

   A tissue microarray including 93 surgical colorectal cancer specimens
   was purchased from Shanghai Outdo Biotech Co., Ltd. (HColA180Su19). The
   primary antibody applied for immunohistochemical detection of LMTK3
   protein expression was anti-LMTK3 (sc-100418, Santa Cruz
   Biotechnology). Two experienced pathologists scored the staining
   results in a blinded manner. The immunostaining intensity was scored as
   1 (low, +), 2 (medium, ++) or 3 (high, +++) ([359]Figure S6G), and the
   stained percentage was scored semiquantitatively as 1 (0-25%), 2
   (26-50%), 3 (51-75%) or 4 (76-100%). Both the cell nucleus and
   cytoplasm were scored separately. Multiplication of the intensity and
   percentage scores, followed by the summation for those from the nucleus
   and cytoplasm, resulted in a score ranging from 1 to 24 for each
   tissue.

Single-cell transcriptome analysis

   We retrieved single-cell 3′ mRNA sequencing data from the Gene
   Expression Omnibus (GEO) database under accession code
   [360]GSE132465,[361]^35 which included 23 colorectal cancer patients
   with 23 primary colorectal cancer and 10 matched normal mucosa samples.
   We followed the pipeline by the author to process the dataset[362]^35
   and divided it into the comparison between EOCRC and LOCRC. Briefly,
   the sequencing data were aligned to the human reference genome (GRCh38)
   and processed using the CellRanger 2.1.0 pipeline (10x Genomics). The R
   package Seurat[363]^67 was used to filter the raw gene expression
   matrix according to the following criteria, namely, cells with >1,000
   unique molecular identifier (UMI) counts, >200 genes and <6,000 genes,
   and <20% mitochondrial gene expression in UMI counts, which generated a
   filtered gene expression matrix with 67,296 cells. This filtered matrix
   was then normalized to the total UMI counts per cell and transformed to
   the natural log scale. The multiple canonical correlation analysis
   (multiCCA) implemented in Seurat was used for batch correction. To
   eliminate cells of ambiguous identity, multiCCA and reference component
   analysis (RCA; R package RCAv2)[364]^68 were combined for initial
   clustering and cell type identification, and discordant cells were
   removed from the downstream analysis. After defining the global cell
   type, cells with a number of genes exceeding the outliers were removed
   to eliminate potential doublets, which generated a final gene
   expression matrix with 63,689 cells available in this study. The
   t-distributed stochastic neighbor embedding (tSNE) deposited in R
   package Seurat was used to cluster cells in low-dimensional space.

Quantification and statistical analysis

   Statistical power of SMG identification was calculated through a
   binomial model implemented in R package cancerSeqStudy[365]^69 based on
   the parameters of sample size, effect size (i.e., population mutation
   rate), basepair mutation rate (per base), and significance level (at
   p = 5e-6). The Wilcoxon rank-sum test, t test, and Kruskal-Wallis test
   were used for quantitative data comparisons, and Fisher’s exact test
   and χ^2 test were used to assess the difference in qualitative data, as
   appropriate. The R package UpSetR[366]^70 was used to visualize set
   intersections or overlap components across subgroups. The area under
   the receiver operating characteristic curve was used to assess the
   ability to distinguish between EOCRC and LOCRC using DMGs. The
   Kaplan-Meier survival curve was generated to assess the 5-year survival
   rate, and the log-rank test was used to test for significant
   differences. All reported p values were two-sided.

Acknowledgments