Abstract
Genetic variants can affect protein function by driving aberrant
subcellular localization. However, comprehensive analysis of how
mutations promote tumor progression by influencing nuclear localization
is currently lacking. Here, we systematically characterize potential
shuttling-attacking mutations (SAMs) across cancers through developing
the deep learning model pSAM for the ab initio decoding of the sequence
determinants of nucleocytoplasmic shuttling. Leveraging cancer
mutations across 11 cancer types, we find that SAMs enrich functional
genetic variations and critical genes in cancer. We experimentally
validate a dozen SAMs, among which R14M in PTEN, P255L in CHFR, etc.
are identified to disrupt the nuclear localization signals through
interfering their interactions with importins. Further studies confirm
that the nucleocytoplasmic shuttling altered by SAMs in PTEN and CHFR
rewire the downstream signaling and eliminate their function of tumor
suppression. Thus, this study will help to understand the molecular
traits of nucleocytoplasmic shuttling and their dysfunctions mediated
by genetic variants.
Subject terms: Mutation, Oncogenes, Protein translocation, Cancer
genomics, Cancer models
__________________________________________________________________
Genetic variants can affect protein function and promote tumour
progression by driving aberrant subcellular localization. Here, the
authors characterise potential shuttling-attacking mutations (SAMs)
across cancers using a deep learning model and experimental validation,
and find how SAMs disrupt critical signalling interactions involved in
cancer progression.
Introduction
Many shuttling proteins are transported between different subcellular
compartments in response to external signals and regulate a broad
spectrum of biological processes^[50]1–[51]3. This aberrant shuttling
of proteins is often driven by binding to specific molecules that
recognize distinct targeting signals. The dynamic nucleocytoplasmic
trafficking process is functionally and mechanistically diversified and
is mostly regulated by the nuclear localization signal (NLS) and
nuclear export signal (NES)^[52]4,[53]5. The NLS is recognized by the
corresponding nuclear transporters, classically members of the importin
superfamily, which can interact with nucleoporins to help
NLS-containing proteins entry the nucleus through the nuclear pore
complex (NPCs)^[54]6. However, many nuclear proteins do not contain
classical NLSs, and must either use alternate entry mechanisms
including non-classical NLSs, binding with other nuclear localized
proteins, direct interaction with NPCs and nuclear retention mediated
by DNA or RNA affinity^[55]7–[56]9. In recent years, an increasing
number of proteins initially identified in the cytoplasm have also been
observed to localize to the nucleus and perform different functions,
especially in tumors and disease states^[57]10–[58]14. These
discoveries further underscore the importance of additional
investigations on the regulation of protein subcellular localization.
Cancer-driving mutations contribute to abnormal cellular functions that
trigger tumorigenesis through a variety of mechanisms. To date, the
pathological mechanisms of only a fraction of known mutations have been
elucidated. In particular, for protein nuclear localization, only
several investigations have described protein nucleocytoplasmic
shuttling due to cancer-driving mutations^[59]15–[60]18. However, all
of these studies reported a limited number of mutations in individual
proteins. An urgent need is to systematically identify cancer mutations
altering protein nucleocytoplasmic shuttling that drives tumorigenesis,
and an effective in silico predictor is necessary. Numerous studies
have contributed to predicting protein nuclear
localization^[61]19–[62]29, and these efforts have substantially
improved our understanding of proteins localized in specific
organelles. However, limitations still exist. Most predictors yield
prediction scores only for nuclear localization and do not pinpoint
which sequences potentially modulate transport into the nucleus. Other
tools^[63]30–[64]33 build models and conduct predictions based on
existing targeting peptides, which are unable to predict proteins with
unconventional localization sequences. Furthermore, most of these
models are based on classical machine learning algorithms and sequence
alignment methods, and there is still room for performance improvement.
Thus, although disrupting nuclear localization is a critical mechanism
by which mutations cause protein dysfunction, a computational tool for
identifying shuttling-attacking mutations (SAMs) is lacking; therefore,
systematic studies on this topic are lacking.
In this work, to elucidate SAMs, we construct a deep-learning framework
to decipher the impact of single amino acid mutations on protein
nuclear localization. First, we build a deep learning model, named
prediction of SAMs (pSAM), based on a hybrid convolutional neural
network (CNN) and bidirectional long short-term memory (BiLSTM)
architecture with an attention mechanism to accurately predict the
potential nuclear localization of proteins. Since deep learning methods
can independently identify helpful information in sequences without
relying on knowledge-based biological features and solve scientific
problems of interest well^[65]34–[66]36, we avoid using known targeting
peptides to detect potential vital regulatory regions in the protein
sequence. Instead, the full-length sequence of the protein is used to
predict specific nuclear localization, and a residue-level contribution
analysis (RCA) method is utilized for model interpretation to score the
residue-level contribution. The rigorous evaluation convincingly
reveals that the pSAM model built in this study accurately predicts
nuclear localization probability with state-of-the-art performance, and
ab initio prediction without any prior NLS or NES information confers
the ability to pinpoint key nuclear localization-related sequence
determinants beyond the canonical presequences and internal signals.
While investigating somatic mutations across 11 variant-abundant cancer
types, we find that SAMs constitute approximately 6.7% of the mutations
in NLSs and that cancer mutations in the sequence determinant regions
are more frequent and more deleterious. The potential disruption of
nucleocytoplasmic shuttling induced by single-nucleotide or truncated
mutations is analyzed based on this predictor and partially
experimentally validated. Overall, since most predicted disruptions are
confirmed by experiments, our proposed model is shown to be accurate at
the residue level and can facilitate investigations into the molecular
mechanism of protein nuclear localization.
Results
Mutations might disrupt protein nuclear localization but are largely unknown
Cancer-causing mutations lead to vicious consequences through numerous
molecular mechanisms^[67]37. The elucidation of protein SAMs requires
potential protein sequence determinants that are essential for nuclear
localization to be decoded (Fig. [68]1A). We collected all cancer
mutations across 11 cancer types in The Cancer Genome Atlas (TCGA) and
compared the occurrence of known missense mutations in the targeting
peptides and other regions. We integrated several well-known databases,
including UniProt, SeqNLS, ValidNESs and NESbase, to annotate
experimentally validated NLSs/NESs in proteins. The frequency of
mutations was significantly greater in the targeting peptide regions of
NLSs/NESs (Fig. [69]1B and Supplementary Fig. [70]1A, B). We further
used simulated mutations (i.e., signature-corrected randomized
mutations) in the mutation enrichment analyses; using this, we found
that putative mutations were significantly depleted in the validated
NLS/NES regions (Supplementary Fig. [71]1C, D). We also analyzed the
nuclear localization of cancer driver and non-driver genes. The results
indicated that nearly 60% of the cancer driver genes are located in the
nucleus, while only approximately 25% of non-driver genes are located
in the nucleus (Supplementary Fig. [72]1E-F). We investigated whether
mutations in targeting peptide regions are more likely to disrupt
protein functions by introducing a deleterious score to measure their
deleteriousness. Mutations in targeting peptide regions were more
deleterious (Fig. [73]1C and Supplementary Fig. [74]1G, H), suggesting
that these mutations significantly predispose patients to protein
dysfunction. Because transcript factors (TFs) and transcriptional
repressors perform transcriptional regulatory functions upon entering
the nucleus, we explored the transcriptional regulatory effect of
mutations in the experimentally determined NLS region of TFs and
transcriptional repressors on their substrates in the TCGA-UCEC cohort.
BACH2 is a transcriptional repressor that may play a role in cancer and
neuronal differentiation^[75]38. Interestingly, we noticed a
significant upregulation of its substrates when the mutations were
located at the experimentally validated NLS regions, for example, those
of TXNRD1 (Fig. [76]1D) and CLIP1 (Fig. [77]1E), two well-known
oncogenic drivers that promote tumorigenicity^[78]39,[79]40. These
results suggest that mutations in the targeting peptide region may
affect the nucleocytoplasmic shuttling of the protein, thereby
regulating its biological function.
Fig. 1. Study overview.
[80]Fig. 1
[81]Open in a new tab
A Somatic mutations from 11 cancer types in TCGA were analyzed to
determine their potential effect on nucleocytoplasmic shuttling and
revealed a significant enrichment in targeting peptides. A deep
learning model, pSAM, was then constructed to precisely predict the
protein nuclear localization probability and ab initio inference of
protein sequence determinants without knowledge of known targeting
peptides based on deep learning coupled with residue-level contribution
analysis. The pSAM model and somatic mutation dataset were subsequently
used for downstream analyses. B Mutation frequency distribution of the
targeting peptide regions and other regions from 11 cancer types in
TCGA. The proteins with validated NLS/NES region were included (n = 564
proteins), and the NLS/NES regions were compared with the rest regions.
The data are presented as a box-and-whisker graph (bounds of box: first
to third quartile, bottom and top line: minimum to maximum, central
line: median). C The deleteriousness score distribution of mutations
localized in targeting peptide regions and other regions from 11 cancer
types in TCGA. The target regions include valid NLS and NES regions. D,
E Mutations in NLS regions of BACH2 affect the expression of its
transcriptional substrates, (D) CLIP1 and (E) TXNRD1, in the TCGA-UCEC
cohort (Mutated group: n = 4 samples; Non-Mutated group: n = 579
samples). The data are presented as a box-and-whisker graph (bounds of
box: first to third quartile, bottom and top line: minimum to maximum,
central line: median). Two-sided Wilcoxon test was used for (B),
two-sided Chi-square test was used for (C), two-sided Student’s t test
was used for (D) and (E). Source data are provided as a Source Data
file.
We collected eukaryotic protein sequences and localization annotations
from the UniProt database^[82]41 and summarized the distributions of
nuclear localization (Supplementary Fig. [83]1I). Amino acid frequency
statistics indicated that nucleus-localized proteins contain a greater
number of positively charged residues (Supplementary Fig. [84]1I),
consistent with previous reports that the classical NLS motif consists
of lysine and arginine^[85]7. In addition, we found that serine was
also enriched in nucleus-localized proteins and that proline and
glutamate were enriched in extracellular proteins.
Ab initial identification of sequence determinants with a deep learning
approach
We constructed a deep neural network, pSAM, to predict protein nuclear
localization and analyzed the residue-level contribution of the nuclear
localization probability calculation to dissect the decision-making
process of pSAM and identify the sequence determinants for nuclear
localization (DNLs). We collected eukaryotic protein sequences and
localization annotations from the UniProt database^[86]41
(Supplementary Fig. [87]2A). We randomly divided the whole dataset
(20-2000 amino acids) into a training set, validation set, and testing
set at a ratio of 7:2:1. Then, we use a combined deep neural network
based on CNN, BiLSTM, and attention mechanisms to construct the model
to predict nuclear localization probability (Fig. [88]1A). The input
protein sequences were one-hot encoded into a (2000 ×21) embedded
matrix, which was used as the input for the multi-subnetwork CNN
layers. Then, the output was passed to the BiLSTM layer, followed by
the attention layer based on Luong-style attention to assign an
importance score to each hidden state of the BiLSTM layer. The Adam
algorithm was used to optimize the model, and batch normalization,
dropout, and early stopping (Supplementary Fig. [89]2B) were employed
to avoid overfitting the model. Hyperparameters were carefully tuned
for the residual block combination, CNN node size, dropout rate, and
recurrent gate size (Supplementary Data [90]1).
To assess the predictive power of the constructed deep neural network
model, we evaluated various aspects of performance. First, the
structures of the merged model and other individual models were
compared, and the robust performance of the combined deep neural
network based on CNN, BiLSTM, and attention mechanisms showed its
superiority (Supplementary Fig. [91]2C). The optimized model named pSAM
presented satisfactory robustness in the ten-fold cross-validations,
with area under the receiver operating characteristic (ROC) curve (AUC)
values of 0.9726–0.9848 (Supplementary Fig. [92]2D). Finally, pSAM
achieved AUC values of 0.9932, 0.9876, and 0.9865 in the training
dataset, validation dataset, and testing dataset, respectively
(Supplementary Fig. [93]2E). With a median nuclear localization
probability of 0.5 as the cutoff, the true negative rate was 99.06%,
and the true positive rate was 88.49% (Supplementary Fig. [94]2F).
After screening the best threshold for obtaining the optimal
sensitivity and specificity, the true negative and true positive rates
were 98.76% and 93.54%, respectively (Supplementary Fig. [95]2G).
As mentioned above, there are a handful of in silico tools applicable
to nucleus-specific or general prediction of subcellular localization.
Considering both predictor availability and classification system
diversity, we compared pSAM with the general localization prediction
tools YLoc^[96]22, PSORT_II^[97]20, MultiLoc^[98]26, DeepLoc^[99]27 and
MULocDeep^[100]28 and the nuclear localization-specific prediction tool
NucPred^[101]29. We also trained another model using 8,421
experimentally verified nuclear proteins and 18,278 non-nuclear
proteins from NLSdb based on the same architecture as pSAM^[102]42.
Comparing all tools, pSAM outperformed the other tools, although the
1000 proteins used for evaluation were not included in the training
dataset of pSAM but were probably in the other tools mentioned above
(Fig. [103]2A). Taken together, these findings indicate that pSAM shows
state-of-the-art performance for predicting nuclear localization.
Fig. 2. Evaluation of the performance of the pSAM model and identification of
DNLs with residue-level contribution analysis.
[104]Fig. 2
[105]Open in a new tab
A pSAM shows significantly better predictive performance than existing
tools for 1000 randomly selected proteins. B The pSAM-predicted
probabilities on nuclear-localized peptides whose NLS activity scores
were measured previously. The data are presented as a box-and-whisker
graph (bounds of box: first to third quartile, bottom and top line:
minimum to maximum, central line: median). NLS activity score classes:
scores 9−10 (n = 57 peptides), scores 5-8 (n = 168 peptides) and scores
1-4 (n = 149 peptides). C The delta scores between amino acids located
in known NLS regions and randomly selected regions. Alterations in
nuclear localization probability was calculated for 8920 amino acids in
NLSs and randomly selected regions, respectively. The data are
presented as a box-and-whisker graph (bounds of box: first to third
quartile, bottom and top line: minimum to maximum, central line:
median). D The position relative to the terminal terminus of the
predicted DNL regions. E The residue-level contribution of the nuclear
localization probability of PML. F The residue-level contribution of
the nuclear localization probability of TP53. G The t-SNE distribution
of predicted NLS-like and NES-like regions. H The t-SNE distribution of
known NLSs and NESs. I The number of known and predicted NLSs and NESs.
J Domain analysis of NLS regions and NLS-like regions retrieved from
NLSdb and pSAM. The error bands represent 95% confidence intervals. K
Shuffling analysis of known NLSs and predicted determinants of nuclear
localization (DNLs). For the 1752 selected proteins, the matched
wildtype sequence (WT), validated NLS-truncated sequence, DNL-truncated
sequence and randomly truncated sequence were compared. The data are
presented as a box-and-whisker graph (bounds of box: first to third
quartile, bottom and top line: minimum to maximum, central line:
median). L Coverage of known and predicted targeting peptides for
nuclear localization. Two-sided Kruskal-Wallis test was used for (B),
two-sided Student’s t test was used for (C), two-sided Pearson
correlation test was used for (J), two-sided Wilcoxon test was used for
(K). Source data are provided as a Source Data file.
Since pSAM does not utilize any prior knowledge about nuclear
localization, we further assessed its performance by interrogating the
consistencies between the predictions and current knowledge. First, we
performed predictions for specific proteins known to localize in the
nucleus, including proteins containing ankyrin repeat domains,
histones, proteins with validated NLS annotations, TFs and their
cofactors. These proteins were predicted to have a significantly higher
nuclear localization probability when compared against the whole
proteome, especially the widely studied TFs and histones (Supplementary
Fig. [106]2H). Furthermore, we predicted the nuclear localization score
of the NLS peptides generated by Kosugi et al.^[107]30, from which the
relative NLS activities were detected according to the localization
phenotype of the green fluorescent protein (GFP) reporter
(Supplementary Data [108]2). The scores predicted by pSAM and the
measured NLS activity showed satisfactory agreement (Fig. [109]2B). We
then calculated the delta score, which refers to the alteration in
nuclear localization probability upon simulated region loss or
simulated mutation. As expected, the residues located in the
experimentally validated NLSs exhibited higher delta scores than the
residues in randomly selected regions (Fig. [110]2C and Supplementary
Data [111]3).
We then performed residue-level contribution analysis (RCA) to dissect
the decision-making process of pSAM and identify the DNLs
(Supplementary Fig. [112]3A and Supplementary Data [113]4). We also
calculated the relative positions of these DNLs in the proteins,
observing that although DNLs can be located at almost any part of the
protein sequence, there is a striking enrichment at the N-terminus
(Fig. [114]2D), which allows the NLS to be recognized by importin
proteins^[115]7. We further compared the known NLSs and the newly
defined DNLs. For example, promyelocytic leukemia protein (PML) is a
protein that exhibits antiviral activity against both DNA and RNA
viruses^[116]43,[117]44 and was annotated to be able to localize to the
nucleus in the UniProt database; its NLS is the sequence
476-KRKCSQTQCPRKVIK-490. We observed that the delta score and nuclear
localization probability rapidly increased from residue T473 to K490
(Fig. [118]2E), which caused the final nuclear localization probability
of PML to reach 0.625, a moderate score. These findings for PML present
high concordance with the findings of a previous study^[119]45.
Furthermore, we performed a detailed analysis of another representative
protein, cellular tumor antigen p53 (TP53), a tumor suppressor
expressed in many tumor types that induces growth arrest or apoptosis,
depending on the physiological circumstances^[120]46,[121]47. Three
DNLs on the p53 protein identified by pSAM were predicted to be
significantly associated with nuclear localization (Fig. [122]2F). The
second DNL, 319-KKK-321, overlapped with the known NLS, and the third
DNL, 339-EMFRELNEALELADKQ-354, overlapped with the known NES,
indicating that both regions are critical residues regulating p53
nuclear localization. Although p53 was annotated to have only one NLS
and one NES according to the UniProt database, it was reported to have
an NLS at positions 305-321 and two NESs at both
terminals^[123]48–[124]50, consistent with the positioning of the first
DNL, 28-ENNVLSPLPSQAMDDLM-44, at the N-terminus. The results indicated
high concordance between the identified DNLs and known NLSs/NESs.
The predicted DNLs were then classified into NLS-like, NES-like and NA
types to further interpret the predictions from pSAM, according to the
residue composition of NLS and NES motifs^[125]42. To increase their
credibility, we further calculated the false discovery rates (FDRs) of
NLS-like and NES-like DNLs based on the sequence similarity with
experimentally validated NLSs and NESs, respectively. The DNLs with an
FDR < 0.05 were annotated as the predicted NLSs or NESs. The predicted
NLS-like and NES-like DNLs were very close to the known NLSs and NESs
(Fig. [126]2G, H), verifying the reliability of predicted NLSs and
NESs. Our approach has defined 3,095 NLSs and 1,394 NESs, largely
extending the knowledge base of NLSs and NESs (Fig. [127]2I and
Supplementary Data [128]4).
The amino acid composition and sequence structure were similar between
predicted NLSs/NESs and validated NLSs/NESs (Supplementary
Fig. [129]3B). The motif analysis detected the regions significantly
enriched in two classical NLS motifs and other regions enriched in DNA
binding-related motifs (Supplementary Data [130]5). Interestingly, the
domain analysis revealed that the NLS and NLS-like DNLs were enriched
in DNA-binding domains such as homeodomains and helix-loop-helix
DNA-binding domains (Supplementary Data [131]6-[132]7). We compared the
domain enrichment results for NLS-like regions with known NLS regions
in NLSdb (Supplementary Data [133]6-[134]7). The two datasets were
consistent, and both contained enriched zinc finger structural domains
(Fig. [135]2I), while previous studies revealed that DNA-/RNA-binding
domains in diverse nucleic acid-binding proteins potentially function
as NLSs/NESs^[136]51,[137]52. We shuffled the annotated regions for
proteins with known targeting peptides, and the significant decrease in
predicted nuclear localization probabilities for the truncated proteins
confirmed the accuracy of pSAM (Fig. [138]2J); the further decrease
observed for the proteins with predicted truncated regions indicated
that the current annotation of sequences critical for nuclear
localization was insufficient (Fig. [139]2K), which is intuitive. In
addition, we performed shuffling analysis of the N-terminus, middle
region, and C-terminus of the human proteome. A striking decrease in
nuclear localization probability was observed in N-terminally truncated
proteins, consistent with the previous discovery that the N-terminus is
enriched in key sequences critical for nuclear localization
(Supplementary Fig. [140]3C).
Many DNLs have not been classified into NLSs/NESs due to a lack of
sequence features, and a tempting speculation is that these proteins
are transported into the nucleus through specific pathways. For
example, gelsolin family members enter the nucleus by binding directly
to NTF2 without the involvement of importin^[141]53. This suggested
that some DNLs might be involved in protein interactions, but an
approach for validating these functions and the nuclear proteins with
which they interact remains elusive. Annotation of the determinants
revealed that the potential targeting peptide annotation covers
approximately 75% of human proteins (Fig. [142]2L), and further
analysis on these regions will improve our understanding of subcellular
localization.
Systematic analyses of SAMs across cancers
With this state-of-the-art predictor of protein nuclear localization,
we speculate that abnormal protein nucleocytoplasmic shuttling is
driven by cancer-related mutations. First, we compared the occurrence
of known missense mutations in the determinants and other regions, and
the results showed that the frequency of mutations was significantly
greater in DNLs (Fig. [143]3A). The results of signature-corrected
randomized mutation simulation showed that simulated mutations were
significantly depleted in the predicted NLS/NES regions (Supplementary
Fig. [144]3D). We investigated whether the mutations in DNLs were more
likely to disrupt protein functions by introducing a deleterious score
to measure their deleteriousness. Mutations in DNL regions were more
deleterious (Fig. [145]3B), suggesting that these mutations
significantly predispose patients to protein dysfunction. We calculated
the relative delta nucleus probability according to the percentage of
simulated mutation sites. We selected a two-tailed alpha > 5% as the
statistical significance of mutation rates. When the mutation rate was
5%, the delta score was approximately 0.1 (exactly, 0.1022), so we used
0.1 as the threshold for determining whether a site was significant
(Fig. [146]3C). Then, we calculated the potential changes in protein
nuclear localization during mutagenesis for all cancer mutations in
TCGA (Supplementary Data [147]8). Approximately 6.7% of the mutations
in the NLS regions affected protein nuclear localization
(Fig. [148]3D). Several protein functional clusters, including enzymes,
TFs, and cancer-related proteins, were selected to inspect the effects
of mutagenesis on protein nuclear localization. The prediction results
from pSAM were consistent with most deleterious predictors
(Supplementary Data [149]8). Pathway enrichment analysis revealed that
the SAM-affected proteins were enriched in pathways related to mTORC1
signaling, IL2-STAT5 signaling, Myc signaling and PI3K-akt-mTOR
signaling, all of which are involved in the occurrence and development
of tumors, indicating that these SAMs would regulate the occurrence and
development of tumors through various biological processes by affecting
the nucleocytoplasmic shuttling of proteins (Fig. [150]3E).
Fig. 3. The discovery of abnormal nuclear localization driven by
cancer-related mutations.
[151]Fig. 3
[152]Open in a new tab
A Mutation frequency distribution of predicted DNL regions and other
regions. Proteins with annotated somatic mutation from 11 cancer types
in TCGA were included (protein number: BLCA, n = 3934 proteins; BRCA,
n = 3746 proteins; CESC, n = 3582 proteins; COAD, n = 4210 proteins;
HNSC, n = 3606 proteins; LIHC, n = 2812 proteins; LUAD, n = 3948
proteins; LUSC, n = 3909 proteins; SKCM, n = 4079 proteins; STAD,
n = 4092 proteins; UCEC, n = 4417 proteins). The data are presented as
a box-and-whisker graph (bounds of box: first to third quartile, bottom
and top line: minimum to maximum, central line: median). B The
deleteriousness score distribution of mutations located in predicted
DNL regions and other regions. C Relative delta scores according to the
percentage of simulated mutation sites. We selected a two-tailed alpha
> 5% as the level of statistical significance of mutation rates. When
the mutation rate was 5%, the delta score was approximately 0.1
(0.1022), so we used 0.1 as the threshold for determining whether a
site was significant. D The percentage of SAMs with different delta
score cutoffs in NLS regions. E Enrichment analysis of mutated genes
affecting nuclear localizations. F Diagram depicting the strategy for
associating SAMs with putative transcription factor substrates. G COAD
patients (n = 411 patients) carry SAMs in TP63 and ATF3, affecting the
expression of their transcriptional substrates BLCAP and NOTUM. 15 SMAs
in TP63 and 9 SAMs in ATF3 were detected, which divided the COAD
patients into SAM-carried patients and non-SAM patients. The data are
presented as a box-and-whisker graph (bounds of box: first to third
quartile, bottom and top line: minimum to maximum, central line:
median). Two-sided Wilcoxon test was used for (A) and (G), two-sided
Chi-square test was used for (B) and two-sided hypergeometric test was
used to calculate p values and Benjamini-Hochberg method was used to
adjusted the p values for (E). TF, transcript factor, WT, wild type,
Mut, mutated type. Source data are provided as a Source Data file.
As an approach to expanding the molecular mechanisms of SAMs, we
reasoned that TFs might be a bridge connecting SAMs and biological
functions because the mRNA expression levels of the substrates might
serve as indices for TF protein function. Considering that TFs function
after entering the nucleus, SAMs might block this process and lead to a
decrease in the mRNA expression of their targets (Fig. [153]3F). As a
proof of principle, differentially expressed genes were analyzed for
tumors harboring mutations in TP63, which is a known tumor suppressor
TF. TP63 upregulates the expression of BLCAP, encoding a protein that
reduces growth arrest at the G(1)/S checkpoint and reduces cell growth
by stimulating apoptosis^[154]54. When TP63 carries SAMs, the nuclear
localization of its encoded protein decreases, which leads to decreased
expression of BLCAP, therefore promoting the progression of tumors
(Fig. [155]3G). ATF3 and NOTUM are another example of a TF-substrate
pair that also reflects the loss of the nuclear localization ability of
TFs due to mutation, regulating tumor cell stemness and growth^[156]55.
Recently, an increasing number of proteins initially identified in the
cytoplasm have also been observed to localize to the nucleus and
perform different functions, especially in tumors and disease
states^[157]10–[158]14. We summarized the SAMs that increased the
nuclear import ability of tumor suppressor genes and constructed a
protein‒protein interaction network for those genes that acquired
nuclear localization capability (Fig. [159]4A). The cytoplasmic
proteins that interacted with those genes were enriched in housekeeping
pathways, such as those of the tricarboxylic acid cycle, focal adhesion
and cytokine-cytokine receptor interactions (Fig. [160]4B). However,
proteins located in the nucleus play important roles in
tumor-associated pathways, such as ubiquitin-mediated proteolysis and
the cell cycle, providing a potential explanation for mutations in
tumor suppressor genes leading to tumor progression. TNK1 is known to
be localized in the cytoplasm and may serve as an oncogene or a tumor
suppressor depending on the cell context^[161]56,[162]57. However, few
studies have determined how TNK1 gene mutations cause cancer, and none
of them have considered the potential effect of nuclear localization.
We discovered numerous mutations that could result in TNK1 nuclear
localization (Figs. [163]4A and [164]4C). 14-3-3 proteins interact with
and inhibit the kinase activity of TNK1 in the cytoplasm^[165]56. TNK1
gained nuclear localization capability while harboring the detected
SAMs. Tyrosine kinases can regulate a wide range of TFs, and we
discovered that some carcinogenic TFs, such as JUN, could act
downstream of TNK1. The above analysis provides a possible explanation
for the duality of TNK1 function.
Fig. 4. Validation of nuclear localization affected by mutations.
[166]Fig. 4
[167]Open in a new tab
A PPI network of tumor suppressor genes harboring SAMs and their
experimentally validated interacting proteins. B KEGG enrichment of the
interacting proteins located in different nucleocytoplasmic contexts. C
Diagram of the mechanism by which TNK1 functions as a tumor suppressor
in the cytoplasm and SAM blocks this process by entering the nucleus,
which induces an oncogenic function. D The residue-level contribution
of the nuclear localization probability of KAT8 and its R144C mutant. E
Immunofluorescence staining for KAT8 and its R144C mutant. Scale bar =
10 μm. Representative experiment out of n = 3 biological replicates. F
Affinity validation of selected wild-type and mutant proteins with
importin. Representative experiment out of n = 3 biological replicates.
Two-sided hypergeometric test was used to calculate p values and
Benjamini-Hochberg method was used to adjusted the p values for (B).
WT, wild type; Mut, mutated type; NLPT, nuclear localization
probability trajectory. Source data are provided as a Source Data file.
We attempted to determine the relationship between phosphorylation and
nuclear localization in the DNL region. We first mutated the Ser/Thr
sites on DNLs to the phosphomimetic Asp and the inactive mimic Ala.
After further enrichment analysis of proteins with significantly
altered nuclear localization potential, we found that phosphorylated
PLK3 substrates were more likely to have altered nuclear localization
ability (Supplementary Fig. [168]4A), and PLK3 expression was
associated with the overall survival of patients (Supplementary
Fig. [169]4B). PLK3 is a mediator of apoptosis and cellular stress that
regulates the cell cycle and functions as an oncogene or tumor
suppressor in a variety of cancers. However, researchers have not
elucidated the function of the phosphorylation of its substrate by
PLK3, and our analysis provides a possible molecular mechanism by which
PLK3-mediated modification of its substrates affects nuclear
localization and thus regulates downstream biological networks.
Validation of the predicted perturbation of nucleocytoplasmic shuttling
driven by SAMs
Due to the lack of standard datasets to assess the performance of pSAM,
we selected ten SAMs (KAT8 p.R144C, DEAF1 p.K304N, NFE2L1 p.R674C, CMAS
p.R201W, PTEN p.R14M, NHLRC1 p.G274W, CHFR p.P255L, SYNPO2 p.G4E, MAPK1
p.G136E and TP63 p.R393Q), which were predicted to have a significant
impact on nuclear localization probability (Fig. [170]4D and
Supplementary Fig. [171]5A), to further validate the performance of
pSAM. Among all 10 predicted SAMs, eight (KAT8 p.R144C, DEAF1 p.K304N,
NFE2L1 p.R674C, CMAS p.R201W, PTEN p.R14M, NHLRC1 p.G274W, CHFR p.P255L
and SYNPO2 p.G4E) were proven to exhibit a considerable degree of
decreased nuclear localization, and only two SAMs failed to be
demonstrated with positive results (Fig. [172]4E and Supplementary
Fig. [173]5B). Previously, the KAT8 lysine acetyltransferase was
reported to be essential for cancer development, and several studies
have shown that KAT8 is a potential therapeutic target for
cancer^[174]58–[175]60. Nuclear localization probability trajectory
(NLPT) analysis revealed that KAT8 underwent a significant change in
nuclear localization after mutation at R144 (Fig. [176]4D), and
immunofluorescence staining revealed that the KAT8 R144C mutation
significantly abolished the nuclear localization ability of the protein
(Fig. [177]4E).
Furthermore, we scanned SAMs on NES regions, and then predicted their
impacts on the nuclear localization probability of proteins. The
protein CHP1 has two validated NES regions: 138-147 and 176-185
(Supplementary Data [178]2). The first NES region was predicted to have
a SAM, namely CHP1 p.R140L (Supplementary Data [179]8). Therefore, we
experimentally examined the effects of the p.R140L mutation on the
nuclear localization probability of CHP1. The results revealed that the
wild type of CHP1 was excluded from the nucleus, while the mutated type
of CHP1 was retained in the nucleus, demonstrating that the p.R140L
mutation would significantly impair the nuclear export function of the
NES motif (Supplementary Fig. [180]5C, D).
In addition, we observed interesting cases in which individual point
mutations had little effect, but synergistic mutation of two loci
significantly reduced the nuclear localization of the protein
(Supplementary Fig. [181]5E, F). Through a literature search, we
selected six proteins with the potential to bind to the nuclear import
factor importin. We identified the impact of mutations on the binding
of these proteins to importins through affinity tests. Among these six
proteins, four exhibited decreased binding to importin due to mutation
effects (Fig. [182]4F).
Since the current NLS annotations were incomplete, we performed a
truncation analysis of the identified DNLs and screened proteins for
experimental validation (Supplementary Data [183]9). As shown in
Supplementary Fig. [184]6, after truncating the known NLS signals of
QKI and METTL3, the nuclear entry ability of these proteins did not
decrease significantly. However, the truncation of the predicted
regions significantly reduced the nuclear localization level. METTL3 is
an N6-methyltransferase that methylates adenosine residues at the N(6)
position of RNAs, thus regulating various cellular processes, such as
the response to DNA damage, the circadian clock and primary miRNA
processing^[185]61–[186]63. According to the prediction from pSAM,
METTL3 contains three potential DNLs, the third of which is a known
NLS^[187]64. Interestingly, truncation of the NLS did not lead to a
significant decrease in the nuclear localization fluorescence signal.
After truncating the three predicted regions, the nuclear localization
ability of METTL3 decreased significantly, which further verified the
reliability of the prediction from pSAM. In addition, both truncating
the known NLS signal and identifying DNLs led to a decrease in the
nuclear accumulation of NTH. According to the NLPT analysis, the known
NLS sequence exhibited the highest delta score, suggesting that this
region plays a key role in regulating NTH nuclear localization. HIPK3
and CDK8, which have no known NLSs, also showed a significant decrease
in nuclear entry after the predicted DNLs were truncated (Supplementary
Fig. [188]6). Taken together, these findings show that pSAM can
precisely predict the sequence determinants critical for nuclear
localization at the residue level.
R14M disrupted the nuclear localization and tumor-suppressive function of
PTEN
PTEN has been reported to be a tumor suppressor with lipid phosphatase
activity and shown to be a potential therapeutic target for
cancer^[189]65,[190]66. PTEN localizes to both the nucleus and the
cytoplasm and shuttles between each compartment through numerous
mechanisms^[191]67. Thus, we selected the PTEN R14M mutant for
subsequent experimental and functional validation. NLPT analysis
revealed that PTEN underwent a significant change in nuclear
localization after mutation at R14 (Fig. [192]5A and Supplementary
Fig. [193]5A), and immunofluorescence staining revealed that the PTEN
R14M mutation significantly abolished the nuclear localization of the
protein (Fig. [194]5B and Supplementary Fig. [195]5B). R14M is located
at the junction where the phosphatase and an N-terminal α-helix region
are connected in the crystal structure of PTEN (Fig. [196]5C), which is
exposed at the surface of PTEN. Previous studies have emphasized the
critical role of the N-terminal sequence in promoting the nuclear entry
of PTEN^[197]68. PTEN modulates cell survival by antagonizing the
PI3K/Akt/p27 signaling pathway in the cytoplasm^[198]69,[199]70;
however, when it is located in the nucleus, it increases cyclin D1
levels and G0-G1 cell cycle arrest through the downregulation of
ERK^[200]71 and mediates p53-dependent apoptosis^[201]72. Nuclear PTEN
exerts a stronger tumor-suppressive effect than cytoplasmic
PTEN^[202]67. Our analysis suggested that the CCND1 mRNA and p27
phosphorylation levels were significantly increased in patients
carrying PTEN mutations, while the p53 protein level was significantly
decreased (Fig. [203]5D). We confirmed that PTEN could regulate the
phosphorylation of ERK and that the R14M mutation disrupted this
process (Fig. [204]5E and Supplementary Fig. [205]7A). To identify the
biological functions of the PTEN R14M mutation, we performed
transcriptome analysis and found that several oncogenic pathways, such
as the oxidative phosphorylation, mTOR and MAPK signaling pathways,
were dysregulated (Fig. [206]5F and Supplementary Fig. [207]7B).
Further cytological experiments showed that the PTEN R14M mutant
significantly increased tumor growth and colony formation
(Fig. [208]5G–I and Supplementary Fig. [209]7C). In vivo tumorigenesis
experiments showed that the PTEN R14M mutant significantly reduced the
tumor-suppressive function of PTEN (Fig. [210]5J and Supplementary
Fig. [211]7D). Based on these results, the PTEN R14M mutation disrupts
the function of the PTEN protein by regulating its nuclear
localization. In addition, we randomly selected a tumor suppressor
gene, the E3 ubiquitin-protein ligase CHFR, and observed similar
biological phenomena (Supplementary Fig. [212]5A, B and Supplementary
Fig. [213]7D–G).
Fig. 5. The PTEN R14M mutation affects the function of the PTEN protein by
regulating its nuclear localization.
[214]Fig. 5
[215]Open in a new tab
A The residue-level contribution of the nuclear localization
probability of PTEN and its R14M mutant. B Immunofluorescence staining
for PTEN and its R14M mutant. Scale bar = 10 μm. C Locations of the
phosphatase domain, C2 domain and K14 (red) in the 3D structure of
PTEN. D Diagram of the mechanism by which PTEN functions as a tumor
suppressor in the nucleus and cytoplasm. Boxplot showing the substrate
(CCND1 mRNA [mutated, n = 39 samples; non-mutated, n = 487 samples],
p27 phosphorylated protein and p53 protein [mutated, n = 34 samples;
non-mutated, n = 354 samples]) expression levels of PTEN in UCEC
patients carrying SAMs. The data are presented as box-and-whisker
graphs (bounds of box: first to third quartile, bottom and top line:
minimum to maximum, central line: median). E Western blot showing the
expression and phosphorylation level of the ERK by PTEN R14M mutation
in two cell lines. F KEGG pathway enrichment analysis of differentially
expressed genes between PTEN-overexpressing and R14M-mutated cell
lines. G Western blot showing the protein expression by PTEN R14M
mutations. H, I Cell growth assays showing the effects of PTEN and its
mutant on the HCT116 cell lines. H Relative growth rare and (I)
Relative colony number. Three groups were compared: OE vector (n = 3
replicates), OE PTEN WT (n = 3 replicates) and OE PTEN R14M-mutated
(n = 3 replicates). The data are presented as the mean±S.D. J In vivo
tumorigenesis experiments of PTEN and its mutants in the DLD1 cell
line. Three groups were compared: OE vector (n = 6 mice), OE PTEN WT
(n = 6 mice) and OE PTEN R14M-mutated (n = 6 mice). The data are
presented as the mean±S.D. Representative experiment out of n = 3
biological replicates for (B, E, G). Two-sided Wilcoxon test was used
for (D), two-sided hypergeometric test was used for (F), and two-sided
Student’s t test was used for (H–J). Source data are provided as a
Source Data file.
The SAM-carrying gene network connects nucleocytoplasmic shuttling and
carcinogenesis
Understanding how SAMs interact with tumorigenesis is a long-standing
challenge. The fundamental question is what effect the mutation has on
the structure or function of the protein and which pathways are
targeted^[216]73,[217]74. We constructed a tumorigenesis-related
network by integrating the genes that harbor nuclear SAMs and cancer
hallmark pathways to identify the position of nuclear SAMs in human
cancer (Fig. [218]6). We classified these genes into eleven classes:
hypoxia, myogenesis, spermatogenesis, PI3K/AKT/mTOR, complement, MYC
targets, epithelial-mesenchymal transition, cholesterol homeostasis,
mitotic spindle and androgen response. According to the network, most
genes are known cancer-associated genes, drug targets, enzymes and
metastasis-related genes. We further mapped the cancer hallmark genes
from a well-curated database named HOC^[219]75 to the selected genes,
and enrichment analysis revealed that eight cancer hallmark traits,
namely, sustained proliferation, evasion of antigrowth signaling,
immune evasion, replicative immortality, invasion and metastasis,
angiogenesis, cell death and metabolism, are significantly affected by
nuclear SAMs (Fig. [220]6).
Fig. 6. Nuclear localization probability altered genes regulating the network
that connects nuclear localization and carcinogenesis.
[221]Fig. 6
[222]Open in a new tab
The 139 genes were classified into eleven groups based on their major
biological functions. Representative hallmark genes are shown in the
circle. Source data are provided as a Source Data file.
Discussion
Nuclear localization enables and determines protein function and then
orchestrates intracellular processes and responses to external signals.
Protein shuttling between different subcellular compartments is driven
by distinct targeting signals and regulates a broad spectrum of
biological processes, particularly in tumor and disease
states^[223]3,[224]11,[225]13. However, until now, due to the lack of a
robust tool and method to infer the impact of amino acid perturbations
on protein nuclear localization at the residue level, there has been a
lack of systematic analysis of how mutations affect the development of
tumors by influencing protein nuclear localization. Here, we first
systematically identified the sequence determinants for distinct
nuclear localization by performing ab initio analysis and visualized
the residue-level contribution of nuclear localization probability
across human proteins. Moreover, we investigated the potential
disruption of nucleocytoplasmic shuttling induced by single-nucleotide
or truncated mutations, followed by experimental validation, which
highlighted the general mechanism of tumorigenesis driven by mutations
that alter the subcellular shuttling of proteins. While validating all
mutations is beyond our capabilities, the predicted disruptions
confirmed by the experiments support that our proposed model might
facilitate investigations into the molecular mechanisms of protein
nucleocytoplasmic shuttling and provide candidates for further
experimental discoveries.
The RCA analysis helped us identify the determinants for protein
nuclear localization. PML can localize to the nucleus and exhibits
antiviral activity against both DNA and RNA viruses^[226]43,[227]44.
The identified DNL was consistent with previous studies^[228]45. The
tumor suppressor protein TP53, which is expressed in many tumor types,
induces growth arrest or apoptosis, depending on the physiological
circumstances^[229]46,[230]47, and was annotated to have only one NLS
and one NES in the UniProt database. These known NLSs and NESs are
consistent with the predictions from pSAM. In addition, pSAM suggests
that the N-terminal sequence of TP53 may be associated with nuclear
localization. METTL3 is an N6-methyltransferase that regulates various
processes, such as the response to DNA damage, the differentiation of
stem cells, the circadian clock and primary miRNA
processing^[231]61–[232]63. Interestingly, truncating the known NLS
region did not significantly decrease the nuclear localization
fluorescence signal. After truncating the predicted determinants
(comprising the known NLS), the nuclear localization ability of METTL3
decreased significantly. Taken together, pSAM provides comprehensive
annotation of localization signals on proteins and can identify both
classical and unconventional localization signals.
The interpretation of our results should be of interest to researchers
in several fields. We discovered that the DNL regions are highly
correlated with DNA-binding domains, indicating the potential role of
zinc finger domains in protein nuclear localization (Supplementary
Data [233]6), which was proven by previous studies showing that
DNA-/RNA-binding domains in diverse nucleic acid-binding proteins
function as NLSs/NESs^[234]51,[235]52. We speculated that some proteins
that do not contain potential NLS-like/NES-like DNLs may be transported
into the nucleus through relatively specific pathways. For example,
gelsolin family members enter the nucleus by binding directly to NTF2
without the involvement of importin^[236]53. These findings suggested
that a variety of DNLs could perform protein-interacting functions, but
the approach for validating their protein-interacting functions and the
nuclear proteins with which they interact remains elusive. After the
annotation of determinants, the potential targeting signal annotation
covered approximately 75% of the human proteins, which will greatly
improve our understanding of nuclear localization.
Currently, only a few experimental studies have investigated this
topic, and most studies on cancer mutations have focused on functional
domains, such as catalytic domains, rather than on nuclear localization
signals. We identified many somatic mutations that may regulate protein
function by affecting nuclear localization, and these functional
annotations of mutations may help researchers further understand the
pathogenesis of disease. PTEN has been reported to shuttle between the
nucleus and the cytoplasm via numerous mechanisms^[237]67. Our analysis
showed that PTEN underwent a significant change in nuclear localization
after mutation at R14, and we further confirmed that the PTEN R14M
mutation significantly abolished the nuclear localization ability of
the protein and reduced the tumor-suppressive function of PTEN. PTEN
modulates cell survival by antagonizing the PI3K/Akt/p27 signaling
pathway in the cytoplasm^[238]69,[239]70, whereas when it is located in
the nucleus, it increases cyclin D1 levels and G0-G1 cell cycle arrest
through the downregulation of ERK^[240]71 and mediates p53-dependent
apoptosis^[241]72. We summarized the SAMs that increase the nuclear
importing ability of tumor suppressor genes and found that the
cytoplasmic proteins that interact with those genes are enriched in
housekeeping pathways, while proteins located in the nucleus play
important roles in oncogenic pathways. However, few studies have
determined how TNK1 functions as an oncogene or a tumor suppressor
depending on the cell context. We proposed a potential mechanism by
which the kinase activity of TNK1 is inhibited by 14-3-3 proteins in
the cytoplasm and regulates some carcinogenic TFs in the nucleus when
mutated. We elucidated the potential mechanisms of some proteins in
tumors from the perspective of mutations leading to abnormal nuclear
localization, resulting in abnormal protein function. The molecular
mechanisms of many of these genes deserve further investigation.
The existing knowledge posed limitations to the prediction model
constructed in this study. On the one hand, positive datasets for
nuclear localization should be reliable, but negative datasets are
challenging. Although some proteins were not detected in a certain
subcellular compartment, this lack of detection may be due to research
limitations, and their existence and function in certain subcellular
compartments have not been elucidated. For example, in recent years, an
increasing number of proteins that were originally detected in the
cytoplasm have also been observed to localize in the nucleus, and some
of them have been proven to perform completely different functions upon
entering the nucleus, especially in tumor and disease
states^[242]10–[243]14. This limitation will be overcome by determining
the comprehensive nuclear localization of more proteins. On the other
hand, the proportion of targeting peptides of proteins, particularly
internal signals such as NLSs/NESs, has been experimentally confirmed
to be limited, which leads to a lower concordance between known
NLSs/NESs and identified DNLs than between known presequences and known
DNLs. Lastly, environmental conditions could affect the subcellular
location and shuttling of proteins, but the environmental conditions
across diverse organisms and histopathological settings are complicated
and are hard to be assessed. Nevertheless, such influences are not
directly caused and are attained through intracellular mechanisms such
as mutations, protein posttranslational modifications or protein
cleavage, which change the natures of proteins. Thus, although
environmental conditions could provide some clues, our study focused on
the characteristics of the proteins and how genetic variants alter
their subcellular location. Further detailed and comprehensive
investigation about the impacts of environmental conditions will better
corroborate the reliability of our results.
Although we only verified the SAMs and DNLs in several proteins, we
anticipate that the modeling methods and results mentioned in this
article will provide opportunities to dissect the molecular landscape
of nucleocytoplasmic shuttling and reveal the protein variants
mediating dysregulated localization and dysfunction in cancers and
diseases. Moreover, we constructed the iNuLoC database
([244]http://inuloc.omicsbio.info) and prediction tools based on the
nuclear localization prediction model to facilitate researchers’ use,
which might provide guidance to researchers studying the nuclear
localization of proteins.
Methods
Ethical statement
Our study was approved by the Medical Ethics Committee of Sun Yat-sen
University Cancer Center (G2023-356) and complied with the Declaration
of Helsinki. All animal experiments were performed based on the
protocol approved by Institutional Ethics Committee for Clinical
Research and Animal Trials of the Sun Yat-sen University Cancer Center
(No. L102012023220Y), and all mice were housed in a
temperature-controlled room under pathogen-free conditions on a 12 h
light–dark cycle. Transplanted tumors were not to exceed a diameter of
2.0 cm or 10% of body weight as permitted by the Institutional Ethics
Committee for Clinical Research and Animal Trials of the Sun Yat-sen
University Cancer Center.
TCGA mutation distribution and deleteriousness analysis
The mutation data of 11 cancer types (BLCA, BRCA, CESC, COAD, HNSC,
LIHC, LUAD, LUSC, SKCM, STAD, and UCEC) were downloaded from the Xena
Browser ([245]https://xenabrowser.net/datapages/) for further analysis.
Only missense variants were retained. For each protein, we counted the
number of mutations falling within the NLS region and other regions as
[MATH:
N(n
ls) :MATH]
and
[MATH:
N(o
ther) :MATH]
, respectively, and then calculated the number of mutations that
occurred per thousand amino acids in the NLS region and other regions
and defined them as
[MATH:
Mut(
nls) :MATH]
and
[MATH:
Mut(
other) :MATH]
, respectively.
[MATH:
Mut(
nls)=N
(nls)l1×
1000 :MATH]
1
[MATH:
Mut(
other)=N(other)l2×1000 :MATH]
2
where
[MATH: l1 :MATH]
and
[MATH: l2 :MATH]
are the amino acid numbers of NLS regions and other regions,
respectively.
We introduced a deleteriousness score to measure their deleteriousness
and investigated whether mutations in NLS regions are more likely to
impair specific protein functions, as reported by Chen et al.^[246]76.
We defined a mutation as deleterious to protein function if a given SNV
disrupted a functional domain or regulatory region of a specific
protein. Thus, five software programs, namely, SIFT^[247]77, PolyPhen2
HDIV^[248]78, PROVEAN^[249]79, FATHMM^[250]80 and
AlphaMissense^[251]81, were used to predict the functional consequences
of our identified NLS region mutations. To ensure the accuracy of the
predictions, we further defined the deleteriousness score by
integrating the predictions of the above five software programs.
Specifically, the deleteriousness score was calculated by counting the
number of methods described above that considered the mutation to be
deleterious. A deleteriousness score of 0 means that the predicted
mutations are tolerated in all methods. In contrast, a deleteriousness
score of 5 means that the corresponding mutations are predicted to be
harmful by all five predictors. Thus, the deleteriousness score can
range from 0 to 5, where a higher score indicates a greater probability
of deleteriousness. Next, a two-tailed proportionality test was applied
to assess the differences in deleteriousness between mutations that
occurred in NLS regions and those that occurred in other regions.
Gene set collection
The TFs and cofactors were downloaded from AnimalTFDB^[252]82. The
proteins with experimentally validated NLS annotations were downloaded
from the SeqNLS^[253]31 and UniProtKB databases. The proteins with NES
annotations were downloaded from the ValidNESs^[254]83,
NESbase^[255]84, and UniProtKB databases. Proteins that contained
ankyrin repeat domains were retrieved from the Pfam database^[256]85.
The list of histones was downloaded from the UniProtKB database. We
also extracted 8421 experimentally verified nuclear proteins and 18,278
non-nuclear proteins from NLSdb^[257]42. NLSdb provided a large number
of NLS motifs generated in silico, including 2253 NLSs, which were also
used in the validation of our data^[258]42.
Signature-corrected randomized mutations
Signature-corrected randomized mutations were generated as
described^[259]86. To compute the signature-corrected randomized
mutations, we used the MOAT (Mutations Overburdening Annotations Tool;
[260]https://github.com/gersteinlab/MOAT), which is a computational
system for identifying significant mutation burdens in genomic elements
with an empirical, nonparametric method^[261]87. We used the variant
distribution simulator, namely MOAT-s, to produce shuffled mutation
sets preserving mutational signatures associated with cancer while
removing the local position-specific effects of cancer mutations. TCGA
mutation data of 11 cancer types was selected as the input of variant
file. Here, observed mutations were randomly shuffled within a
50,000-base pair (bp) window. During the shuffling process, the
tri-nucleotide context of a mutation was preserved. Mutations affecting
known cancer driver genes were excluded from the generation of these
randomized mutation sets. As the reference genome file of the MOAT was
based on GRCh37 (hg19), we used the UCSC liftover software to transform
the gene location files.
Driver gene analysis
We analyzed whether these genes were significantly mutated in human
cancers to ascertain which genes might promote tumor development or
repress tumor growth. The 20/20+ method^[262]88 is based on a random
forest that scores each gene as a cancer driver gene. We performed the
20/20+ method using default parameters for each of the 11 cancer types
individually and aggregated the results of all cancers together. A
significant putative driver gene was determined with a cutoff P value
less than 0.05.
CERES dependency score
CERES dependency scores were based on data from CRISPR technology to
knock down target genes and perform cell depletion assays^[263]89.
Target gene dependency scores were obtained from the DepMap web portal
([264]https://depmap.org/portal/download/). A lower CERES score for a
particular gene in a specific cell line indicates that the gene of
interest has an essential function in that cell line. A score of 0
indicates that the gene is not essential; accordingly, a score of -1 is
the median of all panessential gene function scores.
Dataset preparation
We downloaded all reviewed protein sequences of eukaryotes from the
UniProt database^[265]41 (UniProt release 2021_01) for a total of
157,460 proteins from 6,852 organisms. To ensure high data quality, the
proteins with no nuclear localization information or uncertain
annotations with keywords such as ‘by similarity’, ‘potential’, and
‘probable’ were classified into the “without localization” subgroup.
Then, we extracted their localization information based on UniProt
annotations and classified the remaining proteins into “nuclear
localization” and “nonnuclear localization” subgroups based on whether
they contained nuclear localization information. Since CNNs take
fixed-size inputs, protein sequences longer than 2000 amino acids or
shorter than 20 amino acids, accounting for 2.75% of sequences, were
excluded from the dataset. As a result, 36,016 proteins with nuclear
localization and 63,904 proteins with nonnuclear localization were
ultimately extracted. For some protein sequences containing nonstandard
amino acids (i.e., B, O, U, and Z), we designated these amino acids as
pseudo amino acids, i.e., “-“. We then split the whole dataset into a
training set, a validation set, and a testing set at a ratio of 7:2:1.
The training dataset was used for cross-validation and hyperparameter
tuning to generate the optimal model. The validation dataset was used
to adjust the model weights, determine the number of training session
epochs, and evaluate the model performance. In addition, we randomly
selected 1000 protein sequences (500 nuclear localization proteins and
500 nonnuclear localization proteins) from the testing dataset to
compare the performance of the model with that of other protein nuclear
localization prediction tools.
Model construction
We designed a series of deep learning-based models to predict the
potential nuclear localizations of proteins and then evaluated model
performance to select the optimal prediction model. We collected
eukaryotic protein sequences and localization annotations from the
UniProt database^[266]41 (Supplementary Fig. [267]2A). Approximately
30% of the proteins lacked clear subcellular localization annotations,
and the ratio of nuclear to non-nuclear protein localization was
approximately 1:2. To avoid too many outliers, protein sequences with
extreme lengths of less than 20 amino acids or more than 2000 amino
acids were excluded. The input protein sequences are one-hot encoded
into a (2000 ×21) embedded matrix, and this matrix is later used as the
input for the constructed CNN layers. We randomly divided the whole
dataset into a training set, validation set, and testing set at a ratio
of 7:2:1. Then, we use a combined deep neural network based on CNN,
BiLSTM, and attention mechanisms to construct the model to predict
nuclear localization probability (Fig. [268]1A). The input protein
sequences were one-hot encoded into a (2000 ×21) embedded matrix, which
was used as the input for the CNN layers. Different convolutional
kernel sizes (e.g., 1, 2, 4, 8, 12, and 16) were combined to form a
multisubnetwork CNN layer. Batch normalization and rectified linear
unit (ReLU) activation are applied before each convolutional layer. In
addition, all padding methods in the CNN layers are valid padding
methods, which means that the convolution kernel is not allowed to go
beyond the original image boundaries. Subsequently, we used the
concatenate layer to merge the output of the multilayer residual block
into one layer, followed by an additional batch normalization layer and
ReLU activation function.
For subcellular localization, the precise identification of targeting
signals often captures the interest of biologists. LSTM coupled with
attention mechanisms has proven to be effective in addressing this
issue. Here, we compute the activation of LSTM memory blocks (hidden
states) at each position in the protein sequence and apply an attention
function to determine the significance of each hidden state, which
represents the importance of the underlying positions in the input
sequence. This enables us to assign importance to each position in the
protein sequence to decode essential amino acids that influence
subcellular localization. Subsequently, we calculate the
attention-weighted sum of all hidden states, using it as input to a
dense feedforward neural network with a hidden layer and Adam output
for the final prediction of nuclear localization.
Hence, the model’s output is passed to the BiLSTM layer, which uses the
L2 Regularizer to apply a penalty to the layer’s kernel, bias, and
output with a value of 0.01. After these two layers, a dropout layer is
used to prevent further model overfitting. Finally, we used the
attention layer constructed based on Luong-style attention to assign an
importance score to each hidden state of the BiLSTM layer. The
attention-weighted sum of all hidden states was used as the input of
the final dense feed-forward neural network with one hidden layer and
output as a two-dimensional classification probability score. The
optimizer of the model is the Adam optimizer, the loss calculation
function is a categorical cross-entropy loss function, the evaluation
metric is accuracy, and the model training batch size is set to 64. In
addition, we introduce the early stop method to prevent further model
overfitting, and the evaluation metric is that the loss in the
validation set does not decrease after more than two successive epochs.
The rationale for selecting this architecture is grounded in the
following conceptual flow of information from distinctive subsequences
to categorization. The CNN filters address the challenge of aligning
sequences and offer a trained filter-bank attuned to specific sequence
patterns. If such a pattern appears anywhere in the sequence, it
results in an elevated absolute input to the LSTM at that specific
position. The bidirectional LSTM layers then locally integrate this
information both forward and backward in sequence. The attention
function, leveraging its acquired contextual awareness, conveys
discriminative information from these RNN hidden states to the dense
layer and subsequently to the Adam classifier.
Hyperparameter tuning
Hyperparameter tuning was performed over the residual block
combination, CNN node size, dropout rate, and recurrent gate size. The
alternative residual block combination consists of three or four
parallel subnetworks with four combination modes: (4×1, 8×1, 12×1),
(1×1, 2×1, 4×1, 8×1), (4×1, 8×1, 1×1) and (4×1, 8×1, 12×1, 16×1). After
comparing the performances on the training and validation sets, we
selected combinations of (4×1, 8×1, 1×1), (1×1, 4×1, 8×1), and (8×1,
1×1, 4×1) as the optimal model combinations. To define the best model
architecture, the CNN node size, dropout rate, and recurrent gate size
were initially set to 64, 0.5, and 128, respectively. Then, we
determined the CNN node size, dropout rate, and recurrent gate size one
by one, keeping two of the parameters unchanged and adjusting one of
the remaining parameters. The CNN node ranges were 32, 64, 128, and
256. The dropout rate range was 0.3, 0.4, 0.5 and 0.6. The range of the
recurrent gate was 64, 128, 256, and 512. We selected a CNN node size =
64, dropout rate range = 0.4, and recurrent gate size = 128 for further
analysis because these values yielded the highest validation accuracy.
Model performance evaluation
As previously described^[269]90, we adopted two measurements to
evaluate the model’s performance: sensitivity (Sn) and specificity
(Sp). To evaluate the model performance, we performed 10-fold
cross-validation and calculated the AUC values. An AUC value of 1
indicates that the model can predict each series accurately, while an
AUC value of 0.5 indicates that the model makes random predictions. A
higher AUC indicates that the model is more effective.
Comparison of pSAM with existing tools
We compared the prediction accuracy of pSAM with that of several other
tools to further evaluate its performance. Although dozens of in silico
tools have been developed for nuclear localization prediction, some of
them are no longer accessible. Considering both code and webserver
availability and classification system diversity, we selected the
general localization prediction tools YLoc^[270]22, PSORT_II^[271]20,
MultiLoc^[272]26, DeepLoc^[273]27 and MULocDeep^[274]28 and the nuclear
localization-specific prediction tool NucPred^[275]29. We also trained
another model using the 8,421 experimentally verified nuclear proteins
and 18,278 non-nuclear proteins from NLSdb based on the same
architecture as pSAM (CNN+BiLSTM)^[276]42. These classification systems
involve three widely used prediction algorithms, sequence alignment,
machine learning, and deep learning, and show satisfactory performance
in their domains. For this comparison, we randomly selected 1000
protein sequences (500 proteins with specific localization and 500
proteins with other localization) from the testing dataset. With pSAM,
we predicted the nuclear localization probability based on the model
constructed with the training dataset. The 1000 protein sequences must
not be used in the training dataset in this situation. Since the other
tools are primarily web-based tools and limit the number of protein
sequences submitted at a time, we uploaded the protein sequences
individually and obtained the prediction results. The results for the
1000 sequences from each tool were downloaded and merged to retrieve
the nuclear localization score. Finally, we calculated the AUC for each
tool and presented these results as ROC curves.
Prediction of nuclear localization probabilities of reported peptides with
NLS activity
Kosugi et al.^[277]30 selected peptides bound by importin alpha and
identified six classes of NLSs by screening random peptide libraries
via mRNA display (Supplementary Data [278]2). The relative NLS
activities of these altered sequences were examined in yeast and
classified into ten classes according to the localization phenotype of
the GFP reporter. We added ten pseudo amino acids (“-“) at both termini
since the altered sequences were less than 20 amino acids in length.
Then, the sequences were encoded and predicted by the pSAM model to
obtain their nuclear localization probabilities. Finally, we used a
boxplot to show the consistency of the predicted and experimental
results.
Shuffling analysis
We assessed the relative importance of the three protein regions—the
terminal-terminus, the middle region, and the C-terminus—by shuffling
each region independently in the human proteome. The truncated
sequences and the wild-type proteins were encoded by one-hot encoding
and transferred as the input of pSAM. Then, the protein nuclear
localization probabilities were calculated for these sequences. A
boxplot was used to illustrate the variation in nuclear localization
probability from wild-type to random shuffling and terminal shuffling
for each protein. For proteins with known NLSs, we shuffled the
annotated NLS regions. We performed random shuffling of same-length
regions and computed the localization probability for each sequence. A
paired-sample boxplot was used to show the variation in nuclear
localization probability from the wild type to random shuffling and NLS
shuffling for each protein.
Analysis of residue-level contributions via deep learning
We then studied the learning process of pSAM by analyzing the
trajectory of the nuclear localization probability as a function of
sequence position for the best model. A detailed description of the
procedure used to determine the nuclear localization probability
trajectory is provided below.
First, the truncated protein sequence was defined from position 1 to
position
[MATH: i :MATH]
as
[MATH:
Pro(
i) :MATH]
. The nuclear localization probability of this sequence calculated
using pSAM was defined as
[MATH:
Ptrun<
/mi>ci :MATH]
. Then, we proposed
[MATH: P~trunci :MATH]
to represent the smoothed nuclear localization probability of position
[MATH: i :MATH]
using unweighted sliding-average smoothing with a window size of length
[MATH: w=16 :MATH]
,
[MATH: P~trunci=∑j=−w
mi>/2w/2Ptr
unc(i+j)min<
mfenced close=")"
open="(">L,i+w2<
/mn>−max1,i−w2<
/mn>+1 :MATH]
3
where L is the entire length of the protein sequence. For each protein,
we calculated the smoothed nuclear localization probabilities from
position 1 to L and visualized them together as the nuclear
localization probability trajectory, as shown in Fig. [279]3.
[MATH: P~trunci :MATH]
represents the cumulative probability of nuclear localization of the
protein.
We calculated the probability change
[MATH: ΔPtrunci :MATH]
as a function of position for each sequence using the same window size
[MATH: w=16 :MATH]
as mentioned above:
[MATH: ΔPtrunci=P~trunc
(i+w2)−P~trunc
(i−w2) :MATH]
4
to determine at which point in this sequence a decision is made by the
pSAM.
Given the cumulative effect of probability scores, if a protein
sequence has obtained a high
[MATH: P~trunci :MATH]
at position
[MATH: i :MATH]
, the score
[MATH: P~truncj :MATH]
at subsequent position
[MATH: j :MATH]
becomes nonsignificant relative to
[MATH: P~trunci :MATH]
. However, this position may also significantly affect protein nuclear
localization. Therefore, when calculating
[MATH: ΔPtrunci :MATH]
, we want to exclude the aforementioned factors as much as possible.
When constructing the distribution of
[MATH: ΔPtrunci :MATH]
at each position of all proteins, we found that |
[MATH: ΔPtrunci :MATH]
| > 0.1 accounted for approximately 5% of the total number of loci.
The interval containing |
[MATH: ΔPtrunci :MATH]
| > 0.1 sites significantly enhances the probability of protein
nuclear localization, resulting in a nonsignificant
[MATH: ΔPtrunci :MATH]
at subsequent sites. Hence, we designed a stepwise truncation method to
recalculate
[MATH: ΔPtrunci :MATH]
.
We calculated
[MATH: P~trunci :MATH]
and
[MATH: ΔPtrunci :MATH]
for each position
[MATH:
i(1≤i≤L) :MATH]
of each protein. Then, we counted all regions with
[MATH: ΔPtrunci :MATH]
> 0.1 and obtained the position
[MATH: j :MATH]
of the C-terminus of the last region. The truncated protein was
generated by shuffling position 1 to position
[MATH: i :MATH]
. For the truncated protein, we repeated the operation described above
for the full-length protein until there were no regions with
[MATH: ΔPtrunci :MATH]
> 0.1. Next, the calculated
[MATH: ΔPtrunci :MATH]
replaces
[MATH: ΔPtrunci :MATH]
at the corresponding position in the previous section. Hence, we
obtained the same
[MATH: P~trunci :MATH]
and updated
[MATH: ΔPtrunci :MATH]
for each protein relative to the score obtained from the first
calculation at the corresponding position.
With the calculated
[MATH: ΔPtrunci :MATH]
, we extracted all regions with
[MATH: ΔPtrunci :MATH]
> 0.1 for each protein and defined them as significant regions. These
regions have a strong influence on nuclear localization, and their
higher level of probability change reflects that the model assigns a
higher weight to this part of the region, which is worth focusing on,
and its sequence structure may be related to nuclear localization.
Subsequently, we computed the probability score changes for all
positions on the protein and constructed a normal distribution curve.
We calculated the proportion of anomalous sites under different cutoffs
by tallying the number of abnormal sites and set the cutoff
corresponding to a 5% proportion of anomalous sites for subsequent
mutation analysis.
Classification of DNLs into NLS-like and NES-like types
Although we identified various significant regions, which we defined as
sequence DNLs, their characteristics and action mechanisms remain
unknown. There is no doubt that these regions are related to nuclear
localization, either nuclear importing or exporting. We further
classified the DNLs into NLS-like and NES-like according to the
description provided by Bernhofer et al.^[280]42. Regions with at least
three positively charged residues (H, K, or R) and an overall positive
charge were defined as potential NLS-like DNLs. Regions with at least
three hydrophobic residues (A, F, I, L, M, or V) and more than 30%
hydrophobic amino acid residues were defined as potential NES-like
DNLs. The regions that could not be classified into these two groups
were defined as the unknown type (NA).
To increase their credibility, we further calculated the FDR of
NLS-like and NES-like DNLs based on the sequence similarity with
experimental validated NLS and NES, respectively. The stringdist
package was used to obtain the cosine similarity of NLS-like DNLs with
experimentally validated NLSs, as well as NES-like DNLs with
experimentally validated NESs. The p values of NLS-like and NES-like
DNLs were calculated to determine whether the given cosine similarities
were significantly higher than those of randomly selected background
sequences of the same size. The Benjamini-Hochberg method was further
used to adjust the p values and obtain FDR. The FDRs of all NLS-like
and NES-like DNLs are provided in supplementary Data [281]4.
We calculated the cosine similarity with the whole dataset for each
region and generated a sequence similarity matrix. The sequence
similarity visualization of our defined NLSs/NESs and validated
NLSs/NESs is shown by the t-distributed stochastic neighbor embedding
(t-SNE) plot.
In addition, considering that a known NLS does not have a strong motif,
to refine the features of the NLS/NES, we performed unsupervised
hierarchical clustering of the regions in the NLS-like/NES-like/NA
groups based on the sequence similarity matrix and divided each group
into six subgroups for further analysis.
Construction of the iNuLoC webserver
To facilitate basic research, we generated a webserver, iNuLoC, to
calculate the protein nuclear localization probability and predict
critical regions. iNuLoC is a platform developed to identify
potentially critical regions that facilitate protein nuclear
localization in five model organisms, namely, Homo sapiens, Mus
musculus, Rattus norvegicus, Saccharomyces cerevisiae, and Drosophila
melanogaster. It provides the prediction results from pSAM and
integrates several well-known databases, including UniProt, NLSdb,
SeqNLS, ValidNESs, and NESbase, to display known and candidate
NLSs/NESs in proteins, which might provide helpful information for
research on protein nuclear localization. The platform provides
experimentally determined NLSs/NESs for 1530 proteins through database
integration. Using the RCA method, the NLS/NES annotations were
extended to 13,481 proteins, and the final dataset matched more than
93% of all known nuclear proteins. The database also contains
14,585/23,208/27,551 predicted nuclear proteins for the proteomes of
five model organisms with high/medium/low thresholds. Users can simply
search for the name of their gene of interest and jump to the
corresponding result interface. By clicking on the “Detail” button,
users can search for detailed information about the relevant protein,
for example, potential nuclear localization probabilities, predicted
significant regions, and NLS and NES annotations from other databases.
We visualized the residue-level contribution of nuclear localization
and highlighted the critical regions.
Furthermore, information on the structure and physicochemical
properties of the protein sequences, including disorder, exposure
status, polarity, charge, secondary structure, surface accessibility,
and hydropathy, is also shown. The disorder information of the query
sequence was calculated using IUPred^[282]91. Surface accessibility and
secondary structure information were predicted by NetSurfP^[283]92. We
also integrated the pSAM model and constructed a prediction interface.
Users can submit sequence information for proteins of interest and
obtain their corresponding nuclear localization probabilities and other
biological features. The framework for the deep learning model was
Keras, with TensorFlow as its backend implementation. Finally, the
webserver was constructed in HTML, PHP, and Python and can be accessed
at [284]http://inuloc.omicsbio.info. To provide a robust service, we
tested the website of iNuLoC on various web browsers, such as the
Internet Explorer, Google Chrome, and Mozilla Firefox, and found that
it operates normally.
Domain and motif enrichment
The HMMER3 format domain information was downloaded from the Pfam
database (V3.1b2). We extracted the sequence based on the UniProt
database (UniProt release 2021_01), prepared the FASTA format file for
each identified NLS, and validated the NLS. Domain enrichment analysis
was performed using HMMER (v3.3.2)^[285]93 with the parameters --noali
-E 1.0. The motif enrichment analysis was performed using meme
(v5.3.3)^[286]94 with the parameters -minw 5 -maxw 10, 15.
Point and paired mutation analysis
The mutation data of 11 cancer types were downloaded from the Xena
Browser, and only missense variants were retained for further analysis.
We first merged all mutation files and removed duplicate mutations, and
the retained mutations were considered possible human-derived
mutations. For each mutation site, we calculated the nuclear
localization probability score of its wild-type and mutant proteins and
used the difference between the mutant protein score and the wild-type
protein score as the effect of the mutation on nuclear localization
potential.
[MATH: ΔPi,mut=Pi,mut−Pi,wt :MATH]
5
For paired mutation analysis, we first counted all possible mutations
in a protein, combined these mutated sites two by two, and constructed
the mutated sequences. These sequences were predicted with pSAM for
potential nuclear localization probabilities. Based on the predicted
single point mutation scores and wild-type scores produced as described
above, we obtained the change after mutation combination relative to
the single point mutation.
[MATH: ΔPco<
mi>mpi,j=ΔPi,mut,<
/mo>j,mut
−ΔPi,mut−ΔPj,mut :MATH]
6
If
[MATH: ΔPi,mut :MATH]
< 0,
[MATH: ΔPj,mut :MATH]
< 0 and
[MATH: ΔPco<
mi>mpi,j :MATH]
< 0, the two mutation sites synergistically affect protein nuclear
localization, and a smaller
[MATH: ΔPco<
mi>mpi,j :MATH]
represents a stronger synergistic effect.
Truncated mutation analysis
We assessed the relative importance of the identified NLSs critical for
protein nuclear localization by shuffling these regions in the human
proteome. The truncated sequences and the wild-type proteins were
encoded by one-hot encoding and transferred as the input of pSAM. Then,
the protein nuclear localization probabilities were calculated for
these sequences. The difference between the truncated protein score and
the wild-type protein score was defined as the effect of truncation on
the nuclear localization potential.
Protein iteration network
The experimentally confirmed PPI dataset was downloaded and curated
from BioGRID^[287]95, IID^[288]96, I2D^[289]97, BioPlex^[290]98 and
IntAct^[291]99. In total, 377,796 interactions among 20,379 proteins
were collected. The visualization of the network was generated with
Cytoscape^[292]100.
Phosphomimetic and inactivation mutation analysis
We assessed the relative importance of the phosphorylation sites
located in the identified NLSs critical for protein nuclear
localization by performing phosphomimetic and inactivation mutation
analyses. For each protein, we mutated the Ser and Thr amino acids to D
in the phosphomimetic mutation analysis and mutated the amino acid to
Ala in the inactivation mutation analysis. The nuclear localization
probabilities of the three types of sequences were then predicted using
pSAM. We first calculated the delta probability between the
phosphomimetic mutant and wild-type sequence as Delta1 and the delta
probability between the inactivation mutant and wild-type sequence as
Delta2. We classified the phosphorylation sites critical for nuclear
localization with thresholds of Delta1 > 0.25 and Delta2 < -0.25. The
selected proteins harboring these sites were then used for kinase
enrichment analysis to identify the critical kinase that regulates the
nuclear localization of its substrates by phosphorylation.
Cell lines and plasmid transfection
The human HCT116 and DLD1 colorectal cancer (CRC) cell lines and human
embryonic kidney (HEK) 293 T cells were obtained from the American Type
Culture Collection (ATCC, Rockville, MD, USA) and cultured according to
standard guidelines. All cells were authenticated by short tandem
repeat fingerprinting before use at the Medicine Laboratory of the
Forensic Medicine Department of Sun Yat-sen University (Guangzhou,
China). All plasmids with a Flag tag were obtained from Saisofi
Biotechnology Co., Ltd. (Jiangsu, China) and transfected into cells
using ViaFect Transfection Reagent (E4982, Promega, Madison, WI, USA)
according to the recommended protocol. CRC cells were transfected with
lentiviruses and selected with puromycin (HY-B1743A, MedChemExpress,
NJ, USA) for 7 days. The sequences of shRNA of PTEN were as following:
shPTEN#1: ACTTGAAGGCGTATACAGGA, shPTEN#2: CGACTTAGACTTGACCTATAT.
Western blot and Co-Immunoprecipitation (IP)
Proteins were extracted using RIPA buffer (P0013B, Beyotime, Jiangsu,
China), separated on SDS‒PAGE gels and then transferred to PVDF
membranes (IPVH00010, Bio-Rad Laboratories, Hercules, CA, USA). The
membranes were incubated with anti-Flag tag (14793, Cell Signaling
Technology, Danvers, MA, USA), anti-β-Actin (A5441, Sigma-Aldrich),
anti-phosphorylated ERK (ab76299, Abcam, Cambridge, MA, USA), anti-ERK2
(ab32081, Abcam), and anti-Importin Alpha 5 (KPNA1) (18137-1-AP,
Proteintech), anti-KPNA2 (ab289858, Abcam), anti-KPNA4 (ab302556,
Abcam), anti-PTEN (ab170941, Abcam) antibodies (4 °C, overnight).
followed by peroxidase-conjugated secondary antibodies (ZB-2301 and
ZB-2305, ZSGB-BIO, Beijing, China) (RT, 1 h). The bands were visualized
using chemiluminescence (34096, Thermo Fisher Scientific, Carlsbad, CA,
USA). For Co-IP, the cell lysate was incubated with Anti-Flag and
protein A/G beads (MedChemExpress, NJ, USA) overnight at 4 °C. The
protein-magnetic beads complexes were washed and eluted by 1× loading
buffer, and the bound proteins were used in Western blot.
Immunofluorescence assay
Cells were plated on glass-bottom culture dishes (Cellvis, Mountain,
CA, USA), fixed with 4% paraformaldehyde for 15 min at 37 °C, and
permeabilized with 0.2% Triton X-100 in PBS for 10 min at room
temperature. The samples were blocked with 5% FBS for 1 h at room
temperature and incubated with an anti-FLAG antibody and Alexa Fluor®
488 conjugated antibody overnight at 4 °C and for 1 h at room
temperature. 4’,6-Diamidino-2-phenylindole (DAPI) (Invitrogen,
Carlsbad, CA, USA)-containing medium was then used to counterstain the
nuclei, and images were observed with a confocal fluorescence
microscope (LSM880 with Fast Airyscan, Zeiss, Germany); scale bar = 10
μm.
Library construction, quality control and sequencing
Total RNA was used as input material for the RNA sample preparations.
Sequencing libraries were generated using the NEBNext Ultra RNA Library
Prep Kit for Illumina (NEB, USA, Catalog #: E7530L) following the
manufacturer’s recommendations, and index codes were added to attribute
sequences to each sample.
Briefly, mRNA was purified from total RNA using poly-T oligo-attached
magnetic beads. Fragmentation was carried out using divalent cations
under elevated temperature in NEBNext First Strand Synthesis Reaction
Buffer (5X). First-strand cDNA was synthesized using random hexamer
primers and M-MuLV reverse transcriptase (RNase H). Second-strand cDNA
synthesis was subsequently performed using DNA polymerase I and RNase
H. The remaining overhangs were converted into blunt ends via
exonuclease/polymerase activities. After adenylation of the 3’ ends of
the DNA fragments, NEBNext adaptors with hairpin loop structures were
ligated to prepare for hybridization. To preferentially select cDNA
fragments 370~420 bp in length, the library fragments were purified
with the AMPure XP system (Beverly, MA, USA). Then, 3 μL of USER Enzyme
(NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37 °C
for 15 min, followed by 5 min at 95 °C before PCR. Then, PCR was
performed with Phusion High-Fidelity DNA polymerase, universal PCR
primers and Index (X) Primer. Finally, the PCR products were purified
(AMPure XP system), and the library quality was assessed on an Agilent
5400 system (Agilent, USA) and quantified by QPCR (1.5 nM).
The qualified libraries were pooled and sequenced on Illumina platforms
with the PE150 strategy at Novogene Bioinformatics Technology Co., Ltd.
(Beijing, China), according to the effective library concentration and
data amount needed.
Cell proliferation and colony formation assays
HCT116 cells (1 × 10^3 cells/well) were seeded in a 96-well plate, and
cell viability was determined using MTS (Qiagen, Hilden, Germany)
according to the manufacturer’s instructions. A colony formation assay
was performed in 12-well plates with 250 cells, which were cultured for
1 week. After being fixed, the colonies were counted and stained with
0.05% crystal violet.
In vivo tumorigenesis
All experiments were performed using a protocol approved by our
institutional Animal Care and Use Committee (SYSUCC). Female BALB/c
nude mice (5 weeks old) were obtained from Beijing Vital River
Laboratory Animal Technology Co., Ltd. Next, 2 × 10^6 CRC cells
overexpressing wild-type or site-specific PTEN or CHFR mutants were
injected subcutaneously into the flanks of each mouse. The diameter and
width of the tumors from the mice were measured every 4 days, and the
tumor volumes were calculated using the formula V = 0.5 × D × W^2 (V,
volume; D, diameter; W, width). All the mice were sacrificed at the
appropriate time points, and the tumors were removed and weighed for
further analysis.
Statistics and reproducibility
When comparing data from two groups, a Student’s t test or Wilcoxon
test was used (in R program version 4.0.3). Chi-square test was used in
the analyses of contingency tables. Log-rank tests were adopted to test
differences in survival. Hypergeometric test was used to calculate p
values for enrichment analyses. A two-sided p value less than 0.05 was
considered to indicate statistical significance. Benjamini-Hochberg
method was used to adjusted the p values for the calculation of FDRs if
necessary. All experiments were performed at least three times as
independent experiments with similar results, and representative images
are shown.
Reporting summary
Further information on research design is available in the [293]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[294]Supplementary Information^ (15.2MB, pdf)
[295]Peer Review File^ (7.5MB, pdf)
[296]41467_2025_57858_MOESM3_ESM.pdf^ (169.9KB, pdf)
Description of Additional Supplementary Files
[297]Supplementary Data 1^ (30.2MB, xlsx)
[298]Supplementary Data 2^ (357.4KB, xlsx)
[299]Supplementary Data 3^ (84.2KB, xlsx)
[300]Supplementary Data 4^ (4.3MB, xlsx)
[301]Supplementary Data 5^ (46.4KB, xlsx)
[302]Supplementary Data 6^ (71.4KB, xlsx)
[303]Supplementary Data 7^ (6.8MB, xlsx)
[304]Supplementary Data 8^ (42.8MB, xlsx)
[305]Supplementary Data 9^ (637.2KB, xlsx)
[306]Reporting Summary^ (1.4MB, pdf)
Source data
[307]Source data^ (38.3MB, xlsx)
Acknowledgements