Abstract
Cerebral cavernous malformation (CCM) is a polygenic disease with
intricate genetic interactions contributing to quantitative
pathogenesis across multiple factors. The principal pathogenic genes of
CCM, specifically KRIT1, CCM2, and PDCD10, have been reported,
accompanied by a growing wealth of genetic data related to mutations.
Furthermore, numerous other molecules associated with CCM have been
unearthed. However, tackling such massive volumes of unstructured data
remains challenging until the advent of advanced large language models.
In this study, we developed an automated analytical pipeline
specialized in single nucleotide variants (SNVs) related biomedical
text analysis called BRLM. To facilitate this, BioBERT was employed to
vectorize the rich information of SNVs, while a deep residue network
was used to discriminate the classes of the SNVs. BRLM was initially
constructed on mutations from 12 different types of TCGA cancers,
achieving an accuracy exceeding 99%. It was further examined for CCM
mutations in familial sequencing data analysis, highlighting an
upstream master regulator gene fibroblast growth factor 1 (FGF1). With
multi-omics characterization and validation in biological function,
FGF1 demonstrated to play a significant role in the development of
CCMs, which proved the effectiveness of our model. The BRLM web server
is available at [37]http://1.117.230.196.
Keywords: Whole genome sequencing, Cerebral cavernous malformation,
Deep learning, Large language model, Natural language processing
Graphical Abstract
[38]ga1
[39]Open in a new tab
Highlights
* •
A language model is introduced to assist the variants
classification in familial cerebral cavernous malformation.
* •
Integrative bioinformatics analysis, rooted in the classification,
guarantees potential mutated genes identification.
* •
Upstream master regulator gene FGF1 plays a substantial role based
on multi-omics verification.
1. Introduction
Cerebral cavernous malformation (CCM; OMIM 116860), also known as
cerebral cavernous angiomas, can manifest as sporadic or autosomal
dominant conditions. These conditions consist of a varied range of
relatively prevalent lesions that have important clinical implications
[40][58]. These angiomas may arise sporadically or be inherited, with
identified causative genes primarily attributed to KRIT1, CCM2, and
PDCD10 [41][31], which have been linked to the molecular diagnostic
criteria of the condition for the last two decades [42][44]. However,
not all sources succeeded in identifying mutations within specified
loci of the above three genes. The advent of next-generation sequencing
techniques has unveiled a broader spectrum of CCM-associated genes.
With regard to transcriptome sequencing, the focus had been on 1325
genes displaying differential expression between CCM endothelial cells
(CCMECs) and Human brain microvascular endothelial cells (HBMECs)
[43][44], along with 80 enriched pathway terms [44][31]. Additionally,
whole genome sequencing (WGS) had even detected instances of
heterozygous inversion without mutations [45][47]. Presently, the most
advanced approach, single cell RNA sequencing (scRNA-seq), has provided
a comprehensive gene expression atlas for mouse models of CCM across
various cell groups [46][36].
Despite the collective findings from these aforementioned studies, an
all-encompassing pathogenic genetic mutations for CCM remains elusive.
Consequently, the substantial body of recent research has shifted its
focus from singular genes to polygenic factors associated with
diseases. This study aims to explore the genetic regulatory aspects of
inherited CCM WGS data, based on SNVs' annotated textual information
mining.
Those multi-factor texts were presented in various forms, encompassing
functional descriptions [47][58], regulatory networks [48][7],
scientific literatures [49][54], and retrievable databases [50][38],
thereby rendering these discoveries as distinct “informational isolated
island” [51][17]. Complex disease traits have been revealed to stem
from the interplay of multiple elements. Among the potential
contributors to the pathogenesis of complex diseases, SNVs, genes, and
gene interactions stand out as prominent factors [52][50]. In response,
researchers have formulated various algorithms aimed at scrutinizing
biomarkers. Examples include the penalized logistic regression [53][8],
multi-factor dimension reduction [54][8], set association [55][59],
Bayesian network and random forest method [56][62], to name a few.
However, these algorithms are limited by the need for explicit feature
extractions, which must be engineered artificially.
While genetic information is inherently an aspect of natural language,
where large language model (LLM) had been devoted to natural language
processing (NLP) in comprehensive feature tracking and summarization.
The two primary NLP technologies currently utilized are GPT [57][55]
and BERT [58][13]. GPT has not received any specific biological
training, although the recently developed scGPT [59][11] had been
limited to single cell sequencing analysis. BioBERT [60][25], on the
other hand, specializes in the biomedical field and has been utilized
for biomedical natural language encoding. Such language learning model
had been applied to molecular interactions mining [61][48].
By using BioBERT [62][25] for NLP, unstructured annotation texts were
converted into per-SNV vectors. All SNV information was transformed
into computationally manageable vectors, which was a great challenge to
overcome gradient disappearance and performance degradation in
traditional deep neural networks. These issues were effectively
addressed since the invention of ResNet [63][51]. The ResNet50 was
borrowed in this study for BioBERT encoded vector classification with
intention, as it provides an ideal balance of depth and computational
efficiency for a variety of tasks [64][51]. Thus, classification of
input vectors for vast variants was carried out using reconstructed
ResNet50 [65][51], although ResNet was originally employed in image
recognition.
Using BioBERT [66][25] as encorder and ResNet50 [67][51] as classifier,
we name our model as BRLM, short for BioBERT vectorized input for
ResNet classification language model, dedicated to pathogenicity
classification of SNVs in annotation texts. BRLM’s performance was
validated on 12 TCGA cancers to ensure its accuracy and robustness. To
demonstrate its proficiency in resolving practical problems, we
successfully classified SNVs for familial CCM WGS variants, and
analyzed mutated genes involved in perturbated pathways.
BRLM was ultimately applied on CCM risk element mining to isolate the
top three risk levels SNVs (pathogenic, likely pathogenic and uncertain
significance) for further investigation. Three-level SNVs were verified
by genetic functional domains and protein-protein interactions (PPIs)
to demonstrate its effectiveness regarding pathogenicity. Subsequently,
these three-class mutated genes were undergo KEGG pathway enrichment
analysis with up- and downstream cumulative effect evaluation. The
integrative results outlined an upstream regulator gene FGF1, which
provided a clear and concise multi-omics atlas of CCM functional
landscape.
2. Results
2.1. Accuracy evaluation of BRLM
The BRLM model was initially trained on 12 TCGA cancers, encompassing
367,224 SNVs from 3104 patients. The datasets sourced from TCGA were
diverse, containing mutations from various organs and volumes to
achieve optimal parameters for best performance. The dataset comprised
a large-scale cases of popular cancers (ACC, BRCA, and GBM), a
small-scale cases of rare cancers (CHOL, DLBC, and KICH), as well as a
medium-sized cases of common cancers. Detailed information about the 12
TCGA datasets is presented in [68]Table 1, including descriptions,
patient numbers, mutation amounts, and links for each dataset. The
density distribution of the aforementioned SNVs is depicted in [69]Fig.
1A (left).
Table 1.
TCGA data information.
Abbreviation Cancer Patients SNVs Link
ACC Adrenocortical carcinoma 90 20,161 https://portal.gdc.cancer.gov/
projects/TCGA-ACC
BLCA Bladder Urothelial Carcinoma 130 39,309
https://portal.gdc.cancer.gov/ projects/TCGA-BLCA
BRCA Breast invasive carcinoma 982 84,713
https://portal.gdc.cancer.gov/ projects/TCGA-BRCA
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
194 35,606 https://portal.gdc.cancer.gov/ projects/TCGA-CESC
CHOL Cholangiocarcinoma 35 3833 https://portal.gdc.cancer.gov/
projects/TCGA-CHOL
COAD Colon adenocarcinoma 154 54,349 https://portal.gdc.cancer.gov/
projects/TCGA-COAD
DLBC Lymphoid Neoplasm Diffuse Large B cell Lymphoma 48 7276
https://portal.gdc.cancer.gov/ projects/TCGA-DLBC
ESCA Esophageal carcinoma 185 36,288 https://portal.gdc.cancer.gov/
projects/TCGA-ESCA
GBM Glioblastoma multiforme 290 22,044 https://portal.gdc.cancer.gov/
projects/TCGA-GBM
KICH Kidney Chromophobe 66 5923 https://portal.gdc.cancer.gov/
projects/TCGA-KICH
KIPAN KICH (Kidney Chromophobe), KIRC (Kidney renal clear cell
carcinoma), and KIRP(Kidney renal papillary cell carcinoma) 644 47,875
[70]https://www.linkedomics.org/ data_download/TCGA-KIPAN/
LGG Brain Lower Grade Glioma 286 9847 https://portal.gdc.cancer.gov/
projects/TCGA-LGG
[71]Open in a new tab
Fig. 1.
[72]Fig. 1
[73]Open in a new tab
BRLM model construction and performance evaluation. (A)BRLM structure,
including BioBERT encoded annotation of SNVs into 728-dimensional
vectors visualized by UMAP (left), ResNet50 model architecture for SNV
classification (middle); and classification results presented by UMAP
(right). (B) TCGA pan-cancer classified variants distribution in
Nightingale rose diagram. (C) Classification performances among four
biomedical encoders in 12 TCGA cancers after 100 epochs. (D)
Classification accuracy of BRLM per 10 epochs in 12 TCGA cancers. (E)
Classification F1-score of BRLM per 10 epochs in 12 TCGA cancers. (F)
Expression comparison between tumor and normal tissues as validation of
TCGA variants classification.
Concurrently, the vectors were input into the ResNet50 classifier,
whose structure is illustrated in the mid-panel of [74]Fig. 1A along
with its classification results in the right panel. Adhering to the
American College of Medical Genetics and Genomics (ACMG) [75][39]
criteria for variant classification, all TCGA SNVs were categorized
into five classes, namely Class 1 (pathogenic), Class 2 (likely
pathogenic), Class 3 (uncertain significance), Class 4 (likely benign),
and Class 5 (benign). Likewise, the final classification results are
highly consistent with the clusters formed by UMAP. It is notable that
mutations in Class 3 (uncertain significance) comprised the largest
fraction of the sequencing analysis outcomes, and the results of BRLM
align closely with this distribution.
In addition to presenting overall statistics, we analyzed the
distribution of classes within each tumor type, as depicted in [76]Fig.
1B. Due to the diverse nature of tumor variants, their classification
proportions exhibit distinct characteristic. Class 3, representing SNVs
of uncertain significance, is prominent across multiple tumors (CESC,
CHOL, DLBC, GBM, KICH, KIPAN, and LGG). Conversely, for other tumors
such as BRCA, BLCA, COAD, and ESCA, the classes demonstrate relatively
equal proportions. Notably, more than 90% of the cases in ACC are
constituted by Class 4 (likely benign) variants. These findings are
consistent with pre-established pathogenic evidence, that certain types
of cancers such as ACC, BRCA, CESC, and COAD may exhibit a focal
concentration of pathogenic SNVs, resulting in clear categorized
mutations and drug targets. However, most carcinomas exhibit
significant heterogeneity and individual variability, making it
difficult to classify intricate risk factors.
After completing 100 training epochs across 12 cancers in TCGA, the
performance comparison of frequently used encoders in the biomedical
field (BERT, ClinicalBERT, and fastText) is presented in [77]Fig. 1C,
which is a three-dimensional line graph for the accuracy, mAP (mean
average precision), and F1-score. The performance reveals that BioBERT
is superior to other encoders, particularly in accuracy and F1-score.
To further evaluate the model’s performance, we presented the curve
graphs for accuracy and F1-score at 10-epoch intervals for each type of
cancer in [78]Fig. 1D and E. The plots demonstrate consistent testing
performance throughout each training epoch, with the optimal values of
accuracy ranging from 0.91 to 0.99 and F1-score ranging from 0.80 to
0.97. As the F1-score represents the harmonic mean of precision and
recall, their line graphs were further included in Supplementary Figure
S1, where the optimal precision ranges from 0.821 to 0.988 and the
optimal recall ranges from 0.753 to 0.967 across the 12 datasets.
To further validate the classification results, we extracted
tumor-related variants and integrated them with TCGA RNAseq dataset
expression analysis. Differential expression comparison between tumor
and normal conditions for each class is illustrated in [79]Fig. 1F. The
classified cancer-related SNVs (Class 1 and 2) exhibit
extremely-significant differences, while Class 3 (uncertain
disease-associated probability) demonstrates significant difference. In
contrast, no significant expression differences are observed between
the normal and tumor groups for neutral mutations (class 4 and 5).
2.2. Generalizability assessment of BRLM
Following the optimal model structure determined through multi-cancer
verification on the TCGA dataset, BRLM was subsequently employed to
analyze a familial WGS data with the aim of identifying variants
associated with CCM. The genetic pedigree of the family was composed of
four CCM affected individuals and four unaffected individuals marked by
asterisks; see Supplementary Figure S2 for details. Based on the WGS
analysis with Genome Analysis Toolkit (GATK) [80][5] workflow for
germline short variants, 409,666 germline variants were identified.
Several benign SNVs were found in the three well-known CCM-related
genes (KRIT1, CCM2 and PDCD10), while a large amount of variants with
unknown impacts were also identified. These benign mutations have a
high frequency in population and are located within intronic areas,
whose detailed information can be found in Supplementary Table S1. We
thus developed BRLM to assist the labor-intensive tasks of SNVs
interpretation and integration. The entire variants were then annotated
and transformed into BioBERT embedded vectors.
Classification results from ResNet50 are visualized in [81]Fig. 2A
using the UMAP technique, which underscores the specificity of BRLM in
SNV classification through clusters of aggregated distributions.
Validations of these results drawn upon both SIFT scores and Clinvar
records are demonstrated in [82]Fig. 2B and C. The SIFT scoring
mechanism ranges from 0 to 1 ([83]Fig. 2B), with color intensity
reflecting the severity of detrimental effects where a score of 0
indicates the most deleterious mutation. On the other hand, [84]Fig. 2C
illustrates the Clinvar category mapping. Based on the comparison
between the two plots, BRLM shows a good performance in identifying
CCM-related variants. Specifically, Class 1 and 2 in [85]Fig. 2A are
consistent with risk SNVs as aligned with the yellow clusters in
[86]Fig. 2B and the red unlabeled variants in [87]Fig. 2C. Similarly,
the classification results for Class 4 and 5 also correlate well with
the patterns observed in both graphs.
Fig. 2.
[88]Fig. 2
[89]Open in a new tab
BRLM mutation classification for a familial CCM WGS. (A) UMAP plot for
the BRLM classified CCM SNVs. (B) UMAP of the SNVs with SIFT Scores
attached. (C) UMAP for SNVs annotated with Clinvar categories. (D)
Functional variants statistics of regulatory and exonic regions for the
five classes. (E) Statistics of potential pathogenic variants
distribution within functional and exonic regions for Class 1, 2 and 3.
(F) Circos plot with low-density functional areas distribution
connected by PPI between high CCM risk variants in Class 1, 2. (G)
Circos plot with high-density functional areas distribution connected
by PPI between uncertain CCM risk variants in Class 3.
However, the classification of variants with uncertain significance
(Class 3) remains a contentious issue due to discrepancies between the
score deficiency highlighted in [90]Fig. 2B and the benign or likely
benign categorization in [91]Fig. 2C. To further assess the
effectiveness of our classification results, an additional eleven
algorithms for variant classification were utilized. The detailed UMAP
plots of the outcomes can be found in Supplementary Figure S3. Of the
total eleven algorithms, nine could only annotate a limited number of
variants albeit with diverse clusters, whereas the remaining two were
incapable of categorizing the variants. The majority of classified SNVs
demonstrated to be consistent with ours. Additionally, BRLM was able to
estimate unpredicted variants by other algorithms, highlighting the
comprehensiveness and superiority of our model.
Owing to the doubtable SNVs in Class 3, we investigated their
functional regions. In [92]Fig. 2D, statistical data is presented
regarding overall functional variants (left) and specific exonic
variants (right). Due to the substantial number of functional mutations
within Class 3, we further localized their chromosome distribution in
[93]Fig. 2E, corresponding to functional and specific exonic SNVs. The
overall distribution pattern remains consistent across both sets of
variants. The mutation coordinate and distribution is an important part
of the input text. However, functional mutations are distributed near
exonic mutations, revealing a possible regulatory impact.
To narrow down valid data, we extracted positive sites from SNVs
depicted in [94]Fig. 2E. These positive sites were defined by the
variant quality score recalibration (VQSR) model in GATK. VQSR learned
features by machine-learning algorithms from mutation sites to
distinguish positives from negatives. The detailed parameter settings
for VQSR are introduced in [95]Section 3.3.
Circos graphs presented in [96]Fig. 2F and G illustrate the positive
sites distribution comparison between Class 1, 2 and 3. [97]Fig. 2F
portrays greater risk variants within Class 1 and 2, whereas [98]Fig.
2G emphasizes abundant mutual effects within Class 3. Mutated genetic
functional regions are colored at the outermost chromosomes, where
exonic variants are highlighted in red. Their frequencies are depicted
in curves and bars of the two middle rings that are obtained from the
1000 Genomes and the Genome Aggregation Database (gnomAD), and internal
lines indicate Protein-Protein Interactions (PPI) among them.
Conversely, deleterious interactions among genes in Class 1 and 2 are
only centered on three links with six genes highlighted in bold: FGF1,
FGF6, ABCB1, ABCG2, IL4R, and CTLA4. Consequently, interactive effects
among these genes have higher priorities in our following analysis.
2.3. Enrichment analysis of candidate genes and pathways
To explore the biological functions affected by the categorized
CCM-related mutations, an initial KEGG pathway enrichment analysis was
performed for the mutated genes within Class 1, 2 and 3, yielding 661
enriched pathways. The simplifyEnrichment package was then invoked to
cluster the similarity matrix of enriched terms into groups using
“binary cut” [99][16], and the results are illustrated by the heatmap
in [100]Fig. 3A. Based on the eight clusters with descriptive
vocabulary frequencies indicated by font size, the mutated genes play
significant roles in the developmental processes and signaling
activities on cell adhesion and proliferation.
Fig. 3.
[101]Fig. 3
[102]Open in a new tab
Enrichment results for mutated genes in Class 1, 2 and 3. (A)
Similarity clustering heatmap for enriched pathways with term
frequencies exhibited by font size. (B) K-means clustering for the top
50 pathways with the most significant p-values. (C) The top 10 enriched
pathways enumeration in terms of p-values.
In order to further characterize the functional terms, the top 50
pathways were grouped into k-means clusters, illustrated in [103]Fig.
3B. It is notable that the top cluster relevant to cell junction
adhesion comprises 17 distinct pathways, while the other three clusters
are associated with cGMP-PKG activation, AGE-RAGE signaling, and
metabolic correlation. However, these top 50 pathways were still
extremely complicated, we thus narrowed them down to feature the top
ten pathways with the most significant p-values, as highlighted in
[104]Fig. 3C.
According to the specific results in [105]Fig. 3C, cell signaling
pathways hold a consistently substantial proportion. Particularly, the
PIK3-Akt, MAPK, Ras, and Rap1 signaling pathways hold the top four
positions, with focal adhesion ranking the third. The aforementioned
SNVs identified in Class 1, 2 and 3 exhibited the potential to induce
aberrant cell functions, thereby becoming predisposing factors for CCM.
2.4. Simulation of Mutated Genes Perturbation among Pathways
After confirming the involvement of SNVs in enriched pathways, the next
study was to address the functional roles of the mutated genes with
accumulative effects, which have remained untouched in previous CCM
studies. To quantify the functional implications of mutated genes at
the pathway level, an algorithm calculated pathway mutations
accumulative perturbation score (PMAP score) was adopted [106][26]. The
PMAP score was used to measure the actual perturbation impact on
enriched pathways under a candidate gene set encompassing the three
classes (Class 1, 2 and 3). A comprehensive list of perturbed pathways
along with their PMAP scores for each class is provided in the
Supplementary Table S2. According to their PMAP scores comparison, the
scores were almost equal between Class 1 and 2, which turned out
different from Class 3. We thus take the Class 1, 2 as a whole in the
follow-up study regarding perturbation.
Since the gene list in Class 3 contained uncertainties that differ from
the well-established lists in Class 1, 2, their proportion of perturbed
gene sets were compared within the top 10 highest scored pathways. This
result is depicted in [107]Fig. 4A as a tree plot. Based on the pie
charts at the end of each tree branch, the perturbation rate of Class
1, 2 is mainly higher or competitive to that of Class 3. Nevertheless,
the number of mutated genes in Class 3 far exceed those in Class 1, 2.
Moreover, the pathways with remarkably high PMAP scores (including
PIK3-Akt, MAPK, Ras, and Rap1 signaling pathways) are also consistent
with the top ten enriched results in [108]Fig. 3C. These findings
suggest that the SNVs have an influence on CCM cellular signaling
functions, with major effects from Class 1, 2 and minor effects from
Class 3.
Fig. 4.
[109]Fig. 4
[110]Open in a new tab
Pathways perturbation simulation derived from mutated genes in Class 1,
2 and 3. (A) Tree plot for the top 10 pathways with the highest PMAP
score, where the pie chart shows the proportion of involved genes from
Class 1, 2 and Class 3. It is evident that fewer mutated genes in Class
1, 2 play more important perturbation roles than those in Class 3. (B)
Containment relationship for top 10 perturbated pathways and functional
domain mutated genes. (C) Sankey plot for risk CCM-related elements in
three levels for mutations, genes and pathways.
For primary perturbed genes with functional relevance were identified
among Class 1 and 2, we thus constructed a perturbated network focusing
on the top ten scored pathways connected by these genes (see [111]Fig.
4B). The ratios of overlap perturbed genes are shown in Supplementary
Figure S4, which distinguishes genes that uniquely belong to one
pathway or perturb two or more pathways. The five PPI-linked genes
(FGF1, FGF6, ABCB1, ABCG2, and IL4R) from [112]Fig. 2F with Class 1 and
2 SNVs are highlighted in bold blue. Furthermore, one of the known
pathogenic genes for CCM, KRIT1, is conspicuously present in this
network (in brown bold font). However, two intronic variants in KRIT1
were common with frequencies exceeding 0.04 in the 1000 Genomes
database as shown in the Supplementary Table S1. Among the pathways,
Vascular Smooth Muscle Contraction and GABAergic Synapse exhibit
remarkable perturbation scores despite being excluded from the top 10
enriched pathways in [113]Fig. 3C, with p-values of 0.002 and 0.03
respectively. This further supports the idea that perturbation
algorithms applied in BRLM can help uncover ignored information.
In order to determine the significant risk pathways, the overlap
between the top 10 scored (in [114]Fig. 4B) and top 10 enriched (in
[115]Fig. 3C) pathways were extracted. Combined with the major effect
genes harboring functional SNVs in protein interactions highlighted in
[116]Fig. 4B, the elements at three levels (mutation, gene, and pathway
level) were established as critical CCM pathogenic routes, elucidated
through the three-bucket Sankey graph in [117]Fig. 4C. The first bucket
effectively outlines the presence of seven functional SNVs within these
genes (six Class 1 SNVs in FGF1 and one Class 2 SNV in FGF6). Notably,
all seven SNVs have just been cataloged with rsIDs from NCBI dbSNP, yet
none of them bear any reports pertaining to CCM or related biological
implications. The second bucket contains a pair of functionally
pathogenic mutated genes, namely FGF1 and FGF6 from the fibroblast
growth factor family. While the four risk pathways in the third bucket
coincide with the top four scored and five enriched results. The
overlapping results support the accuracy of both enriched and
perturbated pathways, which can improve the integrity of CCM-related
mutation knowledge.
To assess the effectiveness of the above CCM-related three-level risk
elements, whose classification entities were extracted from PubMed by
BioBERT [118][25]. The PubMed publication proportion statistics from
2010 to 2023 are shown as three trends for each level in Supplementary
Figure S5. According to the Supplementary Figure S5A, only SNV
rs17217240 had been reported in 2010, while there was no literature
support for the others. Regarding the materials for these two genes
displayed in Supplementary Figure S5B, over the past two decades, FGF1
has received considerable attention but no relation to CCM, while FGF6
has been scarcely reported. Finally, the statistics for pathway
publications reflect overall continuance attention, albeit with varying
numerical records as illustrated by the Supplementary Figure S5C. These
publication statistics unveil abundant diversity in features for
variant classification.
2.5. FGF1 is the upstream master regulator gene of perturbated pathways
Based on the enrichment analysis and the perturbation analysis, RNA-seq
differential expression geneset between CCMECs and HBMECs introduced
before had verified for final major-effect gene determination
[119][44]. FGF1 is found to be highly up-regulated (p-value=0 and
log2FoldChange=1.8), while FGF6 takes no statistical difference
(p-value=0.71 and log2FoldChange=−0.6). The detailed differentially
expressed results for genes engaged in perturbated pathways are
illustrated in the Supplementary Table S3. Herein, we dive into the
details of FGF1 from multi-omics as shown in [120]Fig. 5, including the
scRNA-seq expression clusters ([121]Fig. 5A), the WGS mutant
transcripts ([122]Fig. 5B), the RNA-seq expression profiles ([123]Fig.
5C), and the perturbated pathways ([124]Fig. 5D).
Fig. 5.
[125]Fig. 5
[126]Open in a new tab
The FGF1 acts as the master regulator gene upstream of perturbed
pathways from multi-omics results integration. (A) FGF1 specific
expressed in Astrocytes Cluster from scRNA-seq. The markers for
“Astrocytes” cluster are colored in blue, while the marker for
“Capillaries” is in red. (B) FGF1 mutation sites distribution in two
mutant transcripts from WGS. (C) Differential expression profiles for
multi-effect genes from RNA-seq. (D) Main effect gene FGF1 with
multiple functional variants located upstream of peaked genes from
perturbated and enriched pathways, reacted with mutated genes in Class
1, 2 and 3 including KRIT1, one of the three known CCM pathogenic
genes. Multi-connection mutated genes in the same pathway are outlined
with dashed lines.
For exploring expressed cell groups of FGF1, scRNA-seq was carried out
on a pair group of CCM mouse model under two normal conditions and two
deletions of Pdcd10 [127][36]. The UMAP plots for joint clustering are
displayed in [128]Fig. 5A, with 16,220 cells from the two Pdcd10-wt
mice and 19,135 cells from the two Pdcd10-ko mice after low-quality
cells filtering. Based on gene-marked UMAPs among 12 cell clusters,
FGF1 identifies special expression in cluster 8 whose marker genes
(Aqp4, Gfap, and Slc1a2 colored in blue) are primarily associated with
astrocytes. This cluster also includes the receptor genes for the FGFR
family, with Fgfr3 and Fgfr1 (colored in blue) exhibiting specific and
significant expression. Additionally, Atp1b1, a marker for capillaries
(colored in red) is also highly expressed in this cluster. Recent
research has indicated that astrocytes propel neurovascular dysfunction
during cerebral cavernous malformation lesion formation [129][29].
All six SNVs in FGF1 were classified as Class 1, corresponding to two
transcripts: FGF1–008 (transcript ID: ENST00000378046) and FGF1–013
(transcript ID: ENST00000411960). Both transcripts performed
protein-coding functions with most mutants situated on the longer one
(FGF1–013), see transcripts distribution in [130]Fig. 5B. All the
transcripts for FGF1 are listed in the Supplementary Figure S6
referenced from GENECODE19 ([131]https://www.
gencodegenes.org/human/release_19.html).
To further explore the cumulative influence between the major-effect
gene FGF1 and other minor-effect genes found by BRLM, differential
expression profiles whose parameters taken from the Supplementary Table
S3 are constructed in [132]Fig. 5C. The minor-effect genes were
detected in positive sites across Class 1, 2 and 3, along with the
corresponding perturbated pathways from [133]Fig. 4. Most of these
genes show significant differential expression in both p-value < 0.05
and |log2FoldChange|> 1 indicated by read dots in [134]Fig. 5C.
According to the categorized multi-effect genes based on mutation and
expression, an expanded biological regulatory landscape featuring
partially function exploration is conducted in [135]Fig. 5D. FGF1,
highlighted in orange, demonstrates a master regulatory role, while the
remaining genes are associated with minor impact SNVs across Class 1, 2
and 3. As we can see from [136]Fig. 5D, FGF1 is located upstream,
indicating the direct perturbation of the four signaling pathways
denoted by orange boxes. Importantly, among them the downstream of Rap1
Signaling pathway is positioned to one of the known CCM pathogenic
genes, KRIT1. These four signaling pathways play critical roles in cell
growth and tissue development, which can impact vascular integrity.
In addition to the direct action of FGF1 on the four signaling
pathways, two indirect processes, namely Vascular Smooth Muscle
Contraction and GABAergic Synapse indicated by green boxes, are
regulated by Ca2 + levels via Calcium Signaling Pathway, a downstream
fundamental cellular signaling process of FGF1. Above findings indicate
that FGF1 may activate downstream signals, which suggests a significant
role in the occurrence and development of familial CCMs. In order to
improve the reliability of our model, we have provided further evidence
for the regulatory function of FGF1 in CCM. The supporting literatures
are chronologically rearranged in Figure S6 and detailed in Section S2
of the Supplementary file. These materials provide a comprehensive
overview of FGF1, emphasizing its potent role in inducing angiogenesis
in the brain by regulating multiple growth factor signaling pathways.
The development of FGF1 as a powerful stimulator of angiogenesis for
various brain-related diseases will be emphasized in the Discussion
section.
According to the previous studies, the technologies of genetic or
chemical CCM induced models are still immature. The typical cell
experiment was to cultivate cells (mostly endothelial cells) from CCM
patients for further research based on the original RNAseq results in
[137]Fig. 5C [138][44]. On the other hand, there are diverse findings
showing FGF1’s pivotal role in angiogenesis, as detailed in Section S3
of the Supplementary file. The main results and conclusions are
summarized as follows. Firstly, FGF1 is found to induce the
proliferation and migration of endothelial cells at the cellular level
[139][61], whose results are displayed in Supplementary Figure S7.
Secondly, the FGF1 morpholino knockdown zebrafish embryos display
anomalous cell aggregation in the intermediate cell mass [140][46],
whose results are shown in Supplementary Figure S8. Last but not least,
in a rat model involving the implantation of collagen-suspended beads
into exposed femoral pedicles, a notable enhancement in vascular
density is observed at 1 and 6 weeks compared to the bolus
administration of FGF1 [141][32], whose results are presented in
Supplementary Figure S9 and Supplementary Figure S10.
3. Materials and Methods
The BRLM web server is available at [142]http://1.117.230.196 and the
source codes are available at
[143]https://github.com/wangyiqi806643897/BRLM. The raw sequence data
reported in this paper have been deposited in the Genome Sequence
Archive (Genomics, Proteomics & Bioinformatics 2021) in National
Genomics Data Center (Nucleic Acids Res 2022), China National Center
for Bioinformation / Beijing Institute of Genomics, Chinese Academy of
Sciences (GSA-Human: HRA005489) that are publicly accessible at
[144]https://ngdc.cncb.ac.cn/gsa-human. The other results are shown in
Supplementary files.
3.1. CCM family samples collection
The samples for these eight family members were collected at Shiyan
Taihe Hospital, Hubei, China, whose family genetic pedigree is
presented in Supplementary Figure S2A. Informed consent was obtained
from all subjects, and this study was approved by Taihe Hospital. The
inclusion and exclusion criteria for the data are based on the MRI
results as shown in Supplementary Figure S2B-I, with four individuals
in the case group and four in the control group. Each MRI image was
reviewed by three independent doctors who produced the official reports
and the cross-referenced slices. The case group comprises the proband
in Supplementary Figure S2B, a 34-year-old male patient with CCM, and
three affected first-degree relatives, including his mother (died of
CCM) in Supplementary Figure S2C, aunt (52 years old) in Supplementary
Figure S2D, and brother (30 years old) in Supplementary Figure S2E. The
control group comprised four relatives from his cousin’s family in
Supplementary Figure S2F-I. Seven samples were obtained from collateral
blood, except for the mother, who unfortunately passed away due to the
condition. Her sample was obtained from a block of angioma tissue.
3.2. Whole genome sequencing
Genomic DNA extracted from blood was assessed for quality using
PicoGreen and gel electrophoresis. At least 10 μg of non-degraded DNA
was provided for WGS. Tissue extraction used the Maxwell 16 Tissue DNA
Purification Kit (Promega), which followed the manufacturer's
instructions and utilized 10 mg of tissue. Additional quality controls
were conducted, including assessing DNA purity and integrity through
agarose gel electrophoresis. Furthermore, DNA purity was determined
using Nanodrop detection (OD 260/280 ratio), and DNA quantification was
carried out with Qubit 2.0. To shear approximately 300 ng of
high-quality DNA samples (OD 260/280 =1.8–2.0), a Covaris S220
Sonicator (Covaris) was used to generate fragments of ∼350 bp. The
fragmented DNA was purified using Illumina's Sample Purification Beads.
Adapter-ligated libraries were prepared using TruSeq Nano DNA Sample
Prep Kits (Illumina) in accordance with the Illumina protocol. The
sequencing was conducted on an Illumina HiSeq system for 2 * 150
paired-end sequencing at Novogene in Wuhan, China.
3.3. CCM family WGS analysis
After quality control and trimming by Fastp [145][9], the dataset
consisting of high-quality clean sequences in fastq format was
obtained. Subsequently, sequence alignment to the GRCH37 reference
genome was executed using SAMtools with default parameters, achieving a
mapping rate of over 90%. The Germline short variant discovery pipeline
of GATK version 4 (GATK4) [146][5] was then employed.
To evaluate the feasibility of each site in our data and identify false
positives, VQSR of GATK was employed in two modes: SNP mode and Indel
mode. For each mode, distinct reference databases were assigned to the
corresponding argument sets, and their parameters are detailed in
[147]Table 2. In SNP mode, the utilized databases included HapMap3.3
[148][10], OMNI2.5 [149][40], 1000 Genomes [150][12], and dbSNP
([151]http://www.ncbi.nlm.nih.gov/projects/SNP/). While in Indel mode,
the databases consisted of dbSNP
([152]http://www.ncbi.nlm.nih.gov/projects/SNP/) and Mills Gold
[153][30]. Each of the aforementioned databases necessitates the
determination of four key parameters: “known” (whether the data is used
as known variation for marking), “training” (whether the data is used
for training), “truth” (whether the data is used as the ground truth
for verifying), and “prior” (the weight of the data set in model
training, or prior likelihood).
Table 2.
Data parameters of GATK VQSR.
Resource “known” “training” “truth” “prior” Reason
SNP mode:
HapMap false true true 15.0 Strict quality control and experimental
verification
OMNI false true true 12.0 Gold standard for genotypes
1000 G false true false 10.0 Deficiency of comprehensive experimental
verification
dbSNP true false false 2.0 Submitted results without rigorously
verification
Indel mode:
Mills Gold true true true 12.0 Verified dataset
dbsnp true false false 2.0 Same as SNP mode
[154]Open in a new tab
3.4. SNVs annotation texts preparation
To annotate SNVs for model training, a multi-step pipeline was
employed. Initially, this pipeline established a comprehensive
annotation process involved three SnpEff and seven ANNOVAR referencing
databases. Simultaneously, relevant gene information was curated from
prominent genetic databases, encompassing details such as function,
expression, phenotype, and pathway.
The annotation databases encompassed resources like Clinvar [155][24]
(version 20220320), 1000 Genomes (version 1000g2015aug_ALL and
1000g2015aug_AFR) [156][12], dbSNP
([157]http://www.ncbi.nlm.nih.gov/projects/SNP/), SIFT [158][28], GWAS
[159][53], Kaviar [160][15], eigen [161][22], and gnomAD [162][23].
Besides the above database annotations, further gene information had
also added to the descriptions of the SNVs accordingly, including the
functions, expressions, distributions, phenotypes presented in
GeneCards ([163]https://www.genecards.org/) and NCBI
([164]https://www.ncbi.nlm.nih.gov/), and the pathway terms presented
in Gene Ontology (GO, [165]http://geneontology.org/) and Kyoto
Encyclopedia of Genes and Genomes (KEGG,
[166]https://www.genome.jp/kegg/).
All these genetic insights were incorporated into natural language
descriptions for BioBERT with the input format demonstrated in
[167]Table 3 and an example shown in Supplementary Table S5. The input
files are available for downloading from the web server
([168]http://1.117.230.196/results_download), along with the source
codes ([169]https://github.com/wangyiqi806643897/BRLM).
Table 3.
BioBERT input text format.
Anno Category Text Format
Annotation Pos Chr:{text} Start:{text} End:{text} Ref:{text} Alt:{text}
Ref Func.refGene:{text}Gene.refGene:{text}GeneDetail.refGene:{text}
ExonicFunc.refGene:{text} AAChange.refGene:{text}
Database CLNALLELEID:{text} CLNDN:{text} CLNDISDB:{text}
CLNREVSTAT:{text} CLNSIG:{text} 1000g2015aug_ALL:{text}
1000g2015aug_AFR:{text} avsnp150:{text} avsift:{text}
GWAVA_region_score:{text} GWAVA_tss_score:{text}
GWAVA_unmatched_score:{text} Kaviar_AF:{text} Kaviar_AC:{text}
Kaviar_AN:{text} gnomAD_exome_ALL:{text} gnomAD_exome_AFR:{text}
gnomAD_exome_AMR:{text} gnomAD_exome_ASJ:{text} gnomAD_exome_EAS:{text}
gnomAD_exome_FIN:{text} gnomAD_exome_NFE:{text}
gno-mAD_exome_OTH:{text} gnomAD_exome_SAS:{text}
Info NCBI_Summary:{text} GeneCards_Summary:{text}
Swiss-Prot_Summary:{text} GO:{text} KEGG:{text}
[170]Open in a new tab
3.5. TCGA verified dataset
Initial verification of SNVs by BRLM was conducted using data obtained
from twelve cancers within the Cancer Genome Atlas (TCGA). Cancer types
from TCGA, accompanied by their abbreviations, included Adrenocortical
carcinoma (ACC), Breast invasive carcinoma (BRCA), Bladder urothelial
carcinoma (BLCA), Colon adenocarcinoma (COAD), Cervical squamous cell
carcinoma and endocervical adenocarcinoma (CESC), Cholangiocarcinoma
(CHOL), Lymphoid neoplasm diffuse large B-cell lymphoma (DLBC),
Esophageal carcinoma (ESCA), Glioblastoma multiforme (GBM), Kidney
chromophobe (KICH), Pan-kidney cohort (KICH+KIRC+KIRP) abbreviated as
(KIPAN), and Brain lower grade glioma (LGG). Following verification of
the aforementioned cancers, mutation annotation format (MAF) files were
downloaded, with each line representing a single variant. These
original MAF files were then converted into avinput format by
convert2annovar.pl in ANNOVAR ([171]http://annovar.openbioinformatics.
org/en/latest/).
3.6. Analyzing environment configuration
WGS analysis was conducted on a server having 128 G RAM and two Silver
4114 CPUs (40 cores in total), installed with CentOS 7.0. SAMtools
mapping was performed using 8 cores, while the SNVs calling step
utilized 16 cores.
BRLM was implemented by using python3.8 with the deep learning
framework of PyTorch1.9. The learning rate was optimized by Adam with
an initial value of 10^−2, a reduction factor of 0.1, and a batch size
of 8. The model was evaluated on a cluster having four NVIDIA V100
GPUs.
3.7. BRLM’s workflow
The overall workflow of BRLM is shown in [172]Fig. 6. After obtaining
annotation texts for SNVs, the BioBERT encoder and ResNet classifier
pipelines were developed for classification, with the training on TCGA
and testing on CCM datasets.
Fig. 6.
[173]Fig. 6
[174]Open in a new tab
BRLM workflow for variant annotations classifying. Starting with
annotated data wrangling, embedded vectors are constructed by BioBERT,
which are classified by ResNet50 for distinct datasets.
3.7.1. Encoding variants
The encoder for SNVs was achieved by BioBERT which stands out as an
extensively employed model in the field of natural language processing
for medical and biological purposes. It was developed as a specialized
iteration past the BERT (Bidirectional Encoder Representations from
Transformers), initially pretrained on English Wikipedia and
BooksCorpus [175][13].
Nevertheless, owing to the significant presence of biomedical-specific
proper nouns (such as KRIT1, c.369 A>G) or terms (like exonic,
intronic), BERT’s general models struggled to achieve a satisfactory
performance. This limitation prompted the emergence of BioBERT,
tailored for tasks involving biomedical text mining. This variant was
pretrained on PubMed abstracts (available at PubMed:
[176]https://pubmed.ncbi.nlm.nih.gov/) and full-text articles from
PubMed Central (accessible at PMC:
[177]https://www.ncbi.nlm.nih.gov/pmc/). Considering that descriptions
of SNVs for annotation could also be categorized as domain-specific
natural language expressions, we leveraged BioBERT’s entity extraction
capabilities to supplant the manual literature querying process in the
interpretation of SNVs.
All variants were annotated within sentences, activating the encoding
module for deep learning purposes. Regarding BRLM, we utilized BioBERT
to encode annotated texts, yielding 728-dimensional vectors per SNV.
The employed model version was biobert-base-cased-v1.1, with a batch
size of 100 and the “mean” pooling algorithm for pooler output.
Regarding pooling, it is used to reduce the size of the feature maps
and avoid sacrificing too much information [178][6]. The “mean” pooling
focuses on overall features with less influence from outliers, making
it more robust than the “max” pooling operation [179][20].
Moreover, instructions were incorporated into each annotated sentence
subsequent to the annotations generated by ANNOVAR. The instruction
format followed the pattern of “Disease name” + “mutation sites”. These
disease names were contingent upon the data sources, either TCGA cancer
species or CCM.
3.7.2. Classifying embedded vectors
The ResNet50 [180][51] was borrowed to identify pathogenic mutations,
whose architecture was illustrated in [181]Fig. 1A. The neural network
depth designed for mutation site analysis significantly outperformed
traditional neural networks in handling SNVs’ large data inputs,
preventing gradient vanishing and performance degradation.
To elucidate the unique structure of ResNet, we conducted a comparative
summary of its specific advantages against two other classical
convolutional neural networks (VGG and Inception), as depicted in
[182]Fig. 7A. The skip connection proposed in ResNet is a significant
breakthrough of deep neural network, which provides a shortcut for
gradients to flow more easily through the network during
backpropagation. Instead of passing through every layer, gradients can
take a shortcut and directly propagate to deeper layers. This helps to
mitigate the vanishing gradient problem, allowing the network to learn
more effectively even as it becomes very deep.
Fig. 7.
[183]Fig. 7
[184]Open in a new tab
Particular structure of ResNet compared with classical convolutional
neural networks and ResNet50 architecture diagram in BRLM. (A) ResNet
residual network with skip connection can solve gradient vanishing
problem. (B) ResNet-50 architecture constructed in BRLM for SNVs
classification.
In BRLM, the classification architecture was similar to ResNet50 but
differed in network size. Precisely, we tailored our network input to
accommodate the 728-dimensional vectors generated from BioBERT, and
adjusted the subsequent layers accordingly. The detailed structures are
outlined in [185]Fig. 7B, recording the kernel size, quantity of
kernels, and stride size in each layer of the four residual blocks. In
regards to the stacked residual blocks, each block comprised of three
convolutional layers with a “skip connection” that bypassed the three
layers within each individual block. The goal was to achieve a
five-class classification, as demonstrated in [186]Fig. 1C. For this
unbalanced dataset, the class weight was applied in the
“CrossEntropyLoss” function, resulting in the loss being calculated as
[MATH: L=∑i
munder>Ni
N
∑jy
mrow>jilogpj
mrow>i :MATH]
(1)
where N[i] is the number of samples in class i with
[MATH: N=∑Ni :MATH]
,
[MATH:
yji
:MATH]
is the label of sample j in class i, and
[MATH:
pji
:MATH]
is the predicted probability of sample j in class i.
The training, validation, and test sets were divided using a 7:1:2
splitting ratio. This ratio had been considered to be the optimal
criteria [187][33]. The evaluations were performed based on a
pre-labeled pan-cancer dataset sourced from TCGA by accuracy,
precision, recall, mean average precision (mAP), and F1-score.
Precisely, the predictions are recognized in the following manner: (i)
True positive (TP) is the number of SNVs classified as pathogenic
variants correctly; (ii) False positive (FP) is the number of SNVs
classified as pathogenic variants incorrectly (unrelated mutation in
fact); (iii) False negative (FN) is the number of SNVs deemed as
non-pathogenic variants incorrectly; and (iv) True negative (TN) is the
number of SNVs deemed as non-pathogenic variants correctly. The
formulas are as follows:
[MATH: Accuracy=TP+TNTP+TN+FP+FN :MATH]
(2)
[MATH: Precision=TPTP+FP :MATH]
(3)
[MATH: Recall=TPTP+FN :MATH]
(4)
[MATH: F1−score=2*Precision*RecallPrecision+Recall :MATH]
(5)
Emphasizing the primacy of training BRLM from TCGA, we eventually
incorporated CCM variant vectors for testing.
3.8. Visualization of Classification Results
The basic plots in this paper were created in R, using the ggplot2
package’s built-in functions. This package was used for basic graphing
methods, such as barplots, dotplots, roseplots, and Sankey diagrams.
Furthermore, different types of graphs were developed employing certain
packages, which will be introduced in later sections.
3.9. UMAP construction for variants
Utilizing BRLM input vectors extracted from BioBERT, it was observed
that the variants were present in a 728-dimensional space, which makes
retaining the global structural information derived from ResNet50
classification results. To effectively map the entire set of variants
onto a two-dimensional space, the Uniform Manifold Approximation and
Projection (UMAP) dimensionality reduction technique was employed.
Variants embedding vectors from TCGA or CCM datasets can be utilized as
input entities for the “umap” package in R4.1. All UMAP graphs
presented in this paper were generated using 5 components representing
five distinct classes and 20 neighbors, while considering multiple
categorical labels including class, SIFT score, Clinvar significance,
and density.
3.10. Variety of genetic mutations statistics
The statistical charts for SNVs were generated using the R4.1 package
known as “maftools”, which is designed to process MAF files. The
“bcftools” was used to convert ANNOVAR multi-anno files from VCF to
MAF. Finally, “plotmafSummary” function was employed for charts
generating.
3.11. SNVs chromosome distribution statistics
Two types of chromosome distribution charts were utilized: the first
employed a chromosome distribution strip map using the CMplot package
in R4.2, while the second utilized a circos plot.
The circos chromosome plot was generated with the RCircos package for R
version 4.2 and its chromosome track was constructed based on the UCSC
HG19 Human CytoBandIdeogram data, which was imported along with the
corresponding gene set. Additional tracks were pre-built, each with
their own subfunction calls. The middle two frequency statistics
channels presented the 1000 Genomes and genomAD records for each
mutation site and converted them into input invocation matrices
beforehand. Internal lines were aligned with PPI pairs from the STRING
database.
3.12. Genes enrichment
The ClusterProfiler package was primarily utilized for enriching mutant
genes in the R4.2 software package with the “dotplot” function of
[188]Fig. 3C visualizing the top 10 pathways, and the “treeplot”
function was called to produce tree diagram in [189]Fig. 4A. In order
to summarize entire significant enriched results, the
simplifyEnrichment package was used to cluster similarity matrices
calculated by “term_similarity” function with a new method named
“binary cut” [190][16]. With the similarity matrix, we can directly
apply “simplifyEnrichment” function to perform partition around medoids
(PAM) with two groups on the similarity matrix in each iteration step.
The similarity clustering algorithm was applied to cluster pathways, as
depicted in the heatmap of [191]Fig. 3A.
Similarly, the K-means clustering algorithm utilized the
“emapplot_cluster” function from the ClusterProfiler package to create
the network displayed in [192]Fig. 3B.
3.13. Genes perturbation algorithm
In order to measure the pathway mutations perturbation levels, the
cumulative effect of mutated genes were quantified by PMAP scores using
an R package known as PMAPscore
([193]https://cran.r-project.org/web/packages/PMAPscore/vignettes/PMAPs
core.html). The package’s “get_mut_status” function has been modified
to accommodate variants in Class 1, 2 and 3. Subsequently, the
cumulative effect of genetic mutations on pathways is utilized to
determine the PMAP scores. This score employs a standard cumulative
perturbation measurement to capture the positioning and impact of
genetic mutations on pathways. The formulas of perturbation scores for
genes and their cumulative effect in pathway are as follows.
Gene’s Perturbation score:
[MATH: GMPscoregi=1igj+∑j=1NβijGMPscore<
/mtext>gjNdsgj :MATH]
(6)
where
[MATH: GMPscoregi :MATH]
denotes the perturbation score of mutated gene
[MATH: gj :MATH]
,
[MATH: 1igj :MATH]
is an indicator function with
[MATH: 1igj=1 :MATH]
if
[MATH: gj :MATH]
belongs to class i, otherwise 0,
[MATH: βij :MATH]
is the relationship between genes
[MATH: gi :MATH]
and
[MATH: gj :MATH]
(if
[MATH: gj :MATH]
is directly interacted with
[MATH: gi :MATH]
,
[MATH: βij=1 :MATH]
, else is 0), N is the total number of genes in pathway
[MATH: pi
:MATH]
,
[MATH: Ndsgj :MATH]
is the number of genes at the downstream of gene
[MATH: gj :MATH]
.
Pathway’s Perturbation score:
[MATH: PMAPscorepi=∑k=1NGMPscoregkNdc :MATH]
(7)
where
[MATH: PMAPscorepi :MATH]
denotes the perturbation score of pathway
[MATH: pi :MATH]
, N is the total number of genes in pathway
[MATH: pi :MATH]
,
[MATH: GMPscoregk :MATH]
denotes the perturbation score of gene
[MATH: gk :MATH]
in the pathway
[MATH: pi :MATH]
,
[MATH: Ndc :MATH]
denotes the number of genes that have mutated in the pathway
[MATH: pi :MATH]
.
3.14. Public scRNA-seq data analysis
The public scRNA-seq raw data were obtained from the Gene Expression
Omnibus database with ID [194]GSE155788
([195]https://www.ncbi.nlm.nih.gov/geo/). The Data analysis relied on
Seurat (version 4.0.5), the filtering criteria was nFeature_RNA < 300
or > 12000 with > 15% expression of mitochondrial genes. After
completing quality control, the next step was normalization. This was
accomplished by using the “LogNormalize” method with default scale
factor and log-transformation. Subsequently, the “FindVariableFeatures”
algorithm calculated a subset of features that yielded significant
cell-to-cell variation in the dataset. This function returned 2000
default features per dataset, which will be used in downstream
analysis.
Prior to the dimensionality reduction, a linear transformation (called
scaling) was applied with the “ScaleData” function. Next, the “RunPCA”
function was invoked for linear dimensionality reduction on the scaled
data. In order to determine the dimensionality, the JackStraw procedure
was executed for principal components selection. As a result, the
remaining cells were clustered together with the npcs of 30 and a
resolution of 0.3. The final clusters were annotated based on the
markers shown in [196][36].
4. Discussion
We have shown that BRLM, constructed from BioBERT encoder and ResNet
classifier, can serve as a SNVs annotation learning model to assist
variants classification and interpretation. BRLM was designed with the
aim of conducting mutation sites analysis associated with various
diseases. The accuracy of BRLM was verified on twelve cancer types in
TCGA, whose progressive accuracy in each epoch was observed across all
cancer datasets. The generalizability was validated on CCM sequencing
analysis, with SIFT scoring and Clinvar annotation validated on the
classification results. Following the perturbation scoring algorithm to
quantify the multi-effects of mutated genes in pathways, an upstream
master regulator gene FGF1 was found supported by the astrocyte cluster
of scRNAseq and differential expression of RNAseq. Our results
demonstrate the feasibility of BRLM in classifying SNVs and contribute
to the discovery of major pathogenic factors.
BRLM innovatively employed the NLP algorithm in WGS pathogenic
information mining, making the image classifier applicable to variants
classification. The application of NLP models in single-cell sequencing
datasets has gained popularity, but this technique is still limited in
clinical laboratories. Some other studies have focused on molecular
information mining from published biological texts. It should be NLP’s
first use in genome sequencing to solve clinical problems. As for this
CCM family who visited for procreation guidance, we had to categorize
all SNVs with no reasonable pathogenicity found in known CCM-caused
genes, and further employed the perturbation scoring algorithms to
quantify the accumulation effects of multi-class mutations in pathways.
Finally, the pathogenic mechanisms in CCM were elucidated, showing that
a major gene FGF1 could contribute to CCM development alongside the
minor genes’ effects. This suggests that the applicability of this
model can be widely used in clinical genetic counseling.
FGF1 can be detected in recent sequencing results owing to the broad
usage of bulk sequencing, other than previous CCM omics studies that
have only probed three known genes. For instance, the direct PCR
implementation revealed five variants in the CCM3/SERPINI1 asymmetric
bidirectional promoter [197][41], and the analysis of coding exons
identified a novel missense mutation, c .422 T > G, in CCM3 [198][42].
With the widespread adoption of next-generation sequencing (NGS), an
expanded set of genes can now be examined. When the mini-bulk RNA
sequencing data of MAP3K3 mutant individuals were compared with those
of MAP3K3 WT individuals during fCCM3 lesion formation, FGF1 was found
to be another up-regulated gene, indicating the activation of ERK1 and
ERK2 cascades [199][56]. Additionally, the downstream of FGF1, namely
TGFBR2 and ACTG2, were found to be consistent with our perturbed
results, indicating shared pathways such as Hippo and MAPK signaling
[200][43]. Subsequently, Fgf1 expression decreased after GJA1–20k
altered the endothelial cell transcriptome with hypermethylation of its
gene body in animal experiments [201][49].
The FGF family, comprising six subfamilies (FGF1, FGF4, FGF7, FGF8,
FGF9, and FGF19) [202][35], plays a critical role in embryonic
development and organogenesis by maintaining progenitor cells and
promoting their growth, differentiation, survival, and patterning
[203][21]. The FGF1 subfamily, inclusive of FGF1 and FGF2, serves as
potent angiogenic inducers that control multiple growth factor
signaling. The FGF1 subfamily regulates vessel formation [204][34],
promotes strong angiogenic responses [205][27], and induces vessel
maturation [206][14]. Notably, it has been implicated as a potential
instigator of aberrant angiogenesis [207][18], which could be linked to
the pathogenesis of arteriovenous malformations [208][19] and other
cerebral vascular anomalies [209][45]. As the only gene in the FGF
family undergoing clinical trials [210][4], FGF1 shows promise for
stimulating blood vessel growth in the brain and addressing various
brain-related diseases, including intracranial aneurysm [211][60],
Alzheimer’s disease [212][52], brain tumors [213][3], brain injuries
[214][4], and ischemic stroke [215][63].
According to the multi-omics data of FGF1 in CCM and its relative
biological functions, we speculated that the BRLM analyzed results of
FGF1 was an upstream regulatory gene with clinical implications in CCM.
The CCM-related elements presentation was a landscape of perturbed
pathways with FGF1 positioned upstream and interacting with downstream
mutated genes, including a known CCM pathogenic gene, KRIT1.
Moreover, BRLM is just constructed from genomic data, an integrating
model can be anticipated with multimodality data (such as clinical
records, images, vital signs monitoring, etc.), which may be expected
to provide more clarity in understanding pathogenicity mechanisms and
elucidating functional genes. Our next efforts will be focused on
refining information consolidation and multimodal learning
exploitation. Alternatively, directions of future work on BRLM will
encompass three models. Firstly, the initial text-coding module will be
upgraded into a time-dependent version so that the progression of
diseases can be accommodated [216][1]. Secondly, an image encoding
module will be added to cope with visual data [217][37]. Finally, a
transformer is conceived so that various modalities, such as texts and
images, can be handled simultaneously [218][2]. Our intention is to
construct a medical knowledge graph illustrating the topological
relationships among clinical symptoms, laboratory indicators,
biological markers, related diseases, and targeted drugs [219][57]. The
prototype functions have been integrated into the public interface,
incorporating targeted drug retrieval and ceRNA network construction.
Further development of these functionalities will be presented in our
following studies.
In conclusion, this study offers a novel insight for pathogenic factors
exploration through biomedical language.
analysis, potentially assisting manual retrieval methods and making up
complicated biological experiments.
5. Conclusions
In this study, a biomedical language learning model called BRLM was
developed to classify and link SNVs, with the goal of assisting in the
labor-intensive tasks of interpretation and integration. The pipeline
was utilized to classify variants into five categories, as defined by
ACMG guidelines with multi-class interaction network construction.
Comprehensive databases were compiled containing annotations that
describe variants in biomedical natural language. To facilitate this,
we employed the BioBERT, which primarily focuses on entity recognition
during embedding. Then the encoded vectors were used to train a
convolutional neural network. This model was trained on 12 TCGA
datasets and tested on a familial CCM dataset. From the classified
results, we performed pathway variants accumulative perturbation
analysis, and found a master regulatory gene, FGF1, that could be
highly related to CCM. The core contribution of this study resides in
the integration of a large language model into a variant classifier,
where the former is able to capture the essential information from the
massive input data while the latter guarantees better usage of the
acquired information. The effectiveness of this protocol has been
verified by multi-omics sequencing analysis. Moreover, a web server has
been realized to facilitate the broad usage of the proposed model. This
study highlights the potential of our approach to comprehend the
complex interplay of genetic variants within biological terms.
Institutional Review
The study was approved by the Institutional Review Board (or Ethics
Committee) of Taihe Hospital, Shiyan, Hubei, China (protocol code
2023KS33, September 15th, 2023).
Funding
This research was funded by the National Natural Science Foundation of
China (32070973 and 32060150).
CRediT authorship contribution statement
Yiqi Wang: Conceptualization, Methodology, Software. Jinmei Zuo: Data
collection and sampling. Chao Duan: Data curation. Hao Peng: Data
collection and analysis. Jia Huang: Data collection and sampling. Liang
Zhao: Conceiving and designing, reviewing and editing. Li Zhang:
Original draft preparation. Zhiqiang Dong: Visualization,
Investigation.
All authors have read and agreed to the submitted version of the
manuscript. All authors have read and agreed to the writing of the
manuscript.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to
influence the work reported in this paper.
Footnotes
^Appendix A
Supplementary data associated with this article can be found in the
online version at [220]doi:10.1016/j.csbj.2024.01.014.
Contributor Information
Liang Zhao, Email: s080011@e.ntu.edu.sg.
Li Zhang, Email: zhanglith@163.com.
Zhiqiang Dong, Email: dongz@mail.hzau.edu.cn.
Appendix A. Supplementary material
Supplementary material
[221]mmc1.pdf^ (20MB, pdf)
.
Supplementary material
[222]mmc2.csv^ (5.8KB, csv)
.
Supplementary material
[223]mmc3.xlsx^ (13.1KB, xlsx)
.
References