Abstract

   Cerebral cavernous malformation (CCM) is a polygenic disease with
   intricate genetic interactions contributing to quantitative
   pathogenesis across multiple factors. The principal pathogenic genes of
   CCM, specifically KRIT1, CCM2, and PDCD10, have been reported,
   accompanied by a growing wealth of genetic data related to mutations.
   Furthermore, numerous other molecules associated with CCM have been
   unearthed. However, tackling such massive volumes of unstructured data
   remains challenging until the advent of advanced large language models.
   In this study, we developed an automated analytical pipeline
   specialized in single nucleotide variants (SNVs) related biomedical
   text analysis called BRLM. To facilitate this, BioBERT was employed to
   vectorize the rich information of SNVs, while a deep residue network
   was used to discriminate the classes of the SNVs. BRLM was initially
   constructed on mutations from 12 different types of TCGA cancers,
   achieving an accuracy exceeding 99%. It was further examined for CCM
   mutations in familial sequencing data analysis, highlighting an
   upstream master regulator gene fibroblast growth factor 1 (FGF1). With
   multi-omics characterization and validation in biological function,
   FGF1 demonstrated to play a significant role in the development of
   CCMs, which proved the effectiveness of our model. The BRLM web server
   is available at [37]http://1.117.230.196.

   Keywords: Whole genome sequencing, Cerebral cavernous malformation,
   Deep learning, Large language model, Natural language processing

Graphical Abstract

   [38]ga1
   [39]Open in a new tab

Highlights

     * •
       A language model is introduced to assist the variants
       classification in familial cerebral cavernous malformation.
     * •
       Integrative bioinformatics analysis, rooted in the classification,
       guarantees potential mutated genes identification.
     * •
       Upstream master regulator gene FGF1 plays a substantial role based
       on multi-omics verification.

1. Introduction

   Cerebral cavernous malformation (CCM; OMIM 116860), also known as
   cerebral cavernous angiomas, can manifest as sporadic or autosomal
   dominant conditions. These conditions consist of a varied range of
   relatively prevalent lesions that have important clinical implications
   [40][58]. These angiomas may arise sporadically or be inherited, with
   identified causative genes primarily attributed to KRIT1, CCM2, and
   PDCD10 [41][31], which have been linked to the molecular diagnostic
   criteria of the condition for the last two decades [42][44]. However,
   not all sources succeeded in identifying mutations within specified
   loci of the above three genes. The advent of next-generation sequencing
   techniques has unveiled a broader spectrum of CCM-associated genes.
   With regard to transcriptome sequencing, the focus had been on 1325
   genes displaying differential expression between CCM endothelial cells
   (CCMECs) and Human brain microvascular endothelial cells (HBMECs)
   [43][44], along with 80 enriched pathway terms [44][31]. Additionally,
   whole genome sequencing (WGS) had even detected instances of
   heterozygous inversion without mutations [45][47]. Presently, the most
   advanced approach, single cell RNA sequencing (scRNA-seq), has provided
   a comprehensive gene expression atlas for mouse models of CCM across
   various cell groups [46][36].

   Despite the collective findings from these aforementioned studies, an
   all-encompassing pathogenic genetic mutations for CCM remains elusive.
   Consequently, the substantial body of recent research has shifted its
   focus from singular genes to polygenic factors associated with
   diseases. This study aims to explore the genetic regulatory aspects of
   inherited CCM WGS data, based on SNVs' annotated textual information
   mining.

   Those multi-factor texts were presented in various forms, encompassing
   functional descriptions [47][58], regulatory networks [48][7],
   scientific literatures [49][54], and retrievable databases [50][38],
   thereby rendering these discoveries as distinct “informational isolated
   island” [51][17]. Complex disease traits have been revealed to stem
   from the interplay of multiple elements. Among the potential
   contributors to the pathogenesis of complex diseases, SNVs, genes, and
   gene interactions stand out as prominent factors [52][50]. In response,
   researchers have formulated various algorithms aimed at scrutinizing
   biomarkers. Examples include the penalized logistic regression [53][8],
   multi-factor dimension reduction [54][8], set association [55][59],
   Bayesian network and random forest method [56][62], to name a few.
   However, these algorithms are limited by the need for explicit feature
   extractions, which must be engineered artificially.

   While genetic information is inherently an aspect of natural language,
   where large language model (LLM) had been devoted to natural language
   processing (NLP) in comprehensive feature tracking and summarization.
   The two primary NLP technologies currently utilized are GPT [57][55]
   and BERT [58][13]. GPT has not received any specific biological
   training, although the recently developed scGPT [59][11] had been
   limited to single cell sequencing analysis. BioBERT [60][25], on the
   other hand, specializes in the biomedical field and has been utilized
   for biomedical natural language encoding. Such language learning model
   had been applied to molecular interactions mining [61][48].

   By using BioBERT [62][25] for NLP, unstructured annotation texts were
   converted into per-SNV vectors. All SNV information was transformed
   into computationally manageable vectors, which was a great challenge to
   overcome gradient disappearance and performance degradation in
   traditional deep neural networks. These issues were effectively
   addressed since the invention of ResNet [63][51]. The ResNet50 was
   borrowed in this study for BioBERT encoded vector classification with
   intention, as it provides an ideal balance of depth and computational
   efficiency for a variety of tasks [64][51]. Thus, classification of
   input vectors for vast variants was carried out using reconstructed
   ResNet50 [65][51], although ResNet was originally employed in image
   recognition.

   Using BioBERT [66][25] as encorder and ResNet50 [67][51] as classifier,
   we name our model as BRLM, short for BioBERT vectorized input for
   ResNet classification language model, dedicated to pathogenicity
   classification of SNVs in annotation texts. BRLM’s performance was
   validated on 12 TCGA cancers to ensure its accuracy and robustness. To
   demonstrate its proficiency in resolving practical problems, we
   successfully classified SNVs for familial CCM WGS variants, and
   analyzed mutated genes involved in perturbated pathways.

   BRLM was ultimately applied on CCM risk element mining to isolate the
   top three risk levels SNVs (pathogenic, likely pathogenic and uncertain
   significance) for further investigation. Three-level SNVs were verified
   by genetic functional domains and protein-protein interactions (PPIs)
   to demonstrate its effectiveness regarding pathogenicity. Subsequently,
   these three-class mutated genes were undergo KEGG pathway enrichment
   analysis with up- and downstream cumulative effect evaluation. The
   integrative results outlined an upstream regulator gene FGF1, which
   provided a clear and concise multi-omics atlas of CCM functional
   landscape.

2. Results

2.1. Accuracy evaluation of BRLM

   The BRLM model was initially trained on 12 TCGA cancers, encompassing
   367,224 SNVs from 3104 patients. The datasets sourced from TCGA were
   diverse, containing mutations from various organs and volumes to
   achieve optimal parameters for best performance. The dataset comprised
   a large-scale cases of popular cancers (ACC, BRCA, and GBM), a
   small-scale cases of rare cancers (CHOL, DLBC, and KICH), as well as a
   medium-sized cases of common cancers. Detailed information about the 12
   TCGA datasets is presented in [68]Table 1, including descriptions,
   patient numbers, mutation amounts, and links for each dataset. The
   density distribution of the aforementioned SNVs is depicted in [69]Fig.
   1A (left).

Table 1.

   TCGA data information.
   Abbreviation Cancer Patients SNVs Link
   ACC Adrenocortical carcinoma 90 20,161 https://portal.gdc.cancer.gov/
   projects/TCGA-ACC
   BLCA Bladder Urothelial Carcinoma 130 39,309
   https://portal.gdc.cancer.gov/ projects/TCGA-BLCA
   BRCA Breast invasive carcinoma 982 84,713
   https://portal.gdc.cancer.gov/ projects/TCGA-BRCA
   CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
   194 35,606 https://portal.gdc.cancer.gov/ projects/TCGA-CESC
   CHOL Cholangiocarcinoma 35 3833 https://portal.gdc.cancer.gov/
   projects/TCGA-CHOL
   COAD Colon adenocarcinoma 154 54,349 https://portal.gdc.cancer.gov/
   projects/TCGA-COAD
   DLBC Lymphoid Neoplasm Diffuse Large B cell Lymphoma 48 7276
   https://portal.gdc.cancer.gov/ projects/TCGA-DLBC
   ESCA Esophageal carcinoma 185 36,288 https://portal.gdc.cancer.gov/
   projects/TCGA-ESCA
   GBM Glioblastoma multiforme 290 22,044 https://portal.gdc.cancer.gov/
   projects/TCGA-GBM
   KICH Kidney Chromophobe 66 5923 https://portal.gdc.cancer.gov/
   projects/TCGA-KICH
   KIPAN KICH (Kidney Chromophobe), KIRC (Kidney renal clear cell
   carcinoma), and KIRP(Kidney renal papillary cell carcinoma) 644 47,875
   [70]https://www.linkedomics.org/ data_download/TCGA-KIPAN/
   LGG Brain Lower Grade Glioma 286 9847 https://portal.gdc.cancer.gov/
   projects/TCGA-LGG
   [71]Open in a new tab

Fig. 1.

   [72]Fig. 1
   [73]Open in a new tab

   BRLM model construction and performance evaluation. (A)BRLM structure,
   including BioBERT encoded annotation of SNVs into 728-dimensional
   vectors visualized by UMAP (left), ResNet50 model architecture for SNV
   classification (middle); and classification results presented by UMAP
   (right). (B) TCGA pan-cancer classified variants distribution in
   Nightingale rose diagram. (C) Classification performances among four
   biomedical encoders in 12 TCGA cancers after 100 epochs. (D)
   Classification accuracy of BRLM per 10 epochs in 12 TCGA cancers. (E)
   Classification F1-score of BRLM per 10 epochs in 12 TCGA cancers. (F)
   Expression comparison between tumor and normal tissues as validation of
   TCGA variants classification.

   Concurrently, the vectors were input into the ResNet50 classifier,
   whose structure is illustrated in the mid-panel of [74]Fig. 1A along
   with its classification results in the right panel. Adhering to the
   American College of Medical Genetics and Genomics (ACMG) [75][39]
   criteria for variant classification, all TCGA SNVs were categorized
   into five classes, namely Class 1 (pathogenic), Class 2 (likely
   pathogenic), Class 3 (uncertain significance), Class 4 (likely benign),
   and Class 5 (benign). Likewise, the final classification results are
   highly consistent with the clusters formed by UMAP. It is notable that
   mutations in Class 3 (uncertain significance) comprised the largest
   fraction of the sequencing analysis outcomes, and the results of BRLM
   align closely with this distribution.

   In addition to presenting overall statistics, we analyzed the
   distribution of classes within each tumor type, as depicted in [76]Fig.
   1B. Due to the diverse nature of tumor variants, their classification
   proportions exhibit distinct characteristic. Class 3, representing SNVs
   of uncertain significance, is prominent across multiple tumors (CESC,
   CHOL, DLBC, GBM, KICH, KIPAN, and LGG). Conversely, for other tumors
   such as BRCA, BLCA, COAD, and ESCA, the classes demonstrate relatively
   equal proportions. Notably, more than 90% of the cases in ACC are
   constituted by Class 4 (likely benign) variants. These findings are
   consistent with pre-established pathogenic evidence, that certain types
   of cancers such as ACC, BRCA, CESC, and COAD may exhibit a focal
   concentration of pathogenic SNVs, resulting in clear categorized
   mutations and drug targets. However, most carcinomas exhibit
   significant heterogeneity and individual variability, making it
   difficult to classify intricate risk factors.

   After completing 100 training epochs across 12 cancers in TCGA, the
   performance comparison of frequently used encoders in the biomedical
   field (BERT, ClinicalBERT, and fastText) is presented in [77]Fig. 1C,
   which is a three-dimensional line graph for the accuracy, mAP (mean
   average precision), and F1-score. The performance reveals that BioBERT
   is superior to other encoders, particularly in accuracy and F1-score.

   To further evaluate the model’s performance, we presented the curve
   graphs for accuracy and F1-score at 10-epoch intervals for each type of
   cancer in [78]Fig. 1D and E. The plots demonstrate consistent testing
   performance throughout each training epoch, with the optimal values of
   accuracy ranging from 0.91 to 0.99 and F1-score ranging from 0.80 to
   0.97. As the F1-score represents the harmonic mean of precision and
   recall, their line graphs were further included in Supplementary Figure
   S1, where the optimal precision ranges from 0.821 to 0.988 and the
   optimal recall ranges from 0.753 to 0.967 across the 12 datasets.

   To further validate the classification results, we extracted
   tumor-related variants and integrated them with TCGA RNAseq dataset
   expression analysis. Differential expression comparison between tumor
   and normal conditions for each class is illustrated in [79]Fig. 1F. The
   classified cancer-related SNVs (Class 1 and 2) exhibit
   extremely-significant differences, while Class 3 (uncertain
   disease-associated probability) demonstrates significant difference. In
   contrast, no significant expression differences are observed between
   the normal and tumor groups for neutral mutations (class 4 and 5).

2.2. Generalizability assessment of BRLM

   Following the optimal model structure determined through multi-cancer
   verification on the TCGA dataset, BRLM was subsequently employed to
   analyze a familial WGS data with the aim of identifying variants
   associated with CCM. The genetic pedigree of the family was composed of
   four CCM affected individuals and four unaffected individuals marked by
   asterisks; see Supplementary Figure S2 for details. Based on the WGS
   analysis with Genome Analysis Toolkit (GATK) [80][5] workflow for
   germline short variants, 409,666 germline variants were identified.
   Several benign SNVs were found in the three well-known CCM-related
   genes (KRIT1, CCM2 and PDCD10), while a large amount of variants with
   unknown impacts were also identified. These benign mutations have a
   high frequency in population and are located within intronic areas,
   whose detailed information can be found in Supplementary Table S1. We
   thus developed BRLM to assist the labor-intensive tasks of SNVs
   interpretation and integration. The entire variants were then annotated
   and transformed into BioBERT embedded vectors.

   Classification results from ResNet50 are visualized in [81]Fig. 2A
   using the UMAP technique, which underscores the specificity of BRLM in
   SNV classification through clusters of aggregated distributions.
   Validations of these results drawn upon both SIFT scores and Clinvar
   records are demonstrated in [82]Fig. 2B and C. The SIFT scoring
   mechanism ranges from 0 to 1 ([83]Fig. 2B), with color intensity
   reflecting the severity of detrimental effects where a score of 0
   indicates the most deleterious mutation. On the other hand, [84]Fig. 2C
   illustrates the Clinvar category mapping. Based on the comparison
   between the two plots, BRLM shows a good performance in identifying
   CCM-related variants. Specifically, Class 1 and 2 in [85]Fig. 2A are
   consistent with risk SNVs as aligned with the yellow clusters in
   [86]Fig. 2B and the red unlabeled variants in [87]Fig. 2C. Similarly,
   the classification results for Class 4 and 5 also correlate well with
   the patterns observed in both graphs.

Fig. 2.

   [88]Fig. 2
   [89]Open in a new tab

   BRLM mutation classification for a familial CCM WGS. (A) UMAP plot for
   the BRLM classified CCM SNVs. (B) UMAP of the SNVs with SIFT Scores
   attached. (C) UMAP for SNVs annotated with Clinvar categories. (D)
   Functional variants statistics of regulatory and exonic regions for the
   five classes. (E) Statistics of potential pathogenic variants
   distribution within functional and exonic regions for Class 1, 2 and 3.
   (F) Circos plot with low-density functional areas distribution
   connected by PPI between high CCM risk variants in Class 1, 2. (G)
   Circos plot with high-density functional areas distribution connected
   by PPI between uncertain CCM risk variants in Class 3.

   However, the classification of variants with uncertain significance
   (Class 3) remains a contentious issue due to discrepancies between the
   score deficiency highlighted in [90]Fig. 2B and the benign or likely
   benign categorization in [91]Fig. 2C. To further assess the
   effectiveness of our classification results, an additional eleven
   algorithms for variant classification were utilized. The detailed UMAP
   plots of the outcomes can be found in Supplementary Figure S3. Of the
   total eleven algorithms, nine could only annotate a limited number of
   variants albeit with diverse clusters, whereas the remaining two were
   incapable of categorizing the variants. The majority of classified SNVs
   demonstrated to be consistent with ours. Additionally, BRLM was able to
   estimate unpredicted variants by other algorithms, highlighting the
   comprehensiveness and superiority of our model.

   Owing to the doubtable SNVs in Class 3, we investigated their
   functional regions. In [92]Fig. 2D, statistical data is presented
   regarding overall functional variants (left) and specific exonic
   variants (right). Due to the substantial number of functional mutations
   within Class 3, we further localized their chromosome distribution in
   [93]Fig. 2E, corresponding to functional and specific exonic SNVs. The
   overall distribution pattern remains consistent across both sets of
   variants. The mutation coordinate and distribution is an important part
   of the input text. However, functional mutations are distributed near
   exonic mutations, revealing a possible regulatory impact.

   To narrow down valid data, we extracted positive sites from SNVs
   depicted in [94]Fig. 2E. These positive sites were defined by the
   variant quality score recalibration (VQSR) model in GATK. VQSR learned
   features by machine-learning algorithms from mutation sites to
   distinguish positives from negatives. The detailed parameter settings
   for VQSR are introduced in [95]Section 3.3.

   Circos graphs presented in [96]Fig. 2F and G illustrate the positive
   sites distribution comparison between Class 1, 2 and 3. [97]Fig. 2F
   portrays greater risk variants within Class 1 and 2, whereas [98]Fig.
   2G emphasizes abundant mutual effects within Class 3. Mutated genetic
   functional regions are colored at the outermost chromosomes, where
   exonic variants are highlighted in red. Their frequencies are depicted
   in curves and bars of the two middle rings that are obtained from the
   1000 Genomes and the Genome Aggregation Database (gnomAD), and internal
   lines indicate Protein-Protein Interactions (PPI) among them.
   Conversely, deleterious interactions among genes in Class 1 and 2 are
   only centered on three links with six genes highlighted in bold: FGF1,
   FGF6, ABCB1, ABCG2, IL4R, and CTLA4. Consequently, interactive effects
   among these genes have higher priorities in our following analysis.

2.3. Enrichment analysis of candidate genes and pathways

   To explore the biological functions affected by the categorized
   CCM-related mutations, an initial KEGG pathway enrichment analysis was
   performed for the mutated genes within Class 1, 2 and 3, yielding 661
   enriched pathways. The simplifyEnrichment package was then invoked to
   cluster the similarity matrix of enriched terms into groups using
   “binary cut” [99][16], and the results are illustrated by the heatmap
   in [100]Fig. 3A. Based on the eight clusters with descriptive
   vocabulary frequencies indicated by font size, the mutated genes play
   significant roles in the developmental processes and signaling
   activities on cell adhesion and proliferation.

Fig. 3.

   [101]Fig. 3
   [102]Open in a new tab

   Enrichment results for mutated genes in Class 1, 2 and 3. (A)
   Similarity clustering heatmap for enriched pathways with term
   frequencies exhibited by font size. (B) K-means clustering for the top
   50 pathways with the most significant p-values. (C) The top 10 enriched
   pathways enumeration in terms of p-values.

   In order to further characterize the functional terms, the top 50
   pathways were grouped into k-means clusters, illustrated in [103]Fig.
   3B. It is notable that the top cluster relevant to cell junction
   adhesion comprises 17 distinct pathways, while the other three clusters
   are associated with cGMP-PKG activation, AGE-RAGE signaling, and
   metabolic correlation. However, these top 50 pathways were still
   extremely complicated, we thus narrowed them down to feature the top
   ten pathways with the most significant p-values, as highlighted in
   [104]Fig. 3C.

   According to the specific results in [105]Fig. 3C, cell signaling
   pathways hold a consistently substantial proportion. Particularly, the
   PIK3-Akt, MAPK, Ras, and Rap1 signaling pathways hold the top four
   positions, with focal adhesion ranking the third. The aforementioned
   SNVs identified in Class 1, 2 and 3 exhibited the potential to induce
   aberrant cell functions, thereby becoming predisposing factors for CCM.

2.4. Simulation of Mutated Genes Perturbation among Pathways

   After confirming the involvement of SNVs in enriched pathways, the next
   study was to address the functional roles of the mutated genes with
   accumulative effects, which have remained untouched in previous CCM
   studies. To quantify the functional implications of mutated genes at
   the pathway level, an algorithm calculated pathway mutations
   accumulative perturbation score (PMAP score) was adopted [106][26]. The
   PMAP score was used to measure the actual perturbation impact on
   enriched pathways under a candidate gene set encompassing the three
   classes (Class 1, 2 and 3). A comprehensive list of perturbed pathways
   along with their PMAP scores for each class is provided in the
   Supplementary Table S2. According to their PMAP scores comparison, the
   scores were almost equal between Class 1 and 2, which turned out
   different from Class 3. We thus take the Class 1, 2 as a whole in the
   follow-up study regarding perturbation.

   Since the gene list in Class 3 contained uncertainties that differ from
   the well-established lists in Class 1, 2, their proportion of perturbed
   gene sets were compared within the top 10 highest scored pathways. This
   result is depicted in [107]Fig. 4A as a tree plot. Based on the pie
   charts at the end of each tree branch, the perturbation rate of Class
   1, 2 is mainly higher or competitive to that of Class 3. Nevertheless,
   the number of mutated genes in Class 3 far exceed those in Class 1, 2.
   Moreover, the pathways with remarkably high PMAP scores (including
   PIK3-Akt, MAPK, Ras, and Rap1 signaling pathways) are also consistent
   with the top ten enriched results in [108]Fig. 3C. These findings
   suggest that the SNVs have an influence on CCM cellular signaling
   functions, with major effects from Class 1, 2 and minor effects from
   Class 3.

Fig. 4.

   [109]Fig. 4
   [110]Open in a new tab

   Pathways perturbation simulation derived from mutated genes in Class 1,
   2 and 3. (A) Tree plot for the top 10 pathways with the highest PMAP
   score, where the pie chart shows the proportion of involved genes from
   Class 1, 2 and Class 3. It is evident that fewer mutated genes in Class
   1, 2 play more important perturbation roles than those in Class 3. (B)
   Containment relationship for top 10 perturbated pathways and functional
   domain mutated genes. (C) Sankey plot for risk CCM-related elements in
   three levels for mutations, genes and pathways.

   For primary perturbed genes with functional relevance were identified
   among Class 1 and 2, we thus constructed a perturbated network focusing
   on the top ten scored pathways connected by these genes (see [111]Fig.
   4B). The ratios of overlap perturbed genes are shown in Supplementary
   Figure S4, which distinguishes genes that uniquely belong to one
   pathway or perturb two or more pathways. The five PPI-linked genes
   (FGF1, FGF6, ABCB1, ABCG2, and IL4R) from [112]Fig. 2F with Class 1 and
   2 SNVs are highlighted in bold blue. Furthermore, one of the known
   pathogenic genes for CCM, KRIT1, is conspicuously present in this
   network (in brown bold font). However, two intronic variants in KRIT1
   were common with frequencies exceeding 0.04 in the 1000 Genomes
   database as shown in the Supplementary Table S1. Among the pathways,
   Vascular Smooth Muscle Contraction and GABAergic Synapse exhibit
   remarkable perturbation scores despite being excluded from the top 10
   enriched pathways in [113]Fig. 3C, with p-values of 0.002 and 0.03
   respectively. This further supports the idea that perturbation
   algorithms applied in BRLM can help uncover ignored information.

   In order to determine the significant risk pathways, the overlap
   between the top 10 scored (in [114]Fig. 4B) and top 10 enriched (in
   [115]Fig. 3C) pathways were extracted. Combined with the major effect
   genes harboring functional SNVs in protein interactions highlighted in
   [116]Fig. 4B, the elements at three levels (mutation, gene, and pathway
   level) were established as critical CCM pathogenic routes, elucidated
   through the three-bucket Sankey graph in [117]Fig. 4C. The first bucket
   effectively outlines the presence of seven functional SNVs within these
   genes (six Class 1 SNVs in FGF1 and one Class 2 SNV in FGF6). Notably,
   all seven SNVs have just been cataloged with rsIDs from NCBI dbSNP, yet
   none of them bear any reports pertaining to CCM or related biological
   implications. The second bucket contains a pair of functionally
   pathogenic mutated genes, namely FGF1 and FGF6 from the fibroblast
   growth factor family. While the four risk pathways in the third bucket
   coincide with the top four scored and five enriched results. The
   overlapping results support the accuracy of both enriched and
   perturbated pathways, which can improve the integrity of CCM-related
   mutation knowledge.

   To assess the effectiveness of the above CCM-related three-level risk
   elements, whose classification entities were extracted from PubMed by
   BioBERT [118][25]. The PubMed publication proportion statistics from
   2010 to 2023 are shown as three trends for each level in Supplementary
   Figure S5. According to the Supplementary Figure S5A, only SNV
   rs17217240 had been reported in 2010, while there was no literature
   support for the others. Regarding the materials for these two genes
   displayed in Supplementary Figure S5B, over the past two decades, FGF1
   has received considerable attention but no relation to CCM, while FGF6
   has been scarcely reported. Finally, the statistics for pathway
   publications reflect overall continuance attention, albeit with varying
   numerical records as illustrated by the Supplementary Figure S5C. These
   publication statistics unveil abundant diversity in features for
   variant classification.

2.5. FGF1 is the upstream master regulator gene of perturbated pathways

   Based on the enrichment analysis and the perturbation analysis, RNA-seq
   differential expression geneset between CCMECs and HBMECs introduced
   before had verified for final major-effect gene determination
   [119][44]. FGF1 is found to be highly up-regulated (p-value=0 and
   log2FoldChange=1.8), while FGF6 takes no statistical difference
   (p-value=0.71 and log2FoldChange=−0.6). The detailed differentially
   expressed results for genes engaged in perturbated pathways are
   illustrated in the Supplementary Table S3. Herein, we dive into the
   details of FGF1 from multi-omics as shown in [120]Fig. 5, including the
   scRNA-seq expression clusters ([121]Fig. 5A), the WGS mutant
   transcripts ([122]Fig. 5B), the RNA-seq expression profiles ([123]Fig.
   5C), and the perturbated pathways ([124]Fig. 5D).

Fig. 5.

   [125]Fig. 5
   [126]Open in a new tab

   The FGF1 acts as the master regulator gene upstream of perturbed
   pathways from multi-omics results integration. (A) FGF1 specific
   expressed in Astrocytes Cluster from scRNA-seq. The markers for
   “Astrocytes” cluster are colored in blue, while the marker for
   “Capillaries” is in red. (B) FGF1 mutation sites distribution in two
   mutant transcripts from WGS. (C) Differential expression profiles for
   multi-effect genes from RNA-seq. (D) Main effect gene FGF1 with
   multiple functional variants located upstream of peaked genes from
   perturbated and enriched pathways, reacted with mutated genes in Class
   1, 2 and 3 including KRIT1, one of the three known CCM pathogenic
   genes. Multi-connection mutated genes in the same pathway are outlined
   with dashed lines.

   For exploring expressed cell groups of FGF1, scRNA-seq was carried out
   on a pair group of CCM mouse model under two normal conditions and two
   deletions of Pdcd10 [127][36]. The UMAP plots for joint clustering are
   displayed in [128]Fig. 5A, with 16,220 cells from the two Pdcd10-wt
   mice and 19,135 cells from the two Pdcd10-ko mice after low-quality
   cells filtering. Based on gene-marked UMAPs among 12 cell clusters,
   FGF1 identifies special expression in cluster 8 whose marker genes
   (Aqp4, Gfap, and Slc1a2 colored in blue) are primarily associated with
   astrocytes. This cluster also includes the receptor genes for the FGFR
   family, with Fgfr3 and Fgfr1 (colored in blue) exhibiting specific and
   significant expression. Additionally, Atp1b1, a marker for capillaries
   (colored in red) is also highly expressed in this cluster. Recent
   research has indicated that astrocytes propel neurovascular dysfunction
   during cerebral cavernous malformation lesion formation [129][29].

   All six SNVs in FGF1 were classified as Class 1, corresponding to two
   transcripts: FGF1–008 (transcript ID: ENST00000378046) and FGF1–013
   (transcript ID: ENST00000411960). Both transcripts performed
   protein-coding functions with most mutants situated on the longer one
   (FGF1–013), see transcripts distribution in [130]Fig. 5B. All the
   transcripts for FGF1 are listed in the Supplementary Figure S6
   referenced from GENECODE19 ([131]https://www.
   gencodegenes.org/human/release_19.html).

   To further explore the cumulative influence between the major-effect
   gene FGF1 and other minor-effect genes found by BRLM, differential
   expression profiles whose parameters taken from the Supplementary Table
   S3 are constructed in [132]Fig. 5C. The minor-effect genes were
   detected in positive sites across Class 1, 2 and 3, along with the
   corresponding perturbated pathways from [133]Fig. 4. Most of these
   genes show significant differential expression in both p-value < 0.05
   and |log2FoldChange|> 1 indicated by read dots in [134]Fig. 5C.

   According to the categorized multi-effect genes based on mutation and
   expression, an expanded biological regulatory landscape featuring
   partially function exploration is conducted in [135]Fig. 5D. FGF1,
   highlighted in orange, demonstrates a master regulatory role, while the
   remaining genes are associated with minor impact SNVs across Class 1, 2
   and 3. As we can see from [136]Fig. 5D, FGF1 is located upstream,
   indicating the direct perturbation of the four signaling pathways
   denoted by orange boxes. Importantly, among them the downstream of Rap1
   Signaling pathway is positioned to one of the known CCM pathogenic
   genes, KRIT1. These four signaling pathways play critical roles in cell
   growth and tissue development, which can impact vascular integrity.

   In addition to the direct action of FGF1 on the four signaling
   pathways, two indirect processes, namely Vascular Smooth Muscle
   Contraction and GABAergic Synapse indicated by green boxes, are
   regulated by Ca2 + levels via Calcium Signaling Pathway, a downstream
   fundamental cellular signaling process of FGF1. Above findings indicate
   that FGF1 may activate downstream signals, which suggests a significant
   role in the occurrence and development of familial CCMs. In order to
   improve the reliability of our model, we have provided further evidence
   for the regulatory function of FGF1 in CCM. The supporting literatures
   are chronologically rearranged in Figure S6 and detailed in Section S2
   of the Supplementary file. These materials provide a comprehensive
   overview of FGF1, emphasizing its potent role in inducing angiogenesis
   in the brain by regulating multiple growth factor signaling pathways.
   The development of FGF1 as a powerful stimulator of angiogenesis for
   various brain-related diseases will be emphasized in the Discussion
   section.

   According to the previous studies, the technologies of genetic or
   chemical CCM induced models are still immature. The typical cell
   experiment was to cultivate cells (mostly endothelial cells) from CCM
   patients for further research based on the original RNAseq results in
   [137]Fig. 5C [138][44]. On the other hand, there are diverse findings
   showing FGF1’s pivotal role in angiogenesis, as detailed in Section S3
   of the Supplementary file. The main results and conclusions are
   summarized as follows. Firstly, FGF1 is found to induce the
   proliferation and migration of endothelial cells at the cellular level
   [139][61], whose results are displayed in Supplementary Figure S7.
   Secondly, the FGF1 morpholino knockdown zebrafish embryos display
   anomalous cell aggregation in the intermediate cell mass [140][46],
   whose results are shown in Supplementary Figure S8. Last but not least,
   in a rat model involving the implantation of collagen-suspended beads
   into exposed femoral pedicles, a notable enhancement in vascular
   density is observed at 1 and 6 weeks compared to the bolus
   administration of FGF1 [141][32], whose results are presented in
   Supplementary Figure S9 and Supplementary Figure S10.

3. Materials and Methods

   The BRLM web server is available at [142]http://1.117.230.196 and the
   source codes are available at
   [143]https://github.com/wangyiqi806643897/BRLM. The raw sequence data
   reported in this paper have been deposited in the Genome Sequence
   Archive (Genomics, Proteomics & Bioinformatics 2021) in National
   Genomics Data Center (Nucleic Acids Res 2022), China National Center
   for Bioinformation / Beijing Institute of Genomics, Chinese Academy of
   Sciences (GSA-Human: HRA005489) that are publicly accessible at
   [144]https://ngdc.cncb.ac.cn/gsa-human. The other results are shown in
   Supplementary files.

3.1. CCM family samples collection

   The samples for these eight family members were collected at Shiyan
   Taihe Hospital, Hubei, China, whose family genetic pedigree is
   presented in Supplementary Figure S2A. Informed consent was obtained
   from all subjects, and this study was approved by Taihe Hospital. The
   inclusion and exclusion criteria for the data are based on the MRI
   results as shown in Supplementary Figure S2B-I, with four individuals
   in the case group and four in the control group. Each MRI image was
   reviewed by three independent doctors who produced the official reports
   and the cross-referenced slices. The case group comprises the proband
   in Supplementary Figure S2B, a 34-year-old male patient with CCM, and
   three affected first-degree relatives, including his mother (died of
   CCM) in Supplementary Figure S2C, aunt (52 years old) in Supplementary
   Figure S2D, and brother (30 years old) in Supplementary Figure S2E. The
   control group comprised four relatives from his cousin’s family in
   Supplementary Figure S2F-I. Seven samples were obtained from collateral
   blood, except for the mother, who unfortunately passed away due to the
   condition. Her sample was obtained from a block of angioma tissue.

3.2. Whole genome sequencing

   Genomic DNA extracted from blood was assessed for quality using
   PicoGreen and gel electrophoresis. At least 10 μg of non-degraded DNA
   was provided for WGS. Tissue extraction used the Maxwell 16 Tissue DNA
   Purification Kit (Promega), which followed the manufacturer's
   instructions and utilized 10 mg of tissue. Additional quality controls
   were conducted, including assessing DNA purity and integrity through
   agarose gel electrophoresis. Furthermore, DNA purity was determined
   using Nanodrop detection (OD 260/280 ratio), and DNA quantification was
   carried out with Qubit 2.0. To shear approximately 300 ng of
   high-quality DNA samples (OD 260/280 =1.8–2.0), a Covaris S220
   Sonicator (Covaris) was used to generate fragments of ∼350 bp. The
   fragmented DNA was purified using Illumina's Sample Purification Beads.
   Adapter-ligated libraries were prepared using TruSeq Nano DNA Sample
   Prep Kits (Illumina) in accordance with the Illumina protocol. The
   sequencing was conducted on an Illumina HiSeq system for 2 * 150
   paired-end sequencing at Novogene in Wuhan, China.

3.3. CCM family WGS analysis

   After quality control and trimming by Fastp [145][9], the dataset
   consisting of high-quality clean sequences in fastq format was
   obtained. Subsequently, sequence alignment to the GRCH37 reference
   genome was executed using SAMtools with default parameters, achieving a
   mapping rate of over 90%. The Germline short variant discovery pipeline
   of GATK version 4 (GATK4) [146][5] was then employed.

   To evaluate the feasibility of each site in our data and identify false
   positives, VQSR of GATK was employed in two modes: SNP mode and Indel
   mode. For each mode, distinct reference databases were assigned to the
   corresponding argument sets, and their parameters are detailed in
   [147]Table 2. In SNP mode, the utilized databases included HapMap3.3
   [148][10], OMNI2.5 [149][40], 1000 Genomes [150][12], and dbSNP
   ([151]http://www.ncbi.nlm.nih.gov/projects/SNP/). While in Indel mode,
   the databases consisted of dbSNP
   ([152]http://www.ncbi.nlm.nih.gov/projects/SNP/) and Mills Gold
   [153][30]. Each of the aforementioned databases necessitates the
   determination of four key parameters: “known” (whether the data is used
   as known variation for marking), “training” (whether the data is used
   for training), “truth” (whether the data is used as the ground truth
   for verifying), and “prior” (the weight of the data set in model
   training, or prior likelihood).

Table 2.

   Data parameters of GATK VQSR.
   Resource “known” “training” “truth” “prior” Reason
   SNP mode:
   HapMap false true true 15.0 Strict quality control and experimental
   verification
   OMNI false true true 12.0 Gold standard for genotypes
   1000 G false true false 10.0 Deficiency of comprehensive experimental
   verification
   dbSNP true false false 2.0 Submitted results without rigorously
   verification
   Indel mode:
   Mills Gold true true true 12.0 Verified dataset
   dbsnp true false false 2.0 Same as SNP mode
   [154]Open in a new tab

3.4. SNVs annotation texts preparation

   To annotate SNVs for model training, a multi-step pipeline was
   employed. Initially, this pipeline established a comprehensive
   annotation process involved three SnpEff and seven ANNOVAR referencing
   databases. Simultaneously, relevant gene information was curated from
   prominent genetic databases, encompassing details such as function,
   expression, phenotype, and pathway.

   The annotation databases encompassed resources like Clinvar [155][24]
   (version 20220320), 1000 Genomes (version 1000g2015aug_ALL and
   1000g2015aug_AFR) [156][12], dbSNP
   ([157]http://www.ncbi.nlm.nih.gov/projects/SNP/), SIFT [158][28], GWAS
   [159][53], Kaviar [160][15], eigen [161][22], and gnomAD [162][23].

   Besides the above database annotations, further gene information had
   also added to the descriptions of the SNVs accordingly, including the
   functions, expressions, distributions, phenotypes presented in
   GeneCards ([163]https://www.genecards.org/) and NCBI
   ([164]https://www.ncbi.nlm.nih.gov/), and the pathway terms presented
   in Gene Ontology (GO, [165]http://geneontology.org/) and Kyoto
   Encyclopedia of Genes and Genomes (KEGG,
   [166]https://www.genome.jp/kegg/).

   All these genetic insights were incorporated into natural language
   descriptions for BioBERT with the input format demonstrated in
   [167]Table 3 and an example shown in Supplementary Table S5. The input
   files are available for downloading from the web server
   ([168]http://1.117.230.196/results_download), along with the source
   codes ([169]https://github.com/wangyiqi806643897/BRLM).

Table 3.

   BioBERT input text format.
   Anno Category Text Format
   Annotation Pos Chr:{text} Start:{text} End:{text} Ref:{text} Alt:{text}
   Ref Func.refGene:{text}Gene.refGene:{text}GeneDetail.refGene:{text}
   ExonicFunc.refGene:{text} AAChange.refGene:{text}
   Database CLNALLELEID:{text} CLNDN:{text} CLNDISDB:{text}
   CLNREVSTAT:{text} CLNSIG:{text} 1000g2015aug_ALL:{text}
   1000g2015aug_AFR:{text} avsnp150:{text} avsift:{text}
   GWAVA_region_score:{text} GWAVA_tss_score:{text}
   GWAVA_unmatched_score:{text} Kaviar_AF:{text} Kaviar_AC:{text}
   Kaviar_AN:{text} gnomAD_exome_ALL:{text} gnomAD_exome_AFR:{text}
   gnomAD_exome_AMR:{text} gnomAD_exome_ASJ:{text} gnomAD_exome_EAS:{text}
   gnomAD_exome_FIN:{text} gnomAD_exome_NFE:{text}
   gno-mAD_exome_OTH:{text} gnomAD_exome_SAS:{text}
   Info NCBI_Summary:{text} GeneCards_Summary:{text}
   Swiss-Prot_Summary:{text} GO:{text} KEGG:{text}
   [170]Open in a new tab

3.5. TCGA verified dataset

   Initial verification of SNVs by BRLM was conducted using data obtained
   from twelve cancers within the Cancer Genome Atlas (TCGA). Cancer types
   from TCGA, accompanied by their abbreviations, included Adrenocortical
   carcinoma (ACC), Breast invasive carcinoma (BRCA), Bladder urothelial
   carcinoma (BLCA), Colon adenocarcinoma (COAD), Cervical squamous cell
   carcinoma and endocervical adenocarcinoma (CESC), Cholangiocarcinoma
   (CHOL), Lymphoid neoplasm diffuse large B-cell lymphoma (DLBC),
   Esophageal carcinoma (ESCA), Glioblastoma multiforme (GBM), Kidney
   chromophobe (KICH), Pan-kidney cohort (KICH+KIRC+KIRP) abbreviated as
   (KIPAN), and Brain lower grade glioma (LGG). Following verification of
   the aforementioned cancers, mutation annotation format (MAF) files were
   downloaded, with each line representing a single variant. These
   original MAF files were then converted into avinput format by
   convert2annovar.pl in ANNOVAR ([171]http://annovar.openbioinformatics.
   org/en/latest/).

3.6. Analyzing environment configuration

   WGS analysis was conducted on a server having 128 G RAM and two Silver
   4114 CPUs (40 cores in total), installed with CentOS 7.0. SAMtools
   mapping was performed using 8 cores, while the SNVs calling step
   utilized 16 cores.

   BRLM was implemented by using python3.8 with the deep learning
   framework of PyTorch1.9. The learning rate was optimized by Adam with
   an initial value of 10^−2, a reduction factor of 0.1, and a batch size
   of 8. The model was evaluated on a cluster having four NVIDIA V100
   GPUs.

3.7. BRLM’s workflow

   The overall workflow of BRLM is shown in [172]Fig. 6. After obtaining
   annotation texts for SNVs, the BioBERT encoder and ResNet classifier
   pipelines were developed for classification, with the training on TCGA
   and testing on CCM datasets.

Fig. 6.

   [173]Fig. 6
   [174]Open in a new tab

   BRLM workflow for variant annotations classifying. Starting with
   annotated data wrangling, embedded vectors are constructed by BioBERT,
   which are classified by ResNet50 for distinct datasets.

3.7.1. Encoding variants

   The encoder for SNVs was achieved by BioBERT which stands out as an
   extensively employed model in the field of natural language processing
   for medical and biological purposes. It was developed as a specialized
   iteration past the BERT (Bidirectional Encoder Representations from
   Transformers), initially pretrained on English Wikipedia and
   BooksCorpus [175][13].

   Nevertheless, owing to the significant presence of biomedical-specific
   proper nouns (such as KRIT1, c.369 A>G) or terms (like exonic,
   intronic), BERT’s general models struggled to achieve a satisfactory
   performance. This limitation prompted the emergence of BioBERT,
   tailored for tasks involving biomedical text mining. This variant was
   pretrained on PubMed abstracts (available at PubMed:
   [176]https://pubmed.ncbi.nlm.nih.gov/) and full-text articles from
   PubMed Central (accessible at PMC:
   [177]https://www.ncbi.nlm.nih.gov/pmc/). Considering that descriptions
   of SNVs for annotation could also be categorized as domain-specific
   natural language expressions, we leveraged BioBERT’s entity extraction
   capabilities to supplant the manual literature querying process in the
   interpretation of SNVs.

   All variants were annotated within sentences, activating the encoding
   module for deep learning purposes. Regarding BRLM, we utilized BioBERT
   to encode annotated texts, yielding 728-dimensional vectors per SNV.
   The employed model version was biobert-base-cased-v1.1, with a batch
   size of 100 and the “mean” pooling algorithm for pooler output.
   Regarding pooling, it is used to reduce the size of the feature maps
   and avoid sacrificing too much information [178][6]. The “mean” pooling
   focuses on overall features with less influence from outliers, making
   it more robust than the “max” pooling operation [179][20].

   Moreover, instructions were incorporated into each annotated sentence
   subsequent to the annotations generated by ANNOVAR. The instruction
   format followed the pattern of “Disease name” + “mutation sites”. These
   disease names were contingent upon the data sources, either TCGA cancer
   species or CCM.

3.7.2. Classifying embedded vectors

   The ResNet50 [180][51] was borrowed to identify pathogenic mutations,
   whose architecture was illustrated in [181]Fig. 1A. The neural network
   depth designed for mutation site analysis significantly outperformed
   traditional neural networks in handling SNVs’ large data inputs,
   preventing gradient vanishing and performance degradation.

   To elucidate the unique structure of ResNet, we conducted a comparative
   summary of its specific advantages against two other classical
   convolutional neural networks (VGG and Inception), as depicted in
   [182]Fig. 7A. The skip connection proposed in ResNet is a significant
   breakthrough of deep neural network, which provides a shortcut for
   gradients to flow more easily through the network during
   backpropagation. Instead of passing through every layer, gradients can
   take a shortcut and directly propagate to deeper layers. This helps to
   mitigate the vanishing gradient problem, allowing the network to learn
   more effectively even as it becomes very deep.

Fig. 7.

   [183]Fig. 7
   [184]Open in a new tab

   Particular structure of ResNet compared with classical convolutional
   neural networks and ResNet50 architecture diagram in BRLM. (A) ResNet
   residual network with skip connection can solve gradient vanishing
   problem. (B) ResNet-50 architecture constructed in BRLM for SNVs
   classification.

   In BRLM, the classification architecture was similar to ResNet50 but
   differed in network size. Precisely, we tailored our network input to
   accommodate the 728-dimensional vectors generated from BioBERT, and
   adjusted the subsequent layers accordingly. The detailed structures are
   outlined in [185]Fig. 7B, recording the kernel size, quantity of
   kernels, and stride size in each layer of the four residual blocks. In
   regards to the stacked residual blocks, each block comprised of three
   convolutional layers with a “skip connection” that bypassed the three
   layers within each individual block. The goal was to achieve a
   five-class classification, as demonstrated in [186]Fig. 1C. For this
   unbalanced dataset, the class weight was applied in the
   “CrossEntropyLoss” function, resulting in the loss being calculated as
   [MATH: <mrow><mi mathvariant="script">L</mi><mo
   linebreak="goodbreak">=</mo><munder><mo>∑</mo><mrow><mi>i</mi></mrow></
   munder><mrow><mfrac><mrow><msub><mrow><mi>N</mi></mrow><mrow><mi>i</mi>
   </mrow></msub></mrow><mrow><mi>N</mi></mrow></mfrac></mrow><munder><mo>
   ∑</mo><mrow><mi>j</mi></mrow></munder><mrow><msubsup><mrow><mi>y</mi></
   mrow><mrow><mi>j</mi></mrow><mrow><mi>i</mi></mrow></msubsup></mrow><mi
   >log</mi><mspace
   width="1em"></mspace><msubsup><mrow><mi>p</mi></mrow><mrow><mi>j</mi></
   mrow><mrow><mi>i</mi></mrow></msubsup></mrow> :MATH]
   (1)

   where N[i] is the number of samples in class i with
   [MATH: <mrow><mi>N</mi><mo
   linebreak="goodbreak">=</mo><mrow><mo>∑</mo></mrow><mrow><msub><mrow><m
   i>N</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mrow> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the label of sample j in class i, and
   [MATH:
   <msubsup><mrow><mi>p</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the predicted probability of sample j in class i.

   The training, validation, and test sets were divided using a 7:1:2
   splitting ratio. This ratio had been considered to be the optimal
   criteria [187][33]. The evaluations were performed based on a
   pre-labeled pan-cancer dataset sourced from TCGA by accuracy,
   precision, recall, mean average precision (mAP), and F1-score.
   Precisely, the predictions are recognized in the following manner: (i)
   True positive (TP) is the number of SNVs classified as pathogenic
   variants correctly; (ii) False positive (FP) is the number of SNVs
   classified as pathogenic variants incorrectly (unrelated mutation in
   fact); (iii) False negative (FN) is the number of SNVs deemed as
   non-pathogenic variants incorrectly; and (iv) True negative (TN) is the
   number of SNVs deemed as non-pathogenic variants correctly. The
   formulas are as follows:
   [MATH: <mrow><mi mathvariant="normal">Accuracy</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">TN</mi></mrow><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">TN</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi><mo>+</mo><mi
   mathvariant="normal">FN</mi></mrow></mfrac></mrow> :MATH]
   (2)
   [MATH: <mrow><mi mathvariant="normal">Precision</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mi
   mathvariant="normal">TP</mi></mrow><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">FP</mi></mrow></mfrac></mrow> :MATH]
   (3)
   [MATH: <mrow><mi>Re</mi><mi mathvariant="normal">call</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mi
   mathvariant="normal">TP</mi></mrow><mrow><mi
   mathvariant="normal">TP</mi><mo>+</mo><mi
   mathvariant="normal">FN</mi></mrow></mfrac></mrow> :MATH]
   (4)
   [MATH: <mrow><mi mathvariant="normal">F</mi><mn>1</mn><mo
   linebreak="badbreak">−</mo><mi mathvariant="normal">score</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mn>2</mn><mo>*</mo><mi
   mathvariant="normal">Precision</mi><mo>*</mo><mi
   mathvariant="normal">Recall</mi></mrow><mrow><mi
   mathvariant="normal">Precision</mi><mo>+</mo><mi
   mathvariant="normal">Recall</mi></mrow></mfrac></mrow> :MATH]
   (5)

   Emphasizing the primacy of training BRLM from TCGA, we eventually
   incorporated CCM variant vectors for testing.

3.8. Visualization of Classification Results

   The basic plots in this paper were created in R, using the ggplot2
   package’s built-in functions. This package was used for basic graphing
   methods, such as barplots, dotplots, roseplots, and Sankey diagrams.
   Furthermore, different types of graphs were developed employing certain
   packages, which will be introduced in later sections.

3.9. UMAP construction for variants

   Utilizing BRLM input vectors extracted from BioBERT, it was observed
   that the variants were present in a 728-dimensional space, which makes
   retaining the global structural information derived from ResNet50
   classification results. To effectively map the entire set of variants
   onto a two-dimensional space, the Uniform Manifold Approximation and
   Projection (UMAP) dimensionality reduction technique was employed.

   Variants embedding vectors from TCGA or CCM datasets can be utilized as
   input entities for the “umap” package in R4.1. All UMAP graphs
   presented in this paper were generated using 5 components representing
   five distinct classes and 20 neighbors, while considering multiple
   categorical labels including class, SIFT score, Clinvar significance,
   and density.

3.10. Variety of genetic mutations statistics

   The statistical charts for SNVs were generated using the R4.1 package
   known as “maftools”, which is designed to process MAF files. The
   “bcftools” was used to convert ANNOVAR multi-anno files from VCF to
   MAF. Finally, “plotmafSummary” function was employed for charts
   generating.

3.11. SNVs chromosome distribution statistics

   Two types of chromosome distribution charts were utilized: the first
   employed a chromosome distribution strip map using the CMplot package
   in R4.2, while the second utilized a circos plot.

   The circos chromosome plot was generated with the RCircos package for R
   version 4.2 and its chromosome track was constructed based on the UCSC
   HG19 Human CytoBandIdeogram data, which was imported along with the
   corresponding gene set. Additional tracks were pre-built, each with
   their own subfunction calls. The middle two frequency statistics
   channels presented the 1000 Genomes and genomAD records for each
   mutation site and converted them into input invocation matrices
   beforehand. Internal lines were aligned with PPI pairs from the STRING
   database.

3.12. Genes enrichment

   The ClusterProfiler package was primarily utilized for enriching mutant
   genes in the R4.2 software package with the “dotplot” function of
   [188]Fig. 3C visualizing the top 10 pathways, and the “treeplot”
   function was called to produce tree diagram in [189]Fig. 4A. In order
   to summarize entire significant enriched results, the
   simplifyEnrichment package was used to cluster similarity matrices
   calculated by “term_similarity” function with a new method named
   “binary cut” [190][16]. With the similarity matrix, we can directly
   apply “simplifyEnrichment” function to perform partition around medoids
   (PAM) with two groups on the similarity matrix in each iteration step.
   The similarity clustering algorithm was applied to cluster pathways, as
   depicted in the heatmap of [191]Fig. 3A.

   Similarly, the K-means clustering algorithm utilized the
   “emapplot_cluster” function from the ClusterProfiler package to create
   the network displayed in [192]Fig. 3B.

3.13. Genes perturbation algorithm

   In order to measure the pathway mutations perturbation levels, the
   cumulative effect of mutated genes were quantified by PMAP scores using
   an R package known as PMAPscore
   ([193]https://cran.r-project.org/web/packages/PMAPscore/vignettes/PMAPs
   core.html). The package’s “get_mut_status” function has been modified
   to accommodate variants in Class 1, 2 and 3. Subsequently, the
   cumulative effect of genetic mutations on pathways is utilized to
   determine the PMAP scores. This score employs a standard cumulative
   perturbation measurement to capture the positioning and impact of
   genetic mutations on pathways. The formulas of perturbation scores for
   genes and their cumulative effect in pathway are as follows.

   Gene’s Perturbation score:
   [MATH: <mrow><mtext>GMPscore</mtext><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow></mfenced></mrow><mo
   linebreak="goodbreak">=</mo><msub><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow><mo
   linebreak="badbreak">+</mo><msubsup><mo>∑</mo><mrow><mi
   mathvariant="normal">j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">N</mi></mrow></msubsup><mrow><msub><mrow><mi
   mathvariant="normal">β</mi></mrow><mrow><mi
   mathvariant="normal">ij</mi></mrow></msub><mfrac><mrow><mtext>GMPscore<
   /mtext><mrow><mfenced open="(" close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow></mrow
   ><mrow><msub><mrow><mi mathvariant="normal">N</mi></mrow><mrow><mi
   mathvariant="normal">ds</mi></mrow></msub><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow></mrow
   ></mfrac></mrow></mrow> :MATH]
   (6)

   where
   [MATH: <mrow><mtext>GMPscore</mtext><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow></mfenced></mrow></mrow
   > :MATH]
   denotes the perturbation score of mutated gene
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub> :MATH]
   ,
   [MATH: <mrow><msub><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow></mrow
   > :MATH]
   is an indicator function with
   [MATH: <mrow><msub><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow><mo
   linebreak="goodbreak">=</mo><mn>1</mn></mrow> :MATH]
   if
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub> :MATH]
   belongs to class i, otherwise 0,
   [MATH: <msub><mrow><mi mathvariant="normal">β</mi></mrow><mrow><mi
   mathvariant="normal">ij</mi></mrow></msub> :MATH]
   is the relationship between genes
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub> :MATH]
   (if
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub> :MATH]
   is directly interacted with
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   ,
   [MATH: <mrow><msub><mrow><mi
   mathvariant="normal">β</mi></mrow><mrow><mi
   mathvariant="normal">ij</mi></mrow></msub><mo
   linebreak="goodbreak">=</mo><mn>1</mn></mrow> :MATH]
   , else is 0), N is the total number of genes in pathway
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   ,
   [MATH: <mrow><msub><mrow><mi
   mathvariant="normal">N</mi></mrow><mrow><mi
   mathvariant="normal">ds</mi></mrow></msub><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub></mrow></mfenced></mrow></mrow
   > :MATH]
   is the number of genes at the downstream of gene
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">j</mi></mrow></msub> :MATH]
   .

   Pathway’s Perturbation score:
   [MATH: <mrow><mtext>PMAPscore</mtext><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow></mfenced></mrow><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msubsup><mo>∑</mo><mrow><mi
   mathvariant="normal">k</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">N</mi></mrow></msubsup><mrow><mtext>GMPscore</mtex
   t><mrow><mfenced open="(" close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">k</mi></mrow></msub></mrow></mfenced></mrow></mrow
   ></mrow><mrow><msub><mrow><mi
   mathvariant="normal">N</mi></mrow><mrow><mi
   mathvariant="normal">dc</mi></mrow></msub></mrow></mfrac></mrow> :MATH]
   (7)

   where
   [MATH: <mrow><mtext>PMAPscore</mtext><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></m
   sub></mrow></mfenced></mrow></mrow> :MATH]
   denotes the perturbation score of pathway
   [MATH: <msub><mrow><mi mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   , N is the total number of genes in pathway
   [MATH: <msub><mrow><mi mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   ,
   [MATH: <mrow><mtext>GMPscore</mtext><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi
   mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">k</mi></mrow></msub></mrow></mfenced></mrow></mrow
   > :MATH]
   denotes the perturbation score of gene
   [MATH: <msub><mrow><mi mathvariant="normal">g</mi></mrow><mrow><mi
   mathvariant="normal">k</mi></mrow></msub> :MATH]
   in the pathway
   [MATH: <msub><mrow><mi mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   ,
   [MATH: <msub><mrow><mi mathvariant="normal">N</mi></mrow><mrow><mi
   mathvariant="normal">dc</mi></mrow></msub> :MATH]
   denotes the number of genes that have mutated in the pathway
   [MATH: <msub><mrow><mi mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub> :MATH]
   .

3.14. Public scRNA-seq data analysis

   The public scRNA-seq raw data were obtained from the Gene Expression
   Omnibus database with ID [194]GSE155788
   ([195]https://www.ncbi.nlm.nih.gov/geo/). The Data analysis relied on
   Seurat (version 4.0.5), the filtering criteria was nFeature_RNA < 300
   or > 12000 with > 15% expression of mitochondrial genes. After
   completing quality control, the next step was normalization. This was
   accomplished by using the “LogNormalize” method with default scale
   factor and log-transformation. Subsequently, the “FindVariableFeatures”
   algorithm calculated a subset of features that yielded significant
   cell-to-cell variation in the dataset. This function returned 2000
   default features per dataset, which will be used in downstream
   analysis.

   Prior to the dimensionality reduction, a linear transformation (called
   scaling) was applied with the “ScaleData” function. Next, the “RunPCA”
   function was invoked for linear dimensionality reduction on the scaled
   data. In order to determine the dimensionality, the JackStraw procedure
   was executed for principal components selection. As a result, the
   remaining cells were clustered together with the npcs of 30 and a
   resolution of 0.3. The final clusters were annotated based on the
   markers shown in [196][36].

4. Discussion

   We have shown that BRLM, constructed from BioBERT encoder and ResNet
   classifier, can serve as a SNVs annotation learning model to assist
   variants classification and interpretation. BRLM was designed with the
   aim of conducting mutation sites analysis associated with various
   diseases. The accuracy of BRLM was verified on twelve cancer types in
   TCGA, whose progressive accuracy in each epoch was observed across all
   cancer datasets. The generalizability was validated on CCM sequencing
   analysis, with SIFT scoring and Clinvar annotation validated on the
   classification results. Following the perturbation scoring algorithm to
   quantify the multi-effects of mutated genes in pathways, an upstream
   master regulator gene FGF1 was found supported by the astrocyte cluster
   of scRNAseq and differential expression of RNAseq. Our results
   demonstrate the feasibility of BRLM in classifying SNVs and contribute
   to the discovery of major pathogenic factors.

   BRLM innovatively employed the NLP algorithm in WGS pathogenic
   information mining, making the image classifier applicable to variants
   classification. The application of NLP models in single-cell sequencing
   datasets has gained popularity, but this technique is still limited in
   clinical laboratories. Some other studies have focused on molecular
   information mining from published biological texts. It should be NLP’s
   first use in genome sequencing to solve clinical problems. As for this
   CCM family who visited for procreation guidance, we had to categorize
   all SNVs with no reasonable pathogenicity found in known CCM-caused
   genes, and further employed the perturbation scoring algorithms to
   quantify the accumulation effects of multi-class mutations in pathways.
   Finally, the pathogenic mechanisms in CCM were elucidated, showing that
   a major gene FGF1 could contribute to CCM development alongside the
   minor genes’ effects. This suggests that the applicability of this
   model can be widely used in clinical genetic counseling.

   FGF1 can be detected in recent sequencing results owing to the broad
   usage of bulk sequencing, other than previous CCM omics studies that
   have only probed three known genes. For instance, the direct PCR
   implementation revealed five variants in the CCM3/SERPINI1 asymmetric
   bidirectional promoter [197][41], and the analysis of coding exons
   identified a novel missense mutation, c .422 T > G, in CCM3 [198][42].
   With the widespread adoption of next-generation sequencing (NGS), an
   expanded set of genes can now be examined. When the mini-bulk RNA
   sequencing data of MAP3K3 mutant individuals were compared with those
   of MAP3K3 WT individuals during fCCM3 lesion formation, FGF1 was found
   to be another up-regulated gene, indicating the activation of ERK1 and
   ERK2 cascades [199][56]. Additionally, the downstream of FGF1, namely
   TGFBR2 and ACTG2, were found to be consistent with our perturbed
   results, indicating shared pathways such as Hippo and MAPK signaling
   [200][43]. Subsequently, Fgf1 expression decreased after GJA1–20k
   altered the endothelial cell transcriptome with hypermethylation of its
   gene body in animal experiments [201][49].

   The FGF family, comprising six subfamilies (FGF1, FGF4, FGF7, FGF8,
   FGF9, and FGF19) [202][35], plays a critical role in embryonic
   development and organogenesis by maintaining progenitor cells and
   promoting their growth, differentiation, survival, and patterning
   [203][21]. The FGF1 subfamily, inclusive of FGF1 and FGF2, serves as
   potent angiogenic inducers that control multiple growth factor
   signaling. The FGF1 subfamily regulates vessel formation [204][34],
   promotes strong angiogenic responses [205][27], and induces vessel
   maturation [206][14]. Notably, it has been implicated as a potential
   instigator of aberrant angiogenesis [207][18], which could be linked to
   the pathogenesis of arteriovenous malformations [208][19] and other
   cerebral vascular anomalies [209][45]. As the only gene in the FGF
   family undergoing clinical trials [210][4], FGF1 shows promise for
   stimulating blood vessel growth in the brain and addressing various
   brain-related diseases, including intracranial aneurysm [211][60],
   Alzheimer’s disease [212][52], brain tumors [213][3], brain injuries
   [214][4], and ischemic stroke [215][63].

   According to the multi-omics data of FGF1 in CCM and its relative
   biological functions, we speculated that the BRLM analyzed results of
   FGF1 was an upstream regulatory gene with clinical implications in CCM.
   The CCM-related elements presentation was a landscape of perturbed
   pathways with FGF1 positioned upstream and interacting with downstream
   mutated genes, including a known CCM pathogenic gene, KRIT1.

   Moreover, BRLM is just constructed from genomic data, an integrating
   model can be anticipated with multimodality data (such as clinical
   records, images, vital signs monitoring, etc.), which may be expected
   to provide more clarity in understanding pathogenicity mechanisms and
   elucidating functional genes. Our next efforts will be focused on
   refining information consolidation and multimodal learning
   exploitation. Alternatively, directions of future work on BRLM will
   encompass three models. Firstly, the initial text-coding module will be
   upgraded into a time-dependent version so that the progression of
   diseases can be accommodated [216][1]. Secondly, an image encoding
   module will be added to cope with visual data [217][37]. Finally, a
   transformer is conceived so that various modalities, such as texts and
   images, can be handled simultaneously [218][2]. Our intention is to
   construct a medical knowledge graph illustrating the topological
   relationships among clinical symptoms, laboratory indicators,
   biological markers, related diseases, and targeted drugs [219][57]. The
   prototype functions have been integrated into the public interface,
   incorporating targeted drug retrieval and ceRNA network construction.
   Further development of these functionalities will be presented in our
   following studies.

   In conclusion, this study offers a novel insight for pathogenic factors
   exploration through biomedical language.

   analysis, potentially assisting manual retrieval methods and making up
   complicated biological experiments.

5. Conclusions

   In this study, a biomedical language learning model called BRLM was
   developed to classify and link SNVs, with the goal of assisting in the
   labor-intensive tasks of interpretation and integration. The pipeline
   was utilized to classify variants into five categories, as defined by
   ACMG guidelines with multi-class interaction network construction.
   Comprehensive databases were compiled containing annotations that
   describe variants in biomedical natural language. To facilitate this,
   we employed the BioBERT, which primarily focuses on entity recognition
   during embedding. Then the encoded vectors were used to train a
   convolutional neural network. This model was trained on 12 TCGA
   datasets and tested on a familial CCM dataset. From the classified
   results, we performed pathway variants accumulative perturbation
   analysis, and found a master regulatory gene, FGF1, that could be
   highly related to CCM. The core contribution of this study resides in
   the integration of a large language model into a variant classifier,
   where the former is able to capture the essential information from the
   massive input data while the latter guarantees better usage of the
   acquired information. The effectiveness of this protocol has been
   verified by multi-omics sequencing analysis. Moreover, a web server has
   been realized to facilitate the broad usage of the proposed model. This
   study highlights the potential of our approach to comprehend the
   complex interplay of genetic variants within biological terms.

Institutional Review

   The study was approved by the Institutional Review Board (or Ethics
   Committee) of Taihe Hospital, Shiyan, Hubei, China (protocol code
   2023KS33, September 15th, 2023).

Funding

   This research was funded by the National Natural Science Foundation of
   China (32070973 and 32060150).

CRediT authorship contribution statement

   Yiqi Wang: Conceptualization, Methodology, Software. Jinmei Zuo: Data
   collection and sampling. Chao Duan: Data curation. Hao Peng: Data
   collection and analysis. Jia Huang: Data collection and sampling. Liang
   Zhao: Conceiving and designing, reviewing and editing. Li Zhang:
   Original draft preparation. Zhiqiang Dong: Visualization,
   Investigation.

   All authors have read and agreed to the submitted version of the
   manuscript. All authors have read and agreed to the writing of the
   manuscript.

Declaration of Competing Interest

   The authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Footnotes

   ^Appendix A

   Supplementary data associated with this article can be found in the
   online version at [220]doi:10.1016/j.csbj.2024.01.014.

Contributor Information

   Liang Zhao, Email: s080011@e.ntu.edu.sg.

   Li Zhang, Email: zhanglith@163.com.

   Zhiqiang Dong, Email: dongz@mail.hzau.edu.cn.

Appendix A. Supplementary material

   Supplementary material
   [221]mmc1.pdf^ (20MB, pdf)

   .

   Supplementary material
   [222]mmc2.csv^ (5.8KB, csv)

   .

   Supplementary material
   [223]mmc3.xlsx^ (13.1KB, xlsx)

   .

References