Abstract

   Accurate pathogenicity prediction of missense variants is critically
   important in genetic studies and clinical diagnosis. Previously
   published prediction methods have facilitated the interpretation of
   missense variants but have limited performance. Here, we describe MVP
   (Missense Variant Pathogenicity prediction), a new prediction method
   that uses deep residual network to leverage large training data sets
   and many correlated predictors. We train the model separately in genes
   that are intolerant of loss of function variants and the ones that are
   tolerant in order to take account of potentially different genetic
   effect size and mode of action. We compile cancer mutation hotspots and
   de novo variants from developmental disorders for benchmarking.
   Overall, MVP achieves better performance in prioritizing pathogenic
   missense variants than previous methods, especially in genes tolerant
   of loss of function variants. Finally, using MVP, we estimate that de
   novo coding variants contribute to 7.8% of isolated congenital heart
   disease, nearly doubling previous estimates.

   Subject terms: Machine learning, Sequence annotation, Mutation,
   Sequence annotation
     __________________________________________________________________

   Accurate prediction of variant pathogenicity is essential to
   understanding genetic risks in disease. Here, the authors present a
   deep neural network method for prediction of missense variant
   pathogenicity, MVP, and demonstrate its utility in prioritizing de novo
   variants contributing to developmental disorders.

Introduction

   Missense variants are the most common type of coding genetic variants
   and are a major class of genetic risk across a broad range of common
   and rare diseases. Previous studies have estimated that there is a
   substantial contribution from de novo missense mutations to structural
   birth defects^[38]1–[39]3 and neurodevelopmental disorders^[40]4–[41]6.
   However, only a small fraction of missense de novo mutations are
   pathogenic^[42]4 that will cause disease. As a result, the statistical
   power of detecting individual risk genes based on missense variants or
   mutations is limited^[43]7. In clinical genetic testing, many of
   missense variants in well-established risk genes are classified as
   variants of uncertain significance, unless they are highly recurrent in
   patients. Previously published in silico prediction methods have
   facilitated the interpretation of missense variants, such as
   CADD^[44]8, VEST3^[45]9, MetaSVM^[46]10, M-CAP^[47]11, REVEL^[48]12,
   PrimateAI^[49]13, and UNEECON^[50]14. However, based on recent de novo
   mutation data, they all have limited performance with low positive
   predictive value (Supplementary Data [51]1), especially in
   non-constrained genes (defined as ExAC^[52]15 pLI < 0.5).

   In this work, we hypothesize that missense variant pathogenicity
   prediction can be improved in a few dimensions. First, conventional
   machine learning approaches have limited capacity to leverage large
   amounts of training data compared to recently developed deep learning
   methods^[53]16. Second, databases of pathogenic variants curated from
   the literature are known to have a substantial frequency of false
   positives^[54]17, which are likely caused by common issues across
   databases and therefore introduce inflation of benchmark performance.
   Developing new benchmark data and methods can help to assess and
   improve real performance. Finally, previous methods do not consider
   gene dosage sensitivity^[55]15,[56]18, which can modulate the
   pathogenicity of deleterious missense variants. Generally, missense
   variants can be classified into three categories based on their impact
   on protein function (“mode of action”): (1) gain of function (also
   called hypermorph or neomorph); (2) dominant negative (also called
   antimorph); (3) hypomorph (partial or complete loss of function, also
   called amorph)^[57]19,[58]20. A hypomorphic variant with heterozygous
   genotype can only be pathogenic in dosage sensitive genes^[59]6,
   whereas the pathogenicity of a gain of function or dominant negative
   variant with heterozygous genotype is not limited by gene dosage
   sensitivity. With recently published metrics of mutation intolerance,
   which is strongly associated with dosage sensitivity^[60]15, it is now
   feasible to consider gene dosage sensitivity in predicting
   pathogenicity. Based on these ideas, we developed a new method, MVP, to
   improve missense variant pathogenicity prediction. We evaluate MVP on
   the data of cancer mutation hotspots and de novo variants from
   developmental disorders. Overall, MVP achieves better performance in
   prioritizing pathogenic missense variants than previous methods,
   especially in genes tolerant of loss of function variants.

Results

Model structure and predictors

   MVP uses many correlated predictors, broadly grouped into two
   categories (Supplementary Data [61]2): (a) “raw” features computed at
   different scales, per base pair (e.g., amino acid constraint score and
   conservation), per local context (e.g., protein structure and
   modification), as well as per gene (e.g., gene mutation intolerance,
   sub-genic regional depletion of missense variants^[62]21); (b)
   deleteriousness scores from selected previous methods. To account for
   the impact of gene dosage sensitivity on pathogenicity of heterozygous
   missense variants, we trained our models in constrained genes (defined
   as ExAC^[63]15 probability of being loss-of-function (LoF) intolerant
   (pLI) ≥ 0.5) and non-constrained genes (ExAC pLI < 0.5) separately. We
   included 38 features for the constrained gene model, and 21 features
   for the non-constrained gene model. We removed most of published
   prediction methods features due to limited prediction accuracy
   (Supplementary Data [64]1, [65]2). We also excluded most of
   conservation features in the non-constrained gene model, as pathogenic
   variants in non-constrained genes are less conserved (Supplementary
   Fig. [66]1).

   MVP uses a deep residual neural network (ResNet)^[67]22 model. ResNet
   was designed for computer vision^[68]22 and was successfully applied in
   structural biology^[69]23 and genomics^[70]24. The convolutional layers
   in ResNet are capable of extracting hierarchical features or nonlinear
   spatial local patterns from images or sequence data. To take advantage
   of this, we ordered the predictors based on their correlation, as
   highly correlated predictors are clustered together (Supplementary
   Fig. [71]2). There are two layers of residual blocks, consisting of
   convolutional filters and activation layers, and two fully connected
   layers with sigmoid output (Supplementary Fig. [72]3). For each
   missense variant, we defined MVP score by the rank percentile of the
   ResNet’s raw sigmoid output relative to all 76 million possible
   missense variants.

   In this work, we focus on the rare variants with large effect. We used
   a minor allele frequency (MAF) threshold of 10^−4 (based on
   gnomAD^[73]25) to filter variants in both training and testing
   data sets.

Model training

   We obtained large curated data sets of pathogenic variants from
   HGMD^[74]26 and UniProt^[75]10,[76]27 as positives and random rare
   missense variants from population data as negatives for training
   (Supplementary Data [77]3). The training process takes around 10 min on
   1.6 GHz GPU with 2560 cores and 8 GB memory. Using 6-fold
   cross-validation on the training set (Supplementary Fig. [78]4), MVP
   achieved mean area under the curve (AUC) of 0.99 in constrained genes
   and 0.97 in non-constrained genes.

Performance evaluation using curated mutation data sets

   To evaluate predictive performance of the MVP and compare it with other
   methods, we obtained an independent curated testing data set from
   VariBench^[79]10,[80]28 (Supplementary Fig. [81]5). MVP outperformed
   all other methods with an AUC of 0.96 and 0.92 in constrained and
   non-constrained genes, respectively. A few recently published methods
   (REVEL, M-CAP, VEST3, and MetaSVM) were among the second-best
   predictors and achieved AUC around 0.9.

   Similar to HGMD and Uniprot data used in training, VariBench data are
   curated from literature. False positives caused by similar factors by
   this approach across training and VariBench data sets could inflate the
   performance in testing. To address this issue, we compiled cancer
   somatic mutation data for further evaluation, including missense
   mutations located in inferred hotspots based on statistical evidence
   from a recent study^[82]29 as positives, and randomly selected variants
   from DiscovEHR^[83]30 database as negatives. In this dat aset, the
   performance of all methods decreased, but MVP still achieved the best
   performance with AUC of 0.91 (maximum p = 3.2e−8) and 0.85 (maximum
   p = 6.6e−4) in constrained and non-constrained genes, respectively
   (Fig. [84]1). We observed that methods using HGMD or UniProt in
   training generally have greater performance drop than others
   (Supplementary Data [85]1, Supplementary Fig. [86]6, Supplementary
   Note [87]1).

Fig. 1. Comparison of receiver operating characteristic (ROC) curves of
existing prediction scores and MVP scores using cancer somatic mutation
hotspot data.

   [88]Fig. 1
   [89]Open in a new tab

   a Constrained genes (ExAC pLI ≥ 0.5): evaluation of 698 cancer
   mutations located in hotspots from 150 genes, and 6989 randomly
   selected mutations from DiscovEHR database excluding mutations used in
   training. b Non-constrained genes (ExAC pLI < 0.5): evaluation of 177
   cancer mutations located in hotspots from 55 genes and 1782 randomly
   selected mutations from DiscovEHR database excluding mutations used in
   training. The performance of each method is evaluated by the ROC curve
   and area under the curve (AUC) indicated in parenthesis. Higher AUC
   score indicates better performance.

   To investigate the contribution of features to MVP predictions, we
   performed cross-one-group-out experiments and used the differences in
   AUC as an estimation of feature contribution (Fig. [90]2). We found
   that in constrained genes, conservation scores and published
   deleteriousness predictors have relatively large contribution, whereas
   in non-constrained genes, protein structure and modification features
   and published predictors are the most important.

Fig. 2. Measuring the contribution of features to MVP prediction performance
in cancer mutation hotspots data.

   [91]Fig. 2
   [92]Open in a new tab

   Performance contribution is measured by AUC reduction (ΔAUC) from
   excluding a group of features. Since features within a group is often
   highly correlated, we did measure the contribution of an entire group
   instead of individual features in the group. a Constrained genes (b)
   Non-constrained genes. We subsampled negatives from DiscovEHR database
   and calculated ΔAUC 15 times and estimated the error bar. The box
   bounds the interquartile range (IQR) divided by the median, and
   Tukey-style whiskers extend to a maximum of 1.5× IQR beyond the box.

Evaluating pathogenic de novo missense variants in CHD and ASD

   To test the utility in real genetic studies, we obtained germline de
   novo missense variants (DNMs) from 2645 cases in a congenital heart
   disease (CHD) study^[93]2 (Supplementary Data [94]4), 3953 cases in
   autism spectrum disorder (ASD) studies^[95]2,[96]4,[97]5 (Supplementary
   Data [98]5), and DNMs from 1911 controls (unaffected siblings) in
   Simons Simplex Collection^[99]2,[100]4,[101]5 (Supplementary
   Data [102]6). Since genes with cancer mutation hotspots are relatively
   well studied in both constrained and non-constrained gene sets,
   assessment using de novo mutations can provide additional insight with
   less bias (Supplementary Data [103]7). Because the true pathogenicity
   of most of the de novo mutations is unknown, we cannot directly
   evaluate the performance of prediction methods. To address this issue,
   we first compared the distribution of predicted scores of DNMs in cases
   with the ones in controls (Fig. [104]3). Using Mann–Whitney U test,
   MVP-predicted scores of variants in cases and controls are
   significantly different (p = 1e−5 and 2.7e−4 for CHD vs. controls and
   ASD vs. controls, respectively), and the difference is greater than
   predictions from other methods.

Fig. 3. Distribution of predicted scores of de novo missense variants by MVP
and other methods.

   [105]Fig. 3
   [106]Open in a new tab

   For each method, we normalized all predictions by rank percentile, and
   used two-sided Mann–Whitney U test to assess the statistical
   significance of the difference between cases and controls. CHD:
   congenital heart disease; ASD: autism spectrum disorder; controls:
   unaffected siblings from the ASD study. Number of de novo missense
   variants compared: CHD: 1486; ASD: 2050; controls: 838.

   We then calculated the enrichment rate of predicted pathogenic DNMs by
   a method with a certain threshold in the cases compared to the
   controls, and then estimated precision and the number of true risk
   variants (Methods), which is a proxy of recall since the total number
   of true positives in all cases is a (unknown) constant independent of
   methods. We compared the performance of MVP to other methods by
   estimated precision and recall-proxy (Fig. [107]4). Based on the
   optimal thresholds of MVP in cancer hotspot ROC curves, we used a score
   of 0.7 in constrained genes and 0.75 in non-constrained genes to define
   pathogenic DNMs (Supplementary Fig. [108]7). In constrained genes, we
   observed an enrichment of 2.2 in CHD and an enrichment of 1.9 in ASD
   (Supplementary Data [109]8, [110]9), achieving estimated precision of
   0.55 and 0.47 (Fig. [111]4a and [112]4d), respectively. This indicates
   that about 50% of the MVP-predicted pathogenic DNMs contribute to the
   diseases. In non-constrained genes, we observed an enrichment of 1.5 in
   CHD and 1.4 in ASD (Supplementary Data [113]8, [114]9), respectively,
   and 0.33 and 0.29 in estimated precision (Fig. [115]4b and [116]4e). In
   all genes combined, MVP achieved an estimated precision of 40% for CHD
   and 34% for ASD (Fig. [117]4c and [118]4f). The next best methods
   reached 26% (M-CAP^[119]11) and 21% (PrimateAI^[120]13) given the same
   recall-proxy for CHD and ASD, respectively (Supplementary Data [121]8,
   [122]9). Furthermore, the estimated precision of MVP with DNMs at
   optimal threshold is much closer to the expected precision based on ROC
   of cancer hotspots data than the value from VariBench data
   (Supplementary Fig. [123]8 and Supplementary Note [124]1), supporting
   that there is less performance inflation in testing using cancer data.

Fig. 4. Comparison of MVP and previously published methods using de novo
missense mutations from CHD and ASD studies by precision-recall-proxy curves.

   [125]Fig. 4
   [126]Open in a new tab

   Numbers on each point indicate rank percentile thresholds, star points
   indicate thresholds recommended by publications. The positions of “All
   Mis” points are estimated from all missense variants in the gene set
   without using any pathogenicity prediction method. The point size is
   proportional to –log (p-value). P-value is calculated by two-sided
   binomial test, only points with p value less than 0.05 are shown. a–c
   Performance in CHD DNMs in constrained genes, non-constrained genes,
   and all genes, respectively. d–f Performance in ASD DNMs in constrained
   genes, non-constrained genes, and all genes, respectively.

   We used ResNet in MVP, which is based on Convolutional Neural Network
   while more efficient in training. In order to assess its importance, we
   trained a Random Forest model, which is the core of methods such as
   REVEL^[127]12, with the same training sets and features. We also
   trained a Fully-connected Neural Network as the baseline performance of
   neural network methods. ResNet slightly outperformed Random Forest and
   has the same performance with Fully-connected Neural Network with
   cancer somatic mutation data (Supplementary Fig. [128]9), and
   substantially outperformed both with de novo mutations in CHD and
   autism (Supplementary Figs. [129]10, [130]11).

   Previous studies have estimated that deleterious de novo coding
   mutations, including loss of function variants and damaging missense
   variants, have a small contribution to isolated CHD^[131]2. Here, we
   used MVP to revisit this question. With the definition of damaging DNMs
   in Jin et al. 2017^[132]2 (based on MetaSVM^[133]10), the estimated
   contribution of deleterious de novo coding mutations to isolated CHD is
   about 4.0%. With MVP score of 0.75, the estimation is 7.8% (95%
   CI = [5.9%, 9.6%]), nearly doubling the previous estimate
   (Supplementary Data [134]10, [135]11). We performed pathway enrichment
   analysis of genes with MVP-predicted pathogenic de novo missense
   mutations in CHD cases using Enrichr^[136]31. In isolated cases, the
   genes with such variants are significantly enriched (FDR < 0.01) for
   cardiac conduction (FDR = 0.006, odds ratio = 8.7) and muscle
   contraction (FDR = 0.006, odds ratio = 6.8). In syndromic CHD cases who
   have additional congenital anomalies or neurodevelopmental disorders,
   the genes with such variants are significantly enriched in Notch, Robo,
   or MAPK signaling pathways (Supplementary Fig. [137]12, Supplementary
   Data [138]12) that have been implicated with other developmental
   disorders.

Discussion

   We developed a new method, MVP, to predict pathogenicity of missense
   variants. MVP is based on residual neural networks, a supervised deep
   learning approach, and was trained using a large number of curated
   pathogenic variants from clinical databases, separately in constrained
   genes and non-constrained genes. Using cancer mutation hotspots and de
   novo mutations from CHD and ASD studies, we showed that MVP achieved
   overall better performance than published methods, especially in
   non-constrained genes.

   Two factors may contribute to the improved prediction performance.
   First, deep neural network models have a larger learning capacity to
   leverage large training data set than conventional machine learning
   methods that were used in previous publications. The residual blocks in
   ResNet largely alleviate the problem of vanishing gradient when
   increasing the number of layers, which further enables efficient
   training on deeper networks^[139]22. Using the same features in
   training data set, we showed that ResNet achieved better performance
   than two conventional machine learning methods, Random Forest and
   Fully-connected Neural Network. Second, we trained the model separately
   in constrained and non-constrained genes. This is motivated by the idea
   that the mode of action of pathogenic variants can be different in
   constrained and non-constrained genes. Constrained genes are likely to
   be dosage sensitive, therefore, deleterious missense variants of any
   mode of action with heterozygous genotypes can be pathogenic. In
   contrast, non-constrained genes are unlikely to be dosage sensitive, as
   a result, hypermorphic variants with heterozygous genotype are unlikely
   to be pathogenic. Instead, gain of function or dominant negative
   variants are the main type of pathogenic variants in these genes. By
   training the model in these two groups of genes separately, we enhance
   the ability of the model to use features that are informative with gain
   of function or dominant negative in non-constrained genes. In addition,
   we selected a smaller number of features in the model for
   non-constrained genes. We excluded most of the conservation scores and
   published predictors, based on the observation that known pathogenic
   variants in non-constrained genes are less conserved across species
   (Supplementary Fig. [140]1) and that published methods have poor
   performance in prioritization of de novo variants in these genes
   (Supplementary Data [141]1). We kept all the features related to
   protein structure and post-translational modification. The change of
   the model for non-constrained genes is supported by sensitivity
   analysis using cancer mutation hotspot data (Fig. [142]2). When protein
   structure related features are further excluded, the model has the
   biggest drop in AUC in non-constrained genes. In contrast, excluding
   conservation scores does not lead to reduced AUC.

   In a recent genetic study of CHD^[143]1,[144]2, de novo mutations were
   estimated to contribute to 10–30% of syndromic cases who had additional
   congenital anomalies or neurodevelopmental disorders (NDD), but only
   about 4% of isolated cases who did not have additional anomalies or
   NDD. Isolated CHD are the most common form of the disease^[145]32. We
   reanalyzed the de novo missense variants using MVP. We estimated that
   predicted-pathogenic de novo mutations actually contribute to about
   7.8% of isolated cases, doubling previous estimate. The revised
   estimate suggests a greater utility of discovery or prioritization of
   new candidate risk genes by de novo mutations in isolated CHD. Pathway
   analysis shows predicted pathogenic missense variants in isolated CHD
   cases are enriched in cardiac conduction and muscle contraction,
   whereas the ones in syndromic CHD cases are enriched in Notch, Robo, or
   MAPK signaling. The different organ-specificity of these pathways is
   consistent with phenotypes. Nevertheless, the genetic cause of CHD is
   very heterogeneous^[146]1,[147]2, similar to
   autism^[148]4,[149]5,[150]33–[151]35. With a few thousand cases, very
   few true risk genes have multiple observed deleterious de novo
   mutations in the data. As a result, there is a substantial uncertainty
   of the estimated contribution of de novo mutations. Future studies with
   larger sample size will enable more precise estimation of the
   contribution of de novo mutations and identification of new candidate
   risk genes.

   We note that MPC has better precision than MVP with constrained genes
   with ASD data (Fig. [152]4d), while MVP has much better performance in
   constrained genes with CHD data (Fig. [153]4a), and much better
   performance in non-constrained genes with both ASD data (Fig. [154]4e)
   and CHD data (Fig. [155]4b). Based on known risk genes for ASD^[156]36
   and CHD^[157]2, autism risk genes are under stronger negative selection
   of loss of function variants (“s_het”^[158]37) than CHD risk genes.
   High score of sub-genic regional constraint, a key metric used in MPC,
   is predominantly located in genes under very strong negative selection.
   This could explain the difference of autism vs. CHD mutations among
   constrained genes. Supplementary Fig. [159]13 shows that variants in
   constrained genes significantly have higher MPC scores than those in
   non-constrained genes, either for pathogenic variants or benign
   variants in the training set. MPC has limited performance among
   non-constrained genes, the results of all threshold in MPC in
   non-constrained genes is not statistically significant in ASD cases,
   and MPC was not shown in the panel on non-constrained genes
   (Fig. [160]4d and Fig. [161]4e). We also noted that MPC achieves better
   sensitivity at the published recommended threshold compared to MVP with
   all genes with CHD data (Fig. [162]4c). There is a trade-off between
   sensitivity and specificity. The recommended threshold of MPC achieves
   slightly better sensitivity compared to recommended threshold of MVP,
   however, the specificity of MVP is almost doubling MPC. With the same
   sensitivity, MVP generally outperformed other methods.

   Pathogenic variants may have a range of effect size and penetrance. In
   this work, we focused on rare variants with large effect size. Our
   model is consistent with a common definition of pathogenicity^[163]38,
   that is, pathogenic variants mechanistically contribute to disease, but
   not necessarily with full penetrance. Specifically, we included
   prediction features based on protein structure and protein
   modification, which provide biophysical and biochemical basis for
   mechanistical contribution to the disease, and evolution conservation,
   which is usually a consequence of direct contribution to diseases. In
   addition, the positives in model training were from HGMD and UniProt,
   expert-curated databases with large-number of likely pathogenic and
   rare variants reported in previous publications. We do not include
   disease-associated common variants from genome-wide association studies
   (GWAS) in training.

   A limitation of MVP is the unknown but potentially high false positive
   rate of curated pathogenetic variants used in training. The largest
   databases, such as HGMD and Varibench, were curated from a broad range
   of publications with uneven quality. As a result, independently curated
   databases suffer from similar types of errors. This issue is
   highlighted by the performance drop in testing on cancer somatic
   mutation hotspots data comparing to VariBench. Systematic efforts such
   as ClinVar^[164]39 will eventually produce better and larger training
   data to improve prediction performance.

   Finally, we note that a single pathogenic score cannot capture the
   complexity of the mechanisms of pathogenicity. Many genes may have
   different modes of action in different diseases. One example is CTNNB1
   (beta catenin). It is an oncogene in cancer through gain-of-function
   mutations^[165]40, and a risk gene in structural birth defects and
   neurodevelopmental disorders through loss of function germline
   mutations^[166]41. Most prediction tools, including MVP, do not
   distinguish pathogenic missense variants with gain of function from the
   ones with loss of function. Explicitly predicting gain or loss of
   function, as what a recent study focused on channels did^[167]42, would
   improve the utility of prediction methods.

Methods

Training data sets

   We compiled 22,390 missense mutations from Human Gene Mutation Database
   Pro version 2013 (HGMD)^[168]26 database under the disease mutation
   (DM) category, 12,875 deleterious variants from
   UniProt^[169]10,[170]27, and 4424 pathogenic variants from ClinVar
   database^[171]39 as true positive (TP). In total, there are 32,074
   unique positive training variants. The negative training sets include
   5190 neutral variants from Uniprot^[172]10,[173]27, randomly selected
   42,415 rare variants from DiscovEHR database^[174]30, and 39,593
   observed human-derived variants^[175]8. In total, there are 86,620
   unique negative training variants (Supplementary Data [176]3).

Testing data sets

   We have three categories of testing data sets (Supplementary
   Data [177]3). The three categories are: (a) Benchmark data sets from
   VariBench^[178]10,[179]28 as positives and randomly selected rare
   variants from DiscovEHR database^[180]30 as negatives; (b) cancer
   somatic missense mutations located in hotspots from recent
   study^[181]29 as positives and randomly selected rare variants from
   DiscovEHR database^[182]30 as negatives; (c) and de novo missense
   mutation data sets from recent published exome-sequencing
   studies^[183]2,[184]4,[185]5. All variants in (a) and (b) that overlap
   with training data sets were excluded from testing.

   We tested the performance in constrained genes (ExAC pLI ≥ 0.5) and
   non-constrained gene (ExAC pLI < 0.5)^[186]15 separately.

   To focus on rare variants with large effect, we selected ultra-rare
   variants with MAF < 10^−4 based on gnomAD database to filter variants
   in both training and testing data sets. We applied additional filter of
   MAF < 10^−6 for variants in constrained genes in both cases and
   controls for comparison based on a recent study^[187]21,[188]43.

Features used in the MVP model

   MVP uses many correlated features as predictors (Supplementary
   Data [189]2). There are six categories: (1) local context: GC content
   within 10 flanking bases on the reference genome; (2) amino acid
   constraint, including blosum62^[190]44 and pam250;^[191]45 (3)
   conservation scores, including phyloP 20way mammalian and 100way
   vertebrate^[192]46, GERP++^[193]47, SiPhy 29way^[194]48, and phastCons
   20way mammalian and 100way vertebrate;^[195]49 (4) Protein structure,
   interaction, and modifications, including predicted secondary
   structures^[196]50, number of protein interactions from the BioPlex 2.0
   Network^[197]51, whether the protein is involved in complexes formation
   from CORUM database^[198]52, number of high-confidence interacting
   proteins by PrePPI^[199]53, probability of a residue being located the
   interaction interface by PrePPI (based on PPISP, PINUP, PredU),
   predicted accessible surface areas were obtained from dbPTM^[200]54,
   SUMO scores in 7-amino acids neighborhood by GPS-SUMO^[201]55,
   phosphorylation sites predictions within 7 amino acids neighborhood by
   GPS3.0^[202]56, and ubiquitination scores within 14-amino acids
   neighborhood by UbiProber;^[203]57 (5) Gene mutation intolerance,
   including ExAC metrics^[204]15 (pLI, pRec, lof_z) designed to measure
   gene dosage sensitivity or haploinsufficiency, RVIS^[205]58,
   probability of causing diseases under a dominant model
   “domino”^[206]59, average selection coefficient of loss of function
   variants in a gene “s_het”^[207]37, and sub-genic regional depletion of
   missense variants;^[208]21 (6) Selected deleterious or pathogenicity
   scores by previous published methods obtained through
   dbNSFPv3.3a^[209]60, including Eigen^[210]61, VEST3^[211]9,
   MutationTaster^[212]62, PolyPhen2^[213]63, SIFT^[214]64,
   PROVEAN^[215]65, fathmm-MKL^[216]66, FATHMM^[217]66,
   MutationAssessor^[218]67, and LRT^[219]68.

   For consistency, we used canonical transcripts to define all possible
   missense variants^[220]21. Missing values of protein complex scores are
   filled with 0 and other features are filled with −1.

   Since pathogenic variants in constrained genes and non-constrained
   genes may have different mode of action, we trained our models on
   constrained and non-constrained variants separately with different sets
   of features (38 features used in constrained model, 21 features used in
   non-constrained model, Supplementary Data [221]2).

Deep learning model

   MVP is based on a deep residual neural network model (ResNet)^[222]22
   for predicting pathogenicity using the predictors described above. To
   preserve the structured features in training data, we ordered the
   features according to their correlations (Supplementary Fig. [223]3).
   The model (Supplementary Fig. [224]2) takes a vector of the ordered
   features as input, followed by a convolutional layer of 32 kernels with
   size 3 × 1 and stride of 1, then followed by 2 computational residual
   units, each consisting of 2 convolutional layers of 32 kernels with
   size 3 × 1 and stride of 1 and a ReLU^[225]69 activation layer in
   between. The output layer and input layer of the residual unit is
   summed and passed on to a ReLU activation layer. After the two
   convolutional layers with residual connections, 2 fully connected
   layers of 320 × 512 and 512 × 1 are used followed by a sigmoid function
   to generate the final output^[226]70.
   [MATH: <mi mathvariant="normal">Sigmoid</mi><mfenced close=")"
   open="("><mrow><mi>x</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><mn>1<
   /mn></mrow><mrow><mn>1</mn><mo>+</mo><msup><mrow><mi>e</mi></mrow><mrow
   ><mo>−</mo><mi>x</mi></mrow></msup></mrow></mfrac> :MATH]

   (Supplementary Fig. [227]2). In training, we randomly partitioned the
   synthetic training data sets into two parts, 80% of the total training
   sets for training and 20% for validation. We trained the model with
   batch size of 64, used adam^[228]71 optimizer to perform stochastic
   gradient descent^[229]72 with cross-entropy loss between the predicted
   value and true value. After one full training cycle on the training
   set, we applied the latest model weights on validation data to compute
   validation loss.

   To avoid over fitting, we used early stopping regularization during
   training. We computed the loss in training data and validation data
   after each training cycle and stopped the process when validation loss
   is comparable to training loss and do not decrease after 5 more
   training cycle, and then we set the model weights using the last set
   with the lowest validation loss. We applied the same model weights on
   testing data to obtain MVP scores for further analysis.

Hyperparameters in MVP

   In the MVP neural network, we tested different number of residual
   blocks for the model structure. With all other parameters fixed, the
   model with two residual blocks contain 12,544 parameters before fully
   connected layers, and it saturates at around 20 iterations. Deeper
   models lead to fast overfitting and unstable performance in testing
   data sets (Supplementary Fig. [230]19). We used AUROC on cancer hotspot
   data to investigate the effects of the number of residual blocks on the
   testing performance. When 8 and 16 residual blocks are used, the AUROC
   for constrained genes are 0.885 and 0.872, respectively, and the AUROC
   for non-constrained genes are 0.822 and 0.814, respectively. Other
   hyperparameters we chose are commonly used in deep learning models,
   including kernel size of 3, pooling size of 2, filter size of 32 and
   ReLU as activation functions.

Previously published methods for comparison

   We compared MVP score to 13 previously published prediction scores,
   namely, M-CAP^[231]11, DANN^[232]73, Eigen^[233]61, Polyphen2^[234]63,
   SIFT^[235]64, MutationTaster^[236]62, FATHMM^[237]66, REVEL^[238]12,
   CADD^[239]8, MetaSVM^[240]10, MetaLR^[241]10, VEST3^[242]9, and
   MPC^[243]21.

Normalization of scores using rank percentile

   For each method, we first obtained predicted scores of all possible
   rare missense variants in canonical transcripts, and then sort the
   scores and converted the scores into rank percentile. Higher rank
   percentile indicates more damaging, e.g., a rank score of 0.75
   indicates the missense variant is more likely to be pathogenic than 75%
   of all possible missense variants.

ROC curves

   We plotted Receiver operating characteristic (ROC) curves and
   calculated Area Under the Curve (AUC) values in training data with
   6-fold cross validation (Supplementary Fig. [244]4), and compared MVP
   performance with other prediction scores in curated benchmark testing
   data sets (Supplementary Fig. [245]5) and cancer hotspot mutation
   data set (Fig. [246]1). For each prediction method, we varied the
   threshold for calling pathogenic mutations in a certain range and
   computed the corresponding sensitivity and specificity based on true
   positive, false positive, false negative and true negative predictions.
   ROC curve was then generated by plotting sensitivity against
   1—specificity at each threshold.

Optimal points based on ROC curves

   We define the optimal threshold for MVP score as the threshold where
   the corresponding point in ROC curve has the largest distance to the
   diagonal line (Supplementary Fig. [247]7). Based on the true positive
   rate and false positive rate at the optimal points in ROC curves, we
   can estimate the precision and recall in de novo precision-recall-proxy
   curves (Supplementary Fig. [248]8 and Supplementary Note [249]1).

Precision-recall-proxy curves

   Since de novo mutation data do not have ground truth, we used the
   excess of predicted pathogenic missense de novo variants in cases
   compared to controls to estimate precision and proxy of recall. For
   various thresholds of different scores, we can calculate the estimated
   number of risk variants and estimated precision based on enrichment of
   predicted damaging variants in cases compared to controls. We adjusted
   the number of missense de novo mutation in controls by the synonymous
   rate ratio in cases verses controls, assuming the average number of
   synonymous as the data sets were sequenced and processed separately)
   (Supplementary Table [250]2), which partly reduced the signal but
   ensures that our results were not inflated by the technical difference
   in data processing.

   Denote the number of cases and controls as N[1] and N[0], respectively;
   the number of predicted pathogenic de novo missense variants as M[1]
   and M[0], in cases and controls, respectively; the rate of synonymous
   de novo variants as S[1] and S[0], in cases and controls, respectively;
   technical adjustment rate as α; and the enrichment rate of variants in
   cases compared to controls as R.

   We first estimate α by:
   [MATH:
   <mi>α</mi><mo>=</mo><mfrac><mrow><msub><mrow><mi>S</mi></mrow><mrow><mn
   >1</mn></mrow></msub></mrow><mrow><msub><mrow><mi>S</mi></mrow><mrow><m
   n>0</mn></mrow></msub></mrow></mfrac> :MATH]
   1

   Then assuming the rate of synonymous de novo variants in cases and
   controls should be identical if there is no technical batch effect, we
   use α to adjust estimated enrichment of pathogenic de novo variants in
   cases compared to the controls by:
   [MATH:
   <mi>R</mi><mo>=</mo><mfrac><mrow><mfrac><mrow><msub><mrow><mi>M</mi></m
   row><mrow><mn>1</mn></mrow></msub></mrow><mrow><msub><mrow><mi>N</mi></
   mrow><mrow><mn>1</mn></mrow></msub></mrow></mfrac></mrow><mrow><mfrac><
   mrow><msub><mrow><mi>M</mi></mrow><mrow><mn>0</mn></mrow></msub></mrow>
   <mrow><msub><mrow><mi>N</mi></mrow><mrow><mn>0</mn></mrow></msub></mrow
   ></mfrac><mo>×</mo><mi>α</mi></mrow></mfrac> :MATH]
   2

   Then we can estimate number of true pathogenic variants (
   [MATH:
   <msubsup><mrow><mi>M</mi></mrow><mrow><mn>1</mn></mrow><mrow><mo>′</mo>
   </mrow></msubsup> :MATH]
   ) by:
   [MATH:
   <msubsup><mrow><mi>M</mi></mrow><mrow><mn>1</mn></mrow><mrow><mo>′</mo>
   </mrow></msubsup><mo>=</mo><mfrac><mrow><msub><mrow><mi>M</mi></mrow><m
   row><mn>1</mn></mrow></msub><mfenced close=")"
   open="("><mrow><mi>R</mi><mo>−</mo><mn>1</mn></mrow></mfenced></mrow><m
   row><mi>R</mi></mrow></mfrac> :MATH]
   3

   And then precision by:
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="normal">Precision</mi></mrow><mo>^</mo></mover><mo>=</mo><
   mfrac><mrow><msubsup><mrow><mi>M</mi></mrow><mrow><mn>1</mn></mrow><mro
   w><mo>′</mo></mrow></msubsup></mrow><mrow><msub><mrow><mi>M</mi></mrow>
   <mrow><mn>1</mn></mrow></msub></mrow></mfrac> :MATH]
   4

Random forest model and fully-connected neural network model

   To analyze the relative contribution of the features and the deep
   neural network to the improvement over exiting methods, we trained a
   Random Forest (RF) model and Fully-connected Neural Network (FCNN)
   model with the same features as our base line. RF and FCNN were
   implemented using Python package scikit-learn and Keras separately. We
   tried several hyperparameter settings of RF and FCNN, and the best
   settings were determined by 10-fold cross-validation. In our RF model,
   we used 256 trees with depth of 12. The minimum number of examples per
   node that were allowed to be split was set as 10 to alleviate
   overfitting. For FCNN, we used 2 hidden layers with 64 neurons and 32
   neurons separately and ReLU as the activation function. We used Adam
   algorithm to minimize the objective function with hyperparameters
   lr = 1e−4, β = 0.99, and ε = 1e−8. Early-stopping was applied with
   validation error as the metric. The 10 trained models were all kept,
   and for testing, we average the outputs of the 10 models as the final
   predicted scores.

Reporting summary

   Further information on research design is available in the [251]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [252]Supplementary Information^ (4.1MB, pdf)
   [253]Peer Review File^ (372.4KB, pdf)
   [254]41467_2020_20847_MOESM3_ESM.pdf^ (128.7KB, pdf)

   Description of additional supplementary files
   [255]Supplementary Data 1^ (10.7KB, xlsx)
   [256]Supplementary Data 2^ (13.1KB, xlsx)
   [257]Supplementary Data 3^ (11.1KB, xlsx)
   [258]Supplementary Data 4^ (382.3KB, xlsx)
   [259]Supplementary Data 5^ (523KB, xlsx)
   [260]Supplementary Data 6^ (217.6KB, xlsx)
   [261]Supplementary Data 7^ (10.4KB, xlsx)
   [262]Supplementary Data 8^ (23.9KB, xlsx)
   [263]Supplementary Data 9^ (24KB, xlsx)
   [264]Supplementary Data 10^ (16.6KB, xlsx)
   [265]Supplementary Data 11^ (54.3KB, xlsx)
   [266]Supplementary Data 12^ (18.6KB, xlsx)
   [267]Reporting Summary^ (303KB, pdf)

Acknowledgements