Abstract

   Next-generation sequencing enables simultaneous analysis of hundreds of
   human genomes associated with a particular phenotype, for example, a
   disease. These genomes naturally contain a lot of sequence variation
   that ranges from single-nucleotide variants (SNVs) to large-scale
   structural rearrangements. In order to establish a functional
   connection between genotype and disease-associated phenotypes, one
   needs to distinguish disease drivers from neutral passenger variants.
   Functional annotation based on experimental assays is feasible only for
   a limited number of candidate mutations. Thus alternative computational
   tools are needed. A possible approach to annotating mutations
   functionally is to consider their spatial location relative to
   functionally relevant sites in three-dimensional (3D) structures of the
   harboring proteins. This is impeded by the lack of available protein 3D
   structures. Complementing experimentally resolved structures with
   reliable computational models is an attractive alternative. We
   developed a structure-based approach to characterizing comprehensive
   sets of non-synonymous single-nucleotide variants (nsSNVs): associated
   with cancer, non-cancer diseases and putatively functionally neutral.
   We searched experimentally resolved protein 3D structures for potential
   homology-modeling templates for proteins harboring corresponding
   mutations. We found such templates for all proteins with
   disease-associated nsSNVs, and 51 and 66% of proteins carrying common
   polymorphisms and annotated benign variants. Many mutations caused by
   nsSNVs can be found in protein–protein, protein–nucleic acid or
   protein–ligand complexes. Correction for the number of available
   templates per protein reveals that protein–protein interaction
   interfaces are not enriched in either cancer nsSNVs, or nsSNVs
   associated with non-cancer diseases. Whereas cancer-associated
   mutations are enriched in DNA-binding proteins, they are rarely located
   directly in DNA-interacting interfaces. In contrast, mutations
   associated with non-cancer diseases are in general rare in DNA-binding
   proteins, but enriched in DNA-interacting interfaces in these proteins.
   All disease-associated nsSNVs are overrepresented in ligand-binding
   pockets, and nsSNVs associated with non-cancer diseases are
   additionally enriched in protein core, where they probably affect
   overall protein stability.

Introduction

   Human genetic variation ranges from neutral polymorphisms to disease
   susceptibility variants and pathogenic mutations with high
   penetrance.^[28]1 A single individual may carry up to 3 × 10^6
   single-nucleotide variants (SNVs) and up to 3 × 10^5 insertions and
   deletions,^[29]2 but even in disease-affected individuals only few
   variants of this continuum are expected to be causal, with the rest
   being neutral. Data on genetic variants that underlie certain disease
   phenotypes are accumulated in specific databases, for example,
   ClinVar,^[30]3 which currently contains >160 000 unique variant records
   pertaining to 27 261 genes. However, even a strong mutation-phenotype
   association itself provides no insight into the mechanistic changes to
   the protein function and/or structure that are caused by the mutation.
   These changes can result in protein instability or misfolding, or in
   perturbations of interaction energy, if the affected protein is
   involved in protein–protein, protein–nucleic acid or protein–ligand
   interactions.

   Computational analysis of the available three-dimensional (3D)
   structures of human proteins shows that disease-causing missense
   (non-synonymous) mutations often result in significant alteration of
   the amino-acid residue properties and disruption of non-covalent
   bonding.^[31]4 In contrast, functionally neutral variants tend to be
   located at the protein surface and to be less conserved than
   random.^[32]5, [33]6 Anecdotal data are available on the involvement of
   disease-associated missense SNPs in protein–protein interactions (PPI),
   reviewed in.^[34]7, [35]8, [36]9 A large-scale analysis confirms that
   disease-related mutations are frequently overrepresented on PPI
   interfaces.^[37]10

   Several computational methods have been developed to assess the impact
   of non-synonymous single-nucleotide variants (nsSNVs) on the protein
   function, with SIFT^[38]11 and PolyPhen-2^[39]12 being among the most
   commonly used ones. Some methods take into account protein
   sequence-based phylogenetic information pertaining to the
   mutation,^[40]11, [41]13 others rely on the combination of protein
   structural information, functional parameters and phylogenetic
   information derived from multiple sequence alignments.^[42]14, [43]15,
   [44]16, [45]17, [46]18 Specific contribution of structural parameters
   to the prediction performance has been a long-discussed issue.^[47]12,
   [48]17

   Numerous tools have been constructed to assess potential changes caused
   by SNVs in protein 3D structure: SNPeffect database,^[49]18 for
   example, ignores the conservation profiles of SNVs and relies on
   predicted structural features (aggregation, amyloidogenicity,
   stability) and domain and catalytic site annotations. There are tools
   that predict the energetic impact of a mutation on the stability of a
   protein or protein complex.^[50]19, [51]20, [52]21, [53]22, [54]23,
   [55]24 A thorough comparison and discussion of limitations of these
   methods can be found in references [56]17, [57]25. dSysMap^[58]26 and