Abstract
The current trend in clinical data analysis is to understand how
individuals respond to therapies and drug interactions based on their
genetic makeup. This has led to a paradigm shift in healthcare; caring
for patients is now 99% information and 1% intervention. Reducing costs
of next generation sequencing (NGS) technologies has made it possible
to take genetic profiling to the clinical setting. This requires not
just fast and accurate algorithms for variant detection, but also a
knowledge-base for variant annotation and prioritization to facilitate
tailored therapeutics based on an individual's genetic profile. Here we
show that it is possible to provide a fast and easy access to all
possible information about a variant and its impact on the gene, its
protein product, associated pathways and drug-variant interactions by
integrating previously reported knowledge from various databases. With
this objective, we have developed a pipeline, Sequence Variants
Identification and Annotation (SeqVItA) that provides end-to-end
solution for small sequence variants detection, annotation and
prioritization on a single platform. Parallelization of the variant
detection step and with numerous resources incorporated to infer
functional impact, clinical relevance and drug-variant associations,
SeqVItA will benefit the clinical and research communities alike. Its
open-source platform and modular framework allows for easy
customization of the workflow depending on the data type (single,
paired, or pooled samples), variant type (germline and somatic), and
variant annotation and prioritization. Performance comparison of
SeqVItA on simulated data and detection, interpretation and analysis of
somatic variants on real data (24 liver cancer patients) is carried
out. We demonstrate the efficacy of annotation module in facilitating
personalized medicine based on patient's mutational landscape. SeqVItA
is freely available at [27]https://bioinf.iiit.ac.in/seqvita.
Keywords: SNPs, INDELs, sequence variants, NGS, annotation,
personalized medicine, platform
Background
Precision medicine is an emerging approach for risk assessment to
diagnosis, disease prognosis, and treatment that considers individual
genetic variability into account. It requires analyzing multiple genes
quickly and sensitively from small quantities of sample, which is now
possible with the advent of next generation sequencing (NGS)
techniques. Diagnostic testing of Mendelian and heredity disorders and
risk screening of heredity cancers are a few well-established clinical
applications of NGS techniques. Targeted sequencing (TS) is widely used
in the form of gene panels, e.g., hotspot panels (either clinically
actionable or with diagnostic/prognostic significance), actionable gene
panels (includes all exons of targeted genes), and disease-focused gene
panels. With the decrease in the cost of sequencing, whole genome
(WGS), and whole exome sequencing (WES) are slowly emerging in
biomedical studies and medical practices. The major advantage with
WGS/WES approaches is that one can identify mutations not previously
reported and in non-coding regions (with WGS), which may be specific to
the individual. Thus, the revolution of high-throughput sequencing
technologies has made it possible to carry out genome-wide analysis of
somatic mutations in population-scale cancer cohorts.
Major steps involved in the prediction of sequence variants in clinical
NGS data are (1) Data pre-processing, (2) Read alignment, (3) Variant
calling, and (4) Variant annotation and prioritization. Data
pre-processing involves removing low quality reads, adapter sequences
and contamination. It is a key step in any NGS data analysis and has a
direct effect on the downstream analysis if not performed properly. A
number of data quality check, filtering and trimming tools are
available (both standalone and web-based) (Del Fabbro et al., [28]2013;
Bao et al., [29]2014). A major limitation with some of these tools is
that they are unable to handle very large datasets. For example,
PRINSEQ (Schmieder and Edwards, [30]2011) works efficiently for a
single sample, while NGS QC Toolkit (Patel and Jain, [31]2012) and
FastQC (Andrews, [32]2010) can handle small number of samples
simultaneously; however, the performance is very poor in terms of
runtime and memory usage for large datasets. Raspberry (Katta et al.,
[33]2015) is able to efficiently perform batch processing on large
number of samples in parallel and FaQCs (Lo and Chain, [34]2014)
enables quality check, trimming and filtering of low-quality reads in
large samples quickly through multi-threading.
Alignment of pre-processed short sequence reads to a reference genome
is the next step which requires large computation space and is highly
time-consuming. BWA (Li and Durbin, [35]2009) and Bowtie2 (Langmead and
Salzberg, [36]2012) are some of the fastest aligners available that use
“indexing” strategies (compression heuristics), resulting in optimal
alignment of reads. Most of the existing aligners have a provision to
map the reads using multiple cores/CPUs (parallel processing).
Standalone tools such as NGS-QCbox (Katta et al., [37]2015),
cloud-based aligners such as CloudBurst (Schatz, [38]2009), Galaxy
CloudMan (Afgan et al., [39]2010), and Crossbow (Gurtowski et al.,
[40]2012) carry out read mapping much efficiently using big data
infrastructure.
Because of large space-time resources required for the above two steps
in NGS data analysis, most variant callers take either the alignment
file in BAM format or the VCF format as input. Variants are predicted
by comparing each position with the standard reference sequence and
applying certain statistical models/heuristics to improve the
reliability of predictions. Some popular tools for detection of SNVs
and small INDELs in large NGS datasets are summarized in Table [41]1,
namely, GATK (based on Hadoop-MapReduce framework) (McKenna et al.,
[42]2010; Cibulskis et al., [43]2013), MAFsnp (Hu et al., [44]2015),
and SNVSniffer (Liu et al., [45]2016). Sequence variants are classified
as germline, somatic, or loss-of-heterozygosity (LOH) by the variant
callers. Germline variants are usually identified from a single,
paired, family-pedigree sequences or population cohorts, while a
matched normal is required for the prediction of somatic and LOH
variants. Each of these tools have their inherent strengths and
limitations and none of these can reliably detect all types and sizes
of sequence variants (Yi et al., [46]2014; Xu, [47]2018).
Table 1.
Feature comparison of SeqVItA with various popular and recent tools for
sequence variant calling, annotation and prioritization in NGS data.
Variant Callers Annotation Integrated Platforms
GATK BCFtools VarScan2 DeepVariant SNVsniffer MAFsnp ANNOVAR CADD
FCVPP2 CoVaCs RAREVATOR Canary Nimbus SeqVItA
DATA INPUT
Single ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Paired (case-control) ✓ – ✓ – ✓ – – – – - ✓ ✓ ✓ ✓
Multiple ✓ ✓ ✓ – ✓ – – – ✓ – – – ✓
Read alignment – – – – – – – – – ✓ – ✓ ✓ –
Preprocessing post-alignment ✓ ✓ – ✓ – – – – – ✓ – – – ✓
VARIANT CALLING
Germline ✓ ✓ ✓ ✓ ✓ ✓ – – – ✓ ✓ ✓ ✓ ✓
Somatic ✓ – ✓ – ✓ – – – – – ✓ ✓ ✓ ✓
Multi-allelic F ✓ – – – – ✓ – – – – – – – ✓
Parallelization ✓ ✓ ✓ ✓ – – – – – ✓ – – – ✓
ANNOTATION
Gene-locus – – – – – – ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Conservation score – – – – – – ✓ ✓ ✓ ✓ – ✓ ✓ ✓
Population alleles – – – – – – ✓ ✓ ✓ ✓ ✓ – ✓ –
Clinical associations – – – – – – ✓ ✓ ✓ – ✓ ✓ ✓ ✓
Drug associations – – – – – – – – – – – – – ✓
Prioritization – – – – – – – ✓ ✓ ✓ ✓ ✓ – ✓
Reference McKenna et al., [48]2010 Li, [49]2011 Koboldt et al.,
[50]2012 Poplin et al., [51]2017 Liu et al., [52]2016 Hu et al.,
[53]2015 Wang et al., [54]2010 Kircher et al., [55]2014 Kumar et al.,
[56]2018 Chiara et al., [57]2018 Magi et al., [58]2015 Doig et al.,
[59]2017 Brouwer et al., [60]2018
[61]Open in a new tab
Variant calling in WGS (WES) data result in millions (hundreds) of
variants, most of which may not contribute to any phenotypic condition.
Identifying a small subset of functional variants that are directly
involved in a pathway/disease is desirable. The first step in variant
prioritization is annotation. This includes identifying the functional
impact of the variant on the gene/protein product, its frequency in the
population, disease association and variant-drug association. Some
popular variant annotation tools are listed in Table [62]1. The most
common annotation feature across these tools is gene-locus annotation,
i.e., position of variant on the gene locus (exon/intron/5′/3′
UTR/promoter). Features such as phylogenetic conservation scores [e.g.,
SIFT (Ng and Henikoff, [63]2003) and PolyPhen2 (Adzhubei et al.,
[64]2013)], help in distinguishing the impact of the variants—from
damaging to tolerated missense variants, while population allele
frequencies [dbSNP (Pagès, [65]2017), 1000 Genome Project ([66]1000
Genomes Project Consortium et al., [67]2015) and gnomAD (Lek et al.,
[68]2016)] help in filtering common variations. Known variant-disease
associations can be obtained from databases such as OMIM (Hamosh et
al., [69]2002), ClinVar (Landrum et al., [70]2014), and COSMIC (Bamford
et al., [71]2004) and can help in understanding the role of variants in
disease prevalence and mode of inheritance. Another important class of
annotation is variant-drug association that enables us to interpret
efficacy of the drug in the presence of a particular variant, e.g.,
PharmGKB (Gong et al., [72]2008). Since not all variants in the genic
region are deleterious, it is necessary to prioritize the predicted
variants. Methods proposed vary from computing simple cumulative scores
to statistical measures based on available information for the variant
in various resources. Thus, annotation and prioritization of variants
is prerequisite in personalized therapy.
It may be noted from Table [73]1 that majority of the existing tools
are either only variant callers or annotation provider, with a limited
few providing both the features. Further, variant callers may be
limited by data types (single, paired, or pooled samples), aligner
outputs, and computational efficiency. One of the most widely used
annotation tool, ANNOVAR, requires input in a specific format that is
not supported by most variant callers, making it difficult to use.
There are numerous papers that propose pipeline for disease-specific
variant detection and annotation (Alexander et al., [74]2016; Krupp et
al., [75]2017; Mathur et al., [76]2018). These are useful in guiding
the experimentalist but require the user to download and install all
the tools and dependencies. To address some of these issues we have
developed an open-source, command-line driven pipeline, SeqVItA. On a
single platform all the data types and output formats are supported and
parallelization of the variant calling step makes it easy to handle
large datasets. A detailed annotation is provided for both coding and
non-coding variants and based on their functional impact a score is
given which is used for prioritizing the variant. Here we show how the
annotation module of SeqVItA can assist in making a suitable treatment
plan based on an individual's mutational profile. The efficacy of both
the variant calling and annotation modules are discussed.
Methods
Overview of SeqVitA
Sequence Variants Identification and Annotation (SeqVItA) is an
open-source platform for the prediction and annotation of SNVs and
small INDELs in whole genome (WGS), whole exome (WES), and targeted
(TS) next generation sequencing (NGS) data. It can handle single or
multiple input files simultaneously in either BAM or mpileup format. In
SeqVItA, one can detect germline mutations in single samples and
population datasets (wherein the frequency of a minor allele is also
computed). For the detection of somatic variations in cancer data, it
uses paired tumor-control samples. It is built on a modular framework
and the workflow of SeqVItA is shown in Figure [77]1. It consists of
three major modules, viz., (i) pre-processing, (ii) variant calling,
and (iii) variant annotation and prioritization. The variant calling
module consists of three sub-modules: germline, somatic, and population
to handle different type of data analyses. SeqVItA is freely available
at [78]https://bioinf.iiit.ac.in/seqvita.
Figure 1.
[79]Figure 1
[80]Open in a new tab
Workflow of SeqVItA for identification, annotation and prioritization
of sequence variants in WGS, WES, or TS data. Het, Heterozygous; Homo,
Homozygous; LOH, Loss of heterozygosity; MAF, Minor allele frequency.
Implementation
SeqVItA follows a heuristic/statistic approach similar to VarScan2
(Koboldt et al., [81]2012) for variant detection. It is implemented in
a combination of programming languages (C++, R, and Bash). The input to
SeqVItA is aligned file in either BAM or mpileup format and the
computationally intensive variant calling step is parallelized using
OpenMP. It involves filtering the variants based on predefined criteria
for read depth, base quality, variant allele frequency and statistical
significance. Additionally, SeqVItA provides annotation and
prioritization to the predicted variants for assessing the biological
significance of the variants called. This module can also be
independently accessed by the researchers for variants predicted from
any other tool. Various steps, parameters and their default values
considered in the prediction of SNVs and INDELs in SeqVItA are
summarized in Table [82]2.
Table 2.
Various steps and corresponding parameters in the detection of SNVs and
INDELs in SeqVItA are summarized.
Steps Description Key Parameter Value
Pre-processing (aligned file in BAM format) Mapping quality correction
using Equation (1) –Mqcorr 0
Filter reads based on mapping quality cut-off –Mqread 20
Variant Calling Base quality Cut-off (Phred score) –Qbase 15
Read depth at a site ≥ cut-off, site is considered for variant calling
–RD_th 10
If no. of reads supporting alternate allele ≥ cut-off, site is
considered for variant calling –VAR_th 2
Compute variant allele frequency (VAF) for variant calling –VAF_th 0.20
Check for strand bias (Discard if ≥ 90% and ≤ 10% support from same
strand) –Strand_Bias 1
p-value cut-off (FET[83]^*) for calling variants –p-value 0.01
For germline SVs, if VAF > cut-off, variant is homozygous, else
heterozygous –VAF_homo 0.75
p-value cut-off (FET[84]^*) for calling somatic, LOH variants
–somatic-p-value 0.05
Variant annotation Drug association based on gene/variant in PharmGKB
-d –
[85]Open in a new tab
^*
FET: Fisher Exact test.
Pre-processing
This step is used only when the input is alignment file in BAM format.
Variant calling is very sensitive to the quality of alignment, as a
wrong alignment of reads may lead to false positives. Hence, it is
extremely important to ensure that the reads are uniquely mapped. To
handle the mappability issue, mpileup function in Samtools has been
incorporated for recalibrating and filtering of low mapping quality
reads. Mapping quality of the reads is recalibrated using the
expression:
[MATH: Mq′
= I
nt-MqInt ×Int :MATH]
(1)
where Int is a user defined integer and Mq is the phred-scaled
probability of a read being misaligned. Default Mqcorr = 0 implies no
mappability correction (Mqcorr = 50 is recommended for the alignment
files generated using BWA or Bowtie2). The recalibrated file is
generated in an mpileup format for further analysis. If the alignment
has been obtained using only uniquely mapped reads by an aligner, one
may skip this step. This is followed by filtering of low mapping
quality reads (≤ mapping quality), which is user-defined depending on
the aligner used. If the input to SeqVItA is in mpileup format, this
step is not required.
Variant calling
In this step, depending on the type of data, the user may choose one of
the three modules for small sequence variant detection: germline module
(single sample), somatic module (paired tumor-control samples) and
population module (multiple samples). Using germline module one may
identify SNVs, INDELs or both simultaneously in a single sample, while
identification of somatic, LOH and germline sequence variants (SNVs
and/or INDELs) is carried out in somatic module for paired
tumor-control samples. SeqVItA also handles genotyping in population
data using population module that takes multiple sample files
simultaneously as input and computes frequency of minor allele in the
dataset. The input is alignment file in mpileup format and the output
is in variant call format (VCF, v4.1). The details of each of these
modules are discussed below.
Variant calling in single sample
The input to this module is mpileup format file, shown in Supplementary
Figure [86]S1. It is a tab-separated text file that contains
information about the reads aligned to reference genome. First four
columns give the chromosome number, position in reference genome,
nucleotide present at the location in reference genome and number of
reads aligned at that position, respectively. In the 5th column, base
aligned at the position in the sample: “,” or “.” if the base at the
location is the same as that in reference, characters (“A”, “G”, “T”,
or “C”) for substitution (SNV) and “+” or “−” followed by an integer
representing insertion/deletion of integer length at the position. The
6th column gives ASCII encoded base qualities of the base in the reads.
For variant calling, each line in the mpileup file is parsed and the
coverage at each location is computed by considering only those reads
that have base quality ≥ Qbase (= 15, default). For N samples, coverage
and quality of the bases is be obtained from 4+3(m−1) and 6+3(m−1)
columns respectively, for sample m = 1, 2, …, N. If the number of high
quality reads ≥ minimum coverage threshold, RD_th (= 10, default) at a
location, then the site is considered for further analysis. To make a
variant call at a position, SeqVItA checks whether minimum number of
reads supporting the variant, VAR_th (= 2, default) is satisfied.
Variant allele frequency (VAF) is then computed by taking the ratio of
the number of non-reference alleles to the total number of reads at the
locus. Further, if VAF ≥ VAF_th, it is checked for strandedness. The
strand bias is identified if majority of reads supporting the variant
(>90%) belong to the same strand (forward or reverse). The strand bias
may occur due to erroneous PCR duplication and the variants called at
such locations are discarded.
Variants passing the above filtering criteria are considered for
Genotype call, Genotype Quality, and p-value calculation to assess the
significance of the variant called. This is done by using one-tailed
Fisher exact test under the null hypothesis that the variation observed
is due to sequencing error. For this, number of reads supporting the
variant is compared to the expected distribution of reads for a
non-variant position based on sequencing error alone. Suppose, N(obs,
ref) and N(obs, var) denote number of reads supporting the reference
and variant respectively, with the total coverage at a location defined
as N(obs) = N(obs, ref) + N(obs, var). Considering sequencing error to
be 0.001, expected number of reads supporting the variant allele due to
sequencing error is N(exp, var) = N(obs) × 0.001. The expected number
of reads supporting the reference in this case is N(exp, ref) = N(obs)
− N(exp, var). If the observed number of reads supporting the variant,
N(obs, var) > N(exp, var), the site is assumed to be a true variant,
else a sequencing error. Representing these four quantities in a 2 × 2
contingency table (Table [87]3), the probability of obtaining the
observed data is given by
Table 3.
The 2 × 2 contingency table for computing the p-value using Fisher
exact test.
Reads supporting Observed Expected Row Total
Variant N(obs, var) N(exp, var) N(var)
Reference N(obs, ref) N(exp, ref) N(ref)
Column total N(obs) N(exp) N
[88]Open in a new tab
[MATH: p=CN(obs, var)N(var)× CN(obs, ref)N(ref)CN(Obs)N<
/mrow> :MATH]
(2)
where, N(var) and N(ref) represent row total for variant and reference
alleles, respectively. The probability for all values of N(obs,var) is
computed by considering all possible tables obtained by reducing the
least value in the table (N(exp, var)) to zero while keeping the row
total and column total constant. The p-value is then defined as the sum
of all the hypergeometric probabilities of these contingency tables. If
p-value is less than the cutoff (= 0.05, default), the site is said to
be a true variant. The Genotype Quality is then given by:
[MATH: Genotype Quality=-
10log (p-value) :MATH]
(3)
Once the variants are identified, a final filter is applied to
categorize variants as heterozygous or homozygous. If the variant
allele frequency at a location is greater than the user-defined
homozygous frequency cutoff, VAF_homo (= 0.75, default), the site is
said to be Homozygous (HOM = 1), else Heterozygous (HET = 1).
Variant calling in case-control samples
For identifying somatic, germline and LOH sequence variants, tumor
samples along with a matched control sample paired samples (either as
two separate alignment files in BAM format, or a single mpileup file
for both normal and tumor alignment files) is given to the somatic
module. In this case, the first three columns contain information on
chromosome number, position and nucleotide present at the location,
columns 4th, 5th, and 6th contain information regarding read depth,
reference/variant base, and base quality for sample 1 and corresponding
information for sample 2 is given in columns 7th, 8th, and 9th in the
mpileup file. As in germline module, in this case also each location is
assessed based on the coverage requirement. The p-value calculation is
then carried out to determine the genotype in the two samples
independently if the filtering criteria are satisfied. If the filtering
criteria are not met for any one of the case/control samples, the site
is said to be homozygous (reference). If the number of reads supporting
a variant allele is >75% (VAF_homo), the variant is called homozygous,
else heterozygous (as in germline module). In case of multiple variant
alleles observed at a location, the variant alleles with higher read
count, followed by base quality are reported. Positions at which one or
both the samples have a variant and the genotypes do not match, a
comparison between the tumor and control samples is carried out using
one-tailed Fisher exact test in 2 × 2 contingency tables (Table [89]4)
to compute an additional p-value called somatic p-value (Supplementary
Figure [90]S2). This helps in distinguishing the somatic and LOH
variants from germline mutations, as shown below.
Table 4.
The 2 × 2 contingency table for computing p-value using Fisher exact
test to predict somatic, germline and LOH mutations.
Reference Variant Row total
Tumor N(tum, ref) N(tum, var) N(tum)
Normal N(nor, ref) N(nor, var) N(nor)
Column total N(ref) N(var) N
[91]Open in a new tab
Total number of reads, N = Nref + Nvar = Ntum + Nnor, where Nref and
Nvar correspond to the number of reads supporting the reference and
variant, respectively and Ntum and Nnor correspond to the number of
reads in tumor and normal samples, respectively.
Here, number of reads supporting the reference allele N(tum, ref) and a
variant allele N(tum, var) in the tumor sample are compared to that
observed in the control sample, N(nor, ref) and N(nor, var). Under the
null hypothesis that reference alleles and variants alleles are
independently distributed in tumor and normal samples, the probability
of observing x reference alleles in the tumor sample is given by
[MATH: p=C
mi>xN(tum)× CN(ref)-xN(nor)CN(var)N<
/mrow> :MATH]
(4)
The probability of x reference alleles in the tumor sample is computed
by considering all possible tables obtained by reducing the least value
in the table to zero while keeping the row totals and column totals
constant. The somatic p-value is then defined as the sum of all the
hypergeometric probabilities of these contingency tables. If this
p-value ≤ somatic-p-value threshold (= 0.05, default) and the normal
matches the reference allele, the base is classified as “Somatic” and
if the normal is heterozygous, the base is classified as “LOH.” If the
somatic p-value for a base is smaller than the threshold cut-off (0.05)
and the normal base do not meet the above criteria, the base is
classified as “Unknown.” Sequence variants where the allele is
homozygous in normal, and homozygous-reference or heterozygous in the
tumor are grouped under “Unknown” category. In case somatic p-value is
greater than the threshold cut-off or the genotypes in the case-control
samples match, the variant p-value (VPV) is computed by combining tumor
and normal read counts for each allele under the null hypothesis that
the site is non-variant (i.e., variation observed is due to sequencing
error) and the base is reported as “Germline.” Based on the variant
allele frequency (VAF) and somatic p-values (one-tailed Fisher Exact
test), a somatic variant is reported as “High” priority if VAF ≥ 10% in
tumor, < 5% in normal and p < 0.07, and LOH is reported as “High”
priority if VAF ≥ 10% in normal and p < 0.07. Germline variants are
described as “High” priority if VAF ≥ 10% in both tumor and normal
samples (Supplementary Figure [92]S2). Any sequence variant not meeting
these criteria is assigned “Low” priority. Thus, the user may filter
out high confidence sequence variants. For this module, SeqVItA
generates four VCF files using the function splitVCF corresponding to
Somatic, Germline, LOH and Unknown sequence variants.
Variant calling in large population samples
Apart from detecting variants in single and paired samples, SeqVItA
also handles population data for genotyping in disease cohort studies.
In this case, the input is multiple alignment files (BAM format) or a
single mpileup file containing information of multiple samples. The
sequence variant detection and p-value computation are carried out
individually for each sample using one-tailed Fisher Exact Test
(similar to germline module). A variant site may have more than one
allele identified in multiple samples and these alleles are reported
under “ALT” column, as shown in Supplementary Figure [93]S3. Samples
with no sequence variant are represented by 0/0 and 0/1 if the sample
contains one variant allele. This is reported in the respective sample
information columns. In case more than one alternate allele is
identified in any sample, then it is represented as 0/2 under that
sample and both the alleles are reported in the “ALT” column (comma
separated) in VCF file. Minor allele frequency (MAF), defined as the
frequency of second most common allele identified in the given
population dataset, is also computed for each variant. Based on MAF
value, user may categorize the predicted sequence variant as “rare”
(MAF < 0.01) or “common” (MAF > 0.01). MAF is reported as percentage
frequency of the allele in the “INFO” column, MAF = NA implies no minor
allele is identified.
Parallelization of the variant calling step
Screening for variants at every position of the 3 billion bases in
human genome sequentially is very time-consuming. To improve
computational efficiency, we have parallelized the variant calling step
in SeqVItA using OpenMP that supports shared memory multi-processing.
In this parallel computing platform, variant calling is carried out
simultaneously for “n” positions by using “n” threads in which each
thread shares the same memory address space resulting in a faster
communication. Compared to other existing frameworks such as OpenMPI or
MapReduce, OpenMP is chosen for its efficiency in carrying out
computationally-intensive tasks (Kang et al., [94]2015).
Annotation
Typically, small sequence variants (SNVs and INDELs) range from
thousands to millions in number in a genome. However, not all variants
are likely to be associated to a phenotype. Identifying a small subset
of functional variants that are directly involved in a pathway/disease
is a major challenge in research. Though there exists a number of
variant callers, in our knowledge very few provide any
annotation/prioritization to the called variants. To address this
problem, we have developed the annotation module that extracts
information from numerous resources to assess the significance of the
predicted variants. Thus, using this module, the user may filter out
non-significant variants for further analysis. This module can also be
used independently by the user to annotate variants obtained from any
other sequence variant detection tool in VCF format. Most trivial way
to prioritize the sequence variants is by statistical significance
given by p-value (Fisher's exact test). In SeqVItA, the p-value is
provided in the INFO column of the VCF file and the user may sort and
filter the variants based on this column. Though p-value gives the
statistical significance, it does not tell us anything about the
biological significance of the variant. For evaluating the biological
significance, we provide four levels of information. First level of
screening is based on the location of the variant: whether it is
intergenic or intragenic. R package “VariantAnnotation” (Obenchain et
al., [95]2014) integrated in SeqVItA provides gene-centric annotation
to the variant, e.g., whether the variant lies in the coding region
(synonymous, non-synonymous, frameshift, amino acid change), intron,
5′/3′ UTR, splice site, promoter, or intergenic region. For intergenic
variants, gene preceding or following the variant is provided. Known
variants from dbSNP are reported using R package
“SNPlocs.Hsapiens.dbSNP144.GRCh37” (Pagès, [96]2017). In SeqVItA,
functional annotation of the predicted variant is provided at three
levels: (i) functional impact of the mutation, (ii) clinical
association, and (iii) variant/gene-drug association. For functional
impact of the variant, R package “rfPred” (Jabot-Hanin et al.,
[97]2016) is used to extract SIFT (Ng and Henikoff, [98]2003),
Polyphen2 (Adzhubei et al., [99]2013) MutationTaster (Schwarz et al.,
[100]2010), PhyloP (Pollard et al., [101]2010) and likelihood ratio
test (LRT) (Chun and Fay, [102]2009) scores from the database of
Non-synonymous SNPs Functional Predictions (dbNSFP) (Liu et al.,
[103]2011). These methods are based on either protein sequence
conservation (SIFT and Polyphen2) or DNA sequence conservation models
(MutationTaster, PhyloP, and LRT) for non-synonymous variants. These
scores range from 0 to 1, with 0 indicating variants being tolerant and
1 indicating highly deleterious. The information regarding
clinical-association is extracted from ClinVar (Landrum et al.,
[104]2014), COSMIC (Bamford et al., [105]2004), DECIPHER (Firth et al.,
[106]2009), and OMIM (Hamosh et al., [107]2002) databases. The
extracted information is in the form of clinical significance
(clinSign) and phenotype from ClinVar, COSMIC ID, OMIM ID, and DECIPHER
score (predicted probability and percentage) for the gene spanning the
variant. A gene is categorized as haploinsufficient if
DECIPHER_percentage score lies between 0 and 10%. Fourth level of
information provided is variant/gene-drug association extracted from
PharmGKB (Gong et al., [108]2008). In SeqVItA, the user may choose from
two levels of drug assignment information for the variants: one is
based on mapping with dbSNP ID (Stringent) and another based on mapping
with the gene (flexible). The information provided include the name of
the associated drug, type, and confidence level of action for a given
variant/gene. These three categories of annotations has been used to
obtain disease-associated variant prioritization from sequence data. A
snapshot of the annotated file generated by SeqVItA is shown in
Supplementary Figure [109]S4.
In the analysis of multiple samples, if the user is interested in
identifying genes carrying mutations across the samples (the variants
may or may not be the same across samples), the function
findRecurrentGenes is used to facilitate identification of variant-gene
correspondence which can be outputted in a singular tabular format
file, as shown in Supplementary Figure [110]S5.
Prioritization
All the sequence variants detected in a patient's sample may not
contribute to the disease and it would be helpful in identifying the
important set of variants for therapeutics. In SeqVItA, the
prioritization of sequence variants is based on the three categories of
annotations discussed above. A variant is assigned “High” priority if
the functional impact of the variant score is high (>0.65) from any one
of the five resources (SIFT, Polyphen2, MutationTaster, LRT, and
PhyloP), clinical association identified in at least one of the three
resources (ClinVar, COSMIC, and OMIM), and variant-drug association
identified in PharmGKB. A variant is assigned “Medium” priority if
annotation is reported from any of the three categories. Any variant
not meeting the above criteria is assigned “low” priority.
Results and discussion
We discuss below the performance of SeqVItA on simulated data and real
(cancer) data from 24 samples. In the simulated experiment, the effect
of various parameters, viz., sequencing coverage, read length, minimum
coverage threshold, base quality, type of SNVs and INDELs (homozygous
or heterozygous), and size of INDELs (1–10) on the prediction accuracy
is assessed. The performance of SeqVItA on real data is carried out by
detecting somatic variants (both SNVs and INDELs) in paired exome
sequence data (tumor and matched normal) from 24 liver cancer patients
(SRP123031). Finally, we discuss ability of the “annotate” module in
SeqVItA in analyzing the mutational landscape across 24 liver cancer
patients, and prioritize the somatic variants to understand the
probable role of novel variants in tumorigenesis.
Analysis of simulated data
Genomic DNA sequence of size 16,569 bp (hg18 assembly) is considered
for simulating the reads and is called “reference” sequence. In the
simulated experiment, detection of both homozygous and heterozygous
SNVs and INDELs is considered. To simulate homozygous condition we
insert variations at random locations in a copy of the reference
sequence and concatenate this “mutated reference” sequence to itself
(to obtain the diploid genome), while the “mutated reference” sequence
is concatenated to the “reference” sequence to simulate heterozygous
condition. That is, four independent diploid genome sequences are
constructed containing (i) 12 homozygous SNVs, (ii) 12 heterozygous
SNVs, (iii) 4 different sizes (e.g., 1, 2, 5, and 10) of homozygous
insertions and deletions, and (iv) 4 different sizes of heterozygous
insertions and deletions. Reads are simulated (for Illumina sequencing
in this case) using ART simulator (Huang et al., [111]2012) at three
different sequencing coverages (20 ×, 40 ×, and 60 ×) for two different
read lengths (50 and 100 bp), that is, a total of 6 experiments are
conducted for the four diploid genomes constructed. ART is the most
widely used simulator to generate synthetic NGS reads by emulating the
sequencing process with built-in, technology-specific read error models
and base quality value profiles parameterized empirically in large
sequencing datasets. The reads are aligned to the “reference” sequence
using Bowtie2 (Langmead and Salzberg, [112]2012). For each sequencing
depth and read length, this exercise is repeated 50 times and the
predictions are averaged over these repetitions. Variants detected
using SeqVItA are then compared with three popular resources, BCFtools
(Li, [113]2011), VarScan2 (Koboldt et al., [114]2012) and GATK
HaplotypeCaller (McKenna et al., [115]2010).
In the simulated experiments, performance of SeqVItA is evaluated on
two important parameters, viz., sequencing coverage and base quality.
Base quality is given by Phred score of the base and only those reads
that have a Phred score above a certain minimum quality score, Qbase is
considered. We considered two base quality thresholds, 15 and 30 for
the analysis. Minimum number of reads (with base quality > Qbase)
required to make a call at a location is called minimum coverage
threshold, RD_th (= 10, default). Another important parameter is the
variant allele frequency (VAF), defined as the ratio of the number of
reads supporting the variant allele to the total depth at that
location. For the analysis, the threshold for variant allele frequency,
VAF_th considered is 0.2. It is a useful parameter for filtering false
positives resulting from sequencing errors.
The prediction accuracy of SeqVItA is evaluated by computing the
Sensitivity (recall), Precision and F-score:
[MATH: Sensitivity= TPT<
/mi>P + FN
:MATH]
(5)
[MATH: Precision= TPTP + FP
mi> :MATH]
(6)
[MATH: F-score= 2 <
/mtext>× Sensitivity × PrecisionSensiti
mi>vity + Precis
mi>ion
:MATH]
(7)
where True Positive (TP) is defined as the number of sequence variants
correctly predicted, False Positive (FP) is the number of incorrect
variants calls made, and False Negative (FN) is the number of variants
missed, compared to the annotated list (Supplementary Tables [116]S1,
[117]S2).
Performance evaluation of SeqVitA
Detection of SNVs and small INDELs
We first discuss the performance of SeqVItA in detecting SNVs. In
Figure [118]2, the prediction accuracies in detecting homozygous
(“triangle”) and heterozygous (“square”) SNVs is depicted for two
different read lengths, 50 bp (“empty symbols”) and 100 bp (“filled
symbols”) and base quality score (≥15). As expected, the prediction
accuracies are better with longer reads (100 bp) compared to smaller
read lengths (50 bp), especially at lower sequencing coverage < 40×.
This may probably be due to reliable alignment in the case of longer
reads and suggests the need for higher sequencing coverage when the
read lengths are small. It is easier to detect homozygous SNVs at low
sequencing depth (20×), compared to the detection of heterozygous SNVs,
irrespective of read length. It is observed that precision ~1 in the
detection of both homozygous and heterozygous SNVs even at low
sequencing coverage (20×) and small read lengths (50 bp), indicating
the reliability of results. However, the recall values are seen to
approach 1 only for high coverage data (≥ 40×) (see Supplementary
Tables [119]S3, [120]S4).
Figure 2.
Figure 2
[121]Open in a new tab
Performance of SeqVItA in detecting SNVs in simulated data shown.
F-score values for detecting homozygous “triangle” and heterozygous
“square” SNVs with read length 50 bp (empty symbols) and 100 bp (filled
symbols). Minimum coverage threshold = 10 and Base quality ≥15.
In Figure [122]3, performance of SeqVItA in detecting homozygous and
heterozygous INDELs of varying sizes (1, 2, 5, and 10 bp) is depicted,
for two read lengths (50 and 100 bp), three sequencing depths (20 × ,
40 × , and 60 ×) and base quality score (≥15). The performance of
SeqVItA in detecting homozygous INDELs is summarized in Figure [123]3A.
In this case also the prediction accuracy is observed to be good for
high coverage data (≥40 ×) irrespective of the reads length. Poor
performance at lower sequencing depth (20 ×) is likely to be due to the
minimum coverage threshold of 10 reads not being met due to various
biases in NGS data, viz., mappability bias (in low complexity regions)
and GC bias. A clear dependence of the detection of longer INDELs on
the read length is observed. It may be noted in Figure [124]3 that
INDELs of 10 bp are missed with smaller read length (~50 bp), but are
detected by increasing the read length to 100 bp. On further increasing
the read length (200 bp) we were able to detect INDELs of larger sizes
as well, viz., 15, 20, and 25 bp (results not shown). This is because
the number of reads aligning in the vicinity of a large INDEL (≥10 bp)
is considerably reduced and longer read lengths can tolerate larger
gaps in the alignment, thus improving the performance. The performance
of detecting heterozygous INDELs in Figure [125]3B is observed to be
poor compared to that of homozygous INDELs, especially at 20 ×
sequencing coverage. We also observe dependence of prediction accuracy
on the base quality threshold. Contrary to the expectation, on
increasing the base quality cutoff value from 15 to 30, a dip in the
performance is observed, especially at lower sequencing coverage
(Supplementary Tables [126]S5–[127]S8). This is because a higher
threshold for base quality may result in fewer reads supporting
variant. These results suggest considering a higher base quality score
if the sequencing coverage is ≥ 40 × .
Figure 3.
[128]Figure 3
[129]Open in a new tab
Performance of SeqVItA in detecting INDELs in simulated data shown.
F-score values for predicting (A) homozygous (Homo) and (B)
heterozygous (Het) INDELs of various sizes: 1 bp (“diamond”), 2 bp
(“square”), 5 bp (“triangle”) and 10 bp (“circle”) for two read lengths
50 bp (empty symbols) and 100 bp (filled symbols). Minimum coverage
threshold = 10 and Base quality ≥ 15.
Minimum coverage threshold
Accurate detection of SNVs and small INDELs is affected by various
parameters and briefly discussed below. The distribution of reads is
not uniform along the genome due to various reasons, viz., PCR
artifacts leading to GC content bias, mappability bias, base quality,
sequencing error, etc. As a consequence, read depth is not uniform
across all the positions in the genome. Hence one needs to
appropriately set the minimum coverage threshold, too high values may
miss out true positive calls while too low values may increase the
number of false positives. For data samples with sequencing depth ≥ 40×
, a default minimum coverage threshold of 10 is observed to result in
high accuracy (Figures [130]2, [131]3), both in the detection of SNVs
and INDELs, irrespective of the size of reads or INDELs. However, the
performance is relatively poor in the case of samples with sequencing
depth 20 × . In this case, on decreasing the minimum coverage threshold
to 5 (1/4th of average sequencing depth), the performance is seen to
improve, especially for homozygous SNVs and short deletions (results
not shown).
Read length
It may be noted from Figures [132]2, [133]3 that the performance is
improved on using longer reads (100 bp) compared to 50 bp reads for
sequencing depth <60× , especially in the detection of heterozygous
SNVs and longer INDELs (≥5 bp). Large heterozygous indels (≥10 bp) are
not detected even at 60 × sequencing depth with 50 bp read lengths.
Base quality
Base quality score is defined by the Phred score is inversely
proportional to the probability of error in calling a base. Intuitively
we expect higher base score to result in reliable predictions. However,
we observe lower recall values on increasing the base quality score
(Supplementary Tables [134]S3–[135]S8). This is because the number of
reads available at a location are reduced, and if the sequencing
coverage is low, this may not pass the minimum coverage threshold
criteria required for variant calling.
Because of this inter-dependency between base quality score and minimum
thresholds coverages one needs to set these parameters judiciously
based on the available sequencing coverage of the data.
Comparison of SeqVitA with other tools
The performance of SeqVItA is compared with three popular sequence
variant detection tools, viz., BCFtools (Li, [136]2011), VarScan2
(Koboldt et al., [137]2012), and GATK HaplotypeCaller (McKenna et al.,
[138]2010) on simulated data. The default parameters used in each of
these four tools is summarized in Table [139]5. Performance is
evaluated in the detection of both homozygous and heterozygous SNVs and
INDELs (size ≤ 10 bp), at three sequencing depths (20 × , 40 × , and 60
×). Read length is 100 bp in simulated experiments and base quality
score is set to 15. Results are summarized in Figure [140]4 and in
Supplementary Tables [141]S9–[142]S14.
Table 5.
Various parameters considered for performance evaluation of SeqVItA on
simulated data with three other tools.
SeqVItA BCFtools VarScan2 GATK
Mapping quality bias correction Recalibration of mapping quality:
[MATH:
Mq′=Int-
mo>MqInt×Int
:MATH]
Mann–Whitney U-test – Wilcoxon rank sum test
Mapping quality cut-off 20 20 20 –
Minimum coverage cut-off 10 – 8 –
Base quality cut-off 15 15 15 15
Germline p-value cut-off 0.01 – 0.01 –
Strand bias Discard sites with < 10% or > 90% strandedness Mann–Whitney
U-test Discard sites with < 10% or > 90% strandedness Fisher exact test
[143]Open in a new tab
Figure 4.
Figure 4
[144]Open in a new tab
Performance comparison of SeqVItA with BCFtools, VarScan2 and GATK on
simulated data at three sequencing depths 20 ×, 40 ×, and 60 × in
detecting homozygous (Homo) and heterozygous (Het) (A) SNVs, (B,C)
insertions (Ins), and (D–E) deletions (Del). Read length = 100 bp, and
base quality threshold ≥ 15.
In Figures [145]4A–E, we observe that the overall performance of
SeqVItA is comparable to the other state-of-the-art tools in the
detection of homozygous SNVs. BCFtool performs best in detecting
heterozygous SNVs (at low coverage, 20×), while the performance of
SeqVItA is better in detection of INDELs. It may be noted in Figures
[146]4B–E that all the three tools failed to detect large INDELs
(homozygous and heterozygous), even at 60 × coverage. The reason
SeqVItA is able to detect larger INDELs is due to the recalibration of
the mapping quality (equation 1) that handle mappability bias in the
pre-processing step. Without pre-processing step, SeqVItA also fails to
identify large INDELs. In general, performance of all the tools is
better in detecting homozygous INDELs compared to heterozygous INDELs.
Computation time
SeqVItA is computationally very efficient compared to VarScan2 and GATK
due to parallel implementation of the variant calling step using
OpenMP. For a WES alignment file of size 14 GB (coverage: ~100×),
SeqVItA took ~40 min to detect both SNVs and INDELs and is about 3 ×
faster than GATK (~2 hrs), ~2× faster than VarScan2 (~ 1 h 15 min) and
is comparable to the time taken by BCFtools (~35 min). For comparing
the computational efficiency, all the three tools are installed on a
64-bit Ubuntu machine with an Intel i7-4700MQ @2.4GHz processor. The
minimum computational requirement for running SeqVItA is a desktop with
4 GB RAM running at 1.6 GHz and 500 GB disk space.
Cancer data
A total of 26 Liver cancer patient samples and their matched control
samples (average age: 52 years, range: 27–86 years) are downloaded from
Sequence Read Archive database (Project ID: SRP123031). It is
paired-end data of read length 76 bp obtained from targeted sequencing
of 372 gene panel for Hepatocellular carcinoma (HCC), a primary type
(70% cases) of liver cancer. The pre-processing of data is carried out
using NGC QC Toolkit and reads with quality scores < 30 are removed.
Due to an error in the pre-processing step, two samples (patients 3 and
25) are discarded, leaving 24 HCC patient samples for subsequent
analysis. The pre-processed reads are then aligned with the human
reference genome (hg19 assembly) using Bowtie2. Post-alignment
processing is carried out using Picard tools
([147]http://broadinstitute.github.io/picard/) to identify and mark PCR
duplicates (which are then ignored in the variant calling step). For
each processed file, sequence variants are identified using “somatic”
module of the pipeline and analyzed using the “annotate” module. Reads
with mapping quality, Mqread < 20 are discarded, mapping quality
correction was applied (using Mqcorr = 50 in eqn 1), and sites with
minimum coverage, RD_th (≥50) in both tumor and normal samples are
considered for variant calls.
Mutation landscape of sequent variants in liver cancer patients
The variants detected by the Somatic module are labeled germline,
somatic, LOH and “Unknown” for the 24 paired tumor-control samples. As
expected, majority of the predicted SNVs in the 24 patients are
germline variants (~90%), followed by LOH (~4.9%) and somatic (~4.1%)
variants, with an average of 2007 high confident SNVs per tumor sample.
Small INDELs of size ranging from 2 to 27 bp are also identified in
these tumor samples with the number of insertions (97 per sample,
50.5%) and deletions (95 per sample, 49.5%) nearly equal. On average,
182 high confident small INDELs are predicted per sample and of these,
like SNVs, majority of the INDELs are of germline origin (~85.5%),
followed by LOH (~8%) and somatic (~5.5%) (Supplementary Table
[148]S15).
Analysis of somatic sequence variants
Since somatic sequence variants are most likely to be responsible for
tumorigenesis, these are filtered for further functional analysis using
the “splitVCF” function and their distribution in the genic regions is
identified (Supplementary Figure [149]S6). It is observed that a large
fraction of the variants lie in the intronic regions (59%) followed by
promoter (18%), 3′ UTR (6%) and 5′ UTR (2%) regions. Among the variants
in the coding regions, most common are missense variants (10%),
followed by synonymous (4%), nonsense (0.7%), frameshift (0.5%), and
splice site (0.2%) variants.
The waterfall plot in Figure [150]5 exhibits the mutational landscape
of genes associated with recurrent sequence variants (identified in at
least 15% of the patients). It is observed that the tumor suppressor
gene TP53 carries mutations in 13 out of 24 samples (54%) though the
location and/or type of mutation vary from patient-to-patient. For
example, patient number 9, 18, and 22 exhibit nonsense mutation C>A, at
locations chr17:7577046, chr17:7579521, and chr17:7578188,
respectively; patient number 1 and 2 exhibit frameshift mutations
(G[AA/-], chr17:7578212 and C[A/-], chr17:7578231, respectively), while
patients 5 and 26 have mutations in the promoter regions of TP53 gene
(A>C, chr17:7577407, G>A, chr17:7577427 in patient 5 and G>T,
chr17:7579619, A>G, chr17:7579825, in patient 26). In the remaining
patients, the variants are observed in the intronic regions of TP53
gene.
Figure 5.
[151]Figure 5
[152]Open in a new tab
Mutational landscape of somatic sequence variants identified in 24 HCC
patient samples (intronic and intergenic variants excluded). Each
column corresponds to each patient sample and each row represents a
gene.
We also observe other well-known HCC associated genes, e.g., NCOR1
(38%), AXIN1 (21%), MDM2 (21%), CTNNB1 (21%), MTOR (17%), mutated in a
significant number of patients. Gene NCOR1 is known to exhibit strong
tumor suppressor activity, preventing tumor cell invasion, growth, and
metastasis. Genes TP53 and MDM2 participate in G2/M DNA damage
checkpoint of the cell cycle, while genes AXIN1 and CTNNB1 are involved
in Wnt/β-catenin signaling pathway, and MTOR mediates cell
proliferation and differentiation through MAPK signaling pathway. The
potential role played by these pathways in cancer is well-studied and
aberrations in these candidate genes are known to initiate liver
tumorigenesis (Guichard et al., [153]2012; Meng et al., [154]2014, p.
2; Satoh et al., [155]2000; Ou-Yang et al., [156]2018, p. 1). Out of
nine genes reported to be involved in HCC in the peer-reviewed
database, Atlas of Genetics and Cytogenetics in Oncology and Hematology
(Huret et al., [157]2013), seven have been identified in few of these
24 HCC patients. These include tumor suppressors TP53 (54%), AXIN1
(21%), SMAD2 (13%), CDKN2A (13%), RB1 (8%), and oncogenes CTNNB1 (21%)
and CCND1 (8%).
Principal component analysis based on mutated genes
To explore if the 24 HCC patients share the same set of mutated genes
or they exhibit different mutational profile, we carried out principal
component analysis on these genes. The objective is to identify groups
of patients sharing common variants and disease-causing pathways.
In the PCA plot in Figure [158]6, we observe that majority of patients
form a single group, while patient 19, 14, and 22 show deviation from
this group and patient 26 is a clear outlier. Analysis of Patient 26
data resulted in a completely different profile with a much larger set
of variants compared to the rest of the 23 patients. It had 787
sequence variants across 156 genes, while the average number of
variations observed in other patients ranges from 29 to 126 variations.
Accumulation of mutations is a known somatic mutational signature in
liver cancer (Letouzé et al., [159]2017) and noting the age of patient
26 (86 yrs), it is not surprising. Analysis of the remaining 23 patient
samples clearly also revealed the dependence of number of somatic
mutations as a function of age. On average, patients with age ≥ 55
years exhibited ~160 somatic mutations, while patients < 55 years
showed ~57 mutations per patient. Patient 19 and patient 22 also showed
relatively higher number of mutations, 124 and 126, respectively
(Patient 19: 56 yrs and Patient 22: 60 yrs). However, the set of
mutated genes is very different in the two cases. Patient 14 exhibited
slightly larger number of variants (74) and had a good overlap with
that of Patient 22. This analysis indicates that though all the
patients are diagnosed with HCC, they exhibit different genetic
profiles, suggesting that same drug/dosage may not be suitable to all
the 24 patients. Thus, there clearly exists a need for personalized
screening of the genetic variants and associated functional impact for
accurate treatment. Based on PCA, the 24 patients are clustered in four
groups (G1: Patient 26, G2: Patient 19, G3: Patients 22 and 14, and G4:
remaining 20 patients). We first discuss below the analysis of the
larger group of 20 patients to understand the role of common mutations
in HCC.
Figure 6.
Figure 6
[160]Open in a new tab
Clustering of HCC patients based on somatically mutated genes.
Recurrent somatic variants and their role in liver cancer
To identify common somatic variants and associated genes in group G4
patients, genes carrying mutations in at least 25% of the patient
samples are considered. These common set of mutated genes, summarized
in Table [161]6, are probable diagnostic and prognostic markers for HCC
patients of this group (complete listing of sequence variant hotspots
is given in Supplementary Table [162]S16). These span exons, introns,
promoters, or 5′ and 3′ untranslated regions, and may directly or
indirectly contribute to liver tumorigenesis. It is observed that not
all patients carry the same mutation in a gene. For e.g., the
characteristic TP53 mutation in 7th exon (C>A, chr17:7577534, R249S) is
observed only in two patients, 7 and 12 (i.e., 20% prevalence).
Deletions, leading to frameshift (patients 1 and 2), and diverse
substitutions in coding regions are observed in remaining patients,
while patient 5 is observed to carry two intronic variations. The
direct role played by these TP53 variants in HCC is still unknown.
Similarly, location and type of variations in FGFR1, FANCD2, and
MIR1278 (35%), JAK1 (30%), and NCOR1, NUP93, XPO1, SDHC, and TSC2 (25%)
genes are observed to vary from patient-to-patient (see Supplementary
Table [163]S16).
Table 6.
Summary of recurrent genes exhibiting somatic mutations in at least 25%
of liver cancer patients.
Gene No. of patients with SVs Patient IDs Total No. of SV hotspots
Total SNVs Total INDELs
TP53 10 1,2,5,6,7,9,10,12,18,24 10 8 2
FGFR1 7 1,7,11,13,21,23,24 5 5 0
FANCD2 7 2,7,8,9,15,20,21 5 4 1
MIR1278 7 1,6,7,9,10,17,21 7 7 0
JAK1 6 2,5,8,9,11,23 5 5 0
NCOR1 5 1,2,16,23,24 9 9 0
NUP93 5 2,6,16,23,24 4 2 2
XPO1 5 2,5,9,11,17 3 2 1
SDHC 5 7,9,12,18,20 3 3 0
TSC2 5 7,11,13,18,23 4 4 0
[164]Open in a new tab
Based on the interactions extracted from STRING database (Szklarczyk et
al., [165]2017) between these mutated genes and pathway enrichment, we
observe that cell cycle (FDR: 0.007) and PI3K-AKT (FDR: 0.002) pathways
are overrepresented in this group, as shown in Figure [166]7. Genes
TP53, FGFR1, JAK1, and TSC2 participate in PI3K/AKT pathway, an
important pathway involved in cell growth and proliferation, and
activated in many tumor types including HCC. It may be noted that the
tumor suppressor TP53 affects both the pathways. Dysregulation of cell
cycle plays a key role in promoting liver carcinogenesis through
evasion of growth suppressors (TP53 and NCOR1), sustaining
proliferative signaling (FGFR1, XPO1, and TSC2), chromosomal
instability (FANCD2 and NUP93), invasiveness, survival and metastasis
(Bisteau et al., [167]2014). Out of 20 patients, 14 show diverse
mutations (coding or intronic) in genes that directly participate in
PI3K/AKT pathway.
Figure 7.
Figure 7
[168]Open in a new tab
Interaction between recurrently mutated genes from STRING database.
Pathway enrichment analysis of these mutated genes indicate that cell
cycle (shown in red) and PI3K/AKT (shown in blue) pathways are
affected.
Thus, we see that though the 20 patients of group G4 carry different
SVs, these affect mainly cell cycle and PI3K/AKT pathways. Our analysis
also revealed some novel variants, not yet associated with HCC; G>A,
chr1:193104827 (5′ UTR) in MIR1278 (7 patients), G[GAAAATC/-],
chr2:61713209 (intron) in XPO1 (5 patients); C>G, chr1:161332346 (3′
UTR) in SDHC (4 patients), C>T, chr8:38282294 (intron), and C>G,
chr8:38272542 (intron) in FGFR1 (4 patients each). These may be
probable prognostic markers and their role in HCC may be further
investigated on larger sample data.
Comparative analysis of patient-specific sequence variants
We next show functional analysis of sequence variants identified using
SeqVItA in patient 9 (a representative of large cluster) and patients
19 and 22 (exhibiting variants differing from that of large cluster)
i.e., a comparative analysis of the member of the three groups G2, G3,
and G4. In Table [169]7, the variant genes and their functional impact
on pathways and drug response are summarized for patients 9, 19, and
22. The first striking difference observed is the number of genes
carrying variants. It is much lower in patient 9 (39 yrs) compared to
patients 19 (56 yrs) and 22 (60 yrs), probably because of the
difference in their age. Further, they share very few common mutants.
Below we discuss the analysis of “High” and “Medium” priority genes to
understand pathways and processes affected that may lead to
tumorigenesis in the three groups (complete annotations in
Supplementary [170]Table 2.XLSX).
Table 7.
Summary of sequence variants and their functional role identified in
HCC patient 9 (39 yr old, male), patient 19 (56 yr old, male), and
patient 22 (60 yr old, male).
Patient Number 9 19 22
Number of mutated genes 44 77 69
High/Medium priority genes TP53, FANCD2, SDHC, MTOR, TPMT, DNMT1,
GSTA1, CYP2B6 (8) CTNNB1, MSH1, FBXO11, MSH6, MAP3K1, KMT2C, BRCA2,
NCOR1, BCR, ARID2, CARD11, ERG, GRM3 (13) SDHC, ATR, FCGR2B, ESR1, HGF,
CDKN2A, ATM, TSHR, TP53, NCOR1, NF1, KEAP1, BCR, MSH2 (14)
Known oncogenes MTOR CTNNB1, BCR, CARD11, ERG CDKN2A, BCR
Known tumor suppressors TP53, DNMT1 NCOR1, BRCA2, KMT2C, ARID2 SDHC,
ATR, ESR1, ATM, TP53, NCOR1, NF1, KEAP1
Key pathways affected TP53: cell cycle MTOR: Upregulation frequently
observed in HCC, MTOR –| PTEN, IGF and EGF pathways DNMT1: Methylates
PTEN promoter –| PTEN activation of PI3K/AKT/mTOR pathway CTNNB1: Wnt
pathway → proliferation and survival
FBXO11: Tumor initiation & progression MSH1 & MSH6: DNA mismatch repair
MAP3K1: JNK & Erk pathway → proliferation
KMT2C, ARID2: Chromatin modifications MAP3K1, MSH6, BRCA2, NCOR1,
CTNNB1, BCR: Cell cycle GRM3: GPCR signaling pathway → proliferation
KEAP1: Oxidative stress → proliferation and survival in HCC
HGF: Overexpressed HGF binds with c-Met proliferation & survival in HCC
through c-MET signaling pathway CDKN2A: G1/S cell cycle ESR1: Estrogen
signaling pathway → proliferation
TP53, CDKN2A, ATR, ATM: p53 signaling pathway, replicative senescence,
cell cycle
Variant-Drug associations (PharmGKB) Mutations in GSTA1 & CYP2B6 affect
enzymatic activity of drugs → lower efficacy CTNNB1(A>G):
Ethnic-specific, wild-type (AA) associated with better response to CTD
therapy, not significant in heterozygous condition ESR1: Increased risk
of azoospermia in childhood cancer survivors when treated with
alkylating agents and cisplatin
[171]Open in a new tab
Patient-specific sequence variants
We observe different set of oncogenes and tumor suppressor genes
mutated in the three patients. Different signatures of somatic
mutations indicate probable differences in the processes underlying the
initiation, growth and survival of tumorigenesis in these patients.
From Table [172]7, we observe that in patient 9, mutations in oncogene
mTOR (nonsense mutation) and tumor suppressors TP53 (nonsense mutation)
and DNMT1 (2 bp intronic deletion) are likely candidates for
tumorigenesis. Genes TP53 and mTOR participate in PI3K/AKT pathway
involved in cell growth and proliferation, and are known to be
activated in many tumor types including HCC. Mechanistic target of
rapamycin (MTOR) is a highly conserved serine/threonine protein kinase
that plays a crucial role in cell proliferation, differentiation,
metabolism, and aging via two structurally and functionally distinct
multi-protein complexes: mTORC1 and mTORC2. In earlier studies mTOR
pathway has been shown to be upregulated in ~50% liver cancer patients
and associated with deregulation of PTEN, IGF, and EGF pathways, key
players in HCC (Matter et al., [173]2014). Tumor suppressor TP53
participates in G2/M DNA damage checkpoint and its activity is arrested
due to a nonsense mutation in patient 9. This affects the p53 signaling
pathway which responds to DNA damage by arresting the cell cycle and
facilitates DNA repair by transactivating DNA repair genes. DNA
methylation is known to play an important role in epigenetic gene
regulation and aberrant DNA methylation patterns are commonly found in
tumors. The other tumor suppressor gene, DNA methyltransferase 1
(DNMT1) is shown to hypermethylate promoter region of tumor suppressor,
PTEN, resulting in decreased activity of the gene in a study on rat
model (Bian et al., [174]2012). Decreased activity of PTEN leads to
triggering of PI3K/AKT/mTOR pathway and uncontrolled growth and
proliferation of tumor cells in the liver (Golob-Schwarzl et al.,
[175]2017). The 2 bp deletion observed in the intronic region of DNMT1
gene in patient 9, is reported in COSMIC to be participating in
adenocarcinoma, though indicated to be benign in ClinVar. Our results
suggest that the effect of this variant in HCC may be further
investigated. Thus, we observe that the major pathway affected in
patient 9 is PI3K/AKT/mTOR signaling pathway.
The mutational profile of Patient 19 includes four oncogenes CTNNB1
(missense/intronic), BCR (missense/synonymous/intronic), CARD11
(missense), ERG (missense) and four tumor suppressor genes NCOR1 (2
missense), BRCA2 (intronic), KMT2C (intronic), ARID2 (nonsense). The
missense mutation in CTNNB1 observed in Patient 19 is
well-characterized in several cancers, including HCC and affects
phosphorylation of β-catenin protein, preventing its degradation.
Accumulation of excessive β-catenin in cells pushes the protein into
the nucleus initiating uncontrolled growth and differentiation of tumor
cells via Wnt signaling pathway. Oncogene BCR is involved in activating
ERK pathway, a key kinase pathway that maintains cell cycle, and
aberrations in the gene are associated with proliferation,
differentiation, and inflammation. Protein encoded by CARD11 interacts
with BCL10, acts as a positive regulator of cell apoptosis and
activates the NF-κB pathway. The NF-κB protein complex acts as tumor
promoter in inflammation-associated cancers, including HCC. Oncogene
ERG and tumor suppressor gene NCOR1 are important transcriptional
regulators. While ERG participates in embryonic development, cell
proliferation, differentiation and apoptosis, NCOR1 interacts with
nuclear receptors and other transcriptional factors. NCOR1 is shown to
exhibit strong tumor suppressor activity, preventing tumor cell
invasion, growth, and metastasis in mouse models (Fozzatti et al.,
[176]2013). The mRNA expression levels of NCOR1 are identified to be
decreased in human liver cancer, either due to mutations or deletion of
the gene (Martínez-Iglesias et al., [177]2016). Tumor suppressor BRCA2
is involved in DNA damage repair, while KMT2C and ARID2 (frequently
mutated in HCC) participate in chromatin modifications. Gene KMT2C
encodes for an enzyme histone methyltransferase that methylate lysine 4
of histone H3 and is involved in epigenetic transcriptional activation.
Gene ARID2 is also involved in transcriptional activation and
repression and is required for the stability of SWI/SNF chromatin
remodeling complex SWI/SNF-B. Mutations in ARID2 have been directly
associated with hepatocellular carcinogenesis. Recently, histone
deacetylase inhibitors that affect the epigenetic pathway are used for
therapeutics in HCC. Thus, our analysis of mutational profile reveals
disruption of NF-κB, Wnt, and ERK signaling pathways, cell cycle,
epigenetic and chromatin modifications, responsible for HCC initiation
and progression in patient 19.
Patient 22 exhibits mutations in two oncogenes, CDKN2A and BCR, and
eight tumor suppressor genes, viz., SDHC, ATR, ESR1, ATM, TP53, NCOR1,
NF1, KEAP1. A detailed analysis of the variants in these genes reveals
that c-MET signaling pathway is affected via a synonymous SNV in
Hepatocyte growth factor (HGF), leading to initiation, proliferation,
and survival of hepatic tumor cells. HGF on binding with c-MET results
in a number of molecular events leading to the activation of MAPK (cell
proliferation), PI3K/AKT/mTOR (cell survival) and Rac1-Cdc42 (cell
mobility and cytoskeletal changes) signaling pathways (Goyal et al.,
[178]2013). A missense mutation in tumor suppressor, Kelch-like ECH
associated protein 1 (KEAP1) is also identified in patient 22, which
affects KEAP1-NRF2 pathway, a key regulator to cytoprotective
responses, oxidative, and electrophilic stress. Mutations in KEAP1 are
identified to disrupt the KEAP1-NRF2 regulatory system by increasing
NRF2 levels and thereby affecting cancer cell proliferation and
survival (Kansanen et al., [179]2013). In addition to oncogene CDKN2A
(missense), tumor suppressors TP53 (nonsense), ATM (splice-site
variant) and ATR (missense) participate in disruption of the cell
cycle. While TP53 participates in G2/M DNA damage checkpoint, CDKN2A
participates in G1/S DNA damage checkpoint of the cell cycle.
Aberration in these genes leads to uncontrolled cell proliferation,
increased survival, and genomic stability. Genes ATR and ATM are
involved in DNA damage response through phosphorylation of cell cycle
checkpoint kinases, CHK1, and CHK2 genes, respectively, in p53
signaling pathway. These mutations clearly indicate the disruption of
cell cycle pathway in patient 22 and numerous signaling pathways,
c-MET, MAPK, PI3K/AKT/mTOR, and p53.
The annotations for predicted variants using SeqVItA in patient 19 (114
somatic SNVs and 10 somatic INDELs) are compared with two annotation
tools, ANNOVAR and CADD (which also prioritizes the variants). The
output from variant calling step in SeqVItA was given as input to
ANNOVAR (after modifying it in desired format). Since a number of
resources integrated in ANNOVAR and SeqVItA are common, DNA
conservation scores obtained from MutationTaster, LRT, and PhyloP,
protein conservation scores from SIFT and PolyPhen2,
disease-association relationships from ClinVar, OMIM, and COSMIC, and
identification of known variants from dbSNP, identified by the two
tools were the same. In Patient 19, 25 variants reported to span
promoter regions in SeqVItA were not annotated in ANNOVAR, while
allelic frequencies across various ethnic groups (1000 Genome Project
and ExAc) were not provided by SeqVItA. All other gene-specific
variants had same annotations from the two tools. Major difference
between the annotations from SeqVItA and ANNOVAR is the
variant-gene-drug associations from PharmGKB in SeqVItA. This is a very
important feature and would assist the clinician in designing
individulized therapy based on the effect of variant on the drug
(efficacy, toxicity, and metabolism). Also, ANNOVAR does not prioritize
the variants. For this we compared the prioritized variant list of
patient 19 from SeqVItA with that provided by CADD. We observe that 9
out of 10 top-CADD scored variants were identified as High or Medium
priority variants by SeqVItA. These results clearly indicate the
reliability of SeqVItA with other state-of-the-art tools.
Mutation drug-response associations
Thiopurine S-methyltransferase (TPMT) enzyme catalyzes S-methylation of
aromatic and heterocyclic sulfhydryl compounds present in
immunosuppressants (e.g., azathioprine) and antineoplastic drugs (e.g.,
6-mercaptopurine and 6-thiogauanine). It is known to exhibit reduced
enzymatic activity in patients with specific genetic variations in 5th,
7th, and 10th exons, and is an important pharmacogenetics biomarker
(Asadov et al., [180]2017). Patient 9 carries a synonymous mutation
(G>A, chr6:18139214) in 6th exon for which no effect on TPMT enzyme
activity is reported, suggesting the suitability of these drugs in this
case. Glutathione S-transferase A1 (GSTA1) is involved in
detoxification of toxic molecules such as oxidative stress products,
prostaglandins, chemical carcinogens, and therapeutic drugs, and
mutation in this gene is associated with accumulation of toxic
molecules with high risk of liver cancer (Akhdar et al., [181]2016). In
PharmGKB four sets of drugs (or combination of drugs) associated with
this SNV are reported, viz., (i) doxorubicin, (ii) cyclophosphamide,
doxorubicin, prednisone, rituximab, vincristine, (iii) busulfan, and
(iv) cisplatin and cyclophosphamide. For the first two categories,
efficacy of the drug(s) is reported to be affected due to the variant
in GSTA1 (Rossi et al., [182]2009). Additionally, elimination
(metabolism) of drug busulfan, and toxicity or adverse drug response
(as anemia) of cisplatin and cyclophosphamide are lower in the variant
compared to wildtype GSTA1 (Khrunin et al., [183]2012; ten Brink et
al., [184]2013). Of these, doxorubicin and cisplatin are commonly used
therapeutics for HCC (Li et al., [185]2016), and may be given to
patient 9 with appropriate dosage. Cytochrome P45 2B6 (CYP2B6) is a
monooxygenase that catalyzes many reactions involved in drug metabolism
and specifically known to metabolize a number of xenobiotics including
antineoplastic drugs such as cyclophosphamide and ifosfamide,
antidepressant drugs such as bupropion and sertraline. A missense SNV
observed in exonic region of CYP2B6 gene (rs2279343) in patient 9 has a
high log-ratio test (LRT) score (~0.99) indicating it is highly
deleterious. In ClinVar and OMIM, the mutation is associated with
efavirenz drug response and lower doses of the drug are advised in
Japanese HIV patients with homozygous SNV to reduce drug toxicity
(Tsuchiya et al., [186]2004). Thus, we show that a detailed analysis of
the mutation profile of a patient can help in providing personalized
therapy.
In patient 19, an intronic variant in CTNNB1 (rs4135385) is associated
with drugs lenalidomide, cyclophosphamide, thalidomide, and
dexamethasone (CTD) (efficacy, toxicity/ADR), with an inhibitory effect
on tumor cell proliferation and inducing apoptosis in multiple myeloma
(level 3 confidence). The distribution of the genotype frequency is
population-specific, e.g., homozygous-reference (AA) is widely observed
in healthy individuals of Poland and Saudi Arabia, while
homozygous-variant (GG) is dominant in Chinese population. Efficacy of
these drugs is high when the genotype is wild-type (AA) and no
statistically significant response is observed for heterozygous
condition (AG). However, patients with wild-type allele develop
neutropenia as a side effect of the lenalidomide therapy (Butrym et
al., [187]2015).
Also, intronic variant (rs2207396) in ESR1 observed in patient 22 is
known to have an association with alkylating agents and cisplatin. In a
study (Romerius et al., [188]2011), the effect of this variant in ESR1
analyzed in cancer survivors showed an increased risk of azoospermia in
childhood cancer survivors when treated with alkylating agents and
cisplatin, specifically with heterozygous (GA) genotype compared to
homozygous (GG/AA) condition.
Though the mutations and corresponding drug responses in patient 9
(TPMT, GSTA1, and CPY2B6), patient 19 (CTNNB1) and patient 22 (ESR1)
are not directly shown in HCC, these associations indicate that along
with factors such as age, organ functions and tumor biology, genetic
constitution of the patients may also affect the efficacy of the drugs
used in treatment (20 to 95% variability). This clearly indicates the
need for screening the patient genotype prior to devising the treatment
regime.
Comparison with VarScan2 and mutect2
To compare the performance of SeqVItA on real data, the aligned BAM
files of patient 19 was considered as input to Mutect2 and the
corresponding mpileup format file as input to SeqVItA and VarScan2.
From Figure [189]8A, we observe a clear variation in the total number
of somatic variants predicted and their overlap between the three
tools: VarScan2 (52, 15), SeqVItA (114, 10), and Mutect2 (111, 12), for
SNVs and INDELs respectively. A similar observation was made by
Krøigård et al. ([190]2016) in performance evaluation of nine variant
callers for the detection of SNVs and small INDELs. From Figure [191]8B
we observe higher agreement between VarScan2 and SeqVItA results,
probably because of the same heuristic approach with hard cut-off
thresholds used in both the cases. The difference between their
prediction outputs is probably because of strandedness bias considered
in SeqVItA (but not in VarScan2). The mappability correction was not
applied in Mutect2 which may probably be one of the reasons for its
poor overlap with the other two tools, apart from the difference in the
approach. The variants detected by a single tool are likely to contain
many false positives and hence should be cross-validated either by
predictions form other tools or based on annotations.
Figure 8.
[192]Figure 8
[193]Open in a new tab
(A) Total number of somatic variants called and (B) Pair-wise agreement
(0–1 scores) between SNVs and INDELs predicted by SeqVItA, Mutect2, and
VarScan2.
Variants predicted by the three tools spanning protein-coding genes are
identified: SeqVItA (85 genes), Mutect2 (83 genes), and VarScan2 (51
genes). These genic-variants are further filtered based on the
functional relevance of the variants/genes in liver cancer. We observe
10 genes commonly identified by the three tools. These include missense
mutations in CTNNB1, CARD11, and GRM3 genes and nonsense mutation in
ARID2 gene at the same loci, significance of which has been discussed
in the previous section in Table [194]7. Additionally, SeqVItA and
VarScan2 share common mutations in KMT2C (intron), NCOR1 (missense),
and BCR (missense and synonymous) genes, but these are missed by
Mutect2. VarScan2 did not predict any variant that has a known direct
effect on the drug, while, Mutect2 identified a missense mutation
(rs1801394) in gene MTRR that is associated with drug methotrexate. The
heterozygous condition (identified in Patient 19) of this mutation is
associated with higher levels of drug toxicity in childhood acute
lymphoblastic leukemia and lymphoma (Huang et al., [195]2008). All the
above discussed variants had scaled-score > 20 in the CADD annotation
and prioritization tool, indicating their deleteriousness.
Conclusion
Precision medicine is an emerging approach that considers individual
genetic variability to explain differences in disease susceptibility,
progression, and reaction to drugs at the population level. For disease
prognosis, analysis of multiple genes quickly, and sensitively from
small sample quantities is desirable, which is now possible with the
advent of Next Generation Sequencing (NGS) techniques. However, to
achieve this, there is a need to decode large genetic data more rapidly
and accurately. Single nucleotide variants (SNVs) and small insertions
and deletions (INDELs) are the most prevalent form of genetic variants
and have been shown to play a potential role in the predisposition of
disease and contribute to variable drug response in individual
patients. In this study we show the efficacy of our tool SeqVItA in
detection and functional analysis of SNVs and small INDELs in NGS data.
Performance evaluations of SeqVItA on simulated data suggest high
sequencing depth (≥40 ×) and larger read length (≥100 bp) for reliable
prediction of variants. Compared to other popular methods for variant
detection, SeqVItA is able to accurately detect larger INDELs (5 bp)
because of mapping quality recalibration of reads in the pre-processing
step. In the analysis of 24 liver cancer patients, importance of
annotations extracted from various resources to prioritize the variants
is discussed. These are confirmed by screening the literature for their
role in liver cancer. Detailed comparative analysis of three patient
samples is carried out to show the applicability of the proposed tool
in personalized therapeutics. In the analysis of cancer data, one needs
to take care of purity, ploidy, and heterogeneity of tumor samples
which may affect the prediction accuracy of sequence variants. Further,
presence of larger structural variants such as copy number variations,
inversions and translocations should be taken into account as these may
affect the detection of small sequence variants. One major limitation
in SeqVItA's pipeline is non-availability of population-specific allele
frequencies (e.g., 1000 Genome project) which can help in screening for
rare variants. Currently in SeqVItA prioritization of predicted
variants is based on a simple cumulative scoring scheme. There is scope
for improving variant filtering using statistical or network-based
approaches. SeqVItA is actively developing and we hope to expand the
features of the annotation module to even include annotations for
variants from non-genic regions.
Author contributions
PD investigation, software, validation, writing original draft. SS
methodology, software, validation. NP conceptualization, supervision,
writing, review and editing.
Conflict of interest statement
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Supplementary material
The Supplementary Material for this article can be found online at:
[196]https://www.frontiersin.org/articles/10.3389/fgene.2018.00537/full
#supplementary-material
[197]Click here for additional data file.^ (534.9KB, DOCX)
[198]Click here for additional data file.^ (74.5KB, XLSX)
References