Abstract

Background

   Protein-protein interactions (PPIs) are key to understanding diverse
   cellular processes and disease mechanisms. However, current PPI
   databases only provide low-resolution knowledge of PPIs, in the sense
   that "proteins" of currently known PPIs generally refer to "genes." It
   is known that alternative splicing often impacts PPI by either directly
   affecting protein interacting domains, or by indirectly impacting other
   domains, which, in turn, impacts the PPI binding. Thus, proteins
   translated from different isoforms of the same gene can have different
   interaction partners.

Results

   Due to the limitations of current experimental capacities, little data
   is available for PPIs at the resolution of isoforms, although such
   high-resolution data is crucial to map pathways and to understand
   protein functions. In fact, alternative splicing can often change the
   internal structure of a pathway by rearranging specific PPIs. To fill
   the gap, we systematically predicted genome-wide isoform-isoform
   interactions (IIIs) using RNA-seq datasets, domain-domain interaction
   and PPIs. Furthermore, we constructed an III database (IIIDB) that is a
   resource for studying PPIs at isoform resolution. To discover
   functional modules in the III network, we performed III network
   clustering, and then obtained 1025 isoform modules. To evaluate the
   module functionality, we performed the GO/pathway enrichment analysis
   for each isoform module.

Conclusions

   The IIIDB provides predictions of human protein-protein interactions at
   the high resolution of transcript isoforms that can facilitate detailed
   understanding of protein functions and biological pathways. The web
   interface allows users to search for IIIs or III network modules. The
   IIIDB is freely available at [37]http://syslab.nchu.edu.tw/IIIDB.

Background

   Protein-protein interactions (PPIs) perform and regulate fundamental
   cellular processes. As a consequence, identifying interacting partners
   for a protein is essential to understand its functions. In recent
   years, remarkable progress has been made in the annotation of all
   functional interactions among proteins in the cell. However, in both
   experimentally derived and computationally predicted protein-protein
   interactions, a "protein" generally refers to "all isoforms of the
   respective gene." Yet, it is known that alternative splicing often
   impacts PPI by either directly affecting protein interacting domains,
   or by indirectly impacting other domains, which, in turn, impact the
   PPI binding [[38]1]. That is, alternative splicing can modulate the
   PPIs by altering the protein structures and the domain compositions,
   leading to the gain or loss of specific molecular interactions that
   could be key links of pathways (reviewed in reference [[39]2]). It is
   very likely that different isoforms of the same protein interact with
   different proteins, thus exerting different functional roles. For
   example, the protein BCL2L1 is alternatively spliced into two isoforms:
   Bcl-xL (long form) and Bcl-xS (short form) [[40]3], in which Bcl-xL
   inhibits apoptosis whereas Bcl-xS promotes apoptosis [[41]4]. Vogler et
   al. reported that the interaction of Bcl-xL and BAK1 in platelets
   ensures platelet survival [[42]5]. Therefore, comprehensively
   identifying protein-protein interactions at the isoform level is
   important to systematically dissect cellular roles of proteins, to
   elucidate the exact composition of protein complexes, and to gain
   insights into metabolic pathways and a wide range of direct and
   indirect regulatory interactions.

   Thus far, a series of studies have systematically predicted PPIs
   [[43]6-[44]10] and established PPI databases, e.g., OPHID [[45]11],
   POINT [[46]12], STRING [[47]10] and PIPs [[48]7]. With the exception
   that the IntAct database [[49]13] contains 116 human PPIs with isoform
   specification, currently, none of those PPI databases has isoform-level
   PPI data. This is a huge knowledge gap yet to be filled. The rapid
   accumulation of RNA-seq data provides unprecedented opportunities to
   study the structures and topological dynamics of PPI networks at the
   isoform resolution. RNA-seq data provides two unique informative
   sources for Isoform-Isoform Interaction (III) reconstruction: the
   absence or presence of specific isoforms under specific conditions, and
   the co-expression of two isoforms that may contribute to their
   interaction propensity. In this study, we seize this opportunity to
   comprehensively predict the possible interactions between splicing
   isoforms by integrating a series of RNA-seq data with domain-domain
   interaction data and PPI database. The resulting III network presents a
   high-resolution map of PPIs, which could be invaluable in studying
   biological processes and understanding cellular functions.

   In this report, we described a database, IIIDB, for accessing and
   managing predicted human IIIs. In the IIIDB, users can differentiate
   access high-confidence and low-confidence predictions of human IIIs
   (see detailed description in Result section), and then see the full
   evidence values for each predicted III. Figure [50]1 shows the IIIDB
   web interface screen-shots of the III search and isoform module search
   function in the IIIDB. Users can upload their won gene expression data
   for III prediction, and then the users can download the predicted
   result. The searching function has three major parts: high-confidence
   interaction prediction search, low-confidence interaction prediction
   search, and isoform module search. The IIIDB provides auto-complete
   function in all search functions. Users can input a gene symbol or gene
   ID in the auto-complete field which provides an interface to quickly
   find and select matched values.

Figure 1.

   Figure 1
   [51]Open in a new tab

   The screenshots of the isoform interaction and module search function
   in the IIIDB. (A) Users can upload their own gene expression data for
   III prediction. (B) IIIDB shows top 100 IIIs and uses these
   interactions to construct an III network. Users can download complete
   prediction result as a file. (C) In interaction search function, users
   can input a gene symbol or gene ID to search associated IIIs and
   isoform modules. The interaction search section provides interface on
   searching high-confidence (score > 2.575) or low-confidence (score >
   1.692) prediction. The resulting III table for interaction search shows
   full evidence values including Pearson correlations for 19 RNA-seq
   datasets and domain interaction score. (D) The resulting page for
   module search includes the user friendly network graph and pathway/GO
   enrichment analysis result.

   The IIIDB allows users to easily search IIIs and isoform modules, and
   then provides the evidence that led to each III prediction. To
   visualize the interactions with the isoforms of the input gene, we
   integrated CytoscapeWeb [[52]14] to generate the interactive web-based
   III network (Figure [53]1B). Interestingly, the different isoforms
   within the same gene can be involved with different isoform modules
   that may open a new door to study differential functionality of
   isoforms of the gene. The IIIDB also provided GO/pathway enrichment
   analysis results for each isoform module, which helps biologists to
   study the biological insights of network modules at isoform level.

Results

   To obtain the isoform annotations in the human genome, we used the NCBI
   Reference Sequences (RefSeq) mRNAs as transcriptome annotation
   [[54]15]. Figure [55]2 shows the framework of III prediction. We
   employed the logistic regression approach with 19 RNA-seq datasets
   (Table [56]1) and the domain-domain interaction database to infer IIIs.
   To confirm with PPI, given an III prediction I[1 ]and I[2], we only
   keep this prediction if the gene symbols of I[1 ]and I[2 ]have PPI in
   IntAct database.

Figure 2.

   Figure 2
   [57]Open in a new tab

   We performed logistic regression with 19 RNA-seq datasets and the
   domain-domain interaction database to construct IIIDB. In RNA-seq data
   processing, Bowtie2 and eXpress were used to calculate isoform
   expressions. To confirm with PPI, given an III prediction I1 and I2, we
   only keep this isoform interaction if the gene symbols of I1 and I2
   have PPI in IntAct database.

Table 1.

   19 RNA-seq datasets from SRA
   ID SRA ID # Exp Title
   d1 ERP000546 48 Illumina bodyMap2 transcriptome
   d2 SRP005169 41 Widespread splicing changes in human brain development
   and aging
   d3 SRP005408 31 Gene expression profile in postmortem hippocampus using
   RNAseq for addicted human samples
   d4 SRP010280 31 Integrative genome-wide analysis reveals cooperative
   regulation of alternative splicing by hnRNP proteins
   d5 SRP002628 30 Comparative transcriptomic analysis of prostate cancer
   and matched normal tissue using RNA-seq
   d6 ERP000550 29 Complete transcriptomic landscape of prostate cancer in
   the Chinese population using RNA-seq
   d7 SRP005242 21 A Comparison of Single Molecule and Amplification Based
   Sequencing of Cancer Transcriptomes: RNA-Seq Comparison
   d8 SRP002079 20 [58]GSE20301: Dynamic transcriptomes during neural
   differentiation of human embryonic stem cells
   d9 ERP000992 18 The effect of estrogen and progesterone and their
   antagonists in Ishikawa cell line compared to MCF7 and T47D cells
   d10 SRP000727 16 Alternative Isoform Regulation in Human Tissue
   Transcriptomes
   d11 SRP007338 16 [59]GSE30017: Widespread regulated alternative
   splicing of single codons accelerates proteome evolution
   d12 SRP010166 16 [60]GSE34914: Deep Sequence Analysis of non-small cell
   lung cancer: Integrated analysis of gene expression, alternative
   splicing, and single nucleotide variations in lung adenocarcinomas with
   and without oncogenic KRAS mutations
   d13 ERP000710 12 Transciptome profiling of ovarian cancer cell lines
   d14 SRP005411 11 RNA-Seq Quantification of the Complete Transcriptome
   of Genes Expressed in the Small Airway Epithelium of Nonsmokers and
   Smokers
   d15 SRP006731 11 [61]GSE29155: RNA-Seq anlalysis of prostate cancer
   cell lines using Next Generation Sequencing
   d16 SRP013224 11 [62]GSE38006: Next-generation sequencing reveals
   HIV-1-mediated suppression of T cell activation and RNA processing and
   the regulation of non-coding RNA expression in a CD4+ T cell line
   d17 ERP000418 10 Gene expression profiles between normal and breast
   tumor genomes
   d18 ERP000573 10 RNA and chromatin structure
   d19 SRP010483 10 [63]GSE35296: The human pancreatic islet
   transcriptome: impact of pro-inflammatory cytokines
   [64]Open in a new tab

High-confidence and low-confidence prediction of IIIs

   In the IIIDB, we provided two III prediction sets using the logistic
   regression model: (a) high-confidence prediction: logit score > 2.575
   (precision 60% and recall 15%), it resulted in 4,476 IIIs; and (b)
   low-confidence prediction: logit score > 1.692 (precision 40% and
   recall 68%). In addition, given a known PPI, the isoform pair with the
   best logit score among this PPI will be selected as low-confidence
   prediction. Thus, a PPI has at least one isoform-isoform interaction.
   It resulted in 54,605 IIIs (Figure [65]3).

Figure 3.

   Figure 3
   [66]Open in a new tab

   Precision and recall curve for logistic regression model. At recall
   15%, logistic regression model achieve 60% precision (High-confidence
   prediction); at recall 68%, logistic regression model achieve 40%
   precision (Low-confidence prediction).

Isoform module discovery

   To discover functional modules in the III network, we applied MODES
   network clustering method [[67]16] on low-confidence III network to
   discover isoform modules with the given parameters (minimum module size
   3, maximum module size 30, and density cutoff 0.7). An important
   feature of MODES is that it can discover overlapping dense isoform
   modules which allows one isoform to belong to multiple modules. We
   obtained 1025 modules with size of 5.08 isoforms on average. To provide
   functional annotation and evaluate the module functional enrichment, we
   performed the enrichment analyses with GO [[68]17] and KEGG pathway
   [[69]18] databases. These databases are protein-level annotation which
   provides approximately functional annotation of isoforms.

   To evaluate the significance, we randomly generated the same number and
   the same size of MODES isoform modules (i.e. 1025 randomized isoform
   modules). Table [70]2 shows the results of enrichment analyses for
   MODES and randomization modules for comparison. The MODES isoform
   modules have significantly higher functional enrichment rate than those
   of random cases, showing strong biological relevance of the predicted
   modules. The MODES isoform modules were further used to build the
   isoform module database in the IIIDB, and all isoform modules were
   listed in Additional file [71]1. We also stored the GO and KEGG
   enrichment results for all isoform modules in the IIIDB to provide
   potential functional annotation. In the IIIDB web interface, Figure
   [72]1C shows the result page of the isoform module search including the
   network graph and the enrichment analysis results. In the module result
   page, users can click any gene symbol to iteratively search isoform
   modules, and sort the module result table by clicking the column
   header.

Table 2.

   The enrichment rate of isoform modules based on GO and pathway
   enrichments
   Isoform modules # modules % modules enriched with GO^a % modules
   enriched with pathway^b
   MODES modules 1025 88.7% 36.1%
   Randomization 1025 49.7% 10.1%
   [73]Open in a new tab

   ^a The modules enriched with GO term (P-value < 0.001).

   ^b The modules enriched with pathway (P-value < 0.01).

Integrating diverse data sources for III prediction

   Currently, most PPI databases do not provide information at the level
   of isoforms, which thus presents challenges for constructing a gold
   standard positive set (GSP) for the III prediction. Fortunately, in the
   June 2013 version of the IntAct database
   [74]http://www.ebi.ac.uk/intact/[[75]13], we identified 116 human PPIs
   with isoform specification (out of the total 43,508 distinct human
   PPIs). For example, IntAct has III between P29590-5 and P03243-1, which
   correspond to the 5th isoform of the protein [76]P29590 and the 1st
   isoform of [77]P03243. In addition, to obtain more IIIs for the GSP, we
   applied the following rule: given a PPI between protein P1 and P2, if
   both P1 and P2 only have single isoforms we also take it as the GSP. It
   resulted in 11,356 IIIs in the GSP set.

   GSP covered 5,503 RefSeq IDs, and we used these RefSeq IDs to construct
   gold standard negative set (GSN). The GSN was defined as isoform pairs
   in which one isoform was assigned to the plasma membrane cellular
   component, and the other was assigned to the nuclear cellular component
   by the isoform-specific sub-cellular localization, in which we
   performed sequence-based predictions using the CELLO (subCELlular
   LOcalization predictor) [[78]19]. To obtain the accurate
   isoform-specific annotation, we only used the cellular localization
   prediction results that consist of UniProt GO annotations. It resulted
   in 36 RefSeq IDs for plasma membrane cellular component and 739 RefSeq
   IDs for nuclear cellular component. In addition, isoforms that are
   assigned to both the plasma membrane and the nuclear cellular component
   are excluded in GSN.

   To calculate precision and recall, we used timestamp to divided GSP
   into training and test GSP sets, in which if an interaction with
   timestamp after 1st Jan 2012, it will be assigned to test GSP set
   (10,408 IIIs); otherwise, it will be assign to training GSP set (948
   IIIs). When the GSP is decided, we used the RefSeq IDs covered in GSP
   to build GSN. Figure [79]3 shows the precision and recall curve for the
   logistic regression model.

Case studies

   To demonstrate the biological importance of the IIIDB, we searched for
   isoform-associated reports in the literature. Although isoform-specific
   protein function studies are very rare, we found isoform-specific
   biological evidences with BCL2L1, which validated our III prediction of
   BCL2L1. In addition, we also found diverse biological functions with
   Ras association domain family in our isoform modules.

BCL2L1

   BCL2L1 has two isoforms (Table [80]3), in which BCL2L1 isoform 1 is
   called Bcl-xL (long form) and BCL2L1 isoform 2 is called Bcl-xS (short
   form) [[81]3]. These isoforms play important roles in apoptosis as
   follows: Bcl-xL inhibits apoptosis whereas Bcl-xS promotes apoptosis
   [[82]4]. In our high-confidence prediction, BCL2L1 isoform 1 (Bcl-xL)
   interacts with BAK1, BAX and NLRP1 to inhibit apoptosis, but BCL2L1
   isoform 2 (Bcl-xS) doesn't (Table [83]3). In previous reports, the
   fluorescence anisotropy, analytical ultracentrifugation, and NMR assays
   confirmed a direct interaction between Bcl-xL and BAK1 [[84]5,[85]20].
   Vogler et al. also reported that the interaction of Bcl-xL and BAK1 in
   platelets ensures cell survival [[86]5]. Edlich et al. reported that an
   interaction between Bcl-xL and BAX not only inhibits BAX activity but
   also maintains BAX in the cytosol [[87]21]. On the other hand, Chang et
   al. demonstrated that Bcl-xL interacted with endogenous BAX in 293
   cells. However, no significant amount of BAX was detectable in the
   Bcl-xS immunoprecipitation [[88]21], suggesting that Bcl-xS does not
   interact with BAX. In addition, Bruey et al. reported that Bcl-xL
   interacts with NALP1 to suppress apoptosis [[89]22]. Thus, these
   previous biological studies validated our III prediction of BCL2L1.

Table 3.

   BCL2L1 isoform interaction partners.
   Isoform RefSeq ID mRNA length Protein length UniProt ID Interaction
   partner (high confidence)
   BCL2L1 isoform 1 (Bcl-xL) [90]NM_138578 2575 233 Q07817-1 BAK1 BAX
   NLRP1
   BCL2L1 isoform 2 (Bcl-xS) [91]NM_001191 2386 170 Q07817-2 BCL2
   [92]Open in a new tab

RASSF1

   RASSF1 is RAS association domain family member 1. RASSF1 has four
   isoforms, of which two isoforms belong to isoform modules (Mod-162 and
   Mod-384). The Mod-162 consists of RASSF1 isoform A, RASSF5 and STK4,
   whereas the Mod-384 consists of RASSF1 isoform C, RASSF3, RASSF4, SAV1,
   STK3 and STK4 (Figure [93]4). Interestingly, several studies reported
   that RASSF5 may interact with RASSF1 isoform A to suppress tumors
   [[94]23-[95]25], suggesting that the Mod-162 is validated. On the other
   hand, RASSF1 isoform C may play a completely different role as an
   oncogene in high-grade tumors [[96]26]. Although the function of RASSF1
   isoform C is still not clear, the RASSF1 isoform A and C should have
   distinct functions. The IIIDB assigned the RASSF1 isoform A and C into
   the Mod-162 and the Mod-384, respectively, suggesting a new hypothesis
   of the isoform-level modules.

Figure 4.

   Figure 4
   [97]Open in a new tab

   Two isoform modules of RASSF1. (A) RASSF1 has four isoforms (isoform A,
   B, C and D), of which two isoforms belong to module Mod-162 and
   Mod-384. (B) The isoform module Mod-162 includes RASSF1 isoform A
   ([98]NM_007182), RASSF5 and STK4. (C) The isoform module Mod-384
   includes RASSF1 isoform C ([99]NM_170713), RASSF3, RASSF4, SAV1, STK3
   and STK4.

Methods

Isoform coexpression network construction

   To construct the Isoform coexpression networks, we selected 19 RNA-seq
   datasets with at least 10 experiments from Sequence Read Archive (SRA)
   database (Table [100]1) [[101]27]. We performed the eXpress [[102]28]
   with Bowtie2 aligner [[103]29] to obtain isoform expression values.
   NCBI Reference Sequences (RefSeq) mRNAs with protein sequences was used
   as transcriptome annotation, which included 31,454 RefSeq IDs (Jan 2013
   version) [[104]15].

Logistic regression model

   The logistic regression model has been applied to PPI prediction
   [[105]30,[106]31], as it is suitable to describe the relationship
   between a binary response variable and a set of explanatory variables.
   We used logistic regression approach to build the prediction model as
   follows:
   [MATH: <mi>l</mi><mi>o</mi><mi>g</mi><mi>i</mi><mi>t</mi><mfenced
   close=")"
   open="("><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub></mrow></mfenced><mo
   class="MathClass-rel">=</mo><msub><mrow><mi>α</mi></mrow><mrow><mn>0</m
   n></mrow></msub><mo
   class="MathClass-bin">+</mo><msub><mrow><mi>α</mi></mrow><mrow><mn>1</m
   n></mrow></msub><mi>E</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>i</mi>
   <mi>j</mi></mrow></msub><mo
   class="MathClass-bin">+</mo><msub><mrow><mi>α</mi></mrow><mrow><mn>2</m
   n></mrow></msub><mi>E</mi><msub><mrow><mn>2</mn></mrow><mrow><mi>i</mi>
   <mi>j</mi></mrow></msub><mo class="MathClass-bin">+</mo><mo
   class="MathClass-op">⋯</mo><mo
   class="MathClass-bin">+</mo><msub><mrow><mi>α</mi></mrow><mrow><mn>19</
   mn></mrow></msub><mi>E</mi><mn>1</mn><msub><mrow><mn>9</mn></mrow><mrow
   ><mi>i</mi><mi>j</mi></mrow></msub><mo
   class="MathClass-bin">+</mo><msub><mrow><mi>α</mi></mrow><mrow><mn>20</
   mn></mrow></msub><mi>D</mi><mi>D</mi><msub><mrow><mi>I</mi></mrow><mrow
   ><mi>i</mi><mi>j</mi></mrow></msub> :MATH]

   where α[0],α[1],...,α[20 ]are regression coefficients and y[ij ]is the
   probability of the isoform interaction between isoform i and isoform j.
   DDI, E1, E2, ..., E19 are described as follows: (a) Domain-domain
   interaction (DDI) score: For all combinations of human isoform pairs,
   if a isoform pair has domain-domain interaction in the DOMINE database
   [[107]32,[108]33], then we assign the DDI score by the confidence level
   in the DOMINE database as follows: high-confidence prediction: 3,
   medium-confidence prediction: 2, and low-confidence prediction: 1. If
   the isoform pair has several DDI scores, we take the highest score. (b)
   19 RNA-seq datasets (E1, E2, ..., E19): the absolute values of Pearson
   correlations for all isoform pairs derived from RNA-seq datasets. Since
   each RNA-seq dataset may have different quality and data type, we used
   the logistic regression model to integrate 19 RNA-seq datasets, and
   then the coefficients of datasets reflected quality of the RNA-seq
   datasets.

Competing interests

   The authors declare that they have no competing interests.

Supplementary Material

   Additional file 1

   The isoform modules. There are 1025 isoform modules with gene symbols
   and RefSeq IDs.
   [109]Click here for file^ (240KB, xls)

Contributor Information

   Yu-Ting Tseng, Email: ting0514@gmail.com.

   Wenyuan Li, Email: wel@usc.edu.

   Ching-Hsien Chen, Email: chchencmc@gmail.com.

   Shihua Zhang, Email: zsh@amss.ac.cn.

   Jeremy JW Chen, Email: jwchen@dragon.nchu.edu.tw.

   Xianghong Jasmine Zhou, Email: xjzhou@usc.edu.

   Chun-Chi Liu, Email: jimliu@nchu.edu.tw.

Acknowledgements