Abstract

   More and more transcription factors and their motifs have been reported
   and linked to specific gene expression levels. However, focusing only
   on transcription is not sufficient for mechanism research. Most genes,
   especially in eukaryotes, are alternatively spliced to different
   isoforms. Some of these isoforms increase the biodiversity of proteins.
   From this viewpoint, transcription and splicing are two of important
   mechanisms to modulate expression levels of isoforms. To integrate
   these two kinds of regulation, we built a linear regression model to
   select a subset of transcription factors and splicing factors for each
   co-expressed isoforms using least-angle regression approach. Then, we
   applied this method to investigate the mechanism of myelodysplastic
   syndromes (MDS), a precursor lesion of acute myeloid leukemia. Results
   suggested that expression levels of most isoforms were regulated by a
   set of selected regulatory factors. Some of the detected factors, such
   as EGR1 and STAT family, are highly correlated with progression of MDS.
   We discovered that the splicing factor SRSF11 experienced alternative
   splicing switch, and in turn induced different amino acid sequences
   between MDS and controls. This splicing switch causes two different
   splicing mechanisms. Polymerase Chain Reaction experiments also
   confirmed that one of its isoforms was over-expressed in MDS. We
   analyzed the regulatory networks constructed from the co-expressed
   isoforms and their regulatory factors in MDS. Many of these networks
   were enriched in the herpes simplex infection pathway which involves
   many splicing factors, and pathways in cancers and acute or chronic
   myeloid leukemia.

Introduction

   Gene expression levels are highly dependent on the regulation of
   transcription factors which mainly bind to the near-promoter regions to
   facilitate or block the recruitment of DNA polymerase II (pol II) and
   other complexes. Some methods have been proposed to predict - gene
   expression using such binding information of transcription factors
   [29][1], [30][2]. Conlon et al. [31][2]suggested associating gene
   expression with the interaction strength between the upper stream of
   the gene and motifs of its transcription factors. This interaction was
   defined in terms of degree of matching and occurrences of binding
   sites. After that, a probabilistic approach was proposed to infer
   regulatory rules of transcriptional networks from gene expression data
   and DNA sequences [32][1]. This method started with finding
   co-expressed genes, and then extracted a large number of putative
   regulatory DNA motifs in these co-expressed genes. However, these
   methods only considered transcription levels extracted from gene array
   data and lost sight of other mechanisms such as alternative splicing.
   Moreover, these methods may suffer high false positive rate when mining
   transcription factor binding sites. To control the rate of false
   positive, integrating information from other kinds of data, like
   conservation data, is also necessary.

   Alternative splicing is one of the most versatile mechanisms of gene
   expression regulation and accounts for a considerable proportion of
   proteomic complexity in higher eukaryotes. The mRNA isoforms produced
   by this alternative processing comprise of different combination of
   exons, and may differ in structure, function, and other properties
   [33][3], [34][4]. Different mRNA isoforms are translated into different
   protein isoforms (if they exist) which may have related, distinct or
   even opposing functions.

   The alternative splicing process is regulated by many types of RNA
   binding proteins, especially the heterogeneous nuclear
   ribonucleoproteins (hnRNP) and the serine/arginine-rich (SR) family. In
   eukaryotic cells, hnRNP proteins participate in almost all pre-mRNA
   processing steps including splicing, mature mRNA export, localization,
   translation, and stability [35][5], [36][6]. SR proteins are involved
   in regulating and selecting splice sites in eukaryotic mRNAs. These
   proteins, either ‘classical’ SR proteins or additional SR proteins
   [37][7], generally have at least one RRM (RNA recognition motif) domain
   for RNA binding and a C-terminus RS domain for controlling interactions
   with other proteins (including other SR proteins). In addition to
   alternative splicing, the SR proteins are also involved in mRNA nuclear
   export and mRNA translation. One of the most important characteristics
   of these splicing factors is their functional specificity. Many animal
   experiments suggest that the RNA binding ability of individual SR
   proteins are sequence-specific and their ability to regulate
   alternative splicing is different [38][8]–[39][10].

   These highly specific and non-redundant characteristics of splicing
   factors motivated researchers to look for association between
   abnormalities in SR proteins and the development of human cancers.
   Although the underlying mechanisms are elusive and need further
   studies, splicing factors that regulate specific pathways in diseases
   can be treated as putative markers, especially when their targets in
   these disease-related pathways have experienced alternative splicing
   and produced different protein isoforms.

   In this study, we built a systematic method to identify transcription
   factors and splicing factors that regulated genes to produce different
   RNA isoforms in diseases. Our framework consists of four steps. First,
   differentially expressed mRNA isoforms (DEIs) are extracted by
   comparing abnormal cells and controls using RNA-seq analysis tools
   [40][11]. These DEIs that may experience abnormal splicing are putative
   targets of splicing factors that might function abnormally in disease.
   Assuming that co-expressed genes have a good probability of being
   functionally related, we clustered these DEIs to co-expressed groups
   using hierarchical clustering method. These co-expressed isoforms may
   be regulated by the same group of transcription factors (TFs) and
   splicing factors (SFs). Since co-expressed mRNAs are more likely to
   have their promoter regions bound by common transcription factors, we
   constructed a dataset called promoter region dataset (PRD) to mine the
   TF-isoform interactions (for example, co-expressed isoforms may also
   have common splicing factors that bind to the regions near their
   splicing sites). We then constructed an exon-intron (centered at
   splicing sites and extending 200 bp on both sides) dataset (EID) to
   explore SF-isoform interactions. The binding strength of the TF-isoform
   interactions is quantified by scoring the transcription factor’s
   binding sites in the promoter region. Similarly, the binding strength
   of the SF-isoform interactions is defined by scoring the splicing
   factor’s binding sites in the exon-intron regions of pre-mRNA. To
   integrate both kinds of regulation, we built a linear regression model
   and selected a subset of transcription factors and splicing factors
   that can regulate the expression of co-expressed isoforms using
   least-angle regression (LARS) [41][12] selection approach.

   The proposed method was applied to a RNA-seq dataset comprising of 4
   myelodysplastic syndromes (MDS) samples (RAEB subtype) and 5 matched
   controls. MDSs are defined as clonal stem cell disorder characterized
   by ineffective hematopoiesis and impaired difference in some bone
   marrow lineages, leading to peripheral-blood cytopenias. According to
   the WHO [42][13], the main categories of MDS include refractory
   cytopenia with unilineage dysplasia (RA), refractory cytopenia with
   ringed sideroblast (RARS), refractory cytopenia with multi-lineage
   dysplasia (RCMD), refractory cytopenia with excess blasts (RAEB),
   5q-syndrome and unclassifiable MDS. In this paper, we focused on RAEB
   cases which are characterized by 5–20% myeloblasts in the marrow
   [43][14]. This is a higher risk subtype and likely to transform to
   acute myelogenous leukemia (AML).

   Some work studied the genetic abnormality of MDS with mutations of key
   genes [44][15], e.g. TP53 and RUNX, which are highly related with poor
   overall survival. Recently, the recurrent mutation of splicing factor
   U2AF1 has been validated for MDS patients [45][16]. In-vitro
   experiments demonstrated that the mutated U2AF2 enhances splicing and
   promotes exon skipping. These genetic aberrations would improve the
   prediction of prognosis and development of novel treatment. However,
   mutation detection cannot determine which genes are altered due to the
   abnormality of splicing factors. Our method can recognize not only
   isoforms that experience an abnormal splicing process but also their
   putative regulatory factors. The co-expressed isoforms and their
   transcription and inferred splicing factors comprise our transcription
   and splicing networks.

   To verify the biological significance of these regulatory networks
   composed of a group of co-expressed isoforms and their regulatory
   factors in MDS, we showed theses networks were significantly enriched
   (p<0.0001) in some known Gene Ontology (GO) biology processes and Kyoto
   Encyclopedia of Genes and Genomes (KEGG) pathways. Enrichment analysis
   demonstrated that more than half of these networks were enriched in
   herpes simplex infection due to involving many splicing factors. Six
   networks were significantly related with pathways in cancer, and five
   networks were significantly related with acute or chronic myeloid
   leukemia (AML).

Methods

   Our four-step framework ([46] Figure 1 ) starts with raw RNA sequencing
   (RNA-seq) data. First, differentially expressed isoforms (DEIs) are
   identified from a large amount of isoforms. We cluster these DEIs into
   co-expressed groups that may be regulated by common TFs and SFs ([47]
   Figure 1A ). At the second step, two datasets, promoter region data
   (PRD) and exon-intron data (EID), are constructed, where the PRD
   dataset is used for mining the interaction strength between TFs and
   isoforms and the EID is used for exploring the interaction strength
   between TFs and isoforms ([48] Figure 1B ). This step outputs two
   interaction strength matrices with rows corresponding to isoforms and
   columns corresponding to TFs or SFs. Then, by taking the interaction
   strength matrices as observations of explanatory variables, we build
   linear model to regress the expression levels of isoforms in a
   co-expressed group. To detect the most important factors that regulate
   a co-expressed group, a reliable model selection method called
   least-angle regression (LARS) is applied ([49] Figure 1C ). At the
   final step, the TFs and SFs selected by the LARS are linked with their
   target genes, forming regulatory networks ([50] Figure 1D ).

Figure 1. Flowchart of proposed method for constructing regulatory networks.

   [51]Figure 1
   [52]Open in a new tab

   Flowchart of proposed method for constructing regulatory networks: (A).
   Process raw RNA-seq data, find out deferentially expressed isoforms
   using Tophat and Cufflinks and cluster these isoforms to get gene
   cluster that may be regulated by same TFs and SFs. (B). Construct two
   dataset, promote region data (PRD) and exon-intron data (EID), for
   mining the interaction strength of the TF-isoform interactions and
   SF-isoform interaction. (C). Use interaction strength to predict the
   expression levels of isoforms in a co-expressed group. (D). Link
   model-selected TFs and SFs with their target genes.

Ethics Statement

   This study was approved by the Institutional Review Board of The
   Methodist Hospital, Houston, Texas, USA and the need for written
   informed consent from the participants was waived by the IRB.

Data Preprocessing and Co-expressed Isoforms

   RNA samples were prepared from 9 individuals, 4 samples from RAEB
   patients and 5 controls ([53]http://ctsb.is.wfubmc.edu/MDS/MDS.html).
   After sequencing using Genome Analyzer II (GAII) (Illumina, San Diego,
   CA), millions of reads were produced for each sample. The raw data were
   processed using a RNA-seq analysis pipeline [54][11]. Briefly, reads in
   the FASTA format were first aligned to whole genome using TopHat
   software. Since our sequencing data are pair-ended, the fragments were
   selected at 350 bp and the length of reads was 76, the inner distance
   between mate pairs was set as 198. The standard deviation for the
   distribution on inner distances between mate pairs was set as 30 based
   on our estimation. Then Cufflinks was called to estimate the expression
   level (RPKM, reads per kilobase of transcript per million mapped reads)
   of each isoform. Then, Cuffdiff was used to identify differentially
   expressed isoforms (using default parameters). This protocol returned
   1056 isoforms that were differentially expressed in MDS compared with
   controls. Before using these DEIs to build splicing networks, we first
   filtered out those isoforms that had not been validated at the protein
   level. Then, we checked the condition (control or MDS) in which an
   isoform was up-regulated and required that all FPKM values under this
   condition were higher than 5 [55][17].

   Although there are debates on whether co-expressed genes are
   functionally related, numerous studies have suggested that at least
   some co-expressed clusters function together [56][18]–[57][20]. Here we
   applied hierarchical cluster analysis to those differentially expressed
   isoforms to obtain co-expressed groups. The distances between genes
   were defined as Pearson correlation. Allocco et al. [58][21] concluded
   that genes with strongly co-expressed mRNAs were more likely to have
   their promoter regions bound by common transcription factors. However,
   this co-regulating effect was significant only when expression of these
   co-expressed mRNAs were highly correlated. To control the similarity
   level of expression profiles of isoforms in the same group, we limited
   the size of each group to be between 10 and 30 [59][19], [60][22]. In
   our MDS study, this resulted in 34 clusters ([61] Table 1 ).

Table 1. Regulatory networks found by our model.

   No Target Genes Transcription Factors Splicing Factors
   1 GNB1, GPATCH4, MRPL51, BRD4, ODC1, ATP5J, HNRNPD, RBBP7, CALM3,
   ANP32C, PSPC1, SETD3 Blimp-1, FOX factors, FOXP1a, Sp1, STAT5B MBNL1,
   PSF, Sam68, SF2, SRp38, YB-1
   2 PARK7, KDM1A, SSR2, DNTT, HNRNPA1L2, SUPT16H, SLIRP, YWHAE, TAF15,
   TOP2A, SRSF1, ATP5A1, EEF1B2, HMGN1, U2SURP, HSD17B4, CANX, MAP7D3,
   HNRNPM, BCL11A Egr-1, Evi-1, LXR direct repeat 4, MEF-2, Nkx2-5, Pax-6,
   RSRFC4, Sp1 ETR-3, hnRNPD, hnRNPI(PTB), hnRNPQ, MBNL1, PSF, SF2, SRp38,
   SRp40, hnRNPC/C1/C2
   3 TARDBP, HDAC1, ARHGDIB, ESD, CX-CR4, TTC1, HLA-DPB1, ANP32B, ATP1A1,
   ARPC5 ATF6, COUPTF, FOX factors, FOXP1d, IRF, Nkx3-1 SRp40, hnRNPE1/E2
   4
   PRDM2,GPBP1L1,ZRANB2,DNTTIP2,SMG7,EIF4G2,PHF21A,RBM4B,GNPTAB,UFM1,ERCC5
   ,SF3B1,CD47,CDV3,BDP1,DDAH2,SNX3,PPIL4,KIT,PCMTD1,RABGAP1,DDX3X,C16orf8
   0 Blimp-1, Egr-1, Evi-1, FOX factors, KROX, NIT2, Pax-6, POU3F2,
   RSRFC4, SRF hnRNPQ, KSRP, SF1, SRp38, SRp54, SRp55, TIA-1, TIAL1, YB-1,
   hnRNPA1/A2, hnRNPE1/E2
   6 SRRM1, JUN, PRDX6, DUSP10, RBM34, HSP90AA1, SRRM2, RPS7, RPL3, RPL32,
   RPS20 Blimp-1, HNF-1alpha, KROX, MEF-2, Nkx2-5, Sp1 SRp20, SRp54, YB-1
   7 TMEM50A, DNAJC8, ZNHIT6, MRPL9, DDX21, RPL36, RPL13A, NOP58, YWHAB,
   SREK1, ZMAT2, EIF2AK1, RBM3, UPF3B, HMGB2, SRF Blimp-1, Egr-1,
   ER-alpha, Evi-1, LXR direct repeat 4, Nkx2-5, Pax-6, Pbx1a, PLZF SC35,
   SF2, SRp40, TIA-1, hnRNPA1/A2
   8 STMN1, ATPIF1, CFL1, APLP2, SARNP, MLEC, POMP, ERH, FUS, NME1, GNAI2,
   HNRNPAB, DEK, HIST1H2BC, HIST1H2AC, HIST1H4H, SEC61G, HIST1H2BK
   AP-2alphaA, Egr-1, HOXA5 (Hox-1.3), MCM1, MTF-1, NF-Y, Pax-6, PLZF,
   RXR-alpha, STAT1 hnRNPF, hnRNPI(PTB), MBNL1, SRp30c, SRp55, TIAL1
   9 SRSF4, NASP, SERBP1, CSTF3, SSRP1, ERP29, RAN, HMGB1, SRP14, PSMA4,
   NUTF2, RPS16, VAMP8, HSPE1, CCT5, TAF9, HIST1H2BD, HIST1H4E, TUBB,
   HNRNPA2B1, PPIA, FKBP3 AP-2alphaA, Egr-1, ER-alpha, Evi-1, GLI1, IRF,
   MCM1, NIT2, Nkx2-5, Pbx1a, RSRFC4, STAT1, STAT5A (homotetramer),
   UF1H3BETA 9G8, hnRNPI(PTB), HTra2alpha, SRp30c, TIAL1
   10 KHDRBS1, PIP4K2A, RPLP2, SMARCC2, RBM25, RFX7, YWHAQ, XRCC5, VDAC1,
   GNB2L1, SHFM1, SMC1A, HDGF, LSM14A, MAT2A AP-2alphaA, ATF6, Egr-1,
   MTF-1, Nkx2-5, STAT5B, ZID 9G8, HTra2alpha, MBNL1, PSF, SC35, SRp30c
   11 SFPQ, SRSF11, BCAS2, TFAM, ZFP91, CSDA, NGRN, RPS15A, RPL26, NACA2,
   EEF2, DDX18, SF3B1, MAPRE1, SATB1, RPL24, SMARCA5, RUFY1, NUP153,
   DDX39B, RPL10A, HOXA7, RPL7, TCEAL4, RBMX, SYF2 ER-alpha, GLI1, KAISO,
   mat1-Mc, MIG1, MTF-1, Nkx2-5, Pax-6, PLZF, RSRFC4, Sp1, SRF, STAT1, ZID
   9G8, DAZAP1, hnRNPF, HTra2beta1, KSRP, MBNL1, SRp30c, TIA-1,
   hnRNPC/C1/C2, hnRNPE1/E2
   12 THRAP3, PSMB4, ARF3, TCF25, GABARAP, TAF15, CHMP2A, GAR1, C9orf78,
   ACIN1, UBB, BASP1 Egr-1, Gfi1b, NF-Y, STAT5B, UF1H3BETA PSF, SC35,
   SRp40, SRp55
   13 GNL2, EIF3M, CD44, FNBP4, OSBPL8, NUFIP2, STK17B, SYPL1, MDM4,
   ANKRD36B, IER2 FOXO3a, POU3F2, TBX5 9G8, ETR-3, hnRNPF, hnRNPP(TLS),
   SRp20, hnRNPH1/H2/H3
   14 NASP, UQCRH, ZRANB2, NUCKS1, FKBP4, ANP32A, EIF3C, PTMA, SUMO3,
   HNRNPA2B1, TXN, MYL12B AIRE, COUP-TF1, Egr-1, Evi-1, GLI1, MTF-1,
   NF-kappaB, STAT5A (homodimer), UF1H3BETA Sam68
   15 EPS15, RAB13, ERLIN1, CAT, CWC15, RPL6, EID1, SEC11A, CKLF, RPL23,
   RPL19, FBXO7, EIF3L, POLR2B, RPL34, RPS3A, RPS6, CNTRL, EIF2S3, IK,
   HMGN4 AIRE, COUPTF, Egr-1, Evi-1, FOXP1a, HNF-1alpha, LXR direct repeat
   4, Nkx2-2, Nkx2-5, PLZF, Sp1, SRF, PPARalpha:RXRalpha, PPARG
   hnRNPP(TLS), KSRP, MBNL1, SF2, SRp55
   16 FUBP1, UFC1, HNRNPU, CHD4, ETV6, FGFR1OP2, PFDN5, NCKAP1L, NACA,
   NAP1L1, CTCF, GLOD4, ZNF830, RPL38, SAFB2, CEBPZ, CEP63, USO1, HNRNPD,
   PPWD1, NSA2, RPS23, HINT1, SKP1, RPS12, EIF3H, CEP350, ARID4A, PRPF40A,
   ST13, STXBP3, SERTAD2, CNBP AIRE, AP-2, ATF6, Blimp-1, Egr-1, Evi-1,
   KROX, LXR direct repeat 4, MEF-2, Nkx2-5, 1-Oct, POU3F2, RSRFC4, Sp1,
   SRF, ZBRK1 9G8, hnRNPK, hnRNPP(TLS), hnRNPQ, hnRNPU, SRp20, SRp30c,
   SRp38, SRp40, TIA-1, YB-1, hnRNPA1/A2, hnRNPC/C1/C2, hnRNPE1/E2
   17 PHTF1, H3F3C, TRIAP1, PMAIP1, KLF2, EEF1B2, IL8, HNRPDL, ARRDC3,
   FKBP5, NAMPT HNF4, HSF1, MafA, MIG1, PLZF, Sp1, ZID 9G8, KSRP
   18 MAN1A2, IFI16, UPF2, FGFR1OP2, FLT3, TERF2IP, LRRFIP1, EGR1, MLL5,
   TMEM66, IRF2BP2, ITSN2, PRRC2C, STT3B, CD164, SPRY1 Egr-1, Nkx2-5, p53,
   RSRFC4, SRF, STAT1, STAT5B (homodimer), UF1H3BETA 9G8, MBNL1, TIA-1,
   hnRNPA1/A2, hnRNPC/C1/C2, hnRNPH1/H2/H3
   19 HIST2H2BE, SLC2A3, SRSF5, TEC, PTP4A1, ESYT2, LYN, SMARCA2, JMJD1C,
   PIM3, KLF10 Egr-1, Nkx2-5, Pbx1a, Sp1 PSF
   20 ANP32E, ILF2, COX4I1, AATF, MPO, SNRPD1, FBL, RPS19, SNRPD2, RPL18,
   PPM1G, RPS21, GCFC1, CCDC72, H2AFZ, COX6C, DKC1 LXR direct repeat 4,
   Nkx2-5, 1-Oct, PPARalpha:RXRalpha, REST, STAT5A (homodimer), STAT5B,
   TBX5, UF1H3BETA hnRNPD, hnRNPU, HTra2alpha, SRp40, SRp55
   21 GATAD2B, GDI2, APBB1IP, RPS13, HNRNPA1, LTA4H, MFAP1, ME2, MYADM,
   SEPT2, C21orf59, RUNX1, RPL35A, SNX2, HSPA9, PAIP2, LARS, SOX4, POLR2K,
   STRBP, RPL7A, SLC25A5, NRD1, LARP1, C19orf77, MRPL3, PNRC2, MTIF3 ATF6,
   Egr-1, ER-alpha, HNF-1alpha, IRF-1, MEF-2, MTF-1, Nkx2-5, PLZF, RSRFC4,
   Sp1, Sp3, UF1H3BETA, ZID ETR-3, MBNL1, SRp38, SRp54, SRp55, TIAL1,
   hnRNPA1/A2, hnRNPE1/E2
   22 SELL, TMEM123, DDX6, TMBIM6, CASC4, B2M, CHD2, USP34, SETD2, PAK2,
   HSD17B11, GOLPH3, DFIP1, WRNIP1, UTRN, VAMP2, PCM1, SORL1, PAN3, ATXN3,
   XRN2, FBXO11, C4orf3, GAPT E2F-1:DP-2, Egr-1, GLI1, HNF-1alpha, IRF,
   IRF-3, KROX, mat1-Mc, MEF-2, MyoD, NF-kappaB, NIT2, Nkx2-5, NRSF,
   RSRFC4, Sp1, ZID hnRNPI(PTB), KSRP, SF2, SRp20, SRp30c
   23 RGS1, PTMS, PARP2, C16orf61, CLK1, EAF2, SLU7, RSRC2, ZCCHC7 AIRE,
   Evi-1, FOXO3a, IRF-3, MEF-2, VDR, CAR, PXR hnRNPD
   24 RGS2, CD34, RSF1, WNK1, CBFB, CSTB, IQGAP2, GNAI1, LAPTM4B, SH3KBP1,
   HNRNPM, ARCN1, MARCH6 AP-2, BLIMP1, KROX, MEF-2, Sp3, STAT5B Sam68,
   hnRNPA1/A2
   25 HNRNPU, SAP18, RBM26, LEO1, UQCRC2, RPL27, RPS9, DUSP11, DARS,
   ARPC2, ESF1, SERINC3, PITPNB, HNRNPA0, TCERG1, NPM1, C6orf48, IL2RG,
   RPS4X, API5, SEP15, RPL23A Blimp-1, Egr-1, Evi-1, GLI1, MTF-1, NIT2,
   Nkx2-2, Pax-6, PLZF, ZID DAZAP1, hnRNPD, hnRNPI(PTB), MBNL1, SRp20,
   SRp30c, YB-1, hnRNPA1/A2, hnRNPE1/E2
   27 EIF4EBP2, CELF1, ATXN7L3B, PRPF8, RPSA, SRP72, DDX46, STK38, SRSF3,
   NGFRAP1, TOP2B BLIMP1, COUP-TF1, Egr-1, HNF-1alpha, IRF, IRF-1, SRF,
   ZID YB-1
   28 PDLIM1, LMO2, C11orf67, SLC38A1, DUSP6, ROCK2, KIF2A, PHF3, CD164,
   TAX1BP1, SEPT6, DDX1 Egr-1, HNF-1alpha, RXR-alpha ETR-3, hnRNPK, SRp55
   30 PSMA1, NEDD8, ADAM10, PSMC5, UBE2G2, PDCD10, SEC62, CWC27, LCP2,
   XPO7, PEBP1 AIRE, ATF6, FOXO1, Gfi1b, 1-Oct, Sp1 hnRNPF, SRp20, TIA-1
   31 NXF1, FAU, TAF1D, PPTC7, PABPC3, ELF1, CHD9, JUND, LAPTM4A, RANBP2,
   HAT1, UBE2D3, SBDS, FLJ44635, TSC22D3, HSPH1, KPNB1 ER-alpha, ER-beta,
   GLI1, KROX, Nkx2-5, NRSF, Staf HTra2alpha, Sam68, SRp30c, SRp38, SRp55,
   hnRNPA1/A2
   32 CCND2, METAP2, ARGLU1, MIS18BP1, CCNK, PSME3, RPL38, GLTSCR2, RPS11,
   HADHA, RPL31, HSPD1, RPL37A, NOP56, BRK1, RPL37, UBE2V2 AIRE,
   AP-2alphaA, dl, Egr-1, HNF-1alpha-A, 1-Oct, Olf-1, p53, Pax-6, Pbx1a,
   RSRFC4, Sp1 SC35, SRp38, hnRNPE1/E2
   33 ITM2B, SCAPER, PPIG, MRFAP1L1, RASGEF1B, AIF1, RNPC3, H3F3B, BPTF,
   AFF1, B3GNT5 Alx-4, Blimp-1, BLIMP1, HNF-1alpha, KROX, Sp1 hnRNPF,
   SRp40, hnRNPA1/A2
   34 NIN, ZNF609, ZFP36L2, SOD1, NOL7, LTB, ITGA6, WDR77, PIM1, SEPT7
   ABF, Antp, COUP-TF1, ID1, KAISO, myogenin/NF-1, RORalpha1 hnRNPK
   [62]Open in a new tab

   The first column is differentially expressed isoform groups in our
   cases, the second and third columns are the Transcription factors and
   splicing factors predicted by our regression model. Some cells are
   blank, which means no corresponding factors for that co-expressed
   group.

Mathematic Modeling

   Transcription factors regulate transcription, which controls gene
   expression. Splicing factors regulate alternative splicing, which
   splices pre-mRNA to RNA isoforms and in turn changes protein
   expressions [63][23]. Some predictive methods have modeled gene
   expression levels as a linear function of occurrences of TF-binding
   sites (TFBSs) [64][2], [65][24]. In this paper, we focus on RNA
   isoforms instead of genes and try to link isoform expression levels not
   only with the transcriptional factors but also with splicing factors.
   Isoform expression levels were formulated as linear regression of the
   strength of TF-isoform interactions and SF-isoform interactions, given
   by
   graphic file with name pone.0079118.e014.jpg (1)

   where, Inline graphic is an isoform; Inline graphic denotes the
   expression level; Inline graphic , Inline graphic and s Inline graphic
   are regression coefficients; Inline graphic is transposition operator;
   Inline graphic are binding strength vectors of transcription factors;
   Inline graphic is the number of transcription factors; Inline graphic
   are binding strength vectors of splicing factors; Inline graphic is
   number of splicing factors and Inline graphic is the error term.

   To estimate the binding strength between TFs and isoforms, we first
   constructed a promoter region database (PRD, [66]Figure 1B ) comprised
   of promoter sequences of isoforms. As in previous study [67][25], we
   extracted 2000 bp upstream of the transcription start sites as promoter
   regions. Similarly, we constructed an exon-intron dataset (EID,
   [68]Figure 1B ) to estimate the binding strength between SFs and
   isoforms. Most splicing factor binding sites locate near the splicing
   sites [69][26], especially 200 bp on both sides of this site.
   Therefore, for isoforms with alternative splicing, we gathered 200 bp
   of sequences around their splicing sites into EID to find splicing
   factor binding sites.

   Based on the PRD and EID, we could define the TF-isoform interaction
   strength and SF-isoform interaction strength. First, we downloaded TF
   motif models (PWM, position weight matrixes) from TRANSFAC [70][27] and
   JASPAR [71][28]. Then, for each TF, a hidden markov model based method
   called MAPPER [72][29] was used to identify its sites (TFBSs) in PRD.
   For each binding site, the MAPPER also outputs a binding score
   measuring the binding affinity. TFs with no hits on PRD were removed.
   After retrieving all putative binding sites and their binding scores,
   we defined the interaction strength between a TF Inline graphic and a
   RNA isoform Inline graphic as Inline graphic . Where, Inline graphic is
   the score of Inline graphic -th binding site between Inline graphic -th
   isoform and j-th transcription factor; and Inline graphic is the number
   of binding sites.

   Unlike TFs, SFs lack a reliable PWM database available. Thus, we
   gathered 53 splicing factors that are related with bone marrow cancers
   and their binding motifs from the SpliceAid [73][30] repository. This
   database collects all the experimentally assessed motifs that are bound
   by splicing factors in humans by means of an exhaustive literature
   search. Motifs with positive scores facilitate exon definition as
   exonic splicing enhancers (ESEs) or intronic splicing silencers (ISSs)
   motifs, while motifs with negative scores facilitate intron definition
   as exonic splicing silencers (ESSs) and intronic splicing enhancers
   (ISE) motifs. The absolute values of these scores measure the levels of
   binding affinity. To control false positive rate, we filtered out
   motifs with absolute scores less than 5 and lengths longer than 15.
   Finally, 49 out of 53 splicing factors are selected (see [74]Table 2).
   Since motifs of these splicing factors are degraded short pieces (6 to
   10 base pairs on average), when retrieving all binding sites of these
   SFs using alignment tool Bowtie [75][31], we got numerous hits, most of
   which were false positives. To identify putative binding sites with
   high reliability, we downloaded base-wise conservation scores (phyloP)
   by comparing 45 vertebrate genomes with human genome from the UCSC
   Genome Browser website ([76]http://genome.ucsc.edu/). We defined the
   conservation score of a hit as the average conservation score of the
   nucleotides it spanned. Hits with scores lower than 2 were deleted.
   Though conversion score is not directly related with interaction
   strength, a highly conserved binding site on an isoform does provide
   evidence of a strong interaction. We further averaged conservation
   scores of all hits of a splicing factor on an isoform to eliminate the
   effect of exon numbers. This average is defined as interaction strength
   between a SF and an isoform, Inline graphic , where, Inline graphic is
   the conservation score of Inline graphic -th binding site between
   Inline graphic -th isoform and Inline graphic -th splicing factor;
   Inline graphic is the interaction strength between Inline graphic -th
   splicing factor and Inline graphic -th isoform; Inline graphic is the
   number of binding sites. Some factors, especially the hnRNP family, are
   due to alternative splicing, for example hnRNP H1, hnRNP H2 and hnRNP
   H3, which means they may have similar structure [77][32],motifs
   [78][33] and binding profiles. After we computed their interaction
   strength with isoforms according to our definition, the correlation
   between interaction strength vectors were high. To increase the
   robustness of our linear regression model, we averaged the binding
   strength and obtained four new factors, called hnRNP A1/A2 (average of
   hnRNP A1 and A2), hnRNP H1/H2/H3(average of hnRNP H1, H2 and H3), hnRNP
   C/C1/C2 (average of hnRNP C, C1 and C2) and hnRNP E1/E2 (average of E1
   and E2).

Table 2. Names and description of splicing factors used in our model.

   Splicing Factors Gene Name Description
   9G8 SRSF7 Splicing factor arginine/serine-rich 7. The shuttling protein
   9G8 binds TAP and can function as export factors.
   CUG-BP1 CUGBP1 CUG triplet repeat RNA binding protein 1. CUGBP1 induces
   exon 5 inclusion in cTNT gene (PMID: 9563950), induces exon 11
   exclusion in IR gene (PMID: 11528389), induces intron 2 retention in
   CIC-1 gene (PMID: 12150906).
   DAZAP1 DAZAP1 DAZ associated protein 1.
   ETR-3 CUGBP2 CUG triplet repeat RNA binding protein 2. ETR-3 induces
   exon 5 inclusion in cTNT gene (PMID: 11931771), induces exon 9
   inclusion in CFTR gene (PMID: 15657417), promotes selectively the
   exclusion of Tau exon 2 (PMID: 16862542).
   hnRNP A1 HNRNPA1 Heterogeneous nuclear ribonucleoprotein A1. hnRNP A1
   carries bidirectional shuttling signals that serve for both nuclear
   localization and export (PMID: 8521471)
   hnRNP A2/B1 HNRNPA2B1 Heterogeneous nuclear ribonucleoprotein A2/B1.
   hnRNP A2 is involved in cytoplasmic RNA transport (PMID: 11024030).
   hnRNP C HNRNPC Heterogeneous nuclear ribonucleoprotein C. Tetramer
   composed of 3 copies of isoform C1 and 1 copy of isoform C2. hnRNP C
   proteins are restricted to the nucleus because they bear a nuclear
   retention sequence (NRS) (PMID: 8830767).
   hnRNP C1 HNRNPC Heterogeneous nuclear ribonucleoprotein C. Isoform C1
   is due to Alternative Splicing.
   hnRNP C2 HNRNPC Heterogeneous nuclear ribonucleoprotein C. Isoform C2
   is due to Alternative Splicing.
   hnRNP D HNRNPD Heterogeneous nuclear ribonucleoprotein D.
   hnRNP D0 HNRNPD Heterogeneous nuclear ribonucleoprotein D. Isoform D0
   is due to Alternative Splicing.
   hnRNP DL HNRNPDL Heterogeneous nuclear ribonucleoprotein D-like.
   hnRNP E1 PCBP1 Holy(rC) binding protein 1.
   hnRNP E2 PCBP2 Holy(rC) binding protein 2.
   hnRNPF HNRNPF Heterogeneous nuclear ribonucleoprotein F.
   hnRNPH1 HNRNPH1 Heterogeneous nuclear ribonucleoprotein H1.
   hnRNPH2 HNRNPH2 Heterogeneous nuclear ribonucleoprotein H2.
   hnRNPH3 HNRNPH3 Heterogeneous nuclear ribonucleoprotein H3.
   hnRNP I(PTB) PTBP1 Polypyrimidine tract binding protein 1. In the
   context of CALCA gene, PTB enhances exon 4 inclusion (PMID: 9858533).
   nPTB functionally compensates for PTB and is up-regulated when PTB is
   removed (PMID:17679092).
   hnRNP J HNRNPK Heterogeneous nuclear ribonucleoprotein K. isoform J is
   due to Alternative Splicing.
   hnRNP K HNRNPK Heterogeneous nuclear ribonucleoprotein K. hnRNP K
   carries bidirectional shuttling signals that serve for both nuclear
   localization and export (PMID: 9218800).
   hnRNP M HNRNPM Heterogeneous nuclear ribonucleoprotein M.
   hnRNP P(TLS) FUS Fusion (involved in t (12, 16) in malignant
   liposarcoma).
   hnRNP Q SYNCRIP Synaptotagmin binding cytoplasmic RNA interacting
   protein.
   hnRNP U HNRNPU Heterogeneous nuclear ribonucleoprotein U (scaffold
   attachment factor A).
   HTra2alpha TRA2A Transformer-2 alpha.
   HTra2beta1 SRSF10 Splicing factor arginine/serine-rich 10.
   KSRP KHSRP KH-type splicing regulatory protein.
   MBNL1 MBNL1 Muscleblind-like. MBNL proteins can act as activators or
   repressors of splicing on different pre-mRNAs (PMID: 15257297). MBNLs
   are dsRNA binding factors that can bind CUG or CCUG repeats (PMID:
   14722159).
   PSF SFPQ Splicing factor proline/glutamine-rich (polypyrimidine tract
   binding protein associated).
   RBM25 RBM25 RNA binding motif protein 25. RBM25 stimulated proapoptotic
   Bcl-X(s) isoform through weak 5′ss selection in EX2 (PMID: 18663000).
   RBM4 RBM4 RNA binding motif protein 4. RBM4 induce exon inclusion of
   alpha-TM EX9a and EX2b (PMID: 16260624) and tau EX10 (PMID: 16777844).
   RBM5 RBM5 RNA binding motif protein 5.
   Sam68 KHDRBS1 KH domain containing RNA binding signal transduction
   associated 1.
   SC35 SRSF2 Splicing factor arginine/serine-rich 2. SC35 accelerates
   transcriptional elongation (co-transcriptional splicing) (PMID:
   18641664).
   SF1 SF1 Splicing factor 1. Gomafu lncRNA UACUAAC repeats bind to mouse
   SF1 with a higher affinity than the mammalian branch point consensus
   regulating splicing efficiency by changing the splicing factors nuclear
   level (PMID: 21463453)
   SF2 SRSF1 Splicing factor arginine/serine-rich 1 (splicing factor 2,
   alternate splicing factor). The shuttling protein SF2/ASF binds TAP and
   can function as export factors (18364396).
   SRp20 SRSF3 Splicing factor arginine/serine-rich 3. The shuttling
   protein SRp20 binds TAP and can function as export factors (18364396).
   SRp30c SRSF9 Splicing factor arginine/serine-rich 9.
   SRp38 FUSIP1 FUS interacting protein (serine/arginine-rich).
   Dephosphorylation converts SRp38 to a splicing repressor (PMID:
   12419250). SRp38 functions as a general splicing repressor when
   dephosphorylated, but when phosphorylated it functions as a
   sequence-specific splicing activator (PMID: 18794844).
   SRp40 SRSF5 Splicing factor arginine/serine-rich 5.
   SRp54 SRSF11 Splicing factor, arginine/serine-rich 11.
   SRp55 SRSF6 Splicing factor arginine/serine-rich 6.
   SRp75 SRSF4 Splicing factor arginine/serine-rich 4.
   TDP43 TARDBP TAR DNA binding protein. It can act as transcriptional
   repressor (21252238).
   TIA-1 TIA1 Cytotoxic granule-associated RNA binding protein.
   TIAL1 TIAL1 TIA1 cytotoxic granule-associated RNA binding protein-like
   1.
   YB-1 YBX1 Y box binding protein 1.
   ZRANB2 ZRANB2 Zinc finger, RAN-binding domain containing 2. ZRANB2
   (ZNF265) is an SR-like protein that induce exclusion of EX2 and EX3
   from the Tra2beta1 pre-mRNA in HEK293 cell (PMID: 11448987).
   [79]Open in a new tab

   This table contains 22 splicing factors which are selected to predict
   the expression levels of differentially expressed isoforms. This table
   lists their names and some related references. Most of these details