Abstract More and more transcription factors and their motifs have been reported and linked to specific gene expression levels. However, focusing only on transcription is not sufficient for mechanism research. Most genes, especially in eukaryotes, are alternatively spliced to different isoforms. Some of these isoforms increase the biodiversity of proteins. From this viewpoint, transcription and splicing are two of important mechanisms to modulate expression levels of isoforms. To integrate these two kinds of regulation, we built a linear regression model to select a subset of transcription factors and splicing factors for each co-expressed isoforms using least-angle regression approach. Then, we applied this method to investigate the mechanism of myelodysplastic syndromes (MDS), a precursor lesion of acute myeloid leukemia. Results suggested that expression levels of most isoforms were regulated by a set of selected regulatory factors. Some of the detected factors, such as EGR1 and STAT family, are highly correlated with progression of MDS. We discovered that the splicing factor SRSF11 experienced alternative splicing switch, and in turn induced different amino acid sequences between MDS and controls. This splicing switch causes two different splicing mechanisms. Polymerase Chain Reaction experiments also confirmed that one of its isoforms was over-expressed in MDS. We analyzed the regulatory networks constructed from the co-expressed isoforms and their regulatory factors in MDS. Many of these networks were enriched in the herpes simplex infection pathway which involves many splicing factors, and pathways in cancers and acute or chronic myeloid leukemia. Introduction Gene expression levels are highly dependent on the regulation of transcription factors which mainly bind to the near-promoter regions to facilitate or block the recruitment of DNA polymerase II (pol II) and other complexes. Some methods have been proposed to predict - gene expression using such binding information of transcription factors [29][1], [30][2]. Conlon et al. [31][2]suggested associating gene expression with the interaction strength between the upper stream of the gene and motifs of its transcription factors. This interaction was defined in terms of degree of matching and occurrences of binding sites. After that, a probabilistic approach was proposed to infer regulatory rules of transcriptional networks from gene expression data and DNA sequences [32][1]. This method started with finding co-expressed genes, and then extracted a large number of putative regulatory DNA motifs in these co-expressed genes. However, these methods only considered transcription levels extracted from gene array data and lost sight of other mechanisms such as alternative splicing. Moreover, these methods may suffer high false positive rate when mining transcription factor binding sites. To control the rate of false positive, integrating information from other kinds of data, like conservation data, is also necessary. Alternative splicing is one of the most versatile mechanisms of gene expression regulation and accounts for a considerable proportion of proteomic complexity in higher eukaryotes. The mRNA isoforms produced by this alternative processing comprise of different combination of exons, and may differ in structure, function, and other properties [33][3], [34][4]. Different mRNA isoforms are translated into different protein isoforms (if they exist) which may have related, distinct or even opposing functions. The alternative splicing process is regulated by many types of RNA binding proteins, especially the heterogeneous nuclear ribonucleoproteins (hnRNP) and the serine/arginine-rich (SR) family. In eukaryotic cells, hnRNP proteins participate in almost all pre-mRNA processing steps including splicing, mature mRNA export, localization, translation, and stability [35][5], [36][6]. SR proteins are involved in regulating and selecting splice sites in eukaryotic mRNAs. These proteins, either ‘classical’ SR proteins or additional SR proteins [37][7], generally have at least one RRM (RNA recognition motif) domain for RNA binding and a C-terminus RS domain for controlling interactions with other proteins (including other SR proteins). In addition to alternative splicing, the SR proteins are also involved in mRNA nuclear export and mRNA translation. One of the most important characteristics of these splicing factors is their functional specificity. Many animal experiments suggest that the RNA binding ability of individual SR proteins are sequence-specific and their ability to regulate alternative splicing is different [38][8]–[39][10]. These highly specific and non-redundant characteristics of splicing factors motivated researchers to look for association between abnormalities in SR proteins and the development of human cancers. Although the underlying mechanisms are elusive and need further studies, splicing factors that regulate specific pathways in diseases can be treated as putative markers, especially when their targets in these disease-related pathways have experienced alternative splicing and produced different protein isoforms. In this study, we built a systematic method to identify transcription factors and splicing factors that regulated genes to produce different RNA isoforms in diseases. Our framework consists of four steps. First, differentially expressed mRNA isoforms (DEIs) are extracted by comparing abnormal cells and controls using RNA-seq analysis tools [40][11]. These DEIs that may experience abnormal splicing are putative targets of splicing factors that might function abnormally in disease. Assuming that co-expressed genes have a good probability of being functionally related, we clustered these DEIs to co-expressed groups using hierarchical clustering method. These co-expressed isoforms may be regulated by the same group of transcription factors (TFs) and splicing factors (SFs). Since co-expressed mRNAs are more likely to have their promoter regions bound by common transcription factors, we constructed a dataset called promoter region dataset (PRD) to mine the TF-isoform interactions (for example, co-expressed isoforms may also have common splicing factors that bind to the regions near their splicing sites). We then constructed an exon-intron (centered at splicing sites and extending 200 bp on both sides) dataset (EID) to explore SF-isoform interactions. The binding strength of the TF-isoform interactions is quantified by scoring the transcription factor’s binding sites in the promoter region. Similarly, the binding strength of the SF-isoform interactions is defined by scoring the splicing factor’s binding sites in the exon-intron regions of pre-mRNA. To integrate both kinds of regulation, we built a linear regression model and selected a subset of transcription factors and splicing factors that can regulate the expression of co-expressed isoforms using least-angle regression (LARS) [41][12] selection approach. The proposed method was applied to a RNA-seq dataset comprising of 4 myelodysplastic syndromes (MDS) samples (RAEB subtype) and 5 matched controls. MDSs are defined as clonal stem cell disorder characterized by ineffective hematopoiesis and impaired difference in some bone marrow lineages, leading to peripheral-blood cytopenias. According to the WHO [42][13], the main categories of MDS include refractory cytopenia with unilineage dysplasia (RA), refractory cytopenia with ringed sideroblast (RARS), refractory cytopenia with multi-lineage dysplasia (RCMD), refractory cytopenia with excess blasts (RAEB), 5q-syndrome and unclassifiable MDS. In this paper, we focused on RAEB cases which are characterized by 5–20% myeloblasts in the marrow [43][14]. This is a higher risk subtype and likely to transform to acute myelogenous leukemia (AML). Some work studied the genetic abnormality of MDS with mutations of key genes [44][15], e.g. TP53 and RUNX, which are highly related with poor overall survival. Recently, the recurrent mutation of splicing factor U2AF1 has been validated for MDS patients [45][16]. In-vitro experiments demonstrated that the mutated U2AF2 enhances splicing and promotes exon skipping. These genetic aberrations would improve the prediction of prognosis and development of novel treatment. However, mutation detection cannot determine which genes are altered due to the abnormality of splicing factors. Our method can recognize not only isoforms that experience an abnormal splicing process but also their putative regulatory factors. The co-expressed isoforms and their transcription and inferred splicing factors comprise our transcription and splicing networks. To verify the biological significance of these regulatory networks composed of a group of co-expressed isoforms and their regulatory factors in MDS, we showed theses networks were significantly enriched (p<0.0001) in some known Gene Ontology (GO) biology processes and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Enrichment analysis demonstrated that more than half of these networks were enriched in herpes simplex infection due to involving many splicing factors. Six networks were significantly related with pathways in cancer, and five networks were significantly related with acute or chronic myeloid leukemia (AML). Methods Our four-step framework ([46] Figure 1 ) starts with raw RNA sequencing (RNA-seq) data. First, differentially expressed isoforms (DEIs) are identified from a large amount of isoforms. We cluster these DEIs into co-expressed groups that may be regulated by common TFs and SFs ([47] Figure 1A ). At the second step, two datasets, promoter region data (PRD) and exon-intron data (EID), are constructed, where the PRD dataset is used for mining the interaction strength between TFs and isoforms and the EID is used for exploring the interaction strength between TFs and isoforms ([48] Figure 1B ). This step outputs two interaction strength matrices with rows corresponding to isoforms and columns corresponding to TFs or SFs. Then, by taking the interaction strength matrices as observations of explanatory variables, we build linear model to regress the expression levels of isoforms in a co-expressed group. To detect the most important factors that regulate a co-expressed group, a reliable model selection method called least-angle regression (LARS) is applied ([49] Figure 1C ). At the final step, the TFs and SFs selected by the LARS are linked with their target genes, forming regulatory networks ([50] Figure 1D ). Figure 1. Flowchart of proposed method for constructing regulatory networks. [51]Figure 1 [52]Open in a new tab Flowchart of proposed method for constructing regulatory networks: (A). Process raw RNA-seq data, find out deferentially expressed isoforms using Tophat and Cufflinks and cluster these isoforms to get gene cluster that may be regulated by same TFs and SFs. (B). Construct two dataset, promote region data (PRD) and exon-intron data (EID), for mining the interaction strength of the TF-isoform interactions and SF-isoform interaction. (C). Use interaction strength to predict the expression levels of isoforms in a co-expressed group. (D). Link model-selected TFs and SFs with their target genes. Ethics Statement This study was approved by the Institutional Review Board of The Methodist Hospital, Houston, Texas, USA and the need for written informed consent from the participants was waived by the IRB. Data Preprocessing and Co-expressed Isoforms RNA samples were prepared from 9 individuals, 4 samples from RAEB patients and 5 controls ([53]http://ctsb.is.wfubmc.edu/MDS/MDS.html). After sequencing using Genome Analyzer II (GAII) (Illumina, San Diego, CA), millions of reads were produced for each sample. The raw data were processed using a RNA-seq analysis pipeline [54][11]. Briefly, reads in the FASTA format were first aligned to whole genome using TopHat software. Since our sequencing data are pair-ended, the fragments were selected at 350 bp and the length of reads was 76, the inner distance between mate pairs was set as 198. The standard deviation for the distribution on inner distances between mate pairs was set as 30 based on our estimation. Then Cufflinks was called to estimate the expression level (RPKM, reads per kilobase of transcript per million mapped reads) of each isoform. Then, Cuffdiff was used to identify differentially expressed isoforms (using default parameters). This protocol returned 1056 isoforms that were differentially expressed in MDS compared with controls. Before using these DEIs to build splicing networks, we first filtered out those isoforms that had not been validated at the protein level. Then, we checked the condition (control or MDS) in which an isoform was up-regulated and required that all FPKM values under this condition were higher than 5 [55][17]. Although there are debates on whether co-expressed genes are functionally related, numerous studies have suggested that at least some co-expressed clusters function together [56][18]–[57][20]. Here we applied hierarchical cluster analysis to those differentially expressed isoforms to obtain co-expressed groups. The distances between genes were defined as Pearson correlation. Allocco et al. [58][21] concluded that genes with strongly co-expressed mRNAs were more likely to have their promoter regions bound by common transcription factors. However, this co-regulating effect was significant only when expression of these co-expressed mRNAs were highly correlated. To control the similarity level of expression profiles of isoforms in the same group, we limited the size of each group to be between 10 and 30 [59][19], [60][22]. In our MDS study, this resulted in 34 clusters ([61] Table 1 ). Table 1. Regulatory networks found by our model. No Target Genes Transcription Factors Splicing Factors 1 GNB1, GPATCH4, MRPL51, BRD4, ODC1, ATP5J, HNRNPD, RBBP7, CALM3, ANP32C, PSPC1, SETD3 Blimp-1, FOX factors, FOXP1a, Sp1, STAT5B MBNL1, PSF, Sam68, SF2, SRp38, YB-1 2 PARK7, KDM1A, SSR2, DNTT, HNRNPA1L2, SUPT16H, SLIRP, YWHAE, TAF15, TOP2A, SRSF1, ATP5A1, EEF1B2, HMGN1, U2SURP, HSD17B4, CANX, MAP7D3, HNRNPM, BCL11A Egr-1, Evi-1, LXR direct repeat 4, MEF-2, Nkx2-5, Pax-6, RSRFC4, Sp1 ETR-3, hnRNPD, hnRNPI(PTB), hnRNPQ, MBNL1, PSF, SF2, SRp38, SRp40, hnRNPC/C1/C2 3 TARDBP, HDAC1, ARHGDIB, ESD, CX-CR4, TTC1, HLA-DPB1, ANP32B, ATP1A1, ARPC5 ATF6, COUPTF, FOX factors, FOXP1d, IRF, Nkx3-1 SRp40, hnRNPE1/E2 4 PRDM2,GPBP1L1,ZRANB2,DNTTIP2,SMG7,EIF4G2,PHF21A,RBM4B,GNPTAB,UFM1,ERCC5 ,SF3B1,CD47,CDV3,BDP1,DDAH2,SNX3,PPIL4,KIT,PCMTD1,RABGAP1,DDX3X,C16orf8 0 Blimp-1, Egr-1, Evi-1, FOX factors, KROX, NIT2, Pax-6, POU3F2, RSRFC4, SRF hnRNPQ, KSRP, SF1, SRp38, SRp54, SRp55, TIA-1, TIAL1, YB-1, hnRNPA1/A2, hnRNPE1/E2 6 SRRM1, JUN, PRDX6, DUSP10, RBM34, HSP90AA1, SRRM2, RPS7, RPL3, RPL32, RPS20 Blimp-1, HNF-1alpha, KROX, MEF-2, Nkx2-5, Sp1 SRp20, SRp54, YB-1 7 TMEM50A, DNAJC8, ZNHIT6, MRPL9, DDX21, RPL36, RPL13A, NOP58, YWHAB, SREK1, ZMAT2, EIF2AK1, RBM3, UPF3B, HMGB2, SRF Blimp-1, Egr-1, ER-alpha, Evi-1, LXR direct repeat 4, Nkx2-5, Pax-6, Pbx1a, PLZF SC35, SF2, SRp40, TIA-1, hnRNPA1/A2 8 STMN1, ATPIF1, CFL1, APLP2, SARNP, MLEC, POMP, ERH, FUS, NME1, GNAI2, HNRNPAB, DEK, HIST1H2BC, HIST1H2AC, HIST1H4H, SEC61G, HIST1H2BK AP-2alphaA, Egr-1, HOXA5 (Hox-1.3), MCM1, MTF-1, NF-Y, Pax-6, PLZF, RXR-alpha, STAT1 hnRNPF, hnRNPI(PTB), MBNL1, SRp30c, SRp55, TIAL1 9 SRSF4, NASP, SERBP1, CSTF3, SSRP1, ERP29, RAN, HMGB1, SRP14, PSMA4, NUTF2, RPS16, VAMP8, HSPE1, CCT5, TAF9, HIST1H2BD, HIST1H4E, TUBB, HNRNPA2B1, PPIA, FKBP3 AP-2alphaA, Egr-1, ER-alpha, Evi-1, GLI1, IRF, MCM1, NIT2, Nkx2-5, Pbx1a, RSRFC4, STAT1, STAT5A (homotetramer), UF1H3BETA 9G8, hnRNPI(PTB), HTra2alpha, SRp30c, TIAL1 10 KHDRBS1, PIP4K2A, RPLP2, SMARCC2, RBM25, RFX7, YWHAQ, XRCC5, VDAC1, GNB2L1, SHFM1, SMC1A, HDGF, LSM14A, MAT2A AP-2alphaA, ATF6, Egr-1, MTF-1, Nkx2-5, STAT5B, ZID 9G8, HTra2alpha, MBNL1, PSF, SC35, SRp30c 11 SFPQ, SRSF11, BCAS2, TFAM, ZFP91, CSDA, NGRN, RPS15A, RPL26, NACA2, EEF2, DDX18, SF3B1, MAPRE1, SATB1, RPL24, SMARCA5, RUFY1, NUP153, DDX39B, RPL10A, HOXA7, RPL7, TCEAL4, RBMX, SYF2 ER-alpha, GLI1, KAISO, mat1-Mc, MIG1, MTF-1, Nkx2-5, Pax-6, PLZF, RSRFC4, Sp1, SRF, STAT1, ZID 9G8, DAZAP1, hnRNPF, HTra2beta1, KSRP, MBNL1, SRp30c, TIA-1, hnRNPC/C1/C2, hnRNPE1/E2 12 THRAP3, PSMB4, ARF3, TCF25, GABARAP, TAF15, CHMP2A, GAR1, C9orf78, ACIN1, UBB, BASP1 Egr-1, Gfi1b, NF-Y, STAT5B, UF1H3BETA PSF, SC35, SRp40, SRp55 13 GNL2, EIF3M, CD44, FNBP4, OSBPL8, NUFIP2, STK17B, SYPL1, MDM4, ANKRD36B, IER2 FOXO3a, POU3F2, TBX5 9G8, ETR-3, hnRNPF, hnRNPP(TLS), SRp20, hnRNPH1/H2/H3 14 NASP, UQCRH, ZRANB2, NUCKS1, FKBP4, ANP32A, EIF3C, PTMA, SUMO3, HNRNPA2B1, TXN, MYL12B AIRE, COUP-TF1, Egr-1, Evi-1, GLI1, MTF-1, NF-kappaB, STAT5A (homodimer), UF1H3BETA Sam68 15 EPS15, RAB13, ERLIN1, CAT, CWC15, RPL6, EID1, SEC11A, CKLF, RPL23, RPL19, FBXO7, EIF3L, POLR2B, RPL34, RPS3A, RPS6, CNTRL, EIF2S3, IK, HMGN4 AIRE, COUPTF, Egr-1, Evi-1, FOXP1a, HNF-1alpha, LXR direct repeat 4, Nkx2-2, Nkx2-5, PLZF, Sp1, SRF, PPARalpha:RXRalpha, PPARG hnRNPP(TLS), KSRP, MBNL1, SF2, SRp55 16 FUBP1, UFC1, HNRNPU, CHD4, ETV6, FGFR1OP2, PFDN5, NCKAP1L, NACA, NAP1L1, CTCF, GLOD4, ZNF830, RPL38, SAFB2, CEBPZ, CEP63, USO1, HNRNPD, PPWD1, NSA2, RPS23, HINT1, SKP1, RPS12, EIF3H, CEP350, ARID4A, PRPF40A, ST13, STXBP3, SERTAD2, CNBP AIRE, AP-2, ATF6, Blimp-1, Egr-1, Evi-1, KROX, LXR direct repeat 4, MEF-2, Nkx2-5, 1-Oct, POU3F2, RSRFC4, Sp1, SRF, ZBRK1 9G8, hnRNPK, hnRNPP(TLS), hnRNPQ, hnRNPU, SRp20, SRp30c, SRp38, SRp40, TIA-1, YB-1, hnRNPA1/A2, hnRNPC/C1/C2, hnRNPE1/E2 17 PHTF1, H3F3C, TRIAP1, PMAIP1, KLF2, EEF1B2, IL8, HNRPDL, ARRDC3, FKBP5, NAMPT HNF4, HSF1, MafA, MIG1, PLZF, Sp1, ZID 9G8, KSRP 18 MAN1A2, IFI16, UPF2, FGFR1OP2, FLT3, TERF2IP, LRRFIP1, EGR1, MLL5, TMEM66, IRF2BP2, ITSN2, PRRC2C, STT3B, CD164, SPRY1 Egr-1, Nkx2-5, p53, RSRFC4, SRF, STAT1, STAT5B (homodimer), UF1H3BETA 9G8, MBNL1, TIA-1, hnRNPA1/A2, hnRNPC/C1/C2, hnRNPH1/H2/H3 19 HIST2H2BE, SLC2A3, SRSF5, TEC, PTP4A1, ESYT2, LYN, SMARCA2, JMJD1C, PIM3, KLF10 Egr-1, Nkx2-5, Pbx1a, Sp1 PSF 20 ANP32E, ILF2, COX4I1, AATF, MPO, SNRPD1, FBL, RPS19, SNRPD2, RPL18, PPM1G, RPS21, GCFC1, CCDC72, H2AFZ, COX6C, DKC1 LXR direct repeat 4, Nkx2-5, 1-Oct, PPARalpha:RXRalpha, REST, STAT5A (homodimer), STAT5B, TBX5, UF1H3BETA hnRNPD, hnRNPU, HTra2alpha, SRp40, SRp55 21 GATAD2B, GDI2, APBB1IP, RPS13, HNRNPA1, LTA4H, MFAP1, ME2, MYADM, SEPT2, C21orf59, RUNX1, RPL35A, SNX2, HSPA9, PAIP2, LARS, SOX4, POLR2K, STRBP, RPL7A, SLC25A5, NRD1, LARP1, C19orf77, MRPL3, PNRC2, MTIF3 ATF6, Egr-1, ER-alpha, HNF-1alpha, IRF-1, MEF-2, MTF-1, Nkx2-5, PLZF, RSRFC4, Sp1, Sp3, UF1H3BETA, ZID ETR-3, MBNL1, SRp38, SRp54, SRp55, TIAL1, hnRNPA1/A2, hnRNPE1/E2 22 SELL, TMEM123, DDX6, TMBIM6, CASC4, B2M, CHD2, USP34, SETD2, PAK2, HSD17B11, GOLPH3, DFIP1, WRNIP1, UTRN, VAMP2, PCM1, SORL1, PAN3, ATXN3, XRN2, FBXO11, C4orf3, GAPT E2F-1:DP-2, Egr-1, GLI1, HNF-1alpha, IRF, IRF-3, KROX, mat1-Mc, MEF-2, MyoD, NF-kappaB, NIT2, Nkx2-5, NRSF, RSRFC4, Sp1, ZID hnRNPI(PTB), KSRP, SF2, SRp20, SRp30c 23 RGS1, PTMS, PARP2, C16orf61, CLK1, EAF2, SLU7, RSRC2, ZCCHC7 AIRE, Evi-1, FOXO3a, IRF-3, MEF-2, VDR, CAR, PXR hnRNPD 24 RGS2, CD34, RSF1, WNK1, CBFB, CSTB, IQGAP2, GNAI1, LAPTM4B, SH3KBP1, HNRNPM, ARCN1, MARCH6 AP-2, BLIMP1, KROX, MEF-2, Sp3, STAT5B Sam68, hnRNPA1/A2 25 HNRNPU, SAP18, RBM26, LEO1, UQCRC2, RPL27, RPS9, DUSP11, DARS, ARPC2, ESF1, SERINC3, PITPNB, HNRNPA0, TCERG1, NPM1, C6orf48, IL2RG, RPS4X, API5, SEP15, RPL23A Blimp-1, Egr-1, Evi-1, GLI1, MTF-1, NIT2, Nkx2-2, Pax-6, PLZF, ZID DAZAP1, hnRNPD, hnRNPI(PTB), MBNL1, SRp20, SRp30c, YB-1, hnRNPA1/A2, hnRNPE1/E2 27 EIF4EBP2, CELF1, ATXN7L3B, PRPF8, RPSA, SRP72, DDX46, STK38, SRSF3, NGFRAP1, TOP2B BLIMP1, COUP-TF1, Egr-1, HNF-1alpha, IRF, IRF-1, SRF, ZID YB-1 28 PDLIM1, LMO2, C11orf67, SLC38A1, DUSP6, ROCK2, KIF2A, PHF3, CD164, TAX1BP1, SEPT6, DDX1 Egr-1, HNF-1alpha, RXR-alpha ETR-3, hnRNPK, SRp55 30 PSMA1, NEDD8, ADAM10, PSMC5, UBE2G2, PDCD10, SEC62, CWC27, LCP2, XPO7, PEBP1 AIRE, ATF6, FOXO1, Gfi1b, 1-Oct, Sp1 hnRNPF, SRp20, TIA-1 31 NXF1, FAU, TAF1D, PPTC7, PABPC3, ELF1, CHD9, JUND, LAPTM4A, RANBP2, HAT1, UBE2D3, SBDS, FLJ44635, TSC22D3, HSPH1, KPNB1 ER-alpha, ER-beta, GLI1, KROX, Nkx2-5, NRSF, Staf HTra2alpha, Sam68, SRp30c, SRp38, SRp55, hnRNPA1/A2 32 CCND2, METAP2, ARGLU1, MIS18BP1, CCNK, PSME3, RPL38, GLTSCR2, RPS11, HADHA, RPL31, HSPD1, RPL37A, NOP56, BRK1, RPL37, UBE2V2 AIRE, AP-2alphaA, dl, Egr-1, HNF-1alpha-A, 1-Oct, Olf-1, p53, Pax-6, Pbx1a, RSRFC4, Sp1 SC35, SRp38, hnRNPE1/E2 33 ITM2B, SCAPER, PPIG, MRFAP1L1, RASGEF1B, AIF1, RNPC3, H3F3B, BPTF, AFF1, B3GNT5 Alx-4, Blimp-1, BLIMP1, HNF-1alpha, KROX, Sp1 hnRNPF, SRp40, hnRNPA1/A2 34 NIN, ZNF609, ZFP36L2, SOD1, NOL7, LTB, ITGA6, WDR77, PIM1, SEPT7 ABF, Antp, COUP-TF1, ID1, KAISO, myogenin/NF-1, RORalpha1 hnRNPK [62]Open in a new tab The first column is differentially expressed isoform groups in our cases, the second and third columns are the Transcription factors and splicing factors predicted by our regression model. Some cells are blank, which means no corresponding factors for that co-expressed group. Mathematic Modeling Transcription factors regulate transcription, which controls gene expression. Splicing factors regulate alternative splicing, which splices pre-mRNA to RNA isoforms and in turn changes protein expressions [63][23]. Some predictive methods have modeled gene expression levels as a linear function of occurrences of TF-binding sites (TFBSs) [64][2], [65][24]. In this paper, we focus on RNA isoforms instead of genes and try to link isoform expression levels not only with the transcriptional factors but also with splicing factors. Isoform expression levels were formulated as linear regression of the strength of TF-isoform interactions and SF-isoform interactions, given by graphic file with name pone.0079118.e014.jpg (1) where, Inline graphic is an isoform; Inline graphic denotes the expression level; Inline graphic , Inline graphic and s Inline graphic are regression coefficients; Inline graphic is transposition operator; Inline graphic are binding strength vectors of transcription factors; Inline graphic is the number of transcription factors; Inline graphic are binding strength vectors of splicing factors; Inline graphic is number of splicing factors and Inline graphic is the error term. To estimate the binding strength between TFs and isoforms, we first constructed a promoter region database (PRD, [66]Figure 1B ) comprised of promoter sequences of isoforms. As in previous study [67][25], we extracted 2000 bp upstream of the transcription start sites as promoter regions. Similarly, we constructed an exon-intron dataset (EID, [68]Figure 1B ) to estimate the binding strength between SFs and isoforms. Most splicing factor binding sites locate near the splicing sites [69][26], especially 200 bp on both sides of this site. Therefore, for isoforms with alternative splicing, we gathered 200 bp of sequences around their splicing sites into EID to find splicing factor binding sites. Based on the PRD and EID, we could define the TF-isoform interaction strength and SF-isoform interaction strength. First, we downloaded TF motif models (PWM, position weight matrixes) from TRANSFAC [70][27] and JASPAR [71][28]. Then, for each TF, a hidden markov model based method called MAPPER [72][29] was used to identify its sites (TFBSs) in PRD. For each binding site, the MAPPER also outputs a binding score measuring the binding affinity. TFs with no hits on PRD were removed. After retrieving all putative binding sites and their binding scores, we defined the interaction strength between a TF Inline graphic and a RNA isoform Inline graphic as Inline graphic . Where, Inline graphic is the score of Inline graphic -th binding site between Inline graphic -th isoform and j-th transcription factor; and Inline graphic is the number of binding sites. Unlike TFs, SFs lack a reliable PWM database available. Thus, we gathered 53 splicing factors that are related with bone marrow cancers and their binding motifs from the SpliceAid [73][30] repository. This database collects all the experimentally assessed motifs that are bound by splicing factors in humans by means of an exhaustive literature search. Motifs with positive scores facilitate exon definition as exonic splicing enhancers (ESEs) or intronic splicing silencers (ISSs) motifs, while motifs with negative scores facilitate intron definition as exonic splicing silencers (ESSs) and intronic splicing enhancers (ISE) motifs. The absolute values of these scores measure the levels of binding affinity. To control false positive rate, we filtered out motifs with absolute scores less than 5 and lengths longer than 15. Finally, 49 out of 53 splicing factors are selected (see [74]Table 2). Since motifs of these splicing factors are degraded short pieces (6 to 10 base pairs on average), when retrieving all binding sites of these SFs using alignment tool Bowtie [75][31], we got numerous hits, most of which were false positives. To identify putative binding sites with high reliability, we downloaded base-wise conservation scores (phyloP) by comparing 45 vertebrate genomes with human genome from the UCSC Genome Browser website ([76]http://genome.ucsc.edu/). We defined the conservation score of a hit as the average conservation score of the nucleotides it spanned. Hits with scores lower than 2 were deleted. Though conversion score is not directly related with interaction strength, a highly conserved binding site on an isoform does provide evidence of a strong interaction. We further averaged conservation scores of all hits of a splicing factor on an isoform to eliminate the effect of exon numbers. This average is defined as interaction strength between a SF and an isoform, Inline graphic , where, Inline graphic is the conservation score of Inline graphic -th binding site between Inline graphic -th isoform and Inline graphic -th splicing factor; Inline graphic is the interaction strength between Inline graphic -th splicing factor and Inline graphic -th isoform; Inline graphic is the number of binding sites. Some factors, especially the hnRNP family, are due to alternative splicing, for example hnRNP H1, hnRNP H2 and hnRNP H3, which means they may have similar structure [77][32],motifs [78][33] and binding profiles. After we computed their interaction strength with isoforms according to our definition, the correlation between interaction strength vectors were high. To increase the robustness of our linear regression model, we averaged the binding strength and obtained four new factors, called hnRNP A1/A2 (average of hnRNP A1 and A2), hnRNP H1/H2/H3(average of hnRNP H1, H2 and H3), hnRNP C/C1/C2 (average of hnRNP C, C1 and C2) and hnRNP E1/E2 (average of E1 and E2). Table 2. Names and description of splicing factors used in our model. Splicing Factors Gene Name Description 9G8 SRSF7 Splicing factor arginine/serine-rich 7. The shuttling protein 9G8 binds TAP and can function as export factors. CUG-BP1 CUGBP1 CUG triplet repeat RNA binding protein 1. CUGBP1 induces exon 5 inclusion in cTNT gene (PMID: 9563950), induces exon 11 exclusion in IR gene (PMID: 11528389), induces intron 2 retention in CIC-1 gene (PMID: 12150906). DAZAP1 DAZAP1 DAZ associated protein 1. ETR-3 CUGBP2 CUG triplet repeat RNA binding protein 2. ETR-3 induces exon 5 inclusion in cTNT gene (PMID: 11931771), induces exon 9 inclusion in CFTR gene (PMID: 15657417), promotes selectively the exclusion of Tau exon 2 (PMID: 16862542). hnRNP A1 HNRNPA1 Heterogeneous nuclear ribonucleoprotein A1. hnRNP A1 carries bidirectional shuttling signals that serve for both nuclear localization and export (PMID: 8521471) hnRNP A2/B1 HNRNPA2B1 Heterogeneous nuclear ribonucleoprotein A2/B1. hnRNP A2 is involved in cytoplasmic RNA transport (PMID: 11024030). hnRNP C HNRNPC Heterogeneous nuclear ribonucleoprotein C. Tetramer composed of 3 copies of isoform C1 and 1 copy of isoform C2. hnRNP C proteins are restricted to the nucleus because they bear a nuclear retention sequence (NRS) (PMID: 8830767). hnRNP C1 HNRNPC Heterogeneous nuclear ribonucleoprotein C. Isoform C1 is due to Alternative Splicing. hnRNP C2 HNRNPC Heterogeneous nuclear ribonucleoprotein C. Isoform C2 is due to Alternative Splicing. hnRNP D HNRNPD Heterogeneous nuclear ribonucleoprotein D. hnRNP D0 HNRNPD Heterogeneous nuclear ribonucleoprotein D. Isoform D0 is due to Alternative Splicing. hnRNP DL HNRNPDL Heterogeneous nuclear ribonucleoprotein D-like. hnRNP E1 PCBP1 Holy(rC) binding protein 1. hnRNP E2 PCBP2 Holy(rC) binding protein 2. hnRNPF HNRNPF Heterogeneous nuclear ribonucleoprotein F. hnRNPH1 HNRNPH1 Heterogeneous nuclear ribonucleoprotein H1. hnRNPH2 HNRNPH2 Heterogeneous nuclear ribonucleoprotein H2. hnRNPH3 HNRNPH3 Heterogeneous nuclear ribonucleoprotein H3. hnRNP I(PTB) PTBP1 Polypyrimidine tract binding protein 1. In the context of CALCA gene, PTB enhances exon 4 inclusion (PMID: 9858533). nPTB functionally compensates for PTB and is up-regulated when PTB is removed (PMID:17679092). hnRNP J HNRNPK Heterogeneous nuclear ribonucleoprotein K. isoform J is due to Alternative Splicing. hnRNP K HNRNPK Heterogeneous nuclear ribonucleoprotein K. hnRNP K carries bidirectional shuttling signals that serve for both nuclear localization and export (PMID: 9218800). hnRNP M HNRNPM Heterogeneous nuclear ribonucleoprotein M. hnRNP P(TLS) FUS Fusion (involved in t (12, 16) in malignant liposarcoma). hnRNP Q SYNCRIP Synaptotagmin binding cytoplasmic RNA interacting protein. hnRNP U HNRNPU Heterogeneous nuclear ribonucleoprotein U (scaffold attachment factor A). HTra2alpha TRA2A Transformer-2 alpha. HTra2beta1 SRSF10 Splicing factor arginine/serine-rich 10. KSRP KHSRP KH-type splicing regulatory protein. MBNL1 MBNL1 Muscleblind-like. MBNL proteins can act as activators or repressors of splicing on different pre-mRNAs (PMID: 15257297). MBNLs are dsRNA binding factors that can bind CUG or CCUG repeats (PMID: 14722159). PSF SFPQ Splicing factor proline/glutamine-rich (polypyrimidine tract binding protein associated). RBM25 RBM25 RNA binding motif protein 25. RBM25 stimulated proapoptotic Bcl-X(s) isoform through weak 5′ss selection in EX2 (PMID: 18663000). RBM4 RBM4 RNA binding motif protein 4. RBM4 induce exon inclusion of alpha-TM EX9a and EX2b (PMID: 16260624) and tau EX10 (PMID: 16777844). RBM5 RBM5 RNA binding motif protein 5. Sam68 KHDRBS1 KH domain containing RNA binding signal transduction associated 1. SC35 SRSF2 Splicing factor arginine/serine-rich 2. SC35 accelerates transcriptional elongation (co-transcriptional splicing) (PMID: 18641664). SF1 SF1 Splicing factor 1. Gomafu lncRNA UACUAAC repeats bind to mouse SF1 with a higher affinity than the mammalian branch point consensus regulating splicing efficiency by changing the splicing factors nuclear level (PMID: 21463453) SF2 SRSF1 Splicing factor arginine/serine-rich 1 (splicing factor 2, alternate splicing factor). The shuttling protein SF2/ASF binds TAP and can function as export factors (18364396). SRp20 SRSF3 Splicing factor arginine/serine-rich 3. The shuttling protein SRp20 binds TAP and can function as export factors (18364396). SRp30c SRSF9 Splicing factor arginine/serine-rich 9. SRp38 FUSIP1 FUS interacting protein (serine/arginine-rich). Dephosphorylation converts SRp38 to a splicing repressor (PMID: 12419250). SRp38 functions as a general splicing repressor when dephosphorylated, but when phosphorylated it functions as a sequence-specific splicing activator (PMID: 18794844). SRp40 SRSF5 Splicing factor arginine/serine-rich 5. SRp54 SRSF11 Splicing factor, arginine/serine-rich 11. SRp55 SRSF6 Splicing factor arginine/serine-rich 6. SRp75 SRSF4 Splicing factor arginine/serine-rich 4. TDP43 TARDBP TAR DNA binding protein. It can act as transcriptional repressor (21252238). TIA-1 TIA1 Cytotoxic granule-associated RNA binding protein. TIAL1 TIAL1 TIA1 cytotoxic granule-associated RNA binding protein-like 1. YB-1 YBX1 Y box binding protein 1. ZRANB2 ZRANB2 Zinc finger, RAN-binding domain containing 2. ZRANB2 (ZNF265) is an SR-like protein that induce exclusion of EX2 and EX3 from the Tra2beta1 pre-mRNA in HEK293 cell (PMID: 11448987). [79]Open in a new tab This table contains 22 splicing factors which are selected to predict the expression levels of differentially expressed isoforms. This table lists their names and some related references. Most of these details