Abstract
Esophageal squamous cell carcinoma is a leading cause of cancer death.
Mapping the transcriptional landscapes such as isoforms, fusion
transcripts, as well as long noncoding RNAs have played a central role
to understand the regulating mechanism during malignant processes.
However, canonical methods such as short-read RNA-seq are difficult to
define the entire polyadenylated RNA molecules. Here, we combined
single-molecule real-time sequencing with RNA-seq to generate
high-quality long reads and to survey the transcriptional program in
esophageal squamous cells. Compared with the recent annotations of
human transcriptome (Ensembl 38 release 91), single-molecule real-time
data identified many unannotated transcripts, novel isoforms of known
genes and an expanding repository of long intergenic noncoding RNAs
(lincRNAs). By integrating with annotation of lincRNA catalog, 1,521
esophageal-cancer-specific lincRNAs were defined from single-molecule
real-time reads. Kyoto Encyclopedia of Genes and Genomes enrichment
analysis indicated that these lincRNAs and their target genes are
involved in a variety of cancer signaling pathways. Isoform usage
analysis revealed the shifted alternative splicing patterns, which can
be recaptured from clinical samples or supported by previous studies.
Utilizing vigorous searching criteria, we also detected multiple
transcript fusions, which are not documented in current gene fusion
database or readily identified from RNA-seq reads. Two novel fusion
transcripts were verified based on real-time PCR and Sanger sequencing.
Overall, our long-read single-molecule sequencing largely expands
current understanding of full-length transcriptome in esophageal cells
and provides novel insights on the transcriptional diversity during
oncogenic transformation.
Keywords: heterogeneity, long reads sequencing, esophageal squamous
cell carcinoma, transcriptome, alternative splicing, lincRNA,
transcript fusion
Introduction
Esophageal squamous cell cancer (ESCC) is a serious malignancy with
poor prognosis and mortality rate ([45]Lin et al., 2013; [46]Chen et
al., 2016). Recently, large-scale sequencing studies have revealed the
substantial genomic heterogeneity within and among ESCC patients, which
hampered the development of effective target therapies ([47]Hao et al.,
2016; [48]Yan et al., 2019). While genetic alterations initiate
tumorigenesis, how they affect the transcriptional program and
ultimately drive the malignant phenotype remains elusive. In order to
find the altered signaling pathways and novel functional transcripts
such as long intergenic noncoding RNAs (lincRNAs), several short-read
based transcriptome sequencing studies have been conducted during the
past several years ([49]Cancer Genome Atlas Research et al., 2017;
[50]Li et al., 2017). However, typical RNA-seq captures a large number
of contiguous short reads (about 100–250 bp) and reconstructs the
transcripts by statistical modeling. Thus, it is difficult to
completely describe RNA molecules from 5¢ to 3¢ end and to annotate
novel isoforms or genes using short reads ([51]Au et al., 2013;
[52]Steijger et al., 2013). On the contrary, PacBio single-molecule
real-time (SMRT) platform is capable to sequence long
circular-consensus sequence reads of several thousand base pairs and
have a good opportunity to capture full-length transcripts de novo. In
addition, hybrid sequencing algorithms have been developed to correct
sequencing errors by utilizing high accurate short reads. Thus, hybrid
PacBio SMRT sequencing provides a powerful tool to survey the
transcriptional landscape in cells ([53]Sharon et al., 2013;
[54]Tilgner et al., 2014; [55]Weirather et al., 2015).
In this study, we selected one normal immortalized esophageal squamous
epithelial cell line and four ESCC cell lines, which represent major
cell types of esophageal squamous cell carcinoma (ESCC) and
investigated the cellular heterogeneity at transcriptome level. Using
hybrid PacBio SMRT platform, our de novo sequencing of these five
representative esophageal cell lines yields ∼210 Gb of clean data or
∼2,000,000 full-length nonchimeric (FLNC) reads. All of these FLNC
reads have clear 5¢ and 3¢ messenger RNA (mRNA) canonical structure and
with an average length of >2.5 kb, which are well suitable to describe
the full transcript structures. We detected many new transcripts such
as novel isoforms, esophageal cancer specific lincRNAs and gene
fusions; we also cataloged the shifted alternative splicing (AS)
feature between cancer and normal esophageal cells, which collectively
highlighting the true heterogeneity and complexity.
Methods
Sample Preparation
Four human ESCC cell lines [KYSE140, KYSE510, TE5, and Shantou human
embryonic esophageal carcinoma (SHEEC)] and one normal immortalized
esophageal squamous epithelial cell line (SHEE) were used in this
study. Among the five cell lines in this study, KYSE140, KYSE510, and
TE5 are established from the resected specimens of patients with ESCC
([56]Shimada et al., 1992; [57]Nishihira et al., 1993). We chose these
three ESCC patient-derived cell lines as they cover all three types of
cell differentiation of primary tumor: KYSE140 is derived from a
patient with moderately differentiated squamous cell carcinoma, KYSE510
is derived from a patient with well-differentiated squamous cell
carcinoma, and TE5 is derived from a patient with poorly differentiated
squamous cell carcinoma. The SHEE and SHEEC cell lines were previously
established by our labs. SHEE cell line was a HPV18 E6E7-immoltalized
human embryonic esophageal epithelial cell line ([58]Shen et al.,
2001), and SHEEC cell line is tumor cell line established by malignant
transformation of SHEE induced by 12-O-tetradeeanoyl-phorbol-13-acetate
(TPA) ([59]Shen et al., 2000). The cell lines were authenticated by
short-tandem repeat analysis in 2018. Briefly, KYSE140 and KYSE510
cells were cultured in RPMI 1640 medium (Thermo) containing 10% fetal
bovine serum (GIBCO). TE5 cell was cultured in Dulbecco’s modified
Eagle’s medium (Thermo Fisher Scientific) supplemented with 10% fetal
bovine serum. SHEEC and SHEE cells were cultured in Dulbecco’s modified
Eagle’s medium/F12 medium (Thermo Fisher Scientific) with 10% newborn
bovine serum (Excell Biology). All cell lines were tested without
mycoplasma contamination before RNA isolation. Total RNA was isolated
with TRIzol (Invitrogen) as per the manufacturer’s instructions and
then treated with DNase and purified with PureLink^®RNA Mini Kit (Life
Technology). RNA was purified according to the following criteria: (1)
with concentration ≥300 ng/µl, (2) OD260/280 = 2.0–2.2 and OD260/230 =
1.8–2.1, and (3) RNA integrity number (RIN) ≥9, which is assessed on
the Agilent Bioanalyzer 2100 system.
Library Preparation and Sequencing
The SMRTbell^™ libraries were prepared according to the Isoform
Sequencing protocol (Iso-Seq) as described by Pacific Biosciences
(PN100-092-800-03). First, the complementary DNA (cDNA) was synthesized
by total mRNAs for each sample using the Clontech SMARTer PCR cDNA
Synthesis Kit. To increase the sequencing yield of long transcripts,
the Blue Pippin Size Selection System protocol was used to select the
>4-kb fragments for each sample after the PCR amplification of the
cDNAs. Then, the equimolar mixture of long cDNA fragments (>4 kb) and
normal cDNAs was used subsequently for SMRT sequencing. For Illumina
transcriptome library preparation and sequencing, a total amount of 3
µg RNA per sample was used as input material for the RNA sample
preparations. Sequencing libraries were generated using NEBNext^®
UltraTM RNA Library Prep Kit for Illumina^® (NEB, USA) following
manufacturer’s recommendations. The Illumina PE150 libraries were
sequenced on Hiseq 4000 platform.
Raw Data Processing and Mapping to Reference Genome
SMRT data were processed using the SMRTlink 5.0 software (Pacific
Biosciences). Circular consensus sequence was generated from subread
BAM files with the default parameters. The nonchimeric reads, which
include nonfull-length and full-length transcripts, were then clustered
by isoform level clustering algorithm. The produced clusters were
finally polished using ARROW software (Pacific Biosciences). Additional
nucleotide errors in consensus reads were corrected using the Illumina
RNA-seq data by the LoRDEC software ([60]Salmela and Rivals, 2014).
Consensus reads were aligned to reference annotations (Ensemble 38
release 91) using GMAP ([61]Wu and Watanabe, 2005) with the following
parameters –no-chimeras –cross-species –expand-offsets 1 -B 5 -K 50000
-f samse -n 1.
For Illumina RNA-seq reads mapping, reference genome and gene model
annotation files were downloaded from genome website directly
([62]ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/dna/Homo_s
apiens.GRCh38.dna.primary_assembly.fa.gz). Hisat2, which is a
fast-spliced aligner with low-memory requirements, was used to build
the index of the reference genome, and paired-end clean reads were
mapped against the reference genome
([63]https://github.com/infphilo/hisat2). HTSeq v0.6.1 was used to
count the read numbers mapped to each gene. Then, read per kilobase of
exon per million mapped reads of each gene was calculated based on the
length of the gene and reads count mapped to this gene
([64]http://htseq.readthedocs.io/en/release_0.9.1/).
Gene Structure Analysis and Novel Transcript Annotations
The GMAP output BAM format file and GTF format genome annotation file
were used for gene and transcript structure determination. Long read
clusters were overlapped with gene models to find novel isoforms and
genes as previously reported ([65]Abdel-Ghany et al., 2016). We also
compared the transcription start sites (TSS) of each transcript with
the Cap Analysis of Gene Expression (CAGE) promoter tags and epigenetic
marks that are typically associated with actively transcribed promoters
(H3K4me1, H3K4me3, and H3K27ac). TSSs are defined as the first genomic
position of each transcript structure. We downloaded peak calls of CAGE
promoter tags from FANTOM5 ([66]FANTOM Consortium and the RIKEN PMI and
CLST (DGT) et al., 2014) and epigenetic marks (adult esophagus) from
the Roadmap Epigenomics Consortium ([67]Bernstein et al., 2010).
LiftOver was used when necessary
([68]https://genome.ucsc.edu/cgi-bin/hgLiftOver).
Unmapped transcripts and novel gene transcripts were scanned and
annotated by Diamond BLASTX with parameter e value “1e−5” in the
following protein/peptide databases ([69]Buchfink et al., 2015): NR
(NCBI nonredundant protein sequences), KOG/COG (Clusters of Orthologous
Groups of proteins), Swiss-Prot, Kyoto Encyclopedia of Genes and
Genomes (KEGG) Ortholog database. Similarly, novel transcripts were
also searched against the Pfam database ([70]Finn et al., 2016) by
Hmmscan software ([71]http://hmmer.org/download.html).
Analysis of Alternative Splicing Patterns
Alternative events were analyzed by SUPPA ([72]Trincado et al., 2018).
To quantify the differential isoform usage between cells, we defined
the score D of each gene as follows:
[MATH: Dj=∑i=1<
/mn>4(1−cid) whereas c=a∩b, d=a∪b :MATH]
(1)
suppose gene j has isoform set a and set b, respectively, in cell line
X and Y; c is the number of isoform intersection for set a and set b; d
is the number of isoform union for set a and set b. Thus, D sums up
scores when comparing the four esophageal squamous carcinoma cells with
SHEE. Genes with a higher D value are more diversely spliced. A set of
differentially spliced genes, which was identified from TCGA clinical
esophageal samples, were used in this study to verify the altered
splicing pattern ([73]Mao et al., 2019). Enrichment analysis of spliced
genes was conducted by DAVID against GO Biological Processes terms with
a cutoff false discovery rate ≤ 0.05 ([74]Huang da et al., 2009).
lncRNA Analysis Pipeline
SMRT transcripts were first analyzed by CNCI and PLEK with default
parameters to predict the coding potential of transcripts. These two
software adopted support vector machine algorithms to effectively
distinguish protein-coding and noncoding sequences independent of known
protein annotations ([75]Sun et al., 2013, [76]Li et al., 2014a). In
order to strictly identify the lncRNA candidates, SMRT transcripts were
also scanned against known protein sequence databases subsequently. CPC
software was used to assess the extent and quality of the open reading
frame in a transcript and search the sequences against NCBI eukaryotes’
protein database to clarify the coding and noncoding transcripts
([77]Kong et al., 2007). Pfam-scan translated each transcript in all
three possible frames to identify occurrence of any of the known
protein family domains documented in the Pfam database ([78]Finn et
al., 2016). Finally, transcripts predicted with coding potential by any
of the above four tools were filtered out, and those without protein
coding potential were candidate set of lncRNAs. The Cabili’s reference
set, which catalogs lincRNAs across 24 human tissues and cell types
(but without esophageal tissue), were downloaded from supplemental
files of the publication ([79]Cabili et al., 2011). Candidate lincRNAs
from four esophageal cancer cells, but not expressed in SHEE cells and
Cabili’s reference set, were defined as esophageal cancer cell specific
lincRNAs.
The interacting target genes of lncRNAs are predicted based on an
approach using the coexpression and colocalization pattern ([80]Dempsey
et al., 2018, [81]Gao et al., 2019). Briefly, the expression
correlation was calculated between lncRNAs and coding genes. Genes with
a Pearson correlation coefficient >0.95 (p < 0.001) and reside in 100k
upstream or downstream of lncRNA were identified as target genes of
that lncRNA.
All microRNA (miRNA) hairpin sequences were downloaded from miRBase
database ([82]http://www.mirbase.org) and were blasted against the
sequences of esophageal cancer cell specific lincRNAs to identify
potential pre-microRNA.
Transcripts Fusion Detection and Comparison With Known Gene Fusion in
Database or Predicted From RNA-Seq
SMRT transcripts were determined as transcript fusions according to the
following criteria:(1) SMRT transcripts were mapped to two or more
long-distance range separate loci, and each locus must map at least 10%
of the query transcript; (2) total combined alignment coverage is at
least 99%; (3) minimum distance between each mapped locus is above 100
kb; and (4) at least two Illumina reads were found spanning the
junction to support the candidate fusion transcripts. SMRT transcripts,
which meet all of above criteria, were regarded as fusion transcripts.
The overviews of fusion events between locations in chromosomes are
drawn in R package “RCircos” ([83]Zhang et al., 2013).
Illumina RNA-seq reads were also used to detect gene fusion events by
STAR-Fusion with default parameter
([84]https://github.com/STAR-Fusion). Furthermore, based on comparing
the number of Illumina RNA-seq reads spanning the junction (denoted as
s) and reads adjacent to it on both sides (denoted as a and b,
respectively), a set of candidate fusion transcripts with
[MATH:
as<2∩
bs<2 :MATH]
were filtered, then manually inspected for follow-up experimental
verification. For comparison of detected fusions with known records in
gene fusion database or predictions from RNA-seq data, we first
transform the gene symbols to Ensemble Ids (which are used in ChimerDB
3.0); then, gene level match is considered as an overlap result.
KEGG Pathway Enrichment Analysis
The KOBAS software was used for statistical enrichment of gene list
based on KEGG pathways ([85]Wu et al., 2006).
Fusion Transcripts Verification and Sequencing
cDNA was prepared by SuperScript III (Invitrogen). Primers were
designed to span the junctions and are listed in [86]Table S6.
Quantitative real-time PCR were conducted as previously reported
([87]Xu et al., 2011). The expected PCR products for individual fusion
were determined via melting curve analysis. PCR products were run on
gels and then gel purified and subjected to Sanger sequencing (Shanghai
Sangon Biotech Co., Ltd.).
Four RNA-seq projects (PRJNA140847 with 6 ESCC tumor tissue samples,
PRJNA298963 with 15 tumor tissue samples, PRJNA435587 with 7 tumor
tissue samples, and PRJNA147913 with 7 tumor tissue samples) were
collected. Blast with default parameters was used to align fusion
sequences to all of RNA-seq reads.
Results
Full-Length Transcriptome Sequencing of Esophageal Cells
The analysis pipeline for transcriptional landscape of esophageal cells
is illustrated in [88]Figure 1 . According to Pacbio protocols, total
RNA was purified to prepare for cDNA library for each cell line.
Transcripts were simultaneously sequenced with SMRT cells on the PacBio
Sequel and on the Illuminia Hiseq 4000 platforms. Totally, we
identified 445,983, 477,033, 491,354, 327,459, 259,482 FLNC reads from
KYSE140, KYSE510, SHEE, SHEEC, and TE5 cells, respectively, which cover
∼80% of all circular-consensus sequences in each cell line ([89]Table
1). These high-quality FLNC reads were clustered by Arrow algorithm to
obtain the consensus sequences. Short Illumina RNA-seq reads from the
same RNA samples were also produced for each cell line. After trimming
the sequencing adapter, poly(A) tail, and low-quality bases, the
filtered RNA-seq reads were utilized to further correct FLNC reads by
LoRDEC using the default parameters. After error correcting, the mean
length of reads was within 2.3–3.2 kbp in esophageal cells, indicating
good quality of the SMRT dataset ([90]Table S1A).
Figure 1.
[91]Figure 1
[92]Open in a new tab
The analyses pipeline for transcriptional landscapes in esophageal
cells.
Table 1.
Quality control of single-molecule real-time (SMRT) raw data in
esophageal cells.
Sample CCS Reads with 5′ primer Reads with 3′ primer Reads with Poly-A
Flnc Average flnc read length Flnc/CCS
KYSE140 557777 501,590 516,189 504,004 445,983 2,539 0.8
KYSE510 602754 545,561 558,823 539,496 477,033 2,442 0.79
SHEE 620818 555,241 574,085 559,246 491,354 2,566 0.79
SHEEC 424278 376,723 392,202 386,314 327,459 2,228 0.77
TE5 308010 284,929 287,900 283,066 259,482 3,073 0.84
[93]Open in a new tab
CCS, Circular Consensus Sequence; Flnc, Full-length nonchimeric reads.
Characterization of Full-Length Transcripts in Esophageal Cells
Novel Gene and Isoforms Identified From Full-Length Transcripts
We aligned long sequencing reads to the human Ensembl 38 release 91
genome using GMAP. We found an average alignment identity of ∼90%, with
∼80% transcripts uniquely mapping to the reference genome in each
esophageal squamous cell ([94]Table S1B). Similarly, short Illumina
RNA-seq reads were also mapped to reference genome with a mapping rate
>92% by HISAT2 mapper ([95]Table S1C). According to the recent
annotations, over 80% of SMRT transcripts are novel genes or isoforms
of known genes ([96]Figure 2A left panel). Compared with known isoforms
of known genes, the expression of the novel isoforms from known or
novel genes are relatively low ([97]Figure 2A right panel). The numbers
of novel transcripts sharing among different cell lines are
demonstrated in [98]Figure 2C.
Figure 2.
[99]Figure 2
[100]Open in a new tab
Single-molecule real-time (SMRT) sequencing identifies novel genes or
isoforms of known genes. (A) Left panel: bar chart illustrates the
percentages of novel isoform of known genes (purple), isoform of novel
genes (green), and novel isoform of novel genes (brown). Right panel:
average counts of full-length nonchimerics (FLNCs) in each esophageal
cell line. (B) Distances distribution of transcription start site (TSS)
in each full-length transcript to the closest epigenetic marks and Cap
Analysis of Gene Expression (CAGE) tags. (C) The numbers of novel
transcripts sharing among different cell lines. (D) Unannotated
transcripts were scanned in several peptide or protein databases for
each esophageal cell line.
To support the accuracy of obtained full-length transcripts and the
analysis pipeline, we performed a transcriptome-wide comparison of TSSs
detected in PacBio dataset with CAGE promoters and active epigenetic
marks from Roadmap Epigenomics Project. Although these external
datasets are generated from adult esophagus tissue, they are reasonable
approximation for esophageal cells. As shown in [101]Figure 2B, ∼50% of
TSSs detected by long read sequencing were within 10 bp to their
counterpart in the FANTOM5 CAGE dataset. They are even closer to the
three epigenomics marks, and the majority of TSSs have a mean distance
of 1 bp. The concordance between CAGE tags and epigenetic marks
datasets with 5¢ ends detected in PacBio dataset confirmed the validity
of identified full-length transcripts.
To further establish the accuracy of the full-length transcripts, those
unannotated transcripts were blasted against several peptide or protein
databases including NR, Pfam, KOG, KEGG, and Swiss-Prot. Protein
products from over 85% of the transcripts can be found in at least one
of the above databases, suggesting that many of the novel transcripts
are indeed translated into proteins ([102]Figure 2D). Overall, multiple
orthogonal datasets provide independent confirmations that reported
transcripts are most likely full length.
[103]Figure 3 provides two examples of these novel isoforms of known
genes. For example, VIL2 (also known as Ezrin) encoded three transcript
variants that differ in the transcriptional start site. We previously
found that there are two variants of VIL2 in the esophageal cancer
cells (V1: ENST00000337147.11;V2:ENST00000367075.3), and different
transcriptional regulatory mechanisms regulated their transcription
([104]Zhang et al., 2015; [105]Zhang et al., 2018). From current SMRT
data, it is clear that, except for the three annotated VIL2 variants in
reference genome, other 24 VIL2 variants with different expressions are
also transcribed but have not been annotated in KYSE510 cells
([106]Figures 3A, C). We recently found that the AS of Tropomyosin I
(TPM1) is regulated by its natural antisense TPM1-AS, resulting in
specifically downregulation of TPM1variants ([107]Huang et al., 2017).
Similarly, multiple novel isoforms of this gene were also detected from
SMRT data ([108]Figures 3B, D). For both VIL2 and TPM1 genes, there are
plenty of cell-type-specific variants than those shared by multiple
cells, suggesting that there may be specific splicing events in each
cell ([109]Figures 3C, D).
Figure 3.
[110]Figure 3
[111]Open in a new tab
Examples of isoforms of known genes. Single-molecule real-time (SMRT)
transcripts detected in each cell line for (A) VIL2 and (B) TPM1. (C)
Novel isoforms of VIL2 gene in KYSE510 cells. (D) Novel isoforms of
TPM1 gene in KYSE510 cells. Blue: Known transcript annotations; red:
known isoforms identified from SMRT data; black: novel isoforms from
SMRT data. The number of FLNCs detected is shown in brackets.
To investigate what kind of genes have more novel isoforms, we sorted
all genes according to the number of its isoforms and conducted the
KEGG pathway enrichment analysis of top 10% genes with the most
isoforms. We found that these genes are significantly enriched in known
cancer-associated pathways such as viral carcinogenesis, PI3K-Akt, and
MAPK signaling pathway ([112]Figure S1).
Mining Esophageal Cancer Cell Specific lincRNAs
LincRNAs are transcribed from intergenic regions between protein-coding
genes. Recent studies have pointed out their critical function
associated with the pathogenesis of ESCC ([113]Li et al., 2014b;
[114]Shen et al., 2016). Based on collections of ∼4 billion RNA-seq
reads, Cabili et al. previously have defined a reference catalog of
∼14,000 human lincRNAs with expression pattern across 24 human tissues
and cell types (without esophageal tissue) ([115]Cabili et al., 2011).
To mine esophageal cancer cell specific lincRNAs, we selected the
lncRNAs that were detected in four esophageal cancer cells but not in
the normal-like SHEE cell and Cabili’s reference set. Under a stringent
criterion (see Methods), 5,400, 5,210, 4,883, 4,756, and 2,274 lncRNAs
were directly predicted from SHEE, KYSE140, KYSE510, SHEEC, and TE5
cells, respectively ([116]Figure S2A). Totally, 1,521 specific lincRNAs
were found in esophageal cancer cells ([117]Table S2).
We also predicted that lncRNAs regulated target genes and found that
they are significantly enriched in cancer-related signaling pathways
and extracellular matrix receptor interactions, suggesting that the
interacting lincRNAs may have similar biological functions ([118]Figure
S2B). Furthermore, 37 potential pre-miRNAs were detected by aligning
known miRNA hairpin sequences against lincRNA sequences ([119]Table
S3).
Identification of Cell-Specific Isoform Usage in Esophageal Cells
Employing the SUPPA package, SMRT data were also used to analyze AS
events, which are classified into several categories, such as skipped
exon (SE), mutually exclusive exon, alternative 5¢ splice site,
alternative 3¢ splice site, alternative first, alternative last exons,
and retained intron. The results showed that SE is the richest events
among all AS types in all esophageal cells. This is consistent with
previous findings, in which SE is the most prevalent AS mechanism in
human genome ([120]Sultan et al., 2008; [121]Wang et al., 2008). In
contrast, mutually exclusive exon only accounts for ∼5% of all AS
events and is the most infrequent AS type ([122]Figure 4A). Compared
with four other tumor cells, AS events in the normal-like cell SHEE
shows no particular preference and exclusion (÷^2 test, P > 0.05,
[123]Figure 4A). Score D is constructed to quantitatively measure the
isoform usage for each gene between tumor cells and the normal-like
cell. From [124]Figure 4B, it can be seen that the top 500 diversely
spliced genes are significantly enriched in three Gene Ontology (GO)
terms “DNA repair,” “cellular response to DNA damage stimulus,” and
“positive regulation of GTPase activity” (false discovery rate ≤0.05,
[125]Figure 4B and [126]Table S4). Based on a set of differentially
splicing genes identified from TCGA clinical sequencing dataset
recently ([127]Mao et al., 2019), we also verified that DNA damage and
repair-related genes are significantly spliced in clinical esophageal
patient samples ([128] Table S5).
Figure 4.
[129]Figure 4
[130]Open in a new tab
Shifted alternative splicing pattern in esophageal cells. (A)
Percentage of splicing events in each esophageal cell line. SE, skipped
exon; MXE, mutually exclusive exon; A5, alternative 5¢ splice site; A3,
alternative 3¢ splice site; AF, alternative first; AL, alternative last
exons; and RI, retained intron. (B) Differentially spliced genes
between normal-like and malignant esophageal cells are significantly
enriched in three Gene Ontology (GO) terms.
Identification and Validation of Novel Fusion in Esophageal Cells
Based on searching criteria, 1,972 transcript fusions were identified
from the full-length SMRT reads. The parental genes of fusion
transcripts can be located from the same chromosome or from different
one. However, the frequencies have no significant difference when
comparing the esophageal tumor cells with the normal-like cell (÷^2
test, P > 0.05, [131]Figure S3A). KEGG pathway enrichment indicated
that the fusion-occurring genes are in favor of biological function
related with RNA processing (i.e., splicosome, ribosome, and RNA
transport) and cancer signaling pathways(i.e., focal adhesion, cell
cycle, and apoptosis) ([132]Figure S3B).
Comparison of transcript fusion events identified by SMRT and RNA-seq
assembly indicated that SMRT identifies much more fusions ([133]Table
2). For example, in contrast to 39 splicing fusion found from Illumina
short reads, 335 were identified from long reads in TE5 cell. ChimerDB
3.0 is an enhanced database for fusion genes from cancers ([134]Lee et
al., 2017). It archives thousands of gene fusions collected from
published reports or predicted from RNA-seq transcriptome analysis. We
compared the identified gene fusion pairs from PicBio SMRT data to all
records in ChimerDB 3.0 (i.e., including fusions found in all cancer
types). There are 2, 3, 2, 2, and 1 overlapped fusions found in TE5,
SHEE, KYSE510, KYSE140, and SHEEC, respectively ([135]Table 2). The
fact that there were few overlaps between SMRT detection and the
records in ChimerDB 3.0 suggests most fusions found from long reads are
novel transcript fusions.
Table 2.
Comparison of transcript fusions detected from PacBio data with those
from Illumina reads or ChimerDB.
TE5 SHEE KYSE510 KYSE140 SHEEC
PacBio unique fusion 335 553 415 377 292
Illumina unique fusion 39 59 31 61 49
Found in both Illumina and PacBio UQCC1–C22orf39
ARHGAP21—SVIL
KIF13A–TPMT HNRNPUL2–C11orf49
ARHGAP42–CNTN5
INCENP–GUCY2EP
CTPS1–KCNQ4
SPAG9–LINC02071
ASAP1–KB-1568E2.1
KANSL1–ARL17B IKBKAP–MRRF
MRRF–IKBKAP
GLI3–MRPL48 TNRC6B–PRR5
CERK–LINC02036
MB21D2–ATP13A4
HCG18–ERP29 –
Found in both ChimerDB and PacBio KIF13A–TPMT
NBPF1–NBPF15 KANSL1–ARL17B
NBPF1–NBPF15
SLC16A3–FTH1 NBPF1–NBPF15
CDH12–HSPD1 CPSF6–C9orf3
CDH12–HSPD1 PDE4D–
[136]Open in a new tab
With the help of RNA-seq short reads, we further employed a reads
filter to select six candidate fusion transcripts, which expressed
comparably with their parental genes (i.e., no less than twofold of
their parental genes). After manual inspection, one fusion transcript
was discarded due to ambiguous sequence mapping. We evaluated the
prevalence of the remaining transcript fusions in esophageal cells by
real-time PCR followed by Sanger sequencing ([137]Table S6). Two novel
transcript fusions, ring finger and CCCH-type domains 1–aldo-keto
reductase family 1 member B10 (RC3H1-AKR1B10) and NEK9-TTC21B, could be
verified ([138] Figure 5A and [139]Figure S4). We focused on
RC3H1-AKR1B10 since this transcript fusion is differentially expressed
in esophageal cells, with the lowest expression in normal-like SHEE
cell ([140]Figure 5B). The RC3H1 protein consists of a Roquin domain,
which is required for constitutive decay element-dependent RNA binding.
At both N- and C-terminal sides of Roquin domain, there are regions
used for nucleotide-binding. RC3H1 also contains two zinc finger motifs
([141]Schuetz et al., 2014). AKR1B10 encodes aldo/keto reductase, which
can efficiently reduce aliphatic and aromatic aldehydes ([142]Gallego
et al., 2007). The last 3¢ untranslated region exon of RC3H1 fused with
the first 5¢ untranslated region exon of AKR1B10; thus, the fused
protein is expected to retain intact functional regions from both of
two parental genes ([143]Figure 5A). Blast with default parameters was
used to align the two fusion sequences to several public available
RNA-seq datasets from ESCC patients. We did not find positive results
from these clinical samples (data not shown).
Figure 5.
[144]Figure 5
[145]Open in a new tab
Ring finger and CCCH-type domains 1–aldo-keto reductase family 1 member
B10 (RC3H1-AKR1B10) is a differentially expressed transcript fusion in
esophageal cells. (A) Schematic of RC3H1-AKR1B10 chimeric RNA in
esophageal cells. Fusion transcripts are predicted to retain intact
functional regions from both parental genes. Zn: zinc finger; ROQ:
Roquin domain; HN and HC: N- and C-terminal nucleotide-binding sites of
Roquin domain; Aldo_ket_red: aldo/kept reductase domain. (B)
Representative RT-PCR reactions demonstrating the differentially
expressed fusion in 5 esophageal cell lines.
Discussion
Esophageal cancer is a common and highly heterogeneous malignancy.
Thus, uncovering the transcriptome-wide complexity and heterogeneity
will provide clues for target therapies. The median length of human
gene transcripts is ∼2,500 bp; however, RNA-seq reads are only about
100–200 bp long. This indicates that novel isoforms or genes cannot be
reliably inferred from short reads directly. Long reads sequencing can
directly capture thousands of base pairs from single molecules, thus
greatly expanding the sequencing capability. However, it is usually
with lower accuracy comparing with short reads sequencing. In this
study, we adopted a hybrid strategy, which combines RNA-seq and SMRT
long reads sequencing, to investigate the transcriptional landscape in
well-characterized ESCC cells.
Compared with the high-quality Ensembl 38 annotation of the human
transcriptome, ∼15% of the spliced mappings in our results are from
known gene transcripts, ∼70% are from novel isoforms of known genes,
and the remaining ∼15% transcripts may represent novel isoforms of
novel genes. These results suggest that isoform and gene identification
are likely far from complete in esophageal squamous cells.
LincRNAs have been shown to play important roles in diverse cellular
processes such as regulating key cellular pathways and recruiting the
chromatin-modifying complex to specific genomic loci ([146]Rinn and
Chang, 2012). Previously, we and others have delineated several
lincRNAs which are critical determinants in ESCC tumorigenesis and
development ([147]Shen et al., 2016; [148]Lin et al., 2018; [149]Zhang
et al., 2018). Among the novel full-length transcripts, we have
identified a set of esophageal cancer specific lincRNAs and pre-miRNAs
potentially transcribed from them ([150]Tables S2 and [151]S3).
Furthermore, coexpressed coding genes are also found. KEGG enrichment
analysis indicated that these lincRNA-regulated mRNAs are involved in a
variety of cancer signaling pathways, which suggested that these cancer
cell specific lincRNAs, together with its interacting partners, may be
actively involved in cells transformation.
During the onset of carcinogenesis, shifted splicing of DNA repair
genes has previously been documented in several cancer studies, such as
BRCA1 and FANCM in breast cancer and ERCC1 in ovarian cancer ([152]Sun
et al., 2009; [153]Sevcik et al., 2013; [154]Peterlongo et al., 2015).
However, whether or not this mechanism is active in esophageal cancer
cells is largely unknown. Based on SMRT data, we found that isoform
usage of DNA damage response related genes is significantly different
between esophageal tumor cells and the normal-like cell ([155]Table
S4). Importantly, this feature can be recaptured from clinical
esophageal samples ([156]Table S5). In addition, we also provided
evidence that activators of GTPase activity are differentially spliced
between tumor and normal cells ([157]Figure 4B). Although the altered
splicing pattern of this group of genes is not clear in clinical
samples currently, several investigations have confirmed that a few
regulators of GTPase use different variants in cancers. For example,
Wang et al. found that the splicing pattern of fibroblast growth factor
receptor substrate 3, tuberous sclerosis 2, and RAS guanyl releasing
protein 2 has significant race-related differences among prostate
cancer patients ([158]Wang et al., 2017). Furthermore, a spliced
variant of ARF6 guanine nucleotide exchange factor was found to
regulate the cancer cell migration and invasion ([159]Ratcliffe et al.,
2018). Intriguingly, fibroblast growth factor receptor substrate 3,
tuberous sclerosis 2, and several ARF GEF and GTPase-activating protein
family members are all listed at the top of spliced genes in this study
([160]Table S5). Collectively, our results suggested that shift AS of
genes associated with DNA damage response and GTPase signaling may
contribute to ESCC pathogenesis and should be exploited for detailed
mechanism in the future.
Oncogenic fusions have been found in many cancers. Growing interests
have linked transcript fusion to diverse clinical applications ranging
from tumor subclassification, early diagnostics, to development of
effective treatment targeted this lesion ([161]Shaw et al., 2013;
[162]Mertens et al., 2015; [163]Yoshihara et al., 2015). However, until
now, little is known for their role in ESCCs. In this study, we
cataloged many novel transcript fusions and expanded the gene fusion
repository in esophageal cells. We detected 1,972 novel transcript
fusions from SMRT sequencing data, which are much more than those
archived in cancer fusion database or fusions predicted from RNA-seq
short reads. Based on RT-PCR and Sanger sequencing, we verified two
novel fusion transcripts. Interestingly, our analyses uncovered a new
fusion transcript between the genes RC3H1 and AKR1B10. Previous reports
indicate that RC3H1 activates the deadenylation and degradation in
constitutive decay elements containing mRNAs ([164]Leppek et al., 2013;
[165]Schuetz et al., 2014). AKR1B10 is nicotinamide adenine
dinucleotide phosphate-dependent aldo-keto reductase. It is highly
expressed in several human cancer types such as hepatocarcinoma,
nonsmall cell lung cancer and breast cancer and may play an important
role in carcinogenesis ([166]Fukumoto et al., 2005; [167]Ma et al.,
2012; [168]Cheng et al., 2018). We demonstrated that this novel fusion
is highly expressed in ESCC cells compared with the normal immortalized
esophageal squamous cell ([169]Figure 5B). However, they did not occur
in a few ESCC patients when searching these fusions from publicly
available tumor samples. One possibility is that the searched ESCC
patient datasets are limited, a total of 35 patients from 4
investigations. Thus, if the above two fusions are rare (i.e., <1%),
they are hardly to be detected in clinical cohorts with small size.
Another possibility is that the two fusions are cell specific; thus,
they cannot occur in clinical samples. The functional consequence and
clinical relevance need further investigation when larger ESCC cohorts
are publicly available.
Conclusions
Sequencing technology is currently rapidly evolving. Combining PacBio
SMRT platform with short reads sequencing, we have defined a large
number of full-length transcripts and significantly increased the gene
and isoform annotation for esophageal cells. Specifically, our
investigations into the AS diversity, cancer cell specific lincRNAs,
and detection of novel transcript fusions enlighten current
understanding of transcriptional heterogeneity and complexity during
oncogenic transformation in esophageal cells.
Data Availability Statement
The datasets supporting the conclusions of this article are available
at Gene Expression Omnibus (PRJNA515570) and in the Genome Sequence
Archive in the BIG Data Center of Beijing Institute of Genomics (BIG),
Chinese Academy of Sciences, under accession numbers (CRA001374).
Author Contributions
J-ZX, L-YX, and E-ML designed the study and the interpretation of the
findings. J-ZX, Y-MC, Q-QZ, and D-ZC performed the transcriptome and
other computational analyses. Y-WC, XZ, Y-RW, L-DL, YC, and QY carried
out library preparation and sequencing, transcript fusion confirmation,
and other experiments. J-ZX, L-YX, and E-ML oversaw the project. J-ZX
wrote the manuscript that was subsequently read and approved by all
coauthors.
Funding
This work was supported by grants from National Natural Science
Foundation of China (grant number nos. 81472613, 81673037, and
81772532).
Conflict of Interest
Author Y-MC was employed by Tianjin Novogene Bioinformatics Technology
Co. The remaining authors declare that the research was conducted in
the absence of any commercial or financial relationships that could be
construed as a potential conflict of interest.
Supplementary Material
The Supplementary Material for this article can be found online at:
[170]https://www.frontiersin.org/articles/10.3389/fgene.2019.00915/full
#supplementary-material
[171]Click here for additional data file.^ (780.5KB, xls)
[172]Click here for additional data file.^ (90.8KB, xlsx)
[173]Click here for additional data file.^ (11.5KB, xlsx)
[174]Click here for additional data file.^ (341.9KB, xlsx)
[175]Click here for additional data file.^ (53.5KB, xlsx)
[176]Click here for additional data file.^ (33KB, doc)
[177]Click here for additional data file.^ (1.3MB, tif)
[178]Click here for additional data file.^ (416.6KB, tif)
[179]Click here for additional data file.^ (756.9KB, tif)
[180]Click here for additional data file.^ (2.6MB, tif)
References