Abstract

Simple Summary

   In this study, we evaluated the differences in the alternative splicing
   (AS) profiles between normal liver tissue, HepG2 malignant cells, and
   Huh7 malignant cells using a description of AS profiles as arrays of
   genes characterized by the degree of AS (defined as the number of
   detected splice variants per gene). In brief, we demonstrated that this
   new metric can be employed to successfully identify biological pathways
   that are influenced by the alterations in AS, thereby utilizing a
   mathematical algorithm previously developed for gene enrichment
   analysis based on gene expression profiles. Furthermore, since
   long-read RNA sequencing allows one to also describe the AS profiles as
   arrays of quantified single transcript isoforms, we employed Yanai’s
   tissue specificity index (suggested for gene expression analysis) to
   select groups of genes expressing only one or two splice variants
   specifically in liver tissue, HepG2 malignant cells, and Huh7 malignant
   cells, thus providing additional information to that derived from the
   analysis of gene expression profiles alone. The most of these splice
   variants were translated into protein products that can contribute to
   phenotypes of normal and malignant human hepatocytes, thereby making
   them of interest for the further studying of the mechanisms underlying
   cell malignization.

Abstract

   The long-read RNA sequencing developed by Oxford Nanopore Technologies
   provides a direct quantification of transcript isoforms, thereby making
   it possible to present alternative splicing (AS) profiles as arrays of
   single splice variants with different abundances. Additionally, AS
   profiles can be presented as arrays of genes characterized by the
   degree of alternative splicing (the DAS—the number of detected splice
   variants per gene). Here, we successfully utilized the DAS to reveal
   biological pathways influenced by the alterations in AS in human liver
   tissue and the hepatocyte-derived malignant cell lines HepG2 and Huh7,
   thus employing the mathematical algorithm of gene set enrichment
   analysis. Furthermore, analysis of the AS profiles as abundances of
   single splice variants by using the graded tissue specificity index τ
   provided the selection of the groups of genes expressing particular
   splice variants specifically in liver tissue, HepG2 cells, and Huh7
   cells. The majority of these splice variants were translated into
   proteins products and appeal to be in focus regarding further insights
   into the mechanisms underlying cell malignization. The used metrics are
   intrinsically suitable for transcriptome-wide AS profiling using
   long-read sequencing.

   Keywords: transcriptome, long-read sequencing, alternative splicing,
   degree of alternative splicing, splice variants abundance, human liver
   tissue, HepG2 and Huh7 cells, biological pathways, tissue specificity
   index

1. Introduction

   Alternative splicing (AS) allows for a single gene to be transcribed
   into two or more mRNA transcripts (splice variants or isoforms), thus
   ultimately providing a remarkable increase in proteome diversity in
   higher eukaryotes. The switching via AS to different transcript
   isoforms is involved in cellular differentiation, the control of cell
   functions, and the cell response to environmental changes
   [[46]1,[47]2]. AS is highly regulated, and aberrant splicing
   contributes to various diseases, including cancer. In humans, over 90%
   of transcripts undergo alternative RNA processing, and about 15% of
   hereditary diseases and cancers are thought to be associated with a
   dysregulation of AS [[48]1,[49]2].

   The transcriptome-wide analysis of AS was greatly boosted by the
   advance of the next-generation sequencing and is mostly based on the
   high-throughput sequencing of short cDNA fragments (RNA-seq) [[50]3].
   Yet, when accurately quantifying gene expression, short-read sequencing
   in general fails to correctly identify the isoform from which the read
   originates, since the isoforms from the same gene are similar to a
   large extent [[51]4,[52]5]. To overcome this issue, two metrics have
   been suggested for measuring AS events at the transcriptome-wide
   level—‘exon usage’ [[53]6] and the PSI (percent spliced in) index
   [[54]7]. Both metrics indicate, in fact, how frequently a given exon is
   included into the transcript isoforms of a corresponding gene and can
   be calculated directly from the row read counts, hence avoiding
   uncertainties regarding the short-read assembly to reveal a splicing
   pattern. Despite the ongoing attempts to improve bioinformatics tools
   for RNA-seq-based assembly to quantify splice variants (e.g.,
   [[55]8,[56]9]), it still remains quantitatively challenging.

   The emergence of third-generation sequencers such as those of Oxford
   Nanopore Technologies (ONT) has allowed for sequencing RNA or cDNA as a
   single molecule, thus providing long reads, which can span multiple
   exons. Long-read sequencing significantly simplifies the detection of
   transcript isoforms, thus directly revealing splicing patterns [[57]2].
   This makes the normalized abundance of single (individual) transcript
   isoforms (rather than the gene expression measured as an integral
   normalized abundance of transcript isoforms ascribed to the gene) the
   more appropriate metric for the analysis of AS profiles in the case of
   long-read ONT sequencing than the ‘exon usage’ or PSI index. Indeed,
   though the ‘exon usage’ or PSI index continue to be used for AS
   profiling based on long-read sequencing data (e.g.,
   [[58]10,[59]11,[60]12]), the description of AS profiles in terms of the
   abundance of single isoforms has also been utilized in ONT-based
   transcriptome-wide studies (e.g., [[61]13,[62]14,[63]15]). On the other
   hand, as we recently suggested [[64]16], AS profiles can be described
   regardless of a particular expression of a given transcript isoform as
   arrays of genes, where each gene is characterized by the number of
   detected splice variants ascribed to that gene (here referred to as the
   ‘degree of alternative splicing’, or the DAS).

   The aim of this study was to further explore the utility of such
   metrics as the DAS or abundances (in transcripts per million, or TPM)
   of single transcript isoforms for revealing the differences in AS
   profiles between various cell/tissue types (which we further refer to
   as ‘phenotypes’ for convenience) using long-read sequencing datasets.
   We employed bioinformatics tools that were previously developed for
   gene expression analysis, such as GSEA (gene set enrichment analysis)
   [[65]17] and the graded tissue specificity index τ [[66]18]. These
   tools were commonly applied to identify the biological pathways that
   are influenced by differential gene expression (e.g., [[67]19,[68]20]
   and references therein) or to find tissue-specific signatures of gene