Abstract

   To date, most transcriptome studies of schizophrenia focus on the
   analysis of protein-coding genes. Long noncoding RNAs (lncRNAs) are
   emerging as key tissue-specific regulators of cellular and disease
   processes. The amygdala brain region has been implicated in the
   pathophysiology of schizophrenia. We performed unbiased whole
   transcriptome profiling of amygdala tissues from 22 schizophrenia
   patients and 24 non-psychiatric controls using RNA-seq. We
   reconstructed amygdala transcriptome and employed systems biology
   approaches to annotating the functional roles of lncRNAs. As a result,
   we identified 839 novel lncRNAs in amygdala. We found in amygdala
   lncRNAs are more subtype-specific than protein-coding genes. We
   identified functional modules associated with “synaptic transmission”,
   “ribosome”, and “immune responses” which were related to schizophrenia
   pathophysiology that involved lncRNAs. Integrative functional analyses
   associating individual lncRNAs with specific pathways and functions
   further show that amygdala lncRNAs are connected with all of these
   pathways. Our study presents the first systematic landscape of lncRNAs
   in amygdala tissue from schizophrenia cases.

   Abbreviation: mRNA, Messenger RNA; lncRNA, Long noncoding RNA; SCZ,
   Schizophrenia

   Keywords: Schizophrenia, Next-generation sequencing, Transcriptome,
   Systems biology
     __________________________________________________________________

Research in Context.

Evidence Before this Study

   The etiology of schizophrenia remains unknown. Most previous studies
   focus on the roles of protein-coding genes (mRNAs) in the disease. In
   the recent advance, long non-coding RNAs (lncRNAs), a kind of
   transcripts larger than 200 nt but without protein-coding potential,
   have been illustrated to play important biological roles, especially in
   the neural system. Some lncRNAs such as Gomafu are found to contribute
   to schizophrenia. The amygdala is the region of the brain that has been
   considered to play a primary role in the processing of emotional
   response, and pathophysiology of schizophrenia.

Added Value of this Study

   To identify amygdala specific lncRNAs relevant to schizophrenia, we
   conducted a systematic analysis of lncRNAs in amygdala by using RNA-seq
   data with high sequencing depth and characterized their functions and
   roles in schizophrenia. We constructed novel amygdala lncRNAs, and
   amygdala lncRNAs are found to be more subtype-specific than
   protein-coding genes. We annotated functional roles of lncRNAs in
   amygdala using a systems biology approach. We demonstrated that lncRNA
   are involved in dysregulated gene pathways “synaptic transmission”,
   “ribosome”, and “immune responses” in schizophrenia.

Implications of All the Available Evidence

   Our study presented the first systematic landscape of lncRNAs in
   amygdala tissue from schizophrenia cases, and proposed biomarkers of
   lncRNAs in schizophrenia that are worthy experimental validation in the
   future.

   Alt-text: Unlabelled Box

1. Introduction

   Schizophrenia (SCZ) remains one of the most mysterious and costliest
   mental disorders in terms of human suffering and societal expenditure
   [[35]1]. Thus, it's important to improve our understanding of its
   molecular mechanisms of schizophrenia for developing effective
   therapies. RNA-Seq has been widely used to profile schizophrenia
   transcriptome in recent years for this purpose. Most of these studies
   have focused on differential expression analysis of protein-coding
   genes [[36]2, [37]3] and some have analyzed allele-specific expression
   in schizophrenia [[38]4].

   Long non-coding RNAs (lncRNAs), the transcripts of larger than 200 nt
   but without protein-coding potential [[39]5, [40]6], have received much
   attention in recent years. Though not fully understood, lncRNAs are
   conceived to play important roles in a variety of biological functions
   through molecule functions such as signals, decoys, guides and
   scaffolds [[41]7]. LncRNAs are generally expressed at low level, and
   many of them are unknown. This makes it difficult to detect and
   quantify these molecules. The emergence of RNA-Seq provides the exact
   technology as needed for studying lncRNAs. RNA-Seq has been used to
   investigate lncRNAs in neurons [[42]8]. Previous studies have
   demonstrated that lncRNAs can have a profound impact on gene
   regulations, especially in neurons [[43][9], [44][10], [45][11],
   [46][12], [47][13], [48][14]]. However, limited information exists on
   the roles of lncRNAs in schizophrenia. In addition, many lncRNAs,
   especially tissue-specific lncRNAs, remain to be discovered or are
   involved with biological functions that are complex or poorly defined
   and remain to be annotated.

   The main obstacle to characterize lncRNAs is their low expression
   levels. It requires very high sequencing depth for RNA-Seq, which is
   generally lacking in many previous transcriptome studies. Consequently,
   although many transcriptome studies of schizophrenia have been
   conducted, most of them don't investigate lncRNAs, partially due to the
   high sequencing depth requirement and the poor annotation of lncRNAs
   currently available [[49]2, [50]3, [51][15], [52][16], [53][17],
   [54][18], [55][19], [56][20], [57][21], [58][22]]. Liu et al.
   demonstrated dysregulation of non-coding RNA in the amygdala region of
   schizophrenia patients [[59]23], but a systematical demonstration of
   functional roles of lncRNAs in schizophrenia is lacking. Through its
   important role in the processing of emotions, amygdala region has been
   involved in the pathophysiology of schizophrenia. Currently, emotional
   perturbations are common symptoms of schizophrenia, suggesting that
   dysfunction in neurons within the amygdala region may contribute to the
   dysregulation and altered behavior of schizophrenia patients. To
   identify amygdala specific lncRNAs relevant to schizophrenia, we
   performed unbiased whole transcriptome profiling of human amygdala
   region using RNA-seq with high sequencing depth. We report the
   systematic identification of lncRNAs, and characterize their functions
   and roles in schizophrenia via systems biology approaches.

   According to DSM-IV, schizophrenia can be divided into five subtypes:
   paranoid, disorganized, catatonic, undifferentiated and residual
   schizophrenia [[60]24]. The clinical characteristics of each subtype
   can vary. It is interesting to investigate gene dysregulations in
   different subtypes that account for their distinct diversities.
   Unbiased whole transcriptome profiling would help to identify these
   dysregulations and interplay of these abnormally expressed genes
   systematically. In this study, we used amygdala tissue from 22
   schizophrenia patients including three subtypes: paranoid,
   undifferentiated and disorganized, and 24 non-psychiatric controls. We
   profiled their amygdala whole transcriptomes with ~100 millions of 100
   nt short reads per sample on average. To our knowledge, this is the
   first comprehensive characterization of lncRNAs in amygdala tissue from
   schizophrenia patients. The lncRNA landscape characterized here
   provides novel insights into the transcriptomic variations seen in the
   amygdala across subtypes of schizophrenia patients.

2. Materials and Methods

2.1. Experimental Design

   We obtained 45 amygdala samples from postmortem brain from the Lieber
   Brain bank ([61]http://www.libd.org), including 9 undifferentiated, 7
   disorganized, 5 paranoid, and 24 controls without psychiatric
   diagnoses. The PMI, age and gender of cases and controls were well
   matched ([62]Supplementary Table S1), as indicated by their no
   association with the phenotype (non-parametric Kruskal-Wallis rank sum
   test p-value of 0.2, 0.5 for PMI and age respectively; Chi-square test
   p-value of 0.67 for gender). The sample description and RNA-Seq data
   are available at [63]https://www.ncbi.nlm.nih.gov/bioproject/379666.

2.2. RNA-Seq of Amygdala Tissue

   RNA-Seq libraries were constructed using Illumina (RS-122-2001) TruSeq
   RNA sample Prep Kit following the manufacture instruction. The poly-A
   containing mRNA molecules were purified from 300 to 500 ng DNAse
   treated total RNA using oligo (dT) beads. Following the purification,
   the mRNA was fragmented into small pieces using divalent cations under
   elevated temperature (94 degree) for 2 min. Under this condition, the
   range of the fragments length is from 130 to 290 bp with a median
   length of 185 bp. Reverse transcriptase and random primers were used to
   generate the first strand cDNA from the cleaved RNA fragments. The
   second strand DNA was synthesized using DNA Polymerase I and RNaseH.
   These cDNA fragments then went through an end repair process using T4
   DNA polymerase, T4 PNK and Klenow DNA polymerase, and the addition of a
   single ‘A’ base using Klenow exo (3′ to 5′ exo minus), then ligation of
   the illumine PE adapters using T4 DNA Ligase. An index was inserted
   into Illumina adapters so that multiple samples can be sequenced in a
   single lane. These products were then purified and enriched with PCR to
   create the final cDNA library for high throughput DNA sequencing using
   Highseq2000. The concentration of RNA-seq libraries was measured by
   Qubit (Invitrogen, CA) and quantified with qPCR. The quality of RNA-seq
   library was measured by LabChipGX (Caliper, MA) using HT DNA
   1K/12K/Hi-sensitivity LabChip. The libraries are multiplexed and loaded
   on a flowcell for cluster generation on cBot (Illumina). While on
   sequencing run, the Illumina Real Time Analysis (RTA) module was used
   to perform image analysis, base calling, and the BCL Converter (CASAVA
   v1.8.2) were followed to generate FASTQ files which contain the
   sequence reads. The current sequencing depth is over 80 million
   (2x100bp 40 million paired-end) mappable sequencing reads.

2.3. Transcriptome Assembly and Novel lncRNAs Discovery

   We followed the standard protocol of the transcript-level expression
   analysis of RNA-seq experiments [[64]25]. We obtained an average read
   depth of 116M reads per sample (Supplementary Table S8). The first step
   is quality and adapter trimming. Trimming was done by Trim Galore - a
   wrapper script to automate quality and adapter trimming as well as
   quality control
   ([65]https://www.bioinformatics.babraham.ac.uk/projects/trim_galore).
   Adapter sequences and fragments with less than quality score 20 in raw
   reads were removed. The processed reads were then aligned to human
   reference genome version hg38 (reference chromosomes only) by HISAT2
   [[66]26]. Common SNPs recorded in dbSNP144 and splice sites from
   Ensemble (download from HISAT2 website) were taken into account to
   improve alignment accuracy. Only concordant read pairs were set to be
   reported. The output SAM files were converted to BAM files, then sorted
   and indexed by SAMtools [[67]27]. Following alignment and filtering an
   average of 108 million reads were obtained per sample with an average
   alignment rate of 95% (Supplementary Table S8). After reads alignment,
   we applied the median transcript integrity number (medTIN) [[68]28]
   score to assess the RNA quality of each sample. As a result, we
   obtained the medTINs in the range of 47.4 to 78.3, with a median of
   70.8, which indicated a high level of RNA integrity (Supplementary
   Table S8). Obtained BAM files were used as input by assembler StringTie
   [[69]29] (reference annotation was GENCODE v24 [[70]30] + stringent
   Human Body Map lncRNAs set from Broad Institute [[71]31]), default
   parameters were used, that means genomic regions with at least reads
   coverage 2.5 will be considered as transcripts. The resulting GTF files
   were combined by StringTie with –merge mode with GENCODE
   v24 + stringent Human Body Map lncRNAs as reference annotation. The
   merging took a consensus of individual transcriptomes, that could
   improve the reliability of novel loci and isoforms. After obtaining
   merged GTF file, we compared it with annotation of GENCODE
   v24 + stringent Human Body Map lncRNAs, by using the gffcompare tool
   from GFF utilities
   ([72]https://ccb.jhu.edu/software/stringtie/gff.shtml), to identify
   novel transcripts (whose class code were labeled as “i”, “u” and “x” by
   gffcompare). As a result, we obtained 5924 novel transcripts. Then
   these novel transcripts were used to discover novel lncRNAs as the
   input of slncky software [[73]32] to filter out potential unknown
   protein-coding genes or gene duplication events. We used slncky's novel
   lncRNAs report as our novel lncRNAs set. Then we added these lncRNAs
   annotation to the GENCODE v24 + stringent Human Body Map lncRNAs to do
   the following analysis.

2.4. Differential Expression and Pathway Enrichment Analysis

   Differential expression analysis and summarization of expression levels
   were done based on read counts. All BAM files were first processed by
   the “prepDE.py”
   ([74]https://ccb.jhu.edu/software/stringtie/dl/prepDE.py) script from
   stringTie to extract read count information in each RNA-Seq sample. The
   reference annotation GTF file we used was GENCODE v24 + stringent Human
   Body Map lncRNAs + our novel identified lncRNAs. After obtaining the
   read counts table, we conducted differential expression analysis and
   produced normalized counts using DESeq2 [[75]33]. Reads Per Kilobase
   Million (RPKMs) were obtained by the normalized counts divided by gene
   lengths. Only expressed genes (i.e. the genes' 50th-percentile RPKM
   value is larger than 0, and the genes' 95th-percentile RPKM value is
   larger than 0.05) were selected to conduct differential expression
   analysis. P-values of DESeq2 were corrected using the
   Benjamini-Hochberg procedure [[76]34] for multiple testing adjustment.
   To do pathway enrichment analysis, we calculated z-scores by using
   p-values and signs of log fold change from the output of DESeq2. These
   z-scores were used to rank genes, followed by pre-ranked GSEA [[77]35].
   Significantly enriched pathways were selected based on GSEA FDR
   q-values.

2.5. Expression Profile in Subtypes

   We used an extended method described by Cabili et al. [[78]14, [79]31]
   to quantify the SCZ subtype specificity of each lncRNA, mRNA, and
   pseudogene. For each gene, let v[ij] be the RPKM value of the gene in
   the ith sample of the jth subtype, where i ranges from 1 to n[ij], the
   number of samples in subtype j, and j ranges from 1 to J (J = 4 in this
   study) in this study (Here “subtype” also contains controls). We first
   transform expression level by log2 transformation:
   x[ij] = log[2](v[ij] + 1). Then we took the average of gene's
   expression levels across subtype:
   [MATH: <msub><mover accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mi>j</mi></msub><mo>=</mo><mfenced
   open="("
   close=")"><mrow><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn>
   </mrow><msub><mi>n</mi><mi
   mathvariant="italic">ij</mi></msub></msubsup><msub><mi>x</mi><mi
   mathvariant="italic">ij</mi></msub></mrow></mfenced><mo>/</mo><msub><mi
   >n</mi><mi mathvariant="italic">ij</mi></msub> :MATH]
   . Ideally, if one gene was perfectly subtype-specific, then
   [MATH: <msub><mover accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mi>j</mi></msub> :MATH]
   should be equal to zero in all subtypes except one. Next, we calculated
   an expression profile that normalizes the gene expression in various
   subtypes. Formally:
   [MATH: <mi>E</mi><mo>=</mo><mfenced open="(" close=")"
   separators=",,,"><msub><mi>e</mi><mn>1</mn></msub><msub><mi>e</mi><mn>2
   </mn></msub><mo>…</mo><msub><mi>e</mi><mi>J</mi></msub></mfenced><mo>=<
   /mo><mfenced open="(" close=")" separators=",,"><mfrac><msub><mover
   accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mn>1</mn></msub><mrow><mo>∑</mo><msub><m
   over accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mi>j</mi></msub></mrow></mfrac><mo>…</mo
   ><mfrac><msub><mover accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mi>J</mi></msub><mrow><mo>∑</mo><msub><m
   over accent="true"><mi>x</mi><mo
   stretchy="true">¯</mo></mover><mi>j</mi></msub></mrow></mfrac></mfenced
   ><mo>.</mo> :MATH]

   Each e[j] is the proportion of the total expression that in the jth
   subtype. Third, we defined the idea expression profile in each subtype:
   [MATH: <msup><mi>E</mi><mn>1</mn></msup><mo>=</mo><mfenced open="("
   close=")"
   separators=",,,"><mn>1</mn><mn>0</mn><mo>…</mo><mn>0</mn></mfenced><mo>
   ,</mo><msup><mi>E</mi><mn>2</mn></msup><mo>=</mo><mfenced open="("
   close=")"
   separators=",,,"><mn>0</mn><mn>1</mn><mo>…</mo><mn>0</mn></mfenced><mo>
   ,</mo><mo>…</mo><mo>,</mo><msup><mi>E</mi><mi>J</mi></msup><mo>=</mo><m
   fenced open="(" close=")"
   separators=",,,"><mn>0</mn><mn>0</mn><mo>…</mo><mn>1</mn></mfenced>
   :MATH]

   That is to say, E^j represents that a gene only expressed in the jth
   subtype. Finally, we calculated the distance between observed gene
   expression profile with the idea expression profile E^j by the
   Jensen-Shannon divergence. The Jensen-Shannon divergence is defined as:
   [MATH: <mi>J</mi><mi>S</mi><mfenced open="("
   close=")"><mi>E</mi><msup><mi>E</mi><mi>j</mi></msup></mfenced><mo>=</m
   o><mi>H</mi><mfenced open="("
   close=")"><mfrac><mrow><mi>E</mi><mo>+</mo><msup><mi>E</mi><mi>j</mi></
   msup></mrow><mn>2</mn></mfrac></mfenced><mo>−</mo><mfrac><mrow><mi>H</m
   i><mfenced open="("
   close=")"><mi>E</mi></mfenced><mo>+</mo><mi>H</mi><mfenced open="("
   close=")"><msup><mi>E</mi><mi>j</mi></msup></mfenced></mrow><mn>2</mn><
   /mfrac><mo>,</mo><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>
   ,</mo><mn>3</mn><mo>,</mo><mn>4</mn><mo>,</mo> :MATH]

   where H is the Shannon entropy. Then we defined our subtype-specific
   score as:
   [MATH: <mi>S</mi><mfenced open="("
   close=")"><mi>E</mi></mfenced><mo>=</mo><mn>1</mn><mo>−</mo><munder><mo
   >min</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,<
   /mo><mi>J</mi></mrow></munder><msqrt><mi
   mathvariant="italic">JS</mi><mfenced open="("
   close=")"><mi>E</mi><msup><mi>E</mi><mi>j</mi></msup></mfenced></msqrt>
   <mo>.</mo> :MATH]

   The larger score means more subtype-specificity observed, and a score 1
   means the gene expressed in only one subtype.

2.6. Quantification of lncRNAs and lncRNA-Pathway Association

   First, RPKMs were obtained by DESeq2. RPKMs were calculated by the
   normalized counts divided by gene length. In order to make the result
   more reliable, we only selected highly expressed genes, namely, the
   genes with 50th-percentile RPKM values larger than 0 and
   95th-percentile RPKM values >0.25. Then RPKM value of each lncRNA locus
   was correlated with all highly expressed mRNAs loci (log2 transformed:
   log[2] (RPKM+1)). All 45 samples were used to calculate correlations.
   Then for each lncRNA, a list of correlation-based (Pearson correlation)
   ranked mRNAs was constructed and subject to GSEA to do pre-ranked
   analysis [[80]35]. An association matrix of lncRNAs and GO terms was
   made: if a GO term was enriched in one lncRNA with FDR threshold 0.01,
   then we would assign 1 or −1 based on positive or negative enriched,
   otherwise 0 will be assigned.

   Heatmap was generated based on the association of lncRNAs with GO
   terms, and hierarchical clustering was applied to both rows and
   columns. We cut lncRNAs into 10 clusters based on hierarchical
   clustering. Then we used Fisher exact test to rank GO terms in each
   cluster to determine the enrichment level of associated GO terms of
   each cluster with respect to other clusters.

2.7. Gene Co-Expression Network Analysis

   Co-expression networks of highly expressed genes were identified by
   WGCNA according to the methods described previously [[81]36, [82]37].
   All 45 samples were used to construct WGCNA network. First, normalized
   RPKM values of highly expressed genes were quantified by DESeq2. Then a
   matrix of correlation (Pearson correlation) between all pairs of highly
   expressed genes was generated and further converted to an adjacency
   matrix with a power value β. The value of power value β was 20, as
   determined by the function pickSoftThreshold provided by WGCNA that met
   the scale-free fitting threshold of 0.9. Co-expression modules were
   found by the dynamic hybrid tree cut algorithm, with the parameter
   settings: height cutoff = 0.95, deepSplit = 4 and minimum cluster
   size = 20. If the correlation between the eigenvalues of two modules is
   >0.85, the two modules were considered as so similar that they were
   merged. The module-trait correlation was defined as the correlation
   between the module eigenvalue and the trait.

   After identifying co-expression networks modules, we exported the
   modules, which were used as input to Cytoscape [[83]38] for
   visualization. ARACNE (Algorithm for the Reconstruction of Accurate
   Cellular Networks) [[84]39] was applied to identify significant
   interactions between genes in each module based on their mutual
   information and remove indirect interactions through data processing
   inequality (DPI). The arguments for ARACNE were set to be DPI tolerance
   of 0 and mutual information threshold of 0.5.

3. Results

3.1. Identification of a Stringent Set of Novel Amygdala lncRNAs

   To discover novel lncRNAs, we first assembled transcriptome from
   RNA-Seq data of all 45 samples (clinical information of samples was
   summarized in [85]Supplementary Table S1), followed by a stringent
   pipeline to remove transcripts with protein-coding potential or
   low-quality data (Method). We identified a set of 839 novel lncRNA
   genes with high quality ([86]Supplementary Table S2, the names of novel
   lncRNAs begin with “MSTRG”). We found that a dominant majority of these
   novel lncRNAs (>80%) were intergenic, falling entirely in intergenic
   regions. About 10% were divergent, being transcribed in the opposite
   orientation of a coding gene with which they share a promoter. We found
   very few lncRNAs that were miRNA host genes or snoRNA host genes. To
   further evaluate the reliability of these novel lncRNAs, we applied the
   same protocol to assemble transcriptome from the RNA-Seq dataset of 79
   GTEx normal amygdala tissues [[87]40]. As a result, we recovered 215 of
   the novel lncRNAs completely, namely, with all exons found in the GTEx
   transcriptomes, and 106 of the novel lncRNAs recovered partially in
   GTEx dataset ([88]Supplementary Table S2). We applied the RPKM
   saturation module from RSeQC [[89]41] to evaluate that the current
   sequencing depth was saturated for these 839 novel lncRNAs (Fig. S1).

3.2. Amygdala lncRNAs are Shorter, Less Complex, and Expressed at Low Levels

   We first compared the structure, expression level and coding potentials
   of our assembled transcriptome ([90]Fig. 1). All comparisons were done
   using one-sided Wilcoxon test. We found that novel lncRNAs, compared
   with protein-coding genes (we use the term mRNA to indicate
   protein-coding gene in this paper), have significantly shorter
   transcript length ([91]Fig. 1a, 1477 bp vs. 1727 bp,
   p-value = 7.566e-06), shorter ORFs ([92]Fig. 1b, 192 bp vs. 390 bp,
   p-value < 2.2e-16), and longer exons ([93]Fig. 1c, 766 bp vs 238 bp,
   p-value < 2.2e-16). These properties were consistent with the lower
   estimated number of exons for lncRNAs compared with mRNAs ([94]Fig. 1d,
   [95]2 vs. 5, p-value < 2.2e-16). Taken together, we conclude that
   amygdala lncRNAs are shorter and less complex than mRNAs.

Fig. 1.

   [96]Fig. 1
   [97]Open in a new tab

   Genomic features of novel lncRNAs. (a) Transcript sizes of lncRNAs and
   mRNAs. (b) Open reading frame (ORF) sizes of lncRNA and mRNAs. (c) Exon
   sizes of lncRNAs and mRNAs. (d) Numbers of exons per lncRNA and mRNAs.
   (e) Coding potential (CPAT scores) of known lncRNAs, novel lncRNAs, and
   mRNAs. (f) Expression levels (RPKM values) of known lncRNAs, novel
   lncRNAs, and mRNAs. (a), (b), (c), (e), (f) are standard boxplots,
   which display the distribution of data by presenting the inner fence
   (the whisker, taken to 1.5× the Inter Quartile range, or IQR, from the
   quartile), first quartile, median, third quartile and outliers. The
   means are marked as tan diamonds.

Fig. 2.

   [98]Fig. 2
   [99]Open in a new tab

   Dysregulation of expression of mRNAs, lncRNAs and gene pathways. (a)
   Venn diagrams of expressed protein-coding genes (left) and lncRNAs
   (right) in three subtypes. (b) Venn diagrams of
   upregulated/downregulated mRNAs and lncRNAs in three subtypes. (c)
   Percentages of dysregulated mRNAs and lncRNAs. (d) Enrichment map for
   significant GO terms in schizophrenia (all SCZs vs normal), nominal
   p < 0.005 and FDR q value <0.1. Nodes represent gene sets that are
   significantly up or down regulated, as determined by GSEA, where the
   node size corresponds to the size of the GO term. Edges indicate
   overlap between gene sets, where the thickness indicates the overlap
   coefficient. Red nodes indicate up-regulation and blue nodes indicate
   down-regulation.

   We then measured protein-coding potential of lncRNAs and mRNAs using
   the CPAT [[100]42], which was a logistic regression model built with
   four sequence features: open reading frame size, open reading frame
   coverage, Fickett TESTCODE statistic and hexamer usage bias. We found
   that both known lncRNAs and novel lncRNAs had comparable low
   protein-coding potential, which was significantly lower than that of
   mRNAs ([101]Fig. 1e, median of 0.023, 0.023, 0.75 for known lncRNAs,
   novel lncRNAs and mRNAs, p-value < 2.2e-16 for both known lncRNA vs
   mRNAs and novel lncRNAs vs mRNAs). This suggests that our algorithm to
   identify novel lncRNAs is reliable. Notably, SCZ lncRNAs were expressed
   on average at about 10-fold lower levels than mRNAs, with the median
   expression level of 0.018, 0.057 and 0.21 RPKM for known lncRNAs, novel
   lncRNAs, and mRNAs, respectively ([102]Fig. 1f, p-value < 2.2e-16 for
   both known lncRNAs vs mRNAs and novel lncRNAs vs mRNAs). Similar
   findings were observed in other human tissues [[103]31, [104]43]. Novel
   amygdala lncRNAs exhibited similar characteristics as known lncRNAs.
   Therefore, in the following sections, we combined the novel and known
   lncRNAs for further functional analysis.

3.3. Identification of Differentially Expressed Genes and Pathways

   After obtaining the assembled novel lncRNAs, we appended these lncRNAs
   definitions together with the Broad Institute's lncRNAs set to the
   GENCODE v24 annotation. As a result, we have 19,815 mRNAs and 18,805
   lncRNAs. To rule out the potential unreliability of lowly expressed
   genes, we define a gene as expressed if, [[105]1] the genes'
   50th-percentile RPKM value is larger than 0, and [[106]2] the genes'
   95th-percentile RPKM value is larger than 0.05. Among all SCZ and
   control samples, there were 14,570 expressed mRNAs, and 4596 expressed
   lncRNAs (of which 697 were novel). For the three subtypes of SCZ
   samples, 14,331 mRNAs and 4336 lncRNAs on average were expressed. Of
   those, 13,737 mRNAs and 3407 lncRNAs were detected in all three
   subtypes of SCZs, respectively, suggesting a high level of diversity of
   lncRNA expression across the three subtypes ([107]Fig. 2a).

   To characterize schizophrenia-associated dysregulation of gene
   expression, we performed four differential expression analyses of the
   schizophrenia subtypes in comparison with normal control samples: (1)
   all SCZs vs control, (2) disorganized SCZs vs controls, (3) paranoid
   SCZs vs controls, and (4) undifferentiated SCZs vs controls, in search
   for both common SCZ related genes as well as SCZ subtypes specific
   genes (in the set of expressed genes). Under the cutoff of adjusted
   p-value < 0.05, there are 345, 182, 2 and 541 differentially expressed
   genes for these four comparisons, respectively ([108]Supplementary
   Table S3). The comparison of differentially expressed mRNAs across 3
   subtypes was summarized ([109]Fig. 2b, c). Among them, we uncovered
   several protein-coding genes significantly differentially expressed in
   all SCZ samples that were reported in previous studies [[110]18,
   [111]19, [112]21, [113]22]. These include HBA1, HBA2, HBB, IFITM1,
   GBP1, IFITM2, SERPINA3. These results were identical to the previous
   study focusing on protein-coding genes in the amygdala of schizophrenia
   patients [[114]20].

   To determine which functional pathways are most robustly involved in
   schizophrenia pathogenesis, we performed gene set enrichment analysis
   (GSEA) for each subtype and all case samples combined. At FDR level of
   0.1 (nominal p-value < 0.005), there were 178, 40, 355 and 123 GO terms
   positively enriched genes, and 94, 147, 4 and 107 GO terms negatively
   enriched genes in all SCZ, disorganized SCZ, paranoid SCZ, and
   undifferentiated SCZ, respectively. The three subtypes demonstrated
   enrichment in many unique gene sets, while some common ones were
   identified across all three subtypes ([115]Supplementary Table S4).
   When considering all SCZ vs. normal, we found the dysregulated pathways
   came from four main categories: “immunity”, “blood vessel development”
   and “ribosome and protein synthesis” were up-regulated; and “neuron and
   synapse” were down-regulated ([116]Fig. 2d). We also found the gene
   pathway from “immunity”, “ribosome”, and “neuron and synapse”
   categories in all three subtypes of SCZ ([117]Supplementary Table S4).
   These results agreed with previous studies on dysregulation of
   functional pathways in SCZ especially in amygdala tissue [[118]20]. We
   concluded that in the amygdala tissues of schizophrenia patients, gene
   pathways related to immune response, blood vessel development and
   ribosome were upregulated in expression, gene pathways related to
   synaptic transmission and behavior were suppressed. We want to further
   identify the potential roles of lncRNAs relating to these functional
   pathways in the following sections.

   At FDR level of 0.05, we detected 110, 45, 0, 171 differentially
   expressed lncRNAs in all SCZ samples, disorganized SCZ samples,
   paranoid SCZ samples and undifferentiated SCZ samples, respectively
   ([119]Supplementary Table S3). Venn diagrams and percentages of
   differentially expressed lncRNAs in the three subtypes were presented
   ([120]Fig. 2b, c). Previous studies have reported lncRNAs association
   with schizophrenia, including MIAT, DLX6-AS1, and BDNF-AS [[121]12,
   [122]44]. We didn't find these lncRNAs significantly differentially
   expressed in our dataset.

3.4. Expression Profiles of lncRNAs in Schizophrenia Subtypes

   To determine whether amygdala lncRNAs are subtype-specific in
   schizophrenia, we characterized the expression profiles of lncRNAs and
   mRNAs across the three schizophrenia subtypes in comparison with
   controls. A larger number of lncRNAs exhibited subtype-specific
   expression patterns than mRNAs based on the unsupervised clustering of
   expression profiles (only expressed genes were analyzed, [123]Fig. 3a).
   Furthermore, we calculated subtype specificity score for each gene
   using an entropy-based metric that relies on Jensen-Shannon (JS)
   divergence [[124]31] (Materials and Methods). The expression of lncRNAs
   was found to be more subtype-specific than mRNAs significantly, with
   median specificity score of 0.302 and 0.282 for lncRNAs and mRNAs,
   respectively ([125]Fig. 3b). To rule out the possibility that this
   effect was caused by increased noise from low expressed genes, we also
   calculated the specificity scores of only highly expressed genes (i.e.,
   50th-percentile RPKM values >0 and 95th-percentile RPKM values >0.25).
   There were 7382 and 530 highly expressed mRNAs and lncRNAs,
   respectively. Again, we observed these highly expressed lncRNAs showed
   a higher subtype specificity than mRNAs ([126]Fig. 3b, with median
   specificity score of 0.291 and 0.279 for lncRNAs and mRNAs,
   respectively).

Fig. 3.

   [127]Fig. 3
   [128]Open in a new tab

   Subtype specificity of lncRNAs and mRNAs. (a) Heatmaps of normalized
   expression levels of 14,570 mRNAs (left) and 4596 lncRNAs (right). (b)
   Distributions of maximal subtype specificity scores for all expressed
   lncRNAs and protein-coding genes (left) and highly expressed ones
   (right).

3.5. Functional Annotation of Amygdala lncRNAs Through Expression Correlation

   The lack of annotated features makes it a challenging task to assign
   functions to lncRNAs. However, “guilt-by-association” analysis helps to
   predict roles of mammalian lncRNAs [[129]31, [130]43, [131]45,
   [132]46]. Because genes with similar co-expression patterns tend to
   have similar functional coherency [[133]47], we conducted gene set
   enrichment analysis (GSEA) to associate GO terms and lncRNAs by
   analyzing the correlation between the expression dynamics of each
   lncRNA with the expression dynamics of each mRNA across all 45 SCZ and
   normal samples using the highly expressed 530 lncRNAs and 7382 mRNAs
   (Materials and Methods). As a result, 529 lncRNAs were identified to
   have significantly associated GO terms ([134]Fig. 4a). We grouped
   lncRNAs into 10 clusters by hierarchical clustering ([135]Supplementary
   Table S5). We found that several clusters were associated with
   protein-coding gene sets of distinct functional categories, such as
   ribosome and protein synthesis (cluster A), blood vessel development
   (cluster C), nervous system and synaptic transmission (cluster H) and
   immune system (cluster I) ([136]Fig. 4b). These pathways were found
   dysregulated in schizophrenia.

Fig. 4.

   [137]Fig. 4
   [138]Open in a new tab

   Prediction of lncRNA functions. (a) Expression-based association matrix
   of 529 highly expressed lncRNAs and functional 1623 GO terms (left).
   Rows are lncRNAs, columns are GO terms. In heatmap, red, blue, and
   white dots represent positive, negative, and no correlation
   respectively. Top 5 most enriched GO terms in cluster A, C, H, and I
   are showed (right). (b) Association heatmap of dysregulated lncRNAs in
   all SCZs (all SCZs vs controls).

   Interestingly, by focusing on differential expressed lncRNAs (all SCZ
   vs. controls), we observed that these lncRNAs were associated with
   protein synthesis, blood vessel development, nervous system pathways
   and immune system pathways ([139]Fig. 4b), and these pathways were
   dysregulated in all of three subtypes of schizophrenia. These
   observations indicate that dysregulated amygdala lncRNAs are putative
   contributors to the pathogenesis of schizophrenia.

3.6. lncRNA-mRNAs Co-Expression Network

   We built co-expression networks over the highly expressed 7912 genes
   (including 530 lncRNAs, and 7382 mRNAs) using WGCNA [[140]37], followed
   by network modules testing (total 45 samples were used). We identified
   23 co-expressed modules ([141]Supplementary Table S6), 7 of which were
   highly correlated with the schizophrenia trait ([142]Fig. 5a):
   turquoise, green, pink, greenyellow, salmon, darkred and darkturquoise
   (correlation p-value < 0.05). To identify functions of each module, we
   performed a hypergeometric test on mRNAs to detect enriched GO terms in
   each module. We found that turquoise module was enriched with neuron
   related pathways, the green module enriched with ribosomal pathways,
   and the salmon module enriched with immune response related pathways
   (Fig. S2). Our previous GSEA analysis showed these pathways were
   dysregulated in schizophrenia. The numbers of genes (including both
   mRNAs and lncRNAs) of these modules varied. We then ran the Fisher
   exact test to determine the enrichment level of associated GO terms for
   lncRNAs in each module. The associated functions of lncRNAs (GO terms
   associated with lncRNA were determined in the Section “Functional
   annotation of amygdala lncRNAs through expression correlation”) are
   consistent with their coding counterparts in each module: lncRNAs in
   the turquoise module are primarily associated with neuronal and
   synaptic pathways, lncRNAs in the green module are primarily associated
   with ribosomal pathways, and lncRNAs in the salmon module are primarily
   associated with immunity pathways ([143]Fig. 5b).

Fig. 5.

   [144]Fig. 5
   [145]Open in a new tab

   Co-expression network of lncRNAs and mRNAs. (a) Module-trait
   relationships. Each row represents a module, the column is the trait
   (schizophrenia or normal). Each cell contains corresponding correlation
   coefficient and p-value in the bracket. Numbers of mRNAs and lncRNAs in
   each module are also shown. (b) Enriched lncRNA-correlated GO terms in
   turquoise, green and salmon module. (c, d) Graphic view of lncRNA-mRNAs
   co-expression network of (c) green and (d) salmon module. Node size
   represents corresponding intramodular connectivity degrees, blue and
   red nodes represent mRNAs and lncRNAs respectively, hub genes are
   displayed in triangles. Connection lines between nodes indicate direct
   interactions determined by ARACNE. Only hub genes and genes directly
   connected to hub showed.

   Following [[146]48], we defined 5% of nodes with highest intra-modular
   connectivity as hub genes ([147]Supplementary Table S6). Connectivity
   reflects how frequently a node interacts with other nodes in a
   co-expression module. Hub genes, which had the highest connectivity,
   were usually considered as key regulators in gene co-expression
   networks since they intended to interact with more genes than others
   [[148][49], [149][50], [150][51]]. The turquoise module was enriched by
   genes from neuron pathways and many of their members were observed as
   hub genes. We did find lncRNA “ENSG00000225465.8” (RFPL1S) as a hub
   gene in turquoise (Fig. S3), which suggested that this lncRNA might
   play as a key regulator in this functional module. Interestingly, the
   gene pathways positively correlated with “ENSG00000225465.8” included
   axon, synapse, dendrite as well as other similar pathways
   ([151]Supplementary Table S7). There were 20 lncRNAs directly connected
   to the hub genes in the turquoise module. We found most of them were
   positively associated with synapse pathways ([152]Supplementary Table
   S7). In contrast, in the green and salmon modules, no lncRNA was
   identified as hub genes, but there were several lncRNAs directly
   connected with the hub genes ([153]Fig. 5c, d). In the green module,
   there were 23 lncRNAs connected to the hub genes directly. GSEA showed
   that all of these lncRNAs were associated with ribosome pathways
   ([154]Supplementary Table S7). In the salmon module, we identified one
   lncRNA connected to the hub genes directly: “ENSG00000235501.5”
   (RP4-639F20.1), which was associated with immune pathways
   ([155]Supplementary Table S7) as indicated by GSEA. To take all these
   results together, by gene co-expression network, we identified seven
   co-expression modules correlated with SCZ trait. We found three of
   these seven modules were themed by neuron and synapse, ribosome and
   immunity respectively. In particular, we noticed that lncRNAs existed
   in all of the 3 modules, acting as the hub gene in the turquoise module
   while connecting to the hub genes directly in the other two modules.
   GSEA analysis of lncRNA-mRNA correlations showed that the lncRNAs were
   significantly associated with several pathways in each module
   ([156]Supplementary Table S7). All these results suggested the
   potential involvement of lncRNAs in schizophrenia-related functional
   pathways.

4. Discussion

   There is growing evidence that non-coding RNAs, particularly miRNAs and
   lncRNAs are involved in the pathogenesis of schizophrenia. For
   instance, the lncRNA MIAT (Gomafu), is acutely regulated in response to
   neuronal activation and involved in schizophrenia-associated
   alternative splicing [[157]12]. However, they found MIAT was
   down-regulated in schizophrenia samples, whereas we didn't find
   significance difference of this lncRNA in schizophrenia and normal
   tissues. The tissue used in their study was brain cortex of mice, but
   here we used human amygdala tissue. The difference of tissue may
   account for the difference of expression change.

   The GENCODE v24 annotation dataset contains 15,941 lncRNAs and we found
   about 24.5% of them were expressed in the amygdala under our study.
   Although GENCODE was presumably the most compressive gene annotation
   [[158]52], we still identified 5924 novel transcripts not previously
   annotated. Our library preparation was based on the polyA tailed
   method, so the transcripts we found were all polyA tailed. We would
   expect more transcripts which were not polyA tailed. Among them, we
   defined a stringent set of 839 novel lncRNAs, which include lincRNAs
   (long intergenic non-coding RNAs), intronic overlapping lncRNAs, and
   exonic antisense overlapping lncRNAs. Our novel lncRNAs document the
   first novel long noncoding transcripts in amygdala tissue. To further
   validate these novel lncRNAs, we investigated if they can be found in
   other amygdala RNA-Seq data sets such as GTEx [[159]40]. Due to the
   high heterogeneity of lncRNAs, we only recovered 321 novel lncRNAs
   completely or partially in GTEx amygdala data set. Nevertheless, read
   coverage was shown to be saturated for these 839 novel lncRNAs. With
   high numbers of reads support, these novel transcripts are expected to
   be bona fide. Based on this finding, we would expect that many more
   lncRNAs remain to be identified, and the next generation sequencing
   data from other tissue types will uncover these in the near future.

   Consistent with previous RNA-Seq studies, our results showed that
   lncRNAs expressed about 10-fold lower than coding genes in bulk
   tissues. The recent single-cell RNA-Seq (scRNASeq) technology makes it
   possible to profile gene expression in each individual cell. It has
   been revealed using scRNASeq that many lncRNAs are abundantly expressed
   in individual cells and are cell type-specific [[160]53]. We expect the
   better characterization of lncRNAs in the amygdala with the
   accumulation of single-cell transcriptomics data of this tissue.

   A well-known feature of lncRNAs is their tissue specificity. Here we
   also observed high specificity of lncRNAs among different subtypes of
   schizophrenia even in a single tissue – i.e., amygdala. The
   interpretation of this is that lncRNAs may vary much more across
   different patients when compared with mRNAs even in a single tissue. To
   our knowledge, this is the first study to show that lncRNAs demonstrate
   substantially more variation between disease subtypes in schizophrenia
   than do mRNAs.

   In an effort to understand potential roles of lncRNAs in schizophrenia
   pathophysiology, we first identified differentially expressed lncRNAs
   across SCZ patients and healthy controls. Many mRNAs are known to be
   dysregulated in schizophrenia, whereas dysregulation of lncRNAs hasn't
   been analyzed in any considerable depth before, especially in amygdala
   tissue specimens. By a rigid statistical significance of FDR < 0.05, we
   observed 45, 0 and 171 lncRNAs that were differentially expressed in
   disorganized SCZs, paranoid SCZs and undifferentiated SCZs,
   respectively. We also observed 110 differentially expressed lncRNAs
   when comparing all SCZs cases vs. controls. We generated a
   co-expression network using all highly expressed lncRNAs and
   protein-coding genes. Seven modules that were correlated with SCZ were
   identified, and three of them were themed by the gene pathways
   identified to be dysregulated in SCZ. We pinpointed one lncRNA in the
   turquoise module (themed by neuron and synapse) as a hub gene. While no
   lncRNAs in the green (themed by ribosome) and salmon modules (themed by
   immunity) were identified as hub genes, we observed that lncRNA
   connected directly with hub genes in all of the three modules, which
   implies that lncRNAs may involve as regulatory factors in these
   modules.

   We subsequently annotated all expressed lncRNAs through their
   correlation with protein-coding genes. In the context of co-expression
   of protein-coding genes and GSEA functional predictions, our data
   suggest that lncRNAs may play diverse roles in amygdala tissue. The
   main pathways dysregulated in schizophrenia involve immune response,
   ribosome, blood vessel development and synaptic transmission. Our
   results show that lncRNAs are involved in all of these pathways. This
   is an intriguing possibility and suggests that lncRNAs may be involved
   in the pathogenic dysregulation of schizophrenia. These potential
   functional roles of lncRNAs in schizophrenia etiology as determined by
   bioinformatics analysis are worthy further experimental validation in
   the future.

   Due to the difficulty to collect human post-mortem samples, we were not
   able to obtain a very large size of samples, which put a power limit to
   the statistical tests. We alleviated this small sample size issue by
   focusing on pathway-level analysis and illustrated potential roles of
   lncRNAs in the pathogenic pathways of schizophrenia. Pathway analysis
   was proved to be more robust than single-gene analysis, and could
   mitigate the small sample size issue [[161]35, [162]36]. Nevertheless,
   caution should be used that some subtle but important signals may still
   remain undetectable given this sample size. The current promising
   results call for recruiting a large cohort of samples for more
   comprehensive analyses in the future. In addition, it is noted that in
   the recent DSM-5 diagnostic criteria, DSM-IV subtypes have been
   omitted, and improved to a new scale, the Clinician-Rated Dimensions of
   Psychosis Symptom Severity (C-RDPSS), for characterizing patients
   [[163]54]. In this study, the subtype information was collected
   following previous clinical practice but not C-RDPSS. We expect better
   results be obtained when using the new severity-based characterization
   of patients in future studies with the implementation of DSM-5.

   Taken together, our study provides evidence that multiple novel lncRNAs
   reside within the amygdala region in schizophrenia patients whose
   function is differentially involved in the regulation of numerous gene
   expression networks in different subtypes of schizophrenia. These
   findings suggest that lncRNAs may have multiple roles in schizophrenia
   that is highly differential between different subtypes of
   schizophrenia, which is to be explored further in future genetic and
   mechanistic studies.

   The following are the supplementary data related to this article.
   Supplementary Table S1

   Clinical information of 45 Postmortem brain samples.
   [164]mmc1.xlsx^ (12.4KB, xlsx)
   Supplementary Table S2

   BED format of 839 novel lncRNAs. The matching type to GTEx
   transcriptome was also included.
   [165]mmc2.xlsx^ (72.9KB, xlsx)
   Supplementary Table S3

   Differentially expressed genes in each subtype of schizophrenia.
   [166]mmc3.xlsx^ (71.2KB, xlsx)
   Supplementary Table S4

   Dysregulated pathways in each subtype of schizophrenia.
   [167]mmc4.xlsx^ (458.6KB, xlsx)
   Supplementary Table S5

   Clustering of lncRNAs based on associated GO terms.
   [168]mmc5.xlsx^ (17.1KB, xlsx)
   Supplementary Table S6

   Summary of co-expression modules.
   [169]mmc6.xlsx^ (342KB, xlsx)
   Supplementary Table S7

   Pathway associated with the hub lncRNAs and the lncRNAs connected to
   hub genes in the turquoise, green and salmon modules.
   [170]mmc7.xlsx^ (103.9KB, xlsx)

   Supplementary material
   [171]mmc8.pdf^ (3.1MB, pdf)

Funding Sources

   This study is funded by The Children′s Hospital of Philadelphia Endowed
   Chair in Genomic Research to Dr. Hakonarson, an Institutional
   Development Award to the Center for Applied Genomics from The
   Children’s Hospital of Philadelphia; MH096891-03S1 (NIMH); RC2
   MH089924-02 (NIMH); R01 MH097284-03 (NIMH); and U01-HG008684 (NIH).

Declaration of Interests

   The authors declare that they have no competing interests.

Author Contributions

   Z.W., and H.H. conceived and supervised the project. T.T. designed and
   implemented the methods, and conducted the experiments. X.C., Y.L.,
   R.E.G., and P.M.S. contributed to data acquisition. T.T., Z.W., and
   H.H. wrote the manuscript. All authors approved the manuscript.

Acknowledgments