Abstract

   Formalin-fixed paraffin-embedded (FFPE) tissues are widely available
   specimens for clinical studies. However, RNA degradation in FFPE
   tissues often restricts their utility. In this study, we determined
   optimal FFPE preparation conditions, including tissue ischemia at 4°C
   (<48 h) or 25°C for a short time (0.5 h), 48-h fixation at 25°C and
   sampling from FFPE scrolls instead of sections. Notably, we observed an
   increase in intronic reads and a significant change in gene rank based
   on expression level in the FFPE as opposed to fresh-frozen (FF)
   samples. Additionally, we found that more reads were mapped to genes
   associated with chemical stimulus in FFPE samples. Furthermore, we
   demonstrated that more degraded genes in FFPE samples were enriched in
   genes with short transcripts and high free energy. Besides, we found 40
   housekeeping genes exhibited stable expression in FF and FFPE samples
   across various tissues. Moreover, our study showed that FFPE samples
   yielded comparable results to FF samples in dimensionality reduction
   and pathway analyses between case and control samples. Our study
   established the optimal conditions for FFPE preparation and identified
   gene attributes associated with degradation, which would provide useful
   clues for the utility of FFPE tissues in clinical practice and
   research.

Introduction

   Formalin-fixed paraffin-embedded (FFPE) and fresh-frozen (FF) tissues
   are the common human tissue specimens in clinical practice and medical
   research ([38]1,[39]2). FF tissues are often preferred, but their
   availability is often limited due to their laborious collection and
   expensive preservation. By contrast, as FFPE tissues can be
   economically stored for long periods and linked to patient clinical
   data, they are widely used for pathological analysis and molecular
   testing. Effectively using RNA-seq data from FFPE tissues, along with
   pertinent patient clinical information, enables the acquisition of
   large experimental and control cohorts. Additionally, it also
   facilitates the study of archival specimens ([40]3), making FFPE
   samples an invaluable resource in clinical research.

   In previous studies, DNA from archival FFPE tissue samples has been
   extensively utilized for next-generation sequencing (NGS) methods such
   as whole exome sequencing ([41]4,[42]5). Nevertheless, extracting
   high-quality RNA from FFPE tissue samples is challenging due to the
   extensive crosslinking and degradation caused by formalin fixation
   ([43]6). In recent years, advances in FFPE RNA extraction technology
   could efficiently reverse the crosslinking of FFPE RNA ([44]7). Due to
   rRNA degradation in FFPE tissues, the RNA integrity number (RIN) is
   deemed inappropriate for assessing FFPE RNA integrity. The DV200
   (percentage of RNA fragments >200 nucleotides in size) was devised to
   accurately assess the quality of RNA and incorporated into the Illumina
   protocol ([45]8). For degraded RNA, DV200 is a reliable predictor of
   the probability of successful library construction ([46]8,[47]9).
   Notably, RNA integrity in FFPE specimens is influenced by many
   preanalytical factors, including ischemia time, fixation time,
   temperature, storage conditions and sampling methods. These
   preanalytical factors commonly exhibit variations in clinical
   laboratories, but the analyses of these factors on RNA integrity are
   still incomplete ([48]10). The RNA quality of FFPE samples also affects
   the concordance of expression profiles between FFPE and FF specimens
   ([49]11). The Biospecimen Preanalytical Variables (BPV) program shows
   that cold ischemia time of up to 12 h has little impact on DV200, and
   prolonged fixation time (72 h) contributes to RNA fragmentation
   ([50]12).

   While the integrity of RNA in FFPE specimens declines with longer
   preservation time ([51]13), RNA-seq has been successfully performed on
   some FFPE specimens after long-term storage ([52]4,[53]13,[54]14).
   Moreover, many studies have explore the feasibility of utilizing FFPE
   tissues in RNA-seq analyses ([55]1,[56]13). When FFPE tissues were
   adequately prepared and preserved, their expression data can be
   strongly correlated with those from frozen tissues ([57]5,[58]15–19).
   The expression data of FFPE specimens have been applied in cancer
   research ([59]15). The differentially expressed genes (DEGs) between
   different cancers obtained using FFPE tissues were significantly
   overlapped with those obtained from FF tissues ([60]12). However, not
   all FFPE specimens yielded usable RNA-seq data ([61]18,[62]19). Recent
   studies have introduced methods to normalize FFPE RNA-seq expression
   data ([63]20,[64]21) and enhanced the availability of FFPE RNA-seq
   data. Although some studies have analyzed the DEGs between paired FFPE
   and FF tissues ([65]5,[66]18), the pattern of the changes in FFPE
   expression data remains unclear.

   In this study, we assessed the RNA integrity of FFPE tissues prepared
   under various conditions and identified the optimal preparation
   conditions and sampling method. To understand the pattern of changes in
   FFPE expression data, we explored differences in RNA-seq expression
   profiling and transcript attributes. We also conducted comparative
   analysis between tumor and peritumor using FFPE and FF samples,
   respectively. Our results showed that FFPE samples were available for
   dimensionality reduction and differential expression analysis in cancer
   research. Our findings may help to guide the clinical preparation of
   FFPE tissues, understand the pattern of expression changes in FFPE
   samples and better harness the power of FFPE tissue resources.

Materials and methods

Ethics statement

   The research was conducted on lung tissues collected under
   Institutional Review Board (IRB) approved protocols at Beijing Shijitan
   Hospital. Informed consents to participate in the study were also
   obtained according to IRB. We obtained the tissues under IRB approvals
   (IRB approval number: [sjtkyll-1x-2022(35)]). All experiments were
   performed in compliance with relevant laws and institutional
   guidelines. The data were analyzed anonymously to protect the privacy
   of the patients.

Specimens preparation and RNA isolation

   We collected lung tissues from six patients to prepare the FF and FFPE
   samples with different preparation conditions. After collecting the
   tissues, we cut the tissues into 0.3 cm × 0.3 cm × 0.3 cm pieces and
   preserved the fresh tissues at −80°C. As for the FFPE tissues, we used
   a 4% neutral paraformaldehyde solution to fix the tissues under
   different fixation conditions. Next, we repeatedly immersed the tissue
   in ethanol of increasing concentration levels, ending in a 100% ethanol
   concentration, to dehydrate the tissues at room temperature. Then, we
   used xylene as a clearing agent to remove all the ethanol in the
   tissues at room temperature. Finally, tissues were infiltrated with
   paraffin wax at 63°C and then left to cool so that they solidified (see
   [67]Supplementary Table S1).

   For FFPE RNA isolation, paraffin scrolls or sections were employed for
   RNA extraction, quality assessment and sequencing. Prior to sampling,
   we cut off the outermost layer of paraffin which was exposed to air.
   Next, paraffin scrolls were obtained by cutting 5 μm thick specimens
   from FFPE samples. Then, the paraffin scrolls were soaked in water,
   adhered to the glass slides and dried to get paraffin sections. When
   investigating the impact of sampling method, we prepared the paraffin
   sections first and then the paraffin scrolls from the same FFPE blocks
   for RNA extraction. Total RNA was extracted from the FFPE samples using
   RNAstorm Kit (celldata, CD501), following the manufacturer’s
   instructions. We used Nanodrop to check the purity of total RNA
   extracted from FFPE and FF samples. The Qubit® 3.0 Fluorometer was used
   to assess the RNA concentration. The RNA integrity (DV200 and DV800)
   was assessed using the Agilent 2100 Bioanalyzer.

Data composition

   This study utilized the RNA-seq data obtained from lung tissues and the
   public data from the BPV research program. The metadata for the lung
   tissues was reported in [68]Supplementary Table S2. The data of lung
   tissues consisted of 15 FF samples and their paired FFPE samples,
   including three pairs of control-case-matched FF samples and FFPE
   samples ([69]Supplementary Table S2).

   The public data were acquired from the BPV research program developed
   by National Cancer Institute’s (NCI) Biorepositories and Biospecimen
   Research Branch (BBRB). They consisted of FF samples and FFPE samples
   from five renal clear cell carcinomas (kidney), seven serous ovarian
   carcinomas (ovary) and five colon adenocarcinomas (colon). The metadata
   for public data, including specimen ids, tissue types, cold ischemia
   time and the percentage of RNA fragments >200 nucleotides (DV200), are
   reported in [70]Supplementary Table S3. The public data of the BPV
   research program are accessible through dbGaP (#phs001304). The quality
   metrics of the individual specimen are available in a previous
   publication ([71]22).

Total RNA-sequencing library preparation and sequencing

   We prepared the sequencing libraries following the manufacturer’s
   recommendations of Ribo-off rRNA Depletion Kit (Human/Mouse/Rat)
   (Vazyme Biotech Co., Ltd., Nanjing, China, N406) and VAHTS Universal V6
   RNA-seq Library Prep Kit for Illumina (Vazyme Biotech Co., Ltd.,
   Nanjing, China, NR605). The details of library construction are as
   follows. First, we removed ribosome RNA from 200 ng total RNA using
   Ribo-off rRNA Depletion Kit (Human/Mouse/Rat) (Vazyme, N406). Then, we
   fragmented the RNA into small pieces using divalent cations at elevated
   temperatures. The cleaved RNA fragments were copied into first-strand
   cDNA using reverse transcriptase and random primers, followed by
   second-strand cDNA synthesis using DNA Polymerase I, RNase H,
   deoxyuridine triphosphate (dUTP), deoxyadenosine triphosphate
   (dATP), deoxyguanosine triphosphate (dGTP) and deoxycytidine
   triphosphate (dCTP). Then, a single ‘A’ base and the adapters were
   subsequently added to these cDNA fragments. In order to select the
   appropriate cDNA fragment size for sequencing, we selected the library
   fragments with VAHTSTM DNA Clean Beads (Vazyme, N411). The polymerase
   chain reaction (PCR) amplification was performed, and the aimed
   products were finally purified. After cluster generation, the libraries
   were sequenced on an Illumina novaseq 6000 platform, and the raw fastq
   files of 150-bp paired-end reads were generated.

Expression-level quantification

   After sequencing, we removed reads with adapters and reads in which
   unknown bases were >5%. We defined the low-quality base as the base
   whose sequencing quality was not >10. Next, we removed the reads with
   over 50% low-quality bases. At the same time, Q20, Q30 and GC content
   were calculated for clean reads. All downstream analyses were based on
   the clean reads. Reads were quality-trimmed using fastp ([72]23) with
   default parameters, which trimmed low-quality bases from the ends of
   reads, low-quality reads and residual Illumina adapters. RNA-seq reads
   were aligned to the Homo sapiens reference genome (GRCh38) from the
   Ensembl database using hisat2-2.1.0 ([73]24) with default parameters.
   After alignment, the resulting bam file was fed into the RNA-Seq
   quantification software featureCounts 2.0.3 ([74]25) with paired-end
   mode to generate counts matrices on the gene level (GTF version:
   Homo_sapiens.GRCh38.104.chr.gtf). StringTie 2.2.1 ([75]26) was used to
   generate Transcripts per kilobase of exon model per million mapped
   reads (TPM) matrices. For this study, we only quantified at the gene
   level and focused on protein-coding genes. The relevant quality control
   metrics for RNA-seq were displayed in [76]Supplementary Table S4.

Post-quantification analysis

   We further characterized reads aligned to different regions of genes.
   The read_distribution.py and geneBody_coverage.py functions from RSeQC
   5.0.1 ([77]27) were used to calculate reads distribution over genome
   features and the RNA-seq reads coverage over the gene body. The bam
   files output from Hisat2 were fed into rMATS-4.1.0 ([78]28) to analyze
   alternative splicing with default parameters.

   The classic transcript (the earliest released version of transcripts)
   of each gene was used when analyzing the correlation between the gene
   attributes and RNA degradation. The genes’ transcript length and cDNA
   sequences were obtained from the Ensembl database using the R package
   BiomaRt ([79]29). We transformed the cDNA sequences to mRNA sequences
   and calculated the frequency of nucleobase using the R package
   Biostrings. The minimum free energy (MFE) of the transcript was
   calculated using RNAfold in ViennaRNA ([80]30) package with default
   parameters. The mRNA subcellular localization was predicted by mRNAloc
   ([81]31) with a support vector machine (SVM) classification threshold
   of 0.1.

   Principal Component Analysis (PCA) was performed using TPM data and the
   R package FactoMineR ([82]32) with default parameters. The housekeeping
   gene list of human genes stably expressed across 52 tissues and cell
   types was obtained from HRT atlas V1.0 ([83]33). The human pan-cancer
   gene list and the annotation of genes were obtained from the nanoString
   website:
   [84]https://nanostring.com/products/ncounter-assays-panels/oncology/nco
   unter-pancancer-pathways-panel (last accessed date: 13 March 2023).

Statistical analysis

   The concordance correlation coefficient (CCC) function in the R package
   DescTools was used to evaluate the CCC ([85]34) between paired FFPE and
   FF samples. To perform paired differential expression analysis, we used
   the R package DESeq2 ([86]35) and added sample pairwise information as
   a covariate to the negative binomial regression model. Then, we
   identified the DEGs with thresholds of adjusted P-value < 0.05 and
   absolute (log₂FoldChange) > 1. Besides, we identified the genes with
   small changes between FF and FFPE samples with the criteria of absolute
   (log₂FoldChange) < 0.2, and lfcSE (the standard error of
   log₂FoldChange) < 0.1. R package clusterProfiler ([87]36) was used to
   perform gene set enrichment analysis (GSEA) with a q-value threshold of
   0.05. Fisher’s exact test was performed with an unadjusted P-value
   threshold of 0.005 and a |log₂ odds ratio| threshold of 1 to analyze
   the association between the DEGs and transcript attribute. We
   constructed a generalized linear model using the glm.nb function in the
   R package MASS and incorporated tissue type as a covariate to
   investigate the relationship between DV200 and gene expression. An
   adjusted P-value threshold of 0.05 was set to distinguish whether a
   gene is significantly correlated to DV200.

Results

The optimal preparation conditions of FFPE tissues for high-quality RNAs

   In order to determine the optimal preparation conditions of FFPE
   tissues, we compared the RNA quality of FFPE tissues with different
   fixation times and temperatures, ischemic times, temperatures and
   sampling methods. Though the study of BPV program found that excessive
   fixation time up to 72 h resulted in decreased RNA quality ([88]22), we
   found that adequate fixation duration (48 h) improved the quality of
   RNA in FFPE samples in our study. Among the fixation times of 12, 24
   and 48 h, we observed that FFPE tissues fixed for 48 h exhibited the
   highest DV200 and DV800 (the percentage of RNA fragments > 800
   nucleotides) (Figure [89]1A and [90]Supplementary Figure S1A).

Figure 1.

   [91]Figure 1.
   [92]Open in a new tab

   Comparison of the RNA quality of FFPE tissues under different
   preparation conditions. (A–C) Boxplot showing DV200 of FFPE samples
   with different fixation time (A), fixation temperature (B) and ischemia
   time (C). Six FFPE samples were investigated for each parameter
   combination ([93]Supplementary Table S5). (D) Histogram showing the CCC
   between paired FFPE and FF samples at different ischemia temperature
   and time.

   For the fixation temperature, the DV200 and DV800 of FFPE samples fixed
   at 4°C were close to FFPE samples fixed at 25°C (Figure [94]1B;
   [95]Supplementary Figure S1B and [96]Supplementary Table S5). As 25°C
   was closer to the FFPE production temperatures in hospitals and
   produced similar RNA quality as 4°C, we chose the fixation time of 48 h
   and fixation temperature of 25°C as the optimal fixation conditions. In
   the range of ischemia time from 0.5, 3, 6 to 12 h at 25°C, no
   significant decline in FFPE RNA quality was observed, although slightly
   higher DV200 values were noted for 0.5 h (Figure [97]1C and
   [98]Supplementary Figure S1C). Concerning ischemic temperature, when
   the ischemia time was >0.5 h, the quality of RNA extracted from the
   sample at 4°C was higher than that at 25°C ([99]Supplementary Table
   S5).

   From the perspective of expression consistency with FF samples, the
   expression profiles of the samples with ischemic treatment at 4°C were
   in high agreement with FF samples, and the CCC ranged from 0.79 to 0.9,
   which was close to FFPE samples without ischemic treatment (Figure
   [100]1D and [101]Supplementary Table S6). FFPE samples with 6 and 48 h
   of ischemia at 25°C were significantly lower in agreement with FF
   samples (Figure [102]1D and [103]Supplementary Table S6). For the
   sampling methods, the quality of RNA extracted from FFPE sections was
   generally lower than RNA extracted from the FFPE scrolls (see Materials
   and Methods section). The DV200 of FFPE sections ranged from 30 to 40,
   while the DV200 of FFPE scrolls was >60 ([104]Supplementary Figure S2).
   Furthermore, we used the RNA quality data from BPV program ([105]22)
   and found no significant difference in the DV200 value (RNA quality)
   among colon, kidney and ovary tissues in the majority of experimental
   groups ([106]Supplementary Figure S3). This result suggested that out
   results from the lung tissue can be generalized to other tissues. In
   summary, the optimal preparation conditions of FFPE are ischemia of
   tissues at 4°C (<48 h) or ischemia at 25°C for short time (0.5 h),
   fixation for 48 h at 25°C and sampling from FFPE scrolls rather than
   FFPE sections.

Read distribution and expression profiles differ between FFPE and FF samples

   To assess potential systematic differences between FF and FFPE samples,
   we performed RNA-seq on seven FF samples and their paired FFPE samples
   from the peritumoral tissues of lung ([107]Supplementary Table S2,
   sample id: L1-7). We compared the distribution of RNA-seq reads and
   expression profiles between FF and FFPE samples. First, we found that
   FFPE and FF samples did not show serious 3′ or 5′ bias on the gene body
   distribution (Figure [108]2A). However, the proportion of read
   distributed in introns was significantly higher in FFPE samples than FF
   samples, consistent with previous studies ([109]12,[110]17,[111]19).
   The proportion of intron was generally ∼25% in FF samples, while it was
   >50% in FFPE samples (Figure [112]2B and [113]C). Changes in transcript
   reads distribution and increased intronic reads affected the detection
   of alternative splicing. Approximately 20–25% of the intron retention
   events were overrepresented in FFPE samples ([114]Supplementary Figure
   S4).

Figure 2.

   [115]Figure 2.
   [116]Open in a new tab

   Differences in read distribution and expression profiles between FFPE
   samples and FF samples. (A) Average reads coverage distribution from
   transcript 5′ to 3′, the coverage of 3′ area of the transcript was
   higher in FFPE samples; (B, C) Proportion of read distributed in 3′UTR
   (untranslated regions), 5′UTR, CDS (coding sequences), introns in
   paired FFPE and FF samples. (D) Dot graph showing PCA result of
   expression profiles of FFPE and FF samples. (E) Difference of gene
   expression rank between paired FFPE and FF samples.

   In terms of the expression profiles, PCA results showed that the
   expression profiles of FFPE samples differed from those of FF samples
   (Figure [117]2D). When comparing the ranks of genes with TPM > 10
   (n = 11745) ordered by expression level in FFPE samples and their
   paired FF samples, substantial rank changes were observed in most
   genes. Only 5–10% of genes exhibited minor change (|Δrank|<100).
   Notably, the expression rank of 4.4% of genes greatly changed
   (|Δrank|>1000) in the same tissue of different sequencing batches. In
   contrast, over 40% of genes significantly changed in FFPE samples
   relative to paired FF samples (Figure [118]2E). Our findings showed
   systematic differences in read distribution and expression profiles
   between FFPE and FF samples.

Genes related to chemical stimulus over-sampled in FFPE samples

   To investigate the changes in gene expression between FFPE and FF
   samples, we performed DEG analysis on our data from lung tissue and the
   public data from BPV project for three tissues (kidney, ovary and
   colon). We identified 1001 genes were consistently over-sampled in at
   least two kinds of tissues (Figure [119]3A). There are 420 genes that
   were consistently more degraded in at least two kinds of tissues
   (Figure [120]3B). Next, we performed GSEA. GSEA results showed that the
   genes with higher expression level in FFPE were significantly enriched
   in the gene sets of chemical stimulus perception (Figure [121]3C). We
   further checked genes in the identified GO terms in our lung samples.
   After filtering low-expression genes (baseMean > 1), 12 out of 21 (57%)
   taste receptor genes and 28 out of 91 (31%) olfactory receptor genes
   were significantly over-sampled in FFPE in lung tissue. In addition,
   most of the other genes in these two terms had an over-sampled trend in
   FFPE samples (Figure [122]3D). We further analyzed the stimulus
   perception-related genes in the BPV project data and found similar
   results ([123]Supplementary Figure S5A–C). This finding suggested that
   reads of genes related to perception of chemical stimulus were
   overrepresented in FFPE.

Figure 3.

   [124]Figure 3.
   [125]Open in a new tab

   DEGs and GO biological processes enriched terms in four kinds of
   tissues. (AandB) Venn diagram showing the number of overlapped genes
   with more (A) or less (B) reads in FFPE across four kinds of tissues.
   (C) Dot graph showing top 5 significantly overrepresented GO terms
   overlapped in four kinds of tissues. (D) Volcano plot showing the DEGs
   between FFPE and FF samples.

Transcripts with altered expression in FFPE samples significantly correlated
with their properties including length, secondary structure and subcellular
localization

   After analyzing the potential biological processes involved during
   fixation, we delved into the gene properties of DEGs in FFPE versus FF
   samples. The classic transcripts (n = 19 307) of protein-coding genes
   were selected to obtain the transcript length, MFE of secondary
   structure, nucleobase content and subcellular localization information.
   Our analysis revealed that the more degraded genes in FFPE samples were
   overrepresented in genes with short transcript length and high free
   energy, regardless of tissue types (P-value < 0.05, |log₂odds_ratio|>1)
   (Figure [126]4A–[127]D). In other words, these more degraded genes in
   FFPE samples had significantly shorter transcript lengths and higher
   MFE than other genes (Figure [128]4E,F). Regarding nucleobase content,
   we found that genes over-sampled in FFPE samples were overrepresented
   in high GC content genes in lung tissues but not in other tissues
   ([129]Supplementary Figure S6). For the RNA subcellular localization,
   we found that genes more degraded in FFPE samples were overrepresented
   in the extracellular regions across all four tissue types (Figure
   [130]4A–[131]D). Additionally, almost all mitochondrial genes were
   consistently overexpressed in FFPE samples across tissue types. To
   investigate the relationship between gene expression and DV200 values,
   we constructed a generalized linear model using the FFPE samples in the
   BPV project. The results showed that only 41 genes were significantly
   correlated with the DV200 value (padj < 0.05), and most genes were not
   correlated with DV200 ([132]Supplementary Figure S7). In summary, our
   results suggest that the expression changes of genes/transcripts in
   FFPE versus FF samples are correlated to transcript length, MFE of
   secondary structure and subcellular localization.

Figure 4.

   [133]Figure 4.
   [134]Open in a new tab

   The gene properties correlated to genes more degraded in FFPE tissues.
   (A–D) Fisher’s exact test results of more degraded genes in FFPE
   samples of four different tissues; ER, endoplasmic reticulum; D1, first
   Deciles; D9, ninth Deciles. (E) Violin plot showing the transcript
   length in different DEG group. The genes significantly more degraded in
   FFPE are significantly shorter. (F) Violin plot of the MFE of secondary
   structure in different DEG groups. The genes significantly more
   degraded in FFPE have significantly higher MFE.

Housekeeping gene expression alters in FFPE samples

   To assess the potential in the expression profile of housekeeping genes
   (HK genes) in FFPE, we obtained the HK gene list from HRT Atlas v1.0
   ([135]33). We found that most HK genes were expressed in both FF and
   FFPE samples. Among the 2176 HK genes, 85–89% displayed expression
   levels exceeding 10 TPM in FFPE and FF samples (Figure [136]5A).
   Subsequently, we checked the DEG analysis results of HK genes on our
   lung samples and public data from the BPV project. Among the 2176 HK
   genes, 4–14% were differentially expressed in FFPE versus FF samples
   (Figure [137]5B,C and [138]Supplementary Figure S8).
   Glyceraldehyde-3-Phosphate Dehydrogenase (GAPDH), the common internal
   reference gene for qPCR, was more degraded in FFPE samples
   ([139]Supplementary Figure S8). Applying a criterion for constantly
   expressed HK genes (see Materials and Methods), we identified a set of
   40 HK genes with constant expression across lung, colon, kidney and
   ovary tissues, which may be suitable as internal reference genes to
   normalize gene expression data in FFPE (Figure [140]5D and
   [141]Supplementary Table S7). Among the 40 stable HK genes, SSU72,
   SNX1, RNF114, NAP1L4 and VPS26C were the top 5 stably expressed genes.
   Our results highlight the importance of carefully selecting stable HK
   genes for internal reference genes, as many HK genes showed significant
   expression differences between FFPE and FF samples.

Figure 5.

   [142]Figure 5.
   [143]Open in a new tab

   The overview of the expression of housekeeping genes in FF and FFPE
   samples. (A) Histogram showing the number of HK genes with minimum
   expression >10 TPM. (B) The volcano plot of differentially expressed HK
   genes and genes stably expressed in FF and FFPE samples from lung
   tissues. (C) Histogram showing the number of differentially expressed
   HK genes in four tissues. (D) Venn diagram showing the overlap of HK
   genes with small changes across four tissues.

FFPE RNA-seq data can be used for pan-cancer pathway analyses

   To assess the utility of FFPE RNA-seq data, we performed transcriptome
   analysis using FFPE samples and FF samples separately and assessed the
   consistency of the results. In a previous study ([144]12), a similar
   analysis has been conducted. The results for FFPE samples were highly
   consistent with FF samples, with a proportion of overlapped DEGs
   exceeding 70%. However, the huge variation between different organs
   (colon, kidney and ovary) may lead to a large number of DEGs, and
   thousands of DEGs may lead to a high proportion of overlapped DEGs.
   Thus, we used additional three pairs of samples from tumors and
   peritumor tissues ([145]Supplementary Table S2) to validate the
   results.

   First, we performed PCA on FF samples and FFPE samples, respectively.
   The FF samples and FFPE samples had similar distributions on the 2D
   plane after dimensionality reduction (Figure [146]6A–[147]C). We
   further analyze the top 5 principal components in PCA, 4 out of the top
   5 principal components correlated between FF and FFPE samples, with a
   Pearson’s correlation coefficient >0.4 (Figure [148]6D). The first
   principal component obtained from FFPE samples was highly correlated
   with that in FF samples (Pearson, r > 0.8) (Figure [149]6D), suggesting
   that the main variation in the data was similar in the FF and FFPE
   samples.

Figure 6.

   [150]Figure 6.
   [151]Open in a new tab

   Consistency of analysis results between FF and FFPE samples. (A–C) The
   dot graph showing the distributions of three pairs of
   control-case-matched FF (A) and FFPE (B) samples in the 2D after
   dimensionality reduction. (D) Heatmap showing the correlation between
   the top 5 principal components from FF and FFPE samples. (E and F) Venn
   diagram showing overlap of upregulated (E) and downregulated (F) genes
   in tumor versus peritumor between FFPE samples and FF samples. (G and
   H) Dot graph showing top 5 significantly upregulated (G) and
   downregulated (H) KEGG pathways overlapped in FF and FFPE samples. (I)
   Dot graph showing GSEA results in cancer-related pathway. The results
   obtained from FF and FFPE samples are highly consistent.

   Further, we identified the DEGs in tumor versus peritumor in FF samples
   and FFPE samples, respectively. Among the upregulated genes, 38% (217
   genes) were consistently identified in FF and FFPE samples (Figure
   [152]6E). For downregulated genes, 31% (96 genes) were consistently
   identified in FF and FFPE samples (Figure [153]6F). The consistency
   between FFPE and FF samples was lower than in previous study ([154]12).
   We then performed KEGG pathway enrichment analysis using the
   upregulated/downregulated genes and screened the top 10 significant
   pathways. The upregulated genes were consistently overrepresented in
   seven KEGG pathways across FF and FFPE samples (Figure [155]6G). We
   also found four KEGG pathways downregulated in tumors across FF and
   FFPE samples (Figure [156]6H).

   To assess the feasibility of FFPE samples for cancer research, we
   utilized the pan-cancer gene annotations (see Materials and methods)
   for GSEA. Notably, we found eight cancer-related pathways significantly
   upregulated in tumor versus peritumor in FF samples. In FFPE samples,
   10 cancer-related pathways were significantly upregulated in tumor
   tissues, including all 8 pathways identified in FF samples. These
   results suggested that FFPE samples can be utilized for cancer-related
   comparative transcriptome analysis and data mining to a considerable
   extent.

Discussion

   FFPE tissues are the most common human tissue specimens in clinical
   practice. Preparing FFPE samples with high-quality RNA is essential for
   successful sequencing and downstream analysis. Understanding the
   pattern of changes in expression data due to FFPE treatment may also
   help develop FFPE RNA-seq analysis approaches and better harness the
   power of FFPE tissue resources.

   In previous studies, several factors were found to affect the RNA
   quality of FFPE tissues, including cold ischemia time ([157]22),
   specimen size, fixation time, storage time and temperature ([158]37).
   For the storage condition, a recent study showed that the degradation
   of nucleic acids was slower when FFPE tissues were stored at 4°C or
   lower temperatures ([159]38). Our study aimed to determine an optimal
   condition to prepare FFPE specimens for high-quality RNA. Our finding
   showed that the quality of RNA extracted from FFPE specimens has little
   relation to the temperature of fixation but has a positive correlation
   to the time of fixation ranging from 12 to 48 h. The prolonged ischemia
   time also distinctly influenced the RNA quality at room temperature.
   Besides, the sampling method affected the FFPE RNA quality as well. The
   DV200 of RNA extracted from paraffin scrolls was significantly higher
   than paraffin sections. Overall, we determined that the optimal
   preparation conditions of FFPE are ischemia of tissues at 4°C (<48 h)
   or ischemia at 25°C for short time (0.5 h), fixation for 48 h at 25°C
   and sampling from FFPE scrolls rather than FFPE sections. In addition,
   FFPE should be stored at 4°C or lower temperatures to slow down RNA
   degradation ([160]38).

   The systemic differences in the RNA-seq data were observed between FFPE
   and FF samples, independent of tissue types
   ([161]5,[162]12,[163]17,[164]19). We found that more reads were mapped
   to genes associated with the perception of chemical stimulus in FFPE
   samples, which may be related to biological processes arising during
   paraformaldehyde fixation. Additionally, we first found that the more
   degraded gene was more likely to present in the genes with short
   transcript and high MFE of secondary structure. Moreover, we found that
   the genes localized to extracellular regions such as extracellular
   vesicles ([165]31,[166]39) were more likely to downgrade in FFPE
   samples across four tissues, indicating that these RNAs may be more
   likely to degrade in FFPE samples. Notably, mitochondrial genes were
   consistently over-sampled in FFPE samples, which is consistent with a
   previous study ([167]19). This phenomenon suggested that RNA in
   mitochondria may be protected from degradation by the mitochondrial
   membranes and RNA-binding proteins within the mitochondria ([168]40).
   Unexpectedly, almost no gene correlated with the DV200 values of the
   FFPE samples, indicating that the DV200 values were only suitable to
   evaluate the probability of successful library construction
   ([169]8,[170]9).

   Housekeeping genes are involved in fundamental cell biological
   processes. Thus, they are expected to maintain a constant level of
   expression in all cells and conditions ([171]41). Owing to this
   characteristic, housekeeping genes are commonly used as internal
   reference genes ([172]42,[173]43). However, the systemic differences
   between FFPE and FF samples altered the expression of many housekeeping
   genes. We found 40 housekeeping genes with a small change between FF
   and FFPE samples across four tissues. These genes may be the candidates
   of internal reference to normalize gene expression data in both FF and
   FFPE samples. The consistency of DEGs is relatively lower than in the
   previous study ([174]12). However, from the result of PCA, we found
   that the major variations in the data were highly consistent across
   FFPE and FF samples. Although many different genes were detected in
   differential expression analysis, the enrichment analysis results are
   similar. In particular, the results of cancer-related gene sets were
   highly consistent across FFPE and FF samples. Our findings suggested
   that using the FFPE samples may produce similar results as FF samples
   in comparative expression analyses between control and case samples.

   In summary, we have found an optimal FFPE specimen preparation
   condition to obtain high-quality RNA. In addition, we discovered that
   the genes with specific attributes were more prone to alter expression
   in FFPE samples. We evaluated the feasibility of FFPE RNA-seq data
   using control-case-matched samples and demonstrated that FFPE samples
   could be used for cancer-related transcriptome analysis. Our findings
   provide insights that can inform the clinical preparation of FFPE
   tissues, enhance our understanding of expression data alteration
   induced by FFPE treatment and optimize the utilization of FFPE tissue
   resources.

Data and code availability

   The data that support the findings of this study are available in the
   Genome Sequence Archive for Human (GSA-Human)
   at [175]https://ngdc.cncb.ac.cn/gsa-human/s/0AeSL3An; reference number
   (HRA004941). The public data of the BPV research program are available
   through dbGaP (#phs001304). The scripts used to reproduce the main
   results in this study are available in Github repository
   ([176]https://github.com/liny-suts/FFPE) and Zenodo
   ([177]https://doi.org/10.5281/zenodo.10516663).

Supplementary data

   [178]Supplementary Data are available at NARGAB Online.

Supplementary Material

   lqae008_Supplemental_Files
   [179]lqae008_supplemental_files.zip^ (1.4MB, zip)

Acknowledgements