Abstract

   Proliferative Diabetic Retinopathy (PDR) is a chronic complication of
   Diabetes and the main cause of blindness among the world’s working
   population at present. While there have been many studies on the
   pathogenesis of PDR, its intrinsic molecular mechanisms have not yet
   been fully elucidated. In recent years, several studies have employed
   bulk RNA-sequencing (RNA-seq) and single-cell RNA sequencing
   (scRNA-seq) to profile differentially expressed genes (DEGs) and
   cellular components associated with PDR. This study adds to this
   expanding body of work by identifying PDR’s target genes and cellular
   components by conducting an integrated transcriptome bioinformatics
   analysis. This study integrately examined two public bulk RNA-seq
   datasets(including 11 PDR patients and 7 controls) and one single-cell
   RNA-seq datasets(including 5 PDR patients) of Fibro (Vascular)
   Membranes (FVMs) from PDR patients and control. A total of 176 genes
   were identified as DEGs between PDR patients and control among both
   bulk RNA-seq datasets. Based on these DEGs, 14 proteins were identified
   in the protein overlap within the significant ligand-receptor
   interactions of retinal FVMs and Protein-Protein Interaction (PPI)
   network, three of which were associated with PDR (CD44, ICAM1, POSTN),
   and POSTN might act as key ligand. This finding may provide novel gene
   signatures and therapeutic targets for PDR.

Introduction

   Proliferative Diabetic Retinopathy (PDR) is a slow-onset, chronic
   complication of diabetes and the main cause of blindness among the
   world’s working population at present [[32]1]. The global number of
   diabetic patients is expected to rise to 366 million by 2030, with 11%
   of them developing vision-threatening retinopathy [[33]2, [34]3],
   making early diagnosis and treatment of PDR critical. While studies
   have previously examined the roles of inflammation, oxidative stress
   and cytokine production/release in PDR’s pathogenesis [[35]3], its
   underlying molecular mechanism has not been fully elucidated, and more
   studies are needed to provide a deeper understanding for uncovering
   more effective therapeutic targets.

   The identification of specific gene expression patterns can help
   understand disease pathogenesis and reveal possible theraputeic targets
   [[36]4]. Recently, bulk RNA-sequencing (RNA-seq) and single-cell RNA
   sequencing (scRNA-seq) have been widely used for gene expression
   profiling and identifying cell populations in Fibro (Vascular)
   Membranes (FVMs) from PDR patients [[37]5]. Still, to the best of our
   knowledge, no previous study on PDR has integrated RNA-seq and
   scRNA-seq to identify Differentially Expressed Genes (DEGs) between
   normal FVMs and that from PDR patients. Therefore, this study conducts
   an integrated transcriptomics analysis through RNA-seq and scRNA-seq.
   We identified three biomarkers of PDR, among which POSTN and FAK/Akt
   pathway may be potential therapeutic targets of PDR.

Data and methods

Data collection

   The gene expression datasets of PDR were obtained through the Gene
   Expression Omnibus (GEO) database ([38]http://www.ncbi.nlm.nih.gov/)
   [[39]6]. Two bulk RNA-seq datasets ([40]GSE94019 and [41]GSE102485,
   containing SRR5925083-SRR5925086, SRR5925099- SRR5925100) [[42]7,
   [43]8] were downloaded to profile DEGs, including FVMs from 11 PDR
   patients and seven controls. A single-cell RNA-seq dataset consisting
   of 7,971 FVM cells from five PDR patients was also downloaded from the
   GEO database ([44]GSE165784) [[45]5]. Validation studies were then
   conducted using data from a microarray dataset ([46]GSE60436) comprised
   of six PDR patients and three control subjects [[47]9].

Processing of RNA-seq data

   SRA Toolkit (version 2.11.3-ubuntu64) was used to download and
   preprocess the raw data from the two datasets. Raw reads were first
   separated into FASTQ files of pair-end reads, with FastQC (version
   0.11.5) used for data quality control. The clean reads were aligned to
   the human reference genome (USCS hg19) by HISAT2 (version 2.1.0).
   Finally, SAMtools (version 1.9) and HTSeq (version 0.6.1p1) were used
   to quantify and map the reads to an annotated document (GENCODE,
   version 39lift37, Oct 2021).

Differential gene expression analysis

   The R package DESeq2 [[48]10] was then used to identify the DEGs among
   the PDR and control groups in each RNA-Seq dataset. The cutoff criteria
   for determining DEGs were |log2 fold change (FC)| > 1 and FDR < 0.05.

RRA analysis

   The Robust Rank Aggregation (RRA) method [[49]11] was used to integrate
   the results of the RNA-seq studies and control batch effects introduced
   by different sequencing platforms. A lists of up- and down-regulated
   genes for each RNA-seq dataset were generated from the expression fold
   changes between PDR and control groups. RobustRankAggreg package in R
   was used to integrate and rank differentially expressed genes from each
   dataset, and genes with a score <0.05 were considered significant in
   the final integrated dataset.

Functional and pathway enrichment analysis and construction of
protein-protein interaction network

   Functional enrichment analysis was used to explore the function of the
   DEGs, including Gene Ontology (GO) functional enrichment analysis and
   the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway analysis
   [[50]12, [51]13], performed by the clusterProfiler package in R
   [[52]14]. String database (version 11.5) [[53]15] was used to construct
   a Protein-Protein Interaction (PPI) network based on the DEGs, allowing
   the names and protein-coding genes of all proteins interacting in the
   network to be extracted.

Processing of single-cell RNA-seq data

   The scRNA-seq data was processed by the Seurat package in R. First,
   Canonical Correlation Analysis (CCA) was used to find Mutual Nearest
   Neighbours (MNNs) [[54]16]. Cells with more than 2,500 or fewer than
   200 gene counts, or with more than 5% mitochondria, were filtered out.
   The “vst” selection method was used to find variable genes, which were
   input features for initial Principal Component Analysis (PCA) [[55]17].
   Jackstraw analysis was then performed to select the Principal
   Components (PCs) with P-values < 0.05 [[56]18]. Significant PCs were
   incorporated into further t-distributed Stochastic Neighbour Embedding
   (t-SNE) to identify different cell clusters with DEGs (resolution =
   0.5). The distribution and expression of the top 10 DEGs were displayed
   on feature plots and heat maps, respectively, while the Blueprint and
   Encode databases [[57]19–[58]21] in the R package singleR were used as
   references for defining each cell cluster.