Abstract Proliferative Diabetic Retinopathy (PDR) is a chronic complication of Diabetes and the main cause of blindness among the world’s working population at present. While there have been many studies on the pathogenesis of PDR, its intrinsic molecular mechanisms have not yet been fully elucidated. In recent years, several studies have employed bulk RNA-sequencing (RNA-seq) and single-cell RNA sequencing (scRNA-seq) to profile differentially expressed genes (DEGs) and cellular components associated with PDR. This study adds to this expanding body of work by identifying PDR’s target genes and cellular components by conducting an integrated transcriptome bioinformatics analysis. This study integrately examined two public bulk RNA-seq datasets(including 11 PDR patients and 7 controls) and one single-cell RNA-seq datasets(including 5 PDR patients) of Fibro (Vascular) Membranes (FVMs) from PDR patients and control. A total of 176 genes were identified as DEGs between PDR patients and control among both bulk RNA-seq datasets. Based on these DEGs, 14 proteins were identified in the protein overlap within the significant ligand-receptor interactions of retinal FVMs and Protein-Protein Interaction (PPI) network, three of which were associated with PDR (CD44, ICAM1, POSTN), and POSTN might act as key ligand. This finding may provide novel gene signatures and therapeutic targets for PDR. Introduction Proliferative Diabetic Retinopathy (PDR) is a slow-onset, chronic complication of diabetes and the main cause of blindness among the world’s working population at present [[32]1]. The global number of diabetic patients is expected to rise to 366 million by 2030, with 11% of them developing vision-threatening retinopathy [[33]2, [34]3], making early diagnosis and treatment of PDR critical. While studies have previously examined the roles of inflammation, oxidative stress and cytokine production/release in PDR’s pathogenesis [[35]3], its underlying molecular mechanism has not been fully elucidated, and more studies are needed to provide a deeper understanding for uncovering more effective therapeutic targets. The identification of specific gene expression patterns can help understand disease pathogenesis and reveal possible theraputeic targets [[36]4]. Recently, bulk RNA-sequencing (RNA-seq) and single-cell RNA sequencing (scRNA-seq) have been widely used for gene expression profiling and identifying cell populations in Fibro (Vascular) Membranes (FVMs) from PDR patients [[37]5]. Still, to the best of our knowledge, no previous study on PDR has integrated RNA-seq and scRNA-seq to identify Differentially Expressed Genes (DEGs) between normal FVMs and that from PDR patients. Therefore, this study conducts an integrated transcriptomics analysis through RNA-seq and scRNA-seq. We identified three biomarkers of PDR, among which POSTN and FAK/Akt pathway may be potential therapeutic targets of PDR. Data and methods Data collection The gene expression datasets of PDR were obtained through the Gene Expression Omnibus (GEO) database ([38]http://www.ncbi.nlm.nih.gov/) [[39]6]. Two bulk RNA-seq datasets ([40]GSE94019 and [41]GSE102485, containing SRR5925083-SRR5925086, SRR5925099- SRR5925100) [[42]7, [43]8] were downloaded to profile DEGs, including FVMs from 11 PDR patients and seven controls. A single-cell RNA-seq dataset consisting of 7,971 FVM cells from five PDR patients was also downloaded from the GEO database ([44]GSE165784) [[45]5]. Validation studies were then conducted using data from a microarray dataset ([46]GSE60436) comprised of six PDR patients and three control subjects [[47]9]. Processing of RNA-seq data SRA Toolkit (version 2.11.3-ubuntu64) was used to download and preprocess the raw data from the two datasets. Raw reads were first separated into FASTQ files of pair-end reads, with FastQC (version 0.11.5) used for data quality control. The clean reads were aligned to the human reference genome (USCS hg19) by HISAT2 (version 2.1.0). Finally, SAMtools (version 1.9) and HTSeq (version 0.6.1p1) were used to quantify and map the reads to an annotated document (GENCODE, version 39lift37, Oct 2021). Differential gene expression analysis The R package DESeq2 [[48]10] was then used to identify the DEGs among the PDR and control groups in each RNA-Seq dataset. The cutoff criteria for determining DEGs were |log2 fold change (FC)| > 1 and FDR < 0.05. RRA analysis The Robust Rank Aggregation (RRA) method [[49]11] was used to integrate the results of the RNA-seq studies and control batch effects introduced by different sequencing platforms. A lists of up- and down-regulated genes for each RNA-seq dataset were generated from the expression fold changes between PDR and control groups. RobustRankAggreg package in R was used to integrate and rank differentially expressed genes from each dataset, and genes with a score <0.05 were considered significant in the final integrated dataset. Functional and pathway enrichment analysis and construction of protein-protein interaction network Functional enrichment analysis was used to explore the function of the DEGs, including Gene Ontology (GO) functional enrichment analysis and the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway analysis [[50]12, [51]13], performed by the clusterProfiler package in R [[52]14]. String database (version 11.5) [[53]15] was used to construct a Protein-Protein Interaction (PPI) network based on the DEGs, allowing the names and protein-coding genes of all proteins interacting in the network to be extracted. Processing of single-cell RNA-seq data The scRNA-seq data was processed by the Seurat package in R. First, Canonical Correlation Analysis (CCA) was used to find Mutual Nearest Neighbours (MNNs) [[54]16]. Cells with more than 2,500 or fewer than 200 gene counts, or with more than 5% mitochondria, were filtered out. The “vst” selection method was used to find variable genes, which were input features for initial Principal Component Analysis (PCA) [[55]17]. Jackstraw analysis was then performed to select the Principal Components (PCs) with P-values < 0.05 [[56]18]. Significant PCs were incorporated into further t-distributed Stochastic Neighbour Embedding (t-SNE) to identify different cell clusters with DEGs (resolution = 0.5). The distribution and expression of the top 10 DEGs were displayed on feature plots and heat maps, respectively, while the Blueprint and Encode databases [[57]19–[58]21] in the R package singleR were used as references for defining each cell cluster.