Abstract The zebrafish is an excellent model system to study thrombocyte function and development. Due to the difficulties in separating young and mature thrombocytes, comparative transcriptomics between these two cell types has not been performed. It is important to study these differences in order to understand the mechanism of thrombocyte maturation. Here, we performed single-cell RNA sequencing of the young and mature zebrafish thrombocytes and compared the two datasets for young and mature thrombocyte transcripts. We found a total of 9143 genes expressed cumulatively in both young and mature thrombocytes, and among these, 72% of zebrafish thrombocyte-expressed genes have human orthologs according to the Ensembl human genome annotation. We also found 397 uniquely expressed genes in young and 2153 uniquely expressed genes in mature thrombocytes. Of these 397 and 2153 genes, 272 and 1620 corresponded to human orthologous genes, respectively. Of all genes expressed in both young and mature thrombocytes, 4224 have been reported to be expressed in human megakaryocytes, and 1603 were found in platelets. Among these orthologs, 156 transcription factor transcripts in thrombocytes were found in megakaryocytes and 60 transcription factor transcripts were found in platelets including a few already known factors such as Nfe2 and Nfe212a (related to Nfe2) that are present in both megakaryocytes, and platelets. These results indicate that thrombocytes have more megakaryocyte features and since platelets are megakaryocyte fragments, platelets also appear to be thrombocyte equivalents. In conclusion, our study delineates the differential gene expression patterns of young and mature thrombocytes, highlighting the processes regulating thrombocyte maturation. Future knockdown studies of these young and mature thrombocyte-specific genes are feasible and will provide the basis for understanding megakaryocyte maturation. Introduction Thrombocytes in fish and birds have been shown to be platelet equivalents and participate in hemostasis to prevent blood loss during an injury. We reported earlier that there are two types of thrombocytes in zebrafish, the young and mature thrombocytes [[32]1–[33]3]. The classification of these thrombocytes was by the fortuitous labeling of a group of thrombocytes at a specific concentration DiI-C18 (DiI) [[34]3]. These thrombocytes were functionally more active and were first recruited to the injury site. Interestingly, even among the DiI labeled thrombocytes, there are the most intensely labeled and less intensely labeled which suggest a continuum of different stages of thrombocyte maturation. Also, we noted two types of thrombocytes in the Fli1-GFP transgenic zebrafish, the less intense GFP+ thrombocytes and more intense GFP+ thrombocytes. Moreover, the less intense GFP+ thrombocytes were DiI positive. Therefore, we concluded that less intense GFP+ thrombocytes are young thrombocytes. Surprisingly, these less intense GFP+ thrombocytes are also positive for RFP driven by the mlc2 promoter [[35]4]. Thus, the RFP+ thrombocytes are young, and the rest of the more intense GFP+ thrombocytes are shown to be mature thrombocytes, probably in different stages of maturation. Megakaryocytes are precursors for human platelets, and they mature in the bone marrow [[36]5]. These cells advance in maturation by endomitosis, yielding polyploidy [[37]6, [38]7]. However, this feature appears to be unique to mammals. Interestingly from fish to birds, there are no megakaryocytes reported; however, thrombocytes seem to possess properties of megakaryocytes in at least having conserved transcription factors [[39]8]. For example, the young thrombocytes, when they mature, gain more Fli1 with the concomitant loss of GATA-1 transcription factor [[40]9]. We have also shown that there is a decrease in phosphatidylethanolamine (PE) and an increase in phosphatidylcholine (PC) in mature thrombocytes compared to young thrombocytes [[41]4]. Interestingly, amphibian thrombocytes under in vitro conditions showed polyploidy-like features [[42]10]. Several genes such as CD41, GP1B, P-SELECTIN, ACVR1, and PAR1 are expressed in megakaryocytes and have been shown to be expressed in thrombocytes [[43]11]. Thus, we have argued that thrombocytes have both megakaryocyte and platelet properties except for polyploidization. Despite these differences, the overall thrombocyte development could be used to study megakaryocyte development and maturation. Also, there are no comprehensive genetic studies to understand megakaryocyte development. At the best, we could study megakaryocyte development in vitro in cell cultures systems but will require animal models for confirming these studies. Thus, we believe that thrombocyte development and maturation in zebrafish could model megakaryocyte development maturation. However, such studies require a comprehensive analysis of factors that control thrombocyte development and maturation. As an initial step to understanding thrombocyte maturation, we sought to compare the transcriptional profiles of the young and mature zebrafish thrombocytes. Our study shows that in the full collection of thrombocytes (young and mature), there were 9143 gene transcripts in contrast to 7708 gene transcripts in megakaryocytes and 2819 gene transcripts in platelets in the published databases. Our study also shows that there are 397 genes expressing uniquely in young thrombocytes and 2153 genes that expressed in mature thrombocytes only. We used the excel spreadsheet to eliminate the duplicates and found that among the 397 genes, 272 genes had human orthologs. Likewise, of the 2153 genes expressing only in mature thrombocytes, 1620 are orthologous to human genes. We have also compared megakaryocyte expression profiles with the thrombocytes and found that the thrombocytes and megakaryocytes have common transcripts for 4224 genes. Likewise, when we compared platelet expression profiles with thrombocytes, we found 1603 gene transcripts that are commonly expressed between thrombocytes and platelets. Interestingly, none of the unique genes expressed in young thrombocytes was found to be expressed in either megakaryocytes or platelets. However, among the unique gene transcripts found in mature thrombocytes, 602 were found in megakaryocytes, and 225 were found in platelets. We have also performed heatmap analysis, a listing of human orthologs, and provided information on select categories of genes. Thus, our analysis in this paper provides comprehensive information on the differential expression of genes in young and mature thrombocytes that should be useful to study the mechanism of thrombocyte maturation and might also be useful in understanding megakaryocyte maturation and development. Methods Zebrafish husbandry Adult wild-type (WT) zebrafish and transgenic TG (fli1: EGFP) fish, Glo fish (DsRed), GloFli fish were maintained at 28°C in the circulating system of deionized water supplemented with Instant Ocean. The fish were maintained at 10 hours of dark and 14 hours of the light cycle and fed with brain shrimp and flakes [[44]12]. Zebrafish blood collection and cell sorting The fish was placed on a clean paper towel on its side, and the skin was wiped gently with Kim Tech wipes. It was then clipped by using a pair of dissection scissors between the second and fourth black stripes in line with the ventral and the dorsal fins. 2 μL of blood was collected by using a 10 μL pipette, and 2 μL of blood was added into polystyrene 5ml conical tubes (BD Falcon, 12 x 75 mm), containing 300 μL of freshly prepared ice-cold 1x Phosphate buffered saline (PBS) supplemented with 0.04% of Bovine Serum Albumin (BSA) [[45]13]. For cell sorting, GFP+ thrombocytes and RFP+ thrombocytes from GloFli fish were sorted using BD FACSCanto flow cytometer (UT Southwestern) separately into polystyrene 5ml conical tubes containing 1x PBS supplemented with 0.04% BSA. FACSCanto flow cytometer was set at 4°C with a nozzle size of 100. For optimum analysis, cell viability was set at 75% to 90% with at least 10,000 sorted cells. Samples run on FACSCanto flow cytometer were gated on PE and GFP using BD FACSdiva 8.0.2 software. For negative control, WT zebrafish were used, and for positive controls, Glo fish (DsRed) and TG (fli1: EGFP) fish were used. The sorted thrombocytes were kept on ice and were used immediately for RNA-seq by the Next Generation Sequencing Core at UT Southwestern. Humane care of zebrafish Since anesthesia affects blood clotting, and fish blood generally clots faster than human blood, blood collection is very difficult. Thus, the blood collection must be performed without anesthesia. The animals used to collect blood will not survive, and therefore, they are euthanized immediately after blood collection by overdosing the fish with buffered MS222 (2%) anesthetic. All procedures were approved by the Institutional Animal Care and Use Committee (Protocol # 20021) of the University of North Texas, and animal experiments were performed with humane care in compliance with the institutional guidelines. RNA sequencing RNA from sorted GFP+ thrombocytes and RFP+ thrombocytes were prepared according to 10x Genomics protocol. Sequencing was performed after successful library preparation and quality control. Briefly, the sorted cells were loaded immediately onto 10x Genomics Chromium Chip B. The barcoded cDNA libraries were generated using 10x Genomics Chromium Single Cell 3’ GEM library and Gel Bead Kits V3 according to the manufacturer’s instructions. The library sequencing on Illumina Hiseq 2500 V3 was performed on a depth of a minimum of 20,000 read pairs/cell for each library, and Cell Ranger software (3.0.0) version was used for barcode recovery [[46]14–[47]16]. Analysis of duplicated genes in zebrafish and identifying human orthologs We used the raw data and obtained the list of zebrafish ‘protein-coding’ genes for GFP+ thrombocytes and RFP+ thrombocytes using Ensembl and ZFIN databases. The list of the above genes was imported to an excel spreadsheet to analyze the duplicated genes. The GFP+ and RFP+ thrombocyte gene lists were colored green and red, respectively. To identify the common genes expressed in both GFP+ and RFP+ thrombocytes gene lists, we combined both lists and used the ‘conditional formatting’ function to highlight and then to find the duplicated values. We then removed the duplicated genes by using the ‘remove duplicates’ function to identify by their green and red color codes using the ‘sort and filter by color’ function, the unique GFP+ and RFP+ genes. After removing the duplicates from GFP+ and RFP+ genes, we used the list of human genes from the Ensembl database and performed a similar analysis to identify human orthologs that would appear as duplicates when zebrafish and human genes are combined. Gene classification For gene function classification, Protein Analysis Through Evolutionary Relationship (PANTHER) (version 16.0) online software was used to classify the proteins encoded by genes of interest based on functional classifications, including family and subfamily, molecular function, cellular component, biological process, and pathway [[48]17]. This method eliminates non-protein coding genes such as microRNA genes. The gene lists for each dataset, GFP+ thrombocytes, RFP+ thrombocytes, human megakaryocytes, and platelets, were entered with the gene IDs, and Danio rerio genome source ZFIN was chosen for analysis. The outputs of functional classification analysis were visualized using graphic charts and also presented as the summary of genes. Heat maps Transcripts from single-cell RNA-seq data were quantified using Salmon [[49]18]. The expression data that was generated through the 10x Genomics default pipeline was annotated using genes primarily taken from ZFIN. Gene groups were taken from BioMart-Ensembl for all available genes. The only group annotation in ZFIN was a basic group type (protein_coding, miRNA, etc.), but the actual gene function was not annotated. There are several genes in the expression files that do not have any entry in ZFIN. These genes begin with "LOC." If there was data in ZFIN or BioMart-Ensembl, they were included. Genes were clustered based on their expression via k-means clustering ranging from k = 2 to k = 10. The expression of genes is based on the average mean counts of cells falling into the generated clusters. An additional graph-based clustering was also performed, which is independent of a k-value. In this case, 4 clusters were obtained. The cellRanger t-SNE graphs were used in choosing the best clustering. We then generated the spreadsheet with a sheet for each cluster type. The counts covered a large range (0–120+), and row-wise scaling on the values was performed. This scaling provides a range of -2 to 2 such that everything will not be in the same color and two annotation bars covering the map. Due to the different number of genes in some of the sets, I row annotations were turned off. Row-wise re-ordering is based on an invisible hierarchical clustering of gene-wise distances. Column re-ordering was turned off to maintain the column order of the original excel data. Software used was the ’aheatmap’ function of the NMF library for R. Scaling was performed using the scale function of R with a transposed matrix of the mean counts. TPM analysis Transcripts from single-cell RNA-seq data organized in 24 scRNASeq files were quantified using Salmon [[50]18]. Among these 24 files in fastq format, reads from 8 files mapped to the zebrafish genome with a mapping percentage of almost 45–46%. Among these 8 files, 4 represented GFP+ thrombocytes, and the remaining represented RFP+ thrombocytes. Therefore, further downstream processing was done on these 8 files. Since the 4 fastq files were derived from the same GFP+ thrombocytes, we merged them to generate a single GFP+ thrombocyte file. Likewise, we generated a single file for RFP+ thrombocytes. Transcripts per million (TPM) distribution obtained in this analysis was divided into arbitrarily five intervals. For each of the GFP+ and RFP+ thrombocyte datasets, we classified genes as expressed when their TPM values were higher than 1. Gene-level abundance estimates were obtained using the R package (tximport) [[51]19]. Quantile mapping We have divided the expressed genes in each dataset into 10 quantiles based on their expression levels, i.e., the first quantile consists of the top 10% (approximately) of the expressed genes in the dataset, arranged in decreasing order of their expression values, etc. We performed quantile mapping by comparing the genes in each quantile across a pair of datasets. The naming design was GFP_x_RFP_x_Genes; here, x ranges from 1 to 10. For example, GFP_9_RFP_5_Genes denotes genes in the 9th quantile of GFP that overlapped with genes in the 5th quantile of RFP. System biology analysis To identify functional similarity in the two datasets (GFP+ thrombocytes and RFP+ thrombocytes), we performed a comparative analysis at the systems level. For this purpose, we identified “highly expressed” genes in each dataset as the genes that have a z-score (based on log TPM values of “expressed” genes) greater than 1.5 in each dataset. The highly expressed gene lists for GFP+ and RFP+ thrombocytes were analyzed separately using the database for annotation, visualization, and integrated discovery (DAVID V6.8) to identify enriched Kyoto encyclopedia of genes and genomes (KEGG) pathways, and gene ontology (GO) categories, namely, the biological process (BP), cellular component (CC), and molecular function (MF). The criterion of FDR (False Discovery Rate) less than or equal to 0.05 (in DAVID, FDR of 5 is equivalent to 0.05) was used to identify enriched GO terms and KEGG pathways. In DAVID, the FDR values are provided as FDR percentage, so, FDR = 1 means an FDR of 1/100 = 0.01 [[52]19]. Megakaryocyte and platelet RNA-seq analysis We used megakaryocyte single-cell RNA-seq dataset [53]GSM4278189, which was downloaded from NCBI GEO. There were 4 fastq files derived from the same sample, which were merged to generate a single fastq file. For platelets, we used platelet RNA-seq dataset SRP262885, which was downloaded from NCBI SRA. There were 5 fastq files derived from the 5 different samples. These fastq files (1 for megakaryocyte and 5 for platelets) were further processed and transcript abundance quantified using Salmon, and the gene-level abundance estimates were obtained using the R package (tximport). We identified genes as “expressed” in a dataset if their TPM values were higher than 1. We identified “highly expressed” genes in each dataset as the genes that have a Z-score (based on log TPM values of “expressed” genes) greater than 1.5 in the dataset. Results Gene expression in young and mature thrombocytes To understand the genes involved in the maturation of thrombocytes, we performed RNA-seq analysis of young and mature thrombocytes and studied conserved and differential gene expression patterns between these thrombocytes. We used the GloFli adult zebrafish, where young thrombocytes were labeled with RFP, and mature thrombocytes were labeled with more intense GFP. This adult zebrafish line, GloFli, was previously characterized in our laboratory [[54]4]. We collected 2 μL of blood from a single fish and subjected this to flow cytometry for separating RFP+ and GFP+ cells. Attempts to pool blood from multiple fish yielded mostly dead cells because of the time taken from sorting to the loading on the 10x genomics machine. We used 1,541 GFP+ mature thrombocytes and 2,176 RFP+ young thrombocytes for single-cell RNA sequencing (scRNA-seq). The quality of cDNA from GFP+ and RFP+ thrombocytes used for sequencing is shown in [55]S1 Raw images. The raw data are given in Data Review URL ([56]https://dataview.ncbi.nlm.nih.gov/object/PRJNA753153?reviewer=l15a 7d4am3m124rafh16mbsplv). We used these raw data files and obtained human orthologs and the corresponding human genes and respective genetic disorders associated with them using Ensembl and ZFIN databases. This reorganized data is shown in [57]S1 and [58]S2 Files. From this data, we found the total number of genes detected (genes with transcripts) for GFP+ cells is 8,746 and for RFP+ cells is 6,990 genes. By combining the datasets from GFP+ and RFP+ thrombocytes and after eliminating the duplicated genes, 6,593 genes were found to be commonly expressed in both GFP+ and RFP+ thrombocytes, thus, yielding a total of 9143 genes expressed cumulatively in young and mature thrombocytes. Among these genes expressed in both GFP+ and RFP+ thrombocytes, we found CD41 gene transcripts [[59]20]. Also, we found that mature thrombocytes uniquely expressed 2,153 genes, young thrombocytes expressed 397 genes exclusively by our spreadsheet method. Protein coding genes To eliminate the non-protein coding genes such as microRNA genes, the expressed genes from GFP+ (8,746) and RFP+ (6,990) thrombocytes were functionally characterized using the PANTHER online software that performs protein-coding gene functional classification. The protein gene classes for GFP+ and RFP+ thrombocytes are given in [60]Fig 1 as a pie-chart. The number of genes detected (genes with transcripts) for GFP+ cells by PANTHER analysis was 7,452 and for RFP+ cells was 5978. However, the protein coding genes in the GFP+ and RFP+ thrombocytes were 4704 and 3811, respectively. When we analyzed the uniquely expressed genes in the GFP+ thrombocytes by PANTHER, we found 1,778 and 300 genes had transcripts in GFP+ and RFP+ thrombocytes, respectively. However, the unique protein genes in GFP+ and RFP+ thrombocytes were 1,104 and 210, respectively. Fig 1. PANTHER analysis and classification of GFP+ and RFP+ thrombocyte transcripts. [61]Fig 1 [62]Open in a new tab A. When 8,746 genes expressed in GFP+ and 6990 genes expressed in RFP+ thrombocytes thrombocytes were uploaded into the PANTHER program, and Danio rerio species was selected, GFP+ 7452 genes and RFP+ 5978 genes, respectively were detected by this program. The program yielded a total number of protein hits of 4704 and 3811 for these GFP+ and RFP+genes, respectively. B. When 2153 unique genes expressed in GFP+ thrombocytes and 397 unique genes expressed in RFP+ thrombocytes were uploaded into the PANTHER program, and Danio rerio species was selected, GFP+ 1778 genes and RFP+ 300 genes, respectively, were detected by this program. The program yielded a total number of protein hits of 1104 and 210 for these unique GFP+ and RFP+unique genes, respectively. In both A and B, the percentage of each of the protein classes for GFP+ and RFP+ gene lists are shown by different colors in the pie chart, and their classification is provided by its side with similarly coded colors. Heat maps Using the transcripts from RNA-seq data, the heatmaps of the gene expression for GFP+ and RFP+ thrombocytes were generated, and they are shown in [63]Fig 2. The heat maps showed that the highly expressed genes in the RFP+ population are limited. The list of genes used in this heat map study is given in [64]S3 File. We also found that highly expressed genes among certain categories, such as chromatin/chromatin-binding or regulatory protein genes, cytoskeleton protein genes, nucleic acid-binding protein genes, and gene-specific transcriptional regulator genes are less in the RFP+ population compared to the GFP+ thrombocyte population. Fig 2. Heat map of differentially expressed genes between GFP+ and RFP+ thrombocytes. [65]Fig 2 [66]Open in a new tab Genes were arbitrarily clustered via k-means clustering ranging from k = 2 to k = 10. The expression of genes is based on the average mean counts of cells falling into the generated clusters. The cellRanger t-SNE graphs were used in choosing the best clustering and 4 clusters were found. A spreadsheet with a sheet for each cluster type was generated. The counts covered a large range (0–120+), and row-wise scaling on the values was performed, which provided a range of -2 to 2. A. Expression of all genes, 8746 genes in GFP+ thrombocytes and 6990 genes in RFP+ thrombocytes. B. Expression of chromatin/chromatin-binding or regulatory protein genes, 120 genes in GFP+ thrombocytes and 106 genes in RFP+ thrombocytes. C. Expression of cytoskeleton protein genes, 217 genes in GFP+ thrombocytes and 158 genes in RFP+ thrombocytes. D. Expression of nucleic acid-binding protein genes, 535 genes in GFP+ thrombocytes, and 478 genes in RFP+ thrombocytes. E. Expression of gene-specific transcriptional regulator genes, 449 genes in GFP+ thrombocytes, and 329 genes in RFP+ thrombocytes. Purple indicates GFP+ thrombocytes, and pink indicates RFP+ thrombocytes followed by clusters from 1 to 4 shown in light green to dark green. For each gene, red is upregulated, and blue is downregulated in the corresponding sample. TPM analysis We obtained the gene-level abundance estimates using the R package using two merged scRNASeq files, one for GFP+ and the other for RFP+ thrombocytes. The TPM values for each of the two merged scRNASeq files are provided in [67]S4 File. We then analyzed the TPM distribution and the results of this distribution, for each of the two samples (GFP+ and RFP+ thrombocytes) used in our analysis are given in [68]Fig 3. Genes expressing with a TPM value greater than 1 in GFP+ and RFP+ thrombocytes are enlisted in [69]S5 and [70]S6 Files (“expressed genes”). To identify functional similarities in the two datasets (GFP+ and RFP+), we performed a comparative analysis of gene expression at the systems level. For this purpose, we identified “highly expressed” genes in each dataset as the genes that have a Z-score (based on log TPM values of “expressed” genes) greater than 1.5. The log TPM values for GFP+ and RFP+ datasets are provided in [71]S7 and [72]S8 Files. The z-scores of all genes expressed in GFP and RFP thrombocytes are given in [73]S9 and [74]S10 Files. This resulted in 867 “highly expressed” genes for GFP+ and 302 for RFP+ cells. The list of highly expressed genes in these two thrombocyte populations is given in [75]S11 and [76]S12 Files. The results also revealed 265 highly expressed genes commonly present in both these thrombocyte populations. Thus, most of the genes analyzed showed very low or no expression. On average, 40.09% and 49.12% of genes had zero TPM for GFP+ and RFP+ cells, respectively. We also found that 20.19% and 34.55% of genes had less than one TPM in GFP+ thrombocytes and RFP+ thrombocytes, respectively. We found 15.94% and 9.57% of genes had TPM between the interval of 1–5 in GFP+ and RFP+ thrombocytes, respectively. We also found 15.01% and 4.68% of genes had TPM between the interval of 5–30 in GFP+ and RFP+ thrombocytes, respectively. The result also revealed 8.71% of GFP+ genes had a TPM of 30 and higher, contrasting with 2.05% of such RFP+ genes. Interestingly, there was a 3–4 folds increase in the number of genes with TPM values greater than 5 between the RFP+ and GFP+ thrombocytes. Fig 3. Transcripts per million (TPM) distribution for each of the two samples GFP_L001 and RFP_L001 representing GFP+ and RFP+ thrombocytes. Fig 3 [77]Open in a new tab Transcripts from 24 scRNASeq files, 12 for GFP+ and 12 for RFP+ thrombocytes were quantified. Among these files, 4 GFP+, and 4 RFP+ files (in fastq format) that showed a mapping percentage of almost 45–46% to the zebrafish genome were chosen. The 4 GFP fastq files were merged to one fastq file, and similarly 4 RFP fastq files were also merged to one fastq file and used in TPM analysis. Bar graphs show TPM distribution obtained in this analysis and are divided into five intervals for each sample representing the number of genes with TPM values from 0 to >30, and these TPM values are coded by different colors with a key shown below. Quantile mapping The quantile-mapping analysis helped us identify the percentage of overlapping genes displaying similar expression patterns in GFP+ and RFP+ thrombocytes. The most highly expressed genes in the first quantile (top 10% in expression) for GFP+ and RFP+ thrombocytes have 352 genes common between them. The mapping results for each quantile pairs for GFP+ and RFP+ thrombocytes are provided in [78]Fig 4. The details of quantile mapping are shown in [79]S13 File. The overlapping genes between GFP+ and RFP+ thrombocytes in the other 9 quantiles are 1112, approximately 28% of 3935, the total number of genes in all the quantiles ([80]Fig 4). Fig 4. Quantile mapping of GFP+ thrombocyte and RFP+ thrombocyte datasets. Fig 4 [81]Open in a new tab We have divided the expressed genes in each dataset into 10 quantiles. We have compared the genes in each quantile across the pair of datasets. In each quantile, the quantile map shows the number of common genes between the two datasets. The bars on the right of the quantile map indicate the color codes and the number of genes. The data value bars on the rightmost indicate mean TPM values; green color, the highest TPM values for GFP+ thrombocytes, and purple color, the highest TPM values for RFP+ thrombocytes. Data set characterization Further, we analyzed highly expressed genes list in the data sets for both GFP+ mature and RFP+ young thrombocytes. The DAVID program was used to identify KEGG pathways and functional terems representing GO categories, namely, the BP, CC, and MF, for each dataset. The top 15 GO enrichment terms for GFP+ and RFP+ thrombocyte datasets are shown in [82]Fig 5. FDR values are provided in [83]S14 and [84]S15 Files. Interestingly, the GO analysis revealed 241 and 251 genes involved in ribosomal machinery of RFP+ and GFP+ thrombocytes, respectively, representing a modest 4.14% increase in the number of ribosomal machinery genes transcriptionally activated in the course of development from RFP+ to GFP+ thrombocytes. However, for the intracellular component, the RFP+ thrombocytes had 66 expressed genes, whereas the GFP+ cells had expressed 105 genes, representing 59% increase in the number of transcriptionally activated genes from RFP+ to GFP+ thrombocytes. The GO enrichment values for GFP+ and RFP+ thrombocytes are shown in [85]S14 and [86]S15 Files. The list of the top 15 GO enriched gene numbers is given in [87]S16 File. Fig 5. Gene ontology analysis of GFP+ and RFP+ thrombocyte transcripts. [88]Fig 5 [89]Open in a new tab The highly expressed gene lists for GFP+ and RFP+ thrombocytes were analyzed separately using DAVID to identify gene ontology (GO) categories. The top 15 GO enrichment analysis of A. GFP+ thrombocytes and B. RFP+ thrombocytes by DAVID functional analysis. Bars represent the number (x-axis) of highly expressed genes in pathways (y-axis) for each dataset. The KEGG analysis yielded 9 datasets of the most highly expressed genes, each in GFP+ and RFP+ thrombocytes. The KEGG pathway enrichment analysis results are shown in [90]Fig 6. The values for pathway enrichment for each dataset are provided in [91]S17 and [92]S18 Files. Except for the ribosome-related genes which are similar in numbers between GFP+ and RFP+ thrombocytes, the rest of the 8 datasets yielded greater number of genes expressed in GFP+ thrombocytes compared to RFP+ thrombocytes. Interestingly, we found erythrocyte differentiation genes expressed only in RFP+ thrombocytes. Fig 6. Significantly enriched Kyoto encyclopedia of genes and genomes. Fig 6 [93]Open in a new tab The highly expressed gene lists for GFP+ and RFP+ thrombocytes were analyzed separately using the DAVID to identify enriched KEGG pathways. Bars represent the number of highly expressed genes in pathways for GFP+ and RFP+ thrombocyte datasets. Human orthologs expressed in thrombocytes Zebrafish genes that are expressed in thrombocytes were compared with human genes in the Ensembl database, and the orthologous human genes expressed in both young and mature thrombocytes were identified. They were found to be 7081 genes out of the 9143 genes (72%) expressed cumulatively in two types of thrombocytes according to Ensembl and ZFIN databases. However, since many of the zebrafish genes are duplicated, for each of these genes same human homolog will be picked up and thus will yield more duplicates in the human gene set. We, therefore, corrected for 460 human genes that were shown as duplicates or triplicates, yielding 6621 true human orthologs. Thus, approximately 72% of zebrafish genes expressed in thrombocytes are orthologous to human genes. Since we are interested in genes that are regulatory for thrombocyte development, we chose 4 categories of genes analyzed in heat maps such as chromatin/chromatin-binding regulatory protein, cytoskeleton, nucleic acid binding, and gene-specific transcriptional regulator genes. We first obtained the number of genes, each in GFP+ and RFP+ thrombocytes, that have human orthologs in the above 4 categories. We then determined the orthologs of the genes that uniquely expressed in GFP+ and RFP+ thrombocytes, as well as those that expressed across both these thrombocyte populations. The results are shown in [94]Fig 7. The results revealed approximately 30% of the orthologs belong to these 4 categories in both GFP+ and RFP+ thrombocytes. Fig 7. Different categories of human ortholog gene transcripts found in GFP+ and RFP+ thrombocytes. [95]Fig 7 [96]Open in a new tab The raw data was used to obtain the list of zebrafish genes expressed in GFP+ thrombocytes and RFP+ thrombocytes using Ensembl and ZFIN databases. The list of human orthologs corresponding to zebrafish genes was also obtained using Ensembl and ZFIN databases. The list of these human orthologs were uploaded into the PANTHER program to obtain protein categories as well as to find human orthologs corresponding to unique and commonly expressed genes in GFP+ and RFP+ thrombocytes. The bar graph represents the corresponding human orthologs of the unique RFP+ genes, unique GFP+ genes, and both RFP+ and GFP+ common genes shown in red, green, and purple colors, respectively. The number of human ortholog genes is shown in each bar, including their functional protein categories. The protein categories are gene-specific transcriptional regulators, nucleic acid-binding proteins, cytoskeleton proteins, and chromatin/chromatin-binding or regulatory proteins. Comparison of gene expression between thrombocytes and megakaryocytes/platelets To test whether thrombocytes have similar transcripts found in megakaryocytes and platelets, we compared the transcripts present in thrombocytes obtained by our single-cell RNA-seq analysis with the transcripts derived from publicly available single-cell RNA-seq data from megakaryocytes from early human embryonic cells. We obtained a list of genes that are common in both datasets and designated the human counterparts as human orthologs. The list of orthologs of thrombocyte transcripts found in megakaryocytes are given in [97]S19 and [98]S20 Files and for platelets in [99]S21 and [100]S22 Files. We found 4224 out of 9143, i.e. 46% gene transcripts from thrombocytes to be present in the megakaryocytes. Likewise, when we compared with the platelet transcripts, we found 1603 out of 9143, i.e. 17% thrombocyte gene transcripts were in platelets. When we compared both megakaryocyte and platelet transcripts with the transcripts present in thrombocytes, we found 1339 gene transcripts common to all three. However, out of the unique 397 RFP+ gene transcripts, none of them was present in megakaryocytes or platelets, while of the unique 2153 GFP+ gene transcripts, 602 were present in megakaryocytes, whereas 225 were present in platelets. Interestingly, we found that among the 397 unique RFP+ thrombocyte gene transcripts, 272 had human orthologs. Likewise, out of 2153 unique GFP+ thrombocyte gene transcripts, 1620 had human orthologs. We have also analyzed the protein-coding transcripts using the PANTHER program and found several classes of proteins, grouped by functions, that are conserved between thrombocytes and megakaryocytes/platelets ([101]Fig 8). From these data, we have chosen three families of proteins (transcription factors, cell adhesion molecules, and transmembrane receptors) involved in megakaryocyte/platelet development and function. We found a total of 156 transcription factor transcripts that are present in megakaryocytes. Among these, there were transcripts for 14 basic helix-loop-helix transcription factors, 16 basic leucine zipper transcription factors, 7 zinc finger transcription factors, 24 C2H2 zinc finger transcription factors, 7 C4 zinc finger nuclear receptors, 9 homeodomain transcription factors, 4 Rel homology transcription factors, and 9 wing helix/forkhead transcription factors. Among the basic leucine zipper transcription factors, Nfe2, an important factor for megakaryocyte development, is conserved. Fig 8. PANTHER analysis and classification of human orthologs of total thrombocyte gene transcripts found in human megakaryocytes and platelets. [102]Fig 8 [103]Open in a new tab A. The 9143 total thrombocyte gene transcripts and 7708 human megakaryocyte gene transcripts were combined in a spreadsheet and color-coded. Then duplicate genes were selected first and the duplicates were eliminated to give a list of 4224 human orthologs which were uploaded into the PANTHER program, and Danio rerio species was selected. The program detected 4119 genes and yielded a total number of 2633 protein hits for these genes. B. The 9143 total thrombocyte gene transcripts and 2819 human platelet gene transcripts were combined in a spreadsheet and color-coded. Then duplicate gene transcripts were selected first, and the duplicates were eliminated to give a list of 1603 gene transcripts for human orthologs, which were uploaded into the PANTHER program, and Danio rerio species was selected. The program detected 1553 genes and yielded a total number of 987 protein hits for these genes. In both A and B, the percentage of each of the protein classes for RFP+ and GFP+ gene lists are shown by different colors in the pie chart, and their classification is provided by its side with similarly coded colors. We also found transcripts for a total of 6 cell adhesion molecules such as 4 integrins (Itfg2, Itgb3b, Itgb1a, Itga2b), Spondin domain-containing protein (Spon2a), and Thy1 cell surface antigen (Thy1). In addition to these two groups of transcripts, we also found transcripts for 9 GPCRs, including thrombin receptor (Par1) and transcripts for 19 transmembrane receptors, including erythropoietin and thrombopoietin receptors. In thrombocytes, we also found transcripts for 60 transcriptional regulators that are also present in platelets. Among these, there were transcripts for 5 basic helix-loop-helix transcription factors, 4 basic leucine zipper transcription factors, 13 C2H2 zinc finger transcription factors, 1 C4 zinc finger nuclear receptors, 5 homeodomain transcription factors, 1 Rel homology transcription factor, and 4 wing helix/forkhead transcription factors. There were no zinc finger transcription factor transcripts among the platelet-specific transcripts. Interestingly, among the transcripts for 5 basic helix-loop-helix transcription factors found in thrombocytes, 4 of them were also present in megakaryocytes, whereas the arnt transcript was only found in platelets. Among the transcripts of basic leucine zipper transcription factors NFE2, an important factor for megakaryocyte development, is conserved. Likewise, another related transcript nfe2l2a was also shared among thrombocytes, megakaryocytes, and platelets. We found transcripts for 5 cell adhesion molecules such as 4 integrins (Itgb3b, Itgb5, Itgb1a, Itga2b), and Dag1. We did not find transcripts for Spon2a and Thy1. Likewise, we did not find transcripts for Itgb3b and Itgb5 in megakaryocytes. Moreover, we found transcripts for 7 GPCRs, including the one for Par1 that is also a megakaryocyte-specific transcript. However, two of these receptor transcripts adora2ab, ipar5a were found in platelets but not in megakaryocytes. Among the transcripts for 11 transmembrane receptors, both erythropoietin and thrombopoietin receptors’ transcripts were found similar to those in megakaryocytes. Discussion Previous results from our laboratory, using microarray analysis, revealed thrombocyte-specific genes; however, there was no analysis of GFP+ and RFP+ thrombocyte-specific genes [[104]21]. The work presented here is the first comparative analysis of GFP+ and RFP+ thrombocyte transcripts obtained by RNA-seq performed by 10x genomics. In this protocol, mRNA molecules were directly captured by Gel Bead oligo and, therefore, contaminated genomic DNA were not captured downstream. Also, we set up 75–90% cell viability in 10x genomics sequencing. We do not know whether this will skew the downstream data, and it is difficult to predict. However, bioinformatically there are steps that are usually taken to make sure that dead/dying cells are filtered out so that they do not influence the results of differential gene expression analysis. These QC or pre-processing steps usually involve filtering out cells with genes of too high expression or genes with too low expression due to possible background. They also filter out cells with a high percent of mitochondrial gene expression due to possible dead/dying cells. This procedure of filtering before further processing the data ensures that the downstream results reflect the biology of the system under investigation rather than that of cell death caused by the sample handling or preparation. Moreover, since in both of these populations, gene encoding CD41, a highly specific thrombocyte marker, was found to be expressed as revealed in the analysis of RNA-seq data obtained from the sorted populations of thrombocytes, it lends credibility to the thrombocyte populations and confirms that they are indeed thrombocyte populations with minimal contaminations from other blood cells. The comparative analysis of the GFP+ and RFP+ thrombocytes showed, more highly expressed genes in GFP+ thrombocytes compared to RFP+ thrombocytes, as confirmed by the heat map studies, probably because during maturation more number of genes are turned on while less number of genes are silenced. In our previous work on young and mature thrombocytes, we have shown that there was an increase in the expression of a Fli1 in mature thrombocytes [[105]1, [106]4]. These results are consistent with the heatmap results suggesting that maturation of thrombocytes involves an increase in thrombocyte gene expression [[107]4]. The number of genes analyzed using excel spreadsheet analysis tools is larger than the ones obtained by PANTHER analysis and does not match because there are several genes that are not recognized by the PANTHER program. This may be due to the reason that the database used may not contain all the genes. In fact, the TPM analysis also showed there are four times more genes that are highly expressed in GFP+ thrombocytes compared to RFP+ thrombocytes. Understanding thrombocyte maturation involves a closer look at regulatory genes involved in chromatin modification, transcriptional regulation, cytoskeletal reorganization, and nucleic acid-binding. Interestingly, the number of unique genes in RFP+ thrombocytes is less compared to GFP+ thrombocytes. These results are consistent with the total number of highly expressed genes and again suggest that the loss in expression of a fewer number of genes among these groups and the gain in expression of a greater number of genes are important for thrombocyte maturation. Previously it has been shown in our laboratory that there was a switch of a synthetic pathway from PE to PC. Thus, the current finding of the unique genes in GFP+ and RFP+ thrombocytes adds to the knowledge in understanding the regulation of overall thrombocyte maturation. Our analysis by quantile mapping on the level of gene expression between GFP+ and RFP+ thrombocytes also revealed a smaller percentage of genes that are highly expressed between these cells, and a greater number of genes that are commonly expressed at a lower level. These results suggest that the highly expressed genes may be performing housekeeping functions and the genes that are expressed in lower levels may perform specialized thrombocyte function and development. Interestingly, GO studies also showed housekeeping genes such as those coding intracellular group of proteins and the proteins involved in ribosomal machinery are highly expressed in both GFP+ and RFP+ thrombocytes, and genes encoding specialized proteins such as those involved in erythrocyte differentiation are expressed at lower levels, consistent with quantile mapping. KEGG analysis revealed certain specialized functions such as proteosomes, spliceosomes, protein export, and protein processing in ER to be uniquely represented in GFP+ thrombocytes, suggesting that these specialized functions are important and necessary for thrombocyte maturation. Our finding of 72% human orthologs among the zebrafish thrombocyte transcripts is strikingly similar to the percentage observed by the authors of the zebrafish genome project [[108]22]. In addition, a comparison of expressed genes between thrombocytes and megakaryocytes/platelets revealed an important fact that thrombocytes share many transcripts with megakaryocytes and platelets. In fact, the finding of a greater number of orthologs of thrombocyte genes expressing in megakaryocytes compared to platelets suggests that thrombocytes have more megakaryocyte features so far as the transcripts are concerned relative to platelets and are probably megakaryocyte equivalents, and since platelets are fragments of megakaryocytes, thrombocytes also possess the properties of platelets. The higher number of transcript similarities to megakaryocytes relative to platelets is probably because platelet mRNAs, although derived from megakaryocytes may not be stable due to the lack of complete cell architecture in platelets, again indicating thrombocytes have cell machinery similar to megakaryocytes. These results suggest that thrombocytes are forerunners for megakaryocytes where polyploidization occurred at the time of mammalian radiation. Since the thrombocytes carry necessary components for aggregation and can be released into circulation, they perform platelet functions. This finding thus supports our previous hypothesis that thrombocytes have both megakaryocyte and platelet properties [[109]23]. This fact is also strengthened by our finding that transcripts for transcription factors that are important for the differentiation and maturation of megakaryocytes are present in thrombocytes, and the platelets may carry a small number of transcripts derived from megakaryocytes. Likewise, the functionally important cell adhesion molecule transcripts are present in megakaryocytes, platelets, and thrombocytes. It is well known that megakaryocytes are derived from a bifunctional erythrocyte-megakaryocyte precursor. Thus, young thrombocytes may carry certain properties of erythrocytes, and our results are consistent with this megakaryocyte differentiation. This assumption is supported by the observation that expression of erythrocyte receptor genes occurs in young thrombocytes. Also, young thrombocytes express Gata1, an erythroid transcription factor [[110]24, [111]25]. Surprisingly, among the unique genes expressed in young thrombocytes, none of them was found in megakaryocyte transcripts, probably because the megakaryocyte RNA-seq data derived from human embryonic cells may not have representation of younger megakaryocytes. Thus, these unique genes are probably novel and important for thrombocyte maturation. In summary, the above comparative analysis revealed several genes that may be involved in the maturation of thrombocytes. These genes could be used in analyzing their function in thrombocyte maturation by gene knockdown studies. For example, the knockdown of unique RFP+ genes may advance thrombocyte maturation. In contrast, the knockdown of unique GFP+ genes may stop the maturation. The comparison of thrombocyte transcripts with megakaryocyte/platelet transcripts revealed that thrombocytes are both megakaryocyte and platelet equivalents. Conclusions In conclusion, our single-cell RNA-seq analysis and comparison of the gene expression between the young and mature thrombocytes revealed a high degree of similarity and less number of unique genes between these two thrombocyte populations. Our analysis also showed orthology between a large number of zebrafish thrombocyte and human megakaryocyte/platelet expressed genes. This information is valuable for the future knockdown studies of these genes to understand the mechanisms of thrombocyte and megakaryocyte maturation. Supporting information S1 Raw images. Amplification of a full-length of cDNA by PCR to generate sufficient mass for library construction. A 56 μL of cDNA amplification reaction was mixed to a 35 μL sample. 1,541 GFP+ mature thrombocytes (2^nd lane from left) and 2,176 RFP+ young thrombocytes (4^th lane from left) were loaded, and a 12 PCR cycle was performed. 1 μL of the sample (dilution factor 1:10) was run on Agilent 4200 Tapestation for cDNA QC and quantification. The total cDNA yield in ng is calculated by multiplying the cDNA concentration (pg/ μL) by the elution volume (40 μL) of post cDNA amplification reaction clean up the sample (taking any dilutions factors into account) and then divide by 1000 (pg/ng). The upper marker (1^st and 3^rd lanes from left) is labeled with a purple line, and the lower marker is labeled with a green line. The molecular weight products (from top 1500, 1000, 700, 500, 400, 300, 200, 100, 50, 25 bp) are indicated. (PDF) [112]Click here for additional data file.^ (189.3KB, pdf) S1 File. Extracted data from the raw values along with human orthologs and diseases associated with these genes corresponding to zebrafish transcript from GFP+ thrombocytes. (XLS) [113]Click here for additional data file.^ (17.7MB, xls) S2 File. Extracted data from the raw values along with human orthologs and diseases associated with these genes corresponding to zebrafish transcript from RFP+ thrombocytes. (XLS) [114]Click here for additional data file.^ (15.5MB, xls) S3 File. List of all genes as well as four categories of genes expressed in both GFP+ and RFP+ thrombocytes, used in heat map generation. (XLSX) [115]Click here for additional data file.^ (37.3KB, xlsx) S4 File. List of genes used in calculating the relative abundance of transcripts from both GFP+ and RFP+ thrombocytes. (XLSX) [116]Click here for additional data file.^ (744.3KB, xlsx) S5 File. List of transcripts from GFP+ thrombocytes used in TMP analysis. (XLSX) [117]Click here for additional data file.^ (252.6KB, xlsx) S6 File. List of transcripts from RFP+ thrombocytes used in TMP analysis. (XLSX) [118]Click here for additional data file.^ (108.3KB, xlsx) S7 File. List of transcripts from GFP+ thrombocytes used in logTMP analysis. (XLSX) [119]Click here for additional data file.^ (289.2KB, xlsx) S8 File. List of transcripts from RFP+ thrombocytes used in logTMP analysis. (XLSX) [120]Click here for additional data file.^ (124.9KB, xlsx) S9 File. z-score matrix values for genes expressed in GFP+ thrombocytes. (XLSX) [121]Click here for additional data file.^ (293.8KB, xlsx) S10 File. z-score matrix values for genes expressed in RFP+ thrombocytes. (XLSX) [122]Click here for additional data file.^ (126.9KB, xlsx) S11 File. List of highly expressed genes in GFP+ thrombocytes. (XLSX) [123]Click here for additional data file.^ (19.5KB, xlsx) S12 File. List of highly expressed genes in RFP+ thrombocytes. (XLSX) [124]Click here for additional data file.^ (12.4KB, xlsx) S13 File. List of expressed genes from both GFP+ and RFP+ thrombocytes used in quantile mapping. (XLSX) [125]Click here for additional data file.^ (175.7KB, xlsx) S14 File. GO enrichment analysis of expressed genes from GFP+ thrombocytes. (XLSX) [126]Click here for additional data file.^ (36.1KB, xlsx) S15 File. GO enrichment analysis of expressed genes from RFP+ thrombocytes. (XLSX) [127]Click here for additional data file.^ (25.6KB, xlsx) S16 File. Top 15 GO enrichment analysis of expressed genes from GFP+ and RFP+ thrombocytes. (XLSX) [128]Click here for additional data file.^ (26.4KB, xlsx) S17 File. KEGG analysis of expressed genes from GFP+ thrombocytes. (XLSX) [129]Click here for additional data file.^ (11.1KB, xlsx) S18 File. KEGG analysis of expressed genes from RFP+ thrombocytes. (XLSX) [130]Click here for additional data file.^ (10KB, xlsx) S19 File. List of highly expressed human megakaryocyte ortholog genes found in total thrombocytes. (XLSX) [131]Click here for additional data file.^ (33.4KB, xlsx) S20 File. List of not highly expressed human megakaryocyte ortholog genes found in total thrombocytes. (XLSX) [132]Click here for additional data file.^ (225.1KB, xlsx) S21 File. List of highly expressed human platelet transcripts found in total thrombocytes. (XLSX) [133]Click here for additional data file.^ (15.2KB, xlsx) S22 File. List of not highly expressed human platelet transcripts found in total thrombocytes. (XLSX) [134]Click here for additional data file.^ (93.5KB, xlsx) Acknowledgments