Abstract Trichosanthes kirilowii Maxim. (TK) is a dioecious plant in the Cucurbitaceae for which different sexes have separate medicinal uses. In order to study the genes related to sex determination, transcriptome sequencing was performed on flower buds of male and female plants using the high-throughput sequencing technology. A total of 145,975 unigenes and 7110 DEGs were obtained. There were 6776 DEGs annotated to 1234 GO terms and enriched to 18 functional groups, including five biological processes related to sugar metabolism. KEGG pathway analysis indicated genes involved in hormone transduction, hormone synthesis and carbohydrate metabolism. Many DEGs of TK are involved in reproductive organ formation, hormone signal transduction and regulatory networks. Combining the results of GO, KEGG and qRT-PCR, 11 sex determining candidate genes of TK were selected, including MYB80, MYB108, CER1, CBL9, ABCB19, SERK1, HSP81-3, ACS9, SEP3, AUX1 and YUC6. The results provide a foundation for the study of sex differentiation in TK. Introduction Trichosanthes kirilowii Maxim. (namely TK) is a perennial climbing herb in the family Cucurbitaceae. Its fruit (fructus trichosanthis), seeds (semen trichosanthis), peel (trichosanthis pericarpium) and root (radix trichosanthis) are all commonly used as traditional Chinese medicines. Due to the large demand for medicinal products in the marketplace, there are many planting bases for TK in China. Among these, Changqing District of Jinan City, Shandong Province and the surrounding areas of Feicheng City, Shandong Province have a long history of producing excellent varieties and high-quality medicinal materials that are famous as genuine herbal medicines. Dioecious plants play an important role in elucidating the mechanism of plant sex determination and evolution, especially plants in the Cucurbitaceae. The studies of sex identification are of great significance in both theory and practice. TK is dioecious and cross pollinated, and the tissues used for commercial medicinal usages were differ between two sexes. When harvesting seeds and fruits, a large number of female plants (with a small number of male plants) are required, and when harvesting roots, male plants are required. At present, TK can be propagated in two ways: vegetative propagation using rhizomes and sexual propagation using seeds. Although the plant sex can be controlled by rhizome propagation, the propagation coefficient is low, and large amounts of raw materials are consumed. Therefore, seed propagation is an economical and practical method of improving the planting efficiency and realizing large-scale cultivation. However, the problem with seed reproduction is that the proportions of male and female plants cannot be controlled. In the natural state, the ratio of male to female is about 7:3. Therefore, it is of great significance to identify the early sex of TK seedlings and to reveal the molecular mechanism of sex determination. At present, the methods for sex identification of TK include plant appearance, chemical reagents, isoenzymes, protein electrophoresis and molecular markers. Plant sex difference arises from differences in gene expression. Isozymes and proteins are the products of gene expression, and specific gene expression produces specific isozymes or proteins. Therefore, Yu et al. and Li used PAGE to identify the early sex of TK and found certain differences between male and female strains in enzyme concentration and spectral bands [[36]1, [37]2]. Karmakar et al. used total proteins of TK roots to identify plant sex and found a slightly sex differential band with a molecular weight of 19 KDa [[38]3]. Qu et al.’s study of RAPD-SCAR had found that S 1200 primers can generate a 600-bp amplification band specifically in male TK [[39]4]. Guo carried out isozyme electrophoretic analysis of TK leaves and found that the isozyme bands and enzyme contents from leaves of different sexes were different [[40]5]. While it is generally assumed that sex expression is dominated by the formation and accumulation of flowering substances, the above studies have shown that sexual difference exist in TK at the seedling stage. Extended to the Cucurbitaceae family, although many plants in the family are dioecious, only Concinia indica was demonstrated having sex chromosomes with an XX/XY sex determination system [[41]6]. The sex of other Cucurbitaceae plants is controlled by few genes with no sex chromosome evolved. For example, sex differentiation of cucurbit was controlled by three genes, ACS11, ACS7 and WIP1 [[42]7]. Studies of cucumber (Cucumis sativus Linn.) demonstrated that external hormones and environment affected the process of sex determination, and genes related to, ethylene synthesis and induction such as genes CsACS1G, CsACS11, CsACS2, CsACO2 and CsWIP1, were verified involved in sex regulation [[43]8, [44]9]. Although more and more metabolic pathways and genes have been found involved in gender regulation, the regulatory mechanism is not clear [[45]9]. As a Chinese traditional medicinal plant, TK was less known by scientists in other countries besides China, with no information of genome reference and transcriptome data and few EST sequences. Although Chinese researchers have done extensive exploratory work in this field and have obtained some basic results, however, it is difficult to identified sex determining genes via the existing sex linkage markers and to explore the molecular mechanism of sex determination. Therefore, the lack of genomic sequences has become a bottleneck in the study of sex differentiation of TK. The sexual difference of TK mainly observed in the flower organs. In this study, we performed RNAseq of the flower buds from female and male plants of TK to analyzed the transcriptomic profiles in different sexes. The goals were to search for differential expression genes (DEGs), and to screen for the key genes related to sex differentiation in order to lay a foundation for revealing the sex differentiation mechanism of TK at the molecular level. Methods and materials Sample collection TK is planted in the Hebao field planting base of Pinyin, Shandong province (116.45 E, 36.28 N). Samples were collected in July 2016, including flower buds of female and male plants (around 2 mm) [[46]10]. Set up 3 biological replicates for a total of 6 samples(samples of female and male flower buds are named F1, F2, F3, M1, M2 and M3). After sampling, plant buds were wrapped with tin foil and placed into liquid nitrogen, then stored in -80°C refrigerator. RNA isolation and quality assessment Total RNA of each sample was extracted by Tripure, and the concentration (optical density 260 nm/280 nm ratio) and quality (optical density 260 nm/230 nm ratio) were measured using an Aglient 2100 Bioanalyzer, with which RNA integrity (RIN) above 7.5 were used for library construction. cDNA library construction, quality control and Illumina sequencing Approximately 10 μg total RNA of each sample was used to constructed RNA libraries by NEBNext® Ultra™ RNA Library Prep Kit from Illumina® following the recommended protocol. The constructed libraries were sequenced by Beijing Institute of Genomics (BIG) under Illumina HiSeq 2500 sequencing platform with 150 bp pair-ends. The raw sequencing data reported in this paper have been deposited in the Genome Sequence Archive in BIG Data Center (Nucleic Acids Res 2019), Chinese Academy of Sciences, under accession numbers CRA002313 ([47]https://bigd.big.ac.cn/gsa). Raw sequencing processing and de novo assembly FastQc was used to detect the raw RNA reads and remove the joint sequences. The clean reads were obtained after removing connectors and low-quality reads (Q-value<10 or reads containing more than 5% ambiguous ‘N’ bases by trimmomatic (-l 5 -q 0.5 -n 0.1)). The clean reads were used to assembly using the Trinity (v.2.0.6), with a minimum contig length cutoff of 150 and a minimum k-mer size of 3 [[48]11]. Over loop was used to splice the contig and unigene fragments of clean reads to obtain the unigenes. Finally, a BUSCO (Benchmarking Universal Single-Copy Orthologs) analysis was performed using BUSCO v.2.0 with default parameters to evaluate the completeness of total gene annotated in this study [[49]12]. Screening of differentially expressed genes Cuffdiff of Cufflinks software ([50]http://cole-trapnell-lab.github.io/cufflinks/) [[51]13] was used to analyze the differences in gene expression levels in each group to identify the DEGs (differentially expressed genes). Cuffdiff uses non-parametric statistical methods to estimate the mean and variance of FPKM (expected number of fragments per kilobase of transcript sequence per millions of base pairs sequenced) values in different samples based on annotation files and identifies selected transcripts with significant differences in expression between samples through t tests. Verification of DEGs 6 differentially expressed genes were randomly selected and verified by quantitative real time PCR (qRT-PCR), to verify the consistency of expression patterns with RNA sequsing. For each sample, the PrimeScript First Strand cDNA Synthesis Kit with 1μg total mRNA (Takara, Dalian, China) was used to reverse-transcribe the mRNA into the first strand of cDNA, and the quality was measured by 1.5% agarose gel electrophoresis. We designed qPCR primers for specific genes using the Primer 5 software ([52]https://primer-premier-5.software.informer.com/). The total volume of the quantitative PCR reaction system is 20 μL, including 10 μL of 2 × SYBR® Green I Master Mix (Takara, Dalian, China), each with 0.4 μmol/L of forward and reverse primers, 1 μL of cDNA template diluted ten-fold, and the final supplement ddH[2]O to 20 μL. The amplification was carried out with the following cycling programme: 30 s at 94°C, 40 cycles of denaturation at 95°C for 5 s, annealing at 55°C for 15 s, and extension at 72°C for 15 s on a ABI 7500 fast Real-Time PCR machine. A melting curve analysis was completed immediately after the qPCR. 18S rRNA was selected as the reference gene [[53]14]. The relative expression level was calculated with the 2^−ΔΔCt method [[54]15]. Three biological replicates and three technical replicates were performed for each of the analyzed genes. Functional annotation and classification Through BLAST [[55]16] comparison software ([56]https://blast.ncbi.nlm.nih.gov/Blast.cgi), the unigenes sequences were compared with the protein databases NR (NCBI non-redundant protein sequences, [57]http://www.ncbi.nlm.nih.gov/) [[58]17] and KEGG (Kyoto Encyclopedia of Genes and Genomes, [59]http://www.genome.ad.jp/kegg/kegg2.html) [[60]18]. Classification information and gene function annotation were carried out by BLASTx (E-value ≤1.0E-05). In order to reflect the expression of sex difference genes more accurately, the GO (Gene Ontology, [61]http://www.geneontology.org/) [[62]19] function and KEGG pathway significance enrichment analyses were carried out to determine the main biological functions and the main metabolic pathways that the genes were involved in. GO enrichment analysis was performed on DEGs using the SEA tool of agriGo [[63]20] software ([64]http://bioinfo.cau.edu.cn/agriGO/), and the P values were statistically analyzed and corrected (FDR ≤ 0.05) using Fisher's exact test and the Bonferroni correction method. The KEGG pathway enrichment analysis uses KOBAS (E-value ≤1.0E-05)(KEGG Orthology-based Annotation System, [65]http://kobas.cbi.pku.edu.cn/home.do) [[66]21], where the calculation principle is the same as in the GO function enrichment analysis. To control the false positive rate, BH (Benjamini and Hochberg's test) [[67]22] was used for multiple tests with P = 0.05. A KEGG pathway meeting the above conditions was defined as a significantly enriched pathway. Results and analysis Transcriptome sequencing and de novo assembly We obtained 17,619,567 and 16,699,544 high-quality reads from the female and male libraries respectively. The effective detection rate of each library was above 90%. For species without reference genomes, de novo assembly is the most commonly used technique, and thus Trinity software ([68]https://trinitysys.fm.alibaba.com/) was used to assemble the sequencing data. In all, 145,975 unigene fragments were obtained after redundancy removed. The assembly results are shown in [69]Table 1. Among the resulting fragments, 0–400 nt had 71010 fragments; 400−800 nt had 44107 fragments; 800−2000 nt had 23331 fragments; 2000–4000 nt had 6648 fragments and there were 879 fragments ≥ 4000 nt. The completeness assessment result showed that 48.6% of BUSCO genes were “a single-copy”, 47.0% were “complete and duplicated” and 2.8% were “fragmented”, while the remaining 1.6% were “missing”, suggesting a good transcriptome assembly ([70]Table 2). Table 1. Length distribution of assembled unigenes from TK. Length (bp) Unigenes Proportion (%) 0–400 71010 48.6 400–800 44107 30.2 800–2000 23331 16 2000–4000 6648 4.6 >4000 879 0.6 [71]Open in a new tab Table 2. BUSCO assembly evaluation results. Sample Complete and Single-copy Complete and duplicated Fragmented Missing F1 150 141 7 5 M2 154 135 8 6 F3 138 148 12 5 M1 136 154 9 4 F2 153 140 6 4 M3 153 137 8 5 [72]Open in a new tab Identification and analysis of DEGs We divide the genes into three categories according to the expression of FPKM value. FPKM ≥ 10 was set as a highly expressed gene, 2 ≤ FPKM < 10 as a medium expression gene, and FPKM < 2 as a low expression gene. According to the FPKM standard, we calculated the gene expression of flower bud samples as follows: the number of low, medium and high expression genes in female flower buds were 2491, 1924; and 1165 respectively; whilst the number of low, medium and high expression genes in male flower buds were 1588, 2463, and 1530. There were fewer low expression genes in male flower buds than in female flower buds, but more genes were expressed in male flower buds than in female plants. There were few differences in the medium expression genes of male and female flower buds, but there was a large difference in the number of high expression genes. Then, we used Cuffdiff to calculate the significance of differential gene expression, and set fold change > 2 or < 0.5, P < 0.01 as the criteria for identifying DEGs. We compared the gene expression in female relative to male samples, and defined the up-regulated genes in female flower buds as up-regulated genes. The number of DEGs in flower buds was 5580; the number of up-regulated genes was 3104, and the number of down-regulated genes was 2476. qRT-PCR validation of DEGs In order to confirm the results of Illumina sequencing, we randomly select 6 candidate genes and verified the expression of differentially expressed genes in male and female by using real-time PCR. The 6 candidate genes include 3 genes with high expression in the female libraries and 3 genes with high expression in the male libraries. The primer sequences of references genes and 6 selected genes are listed in