Abstract Drought stress is one of the major adverse environmental factors reducing plant growth. With the aim to elucidate the underlying molecular basis of rice response to drought stress, comparative transcriptome analysis was conducted between drought susceptible rice cultivar Zhenshan97 and tolerant cultivar IRAT109 at the seedling stage. 436 genes showed differential expression and mainly enriched in the Gene Ontology (GO) terms of stress defence. A large number of variations exist between these two genotypes including 2564 high-quality insertion and deletions (INDELs) and 70,264 single nucleotide polymorphism (SNPs). 1041 orthologous gene pairs show the ratio of nonsynonymous nucleotide substitution rate to synonymous nucleotide substitutions rate (Ka/Ks) larger than 1.5, indicating the rapid adaptation to different environments during domestication. GO and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of positive selection genes suggested that photosynthesis represents the most significant category. The collocation of positively selected genes with the QTLs of photosynthesis and the different photosynthesis performance of these two cultivars further illuminate the crucial function of photosynthesis in rice adaptation to drought stress. Our results also provide fruitful functional markers and candidate genes for future genetic research and improvement of drought tolerance in rice. __________________________________________________________________ Drought stress is one of the major adverse environmental factors comprising plant growth and productivity. It negatively affects various plant processes including growth, physiology, metabolism and transcription[28]^1,[29]^2. Plants evolved mechanisms to water deficit during the long period of domestication. Paddy rice Zhengshan97 and upland rice IRAT109 are such various resources characterized as drought sensitive and drought tolerant, respectively[30]^3. These two rice cultivars show distinct characters to water stress but are identical in most of the morphological and developmental traits. Multi-scale research experiments have been carried out using paddy and upland rice as bio-materials, for detection of quantitative trait loci (QTLs) related to drought stress[31]^4,[32]^5,[33]^6,[34]^7, exploration of differentially expressed genes (DEGs) using microarray or suppression subtractive hybridization (SSH) technique[35]^8,[36]^9,[37]^10, phenotypic and physiological analysis[38]^4,[39]^11 as well as genome-wide comparative analysis, which revealed genetic mechanisms underlying upland adaptation in rice[40]^12. During the divergence of the upland and lowland rice, the former must have acquired a range of adaptive mechanisms to cope with the harsh features of upland environment especially drought[41]^4,[42]^5,[43]^13. Selection related to domestication has modified a number of traits that now distinguish the modern rice accessions[44]^12,[45]^14,[46]^15,[47]^16. Signature of selection during domestication have prominent role at loci linked to the selection target traits[48]^17. Ka/Ks ratio is a key indicator telling us the way the genes have evolved or been domesticated. The Ka/Ks much greater than 1 is the strong evidence that positive selection has acted to change the protein[49]^18. Distinctive phenotype responding to severe environments can be generated under the positive selection on some key genome loci[50]^19. Comparing transcriptomes of various genotypes of specific species is useful for exploring genes under selection and elucidating the role of various biological pathways and mechanisms for imparting stress tolerance to adverse environments[51]^20,[52]^21. Although fundamental researches have provided significant insights into the physiological and molecular responses of plants to water deficit, but the divergence in transcriptome of rice genotypes with contrasting phenotypes remains unexplored. Techniques such as next-generation sequencing offer a unique opportunity to scan the transcriptomes of different genotypes. In the present study, the comparative transcriptomes of drought-tolerant rice cultivar IRAT109 and sensitive rice cultivar Zhenshan97 were analyzed comprehensively to provide insights into sequence variation, expression divergence, domestication selection of genes between drought-tolerant and sensitive rice genotypes under drought conditions. We highlight the crucial role of photosynthetic system during the domestication of these two cultivars, which experience strong positive selection during the domestication of rice cultivars to upland water deficit environments. Moreover, this study provides functional gene markers to promote future relevant research, and molecular breeding and genetic engineering projects for enhancing drought tolerance in rice. Results Transcripts assembly To get deeper insight into the drought stress responses of upland drought-resistant (IRAT109) and lowland drought-sensitive (Zhenshan 97) rice cultivars, a genome-wide transcriptional analysis was performed by using Illumina Hiseq2500 platform. Approximately 30.6 million 100bp paired-end clean reads pairs for per library were generated after adapter trimming and filtering low-quality reads. On average, about 53.7 and 51.3 million reads were uniquely mapped to the reference genome with tophat default parameters criteria in Zhenshan97 and IRAT109 respectively. Among of which, 83.2% and 85.4% of reads were properly paired mapped respectively for each sample. Transcript prediction from the present samples were initially performed separately, yet generated a transcript set with a high degree of overlap. Merging of the predicted transcripts sets resulted in 103490 transcripts with 34832 novel transcripts which were not annotated in rice gene model ([53]ftp://ussd-ftp.illumina.com/Oryza_sativa_japonica/Ensembl/MSU6/) ([54]Supplemental Table 1). The novel transcripts indicate the alternative splicing events increase the transcripts diversity. Differential expressed genes and Gene Ontology analysis Comparing upland and lowland rice cultivars under drought conditions resulted in a total of 436 genes differentially expressed genes, which were classified into 8 categories according to the log10 (FPKM + 1) value of gene expression by K-means ([55]Supplemental Fig. 1). Among 436 genes, 79 (18%) were new ones without equivalents in the current annotated genes. The rest of 357 genes included 115 up or specially regulated in Zhenshan97 and 242 of that in IRAT109. To further look into the functional categories of genes differentially expressed between upland and lowland rice, Gene Ontology analysis was implemented by searching the upland-up and lowland-up regulated genes against the AgriGO database of Oryza sativa Japonica ([56]Fig. 1). The up and specific regulated genes in Zhenshan97 are enriched in 7 most significant GO terms (P-value < 0.01) including 5 biological process GO terms as include death, cell death, programmed cell death, apoptosis and defense response and 2 significant molecular function GO terms of protein binding and protein dimerization activity. The up and specific regulated genes in IRAT109 are enriched in 10 most significant GO terms (P-value < 0.01) covering death, cell death, programmed cell death, apoptosis, response to stimulus, response to stress and defense response. The function description and expressed value of genes included in the above significant GO terms were listed ([57]Supplemental Table 2). The genes specific or upregulated in Zhenshan97 are mainly enriched in significant GO terms mainly encode disease resistance proteins, transposon proteins, receptor-like protein kinase precursor and so on. The genes specific or up regulated in IRAT109 enriched in significant GO terms mainly include serine/threonine-protein kinase and ATP-binding proteins. Figure 1. GO enrichment of DEGs between the upland and paddy rice. [58]Figure 1 [59]Open in a new tab (A) Significant GO terms of genes up-regulated or specific expressed in Zhenshan97. (B) Significant GO terms of genes up-regulated or specific expressed in IRAT109. The blue bar indicate the percent of query genes in a specific GO terms to the total query genes and the purple bar indicate that for the background. The left and right of the vertical line indicate the biological process and molecular function GO terms, respectively. To validate the RNA sequencing gene expression results, quantitative real-time reverse transcription-PCR (qRT-PCR) was used to assess the expression levels for 10 randomly selected genes in five lowland rice varieties and three upland rice varieties (Plant materials and methods). Because gene express analysis from RNA sequencing were not performed for control condition, only the results analyzed from drought stress were compared between RNA sequencing and qRT-PCR. By comparing the relative expression levels for the 10 genes under drought conditions between these two methods, consistent expression trends were observed for all the selected genes ([60]Fig. 2 and [61]Supplemental Table 2). For genes Os01g07590, Os04g21880, Os06g34960, Os08g07330, Os08g33000, Os11g36950, their expressions were very low even hardly to be detected in lowland rice cultivars whereas show comparative high expression amount in upland rice as revealed by RNA sequenceing. Consistently, the qRT-PCR produced results that these gene expression ratios of upland to lowland rice under drought stress are 5.07, 11.86, 8.42, 17.23, 13.91 and 16.31 respectively. In contrast, genes Os02g13510, Os04g30660, Os06g07370 and Os08g35310 show profoundly decreased expressions under upland rice compared to the lowland rice as detected by RNA sequencing. The qRT-PCR results revealed these four gene expression ratios of lowland to upland rice under drought stress are 144.33, 6.61, 33.06, 14.11 respectively, which indicated a good agreement between both approaches ([62]Fig. 2). It is worth mentioning that except LOC_Os01g07590 and LOC_Os02g13510, the rest of the eight genes show expression FPKM value as 0 in upland or lowland rice detected by RNA sequencing ([63]Supplemental Table 2), whereas revealed a relatively low expression value as indicated by qRT-PCR ([64]Fig. 2). This is probably because a much higher coverage is necessary in RNA sequencing for detecting the expression for those mRNAs with extremely low amount. Figure 2. Validation of 10 DEGs by qRT-PCR. [65]Figure 2 [66]Open in a new tab qRT-PCR values were calculated as means from all relevant varieties for Lowland rice under Control (LC) and Drought stress (LD) as well as for Upland rice under Control (UC) and Drought stress (UD). Error bars indicate the standard deviation. The X-axis indicates the rice materials and growth condition. Y axis indicates the relative expression level. Identification, experimental validation and distribution of SNPs and INDELs along chromosomes The variations calling by Samtool package generates total 166,538 raw variations between samples and the reference. The 110,826 variations were remained after removing those with read depth (DP) smaller than 5, phred-scaled quality (an overall quality score for the assertion made in alternate non-reference alleles) smaller than 30 and mapping quality (Root-mean-square mapping quality of covering reads) smaller than 20. As the variations output directly from the SAMtools mpileup scripts show the variations between the reference genome (cultivar Nipponbare of Oryza sativa ssp. japonica) and each investigated genotype (Zhenshan97 or IRAT109) separately, whereas the aim to the present study is to search the variations between lowland rice Zhenshan97 and upland rice IRAT109. To distinguish the variations between the two investigated genotypes apart from variations between reference and genotypes, an in-house bash script was used following variations calling with SAMtools mpileup. Finally we screened 77,647 variation sites between the two studied genotypes and 72,840 of which are homologous sites in both genotypes including 70,264 SNPs and 2576 INDELs ([67]Table 1). To further look into if there is some hot spots with high-density of functional SNPs and INDELs in the genome, we calculated the SNPs, INDELs and gene density along the chromosomes in 100kb sliding windows ([68]Fig. 3). We observed that the SNPs and INDELs density was generally low around centromeres and telomeres, particularly around the centromeres of chromosomes 4, 5, 9 and the proximal end of long arm of chromosome 9. Consistently, the gene density of these regions are also relatively low. Table 1. The variations information between the investigated Zhenshan97 and IRAT109 and between them and reference genome. variation Counts of variations Raw variations 166,538 Variations meeting the quality filtering criteria 110,826 Variations between the two investigated samples 77,647 Homologous Variation sites in each sample 72,840 SNPs 70,264 INDELs 2,576 INDELs with over 2 bases variation 811 INDELs with one base vaiation 1765 [69]Open in a new tab Figure 3. The distribution of sequence variations (SNPs and INDELs) and genes along the chromosomes. [70]Figure 3 [71]Open in a new tab The left Y-axis shows the number of sequence variations and genes. The right Y-axis indicates the Ka/Ks value of genes. The X-axis shows the position on the chromosomes with the scale as 100kb. The red and blue lines indicate the numbers of sequence variations and genes in 100kb slide windows, respectively. The black point in the graph represents the genes with Ka/Ks value bigger than 2. The detailed information about the ids and function annotation of these genes were summarized in [72]supplementary table 3. The black vertical lines shows the positions of centromere. To validate the reliability of these sequences variations with experiments, 200 2bp-INDELs were selected randomly for experimental validation with the recombinant inbred lines (RILs) generated with IRAT109 and Zhenshan97. 189 of them were validated through Polymerase Chain Reaction (PCR) amplification and PAGE (polyacrylamide gels electrophoresis). The representative INDELs were shown in [73]Fig. 4. Our results suggest that the short INDELs like 2 nucleotides polymorphism can be clearly distinguished by traditional gels analysis if the PCR products were designed in length of 150-200bp and separated on the higher resolution PAGE. Figure 4. The representative INDELs shown through PAGE and silver staining. Figure 4 [74]Open in a new tab These 2bp-INDELs A-E represents INDEL1_146103_CT, INDEL3_7749032_GT, INDEL7_6809650_AG, INDEL8_14321370_AA and INDEL9_14716537_AG, respectively. For the nomenclature of these INDELs, number after the word INDEL indicates the chromosome, the number after the first underline indicates the position on the corresponding chromosome, and the two nucleotides after the second underline is the INDEL. P1 and P2 on the top of each gel track represents the Zhenshan97 and IRAT109, and the bands after P1 and P2 represent individuals from the RIL population which was originated from Zhenshan97 and IRAT109. The Ka/Ks Calculation for orthologs between IRAT109 and Zhenshan97 To identify genes undergoing positive selection during the domestication of the upland and lowland rice cultivars, the ratios of non-synonymous to synonymous mutations for orthologs between the two genotypes were calculated. Using the bidirectional Blast between the transcripts of the two accessions and the annotated proteins sequences in gene model as references, 12626 orthologous pairs were obtained for