Abstract Single-cell RNA-seq has revealed extensive cellular heterogeneity within many organisms, but few methods have been developed for microbial clonal populations. The yeast genome displays unusually dense transcript spacing, with interleaved and overlapping transcription from both strands, resulting in a minuscule but complex pool of RNA protected by a resilient cell wall. Here, we have developed a sensitive, scalable, and inexpensive yeast single-cell RNA-seq (yscRNA-seq) method that digitally counts transcript start sites in a strand- and isoform-specific manner. YscRNA-seq detects the expression of low-abundant, non-coding RNAs, and at least half of the protein-coding genome in each cell. Within clonal cells, we observed a negative correlation for the expression of sense/antisense pairs, while paralogs and divergent transcripts co-express. Combining yscRNA-seq with index sorting, we uncovered a linear relationship between cell size and RNA content. Although we detected an average of ~3.5 molecules/gene, the number of expressed isoforms are restricted at the single-cell level. Remarkably, the expression of metabolic genes is highly variable, while their stochastic expression primes cells for increased fitness towards the corresponding environmental challenge. These findings suggest that functional transcript diversity acts as a mechanism for providing a selective advantage to individual cells within otherwise transcriptionally heterogeneous populations. __________________________________________________________________ Eukaryotic transcription is pervasive, and results from the stochastic process of gene expression leading to cell-to-cell heterogeneity. Advances in sequencing technologies and library preparation have made single cell RNA-seq (scRNA-seq) accessible to several tissues and cell lines, but few methods have been successful for unicellular microorganisms[39]^1. The development of yeast-specific scRNA-seq has been hampered by the intrinsic nature of yeast. First, its small cell size (2-5 μm) results in a minute amount of RNA per cell, which is estimated to be at least 10 times lower than mammalian cells where scRNA-seq has been pioneered[40]^2 (1 pg versus 10 pg, respectively[41]^3). Second, the cell wall poses a barrier for single-cell RNA isolation, and therefore standard RNA extraction procedures are incompatible with efficient scRNA-seq library preparation. Third, the yeast transcriptional landscape is composed of bidirectional and overlapping transcripts embedded in a dense genome—this presents a challenge for RNA-seq, and requires stranded-libraries to capture complex genome architectures. As such, scRNA-seq has only recently been able to be applied to microorganisms like yeast, using low-throughput and labor-intensive methods in combination with non-stranded library preparation[42]^4^,[43]^5. S. cerevisiae is the only organism for which transcript isoforms have been mapped in bulk at both 5’- and 3’-ends by transcript isoform profiling (TIF-seq)[44]^6, however cell-to-cell heterogeneity has mainly been studied on a case-by-case basis. Yeast populations show extensive isoform heterogeneity that contribute to phenotypic diversity[45]^6^,[46]^7. Whether this variability results from the co-expression of several isoforms or from a cell-specific selection has not been investigated; this would require resolving individual cells from a homogenous population. The strongest limitation to yeast-specific scRNA-seq is the lack of strand-specific transcript isoform methods that are sensitive enough to globally assess the transcriptome of single yeast cells. This absence underscores the need for the development of novel technologies. Here, we set out to develop a high-throughput single-cell RNA-seq method for yeast that integrates indexed cell sorting for prior phenotyping, is inexpensive (approximately US$12 per cell), and is strand-specific. By applying yscRNA-seq, we quantitatively characterized the extent to which isogenic single-cells deviate in gene expression, and measured the stochastic expression of highly variable genes that can result in fitness variation within microbial populations. Results To measure absolute gene expression and transcription start site (TSS) usage in individual yeast cells, we performed unbiased index sorting of single cells from exponentially-growing yeast cultures in rich media (YPD) using 96 well plates containing absolute ethanol for fixation and RNA preservation. For each well, we measured the forward scatter (FSC) as a proxy for cell size by fluorescence-activated flow cytometry (FACS; [47]Supplementary Figure 1a). Following cell sorting, we applied yscRNA-seq to 285 individual yeast cells (2 plates for BY4741 and 1 plate for YJM789; see [48]Methods). After ethanol evaporation, cells were lysed in buffer containing zymolyase, and 5000 molecules of external RNA control consortium (ERCC) transcripts. A 5’-biotinylated template-switching oligo (TSO) containing P5 and a unique molecular identifier (UMI) was used to generate the first strand. Full-length dscDNA libraries were amplified with limited numbers of PCR cycles, and size distribution was validated by Bioanalyzer profiling ([49]Supplementary Figure 1b). Cell-specific adapters were introduced by tagmentation using homemade Tn5[50]^8, preloaded with adapters (see [51]Methods). This greatly reduced the cost-per-cell to US$12 ([52]Supplementary Table 1). Tagmented libraries (96 samples) were pooled, and strand-specific libraries were eluted by removing the biotinylated strand with streptavidin beads ([53]Figure 1a). Size distribution was assessed before sequencing ([54]Supplementary Figure 1c). Figure 1. Absolute transcriptome quantification of single yeast cells by using yscRNA-seq. [55]Figure 1 [56]Open in a new tab (a) Schematic representation of the yscRNA-seq workflow and representative example. Full-length cDNA libraries were generated from oligo dT and Unique Molecule Identifiers (UMI)-containing template-switching oligonucleotides (TSO). Following second-strand synthesis, double-stranded cDNA (dscDNA) libraries were tagmented with Tn5-loaded cell-specific barcodes. Single-strand cDNA (sscDNA) was obtained by briefly denaturing dscDNA bound to streptavidin beads, followed by high-throughput sequencing with custom primers. The right panel represents a histogram of reads at the SCC4 locus for individual cells (rows). ORFs are indicated as blue boxes. (b) ERCC reads were mapped to the provided sequence from the manufacturer with a modification on the 5’-end to include the upstream restriction site used for ERCC expression (see [57]Methods). The plot shows the correlation of unique ERCC molecules (log[10]) detected by yscRNA-seq and the expected spike-in concentration provided by the manufacturer. (c) Comparison of yscRNA-seq sensitivity to other scRNA-seq methods. The violin plots show the probability density function across several scRNA-seq methods. For each approach, the distribution of the 50% detection probability[58]^9,[59]^10 of ERCC molecules is plotted for the indicated number of cells (n, parentheses). Highlighted rectangles show the corresponding median value for each distribution. (d) Distribution of the total number of detected transcripts per cell by yscRNA-seq or Gasch et al., 2017[60]^4. The mean of the distribution for each method is shown. (e) Spearman correlation of detected number of molecules (log[2]) by competitive PCR[61]^3 (x-axis) versus mean UMI/gene from yscRNA-seq (y-axis) (n=3211). Correlation value is shown in the bottom right corner. (f) Comparative analysis of yscRNA-seq library-reads containing YJM789 SNP (x-axis) or BY4741 (y-axis) from unindexed random sorted cells of a 1:1 mixed sample containing YJM789 and BY4741 for a 96 well plate. A first-sequencing read (read1) was used to obtain the UMI and the gene identity, while a shorter read (index1) was used to retrieve cellular barcodes. After preprocessing and aligning the reads (see [62]Methods), we applied a filter to consider only UMIs with at least 3 reads for analysis ([63]Supplementary Figure 1d). Additionally, we restricted the analysis to cells with more than 500K uniquely-mapped reads and over 1000 expressed genes (≥ 1 UMI-gene), which removed low-quality cells with a higher ratio of mitochondrial RNA ([64]Supplementary Figure 1e,f). Overall, this resulted in 127 cells for BY4741 out of 190, and 48 out of 95 for YJM789. To measure the quantitative efficiency of yscRNA-seq, we compared the number of detected molecules against expected unique ERCC molecules, and observed a linear correlation across the entire dynamic range ([65]Figure 1b) with a 15.5-25.8% efficiency (an average of 1290 ERCC molecules out of 5000 spiked-in, 775 molecules after filtering, respectively). To gauge the 5’-end detection accuracy of yscRNA-seq, we aligned reads to the reference TSS, determined by TIF-seq[66]^6 and from ERCC annotation. Our method faithfully identified TSS boundaries ([67]Supplementary Figure 1g). Next, we benchmarked the sensitivity of yscRNA-seq relative to other scRNA-seq methods with a previously established metric that calculates the number of ERCC molecules required to reach 50% detection probability[68]^9^,[69]^10. Indeed, we find that yscRNA-seq is among the most sensitive methods, with a median 50% detection probability requiring only 4.7 molecules ([70]Figure 1c). Similarly, we compared yscRNA-seq to the only available yeast scRNA-seq dataset from Gasch et al.[71]^4, which used Fluidigm’s C1 System. Overall, yscRNA-seq yielded a higher number of genes per cell (3399 versus 2392) and a good genome-wide correlation despite scRNA-seq from Gasch et al. not being strand-specific (Spearman Correlation 0.71; [72]Figure 1d, [73]Supplementary Figure 1h). Given the quantitative nature of yscRNA-seq, we assessed the concordance between RNA abundance (estimated in bulk by competitive PCR from Miura et al.[74]^3) and our BY4741 yscRNA-seq dataset. We observed a linear correlation genome-wide (Spearman Correlation 0.76; [75]Figure 1e), however when we performed the same analysis after correcting for ERCC efficiency (15.5%), the predicted number of molecules per gene across the entire transcriptome increased[76]^3 ([77]Supplementary Figure 1i). In both cases, the increase in RNA abundance can be explained by the fact that our method provides an unbiased transcriptome quantification of both coding and noncoding transcripts (ncRNAs), the latter of which had not previously been considered. Our data also suggest that ERCC correction alone may overestimate expression, possibly due to differences in reverse transcription efficiencies[78]^9^,[79]^11. To confirm that yscRNA-seq transcriptomes resemble those obtained in bulk, we compared the sum of all BY4741 libraries to our previously-published tiling array data from the same background and condition[80]^12. Interestingly, the comparison with tiling array data resulted in the highest correlation to other methods (Spearman correlation 0.83) due to the strand-specific nature of both approaches ([81]Supplementary Figure 1j). Overall, yscRNA-seq quantitatively recapitulated gene expression from two independent studies and performed within the most-efficient scRNA-seq protocols—the highest for yeast. One explanation for the higher sensitivity is that libraries are directly generated from live-sorted cells through sequential reactions, which reduces sample loss and handling to a minimum. To validate that a single yeast cell was present in each well, and that there was no well-to-well contamination, we applied yscRNA-seq to a randomly-sorted plate from a mixture of two yeast strains (BY4741 and YJM789) at a 1:1 ratio. YJM789 is a less-commonly used haploid S. cerevisiae strain isolated from an HIV patient[82]^13. At the genomic level, YJM789 differs from the reference strain in approximately 60,000 single nucleotide polymorphisms (SNPs) and 6,000 indels[83]^13. We plotted the allele frequency of reads mapping to each genotype, and considered cell doublets if more than 1% of the reads mapped to more than one genotype. Reassuringly, the frequency of doublets (2 or more cells/reaction) was less than 1%, thus confirming that yscRNA-seq libraries originated from individual cells ([84]Figure 1f, [85]Supplementary Figure 1k). For a single wild-type S. cerevisiae cell, the majority of expressed genes belonged to open reading frames (3072 ORFs), which encompassed 90.5% of the detected transcripts. Meanwhile, ncRNAs—cryptic unstable transcripts (CUTs), stable unannotated transcripts (SUTs) and others—only represented 9.5% of the detected RNA pool ([86]Figure 2a, [87]Supplementary Table 2). The ratio of ORFs/ncRNAs resembled those reported from bulk studies[88]^14^–[89]^16, but the distribution of detected ORFs was narrower than for ncRNAs. This may be due to lower abundance and stability of ncRNAs relative to mRNAs. Interestingly, we obtained similar means of expression for YJM789, indicating that the number of expressed genes per cell is similar across strains ([90]Supplementary Table 2). Moreover, when we analyzed the expression matrix for all genes in each individual cell, only 252 of 7272 genes (or 8 ORFs) were not detected in any of the 127 cells ([91]Supplementary Table 3). Overall, we observed a good genome-wide expression correlation within individual cells or across all samples (Pearson correlation 0.66 to 0.83; [92]Supplementary Figure 2a,b). Altogether, these data suggest that at least half of the yeast genome is expressed in a complex pattern within each individual cell. Figure 2. yscRNA-seq as a tool to quantitatively profile transcriptional architectures and TSS variation. [93]Figure 2 [94]Open in a new tab (a) Average distribution of all RNA species counted from yscRNA-seq libraries prepared from 127 BY4741 cells exponentially grown in YPD. Gene feature annotation was done based on Xu et al., 2009[95]^12. Density represents the distribution of genes expressed (≥1 UMI). The mean of the distribution for each gene feature is shown. (b) Distribution of genome-wide Pearson correlation values between transcripts in divergent orientation (red line, n=2555, one-tailed Wilcoxon test p-value 2.63e-09), annotated ORFs with antisense transcripts originating from their 3’-UTR (green line, n=202, one-tailed Wilcoxon p-value 1.45 e-19), paralogs (purple line, n=370) or a subset of random genes (grey dashed line, n=1846, one-tailed Wilcoxon test p-value 3.6 e-09). Classification of bidirectional and sense-antisense transcripts was obtained from Xu et al., 2009[96]^12 and paralogs from Kellis et al., 2004[97]^17. (c) Spearman correlation of the number of TSS isoforms per gene identified by yscRNA-seq and TIF-seq[98]^6. Detected yscRNA TSSs (x-axis) were obtained by pooling all yscRNA-seq libraries for all BY4741 cells (n=127) and comparing them to TIF-seq[99]^6. Correlation value is shown in the bottom right corner. (d) Ranked mean number of TSSs. Each dot represents the mean number of TSSs detected per gene per cell for BY4741 (127 cells) for genes detected in at least one cell (252/7272 genes were not detected). Mean of all TSSs is indicated with a dotted line. (e) Representative example of TSS usage in single cells. The binary heatmap represents the position (x-axis) and the usage (yellow) of the identified TSSs over YLL014W across all BY4741 yscRNA-seq libraries obtained from two biological replicates. A TSS was considered as ‘used’ if at least 2 UMIs/position were detected. Each row represents an individual cell, and each column represents the position to the TSS identified by TIF-seq[100]^6. (f) Cumulative sum of the number of TSSs used across BY4741 (y-axis) in comparison to the total number of 5’-TSS observed by TIF-seq[101]^6 (y-axis, approximately 148,000 5’-isoforms). To assess the complexity of yscRNA-seq libraries, we downsampled reads to fixed depths from one representative sample (1,027,599 reads after QC and filtering). We found that 250,000 reads recapitulated 90% of the total number of genes detected indicating that even at low sequencing depth yscRNA-seq captures a large fraction of the transcriptome ([102]Supplementary Figure 2c), which can further lower the cost/cell. To assess genome-wide transcriptional architectures by exploiting the sensitivity and strand-specific nature of yscRNA-seq, we analyzed the correlation between paralogous genes (those that arose from whole-genome duplication[103]^17), sense-antisense pairs (SAPs), and divergent transcripts[104]^12 ([105]Figure 2b). The correlation was only computed for gene pairs in which at least one gene in the pair was detected in ≥3 cells. Expression of SAPs showed a strong negative correlation compared to random gene pairs (one-tailed Wilcoxon test, p-value 1.45e-19). This is consistent with previous data showing anticorrelated expression of SAPs from bulk samples[106]^18. To explore the potential of yscRNA-seq, we investigated GAL1, which is one of the most studied SAPs. In bulk samples under glucose growth conditions, the GAL1 sense-transcript is repressed and a ncRNA (CUT445), which originates from a region of the 3’-UTR that is constantly expressed in glucose and galactose that overlaps the GAL1 TSS[107]^19^,[108]^20 For the majority of our single cells in YPD, we could neither detect the sense nor the antisense transcript—remarkably, for those where we could, the expression pattern was mutually exclusive ([109]Supplementary Figure 2d). In contrast, when we investigated the expression of gene pairs that originate from bidirectional promoters and paralogs, we observed a positive correlation for both groups (one-tailed Wilcoxon test, p-value 2.63e-09 and 3.6e-09, respectively [110]Figure 2b). These results indicate that transcription events reported in bulk reproduce within single cells, while the bimodal expression for the GAL1-CUT445 SAP is unresolvable in bulk assays. One of the main layers of transcriptional complexity arises from transcript isoform diversity, where an average of 26 isoforms per gene have been reported in yeast by TIF-seq[111]^6. To characterize the TSS variability in single cells, we sought to identify TSSs applying a similar criteria used for TIF-seq[112]^6, in which the highest-expressed position defined the major isoform. Secondary isoforms were iteratively assigned if reads occurred outside a 15 nt window centered from the previous isoform (see [113]Methods). To make the data comparable across both studies, we assessed TSS diversity by considering only TSS positions with ≥2 UMIs in yscRNA-seq, and ≥2 reads in TIF-seq[114]^6, for the same background and condition. From the sum of isoforms per gene across all cells, we detected a linear correlation between TSS number/gene (Spearman correlation 0.63; [115]Figure 2c). On average, each gene expressed 3.46 unique molecules ([116]Supplementary Figure 2e), however the mean number of TSSs per gene across all cells was 1.19 ([117]Figure 2d). Indeed, when we clustered cells based on the TSS variability of a representative gene (YLL014W with a mean of 3.09 UMIs and 1.25 TSSs of a total of 8 used-TSS-positions in our dataset), the majority of cells expressed only one isoform (coinciding with the bulk TSSs detected by TIF-seq[118]^6 at position-0), 32 cells co-expressed a second isoform from a different TSS, and in some rare cases three isoforms were detected ([119]Figure 2e). As a control, the sum of the most-expressed isoforms of all BY4741 cells revealed a linear correlation to the expression of the second most-expressed isoform (Spearman correlation 0.93). Furthermore, while the major isoform dominated the expression, the second isoform was still abundant within the population ([120]Supplementary Figure 2f). Overall, our results suggest that while a main isoform is preferred in an individual cell, TSS diversity is achieved through the expression of select isoforms between cells. To understand how TSS variability in single cells leads to the observed population isoform diversity, we assessed each single cell’s contribution by computing the cumulative sum of unique TSS isoforms/cell, comparing the result to the total number of TIF-seq[121]^6 5’-TSS isoforms (a total of 148,000 5’-isoforms with ≥2 reads, [122]Figure 2f). We observed a logarithmic increase of isoforms, that starts to plateau at ~100 cells without reaching saturation. These data support the notion that the abundant TSS diversity seen in populations is composed of the heterogeneous expression of limited gene isoforms per cell. As such, a large number of single cells would be needed to reconstruct isoform diversity. To investigate the capacities of yscRNA-seq to resolve clonal yeast populations, we applied t-distributed Stochastic Neighbor Embedding (tSNE), using the entire transcriptomes for all datasets. Indeed, tSNE analysis revealed two distinct strain-specific clusters (BY4741 and YJM789, [123]Figure 3a), validating yscRNA-seq as a sensitive method for distinguishing different yeast strains. Figure 3. yscRNA-seq reveals a high-resolution map of the transcriptional heterogeneity within clonal yeast populations. [124]Figure 3 [125]Open in a new tab (a) tSNE plot using whole transcriptome data obtained by yscRNA-seq for the indicated strains (n=127 BY4741 and n=48 YJM789). (b) Correlation of cell size and transcriptome abundance. Correlation of number of RNA molecules/cell by yscRNA-seq (x-axis) with cell size (y-axis). Cell size units represent the FSC-value determined during index sorting for the indicated yeast backgrounds. (c) Detection of variable genes. Squared coefficient of variation (CV^2) of normalized UMI counts versus mean normalized expression (UMI) for all genes are shown in black across BY4741 (n=127). Variable genes are highlighted in red, and ERCCs are shown in blue. Inference of variable genes was done by taking into account technical noise from ERCC spike-ins as previously reported[126]^24. Genes that deviate from technical noise for whose biological coefficient of variation is more than our chosen minimum (CV^2>0.25) were considered as variable. Variable genes are highlighted in red, ERCCs are shown in blue and solid green line represents the technical noise fit. (d) KEGG-pathway enrichment of highly variable genes identified in (c). Categories are ranked based on effect size (total number of enriched genes over total number of annotated genes). (e) Clustering of 127 BY4741 cells based on genome-wide expression matrix. Each dot corresponds to a cell, with dot size representing cell size (FSC-value). Clusters were generated with Seurat using 0.6 as the value for resolution parameter. (f) Cell cycle stage of clusters generated in (e). Each dot represents the expression of cluster-specific genes ordered on the x-axis, based on their peak expression through cell cycle phases, determined by Spellman et al[127]^27. Co-regulation of cell size with transcriptome size has been extensively investigated in yeast, and has traditionally been approached through the use of mutants that uncouple cell division and growth, or by arresting the cell cycle but not cell growth[128]^21^–[129]^23. The quantitative nature of yscRNA-seq, together with estimated cell size (by FSC index sorting), provides a tool to link transcriptome to phenotype. Moreover, the presence of ERCC spike-ins allows libraries to be normalized against technical noise rather than total number of reads[130]^24^,[131]^25. We found that the absolute number of RNA molecules increased linearly with cell size for both genetic backgrounds ([132]Figure 3b). Comparing unique ERCC molecules against cell size (or against total RNA molecules) confirmed that the observed changes in transcriptome size exceeded technical noise ([133]Supplementary Figure 3a,b). Reverse transcription and template switching are major sources of technical noise[134]^26, while intrinsic biological noise can originate from the stochastic nature of transcription and by the regulation of mRNA stability. To identify variable genes, we took advantage of the presence of ERCC spike-ins[135]^24, and classified genes as variable if their squared coefficient-of-variation (CV^2) was higher than the technical and expected biological variance using a 10% false discovery rate (CV^2>0.25)[136]^24. For exponentially-growing wild type BY4741, we identified 400 variable RNAs (FDR <0.01; [137]Figure 3c). Of these, 100 were ncRNAs (38 CUTs and 62 SUTs), while the remaining 300 variable genes were protein-coding. Remarkably, 50% of the detected ncRNAs were highly variable at the single-cell level, probably due to their intrinsic properties. Because UMI incorporation in yscRNA-seq occurs by template-switching at the 5’-end, we ruled out gene-length-dependent effects of the observed CV^2 ([138]Supplementary Figure 3c). Therefore, transcriptome composition was heterogeneous, despite most cells expressing a relatively similar number of genes. To characterize the nature of variable ORFs in BY4741, we performed KEGG pathway-enrichment analysis. As expected, variable genes were enriched in cell-cycle-periodic transcripts[139]^16^,[140]^27. Surprisingly, the largest groups of variable genes belonged to metabolic pathways such as galactose and glycolysis/gluconeogenesis, among others ([141]Figure 3d). For YJM789, we detected 567 variable genes (FDR<0.01), of which 124 overlapped between YJM789 and BY4741 and were significantly enriched for metabolic processes ([142]Supplementary Figure 3d). These results suggest that variable genes are partially strain-dependent, while several genes might be intrinsically variable. To understand variability within the BY4741 population, we applied tSNE analysis. We identified two distinct clusters enriched for previously annotated cell-cycle-regulated genes[143]^16^,[144]^27 ([145]Figure 3e, [146]Supplementary Figure 3e). We then phase-ordered the expression of cluster-specific genes based on previously annotated expression peaks[147]^16^,[148]^27. Cluster 1 contained cells that expressed early (M/G1) and late (G2/M) cell cycle genes, while Cluster 2 was enriched for cells expressing G1, S and late S/G2 genes ([149]Figure 3f). To rule out the contribution of batch effects, we projected both replicates of BY4741 onto the tSNE and confirmed that batch effects are negligible ([150]Supplementary Figure 3f). Our data suggests that, while phase-specific expression exists, the yeast cell cycle segregates cells into two subpopulations with distinct transcription signatures: those corresponding to cell cycle entry/exit, and those corresponding to progression through G1/S/G2. Since the majority of variable genes were related to metabolic processes such as sugar and nucleotides metabolism, we assessed whether this variability was cell cycle dependent. We overlaid the expression of variable galactose genes (GAL3, EMI2, GLK1, HXK1) onto the tSNE ([151]Figure 4a), as well as representative examples of the remaining KEGG-enriched pathways ([152]Supplementary Figure 4a). We did not observe cell cycle segregation, suggesting that the transcriptional heterogeneity of these genes is cell-cycle-independent. Galactose-induced genes are tightly repressed in the presence of glucose (YPD), and have been used as a model to understand transcriptional noise and regulation[153]^19^,[154]^28^,[155]^29. Interestingly for BY4741, we detected the expression of at least one variable gene belonging to the galactose pathway in virtually all cells ([156]Figure 4b). Despite over 80% of cells exponentially growing in glucose displaying no expression of GAL3, 20% of cells had more than 4 unique GAL3 mRNAs and 1,5% more than 10 mRNAs — an expression level above the expected noise. Figure 4. Functional consequences of stochastic gene expression. [157]Figure 4 [158]Open in a new tab (a) Overlay over the tSNE plot from [159]Figure 3e of four representative variable genes from the galactose metabolism KEGG category. Each dot represents cells in the BY4741 dataset (127 cells); the color scale represents the normalized UMI expression, log[2](UMI+1). (b) Expression of all variable galactose-related genes. Each stacked bar represents the normalized UMI expression for the indicated genes for each BY4741 cell. Bars are ordered based on their sorted position in the plate. (c) Heatmap representing the pair-wise Pearson correlation of variable galactose-related genes using the normalized number of unique molecules per cell (log[2]) across the BY4741 dataset (n=127). Warmer colors indicate a stronger correlation and colder colors represent a weaker correlation. (d) ySMN235 (BY4741 Gal3-GFP) was exponentially grown in YPD. Two groups of 10,000 cells were sorted based on their GFP intensity in YPD. Groups containing Top-2% GFP (Gal3-GFP Top-2%) or a random sort (Gal3-GFP Random) were sorted by FACS into YP media with no sugar, and quickly inoculated into YPD or YPGal. Growth curves over 48 hours were performed with a Synergy H1 plate reader. Data were obtained by measuring optical density (OD660) every 1.5 hours. To understand whether the expression of galactose-variable genes was coordinated, we analyzed the pair-wise expression of all galactose metabolism genes. We did not observe any strong correlations among gene pairs, suggesting their uncoordinated expression-regulation in single cells grown in YPD ([160]Figure 4c). We hypothesized that heterogeneity of variable galactose-related genes during growth in glucose could provide a selective advantage if cells were to face a rapid change in carbon source. To assess this possibility, we tagged the endogenous GAL3 gene with GFP at the C-terminus, sorting two populations of 10,000 cells from YPD based on GFP intensity (top 2% and a random sort control). Then, we then followed their growth in YPD or YPGal and measured Gal3-GFP fluorescence under both conditions ([161]Supplementary Figure 4b). While all populations had a similar growth rate in YPD, the top 2% of cells expressing Gal3 in YPD, displayed an increased fitness in galactose ([162]Figure 4d). These results indicate that stochastic expression variation can have a functional consequence, and can be used as a bet-hedging strategy. Discussion Here we report the development of yscRNA-seq, a method for quantitatively profiling gene expression and TSS variation in a strand-specific, cost-effective manner. YscRNA-seq enables cell profiling with minute amounts of RNA, and represents the most sensitive, inexpensive yeast scRNA-seq method to date. In addition, FACS-based single-cell isolation allows correlating cell parameters, such as cell size or reporter abundance, to the transcriptome. YscRNA-seq provides a high-throughput approach for linking phenotype to transcriptome by means of a single experiment. Pervasive transcription in yeast contributes to complex transcriptional architectures that require strand-specific methods to be resolved at the single cell level[163]^6^,[164]^12^,[165]^14. By applying yscRNA-seq, we identified genome-wide co-expression of paralogous and divergent gene pairs within single cells. In contrast, SAPs showed a stronger anticorrelation of expression in single cells than previously reported in bulk samples[166]^18^,[167]^30, establishing yscRNA-seq as a tool to resolve transcriptional architectures masked in bulk experiments. By including UMIs, we were able to digitally count gene expression and TSS usage. We detected about half of the genome as simultaneously and heterogeneously expressed in single cells. Additionally, the cumulative expression of approximately 100 cells virtually covered the entire transcriptome (>99% of ORFs). On average, each gene expressed 3.46 transcript molecules per cell but primarily from one TSS isoform. This suggests that a limited number of TSS isoforms per cell composes the massive isoform heterogeneity observed in bulk samples. These findings challenge the assumption of one mRNA molecule per gene per cell, and suggest rather the use of one isoform per gene per cell. By recording cell size during sorting, we observed a positive correlation between total RNA molecules and cell size. Our single cell analysis revealed a high degree of heterogeneity in gene expression between cells that affects at least 5.4-7.7% of genes for both strains (11.7-16% of expressed genes/cell). As expected, these genes included cell cycle-regulated transcripts, clustered in only two sets with distinct signatures (mitotic entry/exit and G1/S/G2 progression) rather than phase-specific clusters. This suggests that cell cycle signatures are not as pronounced if they are measured directly from unsynchronized single cells. Surprisingly, we found an even larger number of variable genes related to metabolic processes. One such process is galactose metabolism, whose early transcriptional activation in the presence of both glucose and galactose has been previously reported, and proposed to shorten the adaptation to diauxic shift[168]^31^,[169]^32. Indeed, we tested the functional consequences of stochastic expression for one of the highly variable genes (i.e. Gal3-GFP), and found that this variability provides a fitness advantage upon change in carbon source. This observation suggests that, while the process of transcription is inherently noisy, stochastic gene expression has functional implications for cellular fitness at the single cell level. Methods Strains Wild type S. cerevisiae BY4741 (S288C) and YJM789 strains were used for yscRNA-seq libraries. BY4741 (MATa his3-D1 leu2-D0 met15-D0 ura3-D0) and its derivative ySMN235 (GAL3-GFP::hphNT1) were obtained by genomic integration using a PCR-based strategy[170]^33. Cell growth and isolation BY4741 or YJM789 were pre-inoculated in rich media and grown overnight until OD[660]=1. The next morning, cells were diluted to OD[660]= 0.05 and grown for at least two full divisions. Prior to sorting, cells were diluted to OD[660]= 0.01 and sonicated for 3 pulses of 0.5 seconds to remove aggregates (Sonicator Branson M1800). Then, cells were filtered by passing samples through cell strainer snap-cap tubes (BD Falcon). Live single cells were index-sorted (BD InFlux) into 96 well plates containing 5μl of absolute ethanol. Gates on the pulse-width were stringent, in order to remove potential doublets. Forward and Side scatter (FSC and SSC respectively) values were recorded for each well, FSC was used as an approximation of cell size. For each plate, well-H12 was not sorted, and served as negative control. For experiments with mixed strains of BY4741 and YJM789 ([171]Figure 1f, [172]Supplementary Figure 1k), cells were grown independently as described above, except that strains were mixed at a 1:1 ratio prior to random sort (unindexed) into a 96 well plate. yscRNA-seq library preparation Sorted 96 well plates were left in a sterile hood to allow complete ethanol evaporation. Then, the cell wall was digested with a lysis buffer containing zymolyase (1 U/μl) and 5000 molecules of External RNA Control Consortium (ERCC). Cell lysis was performed for 10 minutes at 37 °C followed by 3 minutes at 72 °C in 5 μl. Plates were immediately placed on ice and 5 μl of reverse transcription reaction mix (RT-buffer and Invitrogen Super Script II) was added to synthesize first strand cDNA from a 5’-Biotinylated oligo dT primer (Bio-AATGATACGGCGACCACCGATCGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT). The oligo dT primer contains a PvuI site between the P5 sequence and the dT stretch that is used to remove 3’-ends. UMI incorporation was done in the same reaction by using a template-switching oligonucleotide consisting of RNA bases (Bio-AAUGAUACGGCGACCACCGAUNNNNNNGGG). The TSO primer sequence (Bio-GAATGATACGGCGACCACCGAT) contains P5 followed by six random nucleotides (UMI) and 3 GGGs used for template-switching. Full-length library amplification was performed with the UMI-PCR primer, which anneals to a common region (P5 sequence) present in both the oligo dT and TSO region to minimize the formation of primer-dimers. This reaction was performed for a limited number of cycles (21 cycles) with Advantage 2 Polymerase mix (Clontech). The resulting full-length dscDNA libraries were diluted (1:20) and subjected to qPCR to validate the presence or absence of cDNA library, using primers for a highly expressed housekeeping gene (TDH3.Fw: TCGTCAAGTTGGTCTCCTGG and TDH3.Rev: GGCAACGTGTTCAACCAAGT). Full-length dscDNA libraries were purified using Agencourt Ampure beads XP (Beckman Coulter), following the manufacturer’s instructions (0.6:1 ratio), and eluted in 15 μl of Elution Buffer (Qiagen). DscDNA library size was validated by Bioanalyzer ([173]Supplementary Figure 1b). To incorporate a cellular barcode and the Illumina P7 adapter, dscDNA libraries were tagmented with our homemade Tn5[174]^8 and preloaded with cell-specific barcodes as previously described[175]^34. Briefly, a plate containing 96 adapters (STRT-TN5-1 to 96) was mixed with TN5-U (PHO-CTGTCTCTTATACACATCTGACGC) 50 μM each. Primers were annealed by denaturing at 95 °C for 3 minutes and cooled down to room temperature. Annealed primers were loaded into Tn5 to make a 10X transposome stock in 80% Glycerol, and incubated for 1 hour at 37°C (50 μM transposase, 50 μM of 96 different annealed adapters). Loaded Tn5 plates can be stored for up to a month at -20°C. Tagmentation was performed by mixing the eluted dscDNA libraries with 2 μl of transposase stock (3-6 ng dscDNA, transposase 1X, 2X TAPS, 10% DMF in 20 μl) for 5 minutes at 55 °C. The tragmentation reaction was inactivated by incubation at 85 °C for 5 minutes. Tagmented libraries were captured using streptavidin beads—a 1:20 dilution of Dynabeads MyOne Streptavidin C1 beads (Invitrogen) was used, pre-washed twice with 2xBTW buffer (10 mM Trizma HCl pH 7.5, 1 mM EDTA, 2M NaCl, 0.02% Tween). To each reaction, 20 μl of pre-washed MyOne Streptavidin C1 beads were added to each well and incubated for 5 minutes at room temperature. Then, all 96 reactions were pooled together and washed once with TNT buffer (20 mM Tris-Cl pH7.5, 50 mM NaCl, 0.02% Tween 20), once with PB Buffer (Quiagen) and three times with TNT buffer. Libraries were then resuspended in PvuI digestion buffer (NEB, PvuI 0.4 U/μl, incubated for 1 hour at 37 °C) to remove the 3’-ends. After 3’-end removal, streptavidin beads were washed twice with TNT buffer and resuspended in 15 μl of nuclease-free water. To generate strand-specific libraries, samples were briefly denatured at 70 °C for 10 minutes to release the non-biotinylated strand. Then, beads were quickly bound to magnets, to remove the biotinylated strand and single-strand cDNA (sscDNA) was recovered from the supernatant. The resulting TSS-enriched sscDNA (purified with Ampure beads as above, at a 0.8:1 ratio) and libraries were quantified using the KAPA library quantification kit (Illumina). Between 8-16 pmol were hybridized and sequenced using a HiSeq Rapid Run to a sequencing depth of 1-2 million reads/cell. Single stranded cDNA library sequencing Sequencing was performed using one plate per lane in an Illumina HiSeq 2500. Libraries were loaded without denaturation in the absence of PhiX. Custom primers with Locked Nucleic Acids (LNA, Exiqon) were spiked-in, following manufacturer’s instructions. The read1 primer is located upstream of the UMI and was used to obtain a 68 bp read containing the UMI sequence, followed by the TSS of the gene of interest. A second custom index primer (index1) was used to determine the identity of the cellular barcode (8 bp length). LNA primers (0.5 μM) were spiked-in following manufacturer’s instructions for the read1 and index1 primers. UMI_PCR_read1: +GAATGA+TACGGCG+ACCA +CCGA+T Index1: CTGT+CT+CTT+ATA+CA+CA+TCTGA+CG+C The “+” indicates the position of the LNA. A detailed step-by-step protocol of yscRNA-seq can be found at protocols.io. Read pre-processing, alignment and filtering To process the sequencing data (fastq), we used a custom script written in Java to trim out the first 6 nt corresponding to UMI sequences. Poly-G’s following the UMI nucleotides were also trimmed out. Only reads containing up to 14 G’s were kept for analysis. These reads were aligned to a reference sequence combining the S. cerevisiae genome (Saccer3, SGD R64 version, [176]www.yeastgenome.org) and ERCC control sequences using Novocraft ([177]http://www.novocraft.com) with default parameters. Of note, ERCC sequences provided by the manufacturer do not contain the restriction sites used to clone the synthetic ERCCs downstream of the T7 promoter. Due to the 5’-specific nature of yscRNA-seq, these nucleotides are sequenced from read1 and cause misalignments to the reference ERCC annotation. We generated a modified the ERCC reference sequence by including the remaining T7 promoter nucleotides in the reference annotation. Updated ERCC sequences can be accessed from the Gene Expression Omnibus (see [178]Data Availability). Unique mapped reads with a quality score >30 and no soft clips in the 5’-end were kept for downstream analysis[179]^35. Molecule counting and TSS identification For each UMI the total number of reads was counted, and UMIs supported with less than 3 reads were removed. Reads that contained the same UMI sequence and mapped to the same 5′-end location in the genome, were included in the number of reads supporting this UMI. To define TSS isoforms, we used a similar approach as previously reported (TIF-seq)[180]^6. Briefly, we defined a first set of initial TSSs by considering the position in the gene where the overlap with the first nucleotide of the reads occurs. Then, we counted the number of UMIs and reads supporting these TSSs. Next, for each gene, we ordered all the TSSs present in all cells (n=127) by expression. Then, secondary isoforms were called by iterating through the ordered set of TSSs in the following way: if the TSS was within a ±7 nt range that had not been defined as a confident TSS in previous iterations, its position was used to define a new TSS. Following this criteria, the most expressed TSSs were the first to be considered as confident TSS; if the TSS was within a ±7 nt range of a previously-defined TSS, we assigned its UMIs and reads to the confident TSS to which it overlaps. In the case a TSS overlapped with two defined TSS, the number of UMIs and reads were divided and assigned to both TSSs. Finally, we considered only the confident TSS isoforms that were supported by at least two UMIs. The aforementioned computation was used to define the total confident TSS isoforms per gene. To identify TSS isoforms per cell, we iterated through each gene/cell and assigned each TSS to the most expressed TSS isoform at gene level. Variable genes calling We inferred variable genes as previously reported[181]^24. Briefly, we tested the null hypothesis that the biological squared coefficient of variation of a gene is smaller or equal than a chosen minimum (CV^2≤0.25). After estimating the coefficients (see [182]Supplementary Note 6 of Brennecke et al., 2013 for an in-depth explanation of the statistical modeling), p-values for each gene were obtained from the cumulative distribution function of the chi-squared-distribution using the pchisq() function of the stats R package. Finally, we selected genes with an FDR<0.01 after multiple testing correction. Correlations Correlation values from plots generated by the LSD package were generated automatically using the comparisonplot() function with cor=T. Spearman’s rho values are displayed on the bottom right corner of the plots unless otherwise stated. Gene-pair expression correlation (normalized by total number of RNA) was computed using the cor.test() function from the stats R package with default parameters. Cell-to-cell genome-wide and galactose metabolism variable genes correlation in gene expression was computed using the cor() function from the stats R package with default parameters. Marker genes We identified cluster specific genes using the FindAllMarkers() function from the Seurat package in R. We only tested genes detected in at least 25% of the cells of one of the clusters. Among these genes, we tested those that differ a minimum in the fraction of detection between the two clusters. For testing, we used a likelihood-ratio test for single cell gene expression that considers changes in the fraction of cells expressing a gene together with differences in the quantity of this gene among cells (McDavid, A. et al. 2013)[183]^36 (Parameters: min.pct=0.25, min.diff.pct=0.25, only.pos=T, test.use=“bimod”). Normalization Median ratio normalization: Normalization of raw UMI counts was done using the DESeq2 package (v.1.18.1) in R (Love et al., 2014)[184]^37. The estimateSizeFactorsForMatrix() function was used to compute cell-wise size factors for yeast RNA and for ERCC separately. Then, these size factors were applied to the raw UMI counts to obtain normalized expression values. ERCC normalization: To estimate the total number of molecules per gene (as represented in [185]Supplementary Figure 1i), the detection efficiency of the libraries was computed using ERCC data. We estimated the efficiency of detection as the average number of detected ERCCs per cell over the initial number of spiked-in ERCCs (5000 molecules). Then, we divided each gene’s raw UMI count by this value. Sensitivity of the method To evaluate the sensitivity of yscRNA-seq, we measured the 50% detection probability of ERCC molecules. This measure, as proposed in Svensson et al., 2017[186]^10, is obtained through a binomial logistic regression model. In this model, the detection of ERCC molecules is considered as a function of the initial number of ERCC molecules available in the sequencing reaction. This model estimates the number of molecules required to achieve a 50% detection probability of ERCC. For the binomial logistic regression, we used the glm(family=binomial(logit)) function from the stats R package. The ERCC sensitivity computation pipeline was adapted from Bagnoli et al., 2018[187]^11. We compared our 50% detection probability against the datasets from Svensson et al 2017[188]^10 and Bagnoli et al., 2018[189]^11. Growth curves ySMN235 strains were grown in YPD to mid-exponential log phase and subjected to sorting as described above. Two groups of 10,000 cells (top 2%, or random) were sorted based on Gal3-GFP intensity in 1 ml of YP media. Cells were immediately grown in duplicate in YPD (2% glucose) or YPGal (2%) for 50 hours in a Synergy H1 plate reader (30 °C, orbital shaking). Growth was measured by optical density at OD[660] every 1.5 hours. Data visualization Data were visualized using R programming language and the packages ggplot2 (v2.2.1), plotly (v4.8.0), LSD (v4.0-0), Seurat (v2.3.1), geneplotter (v1.56.0), ggpubr (v0.1.6.999), VennDiagram (v1.6.20), cowplot (v0.9.2) and Superheat (v1.0.0). Color palettes used were taken from the viridis (v0.3.0), wesanderson (v0.3.6) and RColorBrewer (v1.1-2) packages. tSNE visualizations were generated with Seurat[190]^38 using the default Bames-Hut implementation from Rtsne. We used PCA as a dimensionality reduction (dimensions 1 to 10) and default parameters (perplexity=30, theta=0.5 and Euclidian distances computation) for all the tSNE plots. FlowJo was used to analyze and generate FACS data. Statistical information To test for differences in the correlation of expression of pairs of genes against random pairs of genes, we used the one-tailed Wilcoxon signed-rank test with the wilcox.test() function from the stats R package. Functional enrichment analysis was performed with gProfileR (v0.6.6), which uses a hypergeometric test (Fisher's exact test) with default g:SCS multiple testing correction[191]^39. Supplementary Material 1 [192]NIHMS80827-supplement-1.pdf^ (3MB, pdf) Reporting Summary [193]NIHMS80827-supplement-Reporting_Summary.pdf^ (71.5KB, pdf) Acknowledgements