Abstract C. difficile is the most common cause of nosocomial diarrhea in North America and Europe. Genomes of individual strains of C. difficile are highly divergent. To determine how divergent strains respond to environmental changes, the transcriptomes of two historic and two recently isolated hypervirulent strains were analyzed following nutrient shift and osmotic shock. Illumina based RNA-seq was used to sequence these transcriptomes. Our results reveal that although C. difficile strains contain a large number of shared and strain specific genes, the majority of the differentially expressed genes were core genes. We also detected a number of transcriptionally active regions that were not part of the primary genome annotation. Some of these are likely to be small regulatory RNAs. Introduction Clostridium difficile is a toxin producing anaerobic bacillus and is the leading cause of hospital associated diarrhea in North America and Europe[[33]1,[34]2]. Clinical presentation of C. difficile infection (CDI) ranges from asymptomatic colonization, mild diarrhea, severe pseudomembranous colitis, paralytic ileus, to sepsis and death[[35]3]. Advanced age, prolonged hospitalization, antibiotic use and acid suppression therapy are some of the risk factors for CDI[[36]3,[37]4].The mortality rate is more than 80% in fulminant cases requiring colectomy[[38]5]. C. difficile produces two toxins, toxin A and toxin B, which are responsible for most of the damage caused to the host[[39]6,[40]7]. During the last decade, there has been a significant increase in the rate of CDI across the United States, Canada, and Europe[[41]8-[42]10].Emergence of more virulent strains and changes in antibiotic treatment regimens are some of the established causes of the increase in CDI [[43]11,[44]12]. Some of these strains, which have the capacity to produce more severe colitis and mortality, have been termed as hypervirulent [[45]13]. These strains belong to PCR ribotype 027[[46]13]. Ribotype 027 strains are also characterized as toxinotype III, North American pulsed field gel electrophoresis type1 (NAP1) and restriction endonuclease analysis group BI and in contrast to historical control strains, are fluoroquinolone resistant[[47]10]. In the recent years, C. difficile infection in the community setting has increased [[48]14]. It is also causing a significant number of infections in food animals[[49]15]. Recent studies have shown the possibility of foodborne transmission of C. difficile, which might explain the spread of C. difficile infections in the general community [[50]16]. Comparative genomic studies have revealed that the genome conservation in C. difficile is very low and this is a major contributing factor in the outcome of infection by a given strain [[51]17-[52]19]. The number of core genes in C. difficile is estimated to be less than 20% of the pangenome[[53]20]. This variation in the genome content could enable C. difficile strains to respond differently to environmental changes. A number of functional genomics studies have been conducted to indentify the differential gene network that controls C. difficile response to environmental changes[[54]21-[55]24]. However, all these studies are based C. difficile 630, a strain isolated in 1985 from Zurich, Switzerland. Genome of C. difficile 630 is the most commonly used reference for functional genomics studies as it was the first strain to be sequenced [[56]17]. Another reason for using C. difficile 630 as reference is that microarray based studies require a high quality complete genome for probe design. Recent advances in new generation sequencing platforms have revolutionized microbial genomics. Genome sequencing using new generation sequencing platforms has resulted in sequencing of several C. difficile strains from different continents that are old as well as newer hypervirulent isolates[[57]20,[58]25-[59]27]. These comparative genomics studies have revealed that when compared to historic strains (isolated in the 1980s), new hypervirulent strains (isolated after 2000) have undergone several genome changes that result in the hypervirulent phenotype[[60]12,[61]25]. The development of transcriptional profiling using RNA sequencing (RNA-seq) also offers vast improvements over microarray based transcriptional profiling. In order to determine how historic and recently emerged C. difficile strains respond to environmental changes we used RNA-seq to compare the transcriptome of four C. difficile strains after subjecting them to physiologically relevant in vitro stress conditions. In the first condition, cells were shifted from a rich medium [Brain heart infusion broth (BHI)] to a poor medium [Basal defined medium (BDM)] [[62]28] with supplementation of 0.5% sucrose. It has been reported that the presence of glucose and other easily metabolizable carbon sources in the growth medium suppress production of toxin A and toxin B[[63]29,[64]30]. The shift from BHI to BDM was designed to induce multiple nutritional changes and to analyze the impact of those changes in C. difficile pathways. Osmotic shock (shift from BHI to BHI supplemented with 1.5% NaCl) was chosen as the second test condition because C. difficile has been shown to have enhanced host cell adherence following osmotic shock[[65]31]. Since adhesion and colonization of animal tissue by bacteria is important in establishing infection, it is probable that without attachment, C. difficile cannot colonize and will be quickly removed by non-specific host defense mechanisms which include intestinal peristalsis, mucosal cell exfoliation and intestinal mucins[[66]32,[67]33]. Two strains in this comparison were isolated in the 1980s (CD630 and CD196), and two were hypervirulent strains isolated after 2000 ([68]R20291 and QCD_32g58). Although s C. difficile 630 is not a ribotype 027 strain, we included this strain in the comparisons as it is the only strain that was sequenced using Sanger method and to enable cross comparison with the previously reported functional genomics studies. Consistent with the large genome diversity in C. difficile, our transcriptome sequencing results show that differentially expressed core genes in various strains are not identical. We also report expression of several novel transcripts that are not part of the primary genome annotation. Materials and Methods C. difficile strains and culturing We selected four C. difficile strains for transcriptomic comparisons. The characteristics of these strains are given in [69]Table 1. Bacterial culturing was performed inside a Bactron IV anaerobic chamber (Shel Lab, Cornelius, OR). The chamber was filled and purged with an anaerobic gas mixture (10% CO2, 85% N2, 5% H2). A palladium catalyst was used in the chamber to remove any trace of oxygen. All materials used in the anaerobic chamber were pre-reduced. Spores of C. difficile strains CD630, CD196, QCD_32g58 and [70]R20291 were streaked on brain-heart infusion (BHI) agar plates containing 0.1% L-cysteine and taurocholate. The plates were incubated overnight at 37°C. Single colonies from these plates were then used inoculate pre-reduced BHI broth and were incubated at 37°C overnight. Fresh BHI broth was then inoculated by transferring 1% overnight culture. Cultures were incubated at 37°C until the OD600 reached between 0.4 - 0.5. Bacteria were then collected by centrifugation at 2,000 x g for 5 minutes. These cells were then shifted to two physiologically relevant in vitro conditions. In the first condition, cells were subjected to nutrient change by shifting to an equal volume of Basal defined medium (BDM)[[71]28] with supplementation of 0.5% sucrose. This shift was designed to induce multiple nutritional changes and to analyze the impact of those changes in C. difficile pathways. In the second condition, cells were subjected to osmotic shock by shifting to an equal volume of BHI supplemented with 1.5% NaCl. The same number of cells was transferred to fresh BHI as the control group. After incubating for 1 hour at 37°C, twice the volume of RNAprotect bacteria reagent (Qiagen, Valencia, CA) was added to the cultures to halt transcription and RNA degradation, and cells were collected by centrifugation at 2,000 x g for 10 minutes.All experiments were performed as two biological replicates. Table 1. Characteristics of C. difficile strains used in this study. Strain name Ribotype Isolation year Isolation country Hypervirulence CD630 13 1982 Switzerland no CD196 27 1985 France no QCD-32g58 27 2004 Canada yes [72]R20291 27 2006 UK yes [73]Open in a new tab Isolation of total RNA Total RNA extraction was performed with TRIzol/RNeasy hybrid RNA extraction protocol. Briefly, the bacterial pellets were re-suspended with 1 ml of TRIzol reagent and were transferred to 2 ml sterile screw-cap microcentrifuge tube. Then 0.5ml of sterile RNase-free 0.1 mm zirconia beads was added to each tube. Cells were homogenized and lysed by bead beating four times in a Mini Bead-Beater (BioSpecProducts, Inc., Bartlesville, OK) for 30 seconds with a gap of 30 seconds. After the chloroform extraction, the aqueous phase was transferred to a 1.5 ml sterile RNase-free micro-centrifuge tube and mixed with an equal volume of 100% ethanol (Sigma). This mix was loaded into an RNeasy column (Qiagen kit) and centrifuged for 30 seconds at 8,000 x g. Washing of the column, DNA digestion and elution steps were performed following the standard Qiagen protocol. Integrity of isolated RNA was estimated using Agilent Bioanalyzer 2100. Only those samples with an RNA Integrity Number (RIN) >9 were used for RNA sequencing. cDNA library synthesis and sequencing Ribo-Zero™ rRNA Removal Kit ( Epicentre® Biotechnologies, Madison, WI, USA) was used to purify mRNA from 10µg total RNA. First strand cDNA was synthesized using SuperScript® III (Invitrogen). The second-strand cDNA was synthesized using RNase H (Invitrogen) and DNA polymerase I (New England BioLabs). Then the cDNA libraries were prepared using the Illumina Paired-end Genomic DNA Sample Prep kit (Illumina) following the manufacturer's protocol. For each sample, two libraries were prepared from biological replicates. Each library was then loaded onto flow cell channels of the Illumina High-seq 2000 platform for paired-end 90 bp × 2 sequencing. The average insert size for the paired-end libraries was 200 bp (from 180 to 220 bp). For each strain, six paired-end cDNA libraries were constructed. Therefore twenty four libraries were sequenced across all strains. RNA-Seq data analysis For data processing, we used a customized RNA-Seq data analysis pipeline developed at Virginia Bioinformatics Institute (VBI) by combining open source programs. Briefly, the quality of the raw sequence reads was checked using the FastQC program ([74]http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc). The processed reads were then aligned using Bowtie version 0.12.7 [[75]34] to the corresponding C. difficile reference genome. Read alignments with mapping quality score (MAPQ) < 10 were removed. Cufflinks software package version 1.3.1 [[76]35] was used to assemble transcripts and estimate the relative abundances of the transcripts. Transcript expression levels are estimated as Fragments Per Kilobase per Million mapped reads (FPKM). Cuffdiff [[77]36], a component of Cufflinks was used to calculate transcript expression levels. When compared to control condition, genes with log[2] ratio ≥ 1.5 and FDR- adjusted p value less than or equal to 0.05 were considered as differentially expressed. The processed data and raw files from this experiment have been submitted to NCBI Gene expression Omnibus (GEO) under the accession # [78]GSE50497 and NCBI short read archive (SRA) under the accession # SRP029366. Novel gene discovery Cufflinks program provides reference annotation based assembly and seeks to build upon available information about the transcriptome of an organism to find novel genes and isoforms [[79]35]. Cufflinks output includes all annotated reference transcripts and any novel genes and isoforms that are assembled. The novel gene transcripts identified by the pipeline can be novel small RNA genes or unannotated CDS. We performed the following steps to categorize these novel transcripts. First, these transcripts were used to search the Rfam database for sRNA matches. Rfam is a comprehensive database containing families of structural RNAs, including non-coding RNA genes as well as cis-regulatory RNA elements[[80]37]. It incorporates literature curated known sRNAs and uses them as seeds to predict sRNAs for sequenced genomes using covariance model[[81]37]. In the second step, the new transcripts were searched against the non-redundant (nr) database using BLASTX to check for any protein coding gene hits. Finally, we used ORF Finder program to verify whether any transcripts are potentially protein coding genes. Assembled transcripts with no BLASTX hits and no ORF assignment were considered as sRNAs. Functional analysis of differentially expressed genes Since it is well established that C. difficile strains are known to be highly divergent [[82]18-[83]20], we classified genes in each strain into core, shared and unique categories using OrthoMCL program [[84]38]. We then combined this gene classification with gene expression level data to obtain a comparison of these genes across all strains. Pathologic program in Pathwaytools v 16.0[[85]39] was used to reconstruct the metabolic pathways of strains QCD_32g58, CD196 and [86]R20291. A previously curated high quality pathway annotation for strain CD630 by our group [[87]23] was used to remove false positive pathway predictions in these strains. Omics viewer [[88]40] was used to map the differentially expressed genes onto cellular pathways and to compare differentially expressed pathways across strains. For identifying how gene interaction networks adjust to the stress conditions tested, the differentially expressed genes were mapped to the systems level gene interaction network of C. difficile. The base interaction data for this analysis was obtained from STRING database[[89]41]. Complete interaction data from STRING v 9.0 was downloaded and C. difficile specific interaction data was then extracted using custom Linux shell scripts. The interaction data in STRING database includes both experimental as well as predicted interactions and each interaction is assigned a confidence score. We then selected interactions with confidence score of 400 or above. These would represent medium and high confidence interactions in C. difficile. This interaction network was then imported into Cytoscape [[90]42] for visualization and overlaying of transcriptomic expression data. qRT-PCR We used qRT-PCR to verify the expression levels of selected genes. Following the manufacturer’s instructions, 3.0 µg of total RNA isolated from each stress condition was converted to cDNA by using SuperScript III reverse transcriptase (Invitrogen) with random hexamers. The real-time reaction mixture included 12.5 ng cDNA, 200 nM of each of both forward and reverse primers, and 1X SYBR GreenER qPCR SuperMix (Invitrogen). Primers used in this study are listed in the File S1. qPCR was performed in 96-well optical plates using the AB 7500 Real-Time PCR System instrument and software (Invitrogen) and was analyzed by the method previously reported by our group[[91]43]. Results Genome coverage C. difficile is an unusual species because the number of conserved genes in a given strain is very low and many of the strains contain a very high number of genes that are unique to that strain[[92]19,[93]20]. It has also undergone very rapid evolution in the last two decades by acquiring several new genes [[94]25]. To understand how historic and recently emerged C. difficile respond to physiological stress, two historical (CD630 and CD196) and two recently emerged (QCD-32g58 and [95]R20291) strains were subjected to nutrient shift and osmotic shock. Gene expression under these conditions was determined using RNA-seq and these results were compared to gene expression levels during growth in BHI (control condition). A total of 24 samples from these conditions were used for paired-end bi-directional Illumina sequencing. Illumina sequence files were converted to Sanger fastq format and rRNAs were filtered. The quality of the sequence data was checked using the FastQC program. The sequence reads were 90 nucleotides in length and the total number of reads per sample was ~ 26.6 million on average. The filtered RNA-Seq sequence reads from the 24 samples were aligned to their corresponding C. difficile reference genome using Bowtie. For all samples analyzed, 88-95% of reads were mapped with MAPQ greater than or equal to 10. The transcript expression levels were estimated as FPKM using Cufflinks[[96]44]. The number of genes expressed (FPKM>0) was calculated for each sample. There was very high correlation between samples when the number of expressed genes was compared between biological replicates ([97]Table 2). We used PATRIC [[98]45] annotations of the C. difficile genomes as references. In this