Abstract

Background

   Beef cattle are among the most economically important animals in the
   world because they are farmed for their meat and leather. However, a
   lack of genetic information remains an obstacle to understanding the
   mechanisms behind the development of this animal. Analysis of the beef
   cattle transcriptome and its expression profile data are essential to
   extending the genetic information resources for this species and would
   support studies on this animal.

Results

   RNA sequencing of beef cattle was performed using the Illumina
   High-Seq2000 platform. A total of 25,605,140 and 26,214,800 reads were
   sequenced for embryonic and adult pooled samples, respectively. We
   identified 24,464–29,994 novel transcript units in two pooled samples.
   In addition, 8,533–10,144 genes showed evidence of alternative
   splicing, in agreement with the finding that alternative 3′ splicing is
   the most common type of alternative splicing event in cattle. We
   detected the expression levels of 16,174 genes, and 6,800 genes
   exhibited differential expression between the two pooled samples with a
   false discovery rate ≤0.001. Using GO enrichment and KEGG pathway
   analysis, multiple GO term and biological pathways were found to be
   significantly enriched for differentially expressed genes. In addition,
   we discovered that 30,618–31,334 putative single nucleotide
   polymorphisms were located in coding regions.

Conclusions

   We obtained a high-quality beef cattle reference transcriptome using a
   high throughput sequencing approach, thereby providing a valuable
   resource for better understanding the beef cattle genome. The
   transcriptome data will facilitate future functional studies on the
   beef cattle genome and can be applied to breeding programs for cattle
   and closely related mammals.

Introduction

   As an elite yellow cattle breed, Qinchuan beef cattle have had a long
   history of feeding and breeding in China. The practice of selecting
   good cattle to present to the master has been recorded as early as 800
   BC [25][1]. Qinchuan cattle were mainly used as draft animals
   throughout history. When Zhangqian brought back alfalfa seeds from the
   West in 126 BC via the Silk Road, people began to plant alfalfa for
   cattle feed on the Guanzhong Plain, the main production area of
   Qinchuan beef cattle. This resulted in tremendous improvements in
   Qinchuan beef cattle, particularly in relation to its body size, work
   ability, and individual meat yield.

   Genetic background and pre-birth development are known to affect the
   composition of bovine muscle tissue [26][2], [27][3]. Prenatal muscle
   development is therefore a promising area of gene discovery regarding
   the molecular events that determine adult muscle phenotype.

   However, the complexity of the bovine transcriptome has not yet been
   fully elucidated. Novel, high-throughput, deep-sequencing technologies
   are affecting genomic research by providing new strategies to analyze
   the functional complexity of transcriptomes. The RNA sequencing
   (RNA-Seq) approach produces millions of short cDNA reads that are
   mapped to a reference genome to obtain a genome-scale transcriptional
   map, which consists of the transcriptional structure and the expression
   level of each gene [28][4]. The holistic view of the transcriptome and
   its organization provided by the RNA-Seq method reveals many novel
   transcribed regions, splice isoforms, and single nucleotide
   polymorphisms (SNPs) and allows refinement of gene structures
   [29][5]–[30][10]. Finally, RNA-Seq generates absolute rather than
   relative gene expression measurements, thereby providing greater
   insight and accuracy than do microarrays [31][11], [32][12].

   In the current study, we have performed the first global analysis of
   the beef cattle transcriptome during muscle development using the
   Illumina RNA-Seq method. Although our main aim was to validate the
   RNA-Seq technology and to set up a pipeline that allows observation of
   the level of gene expression, new transcripts, splice variants, and
   SNPs, we report here a comprehensive analysis of transcriptome dynamics
   that may serve as a blueprint of the gene expression profiles that
   occur during muscle development.

Results

Deep Sequencing of Bovine Longissimus Muscle Transcriptomes

   Using RNA-Seq, this study compared the transcriptomic landscapes of
   longissimus muscle from embryo at day 135 post fertilization (Emb135d)
   versus 30-month-old (30M) adult cattle. To accomplish this, two rounds
   of linear amplification of mRNA were used, ensuring that each
   individual produced enough RNA input for analysis. Amplified RNA from
   three embryos and three adult bovines, respectively, all with the same
   sire, was pooled and sequenced on the High-Seq2000 system at
   BGI-Shenzhen, China, resulting in approximately 2 billion pair-end
   reads of 100 bp in length. The data set was analyzed according to the
   BGI bioinformatics protocols for RNA-seq. [33]Table 1 presents the
   overall results of aligning sequencing reads to the bovine reference
   genome (UMD Bos taurus 3.1 (UMD 3.1), produced by the University of
   Maryland) and genes.

Table 1. Summary of sequence read alignments to the UMD 3.1 reference genome
and gene.

   Category 30M sample Emb135 sample
   Reads number Percentage Reads number Percentage
   Total reads 26,214,800 100.00% 25,605,140 100.00%
   Total base pairs 2,359,332,000 100.00% 2,304,462,600 100.00%
   Total mapped reads 20,427,874 77.92% 20,255,767 79.11%
   Read perfectly matched to the reference genome 14,364,201 54.79%
   14,300,096 55.85%
   Read with ≤3 bp mismatch to the reference genome 6,063,673 23.13%
   5,955,671 23.26%
   Reads uniquely matched to the reference genome 19,713,327 75.20%
   18,733,163 73.16%
   Reads matched to the reference genome with multi-positions 714,547
   2.73% 1,522,604 5.95%
   Total unmapped reads to the reference genome 5,786,926 22.08% 5,349,373
   20.89%
   Total mapped reads to annotated genes 20,500,464 78.20% 18,898,384
   73.81%
   Read perfectly matched to the annotated genes 15,500,019 59.13%
   14,859,948 58.04%
   Read with ≤3 bp mismatch to the annotated genes 5,000,445 19.07%
   4,038,436 15.77%
   Reads uniquely matched to the annotated genes 17,615,556 67.20%
   16,364,686 63.91%
   Reads matched to the annotated genes with multi-positions 2,884,908
   11.00% 2,533,698 9.90%
   Total unmapped reads to annotated genes 5,714,336 21.80% 6,706,756
   26.19%
   [34]Open in a new tab

   Total reads: total number of sequencing reads. Total base pairs: total
   number of base pairs. Total mapped reads: the reads that can aligned to
   reference sequence and the ratio of it. We not only classified
   statistic by mismatch number but also classified by uniqueness of
   alignment postion, following 4 ranks are result of classified
   statistic, perfect match: in total mapped reads, no mismatch; ≤3 bp
   mismatch: in total mapped reads, mismatch number less than 3 bp; unique
   match: in total mapped reads, reads aligned to only one postion;
   multi-position match: in total mapped reads, reads aligned to two or
   more places. Total unmapped reads: reads that don't aligned to
   reference sequence.

   Sequencing reads were analyzed using SOAP2 software [35][13] by
   alignment with the UMD 3.1 reference genome. Of the total sequenced
   reads, 79.11% and 77.92% were mapped to the UMD 3.1 reference genome
   for Emb135 and 30M samples, respectively. Of these, 73.16% and 75.20%,
   respectively, were uniquely mapped to specific regions in the bovine
   genome, and 73.81% and 78.20% of reads corresponded to reference genes
   with 63.91% and 67.20% uniquely matched reads, respectively ([36]Table
   1). Unmapped and multi-position matched reads were excluded from
   further analyses ([37]Table 1).

Identification and Analysis of Novel Transcribed Units

   Using the procedures described in the materials and methods section and
   those of Zhang et al. [38][14], we detected an extensive number of
   novel transcript units. The statistics of the novel transcripts are
   presented in [39]Figure 1. In total, we obtained 29,994 and 24,464
   novel transcript units in the pooled samples for Emb135 and 30M,
   respectively, with a mean length of 353 bp and with lengths ranging
   from 150 to 22,302 bp ([40]Table S1). Many novel transcript units
   (28.54%–51.93%) have ≥1 exon, and the largest is 6,917 bp in length and
   contains 23 exons ([41]Table S1).

Figure 1. Statistics of novel transcripts.

   Figure 1
   [42]Open in a new tab

   The abscissa shows the samples 30M and Emb135, and the ordinate
   represents the number of novel transcripts occurred in each sample.

Alternative Splicing Events in the Bovine Transcriptome

   Alternative splicing (AS) plays a major role in the generation of
   proteomic and functional complexity in higher organisms [43][15],
   [44][16], [45][17], [46][18]. To explore potential AS events, we
   conducted computational analyses to determine all theoretical splicing
   junctions and then identified sequence reads that mapped to these
   regions. A brief introduction to the algorithms used to detect
   alternative splicing events follows: First, junction sites, which
   provide information about boundaries and the combinations of different
   exons in a transcript, are detected by SOAPsplice. Then, all junction
   sites of the same gene are used to distinguish the type of its
   alternative splicing event. Seven main types of alternative splicing
   exist: exon skipping, intron retention, and alternative 5′ splice site,
   alternative 3′ splice site, alternative first exon, alternative last
   exon, and mutually exclusive exon-splicing. The last three types are
   not included in our results due to the high rate of false positive
   results obtained when using the SOAPsplice software. [47]Figure 2
   (below) shows these alternative splicing events and the associated gene
   numbers that we determined through high throughput sequencing.

Figure 2. Statistics of alternative splicing events and genes in two samples.

   [48]Figure 2
   [49]Open in a new tab

   (A): The red bars show the distribution of the number of genes for each
   type of alternative splicing mode in the pooled 30M sample. The green
   bars indicate the number of alternative splicing events that occurred
   in each gene. (B) is same as A except that the results for the Emb135
   sample are presented.

   We found that approximately 66.6% of the alternatively spliced genes
   underwent multiple alternative splicing events, thus illustrating the
   complexity of the bovine transcriptome. Alternative 3′ splice
   site-splicing is the most common type of alternative splicing event,
   accounting for an average of 40.8% of all alternative splicing events
   in cattle; in contrast, intron retention is the least common type of
   splicing event, only constituting 1.7% in the pooled Emb135 sample and
   2.0% in the pooled 30M sample. The average size of retained introns in
   the Emb135 sample is 354 bp (range: 64–3,168 bp). In the 30M sample,
   the average size of retained introns is 380 bp (range: 60–8,708 bp).
   [50]Table S2 identifies AS events in each sample in detail.

Refinement of Gene Structures

   Gene structure was optimized according to the distribution of the
   reads, information from RNA-seq, and the annotation of the reference
   gene. First, we obtained the distribution of reads in the genome by
   aligning the continuous and overlapping reads to form a Transcription
   Active Region (TAR). Second, in accordance with the paired-end RNA-seq
   data, we connected the different TARs to form a potential gene model.
   Finally, we compared the gene model with the existing annotated gene to
   extend the gene's 5′- and 3′-ends. In the 30M pooled sample, more than
   2,571 regions were extended at the 5′- end, and more than 4,625 regions
   had an extension at the 3′- end. In comparison, in the Emb135 pooled
   sample, at least 2,157 regions were extended at the 5′- end, and at
   least 3,531 regions had an extension at their 3′- end. [51]Table S2
   describes the 5′- and 3′- end extensions of genes in each sample.

Identification of Differentially Expressed Genes

   The RNA-Seq technique allows analysis of the differential expression
   profile via transcript abundance with a high sensitivity for
   transcripts expressed at low levels that would otherwise be
   undetectable using standard microarrays [52][19]. The differentially
   expressed genes were selected based on the expression profiles and the
   following criteria: (1) if the change in gene expression levels between
   Emb135 and 30M was greater than or equal to two-fold (|log[2]Ratio| ≥1)
   and (2) if the false discovery rate (FDR) was less than or equal to
   0.001. In summary, we identified 16,174 differentially expressed genes
   in longissimus muscle in 30M vs. Emb135 samples. According to the
   screening criteria chosen (FDR ≤0.001 and |log[2]Ratio| ≥1), we
   restricted these 16,174 genes to 6,800 genes that exhibited
   differential expression between the two tested samples. Of these, the
   expression levels of 1,893 genes were up-regulated in the Emb135 sample
   with respect to the 30M sample. The remaining 4,907 genes were
   down-regulated in the Emb135 sample with respect to the 30M sample
   ([53]Figure 3 and [54]Table S4).

Figure 3. Statistics of differentially expressed genes (DEGs).

   Figure 3
   [55]Open in a new tab

   Red and green bars above the X-axis denote genes with up- and
   down-regulated expression in Emb135-VS-30M samples, respectively. The
   Y-axis denote number of differentially expressed genes with up- and
   down-regulated expression in Emb135-VS-30M samples, respectively.

qRT-PCR Confirmation of Differential Gene Expression

   To validate the expression profiles obtained using RNA-Seq,
   quantitative real-time PCR (qRT-PCR) was performed on 47 genes, which
   were randomly selected for high or low expression levels normalized to
   β-actin via qRT-PCR on pooled mRNA samples from Emb135 (n = 3), newborn
   (day 0 at birth, n = 3) and 30M (n = 3) cattle. Overall, of these 47
   randomly selected genes, those that were differentially expressed
   showed similar patterns of mRNA abundance in the RNA-Seq analysis and
   qRT-PCR ([56]Figure S1). [57]Table S5 lists the primers used for the
   qRT-PCR of these 47 genes.

Gene Ontology Analysis

   To further investigate the biological processes associated with the
   differentially expressed genes, we performed Gene Ontology (GO)
   analysis by running queries for each differentially expressed gene
   against the GO database [58][20], which provides information on the
   relevant molecular functions, cellular components, and biological
   processes. The results of analyzing the GO functional annotations are
   presented in [59]Figure 4 and [60]Table S6. In total, 9,202 GO terms
   were assigned to the differentially expressed genes, and these were
   classified into three independent ontology categories: (1) 6,575
   (71.5%) terms corresponding to biological processes, (2) 1,739 (18.9%)
   terms corresponding to molecular functions, and (3) 888 (9.7%) terms
   corresponding to cellular components. In the biological processes
   category, the most important enriched terms are related to
   developmental process, multicellular organismal development, and
   anatomical structure development; these were followed by system
   development, M-phase, cell cycle phase, mitotic cell cycle, and cell
   cycle. Binding and protein binding account for most of the terms in the
   molecular function category, followed by nucleotide binding, purine
   nucleotide binding, purine ribonucleotide binding, and small molecule
   binding. Finally, within the cellular component category, the GO term
   with the highest level of significance was intracellular. Furthermore,
   after intracellular, the terms intracellular part, cytoplasm, organelle
   part, intracellular organelle part, and cell exhibited the highest
   significance in the GO analysis.

Figure 4. The top 20 Gene Ontology functional annotations for the
differentially expressed genes.

   [61]Figure 4
   [62]Open in a new tab

   Bar graphs (A), (B) and (C) show three independent Gene Ontology (GO)
   information categories: cellular components, molecular functions and
   biological processes, respectively. In each bar graph, the abscissa
   represents the number of differentially expressed genes, and the
   ordinate is the ID number of the GO terms. All GO categories listed
   have enrichment P values <0.05.

KEGG Pathway Analysis

   Different genes usually cooperate with each other to exercise their
   biological functions. Pathway-based analysis helps to further
   understand the biological functions of genes. Pathway enrichment
   analysis identifies significantly enriched metabolic pathways or signal
   transduction pathways related to differentially expressed genes.
   Overall, 254 pathways were assigned for differentially expressed genes.
   15 pathways were significantly enriched (Qvalue ≤0.05) for
   differentially expressed genes ([63]Table S7). The pathway term showing
   the highest level of significance was axon guidance. Furthermore, after
   axon guidance, the terms phagosome, cell cycle, hypertrophic
   cardiomyopathy (HCM), dilated cardiomyopathy, pathways in cancer,
   amyotrophic lateral sclerosis (ALS), DNA replication, N-glycan
   biosynthesis, and regulation of actin cytoskeleton exhibited the
   highest significance in the pathway analysis.

Detecting SNP Variants in Pooled Transcriptome Samples by RNA-Seq

   The SOAPsnp program [64][21] calculates the likelihood of each genotype
   at each site based on the alignment of short reads to the UMD 3.1
   reference genome together with the corresponding sequencing quality
   scores. The program then infers the genotype with highest posterior
   probability at each site based on Bayes' theorem (the reverse
   probability model). Thus, we have accounted for the intrinsic bias or
   errors that are common in Illumina sequencing data and recalibrated the
   quality values for use in inferring consensus sequences. [65]Figure 5
   summarizes the workflow used for the discovery of putative SNPs.
   According to this workflow, 31,334 putative SNPs were identified for
   the 30M pooled samples. Similarly, for Emb135 pooled samples, 30,618
   putative SNPs were identified. For both samples, the SNP detection
   stringency conditions were as follows: (1) at least two unique mapping
   reads that support the polymorphic nucleotide and (2) a quality score
   of ≥20 ([66]Table S8).

Figure 5. Pipeline of SNP Calling.

   [67]Figure 5
   [68]Open in a new tab

   The SNPs can be identified on the consensus sequence through the
   comparison with the reference.

Discussion

   The unprecedented level of sensitivity and high throughput of deep
   sequencing technologies suggest that RNA-Seq is likely to become the
   platform of choice for transcriptome analysis, quickly superseding
   microarray-based methods for comprehensive studies of novel transcribed
   units and differential splicing activity, and for the refinement of
   gene structures, the measurement of gene expression levels and the
   discovery of expressed SNPs. However, the new level of detail offered
   by RNA-Seq raises novel statistical and computational challenges that
   limit the widespread adoption of whole-genome transcriptional profiling
   to detect genes that are differentially expressed among several
   experiments.

   In this study, we generated >2 billion sequence reads, corresponding to
   2.7 Gb of raw sequence data, using Illumina sequencing of mRNA from two
   stages of longissimus muscle development. Aligning all sequence reads
   to the bovine UMD 3.1 reference genome yielded >20 million reads that
   were successfully mapped on the UMD 3.1 reference genome. More than 5
   million reads remain that cannot be matched to the UMD 3.1 reference
   genome. This could be caused by low sequence coverage of the reference
   genome, reference errors, sequencing errors, or defined mapping
   criteria. A majority of the annotated transcripts in the Ensembl
   database were covered by sequenced reads, illustrating the sensitivity
   of RNA-seq for transcript discovery, even for genes expressed at low
   levels [69][22]. Furthermore, we identified 24,464–29,994 novel
   transcript units, which improved the gene annotations of the bovine
   genome and transcriptome.

   Alternative splicing is an important model of gene expression
   regulation that has not been generally accessible through microarray or
   SAGE analysis methods in cattle [70][23]. We found that some genes
   showed all four types of alternative splicing models, thus revealing
   the complexity of alternative splicing in cattle. We also found that
   alternative 3′-end splicing is the most common type of alternative
   splicing event in cattle and that intron retention is the least common
   alternative splicing mechanism. Our results are consistent with those
   of a similar pig study [71][24]. However, these results contrast with
   those reported for humans and Holstein cows, where exon-skipping is the
   most prevalent mechanism [72][25], [73][26]; the results also differ
   from those found for rice, for which intron retention is the primary
   alternative splicing mode [74][27]. Three types of alternative
   splicing, including alternative first exon-, alternative last exon- and
   mutually exclusive exon-splicing, were excluded from our analyses
   because the algorithms used to analyze these have not yet been
   perfected. A greater number of alternative spliced genes would be
   discovered if these alternative splicing models were considered.

   With the rapid rise of high throughput sequencing, a plethora of
   transcriptomic data is now available, in particular for mammals.
   Comparing the transcriptomes of beef cattle (Qinchuan beef cattle were
   analyzed in this study) and dairy cattle (Holstein cows were analyzed
   in a similar study [75][26]), we observed that the distribution of
   genes based on the GO terms within each ontological category derived
   for beef cattle is consistent between these two breeds. Indeed, in both
   breed types, the most represented terms in the biological process
   category are developmental process, multicellular organismal
   development, anatomical structure development, and system development.
   Although a few differences exist within each category, the general
   organization and the main terms found are similar between both breeds.
   However, metabolic process, cellular metabolic process, catalytic
   activity, and oxidoreductase activity are the major terms found in the
   biological process category for porcines. In addition, cellular
   process, metabolic process and biological regulation, multicellular
   organismal process, and developmental process are the main terms
   representing the biological process ontological category in amphioxus.
   These results indicate that similar species are likely to have a
   similar general organization for each ontological category of genes.
   Conversely, distantly related species are likely to have considerable
   differences in the general organization for each ontological gene
   category.

   In conclusion, we sequenced the transcriptome of Qinchuan beef cattle
   using the RNA-seq method. This study revealed a large number of genes
   of known and unknown function, greatly expanding the amount of genetic
   information available for this species and providing a profile of its
   developmental processes. This study also represents a first step toward
   an improved understanding of the functions of these genes and provides
   a broad and novel vision of future research at the molecular level in
   cattle.

Materials and Methods

Ethics Statement

   The Qinchuan beef cattle used in this experiment were obtained from the
   Shaanxi Kingbull animal husbandry company limited, Baoji City, China.
   Experiments were performed according to the Regulations of the
   Administration of Affairs Concerning Experimental Animals (Ministry of
   Science and Technology, China, revised in 2004) and approved by the
   Institutional Animal Care and Use Committee of the College of Animal
   Science and Technology, Northwest A&F University, China. Animals were
   allowed access to feed and water ad libitum under same, normal
   conditions and were humanely killed.

RNA Isolation

   Liver, lung, spleen, kidney, heart, stomach, intestines, fat (only
   newborn and 30M) and longissimus muscle from 9 individual were
   harvested for RNA isolation within 10 minutes after slaughter (3
   individuals at each of three key stages of myogenesis and muscle
   maturation: Emb135d, newborn, and 30M adult cattle. All fresh tissue
   samples were collected and divided into 1.5 mL plastic centrifuge tubes
   (each sample weighing around 100 mg) and snap frozen in liquid nitrogen
   for RNA-seq (longissimus muscle samples for RNA-seq were processed only
   for Emb135d and 30M cattle) and qRT-PCR use.

   Total RNA was isolated using TRIzol reagent (Invitrogen, USA) according
   to the manufacturer's instructions. Briefly, 1 mL of TRIzol was added
   to the pellet after removal of the 1.5 mL plastic centrifuge tube from
   the liquid nitrogen and immediately homogenized by pipetting;
   thereafter, RNA were isolated and purified at 4°C using standard
   phenol–chloroform extraction. DNA was removed from the RNA extracts by
   incubation with RNase-free DNase I (Qiagen, Hilden, Germany) for 1 h at
   37°C. RNA concentration and purity were determined using a NanoDrop
   ND-1000 spectrophotometer (Nanodrop, Wilmington, DE), and its quality
   and integrity were assessed based on the A260/A280 ratio and confirmed
   using a 2100 Bioanalyzer (Agilent) according to RNA integrity number.
   Equal amounts of high-quality RNA samples from each tissue were then
   pooled for cDNA synthesis and sequencing. Specifically, longissimus
   muscle from individuals at embryo (n = 3) and adult (n = 3) stages were
   pooled as samples Emb135 and 30M, respectively, for use in sequencing.

cDNA Preparation and Library Construction for Illumina Sequencing

   Messenger RNA was isolated from the total RNA samples using oligo (dT)
   magnetic beads (Invitrogen). The purified mRNA was first fragmented to
   approximately 200–700 nt fragments using the RNA fragmentation kit
   (Ambion). First-strand cDNA synthesis was performed using random
   hexamer primers and reverse transcriptase. After the first strand was
   synthesized, a custom second-strand primer and strand synthesis buffer
   (Illumina) were added, followed by dNTPs, RNase H and DNA polymerase I
   to nick translate the second-strand. The reaction was then purified
   using a QiaQuick PCR column (Qiagen) and eluted in EB buffer (Qiagen);
   the samples were then subjected to end repair, addition of a single A
   base, adaptor ligation, agarose gel isolation of approximately 200-bp
   cDNA, and PCR amplification. Finally, the samples were sequenced using
   the Illumina HiSeq™ 2000 platform. Overall, two cDNA libraries were
   constructed from pooled Emb135 (n = 3) and 30M (n = 3) samples. One
   technical replicate was performed for each sample.

Read Mapping on the Bovine Reference Genome and Data Analysis

   Illumina sequencing of the Emb135 and 30M pooled RNA was conducted in
   BGI, Shenzhen, China. The UMD 3.1 reference genome (available at
   [76]ftp://ftp.ensembl.org/pub/release-67/fasta/bos_taurus/) from the
   Ensembl database was utilized as the bovine reference genome for the
   assembly [77][28]. After trimming reads containing adaptors, reads
   containing more than five Ns per 100 bp and reads of low quality (i.e.,
   more than half of the base's qualities were less than or equal to 5),
   clean reads were 100% matched to the UMD 3.1 reference genome using
   SOAP2 [78][13] by not allowing any mismatches in reads. For the reads
   that were not able to be aligned with the reference sequences, we
   performed the SOAP alignment again while allowing less than 3 bp
   mismatches in the reads.

   A similar strategy was used to filter the unique sequence reads and
   multi-position reads. Data were normalized by calculating the number of
   reads per kilobase of exon region in a gene per million mapped reads
   (RPKM) (RPKM  =  total exon reads/mapped reads in millions × exon
   length in kb) [79][4] for each gene and then annotated with the Ensembl
   bovine genome assembly.

Identification of Novel Transcript Units

   All reads that matched the UMD 3.1 reference genome with
   multi-positions were excluded from further analysis. The intergenic
   regions were defined as regions between 200 bp downstream of one gene
   and 200 bp upstream of the next adjacent gene using the bovine mRNA
   data (Ensembl). A contiguous expression region with each base supported
   by at least two reads was considered a transcriptionally active region.
   Transcriptionally active regions that were joined by at least one set
   of paired-end reads were connected into a single transcript unit.
   Transcript units that were not overlapped with an annotated gene model
   and located in intergenic regions with a continuous mapping length ≥150
   bp and average coverage ≥2 were considered as putative novel transcript
   units.

Identification of Alternative Splicing Models

   As described by Wang et al. [80][29] and Zhang et al. [81][27] and
   according to the structure of exons, alternative splicing events were
   classified into seven different types of alternative splicing models:
   exon skipping, intron retention, alternative 5′ splice site,
   alternative 3′ splice site, alternative first exon, alternative last
   exon, and mutually exclusive exon models. The details of these
   alternative splicing models have been described by Zhang et al.
   [82][27]. Because the algorithms for the alternative first exon,
   alternative last exon and mutually exclusive exon models have not yet
   been perfected, only the remaining four alternative splicing models
   listed above were analyzed and presented in this study.

Refinement of Gene Structures

   Gene structure was optimized according to the distribution of reads,
   paired-end sequences and reference gene annotations. After aligning
   reads to the UMD 3.1 reference genome, the genomic regions with
   continuous reads and uniquely mapped reads ≥2 formed
   transcription-active regions. We connected the various
   transcriptionally active regions to form a potential gene model using
   the paired-end data. The extensions of the 5′ and 3′ boundaries were
   determined by comparing the potential gene model with the existing gene
   annotation.

Analysis of Differentially Expressed Genes

   Gene expression level was normalized by considering RPKM value [83][4].
   Differentially expressed genes and their corresponding P-values were
   determined using the methods described by Audic and Claverie [84][30].
   The significance threshold of the P-value in multiple tests was set
   based on the FDR. The fold changes (log[2]Ratio) were also estimated
   according to the normalized gene expression level in each sample. We
   used “FDR ≤0.001 and |log[2]Ratio| >2” as the threshold to judge the
   significance of gene expression differences.

Gene Ontology Annotation

   The differentially expressed genes were classified into the categories
   of cellular component, molecular function, and biological process using
   the GO annotation. The hypergeometric test was applied to map all
   differentially expressed genes to terms in the GO database
   ([85]http://www.geneontology.org/) [86][20] and to search for
   significantly enriched GO terms in differentially expressed genes by
   comparing them to the genome background. The calculated P-values were
   corrected using the Bonferroni correction, taking the corrected P-value
   ≤0.05 as the significance threshold.

KEGG Pathway

   KEGG is the major public pathway-related database. Pathway-based
   analysis helps to further understand the biological functions of genes.
   The calculation formula is the same as that used for the GO analysis.
   Pathway enrichment analysis identifies significantly enriched metabolic
   pathways or signal transduction pathways in differentially expressed
   genes by comparison against the whole genome background. The calculated
   P-values were corrected using the Bonferroni correction, taking
   pathways with Qvalues ≤0.05 as significantly enriched in differentially
   expressed genes.

Detection of Putative SNPs

   During our experiment, we used SOAPsnp software [87][21] to detect SNPs
   between two samples based on the massively parallel Illumina GA
   technology. SOAPsnp is a member of the SOAP (Short Oligonucleotide
   Analysis Package) software. The program is a resequencing utility that
   can assemble consensus sequences for the genome of a newly sequenced
   individual based on the alignment of the raw sequencing reads to a
   known reference genome. The SNPs can then be identified in the
   consensus sequences by a comparison with the reference.

Validation of RNA-Seq Data

   We used qRT-PCR, the current gold standard for the quantification of
   mRNA, to validate the repeatability and reproducibility of gene
   expression data obtained by RNA sequencing in cattle. 47 differentially
   expressed genes with the highest significance (as identified by
   RNA-seq) were chosen for validation (the set includes both up- and
   down-regulated genes). As described in the methods section dealing with
   RNA isolation, expression was measured in eight (or nine) different
   tissue samples from three pooled individual embryos, newborns, and
   adults, respectively. Measurements were averaged over all tissues and
   compared to RNA-Seq results. The first-strand cDNA was synthesized
   using superscript II reverse transcriptase (Takara). Gene-specific
   primers were designed based on the gene sequence using Primer Premier
   5.0 (Premier Biosoft) software. All primers for validation were
   designed to cross exon-exon junctions and are shown in [88]Table S5.
   The cDNA was diluted 1∶10 in RNA-free water for target gene
   measurements, with the β-actin gene used as a control. qRT-PCR
   reactions were performed in triplicate using SYBR Premix Ex Taq^TM II
   (Takara) on an Bio-Rad CFX 96 Real-Time PCR Detection system (Bio-Rad
   Laboratories, Foster City, USA) using the following program: 95°C for
   30 sec, then 40 cycles at 95°C for 5 sec, 60°C for 30 sec, 72°C for 45
   sec, followed by a final extension step at 72°C for 5 min. The relative
   gene expression values were calculated using the 2^−ΔΔCt method
   [89][31].

Supporting Information

   Figure S1

   Comparison of changes in 47 differentially expressed genes from qRT-PCR
   in three developmental stages.

   (DOC)
   [90]Click here for additional data file.^ (19.2MB, doc)
   Table S1

   Novel transcript units identified in Emb135 and 30M pooled samples.

   (XLS)
   [91]Click here for additional data file.^ (8.2MB, xls)
   Table S2

   The number of each type of alternative splicing events in longissimus
   muscle tissues.

   (XLS)
   [92]Click here for additional data file.^ (4.1MB, xls)
   Table S3

   The detailed description of the 5′ and 3′ end extension of gene in each
   sample.

   (XLS)
   [93]Click here for additional data file.^ (2MB, xls)
   Table S4

   Differentially expressed genes in Emb135-VS-30M.

   (XLS)
   [94]Click here for additional data file.^ (1.2MB, xls)
   Table S5

   Primers for Real-time RT-PCR of 47 DEG genes randomly selected for high
   or low expression levels.

   (XLS)
   [95]Click here for additional data file.^ (27.5KB, xls)
   Table S6

   Gene Ontology (GO) assignment of the bovine transcriptome for the
   biological process, molecular function and cellular component category.

   (XLS)
   [96]Click here for additional data file.^ (5.4MB, xls)
   Table S7

   List of KEGG pathway categories.

   (XLS)
   [97]Click here for additional data file.^ (92KB, xls)
   Table S8

   Detection of putative SNPs.

   (XLS)
   [98]Click here for additional data file.^ (13.4MB, xls)

Acknowledgments