Abstract

   Body colours are important and striking features for individual
   survival and reproductive success, in particular in vertebrates where
   mating behaviour and mate preference may be strongly influenced by
   non-normal phenotypes. Pigmentation disorders may be generated by
   disruption of one or many independent genes as well as by environmental
   factors. The first discovery of albino yellow catfish (Pelteobagrus
   fulvidraco Richardson) with golden skin colour from fish farms in China
   provides us valuable material to study the molecular mechanism
   underlying the abnormalities of pigmentation. In this study,
   transcriptome sequencing of fin tissues corresponding to the distinct
   body colours, wild type and mutant albino yellow catfish, were
   performed using Illumina sequencing technology. Based on
   next-generation sequencing technology and de novo assembly, we
   generated a transcriptome of P. fulvidraco. A number of genes
   differentially expressed between the wild types and albinos were
   identified, suggesting their contribution to the different phenotypes
   and fitness. However, non-synonymous mutations result from single
   nucleotide substitutions residing in coding regions may not contribute
   to such differences. Based on the high-throughput expression data
   generated for the two different types of P. fulvidraco, we found that
   alterations of expression pattern may be more common than
   non-synonymous mutations. The transcriptome of P. fulvidraco will be an
   invaluable resource for subsequent comparative genomics and
   evolutionary analyses of this economically important fish.

Introduction

   Body colours and colour patterns are significant and striking
   phenotypic traits in vertebrates and have important functions such as
   contributing to mate selection, species recognition, crypsis, and
   warning or threatening of predators. Disorders of pigmentation can
   cause variation in body colour, among which albinism was reported as an
   inborn error of metabolism proposed by Garrod [[39]1]. Albinism is
   characterized by a series of genetic abnormalities, which is caused by
   a deficiency of melanocytes or the malfunction of melanin production
   [[40]2]. Albino individuals occur in a variety of vertebrates including
   mammals [[41]3], reptiles [[42]4], amphibians [[43]5], birds [[44]6]
   and fishes [[45]7] and are usually accompanied by a series of pathology
   changes, such as anemia, osteopetrosis,reduced
   eyes,deafness,neurological and skeletal defects [[46]8–[47]10]. Albino
   individuals typically experience much higher levels of selection (i.e.,
   phenotype-specific mortality and/or reduced reproductive fitness) in
   the natural environment than normal individuals and as a consequence
   albinos in the wild are usually rare. Pigment cells of vertebrates are
   derived from the dorsal neural crest during embryonic development. So
   far, at least seven types of pigment cells have been reported,
   including melanocytes (black or dark brown), xanthophores (yellow),
   erythrophores (red/orange), iridophores (reflective silver, blue, or
   gold), leucophores (white), cyanophores (blue) and irido-erythrophores
   (violet) [[48]11, [49]12]. Only bony fishes possess all seven types of
   pigment cells [[50]11]. These cells are concerned because of their
   medical importance since a number of human diseases result from
   abnormal melanocyte development, such as piebaldism, vitiligo,
   albinism, and abnormal growth and dispersal of melanoma. And the
   studies of colour pattern formation, neural crest development, and the
   cell biology and biochemistry of melanin formation and distribution in
   animals are also performed [[51]13].

   At present, over 100 mutations associated with albinism have been
   identified and described in mice and humans (see albinism database:
   [52]http://www.ifpcs.org/albinism/). Colour mutations are also a common
   phenomenon in fish, especially in tropical ornamental fish, such as in
   group of Cichlid fish [[53]7], paradise fish (Macropodus percularis),
   swordtail (Xiphophorus helleri) and goldfish (Carassius auratus)
   [[54]14]. Albinism has also been observed in medaka (Oryzias zatipes),
   as well as in grass carp (Ctenopharyngodon idella) [[55]15], rainbow
   trout (Oncorhynchus mykiss) [[56]16], Japanese flounder (Paralichthys
   olicaceus) [[57]17], black scabbard-fish (Aphanopus carbo) [[58]9] and
   the channel catfish [[59]18]. In most of these fishes, albinism is
   inherited either in an autosomal recessive manner or in an incomplete
   dominance manner [[60]15, [61]19, [62]20].

   Due to its biological and medical significance, the molecular mechanism
   of melanin biosynthesis has been extensively studied and the conserved
   melanin formation pathway has been elucidated in vertebrates
   [[63]21–[64]23]. Melanin is produced in the melanosome of melanocytes,
   which originate from melanoblast precursors. These precursor cells were
   developed from the dorsal neural tube during embryonic development, and
   migrate to their targeted destinations along defined pathways where
   they mature into functional melanocytes [[65]1]. These processes are
   strictly controlled by many genes, among which tyrosinase (TYR) is the
   most important melanogenesis-related enzyme involved in the tyrosine
   metabolism pathway. Tyrosinase can catalyse the conversion of
   L-tyrosine to L-DOPA and regulate both the speed and specificity of
   melanogenesis [[66]24]. In addition, tyrosinase-related protein 1
   (TYRP1) and dopachrome tautomerase (DCT or TYRP2) are also important
   enzymes in melanin synthesis [[67]25] and mainly regulate the eumelanin
   pathway. Recently, the slc7a11 (encoding solute carrier family 7 member
   11) gene has been reported to be a critical genetic regulator for
   pheomelanin synthesis in hair and melanocytes [[68]26].

   Pigmentation in coat colour mutations have been studied over 100 years,
   while most studies focused on the mammals and birds [[69]3, [70]6,
   [71]27]. Limited studies were reported about fish colour mutations, of
   which most reports were about colour gene cloning and expression. In
   recent years, there were sporadic reports of genomic analysis, which
   are performed to determine the complete set of duplicated pigmentation
   genes in fish to better understand how pigment synthesis pathways have
   been affected by the FSGD (fish specific whole-genome duplication)
   [[72]27]. Xu et al. not only presented a draft genome of domesticated
   common carp, but also performed transcriptomic analyses between two
   distinct domesticated strains with body colour mutation (Songpu and
   Hebao), which provided valuable resources for genetic breeding and
   molecular studies of pigmentation formation in common carp [[73]26].
   Wang et al. also compared the skin transcriptome between two oujiang
   color varieties of common carp [[74]28]. More studies of different fish
   species with colour mutations are needed to better understand the
   molecular mechanism underlying the pigmentation in teleosts.

   Yellow catfish, a teleost belonging to the family Bagridae, is one of
   the most important economic freshwater species in China due to its
   excellent meat quality. This species is extensively cultured in East
   and South Asia in semi-intensive and intensive systems [[75]29]. In
   2012, total production of yellow catfish reached 256,650 tonnes in
   China [[76]30]. During the breeding of yellow catfish, we found a small
   number of fish with golden-pink eyes and golden colour all over the
   body ([77]Fig 1). We consider this colour variation to result from an
   absence of melanin pigmentation, just like albinism reported in other
   species. Given the complexity of colour variation and albinism, an
   understanding of how melanin pigmentation is regulated has a long way
   to go. Transcriptome analysis of yellow catfish with golden colour will
   help us to better understand the mechanism for pigment biosynthesis. In
   this study, we sequenced the transcriptome of the normal and the newly
   discovered albino yellow catfish strain (thereafter called wild type
   and albino type, respectively) using Illumina sequencing technology. We
   report the differentially expressed genes (DEs) and mutations that may
   contribute to the phenotypic differences between these two strains,
   which may provide genetic information for fish colour variation.

Fig 1. Wild and albino yellow catfish.

   [78]Fig 1
   [79]Open in a new tab

Materials and Methods

Ethics statements

   This study was approved by the Institutional Animal Care and Use
   Committees (IACUC) of Huazhong Agricultural University. No specific
   permits were required for the field studies described here. The study
   area is not privately-owned or protected in any way, and the field
   studies did not involve endangered or protected species.

2.1 Sampling

   Wild type and albino yellow catfish for one year old (30 individuals
   for each type, ~20g each) were collected from a fish pond in Jingzhou,
   Hubei Province, and were transported to Huazhong Agricultural
   University. The yellow catfish were acclimated in an indoor tank with
   sufficient oxygen and fed daily at 9:00 am for 2 weeks before the
   experiment.

2.2 RNA extraction, cDNA library creation and Illumina sequencing

   In order to identify potential genes subject to single-base
   substitution mutation and differential expression that may result in
   different phenotypes between wild types and albinos, we extracted and
   sequenced the transcriptome of both strains. For each type, we
   extracted and mixed equal quantity of total RNA extracted from anal
   fins from two individuals, and the same procedure was repeated one more
   time but using ten individuals instead to increase the reliability of
   the data. Because sequence data from public databases for P. fulvidraco
   is scarce, we extracted and mixed equal quantities of total RNA from
   skin, liver, spleen, kidney, head-kidney, and muscle of one individual
   for each strain to obtain coding sequences for this species. All
   individuals were euthanized with 300 mg/L tricaine methanesulfonate (MS
   222) before dissection and total RNA was extracted using Trizol reagent
   (Invitrogen, USA) according to the manufacturer’s instructions. The RNA
   purity and integrity were evaluated by the ratio of OD[260]/OD[280] and
   RIN value, respectively. RNA samples with OD[260]/OD[280] ratio > 1.7
   and RIN value > 7.5 were selected for deep sequencing. For
   transcriptome sequencing using Illumina’s sequencing technology, six
   cDNA libraries with insert sizes of 300–500 base pairs (bp) were
   prepared ([80]Table 1). All libraries were sequenced using Illumina
   (San Diego, CA, USA) HiSeq 2000 platforms according to the
   manufacturer’s protocols with a read length of 100 bp. It should be
   noted that libraries of mixed total RNA from anal fins of ten
   individuals for each type (samples 5 and 6 in [81]Table 1) were
   sequenced twice, once as paired-end sequence and once as single-end.

Table 1. Profiles of each library for the two strains subject to deep
sequencing.

   Sample Strains Number of individuals Tissue Reads type Number of clean
   reads1 Number of clean reads2 SRA accession number
   1 wild 1 skin, liver, spleen, kidney, head-kidney, and muscle
   paired-end 23,206,406 23,006,146 SRR1630913
   2 albino 1 skin, liver, spleen, kidney, head-kidney, and muscle
   paired-end 19,813,299 19,632,145 SRR1630912
   3 wild 2 anal fin paired-end 19,920,426 19,732,153 SRR1630911
   4 albino 2 anal fin paired-end 24,111,886 23,910,559 SRR1630910
   5 wild 10 anal fin paired-end 25,564,940 25,462,585 SRR1630906
   single-end 23,880,719 - SRR1630902
   6 albino 10 anal fin paired-end 22,479,475 22,388,185 SRR1630909
   single-end 19,996,564 - SRR1630908
   [82]Open in a new tab

   Clean reads are clipped and trimmed reads.

2.3 Data filtering, de novo assembly and annotation

   Quality controls for original reads of each library were mainly
   processed using the fastx-toolkit
   ([83]http://hannonlab.cshl.edu/fastx_toolkit/). First, sequencing
   adapters were clipped using fastx_clipper. Then fastq_quality_trimmer
   was used to trim the remaining data and clipped reads < 20 bp or
   nucleotides with phred quality scores lower than 20. Subsequently,
   fastx_quality_stats was used to summarize each library and GALAXY
   ([84]http://galaxyproject.org/) was used to draw quality score boxplots
   and nucleotide distribution charts to evaluate the sequencing quality.

   We combined all reads1 and reads2 separately and assembled them into
   contigs using Trinity software [[85]31]. To reduce redundancy and to
   preserve as many coding sequences as possible, we extracted open
   reading frames for each contig using TransDecoder
   ([86]http://sourceforge.net/projects/transdecoder/) and clustered
   identical proteins using CD-HIT [[87]32]. The corresponding nucleotide
   sequences were retained to generate a non-redundant transcriptome that
   was used for subsequent analyses. To evaluate the assembly, we compared
   the non-redundant transcriptome to protein sequence data deposited in
   UniProt ([88]http://www.uniprot.org/) using BLASTX [[89]33]. We
   employed an E value of 10^−20, and calculated the coverage between the
   transcript and its top hit protein. The analyses were performed for
   Swiss-Prot and TrEMBL, separately.

   Functional annotations of the assembled sequences were achieved
   following the SFG pipeline [[90]34]. Briefly, each sequence of the
   non-redundant transcriptome was first translated into amino acids in
   six frames and searched against GenBank's non-redundant protein
   database (NR), UniProt Swiss-Prot, and TrEMBL protein databases using
   local BLASTX with an E value of 10^−4. The proteins with the highest
   sequence similarity were retrieved and annotated to each contig. Gene
   ontology (GO) terms [[91]35] were obtained after parsing the
   corresponding UniProt flat files for each contig, and WEGO software
   [[92]36] was then used to produce a GO functional classification for
   all the contigs to interpret the distribution of the species’ gene
   functions.

2.4 Differential expression analysis

   Paired-end reads generated from mixed total RNA of anal fin for each
   sample (samples 3, 4, 5, and 6 in [93]Table 1) were used to identify
   genes probably subject to differential expression between wild types
   and albinos. For each sample, paired-end reads were mapped back to the
   non-redundant transcriptome using bowtie [[94]37]. The program
   rsem-calculate-expression within the software package RSEM [[95]38] was
   used to estimate the expression levels for each sequence using the
   expectation—maximization (EM) algorithm. The expression profiles
   generated for each sample were combined and were used to detect
   differentially expressed genes between the two strains using edgeR
   [[96]39] and EBSeq [[97]40]. For edgeR, only those contigs with at
   least ten counts across samples were tested for diff-expression. For
   both tests, contigs with fold change value > 2 and the adjusted P value
   < 0.05 according to the False Discovery rate method of Benjamini and
   Hochberg [[98]41] were selected as differentially expressed candidates.

2.5 Mutation sites analysis

   All reads generated by this work were used to identify mutation sites
   by comparing the albino strain with the wild type strain. For each
   sample, all reads were mapped back to the non-redundant transcriptome
   using bwa (-n 0.005-k 5) [[99]42], and all duplicate reads were removed
   using the MarkDuplicates program from the software package Picard
   ([100]http://broadinstittute.github.io/picard). Then we realigned the
   alignments around indels using the Genome Analysis Toolkit (GATK,
   [101]https://www.broadinstitute.org/gatk), and reads with minimum
   mapping quality < 20 were removed using samtools [[102]43]. The
   processed alignments for all samples were combined using samtools and
   were used to generate a synchronized file using the program
   mpileup2sync included in the software package PoPoolation2 [[103]44].
   The synchronized file listed allele frequencies for every population at
   every base in the non-redundant transcriptome in a concise format.
   Sites with base frequencies more than five in every wild type samples
   but absent in every albino samples were identified as
   albefaction-related mutations and were compared with the predicted
   transcript structure to decide if they were located in five prime UTR
   (untranslated region), three prime UTR, or coding regions.

2.6 Pathway enrichment analysis

   Contigs with mutation sites that changed the encoded amino acids or
   were identified as differentially expressed candidates were separately
   subject to pathway enrichment analysis using KOBAS2.0 [[104]45] with
   Danio rerio (zebrafish) as the reference species.

2.7 Colour genes analysis

   Using the reciprocal best blast hit method (RBB), orthologs between D.
   rerio and P. fulvidraco were identified. Briefly, the proteome of D.
   rerio was downloaded from Ensembl
   ([105]http://www.ensembl.org/index.html, version 75), and RBB was
   performed between proteomes of the two species using blastp with the E
   value set to 1e-9. We also required that ortholog pairs overlapped more
   than 100 amino acids and more than 60% of them were identical. After
   that, colour-related genes in P. fulvidraco were extracted according to
   gene names ([106]http://www.espcr.org/micemut/) and patterns of their
   expressions and mutations were scrutinized.

2.8 Experimental verification

   In order to validate the RNA-Seq results, quantitative realtime PCR
   (qRT-PCR) was performed on the Roche LightCycler 480 Real-Time PCR
   System with three replicates using the SYBR Green PCR kit (Roche). The
   samples used for qRT-PCR were different from the samples used for
   RNA-Seq, and were collected from the anal fins of five new individuals
   for each strain. The gene-specific primers were designed based on the
   differentially expressed genes. The β-actin was used as an internal
   reference. The relative expression levels of DE genes were normorlized
   to β-actin (2^-ΔΔCt), and all data were expressed as means±SEM.
   Statistical analysis was performed using GraphPad Prism 5.0 (unpaired t
   test). The gene and primer information is shown in [107]Table 2.

Table 2. List of primers for the qRT-PCR validation of DEs identified by
RNA-Seq.

   Gene Primers sequences(5'to3') Annealing temperature(°C) Product
   Size(bps) Contig name
   tyr Tyr-1278-F CTACATGGCACCCTTCATACC 60 93 comp71391_c1_seq1
   Tyr-1370-R AGGGTCTTGAAGGTAGGAGTA
   ednrbla ednrbla-F1 CTGTGGAAGTAACCCTCATCTG 58 166 comp76216_c0_seq17
   ednrbla-R1 CAGCCACCAGTCCTTTATATCC
   tyrp1 TYRP1-RT-F2 GAATCCAGCAGGGAACGTGA 58 145 comp70296_c0_seq1
   TYRP1-RT-R2 GCCCTCGATGGTGTTTCTGA
   slc7a11 slc7a11-RT-F1 GGAATCTTCATCGCACCC 58 115 comp77711_c1_seq25
   slc7a11-RT-R1 CAGCGTAGGACAAGGCAC
   Melar-5 Melar-5-RT-F1 TATGTGCTTCATGTCCCACT 58 116 comp52331_c0_seq1
   Melar-5-RT-R1 TCTCCTTGAGCGTCTTGC
   CD8a CD8a-RT-F1 GGACAGCGGCACTTACAG 58 287 comp73354_c0_seq23
   CD8a-RT-R1 TGACCAGAAGCAGGAGCA
   GIMAP7 GIMAP7-RT-F1 GGATTGTGCTGGTGGGAAAG 58 289 comp77845_c0_seq21
   GIMAP7-RT-R1 TTGGTGAATCTTCCGGGTTGT
   Utrophin Utrophin-RT-F1 GTGCAACTTTCGTCCACCAC 58 266 comp80248_c1_seq8
   Utrophin-RT-R1 AGGGAATGGTAGAGCTCGGT
   Lipoprotein Lipop-R-RT-F1 ATGGTGGATTGCGTCATA 58 249 comp71875_c1_seq1
   receptor Lipop-R-RT-R1 TCACTTGGTTCCCTTCTT
   ETFS EFTS-RT-F1 CAAGACTACGGGCTACAC 58 204 comp69376_c0_seq2
   EFTS-RT-R1 TTCACCTCCACCATAACA
   CD99 CD99-RT-F1 CAAGACTACGGGCTACAC 58 212 comp75592_c3_seq4
   CD99-RT-R1 TCTCACAGTTCACCTCCA
   HSP70 HSP70-RT-F1 AAACTGGCTGTATGAGGAT 58 101 comp76788_c1_seq2
   HSP70-RT-R1 TGTATCGGTCTTGAATGG
   IRF-7 IRF7-RT-F1 CTCAGGCATCTACGGCTA 58 114 comp80561_c1_seq7
   IRF7-RT-R1 AGTTCATTCTGGCTTTGG
   MYCBP2 MYCBP2-F1 CATCAGGAACCGAAAGGAGAA 58 150 comp81235_c1_seq1
   MYCBP2-R1 TGGAGACCAGGCATTGTAATAG
   comp73484_c0_seq4 LOC-RT-F1 GCACAGGCGTCTTTCTCAAC 58 121
   comp73484_c0_seq4
   LOC-RT-R1 AGTCTCAGCAATAGCAGCCG
   Neoverrucotoxin Neo-RT-F1 TGGAGACCAGGCCTGTGATA 58 154 comp73484_c0_seq5
   Neo-RT-R1 ATCCCCCGTCATGTCTCTGA
   TRIMP-29 TRIMP-RT-F2 TGCTGCTCCGATTCTTGTCC 58 188 comp73484_c0_seq10
   TRIMP-RT-R2 GCTTAGGTCCAGTTGCGACA
   comp73484_c0_seq7 comp–RT-F2 GAGGGTTGGCAGTCCATTCA 58 159
   comp73484_c0_seq7
   comp–RT-R2 GGACCATACGCGAACAGGAA
   Calpastatin Calpas-l-RT-F2 ACGACTCACTTCCTCCCG 58 237 comp73354_c0_seq5
   Calpas-l-RT-R2 CCAGTTGGTCCATTACCG
   Ptprc Ptprc-RT-F1 GACAGTGCGGACACAGAGAA 58 287 comp81041_c0_seq19
   Ptprc-RT-R1 GCTCCCTCTTCAGGCGATAC
   [108]Open in a new tab

Results

3.1 Sequencing, assembly and annotation

   The number of clean reads (clipped and trimmed reads) obtained was > 20
   M from most libraries ([109]Table 1). Integrating these data and using
   a de novo assembly algorithm, we obtained 310,143 contigs > 200 bp.
   According to the TransDecoder predictions, there were 106,201 contigs
   encoding 144,487 proteins longer than 100 amino acids. Identical
   predicted protein sequences were clustered and the corresponding
   nucleotide sequences were extracted to form a non-redundant
   transcriptome, which included 64,398 sequences with the total length
   reached more than 135 Mb. The lengths of most contigs were < 1,000 bp,
   but there were also many contigs > 5,000 bp ([110]Fig 2). The N50
   contig length, median contig length, and average contig length of the
   non-redundant transcriptome were 3,087 bp, 1,604 bp and 2,098 bp,
   respectively.

Fig 2. Contig length distribution of the P. fulvidraco transcriptome.

   [111]Fig 2
   [112]Open in a new tab

   By comparing with protein sequences deposited in NR, Swiss-Prot and
   TrEMBL, we annotated 61,184 (95%), 54,495(84.6%), and 59,314 (92.1%)
   sequences within the non-redundant transcriptome of the yellow catfish,
   respectively. [113]Table 3 shows summary of the coverage of predicted
   proteins overlapping the homologous proteins deposit in Swiss-Prot and
   TrEMBL. GO terms were assigned to 55,032 sequences according to their
   homologs annotated by UniProt. Within these sequences, about 89.6%
   (49,318) were assigned to cellular component terms, 82.6% (45,473) were
   assigned to molecular function terms, and 80% (44,003) were assigned to
   biological process terms. Most represented GO terms in cellular
   components, molecular functions and biological processes were “cell”
   and “cell part”, “binding” and “cellular process”, respectively
   ([114]Fig 3).

Table 3. The coverage of predicted proteins compared with the best hit
homologs available in UniProt Swiss-Prot and TrEMBL.

   Coverage (%) Number of proteins deposit in Swiss-Prot Number of
   proteins deposit in TrEMBL
   90 ~ 100 8,660 13,078
   80 ~ 90 2,049 2,442
   70 ~ 80 1,504 2,024
   60 ~ 70 1,359 2,105
   50 ~ 60 1,393 2,182
   40 ~ 50 1,415 2,297
   30 ~ 40 1,536 2,329
   20 ~30 1,629 2,378
   10 ~ 20 1,299 1,828
   0 ~ 10 0 0
   [115]Open in a new tab

Fig 3. The distribution of GO terms assigned to the P. fulvidraco
transcriptome.

   [116]Fig 3
   [117]Open in a new tab

3.2 Differentially expressed genes (thereafter called DEs)

   Using edgeR, we detected 27 genes that were probably DEs between the
   wild type and albino strains (|logFC| > 1 and adjusted P value < 0.05),
   of which 17 (63%) were up-regulated in wild types and 10 (37%) were
   up-regulated in albino types. While using EBSeq, the number of DEs
   between wild types and albinos reached 813 (PostFC < 0.5 or > 2 and
   adjusted PPDE < 0.05), of which 286 (35.2%) were up-regulated in wild
   types and the remaining 527 (64.8%) were up-regulated in albino types.
   All contigs detected by edgeR except for contig comp72223_c2_seq1,
   which was annotated as “high mobility group protein B2”, were also
   detected by EBSeq. Summaries of the DEs are shown in [118]S1 and
   [119]S2 Tables compares results generated by edgeR and EBSeq for all
   DEs identified in this report. Summaries of DEs with adjusted P values
   less than 0.001 according to both of the two methods, which should be
   regarded as definitely DEs, are shown in [120]Table 4. According to the
   pathway enrichment analyses, the DEs identified by EBSeq may
   participate in a number of KEGG pathways such as “Mismatch repair”,
   “ECM-receptor interaction”, etc ([121]S3 Table).

Table 4. Summaries of differentially expressed genes (adjusted P value or
PPDE < 0.001) between wild and albino types according to both edgeR and
EBSeq.

   ContigName PPDE PostFC FDR1 logFC PValue FDR2 Uniprotmatch evalue
   Description
   comp80248_c1_seq8 1 0.001454 0 -12.689 2.47E-11 1.15E-06 [122]P46939 0
   Utrophin
   comp73484_c0_seq4 1 678.8816 0 12.60986 7.67E-11 1.32E-06 E7FBQ2
   7.00E-110 Uncharacterized protein
   comp73484_c0_seq5 1 622.9236 0 12.48535 8.46E-11 1.32E-06 [123]A0ZSK3
   2.00E-39 Neoverrucotoxin subunit alpha
   comp73354_c0_seq38 1 0.001667 0 -12.4278 5.90E-10 6.88E-06 C6KE06
   2.00E-58 CD8 alpha
   comp73484_c0_seq10 1 444.998 0 11.99936 2.38E-09 2.22E-05 [124]Q8R2Q0
   9.00E-07 Tripartite motif-containing protein 29
   comp73484_c0_seq7 1 538.12 0 11.93033 4.23E-09 3.29E-05 E7FBQ2
   2.00E-136 Uncharacterized protein
   comp73354_c0_seq43 1 627.7454 0 12.49547 1.30E-08 8.68E-05 [125]Q2HYK0
   2.00E-69 Calpastatin-like protein
   comp73354_c0_seq23 1 0.0026 0 -11.7952 1.61E-08 9.40E-05 C6KE06
   2.00E-58 CD8 alpha
   comp81235_c1_seq1 1 388.2891 0 11.801 1.19E-07 0.000615 [126]O75592 0
   E3 ubiquitin-protein ligase MYCBP2
   comp71875_c1_seq1 1 0.002816 0 -11.6659 1.99E-07 0.000821 [127]Q28832 0
   Low-density lipoprotein receptor
   comp73354_c0_seq5 1 271.2564 0 9.40446 2.09E-07 0.000821 [128]Q2HYK0
   3.00E-69 Calpastatin-like protein
   comp81041_c0_seq19 1 356.9776 0 11.67911 2.11E-07 0.000821 [129]Q9IBD8
   0 Receptor-type tyrosine-protein phosphatase C
   comp77845_c0_seq21 1 0.002246 0 -11.5598 2.51E-07 0.000902 [130]Q8NHV1
   5.00E-34 GTPase IMAP family member 7
   [131]Open in a new tab

   PPDE posterior probability of differential expression

   FC fold change

   FDR1 adjusted P value

   FDR2 adjusted PPDE

3.3 Mutation analyses

   We detected 806 sites resident in 616 genes that possessed one or more
   nucleotides present in all wild types but absent in all albinos. That
   is to say, we assume that single-base substitution mutations at these
   sites may disable the normal function of the gene and result in
   albinism. However, none of these sites was homozygote in all samples of
   the wild type; summaries of these sites are shown in [132]S4 Table.
   According to the analytical predictions, 572 (71%) mutation sites
   within 455 genes were probably located in coding regions, and 70 (8.7%)
   mutation sites within 54 genes were probably located in the five prime
   UTR, with 255 (31.6%) mutation sites being located within 203 genes
   that were probably located in three prime UTR. Amino acids were changed
   by 129 single-base substitution mutations that were possibly located in
   coding regions. The results of pathway enrichment analyses for the
   genes where these sites reside in are shown in [133]S3 Table, and the
   most represented GO terms assigned to these genes were “integral
   component of membrane”, “cytoplasm” and “protein binding” ([134]S5
   Table). Interestingly, we found one of the differentially expressed
   candidates between normal and albino fishes (serial number:
   comp81166_c0_seq1, FC > 15, [135]S2 Table), which was annotated as “MHC
   class I antigen”, harboring one nonsynonymous site in the coding region
   according to the mutation analyses. We also detected 2 deletions in 2
   contigs that were homozygote in albino types, but they were both
   located in the three prime UTR, which will not result in a frameshift
   or premature stop codons that may disable the protein.

3.4 Colour-related genes and experimental verification

   We assigned D. rerio genes as orthologs for 11,978 P. fulvidraco
   contigs using the RBB method, within which 140 contigs were identified
   as differentially expressed genes using edgeR and/or EBSeq ([136]S2
   Table). However, this number should be increased because of the long
   time divergence between the orders Cypriniformes and Siluriformes
   (about 250 Mya) [[137]46] and the stringent parameters we used for
   identifying orthologs. Five colour genes, namely tyr, tyrp1, slc7a11,
   MREG, and ednrb1a, were identified as up-regulated candidates in
   albinos according to EBSeq. According to edgeR, the expression pattern
   was similar, but the adjusted P value was not significant. Mutations of
   these genes will result in disorders of pigmentation such as albinism
   ([138]http://www.espcr.org/micemut/). Quantitative RT-PCR experiments
   were performed to confirm the expression patterns of the four
   color-related genes since their up-regulation in albino strains
   compared to wild strains is a bit counter-intuitive. Four genes, tyr,
   tyrp1, slc7a11 and ednrb1a, were successfully amplified and their
   expression patterns were in accord with the results obtained using
   high-throughput data. Moreover, no single-base substitution mutations
   were found in colour-related genes.

   To further validate the DE genes detected by RNA-Seq, another sixteen
   candidates (twelve were from [139]Table 4 and another 4 were randomly
   selected from [140]S2 Table) were selected and qRT-PCR amplification
   confirmed the expression patterns of 13 genes. For the other three DE
   candidates (lipoprotein receptor, Utrophin and Ptprc), no significant
   difference was detected ([141]Fig 4).

Fig 4. The results of qRT-PCR for the DEs identified by RNA-Seq and subject
to validation.

   [142]Fig 4
   [143]Open in a new tab

Discussion

   In this report, a transcriptome represented most genes expressed in P.
   fulvidraco was generated and annotated based on the sequence homology
   to proteins deposit in public databases. The coverage of predicted
   proteins showed that 14,965 (23.2%) sequences overlapping for more than
   50% of the length of homologous proteins deposit in Swiss-Prot, of
   which 10,709 (16.6%) sequences covered more than 80% ([144]Table 3).
   Compared with homologous proteins in TrEMBL, the number of sequences
   overlapping for more than 50% of their length reached 21,831 (33.9%),
   of which 15,520 (24.1%) sequences overlapped more than 80%. Given the
   very strict parameter used (E value = 1e-20) when searching for
   homologs, the statistics presented above are expected to be
   conservative. Nonetheless, these statistics suggest that we have
   generated a high quality transcriptome for P. fulvidraco. A portion of
   contigs without any hits may represent lineage-specific genes and may
   partially explain the uniqueness of P. fulvidraco. Compared to other
   RNA-Seq reports, the ratio of annotated contigs was moderately high
   [[145]47]: this is because we removed all contigs except those encoding
   more than 100 amino acids.

   Based on the de novo assembled transcriptome, a number of
   differentially expressed genes were identified. Elsewhere,
   Seyednasrollah et al. compared eight packages detecting DEs based on
   real data and claimed that the number of detected DEs increased when
   the number of replicate samples was increased for most packages,
   including edgeR and EBSeq [[146]48]. Specifically, edgeR showed a large
   range of variability when the number of replicates was very small,
   while EBSeq may be too liberal. As a result, EBSeq detected many more
   DEs than edgeR when using only two replicate samples; increasing the
   number of replicates is expected to narrow the gap. Simulation and real
   data showed that most genuine DEs ranked at the top when sorting their
   significance such as nominal P value from the smallest to the largest
   [[147]48, [148]49]. This method is not affected by the direction of
   differential expression and edgeR performed better than EBSeq when the
   sample size was very small. Correspondingly, based on the results
   obtained by edgeR and EBSeq, most DEs identified in this report (list
   in [149]S2 Table) ranked in the top 2,000 in order of their
   significance.

   Pathway enrichment analyses of differentially expressed genes
   identified by EBSeq ([150]S3 Table) revealed that eleven candidates
   take part in the pathway “Melanogenesis” and most of these genes,
   including the important tyrosinase (TYR) and tyrosinase-related protein
   1 (TYRP1), were up-regulated in albino types ([151]Fig 5). This may
   result from the up-regulation of cystine/glutamate transporter in
   albino types (slc7a11, [152]S2 Table), which transports cystine into
   melanocytes, switch off the eumelanin (brown to black pigment)
   synthesis pathway and promote the synthesis of pheomelanin (yellow to
   red pigment). The lack of substrates for the synthesis of eumelanin may
   result in the high level of relevant enzymes and corresponding
   transcripts and reduce the melanin synthesis, which contribute to the
   abnormal body colour of the albino strain. The similar results were
   found in common carp [[153]26]. Moreover, KEGG pathways such as
   “mismatch repair”, “base excision repair”, etc., were significantly
   enriched in differentially expressed genes (adjusted P value < 0.05,
   [154]S3 Table). These are cellular mechanisms that repair damaged DNA.
   We suggest that without the protection of sufficient melanin, albino
   fishes were much more easily subject to DNA damage caused by
   ultraviolet (UV). Exposure to UV will trigger intracellular repair
   mechanisms with the result that the expression levels of many related
   genes in albino types were significantly different from those in wild
   types.

Fig 5. Differentially expressed genes between wild and albino types in anal
fins and their involvement in the melanogenesis pathway.

   [155]Fig 5
   [156]Open in a new tab

   Genes marked in red are up-regulated and in yellow represented genes
   are down-regulated in wild type versus in albino type, respectively.
   Comparison between RNA-Seq results (identified by both edgeR and EBSeq)
   and qRT-PCR validation results. X-axis shows genes validated in this
   study; Y-axis shows Log2Ratio of expression of wild versus albino
   types. “*” indicates a significant difference between the two strains.

   Genes harboring mutations that may influence their normal functions in
   albinos were also identified. However, no pathway potentially related
   to different fish body colour was detected among these genes with
   nonsynonymous mutations ([157]S3 Table). Thus, we conclude that
   phenotypic differences between wild types and albinos were mainly the
   result of differential expression of related genes rather than
   single-base substitution mutations in coding regions. This was in
   accordance with published analyses from the genome resequencing
   projects for Saccharomyces cerevisiae (yeast) and Gasterosteus
   aculeatus (threespine stickleback). For both it was claimed that at the
   beginning of divergence, alterations of expression pattern should be
   much more frequent than expected [[158]50, [159]51], and should be even
   more frequent than substitutions of encoded amino acids [[160]51].

Conclusion

   We have generated the transcriptome of the Chinese yellow catfish, P.
   fulvidraco, which will be important for subsequent analyses of this
   economically important fish. More widely, it should also be important
   for comparative genomics and evolutionary analyses in the Siluriformes.
   We conclude that the dominating molecular events underlying phenotypic
   difference between albino and normal P. fulvidraco individuals result
   from differential expression of related genes rather than single
   nucleotide mutations located in coding regions. Moreover, we found
   genes that take part in pathways “mismatch repair” and “base excision
   repair” that may be responses to ultraviolet exposure and result in
   cellular damage, which were significantly enriched in differentially
   expressed genes between wild types and albinos.

Supporting Information

   S1 Table. Summaries of differentially expressed genes between wild and
   albino types identified by edgeR and EBSeq.

   (XLS)
   [161]Click here for additional data file.^ (462KB, xls)
   S2 Table. Comparisons of results generated by edgeR and EBSeq for
   differentially expressed genes between wild and albino types.

   (XLS)
   [162]Click here for additional data file.^ (231KB, xls)
   S3 Table. Pathway enrichment analyses for differentially expressed
   genes between wild and albino types identified by EBSeq and genes in
   which non-synonymous mutations reside.

   (XLS)
   [163]Click here for additional data file.^ (168KB, xls)
   S4 Table. Summaries of mutation sites with one kind of nucleotide
   present in the wild type strain but absent in the albino strain.

   Note that some contigs may encode more than one protein sequence
   according to the prediction.

   (XLS)
   [164]Click here for additional data file.^ (143KB, xls)
   S5 Table. Single-base substitution that was possibly located in coding
   regions and result in non-synonymous mutations.

   (XLS)
   [165]Click here for additional data file.^ (128KB, xls)

Acknowledgments