Abstract

Background

   Plasmodium falciparum causes the deadliest form of malaria, which
   remains one of the most prevalent infectious diseases. Unfortunately,
   the only licensed vaccine showed limited protection and resistance to
   anti-malarial drug is increasing, which can be largely attributed to
   the biological complexity of the parasite’s life cycle. The progression
   from one developmental stage to another in P. falciparum involves
   drastic changes in gene expressions, where its infectivity to human
   hosts varies greatly depending on the stage. Approaches to identify
   candidate genes that are responsible for the development of infectivity
   to human hosts typically involve differential gene expression analysis
   between stages. However, the detection may be limited to annotated
   proteins and open reading frames (ORFs) predicted using restrictive
   criteria.

Methods

   The above problem is particularly relevant for P. falciparum; whose
   genome annotation is relatively incomplete given its clinical
   significance. In this work, systems proteogenomics approach was used to
   address this challenge, as it allows computational detection of
   unannotated, novel Open Reading Frames (nORFs), which are neglected by
   conventional analyses. Two pairs of transcriptome/proteome were
   obtained from a previous study where one was collected in the
   mosquito-infectious oocyst sporozoite stage, and the other in the
   salivary gland sporozoite stage with human infectivity. They were then
   re-analysed using the proteogenomics framework to identify nORFs in
   each stage.

Results

   Translational products of nORFs that map to antisense, intergenic,
   intronic, 3′ UTR and 5′ UTR regions, as well as alternative reading
   frames of canonical proteins were detected. Some of these nORFs also
   showed differential expression between the two life cycle stages
   studied. Their regulatory roles were explored through further
   bioinformatics analyses including the expression regulation on the
   parent reference genes, in silico structure prediction, and gene
   ontology term enrichment analysis.

Conclusion

   The identification of nORFs in P. falciparum sporozoites highlights the
   biological complexity of the parasite. Although the analyses are solely
   computational, these results provide a starting point for further
   experimental validation of the existence and functional roles of these
   nORFs,

Background

   Plasmodium falciparum causes the deadliest form of malaria, which
   impacts over 200 million individuals and results in nearly 450,000
   deaths each year, 60% of which are children aged under 5 years [[27]1].
   It belongs to the large Apicomplexa phylum of diverse eukaryotic
   intracellular parasites. In addition to malaria, this phylum also
   encompasses infectious agents of cryptosporidiosis and toxoplasmosis
   [[28]2]. The latter is caused by Toxoplasma gondii, which is estimated
   to infect over 30% of the world population, and hence considered as one
   of the most successful parasites [[29]3]. These diseases are not only
   life-threatening but also widespread and, therefore, represent a
   serious threat to public health.

   Unfortunately, both vaccines and treatment options with drugs are
   limited due to the biological complexity of the Apicomplexan parasites
   [[30]4]. Even for the well-studied Plasmodium species, there is only
   one vaccine targeted for malaria with relatively low efficacy [[31]5,
   [32]6]. The complexity arises from the highly regulated life cycles
   that allow them to inhabit different hosts and intracellular niches
   [[33]7]. To tackle this challenge, multi-omics data including genomic,
   transcriptomic, and proteomic datasets have been produced to understand
   the biological processes underlying infection and disease causation,
   which can potentially be targeted with drug design [[34]8]. However,
   these studies are often conducted in silos and, therefore, their
   insights are limited due to the lack of consistency [[35]9].

Developmental cycle of Plasmodium falciparum and annotation challenge

   Like many other Apicomplexan parasites [[36]10], P. falciparum is
   specialized to infect two separate hosts—the human host, and the female
   mosquitoes of approximately 40 Anopheles species that can transmit the
   disease [[37]11]. The developmental cycle starts with malaria-infected
   female Anopheles mosquito taking a blood meal, through which the
   sporozoites of Plasmodium parasites are transmitted from the mosquito’s
   salivary gland to the human host (Additional file [38]1: Figure S1).
   The sporozoites then travel to the liver and infect liver cells,
   allowing them to replicate and mature into schizonts. They then rupture
   and release merozoites, which infect red blood cells and is the stage
   that causes clinical manifestation of malaria. Some of the blood-stage
   parasites mature into sexual precursor cells, known as gametocytes,
   which are ingested by the Anopheles mosquito during a blood meal. The
   gametocytes are activated by environmental stimuli inside mosquito
   midgut and differentiate into gametes and fuse to form a zygote
   [[39]12]. The zygotes further develop into oocysts, within which
   sporogony takes place and produces sporozoites to invade mosquito’s
   salivary gland, ready for next round of infection via a mosquito bite.

   The progression from one developmental stage to another in P.
   falciparum involves drastic changes in gene expression [[40]13], and
   infectivity of the parasites to human hosts can also vary greatly
   depending on the stage. For instance, while oocyst sporozoites are
   highly infectious for the mosquito salivary gland, they are
   non-infectious to mammalian hosts; on the contrary, salivary gland
   sporozoites exhibit specific infectivity for mammalian liver, but
   correspondingly loose infectivity for mosquito’s salivary gland
   [[41]14]. Understanding the mechanism for this switch of infectivity is
   of great interest, because inhibiting it allows us to prevent infection
   before the symptomatic phase, which is where malaria intervention
   efforts using vaccines have been focused on [[42]15].

   Approaches to identify candidate genes that are responsible for the
   development of infectivity to mammalian hosts typically involve
   differential gene expression analysis between oocyst sporozoites and
   salivary gland sporozoites [[43]16, [44]17]. However, although these
   studies have successfully identified genes and proteins that are
   upregulated in the mammalian-infectious stage, the detection may be
   limited to annotated proteins [[45]9] and open reading frames (ORFs)
   predicted using restrictive criteria such as a minimum length of 100
   codons [[46]18].

   This problem is particularly relevant for P. falciparum, whose genome
   annotation is relatively incomplete despite its clinical significance
   [[47]19]. Even with the advancement of sequencing techniques and
   computational power, over 33% of predicted genes in the parasite genome
   remain functionally unannotated [[48]20], which represent a great
   challenge for discovering vaccine candidates and therapeutic drugs.
   Furthermore, untranslated region such as 5′ and 3′ UTRs are also absent
   in the annotation, which have been shown to regulate protein expression
   in P. falciparum [[49]21]. The lack of annotation despite community
   efforts can be attributed to several factors. As an Apicomplexan
   parasite, it is evolutionarily distant from other eukaryotic model
   organism where less than 25% of the protein sequences share significant
   similarity with those outside the phylum [[50]22]. Even within the same
   phylum, many of the parasites have evolved pathways for cell invasion
   of their specific host and, therefore, only 12–34% of the
   protein-coding genes are shared by all apicomplexans [[51]23].
   Therefore, there is a pressing need for improving the genome annotation
   of P. falciparum.

   To further complicate the annotation challenge, antisense transcription
   events have widely been discovered in P. falciparum and are thought to
   be important for gene expression regulation and parasite development
   between life cycle stages [[52]24]. Of particular interests are the
   antisense transcripts reported in the var gene regions [[53]25], which
   represent a gene family that encodes for roughly 60 different variants
   of P. falciparum erythrocyte membrane protein-1 (PfEMP1). PfEMP1 is a
   virulence factor that adhere to human receptors and hence enable the
   parasite to evade from the host immune system. To avoid recognition by
   antibodies, P. falciparum switches between the variant forms of PfEMP1
   in a mutually exclusive manner [[54]26]. However, the tightly
   controlled mechanisms underlying var gene transcriptional control
   remain largely unknown. It was recently shown that antisense long
   non-coding RNA (lncRNA) derived from intronic region of var can
   activate the corresponding var gene [[55]27], which may shed light on
   var gene expression control. Unravelling the regulatory roles of these
   noncanonical regions, such as their coding potential, will require
   novel approaches that are specifically tailored to them.

Proteogenomics and novel open reading frames (nORFs) discovery

   One approach that can address the challenges above is proteogenomics,
   which combines the power of genomic, proteomic and transcriptomic data
   to improve genome annotation and the current understanding of protein
   expression (Additional file [56]1: Figure S2) [[57]28]. The traditional
   proteomics analysis detects peptides by matching spectra from tandem
   mass spectrometry (MS/MS) against existing reference protein sequence
   database, which is prone to the assumption that the database contains
   all proteins in the genome of interest [[58]29]. Especially for less
   well-annotated genomes, this assumption will have a more significant
   impact on peptide identification and downstream analysis like protein
   quantification. The peptide search in proteogenomic approach uses
   customized database instead, which can be generated by either six frame
   translation (6FT) of the genome, or by translation of the transcriptome
   [[59]30]. It allows the detection of translation of nORFs that may
   otherwise be neglected, such as those from non-coding regions,
   including antisense transcripts, lncRNAs, intergenic and intronic
   sequences. The definitions of nORFs used in this study are as shown in
   Fig. [60]1a. Conversely, the proteomic data can provide peptide
   evidence for novel transcripts and RNA editing events, further
   improving the gene models. This approach can be iterative as more
   multi-omics data are being generated, and continuously improve genome
   annotation.

Fig. 1.

   [61]Fig. 1
   [62]Open in a new tab

   a The figure illustrates the nORF definitions used in this study.
   Translational products are quantified as peptide-spectral matches and
   classified into these five nORFs categories depending on where they map
   to the canonical gene. b Schematics of the proteogenomics analysis that
   was performed to identify translational products from novel open
   reading frames using transcriptomes and proteomes. oo-spz: oocyst
   sporozoite, sg-spz: salivary gland sporozoite from Lindner et al.
   [[63]37]

   The use of proteogenomic approach has gained its popularity given the
   increasing amount of transcriptome data available thanks to
   next-generation sequencing [[64]31, [65]32], as well as the improved
   accuracy and resolution in MS data for proteomics [[66]33]. For the
   larger human genome, it has been used to identify novel coding regions
   in normal tissues and cancer cell lines, where a significantly larger
   number of novel peptides were discovered in the latter, indicating the
   biological importance of unannotated coding sequences [[67]34]. The
   proteogenomic methods are particularly valuable for non-model organisms
   like P. falciparum, whose genome assembly and annotation tend to be
   incomplete, and protein database construction is difficult owing to the
   lack of close sequence relatives [[68]35]. Previous successful examples
   include Thermococcus gammatolerans, an archaea extremophile whose
   genome was sequenced and annotated using proteogenomic analysis
   [[69]36].

   Here, proteogenomic analysis was performed using the transcriptomic and
   proteomic data obtained from the oocyst sporozoites and salivary gland
   sporozoites of P. falciparum used by a previous study [[70]37]. nORFs
   were detected in both life cycle stages. Bioinformatics analyses
   suggest that these nORFs may be of functional importance with
   interesting properties and are, therefore, worth further experimental
   investigation.

Methods

Dataset

   A literature review was conducted to identify the datasets suitable for
   proteogenomic analysis, which should consist of a pair of proteomes and
   transcriptomes from the same developmental stage. Thanks to the
   substantial research efforts devoted to P. falciparum, several
   transcriptome and proteome datasets have been deposited in the NCBI
   Sequence Read Archive (NCBI-SRA) and the PRIDE database, respectively
   [[71]38]. However, it was difficult to find both transcriptomes and
   proteomes from the same stage, and some are not available in
   retrievable format. For instance, a study by Lasonder et al. [[72]39]
   reported the transcriptomes and associated proteomes of female and male
   gametocytes, but the proteomic data was submitted to PlasmoDB [[73]40],
   which could not be retrieved as raw data. Additionally, omics-data for
   P. falciparum are usually generated from different, independent
   studies, making the downstream analysis susceptible to batch effects.

   As a result, this work focuses on two pairs of transcriptome/proteome
   generated by Lindner et al. [[74]37]. The authors produced
   transcriptomic and proteomic data for oocyst sporozoites from wild-type
   P. falciparum parasites (NF54 strain) as well as salivary gland
   sporozoites, with three biological replicates for each sample type. The
   reasons for narrowing down to these datasets are threefold: firstly,
   the original analysis by Lindner et al. only considered annotated
   genes, leaving room for the discovery of novel ORFs. Secondly, the two
   life cycle stages analysed correspond to the mosquito-infectious and
   human-infectious stage, and therefore novel ORFs detected by
   proteogenomic analysis could help explain the development of
   human-specific infectivity. Finally, since the datasets were produced
   in the same laboratory, the conditions should presumably be similar and
   differential expression is mainly caused by biological differences
   between the developmental stages. The transcriptomic data was
   downloaded from the GEO database (Accession #[75]GSM3109291,
   [76]GSM3109292, [77]GSM3109293 for oocysts sporozoites; #
   [78]GSM3109294, [79]GSM3109295, [80]GSM3109296 for salivary gland
   sporozoites), and the proteomic data from PRIDE (Accession #PXD009726
   for salivary gland sporozoites, #PXD009728 for oocysts sporozoites).

Proteogenomic workflow

   The novelty of proteogenomic analysis comes from matching the MS/MS
   spectra against a customized database constructed [[81]28] based on the
   transcript information, instead of limiting the searches to known
   proteins as in the analysis by Lindner et al. [[82]37] and in most
   proteomic analyses [[83]41]. In this work, a customized database was
   constructed from the transcripts assembled from the HISAT2-StringTie
   pipeline as described by Pertea et al. [[84]42] (Fig. [85]1b). However,
   one should be aware of the limitation of discovering novel ORFs using
   this approach, as it is just a computational prediction.

   Briefly, the quality of RNA-seq data from oocyst sporozoites and
   salivary gland sporozoites were assessed using FastQC to check for
   contamination from sequences from other species, which is essential
   given that separating the parasite from its mosquito vector still
   represents a significant technical challenge [[86]43]. The adapter
   sequences ligated to DNA fragments for library preparation were
   subsequently trimmed using Cutadapt [[87]44]. The processed reads were
   then aligned to the reference genome for P. falciparum 3D7 strain
   retrieved from PlasmoDB ver.46 [[88]40] using HISAT2 [[89]45]. The BAM
   files with sequence alignment data were used to assemble transcripts
   using StringTie [[90]46], which was guided by the reference annotation
   GFF file (PlasmoDB ver.46) and also allow the assembly of novel
   transcripts including splice variants. The assembled transcripts,
   therefore, include known transcripts from reference annotation with
   “PF3D7” as transcript ID prefix, and potentially novel transcripts with
   “MSTRG” prefix, which are constructed from reads that cannot be
   explained by reference transcripts. Compared to gene-level
   quantification where all reads are mapped to known genes,
   transcript-based approach taken by StringTie does not assume that
   genome annotation is complete and effectively expands the search space
   for matching MS/MS peptide spectra. The transcript nucleotide sequences
   were then extracted from the reference genome using BEDTools getfasta
   [[91]47]. The constructed transcriptome database was then searched in
   six-frames for matching to MS/MS spectra by Mascot, a search engine
   that identifies protein using the MS data [[92]48]. It matches all
   peptide spectra to the in silico translated proteins derived from the
   transcriptomic database “on the fly”, hence identifies which transcript
   is supported by peptide evidence. For each peptide-spectrum match
   (PSM), Mascot also computes a probability score which is higher for
   proteins with more peptides matched to it.

Novel peptides classification

   Since the known genes have been analysed thoroughly by Lindner et al.
   when they reported the transcriptomes and proteomes [[93]37], only
   potentially novel transcripts with “MSTRG” prefix were considered in
   discovering nORFs that have been overlooked previously. Peptides that
   identify these MSTRG transcripts were subsequently classified
   (Fig. [94]1b) into different categories as defined above based on the
   position of matched peptides relative to the reference genes. Each
   category was classified independently as described below.

Antisense

   The first step to identify antisense peptides was to identify antisense
   transcripts from the pool of potentially novel MSTRG transcripts.
   Firstly, all the assembled transcripts were compared with reference
   transcripts using BEDtools intersect [[95]47], using parameters that
   only return transcript that is on the complementary strand of the
   reference transcript that it overlaps with. Peptides that are matched
   to these antisense transcripts are then compared with the protein
   sequences from six frame translation using Transeq [[96]49] to see
   which frame it originated from. Only peptides that are translated from
   frame 4 to 6 are considered translation evidence for antisense
   transcripts, because it could be degenerate peptides from reference
   genes due to six frame translation.

Intergenic

   The intergenic regions were extracted by using BEDtools complement
   [[97]47] to subtract the genome from all annotated regions and return
   genome intervals with no genes identified. The MSTRG transcripts were
   mapped to these intervals with BEDtools intersect, returning
   transcripts that overlap completely are classified as intergenic.
   Intergenic transcripts with peptides from any frame that identified
   them are intergenic ORFs.

Retained intron

   MSTRG transcripts that contain the introns of overlapping reference
   genes are determined by GffCompare [[98]50] that compares their
   intron–exon structures. The exons of these transcripts were
   subsequently intersected with the introns of reference genes to extract
   the retained intron region, which were six frame translated by Transeq
   and mapped by the peptides to see if they fall into the introns The
   intronic peptides were then checked manually to see if they are
   in-frame with the neighbouring exon on genome visualizer, Artemis
   [[99]51]; if not, they are classified as AltORFs instead.

3′ and 5′ UTR

   Unlike other well-characterized genomes, information about the
   untranslated regions of P. falciparum is relatively scarce [[100]52],
   where the majority of research has been focussed on coding sequences
   only. Therefore, the first attempt to extract 3′ and 5′ UTRs by
   subtracting coding sequence from the genomic coordinates of mRNAs
   failed, because they are of equal lengths. BEDtools intersect was
   combined with length filtering to find MSTRG transcripts with complete
   overlap with a reference gene, and are also longer than the overlapped
   gene. These transcripts contain extended region on the 3′ and/or 5′ end
   outside coding sequence and regarded as UTRs accordingly. The UTRs were
   then translated into protein sequence to check if any peptides can
   confirm their translation. Similarly, peptides from UTRs need to be
   in-frame with the parent reference gene to be considered, otherwise
   moved to the AltORFs category.

AltORF

   Along with the out-of-frame peptides from retained intron and UTR
   classification, all peptides from MSTRG transcripts (excluding
   antisense transcripts) that overlap with known genes were also compared
   with the corresponding six frame translated coding sequences. The frame
   from which the peptides are translated can then be detected, and those
   from non-canonical frame, were classified as products of AltORFs.
   However, it is worth noting that almost half of these AltORF peptides
   map to frames that are antisense (frame 4, 5 and 6) to the canonical
   genes (341/732 peptides in oocyst sporozoites, and 379/733 peptides in
   salivary gland sporozoites), these peptides are categorized as AltORF
   products as they are not associated with antisense transcript.

Differential expression analysis

   To identify novel ORFs that may be involved in the development of
   infectivity to human hosts, differential expression analysis between
   the two stages studied was conducted at both transcript and protein
   levels. For RNA-seq data, DESeq 2 [[101]53] was chosen to perform
   differential expression analysis, which is an R package available
   within the Bioconductor project [[102]54]. This is because DESeq 2 is
   widely used by the community, including Lindner et al. that published
   the analysed dataset [[103]37], and it has also been recently proved to
   have the best overall performance among 12 methods [[104]55]. Since
   DESeq2 requires read counts as an input, while StringTie outputs
   coverage values for transcript abundance, these were first converted
   from coverage to counts for each transcript, using the formula
   reads_per_transcript = coverage * transcript_len/read_len with a python
   script (available at
   [105]http://ccb.jhu.edu/software/stringtie/dl/prepDE.py). DESeq2 then
   normalizes the counts internally and compares them between oocyst
   sporozoites and salivary gland, producing statistics metrics including
   adjusted p-values and log-transformed fold changes. Transcripts with
   adjusted p-values < 0.1 were called differentially expressed between
   the two stages.

   For protein differential expression analysis, the spectral count method
   was used for protein expression analysis that compares the
   peptide-spectrum matches (PSM) for each protein between the two stages,
   which has the highest reproducibility for label-free proteomics data
   [[106]56]; it is then combined with the G test statistics [[107]57] for
   computing p-values. However, in this study the MS/MS spectra were
   matched against a customized transcriptome database instead of
   annotated protein database, definition of a protein was, therefore,
   adjusted. Because the protein sequence of a translated transcript is
   different for each frame, an assumption was made that a given peptide
   can only identify one frame of the matched transcript being translated,
   and that a maximum of one protein can be translated from each frame of
   a transcript. As a result, the sum of PSMs that identify one frame of a
   transcript are the spectral counts for the protein product from that
   frame of the transcript.

   Based on this assumption, the spectral counts were computed for each
   frame of each transcript (potential protein) with peptide evidence, and
   increase all of them by 1 to remove zero-values. These counts were
   subsequently normalized by first calculating the sum of all PSMs of
   both samples to identify the sample with smaller sum, where its PSMs
   were multiplied by the ratio of two PSM sums to minimize the background
   effect between samples. It was then possible to determine the
   differences in spectral counts of a protein between two stages by
   applying the G test of significance as follows:
   [MATH: <mrow><msub><mrow><mtext>G =
   2[C</mtext></mrow><mtext>oo</mtext></msub><mo>ln</mo><mfenced close=")"
   open="("><mfrac><msub><mtext>C</mtext><mtext>oo</mtext></msub><mrow><mo
   stretchy="false">(</mo><mfrac><mrow><msub><mtext>C</mtext><mrow><mtext>
   sg +</mtext><mspace
   width="0.333333em"></mspace></mrow></msub><msub><mtext>C</mtext><mtext>
   oo</mtext></msub></mrow><mn>2</mn></mfrac><mo
   stretchy="false">)</mo></mrow></mfrac></mfenced><msub><mrow><mspace
   width="0.333333em"></mspace><mtext>+
   C</mtext></mrow><mtext>sg</mtext></msub><mo>ln</mo><mfenced close=")"
   open="("><mfrac><msub><mtext>C</mtext><mtext>sg</mtext></msub><mrow><mo
   stretchy="false">(</mo><mfrac><mrow><msub><mtext>C</mtext><mrow><mtext>
   sg +</mtext><mspace
   width="0.333333em"></mspace></mrow></msub><msub><mtext>C</mtext><mtext>
   oo</mtext></msub></mrow><mn>2</mn></mfrac><mo
   stretchy="false">)</mo></mrow></mfrac></mfenced><mrow><mo
   stretchy="false">]</mo></mrow></mrow> :MATH]

   where G is the G test static, C[oo] and C[sg] are the normalized
   spectral counts for a protein in oocyst sporozoite and salivary gland
   sporozoite respectively. A p-value was calculated as the probability
   that a χ^2 distribution with 1 degree of freedom was more extreme than
   the G statistic for that protein. The Benjamini–Hochberg method was
   used to correct for false discovery rates from multiple hypothesis
   testing, where a protein needs to satisfy the criteria of both G-test
   and FDR < 5.0% to be called differentially expressed.

Estimation of full-length ORF from peptide-spectral match

   While differential expression analysis compares the expression level of
   one protein between samples, it is also interesting to compare the
   relative abundance of a protein with other proteins within and between
   samples. Therefore, the normalized spectral abundance factors (NSAF)
   were computed for each protein, which has been proved to provide
   reliable quantification [[108]58]. For a given protein k, the spectral
   abundance factor (SAF) is calculated as the total PSMs (or spectral
   counts) that identifies the protein normalized by its length, and the
   NSAF is then the SAF normalized by total SAF values in the sample as
   shown below:
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mtext>NSAF)</mtext><mi>k</mi></msub><mo>=
   </mo><mfrac><msub><mrow><mo
   stretchy="false">(</mo><mi>S</mi><mi>p</mi><mi>C</mi><mo
   stretchy="false">/</mo><mi>L</mi><mi>e</mi><mi>n</mi><mi>g</mi><mi>t</m
   i><mi>h</mi><mo
   stretchy="false">)</mo></mrow><mi>k</mi></msub><mrow><msubsup><mo>∑</mo
   ><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></msubsup><msub><
   mrow><mo stretchy="false">(</mo><mi>S</mi><mi>p</mi><mi>C</mi><mo
   stretchy="false">/</mo><mi>L</mi><mi>e</mi><mi>n</mi><mi>g</mi><mi>t</m
   i><mi>h</mi><mo
   stretchy="false">)</mo></mrow><mi>i</mi></msub></mrow></mfrac></mrow>
   :MATH]

   where (NSAF)[k] is the NSAF value for protein k, SpC is spectral count
   and length is the protein length. This method adjusts for the protein
   length, because larger proteins tend to have higher probabilities of
   generating more PSMs; as well as the total protein abundance in one
   sample. However, for peptides that do not map to canonical coding
   sequences, the open reading frame in the mapped transcript from which
   they were translated needed to be estimated to obtain protein length.
   The estimated ORFs could also be used for downstream functional
   analysis.

   Briefly, for all transcripts with peptide-spectrum matches, their
   spliced nucleotide sequences were first extracted and six frame
   translation was performed. Then, for each peptide-spectrum match, the
   frame of which the matched transcript it maps to was identified and the
   translated protein sequence was extracted. All possible open reading
   frames were subsequently determined from the protein sequence defined
   by the presence of start and stop codon, which is indicated by
   methionine and “*” respectively in the protein sequence. The matched
   peptide was then mapped to these possible ORFs and see if it matches to
   any of them. If not, the ORF will be defined as the sequence flanked by
   the start and stop codons closet to the matched peptide (Additional
   file [109]1: Figure S3). An R script was written to perform this task
   and can be found in
   [110]https://github.com/PrabakaranGroup/nORFs-in-malaria.

Gene ontology term and pathway enrichment analysis

   All categories of nORFs analysed in this work except for intergenic
   ORFs have overlaps with a known gene. The sets of reference genes that
   overlapped with novel ORFs identified in the oocyst sporozoites and
   salivary gland sporozoites, respectively, were extracted and GO term
   and KEGG pathway enrichment analysis was performed on these genes using
   the Analyze tools on PlasmoDB [[111]40]. InterProScan [[112]59] was
   used instead to predict GO terms in novel ORFs.

Structural prediction

   For small proteins with less than 200 amino acids, the structures were
   predicted using an ab initio structure prediction tool QUARK [[113]60],
   which has shown top-ranking performance in the Critical Assessment of
   Structure Prediction (CASP) experiments consistently [[114]61]. Since
   template-free, ab initio methods work best on small proteins, QUARK has
   a size limit of 200 amino acids, and larger proteins were predicted
   using the template-based I-TASSER [[115]62] instead.

Results

Identification of novel transcripts and peptides

   Using the HISAT2-StringTie [[116]42] workflow to align the RNA-seq
   reads to reference genome and assemble into full length transcripts, a
   total of unique 7844 transcripts were detected, with 4727 and 4431
   canonical transcripts identified in oocyst sporozoite (oo-spz) and
   salivary gland sporozoite (sg-spz) stages, respectively, which are
   comparable with the results reported by Lindner et al. (3535 and 3575).
   The log transformed fold changes of the canonical transcripts also
   correlate reasonably well with their data (Additional file [117]1:
   Figure S4a), with a correlation coefficient of − 0.533, where the
   negative correlation is caused by the stages being compared
   (oo-spz:sg-spz vs sg-spz:oo-spz). An addition of 2045 transcripts were
   also identified, which are assembled from reads that could not be
   explained by canonical transcripts (with “MSTRG” transcript prefix).
   These potentially novel transcripts were then classified into different
   categories based on their comparison with the overlapped reference
   genes as described in “[118]Methods” section (Fig. [119]1b). From the
   MSTRG transcripts, 780 and 790 transcripts were predicted to contain 3′
   and 5′ UTR regions that are unseen in canonical transcripts along with
   326 transcripts with retained introns. In addition, 41 antisense
   transcripts and 427 intergenic transcripts were detected.
   Identification of AltORFs requires translational evidence and are,
   therefore, not applicable. The high abundance of transcripts from the
   UTR and intergenic regions agrees with a recent study [[120]63] that
   reported nearly 90% of the P. falciparum genome is being actively
   transcribed. Together, these results demonstrated the possibility of
   identifying nORFs in untranslated regions that could not have been
   possible using traditional methods, which might have biological
   functions given their transcriptional potential. The gene boundaries
   also need to be re-defined to take the regions outside coding sequence
   into consideration.

   However, transcriptional evidence alone tends to be very noisy,
   especially for non-canonical transcripts that are constructed from de
   novo assembly [[121]64]. With the availability of proteomic data, it
   was possible to use the peptide-spectrum matches (PSMs) from novel
   transcripts to identify their existence with much higher confidence. As
   shown in Fig. [122]2, most of the novel transcripts have one or two
   unique peptides mapped to them, with very rare cases of over ten
   peptides. This suggests that most of these novel transcripts tend to be
   translated in only one frame even with six-frame translation and the
   translated region probably covers a small segment of the transcript.

Fig. 2.

   [123]Fig. 2
   [124]Open in a new tab

   Distribution of number of unique peptides mapped to novel transcripts
   identified in oocyst sporozoite and salivary gland sporozoite

   Furthermore, a relatively small fraction of novel transcripts has
   translation evidence from proteomic data, which emphasizes the need to
   confirm their existence experimentally. Less than 1/3 of the intergenic
   transcripts from the two stages have peptide spectral matches that can
   identify them (n = 130), although many of them are specific to either
   salivary gland or oocyst sporozoites (Table [125]1). Notably, despite
   many transcripts showed retained introns of canonical genes, only five
   of them have peptide evidence that supports the translation of these
   introns, indicating that the current intron–exon structures of coding
   sequences are likely to be accurate. Similarly, antisense peptides were
   detected at a very low level, with one PSM in each stage. Unlike other
   untranslated regions, slightly more novel peptides were detected to be
   translated from the 3’ and 5′ UTRs. One explanation is that they
   reflect the incomplete annotation of reference genes, or they could
   arise from stop-codon read-through [[126]65] and translation of
   upstream ORFs respectively, where the latter has been shown to regulate
   the expression of downstream canonical ORFs [[127]66]. However, despite
   the sparse evidence for translation of these novel transcripts, their
   regulatory roles on canonical proteins may be at RNA level only,
   especially antisense transcripts, whose RNA molecules were shown to
   interact with chromatin and regulate gene expression in P. falciparum
   [[128]25].

Table 1.

   Number of novel transcripts identified from RNA-seq data and number of
   novel transcripts with translational evidence in oocyst and salivary
   gland sporozoites using the proteogenomic workflow
   Antisense Intergenic Retained intron 3′ UTR 5′ UTR AltORF
   41 427 326 780 790 / Novel transcript
   Oocyst sporozoite 1 64 1 22 18 261 Novel transcript with peptides
   Salivary gland sporozoite 1 87 4 24 21 261
   Transcript common in both stages 0 21 0 7 8 132
   Total unique transcript 2 130 5 39 31 390
   [129]Open in a new tab

   Majority of the novel peptides belong to the alternative ORF (AltORF)
   category, which was defined as the translational products from a
   non-canonical reading frame of the overlapped coding sequence,
   including those that fall into introns and the 3′ and 5′ UTR regions of
   the transcript. AltORFs have previously been found in viruses,
   bacteriophages as well as humans [[130]67], but their existence has not
   yet been reported in P. falciparum. This suggests that the proteome of
   P. falciparum may be much more complicated than previously thought, and
   the abundance of AltORFs suggests that the parasite may use it to
   expand the coding potential of existing genes in the compact genome.
   Although the functions of AltORFs remain largely unknown, it was
   thought that the translation of these non-canonical ORFs alone could
   provide a mechanism for expression control [[131]68]. It would,
   therefore, be interesting to investigate if these novel peptides could
   explain the development of infectivity for human host in the parasite.

ORF prediction and differential expression analysis

   Although the identification of nORFs in this study is limited to
   computational prediction, it is worthwhile to investigate whether some
   of them are more interesting targets for experimental verification than
   others. To study the potential regulatory roles of the novel peptides,
   differential expression analysis was performed on their parent proteins
   between oocyst sporozoites and salivary gland sporozoites. However,
   unlike most canonical proteins whose amino acid sequences have been
   identified experimentally, novel peptides themselves do not provide
   information of the full-length novel proteins from which they were
   generated, they are only evidence that there are PSMs that identify a
   segment of the transcript. An attempt was made to predict the open
   reading frame based on where the peptides map to translated sequence of
   transcripts (see “[132]Methods” section).

   This approach was applied to both putative novel proteins and canonical
   proteins, where the latter have annotated protein lengths that allow
   one to verify the feasibility of estimating full-length ORFs from PSMs
   of six-frame translated transcripts. By searching the MS/MS spectra
   against the customized transcriptome database, 2901 and 2933 canonical
   proteins in oocyst sporozoite and salivary gland sporozoite were
   identified, respectively, as compared to 1432 and 2040 proteins
   identified by Lindner et al. [[133]37]. The results for the
   protein-length normalized spectral abundance factors (SAF) for
   canonical proteins were then compared with those reported by Lindner et
   al. The protein abundances (in SAF) that were computed for canonical
   proteins in the samples from two stages correlated strongly with the
   published data (Additional file [134]1: Figure S4b), with correlation
   coefficients of 0.95 and 0.91 for oocyst sporozoite and salivary gland
   sporozoite, respectively, indicating that one could deduce the ORFs for
   novel peptides using this method (Additional file [135]1: Figure S3) as
   well.

   After computing the PSMs for both novel and canonical proteins in two
   different stages, one could then proceed to identify the ones that were
   differentially expressed using G-test statistics. In total, 86
   differentially expressed nORFs were observed, which share a similar
   distribution of nORF categories (classified in Fig. [136]1) with those
   present in the total proteomics data of salivary gland sporozoites and
   oocyst sporozoites (Fig. [137]3). One exception is the intergenic
   ORFs—while the intergenic peptides were the second most abundant
   category in both stages, only 6% of the differentially expressed nORFs
   are from the intergenic region.

Fig. 3.

   [138]Fig. 3
   [139]Open in a new tab

   Distribution of nORF categories in total proteomes of oocyst sporozoite
   (a) and salivary gland sporozoite (b), as well as in differentially
   expressed nORFs (c)

Antisense transcripts and peptides

   Although only two unique antisense peptides were identified in this
   study, antisense transcripts have previously been shown to be
   incorporated into chromatin and consequently activate the virulence
   factor in P. falciparum, the var gene family [[140]25]. It is,
   therefore, worth investigating further the regulatory roles of the
   antisense transcripts on their parent transcripts. The corresponding
   reference genes of the 41 antisense transcripts were first extracted
   and the transcript abundances in TPM (Transcripts Per Million) of both
   sense and antisense transcripts were calculated, and finally compared
   the TPM values using Pearson correlation. As a result, five antisense
   transcripts were shown to have significant correlation (p < 0.05) with
   the sense reference transcripts (Table [141]2), where three of them
   have positive correlation that suggests an activation mechanism.
   Interestingly, regardless of the transcript correlation, all reference
   genes with associated antisense transcripts are significantly
   downregulated at transcript level in salivary gland sporozoites.
   Especially for the two antisense transcripts that showed anticorrelated
   expression, one of the parent gene has both downregulated mRNA and
   protein expression, while the other is downregulated at mRNA level but
   upregulated at protein level. Notably, the MSTRG.401.1 transcript also
   has translational evidence, suggesting that the regulation by antisense
   transcripts could be mediated by their translational product. Overall,
   the actions of antisense transcripts observed are not uniform where
   they could potentially regulate the expression of their target genes
   not just by activation but also by repression, which has been
   well-established in mammalian genomes [[142]69]. However, since the
   transcript level of the parent reference genes do not correlate well
   with protein expression, further experimental evidence is required to
   confirm the roles of these antisense transcripts.

Table 2.

   Correlation of TPM values between antisense transcripts and their
   associated reference transcripts across the two life cycle stages
   (oocyst and salivary gland sporozoites)
   Associated reference transcript Antisense transcript Correlation
   coefficient p-value Reference transcript fold-change Reference protein
   fold-change
   PF3D7_0209600.1 MSTRG.182.1 0.984 0.000360 ↓ ns
   PF3D7_0103800.1 MSTRG.17.1 − 0.964 0.00187 ↓ ↓
   PF3D7_0414500.1 MSTRG.618.1 0.947 0.00415 ↓ ns
   PF3D7_0314000.1 MSTRG.401.1 − 0.928 0.00758 ↓  ↑
   PF3D7_1116000.1 MSTRG.2636.1 0.879 0.0211 ↓ ↓
   [143]Open in a new tab

   ↓ indicates that the fold change is negative in differential expression
   analysis and ↑ indicates positive fold-change. ns indicates that the
   reference transcript was not significantly differentially expressed

GO term and pathway enrichment

   To investigate if nORFs could be involved in a particular function in
   P. falciparum, enrichment analysis was performed on the GO terms and
   KEGG pathways on their parent reference genes, because they might
   affect the functions of existing genes. As shown in Fig. [144]4, parent
   genes of the novel peptides identified from oocyst sporozoites and
   salivary gland sporozoites are associated with very different sets of
   known genes. With a stringent p-value cut-off of 0.01, the parent genes
   that overlap with the novel peptides in oocyst sporozoites have a clear
   enrichment in processes related to cell localization and movement,
   which is further supported by the 100% enrichment of the background
   genes in actomyosin. On the other hand, the gene set for salivary gland
   sporozoite is also enriched in the actomyosin structure, and
   distinctively enriched in replisome complex as well.

Fig. 4.

   [145]Fig. 4
   [146]Open in a new tab

   GO terms and KEGG pathways enrichment of the reference genes associated
   with novel peptides identified in oocyst sporozoite and salivary gland
   sporozoite. For GO terms enrichment the p-value cut-off was 0.01, and
   0.02 for KEGG pathways. Percentage of genes means how many percents of
   the background genes (in Plasmodium falciparum) with a particular term
   are present in the gene set of interest

   Since oocyst sporozoites need to migrate from mosquito midgut to
   salivary gland and await for injection into human host via a blood
   meal, motility is crucial for host invasion and achieved by the
   invasion machinery called glideosome, which is powered by the
   actomyosin system [[147]70]. Given its importance to entry into host
   cells, attempt has been made to target the glideosome-associated
   proteins with small molecule as a potential approach to discover
   anti-malarial drugs [[148]71]. Therefore, it shows that the identified
   nORFs are associated with important loci for infectivity and could
   serve as an extended reservoir to target for future drug design.

   For KEGG pathway enrichment, a less stringent cut-off was chosen
   because otherwise no pathways could pass the filter, which still
   returned very few enriched pathways (Fig. [149]4). Interestingly, the
   gene set in salivary gland sporozoite is enriched in the pathway for
   the biosynthesis of an antibiotic, puromycin, where the genes involved
   in this pathway are all alpha/beta hydrolases. Of particular interest
   is the BEM 46-like protein, which was shown to modulate the development
   of sporozoites [[150]72] and, therefore the associated novel peptides
   may play a role in the modulation as well.

High quality peptide filtering and re-analysed differential expression

   To obtain high quality novel peptides, a quality filter of Mascot
   score > 20 was applied on the proteomic datasets and subsequently
   performed differential expression analysis on the proteins of filtered
   peptides. Approximately half of the total peptides remained after
   applying the filter (Table [151]3), whereas for novel peptides the
   proportions are much lower, ranging from 14% to 20%. This suggests that
   compared with peptides from canonical proteins, novel peptides tend to
   have lower probabilities of being a real positive match. A total of 138
   and 131 unique novel peptides remained in the oocyst and salivary gland
   sporozoite dataset respectively (Table [152]4) after the filtering,
   with AltORF being the most common category followed by intergenic,
   which follows a similar trend before filtering. Unfortunately, no
   peptides from antisense strand or retained intron passed the filter,
   which were very rare before filtering too.

Table 3.

   Percentage of novel peptides and total peptides passing the filter of
   Mascot score > 20. oo1, oo2, oo3 represent the three biological
   replicates from oocyst sporozoite and sg1, sg2, sg3 are the replicates
   from salivary gland sporozoite
   Sample Percentage of novel peptides passing the filter Percentage of
   total peptides passing the filter
   oo1 17.5 50.5
   oo2 19.8 49.7
   oo3 15.2 48.2
   sg1 14.5 53.0
   sg2 19.9 53.4
   sg3 17.7 52.8
   [153]Open in a new tab

Table 4.

   Number of novel peptides from different categories that passed the
   quality filter of Mascot score > 20
                             Number of unique novel peptides
                             5′ UTR 3′ UTR AltORF Intergenic
   Oocyst sporozoite         4      4      116    14
   Salivary gland sporozoite 4      4      107    16
   [154]Open in a new tab

   The ORFs of the filtered novel peptides as described previously, and
   see which of them was differentially expressed. With a much smaller
   pool of high-quality peptides, only five unique novel ORFs showed
   differential expression between the two stages (Table [155]5), where
   AltORF is still the most common category, and one ORF from 3′ UTR and
   intergenic regions. Interestingly, four out of five DE nORFs showed
   opposite trend in transcript and protein expression, meaning that those
   that have positive fold-change at mRNA-level showed negative
   fold-change at protein level and vice versa. The poor correlation
   between mRNA and protein abundances are not uncommon, which is often
   caused by the regulatory mechanism governing transcripts or proteins,
   but sometimes by noise and experimental error as well [[156]73].
   Further experiments are required to confirm if it is the former.

Table 5.

   High quality (Mascot score > 20) novel ORFs identified by using
   filtered peptides only
   Transcript ID Frame Category of nORF PSM in oo-spz PSM in sg-spz Log
   ratio of PSMs Log fold-change of transcripts p-value
   MSTRG.2270.1 1 3p UTR 3.32 16.00 2.27 − 1.74 0.0026
   MSTRG.231.2 1 AltORF 28.75 1.00 − 4.85 11.57 0.0000
   MSTRG.633.1 6 AltORF 22.12 2.00 − 3.47 ns 0.0000
   MSTRG.4174.1 2 Intergenic 1.11 11.00 3.31 − 3.49 0.0022
   MSTRG.4394.1 1 AltORF 1.11 15.00 3.76 − 3.34 0.0002
   [157]Open in a new tab

   PSM values were normalized as described in “[158]Methods” section

Differentially expressed 3′ UTR

   It was worth noting that the differentially expressed 3′ UTR
   (MSTRG.2270.1), despite showing higher expression in salivary gland
   sporozoites, the peptide spectrum-match that maps to the 3′-end of the
   canonical gene (gene identifier: PF3D7_1013400) was only present in
   oocyst sporozoites. This is in line with the RNA-seq data, where the
   3′-end is clearly expressed in oocyst sporozoites but almost
   undetectable in salivary gland sporozoite. Even though there were only
   two PSMs from this nORF before normalization in oocyst sporozoites, the
   3′ UTR was still captured at such low expression, it is unlikely that
   the absence of peptide from 3′ UTR in salivary gland sporozoites is due
   to chance or physical chemistry of the peptide ion. It is possible that
   the protein with 3′UTR is an isoform specific to the oocyst sporozoite
   stage.

   To understand the function of the extended 3′UTR region, InterProScan
   [[159]59] was used to analyse the sequence of this nORF. The results
   revealed that while the canonical gene PF3D7_1013400 has only a long
   stretch of “Non-cytoplasmic domain” identified, the extended 3′-end of
   71 amino acids was predicted with multiple transmembrane helices.This
   is further confirmed by the prediction of transmembrane helices from
   TMHMM [[160]74], which predicted two helices in the region after the
   canonical stop codon (Additional file [161]1 Figure S5). The inclusion
   of this transmembrane tail might also explain the low expression of
   this protein in oocyst sporozoites, because membrane proteins require
   specialised protocols to be fully solubilised in sample preparation
   [[162]75].

   Results from TMHMM also suggested that canonical protein is likely to
   be outside of the membrane, which coincides with the non-cytoplasmic
   domain detected by InterProScan. Therefore, the extended region in the
   3′ end can potentially act as an anchor that tether the canonical
   protein to the membrane in oocyst sporozoites. When the parasite
   transitions into salivary gland sporozoite, the canonical stop codon is
   used and without expressing the 3′ end of transmembrane helices.

   Since there is no function annotated for the parent gene of this 3′UTR
   other than it encodes a conserved protein, its structures with and
   without the 3′ extension were predicted using the template-based
   I-TASSER [[163]62]. It appears that the isoform with extra helices at
   the 3′-end (Fig. [164]5a, highlighted in red) has a very different
   structure than the one without, where the former is predicted to have a
   more compact, ordered structure and likely to bind to a peptide
   substrate, while the latter contains a lot of disordered loops with
   nucleic acid substrate. Therefore, it is possible that the presence of
   extra helices allows the parent protein to tether to the membrane and
   adopt a more stable structure to bind to different substrates. When the
   parasite matures into salivary gland sporozoite, it expresses the
   canonical protein with no transmembrane helices, which may be released
   into the extracellular environment given its non-cytoplasmic domain and
   potentially involved in the host-parasite interaction upon infection.

Fig. 5.

   [165]Fig. 5
   [166]Open in a new tab

   Predicted structures of a differentially expressed nORF (MSTRG.2270.1)
   with 3′UTR (3′ end coloured in red) of PF3D7_1013400 using I-TASSER; b
   differentially expressed intergenic nORF (MSTRG.4174.1) using QUARK;
   and c two AltORFs that have protein expression that was anticorrelated
   with the associated reference genes using QUARK

Intergenic ORF and transmembrane domains

   Surprisingly, an intergenic nORF was also differentially expressed
   after the quality filtering. It is a short ORF with only 30 amino acids
   and unique to salivary gland sporozoites, identified by 10 raw PSMs
   (Table [167]5). InterProScan could not detect any functional domain
   possibly due to its short length and, therefore, its structure was
   modelled using an ab initio structure prediction tool QUARK [[168]60],
   which can yield high-resolution structures for small proteins
   [[169]76]. The predicted structure (Fig. [170]5b) is highly ordered
   with two short helices connected by a loop, suggesting that the
   intergenic ORF could form protein-like products too.

   Intergenic ORFs with high-quality PSMs were then analysed to see if
   they have functional roles. Firstly, a BlastP search was performed on
   all the intergenic ORFs against the non-redundant protein sequence
   database, and no significant hit with expected value (E-value) smaller
   than 0.05 was returned, indicating that these ORFs share little or no
   sequence homology with known proteins. Interestingly, InterProScan
   detected many transmembrane domains in the intergenic ORFs from
   prediction by TMHMM (Fig. [171]6a), where 12 out of 34 were predicted
   with one or more transmembrane helices, and one of them was predicted
   with four helices. A similar scenario was observed in AltORFs
   (Fig. [172]6b) as well, where 99 out of 248 ORFs were predicted to
   contain transmembrane domain. Such abundance suggests that these ORFs
   may have biological functions in the membrane, which have been
   previously overlooked by the conventional annotation methods.
   Interestingly, this scenario has previously been observed in
   Escherichia coli as well [[173]77], where over half of the novel small
   proteins (16–50 amino acids) identified in the intergenic region were
   predicted to have a transmembrane segment and also shown to
   co-fractionate with the membrane experimentally. Therefore, it is
   possible that this may be a common strategy adopted by organisms with
   small genomes to expand their proteome, although their exact functional
   roles still require experimental testing.

Fig. 6.

   [174]Fig. 6
   [175]Open in a new tab

   Predicted protein domains in high quality (Mascot score > 20)
   intergenic ORFs (a; total ORFs = 32) and AltORFs (b; total ORFs = 248)
   from InterProScan results

Correlation of nORF expression with canonical gene

   Finally, the correlation between the protein expression of high-quality
   nORFs and that of their associated canonical gene was tested. By
   computing the normalized spectral abundance factor (NSAF) values, it
   was possible to compare the expression across proteins and samples,
   which were used to perform Pearson Correlation analysis. In total,
   nORFs (Table [176]6) that showed significant expression correlation
   (p-values < 0.05) with the parent gene were identified, which are all
   AltORFs except for one 3′ UTR. It is worth noting that majority of them
   showed positive correlation, suggesting that either a positive
   regulation on the parent gene is a wide-spread scenario in nORFs, or
   that their expression is a by-product of the canonical translation
   event via mechanisms such as ribosome shifting [[177]78].

Table 6.

   High quality novel ORFs that have significantly (p-value < 0.05)
   correlated protein expression (calculated in NSAF) with their
   associated reference genes
   Transcript ID Frame Category of nORF Reference gene p-value Correlation
   coefficient
   MSTRG.4605.2 2 AltORF PF3D7_1467600 0.0251 − 0.868
   MSTRG.2092.1 6 AltORF PF3D7_0927900 0.0425 − 0.827
   MSTRG.1076.2 2 AltORF PF3D7_0606600 0.0480 0.815
   MSTRG.1556.1 2 AltORF PF3D7_0728700 0.0457 0.820
   MSTRG.3171.1 1 AltORF PF3D7_1225600 0.0367 0.839
   MSTRG.2151.1 6 AltORF PF3D7_0934500 0.0149 0.898
   MSTRG.2472.1 2 3p UTR PF3D7_1033500 0.0134 0.904
   MSTRG.2472.1 6 AltORF PF3D7_1033500 0.0134 0.904
   MSTRG.4173.1 2 AltORF PF3D7_1417600 0.00838 0.924
   MSTRG.370.1 1 AltORF PF3D7_0311100 0.00560 0.938
   MSTRG.3602.1 6 AltORF PF3D7_1322400 0.00098 0.974
   MSTRG.4140.1 1 AltORF PF3D7_1414500 0.000685 0.979
   MSTRG.2185.1 2 AltORF PF3D7_1004300 0.000073 0.993
   [178]Open in a new tab

   Interestingly, two AltORFs (MSTRG 4605.2 and MSTRG 2092.1) have
   anti-correlated expression with the parent gene, which are less likely
   to be a by-product and more likely have regulatory role on the
   expression of canonical proteins. Moreover, one of the parent genes,
   PF3D7_1467600 was significantly downregulated in salivary gland
   sporozoite, where the associated AltORF is uniquely present, suggesting
   that it may be involved in the downregulation mechanism. InterProScan
   did not find any functional domains other than transmembrane helices in
   the two AltORFs, which was common for novel ORFs as previously
   discussed. Therefore, an attempt was made to infer function from their
   three-dimensional structures by first predicting the structures using
   QUARK (Fig. [179]5c), and then submitting them to the structure-based
   function predictor COFACTOR [[180]79]. Although there were no
   significant hits (C-score > 0.4) of the predicted Molecular Function GO
   terms for MSTRG 4605.2 (Table [181]7), MSTRG 2092.1 was predicted with
   nucleic acid binding function. Taken together, binding to the mRNA of
   canonical gene or to the chromosome could be a mechanism through which
   this AltORF regulates the expression of canonical proteins.

Table 7.

   Predicted molecular functions of the structures modelled by QUARK for
   MSTRG.4605.2 and MSTRG.2092.1, which showed negative correlation of
   protein expression with associated reference gene
   MSTRG.4605.2 MSTRG.2092.1
   Molecular function C-score Molecular function C-score
   Phosphatidic acid binding 0.13 Nucleic acid binding 0.46
   Phosphatidylinositol-4-phosphate binding 0.13 Phosphatase binding 0.40
   Sterol transporter activity 0.13 Nucleic acid binding transcription
   factor activity 0.40
   Oxysterol binding 0.13 Structural constituent of cytoskeleton 0.28
   Phosphatidylinositol-4,5-bisphosphate binding 0.13 Catalytic activity
   0.26
   [182]Open in a new tab

   C-scores are confidence scores that range from 0 to 1, which higher
   score indicating a more confident prediction

Discussion

   Despite years of research efforts, malaria continues to be one of the
   most severe global health problems and affects millions of pregnant
   women and children in the African Regions [[183]1]. This can be
   attributed to several factors including the rise of resistance to
   anti-malarial drugs [[184]80] and the lack of a vaccine that has high
   efficacy [[185]5]. It is, therefore, of great interest to discover new
   vaccine candidates and novel drug targets. Additionally, the genome
   annotation of P. falciparum remains relatively incomplete where a large
   portion of genes remains functionally unannotated and regulatory
   elements outside coding sequences are not included [[186]20]. As an
   attempt to address these issues, proteogenomics analysis was performed
   to discover novel open reading frames that would not have been
   identified using conventional approaches and could enhance the current
   understanding in the parasite biology. The datasets analysed are the
   total transcriptomes and proteomes of oocyst sporozoites and salivary
   gland sporozoites [[187]37], which correspond to the life-cycle stages
   that are mosquito-infectious and human-infectious, respectively, and
   the identified nORFs could, therefore, contribute to the understanding
   of the development of infectivity.

   In this work, putative novel transcripts were classified based on how
   they map to the canonical genes and subsequently determined where their
   peptide-spectrum matches align to the translated sequence to identify
   novel peptides. A total of 1734 novel peptides were identified, where
   269 of them passed the high-quality filter of Mascot score > 20. A
   method was also developed to predict the full-length open reading frame
   from peptides by finding the most suitable start and stop codons in the
   translated transcript, so that differential analysis could be performed
   for novel ORFs. The results for the length-normalized spectral
   abundance factors of canonical genes correlated well with those
   reported by Lindner et al. [[188]37] (Additional file [189]1:
   Figure S4b), which were computed using annotated protein lengths,
   suggesting that the approach used is feasible.

   By performing GO terms enrichment analysis on the canonical genes
   associated with the novel ORFs, it appears that they arise from
   functionally important loci that are critical to the parasite invasion
   and survival. While the gene set associated with novel ORFs identified
   in oocyst sporozoites was enriched in GO terms of motility and
   localization, the gene set for salivary gland sporozoites was more
   enriched in replisome-related GO terms (Fig. [190]4). More importantly,
   both gene sets showed significant enrichment in the genes that form the
   actomyosin complex, which is part of the parasite invasion machinery,
   also known as the glideosome. It is therefore possible that these novel
   ORFs play a role in host invasion and could serve as a drug target
   well, especially given that attempt has already been made to target
   glideosome with small molecules as a potential malaria therapy
   [[191]71].

   The functional roles of antisense transcripts have been a subject of
   debate where previous studies showed conflicting results—some suggest
   that they can activate transcription [[192]25, [193]27] and others
   showed no effects [[194]81, [195]82], where the former is achieved by
   epigenetic modifications and the latter is due to the lack of RNAi
   machineries. The presented results align with these observations where
   the levels of the 41 identified antisense transcripts do not have a
   consistent correlation with protein expression. Interestingly, some
   translation of the antisense transcripts was observed, suggesting that
   their regulatory roles could be performed by the translation product as
   well.

   It was worth noting that while a lot of assembled transcripts have
   extended 3′/5′ ends and retained introns, very few peptides are mapped
   to these untranslated regions. Nevertheless, these UTRs tend to be
   regulatory elements that function at mRNA level and very little is
   known about their roles in transcriptional and translational regulation
   in the Plasmodium parasite [[196]83], the results of this work could
   provide a starting point for further investigation. Furthermore, one
   high quality novel ORF from the 3′ UTR category was identified, which
   was differentially expressed between the two stages studied, and the
   peptide that maps to 3′ end was only observed in oocyst sporozoite
   despite low protein expression. The extra sequences provided by the 3′
   extension are predicted to form a transmembrane helix and could affect
   the protein structure significantly, which may allow the protein to
   attach to the membrane and bind to different substrates. One possible
   explanation is that the parasite may choose to use different protein
   isoforms depending on the life cycle stage by skipping a stop codon and
   express the 3′ end, which could be an efficient mechanism to change
   infectivity for different hosts.

   Finally, AltORFs and intergenic ORFs are surprisingly abundant, with
   AltORFs being the most common category in both stages as well as in
   differentially expressed novel ORFs. By correlating the nORF expression
   with their associated canonical genes, it was observed that most nORFs
   with significantly correlated expression are AltORFs showing positive
   correlation, except for two AltORFs that showed negative correlation.
   Structural analysis reveals that one of them is likely to form a
   protein structure that binds to nucleic acid, which provides a possible
   mechanism for this AltORF to regulate the expression of the associated
   gene by repressing its transcription or the translation of mRNA. On the
   other hand, intergenic ORFs were not only translated, some of them were
   even differentially expressed between oocyst and salivary gland
   sporozoites, suggesting that they might be involved in the
   stage-specific functions. An intriguing finding is that 12 out of 34
   high-quality intergenic ORFs, and 99 out of 248 AltORFs were predicted
   with transmembrane domain, which may indicate that they have an
   important functional role in the membrane. Given that a similar
   scenario has been observed in Escherichia coli, non-canonical
   transmembrane ORFs might be common in organisms with small genomes to
   expand the functional proteome.

   In summary, the novel ORFs in oocyst sporozoites and salivary gland
   sporozoites were identified through proteogenomics analysis, which
   allowed analysis on transcription and translation events outside the
   coding sequences that are annotated using conventional criteria.
   Combining analyses of differential expression, GO term enrichment and
   predicted structures, this work has shown that some of these nORFs may
   play a role in the parasite invasion and expression control. Therefore,
   they are interesting targets for further experimental validation on
   their existence and functional roles.

Supplementary information

   [197]12936_2021_3598_MOESM1_ESM.docx^ (1MB, docx)

   Additional file 1. Figures S1 to S5. Supplementary File for In silico
   identification of novel open reading frames in Plasmodium falciparum
   oocyte and salivary gland sporozoites using proteogenomics framework.

Acknowledgements