Abstract

   Ampulex clypecomplana Chen & Li, 2010 (Hymenoptera: Ampulicidae) is an
   important predatory insect in Hymenoptera. However, molecular
   information about this predatory insect is currently limited. In this
   study, we employed ONT long-read sequencing, MGI-SEQ short-read
   sequencing, Hi-C sequencing and transcriptomic data to assemble the
   high-quality genome of A. clypecomplana. The genome assembly length was
   338.43 Mb, with a Scaffold N50 length of 19.05 Mb. Our BUSCO analysis
   further confirmed the gene coverage completeness of the genome assembly
   to be 99.2%. Phylogenetic analysis indicated that A. clypecomplana
   appeared approximately 132 million years ago. We annotated 110.75 Mb of
   repetitive sequences, accounting for 32.72% of the entire genome. In A.
   clypecomplana, we identified 180 gene expansions and 1029 genes that
   underwent contraction or loss. The high-quality genome of A.
   clypecomplana provides a valuable genetic resource for future research
   in evolution, molecular biology, and applied studies.

   Subject terms: Phylogenetics, Taxonomy

Background & Summary

   Ampulex clypecomplana Chen & Li, 2010, commonly known as a cockroach
   wasp, is a solitary predator within Hymenoptera, specifically in the
   Spheciformes group^[34]1–[35]5. Spheciformes, which also includes
   families like Ampulicidae, Heterogynaidae, Sphecidae, and Crabronidae,
   have been merged with the bee superfamily to form the Apoidea^[36]5.
   While adult Spheciformes feed on nectar and pollen, their larvae rely
   on captured prey^[37]6, contrasting with bees, which nourish both
   adults and larvae with pollen and nectar^[38]7.

   Cockroach wasps are known for their unique behavior of preying on
   cockroaches to provide sustenance for their young^[39]8. Found mainly
   in tropical and subtropical regions^[40]2,[41]6, these wasps are
   thought to have originated in the Oriental or Ethiopian
   regions^[42]2,[43]8. Adult female cockroach wasps exhibit a remarkable
   adaptation during hunting. For instance, A. compressa sever the
   cockroach’s antennae to feed on its hemolymph before dragging it to a
   secure nesting site. These females lay eggs on the middle leg of the
   paralyzed cockroach, then seal the nest with leaves and debris^[44]9.
   After hatching, the wasp larvae feed on the host’s hemolymph and
   gradually consume all but the digestive system. A. compressa larvae
   cocoon within the host, using the exoskeleton as a developmental
   chamber^[45]9–[46]14. If the structure of the host is disrupted, the
   larvae fail to complete development, highlighting the importance of the
   intact exoskeleton for their lifecycle.

   Studies on A. compressa have revealed various adaptations linked to
   prey manipulation, including venom properties that induce partial
   paralysis. Research has examined gene expression across life stages,
   specifically focusing on digestion and detoxification genes^[47]11.
   Structural and biochemical studies of the wasp’s venom glands have
   uncovered a novel peptide family containing dopamine and other
   monoamines, demonstrating effects like antibacterial activity,
   cytotoxicity, and disruption of motor functions in prey^[48]15–[49]17.

   Our laboratory discovered A. clypecomplana as a new species in 2006 in
   Mengzi, Yunnan^[50]1. Although it shares behavioral and biological
   traits with A. compressa^[51]9–[52]14, a notable difference is that A.
   clypecomplana larvae cocoon outside the cockroach host. To date, there
   are no prior reports on the chromosome number or genome size of the
   genus Ampulex. In the NCBI database, a scaffold-level genome of the
   male Ampulex compressa is available, submitted by the University of
   California, Riverside, with a genome size of 277.4 Mb (GCA_038496175.1;
   from NCBI)^[53]18; another submission by the Zoologisches
   Forschungsmuseum Alexander Koenig reports a genome size of 277.7 Mb
   (GCA_019049445.1; from NCBI)^[54]19. High-quality chromosome-level
   genome studies on cockroach wasps have been lacking, limiting research
   on their biology and prey interactions. In this study, we present the
   chromosome-level genome of A. clypecomplana, assembled using ONT
   long-read, MGI short-read, and Hi-C sequencing. This genome, spanning
   338.43 Mb with a scaffold N50 of 19.05 Mb, provides an essential
   resource for understanding the evolutionary biology of Spheciformes and
   Apoidea within Hymenoptera, establishing a foundation for further
   genomic and evolutionary research.

Methods

Sample collection

   Specimens of A. clypecomplana were collected on September 17, 2023,
   from Yiliang County, Kunming City, Yunnan Province (24.94°N, 103.16°E)
   and reared under laboratory conditions at room temperature. Female A.
   clypecomplana were provided with nymphs of Periplaneta americana
   (2–3 cm) for hunting and oviposition, while adult individuals were fed
   honey. Prior to sequencing, species identification was identified by
   both morphological examination^[55]1 and COI barcode
   analysis^[56]20,[57]21. The specimens were deposited at Yunnan
   Agricultural University (accession number: YNAU 2024010010).
   Tables [58]S1–[59]S6 sees Supplementary Information document.

Library preparation and sequencing

   Genomic DNA was extracted using the Blood & Cell Culture DNA Midi Kit
   (QIAGEN, Germany) for both long-read and short-read whole genome
   sequencing following the manufacturer’s protocol.

   DNA quality was assessed through 1% agarose gel electrophoresis to
   check for integrity and contamination. Purity measurements were
   performed using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher
   Scientific, USA), ensuring OD260/280 values between 1.8 and 2.0 and
   OD260/230 values within 2.0–2.2. DNA concentration was further
   quantified with a Qubit® 4.0 Fluorometer (Invitrogen, USA). For
   short-read sequencing, DNA libraries were prepared with an insert size
   of 350 bp using the MGIEasy Universal DNA Library Prep Kit V1.0
   (CAT#1000005250, MGI). These libraries were sequenced on the
   DNBSEQ-T7RS platform (MGI, Shenzhen, China). Long-read sequencing was
   performed using the Nanopore PromethION platform (Oxford Nanopore
   Technologies, UK) with an insert size of approximately 20 kb. Total RNA
   was extracted from fresh samples using TRIzol reagent (TIANGEN),
   following the manufacturer’s guidelines. Poly-A RNA was enriched from
   total RNA using the Dynabeads mRNA Purification Kit (Cat#61006,
   Invitrogen) and fragmented using the fragmentation reagents from the
   MGIEasy RNA Library Prep Kit V3.1 (Cat# 1000005276, MGI). First-strand
   cDNA was synthesized using random primers and reverse transcriptase,
   followed by second-strand cDNA synthesis. The double-stranded cDNA was
   subsequently subjected to end repair, A-tailing, and adapter ligation
   according to the manufacturer’s library construction protocol. The cDNA
   fragments were then amplified by PCR and purified using MGIEasy DNA
   Clean Beads (Cat# 1000005279, MGI). The quality and size distribution
   of the constructed library were assessed using the Agilent Technologies
   2100 Bioanalyzer. The double-stranded PCR products were heat-denatured
   and circularized using the splint oligo sequence provided in the
   MGIEasy Circularization Module (Cat# 1000005260, MGI). The resulting
   single-strand circular DNA (ssCir DNA) was used as the final library
   for sequencing. The qualified libraries were sequenced on the
   DNBSEQ-T7RS platform.

   Chromosome conformation capture (Hi-C) sequencing was performed using
   fresh tissue (excluding the abdomen) from one female A. clypecomplana
   specimen. The samples were vacuum infiltrated in a nuclear isolation
   buffer containing 2% formaldehyde. Crosslinking was stopped by adding
   glycine and additional vacuum infiltration. The fixed tissue was then
   ground into a powder and resuspended in nuclear isolation buffer to
   obtain a nuclear suspension. The purified were digested with 100 units
   of DpnII and labeled with biotin-14-dATP. Biotin-14-dATP from
   non-ligated DNA ends was removed due to the exonuclease activity of T4
   DNA polymerase. The ligated DNA was sheared into 350 bp fragments, then
   blunt-end repaired and A-tailed, followed by purification through
   biotin-streptavidin-mediated pull-down. The resulting Hi-C libraries
   were sequenced on the MGI-2000 platform.

   Finally, we obtained 84.67 Gb of ONT long reads, 27.34 Gb of MGI short
   reads, 73.04 Gb of Hi-C reads, and 8.16 Gb of RNA-seq reads for the
   genome assembly (Table [60]1).

Table 1.

   Statistics of the DNA/RNA sequence data used for genome assembly.
   Library Insert size (bp) Reads number Raw data (Gb) N50 Read Length
   (bp) Sequence coverage (X)
   MGI 350 182,244,650 27.34 150 80.78
   ONT 20,000 7,882,315 84.67 22,644 250.18
   Hi-C 350 486,927,556 73.04 150 215.82
   RNA-seq 350 54,413,932 8.16 150 —
   Total — 731,468,453 193.21 — —
   [61]Open in a new tab

Genome size evaluation and assembly

   To ensure high-quality sequencing data, fastp v0.21.0^[62]22 was
   employed to filter raw reads and generate quality statistics for both
   raw and cleaned datasets. Prior to genome assembly, a k-mer analysis
   was conducted using MGI sequencing data to infer the genomic
   characteristics of Ampulex clypecomplana, including genome size and
   heterozygosity (Fig. [63]1). Specifically, Jellyfish v2.3.0^[64]23 was
   used to compute the frequency distribution of 17-mers, while
   GenomeScope v1.0.0^[65]24 was utilized to model genome properties. The
   final analysis estimated the genome size of A. clypecomplana to be
   approximately 515.82 Mb, with an inferred heterozygosity rate of 0.70%.

Fig. 1.

   Fig. 1
   [66]Open in a new tab

   k-mer distribution curve and heterozygosity simulation curve of Ampulex
   clypecomplana. Am: Ampulex clypecomplana. Atha: Arabidopsis thaliana.
   Simulated short-read data at corresponding depths were generated using
   the Arabidopsis thaliana genome. k-mer curve fitting was performed
   under different heterozygosity gradient combinations to estimate the
   heterozygosity of the sample Ampulex clypecomplana. The x-axis
   represents k-mer depth, while the y-axis represents k-mer depth
   frequency. For example, “AthaH0.010_X31” indicates (Atha) Arabidopsis
   thaliana with a heterozygosity (H) of 1.0% and a sequencing depth (X)
   of 31, with other labels following the same pattern.

   For de novo genome assembly, ONT reads were assembled using NextDenovo
   v2.3.1 ([67]https://github.com/Nextomics/NextDenovo.git). Due to the
   inherently high error rate of ONT raw reads, an initial self-correction
   step was performed using NextCorrect, which refined the reads into
   consensus sequences (CNS reads). These CNS reads were then analyzed by
   the NextGraph module, which identified sequence overlaps and
   constructed a preliminary genome assembly based on read correlations.
   To enhance assembly accuracy, Racon
   ([68]http://samtools.github.io/bcftools/) was used for polishing ONT
   long reads, while NextPolish^[69]25 was applied with default parameters
   to further refine the assembly using high-accuracy MGI short reads. To
   discard potentially redundant contigs and generate the final assembly,
   a similarity search was performed using the parameters “identity 0.8
   -overlap 0.8”.

   Clean paired-end reads were mapped to the assembled draft sequences
   using bowtie2 (v2.3.2)^[70]26, resulting in uniquely mapped paired-end
   reads. HiC-Pro (v2.8.1) ([71]https://github.com/nservant/HiC-Pro) was
   used to identify and retain valid interaction paired reads from
   uniquely mapped paired-end reads for further analysis. Invalid read
   pairs, including dangling-end, self-cycle, re-ligation, and dumped
   products, were filtered out by HiC-Pro (v2.8.1). The scaffolds were
   further clustered, ordered, and oriented scaffolds onto chromosomes by
   LACHESIS^[72]27. The final assembled genome length of A. clypecomplana
   is 338.43 Mb, with a scaffold N50 of 19.05 Mb and the longest scaffold
   being 38.06 Mb. The lengths of the assembled chromosomes range from
   9,677,010 bp to 42,715,628 bp. The average GC content of the assembled
   A. clypecomplana genome is 43.78% (Table [73]2, Figs. [74]2, [75]3).

Table 2.

   Genome assembly and annotation statistics of A. clypecomplana.
                   Elements                 Current Version
   Genome assembly Assembly size (Mb)       338.43
                   Number of scaffolds      611
                   Longest scaffold (Mb)    38.06
                   N50 scaffold length (Mb) 19.05
                   GC (%)                   43.78
                   BUSCO completeness (%)   99.20%
                   Number of chromosomes    15
   Gene annotation Protein-coding genes     12,381
                   BUSCO completeness (%)   98.70%
                   5′ UTRs                  12,058
                   3′ prime UTRs            12,058
                   mRNAs                    18,127
   [76]Open in a new tab

Fig. 2.

   Fig. 2
   [77]Open in a new tab

   Heat map of Hi-C assembly of A. clypecomplana. The scale bar represents
   the interaction frequency of Hi-C links.

Fig. 3.

   [78]Fig. 3
   [79]Open in a new tab

   Genome characteristics of A. clypecomplana. (a) Gene density; (b) GC
   rate; (c) Repeat density; (d) LTR density; (e) DNA transposons density.

   The completeness of the assembled A. clypecomplana genome was assessed
   using Benchmarking Universal Single-Copy Orthologs (BUSCO)^[80]28 based
   on the OrthoDB database insecta_odb10. In the assembled genome, the
   completeness of BUSCO genes is approximately 99.2%, with 99.1% being
   complete single-copy, 0.1% complete duplicated, 0.3% fragmented, and
   0.5% missing (Table [81]2, Table [82]S2). These results indicated the
   assembled genome is highly complete. Based on a set of conserved
   protein families widely present in a large number of eukaryotes (248
   core gene set), the assembled genome was evaluated to assess the
   assembly of conserved genes. CEGMA^[83]29 was used to predict core
   genes in the genome through this database, obtaining information on the
   core genes within the genome. The genome prediction identified 246 core
   genes, representing 99.19% of the core gene set, with 241 being
   complete core genes, accounting for 97.18%. This indicates that the
   core genes in the assembled genome are relatively complete.

Genome annotation

   The genome of A. clypecomplana was annotated for repetitive elements,
   non-coding RNAs (ncRNAs), and protein-coding genes (PCGs). To generate
   a comprehensive transposable element (TE) library, the Extensive de
   novo TE Annotator (EDTA)^[84]30 pipeline was first employed to detect
   and classify repetitive sequences. Then, RepeatModeler v2.0.2^[85]31
   was used to identify non-LTR retrotransposons and unclassified TEs that
   were not captured by the EDTA pipeline. The final non-redundant TE
   library was assembled by integrating these results with data from
   Dfam3.2^[86]32. Subsequently, RepeatMasker v4.1.2
   ([87]http://www.repeatmasker.org) was used to search for known and
   novel TEs. Infernal 1.1.2^[88]33 was used to search the Rfam
   library^[89]34 to identify snRNAs, snoRNAs, and miRNAs. tRNAs were
   identified using the tRNAscan-SE^[90]35 software.

   In total, 110.75 Mb of repetitive sequences were annotated, occupying
   32.72% of the entire genome. Among these, DNA transposons accounted for
   9.40%, retroelements for 4.90%, LTR elements for 3.90%, LINEs for
   0.98%, unclassified elements for 18.18%, simple repeats for 0.14%, and
   Rolling-circles and SINEs for very little (Table [91]3, Table [92]S1,
   Fig. [93]4). A total of 331 rRNAs, 1080 tRNAs, 68 miRNAs, and 48 snRNAs
   were predicted in the A. clypecomplana genome (Table [94]S3).

Table 3.

   Statistics of repetitive elements in the genome A. clypecomplana.
     Repeat type   Length occupied (bp) Proportion in Genome
   DNA transposon  318,201,124          9.40%
   SINE            68,483               0.02%
   LINE            3,322,703            0.98%
   LTR             13,199,027           3.90%
   Rolling-circles 260,867              0.08%
   Satellite       95,716               0.03%
   Simple repeat   458,570              0.14%
   Unknown         61,521,990           18.18%
   Total           110,747,480          32.72%
   [95]Open in a new tab

Fig. 4.

   [96]Fig. 4
   [97]Open in a new tab

   Classification circle diagram of repetitive elements.

   The annotation of protein-coding genes in A. clypecomplana was
   performed using a comprehensive approach, integrating
   transcriptome-based, de novo, and homology-based prediction strategies.
   For transcriptome-guided gene prediction, RNA-seq reads were first
   aligned to the assembled genome using HISAT2 v2.2.1^[98]36. The aligned
   transcripts were then assembled and refined using the PASA pipeline
   v2.4.1 ([99]https://github.com/PASApipeline/PASApipeline), which
   facilitated the identification of candidate coding regions by
   incorporating splicing information and alternative isoform structures.

   De novo gene prediction on the repeat-masked genome was performed using
   AUGUSTUS v3.3.3^[100]37 and SNAP v2006 07-2833^[101]38. Protein
   sequences of Hymenoptera insects were downloaded from the NCBI database
   to use as references for homology-based prediction. Exonerate