Abstract Ampulex clypecomplana Chen & Li, 2010 (Hymenoptera: Ampulicidae) is an important predatory insect in Hymenoptera. However, molecular information about this predatory insect is currently limited. In this study, we employed ONT long-read sequencing, MGI-SEQ short-read sequencing, Hi-C sequencing and transcriptomic data to assemble the high-quality genome of A. clypecomplana. The genome assembly length was 338.43 Mb, with a Scaffold N50 length of 19.05 Mb. Our BUSCO analysis further confirmed the gene coverage completeness of the genome assembly to be 99.2%. Phylogenetic analysis indicated that A. clypecomplana appeared approximately 132 million years ago. We annotated 110.75 Mb of repetitive sequences, accounting for 32.72% of the entire genome. In A. clypecomplana, we identified 180 gene expansions and 1029 genes that underwent contraction or loss. The high-quality genome of A. clypecomplana provides a valuable genetic resource for future research in evolution, molecular biology, and applied studies. Subject terms: Phylogenetics, Taxonomy Background & Summary Ampulex clypecomplana Chen & Li, 2010, commonly known as a cockroach wasp, is a solitary predator within Hymenoptera, specifically in the Spheciformes group^[34]1–[35]5. Spheciformes, which also includes families like Ampulicidae, Heterogynaidae, Sphecidae, and Crabronidae, have been merged with the bee superfamily to form the Apoidea^[36]5. While adult Spheciformes feed on nectar and pollen, their larvae rely on captured prey^[37]6, contrasting with bees, which nourish both adults and larvae with pollen and nectar^[38]7. Cockroach wasps are known for their unique behavior of preying on cockroaches to provide sustenance for their young^[39]8. Found mainly in tropical and subtropical regions^[40]2,[41]6, these wasps are thought to have originated in the Oriental or Ethiopian regions^[42]2,[43]8. Adult female cockroach wasps exhibit a remarkable adaptation during hunting. For instance, A. compressa sever the cockroach’s antennae to feed on its hemolymph before dragging it to a secure nesting site. These females lay eggs on the middle leg of the paralyzed cockroach, then seal the nest with leaves and debris^[44]9. After hatching, the wasp larvae feed on the host’s hemolymph and gradually consume all but the digestive system. A. compressa larvae cocoon within the host, using the exoskeleton as a developmental chamber^[45]9–[46]14. If the structure of the host is disrupted, the larvae fail to complete development, highlighting the importance of the intact exoskeleton for their lifecycle. Studies on A. compressa have revealed various adaptations linked to prey manipulation, including venom properties that induce partial paralysis. Research has examined gene expression across life stages, specifically focusing on digestion and detoxification genes^[47]11. Structural and biochemical studies of the wasp’s venom glands have uncovered a novel peptide family containing dopamine and other monoamines, demonstrating effects like antibacterial activity, cytotoxicity, and disruption of motor functions in prey^[48]15–[49]17. Our laboratory discovered A. clypecomplana as a new species in 2006 in Mengzi, Yunnan^[50]1. Although it shares behavioral and biological traits with A. compressa^[51]9–[52]14, a notable difference is that A. clypecomplana larvae cocoon outside the cockroach host. To date, there are no prior reports on the chromosome number or genome size of the genus Ampulex. In the NCBI database, a scaffold-level genome of the male Ampulex compressa is available, submitted by the University of California, Riverside, with a genome size of 277.4 Mb (GCA_038496175.1; from NCBI)^[53]18; another submission by the Zoologisches Forschungsmuseum Alexander Koenig reports a genome size of 277.7 Mb (GCA_019049445.1; from NCBI)^[54]19. High-quality chromosome-level genome studies on cockroach wasps have been lacking, limiting research on their biology and prey interactions. In this study, we present the chromosome-level genome of A. clypecomplana, assembled using ONT long-read, MGI short-read, and Hi-C sequencing. This genome, spanning 338.43 Mb with a scaffold N50 of 19.05 Mb, provides an essential resource for understanding the evolutionary biology of Spheciformes and Apoidea within Hymenoptera, establishing a foundation for further genomic and evolutionary research. Methods Sample collection Specimens of A. clypecomplana were collected on September 17, 2023, from Yiliang County, Kunming City, Yunnan Province (24.94°N, 103.16°E) and reared under laboratory conditions at room temperature. Female A. clypecomplana were provided with nymphs of Periplaneta americana (2–3 cm) for hunting and oviposition, while adult individuals were fed honey. Prior to sequencing, species identification was identified by both morphological examination^[55]1 and COI barcode analysis^[56]20,[57]21. The specimens were deposited at Yunnan Agricultural University (accession number: YNAU 2024010010). Tables [58]S1–[59]S6 sees Supplementary Information document. Library preparation and sequencing Genomic DNA was extracted using the Blood & Cell Culture DNA Midi Kit (QIAGEN, Germany) for both long-read and short-read whole genome sequencing following the manufacturer’s protocol. DNA quality was assessed through 1% agarose gel electrophoresis to check for integrity and contamination. Purity measurements were performed using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), ensuring OD260/280 values between 1.8 and 2.0 and OD260/230 values within 2.0–2.2. DNA concentration was further quantified with a Qubit® 4.0 Fluorometer (Invitrogen, USA). For short-read sequencing, DNA libraries were prepared with an insert size of 350 bp using the MGIEasy Universal DNA Library Prep Kit V1.0 (CAT#1000005250, MGI). These libraries were sequenced on the DNBSEQ-T7RS platform (MGI, Shenzhen, China). Long-read sequencing was performed using the Nanopore PromethION platform (Oxford Nanopore Technologies, UK) with an insert size of approximately 20 kb. Total RNA was extracted from fresh samples using TRIzol reagent (TIANGEN), following the manufacturer’s guidelines. Poly-A RNA was enriched from total RNA using the Dynabeads mRNA Purification Kit (Cat#61006, Invitrogen) and fragmented using the fragmentation reagents from the MGIEasy RNA Library Prep Kit V3.1 (Cat# 1000005276, MGI). First-strand cDNA was synthesized using random primers and reverse transcriptase, followed by second-strand cDNA synthesis. The double-stranded cDNA was subsequently subjected to end repair, A-tailing, and adapter ligation according to the manufacturer’s library construction protocol. The cDNA fragments were then amplified by PCR and purified using MGIEasy DNA Clean Beads (Cat# 1000005279, MGI). The quality and size distribution of the constructed library were assessed using the Agilent Technologies 2100 Bioanalyzer. The double-stranded PCR products were heat-denatured and circularized using the splint oligo sequence provided in the MGIEasy Circularization Module (Cat# 1000005260, MGI). The resulting single-strand circular DNA (ssCir DNA) was used as the final library for sequencing. The qualified libraries were sequenced on the DNBSEQ-T7RS platform. Chromosome conformation capture (Hi-C) sequencing was performed using fresh tissue (excluding the abdomen) from one female A. clypecomplana specimen. The samples were vacuum infiltrated in a nuclear isolation buffer containing 2% formaldehyde. Crosslinking was stopped by adding glycine and additional vacuum infiltration. The fixed tissue was then ground into a powder and resuspended in nuclear isolation buffer to obtain a nuclear suspension. The purified were digested with 100 units of DpnII and labeled with biotin-14-dATP. Biotin-14-dATP from non-ligated DNA ends was removed due to the exonuclease activity of T4 DNA polymerase. The ligated DNA was sheared into 350 bp fragments, then blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pull-down. The resulting Hi-C libraries were sequenced on the MGI-2000 platform. Finally, we obtained 84.67 Gb of ONT long reads, 27.34 Gb of MGI short reads, 73.04 Gb of Hi-C reads, and 8.16 Gb of RNA-seq reads for the genome assembly (Table [60]1). Table 1. Statistics of the DNA/RNA sequence data used for genome assembly. Library Insert size (bp) Reads number Raw data (Gb) N50 Read Length (bp) Sequence coverage (X) MGI 350 182,244,650 27.34 150 80.78 ONT 20,000 7,882,315 84.67 22,644 250.18 Hi-C 350 486,927,556 73.04 150 215.82 RNA-seq 350 54,413,932 8.16 150 — Total — 731,468,453 193.21 — — [61]Open in a new tab Genome size evaluation and assembly To ensure high-quality sequencing data, fastp v0.21.0^[62]22 was employed to filter raw reads and generate quality statistics for both raw and cleaned datasets. Prior to genome assembly, a k-mer analysis was conducted using MGI sequencing data to infer the genomic characteristics of Ampulex clypecomplana, including genome size and heterozygosity (Fig. [63]1). Specifically, Jellyfish v2.3.0^[64]23 was used to compute the frequency distribution of 17-mers, while GenomeScope v1.0.0^[65]24 was utilized to model genome properties. The final analysis estimated the genome size of A. clypecomplana to be approximately 515.82 Mb, with an inferred heterozygosity rate of 0.70%. Fig. 1. Fig. 1 [66]Open in a new tab k-mer distribution curve and heterozygosity simulation curve of Ampulex clypecomplana. Am: Ampulex clypecomplana. Atha: Arabidopsis thaliana. Simulated short-read data at corresponding depths were generated using the Arabidopsis thaliana genome. k-mer curve fitting was performed under different heterozygosity gradient combinations to estimate the heterozygosity of the sample Ampulex clypecomplana. The x-axis represents k-mer depth, while the y-axis represents k-mer depth frequency. For example, “AthaH0.010_X31” indicates (Atha) Arabidopsis thaliana with a heterozygosity (H) of 1.0% and a sequencing depth (X) of 31, with other labels following the same pattern. For de novo genome assembly, ONT reads were assembled using NextDenovo v2.3.1 ([67]https://github.com/Nextomics/NextDenovo.git). Due to the inherently high error rate of ONT raw reads, an initial self-correction step was performed using NextCorrect, which refined the reads into consensus sequences (CNS reads). These CNS reads were then analyzed by the NextGraph module, which identified sequence overlaps and constructed a preliminary genome assembly based on read correlations. To enhance assembly accuracy, Racon ([68]http://samtools.github.io/bcftools/) was used for polishing ONT long reads, while NextPolish^[69]25 was applied with default parameters to further refine the assembly using high-accuracy MGI short reads. To discard potentially redundant contigs and generate the final assembly, a similarity search was performed using the parameters “identity 0.8 -overlap 0.8”. Clean paired-end reads were mapped to the assembled draft sequences using bowtie2 (v2.3.2)^[70]26, resulting in uniquely mapped paired-end reads. HiC-Pro (v2.8.1) ([71]https://github.com/nservant/HiC-Pro) was used to identify and retain valid interaction paired reads from uniquely mapped paired-end reads for further analysis. Invalid read pairs, including dangling-end, self-cycle, re-ligation, and dumped products, were filtered out by HiC-Pro (v2.8.1). The scaffolds were further clustered, ordered, and oriented scaffolds onto chromosomes by LACHESIS^[72]27. The final assembled genome length of A. clypecomplana is 338.43 Mb, with a scaffold N50 of 19.05 Mb and the longest scaffold being 38.06 Mb. The lengths of the assembled chromosomes range from 9,677,010 bp to 42,715,628 bp. The average GC content of the assembled A. clypecomplana genome is 43.78% (Table [73]2, Figs. [74]2, [75]3). Table 2. Genome assembly and annotation statistics of A. clypecomplana. Elements Current Version Genome assembly Assembly size (Mb) 338.43 Number of scaffolds 611 Longest scaffold (Mb) 38.06 N50 scaffold length (Mb) 19.05 GC (%) 43.78 BUSCO completeness (%) 99.20% Number of chromosomes 15 Gene annotation Protein-coding genes 12,381 BUSCO completeness (%) 98.70% 5′ UTRs 12,058 3′ prime UTRs 12,058 mRNAs 18,127 [76]Open in a new tab Fig. 2. Fig. 2 [77]Open in a new tab Heat map of Hi-C assembly of A. clypecomplana. The scale bar represents the interaction frequency of Hi-C links. Fig. 3. [78]Fig. 3 [79]Open in a new tab Genome characteristics of A. clypecomplana. (a) Gene density; (b) GC rate; (c) Repeat density; (d) LTR density; (e) DNA transposons density. The completeness of the assembled A. clypecomplana genome was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO)^[80]28 based on the OrthoDB database insecta_odb10. In the assembled genome, the completeness of BUSCO genes is approximately 99.2%, with 99.1% being complete single-copy, 0.1% complete duplicated, 0.3% fragmented, and 0.5% missing (Table [81]2, Table [82]S2). These results indicated the assembled genome is highly complete. Based on a set of conserved protein families widely present in a large number of eukaryotes (248 core gene set), the assembled genome was evaluated to assess the assembly of conserved genes. CEGMA^[83]29 was used to predict core genes in the genome through this database, obtaining information on the core genes within the genome. The genome prediction identified 246 core genes, representing 99.19% of the core gene set, with 241 being complete core genes, accounting for 97.18%. This indicates that the core genes in the assembled genome are relatively complete. Genome annotation The genome of A. clypecomplana was annotated for repetitive elements, non-coding RNAs (ncRNAs), and protein-coding genes (PCGs). To generate a comprehensive transposable element (TE) library, the Extensive de novo TE Annotator (EDTA)^[84]30 pipeline was first employed to detect and classify repetitive sequences. Then, RepeatModeler v2.0.2^[85]31 was used to identify non-LTR retrotransposons and unclassified TEs that were not captured by the EDTA pipeline. The final non-redundant TE library was assembled by integrating these results with data from Dfam3.2^[86]32. Subsequently, RepeatMasker v4.1.2 ([87]http://www.repeatmasker.org) was used to search for known and novel TEs. Infernal 1.1.2^[88]33 was used to search the Rfam library^[89]34 to identify snRNAs, snoRNAs, and miRNAs. tRNAs were identified using the tRNAscan-SE^[90]35 software. In total, 110.75 Mb of repetitive sequences were annotated, occupying 32.72% of the entire genome. Among these, DNA transposons accounted for 9.40%, retroelements for 4.90%, LTR elements for 3.90%, LINEs for 0.98%, unclassified elements for 18.18%, simple repeats for 0.14%, and Rolling-circles and SINEs for very little (Table [91]3, Table [92]S1, Fig. [93]4). A total of 331 rRNAs, 1080 tRNAs, 68 miRNAs, and 48 snRNAs were predicted in the A. clypecomplana genome (Table [94]S3). Table 3. Statistics of repetitive elements in the genome A. clypecomplana. Repeat type Length occupied (bp) Proportion in Genome DNA transposon 318,201,124 9.40% SINE 68,483 0.02% LINE 3,322,703 0.98% LTR 13,199,027 3.90% Rolling-circles 260,867 0.08% Satellite 95,716 0.03% Simple repeat 458,570 0.14% Unknown 61,521,990 18.18% Total 110,747,480 32.72% [95]Open in a new tab Fig. 4. [96]Fig. 4 [97]Open in a new tab Classification circle diagram of repetitive elements. The annotation of protein-coding genes in A. clypecomplana was performed using a comprehensive approach, integrating transcriptome-based, de novo, and homology-based prediction strategies. For transcriptome-guided gene prediction, RNA-seq reads were first aligned to the assembled genome using HISAT2 v2.2.1^[98]36. The aligned transcripts were then assembled and refined using the PASA pipeline v2.4.1 ([99]https://github.com/PASApipeline/PASApipeline), which facilitated the identification of candidate coding regions by incorporating splicing information and alternative isoform structures. De novo gene prediction on the repeat-masked genome was performed using AUGUSTUS v3.3.3^[100]37 and SNAP v2006 07-2833^[101]38. Protein sequences of Hymenoptera insects were downloaded from the NCBI database to use as references for homology-based prediction. Exonerate