ABSTRACT

   Global sequencing of the severe acute respiratory syndrome coronavirus
   2 (SARS-CoV-2) has enabled researchers to monitor the genetic
   information of new pathogenic variants. However, the integrity and
   accuracy of most genomic information cannot be guaranteed due to
   limited sample quality and sequencing technology. This restricts
   researchers from tracking the latest changes in the virus genome. In
   this study, we aimed to characterize the diversity and conservation
   balance of complete SARS-CoV-2 genomes worldwide since the outbreak. To
   achieve this goal, we collected 5,966,490 genome sequences from various
   parts of the world, excluding those with incomplete or
   unknown/degenerate sequences. Our methodology included comparing
   sequences using BLASTN and BLASTP, analyzing RNA secondary structure,
   and calculating entropy. Our findings provide insights into the
   characteristics of SARS-CoV-2 and potential mechanisms of pathogenesis.
   We found that uracil had the highest proportion of all bases among
   various coronaviruses and cytosine to uracil mutations had the highest
   proportion among all point mutations. The consistency in the front part
   (1–26,599 nt) was significantly higher than that in the back part
   (26,600–29,903 nt) of the genome. For most genes, the similar
   consistency characteristics were also observed in their protein
   families.

IMPORTANCE

   Our results indicate that most severe acute respiratory syndrome
   coronavirus 2 genomes sampled from patients had a mutation rate ≤1.07 ‰
   and genome-tail proteins (including S protein) were the main sources of
   genetic polymorphism. The analysis of the virus-host interaction
   network of genome-tail proteins showed that they shared some antiviral
   signaling pathways, especially the intracellular protein transport
   pathway.

   KEYWORDS: coronavirus, SARS-CoV-2, genomic diversity, COVID-19

INTRODUCTION

   As of January 2023, the coronavirus disease 2019 (COVID-19) pandemic
   caused by the severe acute respiratory syndrome coronavirus-2
   (SARS-CoV-2) has resulted in a staggering 655 million confirmed cases
   and 6.67 million deaths worldwide ([36]https://covid19.who.int/).
   Despite the rapid progress in related research on SARS-CoV-2, driven by
   the combined efforts of healthcare facilities and researchers, the
   extensive infections have led to the emergence of many variants ([37]1,
   [38]2). Currently, there are 29 clades or 35 emerging lineages,
   including variants of concern (VOC), variants of interest (VOI), and
   variants under monitoring (VUM)
   ([39]https://gisaid.org/database-features/influenza-genomic-epidemiolog
   y/). In recent years, there have been several rounds of mainstream
   strain replacement, with one after another emergence of original
   strains that have replaced all parts of the early years, including
   Alpha GRY (B.1.1.7) ([40]3), Delta GK (B.1.617.2 + AY.x) ([41]4,
   [42]5), and Omicron GRA (B.1.1.529 + BA.x) ([43]6, [44]7). Although
   some strains have been dominant over months, there is still an urgent
   need to track and study all mutants to help predict unknown strains and
   reveal the evolutionary laws of SARS-CoV-2 to cope with its rapid
   genetic evolution.

   RNA viruses, like SARS-CoV-2, undergo frequent genetic changes that
   pose significant challenges to human health in terms of disease
   diagnosis, prevention, and treatment. Nucleic acid test is currently
   the most accurate diagnostic method, but mutations in viral genes may
   lead to mispriming or mismatching of the detection primers, resulting
   in false negative results ([45]8 [46]– [47]10); rapid antigen tests are
   also less sensitive to some variants ([48]11). The mutation of the
   spike protein helps variants, including Omicron, adapt to and escape
   from the human immune system, affecting the effectiveness of immunity
   and treatment ([49]12, [50]13). As vaccines are effective means of
   preventing diseases, viral gene mutations may lead to immune escape
   ([51]14). Drug resistance may occur when monoclonal antibodies against
   SARS-CoV-2 encounter variants, for example, monoclonal antibodies
   Ly-CoV555 and REGN10933 have impaired or lost activity against
   B.1.526-E484K strain ([52]15); similarly, the neutralizing activity of
   monoclonal antibodies against B.1.351, B.1.1.28, B.1.617.1, and B.1.526
   viruses was reduced or eliminated in cell culture ([53]16). Many
   studies have investigated key variants or proteins, including S protein
   (the major antigen), or a smaller number of genomes ([54]17 [55]–
   [56]19). However, due to the limitation of sample quality and
   sequencing technology, there is a lack of large-scale complete genome
   research. Continuous whole-genome sequencing of SARS-CoV-2 is necessary
   to maintain genomic surveillance ([57]20); further analysis and
   characterization of the novel variants are critical to understanding
   the prevention and therapy of COVID-19.

   In this study, we attempted to identify conserved and essential RNA
   secondary structures in the genome of SARS-CoV-2, with a particular
   emphasis on completeness and accuracy, which are critical for
   developing targeted therapies to combat viral mutations. To achieve
   this goal, we compiled SARS-CoV-2 sequence data from 5,966,490 samples
   globally, focusing only on complete genomes and excluding any uncertain
   base sequences. Then, the original proportions of the different bases
   and the base changes brought about by the variants were investigated.
   Following the alignment of all complete genomes with the full length of
   29,903, we analyzed the identity of bases at each position and
   calculated the identity rate per 100 nt. Moreover, we investigated
   genes with significant differences and their variations in comparison
   to homologous proteins across different viruses. We also identified
   human key interacting proteins and their common pathways that
   correspond to SARS-CoV-2.

RESULTS

Global SARS-CoV-2 genome composition and distribution

   As of 22 January 2023, the SARS-CoV-2 Data Hub contains 5,966,490
   nucleotides in the NCBI Virus database. After excluding incomplete
   sequence information, 1,394,180 complete genome sequences were
   obtained. As the length of the reference genome [58]NC_045512.2 is
   29,903 nucleotides, 80,833 genomes were selected as candidates based on
   the length. These candidate samples come from various parts of the
   world, with the highest number in Europe, followed by North America and
   Asia ([59]Fig. 1). These data represent sequencing data only, not the
   total number of COVID-19 cases.

Fig 1.

   [60]Fig 1
   [61]Open in a new tab

   Screening and distribution of the SARS-CoV-2 genome.

Uracil has the highest proportion in the SARS-CoV-2 genome, and cytosine to
uracil mutations are the main type of point mutations

   Examination of the 80,833 genome sequences revealed 22,588 duplicate
   values, leaving 58,245 unique values after removal. Subsequently,
   genomes with any missing (−) and degenerate information (R, Y, K, M, S,
   W, B, D, H, V, and N) were removed, resulting in 5,966 genomes with
   well-defined sequences. When analyzing these genomes as a whole, the
   highest proportion was U at 32%, followed by A＞G＞C ([62]Fig. 2A).
   Surprisingly, despite C being the lowest (18%), the proportion of
   cytosine to uracil (C-to-U) mutations was the highest of the 12 types
   of point mutations (36%). The second was G-to-U (16%). The other ten
   mutations were between 2% and 9%. A staggering 58% of mutations were
   converted to U, while only 11% were converted to C ([63]Fig. 2B and C).

Fig 2.

   [64]Fig 2
   [65]Open in a new tab

   (A) The proportion of the four bases in the SARS-CoV-2 genome. (B) The
   proportion of various transitions and transversions in point mutations.
   (C) The number of 12 types of point mutations.

   We compared the genomes of seven human-infecting coronaviruses,
   including SARS-CoV-2, and found that their sizes were similar (27,277
   to 30,713 nt), with SARS-CoV-2 ranking third in size after human
   coronavirus OC43 (HCoV-OC43) and Middle East respiratory syndrome
   coronavirus (MERS-CoV) ([66]Fig. 3A). Base composition analysis showed
   a consistent order, U＞A＞G＞C, in all seven viruses ([67]Fig. 3B).

Fig 3.

   [68]Fig 3
   [69]Open in a new tab

   (A) The reference genome length of coronaviruses that can infect
   humans. (B) The ratio of bases in the different coronaviruses.

The mutation rate of most variants is less than 1‰ (median 0.60‰)

   The genomic sequences of the variants exhibit some diversity. We
   aligned the 80,833 genomic sequences, and there were 295 genomes with
   more than 40 mismatches to the consensus sequence, of which the top
   three have 15,494, 15,492, and 2,623 mismatches, respectively ([70]Fig.
   4A). Considering that 15,494 and 15,492 are close, we performed
   sequence alignments and found that they only have two different
   nucleotides ([71]Fig. 4B). After alignment to the reference genome
   [72]NC_045512.2, we found that these two variants had two deletions and
   one insertion mutation, and the number of deleted and inserted
   nucleotides were equal, resulting in no change in genome length (3 + 9
   = 12; [73]Fig. 4C). Further sequence analysis revealed that the two
   deletion mutations leading to SS-to-C mutation in Nsp3 and an SGF
   deletion in Nsp6, respectively. The insertion mutation was located in
   the non-coding region between ORF8 and the N protein ([74]Fig. 4D). We
   aligned the insertion sequence “AACAAACAAACA” to the reference genome
   sequence NC_ 045512.2; the results showed the highest similarity to
   sequences near the location of insertion mutations (28,259–28,269;
   [75]Fig. 4E). The above mutations resulted in an unusually high number
   of mismatches, and after excluding these two variants, the number of
   mismatches for all the other variants ranged from 0 to 2,623, with a
   median of 18 (0.60 ‰), and 90% of variants had less than 30 (1.07 ‰).

Fig 4.

   [76]Fig 4
   [77]Open in a new tab

   (A) Number of mismatches in the variant genomes (descending order). (B)
   The two variants with the highest mismatches ([78]MZ233993.1 and
   [79]MZ097525.1) were highly similar in sequence, only two bases are
   different. (C) [80]MZ233993.1 and [81]MZ097525.1 contain two deletion
   mutations and one insertion mutation, and the number of deleted bases
   was equal to the number of inserted bases. (D) The deletion mutation in
   [82]MZ233993.1 and [83]MZ097525.1 results in two or three amino acid
   mutations, and the insertion mutation is located in a non-coding region
   between ORF8 and the N protein. (E) The alignment result of the
   inserted sequence “AACAAAAAACA” and reference genome [84]NC_045512.2
   aims to explore the source of insertion mutation. (F) After excluding
   the frameshift mutation, the overall distribution of the number of
   mismatches in the variants.

   The insertion mutation causing “AACA” change to “AACA-AACA-AACA-AACA”
   occurred in the non-coding region between ORF8 and N protein ([85]Fig.
   5A), seven nucleotides away from the start codon of the N protein.
   While this mutation does not affect the protein sequence, its impact on
   the Shine-Dalgarno (SD) sequence is unknown. We simulated the RNA
   secondary structure before and after the mutation ([86]Fig. 5B) and
   found that the insertion mutation led to a larger loop in front of the
   start codon, which may reduce complementarity with the ribosome and
   slow down its binding speed, thereby affecting translation rate.
   However, there were no obvious structural changes upstream and
   downstream of the mutation. Deviations from the baseline in the entropy
   plot indicate that the structure involving the mutant has low
   positional entropy, whereas the original sequence of the downstream
   gene affected by the mutation has higher randomness and positional
   entropy (dashed lines label the mutation).

Fig 5.

   [87]Fig 5
   [88]Open in a new tab

   (A) Position of the insertion mutation. (B) Comparison of the secondary
   structure and energy entropy between the reference genome
   [89]NC_045512.2 and the insertion mutation ([90]MZ233993.1 and
   [91]MZ097525.1).

The consistency of the front part was significantly higher than that of the
tail part in the SARS-CoV-2 genome

   After all genome sequences were aligned, 1–29,903 bases were analyzed
   sequentially. If the sequence of all genomes in the same location is
   identical, it is defined as identity; otherwise, it is defined as
   mismatch. The percent identity was 55.98% for the whole genome
   ([92]Fig. 6A). Subsequently, continuous identical sequences were
   analyzed. As the length increased, the number of continuous sequences
   decreased ([93]Fig. 6B). There were 1,225 sequences at n = 5. There
   were only 13 sequences at n＞20. The longest continuous consistent
   sequence length is 32. At the genome-wide level, the percent identities
   were calculated per 100 nt, from which the genome-wide landscape was
   mapped ([94]Fig. 6C). The results showed that the consistency of the
   front part of the genome (1–26,599 nt) was 62.64%, significantly higher
   than the 4.03% in the tail part (26,600–29,903 nt). The four highest
   sites are inside the genes Nsp3, Nsp8, RdRp, and S (annotated with blue
   arrows). The genes located at the end of the genome, particularly those
   beyond the M gene, exhibit low consistency. Certain genes display
   significant internal consistency changes, particularly two positions
   within the S and Nsp14 genes that are as low as 0.

Fig 6.

   [95]Fig 6
   [96]Open in a new tab

   (A) The percent identity in the SARS-CoV-2 genome. (B) Continuous
   identical sequences in the SARS-CoV-2 genome.

   The consistency of most genes in ORF1ab was higher than 62%, with a
   median of 69%; the S gene showed large internal differences; the
   majority of genes in the tail were less than 13%, with a median of 6%
   ([97]Fig. 7A). The top four sites with the highest consistency in
   SARS-CoV-2 also showed high conservation in their respective protein
   families ([98]Fig. 7B). Among them, the sites in proteins RdRp and S
   are also highly conserved in their protein families. There are one and
   three different amino acid sequences in Nsp3 and Nsp8, which can be
   used as important sites for virus identification. S protein contains
   multiple domains; S1 plays a role in the binding of receptor ACE2
   ([99]21), whereas S2 mediates the membrane fusion process ([100]22).
   Partial domains and sequence identity are highly correlated; the NTD
   domain in S1 and HR1 in S2 showed a significantly reduced (16.9% and
   52.3%), while the consistency of other parts was 69.2%; they showed
   clear demarcation ([101]Fig. 7C). In addition to S protein, another
   protein that shows drastic internal variation is Nsp14, most of which
   had high consistency, but one site was 0. This site not only varies
   greatly among different SARS-CoV-2 variants but is also extremely
   poorly conserved among homologous proteins of different viruses. The
   structure of Nsp14 protein shows two active sites corresponding to two
   functions; the site is located at a marginal position (shown by white
   ovals; [102]Fig. 7D).

Fig 7.

   [103]Fig 7
   [104]Open in a new tab

   Analysis of several gene loci with the largest differences in
   consistency. (A) Consistency analysis of the anterior part of the
   genome (ORF1ab), S gene, and the tail of the genome (other genes). (B)
   Conservation analysis of complete consensus sequence across protein
   domain families. (C) The relationship between sequence consistency and
   different domains in S protein. (D) Conservation and structural
   analysis of the least conserved site in SARS-CoV-2.

   To further investigate genes with low conservation at the 3′ end of the
   SARS-CoV-2 genome, we selected five key proteins ORF6, ORF7a, ORF7b,
   ORF8, and ORF10. A total of 213 corresponding human interacting
   proteins were collected through literature mining ([105]Table 1).
   Analysis of these proteins revealed partial overlap or shared pathways,
   with ORF8 having the highest number of interacting proteins ([106]Fig.
   8A). Pathway and process enrichment analysis was performed using DAVID
   ([107]23) and KOBAS ([108]24) with the following ontology sources: KEGG
   Pathway, GO Biological Processes, and Reactome Gene Sets (P < 0.01,
   count > 3, and enrichment factor >1.5). We obtained five sets of terms,
   with the number ranging from 3 to 17 ([109]Fig. 8B). The most
   significant term for each group was GO: 0035966: response to
   topologically incorrect protein (ORF8); GO: 0006886: intracellular
   protein transport (ORF7A); R-HSA-9679506: SARS-CoV infections (ORF6);
   WP4860: Hijack of ubiquitination by SARS-CoV-2 (ORF 10); and GO:
   0006886: intracellular protein transport (ORF 7b). GO: 0006886:
   intracellular protein transport was a common term among the five
   groups. To further understand the relationships between the terms, we
   generated a network diagram ([110]Fig. 8C and D) with different colors
   representing different groups and terms. The results showed that ORF8
   had the widest distribution and was linked to other groups. The size of
   the node was proportional to the total number of hits belonging to that
   particular term, with the proteasomal protein catabolic process,
   response to topologically incorrect protein, intracellular protein
   transport, and glycoprotein metabolic process having higher hit rates.

TABLE 1.

   Key proteins with low consistency in SARS-CoV-2 genome tail and their
   interacting proteins reported in the literature
   SARS-CoV-2 protein Human interacting proteins Reference
   ORF6 NUP98, RAE1, MTCH1, KINH, EDC4, DCTN2, ATD3A, RAE1L, FAF2, XP01,
   SPF27, MYCB2, AT1A1, ATD3B, AT2A2, ACLY, DESP, ATPB, RT27, ATPA, CE170
   [111]25
   ORF7a HEATR3 [112]25
   MDN1, HACD3, FAF2, XPO1, COPG2, COPB2, SMC2, MIC60, BAG6, QPCTL, XPOT,
   PSMD1, AT2A2, PP6R3, COPA, PSMD2, PRS7, TECR, COPB, PSMD7, AT1A1, PRS4,
   DNJA2, UBR5, PSD13, PSMD6, HSP74, SC16A, CMC1, MPCP, GEMI4, PSA, DNJC7,
   RPB2, PRS8, GHC1, CALU, EMD, DNJA3, TXTP, BAG2, DDX20, DJC10, ECHA,
   ATPB, 2AAA [113]26
   ORF7b HACD3, RO60, XPO1, COPG2, COPB2, M2OM, FAF2, QPCTL, COPA, AT2A2,
   MIC60, TECR, SMC2, XPOT, TMM33, COPB, MPCP, PSA, AT1A1, ATD3A, SND1,
   NOC3L, ATPB, DNJA1, ATPA, CALU, DPM1, RCN2 [114]26
   ORF8 PLOD2 [115]26
   TOR1A, STC2, PLAT, ITGB1, CISD3, COL6A1, PVR, DNMT1, LOX, PCSK6, INHBE,
   NPC2, MFGE8, OS9, NPTX1, POGLUT2, POGLUT3, ERO1B, PLD3, FOXRED2, CHPF,
   PUSL1, EMC1, GGH, ERLEC1, IL17RA, NGLY1, HS6ST2, SDF2, NEU1, GDF15,
   TM2D3, ERP44, EDEM3, SlL1, POFUT1, SMOC1, PLEKHF2, FBXL12, UGGT2,
   CHPF2, ADAMTS1, HYOU1, FKBP7, ADAM9, FKBP10, HACD3, BAG6, COPG2, XP01,
   SMC2, MIC60, COPA, AT2A2, CMC1, PSMD1, COPB, TECR, M2OM, XPOT, PRS7,
   UBR5, AT1A1, PSMD2, DNJA2, ATD3A, SND1, RPB2, SC16A, MPCP, PSD13,
   PSMD6, DNJA1, GHC1, TF3C1, CALU, GEMI4, PSA, PSA7, GNAS1, ECHA, ATPA,
   PRS8, ATPB, DDX20, EMD, EIF3A [116]25
   ORF10 PPT1 [117]26
   CUL2, MAP7D1, THTPA, ZYG11B, TIMM8B, RBX1, ELOC, ELOB, SMC2, MIC60,
   XP01, FAF2, HACD3, AT2A2, COPG2, AT1A1, IMDH2, XPOT, SC16A, COPA, COPB,
   GEMI4, YTHD2, MPCP, RCN2, TCPE, RSSA, TCPD [118]25
   [119]Open in a new tab

Fig 8.

   [120]Fig 8
   [121]Open in a new tab

   Genome-wide annotation and analysis of key proteins with low
   consistency at the C-terminus. (A) The repetition and connectivity
   among the human proteins interacting with ORF6, ORF7a, ORF7b, ORF8, and
   ORF10. The blue lines indicate the same protein, while the purple lines
   represent proteins in the same pathway. (B) Pathway and process
   enrichment analysis of human proteins interacting with the five key
   proteins. (C) Enriched term network diagram colored by input clusters.
   (D) Enriched term network diagram. (E) Enrichment analysis of
   protein-protein interactions. (F) MCODE components obtained from
   enrichment analysis of protein-protein interactions. (G) Summary of
   enrichment analysis performed in DisGeNET.

   We used STRING ([122]27) (physical score > 0.132) to generate a
   protein-protein interaction enrichment network, which included subsets
   of proteins that physically interacted with at least one other member
   in the list ([123]Fig. 8E). The Molecular Complex Detection (MCODE)
   algorithm ([124]28) was used to identify densely connected network
   components if the network contained 3 to 500 proteins ([125]Fig. 8F).
   ORF8 had a relatively large proportion and had significant overlap with
   ORF7a. Hypoxia (MCODE1) and endoplasmic reticulum stress (MCODE2)
   occupied central positions. Finally, we used the DisGeNET database
   ([126]29) to help determine the priority of genotype-phenotype
   relationships ([127]Fig. 8G). Contracture, premature aging syndrome,
   cortical visual impairment, and polycythemia were identified as
   symptoms or diseases related to ORF8, ORF7a, ORF10, and ORF7b.

DISCUSSION

   In this work, we reported the diversity and conservation of SARS-CoV-2
   genomes worldwide since the outbreak of the COVID-19 pandemic. A wide
   time range, samples from around the world, complete genome, and
   accurate sequence are the criteria for selecting samples in this study.
   Researchers have discovered multiple SARS-CoV-2 epidemic variants, some
   of which are more successful in evading the immune system’s defense,
   posing challenges for healthcare professionals to control the wave of
   infection, including VOCs that are gradually replacing original strains
   around the world. For example, the D614G mutation contributes to the
   adaptive advantage of SARS-CoV-2 and enhances binding to the human
   cell-surface receptor angiotensin-converting enzyme 2 (ACE2) ([128]30
   [129]– [130]32) and is therefore associated with enhanced infectivity.
   Whereas mutations L452R, S477N, and E484K may result in decreased
   antibody-neutralizing activity or increased resistance to
   vaccine-elicited antibodies ([131]33, [132]34). However, most reports
   and subsequent studies have focused on mutations in single amino acids
   or proteins; the horizon cannot be at the genome-wide level.

   Completeness and accuracy are two major aspects that limit genetic
   information. Firstly, due to limitations in sample quality and
   sequencing technology, many genome sequences are incomplete, with
   missing regions often occurring at the beginning and end of the
   sequence. While the lack of poly A at the tail has minimal impact, the
   starting sequence may contain important genetic information. Degenerate
   information in sequencing results is a major factor affecting sequence
   accuracy, and many genomes that contain unknown sequences are excluded,
   even if these are short. We selected complete genomes with well-defined
   sequences and removed duplications and incomplete or degenerate
   information, resulting in 5,966 unique samples for further analysis;
   these genomes were sourced from around the world, with the majority
   coming from Europe and the Americas.

   We have identified a nucleotide alteration that likely represents an
   evolutionary hallmark of SARS-CoV-2: C-to-U, which exhibits the highest
   proportion among all point mutations. Our analysis of the genomes of
   various coronaviruses that infect humans reveals that uracil is the
   most abundant among the four bases; this phenomenon is observed in all
   seven viruses. The primary reason for this alteration may be mediated
   by apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like
   (APOBEC) ([133]35). APOBEC was demonstrated in human immunodeficiency
   virus type 1 (HIV-1), where guanine to adenine (G-to-A) and cytosine to
   uracil (C-to-U) changes in HIV-1 transcripts have been confirmed
   ([134]36). However, the related mechanism in SARS-CoV-2 requires
   further investigation.

   We also analyzed the genomic diversity and mutation rate of variants
   and identified some rare variants with high mismatches, deletions, and
   insertion mutations. These two variants had no change in genome length
   despite containing multiple mutations. Although these mutations do not
   affect protein sequence directly, the insertion mutation may
   significantly influence the RNA secondary structure by enlarging the
   loop ahead of the start codon of the N protein, which may affect
   translational initiation.

   The rapid rate of variation is a crucial determinant of RNA viruses to
   adapt to their living environment; however, a certain level of
   conservation is a prerequisite for survival. We observed that the
   anterior segment of the SARS-CoV-2 genome exhibits 62.64% identity,
   which is significantly higher than the 4.03% identity in the tail
   segment, suggesting that conservation of the anterior segment is
   essential for viral survival, whereas the posterior segment is critical
   for viral mutation and evolution. Overall, the consistency within most
   genes is stable, with only a few genes exhibiting large fluctuations.
   Notably, the identity of the S1-NTD domain in S proteins (16.9%) is
   significantly lower than that of the adjacent S1-RBD (70.8%). RBD and
   NTD constitute the two major targets for antibodies ([135]37). The
   S1-NTD is responsible for binding to human ACE2, and several monoclonal
   antibodies have been discovered ([136]38). SARS-CoV-2 exhibits improved
   binding to the ACE2 receptor compared with SARS-CoV ([137]39, [138]40),
   implying that S1-RBD and FCS may contribute to the severity of
   infection in SARS-CoV-2.

   The majority of SARS-CoV-2 samples had a genomic mutation rate ≤ 1.07
   ‰, with the genome tail and S protein being the main sources of genetic
   polymorphism. Moreover, we investigated five key proteins ORF6, ORF7a,
   ORF7b, ORF8, and ORF10, representing the low-conserved genes in the
   SARS-CoV-2 genome concerning human interacting proteins, pathways, and
   processes. It showed that they share some antiviral signaling pathways,
   especially intracellular protein transport.

   Although our study provides comprehensive analyses of the complete
   genome sequence of SARS-CoV-2, several limitations need to be
   addressed. First, our sample does not include incomplete and degenerate
   information containing genomes. Therefore, our analysis may not
   accurately reflect the proportion of mutations in all infected cases.
   Second, while our investigation emphasizes the importance of non-coding
   regions in gene regulation, the functional significance of these
   variants requires further experimental validation. Finally, although we
   investigated molecular interactions and related diseases using
   literature mining and databases, more detailed and systematic
   experiments are needed to prove drug efficacy and side effects.

   In conclusion, our study provides valuable insights into the nucleotide
   composition, mutation profile, genomic diversity, and consistency of
   the SARS-CoV-2 genome. Our findings enhance our understanding of the
   virus’s features and improve our ability to develop antiviral
   strategies against the variants. The insertion mutation near the start
   codon of the N protein may affect translational initiation
   significantly, further validating the importance of the non-coding
   regions in gene regulation. Finally, the molecular interactions and
   related diseases revealed by our investigation of low-conserved genes
   highlight their biological significance as potential therapeutic
   targets.

MATERIALS AND METHODS

Data source and processing

   This study analyzed 5,966,490 nucleotides registered in the SARS-CoV-2
   data hub of the NCBI virus database before 22 January 2023
   ([139]https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). After excluding
   sequences containing any missing (−) and degenerate information (R, Y,
   K, M, S, W, B, D, H, V, and N), we obtained 80,833 viral accession
   ([140]Table S1). During data processing, 22,588 duplicate values and
   samples with missing or unclear information were removed, resulting in
   5,966 clear genome sequences.

Genomic and protein sequence alignment

   Sequence alignment between SARS-CoV-2 genomes was performed using
   BLASTN, Expect threshold (0.05), word size (7), match/mismatch fraction
   (1, –1), and gap cost (1, 2) were used as parameters. Low-complexity
   regions were filtered. BLASTP was used for protein homology analysis.
   The expected threshold is 0.05, word size is 6, and the cap costs are
   existence 11 and extension 1.

Genome consistency analysis

   The genome sequence data to be analyzed were downloaded through the
   NCBI virus database, and the collected sequences were aligned with
   1–29,903 nt of the reference genome using alignment software (BLAST or
   EmEditor) to find out the consistent and mismatch parts. Dividing the
   concordant fraction in each 100 nt by the total yielded the consistency
   rate, and characteristics of the consistency rate were determined.

Protein structure simulation analysis

   The protein sequence data to be analyzed were collected through the
   public database PDB, and the three-dimensional structure prediction of
   the protein was performed using protein structure prediction software
   (SWISS-MODEL) to analyze the simulation results and determine the
   conformational characteristics, stability, and function of the protein.

RNA secondary structure analysis and entropy calculation

   The 63-nt RNA (TTGTTTTAGATTTCATCTAAACGAACAAACTAAAATGTCTGATAATG
   6ACCCCAAAATCAGCG), which includes a partial coding sequence for ORF8
   and N protein along with a non-coding region in the middle, and its
   corresponding 75-nt mutant
   (TTGTTTTAGATTTCATCTAAACGAACAAACAAACAAACAAACTAAAATGTCTGATAATG
   6ACCCCAAAATCAGCG), were submitted to the RNAfold Web server
   ([141]http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/ RNAfold.cgi)
   using the “minimum free energy (MFE) and partition function” algorithm
   to avoid isolated base pairs. Other parameters were left as default to
   generate positional entropy, center of mass structure, and potential
   MFE structure.

Gene ontology pathway enrichment analysis

   We used the GO annotation tool DAVID for gene functional annotation,
   which consists of three main aspects: molecular function, cellular
   component, and biological process. Online tools DAVID was utilized for
   KEGG and reactome pathway analysis. The reliability and biological
   significance of the analysis results were evaluated. Statistical tests
   such as Fisher’s exact test were performed to validate the analysis
   results. The analysis results were presented in the form of graphs,
   heat maps, and other visualizations using R and Cytoscape.

Statistical analysis

   Statistical analysis was performed using SPSS 17.0 software (SPSS,
   Chicago, USA). The t-test was utilized for group comparisons, and data
   were presented as mean ± standard deviation. GraphPad Prism 8 was
   employed to generate graphics. P < 0.05 was considered statistically
   significant.

Supplementary Material

   Reviewer comments
   [142]reviewer-comments.pdf^ (1.6MB, pdf)

ACKNOWLEDGMENTS