Abstract

   Patients diagnosed with early-stage cancers have a substantially higher
   chance of survival than those with late-stage diseases. However, the
   option for early cancer screening is limited, with most cancer types
   lacking an effective screening tool. Here we report a miRNA-based blood
   test for multi-cancer early detection based on examination of serum
   microRNA microarray data from cancer patients and controls. First, a
   large multi-cancer training set that included 1,408 patients across 7
   cancer types and 1,408 age- and gender-matched non-cancer controls was
   used to develop a 4-microRNA diagnostic model using 10-fold
   cross-validation. In three independent validation sets comprising a
   total of 4,875 cancer patients across 13 cancer types and 3,722
   non-cancer participants, the 4-microRNA model achieved greater than 90%
   sensitivity for 9 cancer types (lung, biliary tract, bladder,
   colorectal, esophageal, gastric, glioma, pancreatic, and prostate
   cancers) and 75–84% sensitivity for 3 cancer types (sarcoma, liver, and
   ovarian cancer), while maintaining greater than 99% specificity. The
   sensitivity remained to be > 99% for patients with stage 1 lung cancer.
   Our study provided novel evidence to support the development of an
   inexpensive and accurate miRNA-based blood test for multi-cancer early
   detection.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-024-73783-0.

   Keywords: Multi-cancer early detection, MicroRNA, Noninvasive,
   Blood-based diagnostic model

   Subject terms: Cancer, Computational biology and bioinformatics,
   Biomarkers, Molecular medicine, Oncology

Introduction

   Cancer ranks the first or second leading cause of death in most
   countries worldwide^[28]1. In the United States, the American Cancer
   Society estimated 1.9 million new cancer cases and nearly 610 K cancer
   deaths in 2022^[29]2. Patients diagnosed with early-stage cancers have
   much higher survival rates than those at late stages. For example, the
   5-year patient survival rate for localized colorectal cancers is 91%
   but only 15% for those that have spread to distant organs^[30]2.
   However, early-stage cancer patients often have no symptoms and thus
   are more likely to miss timely diagnosis^[31]3,[32]4. Therefore,
   detecting cancers at early stages is paramount to reduce cancer-related
   mortality.

   The most effective way for detecting cancer early is the availability
   and accessibility of cancer screening tools for the general population.
   Unfortunately, the options of such screening tools are limited.
   Currently, only four cancer types have screening tests recommended by
   the United States Preventive Service Task Force (USPSTF): mammography
   for breast cancer, cytology/HPV testing for cervical cancer,
   colonoscopy and/or stool-based testing for colon cancer, and low-dose
   CT scans for lung cancer^[33]5–[34]8. A challenge of using these single
   cancer-based screening tests is that when used sequentially, they could
   lead to dramatically increased cumulative incidence of false
   positives^[35]9. Therefore, a low cost, high performance and
   noninvasive test that can detect multiple cancers simultaneously will
   overcome the pitfalls of these single cancer-based screening tools and
   greatly facilitate the adoption and increase the compliance of the
   so-called multi-cancer early detection (MCED) in high-risk general
   population.

   Here we report the development of a circulating microRNA (miRNA)-based
   MCED model using a multi-cancer training set and show its validation in
   a broader cohort of patients and controls, demonstrating a high
   accuracy of detecting 12 cancer types.

Results

Participants and datasets

   To develop a MCED model, we identified eight serum miRNA microarray
   datasets from Gene Expression Omnibus (GEO)^[36]10,[37]11, which
   included data from 13 cancer types (biliary tract, bladder, breast,
   colorectal, esophageal, gastric, glioma, liver, lung, ovarian,
   pancreatic, prostate, sarcoma) and were all generated from the Japanese
   nationwide multi-year, multi-center research program “Development and
   Diagnostic Technology for Detection of miRNA in Body Fluids” using a
   standardized microarray platform. These eight datasets were originally
   used to develop individual diagnostic models for individual cancer
   types^[38]12–[39]19. In this study, we cleaned and assembled these
   datasets to build a multi-cancer train set comprised 1408 cancer
   patients from 7 cancer types (lung, ovarian, liver, bladder,
   esophageal, gastric, and prostate) and 1408 age- and gender-matched
   non-cancer controls for the development of a diagnostic model for
   simultaneously detecting multiple cancer types. All the remaining
   subjects including 4875 cancer patients across 13 cancer types and 3722
   non-cancer controls constitute three validation sets. Detailed
   description on study design, microarray datasets and construction of
   train and validation datasets are described in the Supplemental Methods
   and Fig. [40]1.

Fig. 1.

   [41]Fig. 1
   [42]Open in a new tab

   Flow of datasets and study design. (A) Construction of the train and
   validation datasets. (B) Study design of model development and
   validation.

   Detailed demographic and clinical information for those cancer types of
   large sample size were described in the original publications. Briefly,
   the patients in the lung cancer dataset (n = 1566) had mean age 65y,
   composed of 57% male and 62% former or current smokers, with 78% of the
   tumors being adenocarcinoma, 14% squamous carcinoma, 87% stage I or II.
   The bladder cancer dataset (n = 392) included patients of mean age 68y,
   72% male, 95% non-metastatic, 88% nodal-negative, 77% T1 and 80% high
   grade. The ovarian cancer dataset (n = 333) included patients with mean
   age 57y, 35% stage I or II, 96% epithelial (including 55%, 19% and 13%
   for serous, clear cell, and endometrioid histology, respectively). The
   patients in the liver cancer dataset (n = 348) were of mean age 68y,
   78% male, and 70% stage I or II. The esophageal cancer dataset
   (n = 447) consisted of patients with a mean age 67y, 97% male, and 66%
   stage I or II. The gastric cancer dataset (n = 1267) included patients
   with mean age 66y, 77% male and all stage I or II. The glioma dataset
   (n = 196) comprised patients with mean age 56y and 57% were male.
   Finally, the patients in the prostate cancer dataset (n = 769) had mean
   age 68y, 93% node-negative, and 92% non-metastatic.

Cancer diagnostic model development

   All diagnostic model development work was performed in the multi-cancer
   Train Set, which included 1408 cancer patients and 1408 non-cancer
   controls matched by age and gender (Fig. [43]1B). First, limma analysis
   was used to assess the differential expression of miRNAs between cancer
   and non-cancer. miRNAs were then ranked based on the B statistics from
   the limma analysis. The top 50 differentially expressed miRNAs are
   listed in Supplemental Table 1. A key hallmark of cancer is
   uncontrolled proliferation. We hypothesized that cancer-associated
   miRNAs would target growth signaling pathways. To investigate this,
   potential targeted genes regulated by these top 50 differentially
   expressed miRNAs were predicted from the miRDB database. Both KEGG and
   WikiPathways analysis of potential target genes indicated that several
   cancer pathways including colorectal cancer, non-small cell lung
   cancer, pancreatic cancer, prostate cancer, renal cell carcinoma,
   glioma etc. were enriched (Fig. [44]2A, B and C). Consistent with our
   hypothesis, known signal transduction pathways implicated in
   tumorigenesis and cancer progression including PI3K Akt signaling, MAPK
   signaling, Wnt signaling, ErbB signaling, Ras signaling, etc. were
   enriched as well (Fig. [45]2A, B and D). Furthermore, network analysis
   of enriched pathways showed that common target genes were involved
   among several cancer and signaling pathways (Fig. [46]2C and D), which
   supports the implication of common miRNAs for the diagnosis of multiple
   cancer types.

Fig. 2.

   [47]Fig. 2
   [48]Open in a new tab

   Pathway enrichment analysis of target genes regulated by the top 50
   differentially expressed miRNAs. (A) KEGG analysis of potential target
   genes regulated by these miRNAs^[49]44,[50]45. (B) WikiPathways
   analysis of potential target genes regulated by these miRNAs^[51]46.
   (C) Network plot of enriched target genes depicting the linkages of
   genes and selected KEGG pathways^[52]44,[53]45. (D) Network plot of
   enriched target genes depicting the linkages of genes and selected
   WikiPathways^[54]46. For (A) and (B), the gene ratio (no. of mapped
   genes / total no. of genes) is shown on the x-axis; bubble size is gene
   count and bubble color reflects adjusted p value.

   Ten-fold cross-validation revealed that the top 4 miRNAs (hsa-miR-5100,
   hsa-miR-1228-5p, hsa-miR-8073 and hsa-miR-663a) provided the highest
   AUC in the ROC analysis and thus were included in the final diagnostic
   model (Fig. [55]3A). We calculated a diagnostic index by the weighted
   sum of the 4 miRNA expression levels (weighted by the t statistics from
   the limma analysis) and normalized to the range of 0 to 10. This
   4-miRNA model achieved an AUC value of 0.994 within the Train set
   (Fig. [56]3B). A cut-point of 5.3 was chosen to yield an overall > 99%
   specificity (i.e., < 1% false positives) across the non-cancer cases,
   and an overall 94% sensitivity (Fig. [57]3C). The AUC and sensitivity
   of the model for each of the 7 cancer types in the multi-cancer Train
   set ranged from 0.985 to 84% for ovarian cancer to 0.998 and 100% for
   bladder and gastric cancers, respectively (Table [58]1).

Fig. 3.

   [59]Fig. 3
   [60]Open in a new tab

   Diagnostic performance of the 4-miRNA model in the multi-cancer Train
   Set. (A) 10-fold cross validation; (B) ROC of the 4-miRNA model; (C)
   Scatterplot of the diagnostic index.

Table 1.

   Performance of the 4-miRNA model for each cancer type in the
   multi-cancer train set.
   Cancer types N   AUC of ROC Sensitivity
   Lung         208   0.997        99%
   Bladder      200   0.998       100%
   Ovarian      200   0.985        84%
   Liver        200   0.987        86%
   Gastric      200 0.998      100%
   Esophageal   200   0.993        92%
   Prostate     200 0.997      97%
   [61]Open in a new tab

Validation of the diagnostic model in the independent validation set 1

   The performance of the 4-miRNA model was first evaluated in the
   independent Validation Set 1 (n = 2859) that included 1358 lung cancer
   patients and 1501 non-cancer controls. The model achieved an AUC of
   1.000 (Fig. [62]4A) with a specificity of 100% and sensitivity of 99%
   (Fig. [63]4B). In addition, analysis of paired serum samples (pre- vs.
   post-surgery; n = 180) verified normalization of the diagnostic indices
   to the levels of non-cancer controls in post-surgery serum samples
   (Fig. [64]4C).

Fig. 4.

   [65]Fig. 4
   [66]Open in a new tab

   Diagnostic performance of the 4-miRNA model in Validation Set 1, the
   lung cancer validation dataset. (A) ROC of the 4-miRNA model; (B)
   Scatterplot of the diagnostic index; (C) Scatterplot of the diagnostic
   index from pre- vs. post-operation serum samples; (D) Scatterplot of
   the diagnostic index in clinical subsets. ADC: adenocarcinoma; SqCC:
   squamous cell carcinoma; LCC: large cell carcinoma; SCLC: small cell
   lung cancer.

   Furthermore, the performance of the 4-miRNA model was evaluated across
   clinical subsets of the Validation Set 1, as defined by the clinical
   stages, TNM stages, and histology subtypes. High sensitivities were
   observed for all clinical subsets. The model achieved at least 99%
   sensitivity for 22 out of 24 clinical subsets examined except for stage
   IIB and T3 tumors (Fig. [67]4D). In particular, the model
   demonstrated > 99% sensitivities for stage I lung cancers and for
   adenocarcinoma and squamous cell carcinoma.

Validation of the diagnostic model in the independent validation sets 2 and 3

   The independent Validation Set 2 included 1438 patients across 12
   additional cancer types and 1623 non-cancer controls. Except for breast
   cancer, the 4-miRNA model achieved at least 90% sensitivity for eight
   cancer types (biliary tract, bladder, colorectal, esophageal, gastric,
   glioma, pancreatic and prostate) and at least 75% for the other three
   cancer types (liver, ovarian and sarcoma) (Fig. [68]5A; Table 2).
   Noteworthy, while the model had a reasonable AUC value of 0.909 for
   breast cancer, the 1% sensitivity was still very low due to the high
   specificity requirement (Fig. [69]5A; Table [70]2).

Fig. 5.

   [71]Fig. 5
   [72]Open in a new tab

   Diagnostic performance of the 4-miRNA model in Validation Sets 2 and 3.
   (A) Scatterplot of the diagnostic index in Validation Set 2; (B)
   Scatterplot of the diagnostic index in Validation Set 3.

Table 2.

   Performance of the 4-miRNA model for each cancer type in Validation
   Sets 2 and 3.
                 N    AUC of ROC Sensitivity
   Validation Set 2
   Biliary tract  40    0.998       100%
   Bladder       192    0.999        99%
   Breast        135    0.909        1%
   Colorectal    155    0.991        92%
   Esophageal    124  0.996      91%
   Gastric       150    0.999       100%
   Glioma         40  0.997      98%
   Liver         148    0.998        84%
   Ovarian       133  0.986      79%
   Pancreatic    149    0.995        91%
   Prostate       40  0.998      98%
   Sarcoma       132    0.976        75%
   Validation Set 3
   Gastric       1067   0.994       100%
   Glioma        196    0.993        96%
   Esophageal    247    0.992        92%
   Prostate      569    0.993        95%
   [73]Open in a new tab

   The independent Validation Set 3 included 2079 patients from four
   cancer types (esophageal, gastric, glioma and prostate) and 598
   non-cancer controls, where the sample sizes of the four cancer types
   were substantially larger than those in Validation Set 2 (247 vs. 124
   for esophageal, 1067 vs. 150 for gastric, 196 vs. 40 for glioma, and
   569 vs. 40 for prostate). The 4-miRNA model achieved > 0.99 AUC and
   > 99% sensitivity for all four cancer types, similar to those observed
   in Validation Set 2 (Fig. [74]5B; Table [75]2). The specificity of the
   model was a little lower in Validation Set 3 than in Validation Set 2
   (0.98 vs. 0.99) (Fig. [76]5B). Therefore, for Validation Set 3, a
   sensitivity analysis with an adjusted diagnostic index cut-point of 5.6
   was explored to increase the specificity of the new model to 99%. With
   this new cut-point, the model still achieved high sensitivity for all
   four cancer types, including 99% for gastric, 92% for glioma, 91% for
   prostate, and 89% for esophageal cancers.

Discussion

   Noninvasive screening tests for MCED via analyzing circulating
   cell-free nucleic acids and/or proteins in the body fluid, especially
   blood, have attracted high attention for the last decade. In this
   study, we reported the development and validation of a serum 4-miRNA
   diagnostic model and demonstrated that in three large independent
   validation sets totaling 8597 participants (4875 cancer patients across
   13 cancer types and 3722 non-cancer individuals), the 4-miRNA model can
   detect 12 cancer types simultaneously with high sensitivities (> 90%
   for 9 cancer types, and ≥ 75% for 3 cancer types) while still achieving
   a very high specificity of ~ 99%. In addition, the observation that the
   diagnostic indices for the post-surgery serum samples were reduced to
   normal levels suggests the potential utility of the model for
   monitoring response to treatment and detection of recurrence.

   Importantly, our model was able to detect early-stage cancers at high
   sensitivity. Specifically, in Validation Set 1 of lung cancer patients,
   the model detects stage I and II cancers at a sensitivity ranging from
   98.4 to 99.6% (Fig. [77]4D). In Validation Sets 2 and 3, while
   individual patient-level stage information was not available, aggregate
   stage information was provided for 6 of the 12 cancer types examined.
   First, all gastric cancer patients were stage I or II, thus the 100%
   sensitivity of our model applied to early-stage gastric cancer. Second,
   88% and 93% of bladder and prostate cancer patients had node negative
   disease. Thus, with 99% and 98% sensitivity for these two cancers, the
   sensitivity for stage I or II bladder and prostate cancers should be
   very high as well. Third, 66% and 70% of esophageal and liver cancer
   patients were stage I or II, respectively. It was reasonable to
   speculate that the sensitivity for stage I or II of these two cancers
   should not be far off from the 92% and 84% sensitivity reported for all
   stages included. In summary, based on the data currently available in
   the three validation sets, we concluded that our 4-miRNA model achieves
   high sensitivity for stage I or II disease of six cancer types (lung,
   gastric, bladder, prostate, esophageal, and liver).

   Of note, the original studies that generated the eight miRNA microarray
   datasets analyzed in this study also proposed miRNA panels for
   detecting each of the eight caner types (lung, ovarian, liver, bladder,
   esophageal squamous, gastric, prostate and glioma), respectively. These
   eight miRNA panels included 41 unique miRNAs with only one overlapping
   miRNA, hsa-miR-6724-5p, which occurred in the liver and bladder cancer
   panels. While some of these panels demonstrated higher performance
   characteristics than our model for their respective cancer types, this
   is expected given their specific focus. However, if these panels were
   to be used to detect these eight cancer types together in a sequential
   fashion, the cumulative incidence of false positives was approximately
   33% based on the published performance matrix. In contrast, our model,
   which detects 12 cancer types simultaneously, achieves a false positive
   rate of less than 1%.

   Among the four miRNAs used in our model, hsa-miR-5100 has been reported
   to be overexpressed in lung, gastric, oral squamous cell carcinoma, and
   pancreatic cancers^[78]20–[79]24. On the other hand, has-miR-1228-5p
   has been implicated as overexpressed in hepatocellular carcinoma and
   kidney clear cell carcinoma^[80]25,[81]26, while hsa-miR-663a has been
   found to be overexpressed in colon cancer and metastatic prostate
   cancer^[82]27,[83]28. Gene set enrichment and network analysis showed
   that transforming growth factor beta-1 (TGFB1), a gene regulated by
   has-miR-663a, was implicated in signaling pathways across multiple
   cancer types including colorectal cancer, pancreatic cancer, gastric
   cancer, renal cell carcinoma, hepatocellular carcinoma and leukemia.
   The observation that the PI3K Akt and MAPK signaling pathways are among
   the most regulated by the top 50 miRNAs certainly suggests that the
   origin of the miRNAs is from the cancer cells, but not from reactive
   stromal fibroblasts, tumor-associated immune cells, or biopsy-induced
   wound-related changes^[84]29. Taken together, these data support the
   use of these miRNAs as potential biomarkers for cancer early detection
   across multiple cancer types.

   Several commercial assays for MCED have emerged in recent years. Most
   of these tests used next generation sequencing (NGS) technology to
   evaluate either methylation or fragmentation patterns of circulating
   tumor DNAs^[85]30–[86]33. The most prominent MCED test that attracted
   high attention was the Galleri test that examined > 100,000 targeted
   methylated regions and > 1,000,000 CpG dinucleotides. In its
   prospective and case-controlled the Circulating Cell-free Genome Atlas
   (CCGA) study, Galleri achieved an overall sensitivity of 67.6% across
   12 stage I-III pre-specified cancer types and 99.5% specificity^[87]30.
   However, the sensitivity was only 16.8% for stage I and 40.4% for stage
   II. The other MCED test not based on NGS technology is CancerSEEK that
   assesses four biomarker classes (aneuploidy, DNA methylation, mutations
   and proteins). In its latest retrospective, case-controlled study of
   566 cancer patients across 12 cancer types and 566 non-cancer controls,
   it showed an overall 61% sensitivity and 98.2% specificity^[88]34. The
   sensitivity dropped to 49.8% for stage I-III cancers. In summary, these
   MCED tests generally showed modest sensitivities in the range of 60–70%
   when a high 99% specificity was required, and the sensitivities dropped
   further for stage I or II cancers. Compared to these assays, our
   diagnostic model, while much simpler, demonstrated substantially higher
   sensitivities in the range of 90–100% for 9 out of 12 cancer types in
   large validation cohorts totaling almost 8600 participants. More
   importantly, our model achieves similarly high sensitivities for stage
   I or II cancers.

   The clinical utility of these MCED assays must be ultimately
   demonstrated in prospective screening trials with asymptomatic
   individuals. For example, Galleri was evaluated in the prospective
   screening study of PATHFINDER that analyzed 6621 participants
   aged ≥ 50y with 1 year follow-up^[89]35. The study detected a cancer
   signal in 92 (1.4%) participants and confirmed 35 as true positives,
   resulting in a 38% positive predictive value (PPV). In addition, 121
   participants had cancer diagnosed at the end of 1-year follow-up, which
   corresponded to a 29% sensitivity by Galleri. In its latest prospective
   observation study SYMPLIFY with 5461 symptomatic participants referred
   from primary care and 368 (6.7%) diagnosed with a cancer, Galleri
   achieved 66.3% sensitivity and 98.4% specificity^[90]36. For our
   4-miRNA diagnostic model, assuming a screening population with 1%
   cancer incidence rate, 90% sensitivity and 99.3% specificity, our model
   would provide a PPV of 56%, significantly higher than the 3.7–4.4% PPVs
   for the four single-cancer screening tests recommended by
   USPSTF^[91]37–[92]39.

   It is worth noting that a simple four-parameter diagnostic model like
   the one described here not only costs significantly less, but also can
   be developed into an in vitro diagnostic (IVD) test using RT-qPCR
   capable of decentralized testing, which has an advantage over NGS-based
   tests that are usually implemented as a laboratory developed test
   (LDT). These characteristics are important to drive adoption and
   increase affordability of MCED tests as they are intended to target
   high risk or at-risk general public, especially for those from
   low-income communities.

   We acknowledge that the current study is a computational analysis using
   public datasets. Experimental validation and investigations on the role
   of the 4 miRNAs will shed light on the mechanistic understanding of the
   predictive power of these miRNAs. In particular, validating these
   miRNAs in different cohorts using different molecular techniques such
   as PCR is crucial before considering the current study results
   definitive, which is also critically important in developing these
   miRNAs into a lower-cost and practical diagnostic assay for clinical
   use. These will be the focus of our future work and are beyond the
   scope of the current study.

   In summary, our study has provided proof-of-concept data for developing
   a blood screening test based on expression profiles of circulating
   cell-free miRNAs for 12 cancer types, which account for 50% estimated
   new cancer cases and 63% cancer deaths in the US in 2022^[93]2.

Methods

Study design and construction of train and validation datasets

   We identified eight serum miRNA microarray datasets from Gene
   Expression Omnibus (GEO)^[94]10,[95]11. After removing redundant cases,
   we assembled three large datasets that were independent of each other:
   a lung cancer dataset (n = 3744)^[96]10,[97]12, a combined dataset by
   merging the ovarian, liver and bladders cancer datasets
   (n = 3792)^[98]10,[99]13–[100]15, and a combined dataset by merging the
   esophageal squamous cell, gastric, prostate and glioma cancer datasets
   (n = 3877)^[101]11,[102]16–[103]19.

   Based on these three large datasets, we constructed a large training
   set (‘Train Set’) that included 1408 cancer patients from 7 cancer
   types (208 lung cancer patients and 200 patients each for ovarian,
   liver, bladder, esophageal, gastric, and prostate) and 1408 age- and
   gender-matched non-cancer controls for the development of a diagnostic
   model for detecting multiple cancer types. All the remaining cases
   formed three separate independent validation sets (Fig. [104]1A and B).
   Details of how the cancer case and control samples for the Train Set
   and Validation Sets were selected are described in the Supplemental
   Methods.

Blood sample collection

   Collection of blood serum samples has been previously described in the
   original publications^[105]12–[106]19. Briefly, serum samples were
   collected prior to surgical operation from cancer patients who were
   admitted to the National Cancer Center Hospital (NCCH) between 2008 and
   2016 and stored initially at 4⁰C for one week and then at -20⁰C until
   further use. The exclusion criteria included those patients who were
   treated with preoperative chemotherapy and/or radiotherapy prior to
   serum sample collection. The serum samples for non-cancer controls were
   from those who had no history of cancer and no hospitalization during
   the previous 3 months and were collected along with routine blood tests
   from outpatient departments of three sources: NCCH, National Center for
   Geriatrics and Gerontology (NCGG) Biobank and Yokohama Minoru Clinic
   (YMC). Serums from cancer patients and non-cancer controls collected at
   NCCH were stored in the same way as described above, while those from
   NCGG and YMC were stored at -80⁰C till use. While our study as an in
   silico analysis of public datasets does not require any ethical
   approval, the original studies were approved by the NCCH Institutional
   Review Board, the Ethics and Conflict of Interest Committee of the
   NCGG, and the Research Ethics Committee of Medical Corporation
   Shintokai YMC. Written informed consent was obtained from each
   participant^[107]12–[108]19.

Microarray analysis of miRNA expression

   Details about microarray expression analysis were described in the
   original publications^[109]12–[110]19. Briefly, total RNA was extracted
   from 300 µl serum, labeled by 3D-Gene miRNA Labeling kit and hybridized
   to 3D-Gene Human miRNA Oligo Chip (Toray Industries, Kanagawa, Japan)
   that evaluates the expression profiles of 2588 miRNA sequences
   registered in miRBase release 21 ([111]http://www.mirbase.org/). Low
   quality samples were discarded if the coefficient of variation of
   negative control probes > 0.15 or the number of flagged probes
   identified by 3D-Gene Scanner as “uneven spot images” > 10. A miRNA was
   called “present” when its signal intensity was greater than mean plus
   two standard deviations of the negative control signals after the top
   and bottom 5% of the ranked signal intensities were removed. The signal
   intensities for miRNAs were determined after background subtraction by
   subtracting the mean signal intensity of negative control signals
   (after removing top and bottom 5% of the ranked signal intensities)
   from the miRNA signal. Finally, microarrays were normalized by
   calibrating according to three pre-selected internal control miRNAs
   (miR-149-3p, miR-2861, and miR-4463).

Diagnostic model development

   miRNA biomarker identification and all model development work were done
   in the multi-cancer Train Set only. The differential miRNA expression
   between cancer vs. non-cancer was evaluated using Linear Model for
   Microarray Data (limma)^[112]40. miRNAs were then ranked based on the t
   statistics from the limma analysis and the top miRNAs were used to
   build diagnostic models for distinguishing cancer vs. non-cancer. A
   diagnostic index was calculated for each diagnostic model as a linear
   sum of expression levels of the selected miRNAs weighted by limma
   statistics. Ten-fold cross validation was performed to determine the
   optimal number of miRNAs to be included in the final diagnostic model
   that had the highest area-under-the-curve (AUC) of the Receiver
   Operating Characteristics (ROC) curves for distinguishing cancer vs.
   non-cancer. The cut-point for the diagnostic index was chosen to ensure
   at least 99% specificity (i.e., ≤ 1% false positive rate) as the model
   may potentially be used as a screening tool in at-risk general public.

Diagnostic model validation

   The three independent validation datasets contained mutually exclusive
   samples that were not used in model development, with each offering
   distinct characteristics for the validation of the developed model.
   Validation Set 1 not only was of a very large sample size for lung
   cancer cases, but also contained comprehensive patient-level
   clinicopathologic data in contrast to the other two validation
   datasets, making it possible to assess model performance on early-stage
   cancers and different histology subtypes. Validation Set 2 contained
   samples from 12 other cancer types, thus expanding the evaluation of
   model performance across multiple cancer types. Validation Set 3
   comprised large numbers of the cases from four cancer types including
   the two cancer types with low sample size in Validation Set 2, allowing
   additional independent verification of the model performance.

KEGG and WikiPathways pathway enrichment analysis of potential miRNAs target
genes

   The prediction of target genes of top 50 significantly differentially
   expressed miRNAs was performed using the database miRDB^[113]41. The
   pathway enrichment analysis on the target genes was conducted using
   Bioconductor package clusterProfiler (version 4.10.1)^[114]42,[115]43
   based on KEGG^[116]44,[117]45 and WikiPathways^[118]46. A
   Benjamini–Hochberg adjusted p value cutoff 0.05 was used to select
   significantly enriched pathways.

Statistical analysis

   AUC of the ROC curve analysis, sensitivity, and specificity were used
   to measure the diagnostic performance for detecting cancer vs.
   non-cancer. Sensitivity was defined as the proportion of cancer
   patients who were correctly identified as cancer by the diagnostic
   model, while specificity was defined as the proportion of non-cancer
   participants who were correctly identified as non-cancer. limma
   analysis was performed using Bioconductor package limma
   ([119]http://www.bioconductor.org)^[120]40. All statistical analysis
   was conducted using R version 4.2.1 ([121]http://www.r-project.org).

Electronic supplementary material

   Below is the link to the electronic supplementary material.
   [122]Supplementary Material 1^ (208.3KB, pdf)

Author contributions

   J.Z. and H.H. conceived and designed the study. J.Z. collected and
   analyzed the data. J.Z., H.R., and H.H. interpreted the data and wrote
   and finalized the manuscript.

Data availability

   All individual patient data were made publicly available by the
   original study authors. Gene Expression Omnibus (GEO) accession IDs for
   the datasets used in this study are included in the Supplemental
   Methods section.

Declarations

Competing interests

   J.Z. and H.H. are named inventors on a patent of the diagnostic model
   developed in this study. HH is a cofounder of, and holds equity in
   miRoncol Diagnostics, Inc, a company that seeks to commercialize the
   diagnostic model. All other authors do not hold any competing interest.

Footnotes

   Publisher’s note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

References