Abstract Patients diagnosed with early-stage cancers have a substantially higher chance of survival than those with late-stage diseases. However, the option for early cancer screening is limited, with most cancer types lacking an effective screening tool. Here we report a miRNA-based blood test for multi-cancer early detection based on examination of serum microRNA microarray data from cancer patients and controls. First, a large multi-cancer training set that included 1,408 patients across 7 cancer types and 1,408 age- and gender-matched non-cancer controls was used to develop a 4-microRNA diagnostic model using 10-fold cross-validation. In three independent validation sets comprising a total of 4,875 cancer patients across 13 cancer types and 3,722 non-cancer participants, the 4-microRNA model achieved greater than 90% sensitivity for 9 cancer types (lung, biliary tract, bladder, colorectal, esophageal, gastric, glioma, pancreatic, and prostate cancers) and 75–84% sensitivity for 3 cancer types (sarcoma, liver, and ovarian cancer), while maintaining greater than 99% specificity. The sensitivity remained to be > 99% for patients with stage 1 lung cancer. Our study provided novel evidence to support the development of an inexpensive and accurate miRNA-based blood test for multi-cancer early detection. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-024-73783-0. Keywords: Multi-cancer early detection, MicroRNA, Noninvasive, Blood-based diagnostic model Subject terms: Cancer, Computational biology and bioinformatics, Biomarkers, Molecular medicine, Oncology Introduction Cancer ranks the first or second leading cause of death in most countries worldwide^[28]1. In the United States, the American Cancer Society estimated 1.9 million new cancer cases and nearly 610 K cancer deaths in 2022^[29]2. Patients diagnosed with early-stage cancers have much higher survival rates than those at late stages. For example, the 5-year patient survival rate for localized colorectal cancers is 91% but only 15% for those that have spread to distant organs^[30]2. However, early-stage cancer patients often have no symptoms and thus are more likely to miss timely diagnosis^[31]3,[32]4. Therefore, detecting cancers at early stages is paramount to reduce cancer-related mortality. The most effective way for detecting cancer early is the availability and accessibility of cancer screening tools for the general population. Unfortunately, the options of such screening tools are limited. Currently, only four cancer types have screening tests recommended by the United States Preventive Service Task Force (USPSTF): mammography for breast cancer, cytology/HPV testing for cervical cancer, colonoscopy and/or stool-based testing for colon cancer, and low-dose CT scans for lung cancer^[33]5–[34]8. A challenge of using these single cancer-based screening tests is that when used sequentially, they could lead to dramatically increased cumulative incidence of false positives^[35]9. Therefore, a low cost, high performance and noninvasive test that can detect multiple cancers simultaneously will overcome the pitfalls of these single cancer-based screening tools and greatly facilitate the adoption and increase the compliance of the so-called multi-cancer early detection (MCED) in high-risk general population. Here we report the development of a circulating microRNA (miRNA)-based MCED model using a multi-cancer training set and show its validation in a broader cohort of patients and controls, demonstrating a high accuracy of detecting 12 cancer types. Results Participants and datasets To develop a MCED model, we identified eight serum miRNA microarray datasets from Gene Expression Omnibus (GEO)^[36]10,[37]11, which included data from 13 cancer types (biliary tract, bladder, breast, colorectal, esophageal, gastric, glioma, liver, lung, ovarian, pancreatic, prostate, sarcoma) and were all generated from the Japanese nationwide multi-year, multi-center research program “Development and Diagnostic Technology for Detection of miRNA in Body Fluids” using a standardized microarray platform. These eight datasets were originally used to develop individual diagnostic models for individual cancer types^[38]12–[39]19. In this study, we cleaned and assembled these datasets to build a multi-cancer train set comprised 1408 cancer patients from 7 cancer types (lung, ovarian, liver, bladder, esophageal, gastric, and prostate) and 1408 age- and gender-matched non-cancer controls for the development of a diagnostic model for simultaneously detecting multiple cancer types. All the remaining subjects including 4875 cancer patients across 13 cancer types and 3722 non-cancer controls constitute three validation sets. Detailed description on study design, microarray datasets and construction of train and validation datasets are described in the Supplemental Methods and Fig. [40]1. Fig. 1. [41]Fig. 1 [42]Open in a new tab Flow of datasets and study design. (A) Construction of the train and validation datasets. (B) Study design of model development and validation. Detailed demographic and clinical information for those cancer types of large sample size were described in the original publications. Briefly, the patients in the lung cancer dataset (n = 1566) had mean age 65y, composed of 57% male and 62% former or current smokers, with 78% of the tumors being adenocarcinoma, 14% squamous carcinoma, 87% stage I or II. The bladder cancer dataset (n = 392) included patients of mean age 68y, 72% male, 95% non-metastatic, 88% nodal-negative, 77% T1 and 80% high grade. The ovarian cancer dataset (n = 333) included patients with mean age 57y, 35% stage I or II, 96% epithelial (including 55%, 19% and 13% for serous, clear cell, and endometrioid histology, respectively). The patients in the liver cancer dataset (n = 348) were of mean age 68y, 78% male, and 70% stage I or II. The esophageal cancer dataset (n = 447) consisted of patients with a mean age 67y, 97% male, and 66% stage I or II. The gastric cancer dataset (n = 1267) included patients with mean age 66y, 77% male and all stage I or II. The glioma dataset (n = 196) comprised patients with mean age 56y and 57% were male. Finally, the patients in the prostate cancer dataset (n = 769) had mean age 68y, 93% node-negative, and 92% non-metastatic. Cancer diagnostic model development All diagnostic model development work was performed in the multi-cancer Train Set, which included 1408 cancer patients and 1408 non-cancer controls matched by age and gender (Fig. [43]1B). First, limma analysis was used to assess the differential expression of miRNAs between cancer and non-cancer. miRNAs were then ranked based on the B statistics from the limma analysis. The top 50 differentially expressed miRNAs are listed in Supplemental Table 1. A key hallmark of cancer is uncontrolled proliferation. We hypothesized that cancer-associated miRNAs would target growth signaling pathways. To investigate this, potential targeted genes regulated by these top 50 differentially expressed miRNAs were predicted from the miRDB database. Both KEGG and WikiPathways analysis of potential target genes indicated that several cancer pathways including colorectal cancer, non-small cell lung cancer, pancreatic cancer, prostate cancer, renal cell carcinoma, glioma etc. were enriched (Fig. [44]2A, B and C). Consistent with our hypothesis, known signal transduction pathways implicated in tumorigenesis and cancer progression including PI3K Akt signaling, MAPK signaling, Wnt signaling, ErbB signaling, Ras signaling, etc. were enriched as well (Fig. [45]2A, B and D). Furthermore, network analysis of enriched pathways showed that common target genes were involved among several cancer and signaling pathways (Fig. [46]2C and D), which supports the implication of common miRNAs for the diagnosis of multiple cancer types. Fig. 2. [47]Fig. 2 [48]Open in a new tab Pathway enrichment analysis of target genes regulated by the top 50 differentially expressed miRNAs. (A) KEGG analysis of potential target genes regulated by these miRNAs^[49]44,[50]45. (B) WikiPathways analysis of potential target genes regulated by these miRNAs^[51]46. (C) Network plot of enriched target genes depicting the linkages of genes and selected KEGG pathways^[52]44,[53]45. (D) Network plot of enriched target genes depicting the linkages of genes and selected WikiPathways^[54]46. For (A) and (B), the gene ratio (no. of mapped genes / total no. of genes) is shown on the x-axis; bubble size is gene count and bubble color reflects adjusted p value. Ten-fold cross-validation revealed that the top 4 miRNAs (hsa-miR-5100, hsa-miR-1228-5p, hsa-miR-8073 and hsa-miR-663a) provided the highest AUC in the ROC analysis and thus were included in the final diagnostic model (Fig. [55]3A). We calculated a diagnostic index by the weighted sum of the 4 miRNA expression levels (weighted by the t statistics from the limma analysis) and normalized to the range of 0 to 10. This 4-miRNA model achieved an AUC value of 0.994 within the Train set (Fig. [56]3B). A cut-point of 5.3 was chosen to yield an overall > 99% specificity (i.e., < 1% false positives) across the non-cancer cases, and an overall 94% sensitivity (Fig. [57]3C). The AUC and sensitivity of the model for each of the 7 cancer types in the multi-cancer Train set ranged from 0.985 to 84% for ovarian cancer to 0.998 and 100% for bladder and gastric cancers, respectively (Table [58]1). Fig. 3. [59]Fig. 3 [60]Open in a new tab Diagnostic performance of the 4-miRNA model in the multi-cancer Train Set. (A) 10-fold cross validation; (B) ROC of the 4-miRNA model; (C) Scatterplot of the diagnostic index. Table 1. Performance of the 4-miRNA model for each cancer type in the multi-cancer train set. Cancer types N AUC of ROC Sensitivity Lung 208 0.997 99% Bladder 200 0.998 100% Ovarian 200 0.985 84% Liver 200 0.987 86% Gastric 200 0.998 100% Esophageal 200 0.993 92% Prostate 200 0.997 97% [61]Open in a new tab Validation of the diagnostic model in the independent validation set 1 The performance of the 4-miRNA model was first evaluated in the independent Validation Set 1 (n = 2859) that included 1358 lung cancer patients and 1501 non-cancer controls. The model achieved an AUC of 1.000 (Fig. [62]4A) with a specificity of 100% and sensitivity of 99% (Fig. [63]4B). In addition, analysis of paired serum samples (pre- vs. post-surgery; n = 180) verified normalization of the diagnostic indices to the levels of non-cancer controls in post-surgery serum samples (Fig. [64]4C). Fig. 4. [65]Fig. 4 [66]Open in a new tab Diagnostic performance of the 4-miRNA model in Validation Set 1, the lung cancer validation dataset. (A) ROC of the 4-miRNA model; (B) Scatterplot of the diagnostic index; (C) Scatterplot of the diagnostic index from pre- vs. post-operation serum samples; (D) Scatterplot of the diagnostic index in clinical subsets. ADC: adenocarcinoma; SqCC: squamous cell carcinoma; LCC: large cell carcinoma; SCLC: small cell lung cancer. Furthermore, the performance of the 4-miRNA model was evaluated across clinical subsets of the Validation Set 1, as defined by the clinical stages, TNM stages, and histology subtypes. High sensitivities were observed for all clinical subsets. The model achieved at least 99% sensitivity for 22 out of 24 clinical subsets examined except for stage IIB and T3 tumors (Fig. [67]4D). In particular, the model demonstrated > 99% sensitivities for stage I lung cancers and for adenocarcinoma and squamous cell carcinoma. Validation of the diagnostic model in the independent validation sets 2 and 3 The independent Validation Set 2 included 1438 patients across 12 additional cancer types and 1623 non-cancer controls. Except for breast cancer, the 4-miRNA model achieved at least 90% sensitivity for eight cancer types (biliary tract, bladder, colorectal, esophageal, gastric, glioma, pancreatic and prostate) and at least 75% for the other three cancer types (liver, ovarian and sarcoma) (Fig. [68]5A; Table 2). Noteworthy, while the model had a reasonable AUC value of 0.909 for breast cancer, the 1% sensitivity was still very low due to the high specificity requirement (Fig. [69]5A; Table [70]2). Fig. 5. [71]Fig. 5 [72]Open in a new tab Diagnostic performance of the 4-miRNA model in Validation Sets 2 and 3. (A) Scatterplot of the diagnostic index in Validation Set 2; (B) Scatterplot of the diagnostic index in Validation Set 3. Table 2. Performance of the 4-miRNA model for each cancer type in Validation Sets 2 and 3. N AUC of ROC Sensitivity Validation Set 2 Biliary tract 40 0.998 100% Bladder 192 0.999 99% Breast 135 0.909 1% Colorectal 155 0.991 92% Esophageal 124 0.996 91% Gastric 150 0.999 100% Glioma 40 0.997 98% Liver 148 0.998 84% Ovarian 133 0.986 79% Pancreatic 149 0.995 91% Prostate 40 0.998 98% Sarcoma 132 0.976 75% Validation Set 3 Gastric 1067 0.994 100% Glioma 196 0.993 96% Esophageal 247 0.992 92% Prostate 569 0.993 95% [73]Open in a new tab The independent Validation Set 3 included 2079 patients from four cancer types (esophageal, gastric, glioma and prostate) and 598 non-cancer controls, where the sample sizes of the four cancer types were substantially larger than those in Validation Set 2 (247 vs. 124 for esophageal, 1067 vs. 150 for gastric, 196 vs. 40 for glioma, and 569 vs. 40 for prostate). The 4-miRNA model achieved > 0.99 AUC and > 99% sensitivity for all four cancer types, similar to those observed in Validation Set 2 (Fig. [74]5B; Table [75]2). The specificity of the model was a little lower in Validation Set 3 than in Validation Set 2 (0.98 vs. 0.99) (Fig. [76]5B). Therefore, for Validation Set 3, a sensitivity analysis with an adjusted diagnostic index cut-point of 5.6 was explored to increase the specificity of the new model to 99%. With this new cut-point, the model still achieved high sensitivity for all four cancer types, including 99% for gastric, 92% for glioma, 91% for prostate, and 89% for esophageal cancers. Discussion Noninvasive screening tests for MCED via analyzing circulating cell-free nucleic acids and/or proteins in the body fluid, especially blood, have attracted high attention for the last decade. In this study, we reported the development and validation of a serum 4-miRNA diagnostic model and demonstrated that in three large independent validation sets totaling 8597 participants (4875 cancer patients across 13 cancer types and 3722 non-cancer individuals), the 4-miRNA model can detect 12 cancer types simultaneously with high sensitivities (> 90% for 9 cancer types, and ≥ 75% for 3 cancer types) while still achieving a very high specificity of ~ 99%. In addition, the observation that the diagnostic indices for the post-surgery serum samples were reduced to normal levels suggests the potential utility of the model for monitoring response to treatment and detection of recurrence. Importantly, our model was able to detect early-stage cancers at high sensitivity. Specifically, in Validation Set 1 of lung cancer patients, the model detects stage I and II cancers at a sensitivity ranging from 98.4 to 99.6% (Fig. [77]4D). In Validation Sets 2 and 3, while individual patient-level stage information was not available, aggregate stage information was provided for 6 of the 12 cancer types examined. First, all gastric cancer patients were stage I or II, thus the 100% sensitivity of our model applied to early-stage gastric cancer. Second, 88% and 93% of bladder and prostate cancer patients had node negative disease. Thus, with 99% and 98% sensitivity for these two cancers, the sensitivity for stage I or II bladder and prostate cancers should be very high as well. Third, 66% and 70% of esophageal and liver cancer patients were stage I or II, respectively. It was reasonable to speculate that the sensitivity for stage I or II of these two cancers should not be far off from the 92% and 84% sensitivity reported for all stages included. In summary, based on the data currently available in the three validation sets, we concluded that our 4-miRNA model achieves high sensitivity for stage I or II disease of six cancer types (lung, gastric, bladder, prostate, esophageal, and liver). Of note, the original studies that generated the eight miRNA microarray datasets analyzed in this study also proposed miRNA panels for detecting each of the eight caner types (lung, ovarian, liver, bladder, esophageal squamous, gastric, prostate and glioma), respectively. These eight miRNA panels included 41 unique miRNAs with only one overlapping miRNA, hsa-miR-6724-5p, which occurred in the liver and bladder cancer panels. While some of these panels demonstrated higher performance characteristics than our model for their respective cancer types, this is expected given their specific focus. However, if these panels were to be used to detect these eight cancer types together in a sequential fashion, the cumulative incidence of false positives was approximately 33% based on the published performance matrix. In contrast, our model, which detects 12 cancer types simultaneously, achieves a false positive rate of less than 1%. Among the four miRNAs used in our model, hsa-miR-5100 has been reported to be overexpressed in lung, gastric, oral squamous cell carcinoma, and pancreatic cancers^[78]20–[79]24. On the other hand, has-miR-1228-5p has been implicated as overexpressed in hepatocellular carcinoma and kidney clear cell carcinoma^[80]25,[81]26, while hsa-miR-663a has been found to be overexpressed in colon cancer and metastatic prostate cancer^[82]27,[83]28. Gene set enrichment and network analysis showed that transforming growth factor beta-1 (TGFB1), a gene regulated by has-miR-663a, was implicated in signaling pathways across multiple cancer types including colorectal cancer, pancreatic cancer, gastric cancer, renal cell carcinoma, hepatocellular carcinoma and leukemia. The observation that the PI3K Akt and MAPK signaling pathways are among the most regulated by the top 50 miRNAs certainly suggests that the origin of the miRNAs is from the cancer cells, but not from reactive stromal fibroblasts, tumor-associated immune cells, or biopsy-induced wound-related changes^[84]29. Taken together, these data support the use of these miRNAs as potential biomarkers for cancer early detection across multiple cancer types. Several commercial assays for MCED have emerged in recent years. Most of these tests used next generation sequencing (NGS) technology to evaluate either methylation or fragmentation patterns of circulating tumor DNAs^[85]30–[86]33. The most prominent MCED test that attracted high attention was the Galleri test that examined > 100,000 targeted methylated regions and > 1,000,000 CpG dinucleotides. In its prospective and case-controlled the Circulating Cell-free Genome Atlas (CCGA) study, Galleri achieved an overall sensitivity of 67.6% across 12 stage I-III pre-specified cancer types and 99.5% specificity^[87]30. However, the sensitivity was only 16.8% for stage I and 40.4% for stage II. The other MCED test not based on NGS technology is CancerSEEK that assesses four biomarker classes (aneuploidy, DNA methylation, mutations and proteins). In its latest retrospective, case-controlled study of 566 cancer patients across 12 cancer types and 566 non-cancer controls, it showed an overall 61% sensitivity and 98.2% specificity^[88]34. The sensitivity dropped to 49.8% for stage I-III cancers. In summary, these MCED tests generally showed modest sensitivities in the range of 60–70% when a high 99% specificity was required, and the sensitivities dropped further for stage I or II cancers. Compared to these assays, our diagnostic model, while much simpler, demonstrated substantially higher sensitivities in the range of 90–100% for 9 out of 12 cancer types in large validation cohorts totaling almost 8600 participants. More importantly, our model achieves similarly high sensitivities for stage I or II cancers. The clinical utility of these MCED assays must be ultimately demonstrated in prospective screening trials with asymptomatic individuals. For example, Galleri was evaluated in the prospective screening study of PATHFINDER that analyzed 6621 participants aged ≥ 50y with 1 year follow-up^[89]35. The study detected a cancer signal in 92 (1.4%) participants and confirmed 35 as true positives, resulting in a 38% positive predictive value (PPV). In addition, 121 participants had cancer diagnosed at the end of 1-year follow-up, which corresponded to a 29% sensitivity by Galleri. In its latest prospective observation study SYMPLIFY with 5461 symptomatic participants referred from primary care and 368 (6.7%) diagnosed with a cancer, Galleri achieved 66.3% sensitivity and 98.4% specificity^[90]36. For our 4-miRNA diagnostic model, assuming a screening population with 1% cancer incidence rate, 90% sensitivity and 99.3% specificity, our model would provide a PPV of 56%, significantly higher than the 3.7–4.4% PPVs for the four single-cancer screening tests recommended by USPSTF^[91]37–[92]39. It is worth noting that a simple four-parameter diagnostic model like the one described here not only costs significantly less, but also can be developed into an in vitro diagnostic (IVD) test using RT-qPCR capable of decentralized testing, which has an advantage over NGS-based tests that are usually implemented as a laboratory developed test (LDT). These characteristics are important to drive adoption and increase affordability of MCED tests as they are intended to target high risk or at-risk general public, especially for those from low-income communities. We acknowledge that the current study is a computational analysis using public datasets. Experimental validation and investigations on the role of the 4 miRNAs will shed light on the mechanistic understanding of the predictive power of these miRNAs. In particular, validating these miRNAs in different cohorts using different molecular techniques such as PCR is crucial before considering the current study results definitive, which is also critically important in developing these miRNAs into a lower-cost and practical diagnostic assay for clinical use. These will be the focus of our future work and are beyond the scope of the current study. In summary, our study has provided proof-of-concept data for developing a blood screening test based on expression profiles of circulating cell-free miRNAs for 12 cancer types, which account for 50% estimated new cancer cases and 63% cancer deaths in the US in 2022^[93]2. Methods Study design and construction of train and validation datasets We identified eight serum miRNA microarray datasets from Gene Expression Omnibus (GEO)^[94]10,[95]11. After removing redundant cases, we assembled three large datasets that were independent of each other: a lung cancer dataset (n = 3744)^[96]10,[97]12, a combined dataset by merging the ovarian, liver and bladders cancer datasets (n = 3792)^[98]10,[99]13–[100]15, and a combined dataset by merging the esophageal squamous cell, gastric, prostate and glioma cancer datasets (n = 3877)^[101]11,[102]16–[103]19. Based on these three large datasets, we constructed a large training set (‘Train Set’) that included 1408 cancer patients from 7 cancer types (208 lung cancer patients and 200 patients each for ovarian, liver, bladder, esophageal, gastric, and prostate) and 1408 age- and gender-matched non-cancer controls for the development of a diagnostic model for detecting multiple cancer types. All the remaining cases formed three separate independent validation sets (Fig. [104]1A and B). Details of how the cancer case and control samples for the Train Set and Validation Sets were selected are described in the Supplemental Methods. Blood sample collection Collection of blood serum samples has been previously described in the original publications^[105]12–[106]19. Briefly, serum samples were collected prior to surgical operation from cancer patients who were admitted to the National Cancer Center Hospital (NCCH) between 2008 and 2016 and stored initially at 4⁰C for one week and then at -20⁰C until further use. The exclusion criteria included those patients who were treated with preoperative chemotherapy and/or radiotherapy prior to serum sample collection. The serum samples for non-cancer controls were from those who had no history of cancer and no hospitalization during the previous 3 months and were collected along with routine blood tests from outpatient departments of three sources: NCCH, National Center for Geriatrics and Gerontology (NCGG) Biobank and Yokohama Minoru Clinic (YMC). Serums from cancer patients and non-cancer controls collected at NCCH were stored in the same way as described above, while those from NCGG and YMC were stored at -80⁰C till use. While our study as an in silico analysis of public datasets does not require any ethical approval, the original studies were approved by the NCCH Institutional Review Board, the Ethics and Conflict of Interest Committee of the NCGG, and the Research Ethics Committee of Medical Corporation Shintokai YMC. Written informed consent was obtained from each participant^[107]12–[108]19. Microarray analysis of miRNA expression Details about microarray expression analysis were described in the original publications^[109]12–[110]19. Briefly, total RNA was extracted from 300 µl serum, labeled by 3D-Gene miRNA Labeling kit and hybridized to 3D-Gene Human miRNA Oligo Chip (Toray Industries, Kanagawa, Japan) that evaluates the expression profiles of 2588 miRNA sequences registered in miRBase release 21 ([111]http://www.mirbase.org/). Low quality samples were discarded if the coefficient of variation of negative control probes > 0.15 or the number of flagged probes identified by 3D-Gene Scanner as “uneven spot images” > 10. A miRNA was called “present” when its signal intensity was greater than mean plus two standard deviations of the negative control signals after the top and bottom 5% of the ranked signal intensities were removed. The signal intensities for miRNAs were determined after background subtraction by subtracting the mean signal intensity of negative control signals (after removing top and bottom 5% of the ranked signal intensities) from the miRNA signal. Finally, microarrays were normalized by calibrating according to three pre-selected internal control miRNAs (miR-149-3p, miR-2861, and miR-4463). Diagnostic model development miRNA biomarker identification and all model development work were done in the multi-cancer Train Set only. The differential miRNA expression between cancer vs. non-cancer was evaluated using Linear Model for Microarray Data (limma)^[112]40. miRNAs were then ranked based on the t statistics from the limma analysis and the top miRNAs were used to build diagnostic models for distinguishing cancer vs. non-cancer. A diagnostic index was calculated for each diagnostic model as a linear sum of expression levels of the selected miRNAs weighted by limma statistics. Ten-fold cross validation was performed to determine the optimal number of miRNAs to be included in the final diagnostic model that had the highest area-under-the-curve (AUC) of the Receiver Operating Characteristics (ROC) curves for distinguishing cancer vs. non-cancer. The cut-point for the diagnostic index was chosen to ensure at least 99% specificity (i.e., ≤ 1% false positive rate) as the model may potentially be used as a screening tool in at-risk general public. Diagnostic model validation The three independent validation datasets contained mutually exclusive samples that were not used in model development, with each offering distinct characteristics for the validation of the developed model. Validation Set 1 not only was of a very large sample size for lung cancer cases, but also contained comprehensive patient-level clinicopathologic data in contrast to the other two validation datasets, making it possible to assess model performance on early-stage cancers and different histology subtypes. Validation Set 2 contained samples from 12 other cancer types, thus expanding the evaluation of model performance across multiple cancer types. Validation Set 3 comprised large numbers of the cases from four cancer types including the two cancer types with low sample size in Validation Set 2, allowing additional independent verification of the model performance. KEGG and WikiPathways pathway enrichment analysis of potential miRNAs target genes The prediction of target genes of top 50 significantly differentially expressed miRNAs was performed using the database miRDB^[113]41. The pathway enrichment analysis on the target genes was conducted using Bioconductor package clusterProfiler (version 4.10.1)^[114]42,[115]43 based on KEGG^[116]44,[117]45 and WikiPathways^[118]46. A Benjamini–Hochberg adjusted p value cutoff 0.05 was used to select significantly enriched pathways. Statistical analysis AUC of the ROC curve analysis, sensitivity, and specificity were used to measure the diagnostic performance for detecting cancer vs. non-cancer. Sensitivity was defined as the proportion of cancer patients who were correctly identified as cancer by the diagnostic model, while specificity was defined as the proportion of non-cancer participants who were correctly identified as non-cancer. limma analysis was performed using Bioconductor package limma ([119]http://www.bioconductor.org)^[120]40. All statistical analysis was conducted using R version 4.2.1 ([121]http://www.r-project.org). Electronic supplementary material Below is the link to the electronic supplementary material. [122]Supplementary Material 1^ (208.3KB, pdf) Author contributions J.Z. and H.H. conceived and designed the study. J.Z. collected and analyzed the data. J.Z., H.R., and H.H. interpreted the data and wrote and finalized the manuscript. Data availability All individual patient data were made publicly available by the original study authors. Gene Expression Omnibus (GEO) accession IDs for the datasets used in this study are included in the Supplemental Methods section. Declarations Competing interests J.Z. and H.H. are named inventors on a patent of the diagnostic model developed in this study. HH is a cofounder of, and holds equity in miRoncol Diagnostics, Inc, a company that seeks to commercialize the diagnostic model. All other authors do not hold any competing interest. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. References