Abstract The advances in the field of cancer genomics have enabled researchers and clinicians to identify altered pathways and regulatory networks that differentiate subtypes manifesting as differential phenotypes of lung neuroendocrine neoplasms (NENs). The clinical heterogeneity observed among lung NEN subtypes reflects underlying biological distinctions, including differential mutation patterns, epigenetic changes and immune microenvironment activities. Although in many cases only a handful of underlying genes are used to differentiate patients, broader gene signatures might result in finer separation and help identify patients with differential survival. Lung NENs are vastly underrepresented in pan-cancer studies, resulting in lacking options to explore datasets. To this end, we developed a freely available website (https://survsig.hcemm.eu/) which allows users to upload potential genes of interest, perform patient clustering, compare survival and explore gene expression signature of lung NENs. Leveraging these biological differences enhances the accuracy of gene expression-based prognostic classifiers like SurvSig. Keywords: Lung neuroendocrine, Expression signature, Stratification, Clustering, Survival, Machine learning 1. Introduction Lung neuroendocrine neoplasms (NENs) account for approximately 20–25 % of all lung cancers and exhibit a wide range of clinical behaviors, from indolent to highly aggressive forms [47][1]. Lung NENs are classified into lung neuroendocrine tumors (NETs), representing mainly the well differentiated carcinoids (CARCI), and the poorly differentiated lung neuroendocrine carcinomas, encompassing large cell neuroendocrine carcinoma (LCNEC) and small cell lung cancer (SCLC) [48][2]. Carcinoids can be further separated to low-grade typical carcinoids and intermediate-grade atypical carcinoids, having relatively better prognosis compared to NECs. In contrast LCNEC and SCLC are high-grade and poorly differentiated tumors associated with aggressive behavior, rapid progression, and poor survival outcomes. The molecular pathogenesis of lung NENs is complex and varies across the different subtypes, typically lacking common oncogenic driver mutations found in non-small cell lung cancer (NSCLC), such as KRAS and EGFR in lung adenocarcinomas [49][3]. Carcinoid tumors present with lower mutational burdens, often harboring somatic mutations in chromatin remodeling genes, such as MEN1 and ATRX [50][4], [51][5]. On the other hand, SCLC and LCNEC are characterized by high mutational burdens and frequent alterations in key tumor suppressor genes, such as the TP53 gene. In addition, RB1 gene mutations can be found in most SCLC tumors, and approximately half of LCNEC tumors. In case of LCNEC tumors with no RB1 mutations are often enriched with STK11 and KEAP1 gene mutations, potentially reflecting distinct subgroups [52][6], [53][7]. Recent advances have focused on stratifying lung NENs based on transcriptional profiles, resulting in the classification of patients into distinct molecular subtypes. Carcinoid tumors have been divided into three major molecular subtypes [54][5], [55][8]: “cluster A1” is defined by DLL3 and ASCL1 expression, “cluster A2” manifests with SLIT/ROBO pathway downregulation, and “cluster B” is comprised mainly of atypical carcinoids and tumors with MEN1 mutations which display poorer outcomes [56][5]. A fourth emerging subtype, known as supra-carcinoids, has also been identified and shows molecular similarities to high-grade tumors [57][5]. LCNEC tumors have also been subdivided into two primary categories: type I, which are highly neuroendocrine and enriched for STK11 and KEAP1 mutations, and type II, which exhibit less neuroendocrine features, including enrichment of RB1 mutations and upregulation of the NOTCH pathway [58][7]. SCLC, meanwhile, can be categorized into molecular subtypes based on the expression of lineage-specific transcription factors ASCL1, NEUROD1, POU2F3, and YAP1 [59][9]. Taken together, these analyses highlight that lung NENs comprise diverse subtypes exhibiting distinct biological behaviors beyond traditional morphological classifications. Consequently, identifying novel patient subgroups, potentially independent of established molecular clusters, may lead to improved prognostication, refined therapeutic strategies, and ultimately better clinical outcomes. As genome-wide profiling gains popularity, the identification of transcriptional signatures associated with various conditions and cellular states has become increasingly common. These signatures have proven useful for identifying patient subtypes and stratifying cell lines, demonstrating that complex gene expression patterns can provide more precise classification of patient groups than single-gene analyses [60][10], [61][11], [62][12], [63][13], [64][14]. By integrating multi-omics data, these have led to identification of dysregulated pathways, such as p53 and ATM in neuroendocrine neoplasms, related to miRNA-mediated regulation with the potential for use as prognostic markers [65][15]. Furthermore, these signatures can have therapeutic implications, where a notable example is the identification of the inflamed subtype in SCLC, which has significantly better response to different treatments, including immunotherapy [66][12], [67][16], [68][17]. Unfortunately, lung NENs are often underrepresented in large-scale pan-cancer studies, such as The Cancer Genome Atlas (TCGA), due to challenges associated with obtaining sufficient tumor specimens. Consequently, the availability of tools for exploring lung NEN data is extremely limited, impeding both research advancements and clinical interpretations. Our aim was to address these limitations by developing an online tool that that enables the interactive exploration of these datasets. For this reason, we have compiled publicly accessible gene expression data from multiple lung NEN cohorts and developed a freely available online website allowing users to upload and interactively analyze gene signatures with various machine learning approaches. This novel and unique approach will facilitate identifying potential subgroups in these heterogeneous diseases, discovery potential biomarkers and aiding in the interpretation of complex transcriptional data. 2. Results We collected publicly available gene expression data for over 600 lung NENs across six datasets, encompassing the three major histological subtypes ([69]Fig. 1A). The majority of samples are from small cell lung cancer (SCLC, n = 359) [70][6], [71][18], [72][19], [73][20], [74][21], which is the most prevalent NEN subtype, followed by carcinoid tumors (n = 139) [75][4], [76][5], [77][21] and large cell neuroendocrine carcinoma (LCNEC, n = 122) [78][7], [79][21]. Among the datasets, one cohort includes transcriptional data for non-NEN tumors, allowing for cross-histology comparisons ([80]Fig. 1B) [81][21]. Additionally, we incorporated the TCGA (The Cancer Genome Atlas) integrated cohort (33 cancer types, >10k samples) [82][22], enabling the examination of transcriptional patterns across other cancer types ([83]Fig. 1C). Fig. 1. [84]Fig. 1 [85]Open in a new tab Datasets and SurvSig website. A) number of samples from each dataset for the three lung Nen histologies. B) number of patients for each histology of the rousseaux cohort. C) summary of the TCGA integrated dataset. D) Kaplan-Meier survival plot of lung histologies in the rousseaux cohort. E) Kaplan-Meier survival comparison of SCLC patients from the 5 collected cohorts. F) summary of the SurvSig website. Patient survival varies significantly among the main histological subtypes of lung NENs, demonstrating a pronounced bimodal distribution. Carcinoid tumors are associated with notably favorable outcomes, where median survival was not reached in either the Alcala or Rousseaux cohorts. High-grade SCLC and LCNEC display poor survival outcomes, with median survivals of approximately 15 and 20 months, respectively (Rousseaux cohort, [86]Fig. 1D). Detailed survival analysis of SCLC patients expectedly found significantly better prognosis in limited-stage and chemo-naive cohorts (George-SCLC, Liu-SCLC and Jiang-SCLC) compared to the Lissa and Rousseaux cohorts, where most patients received systemic chemotherapies ([87]Fig. 1E). Comparisons between LCNEC and carcinoid tumors across the datasets showed no significant differences in survival ([88]Fig. S1A-B). To support data mining and exploration of gene expression signatures, we developed the SurvSig website (https://survsig.hcemm.eu/), which facilitates clustering-based analysis of complex gene expression patterns ([89]Fig. 1F). After uploading a gene signature or gene list (up to 2000 genes), the website performs dimensionality reduction and clustering to group patients based on similar transcriptional profiles. Multiple dimensionality reduction and clustering algorithms are available and customizable by the user. The generated clustered expression heatmaps allow for survival comparisons among patient clusters, revealing potential differences in survival outcomes. Multivariate analysis and patient annotation enrichment can also be compared between patient clusters to account for clinical variables. Additionally, uploaded genes can be clustered to identify groups with similar expression patterns. SurvSig also includes gene set enrichment analysis for both uploaded and clustered gene sets to quantify pathway activity, which can be used for survival analysis in a single-gene context. The expression profiles of single genes and enrichment scores can be compared through correlation and violin plots, where samples can be grouped and color coded based on annotations and clinical characteristics. 2.1. Neuroendocrine gene signatures We classified tumors into neuroendocrine (NE) and non-neuroendocrine (non-NE) subtypes using the NE50 gene signature [90][23], which includes 25 genes highly expressed in NE tumors and 25 genes highly expressed in non-NE SCLC tumors ([91]Fig. 2A, [92]Table S1). A representative heatmap for the signature in the George-SCLC cohort can be seen in [93]Fig. 2B. Approximately 75–80 % of SCLC tumor cases scored positively for NE characteristics. Conversely, a substantial proportion of LCNEC tumors exhibited stronger non-NE features (50 % in the George-LCNEC and 70 % in the Rousseaux-LCNEC cohorts). Carcinoid tumors were predominantly highly neuroendocrine, except for a few cases that were enriched for non-NE genes, potentially linked to the emerging supra-carcinoid subtype ([94]Fig. S2A-B). Fig. 2. [95]Fig. 2 [96]Open in a new tab Transcriptional patterns defining lung NENs. A) distribution of rudin subtypes (NAPY, top) and NE subtypes (NE50 signature, bottom) of lung NENs. B) expression heatmap of the NE50 gene signature using the George-SCLC cohort (RNA-seq). C) expression heatmap of the NE50 gene signature using the rousseaux cohort (microarray). D) spearman correlation of NE and non-NE geneset activities defined in the NE50 signature using the rousseaux cohort. E) correlation of ASCL1 expression and NE geneset activity (from NE50 signature) using the rousseaux cohort. F-G) expression heatmap of genes that defined molecular subtypes [97][7] using the F) George-SCLC cohort (RNA-seq) and G) LCNEC tumors from the rousseaux cohort (microarray). Next, each patient from all lung NEN datasets was classified using the Rudin classification for SCLC tumors [98][9], which assigns subtypes based on the expression of ASCL1 (A), NEUROD1 (N), POU2F3 (P), or YAP1 (Y). As expected, chemo-naive SCLC cohorts were predominantly enriched for ASCL1+ tumors (George-SCLC, Liu and Jiang cohorts), whereas the highly treated Lissa cohort had higher prevalence of YAP1+ tumors ([99]Fig. 2A). LCNEC tumors presented a significantly greater proportion of YAP1+ cases compared to SCLC. The most notable difference was seen in carcinoid tumors, where the tumors in the Alcala cohort had significantly fewer YAP1 + cases compared to the carcinoid tumors in the Rousseaux cohort despite retaining similar NE signature characteristics. While most lung NENs retain distinct neuroendocrine (NE) characteristics, a subset exhibits prominent non-NE features, showing molecular similarities to non-small cell lung cancers (NSCLC) ([100]Fig. 2C). This observation is supported by ssGSEA-based enrichment analyses, which demonstrate strong negative correlations between NE and non-NE gene expression across all lung tumor histologies ([101]Fig. 2D, with differential activity quantified in [102]Fig. S2C). Notably, while ASCL1—a key NE marker—predictably shows positive correlation with NE-specific genes in the NE50 gene set, its expression is absent in many carcinoids ([103]Fig. 2E, with sample tumor histology indicated by color). Moreover, some NSCLC adenocarcinomas (ADC) exhibit high ASCL1 expression while maintaining a non-NE phenotype, as observed in both the Rousseaux cohort and TCGA datasets ([104]Fig. S2D-E). These findings reinforce the notion that although ASCL1 serves as a critical regulator in neuroendocrine cells, additional factors are necessary to drive full neuroendocrine differentiation ^24. We subsequently assessed the gene signature employed to classify the Type I, Type II, and ‘nan’ (unassigned or unknown) molecular subtypes within the George-LCNEC tumors cohort (clustered data available in [105]Table S2) ^7. Using two-dimensional UMAP dimensionality reduction combined with dynamicTreeCut clustering, we identified four distinct patient groups and three primary gene sets ([106]Fig. 2F). This clustering analysis delineated the George-LCNEC cohort into four groups that correlated with molecular subtypes: S1 was predominantly Type II, S2 exclusively Type I, S3 a mixture of Type I and II, and S4 exclusively ‘nan’. The S1 cluster was characterized by POU2F3 + and YAP1 + tumors, while clusters S2-S4 were nearly exclusively ASCL1 + . This pattern was also reflected in the NE scores, with S1 displaying a non-NE profile, S2 and S4 being mainly NE, and the mixed S3 cluster showing non-NE characteristics despite elevated ASCL1 expression. The identified clusters were differentially enriched in three gene sets (G1-G3, gene ontoloy in [107]Table S3). G1 was associated with the ‘nan’ subtype and enriched in synaptic pathway-related genes, showing significantly higher expression of neuroendocrine markers such as INSM1 and SYP ([108]Fig. S2E). G2 was linked to the Type I subtype, with enrichment in coagulation pathways, while G3 corresponded to the Type II subtype, mainly composed of non-NE tumors and enriched for the G3 gene cluster, which includes genes related to NFKB and adhesion pathways. Notably, we observed similar clustering patterns within LCNEC samples from the Rousseaux cohort ([109]Fig. 2G, distribution of INSM1 and SYP expression in [110]Fig. 2F), where expression was quantified using microarray in comparison the RNA-seq based Alcala cohort, underscoring the robust reproducibility of this gene signature in the molecular classification of LCNEC tumors. 2.2. Identifying signatures related to patient annotations We implemented the option to identify genes of interest through various statistical approaches, enabling the detailed exploration of the datasets (“Gene Set Finder” tab on SurvSig). This can be achieved either by using an entire patient cohort, or a subset of patients, considering either the full gene set or a gene list uploaded by the user. Two main approaches are implemented: 1) selecting informative genes based on expression profiles, such as standard deviation, PCA, SVD and other approaches; 2) identifying gene groups with differential expression profiles between predefined patient groups, such as clinical characteristics or annotations. To illustrate this feature, we extracted the top 500 most variable genes in the mixed Rousseaux based on standard deviation ([111]Fig. 3A, [112]Table S4) including six lung cancer histologies (SCLC, LCNEC, CARCI, SQC (Squamous cell carcinoma), ADC (Adenocarcinoma) and BAS (Basaloid) histologies). Cluster analysis and heatmap representation of patients highlighted four distinct groups with distinct behaviors, defined by UMAP dimensionality reduction and dynamicTreeCut clustering ([113]Fig. 3B). The S1 cluster comprised SQC and BAS (a rare subtype of SQC) tumors, while the S2 cluster predominantly included ADC tumors. Most SCLC and LCNEC tumors clustered together in the S3 group, independent of NE status, whereas nearly all carcinoid tumors were grouped in the S4 cluster ([114]Fig. 3C). Fig. 3. [115]Fig. 3 [116]Open in a new tab Variable genes across lung tumor histologies. A) expression heatmap of 500 most variable genes identified in the rousseaux cohort. B) UMAP representation and clustering of patient samples. C) histology composition of the identified patient clusters. D) normalized enrichment scores of NE specific genesets from multiple data sources. The gene clusters active in lung NEN tumors (G2) were highly enriched for neuronal and secretory pathways ([117]Fig. S3A, gene ontology results can be found in [118]Table S5). This gene cluster was highly expressed in all three lung NEN histologies (SCLC, LCNEC and CARCI). In contrast, the G4 gene cluster was more inactive in carcinoid tumors compared to high-grade lung NENs, enriched for replication related pathways ([119]Fig. S3B, [120]Table S5). This differential activity was reinforced by comparing normalized enrichment scores among the lung NEN datasets ([121]Fig. 3D and [122]Fig. S3C), where the G2 cluster was more active in carcinoid tumors, followed by SCLC tumors and finally LCNEC tumors. In contrast the G4 cluster was more active in SCLC and LCNEC tumors compared to the low-grade carcinoids. 2.3. Molecular and prognostic signatures in carcinoid tumors Carcinoid tumors can be categorized to typical (TC) and atypical (AC) cases, that have recently been further classified into 4 molecular subtypes (termed A1, A2, B and supra-carcinoids) using distinct gene expression signatures [123][5]. The three gene signatures (A1 vs A2, A1 vs B and A2 vs B signatures [124][5]) yielded reproducible results using the original Alcala cohort, highlighting the transcriptional differences among the molecular subtypes ([125]Fig. S4A), also seen in the carcinoid tumors from the Rousseaux cohort ([126]Fig. S4B). Using SurvSig, we constructed a single gene signature that can differentiate the molecular histologies, without the need of multiple gene lists for classification. For this, we identified 1k genes that describe the four molecular subtypes using an artificial neural network (MLP model from SciKit-learn package in python) implemented on SurvSig ([127]Table S6). Patient clustering using the newly defined genes by PCA dimensionality reduction (with whitening enabled) and dynamicTreeCut algorithm (set to complete linkage method) separated the Alcala carcinoid tumors completely based on the predefined molecular subtypes ([128]Fig. 4A). The A1 tumors were enriched for neuronal pathways (G3 genes), the A2 tumors were enriched for metal ion stress response (G1 genes) and wound healing pathways (G2 genes), the B tumors were enriched with synaptic pathways (G5 genes) and wound healing (G2 genes), while the supra-carcinoids were enriched for protein kinase and serine endopeptidase pathways (G4 genes) (gene ontology can be found in [129]Fig. S4C and [130]Table S7). Using the gene signature, we were able to annotate the molecular clusters of the carcinoid tumors from the Rousseaux cohort (microarray-based expression quantification) as well, which yielded very similar expression heatmaps despite difference in expression quantification technologies ([131]Fig. 4B). Fig. 4. [132]Fig. 4 [133]Open in a new tab Expression patterns of carcinoid tumors. A-b) clustering of 1k genes identified by artificial neural network defining molecular subtypes of carcinoids in the A) alcala and B) rousseaux cohorts. C-d) clustering of patients using 358 genes identified by the nearest centroids classifier that differentiate TC and AC carcinoids using the C) alcala (RNA-seq based expression) and D) rousseaux (microarray-based expression) cohorts. E) UMAP and clustering of alcala (top) and rousseaux (bottom) cohorts. F) Kaplan-Meier survival comparison of patient cohorts from the two carcinoid cohorts. G) ssGSEA activity and cutoffs identified with automatic cutoff selection of the G1 gene list enriched in TC. H) Kaplan-Meier survival plots of high and low G1 gene set activities. To better differentiate gene expression patterns among the TC and AC subtypes, we performed a grid search using nearest centroids classifier (see Methods section), which identified 358 genes that are differentially enriched in TC and AC carcinoids from the Alcala cohort ([134]Table S8). Cluster analysis of the gene signature identified three groups of genes, of which two had marked differential activity between the two subtypes ([135]Fig. 4C). The G1 gene set was more active in typical carcinoids and were enriched for GTP pathways, while the gene set enriched in atypical carcinoids were enriched for spindle and mitotic pathways ([136]Fig. S5A). Importantly, we observed similar separation in the independent carcinoid samples from the Rousseaux mixed cohort, highlighting the reproducibility of the gene set ([137]Fig. 4D). As expected, the patient cluster that was enriched for typical carcinoid signatures had significantly better survival, than the patient cluster enriched for atypical signatures, using the default UMAP dimensionality reduction and clustering patients to two groups using k-means (excluding samples that were included from another study with no clinical information with the “LC exclusion” toggle, and turning off Z-scoring of data). This trend was visible using both the Alcala cohort, and carcinoid tumors from the Rousseaux cohort ([138]Fig. 4E-F). We also tested whether gene set enrichment quantified by ssGSEA (Single Sample Gene Set Enrichment Analysis) of the two gene signatures (G1 and G2) could be used as surrogate markers to predict survival. The analysis reinforced the observation that patients displaying higher activity of the typical carcinoids G1 genes had significantly better outcomes in both datasets ([139]Fig. 4G-H). In contrast, patients with signature enrichment of the G2 genes, enriched in atypical carcinoids, had significantly worse outcomes ([140]Fig. S5B). The significance was retained even when correcting the p-value for multiple testing from the automatic cutoff selection of all ssGSEA based survival analyses ([141]Fig. S5C). 3. Discussion In this study, we compiled a comprehensive collection of publicly available gene expression data from over 600 lung neuroendocrine neoplasms (NENs), encompassing carcinoids, large cell neuroendocrine carcinoma (LCNEC), and small cell lung cancer (SCLC). By integrating this data into an interactive online platform called SurvSig, we enabled the analysis of gene expression signatures using various machine learning approaches. Our findings demonstrate that complex transcriptional patterns can effectively classify lung NEN subtypes and are associated with distinct clinical outcomes, particularly in differentiating typical and atypical carcinoid tumors. The application of transcriptional profiling allowed us to classify tumors based on established molecular signatures, such as the Rudin classification for SCLC and the NE50 gene signature for neuroendocrine differentiation. We observed that carcinoid tumors exhibit diverse molecular subtypes with distinct gene expression patterns, which correlate with patient survival. Specifically, we identified gene clusters that differentiate typical carcinoids, associated with favorable prognosis, from atypical carcinoids, which have poorer outcomes. These gene signatures were consistent across independent cohorts, highlighting their robustness and potential utility in clinical settings. Lung NENs have distinct pathogenic mechanisms related to their development, resulting in a heterogeneous group with distinct pathogenetic profiles across their subtypes. A recognized premalignant condition, diffuse idiopathic pulmonary neuroendocrine cell hyperplasia (DIPNECH), has been implicated in the early development of typical and atypical carcinoids, particularly among non-smoker females. Environmental factors, most notably nicotine and hypoxia related to smoking, contribute significantly to the pathogenesis of high-grade tumors. Notably, the mutational burden is substantially higher in SCLC than in carcinoids, reflecting their more aggressive behavior and accumulation of genetic alterations over time. Although smoking is a major risk factor for NECs, its role in carcinoid development remains unclear, suggesting distinct etiologies within the lung NEN spectrum [142][25]. Our work builds upon previous studies that have attempted to stratify lung NENs using molecular characteristics. By leveraging machine learning algorithms within SurvSig, we were able to refine these classifications and identify novel gene signatures that may serve as prognostic biomarkers. The ability to reproduce these findings across multiple datasets underscores the importance of integrating large-scale genomic data to enhance our understanding of tumor biology and improve patient stratification. Using SurvSig, we were able to construct gene signatures that stratify patients into molecular clusters using a single list. Importantly, many identified genes have been recently validated to be discriminators of these clusters using protein expression through immunohistochemistry [143][26], [144][27]. Notable examples are enrichment of OTP in A1 and A2 carcinoids, ASCL1 in A1, HNF1A in A2 and B clusters and ANGPTL3 in B clusters ([145]Fig. 4A, [146]Table S6). The distinct clinical trajectories observed in lung NENs can be directly attributed to their underlying biological diversity. Typical and atypical carcinoids often harbor mutations in the MEN1 gene and exhibit low proliferation rates, which correlate with their generally indolent clinical behavior. Conversely, high-grade neuroendocrine carcinomas—such as SCLC and LCNEC—are characterized by frequent genetic alterations in TP53 and RB1, extensive genomic instability, and aggressive proliferation. Recent literature also highlights the importance of tumor plasticity, with documented instances of non-small cell lung cancer (NSCLC) undergoing histological transdifferentiation into SCLC following targeted therapy resistance [147][28], [148][29]. Furthermore, a recent study has demonstrated the ability of carcinoids to differentiate to high-grade tumors by acquiring amplifications of cell-cycle regulating genes through chromothripsis [149][30]. Additionally, the immune microenvironment markedly varies among subtypes, with carcinoids typically displaying low immune infiltration and an immunologically 'cold' phenotype [150][31], [151][32], whereas SCLC and LCNEC frequently present a heterogeneous immune contexture, including immunosuppressive environments that diminish response to immunotherapeutic strategies [152][12], [153][32], [154][33], [155][34]. These biological differences are critical for interpreting the transcriptomic signatures captured by SurvSig, underpinning its utility as both a prognostic and biologically informative tool. Future studies integrating genomic, epigenomic, and immune profiling could further refine our understanding of these differences, potentially improving patient stratification and guiding subtype-specific treatment approaches. Our study has several limitations to take into account. SurvSig relies solely on transcriptional patterns for sample stratification and does not incorporate genomic features such as mutations or copy number variations (CNVs) [156][4], [157][5], [158][6], [159][7], [160][35]. These genetic alterations are known to influence gene expression and can act as confounding factors, as demonstrated in carcinoid tumors (e.g., MEN1 mutations) and LCNEC. However, previous studies have also shown that transcriptional signatures associated with these alterations are detectable in tumors lacking the corresponding mutations, underscoring the role of alternative mechanisms—such as epigenetic inactivation [161][24], [162][36]—that can lead to similar phenotypic outcomes. Additionally, given the reliance on publicly available datasets, cohort selection bias remains a potential limitation, particularly in terms of sample diversity and completeness of clinical annotation. Moreover, it is well-established that comparing RNA expression levels directly with protein abundance has inherent limitations. While SurvSig was specifically designed and validated for lung neuroendocrine neoplasms, the underlying methodology is broadly applicable and could, in principle, be extended to other neuroendocrine malignancies such as pancreatic neuroendocrine tumors (PanNETs) [163][37], [164][38]. Given the shared neuroendocrine transcriptional programs across tissue types, such an extension is technically feasible, as previous research has demonstrated common transcriptional and epigenetic programs in other neuroendocrine neoplasms [165][37], [166][39]. However, systematically evaluating its performance in non-pulmonary neuroendocrine cancers requires tumor-specific reference cohorts with well-annotated survival data, which falls beyond the scope of the current study. Future work may explore this direction as suitable datasets become available. In summary, the SurvSig platform offers a versatile tool for the exploration of gene expression signatures in lung NENs. By facilitating biomarker discovery and enabling the interpretation of complex transcriptional data, SurvSig can aid in the development of personalized therapeutic strategies. SurvSig has the potential to uncover novel transcriptional signatures and genes, aiding the identification of candidate markers capable of differentiating lung NET histologies; however, these findings require further validations in the clinical setting. Continued development of such tools are essential for translating genomic insights into actionable clinical interventions, ultimately improving patient outcomes in lung NENs. 4. Material and methods 4.1. Data collection Processed gene expression and survival data were collected from multiple publicly available neuroendocrine lung tumor cohorts. For small cell lung cancer (SCLC), data were sourced from George-SCLC [167][6] through cBioPortal [168][40], Lissa-SCLC [169][18], Jiang-SCLC [170][19], and Liu-SCLC [171][20], covering both treatment-naive and metastatic cases. Large cell neuroendocrine carcinoma (LCNEC) data were obtained from George-LCNEC [172][7]. Pulmonary carcinoid tumor data were derived from Fernandez-CARCI [173][4] and Alcala-CARCI [174][5]. The Cancer Genome Atlas (TCGA) integrated cohort was also included [175][22], covering 33 different cancer types for comparative analysis, selecting earlier sample time points for duplicate entries. The Rousseaux mixed lung tumor cohort [176][21] was also included, where expression was profiled using microarrays. Raw cel expression files were obtained from GEO, imported with the read.affybatch function from the affy package [177][41], followed by normalization with the rma function. Gene expression scores were obtained with the jscores function from the JetSet package [178][42]. Gene names across all cohorts were standardized to HUGO symbols through an automated mapping process to correct outdated or non-standard gene names. Unmapped genes were excluded, and duplicates were removed by retaining the cases with highest mean expression. Datasets where expression was summarized in normalized read counts were log2-transformed to stabilize variance. For all cohorts except TCGA, NAPY [179][9], neuroendocrine (NE) [180][23], and epithelial-mesenchymal transition (EMT) scores [181][43] were calculated to assess tumor characteristics. These scoring methods were applied uniformly across the cohorts, ensuring consistent analysis across datasets. Cohorts are represented as individual sets on the SurvSig website, which can be selected under the “Select a Dataset” drop-down menu. 4.2. SurvSig website implementation Python and R were used for data analysis and visualization. A web application developed using Python and Streamlit (1.39) was deployed on a Linux Debian (6.1.38) server, with integration of R for additional statistical analysis. Python (3.12) and R (4.3.3) were used for consistent analysis. A brief example about using SurvSig can be found on the landing page, as well as under the “Help” tab, where several short introduction videos can be found related to the different functionalities. 4.3. Dimensionality reduction and clustering Dimensionality reduction methods included Uniform Manifold Approximation and Projection (UMAP), t-distributed Stochastic Neighbor Embedding (t-SNE), Principal Component Analysis (PCA), and Multidimensional Scaling (MDS). UMAP [182][44] was implemented using umap-learn (0.5.6), while t-SNE [183][45], PCA, and MDS were applied using scikit-learn (1.5.1 and 1.5.2). Non-negative Matrix Factorization (NMF), implemented with scikit-learn and bignmf (1.0.5) [184][46], was also performed for clustering purposes. PHATE, used for nonlinear dimensionality reduction, was implemented with phate (1.011) [185][47]. On the SurvSig website, we implemented several advanced settings that users can modify for each method. In case of UMAP, users can modify minimal distance, number of neighbors and distance metric. For t-SNE, users can modify perplexity, number of iterations, distance metric and embedding method. In PCA analysis users can modify tolerance, whitening and SVD solver. In MDS, users can select number of iterations, epsilon value and use of metric MDS. In case of NMF analysis, in the standard NMF, users can select iterations, tolerance, method of initialization, beta loss, numeric solver and randomization of order of coordinates, while in the NMF clustering option, users can select number of iterations, number of trials and Lamb’s value. In the PHATE method, users can select number of neighbors and principal components. In the correlation option, users can select Spearman or Pearson correlations. For clustering, several methods were employed, including K-means, Gaussian Mixture Models (GMM) [186][48], Agglomerative Clustering, Self-Organizing Maps (SOM) [187][49], and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [188][50], using scikit-learn and MiniSom. On the SurvSig website, users can also modify advanced parameters for the clustering methods. In case of k-means, apart from number of clusters, users can modify the initialization method, relative tolerance, max iterations and k-means algorithm. During Gaussian Mixture Models, users can select initialization method, covariance type, convergence threshold and non-negative regularization. In agglomerative clustering, users can select the linkage method. In DynamicTreeCut, users can select the linkage methods, distance metric and minimum cluster size. In HDBSCAN, users can select the alpha value, metric method, minimum number of samples, epsilon values and cluster selection method. In the OPTICS approach, users can select the minimum number of samples, distance metric, initialization method and epsilon value. Together, these options enable users to modify and refine both the dimensionality reductions and clustering approaches. Correlations between gene expression profiles were calculated using Spearman or Pearson methods, implemented via pandas (versions 2.2.2 and 2.2.3). 4.4. Gene set enrichment, survival and pathway enrichment analysis Gene set enrichment analysis was conducted using single-sample GSEA (ssGSEA) with the ssgsea function of gseapy (1.1.3) [189][51] using default settings. Since datasets were log-normalized, normalization was set to “None” during analyses seen in the figures. We have implemented advanced settings for ssGSEA calculation, which includes the option to select normalization, correlation method and weights. Pathway enrichment analysis was performed using clusterProfiler (4.6.2) [190][52] with gene annotations from org.Hs.eg.db (3.16.0) and results visualized using enrichplot (1.18.4). On the SurvSig website, users can select the enrichment method of clusterProfiler: enrichGO (where BP, MF CC or all sub-ontologies can be selected), or enrichKEGG. In addition, we implemented additional options, such as p-value and Q-value selection, selection of p-value correction, selecting the maximum number of genes and pathways to be plotted and ordering of gene lists on plots. Survival analysis used the survival R package (3.7–0) for Kaplan-Meier estimations and Cox proportional hazards models, with additional visualization provided by survminer, such as multivariate analysis (0.4.9). On the SurvSig website, the multivariate option for clustered heatmaps can be found on a separate tab (“Multivariate & Chi [191][2]), while for survival analysis using ssGSEA values or single-genes, multivariate analysis is performed automatically. In each case, we implemented options to modify the analysis, which includes selection of clinical features and annotations to include in the analysis, and what should be the reference category. In case of single-gene and ssGSEA score-based survival analysis, we implemented several options to stratify patients to “high” and “low” groups: 1) “Median” value; 2) “Percentage decomposition”: percentile based separation, where user can set a manual percentage value; 3) “Percentage decomposition (lower and upper limit)”: users can specify two percentages, where the “low” group consists of patients under the lower threshold, and “high” group consists of patients above the higher threshold; 4) “Automatic”: in this case, SurvSig scans by default all cutoffs in a 1 % step between 10 % and 90 % interval of patients, returning the percentage where the survival analysis resulted in the lowest p-value. To control for multiple testing, in these cases an additional plot and table appears summarizing the calculated p-values and adjusted (FDR) p-values. Both the interval range, and step size can be adjusted by the users. 5) “Expression values cutoff”: users can also select what expression value to use to separate the patients. Heatmaps were generated using ComplexHeatmap (2.14.0) [192][53] to visualize gene expression and enrichment data. On the SurvSig website, users can choose to visualize (and cluster) data based on Z-scoring, or simply the normalized (log transformed in case of RNA-seq) expression data. We implemented several options to customize the heatmaps, such as specifying the color palett (and percentile thresholds), displaying dendograms of rows and columns and selecting annotations of samples together with their colors. 4.5. Statistical analyses Comparisons and statistical analyses were calculated using SurvSig. Comparisons between groups (such as boxplots) were calculated using the mannwhitneyu() function of SciPy in python. Spearman correlations between genes (and ssGSEA enrichment scores) are calculated using the spearmanr function from SciPy, with implemented options to calculate Pearson correlation using the pearsonr function of SciPy. For statistical analysis, false discovery rate (FDR) adjustments, chi-square tests, and Z-score calculations were performed using statsmodels (0.14.4), while median, standard deviation, and variance were calculated using numpy. Correlations between clinical and gene expression data were calculated using pandas. Interactive visualizations were generated using Plotly (5.23.0). 4.6. Gene set finder The Gene Set Finder identifies significant genes using machine learning and statistical methods. Dimensionality reduction techniques, such as PCA, ICA, FA, and NMF, are applied to reduce complexity while retaining relevant structure. These methods, implemented via scikit-learn, were supplemented with standard deviation to rank genes by variability. For clinically integrated analysis, classifiers such as Artificial Neural Networks (ANN) and Categorical Naive Bayes were used to identify genes associated with clinical features. Kruskal-Wallis tests, corrected for multiple testing using FDR, were used to detect associations between genes and clinical characteristics. CRediT authorship contribution statement Gabriella Mihalekné Fűr: Writing – review & editing, Writing – original draft, Visualization, Investigation. Alexandra Benő: Writing – review & editing, Writing – original draft, Visualization, Investigation. Schultz Christopher W: Writing – review & editing, Writing – original draft, Methodology, Investigation, Formal analysis. Petronella Topolcsányi: Writing – review & editing, Writing – original draft, Visualization, Investigation. Éva Magó: Writing – review & editing, Writing – original draft, Visualization, Investigation. Parth Desai: Writing – review & editing, Writing – original draft, Formal analysis. Nobuyuki Takahashi: Writing – review & editing, Writing – original draft, Formal analysis. Mirit I Aladjem: Writing – review & editing, Writing – original draft, Investigation, Formal analysis. William Reinhold: Writing – review & editing, Writing – original draft, Investigation, Formal analysis. Yves Pommier: Writing – review & editing, Writing – original draft, Investigation, Formal analysis. Anish Thomas: Writing – review & editing, Writing – original draft, Formal analysis, Conceptualization. Pongor Lorinc: Writing – review & editing, Writing – original draft, Supervision, Software, Methodology, Formal analysis, Conceptualization. Kolos Nemes: Writing – review & editing, Writing – original draft, Software, Methodology, Data curation, Conceptualization. Code availability The SurvSig website is freely available at https://survsig.hcemm.eu/. The code is available at GitHub (link https://github.com/HCEMM/SurvSig) Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work the author(s) used OpenAI ChatGPT (4o) in order to proofread the text. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication. Funding This research was funded by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences BO/00697/23 (L.S.P). The project received funding from the EU's Horizon 2020 Research and Innovation Program with grant agreement No. 739593. Project no. TKP-2021-EGA-05 and 2022–2.1.1-NL-2022–00005 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, funded by the TKP2021-EGA and National Laboratories grant program. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Funders have no conflict of interest. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Footnotes ^Appendix A Supplementary data associated with this article can be found in the online version at [193]doi:10.1016/j.csbj.2025.06.010. Appendix A. Supplementary material Supplementary material [194]mmc1.docx^ (1.3MB, docx) Supplementary material [195]mmc2.xlsx^ (184KB, xlsx) References