Abstract The rapid development of high-performance technologies has greatly promoted studies of molecular oncology producing large amounts of data. Even if these data are publicly available, they need to be processed and studied to extract information useful to better understand mechanisms of pathogenesis of complex diseases, such as tumors. In this article, we illustrated a procedure for mining biologically meaningful biomarkers from microarray datasets of different tumor histotypes. The proposed methodology allows to automatically identify a subset of potentially informative genes from microarray data matrices, which differs either in the number of rows (genes) and of columns (patients). The methodology integrates nonnegative matrix factorization method, a functional enrichment analysis web tool with a properly designed gene extraction procedure to allow the analysis of omics input data with different row size. The proposed methodology has been used to mine microarray of solid tumors of different embryonic origin to verify the presence of common genes characterizing the heterogeneity of cancer-associated fibroblasts. These automatically extracted biomarkers could be used to suggest appropriate therapies to inactivate the state of active fibroblasts, thus avoiding their action on tumor progression. Keywords: NMF, metagene, microarray, cancer, fibroblast, cancer-associated fibroblast Introduction Complex human diseases are caused by the interactions of many genetic, environmental, and behavioral factors. Development of high-performance technologies greatly promoted studies on molecular oncology producing large amounts of omic data. The availability of massive volume of experimental data based on cancer researches requires the development of mathematical, statistical, and computational techniques, which automatically extract valuable information useful for a better understanding of pathogenesis mechanisms of complex diseases, such as tumors. Identifying marker genes potentially involved into the development of tumors may facilitate the understanding of cause of disease, thus contributing to the advancement of diagnostic tools and/or to the evaluation of more efficient clinical strategies. In most of the publicly available data sets, the number of genes is significantly larger than the number of samples. This high dimensionality represents a major problem for an automatic gene array–based cancer analysis. In this scenario, dimensionality reduction methods became indispensable as they can eliminate irrelevant and redundant information, thus reducing the dimensionality as well as the complexity of the original problem, with significant benefits in terms of computational efficiency, model interpretability, and data understanding. Among linear dimensionality reduction methods, low rank matrix decomposition algorithms have been successfully exploited as alternative approaches for studying various types of high-dimensional biological data, including gene expression data.^[29]1 The application of these algorithms relies on the assumption that large-scale biological data have an intrinsic low-dimensional representation, with the dimension often corresponding to the number of latent information embedded into the original data. These methods transform the space of original data into a lower dimensional more discriminating (informative) space that makes the subsequent analysis more efficient. Among low-rank reduction mechanisms, nonnegative matrix factorizations (NMFs) emerge as useful approaches for the analysis of microarray data. The intrinsic non-negativity property of these techniques, in fact, produces more intuitive results as many biological measurements are represented by positive values. Nonnegative matrix factorizations demonstrated their ability in a number of tasks, including the identification of sets of genes co-operating in a relatively tightly regulated manner;^[30]2 the discovery of potential relationships in large biological data samples and link genes to these patterns;^[31]3 the detection of distinct genomic subtypes in cancer patients;^[32]4 the inspection of expression data sets including time evolution of the gene expression profile in different samples;^[33]5 and the extraction of gene expression profiles from fibroblasts of cancer blood diseases.^[34]6 In this article, we present a gene extraction methodology, which integrates a NMF method^[35]7,[36]8 with the functional enrichment analysis web tool WebGestalt^[37]9 and a gene extraction procedure designed ad hoc to automatically mine different microarray cancer data sets to extract a reduced subset of genes to be further investigated from a biological point of view. An approach for mining multi-omics cell line data with the same row size by joint nonnegative matrix factorization (JNMF) and pathway signature analyses was recently proposed in Fujita et al.^[38]10 The methodology presented in this article instead allows to analyze microarray data matrices, which differ either in the number of rows (genes) and in the number of columns (patients) to verify the presence of common genes characterizing the heterogeneity of different cancer datasets. This proposal enriches the panorama of large-scale data-driven computational methods based on matrix factorization algorithms, which are able to extract concise and useful pieces of information from existing disease-associated data sets.^[39]1,[40]11,[41]12 Particularly, through the NMF method, our methodology mines the metagenes, which are the most representative of the information embedded into different tumor datasets and then, by simple intersection set operations allow to extract genes in a natural way to obtain interpretable and useful knowledge readily usable from biologists and analyzed, thanks to functional and visualization approaches based on the WebGestalt tool. The proposed methodology has been used to mine microarray of solid tumors of different embryonic origin to verify the presence of biomarkers characterizing the heterogeneity of cancer-associated fibroblasts (CAF) thus being applicable in clinical practice. Fibroblasts constitute the most heterogeneous and abundant population of mesenchymal cells in tumor microenvironment (TME). Their presence goes from tumor formation up to the final stage of metastatic diffusion, but their precise functional role in tumor is not fully understood yet.^[42]13 Also, it is not clear how different subtypes of CAF could exert distinct paracrine actions affecting specific tumor oncogenesis.^[43]14,[44]15 Therefore, we adopt the proposed procedure to analyze gene expression profiles of CAFs belonging to primary cultures of 3 distinct tumor histotypes (ie, colon of endodermal origin, breast carcinoma of ectodermal origin and ovary of mesodermal origin) to select common biomarkers and characterize activated fibroblast phenotype. Moreover, as it is known that bone marrow (BM) is a CAF recruitment source,^[45]16 we aim at investigating also the existence of common genes among CAF gene expression profiles of selected solid tumors and those of BM of patients with multiple myeloma (MM) and monoclonal gammopathy of undetermined significance (MGUS; MGUS is a benign pathological condition characterized by the proliferation of a plasma cell clone that rarely evolves into a malignant neoplasm.^[46]17 The latter, being a benign pathological condition, simulates a well-differentiated CAF phenotype compared to those of patients with MM.). Using the proposed methodology, we highlight the existence of biomarkers among CAFs of different embryonic origin that uniquely identify them and could be useful for developing or evaluating more efficient clinical strategies. This article is organized as follows. First, the pipeline of the whole gene extraction methodology is presented, illustrating a general way to pre-process this kind of data, its analysis core (based on the NMF method), and the ad hoc gene landscape extraction procedure. The subsequent section describes the specific biological problem we investigated, as well as the databases of CAF populations used in the experimental session. Then, the functional and pathway analysis performed during the experiments are deeply discussed and the obtained results are highlighted from a biological point of view. Some conclusive remarks conclude the article. Methods This section discusses in some detail the main architecture, data pre-processing, and basic NMF approach used in this study. The ensemble gene extraction methodology The proposed methodology is based on different operations that integrate the acquisition of data, their mathematical analysis, and the biological exploration of the obtained results. [47]Figure 1 illustrates the work-flow and the main operations performed when selected tumor data sets are provided. Figure 1. [48]Figure 1. [49]Open in a new tab Work flow of the NMF-based methodology for extraction of common genes between different tumor gene expression datasets. The methodology is mainly based on the integration of 4 parts: the data preparation module, the core module computing NMF of given data matrices, the Gene Landscape extractor devoted to the extraction of common genes between processed results, and the Functional and Visual Analysis module. The first 3 modules combine R/BioConductor platform and the NMF library, which are well established in bioinformatics research,^[50]18-[51]20 whereas the biological analysis of the latter module is performed using the well-known web tool WebGestalt. Data preparation and pre-processing Working on microarray data requires data sets are imported in the virtual environment as a single different expression set, in which probeset represents gene interrogating a [MATH: X+n ×m :MATH] particular expressed sequence.^[52]21 Once the data set is loaded, a pre-processing phase needed to be applied to create a correspondence between probeset and gene symbol. Data preparation module allows to handle gene expression set in which probesets map the same gene even when they refer to different quantities of transcript or to tackle the associated probes, which were not annotated in any sequence. [53]Figure 2 explains the cases treated by this phase to create gene expression data feeding the following algorithms. Figure 2. [54]Figure 2. [55]Open in a new tab Possible cases supported by the pre-processing phase. The pre-processing operations implemented in this module are based on the adjusted median absolute deviation (MAD) of the gene expression values. The use of the MAD (a scale factor is used to consider the MAD as a consistent estimator for the estimation of data standard deviation), which computes the variability of the data from the median, makes the process more robust to outliers. In particular, a tuple probeset id, gene symbol, and the associated expression value are uniquely identified assuming that greater is the MAD higher is the goodness of the hybridization in the experiment. The code and dataset related to data preparation and pre-processing module have been made available at GitHub at [56]https://github.com/flaespo/NMF-for-GSE. Nonnegative matrix factorization for genetic data sets After data pre-processing is performed, most omic data sets can be represented as a matrix in which each element contains the measurement of a single molecule in a single experimental condition. In the case of expression data, the resulting high-dimensional data set is re-formulated as a numerical nonnegative matrix with rows being genes and columns representing samples, for example, tissues of various patients (as in this study), development stages, or treatments. This matrix is called “gene expression matrix” and its elements [MATH: Xij :MATH] indicate the expression level of the gene i in the sample j. A main task in analyzing this matrix is to extract from it some knowledge about the underlying biological processes. Nonnegative matrix factorization can be applied to reduce data dimensionality as it decomposes a gene expression matrix X by creating a user-defined number of new column features [MATH: W(:,k) :MATH] , [MATH: (k=1,,r) :MATH] called “metagenes,” which are linear combinations of the original samples set (eg, column vectors [MATH: X(:,j) :MATH] ) weighted with nonnegative coefficients [MATH: X(:,j)k =1rW(:,k)Hkj, :MATH] for [MATH: j=1,,m :MATH] and [MATH: r<_min(m,n) :MATH] . In this way, original data are explained by a sum of additive parts and so that intuitively biological entities and mechanisms can be naturally described with a signal that is either present or absent. From an algebraic point of view, NMF finds 2 nonnegative matrices, the metagenes matrix [MATH: W+n ×r=[W(:,1),,W(:,r)] :MATH] and the metagene expression profiles matrix [MATH: H+r ×m :MATH] , whose elements [MATH: Hij :MATH] reveal the effect that the ith metagene [MATH: W(:,i) :MATH] has on the sample j, such that [MATH: XWH :MATH] . It is worthy to observe that if the value [MATH: Hij :MATH] is very small, then the corresponding metagene [MATH: W(:,i) :MATH] (having rows which are genes in X) is useless in approximating that particular sample. The decomposition could be dually viewed as individuating metasamples (rather than metagenes) and groups of genes (rather than of samples) when the entries of W are taken into account.^[57]22,[58]23 When NMFs are applied to produce clusters of genes, the rank value r is usually a priori set after trying different values, computing some quality measure of the results, and then choosing the best value according to the adopted quality criteria. In this article, however, we make use of some automatically suggested value accordingly a procedure described in Del Buono and colleagues.^[59]8,[60]23 This procedure makes use of cophenetic coefficient, residuals sum of squares, dispersion curve, and consensus matrices to optimally address a proper rank value r for each gene expression matrix X. Computationally, metagenes and expression profile values can be obtained solving a non-linear constrained optimization problem over the cone of nonnegative matrices [MATH: minW0< /mn>,H0< mi>Div(X,WH) :MATH] (1) where [MATH: Div(X,WH) :MATH] is any divergence measure which evaluates how well the low-dimensional matrix WH approximates X. In this article, the generalized Kullback-Leibler (KL) divergence [MATH: Div(X,WH)=i j(Xi jlog(< mi>Xij(WH)ij< /mrow>)X< mi>ij+(WH)ij< /mrow>) :MATH] is used, which corresponds to the maximum likelihood estimation under an independent Poisson assumption.^[61]24 The minimization problem (1) for the KL divergence function is solved applying the following multiplicative update rules for W and H: [MATH: Wij Wijk(HjkXik)/(WH)ik< /mrow>kH jkHijHijk(WkiXkj)/(WH)kj< /mrow>kW ki :MATH] Because of the non-negativity constraint, solutions to NMF are only unique up to scaling and rotation, but appropriately scaling and rotating the columns of W and rows of H will not alter the overall matrix product WH. For this reason, what is of interest in practice is not the values of matrix elements, but their relative magnitudes in each column of W (or row of H). Moreover, taking into account that more genes can participate in more than one biological process, it could be of some beneficial to investigate genes that have relatively large coefficients in each biological process. To do this, the gene.score scoring method proposed in Kim and Park^[62]25 has been adopted. This method computes a value gene.score (i) i value for each gene i in a metagene [MATH: W(:,k) :MATH] and selects those genes possessing gene.score value higher than a given threshold τ. Particularly, the score threshold τ is computed from the gene score vector itself as [MATH: τ=μ^+3σ^ :MATH] , being and [MATH: σ^ :MATH] the median and the MAD of gene scores, respectively. Metagenes in W contain the largest number of genes satisfying this empirical criterion and they can be considered as the most representative of the information hidden onto the gene expression data X. Observe that the metagenes NMF technique extracts are constrained by the dataset used to train them, so a careful selection of datasets is essential: users need to choose those being broad enough to cover the relevant sources of variability. Gene Landscape Extraction procedure Referring to [63]Figure 1, let [MATH: Xi+ni×mi :MATH] , [MATH: i=1,,t :MATH] be the gene expression matrices related to different biological experiments and let [MATH: Wi+ni×ki :MATH] be the metagenes matrices obtained solving equation (1) for each [MATH: Xi :MATH] with an a priori rank [MATH: ki :MATH] . In the proposed methodology, the constrained nonnegative optimization problem (1) have been solved via the alternating method in NMF R-package, with multiple executions and random initializations.^[64]8,[65]20 To compare gene expression matrices derived from different tumor histotypes (in this work from their associated CAF cultures), the most representative metagene [MATH: Wi(:,k.) :MATH] is extracted from each gene expression matrix [MATH: Xi :MATH] . This metagene represents the [MATH: k.th :MATH] column in the matrix [MATH: Wi :MATH] , which possesses the largest number of genes satisfying the gene.score criterium. These latter genes compose the gene-subsets automatically extracted from each tumor histotype. To identify common genes among those extracted by different tumor histotypes, an intersection set operation was performed on the identification labels of genes [MATH: Wi(:,k.) :MATH] so that [MATH: C=i= 1NWi< mo stretchy="false">(:,k.) :MATH] is a subset of gene-label common to each different tumor histotypes. It should be observed that the set of common genes [MATH: C :MATH] depends on the gene expression matrices initially selected. [66]Figure 3 provides a logical view of the Gene Landscape Extraction procedure, which represents the novel proposal to extract common genes from microarray matrices which differ in their sizes. Particularly, the microarray matrices [MATH: Xi ni×mi :MATH] obtained by the data preparation and pre-processing module from each tumor database are factorized by the NMF algorithm. This provides matrices [MATH: Wi ni×ri :MATH] and [MATH: Hi ri×mi :MATH] . Particularly, the rank values [MATH: ri :MATH] used in the factorization process of the gene expression matrices used in this article are reported in [67]Table 2. From each [MATH: Wi :MATH] , the most relevant metagene [MATH: Wi(:,k.i) :MATH] is automatically selected by the gene.score scoring method. These column vectors undergo the Gene Landscape Extraction procedure. It should be observed that which [MATH: Wi(:,k.i) :MATH] ( [MATH: i=1,,n :MATH] ) differ in their number of rows (genes) and they are ordered to have the most representative information of the specific tumor histotype in their first components. Gene Landscape Extraction procedure uses intersection set operations to identify labels of genes in each [MATH: Wi(:,k.i) :MATH] , which are common to each different tumor histotypes. Figure 3. [68]Figure 3. [69]Open in a new tab Logical view of the gene landscape extraction. Preprocessed expression data matrices related to 3 different tumors ( [MATH: X1,X2< /mn>,X3 :MATH] ) are factorized by the NMF algorithm and corresponding metagenes matrices [MATH: Wi :MATH] ( [MATH: i=1,2,3 :MATH] ) are taken into considerations. Gene.score scoring method selects the most relevant metagenes [MATH: (Wi(:,k.i)) :MATH] , ( [MATH: i=1,2,3 :MATH] ) i, which undergo the Gene Landscape Extraction procedure. This latter uses an intersection set operation to identify labels of genes common to each different tumor histotypes. The extracted subset of gene-labels is then sent to the functional and pathway analyses. Table 2. Gene ontology functional analysis: based on the parameters used, 10 categories are identified as enriched categories and all are shown in this table.^[70]9 Gene set Description Size Expect Ratio P value FDR GO:0006955 Immune response 1919 9.0964 2.5285 .00001777 0.1615 GO:1901565 Organonitrogen compound catabolic process 1240 5.8778 2.8922 .00005655 0.2571 GO:0042454 Ribonucleoside catabolic process 22 0.1043 28.7680 .00014794 0.3362 GO:0009164 Nucleoside catabolic process 34 0.1612 18.6140 .00055180 0.8992 GO:0072529 Pyrimidine-containing compound catabolic process 38 0.1801 16.6550 .00076738 0.8992 GO:0048525 Negative regulation of viral process 88 0.4171 9.5892 .00080637 0.8992 GO:1901658 Glycosyl compound catabolic process 44 0.2086 14.3840 .00118040 0.8992 GO:0006216 Cytidine catabolic process 11 0.0521 38.3570 .00118690 0.8992 GO:0009972 Cytidine deamination 11 0.0521 38.3570 .00118690 0.8992 [71]Open in a new tab The parameters for the analysis of enrichment are minimum number of IDs in the category: 5; maximum number of IDs in the category: 2000; FDR method: BH; level of significance: Top 10. Functional and visual analysis The obtained subset of genes is then analyzed by integrative bioinformatic tools, that is, the WebGestalt tool,^[72]26 to discover the role they cover into the biological process under investigation. This module performs either functional and pathway enrichment analysis of the genes in [MATH: C :MATH] . Some graphical utilities were also added to better visualize the results provided. For instance, a variation of the UpSet plot is used to highlight most frequent genes and the pathway they belong to. The novelty of this representation with respect to the standard UpSet plot is the possibility of drawing the height and the width of each bar. To quantify the intersection between each pathway, the bars in the plot are proportional to the relative frequencies and to the number of genes selected in each set of pathway. This graphical representation was used in the following. Application to the Analysis of Gene Expression Profiles of CAFs In this section, we sequentially applied the main modules of the proposed methodology for automatically extract genes from gene expression profiles of CAFs. Fibroblasts are the most heterogeneous and abundant population of mesenchymal cells in the TME,^[73]27 but their precise functional role in tumor is not fully understood yet.^[74]13 During initial phases of oncogenesis, fibroblasts are activated giving rise to fibroblasts associated with the tumor (CAF), which play a key role in generating a specific extracellular matrix (ECM) in TME.^[75]28 The kinetics of changes in CAF actions might be different in the various types of tumor, partly because of organ-specific transcriptomic profiles of resident fibroblasts^[76]29 and partly as different subtypes of CAF could exert distinct paracrine actions affecting specific tumor oncogenesis.^[77]14,[78]15 Currently, either the number of CAF subpopulations present in the tumor stroma and the role assumed by the presence of an individual population or different cell types into tumor initial development stages are unknown.^[79]6,[80]16 Some characteristics distinguish CAFs from quiescent fibroblasts as, for instance, metabolic adaptations supporting their need for advanced proliferation and biosynthesis activities.^[81]13 Furthermore, a potentially controversial area of research on CAFs is focused on their origin. In fact, to define and identify origin of fibroblasts, it is fundamental to consider that CAFs are “activated fibroblasts” and unlike the non-activated (quiescent) fibroblasts residing in the tissue, they are an expansion of the cell population proliferating “in situ” or are recruited in the tumor.^[82]30 Recruitment of BM mesenchymal stem cells, differentiation from adipose stem cells, or conversion from endothelial cells through an epithelial-endothelium-mesenchymal transition process are potential origins of CAFs.^[83]13 However, the best documented source of CAFs is the activation of “normal” resident fibroblasts which, with their heterogeneity, imply the existence of different subsets. This heterogeneity may reflect the variability with respect to phenotypic state of both cell and tissue of origin and therefore also the reporting mediators and the mechanisms to be activated.^[84]14,[85]31 In fact, CAFs can be distinguished from other types of cells within the tumor by means of exclusion criteria defined by their morphological characteristics and by a lack of expression of non-mesenchymal markers, such as those expressed by endothelial, epithelial, immune, and neuronal cells even if none of these has an absolute specificity.^[86]32 A challenging aspect in CAFs studies is the precise definition of heterogeneous CAF populations in distinct phases of tumor progression through markers informing on their functions.^[87]6,[88]33 In this study, we used the proposed NMF-based gene extractor methodology to investigate CAFs heterogeneity to identify the possible presences of genes, which can be biomarkers and characterize activated fibroblast phenotype. Gene expression profiles of CAFs belonging to primary cultures of 3 distinct solid tumor histotypes from patients (ie, colon of endodermal origin, breast carcinoma of ectodermal origin and ovary of mesodermal origin) were considered. Moreover, we also verified the existence of common genes among CAF gene expression profiles of selected solid tumors and those of BM of patients with multiple myeloma (MM)^[89]6 and MGUS.^[90]17 Identification of CAF populations and download of transcriptomic profiles International Agency for Research on Cancer (IARC) estimates 18.1 million (17.0 million excluding non-melanoma skin cancer) new cancer cases in 2018, causing about 9.6 million of deaths (9.5 million excluding non-melanoma skin cancer).^[91]34 For studying the heterogeneous CAF population, which represents the most abundant cellular component of the TME, representative tumor histotypes with high rate of mortality have been selected: colon carcinoma, breast cancer, ovarian cancer, and MM. The gene expression profiles of CAFs were downloaded from the Gene Expression Omnibus database (GEO) NCBI (a functional public genomic database that supports the submission of MIAME-compliant data) directly from the original publications of each series ([92]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GEO series).^[93]35 Data sets were selected according to the following standards: (a) the GEO platform (GPL) and (b) the number of samples with labels to identify the fibroblasts associated with the tumor (being ⩾7 the total amount of samples for each representative cancer). [94]Table 1 reports GSE series, bibliographical references, GPL, GSM