Abstract

   The rapid development of high-performance technologies has greatly
   promoted studies of molecular oncology producing large amounts of data.
   Even if these data are publicly available, they need to be processed
   and studied to extract information useful to better understand
   mechanisms of pathogenesis of complex diseases, such as tumors. In this
   article, we illustrated a procedure for mining biologically meaningful
   biomarkers from microarray datasets of different tumor histotypes. The
   proposed methodology allows to automatically identify a subset of
   potentially informative genes from microarray data matrices, which
   differs either in the number of rows (genes) and of columns (patients).
   The methodology integrates nonnegative matrix factorization method, a
   functional enrichment analysis web tool with a properly designed gene
   extraction procedure to allow the analysis of omics input data with
   different row size. The proposed methodology has been used to mine
   microarray of solid tumors of different embryonic origin to verify the
   presence of common genes characterizing the heterogeneity of
   cancer-associated fibroblasts. These automatically extracted biomarkers
   could be used to suggest appropriate therapies to inactivate the state
   of active fibroblasts, thus avoiding their action on tumor progression.

   Keywords: NMF, metagene, microarray, cancer, fibroblast,
   cancer-associated fibroblast

Introduction

   Complex human diseases are caused by the interactions of many genetic,
   environmental, and behavioral factors. Development of high-performance
   technologies greatly promoted studies on molecular oncology producing
   large amounts of omic data. The availability of massive volume of
   experimental data based on cancer researches requires the development
   of mathematical, statistical, and computational techniques, which
   automatically extract valuable information useful for a better
   understanding of pathogenesis mechanisms of complex diseases, such as
   tumors. Identifying marker genes potentially involved into the
   development of tumors may facilitate the understanding of cause of
   disease, thus contributing to the advancement of diagnostic tools
   and/or to the evaluation of more efficient clinical strategies.

   In most of the publicly available data sets, the number of genes is
   significantly larger than the number of samples. This high
   dimensionality represents a major problem for an automatic gene
   array–based cancer analysis. In this scenario, dimensionality reduction
   methods became indispensable as they can eliminate irrelevant and
   redundant information, thus reducing the dimensionality as well as the
   complexity of the original problem, with significant benefits in terms
   of computational efficiency, model interpretability, and data
   understanding.

   Among linear dimensionality reduction methods, low rank matrix
   decomposition algorithms have been successfully exploited as
   alternative approaches for studying various types of high-dimensional
   biological data, including gene expression data.^[29]1 The application
   of these algorithms relies on the assumption that large-scale
   biological data have an intrinsic low-dimensional representation, with
   the dimension often corresponding to the number of latent information
   embedded into the original data. These methods transform the space of
   original data into a lower dimensional more discriminating
   (informative) space that makes the subsequent analysis more efficient.

   Among low-rank reduction mechanisms, nonnegative matrix factorizations
   (NMFs) emerge as useful approaches for the analysis of microarray data.
   The intrinsic non-negativity property of these techniques, in fact,
   produces more intuitive results as many biological measurements are
   represented by positive values. Nonnegative matrix factorizations
   demonstrated their ability in a number of tasks, including the
   identification of sets of genes co-operating in a relatively tightly
   regulated manner;^[30]2 the discovery of potential relationships in
   large biological data samples and link genes to these patterns;^[31]3
   the detection of distinct genomic subtypes in cancer patients;^[32]4
   the inspection of expression data sets including time evolution of the
   gene expression profile in different samples;^[33]5 and the extraction
   of gene expression profiles from fibroblasts of cancer blood
   diseases.^[34]6

   In this article, we present a gene extraction methodology, which
   integrates a NMF method^[35]7,[36]8 with the functional enrichment
   analysis web tool WebGestalt^[37]9 and a gene extraction procedure
   designed ad hoc to automatically mine different microarray cancer data
   sets to extract a reduced subset of genes to be further investigated
   from a biological point of view. An approach for mining multi-omics
   cell line data with the same row size by joint nonnegative matrix
   factorization (JNMF) and pathway signature analyses was recently
   proposed in Fujita et al.^[38]10 The methodology presented in this
   article instead allows to analyze microarray data matrices, which
   differ either in the number of rows (genes) and in the number of
   columns (patients) to verify the presence of common genes
   characterizing the heterogeneity of different cancer datasets. This
   proposal enriches the panorama of large-scale data-driven computational
   methods based on matrix factorization algorithms, which are able to
   extract concise and useful pieces of information from existing
   disease-associated data sets.^[39]1,[40]11,[41]12 Particularly, through
   the NMF method, our methodology mines the metagenes, which are the most
   representative of the information embedded into different tumor
   datasets and then, by simple intersection set operations allow to
   extract genes in a natural way to obtain interpretable and useful
   knowledge readily usable from biologists and analyzed, thanks to
   functional and visualization approaches based on the WebGestalt tool.
   The proposed methodology has been used to mine microarray of solid
   tumors of different embryonic origin to verify the presence of
   biomarkers characterizing the heterogeneity of cancer-associated
   fibroblasts (CAF) thus being applicable in clinical practice.

   Fibroblasts constitute the most heterogeneous and abundant population
   of mesenchymal cells in tumor microenvironment (TME). Their presence
   goes from tumor formation up to the final stage of metastatic
   diffusion, but their precise functional role in tumor is not fully
   understood yet.^[42]13 Also, it is not clear how different subtypes of
   CAF could exert distinct paracrine actions affecting specific tumor
   oncogenesis.^[43]14,[44]15 Therefore, we adopt the proposed procedure
   to analyze gene expression profiles of CAFs belonging to primary
   cultures of 3 distinct tumor histotypes (ie, colon of endodermal
   origin, breast carcinoma of ectodermal origin and ovary of mesodermal
   origin) to select common biomarkers and characterize activated
   fibroblast phenotype. Moreover, as it is known that bone marrow (BM) is
   a CAF recruitment source,^[45]16 we aim at investigating also the
   existence of common genes among CAF gene expression profiles of
   selected solid tumors and those of BM of patients with multiple myeloma
   (MM) and monoclonal gammopathy of undetermined significance (MGUS; MGUS
   is a benign pathological condition characterized by the proliferation
   of a plasma cell clone that rarely evolves into a malignant
   neoplasm.^[46]17 The latter, being a benign pathological condition,
   simulates a well-differentiated CAF phenotype compared to those of
   patients with MM.). Using the proposed methodology, we highlight the
   existence of biomarkers among CAFs of different embryonic origin that
   uniquely identify them and could be useful for developing or evaluating
   more efficient clinical strategies.

   This article is organized as follows. First, the pipeline of the whole
   gene extraction methodology is presented, illustrating a general way to
   pre-process this kind of data, its analysis core (based on the NMF
   method), and the ad hoc gene landscape extraction procedure. The
   subsequent section describes the specific biological problem we
   investigated, as well as the databases of CAF populations used in the
   experimental session. Then, the functional and pathway analysis
   performed during the experiments are deeply discussed and the obtained
   results are highlighted from a biological point of view. Some
   conclusive remarks conclude the article.

Methods

   This section discusses in some detail the main architecture, data
   pre-processing, and basic NMF approach used in this study.

The ensemble gene extraction methodology

   The proposed methodology is based on different operations that
   integrate the acquisition of data, their mathematical analysis, and the
   biological exploration of the obtained results.

   [47]Figure 1 illustrates the work-flow and the main operations
   performed when selected tumor data sets are provided.

Figure 1.

   [48]Figure 1.
   [49]Open in a new tab

   Work flow of the NMF-based methodology for extraction of common genes
   between different tumor gene expression datasets.

   The methodology is mainly based on the integration of 4 parts: the data
   preparation module, the core module computing NMF of given data
   matrices, the Gene Landscape extractor devoted to the extraction of
   common genes between processed results, and the Functional and Visual
   Analysis module. The first 3 modules combine R/BioConductor platform
   and the NMF library, which are well established in bioinformatics
   research,^[50]18-[51]20 whereas the biological analysis of the latter
   module is performed using the well-known web tool WebGestalt.

Data preparation and pre-processing

   Working on microarray data requires data sets are imported in the
   virtual environment as a single different expression set, in which
   probeset represents gene interrogating a
   [MATH:
   <mrow><mi>X</mi><mo>∈</mo><msubsup><mi>ℝ</mi><mo>+</mo><mrow><mi>n</mi>
   <mo>×</mo><mi>m</mi></mrow></msubsup></mrow> :MATH]
   particular expressed sequence.^[52]21 Once the data set is loaded, a
   pre-processing phase needed to be applied to create a correspondence
   between probeset and gene symbol. Data preparation module allows to
   handle gene expression set in which probesets map the same gene even
   when they refer to different quantities of transcript or to tackle the
   associated probes, which were not annotated in any sequence. [53]Figure
   2 explains the cases treated by this phase to create gene expression
   data feeding the following algorithms.

Figure 2.

   [54]Figure 2.
   [55]Open in a new tab

   Possible cases supported by the pre-processing phase.

   The pre-processing operations implemented in this module are based on
   the adjusted median absolute deviation (MAD) of the gene expression
   values. The use of the MAD (a scale factor is used to consider the MAD
   as a consistent estimator for the estimation of data standard
   deviation), which computes the variability of the data from the median,
   makes the process more robust to outliers. In particular, a tuple
   probeset id, gene symbol, and the associated expression value are
   uniquely identified assuming that greater is the MAD higher is the
   goodness of the hybridization in the experiment. The code and dataset
   related to data preparation and pre-processing module have been made
   available at GitHub at [56]https://github.com/flaespo/NMF-for-GSE.

Nonnegative matrix factorization for genetic data sets

   After data pre-processing is performed, most omic data sets can be
   represented as a matrix in which each element contains the measurement
   of a single molecule in a single experimental condition. In the case of
   expression data, the resulting high-dimensional data set is
   re-formulated as a numerical nonnegative matrix with rows being genes
   and columns representing samples, for example, tissues of various
   patients (as in this study), development stages, or treatments. This
   matrix is called “gene expression matrix” and its elements
   [MATH:
   <mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   indicate the expression level of the gene i in the sample j. A main
   task in analyzing this matrix is to extract from it some knowledge
   about the underlying biological processes.

   Nonnegative matrix factorization can be applied to reduce data
   dimensionality as it decomposes a gene expression matrix X by creating
   a user-defined number of new column features
   [MATH: <mrow><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   ,
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</m
   o><mo>,</mo><mi>r</mi><mo>⊥</mo><mo stretchy="false">)</mo></mrow>
   :MATH]
   called “metagenes,” which are linear combinations of the original
   samples set (eg, column vectors
   [MATH: <mrow><mi>X</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>j</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   ) weighted with nonnegative coefficients
   [MATH: <mrow><mi>X</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>j</mi><mo
   stretchy="false">)</mo><mo>≈</mo><munderover><mo>∑</mo><mrow><mi>k</mi>
   <mo>=</mo><mn>1</mn></mrow><mi>r</mi></munderover><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>k</mi><mo
   stretchy="false">)</mo><msub><mi>H</mi><mrow><mi>k</mi><mi>j</mi></mrow
   ></msub><mo>,</mo></mrow> :MATH]

   for
   [MATH:
   <mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>m
   </mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>r</mi><mo><</mo><mo>_</mo><mi>min</mi><mo
   stretchy="false">(</mo><mi>m</mi><mo>,</mo><mi>n</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   . In this way, original data are explained by a sum of additive parts
   and so that intuitively biological entities and mechanisms can be
   naturally described with a signal that is either present or absent.

   From an algebraic point of view, NMF finds 2 nonnegative matrices, the
   metagenes matrix
   [MATH:
   <mrow><mi>W</mi><mo>∈</mo><msubsup><mi>ℝ</mi><mo>+</mo><mrow><mi>n</mi>
   <mo>×</mo><mi>r</mi></mrow></msubsup><mo>=</mo><mo
   stretchy="false">[</mo><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mn>1</mn><mo
   stretchy="false">)</mo><mo>,</mo><mo>…</mo><mo>,</mo><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>r</mi><mo
   stretchy="false">)</mo><mo stretchy="false">]</mo></mrow> :MATH]
   and the metagene expression profiles matrix
   [MATH:
   <mrow><mi>H</mi><mo>∈</mo><msubsup><mi>ℝ</mi><mo>+</mo><mrow><mi>r</mi>
   <mo>×</mo><mi>m</mi></mrow></msubsup></mrow> :MATH]
   , whose elements
   [MATH:
   <mrow><msub><mi>H</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   reveal the effect that the ith metagene
   [MATH: <mrow><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   has on the sample j, such that
   [MATH: <mrow><mi>X</mi><mspace
   width="0.25em"></mspace><mo>∼</mo><mo>∼</mo><mspace
   width="0.25em"></mspace><mi>W</mi><mi>H</mi></mrow> :MATH]
   . It is worthy to observe that if the value
   [MATH:
   <mrow><msub><mi>H</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   is very small, then the corresponding metagene
   [MATH: <mrow><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   (having rows which are genes in X) is useless in approximating that
   particular sample. The decomposition could be dually viewed as
   individuating metasamples (rather than metagenes) and groups of genes
   (rather than of samples) when the entries of W are taken into
   account.^[57]22,[58]23

   When NMFs are applied to produce clusters of genes, the rank value r is
   usually a priori set after trying different values, computing some
   quality measure of the results, and then choosing the best value
   according to the adopted quality criteria. In this article, however, we
   make use of some automatically suggested value accordingly a procedure
   described in Del Buono and colleagues.^[59]8,[60]23 This procedure
   makes use of cophenetic coefficient, residuals sum of squares,
   dispersion curve, and consensus matrices to optimally address a proper
   rank value r for each gene expression matrix X.

   Computationally, metagenes and expression profile values can be
   obtained solving a non-linear constrained optimization problem over the
   cone of nonnegative matrices
   [MATH:
   <mrow><munder><mrow><mi>min</mi></mrow><mrow><mi>W</mi><mo>≥</mo><mn>0<
   /mn><mo>,</mo><mspace
   width="0.25em"></mspace><mi>H</mi><mo>≥</mo><mn>0</mn></mrow></munder><
   mi>D</mi><mi>i</mi><mi>v</mi><mo
   stretchy="false">(</mo><mi>X</mi><mo>,</mo><mi>W</mi><mi>H</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   (1)

   where
   [MATH: <mrow><mi>D</mi><mi>i</mi><mi>v</mi><mo
   stretchy="false">(</mo><mi>X</mi><mo>,</mo><mi>W</mi><mi>H</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   is any divergence measure which evaluates how well the low-dimensional
   matrix WH approximates X. In this article, the generalized
   Kullback-Leibler (KL) divergence
   [MATH: <mrow><mi>D</mi><mi>i</mi><mi>v</mi><mo
   stretchy="false">(</mo><mi>X</mi><mo>,</mo><mi>W</mi><mi>H</mi><mo
   stretchy="false">)</mo><mo>=</mo><munder><mo>∑</mo><mrow><mi>i</mi><mi>
   j</mi></mrow></munder><mrow><mo>(</mo><mrow><msub><mi>X</mi><mrow><mi>i
   </mi><mi>j</mi></mrow></msub><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><mfrac><mrow><msub><
   mi>X</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mrow><msub><mr
   ow><mo stretchy="false">(</mo><mi>W</mi><mi>H</mi><mo
   stretchy="false">)</mo></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><
   /mrow></mfrac></mrow><mo>)</mo></mrow><mo>−</mo><msub><mi>X</mi><mrow><
   mi>i</mi><mi>j</mi></mrow></msub><mo>+</mo><msub><mrow><mo
   stretchy="false">(</mo><mi>W</mi><mi>H</mi><mo
   stretchy="false">)</mo></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><
   /mrow><mo>)</mo></mrow></mrow> :MATH]

   is used, which corresponds to the maximum likelihood estimation under
   an independent Poisson assumption.^[61]24 The minimization problem (1)
   for the KL divergence function is solved applying the following
   multiplicative update rules for W and H:
   [MATH: <mrow><mtable
   columnalign="left"><mtr><mtd><msub><mi>W</mi><mrow><mi>i</mi><mi>j</mi>
   </mrow></msub><mo
   stretchy="false">←</mo><msub><mi>W</mi><mrow><mi>i</mi><mi>j</mi></mrow
   ></msub><mfrac><mrow><munder><mo>∑</mo><mi>k</mi></munder><mrow><mo
   stretchy="false">(</mo><msub><mi>H</mi><mrow><mi>j</mi><mi>k</mi></mrow
   ></msub><msub><mi>X</mi><mrow><mi>i</mi><mi>k</mi></mrow></msub><mo
   stretchy="false">)</mo><mo>/</mo><mo
   stretchy="false">(</mo><mi>W</mi><mi>H</mi><msub><mo
   stretchy="false">)</mo><mrow><mi>i</mi><mi>k</mi></mrow></msub></mrow><
   /mrow><mrow><munder><mo>∑</mo><mi>k</mi></munder><mrow><msub><mi>H</mi>
   <mrow><mi>j</mi><mi>k</mi></mrow></msub></mrow></mrow></mfrac><mspace
   width="0.25em"></mspace><mspace
   width="0.25em"></mspace></mtd></mtr><mtr><mtd><msub><mi>H</mi><mrow><mi
   >i</mi><mi>j</mi></mrow></msub><mspace width="0.25em"></mspace><mo
   stretchy="false">←</mo><msub><mi>H</mi><mrow><mi>i</mi><mi>j</mi></mrow
   ></msub><mfrac><mrow><munder><mo>∑</mo><mi>k</mi></munder><mrow><mo
   stretchy="false">(</mo><msub><mi>W</mi><mrow><mi>k</mi><mi>i</mi></mrow
   ></msub><msub><mi>X</mi><mrow><mi>k</mi><mi>j</mi></mrow></msub><mo
   stretchy="false">)</mo><mo>/</mo><mo
   stretchy="false">(</mo><mi>W</mi><mi>H</mi><msub><mo
   stretchy="false">)</mo><mrow><mi>k</mi><mi>j</mi></mrow></msub></mrow><
   /mrow><mrow><munder><mo>∑</mo><mi>k</mi></munder><mrow><msub><mi>W</mi>
   <mrow><mi>k</mi><mi>i</mi></mrow></msub></mrow></mrow></mfrac></mtd></m
   tr></mtable></mrow> :MATH]

   Because of the non-negativity constraint, solutions to NMF are only
   unique up to scaling and rotation, but appropriately scaling and
   rotating the columns of W and rows of H will not alter the overall
   matrix product WH. For this reason, what is of interest in practice is
   not the values of matrix elements, but their relative magnitudes in
   each column of W (or row of H). Moreover, taking into account that more
   genes can participate in more than one biological process, it could be
   of some beneficial to investigate genes that have relatively large
   coefficients in each biological process.

   To do this, the gene.score scoring method proposed in Kim and
   Park^[62]25 has been adopted. This method computes a value gene.score
   (i) i value for each gene i in a metagene
   [MATH: <mrow><mi>W</mi><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   and selects those genes possessing gene.score value higher than a given
   threshold τ. Particularly, the score threshold τ is computed from the
   gene score vector itself as
   [MATH: <mrow><mi>τ</mi><mo>=</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo
   stretchy="true">^</mo></mover><mo>+</mo><mn>3</mn><mover
   accent="true"><mrow><mi>σ</mi></mrow><mo
   stretchy="true">^</mo></mover></mrow> :MATH]
   , being and
   [MATH: <mrow><mover accent="true"><mrow><mi>σ</mi></mrow><mo
   stretchy="true">^</mo></mover></mrow> :MATH]
   the median and the MAD of gene scores, respectively.

   Metagenes in W contain the largest number of genes satisfying this
   empirical criterion and they can be considered as the most
   representative of the information hidden onto the gene expression data
   X. Observe that the metagenes NMF technique extracts are constrained by
   the dataset used to train them, so a careful selection of datasets is
   essential: users need to choose those being broad enough to cover the
   relevant sources of variability.

Gene Landscape Extraction procedure

   Referring to [63]Figure 1, let
   [MATH:
   <mrow><msub><mi>X</mi><mi>i</mi></msub><mo>∈</mo><msubsup><mi>ℝ</mi><mo
   >+</mo><mrow><msub><mi>n</mi><mi>i</mi></msub><mo>×</mo><msub><mi>m</mi
   ><mi>i</mi></msub></mrow></msubsup></mrow> :MATH]
   ,
   [MATH:
   <mrow><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>t
   </mi></mrow> :MATH]
   be the gene expression matrices related to different biological
   experiments and let
   [MATH:
   <mrow><msub><mi>W</mi><mi>i</mi></msub><mo>∈</mo><msubsup><mi>ℝ</mi><mo
   >+</mo><mrow><msub><mi>n</mi><mi>i</mi></msub><mo>×</mo><msub><mi>k</mi
   ><mi>i</mi></msub></mrow></msubsup></mrow> :MATH]
   be the metagenes matrices obtained solving equation (1) for each
   [MATH: <mrow><msub><mi>X</mi><mi>i</mi></msub></mrow> :MATH]
   with an a priori rank
   [MATH: <mrow><msub><mi>k</mi><mi>i</mi></msub></mrow> :MATH]
   . In the proposed methodology, the constrained nonnegative optimization
   problem (1) have been solved via the alternating method in NMF
   R-package, with multiple executions and random
   initializations.^[64]8,[65]20

   To compare gene expression matrices derived from different tumor
   histotypes (in this work from their associated CAF cultures), the most
   representative metagene
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mo
   stretchy="false">)</mo></mrow> :MATH]
   is extracted from each gene expression matrix
   [MATH: <mrow><msub><mi>X</mi><mi>i</mi></msub></mrow> :MATH]
   . This metagene represents the
   [MATH: <mrow><mover accent="true"><mi>k</mi><mo>.</mo></mover><mi
   mathvariant="normal">th</mi></mrow> :MATH]
   column in the matrix
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub></mrow> :MATH]
   , which possesses the largest number of genes satisfying the gene.score
   criterium. These latter genes compose the gene-subsets automatically
   extracted from each tumor histotype.

   To identify common genes among those extracted by different tumor
   histotypes, an intersection set operation was performed on the
   identification labels of genes
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mo
   stretchy="false">)</mo></mrow> :MATH]
   so that
   [MATH:
   <mrow><mi>C</mi><mo>=</mo><msubsup><mo>∩</mo><mrow><mi>i</mi><mo>=</mo>
   <mn>1</mn></mrow><mi>N</mi></msubsup><msub><mi>W</mi><mi>i</mi></msub><
   mo stretchy="false">(</mo><mo>:</mo><mo>,</mo><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mo
   stretchy="false">)</mo></mrow> :MATH]
   is a subset of gene-label common to each different tumor histotypes. It
   should be observed that the set of common genes
   [MATH: <mrow><mi>C</mi></mrow> :MATH]
   depends on the gene expression matrices initially selected.

   [66]Figure 3 provides a logical view of the Gene Landscape Extraction
   procedure, which represents the novel proposal to extract common genes
   from microarray matrices which differ in their sizes. Particularly, the
   microarray matrices
   [MATH:
   <mrow><msub><mi>X</mi><mi>i</mi></msub><mo>∈</mo><msup><mi>ℝ</mi><mrow>
   <msub><mi>n</mi><mi>i</mi></msub><mo>×</mo><msub><mi>m</mi><mi>i</mi></
   msub></mrow></msup></mrow> :MATH]
   obtained by the data preparation and pre-processing module from each
   tumor database are factorized by the NMF algorithm. This provides
   matrices
   [MATH:
   <mrow><msub><mi>W</mi><mi>i</mi></msub><mo>∈</mo><msup><mi>ℝ</mi><mrow>
   <msub><mi>n</mi><mi>i</mi></msub><mo>×</mo><msub><mi>r</mi><mi>i</mi></
   msub></mrow></msup></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>H</mi><mi>i</mi></msub><mo>∈</mo><msup><mi>ℝ</mi><mrow>
   <msub><mi>r</mi><mi>i</mi></msub><mo>×</mo><msub><mi>m</mi><mi>i</mi></
   msub></mrow></msup></mrow> :MATH]
   . Particularly, the rank values
   [MATH: <mrow><msub><mi>r</mi><mi>i</mi></msub></mrow> :MATH]
   used in the factorization process of the gene expression matrices used
   in this article are reported in [67]Table 2. From each
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub></mrow> :MATH]
   , the most relevant metagene
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><msub><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   is automatically selected by the gene.score scoring method. These
   column vectors undergo the Gene Landscape Extraction procedure. It
   should be observed that which
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><msub><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   (
   [MATH:
   <mrow><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>n
   </mi></mrow> :MATH]
   ) differ in their number of rows (genes) and they are ordered to have
   the most representative information of the specific tumor histotype in
   their first components. Gene Landscape Extraction procedure uses
   intersection set operations to identify labels of genes in each
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><msub><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   , which are common to each different tumor histotypes.

Figure 3.

   [68]Figure 3.
   [69]Open in a new tab

   Logical view of the gene landscape extraction. Preprocessed expression
   data matrices related to 3 different tumors (
   [MATH:
   <mrow><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><msub><mi>X</mi><mn>2<
   /mn></msub><mo>,</mo><msub><mi>X</mi><mn>3</mn></msub></mrow> :MATH]
   ) are factorized by the NMF algorithm and corresponding metagenes
   matrices
   [MATH: <mrow><msub><mi>W</mi><mi>i</mi></msub></mrow> :MATH]
   (
   [MATH:
   <mrow><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mn>3
   </mn></mrow> :MATH]
   ) are taken into considerations. Gene.score scoring method selects the
   most relevant metagenes
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>W</mi><mi>i</mi></msub><mo
   stretchy="false">(</mo><mo>:</mo><mo>,</mo><msub><mover
   accent="true"><mi>k</mi><mo>.</mo></mover><mi>i</mi></msub><mo
   stretchy="false">)</mo><mo stretchy="false">)</mo></mrow> :MATH]
   , (
   [MATH:
   <mrow><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mn>3
   </mn></mrow> :MATH]
   ) i, which undergo the Gene Landscape Extraction procedure. This latter
   uses an intersection set operation to identify labels of genes common
   to each different tumor histotypes. The extracted subset of gene-labels
   is then sent to the functional and pathway analyses.

Table 2.

   Gene ontology functional analysis: based on the parameters used, 10
   categories are identified as enriched categories and all are shown in
   this table.^[70]9
   Gene set Description Size Expect Ratio P value FDR
   GO:0006955 Immune response 1919 9.0964 2.5285 .00001777 0.1615
   GO:1901565 Organonitrogen compound catabolic process 1240 5.8778 2.8922
   .00005655 0.2571
   GO:0042454 Ribonucleoside catabolic process 22 0.1043 28.7680 .00014794
   0.3362
   GO:0009164 Nucleoside catabolic process 34 0.1612 18.6140 .00055180
   0.8992
   GO:0072529 Pyrimidine-containing compound catabolic process 38 0.1801
   16.6550 .00076738 0.8992
   GO:0048525 Negative regulation of viral process 88 0.4171 9.5892
   .00080637 0.8992
   GO:1901658 Glycosyl compound catabolic process 44 0.2086 14.3840
   .00118040 0.8992
   GO:0006216 Cytidine catabolic process 11 0.0521 38.3570 .00118690
   0.8992
   GO:0009972 Cytidine deamination 11 0.0521 38.3570 .00118690 0.8992
   [71]Open in a new tab

   The parameters for the analysis of enrichment are minimum number of IDs
   in the category: 5; maximum number of IDs in the category: 2000; FDR
   method: BH; level of significance: Top 10.

Functional and visual analysis

   The obtained subset of genes is then analyzed by integrative
   bioinformatic tools, that is, the WebGestalt tool,^[72]26 to discover
   the role they cover into the biological process under investigation.

   This module performs either functional and pathway enrichment analysis
   of the genes in
   [MATH: <mrow><mi>C</mi></mrow> :MATH]
   . Some graphical utilities were also added to better visualize the
   results provided. For instance, a variation of the UpSet plot is used
   to highlight most frequent genes and the pathway they belong to. The
   novelty of this representation with respect to the standard UpSet plot
   is the possibility of drawing the height and the width of each bar. To
   quantify the intersection between each pathway, the bars in the plot
   are proportional to the relative frequencies and to the number of genes
   selected in each set of pathway. This graphical representation was used
   in the following.

Application to the Analysis of Gene Expression Profiles of CAFs

   In this section, we sequentially applied the main modules of the
   proposed methodology for automatically extract genes from gene
   expression profiles of CAFs.

   Fibroblasts are the most heterogeneous and abundant population of
   mesenchymal cells in the TME,^[73]27 but their precise functional role
   in tumor is not fully understood yet.^[74]13 During initial phases of
   oncogenesis, fibroblasts are activated giving rise to fibroblasts
   associated with the tumor (CAF), which play a key role in generating a
   specific extracellular matrix (ECM) in TME.^[75]28 The kinetics of
   changes in CAF actions might be different in the various types of
   tumor, partly because of organ-specific transcriptomic profiles of
   resident fibroblasts^[76]29 and partly as different subtypes of CAF
   could exert distinct paracrine actions affecting specific tumor
   oncogenesis.^[77]14,[78]15 Currently, either the number of CAF
   subpopulations present in the tumor stroma and the role assumed by the
   presence of an individual population or different cell types into tumor
   initial development stages are unknown.^[79]6,[80]16 Some
   characteristics distinguish CAFs from quiescent fibroblasts as, for
   instance, metabolic adaptations supporting their need for advanced
   proliferation and biosynthesis activities.^[81]13 Furthermore, a
   potentially controversial area of research on CAFs is focused on their
   origin. In fact, to define and identify origin of fibroblasts, it is
   fundamental to consider that CAFs are “activated fibroblasts” and
   unlike the non-activated (quiescent) fibroblasts residing in the
   tissue, they are an expansion of the cell population proliferating “in
   situ” or are recruited in the tumor.^[82]30 Recruitment of BM
   mesenchymal stem cells, differentiation from adipose stem cells, or
   conversion from endothelial cells through an
   epithelial-endothelium-mesenchymal transition process are potential
   origins of CAFs.^[83]13 However, the best documented source of CAFs is
   the activation of “normal” resident fibroblasts which, with their
   heterogeneity, imply the existence of different subsets. This
   heterogeneity may reflect the variability with respect to phenotypic
   state of both cell and tissue of origin and therefore also the
   reporting mediators and the mechanisms to be activated.^[84]14,[85]31
   In fact, CAFs can be distinguished from other types of cells within the
   tumor by means of exclusion criteria defined by their morphological
   characteristics and by a lack of expression of non-mesenchymal markers,
   such as those expressed by endothelial, epithelial, immune, and
   neuronal cells even if none of these has an absolute
   specificity.^[86]32 A challenging aspect in CAFs studies is the precise
   definition of heterogeneous CAF populations in distinct phases of tumor
   progression through markers informing on their functions.^[87]6,[88]33

   In this study, we used the proposed NMF-based gene extractor
   methodology to investigate CAFs heterogeneity to identify the possible
   presences of genes, which can be biomarkers and characterize activated
   fibroblast phenotype. Gene expression profiles of CAFs belonging to
   primary cultures of 3 distinct solid tumor histotypes from patients
   (ie, colon of endodermal origin, breast carcinoma of ectodermal origin
   and ovary of mesodermal origin) were considered. Moreover, we also
   verified the existence of common genes among CAF gene expression
   profiles of selected solid tumors and those of BM of patients with
   multiple myeloma (MM)^[89]6 and MGUS.^[90]17

Identification of CAF populations and download of transcriptomic profiles

   International Agency for Research on Cancer (IARC) estimates 18.1
   million (17.0 million excluding non-melanoma skin cancer) new cancer
   cases in 2018, causing about 9.6 million of deaths (9.5 million
   excluding non-melanoma skin cancer).^[91]34

   For studying the heterogeneous CAF population, which represents the
   most abundant cellular component of the TME, representative tumor
   histotypes with high rate of mortality have been selected: colon
   carcinoma, breast cancer, ovarian cancer, and MM. The gene expression
   profiles of CAFs were downloaded from the Gene Expression Omnibus
   database (GEO) NCBI (a functional public genomic database that supports
   the submission of MIAME-compliant data) directly from the original
   publications of each series
   ([92]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GEO
   series).^[93]35 Data sets were selected according to the following
   standards: (a) the GEO platform (GPL) and (b) the number of samples
   with labels to identify the fibroblasts associated with the tumor
   (being ⩾7 the total amount of samples for each representative cancer).
   [94]Table 1 reports GSE series, bibliographical references, GPL, GSM