Abstract
The rapid development of high-performance technologies has greatly
promoted studies of molecular oncology producing large amounts of data.
Even if these data are publicly available, they need to be processed
and studied to extract information useful to better understand
mechanisms of pathogenesis of complex diseases, such as tumors. In this
article, we illustrated a procedure for mining biologically meaningful
biomarkers from microarray datasets of different tumor histotypes. The
proposed methodology allows to automatically identify a subset of
potentially informative genes from microarray data matrices, which
differs either in the number of rows (genes) and of columns (patients).
The methodology integrates nonnegative matrix factorization method, a
functional enrichment analysis web tool with a properly designed gene
extraction procedure to allow the analysis of omics input data with
different row size. The proposed methodology has been used to mine
microarray of solid tumors of different embryonic origin to verify the
presence of common genes characterizing the heterogeneity of
cancer-associated fibroblasts. These automatically extracted biomarkers
could be used to suggest appropriate therapies to inactivate the state
of active fibroblasts, thus avoiding their action on tumor progression.
Keywords: NMF, metagene, microarray, cancer, fibroblast,
cancer-associated fibroblast
Introduction
Complex human diseases are caused by the interactions of many genetic,
environmental, and behavioral factors. Development of high-performance
technologies greatly promoted studies on molecular oncology producing
large amounts of omic data. The availability of massive volume of
experimental data based on cancer researches requires the development
of mathematical, statistical, and computational techniques, which
automatically extract valuable information useful for a better
understanding of pathogenesis mechanisms of complex diseases, such as
tumors. Identifying marker genes potentially involved into the
development of tumors may facilitate the understanding of cause of
disease, thus contributing to the advancement of diagnostic tools
and/or to the evaluation of more efficient clinical strategies.
In most of the publicly available data sets, the number of genes is
significantly larger than the number of samples. This high
dimensionality represents a major problem for an automatic gene
array–based cancer analysis. In this scenario, dimensionality reduction
methods became indispensable as they can eliminate irrelevant and
redundant information, thus reducing the dimensionality as well as the
complexity of the original problem, with significant benefits in terms
of computational efficiency, model interpretability, and data
understanding.
Among linear dimensionality reduction methods, low rank matrix
decomposition algorithms have been successfully exploited as
alternative approaches for studying various types of high-dimensional
biological data, including gene expression data.^[29]1 The application
of these algorithms relies on the assumption that large-scale
biological data have an intrinsic low-dimensional representation, with
the dimension often corresponding to the number of latent information
embedded into the original data. These methods transform the space of
original data into a lower dimensional more discriminating
(informative) space that makes the subsequent analysis more efficient.
Among low-rank reduction mechanisms, nonnegative matrix factorizations
(NMFs) emerge as useful approaches for the analysis of microarray data.
The intrinsic non-negativity property of these techniques, in fact,
produces more intuitive results as many biological measurements are
represented by positive values. Nonnegative matrix factorizations
demonstrated their ability in a number of tasks, including the
identification of sets of genes co-operating in a relatively tightly
regulated manner;^[30]2 the discovery of potential relationships in
large biological data samples and link genes to these patterns;^[31]3
the detection of distinct genomic subtypes in cancer patients;^[32]4
the inspection of expression data sets including time evolution of the
gene expression profile in different samples;^[33]5 and the extraction
of gene expression profiles from fibroblasts of cancer blood
diseases.^[34]6
In this article, we present a gene extraction methodology, which
integrates a NMF method^[35]7,[36]8 with the functional enrichment
analysis web tool WebGestalt^[37]9 and a gene extraction procedure
designed ad hoc to automatically mine different microarray cancer data
sets to extract a reduced subset of genes to be further investigated
from a biological point of view. An approach for mining multi-omics
cell line data with the same row size by joint nonnegative matrix
factorization (JNMF) and pathway signature analyses was recently
proposed in Fujita et al.^[38]10 The methodology presented in this
article instead allows to analyze microarray data matrices, which
differ either in the number of rows (genes) and in the number of
columns (patients) to verify the presence of common genes
characterizing the heterogeneity of different cancer datasets. This
proposal enriches the panorama of large-scale data-driven computational
methods based on matrix factorization algorithms, which are able to
extract concise and useful pieces of information from existing
disease-associated data sets.^[39]1,[40]11,[41]12 Particularly, through
the NMF method, our methodology mines the metagenes, which are the most
representative of the information embedded into different tumor
datasets and then, by simple intersection set operations allow to
extract genes in a natural way to obtain interpretable and useful
knowledge readily usable from biologists and analyzed, thanks to
functional and visualization approaches based on the WebGestalt tool.
The proposed methodology has been used to mine microarray of solid
tumors of different embryonic origin to verify the presence of
biomarkers characterizing the heterogeneity of cancer-associated
fibroblasts (CAF) thus being applicable in clinical practice.
Fibroblasts constitute the most heterogeneous and abundant population
of mesenchymal cells in tumor microenvironment (TME). Their presence
goes from tumor formation up to the final stage of metastatic
diffusion, but their precise functional role in tumor is not fully
understood yet.^[42]13 Also, it is not clear how different subtypes of
CAF could exert distinct paracrine actions affecting specific tumor
oncogenesis.^[43]14,[44]15 Therefore, we adopt the proposed procedure
to analyze gene expression profiles of CAFs belonging to primary
cultures of 3 distinct tumor histotypes (ie, colon of endodermal
origin, breast carcinoma of ectodermal origin and ovary of mesodermal
origin) to select common biomarkers and characterize activated
fibroblast phenotype. Moreover, as it is known that bone marrow (BM) is
a CAF recruitment source,^[45]16 we aim at investigating also the
existence of common genes among CAF gene expression profiles of
selected solid tumors and those of BM of patients with multiple myeloma
(MM) and monoclonal gammopathy of undetermined significance (MGUS; MGUS
is a benign pathological condition characterized by the proliferation
of a plasma cell clone that rarely evolves into a malignant
neoplasm.^[46]17 The latter, being a benign pathological condition,
simulates a well-differentiated CAF phenotype compared to those of
patients with MM.). Using the proposed methodology, we highlight the
existence of biomarkers among CAFs of different embryonic origin that
uniquely identify them and could be useful for developing or evaluating
more efficient clinical strategies.
This article is organized as follows. First, the pipeline of the whole
gene extraction methodology is presented, illustrating a general way to
pre-process this kind of data, its analysis core (based on the NMF
method), and the ad hoc gene landscape extraction procedure. The
subsequent section describes the specific biological problem we
investigated, as well as the databases of CAF populations used in the
experimental session. Then, the functional and pathway analysis
performed during the experiments are deeply discussed and the obtained
results are highlighted from a biological point of view. Some
conclusive remarks conclude the article.
Methods
This section discusses in some detail the main architecture, data
pre-processing, and basic NMF approach used in this study.
The ensemble gene extraction methodology
The proposed methodology is based on different operations that
integrate the acquisition of data, their mathematical analysis, and the
biological exploration of the obtained results.
[47]Figure 1 illustrates the work-flow and the main operations
performed when selected tumor data sets are provided.
Figure 1.
[48]Figure 1.
[49]Open in a new tab
Work flow of the NMF-based methodology for extraction of common genes
between different tumor gene expression datasets.
The methodology is mainly based on the integration of 4 parts: the data
preparation module, the core module computing NMF of given data
matrices, the Gene Landscape extractor devoted to the extraction of
common genes between processed results, and the Functional and Visual
Analysis module. The first 3 modules combine R/BioConductor platform
and the NMF library, which are well established in bioinformatics
research,^[50]18-[51]20 whereas the biological analysis of the latter
module is performed using the well-known web tool WebGestalt.
Data preparation and pre-processing
Working on microarray data requires data sets are imported in the
virtual environment as a single different expression set, in which
probeset represents gene interrogating a
[MATH:
X∈ℝ+n
×m :MATH]
particular expressed sequence.^[52]21 Once the data set is loaded, a
pre-processing phase needed to be applied to create a correspondence
between probeset and gene symbol. Data preparation module allows to
handle gene expression set in which probesets map the same gene even
when they refer to different quantities of transcript or to tackle the
associated probes, which were not annotated in any sequence. [53]Figure
2 explains the cases treated by this phase to create gene expression
data feeding the following algorithms.
Figure 2.
[54]Figure 2.
[55]Open in a new tab
Possible cases supported by the pre-processing phase.
The pre-processing operations implemented in this module are based on
the adjusted median absolute deviation (MAD) of the gene expression
values. The use of the MAD (a scale factor is used to consider the MAD
as a consistent estimator for the estimation of data standard
deviation), which computes the variability of the data from the median,
makes the process more robust to outliers. In particular, a tuple
probeset id, gene symbol, and the associated expression value are
uniquely identified assuming that greater is the MAD higher is the
goodness of the hybridization in the experiment. The code and dataset
related to data preparation and pre-processing module have been made
available at GitHub at [56]https://github.com/flaespo/NMF-for-GSE.
Nonnegative matrix factorization for genetic data sets
After data pre-processing is performed, most omic data sets can be
represented as a matrix in which each element contains the measurement
of a single molecule in a single experimental condition. In the case of
expression data, the resulting high-dimensional data set is
re-formulated as a numerical nonnegative matrix with rows being genes
and columns representing samples, for example, tissues of various
patients (as in this study), development stages, or treatments. This
matrix is called “gene expression matrix” and its elements
[MATH:
Xij
:MATH]
indicate the expression level of the gene i in the sample j. A main
task in analyzing this matrix is to extract from it some knowledge
about the underlying biological processes.
Nonnegative matrix factorization can be applied to reduce data
dimensionality as it decomposes a gene expression matrix X by creating
a user-defined number of new column features
[MATH: W(:,k) :MATH]
,
[MATH: (k=1,…,r⊥)
:MATH]
called “metagenes,” which are linear combinations of the original
samples set (eg, column vectors
[MATH: X(:,j) :MATH]
) weighted with nonnegative coefficients
[MATH: X(:,j)≈∑k
=1rW(:,k)Hkj, :MATH]
for
[MATH:
j=1,…,m
:MATH]
and
[MATH: r<_min(m,n) :MATH]
. In this way, original data are explained by a sum of additive parts
and so that intuitively biological entities and mechanisms can be
naturally described with a signal that is either present or absent.
From an algebraic point of view, NMF finds 2 nonnegative matrices, the
metagenes matrix
[MATH:
W∈ℝ+n
×r=[W(:,1),…,W(:,r)] :MATH]
and the metagene expression profiles matrix
[MATH:
H∈ℝ+r
×m :MATH]
, whose elements
[MATH:
Hij
:MATH]
reveal the effect that the ith metagene
[MATH: W(:,i) :MATH]
has on the sample j, such that
[MATH: X∼∼WH :MATH]
. It is worthy to observe that if the value
[MATH:
Hij
:MATH]
is very small, then the corresponding metagene
[MATH: W(:,i) :MATH]
(having rows which are genes in X) is useless in approximating that
particular sample. The decomposition could be dually viewed as
individuating metasamples (rather than metagenes) and groups of genes
(rather than of samples) when the entries of W are taken into
account.^[57]22,[58]23
When NMFs are applied to produce clusters of genes, the rank value r is
usually a priori set after trying different values, computing some
quality measure of the results, and then choosing the best value
according to the adopted quality criteria. In this article, however, we
make use of some automatically suggested value accordingly a procedure
described in Del Buono and colleagues.^[59]8,[60]23 This procedure
makes use of cophenetic coefficient, residuals sum of squares,
dispersion curve, and consensus matrices to optimally address a proper
rank value r for each gene expression matrix X.
Computationally, metagenes and expression profile values can be
obtained solving a non-linear constrained optimization problem over the
cone of nonnegative matrices
[MATH:
minW≥0<
/mn>,H≥0<
mi>Div(X,WH) :MATH]
(1)
where
[MATH: Div(X,WH) :MATH]
is any divergence measure which evaluates how well the low-dimensional
matrix WH approximates X. In this article, the generalized
Kullback-Leibler (KL) divergence
[MATH: Div(X,WH)=∑i
j(Xi
jlog(<
mi>Xij(WH)ij<
/mrow>)−X<
mi>ij+(WH)ij<
/mrow>) :MATH]
is used, which corresponds to the maximum likelihood estimation under
an independent Poisson assumption.^[61]24 The minimization problem (1)
for the KL divergence function is solved applying the following
multiplicative update rules for W and H:
[MATH: Wij
←Wij∑k(HjkXik)/(WH)ik<
/mrow>∑kH
jkHij←Hij∑k(WkiXkj)/(WH)kj<
/mrow>∑kW
ki :MATH]
Because of the non-negativity constraint, solutions to NMF are only
unique up to scaling and rotation, but appropriately scaling and
rotating the columns of W and rows of H will not alter the overall
matrix product WH. For this reason, what is of interest in practice is
not the values of matrix elements, but their relative magnitudes in
each column of W (or row of H). Moreover, taking into account that more
genes can participate in more than one biological process, it could be
of some beneficial to investigate genes that have relatively large
coefficients in each biological process.
To do this, the gene.score scoring method proposed in Kim and
Park^[62]25 has been adopted. This method computes a value gene.score
(i) i value for each gene i in a metagene
[MATH: W(:,k) :MATH]
and selects those genes possessing gene.score value higher than a given
threshold τ. Particularly, the score threshold τ is computed from the
gene score vector itself as
[MATH: τ=μ^+3σ^ :MATH]
, being and
[MATH: σ^ :MATH]
the median and the MAD of gene scores, respectively.
Metagenes in W contain the largest number of genes satisfying this
empirical criterion and they can be considered as the most
representative of the information hidden onto the gene expression data
X. Observe that the metagenes NMF technique extracts are constrained by
the dataset used to train them, so a careful selection of datasets is
essential: users need to choose those being broad enough to cover the
relevant sources of variability.
Gene Landscape Extraction procedure
Referring to [63]Figure 1, let
[MATH:
Xi∈ℝ+ni×mi :MATH]
,
[MATH:
i=1,…,t
:MATH]
be the gene expression matrices related to different biological
experiments and let
[MATH:
Wi∈ℝ+ni×ki :MATH]
be the metagenes matrices obtained solving equation (1) for each
[MATH: Xi :MATH]
with an a priori rank
[MATH: ki :MATH]
. In the proposed methodology, the constrained nonnegative optimization
problem (1) have been solved via the alternating method in NMF
R-package, with multiple executions and random
initializations.^[64]8,[65]20
To compare gene expression matrices derived from different tumor
histotypes (in this work from their associated CAF cultures), the most
representative metagene
[MATH: Wi(:,k.) :MATH]
is extracted from each gene expression matrix
[MATH: Xi :MATH]
. This metagene represents the
[MATH: k.th :MATH]
column in the matrix
[MATH: Wi :MATH]
, which possesses the largest number of genes satisfying the gene.score
criterium. These latter genes compose the gene-subsets automatically
extracted from each tumor histotype.
To identify common genes among those extracted by different tumor
histotypes, an intersection set operation was performed on the
identification labels of genes
[MATH: Wi(:,k.) :MATH]
so that
[MATH:
C=∩i=
1NWi<
mo stretchy="false">(:,k.) :MATH]
is a subset of gene-label common to each different tumor histotypes. It
should be observed that the set of common genes
[MATH: C :MATH]
depends on the gene expression matrices initially selected.
[66]Figure 3 provides a logical view of the Gene Landscape Extraction
procedure, which represents the novel proposal to extract common genes
from microarray matrices which differ in their sizes. Particularly, the
microarray matrices
[MATH:
Xi∈ℝ
ni×mi
msub> :MATH]
obtained by the data preparation and pre-processing module from each
tumor database are factorized by the NMF algorithm. This provides
matrices
[MATH:
Wi∈ℝ
ni×ri
msub> :MATH]
and
[MATH:
Hi∈ℝ
ri×mi
msub> :MATH]
. Particularly, the rank values
[MATH: ri :MATH]
used in the factorization process of the gene expression matrices used
in this article are reported in [67]Table 2. From each
[MATH: Wi :MATH]
, the most relevant metagene
[MATH: Wi(:,k.i) :MATH]
is automatically selected by the gene.score scoring method. These
column vectors undergo the Gene Landscape Extraction procedure. It
should be observed that which
[MATH: Wi(:,k.i) :MATH]
(
[MATH:
i=1,…,n
:MATH]
) differ in their number of rows (genes) and they are ordered to have
the most representative information of the specific tumor histotype in
their first components. Gene Landscape Extraction procedure uses
intersection set operations to identify labels of genes in each
[MATH: Wi(:,k.i) :MATH]
, which are common to each different tumor histotypes.
Figure 3.
[68]Figure 3.
[69]Open in a new tab
Logical view of the gene landscape extraction. Preprocessed expression
data matrices related to 3 different tumors (
[MATH:
X1,X2<
/mn>,X3 :MATH]
) are factorized by the NMF algorithm and corresponding metagenes
matrices
[MATH: Wi :MATH]
(
[MATH:
i=1,2,3
:MATH]
) are taken into considerations. Gene.score scoring method selects the
most relevant metagenes
[MATH: (Wi(:,k.i)) :MATH]
, (
[MATH:
i=1,2,3
:MATH]
) i, which undergo the Gene Landscape Extraction procedure. This latter
uses an intersection set operation to identify labels of genes common
to each different tumor histotypes. The extracted subset of gene-labels
is then sent to the functional and pathway analyses.
Table 2.
Gene ontology functional analysis: based on the parameters used, 10
categories are identified as enriched categories and all are shown in
this table.^[70]9
Gene set Description Size Expect Ratio P value FDR
GO:0006955 Immune response 1919 9.0964 2.5285 .00001777 0.1615
GO:1901565 Organonitrogen compound catabolic process 1240 5.8778 2.8922
.00005655 0.2571
GO:0042454 Ribonucleoside catabolic process 22 0.1043 28.7680 .00014794
0.3362
GO:0009164 Nucleoside catabolic process 34 0.1612 18.6140 .00055180
0.8992
GO:0072529 Pyrimidine-containing compound catabolic process 38 0.1801
16.6550 .00076738 0.8992
GO:0048525 Negative regulation of viral process 88 0.4171 9.5892
.00080637 0.8992
GO:1901658 Glycosyl compound catabolic process 44 0.2086 14.3840
.00118040 0.8992
GO:0006216 Cytidine catabolic process 11 0.0521 38.3570 .00118690
0.8992
GO:0009972 Cytidine deamination 11 0.0521 38.3570 .00118690 0.8992
[71]Open in a new tab
The parameters for the analysis of enrichment are minimum number of IDs
in the category: 5; maximum number of IDs in the category: 2000; FDR
method: BH; level of significance: Top 10.
Functional and visual analysis
The obtained subset of genes is then analyzed by integrative
bioinformatic tools, that is, the WebGestalt tool,^[72]26 to discover
the role they cover into the biological process under investigation.
This module performs either functional and pathway enrichment analysis
of the genes in
[MATH: C :MATH]
. Some graphical utilities were also added to better visualize the
results provided. For instance, a variation of the UpSet plot is used
to highlight most frequent genes and the pathway they belong to. The
novelty of this representation with respect to the standard UpSet plot
is the possibility of drawing the height and the width of each bar. To
quantify the intersection between each pathway, the bars in the plot
are proportional to the relative frequencies and to the number of genes
selected in each set of pathway. This graphical representation was used
in the following.
Application to the Analysis of Gene Expression Profiles of CAFs
In this section, we sequentially applied the main modules of the
proposed methodology for automatically extract genes from gene
expression profiles of CAFs.
Fibroblasts are the most heterogeneous and abundant population of
mesenchymal cells in the TME,^[73]27 but their precise functional role
in tumor is not fully understood yet.^[74]13 During initial phases of
oncogenesis, fibroblasts are activated giving rise to fibroblasts
associated with the tumor (CAF), which play a key role in generating a
specific extracellular matrix (ECM) in TME.^[75]28 The kinetics of
changes in CAF actions might be different in the various types of
tumor, partly because of organ-specific transcriptomic profiles of
resident fibroblasts^[76]29 and partly as different subtypes of CAF
could exert distinct paracrine actions affecting specific tumor
oncogenesis.^[77]14,[78]15 Currently, either the number of CAF
subpopulations present in the tumor stroma and the role assumed by the
presence of an individual population or different cell types into tumor
initial development stages are unknown.^[79]6,[80]16 Some
characteristics distinguish CAFs from quiescent fibroblasts as, for
instance, metabolic adaptations supporting their need for advanced
proliferation and biosynthesis activities.^[81]13 Furthermore, a
potentially controversial area of research on CAFs is focused on their
origin. In fact, to define and identify origin of fibroblasts, it is
fundamental to consider that CAFs are “activated fibroblasts” and
unlike the non-activated (quiescent) fibroblasts residing in the
tissue, they are an expansion of the cell population proliferating “in
situ” or are recruited in the tumor.^[82]30 Recruitment of BM
mesenchymal stem cells, differentiation from adipose stem cells, or
conversion from endothelial cells through an
epithelial-endothelium-mesenchymal transition process are potential
origins of CAFs.^[83]13 However, the best documented source of CAFs is
the activation of “normal” resident fibroblasts which, with their
heterogeneity, imply the existence of different subsets. This
heterogeneity may reflect the variability with respect to phenotypic
state of both cell and tissue of origin and therefore also the
reporting mediators and the mechanisms to be activated.^[84]14,[85]31
In fact, CAFs can be distinguished from other types of cells within the
tumor by means of exclusion criteria defined by their morphological
characteristics and by a lack of expression of non-mesenchymal markers,
such as those expressed by endothelial, epithelial, immune, and
neuronal cells even if none of these has an absolute
specificity.^[86]32 A challenging aspect in CAFs studies is the precise
definition of heterogeneous CAF populations in distinct phases of tumor
progression through markers informing on their functions.^[87]6,[88]33
In this study, we used the proposed NMF-based gene extractor
methodology to investigate CAFs heterogeneity to identify the possible
presences of genes, which can be biomarkers and characterize activated
fibroblast phenotype. Gene expression profiles of CAFs belonging to
primary cultures of 3 distinct solid tumor histotypes from patients
(ie, colon of endodermal origin, breast carcinoma of ectodermal origin
and ovary of mesodermal origin) were considered. Moreover, we also
verified the existence of common genes among CAF gene expression
profiles of selected solid tumors and those of BM of patients with
multiple myeloma (MM)^[89]6 and MGUS.^[90]17
Identification of CAF populations and download of transcriptomic profiles
International Agency for Research on Cancer (IARC) estimates 18.1
million (17.0 million excluding non-melanoma skin cancer) new cancer
cases in 2018, causing about 9.6 million of deaths (9.5 million
excluding non-melanoma skin cancer).^[91]34
For studying the heterogeneous CAF population, which represents the
most abundant cellular component of the TME, representative tumor
histotypes with high rate of mortality have been selected: colon
carcinoma, breast cancer, ovarian cancer, and MM. The gene expression
profiles of CAFs were downloaded from the Gene Expression Omnibus
database (GEO) NCBI (a functional public genomic database that supports
the submission of MIAME-compliant data) directly from the original
publications of each series
([92]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GEO
series).^[93]35 Data sets were selected according to the following
standards: (a) the GEO platform (GPL) and (b) the number of samples
with labels to identify the fibroblasts associated with the tumor
(being ⩾7 the total amount of samples for each representative cancer).
[94]Table 1 reports GSE series, bibliographical references, GPL, GSM