Abstract

   Accurate molecular subtypes prediction of cancer patients is
   significant for personalized cancer diagnosis and treatments. Large
   amount of multi-omics data and the advancement of data-driven methods
   are expected to facilitate molecular subtyping of cancer. Most existing
   machine learning–based methods usually classify samples according to
   single omics data, fail to integrate multi-omics data to learn
   comprehensive representations of the samples, and ignore that
   information transfer and aggregation among samples can better represent
   them and ultimately help in classification. We propose a novel
   framework named multi-omics graph convolutional network (M-GCN) for
   molecular subtyping based on robust graph convolutional networks
   integrating multi-omics data. We first apply the Hilbert–Schmidt
   independence criterion least absolute shrinkage and selection operator
   (HSIC Lasso) to select the molecular subtype-related transcriptomic
   features and then construct a sample–sample similarity graph with low
   noise by using these features. Next, we take the selected gene
   expression, single nucleotide variants (SNV), and copy number variation
   (CNV) data as input and learn the multi-view representations of
   samples. On this basis, a robust variant of graph convolutional network
   (GCN) model is finally developed to obtain samples’ new representations
   by aggregating their subgraphs. Experimental results of breast and
   stomach cancer demonstrate that the classification performance of M-GCN
   is superior to other existing methods. Moreover, the identified
   subtype-specific biomarkers are highly consistent with current clinical
   understanding and promising to assist accurate diagnosis and targeted
   drug development.

   Keywords: molecular subtyping of cancer, multi-omics data, feature
   selection, graph convolutional networks, subtype-specific biomarkers

Introduction

   Cancer is a complex and highly individualized disease with diverse
   subtypes, and molecular heterogeneity exists among different subtypes
   of the same cancer type ([38]González-García et al., 2002;
   [39]Shipitsin et al., 2007). As cancer patients of distinct molecular
   subtypes usually respond differently to same treatment, so accurate
   subtype classification can not only assist precision diagnosis but also
   facilitate effective targeted treatment ([40]Toss and Cristofanilli,
   2015; [41]Lee Y.-M. et al., 2020).

   High-throughput sequencing technologies generate a large amount of
   multi-omics data ([42]Subramanian et al., 2020), which promotes the
   proposal of many computational methods to identify the molecular
   subtypes of cancer. Some methods focus on similarity network fusion
   (SNF) to cluster cancer subtypes. [43]Wang et al. (2014) used SNF to
   combine similarity networks obtained from mRNA expression, DNA
   methylation, and microRNA expression data into one network. [44]Chen et
   al. (2021) proposed a similarity fusion method to fuse the high-order
   proximity of different omics data and preserve cluster information of
   multiple graphs. [45]Xu et al. (2019) developed a method named
   high-order path elucidated similarity (HOPES), which integrated the
   similarity of different data by high-order connected paths. These
   methods apply unsupervised spectral clustering to identify cancer
   subtypes without using the additional information of sample labels.
   With the accumulation of labeled data, some supervised machine learning
   methods are utilized to learn non-linear associations of samples’
   features and subtype labels ([46]Shieh et al., 2004; [47]Wu et al.,
   2017). [48]Guan et al. (2012) applied splitting random forest to
   discover a highly predictive gene set for sample classification.
   [49]Gao et al. (2019) utilized transcriptomic data and leveraged
   feedforward neural networks to build molecular subtyping classifiers.
   [50]Chen et al. (2020) proposed a DeepType framework, which performed
   joint supervised classification, unsupervised clustering, and
   dimensionality reduction to learn cancer-relevant data representation
   with the cluster structure. These methods treat each sample as an
   independent individual and do not take full advantage of the similarity
   and mutual representation ability between samples.

   With the strong representation ability of graph-structured data, graph
   neural networks (GNNs) have achieved great success and are gradually
   used in a node classification task. It provides one way to obtain new
   representations of nodes by combining the connectivity and features of
   its local neighborhood. Recently, some GNN-based methods have been
   proposed to predict molecular subtyping of cancer. Rhee et al.
   developed a GCN-based model to explore the gene–gene association and
   information passing for cancer subtyping ([51]Rhee et al., 2017). Lee
   et al. developed a GCN model with attention mechanisms to learn
   pathway-level representations of cancer samples for their subtype
   classification ([52]Lee et al., 2020a). Although GNN are powerful, they
   are reportedly vulnerable when the skeleton of the graph and nodes’
   feature are mixed with noise ([53]Dai et al., 2018; [54]Jin et al.,
   2020; [55]Zhang and Zitnik, 2020), so a robust GNN model is necessary
   for accurately and stably predicting cancer subtypes.

   It is well known that abnormal behaviors of cancer cells are the result
   of a series of gene mutations, gene copy number variation, and gene
   transcription level changes in key regulatory pathways ([56]Greenman et
   al., 2007; [57]Bradner et al., 2017; [58]Kuijjer et al., 2018;
   [59]Memon et al., 2021). A single type of omics data can only capture
   part of the biological complexity, whereas integrating multiple types
   of omics data can provide a more holistic view to better understand the
   interrelationships of the involved biomolecules and their functions and
   demonstrably improve the prediction accuracy of patients’ clinical
   outcome ([60]Huang et al., 2019; [61]Singh et al., 2019). To learn
   integrative representations of different omics data, Li et al.
   developed a graph autoencoder model by utilizing the prior knowledge
   graph and integrating mRNA expression and CNV data ([62]Li et al.,
   2021). Lin et al. used multi-omics data and applied deep neural
   networks to improve the classification accuracy of breast cancer sample
   ([63]Lin et al., 2020).

   In this study, we propose a novel and general framework M-GCN
   ([64]Figure 1) for molecular subtyping of cancer. It integrates
   RNA-seq, SNV, and CNV data and learns the node representation based on
   a robust GCN model. In order to reduce dimension and eliminate noise of
   transcriptomic data, we first apply HSIC Lasso to select the molecular
   subtype-related transcriptomic features, which are further used for
   constructing sample–sample similarity graph, and utilize statistics
   analysis to find genes with high mutational rates and significant copy
   number changes. The clean data and purified graph structures are
   prerequisite for building a robust GNN model. Then we use different
   non-linear transformations to learn multi-view representations of these
   three types of data. Furthermore, M-GCN strengthens connections between
   the new generated features and the graph and assigns weight to edges by
   layer-wise graph memory based on GNNGUARD ([65]Zhang and Zitnik, 2020).
   GNNGUARD is originally developed to purify the graph structure and
   nodes’ features to eliminate the effect of possible noise edge message
   passing of GCN. Next, a robust GCN model is developed to get samples’
   new representations by aggregating their subgraphs for predicting their
   subtype category. When applying M-GCN to study molecular subtyping of
   breast cancer and stomach cancer, the experiment results show that the
   subtype classification performance of M-GCN outperforms other
   state-of-the-art methods. In addition, we further identify a few
   specific biomarkers for each molecular subtype, which can potentially
   contribute to disease diagnosis.

FIGURE 1.

   [66]FIGURE 1
   [67]Open in a new tab

   Flowchart of M-GCN. (A) Filtered SNV and CNV features and molecular
   subtype-related transcriptomic features selected by HSIC Lasso are used
   as the input. Three type-specific non-linear transformation layers are
   used. The sample–sample similarity graph is constructed by molecular
   subtype-related transcriptomic features. (B) Output of non-linear
   transformations and sample–sample similarity graph are used as the
   input of GCN; convolution process is used for message passing and
   aggregation among samples; output of the final GCN layer is the
   prediction of samples’ subtype category.

Materials and Methods

Data Collection and Preprocessing

   We collect gene expression, SNV and CNV data, and clinical information
   of breast cancer (BRCA) and stomach adenocarcinoma (STAD) patients from
   The Cancer Genome Atlas (TCGA) database ([68]Weinstein et al., 2013).
   As shown in [69]Table 1, there are
   [MATH: <mrow><mn>518</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>221</mn></mrow> :MATH]
   samples of BRCA and STAD having all three types of omics data.
   Specifically, BRCA includes molecular subtypes of estrogen receptor
   positive (ER+), human epidermal growth factor receptor 2 positive
   (HER2+), and triple-negative breast cancer (TNBC) ([70]Vuong et al.,
   2014), and STAD includes molecular subtypes of chromosomal instability
   (CIN), Epstein–Barr virus (EBV), microsatellite instability (MSI), and
   genomically stable (GS) ([71]Bass et al., 2014), respectively.

TABLE 1.

   Dataset attributes.
   Cancer #Subtype #Samples of each subtype #CNV features #SNV features #
   Gene expression features
   BRCA ER+ 386 74 62 124
   HER2+ 35
   TNBC 97
   STAD CIN 107 169 166 128
   EBV 23
   MSI 46
   GS 45
   [72]Open in a new tab

   Genes whose expression values are lower than 10 and 3 are considered as
   not expressed in BRCA and STAD and then deleted, respectively. As
   RNA-seq data of these two cancer types are obtained from different
   platforms in TCGA, we set different cutoffs according to their data
   distributions. Then, fragments per kilobase of exon per million
   fragments mapped (FPKM) values of gene expression are normalized with
   log2-transformation. In each cancer type, a gene’s mutation frequency
   is defined as the number of samples with this mutation divided by the
   total number of samples. Genes with mutation frequency greater than
   [MATH: <mrow><mn>0.03</mn></mrow> :MATH]
   are selected and their SNV data are used as SNV features in BRCA. For
   STAD, mutation frequency threshold is set as
   [MATH: <mrow><mn>0.1</mn></mrow> :MATH]
   . Similarly, the genes having significant amplifications or deletions
   rates across cancer samples are selected and their CNV data are used as
   CNV features. Finally, the number of SNV features and CNV features are
   [MATH: <mrow><mn>62</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>74</mn></mrow> :MATH]
   in BRCA and
   [MATH: <mrow><mn>166</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>169</mn><mtext> </mtext></mrow> :MATH]
   in STAD, respectively. The details of the datasets are listed in
   [73]Table 1.

Feature Selection

   To obtain molecular subtype-related transcriptome features with low
   noise for constructing a purified sample–sample similarity graph and
   effective message passing, we apply a supervised non-linear feature
   selection method HSIC Lasso ([74]Yamada et al., 2014), which captures
   non-linear dependency of molecular subtyping labels and genes’
   expression level.

   Let
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">X</mi><mi>t</mi></msup><mo>=</mo><msubsup><mr
   ow><mrow><mo>{</mo><mrow><msub><mi
   mathvariant="bold-italic">x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>y<
   /mi><mi>i</mi></msub></mrow><mo>}</mo></mrow></mrow><mrow><mi>i</mi><mo
   >=</mo><mn>1</mn></mrow><mi>n</mi></msubsup></mrow> :MATH]
   denote the supervised data with the
   [MATH: <mi>n</mi> :MATH]
   samples.
   [MATH: <mrow><msub><mi>x</mi><mi>i</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>y</mi><mi>i</mi></msub></mrow> :MATH]
   are the gene expression vector and label of
   [MATH: <mi>i</mi> :MATH]
   -
   [MATH: <mrow><mtext>th</mtext></mrow> :MATH]
   sample, respectively. Its optimization goal is as follows:
   [MATH:
   <mrow><mmultiscripts><mo> </mo><mprescripts></mprescripts><mrow><mi>γ</
   mi><mi>ε</mi><msup><mi mathvariant="bold-italic">ℝ</mi><mi
   mathvariant="bold-italic">d</mi></msup></mrow><mrow><mi
   mathvariant="bold-italic">m</mi><mi mathvariant="bold-italic">i</mi><mi
   mathvariant="bold-italic">n</mi></mrow></mmultiscripts><mfrac><mn>1</mn
   ><mn>2</mn></mfrac><mo>∥</mo><mrow><mover accent="true"><mi
   mathvariant="bold-italic">L</mi><mo>¯</mo></mover></mrow><mo>−</mo><mst
   yle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>l</mi><mo> </mo><mo
   >=</mo><mo> </mo><mn>1</mn></mrow><mi>d</mi></munderover><mrow><msub><m
   i>γ</mi><mi mathvariant="bold-italic">l</mi></msub><msup><mrow><mover
   accent="true"><mi
   mathvariant="bold-italic">K</mi><mo>¯</mo></mover></mrow><mrow><mrow><m
   o>(</mo><mi
   mathvariant="bold-italic">l</mi><mo>)</mo></mrow></mrow></msup><msubsup
   ><mo>∥</mo><mrow><mi mathvariant="bold-italic">F</mi><mi
   mathvariant="bold-italic">r</mi><mi mathvariant="bold-italic">o</mi><mi
   mathvariant="bold-italic">b</mi></mrow><mn>2</mn></msubsup></mrow></mst
   yle><mo>+</mo><mi>λ</mi><mo>∥</mo><mi>γ</mi><msub><mo>∥</mo><mn>1</mn><
   /msub><mo>,</mo><mtext>s</mtext><mo>.</mo><mtext>t</mtext><mo>.</mo><mo
   > </mo><mo> </mo><msub><mi>γ</mi><mn>1</mn></msub><mo>,</mo><msub><mi>γ
   </mi><mn>2</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>γ</mi><mi
   >d</mi></msub><mo>≥</mo><mn>0</mn><mo>,</mo></mrow> :MATH]
   (1)

   where
   [MATH: <mrow><mo>∥</mo><mi>γ</mi><mo>∥</mo></mrow> :MATH]
   is the Frobenius norm,
   [MATH: <mrow><mrow><mover accent="true"><mi
   mathvariant="bold-italic">L</mi><mo>¯</mo></mover></mrow><mo>=</mo><mo>
    </mo><mi mathvariant="bold-italic">Γ</mi><mi
   mathvariant="bold-italic">L</mi><mi
   mathvariant="bold-italic">Γ</mi></mrow> :MATH]
   is centered Gram matrices,
   [MATH: <mrow><mi mathvariant="bold-italic">Γ</mi><mo>=</mo><msub><mi
   mathvariant="bold-italic">I</mi><mi
   mathvariant="bold-italic">n</mi></msub><mo>−</mo><mfrac><mn>1</mn><mi
   mathvariant="bold-italic">n</mi></mfrac><msub><mn>1</mn><mi
   mathvariant="bold-italic">n</mi></msub><msubsup><mn>1</mn><mi
   mathvariant="bold-italic">n</mi><mi
   mathvariant="bold-italic">T</mi></msubsup></mrow> :MATH]
   is the centering matrix,
   [MATH: <mrow><msub><mi mathvariant="bold-italic">I</mi><mi
   mathvariant="bold-italic">n</mi></msub></mrow> :MATH]
   is the
   [MATH: <mi>n</mi> :MATH]
   -dimensional identity matrix, 1 [n ]is the
   [MATH: <mi>n</mi> :MATH]
   -dimensional vector that all the elements are ones,
   [MATH: <mi>d</mi> :MATH]
   is the number of features,
   [MATH: <mrow><msub><mi>γ</mi><mi>l</mi></msub></mrow> :MATH]
   is the regression coefficient of the
   [MATH: <mi>l</mi> :MATH]
   -th feature,
   [MATH: <mrow><mo> </mo><msup><mrow><mover accent="true"><mi
   mathvariant="bold-italic">K</mi><mo>¯</mo></mover></mrow><mrow><mrow><m
   o>(</mo><mi>l</mi><mo>)</mo></mrow></mrow></msup><mo>=</mo><mo> </mo><m
   i mathvariant="bold-italic">Γ</mi><msup><mi
   mathvariant="bold-italic">K</mi><mrow><mrow><mo>(</mo><mi>l</mi><mo>)</
   mo></mrow></mrow></msup><mi mathvariant="bold-italic">Γ</mi></mrow>
   :MATH]
   is the centered Gram matrix,
   [MATH:
   <mrow><msubsup><mi>K</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mr
   ow><mrow><mo>(</mo><mi>l</mi><mo>)</mo></mrow></mrow></msubsup><mo>=</m
   o><mi>K</mi><mrow><mo>(</mo><mrow><msubsup><mi>x</mi><mi>i</mi><mrow><m
   row><mo>(</mo><mi>l</mi><mo>)</mo></mrow></mrow></msubsup><mo>,</mo><ms
   ubsup><mi>x</mi><mi>j</mi><mrow><mrow><mo>(</mo><mi>l</mi><mo>)</mo></m
   row></mrow></msubsup></mrow><mo>)</mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><mo> </mo><msub><mi>L</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></m
   row></msub><mo>=</mo><mi>L</mi><mrow><mo>(</mo><mrow><msub><mi>y</mi><m
   i>i</mi></msub><mo>,</mo><msub><mi>y</mi><mi>j</mi></msub></mrow><mo>)<
   /mo></mrow></mrow> :MATH]
   are calculated by kernel functions
   [MATH:
   <mrow><mi>K</mi><mrow><mo>(</mo><mrow><mi>x</mi><mo>,</mo><msup><mi>x</
   mi><mo>'</mo></msup></mrow><mo>)</mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>L</mi><mrow><mo>(</mo><mrow><mi>y</mi><mo>,</mo><msup><mi>y</
   mi><mo>'</mo></msup></mrow><mo>)</mo></mrow></mrow> :MATH]
   ,
   [MATH: <mtext>λ</mtext> :MATH]
   is the regularization parameter, and
   [MATH: <mrow><mi
   mathvariant="bold-italic">γ</mi><mo>=</mo><msup><mrow><mrow><mo>[</mo><
   mrow><msub><mi>γ</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><mo
   > </mo><msub><mi>γ</mi><mi>d</mi></msub><mo> </mo><mo> </mo></mrow><mo>
   ]</mo></mrow></mrow><mi>T</mi></msup></mrow> :MATH]
   is a regression coefficient vector.

Sample—Sample Similarity Graph Construction

   Since samples with similar features are more likely to fall into the
   same category, we first construct a sample–sample graph based on
   similarity of each samples pair. From the perspective of the whole
   biological system, compared with SNV and CNV, gene expression is the
   most fundamental level at which the genotype gives rise to the
   phenotype. So we only use transcriptomic data and apply spearman’s
   correlation to calculate the similarity of each two samples. We select
   sample pairs whose correlation coefficient (
   [MATH: <mi>ρ</mi> :MATH]
   ) are greater than the threshold
   [MATH: <mi>r</mi> :MATH]
   with
   [MATH: <mi>p</mi> :MATH]
   value less than
   [MATH: <mrow><mn>0.05</mn></mrow> :MATH]
   and then generate adjacency matrix
   [MATH: <mrow><mi>A</mi><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow><
   /msup></mrow> :MATH]
   [MATH:
   <mrow><msub><mi>A</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo
   ><mrow><mo>{</mo><mrow><mtable><mtr><mtd><mrow><mn>1</mn><mo>,</mo><mo>
    </mo><mo> </mo><msub><mi>ρ</mi><mrow><mi>ij</mi></mrow></msub><mo>≥</m
   o><mi>r</mi><mo>,</mo><mi>p</mi><mo>≤</mo><mi>0.05</mi></mrow></mtd></m
   tr><mtr><mtd><mrow><mi>0</mi><mo>,</mo><mo> </mo><mo> </mo><mi>o</mi><m
   i>t</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>s</mi></mrow></mtd></mtr></mt
   able></mrow></mrow><mo>,</mo></mrow> :MATH]
   (2)

   where
   [MATH: <mn>1</mn> :MATH]
   and 0 ([75]Eq. 2) represent that there is and there is not an edge
   between sample
   [MATH: <mi>i</mi> :MATH]
   and sample
   [MATH: <mi>j</mi> :MATH]
   , respectively.

   In the generated undirected graph
   [MATH: <mrow><mi
   mathvariant="bold-italic">G</mi><mo>=</mo><mrow><mo>(</mo><mrow><mi
   mathvariant="bold-italic">V</mi><mo>,</mo><mtext> </mtext><mi
   mathvariant="bold-italic">E</mi></mrow><mo>)</mo></mrow></mrow> :MATH]
   ,
   [MATH: <mi mathvariant="bold-italic">V</mi> :MATH]
   and
   [MATH: <mi mathvariant="bold-italic">E</mi> :MATH]
   denote the sample nodes and edges, respectively. There are
   [MATH: <mi>n</mi> :MATH]
   samples in the graph.
   [MATH: <mi mathvariant="bold-italic">X</mi> :MATH]
   = [
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi>m</mi></msub><mo>,</mo><msub><mi
   mathvariant="bold-italic">X</mi><mi>c</mi></msub><mo>,</mo><msub><mi
   mathvariant="bold-italic">X</mi><mi>e</mi></msub></mrow> :MATH]
   ] are the nodes’ feature matrix, where [] is the concat operation.
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi>m</mi></msub><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mi>n</mi><mo>×</mo><msub><mi>f</mi><m
   n>1</mn></msub></mrow></msup><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi>c</mi></msub><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mi>n</mi><mo>×</mo><msub><mi>f</mi><m
   n>2</mn></msub></mrow></msup><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi>e</mi></msub><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mi>n</mi><mo>×</mo><msub><mi>f</mi><m
   n>3</mn></msub></mrow></msup></mrow> :MATH]
   represent SNV feature matrix, CNV feature matrix, and gene expression
   feature matrix, respectively. The number of features in each data type
   is
   [MATH: <mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mo> </mo><msub><mi>f</mi><mn>2</mn></msub><mo>,</mo></mrow>
   :MATH]
   and
   [MATH: <mrow><msub><mi>f</mi><mn>3</mn></msub></mrow> :MATH]
   . The features of SNV and CNV are selected by data preprocessing. Gene
   expression features are obtained by HSIC Lasso.

GCN Model Integrating Multi-Omics Data for Sample Classification

   Some noises could be introduced as there could be some mismatch between
   the graph and the new concatenate features. According to similarity
   between the new features, we introduce the idea of a robust variant
   into the GCN model to mitigate the impact of noises. Specifically, we
   apply GNNGUARD, which originally is a defense method against
   adversarial attacks. It improves robustness of GCN models by detecting
   fake edges of graph structure and removes or reduces their weights in
   message passing of GCN. GNNGUARD is implemented by neighbor importance
   estimation and layer-wise graph memory. We use GNNGUARD here to
   strengthen connections between the new features and the graph by
   assigning weight to edges. In addition, our framework can be more
   robust using dirty data with noises.

Multi-Omics Data Features Transformation

   To improve samples’ feature representations, we adopt different
   non-linear transformations to separately project gene expression
   features, SNV features, and CNV features into their feature space, and
   then concatenate them together. The projected latent feature matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mrow><mrow><mo>(</mo><mn>0</mn><mo>)</
   mo></mrow></mrow></msup></mrow> :MATH]
   is as follows:
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="normal">0</mi></msup><mo>=</mo><mrow><mo>[</mo><mrow><mo>
   </mo><mtext>σ</mtext><mrow><mo>(</mo><mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi
   mathvariant="italic">m</mi></msub><msub><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="italic">m</mi></msub></mrow><mo>)</mo></mrow><mo>,</mo><mt
   ext>σ</mtext><mrow><mo>(</mo><mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi
   mathvariant="italic">c</mi></msub><msub><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="italic">c</mi></msub></mrow><mo>)</mo></mrow><mo>,</mo><mt
   ext>σ</mtext><mrow><mo>(</mo><mrow><msub><mi
   mathvariant="bold-italic">X</mi><mi
   mathvariant="italic">e</mi></msub><msub><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="italic">e</mi></msub></mrow><mo>)</mo></mrow></mrow><mo>]<
   /mo></mrow><mo>,</mo></mrow> :MATH]
   (3)

   where
   [MATH: <mi>σ</mi> :MATH]
   is the ReLU activation function,
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">W</mi><mi>m</mi></msub><mo>∈</mo><msup><mi>ℝ<
   /mi><mrow><msub><mi>f</mi><mn>1</mn></msub><mo>×</mo><msubsup><mi>f</mi
   ><mn>1</mn><mo>'</mo></msubsup></mrow></msup><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">W</mi><mi>c</mi></msub><mtext> </mtext><mo>∈<
   /mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><msub><mi>f</mi><mrow><mn>2</mn><mo>×<
   /mo></mrow></msub><msubsup><mi>f</mi><mn>2</mn><mo>'</mo></msubsup></mr
   ow></msup><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">W</mi><mi>e</mi></msub><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><msub><mi>f</mi><mn>3</mn></msub><mo>×
   </mo><msubsup><mi>f</mi><mn>3</mn><mo>'</mo></msubsup></mrow></msup></m
   row> :MATH]
   represent the learnable non-linear transformation matrix of SNV, CNV,
   and gene expression data, respectively.
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mn>0</mn></msup></mrow> :MATH]
   [MATH:
   <mrow><mo>∈</mo><msup><mi>ℝ</mi><mrow><mi>n</mi><mo>×</mo><mrow><mo>(</
   mo><mrow><msubsup><mi>f</mi><mn>1</mn><mo>'</mo></msubsup><mo>+</mo><ms
   ubsup><mi>f</mi><mn>2</mn><mo>'</mo></msubsup><mo>+</mo><msubsup><mi>f<
   /mi><mn>3</mn><mo>'</mo></msubsup></mrow><mo>)</mo></mrow></mrow></msup
   ></mrow> :MATH]
   is the final output multi-view representations of samples.

Neighbor Importance Estimation

   To quantify the relevance between node
   [MATH: <mi>i</mi> :MATH]
   and node
   [MATH: <mi>j</mi> :MATH]
   for successful message passing of GCN, GNNGUARD evaluates the
   importance weight of each edge
   [MATH:
   <mrow><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   in each layer based on similarity measure between nodes’
   representations. The similarity
   [MATH:
   <mrow><msubsup><mi>s</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup></mrow> :MATH]
   is defined as follows based on the hypothesis that similar nodes are
   more likely to interact with each other:
   [MATH: <mrow><msubsup><mi mathvariant="bold-italic">s</mi><mrow><mi
   mathvariant="bold-italic">i</mi><mi
   mathvariant="bold-italic">j</mi></mrow><mi
   mathvariant="bold-italic">k</mi></msubsup><mo>=</mo><mrow><mrow><mrow><
   mo>(</mo><mrow><msubsup><mi mathvariant="bold-italic">h</mi><mi
   mathvariant="bold-italic">i</mi><mi
   mathvariant="bold-italic">k</mi></msubsup><mo>⊙</mo><msubsup><mi
   mathvariant="bold-italic">h</mi><mi mathvariant="bold-italic">j</mi><mi
   mathvariant="bold-italic">k</mi></msubsup></mrow><mo>)</mo></mrow></mro
   w><mo>/</mo><mrow><mrow><mo>(</mo><mrow><mo>∥</mo><msubsup><mi
   mathvariant="bold-italic">h</mi><mi mathvariant="bold-italic">i</mi><mi
   mathvariant="bold-italic">k</mi></msubsup><msub><mo>∥</mo><mn>2</mn></m
   sub><mo>∥</mo><msubsup><mi mathvariant="bold-italic">h</mi><mi
   mathvariant="bold-italic">j</mi><mi
   mathvariant="bold-italic">k</mi></msubsup><msub><mo>∥</mo><mn>2</mn></m
   sub></mrow><mo>)</mo></mrow></mrow></mrow><mo>,</mo></mrow> :MATH]
   (4)

   where
   [MATH:
   <mrow><msubsup><mi>s</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup></mrow> :MATH]
   is the cosine similarity between
   [MATH: <mi>i</mi> :MATH]
   and its neighbor
   [MATH: <mrow><mtext> </mtext><mi>j</mi></mrow> :MATH]
   in the
   [MATH: <mi>k</mi> :MATH]
   -th layer of GCN,
   [MATH: <mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>i</mi><mi>k</mi></msubsup><mo>∈</mo
   ><msup><mi
   mathvariant="normal">ℝ</mi><mrow><msub><mi>D</mi><mi>k</mi></msub></mro
   w></msup></mrow> :MATH]
   and
   [MATH: <mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>j</mi><mi>k</mi></msubsup><mo>∈</mo
   ><msup><mi
   mathvariant="normal">ℝ</mi><mrow><msub><mi>D</mi><mi>k</mi></msub></mro
   w></msup></mrow> :MATH]
   denote the representations of node
   [MATH: <mi>i</mi> :MATH]
   and node
   [MATH: <mi>j</mi> :MATH]
   in the
   [MATH: <mi>k</mi> :MATH]
   -th layer of GCN,
   [MATH: <mo>⊙</mo> :MATH]
   is dot product,
   [MATH: <mrow><msub><mi>D</mi><mi>k</mi></msub></mrow> :MATH]
   is the dimension of
   [MATH: <mrow><mtext> </mtext><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>i</mi><mi>k</mi></msubsup></mrow>
   :MATH]
   (or
   [MATH: <mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>j</mi><mi>k</mi></msubsup></mrow>
   :MATH]
   ), and
   [MATH:
   <mrow><mo>∥</mo><mo>⋅</mo><msub><mo>∥</mo><mn>2</mn></msub><mo> </mo></
   mrow> :MATH]
   is the L-2 norm. Node similarity
   [MATH:
   <mrow><msubsup><mi>s</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup></mrow> :MATH]
   is normalized at the node-level within
   [MATH: <mi>i</mi> :MATH]
   ’s neighborhood as follows:
   [MATH:
   <mrow><msubsup><mi>α</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup><mo>=</mo><mrow><mo>{</mo><mrow><mtable><mtr><mtd><mrow><mrow><m
   row><msubsup><mi>s</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></msu
   bsup></mrow><mo>/</mo><mrow><msub><mtext>Σ</mtext><mrow><mi>j</mi><mo>∈
   </mo><msubsup><mi>N</mi><mi>i</mi><mo>∗</mo></msubsup></mrow></msub></m
   row></mrow><mrow><mrow><msubsup><mi>s</mi><mrow><mi>i</mi><mi>j</mi></m
   row><mi>k</mi></msubsup><mo>×</mo><msubsup><mrow><mover
   accent="true"><mi>N</mi><mo>^</mo></mover></mrow><mi>i</mi><mi>k</mi></
   msubsup></mrow><mo>/</mo><mrow><mrow><mo>(</mo><mrow><msubsup><mrow><mo
   ver
   accent="true"><mi>N</mi><mo>^</mo></mover></mrow><mi>i</mi><mi>k</mi></
   msubsup><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></mrow><mtex
   t> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mt
   ext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mt
   ext> </mtext><mtext> </mtext><mtext> </mtext><mi>i</mi><mi>f</mi><mo> <
   /mo><mi>i</mi><mo>≠</mo><mi>j</mi></mrow></mtd></mtr><mtr><mtd><mrow><m
   row><mn>1</mn><mo>/</mo><mrow><mrow><mo>(</mo><mrow><msubsup><mrow><mov
   er
   accent="true"><mi>N</mi><mo>^</mo></mover></mrow><mi>i</mi><mi>k</mi></
   msubsup><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></mrow><mtex
   t> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mt
   ext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mt
   ext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </
   mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><
   mtext> </mtext><mtext> </mtext><mtext> </mtext><mtext> </mtext><mi>i</m
   i><mi>f</mi><mo> </mo><mi>i</mi><mo>=</mo><mi>j</mi><mo> </mo></mrow></
   mtd></mtr></mtable></mrow></mrow><mo>,</mo></mrow> :MATH]
   (5)

   where
   [MATH:
   <mrow><msubsup><mi>α</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup></mrow> :MATH]
   is an importance weight between node
   [MATH: <mi>i</mi> :MATH]
   and node
   [MATH: <mi>j</mi> :MATH]
   in the
   [MATH: <mi>k</mi> :MATH]
   -th layer,
   [MATH: <mrow><msubsup><mi>N</mi><mi>i</mi><mo>∗</mo></msubsup></mrow>
   :MATH]
   represents
   [MATH: <mi>i</mi> :MATH]
   ’s neighborhood (excluding node
   [MATH: <mrow><mtext> </mtext><mi>i</mi></mrow> :MATH]
   ), and
   [MATH: <mrow><msubsup><mrow><mover
   accent="true"><mi>N</mi><mo>^</mo></mover></mrow><mi>i</mi><mi>k</mi></
   msubsup><mo>=</mo><msub><mi>Σ</mi><mrow><mi>j</mi><mo>∈</mo><msubsup><m
   i>N</mi><mi>i</mi><mo>∗</mo></msubsup></mrow></msub><mo>∥</mo><msubsup>
   <mi>s</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></msubsup><msub><m
   o>∥</mo><mn>0</mn></msub></mrow> :MATH]
   . The noises can be defensed by using important weights on the basis of
   reducing the weight of dissimilar nodes. Edge pruning probability for
   edge
   [MATH:
   <mrow><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   is calculated by a binary indicator
   [MATH:
   <mrow><msub><mn>1</mn><mrow><msub><mi>P</mi><mn>0</mn></msub></mrow></m
   sub></mrow> :MATH]
   :
   [MATH: <mrow><mtext>σ</mtext><mrow><mo>(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">c</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k<
   /mi></msubsup><msub><mi
   mathvariant="bold-italic">W</mi><mi>n</mi></msub></mrow><mo>)</mo></mro
   w><mo>,</mo></mrow> :MATH]
   as follows:
   [MATH:
   <mrow><mo> </mo><msub><mn>1</mn><mrow><msub><mi>P</mi><mn>0</mn></msub>
   </mrow></msub><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">σ</mi><mrow><mo>(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">c</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k<
   /mi></msubsup><mi
   mathvariant="bold-italic">W</mi></mrow><mo>)</mo></mrow></mrow><mo>)</m
   o></mrow><mo>=</mo><mrow><mo>{</mo><mrow><mtable><mtr><mtd><mrow><mn>0<
   /mn><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </
   mo><mo> </mo><mo> </mo><mo> </mo><mi>i</mi><mi>f</mi><mo> </mo><mo> </m
   o><mi>σ</mi><mrow><mo>(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">c</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k<
   /mi></msubsup><msub><mi
   mathvariant="bold-italic">W</mi><mi>n</mi></msub></mrow><mo>)</mo></mro
   w><mo><</mo><msub><mi>P</mi><mn>0</mn></msub></mrow></mtd></mtr><mtr><m
   td><mrow><mn>1</mn><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><m
   o> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo
   > </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo> </mo><mo>
    </mo><mo> </mo><mi>o</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>w
   </mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></mtd></mtr></mtable></mrow><
   /mrow><mo>,</mo></mrow> :MATH]
   (6)

   where
   [MATH: <mrow><msubsup><mi
   mathvariant="bold-italic">c</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k<
   /mi></msubsup><mo>=</mo><mrow><mo>[</mo><mrow><msubsup><mi>α</mi><mrow>
   <mi>i</mi><mi>j</mi></mrow><mi>k</mi></msubsup><mo>,</mo><mo> </mo><mo>
    </mo><msubsup><mi>α</mi><mrow><mi>j</mi><mi>i</mi></mrow><mi>k</mi></m
   subsup></mrow><mo>]</mo></mrow></mrow> :MATH]
   is a characteristic vector in the
   [MATH: <mi>k</mi> :MATH]
   -th layer of GCN which describes edge
   [MATH:
   <mrow><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   ,
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">W</mi><mi>n</mi></msub></mrow> :MATH]
   is the learnable parameter,
   [MATH: <mtext>σ</mtext> :MATH]
   is a non-linear transformation, and
   [MATH: <mrow><msub><mi>P</mi><mn>0</mn></msub></mrow> :MATH]
   is a pre-defined threshold. We update importance weight
   [MATH:
   <mrow><msubsup><mi>α</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup></mrow> :MATH]
   to
   [MATH: <mrow><msubsup><mrow><mover
   accent="true"><mi>α</mi><mo>^</mo></mover></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow><mi>k</mi></msubsup></mrow> :MATH]
   and prune edges with [76]Eq. 7 in order to ignore perturbed edge.
   [MATH: <mrow><msubsup><mrow><mover
   accent="true"><mi>α</mi><mo>^</mo></mover></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow><mi>k</mi></msubsup><mo>=</mo><mo> </mo><msubsup><mi>α</mi><
   mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></msubsup><msub><mn>1</mn><mr
   ow><msub><mi>P</mi><mn>0</mn></msub></mrow></msub><mrow><mo>(</mo><mrow
   ><mtext>σ</mtext><mrow><mo>(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">c</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k<
   /mi></msubsup><msub><mi
   mathvariant="bold-italic">W</mi><mi>n</mi></msub></mrow><mo>)</mo></mro
   w></mrow><mo>)</mo></mrow><mo>.</mo></mrow> :MATH]
   (7)

Layer-Wise Graph Memory

   Neighbor importance estimation and edge pruning change the structure of
   graph. Because the weighted graph changes in each layer, for a stable
   training to keep partial memory of the weighted graph structure from
   the
   [MATH: <mrow><mi>k</mi><mo>−</mo><mn>1</mn></mrow> :MATH]
   -th layer for the
   [MATH: <mi>k</mi> :MATH]
   -th layer, GNNGUARD introduces a trick called layer-wise graph memory
   ([77]Figure 2). The layer-wise graph memory is defined as follows:
   [MATH:
   <mrow><msubsup><mtext>φ</mtext><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</
   mi></msubsup><mo>=</mo><mo> </mo><mi>β</mi><msubsup><mtext>φ</mtext><mr
   ow><mi>i</mi><mi>j</mi></mrow><mrow><mi>k</mi><mo>−</mo><mn>1</mn></mro
   w></msubsup><mo>+</mo><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><mi>β</
   mi></mrow><mo>)</mo></mrow><mrow><msubsup><mrow><mover
   accent="true"><mi>α</mi><mo>^</mo></mover></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow><mi>k</mi></msubsup></mrow><mo>,</mo></mrow> :MATH]
   (8)

   where
   [MATH:
   <mrow><mi>β</mi><mo>∈</mo><mrow><mo>[</mo><mrow><mn>0</mn><mo>,</mo><mo
   > </mo><mn>1</mn></mrow><mo>]</mo></mrow></mrow> :MATH]
   is a learnable parameter and
   [MATH:
   <mrow><msubsup><mtext>φ</mtext><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</
   mi></msubsup></mrow> :MATH]
   denotes weight for edge
   [MATH:
   <mrow><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   in the
   [MATH: <mi>k</mi> :MATH]
   -th layer of GCN.

FIGURE 2.

   [78]FIGURE 2
   [79]Open in a new tab

   Illustration of GNNGUARD. An example of one node (orange node) is
   chosen to demonstrate the process of layer-wise graph memory. (A)
   Message passing in i’s local neighborhood in the k-th layer of GCN. (B)
   Thickness of the gray arrow represents the weight in the message
   passing of GCN. The weight between node
   [MATH: <mrow><mi>i</mi><mo> </mo></mrow> :MATH]
   and node
   [MATH: <mi>j</mi> :MATH]
   is utilized to determine message passing between nodes, such as
   strengthening message or blocking message. To stably train model, the
   k-th layer weight coefficient keeps a partial memory of the
   [MATH: <mrow><mi>k</mi><mo>−</mo><mn>1</mn></mrow> :MATH]
   -th layer.

Node Aggregation with Multi-View Representations Based on GCN

   To learn comprehensive representations of sample nodes and multi-omics
   data, a multilayered graph convolutional network ([80]Kipf and Welling,
   2016) based on the message passing is defined as follows:
   [MATH: <mrow><msup><mi mathvariant="bold-italic">H</mi><mrow><mi
   mathvariant="bold-italic">k</mi><mo>+</mo><mn>1</mn></mrow></msup><mo>=
   </mo><mo> </mo><mtext>σ</mtext><mrow><mo>(</mo><mrow><msup><mrow><mover
   accent="true"><mi
   mathvariant="bold-italic">A</mi><mo>^</mo></mover></mrow><mi
   mathvariant="bold-italic">k</mi></msup><msup><mi
   mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-italic">k</mi></msup><msup><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="bold-italic">k</mi></msup></mrow><mo>)</mo></mrow><mo>,</m
   o></mrow> :MATH]
   (9)

   where
   [MATH: <mrow><msup><mrow><mover accent="true"><mi
   mathvariant="bold-italic">A</mi><mo>^</mo></mover></mrow><mi>k</mi></ms
   up><mo>=</mo><msup><mrow><mover accent="true"><mi
   mathvariant="bold-italic">D</mi><mo>˜</mo></mover></mrow><mrow><mi>k</m
   i><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac></mrow></msup><msup><mro
   w><mover accent="true"><mi
   mathvariant="bold-italic">A</mi><mo>˜</mo></mover></mrow><mi>k</mi></ms
   up><msup><mrow><mover accent="true"><mi
   mathvariant="bold-italic">D</mi><mo>˜</mo></mover></mrow><mrow><mi>k</m
   i><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac></mrow></msup></mrow>
   :MATH]
   represents the normalized Laplacian of the weighted graph in k-th
   layer,
   [MATH:
   <mrow><msubsup><mi>A</mi><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></m
   subsup><mo>=</mo><mo> </mo><msubsup><mtext>A</mtext><mrow><mi>i</mi><mi
   >j</mi></mrow><mi>k</mi></msubsup><msubsup><mrow><mtext> φ</mtext></mro
   w><mrow><mi>i</mi><mi>j</mi></mrow><mi>k</mi></msubsup><mo> </mo></mrow
   > :MATH]
   is recalculated at each layer to update the adjacency matrix,
   [MATH: <mrow><msup><mrow><mover accent="true"><mi
   mathvariant="bold-italic">A</mi><mo>˜</mo></mover></mrow><mi>k</mi></ms
   up><mo>=</mo><msup><mi
   mathvariant="bold-italic">A</mi><mi>k</mi></msup><mo>+</mo><mi
   mathvariant="bold-italic">I</mi></mrow> :MATH]
   is the adjacency matrix with added self-connections,
   [MATH: <mrow><msubsup><mrow><mover
   accent="true"><mi>D</mi><mo>˜</mo></mover></mrow><mrow><mi>i</mi><mi>i<
   /mi></mrow><mi>k</mi></msubsup><mo>=</mo><mstyle
   displaystyle="true"><msub><mo>∑</mo><mi>j</mi></msub><mrow><msubsup><mr
   ow><mover
   accent="true"><mi>A</mi><mo>˜</mo></mover></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow><mi>k</mi></msubsup></mrow></mstyle></mrow> :MATH]
   is the degree matrix,
   [MATH: <mi mathvariant="bold-italic">W</mi> :MATH]
   is a layer-specific learnable weight matrix from training,
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mn>0</mn></msup></mrow> :MATH]
   represents the input of the first GCN layer,
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mi>k</mi></msup></mrow> :MATH]
   is the input of the
   [MATH: <mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow> :MATH]
   -th layer, and
   [MATH:
   <mrow><msup><mi>H</mi><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow></msup
   ></mrow> :MATH]
   is the comprehensive representation by aggregating neighbor features
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mi>k</mi></msup></mrow> :MATH]
   . The activation function softmax is used in the last graph
   convolutional layer to calculate the probability
   [MATH: <mrow><mtext
   mathvariant="bold-italic">P</mtext><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mi>n</mi><mo>×</mo><mi>V</mi></mrow><
   /msup></mrow> :MATH]
   of which molecular subtyping of each sample belongs to [81]Eq. 10.
   [MATH: <mrow><mo> </mo><mi
   mathvariant="bold-italic">P</mi><mo>=</mo><mtext>softmax</mtext><mrow><
   mo>(</mo><mrow><msup><mi mathvariant="bold-italic">H</mi><mi
   mathvariant="bold-italic">k</mi></msup></mrow><mo>)</mo></mrow><mo>,</m
   o></mrow> :MATH]
   (10)

   where
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mi>k</mi></msup></mrow> :MATH]
   is the output of final graph convolutional layer
   [MATH: <mi>k</mi> :MATH]
   , and
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">P</mi><mi>i</mi></msub></mrow> :MATH]
   is the prediction probability vector of the sample node
   [MATH: <mi>i</mi> :MATH]
   .

Loss and Optimization

   Cross-entropy is used as the loss function of our model:
   [MATH:
   <mrow><mo> </mo><mi>ℒ</mi><mo>=</mo><mo> </mo><mo>−</mo><mfrac><mi>1</m
   i><mi>n</mi></mfrac><msub><mtext>Σ</mtext><mi>i</mi></msub><msubsup><mt
   ext>Σ</mtext><mrow><mi>v</mi><mo>=</mo><mi>1</mi></mrow><mi>V</mi></msu
   bsup><msub><mi>y</mi><mrow><mi
   mathvariant="bold-italic">i</mi><mi>v</mi></mrow></msub><mo>⁡</mo><mi>l
   og</mi><mrow><mo>(</mo><mrow><msub><mi
   mathvariant="bold-italic">P</mi><mrow><mi>i</mi><mi>v</mi></mrow></msub
   ></mrow><mo>)</mo></mrow><mo>,</mo></mrow> :MATH]
   (11)

   where
   [MATH: <mi>V</mi> :MATH]
   is the number of molecular subtypes,
   [MATH: <mi>n</mi> :MATH]
   is the number of total samples,
   [MATH:
   <mrow><msub><mi>y</mi><mrow><mi>i</mi><mi>v</mi></mrow></msub></mrow>
   :MATH]
   is the ground truth label of
   [MATH: <mrow><msup><mi>i</mi><mo> </mo></msup></mrow> :MATH]
   -th sample, and
   [MATH:
   <mrow><msub><mi>P</mi><mrow><mi>i</mi><mi>v</mi></mrow></msub></mrow>
   :MATH]
   is the probability score that sample
   [MATH: <mi>i</mi> :MATH]
   in molecular subtype
   [MATH: <mi>v</mi> :MATH]
   . Adam is used to minimize the loss function ([82]Kingma and Ba, 2014).

New Sample Prediction

   When predicting which molecular subtype a new sample is, we first add
   it into dataset and the sample–sample similarity graph according to
   [83]Eq. 2. The new data is
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mrow><mi>n</mi><mi>e</mi><mi>w</mi></m
   row></msub><mo> </mo><mo>∈</mo><mo> </mo><msup><mi>ℝ</mi><mrow><mrow><m
   o>(</mo><mrow><mi>n</mi><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow><mo
   >×</mo><mrow><mo>(</mo><mrow><msub><mi>f</mi><mn>1</mn></msub><mo>+</mo
   ><msub><mi>f</mi><mn>2</mn></msub><mo>+</mo><msub><mi>f</mi><mn>3</mn><
   /msub></mrow><mo>)</mo></mrow></mrow></msup></mrow> :MATH]
   , where
   [MATH: <mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><msub><mi>f</mi><mn>2</mn></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>f</mi><mn>3</mn></msub></mrow> :MATH]
   are the number of selected features of SNV data, CNV data, and gene
   expression data, respectively. The new graph is
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">A</mi><mrow><mi>n</mi><mi>e</mi><mi>w</mi></m
   row></msub><mo>∈</mo><msup><mi
   mathvariant="normal">ℝ</mi><mrow><mrow><mo>(</mo><mrow><mi>n</mi><mo>+<
   /mo><mn>1</mn></mrow><mo>)</mo></mrow><mo>×</mo><mrow><mo>(</mo><mrow><
   mi>n</mi><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></msup></mr
   ow> :MATH]
   . The projected latent feature matrix
   [MATH: <mrow><msup><mi
   mathvariant="bold-italic">H</mi><mrow><mo>'</mo><mn>0</mn></mrow></msup
   ></mrow> :MATH]
   is obtained from [84]Eq. 3. Therefore, given
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">X</mi><mrow><mi>n</mi><mi>e</mi><mi>w</mi></m
   row></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi
   mathvariant="bold-italic">A</mi><mrow><mi>n</mi><mi>e</mi><mi>w</mi></m
   row></msub></mrow> :MATH]
   , we can predict cancer subtype of the new sample by [85]Eqs 2–[86]11.

Experiment settings

   We implement M-GCN using the deep learning framework of PyTorch and
   train
   [MATH: <mrow><mn>500</mn></mrow> :MATH]
   epochs for M-GCN with a learning rate of
   [MATH: <mrow><mn>0.0001</mn></mrow> :MATH]
   . The dropout rate is set as
   [MATH: <mrow><mn>0.4</mn></mrow> :MATH]
   to avoid overfitting. We set
   [MATH: <mrow><mn>0.82</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>0.79</mn></mrow> :MATH]
   as spearman correlation coefficient thresholds of BRCA and STAD,
   respectively. The similarity threshold parameter (
   [MATH: <mrow><msub><mi>P</mi><mn>0</mn></msub></mrow> :MATH]
   ) in neighbor importance estimation is set to 0.1 and 0.25 on BRCA and
   STAD, respectively. For BRCA, after transformation, the data dimensions
   of SNV (
   [MATH: <mrow><msubsup><mi>f</mi><mn>1</mn><mo>'</mo></msubsup></mrow>
   :MATH]
   ), CNV (
   [MATH: <mrow><msubsup><mi>f</mi><mn>2</mn><mo>'</mo></msubsup></mrow>
   :MATH]
   ), and gene expression (
   [MATH: <mrow><msubsup><mi>f</mi><mn>3</mn><mo>'</mo></msubsup></mrow>
   :MATH]
   ) are set as
   [MATH: <mrow><mn>25</mn><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mn>20</mn><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><mn>60</mn><mo>,</mo></mrow> :MATH]
   respectively. M-GCN model has two GCN layers in which the number of
   neurons in first and second hidden layer are
   [MATH: <mrow><mn>32</mn></mrow> :MATH]
   and
   [MATH: <mrow><mn>3</mn><mo>,</mo></mrow> :MATH]
   respectively. As for STAD, we set
   [MATH:
   <mrow><msubsup><mi>f</mi><mn>1</mn><mo>'</mo></msubsup><mo>,</mo></mrow
   > :MATH]
   [MATH:
   <mrow><msubsup><mi>f</mi><mn>2</mn><mo>'</mo></msubsup><mo>,</mo></mrow
   > :MATH]
   [MATH:
   <mrow><msubsup><mi>f</mi><mn>3</mn><mo>'</mo></msubsup><mo>,</mo></mrow
   > :MATH]
   and the number of neurons in first and second hidden layer of GCN are
   [MATH: <mrow><mn>35</mn><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mn>40</mn><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mn>65</mn><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mn>32</mn><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><mn>4</mn><mo>,</mo></mrow> :MATH]
   respectively.

Evaluation Metrics

   We perform
   [MATH: <mrow><mn>10</mn></mrow> :MATH]
   -fold cross-validation to evaluate the performance of M-GCN in
   molecular subtyping tasks of BRCA and STAD. The samples are divided
   into ten groups according to stratified sampling, nine of which are
   used for training data and one for test data in turn. In each training
   process, we select transcriptomic features by using HSIC Lasso on gene
   expression data of training samples (see [87]Eq. 1) and then construct
   a separate sample–sample similarity graph. In order to evaluate the
   performance of the method comprehensively, we take several evaluation
   metrics, that is,
   [MATH:
   <mrow><mi>a</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>c
   </mi><mi>y</mi></mrow> :MATH]
   (
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi></mrow> :MATH]
   ),
   [MATH:
   <mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><mi>n</mi></mrow> :MATH]
   , recall,
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><mi>e</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><msub><mi>l</mi
   ><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</mi></mrow></msub>
   <mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   , which are calculated as follows:
   [MATH: <mrow><mi
   mathvariant="italic">accuracy</mi><mo>=</mo><mo> </mo><mfrac><mrow><mi>
   T</mi><mi>P</mi><mo>+</mo><mi>T</mi><mi
   mathvariant="italic">N</mi></mrow><mrow><mi
   mathvariant="italic">T</mi><mi mathvariant="italic">P</mi><mo>+</mo><mi
   mathvariant="italic">T</mi><mi mathvariant="italic">N</mi><mo>+</mo><mi
   mathvariant="italic">F</mi><mi mathvariant="italic">P</mi><mo>+</mo><mi
   mathvariant="italic">F</mi><mi
   mathvariant="italic">N</mi></mrow></mfrac><mo>,</mo></mrow> :MATH]
   (12)
   [MATH: <mrow><mi
   mathvariant="italic">precision</mi><mo>=</mo><mo> </mo><mfrac><mrow><mi
   mathvariant="italic">T</mi><mi
   mathvariant="italic">P</mi></mrow><mrow><mi
   mathvariant="italic">T</mi><mi mathvariant="italic">P</mi><mo>+</mo><mi
   mathvariant="italic">F</mi><mi
   mathvariant="italic">P</mi></mrow></mfrac><mo>,</mo></mrow> :MATH]
   (13)
   [MATH: <mrow><mi
   mathvariant="italic">recall</mi><mo>=</mo><mo> </mo><mfrac><mrow><mi
   mathvariant="italic">T</mi><mi
   mathvariant="italic">P</mi></mrow><mrow><mi
   mathvariant="italic">T</mi><mi mathvariant="italic">P</mi><mo>+</mo><mi
   mathvariant="italic">F</mi><mi
   mathvariant="italic">N</mi></mrow></mfrac><mo>,</mo></mrow> :MATH]
   (14)
   [MATH: <mrow><mi mathvariant="italic">F</mi><mi
   mathvariant="italic">1</mi><mo>−</mo><mi
   mathvariant="italic">score</mi><mo>=</mo><mo> </mo><mfrac><mrow><mi>2</
   mi><mo>∗</mo><mi mathvariant="italic">precision</mi><mo>∗</mo><mi
   mathvariant="italic">recall</mi></mrow><mrow><mi
   mathvariant="italic">precision</mi><mo>+</mo><mi
   mathvariant="italic">recall</mi></mrow></mfrac><mo> </mo><mo>,</mo></mr
   ow> :MATH]
   (15)
   [MATH: <mrow><mi mathvariant="italic">Precisio</mi><msub><mi
   mathvariant="italic">n</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub><mo>=</mo><mo> </mo><mfrac
   ><mi>1</mi><mi mathvariant="italic">v</mi></mfrac><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi
   mathvariant="italic">c</mi><mo>=</mo><mi>1</mi></mrow><mi>v</mi></munde
   rover><mrow><mi mathvariant="italic">precisio</mi><msub><mi
   mathvariant="italic">n</mi><mi
   mathvariant="italic">c</mi></msub></mrow></mstyle><mo>,</mo></mrow>
   :MATH]
   (16)
   [MATH: <mrow><mi mathvariant="italic">Recal</mi><msub><mi
   mathvariant="italic">l</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub><mo>=</mo><mo> </mo><mfrac
   ><mi>1</mi><mi>v</mi></mfrac><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi
   mathvariant="italic">c</mi><mo>=</mo><mi>1</mi></mrow><mi
   mathvariant="bold-italic">v</mi></munderover><mrow><mi
   mathvariant="italic">recal</mi><msub><mi mathvariant="italic">l</mi><mi
   mathvariant="italic">c</mi></msub></mrow></mstyle><mo>,</mo></mrow>
   :MATH]
   (17)
   [MATH: <mrow><mi mathvariant="italic">F</mi><mi>1</mi><mo>−</mo><mi
   mathvariant="italic">scor</mi><msub><mi
   mathvariant="italic">e</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub><mo>=</mo><mo> </mo><mfrac
   ><mrow><mi>2</mi><mo>∗</mo><msub><mi
   mathvariant="italic">P</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub><mo>∗</mo><msub><mi
   mathvariant="italic">R</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub></mrow><mrow><msub><mi
   mathvariant="italic">P</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub><mo>+</mo><msub><mi
   mathvariant="italic">R</mi><mrow><mi
   mathvariant="italic">macro</mi></mrow></msub></mrow></mfrac><mo>,</mo><
   /mrow> :MATH]
   (18)

   where true positive (
   [MATH: <mrow><mi>T</mi><mi>P</mi></mrow> :MATH]
   ) is an outcome where the model correctly predicts the positive class,
   and true negative (
   [MATH: <mrow><mi>T</mi><mi>N</mi></mrow> :MATH]
   ) is an outcome where the model correctly predicts the negative class.
   For a multi-class classification task, as long as it is not a positive
   class, we define it as a negative class. False positive (
   [MATH: <mrow><mi>F</mi><mi>P</mi></mrow> :MATH]
   ) is an outcome where the model incorrectly predicts the positive
   class, and false negative (
   [MATH: <mrow><mi>F</mi><mi>N</mi></mrow> :MATH]
   ) is an outcome where the model incorrectly predicts the negative
   class.
   [MATH:
   <mrow><mi>A</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>c
   </mi><mi>y</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><mi>n</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>r</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi><mo>,
   </mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><mi>e</mi></mrow> :MATH]
   are the most commonly used evaluation indexes for classification
   performance based on the above
   [MATH: <mrow><mi>T</mi><mi>P</mi><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mi>T</mi><mi>N</mi><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mi>F</mi><mi>P</mi><mo>,</mo></mrow> :MATH]
   and
   [MATH: <mrow><mi>F</mi><mi>N</mi></mrow> :MATH]
   . Considering the evaluation bias caused by an unbalanced sample size
   in multi-classification task,
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext> </mtext><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi
   ><msub><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</m
   i></mrow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   are finally used for evaluation. They are weighted average of
   [MATH:
   <mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><mi>n</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>r</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi><mo>,
   </mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><mi>e</mi></mrow> :MATH]
   on each category, with each category being equally weighted.

Identification of Specific Genes of Each Molecular Subtype and Functional
Enrichment Analysis

   Since selected transcriptomic features have the potential to classify
   samples, we further identify the specific genes of each molecular
   subtype. We first take z-score normalization on the expression matrix
   of selected genes in order to make the genes’ specificity comparable
   between samples. Then, in every subtype, we calculate the mean value of
   each gene and sort the genes in descending order. Finally, we select
   top 10 genes of each subtype as specific markers excluding the genes
   that are present in at least two subtypes.

   In order to understand biological function of each certain gene set, we
   perform biological process (BP) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathways enrichment analysis on top 40 subtype-specific
   genes. The R package “clusterProfiler” is used.

Results

Subtype Classification Performance of M-GCN on BRCA and STAD

   To demonstrate the performance of our method, we compare the
   performance of M-GCN with six commonly used or advanced methods on STAD
   and BRCA molecular subtyping including traditional machine
   learning–based methods, neural network–based method, and a GCN-based
   method:
     * K-nearest neighbor classifier (KNN), random forest (RF), support
       vector machine classifier (SVM), and Gaussian naive Bayes (GNB) are
       traditional machine learning methods and we utilize gene
       expression, SNV, and CNV data as features.
     * DeepCC is a neural network-based method which utilizes
       transcriptomic data and leverages feedforward neural networks to
       classify molecular subtypes.
     * Li’s method is a GCN-based molecular subtyping method which
       integrates CNV data and gene expression data.

   For BRCA, our framework M-GCN achieves best performance ([88]Figure 3).
   M-GCN achieves the highest averaged
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi></mrow> :MATH]
   of
   [MATH: <mrow><mn>94</mn></mrow> :MATH]
   %, which is
   [MATH: <mrow><mn>1.5</mn></mrow> :MATH]
   % higher than the second best method RF,
   [MATH: <mrow><mo> </mo><mn>2.5</mn></mrow> :MATH]
   % better than SVM and DeepCC,
   [MATH: <mn>3</mn> :MATH]
   % higher than GNB, and
   [MATH: <mrow><mn>4.8</mn></mrow> :MATH]
   % and
   [MATH: <mrow><mn>6.7</mn></mrow> :MATH]
   % better than Li’s method and KNN, respectively. Under the
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub></mrow> :MATH]
   index, RF outperforms others and M-GCN ranks second. For
   [MATH:
   <mrow><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><msub><mi>l</mi
   ><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</mi></mrow></msub>
   </mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   indexes, M-GCN has the significantly advantage. Overall, KNN has the
   worst performance.

FIGURE 3.

   [89]FIGURE 3
   [90]Open in a new tab

   Prediction performance under four evaluation metrics of seven methods
   in the BRCA dataset. Pink bar represents the final performance of
   M-GCN; purple bar, orange bar, yellow bar, and green bar refer to the
   performance of KNN, RF, SVM, and GNB, respectively. Light blue bar is
   for the performance of DeepCC, and dark blue bar is for the performance
   of Li’s method.

   Furthermore, we analyze the detailed results of subtype classification.
   As shown in [91]Table 2, M-GCN achieves the best performance in
   diagnosis of ER+ subtype patients, where
   [MATH: <mrow><mn>95.9</mn></mrow> :MATH]
   % samples can be accurately predicted. By comparison, HER2+ patients
   are relatively hard to predict. Through
   [MATH: <mrow><mn>10</mn></mrow> :MATH]
   -fold cross-validation, there are average
   [MATH: <mn>6</mn> :MATH]
   out of
   [MATH: <mrow><mn>30</mn></mrow> :MATH]
   samples are wrongly predicted as TNBC.

TABLE 2.

   Classification results of M-GCN on each subtype of BRCA.
   Ratio predicted as ER+ (%) Ratio predicted as HER2+ (%) Ratio predicted
   as TNBC (%)
   ER+ 95.9 0.51 3.59
   HER2+ 0 80 20
   TNBC 7 3 90
   [92]Open in a new tab

   The meaning of the bold values provided in Tables 2 and 3 is “the
   highest prediction ratio in each subtype”.

   For the subtype classification task with more classes and smaller
   sample size, our method still performs best on STAD than other methods
   in all metrics ([93]Figure 4). The performance of neural network–based
   method DeepCC ranks second, which ignores the sample–sample graph
   structure information. These traditional machine learning–based methods
   have better scores in four metrics by utilizing multi-omics data.
   Compared with the results in BRCA, Li’s method has the largest decline
   of performance in STAD. According to the detailed classification
   results of each subtype by M-GCN under
   [MATH: <mrow><mn>10</mn></mrow> :MATH]
   -fold cross-validation ([94]Table 3), M-GCN has a
   [MATH: <mrow><mn>100</mn></mrow> :MATH]
   % predictive power for EBV stomach cancer, and a
   [MATH: <mrow><mn>90</mn></mrow> :MATH]
   % probability of correctly predicting the MSI type. However, GS is
   relatively hard to predict, especially not easily distinguishable from
   CIN.

FIGURE 4.

   [95]FIGURE 4
   [96]Open in a new tab

   Prediction performance under four evaluation metrics of seven methods
   in the STAD dataset. Pink bar represents the final performance of
   M-GCN; purple bar, orange bar, yellow bar, and green bar refer to the
   performance of KNN, RF, SVM, and GNB, respectively. Light blue bar is
   for the performance of DeepCC, and dark blue bar is for the performance
   of Li’s method.

TABLE 3.

   Classification results of M-GCN on each subtype of STAD.
   Ratio predicted as CIN (%) Ratio predicted as EBV (%) Ratio predicted
   as MSI (%) Ratio predicted as GS (%)
   CIN 93.64 0 2.72 3.64
   EBV 0 100 0 0
   MSI 6 0 90 4
   GS 20 4 6 70
   [97]Open in a new tab

   Our framework M-GCN achieves best performance in BRCA and STAD
   molecular subtypes. The number of samples in BRCA is greater than that
   in STAD, and all of these methods in BRCA have good accuracy. The
   machine learning–based methods such as RF, SVM, and GNB have a
   significant difference between BRCA and STAD tasks. In addition, these
   methods in traditional machine learning–based methods are shallow and
   cannot learn the deep and complex representations of sample nodes. The
   performance of the neural network–based method DeepCC is higher than
   most of these machine learning–based methods, which shows the deep and
   non-linear representation are important. Li’s method may suitable for
   the task with more samples. M-GCN still has the better scores in
   evaluation metrics than the multiple omic-based methods by utilizing
   the cleaned structure information and message passing of sample nodes.

Contribution of Each Element to Molecular Subtype Classification in STAD

   After assessing the performance compared with other methods, we conduct
   three ablation experiments to evaluate the contributions of feature
   selection step, SNV data, and CNV data in STAD, respectively
   ([98]Figure 5). The basic idea of ablation experiment is to learn the
   framework by removing parts of it and studying its performance. In the
   first ablation experiment, without feature selection, we use all the
   gene expression features to construct sample–sample similarity graph
   and take them as the transcriptomic feature for training the GCN-based
   molecular subtyping model. Under this setting, the prediction
   performance decreases by
   [MATH: <mn>7</mn> :MATH]
   %,
   [MATH: <mrow><mn>8.9</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>12.1</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>10.6</mn></mrow> :MATH]
   % in terms of
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext> </mtext><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi
   ><msub><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</m
   i></mrow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   when compared with M-GCN. In the second ablation experiment, we exclude
   SNV features from the input data.
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>o
   </mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext> </mtext><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi
   ><msub><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</m
   i></mrow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   of the new trained molecular subtyping model reduce
   [MATH: <mrow><mn>3.5</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>2.5</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>2.6</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>2.5</mn></mrow> :MATH]
   %, respectively. In the last ablation experiment, excluding CNV
   features from the input data, the model’s performance has dropped
   [MATH: <mrow><mn>2.2</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>1.2</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>1.7</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>1.4</mn></mrow> :MATH]
   % in
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext> </mtext><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi
   ><msub><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</m
   i></mrow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   metrics.

FIGURE 5.

   [99]FIGURE 5
   [100]Open in a new tab

   Results of ablation experiment of M-GCN in STAD. Pink bar represents
   the final performance of M-GCN, orange bar represents the performance
   of M-GCN without feature selection, green bar is for the performance of
   M-GCN without SNV features, and blue bar is for the performance of
   M-GCN without CNV features.

   Overall, the results of ablation experiments in STAD demonstrate that
   feature selection, SNV data, and CNV data are essential. Especially,
   feature selection makes a more significant contribution. One possible
   reason for this is that selected subtype-related features can help
   learn good representations of sample and reduce noise of the
   sample–sample graph.

Contribution of Each Element to Molecular Subtype Classification in BRCA

   Similarly, to explore contributions of feature selection, SNV, and CNV
   data for molecular subtyping of BRCA, we also perform ablation
   experiments. The results of three ablation experiments are shown in
   [101]Figure 6. Without the feature selection, the prediction
   performance decreases by
   [MATH: <mn>4</mn> :MATH]
   %,
   [MATH: <mrow><mn>24.1</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>25.4</mn></mrow> :MATH]
   % in terms of
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><msub><mi>l</mi
   ><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</mi></mrow></msub>
   <mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   respectively. Under
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub></mrow> :MATH]
   index, ablation experiment outperforms M-GCN. Without SNV as the input,
   the prediction ability reduces by
   [MATH: <mrow><mn>0.3</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>1.6</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>0.9</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>1.2</mn></mrow> :MATH]
   % for
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mtext> </mtext><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi
   ><msub><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</m
   i></mrow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub></mrow> :MATH]
   metrics. Without CNV as the input, the model’s performance has dropped
   [MATH: <mrow><mn>0.2</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>1.0</mn></mrow> :MATH]
   %,
   [MATH: <mrow><mn>0.1</mn></mrow> :MATH]
   %, and
   [MATH: <mrow><mn>0.5</mn></mrow> :MATH]
   % in terms of
   [MATH: <mrow><mi>A</mi><mi>C</mi><mi>C</mi><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i
   </mi><mi>o</mi><msub><mi>n</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>
   r</mi><mi>o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mo> </mo><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><msub
   ><mi>l</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>o</mi></mr
   ow></msub><mo>,</mo></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mn>1</mn><mo>−</mo><mi>s</mi><mi>c</mi><mi>o</mi><mi>r
   </mi><msub><mi>e</mi><mrow><mi>m</mi><mi>a</mi><mi>c</mi><mi>r</mi><mi>
   o</mi></mrow></msub><mo>,</mo></mrow> :MATH]
   respectively.

FIGURE 6.

   [102]FIGURE 6
   [103]Open in a new tab

   Results of ablation experiment of M-GCN in BRCA. Pink bar represents
   the final performance of M-GCN, orange bar represents the performance
   of M-GCN without feature selection, green bar is for the performance of
   M-GCN without SNV features, and blue bar is for the performance of
   M-GCN without CNV features.

Biomarkers of Each Subtype of BRCA and Their Functions

   On the basis of selected transcriptomic features that could accurately
   classify the breast cancer samples into various molecular subtypes, we
   further obtained the subtype-specific genes.

   We identify ten genes with highest specificity score of each subtype,
   the gene lists are shown in [104]Table 4. These identified biomarkers
   can significantly distinguish samples of different subtypes with the
   normalized gene expression by z-score transformation ([105]Figure 7).
   Among these genes, two thirds of them have been extensively studied.
   For example, Robinson et al. suggested that activating mutations in
   ESR1 were a key mechanism in acquired endocrine resistance in breast
   cancer therapy ([106]Robinson et al., 2013). In addition, the specific
   biomarkers of ER+ subtype, ESR1 ([107]Robinson et al., 2013;
   [108]Spoerke et al., 2016), AGR3 ([109]Garczyk et al., 2015), GATA3
   ([110]Ciocca et al., 2009), PCSK6 ([111]Venables et al., 2008), BCAS1
   ([112]Fenne et al., 2013), PMAIP1 ([113]Putnik et al., 2012), GPR77
   ([114]Zhu et al., 2021), and SCGB2A2 ([115]Guan et al., 2003) have been
   demonstrated to be associated with breast cancer development or
   prognosis. For HER2+ subtype, Prat et al. have found that HER2+
   patients are highly sensitive to ERBB2-targeted therapy ([116]Prat et
   al., 2020). In addition, existing studies have reported that ERBB2
   ([117]Lucci et al., 2010; [118]Alcalá-Corona et al., 2018; [119]Prat et
   al., 2020), STARD3 ([120]Sahlberg et al., 2013; [121]Vassilev et al.,
   2015; [122]Alcalá-Corona et al., 2018), GRB7 ([123]Lucci et al., 2010;
   [124]Natrajan et al., 2010; [125]Sahlberg et al., 2013;
   [126]Alcalá-Corona et al., 2018; [127]Tang et al., 2019), C17orf37
   ([128]Natrajan et al., 2010), PGAP3 ([129]Alcalá-Corona et al., 2018),
   PSMD3 ([130]Sahlberg et al., 2013), and DUSP10 ([131]Lucci et al.,
   2010) played an important role in the development and progression of
   breast cancer. For the TNBC subtype, although understanding of the
   identified subtype-specific genes is less than other two types, the
   roles of DGCR5 ([132]Jiang et al., 2020), RAD51L1 ([133]Stevens et al.,
   2011), and TTLL4 ([134]Arnold et al., 2020) in breast cancer are well
   studied.

TABLE 4.

   Specific biomarkers of each BRCA subtype and their enrichment pathways.
   The listed biomarkers rank in descending order from high to low
   specific score.
Molecular subtypes Biomarker                 Pathway and p-value
ER+                ESR1          Response to estradiol (p-value = 1.09E-02)
                   AGR3
                   GATA3
                   PCSK6
                   FLJ45983
                   BCAS1
                   PMAIP1
                   GPR77
                   SCGB2A2
                   C10orf82
HER2+              ERBB2        ERBB2 signaling pathway (p-value = 7.52E-04)
                   STARD3
                   GRB7
                   C17orf37
                   CRISP3
                   SERHL2
                   PGAP3
                   PSMD3
                   IDH1
                   DUSP10
TNBC               MFI2      Sequestering of actin monomers (p-value = 6.36E-05)
                   TFCP2L1
                   DGCR5
                   C6orf162
                   DCLRE1C
                   FAM90A1
                   RAD51L1
                   TTLL4
                   TM4SF1
                   ESYT3
   [135]Open in a new tab

FIGURE 7.

   [136]FIGURE 7
   [137]Open in a new tab

   Heatmap of the z-score normalized gene expression of the molecular
   subtype-specific biomarker genes in BRCA. Green bar, pink bar, and blue
   bar at the top represent ER+, HER2+, and TNBC subtype, respectively.

   Moreover, specific genes of ER+, HER2+, and TNBC are significantly
   enriched in biological processes of response to estradiol, ERBB2
   signaling pathway, and sequestering of actin monomers, respectively.
   Some of the findings are also highly consistent with current
   understandings. Daniel et al. have found that estrogen were important
   drivers of breast cancer proliferation and PR-B expression increased
   breast cancer cell growth in response to estradiol ([138]Daniel et al.,
   2015). Shah et al. reported that HER2+ subtype of breast cancer is
   associated with gene amplification and/or protein overexpression of
   ERBB2, which leads to aggressive tumor growth and poor clinical outcome
   ([139]Arora et al., 2008; [140]Shah and Osipo, 2016). Other enriched
   pathways of subtype-specific genes of BRCA are listed in
   [141]Supplementary Table S1.

Biomarkers of Each Subtype of STAD and Their Functions

   Compared with BRCA, the current understanding of subtype markers and
   biological mechanisms of STAD is much less and our analysis is expected
   to provide more insight. From the gene expression heatmap across all
   the samples, it can be concluded that biomarkers of STAD perform well
   in distinguishing EBV, GS, and MSI ([142]Figure 8). Through functional
   enrichment analysis, we find genes in CIN are usually enriched in
   regulation of cellular response to insulin stimulus, response to
   radiation, and telomere maintenance. Telomere maintenance in cancer
   cells is often accompanied by activated telomerase to protect
   genetically damaged DNA from normal cell senescence or apoptosis
   ([143]Basu et al., 2013). Moreover, we also identify the specific gene
   CCNE1, which was reported as one of potential targets in the CIN
   subtype ([144]Wang et al., 2019). For the EBV subtype, we infer their
   specific genes mainly involve in cilium organization and Herpes simplex
   virus 1 infection. It is well known that EBV is a gamma-herpes virus,
   and EBV subtyping accounts for nearly 10% of gastric carcinomas
   ([145]Shinozaki-Ushiku et al., 2015). Identified specific genes of MSI
   are related to regulation of microtubule cytoskeleton organization and
   positive regulation of I-kappaB kinase/NF-kappaB signaling. Gullo et
   al. analyzed 55 differentially expressed genes in microsatellite
   unstable cases and found these genes associated with microtubule
   cytoskeleton organization ([146]Gullo et al., 2018). Identified
   specific genes of GS are enriched in the biological process of fatty
   acid oxidation, protein targeting to peroxisome, and AMPK signaling
   pathway. He et al. discovered that mesenchymal stem cells promoted
   stemness and chemoresistance in stomach cancer cells through fatty acid
   oxidation ([147]He et al., 2019). Detailed pathways related with
   molecular subtyping of STAD are listed in [148]Table 5 and
   [149]Supplementary Table S2. As the identified biomarkers by our method
   for breast cancer are greatly consistent with the current clinical
   consensus, we infer that the predicted biomarkers for STAD are also
   promising to provide guidance for researchers on the further studies of
   stomach cancer.

FIGURE 8.

   [150]FIGURE 8
   [151]Open in a new tab

   Heatmap of the z-score normalized gene expression of the molecular
   subtype-specific biomarker genes in STAD. Green bar, pink bar, blue
   bar, and purple bar at the top represent CIN, EBV, GS, and MSI subtype,
   respectively.

TABLE 5.

   Specific biomarkers of each STAD subtype and their enrichment pathways.
   The listed biomarkers rank in descending order from high to low
   specific score.
   Molecular subtypes Biomarker Pathway and p-value
   CIN DDX27, OPN3, F11R, MORC2, TMEM117, CCNE1, SMG5, CIB2, RPL39L,
   ZNF480 Regulation of cellular response to insulin stimulus (p-value =
   2.43E-03), response to radiation (p-value = 1.28E-02), and telomere
   maintenance (p-value = 1.37E-02)
   EBV ATP2C1, BCL2L1, SYT13, ZNF8, EPB41L1, ZNF486, C12orf75, IQCB1,
   HABP2, RCN1 Cilium organization (p-value = 1.88E-02)
   Herpes simplex virus 1 infection (p-value = 8.63E-04)
   MSI TMEM52, DNAJA4, DAZAP2, PAK6, MIB2, KCMF1, RAD51C, PARP3, ATP5A1,
   METRN Regulation of microtubule cytoskeleton organization (p-value =
   6.63E-03); positive regulation of I-kappaB kinase/NF-kappaB signaling
   (p-value = 1.14E-02)
   GS MLYCD, CISD1, PRMT2, CRABP2, ALPL, ECHDC2, C2CD4B, CAMTA2, SH3BP5,
   IRS2 Fatty acid oxidation (p-value = 3.30E-06); protein targeting to
   peroxisome (p-value = 2.10E-03)
   AMPK signaling pathway (p-value = 3.13E-03)
   [152]Open in a new tab

Discussion

   The generation of large amounts of multi-omics data and development of
   deep learning methods offer a more effective mean to study the
   personalized diagnosis and treatment options of complex diseases, such
   as cancer ([153]Ades et al., 2017; [154]Krzyszczyk et al., 2018). In
   this study, we propose a new framework M-GCN for molecular subtyping of
   cancer, which is empowered by integrated multi-omics data and a robust
   graph convolutional network. In two case studies, that are molecular
   subtyping of breast and stomach cancer, M-GCN achieves best
   classification performance under almost all the metrics when compared
   with six advanced methods. As we all known, although GCN is a powerful
   end-to-end model, it usually ignores the noise of data and graph which
   makes GCN unstable. M-GCN first learns subtype-related features to
   denoise data and construct a relatively pure sample–sample similarity
   graph. HSIC Lasso, which is recognized as an effective feature
   selection method, is used in our study. Furthermore, M-GCN assigns
   higher weights to similar nodes and utilizes layer-wise graph memory to
   limit the network to improve the robustness of the model based on
   GNNGUARD. To learn multi-view representations of multi-omics data,
   M-GCN then re-maps denoised three types of data into their feature
   spaces. Furthermore, to fuse multi-view representations of multi-omics
   data, M-GCN utilizes information transfer among samples in the same
   class and over different classes, respectively. In addition these three
   types of data, in the future, other omics data will be added to our
   framework.

   Ablation experiments demonstrate that subtype-dependent feature
   selection contributes most to the improvement of classification
   performance of cancer molecular subtypes. Furthermore, we verify the
   stability of the feature selection process to ensure that obtained
   features are reliable. When shuffling the samples and using 90% of them
   to perform feature selection, we find the intersection of features
   picked out by the 10 rounds of feature selection processes are very
   large for both BRCA and STAD. This is extremely beneficial to train a
   stable GCN-based model.

   On the basis of subtype-related features, we further identify a few
   subtype-specific features which can potentially be used for diagnostic
   biomarkers. In our study, ESR1, ERBB2, and MFI2 are predicted as the
   subtype-specific biomarkers because they have the highest specificity
   scores for ER+, HER2+, and TNBC samples, respectively. It is worth
   noting that ESR1 and ERBB2 are well accepted markers for ER+ and HER2+
   breast subtype, indicating our prediction is highly consistent with
   current understanding. Although MFI2 has not been demonstrated as
   biomarkers of TNBC subtype by wet lab and clinical experiments, its
   encoding protein shares sequence similarity and iron-binding properties
   with members of the transferrin superfamily. Public studies have
   demonstrated these iron-binding properties serve iron uptake and
   promote cell proliferation, and high expression of these proteins are
   associated with the decreased overall survival of patients in many
   cancer types ([155]Torti and Torti, 2013; [156]Sun et al., 2018). As we
   know, TNBC patients show the poorest prognosis with a low survival time
   compared with other types of breast cancer. Moreover, DGCR5, having the
   third highest score under our prediction, reportedly incudes
   tumorigenesis of triple-negative breast cancer by affecting the
   Wnt/β-catenin signaling pathway. Overall, our study can accurately
   identify the subtype-specific biomarkers which are helpful to
   personalized diagnosis. So far, for many cancer types, there are still
   not effect means to predict their molecular subtyping. Our method is
   expected to be an important tool for effectively predicting molecular
   typing with very few genes. Moreover, the proposed framework can be
   used for other tasks, such as prediction of cancer staging and grading.

Conclusion

   Large amount of multi-omics data generated by rapid development of
   high-throughput technologies has enabled data-driven methods to apply
   in molecular subtyping of cancer. We proposed a robust GCN-based
   framework M-GCN for molecular subtyping of cancer by integrating gene
   expression, SNV, and CNV data. In addition to comprehensive information
   of individual samples, M-GCN fully considers message aggregation among
   samples for subtype classification. Compared with other six advanced
   computational methods, M-GCN achieves the best classification
   performance for molecular subtyping of breast and stomach cancer.
   Through ablation experiments, we demonstrate subtype-related
   transcriptomics features obtained by HSIC Lasso method highly
   contribute to sample classification, which is probably because the
   selected features eliminate data noise and facilitate the construction
   of purified graph. On the basis of the graph structure constructed by
   HSIC Lasso, M-GCN further strengthens connections between new features
   and the graph by assigning weights. By assigning higher weights, M-GCN
   aims to successfully pass message in GCN. Furthermore, the identified
   molecular subtype-specific marker of breast cancer is highly consistent
   with clinical cognition, so the predicted biomarkers of stomach cancer
   are promising to be used for molecular typing diagnosis of patients,
   filling in the current gap.

Acknowledgments