Abstract

   Patient classification has widespread biomedical and clinical
   applications, including diagnosis, prognosis, and treatment response
   prediction. A clinically useful prediction algorithm should be
   accurate, generalizable, be able to integrate diverse data types, and
   handle sparse data. A clinical predictor based on genomic data needs to
   be interpretable to drive hypothesis‐driven research into new
   treatments. We describe netDx, a novel supervised patient
   classification framework based on patient similarity networks, which
   meets these criteria. In a cancer survival benchmark dataset
   integrating up to six data types in four cancer types, netDx
   significantly outperforms most other machine‐learning approaches across
   most cancer types. Compared to traditional machine‐learning‐based
   patient classifiers, netDx results are more interpretable, visualizing
   the decision boundary in the context of patient similarity space. When
   patient similarity is defined by pathway‐level gene expression, netDx
   identifies biological pathways important for outcome prediction, as
   demonstrated in breast cancer and asthma. netDx can serve as a patient
   classifier and as a tool for discovery of biological features
   characteristic of disease. We provide a free software implementation of
   netDx with automation workflows.

   Keywords: multimodal data integration, multi‐omics, patient similarity
   networks, precision medicine, supervised machine learning

   Subject Categories: Computational Biology, Genome-Scale & Integrative
   Biology, Systems Medicine

Introduction

   The goal of precision medicine is to build quantitative models that
   guide clinical decision‐making by predicting disease risk and response
   to treatment using data measured for an individual. Within the next
   5 years, several countries will have general‐purpose cohort databases
   with 10,000 to > 1 million patients, with linked genetics, electronic
   health records, metabolite status, and detailed clinical phenotyping;
   examples of projects underway include the UK BioBank (Sudlow et al,
   [34]2015), the US NIH Precision Medicine Initiative
   ([35]https://obamawhitehouse.archives.gov/precision-medicine), and the
   Million Veteran Program ([36]http://www.research.va.gov/MVP/).
   Additionally, human disease‐specific research projects are profiling
   multiple data types across thousands of individuals, including genetic
   and genomic assays, brain imaging, behavioral testing, and clinical
   history from integrated electronic medical records (Hudson et al,
   [37]2010; Calkins et al, [38]2015; Collins & Varmus, [39]2015) (e.g.,
   the Cancer Genome Atlas, [40]http://cancergenome.nih.gov/).
   Computational methods to integrate these diverse patient data for
   analysis and prediction will aid understanding of disease architecture
   and promise to provide actionable clinical guidance.

   Statistical models that predict disease risk or outcome are in routine
   clinical use in fields such as cardiology, metabolic disorders, and
   oncology (Wilson et al, [41]1998; Schmidt et al, [42]2005; Gail et al,
   [43]2007; Lee et al, [44]2014). Traditional clinical risk prediction
   models typically use generalized linear regression or survival
   analysis, in which individual measures are incorporated as terms (or
   features) of a single equation. Standard methods of this type have
   limitations analyzing large data from genomic assays (e.g.,
   whole‐genome sequencing). Machine‐learning methods can handle large
   data, but are often treated as black boxes that require substantial
   effort to interpret how specific features contribute to prediction.
   Black box methods are unlikely to be clinically successful, as
   physicians frequently must understand the characteristic features of a
   disease to make a confident diagnosis (Castelvecchi, [45]2016).
   Interpretability is particularly required in genomics because of
   relatively smaller sample sizes and to better understand the molecular
   causes of disease so that targeted therapies can be designed. Further,
   many existing methods do not natively handle missing data, requiring
   data pruning or imputation, and have difficulty integrating multiple
   different data types.

   The patient similarity network framework can overcome these challenges
   and excels at integrating heterogeneous data and generating
   interpretable models. In this framework, each feature of patient data
   (e.g., gene expression profile, age) is represented as a patient
   similarity network (PSN; Fig [46]1A). Each PSN node is an individual
   patient, and an edge between two patients corresponds to pairwise
   similarity for a given feature. For instance, two patients could be
   similar in age, mutation status, or transcriptome. PSNs can be
   constructed based on any available data, using a similarity measure
   (e.g., Pearson correlation and Jaccard index). Because all data are
   converted to a single type of input (similarity networks), integration
   across diverse data types is straightforward. Patient similarity
   networks (PSN) are a recently introduced concept and have been used
   successfully for unsupervised class discovery in cancer and type 2
   diabetes (Wang et al, [47]2014; Li et al, [48]2015), but have never
   been developed for supervised patient classification.

Figure 1. The netDx method.

   Figure 1
   [49]Open in a new tab
    1. Patient data, provided as input to netDx in the form of tables. The
       simple example for predicting low/high risk of disease uses
       clinical, genomic, metabolomic, and genetic data as input.
    2. Patient similarity networks (or PSN) are networks with patients as
       nodes and weighted edges representing patient similarity by some
       measure. netDx converts patient data into a set of patient
       similarity networks. netDx identifies which networks strongly
       relate high‐risk patients (here, clinical and metabolomic data) and
       which relate low‐risk patients (clinical and gene expression data).
       Feature selection is used to score networks by their ability to
       predict patient class (details in Fig [50]EV1).
    3. netDx output. netDx returns several types of output. Top‐scoring
       features are combined into a single view, or network of overall
       patient similarity, which can be used to classify new patients
       based on relative similarity to known patient classes. netDx also
       provides standard classifier performance metrics and scores for the
       predictive value of individual features.
    4. Network‐based visualization of top predictive pathways. If pathway
       features are used, netDx identifies and visualizes the pathways
       most useful for classification.

   We describe netDx, the first PSN‐based approach for supervised patient
   classification. In this system, patients of unknown status can be
   classified based on their similarity to patients with known status. It
   applies the idea of recommender systems, similar to those used in
   Amazon or Netflix (“find movies like this one”), to precision medicine
   (“find patients who don't respond to therapy”). This process is
   clinically intuitive because it is analogous to clinical diagnosis,
   which often involves a physician relating a patient to a mental
   database of similar patients they have seen. As demonstrated below,
   netDx has strengths in classification performance, heterogeneous data
   integration, and interpretability.

Results

Algorithm description

   The overall netDx workflow is shown in Fig [51]1, with detailed views
   of particular steps in Figs [52]EV1 and [53]EV2. The example in
   Fig [54]1 conceptually shows how patient similarity networks (PSNs) can
   be used to predict whether a patient is at high or low risk of
   developing a disease based on one or more patient‐level data types.
   Similarity networks are computed for each patient pair and for each
   data type. In this example, high‐risk patients are more strongly
   connected based on their clinical profile, which may capture age and
   smoking status, and metabolomic profile. Low‐risk patients are more
   similar in their clinical and genomic profiles. The goal of netDx is to
   identify the input features predictive of high and low risk, and to
   accurately assign new patients to the correct class.

Figure EV1. Detailed workflow of netDx predictor, indicating stochastic
components.

   Figure EV1
   [55]Open in a new tab

   Feature selection is shown on the left and uses only training samples
   (blue boxes). Top‐scoring features are used to score the held‐out test
   set (red boxes). Input networks are created and loaded into a GeneMANIA
   database. Feature selection is achieved by running 10 GeneMANIA
   queries, once per patient label or category. Note that GeneMANIA
   queries are run twice. They are first run on a database containing only
   training subjects and used to compute feature scores. The second time,
   they are run to assign labels to blind test samples (right). In this
   instance, the database contains both training and test samples, and is
   limited to feature‐selected networks. Stochastic components of the
   model are indicated by a black asterisk.

Figure EV2. Conceptual overview of the GeneMANIA algorithm, used by netDx for
network integration.

   Figure EV2
   [56]Open in a new tab

   GeneMANIA is a network‐based recommender system that ranks all nodes by
   similarity to an input query (or “positive” nodes). In netDx, the nodes
   are patients and GeneMANIA uses the set of input patient similarity
   networks (left). The patient ranking is achieved by a two‐step process.
   First, input networks are integrated into a single association network
   via regularized regression that maximizes connectivity between nodes
   with the same label and reduces connectivity to other nodes (middle);
   this step computes network weights corresponding to predictive value
   for each network. Second, label propagation is applied to the
   integrated network starting with the query nodes (red), thereby ranking
   patients from most to least similar to the query (right).

Input data design

   Each patient similarity network (PSN) is a feature, similar to a
   variable in a regression model. A PSN can be generated from any kind of
   patient data, using a pairwise patient similarity measure (Fig [57]1A
   and B). For example, gene expression profile similarity can be measured
   using Pearson correlation, while patient age similarity can be measured
   as normalized age similarity. A reasonable design is to define one
   similarity network per data type, such as a single network based on
   correlating the expression of all genes in the transcriptome, or a
   single network based on similarity of responses to a clinical
   questionnaire. If a data type is multivariate, defining a network for
   each individual variable will result in more interpretable output.
   However, this approach may lead to too many features generated (e.g.,
   one network each for millions of SNPs), which increases computational
   resource requirements and risk of overfitting. Thus, as with any
   machine‐learning task, there is a trade‐off between interpretability
   and overfitting/scalability. To help address this problem for
   gene‐oriented data (e.g., transcriptomics), we group gene‐based
   measurements into biological pathways, which we assume capture relevant
   aspects of cellular and physiological processes underlying disease and
   normal phenotypes. This biological process‐based design generates
   ~2,000 features from gene expression profiles containing over 20,000
   genes, with one feature per pathway, but has the limitation that not
   all genes can be mapped to pathways.

Selecting features informative for class prediction

   In machine learning, the feature selection step identifies the features
   (here, networks) with the highest generalizable predictive power. In
   the case of netDx, predictive features are identified for each class
   using feature selection. netDx is trained on samples from the class of
   interest, using cross‐validation (Figs [58]EV1 and [59]EV2) and an
   established network integration algorithm (GeneMANIA; Mostafavi et al,
   [60]2008). When provided with a collection of PSNs, GeneMANIA takes a
   query of patients as input (“+” nodes) and solves a constrained
   regression problem on the network to maximize edges that connect query
   patients, that is, enriched for (+,+) interactions, relative to other
   edges in the network. The ideal network is one connecting all patients
   of the same class without any connections to other classes; an example
   would be a network connecting all treatment responders, and all
   non‐responders, without edges between the two. The least useful network
   is one that connects patients from one class to patients from other
   classes, without connecting any patients within the same class. The
   query‐driven regression weights each network, with higher weights
   reflecting a greater enrichment of (+,+) edges. netDx assesses the
   robustness of this feature selection by performing repeated GeneMANIA
   queries with different subsamples of the training samples. The netDx
   score for a given feature is the number of times that feature was
   assigned a positive score in a query across resampling rounds. This
   scoring process is repeated for each class. The classifier's
   sensitivity and specificity can be tuned by thresholding this score
   (for instance, features passing selection could be defined as those
   scoring ≥ 80%); a feature with a higher score achieves greater
   specificity and lower sensitivity. Similar to feature selection in
   other machine‐learning algorithms, the output of this step is a set of
   features that can be integrated to produce a predictor for the patient
   class of interest.

Class prediction using selected features

   Selected features are combined by averaging their similarity scores to
   produce a single, integrated network for each class. GeneMANIA runs
   label propagation on each integrated network outward from the query
   nodes (“+” nodes) to rank all other nodes in the network by similarity
   to the query (Fig [61]EV2). For each class, netDx generates a GeneMANIA
   database consisting of networks passing feature selection; for example,
   one database for high‐risk of lung cancer and another for low‐risk of
   lung cancer. It then runs a single GeneMANIA query on this database for
   each class, using the training samples as the query. This is equivalent
   to the query, “given predictive networks for patient class L, rank all
   patients by similarity to known examples of L”. netDx then assigns test
   patients to the class for which it has the highest similarity
   (Mostafavi et al, [62]2008; Figs [63]EV1 and [64]EV2).

netDx output

   netDx returns the set of selected features for each class, predicted
   labels for all test patients, and standard performance measures
   including the area under the receiver operating characteristic curve
   (AUROC), area under the precision‐recall curve (AUPR), and accuracy
   (Fig [65]1C). Class‐specific scores for each feature are returned; if
   pathway features are used, they are visualized using an enrichment map
   (Fig [66]1D; Merico et al, [67]2011a). A final integrated patient
   similarity network is generated by integrating all selected features.
   Inter‐ and intra‐class separation is measured using average shortest
   path methods over the integrated PSN (Fig [68]1C), and the network is
   visualized to aid interpretation of the strength of class separation.

Benchmarking performance by predicting binarized survival in cancer

   To assess the classification performance of netDx, we use an
   established cancer survival prediction benchmark available for four
   tumor types, using data from The Cancer Genome Atlas (TCGA;
   [69]http://cancergenome.nih.gov/) via the TCGA PanCancer Survival
   Prediction project website of Yuan et al ([70]2014a),
   [71]https://www.synapse.org/#!Synapse:syn1710282). These tumor types
   have been thoroughly analyzed using eight machine‐learning methods,
   which provide extensive performance results that we can compare to
   (Yuan et al, [72]2014a). Data are for renal clear cell carcinoma (The
   Cancer Genome Atlas, [73]2013; KIRC, N = 150 patients; Data ref: Yuan
   et al, [74]2014b), ovarian serous cystadenocarcinoma (The Cancer Genome
   Atlas, [75]2011; OV, N = 252, Data ref: Yuan et al, [76]2014c),
   glioblastoma multiforme (The Cancer Genome Atlas, [77]2008; GBM,
   N = 155, Data ref: Yuan et al, [78]2014d), and lung squamous cell
   carcinoma (The Cancer Genome Atlas, [79]2012a; LUSC, N = 77, Data ref:
   Yuan et al, [80]2014e). Data for a given tumor type include clinical
   variables (e.g., age and tumor grade); mRNA, miRNA, and protein
   expression; DNA methylation; and somatic copy number aberrations.
   Binarization of survival and format of clinical variables followed
   previous work (Yuan et al, [81]2014a; Data ref: Yuan et al, [82]2014f);
   comparison of data processing and predictor methods to that of the
   PanCancer Survival project is shown in Fig [83]EV3 and [84]Appendix
   Table S1.

Figure EV3. Comparison of workflow for netDx and PanCancer Survival for
benchmarking.

   Figure EV3
   [85]Open in a new tab

   The netDx workflow is shown on the left and PanCancer Survival on the
   right. To compare netDx to other methods used by the PanCancer Survival
   project, we downloaded processed data from Synapse (see [86]Data
   Availability section). To keep all steps except the predictor identical
   to the PanCancer Survival project, clinical variables were coded
   identically to the PanCancer Survival project ([87]Appendix Table S1).
   Processed variables were then provided to the predictor. Univariate
   prefiltering was performed within the cross‐validation framework of the
   respective predictor ([88]Appendix Table S1); the variability across
   this process is shown in [89]Appendix Fig S1.

   For each tumor type, we classified samples into high and low survival
   categories using a range of models, each with different combinations of
   input data, following (Yuan et al, [90]2014a) for comparability
   (Fig [91]2A). Each model was independently trained (80:20 train:test)
   and optimized. For a given train/test split, feature selection was
   performed over 10 resamplings of the training samples for each class,
   and features robustly identified in at least 9 out of 10 resamplings
   were used to classify test samples. This process was repeated for
   multiple random splits of train and test until performance measures
   stabilized (20 splits; Fig [92]2A). Performance for each model was
   measured as the average of test classification (mean AUROC) across the
   20 splits. Optimization of each model included varying choice of
   similarity metric, whether features were defined at the level of entire
   data types or of individual variables (genes), and by whether or not
   imputation was used ([93]Appendix Table S1–S3). Where features were
   defined at the level of individual genes, variables were prefiltered
   using lasso regression to reduce the number of features considered.
   Prefiltering supplements the robust feature selection step described
   above and is performed within the resampling loop, by creating features
   only using selected variables. [94]Appendix Fig S1 shows an example of
   variability produced in prefiltered features. For similarity metrics,
   we tested normalized similarity, Pearson correlation with and without
   exponential scaling, radial basis function, and
   Euclidean‐distance‐based similarity with exponential scaling. In total,
   40 different models were trained, 9–11 per cancer type ([95]Dataset
   EV1).

Figure 2. Performance benchmarking with PanCancer Survival data.

   Figure 2
   [96]Open in a new tab
    1. Method. Various combinations of patient data types were provided as
       input to netDx, to predict binary survival (Good/Poor). Performance
       was compared for renal clear cell carcinoma (KIRC, N = 150
       patients), ovarian cystadenocarcinoma (OV, N = 252), lung squamous
       carcinoma (LUSC, N = 77), and glioblastoma multiforme (GBM,
       N = 155).
    2. Average performance of netDx compared to other machine‐learning
       methods. Each panel shows data for one tumor type, and each boxplot
       shows mean AUROC (20 train/test splits) for a given
       machine‐learning method across different tested combinations of
       input data ([97]Appendix Table S1–S3). Boxplot center indicates
       median; box bounds indicate 25^th and 75^th percentile, and
       whiskers mark 1.5 times the interquartile range. Dots indicate
       points that fall outside this range. netDx is shown in pink.
       Statistical significance is computed using a one‐sided
       Wilcoxon‐Mann‐Whitney comparing netDx to all other methods
       combined; this was done to establish strictly higher performance of
       netDx relative to other methods. Bottom: Yellow box: As a reference
       point, Kaplan–Meier curves and hazard ratios are shown for
       predicted samples from a representative KIRC split (split with
       AUROC closest to the mean AUROC across 20 splits). The light shaded
       pink and blue areas in the Kaplan–Meier curve indicate 95%
       confidence intervals based on log‐hazard; P‐value from log‐rank
       test. Error bars for the hazard ratio indicate 95% confidence
       intervals; P‐value from Cox proportional hazards model.

   We use the AUROC to measure performance as this was the metric reported
   by the PanCancer Survival project (Yuan et al, [98]2014a;
   [99]Appendix Table S3). Information on the exact samples used for the
   10 train/test splits used by Yuan et al is not available, thus we used
   new random splits, but used more splits (n = 20) until we reached a
   stable AUROC estimate. For all tumor types, netDx demonstrates
   performance consistently better than, or at par with, other
   machine‐learning methods (Fig [100]2B; [101]Dataset EV1). Average netDx
   performance is significantly higher than that for all other methods for
   three of the tumors (one‐tailed WMW to test strictly higher netDx
   performance; KIRC: P = 0.03; OV: P = 0.02; GBM: P = 0.02) and is not
   significant for the fourth (LUSC: P = 0.07). Furthermore, the
   top‐performing netDx model outperforms all eight tested
   machine‐learning algorithms for kidney, ovarian, and brain cancer
   (Fig [102]2B, [103]Appendix Table S1), and outperforms all but one
   outlier model for lung cancer (netDx best = 0.72, Yuan et al second
   best = 0.71). Pairwise comparison of netDx broken down by
   machine‐learning method shows that netDx has a significantly better
   median AUROC score in most (75%) cases ([104]Appendix Table S2).
   Performance statistics reported by Yuan et al were the best performing
   models out of 320 tested for different data combinations with eight
   different machine‐learning methods: diagonal discriminant analysis;
   K‐nearest neighbor; discriminant analysis; logistic regression; nearest
   centroid; partial least squares; random forest; and support vector
   machine. Thus, netDx can perform as well as, or better than, a diverse
   panel of machine‐learning methods.

Pathway‐level feature selection identifies cellular processes predictive of
clinical condition

   Creating a single feature per data type identifies the general
   predictive value of that data layer but, without further work, does not
   provide insight into which genes or cellular processes are useful for
   classification. To better understand disease mechanisms, netDx supports
   the ability to group gene‐level measures into pathways (gene sets) so
   that the predictive value of pathway features can be measured. To
   illustrate this ability, we classified breast tumors as being of the
   Luminal A subtype or not, using tumor‐derived gene expression (N = 348
   patients; 154 Luminal A, 194 non‐Luminal A; The Cancer Genome Atlas,
   [105]2012b; Data ref: The Cancer Genome Atlas, [106]2012c,[107]d). Gene
   expression was used as input, and features were defined at the level of
   curated pathways of cellular processes (i.e., 2,119 patient similarity
   networks, one per pathway; Data ref: Merico et al, [108]2011b). We
   split the data into train and test groups in an 80:20 ratio and ran
   netDx as described in the previous section to classify Luminal A vs.
   “other”. This process was repeated with 100 train/test splits to
   achieve stable pathway‐level scores ([109]Appendix Fig S2).
   Consistently predictive pathways, representing a high‐confidence set of
   features, are visualized as an EnrichmentMap (Fig [110]3), with each
   pathway associated with a consistency score (highest netDx feature
   selection score in ≥ 70% of the 100 splits).

Figure 3. Pathway‐level feature selection in breast cancer and asthma.

   Figure 3
   [111]Open in a new tab
    1. netDx performance for binary classification of breast tumor as
       Luminal A subtype from tumor‐derived gene expression (N = 384
       patients).
    2. Pathways feature‐selected by netDx in predicting Luminal A status.
       Nodes are pathways, and edges indicate shared genes. Nodes are
       colored by highest netDx score consistently achieved out of a
       maximum possible of 10, in ≥ 70% of 100 train/test splits. Themes
       identified by AutoAnnotate (Merico et al, [112]2011a; Kucera et al,
       [113]2016).
    3. Integrated patient similarity network. Nodes represent patients,
       and edges represent average similarity computed from pathways that
       scored 10 out of 10 in all splits. Nodes are colored by tumor type.
       Edges with weight < 0.7 were excluded, and the top 20% of edges per
       node were retained. The resulting network was visualized in
       Cytoscape (spring‐embedded layout, spring strength = 5).
    4. Correlation of top‐scoring pathway features (represented as the
       first three principal components of pathway‐specific gene
       expression) with tumor type (Spearman's correlation). Table cells
       are colored by sign and magnitude of correlation (blue: Spearman
       corr. > 0; red, corr. < 0). Circled letters correspond to detailed
       panels on the right. Right: Projections of patient‐level gene
       expression in feature‐selected pathways onto the first two
       principal components (individual dots indicate patients). Points
       are colored by survival class. Decision boundaries were calculated
       using logistic regression on scatterplot data.
    5. Selected features for asthma case status in the case of asthma
       case/control prediction (N = 97 cases; N = 97 controls). Legend as
       in (B).

   Classification performance was high, with an average (SD) AUROC of
   0.97 ± 0.01, average AUPR of 0.92 ± 0.02, and average accuracy of
   89 ± 3% (100 train/test splits; Fig [114]3A). Performance for
   pathway‐level features is slightly, but significantly, better than when
   gene expression is provided as a single feature (mean AUROC for single
   network = 0.96 ± 0.02; one‐sided Wilcoxon–Mann–Whitney test P = 0.013).
   Top‐scoring pathways included themes of cell cycle progression and
   checkpoint regulation, DNA synthesis, DNA mismatch repair, and DNA
   double‐strand break repair (Fig [115]3B, [116]Dataset EV2). These
   processes are consistent with the pathways known to be dysregulated in
   luminal breast tumors and cancer progression in general. netDx also
   identified pathways related to solute carrier family membrane transport
   proteins and vesicle release, which are not traditionally linked to
   breast cancer, but which may support new insights (see
   [117]Discussion). We integrated the features from top‐scoring pathways
   (those scoring 10 out of 10 in all 100 train/test splits) into a single
   network (Fig [118]3C). In this network, LumA patients are significantly
   closer to other LumA patients (average shortest distance = 0.52),
   compared to patients of other breast cancer subtypes (average shortest
   distance = 0.59; one‐tailed Wilcoxon–Mann–Whitney test P < 2e‐16).
   Therefore, top‐scoring features succeed in separating same‐class
   patients, relative to patients of other classes.

   A common problem with genomic data is relating cohort‐level analysis
   results, such as the set of affected pathways in Fig [119]3B, to
   changes in individual patients. To address this, we performed principal
   component analysis on gene expression values of individual selected
   pathways and correlated the projections of the first three principal
   components with clinical outcome (Fig [120]3). Most features
   individually showed significant correlation with tumor subtype (e.g.,
   correlation for “Amplification of Signal from the
   Kinetochores” = −0.80, t‐test, P = 3.3e‐72), and the patient class
   boundary is visually evident in these features (Fig [121]3D). However,
   not all features had this property (e.g., correlation for
   “Glucuronidation” = 0.1, t‐test, P = 0.038). Pathways that score highly
   in feature selection and correlate with outcome are good candidates for
   follow‐up biomarker or mechanistic studies.

   As a second case study to demonstrate that netDx identifies pathways
   consistent with the biology of the condition, we predicted case/control
   status in asthma using gene expression from sorted peripheral blood
   mononuclear cells (Yang et al, [122]2015; 97 cases, 97 controls; Data
   ref: Yang et al, [123]2015) using an identical predictor design as used
   for breast cancer above (2,119 pathway‐level features). The netDx
   predictor achieved an AUROC of 0.70 ± 0.07 (SD) (Fig [124]3E; mean
   AUPR = 0.65; mean accuracy = 66%). Feature‐selected pathways included
   cytotoxic T‐lymphocyte‐related processes and Notch2 signaling
   (Fig [125]3D; [126]Appendix Table S4). These themes are consistent with
   prior knowledge of cellular changes in asthma (see [127]Discussion).
   Similar to the breast cancer example, pathway‐level features result in
   significantly improved classification performance, relative to a
   predictor design where gene expression is provided as a single feature
   (100 train/test splits; mean AUROC = 0.56, AUPR = 0.54; one‐sided WMW
   for greater performance of pathway‐based design, P < 2e‐16). These
   examples demonstrate that when used with pathway‐level features, netDx
   can provide insight into molecular mechanisms and disease‐related
   processes that discriminate patient groups. Altogether, our results
   show that using pathway‐level features can improve classification
   performance and provide insight into disease mechanisms.

Discussion

   We describe netDx, the first supervised patient classification system
   based on patient similarity networks. We demonstrate that netDx does as
   well as or better than a diverse panel of machine‐learning approaches
   in predicting survival across four different tumor types. Further,
   feature selection, especially when biological pathways are used, aids
   interpretability and provides insight into disease mechanisms important
   for classification. This framework can be used to create accurate,
   generalizable predictors and has particular strengths in data
   integration and interpretation compared to standard machine‐learning
   approaches. netDx is targeted at researchers who are interested to see
   whether their sample‐level data can answer a specific sample
   classification question. netDx provides a standard workflow that can
   determine whether the given classification question can be answered
   based on a training set and if so, provides a set of relevant features
   and a software tool to classify new samples.

   With the PanCancer benchmark set, netDx performs as well as or better
   than all machine‐learning methods tested. In the case of KIRC, the
   basic netDx model outperformed other methods without any additional
   tuning. Some of this performance gain may be due to the use of
   regularized regression in GeneMANIA. In the context of predicting gene
   function, the original GeneMANIA benchmarking tests determined that,
   relative to unregularized linear regression, regularization improved
   performance by reducing overfitting (Mostafavi et al, [128]2008). In
   other instances (LUSC, OV, and some GBM models), tuning the
   sparsification level of the input networks changed netDx's performance
   sufficiently to outperform other methods. Parameters to alter
   sparsification included how many of each patient's strongest edges were
   retained (i.e., top 30–50 edges per patient), and whether or not an
   exponential scaling filter was applied to edge weights. We found that
   the optimal parameter choice differed depending on the dataset
   ([129]Dataset EV1). Similar to regularization, the sparsification of
   input networks was demonstrated to improve GeneMANIA performance in
   gene function prediction, presumably by accentuating signal and
   reducing noise (Mostafavi et al, [130]2008). We therefore attribute the
   improved performance of netDx to the fine tuning of the input networks
   through sparsification and to the use of regularized regression in the
   GeneMANIA algorithm used by netDx. A single support vector machine
   model from Yuan et al for lung cancer survival prediction vastly
   outperformed any other model, including netDx, perhaps because it
   identified a useful non‐linear decision boundary, or was overfit.
   Future work will explore whether considering non‐linear effects (e.g.,
   via non‐linear similarity measure and network combinations) can improve
   performance.

   netDx includes support for grouping genes into pathways to improve
   interpretability. Grouping genes into a smaller number of features can
   mitigate overfitting risk, improve signal detection with sparse data,
   be easier to analyze, and improve prediction performance. The themes
   identified for Luminal A classification of breast tumors are consistent
   with processes known to be dysregulated in this type of cancer. For
   instance, themes of DNA repair and G2‐M checkpoint regulation are
   consistent with the known roles of BRCA1/BRCA2 and ATM proteins, which
   are established risk factors for breast cancer (Roy et al, [131]2011).
   Cell cycle dysregulation accompanies genomic instability as a feature
   of several cancers (Negrini et al, [132]2010). netDx also identified a
   theme of solute carrier family proteins, many of which are
   overexpressed in tumors, are thought to support the metabolic needs of
   growing tumors (McCracken & Edinger, [133]2013; Uhlen et al, [134]2015)
   and are associated with genetic risk of breast cancer (Fletcher et al,
   [135]2011). These suggest novel directions for biomarker identification
   and therapeutic targeting. Similarly, the themes identified for asthma
   case prediction, including cytotoxic T lymphocytes and associated
   apoptosis, are consistent with known asthma genetics and genomic
   results. netDx also identified Notch signaling as predictive of asthma.
   Notch signaling regulates the differentiation of T‐helper cells, and
   inhibitors of this pathway are being tested in clinical trials to
   suppress symptoms of asthma (Okamoto et al, [136]2008, [137]2009; Huang
   et al, [138]2017). In summary, when provided with pathway‐level
   features, netDx can be a useful tool for discovery research.

   We compared the output of netDx's pathway‐based patient classifier to
   traditional pathway enrichment analysis (gene set enrichment analysis
   or GSEA; Subramanian et al, [139]2005), by applying GSEA to the Luminal
   A vs. “other” gene expression comparison data mentioned previously
   (N = 348 patients, expression for 17,814 genes). GSEA identified
   roughly ~1.5 times more pathways (126 pathways, Q < 0.05; [140]Dataset
   EV3) than netDx did (80 pathways scoring ≥ 7 out of 10 in ≥ 70 out of
   100 train/test splits). Roughly half of netDx identified pathways (48%)
   were also found by GSEA ([141]Appendix Fig S3, [142]Dataset EV4), and
   these overlapping pathways are all known to be affected in breast
   cancer (e.g., cell division, cell cycle checkpoints, DNA replication,
   and DNA damage repair). Overall, netDx is more conservative than GSEA
   in terms of pathways reported, likely because netDx uses regularized
   regression to reduce redundancy in the feature selection step.

   We also compared netDx to the DIABLO multi‐omic patient classifier
   (Rohart et al, [143]2017) which uses a partial latent structure‐based
   method to identify correlation between individual features, such as
   gene expression levels, and with outcome. For this, we integrated mRNA
   and miRNA expression from 346 primary breast tumors (The Cancer Genome
   Atlas, [144]2012b), using pathway‐level features in netDx and, for
   comparison, the set of genes covered by these pathways in DIABLO. Both
   methods used identical train/test splits of 80:20. Both methods
   performed well (AUROC: DIABLO: average AUROC = 0.86; netDx: 0.96;
   accuracy: DIABLO: 90%, netDx 91.4%; [145]Dataset EV5 and [146]EV6).
   netDx selected pathways with themes similar to the Luminal A gene
   expression‐based predictor previously mentioned (Fig [147]3B),
   including cell cycle regulation and DNA damage repair
   ([148]Appendix Fig S4A). DIABLO identified a range of features
   correlated with outcome, such as the known link of mir‐30a
   overexpression with improved survival in both luminal and basal tumors
   (Kawaguchi et al, [149]2017). It also highlighted individual genes
   present in the pathways found by netDx ([150]Appendix Fig S4B), for
   example, CENPA and AURKB in Aurora signaling pathways. DIABLO focuses
   on gene‐level features, and netDx natively supports pathway‐level
   features; thus, both tools provide complementary views of predictive
   multi‐omic features that could be useful when applied in tandem.

   In conclusion, netDx is a useful tool for precision medicine because it
   combines a conceptually intuitive paradigm of patient similarity
   networks with classification performance better than or equal to
   traditional machine‐learning methods, as well as biological
   interpretability when using pathway features. Like other
   machine‐learning methods, model performance and generalizability are
   limited by feature design choices and sample size. Our ultimate vision
   is to enable clinical researchers to assess classification performance
   for questions of interest, such as “will a patient respond to one
   therapy or another?” based on patient measurements and outcomes present
   in large electronic medical record databases. Output would include a
   report card on model performance and generalizability estimates on
   independent cohorts, feature interpretation, an interactive integrated
   patient similarity network visualization enabling the exploration of
   individual patients, and a ready‐to‐run classifier for new patients
   (Pai & Bader, [151]2018).

   One limitation of netDx, as for any supervised machine‐learning method,
   is the need for a large bank of patient samples to learn from.
   Fortunately, clinical data are growing rapidly, though specific queries
   involving smaller patient cohorts will need alternative methods to
   address. Additionally, netDx needs to compute similarities of training
   as well as test patients. Private data will therefore potentially need
   to be shared with users of any netDx classifier. Secure sharing,
   differential privacy, or privacy‐preserving data mining techniques
   could be developed to enable patient similarity computations without
   the need to share complete private data. We therefore expect that netDx
   will initially be useful as a multi‐omic data analysis tool for
   research groups rather than in a production clinical environment.

   netDx is implemented as an open‐source R software package available at
   [152]http://netdx.org, with worked examples and a Docker container
   enabling reproduction of the results and figures in this paper
   ([153]https://doi.org/10.5281/zenodo.2558452). We also propose that
   users store and publicly share patient similarity networks, useful as
   features for netDx and other PSN methods, in the NDEx network exchange
   system (Pratt et al, [154]2015).

Materials and Methods

PanCancer Survival benchmark models

   We tested various models for the PanCancer survival benchmarking. This
   section describes the model details; models are named as per
   [155]Dataset EV1. The models varied based on whether or not they
   included a data imputation step, whether or not variables were
   prefiltered using lasso regression, and choice of similarity metric
   (Pearson correlation, normalized similarity, scaled Euclidean/Pearson).
   Where used, imputation or prefiltering was performed only on training
   samples inside the cross‐validation (CV) loop to avoid leaking
   information from train to test.

Base (no lasso prefiltering)

   In this model, each data type was treated as a single feature (e.g.,
   one patient similarity network was generated for gene expression, one
   for clinical data). Similarity was defined by Pearson correlation where
   a data type had more than six measures (Abdel‐Megeed, [156]1984) or by
   average normalized similarity if the data type had five or fewer
   variables. For the case of a single continuous variable, we use the
   normalized similarity, defined as follows:
   [MATH: <mrow><mi>S</mi><mrow><mo
   stretchy="false">(</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>G</m
   i><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mn>1</mn><mo>−</mo><mfrac><mro
   w><mi>a</mi><mi>b</mi><mi>s</mi><mo
   stretchy="false">(</mo><mi>a</mi><mo>−</mo><mi>b</mi><mo
   stretchy="false">)</mo></mrow><mrow><mi>m</mi><mi>a</mi><mi>x</mi><mo
   stretchy="false">(</mo><mi>G</mi><mo
   stretchy="false">)</mo><mo>−</mo><mi>m</mi><mi>i</mi><mi>n</mi><mo
   stretchy="false">(</mo><mi>G</mi><mo
   stretchy="false">)</mo></mrow></mfrac></mrow> :MATH]

   Where a and b are the values of the variable for individual patients (a
   and b) and G is the set of all values for the variable (e.g., age). For
   a set of k variables G={g[1],g[2],..g[k]}, where 1 ≤ k ≤ 5, the
   similarity S' between two patients a and b is defined as the average of
   normalized similarity for each of the variables:
   [MATH: <mrow><msup><mi>S</mi><mo>′</mo></msup><mrow><mo
   stretchy="false">(</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>G</m
   i><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo
   ><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsup><mi>S</
   mi><mrow><mo
   stretchy="false">(</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><msub><m
   i>g</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow></mrow><mi>k</mi></mfrac></mrow> :MATH]

Variable prefiltering and scaled Euclidean/scaled Pearson similarity metrics

   This model design defines features at the level of individual variables
   (e.g., genes and clinical variables) and performs within‐CV
   prefiltering using lasso regression to rank features by their ability
   to predict outcome (Tibshirani, [157]1996). This method is likely a
   better choice than combining all variables of a data type into a single
   network when only a handful of variables carry predictive information
   for that data type. Only variables with a non‐zero weight are included
   in the analysis. Regression uses only training samples within a given
   fold to avoid leaking information from train to test. The similarity
   metric used is either Euclidean distance (model code = eucscale) or
   Pearson correlation, followed by local exponential scaling (Wang et al,
   [158]2014). Imputation (pearimpute, eucimpute) is performed within the
   cross‐validation loop separately for training and test samples to avoid
   leaking information from train to test. The lung cancer dataset
   demonstrated the best performance if the model was also limited to the
   top clinical variable from lasso (plassoc1).

Integrated patient network

   The integrated patient network is an average combination of all
   selected networks (features) to create a single network (i.e., average
   of all edge weights between patients from all selected networks).
   Visually, the goal is to view similar patients as being more tightly
   grouped and dissimilar patients as being farther apart. Similarity
   (normalized from 0 to 1) is therefore converted to dissimilarity,
   defined as 1‐similarity. Weighted shortest path distances are computed
   on this resulting dissimilarity network. The one‐tailed
   Wilcoxon–Mann–Whitney test is used to ascertain whether within‐class
   distances are collectively shorter than across‐class distances. To aid
   visualization, only edges representing the top 40% of distances in the
   network are included. The weighted shortest path between patient
   classes (a node set) in the integrated network was computed using
   Dijkstra's method (igraph v1.01; Csardi & Nepusz, [159]2006); distance
   was defined as 1‐similarity (or edge weight from a patient similarity
   network). The overall shortest path was defined as the mean pairwise
   shortest path for a node set.

Survival curve and hazard ratios

   Survival curves were constructed based on netDx‐predicted classes of
   test samples. The R packages survival and survminer were used to
   compute Kaplan–Meier curves, and rms was used to calculate the log‐rank
   test for separation of survival curves. The package survival was also
   used to compute the Cox proportional hazards model of predicted poor
   survivors, using predicted good survivors as a reference, and to
   calculate the hazard ratio and associated P‐value.

Pathway networks

   Pathway definitions were aggregated from HumanCyc (Romero et al,
   [160]2005; [161]http://humancyc.org), NetPath (Kandasamy et al,
   [162]2010; [163]http://www.netpath.org), Reactome (Croft et al,
   [164]2014; Fabregat et al, [165]2016; [166]http://www.reactome.org),
   NCI Curated Pathways (Schaefer et al, [167]2009), mSigDB (Subramanian
   et al, [168]2005;
   [169]http://software.broadinstitute.org/gsea/msigdb/), and Panther (Mi
   et al, [170]2005; [171]http://pantherdb.org/). The compiled set of
   pathways was downloaded from
   [172]http://download.baderlab.org/EM_Genesets/February_01_2018/Human/sy
   mbol/Human_AllPathways_February_01_2018_symbol.gmt (Merico et al,
   [173]2011a). Only pathways with 10–200 genes were included (2,119
   pathways). Pathway‐level patient similarity was defined as the Pearson
   correlation of the expression vectors corresponding to member genes,
   and the network was sparsified (see next section).

   For the asthma case study, Ensembl IDs were mapped to HGNC symbols
   using the Bioconductor package org.Hs.eg.db.

Sparsification of input networks

   Sparsification is useful for reducing the number of edges to be
   considered in input networks, speeding computation, and for reducing
   noise in the data, as weak correlations are removed. Sparsification
   uses three parameters: the minimum edge weight to include (cutoff; set
   at smallest possible non‐zero weight for models using exponential
   scaling, and at 0.3 for others), how many top interactions to include
   per node (topX), and the upper bound on the number of edges in a
   network (maxEdges). By default, GeneMANIA sparsifies networks by first
   excluding edge weights below a cutoff and retaining the top 50 edges
   per node (Mostafavi et al, [174]2008). We used this sparsification
   strategy in all instances where Pearson correlation was used to compute
   patient similarity. For other similarity measures (e.g., normalized
   similarity) or where exponential scaling was applied to networks, we
   found that additionally limiting the total number of edges in the
   network (i.e., adding a maxEdges parameter) improved performance in
   some models. In the latter instance, we ensured that a network included
   all input patients (so test patients could be classified) by adding the
   strongest edge for any patients excluded by the maxEdge criterion. If
   this edge weight was lower than the network cutoff, it was set to be at
   the cutoff value. Different tumor datasets performed optimally with
   different levels of sparsification: KIRC: topX = 30, maxEdges = 6,000;
   LUSC: topX = 40, maxEdges = 6,000; GBM: topX = 50, maxEdges = 3,000.

Map of selected networks

   The Enrichment Map app (3.1.0RC4) in Cytoscape 3.6.1 (Shannon et al,
   [175]2003) was used to generate enrichment maps (Merico et al,
   [176]2011a). A Jaccard overlap threshold of 0.05 was used to prune
   identical gene sets. AutoAnnotate v1.1.0 was used to cluster similar
   pathways using MCL clustering with default parameters.

Gene set enrichment analysis for LumA subtype in breast cancer

   Limma (Linear Models for Microarray and RNA‐Seq Data; Ritchie et al,
   [177]2015) was used to fit a linear model to compare mRNA levels of
   LumA breast cancer vs. other breast cancer subtypes (i.e., Her2, Basal,
   LumB, and Normal) in the TCGA BRCA mRNA microarray dataset. This
   analysis produced a list of genes ranked by a moderated t‐statistic. As
   we were interested in both up‐ and down‐regulated genes (i.e., overall
   change in LumA and other subtypes taken together), we ranked the limma
   output based on the absolute value of the moderated t‐statistic. A
   weighted pre‐ranked GSEA (gene set enrichment analysis; Subramanian
   et al, [178]2005) analysis was performed using the ranked list of genes
   obtained from the above limma analysis as input. Gene sets were
   filtered out if they had < 10 or > 500 genes and 1,000 permutations
   were run. The set of pathway definitions was identical to that used in
   the netDx breast cancer predictor.

   An Enrichment Map was created using the union of GSEA pathways enriched
   in LumA and other subtypes (FDR < 0.05) and netDx pathways that
   consistently had high scores across multiple train/test splits
   (maxScore ≥ 7). This enrichment map was created to assess overlaps in
   pathways and themes (where a theme is a group of related pathways). We
   used the Enrichment Map 3.1.0 app (Merico et al, [179]2011a) in
   Cytoscape 3.6.1 (Shannon et al, [180]2003) and the GMT file
   Human_AllPathways_February_01_2018_symbol.gmt (same as that used for
   the pathway‐based netDx breast cancer tumor subtype predictor).

Comparison of netDx and DIABLO for multi‐omic classification

   Previously published gene expression and miRNA data for 346 tumors were
   downloaded from the Cancer Genome Atlas data portal
   ([181]https://tcga-data.nci.nih.gov/docs/publications/brca_2012/;
   BRCA.348.precursor.txt; BRCA.exp.348.med.txt; The Cancer Genome Atlas,
   [182]2012b).

   miRNA were mapped to pathways. The MSigDB miRNA gene set database
   (c3.mir.v6.2.symbols.gmt) (Subramanian et al, [183]2005) was used to
   map individual miRNA to genes, and the pathway gene set database
   (Human_AllPathways_February_01_2018_symbol.gmt) was used to assign each
   miRNA to pathways in which it had a target gene. For comparison to
   netDx, we limited RNA and miRNA variables provided as input to DIABLO
   to those present in pathways. Both methods also used identical
   train/test splits.

   For netDx, a predictor was run with pathway‐level features for a single
   train/test split of 80:20. Only pathways with 10–200 genes (2,119
   pathways) or those with more than 5 miRNA (for stable computation of
   Pearson correlation, after (Abdel‐Megeed, [184]1984)) (1,935 pathways)
   were included. Pathway‐level patient similarity was defined by Pearson
   correlation (network sparsification parameters: edges with correlation
   below 0.3 were ignored, top 50 strongest edges per node were retained,
   to a maximum of 3,000 edges per network). RNA and miRNA patient
   similarity networks were constructed using the same parameters.
   Training samples were used to construct pathway‐level features (4,054
   features), and these were scored between 0 and 10 for predictive value,
   as with previous netDx predictors; features with score ≥ 9 were used to
   classify the blind test samples.

   We ran DIABLO with a gene‐centered model, following operating
   procedures specified in the online tutorial
   ([185]http://mixomics.org/mixdiablo/case-study-tcga/). The design
   matrix was a 2 × 2 matrix with 0 in the diagonals and 0.1 in the
   off‐diagonal. Block.splslda() was used to estimate the number of
   components to use; the most frequent estimate of number of components
   (median choice.ncomp$WeightedVote) was used as the number of components
   in the model. A grid‐based search was performed to estimate the number
   of variables to retain for each omic layer and component. Specifically,
   tune.block.splsda() was used with 10‐fold cross‐validation to ascertain
   the number of variables to keep (between 5 and 30 variables). DIABLO
   was used to select variables for each component.

Author contributions

   All authors contributed to netDx method development. SP and MAS
   analyzed the data. SP wrote the netDx software package with
   contributions from MAS. SH, HK, and RI developed initial versions of
   netDx. SP and GDB wrote the paper. GDB supervised the work.

Conflict of interest

   The authors declare that they have no conflict of interest.

Supporting information

   Appendix
   [186]Click here for additional data file.^ (3.4MB, pdf)

   Expanded View Figures PDF
   [187]Click here for additional data file.^ (387.1KB, pdf)

   Dataset EV1
   [188]Click here for additional data file.^ (12KB, xlsx)

   Dataset EV2
   [189]Click here for additional data file.^ (12.9KB, xlsx)

   Dataset EV3
   [190]Click here for additional data file.^ (14.6KB, xlsx)

   Dataset EV4
   [191]Click here for additional data file.^ (16.2KB, xlsx)

   Dataset EV5
   [192]Click here for additional data file.^ (10.6KB, xlsx)

   Dataset EV6
   [193]Click here for additional data file.^ (12.4KB, xlsx)

   Review Process File
   [194]Click here for additional data file.^ (551.5KB, pdf)

Acknowledgments