ABSTRACT

   Human immunodeficiency virus type 1 (HIV-1) depends on a class of host
   proteins called host dependency factors (HDFs) to facilitate its
   infection. So far experimental efforts have detected a certain number
   of HDFs, but the gene inventory of HIV-1 HDFs remains incomplete. Here,
   we implemented an existing network-based gene discovery strategy to
   predict HIV-1 HDFs. First, an encoding scheme based on a publicly
   available human tissue-specific gene functional network (GIANT;
   [37]http://giant.princeton.edu/) was designed to convert each human
   gene into a 25,825-dimensional feature vector. Then, a random
   forest-based predictive model was trained on a data set containing 868
   known HDFs and 1,736 non-HDFs. Through 5-fold cross-validation, an
   independent test, and comparison with one existing method, the proposed
   prediction method consistently revealed accurate and competitive
   performance. The highlight of our method should be ascribed to the
   introduction of the GIANT encoding scheme, which contains rich
   information regarding gene interactions. By merging known HDFs and
   genome-wide HDF prediction results, network analysis was conducted to
   catch the common patterns of HDFs in the context of the GIANT network.
   Interestingly, HDFs reveal significantly lower betweenness than
   HIV-1-interacting human proteins (i.e., HIV targets). In the meantime,
   the functional roles of HDFs were also examined by mapping all the HDF
   candidates into human protein complexes. Especially, we observed the
   frequent co-occurrence of HDFs and HIV targets at the protein complex
   level. Collectively, we hope the proposed prediction method not only
   can accelerate the HDF identification and antiviral drug target
   discovery, but also can provide some mechanistic insights into
   human-virus relationships.

   IMPORTANCE Identification of HIV-1 HDFs remains a crucial step to
   understand the complicated relationships between human and HIV-1. To
   complement the experimental identification of HDFs, we have implemented
   an existing network-based gene discovery strategy to predict HDFs from
   the human genome. The core idea of the proposed method is that the rich
   information deposited in host gene functional networks can be
   effectively utilized to infer the potential HDFs. We hope the proposed
   prediction method could further guide hypothesis-driven experimental
   efforts to interrogate human–HIV-1 relationships and provide new hints
   for the development of antiviral drugs to combat HIV-1 infection.

INTRODUCTION

   As a kind of obligate intracellular pathogens, viruses contain small
   genomes that encode a limited number of proteins. To carry out their
   activities in host cells, viruses need to exploit host proteins for
   entry, replication, and transmission. In general, such host proteins
   are referred to as host dependency factors (HDFs) ([38]1, [39]2).
   Biologically, the characterization of HDFs remains an important step to
   decipher human-virus relationships ([40]3). In the meantime, HDFs can
   serve as potential antiviral drug targets. Indeed, such a target
   discovery strategy is increasingly attractive, since it can effectively
   avoid drug resistance in comparison to therapeutic targeting of viral
   proteins ([41]4[42]–[43]6).

   As an infectious virus, HIV-1 continuously poses a serious threat to
   human health. Mechanistic understanding of human–HIV-1 interaction has
   been of long-term research interest to the community. The genome of
   HIV-1 encodes only 19 proteins. Consequently, it has to rely on HDFs to
   complete its life cycle ([44]7). In the past decades, many experimental
   methods, such as small interfering RNA (siRNA)-based screens ([45]8)
   and CRISPR/Cas9-based screens ([46]9, [47]10), have been explored to
   identify HIV-1 HDFs ([48]11[49]–[50]15). Regarding siRNA-based screens,
   individual human genes are first knocked down through RNA interference,
   then the effects of viral infection (e.g., levels of viral protein
   expression or production of viral particles in human cells) are
   measured to find potential HDFs. As a novel and powerful
   loss-of-function technique, CRISPR/Cas9 has also been applied to the
   detection of HDFs with higher sensitivity and specificity ([51]10).
   Until now, the experimentally identified HIV-1 HDFs have provided
   further insights into the functional roles of HIV-1 HDFs. Moreover, the
   relationship between HDFs and HIV-1-interacting human proteins (i.e.,
   HIV targets) has been examined in the context of human protein-protein
   interaction (PPI) networks ([52]16, [53]17). Regarding the antiviral
   drug discovery, HDF-targeted drugs have been successfully developed.
   For instance, one HIV-1 HDF called CCR5 could serve as a coreceptor for
   HIV-1 infection of CD4^+ T cells and macrophages, and small molecule
   inhibitors of CCR5 have been developed as effective anti-HIV drugs
   ([54]18).

   Thanks to the development of experimental techniques, more and more
   HIV-1 HDFs have been continuously discovered, especially with the
   application of CRISPR/Cas9-based screens ([55]2, [56]10). In the
   meantime, it has been reported that high false-negative rates exist in
   previous genome-wide siRNA-based HDF screens ([57]19). The evidence
   above clearly indicates that the current catalogs of HIV-1 HDFs remain
   incomplete. Additionally, experimental methods are often time-consuming
   and laborious. In this regard, cost-effective computational methods may
   offer a promising alternative solution for complementing the
   experimental identification of HDFs. Indeed, the available HDF data
   have provided a solid foundation for the development of prediction
   methods. Considering the functional diversity of HDFs, conventional
   sequence information-based protein family prediction is not suitable
   for this task. Rather, network-based gene discovery ([58]16, [59]20,
   [60]21) may provide an effective alternative solution to detect HDFs.
   Based on the hypothesis that the network topologies of known HDFs
   within human PPI networks can be employed to detect new HDFs, Murali et
   al. initially predicted HIV-1 HDFs through the introduction of a
   graph-theoretic approach called SinkSource ([61]16). Recently, Ackerman
   et al. proposed a method of integrating human PPI networks with
   human-virus PPIs to detect HDFs of influenza viruses ([62]21). In
   addition to successfully predicting novel HDFs, the topology
   relationships between HDFs and virus-interacting proteins in the
   context of human PPI networks have been characterized. The
   aforementioned prediction and analysis of HDFs indicated that the
   network-informed strategy is powerful for novel HDF discovery.

   In comparison to pure PPI networks, genome-wide functional networks may
   be more comprehensive to represent the complex gene/protein
   associations within cellular systems. In 2015, Troyanskaya and
   coworkers developed a series of tissue-specific functional gene
   interaction networks through a Bayesian data integration strategy
   ([63]22). The integrated data types include thousands of PPI, gene
   expression, and regulatory sequence data sets. Moreover, they have
   constructed a web server called GIANT (Genome-Scale Integrated Analysis
   of Networks in Tissues) to make the predicted functional gene
   interaction networks applicable to the community. Based on the GIANT
   network, for instance, Krishnan et al. employed a machine learning
   approach to conduct genome-wide prediction of autism risk genes. They
   successfully predicted hundreds of autism risk gene candidates with
   little or no prior genetic evidence, many of which have been
   experimentally validated ([64]23).

   Inspired by the successful applications of network-based gene discovery
   ([65]22[66]–[67]25), in this work we implemented a GIANT
   network-informed prediction method of HIV-1 HDFs with the assistance of
   machine learning algorithms. We will elaborate the overall
   computational framework, methodology details, performance assessment,
   and comparison of the proposed HDF predictor. In the meantime, we will
   also report the comprehensive network and functional analyses of HDF
   candidates inferred from genome-wide prediction, which will allow us to
   better understand the global landscape of HIV-1 HDFs.

RESULTS AND DISCUSSION

The computational framework of the proposed network-informed HDF prediction.

   The flowchart of the proposed prediction method is illustrated in
   [68]Fig. 1. At first, we manually collected known HDFs with
   experimental evidence (i.e., positive samples) and selected non-HDFs
   (i.e., negative samples) through random sampling of human genes other
   than known HDFs. We further compiled them into a training data set
   covering 868 HDFs and 1,736 non-HDFs and an independent test set
   involving 276 HDFs and 552 non-HDFs. Then, the GIANT network was used
   to infer feature vectors for HDFs/non-HDFs. Based on the GIANT encoding
   scheme, five popular machine learning methods (i.e., random forest
   [RF], naive Bayesian [NB], k-nearest neighbors [KNN], logistic
   regression [LR], and support vector machine [SVM]) were adopted to
   build the corresponding predictive models. At last, a 5-fold
   cross-validation and an independent test were carried out to select the
   best predictive model. More details about the data set preparation,
   GIANT-based feature vector construction, machine learning algorithm
   implementation, and performance metrics are available in Materials and
   Methods.

FIG 1.

   [69]FIG 1
   [70]Open in a new tab

   Flowchart of the proposed HIV-1 HDF prediction method. In the data set
   preparation step, we compiled positive genes (known HDFs) through
   literature searching and negative genes (i.e., non-HDFs) through random
   sampling of human proteins other than known HDFs. In the feature vector
   construction step, we employed the T-cell-specific GIANT network to
   convert each positive/negative sample into a 25,825-dimensional vector.
   In the model training and evaluation step, we introduced five popular
   machine learning methods, and RF was selected as the optimal machine
   learning algorithm through the 5-fold cross-validation and independent
   test. Moreover, the feature selection was conducted to rank the
   contributions of different features in the proposed encoding scheme. In
   the final step, we conducted genome-wide HDF screening based on the
   proposed method, and conducted topological analysis of the HDF
   candidates in the context of the GIANT network and examined the
   functional roles of HDF candidates in the context of human protein
   complexes.

The performance of network-based HDF prediction.

   In this work, a 5-fold cross-validation and an independent test were
   carried out to stringently assess the model performance of different
   machine learning algorithms, which were first measured through the
   receiver operating characteristic (ROC) curve and the area under ROC
   curve (AUC). It should be noted that for fair comparison, the
   parameters in different algorithms were preliminarily optimized (i.e.,
   the key parameters in each algorithm were optimized, while other
   parameters were set as default). As shown in [71]Fig. 2A, the RF-based
   model performed the best (AUC = 0.751) in the 5-fold cross-validation,
   followed by SVM (AUC = 0.737), LR (AUC = 0.718), KNN (AUC = 0.660), and
   NB (AUC = 0.640). [72]Figure 2B illustrates the ROC curves of the
   models in the independent test. Considering that the precision-recall
   (PR) curve is more suitable for characterizing the model performance
   with imbalanced positives and negatives, the PR curve and the area
   under PR curve (AUPRC) are also provided in [73]Fig. 2C and [74]D.
   Likewise, the RF-based model performed the best in either the 5-fold
   cross-validation or the independent test. For real application, it is
   important to quantify the performance at a low false-positive rate
   (FPR) control. At an FPR control of 10%, for instance, the
   corresponding sensitivity (with precision in parentheses) values for
   RF, SVM, LR, KNN, and NB are 31.2% (61.7%), 28.6% (57.1%), 29.2%
   (55.3%), 20.9% (54.2%), and 14.8% (43.2%) in the 5-fold
   cross-validation, respectively ([75]Fig. 2A and [76]C). Moreover, it is
   worth noting that the AUC/AUPRC values from the independent test
   revealed reasonably decreased performance in comparison to the 5-fold
   cross-validation, which should be ascribed to the fact that the
   positive samples in the training set and independent test set were
   selected from different experimental studies. Even so, different
   machine learning-based models showed the same performance rank in
   either the 5-fold cross-validation or the independent test, further
   suggesting the overall performance of these five machine learning-based
   models is robust.

FIG 2.

   [77]FIG 2
   [78]Open in a new tab

   Performance comparison of prediction models based on different machine
   learning methods. (A) ROC curves of the 5-fold cross-validation. (B)
   ROC curves of the independent test. (C) PR curves of the 5-fold
   cross-validation. (D) PR curves of the independent test. Parameters in
   parentheses in panels A and B denote the AUC values of different
   models, while parameters in parentheses in panels C and D stand for the
   AUPRC values. Note that the AUC/AUPRC values are reported as average ±
   standard deviation (SD).

   Considering T cells are the principal targets for HIV-1 ([79]26,
   [80]27), our predictive model is based on the T-cell-specific GIANT
   network. To demonstrate whether the predictive model is sensitive to
   different tissue-specific networks, we compared the performance of the
   T-cell-specific network against the networks of other tissues that are
   not known HIV-1 host cells. In terms of AUC or AUPRC, the
   T-cell-specific network slightly outperformed the epidermis tissue- and
   adipose tissue-specific networks ([81]Table 1), implying the
   T-cell-specific network seems to have higher signal-to-noise ratio to
   some extent.

TABLE 1.

   Model performance based on different tissue-specific GIANT networks and
   PPI networks
   Network type 5-fold cross-validation[82]^a
     __________________________________________________________________

   Independent test[83]^a
     __________________________________________________________________

   AUC AUPRC AUC AUPRC
   GIANT networks
       T cells 0.751 ± 0.003 0.554 ± 0.010 0.703 ± 0.013 0.483 ± 0.011
       Adipose tissue 0.747 ± 0.002 0.546 ± 0.008 0.703 ± 0.015 0.476 ±
   0.014
       Epidermis tissue 0.747 ± 0.002 0.552 ± 0.006 0.701 ± 0.025 0.468 ±
   0.022
   PPI networks
       PPI network in this study[84]^b 0.643 ± 0.004 0.502 ± 0.011 0.552 ±
   0.020 0.368 ± 0.025
       InWeb_InBioMap[85]^b 0.669 ± 0.006 0.501 ± 0.010 0.590 ± 0.014
   0.390 ± 0.012
   [86]Open in a new tab
   ^a

   The results are based on five different repeats of negative sample
   selections, which are expressed as average ± SD.
   ^b

   We used the same encoding strategy as the GIANT network to infer the
   compiled PPI network- or InWeb_InBioMap-based predictive model. Since
   there are a total of 16,745 proteins in the compiled PPI network, each
   sample can be converted into a 16,745-dimensional feature vector.
   Regarding the InWeb_InBioMap PPI network, the number of proteins is
   16,948, and thus each sample can be represented as a 16,948-dimensional
   vector. To train and assess the compiled PPI network- or
   InWeb_InBioMap-based model, note that some HDFs in the original
   training and independent test sets were removed since they were not
   included in these two PPI networks.

   We also reconstructed the proposed predictive model based on a human
   PPI network compiled in this study, which contains 344,703 interactions
   and 16,745 proteins. Rather than the GIANT network, the PPI network
   data used here are unweighted and tissue independent. To infer the PPI
   network-based encoding, the interaction score of an interacting protein
   pair was set to 1.0, whereas the interaction score of a noninteracting
   protein pair was set to 0.0. As shown in [87]Table 1, the PPI
   network-based model achieved an AUC value of 0.643 (AUPRC = 0.502) in
   the 5-fold cross-validation and an AUC value of 0.552 (AUPRC = 0.368)
   in the independent test, which are much lower than those of the
   corresponding counterparts in the GIANT network-based predictive model.
   Likewise, we also retrained the RF model based on a systematically
   integrated PPI network called InWeb_InBioMap ([88]28), which covers
   580,075 interactions and 16,948 proteins in version 2016-09. In
   general, the InWeb_InBioMap model outperforms the model based on the
   PPI network compiled in this work, but it is still inferior to the
   GIANT network-based model ([89]Table 1). Collectively, the above
   performance comparison of the GIANT network-based model and two PPI
   network-based models demonstrated that GIANT is a suitable gene network
   for HDF identification.

Selection of different negative data sets.

   It has been established that the real ratio of HDFs to non-HDFs is
   highly skewed, although the exact ratio of positives to negatives in
   the human genome remains elusive. To address this highly imbalanced
   classification task, the ratio of positives to negatives used in
   training/assessing machine learning models remains an open issue. On
   the one hand, models trained on balanced samples, as widely used in
   many classification tasks, cannot reflect reality. On the other hand,
   models trained on a highly imbalanced ratio will also inevitably
   generate biased results. In this context, a relatively imbalanced ratio
   of positives to negatives was often empirically adopted without strict
   optimization. Here, we conducted some computational analyses to
   investigate the different ratios of positives to negatives in model
   training and assessment. Supposing that the real ratio of HDFs to
   non-HDFs in the human proteome is 1:10, we trained predictive models
   based on four different ratios of HDFs to non-HDFs (1:1, 1:2, 1:5, and
   1:10) and assessed the performance on an independent test set with a
   1:10 ratio of HDFs to non-HDFs. By doing so, we can roughly examine the
   effects of different training sample ratios in the real application.
   Note that the HDFs in the training set and independent set were the
   same as those used in developing our original model. As shown in
   [90]Table S1 in the supplemental material, the overall performance of
   RF-based models was only slightly affected by the sample ratios in
   training. Comparatively, the training set with a 1:2 or 1:5 ratio of
   HDFs to non-HDFs yielded better performance than the ratios of 1:1 and
   1:10 ([91]Table S1). Thus, the above analyses confirmed that the ratio
   of 1:2 in this work is generally reasonable, although it is probably
   not the optimal choice.
   TABLE S1

   The performance of the independent test set based on models trained by
   four different ratios of positives to negatives. Download [92]Table S1,
   DOCX file, 0.03 MB^ (29KB, docx) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [93]Creative Commons
   Attribution 4.0 International license.

   As we know, it is a challenging task to choose high-quality negative
   samples in network-based gene discovery with supervised learning. For
   instance, one limitation of our original negative sample construction
   is that some unknown HDFs are inevitably contained in the randomly
   selected negative samples and introduce noise to model training. To
   address this issue, we further examined the performance and biases of
   choosing different negative samples, including disease-associated genes
   (DAGs), HDFs from other viruses, essential genes, and genes with
   similar network degrees or expression levels to HDFs (see Materials and
   Methods for more details about the different negative data set
   preparations). Similar to our original model using randomly selected
   genes as negative samples, all of these new models were also trained by
   using RF with a 1:2 ratio of positives to negatives, and the
   corresponding performance is listed in [94]Table 2.

TABLE 2.

   The performance of models based on different negative data set
   constructions
   Negative data set construction 5-fold cross-validation[95]^a
     __________________________________________________________________

   Independent test[96]^a
     __________________________________________________________________

   AUC AUPRC AUC AUPRC
   Randomly selected genes 0.751 ± 0.003 0.554 ± 0.010 0.703 ± 0.013 0.483
   ± 0.011
   DAGs 0.662 ± 0.007 0.494 ± 0.011 0.552 ± 0.009 0.405 ± 0.011
   HDFs from other viruses 0.625 ± 0.005 0.435 ± 0.008 0.539 ± 0.005 0.372
   ± 0.005
   Essential genes 0.703 ± 0.003 0.554 ± 0.010 0.762 ± 0.005 0.654 ± 0.012
   Genes with similar T cell expression levels to HDFs 0.650 ± 0.002 0.461
   ± 0.006 0.626 ± 0.009 0.441 ± 0.013
   Genes with similar network degrees as HDFs 0.584 ± 0.003 0.399 ± 0.002
   0.597 ± 0.002 0.401 ± 0.002
   [97]Open in a new tab
   ^a

   The measurements are based on five different repeats of negative sample
   selections, which are reported as average ± SD.

   Regarding choosing DAGs as non-HDFs, the levels of performance of the
   5-fold cross-validation (AUC = 0.662 and AUPRC = 0.494) and the
   independent test (AUC = 0.552 and AUPRC = 0.405) are reasonably
   decreased in comparison to the performance of the original model.
   Biologically, HIV-1 HDFs tend to be DAGs. Of the known 1,144 HDFs used
   in our work and the initially collected 3,855 DAGs, 272 genes overlap
   (hypergeometric test, P = 4.44 × 10^−16). In the context of GIANT,
   moreover, HDFs also share similar network topology properties with DAGs
   to a certain extent, which is exemplified in the corresponding box
   plots of network degree distributions (see [98]Fig. S1 in the
   supplemental material). Since GIANT is a weighted network, note that
   all the reported network parameters in this work are also weighted, if
   not specified. Thus, choosing DAGs as negative samples increased the
   prediction difficulty. With respect to choosing HDFs from other viruses
   as negative samples, the levels of performance on the 5-fold
   cross-validation (AUC = 0.625 and AUPRC = 0.435) and independent test
   (AUC= 0.539 and AUPRC = 0.372) are considerably decreased in comparison
   to those in the original model. Again, these results may reflect the
   commonality of HDFs from different viruses. For instance, 108 out of
   the collected 834 influenza A virus subtype H1N1 HDFs overlap known
   HIV-1 HDFs (hypergeometric test, P = 5.05 × 10^−24). Moreover, the
   commonality of HIV-1 HDFs and other viral HDFs is also reflected in
   their network properties ([99]Fig. S1).
   FIG S1

   The network degree distributions of HDFs and non-HDFs based on
   different selection methods. The 868 HDFs in the training set were used
   to calculate the degree distribution. In each non-HDF construction
   method, only one negative set was used to infer the degree
   distribution, although the negative set was repeatedly generated five
   times. The diamond symbol stands for the average value. The average
   network degrees for HDFs and six different non-HDF data sets along the
   x axis are 48, 36, 35, 42, 65, 50, and 41, respectively. Download
   [100]FIG S1, TIF file, 2.0 MB^ (2.1MB, tif) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [101]Creative
   Commons Attribution 4.0 International license.

   Regarding the model using essential genes as non-HDFs, the performance
   on the 5-fold cross-validation and independent test is fully comparable
   to that of our original model ([102]Table 2). As we know, the essential
   genes perform important functional roles in human cells and often
   occupy unique network positions in gene networks ([103]29, [104]30).
   For instance, the average network degree of essential genes is much
   higher than that of known HDFs ([105]Fig. S1). In this context, the
   essential genes are not suitable for being selected as non-HDFs,
   although they have less chance to be HDFs. Indeed, when we conducted
   genome-wide HDF identification through the model using essential genes
   as negatives, 11,418 out of 25,085 human genes were predicted as HDFs
   when the FPR was controlled at 5%, implying biased results have been
   yielded from the new model. (Note that the prediction threshold
   corresponding to a 5% FPR was estimated from the model with a 1:2 ratio
   of positives to negatives.) Considering the network property
   differences between HDFs and essential genes, the majority of human
   genes tend to have comparatively more similar network features with
   HDFs rather than essential genes, and thus more human genes are prone
   to be predicted as HDFs.

   When we further selected non-HDFs with similar network degrees to HDFs
   in the GIANT network, the model performance was also dramatically
   decreased, as expected ([106]Table 2). Regarding the negative data set
   with similar expression levels to HDFs, the overall performance was
   also much lower than that of the original model ([107]Table 2). The
   decreasing performance may be ascribed to the fact that the newly
   selected non-HDFs may still share similar network properties with HDFs
   ([108]Fig. S1).

   Based on the above computational experiments regarding the different
   negative sample constructions, we can conclude that using random
   proteins other than known HDFs as negative samples is still a
   reasonable choice, since the network properties of random genes can
   generally reflect the diversity of non-HDFs. As a network-based gene
   discovery method, moreover, the prediction specificity of the proposed
   method is also limited to the network properties of query proteins in
   the context of the GIANT network. For instance, other proteins with
   similar network properties to HDFs may have a high chance to be
   predicted as HDFs. We hope these pros and cons of negative sample
   constructions will be taken into consideration when developing new HDF
   prediction methods in the future.

Comparison of the proposed method with an existing prediction method.

   To our best knowledge, the method of Murali et al. is probably the only
   existing bioinformatics method to predict HIV-1 HDFs. Therefore, it is
   interesting and necessary to compare our method against Murali et al.’s
   method. In Murali et al.’s method, 908 positive genes and 455 negative
   genes were used to train and test models with 10 independent runs of
   2-fold cross-validation. As a network-based prediction, their
   prediction was based on a human PPI network consisting of 71,461
   interactions and 9,595 proteins. The adopted SinkSource algorithm was
   analogous to the functional flow algorithm, which was originally
   developed for protein function prediction. By following the method
   description of SinkSource in reference [109]16, we have implemented it
   through an in-house Python script. To ensure a fair performance
   comparison, we used the GIANT network, the training set, and the
   independent test set in our work to infer and evaluate the
   SinkSource-based prediction model. We compared the SinkSource-based
   model and our RF model through the 5-fold cross-validation and
   independent test. In general, the SinkSource-based model yielded a
   performance inferior to our RF model in terms of either AUC or AUPRC
   (see [110]Fig. S2 in the supplemental material). For instance, the
   SinkSource-based model yielded an AUC of 0.654 and an AUPRC of 0.441 in
   the 5-fold cross-validation, while the corresponding values for our RF
   model were 0.751 and 0.554.
   FIG S2

   Performance comparison of the SinkSource model and our predictive model
   based on the GIANT network. (A) ROC curves of the 5-fold
   cross-validation. (B) ROC curves of the independent test. (C) PR curves
   of the 5-fold cross-validation. (D) PR curves of the independent test.
   Parameters in parentheses of panels A and B denote the AUC values of
   different models, while parameters in parentheses of panels C and D
   stand for the AUPRC values. Note that the AUC/AUPRC values are reported
   as average ± SD. Download [111]FIG S2, TIF file, 1.4 MB^ (1.4MB, tif) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [112]Creative
   Commons Attribution 4.0 International license.

   To complement the aforementioned performance comparison, we also
   attempted to retrain our model on the basis of the training data set
   used in Murali et al.’s method. To this end, we first compiled a
   training set containing 868 HDFs (positive samples) and 434 human
   essential genes (negative samples). Note that the newly compiled
   training set is slightly different from the original training set of
   Murali et al., since some genes in Murali et al.’s data set did not
   occur in the GIANT network. Then, we retrained the predictive model
   based on the GIANT network encoding scheme and assessed the performance
   through the same 10 independent runs of 2-fold cross-validation.
   Finally, we compared the corresponding AUC values to roughly assess
   these two methods. Again, our method (AUC = 0.737 and AUPRC = 0.859)
   achieved better performance than Murali et al.’s method (AUC = 0.658
   and AUPRC = 0.732, which were retrieved from reference [113]16),
   further suggesting that the GIANT network-informed HDF discovery is
   very competitive in comparison to Murali et al.’s method.

   Murali et al.’s method and our method can be classified into two
   different types of network-based gene classification. As reported by
   Liu et al. ([114]25), Murali et al.’s method belongs to a class of
   methods referred to as “label propagation,” while our RF-based method
   belongs to another class of methods called “supervised learning.”
   Although “supervised learning” is applied far less frequently than
   “label propagation” for network-based gene discovery ([115]25), we have
   clearly demonstrated the promising performance of the proposed RF model
   in predicting HDFs. Apart from the methodological difference, it is
   also worth mentioning the different choices of negative samples in
   these two methods. Our method used random genes that are not HDFs as
   negative samples, while Murali et al. used essential genes as negative
   samples. Although most of the essential genes are unlikely to be HDFs,
   the unique network properties of essential genes may generate model
   bias. As discussed in the previous section, the randomly selected
   negative samples seem to be more suitable in developing the proposed
   RF-based predictive model.

Important features contributing to the prediction of HDFs.

   In general, the GIANT-based encoding scheme is of high dimensionality
   (i.e., 25,825 dimensions). In order to obtain a more optimized feature
   vector subset, the feature selection algorithm adopted in RF (i.e., the
   Gini algorithm) was conducted to reduce the dimensions to 1,047 (see
   [116]Fig. S3 in the supplemental material). With these 1,047 top-ranked
   features, the corresponding RF model yielded an AUC value of 0.744 in
   the 5-fold cross-validation, which is very close to the performance
   based on the original GIANT-based encodings (AUC = 0.751). Although the
   feature selection did not result in performance improvement, it has
   rendered the model more concise and has allowed us to investigate the
   important features contributing to the prediction. The overlaps among
   the 1,047 genes corresponding to these 1,047 features (i.e., the top
   important genes for prediction), known HDFs, and HIV targets are shown
   in [117]Fig. 3A. Interestingly, the top important genes for prediction
   significantly overlap HIV targets ([118]Fig. 3A; hypergeometric test, P
   = 1.44 × 10^−10). We further examined the top important genes in the
   context of GIANT network. The results showed these top important genes
   tend to be significantly closer to known HDFs/HIV targets in comparison
   to other human proteins ([119]Fig. 3B; Wilcoxon test, P = 7.05 × 10^−8
   and P < 2.2 × 10^−16, respectively). Collectively, these top important
   genes for prediction tend to be known HIV targets or neighbors of known
   HDFs/HIV targets, which may partly explain why the GIANT network is
   informative in distinguishing HDFs from non-HDFs.

FIG 3.

   [120]FIG 3
   [121]Open in a new tab

   The relationships among top important genes for prediction, known HDFs,
   and HIV targets. (A) Venn diagram showing the overlaps among top
   important genes for prediction, known HDFs, and HIV targets. (B) Box
   plots showing the network distance between top important genes for
   prediction and known HDFs/HIV targets. For comparison, 2,000 human
   proteins other than known HDFs or HIV targets were randomly selected
   and compiled as a data set called “others.” Different lowercase letters
   indicate significant differences (P <  0.05), which were determined by
   one-tailed Wilcoxon rank sum test.
   FIG S3

   The performance of the 5-fold cross-validation based on different
   numbers of features. We first ranked the 25,825 features according to
   the corresponding Gini importance scores and further systematically
   investigated the performance based on the top-ranked features ranging
   from 1 to 25,825 with different steps. Briefly, the step was set to 1
   when the number of features was in the range of 1 to 10: the steps were
   set to 10, 100, and 1,000 in the ranges of 10 to 100, 100 to 1,000, and
   1,000 to 25,825, respectively. In general, the feature selection did
   not result in performance improvement, but the overall performance in
   terms of AUC was very close to the final performance when the top 1,000
   features were selected. Moreover, we found that the Gini importance
   scores of the features ranked from 997 to 1,047 were the same. Thus, we
   chose the top 1,047 features as an important feature set for further
   analysis. Download [122]FIG S3, TIF file, 2.1 MB^ (2.2MB, tif) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [123]Creative
   Commons Attribution 4.0 International license.

Genome-wide screening of HDFs.

   We used the proposed method to conduct genome-wide HDF screening. In
   brief, we used the corresponding five predictive models established by
   the 5-fold cross-validation to screen potential HDFs in the human
   genome. For each human protein, the final predicted score was averaged
   over the corresponding prediction scores from the five predictive
   models. Based on the final prediction scores, we ranked the 24,681
   genes in the human genome, except 1,144 known HDFs. When the
   false-positive rate (FPR) was controlled at 5%, 857 HDF candidates were
   predicted. Note that the threshold corresponding to 5% FPR was
   estimated from the 5-fold cross-validation on the training set (ratio
   of positives to negatives = 1:2). In order to understand the
   characteristics of HDFs more comprehensively, we merged the predicted
   857 HDF candidates and experimentally determined 1,144 HDFs into a data
   set containing 2,001 HDFs (see [124]Data Set S1, sheet 1, in the
   supplemental material), which were collectively referred as HDF
   candidates in the subsequent analysis. It is worth noting that 423 out
   of these 2,001 HDF candidates are known HIV-targets, which is in line
   with previous observations that HDFs and HIV targets are strongly
   intertwined ([125]16, [126]17).
   DATA SET S1

   Sheet 1, list of the 2,001 HDFs. Sheet 2, list of complexes enriched
   with the 2,001 HDFs. Sheet 3, list of complexes enriched with 1,144
   experimentally known HDFs. Sheet 4, the five groups of training and
   independent sets used in this work. Download [127]Data Set S1, XLSX
   file, 0.3 MB^ (322.1KB, xlsx) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [128]Creative
   Commons Attribution 4.0 International license.

Network analysis of experimentally validated and predicted HDFs.

   To understand the network patterns of HDFs at a larger scale, we
   conducted network topology analyses of these 2,001 HDFs in the context
   of the GIANT network. We measured each HDF’s degree, betweenness,
   closeness centrality, and clustering coefficient in the GIANT network
   ([129]Fig. 4). In brief, each gene in the network was regarded as a
   node and the edge was defined in case two genes are interacting. The
   degree of a gene denotes the number of the edges adjacent to the gene.
   The betweenness of a gene is defined as the proportion of the shortest
   paths between the interacting gene pairs that go through the node of
   interest. The closeness centrality of a gene is defined by the inverse
   of the average length of the shortest paths to all the other genes in
   the network. The clustering coefficient of a gene measures its local
   clustering within the GIANT network, which is defined as the number of
   existing edges between its neighboring genes divided by the maximal
   number of possible edges between its neighboring genes. Compared with
   other human genes, these 2,001 HDFs have significantly higher
   indicators in terms of degree, betweenness, closeness centrality, and
   clustering coefficient ([130]Fig. 4; Wilcoxon test, all P values are
   <2.20 × 10^−16). These network patterns indicated that HDFs are more
   likely to be hubs, bottlenecks, and centrally located in the GIANT
   network, which are very important to perform their functional roles.
   For instance, HDFs can control the information flow between nodes since
   they have many interacting partners and are located in the shortest
   paths between any two genes, which can probably explain why HDFs can
   help viruses effectively infect the host from the perspective of
   network biology.

FIG 4.

   [131]FIG 4
   [132]Open in a new tab

   Comparison of topological parameters among HDF candidates (i.e.,
   predicted and known HDFs), known HDFs, HIV targets, and other human
   proteins. Note that “others” denotes 2,000 randomly selected human
   proteins other than known HDFs or HIV targets. Panels A to D show the
   distributions of degree, betweenness, closeness, and the clustering
   coefficient, respectively. The red diamond stands for the average
   value. Different lowercase letters indicate significant differences (P
   < 0.05) determined by one-tailed Wilcoxon rank sum test.

   For the purpose of comparison, we also calculated the corresponding
   network property distributions for experimentally known HDFs and HIV
   targets. Briefly, the experimentally known HDFs revealed significantly
   different network properties with other proteins or HIV targets
   ([133]Fig. 4; Wilcoxon test, all P values are <2.20 × 10^−16). When the
   predicted HDFs were taken into account, the predicted and known HDFs
   (i.e., the 2,001 HDFs) tended to have similar results for network
   degree, closeness centrality, and clustering coefficient with HIV
   targets ([134]Fig. 4). Compared with Murali et al.’s work, the current
   network analysis further quantified the network property difference
   between HDFs and HIV targets. For instance, the 2,001 HDFs still reveal
   a significantly lower betweenness in comparison to HIV targets
   ([135]Fig. 4B; Wilcoxon test, P = 0.0227), indicating that betweenness
   may serve as a potential indicator to further distinguish HDFs and HIV
   targets.

Functional analysis of HDFs in the context of human complexes.

   Proteins are usually assembled into complexes and act as molecular
   machines to perform their functional roles ([136]31). A protein complex
   contains multiple functionally diversified proteins (subunits).
   Previous studies have shown that viruses regulate the biological
   processes of host cells by manipulating host protein complexes ([137]1,
   [138]7, [139]32). To conduct a large-scale investigation of HDFs in the
   context of human complexes, we collected all human protein complexes
   from a database of mammalian protein complexes called CORUM ([140]33)
   and calculated the intersection of all HDF candidates and all proteins
   participating in complexes. The results indicated that the intersection
   is significant (hypergeometric test, P = 7.20 × 10^−223), indicating
   that protein complexes are more likely to contain HDFs than randomly
   selected proteins ([141]Fig. 5A). The preference for HDFs allows
   viruses to be more efficient in manipulating the corresponding
   complexes. Note that the experimentally known HDFs were also observed
   to significantly overlap proteins participating in complexes
   ([142]Fig. 5B; hypergeometric test, P = 1.23 × 10^−62).

FIG 5.

   [143]FIG 5
   [144]Open in a new tab

   Venn diagrams showing the overlaps among HDFs, proteins in complexes,
   and HIV targets. (A) HDF candidates cover predicted and known HDFs. (B)
   HDFs only account for known HDFs.

   Moreover, Fisher’s exact test was used to calculate the significance of
   complexes enriched with HDF candidates. The inferred P values were
   further corrected to q values (false-discovery rate) by the Benjamini
   and Hochberg method ([145]34). In total, 585 of 2,824 complexes were
   observed to be significantly enriched with HDFs (q < 0.05). It is worth
   mentioning that 348 of these 585 complexes are also enriched with HIV
   targets ([146]Data Set S1, sheet 2), further suggesting that HDFs and
   HIV targets are intertwined. The top 20 complexes enriched with HDFs
   are listed in [147]Table 3, and all of them are enriched with HIV
   targets as well. For comparison, we only detected 53 complexes enriched
   with experimentally known HDFs ([148]Data Set S1, sheet 3), which is
   far less than the number of complexes enriched with the 2,001 HDF
   candidates. It is worth mentioning that the number of enriched small
   complexes (i.e., those with ≤5 subunit members) was dramatically
   decreased when only taking the experimentally known HDFs into account.
   For instance, 348 small complexes were enriched with the 2,001 HDFs,
   while the number is only 10 for the experimentally known HDFs. We hope
   the incorporation of newly predicted HDFs can allow us to catch the
   relationship between HDFs and host complexes more comprehensively.

TABLE 3.

   Top 20 protein complexes enriched with HDFs
   Complex ID Complex name No. of proteins No. of HDFs q value No. of HIV
   targets
   351 Spliceosome 143 81 2.50 × 10^−49 52
   1181 C complex spliceosome 80 46 4.62 × 10^−28 35
   193 PA700-20S-PA28 complex 36 28 2.13 × 10^−22 20
   181 26S proteasome 22 18 7.46 × 10^−15 18
   2825 BRCA1-RNA polymerase II complex 26 19 3.29 × 10^−14 22
   103 RNA polymerase II holoenzyme complex 24 18 7.88 × 10^−14 22
   2685 RNA polymerase II (RNAPII) 17 14 1.23 × 10^−11 16
   2755 17S U2 snRNP 33 19 1.29 × 10^−11 15
   32 PA700 complex 20 15 1.29 × 10^−11 7
   1332 Large Drosha complex 20 15 1.29 × 10^−11 14
   194 PA28gamma-20S proteasome 15 13 1.61 × 10^−11 14
   1183 CDC5L complex 30 18 1.61 × 10^−11 14
   2686 BRCA1-core RNA polymerase II complex 13 12 2.47 × 10^−11 12
   192 PA28-20S proteasome 16 13 6.56 × 10^−11 13
   191 20S proteasome 14 12 1.39 × 10^−10 13
   104 RNA polymerase II core complex 12 11 2.40 × 10^−10 12
   1335 SNW1 complex 18 12 1.87 × 10^−8 12
   728 CSA-POLIIa complex 13 10 5.72 × 10^−8 11
   3040 Multisynthetase complex 11 9 1.45 × 10^−7 11
   726 DDB2 complex 12 9 4.90 × 10^−7 11
   [149]Open in a new tab

   Indeed, the majority of the top 20 complexes enriched with the 2,001
   HDFs are consistent with previous observations regarding the functional
   roles of HDFs associated with HIV-1 infection, which are exemplified as
   follows. For instance, HDFs are significantly presented in the
   spliceosome complex (q = 2.50 × 10^−49). Of the 143 proteins in the
   spliceosome complex, 81 are HDFs. This suggests that many HDFs regulate
   viral infection by participating in mRNA splicing, which allows HIV-1
   to prevent host downstream immune responses by inhibiting the
   production of the spliceosome. The proteasome is an important component
   of the ATP-dependent proteolytic pathway and regulates the degradation
   of most cellular proteins. It has been common knowledge that the
   proteasome is involved in HIV-1 replication. The proteasome is required
   for the release and maturation of infectious HIV-1 particles ([150]35).
   Thus, HDFs were observed to be enriched in several proteasome-related
   complexes. SNW1 is a highly conserved protein complex associated with
   splicing and transcription. SNW1 is recruited by HIV-1 Tat to
   Tat:P-TEFb:TAR RNA complexes and is involved in Tat transcription by
   recruitment of MYC, MEN1, and TRRAP to the HIV-1 Tat-activated long
   terminal repeat (LTR) promoter, thereby overcoming the suppression of
   transcription elongation by negative elongation factors and stimulating
   transcriptional replication ([151]36). Consistent with Kӧnig et al.’s
   work, we observed the SNW1 complex is enriched with HDFs.

   In addition, we also discovered some complexes whose associations with
   HIV-1 HDFs had been rarely reported. For instance, RNA polymerase II
   catalyzes the transcription of DNA to synthesize mRNA. Obviously, HDFs
   regulate transcription to be primarily involved in the maintenance of
   viral latency, which is a crucial step in the life cycle of HIV-1. In
   the large Drosha complex, 15 of the 20 proteins are HDFs (q
   = 1.29 × 10^−11), in which 11 HDFs are newly predicted and 4 HDFs are
   known HDFs. Drosha, a nuclease of the RNase III family, executes the
   initiation step of microRNA (miRNA) processing in the nucleus as the
   core nuclease, which can cleave primary miRNAs (pri-miRNAs) to release
   pre-miRNAs. Pre-miRNAs are processed into mature miRNAs, which play a
   role in regulating HIV-1 replication and infection
   ([152]37[153]–[154]39).

   Note that Murali et al. also examined the functionality of HDFs by
   seeking the locations of HDFs in the clusters of the human PPI network
   through a network graph clustering algorithm. Although the analysis
   strategy is different from ours, both studies share the same motivation
   of understanding the functional roles of HDFs from complexes or network
   clusters. Interestingly, some common clusters or complexes were
   identified. For instance, the identified spliceosome and proteasome
   complexes in our work are also related to the top 10 clusters highly
   connected with HDFs, as reported in Murali et al.’s work. Taken
   together, the aforementioned functional analysis of HDFs in the context
   of human complexes not only recapitulates known biology regarding
   human-HIV-1 interaction but also provides some hints to interrogate the
   functional roles of HDFs as well as the associated human complexes.

   To complement the complex-based functional analysis of HDFs, we used
   DAVID ([155]40) to perform Gene Ontology (GO) and Kyoto Encyclopedia of
   Genes and Genomes (KEGG) pathway enrichment analyses on the 2,001 HDF
   candidates. Here, we only took the GO category of biological process
   into account. REVIGO ([156]41) was further employed to remove the
   redundancy of enriched GO terms. Likewise, a P value inferred from
   Fisher's exact test was further corrected to the q value by the
   Benjamini and Hochberg method ([157]30). Thus, a total number of 41 GO
   terms were enriched (q < 0.05 [the complete GO terms are available in
   [158]Table S2 in the supplemental material]), the top 20 of which are
   displayed in [159]Fig. 6A. Similarly, 23 enriched KEGG pathways of the
   2,001 HDF candidates were also inferred (q value of <0.05
   [[160]Fig. 6B]). In general, the GO/KEGG enrichment analysis has
   allowed us to understand the biological functions of HDFs more
   completely. For instance, we observed that HDFs are heavily associated
   with the spliceosome, proteasome, RNA polymerase II, and cell cycle
   through the functional annotations of GO/KEGG ([161]Fig. 6A and
   [162]B), which are consistent with previous functional analysis of
   enriched complexes to a large extent.

FIG 6.

   [163]FIG 6
   [164]Open in a new tab

   Functional enrichment analyses of the 2,001 HDF candidates. (A) GO
   enrichment analysis. (B) KEGG pathway enrichment analysis.
   TABLE S2

   The 41 enriched GO terms in the 2,001 HDFs. Download [165]Table S2,
   DOCX file, 0.03 MB^ (27.4KB, docx) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [166]Creative
   Commons Attribution 4.0 International license.

Web server implementation.

   To facilitate the research community, a simple web server called HDFP
   for managing and searching the 2,001 HIV-1 HDF candidates has been made
   freely accessible at [167]http://zzdlab.com/HDFP. The web server was
   implemented with CentOS 7.4 and MySQL. It can display information about
   these 2,001 HDF candidates, including Entrez IDs, UniProt IDs, gene
   symbols, PubMed IDs, and prediction scores. Users can download all the
   detailed information regarding these 2,001 HDF candidates in an Excel
   or PDF format. Moreover, the source code of the proposed RF model, the
   training data set, and the independent test set used in this work are
   also downloadable through the web server.

Conclusions.

   In this work, we implemented an existing network-based gene discovery
   method to predict new HIV-1 HDF candidates from the GIANT network. The
   interaction scores of gene pairs in the GIANT network were used to
   construct the feature vectors of HDFs/non-HDFs. By applying the RF
   algorithm, we constructed an HDF predictor with reasonably good
   performance. Further comprehensive analyses on the combination set of
   experimentally determined HDFs and genome-wide predicted HDFs not only
   recapitulated the known knowledge regarding HIV-1 HDFs, but also
   provided further insights into the relationship between HDFs and HIV
   targets in the context of the GIANT network. In particular, HDFs
   revealed a significantly lower betweenness than HIV targets, although
   their network properties are generally similar when both experimental
   and predicted HDFs are taken into account. We further observed that the
   HDFs and HIV targets are highly intertwined, and they frequently
   co-occurred at the protein complex level, suggesting that this is an
   important avenue to decipher viral infection from the complexes
   enriched with HDFs. Taken together, our current results demonstrate
   that the GIANT network contains rich information regarding gene
   interactions and thus can be effectively employed for HDF
   identification. We hope the predicted HDF candidates can further guide
   hypothesis-driven experimental efforts to interrogate human–HIV-1
   relationships.

MATERIALS AND METHODS

Data sets.

   We collected 1,144 experimentally determined HIV-1 HDFs in total. A
   total of 868 of these 1,144 HDFs were compiled from three
   high-throughput HDF screening studies, including those by Brass et al.
   ([168]13), König et al. ([169]14), and Zhou et al. ([170]15), which
   constituted the positive samples in our training data set. The
   remaining 276 known HDFs, collected through searching literature
   published from 2008 to 2017, were used to constitute the positive
   samples in the independent test set. Those human proteins other than
   known HDFs were randomly sampled as non-HDFs (negative samples).
   Considering the number of non-HDFs in the human genome should be much
   larger than that of HDFs, the ratio of positive to negative samples was
   set to 1:2 to build/assess a predictive model. Thus, we obtained a
   training data set containing 868 HDFs and 1,736 non-HDFs and an
   independent test set containing 276 HDFs and 552 non-HDFs. Since the
   negative data were much more available, we also repeated the selection
   of negative samples five times to investigate the robustness of model
   performance perturbed by the selection of negative samples. The full
   list of HDFs and non-HDFs in the five groups of training data sets and
   independent test sets is available in [171]Data Set S1, sheet 4, or at
   [172]http://zzdlab.com/HDFP.

   For comparison, we also adopted five different ways to construct
   negative data sets. First, we used DAGs other than known HDFs as
   non-HDFs. To do so, we collected 3,855 DAGs from the OMIM database
   ([173]https://omim.org/). After filtering out DAGs associated with
   known HDFs and HIV, 3,697 DAGs were retained, 2,288 of which were
   randomly selected as non-HDFs. Second, through literature searching we
   collected 5,506 HDFs from 11 other viruses, including influenza A virus
   subtype H1N1, human papillomavirus, dengue virus, hepatitis C virus,
   etc. After removing the same genes as HIV-1 HDFs, 5,365 HDFs from these
   11 other viruses were retained and were randomly selected as negative
   samples. Third, we collected essential genes from three publications
   ([174]42[175]–[176]44). After removing the redundancy, 2,290 essential
   genes were compiled, which were further used to construct negative
   samples. Moreover, proteins with a similar network degree to HDFs in
   the GIANT network were also selected as negative samples. For each HDF,
   we randomly selected two human proteins with similar degrees. By doing
   so, a negative data set was compiled, and the statistical test
   confirmed that the degree distributions between HDFs and the newly
   obtained negatives are similar (Kolmogorov-Smirnov test, P > 0.05).
   Finally, we also randomly chose genes with similar expression levels to
   HDFs to construct negative samples. To this end, we downloaded a set of
   T cell microarray data (accession no. [177]GSE73968) ([178]45) from the
   GEO database ([179]https://www.ncbi.nlm.nih.gov/geo/). The expression
   level for each gene was further averaged by three replicates. For each
   HDF, two genes other than known HDFs but sharing similar expression
   values to the query HDF were randomly selected, and thus a negative
   gene set sharing similar expression levels to HDFs was obtained
   (Kolmogorov-Smirnov test, P > 0.05). Note that the different methods of
   non-HDF selections described above were also repeated five times to
   ensure the robustness of performance comparison.

   Experimentally validated human–HIV-1 PPI data were collected from HPIDB
   2.0 ([180]46). After PPIs containing proteins without UniProt IDs were
   filtered, 1,638 human–HIV-1 PPIs between 1,142 human proteins and 19
   HIV-1 proteins were obtained. We obtained 2,916 original protein
   complexes from CORUM ([181]http://mips.helmholtz-muenchen.de/corum/)
   and filtered out complexes containing less than two subunits or
   complexes whose subunit members had unreviewed UniProt IDs. Thus, 2,824
   complexes were retained. Experimentally determined human PPIs were
   collected from BioGRID ([182]47), IntAct ([183]48), and DIP ([184]49).
   In total, 344,703 human PPIs covering 16,745 proteins were obtained to
   compile a human PPI network.

GIANT encoding.

   GIANT provides tissue-specific interaction maps, which can be
   downloaded from [185]http://giant.princeton.edu/. In each
   tissue-specific network, the interaction probability for any gene pair
   is assigned. Considering the principal targets of HIV-1 are T cells, we
   used the T-cell-specific GIANT network to infer the feature vectors of
   HDFs and non-HDFs. For each HDF/non-HDF, the interaction probabilities
   with the 25,825 genes in the network were extracted to constitute the
   corresponding feature encoding. Thus, each HDF/non-HDF can be converted
   into a 25,825-dimensional feature vector.

Machine learning algorithms.

   In this work, we trained our predictive models through five commonly
   used machine learning algorithms (RF, SVM, LR, KNN, and NB), which were
   implemented in Python with the package scikit-learn ([186]50). RF is an
   ensemble machine learning algorithm, which creates a forest of random
   uncorrelated decision trees to achieve the best possible result. SVM
   implements classification by mapping low-dimensional-input features
   into a high-dimensional space through a kernel function. LR is a
   generalized linear model, which constructs a regression model to
   estimate the probability of a binary classification by considering the
   relationships among multiple independent variables. The core idea of
   KNN is that if the majority of the k most neighboring genes in a
   feature space belong to a certain category, the query sample should
   also belong to this category. NB is a Bayes theorem-based algorithm
   with independent assumptions among input features. Here, we used
   Gaussian NB to allow training models with noninteger input features
   ([187]51). We utilized MinMaxScaler in scikit-learn to conduct
   feature-wise standardization on the training data and applied the same
   transformation on the test set. In each algorithm, the most commonly
   used parameters were optimized through 5-fold cross-validation, while
   the other parameters were set as the default. More details about the
   parameter selection and optimization are available in [188]Table S3 in
   the supplemental material.
   TABLE S3

   Parameter selection and optimization in different algorithms. Download
   [189]Table S3, DOCX file, 0.02 MB^ (21.3KB, docx) .

   Copyright © 2020 Fu et al.

   This content is distributed under the terms of the [190]Creative
   Commons Attribution 4.0 International license.

Performance assessment.

   In this work, a 5-fold cross-validation and an independent test were
   employed to assess the predictive models. We used ROC curves to
   characterize the performance of our predictive model and further
   quantified the overall performance by the AUC value ([191]52). In the
   meantime, the PR curve and the corresponding AUPRC value were also used
   to estimate the performance, which is commonly employed when the
   positive and negative samples are imbalanced. Briefly, an ROC curve
   plots a true-positive rate (TPR) against the FPR at different
   thresholds, whereas a PR curve plots precision values at different
   recall controls. The definitions of TPR (i.e., sensitivity or recall),
   FPR (i.e., 1 − specificity), and precision are as follows:
   [MATH:
   <mrow><mtext>TPR</mtext><mo>=</mo><mtext>sensitivity</mtext><mo>=</mo><
   mtext>recall</mtext><mo>=</mo><mfrac><mrow><mtext>TP</mtext></mrow><mro
   w><mtext>TP + FN</mtext></mrow></mfrac></mrow> :MATH]
   [MATH:
   <mrow><mtext>FPR</mtext><mo>=</mo><mfrac><mrow><mtext>FP</mtext></mrow>
   <mrow><mtext>TN +
   FP</mtext></mrow></mfrac><mo>=</mo><mn>1</mn><mo>−</mo><mfrac><mrow><mt
   ext>TN</mtext></mrow><mrow><mtext>TN +
   FP</mtext></mrow></mfrac><mo>=</mo><mn>1</mn><mo>−</mo><mtext>specifici
   ty</mtext></mrow> :MATH]
   [MATH:
   <mrow><mtext>precision</mtext><mo>=</mo><mfrac><mrow><mtext>TP</mtext><
   /mrow><mrow><mtext>TP +
   FP</mtext></mrow></mfrac><mtext> </mtext></mrow> :MATH]

   where TP, FP, TN, and FN denote the number of true-positive,
   false-positive, true-negative, and false-negative instances,
   respectively. In general, the closer the value of AUC/AUPRC is to 1,
   the more powerful the predictive performance is. All ROC/PR curves were
   generated by the ROCR package in R ([192]53).

Implementation of the SinkSource algorithm.

   To compare our method with Murali et al.’s work, we implemented the
   SinkSource algorithm through a Python script by following the
   methodological details reported in references [193]16 and [194]54.