Abstract Background Interpretability is a topical question in recommender systems, especially in healthcare applications. An interpretable classifier quantifies the importance of each input feature for the predicted item-user association in a non-ambiguous fashion. Results We introduce the novel Joint Embedding Learning-classifier for improved Interpretability (JELI). By combining the training of a structured collaborative-filtering classifier and an embedding learning task, JELI predicts new user-item associations based on jointly learned item and user embeddings while providing feature-wise importance scores. Therefore, JELI flexibly allows the introduction of priors on the connections between users, items, and features. In particular, JELI simultaneously (a) learns feature, item, and user embeddings; (b) predicts new item-user associations; (c) provides importance scores for each feature. Moreover, JELI instantiates a generic approach to training recommender systems by encoding generic graph-regularization constraints. Conclusions First, we show that the joint training approach yields a gain in the predictive power of the downstream classifier. Second, JELI can recover feature-association dependencies. Finally, JELI induces a restriction in the number of parameters compared to baselines in synthetic and drug-repurposing data sets. Keywords: Drug repurposing, Interpretability, Gene expression, Collaborative filtering Background The Netflix Challenge [[28]1] popularized collaborative filtering, where connections between items and users are inferred based on the guilt-by-association principle and similarities. This approach is particularly suitable for use cases where information about known user-item associations is sparse—typically, close to 99% of all possible user-item associations are not labelled, such as in the MovieLens movie recommendation data set [[29]2]—and when there is implicit feedback. For instance, in the case of movie recommendations on streaming platforms or online advertising, the algorithm often gets only access to clicks, that is, positive feedback. However, the reasons for ignoring an item can be numerous: either the item would straightforwardly receive negative feedback, or the item is too far from the user’s usual exploration zone but could still be enjoyed. In some rare cases, true negative feedback might be accessible but in even smaller numbers than the positive associations, for instance, for drug repurposing data sets, by reporting failed Phase III clinical trials [[30]3]. Collaborative filtering algorithms then enable the modeling of the user’s behavior based on their similarity to other users and the similarity of the potential recommended item to other items positively graded by this cluster of users. Several types of algorithms implement collaborative filtering. For instance, matrix factorizations [[31]4, [32]5] such as Non-negative Matrix Factorization (NMF) [[33]6] or Singular Value Decomposition (SVD) [[34]7], decompose the matrix of item-user associations into a product of two low-rank tensors. Other types of algorithms are (deep) neural networks [[35]8–[36]10], which build item and user embeddings with convolutional or graph neural networks based on common associations and/or additional feature values. On the one hand, among those last approaches, graph-based methods, which integrate and infer edges between features, items, and users, seem promising in performance [[37]11]. Predictions are supported by establishing complex connections between those entities. Conversely, matrix factorizations incorporate explicit interpretability, as one can try to connect the inferred latent factors to specific user and item features. One example is the factorization machine (FM) [[38]12], which combines a linear regression-like term and a feature pairwise interaction term to output a score for binary classification. The learned coefficients of the FM explicitly contribute to the score for each item and user feature set. This type of interpretability, called feature attribution in the literature [[39]13–[40]16], allows further downstream statistical analysis of the feature interactions. For instance, in our motivating example of drug repurposing, the objective is to identify novel drug-disease therapeutic associations. If features are genes mutated by the pathology or targeted by the chemical compound, the overrepresented biological pathways among those that are respectively affected or repaired can be retrieved based on the set of key repurposing genes. This, in turn, offers important points to argue in favor of the therapeutic value of a drug-disease indication and for further development towards marketing. In this work, we aim to combine the performance and versatility (in terms of embeddings) of graph-based collaborative filtering and the explicit interpretability of factorization machines to derive a “best-of-both-worlds” approach for predicting user-item associations. To achieve this, we introduce a special class of factorization machines that leverages a strong hypothesis on the structure of item and user embeddings depending on feature embeddings. This classifier is then jointly trained with a knowledge graph completion task. This knowledge graph connects items, users, and features based on the similarity between them and users and potentially additional priors on their relationships with features. The embeddings used to compute the edge probability scores in the knowledge graph are shared with the factorization machine, which allows the distillation of generic priors into the classifier. Our paper is structured as follows. In Sect. "[41]Related work", we introduce and give an overview of the state-of-the-art on factorization machines and knowledge graphs and how their combination might be able to overcome some topical questions in the field. Section "[42]Methods" introduces the JELI algorithm, which features our novel class of structured factorization machines and a joint training strategy with a knowledge graph. Eventually, Sect. "[43]Results" shows the performance and interpretability of the JELI approach on both synthetic data sets and drug repurposing applications. Notation For any matrix M (in capital letters), we denote [MATH: Mi,: :MATH] , [MATH: M:,j :MATH] and [MATH: Mi,j :MATH] respectively its [MATH: ith :MATH] row, [MATH: jth :MATH] column and coefficient at position (i, j). For any vector [MATH: v :MATH] (in bold type), [MATH: vi :MATH] is its [MATH: ith :MATH] coefficient. Moreover, [MATH: M :MATH] is the pseudo-inverse of matrix M. Related work Our approach, JELI, leverages a generic knowledge graph completion task and the interpretability of factorization machines to derive a novel, explainable collaborative filtering approach. Knowledge graph embedding learning A knowledge graph is a set of triplets of the form (h, r, t) such that the head entity h is linked to the tail entity t by the relation r [[44]17]. Entity and relation embeddings learned on the graph allow us to capture the structure and connections in the graph in a numerical form, as embeddings are parameters of a function predicting the presence of a triplet in the graph. Those parameters are then learned based on the current set of edges in the graph. This approach encodes the graph structure into numerical representations, which can later be provided to a downstream regression model [[45]18]. The edge prediction function is usually called the interaction model. Many exist [[46]19–[47]22], among these, the Multi-Relational Euclidean (MuRE) model [[48]23], defined for any triplet (h, r, t) of respective embeddings [MATH: eh,er,et :MATH] of dimension d as [MATH: MuRE(eh,er,et)=-Rreh-(et+er)22+< msup>bh+bt, :MATH] 1 where [MATH: d×d :MATH] matrix [MATH: Rr :MATH] , and scalars [MATH: bh :MATH] and [MATH: bt :MATH] are respectively relation-, head- and tail-specific parameters. Notably, this interaction model has exhibited good embedding engineering properties throughout the literature [[49]24, [50]25]. Yet, many challenges are present in this field of research. Current representation learning algorithms (no matter the selected interaction model between a triplet and its embedding) infer representations directly on the nodes and relations of the graph. However, this approach does not make it possible to establish a relationship between the nodes other than a similarity at the level of the numerical representation for neighboring nodes for specific relations in the graph. That is, specific logical operations depending on the relation are often ignored: for instance, for a relation r and its opposite [MATH: ¬r :MATH] , we would like to ensure that the score p assigned to triplet (h, r, t) is proportional to [MATH: -p¯ :MATH] , where [MATH: p¯ :MATH] is the score associated with triplet [MATH: (h,¬r,t) :MATH] . Moreover, knowledge graphs are currently more suited to categorical information, where entities and relationships take discrete rather than numerical values. Numerical values could describe a relation such as “users from this specific age group are twice more interested in that movie genre”. Some recent works focus on integrating numerical values into knowledge graph embeddings. In KEN embeddings [[51]26], a single-layer neural network is trained for each numeric relation, taking the attribute as input and returning an embedding. Another approach, TransEA [[52]27], aims to optimize a loss function that linearly combines, with a hyperparameter, a loss value on the categorical variables (the difference between the scores and the indicator of the presence of a triplet) and another loss value on numerical variables, which seeks to minimize the gap between the variable and a scalar product involving its embedding. However, these two approaches add several additional hyperparameters and do not deal with interpretability. Resorting to knowledge-graph-infused embeddings allows us to integrate prior knowledge constraints generically into the representations of entities, both items and users. We aim to enforce a structure on those embeddings to guarantee the good prediction of user-item associations by incorporating those embeddings into a special type of factorization machine. Factorization machines Factorization machines are a type of collaborative filtering algorithms introduced by [[53]12]. Their most common expression, the second-order factorization machine of dimension d, comprises a linear regression term of coefficient (with a possibly non-zero intercept) and a term that combines interactions from all distinct pairs of features by featuring a scalar product of their corresponding low-rank latent vectors of dimension d. This approach, particularly in the presence of sparse feature vectors, is computationally efficient while performant on a variety of recommendation tasks: for instance, knowledge tracing for education [[54]28], click-through rate prediction [[55]29]. Computationally tractable evaluation and training routines were first proposed by [[56]30] for higher-order factorization machines (HOFMs), which were introduced as well in [[57]12] and include interactions from all distinct K sets of features, where [MATH: K2 :MATH] , opening the way to even finer classification models. The definition of HOFMs is recalled in Definition [58]1. Definition 1 Higher–Order Factorization Machines (HOFMs). Let us denote the set of available item and user features [MATH: FN :MATH] . The general expression for HOFM [[59]12, [60]30] of order [MATH: m2 :MATH] and dimensions [MATH: d2,,dm :MATH] that takes as input a single feature vector [MATH: xR|F| :MATH] is a model such that [MATH: θ=(ω0,ω1,ω2,,ωm) :MATH] where [MATH: (ω0,ω1)R×R|F| :MATH] and for any [MATH: i{2,,m} :MATH] , [MATH: ωiR|F|×di :MATH] [MATH: HOFMθ(x)ω0+(ω1)x+2tmf1 <<ft< /mrow>f1,,ftF ωf1 ,:t,,ωft ,:txf1·xf2··xft -1·xft, :MATH] 2 where [MATH: ωf1 ,:t,,ωft ,:tddtωf1 ,dt·ωf2 ,dt··ωf t-1,d t·ωft ,dt :MATH] for any t and indices [MATH: f1,,ft :MATH] . In particular, for [MATH: m=2 :MATH] [MATH: FMθ(x)ω0+(ω1)xlinearregression term+f<f,f,f< mo>′Fωf,:< /mo>2,ωf ,:2xf·xfpairwiseinteraction term. :MATH] 3 Besides their good predictive power, factorization machines involve explicit coefficients that quantify the contribution of each K set of features to the final score associated with the positive class of associations. These coefficients offer a straightforward insight into the discriminating features for the recommendation problem, and this type of “white-box” explainability is related to a larger research field called feature attribution-based interpretability. Feature attribution-based interpretability Given a binary classifier C and a feature vector [MATH: xRF :MATH] , a feature attribution function [MATH: ϕC:RFRF :MATH] returns importance scores for each feature contributing to the positive class score for the input vector [MATH: x :MATH] . If the importance score associated with feature f is largely positive (resp., negative), it means that feature f drives the membership of [MATH: x :MATH] to the positive (resp., negative) class. In contrast, an importance score close to 0 indicates that feature f has little influence on the classification of data point [MATH: x :MATH] . Albeit other types of interpretability approaches exist (based on decision rules given by single classifier trees or random forests [[61]31, [62]32], counterfactual examples [[63]33] or logic rules [[64]34, [65]35]) the importance score-based methods allow going beyond single feature influence. In particular, the importance scores can be integrated into downstream analyses to statistically quantify the effect of specific groups of features on the classification. For instance, when considering genes as features, an enrichment analysis [[66]36] based on the scores can uncover overrepresented functionally consistent cell pathways. Some classifiers, as seen for factorization machines, readily include importance scores, whereas several approaches compute post-hoc importance scores. Importance scores are evaluated based on the outputs of an already trained “black-box” classifier, such as a neural network. Such approaches include Shapley values [[67]13], LIME [[68]14], DeepLIFT [[69]37] (for image annotation) or sufficient explanations [[70]38]. Yet, recent works show their lack of robustness and consistency across post-hoc feature attribution methods, both empirically [[71]15] and theoretically [[72]16, [73]39]. However, the advantage of posthoc approaches is that they allow the explainability of any type of classifier and combine the richness of the model (predictive performance) and interpretability. The approach described in our paper then aims to encompass any generic embedding model without losing the connection to the initial features of the input vectors to the classifier. Methods In this section, we define the JELI algorithm, our main contribution. The full pipeline of JELI is illustrated in Fig. [74]1. Let us define in formal terms the inputs to the associated recommendation problem of [MATH: ni :MATH] items [MATH: i1,i2< /mn>,,ini :MATH] to [MATH: nu :MATH] users [MATH: u1,u2< /mn>,,unu :MATH] . The minimal input to the recommendation problem is the user-item association matrix [MATH: A{-1,0,+1}ni×nu :MATH] which summarizes the known positive ( [MATH: +1 :MATH] )—and possibly negative ( [MATH: -1 :MATH] )—associations and denotes unknown associations by zeroes. In simple terms, the recommender systems aim to replace zeroes by [MATH: ±1 :MATH] while preserving the label of nonzero-valued associations. Second, in some cases, we also have access to the respective item and user feature matrices denoted [MATH: SRF×ni :MATH] and [MATH: PRF×nu :MATH] . Without a loss of generality, we assume that the item and user feature matrices have the same F features [MATH: f1,f2< /mn>,,fF :MATH] . [75]1 Finally, there might be a partial graph on some of the items, users, features, and possibly other entities. For instance, such a graph might connect movies, users, and human emotions for movie recommendation [[76]40], or drugs, diseases, pathways, and proteins or genes for drug repurposing [[77]41, [78]42]. We denote this graph [MATH: G(VG,EG) :MATH] , where [MATH: VG :MATH] is the set of nodes in [MATH: G :MATH] and [MATH: EG :MATH] is its set of (undirected, labeled) edges. Fig. 1. [79]Fig. 1 [80]Open in a new tab Full pipeline of the JELI algorithm, from the initial inputs to the downstream tasks We first introduce the class of higher-order factorization machines, called redundant structured HOFMs, which will classify user-item associations based on an assumption on the structure of item/user and feature embeddings. Redundant structured HOFM (RHOFM) This subtype of higher-order factorization machines features shared higher-order parameters across interaction orders, such that the corresponding dimensions of the HOFM satisfy [MATH: d2==dm=d :MATH] in Definition [81]1. As such, RHOFMs are related to inhomogeneous ANOVA kernel HOFMs (iHOFMs) mentioned in [[82]30]. This type of factorization machine is such that the higher-order dimensions are all equal (that is, [MATH: d2==dm=d :MATH] ) and the corresponding higher-order coefficients are all proportional to one another: for any [MATH: t,t2 :MATH] and [MATH: fF :MATH] . there exists [MATH: cR :MATH] such that [MATH: ωft=c·ωf :MATH] in Definition [83]1. However, what distinguishes the RHOFM from an iHOFM is the following hypothesis on structure: it is assumed that every entity d-dimensional embedding [MATH: eRd :MATH] results from some function [MATH: sW :MATH] with parameter [MATH: WRF× d :MATH] applied to the corresponding entity feature vector [MATH: xRF :MATH] . For instance, an embedding [MATH: e :MATH] associated with feature vector [MATH: x :MATH] with a linear structure function of dimension d is defined as [MATH: e=sW< /mi>(x)=xW :MATH] . However, any, possibly non-linear, structure function [MATH: sW :MATH] can be considered. Note that for completeness, we can define a feature vector for features, which is simply the result of the indicator function on features in F: for feature [MATH: fF :MATH] , its corresponding feature vector is [MATH: xf(δ(fj=f ))jF :MATH] where [MATH: δ :MATH] is the Kronecker symbol, such that the structure function [MATH: sW :MATH] can be applied to any item, user or feature entity. Definition [84]2 gives the formal expression of RHOFMs for any order, dimension, and structure. Definition 2 Redundant structured HOFMs (RHOFMs). The RHOFM of structure [MATH: sW :MATH] , order m and dimension d, with parameters [MATH: θ=(ω0,ω1,ω2:m< /mi>,W)R×Rd×Rm- 1×RF× d :MATH] on item and user of respective feature vectors [MATH: xi,xuRF :MATH] is defined as [MATH: RHOFMθ(xi,xu)ω0+(ω1)(xiu< /mi>)W~λiuW~λiu+2tmωt-1< /mn>2:mf1<<ftf1,,ft2F W~λiuW~λiuf1,:,,W~λiuW~λiuft,:x1i< /mi>ux2i< /mi>u...< mrow>xti< /mi>u, :MATH] 4 where [MATH: xiu< /mi>[(xi),(xu)]R2F :MATH] is the concatenation of feature vectors along the row dimension, [MATH: x~iu[xi,xu]RF× 2 :MATH] the concatenation along the column dimension, [MATH: W~λiu(x~iu(x~iu)+λIF)(x~iu)[sW(xi),sW(xu)]RF× d :MATH] is the [MATH: λ :MATH] -regularized approximate least squares estimator in the following equation in V: [MATH: sW(x~iu)=x~iuV :MATH] , with [MATH: λ0 :MATH] . By reordering terms and by definition of [MATH: W~λiu :MATH] (full details in Appendix ), if we denote [MATH: f%F :MATH] the remainder of the Euclidean division of f by F, we can notice that [MATH: RHOFMθ(xi,xu)ω0+(ω1)(sW(xi)+sW(xu))+2tmωt-1< /mn>2:mf1<<ftf1,,ft2F x1i< /mi>usW< mo stretchy="false">(xf1 %F),,xti< /mi>usW< mo stretchy="false">(xft %F). :MATH] 5 In particular, for [MATH: m=2 :MATH] , [MATH: RHOFMθ(xi,xu) :MATH] is roughly [85]2 equal to [MATH: ω0+(ω1)(sW(xi)+sW(xu))+ω2f1< /mn><f2f1,f22Fx1i< /mi>usW< mo stretchy="false">(xf1 %F),x2i< /mi>usW< mo stretchy="false">(xf2 %F). :MATH] 6 Compared to the expression of a factorization machine for [MATH: m=2 :MATH] in Eq. ([86]3), the RHOFM includes a structure that can be non-linear (through the function [MATH: sW :MATH] ) and a supplementary degree of freedom with parameters [MATH: ω1 :MATH] and [MATH: ω2:m< /mi> :MATH] . The RHOFM then comprises a term linear in the item/user embeddings and a product of feature embeddings weighted by the corresponding values in the item and user initial feature vectors. Moreover, if we assume a linear structure on the RHOFM, the embedding vector for feature [MATH: fj :MATH] is exactly [MATH: Wfj,:< /mo> :MATH] and the embeddings for items and users are the sum of feature embeddings weighted by their corresponding values in the item and user vectors. The expression in Definition [87]2 is relatively computationally efficient when combined with the dynamic programming routines described in [[88]30]. Moreover, the redundancy in the RHOFM allows it to benefit from the same type of computational speed-up as inhomogeneous ANOVA kernels or iHOFMs. Knowing that HOFMs (in Definition [89]1) and iHOFMs would take as input the concatenation along the row dimension of [MATH: (xi,xu) :MATH] , assuming that the dimensions across subsets are the same, i.e., [MATH: d2==dm=d :MATH] , HOFMs comprise [MATH: 1+2F+2F d(m-1) :MATH] parameters, which can account for a prohibitive computation cost in practice. Similarly, iHOFMs would require the training of [MATH: 1+m+2Fd :MATH] parameters, whereas RHOFMs (in Definition [90]2) only feature [MATH: 1+m+(F+1)d :MATH] , hence removing the multiplicative constant on the number of features F, which has an impact for high-dimensional data sets such as the TRANSCRIPT drug repurposing data set [[91]43] which gathers values on 12, 000 genes across the human genome. Regarding interpretability, as evidenced by Eq. ([92]5), the coefficients involved in the expression of the RHOFM are straightforwardly connected to the input embeddings. In the case of the linear structure and when [MATH: ω1=1d :MATH] , [MATH: ω2:m< /mi>=1m-1 :MATH] (or any other constant), the contributions from features on the one hand and the item/user values on the other can easily be disentangled. In that case, [MATH: W~λiuW :MATH] and then for any feature f, the intrinsic (i.e., independent of users or items) importance score is [MATH: kdWf,k :MATH] . When associated with an entity (item or user) of feature vector [MATH: xRF :MATH] , its importance score is simply [MATH: xfkdWf,k :MATH] . Using [MATH: x~iuW~λius W(x~iu) :MATH] in non-linear structures, we can extrapolate this result to obtain the following intrinsic feature importance score Result 1 Feature importance scores in a RHOFM. When [MATH: ω1=1d :MATH] , [MATH: ω2:m< /mi>=1m-1 :MATH] (or any other constant), the intrinsic (entity-independent) feature importance score for feature [MATH: fF :MATH] in an RHOFM (Definition [93]2) is [MATH: kd(W~λiu)f,k. :MATH] As a consequence, the feature attribution function associated with feature vector [MATH: xRF :MATH] is [MATH: ϕRHOFM(x)(xfkd(W~λiu)f,k)fF. :MATH] One could infer the RHOFM parameters by directly minimizing a loss function. However, as mentioned in the introduction, we would like to distil some prior knowledge information into the RHOFM, for instance, via a knowledge graph specific to the recommendation use case. By seeing the feature embeddings in the RHOFM as node embeddings in a knowledge graph, the next section describes how to jointly train the RHOFM and the feature embeddings on a knowledge graph completion task. Joint training of the RHOFM and the knowledge graph embeddings We will leverage the information from the partial graph [MATH: G(VG,EG) :MATH] to fit the RHOFM, by reducing the problem of classification to the prediction of a subset of edges in a knowledge graph completion problem. To do so, we first extend the partial graph [MATH: G :MATH] based on the respective user-item association matrix A, and respective item and user feature matrices S and P to build a knowledge graph [MATH: K(V,T) :MATH] with nine types of relations. Note that the partial graph can possibly be empty or, to the contrary, can include any edge between drugs and features, diseases and features, and between two features. Definition 3 Similarity-based knowledge graph augmented with prior edges. Considering a similarity threshold [MATH: τ[0,1] :MATH] associated with a similarity function sim [MATH: :RF×RF[-1,1] :MATH] , JELI builds a knowledge graph from the data set A, P and S and partial graph [MATH: G(VG,EG) :MATH] as follows [MATH: V{i1,i2,,ini}{u1,u2,,unu}{f1,f2,,fF}, :MATH] 7 [MATH: T{(s,prior,t)(s,t)EG,s,tV}{(ij,- ,uk)Aij,uk=-1,j< msub>ni,knu}{(ij,+ ,uk)Aij,uk=+1,jni,knu}{(uj,user-sim,uk)sim(P:, uj,P:,uk)>τ,j,k nu}{(ij,item-sim,ik)sim(S:, ij,S:,ik)>τ,j,k ni}{(ij,item-feat-pos,fk)Sfk,ij>0,kF,jni}{(ij,item-feat-neg,fk)Sfk,ij<0,kF,jni}{(uj,user-feat-pos,fk)Pfk,uj>0,kF,jnu}{(uj,user-feat-neg,fk)Pfk,uj<0,kF,jnu}. :MATH] 8 The objective of knowledge graph completion is to fit a model predictive of the probability of the presence of a triplet in the knowledge graph. In particular, computing the score associated with triplets of the form [MATH: (h,+,t) :MATH] , for (h, t) a user-item pair, boils down to fitting a classifier of user-item interactions. Conversely, a straightforward assumption is that the score associated with triplets [MATH: (h,+,t) :MATH] should be opposite to the score assigned to triplets [MATH: (h,-,t) :MATH] . With that in mind, denoting the set of RHOFM parameters [MATH: θ :MATH] and [MATH: θJELI≜< /mo>(θ,{Rr,rrelation},{er,rrelation},{bh,hV}) :MATH] as the total set of parameters to estimate, we define in Eq. ([94]9) the edge score to be maximized for present triplets in the knowledge graph [MATH: K :MATH] [MATH: scoreθJELI(h,r,t)MuRE(sW(xh),er,sW(xt);Rr,bh,b t)ifr{+,-}RHOFMθ(xh,xt)ifr=+ -RHOFMθ(xh,xt)ifr=- . :MATH] 9 Remember that the vector [MATH: xh :MATH] is well-defined for any item, user, or feature h. Then we fit parameter [MATH: θJELI :MATH] by minimizing the soft margin ranking loss with margin [MATH: λ0=1 :MATH] , which expression is recalled below [MATH: θ,Lmargin (θ)(h,r,t)T(h¯,r,t)Tlog(1+exp(λ0+scoreθ(h,r,t)-scoreθ(h¯,r,t))). :MATH] 10 Further implementation details and numerical considerations for the training pipeline are available in Appendix . Downstream tasks with JELI Interestingly, not only does JELI build embeddings for items and users available at training time, but it can also be used to produce embeddings for new entities without requiring any retraining step. Given a feature vector [MATH: xRF :MATH] , padding with zeroes if needed on unavailable features, the corresponding embedding is [MATH: sW(x) :MATH] . However, the main objective of the trained JELI model is to predict new (positive) user-item associations, possibly on items and users not observed at training time. In that case, for any pair of item and user feature vectors [MATH: (xi,xu)RF×RF :MATH] , the label predicted by JELI with RHOFM parameter [MATH: θ :MATH] is [MATH: y^JELI(xi,xu)+1ifσ(RHOFMθ(xi,xu))>0.5-1otherwise , :MATH] 11 where [MATH: σ :MATH] is the standard sigmoid function. Note that the JELI approach could be even more generic. Besides any knowledge graph, this joint training approach could feature any classifier, and not necessarily an RHOFM, as long as the classifier remains interpretable, and any knowledge graph completion loss function or any edge score function. Results We first validate the performance, the interpretability, and the different components of JELI on synthetic data sets, for which the ground truth on feature importance is available. Then, we apply JELI to drug repurposing, our main motivating example for interpretability in recommendation. Further information about the generation of the synthetic data sets and numerical details is available in Appendix . Unless otherwise specified, the order of all factorization machine variants considered (including the RHOFM classifier in JELI) satisfies [MATH: m=2 :MATH] . In this section, we consider several evaluation metrics. First, Spearman’s rank correlation [[95]45] quantifies the quality of the importance scores. It is computed on ground truth importance scores [MATH: s(kd< /mi>Wf,k)fF :MATH] and predicted ones [MATH: s^(kd< /mi>W^f,k)fF :MATH] with [MATH: W^ :MATH] the inferred embedding parameter. Second, the Area Under the Curve (AUC) is computed on all user-item pairs to measure classification performance between the ground truth [MATH: A{-1,0,+1}ni×nu :MATH] and the classifier scores [MATH: A^Rni×nu :MATH] . We also consider the Negative-Sampling AUC (NS-AUC) [[96]44]. Contrary to AUC, the NS-AUC is a ranking measure akin to an average of user-wise AUCs, giving a more refined quantification of prediction quality across users. As a complementary measure of classification quality, we also consider the Normalized Discounted Cumulative Gain (NDCG), which is proportional to the quality of the ranking of recommended drugs across diseases. Note that all those classification metrics depend solely on the classifier scores, and not on the final class labels that can be inferred by applying a fixed threshold [MATH: τ :MATH] . The exact definitions of each metric are reported in Table [97]1. Table 1. Description of the performance metrics in Section "[98]Results" Notation Performance metric Definition Spearman’s [MATH: ρ :MATH] Spearman’s correlation [MATH: 1-6fF(Δf)2/(F(F2-1 )) :MATH] AUC Area Under the Curve [MATH: 01TPR(FPR-1(τ;A^,A);A^,A)dτ :MATH] NS-AUC Average NS-AUC [[99]44] [MATH: |nu|-1< msub>unu|Ω~u|-1< msub>(i,i)Ω~uδ (A^i,u>A^i< mo>,u) :MATH] NDCG Average NDCG@ [MATH: ni :MATH] [MATH: nu-1unu< /mi>i=1 Nu+Aσu(i),u< msub>log2(i+1)/i=1 Nu+1 log2(i+1) :MATH] [100]Open in a new tab Spearman’s [MATH: ρ :MATH] : [MATH: Δf :MATH] is the gap in rank (for the decreasing order) between the true and predicted importance scores [MATH: (s)f :MATH] and [MATH: s^f :MATH] for feature f. AUC: The true positive rate between ground truth A and predictions [MATH: A^ :MATH] is defined as [MATH: TPR(τ;A^,A)=(i,u),Ai,u=+1δ(A^i,u>τ)/(i,u)δ(A^i,u>τ) :MATH] , the false positive rate is [MATH: FPR(τ;A^,A)=(i,u),Ai,u=-1δ(A^i,u>τ)/(i,u)δ(A^i,uτ) :MATH] , and [MATH: δ :MATH] is the Kronecker symbol. NS-AUC: The set of true positive, respectively negative, drug-disease associations is [MATH: Ω±{(i,u),Ai,u=±1∣< /mo>ini, unu} :MATH] , whereas the set of positive drugs to disease u is [MATH: Ωu+{iAi< /mi>,u=+1} :MATH] . Finally, the set of correctly ranked drugs for disease u is [MATH: Ω~u{(i,i)Ai,u>Ai,u} :MATH] . NDCG: [MATH: σu :MATH] is the permutation that sorts all coefficients of the recommendations [MATH: A^i,u,ini :MATH] for disease u in the decreasing order. That is, [MATH: A^σu< mrow>(1),uA^σu< mrow>(2),uA^σu< mrow>(ni),u :MATH] . Finally, [MATH: Nu+ :MATH] is defined as [MATH: min(ni,|Ωu+|) :MATH] . Synthetic data sets We consider two types of “interpretable” synthetic recommendation data, called “linear first-order” and “linear second-order”, for which the ground truth feature importance scores are known. At fixed values of dimension d, feature number F, and numbers of items and users [MATH: ni :MATH] and [MATH: nu :MATH] , both item and user feature vectors are drawn at random from a standard Gaussian distribution, along with a matrix [MATH: WRF× d :MATH] . The algorithm cannot access the full feature values in most practical cases in recommendation tasks. Reasons for missing values can be diverse [[101]46], but most likely follow a not missing at random mechanism, meaning that the probability of a missing value depends on the features. To implement such a mechanism, we applied a slightly adapted Gaussian self-masking [[102]47] to the corresponding item and user feature matrices, such that we expect around [MATH: 10% :MATH] of missing feature values. The complete set of user-item scores is obtained by a generating model [MATH: g0:RF×RF[0,1] :MATH] . For “first-order” synthetic data sets, [MATH: g0 :MATH] is defined as [MATH: (xi,xu)σ(kd< /mi>(xi+xu)W:,< /mo>k)=σ(RHOFM(0,1d, 0m-1,W)(xi,xu)) :MATH] where [MATH: xi :MATH] and [MATH: xu :MATH] are respectively the item and user feature vectors. For the “second-order” type, [MATH: g0 :MATH] is simply [MATH: (xi,xu)σ(RHOFM(1,1d, 1m-1,W)(xi,xu)) :MATH] where the order is [MATH: m=2 :MATH] . In both cases, the corresponding structure function [MATH: sW :MATH] is linear, that is, [MATH: sW(x)=xW :MATH] and [MATH: λ=0 :MATH] . Finally, since in practice, most of the user-item associations are inaccessible at training time, we label user-item pairs with [MATH: -1 :MATH] , 0, and [MATH: +1 :MATH] depending on their score, such that the sparsity number—that is, the percentage of unknown values in the association matrix—is equal to a prespecified value greater than [MATH: 50% :MATH] . JELI is performant for various validation metrics and reliably retrieves ground truth importance scores We generate 10 synthetic datasets of each type ( [MATH: F=10 :MATH] , [MATH: d=2 :MATH] , [MATH: ni=nu< /mi>=173 :MATH] ) and run JELI 100 times with different random seeds corresponding to different training/testing splits. Table [103]2 shows the numerical results across those [MATH: 10×100 :MATH] runs for several validation metrics on the predicted item-user associations and feature importance scores. Table 2. Average validation metrics with standard deviations across 100 iterations and 10 synthetic data sets of each type (total number of values: 1000) Data set type AUC NS-AUC Spearman’s [MATH: ρ :MATH] First-order [MATH: 0.99± :MATH] 0.013 [MATH: 0.89± :MATH] 0.124 [MATH: 0.83± :MATH] 0.279 Second-order [MATH: 0.98± :MATH] 0.019 [MATH: 0.86± :MATH] 0.167 [MATH: 0.75± :MATH] 0.363 [104]Open in a new tab Average (respectively, standard deviation) values are rounded to the closest second (resp., third) decimal place. AUC: Area Under the Curve. NS-AUC: Negative-Sampling AUC [[105]44]. Spearman’s [MATH: ρ :MATH] : Spearman’s rank correlation Albeit there is a large variation in the quality of the prediction due to the random training/testing split when considering the average best value across 100 iterations, the metrics in Table [106]2 show a high predictive power for JELI, along with a consistently high correlation between true and predicted feature importance scores: the average Spearman’s rank correlation for the best-trained models across all 10 data sets is 0.932 for “first-order” sets and 0.932 for “second-order” ones. The bar plots representing the ground truth and predicted importance scores for each of these 10 sets and each type of synthetic data in Fig. [107]2 show that JELI can preserve the global trend in importance scores across data sets. We also tested the impact of the dimension parameter d and of the order m of the RHOFM on the accuracy metrics. In the previous experiments, we used [MATH: d=2 :MATH] , which is the true dimensionality of the underlying generating model. However, it appears that JELI is also robust to the choice of the dimension parameter if it is large enough for all metrics. Moreover, similarly to higher-order factorization machines, higher-order interactions ( [MATH: m>2 :MATH] ) allow us to get a more expressive classifier model and, thus, better classification performance. However, this improvement comes at a heavy computational price, even with the dynamic programming routines in [[108]30], where the time complexity is linear in m. The experiments and results on parameter impact can be found in [109]Appendix 4. Fig. 2. [110]Fig. 2 [111]Open in a new tab Barplots of the true and predicted feature importance scores for [MATH: F=10 :MATH] features in each synthetic data set for the best-performing model across 100 iterations. Top-2 lines: on “first-order” synthetic data. Bottom-2 lines: on “second-order” synthetic data JELI is robust in synthetic data sets across sparsity numbers We also compare the predictive performance of JELI compared to embedding-based recommender systems from the state-of-the-art, namely Fast.ai collaborative learner [[112]8], the heterogeneous attention network (HAN) algorithm [[113]48] and the neural inductive matrix completion with graph convolutional network (NIMCGCN) [[114]10]. We set, whenever appropriate, the same hyperparameter values for all algorithms (with [MATH: d=2 :MATH] ). We run each algorithm on 100 different random seeds on 5 “first-order” synthetic data sets generated with sparsity numbers in [MATH: {50%,65%< /mo>,80%} :MATH] , for 500 tests. Figure [115]3 and Table [116]3 report the boxplots and the confidence intervals on corresponding validation metrics. In addition to the AUC and NS-AUC, we include the Non-Discounted Cumulative Gain (NDCG) computed for each user at rank [MATH: ni :MATH] (number of items) and averaged across users as a counterpart to the NS-AUC measure. Fig. 3. Fig. 3 [117]Open in a new tab NS-AUC values across “first-order” synthetic data sets for sparsity numbers and 500 iterations for JELI and state-of-the-art embedding-based recommender systems Table 3. Average metrics with standard deviations across 100 iterations and 5 “first-order” sets AUC NS-AUC NDCG 50% Fast.ai 0.99± 0.0 0.52± 0.3 0.85± 0.1 HAN 0.93± 0.0 0.62± 0.1 0.18± 0.1 NIM 0.93± 0.0 0.63± 0.1 0.39± 0.1 JELI 0.99± 0.0 0.92± 0.1 0.96± 0.1 65% Fast.ai 0.99± 0.0 0.64± 0.4 0.78± 0.3 HAN 0.93± 0.0 0.67± 0.0 0.12± 0.1 NIM 0.94± 0.0 0.67± 0.1 0.42± 0.1 JELI 0.99± 0.0 0.94± 0.0 0.94± 0.1 80% Fast.ai 0.99± 0.0 0.91± 0.1 0.77± 0.2 HAN 0.96± 0.0 0.72± 0.0 0.20± 0.1 NIM 0.93± 0.0 0.61± 0.1 0.19± 0.0 JELI 0.99± 0.0 0.94± 0.0 0.85± 0.2 [118]Open in a new tab The NDCG at rank [MATH: ni :MATH] is averaged across users. NIM is NIMCGCN Bold type is used for the highest value(s) in each experiment As illustrated by Fig. [119]3, JELI consistently outperforms the state-of-the-art on all metrics and remains robust to the sparsity number. Ablation study: both the structure and the joint learning are crucial to the performance We perform the same type of experiments as in Sect. "[120]JELI is robust in synthetic data sets across sparsity numbers" on several ablated versions of JELI to estimate the contribution of each part to the predictive performance. We introduce several JELI variants. First, we remove the structured and embedding part of the RHOFM classifier. FM is the regular second-order factorization machine of dimension d on 2F-dimensional input vectors, without structure on the coefficients (see Definition [121]1), whereas CrossFM2 is a more refined non-structured second-order factorization machine, where the feature pairwise interaction terms only comprise pairs of features on both the item and user vectors, that is, with notation from Definition [122]1 [MATH: CrossFM(ω0,ω1,ω2)(xi,xu)ω0+(ω1)xixu+fF,f> Fωf2,ω2xfixf -Fu. :MATH] 12 Next, we also study methods featuring separate learning of the embeddings and the RHOFM classifier, named Separate Embedding Learning and Training algorithms (SELT). We consider different feature embedding types. SELT-PCAf uses the d principal component analysis (PCA) run on the concatenation of the item and user matrices along the column dimension, resulting in a [MATH: F×(ni+nu) :MATH] matrix. SELT-PCAf then infers feature embeddings based on each feature’s d first principal components. Another PCA-based baseline, SELT-PCAiu, applies the learned PCA transformation directly on item and user feature vectors to obtain item and user embeddings. Finally, the SELT-KGE approach completes the knowledge graph task to obtain item and user embeddings—without enforcing the feature-dependent structure—on the knowledge graph described in Definition [123]3 with an empty partial graph. Then, SELT-KGE uses those item and user embeddings to train the RHOFM classifier. The final results in Fig.[124]4 and Table [125]4 show that the most crucial part for predictive performance across sparsity numbers is the factorization machine, which is unsurprising given the literature on factorization machines applied to sparse data. One can observe that separate embedding learning and factorization machine training leads to mediocre performance. The combination of a structured factorization machine and jointly learned embeddings, that is, JELI, gives the best performance and is even more significant as the set of known associations gets smaller (and the sparsity number is larger). Fig. 4. Fig. 4 [126]Open in a new tab NS-AUC values across “first-order” synthetic data sets for sparsity numbers and 500 iterations for JELI and ablated variants. This shows that the most crucial part for a good predictive performance across sparsity numbers is the factorization machine Table 4. Average metrics with standard deviations across 100 iterations and 5 “first-order” sets. The NDCG at rank [MATH: ni :MATH] is averaged across users. S indicates an instance of SELT AUC NS-AUC NDCG 50% FM2 0.99± 0.0 0.92± 0.0 0.97± 0.0 CrossFM2 0.99± 0.0 0.93± 0.0 1.00± 0.0 S-PCAf 0.95± 0.0 0.70± 0.1 0.58± 0.2 S-PCAiu 0.95± 0.0 0.61± 0.2 0.45± 0.2 S-KGE 0.91± 0.0 0.43± 0.2 0.25± 0.2 JELI 0.99± 0.0 0.92± 0.1 0.96± 0.0 65% FM2 0.98± 0.0 0.91± 0.0 0.87± 0.1 CrossFM2 0.99± 0.0 0.91± 0.0 0.95± 0.0 S-PCAf 0.95± 0.0 0.73± 0.1 0.54± 0.2 S-PCAiu 0.94± 0.0 0.62± 0.0 0.34± 0.1 S-KGE 0.90± 0.0 0.43± 0.0 0.06± 0.0 JELI 0.99± 0.0 0.94± 0.0 0.94± 0.1 80% FM2 0.97± 0.0 0.84± 0.1 0.56± 0.1 CrossFM2 0.98± 0.0 0.87± 0.0 0.74± 0.0 S-PCAf 0.95± 0.0 0.73± 0.1 0.38± 0.1 S-PCAiu 0.93± 0.0 0.62± 0.1 0.20± 0.0 S-KGE 0.91± 0.0 0.55± 0.1 0.12± 0.1 JELI 0.99± 0.0 0.94± 0.0 0.85± 0.2 [127]Open in a new tab Bold type is used for the highest value(s) in each experiment Application to drug repurposing We aim to predict new therapeutic indications, that is, novel associations between chemical compounds and diseases. The interpretability of the model for predicting associations between molecules and pathologies is crucial to encourage its use for health. In that case, higher-order factorization machines are very interesting models due to their inherent interpretability. However, particularly for the most recent drug repurposing datasets (e.g., TRANSCRIPT [[128]43] and PREDICT [[129]49]), the number of features ( [MATH: F12,000 :MATH] and [MATH: F6,000 :MATH] , respectively) is too large to effectively train a factorization machine due to the curse of dimensionality. Resorting to knowledge graphs then enables the construction of low-dimensional vector representations of these associations. Then, these representations are fed as input to the classifier during training instead of the initial feature vectors. JELI is on par with state-of-the-art approaches on drug repurposing data sets We now run JELI and the baseline algorithms tested in Sect. "[130]JELI is robust in synthetic data sets across sparsity numbers" on Gottlieb [[131]50] (named Fdataset in the paper), LRSSL [[132]51], PREDICT-Gottlieb [[133]52] and TRANSCRIPT [[134]43] drug repurposing data sets which feature a variety of data types and sizes. Please refer to [135]Appendix 4 for more information. Figure [136]5 and Table [137]5 report the validation metrics for each method’s 100 different training/testing splits with [MATH: d=15 :MATH] . From those results, we can see that the performance of JELI is on par with the top algorithm, HAN, and sometimes outperforms it while providing interpretability. Fig. 5. Fig. 5 [138]Open in a new tab AUC values across drug repurposing data sets for 100 iterations for JELI and state-of-the-art embedding-based approaches Table 5. Average metrics with standard deviations across 100 iterations for each drug repurposing data set AUC NS-AUC NDCG Gottlieb Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0 HAN 0.93± 0.0 0.67± 0.0 0.02± 0.0 NIM 0.90± 0.0 0.51± 0.0 0.01± 0.0 JELI 0.90± 0.0 0.52± 0.0 0.02± 0.0 LRSSL Fast.ai 0.90± 0.0 0.49± 0.1 0.01± 0.0 HAN 0.95± 0.0 0.69± 0.0 0.10± 0.0 NIM 0.91± 0.0 0.53± 0.0 0.01± 0.0 JELI 0.92± 0.0 0.51± 0.0 0.02± 0.0 PRED-G Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0 HAN 0.93± 0.0 0.68± 0.0 0.01± 0.0 NIM 0.91± 0.0 0.49± 0.0 0.01± 0.0 JELI 0.90± 0.0 0.47± 0.0 0.02± 0.0 TRANSC Fast.ai 0.61± 0.1 0.57± 0.1 0.04± 0.0 HAN 0.93± 0.0 0.61± 0.0 0.08± 0.0 NIM 0.92± 0.0 0.57± 0.0 0.04± 0.0 JELI 0.92± 0.0 0.56± 0.0 0.02± 0.0 [139]Open in a new tab The NDCG at rank [MATH: ni :MATH] is averaged across users. NIM is the algorithm NIMCGCN, TRANSC refers to the data set TRANSCRIPT, and PRED-G to the data set PREDICT-Gottlieb For the sake of completeness, we also considered one of the most popular data sets for recommendation, called MovieLens [[140]2], to better assess the performance of JELI for the general purpose of collaborative filtering. The goal is to predict if a movie should be recommended to a user, that is, if the user would rate this movie with more than 3 stars. The movie features are the year and the one-hot encodings of the movie genres, whereas the user features are the counts of each movie tag that this user has previously assigned. This experiment confirms that the performance of JELI is on par with the baselines, even in a non-biological setting. Please refer to [141]Appendix 4 for more information. JELI can integrate any graph prior on the TRANSCRIPT data set We now focus on the TRANSCRIPT data set, which involves gene activity measurements across [MATH: F=12,096 :MATH] genes for [MATH: ni=204 :MATH] drugs and [MATH: nu=116 :MATH] diseases. We compare the predictive power of JELI on the TRANSCRIPT data set with the default knowledge graph created by JELI (named “None” network, as we don’t rely on external sources of knowledge) and the default graph augmented with an external knowledge graph. The “None” network corresponds to the knowledge graph in Definition [142]3 with an empty partial graph. We considered as external knowledge graphs DRKG [[143]53], Hetionet [[144]54], PharmKG and PharmK8k (a subset of 8, 000 triplets) [[145]41] and PrimeKG [[146]42] as provided by the Python library PyKeen [[147]55]. In addition, we also built a partial graph listing protein-protein interactions (where proteins are matched one-to-one to their corresponding coding genes) based on the STRING database [[148]56]. The resulting accuracies in classification are shown on Figure [149]6 and Table [150]6. Most of the external graph priors significantly improve the classification accuracy, particularly the specific information about gene regulation (prior STRING). In [151]Appendix 4, we also show that the graph priors’ performance correlates with a more frequent grouping of genes that belong to the same functional pathways. In [152]Appendix 5, we perform a more thorough analysis of the specific case of melanoma and show that the predicted drug-disease associations and perturbed pathways allow us to recover some elements of the literature on melanoma. Fig. 6. Fig. 6 [153]Open in a new tab Predictive performance of JELI with different graph priors on different validation metrics Table 6. Average metrics with standard deviations across 10 iterations on the TRANSCRIPT data set for different graph priors Graph prior AUC NS-AUC NDCG None 0.90± 0.01 0.48± 0.02 0.02± 0.01 DRKG 0.90± 0.00 0.48± 0.02 0.00± 0.00 Hetionet 0.89± 0.00 0.43± 0.02 0.01± 0.01 PharmKG 0.88± 0.01 0.43± 0.03 0.01± 0.01 PharmKG8k 0.91± 0.01 0.53± 0.03 0.03± 0.01 PrimeKG 0.89± 0.01 0.48± 0.03 0.02± 0.01 STRING 0.91± 0.00 0.55± 0.03 0.02± 0.01 [154]Open in a new tab Discussion This work proposes the JELI approach for integrating knowledge graph-based regularization into an interpretable recommender system. The structure incorporated into user and item embeddings considers numerical feature values in a generic fashion, which allows one to go beyond the categorical relations encoded in knowledge graphs without adding many parameters. This method allows us to derive item and user representations of fixed dimensions and score a user-item association, even on previously unseen items and users. We have shown the performance and the explainability power of JELI on synthetic and real-life data sets. The Python package that implements the JELI approach is available at the following open-source repository: [155]github.com/RECeSS-EU-Project/JELI. Experimental results can be reproduced using code uploaded at [156]github.com/RECeSS-EU-Project/JELI-experiments. Conclusions This paper introduces and empirically validates our algorithmic contribution, JELI, for drug repurposing. JELI aims to provide straightforward interpretability in recommendations while integrating any graph information on items and users. However, there are a few limitations to the JELI approach. The first one is that JELI performs best on sparse user and item feature matrices, to exploit to the fullest the expressiveness of factorization machines. Moreover, this approach is quite slow compared to state-of-the-art algorithms since it simultaneously solves two tasks: the recommendation one on user-item pairs and the knowledge graph completion. We discuss the scalability of JELI with respect to various parameters in [157]Appendix 6. However, this slowness is mitigated by the superior interpretability of JELI compared to the baselines. Furthermore, an interesting subsequent work would focus on integrating missing values into the recommendation problem. As it is, JELI ignores the missing features and potentially recovers qualitative item-feature –respectively, user-feature– links during the knowledge graph completion tasks. That is, provided an approach to quantify the strength of the link between an item and a feature, JELI might also be extended to perform an imputation of this item’s corresponding missing feature value. Acknowledgements