Abstract

Background

   Interpretability is a topical question in recommender systems,
   especially in healthcare applications. An interpretable classifier
   quantifies the importance of each input feature for the predicted
   item-user association in a non-ambiguous fashion.

Results

   We introduce the novel Joint Embedding Learning-classifier for improved
   Interpretability (JELI). By combining the training of a structured
   collaborative-filtering classifier and an embedding learning task, JELI
   predicts new user-item associations based on jointly learned item and
   user embeddings while providing feature-wise importance scores.
   Therefore, JELI flexibly allows the introduction of priors on the
   connections between users, items, and features. In particular, JELI
   simultaneously (a) learns feature, item, and user embeddings; (b)
   predicts new item-user associations; (c) provides importance scores for
   each feature. Moreover, JELI instantiates a generic approach to
   training recommender systems by encoding generic graph-regularization
   constraints.

Conclusions

   First, we show that the joint training approach yields a gain in the
   predictive power of the downstream classifier. Second, JELI can recover
   feature-association dependencies. Finally, JELI induces a restriction
   in the number of parameters compared to baselines in synthetic and
   drug-repurposing data sets.

   Keywords: Drug repurposing, Interpretability, Gene expression,
   Collaborative filtering

Background

   The Netflix Challenge [[28]1] popularized collaborative filtering,
   where connections between items and users are inferred based on the
   guilt-by-association principle and similarities. This approach is
   particularly suitable for use cases where information about known
   user-item associations is sparse—typically, close to 99% of all
   possible user-item associations are not labelled, such as in the
   MovieLens movie recommendation data set [[29]2]—and when there is
   implicit feedback. For instance, in the case of movie recommendations
   on streaming platforms or online advertising, the algorithm often gets
   only access to clicks, that is, positive feedback. However, the reasons
   for ignoring an item can be numerous: either the item would
   straightforwardly receive negative feedback, or the item is too far
   from the user’s usual exploration zone but could still be enjoyed. In
   some rare cases, true negative feedback might be accessible but in even
   smaller numbers than the positive associations, for instance, for drug
   repurposing data sets, by reporting failed Phase III clinical trials
   [[30]3]. Collaborative filtering algorithms then enable the modeling of
   the user’s behavior based on their similarity to other users and the
   similarity of the potential recommended item to other items positively
   graded by this cluster of users.

   Several types of algorithms implement collaborative filtering. For
   instance, matrix factorizations [[31]4, [32]5] such as Non-negative
   Matrix Factorization (NMF) [[33]6] or Singular Value Decomposition
   (SVD) [[34]7], decompose the matrix of item-user associations into a
   product of two low-rank tensors. Other types of algorithms are (deep)
   neural networks [[35]8–[36]10], which build item and user embeddings
   with convolutional or graph neural networks based on common
   associations and/or additional feature values. On the one hand, among
   those last approaches, graph-based methods, which integrate and infer
   edges between features, items, and users, seem promising in performance
   [[37]11]. Predictions are supported by establishing complex connections
   between those entities. Conversely, matrix factorizations incorporate
   explicit interpretability, as one can try to connect the inferred
   latent factors to specific user and item features. One example is the
   factorization machine (FM) [[38]12], which combines a linear
   regression-like term and a feature pairwise interaction term to output
   a score for binary classification. The learned coefficients of the FM
   explicitly contribute to the score for each item and user feature set.
   This type of interpretability, called feature attribution in the
   literature [[39]13–[40]16], allows further downstream statistical
   analysis of the feature interactions. For instance, in our motivating
   example of drug repurposing, the objective is to identify novel
   drug-disease therapeutic associations. If features are genes mutated by
   the pathology or targeted by the chemical compound, the overrepresented
   biological pathways among those that are respectively affected or
   repaired can be retrieved based on the set of key repurposing genes.
   This, in turn, offers important points to argue in favor of the
   therapeutic value of a drug-disease indication and for further
   development towards marketing.

   In this work, we aim to combine the performance and versatility (in
   terms of embeddings) of graph-based collaborative filtering and the
   explicit interpretability of factorization machines to derive a
   “best-of-both-worlds” approach for predicting user-item associations.
   To achieve this, we introduce a special class of factorization machines
   that leverages a strong hypothesis on the structure of item and user
   embeddings depending on feature embeddings. This classifier is then
   jointly trained with a knowledge graph completion task. This knowledge
   graph connects items, users, and features based on the similarity
   between them and users and potentially additional priors on their
   relationships with features. The embeddings used to compute the edge
   probability scores in the knowledge graph are shared with the
   factorization machine, which allows the distillation of generic priors
   into the classifier.

   Our paper is structured as follows. In Sect. "[41]Related work", we
   introduce and give an overview of the state-of-the-art on factorization
   machines and knowledge graphs and how their combination might be able
   to overcome some topical questions in the field. Section "[42]Methods"
   introduces the JELI algorithm, which features our novel class of
   structured factorization machines and a joint training strategy with a
   knowledge graph. Eventually, Sect. "[43]Results" shows the performance
   and interpretability of the JELI approach on both synthetic data sets
   and drug repurposing applications.

   Notation For any matrix M (in capital letters), we denote
   [MATH:
   <msub><mi>M</mi><mrow><mi>i</mi><mo>,</mo><mo>:</mo></mrow></msub>
   :MATH]
   ,
   [MATH:
   <msub><mi>M</mi><mrow><mo>:</mo><mo>,</mo><mi>j</mi></mrow></msub>
   :MATH]
   and
   [MATH:
   <msub><mi>M</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub>
   :MATH]
   respectively its
   [MATH: <msup><mi>i</mi><mtext>th</mtext></msup> :MATH]
   row,
   [MATH: <msup><mi>j</mi><mtext>th</mtext></msup> :MATH]
   column and coefficient at position (i, j). For any vector
   [MATH: <mrow><mi mathvariant="bold-italic">v</mi></mrow> :MATH]
   (in bold type),
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">v</mi></mrow><mi>i</mi></msub> :MATH]
   is its
   [MATH: <msup><mi>i</mi><mtext>th</mtext></msup> :MATH]
   coefficient. Moreover,
   [MATH: <msup><mi>M</mi><mo>†</mo></msup> :MATH]
   is the pseudo-inverse of matrix M.

Related work

   Our approach, JELI, leverages a generic knowledge graph completion task
   and the interpretability of factorization machines to derive a novel,
   explainable collaborative filtering approach.

Knowledge graph embedding learning

   A knowledge graph is a set of triplets of the form (h, r, t) such that
   the head entity h is linked to the tail entity t by the relation r
   [[44]17]. Entity and relation embeddings learned on the graph allow us
   to capture the structure and connections in the graph in a numerical
   form, as embeddings are parameters of a function predicting the
   presence of a triplet in the graph. Those parameters are then learned
   based on the current set of edges in the graph. This approach encodes
   the graph structure into numerical representations, which can later be
   provided to a downstream regression model [[45]18]. The edge prediction
   function is usually called the interaction model. Many exist
   [[46]19–[47]22], among these, the Multi-Relational Euclidean (MuRE)
   model [[48]23], defined for any triplet (h, r, t) of respective
   embeddings
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>h</mi></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>r</mi></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>t</mi></msup></mrow> :MATH]
   of dimension d as
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mtext>MuRE</mtext><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>h</mi></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>r</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">e</mi></mrow><mi>t</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mo>-</mo><msubsup><mrow><mo
   stretchy="false">‖</mo><msup><mi>R</mi><mi>r</mi></msup><msup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>h</mi></msup><mo>-</mo><mrow
   ><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>t</mi></msup><mo>+</mo><msup
   ><mrow><mi mathvariant="bold-italic">e</mi></mrow><mi>r</mi></msup><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">‖</mo></mrow><mn>2</mn><mn>2</mn></msubsup><mo>+</mo><
   msup><mi>b</mi><mi>h</mi></msup><mo>+</mo><msup><mi>b</mi><mi>t</mi></m
   sup><mspace
   width="0.277778em"></mspace><mo>,</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   1

   where
   [MATH: <mrow><mi>d</mi><mo>×</mo><mi>d</mi></mrow> :MATH]
   matrix
   [MATH: <msup><mi>R</mi><mi>r</mi></msup> :MATH]
   , and scalars
   [MATH: <msup><mi>b</mi><mi>h</mi></msup> :MATH]
   and
   [MATH: <msup><mi>b</mi><mi>t</mi></msup> :MATH]
   are respectively relation-, head- and tail-specific parameters.
   Notably, this interaction model has exhibited good embedding
   engineering properties throughout the literature [[49]24, [50]25].

   Yet, many challenges are present in this field of research. Current
   representation learning algorithms (no matter the selected interaction
   model between a triplet and its embedding) infer representations
   directly on the nodes and relations of the graph. However, this
   approach does not make it possible to establish a relationship between
   the nodes other than a similarity at the level of the numerical
   representation for neighboring nodes for specific relations in the
   graph. That is, specific logical operations depending on the relation
   are often ignored: for instance, for a relation r and its opposite
   [MATH: <mrow><mo>¬</mo><mi>r</mi></mrow> :MATH]
   , we would like to ensure that the score p assigned to triplet
   (h, r, t) is proportional to
   [MATH: <mrow><mo>-</mo><mover><mi>p</mi><mo>¯</mo></mover></mrow>
   :MATH]
   , where
   [MATH: <mover><mi>p</mi><mo>¯</mo></mover> :MATH]
   is the score associated with triplet
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mo>¬</mo><mi>r</mi><mo>,</m
   o><mi>t</mi><mo stretchy="false">)</mo></mrow> :MATH]
   . Moreover, knowledge graphs are currently more suited to categorical
   information, where entities and relationships take discrete rather than
   numerical values. Numerical values could describe a relation such as
   “users from this specific age group are twice more interested in that
   movie genre”. Some recent works focus on integrating numerical values
   into knowledge graph embeddings. In KEN embeddings [[51]26], a
   single-layer neural network is trained for each numeric relation,
   taking the attribute as input and returning an embedding. Another
   approach, TransEA [[52]27], aims to optimize a loss function that
   linearly combines, with a hyperparameter, a loss value on the
   categorical variables (the difference between the scores and the
   indicator of the presence of a triplet) and another loss value on
   numerical variables, which seeks to minimize the gap between the
   variable and a scalar product involving its embedding. However, these
   two approaches add several additional hyperparameters and do not deal
   with interpretability.

   Resorting to knowledge-graph-infused embeddings allows us to integrate
   prior knowledge constraints generically into the representations of
   entities, both items and users. We aim to enforce a structure on those
   embeddings to guarantee the good prediction of user-item associations
   by incorporating those embeddings into a special type of factorization
   machine.

Factorization machines

   Factorization machines are a type of collaborative filtering algorithms
   introduced by [[53]12]. Their most common expression, the second-order
   factorization machine of dimension d, comprises a linear regression
   term of coefficient (with a possibly non-zero intercept) and a term
   that combines interactions from all distinct pairs of features by
   featuring a scalar product of their corresponding low-rank latent
   vectors of dimension d. This approach, particularly in the presence of
   sparse feature vectors, is computationally efficient while performant
   on a variety of recommendation tasks: for instance, knowledge tracing
   for education [[54]28], click-through rate prediction [[55]29].
   Computationally tractable evaluation and training routines were first
   proposed by [[56]30] for higher-order factorization machines (HOFMs),
   which were introduced as well in [[57]12] and include interactions from
   all distinct K sets of features, where
   [MATH: <mrow><mi>K</mi><mo>≥</mo><mn>2</mn></mrow> :MATH]
   , opening the way to even finer classification models. The definition
   of HOFMs is recalled in Definition [58]1.

Definition 1

   Higher–Order Factorization Machines (HOFMs). Let us denote the set of
   available item and user features
   [MATH: <mrow><mi mathvariant="script">F</mi><mo>⊆</mo><msup><mrow><mi
   mathvariant="double-struck">N</mi></mrow><mo>∗</mo></msup></mrow>
   :MATH]
   . The general expression for HOFM [[59]12, [60]30] of order
   [MATH: <mrow><mi>m</mi><mo>≥</mo><mn>2</mn></mrow> :MATH]
   and dimensions
   [MATH:
   <mrow><msub><mi>d</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><m
   sub><mi>d</mi><mi>m</mi></msub></mrow> :MATH]
   that takes as input a single feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mo
   stretchy="false">|</mo><mi mathvariant="script">F</mi><mo
   stretchy="false">|</mo></mrow></msup></mrow> :MATH]
   is a model such that
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>=</mo><mo
   stretchy="false">(</mo><msup><mi>ω</mi><mn>0</mn></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>2</mn></msup><mo>,</mo><mo>⋯
   </mo><mo>,</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mi>m</mi></msup><mo
   stretchy="false">)</mo></mrow> :MATH]
   where
   [MATH: <mrow><mrow><mo
   stretchy="false">(</mo><msup><mi>ω</mi><mn>0</mn></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>∈</mo><mi
   mathvariant="double-struck">R</mi><mo>×</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mo
   stretchy="false">|</mo><mi mathvariant="script">F</mi><mo
   stretchy="false">|</mo></mrow></msup></mrow> :MATH]
   and for any
   [MATH: <mrow><mi>i</mi><mo>∈</mo><mo
   stretchy="false">{</mo><mn>2</mn><mo>,</mo><mo>⋯</mo><mo>,</mo><mi>m</m
   i><mo stretchy="false">}</mo></mrow> :MATH]
   ,
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mi>i</mi></msup><mo>∈</mo><msup
   ><mrow><mi mathvariant="double-struck">R</mi></mrow><mrow><mrow><mo
   stretchy="false">|</mo><mi mathvariant="script">F</mi><mo
   stretchy="false">|</mo></mrow><mo>×</mo><msub><mi>d</mi><mi>i</mi></msu
   b></mrow></msup></mrow> :MATH]
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>HOFM</mtext><mi>θ</mi></msub><mr
   ow><mo stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><msup><mi>ω</mi><mn>0</mn></msu
   p><mo>+</mo><msup><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>+</mo><munder><mo>∑</mo><mro
   w><mn>2</mn><mo>≤</mo><mi>t</mi><mo>≤</mo><mi>m</mi></mrow></munder><mu
   nder><mo>∑</mo><mrow><mtable><mtr><mtd><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo><</mo><mo>⋯</mo><mo><</mo><msub><mi>f</mi><mi>t</mi></msub><
   /mrow></mtd></mtr><mtr><mtd><mrow><mrow></mrow><msub><mi>f</mi><mn>1</m
   n></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>f</mi><mi>t</mi></msub
   ><mo>∈</mo><mi
   mathvariant="script">F</mi></mrow></mtd></mtr></mtable></mrow></munder>
   <mrow><mo stretchy="false">⟨</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo>,</mo><mo>:</mo></mrow><mi>t</mi></msubsup><mo>,</mo><mo>⋯</
   mo><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mi>t</mi>
   </msub><mo>,</mo><mo>:</mo></mrow><mi>t</mi></msubsup><mo
   stretchy="false">⟩</mo></mrow><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msub><mi>f</mi><mn>1</mn></msub
   ></msub><mo>·</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msub><mi>f</mi><mn>2</mn></msub
   ></msub><mo>·</mo><mo>⋯</mo><mo>·</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msub><mi>f</mi><mrow><mi>t</mi>
   <mo>-</mo><mn>1</mn></mrow></msub></msub><mo>·</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msub><mi>f</mi><mi>t</mi></msub
   ></msub><mspace
   width="0.277778em"></mspace><mo>,</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   2

   where
   [MATH: <mrow><mrow><mo stretchy="false">⟨</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo>,</mo><mo>:</mo></mrow><mi>t</mi></msubsup><mo>,</mo><mo>⋯</
   mo><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mi>t</mi>
   </msub><mo>,</mo><mo>:</mo></mrow><mi>t</mi></msubsup><mo
   stretchy="false">⟩</mo></mrow><mo>≜</mo><msub><mo>∑</mo><mrow><mi>d</mi
   ><mo>≤</mo><msub><mi>d</mi><mi>t</mi></msub></mrow></msub><msubsup><mro
   w><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo>,</mo><mi>d</mi></mrow><mi>t</mi></msubsup><mo>·</mo><msubsu
   p><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mn>2</mn>
   </msub><mo>,</mo><mi>d</mi></mrow><mi>t</mi></msubsup><mo>·</mo><mo>⋯</
   mo><mo>·</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mrow><mi>
   t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>,</mo><mi>d</mi></mrow><mi>
   t</mi></msubsup><mo>·</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msub><mi>f</mi><mi>t</mi>
   </msub><mo>,</mo><mi>d</mi></mrow><mi>t</mi></msubsup></mrow> :MATH]
   for any t and indices
   [MATH:
   <mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><m
   sub><mi>f</mi><mi>t</mi></msub></mrow> :MATH]
   . In particular, for
   [MATH: <mrow><mi>m</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>FM</mtext><mi>θ</mi></msub><mrow
   ><mo stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><mover><mover
   accent="true"><mrow><msup><mi>ω</mi><mn>0</mn></msup><mo>+</mo><msup><m
   row><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mi>x</mi></mrow><mo>⏞</
   mo></mover><mtext>linear</mtext></mover><mspace
   width="0.333333em"></mspace><mtext>regression
   term</mtext><mo>+</mo><mover><mover
   accent="true"><mrow><msub><mo>∑</mo><mrow><mi>f</mi><mo><</mo><msup><mi
   >f</mi><mo>′</mo></msup><mo>,</mo><mi>f</mi><mo>,</mo><msup><mi>f</mi><
   mo>′</mo></msup><mo>∈</mo><mi
   mathvariant="script">F</mi></mrow></msub><mrow><mo
   stretchy="false">⟨</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>f</mi><mo>,</mo><mo>:<
   /mo></mrow><mn>2</mn></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><msup><mi>f</mi><mo>′</mo>
   </msup><mo>,</mo><mo>:</mo></mrow><mn>2</mn></msubsup><mo
   stretchy="false">⟩</mo></mrow><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>f</mi></msub><mo>·</mo><msub
   ><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msup><mi>f</mi><mo>′</mo></msup
   ></msub></mrow><mo>⏞</mo></mover><mtext>pairwise</mtext></mover><mspace
   width="0.333333em"></mspace><mtext>interaction term</mtext><mspace
   width="0.277778em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   3

   Besides their good predictive power, factorization machines involve
   explicit coefficients that quantify the contribution of each K set of
   features to the final score associated with the positive class of
   associations. These coefficients offer a straightforward insight into
   the discriminating features for the recommendation problem, and this
   type of “white-box” explainability is related to a larger research
   field called feature attribution-based interpretability.

Feature attribution-based interpretability

   Given a binary classifier C and a feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   , a feature attribution function
   [MATH: <mrow><msup><mi>ϕ</mi><mi>C</mi></msup><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo
   stretchy="false">→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   returns importance scores for each feature contributing to the positive
   class score for the input vector
   [MATH: <mrow><mi mathvariant="bold-italic">x</mi></mrow> :MATH]
   . If the importance score associated with feature f is largely positive
   (resp., negative), it means that feature f drives the membership of
   [MATH: <mrow><mi mathvariant="bold-italic">x</mi></mrow> :MATH]
   to the positive (resp., negative) class. In contrast, an importance
   score close to 0 indicates that feature f has little influence on the
   classification of data point
   [MATH: <mrow><mi mathvariant="bold-italic">x</mi></mrow> :MATH]
   . Albeit other types of interpretability approaches exist (based on
   decision rules given by single classifier trees or random forests
   [[61]31, [62]32], counterfactual examples [[63]33] or logic rules
   [[64]34, [65]35]) the importance score-based methods allow going beyond
   single feature influence. In particular, the importance scores can be
   integrated into downstream analyses to statistically quantify the
   effect of specific groups of features on the classification. For
   instance, when considering genes as features, an enrichment analysis
   [[66]36] based on the scores can uncover overrepresented functionally
   consistent cell pathways.

   Some classifiers, as seen for factorization machines, readily include
   importance scores, whereas several approaches compute post-hoc
   importance scores. Importance scores are evaluated based on the outputs
   of an already trained “black-box” classifier, such as a neural network.
   Such approaches include Shapley values [[67]13], LIME [[68]14],
   DeepLIFT [[69]37] (for image annotation) or sufficient explanations
   [[70]38]. Yet, recent works show their lack of robustness and
   consistency across post-hoc feature attribution methods, both
   empirically [[71]15] and theoretically [[72]16, [73]39]. However, the
   advantage of posthoc approaches is that they allow the explainability
   of any type of classifier and combine the richness of the model
   (predictive performance) and interpretability.

   The approach described in our paper then aims to encompass any generic
   embedding model without losing the connection to the initial features
   of the input vectors to the classifier.

Methods

   In this section, we define the JELI algorithm, our main contribution.
   The full pipeline of JELI is illustrated in Fig. [74]1. Let us define
   in formal terms the inputs to the associated recommendation problem of
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   items
   [MATH:
   <mrow><msub><mi>i</mi><mn>1</mn></msub><mo>,</mo><msub><mi>i</mi><mn>2<
   /mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>i</mi><msub><mi>n</m
   i><mi>i</mi></msub></msub></mrow> :MATH]
   to
   [MATH: <msub><mi>n</mi><mi>u</mi></msub> :MATH]
   users
   [MATH:
   <mrow><msub><mi>u</mi><mn>1</mn></msub><mo>,</mo><msub><mi>u</mi><mn>2<
   /mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>u</mi><msub><mi>n</m
   i><mi>u</mi></msub></msub></mrow> :MATH]
   . The minimal input to the recommendation problem is the user-item
   association matrix
   [MATH: <mrow><mi>A</mi><mo>∈</mo><msup><mrow><mo
   stretchy="false">{</mo><mo>-</mo><mn>1</mn><mo>,</mo><mn>0</mn><mo>,</m
   o><mo>+</mo><mn>1</mn><mo
   stretchy="false">}</mo></mrow><mrow><msub><mi>n</mi><mi>i</mi></msub><m
   o>×</mo><msub><mi>n</mi><mi>u</mi></msub></mrow></msup></mrow> :MATH]
   which summarizes the known positive (
   [MATH: <mrow><mo>+</mo><mn>1</mn></mrow> :MATH]
   )—and possibly negative (
   [MATH: <mrow><mo>-</mo><mn>1</mn></mrow> :MATH]
   )—associations and denotes unknown associations by zeroes. In simple
   terms, the recommender systems aim to replace zeroes by
   [MATH: <mrow><mo>±</mo><mn>1</mn></mrow> :MATH]
   while preserving the label of nonzero-valued associations. Second, in
   some cases, we also have access to the respective item and user feature
   matrices denoted
   [MATH: <mrow><mi>S</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><msu
   b><mi>n</mi><mi>i</mi></msub></mrow></msup></mrow> :MATH]
   and
   [MATH: <mrow><mi>P</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><msu
   b><mi>n</mi><mi>u</mi></msub></mrow></msup></mrow> :MATH]
   . Without a loss of generality, we assume that the item and user
   feature matrices have the same F features
   [MATH:
   <mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><msub><mi>f</mi><mn>2<
   /mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>f</mi><mi>F</mi></ms
   ub></mrow> :MATH]
   . [75]1 Finally, there might be a partial graph on some of the items,
   users, features, and possibly other entities. For instance, such a
   graph might connect movies, users, and human emotions for movie
   recommendation [[76]40], or drugs, diseases, pathways, and proteins or
   genes for drug repurposing [[77]41, [78]42]. We denote this graph
   [MATH: <mrow><mi mathvariant="script">G</mi><mo
   stretchy="false">(</mo><msub><mi mathvariant="script">V</mi><mi
   mathvariant="script">G</mi></msub><mo>,</mo><msub><mi
   mathvariant="script">E</mi><mi mathvariant="script">G</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   , where
   [MATH: <msub><mi mathvariant="script">V</mi><mi
   mathvariant="script">G</mi></msub> :MATH]
   is the set of nodes in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   and
   [MATH: <msub><mi mathvariant="script">E</mi><mi
   mathvariant="script">G</mi></msub> :MATH]
   is its set of (undirected, labeled) edges.

Fig. 1.

   [79]Fig. 1
   [80]Open in a new tab

   Full pipeline of the JELI algorithm, from the initial inputs to the
   downstream tasks

   We first introduce the class of higher-order factorization machines,
   called redundant structured HOFMs, which will classify user-item
   associations based on an assumption on the structure of item/user and
   feature embeddings.

Redundant structured HOFM (RHOFM)

   This subtype of higher-order factorization machines features shared
   higher-order parameters across interaction orders, such that the
   corresponding dimensions of the HOFM satisfy
   [MATH:
   <mrow><msub><mi>d</mi><mn>2</mn></msub><mo>=</mo><mo>⋯</mo><mo>=</mo><m
   sub><mi>d</mi><mi>m</mi></msub><mo>=</mo><mi>d</mi></mrow> :MATH]
   in Definition [81]1. As such, RHOFMs are related to inhomogeneous ANOVA
   kernel HOFMs (iHOFMs) mentioned in [[82]30]. This type of factorization
   machine is such that the higher-order dimensions are all equal (that
   is,
   [MATH:
   <mrow><msub><mi>d</mi><mn>2</mn></msub><mo>=</mo><mo>⋯</mo><mo>=</mo><m
   sub><mi>d</mi><mi>m</mi></msub><mo>=</mo><mi>d</mi></mrow> :MATH]
   ) and the corresponding higher-order coefficients are all proportional
   to one another: for any
   [MATH:
   <mrow><mi>t</mi><mo>,</mo><msup><mi>t</mi><mo>′</mo></msup><mo>≥</mo><m
   n>2</mn></mrow> :MATH]
   and
   [MATH: <mrow><mi>f</mi><mo>≤</mo><mi>F</mi></mrow> :MATH]
   . there exists
   [MATH: <mrow><mi>c</mi><mo>∈</mo><mi
   mathvariant="double-struck">R</mi></mrow> :MATH]
   such that
   [MATH: <mrow><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mi>f</mi><mi>t</mi></msubsup><m
   o>=</mo><mi>c</mi><mo>·</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mi>f</mi><mo>′</mo></msubsup></
   mrow> :MATH]
   in Definition [83]1. However, what distinguishes the RHOFM from an
   iHOFM is the following hypothesis on structure: it is assumed that
   every entity d-dimensional embedding
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup></mrow>
   :MATH]
   results from some function
   [MATH: <msub><mi>s</mi><mi>W</mi></msub> :MATH]
   with parameter
   [MATH: <mrow><mi>W</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mi>
   d</mi></mrow></msup></mrow> :MATH]
   applied to the corresponding entity feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   . For instance, an embedding
   [MATH: <mrow><mi mathvariant="bold-italic">e</mi></mrow> :MATH]
   associated with feature vector
   [MATH: <mrow><mi mathvariant="bold-italic">x</mi></mrow> :MATH]
   with a linear structure function of dimension d is defined as
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mo>=</mo><msub><mi>s</mi><mi>W<
   /mi></msub><mrow><mo stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>W</mi></mrow> :MATH]
   . However, any, possibly non-linear, structure function
   [MATH: <msub><mi>s</mi><mi>W</mi></msub> :MATH]
   can be considered. Note that for completeness, we can define a feature
   vector for features, which is simply the result of the indicator
   function on features in F: for feature
   [MATH: <mrow><mi>f</mi><mo>∈</mo><mi>F</mi></mrow> :MATH]
   , its corresponding feature vector is
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>f</mi></msup><mo>≜</mo><msub
   ><mrow><mo stretchy="false">(</mo><msub><mi>δ</mi><mrow><mo
   stretchy="false">(</mo><msub><mi>f</mi><mi>j</mi></msub><mo>=</mo><mi>f
   </mi><mo stretchy="false">)</mo></mrow></msub><mo
   stretchy="false">)</mo></mrow><mrow><mi>j</mi><mo>≤</mo><mi>F</mi></mro
   w></msub></mrow> :MATH]
   where
   [MATH: <mi>δ</mi> :MATH]
   is the Kronecker symbol, such that the structure function
   [MATH: <msub><mi>s</mi><mi>W</mi></msub> :MATH]
   can be applied to any item, user or feature entity. Definition [84]2
   gives the formal expression of RHOFMs for any order, dimension, and
   structure.

Definition 2

   Redundant structured HOFMs (RHOFMs). The RHOFM of structure
   [MATH: <msub><mi>s</mi><mi>W</mi></msub> :MATH]
   , order m and dimension d, with parameters
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>=</mo><mrow><mo
   stretchy="false">(</mo><msup><mi>ω</mi><mn>0</mn></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mn>2</mn><mo>:</mo><mi>m<
   /mi></mrow></msup><mo>,</mo><mi>W</mi><mo
   stretchy="false">)</mo></mrow><mo>∈</mo><mi
   mathvariant="double-struck">R</mi><mo>×</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup><mo>×</mo><ms
   up><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>m</mi><mo>-</mo><mn>
   1</mn></mrow></msup><mo>×</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mi>
   d</mi></mrow></msup></mrow> :MATH]
   on item and user of respective feature vectors
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo>∈</mo><msup
   ><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   is defined as
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mspace
   width="0.166667em"></mspace><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>≜</mo></mrow></mtd><mtd
   columnalign="left"><mrow><msup><mi>ω</mi><mn>0</mn></msup><mo>+</mo><ms
   up><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><msup><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mo>′</mo><mi>i</mi><mi>u<
   /mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd><msubsup><mover
   accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mtd></mtr><mtr><mtd><mro
   w><mrow></mrow><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mrow></mtd></mtr></mtabl
   e></mrow></mfenced></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow><mrow></mrow><mo>+</mo></mrow></mtd><mtd
   columnalign="left"><mrow><munder><mo>∑</mo><mrow><mn>2</mn><mo>≤</mo><m
   i>t</mi><mo>≤</mo><mi>m</mi></mrow></munder><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>t</mi><mo>-</mo><mn>1<
   /mn></mrow><mrow><mn>2</mn><mo>:</mo><mi>m</mi></mrow></msubsup><munder
   ><mo>∑</mo><mrow><mtable><mtr><mtd><mrow><msub><mi>f</mi><mn>1</mn></ms
   ub><mo><</mo><mo>⋯</mo><mo><</mo><msub><mi>f</mi><mi>t</mi></msub></mro
   w></mtd></mtr><mtr><mtd><mrow><mrow></mrow><msub><mi>f</mi><mn>1</mn></
   msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>f</mi><mi>t</mi></msub><mo
   >≤</mo><mn>2</mn><mi>F</mi></mrow></mtd></mtr></mtable></mrow></munder>
   <mfenced close="〉" open="〈"><msub><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd><msubsup><mover
   accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mtd></mtr><mtr><mtd><mro
   w><mrow></mrow><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mrow></mtd></mtr></mtabl
   e></mrow></mfenced><mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><mo
   >:</mo></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mfenced
   close="]" open="["><mrow><mtable><mtr><mtd><msubsup><mover
   accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mtd></mtr><mtr><mtd><mro
   w><mrow></mrow><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup></mrow></mtd></mtr></mtabl
   e></mrow></mfenced><mrow><msub><mi>f</mi><mi>t</mi></msub><mo>,</mo><mo
   >:</mo></mrow></msub></mfenced><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mn>1</mn><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mn>2</mn><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><mo>.</mo><mo>.</mo><mo>.</mo><msubsup><
   mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>t</mi><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><mspace
   width="0.277778em"></mspace><mo>,</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   4

   where
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mo>′</mo><mi>i</mi><mi>u<
   /mi></mrow></msup><mo>≜</mo><msup><mrow><mo
   stretchy="false">[</mo><msup><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mo>,</mo><msup><mrow><m
   o stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mo
   stretchy="false">]</mo></mrow><mo>⊺</mo></msup><mo>∈</mo><msup><mrow><m
   i
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn><mi>F</mi></mr
   ow></msup></mrow> :MATH]
   is the concatenation of feature vectors along the row dimension,
   [MATH: <mrow><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mo>≜</mo><msup><mrow><mo
   stretchy="false">[</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">]</mo></mrow><mo>⊺</mo></msup><mo>∈</mo><msup><mrow><m
   i
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mn>
   2</mn></mrow></msup></mrow> :MATH]
   the concatenation along the column dimension,
   [MATH: <mrow><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup><mo>≜</mo><msup><mrow><mo
   stretchy="false">(</mo><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><msup><mrow><mo
   stretchy="false">(</mo><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mo>+</mo><mi>λ</mi><msu
   b><mi>I</mi><mi>F</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>†</mo></msup><msup><mrow><mo
   stretchy="false">(</mo><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mrow><mo
   stretchy="false">[</mo><msub><mi>s</mi><mi>W</mi></msub><msup><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mo>,</mo><msub><mi>s</m
   i><mi>W</mi></msub><msup><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mo
   stretchy="false">]</mo></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mi>
   d</mi></mrow></msup></mrow> :MATH]
   is the
   [MATH: <mi>λ</mi> :MATH]
   -regularized approximate least squares estimator in the following
   equation in V:
   [MATH: <mrow><msub><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msup><mover
   accent="true"><mrow><mi mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mi>V</mi></mrow> :MATH]
   , with
   [MATH: <mrow><mi>λ</mi><mo>≥</mo><mn>0</mn></mrow> :MATH]
   .

   By reordering terms and by definition of
   [MATH: <msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup> :MATH]
   (full details in Appendix ), if we denote
   [MATH: <mrow><mi>f</mi><mo>%</mo><mi>F</mi></mrow> :MATH]
   the remainder of the Euclidean division of f by F, we can notice that
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>≈</mo></mrow></mtd><mtd
   columnalign="left"><mrow><msup><mi>ω</mi><mn>0</mn></msup><mo>+</mo><ms
   up><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mrow><mo
   stretchy="false">(</mo><msub><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>+</mo><msub><mi>s</mi><mi>W</mi></msu
   b><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mo>+</mo></mtd><mtd
   columnalign="left"><mrow><munder><mo>∑</mo><mrow><mn>2</mn><mo>≤</mo><m
   i>t</mi><mo>≤</mo><mi>m</mi></mrow></munder><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>t</mi><mo>-</mo><mn>1<
   /mn></mrow><mrow><mn>2</mn><mo>:</mo><mi>m</mi></mrow></msubsup><munder
   ><mo>∑</mo><mrow><mtable><mtr><mtd><mrow><msub><mi>f</mi><mn>1</mn></ms
   ub><mo><</mo><mo>⋯</mo><mo><</mo><msub><mi>f</mi><mi>t</mi></msub></mro
   w></mtd></mtr><mtr><mtd><mrow><mrow></mrow><msub><mi>f</mi><mn>1</mn></
   msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>f</mi><mi>t</mi></msub><mo
   >≤</mo><mn>2</mn><mi>F</mi></mrow></mtd></mtr></mtable></mrow></munder>
   <mfenced close="〉" open="〈"><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mn>1</mn><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><msub><mi>s</mi><mi>W</mi></msub><mrow><
   mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo>%</mo><mi>F</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mo>…</mo><mo>,</mo><msubsup><m
   row><mi
   mathvariant="bold-italic">x</mi></mrow><mi>t</mi><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><msub><mi>s</mi><mi>W</mi></msub><mrow><
   mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><msub><mi>f</mi><mi>t</mi>
   </msub><mo>%</mo><mi>F</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow></mfenced><mspace
   width="0.277778em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   5

   In particular, for
   [MATH: <mrow><mi>m</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   ,
   [MATH: <mrow><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   is roughly [85]2 equal to
   [MATH: <mrow><mtable><mtr><mtd></mtd><mtd
   columnalign="left"><mrow><msup><mi>ω</mi><mn>0</mn></msup><mo>+</mo><ms
   up><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mrow><mo
   maxsize="1.623em" minsize="1.623em"
   stretchy="true">(</mo></mrow><msub><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>+</mo><msub><mi>s</mi><mi>W</mi></msu
   b><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mrow><mo maxsize="1.623em"
   minsize="1.623em"
   stretchy="true">)</mo></mrow><mo>+</mo><msup><mi>ω</mi><mn>2</mn></msup
   ><munder><mo>∑</mo><mrow><mtable><mtr><mtd><mrow><msub><mi>f</mi><mn>1<
   /mn></msub><mo><</mo><msub><mi>f</mi><mn>2</mn></msub></mrow></mtd></mt
   r><mtr><mtd><mrow><mrow></mrow><msub><mi>f</mi><mn>1</mn></msub><mo>,</
   mo><msub><mi>f</mi><mn>2</mn></msub><mo>≤</mo><mn>2</mn><mi>F</mi></mro
   w></mtd></mtr></mtable></mrow></munder><mfenced close="〉"
   open="〈"><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mn>1</mn><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><msub><mi>s</mi><mi>W</mi></msub><mrow><
   mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><msub><mi>f</mi><mn>1</mn>
   </msub><mo>%</mo><mi>F</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mn>2</mn><mrow><mo>′</mo><mi>i<
   /mi><mi>u</mi></mrow></msubsup><msub><mi>s</mi><mi>W</mi></msub><mrow><
   mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><msub><mi>f</mi><mn>2</mn>
   </msub><mo>%</mo><mi>F</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow></mfenced><mspace
   width="0.277778em"></mspace><mo>.</mo><mspace
   width="2em"></mspace></mrow></mtd></mtr></mtable></mrow> :MATH]
   6

   Compared to the expression of a factorization machine for
   [MATH: <mrow><mi>m</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   in Eq. ([86]3), the RHOFM includes a structure that can be non-linear
   (through the function
   [MATH: <msub><mi>s</mi><mi>W</mi></msub> :MATH]
   ) and a supplementary degree of freedom with parameters
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup> :MATH]
   and
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mn>2</mn><mo>:</mo><mi>m<
   /mi></mrow></msup> :MATH]
   .

   The RHOFM then comprises a term linear in the item/user embeddings and
   a product of feature embeddings weighted by the corresponding values in
   the item and user initial feature vectors. Moreover, if we assume a
   linear structure on the RHOFM, the embedding vector for feature
   [MATH: <msub><mi>f</mi><mi>j</mi></msub> :MATH]
   is exactly
   [MATH:
   <msub><mi>W</mi><mrow><msub><mi>f</mi><mi>j</mi></msub><mo>,</mo><mo>:<
   /mo></mrow></msub> :MATH]
   and the embeddings for items and users are the sum of feature
   embeddings weighted by their corresponding values in the item and user
   vectors. The expression in Definition [87]2 is relatively
   computationally efficient when combined with the dynamic programming
   routines described in [[88]30]. Moreover, the redundancy in the RHOFM
   allows it to benefit from the same type of computational speed-up as
   inhomogeneous ANOVA kernels or iHOFMs.

   Knowing that HOFMs (in Definition [89]1) and iHOFMs would take as input
   the concatenation along the row dimension of
   [MATH: <mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow> :MATH]
   , assuming that the dimensions across subsets are the same, i.e.,
   [MATH:
   <mrow><msub><mi>d</mi><mn>2</mn></msub><mo>=</mo><mo>⋯</mo><mo>=</mo><m
   sub><mi>d</mi><mi>m</mi></msub><mo>=</mo><mi>d</mi></mrow> :MATH]
   , HOFMs comprise
   [MATH:
   <mrow><mn>1</mn><mo>+</mo><mn>2</mn><mi>F</mi><mo>+</mo><mn>2</mn><mi>F
   </mi><mi>d</mi><mo
   stretchy="false">(</mo><mi>m</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   parameters, which can account for a prohibitive computation cost in
   practice. Similarly, iHOFMs would require the training of
   [MATH:
   <mrow><mn>1</mn><mo>+</mo><mi>m</mi><mo>+</mo><mn>2</mn><mi>F</mi><mi>d
   </mi></mrow> :MATH]
   parameters, whereas RHOFMs (in Definition [90]2) only feature
   [MATH: <mrow><mn>1</mn><mo>+</mo><mi>m</mi><mo>+</mo><mo
   stretchy="false">(</mo><mi>F</mi><mo>+</mo><mn>1</mn><mo
   stretchy="false">)</mo><mi>d</mi></mrow> :MATH]
   , hence removing the multiplicative constant on the number of features
   F, which has an impact for high-dimensional data sets such as the
   TRANSCRIPT drug repurposing data set [[91]43] which gathers values on
   12, 000 genes across the human genome.

   Regarding interpretability, as evidenced by Eq. ([92]5), the
   coefficients involved in the expression of the RHOFM are
   straightforwardly connected to the input embeddings. In the case of the
   linear structure and when
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo>=</mo><msub
   ><mrow><mn mathvariant="bold">1</mn></mrow><mi>d</mi></msub></mrow>
   :MATH]
   ,
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mn>2</mn><mo>:</mo><mi>m<
   /mi></mrow></msup><mo>=</mo><msub><mrow><mn
   mathvariant="bold">1</mn></mrow><mrow><mi>m</mi><mo>-</mo><mn>1</mn></m
   row></msub></mrow> :MATH]
   (or any other constant), the contributions from features on the one
   hand and the item/user values on the other can easily be disentangled.
   In that case,
   [MATH: <mrow><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup><mo>≈</mo><mi>W</mi></mrow
   > :MATH]
   and then for any feature f, the intrinsic (i.e., independent of users
   or items) importance score is
   [MATH:
   <mrow><msub><mo>∑</mo><mrow><mi>k</mi><mo>≤</mo><mi>d</mi></mrow></msub
   ><mspace
   width="4pt"></mspace><msub><mi>W</mi><mrow><mi>f</mi><mo>,</mo><mi>k</m
   i></mrow></msub></mrow> :MATH]
   . When associated with an entity (item or user) of feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   , its importance score is simply
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>f</mi></msub><msub><mo>∑</mo
   ><mrow><mi>k</mi><mo>≤</mo><mi>d</mi></mrow></msub><mspace
   width="4pt"></mspace><msub><mi>W</mi><mrow><mi>f</mi><mo>,</mo><mi>k</m
   i></mrow></msub></mrow> :MATH]
   . Using
   [MATH: <mrow><msup><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><msubsup><mover
   accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup><mo>≈</mo><msub><mi>s</mi>
   <mi>W</mi></msub><mrow><mo stretchy="false">(</mo><msup><mover
   accent="true"><mrow><mi mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="true">~</mo></mover><mrow><mi
   mathvariant="italic">iu</mi></mrow></msup><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   in non-linear structures, we can extrapolate this result to obtain the
   following intrinsic feature importance score

Result 1

   Feature importance scores in a RHOFM. When
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo>=</mo><msub
   ><mrow><mn mathvariant="bold">1</mn></mrow><mi>d</mi></msub></mrow>
   :MATH]
   ,
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mn>2</mn><mo>:</mo><mi>m<
   /mi></mrow></msup><mo>=</mo><msub><mrow><mn
   mathvariant="bold">1</mn></mrow><mrow><mi>m</mi><mo>-</mo><mn>1</mn></m
   row></msub></mrow> :MATH]
   (or any other constant), the intrinsic (entity-independent) feature
   importance score for feature
   [MATH: <mrow><mi>f</mi><mo>≤</mo><mi>F</mi></mrow> :MATH]
   in an RHOFM (Definition [93]2) is
   [MATH:
   <mrow><msub><mo>∑</mo><mrow><mi>k</mi><mo>≤</mo><mi>d</mi></mrow></msub
   ><mspace width="4pt"></mspace><msub><mrow><mo
   stretchy="false">(</mo><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo>,</mo><mi>k</mi></mro
   w></msub><mspace width="0.277778em"></mspace><mo>.</mo></mrow> :MATH]
   As a consequence, the feature attribution function associated with
   feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   is
   [MATH: <mrow><msup><mi>ϕ</mi><mtext>RHOFM</mtext></msup><mrow><mo
   stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><msub><mrow><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>f</mi></msub><msub><mo>∑</mo
   ><mrow><mi>k</mi><mo>≤</mo><mi>d</mi></mrow></msub><msub><mrow><mo
   stretchy="false">(</mo><msubsup><mover accent="true"><mi>W</mi><mo
   stretchy="true">~</mo></mover><mi>λ</mi><mrow><mi
   mathvariant="italic">iu</mi></mrow></msubsup><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo>,</mo><mi>k</mi></mro
   w></msub><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo>≤</mo><mi>F</mi></mro
   w></msub><mspace width="0.166667em"></mspace><mo>.</mo></mrow> :MATH]

   One could infer the RHOFM parameters by directly minimizing a loss
   function. However, as mentioned in the introduction, we would like to
   distil some prior knowledge information into the RHOFM, for instance,
   via a knowledge graph specific to the recommendation use case. By
   seeing the feature embeddings in the RHOFM as node embeddings in a
   knowledge graph, the next section describes how to jointly train the
   RHOFM and the feature embeddings on a knowledge graph completion task.

Joint training of the RHOFM and the knowledge graph embeddings

   We will leverage the information from the partial graph
   [MATH: <mrow><mi mathvariant="script">G</mi><mo
   stretchy="false">(</mo><msub><mi mathvariant="script">V</mi><mi
   mathvariant="script">G</mi></msub><mo>,</mo><msub><mi
   mathvariant="script">E</mi><mi mathvariant="script">G</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   to fit the RHOFM, by reducing the problem of classification to the
   prediction of a subset of edges in a knowledge graph completion
   problem. To do so, we first extend the partial graph
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   based on the respective user-item association matrix A, and respective
   item and user feature matrices S and P to build a knowledge graph
   [MATH: <mrow><mi mathvariant="script">K</mi><mo
   stretchy="false">(</mo><mi mathvariant="script">V</mi><mo>,</mo><mi
   mathvariant="script">T</mi><mo stretchy="false">)</mo></mrow> :MATH]
   with nine types of relations. Note that the partial graph can possibly
   be empty or, to the contrary, can include any edge between drugs and
   features, diseases and features, and between two features.

Definition 3

   Similarity-based knowledge graph augmented with prior edges.
   Considering a similarity threshold
   [MATH: <mrow><mi>τ</mi><mo>∈</mo><mo
   stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo
   stretchy="false">]</mo></mrow> :MATH]
   associated with a similarity function sim
   [MATH: <mrow><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo>×</mo><ms
   up><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo
   stretchy="false">→</mo><mrow><mo
   stretchy="false">[</mo><mo>-</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo
   stretchy="false">]</mo></mrow></mrow> :MATH]
   , JELI builds a knowledge graph from the data set A, P and S and
   partial graph
   [MATH: <mrow><mi mathvariant="script">G</mi><mo
   stretchy="false">(</mo><msub><mi mathvariant="script">V</mi><mi
   mathvariant="script">G</mi></msub><mo>,</mo><msub><mi
   mathvariant="script">E</mi><mi mathvariant="script">G</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   as follows
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><mi
   mathvariant="script">V</mi><mo>≜</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mrow><mo
   stretchy="false">{</mo><msub><mi>i</mi><mn>1</mn></msub><mo>,</mo><msub
   ><mi>i</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>i</
   mi><msub><mi>n</mi><mi>i</mi></msub></msub><mo
   stretchy="false">}</mo></mrow><mo>∪</mo><mrow><mo
   stretchy="false">{</mo><msub><mi>u</mi><mn>1</mn></msub><mo>,</mo><msub
   ><mi>u</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>u</
   mi><msub><mi>n</mi><mi>u</mi></msub></msub><mo
   stretchy="false">}</mo></mrow><mo>∪</mo><mrow><mo
   stretchy="false">{</mo><msub><mi>f</mi><mn>1</mn></msub><mo>,</mo><msub
   ><mi>f</mi><mn>2</mn></msub><mo>,</mo><mo>⋯</mo><mo>,</mo><msub><mi>f</
   mi><mi>F</mi></msub><mo stretchy="false">}</mo></mrow><mspace
   width="0.277778em"></mspace><mo>,</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   7
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><mi
   mathvariant="script">T</mi><mo>≜</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><mi>s</mi><mo>,</mo><mtext>prior</mtext><mo>,</m
   o><mi>t</mi><mo stretchy="false">)</mo></mrow><mo>∣</mo><mrow><mo
   stretchy="false">(</mo><mi>s</mi><mo>,</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow><mo>∈</mo><msub><mi
   mathvariant="script">E</mi><mi
   mathvariant="script">G</mi></msub><mo>,</mo><mi>s</mi><mo>,</mo><mi>t</
   mi><mo>∈</mo><mi mathvariant="script">V</mi><mo
   stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>i</mi><mi>j</mi></msub><mo>,</mo><mo>-
   </mo><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>A</mi><mrow><msub><mi
   >i</mi><mi>j</mi></msub><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub></mr
   ow></msub><mo>=</mo><mo>-</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo>≤</mo><
   msub><mi>n</mi><mi>i</mi></msub><mo>,</mo><mi>k</mi><mo>≤</mo><msub><mi
   >n</mi><mi>u</mi></msub><mo
   stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>i</mi><mi>j</mi></msub><mo>,</mo><mo>+
   </mo><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>A</mi><mrow><msub><mi
   >i</mi><mi>j</mi></msub><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub></mr
   ow></msub><mo>=</mo><mo>+</mo><mn>1</mn><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>≤</mo><msub><mi>n</mi><mi>i</mi></ms
   ub><mo>,</mo><mspace
   width="4pt"></mspace><mi>k</mi><mo>≤</mo><msub><mi>n</mi><mi>u</mi></ms
   ub><mo stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>u</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>user-sim</mtext><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><mi
   mathvariant="monospace">sim</mi><mrow><mo
   stretchy="false">(</mo><msub><mi>P</mi><mrow><mo>:</mo><mo>,</mo><msub>
   <mi>u</mi><mi>j</mi></msub></mrow></msub><mo>,</mo><msub><mi>P</mi><mro
   w><mo>:</mo><mo>,</mo><msub><mi>u</mi><mi>k</mi></msub></mrow></msub><m
   o stretchy="false">)</mo></mrow><mo>></mo><mi>τ</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>,</mo><mi>k</mi><mo>≤</mo><msub><mi>
   n</mi><mi>u</mi></msub><mo
   stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>i</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>item-sim</mtext><mo>,</mo><msub><mi>i</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><mi
   mathvariant="monospace">sim</mi><mrow><mo
   stretchy="false">(</mo><msub><mi>S</mi><mrow><mo>:</mo><mo>,</mo><msub>
   <mi>i</mi><mi>j</mi></msub></mrow></msub><mo>,</mo><msub><mi>S</mi><mro
   w><mo>:</mo><mo>,</mo><msub><mi>i</mi><mi>k</mi></msub></mrow></msub><m
   o stretchy="false">)</mo></mrow><mo>></mo><mi>τ</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>,</mo><mi>k</mi><mo>≤</mo><msub><mi>
   n</mi><mi>i</mi></msub><mo
   stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>i</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>item-feat-pos</mtext><mo>,</mo><msub><mi>f</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>S</mi><mrow><msub><mi
   >f</mi><mi>k</mi></msub><mo>,</mo><msub><mi>i</mi><mi>j</mi></msub></mr
   ow></msub><mo>></mo><mn>0</mn><mo>,</mo><mspace
   width="4pt"></mspace><mi>k</mi><mo>≤</mo><mi>F</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>≤</mo><msub><mi>n</mi><mi>i</mi></ms
   ub><mo stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>i</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>item-feat-neg</mtext><mo>,</mo><msub><mi>f</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>S</mi><mrow><msub><mi
   >f</mi><mi>k</mi></msub><mo>,</mo><msub><mi>i</mi><mi>j</mi></msub></mr
   ow></msub><mo><</mo><mn>0</mn><mo>,</mo><mspace
   width="4pt"></mspace><mi>k</mi><mo>≤</mo><mi>F</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>≤</mo><msub><mi>n</mi><mi>i</mi></ms
   ub><mo stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>u</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>user-feat-pos</mtext><mo>,</mo><msub><mi>f</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>P</mi><mrow><msub><mi
   >f</mi><mi>k</mi></msub><mo>,</mo><msub><mi>u</mi><mi>j</mi></msub></mr
   ow></msub><mo>></mo><mn>0</mn><mo>,</mo><mspace
   width="4pt"></mspace><mi>k</mi><mo>≤</mo><mi>F</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>≤</mo><msub><mi>n</mi><mi>u</mi></ms
   ub><mo stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>∪</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>u</mi><mi>j</mi></msub><mo>,</mo><mtex
   t>user-feat-neg</mtext><mo>,</mo><msub><mi>f</mi><mi>k</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>P</mi><mrow><msub><mi
   >f</mi><mi>k</mi></msub><mo>,</mo><msub><mi>u</mi><mi>j</mi></msub></mr
   ow></msub><mo><</mo><mn>0</mn><mo>,</mo><mspace
   width="4pt"></mspace><mi>k</mi><mo>≤</mo><mi>F</mi><mo>,</mo><mspace
   width="4pt"></mspace><mi>j</mi><mo>≤</mo><msub><mi>n</mi><mi>u</mi></ms
   ub><mo stretchy="false">}</mo><mspace
   width="0.277778em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   8

   The objective of knowledge graph completion is to fit a model
   predictive of the probability of the presence of a triplet in the
   knowledge graph. In particular, computing the score associated with
   triplets of the form
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mo>+</mo><mo>,</mo><mi>t</m
   i><mo stretchy="false">)</mo></mrow> :MATH]
   , for (h, t) a user-item pair, boils down to fitting a classifier of
   user-item interactions. Conversely, a straightforward assumption is
   that the score associated with triplets
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mo>+</mo><mo>,</mo><mi>t</m
   i><mo stretchy="false">)</mo></mrow> :MATH]
   should be opposite to the score assigned to triplets
   [MATH: <mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mo>-</mo><mo>,</mo><mi>t</m
   i><mo stretchy="false">)</mo></mrow> :MATH]
   . With that in mind, denoting the set of RHOFM parameters
   [MATH: <mrow><mi mathvariant="bold-italic">θ</mi></mrow> :MATH]
   and
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mtext>JELI</mtext></msup><mo>≜<
   /mo><mrow><mo stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>,</mo><mrow><mo
   stretchy="false">{</mo><msub><mi>R</mi><mi>r</mi></msub><mo>,</mo><mspa
   ce width="4pt"></mspace><mi>r</mi><mspace
   width="0.333333em"></mspace><mtext>relation</mtext><mo
   stretchy="false">}</mo></mrow><mo>,</mo><mrow><mo
   stretchy="false">{</mo><msub><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>r</mi></msub><mo>,</mo><mspa
   ce width="4pt"></mspace><mi>r</mi><mspace
   width="0.333333em"></mspace><mtext>relation</mtext><mo
   stretchy="false">}</mo></mrow><mo>,</mo><mrow><mo
   stretchy="false">{</mo><msub><mi>b</mi><mi>h</mi></msub><mo>,</mo><mspa
   ce width="4pt"></mspace><mi>h</mi><mo>∈</mo><mi
   mathvariant="script">V</mi><mo stretchy="false">}</mo></mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   as the total set of parameters to estimate, we define in Eq. ([94]9)
   the edge score to be maximized for present triplets in the knowledge
   graph
   [MATH: <mi mathvariant="script">K</mi> :MATH]
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>score</mtext><msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mtext>JELI</mtext></msup></msub
   ><mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mi>r</mi><mo>,</mo><mi>t</m
   i><mo stretchy="false">)</mo></mrow><mo>≜</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>MuRE</mtext><mo
   stretchy="false">(</mo><msub><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>h</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi>r</mi></msub><mo>,</mo><msub
   ><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>t</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>;</mo><msub><mi>R</mi><mi>r</mi></msu
   b><mo>,</mo><msub><mi>b</mi><mi>h</mi></msub><mo>,</mo><msub><mi>b</mi>
   <mi>t</mi></msub><mo stretchy="false">)</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mspace
   width="0.333333em"></mspace><mtext>if</mtext><mspace
   width="0.333333em"></mspace><mi>r</mi><mo>∉</mo><mo
   stretchy="false">{</mo><mo>+</mo><mo>,</mo><mo>-</mo><mo
   stretchy="false">}</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mrow><mrow></mrow><mspace
   width="4pt"></mspace><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>h</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>t</mi></msup><mo
   stretchy="false">)</mo></mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mspace
   width="0.333333em"></mspace><mtext>if</mtext><mspace
   width="0.333333em"></mspace><mi>r</mi><mo>=</mo><mo>+</mo></mrow></mtd>
   </mtr><mtr><mtd columnalign="left"><mrow><mrow></mrow><mspace
   width="4pt"></mspace><mo>-</mo><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>h</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>t</mi></msup><mo
   stretchy="false">)</mo></mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mspace
   width="0.333333em"></mspace><mtext>if</mtext><mspace
   width="0.333333em"></mspace><mi>r</mi><mo>=</mo><mo>-</mo></mrow></mtd>
   </mtr></mtable></mrow></mfenced><mspace
   width="0.277778em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   9

   Remember that the vector
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>h</mi></msup> :MATH]
   is well-defined for any item, user, or feature h. Then we fit parameter
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mtext>JELI</mtext></msup>
   :MATH]
   by minimizing the soft margin ranking loss with margin
   [MATH:
   <mrow><msup><mi>λ</mi><mn>0</mn></msup><mo>=</mo><mn>1</mn></mrow>
   :MATH]
   , which expression is recalled below
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mo>∀</mo><msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>′</mo></msup><mspace
   width="4pt"></mspace><mo>,</mo><mspace
   width="4pt"></mspace><msup><mtext>L</mtext><mtext>margin</mtext></msup>
   <mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>′</mo></msup><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><munder><mo>∑</mo><mrow><mo
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mi>r</mi><mo>,</mo><mi>t</m
   i><mo stretchy="false">)</mo><mo>∈</mo><mi
   mathvariant="script">T</mi></mrow></munder><munder><mo>∑</mo><mrow><mo
   stretchy="false">(</mo><mover><mi>h</mi><mo>¯</mo></mover><mo>,</mo><mi
   >r</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo><mo>∉</mo><mi
   mathvariant="script">T</mi></mrow></munder><mo>log</mo><mrow><mo
   maxsize="1.623em" minsize="1.623em"
   stretchy="true">(</mo></mrow><mn>1</mn><mo>+</mo><mo>exp</mo><mrow><mo
   maxsize="1.2em" minsize="1.2em"
   stretchy="true">(</mo></mrow><msup><mi>λ</mi><mn>0</mn></msup><mo>+</mo
   ><msub><mtext>score</mtext><msup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>′</mo></msup></msub><mrow><m
   o
   stretchy="false">(</mo><mi>h</mi><mo>,</mo><mi>r</mi><mo>,</mo><mi>t</m
   i><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mtext>score</mtext><msup
   ><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>′</mo></msup></msub><mrow><m
   o
   stretchy="false">(</mo><mover><mi>h</mi><mo>¯</mo></mover><mo>,</mo><mi
   >r</mi><mo>,</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mo
   maxsize="1.2em" minsize="1.2em" stretchy="true">)</mo></mrow><mrow><mo
   maxsize="1.623em" minsize="1.623em"
   stretchy="true">)</mo></mrow><mspace
   width="0.166667em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   10

   Further implementation details and numerical considerations for the
   training pipeline are available in Appendix .

Downstream tasks with JELI

   Interestingly, not only does JELI build embeddings for items and users
   available at training time, but it can also be used to produce
   embeddings for new entities without requiring any retraining step.
   Given a feature vector
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   , padding with zeroes if needed on unavailable features, the
   corresponding embedding is
   [MATH: <mrow><msub><mi>s</mi><mi>W</mi></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . However, the main objective of the trained JELI model is to predict
   new (positive) user-item associations, possibly on items and users not
   observed at training time. In that case, for any pair of item and user
   feature vectors
   [MATH: <mrow><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo>×</mo><ms
   up><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup></mrow>
   :MATH]
   , the label predicted by JELI with RHOFM parameter
   [MATH: <mrow><mi mathvariant="bold-italic">θ</mi></mrow> :MATH]
   is
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><msup><mover
   accent="true"><mi>y</mi><mo
   stretchy="false">^</mo></mover><mtext>JELI</mtext></msup><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd columnalign="left"><mrow><mspace
   width="4pt"></mspace><mo>+</mo><mn>1</mn></mrow></mtd><mtd
   columnalign="left"><mrow><mspace
   width="0.333333em"></mspace><mtext>if</mtext><mspace
   width="0.333333em"></mspace><mi>σ</mi><mo
   stretchy="false">(</mo><msub><mtext>RHOFM</mtext><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo><mo>></mo><mn>0.5</mn></mrow></mtd></mtr><mtr><m
   td columnalign="left"><mrow><mrow></mrow><mspace
   width="4pt"></mspace><mo>-</mo><mn>1</mn></mrow></mtd><mtd
   columnalign="left"><mrow><mspace
   width="0.333333em"></mspace><mtext>otherwise</mtext></mrow></mtd></mtr>
   </mtable></mrow></mfenced><mspace
   width="0.277778em"></mspace><mo>,</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   11

   where
   [MATH: <mi>σ</mi> :MATH]
   is the standard sigmoid function. Note that the JELI approach could be
   even more generic. Besides any knowledge graph, this joint training
   approach could feature any classifier, and not necessarily an RHOFM, as
   long as the classifier remains interpretable, and any knowledge graph
   completion loss function or any edge score function.

Results

   We first validate the performance, the interpretability, and the
   different components of JELI on synthetic data sets, for which the
   ground truth on feature importance is available. Then, we apply JELI to
   drug repurposing, our main motivating example for interpretability in
   recommendation. Further information about the generation of the
   synthetic data sets and numerical details is available in Appendix .
   Unless otherwise specified, the order of all factorization machine
   variants considered (including the RHOFM classifier in JELI) satisfies
   [MATH: <mrow><mi>m</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   .

   In this section, we consider several evaluation metrics. First,
   Spearman’s rank correlation [[95]45] quantifies the quality of the
   importance scores. It is computed on ground truth importance scores
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mo>⋆</mo></msup><mo>≜</mo><msub
   ><mrow><mo
   stretchy="false">(</mo><msub><mo>∑</mo><mrow><mi>k</mi><mo>≤</mo><mi>d<
   /mi></mrow></msub><msubsup><mi>W</mi><mrow><mi>f</mi><mo>,</mo><mi>k</m
   i></mrow><mo>⋆</mo></msubsup><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo>≤</mo><mi>F</mi></mro
   w></msub></mrow> :MATH]
   and predicted ones
   [MATH: <mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mo
   stretchy="false">^</mo></mover><mo>≜</mo><msub><mrow><mo
   stretchy="false">(</mo><msub><mo>∑</mo><mrow><mi>k</mi><mo>≤</mo><mi>d<
   /mi></mrow></msub><msub><mover accent="true"><mi>W</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>f</mi><mo>,</mo><mi>k</mi></mro
   w></msub><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mo>≤</mo><mi>F</mi></mro
   w></msub></mrow> :MATH]
   with
   [MATH: <mover accent="true"><mi>W</mi><mo
   stretchy="true">^</mo></mover> :MATH]
   the inferred embedding parameter. Second, the Area Under the Curve
   (AUC) is computed on all user-item pairs to measure classification
   performance between the ground truth
   [MATH: <mrow><mi>A</mi><mo>∈</mo><msup><mrow><mo
   stretchy="false">{</mo><mo>-</mo><mn>1</mn><mo>,</mo><mn>0</mn><mo>,</m
   o><mo>+</mo><mn>1</mn><mo
   stretchy="false">}</mo></mrow><mrow><msub><mi>n</mi><mi>i</mi></msub><m
   o>×</mo><msub><mi>n</mi><mi>u</mi></msub></mrow></msup></mrow> :MATH]
   and the classifier scores
   [MATH: <mrow><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><msub><mi>n</mi><mi>i</m
   i></msub><mo>×</mo><msub><mi>n</mi><mi>u</mi></msub></mrow></msup></mro
   w> :MATH]
   . We also consider the Negative-Sampling AUC (NS-AUC) [[96]44].
   Contrary to AUC, the NS-AUC is a ranking measure akin to an average of
   user-wise AUCs, giving a more refined quantification of prediction
   quality across users. As a complementary measure of classification
   quality, we also consider the Normalized Discounted Cumulative Gain
   (NDCG), which is proportional to the quality of the ranking of
   recommended drugs across diseases. Note that all those classification
   metrics depend solely on the classifier scores, and not on the final
   class labels that can be inferred by applying a fixed threshold
   [MATH: <mi>τ</mi> :MATH]
   . The exact definitions of each metric are reported in Table [97]1.

Table 1.

   Description of the performance metrics in Section "[98]Results"
   Notation Performance metric Definition
   Spearman’s
   [MATH: <mi>ρ</mi> :MATH]
   Spearman’s correlation
   [MATH:
   <mrow><mn>1</mn><mo>-</mo><mn>6</mn><msub><mo>∑</mo><mrow><mi>f</mi><mo
   >≤</mo><mi>F</mi></mrow></msub><msup><mrow><mo
   stretchy="false">(</mo><msub><mi
   mathvariant="normal">Δ</mi><mi>f</mi></msub><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup><mo
   stretchy="false">/</mo><mrow><mo
   stretchy="false">(</mo><mi>F</mi><mrow><mo
   stretchy="false">(</mo><msup><mi>F</mi><mn>2</mn></msup><mo>-</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   AUC Area Under the Curve
   [MATH:
   <mrow><msubsup><mo>∫</mo><mn>0</mn><mn>1</mn></msubsup><mtext>TPR</mtex
   t><mrow><mo
   stretchy="false">(</mo><msup><mtext>FPR</mtext><mrow><mo>-</mo><mn>1</m
   n></mrow></msup><mrow><mo
   stretchy="false">(</mo><mi>τ</mi><mo>;</mo><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mo>,</mo><mi>A</mi><mo
   stretchy="false">)</mo></mrow><mo>;</mo><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mo>,</mo><mi>A</mi><mo
   stretchy="false">)</mo></mrow><mi>d</mi><mi>τ</mi></mrow> :MATH]
   NS-AUC Average NS-AUC [[99]44]
   [MATH: <mrow><mrow><mo
   stretchy="false">|</mo></mrow><msub><mi>n</mi><mi>u</mi></msub><msup><m
   row><mo
   stretchy="false">|</mo></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msup><
   msub><mo>∑</mo><mrow><mi>u</mi><mo>≤</mo><msub><mi>n</mi><mi>u</mi></ms
   ub></mrow></msub><msup><mrow><mo stretchy="false">|</mo><msub><mover
   accent="true"><mi mathvariant="normal">Ω</mi><mo
   stretchy="true">~</mo></mover><mi>u</mi></msub><mo
   stretchy="false">|</mo></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msup><
   msub><mo>∑</mo><mrow><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><msup><mi>i</mi><mo>′</mo></
   msup><mo stretchy="false">)</mo></mrow><mo>∈</mo><msub><mover
   accent="true"><mi mathvariant="normal">Ω</mi><mo
   stretchy="true">~</mo></mover><mi>u</mi></msub></mrow></msub><mi>δ</mi>
   <mrow><mo stretchy="false">(</mo><msub><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>></mo><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><msup><mi>i</mi><mo>′</mo></msup><
   mo>,</mo><mi>u</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   NDCG Average NDCG@
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   [MATH:
   <mrow><msubsup><mi>n</mi><mi>u</mi><mrow><mo>-</mo><mn>1</mn></mrow></m
   subsup><msub><mo>∑</mo><mrow><mi>u</mi><mo>≤</mo><msub><mi>n</mi><mi>u<
   /mi></msub></mrow></msub><mfenced close=")"
   open="("><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <msubsup><mi>N</mi><mi>u</mi><mo>+</mo></msubsup></msubsup><mfrac><msub
   ><mi>A</mi><mrow><msub><mi>σ</mi><mi>u</mi></msub><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mi>u</mi></mrow></msub><mrow><
   msub><mo>log</mo><mn>2</mn></msub><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>+</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></mrow></mfrac></mfenced><mo
   stretchy="false">/</mo><mfenced close=")"
   open="("><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <msubsup><mi>N</mi><mi>u</mi><mo>+</mo></msubsup></msubsup><mfrac><mn>1
   </mn><mrow><msub><mo>log</mo><mn>2</mn></msub><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>+</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></mrow></mfrac></mfenced></mrow> :MATH]
   [100]Open in a new tab

   Spearman’s
   [MATH: <mi>ρ</mi> :MATH]
   :
   [MATH: <msub><mi mathvariant="normal">Δ</mi><mi>f</mi></msub> :MATH]
   is the gap in rank (for the decreasing order) between the true and
   predicted importance scores
   [MATH: <msub><mrow><mo
   stretchy="false">(</mo><msup><mi>s</mi><mo>⋆</mo></msup><mo
   stretchy="false">)</mo></mrow><mi>f</mi></msub> :MATH]
   and
   [MATH: <msub><mover accent="true"><mi>s</mi><mo
   stretchy="false">^</mo></mover><mi>f</mi></msub> :MATH]
   for feature f. AUC: The true positive rate between ground truth A and
   predictions
   [MATH: <mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover> :MATH]
   is defined as
   [MATH: <mrow><mtext>TPR</mtext><mrow><mo
   stretchy="false">(</mo><mi>τ</mi><mo>;</mo><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mo>,</mo><mi>A</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mo>∑</mo><mrow><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo><msub><mi>A</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>u</mi></mrow></msub><mo>=</mo><mo>+</mo><mn>1</mn></mrow
   ></msub><mi>δ</mi><mrow><mo stretchy="false">(</mo><msub><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>></mo><mi>τ</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">/</mo><msub><mo>∑</mo><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow></msub><mi>δ</mi><mrow><mo
   stretchy="false">(</mo><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>></mo><mi>τ</mi><mo stretchy="false">)</mo></mrow></mrow>
   :MATH]
   , the false positive rate is
   [MATH: <mrow><mtext>FPR</mtext><mrow><mo
   stretchy="false">(</mo><mi>τ</mi><mo>;</mo><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mo>,</mo><mi>A</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mo>∑</mo><mrow><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo><msub><mi>A</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>u</mi></mrow></msub><mo>=</mo><mo>-</mo><mn>1</mn></mrow
   ></msub><mi>δ</mi><mrow><mo stretchy="false">(</mo><msub><mover
   accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>></mo><mi>τ</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">/</mo><msub><mo>∑</mo><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow></msub><mi>δ</mi><mrow><mo
   stretchy="false">(</mo><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>≤</mo><mi>τ</mi><mo stretchy="false">)</mo></mrow></mrow>
   :MATH]
   , and
   [MATH: <mi>δ</mi> :MATH]
   is the Kronecker symbol. NS-AUC: The set of true positive, respectively
   negative, drug-disease associations is
   [MATH: <mrow><msup><mi
   mathvariant="normal">Ω</mi><mo>±</mo></msup><mo>≜</mo><mrow><mo
   stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo><msub><mi>A</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>u</mi></mrow></msub><mo>=</mo><mo>±</mo><mn>1</mn><mo>∣<
   /mo><mi>i</mi><mo>≤</mo><msub><mi>n</mi><mi>i</mi></msub><mo>,</mo><mi>
   u</mi><mo>≤</mo><msub><mi>n</mi><mi>u</mi></msub><mo
   stretchy="false">}</mo></mrow></mrow> :MATH]
   , whereas the set of positive drugs to disease u is
   [MATH: <mrow><msubsup><mi
   mathvariant="normal">Ω</mi><mi>u</mi><mo>+</mo></msubsup><mo>≜</mo><mro
   w><mo
   stretchy="false">{</mo><mi>i</mi><mo>∣</mo><msub><mi>A</mi><mrow><mi>i<
   /mi><mo>,</mo><mi>u</mi></mrow></msub><mo>=</mo><mo>+</mo><mn>1</mn><mo
   stretchy="false">}</mo></mrow></mrow> :MATH]
   . Finally, the set of correctly ranked drugs for disease u is
   [MATH: <mrow><msub><mover accent="true"><mi
   mathvariant="normal">Ω</mi><mo
   stretchy="true">~</mo></mover><mi>u</mi></msub><mo>≜</mo><mrow><mo
   stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><msup><mi>i</mi><mo>′</mo></
   msup><mo
   stretchy="false">)</mo></mrow><mo>∣</mo><msub><mi>A</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>u</mi></mrow></msub><mo>></mo><msub><mi>A</mi><mrow><msu
   p><mi>i</mi><mo>′</mo></msup><mo>,</mo><mi>u</mi></mrow></msub><mo
   stretchy="false">}</mo></mrow></mrow> :MATH]
   . NDCG:
   [MATH: <msub><mi>σ</mi><mi>u</mi></msub> :MATH]
   is the permutation that sorts all coefficients of the recommendations
   [MATH: <mrow><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><mi>i</mi><mo>,</mo><mi>u</mi></mr
   ow></msub><mo>,</mo><mi>i</mi><mo>≤</mo><msub><mi>n</mi><mi>i</mi></msu
   b></mrow> :MATH]
   for disease u in the decreasing order. That is,
   [MATH: <mrow><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><msub><mi>σ</mi><mi>u</mi></msub><
   mrow><mo stretchy="false">(</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mi>u</mi></mrow></msub><mo>≥</
   mo><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><msub><mi>σ</mi><mi>u</mi></msub><
   mrow><mo stretchy="false">(</mo><mn>2</mn><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mi>u</mi></mrow></msub><mo>≥</
   mo><mo>⋯</mo><mo>≥</mo><msub><mover accent="true"><mi>A</mi><mo
   stretchy="false">^</mo></mover><mrow><msub><mi>σ</mi><mi>u</mi></msub><
   mrow><mo stretchy="false">(</mo><msub><mi>n</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mi>u</mi></mrow></msub></mrow>
   :MATH]
   . Finally,
   [MATH: <msubsup><mi>N</mi><mi>u</mi><mo>+</mo></msubsup> :MATH]
   is defined as
   [MATH: <mrow><mo movablelimits="true">min</mo><mo
   stretchy="false">(</mo><msub><mi>n</mi><mi>i</mi></msub><mo>,</mo><mo
   stretchy="false">|</mo><msubsup><mi
   mathvariant="normal">Ω</mi><mi>u</mi><mo>+</mo></msubsup><mo
   stretchy="false">|</mo><mo stretchy="false">)</mo></mrow> :MATH]
   .

Synthetic data sets

   We consider two types of “interpretable” synthetic recommendation data,
   called “linear first-order” and “linear second-order”, for which the
   ground truth feature importance scores are known. At fixed values of
   dimension d, feature number F, and numbers of items and users
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>n</mi><mi>u</mi></msub> :MATH]
   , both item and user feature vectors are drawn at random from a
   standard Gaussian distribution, along with a matrix
   [MATH: <mrow><msup><mi>W</mi><mo>⋆</mo></msup><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mi>
   d</mi></mrow></msup></mrow> :MATH]
   . The algorithm cannot access the full feature values in most practical
   cases in recommendation tasks. Reasons for missing values can be
   diverse [[101]46], but most likely follow a not missing at random
   mechanism, meaning that the probability of a missing value depends on
   the features. To implement such a mechanism, we applied a slightly
   adapted Gaussian self-masking [[102]47] to the corresponding item and
   user feature matrices, such that we expect around
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   of missing feature values.

   The complete set of user-item scores is obtained by a generating model
   [MATH: <mrow><msub><mi>g</mi><mn>0</mn></msub><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo>×</mo><ms
   up><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mi>F</mi></msup><mo
   stretchy="false">→</mo><mrow><mo
   stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo
   stretchy="false">]</mo></mrow></mrow> :MATH]
   . For “first-order” synthetic data sets,
   [MATH: <msub><mi>g</mi><mn>0</mn></msub> :MATH]
   is defined as
   [MATH: <mrow><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>↦</mo><mi>σ</mi><mrow><mo
   stretchy="false">(</mo><msub><mo>∑</mo><mrow><mi>k</mi><mo>≤</mo><mi>d<
   /mi></mrow></msub><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>+</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><msubsup><mi>W</mi><mrow><mo>:</mo><mo>,<
   /mo><mi>k</mi></mrow><mo>⋆</mo></msubsup><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>σ</mi><mrow><mo
   stretchy="false">(</mo><msub><mtext>RHOFM</mtext><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo>,</mo><msub><mrow><mn
   mathvariant="bold">1</mn></mrow><mi>d</mi></msub><mo>,</mo><msub><mrow>
   <mn
   mathvariant="bold">0</mn></mrow><mrow><mi>m</mi><mo>-</mo><mn>1</mn></m
   row></msub><mo>,</mo><msup><mi>W</mi><mo>⋆</mo></msup><mo
   stretchy="false">)</mo></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">)</mo></mrow></mrow>
   :MATH]
   where
   [MATH: <msup><mi>x</mi><mi>i</mi></msup> :MATH]
   and
   [MATH: <msup><mi>x</mi><mi>u</mi></msup> :MATH]
   are respectively the item and user feature vectors. For the
   “second-order” type,
   [MATH: <msub><mi>g</mi><mn>0</mn></msub> :MATH]
   is simply
   [MATH: <mrow><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>↦</mo><mi>σ</mi><mrow><mo
   stretchy="false">(</mo><msub><mtext>RHOFM</mtext><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>,</mo><msub><mrow><mn
   mathvariant="bold">1</mn></mrow><mi>d</mi></msub><mo>,</mo><msub><mrow>
   <mn
   mathvariant="bold">1</mn></mrow><mrow><mi>m</mi><mo>-</mo><mn>1</mn></m
   row></msub><mo>,</mo><msup><mi>W</mi><mo>⋆</mo></msup><mo
   stretchy="false">)</mo></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">)</mo></mrow></mrow>
   :MATH]
   where the order is
   [MATH: <mrow><mi>m</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   . In both cases, the corresponding structure function
   [MATH: <msub><mi>s</mi><msup><mi>W</mi><mo>⋆</mo></msup></msub> :MATH]
   is linear, that is,
   [MATH:
   <mrow><msub><mi>s</mi><msup><mi>W</mi><mo>⋆</mo></msup></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><msup><mi>W</mi><mo>⋆</mo></msup
   ></mrow> :MATH]
   and
   [MATH: <mrow><mi>λ</mi><mo>=</mo><mn>0</mn></mrow> :MATH]
   .

   Finally, since in practice, most of the user-item associations are
   inaccessible at training time, we label user-item pairs with
   [MATH: <mrow><mo>-</mo><mn>1</mn></mrow> :MATH]
   , 0, and
   [MATH: <mrow><mo>+</mo><mn>1</mn></mrow> :MATH]
   depending on their score, such that the sparsity number—that is, the
   percentage of unknown values in the association matrix—is equal to a
   prespecified value greater than
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   .

JELI is performant for various validation metrics and reliably retrieves
ground truth importance scores

   We generate 10 synthetic datasets of each type (
   [MATH: <mrow><mi>F</mi><mo>=</mo><mn>10</mn></mrow> :MATH]
   ,
   [MATH: <mrow><mi>d</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   ,
   [MATH:
   <mrow><msub><mi>n</mi><mi>i</mi></msub><mo>=</mo><msub><mi>n</mi><mi>u<
   /mi></msub><mo>=</mo><mn>173</mn></mrow> :MATH]
   ) and run JELI 100 times with different random seeds corresponding to
   different training/testing splits. Table [103]2 shows the numerical
   results across those
   [MATH: <mrow><mn>10</mn><mo>×</mo><mn>100</mn></mrow> :MATH]
   runs for several validation metrics on the predicted item-user
   associations and feature importance scores.

Table 2.

   Average validation metrics with standard deviations across 100
   iterations and 10 synthetic data sets of each type (total number of
   values: 1000)
   Data set type AUC NS-AUC Spearman’s
   [MATH: <mi>ρ</mi> :MATH]
   First-order
   [MATH: <mrow><mn>0.99</mn><mo>±</mo></mrow> :MATH]
   0.013
   [MATH: <mrow><mn>0.89</mn><mo>±</mo></mrow> :MATH]
   0.124
   [MATH: <mrow><mn>0.83</mn><mo>±</mo></mrow> :MATH]
   0.279
   Second-order
   [MATH: <mrow><mn>0.98</mn><mo>±</mo></mrow> :MATH]
   0.019
   [MATH: <mrow><mn>0.86</mn><mo>±</mo></mrow> :MATH]
   0.167
   [MATH: <mrow><mn>0.75</mn><mo>±</mo></mrow> :MATH]
   0.363
   [104]Open in a new tab

   Average (respectively, standard deviation) values are rounded to the
   closest second (resp., third) decimal place. AUC: Area Under the Curve.
   NS-AUC: Negative-Sampling AUC [[105]44]. Spearman’s
   [MATH: <mi>ρ</mi> :MATH]
   : Spearman’s rank correlation

   Albeit there is a large variation in the quality of the prediction due
   to the random training/testing split when considering the average best
   value across 100 iterations, the metrics in Table [106]2 show a high
   predictive power for JELI, along with a consistently high correlation
   between true and predicted feature importance scores: the average
   Spearman’s rank correlation for the best-trained models across all 10
   data sets is 0.932 for “first-order” sets and 0.932 for “second-order”
   ones. The bar plots representing the ground truth and predicted
   importance scores for each of these 10 sets and each type of synthetic
   data in Fig. [107]2 show that JELI can preserve the global trend in
   importance scores across data sets. We also tested the impact of the
   dimension parameter d and of the order m of the RHOFM on the accuracy
   metrics. In the previous experiments, we used
   [MATH: <mrow><mi>d</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   , which is the true dimensionality of the underlying generating model.
   However, it appears that JELI is also robust to the choice of the
   dimension parameter if it is large enough for all metrics. Moreover,
   similarly to higher-order factorization machines, higher-order
   interactions (
   [MATH: <mrow><mi>m</mi><mo>></mo><mn>2</mn></mrow> :MATH]
   ) allow us to get a more expressive classifier model and, thus, better
   classification performance. However, this improvement comes at a heavy
   computational price, even with the dynamic programming routines in
   [[108]30], where the time complexity is linear in m. The experiments
   and results on parameter impact can be found in [109]Appendix 4.

Fig. 2.

   [110]Fig. 2
   [111]Open in a new tab

   Barplots of the true and predicted feature importance scores for
   [MATH: <mrow><mi>F</mi><mo>=</mo><mn>10</mn></mrow> :MATH]
   features in each synthetic data set for the best-performing model
   across 100 iterations. Top-2 lines: on “first-order” synthetic data.
   Bottom-2 lines: on “second-order” synthetic data

JELI is robust in synthetic data sets across sparsity numbers

   We also compare the predictive performance of JELI compared to
   embedding-based recommender systems from the state-of-the-art, namely
   Fast.ai collaborative learner [[112]8], the heterogeneous attention
   network (HAN) algorithm [[113]48] and the neural inductive matrix
   completion with graph convolutional network (NIMCGCN) [[114]10]. We
   set, whenever appropriate, the same hyperparameter values for all
   algorithms (with
   [MATH: <mrow><mi>d</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   ). We run each algorithm on 100 different random seeds on 5
   “first-order” synthetic data sets generated with sparsity numbers in
   [MATH: <mrow><mo
   stretchy="false">{</mo><mn>50</mn><mo>%</mo><mo>,</mo><mn>65</mn><mo>%<
   /mo><mo>,</mo><mn>80</mn><mo>%</mo><mo stretchy="false">}</mo></mrow>
   :MATH]
   , for 500 tests. Figure [115]3 and Table [116]3 report the boxplots and
   the confidence intervals on corresponding validation metrics. In
   addition to the AUC and NS-AUC, we include the Non-Discounted
   Cumulative Gain (NDCG) computed for each user at rank
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   (number of items) and averaged across users as a counterpart to the
   NS-AUC measure.

Fig. 3.

   Fig. 3
   [117]Open in a new tab

   NS-AUC values across “first-order” synthetic data sets for sparsity
   numbers and 500 iterations for JELI and state-of-the-art
   embedding-based recommender systems

Table 3.

   Average metrics with standard deviations across 100 iterations and 5
   “first-order” sets
               AUC       NS-AUC    NDCG
   50% Fast.ai 0.99± 0.0 0.52± 0.3 0.85± 0.1
       HAN     0.93± 0.0 0.62± 0.1 0.18± 0.1
       NIM     0.93± 0.0 0.63± 0.1 0.39± 0.1
       JELI    0.99± 0.0 0.92± 0.1 0.96± 0.1
   65% Fast.ai 0.99± 0.0 0.64± 0.4 0.78± 0.3
       HAN     0.93± 0.0 0.67± 0.0 0.12± 0.1
       NIM     0.94± 0.0 0.67± 0.1 0.42± 0.1
       JELI    0.99± 0.0 0.94± 0.0 0.94± 0.1
   80% Fast.ai 0.99± 0.0 0.91± 0.1 0.77± 0.2
       HAN     0.96± 0.0 0.72± 0.0 0.20± 0.1
       NIM     0.93± 0.0 0.61± 0.1 0.19± 0.0
       JELI    0.99± 0.0 0.94± 0.0 0.85± 0.2
   [118]Open in a new tab

   The NDCG at rank
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   is averaged across users. NIM is NIMCGCN

   Bold type is used for the highest value(s) in each experiment

   As illustrated by Fig. [119]3, JELI consistently outperforms the
   state-of-the-art on all metrics and remains robust to the sparsity
   number.

Ablation study: both the structure and the joint learning are crucial to the
performance

   We perform the same type of experiments as in Sect. "[120]JELI is
   robust in synthetic data sets across sparsity numbers" on several
   ablated versions of JELI to estimate the contribution of each part to
   the predictive performance. We introduce several JELI variants. First,
   we remove the structured and embedding part of the RHOFM classifier. FM
   is the regular second-order factorization machine of dimension d on
   2F-dimensional input vectors, without structure on the coefficients
   (see Definition [121]1), whereas CrossFM2 is a more refined
   non-structured second-order factorization machine, where the feature
   pairwise interaction terms only comprise pairs of features on both the
   item and user vectors, that is, with notation from Definition [122]1
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msub><mtext>CrossFM</mtext><mrow><mo
   stretchy="false">(</mo><msup><mi>ω</mi><mn>0</mn></msup><mo>,</mo><msup
   ><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">ω</mi></mrow><mn>2</mn></msup><mo
   stretchy="false">)</mo></mrow></msub><mrow><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup><mo>,</mo><msup
   ><mrow><mi mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup><mo
   stretchy="false">)</mo></mrow><mo>≜</mo><msup><mi>ω</mi><mn>0</mn></msu
   p><mo>+</mo><msup><mrow><mo stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mn>1</mn></msup><mo
   stretchy="false">)</mo></mrow><mo>⊺</mo></msup><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>i</mi></msup></mtd></mtr><mt
   r><mtd><mrow><mrow></mrow><msup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>u</mi></msup></mrow></mtd></
   mtr></mtable></mrow></mfenced><mo>+</mo><munder><mo>∑</mo><mrow><mi>f</
   mi><mo>≤</mo><mi>F</mi><mo>,</mo><msup><mi>f</mi><mo>′</mo></msup><mo>>
   </mo><mi>F</mi></mrow></munder><mrow><mo
   stretchy="false">⟨</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mi>f</mi><mn>2</mn></msubsup><m
   o>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mo>′</mo><mn>2</mn></msubsup><m
   o stretchy="false">⟩</mo></mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi>f</mi><mi>i</mi></msubsup><m
   subsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><msup><mi>f</mi><mo>′</mo>
   </msup><mo>-</mo><mi>F</mi></mrow><mi>u</mi></msubsup><mspace
   width="0.277778em"></mspace><mo>.</mo></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   12

   Next, we also study methods featuring separate learning of the
   embeddings and the RHOFM classifier, named Separate Embedding Learning
   and Training algorithms (SELT). We consider different feature embedding
   types. SELT-PCAf uses the d principal component analysis (PCA) run on
   the concatenation of the item and user matrices along the column
   dimension, resulting in a
   [MATH: <mrow><mi>F</mi><mo>×</mo><mo
   stretchy="false">(</mo><msub><mi>n</mi><mi>i</mi></msub><mo>+</mo><msub
   ><mi>n</mi><mi>u</mi></msub><mo stretchy="false">)</mo></mrow> :MATH]
   matrix. SELT-PCAf then infers feature embeddings based on each
   feature’s d first principal components. Another PCA-based baseline,
   SELT-PCAiu, applies the learned PCA transformation directly on item and
   user feature vectors to obtain item and user embeddings. Finally, the
   SELT-KGE approach completes the knowledge graph task to obtain item and
   user embeddings—without enforcing the feature-dependent structure—on
   the knowledge graph described in Definition [123]3 with an empty
   partial graph. Then, SELT-KGE uses those item and user embeddings to
   train the RHOFM classifier.

   The final results in Fig.[124]4 and Table [125]4 show that the most
   crucial part for predictive performance across sparsity numbers is the
   factorization machine, which is unsurprising given the literature on
   factorization machines applied to sparse data. One can observe that
   separate embedding learning and factorization machine training leads to
   mediocre performance. The combination of a structured factorization
   machine and jointly learned embeddings, that is, JELI, gives the best
   performance and is even more significant as the set of known
   associations gets smaller (and the sparsity number is larger).

Fig. 4.

   Fig. 4
   [126]Open in a new tab

   NS-AUC values across “first-order” synthetic data sets for sparsity
   numbers and 500 iterations for JELI and ablated variants. This shows
   that the most crucial part for a good predictive performance across
   sparsity numbers is the factorization machine

Table 4.

   Average metrics with standard deviations across 100 iterations and 5
   “first-order” sets. The NDCG at rank
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   is averaged across users. S indicates an instance of SELT
                AUC       NS-AUC    NDCG
   50% FM2      0.99± 0.0 0.92± 0.0 0.97± 0.0
       CrossFM2 0.99± 0.0 0.93± 0.0 1.00± 0.0
       S-PCAf   0.95± 0.0 0.70± 0.1 0.58± 0.2
       S-PCAiu  0.95± 0.0 0.61± 0.2 0.45± 0.2
       S-KGE    0.91± 0.0 0.43± 0.2 0.25± 0.2
       JELI     0.99± 0.0 0.92± 0.1 0.96± 0.0
   65% FM2      0.98± 0.0 0.91± 0.0 0.87± 0.1
       CrossFM2 0.99± 0.0 0.91± 0.0 0.95± 0.0
       S-PCAf   0.95± 0.0 0.73± 0.1 0.54± 0.2
       S-PCAiu  0.94± 0.0 0.62± 0.0 0.34± 0.1
       S-KGE    0.90± 0.0 0.43± 0.0 0.06± 0.0
       JELI     0.99± 0.0 0.94± 0.0 0.94± 0.1
   80% FM2      0.97± 0.0 0.84± 0.1 0.56± 0.1
       CrossFM2 0.98± 0.0 0.87± 0.0 0.74± 0.0
       S-PCAf   0.95± 0.0 0.73± 0.1 0.38± 0.1
       S-PCAiu  0.93± 0.0 0.62± 0.1 0.20± 0.0
       S-KGE    0.91± 0.0 0.55± 0.1 0.12± 0.1
       JELI     0.99± 0.0 0.94± 0.0 0.85± 0.2
   [127]Open in a new tab

   Bold type is used for the highest value(s) in each experiment

Application to drug repurposing

   We aim to predict new therapeutic indications, that is, novel
   associations between chemical compounds and diseases. The
   interpretability of the model for predicting associations between
   molecules and pathologies is crucial to encourage its use for health.
   In that case, higher-order factorization machines are very interesting
   models due to their inherent interpretability. However, particularly
   for the most recent drug repurposing datasets (e.g., TRANSCRIPT
   [[128]43] and PREDICT [[129]49]), the number of features (
   [MATH:
   <mrow><mi>F</mi><mo>≈</mo><mn>12</mn><mo>,</mo><mn>000</mn></mrow>
   :MATH]
   and
   [MATH:
   <mrow><mi>F</mi><mo>≈</mo><mn>6</mn><mo>,</mo><mn>000</mn></mrow>
   :MATH]
   , respectively) is too large to effectively train a factorization
   machine due to the curse of dimensionality. Resorting to knowledge
   graphs then enables the construction of low-dimensional vector
   representations of these associations. Then, these representations are
   fed as input to the classifier during training instead of the initial
   feature vectors.

JELI is on par with state-of-the-art approaches on drug repurposing data sets

   We now run JELI and the baseline algorithms tested in Sect. "[130]JELI
   is robust in synthetic data sets across sparsity numbers" on Gottlieb
   [[131]50] (named Fdataset in the paper), LRSSL [[132]51],
   PREDICT-Gottlieb [[133]52] and TRANSCRIPT [[134]43] drug repurposing
   data sets which feature a variety of data types and sizes. Please refer
   to [135]Appendix 4 for more information. Figure [136]5 and Table [137]5
   report the validation metrics for each method’s 100 different
   training/testing splits with
   [MATH: <mrow><mi>d</mi><mo>=</mo><mn>15</mn></mrow> :MATH]
   . From those results, we can see that the performance of JELI is on par
   with the top algorithm, HAN, and sometimes outperforms it while
   providing interpretability.

Fig. 5.

   Fig. 5
   [138]Open in a new tab

   AUC values across drug repurposing data sets for 100 iterations for
   JELI and state-of-the-art embedding-based approaches

Table 5.

   Average metrics with standard deviations across 100 iterations for each
   drug repurposing data set
                    AUC       NS-AUC    NDCG
   Gottlieb Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0
            HAN     0.93± 0.0 0.67± 0.0 0.02± 0.0
            NIM     0.90± 0.0 0.51± 0.0 0.01± 0.0
            JELI    0.90± 0.0 0.52± 0.0 0.02± 0.0
   LRSSL    Fast.ai 0.90± 0.0 0.49± 0.1 0.01± 0.0
            HAN     0.95± 0.0 0.69± 0.0 0.10± 0.0
            NIM     0.91± 0.0 0.53± 0.0 0.01± 0.0
            JELI    0.92± 0.0 0.51± 0.0 0.02± 0.0
   PRED-G   Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0
            HAN     0.93± 0.0 0.68± 0.0 0.01± 0.0
            NIM     0.91± 0.0 0.49± 0.0 0.01± 0.0
            JELI    0.90± 0.0 0.47± 0.0 0.02± 0.0
   TRANSC   Fast.ai 0.61± 0.1 0.57± 0.1 0.04± 0.0
            HAN     0.93± 0.0 0.61± 0.0 0.08± 0.0
            NIM     0.92± 0.0 0.57± 0.0 0.04± 0.0
            JELI    0.92± 0.0 0.56± 0.0 0.02± 0.0
   [139]Open in a new tab

   The NDCG at rank
   [MATH: <msub><mi>n</mi><mi>i</mi></msub> :MATH]
   is averaged across users. NIM is the algorithm NIMCGCN, TRANSC refers
   to the data set TRANSCRIPT, and PRED-G to the data set PREDICT-Gottlieb

   For the sake of completeness, we also considered one of the most
   popular data sets for recommendation, called MovieLens [[140]2], to
   better assess the performance of JELI for the general purpose of
   collaborative filtering. The goal is to predict if a movie should be
   recommended to a user, that is, if the user would rate this movie with
   more than 3 stars. The movie features are the year and the one-hot
   encodings of the movie genres, whereas the user features are the counts
   of each movie tag that this user has previously assigned. This
   experiment confirms that the performance of JELI is on par with the
   baselines, even in a non-biological setting. Please refer to
   [141]Appendix 4 for more information.

JELI can integrate any graph prior on the TRANSCRIPT data set

   We now focus on the TRANSCRIPT data set, which involves gene activity
   measurements across
   [MATH:
   <mrow><mi>F</mi><mo>=</mo><mn>12</mn><mo>,</mo><mn>096</mn></mrow>
   :MATH]
   genes for
   [MATH:
   <mrow><msub><mi>n</mi><mi>i</mi></msub><mo>=</mo><mn>204</mn></mrow>
   :MATH]
   drugs and
   [MATH:
   <mrow><msub><mi>n</mi><mi>u</mi></msub><mo>=</mo><mn>116</mn></mrow>
   :MATH]
   diseases. We compare the predictive power of JELI on the TRANSCRIPT
   data set with the default knowledge graph created by JELI (named “None”
   network, as we don’t rely on external sources of knowledge) and the
   default graph augmented with an external knowledge graph. The “None”
   network corresponds to the knowledge graph in Definition [142]3 with an
   empty partial graph. We considered as external knowledge graphs DRKG
   [[143]53], Hetionet [[144]54], PharmKG and PharmK8k (a subset of 8, 000
   triplets) [[145]41] and PrimeKG [[146]42] as provided by the Python
   library PyKeen [[147]55]. In addition, we also built a partial graph
   listing protein-protein interactions (where proteins are matched
   one-to-one to their corresponding coding genes) based on the STRING
   database [[148]56]. The resulting accuracies in classification are
   shown on Figure [149]6 and Table [150]6. Most of the external graph
   priors significantly improve the classification accuracy, particularly
   the specific information about gene regulation (prior STRING). In
   [151]Appendix 4, we also show that the graph priors’ performance
   correlates with a more frequent grouping of genes that belong to the
   same functional pathways. In [152]Appendix 5, we perform a more
   thorough analysis of the specific case of melanoma and show that the
   predicted drug-disease associations and perturbed pathways allow us to
   recover some elements of the literature on melanoma.

Fig. 6.

   Fig. 6
   [153]Open in a new tab

   Predictive performance of JELI with different graph priors on different
   validation metrics

Table 6.

   Average metrics with standard deviations across 10 iterations on the
   TRANSCRIPT data set for different graph priors
   Graph prior AUC        NS-AUC     NDCG
   None        0.90± 0.01 0.48± 0.02 0.02± 0.01
   DRKG        0.90± 0.00 0.48± 0.02 0.00± 0.00
   Hetionet    0.89± 0.00 0.43± 0.02 0.01± 0.01
   PharmKG     0.88± 0.01 0.43± 0.03 0.01± 0.01
   PharmKG8k   0.91± 0.01 0.53± 0.03 0.03± 0.01
   PrimeKG     0.89± 0.01 0.48± 0.03 0.02± 0.01
   STRING      0.91± 0.00 0.55± 0.03 0.02± 0.01
   [154]Open in a new tab

Discussion

   This work proposes the JELI approach for integrating knowledge
   graph-based regularization into an interpretable recommender system.
   The structure incorporated into user and item embeddings considers
   numerical feature values in a generic fashion, which allows one to go
   beyond the categorical relations encoded in knowledge graphs without
   adding many parameters. This method allows us to derive item and user
   representations of fixed dimensions and score a user-item association,
   even on previously unseen items and users. We have shown the
   performance and the explainability power of JELI on synthetic and
   real-life data sets. The Python package that implements the JELI
   approach is available at the following open-source repository:
   [155]github.com/RECeSS-EU-Project/JELI. Experimental results can be
   reproduced using code uploaded at
   [156]github.com/RECeSS-EU-Project/JELI-experiments.

Conclusions

   This paper introduces and empirically validates our algorithmic
   contribution, JELI, for drug repurposing. JELI aims to provide
   straightforward interpretability in recommendations while integrating
   any graph information on items and users. However, there are a few
   limitations to the JELI approach. The first one is that JELI performs
   best on sparse user and item feature matrices, to exploit to the
   fullest the expressiveness of factorization machines. Moreover, this
   approach is quite slow compared to state-of-the-art algorithms since it
   simultaneously solves two tasks: the recommendation one on user-item
   pairs and the knowledge graph completion. We discuss the scalability of
   JELI with respect to various parameters in [157]Appendix 6. However,
   this slowness is mitigated by the superior interpretability of JELI
   compared to the baselines. Furthermore, an interesting subsequent work
   would focus on integrating missing values into the recommendation
   problem. As it is, JELI ignores the missing features and potentially
   recovers qualitative item-feature –respectively, user-feature– links
   during the knowledge graph completion tasks. That is, provided an
   approach to quantify the strength of the link between an item and a
   feature, JELI might also be extended to perform an imputation of this
   item’s corresponding missing feature value.

Acknowledgements