Abstract
Background
Interpretability is a topical question in recommender systems,
especially in healthcare applications. An interpretable classifier
quantifies the importance of each input feature for the predicted
item-user association in a non-ambiguous fashion.
Results
We introduce the novel Joint Embedding Learning-classifier for improved
Interpretability (JELI). By combining the training of a structured
collaborative-filtering classifier and an embedding learning task, JELI
predicts new user-item associations based on jointly learned item and
user embeddings while providing feature-wise importance scores.
Therefore, JELI flexibly allows the introduction of priors on the
connections between users, items, and features. In particular, JELI
simultaneously (a) learns feature, item, and user embeddings; (b)
predicts new item-user associations; (c) provides importance scores for
each feature. Moreover, JELI instantiates a generic approach to
training recommender systems by encoding generic graph-regularization
constraints.
Conclusions
First, we show that the joint training approach yields a gain in the
predictive power of the downstream classifier. Second, JELI can recover
feature-association dependencies. Finally, JELI induces a restriction
in the number of parameters compared to baselines in synthetic and
drug-repurposing data sets.
Keywords: Drug repurposing, Interpretability, Gene expression,
Collaborative filtering
Background
The Netflix Challenge [[28]1] popularized collaborative filtering,
where connections between items and users are inferred based on the
guilt-by-association principle and similarities. This approach is
particularly suitable for use cases where information about known
user-item associations is sparse—typically, close to 99% of all
possible user-item associations are not labelled, such as in the
MovieLens movie recommendation data set [[29]2]—and when there is
implicit feedback. For instance, in the case of movie recommendations
on streaming platforms or online advertising, the algorithm often gets
only access to clicks, that is, positive feedback. However, the reasons
for ignoring an item can be numerous: either the item would
straightforwardly receive negative feedback, or the item is too far
from the user’s usual exploration zone but could still be enjoyed. In
some rare cases, true negative feedback might be accessible but in even
smaller numbers than the positive associations, for instance, for drug
repurposing data sets, by reporting failed Phase III clinical trials
[[30]3]. Collaborative filtering algorithms then enable the modeling of
the user’s behavior based on their similarity to other users and the
similarity of the potential recommended item to other items positively
graded by this cluster of users.
Several types of algorithms implement collaborative filtering. For
instance, matrix factorizations [[31]4, [32]5] such as Non-negative
Matrix Factorization (NMF) [[33]6] or Singular Value Decomposition
(SVD) [[34]7], decompose the matrix of item-user associations into a
product of two low-rank tensors. Other types of algorithms are (deep)
neural networks [[35]8–[36]10], which build item and user embeddings
with convolutional or graph neural networks based on common
associations and/or additional feature values. On the one hand, among
those last approaches, graph-based methods, which integrate and infer
edges between features, items, and users, seem promising in performance
[[37]11]. Predictions are supported by establishing complex connections
between those entities. Conversely, matrix factorizations incorporate
explicit interpretability, as one can try to connect the inferred
latent factors to specific user and item features. One example is the
factorization machine (FM) [[38]12], which combines a linear
regression-like term and a feature pairwise interaction term to output
a score for binary classification. The learned coefficients of the FM
explicitly contribute to the score for each item and user feature set.
This type of interpretability, called feature attribution in the
literature [[39]13–[40]16], allows further downstream statistical
analysis of the feature interactions. For instance, in our motivating
example of drug repurposing, the objective is to identify novel
drug-disease therapeutic associations. If features are genes mutated by
the pathology or targeted by the chemical compound, the overrepresented
biological pathways among those that are respectively affected or
repaired can be retrieved based on the set of key repurposing genes.
This, in turn, offers important points to argue in favor of the
therapeutic value of a drug-disease indication and for further
development towards marketing.
In this work, we aim to combine the performance and versatility (in
terms of embeddings) of graph-based collaborative filtering and the
explicit interpretability of factorization machines to derive a
“best-of-both-worlds” approach for predicting user-item associations.
To achieve this, we introduce a special class of factorization machines
that leverages a strong hypothesis on the structure of item and user
embeddings depending on feature embeddings. This classifier is then
jointly trained with a knowledge graph completion task. This knowledge
graph connects items, users, and features based on the similarity
between them and users and potentially additional priors on their
relationships with features. The embeddings used to compute the edge
probability scores in the knowledge graph are shared with the
factorization machine, which allows the distillation of generic priors
into the classifier.
Our paper is structured as follows. In Sect. "[41]Related work", we
introduce and give an overview of the state-of-the-art on factorization
machines and knowledge graphs and how their combination might be able
to overcome some topical questions in the field. Section "[42]Methods"
introduces the JELI algorithm, which features our novel class of
structured factorization machines and a joint training strategy with a
knowledge graph. Eventually, Sect. "[43]Results" shows the performance
and interpretability of the JELI approach on both synthetic data sets
and drug repurposing applications.
Notation For any matrix M (in capital letters), we denote
[MATH:
Mi,:
:MATH]
,
[MATH:
M:,j
:MATH]
and
[MATH:
Mi,j
:MATH]
respectively its
[MATH: ith :MATH]
row,
[MATH: jth :MATH]
column and coefficient at position (i, j). For any vector
[MATH: v :MATH]
(in bold type),
[MATH: vi :MATH]
is its
[MATH: ith :MATH]
coefficient. Moreover,
[MATH: M† :MATH]
is the pseudo-inverse of matrix M.
Related work
Our approach, JELI, leverages a generic knowledge graph completion task
and the interpretability of factorization machines to derive a novel,
explainable collaborative filtering approach.
Knowledge graph embedding learning
A knowledge graph is a set of triplets of the form (h, r, t) such that
the head entity h is linked to the tail entity t by the relation r
[[44]17]. Entity and relation embeddings learned on the graph allow us
to capture the structure and connections in the graph in a numerical
form, as embeddings are parameters of a function predicting the
presence of a triplet in the graph. Those parameters are then learned
based on the current set of edges in the graph. This approach encodes
the graph structure into numerical representations, which can later be
provided to a downstream regression model [[45]18]. The edge prediction
function is usually called the interaction model. Many exist
[[46]19–[47]22], among these, the Multi-Relational Euclidean (MuRE)
model [[48]23], defined for any triplet (h, r, t) of respective
embeddings
[MATH: eh,er,et :MATH]
of dimension d as
[MATH: MuRE(eh,er,et)=-‖Rreh-(et+er)‖22+<
msup>bh+bt, :MATH]
1
where
[MATH: d×d :MATH]
matrix
[MATH: Rr :MATH]
, and scalars
[MATH: bh :MATH]
and
[MATH: bt :MATH]
are respectively relation-, head- and tail-specific parameters.
Notably, this interaction model has exhibited good embedding
engineering properties throughout the literature [[49]24, [50]25].
Yet, many challenges are present in this field of research. Current
representation learning algorithms (no matter the selected interaction
model between a triplet and its embedding) infer representations
directly on the nodes and relations of the graph. However, this
approach does not make it possible to establish a relationship between
the nodes other than a similarity at the level of the numerical
representation for neighboring nodes for specific relations in the
graph. That is, specific logical operations depending on the relation
are often ignored: for instance, for a relation r and its opposite
[MATH: ¬r :MATH]
, we would like to ensure that the score p assigned to triplet
(h, r, t) is proportional to
[MATH: -p¯
:MATH]
, where
[MATH: p¯ :MATH]
is the score associated with triplet
[MATH: (h,¬r,t) :MATH]
. Moreover, knowledge graphs are currently more suited to categorical
information, where entities and relationships take discrete rather than
numerical values. Numerical values could describe a relation such as
“users from this specific age group are twice more interested in that
movie genre”. Some recent works focus on integrating numerical values
into knowledge graph embeddings. In KEN embeddings [[51]26], a
single-layer neural network is trained for each numeric relation,
taking the attribute as input and returning an embedding. Another
approach, TransEA [[52]27], aims to optimize a loss function that
linearly combines, with a hyperparameter, a loss value on the
categorical variables (the difference between the scores and the
indicator of the presence of a triplet) and another loss value on
numerical variables, which seeks to minimize the gap between the
variable and a scalar product involving its embedding. However, these
two approaches add several additional hyperparameters and do not deal
with interpretability.
Resorting to knowledge-graph-infused embeddings allows us to integrate
prior knowledge constraints generically into the representations of
entities, both items and users. We aim to enforce a structure on those
embeddings to guarantee the good prediction of user-item associations
by incorporating those embeddings into a special type of factorization
machine.
Factorization machines
Factorization machines are a type of collaborative filtering algorithms
introduced by [[53]12]. Their most common expression, the second-order
factorization machine of dimension d, comprises a linear regression
term of coefficient (with a possibly non-zero intercept) and a term
that combines interactions from all distinct pairs of features by
featuring a scalar product of their corresponding low-rank latent
vectors of dimension d. This approach, particularly in the presence of
sparse feature vectors, is computationally efficient while performant
on a variety of recommendation tasks: for instance, knowledge tracing
for education [[54]28], click-through rate prediction [[55]29].
Computationally tractable evaluation and training routines were first
proposed by [[56]30] for higher-order factorization machines (HOFMs),
which were introduced as well in [[57]12] and include interactions from
all distinct K sets of features, where
[MATH: K≥2 :MATH]
, opening the way to even finer classification models. The definition
of HOFMs is recalled in Definition [58]1.
Definition 1
Higher–Order Factorization Machines (HOFMs). Let us denote the set of
available item and user features
[MATH: F⊆N∗
:MATH]
. The general expression for HOFM [[59]12, [60]30] of order
[MATH: m≥2 :MATH]
and dimensions
[MATH:
d2,⋯,dm :MATH]
that takes as input a single feature vector
[MATH: x∈R|F| :MATH]
is a model such that
[MATH: θ=(ω0,ω1,ω2,⋯
,ωm) :MATH]
where
[MATH: (ω0,ω1)∈R×R|F| :MATH]
and for any
[MATH: i∈{2,⋯,m} :MATH]
,
[MATH: ωi∈R|F|×di :MATH]
[MATH: HOFMθ(x)≜ω0+(ω1)⊺x+∑2≤t≤m∑f1
<⋯<ft<
/mrow>f1,⋯,ft∈F
⟨ωf1
,:t,⋯
mo>,ωft
,:t⟩xf1·xf2·⋯·xft
-1·xft, :MATH]
2
where
[MATH: ⟨ωf1
,:t,⋯
mo>,ωft
,:t⟩≜∑d≤dtωf1
,dt·ωf2
,dt·⋯
mo>·ωf
t-1,d
t·ωft
,dt :MATH]
for any t and indices
[MATH:
f1,⋯,ft :MATH]
. In particular, for
[MATH: m=2 :MATH]
[MATH: FMθ(x)≜ω0+(ω1)⊺x⏞
mo>linearregression
term+∑f<f′,f,f<
mo>′∈F⟨ωf,:<
/mo>2,ωf′
,:2⟩xf·xf′⏞pairwiseinteraction term. :MATH]
3
Besides their good predictive power, factorization machines involve
explicit coefficients that quantify the contribution of each K set of
features to the final score associated with the positive class of
associations. These coefficients offer a straightforward insight into
the discriminating features for the recommendation problem, and this
type of “white-box” explainability is related to a larger research
field called feature attribution-based interpretability.
Feature attribution-based interpretability
Given a binary classifier C and a feature vector
[MATH: x∈RF
:MATH]
, a feature attribution function
[MATH: ϕC:RF→RF
:MATH]
returns importance scores for each feature contributing to the positive
class score for the input vector
[MATH: x :MATH]
. If the importance score associated with feature f is largely positive
(resp., negative), it means that feature f drives the membership of
[MATH: x :MATH]
to the positive (resp., negative) class. In contrast, an importance
score close to 0 indicates that feature f has little influence on the
classification of data point
[MATH: x :MATH]
. Albeit other types of interpretability approaches exist (based on
decision rules given by single classifier trees or random forests
[[61]31, [62]32], counterfactual examples [[63]33] or logic rules
[[64]34, [65]35]) the importance score-based methods allow going beyond
single feature influence. In particular, the importance scores can be
integrated into downstream analyses to statistically quantify the
effect of specific groups of features on the classification. For
instance, when considering genes as features, an enrichment analysis
[[66]36] based on the scores can uncover overrepresented functionally
consistent cell pathways.
Some classifiers, as seen for factorization machines, readily include
importance scores, whereas several approaches compute post-hoc
importance scores. Importance scores are evaluated based on the outputs
of an already trained “black-box” classifier, such as a neural network.
Such approaches include Shapley values [[67]13], LIME [[68]14],
DeepLIFT [[69]37] (for image annotation) or sufficient explanations
[[70]38]. Yet, recent works show their lack of robustness and
consistency across post-hoc feature attribution methods, both
empirically [[71]15] and theoretically [[72]16, [73]39]. However, the
advantage of posthoc approaches is that they allow the explainability
of any type of classifier and combine the richness of the model
(predictive performance) and interpretability.
The approach described in our paper then aims to encompass any generic
embedding model without losing the connection to the initial features
of the input vectors to the classifier.
Methods
In this section, we define the JELI algorithm, our main contribution.
The full pipeline of JELI is illustrated in Fig. [74]1. Let us define
in formal terms the inputs to the associated recommendation problem of
[MATH: ni :MATH]
items
[MATH:
i1,i2<
/mn>,⋯,ini :MATH]
to
[MATH: nu :MATH]
users
[MATH:
u1,u2<
/mn>,⋯,unu :MATH]
. The minimal input to the recommendation problem is the user-item
association matrix
[MATH: A∈{-1,0,+1}ni×nu :MATH]
which summarizes the known positive (
[MATH: +1 :MATH]
)—and possibly negative (
[MATH: -1 :MATH]
)—associations and denotes unknown associations by zeroes. In simple
terms, the recommender systems aim to replace zeroes by
[MATH: ±1 :MATH]
while preserving the label of nonzero-valued associations. Second, in
some cases, we also have access to the respective item and user feature
matrices denoted
[MATH: S∈RF×ni :MATH]
and
[MATH: P∈RF×nu :MATH]
. Without a loss of generality, we assume that the item and user
feature matrices have the same F features
[MATH:
f1,f2<
/mn>,⋯,fF :MATH]
. [75]1 Finally, there might be a partial graph on some of the items,
users, features, and possibly other entities. For instance, such a
graph might connect movies, users, and human emotions for movie
recommendation [[76]40], or drugs, diseases, pathways, and proteins or
genes for drug repurposing [[77]41, [78]42]. We denote this graph
[MATH: G(VG,EG) :MATH]
, where
[MATH: VG :MATH]
is the set of nodes in
[MATH: G :MATH]
and
[MATH: EG :MATH]
is its set of (undirected, labeled) edges.
Fig. 1.
[79]Fig. 1
[80]Open in a new tab
Full pipeline of the JELI algorithm, from the initial inputs to the
downstream tasks
We first introduce the class of higher-order factorization machines,
called redundant structured HOFMs, which will classify user-item
associations based on an assumption on the structure of item/user and
feature embeddings.
Redundant structured HOFM (RHOFM)
This subtype of higher-order factorization machines features shared
higher-order parameters across interaction orders, such that the
corresponding dimensions of the HOFM satisfy
[MATH:
d2=⋯=dm=d :MATH]
in Definition [81]1. As such, RHOFMs are related to inhomogeneous ANOVA
kernel HOFMs (iHOFMs) mentioned in [[82]30]. This type of factorization
machine is such that the higher-order dimensions are all equal (that
is,
[MATH:
d2=⋯=dm=d :MATH]
) and the corresponding higher-order coefficients are all proportional
to one another: for any
[MATH:
t,t′≥2 :MATH]
and
[MATH: f≤F :MATH]
. there exists
[MATH: c∈R :MATH]
such that
[MATH: ωft=c·ωf′
mrow> :MATH]
in Definition [83]1. However, what distinguishes the RHOFM from an
iHOFM is the following hypothesis on structure: it is assumed that
every entity d-dimensional embedding
[MATH: e∈Rd
:MATH]
results from some function
[MATH: sW :MATH]
with parameter
[MATH: W∈RF×
d :MATH]
applied to the corresponding entity feature vector
[MATH: x∈RF
:MATH]
. For instance, an embedding
[MATH: e :MATH]
associated with feature vector
[MATH: x :MATH]
with a linear structure function of dimension d is defined as
[MATH: e=sW<
/mi>(x)=xW :MATH]
. However, any, possibly non-linear, structure function
[MATH: sW :MATH]
can be considered. Note that for completeness, we can define a feature
vector for features, which is simply the result of the indicator
function on features in F: for feature
[MATH: f∈F :MATH]
, its corresponding feature vector is
[MATH: xf≜(δ(fj=f
))j≤F :MATH]
where
[MATH: δ :MATH]
is the Kronecker symbol, such that the structure function
[MATH: sW :MATH]
can be applied to any item, user or feature entity. Definition [84]2
gives the formal expression of RHOFMs for any order, dimension, and
structure.
Definition 2
Redundant structured HOFMs (RHOFMs). The RHOFM of structure
[MATH: sW :MATH]
, order m and dimension d, with parameters
[MATH: θ=(ω0,ω1,ω2:m<
/mi>,W)∈R×Rd×Rm-
1×RF×
d :MATH]
on item and user of respective feature vectors
[MATH: xi,xu∈RF
:MATH]
is defined as
[MATH: RHOFMθ(xi,xu)≜ω0+(ω1)⊺(x′iu<
/mi>)⊺W~λiuW~λiu+∑2≤t≤mωt-1<
/mn>2:m∑f1<⋯<ftf1
msub>,⋯,ft≤2F
W~λiuW~λiuf1,:,…,W~λiuW~λiuft,:x1′i<
/mi>ux2′i<
/mi>u...<
mrow>xt′i<
/mi>u, :MATH]
4
where
[MATH: x′iu<
/mi>≜[(xi)⊺,(xu)⊺]⊺∈R2F :MATH]
is the concatenation of feature vectors along the row dimension,
[MATH: x~iu≜[xi,xu]⊺∈RF×
2 :MATH]
the concatenation along the column dimension,
[MATH: W~λiu≜(x~iu(x~iu)⊺+λIF)†(x~iu)⊺[sW(xi)⊺,sW(xu)⊺]∈RF×
d :MATH]
is the
[MATH: λ :MATH]
-regularized approximate least squares estimator in the following
equation in V:
[MATH: sW(x~iu)=x~iuV :MATH]
, with
[MATH: λ≥0 :MATH]
.
By reordering terms and by definition of
[MATH: W~λiu :MATH]
(full details in Appendix ), if we denote
[MATH: f%F :MATH]
the remainder of the Euclidean division of f by F, we can notice that
[MATH: RHOFMθ(xi,xu)≈ω0+(ω1)⊺(sW(xi)+sW(xu))+∑2≤t≤mωt-1<
/mn>2:m∑f1<⋯<ftf1
msub>,⋯,ft≤2F
x1′i<
/mi>usW<
mo stretchy="false">(xf1
%F),…,xt′i<
/mi>usW<
mo stretchy="false">(xft
%F). :MATH]
5
In particular, for
[MATH: m=2 :MATH]
,
[MATH: RHOFMθ(xi,xu) :MATH]
is roughly [85]2 equal to
[MATH: ω0+(ω1)⊺(sW(xi)+sW(xu))+ω2∑f1<
/mn><f2f1,
mo>f2≤2Fx1′i<
/mi>usW<
mo stretchy="false">(xf1
%F),x2′i<
/mi>usW<
mo stretchy="false">(xf2
%F). :MATH]
6
Compared to the expression of a factorization machine for
[MATH: m=2 :MATH]
in Eq. ([86]3), the RHOFM includes a structure that can be non-linear
(through the function
[MATH: sW :MATH]
) and a supplementary degree of freedom with parameters
[MATH: ω1 :MATH]
and
[MATH: ω2:m<
/mi> :MATH]
.
The RHOFM then comprises a term linear in the item/user embeddings and
a product of feature embeddings weighted by the corresponding values in
the item and user initial feature vectors. Moreover, if we assume a
linear structure on the RHOFM, the embedding vector for feature
[MATH: fj :MATH]
is exactly
[MATH:
Wfj,:<
/mo> :MATH]
and the embeddings for items and users are the sum of feature
embeddings weighted by their corresponding values in the item and user
vectors. The expression in Definition [87]2 is relatively
computationally efficient when combined with the dynamic programming
routines described in [[88]30]. Moreover, the redundancy in the RHOFM
allows it to benefit from the same type of computational speed-up as
inhomogeneous ANOVA kernels or iHOFMs.
Knowing that HOFMs (in Definition [89]1) and iHOFMs would take as input
the concatenation along the row dimension of
[MATH: (xi,xu) :MATH]
, assuming that the dimensions across subsets are the same, i.e.,
[MATH:
d2=⋯=dm=d :MATH]
, HOFMs comprise
[MATH:
1+2F+2F
d(m-1) :MATH]
parameters, which can account for a prohibitive computation cost in
practice. Similarly, iHOFMs would require the training of
[MATH:
1+m+2Fd
:MATH]
parameters, whereas RHOFMs (in Definition [90]2) only feature
[MATH: 1+m+(F+1)d :MATH]
, hence removing the multiplicative constant on the number of features
F, which has an impact for high-dimensional data sets such as the
TRANSCRIPT drug repurposing data set [[91]43] which gathers values on
12, 000 genes across the human genome.
Regarding interpretability, as evidenced by Eq. ([92]5), the
coefficients involved in the expression of the RHOFM are
straightforwardly connected to the input embeddings. In the case of the
linear structure and when
[MATH: ω1=1d
:MATH]
,
[MATH: ω2:m<
/mi>=1m-1 :MATH]
(or any other constant), the contributions from features on the one
hand and the item/user values on the other can easily be disentangled.
In that case,
[MATH: W~λiu≈W :MATH]
and then for any feature f, the intrinsic (i.e., independent of users
or items) importance score is
[MATH:
∑k≤dWf,k :MATH]
. When associated with an entity (item or user) of feature vector
[MATH: x∈RF
:MATH]
, its importance score is simply
[MATH: xf∑k≤dWf,k :MATH]
. Using
[MATH: x~iuW~λiu≈s
W(x~iu) :MATH]
in non-linear structures, we can extrapolate this result to obtain the
following intrinsic feature importance score
Result 1
Feature importance scores in a RHOFM. When
[MATH: ω1=1d
:MATH]
,
[MATH: ω2:m<
/mi>=1m-1 :MATH]
(or any other constant), the intrinsic (entity-independent) feature
importance score for feature
[MATH: f≤F :MATH]
in an RHOFM (Definition [93]2) is
[MATH:
∑k≤d(W~λiu)f,k. :MATH]
As a consequence, the feature attribution function associated with
feature vector
[MATH: x∈RF
:MATH]
is
[MATH: ϕRHOFM(x)≜(xf∑k≤d(W~λiu)f,k)f≤F. :MATH]
One could infer the RHOFM parameters by directly minimizing a loss
function. However, as mentioned in the introduction, we would like to
distil some prior knowledge information into the RHOFM, for instance,
via a knowledge graph specific to the recommendation use case. By
seeing the feature embeddings in the RHOFM as node embeddings in a
knowledge graph, the next section describes how to jointly train the
RHOFM and the feature embeddings on a knowledge graph completion task.
Joint training of the RHOFM and the knowledge graph embeddings
We will leverage the information from the partial graph
[MATH: G(VG,EG) :MATH]
to fit the RHOFM, by reducing the problem of classification to the
prediction of a subset of edges in a knowledge graph completion
problem. To do so, we first extend the partial graph
[MATH: G :MATH]
based on the respective user-item association matrix A, and respective
item and user feature matrices S and P to build a knowledge graph
[MATH: K(V,T) :MATH]
with nine types of relations. Note that the partial graph can possibly
be empty or, to the contrary, can include any edge between drugs and
features, diseases and features, and between two features.
Definition 3
Similarity-based knowledge graph augmented with prior edges.
Considering a similarity threshold
[MATH: τ∈[0,1] :MATH]
associated with a similarity function sim
[MATH: :RF×RF→[-1,1] :MATH]
, JELI builds a knowledge graph from the data set A, P and S and
partial graph
[MATH: G(VG,EG) :MATH]
as follows
[MATH: V≜{i1,i2,⋯,i
mi>ni}∪{u1,u2,⋯,u
mi>nu}∪{f1,f2,⋯,f
mi>F}, :MATH]
7
[MATH: T≜{(s,prior,t)∣(s,t)∈EG,s,t
mi>∈V}∪{(ij,-
,uk)∣Aij,uk=-1,j≤<
msub>ni,k≤nu}∪{(ij,+
,uk)∣Aij,uk=+1,j≤ni,k≤nu}∪{(uj,user-sim,uk)∣sim(P:,
uj,P:,uk)>τ,j,k≤
nu}∪{(ij,item-sim,ik)∣sim(S:,
ij,S:,ik)>τ,j,k≤
ni}∪{(ij,item-feat-pos,fk)∣Sfk,ij>0,k≤F,j≤ni}∪{(ij,item-feat-neg,fk)∣Sfk,ij<0,k≤F,j≤ni}∪{(uj,user-feat-pos,fk)∣Pfk,uj>0,k≤F,j≤nu}∪{(uj,user-feat-neg,fk)∣Pfk,uj<0,k≤F,j≤nu}. :MATH]
8
The objective of knowledge graph completion is to fit a model
predictive of the probability of the presence of a triplet in the
knowledge graph. In particular, computing the score associated with
triplets of the form
[MATH: (h,+,t) :MATH]
, for (h, t) a user-item pair, boils down to fitting a classifier of
user-item interactions. Conversely, a straightforward assumption is
that the score associated with triplets
[MATH: (h,+,t) :MATH]
should be opposite to the score assigned to triplets
[MATH: (h,-,t) :MATH]
. With that in mind, denoting the set of RHOFM parameters
[MATH: θ :MATH]
and
[MATH: θJELI≜<
/mo>(θ,{Rr,rrelation},{er,rrelation},{bh,h∈V}) :MATH]
as the total set of parameters to estimate, we define in Eq. ([94]9)
the edge score to be maximized for present triplets in the knowledge
graph
[MATH: K :MATH]
[MATH: scoreθJELI(h,r,t)≜MuRE(sW(xh),er,sW(xt);Rr,bh,b
t)ifr∉{+,-}RHOFMθ(xh,xt)ifr=+
-RHOFMθ(xh,xt)ifr=-
. :MATH]
9
Remember that the vector
[MATH: xh :MATH]
is well-defined for any item, user, or feature h. Then we fit parameter
[MATH: θJELI
:MATH]
by minimizing the soft margin ranking loss with margin
[MATH:
λ0=1
:MATH]
, which expression is recalled below
[MATH: ∀θ′,Lmargin
(θ′)≜∑(h,r,t)∈T∑(h¯,r,t)∉Tlog(1+exp(λ0+scoreθ′(h,r,t)-scoreθ′(h¯,r,t))). :MATH]
10
Further implementation details and numerical considerations for the
training pipeline are available in Appendix .
Downstream tasks with JELI
Interestingly, not only does JELI build embeddings for items and users
available at training time, but it can also be used to produce
embeddings for new entities without requiring any retraining step.
Given a feature vector
[MATH: x∈RF
:MATH]
, padding with zeroes if needed on unavailable features, the
corresponding embedding is
[MATH: sW(x) :MATH]
. However, the main objective of the trained JELI model is to predict
new (positive) user-item associations, possibly on items and users not
observed at training time. In that case, for any pair of item and user
feature vectors
[MATH: (xi,xu)∈RF×RF
:MATH]
, the label predicted by JELI with RHOFM parameter
[MATH: θ :MATH]
is
[MATH: y^JELI(xi,xu)≜+1ifσ(RHOFMθ(xi,xu))>0.5-1otherwise
, :MATH]
11
where
[MATH: σ :MATH]
is the standard sigmoid function. Note that the JELI approach could be
even more generic. Besides any knowledge graph, this joint training
approach could feature any classifier, and not necessarily an RHOFM, as
long as the classifier remains interpretable, and any knowledge graph
completion loss function or any edge score function.
Results
We first validate the performance, the interpretability, and the
different components of JELI on synthetic data sets, for which the
ground truth on feature importance is available. Then, we apply JELI to
drug repurposing, our main motivating example for interpretability in
recommendation. Further information about the generation of the
synthetic data sets and numerical details is available in Appendix .
Unless otherwise specified, the order of all factorization machine
variants considered (including the RHOFM classifier in JELI) satisfies
[MATH: m=2 :MATH]
.
In this section, we consider several evaluation metrics. First,
Spearman’s rank correlation [[95]45] quantifies the quality of the
importance scores. It is computed on ground truth importance scores
[MATH: s⋆≜(∑k≤d<
/mi>Wf,k⋆)f≤F :MATH]
and predicted ones
[MATH: s^≜(∑k≤d<
/mi>W^f,k)f≤F :MATH]
with
[MATH: W^ :MATH]
the inferred embedding parameter. Second, the Area Under the Curve
(AUC) is computed on all user-item pairs to measure classification
performance between the ground truth
[MATH: A∈{-1,0,+1}ni×nu :MATH]
and the classifier scores
[MATH: A^∈Rni×nu :MATH]
. We also consider the Negative-Sampling AUC (NS-AUC) [[96]44].
Contrary to AUC, the NS-AUC is a ranking measure akin to an average of
user-wise AUCs, giving a more refined quantification of prediction
quality across users. As a complementary measure of classification
quality, we also consider the Normalized Discounted Cumulative Gain
(NDCG), which is proportional to the quality of the ranking of
recommended drugs across diseases. Note that all those classification
metrics depend solely on the classifier scores, and not on the final
class labels that can be inferred by applying a fixed threshold
[MATH: τ :MATH]
. The exact definitions of each metric are reported in Table [97]1.
Table 1.
Description of the performance metrics in Section "[98]Results"
Notation Performance metric Definition
Spearman’s
[MATH: ρ :MATH]
Spearman’s correlation
[MATH:
1-6∑f≤F(Δf)2/(F(F2-1
)) :MATH]
AUC Area Under the Curve
[MATH:
∫01TPR(FPR-1(τ;A^,A);A^,A)dτ :MATH]
NS-AUC Average NS-AUC [[99]44]
[MATH: |nu|-1<
msub>∑u≤nu|Ω~u|-1<
msub>∑(i,i′
msup>)∈Ω~uδ
(A^i,u>A^i′<
mo>,u) :MATH]
NDCG Average NDCG@
[MATH: ni :MATH]
[MATH:
nu-1∑u≤nu<
/mi>∑i=1
Nu+Aσu(i),u<
msub>log2(i+1)/∑i=1
Nu+1
log2(i+1) :MATH]
[100]Open in a new tab
Spearman’s
[MATH: ρ :MATH]
:
[MATH: Δf :MATH]
is the gap in rank (for the decreasing order) between the true and
predicted importance scores
[MATH: (s⋆)f :MATH]
and
[MATH: s^f :MATH]
for feature f. AUC: The true positive rate between ground truth A and
predictions
[MATH: A^ :MATH]
is defined as
[MATH: TPR(τ;A^,A)=∑(i,u),Ai,u=+1δ(A^i,u>τ)/∑(i,u)δ(A^i,u>τ)
:MATH]
, the false positive rate is
[MATH: FPR(τ;A^,A)=∑(i,u),Ai,u=-1δ(A^i,u>τ)/∑(i,u)δ(A^i,u≤τ)
:MATH]
, and
[MATH: δ :MATH]
is the Kronecker symbol. NS-AUC: The set of true positive, respectively
negative, drug-disease associations is
[MATH: Ω±≜{(i,u),Ai,u=±1∣<
/mo>i≤ni,
u≤nu} :MATH]
, whereas the set of positive drugs to disease u is
[MATH: Ωu+≜{i∣Ai<
/mi>,u=+1} :MATH]
. Finally, the set of correctly ranked drugs for disease u is
[MATH: Ω~u≜{(i,i′
msup>)∣Ai,u>Ai′,u} :MATH]
. NDCG:
[MATH: σu :MATH]
is the permutation that sorts all coefficients of the recommendations
[MATH: A^i,u,i≤ni :MATH]
for disease u in the decreasing order. That is,
[MATH: A^σu<
mrow>(1),u≥
mo>A^σu<
mrow>(2),u≥
mo>⋯≥A^σu<
mrow>(ni),u
:MATH]
. Finally,
[MATH: Nu+ :MATH]
is defined as
[MATH: min(ni,|Ωu+|) :MATH]
.
Synthetic data sets
We consider two types of “interpretable” synthetic recommendation data,
called “linear first-order” and “linear second-order”, for which the
ground truth feature importance scores are known. At fixed values of
dimension d, feature number F, and numbers of items and users
[MATH: ni :MATH]
and
[MATH: nu :MATH]
, both item and user feature vectors are drawn at random from a
standard Gaussian distribution, along with a matrix
[MATH: W⋆∈RF×
d :MATH]
. The algorithm cannot access the full feature values in most practical
cases in recommendation tasks. Reasons for missing values can be
diverse [[101]46], but most likely follow a not missing at random
mechanism, meaning that the probability of a missing value depends on
the features. To implement such a mechanism, we applied a slightly
adapted Gaussian self-masking [[102]47] to the corresponding item and
user feature matrices, such that we expect around
[MATH: 10% :MATH]
of missing feature values.
The complete set of user-item scores is obtained by a generating model
[MATH: g0:RF×RF→[0,1] :MATH]
. For “first-order” synthetic data sets,
[MATH: g0 :MATH]
is defined as
[MATH: (xi,xu)↦σ(∑k≤d<
/mi>(xi+xu)W:,<
/mo>k⋆)=σ(RHOFM(0,1d,
0m-1,W⋆)(xi,xu))
:MATH]
where
[MATH: xi :MATH]
and
[MATH: xu :MATH]
are respectively the item and user feature vectors. For the
“second-order” type,
[MATH: g0 :MATH]
is simply
[MATH: (xi,xu)↦σ(RHOFM(1,1d,
1m-1,W⋆)(xi,xu))
:MATH]
where the order is
[MATH: m=2 :MATH]
. In both cases, the corresponding structure function
[MATH: sW⋆ :MATH]
is linear, that is,
[MATH:
sW⋆(x)=xW⋆ :MATH]
and
[MATH: λ=0 :MATH]
.
Finally, since in practice, most of the user-item associations are
inaccessible at training time, we label user-item pairs with
[MATH: -1 :MATH]
, 0, and
[MATH: +1 :MATH]
depending on their score, such that the sparsity number—that is, the
percentage of unknown values in the association matrix—is equal to a
prespecified value greater than
[MATH: 50% :MATH]
.
JELI is performant for various validation metrics and reliably retrieves
ground truth importance scores
We generate 10 synthetic datasets of each type (
[MATH: F=10 :MATH]
,
[MATH: d=2 :MATH]
,
[MATH:
ni=nu<
/mi>=173 :MATH]
) and run JELI 100 times with different random seeds corresponding to
different training/testing splits. Table [103]2 shows the numerical
results across those
[MATH: 10×100 :MATH]
runs for several validation metrics on the predicted item-user
associations and feature importance scores.
Table 2.
Average validation metrics with standard deviations across 100
iterations and 10 synthetic data sets of each type (total number of
values: 1000)
Data set type AUC NS-AUC Spearman’s
[MATH: ρ :MATH]
First-order
[MATH: 0.99± :MATH]
0.013
[MATH: 0.89± :MATH]
0.124
[MATH: 0.83± :MATH]
0.279
Second-order
[MATH: 0.98± :MATH]
0.019
[MATH: 0.86± :MATH]
0.167
[MATH: 0.75± :MATH]
0.363
[104]Open in a new tab
Average (respectively, standard deviation) values are rounded to the
closest second (resp., third) decimal place. AUC: Area Under the Curve.
NS-AUC: Negative-Sampling AUC [[105]44]. Spearman’s
[MATH: ρ :MATH]
: Spearman’s rank correlation
Albeit there is a large variation in the quality of the prediction due
to the random training/testing split when considering the average best
value across 100 iterations, the metrics in Table [106]2 show a high
predictive power for JELI, along with a consistently high correlation
between true and predicted feature importance scores: the average
Spearman’s rank correlation for the best-trained models across all 10
data sets is 0.932 for “first-order” sets and 0.932 for “second-order”
ones. The bar plots representing the ground truth and predicted
importance scores for each of these 10 sets and each type of synthetic
data in Fig. [107]2 show that JELI can preserve the global trend in
importance scores across data sets. We also tested the impact of the
dimension parameter d and of the order m of the RHOFM on the accuracy
metrics. In the previous experiments, we used
[MATH: d=2 :MATH]
, which is the true dimensionality of the underlying generating model.
However, it appears that JELI is also robust to the choice of the
dimension parameter if it is large enough for all metrics. Moreover,
similarly to higher-order factorization machines, higher-order
interactions (
[MATH: m>2 :MATH]
) allow us to get a more expressive classifier model and, thus, better
classification performance. However, this improvement comes at a heavy
computational price, even with the dynamic programming routines in
[[108]30], where the time complexity is linear in m. The experiments
and results on parameter impact can be found in [109]Appendix 4.
Fig. 2.
[110]Fig. 2
[111]Open in a new tab
Barplots of the true and predicted feature importance scores for
[MATH: F=10 :MATH]
features in each synthetic data set for the best-performing model
across 100 iterations. Top-2 lines: on “first-order” synthetic data.
Bottom-2 lines: on “second-order” synthetic data
JELI is robust in synthetic data sets across sparsity numbers
We also compare the predictive performance of JELI compared to
embedding-based recommender systems from the state-of-the-art, namely
Fast.ai collaborative learner [[112]8], the heterogeneous attention
network (HAN) algorithm [[113]48] and the neural inductive matrix
completion with graph convolutional network (NIMCGCN) [[114]10]. We
set, whenever appropriate, the same hyperparameter values for all
algorithms (with
[MATH: d=2 :MATH]
). We run each algorithm on 100 different random seeds on 5
“first-order” synthetic data sets generated with sparsity numbers in
[MATH: {50%,65%<
/mo>,80%}
:MATH]
, for 500 tests. Figure [115]3 and Table [116]3 report the boxplots and
the confidence intervals on corresponding validation metrics. In
addition to the AUC and NS-AUC, we include the Non-Discounted
Cumulative Gain (NDCG) computed for each user at rank
[MATH: ni :MATH]
(number of items) and averaged across users as a counterpart to the
NS-AUC measure.
Fig. 3.
Fig. 3
[117]Open in a new tab
NS-AUC values across “first-order” synthetic data sets for sparsity
numbers and 500 iterations for JELI and state-of-the-art
embedding-based recommender systems
Table 3.
Average metrics with standard deviations across 100 iterations and 5
“first-order” sets
AUC NS-AUC NDCG
50% Fast.ai 0.99± 0.0 0.52± 0.3 0.85± 0.1
HAN 0.93± 0.0 0.62± 0.1 0.18± 0.1
NIM 0.93± 0.0 0.63± 0.1 0.39± 0.1
JELI 0.99± 0.0 0.92± 0.1 0.96± 0.1
65% Fast.ai 0.99± 0.0 0.64± 0.4 0.78± 0.3
HAN 0.93± 0.0 0.67± 0.0 0.12± 0.1
NIM 0.94± 0.0 0.67± 0.1 0.42± 0.1
JELI 0.99± 0.0 0.94± 0.0 0.94± 0.1
80% Fast.ai 0.99± 0.0 0.91± 0.1 0.77± 0.2
HAN 0.96± 0.0 0.72± 0.0 0.20± 0.1
NIM 0.93± 0.0 0.61± 0.1 0.19± 0.0
JELI 0.99± 0.0 0.94± 0.0 0.85± 0.2
[118]Open in a new tab
The NDCG at rank
[MATH: ni :MATH]
is averaged across users. NIM is NIMCGCN
Bold type is used for the highest value(s) in each experiment
As illustrated by Fig. [119]3, JELI consistently outperforms the
state-of-the-art on all metrics and remains robust to the sparsity
number.
Ablation study: both the structure and the joint learning are crucial to the
performance
We perform the same type of experiments as in Sect. "[120]JELI is
robust in synthetic data sets across sparsity numbers" on several
ablated versions of JELI to estimate the contribution of each part to
the predictive performance. We introduce several JELI variants. First,
we remove the structured and embedding part of the RHOFM classifier. FM
is the regular second-order factorization machine of dimension d on
2F-dimensional input vectors, without structure on the coefficients
(see Definition [121]1), whereas CrossFM2 is a more refined
non-structured second-order factorization machine, where the feature
pairwise interaction terms only comprise pairs of features on both the
item and user vectors, that is, with notation from Definition [122]1
[MATH: CrossFM(ω0,ω1,ω2)(xi,xu)≜ω0+(ω1)⊺xixu
mtr>+∑f
mi>≤F,f′>
F⟨ωf2,ω′2⟩xfixf′
-Fu. :MATH]
12
Next, we also study methods featuring separate learning of the
embeddings and the RHOFM classifier, named Separate Embedding Learning
and Training algorithms (SELT). We consider different feature embedding
types. SELT-PCAf uses the d principal component analysis (PCA) run on
the concatenation of the item and user matrices along the column
dimension, resulting in a
[MATH: F×(ni+nu) :MATH]
matrix. SELT-PCAf then infers feature embeddings based on each
feature’s d first principal components. Another PCA-based baseline,
SELT-PCAiu, applies the learned PCA transformation directly on item and
user feature vectors to obtain item and user embeddings. Finally, the
SELT-KGE approach completes the knowledge graph task to obtain item and
user embeddings—without enforcing the feature-dependent structure—on
the knowledge graph described in Definition [123]3 with an empty
partial graph. Then, SELT-KGE uses those item and user embeddings to
train the RHOFM classifier.
The final results in Fig.[124]4 and Table [125]4 show that the most
crucial part for predictive performance across sparsity numbers is the
factorization machine, which is unsurprising given the literature on
factorization machines applied to sparse data. One can observe that
separate embedding learning and factorization machine training leads to
mediocre performance. The combination of a structured factorization
machine and jointly learned embeddings, that is, JELI, gives the best
performance and is even more significant as the set of known
associations gets smaller (and the sparsity number is larger).
Fig. 4.
Fig. 4
[126]Open in a new tab
NS-AUC values across “first-order” synthetic data sets for sparsity
numbers and 500 iterations for JELI and ablated variants. This shows
that the most crucial part for a good predictive performance across
sparsity numbers is the factorization machine
Table 4.
Average metrics with standard deviations across 100 iterations and 5
“first-order” sets. The NDCG at rank
[MATH: ni :MATH]
is averaged across users. S indicates an instance of SELT
AUC NS-AUC NDCG
50% FM2 0.99± 0.0 0.92± 0.0 0.97± 0.0
CrossFM2 0.99± 0.0 0.93± 0.0 1.00± 0.0
S-PCAf 0.95± 0.0 0.70± 0.1 0.58± 0.2
S-PCAiu 0.95± 0.0 0.61± 0.2 0.45± 0.2
S-KGE 0.91± 0.0 0.43± 0.2 0.25± 0.2
JELI 0.99± 0.0 0.92± 0.1 0.96± 0.0
65% FM2 0.98± 0.0 0.91± 0.0 0.87± 0.1
CrossFM2 0.99± 0.0 0.91± 0.0 0.95± 0.0
S-PCAf 0.95± 0.0 0.73± 0.1 0.54± 0.2
S-PCAiu 0.94± 0.0 0.62± 0.0 0.34± 0.1
S-KGE 0.90± 0.0 0.43± 0.0 0.06± 0.0
JELI 0.99± 0.0 0.94± 0.0 0.94± 0.1
80% FM2 0.97± 0.0 0.84± 0.1 0.56± 0.1
CrossFM2 0.98± 0.0 0.87± 0.0 0.74± 0.0
S-PCAf 0.95± 0.0 0.73± 0.1 0.38± 0.1
S-PCAiu 0.93± 0.0 0.62± 0.1 0.20± 0.0
S-KGE 0.91± 0.0 0.55± 0.1 0.12± 0.1
JELI 0.99± 0.0 0.94± 0.0 0.85± 0.2
[127]Open in a new tab
Bold type is used for the highest value(s) in each experiment
Application to drug repurposing
We aim to predict new therapeutic indications, that is, novel
associations between chemical compounds and diseases. The
interpretability of the model for predicting associations between
molecules and pathologies is crucial to encourage its use for health.
In that case, higher-order factorization machines are very interesting
models due to their inherent interpretability. However, particularly
for the most recent drug repurposing datasets (e.g., TRANSCRIPT
[[128]43] and PREDICT [[129]49]), the number of features (
[MATH:
F≈12,000
:MATH]
and
[MATH:
F≈6,000
:MATH]
, respectively) is too large to effectively train a factorization
machine due to the curse of dimensionality. Resorting to knowledge
graphs then enables the construction of low-dimensional vector
representations of these associations. Then, these representations are
fed as input to the classifier during training instead of the initial
feature vectors.
JELI is on par with state-of-the-art approaches on drug repurposing data sets
We now run JELI and the baseline algorithms tested in Sect. "[130]JELI
is robust in synthetic data sets across sparsity numbers" on Gottlieb
[[131]50] (named Fdataset in the paper), LRSSL [[132]51],
PREDICT-Gottlieb [[133]52] and TRANSCRIPT [[134]43] drug repurposing
data sets which feature a variety of data types and sizes. Please refer
to [135]Appendix 4 for more information. Figure [136]5 and Table [137]5
report the validation metrics for each method’s 100 different
training/testing splits with
[MATH: d=15 :MATH]
. From those results, we can see that the performance of JELI is on par
with the top algorithm, HAN, and sometimes outperforms it while
providing interpretability.
Fig. 5.
Fig. 5
[138]Open in a new tab
AUC values across drug repurposing data sets for 100 iterations for
JELI and state-of-the-art embedding-based approaches
Table 5.
Average metrics with standard deviations across 100 iterations for each
drug repurposing data set
AUC NS-AUC NDCG
Gottlieb Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0
HAN 0.93± 0.0 0.67± 0.0 0.02± 0.0
NIM 0.90± 0.0 0.51± 0.0 0.01± 0.0
JELI 0.90± 0.0 0.52± 0.0 0.02± 0.0
LRSSL Fast.ai 0.90± 0.0 0.49± 0.1 0.01± 0.0
HAN 0.95± 0.0 0.69± 0.0 0.10± 0.0
NIM 0.91± 0.0 0.53± 0.0 0.01± 0.0
JELI 0.92± 0.0 0.51± 0.0 0.02± 0.0
PRED-G Fast.ai 0.90± 0.0 0.50± 0.1 0.01± 0.0
HAN 0.93± 0.0 0.68± 0.0 0.01± 0.0
NIM 0.91± 0.0 0.49± 0.0 0.01± 0.0
JELI 0.90± 0.0 0.47± 0.0 0.02± 0.0
TRANSC Fast.ai 0.61± 0.1 0.57± 0.1 0.04± 0.0
HAN 0.93± 0.0 0.61± 0.0 0.08± 0.0
NIM 0.92± 0.0 0.57± 0.0 0.04± 0.0
JELI 0.92± 0.0 0.56± 0.0 0.02± 0.0
[139]Open in a new tab
The NDCG at rank
[MATH: ni :MATH]
is averaged across users. NIM is the algorithm NIMCGCN, TRANSC refers
to the data set TRANSCRIPT, and PRED-G to the data set PREDICT-Gottlieb
For the sake of completeness, we also considered one of the most
popular data sets for recommendation, called MovieLens [[140]2], to
better assess the performance of JELI for the general purpose of
collaborative filtering. The goal is to predict if a movie should be
recommended to a user, that is, if the user would rate this movie with
more than 3 stars. The movie features are the year and the one-hot
encodings of the movie genres, whereas the user features are the counts
of each movie tag that this user has previously assigned. This
experiment confirms that the performance of JELI is on par with the
baselines, even in a non-biological setting. Please refer to
[141]Appendix 4 for more information.
JELI can integrate any graph prior on the TRANSCRIPT data set
We now focus on the TRANSCRIPT data set, which involves gene activity
measurements across
[MATH:
F=12,096
:MATH]
genes for
[MATH:
ni=204
:MATH]
drugs and
[MATH:
nu=116
:MATH]
diseases. We compare the predictive power of JELI on the TRANSCRIPT
data set with the default knowledge graph created by JELI (named “None”
network, as we don’t rely on external sources of knowledge) and the
default graph augmented with an external knowledge graph. The “None”
network corresponds to the knowledge graph in Definition [142]3 with an
empty partial graph. We considered as external knowledge graphs DRKG
[[143]53], Hetionet [[144]54], PharmKG and PharmK8k (a subset of 8, 000
triplets) [[145]41] and PrimeKG [[146]42] as provided by the Python
library PyKeen [[147]55]. In addition, we also built a partial graph
listing protein-protein interactions (where proteins are matched
one-to-one to their corresponding coding genes) based on the STRING
database [[148]56]. The resulting accuracies in classification are
shown on Figure [149]6 and Table [150]6. Most of the external graph
priors significantly improve the classification accuracy, particularly
the specific information about gene regulation (prior STRING). In
[151]Appendix 4, we also show that the graph priors’ performance
correlates with a more frequent grouping of genes that belong to the
same functional pathways. In [152]Appendix 5, we perform a more
thorough analysis of the specific case of melanoma and show that the
predicted drug-disease associations and perturbed pathways allow us to
recover some elements of the literature on melanoma.
Fig. 6.
Fig. 6
[153]Open in a new tab
Predictive performance of JELI with different graph priors on different
validation metrics
Table 6.
Average metrics with standard deviations across 10 iterations on the
TRANSCRIPT data set for different graph priors
Graph prior AUC NS-AUC NDCG
None 0.90± 0.01 0.48± 0.02 0.02± 0.01
DRKG 0.90± 0.00 0.48± 0.02 0.00± 0.00
Hetionet 0.89± 0.00 0.43± 0.02 0.01± 0.01
PharmKG 0.88± 0.01 0.43± 0.03 0.01± 0.01
PharmKG8k 0.91± 0.01 0.53± 0.03 0.03± 0.01
PrimeKG 0.89± 0.01 0.48± 0.03 0.02± 0.01
STRING 0.91± 0.00 0.55± 0.03 0.02± 0.01
[154]Open in a new tab
Discussion
This work proposes the JELI approach for integrating knowledge
graph-based regularization into an interpretable recommender system.
The structure incorporated into user and item embeddings considers
numerical feature values in a generic fashion, which allows one to go
beyond the categorical relations encoded in knowledge graphs without
adding many parameters. This method allows us to derive item and user
representations of fixed dimensions and score a user-item association,
even on previously unseen items and users. We have shown the
performance and the explainability power of JELI on synthetic and
real-life data sets. The Python package that implements the JELI
approach is available at the following open-source repository:
[155]github.com/RECeSS-EU-Project/JELI. Experimental results can be
reproduced using code uploaded at
[156]github.com/RECeSS-EU-Project/JELI-experiments.
Conclusions
This paper introduces and empirically validates our algorithmic
contribution, JELI, for drug repurposing. JELI aims to provide
straightforward interpretability in recommendations while integrating
any graph information on items and users. However, there are a few
limitations to the JELI approach. The first one is that JELI performs
best on sparse user and item feature matrices, to exploit to the
fullest the expressiveness of factorization machines. Moreover, this
approach is quite slow compared to state-of-the-art algorithms since it
simultaneously solves two tasks: the recommendation one on user-item
pairs and the knowledge graph completion. We discuss the scalability of
JELI with respect to various parameters in [157]Appendix 6. However,
this slowness is mitigated by the superior interpretability of JELI
compared to the baselines. Furthermore, an interesting subsequent work
would focus on integrating missing values into the recommendation
problem. As it is, JELI ignores the missing features and potentially
recovers qualitative item-feature –respectively, user-feature– links
during the knowledge graph completion tasks. That is, provided an
approach to quantify the strength of the link between an item and a
feature, JELI might also be extended to perform an imputation of this
item’s corresponding missing feature value.
Acknowledgements