Abstract

   Integrating gene expression across tissues and cell types is crucial
   for understanding the coordinated biological mechanisms that drive
   disease and characterise homeostasis. However, traditional multitissue
   integration methods cannot handle uncollected tissues or rely on
   genotype information, which is often unavailable and subject to privacy
   concerns. Here we present HYFA (Hypergraph Factorisation), a
   parameter-efficient graph representation learning approach for joint
   imputation of multi-tissue and cell-type gene expression. HYFA is
   genotype-agnostic, supports a variable number of collected tissues per
   individual, and imposes strong inductive biases to leverage the shared
   regulatory architecture of tissues and genes. In performance comparison
   on Genotype-Tissue Expression project data, HYFA achieves superior
   performance over existing methods, especially when multiple reference
   tissues are available. The HYFA-imputed dataset can be used to identify
   replicable regulatory genetic variations (eQTLs), with substantial
   gains over the original incomplete dataset. HYFA can accelerate the
   effective and scalable integration of tissue and cell-type
   transcriptome biorepositories.

Introduction

   Sequencing technologies have enabled profiling the transcriptome at
   tissue and single-cell resolutions, with great potential to unveil
   intra- and multi-tissue molecular phenomena such as cell signalling and
   disease mechanisms. Due to the invasiveness of the sampling process,
   gene expression is usually measured independently in easy-to-acquire
   tissues, leading to an incomplete picture of an individual’s
   physiological state and necessitating effective multi-tissue
   integration methodologies.

   A question of fundamental biological significance is to what extent the
   transcriptomes of difficult-to-acquire tissues and cell types can be
   inferred from those of accessible ones [[36]1, [37]2]. Due to their
   ease of collection, accessible tissues such as whole blood could have
   great utility for diagnosis and monitoring of pathophysiological
   conditions through metabolites, signalling molecules, and other
   biomarkers, including possible transcriptome-level associations
   [[38]3]. Moreover, all human somatic cells share the same genetic
   information, which may regulate expression in a context-dependent and
   temporal manner, partially explaining tissue- and cell-type-specific
   gene expression variation. Computational models that exploit these
   patterns could therefore be used to impute the transcriptomes of
   uncollected cell types and tissues, with potential to elucidate the
   biological mechanisms regulating a diverse range of developmental and
   physiological processes.

   Multi-tissue imputation is a central problem in transcriptomics with
   broad implications for fundamental biological research and
   translational science. The methodological problem can powerfully
   influence downstream applications, including performing differential
   expression analysis, identifying regulatory mechanisms, determining
   co-expression networks, and enabling drug target discovery. In
   practice, in experimental follow-up or clinical application, the task
   includes the special case of determining a good proxy or easily-assayed
   system for causal tissues and cell types. Multi-tissue integration
   methods can also be applied to harmonise large collections of RNA-seq
   datasets from diverse institutions, consortia, and studies [[39]4] —
   each potentially affected by technical artifacts — and to characterise
   gene expression co-regulation across tissues. Reconstruction of
   unmeasured gene expression across a broad collection of tissues and
   cell types from available reference transcriptome panels may expand our
   understanding of the molecular origins of complex traits and of their
   context specificity.

   Several methods have traditionally been employed to impute uncollected
   gene expression. Leveraging a surrogate tissue has been widely used in
   studies of biomarker discovery, diagnostics, and expression
   Quantitative Trait Loci (eQTLs), and in the development of model
   systems [[40]5, [41]6, [42]7, [43]8, [44]9]. Nonetheless, gene
   expression is known to be tissue and cell-type specific, limiting the
   utility of a proxy tissue. Other related studies impute tissue-specific
   gene expression from genetic information [[45]10]. Wang et al. [[46]11]
   propose a mixed-effects model to infer uncollected data in multiple
   tissues from eQTLs. Sul et al. [[47]12] introduce a model termed
   Meta-Tissue that aggregates information from multiple tissues to
   increase statistical power of eQTL detection. However, these approaches
   do not model the complex non-linear relationships between measured and
   unmeasured gene expression traits among tissues and cell types, and
   individual-level genetic information (such as at eQTLs) is subject to
   privacy concerns and often unavailable.

   Computationally, multi-tissue transcriptome imputation is challenging
   because the data dimensionality scales rapidly with the number of genes
   and tissues, often leading to overparameterised models. TEEBoT [[48]1]
   addresses this issue by employing principal component analysis (PCA) —
   a non-parametric dimensionality reduction method — to project the data
   into a low-dimensional manifold, followed by linear regression to
   predict target gene expression from the principal components. However,
   this technique does not account for non-linear effects and can only
   handle a single reference tissue, i.e. whole blood. Approaches such as
   standard multilayer perceptrons can exploit non-linear patterns, but
   are massively overparameterised and computationally infeasible.

   To address these challenges, we present HYFA (Hypergraph
   Factorisation), a parameter-efficient graph representation learning
   approach for joint multi-tissue and cell-type gene expression
   imputation. HYFA represents multi-tissue gene expression in a
   hypergraph of individuals, metagenes, and tissues, and learns
   factorised representations via a specialised message passing neural
   network operating on the hypergraph. In contrast to existing methods,
   HYFA supports a variable number of reference tissues, increasing the
   statistical power over single-tissue approaches, and incorporates
   inductive biases to exploit the shared regulatory architecture of
   tissues and genes. In performance comparison, HYFA attains improved
   performance over TEEBoT and standard imputation methods across a broad
   range of tissues from the Genotype-Tissue Expression (GTEx) project
   (v8) [[49]2]. Through transfer learning on a paired single-nucleus
   RNA-seq dataset (GTEx-v9) [[50]13], we further demonstrate the ability
   of HYFA to resolve cell-type signatures — average gene expression
   across cells for a given cell-type, tissue, and individual — from bulk
   gene expression. Thus, HYFA may provide a unifying transcriptomic
   methodology for multi-tissue imputation and cell-type deconvolution. In
   post-imputation analysis, application of eQTL mapping on the
   fully-imputed GTEx data yields a substantial increase in number of
   detected replicable eQTLs. HYFA is publicly available at
   [51]https://github.com/rvinas/HYFA.

Results

Hypergraph factorisation (HYFA)

   We developed HYFA, a framework for inferring the transcriptomes of
   unmeasured tissues and cell-types from bulk expression collected in a
   variable number of reference tissues ([52]Figure 1, [53]Methods). HYFA
   receives as input gene expression measurements collected from a set of
   reference tissues, as well as demographic information, and outputs gene
   expression values in a tissue of interest (e.g. uncollected). The first
   step of the workflow is to project the input gene expression into
   low-dimensional metagene representations [[54]14, [55]15] for every
   collected tissue. Each metagene summarises abstract properties of
   groups of genes, e.g. sets of genes that tend to be expressed together
   [[56]16], that are relevant for the imputation task. In a second step,
   HYFA employs a custom message passing neural network [[57]17] that
   operates on a 3-uniform hypergraph, yielding factorised individual,
   tissue, and metagene representations. Lastly, HYFA infers latent
   metagene values for the target tissue — a hyperedge-level prediction
   task — and maps these representations back to the original gene
   expression space. Through higher-order hyperedges (e.g. 4-uniform
   hypergraph), HYFA can also incorporate cell-type information and infer
   finer-grained cell-type-specific gene expression ([58]Methods).
   Altogether, HYFA offers features to reuse knowledge across tissues and
   genes, capture non-linear cross-tissue patterns of gene expression,
   learn rich representations of biological entities, and account for
   variable numbers of reference tissues.

Figure 1:

   Figure 1:
   [59]Open in a new tab

   Overview of HYFA. (a) HYFA processes gene expression from a number of
   collected tissues (e.g. accessible tissues) and infers the
   transcriptomes of uncollected tissues. (b) Workflow of HYFA. The model
   receives as input a variable number of gene expression samples
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   corresponding to the collected tissues
   [MATH: <mi>k</mi><mo>∈</mo><mi>𝒯</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo> :MATH]
   of a given individual
   [MATH: <mi>i</mi> :MATH]
   . The samples
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   are fed through an encoder that computes low-dimensional
   representations
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   for each metagene
   [MATH: <mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>M</mi>
   :MATH]
   . A metagene is a latent, low-dimensional representation that captures
   certain gene expression patterns of the high-dimensional input sample.
   These representations are then used as hyperedge features in a message
   passing neural network that operates on a hypergraph. In the hypergraph
   representation, each hyperedge labelled with
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   connects an individual
   [MATH: <mi>i</mi> :MATH]
   with metagene
   [MATH: <mi>j</mi> :MATH]
   and tissue
   [MATH: <mi>k</mi> :MATH]
   if tissue
   [MATH: <mi>k</mi> :MATH]
   was collected for individual
   [MATH: <mi>i</mi> :MATH]
   , i.e.
   [MATH: <mi>k</mi><mo>∈</mo><mi>𝒯</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo> :MATH]
   . Through message passing, HYFA learns factorised representations of
   individual, tissue, and metagene nodes. To infer the gene expression of
   an uncollected tissue
   [MATH: <mi>u</mi> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   , the corresponding factorised representations are fed through a
   multilayer perceptron (MLP) that predicts low-dimensional features
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo stretchy="false">(</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   for each metagene
   [MATH: <mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mo>.</mo><mi
   mathvariant="normal">M</mi> :MATH]
   . HYFA finally processes these latent representations through a decoder
   that recovers the uncollected gene expression sample
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo stretchy="false">(</mo><mi>u</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   .

Characterisation of cross-tissue relationships

   Characterising cross-tissue relationships at the transcriptome level
   can help elucidate coordinated gene regulation and expression, a
   fundamental phenomenon with direct implications on health homeostasis,
   disease mechanisms, and comorbidities [[60]18, [61]19, [62]20]. We
   trained HYFA on bulk gene expression from the GTEx project (GTEx-v8;
   [63]Methods) [[64]2] and assessed the cross-tissue gene expression
   predictability —measured by the Pearson correlation between the
   observed and the predicted gene expression across individuals— and
   quality of tissue embeddings ([65]Figure 2). Application of Uniform
   Manifold Approximation and Projection (UMAP) [[66]21] on the learnt
   tissue representations revealed strong clustering of
   biologically-related tissues ([67]Figure 2a), including the
   gastrointestinal system (e.g. esophageal, stomach, colonic, and
   intestinal tissues), the female reproductive tissues (i.e. uterus,
   vagina, and ovary), and the central nervous system (i.e. the 13 brain
   tissues). For every pair of reference and target tissues in GTEx, we
   then computed the Pearson correlation coefficient
   [MATH: <mi>ρ</mi> :MATH]
   between the predicted and actual gene expression, averaged the scores
   across individuals, and used a cutoff of
   [MATH: <mi>ρ</mi><mo>></mo><mn>0.5</mn> :MATH]
   to depict the top pairwise associations ([68]Figure 2b and [69]Extended
   Data Figure 1). We observed connections between most GTEx tissues and
   whole blood, which suggests that blood-derived gene expression is
   highly informative of (patho)physiological processes in other tissues
   [[70]22]. Notably, brain tissues and the pituitary gland were strongly
   associated with several tissues
   [MATH: <mo>(</mo><mi>ρ</mi><mo>></mo><mn>0.5</mn><mo>)</mo> :MATH]
   , including gastrointestinal tissues (i.e. esophagus, stomach, and
   colon), the adrenal gland, and skeletal muscle, which may account for
   known disease comorbidities.

Figure 2:

   Figure 2:
   [71]Open in a new tab

   Analysis of cross-tissue relationships (next page). Colors are assigned
   to conform to the GTEx Consortium conventions. (a) UMAP representation
   of the tissue embeddings learnt by HYFA. Note that human body systems
   cluster in the embedding space (e.g. digestive system: stomach, small
   intestine, colon, esophagus; and central nervous system). (b) Network
   of tissues depicting the predictability of target tissues with HYFA
   using the average per-sample Pearson ρ correlation coefficients. The
   dimension of each node is proportional to its degree. Edges from
   reference to target tissues indicate an average Pearson correlation
   coefficient ρ > 0.5. Interestingly, central nervous system tissues
   strongly correlate with several non-brain tissues such as
   gastrointestinal tissues and skeletal muscles. (c) Top predicted genes
   in multiple brain regions with the oesophago gastric junction as the
   reference tissue, ranked by average Pearson correlation. (d) Common
   genes in the top 1000 predicted genes for each brain tissue. (e)
   Enriched GO terms of the top shared genes at the interesection. The top
   predicted genes were enriched in signalling pathways (FDR < 0.05),
   consistent with studies reporting that gut microbes communicate to the
   central nervous system through endocrine and immune mechanisms. These
   results depict the cross-tissue associations and highlight the
   potential connection between the elements of the oesophago gastric
   junction and the ciliary neurotrophic factor, which has been linked to
   the survival of neurons [[72]33] and the control of body weight
   [[73]35].

Imputation of gene expression from whole blood transcriptome

   Knowledge about tissue-specific patterns of gene expression can
   increase our understanding of disease biology, facilitate the
   development of diagnostic tools, and improve patient subtyping [[74]23,
   [75]1], but most tissues are inaccessible or difficult to acquire. To
   address this challenge, we studied to what extent HYFA can recover
   tissue-specific gene expression from whole-blood transcriptomic
   measurements ([76]Figure 3). For each test individual with measured
   whole-blood gene expression, we predicted tissue-specific gene
   expression in the remaining collected tissues of the individual. We
   evaluated performance using the Pearson correlation
   [MATH: <mi>ρ</mi> :MATH]
   between the inferred gene expression and the ground-truth samples. We
   observed strong prediction performance for esophageal tissues
   (muscularis:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.49</mn> :MATH]
   , gastro:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.46</mn> :MATH]
   , mucosa:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.36</mn> :MATH]
   ), heart tissues (left ventricle:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.48</mn> :MATH]
   , atrial:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.46</mn> :MATH]
   ), and lung (
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.47</mn> :MATH]
   ), while Epstein Barr virus-transformed lymphocytes (
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.06</mn> :MATH]
   ), an accessible and renewable resource for functional genomics, was a
   notable outlier. We noted that the per-gene prediction scores followed
   smooth, unimodal distributions ([77]Extended Data Figure 2). The
   blood-imputed gene expression also predicted disease-relevant genes in
   hard-to-access central nervous system ([78]Extended Data Figure 3).
   These include APP, PSEN1, and PSEN2, i.e. the causal genes for
   autosomal dominant forms of early-onset Alzheimer’s disease [[79]24],
   and Alzheimer’s disease genetic risk factors such as APOE [[80]25]. We
   compared our method with TEEBoT [[81]1] (without expression
   single-nucleotide polymorphism information), which first projects the
   high-dimensional blood expression data into a low-dimensional space
   through principal component analysis (30 components; 75–80% explained
   variance) and then performs linear regression to predict the gene
   expression of the target tissue. Overall, TEEBoT and HYFA attained
   comparable scores when a single tissue (i.e. whole blood) was used as
   reference and both methods outperformed standard imputation approaches
   (mean imputation, blood surrogate, and k nearest neighbours; [82]Figure
   3c).

Figure 3:

   Figure 3:
   [83]Open in a new tab

   Performance comparison across gene expression imputation methods. (a,
   b) Per-tissue comparison between HYFA and TEEBoT when using (a)
   whole-blood and (b) all accessible tissues (whole blood, skin sun
   exposed, skin not sun exposed, and adipose subcutaneous) as reference.
   HYFA achieved superior Pearson correlation in (a) 25 out of 48 target
   tissues when a single tissue was used as reference and (b) all target
   tissues when multiple reference tissues were considered. For
   underrepresented target tissues (less than 25 individuals with source
   and target tissues in the test set), we considered all the validation
   and test individuals (translucent bars). We employed a two-sided
   Mann-Whitney-Wilcoxon tests to compute p-values (*: 1e-2 < p ≤ 5e-2,
   **: 1e-3 < p ≤ 1e-2, ***: 1e-4 < p ≤ 1e-3, ****: p ≤ 1e-4). (c, d)
   Prediction performance from (c) whole-blood gene expression (n=2424
   samples from 167 GTEx donors) and (d) accessible tissues as reference
   (n=675 samples from 167 test GTEx donors). Mean imputation replaces
   missing values with the feature averages. Blood surrogate utilises gene
   expression in whole blood as a proxy for the target tissue. k-Nearest
   Neighbours (kNN) imputes missing features with the average of measured
   values across the k nearest observations (k=20). TEEBoT projects
   reference gene expression into a low-dimensional space with principal
   component analysis (PCA; 30 components), followed by linear regression
   to predict target values. HYFA (all) employs information from all
   collected tissues of the individual. Boxes show quartiles, centerlines
   correspond to the median, and whiskers depict the distribution range
   (1.5 times the interquartile range). Outliers outside of the whiskers
   are shown as distinct points. The top axis indicates the total number n
   of independent individuals for every target tissue.

Multiple reference tissues improve performance

   We hypothesised that using multiple tissues as reference would improve
   downstream imputation performance. To evaluate this, we selected
   individuals with measured gene expression both at the target tissue and
   4 reference accessible tissues (whole blood, skin sun-exposed, skin not
   sun-exposed, and adipose subcutaneous) and employed HYFA to impute
   target expression values ([84]Figure 3 and [85]Extended Data Figure 4).
   We discarded underrepresented target tissues with less than 25 test
   individuals. Relative to using whole blood in isolation, using all
   accessible tissues as reference resulted in improved performance for 32
   out of 38 target tissues ([86]Extended Data Figure 4). This
   particularly boosted imputation performance for esophageal tissues
   (muscularis:
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.068</mn> :MATH]
   , gastro:
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.061</mn> :MATH]
   , mucosa:
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.048</mn> :MATH]
   ), colonic tissues (transverse:
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.065</mn> :MATH]
   , sigmoid:
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.056</mn> :MATH]
   ), and artery tibial (
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mn>0.079</mn> :MATH]
   ). In contrast, performance for the pituitary gland (
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mo>−</mo><mn>0.011</mn>
   :MATH]
   ), lung (
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mo>−</mo><mn>0.003</mn>
   :MATH]
   ), and stomach (
   [MATH: <mi
   mathvariant="normal">Δ</mi><mi>ρ</mi><mo>=</mo><mo>−</mo><mn>0.002</mn>
   :MATH]
   ) remained stable or dropped slightly. Moreover, the performance gap
   between HYFA and TEEBoT (trained on the set of complete multi-tissue
   samples) widened relative to the single-tissue scenario ([87]Figure 3
   and [88]Extended Data Figure 5) — HYFA obtained better performance in
   all target tissues, with statistically significant improvements in 26
   out of 38 tissues (two-sided Mann-Whitney-Wilcoxon p-value < 0.05). We
   attribute the improved scores to HYFA’s ability to process a variable
   number of reference tissues, reuse knowledge across tissues, and
   capture non-linear patterns.

Inference of cell-type signatures

   We next investigated the potential of HYFA to predict
   cell-type-specific signatures — average gene expression across cells
   from a given cell-type — in a given tissue of interest. We first
   selected GTEx donors with collected bulk (v8) and single-nucleus
   RNA-seq profiles (v9, [89]Methods). Next, we trained HYFA to infer
   cell-type signatures from the multi-tissue bulk expression profiles. We
   evaluated performance using the observed ([90]Figure 4) and inferred
   library sizes ([91]Supplementary Information K). To attenuate the small
   data size problem, we applied transfer learning on the model trained
   for the multi-tissue imputation task ([92]Methods). We observed strong
   prediction performance (Pearson correlation
   [MATH: <mi>ρ</mi> :MATH]
   between log ground truth and log predicted signatures) for vascular
   endothelial cells (heart:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.84</mn> :MATH]
   ; breast:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.88</mn> :MATH]
   , esophagus muscularis:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.68</mn> :MATH]
   ) and fibroblasts (heart:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.84</mn> :MATH]
   ; breast:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.89</mn> :MATH]
   , esophagus muscularis:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.70</mn> :MATH]
   ). Strikingly, HYFA recovered the cell-type profiles of tissues that
   were never observed in the train set with high correlation ([93]Figure
   4 and [94]Supplementary Information K), e.g. skeletal muscle (vascular
   endothelial cells:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.79</mn> :MATH]
   , fibroblasts:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.77</mn> :MATH]
   , pericytes/SMC:
   [MATH: <mi>ρ</mi><mo>=</mo><mn>0.68</mn> :MATH]
   ), demonstrating the benefits of the factorised tissue representations.
   Overall, our results highlight the potential of HYFA to impute unknown
   cell-type signatures even for tissues that were not considered in the
   original single-cell study. Additionally, our analyses point to
   promising downstream applications as single-cell RNA-seq datasets
   become larger in number of individuals ([95]Supplementary Information
   N), including deconvolution and cell-type specific eQTL mapping.

Figure 4:

   Figure 4:
   [96]Open in a new tab

   Prediction of cell-type signatures. HYFA imputes individual- and
   tissue-specific cell-type signatures from bulk multi-tissue gene
   expression. The scatter plots depict the Pearson correlation
   [MATH: <mi>ρ</mi> :MATH]
   between the logarithmised ground truth and predicted signatures for
   [MATH: <mi>N</mi> :MATH]
   unseen individuals. To infer the signatures, we used the observed
   library size
   [MATH: <msubsup><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   and number of cells
   [MATH: <msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   ([97]Methods).

Multi-tissue imputation improves eQTL detection

   The GTEx project has enabled the identification of numerous genetic
   associations with gene expression across a broad collection of tissues
   [[98]2], also known as expression Quantitative Trait Loci (eQTLs)
   [[99]26]. However, eQTL datasets are characterised by small sample
   sizes, especially for difficult-to-acquire tissues and cell types,
   reducing the statistical power to detect eQTLs [[100]27]. To address
   this problem, we employed HYFA to impute the transcript levels of every
   uncollected tissue for each individual in GTEx, yielding a complete
   gene expression dataset of 834 individuals and 49 tissues. We then
   performed eQTL mapping ([101]Methods) on the original and imputed
   datasets and observed a substantial gain in the number of unique genes
   with detected eQTLs, the so-called eGenes ([102]Figure 5). Notably,
   this metric increased for tissues with low sample size (Spearman
   correlation coefficient
   [MATH: <mi>ρ</mi><mo>=</mo><mo>−</mo><mn>0.83</mn> :MATH]
   ) — which are most likely to benefit from borrowing information across
   tissues with shared regulatory architecture. Kidney cortex displayed
   the largest gain in number of eGenes (from 215 to 12,557), while there
   was no increase observed for whole blood.

Figure 5:

   Figure 5:
   [103]Open in a new tab

   HYFA’s imputed data improves expression Quantitative Trait Loci (eQTL)
   discovery. (a) Number of unique genes with detected eQTLs (FDR < 0.1)
   on observed (circle) and full (observed plus imputed; rhombus) GTEx
   data. Note logarithmic scale of y-axis. The eQTLs were mapped using
   MatrixEQTL [[104]70, [105]55] assuming additive genotype effect on gene
   expression. MatrixEQTL conducts a test for each SNP-gene pair and makes
   adjustments for multiple comparisons by computing the
   Benjamini-Hochberg FDR [[106]71]. (b) Fold increase in number of unique
   genes with mapped eQTLs (y-axis) versus observed sample size (x-axis).
   (c) Histogram of replication p-values among the HYFA-identified
   cis-eQTLs for whole blood (left) and brain prefrontal cortex (right).
   For replication, we used the independent eQTLGen Consortium (n>30,000;
   [[107]28]) and PsychENCODE (n=1,866; [[108]29]) eQTL datasets,
   respectively. (d) Quantile-quantile plot showing the causal variants’
   association with gene expression in blood (left) and brain frontal
   cortex (right) in the HYFA-derived dataset using experimentally
   validated causal variant data from the application of Massively
   Parallel Reporter Assay [[109]31]. All statistical tests were
   two-sided. HYFA’s imputed data substantially increases the number of
   identified associations with high replicability and significant
   enrichment of causal regulatory variants.

   To assess the quality of the identified eQTLs from HYFA imputation, we
   conducted systematic replication analyses of 1) the whole blood
   eQTL-eGene pairs, using the eQTLGen blood transcriptome dataset in more
   than 30,000 individuals [[110]28] and 2) the frontal cortex eQTL-eGene
   pairs, using the PsychENCODE pre-frontal cortex transcriptome dataset
   in 1,866 individuals [[111]29]. For each tissue, we quantified the
   replication rate for eQTL-eGene pairs using the
   [MATH: <msub><mrow><mi>π</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   statistic [[112]30]. Notably, we found a highly significant enrichment
   for low replication p-values among the HYFA-derived eQTL-eGene pairs
   ([113]Figure 5), demonstrating strong reproducibility of the results.
   The replication rate
   [MATH: <msub><mrow><mi>π</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   was 0.80 for whole blood and 0.96 for frontal cortex. We also evaluated
   the extent to which the HYFA imputation could capture regulatory
   variants that directly modulate gene expression using experimentally
   validated causal variants from Massively Parallel Reporter Assay
   [[114]31]. Notably, among the causal regulatory variants from this
   experimental assay, we found a highly significant enrichment for low
   p-values among the HYFA-identified eQTLs in blood and in frontal cortex
   ([115]Figure 5). Thus, HYFA imputation enabled identification of
   biologically meaningful, replicable eQTL hits in the respective
   tissues. Our results generate a large catalog of new tissue-specific
   eQTLs (Data availability), with potential to enhance our understanding
   of how regulatory variation mediates variation in complex traits,
   including disease susceptibility.

Brain-gut axis

   The brain-gut axis is a bidirectional communication system of
   signalling pathways linking the central and enteric nervous systems. We
   investigated whether the transcriptomes of tissues from the
   gastrointestinal system are predictive of gene expression in brain
   tissues ([116]Figure 2 and [117]Supplementary Information G). Overall,
   the top predicted genes were enriched in multiple signalling-related
   terms (e.g. cytokine receptor activity and interleukin-1 receptor
   activity), consistent with existing knowledge that gut microbes
   communicate with the central nervous system through signalling
   mechanisms [[118]32]. Genes in the intersection were also notably
   enriched in the ciliary neurotrophic factor receptor activity, which
   plays an important role in neuron survival [[119]33], enteric nervous
   system development [[120]34], and body weight control [[121]35].

HYFA-learned metagenes capture known biological pathways

   A key feature of HYFA is that it reuses knowledge across tissues and
   metagenes, allowing to exploit shared regulatory patterns. We explored
   whether HYFA’s inductive biases encourage learning biologically
   relevant metagenes. To determine the extent to which metagene-factors
   relate to known biological pathways, we applied Gene Set Enrichment
   Analysis (GSEA) [[122]36] to the gene loadings of HYFA’s encoder
   ([123]Methods). Similar to [[124]37], for a given query gene set, we
   calculated the maximum running sum of enrichment scores by descending
   the sorted list of gene loadings for every metagene and factor. We then
   computed pathway enrichment p-values through a permutation test and
   employed the Benjamini-Hochberg method to correct for multiple testing
   independently for every metagene-factor.

   In total, we identified 18683 statistically significant enrichments
   (FDR < 0.05) of KEGG biological processes ([[125]38]; 320 gene sets;
   [126]Figure 6) across all HYFA metagenes (n=50) and factors (n=98).
   Among the enriched terms, 2109 corresponded to signalling pathways and
   1300 to pathways of neurodegeneration. We observed considerable overlap
   between several metagenes in terms of biologically related pathways,
   e.g. factor 95 of metagene 11 had the lowest FDR for both Alzheimer’s
   disease (FDR < 0.001) and Amyotrophic Lateral Sclerosis (FDR < 0.001)
   pathways. Enrichment analysis of TRRUST [[127]39] transcription factors
   (TFs) further identified important regulators including GATA1 (known to
   regulate the development of red blood cells [[128]40]), SPI1 (which
   controls hematopoietic cell fate [[129]41]), CEBPs (which play an
   important role in the differentiation of a range of cell types and the
   control of tissue-specific gene expression; [[130]42, [131]43]), and
   STAT1 (a member of the STAT protein family that drives the expression
   of many target genes [[132]44]). We also observed that the learnt HYFA
   factors recapitulate synergistic effects among the enriched TFs
   ([133]Supplementary Information H and [134]Extended Data Figure 6). For
   example, GATA1 and SPI1, which were simultaneously enriched in 7
   factors (FDR < 0.05), functionally antagonise each other through
   physical interaction [[135]45]. Similarly, IRF1 induces STAT1
   activation via phosphorylation [[136]44, [137]46] and both TFs were
   enriched in 10 factors (FDR < 0.05), aligning with our enrichment
   analyses of GO Biological Process terms ([138]Supplementary Information
   I and [139]Extended Data Figures 7 and [140]8). Altogether, our
   analyses suggest that HYFA-learned metagenes and factors are amenable
   to biological interpretation and capture information about known
   regulators of tissue-specific gene expression.

Figure 6:

   Figure 6:
   [141]Open in a new tab

   Pathway enrichment analysis of metagene factors. (a) Manhattan plot of
   the GSEA results on the metagenes (n=50) and factors (n=98) learned by
   HYFA. The x-axis represents metagenes (colored bins) and every offset
   within the bin corresponds to a different factor. The y-axis is the
   −log q-value (FDR) from the GSEA permutation test, corrected for
   multiple testing via the Benjamini-Hochberg procedure. We identified
   18683 statistically significant enrichments (FDR < 0.05) of KEGG
   biological processes across all metagenes and factors. (b) Total number
   of enriched terms for each type of pathway. (c) FDR for pathways of
   neurodegeneration. For every pathway and metagene, we selected the
   factor with lowest FDR and depicted statistically significant values
   (FDR < 0.05). Point sizes are proportional to −log FDR values. Metagene
   11 (factor 95) had the lowest FDR for both Amyotropic Lateral Sclerosis
   (ALS) and Alzheimer’s Disease. (d) UMAP of latent values of metagene 11
   for all spinal cord (ALS: orange) and brain cortex (Alzheimer’s disease
   or Dementia: orange) GTEx samples. (e) Leading edge subsets of top-15
   enriched gene sets for factor 95 of metagene 11. (f, g) Enrichment
   plots for Amyotropic Lateral Sclerosis (f) and Alzheimer’s disease gene
   sets (g).

Discussion

   Effective multi-tissue omics integration promises a system-wide view of
   human physiology, with potential to shed light on intra- and
   multi-tissue molecular phenomena. Such an approach challenges
   single-tissue and conventional integration techniques — often unable to
   model a variable number of tissues with sufficient statistical strength
   — necessitating the development of scalable, non-linear, and flexible
   methods. Here we developed HYFA (Hypergraph Factorisation), a
   parameter-efficient approach for joint multi-tissue and cell-type gene
   expression imputation that imposes strong inductive biases to learn
   entity-independent relational semantics and demonstrates excellent
   imputation capabilities.

   We performed extensive benchmarks on data from the Genotype-Tissue
   Expression project (GTEx; [[142]2]; v8 and v9), the most comprehensive
   human transcriptome resource available, and evaluated imputation
   performance over a broad collection of tissues and cell types. In
   addition to standard transcriptome imputation approaches, we compared
   our method with TEEBoT [[143]1], a linear method that predicts target
   gene expression from the principal components of the reference
   expression. In the single-tissue reference scenario, HYFA and TEEBoT
   attained comparable imputation performance, outperforming standard
   methods. In the multi-tissue reference scenario, HYFA consistently
   outperformed TEEBoT and standard approaches in all target tissues,
   demonstrating HYFA’s capabilities to borrow non-linear information
   across a variable number of tissues and exploit shared molecular
   patterns.

   In addition to imputing tissue-level transcriptomics, we investigated
   the ability of HYFA to predict cell-type-level gene expression from
   multi-tissue bulk expression measurements. Through transfer learning,
   we trained HYFA to infer cell-type signatures from a cohort of
   single-nucleus RNA-seq [[144]13] with matching GTEx-v8 donors. The
   inferred cell-type signatures exhibited a strong correlation with the
   ground truth despite the low sample size, indicating that HYFA’s latent
   representations are rich and amenable to knowledge transfer.
   Strikingly, HYFA also recovered cell-type profiles from tissues that
   were never observed at transfer time, pointing to HYFA’s ability to
   leverage gene expression programs underlying cell-type identity
   [[145]47] even in tissues that were not considered in the original
   study [[146]13]. HYFA may also be used to impute the expression of
   disease-related genes in a tissue of interest ([147]Supplementary
   Information J).

   In post-imputation analysis, we studied whether the imputed data
   improves eQTL discovery. We employed HYFA to impute the gene expression
   levels of every uncollected tissue in GTEx-v8, yielding a complete
   dataset, and performed eQTL mapping. Compared to the original dataset,
   we observed a substantial gain in number of genes with detected eQTLs,
   with kidney cortex showing the largest gain. The increase was highest
   for tissues with low sample sizes, which are the ones expected to
   benefit the most from knowledge sharing across tissues. Notably, HYFA’s
   detected eQTLs with their target eGenes could be replicated using
   independent, single-tissue transcriptome datasets that focus on depth,
   including the blood eQTLGen [[148]28] and the brain frontal cortex
   PsychENCODE [[149]29] datasets. Moreover, we found a significant
   enrichment for experimentally validated causal variants from the
   Massively Parallel Reporter Assay ([[150]31]) dataset. Our results
   uncover a large number of previously undetected tissue-specific eQTLs
   and highlight the ability of HYFA to exploit shared regulatory
   information across tissues.

   Lastly, HYFA can provide insights on coordinated gene regulation and
   expression mechanisms across tissues. We analysed to what extent
   tissues from the gastrointestinal system are informative of gene
   expression in brain tissues — an important question that may shed light
   on the biology of the brain-gut axis — and identified enriched
   biological processes and molecular functions. Through Gene Set
   Enrichment Analysis [[151]36], we observed, among the HYFA-learned
   metagenes, a substantial amount of enriched pathways, transcription
   factors, and known regulators of biological processes, opening the door
   to biological interpretations. Future work might also seek to impose
   stronger inductive bias to ensure that metagenes are identifiable and
   robust to batch effects.

   We believe that HYFA, as a versatile graph representation learning
   framework, provides a novel methodology for effective integration of
   large-scale multi-tissue biorepositories. The hypergraph factorisation
   framework is flexible (it supports k-uniform hypergraphs of arbitrary
   node types) and may find application beyond computational genomics.

Methods

Problem formulation

   Suppose we have a transcriptomics dataset of
   [MATH: <mi>N</mi> :MATH]
   individuals/donors,
   [MATH: <mi>T</mi> :MATH]
   tissues, and
   [MATH: <mi>G</mi> :MATH]
   genes. For each individual
   [MATH:
   <mi>i</mi><mo>∈</mo><mo>{</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><
   mi>N</mi><mo>}</mo> :MATH]
   , let
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>T</mi><mo>×</mo><mi>
   G</mi></mrow></msup> :MATH]
   be the gene expression values in
   [MATH: <mi>T</mi> :MATH]
   tissues and define the donor’s demographic information by
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>C</mi></mrow></msup>
   :MATH]
   , where
   [MATH: <mi>C</mi> :MATH]
   is the number of covariates. Denote by
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   the
   [MATH: <mi>k</mi> :MATH]
   -th entry of
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   , corresponding to the expression values of donor
   [MATH: <mi>i</mi> :MATH]
   measured in tissue
   [MATH: <mi>k</mi> :MATH]
   . For a given donor
   [MATH: <mi>i</mi> :MATH]
   , let
   [MATH: <mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo> :MATH]
   represent the collection of tissues with measured expression values.
   These sets might vary across individuals. Let
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>∈</mo><mo stretchy="false">(</mo><mi
   mathvariant="double-struck">R</mi><mo>∪</mo><mo
   stretchy="false">{</mo><mi mathvariant="normal">*</mi><mo
   stretchy="false">}</mo><msup><mrow><mo
   stretchy="false">)</mo></mrow><mrow><mi>T</mi><mo>×</mo><mi>G</mi></mro
   w></msup> :MATH]
   be the measured gene expression values, where
   [MATH: <mi mathvariant="normal">*</mi> :MATH]
   denotes unobserved, so that
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><m
   o>=</mo><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   if
   [MATH: <mi>k</mi><mo>∈</mo><mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo>
   :MATH]
   and
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><m
   o>=</mo><mi mathvariant="normal">*</mi> :MATH]
   otherwise. Our goal is to infer the uncollected values in
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub> :MATH]
   by modelling the distribution
   [MATH: <mi>p</mi><mfenced separators="|"><mrow><mi
   mathvariant="bold">X</mi><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>∣</mo><mover accent="true"><mrow><mi
   mathvariant="bold">X</mi></mrow><mo>˜</mo></mover><mo>=</mo><msub><mrow
   ><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>,</mo><mi
   mathvariant="bold">U</mi><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></mfenced> :MATH]
   .

Multi-tissue model

   An important challenge of modelling multi-tissue gene expression is
   that a different set of tissues might be collected for each individual.
   Moreover, the data dimensionality scales rapidly with the total number
   of tissues and genes. To address these problems, we represent the data
   in a hypergraph and develop a parameter-efficient neural network that
   operates on this hypergraph. Throughout, we make use of the concept of
   metagenes [[152]14, [153]15]. Each metagene characterises certain gene
   expression patterns and is defined as a positive linear combination of
   multiple genes [[154]14, [155]15].

Hypergraph representation

   We represent the data in a hypergraph consisting of three types of
   nodes: donor, tissue, and metagene nodes.

   Mathematically, we define a hypergraph
   [MATH: <mi>𝒢</mi><mo>=</mo><mfenced open="{" close="}"
   separators="|"><mrow><msub><mrow><mi>𝒱</mi></mrow><mrow><mi>d</mi></mro
   w></msub><mo>∪</mo><msub><mrow><mi>𝒱</mi></mrow><mrow><mi>m</mi></mrow>
   </msub><mo>∪</mo><msub><mrow><mi>𝒱</mi></mrow><mrow><mi>t</mi></mrow></
   msub><mo>,</mo><mi>ℰ</mi></mrow></mfenced> :MATH]
   , where
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>d</mi></mrow></msub>
   :MATH]
   is a set of donor nodes,
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>m</mi></mrow></msub>
   :MATH]
   is a set of metagene nodes,
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>t</mi></mrow></msub>
   :MATH]
   is a set of tissue nodes, and
   [MATH: <mi>ℰ</mi> :MATH]
   is a set of multi-attributed hyperedges. Each hyperedge connects an
   individual
   [MATH: <mi>i</mi> :MATH]
   with a metagene
   [MATH: <mi>j</mi> :MATH]
   and a tissue
   [MATH: <mi>k</mi> :MATH]
   if
   [MATH: <mi>k</mi><mo>∈</mo><mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo>
   :MATH]
   , where
   [MATH: <mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo> :MATH]
   are the collected tissues of individual
   [MATH: <mi>i</mi> :MATH]
   . The set of all hyperedges is defined as
   [MATH: <mi>ℰ</mi><mo>=</mo><mo stretchy="false">{</mo><mrow><mo
   stretchy="false">(</mo><mrow><mi>i</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo><mo>∣</mo><mo
   stretchy="false">(</mo><mi>i</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi><mo
   stretchy="false">)</mo><mo>∈</mo><msub><mrow><mi>𝒱</mi></mrow><mrow><mi
   >d</mi></mrow></msub><mo>×</mo><msub><mrow><mi>𝒱</mi></mrow><mrow><mi>m
   </mi></mrow></msub><mo>×</mo><msub><mrow><mi>𝒱</mi></mrow><mrow><mi>t</
   mi></mrow></msub><mo>,</mo><mi>k</mi><mo>∈</mo><mi>𝒯</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">}</mo> :MATH]
   , where
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   are hyperedge attributes that describe characteristics of the
   interacting nodes, i.e. features of metagene
   [MATH: <mi>j</mi> :MATH]
   in tissue
   [MATH: <mi>k</mi> :MATH]
   for individual
   [MATH: <mi>i</mi> :MATH]
   .

   The hypergraph representation allows representing data in a flexible
   way, generalising the bipartite graph representation from [[156]48]. On
   the one hand, using a single metagene results in a bipartite graph
   where each edge connects an individual
   [MATH: <mi>i</mi> :MATH]
   with a tissue
   [MATH: <mi>k</mi> :MATH]
   . In this case, the edge attributes
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mn>1</mn></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   are derived from the gene expression
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   in tissue
   [MATH: <mi>k</mi> :MATH]
   . On the other hand, using multiple metagenes leads to a hypergraph
   where each individual
   [MATH: <mi>i</mi> :MATH]
   is connected to tissue
   [MATH: <mi>k</mi> :MATH]
   through multiple hyperedges. For example, it is possible to construct a
   hypergraph where genes and metagenes are related by a one-to-one
   correspondence, with hyperedge attributes
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   derived directly from expression
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   . The number of metagenes thus controls a spectrum of hypergraph
   representations and, as we shall see, can help alleviate the inherent
   over-squashing problem of graph neural networks.

Message passing neural network

   Given the hypergraph representation of the multi-tissue transcriptomics
   dataset, we now present a parameter-efficient graph neural network to
   learn donor, metagene, and tissue embeddings, and infer the expression
   values of the unmeasured tissues. We start by computing hyperedge
   attributes from the multi-tissue expression data. Then, we initialise
   the embeddings of all nodes in the hypergraph, construct the message
   passing neural network, and define an inference model that builds on
   the latent node representations obtained via message passing.

Computing hyperedge attributes

   We first reduce the dimensionality of the measured transcriptomics
   values. For every individual
   [MATH: <mi>i</mi> :MATH]
   and measured tissue
   [MATH: <mi>k</mi> :MATH]
   , we project the corresponding gene expression values
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   into low-dimensional metagene representations
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   :
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow
   ><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">R</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">L</mi><mi mathvariant="normal">U</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>j</mi></mrow></msub><m
   subsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo><mspace
   width="2em"></mspace><mo>∀</mo><mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo>
   <mo>.</mo><mi>M</mi> :MATH]
   (1)

   where
   [MATH: <mi>M</mi> :MATH]
   , the number of metagenes, is a user-definable hyperparameter and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>j</mi></mrow></msub><m
   o>∀</mo><mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>M</mi>
   :MATH]
   are learnable parameters. In addition to characterising groups of
   functionally similar genes, employing metagenes reduces the number of
   messages being aggregated for each node, addressing the over-squashing
   problem of graph neural networks ([157]Supplementary Information B).

Initial node embeddings

   We initialise the node features of the individual
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>p</mi></mrow></msub>
   :MATH]
   , metagene
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>m</mi></mrow></msub>
   :MATH]
   , and tissue
   [MATH: <msub><mrow><mi>𝒱</mi></mrow><mrow><mi>t</mi></mrow></msub>
   :MATH]
   partitions with learnable parameters and available information. For
   metagene and tissue nodes, we use learnable embeddings as initial node
   values. The idea is that these weights, which will be approximated
   through gradient descent, should summarise relevant properties of each
   metagene and tissue. We initialise the node features of each individual
   with the available demographic information
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   of each individual
   [MATH: <mi>i</mi> :MATH]
   (we use age and sex). We encode sex as a binary value and age as a
   float normalised by 100 (e.g. age 30 is encoded as 0.30). Importantly,
   this formulation allows transfer learning between sets of distinct
   donors.

Message passing layer

   We develop a custom GNN layer to compute latent donor embeddings by
   passing messages along the hypergraph. At each layer of the GNN, we
   perform message passing to iteratively refine the individual node
   embeddings. We do not update the tissue and metagene embeddings during
   message passing, in a similar vein as knowledge graph embeddings
   [[158]49], because their node embeddings already consist of learnable
   weights that are updated through gradient descent. Sending messages to
   these nodes would also introduces a dependency between individual nodes
   and tissue and metagene features (and, by transitivity, dependencies
   between individuals). However, if we foresee that unseen entities will
   be present at test time (e.g. new tissue types), our approach can be
   extended by initialising their node features with constant values and
   introducing node-type-specific message passing equations.

   Mathematically, let
   [MATH: <mfenced open="{" close="}"
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   >d</mi></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,</mo><msubsup><mrow><m
   i
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>N</mi></mrow><mrow><mi
   >d</mi></mrow></msubsup></mrow></mfenced><mo>,</mo><mspace
   width="0.25em"></mspace><mfenced open="{" close="}"
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsu
   p><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>M</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup></mrow></mfenced> :MATH]
   , and
   [MATH: <mfenced open="{" close="}"
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   >t</mi></mrow></msubsup><mo>,</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsu
   p><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>T</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow></mfenced> :MATH]
   be the donor, metagene, and tissue node embeddings, respectively. At
   each layer of the GNN, we compute refined individual embeddings
   [MATH: <mo stretchy="false">{</mo><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mn>1</mn></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,</mo><msubsup>
   <mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>N</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup></mrow><mo stretchy="false">}</mo>
   :MATH]
   as follows:
   [MATH: <mrow><msubsup><mrow><mover><mi
   mathvariant="bold-italic">h</mi><mo>ˆ</mo></mover></mrow><mi>i</mi><mi>
   d</mi></msubsup><mo>=</mo><msub><mi>ϕ</mi><mi>h</mi></msub><mrow><mo>(<
   /mo><mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>i</mi><mi>d</mi></msubsup><mo>,</mo
   ><msub><mi
   mathvariant="bold-italic">m</mi><mi>i</mi></msub></mrow><mo>)</mo></mro
   w><mo>,</mo><mspace width="0.25em"></mspace><msub><mi
   mathvariant="bold-italic">m</mi><mi>i</mi></msub><mo>=</mo><munderover>
   <mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>M</mi></munder
   over><munder><mo>∑</mo><mrow><mi>k</mi><mo>∈</mo><mi>𝒯</mi><mrow><mo>(<
   /mo><mi>i</mi><mo>)</mo></mrow></mrow></munder><msub><mi>ϕ</mi><mi>a</m
   i></msub><mrow><mo stretchy="false">(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>j</mi><mi>m</mi></msubsup><mo>,</mo
   ><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>k</mi><mi>t</mi></msubsup><mo>,</mo
   ><msub><mi
   mathvariant="bold-italic">m</mi><mrow><mi>i</mi><mi>j</mi><mi>k</mi></m
   row></msub></mrow><mo stretchy="false">)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi
   mathvariant="bold-italic">m</mi><mrow><mi>i</mi><mi>j</mi><mi>k</mi></m
   row></msub><mo>=</mo><msub><mi>ϕ</mi><mi>e</mi></msub><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>i</mi><mi>d</mi></msubsup><mo>,</mo
   ><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>j</mi><mi>m</mi></msubsup><mo>,</mo
   ><msubsup><mi
   mathvariant="bold-italic">h</mi><mi>k</mi><mi>t</mi></msubsup><mo>,</mo
   ><msubsup><mi
   mathvariant="bold-italic">e</mi><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></mrow></msubsup></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mrow> :MATH]
   (2)

   where the functions
   [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>e</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>h</mi></mrow></msub>
   :MATH]
   are edge and node operations that we model as multilayer perceptrons
   (MLP), and
   [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub>
   :MATH]
   is a function that determines the aggregation behavior. In its simplest
   form, choosing
   [MATH:
   <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub><mfenced
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">m</mi></mrow><mrow><mi>i</mi><mi>j</mi><mi>k<
   /mi></mrow></msub></mrow></mfenced><mo>=</mo><mfrac><mrow><mn>1</mn></m
   row><mrow><mi>M</mi><mo>|</mo><mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo><
   mo>|</mo></mrow></mfrac><msub><mrow><mi
   mathvariant="bold-italic">m</mi></mrow><mrow><mi>i</mi><mi>j</mi><mi>k<
   /mi></mrow></msub> :MATH]
   results in average aggregation. We analyse the time complexity of the
   message passing layer in [159]Supplementary Information A. Optionally,
   we can stack several message passing layers to increase the
   expressivity of the model.

   The architecture is flexible and may be extended as follows:
     * Incorporation of information about the individual embeddings
       [MATH: <msubsup><mrow><mi
       mathvariant="bold-italic">h</mi></mrow><mrow><mi>i</mi></mrow><mrow
       ><mi>d</mi></mrow></msubsup> :MATH]
       into the aggregation mechanism
       [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub>
       :MATH]
       .
     * Incorporation of target tissue embeddings
       [MATH: <msubsup><mrow><mi
       mathvariant="bold-italic">h</mi></mrow><mrow><mi>u</mi></mrow><mrow
       ><mi>t</mi></mrow></msubsup> :MATH]
       , for a given target tissue
       [MATH: <mi>u</mi> :MATH]
       , into the aggregation mechanism
       [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub>
       :MATH]
       .
     * Update hyperedge attributes
       [MATH: <msubsup><mrow><mi
       mathvariant="bold-italic">e</mi></mrow><mrow><mi>i</mi><mi>j</mi></
       mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
       at every layer.

Aggregation mechanism

   In practice, the proposed hypergraph neural network suffers from a
   bottleneck. In the aggregation step, the number of messages being
   aggregated is
   [MATH:
   <mi>M</mi><mo>|</mo><mi>𝒯</mi><mo>(</mo><mi>i</mi><mo>)</mo><mo>|</mo>
   :MATH]
   for each individual
   [MATH: <mi>i</mi> :MATH]
   . In the worst case, when all genes are used as metagenes (i.e.
   [MATH: <mi>M</mi><mo>=</mo><mi>G</mi> :MATH]
   ; it is estimated that humans have around
   [MATH: <mi>G</mi><mo>≈</mo><mn>25</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mn>000</mn> :MATH]
   protein-coding genes), this leads to serious over-squashing — large
   amounts of information are compressed into fixed-length vectors
   [[160]50]. Fortunately, choosing a small number of metagenes reduces
   the dimensionality of the original transcriptomics values which in turn
   alleviates the over-squashing and scalability problems
   ([161]Supplementary Information B). To further attenuate
   over-squashing, we propose an attention-based aggregation mechanism
   [MATH: <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub>
   :MATH]
   that weighs metagenes according to their relevance in each tissue:
   [MATH:
   <msub><mrow><mi>ϕ</mi></mrow><mrow><mi>a</mi></mrow></msub><mfenced
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">m</mi></mrow><mrow><mi>i</mi><mi>j</mi><mi>k<
   /mi></mrow></msub></mrow></mfenced><mo>=</mo><msub><mrow><mi>α</mi></mr
   ow><mrow><mi>j</mi><mi>k</mi></mrow></msub><msub><mrow><mi
   mathvariant="bold-italic">m</mi></mrow><mrow><mi>i</mi><mi>j</mi><mi>k<
   /mi></mrow></msub><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>α</mi></mrow><mrow><mi>j</mi><m
   i>k</mi></mrow></msub><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mo>⁡</mo><mi>e</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow><mo
   stretchy="false">)</mo></mrow><mrow><mrow><munder><mo
   stretchy="true">∑</mo><mrow><mi>v</mi></mrow></munder><mrow><mo> </mo><
   /mrow></mrow><mo> </mo><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mo>⁡</mo><mi>e</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>v</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow><mo
   stretchy="false">)</mo></mrow></mfrac><mo>,</mo><mspace
   width="0.25em"></mspace><mi>e</mi><mfenced
   separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow></mfenced><mo>=</mo><msup><mrow><mi
   mathvariant="bold">a</mi></mrow><mrow><mi
   mathvariant="normal">⊤</mi></mrow></msup><mi
   mathvariant="normal">L</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">k</mi><mi
   mathvariant="normal">y</mi><mi mathvariant="normal">R</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">U</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><mi mathvariant="bold">W</mi><mfenced
   open="[" close="]" separators="|"><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>∥</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow></mfenced></mrow><mo
   stretchy="false">)</mo><mo>,</mo> :MATH]

   where
   [MATH: <mo>∥</mo> :MATH]
   is the concatenation operation and a and
   [MATH: <mi mathvariant="bold">W</mi> :MATH]
   are learnable parameters. The proposed attention mechanism, which
   closely follows the neighbour aggregation method of graph attention
   networks [[162]51, [163]52], computes dynamic weighting coefficients
   that prioritise messages coming from important metagenes. Optionally,
   we can leverage multiple heads [[164]53] to learn multiple modes of
   interaction and increase the expressivity of the model.

Hypergraph model

   The hypergraph model, which we define as
   [MATH: <mi>f</mi> :MATH]
   , computes latent individual embeddings
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup> :MATH]
   from incomplete multi-tissue expression profiles as
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>=</mo><mi>f</mi><mo
   stretchy="false">(</mo><mrow><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo> :MATH]
   .

Downstream imputation tasks

   The resulting donor representations
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup> :MATH]
   summarise information about a variable number of tissue types collected
   for donor i, in addition to demographic information. We leverage these
   embeddings for two downstream tasks: inference of gene expression in
   uncollected tissues and prediction of cell-type signatures.

Inference of gene expression in uncollected tissues

   Predicting the transcriptomic measurements
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   of a tissue
   [MATH: <mi>k</mi> :MATH]
   (e.g. uncollected) is achieved by first recovering the latent metagene
   values
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   for all metagenes
   [MATH: <mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>M</mi>
   :MATH]
   , a hyperedge-level prediction task, and then decoding the gene
   expression values from the predicted metagene representations
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   with an appropriate probabilistic model.

Prediction of hyperedge attributes

   To predict the latent metagene attributes
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   for all
   [MATH: <mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>M</mi>
   :MATH]
   , we employ a multilayer perceptron that operates on the factorised
   metagene
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup> :MATH]
   and tissue representations
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup> :MATH]
   as well as the latent variables
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   :
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo
   ><mi mathvariant="normal">M</mi><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">P</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup></mrow><mo stretchy="false">)</mo><mo>,</mo>
   :MATH]

   where the MLP is shared for all combinations of metagenes, individuals,
   and tissues.

Negative binomial imputation model

   For raw count data, we use a negative binomial likelihood. To decode
   the gene expression values for a tissue
   [MATH: <mi>k</mi> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   , we define the following probabilistic model
   [MATH: <mi>p</mi><mo stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><m
   over accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi></mrow><mo stretchy="false">)</mo> :MATH]
   :
   [MATH: <mi>p</mi><mo stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><m
   over accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi></mrow><mo
   stretchy="false">)</mo><mo>=</mo><mrow><munderover><mo
   stretchy="true">∏</mo><mrow><mi>j</mi></mrow><mrow><mi>G</mi></mrow></m
   underover><mrow><mo> </mo></mrow></mrow><mi>p</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>
   i</mi><mi>j</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msu
   bsup><mo>∣</mo><msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mspace width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi></mrow><mo
   stretchy="false">)</mo><mo>,</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>
   i</mi><mi>j</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msu
   bsup><mo>∣</mo><msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mspace width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi></mrow><mo
   stretchy="false">)</mo><mo>=</mo><mi mathvariant="normal">N</mi><mi
   mathvariant="normal">B</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>
   i</mi><mi>j</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msu
   bsup><mo>;</mo><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>,</mo
   ><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow
   ><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo><mo>,</mo> :MATH]

   where NB is a negative binomial distribution. The mean
   [MATH:
   <msubsup><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   and dispersion
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   parameters of this distribution are computed as follows:
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo><msubsup><mrow><m
   i>l</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</
   mo></mrow></msubsup><msubsup><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo><mrow><mrow><mi
   mathvariant="normal">softmax</mi></mrow><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup><mo>+<
   /mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>s</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo></mrow></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mf
   enced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup><mo>=<
   /mo><mrow><mrow><mi
   mathvariant="normal">exp</mi></mrow><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>θ</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup><mo>+<
   /mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>θ</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo></mrow></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup><mo>=<
   /mo><mrow><mrow><mi
   mathvariant="normal">MLP</mi></mrow><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mo
   stretchy="false">∥</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mro
   w><mrow><mi>M</mi></mrow></msubsup><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mfenced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup></mrow
   ><mo stretchy="false">)</mo></mrow></mrow><mo>,</mo> :MATH]

   where
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>(</mo>
   <mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   are mean gene-wise proportions;
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>θ</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>s</mi></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>θ</mi></mrow></msub>
   :MATH]
   are learnable parameters; and
   [MATH:
   <msubsup><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>(</mo>
   <mi>k</mi><mo>)</mo></mrow></msubsup> :MATH]
   is the library size, which is modelled with a log-normal distribution
   [MATH: <mi mathvariant="normal">l</mi><mi
   mathvariant="normal">o</mi><mi
   mathvariant="normal">g</mi><mo>⁡</mo><msubsup><mrow><mi>l</mi></mrow><m
   row><mi>i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubs
   up><mo>∼</mo><mi>𝒩</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>l</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>;
   </mo><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>(
   </mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>
   ω</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo
   ></mrow></msubsup></mrow><mo stretchy="false">)</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>i</mi
   ></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo>
   <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>+</mo><msub><mr
   ow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   o>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>ω</mi></mrow><mrow><mi>i</mi
   ></mrow><mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>=</mo>
   <mi mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mspace
   width="0.25em"></mspace><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ω</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>(</mo><mi>k</mi><mo>)</mo></mrow></msubsup><mo>+</mo><msub><mr
   ow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ω</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo><mo>,</mo> :MATH]

   where
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ω</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ν</mi></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ω</mi></mrow></msub>
   :MATH]
   are learnable parameters. Optionally, we can use the observed library
   size.

Gaussian imputation model

   For normalised gene expression data (i.e. inverse normal transformed
   data), we use the following Gaussian likelihood:
   [MATH: <mrow><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">x</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mi>k</mi></
   mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo>∏</mo><mi>j</mi
   ><mi>G</mi></munderover><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</
   mi></mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mi>j</mi><m
   o>,</mo><mi>k</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</
   mi></mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mi>j</mi><m
   o>,</mo><mi>k</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>𝒩</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</
   mi></mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>;</mo><msubsup><mi>μ</mi><m
   row><mi>i</mi><mi>j</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mi>σ</mi><m
   row><mi>i</mi><mi>j</mi></mrow><mn>2</mn></msubsup><msubsup><mrow></mro
   w><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo>,</mo> :MATH]

   where the mean
   [MATH:
   <msubsup><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   and standard deviation
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow>
   <mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   are computed as follows:
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>μ</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>μ</mi></mrow></msub><m
   o>,</mo><mspace width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">f</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">u</mi><mi mathvariant="normal">s</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>σ</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>σ</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">M</mi><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">P</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mo
   stretchy="false">∥</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mro
   w><mrow><mi>M</mi></mrow></msubsup><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo><mo>,</mo> :MATH]

   where
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>μ</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>σ</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>μ</mi></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>σ</mi></mrow></msub>
   :MATH]
   are learnable parameters and softplus
   [MATH: <mo stretchy="false">(</mo><mi>x</mi><mo
   stretchy="false">)</mo><mo>=</mo><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">o</mi><mi
   mathvariant="normal">g</mi><mo>⁡</mo><mspace
   width="0.25em"></mspace><mo
   stretchy="false">(</mo><mn>1</mn><mo>+</mo><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo
   stretchy="false">)</mo> :MATH]
   .

Optimisation

   We optimise the model to maximise the imputation performance on a
   dynamic subset of observed tissues, that is, tissues that are masked
   out at train time, similar to [[165]54]. For each individual
   [MATH: <mi>i</mi> :MATH]
   , we randomly select a subset
   [MATH: <mi>𝒞</mi><mo>⊂</mo><mi>𝒯</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo> :MATH]
   of pseudo-observed tissues and treat the remaining tissues
   [MATH: <mi>𝒰</mi><mo>=</mo><mi>𝒯</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo><mo>−</mo><mi>𝒞</mi> :MATH]
   as unobserved (pseudo-missing). We then compute the individual
   embeddings
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup> :MATH]
   using the gene expression of pseudo-observed tissues
   [MATH: <mi>𝒞</mi> :MATH]
   and minimise the loss:
   [MATH: <mi>ℒ</mi><mo stretchy="false">(</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mspace width="0.25em"></mspace><mi>𝒞</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>𝒰</mi></mrow><mo
   stretchy="false">)</mo><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mro
   w><mrow><mfenced open="|" close="|"
   separators="|"><mrow><mi>𝒰</mi></mrow></mfenced></mrow></mfrac><mrow><m
   under><mo
   stretchy="true">∑</mo><mrow><mi>k</mi><mo>∈</mo><mi>𝒰</mi></mrow></mund
   er><mrow><mo> </mo></mrow></mrow><mrow><mrow><mi
   mathvariant="normal">log</mi></mrow><mo>⁡</mo><mrow><mi>p</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mf
   enced
   separators="|"><mrow><mi>k</mi></mrow></mfenced></mrow></msubsup><mo>∣<
   /mo><msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo>,</mo> :MATH]

   which corresponds to the average negative log-likelihood across
   pseudo-missing tissues. Importantly, the pseudo-mask mechanism
   generates different sets of pseudo-missing tissues for each individual,
   effectively enlarging the number of training examples and regularising
   our model. We summarise the training algorithm in [166]Supplementary
   Information D.

Inference of gene expression from uncollected tissues

   At test time, we infer the gene expression values
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>v</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   of an uncollected tissue
   [MATH: <mi>v</mi> :MATH]
   from a given donor
   [MATH: <mi>i</mi> :MATH]
   via the mean, i.e.
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>v</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><msubsup><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>v</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   . Alternatively, we can draw random samples from the conditional
   predictive distribution
   [MATH: <mi>p</mi><mo stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi></mrow><mo stretchy="false">)</mo> :MATH]
   .

Prediction of cell-type signatures

   We next consider the problem of imputing cell-type signatures in a
   tissue of interest. We define a cell-type signature as the sum of gene
   expression profiles across cells of a given cell-type in a certain
   tissue. Formally, let
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   be the gene expression signature of cell-type
   [MATH: <mi>q</mi> :MATH]
   in a tissue of interest
   [MATH: <mi>k</mi> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   . Our goal is to infer
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   from the multi-tissue gene expression measurements
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub> :MATH]
   . To achieve this, we first compute the hyperedge features of a
   hypergraph consisting of 4-node hyperedges and then infer the
   corresponding signatures with a zero-inflated model.

Prediction of hyperedge attributes

   We consider a hypergraph where each hyperedge groups an individual, a
   tissue, a metagene, and a cell-type node. For all metagenes
   [MATH: <mi>j</mi><mo>∈</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>M</mi>
   :MATH]
   , we compute latent hyperedge attributes
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   for a cell-type
   [MATH: <mi>q</mi> :MATH]
   in a tissue of interest
   [MATH: <mi>k</mi> :MATH]
   of individual
   [MATH: <mi>i</mi> :MATH]
   as follows:
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">M</mi><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">P</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >m</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>k</mi></mrow><mrow><mi
   >t</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>q</mi></mrow><mrow><mi
   >c</mi></mrow></msubsup></mrow><mo stretchy="false">)</mo><mo>,</mo>
   :MATH]

   where
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mrow><mi>q</mi></mrow><mrow><mi
   >c</mi></mrow></msubsup> :MATH]
   are parameters specific to each unique cell-type
   [MATH: <mi>q</mi> :MATH]
   and the MLP is shared for all combinations of metagenes, individuals,
   tissues, and cell-types.

Zero-inflated model

   We employ the following probabilistic model:
   [MATH: <mrow><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">x</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><mspace width="0.25em"></mspace><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>q</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo>∏</mo><mi>j</mi
   ><mi>G</mi></munderover><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</
   mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><mspace width="0.25em"></mspace><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>q</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi
   mathvariant="bold-italic">x</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mi mathvariant="bold-italic">h</mi><mo
   stretchy="true">^</mo></mover></mrow><mi>i</mi><mi>d</mi></msubsup><mo>
   ,</mo><mspace width="0.25em"></mspace><msub><mi
   mathvariant="bold-italic">u</mi><mi>i</mi></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>k</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>q</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi
   mathvariant="normal">ZINB</mi><mrow><mo
   stretchy="false">(</mo><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</
   mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>;</mo><mspace
   width="0.25em"></mspace><msubsup><mi>μ</mi><mrow><mi>i</mi><mi>j</mi></
   mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mi>θ</mi><mrow><mi>i</mi><mi>j</mi></
   mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mi>π</mi><mrow><mi>i</mi><mi>j</mi></
   mrow><mrow><mo stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><msubsup><mrow><mi>n<
   /mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><msubsup><mrow><mi>l</mi></mrow
   ><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mrow><mrow><mi
   mathvariant="normal">softmax</mi></mrow><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi><mo>,</mo><mi>q</mi></mrow></mfenced></m
   row></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>s</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo></mrow></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mf
   enced
   separators="|"><mrow><mi>k</mi><mo>,</mo><mi>q</mi></mrow></mfenced></m
   row></msubsup><mo>=</mo><mrow><mrow><mi
   mathvariant="normal">exp</mi></mrow><mo>⁡</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>θ</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi><mo>,</mo><mi>q</mi></mrow></mfenced></m
   row></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>θ</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo></mrow></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">π</mi></mrow><mrow><mi>i</mi></mrow><mrow><mf
   enced
   separators="|"><mrow><mi>k</mi><mo>,</mo><mi>q</mi></mrow></mfenced></m
   row></msubsup><mo>=</mo><mi>σ</mi><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>π</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mfenced
   separators="|"><mrow><mi>k</mi><mo>,</mo><mi>q</mi></mrow></mfenced></m
   row></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>π</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo><mo>,</mo> :MATH]

   where
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>θ</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>π</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>s</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>θ</mi></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>π</mi></mrow></msub>
   :MATH]
   are learnable parameters;
   [MATH: <msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   is the number of cells in the signature, and
   [MATH: <msubsup><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   is their average library size. At train time, we set
   [MATH: <msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   to match the ground truth number of cells. At test time, the number of
   cells
   [MATH: <msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   is user-definable. We model the average library size
   [MATH: <msubsup><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   with a log-normal distribution
   [MATH: <mi mathvariant="normal">l</mi><mi
   mathvariant="normal">o</mi><mi
   mathvariant="normal">g</mi><mo>⁡</mo><msubsup><mrow><mi>l</mi></mrow><m
   row><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∼</mo><mi>𝒩</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi>l</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>;</mo><msubsup><mrow><mi>ν<
   /mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>ω<
   /mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>i</mi
   ></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   o>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>ω</mi></mrow><mrow><mi>i</mi
   ></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mo>⁡</mo><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ω</mi></mrow></msub><m
   subsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">e</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mo stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ω</mi></mrow></msub></
   mrow><mo stretchy="false">)</mo><mo>,</mo> :MATH]

   where
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ν</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>ω</mi></mrow></msub><m
   o>,</mo> :MATH]
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ν</mi></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">b</mi></mrow><mrow><mi>ω</mi></mrow></msub>
   :MATH]
   are learnable parameters. Optionally, we can use the observed library
   size.

Optimisation

   Single-cell transcriptomic studies typically measure single-cell gene
   expression for a limited number of individuals, tissues, and
   cell-types, so aggregating single-cell profiles per individual, tissue,
   and cell-type often results in small sample sizes. To address this
   challenge, we apply transfer learning by pre-training the hypergraph
   model
   [MATH: <mi>f</mi> :MATH]
   on the multi-tissue imputation task and then fine-tuning the parameters
   of the signature inference module on the cell-type signature profiles.
   Concretely, we minimise the loss:
   [MATH: <mi>ℒ</mi><mo stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi><mo>,</mo><mi>q</mi></mrow><mo
   stretchy="false">)</mo><mo>=</mo><mo>−</mo><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">g</mi><mo>⁡</mo><mi>p</mi><mo
   stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi><mo>,</mo><mi>q</mi></mrow><mo
   stretchy="false">)</mo><mo>,</mo> :MATH]

   which corresponds to the negative log-likelihood of the observed
   cell-type signatures.

Inference of uncollected gene expression

   To infer the signature of a cell-type
   [MATH: <mi>q</mi> :MATH]
   in a certain tissue
   [MATH: <mi>v</mi> :MATH]
   of interest, we first compute the latent individual embeddings
   [MATH: <msubsup><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup> :MATH]
   from the multi-tissue profiles
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>˜</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub> :MATH]
   and then compute the mean of the distribution
   [MATH: <mi>p</mi><mo stretchy="false">(</mo><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∣</mo><msubsup><mrow><mover
   accent="true"><mrow><mi mathvariant="bold-italic">h</mi></mrow><mi
   mathvariant="bold-italic">ˆ</mi></mover></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>d</mi></mrow></msubsup><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><mi>k</mi><mo>,</mo><mi>q</mi></mrow><mo
   stretchy="false">)</mo> :MATH]
   as
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   stretchy="false">(</mo><mrow><mn>1</mn><mo>−</mo><msubsup><mrow><mi
   mathvariant="bold-italic">π</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>k</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo
   stretchy="false">)</mo> :MATH]
   . Alternatively, we can draw random samples from that distribution.

eQTL mapping

   The breadth of tissues in the GTEx (v8) collection enabled us to
   comprehensively evaluate the extent to which eQTL discovery could be
   improved through the HYFA-imputed transcriptome data. We mapped eQTLs
   that act in cis to the target gene (cis-eQTLs), using all SNPs within ±
   1 Mb of the transcription start site of each gene. For the imputed and
   the original (incomplete) datasets, we considered SNPs significantly
   associated with gene expression, at a false discovery rate ≤ 0.10. We
   applied the same GTEx eQTL mapping pipeline, as previously described
   [[167]55], to the imputed and original datasets to quantify the gain in
   eQTL discovery from the HYFA-imputed dataset.

Pathway enrichment analysis

   Similar to [[168]37], we employed Gene Set Enrichment Analysis (GSEA)
   [[169]36] to relate HYFA’s metagene factors to known biological
   pathways. This is advantageous to over-representation analysis, which
   requires selecting an arbitrary cutoff to select enriched genes. GSEA,
   instead, computes a running sum of enrichment scores by descending a
   sorted gene list [[170]36, [171]37].

   We applied GSEA to the gene loadings in HYFA’s encoder. Specifically,
   let
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>j</mi></mrow></msub><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>F</mi><mo>×</mo><mi>
   G</mi></mrow></msup> :MATH]
   be the gene loadings for metagene
   [MATH: <mi>j</mi> :MATH]
   , where
   [MATH: <mi>F</mi> :MATH]
   is the number of factors (i.e. number of hyperedge attributes) and
   [MATH: <mi>G</mi> :MATH]
   is the number of genes (Equation 17. For every factor in
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">W</mi></mrow><mrow><mi>j</mi></mrow></msub>
   :MATH]
   , we employed blitzGSEA [[172]56] to calculate the running sum of
   enrichment scores by descending the gene list sorted by the factor’s
   gene loadings. The enrichment score for a query gene set is the maximum
   difference between
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>h</mi><mi>i</mi><mi>t</mi></mrow
   ></msub><mo stretchy="false">(</mo><mi>𝒮</mi><mo>,</mo><mi>i</mi><mo
   stretchy="false">)</mo> :MATH]
   and
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>m</mi><mi>i</mi><mi>s</mi><mi>s<
   /mi></mrow></msub><mo
   stretchy="false">(</mo><mi>𝒮</mi><mo>,</mo><mi>i</mi><mo
   stretchy="false">)</mo> :MATH]
   [[173]37], where
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>h</mi><mi>i</mi><mi>t</mi></mrow
   ></msub><mo stretchy="false">(</mo><mi>𝒮</mi><mo>,</mo><mi>i</mi><mo
   stretchy="false">)</mo> :MATH]
   is the proportion of genes in
   [MATH: <mi>𝒮</mi> :MATH]
   weighted by their gene loadings up to gene index
   [MATH: <mi>i</mi> :MATH]
   in the sorted list [[174]37]. We then calculated pathway enrichment
   p-values through a permutation test (with
   [MATH: <mi mathvariant="normal">n</mi><mo>=</mo><mn>100</mn> :MATH]
   trials) by randomly shuffling the gene list. We employed the
   Benjamini-Hochberg method to correct for multiple testing.

GTEx bulk and single-nucleus RNA-seq data processing

   The GTEx dataset is a public resource that has generated a broad
   collection of gene expression data collected from a diverse set of
   human tissues [[175]2]. We downloaded the data from the GTEx portal
   (Data availability). After the processing step, the GTEx-v8 dataset
   consisted of 15197 samples (49 tissues, 834 donors) and 12557 genes.
   The dataset was randomly split into 500 train, 167 validation, and 167
   test donors. Each donor had an average of 18.22 collected tissues. The
   processing steps are described below.

Normalised bulk transcriptomics (GTEx-v8)

   Following the GTEx eQTL discovery pipeline
   ([176]https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl),
   we processed the data as follows:
    1. Discard underrepresented tissues (n=5), namely bladder, cervix
       (ectocervix, endocervix), fallopian tube, and kidney (medulla).
    2. Select set of overlapping genes across all tissues.
    3. Discard donors with only one collected tissue (n=4).
    4. Select genes based on expression thresholds of ≥ 0.1 transcripts
       per kilobase million (TPM) in ≥ 20% of samples and ≥ 6 reads
       (unnormalised) in ≥ 20% of samples.
    5. Normalise read counts across samples using the trimmed mean of M
       values (TMM) method [[177]57].
    6. Apply inverse normal transformation to the expression values for
       each gene.

Cell-type signatures from a paired snRNA-seq dataset (GTEx-v9)

   We downloaded paired snRNA-seq data for 16 GTEx individuals [[178]13]
   (Data availability) collected in 8 GTEx tissues, namely skeletal
   muscle, breast, esophagus (mucosa, muscularis), heart, lung, prostate,
   and skin. We split these individuals into train, validation, and test
   donors according to the GTEx-v8 split. We processed the data as
   follows:
    1. Select set of overlapping genes between bulk RNA-seq (GTEx-v9) and
       paired snRNA-seq dataset [[179]13].
    2. Select top 3000 variable genes using the Scanpy function
       scanpy.pp.highly_variable_genes with flavour setting seurat_v3
       [[180]58, [181]59].
    3. Discard underrepresented cell-types occurring in less than 10
       tissue-individual combinations.
    4. Aggregate (i.e. sum) read counts by individual, tissue, and (broad)
       cell-type. This resulted in a dataset of 226 unique signatures, of
       which 135 belong to matching GTEx-v8 individuals.

Implementation and reproducibility

   We report the selected hyperparameters in [182]Supplementary
   Information B. HYFA is implemented in Python [[183]60]. Our framework
   and implementation are flexible (i.e. we support k-uniform
   hypergraphs), may be integrated in other bioinformatics pipelines, and
   may be useful for other applications in different domains. We used
   PyTorch [[184]61] to implement the model and scanpy [[185]58] to
   process the gene expression data. We performed hyperparameter
   optimisation with wandb [[186]62]. We employed blitzGSEA [[187]56] for
   pathway enrichment analysis. We also used NumPy [[188]63], scikit-learn
   [[189]64], pandas [[190]65], matplotlib [[191]66], seaborn [[192]67],
   and statannotations [[193]68]. [194]Figure 1 was created with
   [195]BioRender.com.

Extended Data

Extended Data Figure 1.

   Extended Data Figure 1
   [196]Open in a new tab

   Summary of per-gene prediction scores

   (a) Network of tissues depicting the predictability of target tissues
   with HYFA using the average per-gene Pearson ρ correlation
   coefficients. Edges from reference to target tissues indicate an
   average per-gene ρ > 0.4. The dimension of each node is proportional to
   its degree. (b) Distribution of per-gene Pearson correlation
   coefficients in 6 target tissues (source tissue: whole blood). We
   attribute the unimodality of the distributions to the fact that the
   data was inverse Normal transformed ([197]Methods).

Extended Data Figure 2.

   Extended Data Figure 2
   [198]Open in a new tab

   Whole blood to lung predictions for unseen individuals

   (a) Average and standard deviation of per-gene expression in lung
   versus prediction performance (prediction performance (Pearson
   correlation between predicted and ground truth expression; whole blood
   to lung). The per-gene predictions were uncorrelated with the averages
   and variances of the per-gene expression in the target tissue (average:
   ρ=0.07, variance: ρ=0.06). (b) Best and worst predicted lung genes
   (NUDT16: ρ=0.85; GALNT4: ρ=−0.08; n=166).

Extended Data Figure 3.

   Extended Data Figure 3
   [199]Open in a new tab

   Top predicted Alzheimer’s disease-relevant genes in multiple brain
   regions, with whole blood as reference tissue

   (a) Pearson correlation coefficient of top 20 predicted genes from the
   Alzheimer’s disease pathway (KEGG), ranked by average correlation. (b,
   c, d) Average per-gene expression (x-axis) versus prediction
   performance (Pearson correlation between predicted and ground truth
   expression) in (b) cerebellum, (c) cortex, and (d) hippocampus. HYFA
   exhibits strong prediction performance for several Alzheimer’s
   disease-relevant genes including APOE (cortex ρ=0.536, cerebellum:
   ρ=0.502), APP (cortex ρ=0.524), PSEN1 (cerebellum: ρ=0.459), and PSEN2
   (cortex: ρ=0.590, cerebellum: ρ=0.559, hippocampus: ρ=0.403). In
   cerebellum, PSEN1 (ρ=0.459), PSEN2 (ρ=0.559), and APOE (ρ=0.502)
   attained above expected performances (average ρ=0.448). APP (ρ=0.524),
   PSEN2 (ρ=0.590), and APOE (ρ=0.536) surpassed the expected correlation
   in cortex (average ρ=0.443).

Extended Data Figure 4.

   Extended Data Figure 4
   [200]Open in a new tab

   Prediction scores for different accessible tissues as reference

   For each target tissue, we predicted the expression values based on
   accessible tissues (whole blood, skin sun exposed, skin not sun
   exposed, and adipose subcutaneous). We report the Pearson correlation
   coefficient between the predicted values and the actual gene expression
   values. For any given target tissue, we used the same set of
   individuals to evaluate performance, namely individuals in the
   validation and test sets with collected gene expression measurements in
   all the corresponding tissues. Target tissues represented by less than
   25 test individuals were discarded. HYFA attains the best performance
   in 32 out of 38 tissues when all accessible tissues are simultaneously
   used as reference. Boxes show quartiles, centerlines correspond to the
   median, and whiskers depict the distribution range (1.5 times the
   interquartile range). Outliers outside of the whiskers are shown as
   distinct points. The top axis indicates the total number of samples for
   every target tissue.

Extended Data Figure 5.

   Extended Data Figure 5
   [201]Open in a new tab

   Performance comparison across gene expression imputation methods with
   per-gene metrics (n=12,557 genes)

   (a, b) Per-tissue comparison between HYFA and TEEBoT when using (a)
   whole-blood and (b) all accessible tissues (whole blood, skin sun
   exposed, skin not sun exposed, and adipose subcutaneous) as reference.
   We discarded target tissues represented by less than 25 test
   individuals. HYFA achieved superior Pearson correlation in (a) 25 out
   of 48 target tissues when a single tissue was used as reference and (b)
   all target tissues when multiple reference tissues were considered. For
   underrepresented target tissues (less than 25 individuals with source
   and target tissues in the test set), we considered all the validation
   and test individuals (translucent bars). (c, d) Prediction performance
   from (c) whole-blood gene expression and (d) accessible tissues as
   reference. Boxes show quartiles and whiskers depict the distribution
   range (1.5 times the interquartile range). Mean imputation replaces
   missing values with the feature averages. Blood surrogate utilises gene
   expression in whole blood as a proxy for the target tissue. k- Nearest
   Neighbours (kNN) imputes missing features with the average of measured
   values across the k nearest observations (k=20). TEEBoT projects
   reference gene expression into a low- dimensional space with principal
   component analysis (PCA; 30 components), followed by linear regression
   to predict target values. HYFA (all) employs information from all
   collected tissues. Boxes show quartiles, centerlines correspond to the
   median, and whiskers depict the distribution range (1.5 times the
   interquartile range). Outliers outside of the whiskers are shown as
   distinct points. The top axis indicates the total number of samples for
   every target tissue.

Extended Data Figure 6.

   Extended Data Figure 6
   [202]Open in a new tab

   Transcription factor (TF) enrichment analysis of metagene-factors

   For every metagene (n=50) and factor (n=98), we performed Gene Set
   Enrichment Analysis using the corresponding gene loadings of HYFA’s
   encoder ([203]Methods) and TF gene sets from the TRRUST database of
   transcription factors (Enrichr library:
   TRRUST_Transcription_Factors_2019). (a) Top enriched TFs, ranked by the
   total number of metagene-factors in which the TFs were enriched (FDR <
   0.05). (b) Circos plot of the top 9 enriched TFs (outer layer). The
   angular size is proportional to the number of enrichments. The second
   layer (bar plot) depicts the factor IDs where the TF was enriched,
   ranging from 0 (lowest bar) to 98 (higher bar). The third layer shows
   the corresponding metagene IDs (blue dots) of the enriched
   metagene-factors, increasing monotonically within the same factor. The
   edges in the middle connect TFs whenever they are both enriched in the
   same factor (FDR < 0.05). (c, d) Distribution of the GATA1 false
   discovery rates in factor 69 (FDR < 0.05 in 28/50 metagenes) and an
   arbitrary factor (enriched in 0/50 metagenes).

Extended Data Figure 7.

   Extended Data Figure 7
   [204]Open in a new tab

   GO Biological Process enrichment analysis of metagene-factors

   For every metagene (n=50) and factor (n=98), we performed Gene Set
   Enrichment Analysis using the corresponding gene loadings of HYFA’s
   encoder ([205]Methods) and Gene Ontology gene sets (GO Biological
   Process, version of 2021) (Enrichr library:
   GO_Biological_Process_2021). (a) Top enriched signaling GO terms,
   ranked by the total number of metagene- factors in which the terms were
   enriched (FDR < 0.05). (b, c) FDR distribution of the Type-I Interferon
   signaling pathway in factor 18 (FDR < 0.05 in 12/50 metagenes) and an
   arbitrary factor (enriched in 0/50 metagenes).

Extended Data Figure 8.

   Extended Data Figure 8
   [206]Open in a new tab

   GO Biological Process FDRs for signaling pathways

   GO Biological Process enrichment analysis of metagene-factors. For
   every pathway and factor, we selected the metagene with lowest FDR and
   depicted statistically significant values (FDR < 0.05). Point sizes are
   inversely proportional to the FDR values. Type I interferons (IFNs), a
   family of cytokines that activate a variety of signaling cascades, were
   the most enriched. We also detected the simultaneous enrichment of
   interferon IRF1 and STAT1 (a member of the STAT protein family that
   drives the expression of many target genes) in 10 factors (FDR < 0.05;
   [207]Extended Data Figure 6b), consistent with these results.

Supplementary Material

   Supplementary Information
   [208]NIHMS1927789-supplement-Supplementary_Information.pdf^ (1.3MB,
   pdf)

Acknowledgements