Abstract

   Simultaneous profiling of spatial transcriptomics (ST) and spatial
   metabolomics (SM) on the same or adjacent tissue sections offers a
   revolutionary approach to decode tissue microenvironment and identify
   potential therapeutic targets for cancer immunotherapy. Unlike other
   spatial omics, cross-modal integration of ST and SM data is challenging
   due to differences in feature distributions of transcript counts and
   metabolite intensities, and inherent disparities in spatial morphology
   and resolution. Furthermore, cross-sample integration is essential for
   capturing spatial consensus and heterogeneous patterns but is often
   complicated by batch effects. Here, we introduce SpatialMETA, a
   conditional variational autoencoder (CVAE)-based framework for
   cross-modal and cross-sample integration of ST and SM data. SpatialMETA
   employs tailored decoders and loss functions to enhance modality
   fusion, batch effect correction and biological conservation, enabling
   interpretable integration of spatially correlated ST-SM patterns and
   downstream analysis. SpatialMETA identifies immune spatial clusters
   with distinct metabolic features in cancer, revealing insights that
   extend beyond the original study. Compared to existing tools,
   SpatialMETA demonstrates superior reconstruction capability and fused
   modality representation, accurately capturing ST and SM feature
   distributions. In summary, SpatialMETA offers a powerful platform for
   advancing spatial multi-omics research and refining the understanding
   of metabolic heterogeneity within the tissue microenvironment.

   Subject terms: Computational models, Data integration, Machine learning
     __________________________________________________________________

   Simultaneous profiling of spatial transcriptomics (ST) and metabolomics
   (SM) offers a novel way to decode tissue microenvironment
   heterogeneity. Here, the authors present SpatialMETA, a conditional
   variational autoencoder-based framework designed for the integration of
   ST and SM data.

Introduction

   In recent years, spatial omics technologies have advanced rapidly,
   particularly spatial transcriptomics (ST), which systematically
   measures gene expression levels within the spatial context of tissues.
   ST technologies are categorized into imaging-based ST (iST) and
   sequencing-based ST (sST)^[46]1–[47]3. iST technologies include
   MERFISH^[48]4, seqFISH^[49]5, CosMx^[50]6, Xenium^[51]7, and
   STARMap^[52]8, while sST technologies encompass 10X Visium, Visium
   HD^[53]9, Slide-seq^[54]10, and Stereo-seq^[55]11. ST facilitates the
   study of cell-cell interactions^[56]12–[57]14 and spatial pattern
   identification^[58]15–[59]17, offering profound insights into fields
   such as cancer research, neuroscience, developmental biology, and plant
   biology^[60]1–[61]3. Spatial metabolomics (SM) employs mass
   spectrometry imaging to measure metabolite intensities and their
   spatial position^[62]18,[63]19. Representative SM technologies include
   Desorption Electrospray Ionization (DESI)^[64]20 and Matrix-Assisted
   Laser Desorption/Ionization (MALDI)^[65]21. These techniques are
   primarily applied to investigate cancer-associated metabolic
   reprogramming, enabling the identification of crucial biomarkers with
   potential diagnostic or therapeutic applications^[66]19,[67]22.

   Combining ST and SM allows for a comprehensive characterization of
   tissue microenvironment across spatial, transcriptomic, and metabolomic
   modalities. Particularly in cancer research, the tumor microenvironment
   (TME) presents unique properties including hypoxia, acidity, and
   nutrient deprivation, which suppress anticancer immune responses and
   diminish the effectiveness of immunotherapy in solid
   tumors^[68]23,[69]24. Therefore, integrating transcriptomic and
   metabolomic modalities is crucial for studying immune cell functional
   plasticity and metabolic state heterogeneity. Spatially resolved data
   can unveil intricate heterogeneities of tissue morphology, cell type
   composition, and cellular interactions, offering critical insights into
   tumor progression and therapeutic responses^[70]25. Simultaneous ST and
   SM profiling has been widely applied in studies of prostate cancer
   (PC)^[71]26, gastric cancer^[72]27, clear cell renal cell carcinoma
   (ccRCC)^[73]28, glioblastoma (GBM)^[74]29, as well as human and mouse
   brains^[75]30,[76]31, providing valuable insights into metabolic
   reprogramming and metabolic heterogeneity under these contexts. Even
   though measurement of ST and SM on the same tissue section has been
   demonstrated^[77]31, most current spatial multi-omics practices
   typically use adjacent or serial sections from the same tissue
   sample^[78]27–[79]32.

   The cross-modal (vertical) and cross-sample (horizontal) integration of
   non-spatial single-cell multi-omics data has been advanced through the
   application of machine learning and deep learning frameworks. Notable
   approaches for vertical integration include Seurat^[80]33,[81]34,
   totalVI^[82]35, and Stabmap^[83]36, methods such as Seurat
   CCA/RPCA^[84]34, scVI^[85]37, scANVI^[86]38, SCALEX^[87]39,
   scPoli^[88]40, and scAtlasVAE^[89]41 are designed for horizontal
   integration. Recent advances in multi-omics integration have extended
   into spatial omics, with methods such as spaVAE^[90]42,
   spaMultiVAE^[91]42, SpatialGLUE^[92]43, and MISO^[93]44. Among them,
   spaVAE, spaMultiVAE, and SpatialGLUE are primarily designed for
   integrating ST with spatial proteomics (SP) or spatial epigenomics
   (SE), where both modalities are derived from the same tissue section.
   Recently, MISO was developed to integrate ST, SP, SE, and SM data
   derived from the same tissue sections, utilizing an algorithm
   specifically designed to incorporate histological images. However, most
   ST and SM data are obtained from adjacent tissue sections, which can
   create discrepancies in spatial morphology and resolution. This
   necessitates additional preprocessing steps to unify spatial
   coordinates and resolution. Moreover, SM data is represented as
   continuous mass spectrometry intensity, rather than sequencing counts
   as in ST, SP, or SE. This difference results in distinct feature
   distributions and necessitates specialized model design. In addition,
   capturing spatial consensus and heterogeneous patterns in ST and SM
   data across samples is crucial. Hence, it is essential to develop
   methods for the simultaneous vertical and horizontal integration of ST
   and SM data that effectively balance modality fusion, batch effect
   correction, and biological conservation, to accurately characterize the
   complexities of tissue microenvironments.

   Here, we present SpatialMETA (Spatial Metabolomics and Transcriptomics
   Analysis), a method designed for the vertical and horizontal
   integration of ST and SM data. First, SpatialMETA aligns and reassigns
   ST and SM data to achieve a unified spatial resolution. It then employs
   a Conditional Variational Autoencoder (CVAE) framework to learn the
   joint latent embedding for vertical and horizontal integration. To
   correct batch effects in the latent space, conditioning is incorporated
   into the decoder. To account for modality-specific features, the
   reconstruction loss of ST data is modeled using a zero-inflated
   negative binomial (ZINB) distribution, while SM data is modeled with a
   Gaussian distribution. Additionally, shared decoders are applied
   simultaneously to both single-modality and joint embeddings,
   facilitating modality fusion and model interpretability. We also
   demonstrated the efficacy of SpatialMETA using previously published
   human ccRCC ST and SM data, revealing a spatial immune cluster enriched
   with lipid metabolism-associated features. SpatialMETA outperforms
   existing methods in both vertical and horizontal integration on human
   ccRCC^[94]28, GBM^[95]29, and mouse brain^[96]31 datasets, assessed by
   a series of designed metrics^[97]45,[98]46. SpatialMETA provides a
   comprehensive suite of analytical and visualization capabilities to
   facilitate in-depth data interpretation.

Results

SpatialMETA model architecture

   SpatialMETA comprises four key modules: alignment, reassignment,
   integration, and analysis (Fig. [99]1). To align the morphology of ST
   and SM data obtained from adjacent tissue sections, SpatialMETA
   incorporates rotation, translation, and non-linear distortion
   (Fig. [100]1a). Publicly available ST datasets employed in this study
   generally exhibit lower resolution compared to SM data. To address this
   discrepancy, SpatialMETA employs a K-nearest neighbor (KNN) approach,
   reassigning SM data to ensure consistency with the spots and resolution
   of ST data (Fig. [101]1b).

Fig. 1. Schematic representation of the integrated analysis of ST and SM with
SpatialMETA.

   [102]Fig. 1
   [103]Open in a new tab

   a Schematic illustration for the alignment process between ST and SM
   data. SM spots align with ST spots or histology image derived from ST
   through rotation, translation, and non-linear distortion. b Schematic
   illustration for the reassignment process, in which SM data is adjusted
   to a unified resolution with ST data by k-nearest neighbors (KNN)
   computation. c Schematic illustration for spatially variable features
   (SVFs) identification before vertical or horizontal integration. d
   Schematic illustration for vertical and horizontal integration of ST-SM
   data using a conditional variational autoencoder (CVAE). e Schematic
   overview for downstream analysis and visualization capabilities of
   SpatialMETA. f Schematic illustration for different scenarios supported
   by SpatialMETA. SVGs spatially variable genes, SVMs spatially variable
   metabolites, TOI trajectories of interest, ROI spatial regions of
   interest.

   SpatialMETA calculates the spatially variable features (SVFs) and
   adopts a CVAE framework to vertically and horizontally integrate the
   aligned ST and SM data (Fig. [104]1c, d). For vertical integration,
   separate encoders extract feature representations (ST-only and SM-only
   embeddings) from the original gene expression matrix (GEM) and
   normalized metabolite intensity matrix (MIM), respectively. These
   representations are concatenated and fed into a linear projection layer
   to generate a joint embedding, which is then utilized to reconstruct
   GEM and MIM data via ST or SM decoder. Additionally, the same ST or SM
   decoders are applied to the ST-only or SM-only embedding. This unique
   model design incorporates corresponding reconstruction losses that
   constrain the ST- and SM-only embedding to closely resemble the joint
   embedding within the latent space. Furthermore, angular similarity
   between the single-modality embedding and joint embedding quantifies
   the contribution of each modality, enhancing model interpretability.

   For simultaneous vertical and horizontal integration, SpatialMETA
   employs a batch-invariant encoder and a batch-variant decoder. While
   batch labels are not utilized during encoding, they are incorporated
   into the joint, ST-only, or SM-only embeddings during decoding,
   facilitating effective batch effects correction across multiple
   samples. Additionally, Maximum Mean Discrepancy (MMD) losses are
   applied to ensure the comparability of joint embeddings across
   different samples^[105]47. To model the feature distributions,
   SpatialMETA adopts a ZINB distribution for GEM, which is widely used in
   single-cell transcriptomics computational methods^[106]37,[107]48. MIM
   is modeled using a Gaussian negative log-likelihood loss (hereafter
   referred to as Gaussian loss), which reflects the typical intensity
   distribution of most spatially variable metabolites (SVMs)
   (Supplementary Fig. [108]1a).

   SpatialMETA also offers a suite of analysis and visualization
   functionalities (Fig. [109]1e), including spatial clustering,
   identification of marker genes and metabolites, metabolite annotation
   and enrichment analysis, gene-metabolite correlation analysis,
   user-defined spatial trajectories of interest (TOI) for gene expression
   and metabolite intensity, interactive analysis for user-defined spatial
   regions of interest (ROI), quantification of modality contribution and
   network analysis for spatial clusters. SpatialMETA supports various
   scenarios, including vertical integration for ST and SM data from
   single sample, simultaneous vertical and horizontal integration of ST
   and SM data, as well as horizontal integration for SM data only
   (Fig. [110]1f).

Alignment and reassignment of ST and SM data

   To demonstrate the performance of SpatialMETA in aligning and
   reassigning ST and SM data, we applied the method to publicly available
   datasets from ccRCC (Y7_T), ccRCC (R29_T), mouse brain (m3_FMP), and
   GBM (248_T)^[111]28,[112]29,[113]31. Since these ST and SM datasets are
   obtained from adjacent tissue sections, they share overall
   morphological similarities but also exhibit variations in tissue
   morphology and misalignment in orientation. Additionally, differences
   in spatial resolution arise due to the use of different platforms and
   technologies. The raw ST and SM data, as initially presented, reflect
   these discrepancies in morphology and resolution (Supplementary
   Fig. [114]1b–e). While SM data exhibit morphological similarities to ST
   data and their corresponding ST histology images, positional
   discrepancies between SM and ST spots are evident.

   SpatialMETA’s alignment module addresses these discrepancies by
   applying a series of transformations to the SM spot coordinates,
   including rotation, translation, non-linear distortion, and,
   optionally, diffeomorphic metric mapping. These transformations are
   optimized using gradient descent, adapted from the STalign
   method^[115]49. Additionally, SpatialMETA provides an option to align
   the latent features of the SM data with either latent features
   derived from ST data or features from ST histology, utilizing
   rasterization to convert the spots into an image with a specified
   resolution^[116]49. In most instances, the transformation is optimized
   by aligning the outline of SM spots with that of the ST histology
   image, given their high degree of morphological similarity. However, it
   is also feasible to align the outline of SM spots with the ST spots.
   This process enables the projection of the SM data, which lacks
   histology images, onto histology images derived from ST data in various
   datasets, including ccRCC (R29_T), mouse brain (m3_FMP), and GBM
   (248_T) (Supplementary Fig. [117]1f–m).

   After the alignment of spatial coordinates for ST and SM data,
   SpatialMETA performs a reassignment step. In most datasets employed in
   this study, ST data generally exhibit lower resolution than SM data,
   except for the mouse brain (m3_FMP) sample, where their resolutions are
   comparable. By default, SpatialMETA reassigns SM data to ST coordinates
   to achieve a consistent spatial resolution (Supplementary
   Fig. [118]1n–q). For instance, in the ccRCC Y7_T sample, the raw ST
   data contain 2,018 spots, whereas the SM data have 10,145 spots
   (Supplementary Fig. [119]1b). SpatialMETA utilizes ST coordinates,
   identifies KNN spots in the SM data, and averages their MIM values to
   generate an aggregated SM dataset with 2,018 spots (Supplementary
   Fig. [120]1n). Furthermore, SpatialMETA offers the flexibility to
   reassign ST data to SM coordinates when ST data have higher spatial
   resolution, ensuring adaptability to emerging high-resolution ST
   technologies.

Vertical integration of ST and SM data and benchmark analysis

   To evaluate the vertical and horizontal integration performance of
   SpatialMETA, we compared it with other state-of-the-art methods for
   non-spatial single-cell and spatial multi-omics data integration. The
   methods include deep learning-based approaches such as scVI^[121]37,
   scANVI^[122]38, totalVI^[123]35, scPoli^[124]40, spaVAE^[125]42,
   spaMultiVAE^[126]42, SpatialGLUE^[127]43, and MISO^[128]44 as well as
   traditional machine learning techniques, including Seurat
   RPCA/CCA/BNN^[129]33,[130]34, Stabmap^[131]36,and principal component
   analysis (PCA) (Supplementary Fig. [132]2a, Supplementary Data [133]1).
   Among these methods, only MISO supports the incorporation of
   histological image data. We evaluated MISO under two configurations:
   MISO_2m, which integrates the ST and SM modalities; MISO_3m, which
   integrates ST, SM, and histological image embeddings. Additionally, we
   performed PCA on raw ST data, raw SM data, reassigned SM data, and
   reassigned SM combined with raw ST data, revealing a high degree of
   similarity in spatial patterns before and after the alignment and
   reassignment preprocessing (Supplementary Fig. [134]2b–d).

   SpatialMETA facilitates the vertical integration of ST and SM data and
   provides interpretable quantification of modality contribution for each
   spatial spot (Fig. [135]2a). To evaluate its performance for vertical
   integration, we introduced four key metrics for benchmark analysis:
   reconstruction accuracy, continuity score, marker score, and biological
   conservation score (Fig. [136]2b, Supplementary Fig. [137]2e–i).
   Reconstruction accuracy is assessed using the Pearson Correlation
   Coefficient (PCC) and Cosine Similarity (CS), with higher values
   indicating better feature representation (Supplementary Fig. [138]2e).
   Continuity score, evaluating smoothness and consistency, is measured
   using CHAOS, and the percentage of abnormal spots (PAS), as proposed in
   SDMBench^[139]46 (Supplementary Fig. [140]2f). Marker score, assessing
   the efficiency of modality fusion and the preservation of biologically
   meaningful spatial patterns, are quantified using Moran’s I and Geary’s
   C from SDMBench^[141]46, along with specificity, logistic, and Mutual
   Information (MI) scores proposed in this study (Supplementary
   Fig. [142]2g, and see “Methods” for details). Biological
   conservation score was evaluated by measuring the preservation of
   cell-type level variation, using Adjusted Rand Index (ARI), Normalized
   Mutual Information (NMI), the Average Silhouette Width (ASW) of cell
   types, the cell-type separation Local Inverse Simpson’s Index (cLISI),
   and the Isolated label ASW, as introduced by scIB^[143]45
   (Supplementary Fig. [144]2h). These metrics were averaged as overall
   score, allowing a comprehensive benchmark against other vertical
   integration methods across eleven samples from ccRCC, mouse brain, and
   GBM. Overall, our benchmark results suggested that SpatialMETA
   outperformed existing methods, demonstrating superior performance in
   reconstruction and modality fusion (Fig. [145]2b). Additionally, we
   quantified the computational efficiency in terms of time and memory
   consumption, with SpatialMETA exhibiting reasonable memory usage and
   efficient processing time (generally less than 2 gigabytes (GB) and
   100 s for memory and time consumption) (Supplementary Fig. [146]2j). To
   further evaluate the scalability of SpatialMETA, we simulated ST and SM
   datasets with varying degrees of spot numbers ranging from 10,000 to
   100,000. The results showed approximately linear increases in both time
   and memory usage, with memory consumption primarily driven by data
   scale rather than model complexity (Supplementary Fig. [147]2k).

Fig. 2. Benchmarking and application of vertical integration using
SpatialMETA.

   [148]Fig. 2
   [149]Open in a new tab

   a Schematic diagram illustrating the vertical integration process of
   SpatialMETA. b Summary of benchmarking metrics assessing the vertical
   integration performance of different tools on ST and SM data. The tools
   are arranged in descending order based on their overall scores. Source
   data are provided as a [150]Source Data file. c–e Spatial plots
   comparing the original (upper left and middle panels) and denoised
   (lower left and middle panels) gene expression (left panels) and
   metabolite intensity (middle panels) data processed by SpatialMETA. Bar
   plots display the Pearson correlation coefficient (PCC) and cosine
   similarity between the original and denoised data for each method
   (right panels). Spatial plots visualizing Leiden clustering results
   derived from different tools for ccRCC (f), GBM (g), and mouse brain
   (h) datasets. Bar plots depicting spatial continuity scores for ccRCC
   (i), GBM (j), and mouse brain (k) datasets across tools, bars are
   color-coded by method. Stacked bar plots showing marker scores for ST
   (green) and SM (yellow) data across different tools for RCC (l), GBM
   (m), and mouse brain (n) samples. Imm immune clusters, Endo endothelial
   clusters, Stro stromal clusters, Mal malignant clusters. Note that the
   color legend is placed at the bottom of the figure, and the colors of
   clusters do not directly correspond to the same captured structures
   across different methods.

   Compared to totalVI and spaMultiVAE, SpatialMETA generally achieves the
   highest PCC and CS scores, demonstrating superior reconstruction
   accuracy (Supplementary Fig. [151]2e). Furthermore, SpatialMETA
   effectively denoises both modalities in the reconstructed GEM and MIM,
   as represented by the selected spatially variable genes (SVGs) or SVMs
   (Fig. [152]2c–e). Given that the spot size of ST data employed in this
   study are larger than cell size, spot annotation could be performed by
   cell type deconvolution using scRNA-seq data as reference. For example,
   in the case of ccRCC (Y7_T), we conducted deconvolution and automatic
   cell type annotations via DestVI^[153]50 using the original and
   reconstructed GEM along with the ccRCC scRNA-seq reference^[154]28
   (Supplementary Fig. [155]3a). The reconstructed GEM visually presents a
   more spatially homogeneous distribution in cell type deconvolution
   (Supplementary Fig. [156]3b–e).

   To evaluate SpatialMETA’s ability to generate coherent spatial patterns
   through vertical integration, we benchmarked it against
   SeuratV5_BNN^[157]33,[158]34, totalVI^[159]1, Stabmap^[160]36,
   spaMultiVAE^[161]42, SpatialGLUE^[162]43, and MISO^[163]44. Leiden
   clustering was then performed using default parameters, and the
   continuity score was evaluated. For ccRCC Y7_T sample, the Leiden
   clustering results obtained from SpatialMETA were manually annotated
   (referred to as cluster annotation) into four main spatial clusters:
   immune clusters (Imm), endothelial clusters (Endo), stromal clusters
   (Stro), and malignant clusters (Mal), which were subsequently used for
   downstream analyses (Supplementary Fig. [164]4a–e). The results
   indicated that SpatialMETA, MISO, and SeuratV5_BNN consistently
   generated visually coherent spatial clusters evaluated with CHAOS and
   PAS scores (Fig. [165]2f–k, and Supplementary Fig. [166]2f). TotalVI
   exhibited variable performance across datasets. For example, it
   performed well on the mouse brain and GBM samples but was less
   effective on the ccRCC datasets (Fig. [167]2f–k). Moreover, the Geary’s
   C and Moran’s I scores demonstrated that SpatialMETA effectively
   integrates ST and SM features compared to other methods
   (Fig. [168]2l–n, and Supplementary Fig. [169]2g). Notably, MISO_3m
   achieved the best performance in terms of biological conservation. This
   enhancement compared to MISO_2m may be attributed to the incorporation
   of histological image information in MISO_3m (Fig. [170]2b, and
   Supplementary Fig. [171]2h).

   To evaluate the effectiveness of the SpatialMETA model design, we
   systematically analyzed the impact of key model components on the
   performance of vertical integration (Supplementary Fig. [172]4f, g).
   Specifically, we compared different configurations of SM loss
   functions, including Mean Squared Error (MSE) and Gaussian, as well as
   the effect of SM data normalization, with or without normalization.
   Additionally, we evaluated decoder architectures under two scenarios:
   one utilizing the designed shared decoders and the other employing
   independent decoders, which directly input concatenated joint
   embeddings into separate ST and SM decoders (Supplementary
   Fig. [173]4f). Their performance was assessed using metrics such as
   reconstruction accuracy, continuity score, marker score, and biological
   conservation score. The benchmarking results highlighted the critical
   importance of Gaussian loss function and normalization for SM data in
   achieving effective vertical integration. Furthermore, the shared
   decoder architecture led to improved performance in modality fusion,
   outperforming the independent decoder model (Supplementary
   Fig. [174]4g).

Comprehensive downstream analyses for ST and SM data with SpatialMETA

   SpatialMETA’s quantification of ST and SM contributions allows the
   identification of spatial clusters driven by either or both modalities.
   In the ccRCC Y7 sample, all five immune clusters (Imm_1 to 5) and
   stromal type 1 cluster (Stro_1) exhibit a balanced contribution from
   both ST and SM (Fig. [175]3a). Notably, among the immune subclusters,
   Imm_3 and Imm_5 showed a stronger reliance on ST compared to the other
   immune subtypes (Fig. [176]3b). In contrast, other spatial clusters,
   particularly the endothelial (Endo) and malignant (Mal) clusters, are
   predominantly driven by metabolomics (Fig. [177]3a, b). This pattern
   could be attributed to the unique metabolic characteristics of cancer
   cells, which adapt to survive in the hypoxic and nutrient-deprived
   TME^[178]51,[179]52. Additionally, endothelial cells undergo metabolic
   reprogramming to promote angiogenesis, supporting tumor growth within
   the TME by supplying oxygen and nutrients^[180]53–[181]55. In the mouse
   brain (m3_FMP) sample, most spatial clusters are primarily represented
   by ST, while SM plays a more prominent role in the cerebral
   cortex^[182]56 (Fig. [183]3c, d). Most spatial clusters in the GBM
   (248_T) sample displayed a balanced contribution from both ST and SM,
   highlighting the importance of both modalities (Supplementary
   Fig. [184]5a, b). While SpatialGLUE’s model design incorporates the
   calculation of modality weights, it may not be ideally suited for
   integrating ST and SM data. This is evidenced in the integration
   results for the three samples, which are all dominated by SM
   (Supplementary Fig. [185]5c–e). Originally, SpatialGLUE was designed
   for integrating ST with SP or ST with SE, where these modalities differ
   from SM in terms of feature distribution, which may explain the
   suboptimal performance of SpatialGLUE in ST and SM integration.

Fig. 3. Downstream analysis of ST and SM contributions and immune-enriched
spatial clusters using SpatialMETA.

   [186]Fig. 3
   [187]Open in a new tab

   a Violin plots showing the ST and SM contributions across different
   Leiden clusters identified by SpatialMETA for ccRCC (Y7_T). b Spatial
   plots visualizing the ST and SM contributions generated by SpatialMETA
   for ccRCC (Y7_T). Colored dashed outlines correspond to the clusters
   identified by SpatialMETA in Fig. [188]2f. c Violin plots showing the
   ST and SM contributions across different Leiden clusters derived by
   SpatialMETA for mouse brain (m3_FMP). d Spatial plots visualizing the
   ST and SM contributions generated by SpatialMETA for mouse brain
   (m3_FMP), with the black dashed line outlining the cerebral cortex
   region. e Spatial plot depicting immune-cell-enriched subclusters.
   Spots are colored by Leiden clusters as defined by SpatialMETA,
   consistent with those in Fig. [189]2f. f Scatter plot illustrating the
   upregulated differentially expressed genes/metabolites in each
   immune-cell-enriched spatial subcluster. Genes/metabolites with a
   log2FoldChange > 0.25 and false discovery rate (FDR) < 0.05 in each
   cluster are highlighted, with the top 5 genes and top 3 metabolites
   labeled. Dot size indicates the proportion of spots expressing the
   respective gene or metabolite in each subcluster. g Spatial plot
   depicting the expression of marker genes CYP2J2 and ACSM2A, as well as
   marker metabolites with m/z values of 267.1356 and 872.7177 in the
   Imm_4 subcluster (Wilcoxon test with adjusted p values). Spots are
   colored based on scaled gene expression or scaled metabolite intensity.
   h Dot plot displaying gene ontology (GO) term enrichment specific to
   the Imm_4 subcluster (hypergeometric test with adjusted p values). i
   Bar plot displaying enrichment of metabolite groups specific to the
   Imm_4 subcluster, assessed by hypergeometric test without multiple
   comparison adjustment. As detailed in the “Methods” section. Source
   data for (h, i) are provided as a [190]Source Data file.

   The transcriptional and metabolic characteristics identified by
   SpatialMETA can provide further insights into metabolic reprogramming
   with therapeutic potential. To demonstrate this, we focused on immune
   clusters for further analysis using SpatialMETA (Fig. [191]3e, f). The
   original ccRCC study divided tumor samples into four groups, with Y7_T
   sample defined as IM2 (immune subtypes 2) group, showing enriched fatty
   acid and amino acid metabolism^[192]28. Here, we further analyzed
   sub-spatial clusters. Notably, the Imm_4 sub-cluster prominently
   expresses lipid synthesis-related genes (CYP2J2 and ACSM2A) and
   metabolites (Phosphatidylglycerol and PA (24:0/24:1(15Z)))
   (Fig. [193]3g). Pathway enrichment analysis in SpatialMETA revealed the
   upregulated lipid metabolism at both transcriptional and metabolic
   levels in Imm_4 (Fig. [194]3h, i). Notably, CYP2J2 has been reported as
   a diagnostic and prognostic biomarker associated with immune cell
   infiltration in RCC^[195]57,[196]58. Moreover, previous studies have
   highlighted lipid metabolic reprogramming as a key metabolic marker of
   RCC^[197]59–[198]61.

   SpatialMETA also integrates several user-friendly analysis modules,
   including automatic annotation of metabolites based on the m/z values
   (Supplementary Fig. [199]5f), characterization of marker genes and
   metabolites for each spatial cluster (Supplementary Fig. [200]5g), and
   correlation analysis for user-defined genes and metabolites
   (Supplementary Fig. [201]5h, i). For metabolites of interest, users are
   recommended to perform additional experimental validation, such as
   liquid chromatography-mass spectrometry (LC-MS), to confirm their
   identities. In addition, SpatialMETA provides a user-interactive
   interface, as exemplified by its application to the Y7_T sample
   (Supplementary Fig. [202]6a). Users can draw a TOI from the malignant
   clusters to the immune clusters (Supplementary Fig. [203]6b), enabling
   the calculation and visualization of gene expression and metabolite
   intensity gradients along the trajectory (Supplementary Fig. [204]6c,
   d). Additionally, users can draw ROI, such as tumor core and immune
   infiltrated regions (Supplementary Fig. [205]6e), to calculate marker
   genes and metabolites within the ROIs (Supplementary Fig. [206]6f).

Simultaneous vertical and horizontal integration of ST and SM data and
benchmark analysis

   During vertical integration, SpatialMETA also supports horizontal
   integration to perform batch effect correction across samples
   (Fig. [207]4a). We utilized five publicly available ccRCC samples
   classified as the IM2 group in the original study to demonstrate this
   feature. The process begins by aligning and reassigning the ST and SM
   data from adjacent tissue sections derived from the same sample.
   Subsequently, batch-biased SVGs and SVMs were excluded from the list of
   SVFs and used as the input for the SpatialMETA model. This
   preprocessing step, compared to relying solely on highly variable genes
   and metabolites, alleviates batch effects (Supplementary Fig. [208]7a,
   b). Following this, the CVAE model within SpatialMETA further corrected
   the batch effect across samples, enabling the identification of unified
   spatial clusters characterized by unique transcriptional and metabolic
   states (Fig. [209]4b, and Supplementary Fig. [210]7c, d).

Fig. 4. Application of SpatialMETA to multiple ccRCC samples for vertical and
horizontal integration of ST and SM data.

   [211]Fig. 4
   [212]Open in a new tab

   a Schematic representation for the vertical and horizontal integration
   of ST and SM data from multiple samples using SpatialMETA. b UMAP
   visualization of integrated ST-SM data from five ccRCC samples using
   SpatialMETA for dimensionality reduction, colored by samples (left
   panel), annotated Leiden clusters identified by SpatialMETA (right
   panel). c Network plot illustrating spatial distances between different
   spatial clusters. Node size represents the number of spots scaled using
   min-max normalization, and node color indicates Leiden clusters
   identified by SpatialMETA. Edge weights indicate reciprocal spatial
   distances between nodes, calculated as average distances among all
   spots within each spatial cluster and scaled using min-max
   normalization. Only the top 4 weighted edges for each node are
   retained. d Summary of benchmarking metrics evaluating the performance
   of vertical and horizontal integration of ST and SM data by different
   tools. The tools are arranged in descending order based on their
   overall scores. Source data of the benchmark are provided as a
   [213]Source Data file. e Spatial plots for individual ccRCC samples,
   with spots colored by annotated Leiden clusters derived from
   SpatialMETA. The color legend is consistent with (b) (right). Imm
   immune clusters, Endo endothelial clusters, Stro stromal clusters, Mal
   malignant clusters.

   Spatial network analysis for clusters in integrated ccRCC IM2 samples
   showed that endothelial clusters consistently exhibited closer spatial
   proximity to malignant clusters, indicating frequent co-localization
   (Fig. [214]4c, and Supplementary Fig. [215]7e)^[216]62–[217]64. This
   finding aligns with the original study, in which they reported enriched
   endothelial signatures in IM2 tumors^[218]28. Notably, heterogeneity
   was also observed across samples. In sample R114_T, S15_T and Y27_T,
   the Imm_2 and malignant clusters co-localized, whereas in sample X49_T,
   no immune clusters co-localized with malignant clusters (Supplementary
   Fig. [219]7e).

   We next conducted a benchmark analysis to evaluate the vertical and
   horizontal integration performance of SpatialMETA compared to existing
   methods, including Seurat RPCA/CCA^[220]33,[221]34, scVI^[222]37,
   scANVI^[223]38, totalVI^[224]35, scPoli^[225]40, spaVAE^[226]42, and
   PCA (Fig. [227]4d, and Supplementary Fig. [228]7f–k). The performance
   for vertical integration was assessed using metrics such as
   reconstruction accuracy, continuity score, marker score, as described
   previously. To assess the performance of horizontal integration, we
   employed batch ASW, Graph Integration LISI (iLISI), and Principal
   component regression (PCR) batch for batch correction, and ARI, NMI,
   cell type ASW, Isolated label ASW, and cLISI for biological
   conservation, as introduced by scIB^[229]45. The overall integration
   score, which reflects both vertical and horizontal integration, was
   calculated as the mean of the cross-modality and cross-sample scores.
   Achieving a proper balance between modality fusion, batch correction,
   and biological conservation is crucial for effective vertical and
   horizontal integration of SM and ST data. For all five ccRCC samples,
   we performed both vertical and horizontal integration using the
   aforementioned methods. SpatialMETA achieved the highest overall
   scores, demonstrating superior performance in vertical integration,
   while maintaining a reasonable balance between batch correction and
   biological conservation (Fig. [230]4d). We visualized the unified
   spatial clusters in each sample, noting that other methods often
   exhibit biases toward vertical or horizontal integration (Fig. [231]4e,
   and Supplementary Fig. [232]8). For instance, while scPoli excelled at
   vertical integration, achieving high cross-modal overall score and
   providing smoother results, it failed to effectively remove batch
   effects, as evidenced by its lower cross-sample overall score and
   visual representation (Fig. [233]4d, Supplementary Fig. [234]8). In
   contrast, Seurat CCA/RPCA and spaVAE performed well in horizontal
   integration, but their vertical integration results were suboptimal
   (Fig. [235]4d, Supplementary Fig. [236]8). SpatialMETA outperformed
   other methods in both vertical and horizontal integration of the mouse
   brain samples, achieving the highest overall score and further
   demonstrating its robustness (Supplementary Fig. [237]9, [238]10).

   Additionally, when only SM data are available, SpatialMETA supports
   horizontal integration and analysis of SM modality independently. Using
   the SM data of ccRCC samples as examples, we further benchmarked
   SpatialMETA against other horizontal integration methods, including
   Seurat CCA/RPCA, scVI, scANVI, scPoli, spaVAE, and PCA (Supplementary
   Fig. [239]11a). Overall, scPoli, SpatialMETA, and Seurat CCA/RPCA
   demonstrated superior performance compared to the other methods across
   continuity score, marker score, and batch correction metrics
   (Supplementary Fig. [240]11). The result indicated that while some
   horizontal integration tools not specifically designed for SM data,
   such as scPoli and Seurat CCA/RPCA, can still effectively applied for
   SM-only integration.

Discussion

   In this study, we introduced SpatialMETA, a robust deep learning-based
   framework for the vertical and horizontal integration of ST and SM
   data. With the specialized model design for decoders and loss
   functions, SpatialMETA achieves a balanced performance across modality
   fusion, batch effect correction and biological conservation. This
   approach offers the potential to study the microenvironmental spatial
   organization and metabolic heterogeneity of tissues. When applied to
   the vertical integration of ccRCC ST and SM data, SpatialMETA
   characterizes a sub-spatial cluster of tumor-infiltrating immune cells
   with upregulated lipid metabolic pathways. Additionally, the horizontal
   integration of ST and SM data across samples enables the identification
   of unified spatial clusters and their consistent interactions.
   Benchmark analyses demonstrate that, compared to previously published
   non-spatial single-cell or spatial multi-omics tools, SpatialMETA
   achieved superior overall performance in vertical and horizontal
   integration of ST and SM data.

   Current spatial multi-omics strategies can be generally categorized
   into two main approaches. The first approach involves multimodal
   profiling on the same tissue section, generating spatial multi-omics
   data that retain identical spatial morphology. Techniques following
   this approach either utilize the same spatial spots, such as
   Stereo-CITE-seq^[241]65, spatial ATAC-RNA-seq^[242]66, and
   CUT&Tag-RNA-seq^[243]66, or employ different spatial spots with varying
   resolutions, as demonstrated in a recent method by Vicari et
   al.^[244]27. However, these methods face experimental challenges,
   including conflicting tissue embedding requirements for different
   modalities, and technical limitations that may compromise the quality
   of one modality during the acquisition of another^[245]32. The second
   approach utilizes adjacent tissue sections for multimodal profiling,
   enabling the independent acquisition of each modality, as observed in
   the majority of ST and SM joint profiling studies^[246]27–[247]30.
   While this method bypasses the experimental constraints of the first
   approach, it introduces computational challenges due to morphological
   and resolution discrepancies between adjacent multi-modal sections,
   which necessitate alignment and reassignment during preprocessing.

   Several single-modality spatial alignment algorithms for ST have been
   rapidly developed, such as STalign^[248]49, PASTE^[249]67,
   GPSA^[250]68, and Spateo^[251]69, which consider both spot coordinates
   and transcriptomic similarity to align series of sections from the same
   sample. Nevertheless, multi-modal integration methods like spaVAE,
   spaMultiVAE, and SpatialGLUE do not incorporate alignment, as they are
   designed for spatial multi-modal integration within the same tissue
   section and spot coordinates. Currently, most ST and SM techniques rely
   on adjacent tissue sections, resulting in different spatial morphology
   and resolution, thereby necessitating alignment and reassignment. In
   SpatialMETA, we align the SM to the ST using rotation, translation, and
   non-linear distortion, and perform reassignment using KNN. This process
   alters only the spatial coordinates and resolution of the SM data with
   minor influence on biological conclusions, such as the identification
   of marker metabolites within specific spatial clusters. Recently, MISO
   has incorporated the iStar^[252]70 method that utilizes histological
   images to enhance resolution alignment when ST and SM data are
   generated from the same tissue section. Such approaches may become
   increasingly important as emerging technologies enable multi-modal data
   acquisition from the same tissue section, preserving spatial morphology
   during resolution unification. Notably, the alignment and reassignment
   are currently applied only between two adjacent sections within a
   single sample. However, for future applications in 3D spatial omics,
   alignment across multiple sections may be necessary to capture the full
   spatial features. Looking ahead, we propose that, beyond morphological
   similarity, shared spatial patterns across modalities could further
   enhance alignment accuracy.

   Furthermore, recent advancements in high-resolution spatial
   transcriptomics and proteomics technologies, such as CosMx^[253]6,
   Xenium^[254]7, Visium HD^[255]9, Stereo-seq^[256]11, Slide-seq^[257]10,
   and CODEX^[258]71, when combined with SM modalities, present new
   opportunities and challenges for spatial multi-omics integration.
   High-resolution spatial omics data, which can profile millions of
   individual spots or cells within each tissue section, poses significant
   challenges for the computational scalability of analysis. SpatialMETA
   demonstrates efficient memory usage, exhibiting approximately linear
   growth with an increasing of spots number. This scalability highlights
   SpatialMETA’s potential applicability for high-resolution multi-omics
   data integration. Additionally, high-resolution ST datasets often
   exhibit low sequencing depth per spot, which underscores the importance
   of incorporating histological image data. In this context, methods like
   MISO, which integrate image-derived features, provide valuable
   enhancements for modality fusion^[259]44. High-resolution spatial omics
   facilitates the spatial profiling at the single-cell level. Therefore,
   incorporating single-cell segmentation algorithms such as
   Cellpose3^[260]72, and Proseg^[261]73 would enable the reassignment of
   ST and SM data based on individual cells, which may further enhance the
   multi-modality integration.

   Multi-omics captures cellular identity from different molecular layers,
   and the integration of these datasets provides a more comprehensive
   understanding of cell types and states. An optimal vertical integration
   approach entails deriving a joint embedding that simultaneously
   extracts features from multi-modal data. Similar to existing methods
   for vertical integration of non-spatial single-cell multi-omics data,
   such as MultiVI^[262]74, totalVI^[263]35, Cobolt^[264]75, GLUE^[265]76
   and scMVP^[266]77, SpatialMETA employs an encoder-decoder framework to
   obtain a joint embedding and reconstruct input matrices. SpatialMETA
   further utilizes shared decoders for both the joint embedding and the
   ST- or SM-only embeddings, ensuring consistency between the joint
   embedding and individual modality embeddings. This approach facilitates
   effective modality fusion and enhances the accuracy of ST and SM data
   reconstruction. Additionally, the model design enables the calculation
   of the angular similarity between embeddings, improving the
   interpretability of modality contributions. Additional ablation studies
   on model components further emphasizes the importance of shared
   decoders, the selection of appropriate SM loss function, and
   SM-specific normalization to achieve optimal integration performance.
   For vertical integration of spatial multi-omics, Tian et al. developed
   spaVAE, which uses a joint embedding for vertical integration, while
   horizontal integration is achieved by incorporating the batch
   information using another VAE framework, spaMultiVAE^[267]42. In
   contrast, SpatialMETA simultaneously integrates both batch information
   and the joint embedding, enabling vertical and horizontal integration
   within a single VAE framework. SpatialGLUE incorporates spatial
   position information through graph neural network (GNN)
   framework^[268]43, while MISO adopts Vision Transformers for
   histological image representation learning^[269]44. In summary, future
   computational methods for spatial multi-omics integration should
   consider the incorporation of spatial or image features to enhance the
   vertical integration performance.

   In both single-cell and spatial omics horizontal integration, batch
   variations arise from both technological and biological
   factors^[270]32. It is crucial to avoid overcorrecting true biological
   variation while effectively removing technological variation during
   horizontal integration across different samples. Similar to
   SCALEX^[271]39 and scAtlasVAE^[272]78, we introduce a batch-variant
   decoder to ensure biological conservation, where the latent embeddings
   are derived solely from the original matrix and remains independent of
   batch information. To facilitate batch correction and inspired by
   scArches^[273]47, we incorporated the MMD loss function to ensure that
   the latent embeddings from different batches are similar. Additionally,
   achieving an optimal balance between batch correction and biological
   conservation is critical when evaluating horizontal integration.
   Therefore, we adopted previously established metrics to comprehensively
   assess the effectiveness of SpatialMETA’s horizontal
   integration^[274]45,[275]46. Notably, for the SM-only horizontal
   integration, some existing methods not specifically designed for SM
   data may still be applicable, offering additional options for
   integrating SM data across batches.

   SpatialMETA demonstrates robust performance in vertical and horizontal
   integration of ST and SM data, however, several limitations remain.
   Firstly, SpatialMETA is designed for low-resolution ST and SM data.
   With the rapid development of single-cell resolution spatial omics
   technologies, alignment and reassignment could be achieved on cellular
   or subcellular levels, necessitating computational advancements for
   precise cell segmentation. Secondly, SpatialMETA primarily focuses on
   the integration of ST and SM data. As spatial multi-omics technologies
   continue to evolve, future iterations of SpatialMETA will need to
   accommodate vertical integration involving more than two modalities.
   Thirdly, the horizontal integration capabilities of SpatialMETA have
   been demonstrated on a limited number of samples from a single study.
   With the exponential growth of spatial omics datasets, we foresee that
   the horizontal integration of SpatialMETA will be instrumental in
   constructing comprehensive spatial multi-omics atlases. Lastly,
   collaborative efforts between experimental and computational biologists
   are crucial for validating the practical applications and biological
   significance of spatial multi-omics data integration.

Methods

Datasets and preprocessing

   The ST and SM datasets used in this study were preprocessed into GEM
   and MIM formats, respectively. These datasets were derived from human
   ccRCC, GBM, and mouse brain samples, obtained from previously published
   studies. Detailed information of the datasets and their respective
   sources are provided in Supplementary Data [276]2. The ST datasets were
   generated using the 10x Genomics Visium platform, while the SM datasets
   were produced using both the MALDI and DESI platforms. The raw GEM and
   MIM data, as originally provided in the respective publications, were
   processed according to the SpatialMETA tutorial for subsequent
   integration and analysis. Additionally, all Leiden clusters in this
   study were performed using the default parameters of the
   ‘scanpy.tl.leiden’ function.

Overview of SpatialMETA

   SpatialMETA consists of four key components: alignment between ST and
   SM with different morphologies, reassignment between ST and SM with
   different resolutions, vertical and horizontal integration between ST
   and SM from different samples, extensive analysis and visualization
   capabilities.

Alignment between ST and SM

   We implement an alignment module (
   [MATH: <mi mathvariant="bold-italic">ϕ</mi> :MATH]
   ) to perform biological meaningful alignment between ST and SM, which
   requires considerations on registration of two modalities: spot
   coordinate from ST (
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   ) and SM (
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   ), histological image from ST (
   [MATH: <msub><mrow><mi mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>3</mn></mrow></msup>
   :MATH]
   ) and SM (
   [MATH: <msub><mrow><mi mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>3</mn></mrow></msup>
   :MATH]
   ), and transcriptomic features from ST and metabolomic features from
   SM. An alignment between ST and SM requires a transformation
   [MATH: <mi>ϕ</mi><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn></mrow></msup>
   <mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   , which can be represented by an affine transformation matrix
   [MATH: <mi mathvariant="bold">A</mi> :MATH]
   which include a rotation matrix
   [MATH: <mi mathvariant="bold">R</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn><mo>×</mo><mn>
   2</mn></mrow></msup> :MATH]
   , a translation matrix
   [MATH: <mi mathvariant="bold">t</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn><mo>×</mo><mn>
   1</mn></mrow></msup> :MATH]
   , a scaling matrix
   [MATH: <mi mathvariant="bold">S</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn><mo>×</mo><mn>
   2</mn></mrow></msup> :MATH]
   for non-linear distortion, and a diffeomorphism term
   [MATH: <mi mathvariant="bold-italic">φ</mi> :MATH]
   , to transform
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub> :MATH]
   to the
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow><mrow><mo>′</mo></mrow></msubsup>
   :MATH]
   within same coordinate system with
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub> :MATH]
   :
   [MATH: <mi>ϕ</mi><mfenced close=")" open="("><mrow><msubsup><mrow><mi
   mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow><mrow><mo>′</mo></mrow></msubsup></m
   row></mfenced><mo>=</mo><mi mathvariant="normal">R ⋅ S⋅</mi><mi
   mathvariant="normal">φ</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub></mrow></mfenced><mo>+</mo><m
   i mathvariant="normal">t</mi> :MATH]
   1

   where
   [MATH: <mi mathvariant="bold">S</mi> :MATH]
   is a diagonal matrix constructed from two scaling factor
   [MATH: <msub><mrow><mi>s</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>s</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   :
   [MATH: <mi mathvariant="normal">S</mi><mo>=</mo><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd
   columnalign="center"><msub><mrow><mi>s</mi></mrow><mrow><mn>1</mn></mro
   w></msub></mtd><mtd
   columnalign="center"><mn>0</mn></mtd></mtr><mtr><mtd
   columnalign="center"><mn>0</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>s</mi></mrow><mrow><mn>2</mn></mro
   w></msub></mtd></mtr></mtable></mrow></mfenced> :MATH]
   2

   As ST and SM are often measured on adjacent or serial sections from the
   same tissue sample, and the outlines from each slide are similar. We
   transform
   [MATH: <msub><mrow><mi mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub> :MATH]
   into 2-dimensional point set with the same coordinate system as
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub> :MATH]
   , as registration between
   [MATH: <msub><mrow><mi mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub> :MATH]
   is provided by common ST sequencing platforms. As
   [MATH: <msub><mrow><mi mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub> :MATH]
   is often unavailable, we use the outline from the SM spot coordinates
   [MATH: <msub><mrow><mi mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub> :MATH]
   using the concave hull algorithm to represent the shapes of the SM
   tissue sections. To model the transcriptomic features and metabolomic
   features, we use two independent variational autoencoders to learn the
   latent representations ST (
   [MATH: <msub><mrow><mi mathvariant="normal">Latent</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi></mrow></msup>
   :MATH]
   ) and SM (
   [MATH: <msub><mrow><mi mathvariant="normal">Latent</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi></mrow></msup>
   :MATH]
   ) at spot level, and then rasterized into the image space as described
   in STalign^[277]49.

   Optionally, a linear projection layer
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi><mo>→</mo><mi
   mathvariant="normal">Image</mi></mrow></msub><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi></mrow></msup>
   <mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>3</mn></mrow></msup>
   :MATH]
   was trained to learn the relationship between histological features and
   metabolomic features. A linear projection layer
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi><mo>→</mo><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi></mrow></msup>
   <mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi></mrow></msup>
   :MATH]
   was trained to learn the relationship between transcriptomic features
   and metabolomic features.

   Our alignment module computes
   [MATH: <mi mathvariant="bold">A</mi> :MATH]
   ,
   [MATH: <mi mathvariant="bold-italic">φ</mi> :MATH]
   , and
   [MATH: <mi mathvariant="bold">S</mi> :MATH]
   by adopting the stochastic gradient descent of Large Deformation
   Diffeomorphic Metric Mapping adapted from STalign^[278]49, with
   modifications on the default weight parameters and optimization
   objective functions:
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">alignment</mi></mrow></msub><mo>=</mo><msub><mrow>
   <mi>λ</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>×</mo><msub><mrow><m
   i class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">point</mi></mrow></msub><mo>+</mo><msub><mrow><msu
   b><mrow><mi>λ</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>×</mo><mi
   class="MJX-tex-caligraphic" mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">histo</mi></mrow></msub><mo>+</mo><msub><mrow><mi>
   λ</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>×</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">features</mi></mrow></msub> :MATH]
   3

   Where
   [MATH:
   <msub><mrow><mi>λ</mi></mrow><mrow><mn>1</mn></mrow></msub><mrow><mo>(<
   /mo><mrow><mi
   mathvariant="normal">default</mi><mo>=</mo><mn>1</mn></mrow><mo>)</mo><
   /mrow> :MATH]
   ,
   [MATH: <msub><mrow><mi>λ</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   (default = 0), and
   [MATH: <msub><mrow><mi>λ</mi></mrow><mrow><mn>3</mn></mrow></msub>
   :MATH]
   (default = 0) are weight parameters, and
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">point</mi></mrow></msub><mo>=</mo><mi
   mathvariant="normal">PointDistance</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="normal">AlphaShape</mi><mfenced
   close=")" open="("><mrow><msub><mrow><mi
   mathvariant="normal">Coord</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub></mrow></mfenced><mo>,</mo><m
   i mathvariant="normal">AlphaShape</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub></mrow></mfenced></mrow></mfe
   nced> :MATH]
   4
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">histo</mi></mrow></msub><mo>=</mo><mi
   mathvariant="normal">MSE</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="normal">Image</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi><mo>→</mo><mi
   mathvariant="normal">Image</mi></mrow></msub><mrow><mo>(</mo><mrow><msu
   b><mrow><mi mathvariant="normal">Latent</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub></mrow><mo>)</mo></mrow></mro
   w></mfenced> :MATH]
   5
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi>f</mi><mi>e</mi><mi>a</mi><
   mi>t</mi><mi>u</mi><mi>r</mi><mi>e</mi><mi>s</mi></mrow></msub><mo>=</m
   o><mi mathvariant="normal">MSE</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="normal">Latent</mi></mrow><mrow><mi
   mathvariant="normal">ST</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi><mo>→</mo><mi
   mathvariant="normal">ST</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><
   mrow><mi mathvariant="normal">Latent</mi></mrow><mrow><mi
   mathvariant="normal">SM</mi></mrow></msub></mrow><mo>)</mo></mrow></mro
   w></mfenced> :MATH]
   6

   are objective functions for different measurement.
   [MATH: <mi mathvariant="normal">MSE</mi> :MATH]
   calculates the mean squared error between two sets of latent
   representation.

   [MATH: <mi
   mathvariant="normal">PointDistance</mi><mrow><mo>(</mo><mrow><mi>P</mi>
   <mo>,</mo><mi>Q</mi></mrow><mo>)</mo></mrow> :MATH]
   calculates the pairwise Euclidean distance between two sets of points
   [MATH: <mi>P</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi
   mathvariant="bold">p</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">p</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo
   ><mi mathvariant="bold-italic">⋯</mi><mo>,</mo><msub><mrow><mi
   mathvariant="bold">p</mi></mrow><mrow><mi
   mathvariant="normal">n</mi></mrow></msub></mrow><mo>}</mo></mrow>
   :MATH]
   and
   [MATH: <mi>Q</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo
   ><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo
   ><mi mathvariant="bold-italic">⋯</mi><mo>,</mo><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">m</mi></mrow></msub></mrow><mo>}</mo></mrow>
   :MATH]
   . First, for each point
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>∈</mo><mi mathvariant="normal">P</mi> :MATH]
   , we find the closest point
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mrow><mi>j</mi></mrow></msub><m
   o>∈</mo><mi>Q</mi> :MATH]
   , and vise versa. Let
   [MATH:
   <mi>N</mi><mi>N</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>j</mi> :MATH]
   denote the index of closest point in
   [MATH: <mi>Q</mi> :MATH]
   to
   [MATH: <mi>P</mi> :MATH]
   , and
   [MATH:
   <mi>N</mi><mi>N</mi><mrow><mo>(</mo><mrow><mi>j</mi></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>i</mi> :MATH]
   denote the index of closest point in
   [MATH: <mi>P</mi> :MATH]
   to
   [MATH: <mi>Q</mi> :MATH]
   . Then,
   [MATH: <mi mathvariant="normal">PointDistance</mi><mfenced close=")"
   open="("><mrow><mi>P</mi><mo>,</mo><mi>Q</mi></mrow></mfenced><mo>=</mo
   ><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><mfenced
   close="]" open="["><mrow><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi
   mathvariant="normal">dist</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mrow><mi
   mathvariant="normal">NN</mi><mfenced close=")"
   open="("><mrow><mi>i</mi></mrow></mfenced></mrow></msub></mrow></mfence
   d><mo>⋅</mo><msub><mrow><mi>w</mi></mrow><mrow><mi>I</mi></mrow></msub>
   <mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>M</mi></mrow></munderover><mi
   mathvariant="normal">dist</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mrow><mi>j</mi></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mi
   mathvariant="normal">NN</mi><mfenced close=")"
   open="("><mrow><mi>j</mi></mrow></mfenced></mrow></msub></mrow></mfence
   d><mo>⋅</mo><msub><mrow><mi>w</mi></mrow><mrow><mi>J</mi></mrow></msub>
   </mrow></mfenced> :MATH]
   7

   where
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mi>I</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mi>J</mi></mrow></msub>
   :MATH]
   are the weights for the two sets of points, and
   [MATH: <mi>d</mi><mi>i</mi><mi>s</mi><mi>t</mi> :MATH]
   calculates the Euclidean distance between two points.

   [MATH: <mi
   mathvariant="normal">AlphaShape</mi><mrow><mo>(</mo><mrow><mi>P</mi></m
   row><mo>)</mo></mrow> :MATH]
   finds the edge of a set of points
   [MATH: <mi>P</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mn>1</mn></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mn>2</mn></mrow></msub><m
   o>,</mo><mo>⋯</mo><mspace
   width="0.25em"></mspace><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mi>n</mi></mrow></msub></
   mrow><mo>}</mo></mrow> :MATH]
   , where each point
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>=</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>
   ,</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
    is a point in a 2D plane, the goal is to find the concave hull of the
   points using Delaunay triangulation and edge detection based on the
   circumradius of the triangles. The Delaunay triangulation divides
   [MATH: <mi>P</mi> :MATH]
   into a set of triangles
   [MATH:
   <mi>T</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi>t</mi></mrow>
   <mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>t</mi></mrow><m
   row><mn>2</mn></mrow></msub><mo>,</mo><mo>⋯</mo><mspace
   width="0.25em"></mspace><mo>,</mo><msub><mrow><mi>t</mi></mrow><mrow><m
   i>k</mi></mrow></msub></mrow><mo>}</mo></mrow> :MATH]
   , and each triangle
   [MATH: <msub><mrow><mi>t</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   has three vertices
   [MATH:
   <msub><mrow><mi>t</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   row><mo>(</mo><mrow><msub><mrow><mi>v</mi></mrow><mrow><mi>a</mi></mrow
   ></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>b</mi></mrow><
   /msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>c</mi></mrow></m
   sub></mrow><mo>)</mo></mrow> :MATH]
   . The circumradius
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   frac><mrow><mi>a</mi><mi>b</mi><mi>c</mi></mrow><mrow><mn>4</mn><mi>A</
   mi></mrow></mfrac> :MATH]
   , where
   [MATH: <mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>c</mi> :MATH]
   are side lengths and
   [MATH: <mi>A</mi> :MATH]
   is the area of the triangle. The triangle edges would be considered as
   bounardy edges if
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>></mo><m
   i>α</mi> :MATH]
   , where
   [MATH: <mi>α</mi> :MATH]
   (default = 1) is a user-defined parameter. Vertices from boundry edges
   are used as outline spots.

   In addition, if there is a significant difference between the SM and ST
   data positions, the SM image needs to be manually flipped and rotated
   by a large angle to ensure more effective alignment before using the
   automatic SpatialMETA Alignment Module.

Reassignment between ST and SM

   After aligning the ST and SM data, the spatial positions of the ST
   spots are extracted and used as the new reassigned SM coordinates.
   Then, we perform the KNN calculation on the raw SM data to establish a
   correspondence between the new SM spots and the raw SM data. The
   process is detailed as follows:
    1. KNN Calculation: For each new SM spatial position (aligned with the
       ST spots), the
       [MATH: <mi>K</mi> :MATH]
       nearest raw SM spots are identified based on their Euclidean
       distance, with a default setting of
       [MATH: <mi>K</mi><mo>=</mo><mn>5</mn> :MATH]
       . Users can adjust this value as needed for their specific
       analysis. An increased
       [MATH: <mi>K</mi> :MATH]
       value results in each new SM spot being derived from a larger
       number of neighboring raw SM spots, which in turn produces a
       smoother distribution of the MIM in the newly generated SM data.
    2. Distance Threshold: To ensure that only nearby SM spots are
       included in the KNN calculation, a distance threshold is applied,
       filtering out those too far from the reassigned SM spot. This
       threshold is determined as the product of
       [MATH:
       <msub><mrow><mi>d</mi></mrow><mrow><mi>min</mi></mrow></msub>
       :MATH]
       and
       [MATH: <msub><mrow><mi>f</mi></mrow><mrow><mi
       mathvariant="normal">dist</mi></mrow></msub> :MATH]
       , where
       [MATH:
       <msub><mrow><mi>d</mi></mrow><mrow><mi>min</mi></mrow></msub>
       :MATH]
       represents the minimum inter-spot distance in the ST data, computed
       via the “calculate_min_dist” function (recommended) or manually
       defined by the user.
       [MATH: <msub><mrow><mi>f</mi></mrow><mrow><mi
       mathvariant="normal">dist</mi></mrow></msub> :MATH]
       is a scaling factor with a default value of 1.5, which can also be
       user-defined.
    3. Average Calculation: After identifying the
       [MATH: <mi>K</mi> :MATH]
       nearest raw SM spots, the MIM for the new SM spot is computed by
       averaging the values of the
       [MATH: <mi>K</mi> :MATH]
       nearest raw SM spots.

Vertical and horizontal integration among ST and SM

   The vertical and horizontal integration of ST and SM in SpatialMETA is
   based on a CVAE. We denote the gene expression counts of each spot as
   [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub><mo>=</mo><mrow><mo>{</mo><mr
   ow><msubsup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,<
   /mo><msubsup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup></mrow><mo>}</mo></mrow><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>G</mi></mrow></msup>
   :MATH]
   , where
   [MATH: <mi>G</mi> :MATH]
   is the number of selected SVGs and n is the number of spots. Similarly,
   the metabolite intensity values of each spot are denoted as
   [MATH: <msub><mrow><mi mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub><mo>=</mo><mrow><mo>{</mo><mr
   ow><msubsup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,<
   /mo><msubsup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow><mo>}</mo></mrow><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>M</mi></mrow></msup>
   :MATH]
   , with
   [MATH: <mi>M</mi> :MATH]
   indicating the number of selected SVMs and n is the number of spots.

Encoding

   SpatialMETA utilizes two batch-invariant encoders,
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi>e</mi><mi
   mathvariant="normal">ncoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   for ST data and
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi>e</mi><mi
   mathvariant="normal">ncoder</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><mi
   mathvariant="normal">m</mi></mrow></msubsup> :MATH]
   for SM data, to obtain their respective embeddings:
   [MATH: <msub><mrow><mi mathvariant="bold-italic">q</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub><mo>=</mo><msubsup><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">encoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub></mrow></mfenced> :MATH]
   8
   [MATH: <msub><mrow><mi mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub><mo>=</mo><msubsup><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">encoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub></mrow></mfenced> :MATH]
   9

Vertical integration

   In the vertical integration process, the ST and SM representation
   [MATH: <msub><mrow><mi mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub> :MATH]
   are concatenated to form the joint representation
   [MATH: <msub><mrow><mi mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub> :MATH]
   . The joint representation is passed through a Multivariate Normal
   distribution
   [MATH: <mi mathvariant="script">N</mi> :MATH]
   , where the mean (
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub> :MATH]
   ) and variance (
   [MATH: <msup><mrow><msub><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub></mrow><mrow><mn>2</mn></m
   row></msup> :MATH]
   ) are computed as follows:
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub></mrow></mfenced> :MATH]
   10
   [MATH: <msub><mrow><mi mathvariant="bold-italic">σ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi>exp</mi></mrow></msub><mfen
   ced close=")" open="("><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub></mrow></mfenced><mo>+</mo
   ><mi>ϵ</mi> :MATH]
   11
   [MATH: <msub><mrow><mi mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>=</mo><mi
   mathvariant="script">N</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>,</mo><msup><mrow><msu
   b><mrow><mi mathvariant="bold-italic">σ</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub></mrow><mrow><mn>2</mn></m
   row></msup><mi>I</mi></mrow></mfenced> :MATH]
   12

   where
   [MATH: <mi>ϵ</mi> :MATH]
   is typically a small number (default is 1e-4) added to prevent
   [MATH: <mi mathvariant="bold-italic">σ</mi> :MATH]
   from being zero.

Latent embeddings for each modality

   Additionally, the means of the individual ST and SM embeddings are
   computed as:
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub></mrow></mfenced> :MATH]
   13
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub></mrow></mfenced> :MATH]
   14

Decoding for joint embedding

   To decode the joint embedding
   [MATH: <msub><mrow><mi mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub> :MATH]
   , SpatialMETA employs two batch-variant neural networks:
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   serves as the ST decoder, and
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   serves as the SM decoder. These decoders reconstruct the respective
   data for ST and SM as follows:
   [MATH: <mfenced close=")" open="("><mrow><msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><
   mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><
   mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">gate</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup></mrow></mfenced><mo>=</mo
   ><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>,</mo><mi
   mathvariant="bold">B</mi></mrow></mfenced> :MATH]
   15
   [MATH: <msubsup><mrow><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi
   mathvariant="bold-italic">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><
   mi mathvariant="bold-italic">r</mi></mrow><mrow><mi>v</mi><mi
   mathvariant="normal">ar</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced><mo>=</mo
   ><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub><mo>,</mo><mi
   mathvariant="bold">B</mi></mrow></mfenced> :MATH]
   16

   where
   [MATH: <mi mathvariant="bold">B</mi> :MATH]
   represents the batch information, which could be samples, studies and
   sequencing platforms. When SpatialMETA is applied to a single sample
   for vertical integration only,
   [MATH: <mi mathvariant="bold">B</mi> :MATH]
   is set to None. When SpatialMETA is applied to multiple samples for
   both vertical and horizontal integration,
   [MATH: <mi mathvariant="bold">B</mi> :MATH]
   is included to account for batch effects.

Batch embeddings

   The batch embeddings are computed as:
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">emb</mi></mrow></msub><mfenced close=")"
   open="("><mrow><mi
   mathvariant="bold">B</mi></mrow></mfenced><mo>=</mo><mfenced close="}"
   open="{"><mrow><msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">emb</mi></mrow><mrow><mn>1</mn></mrow></msub></mro
   w></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">emb</mi></mrow><mrow><mi>H</mi></mrow></msub></mro
   w></msub></mrow></mfenced> :MATH]
   17

   where the batch index for each spot is denoted as
   [MATH: <mi
   mathvariant="bold">B</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><m
   i
   mathvariant="bold">B</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo
   ><mo>…</mo><mo>,</mo><msub><mrow><mi
   mathvariant="bold">B</mi></mrow><mrow><mi
   mathvariant="normal">n</mi></mrow></msub></mrow><mo>}</mo></mrow><mo>∈<
   /mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>H</mi></mrow></msup>
   :MATH]
   , with
   [MATH: <mi>H</mi> :MATH]
   representing the number of batch levels and
   [MATH: <mi>n</mi> :MATH]
   representing the number of spots. Each batch index
   [MATH: <msub><mrow><mi mathvariant="bold">B</mi></mrow><mrow><mi
   mathvariant="normal">n</mi></mrow></msub> :MATH]
   is defined as
   [MATH: <msub><mrow><mi mathvariant="bold">B</mi></mrow><mrow><mi
   mathvariant="normal">n</mi></mrow></msub><mo>=</mo><mrow><mo>{</mo><mro
   w><msub><mrow><mi>b</mi></mrow><mrow><mi>n</mi><mo>,</mo><mn>1</mn></mr
   ow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>b</mi></mrow><m
   row><mi>n</mi><mo>,</mo><mi>H</mi></mrow></msub></mrow><mo>}</mo></mrow
   > :MATH]
   . These batch embeddings are trainable parameters optimized through the
   reconstruction loss to capture batch-specific gene expression and
   metabolite intensity patterns, thereby enabling batch correction. In
   this study, only a single batch level, “sample” was used due to the
   limited number of samples. However, for datasets integrating samples
   from different projects, multiple batch levels (e.g., [“sample”,
   “project”]) can be incorporated to achieve more comprehensive batch
   correction.

Decoding for ST and SM embedding

   For the corresponding ST and SM reconstruction, the shared decoder
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   are used as follows:
   [MATH: <mfenced close=")" open="("><mrow><msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mo>,</mo><
   msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mo>,</mo><
   msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">gate</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup></mrow></mf
   enced><mo>=</mo><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub><mo>,</mo><mi
   mathvariant="bold">B</mi></mrow></mfenced> :MATH]
   18
   [MATH: <msubsup><mrow><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">mean</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mo>,</mo><
   msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup></mrow></mf
   enced><mo>=</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi
   mathvariant="normal">decoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub><mo>,</mo><mi
   mathvariant="bold">B</mi></mrow></mfenced> :MATH]
   19

ST and SM contribution calculation

   To assess the influence of each modality on the joint embedding for
   each spot, the contribution of input
   [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub> :MATH]
   to joint embedding
   [MATH: <msub><mrow><mi mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub> :MATH]
   was determined by calculating the the angular distances as the
   following:
   [MATH: <msubsup><mrow><mi>θ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mo>=</mo><mfrac><mrow><msu
   bsup><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mo>⋅</mo><msubsup><mrow><m
   i mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup></mrow><mrow><msubsup><mrow
   ><mo>∣</mo><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mfenced close="∣"
   open="∣"><mrow><mi
   mathvariant="bold-italic">*</mi><msubsup><mrow><mo>∣</mo><mi
   mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup></mrow></mfenced></mrow></m
   frac> :MATH]
   20
   [MATH: <msubsup><mrow><mi>θ</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mo>=</mo><mfrac><mrow><msu
   bsup><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mo>⋅</mo><msubsup><mrow><m
   i mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup></mrow><mrow><msubsup><mrow
   ><mo>∣</mo><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mfenced close="∣"
   open="∣"><mrow><mi
   mathvariant="bold-italic">*</mi><msubsup><mrow><mo>∣</mo><mi
   mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup></mrow></mfenced></mrow></m
   frac> :MATH]
   21

   where
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>n</mi>
   :MATH]
   ,
   [MATH: <mi>n</mi> :MATH]
   is the number of spots. The final modality contribution is defined as
   the difference between
   [MATH: <msubsup><mrow><mi>θ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow><mrow><mi>i</mi></mrow></msubsup>
   :MATH]
   and
   [MATH: <msubsup><mrow><mi>θ</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><mi
   mathvariant="normal">m</mi></mrow><mrow><mi>i</mi></mrow></msubsup>
   :MATH]
   shifted by the factor 0.5 to scale the values between 0 and 1.

ST and SM decoder outputs

   The ST decoder outputs the mean (
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">st</mi><mo>_</mo><mi
   mathvariant="normal">cor</mi></mrow></msubsup> :MATH]
   ), variance (
   [MATH: <msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">st</mi><mo>_</mo><mi
   mathvariant="normal">cor</mi></mrow></msubsup> :MATH]
   ), and dropout probability (
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>g</mi><mi
   mathvariant="normal">ate</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>g</mi><mi
   mathvariant="normal">ate</mi></mrow><mrow><mi
   mathvariant="normal">st</mi><mo>_</mo><mi
   mathvariant="normal">cor</mi></mrow></msubsup> :MATH]
   ) of the gene expression counts to parameterize the zero-inflated
   negative-binomial distribution (ZINB) and model the raw GEM. Meanwhile,
   the SM decoder outputs the mean (
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi><mo>_</mo><mi
   mathvariant="normal">cor</mi></mrow></msubsup> :MATH]
   ), and variance (
   [MATH: <msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi><mo>_</mo><mi
   mathvariant="normal">cor</mi></mrow></msubsup> :MATH]
   ) of the metabolite intensity values to parameterize the Gaussian
   distribution and model the raw MIM.

Total loss function in SpatialMETA

   Let
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced>
   :MATH]
   denote the reconstruction loss for SM from
   [MATH: <msub><mrow><mi mathvariant="bold">z</mi></mrow><mrow><mi
   mathvariant="normal">joint</mi></mrow></msub> :MATH]
   and
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi mathvariant="normal">st</mi></mrow></msubsup></mrow></mfenced>
   :MATH]
   denote the reconstruction loss for ST from
   [MATH: <msub><mrow><mi mathvariant="bold-italic">z</mi></mrow><mrow><mi
   mathvariant="bold">joint</mi></mrow></msub> :MATH]
   . Similarly,
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mfenced
   close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced>
   :MATH]
   represents the reconstruction loss for SM from
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub> :MATH]
   , and
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mrow><mo>(
   </mo><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><mrow
   ><mi
   mathvariant="normal">st</mi></mrow></msubsup></mrow><mo>)</mo></mrow>
   :MATH]
   represents the reconstruction loss for ST from
   [MATH: <msub><mrow><mi mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msub> :MATH]
   . Additionally,
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">MMD</mi></mrow></msub><mrow><mo>(</mo><mrow><msub>
   <mrow><mi>q</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow><mo>)</mo></mrow>
   :MATH]
   denotes the MMD to ensure comparable joint embeddings across samples,
   as defined in scArches^[279]47.
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">KL</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><
   mrow><mi>q</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow><mo>)</mo></mrow>
   :MATH]
   represents the KL-divergence loss for the variational distribution. The
   total loss function for SpatialMETA combines the total reconstruction
   losses for ST and SM, the MMD loss, and the KL-divergence loss. It is
   expressed as:
   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">total</mi></mrow></msub><mo>=</mo><mfrac><mrow><mn
   >1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><mtable><mtr><mtd
   columnalign="center"><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced><mo>+</mo
   ><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">m</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mfenced
   close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced><mo>+</mo
   ></mtd></mtr><mtr><mtd columnalign="center"><msubsup><mrow><mi
   class="MJX-tex-caligraphic" mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi
   mathvariant="normal">st</mi></mrow></msubsup></mrow></mfenced><mo>+</mo
   ><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">s</mi><msub><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mi
   mathvariant="normal">cor</mi></mrow></msub></mrow></msubsup><mfenced
   close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>s</mi><mi>t</mi></mrow></msubsup></mrow></mfenced><mo>+</mo><m
   sub><mrow><msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">MMD</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mi mathvariant="script">+</mi><mi
   class="MJX-tex-caligraphic" mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">KL</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced></mtd></mtr></mtable></mrow></mfenced> :MATH]
   22

   where
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>N</mi>
   :MATH]
   ,
   [MATH: <mi>N</mi> :MATH]
   is the number of spots. The weights for each loss component can be
   defined by the user to emphasize specific objectives.

Horizontal integration for SM data

   Similar to the horizontal integration among ST and SM, we denote the
   metabolite intensity values of each spot as
   [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub><mo>=</mo><mrow><mo>{</mo><mr
   ow><msubsup><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,<
   /mo><msubsup><mrow><mi>X</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow><mo>}</mo></mrow><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>M</mi></mrow></msup>
   :MATH]
   , where
   [MATH: <mi>M</mi> :MATH]
   represents the selected SVMs and
   [MATH: <mi>n</mi> :MATH]
   representing the number of spots. If batch information is available,
   the batch index of each spot is annotated as
   [MATH: <mi
   mathvariant="bold">B</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><m
   i
   mathvariant="bold">B</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo
   ><mo>…</mo><mo>,</mo><msub><mrow><mi
   mathvariant="bold">B</mi></mrow><mrow><mi>n</mi></mrow></msub></mrow><m
   o>}</mo></mrow><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>H</mi></mrow></msup>
   :MATH]
   , where
   [MATH: <mi>H</mi> :MATH]
   denotes the number of batch levels, and
   [MATH: <msub><mrow><mi
   mathvariant="bold">B</mi></mrow><mrow><mi>n</mi></mrow></msub><mo>=</mo
   ><msub><mrow><mi>b</mi></mrow><mrow><mi>n</mi><mo>,</mo><mn>1</mn></mro
   w></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>b</mi></mrow><mr
   ow><mi>n</mi><mo>,</mo><mi>H</mi></mrow></msub> :MATH]
   } with
   [MATH: <mi>H</mi> :MATH]
   being the number of batch levels.

   Then, a batch-invariant neural network
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi>e</mi><mi
   mathvariant="normal">ncoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub></mrow></mfenced><mo>→</mo><m
   row><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">q</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msub></mrow><mo>)</mo></mrow>
   :MATH]
   serving as the encoder for SM. Additionally, a batch-variant neural
   network
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">F</mi></mrow><mrow><mi>d</mi><mi
   mathvariant="normal">ecoder</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">z</mi><mo>,</mo><mi
   mathvariant="bold">B</mi></mrow></mfenced><mo>→</mo><mrow><mo>(</mo><mr
   ow><msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><
   mi mathvariant="bold">r</mi></mrow><mrow><mi>v</mi><mi
   mathvariant="normal">ar</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow><mo>)</mo></mrow>
   :MATH]
   acts as the SM decoder. The horizontal integration process mirrors the
   aforementioned approach, with the SM decoder outputting the mean (
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">r</mi></mrow><mrow><mi>m</mi><mi
   mathvariant="normal">ean</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   ), and variance (
   [MATH: <msubsup><mrow><mi mathvariant="bold">r</mi></mrow><mrow><mi
   mathvariant="normal">var</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup> :MATH]
   ) of the metabolite intensity values to parameterize the Gaussian
   distribution and model the raw MIM.

   Moreover, the total loss function, which incorporates the average
   reconstruction loss for SM, the MMD loss, and the average KL-divergence
   loss, is expressed as:
   [MATH: <msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">total</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mo>=</mo><mfrac><mrow><mn
   >1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">recon</mi></mrow><mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi
   mathvariant="normal">sm</mi></mrow></msubsup></mrow></mfenced><mo>+</mo
   ><msub><mrow><msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">MMD</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mi mathvariant="script">+</mi><mi
   class="MJX-tex-caligraphic" mathvariant="script">L</mi></mrow><mrow><mi
   mathvariant="normal">KL</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced></mrow></mfenced> :MATH]
   23

   where
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>N</mi>
   :MATH]
   ,
   [MATH: <mi>N</mi> :MATH]
   is the number of spots. The weights for each loss component can be
   defined by the user to emphasize specific objectives.

Benchmark metrics calculation

   For the benchmarking of vertical integration on individual samples, we
   categorized the metrics into four components: (1) Reconstruction
   accuracy, (2) Continuity scores, and (3) Marker scores, (4) Biology
   conservation.

   The overall score for vertical integration was calculated as the mean
   of the four component scores:
   [MATH: <mi mathvariant="normal">Overall</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mtable><mtr><mtd
   columnalign="center"><mi
   mathvariant="normal">Reconstruction</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">accuracy</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Continuity</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mspace
   width="0.25em"></mspace><mo>+</mo></mtd></mtr><mtr><mtd
   columnalign="center"><mi mathvariant="normal"> Marker</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">score</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal"> Biology</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">conservation</mi></mtd></mtr></mtable></mrow><mrow
   ><mn>4</mn></mrow></mfrac> :MATH]
   24
    1. Reconstruction accuracy: To evaluate the reconstruction accuracy of
       SpatialMETA, two quantitative metrics were utilized: Pearson
       Correlation Coefficient (PCC) and Cosine Similarity (CS). These
       metrics were computed between the original data matrix
       [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
       mathvariant="normal">original</mi></mrow></msub> :MATH]
       and
       [MATH: <msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mi
       mathvariant="normal">reconstructed</mi></mrow></msub> :MATH]
       for each variable (e.g., gene expression or metabolite intensity)
       independently. The Reconstruction Accuracy score was defined as the
       mean of PCC and CS.

   [MATH: <mi mathvariant="normal">Reconstruction</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">accuracy</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">PCC</mi><mo>+</mo><mi
   mathvariant="normal">CS</mi></mrow><mrow><mn>2</mn></mrow></mfrac>
   :MATH]
   25

Pearson Correlation Coefficient (PCC)

   The PCC was calculated for each feature
   [MATH: <mi>i</mi> :MATH]
   as follows:
   [MATH: <msub><mrow><mi
   mathvariant="normal">PCC</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=
   </mo><mfrac><mrow><mi mathvariant="normal">Cov</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup><mo>,</mo><msubsup><
   mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup></mrow></mfence
   d></mrow><mrow><mi>σ</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup></mrow></mfenced><mo
   >⋅</mo><mi>σ</mi><mfenced close=")" open="("><mrow><msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup></mrow></mfence
   d></mrow></mfrac> :MATH]
   26
   [MATH: <mi
   mathvariant="normal">MeanPCC</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mro
   w><mrow><mi>N</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="normal">PCC</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   27

   Where
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup> :MATH]
   represent the
   [MATH: <mi>i</mi> :MATH]
   -th feature of
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msub> :MATH]
   , repectively.
   [MATH: <mi mathvariant="normal">Cov</mi> :MATH]
   denotes covariance, and
   [MATH: <mi>σ</mi> :MATH]
   denotes the standard deviation, and
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>N</mi>
   :MATH]
   , where
   [MATH: <mi>N</mi> :MATH]
   is the number of features.

Cosine similarity (CS)

   The cosine similarity for each feature
   [MATH: <mi>i</mi> :MATH]
   was calculated as:
   [MATH: <msub><mrow><mi
   mathvariant="normal">CS</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=<
   /mo><mfrac><mrow><msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup><msubsup><mrow><mo>⋅
   </mo><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup></mrow><mrow><m
   subsup><mrow><mi
   mathvariant="normal">∣x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup><mo>∣</mo><mo>*</mo>
   <mo>∣</mo><msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup><mo>∣</mo></mro
   w></mfrac> :MATH]
   28
   [MATH: <mi
   mathvariant="normal">MeanCS</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow
   ><mrow><mi>N</mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><msub><mrow><mi>C</m
   i><mi>S</mi></mrow><mrow><mi>i</mi></mrow></msub> :MATH]
   29

   Where
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msubsup> :MATH]
   represent the
   [MATH: <mi>i</mi> :MATH]
   -th feature of
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi
   mathvariant="normal">original</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi
   mathvariant="normal">reconstructed</mi></mrow></msub> :MATH]
   , repectively.
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>N</mi>
   :MATH]
   , where
   [MATH: <mi>N</mi> :MATH]
   is the number of features.
     * (2).
       Continuity scores: These metrics assess the smoothness and
       consistency of the data embedding. This is evaluated using CHAOS
       and PAS, as defined in SDMBench^[280]46. Since lower CHAOS and PAS
       scores indicate better performance, they were transformed as
       [MATH: <mn>1</mn><mo>−</mo><mi mathvariant="normal">CHAOS</mi>
       :MATH]
       or
       [MATH: <mn>1</mn><mo>−</mo><mi mathvariant="normal">PAS</mi> :MATH]
       before aggregation. The overall continuity score was then
       calculated as the mean of the transformed scores:

   [MATH: <mi mathvariant="normal">Continuity</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mfenced
   close=")" open="("><mrow><mn>1</mn><mo>−</mo><mi
   mathvariant="normal">CHAOS</mi></mrow></mfenced><mo>+</mo><mfenced
   close=")" open="("><mrow><mn>1</mn><mo>−</mo><mi
   mathvariant="normal">PAS</mi></mrow></mfenced></mrow><mrow><mn>2</mn></
   mrow></mfrac> :MATH]
       30
     * (3).
       Marker scores: Marker scores evaluate the accuracy of modality
       fusion and the preservation of biological spatial patterns. This is
       evaluated using Moran’s I and Geary’s C, as defined in
       SDMBench^[281]46, along with Specificity, Logistic, and Mutual
       Information(MI) scores. Similarly, Geary’s C was transformed as
       [MATH: <mn>1</mn><mo>−</mo><msup><mrow><mi
       mathvariant="normal">Geary</mi></mrow><mrow><mo>′</mo></mrow></msup
       ><mi mathvariant="normal">sC</mi> :MATH]
       . The marker score was calculated as:

   [MATH: <mi mathvariant="normal">Marker</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mtable><mtr><mtd
   columnalign="center"><msup><mrow><mi
   mathvariant="normal">Moran</mi></mrow><mrow><mo>′</mo></mrow></msup><mi
   mathvariant="normal">s</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">I</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msup><mrow><mi
   mathvariant="normal">Geary</mi></mrow><mrow><mo>′</mo></mrow></msup><mi
   mathvariant="normal">s</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">C</mi></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>+</mo></mtd></mtr><mtr><mtd
   columnalign="center"><mi mathvariant="normal">Specificity</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Logistic</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">MI</mi></mtd></mtr></mtable></mrow><mrow><mn>5</mn
   ></mrow></mfrac> :MATH]
   31

Specificity score

   The specificity score quantifies how specific the expression of a gene,
   or the intensity of a metabolite is specific to particular spatial
   clusters within a dataset. The calculation involves the following
   steps:
    1. Feature filtering: To mitigate the influence of lowly expressed or
       sparse features, an initial filtering step was applied. Features
       with an excessively high proportion of zero expression across all
       spatial spots were excluded. Specifically, features were excluded
       if their proportion of zero values exceeded a predefined threshold
       (default: 99%, i.e., features with zero expression in ≥99% of
       spots). To ensure robustness, filtering was performed at multiple
       zero-percentage thresholds (100%, 99%, 98%, 95%), and the final
       specificity score was computed as the mean across these thresholds.
       ST and SM datasets were processed independently.
    2. Specificity score calculation: For each feature
       [MATH: <mi>i</mi> :MATH]
       , its specificity to a spatial cluster
       [MATH: <mi>c</mi> :MATH]
       is calculated as:

   [MATH: <msub><mrow><mi
   mathvariant="normal">Specificity</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>=</mo><mfrac><mrow><mi mathvariant="normal">Mean</mi><mspace
   width="0.25em"></mspace><msub><mrow><mi
   mathvariant="normal">Expression</mi></mrow><mrow><mi>i</mi><mo>,</mo><m
   i>c</mi></mrow></msub></mrow><mrow><mi
   mathvariant="normal">Mean</mi><mspace
   width="0.25em"></mspace><msub><mrow><mi
   mathvariant="normal">Expression</mi></mrow><mrow><mi>i</mi></mrow></msu
   b></mrow></mfrac> :MATH]
   32
   [MATH: <mi mathvariant="normal">Specificity
   Score</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</mi></mro
   w></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munder><mrow><mi>ma
   x</mi></mrow><mrow><mi>c</mi></mrow></munder><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="normal">Specificity</mi></mrow><mrow><mi
   mathvariant="normal">g</mi><mo>,</mo><mi
   mathvariant="normal">c</mi></mrow></msub></mrow></mfenced> :MATH]
   33

   where
   [MATH: <msub><mrow><mi mathvariant="normal">Mean
   Expression</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>c</mi></mrow></msub
   > :MATH]
   is the average expression of feature
   [MATH: <mi>i</mi> :MATH]
   in spatial cluster
   [MATH: <mi>c</mi> :MATH]
   , and
   [MATH: <msub><mrow><mi mathvariant="normal">Mean
   Expression</mi></mrow><mrow><mi>i</mi></mrow></msub> :MATH]
   is the overall average expression of feature
   [MATH: <mi>i</mi> :MATH]
   across all spatial clusters and
   [MATH: <mi>N</mi> :MATH]
   is the number of features.

Logistic score

   An effective spatial clustering method should consider feature
   characteristics, enabling accurate prediction of spatial clusters based
   on feature expression. The logistic score evaluates this predictive
   capability as follows:
    1. The dataset is divided into training and testing sets.
    2. A logistic regression model is trained on the training data using
       feature expression values as input and spatial cluster labels as
       the target.
    3. The trained model is then used to predict cluster labels on the
       testing set (
       [MATH:
       <msub><mrow><mi>X</mi></mrow><mrow><mi>t</mi><mi>e</mi><mi>s</mi><m
       i>t</mi></mrow></msub> :MATH]
       ).
    4. The accuracy of the predictions is calculated as:

   [MATH: <mi mathvariant="normal">Accuracy</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Number</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">of</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">correctly</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">predicted</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">clusters</mi></mrow><mrow><mi
   mathvariant="normal">Total</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">number</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">of</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">clusters</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">in</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">the</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">test</mi><mspace width="0.16em"></mspace><mi
   mathvariant="normal">set</mi></mrow></mfrac> :MATH]
   33

Mutual information score

   The Mutual Information Score measures the dependency between feature
   expression levels and spatial cluster labels. The mutual information
   (MI) for each feature
   [MATH: <mi>i</mi> :MATH]
   is computed as:
   [MATH: <mi mathvariant="normal">MI</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">c</mi></mrow></mfenced><mo>=</mo><munder><mrow><mo
   mathsize="big">
   ∑</mo></mrow><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow></munder><mi>P<
   /mi><mfenced close=")"
   open="("><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow></mfenced><mi>log</
   mi><mfenced close=")" open="("><mrow><mfrac><mrow><mi>P</mi><mfenced
   close=")"
   open="("><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow></mfenced></mrow><m
   row><mi>P</mi><mfenced close=")"
   open="("><mrow><mi>x</mi></mrow></mfenced><mi>P</mi><mfenced close=")"
   open="("><mrow><mi>y</mi></mrow></mfenced></mrow></mfrac></mrow></mfenc
   ed> :MATH]
   34
   [MATH: <mi mathvariant="normal">Mean</mi><mspace
   width="0.16em"></mspace><mi
   mathvariant="normal">MI</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>N</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><mi
   mathvariant="normal">MI</mi><mfenced close=")"
   open="("><mrow><mi>i</mi><mo>,</mo><mi>c</mi></mrow></mfenced> :MATH]
   35

   Where
   [MATH:
   <mi>P</mi><mrow><mo>(</mo><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow><m
   o>)</mo></mrow> :MATH]
   is joint probability distribution of the feature values and cluster
   labels, and
   [MATH: <mi>P</mi><mfenced close=")"
   open="("><mrow><mi>x</mi></mrow></mfenced><mo>,</mo><mi>P</mi><mrow><mo
   >(</mo><mrow><mi>y</mi></mrow><mo>)</mo></mrow> :MATH]
   are their marginal probabilities, and
   [MATH:
   <mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mi>N</mi>
   :MATH]
   ,
   [MATH: <mi>N</mi> :MATH]
   is the number of features.
     * (4).
       Biology conservation: To assess the conservation of biological
       information, metrics such as ARI, NMI, Cell type ASW, Isolated
       label silhouette, and Graph cLISI score, as defined in
       scIB^[282]45, were utilized. The ground-truth labels used for
       evaluation (specified via the “label key” parameter in scIB) were
       manually annotated based on histological features identified
       through H&E staining (Supplementary Fig. [283]2i). The biology
       conservation score was computed as:

   [MATH: <mi mathvariant="normal">Biology</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">conservation</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mtable><mtr><mtd
   columnalign="center"><mi mathvariant="normal">ARI</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">NMI</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Cell</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">type</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">ASW</mi><mspace
   width="0.25em"></mspace><mo>+</mo></mtd></mtr><mtr><mtd
   columnalign="center"><mi mathvariant="normal">Isolated</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">lable</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">silhouette</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal"> Graph</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">cLISI</mi></mtd></mtr></mtable></mrow><mrow><mn>5<
   /mn></mrow></mfrac> :MATH]
   36

   For benchmarking simultaneous vertical and horizontal integration
   across samples, we categorized the evaluation metrics into five broad
   groups: (1) Reconstruction accuracy, (2) Continuity scores, and (3)
   Marker scores, (4) Biology conservation, (5) Batch correction.

   The definitions for Reconstruction Accuracy, Continuity Scores, Marker
   Scores, and Biological Conservation were identical to those described
   in the vertical integration benchmarking (see above). Additional
   metrics were introduced to assess Batch Correction performance in the
   horizontal integration context.

Batch correction

   Batch correction performance was evaluated using metrics including
   Batch ASW, Graph iLISI score, and PCR batch score, as defined in
   scIB^[284]45. The batch correction was computed as:
   [MATH: <mi mathvariant="normal">Batch</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">correction</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Batch</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">ASW</mi><mo>+</mo><mi
   mathvariant="normal">Graph</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">iLISI</mi><mo>+</mo><mi
   mathvariant="normal">PCR</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">batch</mi></mrow><mrow><mn>3</mn></mrow></mfrac>
   :MATH]
   37

   In addition, the cross modality overall scores are calculated:
   [MATH: <mi mathvariant="normal">cross</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">modality</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Continuity</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Marker</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Reconstruction</mi></mrow><mrow><mn>3</mn></mrow><
   /mfrac> :MATH]
   38

   The cross sample overall scores are calculated:
   [MATH: <mi mathvariant="normal">cross</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">sample</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Biology</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">conservation</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">Batch</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">correction</mi></mrow><mrow><mn>2</mn></mrow></mfr
   ac> :MATH]
   39

   The overall scores are calculated:
   [MATH: <mi mathvariant="normal">overall</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">cross</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">modality</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">score</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace width="0.25em"></mspace><mi
   mathvariant="normal">cross</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">sample</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">score</mi></mrow><mrow><mn>2</mn></mrow></mfrac>
   :MATH]
   40

   Metrics from SDMBench and scIB were used with their default scaling
   (normalized to [0,1]). The additional metrics defined in this study
   (PCC, CS, Specificity, Logistic, and MI) were also scaled to the [0,1]
   range across all metrics.

   The full Python code of the benchmark metrics and the visualization
   Jupyter notebook are available at:
   [285]https://github.com/WanluLiuLab/SpatialMETA/tree/master/benchmark.
   Detailed benchmark metric values are provided in [286]source data
   files.

Extensive analysis and visualization

   The joint data of ST and SM in the AnnData format^[287]79. In this
   structure, the spots information and annotation are contained with
   DataFrames “obs”. Similarly, the genes and metabolites information and
   annotation are stored within DataFrames “var”, and the distinction
   between genes and metabolites is made using the “var.type” attribute,
   which categorizes entries as either ST or SM.

   To annotate metabolites, we first reference the Human Metabolome
   Database (HMDB)^[288]80. All metabolites from HMDB undergo user-defined
   adduct addition to generate new m/z values. A user-defined tolerance
   (ppm) is then applied to calculate the m/z range for each metabolite.
   Each metabolite is individually searched within its respective m/z
   range to identify matching annotations (Supplementary Fig. [289]5f).

   To determine the enrichment of the annotated metabolites, we perform a
   hypergeometric distribution test. This statistical test assesses
   whether the number of annotated metabolites associated with a specific
   pathway is greater than expected by chance. The hypergeometric test is
   defined by the following formula:
   [MATH: <mi>P</mi><mfenced close=")"
   open="("><mrow><mi>k</mi><mo>;</mo><mi>n</mi><mo>,</mo><mi>K</mi><mo>,<
   /mo><mi>N</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><mrow><mo>(</mo><
   mrow><mtable><mtr><mtd
   columnalign="center"><mi>K</mi></mtd></mtr><mtr><mtd
   columnalign="center"><mi>k</mi></mtd></mtr></mtable></mrow><mo>)</mo></
   mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd
   columnalign="center"><mi>N</mi><mo>−</mo><mi>K</mi></mtd></mtr><mtr><mt
   d
   columnalign="center"><mi>n</mi><mo>−</mo><mi>k</mi></mtd></mtr></mtable
   ></mrow><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><mrow><mtable><mt
   r><mtd columnalign="center"><mi>N</mi></mtd></mtr><mtr><mtd
   columnalign="center"><mi>n</mi></mtd></mtr></mtable></mrow><mo>)</mo></
   mrow></mrow></mfrac> :MATH]
   , where
   [MATH: <mi>N</mi> :MATH]
   is the total number of metabolites,
   [MATH: <mi>K</mi> :MATH]
   is the number of metabolites annotated to the pathway of interest,
   [MATH: <mi>n</mi> :MATH]
   is the number of annotated metabolites in the study, and
   [MATH: <mi>k</mi> :MATH]
   is the number of annotated metabolites in the study that are associated
   with the pathway.

   The interactive analysis and visualization of SpatialMETA is built by
   Python Graphing Library, Plotly Dash. SpatialMETA loads spatial data in
   AnnData format along with histology images. Users can interact with the
   spatial data through various functionality such as box select, lasso
   select, draw trajectory, and draw closed freeform. These interactive
   operations are recorded in AnnData object.

Calculation of spatially variable genes and metabolites

   The SVGs and SVMs for each sample were calculated using Moran’s I
   method, as provided by Squidpy^[290]81. To remove batch-biased SVGs and
   SVMs, we first calculated the shared sample numbers for each gene and
   metabolite, setting a threshold to ensure they are shared among
   multiple samples (“min_samples”, default is 2). For each sample, we
   identified batch-biased SVGs and SVMs by calculating the fold change of
   each gene or metabolite between one sample and the others. Genes or
   metabolites with a fold change greater than the threshold (“min_logfc”,
   default is 3) were defined as batch-biased SVGs and SVMs. Additionally,
   these genes or metabolites had to be expressed in a certain proportion
   of spots within the sample (“min_frac”, default is 0.8). The
   batch-biased SVGs and SVMs were removed to facilitate better vertical
   integration.

Annotation of spatial clusters

   We employed the ccRCC scRNA-seq reference (Supplementary Fig. [291]3a)
   for spatial transcriptomic deconvolution using DestVI^[292]50. For the
   ST data, we calculated the top 2000 SVGs, and for the scRNA-seq data,
   we identified the top 2000 highly variable genes. We then intersected
   these gene sets, retaining only the overlapping genes. For the
   scRNA-seq data, we used the “CondSCVI” model, while for the ST data, we
   utilized the “DestVI.from_rna_model”. Referring to the deconvolution
   prediction results, we manually annotated the spatial clusters
   according to marker gene expression: Imm (immune clusters; CD8A, CD3D,
   CD74), Endo (endothelial clusters; CD34, PECAM1, APLN), Stro (stromal
   clusters; COL3A1, ACTA2, COL1A1), and Mal (malignant clusters;
   NDUFA4L2, CNDP2, EGFR).

Reporting summary

   Further information on research design is available in the [293]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [294]Supplementary Information^ (27.3MB, pdf)
   [295]41467_2025_63915_MOESM2_ESM.pdf^ (29.7KB, pdf)

   Description of Additional Supplementary Files
   [296]Supplementary Data 1^ (14.9KB, xlsx)
   [297]Supplementary Data 2^ (11KB, xlsx)
   [298]Reporting Summary^ (73.2KB, pdf)
   [299]Transparent Peer review file^ (9.3MB, pdf)

Source data

   [300]Source data Files^ (208.4KB, zip)

Acknowledgements