Abstract Embeddings derived from cell graphs hold significant potential for exploring spatial transcriptomics (ST) datasets. Nevertheless, existing methodologies rely on a graph structure defined by spatial proximity, which inadequately represents the diversity inherent in cell‐cell interactions (CCIs). This study introduces STAGUE, an innovative framework that concurrently learns a cell graph structure and a low‐dimensional embedding from ST data. STAGUE employs graph structure learning to parameterize and refine a cell graph adjacency matrix, enabling the generation of learnable graph views for effective contrastive learning. The derived embeddings and cell graph improve spatial clustering accuracy and facilitate the discovery of novel CCIs. Experimental benchmarks across 86 real and simulated ST datasets show that STAGUE outperforms 15 comparison methods in clustering performance. Additionally, STAGUE delineates the heterogeneity in human breast cancer tissues, revealing the activation of epithelial‐to‐mesenchymal transition and PI3K/AKT signaling in specific sub‐regions. Furthermore, STAGUE identifies CCIs with greater alignment to established biological knowledge than those ascertained by existing graph autoencoder‐based methods. STAGUE also reveals the regulatory genes that participate in these CCIs, including those enriched in neuropeptide signaling and receptor tyrosine kinase signaling pathways, thereby providing insights into the underlying biological processes. Keywords: graph structure learning, cell‐cell interactions, spatial clustering, spatial transcriptomics __________________________________________________________________ STAGUE is an innovative framework that introduces graph structure learning into unsupervised representation learning of spatial transcriptomics data. It generates a learned adjacency matrix and spatially aware cell embeddings for the cell graph. The adjacency matrix can discern novel cell‐cell interactions, whereas the cell embeddings support multiple downstream applications such as clustering, visualization, and trajectory inference. graphic file with name ADVS-11-2403572-g001.jpg 1. Introduction Within the tissues of multicellular organisms, cells often aggregate into spatially organized and functionally distinct anatomical structures,^[ [32]^1 , [33]^2 ^] and proximal cells frequently coordinate their biological functions through cell‐cell interactions (CCIs). The dual challenges of pinpointing these spatial clusters, i.e., spatial clustering, and decoding the intricate communication patterns between cells, i.e., CCI inference, are pivotal for a comprehensive understanding of biological mechanisms. The groundbreaking spatial transcriptomic (ST) technologies,^[ [34]^3 ^] which allow for gene expression profiling while retaining critical spatial localization information, present an invaluable opportunity to discover spatially continuous clusters and decipher CCIs with greater biological fidelity. Recently, there has been great interest in applying unsupervised graph representation learning (UGRL) techniques to integrate spatial information into the learning of cell embeddings and spatial clustering.^[ [35]^4 , [36]^5 , [37]^6 , [38]^7 , [39]^8 , [40]^9 , [41]^10 , [42]^11 ^] These methods involve converting the input ST data into a cell graph, where cells or spots are treated as nodes. Edges between these nodes are established based on the spatial proximity of the corresponding cells. By applying UGRL algorithms, they seek to infer discriminative, low‐dimensional representations of cells, which are intended to preserve the essential spatial and gene expression information in the ST data. The inferred low‐dimensional cell representations provide a powerful basis for downstream clustering with unsupervised algorithms such as k‐means,^[ [43]^12 ^] mclust,^[ [44]^13 ^] and Leiden.^[ [45]^14 ^] Among these methods, some propose jointly optimizing spatially aware cell representations while inferring novel CCIs.^[ [46]^9 , [47]^10 , [48]^11 ^] They build upon graph autoencoders (GAE), learning cell representations by reconstructing the predefined adjacency matrix of the input cell graph. Subsequently, novel CCIs are identified by analyzing the newly generated edges according to the reconstructed adjacency matrix. Recently, graph contrastive learning (GCL) has emerged as an important UGRL paradigm for ST studies.^[ [49]^15 ^] Existing GCL‐based methods^[ [50]^16 , [51]^17 , [52]^18 , [53]^19 , [54]^20 ^] build upon the Deep Graph Infomax (DGI) framework.^[ [55]^21 ^] They typically utilize ad‐hoc augmentation functions^[ [56]^15 ^] to generate different views of the input cell graph stochastically, such as randomly shuffling node features or randomly adding/dropping edges of the original cell graph. Subsequently, they harness the principle of mutual information maximization to bring similar views closer while distancing the irrelevant ones. It is crucial to note that in current GCL‐based methods, the generation of different cell graph views relies on a predefined graph structure based on spatial proximity. When constructing the graph topology, these methods often identify neighboring cells based on two main criteria: proximity within a designated distance or a predetermined quantity of the nearest cells. This methodology may possess certain intrinsic limitations, as merely factoring in spatial proximity does not adequately capture the complexity of intercellular communications that regulate gene expression dependencies among cells. For instance, the signal strength from cell secreting signaling molecules (e.g., protein ligands) often attenuates with increasing distance,^[ [57]^22 ^] and nearby cells selectively respond to these molecules via specific receptors.^[ [58]^23 ^] A smaller distance range for constructing cell graphs can reduce the inclusion of irrelevant interactions, known as noise, but it may also overlook significant long‐distance relationships. Conversely, a larger range might capture these interactions at the cost of reducing the signal‐to‐noise ratio. Therefore, depending solely on spatial information to define the graph structure is insufficient and challenging. Moreover, this trade‐off poses a limitation on existing GAE‐based methods for CCI inference, as their reconstructing objective, i.e., the predefined cell graph adjacency matrix, may inherently contain noise. To address the above problems, in this study, we investigate a scenario where the cell graph structure is dynamically learned and adjusted during the contrastive learning process. To this end, we draw inspiration from recent developments in graph structure learning (GSL),^[ [59]^24 , [60]^25 , [61]^26 ^] and propose STAGUE, an unsupervised representation learning model for Spatial Transcriptomics with spAtially informed Graph strUcture lEarning. STAGUE infers discriminative cell representations by exploiting the contrast between multiple views of the input cell graph. By parameterizing and optimizing the cell graph structure, it can generate learnable graph views to enhance the efficiency of the contrast process. This approach also seamlessly integrates the spatial clustering and CCI inference tasks into a unified framework. Specifically, STAGUE employs a spatial learner module that effectively models a cell graph adjacency matrix by utilizing the intrinsic spatial and gene expression relationships among cells. The learned adjacency matrix captures the statistical dependencies among cells, reflecting their proximities and potential interactions. The integration of the normalized temperature‐scaled cross‐entropy loss^[ [62]^25 , [63]^27 ^] and triplet loss as joint contrastive objectives facilitates the contrast between different views. Benchmark results show that STAGUE outperforms 15 comparison methods in spatial clustering across 86 ST datasets from different platforms. Moreover, STAGUE exhibits the capability to discern heterogeneity within both cancerous and paracancerous regions in human breast cancer samples. It uncovers the localized initiation of the epithelial‐to‐mesenchymal transition as well as the activation of PI3K/AKT signaling pathways in particular sub‐regions. For the CCI inference task, STAGUE shows superior ability to discern biologically meaningful CCIs compared to existing GAE‐based methods, and the regulatory genes governing these interactions are revealed. Comprehensive parameter analyses and ablation studies validate the effectiveness of the proposed components in STAGUE. 2. Results 2.1. Overview of STAGUE STAGUE is an unsupervised representation learning model for spatial transcriptomics (ST), utilizing spatially informed graph structure learning (GSL) to integrate spatial clustering and cell‐cell interaction (CCI) inference tasks into a unified framework (Figure [64] 1 , Experimental Section). Given an ST dataset as input, we can extract a gene expression matrix and spatial coordinates of cells. Utilizing the coordinates, we derive a raw k‐nearest neighbors (kNN) cell adjacency matrix. STAGUE mainly consists of three different views of the input ST data. The learner view is derived using a spatial learner module that integrates the expression features and spatial information to learn a cell adjacency matrix. Compared with the raw kNN cell adjacency matrix, the learned adjacency matrix features continuous values and is designed to capture the statistical dependencies among cells, reflecting their proximities and potential interactions. The positive view is generated from the raw kNN cell adjacency matrix and serves as the positive pair of the learner view. The negative view is formed by randomly shuffling the node features and using an identity matrix as the adjacency matrix. Before the Graph Convolutional Network (GCN)^[ [65]^28 ^] encoding, the three views are processed using data augmentation techniques, including feature masking and edge dropping. The encoded embeddings are then mapped to the space where contrastive loss is applied through projection heads. The integration of the normalized temperature‐scaled cross‐entropy (NT‐Xent) loss^[ [66]^25 , [67]^27 ^] and triplet loss enables an effective contrast between the three views. The learned adjacency matrix from the spatial learner module can be utilized to discern novel CCIs. Furthermore, the embeddings from the learner view can be used for multiple downstream analyses, including spatial clustering, data visualization, and trajectory inference. Note that various unsupervised clustering algorithms^[ [68]^12 , [69]^13 , [70]^14 ^] are applicable to cell embeddings; in this study, the standard k‐means was utilized as the default clustering algorithm. Figure 1. Figure 1 [71]Open in a new tab Overview of STAGUE. The input ST data comprises gene expression matrix X and spatial coordinates C. Utilizing C, a raw kNN cell adjacency matrix Γ is derived. STAGUE processes the input ST data through three views: the learner view, which combines expression features with spatial information to infer a cell adjacency matrix A capturing cell dependencies; the positive view, derived from the raw kNN cell adjacency matrix Γ; and the negative view, formed by shuffling cell features and using an identity adjacency matrix. The three views undergo data augmentation and are subsequently encoded using GCN encoders with shared weights. The embeddings are then projected to compute contrastive loss, utilizing a joint objective composed of the NT‐Xent loss [MATH: LNT :MATH] and triplet loss [MATH: Ltriplet :MATH] . The low‐dimensional embedding of the learner view H can be used for multiple downstream tasks, including clustering, visualization, and trajectory analysis, whereas the learned adjacency matrix A aids in identifying novel CCIs. 2.2. STAGUE Outperforms State‐of‐the‐Art Methods on Spatial Clustering 2.2.1. Clustering Accuracy In this section, we evaluate the spatial clustering accuracy of STAGUE, benchmarking it against fifteen recent algorithms using comprehensive datasets that includes both real and simulated ST datasets (Experimental Section). The selected comparison algorithms encompass deep learning methods, both with and without the implementation of graph contrastive learning (GCL), as well as the latest statistical learning approaches to ensure a rigorous assessment. The first group of real datasets (Real#1) includes 14 datasets derived from a variety of technological platforms, including STARmap, MERFISH, osmFISH, Stereo‐seq, and 10x Visium. Among these data, eight are characterized by single‐cell resolution, while six offer spot‐level resolution. This selection presents a varied scope regarding spatial resolution, cell counts, and gene coverage. The second group of real datasets (Real#2) consists of 16 datasets selected from a recent benchmark study focused on spatial clustering.^[ [72]^29 ^] These data involve non‐brain tissues and cover smaller, discrete tissue domains such as breast cancer, liver, and pancreatic ductal adenocarcinoma. The first simulated dataset group (Simulated#1) is comprised of 28 datasets generated using the spatial pattern preserving simulation tool, SRTsim.^[ [73]^30 ^] These simulations utilized the Real#1 group as references, incorporating both tissue‐based and