Abstract

Background

   Existing studies have demonstrated that the integration analysis of
   transcriptomic and epigenomic data can be used to better understand the
   onset and progression of many diseases, as well as identify new
   diagnostic and prognostic biomarkers. However, such investigations on
   large-scale sequencing data remain challenging for researchers or
   clinicians with limited bioinformatics knowledge.

Results

   To facilitate the interpretation of the gene regulatory landscape, we
   developed an R Shiny application [Linking of atac-seq to gene
   expression data (Linkage)] for exploring and visualizing potential
   cis-regulatory elements (CREs) of genes based on ATAC-seq and RNA-seq
   data. Linkage offers six modules to identify, annotate, and interpret
   potential gene regulatory elements from the whole genome step by step.
   Linkage currently supports the analysis of human and mouse datasets.
   Linkage can provide interactive visualization for the correlation
   between chromatin accessibility and gene expression. More than that,
   Linkage identifies transcription factors (TFs) that potentially drive
   the chromatin changes through identifying TF binding motifs within the
   CREs and constructing trans-regulatory networks of the target gene set.

Conclusions

   This powerful tool enables researchers to conduct extensive multi-omics
   integration analysis and generate visually appealing visualizations
   that effectively highlight the relationship between genes and
   corresponding regulatory elements. With Linkage, users can obtain
   publishable results and gain deeper insights into the gene regulatory
   landscape.

Availability and implementation

   ‘Linkage’ is freely available as a Shiny web application
   [40]https://xulabgdpu.org.cn/linkage. The source code and instructions
   for Linkage can be accessed via Github
   [41]https://github.com/luoyyyy/Linkage.

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12864-025-12115-6.

   Keywords: Multi-omics, ATAC-seq, RNA-seq, R-shiny, Gene regulatory
   elements, Transcription factors

Background

   With the rapid advancement of high-throughput sequencing technologies
   and bioinformatics, increasing attention has been paid to the critical
   regulatory roles of genomic non-coding regions [[42]1]. These regions,
   including promoters and enhancers, often harbor transcription factor
   binding sites (TFBS) that directly influence gene expression. As an
   important feature of the epigenome, chromatin accessibility reflects
   the degree to which DNA is exposed and accessible to regulatory
   proteins such as transcription factors, and has thus become a key focus
   in gene regulation studies [[43]2]. Assay for transposase-accessible
   chromatin with high-throughput sequencing (ATAC-seq) is a
   high-throughput technology that employs an engineered Tn5 transposase
   to map and quantify genome-wide chromatin accessibility [[44]3]. Higher
   chromatin accessibility means more active DNA-protein binding events
   that happen at these regions for bulk ATAC-seq data analysis. On the
   other hand, bulk mRNA sequencing (RNA-seq) is the most common
   application to estimate and quantify gene expression level, which has
   broad utilities in differential expression analysis, cancer
   classification, disease diagnosis, and optimizing therapeutic treatment
   [[45]4, [46]5].

   To date, many studies have performed bioinformatics analysis to
   characterize gene regulatory landscapes by integrating ATAC-seq and
   RNA-seq data [[47]6, [48]7]. However, this kind of work remains very
   challenging for scientists without enough experience in high-throughput
   sequencing data analysis. The emergence of web applications has
   alleviated this issue by providing user-friendly interfaces for
   bioinformatics analysis. Nevertheless, the existing web tools only
   support analyzing datasets from public domains, such as The Cancer
   Genome Atlas (TCGA) and The Encyclopedia of DNA Elements (ENCODE),
   which is not enough for researchers who intend to analyze their own
   datasets [[49]8, [50]9].

   Here, we introduce Linkage, a user-friendly interactive R Shiny
   web-based application for exploring and visualizing potential gene
   regulatory elements based on ATAC-seq and RNA-seq data. Linkage allows
   users to upload customized data or re-analyze public datasets. The main
   feature of Linkage is to identify potential gene regulatory elements
   for the whole genome by performing a correlation analysis between
   matched chromatin accessibility and gene expression across multiple
   samples. Additional modules are developed to allow users to perform
   more comprehensive and biologically meaningful analysis for the links
   between ATAC-seq peaks and target genes. Analysis and visualization
   results are returned to the web page and can be downloaded in PDF, PNG,
   JPEG, CSV, and TXT formats. The workflow and typical output schema are
   shown in Fig. [51]1.

Fig. 1.

   [52]Fig. 1
   [53]Open in a new tab

   Overview of functionality and workflows with Linkage. A The home module
   of Linkage. B The regulatory peaks search module of Linkage. C The
   regulatory peaks visualization module. D The regulatory peaks
   annotation module of Linkage. E The motif analysis module of Linkage.
   F The regulatory networks module of Linkage. G The pathway enrichment
   module of Linkage

Implementation

Home module

   The Home Module contains the brief instructions and Import Data panel
   of Linkage (Fig. [54]1A). Linkage provides two public multi-omics
   datasets, i.e., the TCGA Breast Cancer cohort dataset and the
   [55]GSE121589 dataset, for users to re-analyze them. The TCGA Breast
   Cancer cohort is a human dataset that contains bulk RNA-Seq and
   ATAC-Seq data from 72 patients with breast cancer [[56]10]. The
   [57]GSE121589 is a mouse dataset that constitutes bulk RNA-Seq and
   ATAC-Seq data from 33 mice Muscle Stem Cells (MuSCs) [[58]11]. Apart
   from the provided public datasets, Linkage also enables users to upload
   and explore their own data. Only two tab-delimited text/CSV input files
   (a chromatin accessibility matrix and a gene expression matrix) are
   required before running Linkage. The gene expression matrix file is a
   tab-delimited multi-column data matrix, where the first column
   represents gene symbols, and the following columns represent normalized
   or raw expression levels of genes for each sample. The chromatin
   accessibility matrix file is also a tab-delimited multi-column data
   matrix, where the first three columns represent the names of
   chromosomes, the start, and the end coordinates on the chromosomes of
   the peaks respectively, and the remaining columns represent normalized
   or raw chromatin accessibility levels of peaks for each sample. For
   reference genome annotation, Linkage has preloaded GENCODE Release 43
   (GRCh38.p13) for human and Release M37 (GRCm39) for mouse as default
   reference files in the web application [[59]12]. Users are advised to
   provide normalized values (e.g., TPM, CPM, or log-transformed counts)
   and at least 10 matched RNA-seq and ATAC-seq samples to ensure reliable
   integrative analysis. Each element of ready-to-analysis files will
   appear in the Chromatin Accessibility Matrix panel and Gene Expression
   Matrix panel.

Regulatory peaks search module

   The Regulatory Peaks Search Module is the key module of Linkage that
   allows users to detect all potential regulatory DNA regions for a
   specific gene (Fig. [60]1B). In this study, we define a regulatory peak
   as an ATAC-seq peak (i.e., a chromatin-accessible region) that shows a
   statistically significant correlation with the expression level of a
   nearby gene across samples. Such correlation implies potential
   cis-regulatory activity, and these peaks are considered to harbor
   candidate cis-regulatory elements (CREs). Linkage adopted the same
   procedure as in TCGA [[61]10] to associate regulatory regions with the
   predicted target genes. Specifically, given an input gene and search
   scale, Linkage automatically performs standard correlation analysis
   between the quantitative expression level of the input gene and each
   quantitative chromatin accessibility measure in the region across all
   samples (Fig. [62]2A). To accommodate input genes from various
   annotation sources, Linkage supports three widely used gene ID types:
   Ensembl ID, gene symbol, and Entrez ID.

Fig. 2.

   [63]Fig. 2
   [64]Open in a new tab

   Example of using Linkage to conduct key regulatory elements recognition
   analysis. A the scatter plot of quantitative chromatin accessibility
   and gene expression; B the statistically significant regulatory peaks;
   C the normalized ATAC-seq peak signal representing chromatin
   accessibility; D the expression boxplot of target genes; E the
   upsetplot of regulatory peaks annotation; F the Motif scanning table;
   G the interactive GRNs; H the GO/KEGG Enrichment Table

   Users can adjust the search scale and correlation analysis algorithm
   (Spearman/Pearson/Kendall). The default correlation method is Spearman
   correlation, which is robust to outliers and does not assume linearity
   between chromatin accessibility and gene expression. Since promoter
   capture Hi-C (PCHi-C) data suggested that >75% of three-dimensional
   (3D) promoter-based interactions occur within a 500 kilobase pairs
   (kbp) distance [[65]13], we set the default search scale to 500 kbp
   upstream or downstream of the input gene’s TSS as potential regulatory
   regions. This default search scale can avoid spurious predictions. To
   enable users to explore interactions between distal ATAC-seq peaks and
   genes, Linkage supports the maximum search scale up to 1 million base
   pairs (Mbp). For multiple hypothesis testing correction, we adopt a
   default false discovery rate (FDR) threshold of 0.01, which balances
   sensitivity and specificity. Then, all the statistically significant
   results are listed in the Potential Cis-regulatory Regions panel (Fig.
   [66]2B). By clicking on a specific entry of this panel, users can view
   the scatter plot of quantitative chromatin accessibility and gene
   expression from the Correlation Plot panel. The corresponding rho and
   FDR for the correlation analysis will also be shown on the scatter
   plot. A positive correlation indicates that more accessible of open
   chromatin regions appear to have higher expression level of genes. Such
   finding suggests that the CREs in such regions could potentially
   activate the expression of target genes. On the contrary, a clear
   negative correlation suggests that the target gene associated with the
   chromatin accessibility regions might be repressed by the CREs in these
   regions.

Regulatory peaks visualization module

   The Regulatory Peaks Visualization Module allows users to visualize the
   chromatin accessibility signals around a specific regulatory peak, as
   well as the corresponding quantitative expression of the target gene of
   this regulatory peak (Fig. [67]1C). Users initially select a regulatory
   peak that is obtained from the Regulatory Peaks Search Module. Linkage
   then categorizes samples into five groups based on the quantitative
   chromatin accessibility of the specific regulatory peak, ranging from
   low to high for each sample. Specifically, these five groups are
   defined using quantile-based bins of the ATAC-seq peak signal across
   samples (0–20%, 20–40%, 40–60%, 60–80%, and 80–100%), which facilitates
   visualization of chromatin accessibility variation across samples. The
   coverage track of mapped ATAC-seq reads (Fig. [68]2C) and the
   expression boxplot of the target gene for each group will be shown
   simultaneously (Fig. [69]2D).

Regulatory peaks annotation module

   The Regulatory Peaks Annotation Module allows users to visualize the
   annotation of the predicted regulatory peaks for genes that are given
   in the previous module (Fig. [70]1D). Once users click ‘Annotate
   Regulatory Peaks’, Linkage performs genomic location-based annotation
   of the identified peaks using the corresponding function from the
   ChIPseeker package [[71]14]. The annotation includes promoter, 5′ UTR,
   3′ UTR, exon, intron, downstream, and distal intergenic regions.
   Promoters are defined as ± 3 kb around the transcription start site
   (TSS). Distal intergenic regions refer to genomic regions located
   outside of annotated gene bodies and downstream zones. These
   definitions follow the default settings of ChIPseeker and are based on
   gene models provided by GENCODE [[72]12]. This standardization ensures
   consistent and biologically meaningful annotation across datasets. To
   effectively visualize the overlaps and distribution in the annotation
   for peaks, Linkage also produces the upsetplot that is adopted from the
   ChIPseeker package (Fig. [73]2E).

Cis-Regulatory elements detection module

   After identifying potential regulatory peaks, researchers can further
   explore which specific CREs are activated with Linkage (Fig. [74]1E).
   CREs and the TFs that bind on them play a central role in gene
   transcription regulation, which can be detected by ATAC-seq data. The
   Cis-Regulatory Elements Detection Module supports users to visualize
   the enriched TFBS within potential regulatory peaks.

   Motif annotation is performed using the motifmatchr package [[75]15]
   based on position weight matrices (PWMs) from the JASPAR 2022 database
   [[76]16]. This module identifies individual TFBSs within each peak,
   rather than performing a statistical enrichment analysis. By clicking
   on a specific regulatory peak, users can view the location and binding
   score information of each enriched TFBS of this DNA region from the
   Motif Scanning Table (Fig. [77]2F). Once users select one TFBS of this
   table, the corresponding sequence logo of this CRE will appear in the
   Sequence-logo Plot panel. In addition, to enhance interpretability, we
   added a new column to the Motif Scanning Table displaying the average
   expression level of each enriched TF across all samples. This
   improvement provides insight into whether the enriched TFs are also
   transcriptionally active in the dataset.

Gene regulatory network module

   The interplay between CREs and genes generates complex regulatory
   circuits that can be represented as gene regulatory networks (GRNs)
   (Fig. [78]1F). Studies of GRNs are crucial for understanding how
   cellular identity is established, maintained, and disrupted in disease.
   The Gene Regulatory Network Module helps users to visualize GRNs whose
   nodes are represented by genes and their corresponding CREs that were
   inferred from previous analysis of Linkage.

   First, users input a list of genes and TFs of interest which was
   obtained from the previous analysis. Then users can customize a series
   of parameters in association with building the GRN, including types of
   gene symbols, calculation methods, and thresholds of interactions
   between the nodes (edges of the GRNs). Once users click the ‘Build the
   GRN’ button, Linkage will perform a standard correlation analysis
   between the expression levels of the selected genes and the expression
   levels of transcription factors whose binding motifs are detected in
   the previously linked ATAC-seq peaks. The significant calculation
   results of correlation analysis are shown in the Gene-TF Table panel.
   Meanwhile, Linkage produces the corresponding informatic and
   interactive GRNs that are adopted from the visNetwork package [[79]17]
   (Fig. [80]2G). Users can further change network layouts, select
   subnetworks, and save the GRNs as spreadsheets with interaction scores
   or plots.

   Furthermore, Linkage allows users to upload the previously saved
   Gene-TF Table file to rebuild the GRN without re-running the
   correlation analysis, which is especially useful for large gene sets
   where the original computation may be time-consuming. This feature
   enables faster re-analysis and visualization by skipping the repeated
   computation of TF–gene interaction scores.

Pathway enrichment analysis module

   The Pathway Enrichment Analysis Module supports users to visualize
   tabular and graphical pathway enrichment results of the genes/TFs of
   interest that were previously produced from other modules of Linkage
   (Fig. [81]1G). The pathway enrichment analysis can link them with
   underlying molecular pathways and functional categories such as gene
   ontology (GO) [[82]18] and Kyoto Encyclopedia of Genes and Genomes
   (KEGG) [[83]19]. Within this module, users can input a list of
   interested genes/TFs and set four key parameters (i.e., adjusted
   p-value cutoff, q-value cutoff, minimal size of annotated genes for
   testing, and maximal size of annotated genes for testing) for pathway
   enrichment analysis following the implementation of the clusterProfiler
   package [[84]20]. Then once users click the ‘Build Pathway Enrichment
   Analysis’ button, Linkage automatically performs GO and KEGG enrichment
   analysis. The corresponding enrichment categories will be returned in
   the GO/KEGG Enrichment Table. Finally, Linkage implements several
   different visualization methods adopted from the enrichplot package
   [[85]21] to interpret the functional results in the GO/KEGG Enrichment
   Table from multiple perspectives (Fig. [86]2H).

Results and discussions

Case study

   To better show the functionalities of Linkage, we collect multi-omics
   profiles of human breast cancer and mice MuSCs for experiments. For the
   human breast cancer dataset, we collect 72 samples with matched
   chromatin accessibility and gene expression data. For each detected
   expressed gene, we asked which regulatory regions might have
   contributed to the expression of this gene in breast cancer patients.
   To address this question, we implemented standard correlation analysis
   between the quantitative chromatin accessibility measure and the gene
   expression level across all samples. The Spearman correlation
   coefficient, r, was used to evaluate whether the potential regulatory
   regions were significantly correlated with the gene expression. A total
   of 60,483 expressed genes were analyzed. All peaks whose summits were
   located within 500 kbp from a gene’s TSS were considered by default. A
   conservative FDR cutoff of 0.01 was used to avoid false positives. This
   analysis identified a total of 489,132 regulatory regions for 18,104
   genes and a complete list of these regulatory regions can be found in
   Supplementary Table [87]S1.

   Next, we further explored whether Linkage could provide insights into
   the regulatory networks of breast cancer-related genes. Literature
   evidence indicates that PGF, CTSB, and EDN1 play important roles in
   breast cancer progression [[88]22–[89]24]. Using the publicly available
   datasets collected for this study, we examined the associations of
   these genes with transcription factors in the Gene Regulatory Network
   module and observed distinct patterns: CTSB is linked to 27 TFs, EDN1
   to 3 TFs, whereas PGF did not show highly correlated TFs in our example
   dataset. In the Pathway Enrichment module, all three genes are enriched
   in multiple cancer-relevant pathways, consistent with previous reports
   [[90]22–[91]24], including angiogenesis (PGF), extracellular matrix
   remodeling (CTSB), and vascular homeostasis (EDN1). Due to the limited
   sample size and lack of experimental validation, we did not investigate
   the underlying mechanisms in depth. Nevertheless, this case study
   demonstrates that Linkage can uncover biologically meaningful
   CRE–TF–gene associations and pathway-level insights, while leaving
   broader exploration to the end-user.

   We also collect matched chromatin accessibility and gene expression
   data of 33 mice MuSCs. A total of 1475 regulatory regions for 910 genes
   are identified with the same protocol and a complete list of these
   regulatory regions can be found in Supplementary Table [92]S2. Notably,
   the number of significant CRE-gene associations identified in the human
   dataset was substantially higher than in the mouse dataset. This
   difference primarily stems from the larger sample size and higher
   inter-sample variability in the human cohort, which increases the
   statistical power of correlation-based detection. In contrast, the
   mouse dataset is relatively homogeneous, both in sample type and
   biological context, thereby limiting the number of detectable
   associations.

   Taken together, these results demonstrate that the Linkage application
   can identify gene expression-associated regulatory regions from
   multi-omics data, which could potentially implicate regulatory elements
   responsible for cancer development.

Comparison of linkage with other tools

   Several Shiny applications, such as SPACE and genomeSidekick, have been
   published in the arena of multi-omics data integration analyses [[93]8,
   [94]25]. We qualitatively compared the ability of Linkage with the
   previously published applications in identifying and interpreting
   potential gene regulatory elements from multi-omics data (Fig. [95]3).
   SPACE allows users to analyze and visualize tens of thousands
   accessible DNA elements from multiple cancer types, but only the data
   from predefined cohorts is available for analysis. Thus, users cannot
   upload and analyze their customized datasets with SPACE. The
   genomeSidekick software serves the purpose for investigation of
   ATAC-seq and RNA-seq data. However, the primary concentration of
   genomeSidekick remains confined to the foundational data analysis
   stages. They tend to have limited capabilities for carrying out
   downstream advanced analyses, such as linking regulatory regions to
   gene expression, scanning the potential CREs, constructing regulatory
   networks, and so on. To the best of our knowledge, Linkage is currently
   the only Shiny application that is specifically designed for predicting
   gene regulatory elements from ATAC-seq and RNA-seq data. To improve
   user experience, we also provided local installation instructions in
   the GitHub repository to facilitate smoother usage.

Fig. 3.

   [96]Fig. 3
   [97]Open in a new tab

   Comparison of supported features from currently available multi-omics
   data integration web-servers. “NA” indicates that the feature is not
   applicable to the tool; “Check mark” indicates that the feature is
   supported by the tool

   We note that Linkage requires preprocessed input, which includes
   expression and accessibility matrices derived from upstream steps such
   as alignment, quantification, peak calling, and normalization. This
   design trades full automation for flexibility and clarity. Users can
   leverage established pipelines for these steps, such as Galaxy [[98]26]
   for end‑to‑end GUI workflows, UTAP2 [[99]27] for combined transcriptome
   and epigenome processing. By relying on these robust upstream
   pipelines, Linkage remains focused on integrative correlation analysis
   and visualization with cleanly defined inputs.

Conclusions

   Linkage is a web-based Shiny application that can utilize large-scale
   multi-omics datasets for the identification and prioritization of DNA
   regulatory elements from whole genomic regions. Linkage helps users
   perform standard correlation analysis between predefined chromatin
   accessibility peaks and gene expression profiles across multiple
   samples to identify potential regulatory relationships. The
   visualization and annotation modules provide user-friendly graphical
   interfaces to visualize the ATAC-seq genome tracks and genomics
   location features of a given regulatory peak. The motif module helps
   assess the probability of potential DNA regulatory elements within
   individual regulatory peaks. The Linkage web application also supports
   users to explore the GRNs and molecular pathways composed of DNA
   regulatory elements and target genes that were collected by the
   previous analyses.

   In addition to ATAC-seq, Linkage is also compatible with other
   chromatin accessibility data types such as DNase-seq, ChIP-seq, and
   CUT&Tag, provided that a properly formatted peak matrix is supplied.
   This flexibility enables broader applicability of the tool to diverse
   epigenomic modalities.

   Thus, the Linkage web application provides an accessible and
   interactive platform for integrative analysis, even for users without
   extensive programming expertise or prior experience in multi-omics data
   integration. In conclusion, with the increasing popularity of
   higher-resolution sequencing strategies such as single-cell
   technologies, integrative multi-omics data analysis will undoubtedly
   receive more and more attention. We will keep on developing and
   upgrading the Linkage network resource to benefit the omics data
   integration research community.

Availability and requirements

   Project name: Linkage.

   Project home page: [100]https://github.com/luoyyyy/Linkage.

   Operating system(s): Platform independent.

   Programming language: R.

   Other requirements: None.

   License: MIT.

   Any restrictions to use by non-academics: None.

Supplementary Information

   [101]Supplementary material 1.^ (3.5MB, csv)
   [102]Supplementary material 2.^ (26.5MB, csv)
   [103]Supplementary material 3.^ (13.9KB, docx)

Acknowledgements