Abstract

   The regulatory mechanisms governing transcriptional programs in the
   cancer genome remain elusive, particularly those concerning cell-type
   specificity. We carefully curated single-cell assay for
   transposase-accessible chromatin sequencing (scATAC-seq) and
   single-cell RNA sequencing (scRNA-seq) data from eight distinct
   carcinoma tissues, including breast, skin, colon, endometrium, lung,
   ovary, liver, and kidney. Using single-cell multi-omics analysis, we
   identified extensive open chromatin regions and constructed peak-gene
   link networks, which can reveal distinct cancer gene regulation and
   genetic risks. We further explored conserved epigenetic regulation
   across cell types within cancer and elucidated their functional
   implications. Moreover, we identified cell-type-associated
   transcription factors (TFs) that regulate key cellular functions, such
   as the TEAD family of TFs, which widely control cancer-related
   signaling pathways in tumor cells. In colon cancer, we further
   identified tumor-specific TFs that are more highly activated in tumor
   cells than in normal epithelial cells, including CEBPG, LEF1, SOX4,
   TCF7, and TEAD4, which are pivotal in driving malignant transcriptional
   programs and represent potential therapeutic targets, as corroborated
   by single-cell sequencing data from multiple sources and in vitro
   experiments. Our findings provide a comprehensive understanding of the
   regulatory dynamics underlying carcinomas and offer valuable insights
   into potential therapeutic interventions.

   Subject terms: Cancer epigenetics, Cancer microenvironment

Introduction

   Carcinomas are a highly heterogeneous group of diseases, involving
   different cellular components within the complex milieu of the tumor
   ecosystem that each plays a pivotal role in tumor initiation and
   progression. While single-cell genomics technologies have improved our
   ability to decipher the cellular intricacies of tumors, much of the
   attention has focused on transcriptomics using single-cell RNA-seq
   (scRNA-seq) [[42]1, [43]2]. The seminal work of The Cancer Genome Atlas
   (TCGA) and others has underscored the indispensable role of the
   epigenome in the cancer landscape [[44]3, [45]4], with non-coding
   genomic regions containing regulatory elements exerting a profound
   influence on tumor biology. These regulatory sequences control the
   expression patterns of target genes by recruiting cell-type-specific
   transcription factors (TFs) [[46]5]. Despite the progress we have made
   in our understanding, conventional assays still predominantly utilize
   heterogeneous tissues that provide population-level measurements,
   highlighting the need for a detailed exploration of the regulatory
   landscape of cancer within different cell types [[47]3, [48]4].
   Therefore, deciphering the regulatory programs operating in specific
   cell types within tumors promises novel insights into the molecular
   basis of tumor biology and heterogeneity.

   Genome-wide association studies (GWAS) have identified thousands of
   genetic variants associated with different types of cancer. Many of
   these variants are located in non-protein-coding regions of the genome
   [[49]6, [50]7], and are frequently enriched in predicted
   transcriptional regulatory regions, suggesting their potential impact
   on disease risk by modulating the regulation of target genes [[51]8].
   While bulk-tissue ATAC-seq has attempted to interpret the molecular
   functions of GWAS signals, its resolution is not sufficient to allow
   characterization of the specific cell types or gene expression levels
   influenced by these single nucleotide polymorphisms (SNPs) [[52]9].

   Single-cell assay for transposase-accessible chromatin sequencing
   (scATAC-seq) can identify accessible chromatin regions by Tn5
   transposase-mediated tagmentation and captures active DNA regulatory
   elements at single-cell resolution [[53]10]. Several studies employing
   scATAC-seq techniques have provided insights into diverse tumor
   tissues, uncovering specific regulatory landscapes that control tumor
   cells [[54]11, [55]12]. The integration of scATAC-seq with single-cell
   RNA sequencing (scRNA-seq) augments the ability to explore gene
   regulation across various cell types, offering a more panoramic view of
   genome-wide regulatory elements and insights into TF binding and
   activity.

   In this study, we carefully curated scATAC-seq and scRNA-seq data of
   different cancer types from several studies [[56]11, [57]13–[58]22], to
   identify cell-type-specific regulatory molecules that mediate tumor
   biology. By analyzing 380 465 cells from primary tumor tissues, we
   identified numerous candidate cis-regulatory elements (cCREs) based on
   chromatin accessibility. We also established a gene regulatory network
   by integrated analysis of scATAC-seq and scRNA-seq datasets, revealing
   distinct cancer gene regulation and genetic risks. Finally, we
   identified cell-type-associated TFs and discovered that tumor-specific
   TFs regulate malignant transcriptional programs, suggesting them as
   potential therapeutic targets.

Materials and methods

Patient samples

   Colon cancer samples were obtained from the primary tumor and adjacent
   normal colon tissues after specimen resection at Peking University
   People’s Hospital. All sampling and experimental steps in this study
   were approved by the Ethics Committee of Peking University People’s
   Hospital (License 2021PHB088-001), and all patients signed written
   informed consent for this study.

Tissue dissociation and library preparation

   We used human colon cancer and adjacent normal colon tissues for
   single-cell multiome sequencing, with all steps performed at 4 °C. A
   frozen tissue fragment (approximately 50 mg) was placed into a
   pre-chilled 2-mL Dounce homogenizer containing 2 mL 1× homogenization
   buffer (320 mM sucrose, 0.1 mM EDTA, 0.1% NP40, 5 mM CaCl[2], 3 mM
   Mg(Ac)[2], 10 mM Tris-HCl pH 7.8, 167 μM β-mercaptoethanol, 1× protease
   inhibitor cocktail (Roche), and 1 U/μL RNase inhibitor (Thermo)). The
   tissue was homogenized approximately 15 strokes with the loose ‘A’
   pestle, then filtered through a 70-μm nylon mesh to remove larger
   debris, followed by 20 strokes with tight ‘B’ pestle. Connective tissue
   and residual debris were excluded by filtration through a 40-μm nylon
   mesh filter followed by centrifugation at 350 r.c.f for 5 min.

   The supernatant was aspirated and resuspended in 400 μL of 1×
   homogenization buffer. An equal volume of 50% iodixanol in
   homogenization buffer was added to a final concentration of 25%
   iodixanol. 600 μL of a 29% iodixanol solution (in 1× homogenization
   buffer containing 480 mM sucrose) was layered underneath the 25%
   iodixanol, and then a further 600 μL of a 35% iodixanol solution (in 1×
   homogenization containing 480 mM sucrose) was layered underneath the
   29% iodixanol layer. After centrifugation in a swinging-bucket
   centrifuge at 3000 r.c.f for 35 min, the nuclei at the interface of the
   29 and 35% iodixanol solutions were collected in a volume of 200 μL and
   counted using trypan blue.

   500,000 nuclei were washed in buffer (10 mM Tris-HCl pH 7.4, 10 mM
   NaCl, 3 mM MgCl[2], 1% BSA, 0.1% Tween-20, 1 mM DTT, and 1 U/μL RNase
   Inhibitor), followed by centrifugation at 500 r.c.f for 5 min. Nuclei
   were resuspended in 50 μL Diluted Nuclei Buffer (1× Nuclei Buffer*,
   1 mM DTT, 1 U/μL RNase Inhibitor), and the concentration of nuclei was
   determined. 15,000 nuclei were aspirated for library construction and
   sequencing. Each nuclei suspension was submitted for library
   preparation using the Chromium Next GEM Chip J Single Cell Kit
   (PN-1000234, 10× Genomics) and the Chromium Next GEM Single Cell
   Multiome ATAC + Gene Expression Reagent Kits (PN-1000283, 10×
   Genomics), according to the manufacturer’s instructions. Finally, the
   libraries were sequenced using an Illumina Novaseq6000 sequencer with a
   sequencing depth of at least 50,000 reads per cell with the paired-end
   150 bp strategy.

Data collection

   We collected single-cell sequencing data based on the following
   prioritized criteria: (1) Data generated using the 10× Genomics
   single-cell platform were selected. (2) Epithelial-origin tumors were
   included. (3) Entire tumor samples were sequenced rather than
   selectively sorting specific cell types, such as immune cells. (4)
   scATAC-seq and scRNA-seq data were generated from the same tumor
   samples. (5) Data from multiple studies were included whenever
   possible, while ensuring that batch effects between studies were
   minimal. (6) Sequencing data containing both tumor and normal tissues
   were selected from the same study. (7) For certain cancer types where
   only scATAC-seq data were available, corresponding scRNA-seq data were
   collected separately to supplement the analysis. The selected
   scATAC-seq and scRNA-seq datasets for BC, BCC, CC, EC, LC, OC, PLC, and
   RCC are summarized in Supplementary Table [59]S1. Bulk ATAC-seq and
   methylation sequencing datasets were downloaded from the UCSC Xena
   website ([60]https://xena.ucsc.edu/).

Single-cell ATAC-seq data processing

   The MACS2 method [[61]23] was used to identify accessible chromatin
   regions in each fragment file, creating a unified set of peaks for
   quantification across all datasets. scATAC-seq data was analyzed using
   the Signac [[62]24] R package (version 1.6.0). Low-quality cells were
   excluded based on the following criteria: nCount_peaks >2000,
   nCount_peaks <30,000, nucleosome signal <4, and TSS enrichment >2. To
   reduce computational resource usage, we randomly subsampled 30,000
   cells for scATAC-seq data of BC. The Signac processing pipeline was
   applied, and cluster annotation was performed by comparing differential
   accessible regions associated with marker genes for tumor cells (LGR5,
   EPCAM, CA9), T cells (CD247), plasma cells (JCHAIN), myofibroblasts
   (ACTA2), myeloid cells (ITGAX, CD163), mast cells (KIT), fibroblasts
   (PDGFRA), endothelial cells (EMCN, PECAM1), and B cells (MS4A1). Gene
   activity matrix for scATAC-seq data was calculated using the
   GeneActivity function in Signac. To remove batch effects, individual
   datasets were harmonized using the Harmony algorithm [[63]25]. Genomic
   regions with accessible chromatin peaks were annotated using the
   ChIPSeeker [[64]26] R package (version 1.28.3) and the UCSC database on
   hg38.

Single-cell RNA-seq data processing

   The Seurat [[65]27] R package (version 4.1.0) was used for the analysis
   of scRNA-seq data. Low-quality cells were excluded based on the
   following criteria: nCount_RNA < 50,000, nCount_RNA > 500,
   nFeature_RNA > 500, nFeature_RNA < 6 000, and % mitochondria < 25.
   Additionally, the DoubletFinder [[66]28] R package (version 2.0.3) was
   independently applied to each library to identify potential doublets,
   and cells predicted to be doublets were excluded. The doublet rate
   increases by 0.8% for every 1000-cell increment. For scRNA-seq data of
   BC, CC, and LC, we used the processed data provided by the authors. To
   reduce computational resource usage, we randomly subsampled 30,000
   cells for scRNA-seq data of BC. The default Seurat workflow was used
   for dimensionality reduction and clustering analysis. The same marker
   genes used in the scATAC-seq data analysis were employed for cell
   cluster annotation. To address batch effects across different datasets,
   the Harmony algorithm was used.

Single-cell multiome-seq data processing

   The raw sequencing data were initially processed using the
   cellranger-arc software (version 2.0.2, 10× Genomics) for tasks such as
   demultiplexing, alignment to the GRCh38 human reference genome, and
   generating gene-barcode matrices. MACS2 was then used to identify
   accessible chromatin regions in each fragment file and create a unified
   set of peaks for quantification across all datasets. For filtering,
   normalization, dimensionality reduction, clustering, and differential
   expression analysis, the Seurat and Signac R packages were applied. To
   ensure data quality, low-quality cells were excluded based on the
   following parameters: nCount_ATAC < 20,000, nCount_ATAC > 1 000,
   nCount_RNA < 25,000, nCount_RNA > 500, nFeature_RNA > 500,
   nFeature_RNA < 6000, TSS enrichment >2, and % mitochondria <25.

Copy number alteration detection

   Copy number alterations of epithelial cells from tumor tissues were
   inferred using InferCNV [[67]29] R package (version 1.8.1) based on
   single-cell transcriptomic profiles. Non-tumor cells from scRNA-seq
   data served as the reference for CNV calling. The cutoff was set to 0.1
   and the other parameters were kept at their default values.

Genomic coordinate overlap analysis

   To explore the DNA dictionary established by scATAC-seq, bulk ATAC-seq
   datasets and bulk DNA methylation datasets were obtained from TCGA
   through UCSC. The findOverlaps function in Signac was used to identify
   overlaps between peaks from scATAC-seq data and those from bulk
   ATAC-seq data, with a minimum of 1 bp overlap defining the genomic
   coordinate overlap. For the DNA methylation data, a direct comparison
   was made to determine whether methylation sites were located within the
   identified peak regions.

Identify cell-type-specific cancer-associated differential accessible regions

   Cell-type-specific accessible chromatin regions in each cancer type
   were identified using the FindAllMarkers function in Signac. A
   differential accessible region (DAR) was deemed cell-type-specific and
   cancer-associated if it appeared exclusively in one cancer type.
   Genomic coordinate overlap between DARs was defined as a minimum of
   1 bp overlap. To better assess the heterogeneity of cell types, the
   mean proportion of cell-type-specific cancer-associated DARs relative
   to cell-type-specific DARs was calculated.

Estimating GWAS enrichment using cell-type-specific accessible chromatin
regions

   We retrieved cancer-associated SNPs identified in various published
   genome-wide association studies from the NHGRI-EBI GWAS catalog
   [[68]30]. To assess the heritability of diverse complex traits, we
   employed LDSC (version 1.0.1) [[69]31]. GWAS summary statistics served
   as input for LDSC, enabling the computation of heritability enrichment
   for an annotated set of SNPs conditioned on a baseline model. This
   accounts for genomic features influencing heritability and jointly
   models multiple annotations. Cell-type-specific peaks were formatted
   for LDSC using the make_annotation.py script, and LD scores were
   computed for each set using the ldsc.py script. Publicly available GWAS
   summary statistics were gathered for a range of traits, including type
   2 diabetes (T2D) [[70]32], systemic lupus erythematosus (SLE) [[71]33],
   rheumatoid arthritis (RA) [[72]34], ulcerative colitis (UC) [[73]35],
   Crohn’s disease (CD) [[74]35], coronary atherosclerosis (CAD) [[75]32],
   body height (BH) [[76]36], body mass index (BMI) [[77]37], balding
   [[78]38], CC [[79]39] and OC [[80]32]. Subsequently, summary statistics
   were converted to hg38 coordinates using the UCSC liftover tool and
   formatted for LDSC using the munge_sumstats Python script. The
   recommended guidelines for cell-type-specific partitioned heritability
   analysis were followed, utilizing HapMap3 SNPs and the provided hg38
   baseline model (v2.2). The ldsc.py script was then employed to compute
   cell-type-specific enrichments of GWAS heritability.

Cis-co-accessible network construction

   The correlation structure of chromatin accessibility data was
   scrutinized using the cicero [[81]40] R package (version 1.3.9). Cicero
   estimates co-accessibility between peak pairs using a graphical LASSO
   algorithm, with a maximum interaction constraint of 500 kb.
   Importantly, prior to correlation and regularization, a bootstrap
   approach was implemented to generate metacells. This involved
   aggregating 50 cells at a time using k-nearest neighbors, effectively
   addressing the sparsity inherent in single-cell chromatin data.

Peak-gene links analysis

   A peak-gene correlation analysis was performed using a method similar
   to that described previously [[82]41]. In brief, grouped scATAC-seq and
   grouped scRNA-seq data were used to identify peak-gene links,
   incorporating log-normalized accessibility and expression. For each
   gene in GenomicRanges, the distance to the start site was adjusted, and
   the start site was resized within a ± 500 kb window. For each
   chromosome, 1000 peaks not on the same chromosome were identified and
   correlated with all genes on that chromosome. Each potential peak-gene
   correlation was transformed into a z-score using the mean and standard
   deviation of the null trans correlations. These scores were then
   converted into p-values and adjusted for multiple-hypothesis testing
   using the Benjamini-Hochberg correction (‘p.adjust’ function in R).

   To construct the peak-gene link networks, peak-promoter
   co-accessibility was computed using cicero, and peak-gene correlation
   was determined using the method described above. Reliable peak-gene
   links were defined by a high co-accessibility score (co-accessibility
   score >0.2) between peaks and the promoter of genes, as well as a
   significant correlation (FDR < 0.1) between peaks and genes. For OC,
   the FDR cutoff was set at 0.2 to obtain robust results.

Identification of cell-type-conserved regulatory programs

   Cell-type-conserved regulatory programs were identified using the
   iterative overlap method, which is implemented in ArchR [[83]42].
   Briefly, cell-type-associated regulatory regions were obtained by
   intersecting regulatory regions from regulatory networks with
   cell-type-associated DARs in each cancer type. The iterative overlap
   method was then employed to generate cell-type-associated peak sets for
   cell-type-associated regulatory regions of the same cell type. Finally,
   cell-type-associated peak sets were intersected with the
   cell-type-associated regulatory regions from each cancer type, removing
   peaks that appeared in only one cancer type.

Motif analysis and transcription factor footprinting

   Single-cell TF motif enrichment analysis was performed for 633 human
   TFs from the JASPAR 2020 database [[84]43] using the Signac wrapper for
   chromVAR [[85]44]. Initially, a motif accessibility matrix was computed
   to estimate the number of peaks containing each TF motif across all
   cells. Subsequently, chromVAR utilized this motif accessibility matrix
   to calculate deviation Z-scores for each motif. This calculation
   involved a comparison between the actual number of peaks containing the
   motif and the expected number of fragments in a background set. It also
   accounted for potential confounding factors, such as GC content bias,
   PCR amplification, and variability in Tn5 tagmentation. For a more
   detailed analysis of specific TFs of interest, the Footprint function
   in Signac was employed to conduct TF footprinting analysis in
   pseudo-bulk aggregates of single cells within the same cell type.
   Normalized values of TF footprinting were obtained from the
   GetFootprintData function in Signac.

Identification of intratumor NMF programs

   The NMF method [[86]42], implemented by the NMF R package (version
   0.24.0), was utilized to identify the underlying transcriptional
   programs within tumor cells from three samples (A001-C-007, CRC-1-8810,
   CRC-3-11773; number of tumor cells >500). We applied this approach
   (k = 2:6, nrun = 30) to the relative expression matrix of tumor cells
   in each sample, with all negative values converted to zero. The optimal
   k value, signifying a robust clustering result, was determined based on
   the point at which the cophenetic coefficient exhibited the maximum
   drop. Subsequently, 10 distinct programs across the three samples were
   identified. For each program, the top 100 genes with the highest NMF
   scores were utilized to score tumor cells in each sample, employing the
   Seurat AddModuleScore function. The correlations between these program
   scores were calculated individually for each sample. Finally, four
   meta-programs were identified through hierarchical clustering of
   averaged correlations of pairs of programs across all samples, using
   Pearson correlation as the distance metric and Ward’s linkage.
   Clustering groups covering less than half of the samples were
   disregarded. The 100 genes with the highest average NMF score within
   each highly correlated program set were selected to represent the
   corresponding meta-program. Pathway enrichment analysis for each
   meta-program was conducted using the clusterProfiler [[87]45] R package
   (version 4.0.5). Regulatory elements for genes within each meta-program
   were identified using the peak-gene link network. Candidate TFs for
   each gene were defined based on DNA sequence motifs present in the
   regulatory elements of the gene.

Analysis of meta-program dependencies

   For the dependency analysis of genes from each meta-program, we
   manually searched the Cancer Dependency Map Project database for colon
   cancer cell lines [[88]46]. Dependency data from primary cancer cell
   lines were used to estimate cellular viability.

Perturbagen’s signature of tumor-specific TFs

   Perturbagen’s signatures of tumor-specific TFs were downloaded from the
   Connectivity Map [[89]47]. Perturbagen’s signatures of CEBPG, SOX4,
   TCF7, and TEAD4 were obtained from the HT29 cell line, whereas
   perturbagen’s signature of LEF1 was obtained from all cell lines in the
   database.

Identification of drug targeting transcription factor

   Drug perturbed profiles generated from the LINCS consortium [[90]47]
   were used to screen candidate drugs targeting CEBPG, LEF1, SOX4, TCF7,
   and TEAD4. These profiles were downloaded from the ExperimentHub
   [[91]48] R package (version 2.0.0), which includes a Z-score matrix
   from differential expression analysis of 12,328 genes across 8140
   compound treatments. Genes with a Z-score smaller than –2 were
   considered differentially downregulated genes. Candidate drugs were
   defined as those that can significantly downregulate the expression of
   the target TF. Finally, tacedinaline, quinoclamine, dorsomorphin, and
   vorinostat were selected as candidate drugs for targeting CEBPG, LEF1,
   SOX4, and TEAD4, respectively.

Cell culture and reagents

   All human cell lines used in this study were sourced from the American
   Type Culture Collection. Authentication of these cell lines was
   conducted through STR locus analysis, and they were regularly screened
   for mycoplasma contamination. HEK293T cells and DLD1 cells were
   cultured in DMEM (Corning), supplemented with 10% FBS (HyClone), at
   37 °C 5% (v/v) CO[2].

Vectors and lentiviral transfection

   All short hairpin RNAs (shRNA) were cloned into pLKO.1-TRC vector.
   Target sequences were as follows:

   CEBPG: TTAGCCTTGTAATTCGAATAT;

   LEF1: CCATCAGATGTCAACTCCAAA;

   SOX4: GAAGAAGGTGAAGCGCGTCTA;

   TEAD4: GAGACAGAGTATGCTCGCTAT;

   TCF7: CAACTCTCTCTCTACGAACAT.

   Lentiviral particles were generated by transfecting the shRNA plasmid,
   packaging vectors psPAX2 and pMD2.G into HEK293T cells at a ratio of
   3:2:1 using jetPRIME Transfection Reagent (Polyplus-transfection Inc.).
   The media was replaced 12 h post-transfection. 48 h later, the media
   containing the virus were collected. After centrifugation at 1500× g
   for 45 min, the pellets containing virus were resuspended with DMEM
   medium. The efficiency of lentiviral shRNA clones was assessed by
   real-time PCR.

Lentiviral infection

   A total of 1 × 10^6 DLD1 cells were plated in each well of 12-well
   plates. 200 μL of lentivirus and 10 μg/mL polybrene were added at the
   same time. After 12 h at 37 °C, the cells were transferred to new
   culture dishes. 24 h later, the lentivirus-infected cells were
   subjected to selection with puromycin (Invivogen) for 3 days.

RNA isolation and qRT-PCR

   Total cellular RNA was extracted using TRIzol Reagent (Vazyme) and RNA
   quantified with NanoDrop 2000 (Thermo Fisher Scientific, USA).
   Subsequently, total RNA was reverse transcribed using Promega cDNA
   Synthesis SuperMix (Promega) following the manufacturer’s guidelines.
   Quantitative reverse transcription PCR (qRT-PCR) was conducted
   according to the manufacturer’s protocol on the Bio-Rad CFX ConnectTM
   system, utilizing SYBR Green Master Mix (Vazyme) and gene-specific
   primers. The qPCR primer sequences were as follows:

   CEBPG-fwd: 5′-ACTCCAGGGGTGAACGGAAT-3′;

   CEBPG-rev: 5′-CATGGGCGAACTCTTTTTGCT-3′;

   LEF1-fwd: 5′-AGAGCCCCAATATCCCTGC-3′;

   LEF1-rev: 5′-GAGAAAAGTGCTCGTCACTGT-3′;

   SOX4-fwd: 5′-AGCGACAAGATCCCTTTCATTC-3′;

   SOX4-rev: 5′-CGTTGCCGGACTTCACCTT-3′;

   TEAD4-fwd: 5′-GAACGGGGACCCTCCAATG-3′;

   TEAD4-rev: 5′-GCGAGCATACTCTGTCTCAAC-3′;

   TCF7-fwd: 5′-AACACCTCAACAAGGGCACTC-3′;

   TCF7-rev: 5′-CCCCACTTGAAACGGTTCCT-3′.

Western blot

   Cells were lysed in RIPA lysis buffer with protease inhibitor cocktail
   (Abcam). BCA protein quantification kit (Beijing Solarbio Science &
   Technology) was used to quantify the protein concentration. Equal
   amounts of protein were separated by SDS-PAGE and transferred to PVDF
   membrane. After blocking with 5% BSA for 1 h, the membranes were
   incubated overnight at 4 °C with the indicated primary antibodies,
   followed by secondary antibody. Primary antibodies were used in this
   study: anti-CEBPG (UpingBio), anti-LEF1 (Sangon Biotech), anti-SOX4
   (UpingBio), anti-TCF7 (Sangon Biotech), anti-TEAD4 (IPODIX), anti-GAPDH
   (Sungene Biotech).

Cell proliferation assay

   A seeding density of 2000 cells was employed for each well in 96-well
   plates, and the cells were allowed to adhere. At designated time
   points, the culture medium was substituted with CCK-8 solution
   (Beyotime), followed by incubation at 37 °C for 2 h. Absorbance values
   were measured at 450 nm using a Thermo Fisher Scientific Varioskan
   Flash multimode reader, and cell proliferation curves were generated
   based on the absorbance at each time point. Each group was established
   with six replicates.

Drug treatment

   To assess the mRNA and protein level of the targeted gene induced by
   drug treatment, 1 × 10^6 DLD1 cells were seeded into each 10 cm cell
   culture dish. After 24 h of incubation at 5% CO[2] and 37 °C, the cells
   were treated with 10 μM Tacedinaline (Targetmol), 10 μM Dorsomorphin
   (Macklin), 10 μM Vorinostat (Yeasen), or 4 μM Quinoclamine (Targetmol)
   for 3 days. Subsequently, the cells were harvested for the detection of
   mRNA levels of the targeted genes. For cell proliferation assays, 5 000
   DLD1 cells were seeded into each well of 96-well plates with 200 μL of
   medium per well. After 24 h of incubation at 5% CO[2] and 37 °C, the
   cells were treated with 10 μM Tacedinaline (Targetmol), 10 μM
   Dorsomorphin (Macklin), 10 μM Vorinostat (Yeasen), or 4 μM Quinoclamine
   (Targetmol) for 3 days. Cell proliferation was assessed every 24 h
   using the CCK-8 solution (Beyotime).

Cell migration assay

   A Transwell assay (8 μm, Corning) was used to evaluate the migration
   ability of tumor cells. Briefly, 1×10^5 DLD1 cells were seeded in the
   upper chamber, while the lower chamber contained culture medium with
   10% FBS. After 24 h of incubation, the membrane of the upper chamber
   was carefully removed by a cotton swab, fixed in 4% paraformaldehyde
   for 30 min at room temperature and then stained with crystal violet for
   15 min. Migrated cells were quantified and counted under an inverted
   light microscope. Each group was established with three replicates.

Cell apoptosis assay

   For the apoptosis assay, cells were suspended with 100 μL binding
   buffer containing 5 μL of Annexin V-FITC and 5 μL of 7-AAD staining
   solution (SinoBiological). After staining for 15 min, 400 μL binding
   buffer was added to terminate the reaction. Flow cytometry was used to
   distinguish dead and apoptotic cell populations. Each group was
   established with three replicates.

Statistical analysis

   Enrichment analysis for differentially accessible chromatin regions was
   performed by GREAT [[92]49] (version 4.0.4). Most of the computational
   analyses and statistical tests were performed in R version 4.1.0. The
   Wilcoxon test and Kruskal–Wallis test were used for between-group
   comparisons as appropriate, and P values < 0.05 were considered
   statistically significant.

Results

Single-cell transcriptional and chromatin accessibility profiling in diverse
carcinomas

   To elucidate the chromatin and transcriptional features of human
   primary carcinomas, we curated publicly available scATAC-seq and
   scRNA-seq data, including breast cancer (BC), basal cell cancer (BCC),
   colon cancer (CC), endometrial cancer (EC), lung cancer (LC), ovarian
   cancer (OC), primary liver cancer (PLC), and renal cell cancer (RCC)
   (Fig. [93]1A, Supplementary Table [94]S1). After rigorous quality
   control, a total of 176,928 cells from scATAC-seq and 203,537 cells
   from scRNA-seq were retained for subsequent analysis (Supplementary
   Table [95]S2). A TSS plot of the data confirms a robust enrichment of
   unique open chromatin fragments at transcription start sites (TSS),
   while periodicity in fragment length reveals characteristic peaks
   indicative of high-quality ATAC-seq experiments (Fig. [96]1B). Cells
   were clustered for each cancer type separately and cell type identities
   were assigned to each cell. To integrate data from different samples,
   we used the Harmony algorithm [[97]25] for both scATAC-seq and
   scRNA-seq data to correct batch effects. Cluster annotations were
   accomplished by comparing differentially accessible regions and
   differential gene expression with markers associated with various cell
   types, including tumor cells (LGR5, EPCAM, CA9, KRT18), T cells
   (CD247), NK cells (NCR1), plasma cells (JCHAIN, FCRL5), myofibroblasts
   (ACTA2), myeloid cells (ITGAX, CD163), mast cells (KIT), fibroblasts
   (PDGFRA), endothelial cells (EMCN, PECAM1, PLVAP), melanocytes (MLANA),
   and B cells (MS4A1) (Fig. [98]1C, D, Supplementary Fig. [99]S1A, B,
   Table [100]S3). At the same time, we identified primary copy number
   variation (CNV) events in each cell based on its transcriptomic profile
   to confirm the annotations of tumor cells by marker genes
   (Supplementary Fig. [101]S1C). We presented the cell composition
   proportions for both scATAC-seq and scRNA-seq datasets in all cancer
   types (Fig. [102]1E). Moreover, we projected all the cells onto a UMAP,
   and cells identified as the same type clustered as expected (Fig.
   [103]1F, G).

Fig. 1. Overview of cellular composition across eight carcinomas through
scATAC-seq and scRNA-seq analysis.

   [104]Fig. 1
   [105]Open in a new tab

   A Schematic representation of the cancers used in this study, and
   downstream bioinformatic analyses. B Enrichment of scATAC-seq
   accessibility near transcription start sites (TSS) and fragment length
   distribution across all datasets. C UMAP plots of scATAC-seq cells
   color-coded by cell types found in eight carcinomas. D UMAP plots of
   scRNA-seq cells color-coded by cell types found in eight carcinomas. E
   Proportion of various cell types in scATAC-seq (left) and scRNA-seq
   (right) for eight types of carcinomas. UMAP plots show the embedding of
   integrated scATAC-seq data F and scRNA-seq data G.

   To identify variances in chromatin accessibility across diverse cell
   types, we identified 257,632 peaks in BC scATAC-seq data, 138,534 peaks
   in BCC scATAC-seq data, 220,003 peaks in CC scATAC-seq data, 329,815
   peaks in EC scATAC-seq data, 263,338 peaks in LC scATAC-seq data,
   326,761 peaks in OC scATAC-seq data, 162,992 peaks in PLC scATAC-seq
   data, and 167,873 peaks in RCC scATAC-seq data individually, utilizing
   MACS2 (Fig. [106]2A). Most of them were observed in more than 10 cells
   (Supplementary Fig. [107]S2A). The predominant presence of these
   accessible peaks is noted in intronic, distal intergenic, as well as
   promoter regions of the genome. Intriguingly, our scATAC-seq approach
   detects more open chromatin regions than those identified by
   bulk-tissue ATAC-seq (Fig. [108]2B). We then merged the accessible
   chromatin regions to obtain a list of 614,619 non-overlapping peaks,
   which covered 61.5% of the regulatory elements in the registry of cCREs
   published by the ENCODE consortium [[109]50] (Fig. [110]2B), suggesting
   that scATAC-seq extends the compendium of DNA regulatory elements. In
   accordance with published findings [[111]3, [112]51], which establish
   an inverse correlation between DNA methylation and chromatin
   accessibility at regulatory elements, we examined the methylation data
   from TCGA to further delineate the accessible potential of the peaks
   identified. Notably, methylation sites within peak regions exhibit
   lower methylation β values, while those in non-peak regions display
   higher methylation β values (Supplementary Fig. [113]S2B).

Fig. 2. Single-cell ATAC-seq extends the compendium of DNA regulatory
elements and reveals cancer genetic risks.

   [114]Fig. 2
   [115]Open in a new tab

   A Quantification of peaks identified in individual cancer-type
   scATAC-seq datasets and the combined dataset. Colors correspond to the
   indicated peak types. B Venn diagram illustrating the intersection
   between cCREs identified in scATAC-seq data and the registry of cCREs
   created by other projects. This includes BC scATAC-seq data
   intersecting with TCGA-BRCA bulk ATAC-seq data, CC scATAC-seq data
   intersecting with TCGA-COAD bulk ATAC-seq data, EC scATAC-seq data
   intersecting with TCGA-UCEC bulk ATAC-seq data, LC scATAC-seq data
   intersecting with TCGA-LUAD bulk ATAC-seq data, PLC scATAC-seq data
   intersecting with TCGA-LIHC bulk ATAC-seq data, RCC scATAC-seq data
   intersecting with TCGA-KIRC bulk ATAC-seq data, and integrated
   scATAC-seq data intersecting with the ENCODE consortium. The numbers on
   the left, right, top and bottom represent the number of unique peaks in
   scATAC-seq data, the number of unique peaks in TCGA data, the number of
   overlapping peaks in scATAC-seq data, and the number of overlapping
   peaks in TCGA data, respectively. C The number of DARs for all cell
   types in the scATAC-seq data of eight carcinomas. D Normalized
   scATAC-seq tracks for all cell types in CC (upper) and the eight types
   of tumor cells (lower) at the SNP rs4554811 locus. E Normalized
   scATAC-seq tracks for all cell types in CC (upper) and the five types
   of fibroblasts (lower) at the SNP rs704417 locus. Dot plot showing the
   significance of enrichment for selected traits from BC F, CC G, and OC
   H, within DARs from different cell types. Each circle represents a cell
   type.

   We then identified differentially accessible chromatin regions (DARs)
   across distinct cell types. Tumor cells, endothelial cells, and
   fibroblasts showed a higher prevalence of DARs, whereas myeloid cells,
   plasma cells, and B cells demonstrated comparatively fewer DARs (Fig.
   [116]2C, Supplementary Table [117]S4). To examine the heterogeneity
   within the same cell type across different cancers, we identified
   cell-type-specific cancer-associated DARs. These analyses indicate that
   tumor cells exhibit the highest degree of heterogeneity across the
   eight types of cancer, followed by fibroblasts, while immune cells
   display minimal heterogeneity (Fig. [118]S2C, Supplementary Table
   [119]S4).

Single-cell ATAC-seq can reveal distinct cancer genetic risks

   To determine the impact of our identified accessible chromatin regions
   on the genetic variation associated with carcinomas, we scrutinized the
   NHGRI-EBI GWAS catalog for BC, CC, EC, LC, OC, PLC, and RCC, and noted
   a higher prevalence of SNPs within our identified accessible chromatin
   regions in BC, CC, and OC (Supplementary Fig. [120]S2D). By overlaying
   the co-accessibility map with the chromatin accessibility signal and
   GWAS statistics along the genomic axis, we were able to identify known
   chromatin accessibility sites. SNP rs4554811 and rs704417 are
   associated with increased susceptibility to colon adenocarcinoma,
   consistent with the presence of focal chromatin accessibility in CC
   scATAC-seq data [[121]52, [122]53]. SNP rs4554811 located intronically
   in VTI1A, exhibits heightened accessibility in colon tumor cells, with
   weak accessibility in other tumor cells, presenting its specific
   inherited risk in CC susceptibility (Fig. [123]2D). Additionally, SNP
   rs704417 shows strong accessibility in fibroblasts from CC, as well as
   from EC, OC, LC, and BC, suggesting a potential role for it in
   previously unappreciated cancer contexts (Fig. [124]2E). Further
   analysis revealed that numerous GWAS SNPs were open in multiple cancer
   types, suggesting that they play similar regulatory roles across
   different cancer types (Supplementary Table [125]S5).

   To further investigate the genetic risk signals for BC, CC, and OC, we
   performed cell-type-specific linkage-disequilibrium score regression
   (LDSC) analysis of our scATAC-seq clusters using GWAS summary
   statistics in BC, CC, OC, as well as other relevant traits (Fig.
   [126]2F–H, Supplementary Table [127]S6). Intriguingly, immune-related
   diseases, such as systemic lupus erythematosus (SLE), rheumatoid
   arthritis (RA), ulcerative colitis (UC), and Crohn’s disease (CD),
   exhibit a notable correlation with immune cells. Additionally,
   fibroblasts and myofibroblasts exhibit associations with traits such as
   balding, coronary atherosclerosis (CAD), and body height (BH),
   consistent with prior investigations [[128]54]. With respect to BC,
   GWAS SNPs were enriched in open regions from tumor cells, and in CC,
   both fibroblasts and tumor cells display a significant enrichment for
   GWAS SNPs, whereas in OC, enrichment was observed in fibroblasts and
   myofibroblasts. We further revealed that the expression of marker genes
   in CC fibroblasts, OC fibroblasts, and OC myofibroblasts presented
   significant correlations with overall survival, suggesting dysfunctions
   in these stromal cells contribute to a worse prognosis (Supplementary
   Fig. [129]S3A–D).

Differentially accessible regions are candidate DNA regulatory elements

   To investigate potential variations in regulatory dynamics across
   different cancer types, we visually examined our scATAC-seq data. The
   regulatory elements of MYC, known for their high degree of tissue
   specificity and distinct locations in different cancer types [[130]55],
   served as our focal point. Our analysis revealed that MYC is widely
   expressed and its locus has extensive accessibility in various cell
   types from the tumor ecosystem (Supplementary Fig. [131]S4A, B).
   Remarkably, tumor cells exhibit extensive chromatin accessibility at
   both the 5′ and 3′ DNA elements of MYC, whereas other cells primarily
   display accessibility at the 3′ DNA elements. Furthermore, we focused
   on tumor cells extracted from integrated scATAC-seq data (Supplementary
   Fig. [132]S4C, D). Our finding revealed that tumor cells from BC, BCC,
   CC, EC, LC, and OC all have strong accessibility at both the 5′ and 3′
   DNA elements of MYC, while PLC and RCC show accessibility at the 3′
   regulatory elements, consistent with previous findings [[133]3] (Fig.
   [134]3A). The comprehensive nature of our scATAC-seq data is
   underscored by its ability to reveal the enhancer regions of MYC in CC,
   RCC, LC, and OC, reinforcing its credibility (Fig. [135]3A). We next
   turned our attention to NDGR1, a gene known for its pivotal role in
   cancer development and progression, but for which a comprehensive
   understanding of its transcriptional regulation is lacking. Notably,
   the top 5 regulatory elements documented in GeneCards surface in our
   scATAC-seq data analysis (Fig. [136]3B). Differential accessibility of
   these regulatory elements across distinct tumor cells suggests varied
   regulatory influences. Additionally, we identified a region
   (chr8-133320109-133321082) exhibiting high accessibility in BCC, EC,
   OC, and RCC, which is conspicuously absent in CC cells (Fig. [137]3B,
   C).

Fig. 3. Identification of candidate regulatory elements.

   [138]Fig. 3
   [139]Open in a new tab

   A Normalized scATAC-seq tracks for the eight types of tumor cells at
   the MYC locus. Enhancer regions of MYC are colored in yellow. B
   Normalized scATAC-seq tracks for the eight types of tumor cells at the
   NDRG1 locus. Reported enhancer regions of NDRG1 are shaded in yellow
   and the predicted enhancer region is marked in gray. C The rank of DARs
   for colon cancer cells with low accessibility. D Workflow of the
   construction for peak-gene link network. The triangle represents cells
   derived from scATAC-seq data, while the circle represents cells from
   scRNA-seq data. Different colors indicate distinct cell types. E
   Connection plots illustrating the interactions between the promoter
   region and predicted enhancer region for NDRG1 in BC, BCC, EC, LC, OC,
   PLC, and RCC. The p-value of peak-gene represents the p-value from the
   peak-gene correlation analysis. The co-accessibility score of
   peak-promoter represents the co-accessibility score calculated by
   Cicero.

   To explore the interaction between DNA regulatory elements and target
   genes, we predicted peak-gene links through integrated scATAC-seq and
   scRNA-seq analysis (Fig. [140]3D). We calculated correlations between
   peak accessibility and gene expression from the integrated dataset. In
   parallel, we constructed a co-accessibility network for distal peaks
   and the promoters of target genes. Confident peak-gene links are
   contingent on significant peak-gene correlation and a high
   co-accessibility score (Fig. [141]3D). In BCC, EC, OC, and RCC,
   correlations with NDRG1 expression have p-values of 9.3e-05, 1.4e-04,
   2.3e-04, and 1.0e-04, respectively, with co-accessibility scores of
   0.24, 0.32, 0.62 and 0.88 (Fig. [142]3E). However, in BC, LC, and PLC,
   the target region displays a low correlation with the expression of
   NDRG1 (p-value = 0.25, 0.34, and 0.19). These findings suggest that the
   target region serves as a potential CRE of NDRG1 in BCC, EC, OC, and
   RCC, while displaying no regulatory effect on NDRG1 in BC, LC, and PLC.
   Intriguingly, subsequent examination of DARs across distinct tumor
   cells indicates a similarity with all exhibiting weak accessibility in
   the target region for BC, LC, and PLC (Supplementary Fig. [143]S4E–G).

   Furthermore, we systematically estimated the proportion of regulatory
   regions identified by gene regulatory networks relative to DARs of
   different cell types across diverse cancer types. Notably, five types
   of cells exhibited more than 500 DARs, of which approximately 10 to 60%
   were identified as candidate regulatory DNA elements (Supplementary
   Fig. [144]S4H).

Cell-type-associated epigenetic regulation in different carcinomas

   Using the approach described above, we identified 20,109, 11,955,
   38,292, 39,640, 18,976, 15,202, 10,140, and 13,368 unique peak-gene
   link pairs in the single-cell data from BC, BCC, CC, EC, LC, OC, PLC,
   and RCC, respectively (Fig. [145]4A). The number of peak-gene link
   pairs correlates with the number of cells from single-cell ATAC-seq
   data (Fig. [146]4B). By examining the overlap between these linked
   peaks and the DARs identified in each cell type, we identified
   cell-type-associated regulatory regions (Fig. [147]4C), which can
   reveal the specific cell functions of cancer genetic risk loci. For
   instance, SNP rs4554811 and rs704417 exhibit predicted interactions
   with nearby transcript promoter regions, indicative of a potential
   regulatory role for TCF7L1 and PRICKLE2 (Supplementary Fig. [148]S5A,
   B).

Fig. 4. Cell-type-associated epigenetic regulation in the eight carcinomas.

   [149]Fig. 4
   [150]Open in a new tab

   A Number of identified peak-gene links in the eight carcinomas. B The
   scatter plot and correlation between the cell number in scATAC-seq data
   and the number of peak-gene links (R represents Pearson’s correlation
   and its coefficient of determination, the p-value is 0.08). C Number of
   cell-type-associated peak-gene links in the eight carcinomas. D Heatmap
   representation of the cell-type-conserved regulatory elements
   predicted. Each row represents an individual scATAC-seq peak. Color
   represents the scATAC-seq log normalized accessibility for each peak. E
   GREAT ontology enrichment analysis of cell-type-conserved regulatory
   elements. F Circular bar plot of twelve overlapping pathways in the
   eight carcinomas. The length of the bar represents the number of genes
   in the indicated pathway, and the color intensity of the bar indicates
   the p-value. G Enriched motifs in different cell types. Motifs are
   ranked by p-values from smallest to largest. Motif plots of SPIB H,
   ELK4 I, FOS J, and KLF4 K. L Gene Ontology enrichment analysis of KLF4
   regulatory genes about EMT-related pathways.

   We further identified cell-type-conserved regulatory regions in
   different carcinomas, recognized as candidate regulatory elements not
   specific to a single carcinoma (Fig. [151]4D). Tumor cells, T cells,
   myeloid cells, myofibroblasts, fibroblasts, and endothelial cells
   present substantial conserved regulatory regions. To assess the
   potential functions of conserved epigenetic regulation in tumor cells,
   we employed GREAT ontology enrichment analysis on conserved chromatin
   regions and gene ontology (GO) enrichment analysis on their target
   genes. The results revealed that tumor-conserved regions predominantly
   regulate cell junction and epithelial cell development-related pathways
   (Fig. [152]4E, Supplementary Fig. [153]S5C). Moreover, we individually
   inspected the functions of tumor cell-associated regulatory elements by
   performing GO enrichment analysis on their predicted regulatory genes.
   Twelve processes were enriched in all tumor cells, including pathways
   related to cell junction and epithelial cell development, consistent
   with previous findings (Fig. [154]4F, Supplementary Fig. [155]S5D). We
   also explored the potential functions of conserved regulatory regions
   in other cell types. For instance, T cells mainly regulate T cell
   activation, myeloid cells influence the immune response, fibroblasts
   contribute to extracellular matrix organization, and both
   myofibroblasts and endothelial cells play roles in blood vessel
   development (Supplementary Fig. [156]S5E, F).

   We next applied motif enrichment analysis on cell-type-conserved
   regulatory regions to reveal key TFs of each cell type (Fig. [157]4G).
   Myeloid cells and T cells are enriched for the binding sites of
   expected cell-type-activated TFs, such as SPIB and ELK4. Fibroblasts,
   on the other hand, showed a strong enrichment for AP-1 TF binding sites
   (Fig. [158]4H, J). Notably, tumor cells exhibit an enrichment in
   binding sites for Krüppel-like factors, similar to endothelial cells
   and myofibroblasts. We identified tumor cell-conserved regulatory
   regions with KLF4 binding sites and performed GO enrichment analysis on
   the target genes of KLF4. This analysis revealed that genes regulated
   by KLF4 are significantly enriched in ontology terms related to
   epithelial-mesenchymal transition (EMT) (Fig. [159]4K, L), indicating
   that KLF4 plays a pivotal role in regulating genes associated with
   mesenchymal characteristics, consistent with previous studies
   [[160]56].

Transcription factor activities in different cell types

   In prior studies, which elucidated the role of small sets of TFs in
   governing gene expression programs in tumor cells [[161]57, [162]58],
   we sought to identify pivotal TFs participating in oncogenic
   transcriptional programs. To achieve this, we implemented a filtering
   strategy to identify highly specific TFs enriched across different cell
   types in each cancer type. Initially, we calculated the differentially
   enriched motifs through chromVAR bias-corrected scores for all cell
   types, selecting those highly enriched across diverse cell types.
   Subsequently, utilizing footprint analysis, we assessed the binding
   status of TFs across the genome for different cell types. In each cell
   type, TFs with a higher degree of activity and binding than those in
   other cell types were identified as cell-type-associated TFs (Fig.
   [163]5A, Supplementary Table [164]S7). Employing UMAP with
   bias-corrected chromVAR scores for identified cell-type-enriched
   motifs, we effectively clustered single cells into distinct cell types,
   illustrating the efficacy of our strategy in screening for
   cell-type-associated TFs (Fig. [165]5B).

Fig. 5. Transcription factor activities in different cell types in the eight
different carcinomas.

   [166]Fig. 5
   [167]Open in a new tab

   A Heatmap illustrating chromVAR bias-corrected deviation scores for the
   differential TF motifs of various cell types in eight different
   carcinomas. B UMAP visualization of single cells based on chromVAR
   bias-corrected deviations for identified motifs. C TF footprinting of
   TEAD TFs in diverse cell types in the integrated scATAC-seq data. D Bar
   plot of TEAD TFs analyzed by comparing the chromVAR score of tumor
   cells with other cell types. E Bar plot of TEAD TFs analyzed by
   comparing the gene expression of tumor cells with other cell types. F
   Dot plot of TEAD TFs regulating cancer-associated pathways in the eight
   carcinomas. The size of the dot represents the number of TF regulatory
   genes, and the color intensity of the dot indicates the p-value. G
   Kaplan-Meier survival analysis of TCGA cancer (COAD, LIHC, LUAD, KIRC,
   OV, and UCEC) patients with high (top 50%) and low (bottom 50%)
   expression for the TEAD1/2/3/4 gene.

   We made an intriguing discovery that AP-1 TFs exhibit extensive binding
   across different cell types, with elevated activity in tumor cells,
   fibroblasts, myofibroblasts, and endothelial cells (Supplementary Fig.
   [168]S6A, B). To evaluate the impact of AP-1 TFs on gene expression
   regulation, we mapped their binding sites along with those of 588 other
   TFs in pairs-linked regulatory regions. Our findings revealed that AP-1
   TFs exhibit a more extensive binding pattern compared with other TFs,
   indicating their broader regulation of gene expression (Supplementary
   Fig. [169]S6C).

   Our investigations also revealed that the identified
   cell-type-associated TFs in multiple cancer types are similar. For
   instance, EOMES and TWIST1 are highly activated in T cells and
   fibroblasts, respectively, indicating a high degree of reliability and
   consistency. We further performed GO enrichment analysis on the
   regulatory genes of EOMES in different carcinomas individually, based
   on the identified peak-gene link networks. The results show that EOMES
   is predominantly involved in the regulation of T cell activation and
   other immune-related pathways (Supplementary Fig. [170]S7A). Similarly,
   TWIST1 is involved in regulating extracellular matrix organization and
   mesenchyme development, correlating with expected fibroblast functions
   (Supplementary Fig. [171]S7B). However, distinct TF activity profiles
   emerged from analyses of different tumor cells. Notably, TEAD1/2/3/4
   are activated in tumor cells across most cancer types, except for BC,
   displaying higher binding status and expression levels compared to
   other cells in the integrated scATAC-seq and scRNA-seq data (Fig.
   [172]5C–E). TEAD family TFs are critical regulators of various
   oncogenic pathways, including the Hippo signaling pathway, Wnt
   signaling pathway, and TGF-β signaling pathway. Our investigation
   revealed that TEADs significantly regulate these pathways in various
   tumor cells (Fig. [173]5F). They also participate in the regulation of
   conserved regulatory pathways in tumor cells, such as cell junction and
   epithelial cell development. Additionally, overexpression of
   TEAD1/2/3/4 is significantly linked to poorer overall survival across
   six cancers (COAD, LUAD, LIHC, KIRC, OV, and UCEC) in TCGA data (Fig.
   [174]5G). Across all available TCGA datasets, TEAD family TFs
   consistently correlate with poorer outcomes, highlighting their
   consistent impact on different tumor cells and suggesting their
   potential as prognostic biomarkers (Supplementary Fig. [175]S7C).

Tumor-specific transcription factors regulate malignant transcriptional
programs in colon cancer

   We found that certain previously identified tumor cell-associated TFs
   are highly activated in normal epithelial. For instance, HNF4A exhibits
   high activity in colon and kidney epithelium [[176]13, [177]59], while
   TFAP2C has been reported to play a crucial role in promoting epithelial
   gene expression [[178]60]. To identify tumor-specific TFs, we conducted
   a comprehensive analysis of scATAC-seq and scRNA-seq data from normal
   colon tissues (Fig. [179]6A, B, Supplementary Fig. [180]S8A, B). Five
   TFs (i.e. CEBPG, LEF1, SOX4, TCF7, and TEAD4) display higher chromVAR
   scores, increased binding, and elevated expression in tumor cells
   compared to normal epithelial cells (Fig. [181]6C, D, Supplementary
   Table [182]S8).

Fig. 6. Tumor-specific TFs regulate malignant transcriptional programs in
colon cancer.

   [183]Fig. 6
   [184]Open in a new tab

   A Integrated UMAP projection based on scATAC-seq data, encompassing all
   cells from CC tissues and normal colon tissues. B Integrated UMAP
   projection based on scRNA-seq data, encompassing all cells from CC
   tissues and normal colon tissues. C Dot plot presenting changes of
   chromVAR bias-corrected deviation scores and gene expression between
   tumor cells and normal epithelial cells for the identified colon
   tumor-specific TF motifs. D TF footprinting of CEBPG, LEF1, SOX4, TCF7,
   and TEAD4 motifs in CC cells and normal colon epithelial cells. E
   Pairwise correlations between the expression profiles of three
   scRNA-seq CC samples. F Heatmap of average correlations across three
   samples with CC between pairs of programs. The red box indicates the
   four meta-programs. G Dot plot displaying significantly enriched
   pathways for signature genes of each meta-program. The dot size is
   proportional to the relative number of genes in the indicated pathway,
   and the color intensity indicates the p-value after FDR correction. H
   Proportion of the five identified tumor-specific TF motifs found in the
   regulatory elements for the signature genes for each of the four
   meta-programs (*p < 0.05, **p < 0.01, ***p < 0.001, ns: not
   significant).

   To better understand the regulatory landscape driven by the five
   tumor-specific TFs (i.e. CEBPG, LEF1, SOX4, TCF7, and TEAD4) in CC
   cells, we utilized scRNA-seq data from CC tissues. We began with a
   pairwise correlation analysis, which revealed distinct transcriptional
   states consistently present within these tumors (Fig. [185]6E). Using
   nonnegative matrix factorization (NMF), we sought to elucidate the
   underlying transcriptional programs comprising co-expressed genes, and
   ultimately extracted 10 malignant transcriptional programs from the
   scRNA-seq data derived from three patients with CC (Supplementary Fig.
   [186]S8C). Subsequent hierarchical clustering led to the identification
   of four meta-programs that encompassed highly similar programs across
   the three patient samples (Fig. [187]6F, Supplementary Fig. [188]S8D-E,
   Table [189]S9). Genes within meta-program 1 exhibit significant
   enrichment in the WNT signaling pathway, while meta-program 2 is
   predominantly enriched in pathways related to cell junction.
   Meta-program 3 and 4 showed enrichment in pathways associated with
   protein synthesis and DNA replication, respectively (Fig. [190]6G). To
   assess the impact of meta-programs on cellular viability in tumor
   cells, we evaluated the effect of loss-of-function of the top 100 genes
   in each meta-program on tumor cell viability, by examining CRISPR-Cas9
   screening data from The Cancer Dependency Map Consortium (DepMap).
   Dependency scores (CERES) are normalized, with scores lower than 0
   indicating essential genes for the given cell line. Notably, the mean
   CERES of most genes from 33 CC cell lines are lower than 0, suggesting
   they are essential in colon tumor cells (Supplementary Fig. [191]S8F).

   Next, we reviewed the top 100 genes within these four meta-programs and
   examined their linked peaks. Notably, motifs of the five tumor-specific
   TFs were significantly enriched in these peaks, indicating their key
   role in regulating gene expression in colon tumor cells (Fig. [192]6H).
   We further performed gene ontology enrichment analysis on the
   perturbagen’s signature of CEBPG, LEF1, SOX4, TCF7, and TEAD4, and the
   results reveal that the five tumor-specific TFs present extensive
   regulation on meta-program related pathways, consistent with the
   analysis before (Supplementary Fig. [193]S8G).

   We further investigated the regulatory interactions among CEBPG, LEF1,
   SOX4, TCF7, and TEAD4 in colon cancer cells. Our constructed regulatory
   network in colon cancer revealed that LEF1 is under the regulation of
   SOX4 and TCF7, whereas SOX4 is regulated by LEF1, TCF7, and TEAD4.
   Additionally, CEBPG is regulated by LEF1, TEAD4, and SOX4, while TEAD4
   is regulated by LEF1 and TCF7 (Supplementary Fig. [194]S9A). These
   regulatory dynamics are further confirmed through TF perturbation
   experiments (Supplementary Fig. [195]S9A, B). Analyzing the COAD
   RNA-seq data from TCGA, we observed strong correlations among the
   expressions of LEF1, SOX4, and TCF7 (Supplementary Fig. [196]S9C).
   These findings provide evidence of their regulatory relationships,
   indicating the presence of a transcriptional circuitry among
   tumor-specific TFs.

The importance of tumor-specific TFs is widely supported

   To explore the significance of tumor-specific TFs, we performed
   single-cell multiome sequencing on paired CC tissues and adjacent
   normal tissues for two patients (“Patient 1” and “Patient 2”) (Fig.
   [197]7A, Supplementary Table [198]S10). Additionally, we curated
   another dataset comprising scATAC-seq and scRNA-seq from the same
   patient, comparing tumor and normal samples (“Patient 3”). By
   integrating both transcriptomic and epigenetic data from the same
   patient samples, we minimized inter-patient variability (Fig. [199]7B,
   Supplementary Fig. [200]S10A–D). Our analysis revealed that CEBPG,
   LEF1, SOX4, TCF7, and TEAD4 exhibited higher expression levels and
   increased chromVAR scores in tumor cells compared to normal epithelial
   cells in the three patients (Fig. [201]7C). We also assessed their
   binding status across the entire open chromatin genome in tumor and
   epithelial cells, observing notably increased binding in tumor cells
   (Fig. [202]7D, Supplementary Table [203]S8). Moreover, we integrated
   scATAC-seq data from eight colon tumor samples and seven normal colon
   samples, along with scRNA-seq data from twenty-three colon tumor
   samples and ten normal colon samples, which are collectively labeled as
   “Validation” (Supplementary Fig. [204]S10E–H). The tumor-specific TFs
   with high activity in colon tumor cells, including CEBPG, LEF1, SOX4,
   TCF7, and TEAD4, exhibited elevated expression, higher chromVAR scores,
   and increased binding compared to normal epithelial cells
   (Supplementary Fig. [205]S10I, J, Table [206]S8). Further analysis
   using the TCGA and GTEx datasets confirmed that these TFs are expressed
   at higher levels in tumor tissue samples compared to normal tissue
   samples (Fig. [207]7E).

Fig. 7. The importance of tumor-specific TFs is widely supported.

   [208]Fig. 7
   [209]Open in a new tab

   A Schematic representation of the single-cell multiome sequencing
   experiments in this study. B Integrated UMAP projection based on
   scRNA-seq and scATAC-seq data, encompassing all cells from CC tissues
   and normal colon tissues of 3 patients. C Dot plot presenting changes
   of chromVAR bias-corrected deviation scores and expression levels for
   the identified colon tumor-specific TFs across tumor cells and normal
   epithelial cells. D TF footprinting of CEBPG, LEF1, SOX4, TCF7, and
   TEAD4 motifs in CC cells and normal colon epithelial cells for three
   patients. E Expression levels of the identified colon tumor-specific
   TFs in CC tissue samples (orange) and normal tissue samples (gray)
   obtained from the GEPIA2 database (*p < 0.05). F Downregulation of
   CEBPG, LEF1, SOX4, TCF7, and TEAD4 mRNA expression in the DLD1 cell
   line by shRNA knockdown. Statistically significant differences are
   determined by two-way ANOVA (*p < 0.05, **p < 0.01, ***p < 0.001). G
   Effects of sh-CEBPG, sh-LEF1, sh-SOX4, sh-TCF7, sh-TEAD4, and vector
   (control) knockdown on DLD1 cell proliferation as determined by a cell
   proliferation assay. Statistically significant differences are
   determined by two-way ANOVA (*p < 0.05, **p < 0.01, ***p < 0.001).

   To further our understanding and validate the biological functions of
   these tumor-specific TFs, we carried out targeted knockdown experiments
   (Fig. [210]7F, Supplementary Fig. [211]S11A). The results were
   striking, showing a substantial reduction in tumor cell proliferation
   by 5 days after TF knockdown (Fig. [212]7G). Moreover, migration was
   reduced, and apoptosis was induced in DLD1 cells with tumor-specific TF
   deletion, compared to control cells (Supplementary Fig. [213]S11B, C).
   To identify potential drugs targeting these TFs, we took advantage of
   the LINCS consortium, which contains 19,811 small-molecule
   compound-perturbed profiles [[214]47]. Through careful screening, we
   identified four pharmacological agents—tacedinaline, quinoclamine,
   dorsomorphin, and vorinostat—each targeting CEBPG, LEF1, SOX4, and
   TEAD4, respectively. Consistent with our expectations, these four
   compounds substantially attenuated both the expression levels of CEBPG,
   LEF1, SOX4, and TEAD4, and affected malignant phenotypes in the DLD1
   cell lines, including proliferation, migration, and apoptosis
   (Supplementary Fig. [215]S11D–H).

Discussion

   As an integrated approach, scATAC-seq and scRNA-seq data from diverse
   forms of carcinomas can provide a comprehensive analysis of chromatin
   accessibility and transcriptional characteristics within the tumor
   ecosystem. Using scATAC-seq data, we identified hundreds of thousands
   of cCREs in the cancer genome, effectively expanding the compendium of
   DNA regulatory elements compared to the bulk ATAC-seq data from TCGA.
   The robust overlap observed between methylation data and scATAC-seq
   data serves to underscore the reliability and consistency of our
   findings across platforms. These results provide a solid foundation for
   the subsequent exploration.

   Using identified open chromatin regions limited to specific cell types,
   our investigations focus on discerning the potential regulatory roles
   of selected SNPs that are significantly associated with cancer.
   Currently, the scrutiny of SNPs within non-coding regions revolves
   predominantly around their gene regulatory functions within tumor cell
   lines [[216]61–[217]64]. Rather than just focusing on tumor cells, our
   single-cell ATAC-seq data provides a more comprehensive view, allowing
   us to identify which cell types exhibit heightened accessibility to
   these cancer-risk SNPs. This insight enables a more precise
   understanding of the specific cellular functions regulated by these
   SNPs, shedding light on their roles in cancer initiation and
   progression. Our analysis also revealed significant enrichment of GWAS
   SNPs associated with cancer in fibroblasts and myofibroblasts, as shown
   by our cell-type-specific LDSC analysis. These findings underscore the
   importance of studying the regulatory functions of SNPs within these
   two stromal cell types, opening up new avenues for research and
   emphasizing the importance of considering the diverse cellular contexts
   in which cancer-risk SNPs exert their regulatory effects.

   Identifying the target genes of cCREs is important to reveal the cancer
   gene regulatory network [[218]3]. To ensure the network’s reliability,
   we adopted the method established by Granja et al. [[219]41] to assess
   the correlations between peaks and target genes. In parallel, Cicero
   analysis was performed to calculate the co-accessibility scores between
   peaks and the target gene promoters, which helped to identify more
   robust peak-gene links. Previous studies have employed similar
   strategies [[220]65, [221]66]; however, their accessibility-expression
   correlation is calculated across cell types or samples, which may limit
   the scope of conclusions due to sample size constraints and cell type
   diversity. Overall, we constructed a peak-gene link network by
   integrated analysis of scATAC-seq data and scRNA-seq data, offering a
   more comprehensive understanding of the genome-wide regulatory
   landscape of gene expression. This integrated analysis not only
   provides deeper insights into differential cancer gene regulation but
   also highlights the regulatory roles of genetic risk SNPs. Previous
   studies have discovered that cancer epigenetic regulation is
   heterogeneous [[222]67, [223]68], highlighting the importance of
   identifying conserved regulatory elements in specific cell types. Our
   analysis identifies conserved regulatory elements in tumor cells, T
   cells, myeloid cells, fibroblasts, myofibroblasts, and endothelial
   cells, revealing the potential functions they regulate and
   demonstrating inherent regulatory patterns across different carcinomas.

   Previous reports have profiled TFs across various cancers; for
   instance, Jessica et al. predicted master TFs in cancers based on
   RNA-seq data [[224]58], while Nadezhda et al. focused on altered TFs
   during cancer transitions using expression data from scRNA-seq
   [[225]12]. However, our analysis focuses on identifying TFs activated
   in specific cell types within the tumor ecosystem using scATAC-seq
   data. This approach reveals regulons involved in cancer transcriptional
   programs and offers novel insights for identifying potential
   therapeutic targets. By analyzing tumor-cell highly activated TFs, we
   found that TEAD family TFs exhibit elevated activity in tumor cells
   across most carcinomas, which are reported as important regulators of
   tumor-associated pathways [[226]69, [227]70] and widely regarded as
   targets for cancer treatment [[228]71]. We also noted that their
   increased expression is associated with a poor prognosis, indicating
   their potential as biomarkers and demonstrating the effectiveness of
   our scATAC-seq approach. Our focused investigations identified
   tumor-specific TFs in CC, by strategically excluding those TFs highly
   activated in normal epithelial cells from which the tumor originates.
   These tumor-specific TFs appear to present potential functions to
   regulate each other and play an important role in governing malignant
   transcriptional programs, acting like master TFs [[229]72]. Through
   bioinformatic analyses and experiment validation, these TFs may be
   potential targets for therapeutic interventions. For example, TCF7 and
   LEF1, key mediators of WNT signaling, emerged as major regulators in
   our study, consistent with their well-established role as drivers of CC
   [[230]73, [231]74]. Consistent with our findings, drugs targeting TCF7
   and LEF1 have been used in several studies to treat CC cells
   [[232]75–[233]77], with promising therapeutic responses. We also
   identified CEBPG, SOX4 and TEAD4, which widely control the gene
   expression of CC cells, exhibiting a strong impact on the malignant
   phenotype of colon tumor cells. While CEBPG has been extensively
   studied in myeloid and lymphoid leukemia [[234]78, [235]79], our
   findings revealed its key role in CC, but the underlying mechanisms
   remain to be investigated in future studies. Our strategy for
   identifying tumor-specific TFs is broadly applicable and can help to
   provide therapeutic targets for more epithelial cancers.

   It is crucial to acknowledge the limitations of our study to prevent
   unwarranted overinterpretation. Firstly, the sample size of our cancer
   datasets is constrained. While we identified a substantial number of
   cCREs compared to bulk ATAC-seq, augmenting the size of our dataset
   would enhance our ability to capture additional cCREs, affording a more
   comprehensive portrayal of the cellular heterogeneity within the tumor
   ecosystem and facilitating a more nuanced exploration of differential
   gene regulation. Secondly, our analyses of normal tissues are limited
   to colon samples, and thus the identification of tumor-specific TFs for
   BC, BCC, EC, LC, OC, PLC, and RCC necessitates additional studies.
   Lastly, our insights into gene expression regulation stem from the
   predictive power of integrated scATAC-seq and scRNA-seq data. While
   these findings provide valuable resources, they require rigorous
   experimentation in future studies.

   Although the molecular causes of cancer continue to elude our complete
   understanding, our investigation provides novel insights into the
   complex landscape of cancer-associated gene expression regulation. This
   is particularly noteworthy concerning genomic loci associated with
   well-defined heritable disease risks. Our analytical framework stands
   as a blueprint, offering a strategic approach for investigating complex
   traits by exploiting the power of single-cell multi-omics data.
   Ultimately, these and other studies will lead to improved design of
   rational therapeutic strategies.

Supplementary information

   [236]SupplementaryMaterials^ (11.8MB, pdf)
   [237]UncroppedWesternBlots^ (515.7KB, pdf)
   [238]aj-checklist^ (1.8MB, pdf)

Acknowledgements