Abstract
The hierarchical organization of the eukaryotic genome is crucial for
nuclear activities and cellular development. Genetic aberrations can
disrupt this 3D genomic architecture, potentially driving oncogenesis.
However, current research often lacks a comprehensive perspective,
focusing on specific mutation types and singular 3D structural levels.
Here, pathological changes from chromosomes to nucleotides are
systematically cataloged, including 10 789 interchromosomal
translocations (ICTs), 18 863 structural variants (SVs), and 162 769
single nucleotide polymorphisms (SNPs). The multilayered analysis
reveals that fewer than 10% of ICTs disrupt territories via potent 3D
interactions, and only a minimal fraction of SVs disrupt compartments
or intersect topologically associated domain structures, yet these
events significantly influence gene expression. Pathogenic SNPs
typically show reduced interactions within the 3D genomic space. To
investigate the effects of variants in the context of 3D organization,
a two‐phase scoring algorithm, 3DFunc, is developed to evaluate the
pathogenicity of variant–gene pairs in cancer. Using 3DFunc, IGHV3‐23′s
critical role in chronic lymphocytic leukemia is identified and it is
found that three pathological SNPs (rs6605578, rs7814783, rs2738144)
interact with DEFA3. Additionally, 3DGAtlas is introduced, which
provides a highly accessible 3D genome atlas and a valuable resource
for exploring the pathological effects of genetic mutations in cancer.
Keywords: 3D genome, cancer, Hi‐C, SNP, structure variants
__________________________________________________________________
The hierarchical organization of the eukaryotic genome is vital for
nuclear function, and disruptions from genetic mutations can alter this
3D architecture. Cataloging thousands of interchromosomal
translocations, structural variants, and single nucleotide
polymorphisms, their impact on 3D genome organization is revealed. The
scoring algorithm, 3DFunc, and the comprehensive 3DGAtlas resource
provide powerful tools and new insights into cancer‐related genetic
alterations.
graphic file with name ADVS-12-2408420-g005.jpg
1. Introduction
To systematically elucidate the complex circuit of connections that
exist between regulatory elements and genes, it is necessary to
consider that interphase chromatin is folded in 3D in a
cell‐type‐specific manner.^[ [36]^1 , [37]^2 ^] In recent years,
chromosome conformation capture (3C) assays, in combination with
next‐generation sequencing, have provided new insights into the global
organization of the genome. Since the advent of these technologies, we
have known that interphase genomic organization is multitiered. Each
chromosome in the human genome occupies independent spatial
territories, which play a role in the regulation of transcriptional
activity and preferential positioning of loci within the nucleus.^[
[38]^3 ^] The analysis of Hi‐C data further refined the large scale of
territories into two sets of mega‐base‐sized regions called “A” and “B”
compartments.^[ [39]^4 ^] Compartment A is enriched in the regions of
high gene density, active histone markers, early replication, and open
chromatin.^[ [40]^4 ^] Compartment B has the opposite features,
compared with A, and is associated with lamina‐associated domains, low
transcriptional activity, and late replication, and hence is enriched
in heterochromatin.^[ [41]^4 ^] The topologically associated domains
(TADs) are mega‐base‐pair (Mb)‐sized genomic regions, and the TAD
boundary regions contribute to the regulation of gene expression by
limiting the interactions between cis‐regulatory elements and target
genes. Finally, genomic loci can form specific long‐range looping
interactions within or across TAD boundaries, through which the
regulatory elements such as enhancers and insulators play a crucial
role in controlling the gene expression profile of a cell in a
context‐dependent manner.^[ [42]^5 , [43]^6 ^]
At each layer in the genome organization hierarchy, the folding
patterns exhibit a complex connection to the maintenance of genomic
function, and mutations that effect any of these layers may lead to
disease.^[ [44]^7 , [45]^8 ^] Some examples include the dissolution of
chromosomal territories, as observed in both breast and prostate
cancers.^[ [46]^9 ^] Altered genomic compartmentalization has also been
reported in cancer cells. For example, by comparing normal breast cells
(MCF‐10A) with their cancerous counterpart (MCF‐7 cells), it was found
that a homogeneous switching of 12% of all chromosomal compartments
would occur.^[ [47]^10 ^] Although the TAD structure has been found to
be a general property of the interphase chromatin across different cell
types, further studies have suggested that TADs are not simple stable
interactions that are formed between two permanent genomic loci;
rather, they are dynamic in nature.^[ [48]^11 ^] The deletion of the
Epha4 gene and CTCF associated boundary eliminates a TAD boundary,
which causes the Epha4 promoter to interact with the Pax3 gene and
drive misexpression of Pax3.^[ [49]^12 ^] Genomic duplication events
can affect the expression of many genes, and spurious TAD formation can
also lead to human disease.^[ [50]^13 ^] However, other studies have
shown that the interruption of TAD boundaries has no obvious impact on
gene expression.^[ [51]^14 , [52]^15 ^]
A chromatin loop is formed when two distant genomic loci are physically
closer than their intervening sequences. A classic example of
long‐range gene regulation involves the Shh gene, the expression of
which is regulated by an enhancer element ≈1 Mb away.^[ [53]^16 ^]
Combining 3C techniques with genome‐wide association studies (GWAS)
holds great potential for identifying new putative target
genes/pathways for intergenic‐disease‐associated single nucleotide
polymorphisms (SNPs). Recent studies showed that most
disease‐associated SNPs reside within the regulatory elements and/or
transcriptional factor binding sites in the noncoding regions of the
genome and are likely to act through long‐range chromosomal
interactions. Similar effects have also been shown in cases of prostate
cancer, breast cancer,^[ [54]^17 , [55]^18 ^] and multiple other
cancers.^[ [56]^19 ^]
Current methods for assessing variant impact, such as CADD,^[ [57]^20
^] fathmm‐MKL,^[ [58]^21 ^] and FunSeq2,^[ [59]^22 ^] provide valuable
insights into variant pathogenicity by integrating diverse biological
data. CADD evaluates both coding and noncoding variants through a
holistic scoring system, fathmm‐MKL leverages multiple kernel learnings
to predict variant effects using evolutionary and phenotypic data, and
FunSeq2 focuses on prioritizing noncoding regulatory variants that may
disrupt gene regulation.
In this study, we curate the pathological alterations from the
chromosome level down to single nucleotides, including 10 789
interchromosomal translocations (ICTs), 18 863 structure variants
(SVs), and 162 769 SNPs. We then analyze the 3D disruptions caused by
these curated variants in four layers and observe that less than 10% of
ICTs interrupted territories through strong 3D interactions, and very
few SVs interrupted compartments or crossed TAD structures. However,
these small‐scale events were found to impact gene expression
significantly. Many SNPs residing within regulatory elements and/or in
transcription factor binding motifs appeared to exert their effects by
impacting loop strength. To assess the transcriptional impact of
genetic variants within the framework of 3D genome organization, we
introduce 3DFunc, a novel approach that goes beyond traditional tools
by incorporating spatial genome architecture into its analysis. Unlike
these methods, 3DFunc uniquely integrates gene expression data from 35
tissues with Hi‐C profiles from 33 tissues, allowing it to score the
causality of genetic variants on transcription within a 3D genomic
context. The application of 3DFunc to myelodysplastic syndrome (MDS)
highlights its ability to detect causal variant–gene pairs under the
context of 3D organization. Finally, we assemble a publicly available
database, 3DGAtlas, to provide all the curated variants, the
corresponding 3D layer disruptions, as well as the scoring results
derived from 3DFunc.
2. Results
2.1. Disruptive ICTs with Strong 3D Interactions are Highly Pathogenetic
Genomic rearrangements are implicated in the pathogenesis of many types
of cancer.^[ [60]^23 ^] Rearrangements occur when two or more
double‐stranded DNA breaks are located proximally enough to fuse
together (Figure [61]1A), and recent studies have revealed that
translocation can lead to gene fusions, dysregulated gene expression,
and novel molecular functions^[ [62]^24 ^] (Figure [63]1B).
Figure 1.
Figure 1
[64]Open in a new tab
Curation of disruptive ICTs and correlation with 3D interactions. A,B)
Diagram depicting that the ICTs interrupt territories, and ICTs lead to
GFs. C) The overall distribution of high frequency fusion pairs (>10)
across genome. D) Validation of merged fusion pairs. E) Average IF of
merged fusion pairs across 12 cell lines. Difference between strong IF,
weak IF, and no IF was calculated with Kruskal–Wallis test with Dunn
test post‐hoc, **** p‐value < 0.0001. F) Circos plot of fusion pairs
with strong 3D interactions. G) The IF values for individual cell
lines, distinguishing between cancer and normal cells.
To investigate how interchromosomal translocations impact gene
expression through spatial proximity, we curated 10 789 unique ICTs,
which included 1104 unique gene fusions (GFs), and 1003 unique genes
(Table [65]S1, Supporting Information). There were 585 gene pairs
appearing with high a frequency (>10) across the genome
(Figure [66]1C). To validate the pathogenicity of merged events, we
overlaid the involved genes with driver cancer genes from DriverDBV3,^[
[67]^25 ^] which showed an 85.1% overlap. The merged fusion events
showed an 84.7% overlap with TumorFusions,^[ [68]^26 ^] a 57.6% overlap
with fusions from TCGA markers, and a 57.1% overlap with oncogenes from
the Cancer Gene Census^[ [69]^27 ^] (Figure [70]1D and Table [71]S1
(Supporting Information)). Specifically, we identified the set of
cancer‐related fusion gene pairs by overlapping fusion events from four
published resources listed in Figure [72]1D. The results demonstrate
that cancer‐related fusion gene pairs exhibit significantly higher
interaction frequencies (IFs) compared to nonrelated gene pairs,
reinforcing the link between these gene pairs and the distinct
chromatin structure alterations observed in cancer cells (Figure
[73]S1, Supporting Information).
To correlate the pathogenic gene fusions with the spatial structure of
genome territories, we calculated the flexible IF between each gene
pair with high‐resolution Hi‐C data from 12 different cell lines. The
results showed that 8.6% of the fusion pairs had strong 3D interactions
with each other and 42.7% had weak 3D interactions (Figure [74]1E).
Within the fusion pairs with strong 3D interactions, some of the
pathogeneses have already been reported, such as PAX5, a transcription
factor crucial for B‐cell commitment and maintenance, which typically
fuses with KIAA1549L in childhood B‐cell precursor ALL.^[ [75]^28 ^] In
pediatric acute leukemias, reciprocal chromosomal translocations
frequently cause gene fusions involving the lysine (K)‐specific
methyltransferase 2A gene (KMT2A); specific KMT2A fusion partners are
associated with the disease phenotype (lymphoblastic vs myeloid).
MLLT10, PDS5A, AFF4, and so on, are common fusion partners of KMT2A ^[
[76]^29 ^] (Figure [77]1F). Overall, only less than 10% of GFs
interrupt chromosomal territories through strong 3D interactions;
however, these GFs are highly correlated with pathogenesis.
Furthermore, separate analyses across individual cell lines underscored
cell‐type specificity: cancer cell lines consistently exhibited higher
IF values than normal cell lines, reinforcing the potential oncogenic
role of these 3D genome disruptions (Figure [78]1G).
2.2. Structural Variations Do Not Frequently Interrupt Compartment Switching
but Do Cause Significant Changes in Gene Expression
We first examined 18 863 pathogenic SVs that included copy number gain,
copy number loss, deletion, duplication, and insertion. We then
filtered these SVs for length, selecting those with a size of 10 kb–10
Mb for the subsequent analysis (Figure [79]S2, Supporting Information).
Variants smaller than 10 kb are less likely to span entire regulatory
units and therefore have a reduced potential to disrupt chromatin
loops. By contrast, variants larger than 10 Mb are relatively rare and
often encompass multiple TADs or even entire chromosomal arms, which
complicates the identification of precise chromatin interactions. The
filtered SVs were mapped to A/B compartments and four types of
compartment interruptions including: A–A (both breakpoints within A
compartments), B–B (both breakpoints within B compartments), A–B (left
breakpoint within A compartment and right breakpoint within B
compartment), B–A (left breakpoint within B compartment and right
breakpoint within A compartment), in which A–A and B–B were defined as
stable compartment interruptions, A–B and B–A were defined as switching
compartment interruptions. We found that two breakpoints of SV occurred
more frequently, especially in the type of A–A, within regions having
no compartmental changes, while the SVs that did cause compartment
switching, such as A–B or B–A, occurred in less than 20% of all SVs
(Figure [80]2A,B and Table [81]S2 (Supporting Information)). Next, we
calculated the flexible IF for all four possible types of compartmental
disruption, and the stable compartment interruptions showed more 3D
interactions than the switching types (Figure [82]2C and Table [83]S2
(Supporting Information)). In addition, we observed that the SVs mapped
to switching compartments involved more dosage sensitivity genes,
consistent with a previous study that found that altered compartmental
states can lead to transcriptional activity change^[ [84]^30 ^]
(Figure [85]2D and Table [86]S2 (Supporting Information)). And we
analyzed RNA‐seq data from nine cell lines, focusing on the expression
levels of genes located near SV breakpoints. We also examined whether
these SVs were associated with compartment switching, the results show
that switching compartments exhibit distinct expression patterns
compared to those in stable compartments (Figure [87]S3, Supporting
Information). Our comparative analysis of cancer‐related and nonrelated
SVs, sourced from ClinVar,^[ [88]^31 ^] COSMIC,^[ [89]^27 ^] dbVar,^[
[90]^32 ^] and TCGA,^[ [91]^33 ^] revealed that cancer‐related SVs had
higher IFs in both stable and switching compartments (Figure [92]2E).
Despite most SVs interrupting stable compartments with high 3D
interactions, the small proportion causing compartment switching had
greater transcriptional impacts with fewer 3D interactions.
Figure 2.
Figure 2
[93]Open in a new tab
Interruptions of compartments and TADs caused by SVs. A) Four types of
compartment interruptions: A–A (both breakpoints within A
compartments), B–B (both breakpoints within B compartments), A–B (left
breakpoint within A compartment and right breakpoint within B
compartment), B–A (left breakpoint within B compartment and right
breakpoint within A compartment). B) Percentage of compartment
interruption occurrences across 12 cell types. C) Average IF of
compartment interruption across 12 cell types. D) Number of genes with
dosage sensitivity across 12 cell types for the interruptions in
switching and stable compartment. E) Interaction frequency (IF)
comparison for cancer‐related SVs and nonrelated SVs within stable and
switching compartments. The differences were analyzed across 12 cell
lines. F) Four types of SVs interrupt TADs: intra‐TAD (both breakpoints
located within the same TAD), inter‐TAD1 (with one breakpoint locate in
boundary and the other in TAD), inter‐TAD2 (both breakpoints located
within the boundary), and inter‐TAD3 (two breakpoint locate in
different TADs). G) Percentage of TAD interruption occurrences across
12 cell types. H) Percentage of TAD interruptions overlap with
compartment switching regions. I) IF comparison for cancer‐related SVs
and nonrelated SVs within TADs and across TAD boundaries. The
differences were analyzed across 12 cell lines. Average IF of TAD
interruption across 12 cell types. J) p‐value were calculated by
Wilcoxon test, * p‐value < 0.05, ** p‐value < 0.01, *** p‐value
< 0.001, **** p‐value < 0.0001.
Recent studies reported that structural variation can induce dramatic
changes in TAD organization.^[ [94]^12 , [95]^13 ^] We mapped SVs to
TAD boundaries and categorized the predicted interruptions into four
categories, namely intra‐TAD (both breakpoints located within the same
TAD), inter‐TAD1 (with one breakpoint located in the boundary and the
other in TAD), inter‐TAD2 (both breakpoints located within the
boundary), and inter‐TAD3 (two breakpoints located in different TADs)
(Figure [96]2F). Among the four types of interruptions, intra‐TAD
occupied the largest proportion, followed by inter‐TAD3 (Figure [97]2G
and Table [98]S2 (Supporting Information)). We then mapped the four
types of TAD interruptions to the switching compartment regions.
Inter‐TAD3 had the highest overlapping percentage with switching
compartments, indicating that the inter‐TAD3 interruptions potentially
correlated with greater transcriptional activity changes
(Figure [99]2H). We then calculated the flexible IF for SVs in all four
types of TAD interruption. Inter‐TAD3 showed the lowest 3D connection
frequency (Figure [100]2I and Table [101]S2 (Supporting Information)).
Moreover, cancer‐related SVs showed higher IFs within and across TADs
compared to nonrelated SVs (Figure [102]2J), confirming the heightened
impact of cancer‐related SVs on 3D chromatin organization.
2.3. Analysis of Cancer‐Associated SNPs
Then, we acquired 162 769 cancer‐related SNPs from COSMIC,^[ [103]^27
^] spanning 13 distinct tissues. Among these, noncoding region SNPs
accounted for 97.6% (Figure [104]3A). Considering CTCF‐mediated
chromatin looping leads to the formation of TADs, chromatin interaction
analysis tools, such as ChIA‐PET^[ [105]^34 ^] and Hi‐C,^[ [106]^35 ^]
have mapped these interactions and recognized TADs as large‐scale
chromatin structures. CTCF or cohesin enrichment is typically observed
at the TAD boundaries.^[ [107]^36 ^] These chromatin loops enhance
intradomain interactions between regulatory elements, such as enhancers
and gene promoters (which trigger gene expression). Concurrently, they
restrict interdomain contacts to mitigate unspecific gene expression.
In this model, regulatory variants at TAD boundaries or intradomain
contacts could disrupt the protective neighborhoods formed by the
looping, leading to aberrant enhancer–promoter interactions.
Additionally, variants at active transcription‐factor (TF)‐bound
enhancers can directly modulate these enhancer–promoter interactions.
Alterations that compromise TAD structural integrity and chromatin
associations are more probable to manifest functional consequences,
potentially culminating in disease susceptibility (Figure [108]3B).
Figure 3.
Figure 3
[109]Open in a new tab
Analysis of cancer‐associated SNPs. A) Tissue distribution and coding
region distribution of COSMIC cancer‐related SNPs. B) A schematic
representation illustrating the influence of regulatory variants on TAD
boundaries and intradomain interactions. The blue triangle represents
the CTCF peak, the blue region signifies the area covered by the CTCF
loop. The yellow triangle represents the H3K27ac peak, the red square
symbolizes the enhancer, and the red area denotes the span of the E–P
loop. Notably, cancer SNPs associated with transcription‐factor
(TF)‐bound enhancers are accentuated, underscoring their pivotal role
in modulating enhancer–promoter interactions. The diagram also
emphasizes potential functional ramifications stemming from alterations
that weaken TAD structural cohesion and chromatin interactions,
potentially amplifying cancer susceptibility. C) Count of
cancer‐specific loop (CSL), overlap loop (OL), and normal‐specific loop
(NSL). D) Proportions of E–P loops, CTCF loops, and other loops within
CSL, OL, and NSL. E) Quantity of enhancer's TFBS and the corresponding
matched SNPs within CSL, OL, and NSL. F) PIP values in CSL, OL, and NSL
of skin tissue. The red dashed lines represent cutoff of high PIP, the
blue dashed lines represent the mean PIP. G) Representation of the
chromatin loop between rs12203592 and the IRF4 gene in CSL of skin
tissue. H) Flexible IF computation for CSL, OL, and NSL. p‐value were
calculated by Wilcoxon test, ** p‐value < 0.01, **** p‐value < 0.0001.
Here, we integrated available Hi‐C,^[ [110]^35 ^] ChIA‐PET/HiChIP,^[
[111]^34 , [112]^37 , [113]^38 ^] PCHi‐C,^[ [114]^39 ^] eQTL data,^[
[115]^40 ^] and CRISPR/Cas9‐verified loops^[ [116]^41 ^] across these
tissues. By mapping curated SNPs to loops, a loop containing
cancer‐related SNPs across all 13 tissues was categorized as a
cancer‐specific loop (CSL). Conversely, a loop devoid of such SNPs in
all tissues was deemed a normal specific loop (NSL). Otherwise, it was
tagged as an overlap loop (potentially linked with both cancer
relevance and nonrelevance) (Figure [117]3C). Moreover, we employed
respective tissue CTCF ChIP‐seq peaks and DNase‐seq peaks for loop
analysis, tallying the count of CTCF loops and enhancer–promoter (E–P)
loops. The counting results revealed that across the three loop
categories (CSL, overlap loop (OL), and NSL), CTCF loops predominated,
followed by E–P loops. Specifically, CSL and NSL comprised 17.8% and
19.5% E–P loops, respectively, markedly surpassing OL (5.3%). We
postulate this might be due to numerous contacts in the cancer and
normal‐specific loops that remotely regulate cancer genes or sustain
regular gene expression. By contrast, OLs seem minimally associated
with gene regulation (Figure [118]3D). Subsequently, our investigation
into the TF binding scenarios in E–P loops highlighted that the
enhancer in CSL harbored up to 35 active TF binding sites (TFBS) and a
maximum of 80 000 SNP matches, surpassing those in OL and NSL
(Figure [119]3E).
The posterior inclusion probability (PIP) for the SNPs in skin tissue
was computed using SuSiE^[ [120]^42 ^] to facilitate the determination
of SNP causality. Notably, CSL SNPs displayed the highest average PIP
(0.66), inclusive of 17 SNPs with high PIP (≥0.95) — a count surpassing
those in OL and NSL (Figure [121]3F). As a case in point, we spotlight
the intronic rs12203592 enhancer within the IRF4 gene, which has ties
with multiple phenotypic attributes, such as skin pigmentation and hair
color. The chromatin loop in CSL involving this enhancer facilitates
its interaction with the IRF4 promoter, even if separated within the
DNA sequence. It is paramount to note that the IRF4 gene correlates
with several hematological malignancies, including multiple myeloma,
chronic lymphocytic leukemia, and non‐Hodgkin lymphoma. Variants within
IRF4, exemplified by rs12203592, might influence these cancers'
vulnerability or progression given their regulatory function^[ [122]^43
^] (Figure [123]3G).
To investigate the interaction frequency of these SNPs with target
genes in 3D structures, we calculated the flexible IF for each loop
type. The findings indicated that OL exhibited the highest interaction
frequency, followed by CSL, with NSL having the least frequency
(Figure [124]3H), and IF comparison for CSL, OL, and NSL across 34 cell
lines shows the same result (Figure [125]S4, Supporting Information).
In summary, SNPs with higher pathogenic potential tend to participate
in more interactions within the 3D genomic space.
2.4. 3DFunc: A Two‐Phase Scoring Algorithm for Detecting the Variants That
Affect Gene Function through Long‐Range Genomic Interactions
To enlarge the testing datasets, we identified ICTs, SVs, and SNPs from
the pan‐cancer analysis of whole genomes (PCAWG)^[ [126]^44 ^] (Figure
[127]S5, Supporting Information). First, we collected cell‐specific
gene expression data from PCAWG and high‐resolution Hi‐C data from
4DN,^[ [128]^45 ^] then we ranked the identified variants with
downstream expression changes and IF changes individually. The verified
causal variants did not show a preferable top ranking with a single
data type, indicating that the causality of variants was complex and
could not be characterized with only a single data type (Figure
[129]S6, Supporting Information).
To address this challenge and quantify the effect of variants within
the 3D genomic context, we developed a cell‐specific, two‐phase scoring
algorithm, 3DFunc. First, 3DFunc employed Hi‐C matrices to compute the
chromatin interaction frequency between predicted variants and their
target genes. To address the low resolution of Hi‐C matrices, we
expanded the anchor regions on both the left and right ends, and
combined this with the anchor window size to calculate a flexible IF.
Subsequently, 3DFunc computed the IF changes (ICs) between 13 normal
samples and 20 cancer samples. For gene expression analysis, we used
cell‐specific PCAWG data and applied a t‐test to evaluate the
expression changes (ECs) of the mutated target genes between 20 cancer
and 15 normal cell lines. By integrating both chromatin ICs and gene
ECs, 3DFunc employed nonlinear least squares fitting to model the
relationship between these variables for each cell line. Finally, the
3DFunc score was determined by calculating the difference between the
expected and observed values for the fitted model, providing a
comprehensive quantification of the variant effects in the 3D genome
(Figure [130]4A).
Figure 4.
Figure 4
[131]Open in a new tab
3DFunc identify the transcriptional effects of diverse sets of genetic
variants. A) The schema of 3DFunc. B) The percentage of verified causal
variants within the top 10% ranking parts, p‐value was calculated by
Wilcoxon test, *** p‐value < 0.001. C) Rank variants with 3DFunc
scores. D) Comparison of SNP scoring performance across tools using
AUROC and AUPRC metrics. AUROC (left) and AUPRC (right) values were
calculated for 3DFunc, CADD, fathmm‐MKL, and FunSeq2 to assess their
accuracy in distinguishing pathological from nonpathological SNPs in
the PCAWG dataset. E) Comparison of SNP scoring performance within the
top 5% of ranked SNPs for 3DFunc, CADD, fathmm‐MKL, and FunSeq2, using
the same AUROC and AUPRC metrics as in (D).
We used 3DFunc to score the predicted variants and calculate the
percentage of verified causal variants within the top 10% of
3DFunc‐ranked variants. By cross‐referencing these top‐ranked variants
with ClinVar^[ [132]^31 ^] annotations, we demonstrated that 3DFunc
consistently prioritized verified causal variants, as SNPs labeled as
“pathogenic” in ClinVar were highly enriched within the top 10%
(Figure [133]4B,C). To further evaluate SNP scoring accuracy, we
applied 3DFunc, CADD,^[ [134]^20 ^] fathmm‐MKL,^[ [135]^21 ^] and
FunSeq2^[ [136]^22 ^] to SNPs from the PCAWG dataset. First, we mapped
these SNPs to the ClinVar database; SNPs labeled as pathological in
ClinVar were designated as positive samples, while others were
considered negative samples. Using each tool's predicted scores, we
calculated AUROC and AUPRC metrics to assess their performance in
distinguishing pathological from nonpathological SNPs under both the
original full dataset and the top 5% selection criteria. Results showed
that 3DFunc achieved the highest AUROC (0.87) and AUPRC (0.70) among
all tools when evaluated on the full dataset, demonstrating superior
accuracy in prioritizing pathological SNPs (Figure [137]4D). To address
concerns about the potential influence of false positives and dataset
imbalances, we further restricted the analysis to the top 5% of
variants predicted by each tool. Under this stricter criterion, 3DFunc
maintained the highest performance, with an AUROC of 0.89 and an
improved AUPRC of 0.77 (Figure [138]4E). By contrast, the performance
of the other tools under the top 5% selection criteria showed minimal
improvement or slight declines in AUPRC (CADD: 0.62–0.62, FunSeq2:
0.46–0.47, fathmm‐MKL: 0.29–0.26), indicating a lack of enrichment for
true positive variants in their top‐ranked predictions. The results
highlight the robustness of 3DFunc in prioritizing functional SNPs,
even under conditions designed to reduce the potential influence of
false positives.
2.5. Assessing the Efficacy of 3DFunc through Its Application to PCAWG
Variants
To further investigate and validate the functionality of the predicted
high‐scored SNPs (those with a 3DFunc score greater than 0.9 and
p‐value less than 0.05), we conducted comprehensive analyses using
publicly available datasets. First, we extracted variant–gene pairings
from PCAWG for five prevalent tissues with both Hi‐C and expression
data accessible. We then determined the significance of the 3DFunc
scores using a chi‐square test. KEGG pathway enrichment analysis of the
target genes of these SNPs revealed significant enrichment in
cancer‐related pathways, such as “Pathways in Cancer,” “MAPK Signaling
Pathway,” and immune‐related pathways including “Th1 and Th2 Cell
Differentiation” and “PD‐L1 Expression and PD‐1 Checkpoint Pathway in
Cancer” (Figure [139]5A).
Figure 5.
Figure 5
[140]Open in a new tab
Evaluating 3DFunc's performance by applying it to PCAWG variants. A)
KEGG pathway enrichment analysis of genes targeted by high‐scored SNPs.
B) Percentage of significant 3DFunc scores, and the percentage of
driver mutations in five tissues. The significance of 3DFunc scoring
with chi‐square test in tissues of ICTs, SVs, and SNPs, and the
overlapping percentage with the driver mutations from PCAWG. C) GDA
network of ICTs, SVs, and SNPs in blood tissue. D) Differential
expression analysis for TCGA MDS data with 3DFunc scoring, p‐value
< 0.5 and the absolute value of log2(fold change) larger than or equal
to 1 were used as the significant threshold. The purple points
represent the genes, the blue points represent disease, and the size of
points represent the evidence of gene–disease association. E) Virtual
4C analysis of chromatin interactions involving high‐scored SNPs near
IGHV3‐23 and DEFA3. F) Example of 3DFunc score in the locus of
WNT6/EPHA4/PAX3 gene. The heatmap indicated the Hi‐C profile and
colored by interaction frequency. The triangle represented the regions
of TAD structure. The gray squares represented SVs, and the curves
represented SV–gene interactions. The IF change, caviar probability,
and 3DFunc scores for each SV was marked with different colors. W‐SV‐1,
W‐SV‐2, and W‐SV‐3 indicate the SVs near WNT6. E‐SV‐1, E‐SV‐2, and
E‐SV‐3 indicate the SVs near EPHA4. P‐SV‐1 indicate the SV near PAX3.
Next, we overlaid the driver mutations from the PCAWG dataset with the
variants from common tissues in Hi‐C and RNA‐seq expression datasets.
The variants in blood tissues showed the most significant 3DFunc scores
(p‐value < 0.05) and the highest driver mutation rate (Figure [141]5B).
To investigate the potential pathogenesis of these high‐scored SNPs, we
further analyzed the Gene Disease Association (GDA) network for the
target genes of these variants. We observed that MDS was most
frequently associated with variants showing the highest evidence of
disease risk (Figure [142]5C and Table [143]S3 (Supporting
Information)).
Additionally, we analyzed TCGA gene expression data and identified 85
differentially expressed genes between control and MDS samples.
Notably, IGHV3‐23 and DEFA3 exhibited high 3DFunc scores. IGHV3‐23, a
gene frequently mutated in chronic lymphocytic leukemia,^[ [144]^46 ^]
and DEFA3, which is highly expressed in neutrophils, were found to
interact with high‐scored SNPs, further supporting their functional
relevance (Figure [145]5D and Table [146]S4 (Supporting Information)).
To further evaluate the chromatin interactions associated with these
SNPs, we performed virtual 4C analysis using IGHV3‐23 and DEFA3 as
viewpoints. This analysis revealed strong interaction peaks near these
genes, suggesting that the SNPs in these regions may influence
chromatin looping and regulatory mechanisms (Figure [147]5E). Finally,
we used eQTL data from the GTEx database to demonstrate significant
associations between SNPs near IGHV3‐23 and DEFA3 and the expression of
these genes across various tissues. For example, SNP rs2738144 near
DEFA3 and rs397933171 near IGHV3‐23 exhibited strong eQTL signals,
highlighting their potential role in modulating gene expression (Figure
[148]S7, Supporting Information).
In addition to SNPs, the PCAWG dataset also includes SVs, which we
utilized to further verify the effectiveness of 3DFunc in scoring
structural variants. We focused on high‐scored SVs and their
corresponding interactions within the WNT6, EPHA4, and PAX3 loci. Using
Hi‐C data from H1 cells,^[ [149]^35 ^] we annotated the TAD structures
at these loci. Among the WNT6‐related SVs, W‐SV‐3 (the third nearest SV
to WNT6) exhibited the highest 3DFunc score, particularly near the TAD
boundary. Duplication of this region misplaces the WNT6 gene closer to
an enhancer element, potentially causing its misexpression.^[ [150]^12
^] For the SVs of EPHA4, three 3DFunc scores were larger than 0.5,
indicating the causal effect of the upstream enhancer region of EPHA4.
For comparison, we also showed the scoring of the Caviar probability, a
measure of causal variants in associated regions,^[ [151]^47 ^] which
did not display higher scores for the SV‐WNT6 and SV‐EPHA4 pairs
(Figure [152]5F).
2.6. A 3D Genome Atlas of Genetic Variants and Their Pathological Effects
In the above work, thousands of pathological genetic mutations were
curated, and we analyzed the topological genomic disruptions caused by
these curated variants (Figure [153]6A), in which 1104 ICTs interrupted
territories, 5002 SVs correlated with compartment switching events,
7654 SVs disrupted TAD boundaries, and 3033 SNPs disrupted loops. To
interpret how these variants would impact gene expression through 3D
interactions in a quantitative manner, we generated a two‐phase scoring
algorithm known as 3DFunc, which combines 33 Hi‐C datasets from the 4DN
data portal,^[ [154]^45 ^] the gene expression data of 20 cancer
tissues derived from the PCAWG,^[ [155]^44 ^] and 15 normal tissues
taken from the GTEx portal.^[ [156]^48 ^] 3DFunc employs a nonlinear
least‐square curve to measure the effect of genetic variants on
long‐range gene regulation and genomic architecture. To demonstrate the
effectiveness of 3DFunc, more ICTs, SVs, and SNPs were identified from
the PCAWG datasets and scored with 3DFunc. Finally, we integrated all
of the curated data and the calculated the 3DFunc scoring results to
the 3DGAtlas database, in order to provide an atlas of genetic variants
and their pathological effects for further research. This is available
at [157]https://www.csuligroup.com/3DGAtlas/home (Figure [158]6B).
Figure 6.
Figure 6
[159]Open in a new tab
Design of 3DGAtlas. A) 10 789 ICTs, 18 863 SVs, and 162 769 SNPs were
curated from literatures, in which 1104, 5002, 7654, and 3033 events
were detected disrupting the corresponding 3D architecture layers. B)
Hi‐C profile from 33 tissues and gene expression data from 35 tissues
were combined with a nonlinear least square curve to generate two‐phase
scoring algorithm 3DFunc. The identified variants were scored with
3DFunc. The database 3DGAtlas was constructed with the literature
curated data and the 3DFunc predictions.
3. Discussion
In recent years, 3C assays in combination with high‐throughput
sequencing, such as Hi‐C, have provided new insights into the global
organization of the genome. Since the advent of these techniques, we
have known that interphase chromosomes are folded into four layers of
hierarchical structures, including territory, compartment, TAD, and
loop. The alterations in any of these layers can lead to disease or
phenotype, thus, we curated the variants and mapped them to the 3D
layers to investigate the pathological effects.
For territory, ICT could lead to gene fusions and dysregulated gene
expression through spatial proximity. We observed that the fused gene
pairs with more 3D interactions tended to corelate with pathogenesis.
It is reasonable, as DNA FISH studies have revealed, that the frequency
of translocations between two chromosomes is related to their spatial
proximity in interphase nuclei.^[ [160]^49 , [161]^50 ^] Consistent
with this finding, the 3D interaction counts from the Hi‐C profile
between whole chromosomes were found to exhibit a strong correlation
with the frequency of translocation, suggesting that spatial genomic
proximity precedes translocation.^[ [162]^51 ^]
For compartment, SV influences the expression of genes distant from the
SV breakpoints and causes disease. In this study, most of the
pathogenic SVs interrupted stable compartments, which also involved
more 3D spatial connections. Although a small part of SVs interrupt
switching compartments, these SVs impacted more gene expression through
fewer 3D interactions. The observed frequencies of A–A and B–B
interruptions provide key insights into the stability of chromatin
architecture. A‐type compartments, associated with active chromatin,
and B‐type compartments, linked to more repressive chromatin, exhibit
different interaction patterns due to their roles in the genome. The
predominance of stable A–A and B–B interactions may suggest that
disruptions in these regions could significantly impact the regulatory
landscape, given the compartmentalized nature of chromatin. Previous
studies have highlighted the importance of compartmental interactions
in maintaining gene regulation and expression patterns across cell
types. The higher frequency of these stable interruptions may be
indicative of the robust nature of chromatin domains that maintain
their structure even in the presence of genetic aberrations.^[ [163]^11
^] This has been linked to the self‐organizing principles of the
genome, where chromatin tends to preserve its domain architecture due
to factors such as loop extrusion, cohesin activity, and CTCF
binding.^[ [164]^52 ^] The A–A and B–B therefore represent a protective
mechanism, ensuring that essential regulatory regions maintain their
activity despite structural variations.
For TAD, fewer SVs occurred inter‐TAD, and the type of inter‐TAD3 (two
breakpoint locate in different TADs) correlated more with switching
compartment and had less 3D interactions. Previous studies have
demonstrated that compartments and TADs are highly dynamic in nature
and changes occurred in accordance with lineage and cell‐type
specificity.^[ [165]^53 ^] What is more, the altered
compartmentalization and interruptions of TAD boundaries have been
reported in disease and complex traits,^[ [166]^10 , [167]^13 ^] which
explained why we observed that the SVs occurred in switching
compartment and inter‐TAD structures impacted more dosage gene
expression. In addition, fewer 3D interactions were detected near these
regions, which was possibly caused by the dynamic characteristics of
switching compartments.
For loop, we observed that the cancer specific SNP–gene loops have
higher 3D interaction frequencies and with higher causality. Meanwhile,
most of the SNPs locate within noncoding regions. This is consistent
with the previous cancer studies that showed that most of the disease
associated SNPs reside within the regulatory elements and/or are
enriched in transcriptional factor binding motifs in the noncoding
region of the genome and exerts effects through long‐range chromosomal
interactions.^[ [168]^17 , [169]^54 , [170]^55 ^]
In addressing potential biases in the selection of ICTs and gene
fusions, we took several measures to ensure that the datasets used were
as representative as possible of the broader cancer landscape.
Specifically, we utilized well‐curated datasets such as PCAWG and 4DN
to minimize selection biases. These datasets provide a comprehensive
overview of somatic mutations, structural variants, and gene fusions
across multiple cancer types, ensuring the inclusion of a wide range of
genomic alterations observed in diverse tumor types and clinical
contexts.^[ [171]^33 , [172]^44 ^] However, we acknowledge that no
dataset is entirely free from biases, particularly those introduced by
the overrepresentation of certain cancer types or the technical
limitations inherent in detecting ICTs in lower‐resolution assays. To
further mitigate these potential biases, we applied rigorous inclusion
criteria, focusing only on high‐confidence gene fusions that have been
recurrently reported in multiple independent studies and cancer
types.^[ [173]^56 ^] We also cross‐referenced known driver mutations
and well‐characterized oncogenic translocations, ensuring that our
analysis captured clinically significant and functionally relevant
events, rather than rare or incidental findings. Furthermore, our
approach to ICT and gene fusion mapping involved stringent filtering
based on the size and frequency of chromosomal rearrangements, allowing
us to reduce the likelihood of artifacts arising from technical noise
in sequencing data.
4. Experimental Section
Curation of ICTs
The cancer‐related ICT events were collected from three predominant
chromosome aberration database: Mitelman,^[ [174]^57 ^] COSMIC,^[
[175]^27 ^] and ChimerPub.^[ [176]^58 ^] For the data from Mitelman
database ([177]https://mitelmandatabase.isb‐cgc.org/), only the
structural aberrations of interchromosome were retained, and the
karyotype of which were converted to chromosome coordinates with
CytoConverter.^[ [178]^59 ^] For the data from COSMIC
([179]https://cancer.sanger.ac.uk/cosmic), the structural genomic
rearrangements were downloaded, and the interchromosomal translocations
were filtered out. For the data from ChimerPub
([180]https://www.kobic.re.kr/chimerdb/chimerpub), the translocation
data with known chromosome positions were retained. Then, all the ICTs
from these three databases were merged and duplications were removed,
10 789 ICTs were retained in total. 1104 gene fusion events were
extracted from the filtered unique ICTs. All the chromosome coordinates
were extracted under the reference of hg38.
Curation of SVs
The pathological SVs were collected from Clinvar^[ [181]^31 ^] (release
of 2021‐08‐13) under the reference of hg38
([182]https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/dbvarhub/hg38/),
the SVs were filtered with the percentage of larger than 1%, the
filtered SVs included copy number gain, copy number loss, deletion,
duplication, and so on. For the subsequent analysis, 18 863 SVs with
length of 10k to 10 M bp were retained.
Curation of SNPs
The cancer‐related SNPs were collected from COSMIC.^[ [183]^27 ^] To
narrow the scope to pathological SNPs, the fine mapping eQTL data were
downloaded from the recomputed datasets of eQTL Catalogue^[ [184]^60 ^]
([185]https://www.ebi.ac.uk/eqtl/Data_access/), the credible sets of
the same tissue from different studies were merged and the duplicated
SNP–gene pairs were removed.
Disruption of 3D Layers
For the layer of territory, the filtered unique gene fusions from
curated ICTs were regarded as the disruptions of territory, and the
cytobands of two fused genes were regarded as the locations where
aberrations occurred.
For the layer of compartment, high‐resolution Hi‐C datasets (>1 billion
reads) of 12 cell types were first used from 4DN data portal^[ [186]^45
^] to assign active (A) or inactive (B) compartments genome‐wide by
FAN‐C^[ [187]^61 ^] with the resolution of 1 Mb, and the genome
sequence of hg38 was applied to calculate the average GC content, as GC
content was previously shown to correlate well with
compartmentalization.^[ [188]^61 ^] Then, the curated SVs were
overlayed with the compartments of 12 cell types individually, the
criteria for mapping included calculating the distance between SV
breakpoints and the nearest compartment transitions, the spanning
locations of SVs were regarded as the disruptions of compartments.
There were four different disruption types, A–A (both breakpoints
within A compartments), B–B (both breakpoints within B compartments),
A–B (left breakpoint within A compartment and right breakpoint within B
compartment), B–A (left breakpoint within B compartment and right
breakpoint within A compartment).
For the layer of TAD, high‐resolution Hi‐C datasets (>1 billion reads)
of 12 cell types from 4DN data portal were used to call the insulation
domains by FAN‐C with the window size of 1 Mb. The downloaded Hi‐C file
was normalized with ICE,^[ [189]^62 ^] then the resolution of the Hi‐C
data was carefully adjusted to ensure the TADs could be accurately
identified despite the presence of large SVs. The curated SVs were
overlayed with the TADs of 12 cell types individually, the criteria for
mapping included calculating the distance between SV breakpoints and
the nearest TAD boundaries, the spanning locations of SVs were regarded
as the disruptions of TADs. Additionally, a size‐based filtering
approach was implemented to exclude regions where SVs exceeded 50% of
the local TAD size, ensuring that disrupted regions did not skew the
overall results. There were four interruption types: intra‐TAD,
inter‐TAD1, inter‐TAD2, and inter‐TAD3.
For the layer of loop, the curated SNPs were assigned to the target
genes with Hi‐C, PCHi‐C, fine‐mapping eQTL, and CRISPR/Cas9‐verified
loops. The Hi‐C loops were identified with HiCCUPS,^[ [190]^62 ^] and
the PCHi‐C loops were detected by MMCT‐Loop.^[ [191]^63 ^] The
fine‐mapping eQTL was downloaded from GTEx,^[ [192]^48 ^] and
CRISPR/Cas9‐verified loops were from Gasperini et al.^[ [193]^41 ^] The
chromosome regions near SNPs were extended to 1 kb on both sides, then
the extended locus‐gene pairs were regarded as loops, and the SNPs were
regarded as the interruption within loops.
Quantification of Dosage‐Sensitive Genes
SVs were first categorized based on the chromatin compartments they
disrupted (AA, AB/BA, or BB) by mapping SV breakpoints to compartment
boundaries identified through Hi‐C data. For each category,
dosage‐sensitive genes located within the disrupted regions were
identified by cross‐referencing with ClinVar annotations. The frequency
of dosage‐sensitive genes affected by SVs within each compartment type
across the cell lines was then calculated to quantify transcriptional
changes potentially induced by these structural alterations. To assess
the impact of these SVs on transcriptional activity, compartment shifts
(e.g., from active to inactive compartments) were examined that could
alter gene expression levels. SVs disrupting AB/BA compartments, for
instance, were expected to have different regulatory effects than those
interrupting AA or BB compartments.
Calculation of Flexible IF
The flexible IF used Hi‐C matrix to measure the interaction frequency
between two specific genomic loci. To make the calculation fit
different scales of genomic length, a flexible strategy was used to
determine the window size of interaction frequency. Two candidate
regions were defined as C[a] and C[b] , respectively. The start point
of C[a] was S[a] , the end point of C[a] was E[a] , the start point of
C[b] was S[b] , the end point of C[b] was E[b] . The Hi‐C files created
by juicer^[ [194]^62 ^] contained several prebuilt matrices with
different resolutions, such as 1000, 5000, 10 000, 50 000, 100 000, and
500 000 bp. For the candidate regions C[a] and C[b] , the nearest
resolution R[a] and R[b] , respectively, could be determined. The
window size for calculating interaction frequency was set to W[ab] =
min{R[a] ,R[b] }. The interaction frequency between C[a] and C[b] was
calculated by Straw,^[ [195]^62 ^] which was represented by IF (S[a] :
E[a] , S[b] : E[b] , W[ab] ).
Since most of the Hi‐C files were low‐resolution, which made the IF
sparse, here, the candidate regions were extended as background to
calculate the flexible IF. The extended start point of C[a] was S′[ a
]= S[a] − R[a] α, the extended end point of C[a] was E′[ a ]= E[a] +
R[a] α, the extended start point of C[b] was S′[ b ]= S[b] − R[b] α,
the extended end point of C[b] was E′[ b ]= E[b] − R[b] α. The
extension coefficient α was set to
[MATH: {Ra/Rb(ifRa>Rb)
Rb/Ra<
mrow>(ifRb>Ra) :MATH]
. Then, the flexible IF was calculated by
[MATH:
FlexibleIF=IFSa:E<
/mi>a,Sb:
mo>Eb,WabIFS′a:E′a
,S′b:E′b,Wab
mrow> :MATH]
(1)
For the ICT events, the loci of two fused genes were extracted as
candidate regions for calculation. For the SV events, all the SV–gene
pairs within the same TAD structure were considered as the candidate
regions. For the SNP–gene pairs, the chromosome regions near SNPs were
extended to 1 kb on both sides, then the extended locus‐gene pairs were
regarded as candidate regions.
Validation of Merged ICTs
To validate the pathogeneses of merged ICTs, the driver cancer genes
were collected from DriverDBV3^[ [196]^25 ^]
([197]http://driverdb.tms.cmu.edu.tw/download), then the genes
involving in the ICTs were overlayed with the driver cancer genes, and
the overlapping percentage was calculated. For the gene fusions from
the whitelist of TumorFusions^[ [198]^26 ^]
([199]https://tumorfusions.org/), the fusions from TCGA marker papers,
and the TCGA fusions dataset,^[ [200]^33 ^] the curated gene fusions
were overlayed with them regardless the fusion directions, then the
overlapping percentage was calculated.
Genomic Annotation of SNPs
To investigate the 3D interaction characteristics of SNPs within
different genomic regions, the curated cancer‐related SNPs were
annotated to coding regions with the coordinates of known genes from
UCSC^[ [201]^64 ^] (release of 2021‐11‐16) and RefGene^[ [202]^65 ^]
(release of 2021‐03‐01), and the noncoding SNPs did not locate in a
coding sequence or within 10 bp of a splice site annotated in the
RefGene. And the SNPs were annotated with high resolution maps of DHSs
data.^[ [203]^66 ^] Then, SNP2TFBS^[ [204]^67 ^]
([205]https://ccg.epfl.ch/snp2tfbs/) was employed to investigate the
TFBS associated with coding and noncoding SNPs, respectively. The
enrichment of TF and the number of affected TFBS were calculated by
SNP2TFBS.
Process of PCAWG and GTEx Data
For PCAWG gene expression data, a total of 1359 samples in 20 cancer
tissues were used, the reads were aligned with the alignment tools of
TopHat2^[ [206]^68 ^] (release of v2.1.1) and STAR^[ [207]^69 ^]
(release of v2.7.10a). Read counts to genes were calculated using
Htseq‐count^[ [208]^70 ^] with GFT file from GENCODE human release v38.
Then, counts were normalized using fragments per kilobase of exon per
million mapped fragments (FPKM) normalization and upper quartile
normalization. The final expression values were given as an average of
the TopHat2 and STAR‐based alignments.
For GTEx data, the expressions of 15 healthy tissues in 3274 samples
from GTEx (phs000424.v4.p1) were analyzed with the same pipeline as
PCAWG gene expression data to calculate the FPKM value for genes. The
calculated expression data of each sample were assigned a unique
aliquot.
Classification of CTCF Loops and E–P Loops
Given that E–P loops were generally fewer in number but were more
closely associated with regulatory functions, identifying E–P loops was
prioritized in the analysis. The LoopPredictor algorithm^[ [209]^55 ,
[210]^71 ^] was first applied to classify all detected loops and those
categorized as E–P loops were selected. For the remaining loops, an
overlap analysis was performed with CTCF ChIP‐seq peaks; loops with at
least one anchor overlapping a CTCF ChIP‐seq peak were classified as
CTCF loops. The remaining loops, with no overlap, were designated as
other loops.
Finally, the number of CTCF loops, E–P loops, and other loops across
CSLs, NSLs, and OLs was quantified to facilitate a detailed comparison
of loop types within each category.
The Two‐Phase Scoring Algorithm: 3DFunc
To measure the effect of variants under the 3D context quantitatively,
a two‐phase scoring algorithm, 3DFunc, was proposed which combined
cell‐specific gene expression data from PCAWG and GTEx, as well as the
Hi‐C matrix from 4DN. For each of the 20 cancer cell lines and their
corresponding cancer types, the gene expression changes and 3D
interaction frequencies were calculated separately.
To measure the gene expression changes between cancer and normal cell
lines, the putative target gene i of predicted variants was mapped to
the PCAWG expression data for the specific cancer cell line, matched to
its unique “aliquot,” the average matched gene expression of target
gene i in n [cancer] cancer cell lines was
[MATH: ECi¯
mrow> :MATH]
, the standard deviation was
[MATH: SCi :MATH]
, and the average matched gene expression of target gene i in n
[normal]normal cell lines was
[MATH: ENi¯
mrow> :MATH]
, the standard deviation was
[MATH: SNi :MATH]
. 20 cancer and 15 normal cell lines were used, with expression change
EC calculated independently for each cancer cell line using a t‐test
[MATH: EC=Pt<ECi¯−ENi¯
SC<
mi>i2ncancer+SNi
2nnormal :MATH]
(2)
To characterize the 3D interaction frequency between predicted variant
and target gene, the flexible IF for each variant–gene in the
corresponding cancer cell line was calculated. For putative target gene
i, the flexible IF in a specific cancer cell line was C[ i ], and the
flexible IF in coresponding normal cell lines was N[ i ], the flexible
IF change was defined as IC = abs(C[ i ] − N[ i ]).
Next, nonlinear least square curve^[ [211]^72 ^] fitting procedure was
employed separately for each cancer cell line to model the relationship
between EC and IC. It was observed that the data likely followed an
exponential decay pattern, leading to the fitting function
[MATH: EC=Ae−IC/τ
:MATH]
(3)
where A and τ were determined by the residual sum of squares (RSS)
[MATH: RSS=∑obs−pr
ed2 :MATH]
(4)
The goal of nonlinear least squares fitting algorithm was to find
function parameters that minimized the RSS. The curve‐fitting process
was performed for each cancer cell line using a custom R script. The
resulting model parameters were stored in an R object named “model.”
The 3DFunc score was then calculated as the difference between the
expected EC and the observed EC, tailored to each specific cell line.
Scoring of Caviar Probability
To compare the scoring results of 3DFunc with other causal variants
detecting methods, the Caviar Probability of the predicted variants was
calculated with CAVIAR,^[ [212]^47 ^] which took GWAS and eQTL linkage
disequilibrium files as input to detect the causal variants, the
Colocalization posterior probability for each variant was used as the
final Caviar Probability.
Construction of GDA Network
The variant–gene pairs with the threshold of 3DFunc > 0.5 and p‐value
< 0.05 were first filtered, then 100, 76, 116, 166, and 163 target
genes were extracted from different blood, colon, liver, lung, and
uterus individually. The GDA network was built with DisGeNET cytoscape
app^[ [213]^73 ^] with the curated source, all the annotation types,
and all the disease class. The score and EI threshold were set from 0.5
to 1.0. The output associations included 831, 603, 971, 1091, and 925
diseases for each tissue. The GDA network was grouped by the type of
variants, and the nodes were resized by the detected evidence. Finally,
the diseases related to the accordingly tissues were selected and
marked on the GDA network.
Analysis of TCGA MDS Data
For the TCGA MDS data, two cancer samples and two normal samples in
BEATAML1.0‐COHORT were used, the gene counts were normalized by FPKM,
which were taken as the input of DESeq2^[ [214]^74 ^] and the default
parameters were used to detect differentially expressed genes. Then
p‐value < 0.5 and the absolute value of log2(fold change) larger than
or equal to 1 were used as the significant threshold to filter the
genes, which showed 85 genes significantly differential expressed.
Statistical Analysis
All statistical analyses were performed using R (version 4.3.1).
Differences between strong IF, weak IF, and no IF groups were assessed
using the Kruskal–Wallis test, with post‐hoc pairwise comparisons
conducted using the Dunn test. The percentage of verified causal
variants within the top 10% ranking was compared using the Wilcoxon
test. Flexible IF computation was applied to CSL, OL, and NSL, with
statistical significance assessed using the Wilcoxon test. ** p‐value
< 0.01, *** p‐value < 0.001, **** p‐value < 0.0001.
Data and Code Availability
The GEO datasets used in this study were available under the accession
numbers: [215]GSE116862,^[ [216]^75 ^] [217]GSE51312,^[ [218]^76 ^]
[219]GSE66733,^[ [220]^10 ^] [221]GSE71862,^[ [222]^10 ^] and
[223]GSE179128.^[ [224]^77 ^] The ArrayExpress database was under
accession number of E‐MTAB‐2323, and AMP‐CMD accession number of
TSTSR043623.^[ [225]^78 ^]
All the curated variant–gene pairs along with the 3D interruptions, as
well as the 3DFunc scoring results were available at the 3DGAtlas
database: [226]https://www.csuligroup.com/3DGAtlas/home.
The source code of 3DFunc was available at GitHub repository:
[227]https://github.com/CSUBioGroup/3DFunc. The data that support the
findings of this study were openly available in figshare
[228]https://doi.org/10.6084/m9.figshare.26318707, reference number.^[
[229]^79 ^]
Conflict of Interest
P.T.E. received sponsored research support from the Bayer AG and IBM
and had served on advisory boards or consulted for Bayer AG, MyoKardia,
and Novartis. The remaining authors declared that they have no
competing interests.
Author Contributions
L.T. and M.L. conceived of the presented idea. L.T., M.H., and J.C.
collected the data and designed the model, L.T. wrote the source code
of 3DFunc. M.H. and J.C. developed the framework of database. M.C.H.
helped improve the bioinformatics analysis. M.C.H., P.T.E., and M.L.
aided in interpreting the results and provided input on the data
presentation. All authors provided critical feedback and helped shape
the research, analysis, and paper.
Supporting information
Supporting Information
[230]ADVS-12-2408420-s004.docx^ (1.6MB, docx)
Supplemental Table 1
[231]ADVS-12-2408420-s002.xlsx^ (314.1KB, xlsx)
Supplemental Table 2
[232]ADVS-12-2408420-s003.xlsx^ (4.2MB, xlsx)
Supplemental Table 3
[233]ADVS-12-2408420-s001.xlsx^ (149.9KB, xlsx)
Supplemental Table 4
[234]ADVS-12-2408420-s005.xlsx^ (530.8KB, xlsx)
Acknowledgements