Abstract
Synthetic data generation in omics mimics real-world biological data,
providing alternatives for training and evaluation of genomic analysis
tools, controlling differential expression, and exploring data
architecture. We previously developed Precious1GPT, a multimodal
transformer trained on transcriptomic and methylation data, along with
metadata, for predicting biological age and identifying dual-purpose
therapeutic targets potentially implicated in aging and age-associated
diseases. In this study, we introduce Precious2GPT, a multimodal
architecture that integrates Conditional Diffusion (CDiffusion) and
decoder-only Multi-omics Pretrained Transformer (MoPT) models trained
on gene expression and DNA methylation data. Precious2GPT excels in
synthetic data generation, outperforming Conditional Generative
Adversarial Networks (CGANs), CDiffusion, and MoPT. We demonstrate that
Precious2GPT is capable of generating representative synthetic data
that captures tissue- and age-specific information from real
transcriptomics and methylomics data. Notably, Precious2GPT surpasses
other models in age prediction accuracy using the generated data, and
it can generate data beyond 120 years of age. Furthermore, we showcase
the potential of using this model in identifying gene signatures and
potential therapeutic targets in a colorectal cancer case study.
Subject terms: Drug discovery, Diseases
Introduction
Biological synthetic data generation in the context of omics refers to
the creation of artificial datasets that mimic the characteristics of
real biological data, particularly in genomics, transcriptomics,
proteomics, and other high-throughput biological technologies^[54]1.
Generating synthetic data is valuable for various reasons, including
the development and validation of computational methods, protection of
privacy in sensitive datasets, and augmentation of limited real-world
data. Generative adversarial networks (GANs) have been introduced as
unique models to generate synthetic genomic data, ranging from DNA
sequences to bulk RNA-seq data^[55]2,[56]3. Copula-based methods are
other examples of classical statistical approaches in generating
synthetic omics data, especially microarray gene expression data^[57]4.
Moreover, Diffusion models are a recent addition to deep learning for
synthetic data generation by simulating a diffusion process, which
gradually transforms a simple noise distribution into the target data
distribution^[58]5. Large language models (LLMs), exemplified by
Generative Pre-trained Transformer 2 (GPT-2), have also garnered
substantial interest built upon the Transformer architectures,
capturing their significant contributions to the analysis of sequential
data, and capabilities in modeling and advanced language understanding,
generation and prediction^[59]6. Although these models have shown
promising results in generating high-quality synthetic data, their
application to omics data is still in its infancy.
The need for synthetic data in omics studies stems from a variety of
practical considerations and research objectives, which offer
beneficial alternatives for training and evaluating genomic analysis
tools without relying on large real-world datasets. They also serve as
effective tools in controlling for differential expression, testing
algorithms and pipelines in synthetic data, modeling interventions, and
studying the underlying data structure in omics data^[60]7. For
instance, researchers often compare gene expression levels between two
or more conditions in differential expression analysis. Such conditions
could include disease vs. healthy or treated vs. untreated samples.
However, obtaining an adequate number of control samples can be
challenging due to cost constraints or limited availability^[61]8. On
the other hand, ‘conditional’ synthetic omics data generation is
crucial to simulate real-life scenarios as gene expression patterns
exhibit notable variations influenced by factors such as species,
tissue type, or age of the sample. In particular, gene expression
undergoes substantial changes throughout development or aging, whereas
certain genes experience upregulation or downregulation during distinct
life stages^[62]9,[63]10. Therefore, synthetic data generation models
incorporating conditionality are essential in capturing the
condition-specific variations in gene expression, as well as in
confronting the disparate distribution of omics data on varying
conditions^[64]11. However, achieving a high level of realism in
synthetic data poses significant challenges, thus highlighting the
emerging need for advanced modeling interventions.
In our prior work, Precious1GPT^[65]12 successfully demonstrated
promising capabilities in predicting biological age and identifying
dual-purpose therapeutic targets potentially implicated in aging and
age-related diseases. In this study, we employed a novel method
incorporating the generation of multi-omics data implemented via the
combination of a language model for data generation and a diffusion
model for creating visual representations of omics data. Deep learning
models including Conditional GANs^[66]13, Сonditional Diffusion
(CDiffusion), and Multi-omics Pre-trained (MoPT)^[67]14 approaches, and
our novel intervention Precious2GPT (P2GPT) combining CDiffusion and
MoPT models, were utilized and compared in generating synthetic
multi-omics, multi-species and multi-tissue samples. While aging clocks
based on various omics data types have been proposed, such as
transcriptomics^[68]15, methylomics^[69]16, proteomics^[70]17, and
metabolomics^[71]18, P2GPT focused on transcriptomic and DNA
methylation data. This choice was driven by the abundant availability
of these data types for training and validation purposes, particularly
when considering factors like age and tissue specificity. The
synergistic effect of our novel combination approaches provides a
tactical edge in addressing challenges pertaining to tissue
classification, age prediction, and identification of crucial signaling
pathways. A case study using colorectal cancer as an exemplary dataset
was also conducted to highlight the potential and applicability of
P2GPT in accurately generating simulated biological data for
bioinformatic analyses.
Results
Construction of Precious2GPT
We propose a novel hybrid approach that combines the power of two
generation models, CDiffusion and MoPT, to generate high-quality
multi-omics methylation and expression data (Fig. [72]1). Our approach
of constructing Precious2GPT (P2GPT) involves the following steps:
Fig. 1. Schematic representation of the P2GPT model.
[73]Fig. 1
[74]Open in a new tab
The top left section of the diagram delineates the diverse omics
datasets (e.g., methylation and gene expression) collected under
various conditions such as age, tissue type, species, and omics types.
From this initial data representation, lines branch out to indicate two
separate data processing streams feeding into the CDiffusion. One
stream enters the Categorical Embedding, processing discrete data
features and the other enters the Continuous Embedding, handling the
age data. Adjacent to these embedding blocks, the PyDeepInsight
Transformation highlights another preparatory step for the input data,
which is processed in parallel to the embeddings and also fed into the
CDiffusion. On the left side, CDiffusion is presented in detail to
reflect its centrality in the data analysis pipeline. Beneath this
architecture, an Inverse PyDeepInsight block reverts the transformed
data back to its omics representation after processing through the
CDiffusion model. The transformed outcomes are combined with results
from the CDiffusion in the FWLS block. The top right section of the
figure introduces the Omics Tokenizer, serving as the preliminary stage
for the LLM generation. Below the tokenizer, a larger visual represents
the architecture of the LLM model. Its output is directed back into the
omics space to broaden interpretability and also channeled into the
FWLS, where it is integrated with the CDiffusion generations. The
bottom right of the illustration showcases the Model Capabilities
block. This block emphasizes various practical applications of the
developed framework, including omics data generation, assembly of large
open datasets, facilitation of control mechanisms for PandaOmics, the
model’s capacity for target discovery, out-of-domain extrapolations,
and conditional age prediction.
Step 1: CDiffusion model generation
We employ the CDiffusion model to generate an initial dataset, which
simulates gene expression levels based on the provided gene expression
network. This network incorporates dependencies between genes, ensuring
a biologically plausible gene expression pattern.
Step 2: MoPT model evaluation
Using the generated dataset from Step 1, we evaluate the quality of
each gene’s generation using the MoPT model. The MoPT model calculates
a quality score for each gene, reflecting the similarity between the
synthetic data and real-world gene expression and DNA methylation
profiles.
Step 3: Coefficient calculation
To create a balanced combination of the two models and reflect the
proportion of quality contributed by each, we calculate coefficients
based on the quality scores obtained from the MoPT model for each gene.
These coefficients represent the relative importance of the CDiffusion
and MoPT models in the final hybrid generation. To conduct the
combination of our models, we employed the Feature Weighted Linear
Stacking (FWLS^[75]19) approach. FWLS is a technique that combines
multiple models by assigning weights to each model based on their
performance and using these weights to calculate a weighted average
prediction. The FWLS formula is as follows:
[MATH:
yFWL
S=∑j=1m∑i=1nwij*
yij :MATH]
Where
[MATH: yFWLS :MATH]
represents the final combined generation,
[MATH: yij :MATH]
represents the generation of each individual model (MoPT and
CDiffusion) for each gene,
[MATH: wij :MATH]
represents the weight assigned to each model for each gene. In our
study, we employed a linear regression approach to determine the
optimal weights for model combination. The weight calculation formula
using linear regression is as follows:
[MATH: w=(XTX)−
mo>1XTy :MATH]
Where
[MATH: w :MATH]
represents the vector of weights,
[MATH: X :MATH]
represents the matrix of generations from individual models (MoPT and
CDiffusion),
[MATH: y :MATH]
represents the actual target values for each subgroup of conditions
(tissue, age, omics type, species). Once the weights are determined,
they are used to calculate the final combined prediction by taking the
weighted average of the individual predictions. This approach allows us
to leverage the strengths of each model and minimize the impact of any
individual model’s weaknesses.
P2GPT is capable of generating tissue-specific and accurate multi-omics data
We first tested 3 basic models including CGAN, CDiffusion, and MoPT, as
well as the multi-omics versions of CDiffusion, MoPT, and their
combination, P2GPT. As a simple model, CGAN was used to serve as the
benchmark or baseline in the analysis to evaluate the performance of
other models. All 6 models performed well in classifying tissues with
both real (Table [76]1) and generated (Table [77]2) labels. Table [78]1
in columns 1 and 2 for the P2GPT and diffusion model presents the
results with genes extrapolated from the landmark genes (obtained from
the LINCS L1000 project) and the CGAN model uses all genes to generate.
The comparison between all the 6 models tested using human expression,
DNA methylation, and mice expression data showed that in the case of
extremely underrepresented data such as the imbalance of tissue
densities, which is common in practice, only the P2GPT model is capable
of generating accurate omics data by combining the multi-omics models
of CDiffusion and MoPT. We further visualized the distribution of data
by UMAP dimensionality reduction which showed that using human
expression data (Fig. [79]2A, B, Supplementary Fig. [80]1), the
generated labels were highly concordant with the real labels, further
suggesting that P2GPT was capable of synthesizing tissue-specific
expression data accurately. On the other hand, real human DNA
methylation data were not well clustered into tissue types (Fig.
[81]2D, Supplementary Fig. [82]1). Despite this, they were also highly
concordant with the real data in terms of similarity and distribution
(Fig. [83]2C, Table [84]2). In mice expression data, P2GPT could also
generate concordant and tissue-specific data, although less accurately
than in human data (Fig. [85]2E, F, Table [86]2) (see method). Overall,
this model has shown itself to work well, both for DNA methylation and
expression, in the case of tissue-specific data generation.
Table 1.
Tissue classification real labels
Real labels Expression (Human) Methylation (Human) Expression (Mouse)
MoPT 0.995 0.880 0.920
CDiffusion 0.993 0.937 0.960
CGAN 0.991 0.930 0.930
CDiffusion (multi-omics) 0.994 0.887 0.993
MoPT (multi-omics) 0.997 0.916 0.982
P2GPT 0.999 0.944 0.999
[87]Open in a new tab
f-1 weighted on tissue prediction with LR model. The best performing
model was bolded.
Table 2.
Tissue classification generated labels
Gen labels Expression (Human) Methylation (Human) Expression (Mouse)
MoPT 0.928 0.780 0.632
CDiffusion 0.945 0.870 0.810
CGAN 0.960 0.874 0.501
CDiffusion (multi-omics) 0.957 0.880 0.887
MoPT (multi-omics) 0.956 0.830 0.697
P2GPT 0.965 0.910 0.899
[88]Open in a new tab
f-1 weighted on tissue prediction with LR model. The best performing
model was bolded.
Fig. 2. UMAP of real data and data generated by Precious2GPT.
[89]Fig. 2
[90]Open in a new tab
Each point represents an individual sample. A Human expression data
colored by data type (orange, real; blue, generated). B Human
expression data colored by tissue type. C Human methylation data
colored by data type (real or generated). D Human methylation data
colored by tissue type. E Mouse expression data colored by data type
(real or generated). F Mouse expression data colored by tissue type.
P2GPT outperformed other models in age prediction using generated data
Next, we trained the CatBoost regression model using real data with age
as the parameter and assessed the prediction performance of age in the
generated data. Our results showed that P2GPT demonstrated the best
performance, achieving the lowest mean absolute error (MAE) and highest
R^2 score across all types of datasets tested when compared to other
models (Tables [91]3 and [92]4).
Table 3.
Age prediction real labels quality
Real labels Expression (Human) Methylation (Human) Expression (Mouse)
MAE R^2 MAE R^2 MAE R^2
MoPT 9.210 0.255 8.050 0.620 12.011 0.646
MoPT (multi-omics) 8.754 0.210 6.580 0.770 10.839 0.685
CDiffusion 11.00 0.165 7.700 0.840 13.376 0.674
CDiffusion (multi-omics) 8.297 0.307 7.010 0.880 11.952 0.686
P2GPT 8.247 0.313 6.300 0.890 10.708 0.693
[93]Open in a new tab
r2/mae with CatBoostRegressor. For mouse expression MAE metric was
calculated in days, for human methylation and expression MAE metric was
calculated in years. The best performing model (lowest MAE and highest
R^2) was bolded.
Table 4.
Age prediction generated labels
Generated labels Expression (Human) Methylation (Human) Expression
(Mouse)
MAE R^2 MAE R^2 MAE R^2
MoPT 13.280 −0.260 12.70 0.650 61.982 −0.363
MoPT (multi-omics) 12.717 0.019 12.47 0.710 61.073 −0.323
CDiffusion 12.310 0.030 11.18 0.810 50.930 −0.120
CDiffusion (multi-omics) 11.848 0.108 10.94 0.800 50.400 −0.105
P2GPT 10.906 0.156 9.890 0.820 48.940 −0.010
[94]Open in a new tab
r2/mae with CatBoostRegressor. For mouse expression MAE metric was
calculated in days, for human methylation and expression MAE metric was
calculated in years. The best performing model (lowest MAE and highest
R^2) was bolded.
It is important to highlight that the expression-based regressor error
is notably substantial, especially in mice expression data. In
contrast, the generative models with the age condition demonstrate a
pronounced proficiency in terms of MAE and R^2 when applied to human
DNA methylation data. This observation has guided subsequent
experiments to also focus on DNA methylation with the age condition. A
noteworthy characteristic is the enhanced quality achieved through the
unique heterogeneity and the combination of these models which results
in the optimal performance. Additionally, we have witnessed
improvements not only in models that encompass multiple omics datasets
but also in the results produced by their integrative combinations.
Based on the figure with the results of the best regression on DNA
methylation in Supplementary Fig. [95]2, our model shows high quality
in all tissues.
The effect of underrepresented data on P2GPT’s data synthesis
We then investigated how the representation of data affects the P2GPT’s
ability to generate new synthetic data in each of the three data types
(human expression, human DNA methylation, and mouse expression). The
model was asked to generate 300 synthetic samples for each tissue with
the additional condition of age from a uniform distribution of age
values. We can see the model requires a relatively small amount of real
data samples to start generating valid synthetic expression samples
conditioned with specific age and tissue, while the error rate for the
generation of DNA methylation samples is significantly higher (Fig.
[96]3). We could not definitely state that the number of samples in a
dataset is the main factor of successful generation of synthetic
samples, but as we can see from trends on graphs it is one of the most
important factors. Here, we defined correctly generated samples as
those generated in strictly proper structure i.e. the order of genes
and their generated omics numerical values were correct.
Fig. 3. Analysis of underrepresented data synthesis using P2GPT.
[97]Fig. 3
[98]Open in a new tab
Each point is a specific tissue, while the x-axis shows the number of
samples presented in the dataset, and the y-axis shows the number of
correctly generated samples out of the expected 300 samples. Left:
Human expression. Middle: Human methylation. Right: Mouse expression.
P2GPT is capable of generating highly accurate DNA methylation data across
ages
Since our results from CatBoost regression analysis (Tables [99]3 and
[100]4) indicated that DNA methylation data was the best data type for
age prediction using the generated data, we used it to compare the
similarity between real and generated data in terms of differential
methylation. For each of the tissues, we compared the DNA methylation
levels of samples of 80 ± 20 years old and 30 ± 20 years old in the
real data, as well as those of 80 ± 20 years old (predicted) and
30 ± 20 years old (predicted) in the generated data. The significantly
differentially methylated genes were then obtained from each of them to
identify the number of intersections (Fig. [101]4). We then studied how
much overlap there is between real and generated data on sets of
differentially methylated genes (Table [102]5, MA). We hypothesized
that if the models captured a certain difference between groups, in
this case, a 50-year difference, then we could generate the data for
older people or mice. For certain tissues (e.g., leukocytes or liver),
we observe that it is better to use a single model. However, in the
case of the occipital lobe, blood, breast, and some other tissues, the
P2GPT shows better performance. With buccal mucosa tissue, we have very
little overlap, but this is because there are only 3 differentially
methylated genes. Globally, based on statistics, we can say that our
model P2GPT has become more stable between all tissues and there is no
bias. In addition, we compared the DNA methylation levels in 80 years
old samples between the real and generated data. We hypothesized that
if the real and generated data shared high similarity, low counts of
differentially methylated genes would be obtained. As shown by our
results, for the majority of tissues analyzed, P2GPT demonstrated the
lowest numbers of differentially methylated genes identified by the
comparison between real and generated data (Table [103]5, DM). This
further suggests that our model can identify the difference in DNA
methylation levels both between groups at different ages and preserve
DNA methylation values with the same age group for real and generated
data.
Fig. 4. Tissues with the most overlaps between real and generated data in
differentially methylated genes were identified by 30 vs 80 years old
comparisons.
[104]Fig. 4
[105]Open in a new tab
gen generated.
Table 5.
Results of overlapping differentially methylated genes between
generated and real data
Tissue MoPT (multi-omics) CDiffusion (multi-omics) P2GPT
MA DM MA DM MA DM
Immune cells 0.931 0.396 0.926 0.76 0.95 0.11
Saliva 0.942 0.244 0.778 0.757 0.932 0.624
Cerebral cortex 0.925 0.37 0.905 0.702 0.923 0.294
Blood 0.828 0.24 0.902 0.805 0.907 0.149
Leukocyte 0.875 0.269 0.801 0.775 0.789 0.202
Kidney 0.364 0.168 0.533 0.659 0.752 0.01
Liver 0.764 0.148 0.728 0.809 0.752 0.109
Peripheral blood mononuclear cell 0.759 0.100 0.697 0.169 0.748 0.000
Occipital lobe 0.442 0.076 0.689 0.081 0.745 0.000
Breast 0.633 0.323 0.707 0.464 0.721 0.05
Mucosa 0.549 0.090 0.431 0.775 0.65 0.594
Cerebellum 0.771 0.208 0.465 0.653 0.627 0.623
Frontal lobe 0.535 0.175 0.418 0.468 0.477 0.336
Thyroid gland 0.222 0.091 0.568 0.229 0.281 0.000
Buccal mucosa 0.045 0.000 0.000 0.825 0.000 0.200
[106]Open in a new tab
MA: Intersection of differentially methylated genes between the samples
of 30 and 80 years old in generated and real data. Values represent the
proportion of differentially methylated genes in the real data
comparison that intersected with the list of differentially methylated
genes in the generated data. A higher MA represents a better
performance of the model. DM: Differentially methylated genes between
80 years-old generated vs. 80 years-old real data. Values represent the
proportion of differentially methylated genes in the real data
comparison that intersected with the list of differentially methylated
genes in the generated data. Lower DM represents the better performance
of the model. For each tissue, the best performing (highest MA and
lowest DM) model(s) was/were bolded.
P2GPT can extrapolate age in generated data beyond the age range of the
training dataset
Since MoPT could not be used to predict age in the generated data that
was not present in the training real data, as an out-of-scope
experiment, we used the diffusion model integrated in P2GPT to study
its age prediction accuracy based on two conditions. The first
condition involved the training exclusively on the DNA methylation data
from individuals aged lower than 50 or 80 years old, while in the
second condition, the model was trained using data encompassing all
ages (ranging from 0 to 114 years old) in the real data. We show that
the number of samples available per age group varies across tissues,
with some tissues having underrepresented age groups (Fig. [107]5A).
Our results indicated that the quality of the model’s predictions was
maintained particularly well in the crucial age component within PCA
for specific tissues. Specifically, we obtained notable results for
blood and examined the components that exhibited the highest
correlation with age. We observed that the layer representing generated
samples of older individuals was positioned right after the layer
representing the 80–100 age group, regardless of the training condition
(Fig. [108]5B, C). Similarly, generated samples at older age were also
positioned close to the real old samples in blood by setting the
training threshold at 50 years old (Supplementary Fig. [109]3). This
insightful finding confirms the model’s ability to generate age values
that have not been encountered during training, indicating its capacity
to extrapolate and generate realistic age predictions beyond the
observed age range. Moreover, the results on tissues with the
underrepresented 80–100 years old group (Supplementary Fig. [110]4A) do
not differ much between both training conditions. However, when it is
well represented, the variance between samples is lower where the model
was trained on all labels (Frontal Lobe in Supplementary Fig. [111]4B).
Similar results were also observed in the cerebral cortex
(Supplementary Fig. [112]4C), cerebellum (Supplementary Fig. [113]4D)
and buccal mucosa (Supplementary Fig. [114]4E).
Fig. 5. Generation of methylation data based on real data and age.
[115]Fig. 5
[116]Open in a new tab
A Distribution of age groups in real methylation data for each tissue.
B PCA for blood with real and generated data with different age bins
for models trained on data with the whole age distribution. C PCA for
blood with real and generated data with different age bins for models
trained on data with age lower than 80. The black line in PCA is
connected by cluster centroids of each age group from [0,20] to
[100,120] for real data and [100,120] for generated blood data.
Applications of P2GPT-synthesized data in pathway enrichment analysis
To assess the applicability of P2GPT in biological data interpretation,
we performed differential DNA methylation analysis between the 120 and
150 years old samples in each tissue generated by P2GPT, followed by
pathway enrichment analysis on the differentially methylated genes
based on the KEGG database to identify the pathways that were
potentially related to aging. In blood, the enriched pathways for the
differentially methylated genes were associated with immune function,
organ development, and hormonal regulation (Supplementary Fig. [117]5).
In the liver, pathways associated with cytokine receptor interactions
and NOD-like receptor signaling were enriched, while pathways enriched
in the older thyroid gland revealed a shift towards chronic
inflammation and immune dysregulation as suggested by disrupted
cytokine interactions, increased neutrophil extracellular trap
formation, and alterations in hematopoietic cell lineage. Moreover,
overrepresentation analysis showed that the differentially methylated
genes were enriched in several hallmarks of aging across multiple
tissues. In particular, inflammation was found to be enriched in 4
tissues (blood, liver, saliva, thyroid gland), while genomic
instability was enriched in 2 tissues (liver, thyroid gland) and
altered intercellular communication in 2 tissues (blood, thyroid
gland). Here we took the conclusion of enriched pathways through all
tissues in KEGG_2021_Human. Lists of pathways for selected tissues can
be seen in Supplementary Table [118]1. We also added a description of
these pathways through GPT-4 based on the direction of methylation and
importance in aging. It can be observed that most of the pathways are
indeed related to aging. This suggests that our P2GPT model can
generate old people data and find the most important biological
signaling pathways of aging.
Case study experiment in colorectal carcinoma
To demonstrate the potential of in silico-generated gene expression
data as control samples for actual colorectal cancer samples, we
generated control samples for corresponding case samples of eight
colorectal cancer (CRC) cell lines using our P2GPT model. Subsequently,
eight case-control comparisons for the corresponding CRC cell lines
were incorporated into a meta-analysis. CRC meta-analysis was
constructed using a restricted subset of genes, referred to as the
“landmark” genes, and another meta-analysis was created using a
comprehensive subset of genes, termed “restored” by a CycleGAN model
which is described in Supplementary materials. For each of the two
meta-analyses, we extracted the common gene expression signatures
across all eight cell lines which yield two lists of gene expression
signatures. Each of them was subjected to Spearman’s correlation test
with the gene expression signature obtained from the pre-calculated CRC
meta-analysis on PandaOmics (Fig. [119]6). We observed that common gene
expression signatures calculated on an extended number of genes
(restored genes) exhibited greater similarity to the benchmark CRC
signatures (r = 0.552) compared to signatures calculated on a limited
number of genes (landmark genes, r = 0.497). Additionally, it was
evident that the application of a gene expression significance
threshold positively correlated with overall signature similarities.
The combined landmark signature generated with the P2GPT model
demonstrated strong similarity to the CRC benchmark signature (Fig.
[120]6A). However, its performance on the restored subset of genes was
even better (Fig. [121]6B). Consequently, the meta-analysis derived
from comparisons created using the P2GPT on an extended number of genes
was further employed for target identification analysis.
Fig. 6. Correlation matrix for colon carcinoma signatures.
[122]Fig. 6
[123]Open in a new tab
Spearman correlation coefficients between colon carcinoma signatures
were calculated using only landmark genes (A) and all restored genes
(B). The colon carcinoma signature (PandaOmics CRC project signature)
was derived from the “expression analysis” section of manually curated
colon carcinoma meta-analysis in PandaOmics and corresponded to the
combined gene expression changes values for colon carcinoma. P2GPT CRC
signature was collected from the corresponding meta-analysis in
PandaOmics.
As previously, the manually curated CRC project available in PandaOmics
served as a benchmark for target hypotheses. The Target ID results were
compared between the CRC meta-analysis, containing case-control
datasets that were obtained from patients, and the P2GPT CRC
meta-analysis (refer to the previous section). This case study aimed to
demonstrate that the in silico-generated control samples could be used
as control samples for actual case samples. Therefore, to compare
Target ID results, we only used omics scores for hypothesis ranking and
included solely druggable target families^[124]20. The top 20 target
hypotheses for both the benchmark and the P2GPT-restored CRC
meta-analyses are depicted on a heatmap (Fig. [125]7). Using the Target
ID approach the top genes that were highly scored in PandaOmics CRC
meta-analysis and P2GPT-controls generated CRC meta-analysis were
explored. By analyzing the overlapped genes, it was observed that both
top 20 target hypotheses lists contain hits that are strongly
associated with CRC pathology. For instance, AKT1, PTEN, and PIK3R1 are
key modulators in the PI3K/AKT pathway while PLK1, CDK2, and MAPK14 are
major drivers involved in cell cycle regulation. Being ranked top in
both Target ID results, AKT has been extensively studied in disease
pathogenesis^[126]21,[127]22 and is altered in CRC patients^[128]23.
CDK2 is also highly scored in both meta-analyses. The CDK2 has been
explored in the G1/S phase transition^[129]24 and the CDK2 selective
inhibitors have already been tested in CRC models^[130]25. Genes that
were top scored only in P2GPT-controls generated CRC meta-analysis
(PIK3CD, FYN, YES1, ATM, HRAS, TNFRSF1A, GSK3B, PLCG1, CSK, PIK3CA) are
also related to pathogenesis. For example, PIK3CD was shown to be
involved in AKT/GSK-3β/β-catenin signaling and could be considered as a
potential target^[131]26, while mutations in PIK3CA were observed in
20% to 25% of CRC^[132]27 and associated with shorter cancer-specific
survival^[133]28. The results were supported by PandaOmics knowledge
graph (Supplementary Fig. [134]6). Overall, our results suggested that
our P2GPT model can be used to generate expression data that could be
utilized in target discovery. We showed that gene expression changes
between case and control (both real and generated) samples resulted in
a similar disease-specific expression signature. At the same time, the
Target ID approach applied for data from patients (colon carcinoma
PandaOmics meta-analysis) and for P2GPT-controls generated colorectal
cancer meta-analysis showed a strong overlap between well-known targets
for colorectal carcinoma along with a new target hypothesis.
Fig. 7. Top 20 most promising target hypotheses for colorectal cancer.
[135]Fig. 7
[136]Open in a new tab
Results were derived from the in silico Target ID scoring approach for
PandaOmics colorectal carcinoma meta-analysis (A) and P2GPT colon
cancer meta-analysis (B). To validate our approach, only omics-based
scores with the application of a druggability filter were taken into
account and used for the composition of the scores for ranking.
Discussion
Studies have identified several key biomarkers, proteins and pathways
that play important roles in aging^[137]29,[138]30. Despite this, the
multifaceted process of aging still requires substantial understanding
and unraveling of the complex biological data. To address this, the use
of artificial intelligence (AI) in the field of aging research has been
increasing recently. Indeed, deep learning-based approaches have been
proposed to play vital roles in multiple areas facilitating aging
research, such as predicting biological age^[139]16,[140]31,[141]32,
developing biomarkers^[142]15,[143]33–[144]35, identifying therapeutic
targets^[145]15,[146]20,[147]36–[148]39, and generating novel
compounds^[149]34,[150]40,[151]41. In fact, studies have also
demonstrated the potential of applying AI models to identify targets
implicated in aging and age-associated diseases, targeting established
hallmarks of aging^[152]20,[153]30,[154]37,[155]42.
In this study, we present a hybrid approach Precious2GPT (P2GPT) that
combines the complementary strengths of the CDiffusion and MoPT models
for generating high-quality multi-omics DNA methylation and expression
data. Our approach reduces the limitations of individual models and
leverages their strengths to enhance the generation process. This
innovative approach has potential applications in various fields,
including data analysis, algorithm development, and privacy
preservation for multi-omics research. We demonstrate the effectiveness
of our hybrid approach by comparing the quality of data generated using
individual models of CGAN, CDiffusion, and MoPT, with the combined
hybrid approach P2GPT. With the aid of tissue classification and age
regression experiments, the performance of models was assessed in terms
of their specificity to species and tissue types, as well as their
capability to predict age based on learned patterns from real data.
Upon training the transformer-based model with this corpus, we
demonstrated its high capability of generating new data conditioned on
specific factors like age or tissue type. In our study, we encountered
a primary challenge in generating tabular data from continuous gene
expression and DNA methylation omics data. Previous works have
attempted the conversion of table data to text before the application
of the pretrained GPT-2 model^[156]43. Another approach addressed the
complexity issue by using the GPT-2 architecture with a customized
vocabulary to improve the efficiency during both training and inference
of the model^[157]44. Hence, we devised an encoding scheme, wherein
each gene and its corresponding omics value were represented as
individual tokens. In essence, our approach treated the gene-omics data
as pseudo-text, enabling us to utilize the transformer-based model,
ultimately introducing the MoPT model. To evaluate the generated data
with predictions of age and tissues for different data types and
species, we highlight the potential of transformer architectures in
bioinformatics tasks, which represents the first biomedical-specific
adaptation of a language model for generating tabular data.
Synthetic data plays a crucial role in overcoming data insufficiency by
providing synthetic controls that replicate the biological properties
of real control samples, and enhance equity in differential expression
analysis. The use of generated data also enables cost-effective testing
of algorithms and pipelines in a virtual experimental platform,
allowing researchers to mimic the effects of interventions under
specific scenarios such as varying levels of noise and different
degrees of differential expression^[158]45. Furthermore, the potential
impact of alterations in genomic profiles can be predicted with
synthetic gene knockdowns or knockins data^[159]46. Our P2GPT model
demonstrated exceptional performance in classifying tissues based on
synthetic data. The model’s accuracy is remarkable, with its
predictions closely resembling those based on real biological datasets
as evidenced by the high correlation coefficients in cross-validation
studies and the model’s robustness when tested against known
benchmarks. In the age regression analysis, P2GPT showcased its
aptitude by accurately predicting the biological age of samples using
synthetically generated DNA methylation patterns. The synthetic data,
when compared against real-world epigenetic clocks, confirmed that
P2GPT successfully captured the nuances of age-related changes, with a
minimal margin of error. This reveals the potential for wide-ranging
applications in biogerontology and personalized medicine.
Leveraging out-of-scope (OOS) experiments with the P2GPT model has
revealed that across various tissues, aging is consistently associated
with dysregulated immune function, chronic inflammation, and alteration
in cell lineage and signaling pathways. Age-associated dysregulated
immune function, accompanied by chronic inflammation (inflammaging),
contributes to the process of immunosenescence observed in aged
individuals^[160]47,[161]48. The alteration in signaling pathways has
been shown to trigger inflammaging and senescence across multiple
tissues^[162]30. These biological processes markedly contribute to the
increased disease burden observed in the elderly population and present
potential targets for therapeutic intervention. The insights delivered
by the P2GPT model’s OOS experiments underscore the value of advanced
computational models in understanding the complex biological
underpinnings of aging and spotlighting potential avenues to mitigate
its detrimental effects on health. Our results showed that our model
can be utilized to identify biologically relevant pathways and
processes through synthetic data generation.
By combining MoPT and CDiffusion models using Feature Weighted Linear
Stacking (FWLS), we aimed to improve the overall predictive performance
and generalization ability. This approach integrates diverse
perspectives and captures complementary information from each model,
resulting in a more robust and accurate prediction. Applying FWLS
during coefficient calculation allowed us to obtain more accurate
predictions by incorporating individual model strengths. By considering
model weights, we ensured that more accurate and reliable models had a
higher impact on the final generation, mitigating biases or
inaccuracies introduced by any single model and providing a more robust
prediction. Our findings indicate that the coefficients derived from
P2GPT allow a refined integration of the two models, leading to
enhanced performance with improved generation quality. Despite the
advancements achieved by integrating MoPT and CDiffusion models with
FWLS, there are certain limitations in the current P2GPT model.
Firstly, the complexity of the model poses a potential barrier to
replication and broader application. The intricacies involved in
managing and interpreting the combination of such models may limit
their use by those without deep expertise in bioinformatics and access
to substantial computational resources. Secondly, the current iteration
of P2GPT processes primarily tabular data or bidimensional image data,
and could not accommodate the analysis of graphical structures which
represent complex biological interactions or pathways at this stage.
Future extensions of the model that incorporate graph neural networks
could enable the analysis of data represented in graph forms, such as
protein-protein interaction networks or gene regulatory networks.
Despite these limitations, the synergistic integration of MoPT and
CDiffusion models through FWLS has successfully demonstrated an
enhanced predictive capability.
Our findings underscore the versatility and effectiveness of
transformer architectures in handling bioinformatics tasks. However, it
is important to acknowledge that the success of our P2GPT is attributed
to the generation of relatively large sequence lengths and the design
of an effective encoding scheme. Future work can expand the application
of our method in other bioinformatics tasks like survival analysis,
cross-modality prediction, and generation of omics depending on the
disease or drug, thereby broadening the usage of transformer
architectures in the field. For instance, beyond aging research, P2GPT
could facilitate the analysis of fundamental processes underlying tumor
progression, resistance, and metastasis. Additionally, modeling the
timing and administration methods of various therapy combinations could
provide insights into how tumor cells develop resistance to
drugs^[163]49–[164]51. In addition, we envision further refining our
hybrid approach by exploring additional generation models and
incorporating various omics data types. Moreover, we believe that
validation of synthetic data through downstream applications and
benchmarking against real-world datasets would enhance the utility and
robustness of synthetic multi-omics data. Lastly, we anticipate the
future integration of P2GPT into clinical settings, enabling invaluable
applications such as simulating tissue-specific biological data without
invasive biopsies to predict treatment responses, predicting biological
changes and disease progression trajectories, and incorporating various
clinical parameters to enhance the accuracy for personalized disease
monitoring and therapeutic strategies.
In summary, we developed Precious2GPT, a generative model capable of
producing methylation and expression data, which are invaluable
resources for aging research due to the scarcity of longitudinal
biological data. Through multiple lines of evidence and validation, we
demonstrated the significant potential of Precious2GPT in facilitating
aging research. Future work addressing the aforementioned limitations
would further strengthen the model’s applicability, accuracy, and
comprehensiveness, providing a powerful tool for biological discovery
and translational medical research.
Methods
Data sources
In this study, expression and methylation data were adopted across two
species, human and mouse. Access to Genotype-Tissue Expression (GTEx)
V8-protected data (phs000424) was authorized by the Data Access
Committee of NCBI dbGAP. Human transcriptomic data^[165]52 and sample
attribute data were downloaded, constituting 12,453 samples.
Complementary mouse transcriptomic data was sourced from ARHS4
database, V2.2 (12,541 samples). Mouse genes were mapped to their
corresponding human orthologs with the use of Human Genome Organisation
Gene Nomenclature Committee (HGNC) mappings^[166]53. Both GTEx and
ARCHS4 RNA-seq data were procured in the form of raw gene counts. These
datasets underwent log2 transformation, followed by quantile
normalization applied to each tissue type separately within the
expression datasets. After performing log2 transformation and quantile
normalization, we preserved the target distribution to facilitate its
application to novel samples. Human DNA methylation data was aggregated
from the Illumina Infinium HumanMethylation450 BeadChip array datasets,
retrieved from the China National Center for Bioinformation’s (CNCB)
data repository (8,285 samples)^[167]54. Methylation beta values were
mapped to genomic features based on the HumanMethylation450 v1.2
Manifest File. In detail, we intentionally focused our attention on the
CpGs located exclusively within the TSS200 region, as these were
interpreted as the most relevant to age prediction. The TSS200 region,
defined as the area comprising 200 base pairs upstream of the
transcription initiation site, is documented as crucial for gene
regulation processes. Consequently, the beta values of the CpGs
situated within a gene’s TSS200 were averaged for downstream analysis.
Preprocessing methods
For pictures construction, DeepInsight technique with the application
of convolutional neural networks (CNNs)^[168]55,[169]56 and Kohonen’s
self-organizing maps (SOMs)^[170]57,[171]58 was used to transform
non-image data into image-like representations in CGAN and CDiffusion
models. For the acceleration in training and inference processes of
computationally heavy models, deep learning engaging
CycleGAN^[172]59,[173]60 was employed to generate synthetic data in
CDiffusion, MoPT and Precious2GPT models. In brief, generation methods
work either with text or with pictures. We used DeepInsight to
construct pictures for CGAN and CDiffusion models and in the CDiffusion
part of Precious2GPT model. To compare individual genes in each pixel
of images, SOM was used instead of the TSNE, UMAP and PCA algorithms.
For each data set, we built a separate SOM of different dimensions to
minimize space in the square image, and ach picture was colored by
expression or methylation, along with the training set in different
colors.
DeepInsight
CNNs was used to automatically extract features from spatially coherent
pixels, detecting higher-order statistics and non-linear correlations,
and to provide promising performance in learning complex patterns and
relationships in the data. To improve the efficiency of CNNs,
one-dimensional (1D) biological data was transformed into
two-dimensional (2D) representations. DeepInsight is a methodology
designed to transform non-image data into image-like representations,
allowing convolutional neural networks (CNNs) to be applied more
effectively. It serves as the basis for the DeepInsight-3D model, which
extends this approach to multi-domain tabular datasets. The DeepInsight
pipeline consists of the following steps (Supplementary Fig. [174]7):
Data normalization: The input data is normalized to ensure that all
features have the same scale. This is typically achieved by applying
min-max scaling, z-score normalization, or other suitable normalization
techniques. Dimensionality reduction: The high-dimensional input data
is transformed into a lower-dimensional representation. This can be
done using dimensionality reduction techniques such as t-SNE, UMAP, or
PCA. The resulting lower-dimensional data retains the most important
information from the original data while reducing noise and
computational complexity. Image generation: The lower-dimensional data
is then converted into a 2D image-like representation. This is achieved
by mapping each data point to a pixel in the image, with the pixel
intensity representing the value of the corresponding feature. The
resulting image preserves the spatial relationships between the data
points, allowing CNNs to effectively capture local and global patterns
in the data. Convolutional neural network (CNN) training: The generated
images are used as input to a CNN, which is trained to perform a
specific task, such as classification or regression. Recently developed
techniques such as diffusion models can be used to effectively process
such data Supplementary Fig. [175]7. By transforming non-image data
into image-like representations, DeepInsight-like models allow for the
efficient application of image-oriented models to a wide range of data
types, including biological data.
SOM
Kohonen’s self-organizing maps (SOMs) offer a promising alternative to
PCA or UMAP for dimensionality reduction in the context of transforming
non-image data into image-like representations. As an unsupervised
learning algorithm, SOMs excel at converting high-dimensional data into
lower-dimensional representations while preserving the topological
structure of the input data. This ability to maintain the spatial
relationships between data points makes SOMs particularly well-suited
for generating images that can be fed into convolutional neural
networks (CNNs). Unlike PCA, which focuses on linear relationships and
maximizes variance, or UMAP, which aims to preserve both local and
global structure, SOMs employ a competitive learning process that
iteratively updates neuron weight vectors to better represent the input
data. This results in a 2D grid of neurons that captures complex
relationships between variables, potentially leading to more effective
feature extraction and improved performance of the CNN. By
incorporating Kohonen’s SOMs into the DeepInsight methodology, we can
harness the unique advantages of this algorithm to enhance the analysis
of non-image data using deep neural networks.
CycleGAN
To speed up training and inference of heavy models (CDiffusion, MoPT
and P2GPT), for we used extrapolation of all genes using the CycleGAN
model during post-processing. In our heavy models, we trained them with
different generations and then extrapolated the result using this
model. In detail, our domain X consists of data for landmark 978
genes^[176]59 and domain Y consists of desired output data for 11,278
genes, which are our intersections across several OMICS datasets and
species types. The set of 978 genes serves as the starting point to
generate synthetic output data for the 11,278 genes.
A CycleGAN comprises two generators (G & F) and two discriminators (Dx
& Dy). Generator G transforms from domain X to Y (G: X → Y), while F
does the vice versa, i.e., F: Y → X. Dx aims to distinguish between X
and FX(Y), whereas Dy works on discriminating between Y and G(X). The
training goes as follows: first, the generator G translates a sample
data from domain X into a synthetic data of Domain Y. Subsequently, the
generator F tries to regenerate the original sample from this synthetic
data. The objective is to train the CycleGAN in learning the mapping
such that the regenerated data closely matches the input data. This is
referred to as forward-cycle consistency. A backward cycle consistency
is simultaneously processed from Domain Y to X, and the whole cycle
repeats continuously in learning. The network learns from the
inconsistencies between the regenerated data and the original input
data to increase the capabilities in generating synthetic data aligned
with the target domain. Importantly, the discriminators Dx and Dy also
participate in this training process, aiming to classify an instance
from the actual dataset or a generated data by respective generators.
As a result, CycleGAN has the ability to extrapolate the data from 978
genes to realistically simulate data for 11,278 genes even in cases
where paired samples are lacking. Finally, we can say that this model
greatly helped us in generating a large amount of data in a short time
with minor losses in quality compared to the full set. In the
production model we will of course eventually use the full data set,
but for some experiments this is not necessary.
Generation methods
Mathematical formulation of conditional generation task
In the context of conditionality, we aim to develop models that can
generate data instances conditioned on multiple factors: tissue (
[MATH: T :MATH]
), age (
[MATH: A :MATH]
), species (
[MATH: S :MATH]
), and omics types (
[MATH: D :MATH]
). We represent the generated data as
[MATH: X :MATH]
, and the conditions as a tuple
[MATH:
C=(T,A,
S,D) :MATH]
. The conditional generation task is defined given a set of training
data (D):
[MATH:
D=(X
i,Ci)i=
1N, :MATH]
where
[MATH: Xi
:MATH]
represents the observed data instances and
[MATH: Ci
:MATH]
represents the corresponding conditions in order to learn a conditional
generative model
[MATH: G :MATH]
that can sample data instances
[MATH: X :MATH]
conditioned on arbitrary conditions
[MATH: C :MATH]
.
The training objective of this model is to estimate the conditional
probability distribution
[MATH: P(X|C) :MATH]
, such that
[MATH: G(X∣C)≈P(X∣C),
:MATH]
where
[MATH: G(X|C) :MATH]
represents the data generated by our model.
CGAN
To evaluate the performance of Precious2GPT, Conditional Generative
adversarial network (CGAN) was used as positive control in the
validation experiments. Generative adversarial networks are more
classical, easier to learn and faster in terms of speed of inference,
which serve as a baseline for the other models. In some situations it
has been observed that they can show themselves high and do not
necessarily use complex patterns. In particular, if there is a
generation task with one condition and we do not want to take into
account the age condition. This generative model was trained to
generate synthetic data using two networks, th generator
[MATH: G :MATH]
and the discriminator
[MATH: D :MATH]
. In CGAN, the generator
[MATH: G :MATH]
was trained to produce data samples that are indistinguishable from
real data by a discriminator
[MATH: D :MATH]
, whilst the generator took the conditions
[MATH: C :MATH]
as input and generates data
[MATH: X :MATH]
(Supplementary Fig. [177]8). In the context of multi-omics data
integration, CGANs were employed to generate realistic images
corresponding to expression or methylation data with additional
conditions, tissue type, age, omics type, and species.
CDiffusion
Diffusion models were employed to estimate the likelihood of generation
data
[MATH: X :MATH]
. The model was trained to sample data through a diffusion process
conditioned on
[MATH: C :MATH]
, and the likelihood of data was maximized throughout the learning
process. A PyTorch published on GitHub (available at
[178]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main)^[1
79]61 was implemented as the basis of the conditional diffusion
(CDiffusion) model, and PyTorch’s embedding was applied to construct
the conditionality on categorical features in this model.
A PyTorch implementation of the conditional diffusion model published
on GitHub^[180]61 (available at
[181]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main)
was the basis for the used CDiffusion model. In its basis, this
implementation involves a U-Net block that has self-attention layers
between the downsampling and upsampling layers. Standard diffusion
models incorporate temporal information in a tensor (later referred to
as the time step) that controls the noising/denoising process based on
the current step through being embedded into every
downsampling/upsampling layer of the U-Net block. This time step is
initialized by ongoing positional encoding given the previous step’s
tensor through a sinusoidal positional embedding (Supplementary Fig.
[182]9).
The diffusion model is conditioned on categorical features of sets of
genes by adding the conditions’ embeddings into the current time step.
PyTorch’s nn.Embedding is used as a learnable embedding layer that
stores embeddings mapping each class of a categorical condition (mapped
to unique integers) into a tensor with the time step’s shape. However,
such an embedding layer is not applicable for continuous features, such
as age, so ages are fused in the time step by first undergoing a small
network (three linear layers with ReLU activation functions) that
transforms a single floating-point value (age) into the time step’s
dimensions. Finally, the ages’ embeddings are similarly added to the
time step. While the outlined approach works as is for single
conditions, including multiple conditions leads to the optimizer
tilting its attention towards the condition with higher embedded
values. This is solved by normalizing all the conditions’ embeddings
(categorical/continuous) before adding them to the time step.
MoPT
To process large numbers of genes and omics values, memory efficient
transformer architecture – MPT^[183]14 was utilized in the construction
of Precious2GPT. It incorporates a modified architecture inspired by
GPT-2^[184]6, where the positional embeddings are replaced with a
Linear basis matrix. This modification enhances extrapolation
capabilities while requiring fewer GPU memory resources during model
training. Following the retraining process, our model underwent
biological adaptation to multi-omics data, ultimately presenting as the
Multi-omics Pretrained Transformer (MoPT).
Model setup and training procedure
We prepared a tokenizer which consists of all possible genes from
datasets, all 2-digit values for tokens referred to as age, tissues,
species.
We utilized “mosaicml/mpt-7b” configuration from HuggingFace^[185]14.
To properly set the number of parameters we considered Chinchilla
scaling law^[186]62, which proposed that the number model parameters
should be proportional to the number of tokens in training corpus in
ratio of 1:20. For all three datasets we considered this law and got
the next model sizes: 4.1, 1.7 and 1 million parameters for multi-omics
dataset, expression and methylation respectively.
Learning curves for different datasets are represented in Supplementary
Fig. [187]10. For each dataset the evaluation set was 1000 samples
uniformly distributed by all tissues.
MoPT generation procedure
To generate new omics samples we pass to the model desirable conditions
on generation such as age, tissue, species and omics data type in form
of plain string with spaces between conditions, e.g: “SPECIES Mouse
dataset EXPRESSION TISSUE Brain MouseAGE84 ”. We utilized top-k
together with top-p sampling, where k = 40, p = 0.9 and
temperature = 0.8.
Validation experiments
Tissue classification
To evaluate the quality of generated omics samples, the Logistic
Regression model was used in the assessment for tissue classification
tasks. The evaluation was based on the f1-score^[188]63, weighted by
classes for both real and generated data, as the key metric for
determining the reliability of the generated data. For each model, we
built classification metrics twice. First, we generated synthetic
samples in a 1:1 ratio with the original data, and the metrics were
calculated on these samples, where we compared the real label with the
one predicted by the classifier (Supplementary Fig. [189]11). We
subsequently examined the performance of the classifier on uniformly
generated labels from the total number of tissues to evaluate its
effectiveness in handling unbalanced classes. In the case of multiple
conditions, we additionally generated age between the minimum and
maximum values present in our data, or other label types of dataset or
species within each tissue.
In addition to the aforementioned metrics, UMAP^[190]64 representations
were used to visualize both synthesized and real tissue data in
identifying disparities or similarities between the two distributions.
Age regression
To predict the age of generated data, a CatboostRegressor^[191]65 model
was applied solely based on gene omics values. The training dataset was
composed of real samples paired with their respective age values as the
target variable, while the synthesized samples were utilized as the
testing data to generate predicted age values as the conditioning
variable. The evaluation of performance by each model was presented as
mean absolute error (MAE)^[192]63 and R-squared (R^2) metrics.
Differential methylation analysis
To examine the sample homogeneity of real and generated data, we
performed several statistical tests focusing on human methylation in
multiple tissues. First, the nonparametric Mann–Whitney^[193]66
statistical test was used where we fed methylation data from different
age groups generated by the CDiffusion, MoPT and Precious2GPT models.
To evaluate the ability of models in preserving the differentially
methylated genes of distinct age groups, we identified differentially
methylated genes between the samples obtained from 80 ± 20 vs 30 ± 20
years old individuals, in both real and generated data. We then
calculated the rate of intersection by the number of intersected
differentially methylated genes between two sets divided by the number
of differentially methylated genes in the real data. Differential
methylation analysis was also performed between the 80 ± 20 years old
(real data) and 80 ± 20 years old (generated data) to assess the
similarity between the real and generated data (Supplementary Fig.
[194]12). To optimize statistical validity, multiple testing
corrections of the Benjamini–Hochberg^[195]67 hypotheses were used.
Out-of-scope experiment
To validate our Precious2GPT model for age prediction, we conducted
several out-of-scope experiments involving the training of models with
methylation data at different age thresholds.
1. Two models were trained for this purpose. One was trained with
samples up to thresholds of 50 and 80 years old, and the other one
was trained with the entire sample.
2. Using the model trained with distinct thresholds, we generated data
for (threshold +20, threshold +40) years old, and compared the
clusters created by the generated data with those of real data on
PCA^[196]68 representations.
3. We trained a model with available data from 100 samples per tissue
for individuals aged between 120 and 150 years old, and generated
the differentially methylation data for pathway analysis to predict
aging-related alteration. The pathway analysis was conducted using
the Python package gseapy^[197]69, with the KEGG-human
database^[198]70 and the 12 HALLMARKS lists serving as the enriched
pathways for the differentially methylated genes.
Case study experiment
In the case study experiment focusing on colorectal cancer (CRC), we
utilized Precious2GPT for CRC cell lines as synthetic controls, namely
Caco2, Lovo, SW1417, NCI-716, RKO, HCT-8, SW480 and SK-CO-1, obtained
from our internal laboratory, the Robotic Lab. We fed the gene
expression data of the eight CRC cell lines as input to facilitate the
generation of respective synthetic controls using the pre-trained
Precious2GPT model. For the generated control samples, gene expression
data was normalized and uploaded to our PandaOmics platform. Within the
platform, individual comparisons (case vs. control) were established
for each cell line. These eight comparisons were then incorporated into
meta-analysis, and the results for CRC landmark genes (obtained from
the LINCS L1000 project) were generated through TargetID panel and
Knowledge graph. To evaluate the quality of control samples generated
by Precious2GPT, we compared the AI-driven target prioritization
results between the generated CRC data and the pre-calculated results
using real data available on PandaOmics.
Supplementary information
[199]Supplementary information^ (5.2MB, pdf)
Acknowledgements