Abstract

   Synthetic data generation in omics mimics real-world biological data,
   providing alternatives for training and evaluation of genomic analysis
   tools, controlling differential expression, and exploring data
   architecture. We previously developed Precious1GPT, a multimodal
   transformer trained on transcriptomic and methylation data, along with
   metadata, for predicting biological age and identifying dual-purpose
   therapeutic targets potentially implicated in aging and age-associated
   diseases. In this study, we introduce Precious2GPT, a multimodal
   architecture that integrates Conditional Diffusion (CDiffusion) and
   decoder-only Multi-omics Pretrained Transformer (MoPT) models trained
   on gene expression and DNA methylation data. Precious2GPT excels in
   synthetic data generation, outperforming Conditional Generative
   Adversarial Networks (CGANs), CDiffusion, and MoPT. We demonstrate that
   Precious2GPT is capable of generating representative synthetic data
   that captures tissue- and age-specific information from real
   transcriptomics and methylomics data. Notably, Precious2GPT surpasses
   other models in age prediction accuracy using the generated data, and
   it can generate data beyond 120 years of age. Furthermore, we showcase
   the potential of using this model in identifying gene signatures and
   potential therapeutic targets in a colorectal cancer case study.

   Subject terms: Drug discovery, Diseases

Introduction

   Biological synthetic data generation in the context of omics refers to
   the creation of artificial datasets that mimic the characteristics of
   real biological data, particularly in genomics, transcriptomics,
   proteomics, and other high-throughput biological technologies^[54]1.
   Generating synthetic data is valuable for various reasons, including
   the development and validation of computational methods, protection of
   privacy in sensitive datasets, and augmentation of limited real-world
   data. Generative adversarial networks (GANs) have been introduced as
   unique models to generate synthetic genomic data, ranging from DNA
   sequences to bulk RNA-seq data^[55]2,[56]3. Copula-based methods are
   other examples of classical statistical approaches in generating
   synthetic omics data, especially microarray gene expression data^[57]4.
   Moreover, Diffusion models are a recent addition to deep learning for
   synthetic data generation by simulating a diffusion process, which
   gradually transforms a simple noise distribution into the target data
   distribution^[58]5. Large language models (LLMs), exemplified by
   Generative Pre-trained Transformer 2 (GPT-2), have also garnered
   substantial interest built upon the Transformer architectures,
   capturing their significant contributions to the analysis of sequential
   data, and capabilities in modeling and advanced language understanding,
   generation and prediction^[59]6. Although these models have shown
   promising results in generating high-quality synthetic data, their
   application to omics data is still in its infancy.

   The need for synthetic data in omics studies stems from a variety of
   practical considerations and research objectives, which offer
   beneficial alternatives for training and evaluating genomic analysis
   tools without relying on large real-world datasets. They also serve as
   effective tools in controlling for differential expression, testing
   algorithms and pipelines in synthetic data, modeling interventions, and
   studying the underlying data structure in omics data^[60]7. For
   instance, researchers often compare gene expression levels between two
   or more conditions in differential expression analysis. Such conditions
   could include disease vs. healthy or treated vs. untreated samples.
   However, obtaining an adequate number of control samples can be
   challenging due to cost constraints or limited availability^[61]8. On
   the other hand, ‘conditional’ synthetic omics data generation is
   crucial to simulate real-life scenarios as gene expression patterns
   exhibit notable variations influenced by factors such as species,
   tissue type, or age of the sample. In particular, gene expression
   undergoes substantial changes throughout development or aging, whereas
   certain genes experience upregulation or downregulation during distinct
   life stages^[62]9,[63]10. Therefore, synthetic data generation models
   incorporating conditionality are essential in capturing the
   condition-specific variations in gene expression, as well as in
   confronting the disparate distribution of omics data on varying
   conditions^[64]11. However, achieving a high level of realism in
   synthetic data poses significant challenges, thus highlighting the
   emerging need for advanced modeling interventions.

   In our prior work, Precious1GPT^[65]12 successfully demonstrated
   promising capabilities in predicting biological age and identifying
   dual-purpose therapeutic targets potentially implicated in aging and
   age-related diseases. In this study, we employed a novel method
   incorporating the generation of multi-omics data implemented via the
   combination of a language model for data generation and a diffusion
   model for creating visual representations of omics data. Deep learning
   models including Conditional GANs^[66]13, Сonditional Diffusion
   (CDiffusion), and Multi-omics Pre-trained (MoPT)^[67]14 approaches, and
   our novel intervention Precious2GPT (P2GPT) combining CDiffusion and
   MoPT models, were utilized and compared in generating synthetic
   multi-omics, multi-species and multi-tissue samples. While aging clocks
   based on various omics data types have been proposed, such as
   transcriptomics^[68]15, methylomics^[69]16, proteomics^[70]17, and
   metabolomics^[71]18, P2GPT focused on transcriptomic and DNA
   methylation data. This choice was driven by the abundant availability
   of these data types for training and validation purposes, particularly
   when considering factors like age and tissue specificity. The
   synergistic effect of our novel combination approaches provides a
   tactical edge in addressing challenges pertaining to tissue
   classification, age prediction, and identification of crucial signaling
   pathways. A case study using colorectal cancer as an exemplary dataset
   was also conducted to highlight the potential and applicability of
   P2GPT in accurately generating simulated biological data for
   bioinformatic analyses.

Results

Construction of Precious2GPT

   We propose a novel hybrid approach that combines the power of two
   generation models, CDiffusion and MoPT, to generate high-quality
   multi-omics methylation and expression data (Fig. [72]1). Our approach
   of constructing Precious2GPT (P2GPT) involves the following steps:

Fig. 1. Schematic representation of the P2GPT model.

   [73]Fig. 1
   [74]Open in a new tab

   The top left section of the diagram delineates the diverse omics
   datasets (e.g., methylation and gene expression) collected under
   various conditions such as age, tissue type, species, and omics types.
   From this initial data representation, lines branch out to indicate two
   separate data processing streams feeding into the CDiffusion. One
   stream enters the Categorical Embedding, processing discrete data
   features and the other enters the Continuous Embedding, handling the
   age data. Adjacent to these embedding blocks, the PyDeepInsight
   Transformation highlights another preparatory step for the input data,
   which is processed in parallel to the embeddings and also fed into the
   CDiffusion. On the left side, CDiffusion is presented in detail to
   reflect its centrality in the data analysis pipeline. Beneath this
   architecture, an Inverse PyDeepInsight block reverts the transformed
   data back to its omics representation after processing through the
   CDiffusion model. The transformed outcomes are combined with results
   from the CDiffusion in the FWLS block. The top right section of the
   figure introduces the Omics Tokenizer, serving as the preliminary stage
   for the LLM generation. Below the tokenizer, a larger visual represents
   the architecture of the LLM model. Its output is directed back into the
   omics space to broaden interpretability and also channeled into the
   FWLS, where it is integrated with the CDiffusion generations. The
   bottom right of the illustration showcases the Model Capabilities
   block. This block emphasizes various practical applications of the
   developed framework, including omics data generation, assembly of large
   open datasets, facilitation of control mechanisms for PandaOmics, the
   model’s capacity for target discovery, out-of-domain extrapolations,
   and conditional age prediction.

Step 1: CDiffusion model generation

   We employ the CDiffusion model to generate an initial dataset, which
   simulates gene expression levels based on the provided gene expression
   network. This network incorporates dependencies between genes, ensuring
   a biologically plausible gene expression pattern.

Step 2: MoPT model evaluation

   Using the generated dataset from Step 1, we evaluate the quality of
   each gene’s generation using the MoPT model. The MoPT model calculates
   a quality score for each gene, reflecting the similarity between the
   synthetic data and real-world gene expression and DNA methylation
   profiles.

Step 3: Coefficient calculation

   To create a balanced combination of the two models and reflect the
   proportion of quality contributed by each, we calculate coefficients
   based on the quality scores obtained from the MoPT model for each gene.
   These coefficients represent the relative importance of the CDiffusion
   and MoPT models in the final hybrid generation. To conduct the
   combination of our models, we employed the Feature Weighted Linear
   Stacking (FWLS^[75]19) approach. FWLS is a technique that combines
   multiple models by assigning weights to each model based on their
   performance and using these weights to calculate a weighted average
   prediction. The FWLS formula is as follows:
   [MATH:
   <mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>F</mi><mi>W</mi><mi>L</mi>
   <mi>S</mi></mrow></msub><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>m</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msub><mrow><mi>w</m
   i></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>*</mo><msub><mrow>
   <mi>y</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow> :MATH]

   Where
   [MATH: <msub><mrow><mi>y</mi></mrow><mrow><mi
   mathvariant="italic">FWLS</mi></mrow></msub> :MATH]
   represents the final combined generation,
   [MATH: <msub><mrow><mi>y</mi></mrow><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub> :MATH]
   represents the generation of each individual model (MoPT and
   CDiffusion) for each gene,
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub> :MATH]
   represents the weight assigned to each model for each gene. In our
   study, we employed a linear regression approach to determine the
   optimal weights for model combination. The weight calculation formula
   using linear regression is as follows:
   [MATH: <mrow><mi>w</mi><mspace width="1em"></mspace><mo>=</mo><mspace
   width="1em"></mspace><msup><mrow><mo>(</mo><msup><mrow><mi>X</mi></mrow
   ><mrow><mi>T</mi></mrow></msup><mi>X</mi><mo>)</mo></mrow><mrow><mo>−</
   mo><mn>1</mn></mrow></msup><msup><mrow><mi>X</mi></mrow><mrow><mi>T</mi
   ></mrow></msup><mi>y</mi></mrow> :MATH]

   Where
   [MATH: <mi>w</mi> :MATH]
   represents the vector of weights,
   [MATH: <mi>X</mi> :MATH]
   represents the matrix of generations from individual models (MoPT and
   CDiffusion),
   [MATH: <mi>y</mi> :MATH]
   represents the actual target values for each subgroup of conditions
   (tissue, age, omics type, species). Once the weights are determined,
   they are used to calculate the final combined prediction by taking the
   weighted average of the individual predictions. This approach allows us
   to leverage the strengths of each model and minimize the impact of any
   individual model’s weaknesses.

P2GPT is capable of generating tissue-specific and accurate multi-omics data

   We first tested 3 basic models including CGAN, CDiffusion, and MoPT, as
   well as the multi-omics versions of CDiffusion, MoPT, and their
   combination, P2GPT. As a simple model, CGAN was used to serve as the
   benchmark or baseline in the analysis to evaluate the performance of
   other models. All 6 models performed well in classifying tissues with
   both real (Table [76]1) and generated (Table [77]2) labels. Table [78]1
   in columns 1 and 2 for the P2GPT and diffusion model presents the
   results with genes extrapolated from the landmark genes (obtained from
   the LINCS L1000 project) and the CGAN model uses all genes to generate.
   The comparison between all the 6 models tested using human expression,
   DNA methylation, and mice expression data showed that in the case of
   extremely underrepresented data such as the imbalance of tissue
   densities, which is common in practice, only the P2GPT model is capable
   of generating accurate omics data by combining the multi-omics models
   of CDiffusion and MoPT. We further visualized the distribution of data
   by UMAP dimensionality reduction which showed that using human
   expression data (Fig. [79]2A, B, Supplementary Fig. [80]1), the
   generated labels were highly concordant with the real labels, further
   suggesting that P2GPT was capable of synthesizing tissue-specific
   expression data accurately. On the other hand, real human DNA
   methylation data were not well clustered into tissue types (Fig.
   [81]2D, Supplementary Fig. [82]1). Despite this, they were also highly
   concordant with the real data in terms of similarity and distribution
   (Fig. [83]2C, Table [84]2). In mice expression data, P2GPT could also
   generate concordant and tissue-specific data, although less accurately
   than in human data (Fig. [85]2E, F, Table [86]2) (see method). Overall,
   this model has shown itself to work well, both for DNA methylation and
   expression, in the case of tissue-specific data generation.

Table 1.

   Tissue classification real labels
   Real labels Expression (Human) Methylation (Human) Expression (Mouse)
   MoPT 0.995 0.880 0.920
   CDiffusion 0.993 0.937 0.960
   CGAN 0.991 0.930 0.930
   CDiffusion (multi-omics) 0.994 0.887 0.993
   MoPT (multi-omics) 0.997 0.916 0.982
   P2GPT 0.999 0.944 0.999
   [87]Open in a new tab

   f-1 weighted on tissue prediction with LR model. The best performing
   model was bolded.

Table 2.

   Tissue classification generated labels
   Gen labels Expression (Human) Methylation (Human) Expression (Mouse)
   MoPT 0.928 0.780 0.632
   CDiffusion 0.945 0.870 0.810
   CGAN 0.960 0.874 0.501
   CDiffusion (multi-omics) 0.957 0.880 0.887
   MoPT (multi-omics) 0.956 0.830 0.697
   P2GPT 0.965 0.910 0.899
   [88]Open in a new tab

   f-1 weighted on tissue prediction with LR model. The best performing
   model was bolded.

Fig. 2. UMAP of real data and data generated by Precious2GPT.

   [89]Fig. 2
   [90]Open in a new tab

   Each point represents an individual sample. A Human expression data
   colored by data type (orange, real; blue, generated). B Human
   expression data colored by tissue type. C Human methylation data
   colored by data type (real or generated). D Human methylation data
   colored by tissue type. E Mouse expression data colored by data type
   (real or generated). F Mouse expression data colored by tissue type.

P2GPT outperformed other models in age prediction using generated data

   Next, we trained the CatBoost regression model using real data with age
   as the parameter and assessed the prediction performance of age in the
   generated data. Our results showed that P2GPT demonstrated the best
   performance, achieving the lowest mean absolute error (MAE) and highest
   R^2 score across all types of datasets tested when compared to other
   models (Tables [91]3 and [92]4).

Table 3.

   Age prediction real labels quality
   Real labels Expression (Human) Methylation (Human) Expression (Mouse)
   MAE R^2 MAE R^2 MAE R^2
   MoPT 9.210 0.255 8.050 0.620 12.011 0.646
   MoPT (multi-omics) 8.754 0.210 6.580 0.770 10.839 0.685
   CDiffusion 11.00 0.165 7.700 0.840 13.376 0.674
   CDiffusion (multi-omics) 8.297 0.307 7.010 0.880 11.952 0.686
   P2GPT 8.247 0.313 6.300 0.890 10.708 0.693
   [93]Open in a new tab

   r2/mae with CatBoostRegressor. For mouse expression MAE metric was
   calculated in days, for human methylation and expression MAE metric was
   calculated in years. The best performing model (lowest MAE and highest
   R^2) was bolded.

Table 4.

   Age prediction generated labels
   Generated labels Expression (Human) Methylation (Human) Expression
   (Mouse)
   MAE R^2 MAE R^2 MAE R^2
   MoPT 13.280 −0.260 12.70 0.650 61.982 −0.363
   MoPT (multi-omics) 12.717 0.019 12.47 0.710 61.073 −0.323
   CDiffusion 12.310 0.030 11.18 0.810 50.930 −0.120
   CDiffusion (multi-omics) 11.848 0.108 10.94 0.800 50.400 −0.105
   P2GPT 10.906 0.156 9.890 0.820 48.940 −0.010
   [94]Open in a new tab

   r2/mae with CatBoostRegressor. For mouse expression MAE metric was
   calculated in days, for human methylation and expression MAE metric was
   calculated in years. The best performing model (lowest MAE and highest
   R^2) was bolded.

   It is important to highlight that the expression-based regressor error
   is notably substantial, especially in mice expression data. In
   contrast, the generative models with the age condition demonstrate a
   pronounced proficiency in terms of MAE and R^2 when applied to human
   DNA methylation data. This observation has guided subsequent
   experiments to also focus on DNA methylation with the age condition. A
   noteworthy characteristic is the enhanced quality achieved through the
   unique heterogeneity and the combination of these models which results
   in the optimal performance. Additionally, we have witnessed
   improvements not only in models that encompass multiple omics datasets
   but also in the results produced by their integrative combinations.
   Based on the figure with the results of the best regression on DNA
   methylation in Supplementary Fig. [95]2, our model shows high quality
   in all tissues.

The effect of underrepresented data on P2GPT’s data synthesis

   We then investigated how the representation of data affects the P2GPT’s
   ability to generate new synthetic data in each of the three data types
   (human expression, human DNA methylation, and mouse expression). The
   model was asked to generate 300 synthetic samples for each tissue with
   the additional condition of age from a uniform distribution of age
   values. We can see the model requires a relatively small amount of real
   data samples to start generating valid synthetic expression samples
   conditioned with specific age and tissue, while the error rate for the
   generation of DNA methylation samples is significantly higher (Fig.
   [96]3). We could not definitely state that the number of samples in a
   dataset is the main factor of successful generation of synthetic
   samples, but as we can see from trends on graphs it is one of the most
   important factors. Here, we defined correctly generated samples as
   those generated in strictly proper structure i.e. the order of genes
   and their generated omics numerical values were correct.

Fig. 3. Analysis of underrepresented data synthesis using P2GPT.

   [97]Fig. 3
   [98]Open in a new tab

   Each point is a specific tissue, while the x-axis shows the number of
   samples presented in the dataset, and the y-axis shows the number of
   correctly generated samples out of the expected 300 samples. Left:
   Human expression. Middle: Human methylation. Right: Mouse expression.

P2GPT is capable of generating highly accurate DNA methylation data across
ages

   Since our results from CatBoost regression analysis (Tables [99]3 and
   [100]4) indicated that DNA methylation data was the best data type for
   age prediction using the generated data, we used it to compare the
   similarity between real and generated data in terms of differential
   methylation. For each of the tissues, we compared the DNA methylation
   levels of samples of 80 ± 20 years old and 30 ± 20 years old in the
   real data, as well as those of 80 ± 20 years old (predicted) and
   30 ± 20 years old (predicted) in the generated data. The significantly
   differentially methylated genes were then obtained from each of them to
   identify the number of intersections (Fig. [101]4). We then studied how
   much overlap there is between real and generated data on sets of
   differentially methylated genes (Table [102]5, MA). We hypothesized
   that if the models captured a certain difference between groups, in
   this case, a 50-year difference, then we could generate the data for
   older people or mice. For certain tissues (e.g., leukocytes or liver),
   we observe that it is better to use a single model. However, in the
   case of the occipital lobe, blood, breast, and some other tissues, the
   P2GPT shows better performance. With buccal mucosa tissue, we have very
   little overlap, but this is because there are only 3 differentially
   methylated genes. Globally, based on statistics, we can say that our
   model P2GPT has become more stable between all tissues and there is no
   bias. In addition, we compared the DNA methylation levels in 80 years
   old samples between the real and generated data. We hypothesized that
   if the real and generated data shared high similarity, low counts of
   differentially methylated genes would be obtained. As shown by our
   results, for the majority of tissues analyzed, P2GPT demonstrated the
   lowest numbers of differentially methylated genes identified by the
   comparison between real and generated data (Table [103]5, DM). This
   further suggests that our model can identify the difference in DNA
   methylation levels both between groups at different ages and preserve
   DNA methylation values with the same age group for real and generated
   data.

Fig. 4. Tissues with the most overlaps between real and generated data in
differentially methylated genes were identified by 30 vs 80 years old
comparisons.

   [104]Fig. 4
   [105]Open in a new tab

   gen generated.

Table 5.

   Results of overlapping differentially methylated genes between
   generated and real data
   Tissue MoPT (multi-omics) CDiffusion (multi-omics) P2GPT
   MA DM MA DM MA DM
   Immune cells 0.931 0.396 0.926 0.76 0.95 0.11
   Saliva 0.942 0.244 0.778 0.757 0.932 0.624
   Cerebral cortex 0.925 0.37 0.905 0.702 0.923 0.294
   Blood 0.828 0.24 0.902 0.805 0.907 0.149
   Leukocyte 0.875 0.269 0.801 0.775 0.789 0.202
   Kidney 0.364 0.168 0.533 0.659 0.752 0.01
   Liver 0.764 0.148 0.728 0.809 0.752 0.109
   Peripheral blood mononuclear cell 0.759 0.100 0.697 0.169 0.748 0.000
   Occipital lobe 0.442 0.076 0.689 0.081 0.745 0.000
   Breast 0.633 0.323 0.707 0.464 0.721 0.05
   Mucosa 0.549 0.090 0.431 0.775 0.65 0.594
   Cerebellum 0.771 0.208 0.465 0.653 0.627 0.623
   Frontal lobe 0.535 0.175 0.418 0.468 0.477 0.336
   Thyroid gland 0.222 0.091 0.568 0.229 0.281 0.000
   Buccal mucosa 0.045 0.000 0.000 0.825 0.000 0.200
   [106]Open in a new tab

   MA: Intersection of differentially methylated genes between the samples
   of 30 and 80 years old in generated and real data. Values represent the
   proportion of differentially methylated genes in the real data
   comparison that intersected with the list of differentially methylated
   genes in the generated data. A higher MA represents a better
   performance of the model. DM: Differentially methylated genes between
   80 years-old generated vs. 80 years-old real data. Values represent the
   proportion of differentially methylated genes in the real data
   comparison that intersected with the list of differentially methylated
   genes in the generated data. Lower DM represents the better performance
   of the model. For each tissue, the best performing (highest MA and
   lowest DM) model(s) was/were bolded.

P2GPT can extrapolate age in generated data beyond the age range of the
training dataset

   Since MoPT could not be used to predict age in the generated data that
   was not present in the training real data, as an out-of-scope
   experiment, we used the diffusion model integrated in P2GPT to study
   its age prediction accuracy based on two conditions. The first
   condition involved the training exclusively on the DNA methylation data
   from individuals aged lower than 50 or 80 years old, while in the
   second condition, the model was trained using data encompassing all
   ages (ranging from 0 to 114 years old) in the real data. We show that
   the number of samples available per age group varies across tissues,
   with some tissues having underrepresented age groups (Fig. [107]5A).
   Our results indicated that the quality of the model’s predictions was
   maintained particularly well in the crucial age component within PCA
   for specific tissues. Specifically, we obtained notable results for
   blood and examined the components that exhibited the highest
   correlation with age. We observed that the layer representing generated
   samples of older individuals was positioned right after the layer
   representing the 80–100 age group, regardless of the training condition
   (Fig. [108]5B, C). Similarly, generated samples at older age were also
   positioned close to the real old samples in blood by setting the
   training threshold at 50 years old (Supplementary Fig. [109]3). This
   insightful finding confirms the model’s ability to generate age values
   that have not been encountered during training, indicating its capacity
   to extrapolate and generate realistic age predictions beyond the
   observed age range. Moreover, the results on tissues with the
   underrepresented 80–100 years old group (Supplementary Fig. [110]4A) do
   not differ much between both training conditions. However, when it is
   well represented, the variance between samples is lower where the model
   was trained on all labels (Frontal Lobe in Supplementary Fig. [111]4B).
   Similar results were also observed in the cerebral cortex
   (Supplementary Fig. [112]4C), cerebellum (Supplementary Fig. [113]4D)
   and buccal mucosa (Supplementary Fig. [114]4E).

Fig. 5. Generation of methylation data based on real data and age.

   [115]Fig. 5
   [116]Open in a new tab

   A Distribution of age groups in real methylation data for each tissue.
   B PCA for blood with real and generated data with different age bins
   for models trained on data with the whole age distribution. C PCA for
   blood with real and generated data with different age bins for models
   trained on data with age lower than 80. The black line in PCA is
   connected by cluster centroids of each age group from [0,20] to
   [100,120] for real data and [100,120] for generated blood data.

Applications of P2GPT-synthesized data in pathway enrichment analysis

   To assess the applicability of P2GPT in biological data interpretation,
   we performed differential DNA methylation analysis between the 120 and
   150 years old samples in each tissue generated by P2GPT, followed by
   pathway enrichment analysis on the differentially methylated genes
   based on the KEGG database to identify the pathways that were
   potentially related to aging. In blood, the enriched pathways for the
   differentially methylated genes were associated with immune function,
   organ development, and hormonal regulation (Supplementary Fig. [117]5).
   In the liver, pathways associated with cytokine receptor interactions
   and NOD-like receptor signaling were enriched, while pathways enriched
   in the older thyroid gland revealed a shift towards chronic
   inflammation and immune dysregulation as suggested by disrupted
   cytokine interactions, increased neutrophil extracellular trap
   formation, and alterations in hematopoietic cell lineage. Moreover,
   overrepresentation analysis showed that the differentially methylated
   genes were enriched in several hallmarks of aging across multiple
   tissues. In particular, inflammation was found to be enriched in 4
   tissues (blood, liver, saliva, thyroid gland), while genomic
   instability was enriched in 2 tissues (liver, thyroid gland) and
   altered intercellular communication in 2 tissues (blood, thyroid
   gland). Here we took the conclusion of enriched pathways through all
   tissues in KEGG_2021_Human. Lists of pathways for selected tissues can
   be seen in Supplementary Table [118]1. We also added a description of
   these pathways through GPT-4 based on the direction of methylation and
   importance in aging. It can be observed that most of the pathways are
   indeed related to aging. This suggests that our P2GPT model can
   generate old people data and find the most important biological
   signaling pathways of aging.

Case study experiment in colorectal carcinoma

   To demonstrate the potential of in silico-generated gene expression
   data as control samples for actual colorectal cancer samples, we
   generated control samples for corresponding case samples of eight
   colorectal cancer (CRC) cell lines using our P2GPT model. Subsequently,
   eight case-control comparisons for the corresponding CRC cell lines
   were incorporated into a meta-analysis. CRC meta-analysis was
   constructed using a restricted subset of genes, referred to as the
   “landmark” genes, and another meta-analysis was created using a
   comprehensive subset of genes, termed “restored” by a CycleGAN model
   which is described in Supplementary materials. For each of the two
   meta-analyses, we extracted the common gene expression signatures
   across all eight cell lines which yield two lists of gene expression
   signatures. Each of them was subjected to Spearman’s correlation test
   with the gene expression signature obtained from the pre-calculated CRC
   meta-analysis on PandaOmics (Fig. [119]6). We observed that common gene
   expression signatures calculated on an extended number of genes
   (restored genes) exhibited greater similarity to the benchmark CRC
   signatures (r = 0.552) compared to signatures calculated on a limited
   number of genes (landmark genes, r = 0.497). Additionally, it was
   evident that the application of a gene expression significance
   threshold positively correlated with overall signature similarities.
   The combined landmark signature generated with the P2GPT model
   demonstrated strong similarity to the CRC benchmark signature (Fig.
   [120]6A). However, its performance on the restored subset of genes was
   even better (Fig. [121]6B). Consequently, the meta-analysis derived
   from comparisons created using the P2GPT on an extended number of genes
   was further employed for target identification analysis.

Fig. 6. Correlation matrix for colon carcinoma signatures.

   [122]Fig. 6
   [123]Open in a new tab

   Spearman correlation coefficients between colon carcinoma signatures
   were calculated using only landmark genes (A) and all restored genes
   (B). The colon carcinoma signature (PandaOmics CRC project signature)
   was derived from the “expression analysis” section of manually curated
   colon carcinoma meta-analysis in PandaOmics and corresponded to the
   combined gene expression changes values for colon carcinoma. P2GPT CRC
   signature was collected from the corresponding meta-analysis in
   PandaOmics.

   As previously, the manually curated CRC project available in PandaOmics
   served as a benchmark for target hypotheses. The Target ID results were
   compared between the CRC meta-analysis, containing case-control
   datasets that were obtained from patients, and the P2GPT CRC
   meta-analysis (refer to the previous section). This case study aimed to
   demonstrate that the in silico-generated control samples could be used
   as control samples for actual case samples. Therefore, to compare
   Target ID results, we only used omics scores for hypothesis ranking and
   included solely druggable target families^[124]20. The top 20 target
   hypotheses for both the benchmark and the P2GPT-restored CRC
   meta-analyses are depicted on a heatmap (Fig. [125]7). Using the Target
   ID approach the top genes that were highly scored in PandaOmics CRC
   meta-analysis and P2GPT-controls generated CRC meta-analysis were
   explored. By analyzing the overlapped genes, it was observed that both
   top 20 target hypotheses lists contain hits that are strongly
   associated with CRC pathology. For instance, AKT1, PTEN, and PIK3R1 are
   key modulators in the PI3K/AKT pathway while PLK1, CDK2, and MAPK14 are
   major drivers involved in cell cycle regulation. Being ranked top in
   both Target ID results, AKT has been extensively studied in disease
   pathogenesis^[126]21,[127]22 and is altered in CRC patients^[128]23.
   CDK2 is also highly scored in both meta-analyses. The CDK2 has been
   explored in the G1/S phase transition^[129]24 and the CDK2 selective
   inhibitors have already been tested in CRC models^[130]25. Genes that
   were top scored only in P2GPT-controls generated CRC meta-analysis
   (PIK3CD, FYN, YES1, ATM, HRAS, TNFRSF1A, GSK3B, PLCG1, CSK, PIK3CA) are
   also related to pathogenesis. For example, PIK3CD was shown to be
   involved in AKT/GSK-3β/β-catenin signaling and could be considered as a
   potential target^[131]26, while mutations in PIK3CA were observed in
   20% to 25% of CRC^[132]27 and associated with shorter cancer-specific
   survival^[133]28. The results were supported by PandaOmics knowledge
   graph (Supplementary Fig. [134]6). Overall, our results suggested that
   our P2GPT model can be used to generate expression data that could be
   utilized in target discovery. We showed that gene expression changes
   between case and control (both real and generated) samples resulted in
   a similar disease-specific expression signature. At the same time, the
   Target ID approach applied for data from patients (colon carcinoma
   PandaOmics meta-analysis) and for P2GPT-controls generated colorectal
   cancer meta-analysis showed a strong overlap between well-known targets
   for colorectal carcinoma along with a new target hypothesis.

Fig. 7. Top 20 most promising target hypotheses for colorectal cancer.

   [135]Fig. 7
   [136]Open in a new tab

   Results were derived from the in silico Target ID scoring approach for
   PandaOmics colorectal carcinoma meta-analysis (A) and P2GPT colon
   cancer meta-analysis (B). To validate our approach, only omics-based
   scores with the application of a druggability filter were taken into
   account and used for the composition of the scores for ranking.

Discussion

   Studies have identified several key biomarkers, proteins and pathways
   that play important roles in aging^[137]29,[138]30. Despite this, the
   multifaceted process of aging still requires substantial understanding
   and unraveling of the complex biological data. To address this, the use
   of artificial intelligence (AI) in the field of aging research has been
   increasing recently. Indeed, deep learning-based approaches have been
   proposed to play vital roles in multiple areas facilitating aging
   research, such as predicting biological age^[139]16,[140]31,[141]32,
   developing biomarkers^[142]15,[143]33–[144]35, identifying therapeutic
   targets^[145]15,[146]20,[147]36–[148]39, and generating novel
   compounds^[149]34,[150]40,[151]41. In fact, studies have also
   demonstrated the potential of applying AI models to identify targets
   implicated in aging and age-associated diseases, targeting established
   hallmarks of aging^[152]20,[153]30,[154]37,[155]42.

   In this study, we present a hybrid approach Precious2GPT (P2GPT) that
   combines the complementary strengths of the CDiffusion and MoPT models
   for generating high-quality multi-omics DNA methylation and expression
   data. Our approach reduces the limitations of individual models and
   leverages their strengths to enhance the generation process. This
   innovative approach has potential applications in various fields,
   including data analysis, algorithm development, and privacy
   preservation for multi-omics research. We demonstrate the effectiveness
   of our hybrid approach by comparing the quality of data generated using
   individual models of CGAN, CDiffusion, and MoPT, with the combined
   hybrid approach P2GPT. With the aid of tissue classification and age
   regression experiments, the performance of models was assessed in terms
   of their specificity to species and tissue types, as well as their
   capability to predict age based on learned patterns from real data.

   Upon training the transformer-based model with this corpus, we
   demonstrated its high capability of generating new data conditioned on
   specific factors like age or tissue type. In our study, we encountered
   a primary challenge in generating tabular data from continuous gene
   expression and DNA methylation omics data. Previous works have
   attempted the conversion of table data to text before the application
   of the pretrained GPT-2 model^[156]43. Another approach addressed the
   complexity issue by using the GPT-2 architecture with a customized
   vocabulary to improve the efficiency during both training and inference
   of the model^[157]44. Hence, we devised an encoding scheme, wherein
   each gene and its corresponding omics value were represented as
   individual tokens. In essence, our approach treated the gene-omics data
   as pseudo-text, enabling us to utilize the transformer-based model,
   ultimately introducing the MoPT model. To evaluate the generated data
   with predictions of age and tissues for different data types and
   species, we highlight the potential of transformer architectures in
   bioinformatics tasks, which represents the first biomedical-specific
   adaptation of a language model for generating tabular data.

   Synthetic data plays a crucial role in overcoming data insufficiency by
   providing synthetic controls that replicate the biological properties
   of real control samples, and enhance equity in differential expression
   analysis. The use of generated data also enables cost-effective testing
   of algorithms and pipelines in a virtual experimental platform,
   allowing researchers to mimic the effects of interventions under
   specific scenarios such as varying levels of noise and different
   degrees of differential expression^[158]45. Furthermore, the potential
   impact of alterations in genomic profiles can be predicted with
   synthetic gene knockdowns or knockins data^[159]46. Our P2GPT model
   demonstrated exceptional performance in classifying tissues based on
   synthetic data. The model’s accuracy is remarkable, with its
   predictions closely resembling those based on real biological datasets
   as evidenced by the high correlation coefficients in cross-validation
   studies and the model’s robustness when tested against known
   benchmarks. In the age regression analysis, P2GPT showcased its
   aptitude by accurately predicting the biological age of samples using
   synthetically generated DNA methylation patterns. The synthetic data,
   when compared against real-world epigenetic clocks, confirmed that
   P2GPT successfully captured the nuances of age-related changes, with a
   minimal margin of error. This reveals the potential for wide-ranging
   applications in biogerontology and personalized medicine.

   Leveraging out-of-scope (OOS) experiments with the P2GPT model has
   revealed that across various tissues, aging is consistently associated
   with dysregulated immune function, chronic inflammation, and alteration
   in cell lineage and signaling pathways. Age-associated dysregulated
   immune function, accompanied by chronic inflammation (inflammaging),
   contributes to the process of immunosenescence observed in aged
   individuals^[160]47,[161]48. The alteration in signaling pathways has
   been shown to trigger inflammaging and senescence across multiple
   tissues^[162]30. These biological processes markedly contribute to the
   increased disease burden observed in the elderly population and present
   potential targets for therapeutic intervention. The insights delivered
   by the P2GPT model’s OOS experiments underscore the value of advanced
   computational models in understanding the complex biological
   underpinnings of aging and spotlighting potential avenues to mitigate
   its detrimental effects on health. Our results showed that our model
   can be utilized to identify biologically relevant pathways and
   processes through synthetic data generation.

   By combining MoPT and CDiffusion models using Feature Weighted Linear
   Stacking (FWLS), we aimed to improve the overall predictive performance
   and generalization ability. This approach integrates diverse
   perspectives and captures complementary information from each model,
   resulting in a more robust and accurate prediction. Applying FWLS
   during coefficient calculation allowed us to obtain more accurate
   predictions by incorporating individual model strengths. By considering
   model weights, we ensured that more accurate and reliable models had a
   higher impact on the final generation, mitigating biases or
   inaccuracies introduced by any single model and providing a more robust
   prediction. Our findings indicate that the coefficients derived from
   P2GPT allow a refined integration of the two models, leading to
   enhanced performance with improved generation quality. Despite the
   advancements achieved by integrating MoPT and CDiffusion models with
   FWLS, there are certain limitations in the current P2GPT model.
   Firstly, the complexity of the model poses a potential barrier to
   replication and broader application. The intricacies involved in
   managing and interpreting the combination of such models may limit
   their use by those without deep expertise in bioinformatics and access
   to substantial computational resources. Secondly, the current iteration
   of P2GPT processes primarily tabular data or bidimensional image data,
   and could not accommodate the analysis of graphical structures which
   represent complex biological interactions or pathways at this stage.
   Future extensions of the model that incorporate graph neural networks
   could enable the analysis of data represented in graph forms, such as
   protein-protein interaction networks or gene regulatory networks.
   Despite these limitations, the synergistic integration of MoPT and
   CDiffusion models through FWLS has successfully demonstrated an
   enhanced predictive capability.

   Our findings underscore the versatility and effectiveness of
   transformer architectures in handling bioinformatics tasks. However, it
   is important to acknowledge that the success of our P2GPT is attributed
   to the generation of relatively large sequence lengths and the design
   of an effective encoding scheme. Future work can expand the application
   of our method in other bioinformatics tasks like survival analysis,
   cross-modality prediction, and generation of omics depending on the
   disease or drug, thereby broadening the usage of transformer
   architectures in the field. For instance, beyond aging research, P2GPT
   could facilitate the analysis of fundamental processes underlying tumor
   progression, resistance, and metastasis. Additionally, modeling the
   timing and administration methods of various therapy combinations could
   provide insights into how tumor cells develop resistance to
   drugs^[163]49–[164]51. In addition, we envision further refining our
   hybrid approach by exploring additional generation models and
   incorporating various omics data types. Moreover, we believe that
   validation of synthetic data through downstream applications and
   benchmarking against real-world datasets would enhance the utility and
   robustness of synthetic multi-omics data. Lastly, we anticipate the
   future integration of P2GPT into clinical settings, enabling invaluable
   applications such as simulating tissue-specific biological data without
   invasive biopsies to predict treatment responses, predicting biological
   changes and disease progression trajectories, and incorporating various
   clinical parameters to enhance the accuracy for personalized disease
   monitoring and therapeutic strategies.

   In summary, we developed Precious2GPT, a generative model capable of
   producing methylation and expression data, which are invaluable
   resources for aging research due to the scarcity of longitudinal
   biological data. Through multiple lines of evidence and validation, we
   demonstrated the significant potential of Precious2GPT in facilitating
   aging research. Future work addressing the aforementioned limitations
   would further strengthen the model’s applicability, accuracy, and
   comprehensiveness, providing a powerful tool for biological discovery
   and translational medical research.

Methods

Data sources

   In this study, expression and methylation data were adopted across two
   species, human and mouse. Access to Genotype-Tissue Expression (GTEx)
   V8-protected data (phs000424) was authorized by the Data Access
   Committee of NCBI dbGAP. Human transcriptomic data^[165]52 and sample
   attribute data were downloaded, constituting 12,453 samples.
   Complementary mouse transcriptomic data was sourced from ARHS4
   database, V2.2 (12,541 samples). Mouse genes were mapped to their
   corresponding human orthologs with the use of Human Genome Organisation
   Gene Nomenclature Committee (HGNC) mappings^[166]53. Both GTEx and
   ARCHS4 RNA-seq data were procured in the form of raw gene counts. These
   datasets underwent log2 transformation, followed by quantile
   normalization applied to each tissue type separately within the
   expression datasets. After performing log2 transformation and quantile
   normalization, we preserved the target distribution to facilitate its
   application to novel samples. Human DNA methylation data was aggregated
   from the Illumina Infinium HumanMethylation450 BeadChip array datasets,
   retrieved from the China National Center for Bioinformation’s (CNCB)
   data repository (8,285 samples)^[167]54. Methylation beta values were
   mapped to genomic features based on the HumanMethylation450 v1.2
   Manifest File. In detail, we intentionally focused our attention on the
   CpGs located exclusively within the TSS200 region, as these were
   interpreted as the most relevant to age prediction. The TSS200 region,
   defined as the area comprising 200 base pairs upstream of the
   transcription initiation site, is documented as crucial for gene
   regulation processes. Consequently, the beta values of the CpGs
   situated within a gene’s TSS200 were averaged for downstream analysis.

Preprocessing methods

   For pictures construction, DeepInsight technique with the application
   of convolutional neural networks (CNNs)^[168]55,[169]56 and Kohonen’s
   self-organizing maps (SOMs)^[170]57,[171]58 was used to transform
   non-image data into image-like representations in CGAN and CDiffusion
   models. For the acceleration in training and inference processes of
   computationally heavy models, deep learning engaging
   CycleGAN^[172]59,[173]60 was employed to generate synthetic data in
   CDiffusion, MoPT and Precious2GPT models. In brief, generation methods
   work either with text or with pictures. We used DeepInsight to
   construct pictures for CGAN and CDiffusion models and in the CDiffusion
   part of Precious2GPT model. To compare individual genes in each pixel
   of images, SOM was used instead of the TSNE, UMAP and PCA algorithms.
   For each data set, we built a separate SOM of different dimensions to
   minimize space in the square image, and ach picture was colored by
   expression or methylation, along with the training set in different
   colors.

DeepInsight

   CNNs was used to automatically extract features from spatially coherent
   pixels, detecting higher-order statistics and non-linear correlations,
   and to provide promising performance in learning complex patterns and
   relationships in the data. To improve the efficiency of CNNs,
   one-dimensional (1D) biological data was transformed into
   two-dimensional (2D) representations. DeepInsight is a methodology
   designed to transform non-image data into image-like representations,
   allowing convolutional neural networks (CNNs) to be applied more
   effectively. It serves as the basis for the DeepInsight-3D model, which
   extends this approach to multi-domain tabular datasets. The DeepInsight
   pipeline consists of the following steps (Supplementary Fig. [174]7):

   Data normalization: The input data is normalized to ensure that all
   features have the same scale. This is typically achieved by applying
   min-max scaling, z-score normalization, or other suitable normalization
   techniques. Dimensionality reduction: The high-dimensional input data
   is transformed into a lower-dimensional representation. This can be
   done using dimensionality reduction techniques such as t-SNE, UMAP, or
   PCA. The resulting lower-dimensional data retains the most important
   information from the original data while reducing noise and
   computational complexity. Image generation: The lower-dimensional data
   is then converted into a 2D image-like representation. This is achieved
   by mapping each data point to a pixel in the image, with the pixel
   intensity representing the value of the corresponding feature. The
   resulting image preserves the spatial relationships between the data
   points, allowing CNNs to effectively capture local and global patterns
   in the data. Convolutional neural network (CNN) training: The generated
   images are used as input to a CNN, which is trained to perform a
   specific task, such as classification or regression. Recently developed
   techniques such as diffusion models can be used to effectively process
   such data Supplementary Fig. [175]7. By transforming non-image data
   into image-like representations, DeepInsight-like models allow for the
   efficient application of image-oriented models to a wide range of data
   types, including biological data.

SOM

   Kohonen’s self-organizing maps (SOMs) offer a promising alternative to
   PCA or UMAP for dimensionality reduction in the context of transforming
   non-image data into image-like representations. As an unsupervised
   learning algorithm, SOMs excel at converting high-dimensional data into
   lower-dimensional representations while preserving the topological
   structure of the input data. This ability to maintain the spatial
   relationships between data points makes SOMs particularly well-suited
   for generating images that can be fed into convolutional neural
   networks (CNNs). Unlike PCA, which focuses on linear relationships and
   maximizes variance, or UMAP, which aims to preserve both local and
   global structure, SOMs employ a competitive learning process that
   iteratively updates neuron weight vectors to better represent the input
   data. This results in a 2D grid of neurons that captures complex
   relationships between variables, potentially leading to more effective
   feature extraction and improved performance of the CNN. By
   incorporating Kohonen’s SOMs into the DeepInsight methodology, we can
   harness the unique advantages of this algorithm to enhance the analysis
   of non-image data using deep neural networks.

CycleGAN

   To speed up training and inference of heavy models (CDiffusion, MoPT
   and P2GPT), for we used extrapolation of all genes using the CycleGAN
   model during post-processing. In our heavy models, we trained them with
   different generations and then extrapolated the result using this
   model. In detail, our domain X consists of data for landmark 978
   genes^[176]59 and domain Y consists of desired output data for 11,278
   genes, which are our intersections across several OMICS datasets and
   species types. The set of 978 genes serves as the starting point to
   generate synthetic output data for the 11,278 genes.

   A CycleGAN comprises two generators (G & F) and two discriminators (Dx
   & Dy). Generator G transforms from domain X to Y (G: X → Y), while F
   does the vice versa, i.e., F: Y → X. Dx aims to distinguish between X
   and FX(Y), whereas Dy works on discriminating between Y and G(X). The
   training goes as follows: first, the generator G translates a sample
   data from domain X into a synthetic data of Domain Y. Subsequently, the
   generator F tries to regenerate the original sample from this synthetic
   data. The objective is to train the CycleGAN in learning the mapping
   such that the regenerated data closely matches the input data. This is
   referred to as forward-cycle consistency. A backward cycle consistency
   is simultaneously processed from Domain Y to X, and the whole cycle
   repeats continuously in learning. The network learns from the
   inconsistencies between the regenerated data and the original input
   data to increase the capabilities in generating synthetic data aligned
   with the target domain. Importantly, the discriminators Dx and Dy also
   participate in this training process, aiming to classify an instance
   from the actual dataset or a generated data by respective generators.
   As a result, CycleGAN has the ability to extrapolate the data from 978
   genes to realistically simulate data for 11,278 genes even in cases
   where paired samples are lacking. Finally, we can say that this model
   greatly helped us in generating a large amount of data in a short time
   with minor losses in quality compared to the full set. In the
   production model we will of course eventually use the full data set,
   but for some experiments this is not necessary.

Generation methods

Mathematical formulation of conditional generation task

   In the context of conditionality, we aim to develop models that can
   generate data instances conditioned on multiple factors: tissue (
   [MATH: <mi>T</mi> :MATH]
   ), age (
   [MATH: <mi>A</mi> :MATH]
   ), species (
   [MATH: <mi>S</mi> :MATH]
   ), and omics types (
   [MATH: <mi>D</mi> :MATH]
   ). We represent the generated data as
   [MATH: <mi>X</mi> :MATH]
   , and the conditions as a tuple
   [MATH:
   <mrow><mi>C</mi><mo>=</mo><mo>(</mo><mi>T</mi><mo>,</mo><mi>A</mi><mo>,
   </mo><mi>S</mi><mo>,</mo><mi>D</mi><mo>)</mo></mrow> :MATH]
   . The conditional generation task is defined given a set of training
   data (D):
   [MATH:
   <mrow><mi>D</mi><mo>=</mo><mo>(</mo><msub><mrow><mi>X</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>C</mi></mrow><mrow><m
   i>i</mi></mrow></msub><msubsup><mo>)</mo><mrow><mi>i</mi><mo>=</mo><mn>
   1</mn></mrow><mrow><mi>N</mi></mrow></msubsup><mo>,</mo></mrow> :MATH]

   where
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   represents the observed data instances and
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   represents the corresponding conditions in order to learn a conditional
   generative model
   [MATH: <mi>G</mi> :MATH]
   that can sample data instances
   [MATH: <mi>X</mi> :MATH]
   conditioned on arbitrary conditions
   [MATH: <mi>C</mi> :MATH]
   .

   The training objective of this model is to estimate the conditional
   probability distribution
   [MATH: <mrow><mi>P</mi><mo>(</mo><mi
   mathvariant="italic">X|C</mi><mo>)</mo></mrow> :MATH]
   , such that
   [MATH: <mrow><mi>G</mi><mrow><mo>(</mo><mrow><mi>X</mi><mi
   mathvariant="normal">∣</mi><mi>C</mi></mrow><mo>)</mo></mrow><mspace
   width="0.25em"></mspace><mo>≈</mo><mspace
   width="0.25em"></mspace><mi>P</mi><mrow><mo>(</mo><mrow><mi>X</mi><mi
   mathvariant="normal">∣</mi><mi>C</mi></mrow><mo>)</mo></mrow><mo>,</mo>
   </mrow> :MATH]

   where
   [MATH: <mrow><mi>G</mi><mo>(</mo><mi
   mathvariant="italic">X|C</mi><mo>)</mo></mrow> :MATH]
   represents the data generated by our model.

CGAN

   To evaluate the performance of Precious2GPT, Conditional Generative
   adversarial network (CGAN) was used as positive control in the
   validation experiments. Generative adversarial networks are more
   classical, easier to learn and faster in terms of speed of inference,
   which serve as a baseline for the other models. In some situations it
   has been observed that they can show themselves high and do not
   necessarily use complex patterns. In particular, if there is a
   generation task with one condition and we do not want to take into
   account the age condition. This generative model was trained to
   generate synthetic data using two networks, th generator
   [MATH: <mi>G</mi> :MATH]
   and the discriminator
   [MATH: <mi>D</mi> :MATH]
   . In CGAN, the generator
   [MATH: <mi>G</mi> :MATH]
   was trained to produce data samples that are indistinguishable from
   real data by a discriminator
   [MATH: <mi mathvariant="italic">D</mi> :MATH]
   , whilst the generator took the conditions
   [MATH: <mi>C</mi> :MATH]
   as input and generates data
   [MATH: <mrow><mspace width="1em"></mspace><mi>X</mi></mrow> :MATH]
   (Supplementary Fig. [177]8). In the context of multi-omics data
   integration, CGANs were employed to generate realistic images
   corresponding to expression or methylation data with additional
   conditions, tissue type, age, omics type, and species.

CDiffusion

   Diffusion models were employed to estimate the likelihood of generation
   data
   [MATH: <mi mathvariant="italic">X</mi> :MATH]
   . The model was trained to sample data through a diffusion process
   conditioned on
   [MATH: <mi>C</mi> :MATH]
   , and the likelihood of data was maximized throughout the learning
   process. A PyTorch published on GitHub (available at
   [178]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main)^[1
   79]61 was implemented as the basis of the conditional diffusion
   (CDiffusion) model, and PyTorch’s embedding was applied to construct
   the conditionality on categorical features in this model.

   A PyTorch implementation of the conditional diffusion model published
   on GitHub^[180]61 (available at
   [181]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main)
   was the basis for the used CDiffusion model. In its basis, this
   implementation involves a U-Net block that has self-attention layers
   between the downsampling and upsampling layers. Standard diffusion
   models incorporate temporal information in a tensor (later referred to
   as the time step) that controls the noising/denoising process based on
   the current step through being embedded into every
   downsampling/upsampling layer of the U-Net block. This time step is
   initialized by ongoing positional encoding given the previous step’s
   tensor through a sinusoidal positional embedding (Supplementary Fig.
   [182]9).

   The diffusion model is conditioned on categorical features of sets of
   genes by adding the conditions’ embeddings into the current time step.
   PyTorch’s nn.Embedding is used as a learnable embedding layer that
   stores embeddings mapping each class of a categorical condition (mapped
   to unique integers) into a tensor with the time step’s shape. However,
   such an embedding layer is not applicable for continuous features, such
   as age, so ages are fused in the time step by first undergoing a small
   network (three linear layers with ReLU activation functions) that
   transforms a single floating-point value (age) into the time step’s
   dimensions. Finally, the ages’ embeddings are similarly added to the
   time step. While the outlined approach works as is for single
   conditions, including multiple conditions leads to the optimizer
   tilting its attention towards the condition with higher embedded
   values. This is solved by normalizing all the conditions’ embeddings
   (categorical/continuous) before adding them to the time step.

MoPT

   To process large numbers of genes and omics values, memory efficient
   transformer architecture – MPT^[183]14 was utilized in the construction
   of Precious2GPT. It incorporates a modified architecture inspired by
   GPT-2^[184]6, where the positional embeddings are replaced with a
   Linear basis matrix. This modification enhances extrapolation
   capabilities while requiring fewer GPU memory resources during model
   training. Following the retraining process, our model underwent
   biological adaptation to multi-omics data, ultimately presenting as the
   Multi-omics Pretrained Transformer (MoPT).

Model setup and training procedure

   We prepared a tokenizer which consists of all possible genes from
   datasets, all 2-digit values for tokens referred to as age, tissues,
   species.

   We utilized “mosaicml/mpt-7b” configuration from HuggingFace^[185]14.
   To properly set the number of parameters we considered Chinchilla
   scaling law^[186]62, which proposed that the number model parameters
   should be proportional to the number of tokens in training corpus in
   ratio of 1:20. For all three datasets we considered this law and got
   the next model sizes: 4.1, 1.7 and 1 million parameters for multi-omics
   dataset, expression and methylation respectively.

   Learning curves for different datasets are represented in Supplementary
   Fig. [187]10. For each dataset the evaluation set was 1000 samples
   uniformly distributed by all tissues.

MoPT generation procedure

   To generate new omics samples we pass to the model desirable conditions
   on generation such as age, tissue, species and omics data type in form
   of plain string with spaces between conditions, e.g: “SPECIES Mouse
   dataset EXPRESSION TISSUE Brain MouseAGE84 ”. We utilized top-k
   together with top-p sampling, where k = 40, p = 0.9 and
   temperature = 0.8.

Validation experiments

Tissue classification

   To evaluate the quality of generated omics samples, the Logistic
   Regression model was used in the assessment for tissue classification
   tasks. The evaluation was based on the f1-score^[188]63, weighted by
   classes for both real and generated data, as the key metric for
   determining the reliability of the generated data. For each model, we
   built classification metrics twice. First, we generated synthetic
   samples in a 1:1 ratio with the original data, and the metrics were
   calculated on these samples, where we compared the real label with the
   one predicted by the classifier (Supplementary Fig. [189]11). We
   subsequently examined the performance of the classifier on uniformly
   generated labels from the total number of tissues to evaluate its
   effectiveness in handling unbalanced classes. In the case of multiple
   conditions, we additionally generated age between the minimum and
   maximum values present in our data, or other label types of dataset or
   species within each tissue.

   In addition to the aforementioned metrics, UMAP^[190]64 representations
   were used to visualize both synthesized and real tissue data in
   identifying disparities or similarities between the two distributions.

Age regression

   To predict the age of generated data, a CatboostRegressor^[191]65 model
   was applied solely based on gene omics values. The training dataset was
   composed of real samples paired with their respective age values as the
   target variable, while the synthesized samples were utilized as the
   testing data to generate predicted age values as the conditioning
   variable. The evaluation of performance by each model was presented as
   mean absolute error (MAE)^[192]63 and R-squared (R^2) metrics.

Differential methylation analysis

   To examine the sample homogeneity of real and generated data, we
   performed several statistical tests focusing on human methylation in
   multiple tissues. First, the nonparametric Mann–Whitney^[193]66
   statistical test was used where we fed methylation data from different
   age groups generated by the CDiffusion, MoPT and Precious2GPT models.
   To evaluate the ability of models in preserving the differentially
   methylated genes of distinct age groups, we identified differentially
   methylated genes between the samples obtained from 80 ± 20 vs 30 ± 20
   years old individuals, in both real and generated data. We then
   calculated the rate of intersection by the number of intersected
   differentially methylated genes between two sets divided by the number
   of differentially methylated genes in the real data. Differential
   methylation analysis was also performed between the 80 ± 20 years old
   (real data) and 80 ± 20 years old (generated data) to assess the
   similarity between the real and generated data (Supplementary Fig.
   [194]12). To optimize statistical validity, multiple testing
   corrections of the Benjamini–Hochberg^[195]67 hypotheses were used.

Out-of-scope experiment

   To validate our Precious2GPT model for age prediction, we conducted
   several out-of-scope experiments involving the training of models with
   methylation data at different age thresholds.
    1. Two models were trained for this purpose. One was trained with
       samples up to thresholds of 50 and 80 years old, and the other one
       was trained with the entire sample.
    2. Using the model trained with distinct thresholds, we generated data
       for (threshold +20, threshold +40) years old, and compared the
       clusters created by the generated data with those of real data on
       PCA^[196]68 representations.
    3. We trained a model with available data from 100 samples per tissue
       for individuals aged between 120 and 150 years old, and generated
       the differentially methylation data for pathway analysis to predict
       aging-related alteration. The pathway analysis was conducted using
       the Python package gseapy^[197]69, with the KEGG-human
       database^[198]70 and the 12 HALLMARKS lists serving as the enriched
       pathways for the differentially methylated genes.

Case study experiment

   In the case study experiment focusing on colorectal cancer (CRC), we
   utilized Precious2GPT for CRC cell lines as synthetic controls, namely
   Caco2, Lovo, SW1417, NCI-716, RKO, HCT-8, SW480 and SK-CO-1, obtained
   from our internal laboratory, the Robotic Lab. We fed the gene
   expression data of the eight CRC cell lines as input to facilitate the
   generation of respective synthetic controls using the pre-trained
   Precious2GPT model. For the generated control samples, gene expression
   data was normalized and uploaded to our PandaOmics platform. Within the
   platform, individual comparisons (case vs. control) were established
   for each cell line. These eight comparisons were then incorporated into
   meta-analysis, and the results for CRC landmark genes (obtained from
   the LINCS L1000 project) were generated through TargetID panel and
   Knowledge graph. To evaluate the quality of control samples generated
   by Precious2GPT, we compared the AI-driven target prioritization
   results between the generated CRC data and the pre-calculated results
   using real data available on PandaOmics.

Supplementary information

   [199]Supplementary information^ (5.2MB, pdf)

Acknowledgements