Abstract Synthetic data generation in omics mimics real-world biological data, providing alternatives for training and evaluation of genomic analysis tools, controlling differential expression, and exploring data architecture. We previously developed Precious1GPT, a multimodal transformer trained on transcriptomic and methylation data, along with metadata, for predicting biological age and identifying dual-purpose therapeutic targets potentially implicated in aging and age-associated diseases. In this study, we introduce Precious2GPT, a multimodal architecture that integrates Conditional Diffusion (CDiffusion) and decoder-only Multi-omics Pretrained Transformer (MoPT) models trained on gene expression and DNA methylation data. Precious2GPT excels in synthetic data generation, outperforming Conditional Generative Adversarial Networks (CGANs), CDiffusion, and MoPT. We demonstrate that Precious2GPT is capable of generating representative synthetic data that captures tissue- and age-specific information from real transcriptomics and methylomics data. Notably, Precious2GPT surpasses other models in age prediction accuracy using the generated data, and it can generate data beyond 120 years of age. Furthermore, we showcase the potential of using this model in identifying gene signatures and potential therapeutic targets in a colorectal cancer case study. Subject terms: Drug discovery, Diseases Introduction Biological synthetic data generation in the context of omics refers to the creation of artificial datasets that mimic the characteristics of real biological data, particularly in genomics, transcriptomics, proteomics, and other high-throughput biological technologies^[54]1. Generating synthetic data is valuable for various reasons, including the development and validation of computational methods, protection of privacy in sensitive datasets, and augmentation of limited real-world data. Generative adversarial networks (GANs) have been introduced as unique models to generate synthetic genomic data, ranging from DNA sequences to bulk RNA-seq data^[55]2,[56]3. Copula-based methods are other examples of classical statistical approaches in generating synthetic omics data, especially microarray gene expression data^[57]4. Moreover, Diffusion models are a recent addition to deep learning for synthetic data generation by simulating a diffusion process, which gradually transforms a simple noise distribution into the target data distribution^[58]5. Large language models (LLMs), exemplified by Generative Pre-trained Transformer 2 (GPT-2), have also garnered substantial interest built upon the Transformer architectures, capturing their significant contributions to the analysis of sequential data, and capabilities in modeling and advanced language understanding, generation and prediction^[59]6. Although these models have shown promising results in generating high-quality synthetic data, their application to omics data is still in its infancy. The need for synthetic data in omics studies stems from a variety of practical considerations and research objectives, which offer beneficial alternatives for training and evaluating genomic analysis tools without relying on large real-world datasets. They also serve as effective tools in controlling for differential expression, testing algorithms and pipelines in synthetic data, modeling interventions, and studying the underlying data structure in omics data^[60]7. For instance, researchers often compare gene expression levels between two or more conditions in differential expression analysis. Such conditions could include disease vs. healthy or treated vs. untreated samples. However, obtaining an adequate number of control samples can be challenging due to cost constraints or limited availability^[61]8. On the other hand, ‘conditional’ synthetic omics data generation is crucial to simulate real-life scenarios as gene expression patterns exhibit notable variations influenced by factors such as species, tissue type, or age of the sample. In particular, gene expression undergoes substantial changes throughout development or aging, whereas certain genes experience upregulation or downregulation during distinct life stages^[62]9,[63]10. Therefore, synthetic data generation models incorporating conditionality are essential in capturing the condition-specific variations in gene expression, as well as in confronting the disparate distribution of omics data on varying conditions^[64]11. However, achieving a high level of realism in synthetic data poses significant challenges, thus highlighting the emerging need for advanced modeling interventions. In our prior work, Precious1GPT^[65]12 successfully demonstrated promising capabilities in predicting biological age and identifying dual-purpose therapeutic targets potentially implicated in aging and age-related diseases. In this study, we employed a novel method incorporating the generation of multi-omics data implemented via the combination of a language model for data generation and a diffusion model for creating visual representations of omics data. Deep learning models including Conditional GANs^[66]13, Сonditional Diffusion (CDiffusion), and Multi-omics Pre-trained (MoPT)^[67]14 approaches, and our novel intervention Precious2GPT (P2GPT) combining CDiffusion and MoPT models, were utilized and compared in generating synthetic multi-omics, multi-species and multi-tissue samples. While aging clocks based on various omics data types have been proposed, such as transcriptomics^[68]15, methylomics^[69]16, proteomics^[70]17, and metabolomics^[71]18, P2GPT focused on transcriptomic and DNA methylation data. This choice was driven by the abundant availability of these data types for training and validation purposes, particularly when considering factors like age and tissue specificity. The synergistic effect of our novel combination approaches provides a tactical edge in addressing challenges pertaining to tissue classification, age prediction, and identification of crucial signaling pathways. A case study using colorectal cancer as an exemplary dataset was also conducted to highlight the potential and applicability of P2GPT in accurately generating simulated biological data for bioinformatic analyses. Results Construction of Precious2GPT We propose a novel hybrid approach that combines the power of two generation models, CDiffusion and MoPT, to generate high-quality multi-omics methylation and expression data (Fig. [72]1). Our approach of constructing Precious2GPT (P2GPT) involves the following steps: Fig. 1. Schematic representation of the P2GPT model. [73]Fig. 1 [74]Open in a new tab The top left section of the diagram delineates the diverse omics datasets (e.g., methylation and gene expression) collected under various conditions such as age, tissue type, species, and omics types. From this initial data representation, lines branch out to indicate two separate data processing streams feeding into the CDiffusion. One stream enters the Categorical Embedding, processing discrete data features and the other enters the Continuous Embedding, handling the age data. Adjacent to these embedding blocks, the PyDeepInsight Transformation highlights another preparatory step for the input data, which is processed in parallel to the embeddings and also fed into the CDiffusion. On the left side, CDiffusion is presented in detail to reflect its centrality in the data analysis pipeline. Beneath this architecture, an Inverse PyDeepInsight block reverts the transformed data back to its omics representation after processing through the CDiffusion model. The transformed outcomes are combined with results from the CDiffusion in the FWLS block. The top right section of the figure introduces the Omics Tokenizer, serving as the preliminary stage for the LLM generation. Below the tokenizer, a larger visual represents the architecture of the LLM model. Its output is directed back into the omics space to broaden interpretability and also channeled into the FWLS, where it is integrated with the CDiffusion generations. The bottom right of the illustration showcases the Model Capabilities block. This block emphasizes various practical applications of the developed framework, including omics data generation, assembly of large open datasets, facilitation of control mechanisms for PandaOmics, the model’s capacity for target discovery, out-of-domain extrapolations, and conditional age prediction. Step 1: CDiffusion model generation We employ the CDiffusion model to generate an initial dataset, which simulates gene expression levels based on the provided gene expression network. This network incorporates dependencies between genes, ensuring a biologically plausible gene expression pattern. Step 2: MoPT model evaluation Using the generated dataset from Step 1, we evaluate the quality of each gene’s generation using the MoPT model. The MoPT model calculates a quality score for each gene, reflecting the similarity between the synthetic data and real-world gene expression and DNA methylation profiles. Step 3: Coefficient calculation To create a balanced combination of the two models and reflect the proportion of quality contributed by each, we calculate coefficients based on the quality scores obtained from the MoPT model for each gene. These coefficients represent the relative importance of the CDiffusion and MoPT models in the final hybrid generation. To conduct the combination of our models, we employed the Feature Weighted Linear Stacking (FWLS^[75]19) approach. FWLS is a technique that combines multiple models by assigning weights to each model based on their performance and using these weights to calculate a weighted average prediction. The FWLS formula is as follows: [MATH: yFWL S=j=1mi=1nwij* yij :MATH] Where [MATH: yFWLS :MATH] represents the final combined generation, [MATH: yij :MATH] represents the generation of each individual model (MoPT and CDiffusion) for each gene, [MATH: wij :MATH] represents the weight assigned to each model for each gene. In our study, we employed a linear regression approach to determine the optimal weights for model combination. The weight calculation formula using linear regression is as follows: [MATH: w=(XTX)1XTy :MATH] Where [MATH: w :MATH] represents the vector of weights, [MATH: X :MATH] represents the matrix of generations from individual models (MoPT and CDiffusion), [MATH: y :MATH] represents the actual target values for each subgroup of conditions (tissue, age, omics type, species). Once the weights are determined, they are used to calculate the final combined prediction by taking the weighted average of the individual predictions. This approach allows us to leverage the strengths of each model and minimize the impact of any individual model’s weaknesses. P2GPT is capable of generating tissue-specific and accurate multi-omics data We first tested 3 basic models including CGAN, CDiffusion, and MoPT, as well as the multi-omics versions of CDiffusion, MoPT, and their combination, P2GPT. As a simple model, CGAN was used to serve as the benchmark or baseline in the analysis to evaluate the performance of other models. All 6 models performed well in classifying tissues with both real (Table [76]1) and generated (Table [77]2) labels. Table [78]1 in columns 1 and 2 for the P2GPT and diffusion model presents the results with genes extrapolated from the landmark genes (obtained from the LINCS L1000 project) and the CGAN model uses all genes to generate. The comparison between all the 6 models tested using human expression, DNA methylation, and mice expression data showed that in the case of extremely underrepresented data such as the imbalance of tissue densities, which is common in practice, only the P2GPT model is capable of generating accurate omics data by combining the multi-omics models of CDiffusion and MoPT. We further visualized the distribution of data by UMAP dimensionality reduction which showed that using human expression data (Fig. [79]2A, B, Supplementary Fig. [80]1), the generated labels were highly concordant with the real labels, further suggesting that P2GPT was capable of synthesizing tissue-specific expression data accurately. On the other hand, real human DNA methylation data were not well clustered into tissue types (Fig. [81]2D, Supplementary Fig. [82]1). Despite this, they were also highly concordant with the real data in terms of similarity and distribution (Fig. [83]2C, Table [84]2). In mice expression data, P2GPT could also generate concordant and tissue-specific data, although less accurately than in human data (Fig. [85]2E, F, Table [86]2) (see method). Overall, this model has shown itself to work well, both for DNA methylation and expression, in the case of tissue-specific data generation. Table 1. Tissue classification real labels Real labels Expression (Human) Methylation (Human) Expression (Mouse) MoPT 0.995 0.880 0.920 CDiffusion 0.993 0.937 0.960 CGAN 0.991 0.930 0.930 CDiffusion (multi-omics) 0.994 0.887 0.993 MoPT (multi-omics) 0.997 0.916 0.982 P2GPT 0.999 0.944 0.999 [87]Open in a new tab f-1 weighted on tissue prediction with LR model. The best performing model was bolded. Table 2. Tissue classification generated labels Gen labels Expression (Human) Methylation (Human) Expression (Mouse) MoPT 0.928 0.780 0.632 CDiffusion 0.945 0.870 0.810 CGAN 0.960 0.874 0.501 CDiffusion (multi-omics) 0.957 0.880 0.887 MoPT (multi-omics) 0.956 0.830 0.697 P2GPT 0.965 0.910 0.899 [88]Open in a new tab f-1 weighted on tissue prediction with LR model. The best performing model was bolded. Fig. 2. UMAP of real data and data generated by Precious2GPT. [89]Fig. 2 [90]Open in a new tab Each point represents an individual sample. A Human expression data colored by data type (orange, real; blue, generated). B Human expression data colored by tissue type. C Human methylation data colored by data type (real or generated). D Human methylation data colored by tissue type. E Mouse expression data colored by data type (real or generated). F Mouse expression data colored by tissue type. P2GPT outperformed other models in age prediction using generated data Next, we trained the CatBoost regression model using real data with age as the parameter and assessed the prediction performance of age in the generated data. Our results showed that P2GPT demonstrated the best performance, achieving the lowest mean absolute error (MAE) and highest R^2 score across all types of datasets tested when compared to other models (Tables [91]3 and [92]4). Table 3. Age prediction real labels quality Real labels Expression (Human) Methylation (Human) Expression (Mouse) MAE R^2 MAE R^2 MAE R^2 MoPT 9.210 0.255 8.050 0.620 12.011 0.646 MoPT (multi-omics) 8.754 0.210 6.580 0.770 10.839 0.685 CDiffusion 11.00 0.165 7.700 0.840 13.376 0.674 CDiffusion (multi-omics) 8.297 0.307 7.010 0.880 11.952 0.686 P2GPT 8.247 0.313 6.300 0.890 10.708 0.693 [93]Open in a new tab r2/mae with CatBoostRegressor. For mouse expression MAE metric was calculated in days, for human methylation and expression MAE metric was calculated in years. The best performing model (lowest MAE and highest R^2) was bolded. Table 4. Age prediction generated labels Generated labels Expression (Human) Methylation (Human) Expression (Mouse) MAE R^2 MAE R^2 MAE R^2 MoPT 13.280 −0.260 12.70 0.650 61.982 −0.363 MoPT (multi-omics) 12.717 0.019 12.47 0.710 61.073 −0.323 CDiffusion 12.310 0.030 11.18 0.810 50.930 −0.120 CDiffusion (multi-omics) 11.848 0.108 10.94 0.800 50.400 −0.105 P2GPT 10.906 0.156 9.890 0.820 48.940 −0.010 [94]Open in a new tab r2/mae with CatBoostRegressor. For mouse expression MAE metric was calculated in days, for human methylation and expression MAE metric was calculated in years. The best performing model (lowest MAE and highest R^2) was bolded. It is important to highlight that the expression-based regressor error is notably substantial, especially in mice expression data. In contrast, the generative models with the age condition demonstrate a pronounced proficiency in terms of MAE and R^2 when applied to human DNA methylation data. This observation has guided subsequent experiments to also focus on DNA methylation with the age condition. A noteworthy characteristic is the enhanced quality achieved through the unique heterogeneity and the combination of these models which results in the optimal performance. Additionally, we have witnessed improvements not only in models that encompass multiple omics datasets but also in the results produced by their integrative combinations. Based on the figure with the results of the best regression on DNA methylation in Supplementary Fig. [95]2, our model shows high quality in all tissues. The effect of underrepresented data on P2GPT’s data synthesis We then investigated how the representation of data affects the P2GPT’s ability to generate new synthetic data in each of the three data types (human expression, human DNA methylation, and mouse expression). The model was asked to generate 300 synthetic samples for each tissue with the additional condition of age from a uniform distribution of age values. We can see the model requires a relatively small amount of real data samples to start generating valid synthetic expression samples conditioned with specific age and tissue, while the error rate for the generation of DNA methylation samples is significantly higher (Fig. [96]3). We could not definitely state that the number of samples in a dataset is the main factor of successful generation of synthetic samples, but as we can see from trends on graphs it is one of the most important factors. Here, we defined correctly generated samples as those generated in strictly proper structure i.e. the order of genes and their generated omics numerical values were correct. Fig. 3. Analysis of underrepresented data synthesis using P2GPT. [97]Fig. 3 [98]Open in a new tab Each point is a specific tissue, while the x-axis shows the number of samples presented in the dataset, and the y-axis shows the number of correctly generated samples out of the expected 300 samples. Left: Human expression. Middle: Human methylation. Right: Mouse expression. P2GPT is capable of generating highly accurate DNA methylation data across ages Since our results from CatBoost regression analysis (Tables [99]3 and [100]4) indicated that DNA methylation data was the best data type for age prediction using the generated data, we used it to compare the similarity between real and generated data in terms of differential methylation. For each of the tissues, we compared the DNA methylation levels of samples of 80 ± 20 years old and 30 ± 20 years old in the real data, as well as those of 80 ± 20 years old (predicted) and 30 ± 20 years old (predicted) in the generated data. The significantly differentially methylated genes were then obtained from each of them to identify the number of intersections (Fig. [101]4). We then studied how much overlap there is between real and generated data on sets of differentially methylated genes (Table [102]5, MA). We hypothesized that if the models captured a certain difference between groups, in this case, a 50-year difference, then we could generate the data for older people or mice. For certain tissues (e.g., leukocytes or liver), we observe that it is better to use a single model. However, in the case of the occipital lobe, blood, breast, and some other tissues, the P2GPT shows better performance. With buccal mucosa tissue, we have very little overlap, but this is because there are only 3 differentially methylated genes. Globally, based on statistics, we can say that our model P2GPT has become more stable between all tissues and there is no bias. In addition, we compared the DNA methylation levels in 80 years old samples between the real and generated data. We hypothesized that if the real and generated data shared high similarity, low counts of differentially methylated genes would be obtained. As shown by our results, for the majority of tissues analyzed, P2GPT demonstrated the lowest numbers of differentially methylated genes identified by the comparison between real and generated data (Table [103]5, DM). This further suggests that our model can identify the difference in DNA methylation levels both between groups at different ages and preserve DNA methylation values with the same age group for real and generated data. Fig. 4. Tissues with the most overlaps between real and generated data in differentially methylated genes were identified by 30 vs 80 years old comparisons. [104]Fig. 4 [105]Open in a new tab gen generated. Table 5. Results of overlapping differentially methylated genes between generated and real data Tissue MoPT (multi-omics) CDiffusion (multi-omics) P2GPT MA DM MA DM MA DM Immune cells 0.931 0.396 0.926 0.76 0.95 0.11 Saliva 0.942 0.244 0.778 0.757 0.932 0.624 Cerebral cortex 0.925 0.37 0.905 0.702 0.923 0.294 Blood 0.828 0.24 0.902 0.805 0.907 0.149 Leukocyte 0.875 0.269 0.801 0.775 0.789 0.202 Kidney 0.364 0.168 0.533 0.659 0.752 0.01 Liver 0.764 0.148 0.728 0.809 0.752 0.109 Peripheral blood mononuclear cell 0.759 0.100 0.697 0.169 0.748 0.000 Occipital lobe 0.442 0.076 0.689 0.081 0.745 0.000 Breast 0.633 0.323 0.707 0.464 0.721 0.05 Mucosa 0.549 0.090 0.431 0.775 0.65 0.594 Cerebellum 0.771 0.208 0.465 0.653 0.627 0.623 Frontal lobe 0.535 0.175 0.418 0.468 0.477 0.336 Thyroid gland 0.222 0.091 0.568 0.229 0.281 0.000 Buccal mucosa 0.045 0.000 0.000 0.825 0.000 0.200 [106]Open in a new tab MA: Intersection of differentially methylated genes between the samples of 30 and 80 years old in generated and real data. Values represent the proportion of differentially methylated genes in the real data comparison that intersected with the list of differentially methylated genes in the generated data. A higher MA represents a better performance of the model. DM: Differentially methylated genes between 80 years-old generated vs. 80 years-old real data. Values represent the proportion of differentially methylated genes in the real data comparison that intersected with the list of differentially methylated genes in the generated data. Lower DM represents the better performance of the model. For each tissue, the best performing (highest MA and lowest DM) model(s) was/were bolded. P2GPT can extrapolate age in generated data beyond the age range of the training dataset Since MoPT could not be used to predict age in the generated data that was not present in the training real data, as an out-of-scope experiment, we used the diffusion model integrated in P2GPT to study its age prediction accuracy based on two conditions. The first condition involved the training exclusively on the DNA methylation data from individuals aged lower than 50 or 80 years old, while in the second condition, the model was trained using data encompassing all ages (ranging from 0 to 114 years old) in the real data. We show that the number of samples available per age group varies across tissues, with some tissues having underrepresented age groups (Fig. [107]5A). Our results indicated that the quality of the model’s predictions was maintained particularly well in the crucial age component within PCA for specific tissues. Specifically, we obtained notable results for blood and examined the components that exhibited the highest correlation with age. We observed that the layer representing generated samples of older individuals was positioned right after the layer representing the 80–100 age group, regardless of the training condition (Fig. [108]5B, C). Similarly, generated samples at older age were also positioned close to the real old samples in blood by setting the training threshold at 50 years old (Supplementary Fig. [109]3). This insightful finding confirms the model’s ability to generate age values that have not been encountered during training, indicating its capacity to extrapolate and generate realistic age predictions beyond the observed age range. Moreover, the results on tissues with the underrepresented 80–100 years old group (Supplementary Fig. [110]4A) do not differ much between both training conditions. However, when it is well represented, the variance between samples is lower where the model was trained on all labels (Frontal Lobe in Supplementary Fig. [111]4B). Similar results were also observed in the cerebral cortex (Supplementary Fig. [112]4C), cerebellum (Supplementary Fig. [113]4D) and buccal mucosa (Supplementary Fig. [114]4E). Fig. 5. Generation of methylation data based on real data and age. [115]Fig. 5 [116]Open in a new tab A Distribution of age groups in real methylation data for each tissue. B PCA for blood with real and generated data with different age bins for models trained on data with the whole age distribution. C PCA for blood with real and generated data with different age bins for models trained on data with age lower than 80. The black line in PCA is connected by cluster centroids of each age group from [0,20] to [100,120] for real data and [100,120] for generated blood data. Applications of P2GPT-synthesized data in pathway enrichment analysis To assess the applicability of P2GPT in biological data interpretation, we performed differential DNA methylation analysis between the 120 and 150 years old samples in each tissue generated by P2GPT, followed by pathway enrichment analysis on the differentially methylated genes based on the KEGG database to identify the pathways that were potentially related to aging. In blood, the enriched pathways for the differentially methylated genes were associated with immune function, organ development, and hormonal regulation (Supplementary Fig. [117]5). In the liver, pathways associated with cytokine receptor interactions and NOD-like receptor signaling were enriched, while pathways enriched in the older thyroid gland revealed a shift towards chronic inflammation and immune dysregulation as suggested by disrupted cytokine interactions, increased neutrophil extracellular trap formation, and alterations in hematopoietic cell lineage. Moreover, overrepresentation analysis showed that the differentially methylated genes were enriched in several hallmarks of aging across multiple tissues. In particular, inflammation was found to be enriched in 4 tissues (blood, liver, saliva, thyroid gland), while genomic instability was enriched in 2 tissues (liver, thyroid gland) and altered intercellular communication in 2 tissues (blood, thyroid gland). Here we took the conclusion of enriched pathways through all tissues in KEGG_2021_Human. Lists of pathways for selected tissues can be seen in Supplementary Table [118]1. We also added a description of these pathways through GPT-4 based on the direction of methylation and importance in aging. It can be observed that most of the pathways are indeed related to aging. This suggests that our P2GPT model can generate old people data and find the most important biological signaling pathways of aging. Case study experiment in colorectal carcinoma To demonstrate the potential of in silico-generated gene expression data as control samples for actual colorectal cancer samples, we generated control samples for corresponding case samples of eight colorectal cancer (CRC) cell lines using our P2GPT model. Subsequently, eight case-control comparisons for the corresponding CRC cell lines were incorporated into a meta-analysis. CRC meta-analysis was constructed using a restricted subset of genes, referred to as the “landmark” genes, and another meta-analysis was created using a comprehensive subset of genes, termed “restored” by a CycleGAN model which is described in Supplementary materials. For each of the two meta-analyses, we extracted the common gene expression signatures across all eight cell lines which yield two lists of gene expression signatures. Each of them was subjected to Spearman’s correlation test with the gene expression signature obtained from the pre-calculated CRC meta-analysis on PandaOmics (Fig. [119]6). We observed that common gene expression signatures calculated on an extended number of genes (restored genes) exhibited greater similarity to the benchmark CRC signatures (r = 0.552) compared to signatures calculated on a limited number of genes (landmark genes, r = 0.497). Additionally, it was evident that the application of a gene expression significance threshold positively correlated with overall signature similarities. The combined landmark signature generated with the P2GPT model demonstrated strong similarity to the CRC benchmark signature (Fig. [120]6A). However, its performance on the restored subset of genes was even better (Fig. [121]6B). Consequently, the meta-analysis derived from comparisons created using the P2GPT on an extended number of genes was further employed for target identification analysis. Fig. 6. Correlation matrix for colon carcinoma signatures. [122]Fig. 6 [123]Open in a new tab Spearman correlation coefficients between colon carcinoma signatures were calculated using only landmark genes (A) and all restored genes (B). The colon carcinoma signature (PandaOmics CRC project signature) was derived from the “expression analysis” section of manually curated colon carcinoma meta-analysis in PandaOmics and corresponded to the combined gene expression changes values for colon carcinoma. P2GPT CRC signature was collected from the corresponding meta-analysis in PandaOmics. As previously, the manually curated CRC project available in PandaOmics served as a benchmark for target hypotheses. The Target ID results were compared between the CRC meta-analysis, containing case-control datasets that were obtained from patients, and the P2GPT CRC meta-analysis (refer to the previous section). This case study aimed to demonstrate that the in silico-generated control samples could be used as control samples for actual case samples. Therefore, to compare Target ID results, we only used omics scores for hypothesis ranking and included solely druggable target families^[124]20. The top 20 target hypotheses for both the benchmark and the P2GPT-restored CRC meta-analyses are depicted on a heatmap (Fig. [125]7). Using the Target ID approach the top genes that were highly scored in PandaOmics CRC meta-analysis and P2GPT-controls generated CRC meta-analysis were explored. By analyzing the overlapped genes, it was observed that both top 20 target hypotheses lists contain hits that are strongly associated with CRC pathology. For instance, AKT1, PTEN, and PIK3R1 are key modulators in the PI3K/AKT pathway while PLK1, CDK2, and MAPK14 are major drivers involved in cell cycle regulation. Being ranked top in both Target ID results, AKT has been extensively studied in disease pathogenesis^[126]21,[127]22 and is altered in CRC patients^[128]23. CDK2 is also highly scored in both meta-analyses. The CDK2 has been explored in the G1/S phase transition^[129]24 and the CDK2 selective inhibitors have already been tested in CRC models^[130]25. Genes that were top scored only in P2GPT-controls generated CRC meta-analysis (PIK3CD, FYN, YES1, ATM, HRAS, TNFRSF1A, GSK3B, PLCG1, CSK, PIK3CA) are also related to pathogenesis. For example, PIK3CD was shown to be involved in AKT/GSK-3β/β-catenin signaling and could be considered as a potential target^[131]26, while mutations in PIK3CA were observed in 20% to 25% of CRC^[132]27 and associated with shorter cancer-specific survival^[133]28. The results were supported by PandaOmics knowledge graph (Supplementary Fig. [134]6). Overall, our results suggested that our P2GPT model can be used to generate expression data that could be utilized in target discovery. We showed that gene expression changes between case and control (both real and generated) samples resulted in a similar disease-specific expression signature. At the same time, the Target ID approach applied for data from patients (colon carcinoma PandaOmics meta-analysis) and for P2GPT-controls generated colorectal cancer meta-analysis showed a strong overlap between well-known targets for colorectal carcinoma along with a new target hypothesis. Fig. 7. Top 20 most promising target hypotheses for colorectal cancer. [135]Fig. 7 [136]Open in a new tab Results were derived from the in silico Target ID scoring approach for PandaOmics colorectal carcinoma meta-analysis (A) and P2GPT colon cancer meta-analysis (B). To validate our approach, only omics-based scores with the application of a druggability filter were taken into account and used for the composition of the scores for ranking. Discussion Studies have identified several key biomarkers, proteins and pathways that play important roles in aging^[137]29,[138]30. Despite this, the multifaceted process of aging still requires substantial understanding and unraveling of the complex biological data. To address this, the use of artificial intelligence (AI) in the field of aging research has been increasing recently. Indeed, deep learning-based approaches have been proposed to play vital roles in multiple areas facilitating aging research, such as predicting biological age^[139]16,[140]31,[141]32, developing biomarkers^[142]15,[143]33–[144]35, identifying therapeutic targets^[145]15,[146]20,[147]36–[148]39, and generating novel compounds^[149]34,[150]40,[151]41. In fact, studies have also demonstrated the potential of applying AI models to identify targets implicated in aging and age-associated diseases, targeting established hallmarks of aging^[152]20,[153]30,[154]37,[155]42. In this study, we present a hybrid approach Precious2GPT (P2GPT) that combines the complementary strengths of the CDiffusion and MoPT models for generating high-quality multi-omics DNA methylation and expression data. Our approach reduces the limitations of individual models and leverages their strengths to enhance the generation process. This innovative approach has potential applications in various fields, including data analysis, algorithm development, and privacy preservation for multi-omics research. We demonstrate the effectiveness of our hybrid approach by comparing the quality of data generated using individual models of CGAN, CDiffusion, and MoPT, with the combined hybrid approach P2GPT. With the aid of tissue classification and age regression experiments, the performance of models was assessed in terms of their specificity to species and tissue types, as well as their capability to predict age based on learned patterns from real data. Upon training the transformer-based model with this corpus, we demonstrated its high capability of generating new data conditioned on specific factors like age or tissue type. In our study, we encountered a primary challenge in generating tabular data from continuous gene expression and DNA methylation omics data. Previous works have attempted the conversion of table data to text before the application of the pretrained GPT-2 model^[156]43. Another approach addressed the complexity issue by using the GPT-2 architecture with a customized vocabulary to improve the efficiency during both training and inference of the model^[157]44. Hence, we devised an encoding scheme, wherein each gene and its corresponding omics value were represented as individual tokens. In essence, our approach treated the gene-omics data as pseudo-text, enabling us to utilize the transformer-based model, ultimately introducing the MoPT model. To evaluate the generated data with predictions of age and tissues for different data types and species, we highlight the potential of transformer architectures in bioinformatics tasks, which represents the first biomedical-specific adaptation of a language model for generating tabular data. Synthetic data plays a crucial role in overcoming data insufficiency by providing synthetic controls that replicate the biological properties of real control samples, and enhance equity in differential expression analysis. The use of generated data also enables cost-effective testing of algorithms and pipelines in a virtual experimental platform, allowing researchers to mimic the effects of interventions under specific scenarios such as varying levels of noise and different degrees of differential expression^[158]45. Furthermore, the potential impact of alterations in genomic profiles can be predicted with synthetic gene knockdowns or knockins data^[159]46. Our P2GPT model demonstrated exceptional performance in classifying tissues based on synthetic data. The model’s accuracy is remarkable, with its predictions closely resembling those based on real biological datasets as evidenced by the high correlation coefficients in cross-validation studies and the model’s robustness when tested against known benchmarks. In the age regression analysis, P2GPT showcased its aptitude by accurately predicting the biological age of samples using synthetically generated DNA methylation patterns. The synthetic data, when compared against real-world epigenetic clocks, confirmed that P2GPT successfully captured the nuances of age-related changes, with a minimal margin of error. This reveals the potential for wide-ranging applications in biogerontology and personalized medicine. Leveraging out-of-scope (OOS) experiments with the P2GPT model has revealed that across various tissues, aging is consistently associated with dysregulated immune function, chronic inflammation, and alteration in cell lineage and signaling pathways. Age-associated dysregulated immune function, accompanied by chronic inflammation (inflammaging), contributes to the process of immunosenescence observed in aged individuals^[160]47,[161]48. The alteration in signaling pathways has been shown to trigger inflammaging and senescence across multiple tissues^[162]30. These biological processes markedly contribute to the increased disease burden observed in the elderly population and present potential targets for therapeutic intervention. The insights delivered by the P2GPT model’s OOS experiments underscore the value of advanced computational models in understanding the complex biological underpinnings of aging and spotlighting potential avenues to mitigate its detrimental effects on health. Our results showed that our model can be utilized to identify biologically relevant pathways and processes through synthetic data generation. By combining MoPT and CDiffusion models using Feature Weighted Linear Stacking (FWLS), we aimed to improve the overall predictive performance and generalization ability. This approach integrates diverse perspectives and captures complementary information from each model, resulting in a more robust and accurate prediction. Applying FWLS during coefficient calculation allowed us to obtain more accurate predictions by incorporating individual model strengths. By considering model weights, we ensured that more accurate and reliable models had a higher impact on the final generation, mitigating biases or inaccuracies introduced by any single model and providing a more robust prediction. Our findings indicate that the coefficients derived from P2GPT allow a refined integration of the two models, leading to enhanced performance with improved generation quality. Despite the advancements achieved by integrating MoPT and CDiffusion models with FWLS, there are certain limitations in the current P2GPT model. Firstly, the complexity of the model poses a potential barrier to replication and broader application. The intricacies involved in managing and interpreting the combination of such models may limit their use by those without deep expertise in bioinformatics and access to substantial computational resources. Secondly, the current iteration of P2GPT processes primarily tabular data or bidimensional image data, and could not accommodate the analysis of graphical structures which represent complex biological interactions or pathways at this stage. Future extensions of the model that incorporate graph neural networks could enable the analysis of data represented in graph forms, such as protein-protein interaction networks or gene regulatory networks. Despite these limitations, the synergistic integration of MoPT and CDiffusion models through FWLS has successfully demonstrated an enhanced predictive capability. Our findings underscore the versatility and effectiveness of transformer architectures in handling bioinformatics tasks. However, it is important to acknowledge that the success of our P2GPT is attributed to the generation of relatively large sequence lengths and the design of an effective encoding scheme. Future work can expand the application of our method in other bioinformatics tasks like survival analysis, cross-modality prediction, and generation of omics depending on the disease or drug, thereby broadening the usage of transformer architectures in the field. For instance, beyond aging research, P2GPT could facilitate the analysis of fundamental processes underlying tumor progression, resistance, and metastasis. Additionally, modeling the timing and administration methods of various therapy combinations could provide insights into how tumor cells develop resistance to drugs^[163]49–[164]51. In addition, we envision further refining our hybrid approach by exploring additional generation models and incorporating various omics data types. Moreover, we believe that validation of synthetic data through downstream applications and benchmarking against real-world datasets would enhance the utility and robustness of synthetic multi-omics data. Lastly, we anticipate the future integration of P2GPT into clinical settings, enabling invaluable applications such as simulating tissue-specific biological data without invasive biopsies to predict treatment responses, predicting biological changes and disease progression trajectories, and incorporating various clinical parameters to enhance the accuracy for personalized disease monitoring and therapeutic strategies. In summary, we developed Precious2GPT, a generative model capable of producing methylation and expression data, which are invaluable resources for aging research due to the scarcity of longitudinal biological data. Through multiple lines of evidence and validation, we demonstrated the significant potential of Precious2GPT in facilitating aging research. Future work addressing the aforementioned limitations would further strengthen the model’s applicability, accuracy, and comprehensiveness, providing a powerful tool for biological discovery and translational medical research. Methods Data sources In this study, expression and methylation data were adopted across two species, human and mouse. Access to Genotype-Tissue Expression (GTEx) V8-protected data (phs000424) was authorized by the Data Access Committee of NCBI dbGAP. Human transcriptomic data^[165]52 and sample attribute data were downloaded, constituting 12,453 samples. Complementary mouse transcriptomic data was sourced from ARHS4 database, V2.2 (12,541 samples). Mouse genes were mapped to their corresponding human orthologs with the use of Human Genome Organisation Gene Nomenclature Committee (HGNC) mappings^[166]53. Both GTEx and ARCHS4 RNA-seq data were procured in the form of raw gene counts. These datasets underwent log2 transformation, followed by quantile normalization applied to each tissue type separately within the expression datasets. After performing log2 transformation and quantile normalization, we preserved the target distribution to facilitate its application to novel samples. Human DNA methylation data was aggregated from the Illumina Infinium HumanMethylation450 BeadChip array datasets, retrieved from the China National Center for Bioinformation’s (CNCB) data repository (8,285 samples)^[167]54. Methylation beta values were mapped to genomic features based on the HumanMethylation450 v1.2 Manifest File. In detail, we intentionally focused our attention on the CpGs located exclusively within the TSS200 region, as these were interpreted as the most relevant to age prediction. The TSS200 region, defined as the area comprising 200 base pairs upstream of the transcription initiation site, is documented as crucial for gene regulation processes. Consequently, the beta values of the CpGs situated within a gene’s TSS200 were averaged for downstream analysis. Preprocessing methods For pictures construction, DeepInsight technique with the application of convolutional neural networks (CNNs)^[168]55,[169]56 and Kohonen’s self-organizing maps (SOMs)^[170]57,[171]58 was used to transform non-image data into image-like representations in CGAN and CDiffusion models. For the acceleration in training and inference processes of computationally heavy models, deep learning engaging CycleGAN^[172]59,[173]60 was employed to generate synthetic data in CDiffusion, MoPT and Precious2GPT models. In brief, generation methods work either with text or with pictures. We used DeepInsight to construct pictures for CGAN and CDiffusion models and in the CDiffusion part of Precious2GPT model. To compare individual genes in each pixel of images, SOM was used instead of the TSNE, UMAP and PCA algorithms. For each data set, we built a separate SOM of different dimensions to minimize space in the square image, and ach picture was colored by expression or methylation, along with the training set in different colors. DeepInsight CNNs was used to automatically extract features from spatially coherent pixels, detecting higher-order statistics and non-linear correlations, and to provide promising performance in learning complex patterns and relationships in the data. To improve the efficiency of CNNs, one-dimensional (1D) biological data was transformed into two-dimensional (2D) representations. DeepInsight is a methodology designed to transform non-image data into image-like representations, allowing convolutional neural networks (CNNs) to be applied more effectively. It serves as the basis for the DeepInsight-3D model, which extends this approach to multi-domain tabular datasets. The DeepInsight pipeline consists of the following steps (Supplementary Fig. [174]7): Data normalization: The input data is normalized to ensure that all features have the same scale. This is typically achieved by applying min-max scaling, z-score normalization, or other suitable normalization techniques. Dimensionality reduction: The high-dimensional input data is transformed into a lower-dimensional representation. This can be done using dimensionality reduction techniques such as t-SNE, UMAP, or PCA. The resulting lower-dimensional data retains the most important information from the original data while reducing noise and computational complexity. Image generation: The lower-dimensional data is then converted into a 2D image-like representation. This is achieved by mapping each data point to a pixel in the image, with the pixel intensity representing the value of the corresponding feature. The resulting image preserves the spatial relationships between the data points, allowing CNNs to effectively capture local and global patterns in the data. Convolutional neural network (CNN) training: The generated images are used as input to a CNN, which is trained to perform a specific task, such as classification or regression. Recently developed techniques such as diffusion models can be used to effectively process such data Supplementary Fig. [175]7. By transforming non-image data into image-like representations, DeepInsight-like models allow for the efficient application of image-oriented models to a wide range of data types, including biological data. SOM Kohonen’s self-organizing maps (SOMs) offer a promising alternative to PCA or UMAP for dimensionality reduction in the context of transforming non-image data into image-like representations. As an unsupervised learning algorithm, SOMs excel at converting high-dimensional data into lower-dimensional representations while preserving the topological structure of the input data. This ability to maintain the spatial relationships between data points makes SOMs particularly well-suited for generating images that can be fed into convolutional neural networks (CNNs). Unlike PCA, which focuses on linear relationships and maximizes variance, or UMAP, which aims to preserve both local and global structure, SOMs employ a competitive learning process that iteratively updates neuron weight vectors to better represent the input data. This results in a 2D grid of neurons that captures complex relationships between variables, potentially leading to more effective feature extraction and improved performance of the CNN. By incorporating Kohonen’s SOMs into the DeepInsight methodology, we can harness the unique advantages of this algorithm to enhance the analysis of non-image data using deep neural networks. CycleGAN To speed up training and inference of heavy models (CDiffusion, MoPT and P2GPT), for we used extrapolation of all genes using the CycleGAN model during post-processing. In our heavy models, we trained them with different generations and then extrapolated the result using this model. In detail, our domain X consists of data for landmark 978 genes^[176]59 and domain Y consists of desired output data for 11,278 genes, which are our intersections across several OMICS datasets and species types. The set of 978 genes serves as the starting point to generate synthetic output data for the 11,278 genes. A CycleGAN comprises two generators (G & F) and two discriminators (Dx & Dy). Generator G transforms from domain X to Y (G: X → Y), while F does the vice versa, i.e., F: Y → X. Dx aims to distinguish between X and FX(Y), whereas Dy works on discriminating between Y and G(X). The training goes as follows: first, the generator G translates a sample data from domain X into a synthetic data of Domain Y. Subsequently, the generator F tries to regenerate the original sample from this synthetic data. The objective is to train the CycleGAN in learning the mapping such that the regenerated data closely matches the input data. This is referred to as forward-cycle consistency. A backward cycle consistency is simultaneously processed from Domain Y to X, and the whole cycle repeats continuously in learning. The network learns from the inconsistencies between the regenerated data and the original input data to increase the capabilities in generating synthetic data aligned with the target domain. Importantly, the discriminators Dx and Dy also participate in this training process, aiming to classify an instance from the actual dataset or a generated data by respective generators. As a result, CycleGAN has the ability to extrapolate the data from 978 genes to realistically simulate data for 11,278 genes even in cases where paired samples are lacking. Finally, we can say that this model greatly helped us in generating a large amount of data in a short time with minor losses in quality compared to the full set. In the production model we will of course eventually use the full data set, but for some experiments this is not necessary. Generation methods Mathematical formulation of conditional generation task In the context of conditionality, we aim to develop models that can generate data instances conditioned on multiple factors: tissue ( [MATH: T :MATH] ), age ( [MATH: A :MATH] ), species ( [MATH: S :MATH] ), and omics types ( [MATH: D :MATH] ). We represent the generated data as [MATH: X :MATH] , and the conditions as a tuple [MATH: C=(T,A, S,D) :MATH] . The conditional generation task is defined given a set of training data (D): [MATH: D=(X i,Ci)i= 1N, :MATH] where [MATH: Xi :MATH] represents the observed data instances and [MATH: Ci :MATH] represents the corresponding conditions in order to learn a conditional generative model [MATH: G :MATH] that can sample data instances [MATH: X :MATH] conditioned on arbitrary conditions [MATH: C :MATH] . The training objective of this model is to estimate the conditional probability distribution [MATH: P(X|C) :MATH] , such that [MATH: G(XC)P(XC), :MATH] where [MATH: G(X|C) :MATH] represents the data generated by our model. CGAN To evaluate the performance of Precious2GPT, Conditional Generative adversarial network (CGAN) was used as positive control in the validation experiments. Generative adversarial networks are more classical, easier to learn and faster in terms of speed of inference, which serve as a baseline for the other models. In some situations it has been observed that they can show themselves high and do not necessarily use complex patterns. In particular, if there is a generation task with one condition and we do not want to take into account the age condition. This generative model was trained to generate synthetic data using two networks, th generator [MATH: G :MATH] and the discriminator [MATH: D :MATH] . In CGAN, the generator [MATH: G :MATH] was trained to produce data samples that are indistinguishable from real data by a discriminator [MATH: D :MATH] , whilst the generator took the conditions [MATH: C :MATH] as input and generates data [MATH: X :MATH] (Supplementary Fig. [177]8). In the context of multi-omics data integration, CGANs were employed to generate realistic images corresponding to expression or methylation data with additional conditions, tissue type, age, omics type, and species. CDiffusion Diffusion models were employed to estimate the likelihood of generation data [MATH: X :MATH] . The model was trained to sample data through a diffusion process conditioned on [MATH: C :MATH] , and the likelihood of data was maximized throughout the learning process. A PyTorch published on GitHub (available at [178]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main)^[1 79]61 was implemented as the basis of the conditional diffusion (CDiffusion) model, and PyTorch’s embedding was applied to construct the conditionality on categorical features in this model. A PyTorch implementation of the conditional diffusion model published on GitHub^[180]61 (available at [181]https://github.com/tcapelle/Diffusion-Models-pytorch/tree/main) was the basis for the used CDiffusion model. In its basis, this implementation involves a U-Net block that has self-attention layers between the downsampling and upsampling layers. Standard diffusion models incorporate temporal information in a tensor (later referred to as the time step) that controls the noising/denoising process based on the current step through being embedded into every downsampling/upsampling layer of the U-Net block. This time step is initialized by ongoing positional encoding given the previous step’s tensor through a sinusoidal positional embedding (Supplementary Fig. [182]9). The diffusion model is conditioned on categorical features of sets of genes by adding the conditions’ embeddings into the current time step. PyTorch’s nn.Embedding is used as a learnable embedding layer that stores embeddings mapping each class of a categorical condition (mapped to unique integers) into a tensor with the time step’s shape. However, such an embedding layer is not applicable for continuous features, such as age, so ages are fused in the time step by first undergoing a small network (three linear layers with ReLU activation functions) that transforms a single floating-point value (age) into the time step’s dimensions. Finally, the ages’ embeddings are similarly added to the time step. While the outlined approach works as is for single conditions, including multiple conditions leads to the optimizer tilting its attention towards the condition with higher embedded values. This is solved by normalizing all the conditions’ embeddings (categorical/continuous) before adding them to the time step. MoPT To process large numbers of genes and omics values, memory efficient transformer architecture – MPT^[183]14 was utilized in the construction of Precious2GPT. It incorporates a modified architecture inspired by GPT-2^[184]6, where the positional embeddings are replaced with a Linear basis matrix. This modification enhances extrapolation capabilities while requiring fewer GPU memory resources during model training. Following the retraining process, our model underwent biological adaptation to multi-omics data, ultimately presenting as the Multi-omics Pretrained Transformer (MoPT). Model setup and training procedure We prepared a tokenizer which consists of all possible genes from datasets, all 2-digit values for tokens referred to as age, tissues, species. We utilized “mosaicml/mpt-7b” configuration from HuggingFace^[185]14. To properly set the number of parameters we considered Chinchilla scaling law^[186]62, which proposed that the number model parameters should be proportional to the number of tokens in training corpus in ratio of 1:20. For all three datasets we considered this law and got the next model sizes: 4.1, 1.7 and 1 million parameters for multi-omics dataset, expression and methylation respectively. Learning curves for different datasets are represented in Supplementary Fig. [187]10. For each dataset the evaluation set was 1000 samples uniformly distributed by all tissues. MoPT generation procedure To generate new omics samples we pass to the model desirable conditions on generation such as age, tissue, species and omics data type in form of plain string with spaces between conditions, e.g: “SPECIES Mouse dataset EXPRESSION TISSUE Brain MouseAGE84 ”. We utilized top-k together with top-p sampling, where k = 40, p = 0.9 and temperature = 0.8. Validation experiments Tissue classification To evaluate the quality of generated omics samples, the Logistic Regression model was used in the assessment for tissue classification tasks. The evaluation was based on the f1-score^[188]63, weighted by classes for both real and generated data, as the key metric for determining the reliability of the generated data. For each model, we built classification metrics twice. First, we generated synthetic samples in a 1:1 ratio with the original data, and the metrics were calculated on these samples, where we compared the real label with the one predicted by the classifier (Supplementary Fig. [189]11). We subsequently examined the performance of the classifier on uniformly generated labels from the total number of tissues to evaluate its effectiveness in handling unbalanced classes. In the case of multiple conditions, we additionally generated age between the minimum and maximum values present in our data, or other label types of dataset or species within each tissue. In addition to the aforementioned metrics, UMAP^[190]64 representations were used to visualize both synthesized and real tissue data in identifying disparities or similarities between the two distributions. Age regression To predict the age of generated data, a CatboostRegressor^[191]65 model was applied solely based on gene omics values. The training dataset was composed of real samples paired with their respective age values as the target variable, while the synthesized samples were utilized as the testing data to generate predicted age values as the conditioning variable. The evaluation of performance by each model was presented as mean absolute error (MAE)^[192]63 and R-squared (R^2) metrics. Differential methylation analysis To examine the sample homogeneity of real and generated data, we performed several statistical tests focusing on human methylation in multiple tissues. First, the nonparametric Mann–Whitney^[193]66 statistical test was used where we fed methylation data from different age groups generated by the CDiffusion, MoPT and Precious2GPT models. To evaluate the ability of models in preserving the differentially methylated genes of distinct age groups, we identified differentially methylated genes between the samples obtained from 80 ± 20 vs 30 ± 20 years old individuals, in both real and generated data. We then calculated the rate of intersection by the number of intersected differentially methylated genes between two sets divided by the number of differentially methylated genes in the real data. Differential methylation analysis was also performed between the 80 ± 20 years old (real data) and 80 ± 20 years old (generated data) to assess the similarity between the real and generated data (Supplementary Fig. [194]12). To optimize statistical validity, multiple testing corrections of the Benjamini–Hochberg^[195]67 hypotheses were used. Out-of-scope experiment To validate our Precious2GPT model for age prediction, we conducted several out-of-scope experiments involving the training of models with methylation data at different age thresholds. 1. Two models were trained for this purpose. One was trained with samples up to thresholds of 50 and 80 years old, and the other one was trained with the entire sample. 2. Using the model trained with distinct thresholds, we generated data for (threshold +20, threshold +40) years old, and compared the clusters created by the generated data with those of real data on PCA^[196]68 representations. 3. We trained a model with available data from 100 samples per tissue for individuals aged between 120 and 150 years old, and generated the differentially methylation data for pathway analysis to predict aging-related alteration. The pathway analysis was conducted using the Python package gseapy^[197]69, with the KEGG-human database^[198]70 and the 12 HALLMARKS lists serving as the enriched pathways for the differentially methylated genes. Case study experiment In the case study experiment focusing on colorectal cancer (CRC), we utilized Precious2GPT for CRC cell lines as synthetic controls, namely Caco2, Lovo, SW1417, NCI-716, RKO, HCT-8, SW480 and SK-CO-1, obtained from our internal laboratory, the Robotic Lab. We fed the gene expression data of the eight CRC cell lines as input to facilitate the generation of respective synthetic controls using the pre-trained Precious2GPT model. For the generated control samples, gene expression data was normalized and uploaded to our PandaOmics platform. Within the platform, individual comparisons (case vs. control) were established for each cell line. These eight comparisons were then incorporated into meta-analysis, and the results for CRC landmark genes (obtained from the LINCS L1000 project) were generated through TargetID panel and Knowledge graph. To evaluate the quality of control samples generated by Precious2GPT, we compared the AI-driven target prioritization results between the generated CRC data and the pre-calculated results using real data available on PandaOmics. Supplementary information [199]Supplementary information^ (5.2MB, pdf) Acknowledgements