Abstract Objective Age-related macular degeneration (AMD) is a significant cause of blindness, initially characterized by the accumulation of sub-Retinal pigment epithelium (RPE) deposits, leading to progressive retinal degeneration and, eventually, irreversible vision loss. This study aimed to elucidate the differential expression of transcriptomic information in AMD and normal human RPE choroidal donor eyes and to investigate whether it could be used as a biomarker for AMD. Methods RPE choroidal tissue samples (46 Normal samples, 38 AMD samples) were obtained from the GEO ([25]GSE29801) database and screened for differentially expressed genes in normal and AMD patients using GEO2R and R to compare the degree of enrichment of differentially expressed genes in the GO, KEGG pathway. Firstly, we used machine learning models (LASSO, SVM algorithm) to screen disease signature genes and compare the differences between these signature genes in GSVA and immune cell infiltration. Secondly, we also performed a cluster analysis to classify AMD patients. We selected the best classification by weighted gene co-expression network analysis (WGCNA) to screen the key modules and modular genes with the strongest association with AMD. Based on the module genes, four machine models, RF, SVM, XGB, and GLM, were constructed to screen the predictive genes and further construct the AMD clinical prediction model. The accuracy of the column line graphs was evaluated using decision and calibration curves. Results Firstly, we identified 15 disease signature genes by lasso and SVM algorithms, which were associated with abnormal glucose metabolism and immune cell infiltration. Secondly, we identified 52 modular signature genes by WGCNA analysis. We found that SVM was the optimal machine learning model for AMD and constructed a clinical prediction model for AMD consisting of 5 predictive genes. Conclusion We constructed a disease signature genome model and an AMD clinical prediction model by LASSO, WGCNA, and four machine models. The disease signature genes are of great reference significance for AMD etiology research. At the same time, the AMD clinical prediction model provides a reference for early clinical detection of AMD and even becomes a future census tool. In conclusion, our discovery of disease signature genes and AMD clinical prediction models may become promising new targets for the targeted treatment of AMD. Keywords: Age-related macular degeneration, Weighted gene Co-expression network analysis, Machine learning, Immune cell infiltration 1. Introduction AMD is the leading cause of severe vision loss in people under 55 years of age in developed countries [[26]1], accounting for 6–9% of legal blindness worldwide[[27]2,[28]3]. AMD affects the complex of photoreceptors, retinal pigment epithelium (RPE), Bruch's membrane (BrM) and choroid. AMD is characterized by the accumulation of drusen, leading to progressive degeneration of photoreceptors and RPE, resulting in the loss of central vision. Progressive central vision loss due to AMD affects the patient's quality of life and psychosocial well-being and imposes a significant economic burden on the patient [[29][4], [30][5], [31][6], [32][7], [33][8]]. With the increasing number of older people, the high prevalence of AMD has become a serious public problem in China. AMD is considered a multifactorial disease, with risk factors including genetic factors, aging, smoking, light exposure, and abnormal nutritional intake [[34]9]. AMD has been divided into two main types: dry AMD and wet AMD [[35]10]. Wet AMD is diagnosed when neovascularization is detected. The diagnosis of AMD is currently determined by clinical symptoms, fundus photography, optical coherence tomography, and fluorescein fundus angiography [[36]11]. Since neither severe visual problems nor discomfort are present in cases of dry AMD, it is not easy to screen for early cases in the population. So far, no effective prevention methods have been identified from early to wet AMD. the clinical situation of AMD highlights the need to develop potential biomarkers to predict the incidence of AMD. The development of AMD involves multiple pathophysiological mechanisms, all of which focus on RPE dysfunction and degeneration. Therefore, we obtained samples from the GEO ([37]GSE29801) database ([38]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29801) and analyzed normal and AMD transcriptome information to find predictive genes and models, The analysis process of this study is shown in the flow diagram 1and 2. 2. Methods and materials 2.1. Materials The transcriptome information of RPE choroidal tissue samples came from the GEO ([39]GSE29801) database, containing 46 normal and 38 AMD samples. Gene ID conversion was done by DAVID Online Tools ([40]https://david.ncifcrf.gov/), and gene expression levels were taken as mean values if there were duplicate genes. The gene expression was normalized by taking log2 for gene expression. Finally, 19289 genes and 84 samples were obtained as expression matrices. 2.2. Screening for differential genes Extramacular RPE-choroid, AMD (Ex RPE-AMD) was the Treat group; Extramacular RPE-choroid, normal (Ex RPE-Normal) was the Control group. Two methods were used to obtain differential genes. In the first one, based on GEO2R online line difference analysis, the endpoint was set as adjust. P. value < 0.05. In the second one, using library(limma)R package line difference analysis, the endpoint was set as P. value < 0.05. The different genes obtained by the two methods were intersected to be the final difference genes. 2.3. Differential gene pathway enrichment analysis To better understand the biological significance of common differential genes, we used “clusterProfiler,” “org.Hs.eg.db”, “enrichplot, “ggplot2″ R package to perform GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichment analysis. 2.4. Disease signature gene screening 2.4.1. LASSO (least absolute shrinkage and selection operator) regression analysis The LASSO regression model was constructed using the “glmnet” R package, and the number of genes corresponding to the most minor cross-validation point was the optimal number of trait genes. 2.4.2. SVM (support vector machine) algorithm The SVM model is constructed using the “e1071″, “kernlab,” and “caret” R packages, and the number of genes corresponding to the most minor cross-validation point is The number of genes corresponding to the most minor cross-validation point is the number of optimal feature genes. The number of genes corresponding to the most minor cross-validation point is the optimal number. 2.5. Accuracy validation of disease signature genes and pathway enrichment analysis ROC (receiver operating characteristic) curves were constructed using the “glmnet” and “pROC” R packages. The larger the area under the ROC curve, the higher the accuracy of the gene or model for disease diagnosis. In order to further clarify the biological functions involved in the disease signature genes, GSVA(Gene set variation analysis) pathway enrichment analysis was performed using the “GSEABase” and “GSVA” R packages. pvalue<0.05 was considered statistically significant. 2.6. Immune cell infiltration The samples were analyzed for immune cell infiltration using the ‘preprocessCore,’ ‘parallel,’ and ‘e1071’ R packages. Immune cells differentially expressed in the Treat and control groups were screened using the Wilcox test and the ‘vioplot’ R package. The cut-off point was set at Pvalue<0.05. The samples were analyzed by using “limma,” “reshape2″, “tidyverse,” “ggplot2″, “tidyverse,” and “ggplot2”. “ggplot2″ R package was used to clarify the relationship between disease signature genes and immune cell infiltration. 2.7. Clustering analysis and validation Cluster analysis was performed using ConsensusClusterPlus, coalescent km clusters with 1-Pearson correlation distances, and resampling 80% of the samples 10 times with k values ranging from 2 to 9. The optimal number of fractions was determined using empirical cumulative distribution function plots. PCA analysis was performed using the “limma” and “ggplot2″ R packages to detect clustering effects. 2.8. Immune cell infiltration The immune cell distribution of the samples was calculated by the CIBERSORT algorithm using the R software to obtain the proportional fraction of immune cells for each sample in the data set. The “limma” R package was used to compare the immune cell infiltration for differences. 2.9. Cluster pathway enrichment analysis In order to further clarify whether there are differences in the biological functions of different fractal clusters, “reshape2″, “ggpubr,” “limma,” “GSEABase,” and “GSVA” R packages were used for GSVA pathway enrichment analysis. “GSEABase” and “GSVA” R packages were used for GSVA pathway enrichment analysis. 2.10. Cluster-wise WGCNA analysis WGCNA is an R package that constructs gene co-expression networks from many genes and identifies co-expression modules [[41]12]. Firstly, we used the “WGCNA” package in R to cluster the top 25% of the most fluctuating genes in the samples to assess whether there are significant outliers. Secondly, we identified gene co-expression modules based on the topological overlap, detected the modules using hierarchical clustering and dynamic tree-cutting functions, and calculated gene significance (GS) and module affiliation (MM). Modules were linked to clinical traits (sub-clusters). Correlations between these modules are calculated, and heat maps are executed to show the independence between these modules. Finally, we visualize the trait gene network. 2.11. Machine learning model construction Using the above-obtained module genes, sample [42]GSE29801 transcriptome expression levels and “caret,” “DALEX,” “ggplot2″, “randomForest,” “kernlab,” “xgboost,” “pROC “R package to construct RF random forest tree model, SVM machine learning model, XGB model, and GLM model. The samples were divided into Treat and Test groups, with the Treat group accounting for 70% of the total samples for constructing the models and the Test group accounting for 30% of the total samples for verifying the accuracy of the models. The genes obtained from the four methods were analyzed for importance, and the top 5 genes with the highest importance scores were output as the predicted genes. 2.12. Construction of nomo plots To predict the incidence of AMD, we constructed column line graphs using the importance genes obtained above and the “rms” and “rmda” R packages and evaluated the accuracy of the column line graphs using decision curves and calibration curves. 2.13. Methods Online analysis of variance using GEO2R rows, adjpvalue<0.05, was considered statistically significant. Using R4.21 row-wise analysis of variance and co-expression analysis, pvalue<0.05 was considered statistically significant. 3. Results 3.1. Screening of differential genes 52 genes differentially expressed in AMD and normal samples were obtained using GEO2R online analysis ([43]Fig. 1 A). The AMD and normal samples were evenly distributed in Uniform Manifold Approximation and Projection(UMAP) ([44]Fig. 1B) with good consistency in gene expression levels ([45]Fig. 1C), indicating that the differential genes obtained using GEO2R online analysis were reliable. Differential analysis using the limma R package line yielded 2048 genes ([46]Fig. 1 D), of which 1275 genes were down-regulated, and 773 genes were up-regulated in AMD (Supplementary file 1). A total of 50 differential genes were obtained by taking the intersection of the two methods ([47]Fig. 1E) (Supplementary file 2). Fig. 2. [48]Fig. 2 [49]Open in a new tab (A) The GO function's actual name appears on the left vertical axis. , the right vertical axis is the taxonomic name of the function, and the horizontal axis is the number of differential genes released to the corresponding GO function. (B) Circle diagram of GO enrichment analysis. (C) KEGG enrichment analysis, the vertical axis is the specific name of the KEGG function, and the horizontal axis is the number of differential genes released to the corresponding function. Fig. 1. [50]Fig. 1 [51]Open in a new tab Differential expressed genes analysis (A) Volcano plot of 52 differential genes. (B) Distribution of AMD and normal samples in [52]GSE29801 in UMAP. (C) Gene expression levels of AMD and normal samples in [53]GSE29801. (D) Heatmap of 2048 differential genes. (E) The two methods obtained the intersection of differential genes. 3.2. Differential gene GO and KEGG enrichment analysis The results of GO enrichment analysis showed that the pathways enriched in Biological Process were lipoprotein metabolic process, mRNA processing, RNA splicing, lipoprotein biosynthetic process; The pathways enriched on Cell Component were transferase complex, transferring phosphorus-containing groups, spliceosomal complex Molecular Function, the enriched pathways are palmitoyltransferase activity, S-acyltransferase activity, protein-cysteine S-acyltransferase activity, S-acyltransferase activity, and S-acyltransferase activity; The enriched pathways on the Molecular Function were palmitoyltransferase activity, S-acyltransferase activity, protein-cysteine S-acyltransferase activity, and protein-cysteine S-palmitoyltransferase activity. KEGG enrichment analysis identified the above differential genes that were metabolically active in the Ferroptosis, Mucin type O-glycan biosynthesis, Circadian rhythm, and and SNARE interactions in vesicular transport pathways. 3.3. Disease signature gene screening and validation LASSO obtained the minimum value of Log(λ) corresponding to the number of genes of 16 ([54]Fig. 3A) (Supplementary file 3), and a vertical line was drawn at the value selected by 10-fold cross-validation. As the value of λ decreases, the compression of the model increases, and the selection of essential variables by the model increases ([55]Fig. 3B). The minimum value of SVM corresponds to a number of genes of 40 ([56]Fig. 3C) (Supplementary file 4). The 15 genes were obtained after taking the intersection ([57]Fig. 3D) (Supplementary file 5). The heat map is shown in [58]Fig. 3E. The area under the AUC of these 15 genes was more significant than 0.7 ([59]Fig. 3F), indicating that these genes alone predicted AMD disease with high accuracy. The area under the AUC of the model was 0.989, with a 95% fluctuation range of 0.967–1 ([60]Fig. 3G), indicating the excellent prediction performance of this model. Figure 3. [61]Figure 3 [62]Open in a new tab (A) Results of cross-validation. The values in the middle of the two dashed lines are the range of positive and negative standard deviations of log(λ). The dashed line on the left indicates the value of the harmonic parameter log(λ) when the model error is minimal. (B). Distribution of LASSO coefficients for 30 variables. (C). The performance of the feature subset selected by SVM on the dataset and the value of the horizontal coordinate corresponding to the lowest point of the curve indicates the optimal number of genes. (D) ROC curve of 15 genes, the larger area under the curve, indicates the better prediction performance of this gene. (E) ROC curves of the 15-gene model. The larger the area under the curve, the better the prediction performance of this gene. 3.4. Disease signature gene pathway enrichment analysis GSVA enrichment analysis showed that 5 of the 15 differential genes were closely associated with the glucose metabolism pathway, such as C12orf76 was upregulated in the other glycan degradation pathway and downregulated in the Maturity onset diabetes of the young pathway ([63]Fig. 4A). LUC7L3 was upregulated in the type I diabetes mellitus pathway ([64]Fig. 4B). LOC202025 was downregulated in the taurine and hypotaurine metabolism pathway ([65]Fig. 4C)·NPIPB6 was downregulated in the taurine and hypotaurine metabolism pathway ([66]Fig. 4D). ZDHHC11 was downregulated in the other glycan degradation pathway ([67]Fig. 4E). CHRAC1 ([68]Fig. 4F) and EIF4A2 ([69]Fig. 4G) are associated with the fatty acid metabolic pathway. Fig. 4. [70]Fig. 4 [71]Fig. 4 [72]Fig. 4 [73]Fig. 4 [74]Open in a new tab GSVA enrichment of 7 differential genes. c12orf76 (A), LUC7L3 (B), LOC202025 (C), NPIPB6 (D), ZDHHC11 (E), CHRAC1(F), EIF4A2(G). FAM156A (Supplementary F. 4A), MIAT (Supplementary F. 4B), RUNX1T1 (Supplementary F. 4C), WDR27 (Supplementary F. 4D) and ZDHHC24 (Supplementary F. 4E) were associated with Pantothenate and coa biosynthe, cytokine receptor interaction and limonene and pinene degradation pathways. PTBP2 (Supplementary F. 4F), HNRNPH1 (Supplementary F. 4G), and POFUT2 (Supplementary F. 4H) were not found to be significantly enriched in the biological pathway. 3.5. Immune cell infiltration Immune cell infiltration was obtained for each sample using the R package (Supplementary file 6). Immune cell Neutrophils were highly expressed in the control group (normal samples) with pvalue = 0.0497 ([75]Fig. 5A), and the difference was statistically significant. 15 disease signature genes were strongly associated with immune cell infiltration, such as LUC7L3 positively correlated with M0 macrophages (p < 0.05), HNRNPH1 negatively correlated with M2 macrophages (p < 0.05), RUNX1T1 and WDR27 were positively correlated with NK cells activated (P < 0.05) ([76]Fig. 5B). Fig. 5. [77]Fig. 5 [78]Open in a new tab (A) Immune cell infiltration in control and Treat groups; (B) Co-expression of 15 disease signature genes with immune cells. 3.6. Cluster analysis and detection After a comprehensive evaluation, the number of clusters with the highest average consistency within the group was selected as K = 2 ([79]Fig. 6A and B). [80]Fig. 6C and D shows that CHADL, HNRNPH1, SCGB1D2, SLC39A8, and ST6GALNAC2 were highly expressed in C2, and PCA analysis showed that this cluster could well distinguish C1 from C2 ([81]Fig. 6E). Fig. 6. [82]Fig. 6 [83]Open in a new tab (A) Different colors in the horizontal axis indicate different clustering, and other colors within the color indicate the presence of impurities, the more impurities contained, the worse the clustering effect, and the vertical axis indicates different K values. (B) The clustering situation when K = 2. (C–D) Expression of 15 disease signature genes in clusters C1 and C2, heatmap (C), boxplot (D). (E) PCA analysis. 3.7. Inter-cluster immune cell infiltration and gsva pathway enrichment analysis In [84]Fig. 7A, we obtained little difference in immune cell infiltration between C1 and C2 clusters. From [85]Fig. 7B, we obtained that the immune cells with higher enrichment in C2 are T cells CD4 Memoryactivated and B cells naive. The immune cells with higher enrichment in C1 are Monocytes and Neutrophils. GSVA analysis showed C2 clusters in the NEUROTROPHIN signaling pathway, natural killer cell-mediated cytotoxicity, regulation of actin cytoskeleton, autoimmune thyroid disease, tubule-like receptor signaling pathway, FC gamma R-mediated phagocytosis, leishmaniasis infection point-like receptor signaling pathway, chronic myeloid leukemia, viral myocarditis, systemic lupus erythematosus enriched in the lupus erythematosus pathway. The C1 cluster was enriched in the peroxisome, lysine degradation, β-alanine metabolism, oxidative phosphorylation, valine leucine, and isoleucine degradation, terpene skeleton biosynthesis, aspartate and glutamate metabolism, nitrogen metabolism, unsaturated fatty acid biosynthesis, and butyric acid metabolic pathways ([86]Fig. 7c). Fig. 7. [87]Fig. 7 [88]Open in a new tab (A) Infiltration of immune cells in C1 and C2 clusters. (B) Differential analysis of immune cell infiltration in C1 and C2 clusters. (C) GSVA pathway enrichment analysis of immune cells in C1 and C2 clusters. 3.8. Weighted gene co-expression network A total of 6 different gene co-expression modules were generated by WGCNA raw letter analysis. Correlations between each module and clinical features (C1 and C2) were tested. The signature gene dendrogram and heat map indicated that the MEturquoise modules were highly correlated with C1 and C2 ([89]Fig. 8A–C). The results of the eigengene dendrogram and heat map plots showed that the MEturquoise module was positively correlated with C2 and negatively correlated with C1 ([90]Fig. 8D). Gene salience and module membership were plotted for the MEturquoise module, indicating that the module was significantly positively correlated with C2 ([91]Fig. 8E). Finally, 52 genes for the MEturquoise module were obtained. Fig. 9. [92]Fig. 9 [93]Open in a new tab (A) Boxplots of |residual|, the vertical axis indicates the model type, and the horizontal axis indicates the residual size; The smaller the residual, the higher the model accuracy. (B) Feature Importance. The Vertical axis indicates the model type; the horizontal axis indicates Root mean square error (RMSE) loss after permutations, the smaller the Root mean square error (RMSE) loss after permutations, the higher the model accuracy. The smaller the Root mean square error (RMSE) loss after permutations, the higher the model accuracy. (C) ROC curve. The larger the area under the curve, the higher the model's accuracy. (D) Reverse cumulative distribution of |residual|. The smaller the |residual|, the higher the model's accuracy. (E) Nomo clinical prediction model. (F) Calibration curves. (G) DCA decision curve. Fig. 8. [94]Fig. 8 [95]Open in a new tab Weighted gene co-expression network analysis. (A) Sample clustering dendrogram detecting outliers in C1 and C2 clusters of WGCNA. (B) Clustered dendrogram of genes with topology-based overlap dissimilarity and the specified module colors. (C) Gene co-expression network visualized as a heat map. Lighter colors indicate low co-expression, and darkening red indicates high co-expression. Darker colors on the diagonal lines are modules. (D) Module-trait associations. Each row corresponds to a module, and each column corresponds to a trait. Each cell contains the corresponding correlation and P value. The table is color-coded by correlation according to the color legend. (E) Scatter plot between each gene in the module and the C2 cluster. 3.9. Machine learning model and clinical prediction model construction and validation Through the construction of R, SVM, XGB, and GLM machine learning models, we can obtain that the predictive performance of GLM and SVM models is better in Boxplots of |residual|. ROC curve, Reverse cumulative distribution of |residual| both obtained better predictive performance of RF and SVM models, and the SVM model was selected for the following analysis in a comprehensive consideration. Feature Importance shows 10 genes, and we selected the five genes with the highest scores for constructing the nomo plot of the clinical prediction model. From the nomo plot, we can see those different gene expressions correspond to different points, and the sum of 5 gene expressions corresponding to different points in Total points, and Total different points correspond to different Risks of Disease. The Risk of Disease is the probability that we presume an individual patient has AMD. The calibration curve shows that the solid and dashed lines are very close to each other, which proves that the model is more accurate. The red curve of the DCA decision curve is far from the All curve, which proves that the model is more accurate. In conclusion, the accuracy of our constructed model is high. 4. Discussion AMD is a major cause of blindness, beginning with the formation of the outer blood retinal barrier (oBRB) by the retinal pigment epithelium (RPE), Bruch's membrane and choriocapillaris. The risk of AMD development remains poorly understood due to the lack of relevant predictive models for the human retinal pigment epithelium (RPE). We obtained 84 RPE choroidal tissue samples (46 normal samples and 38 AMD samples) from the GEO ([96]GSE29801) database. Gene expression in normal and AMD samples was first analyzed online using GEO2R and 52 differentially significant genes were obtained; 2408 genes were also screened using the R package. The genes obtained by these two methods were taken to intersect to obtain 50 differential genes. To clarify which metabolic processes and pathways these differential genes are mainly involved in, we performed GO enrichment analysis and concluded that these differential genes are mainly involved in protein synthesis and RNA editing. To further explore which metabolic pathways these genes are involved in, we performed KEGG pathway enrichment and found that these differential genes were significantly enriched in the Ferroptosis metabolic pathway. Thus, we hypothesized that protein synthesis and RNA editing in the Ferroptosis metabolic pathway might be involved in the development of AMD. The results of several scholarly studies confirm the validity of our research and speculations, for example, Wei, Ting-Ting suggests that AMD induces Ferroptosis in retinal pigment epithelial cells and that inhibition of Ferroptosis may be a potential target for the treatment of AMD [[97]13], Sun, Yun scholars concluded that Glutathione depletion induces ferroptosis in retinal pigment epithelial cells [[98]14]. In conclusion, the differential genes we screened accurately respond to the biological pathways that may affect AMD, and the use of these differential genes is a good theoretical basis for further research. Second, we used two machine learning models to screen for disease signature genes. Sixteen genes were obtained using the LASS0 algorithm and 40 genes were obtained using the SVM algorithm, and the genes obtained by these two algorithms were intersected to obtain 15 genes as disease signature genes. The ROC curves showed that the evaluation of these genes and the predictive performance of the model were good. To further verify whether our screened disease signature genes are representative of the disease, we performed GSVA enrichment analysis and obtained c12orf76 ([99]Fig. 4A), LUC7L3 ([100]Fig. 4B), LOC202025 ([101]Fig. 4C), NPIPB6 ([102]Fig. 4D), ZDHHC11 (Figure These five genes are involved in glucose metabolism, and CHRAC1 ([103]Fig. 4F) and EIF4A2 ([104]Fig. 4G) are involved in fatty acid metabolism. It is well known that diet and nutrition have a strong epidemiological link to the onset and progression of AMD. Rowan, Sheldon scholars have argued that A lower glycemic diet is associated with protection against AMD in humans, and switching from a higher to a lower glycemic diet prevents the AMD phenotype in mice [[105]13]. Pennington, Katie L also showed that lipid metabolism is involved in the development of AMD [[106]14]. The two studies mentioned above confirm the accuracy of our model and that we can regulate the progression of AMD and even cure or avoid it completely by regulating the expression of these genes. In immune cell infiltration analysis, we obtained that several genes are closely related to the level of immune cell infiltration, such as HNRNPH1, which is negatively correlated with M2 macrophages, and related studies have shown that the number of M2 macrophages is increased in AMD [[107]15]. And it has been shown that melatonin attenuates choroidal neovascularization by regulating macrophage/microglia polarization through inhibition of the RhoA/ROCK signaling pathway [[108]16]. This suggests that macrophages play an important role in the development of AMD, and the 15 disease signature genes we obtained may be targets for future immunotherapy of AMD. In conclusion, Ferroptosis, abnormal glucose metabolism, fatty acid metabolism, and the degree of immune cell infiltration may play a vital role in the development of AMD and may be a direction for future treatment. Third, to further investigate the transcriptome information characteristics of AMD samples, we performed cluster clustering analysis based on these 50 differential gene expressions. AMD samples were classified into two clusters, C1 and C2. From the immune cell infiltration, we obtained that C1 was related to oxid ative phosphorylation, while the C2 cluster was related to glycine serine and threonine metabolism and glycosphingolipid biosynthesis ganglio series. The results again validate that the protein synthesis and glucose metabolism we obtained above are involved in the development of AMD. Fourth, based on the transcriptome information of 84 samples obtained from the GEO database, we performed WGCNA to screen the optimal module MEturquoise and the 52 genes it contained, and MEturquoise was positively correlated with C2 clinical features ([109]Fig. 8D). The five highest scoring predictive genes in the SVM model were HGD, KDF1, GJB1, CNIH3 and CHADL, and we used these five predictive genes to construct an AMD clinical prediction model. The ROC and DCA curves showed that this model has accurate predictive ability. The expression levels of these five genes can be measured before the onset of AMD to predict the risk of AMD in individual patients, providing a theoretical basis for treating the disease before it occurs. 5. Conclusion In conclusion, we combined machine learning model, WGCNA and cluster clustering analysis, not only predicted the possible pathogenesis of AMD, but also constructed a clinical prediction model for AMD, which provides a theoretical basis for the treatment of AMD and screening of high-risk groups. This study has some limitations due to the limitations of disease type and database; experimental validation due to the influence of COVID-9. If conditions permit, we will conduct clinical and animal trials to further explore the cause of AMD. We believe that the treatment of AMD will gain further progress caused by our study. Author contribution statement Daoxin Han, Xiaoli He: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. Data availability statement Data will be made available on request. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements