Abstract Identifying complex phenotypes from high-dimensional biological data is challenging due to the intricate interdependencies among different physiological indicators. Traditional approaches often focus on detecting outliers in single variables, overlooking the broader network of interactions that contribute to phenotype emergence. Here, we introduce ODBAE (Outlier Detection using Balanced Autoencoders), a machine learning method designed to uncover both subtle and extreme outliers by capturing latent relationships among multiple physiological parameters. ODBAE’s revised loss function enhances its ability to detect two key types of outliers: influential points (IP), which disrupt latent correlations between dimensions, and high leverage points (HLP), which deviate from the norm but go undetected by traditional autoencoder-based methods. Using data from the International Mouse Phenotyping Consortium (IMPC), we show that ODBAE can identify knockout mice with complex, multi-indicator phenotypes—normal in individual traits, but abnormal when considered together. In addition, this method reveals novel metabolism-related genes and uncovers coordinated abnormalities across metabolic indicators. Our results highlight the utility of ODBAE in detecting joint abnormalities and advancing our understanding of homeostatic perturbations in biological systems. Subject terms: Software, Bioinformatics __________________________________________________________________ ODBAE offers a powerful approach for detecting complex anomalies and characterizing unknown phenotypes within biological systems. Introduction Phenotypes, including symptoms, defined as the observable characteristics or traits of an organism, are the result of complex interactions between genotype and environmental factors^[34]1,[35]2. These traits are often quantified using physiological or pathological indicators that serve as proxies for the underlying biological or pathological processes. Recent advances in phenotypic research, particularly through genetic manipulation, have deepen our understanding of gene function and the mechanisms underlying pathological conditions across a wide range of organisms, including plants and animals^[36]2. Traditionally, gene-related phenotypic analyses have focused on identifying abnormalities in individual physiological indicators^[37]3–[38]5. However, findings from the International Mouse Phenotyping Consortium (IMPC) demonstrate strong correlations between physiological indicators, suggesting that many traits and diseases result not from isolated abnormalities, but from coordinated disruptions across multiple physiological indicators^[39]6. Our current understanding of phenotypes or diseases often emphasizes the detection of outliers in individual physiological indicators, typically those outside the normal physiological range. However, the emergence of disease in an organism is a more complex process. Homeostasis-the dynamic balance maintained by biological systems-can be perturbed at multiple levels before a single-indicator deviates outside the normal range^[40]7. This suggests that phenotypic abnormalities may manifest as imbalances between correlated indicators, even when each individual measure remains within its expected range. These imbalances may represent early warning signs of disease or dysfunction that may be missed by traditional univariate analysis. We hypothesize that the coordinated perturbation of physiological indicators reflects an emerging phenotype or disease state, even when individual measures appear normal. By examining these subtle interdependencies, we can gain insight into how homeostasis is perturbed in knockout mouse models. Furthermore, for genes whose knockout does not result in obvious outliers in individual parameters, disruption of homeostatic balance may still occur through correlated indicators. This approach challenges the notion that the absence of a detectable abnormal phenotype in knockout mice indicates no functional impact, and instead suggests that our focus on single-indicator abnormalities may obscure more complex systemic effects. Existing methods, such as linear regression^[41]6,[42]8 and basic bioinformatics tools^[43]9–[44]11, have provided initial insights into these relationships. However, these techniques struggle to capture non-linear and complex relationships between multiple physiological indicators, especially in high-dimensional biological datasets. Machine learning methods, particularly autoencoders, have emerged as powerful tools for detecting outliers by learning complex, latent patterns within data^[45]12–[46]18. By compressing and reconstructing data, autoencoders can capture subtle variations and relationships that may indicate phenotypic abnormalities across multiple indicators. Their ability to learn from high-dimensional data makes them well suited for analyzing the intricate dynamics involved in phenotypic expression^[47]19,[48]20. In the regression analysis, all outliers can be categorized as influential points (IP) and high leverage points (HLP). IP exerts a substantial influence on the model’s fitting results and HLP deviate from the center of the dataset but may not necessarily impact the model fit^[49]21. Since the reconstruction of any dataset by the autoencoder can be regarded as a process of regression, IP can be easily detected when using autoencoders for outlier detection. However, there are still many uncertainties about the detection effect of HLP. In this study, we introduce Outlier Detection using Balanced Autoencoders (ODBAE), a machine learning method designed to detect outliers by capturing the relationships between physiological indicators. ODBAE improves on traditional autoencoder by balancing the reconstruction error across principal component directions, enhancing its ability to detect both IP that deviate from the expected relationships between indicators and HLP that are far from the data center. We apply ODBAE to developmental and metabolic datasets from IMPC to explore how coordinated perturbations in physiological parameters can reveal previously unidentified phenotypes, even in cases where knockout mice show no abnormalities in individual traits. By using ODBAE, we aim to identify novel gene functions and phenotypic complexities that have been missed by single-indicator screening methods. Collectively, these findings establish ODBAE as a powerful tool for identifying complex anomalies and unknown phenotypes within biological systems. Results Outlier detection in ODBAE ODBAE takes tabular datasets from various sources as input, such as gene knockout mouse dataset from IMPC and patient health examination data. Each row in the data represents a record, and each column represents an attribute. The goal of ODBAE is to maintain the effectiveness of the autoencoder in detecting IP while further improving the detection performance for HLP. This is a comprehensive approach to outlier detection in high-dimensional tabular datasets. Furthermore, ODBAE also provides an anomaly explanation for each outlier, indicating which parameter or parameters are associated with the anomaly. For gene knockout mouse datasets, ODBAE can identify genes that, when knocked out, lead to anomalies, as well as the abnormal parameter pairs associated with the gene knockout. Overview of ODBAE To address the challenge of outlier detection in high-dimensional tabular datasets, we construct an improved autoencoder model (ODBAE) with refined training loss function. Our model leverages the strong representation learning and feature extraction capabilities of traditional autoencoders for IP, while mitigating their limitations in HLP detection (“Methods”). The core principle underlying ODBAE-based outlier detection is that inliers can be well reconstructed, whereas outliers, including both HLP and IP, may generate significant reconstruction errors. ODBAE performs unsupervised outlier detection on the input high-dimensional datasets. During training process, ODBAE learns as much intrinsic information as possible from training set and reconstructs the training dataset by minimizing the loss function. The training dataset selection strategy depends on the initial outlier prevalence within the dataset. For datasets with few outliers presence, the entire dataset are utilized for both training and testing for outlier detection, such as data from wild-type mice in IMPC. However, if the overall proportion of outliers are not clear, a subset of the data exhibiting fewer anomalies is chosen for model training. Then, the entire dataset, or other subset, can be used for outlier detection. Subsequently, the trained model is applied to the test dataset, generating reconstruction error of all sample points. Finally, sample points with reconstruction errors greater than the predefined threshold are considered outliers. As shown in Fig. [50]1a, outlier detection from biological datasets by ODBAE include three steps: (1) given a training dataset as input, the intrinsic information of normal data points are learned; (2) the trained model is used to reconstruct the test dataset, and outliers will be identified according to large reconstruction error (“Methods”); (3) the detected outliers including HLP and IP are explained using highest reconstruction errors and kernel-SHAP to gain the abnormal parameters^[51]22. Fig. 1. ODBAE is an approach for identifying and explaining outliers in tabular datasets, it can discover complex phenotypes and identify novel genes when applied to knockout mouse datasets. [52]Fig. 1 [53]Open in a new tab a Process for discovering complex phenotypes. ODBAE first detect outliers based on a balanced autoencoder, which can balance the reconstruction of the training dataset to detect both HLP and IP, then perform anomaly explanation to gain abnormal parameters. Finally, some outliers are identified due to joint anomalies of multiple parameters. b Anomaly explanation of ODBAE. The detected outliers are explained in terms of SHAP values. c For both males and females, abnormal mice are first identified according to the presence of metabolic parameters that deviated too far from the mean, and then the same abnormal proportion is set to identify abnormal mice with ODBAE. Finally, some important genes will only be obtained by ODBAE. d Genes with an abnormal proportion of more than 50% of the corresponding single-gene knockout mouse strains are considered important genes by ODBAE. If an important gene has no known metabolic links in the MGI database, it will be considered a novel gene and further validated by SNP analysis of human orthologues. Based on mathematical analysis, ODBAE uses a revised training loss function incorporating an appropriate penalty term to Mean Square Error (MSE) to balance the reconstruction by properly suppressing complete reconstruction of the autoencoder (“Methods”). Specifically, the penalty term can ensure the equal eigenvalue difference between each principal component direction of the training and reconstructed dataset. Therefore, the incorporated penalty term enhance the detection of HLP, while the MSE term maintain the identification performance of IP. To explain each outlier detected by ODBAE, we first identify the top features that contribute the most to the reconstruction error, and then apply kernel-SHAP to obtain the features that have the greatest impact on them (Fig. [54]1b and “Methods”). Finally, ODBAE outputs instance-based outliers and their explanations. For outliers in categories, if the anomaly rate for a category of outliers is greater than the set threshold, then ODBAE provide anomaly explanation for each category according to their mean values of each feature. ODBAE’s comprehensive benefits The power of ODBAE lies in its ability to efficiently detect both HLP and IP in complex, high-dimensional biological data sets. HLP are outliers that deviate from the central distribution, while IP significantly influence model results. To evaluate the performance of ODBAE in detecting IP, we compared it to principal component analysist^[55]23, a common outlier detection method. ODBAE demonstrated superior performance, particularly in identifying IP, due to its enhanced ability to capture nonlinear relationships between data points (Supplementary Fig. [56]1a–d). We then applied ODBAE to phenotypic data from the IMPC to identify complex phenotypes in knockout mouse models. Using a threshold of the top 2% of absolute z-scores^[57]24 for any given physiological parameter, we flagged mice as potential outliers (“Methods”). We then applied ODBAE with a predefined abnormality ratio (Fig. [58]1c). If more than 50% of mice from a single-gene knockout strain were classified as outliers, the corresponding gene was identified as significant and further analyzed (Fig. [59]1d). ODBAE was able to identify key mutant strains with complex phenotypes, even in cases where individual physiological indicators appeared normal. To illustrate, we analyzed eight developmental parameters: Body length (BL), Body weight (BW), Bone Area (BA), Bone Mineral Density (excluding skull), Distance travelled total, Forelimb and hindlimb grip strength measurement mean, heart rate, and Heart weight. Using wild-type mice as the training set and knockout mice as the test set, we evaluated 1904 single-gene knockout mouse strains (“Methods”). Given the prevalence of sexual dimorphism in disease phenotypes^[60]3,[61]25, males and females were analyzed separately (Supplementary Data [62]1a–d). In the female dataset, ODBAE identified Ckb null mice as outliers despite their individual parameter values being within the normal range. These mice had normal BL and BW, but their body mass index (BMI) was abnormally low. Specifically, four of the eight Ckb null mice had extremely low BMI values, calculated as BMI = BW (g)/BL^2 (cm)^[63]26–[64]28. Analysis revealed that while BL and BW were within the expected ranges, their relationship was abnormal, leading to the identification of these mice as outliers (Fig. [65]2a). The average BMI of these Ckb-deficient mice was lower than 97.14% of other mice (Fig. [66]2b), demonstrating the sensitivity of ODBAE to complex, multivariate outliers. In addition, Ckb has previously been implicated in developmental processes and obesity, further strengthening the biological significance of these findings^[67]29. Fig. 2. ODBAE accurately identify outliers with multi-dimensional joint anomalies. [68]Fig. 2 [69]Open in a new tab a Scatter plot of parameters BL and BW for mouse developmental dataset, the data points corresponding to mice with the gene Ckb knockout deviate significantly from most of the data points, but their values of both BL and BW are within the normal range. The BL or BW values of the data points in the shaded part are ranked in the top 2% based on the absolute value of the z-score. b The distribution of standardized BMI of all knockout mice in mouse developmental dataset, the mean BMI of Ckb knockout mice is significantly smaller. c–h Outlier detection results for 3 synthetic 2-dimensional Gaussian distribution datasets when the outlier ratio is 0.05, including detection results of MSE-trained autoencoder and ODBAE for 2-dimensional Gaussian distribution dataset with diagonal covariance matrix and non-Gaussian distributed noise (c, f), 2-dimensional Gaussian distribution dataset with diagonal covariance matrix (d, g) and 2-dimensional Gaussian distribution dataset with non-diagonal covariance matrix (e, h). The orange line and grey line in (c) represent two principal directions. i, j Outlier detection results of MSE-trained autoencoder (i) and ODBAE (j) for 3-dimensional dataset with an intrinsic dimension of 2 when the outlier ratio is 0.05, and the data points follow a 2-dimensional Gaussian distribution with respect to Parameter1 and Parameter3. ODBAE also identified previously unknown phenotypes, highlighting its potential to uncover novel gene functions. In terms of the identification of HLP, most studies choose MSE as the training loss function and force the autoencoder to completely reconstruct the training dataset, which brings two major limitations: (1) only HLP along principal component directions with poor reconstruction can be detected; (2) in the process of anomaly detection, each dimension of the dataset is independently detected. To evaluate the performance difference between our modified loss function and MSE in HLP detection, we used these two trained autoencoder to detect outliers for several datasets following multi-dimensional Gaussian distributions. For example, data in Fig. [70]2c–e were three 2-dimensional Gaussian distribution datasets and data in Fig. [71]2i was 3-dimensional dataset whose intrinsic dimension was two, the data points followed a 2-dimensional Gaussian distribution with respect to Parameter1 and Parameter3 (“Methods”). The detection result in Fig. [72]2c shows that most of the detected HLP lie in the direction of the principal components indicated by the blue line, but few lie in the other direction. In Fig. [73]2d, HLP in two dimensions of the dataset are independently detected, making outliers that are anomalous in multiple dimensions (outliers at the four corners) undetectable. Fig. [74]2i demonstrates the same issue in three-dimensional space. Moreover, these issues persisted when the activation function was replaced with another function (Supplementary Figs. [75]2a–d and [76]3a–d). This is because the saturating nature of the non-linear activation function causes the autoencoder to have different reconstruction capabilities for data points located at the boundaries of the dataset compared to those located in the interior (Supplementary Fig. [77]4a, c and Supplementary Note [78]1). Usually, the distribution of a dataset along various principal component directions may not be balanced, and the number of data points falling into unreconstructable regions may also be uneven (“Methods”). This leads to different detection performances for HLP along different principal component directions. Besides, the distribution of reconstructed dataset is independent across all dimensions (Supplementary Fig. [79]4a, c). Therefore, the new loss function in ODBAE is designed to ensure that the reconstructed dataset follows a balanced joint distribution (Supplementary Fig. [80]4b, d). Then, we used ODBAE to detect outliers in these datasets and found that most outliers were identified (Fig. [81]2f–h, j). We further evaluated ODBAE’s detection of both IP and HLP using a synthetic 3-dimensional dataset (Supplementary Fig. [82]5a). Area Under Curve (AUC) and Average Precision (AP) scores were used to evaluate the accuracy of outlier detection results (“Methods”). The generated 3-dimensional training dataset formed a 2-dimensional manifold, where the subspace represented by Parameter1 and Parameter3 followed a 2-dimensional Gaussian distribution (“Methods”, Supplementary Fig. [83]5a). Here, we assumed that the ratio of HLP and the ratio of IP were equal. During the anomaly detection, we used the training dataset as the test dataset. Then, we considered the top ⌊δn⌋ data points in the subspace represented by Parameter1 and Parameter3, ranked by Mahalanobis distance, as HLP, where δ was the outlier ratio, and n was the number of the data points in the training dataset. Besides, we additionally generated ⌊δn⌋ points that were not on the manifold of the training dataset and treated them as IP (Supplementary Fig. [84]5b). The results showed that ODBAE worked better than other anomaly detection methods (Supplementary Fig. [85]5c, d). In summary, ODBAE is a powerful tool for detecting complex phenotypes in high-dimensional biological datasets, capturing both isolated and coordinated abnormalities. This capability allows the identification of novel gene functions and previously undetected phenotypes, particularly in cases where traditional methods fail to detect subtle perturbations in physiological homeostasis. ODBAE outperforms other methods by a large margin To evaluate the strengths of ODBAE, we performed outlier detection of ODBAE on low-dimensional synthetic datasets, high-dimensional synthetic datasets, and two benchmark datasets. ODBAE was compared with the autoencoder with MSE loss function (MSE-AE), autoencoder with Mean Absolute Error loss function (MAE-AE), and Deep Autoencoding Gaussian Mixture Model (DAGMM)^[86]30. The comparison results were presented through AUC and AP scores. We generated three datasets with 2-dimensional Gaussian distribution, and their intrinsic dimension were also 2. In one of the datasets, the correlation between the two dimensions was small (Fig. [87]3a); the correlation between the two dimensions in another dataset was relatively large (Fig. [88]3b); in the last dataset, the correlation between the two dimensions was also small and non-Gaussian distributed noise existed (Fig. [89]3c). Therefore, the outliers in these three datasets were all HLP. Then we set different outlier ratios and calculated the corresponding AUC and AP scores of the outlier detection results. To be specific, if the outlier ratio is δ and the number of sample points is n, then the input data is sorted according to their Mahalanobis distance, and the first ⌊δn⌋ sample points are regarded as positive. For each fixed anomaly ratio δ, outlier detection was repeated 10 times on the dataset, and the average accuracy scores for each method were recorded. Fig. [90]3d–f show the AUC scores of test results on these three datasets, and Fig. [91]3g–i are corresponding AP scores, respectively. ODBAE outperformed other schemes in detecting HLP in all three datasets at any outlier ratio (Supplementary Data [92]2a–c). Fig. 3. Outlier detection effect of ODBAE is generally better than other schemes. [93]Fig. 3 [94]Open in a new tab a–c Data point distribution plot for 3 synthetic dataset, including 2-dimensional Gaussian distribution dataset with diagonal covariance matrix (a), 2-dimensional Gaussian distribution dataset with non-diagonal covariance matrix (b) and 2-dimensional Gaussian distribution dataset with diagonal covariance matrix and non-Gaussian distributed noise (c). d–f AUC scores of comparison results of dataset in this figure (a, d), (b, e), and (c, f). g–i AP scores of comparison results of dataset in this figure (a, g), (b, h), and (c, i). j, k Comparison results for Dry Bean Dataset. l, m Comparison results for Breast Cancer Dataset. In (d–m), all methods were run 10 times under different outlier ratios, and error bars indicate mean ± standard error. Since autoencoders are often used for outlier detection in high-dimensional datasets, we also analyzed the outlier detection results of ODBAE on two high-dimensional synthetic datasets. Both datasets followed multi-dimensional Gaussian distribution and their covariance matrices were both diagonal matrices. So the intrinsic dimension of these two datasets were equal to their actual dimension. One of the datasets was 50-dimensional and the other dataset was 100-dimensional. Finally, ODBAE still performed best in HLP detection on high-dimensional datasets (Supplementary Fig. [95]6a–d). To further evaluate the robustness of ODBAE under more realistic conditions where data distributions deviate from Gaussian assumptions, we applied the method to two additional synthetic datasets: one sampled from a two-dimensional Laplace distribution and the other from a two-dimensional uniform distribution. In both scenarios, ODBAE again exhibited superior outlier detection performance (Supplementary Fig. [96]7a–d). Additionally, we tested ODBAE on two benchmark datasets: Dry Bean Dataset and Breast Cancer Dataset. Dry Bean Dataset contained 13611 instances and 17 attributes, and there were 7 classes in the original dataset. In each experiment, the inliers were 2000 data points from a certain class and δ × 2000 data points were randomly selected from the remaining 6 classes of data points as outliers, where δ was the outlier ratio and δ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For this dataset, we set the dimension of the hidden layer of the autoencoder to 8 during the experiment. Breast Cancer Dataset contained 569 instances and 31 attributes. All data points fell into two categories: benign and malignant. We randomly selected 300 data points from the data points labeled benign as inliers and δ × 300 data points from the other categories as outliers. For this dataset, the outlier ratio δ ∈ {0.02, 0.04, 0.06, 0.08, 0.1} and the dimension of the hidden layer of the autoencoder was 15. Overall, due to the improvement in HLP detection, ODBAE worked best on both benchmark datasets (Fig. [97]3j–m and Supplementary Data [98]2d, e). In summary, ODBAE outperforms several existing autoencoder-based methods in detecting anomalies from both human-induced and benchmark datasets, indicating that ODBAE can more comprehensively identify anomalous situations. Therefore, ODBAE provides strong support for finding unknown anomalies and exploring unknown phenotypes in existing biological data. ODBAE exhibits a certain degree of robustness The performance of ODBAE, like any machine learning framework, is influenced by several hyperparameters and data conditions. We evaluated the impact of dataset dimensionality, noise levels, and model architecture on the outlier detection performance of ODBAE. We generated Gaussian-distributed datasets of varying dimensionality to train ODBAE, and evaluated its outlier detection performance by measuring the AUC scores when the outlier ratio was set to 0.1. As illustrated in Fig. [99]4a, the anomaly detection performance of ODBAE remains largely stable as the dimensionality of the dataset increases. This observation highlights the potential applicability of ODBAE to outlier detection tasks in high-dimensional datasets. Fig. 4. ODBAE exhibits a certain degree of robustness. [100]Fig. 4 [101]Open in a new tab a Outlier detection performance of ODBAE across datasets with varying dimensionalities (Supplementary Data [102]2f). For each fixed-dimensional dataset, the model was trained 10 times, and the AUC score for outlier detection was recorded after each training run. b Impact of varying noise intensities on the performance of ODBAE (Supplementary Data [103]2g). Laplace-distributed noise L(0, s) was added to the dataset to simulate different noise levels. For each fixed value of s, the model was trained 10 times on the noise-augmented data, and the AUC score for outlier detection was recorded after each training run. c Outlier detection accuracy of ODBAE with architecture (L, C) on the Dry Bean Dataset (Supplementary Data [104]2h). For each fixed architecture (L, C), the model was trained 10 times, and the average AUC score across the 10 runs was recorded to evaluate performance. Error bars indicate mean ± standard error. Considering that real-world datasets frequently contain noise, we further evaluated the impact of data perturbations on the anomaly detection performance of ODBAE. A two-dimensional Gaussian-distributed dataset V containing 10,000 samples was generated. To simulate varying noise intensities, we added perturbation ϵ to each data point, where ϵ was drawn from a Laplace distribution L(0, s), with larger s values representing higher noise levels. We then evaluated the performance of ODBAE trained on the noisy datasets in detecting outliers within the original dataset V, with the outlier ratio set to 0.1. As shown in Fig. [105]4b, ODBAE maintained high outlier detection accuracy under moderate noise levels (s ≤ 0.1), highlighting its robustness in the presence of noise typically observed in real-world data. In addition, we investigated the influence of model architecture on the performance of ODBAE. We similarly employed the Dry Bean dataset, randomly selecting 2000 samples from a single class to form the training dataset. Additionally, 200 outlier samples were randomly drawn from the remaining 6 classes. We then evaluated the outlier detection accuracy of ODBAE with varying model architectures denoted as (L, C), where the model consists of 2L + 1 hidden layers, with all layers except the bottleneck layer having dimensionality C. The results demonstrate that both the number of layers L and the dimensionality of each layer C significantly influence the performance of ODBAE (Fig. [106]4c). Therefore, in practical applications, tuning these architectural parameters—specifically, the number of layers and the dimensionality of each layer—can help optimize the outlier detection performance of ODBAE. In summary, ODBAE maintains high outlier detection accuracy in high-dimensional datasets and exhibits robustness to moderate levels of noise. Furthermore, its performance can be further optimized in practical applications by tuning the number of layers and the dimensionality of each layer in the model architecture. ODBAE identifies new metabolic genes We applied ODBAE to metabolism-related datasets from IMPC to uncover genetic elements involved in metabolic regulation. Fourteen key metabolic parameters were selected for analysis, including: Alanine aminotransferase (ALA), Albumin (Alb), Alkaline phosphatase, Aspartate aminotransferase (ASA), Calcium (Ca), Creatinine (Cre), Glucose (Glu), HDL-cholesterol (HDLC), Phosphorus (Ph), Total bilirubin (TB), Total cholesterol (TC), Total protein (TP), Triglycerides (TG), and Urea. In total, we analyzed 45922 mice from 3064 single-gene knockout strains, including 23024 males and 22898 females. ODBAE was used to detect outliers in both male and female data sets. Initially, mice with absolute z-scores in the top 1.2% for at least one metabolic parameter were flagged as outliers. Of the 23024 male mice, 2672 were flagged as abnormal, and of the 22898 female mice, 2553 abnormalities were identified. To ensure robust detection, we adjusted the proportion of anomalies and identified the top 10% of mice with the highest reconstruction errors as outliers. Genes were considered significant if more than 50% of mice from a given knockout strain were identified as outliers, suggesting a strong association between these genes and metabolism. In total, ODBAE identified 128 significant genes in males (9.5%) and 111 significant genes in females (8.08%), for a total of 203 genes (Fig. [107]5a, b and Supplementary Data [108]1e–h) Fig. 5. ODBAE identifies new metabolic genes from a metabolism-related dataset of knockout mice. [109]Fig. 5 [110]Open in a new tab a The proportion of important genes identified in the male and female datasets. b ODBAE identified 128 important genes in the male dataset and 111 important genes in the female dataset, for a total of 203 important genes. c Sexually dimorphic genes identified by ODBAE, with previously reported sexually dimorphic genes highlighted in red. d Visualization results of outliers corresponding to gene Crabp1 in the subspace formed by abnormal parameters. The absolute z-score values of Alb or HDLC of the data points in the shaded part are ranked in the top 1.2%. e Validation results of important genes in the MGI database. 78.8% of the genes are associated with metabolism, and 21.2% are newly identified by ODBAE. Besides, 51.2% and 44.2% new metabolic genes are identified in the male or female datasets, respectively, and 4.7% are detected in both male and female datasets. f Histogram of KEGG pathway enrichment for 43 novel genes in mice obtained through KOBAS. Moreover, we observed that the metabolism-associated genes identified by ODBAE exhibited pronounced sex-specific differences between male and female mice (Fig. [111]5b). To evaluate the biological significance of this sexual dimorphism, we focused on metabolism-related genes exhibiting pronounced sexual dimorphism. Specifically, we selected genes for which both male and female mouse strains included at least seven individuals, and for which the proportion of knockout strains displaying abnormal phenotypes exceeded 50% in only one sex (either male or female), while remaining at 0% in the other. In total, 18 genes were identified as showing significant sexual dimorphism, 10 of which had been previously reported as sex-dimorphic in earlier studies (Fig. [112]5c, “Methods” and Supplementary Data [113]3i). These findings confirm the sexually dimorphic nature of these genes and underscore the utility of ODBAE as a robust tool for uncovering metabolism-associated genetic mechanisms. Notably, 91 of these 203 genes could not be detected using traditional z-score based methods. This indicates that the metabolic abnormalities associated with these genes involve complex, multi-parameter phenotypes. ODBAE’s anomaly detection includes a detailed explanation of outliers based on the highest reconstruction error and kernel-SHAP analysis, which allows us to identify the specific metabolic parameters driving the anomalies. For example, knockout of the Crabp1 gene resulted in abnormal levels of Alb and HDLC, even though the individual z-scores for these parameters were within normal ranges. Visualization of these data (Fig. [114]5d) shows that while the z-scores for Alb and HDLC appeared normal, the knockout mice deviated significantly from the central distribution of all data points, highlighting the complex metabolic perturbation caused by Crabp1 knockout. More examples are shown in Supplementary Fig. [115]8a–d. Further investigation using the Mouse Genome Informatics (MGI) database revealed that 43 of the 203 genes (21.18%) identified by ODBAE had no previously known association with metabolic phenotypes (Fig. [116]5e and Supplementary Data [117]3a). To explore the potential relevance of these newly identified genes, we analyzed their human orthologues (Supplementary Data [118]3b). Single-nucleotide polymorphisms (SNPs) within a ± 1 kb region of these orthologues were extracted from GWAS Central^[119]31, resulting in a total of 804 SNPs (Supplementary Data [120]3c). We then investigated their association with metabolic diseases, focusing on type 2 diabetes (T2D). Specifically, for each SNP, we evaluated the extent of association across T2D-related traits on data from the DIAGRAM, GIANT, and GLGC consortia^[121]32–[122]34. Cross-phenotype meta-analysis (CPMA)^[123]35 identified SNPs in four gene regions—TWF2, TMED10, HOXA10, and NBAS—that were strongly linked to T2D-related traits (CPMA p < 0.05) (Supplementary Data [124]3d). Visualization of the corresponding outliers confirmed that these genes exhibited distinct patterns of deviation across key metabolic dimensions (Supplementary Fig. [125]9a–d). In addition to T2D, many of these novel genes may be associated with other metabolic disorders. KEGG pathway analysis (via KOBAS) for mouse datasets revealed significant enrichment in metabolic pathways^[126]36, including glycolysis, propanoate metabolism, and pyruvate metabolism (Fig. [127]5f and Supplementary Data [128]3e). These results suggest that ODBAE not only identifies genes involved in complex metabolic phenotypes but also pinpoints pathways that are disrupted by genetic perturbations. Overall, ODBAE provides a powerful tool for uncovering novel metabolic genes and phenotypes that are undetectable using traditional z-score-based methods. It successfully identifies genes associated with complex, multi-parameter metabolic abnormalities and reveals novel genetic contributors to metabolic diseases, many of which were previously unannotated in the MGI database. ODBAE integrates novel metabolic phenotypes ODBAE’s ability to detect outliers based on joint abnormalities, beyond traditional z-score methods, highlights its strength in integrating complex metabolic phenotypes. Here, we demonstrate how ODBAE can detect metabolic abnormalities in gene knockout mice by identifying abnormal relationships between multiple metabolic parameters, rather than focusing on single-parameter outliers. Our analysis of metabolism-related datasets from IMPC reveals that a large proportion of outliers (89.14% of male and 92.75% of female mice) exhibit abnormalities in multiple metabolic parameters simultaneously. These findings suggest that joint abnormalities, or perturbations in the correlations between parameters, are important for detecting phenotypic changes in knockout mice. For example, knockout of the Ckb gene disrupts the relationship between BW and BL, resulting in an abnormal BMI, even though BW and BL are individually within normal ranges. This highlights how gene knockouts often disrupt intrinsic correlations between physiological parameters, resulting in complex metabolic phenotypes. To further explore these correlations, we identified parameter pairs most likely to show joint abnormalities in both male and female mice. In males, 36 different abnormal parameter pairs were observed, with the most common being (Alb, Ph), (Alb, HDLC), and (ALA, HDLC) (Fig. [129]6a, c). In females, 40 different abnormal parameter pairs were identified, with the most common being (ASA, TP), (Alb, ASA), and (TB, TP) (Fig. [130]6b, c). Their corresponding knockout genes are shown in Fig. [131]6d, e. Notably, 22 abnormal parameter pairs were found in both males and females, with the most common pairs being (ALA, ASA), (ALA, HDLC), and (Alb, ASA). These common abnormalities suggest conserved metabolic pathways that are affected by specific gene knockouts in both sexes. Fig. 6. ODBAE integrates metabolic parameter pairs tend to simultaneous abnormalities. [132]Fig. 6 [133]Open in a new tab a, b Chord diagram plotting the inter-connectivity of metabolic parameters for males (a) and females (b). The outer segments represent the metabolic parameters, the size of the arcs connecting the perameters is proportional to the number of outliers associated. c Abnormal parameter pairs observed in males and females. d, e Links between genes and simultaneous abnormal metabolic parameters in males (d) and females (e). We then investigated the biological significance of the three most frequently perturbed parameter pairs. Previous studies have shown that the ALA/HDLC ratio is associated with the risk of non-alcoholic fatty liver disease (NAFLD) and diabetes^[134]37–[135]39. ODBAE identified Lepr as one of the genes causing abnormalities in this pair (Fig. [136]7a). When the ratio of ALA to HDLC was analyzed across all mice, the outliers corresponding to the Lepr knockout had a significantly lower ratio compared to the general population (Fig. [137]7b), consistent with previous findings linking Lepr mutations to T2D and NAFLD^[138]40,[139]41. KEGG pathway analysis of genes associated with (ALA, HDLC) revealed enrichment in pathways related to NAFLD and metabolic processes, including glycolysis (Fig. [140]7c and Supplementary Data [141]3f). For the (Alb, ASA) pair, studies suggest that the ASA/Alb ratio correlates with tumor progression and prognosis in hepatocellular carcinoma (HCC)^[142]42,[143]43. ODBAE identified Ap4m1 as a gene causing abnormalities in this pair (Fig. [144]7d). The average ratio of ASA to Alb in Ap4m1 knockout mice was significantly larger than in other mice (Fig. [145]7e), consistent with the role of Ap4m1 in HCC^[146]44. KEGG pathway enrichment further linked (Alb, ASA) disruptions to pathways involved in pyruvate metabolism, glycolysis, and proximal tubule bicarbonate reclamation (Fig. [147]7f and Supplementary Data [148]3g). Finally, for genes corresponding to parameter pair (ALA, ASA), the pathway enrichment analysis revealed significant enrichment in Glycosaminoglycan biosynthesis, Starch and sucrose metabolism, and Type I diabetes mellitus (Supplementary Fig. [149]10a and Supplementary Data [150]3h), consistent with the previously reported association of parameters ASA to ALA ratio with prediabetes^[151]45. In addition to these well-characterized parameter pairs, ODBAE also identified novel correlations worth exploring further. For instance, Taf8 was highlighted as a gene causing abnormalities in the parameter pair (HDLC, TP), though little is known about the specific relationship between these two parameters (Supplementary Fig. [152]10b). Supplementary Fig. [153]10c shows that there is a linear relationship between these two parameters, and the knockout of the gene Taf8 disrupts this relationship. ODBAE’s ability to uncover such relationships without prior biological inference makes it a powerful tool for discovering previously unknown metabolic interactions. Fig. 7. Validation of the top three abnormal parameter pairs that occur in both males and females and have the highest overall frequency. [154]Fig. 7 [155]Open in a new tab a Visualization of outliers corresponding to gene Lepr based on abnormal parameter pairs in the male dataset. b The distribution of the standardized ratio of ALA to HDLC in mouse metabolism-related dataset, the average ratio of Lepr knockout mice is significantly smaller. c Histogram of KEGG pathway enrichment for genes corresponding to (ALA, HDLC) in humans. d Visualization of outliers corresponding to gene Ap4m1 based on abnormal parameter pairs in the female dataset. e The distribution of the standardized ratio of ASA to Alb in mouse metabolism-related dataset, the average ratio of Ap4m1 knockout mice is significantly larger. f Histogram of KEGG pathway enrichment for genes corresponding to (Alb, ASA) in humans. Overall, ODBAE systematically identifies and explains complex phenotypes by revealing intrinsic relationships between multiple metabolic parameters, including both linear and non-linear associations. These insights can serve as new indicators for biomedical research and provide a more holistic view of metabolic dysfunction in knockout mice. In addition, the ODBAE framework can be adapted to other datasets to uncover complex phenotypes in different biological systems. Discussion In this study, we present ODBAE, a machine learning approach designed to detect and explain complex phenotypes by identifying outliers in high-dimensional biological datasets. The key advantage of ODBAE is its ability to detect two types of outliers: IP, which deviate significantly from expected relationships between variables, and HLP, which are far from the data center. Unlike traditional outlier detection methods that often focus on single abnormal indicators, ODBAE identifies multi-dimensional joint abnormalities, providing a more holistic view of phenotypic disruptions. This approach is particularly powerful when applied to data from knockout mouse models, where phenotypic changes are often subtle or involve complex interactions between physiological parameters. One of the most significant contributions of ODBAE is its potential to identify unexpected phenotypes that may not be apparent when analyzing individual physiological indicators. Traditional methods of phenotype analysis often focus on abnormalities in a single trait^[156]3–[157]5, potentially overlooking the coordinated disruptions across multiple parameters that signal underlying biological imbalances. As our understanding of disease and homeostasis evolves, it is increasingly clear that pathological processes often involve system-wide disturbances rather than isolated dysfunctions^[158]6. ODBAE leverages these insights by integrating correlations between physiological indicators, even when individual indicators remain within normal ranges. This capability allows researchers and clinicians to better understand the interplay of homeostatic mechanisms and how they are disrupted in disease states. For example, when applied to metabolism-related datasets from IMPC, ODBAE successfully identified novel metabolic genes, including those associated with complex, multi-parameter phenotypes. These phenotypes would have been difficult to detect using conventional approaches that focus on single-parameter abnormalities. Moreover, ODBAE successfully identified sex-dimorphic genes, underscoring its utility in advancing the study of sexual dimorphism. ODBAE also uncovered new parameter pairs that tend to exhibit abnormalities together, providing insights into the intrinsic relationships between metabolic pathways. This type of discovery is essential for advancing our understanding of gene function and the pathophysiology of metabolic diseases. Notably, the utility of ODBAE extends beyond knockout mouse models. It can be broadly applied to diverse tabular biological datasets—including disease cohorts, clinical health examination records, and single-cell datasets—to identify complex phenotypes and subtle, multidimensional anomalies. Importantly, ODBAE offers not just detection, but also anomaly explanation through the use of kernel-SHAP. This feature allows researchers to pinpoint the exact physiological indicators driving the detected abnormalities, facilitating deeper insights into the biological processes involved. By identifying the most significant reconstruction errors and the features contributing to them, ODBAE enables a clearer interpretation of complex phenotypic data. This is particularly relevant in clinical settings, where understanding the root cause of a phenotype can guide diagnosis and treatment decisions. Although ODBAE demonstrates considerable potential, there are areas that warrant further exploration. First, the current framework assumes a Gaussian distribution in the training data, which may limit its effectiveness in datasets that do not follow this distribution. As shown in Fig. [159]4b, ODBAE maintains robust performance under moderate levels of noise in Gaussian-distributed datasets. However, its anomaly detection capability progressively declines with increasing noise intensity. Future work should focus on optimizing the model to handle non-Gaussian data and improving the accuracy of anomaly explanations, particularly in cases involving highly complex or subtle phenotypes. Second, although ODBAE could potentially be extended to other data types, such as time-series or imaging data, the current manuscript focuses exclusively on tabular datasets. The model’s performance in analyzing other types of data, such as longitudinal time-series or medical imaging, has not yet been empirically tested. Further exploration and validation in these areas are warranted to assess ODBAE’s broader applicability. Third, the performance of ODBAE is influenced by both the number of layers and the dimensionality of each layer, suggesting that its architecture can be tuned to further enhance performance in practical applications. However, determining the optimal model configuration remains a challenging task. Moreover, since ODBAE identifies anomalies based on the reconstruction error of individual data points, the choice of threshold for this error is critical to the accuracy of outlier detection. Further optimization of threshold selection is therefore warranted. Finally, ODBAE’s use in identifying phenotypes has primarily focused on metabolic parameters. Its performance in detecting phenotypes related to other physiological systems, such as neurological, cardiovascular, or developmental traits, remains unexplored. Broader testing across a range of physiological categories would demonstrate the model’s versatility and help identify any system-specific limitations. Despite these challenges, ODBAE represents a powerful tool for the analysis of high-dimensional biomedical datasets. Its ability to detect multi-dimensional outliers, explain anomalies, and integrate correlated physiological indicators offers a new way to approach phenotypic screening, particularly in gene knockout models. We believe that ODBAE will facilitate the discovery of previously unrecognized phenotypes, enhancing our understanding of homeostasis and disease processes. By moving beyond the limitations of single-indicator analysis, ODBAE opens up new possibilities for both basic research and clinical diagnostics, allowing for a more comprehensive understanding of complex biological systems. Methods Datasets To demonstrate the comprehensive benefits of ODBAE, we first performed outlier detection on developmental datasets obtained from the IMPC. Subsequently, we assessed ODBAE’s ability to identify HLP using 3 synthetic two-dimensional Gaussian-distributed datasets (Fig. [160]2c–e) and one three-dimensional dataset (Fig. [161]2i). Finally, we evaluated the model’s overall detection accuracy for both IP and HLP using an additional three-dimensional dataset (Supplementary Fig. [162]5a). For outlier detection on the IMPC developmental dataset, data from wild-type mice were used as the training dataset, while data from knockout mice served as the test dataset. The training dataset comprised 8327 wild-type samples (4126 males and 4201 females), and the test dataset included 26706 knockout samples (13297 males and 13409 females). To evaluate the accuracy of ODBAE in detecting HLP, as well as its overall detection performance for both IP and HLP, all synthetic datasets consisted of 20000 samples each. To evaluate the advantages of ODBAE over existing methods, we conducted outlier detection using multiple approaches across five low-dimensional synthetic datasets, two high-dimensional synthetic datasets, and two benchmark datasets (the Dry Bean dataset and the Breast Cancer dataset). In the low-dimensional synthetic dataset experiments, we generated three two-dimensional datasets drawn from Gaussian distributions, one from a two-dimensional Laplace distribution, and one from a two-dimensional uniform distribution, each comprising 20000 samples. For the high-dimensional synthetic datasets, we constructed one 50-dimensional and one 100-dimensional Gaussian-distributed dataset, each also consisting of 20000 samples. The Dry Bean dataset contains 13611 samples across seven distinct categories; we randomly selected 2000 samples from a single class as the training dataset, with samples from the remaining six classes designated as outliers. The Breast Cancer dataset comprises 569 samples, categorized as benign or malignant. We randomly selected 300 benign samples as the training data, while malignant samples were treated as outliers. The metabolism-related dataset used to identify novel metabolic genes and phenotypes included 21234 wild-type mouse samples (10585 males and 10649 females) and 45922 knockout mouse samples (23024 males and 22898 females). Wild-type samples were employed for training, while knockout samples served as the test dataset. Dataset preprocessing We used developmental and metabolism-related datasets from IMPC to discover complex phenotypes and identify new genes. The developmental dataset we used was integrated from 8 phenotyping centers (BCM, HMGU, ICS, JAX, MRC Harwell, RBRC, TCP, and UC Davis), while the metabolism-related dataset was from 11 phenotyping centers (BCM, CCP-IMG, HMGU, ICS, KMPC, MARC, MRC Harwell, RBRC, TCP, UC Davis, and WTSI). To eliminate experiment-specific variations, we standardized each phenotyping center for each dataset to have unit variance and zero mean. Then, Min-Max normalization was applied to normalize each physiological parameter within each dataset, mitigating dimensional effects across different physiological parameters. SNPs within a ± 1 kb region of orthologues for the 41 newly identified metabolic genes were downloaded from GWAS Central on September 23, 2023 and filtered based on [MATH: log(p) 2 :MATH] . Definition of HLP and IP Since the reconstruction of any dataset by the autoencoder can be regarded as a complex regression process, we can divide the outliers in each dataset into HLP and IP^[163]46. For a certain dataset, we divide the dimension index into two disjoint sets V[1] and V[2]. Without loss of generality, we assume that V[1] = {1, 2, …, p} and V[2] = {p + 1, p + 2, …, m}. For i, j ∈ V[1], i ≠ j, there is no correlation between variables X[i] and X[j], while for each k ∈ V[2], there is a map ϕ[k] such that variable [MATH: Xk=ϕk(X1 ,X2,< /mo>,Xp) :MATH] . X[i](i ∈ V[1]) is called factor and X[k](k ∈ V[2]) is called response variable. Then we define matrix Q, [MATH: Q=1x11x1p 1x21x2p 1xn1 xnp . :MATH] Besides, if we let [MATH: Qi=(1,xi1,,xip) :MATH] , the leverage value of the iih sample point is [MATH: hi=QiQQ1< /msup>Qi,i=1,2,,n. :MATH] If we represent the mean of variable X[k](k ∈ V[1]) as μ[k], and define the mean vector [MATH: μ~=(μ1,μ2,,μp) :MATH] , we have [MATH: QQ=111x11x21xn1 x1p x2p xnp 1x11x1p 1x21x2p 1xn1 xnp = n1μ~μ~C, :MATH] 1 where [MATH: Cjk=1ni=1nxijx ik :MATH] . Thus, we can obtain the inverse of Q^⊤Q [MATH: QQ1< /msup>=1n1+μ~Cμ~μ~1μ~μ~Cμ~μ~1Cμ~μ~1μ~Cμ~μ~1. :MATH] 2 Therefore, if we let [MATH: x~i=(xi1,,xip) :MATH] , the leverage value of the ith sample point can also be formulated as [MATH: hi=1n Qi1+μ~Cμ~μ~1μ~μ~Cμ~μ~1Cμ~μ~1μ~Cμ~μ~1Qi =1n1+μ~x~iCμ~μ~1μ~,x~iμ~Cμ~μ~1Qi=1< mi>n1+μ~x~iCμ~μ~1μ~+x~iμ~Cμ~μ~1< mrow>x~i=1n+1nx~iμ~Cμ~μ~1x~iμ~. :MATH] 3 If we define [MATH: A=Cμ~μ~ :MATH] , then [MATH: Ajk=1ni=1 nx ijxikμ jμk< /mi>=1ni=1 n(< mi>xijμj) xik< /mrow>μk=n 1nCOV(Xj,Xk),j,kV1. :MATH] 4 Furthermore, we have [MATH: A=n1< mrow>nΣf :MATH] and [MATH: A1=nn1Σf1 :MATH] , where Σ[f] is the covariance matrix of the factor space. Then, based on Eq. ([164]3), we get [MATH: hi=1n+1n1 x~iμ~Σf1x~iμ~. :MATH] 5 Obviously, for each sample point, its leverage value is proportional to its Mahalanobis distance in the factor space. In addition, the Cook distance of the ith sample point based on the k^th(k ∈ V[2]) dimension is [MATH: Cik=(np)xik< /mrow>x^ik2hipi=1 nxik< /mrow>x^ik21h< mi>i2< /msup>,i=1,2,,n. :MATH] 6 For each dataset, HLP are sample points with high leverage value and IP refer to sample points with high Cook distance. The leverage value of the i^th sample point is also proportional to the Mahalanobis distance of the factor space. Therefore, HLP can also be identified by the Mahalanobis distance in the factor space. It is noteworthy that for certain datasets, like those conforming to a multivariate Gaussian distribution, the set V[2] may be empty. In such instances, all outliers in the dataset are identified as HLP. Outlier detection of ODBAE The autoencoder model in ODBAE is formed of an encoder f and a decoder g. If we denote the input data point and the output data point as variable X and [MATH: X^ :MATH] , respectively, where [MATH: X=X1,X2,,Xm :MATH] and [MATH: X^=X^1,X^2,,< mrow>X^m :MATH] , and the number of sample points is n, then we have [MATH: X^=gf(X) :MATH] . Besides, [MATH: X :MATH] represents the whole input dataset and [MATH: XX :MATH] . x[i] represents the ith sample point of the input dataset and its corresponding reconstruction result is [MATH: x^i :MATH] , [MATH: xi=xi1< /mrow>,xi2,,xim :MATH] and [MATH: x^i=x^i1,x^i2,,x^im :MATH] . Ideally, we would like to train the autoencoder with the training dataset to minimize the training loss function, the generally used training loss function is MSE which can be represented as [MATH: LMSE(ω,b )=1ni=1 n(< msub>xix^i) (xix^i)=1ni=1 nj=1 m(< msub>xijx^ij)2. :MATH] Where ω represents the weight between the input layer and the output layer and b is the bias value. The purpose of training process is to guarantee that the intrinsic information of the training dataset can be learned and most of the normal sample points can be well reconstructed. It is important to note that MSE-trained autoencoders are insufficient in outlier detection. Through rigorous theoretical analysis, we demonstrated how this limitation can be addressed. The analysis process is based on the following assumption. Assumption 1 We assume that the input variable X follows m-dimensional Gaussian distribution. As we know, autoencoders will mainly focus on recovering the principal components of a dataset eventually^[165]47,[166]48, it is necessary to use the unit orthogonal eigenvectors as the basis vectors to further study the influence of the loss function on the outlier detection results of the autoencoder. Under Assumption 1, we represent the mean vector of the variable X as [MATH: μ=μ1,μ2,,μm :MATH] . If the eigenvalues of the covariance matrix Σ[x] are λ[1], λ[2], …, λ[m], the corresponding unit orthogonal eigenvectors (i.e., principal directions) are η[1], η[2], …, η[m], and the new coordinate of the variable X is [MATH: Y=Y1,Y2,,Ym :MATH] when the unit orthogonal eigenvectors are the basis vectors, then we can denote an orthogonal matrix P, where the i^th column of P is unit eigenvector η[i]. It’s obvious that Y = P^−1X = P^⊤X and [MATH: Yk=ηkX :MATH] , k = 1, 2, …, m. Thus, in the new coordinate system, the variable of input data X is converted to variable Y. Let [MATH: ν=ν1,ν2,,νm :MATH] and Σ[Y] represent the mean vector and covariance matrix of variable Y, respectively, we have ν = P^⊤μ and [MATH: νk=E(Yk< /mi>)= ηkμ :MATH] , k = 1, 2, …, m. Furthermore, we can gain the variance of the k^th element of variable Y is [MATH: D(Y k)=COVYk,Yk=EYkνkYkνk=Eηk< mrow>XE( X)ηkXE( X)=ηkEXE( X)XE( X)< /mrow>ηk< /mi>=ηk Σx< mrow>ηk= ηkλkηk= λk, :MATH] 7 and the covariance of Y[i] and Y[j] (i ≠ j) is [MATH: COV(Yi,Yj)=EYiE(Yi)YjE(Yj)=Eηi< mrow>XE( X)ηjXE( X)=ηiEXE( X)XE( X)< /mrow>ηj< /mi>=ηi Σx< mrow>ηj= ηiλjηj=0< /mn>. :MATH] 8 Thus, the covariance matrix of variable Y is Σ[Y] = diag(λ[1], λ[2], …, λ[m]). Proposition 1 The covariance matrix of variable Y can be derived as [MATH: ΣY=diag(λ1,λ2,,λm) :MATH] If we denote [MATH: ν=(ν 1,ν2,,< msub>νm)< /mo> :MATH] as the mean vector of variable Y, since X follows m-dimensional Gaussian distribution, we can conclude that Y also follows m-dimensional Gaussian distribution, i.e., [MATH: Y~N(ν,ΣY ) :MATH] . What’s more, it’s easy to get that [MATH: Yk~N(ν k,λk) :MATH] , k = 1, 2, …, m. Then, we denote [MATH: Y^ :MATH] as the output variable of Y and its mean vector and eigenvalues of the covariance matrix are [MATH: ν^=ν^1,ν^2,,< mrow>ν^m :MATH] and [MATH: λ^k :MATH] , k = 1, 2, …, m, respectively. As is mentioned above, autoencoders will mainly recover the principal components of any datasets, it’s reasonable for us to make the following assumption. Assumption 2 For each element [MATH: Y^k :MATH] of the output variable [MATH: Y^ :MATH] , its data distribution is the same as the corresponding input element Y[k]. Since [MATH: Yk~N(ν k,λk) :MATH] , then [MATH: Y^k~N(ν^k,λ^k) :MATH] , k = 1, 2, …, m. Besides, whether each eigenvalue after reconstruction is close to 0 is the same as its corresponding original eigenvalue. In the following, we will theoretically analyze the restrictions of MSE in detecting HLP. First, we can get the relationship between the difference between the reconstructed data and its mean and the difference between the original data and its mean in each principal component direction. As we know, the loss function of MSE can be formulated as [MATH: LMSE(ω,b )=1ni=1 n(< msub>yiy^i) (yiy^i)=EYY2YY^+Y^Y^=EYY2EYY^+EY^Y^=i=1 mEYi< mrow>22i=1 mEYiY^i+i=1 mEY^i2, :MATH] 9 where y[i] and [MATH: y^i :MATH] represent the ith sample point of the input and output dataset in the new coordinate system, respectively. First, if the autoencoder can properly reconstruct the input dataset, we will analyse the mean vector of the output variable [MATH: Y^ :MATH] . Based on Assumption 2 and Eq. ([167]9), we have [MATH: LMSE(ω,b )=i=1 mλ i+i=1 mλ^i+i=1 mνi2+i=1 mν^i2< mn>2i=1 mν iν^i2i=1 mCOVYi,Y^ii=1 mνiν^i2+i=1 mλiλ^i2. :MATH] 10 In order to reach the minima of L[MSE](ω, b), the mean vector of output variable [MATH: ν^ :MATH] must satisfy [MATH: ν^k=νk :MATH] , k = 1, 2, …, m. That is to say, the mean vector of output data is the same as input data. Next, we will specifically analyze the reconstruction error of each data point. According to Assumptions 1 and 2, the input and output variable satisfy [MATH: Yk~N(ν k,λk) :MATH] and [MATH: Y^k~N(ν k,λ^k) :MATH] , k = 1, 2, …, m. Denote [MATH: R(Y)= R1(Y),R2(Y),,Rm( Y) :MATH] ( [MATH: R^(Y)=R^1(Y),R^2(Y),,R^m(Y) :MATH] ) as the difference between variable Y ( [MATH: Y^ :MATH] ) and its mean vector ν, i.e., R(Y) = Y–ν and [MATH: R^(Y)=Y^ν :MATH] . It’s easy to see that [MATH: Rk(< /mo>Y)~N(0,λk) :MATH] and [MATH: R^k(Y)~N(0,λ^k) :MATH] . Then the following equation can be obtained. [MATH: LMSE(ω,b )=EYY^YY^=ER(Y)R^(Y)< /mrow>R(Y)R^(Y)< /mrow>=ER(Y)R(Y) 2ER(Y)R^(Y)< mo>+ER^(Y)R^(Y)< mo>=i=1 mERi< mrow>2(Y )+i=1 mER^i2( Y)2i=1 mERi(Y)ER^i(Y)2i=1 mCOVRi(Y),R^i(Y)=i=1 mλ i+i=1 mλ^i2i=1 mρ iλiλ^i, :MATH] 11 where ρ[k] represents the correlation coefficient of R[k](Y) and [MATH: R^k(Y) :MATH] , and −1 ≤ ρ[k] ≤ 1, k = 1, 2, …, m. In Eq. ([168]11), only when ρ[k] = 1, can L[MSE](ω, b) reach its minima. In this case, there is a positive linear correlation between R[k](Y) and [MATH: R^k(Y) :MATH] , specifically, there exist constants a[k] and c[k] such that [MATH: R^k(Y)=akRk(Y)+ ck :MATH] , then [MATH: ER^k(Y)=akERk(Y)+ck :MATH] and [MATH: DR^k(Y)=a k2DRk(Y) :MATH] . Since [MATH: ER^k(Y)=ERk(Y)=0 :MATH] , [MATH: DRk(Y)=λk :MATH] and [MATH: DR^k(Y)=λ^k :MATH] , we can obtain c[k] = 0 and [MATH: ak=λ^kλk< /msqrt> :MATH] . Therefore, we can obtain the following proposition. Proposition 2 Under Assumptions 1 and 2, the difference between the reconstructed data and its mean is proportional to the difference between the original data and its mean in each principal component direction. The specific relationship is [MATH: R^k(Y)=λ^kλk< /msqrt>Rk(Y). :MATH] 12 Where R[k](Y) = Y[k]–ν[k] and [MATH: R^k(Y)=Y^kν^k :MATH] . Based on Eq. ([169]12), we can further determine the specific reconstruction error of each input data point. Proposition 3 The reconstruction error of each input data point measured by MSE can be formulated as [MATH: W(Y)= YY^YY^=i=1 mλiλ^i2R< /mrow>i2(Y)λi. :MATH] 13 Ideally, we hope each principal component contributes equally to the reconstruction error of any input data point. Recall the definition of R(Y), it’s easy to know that [MATH: Rk(Y)< msub>λk~N(0,1) :MATH] , so Eq. ([170]13) indicates that the reconstruction degree of each principal component by the autoencoder will affect the proportion of this principal component in the reconstruction error. This brings a lot of uncertainty for HLP detection. Proposition 4 HLP are difficult to identify as outliers if the input dataset is completely reconstructed. When the input dataset is well reconstructed, in Eq. ([171]13), it means that [MATH: λkλ^k=0 :MATH] , k = 1, 2, …, m. Then, the reconstruction error of each data point in training dataset is 0. As the result, each data point in the training dataset is considered normal. However, HLP exist in any dataset follows m-dimensional Gaussian distribution. Generally, the larger the Mahalanobis distance is, the more abnormal the data point is. Therefore, although completely reconstructing the training dataset is beneficial for IP detection, it ignores the detection of HLP. Proposition 5 HLP whose values in each dimension are within the normal range can not be identified as outliers. Besides, most HLP will be detected in the direction corresponding to the worst-recovered principal component, but in the direction of the well-recovered principal components, the anomalies are often ignored. From Eq. ([172]13), we know [MATH: λkλ^k :MATH] is equivalent to the weight of the reconstruction error of the kth principal component direction of each data point to the total reconstruction error. Therefore, if the data reconstruction in the ith principal component direction is better than that in the jth principal component direction, then [MATH: λiλ^i<λjλ^j :MATH] , i ≠ j. As the result, the detected outliers are more distributed in the jth principal component direction. Proposition 6 If we ensure the differences between the eigenvalues of the covariance matrix of the original dataset and their corresponding reconstructed results in the direction of each principal component are equal, the value of the reconstruction error for each data point will be proportional to its Mahalanobis distance. If we suppress complete reconstruction of the autoencoder, then [MATH: λkλ^k>0 :MATH] , k = 1, 2, …, m. Based on Eq. ([173]12), we can obtain [MATH: (YkY^k) 2=λkλ^k2λk( Yk< mo>−νk)2. :MATH] 14 It means that for each principal component direction of the input data, there is a reconstruction error. Specifically, the further away the values are from the mean, the larger the reconstruction error. Therefore, in each principal component direction, the outliers detected by the reconstruction error are the same as the outliers defined by the Gaussian distributed data. Besides, if [MATH: λkλ^k=β>0 :MATH] , k = 1, 2, …, m, according to Eq. ([174]13), the reconstruction error of each input data point is [MATH: W(Y)= YY^YY^=β2i=1 mRi2(Y)< mrow>λi . :MATH] 15 Since Σ[Y] = diag(λ[1], λ[2], …, λ[m]), it’s easy to obtain [MATH: ΣY1=diag(1< /mn>λ1,1λ2,,1< mrow>λm) :MATH] . So the Mahalanobis distance of the input variable Y is [MATH: M(Y)= YνΣY1Yν=i=1 mRi2(Y)< mrow>λi . :MATH] 16 It indicates that the reconstruction error of each data point is proportional to its Mahalanobis distance. Therefore, if we control [MATH: λkλ^k=β>0 :MATH] , k = 1, 2, …, m, almost all HLP will be converted to IP and cannot be well reconstructed. As the result, the detected HLP will evenly distributed in each principal component direction. Based on the above discussion, we will propose a new loss function that adds an appropriate penalty term based on MSE to balance the reconstruction of the autoencoder. In fact, if the intrinsic dimension of the dataset is l, then there will be l eigenvalues that are not close to 0. We denote the l eigenvalues that are not close to 0 as [MATH: λk :MATH] and their corresponding reconstruction results are [MATH: λ^k :MATH] , k = 1, 2, …, l. According to Assumption 2, [MATH: λ^k :MATH] is also not close to 0. Then, we consider two losses in our training loss function. One of them is L[MSE](ω, b), which aims to reconstruct the input dataset well. In addition, we define the other loss [MATH: LEIG(ω,b )=i =1l(λiλ^iβ)2 :MATH] which can avoid the autoencoder from completely reconstructing the input dataset. β > 0 is a hyperparameter that can adjust the degree of data reconstruction. The final training loss function is a combination of the two: [MATH: L(ω,b)=θ1< /mrow>LMSE(ω,b)+θ 2LEIG(ω,b). :MATH] 17 Here, θ[1], θ[2] > 0 are hyperparameters that need to be predetermined. In practice, desirable results are usually obtained if we set θ[1] = 0.008 and θ[2] = 1. Specification of outlier ratio ODBAE identifies outliers based on reconstruction error, with larger reconstruction errors indicating greater abnormality. In the implementation of ODBAE, we provide two strategies for determining the reconstruction error threshold. Given a dataset of N samples, when a user-defined outlier proportion δ is specified, the model classifies the top N × δ samples with the highest reconstruction errors as outliers. Alternatively, when the threshold is determined automatically, the model assesses the distribution of reconstruction errors. If the errors approximately follow a Gaussian distribution, outliers are identified based on the 3σ rule. Otherwise, kernel density estimation is employed to derive the probability density function of the reconstruction errors, and data points with reconstruction errors exceeding the 95th percentile are considered anomalous. Anomaly explanation of ODBAE For each outlier detected by ODBAE, it’s necessary for us to find its abnormal parameters. For ODBAE, it obtained abnormal parameters based on the highest reconstruction errors and SHAP values. Specifically, for each outlier, we first sorted its parameters in descending order based on the reconstruction errors. Then, if the sum of the reconstruction errors of the first n parameters was greater than 50% of the total reconstruction error for the outlier, these n parameters were considered potentially anomalous. If n = 1, we used SHAP values to identify additional parameter that have the greatest impact on this anomalous parameter. If the SHAP value of these two parameters exceeded the set threshold, they would be considered abnormal parameters; otherwise, only the parameter with the highest reconstruction error was considered an abnormal parameter. However, if n > 1, we would use SHAP values to obtain additional parameter that have the greatest impact on the parameter with the highest reconstruction error. If the SHAP value of these two parameters exceed the set threshold, they would be considered abnormal parameters; otherwise, the top 2 parameters with the highest reconstruction error were considered anomalous parameters. In our work, the threshold for SHAP values was set to the mean of the SHAP values during the process of anomaly explanation for all outliers. Identification of outliers using z-score method In developmental and metabolism-related datasets, we first identified mice with anomalies using the z-score method. For example, there were 14 metabolic parameters in metabolism-related datasets. Then, the z-score value of each mouse for each metabolic parameter was calculated as [MATH: Zij=Xij μiσi< /mi> :MATH] , where Z[ij] represented the z-score of the jth mouse for the ith metabolic parameter, X[ij] was the value of the ith metabolic parameter for the jth mouse, μ[i] and σ[i] were the mean and standard deviation of the ith metabolic parameter, respectively. The larger the absolute value of Z[ij], the more the ith metabolic parameter value of the jth mouse deviates from the mean. Therefore, for each metabolic parameter, mice with large absolute z-score values were labeled as outliers. Identification of mice with low BMI values We calculated the BMI values for all knockout mice, and the mean BMI of the abnormal mice after Ckb knockout in female mice was lower than that of 97.14% of the mice. Therefore, there were still a small number of mice with lower BMIs from a total of 264 single-gene knockout mouse strains, and only 13 (4.92%) single-gene knockout mouse strains had more than 50% of mice with BMI values lower than the mean BMI of gene Ckb knockout mice. Among these 13 genes, 7 genes were identified as important genes by ODBAE, and for the remaining 6 genes, a portion of mice from each corresponding gene knockout strain were detected as outliers. Validation of sexually dimorphic genes When applied to the IMPC metabolic dataset, ODBAE identified a total of 203 metabolism-associated genes, with a marked divergence observed between male- and female-specific genes (Fig. [175]5b). To further validate the biological relevance of these findings, we first screened for genes exhibiting significant sexual dimorphism. In our framework, a gene was considered significant if the proportion of outliers among its corresponding knockout strains exceeded 50%. To ensure that the observed abnormalities were indeed attributable to the gene and not random fluctuations, we initially restricted our analysis to genes with sufficient sample sizes in both sexes. Given that the majority of knockout strains in both male and female datasets comprised seven individuals (Supplementary Fig. [176]11a, b), we retained genes with at least seven samples per sex. Subsequently, to identify genes with pronounced sexual dimorphism, we focused on genes for which the outlier ratio exceeded 50% in only one sex and was zero in the other. This yielded a final set of 18 genes showing strong evidence of sex-specific metabolic effects. To further evaluate whether the sexual dimorphism observed in these 18 genes had been previously reported, we conducted a literature search in PubMed using each gene name in combination with the keywords “sexual dimorphism”, “sexual”, “sex”, “male”, or “female”. This search revealed that 10 of the 18 genes had been previously documented as exhibiting sexual dimorphism. Comprehensive performance evaluation We compared ODBAE with MSE-AE, MAE-AE, and DAGMM for the detection results of IP and HLP, and computed their AUC and AP scores by regarding the IP and HLP as positive. We implemented DAGMM with minimal modification so that it adapt to our datasets. Besides, for DAGMM, we kept its number of layers and the dimensions of each layer consistent with ODBAE, MSE-AE, and MAE-AE. For each experiment, the structure of the autoencoder was as follows. The dimension of the hidden layer was the intrinsic dimension of the input dataset, and its activation function was ReLU. Besides, the activation function of the output layer was sigmoid. All datasets were normalized before outlier detection. During the autoencoder training process, Adam^[177]49 was used to optimize the loss function and the learning rate was 10^−^3. To evaluate the performance of various outlier detection methods across a range of predefined outlier ratios, each method was executed 10 times per ratio. For each run, we recorded the AUC and AP scores. The final comparison was based on the mean AUC and AP scores averaged across the ten replicates for each method. Selection of hyperparameters θ[1], θ[2] in training loss function Our new training loss function is formulated as L(ω, b) = θ[1]L[MSE](ω, b) + θ[2]L[EIG](ω, b). It can be seen that our loss function contains two loss terms. The function of L[MSE](ω, b) is to drive the autoencoder to reconstruct the input dataset as well as possible, while L[EIG](ω, b) aims at suppressing complete reconstruction of the dataset by the autoencoder and balance the degree of reconstruction in the direction of each principal component of the input dataset. Obviously, when the value of θ[1]: θ[2] is large, the autoencoder can reconstruct the input dataset perfectly and will adversely affect the detection of HLP. However, if the value of θ[1]: θ[2] is very small, the autoencoder will have a poor reconstruction effect on the input dataset, which will bring disadvantages to IP detection. Therefore, it is very necessary for us to ensure that the value of θ[1]: θ[2] is reasonable, so that the autoencoder can have a good detection effect for HLP and IP at the same time. In our work, we found that when θ[1]: θ[2] = 0.008: 1, most experiments could obtain the desired results. In order to study the sensitivity of this ratio, we changed its base to see how different base affects the accuracy of outlier detection. To be specific, if the value of base was c, we set the values of θ[1] and θ[2] to 0.008c and c, respectively. We conducted experiments on the Dry Bean dataset. For a fixed base c, we assessed the outier detection performance of ODBAE across different outlier ratios δ (δ ∈ {0.05, 0.1, 0.15, 0.2, 0.25}), and subsequently computed the average score across all ratios. The results show that the AUC and AP scores obtained by ODBAE varied only marginally with changes in the base value (Supplementary Table [178]1), indicating that θ[1] and θ[2] are insensitive to the change of the base. Determination of hyperparameter β Although hyperparameter β > 0 in Eq. ([179]17) is beneficial for HLP detection, as the value of β increases, the data reconstruction ability of the autoencoder will become worse and worse, which is similar to the model not being well fitted during regression analysis. This adversely affects the detection of IP. Therefore, in the following, it’s important for us to determine how to choose the appropriate value of β. If the intrinsic and actual dimensions of the dataset are equal, i.e., l = m, then most of the outliers in the dataset are HLP. In this case, considering that [MATH: λkλ^k>0 :MATH] and [MATH: λ^k>0 :MATH] , k = 1, 2, …, m, the value of the hyperparameter β should satisfy [MATH: 0<β<min 1im< mrow>(λi) :MATH] . According to the previous analysis, as long as β > 0, the detection effect of HLP can be improved. Meanwhile, in order to maintain the detection effect of IP, the value of β must be small enough. According to Eq. ([180]12), if β = 0, then [MATH: R^k(Y)=Rk(Y) :MATH] , i.e., [MATH: Y^k=Yk :MATH] , k = 1, 2, …, m. However, we usually add nonlinear activation functions to the autoencoder to learn nonlinear features in the input dataset. For example, in our work, we added sigmoid activation function to the output layer. Since commonly used nonlinear activation functions are saturated, when the input values are very large or very small, the corresponding output values hardly change with the change of the input values. As the result, for each dimension of the input data, when the values are very large or very small, their reconstruction results are poor. Otherwise, the values can be reconstructed well. Then, most HLP detected by the autoencoder are anomalous in one dimension, but HLP whose values in each dimension are within the normal range are ignored. It can be seen from this that when β is small enough, it not only adversely affects the detection effect of HLP, but also rarely improves the detection effect of IP. That is to say, there exists ξ > 0, when β < ξ, there is not much improvement in the detection effect of IP. If we determine the value of ξ and let β = ξ, on the one hand, the detection effect of the autoencoder for IP can be maintained. On the other hand, we can balance the reconstruction of the autoencoder and avoid the influence of the saturation of the activation function on the data reconstruction at the same time, which can improve the detection effect for HLP. In our work, the structure of the autoencoder and the setting of hyperparameters are described above. For each dimension of any dataset, input values between 0.15 and 0.85 are hardly affected by the saturation of the nonlinear activation function. Therefore, we hope that the value of β selected can make the output value of each dimension between 0.15 and 0.85. As the result, for each dimension, the interval length of the output value is 0.7 times the interval length of the input value. Before outlier detection, the input dataset is normalized, so the input value of each dimension is between 0 and 1. Recall that [MATH: Yk~N(ν k,λk) :MATH] and [MATH: Y^k~N(ν k,λ^k) :MATH] , k = 1, 2, …, m, the probability that Y[k] is distributed in [MATH: (νk3λk,νk+3 λk) :MATH] is 0.9974, so its distribution interval length is approximately [MATH: 6λk :MATH] . Similarly, the distribution interval length of [MATH: Y^k :MATH] is approximately [MATH: 6λ^k :MATH] . So we can control [MATH: λ^k=0.7 λk :MATH] , then [MATH: λkλ^k=0.3 λk :MATH] , k = 1, 2, …, m. Based on the above discussion, we can summarize how to determine the value of the hyperparameter β when l = m. If [MATH: max1im(0.3 λi)min1im(λi) :MATH] , we can set [MATH: β=max1im(0.3λi ) :MATH] . Otherwise, we set [MATH: β=min1im(λi) :MATH] . In practical applications, we usually encounter this type of dataset whose intrinsic dimension l is smaller than the actual dimension m. In this case, there are only l eigenvalues of the covariance matrix of the dataset that are not close to 0, we denote them as [MATH: λk :MATH] , k = 1, 2, …, l. In the training process, we only need to control such l eigenvalues and make [MATH: λkλ^k=0.3λ k :MATH] , k = 1, 2, …, l. Besides, since [MATH: λk >0 :MATH] , β should satisfy [MATH: 0βmin 1il< mrow>(λi) :MATH] . Actually, if l < m, there will be some data points that have a large impact on the reconstruction ability of the autoencoder (i.e., IP), it is very important to not only improve the detection effect of HLP, but also maintain the detection effect of IP. Therefore, we have to make sure that the value of β is small enough but not equal to 0. For convenience, we also set [MATH: 0<β<min 1im< mrow>(λi) :MATH] in this case. Finally, the determination scheme for the value of β is the same as when l = m. Statistics and reproducibility All data analysis were performed using Python 3.7.7 and R 3.4.3 software. To ensure the reproducibility of our findings, we provide detailed descriptions of the data generation process, including sample sizes used in the synthetic data experiments. For experiments involving real-world datasets, we report the sources of the data, the preprocessing procedures, and all relevant analysis details. When comparing the outlier detection performance of ODBAE against other models, we ran each model 10 times under fixed outlier ratios and recorded the AUC scores from each run. The final performance comparisons are based on the average AUC scores across runs to ensure robustness and reproducibility. Similarly, to assess the impact of dataset dimensionality, noise level, and model architecture on ODBAE’s performance, we report the average AUC scores obtained from 10 independent runs under each condition. Ethics statement The gene knockout mouse data used in this study were obtained from the International Mouse Phenotyping Consortium (IMPC), and no experiments were conducted here. All IMPC research centers make every effort to minimize animal suffering through careful housing and husbandry, with detailed information available at the IMPC portal: [181]https://www.mousephenotype.org/about-impc/arrive-guidelines. Ethics statements from individual research centers are provided in Supplementary Data [182]4. Reporting summary Further information on research design is available in the [183]Nature Portfolio Reporting Summary linked to this article. Supplementary information [184]Supplementary Information^ (5.7MB, pdf) [185]42003_2025_8817_MOESM2_ESM.pdf^ (33KB, pdf) Description of Additional Supplementary files [186]Supplementary Data 1^ (1.6MB, xlsx) [187]Supplementary Data 2^ (59.2KB, xlsx) [188]Supplementary Data 3^ (192.9KB, xlsx) [189]Supplementary Data 4^ (10.7KB, xlsx) [190]Reporting Summary^ (2.8MB, pdf) Acknowledgements