Abstract
Identifying complex phenotypes from high-dimensional biological data is
challenging due to the intricate interdependencies among different
physiological indicators. Traditional approaches often focus on
detecting outliers in single variables, overlooking the broader network
of interactions that contribute to phenotype emergence. Here, we
introduce ODBAE (Outlier Detection using Balanced Autoencoders), a
machine learning method designed to uncover both subtle and extreme
outliers by capturing latent relationships among multiple physiological
parameters. ODBAE’s revised loss function enhances its ability to
detect two key types of outliers: influential points (IP), which
disrupt latent correlations between dimensions, and high leverage
points (HLP), which deviate from the norm but go undetected by
traditional autoencoder-based methods. Using data from the
International Mouse Phenotyping Consortium (IMPC), we show that ODBAE
can identify knockout mice with complex, multi-indicator
phenotypes—normal in individual traits, but abnormal when considered
together. In addition, this method reveals novel metabolism-related
genes and uncovers coordinated abnormalities across metabolic
indicators. Our results highlight the utility of ODBAE in detecting
joint abnormalities and advancing our understanding of homeostatic
perturbations in biological systems.
Subject terms: Software, Bioinformatics
__________________________________________________________________
ODBAE offers a powerful approach for detecting complex anomalies and
characterizing unknown phenotypes within biological systems.
Introduction
Phenotypes, including symptoms, defined as the observable
characteristics or traits of an organism, are the result of complex
interactions between genotype and environmental factors^[34]1,[35]2.
These traits are often quantified using physiological or pathological
indicators that serve as proxies for the underlying biological or
pathological processes. Recent advances in phenotypic research,
particularly through genetic manipulation, have deepen our
understanding of gene function and the mechanisms underlying
pathological conditions across a wide range of organisms, including
plants and animals^[36]2. Traditionally, gene-related phenotypic
analyses have focused on identifying abnormalities in individual
physiological indicators^[37]3–[38]5. However, findings from the
International Mouse Phenotyping Consortium (IMPC) demonstrate strong
correlations between physiological indicators, suggesting that many
traits and diseases result not from isolated abnormalities, but from
coordinated disruptions across multiple physiological indicators^[39]6.
Our current understanding of phenotypes or diseases often emphasizes
the detection of outliers in individual physiological indicators,
typically those outside the normal physiological range. However, the
emergence of disease in an organism is a more complex process.
Homeostasis-the dynamic balance maintained by biological systems-can be
perturbed at multiple levels before a single-indicator deviates outside
the normal range^[40]7. This suggests that phenotypic abnormalities may
manifest as imbalances between correlated indicators, even when each
individual measure remains within its expected range. These imbalances
may represent early warning signs of disease or dysfunction that may be
missed by traditional univariate analysis.
We hypothesize that the coordinated perturbation of physiological
indicators reflects an emerging phenotype or disease state, even when
individual measures appear normal. By examining these subtle
interdependencies, we can gain insight into how homeostasis is
perturbed in knockout mouse models. Furthermore, for genes whose
knockout does not result in obvious outliers in individual parameters,
disruption of homeostatic balance may still occur through correlated
indicators. This approach challenges the notion that the absence of a
detectable abnormal phenotype in knockout mice indicates no functional
impact, and instead suggests that our focus on single-indicator
abnormalities may obscure more complex systemic effects.
Existing methods, such as linear regression^[41]6,[42]8 and basic
bioinformatics tools^[43]9–[44]11, have provided initial insights into
these relationships. However, these techniques struggle to capture
non-linear and complex relationships between multiple physiological
indicators, especially in high-dimensional biological datasets. Machine
learning methods, particularly autoencoders, have emerged as powerful
tools for detecting outliers by learning complex, latent patterns
within data^[45]12–[46]18. By compressing and reconstructing data,
autoencoders can capture subtle variations and relationships that may
indicate phenotypic abnormalities across multiple indicators. Their
ability to learn from high-dimensional data makes them well suited for
analyzing the intricate dynamics involved in phenotypic
expression^[47]19,[48]20. In the regression analysis, all outliers can
be categorized as influential points (IP) and high leverage points
(HLP). IP exerts a substantial influence on the model’s fitting results
and HLP deviate from the center of the dataset but may not necessarily
impact the model fit^[49]21. Since the reconstruction of any dataset by
the autoencoder can be regarded as a process of regression, IP can be
easily detected when using autoencoders for outlier detection. However,
there are still many uncertainties about the detection effect of HLP.
In this study, we introduce Outlier Detection using Balanced
Autoencoders (ODBAE), a machine learning method designed to detect
outliers by capturing the relationships between physiological
indicators. ODBAE improves on traditional autoencoder by balancing the
reconstruction error across principal component directions, enhancing
its ability to detect both IP that deviate from the expected
relationships between indicators and HLP that are far from the data
center. We apply ODBAE to developmental and metabolic datasets from
IMPC to explore how coordinated perturbations in physiological
parameters can reveal previously unidentified phenotypes, even in cases
where knockout mice show no abnormalities in individual traits. By
using ODBAE, we aim to identify novel gene functions and phenotypic
complexities that have been missed by single-indicator screening
methods. Collectively, these findings establish ODBAE as a powerful
tool for identifying complex anomalies and unknown phenotypes within
biological systems.
Results
Outlier detection in ODBAE
ODBAE takes tabular datasets from various sources as input, such as
gene knockout mouse dataset from IMPC and patient health examination
data. Each row in the data represents a record, and each column
represents an attribute. The goal of ODBAE is to maintain the
effectiveness of the autoencoder in detecting IP while further
improving the detection performance for HLP. This is a comprehensive
approach to outlier detection in high-dimensional tabular datasets.
Furthermore, ODBAE also provides an anomaly explanation for each
outlier, indicating which parameter or parameters are associated with
the anomaly. For gene knockout mouse datasets, ODBAE can identify genes
that, when knocked out, lead to anomalies, as well as the abnormal
parameter pairs associated with the gene knockout.
Overview of ODBAE
To address the challenge of outlier detection in high-dimensional
tabular datasets, we construct an improved autoencoder model (ODBAE)
with refined training loss function. Our model leverages the strong
representation learning and feature extraction capabilities of
traditional autoencoders for IP, while mitigating their limitations in
HLP detection (“Methods”). The core principle underlying ODBAE-based
outlier detection is that inliers can be well reconstructed, whereas
outliers, including both HLP and IP, may generate significant
reconstruction errors. ODBAE performs unsupervised outlier detection on
the input high-dimensional datasets. During training process, ODBAE
learns as much intrinsic information as possible from training set and
reconstructs the training dataset by minimizing the loss function. The
training dataset selection strategy depends on the initial outlier
prevalence within the dataset. For datasets with few outliers presence,
the entire dataset are utilized for both training and testing for
outlier detection, such as data from wild-type mice in IMPC. However,
if the overall proportion of outliers are not clear, a subset of the
data exhibiting fewer anomalies is chosen for model training. Then, the
entire dataset, or other subset, can be used for outlier detection.
Subsequently, the trained model is applied to the test dataset,
generating reconstruction error of all sample points. Finally, sample
points with reconstruction errors greater than the predefined threshold
are considered outliers. As shown in Fig. [50]1a, outlier detection
from biological datasets by ODBAE include three steps: (1) given a
training dataset as input, the intrinsic information of normal data
points are learned; (2) the trained model is used to reconstruct the
test dataset, and outliers will be identified according to large
reconstruction error (“Methods”); (3) the detected outliers including
HLP and IP are explained using highest reconstruction errors and
kernel-SHAP to gain the abnormal parameters^[51]22.
Fig. 1. ODBAE is an approach for identifying and explaining outliers in
tabular datasets, it can discover complex phenotypes and identify novel genes
when applied to knockout mouse datasets.
[52]Fig. 1
[53]Open in a new tab
a Process for discovering complex phenotypes. ODBAE first detect
outliers based on a balanced autoencoder, which can balance the
reconstruction of the training dataset to detect both HLP and IP, then
perform anomaly explanation to gain abnormal parameters. Finally, some
outliers are identified due to joint anomalies of multiple parameters.
b Anomaly explanation of ODBAE. The detected outliers are explained in
terms of SHAP values. c For both males and females, abnormal mice are
first identified according to the presence of metabolic parameters that
deviated too far from the mean, and then the same abnormal proportion
is set to identify abnormal mice with ODBAE. Finally, some important
genes will only be obtained by ODBAE. d Genes with an abnormal
proportion of more than 50% of the corresponding single-gene knockout
mouse strains are considered important genes by ODBAE. If an important
gene has no known metabolic links in the MGI database, it will be
considered a novel gene and further validated by SNP analysis of human
orthologues.
Based on mathematical analysis, ODBAE uses a revised training loss
function incorporating an appropriate penalty term to Mean Square Error
(MSE) to balance the reconstruction by properly suppressing complete
reconstruction of the autoencoder (“Methods”). Specifically, the
penalty term can ensure the equal eigenvalue difference between each
principal component direction of the training and reconstructed
dataset. Therefore, the incorporated penalty term enhance the detection
of HLP, while the MSE term maintain the identification performance of
IP.
To explain each outlier detected by ODBAE, we first identify the top
features that contribute the most to the reconstruction error, and then
apply kernel-SHAP to obtain the features that have the greatest impact
on them (Fig. [54]1b and “Methods”). Finally, ODBAE outputs
instance-based outliers and their explanations. For outliers in
categories, if the anomaly rate for a category of outliers is greater
than the set threshold, then ODBAE provide anomaly explanation for each
category according to their mean values of each feature.
ODBAE’s comprehensive benefits
The power of ODBAE lies in its ability to efficiently detect both HLP
and IP in complex, high-dimensional biological data sets. HLP are
outliers that deviate from the central distribution, while IP
significantly influence model results. To evaluate the performance of
ODBAE in detecting IP, we compared it to principal component
analysist^[55]23, a common outlier detection method. ODBAE demonstrated
superior performance, particularly in identifying IP, due to its
enhanced ability to capture nonlinear relationships between data points
(Supplementary Fig. [56]1a–d).
We then applied ODBAE to phenotypic data from the IMPC to identify
complex phenotypes in knockout mouse models. Using a threshold of the
top 2% of absolute z-scores^[57]24 for any given physiological
parameter, we flagged mice as potential outliers (“Methods”). We then
applied ODBAE with a predefined abnormality ratio (Fig. [58]1c). If
more than 50% of mice from a single-gene knockout strain were
classified as outliers, the corresponding gene was identified as
significant and further analyzed (Fig. [59]1d). ODBAE was able to
identify key mutant strains with complex phenotypes, even in cases
where individual physiological indicators appeared normal.
To illustrate, we analyzed eight developmental parameters: Body length
(BL), Body weight (BW), Bone Area (BA), Bone Mineral Density (excluding
skull), Distance travelled total, Forelimb and hindlimb grip strength
measurement mean, heart rate, and Heart weight. Using wild-type mice as
the training set and knockout mice as the test set, we evaluated 1904
single-gene knockout mouse strains (“Methods”). Given the prevalence of
sexual dimorphism in disease phenotypes^[60]3,[61]25, males and females
were analyzed separately (Supplementary Data [62]1a–d). In the female
dataset, ODBAE identified Ckb null mice as outliers despite their
individual parameter values being within the normal range. These mice
had normal BL and BW, but their body mass index (BMI) was abnormally
low. Specifically, four of the eight Ckb null mice had extremely low
BMI values, calculated as BMI = BW (g)/BL^2 (cm)^[63]26–[64]28.
Analysis revealed that while BL and BW were within the expected ranges,
their relationship was abnormal, leading to the identification of these
mice as outliers (Fig. [65]2a). The average BMI of these Ckb-deficient
mice was lower than 97.14% of other mice (Fig. [66]2b), demonstrating
the sensitivity of ODBAE to complex, multivariate outliers. In
addition, Ckb has previously been implicated in developmental processes
and obesity, further strengthening the biological significance of these
findings^[67]29.
Fig. 2. ODBAE accurately identify outliers with multi-dimensional joint
anomalies.
[68]Fig. 2
[69]Open in a new tab
a Scatter plot of parameters BL and BW for mouse developmental dataset,
the data points corresponding to mice with the gene Ckb knockout
deviate significantly from most of the data points, but their values of
both BL and BW are within the normal range. The BL or BW values of the
data points in the shaded part are ranked in the top 2% based on the
absolute value of the z-score. b The distribution of standardized BMI
of all knockout mice in mouse developmental dataset, the mean BMI of
Ckb knockout mice is significantly smaller. c–h Outlier detection
results for 3 synthetic 2-dimensional Gaussian distribution datasets
when the outlier ratio is 0.05, including detection results of
MSE-trained autoencoder and ODBAE for 2-dimensional Gaussian
distribution dataset with diagonal covariance matrix and non-Gaussian
distributed noise (c, f), 2-dimensional Gaussian distribution dataset
with diagonal covariance matrix (d, g) and 2-dimensional Gaussian
distribution dataset with non-diagonal covariance matrix (e, h). The
orange line and grey line in (c) represent two principal directions. i,
j Outlier detection results of MSE-trained autoencoder (i) and ODBAE
(j) for 3-dimensional dataset with an intrinsic dimension of 2 when the
outlier ratio is 0.05, and the data points follow a 2-dimensional
Gaussian distribution with respect to Parameter1 and Parameter3.
ODBAE also identified previously unknown phenotypes, highlighting its
potential to uncover novel gene functions. In terms of the
identification of HLP, most studies choose MSE as the training loss
function and force the autoencoder to completely reconstruct the
training dataset, which brings two major limitations: (1) only HLP
along principal component directions with poor reconstruction can be
detected; (2) in the process of anomaly detection, each dimension of
the dataset is independently detected. To evaluate the performance
difference between our modified loss function and MSE in HLP detection,
we used these two trained autoencoder to detect outliers for several
datasets following multi-dimensional Gaussian distributions. For
example, data in Fig. [70]2c–e were three 2-dimensional Gaussian
distribution datasets and data in Fig. [71]2i was 3-dimensional dataset
whose intrinsic dimension was two, the data points followed a
2-dimensional Gaussian distribution with respect to Parameter1 and
Parameter3 (“Methods”). The detection result in Fig. [72]2c shows that
most of the detected HLP lie in the direction of the principal
components indicated by the blue line, but few lie in the other
direction. In Fig. [73]2d, HLP in two dimensions of the dataset are
independently detected, making outliers that are anomalous in multiple
dimensions (outliers at the four corners) undetectable. Fig. [74]2i
demonstrates the same issue in three-dimensional space. Moreover, these
issues persisted when the activation function was replaced with another
function (Supplementary Figs. [75]2a–d and [76]3a–d). This is because
the saturating nature of the non-linear activation function causes the
autoencoder to have different reconstruction capabilities for data
points located at the boundaries of the dataset compared to those
located in the interior (Supplementary Fig. [77]4a, c and Supplementary
Note [78]1). Usually, the distribution of a dataset along various
principal component directions may not be balanced, and the number of
data points falling into unreconstructable regions may also be uneven
(“Methods”). This leads to different detection performances for HLP
along different principal component directions. Besides, the
distribution of reconstructed dataset is independent across all
dimensions (Supplementary Fig. [79]4a, c). Therefore, the new loss
function in ODBAE is designed to ensure that the reconstructed dataset
follows a balanced joint distribution (Supplementary Fig. [80]4b, d).
Then, we used ODBAE to detect outliers in these datasets and found that
most outliers were identified (Fig. [81]2f–h, j).
We further evaluated ODBAE’s detection of both IP and HLP using a
synthetic 3-dimensional dataset (Supplementary Fig. [82]5a). Area Under
Curve (AUC) and Average Precision (AP) scores were used to evaluate the
accuracy of outlier detection results (“Methods”). The generated
3-dimensional training dataset formed a 2-dimensional manifold, where
the subspace represented by Parameter1 and Parameter3 followed a
2-dimensional Gaussian distribution (“Methods”, Supplementary
Fig. [83]5a). Here, we assumed that the ratio of HLP and the ratio of
IP were equal. During the anomaly detection, we used the training
dataset as the test dataset. Then, we considered the top ⌊δn⌋ data
points in the subspace represented by Parameter1 and Parameter3, ranked
by Mahalanobis distance, as HLP, where δ was the outlier ratio, and n
was the number of the data points in the training dataset. Besides, we
additionally generated ⌊δn⌋ points that were not on the manifold of the
training dataset and treated them as IP (Supplementary Fig. [84]5b).
The results showed that ODBAE worked better than other anomaly
detection methods (Supplementary Fig. [85]5c, d).
In summary, ODBAE is a powerful tool for detecting complex phenotypes
in high-dimensional biological datasets, capturing both isolated and
coordinated abnormalities. This capability allows the identification of
novel gene functions and previously undetected phenotypes, particularly
in cases where traditional methods fail to detect subtle perturbations
in physiological homeostasis.
ODBAE outperforms other methods by a large margin
To evaluate the strengths of ODBAE, we performed outlier detection of
ODBAE on low-dimensional synthetic datasets, high-dimensional synthetic
datasets, and two benchmark datasets. ODBAE was compared with the
autoencoder with MSE loss function (MSE-AE), autoencoder with Mean
Absolute Error loss function (MAE-AE), and Deep Autoencoding Gaussian
Mixture Model (DAGMM)^[86]30. The comparison results were presented
through AUC and AP scores.
We generated three datasets with 2-dimensional Gaussian distribution,
and their intrinsic dimension were also 2. In one of the datasets, the
correlation between the two dimensions was small (Fig. [87]3a); the
correlation between the two dimensions in another dataset was
relatively large (Fig. [88]3b); in the last dataset, the correlation
between the two dimensions was also small and non-Gaussian distributed
noise existed (Fig. [89]3c). Therefore, the outliers in these three
datasets were all HLP. Then we set different outlier ratios and
calculated the corresponding AUC and AP scores of the outlier detection
results. To be specific, if the outlier ratio is δ and the number of
sample points is n, then the input data is sorted according to their
Mahalanobis distance, and the first ⌊δn⌋ sample points are regarded as
positive. For each fixed anomaly ratio δ, outlier detection was
repeated 10 times on the dataset, and the average accuracy scores for
each method were recorded. Fig. [90]3d–f show the AUC scores of test
results on these three datasets, and Fig. [91]3g–i are corresponding AP
scores, respectively. ODBAE outperformed other schemes in detecting HLP
in all three datasets at any outlier ratio (Supplementary
Data [92]2a–c).
Fig. 3. Outlier detection effect of ODBAE is generally better than other
schemes.
[93]Fig. 3
[94]Open in a new tab
a–c Data point distribution plot for 3 synthetic dataset, including
2-dimensional Gaussian distribution dataset with diagonal covariance
matrix (a), 2-dimensional Gaussian distribution dataset with
non-diagonal covariance matrix (b) and 2-dimensional Gaussian
distribution dataset with diagonal covariance matrix and non-Gaussian
distributed noise (c). d–f AUC scores of comparison results of dataset
in this figure (a, d), (b, e), and (c, f). g–i AP scores of comparison
results of dataset in this figure (a, g), (b, h), and (c, i). j, k
Comparison results for Dry Bean Dataset. l, m Comparison results for
Breast Cancer Dataset. In (d–m), all methods were run 10 times under
different outlier ratios, and error bars indicate mean ± standard
error.
Since autoencoders are often used for outlier detection in
high-dimensional datasets, we also analyzed the outlier detection
results of ODBAE on two high-dimensional synthetic datasets. Both
datasets followed multi-dimensional Gaussian distribution and their
covariance matrices were both diagonal matrices. So the intrinsic
dimension of these two datasets were equal to their actual dimension.
One of the datasets was 50-dimensional and the other dataset was
100-dimensional. Finally, ODBAE still performed best in HLP detection
on high-dimensional datasets (Supplementary Fig. [95]6a–d).
To further evaluate the robustness of ODBAE under more realistic
conditions where data distributions deviate from Gaussian assumptions,
we applied the method to two additional synthetic datasets: one sampled
from a two-dimensional Laplace distribution and the other from a
two-dimensional uniform distribution. In both scenarios, ODBAE again
exhibited superior outlier detection performance (Supplementary
Fig. [96]7a–d).
Additionally, we tested ODBAE on two benchmark datasets: Dry Bean
Dataset and Breast Cancer Dataset. Dry Bean Dataset contained 13611
instances and 17 attributes, and there were 7 classes in the original
dataset. In each experiment, the inliers were 2000 data points from a
certain class and δ × 2000 data points were randomly selected from the
remaining 6 classes of data points as outliers, where δ was the outlier
ratio and δ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For this dataset, we set the
dimension of the hidden layer of the autoencoder to 8 during the
experiment. Breast Cancer Dataset contained 569 instances and 31
attributes. All data points fell into two categories: benign and
malignant. We randomly selected 300 data points from the data points
labeled benign as inliers and δ × 300 data points from the other
categories as outliers. For this dataset, the outlier ratio
δ ∈ {0.02, 0.04, 0.06, 0.08, 0.1} and the dimension of the hidden layer
of the autoencoder was 15. Overall, due to the improvement in HLP
detection, ODBAE worked best on both benchmark datasets (Fig. [97]3j–m
and Supplementary Data [98]2d, e).
In summary, ODBAE outperforms several existing autoencoder-based
methods in detecting anomalies from both human-induced and benchmark
datasets, indicating that ODBAE can more comprehensively identify
anomalous situations. Therefore, ODBAE provides strong support for
finding unknown anomalies and exploring unknown phenotypes in existing
biological data.
ODBAE exhibits a certain degree of robustness
The performance of ODBAE, like any machine learning framework, is
influenced by several hyperparameters and data conditions. We evaluated
the impact of dataset dimensionality, noise levels, and model
architecture on the outlier detection performance of ODBAE.
We generated Gaussian-distributed datasets of varying dimensionality to
train ODBAE, and evaluated its outlier detection performance by
measuring the AUC scores when the outlier ratio was set to 0.1. As
illustrated in Fig. [99]4a, the anomaly detection performance of ODBAE
remains largely stable as the dimensionality of the dataset increases.
This observation highlights the potential applicability of ODBAE to
outlier detection tasks in high-dimensional datasets.
Fig. 4. ODBAE exhibits a certain degree of robustness.
[100]Fig. 4
[101]Open in a new tab
a Outlier detection performance of ODBAE across datasets with varying
dimensionalities (Supplementary Data [102]2f). For each
fixed-dimensional dataset, the model was trained 10 times, and the AUC
score for outlier detection was recorded after each training run. b
Impact of varying noise intensities on the performance of ODBAE
(Supplementary Data [103]2g). Laplace-distributed noise L(0, s) was
added to the dataset to simulate different noise levels. For each fixed
value of s, the model was trained 10 times on the noise-augmented data,
and the AUC score for outlier detection was recorded after each
training run. c Outlier detection accuracy of ODBAE with architecture
(L, C) on the Dry Bean Dataset (Supplementary Data [104]2h). For each
fixed architecture (L, C), the model was trained 10 times, and the
average AUC score across the 10 runs was recorded to evaluate
performance. Error bars indicate mean ± standard error.
Considering that real-world datasets frequently contain noise, we
further evaluated the impact of data perturbations on the anomaly
detection performance of ODBAE. A two-dimensional Gaussian-distributed
dataset V containing 10,000 samples was generated. To simulate varying
noise intensities, we added perturbation ϵ to each data point, where ϵ
was drawn from a Laplace distribution L(0, s), with larger s values
representing higher noise levels. We then evaluated the performance of
ODBAE trained on the noisy datasets in detecting outliers within the
original dataset V, with the outlier ratio set to 0.1. As shown in
Fig. [105]4b, ODBAE maintained high outlier detection accuracy under
moderate noise levels (s ≤ 0.1), highlighting its robustness in the
presence of noise typically observed in real-world data.
In addition, we investigated the influence of model architecture on the
performance of ODBAE. We similarly employed the Dry Bean dataset,
randomly selecting 2000 samples from a single class to form the
training dataset. Additionally, 200 outlier samples were randomly drawn
from the remaining 6 classes. We then evaluated the outlier detection
accuracy of ODBAE with varying model architectures denoted as (L, C),
where the model consists of 2L + 1 hidden layers, with all layers
except the bottleneck layer having dimensionality C. The results
demonstrate that both the number of layers L and the dimensionality of
each layer C significantly influence the performance of ODBAE
(Fig. [106]4c). Therefore, in practical applications, tuning these
architectural parameters—specifically, the number of layers and the
dimensionality of each layer—can help optimize the outlier detection
performance of ODBAE.
In summary, ODBAE maintains high outlier detection accuracy in
high-dimensional datasets and exhibits robustness to moderate levels of
noise. Furthermore, its performance can be further optimized in
practical applications by tuning the number of layers and the
dimensionality of each layer in the model architecture.
ODBAE identifies new metabolic genes
We applied ODBAE to metabolism-related datasets from IMPC to uncover
genetic elements involved in metabolic regulation. Fourteen key
metabolic parameters were selected for analysis, including: Alanine
aminotransferase (ALA), Albumin (Alb), Alkaline phosphatase, Aspartate
aminotransferase (ASA), Calcium (Ca), Creatinine (Cre), Glucose (Glu),
HDL-cholesterol (HDLC), Phosphorus (Ph), Total bilirubin (TB), Total
cholesterol (TC), Total protein (TP), Triglycerides (TG), and Urea. In
total, we analyzed 45922 mice from 3064 single-gene knockout strains,
including 23024 males and 22898 females.
ODBAE was used to detect outliers in both male and female data sets.
Initially, mice with absolute z-scores in the top 1.2% for at least one
metabolic parameter were flagged as outliers. Of the 23024 male mice,
2672 were flagged as abnormal, and of the 22898 female mice, 2553
abnormalities were identified. To ensure robust detection, we adjusted
the proportion of anomalies and identified the top 10% of mice with the
highest reconstruction errors as outliers. Genes were considered
significant if more than 50% of mice from a given knockout strain were
identified as outliers, suggesting a strong association between these
genes and metabolism. In total, ODBAE identified 128 significant genes
in males (9.5%) and 111 significant genes in females (8.08%), for a
total of 203 genes (Fig. [107]5a, b and Supplementary Data [108]1e–h)
Fig. 5. ODBAE identifies new metabolic genes from a metabolism-related
dataset of knockout mice.
[109]Fig. 5
[110]Open in a new tab
a The proportion of important genes identified in the male and female
datasets. b ODBAE identified 128 important genes in the male dataset
and 111 important genes in the female dataset, for a total of 203
important genes. c Sexually dimorphic genes identified by ODBAE, with
previously reported sexually dimorphic genes highlighted in red. d
Visualization results of outliers corresponding to gene Crabp1 in the
subspace formed by abnormal parameters. The absolute z-score values of
Alb or HDLC of the data points in the shaded part are ranked in the top
1.2%. e Validation results of important genes in the MGI database.
78.8% of the genes are associated with metabolism, and 21.2% are newly
identified by ODBAE. Besides, 51.2% and 44.2% new metabolic genes are
identified in the male or female datasets, respectively, and 4.7% are
detected in both male and female datasets. f Histogram of KEGG pathway
enrichment for 43 novel genes in mice obtained through KOBAS.
Moreover, we observed that the metabolism-associated genes identified
by ODBAE exhibited pronounced sex-specific differences between male and
female mice (Fig. [111]5b). To evaluate the biological significance of
this sexual dimorphism, we focused on metabolism-related genes
exhibiting pronounced sexual dimorphism. Specifically, we selected
genes for which both male and female mouse strains included at least
seven individuals, and for which the proportion of knockout strains
displaying abnormal phenotypes exceeded 50% in only one sex (either
male or female), while remaining at 0% in the other. In total, 18 genes
were identified as showing significant sexual dimorphism, 10 of which
had been previously reported as sex-dimorphic in earlier studies
(Fig. [112]5c, “Methods” and Supplementary Data [113]3i). These
findings confirm the sexually dimorphic nature of these genes and
underscore the utility of ODBAE as a robust tool for uncovering
metabolism-associated genetic mechanisms.
Notably, 91 of these 203 genes could not be detected using traditional
z-score based methods. This indicates that the metabolic abnormalities
associated with these genes involve complex, multi-parameter
phenotypes. ODBAE’s anomaly detection includes a detailed explanation
of outliers based on the highest reconstruction error and kernel-SHAP
analysis, which allows us to identify the specific metabolic parameters
driving the anomalies. For example, knockout of the Crabp1 gene
resulted in abnormal levels of Alb and HDLC, even though the individual
z-scores for these parameters were within normal ranges. Visualization
of these data (Fig. [114]5d) shows that while the z-scores for Alb and
HDLC appeared normal, the knockout mice deviated significantly from the
central distribution of all data points, highlighting the complex
metabolic perturbation caused by Crabp1 knockout. More examples are
shown in Supplementary Fig. [115]8a–d.
Further investigation using the Mouse Genome Informatics (MGI) database
revealed that 43 of the 203 genes (21.18%) identified by ODBAE had no
previously known association with metabolic phenotypes (Fig. [116]5e
and Supplementary Data [117]3a). To explore the potential relevance of
these newly identified genes, we analyzed their human orthologues
(Supplementary Data [118]3b). Single-nucleotide polymorphisms (SNPs)
within a ± 1 kb region of these orthologues were extracted from GWAS
Central^[119]31, resulting in a total of 804 SNPs (Supplementary
Data [120]3c). We then investigated their association with metabolic
diseases, focusing on type 2 diabetes (T2D). Specifically, for each
SNP, we evaluated the extent of association across T2D-related traits
on data from the DIAGRAM, GIANT, and GLGC consortia^[121]32–[122]34.
Cross-phenotype meta-analysis (CPMA)^[123]35 identified SNPs in four
gene regions—TWF2, TMED10, HOXA10, and NBAS—that were strongly linked
to T2D-related traits (CPMA p < 0.05) (Supplementary Data [124]3d).
Visualization of the corresponding outliers confirmed that these genes
exhibited distinct patterns of deviation across key metabolic
dimensions (Supplementary Fig. [125]9a–d). In addition to T2D, many of
these novel genes may be associated with other metabolic disorders.
KEGG pathway analysis (via KOBAS) for mouse datasets revealed
significant enrichment in metabolic pathways^[126]36, including
glycolysis, propanoate metabolism, and pyruvate metabolism
(Fig. [127]5f and Supplementary Data [128]3e). These results suggest
that ODBAE not only identifies genes involved in complex metabolic
phenotypes but also pinpoints pathways that are disrupted by genetic
perturbations.
Overall, ODBAE provides a powerful tool for uncovering novel metabolic
genes and phenotypes that are undetectable using traditional
z-score-based methods. It successfully identifies genes associated with
complex, multi-parameter metabolic abnormalities and reveals novel
genetic contributors to metabolic diseases, many of which were
previously unannotated in the MGI database.
ODBAE integrates novel metabolic phenotypes
ODBAE’s ability to detect outliers based on joint abnormalities, beyond
traditional z-score methods, highlights its strength in integrating
complex metabolic phenotypes. Here, we demonstrate how ODBAE can detect
metabolic abnormalities in gene knockout mice by identifying abnormal
relationships between multiple metabolic parameters, rather than
focusing on single-parameter outliers. Our analysis of
metabolism-related datasets from IMPC reveals that a large proportion
of outliers (89.14% of male and 92.75% of female mice) exhibit
abnormalities in multiple metabolic parameters simultaneously. These
findings suggest that joint abnormalities, or perturbations in the
correlations between parameters, are important for detecting phenotypic
changes in knockout mice. For example, knockout of the Ckb gene
disrupts the relationship between BW and BL, resulting in an abnormal
BMI, even though BW and BL are individually within normal ranges. This
highlights how gene knockouts often disrupt intrinsic correlations
between physiological parameters, resulting in complex metabolic
phenotypes.
To further explore these correlations, we identified parameter pairs
most likely to show joint abnormalities in both male and female mice.
In males, 36 different abnormal parameter pairs were observed, with the
most common being (Alb, Ph), (Alb, HDLC), and (ALA, HDLC)
(Fig. [129]6a, c). In females, 40 different abnormal parameter pairs
were identified, with the most common being (ASA, TP), (Alb, ASA), and
(TB, TP) (Fig. [130]6b, c). Their corresponding knockout genes are
shown in Fig. [131]6d, e. Notably, 22 abnormal parameter pairs were
found in both males and females, with the most common pairs being (ALA,
ASA), (ALA, HDLC), and (Alb, ASA). These common abnormalities suggest
conserved metabolic pathways that are affected by specific gene
knockouts in both sexes.
Fig. 6. ODBAE integrates metabolic parameter pairs tend to simultaneous
abnormalities.
[132]Fig. 6
[133]Open in a new tab
a, b Chord diagram plotting the inter-connectivity of metabolic
parameters for males (a) and females (b). The outer segments represent
the metabolic parameters, the size of the arcs connecting the
perameters is proportional to the number of outliers associated. c
Abnormal parameter pairs observed in males and females. d, e Links
between genes and simultaneous abnormal metabolic parameters in males
(d) and females (e).
We then investigated the biological significance of the three most
frequently perturbed parameter pairs. Previous studies have shown that
the ALA/HDLC ratio is associated with the risk of non-alcoholic fatty
liver disease (NAFLD) and diabetes^[134]37–[135]39. ODBAE identified
Lepr as one of the genes causing abnormalities in this pair
(Fig. [136]7a). When the ratio of ALA to HDLC was analyzed across all
mice, the outliers corresponding to the Lepr knockout had a
significantly lower ratio compared to the general population
(Fig. [137]7b), consistent with previous findings linking Lepr
mutations to T2D and NAFLD^[138]40,[139]41. KEGG pathway analysis of
genes associated with (ALA, HDLC) revealed enrichment in pathways
related to NAFLD and metabolic processes, including glycolysis
(Fig. [140]7c and Supplementary Data [141]3f). For the (Alb, ASA) pair,
studies suggest that the ASA/Alb ratio correlates with tumor
progression and prognosis in hepatocellular carcinoma
(HCC)^[142]42,[143]43. ODBAE identified Ap4m1 as a gene causing
abnormalities in this pair (Fig. [144]7d). The average ratio of ASA to
Alb in Ap4m1 knockout mice was significantly larger than in other mice
(Fig. [145]7e), consistent with the role of Ap4m1 in HCC^[146]44. KEGG
pathway enrichment further linked (Alb, ASA) disruptions to pathways
involved in pyruvate metabolism, glycolysis, and proximal tubule
bicarbonate reclamation (Fig. [147]7f and Supplementary Data [148]3g).
Finally, for genes corresponding to parameter pair (ALA, ASA), the
pathway enrichment analysis revealed significant enrichment in
Glycosaminoglycan biosynthesis, Starch and sucrose metabolism, and Type
I diabetes mellitus (Supplementary Fig. [149]10a and Supplementary
Data [150]3h), consistent with the previously reported association of
parameters ASA to ALA ratio with prediabetes^[151]45. In addition to
these well-characterized parameter pairs, ODBAE also identified novel
correlations worth exploring further. For instance, Taf8 was
highlighted as a gene causing abnormalities in the parameter pair
(HDLC, TP), though little is known about the specific relationship
between these two parameters (Supplementary Fig. [152]10b).
Supplementary Fig. [153]10c shows that there is a linear relationship
between these two parameters, and the knockout of the gene Taf8
disrupts this relationship. ODBAE’s ability to uncover such
relationships without prior biological inference makes it a powerful
tool for discovering previously unknown metabolic interactions.
Fig. 7. Validation of the top three abnormal parameter pairs that occur in
both males and females and have the highest overall frequency.
[154]Fig. 7
[155]Open in a new tab
a Visualization of outliers corresponding to gene Lepr based on
abnormal parameter pairs in the male dataset. b The distribution of the
standardized ratio of ALA to HDLC in mouse metabolism-related dataset,
the average ratio of Lepr knockout mice is significantly smaller. c
Histogram of KEGG pathway enrichment for genes corresponding to (ALA,
HDLC) in humans. d Visualization of outliers corresponding to gene
Ap4m1 based on abnormal parameter pairs in the female dataset. e The
distribution of the standardized ratio of ASA to Alb in mouse
metabolism-related dataset, the average ratio of Ap4m1 knockout mice is
significantly larger. f Histogram of KEGG pathway enrichment for genes
corresponding to (Alb, ASA) in humans.
Overall, ODBAE systematically identifies and explains complex
phenotypes by revealing intrinsic relationships between multiple
metabolic parameters, including both linear and non-linear
associations. These insights can serve as new indicators for biomedical
research and provide a more holistic view of metabolic dysfunction in
knockout mice. In addition, the ODBAE framework can be adapted to other
datasets to uncover complex phenotypes in different biological systems.
Discussion
In this study, we present ODBAE, a machine learning approach designed
to detect and explain complex phenotypes by identifying outliers in
high-dimensional biological datasets. The key advantage of ODBAE is its
ability to detect two types of outliers: IP, which deviate
significantly from expected relationships between variables, and HLP,
which are far from the data center. Unlike traditional outlier
detection methods that often focus on single abnormal indicators, ODBAE
identifies multi-dimensional joint abnormalities, providing a more
holistic view of phenotypic disruptions. This approach is particularly
powerful when applied to data from knockout mouse models, where
phenotypic changes are often subtle or involve complex interactions
between physiological parameters.
One of the most significant contributions of ODBAE is its potential to
identify unexpected phenotypes that may not be apparent when analyzing
individual physiological indicators. Traditional methods of phenotype
analysis often focus on abnormalities in a single trait^[156]3–[157]5,
potentially overlooking the coordinated disruptions across multiple
parameters that signal underlying biological imbalances. As our
understanding of disease and homeostasis evolves, it is increasingly
clear that pathological processes often involve system-wide
disturbances rather than isolated dysfunctions^[158]6. ODBAE leverages
these insights by integrating correlations between physiological
indicators, even when individual indicators remain within normal
ranges. This capability allows researchers and clinicians to better
understand the interplay of homeostatic mechanisms and how they are
disrupted in disease states. For example, when applied to
metabolism-related datasets from IMPC, ODBAE successfully identified
novel metabolic genes, including those associated with complex,
multi-parameter phenotypes. These phenotypes would have been difficult
to detect using conventional approaches that focus on single-parameter
abnormalities. Moreover, ODBAE successfully identified sex-dimorphic
genes, underscoring its utility in advancing the study of sexual
dimorphism. ODBAE also uncovered new parameter pairs that tend to
exhibit abnormalities together, providing insights into the intrinsic
relationships between metabolic pathways. This type of discovery is
essential for advancing our understanding of gene function and the
pathophysiology of metabolic diseases. Notably, the utility of ODBAE
extends beyond knockout mouse models. It can be broadly applied to
diverse tabular biological datasets—including disease cohorts, clinical
health examination records, and single-cell datasets—to identify
complex phenotypes and subtle, multidimensional anomalies.
Importantly, ODBAE offers not just detection, but also anomaly
explanation through the use of kernel-SHAP. This feature allows
researchers to pinpoint the exact physiological indicators driving the
detected abnormalities, facilitating deeper insights into the
biological processes involved. By identifying the most significant
reconstruction errors and the features contributing to them, ODBAE
enables a clearer interpretation of complex phenotypic data. This is
particularly relevant in clinical settings, where understanding the
root cause of a phenotype can guide diagnosis and treatment decisions.
Although ODBAE demonstrates considerable potential, there are areas
that warrant further exploration. First, the current framework assumes
a Gaussian distribution in the training data, which may limit its
effectiveness in datasets that do not follow this distribution. As
shown in Fig. [159]4b, ODBAE maintains robust performance under
moderate levels of noise in Gaussian-distributed datasets. However, its
anomaly detection capability progressively declines with increasing
noise intensity. Future work should focus on optimizing the model to
handle non-Gaussian data and improving the accuracy of anomaly
explanations, particularly in cases involving highly complex or subtle
phenotypes. Second, although ODBAE could potentially be extended to
other data types, such as time-series or imaging data, the current
manuscript focuses exclusively on tabular datasets. The model’s
performance in analyzing other types of data, such as longitudinal
time-series or medical imaging, has not yet been empirically tested.
Further exploration and validation in these areas are warranted to
assess ODBAE’s broader applicability. Third, the performance of ODBAE
is influenced by both the number of layers and the dimensionality of
each layer, suggesting that its architecture can be tuned to further
enhance performance in practical applications. However, determining the
optimal model configuration remains a challenging task. Moreover, since
ODBAE identifies anomalies based on the reconstruction error of
individual data points, the choice of threshold for this error is
critical to the accuracy of outlier detection. Further optimization of
threshold selection is therefore warranted. Finally, ODBAE’s use in
identifying phenotypes has primarily focused on metabolic parameters.
Its performance in detecting phenotypes related to other physiological
systems, such as neurological, cardiovascular, or developmental traits,
remains unexplored. Broader testing across a range of physiological
categories would demonstrate the model’s versatility and help identify
any system-specific limitations.
Despite these challenges, ODBAE represents a powerful tool for the
analysis of high-dimensional biomedical datasets. Its ability to detect
multi-dimensional outliers, explain anomalies, and integrate correlated
physiological indicators offers a new way to approach phenotypic
screening, particularly in gene knockout models. We believe that ODBAE
will facilitate the discovery of previously unrecognized phenotypes,
enhancing our understanding of homeostasis and disease processes. By
moving beyond the limitations of single-indicator analysis, ODBAE opens
up new possibilities for both basic research and clinical diagnostics,
allowing for a more comprehensive understanding of complex biological
systems.
Methods
Datasets
To demonstrate the comprehensive benefits of ODBAE, we first performed
outlier detection on developmental datasets obtained from the IMPC.
Subsequently, we assessed ODBAE’s ability to identify HLP using 3
synthetic two-dimensional Gaussian-distributed datasets
(Fig. [160]2c–e) and one three-dimensional dataset (Fig. [161]2i).
Finally, we evaluated the model’s overall detection accuracy for both
IP and HLP using an additional three-dimensional dataset (Supplementary
Fig. [162]5a). For outlier detection on the IMPC developmental dataset,
data from wild-type mice were used as the training dataset, while data
from knockout mice served as the test dataset. The training dataset
comprised 8327 wild-type samples (4126 males and 4201 females), and the
test dataset included 26706 knockout samples (13297 males and 13409
females). To evaluate the accuracy of ODBAE in detecting HLP, as well
as its overall detection performance for both IP and HLP, all synthetic
datasets consisted of 20000 samples each.
To evaluate the advantages of ODBAE over existing methods, we conducted
outlier detection using multiple approaches across five low-dimensional
synthetic datasets, two high-dimensional synthetic datasets, and two
benchmark datasets (the Dry Bean dataset and the Breast Cancer
dataset). In the low-dimensional synthetic dataset experiments, we
generated three two-dimensional datasets drawn from Gaussian
distributions, one from a two-dimensional Laplace distribution, and one
from a two-dimensional uniform distribution, each comprising 20000
samples. For the high-dimensional synthetic datasets, we constructed
one 50-dimensional and one 100-dimensional Gaussian-distributed
dataset, each also consisting of 20000 samples. The Dry Bean dataset
contains 13611 samples across seven distinct categories; we randomly
selected 2000 samples from a single class as the training dataset, with
samples from the remaining six classes designated as outliers. The
Breast Cancer dataset comprises 569 samples, categorized as benign or
malignant. We randomly selected 300 benign samples as the training
data, while malignant samples were treated as outliers.
The metabolism-related dataset used to identify novel metabolic genes
and phenotypes included 21234 wild-type mouse samples (10585 males and
10649 females) and 45922 knockout mouse samples (23024 males and 22898
females). Wild-type samples were employed for training, while knockout
samples served as the test dataset.
Dataset preprocessing
We used developmental and metabolism-related datasets from IMPC to
discover complex phenotypes and identify new genes. The developmental
dataset we used was integrated from 8 phenotyping centers (BCM, HMGU,
ICS, JAX, MRC Harwell, RBRC, TCP, and UC Davis), while the
metabolism-related dataset was from 11 phenotyping centers (BCM,
CCP-IMG, HMGU, ICS, KMPC, MARC, MRC Harwell, RBRC, TCP, UC Davis, and
WTSI). To eliminate experiment-specific variations, we standardized
each phenotyping center for each dataset to have unit variance and zero
mean. Then, Min-Max normalization was applied to normalize each
physiological parameter within each dataset, mitigating dimensional
effects across different physiological parameters. SNPs within a
± 1 kb region of orthologues for the 41 newly identified metabolic
genes were downloaded from GWAS Central on September 23, 2023 and
filtered based on
[MATH:
−log(p)
≥2 :MATH]
.
Definition of HLP and IP
Since the reconstruction of any dataset by the autoencoder can be
regarded as a complex regression process, we can divide the outliers in
each dataset into HLP and IP^[163]46. For a certain dataset, we divide
the dimension index into two disjoint sets V[1] and V[2]. Without loss
of generality, we assume that V[1] = {1, 2, …, p} and
V[2] = {p + 1, p + 2, …, m}. For i, j ∈ V[1], i ≠ j, there is no
correlation between variables X[i] and X[j], while for each k ∈ V[2],
there is a map ϕ[k] such that variable
[MATH:
Xk=ϕk(X1
,X2,<
/mo>…,Xp
mrow>) :MATH]
. X[i](i ∈ V[1]) is called factor and X[k](k ∈ V[2]) is called response
variable. Then we define matrix Q,
[MATH: Q=1x11…x1p
1x21…x2p
⋮⋮⋱⋮1xn1
…xnp
.
:MATH]
Besides, if we let
[MATH:
Qi=(1,xi1,…,xip) :MATH]
, the leverage value of the iih sample point is
[MATH:
hi=QiQ⊤Q−1<
/msup>Qi
⊤,i=1,2,…,n. :MATH]
If we represent the mean of variable X[k](k ∈ V[1]) as μ[k], and define
the mean vector
[MATH: μ~=(μ1
mn>,μ2,…,μp) :MATH]
, we have
[MATH:
Q⊤Q=11…1x11x21…xn1
⋮⋮⋱⋮x1p
x2p
…xnp
1x11…x1p
1x21…x2p
⋮⋮⋱⋮1xn1
…xnp
=
n1μ~μ~⊤C, :MATH]
1
where
[MATH:
Cjk=1n∑i=1nxijx
ik :MATH]
. Thus, we can obtain the inverse of Q^⊤Q
[MATH: Q⊤Q−1<
/msup>=1n
mfrac>1+μ~C−μ~⊤μ~−1μ~⊤−μ~C−μ~⊤μ~−1−C−μ~⊤μ~−1μ~⊤C−μ~⊤μ~−1. :MATH]
2
Therefore, if we let
[MATH: x~i=(xi1,
mo>…,xip) :MATH]
, the leverage value of the ith sample point can also be formulated as
[MATH:
hi=1n
Qi1+μ~C−μ~⊤μ~−1μ~⊤−μ~C−μ~⊤μ~−1−C−μ~⊤μ~−1μ~⊤C−μ~⊤μ~−1Qi
⊤=1n1+μ~−x~iC−μ~⊤μ~−1μ~⊤,x~i−μ~C−μ~⊤μ~−1Qi⊤=1<
mi>n1+μ~−x~iC−μ~⊤μ~−1μ~⊤+x~i−μ~C−μ~⊤μ~−1<
mrow>x~i⊤=1n
mfrac>+1n
mfrac>x~i−μ~C−μ~⊤μ~−1x~i−μ~⊤. :MATH]
3
If we define
[MATH: A=C−μ~⊤μ~
:MATH]
, then
[MATH:
Ajk=1n∑i=1
nx
ijxik−μ
jμk<
/mi>=1n
mi>∑i=1
n(<
mi>xij−μj)
xik<
/mrow>−μk=n−
1nCOV(Xj
mi>,Xk),j,k∈V1. :MATH]
4
Furthermore, we have
[MATH:
A=n−1<
mrow>nΣf :MATH]
and
[MATH:
A−1=nn−1
mn>Σf−1 :MATH]
, where Σ[f] is the covariance matrix of the factor space. Then, based
on Eq. ([164]3), we get
[MATH:
hi=1n+1n−1
x~i−μ~Σf−1x~i−μ~⊤. :MATH]
5
Obviously, for each sample point, its leverage value is proportional to
its Mahalanobis distance in the factor space. In addition, the Cook
distance of the ith sample point based on the k^th(k ∈ V[2]) dimension
is
[MATH:
Cik=(n−p)xik<
/mrow>−x^ik2hip∑i=1
nxik<
/mrow>−x^ik21−h<
mi>i2<
/msup>,i=1,2,…,n. :MATH]
6
For each dataset, HLP are sample points with high leverage value and IP
refer to sample points with high Cook distance. The leverage value of
the i^th sample point is also proportional to the Mahalanobis distance
of the factor space. Therefore, HLP can also be identified by the
Mahalanobis distance in the factor space. It is noteworthy that for
certain datasets, like those conforming to a multivariate Gaussian
distribution, the set V[2] may be empty. In such instances, all
outliers in the dataset are identified as HLP.
Outlier detection of ODBAE
The autoencoder model in ODBAE is formed of an encoder f and a decoder
g. If we denote the input data point and the output data point as
variable X and
[MATH: X^
:MATH]
, respectively, where
[MATH: X=X1,X2,…,Xm
⊤ :MATH]
and
[MATH: X^=X^1,X^2,…,<
mrow>X^m⊤
mo> :MATH]
, and the number of sample points is n, then we have
[MATH: X^=g∘f(X
mrow>) :MATH]
. Besides,
[MATH: X
:MATH]
represents the whole input dataset and
[MATH: X∈X :MATH]
. x[i] represents the ith sample point of the input dataset and its
corresponding reconstruction result is
[MATH: x^i :MATH]
,
[MATH:
xi=xi1<
/mrow>,xi2,…,xim⊤ :MATH]
and
[MATH: x^i=x^i1,x^i2,…,
mo>x^im⊤ :MATH]
. Ideally, we would like to train the autoencoder with the training
dataset to minimize the training loss function, the generally used
training loss function is MSE which can be represented as
[MATH:
LMSE(ω,b
)=1n
mrow>∑i=1
n(<
msub>xi−x^i)
⊤(xi−x^i)=1n∑i=1
n∑j=1
m(<
msub>xij−x^ij)2. :MATH]
Where ω represents the weight between the input layer and the output
layer and b is the bias value. The purpose of training process is to
guarantee that the intrinsic information of the training dataset can be
learned and most of the normal sample points can be well reconstructed.
It is important to note that MSE-trained autoencoders are insufficient
in outlier detection. Through rigorous theoretical analysis, we
demonstrated how this limitation can be addressed. The analysis process
is based on the following assumption.
Assumption 1
We assume that the input variable X follows m-dimensional Gaussian
distribution.
As we know, autoencoders will mainly focus on recovering the principal
components of a dataset eventually^[165]47,[166]48, it is necessary to
use the unit orthogonal eigenvectors as the basis vectors to further
study the influence of the loss function on the outlier detection
results of the autoencoder. Under Assumption 1, we represent the mean
vector of the variable X as
[MATH: μ=μ1,μ2,…,μm
⊤ :MATH]
. If the eigenvalues of the covariance matrix Σ[x] are λ[1], λ[2], …,
λ[m], the corresponding unit orthogonal eigenvectors (i.e., principal
directions) are η[1], η[2], …, η[m], and the new coordinate of the
variable X is
[MATH: Y=Y1,Y2,…,Ym
⊤ :MATH]
when the unit orthogonal eigenvectors are the basis vectors, then we
can denote an orthogonal matrix P, where the i^th column of P is unit
eigenvector η[i]. It’s obvious that Y = P^−1X = P^⊤X and
[MATH:
Yk=ηk⊤
mrow>X :MATH]
, k = 1, 2, …, m. Thus, in the new coordinate system, the variable of
input data X is converted to variable Y. Let
[MATH: ν=ν1,ν2,…,νm
⊤ :MATH]
and Σ[Y] represent the mean vector and covariance matrix of variable Y,
respectively, we have ν = P^⊤μ and
[MATH:
νk=E(Yk<
/mi>)=
ηk⊤μ :MATH]
, k = 1, 2, …, m. Furthermore, we can gain the variance of the k^th
element of variable Y is
[MATH:
D(Y
k)=CO
mi>VYk,Yk=EYk−νkYk−νk=Eηk<
mrow>⊤X−E(
X)ηk⊤X−E(
X)=
mo>ηk⊤
mo>EX−E(
X)X−E(
X)⊤<
/mrow>ηk<
/mi>=ηk
⊤Σx<
mrow>ηk=
ηk⊤
msubsup>λkηk=
λk, :MATH]
7
and the covariance of Y[i] and Y[j] (i ≠ j) is
[MATH:
COV(Y
mi>i,Yj)=EYi−E(Yi)Yj−E(Yj)=Eηi<
mrow>⊤X−E(
X)ηj⊤X−E(
X)=
mo>ηi⊤
mo>EX−E(
X)X−E(
X)⊤<
/mrow>ηj<
/mi>=ηi
⊤Σx<
mrow>ηj=
ηi⊤
msubsup>λjηj=0<
/mn>. :MATH]
8
Thus, the covariance matrix of variable Y is
Σ[Y] = diag(λ[1], λ[2], …, λ[m]).
Proposition 1
The covariance matrix of variable Y can be derived as
[MATH: ΣY=
mo>diag(λ
mi>1,λ2,…,λm) :MATH]
If we denote
[MATH:
ν=(ν
1,ν
mi>2,…,<
msub>νm)<
/mo>⊤ :MATH]
as the mean vector of variable Y, since X follows m-dimensional
Gaussian distribution, we can conclude that Y also follows
m-dimensional Gaussian distribution, i.e.,
[MATH: Y~N(ν,ΣY
) :MATH]
. What’s more, it’s easy to get that
[MATH:
Yk~N(ν
k,λ
mrow>k) :MATH]
, k = 1, 2, …, m. Then, we denote
[MATH: Y^
:MATH]
as the output variable of Y and its mean vector and eigenvalues of the
covariance matrix are
[MATH: ν^=ν^1,ν^2,…,<
mrow>ν^m⊤
mo> :MATH]
and
[MATH: λ^k :MATH]
, k = 1, 2, …, m, respectively. As is mentioned above, autoencoders
will mainly recover the principal components of any datasets, it’s
reasonable for us to make the following assumption.
Assumption 2
For each element
[MATH: Y^k :MATH]
of the output variable
[MATH: Y^
:MATH]
, its data distribution is the same as the corresponding input element
Y[k]. Since
[MATH:
Yk~N(ν
k,λ
mrow>k) :MATH]
, then
[MATH: Y^k~N(ν^k,λ^k) :MATH]
, k = 1, 2, …, m. Besides, whether each eigenvalue after reconstruction
is close to 0 is the same as its corresponding original eigenvalue.
In the following, we will theoretically analyze the restrictions of MSE
in detecting HLP. First, we can get the relationship between the
difference between the reconstructed data and its mean and the
difference between the original data and its mean in each principal
component direction.
As we know, the loss function of MSE can be formulated as
[MATH:
LMSE(ω,b
)=1n
mrow>∑i=1
n(<
msub>yi−y^i)
⊤(yi−y^i)=EY⊤Y−2Y⊤Y^+Y^⊤Y^=EY⊤Y−2EY⊤Y^+EY^⊤Y^=∑i=1
mEYi<
mrow>2−2∑i=1
mEYiY^i+∑i=1
mEY^i2, :MATH]
9
where y[i] and
[MATH: y^i :MATH]
represent the ith sample point of the input and output dataset in the
new coordinate system, respectively.
First, if the autoencoder can properly reconstruct the input dataset,
we will analyse the mean vector of the output variable
[MATH: Y^
:MATH]
. Based on Assumption 2 and Eq. ([167]9), we have
[MATH:
LMSE(ω,b
)=∑i=1
mλ
i+∑i=1
mλ^i+∑i=1
mνi2+∑i=1
mν^i2−<
mn>2∑i=1
mν
iν^i−2∑i=1
mCOVYi,Y^i≥∑i=1
mνi−ν^i2
mn>+∑i=1
mλi−λ^i2. :MATH]
10
In order to reach the minima of L[MSE](ω, b), the mean vector of output
variable
[MATH: ν^
:MATH]
must satisfy
[MATH: ν^k=νk :MATH]
, k = 1, 2, …, m. That is to say, the mean vector of output data is the
same as input data.
Next, we will specifically analyze the reconstruction error of each
data point. According to Assumptions 1 and 2, the input and output
variable satisfy
[MATH:
Yk~N(ν
k,λ
mrow>k) :MATH]
and
[MATH: Y^k~N(ν
k,λ^k) :MATH]
, k = 1, 2, …, m. Denote
[MATH:
R(Y)=
R1(Y),R2(Y),…,Rm(
Y)⊤ :MATH]
(
[MATH: R^(Y)=R^1(Y),R^2(Y),…,R^m(Y)⊤ :MATH]
) as the difference between variable Y (
[MATH: Y^
:MATH]
) and its mean vector ν, i.e., R(Y) = Y–ν and
[MATH: R^(Y)=Y^−ν :MATH]
. It’s easy to see that
[MATH:
Rk(<
/mo>Y)~N(0,λk)
mo> :MATH]
and
[MATH: R^k(Y)~N(0,λ^k) :MATH]
. Then the following equation can be obtained.
[MATH:
LMSE(ω,b
)=EY−Y^⊤Y−Y^=ER(Y)
mo>−R^(Y)<
/mrow>⊤R(Y)
mo>−R^(Y)<
/mrow>=ER⊤(Y)R(Y)
−2ER⊤(Y)R^(Y)<
mo>+ER^⊤(Y)R^(Y)<
mo>=∑i=1
mERi<
mrow>2(Y
)+∑i=1
mER^i2(
Y)−2∑i=1
mERi(Y)ER^i(Y)−2∑i=1
mCOVRi(Y),R^i(Y)=∑i=1
mλ
i+∑i=1
mλ^i−2∑i=1
mρ
iλiλ^i, :MATH]
11
where ρ[k] represents the correlation coefficient of R[k](Y) and
[MATH: R^k(Y) :MATH]
, and −1 ≤ ρ[k] ≤ 1, k = 1, 2, …, m. In Eq. ([168]11), only when
ρ[k] = 1, can L[MSE](ω, b) reach its minima. In this case, there is a
positive linear correlation between R[k](Y) and
[MATH: R^k(Y) :MATH]
, specifically, there exist constants a[k] and c[k] such that
[MATH: R^k(Y)=ak
mi>Rk
msub>(Y)+
ck :MATH]
, then
[MATH: ER^k(Y)=akERk(Y)+ck :MATH]
and
[MATH: DR^k(Y)=a
k2DRk(Y) :MATH]
. Since
[MATH: ER^k(Y)=ERk(Y)=0 :MATH]
,
[MATH: DRk(Y)=λk :MATH]
and
[MATH: DR^k(Y)=λ^k :MATH]
, we can obtain c[k] = 0 and
[MATH:
ak=λ^kλk<
/msqrt> :MATH]
. Therefore, we can obtain the following proposition.
Proposition 2
Under Assumptions 1 and 2, the difference between the reconstructed
data and its mean is proportional to the difference between the
original data and its mean in each principal component direction. The
specific relationship is
[MATH: R^k(Y)=λ^kλk<
/msqrt>Rk(Y).
:MATH]
12
Where R[k](Y) = Y[k]–ν[k] and
[MATH: R^k(Y)=Y^k−ν^k :MATH]
. Based on Eq. ([169]12), we can further determine the specific
reconstruction error of each input data point.
Proposition 3
The reconstruction error of each input data point measured by MSE can
be formulated as
[MATH:
W(Y)=
Y−Y^⊤Y−Y^=∑i=1
mλi−λ^i2R<
/mrow>i2(Y)λi. :MATH]
13
Ideally, we hope each principal component contributes equally to the
reconstruction error of any input data point. Recall the definition of
R(Y), it’s easy to know that
[MATH:
Rk(Y)<
msub>λk~N(0,1) :MATH]
, so Eq. ([170]13) indicates that the reconstruction degree of each
principal component by the autoencoder will affect the proportion of
this principal component in the reconstruction error. This brings a lot
of uncertainty for HLP detection.
Proposition 4
HLP are difficult to identify as outliers if the input dataset is
completely reconstructed.
When the input dataset is well reconstructed, in Eq. ([171]13), it
means that
[MATH:
λk−λ^k=0
:MATH]
, k = 1, 2, …, m. Then, the reconstruction error of each data point in
training dataset is 0. As the result, each data point in the training
dataset is considered normal. However, HLP exist in any dataset follows
m-dimensional Gaussian distribution. Generally, the larger the
Mahalanobis distance is, the more abnormal the data point is.
Therefore, although completely reconstructing the training dataset is
beneficial for IP detection, it ignores the detection of HLP.
Proposition 5
HLP whose values in each dimension are within the normal range can not
be identified as outliers. Besides, most HLP will be detected in the
direction corresponding to the worst-recovered principal component, but
in the direction of the well-recovered principal components, the
anomalies are often ignored.
From Eq. ([172]13), we know
[MATH:
λk−λ^k :MATH]
is equivalent to the weight of the reconstruction error of the kth
principal component direction of each data point to the total
reconstruction error. Therefore, if the data reconstruction in the ith
principal component direction is better than that in the jth principal
component direction, then
[MATH:
λi−λ^i<λj−λ^j :MATH]
, i ≠ j. As the result, the detected outliers are more distributed in
the jth principal component direction.
Proposition 6
If we ensure the differences between the eigenvalues of the covariance
matrix of the original dataset and their corresponding reconstructed
results in the direction of each principal component are equal, the
value of the reconstruction error for each data point will be
proportional to its Mahalanobis distance.
If we suppress complete reconstruction of the autoencoder, then
[MATH:
λk−λ^k>0
:MATH]
, k = 1, 2, …, m. Based on Eq. ([173]12), we can obtain
[MATH:
(Yk−Y^k)
2=λk−λ^k2λk(
Yk<
mo>−νk)2.
:MATH]
14
It means that for each principal component direction of the input data,
there is a reconstruction error. Specifically, the further away the
values are from the mean, the larger the reconstruction error.
Therefore, in each principal component direction, the outliers detected
by the reconstruction error are the same as the outliers defined by the
Gaussian distributed data. Besides, if
[MATH:
λk−λ^k=β>0 :MATH]
, k = 1, 2, …, m, according to Eq. ([174]13), the reconstruction error
of each input data point is
[MATH:
W(Y)=
Y−Y^⊤Y−Y^=β2
mrow>∑i=1
mR
mi>i2(Y)<
mrow>λi
. :MATH]
15
Since Σ[Y] = diag(λ[1], λ[2], …, λ[m]), it’s easy to obtain
[MATH: ΣY−1=diag(1<
/mn>λ1
msub>,1λ2,…,1<
mrow>λm) :MATH]
. So the Mahalanobis distance of the input variable Y is
[MATH:
M(Y)=
Y−ν⊤ΣY−1Y−ν=∑i=1
mR
mi>i2(Y)<
mrow>λi
. :MATH]
16
It indicates that the reconstruction error of each data point is
proportional to its Mahalanobis distance. Therefore, if we control
[MATH:
λk−λ^k=β>0 :MATH]
, k = 1, 2, …, m, almost all HLP will be converted to IP and cannot be
well reconstructed. As the result, the detected HLP will evenly
distributed in each principal component direction.
Based on the above discussion, we will propose a new loss function that
adds an appropriate penalty term based on MSE to balance the
reconstruction of the autoencoder.
In fact, if the intrinsic dimension of the dataset is l, then there
will be l eigenvalues that are not close to 0. We denote the l
eigenvalues that are not close to 0 as
[MATH:
λk′
:MATH]
and their corresponding reconstruction results are
[MATH: λ^k′ :MATH]
, k = 1, 2, …, l. According to Assumption 2,
[MATH: λ^k′ :MATH]
is also not close to 0. Then, we consider two losses in our training
loss function. One of them is L[MSE](ω, b), which aims to reconstruct
the input dataset well. In addition, we define the other loss
[MATH:
LEIG(ω,b
)=∑i
=1l(λi′−λ^i′−β)2
:MATH]
which can avoid the autoencoder from completely reconstructing the
input dataset. β > 0 is a hyperparameter that can adjust the degree of
data reconstruction. The final training loss function is a combination
of the two:
[MATH:
L(ω,b)=θ1<
/mrow>LMSE(ω,b)+θ
2LEIG(ω,
mo>b). :MATH]
17
Here, θ[1], θ[2] > 0 are hyperparameters that need to be predetermined.
In practice, desirable results are usually obtained if we set
θ[1] = 0.008 and θ[2] = 1.
Specification of outlier ratio
ODBAE identifies outliers based on reconstruction error, with larger
reconstruction errors indicating greater abnormality. In the
implementation of ODBAE, we provide two strategies for determining the
reconstruction error threshold. Given a dataset of N samples, when a
user-defined outlier proportion δ is specified, the model classifies
the top N × δ samples with the highest reconstruction errors as
outliers. Alternatively, when the threshold is determined
automatically, the model assesses the distribution of reconstruction
errors. If the errors approximately follow a Gaussian distribution,
outliers are identified based on the 3σ rule. Otherwise, kernel density
estimation is employed to derive the probability density function of
the reconstruction errors, and data points with reconstruction errors
exceeding the 95th percentile are considered anomalous.
Anomaly explanation of ODBAE
For each outlier detected by ODBAE, it’s necessary for us to find its
abnormal parameters. For ODBAE, it obtained abnormal parameters based
on the highest reconstruction errors and SHAP values. Specifically, for
each outlier, we first sorted its parameters in descending order based
on the reconstruction errors. Then, if the sum of the reconstruction
errors of the first n parameters was greater than 50% of the total
reconstruction error for the outlier, these n parameters were
considered potentially anomalous. If n = 1, we used SHAP values to
identify additional parameter that have the greatest impact on this
anomalous parameter. If the SHAP value of these two parameters exceeded
the set threshold, they would be considered abnormal parameters;
otherwise, only the parameter with the highest reconstruction error was
considered an abnormal parameter. However, if n > 1, we would use SHAP
values to obtain additional parameter that have the greatest impact on
the parameter with the highest reconstruction error. If the SHAP value
of these two parameters exceed the set threshold, they would be
considered abnormal parameters; otherwise, the top 2 parameters with
the highest reconstruction error were considered anomalous parameters.
In our work, the threshold for SHAP values was set to the mean of the
SHAP values during the process of anomaly explanation for all outliers.
Identification of outliers using z-score method
In developmental and metabolism-related datasets, we first identified
mice with anomalies using the z-score method. For example, there were
14 metabolic parameters in metabolism-related datasets. Then, the
z-score value of each mouse for each metabolic parameter was calculated
as
[MATH:
Zij=Xij
−μi
mi>σi<
/mi> :MATH]
, where Z[ij] represented the z-score of the jth mouse for the ith
metabolic parameter, X[ij] was the value of the ith metabolic parameter
for the jth mouse, μ[i] and σ[i] were the mean and standard deviation
of the ith metabolic parameter, respectively. The larger the absolute
value of Z[ij], the more the ith metabolic parameter value of the jth
mouse deviates from the mean. Therefore, for each metabolic parameter,
mice with large absolute z-score values were labeled as outliers.
Identification of mice with low BMI values
We calculated the BMI values for all knockout mice, and the mean BMI of
the abnormal mice after Ckb knockout in female mice was lower than that
of 97.14% of the mice. Therefore, there were still a small number of
mice with lower BMIs from a total of 264 single-gene knockout mouse
strains, and only 13 (4.92%) single-gene knockout mouse strains had
more than 50% of mice with BMI values lower than the mean BMI of gene
Ckb knockout mice. Among these 13 genes, 7 genes were identified as
important genes by ODBAE, and for the remaining 6 genes, a portion of
mice from each corresponding gene knockout strain were detected as
outliers.
Validation of sexually dimorphic genes
When applied to the IMPC metabolic dataset, ODBAE identified a total of
203 metabolism-associated genes, with a marked divergence observed
between male- and female-specific genes (Fig. [175]5b). To further
validate the biological relevance of these findings, we first screened
for genes exhibiting significant sexual dimorphism. In our framework, a
gene was considered significant if the proportion of outliers among its
corresponding knockout strains exceeded 50%. To ensure that the
observed abnormalities were indeed attributable to the gene and not
random fluctuations, we initially restricted our analysis to genes with
sufficient sample sizes in both sexes. Given that the majority of
knockout strains in both male and female datasets comprised seven
individuals (Supplementary Fig. [176]11a, b), we retained genes with at
least seven samples per sex. Subsequently, to identify genes with
pronounced sexual dimorphism, we focused on genes for which the outlier
ratio exceeded 50% in only one sex and was zero in the other. This
yielded a final set of 18 genes showing strong evidence of sex-specific
metabolic effects.
To further evaluate whether the sexual dimorphism observed in these 18
genes had been previously reported, we conducted a literature search in
PubMed using each gene name in combination with the keywords “sexual
dimorphism”, “sexual”, “sex”, “male”, or “female”. This search revealed
that 10 of the 18 genes had been previously documented as exhibiting
sexual dimorphism.
Comprehensive performance evaluation
We compared ODBAE with MSE-AE, MAE-AE, and DAGMM for the detection
results of IP and HLP, and computed their AUC and AP scores by
regarding the IP and HLP as positive.
We implemented DAGMM with minimal modification so that it adapt to our
datasets. Besides, for DAGMM, we kept its number of layers and the
dimensions of each layer consistent with ODBAE, MSE-AE, and MAE-AE.
For each experiment, the structure of the autoencoder was as follows.
The dimension of the hidden layer was the intrinsic dimension of the
input dataset, and its activation function was ReLU. Besides, the
activation function of the output layer was sigmoid. All datasets were
normalized before outlier detection. During the autoencoder training
process, Adam^[177]49 was used to optimize the loss function and the
learning rate was 10^−^3.
To evaluate the performance of various outlier detection methods across
a range of predefined outlier ratios, each method was executed 10 times
per ratio. For each run, we recorded the AUC and AP scores. The final
comparison was based on the mean AUC and AP scores averaged across the
ten replicates for each method.
Selection of hyperparameters θ[1], θ[2] in training loss function
Our new training loss function is formulated as
L(ω, b) = θ[1]L[MSE](ω, b) + θ[2]L[EIG](ω, b). It can be seen that our
loss function contains two loss terms. The function of L[MSE](ω, b) is
to drive the autoencoder to reconstruct the input dataset as well as
possible, while L[EIG](ω, b) aims at suppressing complete
reconstruction of the dataset by the autoencoder and balance the degree
of reconstruction in the direction of each principal component of the
input dataset. Obviously, when the value of θ[1]: θ[2] is large, the
autoencoder can reconstruct the input dataset perfectly and will
adversely affect the detection of HLP. However, if the value of
θ[1]: θ[2] is very small, the autoencoder will have a poor
reconstruction effect on the input dataset, which will bring
disadvantages to IP detection. Therefore, it is very necessary for us
to ensure that the value of θ[1]: θ[2] is reasonable, so that the
autoencoder can have a good detection effect for HLP and IP at the same
time.
In our work, we found that when θ[1]: θ[2] = 0.008: 1, most experiments
could obtain the desired results. In order to study the sensitivity of
this ratio, we changed its base to see how different base affects the
accuracy of outlier detection. To be specific, if the value of base was
c, we set the values of θ[1] and θ[2] to 0.008c and c, respectively. We
conducted experiments on the Dry Bean dataset. For a fixed base c, we
assessed the outier detection performance of ODBAE across different
outlier ratios δ (δ ∈ {0.05, 0.1, 0.15, 0.2, 0.25}), and subsequently
computed the average score across all ratios. The results show that the
AUC and AP scores obtained by ODBAE varied only marginally with changes
in the base value (Supplementary Table [178]1), indicating that θ[1]
and θ[2] are insensitive to the change of the base.
Determination of hyperparameter β
Although hyperparameter β > 0 in Eq. ([179]17) is beneficial for HLP
detection, as the value of β increases, the data reconstruction ability
of the autoencoder will become worse and worse, which is similar to the
model not being well fitted during regression analysis. This adversely
affects the detection of IP. Therefore, in the following, it’s
important for us to determine how to choose the appropriate value of β.
If the intrinsic and actual dimensions of the dataset are equal, i.e.,
l = m, then most of the outliers in the dataset are HLP. In this case,
considering that
[MATH:
λk−λ^k>0
:MATH]
and
[MATH: λ^k>0 :MATH]
, k = 1, 2, …, m, the value of the hyperparameter β should satisfy
[MATH:
0<β<min
1≤i≤m<
mrow>(λi) :MATH]
. According to the previous analysis, as long as β > 0, the detection
effect of HLP can be improved. Meanwhile, in order to maintain the
detection effect of IP, the value of β must be small enough.
According to Eq. ([180]12), if β = 0, then
[MATH: R^k(Y)=Rk
mi>(Y) :MATH]
, i.e.,
[MATH: Y^k=Yk :MATH]
, k = 1, 2, …, m. However, we usually add nonlinear activation
functions to the autoencoder to learn nonlinear features in the input
dataset. For example, in our work, we added sigmoid activation function
to the output layer. Since commonly used nonlinear activation functions
are saturated, when the input values are very large or very small, the
corresponding output values hardly change with the change of the input
values. As the result, for each dimension of the input data, when the
values are very large or very small, their reconstruction results are
poor. Otherwise, the values can be reconstructed well. Then, most HLP
detected by the autoencoder are anomalous in one dimension, but HLP
whose values in each dimension are within the normal range are ignored.
It can be seen from this that when β is small enough, it not only
adversely affects the detection effect of HLP, but also rarely improves
the detection effect of IP. That is to say, there exists ξ > 0, when
β < ξ, there is not much improvement in the detection effect of IP.
If we determine the value of ξ and let β = ξ, on the one hand, the
detection effect of the autoencoder for IP can be maintained. On the
other hand, we can balance the reconstruction of the autoencoder and
avoid the influence of the saturation of the activation function on the
data reconstruction at the same time, which can improve the detection
effect for HLP.
In our work, the structure of the autoencoder and the setting of
hyperparameters are described above. For each dimension of any dataset,
input values between 0.15 and 0.85 are hardly affected by the
saturation of the nonlinear activation function. Therefore, we hope
that the value of β selected can make the output value of each
dimension between 0.15 and 0.85. As the result, for each dimension, the
interval length of the output value is 0.7 times the interval length of
the input value. Before outlier detection, the input dataset is
normalized, so the input value of each dimension is between 0 and 1.
Recall that
[MATH:
Yk~N(ν
k,λ
mrow>k) :MATH]
and
[MATH: Y^k~N(ν
k,λ^k) :MATH]
, k = 1, 2, …, m, the probability that Y[k] is distributed in
[MATH:
(νk−3λk,νk+3
λk) :MATH]
is 0.9974, so its distribution interval length is approximately
[MATH:
6λk :MATH]
. Similarly, the distribution interval length of
[MATH: Y^k :MATH]
is approximately
[MATH: 6λ^k :MATH]
. So we can control
[MATH: λ^k=0.7
λk :MATH]
, then
[MATH:
λk−λ^k=0.3
λk :MATH]
, k = 1, 2, …, m.
Based on the above discussion, we can summarize how to determine the
value of the hyperparameter β when l = m. If
[MATH:
max1≤i
≤m(0.3
λi)≤min1≤i≤m(λi) :MATH]
, we can set
[MATH:
β=max1
≤i≤m(0.3λi
) :MATH]
. Otherwise, we set
[MATH:
β=min1
≤i≤m(λi) :MATH]
.
In practical applications, we usually encounter this type of dataset
whose intrinsic dimension l is smaller than the actual dimension m. In
this case, there are only l eigenvalues of the covariance matrix of the
dataset that are not close to 0, we denote them as
[MATH:
λk′
:MATH]
, k = 1, 2, …, l. In the training process, we only need to control such
l eigenvalues and make
[MATH:
λk′−λ^k′=0.3λ
k′
:MATH]
, k = 1, 2, …, l. Besides, since
[MATH:
λk′
>0 :MATH]
, β should satisfy
[MATH:
0≤β≤min
1≤i≤l<
mrow>(λi′) :MATH]
. Actually, if l < m, there will be some data points that have a large
impact on the reconstruction ability of the autoencoder (i.e., IP), it
is very important to not only improve the detection effect of HLP, but
also maintain the detection effect of IP. Therefore, we have to make
sure that the value of β is small enough but not equal to 0. For
convenience, we also set
[MATH:
0<β<min
1≤i≤m<
mrow>(λi) :MATH]
in this case. Finally, the determination scheme for the value of β is
the same as when l = m.
Statistics and reproducibility
All data analysis were performed using Python 3.7.7 and R 3.4.3
software. To ensure the reproducibility of our findings, we provide
detailed descriptions of the data generation process, including sample
sizes used in the synthetic data experiments. For experiments involving
real-world datasets, we report the sources of the data, the
preprocessing procedures, and all relevant analysis details. When
comparing the outlier detection performance of ODBAE against other
models, we ran each model 10 times under fixed outlier ratios and
recorded the AUC scores from each run. The final performance
comparisons are based on the average AUC scores across runs to ensure
robustness and reproducibility. Similarly, to assess the impact of
dataset dimensionality, noise level, and model architecture on ODBAE’s
performance, we report the average AUC scores obtained from 10
independent runs under each condition.
Ethics statement
The gene knockout mouse data used in this study were obtained from the
International Mouse Phenotyping Consortium (IMPC), and no experiments
were conducted here. All IMPC research centers make every effort to
minimize animal suffering through careful housing and husbandry, with
detailed information available at the IMPC portal:
[181]https://www.mousephenotype.org/about-impc/arrive-guidelines.
Ethics statements from individual research centers are provided in
Supplementary Data [182]4.
Reporting summary
Further information on research design is available in the [183]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[184]Supplementary Information^ (5.7MB, pdf)
[185]42003_2025_8817_MOESM2_ESM.pdf^ (33KB, pdf)
Description of Additional Supplementary files
[186]Supplementary Data 1^ (1.6MB, xlsx)
[187]Supplementary Data 2^ (59.2KB, xlsx)
[188]Supplementary Data 3^ (192.9KB, xlsx)
[189]Supplementary Data 4^ (10.7KB, xlsx)
[190]Reporting Summary^ (2.8MB, pdf)
Acknowledgements