Abstract

   Identifying complex phenotypes from high-dimensional biological data is
   challenging due to the intricate interdependencies among different
   physiological indicators. Traditional approaches often focus on
   detecting outliers in single variables, overlooking the broader network
   of interactions that contribute to phenotype emergence. Here, we
   introduce ODBAE (Outlier Detection using Balanced Autoencoders), a
   machine learning method designed to uncover both subtle and extreme
   outliers by capturing latent relationships among multiple physiological
   parameters. ODBAE’s revised loss function enhances its ability to
   detect two key types of outliers: influential points (IP), which
   disrupt latent correlations between dimensions, and high leverage
   points (HLP), which deviate from the norm but go undetected by
   traditional autoencoder-based methods. Using data from the
   International Mouse Phenotyping Consortium (IMPC), we show that ODBAE
   can identify knockout mice with complex, multi-indicator
   phenotypes—normal in individual traits, but abnormal when considered
   together. In addition, this method reveals novel metabolism-related
   genes and uncovers coordinated abnormalities across metabolic
   indicators. Our results highlight the utility of ODBAE in detecting
   joint abnormalities and advancing our understanding of homeostatic
   perturbations in biological systems.

   Subject terms: Software, Bioinformatics
     __________________________________________________________________

   ODBAE offers a powerful approach for detecting complex anomalies and
   characterizing unknown phenotypes within biological systems.

Introduction

   Phenotypes, including symptoms, defined as the observable
   characteristics or traits of an organism, are the result of complex
   interactions between genotype and environmental factors^[34]1,[35]2.
   These traits are often quantified using physiological or pathological
   indicators that serve as proxies for the underlying biological or
   pathological processes. Recent advances in phenotypic research,
   particularly through genetic manipulation, have deepen our
   understanding of gene function and the mechanisms underlying
   pathological conditions across a wide range of organisms, including
   plants and animals^[36]2. Traditionally, gene-related phenotypic
   analyses have focused on identifying abnormalities in individual
   physiological indicators^[37]3–[38]5. However, findings from the
   International Mouse Phenotyping Consortium (IMPC) demonstrate strong
   correlations between physiological indicators, suggesting that many
   traits and diseases result not from isolated abnormalities, but from
   coordinated disruptions across multiple physiological indicators^[39]6.

   Our current understanding of phenotypes or diseases often emphasizes
   the detection of outliers in individual physiological indicators,
   typically those outside the normal physiological range. However, the
   emergence of disease in an organism is a more complex process.
   Homeostasis-the dynamic balance maintained by biological systems-can be
   perturbed at multiple levels before a single-indicator deviates outside
   the normal range^[40]7. This suggests that phenotypic abnormalities may
   manifest as imbalances between correlated indicators, even when each
   individual measure remains within its expected range. These imbalances
   may represent early warning signs of disease or dysfunction that may be
   missed by traditional univariate analysis.

   We hypothesize that the coordinated perturbation of physiological
   indicators reflects an emerging phenotype or disease state, even when
   individual measures appear normal. By examining these subtle
   interdependencies, we can gain insight into how homeostasis is
   perturbed in knockout mouse models. Furthermore, for genes whose
   knockout does not result in obvious outliers in individual parameters,
   disruption of homeostatic balance may still occur through correlated
   indicators. This approach challenges the notion that the absence of a
   detectable abnormal phenotype in knockout mice indicates no functional
   impact, and instead suggests that our focus on single-indicator
   abnormalities may obscure more complex systemic effects.

   Existing methods, such as linear regression^[41]6,[42]8 and basic
   bioinformatics tools^[43]9–[44]11, have provided initial insights into
   these relationships. However, these techniques struggle to capture
   non-linear and complex relationships between multiple physiological
   indicators, especially in high-dimensional biological datasets. Machine
   learning methods, particularly autoencoders, have emerged as powerful
   tools for detecting outliers by learning complex, latent patterns
   within data^[45]12–[46]18. By compressing and reconstructing data,
   autoencoders can capture subtle variations and relationships that may
   indicate phenotypic abnormalities across multiple indicators. Their
   ability to learn from high-dimensional data makes them well suited for
   analyzing the intricate dynamics involved in phenotypic
   expression^[47]19,[48]20. In the regression analysis, all outliers can
   be categorized as influential points (IP) and high leverage points
   (HLP). IP exerts a substantial influence on the model’s fitting results
   and HLP deviate from the center of the dataset but may not necessarily
   impact the model fit^[49]21. Since the reconstruction of any dataset by
   the autoencoder can be regarded as a process of regression, IP can be
   easily detected when using autoencoders for outlier detection. However,
   there are still many uncertainties about the detection effect of HLP.

   In this study, we introduce Outlier Detection using Balanced
   Autoencoders (ODBAE), a machine learning method designed to detect
   outliers by capturing the relationships between physiological
   indicators. ODBAE improves on traditional autoencoder by balancing the
   reconstruction error across principal component directions, enhancing
   its ability to detect both IP that deviate from the expected
   relationships between indicators and HLP that are far from the data
   center. We apply ODBAE to developmental and metabolic datasets from
   IMPC to explore how coordinated perturbations in physiological
   parameters can reveal previously unidentified phenotypes, even in cases
   where knockout mice show no abnormalities in individual traits. By
   using ODBAE, we aim to identify novel gene functions and phenotypic
   complexities that have been missed by single-indicator screening
   methods. Collectively, these findings establish ODBAE as a powerful
   tool for identifying complex anomalies and unknown phenotypes within
   biological systems.

Results

Outlier detection in ODBAE

   ODBAE takes tabular datasets from various sources as input, such as
   gene knockout mouse dataset from IMPC and patient health examination
   data. Each row in the data represents a record, and each column
   represents an attribute. The goal of ODBAE is to maintain the
   effectiveness of the autoencoder in detecting IP while further
   improving the detection performance for HLP. This is a comprehensive
   approach to outlier detection in high-dimensional tabular datasets.
   Furthermore, ODBAE also provides an anomaly explanation for each
   outlier, indicating which parameter or parameters are associated with
   the anomaly. For gene knockout mouse datasets, ODBAE can identify genes
   that, when knocked out, lead to anomalies, as well as the abnormal
   parameter pairs associated with the gene knockout.

Overview of ODBAE

   To address the challenge of outlier detection in high-dimensional
   tabular datasets, we construct an improved autoencoder model (ODBAE)
   with refined training loss function. Our model leverages the strong
   representation learning and feature extraction capabilities of
   traditional autoencoders for IP, while mitigating their limitations in
   HLP detection (“Methods”). The core principle underlying ODBAE-based
   outlier detection is that inliers can be well reconstructed, whereas
   outliers, including both HLP and IP, may generate significant
   reconstruction errors. ODBAE performs unsupervised outlier detection on
   the input high-dimensional datasets. During training process, ODBAE
   learns as much intrinsic information as possible from training set and
   reconstructs the training dataset by minimizing the loss function. The
   training dataset selection strategy depends on the initial outlier
   prevalence within the dataset. For datasets with few outliers presence,
   the entire dataset are utilized for both training and testing for
   outlier detection, such as data from wild-type mice in IMPC. However,
   if the overall proportion of outliers are not clear, a subset of the
   data exhibiting fewer anomalies is chosen for model training. Then, the
   entire dataset, or other subset, can be used for outlier detection.
   Subsequently, the trained model is applied to the test dataset,
   generating reconstruction error of all sample points. Finally, sample
   points with reconstruction errors greater than the predefined threshold
   are considered outliers. As shown in Fig. [50]1a, outlier detection
   from biological datasets by ODBAE include three steps: (1) given a
   training dataset as input, the intrinsic information of normal data
   points are learned; (2) the trained model is used to reconstruct the
   test dataset, and outliers will be identified according to large
   reconstruction error (“Methods”); (3) the detected outliers including
   HLP and IP are explained using highest reconstruction errors and
   kernel-SHAP to gain the abnormal parameters^[51]22.

Fig. 1. ODBAE is an approach for identifying and explaining outliers in
tabular datasets, it can discover complex phenotypes and identify novel genes
when applied to knockout mouse datasets.

   [52]Fig. 1
   [53]Open in a new tab

   a Process for discovering complex phenotypes. ODBAE first detect
   outliers based on a balanced autoencoder, which can balance the
   reconstruction of the training dataset to detect both HLP and IP, then
   perform anomaly explanation to gain abnormal parameters. Finally, some
   outliers are identified due to joint anomalies of multiple parameters.
   b Anomaly explanation of ODBAE. The detected outliers are explained in
   terms of SHAP values. c For both males and females, abnormal mice are
   first identified according to the presence of metabolic parameters that
   deviated too far from the mean, and then the same abnormal proportion
   is set to identify abnormal mice with ODBAE. Finally, some important
   genes will only be obtained by ODBAE. d Genes with an abnormal
   proportion of more than 50% of the corresponding single-gene knockout
   mouse strains are considered important genes by ODBAE. If an important
   gene has no known metabolic links in the MGI database, it will be
   considered a novel gene and further validated by SNP analysis of human
   orthologues.

   Based on mathematical analysis, ODBAE uses a revised training loss
   function incorporating an appropriate penalty term to Mean Square Error
   (MSE) to balance the reconstruction by properly suppressing complete
   reconstruction of the autoencoder (“Methods”). Specifically, the
   penalty term can ensure the equal eigenvalue difference between each
   principal component direction of the training and reconstructed
   dataset. Therefore, the incorporated penalty term enhance the detection
   of HLP, while the MSE term maintain the identification performance of
   IP.

   To explain each outlier detected by ODBAE, we first identify the top
   features that contribute the most to the reconstruction error, and then
   apply kernel-SHAP to obtain the features that have the greatest impact
   on them (Fig. [54]1b and “Methods”). Finally, ODBAE outputs
   instance-based outliers and their explanations. For outliers in
   categories, if the anomaly rate for a category of outliers is greater
   than the set threshold, then ODBAE provide anomaly explanation for each
   category according to their mean values of each feature.

ODBAE’s comprehensive benefits

   The power of ODBAE lies in its ability to efficiently detect both HLP
   and IP in complex, high-dimensional biological data sets. HLP are
   outliers that deviate from the central distribution, while IP
   significantly influence model results. To evaluate the performance of
   ODBAE in detecting IP, we compared it to principal component
   analysist^[55]23, a common outlier detection method. ODBAE demonstrated
   superior performance, particularly in identifying IP, due to its
   enhanced ability to capture nonlinear relationships between data points
   (Supplementary Fig. [56]1a–d).

   We then applied ODBAE to phenotypic data from the IMPC to identify
   complex phenotypes in knockout mouse models. Using a threshold of the
   top 2% of absolute z-scores^[57]24 for any given physiological
   parameter, we flagged mice as potential outliers (“Methods”). We then
   applied ODBAE with a predefined abnormality ratio (Fig. [58]1c). If
   more than 50% of mice from a single-gene knockout strain were
   classified as outliers, the corresponding gene was identified as
   significant and further analyzed (Fig. [59]1d). ODBAE was able to
   identify key mutant strains with complex phenotypes, even in cases
   where individual physiological indicators appeared normal.

   To illustrate, we analyzed eight developmental parameters: Body length
   (BL), Body weight (BW), Bone Area (BA), Bone Mineral Density (excluding
   skull), Distance travelled total, Forelimb and hindlimb grip strength
   measurement mean, heart rate, and Heart weight. Using wild-type mice as
   the training set and knockout mice as the test set, we evaluated 1904
   single-gene knockout mouse strains (“Methods”). Given the prevalence of
   sexual dimorphism in disease phenotypes^[60]3,[61]25, males and females
   were analyzed separately (Supplementary Data [62]1a–d). In the female
   dataset, ODBAE identified Ckb null mice as outliers despite their
   individual parameter values being within the normal range. These mice
   had normal BL and BW, but their body mass index (BMI) was abnormally
   low. Specifically, four of the eight Ckb null mice had extremely low
   BMI values, calculated as BMI = BW (g)/BL^2 (cm)^[63]26–[64]28.
   Analysis revealed that while BL and BW were within the expected ranges,
   their relationship was abnormal, leading to the identification of these
   mice as outliers (Fig. [65]2a). The average BMI of these Ckb-deficient
   mice was lower than 97.14% of other mice (Fig. [66]2b), demonstrating
   the sensitivity of ODBAE to complex, multivariate outliers. In
   addition, Ckb has previously been implicated in developmental processes
   and obesity, further strengthening the biological significance of these
   findings^[67]29.

Fig. 2. ODBAE accurately identify outliers with multi-dimensional joint
anomalies.

   [68]Fig. 2
   [69]Open in a new tab

   a Scatter plot of parameters BL and BW for mouse developmental dataset,
   the data points corresponding to mice with the gene Ckb knockout
   deviate significantly from most of the data points, but their values of
   both BL and BW are within the normal range. The BL or BW values of the
   data points in the shaded part are ranked in the top 2% based on the
   absolute value of the z-score. b The distribution of standardized BMI
   of all knockout mice in mouse developmental dataset, the mean BMI of
   Ckb knockout mice is significantly smaller. c–h Outlier detection
   results for 3 synthetic 2-dimensional Gaussian distribution datasets
   when the outlier ratio is 0.05, including detection results of
   MSE-trained autoencoder and ODBAE for 2-dimensional Gaussian
   distribution dataset with diagonal covariance matrix and non-Gaussian
   distributed noise (c, f), 2-dimensional Gaussian distribution dataset
   with diagonal covariance matrix (d, g) and 2-dimensional Gaussian
   distribution dataset with non-diagonal covariance matrix (e, h). The
   orange line and grey line in (c) represent two principal directions. i,
   j Outlier detection results of MSE-trained autoencoder (i) and ODBAE
   (j) for 3-dimensional dataset with an intrinsic dimension of 2 when the
   outlier ratio is 0.05, and the data points follow a 2-dimensional
   Gaussian distribution with respect to Parameter1 and Parameter3.

   ODBAE also identified previously unknown phenotypes, highlighting its
   potential to uncover novel gene functions. In terms of the
   identification of HLP, most studies choose MSE as the training loss
   function and force the autoencoder to completely reconstruct the
   training dataset, which brings two major limitations: (1) only HLP
   along principal component directions with poor reconstruction can be
   detected; (2) in the process of anomaly detection, each dimension of
   the dataset is independently detected. To evaluate the performance
   difference between our modified loss function and MSE in HLP detection,
   we used these two trained autoencoder to detect outliers for several
   datasets following multi-dimensional Gaussian distributions. For
   example, data in Fig. [70]2c–e were three 2-dimensional Gaussian
   distribution datasets and data in Fig. [71]2i was 3-dimensional dataset
   whose intrinsic dimension was two, the data points followed a
   2-dimensional Gaussian distribution with respect to Parameter1 and
   Parameter3 (“Methods”). The detection result in Fig. [72]2c shows that
   most of the detected HLP lie in the direction of the principal
   components indicated by the blue line, but few lie in the other
   direction. In Fig. [73]2d, HLP in two dimensions of the dataset are
   independently detected, making outliers that are anomalous in multiple
   dimensions (outliers at the four corners) undetectable. Fig. [74]2i
   demonstrates the same issue in three-dimensional space. Moreover, these
   issues persisted when the activation function was replaced with another
   function (Supplementary Figs. [75]2a–d and [76]3a–d). This is because
   the saturating nature of the non-linear activation function causes the
   autoencoder to have different reconstruction capabilities for data
   points located at the boundaries of the dataset compared to those
   located in the interior (Supplementary Fig. [77]4a, c and Supplementary
   Note [78]1). Usually, the distribution of a dataset along various
   principal component directions may not be balanced, and the number of
   data points falling into unreconstructable regions may also be uneven
   (“Methods”). This leads to different detection performances for HLP
   along different principal component directions. Besides, the
   distribution of reconstructed dataset is independent across all
   dimensions (Supplementary Fig. [79]4a, c). Therefore, the new loss
   function in ODBAE is designed to ensure that the reconstructed dataset
   follows a balanced joint distribution (Supplementary Fig. [80]4b, d).
   Then, we used ODBAE to detect outliers in these datasets and found that
   most outliers were identified (Fig. [81]2f–h, j).

   We further evaluated ODBAE’s detection of both IP and HLP using a
   synthetic 3-dimensional dataset (Supplementary Fig. [82]5a). Area Under
   Curve (AUC) and Average Precision (AP) scores were used to evaluate the
   accuracy of outlier detection results (“Methods”). The generated
   3-dimensional training dataset formed a 2-dimensional manifold, where
   the subspace represented by Parameter1 and Parameter3 followed a
   2-dimensional Gaussian distribution (“Methods”, Supplementary
   Fig. [83]5a). Here, we assumed that the ratio of HLP and the ratio of
   IP were equal. During the anomaly detection, we used the training
   dataset as the test dataset. Then, we considered the top ⌊δn⌋ data
   points in the subspace represented by Parameter1 and Parameter3, ranked
   by Mahalanobis distance, as HLP, where δ was the outlier ratio, and n
   was the number of the data points in the training dataset. Besides, we
   additionally generated ⌊δn⌋ points that were not on the manifold of the
   training dataset and treated them as IP (Supplementary Fig. [84]5b).
   The results showed that ODBAE worked better than other anomaly
   detection methods (Supplementary Fig. [85]5c, d).

   In summary, ODBAE is a powerful tool for detecting complex phenotypes
   in high-dimensional biological datasets, capturing both isolated and
   coordinated abnormalities. This capability allows the identification of
   novel gene functions and previously undetected phenotypes, particularly
   in cases where traditional methods fail to detect subtle perturbations
   in physiological homeostasis.

ODBAE outperforms other methods by a large margin

   To evaluate the strengths of ODBAE, we performed outlier detection of
   ODBAE on low-dimensional synthetic datasets, high-dimensional synthetic
   datasets, and two benchmark datasets. ODBAE was compared with the
   autoencoder with MSE loss function (MSE-AE), autoencoder with Mean
   Absolute Error loss function (MAE-AE), and Deep Autoencoding Gaussian
   Mixture Model (DAGMM)^[86]30. The comparison results were presented
   through AUC and AP scores.

   We generated three datasets with 2-dimensional Gaussian distribution,
   and their intrinsic dimension were also 2. In one of the datasets, the
   correlation between the two dimensions was small (Fig. [87]3a); the
   correlation between the two dimensions in another dataset was
   relatively large (Fig. [88]3b); in the last dataset, the correlation
   between the two dimensions was also small and non-Gaussian distributed
   noise existed (Fig. [89]3c). Therefore, the outliers in these three
   datasets were all HLP. Then we set different outlier ratios and
   calculated the corresponding AUC and AP scores of the outlier detection
   results. To be specific, if the outlier ratio is δ and the number of
   sample points is n, then the input data is sorted according to their
   Mahalanobis distance, and the first ⌊δn⌋ sample points are regarded as
   positive. For each fixed anomaly ratio δ, outlier detection was
   repeated 10 times on the dataset, and the average accuracy scores for
   each method were recorded. Fig. [90]3d–f show the AUC scores of test
   results on these three datasets, and Fig. [91]3g–i are corresponding AP
   scores, respectively. ODBAE outperformed other schemes in detecting HLP
   in all three datasets at any outlier ratio (Supplementary
   Data [92]2a–c).

Fig. 3. Outlier detection effect of ODBAE is generally better than other
schemes.

   [93]Fig. 3
   [94]Open in a new tab

   a–c Data point distribution plot for 3 synthetic dataset, including
   2-dimensional Gaussian distribution dataset with diagonal covariance
   matrix (a), 2-dimensional Gaussian distribution dataset with
   non-diagonal covariance matrix (b) and 2-dimensional Gaussian
   distribution dataset with diagonal covariance matrix and non-Gaussian
   distributed noise (c). d–f AUC scores of comparison results of dataset
   in this figure (a, d), (b, e), and (c, f). g–i AP scores of comparison
   results of dataset in this figure (a, g), (b, h), and (c, i). j, k
   Comparison results for Dry Bean Dataset. l, m Comparison results for
   Breast Cancer Dataset. In (d–m), all methods were run 10 times under
   different outlier ratios, and error bars indicate mean ± standard
   error.

   Since autoencoders are often used for outlier detection in
   high-dimensional datasets, we also analyzed the outlier detection
   results of ODBAE on two high-dimensional synthetic datasets. Both
   datasets followed multi-dimensional Gaussian distribution and their
   covariance matrices were both diagonal matrices. So the intrinsic
   dimension of these two datasets were equal to their actual dimension.
   One of the datasets was 50-dimensional and the other dataset was
   100-dimensional. Finally, ODBAE still performed best in HLP detection
   on high-dimensional datasets (Supplementary Fig. [95]6a–d).

   To further evaluate the robustness of ODBAE under more realistic
   conditions where data distributions deviate from Gaussian assumptions,
   we applied the method to two additional synthetic datasets: one sampled
   from a two-dimensional Laplace distribution and the other from a
   two-dimensional uniform distribution. In both scenarios, ODBAE again
   exhibited superior outlier detection performance (Supplementary
   Fig. [96]7a–d).

   Additionally, we tested ODBAE on two benchmark datasets: Dry Bean
   Dataset and Breast Cancer Dataset. Dry Bean Dataset contained 13611
   instances and 17 attributes, and there were 7 classes in the original
   dataset. In each experiment, the inliers were 2000 data points from a
   certain class and δ × 2000 data points were randomly selected from the
   remaining 6 classes of data points as outliers, where δ was the outlier
   ratio and δ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For this dataset, we set the
   dimension of the hidden layer of the autoencoder to 8 during the
   experiment. Breast Cancer Dataset contained 569 instances and 31
   attributes. All data points fell into two categories: benign and
   malignant. We randomly selected 300 data points from the data points
   labeled benign as inliers and δ × 300 data points from the other
   categories as outliers. For this dataset, the outlier ratio
   δ ∈ {0.02, 0.04, 0.06, 0.08, 0.1} and the dimension of the hidden layer
   of the autoencoder was 15. Overall, due to the improvement in HLP
   detection, ODBAE worked best on both benchmark datasets (Fig. [97]3j–m
   and Supplementary Data [98]2d, e).

   In summary, ODBAE outperforms several existing autoencoder-based
   methods in detecting anomalies from both human-induced and benchmark
   datasets, indicating that ODBAE can more comprehensively identify
   anomalous situations. Therefore, ODBAE provides strong support for
   finding unknown anomalies and exploring unknown phenotypes in existing
   biological data.

ODBAE exhibits a certain degree of robustness

   The performance of ODBAE, like any machine learning framework, is
   influenced by several hyperparameters and data conditions. We evaluated
   the impact of dataset dimensionality, noise levels, and model
   architecture on the outlier detection performance of ODBAE.

   We generated Gaussian-distributed datasets of varying dimensionality to
   train ODBAE, and evaluated its outlier detection performance by
   measuring the AUC scores when the outlier ratio was set to 0.1. As
   illustrated in Fig. [99]4a, the anomaly detection performance of ODBAE
   remains largely stable as the dimensionality of the dataset increases.
   This observation highlights the potential applicability of ODBAE to
   outlier detection tasks in high-dimensional datasets.

Fig. 4. ODBAE exhibits a certain degree of robustness.

   [100]Fig. 4
   [101]Open in a new tab

   a Outlier detection performance of ODBAE across datasets with varying
   dimensionalities (Supplementary Data [102]2f). For each
   fixed-dimensional dataset, the model was trained 10 times, and the AUC
   score for outlier detection was recorded after each training run. b
   Impact of varying noise intensities on the performance of ODBAE
   (Supplementary Data [103]2g). Laplace-distributed noise L(0, s) was
   added to the dataset to simulate different noise levels. For each fixed
   value of s, the model was trained 10 times on the noise-augmented data,
   and the AUC score for outlier detection was recorded after each
   training run. c Outlier detection accuracy of ODBAE with architecture
   (L, C) on the Dry Bean Dataset (Supplementary Data [104]2h). For each
   fixed architecture (L, C), the model was trained 10 times, and the
   average AUC score across the 10 runs was recorded to evaluate
   performance. Error bars indicate mean ± standard error.

   Considering that real-world datasets frequently contain noise, we
   further evaluated the impact of data perturbations on the anomaly
   detection performance of ODBAE. A two-dimensional Gaussian-distributed
   dataset V containing 10,000 samples was generated. To simulate varying
   noise intensities, we added perturbation ϵ to each data point, where ϵ
   was drawn from a Laplace distribution L(0, s), with larger s values
   representing higher noise levels. We then evaluated the performance of
   ODBAE trained on the noisy datasets in detecting outliers within the
   original dataset V, with the outlier ratio set to 0.1. As shown in
   Fig. [105]4b, ODBAE maintained high outlier detection accuracy under
   moderate noise levels (s ≤ 0.1), highlighting its robustness in the
   presence of noise typically observed in real-world data.

   In addition, we investigated the influence of model architecture on the
   performance of ODBAE. We similarly employed the Dry Bean dataset,
   randomly selecting 2000 samples from a single class to form the
   training dataset. Additionally, 200 outlier samples were randomly drawn
   from the remaining 6 classes. We then evaluated the outlier detection
   accuracy of ODBAE with varying model architectures denoted as (L, C),
   where the model consists of 2L + 1 hidden layers, with all layers
   except the bottleneck layer having dimensionality C. The results
   demonstrate that both the number of layers L and the dimensionality of
   each layer C significantly influence the performance of ODBAE
   (Fig. [106]4c). Therefore, in practical applications, tuning these
   architectural parameters—specifically, the number of layers and the
   dimensionality of each layer—can help optimize the outlier detection
   performance of ODBAE.

   In summary, ODBAE maintains high outlier detection accuracy in
   high-dimensional datasets and exhibits robustness to moderate levels of
   noise. Furthermore, its performance can be further optimized in
   practical applications by tuning the number of layers and the
   dimensionality of each layer in the model architecture.

ODBAE identifies new metabolic genes

   We applied ODBAE to metabolism-related datasets from IMPC to uncover
   genetic elements involved in metabolic regulation. Fourteen key
   metabolic parameters were selected for analysis, including: Alanine
   aminotransferase (ALA), Albumin (Alb), Alkaline phosphatase, Aspartate
   aminotransferase (ASA), Calcium (Ca), Creatinine (Cre), Glucose (Glu),
   HDL-cholesterol (HDLC), Phosphorus (Ph), Total bilirubin (TB), Total
   cholesterol (TC), Total protein (TP), Triglycerides (TG), and Urea. In
   total, we analyzed 45922 mice from 3064 single-gene knockout strains,
   including 23024 males and 22898 females.

   ODBAE was used to detect outliers in both male and female data sets.
   Initially, mice with absolute z-scores in the top 1.2% for at least one
   metabolic parameter were flagged as outliers. Of the 23024 male mice,
   2672 were flagged as abnormal, and of the 22898 female mice, 2553
   abnormalities were identified. To ensure robust detection, we adjusted
   the proportion of anomalies and identified the top 10% of mice with the
   highest reconstruction errors as outliers. Genes were considered
   significant if more than 50% of mice from a given knockout strain were
   identified as outliers, suggesting a strong association between these
   genes and metabolism. In total, ODBAE identified 128 significant genes
   in males (9.5%) and 111 significant genes in females (8.08%), for a
   total of 203 genes (Fig. [107]5a, b and Supplementary Data [108]1e–h)

Fig. 5. ODBAE identifies new metabolic genes from a metabolism-related
dataset of knockout mice.

   [109]Fig. 5
   [110]Open in a new tab

   a The proportion of important genes identified in the male and female
   datasets. b ODBAE identified 128 important genes in the male dataset
   and 111 important genes in the female dataset, for a total of 203
   important genes. c Sexually dimorphic genes identified by ODBAE, with
   previously reported sexually dimorphic genes highlighted in red. d
   Visualization results of outliers corresponding to gene Crabp1 in the
   subspace formed by abnormal parameters. The absolute z-score values of
   Alb or HDLC of the data points in the shaded part are ranked in the top
   1.2%. e Validation results of important genes in the MGI database.
   78.8% of the genes are associated with metabolism, and 21.2% are newly
   identified by ODBAE. Besides, 51.2% and 44.2% new metabolic genes are
   identified in the male or female datasets, respectively, and 4.7% are
   detected in both male and female datasets. f Histogram of KEGG pathway
   enrichment for 43 novel genes in mice obtained through KOBAS.

   Moreover, we observed that the metabolism-associated genes identified
   by ODBAE exhibited pronounced sex-specific differences between male and
   female mice (Fig. [111]5b). To evaluate the biological significance of
   this sexual dimorphism, we focused on metabolism-related genes
   exhibiting pronounced sexual dimorphism. Specifically, we selected
   genes for which both male and female mouse strains included at least
   seven individuals, and for which the proportion of knockout strains
   displaying abnormal phenotypes exceeded 50% in only one sex (either
   male or female), while remaining at 0% in the other. In total, 18 genes
   were identified as showing significant sexual dimorphism, 10 of which
   had been previously reported as sex-dimorphic in earlier studies
   (Fig. [112]5c, “Methods” and Supplementary Data [113]3i). These
   findings confirm the sexually dimorphic nature of these genes and
   underscore the utility of ODBAE as a robust tool for uncovering
   metabolism-associated genetic mechanisms.

   Notably, 91 of these 203 genes could not be detected using traditional
   z-score based methods. This indicates that the metabolic abnormalities
   associated with these genes involve complex, multi-parameter
   phenotypes. ODBAE’s anomaly detection includes a detailed explanation
   of outliers based on the highest reconstruction error and kernel-SHAP
   analysis, which allows us to identify the specific metabolic parameters
   driving the anomalies. For example, knockout of the Crabp1 gene
   resulted in abnormal levels of Alb and HDLC, even though the individual
   z-scores for these parameters were within normal ranges. Visualization
   of these data (Fig. [114]5d) shows that while the z-scores for Alb and
   HDLC appeared normal, the knockout mice deviated significantly from the
   central distribution of all data points, highlighting the complex
   metabolic perturbation caused by Crabp1 knockout. More examples are
   shown in Supplementary Fig. [115]8a–d.

   Further investigation using the Mouse Genome Informatics (MGI) database
   revealed that 43 of the 203 genes (21.18%) identified by ODBAE had no
   previously known association with metabolic phenotypes (Fig. [116]5e
   and Supplementary Data [117]3a). To explore the potential relevance of
   these newly identified genes, we analyzed their human orthologues
   (Supplementary Data [118]3b). Single-nucleotide polymorphisms (SNPs)
   within a ± 1 kb region of these orthologues were extracted from GWAS
   Central^[119]31, resulting in a total of 804 SNPs (Supplementary
   Data [120]3c). We then investigated their association with metabolic
   diseases, focusing on type 2 diabetes (T2D). Specifically, for each
   SNP, we evaluated the extent of association across T2D-related traits
   on data from the DIAGRAM, GIANT, and GLGC consortia^[121]32–[122]34.
   Cross-phenotype meta-analysis (CPMA)^[123]35 identified SNPs in four
   gene regions—TWF2, TMED10, HOXA10, and NBAS—that were strongly linked
   to T2D-related traits (CPMA p < 0.05) (Supplementary Data [124]3d).
   Visualization of the corresponding outliers confirmed that these genes
   exhibited distinct patterns of deviation across key metabolic
   dimensions (Supplementary Fig. [125]9a–d). In addition to T2D, many of
   these novel genes may be associated with other metabolic disorders.
   KEGG pathway analysis (via KOBAS) for mouse datasets revealed
   significant enrichment in metabolic pathways^[126]36, including
   glycolysis, propanoate metabolism, and pyruvate metabolism
   (Fig. [127]5f and Supplementary Data [128]3e). These results suggest
   that ODBAE not only identifies genes involved in complex metabolic
   phenotypes but also pinpoints pathways that are disrupted by genetic
   perturbations.

   Overall, ODBAE provides a powerful tool for uncovering novel metabolic
   genes and phenotypes that are undetectable using traditional
   z-score-based methods. It successfully identifies genes associated with
   complex, multi-parameter metabolic abnormalities and reveals novel
   genetic contributors to metabolic diseases, many of which were
   previously unannotated in the MGI database.

ODBAE integrates novel metabolic phenotypes

   ODBAE’s ability to detect outliers based on joint abnormalities, beyond
   traditional z-score methods, highlights its strength in integrating
   complex metabolic phenotypes. Here, we demonstrate how ODBAE can detect
   metabolic abnormalities in gene knockout mice by identifying abnormal
   relationships between multiple metabolic parameters, rather than
   focusing on single-parameter outliers. Our analysis of
   metabolism-related datasets from IMPC reveals that a large proportion
   of outliers (89.14% of male and 92.75% of female mice) exhibit
   abnormalities in multiple metabolic parameters simultaneously. These
   findings suggest that joint abnormalities, or perturbations in the
   correlations between parameters, are important for detecting phenotypic
   changes in knockout mice. For example, knockout of the Ckb gene
   disrupts the relationship between BW and BL, resulting in an abnormal
   BMI, even though BW and BL are individually within normal ranges. This
   highlights how gene knockouts often disrupt intrinsic correlations
   between physiological parameters, resulting in complex metabolic
   phenotypes.

   To further explore these correlations, we identified parameter pairs
   most likely to show joint abnormalities in both male and female mice.
   In males, 36 different abnormal parameter pairs were observed, with the
   most common being (Alb, Ph), (Alb, HDLC), and (ALA, HDLC)
   (Fig. [129]6a, c). In females, 40 different abnormal parameter pairs
   were identified, with the most common being (ASA, TP), (Alb, ASA), and
   (TB, TP) (Fig. [130]6b, c). Their corresponding knockout genes are
   shown in Fig. [131]6d, e. Notably, 22 abnormal parameter pairs were
   found in both males and females, with the most common pairs being (ALA,
   ASA), (ALA, HDLC), and (Alb, ASA). These common abnormalities suggest
   conserved metabolic pathways that are affected by specific gene
   knockouts in both sexes.

Fig. 6. ODBAE integrates metabolic parameter pairs tend to simultaneous
abnormalities.

   [132]Fig. 6
   [133]Open in a new tab

   a, b Chord diagram plotting the inter-connectivity of metabolic
   parameters for males (a) and females (b). The outer segments represent
   the metabolic parameters, the size of the arcs connecting the
   perameters is proportional to the number of outliers associated. c
   Abnormal parameter pairs observed in males and females. d, e Links
   between genes and simultaneous abnormal metabolic parameters in males
   (d) and females (e).

   We then investigated the biological significance of the three most
   frequently perturbed parameter pairs. Previous studies have shown that
   the ALA/HDLC ratio is associated with the risk of non-alcoholic fatty
   liver disease (NAFLD) and diabetes^[134]37–[135]39. ODBAE identified
   Lepr as one of the genes causing abnormalities in this pair
   (Fig. [136]7a). When the ratio of ALA to HDLC was analyzed across all
   mice, the outliers corresponding to the Lepr knockout had a
   significantly lower ratio compared to the general population
   (Fig. [137]7b), consistent with previous findings linking Lepr
   mutations to T2D and NAFLD^[138]40,[139]41. KEGG pathway analysis of
   genes associated with (ALA, HDLC) revealed enrichment in pathways
   related to NAFLD and metabolic processes, including glycolysis
   (Fig. [140]7c and Supplementary Data [141]3f). For the (Alb, ASA) pair,
   studies suggest that the ASA/Alb ratio correlates with tumor
   progression and prognosis in hepatocellular carcinoma
   (HCC)^[142]42,[143]43. ODBAE identified Ap4m1 as a gene causing
   abnormalities in this pair (Fig. [144]7d). The average ratio of ASA to
   Alb in Ap4m1 knockout mice was significantly larger than in other mice
   (Fig. [145]7e), consistent with the role of Ap4m1 in HCC^[146]44. KEGG
   pathway enrichment further linked (Alb, ASA) disruptions to pathways
   involved in pyruvate metabolism, glycolysis, and proximal tubule
   bicarbonate reclamation (Fig. [147]7f and Supplementary Data [148]3g).
   Finally, for genes corresponding to parameter pair (ALA, ASA), the
   pathway enrichment analysis revealed significant enrichment in
   Glycosaminoglycan biosynthesis, Starch and sucrose metabolism, and Type
   I diabetes mellitus (Supplementary Fig. [149]10a and Supplementary
   Data [150]3h), consistent with the previously reported association of
   parameters ASA to ALA ratio with prediabetes^[151]45. In addition to
   these well-characterized parameter pairs, ODBAE also identified novel
   correlations worth exploring further. For instance, Taf8 was
   highlighted as a gene causing abnormalities in the parameter pair
   (HDLC, TP), though little is known about the specific relationship
   between these two parameters (Supplementary Fig. [152]10b).
   Supplementary Fig. [153]10c shows that there is a linear relationship
   between these two parameters, and the knockout of the gene Taf8
   disrupts this relationship. ODBAE’s ability to uncover such
   relationships without prior biological inference makes it a powerful
   tool for discovering previously unknown metabolic interactions.

Fig. 7. Validation of the top three abnormal parameter pairs that occur in
both males and females and have the highest overall frequency.

   [154]Fig. 7
   [155]Open in a new tab

   a Visualization of outliers corresponding to gene Lepr based on
   abnormal parameter pairs in the male dataset. b The distribution of the
   standardized ratio of ALA to HDLC in mouse metabolism-related dataset,
   the average ratio of Lepr knockout mice is significantly smaller. c
   Histogram of KEGG pathway enrichment for genes corresponding to (ALA,
   HDLC) in humans. d Visualization of outliers corresponding to gene
   Ap4m1 based on abnormal parameter pairs in the female dataset. e The
   distribution of the standardized ratio of ASA to Alb in mouse
   metabolism-related dataset, the average ratio of Ap4m1 knockout mice is
   significantly larger. f Histogram of KEGG pathway enrichment for genes
   corresponding to (Alb, ASA) in humans.

   Overall, ODBAE systematically identifies and explains complex
   phenotypes by revealing intrinsic relationships between multiple
   metabolic parameters, including both linear and non-linear
   associations. These insights can serve as new indicators for biomedical
   research and provide a more holistic view of metabolic dysfunction in
   knockout mice. In addition, the ODBAE framework can be adapted to other
   datasets to uncover complex phenotypes in different biological systems.

Discussion

   In this study, we present ODBAE, a machine learning approach designed
   to detect and explain complex phenotypes by identifying outliers in
   high-dimensional biological datasets. The key advantage of ODBAE is its
   ability to detect two types of outliers: IP, which deviate
   significantly from expected relationships between variables, and HLP,
   which are far from the data center. Unlike traditional outlier
   detection methods that often focus on single abnormal indicators, ODBAE
   identifies multi-dimensional joint abnormalities, providing a more
   holistic view of phenotypic disruptions. This approach is particularly
   powerful when applied to data from knockout mouse models, where
   phenotypic changes are often subtle or involve complex interactions
   between physiological parameters.

   One of the most significant contributions of ODBAE is its potential to
   identify unexpected phenotypes that may not be apparent when analyzing
   individual physiological indicators. Traditional methods of phenotype
   analysis often focus on abnormalities in a single trait^[156]3–[157]5,
   potentially overlooking the coordinated disruptions across multiple
   parameters that signal underlying biological imbalances. As our
   understanding of disease and homeostasis evolves, it is increasingly
   clear that pathological processes often involve system-wide
   disturbances rather than isolated dysfunctions^[158]6. ODBAE leverages
   these insights by integrating correlations between physiological
   indicators, even when individual indicators remain within normal
   ranges. This capability allows researchers and clinicians to better
   understand the interplay of homeostatic mechanisms and how they are
   disrupted in disease states. For example, when applied to
   metabolism-related datasets from IMPC, ODBAE successfully identified
   novel metabolic genes, including those associated with complex,
   multi-parameter phenotypes. These phenotypes would have been difficult
   to detect using conventional approaches that focus on single-parameter
   abnormalities. Moreover, ODBAE successfully identified sex-dimorphic
   genes, underscoring its utility in advancing the study of sexual
   dimorphism. ODBAE also uncovered new parameter pairs that tend to
   exhibit abnormalities together, providing insights into the intrinsic
   relationships between metabolic pathways. This type of discovery is
   essential for advancing our understanding of gene function and the
   pathophysiology of metabolic diseases. Notably, the utility of ODBAE
   extends beyond knockout mouse models. It can be broadly applied to
   diverse tabular biological datasets—including disease cohorts, clinical
   health examination records, and single-cell datasets—to identify
   complex phenotypes and subtle, multidimensional anomalies.

   Importantly, ODBAE offers not just detection, but also anomaly
   explanation through the use of kernel-SHAP. This feature allows
   researchers to pinpoint the exact physiological indicators driving the
   detected abnormalities, facilitating deeper insights into the
   biological processes involved. By identifying the most significant
   reconstruction errors and the features contributing to them, ODBAE
   enables a clearer interpretation of complex phenotypic data. This is
   particularly relevant in clinical settings, where understanding the
   root cause of a phenotype can guide diagnosis and treatment decisions.

   Although ODBAE demonstrates considerable potential, there are areas
   that warrant further exploration. First, the current framework assumes
   a Gaussian distribution in the training data, which may limit its
   effectiveness in datasets that do not follow this distribution. As
   shown in Fig. [159]4b, ODBAE maintains robust performance under
   moderate levels of noise in Gaussian-distributed datasets. However, its
   anomaly detection capability progressively declines with increasing
   noise intensity. Future work should focus on optimizing the model to
   handle non-Gaussian data and improving the accuracy of anomaly
   explanations, particularly in cases involving highly complex or subtle
   phenotypes. Second, although ODBAE could potentially be extended to
   other data types, such as time-series or imaging data, the current
   manuscript focuses exclusively on tabular datasets. The model’s
   performance in analyzing other types of data, such as longitudinal
   time-series or medical imaging, has not yet been empirically tested.
   Further exploration and validation in these areas are warranted to
   assess ODBAE’s broader applicability. Third, the performance of ODBAE
   is influenced by both the number of layers and the dimensionality of
   each layer, suggesting that its architecture can be tuned to further
   enhance performance in practical applications. However, determining the
   optimal model configuration remains a challenging task. Moreover, since
   ODBAE identifies anomalies based on the reconstruction error of
   individual data points, the choice of threshold for this error is
   critical to the accuracy of outlier detection. Further optimization of
   threshold selection is therefore warranted. Finally, ODBAE’s use in
   identifying phenotypes has primarily focused on metabolic parameters.
   Its performance in detecting phenotypes related to other physiological
   systems, such as neurological, cardiovascular, or developmental traits,
   remains unexplored. Broader testing across a range of physiological
   categories would demonstrate the model’s versatility and help identify
   any system-specific limitations.

   Despite these challenges, ODBAE represents a powerful tool for the
   analysis of high-dimensional biomedical datasets. Its ability to detect
   multi-dimensional outliers, explain anomalies, and integrate correlated
   physiological indicators offers a new way to approach phenotypic
   screening, particularly in gene knockout models. We believe that ODBAE
   will facilitate the discovery of previously unrecognized phenotypes,
   enhancing our understanding of homeostasis and disease processes. By
   moving beyond the limitations of single-indicator analysis, ODBAE opens
   up new possibilities for both basic research and clinical diagnostics,
   allowing for a more comprehensive understanding of complex biological
   systems.

Methods

Datasets

   To demonstrate the comprehensive benefits of ODBAE, we first performed
   outlier detection on developmental datasets obtained from the IMPC.
   Subsequently, we assessed ODBAE’s ability to identify HLP using 3
   synthetic two-dimensional Gaussian-distributed datasets
   (Fig. [160]2c–e) and one three-dimensional dataset (Fig. [161]2i).
   Finally, we evaluated the model’s overall detection accuracy for both
   IP and HLP using an additional three-dimensional dataset (Supplementary
   Fig. [162]5a). For outlier detection on the IMPC developmental dataset,
   data from wild-type mice were used as the training dataset, while data
   from knockout mice served as the test dataset. The training dataset
   comprised 8327 wild-type samples (4126 males and 4201 females), and the
   test dataset included 26706 knockout samples (13297 males and 13409
   females). To evaluate the accuracy of ODBAE in detecting HLP, as well
   as its overall detection performance for both IP and HLP, all synthetic
   datasets consisted of 20000 samples each.

   To evaluate the advantages of ODBAE over existing methods, we conducted
   outlier detection using multiple approaches across five low-dimensional
   synthetic datasets, two high-dimensional synthetic datasets, and two
   benchmark datasets (the Dry Bean dataset and the Breast Cancer
   dataset). In the low-dimensional synthetic dataset experiments, we
   generated three two-dimensional datasets drawn from Gaussian
   distributions, one from a two-dimensional Laplace distribution, and one
   from a two-dimensional uniform distribution, each comprising 20000
   samples. For the high-dimensional synthetic datasets, we constructed
   one 50-dimensional and one 100-dimensional Gaussian-distributed
   dataset, each also consisting of 20000 samples. The Dry Bean dataset
   contains 13611 samples across seven distinct categories; we randomly
   selected 2000 samples from a single class as the training dataset, with
   samples from the remaining six classes designated as outliers. The
   Breast Cancer dataset comprises 569 samples, categorized as benign or
   malignant. We randomly selected 300 benign samples as the training
   data, while malignant samples were treated as outliers.

   The metabolism-related dataset used to identify novel metabolic genes
   and phenotypes included 21234 wild-type mouse samples (10585 males and
   10649 females) and 45922 knockout mouse samples (23024 males and 22898
   females). Wild-type samples were employed for training, while knockout
   samples served as the test dataset.

Dataset preprocessing

   We used developmental and metabolism-related datasets from IMPC to
   discover complex phenotypes and identify new genes. The developmental
   dataset we used was integrated from 8 phenotyping centers (BCM, HMGU,
   ICS, JAX, MRC Harwell, RBRC, TCP, and UC Davis), while the
   metabolism-related dataset was from 11 phenotyping centers (BCM,
   CCP-IMG, HMGU, ICS, KMPC, MARC, MRC Harwell, RBRC, TCP, UC Davis, and
   WTSI). To eliminate experiment-specific variations, we standardized
   each phenotyping center for each dataset to have unit variance and zero
   mean. Then, Min-Max normalization was applied to normalize each
   physiological parameter within each dataset, mitigating dimensional
   effects across different physiological parameters. SNPs within a
   ± 1 kb region of orthologues for the 41 newly identified metabolic
   genes were downloaded from GWAS Central on September 23, 2023 and
   filtered based on
   [MATH:
   <mo>−</mo><mi>log</mi><mrow><mo>(</mo><mrow><mi>p</mi></mrow><mo>)</mo>
   </mrow><mo>≥</mo><mn>2</mn> :MATH]
   .

Definition of HLP and IP

   Since the reconstruction of any dataset by the autoencoder can be
   regarded as a complex regression process, we can divide the outliers in
   each dataset into HLP and IP^[163]46. For a certain dataset, we divide
   the dimension index into two disjoint sets V[1] and V[2]. Without loss
   of generality, we assume that V[1] = {1, 2, …, p} and
   V[2] = {p + 1, p + 2, …, m}. For i, j ∈ V[1], i ≠ j, there is no
   correlation between variables X[i] and X[j], while for each k ∈ V[2],
   there is a map ϕ[k] such that variable
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>ϕ</mi></mrow><mrow><mi>k</mi></mrow></msub><mrow><mo>(</m
   o><mrow><msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>
   ,</mo><msub><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>,<
   /mo><mo>…</mo><mo>,</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>p</mi></
   mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   . X[i](i ∈ V[1]) is called factor and X[k](k ∈ V[2]) is called response
   variable. Then we define matrix Q,
   [MATH: <mi>Q</mi><mo>=</mo><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>11</mn></mr
   ow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mi>p
   </mi></mrow></msub></mtd></mtr><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>21</mn></mr
   ow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mi>p
   </mi></mrow></msub></mtd></mtr><mtr><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋱</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd></mtr><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mn>1
   </mn></mrow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p
   </mi></mrow></msub></mtd></mtr></mtable></mrow></mfenced><mo>.</mo>
   :MATH]

   Besides, if we let
   [MATH:
   <msub><mrow><mi>Q</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   row><mo>(</mo><mrow><mn>1</mn><mo>,</mo><msub><mrow><mi>x</mi></mrow><m
   row><mi>i</mi><mn>1</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><ms
   ub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>p</mi></mrow></msub></mro
   w><mo>)</mo></mrow> :MATH]
   , the leverage value of the iih sample point is
   [MATH:
   <msub><mrow><mi>h</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>Q</mi></mrow><mrow><mi>i</mi></mrow></msub><msup><mrow><m
   fenced close=")"
   open="("><mrow><msup><mrow><mi>Q</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mi>Q</mi></mrow></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow><
   /msup><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>
   ⊤</mo></mrow></msubsup><mo>,</mo><mspace
   width="1.0em"></mspace><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</m
   n><mo>,</mo><mo>…</mo><mo>,</mo><mi>n</mi><mo>.</mo> :MATH]

   If we represent the mean of variable X[k](k ∈ V[1]) as μ[k], and define
   the mean vector
   [MATH: <mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >=</mo><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mn>1</
   mn></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow><mn>2</mn
   ></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>μ</mi></mr
   ow><mrow><mi>p</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , we have
   [MATH:
   <msup><mrow><mi>Q</mi></mrow><mrow><mo>⊤</mo></mrow></msup><mi>Q</mi><m
   o>=</mo><mfenced close="]" open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><mn>1</mn></mtd></mtr><mtr><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>11</mn></mr
   ow></msub></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>21</mn></mr
   ow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mn>1
   </mn></mrow></msub></mtd></mtr><mtr><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋱</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd></mtr><mtr><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mi>p
   </mi></mrow></msub></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mi>p
   </mi></mrow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p
   </mi></mrow></msub></mtd></mtr></mtable></mrow></mfenced><mfenced
   close="]" open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>11</mn></mr
   ow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mi>p
   </mi></mrow></msub></mtd></mtr><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>21</mn></mr
   ow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mi>p
   </mi></mrow></msub></mtd></mtr><mtr><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd><mtd
   columnalign="center"><mo>⋱</mo></mtd><mtd
   columnalign="center"><mo>⋮</mo></mtd></mtr><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mn>1
   </mn></mrow></msub></mtd><mtd columnalign="center"><mo>…</mo></mtd><mtd
   columnalign="center"><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p
   </mi></mrow></msub></mtd></mtr></mtable></mrow></mfenced><mo>=</mo><mi>
   n</mi><mfenced close="]" open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd columnalign="center"><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   td></mtr><mtr><mtd columnalign="center"><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup></mtd><mtd
   columnalign="center"><mi>C</mi></mtd></mtr></mtable></mrow></mfenced><m
   o>,</mo> :MATH]
   1

   where
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>j</mi><mi>k</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><m
   underover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msub><mrow><mi>x</m
   i></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub> :MATH]
   . Thus, we can obtain the inverse of Q^⊤Q
   [MATH: <msup><mrow><mfenced close=")"
   open="("><mrow><msup><mrow><mi>Q</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mi>Q</mi></mrow></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow><
   /msup><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></
   mfrac><mfenced close="]" open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn><mo>+</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup></mtd><mtd
   columnalign="center"><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup></mtd></mt
   r><mtr><mtd columnalign="center"><mo>−</mo><msup><mrow><mfenced
   close=")" open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup></mtd><mtd
   columnalign="center"><msup><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup></mtd></mt
   r></mtable></mrow></mfenced><mo>.</mo> :MATH]
   2

   Therefore, if we let
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>=</mo><mrow><mo>(</mo><mrow><msub
   ><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo>,</
   mo><mo>…</mo><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi
   >p</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , the leverage value of the ith sample point can also be formulated as
   [MATH:
   <msub><mrow><mi>h</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><msub><mrow>
   <mi>Q</mi></mrow><mrow><mi>i</mi></mrow></msub><mfenced close="]"
   open="["><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn><mo>+</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup></mtd><mtd
   columnalign="center"><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup></mtd></mt
   r><mtr><mtd columnalign="center"><mo>−</mo><msup><mrow><mfenced
   close=")" open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup></mtd><mtd
   columnalign="center"><msup><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup></mtd></mt
   r></mtable></mrow></mfenced><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>i
   </mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><mo>=</mo><mfrac><mrow><mn
   >1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>+</mo><mfenced close=")"
   open="("><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></mfenced><msup><mrow><mfenced
   close=")" open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mo>,</mo><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced><msup><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup></mrow></m
   fenced><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   >⊤</mo></mrow></msubsup><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><
   mi>n</mi></mrow></mfrac><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>+</mo><mfenced close=")"
   open="("><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></mfenced><msup><mrow><mfenced
   close=")" open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mo>+</mo><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced><msup><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msubsup><
   mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow><mrow><mo>⊤</mo></mrow></msubsup></mrow></mf
   enced><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></
   mfrac><mo>+</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></
   mfrac><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced><msup><mrow><mfenced close=")"
   open="("><mrow><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><msup><mro
   w><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>⊤</mo></mrow></msup><mo>.</mo> :MATH]
   3

   If we define
   [MATH: <mi>A</mi><mo>=</mo><mi>C</mi><mo>−</mo><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover>
   :MATH]
   , then
   [MATH:
   <msub><mrow><mi>A</mi></mrow><mrow><mi>j</mi><mi>k</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><m
   underover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><msub><mrow><mi>x</mi></mrow><mrow>
   <mi>i</mi><mi>j</mi></mrow></msub><msub><mrow><mi>x</mi></mrow><mrow><m
   i>i</mi><mi>k</mi></mrow></msub><mo>−</mo><msub><mrow><mi>μ</mi></mrow>
   <mrow><mi>j</mi></mrow></msub><msub><mrow><mi>μ</mi></mrow><mrow><mi>k<
   /mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</
   mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><mrow><mo>(</mo><mrow><msub><mrow><
   mi>x</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>−</mo><msub
   ><mrow><mi>μ</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo>
   </mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>k</mi><
   /mrow></msub><mo>−</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>k</mi></m
   row></msub></mrow></mfenced><mo>=</mo><mfrac><mrow><mi>n</mi><mo>−</mo>
   <mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><mi>C</mi><mi>O</mi><mi
   >V</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>j</
   mi></mrow></msub><mo>,</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>k</mi
   ></mrow></msub></mrow><mo>)</mo></mrow><mo>,</mo><mspace
   width="1.0em"></mspace><mi>j</mi><mo>,</mo><mi>k</mi><mo>∈</mo><msub><m
   row><mi>V</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>.</mo> :MATH]
   4

   Furthermore, we have
   [MATH:
   <mi>A</mi><mo>=</mo><mfrac><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow><
   mrow><mi>n</mi></mrow></mfrac><msub><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>f</mi></mrow></msub> :MATH]
   and
   [MATH:
   <msup><mrow><mi>A</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><m
   o>=</mo><mfrac><mrow><mi>n</mi></mrow><mrow><mi>n</mi><mo>−</mo><mn>1</
   mn></mrow></mfrac><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>f</mi></mrow><mrow><mo>−</m
   o><mn>1</mn></mrow></msubsup> :MATH]
   , where Σ[f] is the covariance matrix of the factor space. Then, based
   on Eq. ([164]3), we get
   [MATH:
   <msub><mrow><mi>h</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><mo>+</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow>
   </mfrac><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>f</mi></mrow><mrow><mo>−</m
   o><mn>1</mn></mrow></msubsup><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mover
   accent="true"><mrow><mi>μ</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>⊤</mo></mrow></msup><mo>.</mo> :MATH]
   5

   Obviously, for each sample point, its leverage value is proportional to
   its Mahalanobis distance in the factor space. In addition, the Cook
   distance of the ith sample point based on the k^th(k ∈ V[2]) dimension
   is
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><mrow><mo>(</mo><mrow><mi>n</mi><mo>−</mo><mi>p</m
   i></mrow><mo>)</mo></mrow><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>k</mi><
   /mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>k</mi></mrow></msub></mrow></mfenced></mrow><mr
   ow><mn>2</mn></mrow></msup><msub><mrow><mi>h</mi></mrow><mrow><mi>i</mi
   ></mrow></msub></mrow><mrow><mi>p</mi><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>k</mi><
   /mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>k</mi></mrow></msub></mrow></mfenced></mrow><mr
   ow><mn>2</mn></mrow></msup><msup><mrow><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mrow><mi>h</mi></mrow><mrow><
   mi>i</mi></mrow></msub></mrow></mfenced></mrow><mrow><mn>2</mn></mrow><
   /msup></mrow></mfrac><mo>,</mo><mspace
   width="1.0em"></mspace><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</m
   n><mo>,</mo><mo>…</mo><mo>,</mo><mi>n</mi><mo>.</mo> :MATH]
   6

   For each dataset, HLP are sample points with high leverage value and IP
   refer to sample points with high Cook distance. The leverage value of
   the i^th sample point is also proportional to the Mahalanobis distance
   of the factor space. Therefore, HLP can also be identified by the
   Mahalanobis distance in the factor space. It is noteworthy that for
   certain datasets, like those conforming to a multivariate Gaussian
   distribution, the set V[2] may be empty. In such instances, all
   outliers in the dataset are identified as HLP.

Outlier detection of ODBAE

   The autoencoder model in ODBAE is formed of an encoder f and a decoder
   g. If we denote the input data point and the output data point as
   variable X and
   [MATH: <mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   , respectively, where
   [MATH: <mi>X</mi><mo>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>m
   </mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</mo></mrow></msu
   p> :MATH]
   and
   [MATH: <mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >=</mo><msup><mrow><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><
   mrow><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>m</mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</
   mo></mrow></msup> :MATH]
   , and the number of sample points is n, then we have
   [MATH: <mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >=</mo><mi>g</mi><mo>∘</mo><mi>f</mi><mrow><mo>(</mo><mrow><mi>X</mi></
   mrow><mo>)</mo></mrow> :MATH]
   . Besides,
   [MATH: <mi class="MJX-tex-caligraphic" mathvariant="script">X</mi>
   :MATH]
   represents the whole input dataset and
   [MATH: <mi>X</mi><mo>∈</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">X</mi> :MATH]
   . x[i] represents the ith sample point of the input dataset and its
   corresponding reconstruction result is
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub> :MATH]
   ,
   [MATH:
   <msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   sup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mn>1</mn><
   /mrow></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mn
   >2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>x</m
   i></mrow><mrow><mi>i</mi><mi>m</mi></mrow></msub></mrow></mfenced></mro
   w><mrow><mo>⊤</mo></mrow></msup> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>=</mo><msup><mrow><mfenced
   close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><move
   r
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mn>2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</
   mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>m</mi></mrow></msub></mrow></mfenced></mrow><mr
   ow><mo>⊤</mo></mrow></msup> :MATH]
   . Ideally, we would like to train the autoencoder with the training
   dataset to minimize the training loss function, the generally used
   training loss function is MSE which can be represented as
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>M</mi><mi>S</mi><mi>E</mi></mrow
   ></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><mo>
   )</mo></mrow><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></
   mrow></mfrac><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><msup><mrow><mrow><mo>(</mo><mrow><
   msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo><ms
   ub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow>
   <mo>⊤</mo></mrow></msup><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><mf
   rac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><munderover
   accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msup><mrow><mrow><mo>(</mo><mrow><
   msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   >−</mo><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></m
   row><mrow><mn>2</mn></mrow></msup><mo>.</mo> :MATH]

   Where ω represents the weight between the input layer and the output
   layer and b is the bias value. The purpose of training process is to
   guarantee that the intrinsic information of the training dataset can be
   learned and most of the normal sample points can be well reconstructed.
   It is important to note that MSE-trained autoencoders are insufficient
   in outlier detection. Through rigorous theoretical analysis, we
   demonstrated how this limitation can be addressed. The analysis process
   is based on the following assumption.

Assumption 1

   We assume that the input variable X follows m-dimensional Gaussian
   distribution.

   As we know, autoencoders will mainly focus on recovering the principal
   components of a dataset eventually^[165]47,[166]48, it is necessary to
   use the unit orthogonal eigenvectors as the basis vectors to further
   study the influence of the loss function on the outlier detection
   results of the autoencoder. Under Assumption 1, we represent the mean
   vector of the variable X as
   [MATH: <mi>μ</mi><mo>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>m
   </mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</mo></mrow></msu
   p> :MATH]
   . If the eigenvalues of the covariance matrix Σ[x] are λ[1], λ[2], …,
   λ[m], the corresponding unit orthogonal eigenvectors (i.e., principal
   directions) are η[1], η[2], …, η[m], and the new coordinate of the
   variable X is
   [MATH: <mi>Y</mi><mo>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>Y</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>Y</mi></mrow><mrow><mi>m
   </mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</mo></mrow></msu
   p> :MATH]
   when the unit orthogonal eigenvectors are the basis vectors, then we
   can denote an orthogonal matrix P, where the i^th column of P is unit
   eigenvector η[i]. It’s obvious that Y = P^−1X = P^⊤X and
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   subsup><mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>⊤</mo></
   mrow></msubsup><mi>X</mi> :MATH]
   , k = 1, 2, …, m. Thus, in the new coordinate system, the variable of
   input data X is converted to variable Y. Let
   [MATH: <mi>ν</mi><mo>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>ν</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>ν</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>ν</mi></mrow><mrow><mi>m
   </mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</mo></mrow></msu
   p> :MATH]
   and Σ[Y] represent the mean vector and covariance matrix of variable Y,
   respectively, we have ν = P^⊤μ and
   [MATH:
   <msub><mrow><mi>ν</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   i>E</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>k<
   /mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><msubsup><mrow><mi>
   η</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><m
   i>μ</mi> :MATH]
   , k = 1, 2, …, m. Furthermore, we can gain the variance of the k^th
   element of variable Y is
   [MATH:
   <mi>D</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>
   k</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><mi>C</mi><mi>O</
   mi><mi>V</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></mfenced><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>ν</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></mfenced><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>ν</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></mfenced></mrow></mfenced><mo>=</mo><mi>E</mi><mfenced
   close=")"
   open="("><mrow><msubsup><mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow><
   mrow><mo>⊤</mo></mrow></msubsup><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced><msubsup><mrow><mi>η</mi
   ></mrow><mrow><mi>k</mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><mfence
   d close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced></mrow></mfenced><mo>=</
   mo><msubsup><mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>⊤</
   mo></mrow></msubsup><mi>E</mi><mfenced close=")"
   open="("><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced><msup><mrow><mfenced
   close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced></mrow><mrow><mo>⊤</mo><
   /mrow></msup></mrow></mfenced><msub><mrow><mi>η</mi></mrow><mrow><mi>k<
   /mi></mrow></msub><mo>=</mo><msubsup><mrow><mi>η</mi></mrow><mrow><mi>k
   </mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><msub><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>x</mi></mrow></msub><msub><
   mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><msubsup>
   <mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>⊤</mo></mrow></
   msubsup><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub><msu
   b><mrow><mi>η</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><msub>
   <mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo> :MATH]
   7

   and the covariance of Y[i] and Y[j] (i ≠ j) is
   [MATH:
   <mi>C</mi><mi>O</mi><mi>V</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</
   mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>Y</mi
   ></mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo
   ><mi>E</mi><mfenced close=")" open="("><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</mi></mr
   ow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow></mfenc
   ed><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</mi></mr
   ow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow></mfenc
   ed></mrow></mfenced><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>η</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>⊤</mo></mrow></msubsup><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced><msubsup><mrow><mi>η</mi
   ></mrow><mrow><mi>j</mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><mfence
   d close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced></mrow></mfenced><mo>=</
   mo><msubsup><mrow><mi>η</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>⊤</
   mo></mrow></msubsup><mi>E</mi><mfenced close=")"
   open="("><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced><msup><mrow><mfenced
   close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mi>E</mi><mrow><mo>(</mo><mrow><mi>
   X</mi></mrow><mo>)</mo></mrow></mrow></mfenced></mrow><mrow><mo>⊤</mo><
   /mrow></msup></mrow></mfenced><msub><mrow><mi>η</mi></mrow><mrow><mi>j<
   /mi></mrow></msub><mo>=</mo><msubsup><mrow><mi>η</mi></mrow><mrow><mi>i
   </mi></mrow><mrow><mo>⊤</mo></mrow></msubsup><msub><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>x</mi></mrow></msub><msub><
   mrow><mi>η</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>=</mo><msubsup>
   <mrow><mi>η</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>⊤</mo></mrow></
   msubsup><msub><mrow><mi>λ</mi></mrow><mrow><mi>j</mi></mrow></msub><msu
   b><mrow><mi>η</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>=</mo><mn>0<
   /mn><mo>.</mo> :MATH]
   8

   Thus, the covariance matrix of variable Y is
   Σ[Y] = diag(λ[1], λ[2], …, λ[m]).

Proposition 1

   The covariance matrix of variable Y can be derived as
   [MATH: <msub><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>Y</mi></mrow></msub><mo>=</
   mo><mi
   mathvariant="normal">diag</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>λ</
   mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi
   ></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><ms
   ub><mrow><mi>λ</mi></mrow><mrow><mi>m</mi></mrow></msub></mrow><mo>)</m
   o></mrow> :MATH]

   If we denote
   [MATH:
   <mi>ν</mi><mo>=</mo><msup><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi>ν
   </mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>ν</
   mi></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><
   msub><mrow><mi>ν</mi></mrow><mrow><mi>m</mi></mrow></msub></mrow><mo>)<
   /mo></mrow></mrow><mrow><mo>⊤</mo></mrow></msup> :MATH]
   as the mean vector of variable Y, since X follows m-dimensional
   Gaussian distribution, we can conclude that Y also follows
   m-dimensional Gaussian distribution, i.e.,
   [MATH: <mi>Y</mi><mo>~</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mi>ν</mi><mo>,</mo><m
   sub><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>Y</mi></mrow></msub></mrow>
   <mo>)</mo></mrow> :MATH]
   . What’s more, it’s easy to get that
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>~</mo><m
   i class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi></
   mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , k = 1, 2, …, m. Then, we denote
   [MATH: <mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   as the output variable of Y and its mean vector and eigenvalues of the
   covariance matrix are
   [MATH: <mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >=</mo><msup><mrow><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><
   mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>m</mi></mrow></msub></mrow></mfenced></mrow><mrow><mo>⊤</
   mo></mrow></msup> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub> :MATH]
   , k = 1, 2, …, m, respectively. As is mentioned above, autoencoders
   will mainly recover the principal components of any datasets, it’s
   reasonable for us to make the following assumption.

Assumption 2

   For each element
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub> :MATH]
   of the output variable
   [MATH: <mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   , its data distribution is the same as the corresponding input element
   Y[k]. Since
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>~</mo><m
   i class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi></
   mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , then
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>~</mo><mi
   class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , k = 1, 2, …, m. Besides, whether each eigenvalue after reconstruction
   is close to 0 is the same as its corresponding original eigenvalue.

   In the following, we will theoretically analyze the restrictions of MSE
   in detecting HLP. First, we can get the relationship between the
   difference between the reconstructed data and its mean and the
   difference between the original data and its mean in each principal
   component direction.

   As we know, the loss function of MSE can be formulated as
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>M</mi><mi>S</mi><mi>E</mi></mrow
   ></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><mo>
   )</mo></mrow><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></
   mrow></mfrac><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><msup><mrow><mrow><mo>(</mo><mrow><
   msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo><ms
   ub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow>
   <mo>⊤</mo></mrow></msup><mrow><mo>(</mo><mrow><msub><mrow><mi>y</mi></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><mi
   >E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>Y</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mi>Y</mi><mo>−</mo><mn>2</mn><msup><mrow><mi>Y</mi></mrow><mrow><mo
   >⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >+</mo><msup><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>Y</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mi>Y</mi></mrow></mfenced><mo>−</mo><mn>2</mn><mi>E</mi><mfenced
   close=")"
   open="("><mrow><msup><mrow><mi>Y</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>+</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>Y</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mn>2</mn></mrow></msubsup></mrow></mfenced><mo>−</mo><mn>2</mn><m
   underover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></mfenced><mo>+</mo><munderove
   r accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mf
   enced><mo>,</mo> :MATH]
   9

   where y[i] and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub> :MATH]
   represent the ith sample point of the input and output dataset in the
   new coordinate system, respectively.

   First, if the autoencoder can properly reconstruct the input dataset,
   we will analyse the mean vector of the output variable
   [MATH: <mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   . Based on Assumption 2 and Eq. ([167]9), we have
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>M</mi><mi>S</mi><mi>E</mi></mrow
   ></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><mo>
   )</mo></mrow><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mi>λ</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msubsup><mrow><mi>ν</mi></mrow><mr
   ow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>+</mo><mundero
   ver accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msubsup><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>−</mo><
   mn>2</mn><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mi>ν</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mn>2</mn><munderover
   accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>C</mi><mi>O</mi><mi>V</mi><mfen
   ced close=")"
   open="("><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></mfenced><mo>≥</mo><munderove
   r accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>ν</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></mfenced></mrow><mrow><mn>2</
   mn></mrow></msup><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msup><mrow><mfenced close=")"
   open="("><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></msqrt></mrow></mfenced></mro
   w><mrow><mn>2</mn></mrow></msup><mo>.</mo> :MATH]
   10

   In order to reach the minima of L[MSE](ω, b), the mean vector of output
   variable
   [MATH: <mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   must satisfy
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>=</mo><msub><mrow><mi>ν</mi></mro
   w><mrow><mi>k</mi></mrow></msub> :MATH]
   , k = 1, 2, …, m. That is to say, the mean vector of output data is the
   same as input data.

   Next, we will specifically analyze the reconstruction error of each
   data point. According to Assumptions 1 and 2, the input and output
   variable satisfy
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>~</mo><m
   i class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi></
   mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>~</mo><mi
   class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , k = 1, 2, …, m. Denote
   [MATH:
   <mi>R</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>,</mo><m
   sub><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msub><mrow><mo>(</m
   o><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>,</mo><mo>…</mo><mo>,</mo
   ><msub><mrow><mi>R</mi></mrow><mrow><mi>m</mi></mrow></msub><mrow><mo>(
   </mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced></mrow><m
   row><mo>⊤</mo></mrow></msup> :MATH]
   (
   [MATH: <mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=</mo><msup><m
   row><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>2</mn></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>m</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced></mrow><mrow><mo>⊤</mo></mrow></msu
   p> :MATH]
   ) as the difference between variable Y (
   [MATH: <mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   ) and its mean vector ν, i.e., R(Y) = Y–ν and
   [MATH: <mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >−</mo><mi>ν</mi> :MATH]
   . It’s easy to see that
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></msub><mrow><mo>(<
   /mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>~</mo><mi
   class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   sub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>)</
   mo></mrow> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>~</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   sub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   . Then the following equation can be obtained.
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>M</mi><mi>S</mi><mi>E</mi></mrow
   ></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><mo>
   )</mo></mrow><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>⊤</mo></mrow></msup><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow></mfenced><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><mi>R</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</
   mo></mrow><mo>−</mo><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced><
   /mrow><mrow><mo>⊤</mo></mrow></msup><mfenced close=")"
   open="("><mrow><mi>R</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</
   mo></mrow><mo>−</mo><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced><
   /mrow></mfenced><mo>=</mo><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>R</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mi>R</mi><m
   row><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced>
   <mo>−</mo><mn>2</mn><mi>E</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>R</mi></mrow><mrow><mo>⊤</mo></mrow></ms
   up><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced><
   mo>+</mo><mi>E</mi><mfenced close=")" open="("><mrow><msup><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mo>⊤</mo></mrow></msup><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover><mr
   ow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced><
   mo>=</mo><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>R</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mn>2</mn></mrow></msubsup><mrow><mo>(</mo><mrow><mi>Y</mi></mrow>
   <mo>)</mo></mrow></mrow></mfenced><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mrow><mo>(
   </mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfenced><mo>−</mo
   ><mn>2</mn><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>E</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfe
   nced><mi>E</mi><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mspace
   width="1.0em"></mspace><mo>−</mo><mn>2</mn><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mi>C</mi><mi>O</mi><mi>V</mi><mfen
   ced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>,</mo><m
   sub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mo>=</mo><munderover
   accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mi>λ</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><mn>2</mn><munderover
   accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msub><mrow><mi>ρ</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></msqrt><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></msqrt><mo>,</mo> :MATH]
   11

   where ρ[k] represents the correlation coefficient of R[k](Y) and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow> :MATH]
   , and −1 ≤ ρ[k] ≤ 1, k = 1, 2, …, m. In Eq. ([168]11), only when
   ρ[k] = 1, can L[MSE](ω, b) reach its minima. In this case, there is a
   positive linear correlation between R[k](Y) and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow> :MATH]
   , specifically, there exist constants a[k] and c[k] such that
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>=</mo><msub><mrow><mi>a</mi></mrow><mrow><mi>k</
   mi></mrow></msub><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></
   msub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>+</mo>
   <msub><mrow><mi>c</mi></mrow><mrow><mi>k</mi></mrow></msub> :MATH]
   , then
   [MATH: <mi>E</mi><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mo>=</mo><msub><mrow><mi>a</mi></m
   row><mrow><mi>k</mi></mrow></msub><mi>E</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfe
   nced><mo>+</mo><msub><mrow><mi>c</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub> :MATH]
   and
   [MATH: <mi>D</mi><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mo>=</mo><msubsup><mrow><mi>a</mi>
   </mrow><mrow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mi>D</m
   i><mfenced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfe
   nced> :MATH]
   . Since
   [MATH: <mi>E</mi><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mo>=</mo><mi>E</mi><mfenced
   close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfe
   nced><mo>=</mo><mn>0</mn> :MATH]
   ,
   [MATH: <mi>D</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow></mfe
   nced><mo>=</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub> :MATH]
   and
   [MATH: <mi>D</mi><mfenced close=")" open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow></mrow></mfenced><mo>=</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub> :MATH]
   , we can obtain c[k] = 0 and
   [MATH:
   <msub><mrow><mi>a</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   frac><mrow><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt></mrow><mrow><msqrt><m
   row><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><
   /msqrt></mrow></mfrac> :MATH]
   . Therefore, we can obtain the following proposition.

Proposition 2

   Under Assumptions 1 and 2, the difference between the reconstructed
   data and its mean is proportional to the difference between the
   original data and its mean in each principal component direction. The
   specific relationship is
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msqrt><mrow><msub><mrow><mov
   er
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt></mrow><mrow><msqrt><m
   row><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow><
   /msqrt></mrow></mfrac><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mr
   ow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>.
   </mo> :MATH]
   12

   Where R[k](Y) = Y[k]–ν[k] and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>=</mo><msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>ν</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub> :MATH]
   . Based on Eq. ([169]12), we can further determine the specific
   reconstruction error of each input data point.

Proposition 3

   The reconstruction error of each input data point measured by MSE can
   be formulated as
   [MATH:
   <mi>W</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>⊤</mo></mrow></msup><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><msup><mrow><mfenced close=")"
   open="("><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></msqrt></mrow></mfenced></mro
   w><mrow><mn>2</mn></mrow></msup><mfrac><mrow><msubsup><mrow><mi>R</mi><
   /mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mrow><mo
   >(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow><mrow><msub><mrow
   ><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>.</m
   o> :MATH]
   13

   Ideally, we hope each principal component contributes equally to the
   reconstruction error of any input data point. Recall the definition of
   R(Y), it’s easy to know that
   [MATH:
   <mfrac><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>k</mi></mrow></msub
   ><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow><mrow><
   msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow></mfra
   c><mo>~</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   n>1</mn></mrow><mo>)</mo></mrow> :MATH]
   , so Eq. ([170]13) indicates that the reconstruction degree of each
   principal component by the autoencoder will affect the proportion of
   this principal component in the reconstruction error. This brings a lot
   of uncertainty for HLP detection.

Proposition 4

   HLP are difficult to identify as outliers if the input dataset is
   completely reconstructed.

   When the input dataset is well reconstructed, in Eq. ([171]13), it
   means that
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>=</mo><mn>0</mn>
   :MATH]
   , k = 1, 2, …, m. Then, the reconstruction error of each data point in
   training dataset is 0. As the result, each data point in the training
   dataset is considered normal. However, HLP exist in any dataset follows
   m-dimensional Gaussian distribution. Generally, the larger the
   Mahalanobis distance is, the more abnormal the data point is.
   Therefore, although completely reconstructing the training dataset is
   beneficial for IP detection, it ignores the detection of HLP.

Proposition 5

   HLP whose values in each dimension are within the normal range can not
   be identified as outliers. Besides, most HLP will be detected in the
   direction corresponding to the worst-recovered principal component, but
   in the direction of the well-recovered principal components, the
   anomalies are often ignored.

   From Eq. ([172]13), we know
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt> :MATH]
   is equivalent to the weight of the reconstruction error of the kth
   principal component direction of each data point to the total
   reconstruction error. Therefore, if the data reconstruction in the ith
   principal component direction is better than that in the jth principal
   component direction, then
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub></mrow></msqrt><mo><</mo><msqrt><mrow
   ><msub><mrow><mi>λ</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></ms
   qrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>j</mi></mrow></msub></mrow></msqrt> :MATH]
   , i ≠ j. As the result, the detected outliers are more distributed in
   the jth principal component direction.

Proposition 6

   If we ensure the differences between the eigenvalues of the covariance
   matrix of the original dataset and their corresponding reconstructed
   results in the direction of each principal component are equal, the
   value of the reconstruction error for each data point will be
   proportional to its Mahalanobis distance.

   If we suppress complete reconstruction of the autoencoder, then
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>></mo><mn>0</mn>
   :MATH]
   , k = 1, 2, …, m. Based on Eq. ([173]12), we can obtain
   [MATH:
   <msup><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><m
   i>k</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow>
   <mn>2</mn></mrow></msup><mo>=</mo><mfrac><mrow><msup><mrow><mfenced
   close=")"
   open="("><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</m
   i></mrow></msub></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt></mrow></mfenced></mro
   w><mrow><mn>2</mn></mrow></msup></mrow><mrow><msub><mrow><mi>λ</mi></mr
   ow><mrow><mi>k</mi></mrow></msub></mrow></mfrac><msup><mrow><mrow><mo>(
   </mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><
   mo>−</mo><msub><mrow><mi>ν</mi></mrow><mrow><mi>k</mi></mrow></msub></m
   row><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup><mo>.</mo>
   :MATH]
   14

   It means that for each principal component direction of the input data,
   there is a reconstruction error. Specifically, the further away the
   values are from the mean, the larger the reconstruction error.
   Therefore, in each principal component direction, the outliers detected
   by the reconstruction error are the same as the outliers defined by the
   Gaussian distributed data. Besides, if
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>=</mo><mi>β</mi><m
   o>></mo><mn>0</mn> :MATH]
   , k = 1, 2, …, m, according to Eq. ([174]13), the reconstruction error
   of each input data point is
   [MATH:
   <mi>W</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mo>⊤</mo></mrow></msup><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>=</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></
   mrow></msup><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mfrac><mrow><msubsup><mrow><mi>R</
   mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mrow
   ><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow><mrow><msub><
   mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>
   .</mo> :MATH]
   15

   Since Σ[Y] = diag(λ[1], λ[2], …, λ[m]), it’s easy to obtain
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>Y</mi></mrow><mrow><mo>−</m
   o><mn>1</mn></mrow></msubsup><mo>=</mo><mi
   mathvariant="normal">diag</mi><mrow><mo>(</mo><mrow><mfrac><mrow><mn>1<
   /mn></mrow><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mn>1</mn></mrow></
   msub></mrow></mfrac><mo>,</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msub
   ><mrow><mi>λ</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></mfrac><m
   o>,</mo><mo>…</mo><mo>,</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msub><
   mrow><mi>λ</mi></mrow><mrow><mi>m</mi></mrow></msub></mrow></mfrac></mr
   ow><mo>)</mo></mrow> :MATH]
   . So the Mahalanobis distance of the input variable Y is
   [MATH:
   <mi>M</mi><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mi>ν</mi></mrow></mfenced></mrow><m
   row><mo>⊤</mo></mrow></msup><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>Y</mi></mrow><mrow><mo>−</m
   o><mn>1</mn></mrow></msubsup><mfenced close=")"
   open="("><mrow><mi>Y</mi><mo>−</mo><mi>ν</mi></mrow></mfenced><mo>=</mo
   ><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></munderover><mfrac><mrow><msubsup><mrow><mi>R</
   mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mrow
   ><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mrow></mrow><mrow><msub><
   mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>
   .</mo> :MATH]
   16

   It indicates that the reconstruction error of each data point is
   proportional to its Mahalanobis distance. Therefore, if we control
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>=</mo><mi>β</mi><m
   o>></mo><mn>0</mn> :MATH]
   , k = 1, 2, …, m, almost all HLP will be converted to IP and cannot be
   well reconstructed. As the result, the detected HLP will evenly
   distributed in each principal component direction.

   Based on the above discussion, we will propose a new loss function that
   adds an appropriate penalty term based on MSE to balance the
   reconstruction of the autoencoder.

   In fact, if the intrinsic dimension of the dataset is l, then there
   will be l eigenvalues that are not close to 0. We denote the l
   eigenvalues that are not close to 0 as
   [MATH:
   <msubsup><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup> :MATH]
   and their corresponding reconstruction results are
   [MATH: <msubsup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msubsup> :MATH]
   , k = 1, 2, …, l. According to Assumption 2,
   [MATH: <msubsup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msubsup> :MATH]
   is also not close to 0. Then, we consider two losses in our training
   loss function. One of them is L[MSE](ω, b), which aims to reconstruct
   the input dataset well. In addition, we define the other loss
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>E</mi><mi>I</mi><mi>G</mi></mrow
   ></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><mo>
   )</mo></mrow><mo>=</mo><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i</mi>
   <mo>=</mo><mn>1</mn></mrow><mrow><mi>l</mi></mrow></msubsup><msup><mrow
   ><mrow><mo>(</mo><mrow><msqrt><mrow><msubsup><mrow><mi>λ</mi></mrow><mr
   ow><mi>i</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow></msqrt><mo
   >−</mo><msqrt><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow></ms
   qrt><mo>−</mo><mi>β</mi></mrow><mo>)</mo></mrow></mrow><mrow><mn>2</mn>
   </mrow></msup> :MATH]
   which can avoid the autoencoder from completely reconstructing the
   input dataset. β > 0 is a hyperparameter that can adjust the degree of
   data reconstruction. The final training loss function is a combination
   of the two:
   [MATH:
   <mi>L</mi><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</mi></mrow><m
   o>)</mo></mrow><mo>=</mo><msub><mrow><mi>θ</mi></mrow><mrow><mn>1</mn><
   /mrow></msub><msub><mrow><mi>L</mi></mrow><mrow><mi>M</mi><mi>S</mi><mi
   >E</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</mo><mi>b</m
   i></mrow><mo>)</mo></mrow><mo>+</mo><msub><mrow><mi>θ</mi></mrow><mrow>
   <mn>2</mn></mrow></msub><msub><mrow><mi>L</mi></mrow><mrow><mi>E</mi><m
   i>I</mi><mi>G</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>ω</mi><mo>,</
   mo><mi>b</mi></mrow><mo>)</mo></mrow><mo>.</mo> :MATH]
   17

   Here, θ[1], θ[2] > 0 are hyperparameters that need to be predetermined.
   In practice, desirable results are usually obtained if we set
   θ[1] = 0.008 and θ[2] = 1.

Specification of outlier ratio

   ODBAE identifies outliers based on reconstruction error, with larger
   reconstruction errors indicating greater abnormality. In the
   implementation of ODBAE, we provide two strategies for determining the
   reconstruction error threshold. Given a dataset of N samples, when a
   user-defined outlier proportion δ is specified, the model classifies
   the top N × δ samples with the highest reconstruction errors as
   outliers. Alternatively, when the threshold is determined
   automatically, the model assesses the distribution of reconstruction
   errors. If the errors approximately follow a Gaussian distribution,
   outliers are identified based on the 3σ rule. Otherwise, kernel density
   estimation is employed to derive the probability density function of
   the reconstruction errors, and data points with reconstruction errors
   exceeding the 95th percentile are considered anomalous.

Anomaly explanation of ODBAE

   For each outlier detected by ODBAE, it’s necessary for us to find its
   abnormal parameters. For ODBAE, it obtained abnormal parameters based
   on the highest reconstruction errors and SHAP values. Specifically, for
   each outlier, we first sorted its parameters in descending order based
   on the reconstruction errors. Then, if the sum of the reconstruction
   errors of the first n parameters was greater than 50% of the total
   reconstruction error for the outlier, these n parameters were
   considered potentially anomalous. If n = 1, we used SHAP values to
   identify additional parameter that have the greatest impact on this
   anomalous parameter. If the SHAP value of these two parameters exceeded
   the set threshold, they would be considered abnormal parameters;
   otherwise, only the parameter with the highest reconstruction error was
   considered an abnormal parameter. However, if n > 1, we would use SHAP
   values to obtain additional parameter that have the greatest impact on
   the parameter with the highest reconstruction error. If the SHAP value
   of these two parameters exceed the set threshold, they would be
   considered abnormal parameters; otherwise, the top 2 parameters with
   the highest reconstruction error were considered anomalous parameters.
   In our work, the threshold for SHAP values was set to the mean of the
   SHAP values during the process of anomaly explanation for all outliers.

Identification of outliers using z-score method

   In developmental and metabolism-related datasets, we first identified
   mice with anomalies using the z-score method. For example, there were
   14 metabolic parameters in metabolism-related datasets. Then, the
   z-score value of each mouse for each metabolic parameter was calculated
   as
   [MATH:
   <msub><mrow><mi>Z</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>i</mi><mi>j
   </mi></mrow></msub><mo>−</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</
   mi></mrow></msub></mrow><mrow><msub><mrow><mi>σ</mi></mrow><mrow><mi>i<
   /mi></mrow></msub></mrow></mfrac> :MATH]
   , where Z[ij] represented the z-score of the jth mouse for the ith
   metabolic parameter, X[ij] was the value of the ith metabolic parameter
   for the jth mouse, μ[i] and σ[i] were the mean and standard deviation
   of the ith metabolic parameter, respectively. The larger the absolute
   value of Z[ij], the more the ith metabolic parameter value of the jth
   mouse deviates from the mean. Therefore, for each metabolic parameter,
   mice with large absolute z-score values were labeled as outliers.

Identification of mice with low BMI values

   We calculated the BMI values for all knockout mice, and the mean BMI of
   the abnormal mice after Ckb knockout in female mice was lower than that
   of 97.14% of the mice. Therefore, there were still a small number of
   mice with lower BMIs from a total of 264 single-gene knockout mouse
   strains, and only 13 (4.92%) single-gene knockout mouse strains had
   more than 50% of mice with BMI values lower than the mean BMI of gene
   Ckb knockout mice. Among these 13 genes, 7 genes were identified as
   important genes by ODBAE, and for the remaining 6 genes, a portion of
   mice from each corresponding gene knockout strain were detected as
   outliers.

Validation of sexually dimorphic genes

   When applied to the IMPC metabolic dataset, ODBAE identified a total of
   203 metabolism-associated genes, with a marked divergence observed
   between male- and female-specific genes (Fig. [175]5b). To further
   validate the biological relevance of these findings, we first screened
   for genes exhibiting significant sexual dimorphism. In our framework, a
   gene was considered significant if the proportion of outliers among its
   corresponding knockout strains exceeded 50%. To ensure that the
   observed abnormalities were indeed attributable to the gene and not
   random fluctuations, we initially restricted our analysis to genes with
   sufficient sample sizes in both sexes. Given that the majority of
   knockout strains in both male and female datasets comprised seven
   individuals (Supplementary Fig. [176]11a, b), we retained genes with at
   least seven samples per sex. Subsequently, to identify genes with
   pronounced sexual dimorphism, we focused on genes for which the outlier
   ratio exceeded 50% in only one sex and was zero in the other. This
   yielded a final set of 18 genes showing strong evidence of sex-specific
   metabolic effects.

   To further evaluate whether the sexual dimorphism observed in these 18
   genes had been previously reported, we conducted a literature search in
   PubMed using each gene name in combination with the keywords “sexual
   dimorphism”, “sexual”, “sex”, “male”, or “female”. This search revealed
   that 10 of the 18 genes had been previously documented as exhibiting
   sexual dimorphism.

Comprehensive performance evaluation

   We compared ODBAE with MSE-AE, MAE-AE, and DAGMM for the detection
   results of IP and HLP, and computed their AUC and AP scores by
   regarding the IP and HLP as positive.

   We implemented DAGMM with minimal modification so that it adapt to our
   datasets. Besides, for DAGMM, we kept its number of layers and the
   dimensions of each layer consistent with ODBAE, MSE-AE, and MAE-AE.

   For each experiment, the structure of the autoencoder was as follows.
   The dimension of the hidden layer was the intrinsic dimension of the
   input dataset, and its activation function was ReLU. Besides, the
   activation function of the output layer was sigmoid. All datasets were
   normalized before outlier detection. During the autoencoder training
   process, Adam^[177]49 was used to optimize the loss function and the
   learning rate was 10^−^3.

   To evaluate the performance of various outlier detection methods across
   a range of predefined outlier ratios, each method was executed 10 times
   per ratio. For each run, we recorded the AUC and AP scores. The final
   comparison was based on the mean AUC and AP scores averaged across the
   ten replicates for each method.

Selection of hyperparameters θ[1], θ[2] in training loss function

   Our new training loss function is formulated as
   L(ω, b) = θ[1]L[MSE](ω, b) + θ[2]L[EIG](ω, b). It can be seen that our
   loss function contains two loss terms. The function of L[MSE](ω, b) is
   to drive the autoencoder to reconstruct the input dataset as well as
   possible, while L[EIG](ω, b) aims at suppressing complete
   reconstruction of the dataset by the autoencoder and balance the degree
   of reconstruction in the direction of each principal component of the
   input dataset. Obviously, when the value of θ[1]: θ[2] is large, the
   autoencoder can reconstruct the input dataset perfectly and will
   adversely affect the detection of HLP. However, if the value of
   θ[1]: θ[2] is very small, the autoencoder will have a poor
   reconstruction effect on the input dataset, which will bring
   disadvantages to IP detection. Therefore, it is very necessary for us
   to ensure that the value of θ[1]: θ[2] is reasonable, so that the
   autoencoder can have a good detection effect for HLP and IP at the same
   time.

   In our work, we found that when θ[1]: θ[2] = 0.008: 1, most experiments
   could obtain the desired results. In order to study the sensitivity of
   this ratio, we changed its base to see how different base affects the
   accuracy of outlier detection. To be specific, if the value of base was
   c, we set the values of θ[1] and θ[2] to 0.008c and c, respectively. We
   conducted experiments on the Dry Bean dataset. For a fixed base c, we
   assessed the outier detection performance of ODBAE across different
   outlier ratios δ (δ ∈ {0.05, 0.1, 0.15, 0.2, 0.25}), and subsequently
   computed the average score across all ratios. The results show that the
   AUC and AP scores obtained by ODBAE varied only marginally with changes
   in the base value (Supplementary Table [178]1), indicating that θ[1]
   and θ[2] are insensitive to the change of the base.

Determination of hyperparameter β

   Although hyperparameter β > 0 in Eq. ([179]17) is beneficial for HLP
   detection, as the value of β increases, the data reconstruction ability
   of the autoencoder will become worse and worse, which is similar to the
   model not being well fitted during regression analysis. This adversely
   affects the detection of IP. Therefore, in the following, it’s
   important for us to determine how to choose the appropriate value of β.

   If the intrinsic and actual dimensions of the dataset are equal, i.e.,
   l = m, then most of the outliers in the dataset are HLP. In this case,
   considering that
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>></mo><mn>0</mn>
   :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>></mo><mn>0</mn> :MATH]
   , k = 1, 2, …, m, the value of the hyperparameter β should satisfy
   [MATH:
   <mn>0</mn><mo><</mo><mi>β</mi><mo><</mo><msub><mrow><mi>min</mi></mrow>
   <mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi></mrow></msub><
   mrow><mo>(</mo><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></msqrt></mrow><mo>)</mo></mrow> :MATH]
   . According to the previous analysis, as long as β > 0, the detection
   effect of HLP can be improved. Meanwhile, in order to maintain the
   detection effect of IP, the value of β must be small enough.

   According to Eq. ([180]12), if β = 0, then
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>R</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mro
   w><mo>)</mo></mrow><mo>=</mo><msub><mrow><mi>R</mi></mrow><mrow><mi>k</
   mi></mrow></msub><mrow><mo>(</mo><mrow><mi>Y</mi></mrow><mo>)</mo></mro
   w> :MATH]
   , i.e.,
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>=</mo><msub><mrow><mi>Y</mi></mro
   w><mrow><mi>k</mi></mrow></msub> :MATH]
   , k = 1, 2, …, m. However, we usually add nonlinear activation
   functions to the autoencoder to learn nonlinear features in the input
   dataset. For example, in our work, we added sigmoid activation function
   to the output layer. Since commonly used nonlinear activation functions
   are saturated, when the input values are very large or very small, the
   corresponding output values hardly change with the change of the input
   values. As the result, for each dimension of the input data, when the
   values are very large or very small, their reconstruction results are
   poor. Otherwise, the values can be reconstructed well. Then, most HLP
   detected by the autoencoder are anomalous in one dimension, but HLP
   whose values in each dimension are within the normal range are ignored.
   It can be seen from this that when β is small enough, it not only
   adversely affects the detection effect of HLP, but also rarely improves
   the detection effect of IP. That is to say, there exists ξ > 0, when
   β < ξ, there is not much improvement in the detection effect of IP.

   If we determine the value of ξ and let β = ξ, on the one hand, the
   detection effect of the autoencoder for IP can be maintained. On the
   other hand, we can balance the reconstruction of the autoencoder and
   avoid the influence of the saturation of the activation function on the
   data reconstruction at the same time, which can improve the detection
   effect for HLP.

   In our work, the structure of the autoencoder and the setting of
   hyperparameters are described above. For each dimension of any dataset,
   input values between 0.15 and 0.85 are hardly affected by the
   saturation of the nonlinear activation function. Therefore, we hope
   that the value of β selected can make the output value of each
   dimension between 0.15 and 0.85. As the result, for each dimension, the
   interval length of the output value is 0.7 times the interval length of
   the input value. Before outlier detection, the input dataset is
   normalized, so the input value of each dimension is between 0 and 1.
   Recall that
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>~</mo><m
   i class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>λ</mi></
   mrow><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub><mo>~</mo><mi
   class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , k = 1, 2, …, m, the probability that Y[k] is distributed in
   [MATH:
   <mrow><mo>(</mo><mrow><msub><mrow><mi>ν</mi></mrow><mrow><mi>k</mi></mr
   ow></msub><mo>−</mo><mn>3</mn><msqrt><mrow><msub><mrow><mi>λ</mi></mrow
   ><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>,</mo><msub><mrow><mi
   >ν</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>+</mo><mn>3</mn><msqrt>
   <mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub></mrow
   ></msqrt></mrow><mo>)</mo></mrow> :MATH]
   is 0.9974, so its distribution interval length is approximately
   [MATH:
   <mn>6</mn><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></m
   row></msub></mrow></msqrt> :MATH]
   . Similarly, the distribution interval length of
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>Y</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub> :MATH]
   is approximately
   [MATH: <mn>6</mn><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt> :MATH]
   . So we can control
   [MATH: <msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>=</mo><mn>0.7</mn>
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt> :MATH]
   , then
   [MATH:
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt><mo>−</mo><msqrt><mrow><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow></msub></mrow></msqrt><mo>=</mo><mn>0.3</mn>
   <msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow></msub
   ></mrow></msqrt> :MATH]
   , k = 1, 2, …, m.

   Based on the above discussion, we can summarize how to determine the
   value of the hyperparameter β when l = m. If
   [MATH:
   <msub><mrow><mi>max</mi></mrow><mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>
   ≤</mo><mi>m</mi></mrow></msub><mrow><mo>(</mo><mrow><mn>0.3</mn><msqrt>
   <mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow
   ></msqrt></mrow><mo>)</mo></mrow><mo>≤</mo><msub><mrow><mi>min</mi></mr
   ow><mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi></mrow></msu
   b><mrow><mo>(</mo><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></msqrt></mrow><mo>)</mo></mrow> :MATH]
   , we can set
   [MATH:
   <mi>β</mi><mo>=</mo><msub><mrow><mi>max</mi></mrow><mrow><mn>1</mn><mo>
   ≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi></mrow></msub><mrow><mo>(</mo><mrow
   ><mn>0.3</mn><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi>
   </mrow></msub></mrow></msqrt></mrow><mo>)</mo></mrow> :MATH]
   . Otherwise, we set
   [MATH:
   <mi>β</mi><mo>=</mo><msub><mrow><mi>min</mi></mrow><mrow><mn>1</mn><mo>
   ≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi></mrow></msub><mrow><mo>(</mo><mrow
   ><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>i</mi></mrow></msu
   b></mrow></msqrt></mrow><mo>)</mo></mrow> :MATH]
   .

   In practical applications, we usually encounter this type of dataset
   whose intrinsic dimension l is smaller than the actual dimension m. In
   this case, there are only l eigenvalues of the covariance matrix of the
   dataset that are not close to 0, we denote them as
   [MATH:
   <msubsup><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup> :MATH]
   , k = 1, 2, …, l. In the training process, we only need to control such
   l eigenvalues and make
   [MATH:
   <msqrt><mrow><msubsup><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow><mr
   ow><mo>′</mo></mrow></msubsup></mrow></msqrt><mo>−</mo><msqrt><mrow><ms
   ubsup><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow></ms
   qrt><mo>=</mo><mn>0.3</mn><msqrt><mrow><msubsup><mrow><mi>λ</mi></mrow>
   <mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow></msqrt>
   :MATH]
   , k = 1, 2, …, l. Besides, since
   [MATH:
   <msubsup><mrow><mi>λ</mi></mrow><mrow><mi>k</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup><mo>></mo><mn>0</mn> :MATH]
   , β should satisfy
   [MATH:
   <mn>0</mn><mo>≤</mo><mi>β</mi><mo>≤</mo><msub><mrow><mi>min</mi></mrow>
   <mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>l</mi></mrow></msub><
   mrow><mo>(</mo><mrow><msqrt><mrow><msubsup><mrow><mi>λ</mi></mrow><mrow
   ><mi>i</mi></mrow><mrow><mo>′</mo></mrow></msubsup></mrow></msqrt></mro
   w><mo>)</mo></mrow> :MATH]
   . Actually, if l < m, there will be some data points that have a large
   impact on the reconstruction ability of the autoencoder (i.e., IP), it
   is very important to not only improve the detection effect of HLP, but
   also maintain the detection effect of IP. Therefore, we have to make
   sure that the value of β is small enough but not equal to 0. For
   convenience, we also set
   [MATH:
   <mn>0</mn><mo><</mo><mi>β</mi><mo><</mo><msub><mrow><mi>min</mi></mrow>
   <mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi></mrow></msub><
   mrow><mo>(</mo><mrow><msqrt><mrow><msub><mrow><mi>λ</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></msqrt></mrow><mo>)</mo></mrow> :MATH]
   in this case. Finally, the determination scheme for the value of β is
   the same as when l = m.

Statistics and reproducibility

   All data analysis were performed using Python 3.7.7 and R 3.4.3
   software. To ensure the reproducibility of our findings, we provide
   detailed descriptions of the data generation process, including sample
   sizes used in the synthetic data experiments. For experiments involving
   real-world datasets, we report the sources of the data, the
   preprocessing procedures, and all relevant analysis details. When
   comparing the outlier detection performance of ODBAE against other
   models, we ran each model 10 times under fixed outlier ratios and
   recorded the AUC scores from each run. The final performance
   comparisons are based on the average AUC scores across runs to ensure
   robustness and reproducibility. Similarly, to assess the impact of
   dataset dimensionality, noise level, and model architecture on ODBAE’s
   performance, we report the average AUC scores obtained from 10
   independent runs under each condition.

Ethics statement

   The gene knockout mouse data used in this study were obtained from the
   International Mouse Phenotyping Consortium (IMPC), and no experiments
   were conducted here. All IMPC research centers make every effort to
   minimize animal suffering through careful housing and husbandry, with
   detailed information available at the IMPC portal:
   [181]https://www.mousephenotype.org/about-impc/arrive-guidelines.
   Ethics statements from individual research centers are provided in
   Supplementary Data [182]4.

Reporting summary

   Further information on research design is available in the [183]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [184]Supplementary Information^ (5.7MB, pdf)
   [185]42003_2025_8817_MOESM2_ESM.pdf^ (33KB, pdf)

   Description of Additional Supplementary files
   [186]Supplementary Data 1^ (1.6MB, xlsx)
   [187]Supplementary Data 2^ (59.2KB, xlsx)
   [188]Supplementary Data 3^ (192.9KB, xlsx)
   [189]Supplementary Data 4^ (10.7KB, xlsx)
   [190]Reporting Summary^ (2.8MB, pdf)

Acknowledgements