Abstract

   Cell lines and patient-derived xenografts are essential to cancer
   research; however, the results derived from such models often lack
   clinical translatability, as they do not fully recapitulate the complex
   cancer biology. Identifying preclinical models that sufficiently
   resemble the biological characteristics of clinical tumors across
   different cancers is critically important. Here, we developed MOBER,
   Multi-Origin Batch Effect Remover method, to simultaneously extract
   biologically meaningful embeddings while removing confounder
   information. Applying MOBER on 932 cancer cell lines, 434
   patient-derived tumor xenografts, and 11,159 clinical tumors, we
   identified preclinical models with greatest transcriptional fidelity to
   clinical tumors and models that are transcriptionally unrepresentative
   of their respective clinical tumors. MOBER allows for transformation of
   transcriptional profiles of preclinical models to resemble the ones of
   clinical tumors and, therefore, can be used to improve the clinical
   translation of insights gained from preclinical models. MOBER is a
   versatile batch effect removal method applicable to diverse
   transcriptomic datasets, enabling integration of multiple datasets
   simultaneously.
     __________________________________________________________________

   A proposed deep learning method helps to bridge the gap between
   preclinical cancer models and clinical tumors.

INTRODUCTION

   Cancer cell lines and patient-derived tumor xenograft (PTX) models
   continue to play a critical role in preclinical cancer research and
   drug discovery ([38]1–[39]3). Thousands of cancer models have been
   established and propagated in vitro and in vivo in different
   laboratories, where they have been extensively used in preclinical
   settings to study the biology of cancer ([40]4, [41]5), to explore the
   vulnerabilities of cancer cells ([42]6, [43]7), to identify new
   biomarkers ([44]8), and to test the efficacy of anticancer compounds
   ([45]2, [46]9). Enormous knowledge in cancer biology has been derived
   from the various experiments conducted on cancer models. Still, many
   findings from preclinical cancer research are not reproducible in
   clinical trials ([47]10, [48]11), and oncology drugs have the highest
   failure rate compared to compounds used in other disease areas
   ([49]12). One of the major reasons to this lack of translatability is
   that the cancer models are not perfect, and because of their
   propagation and differences in growing conditions, they have altered
   over time, and it is not known how well they represent the biology of
   the tumors from which they were derived. In addition, many cancer
   models lack accurate clinical annotations and histopathological
   classification that are crucial for their utility in cancer research
   ([50]13). For greater clinical translatability, identification of
   models that sufficiently resemble the biological characteristics and
   drug responses of patient tumors is of critical importance.

   Large collections of molecular data from patient tumors and cancer
   models have been generated across different cancer types. The
   Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) ([51]2) contains
   molecular profiles of around 1000 cancer cell lines, which are
   extensively used as preclinical models for various tumor types in drug
   discovery studies. In addition, gene expression profiles of >400 PTX
   models are available via the Novartis Institutes for Biomedical
   Research Patient-derived Tumor Xenograft Encyclopedia ([52]9).
   Comprehensive molecular characterization of primary and metastatic
   tumors along with clinical data from >11,000 patients are available
   from the The Cancer Genome Atlas (TCGA) ([53]14), MET500 ([54]15), and
   Count Me In (CMI) ([55]16) projects. These efforts provide a powerful
   opportunity to unravel the systematic differences between cancer cell
   lines, xenograft models, and patient tumors and to identify the cancer
   models that sufficiently recapitulate the biology of patient tumors
   without relying on clinical annotations.

   Gene expression profiling accurately reproduces histopathological
   classification of tumors and is a useful technique for resolving tumor
   subtypes ([56]17–[57]20). However, large-scale integration of molecular
   data from cancer cell lines, xenograft models, and patient tumors is
   challenging due to the mixture of intrinsic biological signals and
   technical artifacts. One key challenge is that gene expression
   measurements from bulk patient biopsy samples are confounded by the
   presence of human stromal and immune cell populations that are not
   present in cancer models. In addition, large public datasets can be
   confounded by hidden technical variables, even when they come from the
   same source type [e.g., RNA sequencing (RNA-seq) from different patient
   cohorts]. Existing approaches for removing batch effects do not account
   for other systematic differences between cancer models or patient
   tumors or assume that the cell line and tumor datasets have the same
   subtype composition ([58]21, [59]22). Previous studies analyzing the
   differences between cell lines and patient tumors based on
   transcriptomics profiles have primarily focused on selected cancer
   types ([60]23–[61]25). Existing global analysis with the Celligner
   method ([62]19), which leverages a computational approach developed for
   batch correction of single-cell RNA-seq data, has compared cancer cell
   lines and patient tumors. However, this method is limited to aligning
   only two datasets simultaneously, and the Celligner alignment does not
   consider PTXs.

   In recent years, a multitude of studies have used deep learning
   techniques to transcriptomics data analysis. Particularly, the focus
   has been on single-cell RNA-seq data, where different autoencoder-based
   architectures have been proposed and successfully used for data
   harmonization and mitigating confounding technical effects
   ([63]26–[64]29). Inspired by these results, we applied deep learning
   techniques to explore the fidelity of preclinical models as
   representatives of patient tumors.

   Here, we developed a deep learning–based method, MOBER (Multi-Origin
   Batch Effect Remover), that performs biologically relevant integration
   of pan-cancer gene expression profiles from cancer cell lines, PTXs,
   and patient tumors simultaneously. MOBER can be used to guide the
   selection of cell lines and patient-derived xenografts and identify
   models that more closely resemble patient tumors. We applied it to
   integrate transcriptomics data from 932 cancer cell lines, 434 PTXs,
   and 11,159 patient tumors from TCGA, MET500, and CMI, without relying
   on cancer type labels. We developed a web application that provides a
   valuable resource to help researchers select preclinical models with
   greatest transcriptional fidelity to clinical tumors. MOBER can be
   broadly applied as a batch effect removal tool for any transcriptomics
   datasets, and we made the method available as an open-source Python
   package.

RESULTS

The MOBER method

   MOBER is an adversarial conditional variational autoencoder (VAE) that
   generates biologically informative gene expression embeddings robust to
   confounders ([65]Fig. 1). MOBER consists of two neural networks (see
   Materials and Methods for more details). The first is a conditional VAE
   ([66]30, [67]31) that is optimized to generate embeddings that can
   reconstruct the original input. The second is an adversarial neural
   network (aNN) ([68]32) that takes the embedding generated by the VAE as
   an input and tries to predict the origin of the input data. The VAE
   consists of an encoder that takes as an input a gene expression profile
   of a sample and encodes it as a distribution into a low dimensional
   latent space and a decoder that takes an embedding sampled from that
   distribution and the origin sample labels to reconstruct the gene
   expression profile from it. The goal is to learn an embedding space
   that encodes as much information as possible on the input samples while
   not encoding any information on the origin of the sample. To achieve
   this, we train the VAE and aNN simultaneously. The VAE tries to
   successfully reconstruct the data while also preventing the aNN from
   accurately predicting the data source. This way, during training, the
   VAE and the aNN will converge and reach an equilibrium, such that the
   VAE will generate an embedding space that is optimally successful at
   input reconstruction and the aNN will only randomly predict the origin
   of the input data from this embedding. In other words, the VAE will
   converge to generating an embedding that contains no information about
   the origin of the input data, while the aNN will converge to a random
   prediction performance. During the MOBER training process, we use
   one-hot representation of the origin of the input samples (either TCGA,
   CCLE, PTX, MET500, or CMI) and provide that information to the decoder.
   This enables the VAE to decode the data embeddings conditionally and
   reconstruct the expression data as if the sample was coming from
   another source type. To project a transcriptomics profile from one
   origin (e.g., preclinical) into another (e.g., clinical), after the
   model is trained, we can pass that transcriptomics profile through the
   VAE and simply change the one-hot vector informing on the sample origin
   to the desired one (e.g., CCLE to TCGA to decode cell lines as if they
   were TCGA tumors).

Fig. 1. MOBER architecture.

   [69]Fig. 1.
   [70]Open in a new tab

   MOBER consists of two neural networks: one conditional variational
   autoencoder (VAE) and one source discriminator neural network that is
   trained in adversarial fashion. The encoder takes as an input a gene
   expression profile and encodes it into a latent space, and the decoder
   takes a sampling from the latent space and reconstructs the gene
   expression profile from it. The source discriminator adversary neural
   network takes the sampling from the latent space and tries to predict
   the origin of the input data. Parts of this figure were created with
   [71]BioRender.com.

Global pan-cancer alignment of transcriptional profiles from cancer cell
lines, PTXs, and patient tumors

   We analyzed the transcriptomics profiles from 932 CCLE cell lines, 434
   PTXs, 10,550 patient tumors from TCGA, 406 metastatic tumors from
   MET500 ([72]33), and 203 breast tumors from CMI ([73]16). Integrating
   these datasets by performing dimensionality reduction with
   two-dimensional Uniform Manifold Approximation and Projection (UMAP)
   reveals a clear separation of samples based on their origin ([74]Fig.
   2A). As expected, there is a global separation between cell lines,
   xenografts, and patient tumors. In addition, there are still strong
   batch effects between CMI, TCGA, and MET500 datasets, despite these
   samples all being derived from patient biopsies.

Fig. 2. Global pan-cancer alignment of preclinical and clinical
transcriptomes.

   [75]Fig. 2.
   [76]Open in a new tab

   (A) Integration of transcriptional profiles from models and patient
   tumors by performing Uniform Manifold Approximation and Projection
   (UMAP) dimensionality reduction, each dot is a sample. (B) Integration
   of transcriptional profiles from models and patient tumors using MOBER,
   the color corresponds to the sample origin. (C) MOBER alignment, where
   each tumor sample is colored on the basis of cancer indication. (D) The
   proportion of cancer cell lines that are classified as each tumor type
   using MOBER-aligned data. (E) The proportion of PTXs that are
   classified as each tumor type using MOBER-aligned data. The x axis on
   (D) and (E) shows the TCGA tumor types, and the y axis shows the CCLE
   annotation label (D) and PTX annotation label (E), accordingly.

   Aligning the CCLE, PTX, TCGA, MET500, and CMI datasets with MOBER
   resulted in a well-integrated dataset of transcriptional profiles from
   cell lines, xenografts, and patient tumors that have been corrected for
   multiple sources of systematic dataset-specific differences. The UMAP
   plots using the MOBER-aligned data ([77]Fig. 2, B and C) reveal a map
   of cancer transcriptional profiles where cell lines, xenografts, and
   patient tumor samples are largely intermixed, while the biological
   differences across known cancer types are still preserved ([78]Fig.
   2C). The strong batch effects between cell lines, xenografts, and
   patient tumors were not addressed by applying other widely used batch
   correction methods, such as ComBat ([79]21, [80]22), Harmony ([81]34),
   Batch Mean Centering ([82]35), and the Regress_Out algorithm as
   implemented in scanpy (fig. S1) ([83]36). Comparisons using simulated
   data further highlight the superiority of MOBER in data integration,
   particularly in presence of strong batch effects (see Supplementary
   Text and figs. S2 to S7).

   As illustrated in [84]Fig. 2B, MOBER removes the systematic differences
   between patient tumors and cancer models, as well as the technical
   artifacts present in patient tumors coming from different sources (CMI,
   MET500, and TCGA), producing an integrated cancer expression space with
   clear clusters composed of mixture of cell lines, xenografts, and
   patient tumor samples. [85]Figure 2C shows that the aligned expression
   profiles largely cluster together by disease type, although MOBER does
   the alignment in a completely unsupervised manner, without relying on
   any sample annotations such as disease type. We quantified this by
   classifying the most similar tumor type for each cell line and
   xenograft model, based on its nearest neighbors among the TCGA tumor
   samples (see Materials and Methods). We found that, for 73% of the PTX
   models, the inferred disease type matches the annotated PTX tumor type,
   while this number goes down to 53% for CCLE models.

   A key advantage of MOBER is that it does not assume that any two
   datasets are necessarily similar to each other, and it does not rely on
   clinical annotations of individual tumor samples, independently of
   whether they come from patient biopsies or preclinical models. As a
   result, the MOBER-aligned expression data can be used to identify which
   preclinical models have the greatest transcriptional fidelity to
   clinical tumors and which models are transcriptionally unrepresentative
   of their respective clinical tumors. Although a high proportion of
   preclinical models clusters with tumors of the same cancer type, not
   all cell lines and xenograft models align well with patient tumor
   samples. [86]Figure 2 (D and E) shows that a significant proportion of
   cancer models (both PTXs and CCLEs), derived from skin, colorectal, and
   breast cancer, are faithful representatives of patient tumors, while
   many models derived from liver, esophagus, and bile duct tend to align
   with other cancer types. This observation is in agreement with
   Celligner results on cancer cell lines ([87]19). PTX models derived
   from brain tumors align very well with brain cancer patient biopsies;
   however, brain cell line models cluster closely to soft tissue patient
   tumors, but not to clinical brain tumors. This is in line with previous
   studies that show that in vitro medium conditions cause genomic
   alterations in brain cell lines that were not present in the original
   tumors, thus altering their phenotypes ([88]37, [89]38).

   Metastatic tumors from the MET500 dataset tend to cluster together with
   their corresponding primary tumors from TCGA, although the tissue of
   biopsy of MET500 tumors is different from the primary site (fig. S8).
   In 63% of MET500 samples, the inferred disease type matches the
   annotated primary tumor type. We note that, for 88 samples from the
   MET500 dataset, the primary site annotation is missing (such samples
   are shown in black squares in fig. S8). However, the majority of these
   samples align nicely within TCGA clusters, indicating that the
   unsupervised pan-cancer alignment with MOBER can be used to infer the
   primary site for tumors of unknown origin when the transcriptomics
   profiles are available.

   Another key feature of MOBER is that it allows for populations that are
   only present in one dataset to be aligned correctly in an unsupervised
   manner. For example, the CMI dataset contains only transcriptional
   profiles of patient biopsies with metastatic breast cancer. MOBER
   integrated this dataset very nicely with the other breast patient
   tumors from TCGA and MET500, as well as breast cancer cell lines and
   xenograft models ([90]Fig. 2B).

Preservation of biological subtype relationships in the MOBER alignment

   We next sought to determine whether the MOBER alignment keeps known
   biological differences between more granular cancer subtypes. Focusing
   on the breast cancer samples, we determined subtype annotations with
   the PAM50 method (see Materials and Methods). [91]Figure 3A shows that
   breast cancer patient biopsies, breast cell lines, and xenograft models
   primarily cluster together by breast cancer subtype [LumA, LumB,
   normal, human epidermal growth factor receptor 2 (HER2)–enriched, or
   basal], with only a few PTX basal subtype models clustering elsewhere
   (see also fig. S9).

Fig. 3. Alignment of breast cancer subtypes.

   [92]Fig. 3.
   [93]Open in a new tab

   (A) UMAP two-dimensional projection of the MOBER-alignment highlighting
   breast tumor samples: LumA (green), LumB (orange), normal (purple),
   HER2-enriched (blue), and basal (red). All other non–breast tumor
   samples are in gray. (B) Genes that were most significantly
   up-regulated in silico after the alignment of breast cancer cell lines
   to breast cancer patient biopsies (x axis), along with top enriched
   biological pathways involving the 100 most up-regulated genes (y axis).
   (C) Expression values (log[2] counts per million) of a selected gene,
   Lumican (LUM), in breast cancer patient tumors (blue), breast cancer
   cell lines before the alignment (bright red), breast cancer cell lines
   after the MOBER alignment (dark red), breast xenograft models before
   the alignment (yellow), and breast xenograft models after the MOBER
   alignment (orange).

   MOBER is fully interpretable by design, which allows us to study the
   changes in the transcriptomics profiles of preclinical models after
   their in silico transformation to clinical tumors. In this respect, we
   examined the genes that were significantly changed when breast cancer
   cell lines were aligned to patient biopsies. [94]Figure 3B shows the
   genes that were most significantly up-regulated after the alignment,
   along with enriched biological pathways involving these genes. Almost
   all top enriched biological pathways (e.g., extracellular matrix
   organization, collagen fibril organization, and connective tissue
   development) are related to the high presence of stromal tumor
   microenvironment in breast cancer patient biopsies, which is missing in
   cell lines (see also fig. S9). [95]Figure 3C illustrates the changes in
   the expression values of a selected gene, Lumican (LUM), in preclinical
   models before and after their in silico transformation. Lumican is
   involved in extracellular matrix organization and connective tissue
   development processes (among others) and is highly expressed in stromal
   components in breast cancer patient biopsies, while it has a low
   expression in preclinical models as they are missing the human stromal
   component. After transforming the breast preclinical models to resemble
   breast patient biopsies, we see that the LUM expression was increased
   in the transformed data. Similarly, when transforming blood cancer cell
   lines to patient biopsies, the most significantly up-regulated genes
   after the alignment are related to the presence of human immune
   components in blood cancer patient biopsies that are missing in cell
   lines (fig. S10, B and C). Together, these results highlight that,
   during the projection of preclinical models to clinical biopsies, MOBER
   does not correct genes at random. Instead, it up-regulates in silico
   the genes that are expressed in the human tumor microenvironment, which
   is present in clinical biopsies but missing in preclinical models,
   while still preserving the key biological information on tumor subtypes
   at a very granular level.

Information transfer between cell line and patient tumor datasets

   Next, we demonstrate how the MOBER-transformed gene expression profiles
   of preclinical models into clinical tumors can be used in other studies
   where we seek to translate preclinical biomarkers to patients. The
   Broad Institute recently published a large metastasis map dataset
   (MetMap) ([96]15), determining the metastatic potential of ~500 human
   cancer cell lines, thus enabling the metastatic patterns of cell lines
   to be associated with their genomic features. Here, we sought to
   identify transcriptomics features that are associated with high or low
   metastatic potential in human cancer cell lines. We built machine
   learning (ML) models that take as input gene expression profiles of
   cancer cell lines and try to predict their average metastatic potential
   toward five different organs, as provided by the MetMap dataset
   ([97]Fig. 4A; see Materials and Methods). Next, we used the models
   trained on cell line expression data to predict the metastatic
   potential scores of patient tumors from TCGA. We examined whether there
   is a difference in the survival between patients for which we predict
   high metastatic potential versus low metastatic potential. In addition,
   we determined whether there is any correlation between the predicted
   metastatic scores and the clinical stage of patient tumors. Using the
   ML models that are trained on original gene expression profiles of CCLE
   cell lines, we see that there is a significant difference in the
   survival of TCGA patient tumors for which we predict very high
   metastatic potential (top 25%) versus low metastatic potential (bottom
   25%) (P = 1.1 × 10^−16) ([98]Fig. 4B). However, there is no significant
   association with clinical stage of TCGA patient tumors ([99]Fig. 4C, P
   value of Spearman’s correlation is 0.1). Then, we built new ML models
   that can predict the metastatic potential scores, but, this time, we
   trained them on gene expression profiles of cell lines that are
   transformed with MOBER to resemble TCGA tumors. Applying these ML
   models on TCGA patient tumors, we achieve more significant survival
   stratification of TCGA patient tumors for which we predict very high
   metastatic potential (top 25%) versus low metastatic potential (bottom
   25%) (P = 6.2× 10^−29) ([100]Fig. 4D). In addition, we note that such
   models predict higher metastatic potential of the late-stage tumors,
   compared to early-stage tumors ([101]Fig. 4E, Spearman’s correlation r
   = 0.90, P = 0.037). The same analyses performed separately for each
   disease type confirm the improved translatability of the ML models when
   they are trained on MOBER-transformed cell line transcriptomes to
   resemble patient tumors (fig. S11). The ML models that we used here
   might be too simple to faithfully infer the metastatic potential of
   tumors; however, our results demonstrate the utility of using the
   MOBER-transformed gene expression profiles in finding biomarkers that
   are better translatable to patients.

Fig. 4. Associating biomarkers of high/low metastatic potential in human
cancer cell lines from MetMap and translating them to patients.

   [102]Fig. 4.
   [103]Open in a new tab

   (A) Using gene expression profiles of cancer cell lines and their
   experimentally derived metastatic potential scores from the MetMap
   dataset, we built machine learning (ML) models to predict the
   metastatic potential of cell lines based on expression data. Then, we
   translated these models trained on cell lines to patients, trying to
   predict the metastatic potential of TCGA patient tumors using patients’
   transcriptomes. Patients whose tumors are predicted to have higher
   metastatic potential are expected to have worse survival and a more
   advanced stage of the disease. (B and C) The association of the
   predicted metastatic potential with patients’ survival and disease
   stage when ML models are trained on original cell line expression
   profiles from the CCLE. (B) Difference in survival of TCGA patient
   tumors for which we predict very high metastatic potential (top 25%,
   blue) versus low metastatic potential (bottom 25%, orange) with ML
   models trained on original cell line expression profiles. P values are
   derived from the log-rank test, and shaded areas indicate 90% of
   confidence intervals. (C) Predicted metastatic potential of TCGA tumors
   for different clinical stages. No correlation is observed. (D and E)
   The same as (B) and (C) but with ML models trained on MOBER-transformed
   cell line expression profiles to resemble TCGA patients. These models
   translate better to patient tumors, as evidenced by the improved
   survival stratification and significant positive correlation between
   the predicted metastatic potential and disease stage (Spearman’s r =
   0.90, P = 0.037)

DISCUSSION

   Preclinical cancer research critically relies on in vitro and in vivo
   tumor models, such as cell lines and PTXs. However, because of the
   differences in the growing conditions and the absence of the human
   stromal microenvironment, these models are often not predictive of the
   drug response in clinical tumors and do not follow the same pathways of
   drug resistance ([104]10, [105]39). Therefore, the identification of
   the best models for a given cancer type, without relying on annotated
   disease labels, is critically important. To address this, we developed
   MOBER, a deep learning–based method that performs biologically relevant
   integration of transcriptional profiles from various preclinical models
   and clinical tumors. MOBER can be used to guide the selection of cancer
   cell lines and patient-derived xenografts and identify models that more
   closely resemble clinical tumors. We integrated gene expression data
   from 932 cancer cell lines, 434 PTXs, and 11,159 clinical tumors
   simultaneously and demonstrate that MOBER can conserve the inherent
   biological signals while removing confounder information.

   We identified pronounced differences across cancer models in how well
   they recapitulate the transcriptional profiles of their corresponding
   tumors in patients. Certain cancer types, such as the ones in skin,
   breast, colorectal, kidney, and uterus, exhibit greater transcriptional
   similarity between models and patients, while models of cancers of the
   bile duct, liver, and esophagus have transcriptional profiles that are
   unlike their patient tumors. There are also notable differences between
   CCLE and PTX as cancer model systems. While brain and soft tissue CCLEs
   appear to have diverged from their corresponding patient tumors, brain
   and soft tissue PTX models have high transcriptional fidelity. Notably,
   among the PTX models, urinary tract xenografts attain greatest
   transcriptional similarity to corresponding patient tumors; however,
   many of the urinary tract CCLEs have drifted away from their labeled
   tumors. As pointed in previous studies ([106]19, [107]40), many CCLEs
   are not classified as their annotated labels. We report that PTX
   models, on average, show greater transcriptional similarity to patient
   tumors, as compared to their CCLE counterparts. This could suggest that
   the lack of immune component is not the main CCLE confounder, but CCLEs
   likely undergo transcriptional divergence due to the culture condition,
   high number of passages, and genetic instability ([108]41, [109]42). In
   addition, cancer models (both PTXs and CCLEs) could be misannotated
   because of inaccurate assignment based on unclear anatomical features
   or mismatch during sampling ([110]40, [111]43). Our pan-cancer global
   alignment with MOBER could be used to identify cancers that are
   underserved by adequate preclinical models and to determine the primary
   site label for models and clinical tumors where such annotation is
   missing.

   MOBER is interpretable by design, therefore allowing drug hunters to
   better understand the underlying biological differences between models
   and patients that are responsible for the observed lack of clinical
   translatability. The observed differences vary across disease types,
   emphasizing the importance of using unsupervised nonlinear approach
   that enables identification of disease-type-specific variations.

   As a batch effect removal method, MOBER offers several advantages
   compared to previously published methods ([112]19, [113]21, [114]44):
   It (i) supports integration of multiple datasets simultaneously, (ii)
   enables transformation of one dataset into another, and (iii) does not
   make any assumption on the datasets composition. We demonstrate that
   MOBER can remove batch effects between gene expression datasets even
   when the cell population representation across datasets is different.

   We note that, when transforming preclinical models to resemble clinical
   tumors, MOBER corrects not only for hidden technical differences but
   also for differences due to the absence of the tumor microenvironment
   in preclinical models. While this correction is crucial for aligning
   preclinical models with clinical tumors and identification of outlier
   models, this may not be desirable in studies aimed at investigating the
   inherent biological disparities between models and patient tumors. With
   the in silico addition of the tumor microenvironment to cell lines,
   some components that are inherent to cell culture (e.g., extracellular
   matrix genes in stromal reach tumors) would be affected. However, the
   differences in these cell components between the different cell lines
   are still maintained after their in silico transformation to clinical
   tumors, as evidenced by the meaningful alignment of disease types and
   subtypes in a completely unsupervised way.

   To facilitate the use of MOBER, we made the source code available at
   [115]https://github.com/Novartis/MOBER and developed an interactive web
   app available at [116]https://mober.pythonanywhere.com to allow users
   to explore the MOBER aligned expression profiles coming from cancer
   models and clinical tumors. In addition, the web app enables the
   identification of preclinical models that best represent the
   transcriptional features of a tumor type or even a particular tumor
   subtype of interest. Future version of MOBER that integrates genetic
   and epigenetic features of models and patient tumors could potentially
   enable even more detailed analysis between models and patients.

MATERIALS AND METHODS

The MOBER method and model training details

   Each gene expression profile of a sample
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   is a vector
   [MATH: <msub><mi mathvariant="bold">x</mi><mi>i</mi></msub> :MATH]
   with length equal to the total number of genes. The input gene
   expression data
   [MATH: <mrow><mi>x</mi></mrow> :MATH]
   is run through a VAE that uses variational inference to reconstruct the
   original data in a conditional manner. The encoder estimates the
   probability density function of the input expression data
   [MATH: <mrow><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . Then, a latent vector
   [MATH: <mrow><mi>z</mi></mrow> :MATH]
   is sampled from
   [MATH: <mrow><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . The decoder decodes
   [MATH: <mrow><mi>z</mi></mrow> :MATH]
   into an output, learning the parameters of the distribution
   [MATH: <mrow><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mi>x</mi><mo>∣</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . The loss function is then given by
   [MATH:
   <mrow><msub><mtext>Loss</mtext><mtext>VAE</mtext></msub><mo>=</mo><mund
   er><munder
   accentunder="true"><mrow><mo>−</mo><msub><mi>E</mi><mrow><mi>z</mi><mro
   w><mo>∼</mo></mrow><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="false">[</mo><mtext>log</mtext><mo> </mo><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mi>x</mi><mo>∣</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow></mrow><mo>⏟</mo></munder><mtext>reconstr
   uction error</mtext></munder><mo>+</mo><munder><munder
   accentunder="true"><mrow><msub><mi>w</mi><mtext>KL</mtext></msub><mo>*<
   /mo><mtext>KL</mtext><mrow><mo
   stretchy="false">[</mo><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow><mo>∥</mo><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="false">]</mo></mrow></mrow><mo>⏟</mo></munder><mtext>regulari
   zation</mtext></munder></mrow> :MATH]

   where
   [MATH: <mi mathvariant="bold">x</mi> :MATH]
   and
   [MATH: <mrow><mi>z</mi></mrow> :MATH]
   indicate the gene expression data and latent space, respectively;
   [MATH: <mrow><mi>Q</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>P</mi></mrow> :MATH]
   are the estimated probability distributions;
   [MATH: <mrow><mi>E</mi></mrow> :MATH]
   denotes an expectation value; KL is the Kullback-Leibler ([117]45)
   divergence; and
   [MATH: <mrow><mi>w</mi></mrow> :MATH]
   is a weight parameter that determines the importance given to the
   Kullback-Leibler divergence in the VAE loss function. The first term in
   the loss function is the reconstruction error (i.e., expected negative
   log-likelihood of the data sample), and the second term is the
   Kullback-Leibler divergence between the encoder’s distribution
   [MATH: <mrow><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   and
   [MATH: <mrow><mi>P</mi><mrow><mo stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . In addition to the input expression data, we provide to the decoder
   an information about the origin of the input sample (in our case CCLE,
   PTX, TCGA, MET500, or CMI) transformed into a one-hot encoding vector
   [MATH: <mi mathvariant="bold">s</mi> :MATH]
   , that consists of 0’s in all cells and a single 1 in a cell used to
   uniquely identify the input source. This allows for reconstruction of
   the latent vector
   [MATH: <mrow><mi>z</mi></mrow> :MATH]
   by the decoder in a conditional manner, and it enables projection of
   one dataset into another.

   In addition to training the VAE, we simultaneously train an aNN that
   acts as a source discriminator. It takes as an input an embedding
   vector
   [MATH: <mrow><mi>z</mi></mrow> :MATH]
   sampled from the latent space, and tries to predict the source label of
   the input data (
   [MATH: <mrow><mi>s</mi></mrow> :MATH]
   ). This is a multi-class fully connected neural network with negative
   log-likelihood loss function as given by
   [MATH:
   <mrow><msub><mtext>Loss</mtext><mtext>aNN</mtext></msub><mo>=</mo><mo>−
   </mo><msub><mi>E</mi><mrow><mi>Q</mi><mrow><mo
   stretchy="false">(</mo><mi>z</mi><mo>∣</mo><mi>x</mi><mo
   stretchy="false">)</mo></mrow></mrow></msub><mrow><mo
   stretchy="false">[</mo><mtext>log</mtext><mi>p</mi><mrow><mo
   stretchy="false">(</mo><mi>s</mi><mo>∣</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">]</mo></mrow></mrow>
   :MATH]

   The joint loss is computed as
   [MATH:
   <mrow><msub><mtext>Loss</mtext><mtext>MOBER</mtext></msub><mo>=</mo><ms
   ub><mtext>Loss</mtext><mtext>VAE</mtext></msub><mo>−</mo><mi
   mathvariant="normal">λ</mi><mo>*</mo><msub><mtext>Loss</mtext><mtext>aN
   N</mtext></msub></mrow> :MATH]

   where λ is a coefficient that determines the weight that the model
   gives to the adversarial loss.

   In our study, the encoder, decoder, and aNN are designed as fully
   connected neural networks each with three layers. Each layer of the
   encoder (and decoder respectively) consisted of 256, 128, and 64 nodes
   each. We used Scaled Exponential Linear Unit ([118]46) activation
   function between two hidden layers, except the last decoder layer,
   where we applied Rectified Linear activation Unit ([119]47). The last
   aNN layer had five hidden nodes corresponding to the number of data
   source classes and softmax activation.

   We implemented MOBER using PyTorch ([120]48). We set the minibatch size
   to 1600 and trained it with Adam optimizer ([121]49) using a learning
   rate of 1 × 10^−3. The weight for the KL loss of the VAE was set to 11
   × 10^−6, and the weight for the source adversary loss was set to 11 ×
   10^−2. The best hyperparameter set from numerous possibilities was
   chosen from a grid search that minimized the joint loss and maximized
   the clustering performance of models to patients.

Datasets

   The RNA-seq gene expression counts for TCGA samples were downloaded
   from the TCGA portal ([122]https://tcga-data.nci.nih.gov/tcga)
   ([123]14) and the CMI gene expression counts from the Genomic Data
   Commons portal ([124]50), and CCLE ([125]2), PTX ([126]9), and MET500
   ([127]33) data were obtained from the corresponding publications. All
   gene expression data were normalized with Trimmed Mean of M-values
   (TMM) method using EdgeR ([128]51) and then transformed to log[2]
   counts per million using the edgeR function “cpm,” with a pseudocount
   of 1 added. Gene expression data were subset to 17,167 protein-coding
   genes that were present in all datasets. The MetMap ([129]15) dataset
   was downloaded from the DepMap portal ([130]https://depmap.org/metmap).
   We excluded the indications that have less than five samples.

Alignment evaluation

   We projected each sample from the CCLE, PTX, CMI, and MET500 datasets
   to TCGA, by changing the one-hot encoded source information and setting
   it to TCGA. Then, we decoded the expression data with MOBER.

   To evaluate the alignment of preclinical samples to TCGA samples, for
   each CCLE and PTX sample, we identified the 25 TCGA nearest neighbors
   in 70-dimensional principal components analysis space. Each preclinical
   sample was classified as a tumor type by identifying the most
   frequently occurring tumor type within these 25 nearest neighbors.

   The identification of breast cancer molecular subtypes was done using
   the PAM50 classifier as implemented in the geneFu R package ([131]52).
   The identification of differentially expressed genes comparing the
   transcriptional profiles before and after their transformation to
   patient tumors was done with Seurat v3.6 ([132]53) using the t-test
   method. Pathway enrichment analysis was done for the top 100 most
   differentially expressed genes ordered by their fold change and with an
   adjusted P value of <0.01 using the clusterProfiler ([133]54) package.

Analyses of metastatic potential using MetMap data

   We trained random forest models to predict the metastatic potential
   scores using gene expression values as input, using the scikit-learn
   ([134]55) Python package (0.23.2). The hyperparameters were optimized
   with grid search strategy using threefold cross validations. Then, the
   final model was trained using the optimized hyperparameters. The
   hyperparameter optimized in the model is “max_features.” One thousand
   trees were used, and all other hyperparameters were set as default.

   Two different models were trained, using either the original
   transcriptome or the projected transcriptomes of CCLEs to TCGA patients
   to predict mean metastatic potential scores of cell lines across five
   organs (metp500.all5). Then, with each trained model, we predicted the
   metastatic potential scores for patient tumors from TCGA using the
   original transcriptome as input. For the survival analysis, the top 25%
   of samples with highest predicted scores and bottom 25% with the lowest
   predicted scores were compared. The Kaplan-Meier survival analysis
   ([135]56) was done with the lifelines ([136]57) Python package v0.25.4.

Acknowledgments