Abstract

   The genetic component of many common traits is associated with the gene
   expression and several variants act as expression quantitative loci,
   regulating the gene expression in a tissue specific manner. In this
   work, we applied tissue-specific cis-eQTL gene expression prediction
   models on the genotype of 808 samples including controls, subjects with
   mild cognitive impairment, and patients with Alzheimer's Disease. We
   then dissected the imputed transcriptomic profiles by means of
   different unsupervised and supervised machine learning approaches to
   identify potential biological associations. Our analysis suggests that
   unsupervised and supervised methods can provide complementary
   information, which can be integrated for a better characterization of
   the underlying biological system. In particular, a variational
   autoencoder representation of the transcriptomic profiles, followed by
   a support vector machine classification, has been used for
   tissue-specific gene prioritizations. Interestingly, the achieved gene
   prioritizations can be efficiently integrated as a feature selection
   step for improving the accuracy of deep learning classifier networks.
   The identified gene-tissue information suggests a potential role for
   inflammatory and regulatory processes in gut-brain axis related
   tissues. In line with the expected low heritability that can be
   apportioned to eQTL variants, we were able to achieve only relatively
   low prediction capability with deep learning classification models.
   However, our analysis revealed that the classification power strongly
   depends on the network structure, with recurrent neural networks being
   the best performing network class. Interestingly, cross-tissue analysis
   suggests a potentially greater role of models trained in brain tissues
   also by considering dementia-related endophenotypes. Overall, the
   present analysis suggests that the combination of supervised and
   unsupervised machine learning techniques can be used for the evaluation
   of high dimensional omics data.

   Keywords: eQTL, gene expression imputation, GTEx, variational
   autoencoder, support vector machine, deep learning, recurrent neural
   networks, Alzheimer’s

Introduction

   Nowadays researchers can access omics data at different levels, such as
   genomics (e.g., dbGaP[37] ^1 ), transcriptomics (e.g., GEO
   expression[38] ^2 ) and also at multi-omics levels (e.g., GTEx[39] ^3 ,
   Encode[40] ^4 ). Given the advancement of high-throughput technologies,
   the increasing availability of omics data can be expected over time.
   This will allow researchers to better analyze complex systems
   characterized by many interacting features as the biological systems.

   Traditional analytical methods on omics data, such as Genome-wide
   association study (GWAS) and differential expression analysis, usually
   rely on univariate approaches with specific statistical modelling
   ([41]Visscher et al., 2017; [42]McDermaid et al., 2018). These
   approaches, despite being robust, are limited in detecting potential
   combinatorial effects in the underlying biological system. Indeed,
   biological networks can be highly complex with many feedback regulatory
   loops ([43]Franco and Galloway, 2015). A comprehensive analysis of
   interaction effects is not feasible with traditional approaches due to
   the combinatorial explosion of the input factor space ([44]Berger et
   al., 2013).

   On the other hand, machine learning methods have proved to be efficient
   for the analysis of high dimensional complex systems, although the
   application of machine learning methods in omics data is still
   relatively uncommon due to the limited interpretability of the outcome
   of machine learning frameworks ([45]Li et al., 2016). In this work, we
   investigate the applicability of different machine learning methods on
   omics data using, as a case study, matrices of tissue-specific
   predicted transcriptomic profiles in Alzheimer’s disease (AD). AD is a
   progressive neurodegenerative disorder, representing the predominant
   form of dementia ([46]Wang et al., 2017), and is characterized by
   progressive deterioration of memory and cognitive functions that can be
   tested with different clinical tests ([47]Kirsebom et al., 2017). The
   pathophysiology of AD involves the formation of the characteristic
   extracellular amyloid plaques and intracellular neurofibrillary tangles
   ([48]Kuznetsov and Kuznetsov, 2018).

   A lot of research has been done in order to identify the genetics
   factor contributing to AD. In cases of specific familiar forms of AD,
   which are recurrent among family members and are characterized by early
   onset (i.e., age < 65), disease causing mutations in specific genes
   have been identified, namely amyloid precursor protein (APP),
   Presenilin 1 PSEN1 and Presenilin 2 PSEN2 ([49]Piaceri et al., 2013).
   This is not the case of the most common sporadic AD forms,
   characterized by late onset (age > 65), representing about 95% of AD
   cases ([50]Bali et al., 2012), for which the “4 allele of
   Apolipoprotein E (APOE) is the only strong identified genetic risk
   factor ([51]Dorszewska et al., 2016).

   However, the relatively high heritability also of sporadic AD,
   estimated to be around 60% to 80% ([52]Van Cauwenberghe et al., 2016),
   combined with the identification of a number of genetic risk loci from
   GWAS, suggests the presence of a polygenic component in late onset AD
   ([53]Escott-Price et al., 2015). Indeed, GWAS hits can be associated
   with different biological pathways, such as cholesterol and lipid
   metabolism, immune system, inflammatory response, and endosomal vesicle
   cycling ([54]Lambert et al., 2013). Moreover, several susceptibility
   loci are localized in gene-dense regions, but it remains unknown which
   genes of these regions are responsible for the association ([55]Van
   Cauwenberghe et al., 2016). In fact, identifying the functional role of
   variants in intergenic regions is not a trivial process, since the
   related genes might not be the closest to the loci (e.g., chromatin 3D
   structure can place in proximity relatively distant region in the
   primary DNA sequence) ([56]Dekker et al., 2013). Moreover, many complex
   phenotypes have a polygenic architecture, in which many variants have
   minor effects over a phenotype, and polygenic risk score modeling is
   capable of finding significant genetic associations for traits with no
   monogenic causes, but with relatively high heritability ([57]Chatterjee
   et al., 2016).

   Different works show a co-localization between Expression Quantitative
   Loci (eQTL) and GWAS hits indicating that the biological effect of
   non-coding variants can be exerted through the regulation of gene
   expression ([58]Hormozdiari et al., 2016; [59]Wen et al., 2017), that
   is a polygenic trait in which many variants may be involved. Indeed,
   different tools model the combined effect of eQTL signals, considering
   both strong functional SNP effects and additive effects for
   modest-strength signals ([60]Gamazon et al., 2015; [61]Gusev et al.,
   2016). Conducting gene association on the basis of the genetic
   component of gene expression regulation, also called Transcription Wide
   Association Study (TWAS), proved to be particularly efficient in
   finding associations with many traits ([62]Gusev et al., 2016).

   There are many advantages in testing the genetic component of gene
   expression rather than evaluating the nominal variant GWAS association:
   I) the aggregation of multiple eQTL into one gene can boost the
   association by including additive effect among variants; II) genes are
   more interpretable biological unit in comparison with variants; III)
   the statistical power is increased due to the reduction of
   multiple-comparison tests from hundreds of thousands/million variants
   (before/after imputation) to the order of thousands of genes (after
   filtering for gene expression heritability); IV) eQTL are tissue
   specific and therefore it is possible to perform gene association
   analysis in the target tissue for the phenotypes and also in secondary
   tissues for potential peripheral biomarkers (e.g., blood).

   Noteworthy, the evaluation of the solely genetic component of gene
   expression is less comprehensive than the actual gene expression
   analysis, but has the advantage to focus only on the genetic/heritable
   component, avoiding environmental confounding effects ([63]Gamazon et
   al., 2015). Since polygenic effects can be expected also at gene
   expression level, given the complexity of biochemical systems,
   performing multi-gene evaluation can provide greater insights
   concerning potential biological associations ([64]Marigorta et al.,
   2017). Therefore, machine learning and deep learning methodologies have
   proved to be efficient at identifying transcriptomic profiles
   associated with specific phenotypes, considering different input data,
   such as measured RNA-seq data ([65]Wang et al., 2018), single cell
   expression ([66]Hu et al., 2016), and also imputed transcriptomic data
   ([67]Gottlieb et al., 2017).

   In this work, we tested multiple machine learning and deep learning
   approaches to study multi-tissue imputed transcriptomic profiles in the
   Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort ([68]Weiner
   et al., 2013). Noteworthy, the analysis of imputed transcriptomic
   profiles on ADNI data has been already performed at single gene level
   identifying, suggesting potential specific gene-tissue associations
   with amyloid deposition ([69]Hohman et al., 2017). In the following
   sections we introduce the supervised and unsupervised methods we
   exploited in this work, the results achieved combining these
   approaches, and a discussion of the achieved outcomes.

Methods

Machine Learning Methods in Bioinformatics

   Machine Learning (ML) algorithms have proved to be particularly useful
   for the analysis of complex big biological data ([70]Olson et al.,
   2017). For instance ML has been applied to detect epistasis within the
   human genome ([71]McKinney et al., 2006) suggesting that ML can reveal
   non-linear behavior in biological systems. In the same direction, more
   recent deep learning approaches have been profitably exploited to
   analyze genotype/phenotype associations ([72]Min et al., 2017) as well
   as to extract relevant information from many data modalities, including
   text, images, and sounds ([73]Li et al., 2019).

   Deep learning methods follow a data-driven approach and are therefore
   well-designed to detect nonlinear-behaviors, which are relatively
   common in natural systems ([74]Tang et al., 2019). Networks can vary
   depending on the number of layers and type of nodes and not all of them
   perform equally well on different data typology. Convolutional Neural
   Networks (CNN) are generally applied to recognize objects in a pattern,
   Recurrent Neural Networks (RNN) to analyze temporal data, but it is not
   mandatory to use any kind of network only for a specific task. For
   instance, CNNs were successfully used to predict the enhancer-promoter
   interactions with DNA sequences ([75]Zhuang et al., 2019) and for
   accurate clustering of sequences ([76]Aoki and Sakakibara, 2018). RNNs
   were used instead for predicting transcription factor binding sites
   ([77]Shen et al., 2018) and to dissect the regulation of mRNA to
   protein-coding translation ([78]Hill et al., 2018).

   Noteworthy, also variational autoencoders (VAEs) showed good
   performance in capturing biologically relevant feature in gene
   expression data analysis ([79]Way and Greene, 2017a). VAEs are part of
   a large branch of deep learning architectures, the so called generative
   models ([80]Goodfellow, 2016). These architectures are based on an
   encoding-decoding approach and, differently from the standard
   autoencoders, they assume a stochasticity in the modelling of the data.
   The original input matrices of features are compressed in a lower
   dimensional space, the so called encoding phase, and are reconstructed
   back in a second step, called decoding phase. Both phases are composed
   by neural networks. VAEs have seen increasing success in many different
   applications in the last few years, among the unsupervised
   methodologies recently developed, and they are widely used in different
   types of data such as time series, images or gene expressions
   ([81]Goodfellow, 2016; [82]Goodfellow et al., 2016; [83]Way and Greene,
   2017b).

Tissue Specific Gene Expression Imputation

   Data used for the preparation of this article were obtained from the
   Alzheimer’s Disease Neuroimaging Initiative (ADNI) database
   (adni.loni.usc.edu). ADNI was launched in 2003 as a public-private
   partnership led by Michael W. Weiner, MD. The primary goal of ADNI has
   been to test whether serial Magnetic Resonance Imaging (MRI), Positron
   Emission Tomography (PET), other biological markers, clinical and
   neuropsychological assessment can be combined to measure the
   progression of mild cognitive impairment (MCI) and early Alzheimer’s
   disease (AD). In the present work, we analyzed the ADNI1-GWAS dataset
   including gene array genotyping data for 808 samples available on ADNI
   portal.

   Rigorous quality control has been performed. Namely, samples have been
   checked for sex, missing genotype rates lower than 0.05 and
   heterozygosity levels F < 0.2, while variants with Hardy–Weinberg
   p-value < 1e – 10 have been removed. Then, using the tool by W.
   Rayner[84] ^5 we checked SNPs for strand consistency, allele names,
   position, Ref/Alt assignments and minor allele frequency (MAF) in
   comparison to the reference panel. In order to increase the available
   genetic information, we imputed our data using Sanger Imputation
   Server[85] ^6 exploiting Eagle2 for phasing ([86]Loh et al., 2016) and
   Positional Burrows–Wheeler Transform ([87]Durbin, 2014), considering
   Haplotype Reference Consortium version 1.1 ([88]McCarthy et al., 2016)
   as reference panel. As a postimputation quality control, we removed
   variants with info quality level < 0.6. Genotype calls with posterior
   probability < 0.9 were set to missing. Post-QC imputed data was used to
   estimate gene expression regulation across the different samples.

   In order to predict the genetic component of gene expression, we used
   PrediXcan that evaluates the aggregate effects of cis-regulatory
   variants (within 1MB upstream or downstream of genes of interest) on
   gene expression via an elastic net regression method ([89]Gamazon et
   al., 2015). PediXcan needs a reference dataset in which both genome
   variation and gene expression levels have been measured to build
   prediction models for gene expression. We exploited already available
   models trained on GTEX data[90] ^7 to impute tissues specific
   transcriptomic profiles in a total of 42 tissues (we excluded sex
   specific tissues, e.g., prostate, ovary, etc.). The imputed
   transcriptomic profiles were subsequently analyzed using different
   machine learning approaches ([91] Figure 1 ). On the one hand,
   unsupervised machine learning methods were used to analyze data
   structure, on the other hand, supervised methods were used to test for
   the presence of “signal” compared to AD related phenotypes.

Figure 1.

   [92]Figure 1
   [93]Open in a new tab

   Framework of integrative analysis of multi-tissues expression profiles.
   Starting from genotyping data (m individuals per n variants) we imputed
   tissues specific transcriptomic profiles (for any tissue T [i], where i
   = 1‚…‚ k) by means of cis-eQTL PrediXcan models trained on GTEx data.
   Variational autoencoder followed by support vector machine (SVM) latent
   dimension-tissue match on the imputed gene expression matrices (m
   individuals per z genes) is used as a feature selection to identify the
   most relevant genes per tissue (T [i] = gene [1]‚…‚ gene [s] where i is
   the i [th] tissue and s in the number of prioritized genes) to provide
   as input of the recurrent neural network classifier.

Gene Prioritization

   Gene prioritization was performed considering as input the predicted
   transcriptomic matrices from ADNI1-GWAS (excluding sex-specific
   tissues) resulting in a total of 42 tissues with 808 samples each (42 ×
   808 = 33, 936 samples overall). We performed an independent analysis
   involving 528 “cases”, that included people affected by dementia and/or
   with cognitive dysfunction (AD and MCI) for a total of 528 × 42 = 22,
   176 input data, and 280 controls including healthy subjects for a total
   of 280 × 42 = 11, 760 input data. Each sample was comprised of 24, 203
   genes in total.

   To identify relevant genes we used variational autoencoders (VAEs) with
   a single hidden layer with a dimension of 42 units, hence matching the
   number of tissues. We adapted the code publicly provided by [94]Way and
   Greene (2017b) to implement our VAE’s architecture. In the encoding
   phase, the network inputs are the original dataset features
   representation
   [MATH: <mover accent="true"><mi>x</mi><mo>→</mo></mover> :MATH]
   . These are transformed by means of non-linear activation functions in
   a hidden representation that we denominated
   [MATH: <mover accent="true"><mi>z</mi><mo>→</mo></mover> :MATH]
   and that we assume being characterized by a Gaussian probability
   density function. In this phase the 2 latent representations of μ and λ
   of the distribution are learned.

   The second part of the architecture that we denoted as the decoder is
   again built as a neural network. The input this time is the vector
   [MATH: <mover accent="true"><mi>z</mi><mo>→</mo></mover> :MATH]
   i.e. the latent stochastic representation of the input dataset and the
   output will be the reconstructed representation
   [MATH: <mover
   accent="true"><msup><mi>x</mi><mo>′</mo></msup><mo>→</mo></mover>
   :MATH]
   of the original input vector
   [MATH: <mover accent="true"><mi>x</mi><mo>→</mo></mover> :MATH]
   . A representation of the VAE architecture can be seen in [95]Figure 1
   . The loss function of the VAE consists of two parts: the first part
   being the reconstruction loss (negative log-likelihood) and the second
   part being the function expressing the Kullback–Leibler (KL) divergence
   considering the learned hidden distribution and a priori Gaussian
   distribution ([96]Wetzel, 2017).

   The first term of the loss function is considered over the encoder
   distribution of the hidden representation and it “encourages” the
   decoding phase to correctly reconstruct the input data ([97]Altosaar,
   2019). KL divergence is used to enforce the similarity between the
   distribution of the latent representation and the normal distribution.

   We used separate VAEs to encode the gene expression of the cases and
   healthy classes. Original data include positive (upregulated genes) and
   negative values (downregulated genes). In order to compute VAE
   analysis, input data have been scaled between 0 and 1. Noteworthy,
   different genes can be present in different tissues while VAE pipeline
   requires an equal number of gene as input, thus NaN (non-existent/Not a
   Number) values during VAE input preprocessing were set to 0. The input
   samples were randomly split in training (80%) and test sets (20%) using
   a stratified approach to maintain the same proportion of samples per
   tissue. We used the Adam optimizer ([98]Kingma and Ba, 2014) with a
   learning rate of 0:001 over 75 epochs over the data, rectified linear
   units during the encoding stage, sigmoid activation during the decoding
   stage, batch size of 500, and warmup (ĸ) of 1. Hyperparameters were
   manually selected using a VAE that was not used further in the
   analysis, to achieve optimal reconstruction performance without
   overfitting. The entire autoencoding procedure was repeated 75 times
   separately for the healthy and AD classes in order to study the
   repeatability of results.

   The main goal of the unsupervised analysis was to identify the up or
   down-regulation of certain genes in specific tissue types in cases and
   healthy samples. We used a two-step procedure to achieve this
   association: we identified the tissue(s) encoded in each latent
   dimension unit of the VAE models, and then we identified the genes most
   strongly connected to the given latent dimension unit.

   In order to identify the tissue(s) encoded in each latent dimension, we
   used the activations of the hidden layer in the VAEs as an input
   feature to 42 binary Support Vector Machine (SVM) classifiers, one for
   each tissue. We trained each SVM classifier to predict whether the
   input sample to the VAE belonged to a specific tissue relying on the
   activation value of a single unit from the embedded latent dimension of
   the VAE. We repeated this tissue-latent unit association procedure for
   each tissue and each unit in the hidden VAE layer. We performed a
   5-fold stratified cross-validation using a linear SVM (C = 1 with class
   weight balance), thus running a total of 5 × 42 × 42 SVM classifiers
   for each VAE (a 5-fold cross validation procedure, for 42 binary
   classifiers, for each one of the 42 hidden layer’s unit). We considered
   a given latent VAE unit to be predictive of a specific tissue type,
   hence associated with it, if the F1 score was greater than 0.8. We
   found that some hidden units encode more than 1 tissue type.

   It is noteworthy to mention that we tried other unsuccessful
   approaches. Firstly, we tried to use a single VAE with both cases and
   controls, trying to find subclusters besides the tissues which cluster
   very well (see [99]Figure 2 ) in the VAE’s latent dimension as well as
   in the original data. We also tried to use a single VAE for each tissue
   in separate. No obvious structures were found when trying to match the
   results of t-SNE algorithm with all the available phenotypes, including
   case/control status. Filtering the input for genes within each tissue
   that show nominal significance for case/control status using standard
   simple univariate tests did not improve the results. Filtering genes
   with R ^2 > 0.15 of expression prediction using the same threshold as
   in Hohman et al.ʹs work ([100]Hohman et al., 2017) did not improve the
   results as well. In order to understand the features important for
   classification, we also implemented a saliency map approach. This
   method is able to detect where the attention of the network (VAEs in
   our study) is focused ([101]Itti et al., 1998), which can be seen as a
   sensitivity analysis approach. Saliency maps are generally applied in
   computer vision but, they can be used in other areas. In our case, the
   maps were computed on the encoder part of the VAEs and the information
   extracted is the importance of each gene in the analysis, which is
   coded as an rgb color code. From this analysis we were not able to
   identify significative patterns in the input data.

Figure 2.

   [102]Figure 2
   [103]Open in a new tab

   t-SNE embedding of tissue genes, run using the 42 activations on the
   latent dimension of a VAE to check the embedded structure of all
   samples. It is obvious that the latent activations are encoding
   information about each tissue.

   Considering the VAE used in this work, the association of the genes
   with the latent dimension units can be performed solely relying on the
   magnitude of the corresponding network weights. Given that each VAE has
   a single hidden layer, each latent dimension unit is connected directly
   to every output unit, i.e. reconstructed gene, via a linear
   transformation. Since each reconstructed gene is a summation of the
   weighted contribution of each latent unit, we could rank the relative
   importance of the units in the hidden layer relying on the magnitude of
   the weights. Thus, we selected the 100 most positive and 100 most
   negative weights for each latent unit encoding a given tissue. This
   resulted in a set of 100 upregulated and 100 downregulated genes,
   respectively for each of the trained VAEs. The entire association
   procedure was performed for the 75 VAEs from healthy and AD samples. We
   counted the total number of times a given gene was considered up or
   downregulated by our association procedure and kept it if it appeared
   more than three times overall. As a result, we produced a list of up or
   down regulated genes associated with each of the 42 types of tissues.
   We used this list as an input for pathway enrichment analysis.

   In order to perform enrichment analysis, we used Fast Gene Set
   Enrichment Analysis (FGSEA), a tool developed by Sergushichev et al.
   ([104]Sergushichev, 2016). The approach implemented by FGSEA deals with
   quantitative data having inherently directionality like gene
   expression. The model is based on gene statistic array S = Si‚…Sn where
   N is the number of samples and Si > 0 represent over-expression of gene
   i while Si < 0 represent down-expression. The absolute value of Si
   represents a magnitude of the change. The list of gene sets P of length
   m usually contains groups of genes that are commonly regulated in
   certain biological process. To quantify a co-regulation of genes in a
   gene set p [105]Subramanian et al. (2005) introduced a gene set
   enrichment score function sr(p) that uses gene rankings (values of S).
   Given a gene set p the more positive is the value of sr(p) the more
   enriched the gene set is in positively-regulated genes g with Sg > 0,
   accordingly, negative sr(p) corresponds to enrichment of negatively
   regulated genes. To deal with multiple-comparison issues an empirical
   p-value is computed by randomly sampling gene sets of the same size of
   p.

   The lists of downregulated and upregulated genes per tissue (referred
   as List-unsupervised) have been considered also as a feature selection
   step to build prediction models. We also tested other approaches to
   identify the most relevant genes as considering: I) nominal
   significantly associated genes from logistic association test between
   predicted gene expression levels and phenotype status (referred as
   List-PrediXcan), II) nominal associated genes derived by the
   combination of single tissue-trait association using generalized
   Berk–Jones test (referred as List-UTMOST) obtained with UTMOST tool
   ([106]Hu et al., 2019).

Phenotype Prediction Models From Imputed Transcriptomic Matrices

   Several supervised analysis techniques were tested in order to
   understand which one could achieve better results in identifying cases
   and controls from the transcriptomic profiles: Logistic Regression
   (LR), Support Vector Machine (SVM), Random Forest (RF) and Deep
   Learning networks. The latter are known to achieve better results
   compared to other machine learning methods, especially when the
   relationships between the observed features is not supposed to be
   linear ([107]LeCun et al., 2015).

   Since we imputed data according to specific tissues, we searched the
   model that would perform better among them. For this reason, we
   randomly selected 6 of the 42 tissues (Adipose Subcutaneous, Artery
   Aorta, Brain Spinal, Colon Transverse, Thyroid, Whole Blood) and
   trained the models on 600 of the 808 samples from ADNI1-GWAS,
   considering that the dataset is slightly unbalanced, as it contains
   more AD samples (528) than controls (280). SVM, RF and LR were not
   capable of learning how to classify cases and controls, since they
   assigned the samples only to the majority class. Concerning Deep
   Learning, the first accomplishment was understanding the appropriate
   architecture to elaborate transcriptomic data: we tested two Dense
   Neural Networks (DNN), two CNNs and an RNN.

   The first DNN (DNN-1) consisted of 6 layers with respectively 800, 500,
   400, 200, 40 and 2 nodes (called neurons). The second DNN (DNN-2)
   tested consisted of only three layers with 800, 200 and 2 neurons. The
   first CNN (CNN-1) had 6 layers: a convolutional layer of 10 filters, a
   convolutional layer of 5 filters after which a dropout regularization
   was applied, another convolutional layer of 5 filters, a dense layer of
   200 neurons with a dropout, and two dense layers of 100 and 2 neurons
   in the end. The second CNN (CNN-2) was a pure convolutional network of
   two convolutional layers of 10 and 5 filters, with a dropout
   regularization applied between them, and a dense layer with 2 neurons
   as the output layer. The RNN had 3 layers: two Long Short-Term Memory
   cells (LSTM) with output dimension of 30 and 20 and a final dense layer
   of 2 neurons.

   Looking at the preliminary training results ([108] Table 3 ) we decided
   to select and optimize the RNN, manually searching the optimal
   network’s size and then identifying the hyperparameters with the Grid
   Search algorithm (batch size = 100, epochs = 100). The final
   architecture consisted of the input and output layers and two hidden
   LSTM layers of 150 and 10 output dimensions. After every hidden layer a
   batch normalization was applied to maintain the mean activation close
   to 0 and the activation standard deviation close to 1. The input layer
   dimension was equal to the number of genes characterizing the tissue
   transcriptomic profile, while the output layer was a dense layer of
   dimension two to make possible the classification of the samples in AD
   and not-AD.

Table 3.

   Preliminary networks training performance on six sample tissues:
   accuracy (Acc) and area under the curve (AUC).
   Network Adipose subcutaneous Artery aorta Brain spinal Colon transverse
   Thyroid Whole blood
   Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC
   DNN-1 37.50 0.513 37.33 0.503 64.00 0.538 87.67 0.862 64.50 0.503 39.83
   0.516
   DNN-2 64.50 0.5 64.50 0.5 90.17 0.892 64.50 0.5 64.50 0.5 64.50 0.5
   CNN-1 63.00 0.5 76.92 0.721 77.50 0.901 78.50 0.770 64.08 0.5 0.491
   CNN-2 95.83 0.948 64.50 0.5 94.83 0.943 64.50 0.5 96.00 0.95 95.67
   0.947
   RNN 96.17 0.953 95.67 0.951 94.67 0.942 95.33 0.946 95.33 0.946 94.67
   0.939
   [109]Open in a new tab

   Considering all the 42 tissues, we had the chance to perform two types
   of analysis: a tissue-specific analysis and a cross-tissue analysis. In
   the tissue-specific analysis, we trained models on transcriptomic data
   specific for each tissue. Therefore, we implemented predictive models
   that could impute the case/control condition on new transcriptomic data
   related to the same tissue. The input dimensions of the networks were
   in the order of thousands, but different for every tissue: the minimum
   was 2,041 characterizing the Brain Substantia Nigra tissue, and the
   maximum was reached by the Thyroid tissue with 9,655.

   The aim of the cross-tissue analysis was, on the other hand, to observe
   the similarities between tissues in relationship with the Alzheimer’s
   disease. Models were trained on each single tissue, taking as input the
   genes shared by all the 42 tissue transcriptomic profiles (24, 203).
   The column reporting the information for a gene was filled with zeros
   if it was not possible to impute the transcriptomic profile of that
   gene in a specific tissue. Comparing the maximum number of genes
   imputed for the tissues and the total number of genes identified in all
   the analysis, it was clear that the new arranged matrices of 24, 203
   genes for 808 samples were particularly sparse. The models were then
   used to impute the case/control condition on tissues different to the
   one used for the training.

   Both in single tissue and cross-tissue analyses all the models were
   trained on 600 samples from ADNI1-GWAS and the tests were performed on
   the remaining 208 samples. The network architecture was in all cases
   the one in [110]Figure 1 , adjusting the input dimension according to
   the different analysis. A 10-fold cross validation was applied and
   models compiled with the Adam optimizer and the binary cross-entropy as
   the optimization score function. The monitored scores were the
   accuracy, area under the curve (AUC), precision, recall, and F1. The
   saliency map was applied in the first LSTM layer, therefore we could
   observe if some samples were more informative than other for the
   classification purpose. Keras[111] ^8 and Scikit-learn[112] ^9 Python
   libraries were used, built on top of TensorFlow[113] ^10 to implement
   the networks.

   We then worked on features selection to find groups of genes that were
   likely to improve the model performance regarding the samples partition
   in case/control, both in the single-tissue and cross-tissues
   approaches. The identification of such groups in single-tissue analysis
   can bring to the determination of tissue specific markers, on the other
   hand in the cross-tissues section we could focus on the set of genes
   that explained the relationship between tissues. We used three
   different filter lists: List-unsupervised, List-PrediXcan and
   List-UTMOST (see [114]Supplementary Materials Section 3 ). Using these
   lists the input dimensions for all the tissues decreased: the number of
   unique genes identified by the List-unsupervised was 2,016, 4,984 with
   List-PrediXcan. List-UTMOST (649 genes) was used only in the
   cross-tissue analysis as it doesn’t provide tissue-specific
   information.

   All the steps described above (except the architecture selection and
   saliency map) were also performed considering Cognitive Decline over
   time rather than diagnosis at screening. This dataset consisted of 528
   samples (some samples did not have this information), 281 controls and
   247 cases. Cognitive Decline has been calculated by considering the
   difference between the Mini-Mental State Examination (MMSE) score 4
   years after recruitment and the MMSE score at recruitment. Then,
   regardless of the original recruitment diagnosis, we classified the
   samples into two groups: one group showing no cognitive decline
   (difference equal or greater than 0) and the other showing a cognitive
   decline (difference minor than 0). The genes imputed for each tissue
   were therefore the same in ADNI1-GWAS dataset and Cognitive Decline
   dataset. To consider the effect of AD related variables, we also
   performed the same analyses by stratifying by sex and early/late onset
   for dementia and AD [using 65 years of age as a cutoff ([115]Roberts
   and Petersen, 2014)] as well as for carrier and noncarrier of APOE ∊4
   isoform.

Results

   We predicted the genetic component of gene expression across 42
   non-sex-specific tissues for all the samples included in ADNI1-GWAS
   dataset. We exploited tissues specific eQTL models available on
   precictDB[116] ^11 and used PrediXcan tool[117] ^12 to derive tissue
   specific matrices representing individual levels of the genetic
   component of gene expression. The gene levels obtained by these sample
   matrices represent transcriptomic profiles based on eQTL across tissues
   in the analyzed dataset.

   In the present work the matrices of imputed expression were analyzed
   using several machine learning strategies to identify potential tissue
   specific transcriptomic profiles associated with cognitive decline in
   Alzheimer’s.

Gene-Based Results Per Tissue

   We runned t-SNE ([118]Maaten and Hinton, 2008) using the 42 activations
   on each latent dimension of a VAE to check the embedded structure of
   all samples, whose result can be seen in [119]Figure 2 . Although
   interpretations of Euclidean distances between points in a t-SNE plot
   is not straightforward ([120]Wattenberg et al., 2016), it is clear from
   the clusters that information about tissues are being encoded. Indeed,
   we were able to identify associations between latent dimensions of VAE
   and tissue.

   The evaluation of the weights associated with the latent dimension (see
   Methods) allow us to rank gene importance per tissue considering
   case/control status. [121]Table 1 shows the most upregulated and
   downregulated genes from Brain Nucleus. Check [122]Supplementary Table
   S1 for complete information over all 42 tissues.

Table 1.

   Most upregulated and downregulated genes from the brain nucleus.
                       Downregulated               Upregulated
                      AD-MCI         CTR      AD-MCI           CTR
   Brain nucleus ENSG00000230850.3 ABHD14A ENSAP2        [123]AL356475.1
                 GMPR2             ATP2B4  KLF1          F2
                 C1QC              BDKRB2  EEF1A1P19     NRIP2
                 SUN3              C1QC    RP5-1068B5.3  RP11-704J17.5
                 RP11-662J14.1     PXN     RP11-321A17.3 RP11-321A17.3
   [124]Open in a new tab

   The saliency map implementation returned not useful information. If
   taken individually, genes don’t have much impact: it is evident also
   with this result that the AD phenotype is due to a combination of many
   genes and environmental factors.

   In order to investigate the presence of specific gene expression
   regulation associated with case/control status we considered the lists
   of tissue-specific up and down regulated genes derived by VAE analysis.
   Additionally, for each tissue we considered the genes that were
   differentially regulated in cases but not in controls, that is
   representing a disease-specific signature. The enrichment analysis have
   been performed considering Gene ontology[125] ^13 , KEGG[126] ^14 and
   reactome[127] ^15 and pathway databases ([128]Croft et al., 2013;
   [129]Kanehisa et al., 2016). Complete enrichment analysis results are
   available as supplementary files (see [130]Supplementary Materials
   Section 1 ) while significant enrichment tissues specific pathways
   after FDR correction are shown in [131]Table 2 .

Table 2.

   Significant tissue-pathways enrichment analysis using Reactome
   database.
   Tissue Pathway pval padj ES NES Genes
   Colon sigmoid Immune system 3.8E–04 1.2E–02 –5.4E–01 –2.3E+00 CAP1
   RASGRP4
   RASGRP4
   YES1
   SIGLEC8
   CD47
   SELL
   CALM1 FBXO21
   CLEC7A
   CLEC7A
   SEC61A1
   IL13
   HLA-DPB1
   KIF11
   Brain nucleus Generic transcription pathway 3.0E–03 1.8E–02 7.2E–01
   2.1E+00 ZNF688
   ZKSCAN8
   ZNF445 RRAGC
   ZNF697
   CASP6
   Brain nucleus RNA polymerase II transcription 3.0E–03 1.8E–02 7.2E–01
   2.1E+00 ZNF688
   ZKSCAN8
   ZNF445 RRAGC
   ZNF697
   CASP6
   Brain nucleus Gene expression (transcription) 3.0E–03 1.8E–02 7.2E–01
   2.1E+00 ZNF688
   ZKSCAN8
   ZNF445 RRAGC
   ZNF697
   CASP6
   Colon sigmoid Adaptive immune system 3.0Ev03 3.3E–02 –6.1E–01 –2.1E+00
   FBXO21
   SEC61A1
   HLA-DPB1
   KIF11 YES1
   SIGLEC8
   SELL
   CALM1
   Colon sigmoid Innate immune system 2.5E–03 3.3E–02 –6.5E–01 –2.0E+00
   CAP1
   CLEC7A
   CD47
   CALM1 RASGRP4
   YES1
   SELL
   [132]Open in a new tab

   Interestingly enrichment analysis shows the presence of tissue specific
   signal in a specific brain tissue (i.e., brain nucleus) concerning
   pathways involved in gene expression regulation and in immune-related
   pathways in colon ([133] Figure S2 ). The most significant alterations
   in brain pathways concern the brain nucleus accumbens (basal ganglia)
   region. Interestingly, this region has been found to be associated with
   AD ([134]Nie et al., 2017; [135]Nobili et al., 2017; [136]Li et al.,
   2018). Instead, the detected downregulation of immune system pathways
   in cases in comparison to controls could indicate a higher level of
   inflammation in dementia. This is in line with the association observed
   between inflammatory bowel diseases and AD ([137]McCaulley and Grush,
   2015; [138]Sochocka et al., 2019). Given the pivotal role of APOE
   ([139]Liu et al., 2013) in AD a specific evaluation was performed to
   evaluate the effect of APOE related genes.

   APOE gene expression is not predicted by gene expression imputation
   GTEx based models, due to the absence of eQTL explaining a relevant
   fraction of APOE expression level. However, AD susceptibility due to
   APOE isoforms (∫2, ∫3 and ∫4), which are well known to confer a
   different risk for AD depending on the presence of missense coding
   variants, are associated with APOE gene functionality and can be
   independent from the genetic component of gene expression regulation.
   We investigated if other genes directly interacting with APOE,
   according to string functional database[140] ^16 , have a significant
   association in our analysis (see [141]Supplementary Materials Section
   S3 ).

   One of the 11 genes identified, namely APOC2 ([142]Shao et al., 2018),
   is among the top differentially regulated genes from variational
   autoencoder gene prioritization list in brain putamen, an area of the
   brain associated with AD ([143]Coupé et al., 2019). Interestingly, the
   same gene is also the only one (among the 11 APOE interacting genes)
   significantly associated with AD according to a transcription wide
   association analysis performed according to a GWAS on AD in UK Biobank
   dataset ([144]Marioni et al., 2018) and public available on TWAS
   hub[145] ^17 . This suggests a potential role for APOC2 associated with
   the gene expression regulation and, interestingly, a recent work showed
   that the methylation profile in such a gene (which in turn affect gene
   expression) is associated with AD ([146]Shao et al., 2018).

Tissue-Specific and Cross-Tissues Classification

   To understand which network performs better on different tissues, we
   tested five models on six sample tissues. In [147]Table 3 , accuracy
   and AUC obtained during their preliminary 10 cross-validation training
   on 600 of 808 samples are reported: although all methods could perform
   well at least on one tissue during the training, in that phase only the
   RNN was capable of reaching an accuracy higher than 90% for all of
   them. Therefore we decided to optimize the RNN and obtained the network
   structure described in Phenotype Prediction Models From Imputed
   Transcriptomic Matrices, which was then applied for the single-tissue
   and cross-tissue analysis on ADNI1-GWAS and Cognitive Decline dataset.

   Without the feature selection, we observed a great performance during
   the training in terms of AUC, accuracy, precision, recall and F1 scores
   (see [148]Supplementary Materials Section 2 ) on both datasets. On test
   set (composed of 208 samples for tissue for ADNI1-GWAS and 128 for
   Cognitive Decline) the metrics reached values below expectations, with
   AUCs near 0:5 especially for ADNI1-GWAS.

   On ADNI1-GWAS ([149] Figure 3 ), models trained for single-tissue
   analysis improved their AUCs thanks to the List-unsupervised and
   List-PrediXcan feature selection: when the AUCs were below 0:5, the
   filters application returned a score above the threshold for at least
   one list. We did not observe a major impact of a list in this phase but
   the t-test confirmed a significant improvement compared to the no
   filter approach (p-value = 0.001474 for List-unsupervised and p-value =
   2.693e – 06 for List-PrediXcan). Models trained for the cross-tissue
   analysis instead had a less evident improvement with the lists filter:
   only the List unsupervised returned a slightly significant improvement
   (p-value = 0.04084). List-UTMOST did not give any improvement and, as
   we could not use it on single-tissue models, we decided not to further
   analyze it.

Figure 3.

   [150]Figure 3
   [151]Open in a new tab

   ADNI1-GWAS feature selection evaluation. The single-tissue models (top
   panel) significantly improved their ability to classify case/control
   condition thanks to both List-unsupervised (blue) and List-PrediXcan
   (red) compared to the no filtering approach (black). On cross-tissue
   models (bottom panel), where there is also the performance with the
   List-UTMOST (green), the improvement was less evident.

   Cognitive Decline models performed better than ADNI1-GWAS, both in
   single-tissue and cross-tissue analysis ([152] Figure 4 ). The lists
   application on Cognitive Decline models also led to an improvement for
   tissues with borderline or below the threshold performance ([153]
   Figure S5 ), reaching AUCs between 0:51 and 0:6. On cross-tissue models
   we obtained a significant p-value = 0.008766 for List-unsupervised and
   p-value = 0.04346 for List-PrediXcan.

Figure 4.

   [154]Figure 4
   [155]Open in a new tab

   ADNI1-GWAS and Cognitive Decline comparison: Cognitive Decline (red
   boxes) returns higher AUCs on test sets than ADNI1-GWAS (blue boxes)
   both in cross-tissue models (left) and in single-tissue models (right).

   Comparing the two lists on ADNI1-GWAS, List-unsupervised showed the
   bigger improvement on cross-tissue models: the t-test returned a
   p-value of 0:009123, but on single-tissue the difference was not
   significant. Also on Cognitive Decline we observed a slightly major
   impact of List-unsupervised both for the single-tissue and cross-tissue
   models. In [156]Figure 5 , a focus on the improvement achieved with the
   filter on the Brain tissue is shown in both datasets, in [157]Figure S4
   the evaluation for all tissues is shown.

Figure 5.

   [158]Figure 5
   [159]Open in a new tab

   Brain tissues analysis. In green the AUCs on test sets for the no
   filter application are reported, in red for List-unsupervised and in
   yellow for List-PrediXcan. The top two panels report respectively the
   cross-tissue and single-tissue models performance on ADNI1-GWAS
   dataset, the third and fourth panels on Cognitive Decline. In both
   datasets, feature filtering improved the classification in almost all
   the Brain tissues.

   [160]Figure 6 reports, by columns, the AUC achieved by ADNI1-GWAS
   cross-tissue models when they were applied on other tissues from the
   same dataset. The top heatmap describes the relationships between
   tissue when no filter is applied: we could observe that models trained
   on Brain tissues, if they were able to correctly identify the AD
   subjects on a non-Brain tissue, they could do the same on all the other
   non-Brain tissues. Instead, models trained on non-Brain tissue could
   identify AD-MCI/CTRL subjects only on a subset of tissues. We performed
   the same analysis on ADNI1-GWAS models filtered by List-PrediXcan and
   List-unsupervised, respectively the middle and bottom heatmaps of
   [161]Figure 6 : List-unsupervised removed all the information of
   cross-tissue relationships, when instead List-PrediXcan mitigate them,
   pointing out the non-Brain models relationships.

Figure 6.

   [162]Figure 6
   [163]Open in a new tab

   ADNI1-GWAS cross-tissues performance. By column we can observe how much
   a model trained on a tissue is able to recognize without mistakes (AUC)
   AD/non-AD subjects from data related to different tissues. On the
   diagonal for each tissue the AUC obtained for that model during the
   training is reported. The top panel reports the cross-tissue results
   without any filter application, the middle panel when using
   List-PrediXcan and the bottom using List-unsupervised.

   We also tested the stratification for sex, age, APOE effect, and AD
   condition on ADNI1-GWAS dataset for single-tissue and cross-tissue
   analysis. It returned no considerable variation in the performance. The
   saliency map application was also not informative: each sample has the
   same importance. Lastly, we performed the filter analyses on Cognitive
   Decline, pointing out the same results ([164] Figure S6 ).

Discussion

   In the present work we dissected the tissue specific genetic component
   of gene expression in association to AD related cognitive decline. Our
   analysis consisted on the imputation of tissue specific gene expression
   profiles by using a TWAS-like approach ([165]Mancuso et al., 2017).
   However, contrary to the standard TWAS analysis, we did not
   specifically focus on univariate analysis (e.g., gene association based
   on logistic or linear regression). Instead, we dissected individual
   transcriptomic levels using different machine learning approaches. We
   believe that our approach can be of particular interest since is
   capable of capturing data structure and non-linear behaviour in the
   system. In fact, it is well known that gene expression levels are not
   independent, since many genes are actually correlated in terms of
   regulation ([166]Michalopoulos et al., 2012) and functionality, which
   means that also epistatic interactions can play a major role in the
   regulation of biochemical pathways ([167]Sameith et al., 2015).

   Interestingly, we observed that a combination of unsupervised and
   supervised machine learning methods on matrices of predicted expression
   provided complementary information that can be integrated in order to
   get new insights in gene expression regulation. On one hand, the VAE
   combined with enrichment analysis suggests the presence of a specific
   biochemical pathways alteration in dementia occurring in a specific
   brain area and in the gut. The identified alteration occur in brain
   nucleus, a brain region found to be associated with AD by several
   studies ([168]Cho et al., 2014; [169]Wang et al., 2014; [170]Kuhn et
   al., 2015; [171]Liu et al., 2015).

   This alteration seems to be related to the regulation of gene
   expression and 436 therefore possibly associated to tissue-specific
   pathways regulation. Instead, the enriched pathways in gut are related
   to immune systems and noteworthy, it is well established that immune
   system dysfunctions can lead to a greater increase of inflammation in
   AD ([172]Serpente et al., 2014; [173]Heppner et al., 2015; [174]Le Page
   et al., 2018). These results suggest that our analytical approach can
   identify relevant biological alterations occurring in AD. Noteworthy,
   enrichment analysis identified alteration in biological pathway
   specifically in a brain area and gut, which is in line with the
   presence of a gut-brain axis dysfunction in AD. Indeed, several
   researchers pointed out that brain-gut axis can be associated with many
   neurological disorders ([175]Giau et al., 2018; [176]La Rosa et al.,
   2018).

   In the present work, APOE genotype has not been directly included as
   covariate in prediction models since our aim was to identify other
   genetic factors that can explain part of the missing heritability on
   the established polygenic component in AD ([177]Escott-Price et al.,
   2017; [178]Tosto et al., 2017). However, APOE is expected to be by far
   the most influencing risk factor for late onset AD. Though estimation
   of APOE contribution on the heritability component of AD is still not
   well defined, ranging from 10% to 28% of the overall genetic
   heritability ([179]Van Cauwenberghe et al., 2016; [180]Stocker et al.,
   2018). Moreover, in the present work, gene-expression derived genetic
   signals neglect not-eQTL effects and therefore we have limited
   analytical power. This justifies the relatively low AUC values in
   comparison to other prediction models in AD, including the complete
   genome-wide polygenic signal and using APOE as a covariate
   ([181]Escott-Price et al., 2017; [182]Tosto et al., 2017). Our aim was
   indeed to test whether or not there is a genetic signal associated with
   AD that could be apportioned to tissue specific gene-expression
   regulation rather than identify a prediction model. It is also known
   that genetics is just one of the component involved in AD
   susceptibility and therefore the use of multimodal data (e.g., imaging
   data, clinical features, metabolomic, and environmental factors) should
   be taken into account in order to build a reliable classifier in term
   of translational application ([183]Sapkota et al., 2018). Despite that,
   our classification models were still capable of finding a signal
   between cases and controls (overall AUC > 0:5) suggesting that part of
   the genetic signal in AD related dementia can be associated with
   tissue-specific gene expression regulation. Moreover, we observed that
   feature selection can play a major role in the performance of deep
   learning networks classification.

   We are aware that our work presents some limitations. We performed a
   genetic association with dementia by considering ADNI data evaluating
   the solely genetic component of gene expression, which neglects other
   potential genetics effect not related to gene-expression regulation.
   Our models are also limited by the current version of GTEx data, which
   has a relatively small size, therefore it is expected that over time
   new models will optimize eQTL estimation leading to more precise
   analyses of the genetic component of gene expression. We also focused
   on non-sex specific tissues, since we wanted to study general potential
   alterations not involving sex-specific organs, but this could also be a
   limitation given the different prevalence of AD in females and males
   ([184]Mazure and Swendsen, 2016).

Conclusion

   In the present work, we performed an analysis of the predicted genetic
   component of gene expression in ADNI1-GWAS dataset in association with
   AD cognitive decline. We dissected the predicted tissue specific gene
   expression by means of different supervised and unsupervised machine
   learning approaches. Our results suggest that a framework including
   unsupervised and supervised methods in data-analysis can provide
   complementary information and thus leading to better insights into the
   underlying system.

   In particular, variational autoencoder pre-processing of input data
   proved to be efficient for features selection prior to the
   implementation of deep learning classification models. However, the
   limited AUC prediction performance of the developed models suggests
   that the evaluation of the solely genetic component of gene expression
   by exploiting up to date available GTEx models is currently
   under-powered in comparison to genome-wide polygenic risk score
   modeling.

   This is not surprising since we are neglecting the effect of non-eQTL
   variants. On the other hand, we can disclose tissue specific effects
   and reveal potential biological mechanisms associated with a given
   phenotype. In this regard, our analysis showed that brain tissues are
   more associated with dementia status and that inflammatory processes in
   brain-gut axis can play a role in AD.

Author’s Note

   Data used in preparing this article were obtained from the Alzheimer’s
   Disease Neuroimaging Initiative (ADNI) database
   ([185]adni.loni.usc.edu). As such, many investigators within the ADNI
   contributed to the design and implementation of ADNI and/or provided
   data but did not participate in analysis or writing of this report. A
   complete listing of ADNI investigators can be found at:
   [186]http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Ackn
   owledgement_List.pdf

Data Availability

   Publicly available datasets were analyzed in this study. This data can
   be found here: [187]http://adni.loni.usc.edu/. Supplementary data and
   the code used in this work is available at
   [188]https://github.com/imerelli/DeepNeuro.

Author Contributions

   CM, TA and VG equally contributed to the work. They conceived the idea
   and developed the algorithms. OB, GD, and SS contributed to data
   analysis. PL and IM supervised the whole study. All authors contributed
   to final revision of the manuscript.

Conflict of Interest Statement

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Acknowledgments