Abstract

   Cardiovascular disease (CVD) and venous thromboembolism (VTE) figure
   among the main causes of morbidity and mortality in modern societies.
   Although associated with distinct pathogenic mechanisms,
   epidemiological, experimental and clinical trial data suggest that the
   mechanisms responsible for arterial and venous thrombosis are at least
   partially overlapped. Herein we aimed to explore shared and discordant
   pathways involved in the pathogenesis of VTE and CVD at the
   transcriptomic level and to validate the results in independent
   cohorts. Five public datasets of gene expression data from VTE and CVD
   (myocardial infarction, peripheral arterial occlusive disease and
   stroke) patients were analyzed using an integrative bioinformatic
   strategy. A machine/statistical learning method was used to derive
   classifiers for the discrimination of VTE and CVD, and tested in
   independent datasets. Two sets of genes that were commonly (n = 472) or
   divergently (n = 124) expressed in CVD and VTE were identified. Genes
   and pathways associated with innate immune function were
   over-represented in both conditions, along with pathways associated
   with complement and hemostasis. Pathways associated with neutrophil
   activation and with IL-1 signaling were also enriched in CVD compared
   to VTE. The gene expression signature of VTE more closely resembled the
   pattern of cardioembolic stroke than the patterns of acute myocardial
   infarction, ischemic stroke and peripheral arterial occlusive disease.
   Classifiers derived from these gene lists accurately discriminated
   patients with VTE and CVD from independent cohorts. In conclusion, our
   results add a new set of data at the transcriptomic level for future
   studies between arterial and venous thrombosis.

Strengths and limitations of this study

     * Our results represent the first comparison of venous and arterial
       thrombosis at the transcriptomic level.
     * Our main result was the demonstration that immunothrombosis
       pathways are important to the pathophysiology of these conditions,
       also at the transcriptomic level.
     * A specific signature for venous and arterial thrombosis was
       described, and validated in independent cohorts.
     * The limited number of public repositories with gene expression data
       from patients with venous thromboembolism limits the representation
       of these patients in our analyses.
     * In order to gather a meaningful number of studies with gene
       expression data we had to include patients in different time-points
       since the index thrombotic event, which might have increased the
       heterogeneity of our population.

Introduction

   CVD is a generical term that encompasses conditions caused by arterial
   thrombosis such as myocardial infarction (MI), ischemic stroke (IS) and
   peripheral arterial obstructive disease (PAOD), with the former two
   representing the most frequent causes of years of life lost in most
   regions of the world [[30]1, [31]2]. Venous thromboembolism (VTE)
   encompasses deep vein thrombosis (DVT) and pulmonary embolism (PE),
   which together represent the third leading cause of vascular disease in
   the world [[32]3]. Although it has been long recognized that the
   pathogenesis of these two conditions are based on distinct cellular and
   molecular pathways, the existence of common pathogenic pathways
   contributing to both CVD and VTE is suggested by (i) the sharing of
   risk factors such as obesity, smoking, hypertriglyceridemia [[33]4];
   (ii) the epidemiological association between CVD and VTE illustrated by
   the higher prevalence of CVD in patients with VTE even years after the
   venous event [[34]5–[35]7]; (iii) the fact that some inflammatory
   diseases such as sickle cell disease and antiphospholipid syndrome
   (APS) increase the risk of both conditions [[36]8, [37]9]; and, (iv)
   more recently, the demonstration that treatment strategies classically
   used for CVD can also benefit patients with VTE [[38]10, [39]11], and
   vice versa [[40]12]. In this context, a lot remains to be learned about
   their shared and independent pathological mechanisms, whose
   identification could contribute to the identification of new
   therapeutic targets for both VTE and CVD [[41]7, [42]13, [43]14].

   Three major frameworks have been used to address differences and
   similarities between CVD and VTE: (i) studies in animal models, (ii)
   histopathological analyses of thrombi, and (iii) epidemiological data.
   Studies in animal models identified proteins and cells that contribute
   to VTE or CVD [[44]2, [45]15–[46]17] allowing the development of
   important therapeutic targets for each condition. However, these
   studies have not focused on the relative contribution of these pathways
   to CVD, VTE or both conditions in human disease. While
   histopathological studies of human thrombin initially supported the
   classical paradigm of white (platelet-rich) or red (fibrin- and red
   blood cell-rich) thrombi in CVD and VTE respectively, these conclusions
   were later challenged by several studies showing a much more complex
   picture, as recently reviewed [[47]13]. Lastly, epidemiological studies
   have been instrumental to gain insights into the association of venous
   and arterial thrombosis, and clearly demonstrated that VTE and CVD are
   indeed associated conditions [[48]18, [49]19]. However, these studies
   have not been yet able to clearly define the mechanism of this
   association, whether causal (i.e. atherosclerosis leads to VTE) or
   driven by common pathogenic mechanisms [[50]7].

   In recent years, the availability of large databases of genomic data,
   along with bioinformatics and machine learning tools capable of
   performing integrative and functional analyses of these datasets
   allowed new strategies for the research about the molecular and
   cellular pathogenesis of complex conditions. In particular, publicly
   available datasets from gene expression studies, once performed to
   define specific disease signatures, can now be compared, grouped and
   meta-analyzed, allowing biases and artefacts to be canceled out between
   datasets, so that true relationships are more likely to stand out
   [[51]20–[52]26]. Herein we used a panel of bioinformatics and machine
   learning tools to explore the differences and similarities between VTE
   and CVD, thus contributing with a new layer of data to our
   understanding of the common and divergent pathogenic mechanisms of two
   conditions of high epidemiological relevance.

Methods

Identification of eligible studies and datasets

   Gene expression datasets from microarray studies including human
   patients with CVD or VTE were searched in the public repository Gene
   Expression Omnibus (GEO) [[53]27], maintained by the NCBI, by May 2018.
   Search was conducted using the terms “venous thrombosis”, “venous
   thromboembolism”, “myocardial infarction”, “stroke”, “coronary
   ischemia”, “angina”, “atherosclerosis”, “peripheral arterial disease”
   or “thrombosis”. Datasets were included if they met all the following
   inclusion criteria: (1) microarray data obtained from human samples
   using the same microarray platform; (2) RNA source restricted to whole
   blood or populations of circulating blood cells; (3) studies including
   both affected patients and healthy controls, so that the differential
   expression of each gene was evaluated under the same experimental
   conditions; (4) availability of metadata allowing the separation of
   venous from arterial events; and (5) datasets from studies published in
   peer-reviewed journals. In the course of our study, we also restricted
   our analysis to studies using the same microarray platform, so as to
   limit heterogeneity.

Patient and public involvement

   No patient involved.

Meta-analysis of gene expression studies

Pre-processing

   Microarray raw data were pre-processed using the Robust Multichip
   Average (RMA) method [[54]28] implemented in the oligo package
   [[55]29]. For each dataset, the algorithm performs background
   subtraction, minimizing the effects of optical noise and non-specific
   binding on the estimation of relative gene expression parameters.
   Later, quantile normalization was applied, mitigating the effects of
   technical variables through the estimation of a common intensity
   distribution across samples. This stage was followed by a median-polish
   step, which summarized the several probe intensity measurements into a
   single probeset log-expression quantity, for the downstream
   meta-analysis step. Using the biomaRt package [[56]30], we annotated
   the probesets with their respective Ensembl Gene IDs.

Meta-analysis

   To perform the meta-analysis, expression data were organized following
   their pre-defined classes and study of origin. Meta-analysis was
   performed with RankProd package [[57]31]. The algorithm of this package
   adapts the rank production method initially designed to single
   experiment analysis to integrate multiple origin studies. It is a
   non-parametric method that detects genes consistently ranked as DE by
   comparing patients to healthy controls across datasets. One hundred
   permutations were performed to compute the p-value and the false
   discovery rate (FDR). The gene list was further filtered to include
   only genes that were up- or down-regulated in the same direction in all
   five studies based on a false discovery rate (FDR) < 0.05.

Correlation analysis of gene expression levels in CVD and VTE

   The correlation between the expression levels of genes identified in
   the meta-analysis between VTE and CVD was expressed using the estimated
   Pearson’s coefficient, and then represented in graphical format. Unless
   otherwise stated, all analyses were performed in the statistical
   computing environment R version 3.4.4 [[58]32].

Functional gene set analyses

   To facilitate the interpretation of the biological significance of the
   gene list identified by the meta-analysis, a functional gene set
   analysis (GSA) was performed using EnrichR, a bioinformatics web-based
   tool that includes several curated GSA libraries encompassing pathway
   enrichment analysis (e.g. KEGG, Reactome, and 18 other libraries), gene
   ontologies (for cellular components, biological process, molecular
   function), among others. Of the list of enriched terms identified by
   EnrichR, only pathways that were (i) listed among the 10 most
   significant (based on the p-value) for each library, and (ii)
   identified in at least two libraries from the same category were
   considered. For gene ontology terms, the top 5 terms with an adjusted
   p-value < 0.0001 were included.

Evaluation of genes with divergent expression between VTE and CVD

   A list of genes with divergent expression between VTE and three
   databases of CVD (IS, AMI and PAOD) was obtained by selecting all genes
   with a fold-change higher than 1.5 that were up-regulated in VTE and
   down-regulated in IS, AMI and PAOD; as well as genes with a fold-change
   lower than 0.8 that were down-regulated in VTE and up-regulated in IS,
   AMI and PAOD. The cutoff values are defined as the percentile 25% (0.8)
   and 75% (1.49~1.5) fold change to prioritize the most down and
   up-regulated genes respectively. Similar filtering approach has been
   used to avoid the definition of arbitrary threshold [[59]33–[60]35].

   These gene lists were used for an additional functional analysis based
   on FAIME (Functional Analysis of Individual Microarray/RNAseq
   Expression) scores. The FAIME algorithm is implemented in seq2pathway
   package [[61]36] and computes the cumulative quantitative effects of
   genes inside differentiated Gene Ontology terms using log2 gene
   expression of each individual sample. The result was clustered based on
   their gene pattern similarities using Euclidean distances and plotted
   in a FAIME score heat map.

Validation of gene expression signatures associated with VTE and CVD

   In order to validate our results in independent cohorts, we first used
   Support Vector Machine (SVM) based methods to identify two subsets of
   genes capable to more accurately separate VTE from CVD (validation 1)
   and VTE (validation 2) as well as AMI and IS (validation 3 and 4) from
   controls. SVM-based models are based on statistical learning theory
   [[62]37], and are normally used to optimize the discriminatory power of
   complex datasets by identifying subsets of data with higher
   discriminatory potential (classifiers) [[63]38, [64]39]. For validation
   1, SVM was applied to the list of genes that were divergently expressed
   between VTE and CVD, using the VTE and AMI patients’ datasets employed
   for our meta-analysis as training cohorts. The list of classifiers was
   then tested in three additional cohorts (validation cohorts) that were
   not used in the meta-analysis, constituted of patients with VTE
   ([65]GSE48000) [[66]40], AMI ([67]GSE59867) [[68]41]. For validation 2,
   the training cohort consisted of the dataset of VTE patients used in
   our meta-analysis ([69]GSE19151), and the validation cohort consisted
   of a different dataset of VTE patients ([70]GSE48000) [[71]40]. Finally
   a training cohort for validation 3 and 4 consisted healthy controls and
   patients of the AMI and IS datasets ([72]GSE59867, [73]GSE22255
   respectively). Results were then validated using the cohorts consisted
   of another AMI and IS datasets ([74]GSE141512 and [75]GSE16561
   respectively) and presented as heat map [[76]42] of normalized
   expression.

Results

Studies included in the meta-analysis

   Five studies fulfilled the inclusion and exclusion criteria described
   in methods section, and were included in the meta-analysis. These
   studies included data from 163 adult patients and 145 healthy controls.
   [77]Table 1 provides the details of each study. As shown in [78]Table
   1, only one study include patients with VTE and compared gene
   expression levels in patients with single or recurrent VTE
   ([79]GSE19151) with healthy controls [[80]43]. The other four remaining
   studies involve CVD. These CVD studies present gene expression levels
   of patients with PAOD ([81]GSE27034) [[82]44], AMI ([83]GSE48060)
   [[84]45], cardioembolic stroke ([85]GSE58294) [[86]46], and IS
   ([87]GSE22255) [[88]47]. All of them have appropriated study-specific
   paired healthy controls.

Table 1. Characteristics of individual studies included in our analyses.

   GEO access number Sample characteristics
   Characteristics of patients/disease included in each dataset Size
   (Pt:Ctl) RNA source
   [89]GSE19151 Adult with one or more prior VTE or warfarin; APS and
   cancer excluded 70:63 Whole blood
   [90]GSE27034 Peripheral arterial occlusive disease, defined as
   ankle:brachial index < 0.9 19:18 PBMC
   [91]GSE48060 Adults with 1^st time acute myocardial infarction[92]^*;
   inflammatory diseases and cancer excluded. 31:21 Whole blood
   [93]GSE58294 Adults with cardioembolic stroke (i.e. at least one source
   of cardiac embolus and exclusion of strokes from other etiologies)
   [94]^† 23:23 Whole blood
   [95]GSE22255 Adults with history of one ischemic stroke more than 6
   months prior to sample collection; anemia and allergies excluded 20:20
   PBMC
   [96]Open in a new tab

   GEO: Gene Expression Omnibus, Pt:Ctl: patients:controls; PBMC:
   peripheral blood mononuclear cells; WBC: white blood cells; VTE: Venous
   thromboembolism; APS: antiphospholipid syndrome.

   * Samples were collected with 48h from the acute event

   † subset of patients recruited for the Clear Stroke Trial [[97]48];
   samples were collected within 3h from the acute event, prior to any
   pharmacological treatment. All studies used the Affymetrix Human Genome
   U133 Plus 2.0 as a microarray platform. References for published