Abstract Cardiovascular disease (CVD) and venous thromboembolism (VTE) figure among the main causes of morbidity and mortality in modern societies. Although associated with distinct pathogenic mechanisms, epidemiological, experimental and clinical trial data suggest that the mechanisms responsible for arterial and venous thrombosis are at least partially overlapped. Herein we aimed to explore shared and discordant pathways involved in the pathogenesis of VTE and CVD at the transcriptomic level and to validate the results in independent cohorts. Five public datasets of gene expression data from VTE and CVD (myocardial infarction, peripheral arterial occlusive disease and stroke) patients were analyzed using an integrative bioinformatic strategy. A machine/statistical learning method was used to derive classifiers for the discrimination of VTE and CVD, and tested in independent datasets. Two sets of genes that were commonly (n = 472) or divergently (n = 124) expressed in CVD and VTE were identified. Genes and pathways associated with innate immune function were over-represented in both conditions, along with pathways associated with complement and hemostasis. Pathways associated with neutrophil activation and with IL-1 signaling were also enriched in CVD compared to VTE. The gene expression signature of VTE more closely resembled the pattern of cardioembolic stroke than the patterns of acute myocardial infarction, ischemic stroke and peripheral arterial occlusive disease. Classifiers derived from these gene lists accurately discriminated patients with VTE and CVD from independent cohorts. In conclusion, our results add a new set of data at the transcriptomic level for future studies between arterial and venous thrombosis. Strengths and limitations of this study * Our results represent the first comparison of venous and arterial thrombosis at the transcriptomic level. * Our main result was the demonstration that immunothrombosis pathways are important to the pathophysiology of these conditions, also at the transcriptomic level. * A specific signature for venous and arterial thrombosis was described, and validated in independent cohorts. * The limited number of public repositories with gene expression data from patients with venous thromboembolism limits the representation of these patients in our analyses. * In order to gather a meaningful number of studies with gene expression data we had to include patients in different time-points since the index thrombotic event, which might have increased the heterogeneity of our population. Introduction CVD is a generical term that encompasses conditions caused by arterial thrombosis such as myocardial infarction (MI), ischemic stroke (IS) and peripheral arterial obstructive disease (PAOD), with the former two representing the most frequent causes of years of life lost in most regions of the world [[30]1, [31]2]. Venous thromboembolism (VTE) encompasses deep vein thrombosis (DVT) and pulmonary embolism (PE), which together represent the third leading cause of vascular disease in the world [[32]3]. Although it has been long recognized that the pathogenesis of these two conditions are based on distinct cellular and molecular pathways, the existence of common pathogenic pathways contributing to both CVD and VTE is suggested by (i) the sharing of risk factors such as obesity, smoking, hypertriglyceridemia [[33]4]; (ii) the epidemiological association between CVD and VTE illustrated by the higher prevalence of CVD in patients with VTE even years after the venous event [[34]5–[35]7]; (iii) the fact that some inflammatory diseases such as sickle cell disease and antiphospholipid syndrome (APS) increase the risk of both conditions [[36]8, [37]9]; and, (iv) more recently, the demonstration that treatment strategies classically used for CVD can also benefit patients with VTE [[38]10, [39]11], and vice versa [[40]12]. In this context, a lot remains to be learned about their shared and independent pathological mechanisms, whose identification could contribute to the identification of new therapeutic targets for both VTE and CVD [[41]7, [42]13, [43]14]. Three major frameworks have been used to address differences and similarities between CVD and VTE: (i) studies in animal models, (ii) histopathological analyses of thrombi, and (iii) epidemiological data. Studies in animal models identified proteins and cells that contribute to VTE or CVD [[44]2, [45]15–[46]17] allowing the development of important therapeutic targets for each condition. However, these studies have not focused on the relative contribution of these pathways to CVD, VTE or both conditions in human disease. While histopathological studies of human thrombin initially supported the classical paradigm of white (platelet-rich) or red (fibrin- and red blood cell-rich) thrombi in CVD and VTE respectively, these conclusions were later challenged by several studies showing a much more complex picture, as recently reviewed [[47]13]. Lastly, epidemiological studies have been instrumental to gain insights into the association of venous and arterial thrombosis, and clearly demonstrated that VTE and CVD are indeed associated conditions [[48]18, [49]19]. However, these studies have not been yet able to clearly define the mechanism of this association, whether causal (i.e. atherosclerosis leads to VTE) or driven by common pathogenic mechanisms [[50]7]. In recent years, the availability of large databases of genomic data, along with bioinformatics and machine learning tools capable of performing integrative and functional analyses of these datasets allowed new strategies for the research about the molecular and cellular pathogenesis of complex conditions. In particular, publicly available datasets from gene expression studies, once performed to define specific disease signatures, can now be compared, grouped and meta-analyzed, allowing biases and artefacts to be canceled out between datasets, so that true relationships are more likely to stand out [[51]20–[52]26]. Herein we used a panel of bioinformatics and machine learning tools to explore the differences and similarities between VTE and CVD, thus contributing with a new layer of data to our understanding of the common and divergent pathogenic mechanisms of two conditions of high epidemiological relevance. Methods Identification of eligible studies and datasets Gene expression datasets from microarray studies including human patients with CVD or VTE were searched in the public repository Gene Expression Omnibus (GEO) [[53]27], maintained by the NCBI, by May 2018. Search was conducted using the terms “venous thrombosis”, “venous thromboembolism”, “myocardial infarction”, “stroke”, “coronary ischemia”, “angina”, “atherosclerosis”, “peripheral arterial disease” or “thrombosis”. Datasets were included if they met all the following inclusion criteria: (1) microarray data obtained from human samples using the same microarray platform; (2) RNA source restricted to whole blood or populations of circulating blood cells; (3) studies including both affected patients and healthy controls, so that the differential expression of each gene was evaluated under the same experimental conditions; (4) availability of metadata allowing the separation of venous from arterial events; and (5) datasets from studies published in peer-reviewed journals. In the course of our study, we also restricted our analysis to studies using the same microarray platform, so as to limit heterogeneity. Patient and public involvement No patient involved. Meta-analysis of gene expression studies Pre-processing Microarray raw data were pre-processed using the Robust Multichip Average (RMA) method [[54]28] implemented in the oligo package [[55]29]. For each dataset, the algorithm performs background subtraction, minimizing the effects of optical noise and non-specific binding on the estimation of relative gene expression parameters. Later, quantile normalization was applied, mitigating the effects of technical variables through the estimation of a common intensity distribution across samples. This stage was followed by a median-polish step, which summarized the several probe intensity measurements into a single probeset log-expression quantity, for the downstream meta-analysis step. Using the biomaRt package [[56]30], we annotated the probesets with their respective Ensembl Gene IDs. Meta-analysis To perform the meta-analysis, expression data were organized following their pre-defined classes and study of origin. Meta-analysis was performed with RankProd package [[57]31]. The algorithm of this package adapts the rank production method initially designed to single experiment analysis to integrate multiple origin studies. It is a non-parametric method that detects genes consistently ranked as DE by comparing patients to healthy controls across datasets. One hundred permutations were performed to compute the p-value and the false discovery rate (FDR). The gene list was further filtered to include only genes that were up- or down-regulated in the same direction in all five studies based on a false discovery rate (FDR) < 0.05. Correlation analysis of gene expression levels in CVD and VTE The correlation between the expression levels of genes identified in the meta-analysis between VTE and CVD was expressed using the estimated Pearson’s coefficient, and then represented in graphical format. Unless otherwise stated, all analyses were performed in the statistical computing environment R version 3.4.4 [[58]32]. Functional gene set analyses To facilitate the interpretation of the biological significance of the gene list identified by the meta-analysis, a functional gene set analysis (GSA) was performed using EnrichR, a bioinformatics web-based tool that includes several curated GSA libraries encompassing pathway enrichment analysis (e.g. KEGG, Reactome, and 18 other libraries), gene ontologies (for cellular components, biological process, molecular function), among others. Of the list of enriched terms identified by EnrichR, only pathways that were (i) listed among the 10 most significant (based on the p-value) for each library, and (ii) identified in at least two libraries from the same category were considered. For gene ontology terms, the top 5 terms with an adjusted p-value < 0.0001 were included. Evaluation of genes with divergent expression between VTE and CVD A list of genes with divergent expression between VTE and three databases of CVD (IS, AMI and PAOD) was obtained by selecting all genes with a fold-change higher than 1.5 that were up-regulated in VTE and down-regulated in IS, AMI and PAOD; as well as genes with a fold-change lower than 0.8 that were down-regulated in VTE and up-regulated in IS, AMI and PAOD. The cutoff values are defined as the percentile 25% (0.8) and 75% (1.49~1.5) fold change to prioritize the most down and up-regulated genes respectively. Similar filtering approach has been used to avoid the definition of arbitrary threshold [[59]33–[60]35]. These gene lists were used for an additional functional analysis based on FAIME (Functional Analysis of Individual Microarray/RNAseq Expression) scores. The FAIME algorithm is implemented in seq2pathway package [[61]36] and computes the cumulative quantitative effects of genes inside differentiated Gene Ontology terms using log2 gene expression of each individual sample. The result was clustered based on their gene pattern similarities using Euclidean distances and plotted in a FAIME score heat map. Validation of gene expression signatures associated with VTE and CVD In order to validate our results in independent cohorts, we first used Support Vector Machine (SVM) based methods to identify two subsets of genes capable to more accurately separate VTE from CVD (validation 1) and VTE (validation 2) as well as AMI and IS (validation 3 and 4) from controls. SVM-based models are based on statistical learning theory [[62]37], and are normally used to optimize the discriminatory power of complex datasets by identifying subsets of data with higher discriminatory potential (classifiers) [[63]38, [64]39]. For validation 1, SVM was applied to the list of genes that were divergently expressed between VTE and CVD, using the VTE and AMI patients’ datasets employed for our meta-analysis as training cohorts. The list of classifiers was then tested in three additional cohorts (validation cohorts) that were not used in the meta-analysis, constituted of patients with VTE ([65]GSE48000) [[66]40], AMI ([67]GSE59867) [[68]41]. For validation 2, the training cohort consisted of the dataset of VTE patients used in our meta-analysis ([69]GSE19151), and the validation cohort consisted of a different dataset of VTE patients ([70]GSE48000) [[71]40]. Finally a training cohort for validation 3 and 4 consisted healthy controls and patients of the AMI and IS datasets ([72]GSE59867, [73]GSE22255 respectively). Results were then validated using the cohorts consisted of another AMI and IS datasets ([74]GSE141512 and [75]GSE16561 respectively) and presented as heat map [[76]42] of normalized expression. Results Studies included in the meta-analysis Five studies fulfilled the inclusion and exclusion criteria described in methods section, and were included in the meta-analysis. These studies included data from 163 adult patients and 145 healthy controls. [77]Table 1 provides the details of each study. As shown in [78]Table 1, only one study include patients with VTE and compared gene expression levels in patients with single or recurrent VTE ([79]GSE19151) with healthy controls [[80]43]. The other four remaining studies involve CVD. These CVD studies present gene expression levels of patients with PAOD ([81]GSE27034) [[82]44], AMI ([83]GSE48060) [[84]45], cardioembolic stroke ([85]GSE58294) [[86]46], and IS ([87]GSE22255) [[88]47]. All of them have appropriated study-specific paired healthy controls. Table 1. Characteristics of individual studies included in our analyses. GEO access number Sample characteristics Characteristics of patients/disease included in each dataset Size (Pt:Ctl) RNA source [89]GSE19151 Adult with one or more prior VTE or warfarin; APS and cancer excluded 70:63 Whole blood [90]GSE27034 Peripheral arterial occlusive disease, defined as ankle:brachial index < 0.9 19:18 PBMC [91]GSE48060 Adults with 1^st time acute myocardial infarction[92]^*; inflammatory diseases and cancer excluded. 31:21 Whole blood [93]GSE58294 Adults with cardioembolic stroke (i.e. at least one source of cardiac embolus and exclusion of strokes from other etiologies) [94]^† 23:23 Whole blood [95]GSE22255 Adults with history of one ischemic stroke more than 6 months prior to sample collection; anemia and allergies excluded 20:20 PBMC [96]Open in a new tab GEO: Gene Expression Omnibus, Pt:Ctl: patients:controls; PBMC: peripheral blood mononuclear cells; WBC: white blood cells; VTE: Venous thromboembolism; APS: antiphospholipid syndrome. * Samples were collected with 48h from the acute event † subset of patients recruited for the Clear Stroke Trial [[97]48]; samples were collected within 3h from the acute event, prior to any pharmacological treatment. All studies used the Affymetrix Human Genome U133 Plus 2.0 as a microarray platform. References for published