Abstract
We employed a multifaceted computational strategy to identify the
genetic factors contributing to increased risk of severe COVID-19
infection from a Whole Exome Sequencing (WES) dataset of a cohort of
2000 Italian patients. We coupled a stratified k-fold screening, to
rank variants more associated with severity, with the training of
multiple supervised classifiers, to predict severity based on screened
features. Feature importance analysis from tree-based models allowed us
to identify 16 variants with the highest support which, together with
age and gender covariates, were found to be most predictive of COVID-19
severity. When tested on a follow-up cohort, our ensemble of models
predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%;
MCC = 61.55%). Our model recapitulated a vast literature of emerging
molecular mechanisms and genetic factors linked to COVID-19 response
and extends previous landmark Genome-Wide Association Studies (GWAS).
It revealed a network of interplaying genetic signatures converging on
established immune system and inflammatory processes linked to viral
infection response. It also identified additional processes
cross-talking with immune pathways, such as GPCR signaling, which might
offer additional opportunities for therapeutic intervention and patient
stratification. Publicly available PheWAS datasets revealed that
several variants were significantly associated with phenotypic traits
such as “Respiratory or thoracic disease”, supporting their link with
COVID-19 severity outcome.
Subject terms: Machine learning, Cellular signalling networks,
Genomics, Prognostic markers
__________________________________________________________________
A multifaceted computational strategy identifies 16 genetic variants
contributing to increased risk of severe COVID-19 infection from a
Whole Exome Sequencing dataset of a cohort of Italian patients.
Introduction
The coronavirus disease 2019 (COVID-19) pandemic, caused by the
infection with severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2), is challenging health, economical and societal systems
worldwide at an unprecedented level. The SARS-CoV-2 infection is
characterized by a large variation in consequence ranging from
asymptomatic to life-threatening conditions such as viral pneumonia and
acute respiratory distress syndrome (ARDS). ARDS is caused by an
exaggerated host immune response leading to lung injury, which starts
at the epithelial–interstitium–endothelial interface with increased
vascular permeability and extravasation of immune cells, mostly
macrophages, and granulocytes. Infected epithelial cells and debris
bind immune cell receptors, triggering the release of inflammatory
cytokines (predominantly IL-6, IL-1, and TNF-α) and activating
fibroblasts, resulting in a cytokine release syndrome^[42]1.
Established host risk factors for disease severity, such as increasing
age, male gender, and higher body mass index^[43]2, do not explain all
the variability in disease severity observed across individuals.
Genetic factors contributing to COVID-19 susceptibility and severity
may provide novel biological insights into disease pathogenesis
mechanisms, new drug targets as well as new means for patient
stratification. It is important to consider that, despite the recent
development of vaccines, treating the disease remains an important goal
in the clinics. The first genetic factors described to contribute to
COVID-19 severity were rare loss-of-function variants in genes involved
in type I interferon (IFN) responses^[44]3–[45]7. At the same time,
several GWAS projects investigating the contribution of common genetic
variation^[46]8,[47]9 to COVID-19 have provided robust support for the
involvement of various genomic loci associated with COVID-19 severity
and susceptibility, with the strongest finding for severity being
located on chromosome 3. Until now, the Italian GEN-COVID Multicenter
Study contributed to the identification of rare variants ^[48]6,[49]10
and common polymorphisms^[50]11–[51]13 associated with COVID-19
severity through the collection of more than two thousand biospecimens
and clinical data from SARS-CoV-2-positive individuals^[52]14 and whole
exome sequencing (WES) analysis. The COVID-19 Host Genetics Initiative
(COVID-19 HGI) ([53]https://www.COVID-19hg.org) has recently provided
the most comprehensive picture of host genetic factors linked to
COVID-19 severity through meta-analyses of tens of studies from 19
countries^[54]15.
While GWAS studies provide solid evidence of the host genetic factors
individually associated with COVID-19 severity, they most often fail to
provide an organic picture about their interplay. By learning
(non-)linear patterns from data in a human interpretable fashion,
explainable machine learning algorithms might help in understanding the
multifactorial nature of the interactions between host genetics and
COVID-19, at the same time providing effective tools for risk and
severity forecasting.
In 2020, the Italian GEN-COVID Multicenter Study started to investigate
how the combination of common and rare variants could determine
COVID-19 severity in a pilot study including WES data of a first small
cohort of hospitalized patients^[55]16. Previous and ongoing efforts
entailed machine learning techniques (i.e. LASSO logistic regression
models) in combination with a boolean representation of genetic
variants to identify the most informative features associated with the
severity which were used to compile an Integrated PolyGenic Score for
COVID-19 severity predictions^[56]17,[57]18. In this study, we combined
variant case-control screening, supervised binary classifiers training,
feature importance analysis, and dimensionality reduction techniques
with pathway enrichment and phenotype association studies to identify a
few dozens genetic variants contributing to increased risk of severe
COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a
cohort of Italian patients.
Results
Comparing genetic variation in severe and asymptomatic individuals
We considered the Whole Exome Sequencing (WES) dataset of germline
variants from 1982 European descent patients provided by the GEN-COVID
Multicenter Study group^[58]14. All subjects were classified according
to the grading scheme by the World Health Organization (WHO), refined
based on an ordinal logistic model using age as input feature for
sex-stratified patients^[59]17. Demographic (sex, age, and ethnicity)
and clinical data (family history, pre-existing chronic conditions, and
SARS-CoV-2 related symptoms) were also collected (Fig. [60]1a; see
“Methods”).
Fig. 1. Patient cohort and workflow of the computational pipeline.
[61]Fig. 1
[62]Open in a new tab
a piechart with the fraction of sequenced patients for each grading
group; b stacked bar-charts with distribution of patients in the two
groups (severe=5 + 4 + 3; asymptomatic=0), and their gender
composition, whose variants were used for screening, training and
initial testing; c stacked bar-charts with distribution of patients in
the two groups (severe=5 + 4 + 3; asymptomatic=0), and their gender
composition, from a follow-up cohort used for final testing of the
model; d workflow of the bottom-up computational strategy to identify
and interpret variants linked to COVID-19 severity.
We started our analysis from a total of 1.057 M simple variants which
were screened to identify mutations associated with severe patients,
likely representing risk factors, from those associated with
asymptomatic patients, more likely contributing to protection. We
employed log odds ratio statistics, using an additive model, to screen
variants significantly associated with either severe or asymptomatic
groups (Fig. [63]1a, b; see “Methods”). We performed the screening on
the majority portion (training set) of a randomly split dataset
(keeping 80% of the samples for training and 20% for testing), to find
a set of variants to be used as features set for downstream ML and
pathways analysis. To ensure robustness, we repeated the splitting
procedure five times, employing a stratified five fold cross-validation
scheme, by performing the screening on the training set and finally
retaining those variants found to be significantly enriched in each of
the five splits (Fig. [64]1d; see “Methods”). We found on average 1130
variants significantly enriched across the five folds (Data [65]S1).
Genetic variants predict severity through supervised ML classifiers
We embedded the stratified five fold screening within a supervised
classifier training procedure (Fig. [66]1d; see “Methods”). For each
random split of the dataset, we trained the model by considering the
variants screened in the training set (80% of the original dataset),and
tested it on the corresponding held-out portion (20% of the dataset) of
the same split. For each screened random split, we trained multiple
models using a stratified five fold Cross-validation (five fold CV)
grid search to estimate optimal hyperparameters for supervised
classifiers training (Fig. [67]1d; see “Methods”). XGBoost was the
algorithm that displayed the smallest drops between training and
testing accuracies, achieving the best average performance during
testing across the five folds (Fig. [68]2a; Data [69]S2). In more
details, the best XGBoost model had the following performances:
Precision=77.27%, Recall=83.33%, MCC = 46.69%, AUCROC = 80,
Accuracy=75%, F1 = 80.2% (Data [70]S2).
Fig. 2. Performances of the supervised classifier for prediction of COVID-19
severity.
[71]Fig. 2
[72]Open in a new tab
a Distribution of performance metrics of different algorithms during
testing on the five folds. The horizontal line inside each box
represents the median value, and the height (whiskers) of each of the
boxes depict the standard error (variability) of a particular
performance metrics under consideration as scored across the five fold
CVs by the employed supervised ML algorithms. The dotted points above
and below the individual box-and-whisker lines are potential outliers
that are above or below the 25th percentile, and the 75th percentile; b
feature importance distribution for features with non-zero importance
across the five folds. The characteristics of each box-plot are as in
Fig. 2a; c log-odds ratio of the 16 variants with full support in
XGBoost trained models; d performances of the predictors with 16
variants plus covariates (age and gender; orange), only co-variates
(green), all screened variants plus covariates (blue) in the held-out
test set (samples n = 168); e performances of the predictors with 16
variants plus covariates (age and gender; orange), only co-variates
(green),all screened variants plus covariates (blue) in a follow-up
testing set cohort (new samples n = 618).
Overall, we found that 3217 unique variants (out of a total of 3258
unique, screened variants), corresponding to 2546 unique genes, had
non-zero coefficients in at least one of the five, tree-based models
(i.e. RF or XGBoost). However, the XGBoost classifier led to a sharper
reduction of relevant variants (1086, corresponding to 1049 genes, with
non-zero feature importance in at least one model), consisting of a
subset of those identified with the RF models. As expected, clinical
covariates such as age and gender were found among the features with
the highest median of the distribution of importance coefficients
collected from XGBoost models (Fig. [73]2b). Among this shortlist, only
16 variants (and corresponding genes; Data [74]S1) consistently
received non-zero coefficients in all tree-based models, out of which 9
variants were found to be enriched among severe patients (Fig. [75]2b,
c). To confirm the predictive performance of these variants, we
re-trained the models by considering only this subset of variants, plus
age and gender covariates, and we calculated aggregated performances by
considering the median of the probabilities outputted by each model for
each sample in the testing set (see “Methods”). While age and gender
covariates alone retained high predictive power (AUCROC = 80%), the
addition of these most informative genetic features led to an increase
of performances (AUCROC = 86%, best model AUCROC = 91%; Fig. [76]2d;
Data [77]S3).
We observed a high level of performance when we tested the ensemble of
models trained with only informative variants on a follow-up cohort of
618 individuals (122 asymptomatic, 496 severe; Fig. [78]1c), either at
the individual model level or at the ensemble one (Data [79]S4). In
fact, when computing aggregated metrics by considering the median of
the probability distribution collected from the ensemble of models
(Data [80]S4, [81]5; see “Methods”), we identified severe patients with
good accuracy (ACC = 81.88%; AUCROC = 96%), performing considerably
better than the ones obtained by training with only covariates or
variants (Fig. [82]2e; Data [83]S4). The model also showed good
performances on an additional validation set comprising a total of 375
samples excluded from both training and testing due to inconsistent
classification from the WHO grading and the ordinal logistic model
adjusted by age (ACC = 85.34%, MCC = 67.8%, AUCROC = 91.4%;
Fig. [84]S1; Data [85]S6).
Risk and protective genetic factors impinge on modular, interconnected
networks underlying distinct biological processes
We analyzed the subset of variants receiving non-zero feature
importance in at least one XGBoost model to provide a mechanistic
explanation for their potential interaction with COVID-19 infection. We
performed pathway analysis by mapping mutated genes in a functional
interaction (FI) network (i.e., Reactome FI network; see Methods). We
built a general FI network (Fig. [86]3b), as well as networks specific
for clinical groups, by grouping variants and genes enriched in severe
and asymptomatic patients (Fig. [87]3a). Pathway analysis on
group-specific networks revealed patterns of significantly enriched
processes connected to either risk or protection (Fig. [88]3a).
Fig. 3. network analysis and pathway enrichment.
[89]Fig. 3
[90]Open in a new tab
a Pathways overrepresented among variants with non-zero feature in at
least one XGB model and enriched in either severe(red) or asymptomatic
(blue); b reactome FI network of genes affected by variants with
non-zero feature importance from XGBoost. Node diameter is proportional
to the number of variants with non-zero coefficients in any tree based
models. Node color is instead proportional to the LOR with the highest
absolute value among the variants associated to a given gene. The top 3
modules identified within the network are highlighted and corresponding
enriched processes displayed as barcharts colored with cluster specific
corresponding colors; c FI network zoomed representation of the 2nd
largest cluster.
In severe patients we found significant processes associated with
cardiomyopathies, e.g. Arrhythmogenic right ventricular cardiomyopathy
(FDR = 4.03E–05), Calcium signaling pathways (FDR = 4.22E–02), and
immune response such as C-type leptin receptors (CLRs) (FDR = 5.67E–02)
(Fig. [91]3a; Data [92]S7). Asymptomatic patients were instead
characterized by distinct processes, including Fanconi anemia pathway
(FDR = 7.89E–04) and DNA repair processes such as HDR through HRR or
SSA (FDR = 4.84E–03), (Fig. [93]3a; Data [94]S8).
The general FI network comprised a total of 344 mutated genes and 630
functional interactions, marking a high degree of interconnection
between affected genes, which participate in different, cross-talking
biological processes. Cluster analysis on the general FI network
revealed modules characterized by specific pathways. Intriguingly, we
found out that no cluster exclusively contained variants enriched in
severe or asymptomatic patients. In detail, the largest cluster (i.e.
Module 1; 43 nodes) encompassed Fanconi anemia pathway (FDR = 2.46E–07)
and DNA repair processes such as HDR through HRR or SSA
(FDR = 4.51E–06) or Homologous recombination (FDR = 1.76E–03)
(Fig. [95]3b). In this cluster, we found that the gene characterized by
the variant with the strongest model support (ms) (i.e. fraction of
tree-based models assigning non-zero feature importance; see Methods)
is MYBBP1A rs117615621, which is enriched in asymptomatic patients
(Odds Ratio OR) = 0.26; p value = 6.5E–03; ms = 90%; (Data [96]S1,
[97]S8).
The second-largest module (Module 2; 42 nodes) involves genes mediating
signal transduction cascades such as Ras GTPases, e.g. Rap1 signaling
pathway (FDR = 1.01E–04) or MAP kinases, e.g. MAPK signaling pathway
(FDR = 5.95E-04) (Fig. [98]3b, c). We also found processes more
directly linked to the immune and inflammatory response to the virus,
such as the JAK-STAT signaling pathway (FDR = 1.11E-03),
Cytokine-cytokine receptor interaction (FDR = 1.92E–03), and
Interleukin-6 family signaling (FDR = 1.92E–03) (Fig. [99]3b, c). All
three pathways include the CNTFR gene, which codes for the alpha
subunit of the receptor for the ciliary neurotrophic factor, and is
affected by a novel variant (chr9:34557898:A: T) enriched in severe
patients (lor=1.230663067; p val=2.2E–04; Data [100]S1). Intriguingly
this variant was ranked in the top 20 genes with the highest median
importance (Fig. [101]2b) and received 100% model support
(Fig. [102]2c). Another variant with 100% support within the same
cluster is rs150021157, which is significantly enriched among severe
patients (OR = 3.95; p val=1.9E–03; Data [103]S1,[104]S8), and it
affects the PCSK5 gene, a serine endoprotease which processes various
proteins including cytokines, NGF, renin and which has been reported to
regulate the viral life cycle^[105]19.
The third-largest module (Module 3; 38 nodes) is characterized by the
Regulation of nuclear SMAD2/3 signaling pathway (FDR = 1.95E–03) as the
most enriched pathway, therefore being tightly interconnected with
cluster 2. The variant SMAD3 rs897912452 (OR = 0.31; p val=5.1E–4) and
the novel ZMIZ1 10:79307376:-:GGGGGGGGGG (OR = 0.27; p val=6.18E–05)
have the highest support (ms=90%) and are found enriched in
asymptomatic patients. Additionally, the latter gene ZMIZ1 participates
in another significant pathway, Coregulation of Androgen receptor
activity (FDR = 0.01), which also entails AR, which carries several
mutations which, depending on the specific genic locus, can be found
enriched either in severe or asymptomatic patients with variable
support (Fig. [106]3, S2; Data [107]S1,[108] S8).
We found additional potentially relevant pathways in the remaining
modules. Module 4 (33 nodes) contains genes involved in
Deubiquitination (FDR = 1.15E-05), a process frequently modified by
viral infection^[109]20, as well as several other pathways mediating
innate immune response such as the TNF receptor signaling pathway
(FDR = 1.15E–05) (Fig. [110]S3; Data [111]S9). Within this module we
found the PLEC gene, affected by the variant rs140300753 (OR = 3.2, p
val=2.8E-03, ms=100%), which is enriched in severity and received 100%
support from tree-based models (Data [112]S1).
In the remaining clusters we found additional processes with high
translational and therapeutic potential. For instance, we found several
GPCR-signaling instances significantly enriched in Modules 6 (e.g. G
alpha (i) signaling events, FDR = 3.69E–04) and 8, which exclusively
entails GPCR-downstream signaling pathways and where again the G alpha
(i) signaling events (FDR = 2.56E–09) and G alpha (q) signaling events
(FDR = 4.83E–08) are the two downstream pathways most significantly
over-represented (Fig. [113]S4; Data [114]S9).
We also found that a few genes whose variants have been identified
through our pipeline are among the ones carrying top associations to
severity as assessed from studies of the COVID-19 HGI
([115]https://app.COVID-19hg.org/variants)^[116]15. In detail, variants
of 9 out of the 43 genes identified from GWAS studies are also present
in our list, including: ABO, ARL17A, ARL17B, DPP9, LRRC37A, LRRC37A2,
RAVER1, TMEM65, ZBTB11 (Data [117]S1).
Severe patients tend to cluster together using only more informative variants
We applied unsupervised clustering and dimensionality reduction
techniques (i.e. Principal Component Analysis (PCA)) to group patients
based on the genetic distance calculated by considering the most
informative variants selected after screening and supervised machine
learning procedure. By projecting the patients on the first two PCs
followed by k-means clustering (see “Methods”), we detected three
groups of patients in the original cohort (Fig. [118]4a–c). The two
largest clusters were separated by PC1. The largest one, 515 patients,
was characterized by a majority of severe cases (78% of the total). The
second cluster was instead characterized by a prevalence of
asymptomatic patients (70% of the total). Finally, a third small
cluster was identified through the combined usage of PC1 and PC2 and it
was characterized almost exclusively by severe patients (95% of 24
patients in total). Notably, the severity of this cluster is only
partially explained on the basis of either gender (59% males and 37%
females; Fig. [119]4c) or age (Fig. [120]S5a). This cluster was
characterized by peculiar genetic features, with a smaller number of
variants and a neat prevalence of risk over protection factors
(Fig. [121]S5b). Remarkably, a total of 7 (out of 9 overall enriched in
severe patients) variants with 100% support from XGB models were also
found in this cluster (Data [122]S10). Network analysis of the mutated
genes in this predominantly severe cluster highlighted several common
processes as well as candidates for drug targeting. In particular,
several GPCRs (ADRB2, ADRA1, GRM6), ion channels (GRIN1, CACNA1G),
(receptor tyrosine) kinases (NTRK1, CSF1R, GAK) and nuclear hormone
receptors (AR, THRB) participate to this network and can be readily
targeted by approved drugs (Fig. [123]4c; Data [124]S11).
Fig. 4. detection of distinct clinical groups via PCA and clustering.
[125]Fig. 4
[126]Open in a new tab
a Projection of samples (n = 841) along the 1st and 2nd principal
components and coloring based on severity (up) or clusters identified
via k-means (bottom); b gender and clinical group composition of the
clusters detected via k-means on the 1st and 2nd PCA components; c FI
network constructed using mutated genes on the cluster of more severe
patients and approved drugs available for any of these genes.
Important variants are associated with disease traits linked to COVID-19
severe phenotypes
To provide further evidence of a functional relationships between our
variants and COVID-19 severe phenotypes, we checked available
open-access integrative resources (i.e. Open Target Genetics
initiative^[127]21) which aggregate human GWAS and functional genomics
data to link between GWAS-associated loci, variants, and likely causal
genes. In particular, we considered Phenome Wide Association Study
(PheWAS) analysis considering a wide range of diseases and traits to
identify the phenotypes associated with our variants (see Methods).
Intriguingly, we found that many identified variants are associated
with traits or phenotypes which might be linked with either risk or
protection from severe consequences to the viral infection.
For example, by considering variants with non-zero importance in at
least one XGB model, we found that those enriched in severe patients
were 70% of the total associated with the category “respiratory or
thoracic diseases” (see Fig. [128]5A). Among the specific traits with
strong associations to more supported variants, we found instances such
as “Doctor diagnosed emphysema” (ITPKA, rs41277684; LTK, rs35932273),
the latter variant associated also to “Other alveolar and
parietoalveolar pneumopathy”, “Respiratory disorders in diseases
classified elsewhere” (KCNB1, rs34467662), “Chronic
bronchitis/emphysema” (C12orf43;HNF1A, rs11065390; SLC47A2,
rs34399035), “Acute sinusitis” (SHANK2, rs146204677), “Pleural plaque”
(CFAP74, rs141833643), “Allergic asthma” (SYTL2, rs61740616 and
rs35751209), “Symptoms and signs involving the circulatory and
respiratory systems” (PCSK5, rs150021157) (Fig. [129]5b). Although more
weakly associated and supported by our models, we also found several
associations with chronic obstructive pulmonary disease (COPD) both in
“respiratory or thoracic diseases” and in “infectious disease”
categories (Data [130]S12). Other disease categories displaying a net
prevalence of phenotypic associations for variants enriched among
severe were “immune system disease”, with multiple variants associated
with specific traits such as “Autoimmune diseases” “Immunodeficiency
with predominantly antibody defects” or “Noninfectious disorders of
lymphatic channels”, and “pancreatic disease” (Fig. [131]5a;
Data [132]S12).
Fig. 5. PheWAS analysis of most important variants.
[133]Fig. 5
[134]Open in a new tab
a Phenotype categories displaying the greatest fraction of specific
trait associations with variants enriched in severe versus asymptomatic
patients; b scatter plot showing variant-specific traits associated
within the “Respiratory or thoracic disease category”. Dot diameter is
proportional to the model support for each variant. The color is
proportional to the log-odds ratio of the variant in the two groups of
the cohort. Labels are printed only associations with PheWAS P value
<0.001 and PheWAS oddsratio >2.5 or for variants having non-zero
coefficients in at least one XGBoost model.
Two of the variants enriched among severe patients which had highest
importance in all our models (i.e. PCSK5 rs150021157 and PLEC
rs140300753) were significantly associated with the “Abnormalities of
breathing” phenotype (p val = 4E–06 and p val = 1.6E–04, respectively),
suggesting that patients carrying these variants might be at higher
risk due to pre-existing difficulties of breathing (Fig. [135]S6;
Data [136]S12).
Other general categories of traits that might be linked to severe
COVID-19, such as “Cardiovascular disease” or “Infectious disease”
showed similar distributions of associations of risk or mitigation
factors (Fig. [137]S7). Interestingly other categories, such as
“Integumentary system disease” showed instead a prevalence of
associations with mitigation factors (Fig. [138]S7).
Discussion
In this study, we have set up a multifaceted computational strategy to
dissect patient genetic variants which might interplay with the
SARS-Cov-2 virus to increase the risk of, or to protect from, a severe
response to infection.
We integrated into a stratified k-fold scheme a pipeline to perform
variant features screening followed by machine learning model training
and testing to robustly identify variants associated with severe
response to COVID-19 infection. Our pipeline allowed a drastic
reduction of the initial number of variants by several orders of
magnitudes: from an initial set of approximately 1 M unique variants
derived from WES to 1k variants receiving non-zero feature importance
in at least one of the tree-based models. By only considering the
variants with full support, i.e. always found to have non-zero feature
importance in all the tree-based models, we further reduced the pool to
only 16 variants. Models retrained with only full-support variants
(plus age and gender as covariates) achieved superior performances
(median AUCROC = 86%, best model AUCROC = 91%). Although models trained
with only patients age and gender already showed good performances in
severity prediction (median AUCROC = 80%), confirming the predictive
power of these covariates, the increase in performance followed by the
inclusion of curated genetic information provides the foundation for
integrated tools for COVID-19 severity forecast and patient
stratification. When tested on a follow-up cohort of more than 600 our
models achieved remarkable performances in identifying severe patients
with good accuracy (ACC = 81.88% and AUCROC = 96%), performing
considerably better than the ones obtained by training with only
covariates or variants (Fig. [139]2e; Data [140]S4).
The interpretability of our models allowed us to shed new light on the
complex landscape of genetic interactions with virus genetics which
contributed to a severe response to COVID-19 in an Italian cohort.
Among the 16 variants with 100% support, only 6 genes (37%) were
annotated in the largest pathway knowledgebase, i.e. Reactome^[141]22,
suggesting that unannotated variants might modulate the interaction
with the virus through yet-to-be-discovered biological mechanisms.
Intriguingly, we found that two of these highly supported variants,
i.e. chr9:34557898:A:T (CNTFR) and rs150021157 (PCSK5) interact within
the second-largest module identified on the interaction network of the
genes affected by mutations within our study. This cluster, which is
moreover the only one characterized by two fully supported variants, is
highly enriched in pathways linked to immune response and inflammation,
such as the such as JAK-STAT signaling pathway, Cytokine-cytokine
receptor interaction, and Interleukin-6 family signaling. The third
cluster, which cross-talks with the second one, involves processes
related to SMAD and TGF-β signaling, which were previously shown to be
modulated by SARS nucleocapsid proteins^[142]23.
We found that variants enriched in severe patients are involved in
cardiomyopathies processes, supporting the established notion that
patients with heart disease or its risk factors are at greater risk of
severe consequences following COVID-19 infection, including
hospitalization, ventilation, or death^[143]24. Additional processes
significantly enriched among severe mutations was ECM, whose importance
in mediating the interaction with viral particles have been highlighted
by affinity-purification proteomics experiments^[144]25. Recent
experiments also confirmed a role for integrins in binding to
UV-inactivated viral particles, through which outside-inside signaling
is elicited via binding to Gα13^[145]26. Vesicle-mediated transport,
such as clathrin-mediated endocytosis, has been shown to mediate a key
entry point for SARS^[146]27. The latter pathway has also been
confirmed to drive a chronic immune response in severe
COVID-19^[147]28. Moreover, C-type leptin receptors have been shown to
engage with the virus inducing robust pro-inflammatory responses in
myeloid cells that correlated with COVID-19 severity^[148]29.
On the other hand, some of the processes that we found significantly
enriched among asymptomatic patients have been previously put in
connection to SARS viral infection. For example, members of the
machinery for DNA damage response have been shown to interact and
affect the response to several DNA and RNA viruses^[149]30 and it has
been recently demonstrated that these pathways are also triggered by
SARS-CoV-2 in vitro cellular models^[150]31. The Fanconi anemia pathway
is tightly linked to DNA repair processes involving homologous
recombination and genome integrity^[151]32. We therefore speculate that
patients carrying variants on these pathways might differently interact
with the virus, modulating a milder response to viral infection.
Several identified processes offer druggable options for therapeutic
treatment. Androgen receptor signaling and its genetic variability have
been already linked to COVID-19 severity^[152]11,[153]33 and its
inhibition proposed as a therapeutic strategy (e.g^[154]34.). We found
several GPCR signaling instances significantly enriched in our network,
in particular those related to G[i] and G[q] signaling, which mediate
vascular inflammation. In particular, the G[q] pathway contributes to
regulating calcium signaling, which is one of the most enriched
processes in our dataset and which leads to endothelial barrier
disruption via adherens junction disassembly^[155]35. On the other
hand, G[q] signaling might also contribute to transactivate JAK-STAT
pathway via (ERK)1/2 signaling^[156]35, the latter in turn also
activated by G[i] signaling^[157]36. It has also been recently shown
that the C5a–C5aR1 axis, which also signals intracellularly through
G[q], plays a key role in the pathophysiology of ARDS associated with
COVID-19 by starting and maintaining several inflammatory responses
through the recruitment and activation of neutrophils and
monocytes^[158]37. Hence, similarly to what we and others previously
described in cancer^[159]38, genetic factors converging on modulating
common GPCR downstream signaling pathways might also contribute to the
onset of the inflammatory response related to COVID-19, at the same
time offering new therapeutic intervention options for patients with
severe forms of COVID-19. The recent finding that autoantibodies
targeting GPCRs are associated with COVID-19 severity^[160]39, further
strengthens these receptors as therapeutic candidates.
We found multiple, recurrent disease traits associated with the
variants identified. The variants rs150021157 and rs140300753,
characterized by full support during supervised learning, also provide
an example of associations to phenotypes that might play a role in
COVID-19 severity, such as “Abnormalities of breathing phenotype”. Some
categories show a prevalence of associations with risk factors, such as
“respiratory or thoracic disease”, including specific traits such as
chronic bronchitis, emphysema or COPD (the latter also found in the
“infectious disease” category). Other categories enriched for
associations with variants enriched in severe patients are “immune
system disorders”, including traits such as immunodeficiency with
antibody defects, or “pancreas disease”, including several instances
mainly associated to Type 2 diabetes, which is a known risk factor for
severe COVID-19^[161]40 and whose molecular connection to cytokine
storm inflammatory response has now begun to emerge^[162]17,[163]41.
Taken together, these results further corroborate our analysis.
Our model is complementary to previous and ongoing efforts entailing
machine learning techniques (i.e. LASSO logistic regression models) and
a boolean representation of genetic variants to identify the most
informative features associated to severity to compile an Integrated
PolyGenic Score for COVID-19 severity predictions^[164]17,[165]18.
While we expect that some of the variants identified in this study
might be specific for the Italian population, we believe that our
approach could be readily trained on different cohorts to identify
additional biomarkers for patient stratification in the clinics. Our
capability to understand and forecast the genetic factors contributing
to COVID-19 disease severity will certainly benefit from the
availability of larger sequencing cohorts, the usage of more advanced
methods for case-control associations in WES studies, new
methodological advancement in the explainable AI field, as well as on
our prior- or data-driven knowledge of biological mechanisms linking
genetic variants to disease phenotypes.
Methods
Dataset and pre-processing
We used the whole-exome sequencing (WES) dataset of 1982 European
descent patients collected from the GEN-COVID Multicenter Study group
coordinated by the University of Siena
([166]https://clinicaltrials.gov/ct2/show/NCT04549831)^[167]14.
Briefly, the GEN-COVID Multicenter Study includes a network of 22
Italian hospitals as well as local healthcare units and departments of
preventative medicine
([168]https://sites.google.com/dbm.unisi.it/gen-covid). It started its
activity on March 16, 2020, following approval by the Ethical Review
Board of the Promoter Center, University of Siena (Protocol n. 16917,
approval dated March 16, 2020). Written informed consent was obtained
from all individuals who contributed samples and data. Detailed
clinical and laboratory characteristics (data), specifically related to
COVID-19, were collected for all subjects.
Specifically, the WES dataset contained a total of 1.057 M unique
simple variants. Patients were classified according to the grading
scheme by the World Health Organization (WHO). The grading
classification contained the following categories: 0=not hospitalized
(a- or pauci-symptomatic); 1=hospitalized without respiratory support;
2=hospitalized O2 supplementation; 3=hospitalized CPAP-biPAP; 4=
hospitalized intubated; 5=dead. We considered patients from more severe
groups, i.e. 3,4, and 5, as cases, and asymptomatic patients from group
0, as controls, for a total of 1078 patients. We further refined the
grading classification based on an ordinal logistic model which uses
age as input feature for sex-stratified patients^[169]17 and we
retained only those patients whose grading classification was
concordant with the one adjusted by age. This yielded a final set of
841 samples for downstream analysis.
Statistics and reproducibility
We employed the cohort of 841 patients to identify variants most
associated to COVID-19 severity which we used, along with clinical
co-variates such as age and sex, to train and test supervised binary
classifiers of severity. We finally tested our ensemble of predictors
on two unseen cohorts of patients: 618 individuals (122 asymptomatic,
496 severe), from a follow-up cohort of sequenced patients, and a set
of 375 unique patients that were excluded from the original as well as
the follow-up cohort due to inconsistencies between the original WHO
grading classification and the one outputted by an ordinal logistic
regression adjusted by age^[170]17.
We detail below the statistical procedure employed.
Stratified K-fold split of sample cohort into train and test sets
We embedded a strategy for variant screening into a stratified
five fold cross-validation scheme (using the StratifiedKFold function
from the scikit-learn library [171]https://scikit-learn.org/) to
generate 5 random training and testing set splits of the original
dataset. Each fold was constituted by a training set, corresponding to
80% of the dataset, which was also employed for variant screening and a
remaining 20% for the testing set. The variants in the test set were
curated from the variants screened in the training set. Through the
stratified fivefold approach, we made sure that all the samples of the
dataset were employed for testing.
Variant screening
GATK best-practices for germline variant calling pipeline were
employed, as described in our previous work aimed at characterizing
common, low-frequency, rare, and ultra-rare coding variants contribute
to COVID-19 severity^[172]17. We employed a Log-Odds Ratio (LOR)
statistics calculated on a 2×2 contingency Data to perform case-control
association and to screen variants associated with either severe or
asymptomatic patients in each of the training sets for each of the
five folds generated. We grouped severe patients from clinical groups
5, 4, and 3 which were contrasted against the asymptomatic ones,
considered as controls (group 0). We defined a contingency Data to
measure the enrichment of reference (Ref) or alternative (Alt) alleles
in either severe or control groups by employing an additive model,
whereby homozygous genotype (1/1) has twice the risk (or protection) of
the heterozygous type (0/1 or 1/0). We employed the Data2x2 function
from the statsmodels library
([173]https://www.statsmodels.org/sData/index.html) to calculate LORs
values and associated p-values and confidence intervals from the the
contingency Data in Fig. [174]S8, respectively employing the functions
log_oddsratio, log_oddsratio_pvalue() and log_oddsratio_confint(). We
filtered variants with the following characteristics:
[MATH:
p−value<
mo><0.05 :MATH]
and
[MATH: LOR≥1. :MATH]
Variants with LOR > 1 are enriched among severe, while those with LOR <
-1 are enriched among asymptomatics.
Feature matrix generation
For each split, we generated a feature matrix for the training set by
assigning the allele counts of each screened variant for each sample of
the training: i.e. 0 for genotype 0/0, 1 for genotypes 1/0 or 0/1, 2
for genotype 1/1. The feature matrix for the test set was defined by
considering only variants identified as significant after screening the
training set of the corresponding split and by assigning the allele
count of each sample of the test set. We also included as additional
features age, which was normalized, and gender, which was binarized by
setting males to 0 and females to 1. Severe patients from group
“3 + 4 + 5” were given the classification label “1”, the asymptomatic
patients from group 0 were given the label “0”.
Feature selection (removal of multicollinearity)
We employed feature selection techniques to further reduce the number
of considered features initially screened through the Log-Odds-Ratio
statistics. We tried several approaches, including Lasso, ElasticNet
and Multicollinearity, in combination with supervised training
approaches (see below). After training several classifiers with the
variants selected with each of these methods on a smaller cohort of
1200 samples, we found that removing multicollinearity from features by
considering variant allele counts with correlation coefficients
(corr.≤|0.8 | ) gave the best results. The screened features with
little or no effects of multicollinearity formed the final 80% training
sets in each fold and the final 20% corresponding validation sets used
for training the supervised machine learning models.
Supervised binary classification
We trained supervised learning models for binary classification tasks
by employing several algorithms, i.e. Support Vector Machine, Logistic
Regression, Random Forest, and Extreme Gradient Boosting classifiers,
available within the scikit-learn python library
([175]https://scikit-learn.org/).
Support Vector Classifier (SVC): a popular machine learning method that
classifies data points utilizing the concept of hyper-plan and kernel
tricks to find fits that best separate the data cloud. In this study,
we used the popular Jupyter notebook and scikit-learn python package to
import the “sklearn.svm” SVC classifier model. We first set the SVC
default regularization parameter “C” to 1, the class weight to
“balanced” in order to account for imbalanced classification problems
in the dataset. The default linear kernel was used first with the
prediction probability set to true. The GridSearchCV was used to select
the best hyperparameter values for the estimator “C”, “gamma”, and the
kernel (Linear, Radial Basis Function (RBF), and polynomial) that are
critical to the performance of the SVC classifier. The best
GridSearchCV estimator hyperparameter values that were used to train
our dataset were identified as the RBF kernel, C = 10, and gamma set to
0.1.
Logistic Regression: a binary classification regression model that uses
the logistic function to estimate the parameters of the logistic model.
We import from the scikit-learn package the “sklearn.linear_model” the
Logistic Regression model function. We first set the default logistic
model classifier parameters; “class weight = balanced”, C = 0.3 and
solver = sag. The best GirdSearchCV estimator values used to train our
dataset uses the regularization penalty of l1 (Lasso), C = 0.7, and
solver = saga.
Random Forest (RF): an ensemble learning method that employs a bagging
strategy. Multiple decision trees are trained using the same learning
algorithm, and then predictions are aggregated from the individual
decision tree. From the “sklearn.ensemble” library, we import the
Random Forest Classifier function. The RF default model parameters use
a class weight set to “balanced”, maximum depth (max_depth) of the
decision trees was set to 80, the number of features (max_features) was
set to 2, minimum samples (min_samples_leaf) leaf of 3, minimum samples
split (min_samples_split) of 10, and the number of trees (n_estimators)
in the forest was set to 300. The GridSearchCV best model estimator
parameters were “bootstrap = True”, “max_depth” = 110, “max_features” =
2, “min_samples_leaf” = 5, “min_samples_split” = 10, and “n_estimators”
= 100.
Extreme Gradient Boosted Trees classifier (XGBoost): an ensemble
learning classifier family that utilizes boosting strategy to combine a
set of weak learners and delivers improved prediction accuracy. We
import from the XGBoost package “xgboost” library and xgboost function.
We defined the data matrix (training feature set and classification
label). We set the default XGBoost classifier model parameters class
weight to “balanced”, learning objective to “binary logistic”. The best
GridSearchCV estimator parameters values we used to train the dataset
were “learning_rate” = 0.01, “max_depth” = 3, “n_estimators” = 140.
In summary, for each of the four ML models, we performed a parameter
optimization through grid search (GridSearchCV), using the
accuracy_score during grid search as a scoring method. We performed a
fivefolds cross-validation, by splitting 80% for training and 20% for
validation in each fold, repeated three times, using the
StratifiedKFold function with n_splits = 5 and n_repeats = 3. We also
set the class weight parameter to “balanced” in each of the ML
algorithms employed. Both model training and hyperparameters
optimization was done with a Python Jupyter notebook interactive
web-based development environment using the scikit-learn and the
xgboost packages. Model performances on the testing set were evaluated
through the following metrics: Accuracy, F1, Precision, Recall, Matthew
correlation coefficient (MCC), AUCROC.
A consensus voting approach was used to aggregate validation prediction
probability scores of the four ML algorithms (SVC, Logistic Regression,
Random Forest, and XGBoost classifiers) from each of the (20%) testing
sets from each fold by considering the median of the probability
distribution collected from the ensemble of models. The features
(variants) that received non-zero weight during training of the
supervised ML methods (Random Forest and XGBoost classifiers) in each
fold were combined across the fivefold for further interpretation.
We performed a randomization test (i.e. Salzberg’s test) to assess
over-fitting (Salzberg, 1997), where we replace the original phenotypic
labels of the training matrix with randomly assigned labels while
preserving the ratio of the number of positive (severe) and negative
(asymptomatic) patients (Data [176]S13).
Feature importance scores
The feature importance assigns weight scores to individual features
that interact to predict a particular event in the model. Feature
importance for RandomForest and XGBoost models were calculated as the
mean decrease in impurity for the feature using the feature importances
function from xgboost. The feature importance (weights) scores assigned
from these models’ predictions were aggregated across the fivefolds to
prioritize variants according to their consensus importance across
folds for further downstream analysis. In particular we defined the
model support (ms) of a given variant as the fraction of tree-based
models assigning non-zero feature importance during the training of the
model.
Final testing on a follow-up cohort
We tested the best performing models trained using most supported
variants with and without covariates on a followup cohort of sequenced,
Italian patients. An initial set of 838 samples corresponding to
grading groups 0, 3, 4 and 5 were refined by applying the same ordered
logistic regression classification adjusted_by_age, which yielded a
final set of 618 individuals (122 asymptomatic, 496 severe). We
generated an additional testing test by considering all the samples
that were previously excluded due to inconsistency between the original
WHO grading classification and the one outputted by an ordinal logistic
regression adjusted by age classifier^[177]17. In details, in the
original cohort that we used for training the model, there were 237
samples from either asymptomatic (grading 0) or severe (grading
3 + 4 + 5) patients that were excluded due to classification
inconsistencies, while in the follow-up cohort used for final testing
of the model, 220 more individuals were excluded according to the same
criteria. After removing patients with missing values, we obtained an
aggregated list of 375 unique patients. We curated the allele counts of
the 16 most informative variants, identified in the first stage of the
analysis and model training, from this new set of patients and we used
them, together with age and gender, as features for the testing. We
evaluated the performances of the ensemble of the 20 models both on an
individual as well as on an aggregated level, by calculating aggregated
metrics obtained from the median of the probability distribution
outputted by the ensemble of the 20 models on the testing samples.
Principal component analysis (PCA) and clustering
The variants with non-zero weights from best performing tree-based
models were remapped back into the feature space to form a new feature
count matrix covering 100% of the samples (i.e. 841 individuals). This
reduced feature matrix was analyzed using Principal Component Analysis
(PCA) techniques to reduce the dimensional space. In order for us to do
this, we utilized the “sklearn.decomposition” library to import the PCA
function. We standardized the feature count matrix using the
“sklearn.preprocessing” library to import the Standard Scaler function.
We transformed the normal feature count matrix considering the 1st and
2nd PCA components. We further employ the K-means clustering technique
(using the “sklearn.cluster” library to import the “KMeans” function)
to visualize and cluster the 2D PCA components (1^st and 2^nd
dimensions). We set the default cluster size to 3, the maximum
iteration (max_iter=1000), and a tolerance value (tol=1E–04). Clusters
of patients that express interesting severity patterns were further
analyzed using the pathway enrichment for biological interpretations
and implications.
Pathway enrichment analysis
The pathway enrichment analysis was done using the ReactomeFIViz
plugin^[178]42 available in Cytoscape^[179]43. The genes corresponding
to variants with non-zero feature importance from XGBoost were used to
construct a Functional Interaction (FI) network. The general FI network
comprised all the genes affected by variants with non-zero feature
importances in both patient groups. Node diameter is proportional to
the number of variants with non-zero coefficients in any tree-based
models. Node color is instead proportional to the LOR with the highest
absolute value among the variants associated with a given gene. Modules
within the network were identified through spectral partition
clustering^[180]44. Reactome pathways over-representation analysis
(FDR<0.1) was calculated on either the whole network or for each
individual module. We also generated group-specific networks by keeping
separated genes with variants enriched in severity from those enriched
in asymptomatic and performed pathway over-representation analysis (FDR
< 0.1) on the distinct networks.
Retrieving associations between variants and disease traits or phenotypes
We retrieved associations among the variants identified in our study
and disease traits or phenotypes through the Open Targets Genetics
platform^[181]21. We interrogated the database using the GraphQL query
language embedded in a python script and by inputting the variant
coordinates (given by chromosome nr, position, Ref, and Alt allele).
For each PheWAS association, we retrieved the following data: eaf,
beta, se, nTotal, nCases, oddsRatio, studyId and pval. Only PheWAS with
oddsRatio > 1 and p val <0.001 were considered. The statistics were
done only for the variants with non-zero feature importance from
XGBoost models.
All the analyses were performed using customized Python (v3.8) scripts,
with the following libraries: scipy (v1.2.0), numpy (v1.19.4),
scikit-learn (v0.23.2.), statsmodels (v0.11.0) and matplotlib (v3.2.1).
Reporting summary
Further information on research design is available in the [182]Nature
Research Reporting Summary linked to this article.
Supplementary information
[183]Peer Review File^ (1.1MB, pdf)
[184]Supplementary Information^ (8.1MB, pdf)
[185]42003_2022_4073_MOESM3_ESM.pdf^ (486KB, pdf)
Description of Additional Supplementary Files
[186]Supplementary Data 1-13^ (2MB, xlsx)
[187]Reporting Summary^ (3.1MB, pdf)
Acknowledgements