Abstract

   There are only a few platforms that integrate multiple omics data
   types, bioinformatics tools, and interfaces for integrative analyses
   and visualization that do not require programming skills. Here we
   present iLINCS ([74]http://ilincs.org), an integrative web-based
   platform for analysis of omics data and signatures of cellular
   perturbations. The platform facilitates mining and re-analysis of the
   large collection of omics datasets (>34,000), pre-computed signatures
   (>200,000), and their connections, as well as the analysis of
   user-submitted omics signatures of diseases and cellular perturbations.
   iLINCS analysis workflows integrate vast omics data resources and a
   range of analytics and interactive visualization tools into a
   comprehensive platform for analysis of omics signatures. iLINCS
   user-friendly interfaces enable execution of sophisticated analyses of
   omics signatures, mechanism of action analysis, and signature-driven
   drug repositioning. We illustrate the utility of iLINCS with three use
   cases involving analysis of cancer proteogenomic signatures, COVID 19
   transcriptomic signatures and mTOR signaling.

   Subject terms: Computational platforms and environments, Data mining,
   Target identification, Drug development, Translational research
     __________________________________________________________________

   There are only a few platforms that integrate multiple omics data
   types, bioinformatics tools, and interfaces for integrative analyses
   and visualization that do not require programming skills. Here the
   authors present an integrative web-based platform for analysis of omics
   data and signatures of cellular perturbations.

Introduction

   Transcriptomics and proteomics (omics) signatures in response to
   cellular perturbations consist of changes in gene or protein expression
   levels after the perturbation. An omics signature is a high-dimensional
   readout of cellular state change that provides information about the
   biological processes affected by the perturbation and
   perturbation-induced phenotypic changes of the cell. The signature on
   its own provides information, although not always directly discernable,
   about the molecular mechanisms by which the perturbation causes
   observed changes. If we consider a disease to be a perturbation of the
   homeostatic biological system under normal physiology, then the omics
   signature of a disease are the differences in gene/protein expression
   levels between disease and non-diseased tissue samples.

   The low cost and effectiveness of transcriptomics assays^[75]1–[76]4
   have resulted in an abundance of transcriptomics datasets and
   signatures. Recent advances in the field of high-throughput proteomics
   made the generation of large numbers of proteomics signatures a
   reality^[77]5,[78]6. Several recent efforts were directed at the
   systematic generation of omics signatures of cellular
   perturbations^[79]7 and at generating libraries of signatures by
   re-analyzing public domain omics datasets^[80]8,[81]9. The recently
   released library of integrated network-based cellular signatures
   (LINCS)^[82]7 L1000 dataset generated transcriptomic signatures at an
   unprecedented scale^[83]2. The availability of resulting libraries of
   signatures opens exciting new avenues for learning about the mechanisms
   of diseases and the search for effective therapeutics^[84]10.

   The analysis and interpretation of omics signatures has been intensely
   researched. Numerous methods and tools have been developed for
   identifying changes in molecular phenotypes implicated by
   transcriptional signatures based on gene set enrichment, pathway, and
   network analyses approaches^[85]11–[86]13. Directly matching
   transcriptional signatures of a disease with negatively correlated
   transcriptional signatures of chemical perturbations (CP) underlies the
   Connectivity Map (CMAP) approach to identifying potential drug
   candidates^[87]10,[88]14,[89]15. Similarly, correlating signatures of
   chemical perturbagens with genetic perturbations of specific genes has
   been used to identify putative targets of drugs and chemical
   perturbagens^[90]2.

   To fully exploit the information contained within omics signature
   libraries and within countless omics signatures generated frequently
   and constantly by investigators around the world, new user-friendly
   integrative tools, accessible to a large segment of biomedical research
   community, are needed to bring these data together. The integrative
   LINCS (iLINCS) portal brings together libraries of precomputed
   signatures, formatted datasets, connections between signatures, and
   integrates them with a bioinformatics analysis engine and streamlined
   user interfaces into a powerful system for omics signature analysis.

Results

   iLINCS (available at [91]http://ilincs.org) is an integrative
   user-friendly web platform for the analysis of omics (transcriptomic
   and proteomic) datasets and signatures of cellular perturbations. The
   key components of iLINCS are: Interactive and interconnected analytical
   workflows for the creation and analysis of omics signatures; The large
   collection of datasets, precomputed signatures, and their connections;
   And user-friendly graphical interfaces for executing analytical tasks
   and workflows.

   The central concept in iLINCS is the omics signature, which can be
   retrieved from the precomputed signature libraries within the iLINCS
   database, submitted by the user, or constructed using one of the iLINCS
   datasets (Fig. [92]1a). The signatures in iLINCS consist of the
   differential gene or protein expression levels and associated P values
   between perturbed and baseline samples for all, or any subset of
   measured genes/proteins. Signatures submitted by the user can also be
   in the form of a list of genes/proteins, or a list of up- and
   downregulated genes/proteins. iLINCS backend database contains >34,000
   processed omics datasets, >220,000 omics signatures and >10^9
   statistically significant “connections” between signatures. Omics
   signatures include transcriptomic signatures of more than 15,000
   chemicals and genetic perturbations of more than 4400 genes
   (Fig. [93]1a). Omics datasets available for analysis and signatures
   creation cover a wide range of diseases and include transcriptomic
   (RNA-seq and microarray) and proteomic (Reverse Phase Protein
   Arrays^[94]16 and LINCS-targeted mass spectrometry proteomics^[95]5)
   datasets. Datasets collections include close to complete collection of
   GEO RNA-seq datasets and various other dataset collections, such as The
   Cancer Genome Atlas (TCGA), GEO GDS microarray datasets^[96]17, etc. A
   detailed description of iLINCS omics signatures and datasets is
   provided in “Methods”. Analysis of 8942 iLINCS datasets from GEO,
   annotated by MeSH terms^[97]18, shows a wide range of disease coverage
   (Fig. [98]1a).

Fig. 1. Integrative omics signature analysis in iLINCS.

   [99]Fig. 1
   [100]Open in a new tab

   a A signature can be selected by querying the iLINCS database,
   submitted by the user, or constructed by analyzing an iLINCS omics
   dataset. Signatures in the database include chemical and genetic
   perturbation, and a wide range of disease-related signatures. The
   datasets cover a wide range of human diseases. b The signature can be
   analyzed using a range of systems biology methods (gene set enrichment,
   pathway and network analyses). c Signature “connectivity” analyses can
   be applied to identify cellular perturbations and biological states of
   similar signatures. d The analysis of connected signatures, as well as
   the identity of the perturbed genes and proteins leading to the
   connected signatures, can be used to elucidate mechanisms of action. e
   Ultimately, the results of the analyses lead to insights and hypotheses
   about potential therapeutic targets and therapeutic agents.

   iLINCS analytical workflows facilitate systems biology interpretation
   of the signature (Fig. [101]1b) and the connectivity analysis of the
   signature with all iLINCS precomputed signatures (Fig. [102]1c).
   Connected signatures can further be analyzed in terms of the patterns
   of gene/protein expression level changes that underlie the connectivity
   with the query signature, or through the analysis of gene/protein
   targets of connected perturbagens (Fig. [103]1d). Ultimately, the
   multi-layered systems biology analyses, and the connectivity analyses
   lead to biological insights, and identification of therapeutic targets
   and putative therapeutic agents (Fig. [104]1e).

   Interactive analytical workflows in iLINCS facilitate signature
   construction through differential expression analysis as well as
   clustering, dimensionality reduction, functional enrichment, signature
   connectivity analysis, pathway and network analysis, and integrative
   interactive visualization. Visualizations include interactive scatter
   plots, volcano and GSEA plots, heatmaps, and pathway and network node
   and stick diagram (Supplemental Fig. [105]1). Users can download raw
   data and signatures, analysis results, and publication-ready graphics.
   iLINCS internal analysis and visualization engine uses R^[106]19 and
   open-source visualization tools. iLINCS also facilitates seamless
   integration with a wide range of task-specific online bioinformatics
   and systems biology tools and resources including Enrichr^[107]20,
   DAVID^[108]21, ToppGene^[109]22, Reactome^[110]23, KEGG^[111]24,
   GeneMania^[112]25, X2K Web^[113]26, L1000FWD^[114]27, STITCH^[115]28,
   Clustergrammer^[116]29, piNET^[117]30, LINCS Data Portal^[118]31,
   ScrubChem^[119]32, PubChem^[120]33 and GEO^[121]34. Programmatic access
   to iLINCS data, workflows and visualizations are facilitated by the
   calls to iLINCS API which is documented with the OpenAPI community
   standard. Examples of utilizing the iLINCS API within data analysis
   scripts are provided on GitHub
   ([122]https://github.com/uc-bd2k/ilincsAPI). The iLINCS software
   architecture is described in Supplemental Fig. [123]2.

Use cases

   iLINCS workflows facilitate a wide range of possible use cases.
   Querying iLINCS with user-submitted external signatures enables
   identification of connected perturbations signatures, and answering
   in-depth questions about expression patterns of individual genes or
   gene lists of interest in specific datasets, or across classes of
   cellular perturbations. Querying iLINCS with individual genes or
   proteins can identify sets of perturbations that significantly affect
   their expression. Such analysis leads to a set of chemicals, or genetic
   perturbations, that can be applied to modulate the expression and
   activity of the corresponding proteins. Queries with lists of genes
   representing a hallmark of a specific biological state or
   process^[124]35 can identify a set of perturbations that may
   accordingly modify cellular phenotype. iLINCS implements complete
   systematic polypharmacology and drug repurposing^[125]36,[126]37
   workflows, and has been listed as a Bioinformatics resource for cancer
   immunotherapy studies^[127]38 and multi-omics computational
   oncology^[128]39. Most recently, iLINCS has been used in the drug
   repurposing workflow that combines searching for drug repurposing
   candidates via CMAP analysis with the validation using analysis of
   Electronic Health Records^[129]40. Finally, iLINCS removes technical
   barriers for re-using any of more than 34,000 preprocessed omics
   datasets enabling users to construct and analyze new omics signatures
   without any data generation and with only a few mouse clicks.

   Here, we illustrate the use of iLINCS in detecting and modulating
   aberrant mTOR pathway signaling, analysis of proteogenomic signatures
   in breast cancer and in search for COVID-19 therapeutics. It is
   important to emphasize that all analyses were performed by navigating
   iLINCS GUI within a web browser, and each use case can be completed in
   less than five minutes. Step-by-step instructions are provided in the
   Supplemental Materials (Supplemental Workflows [130]1, [131]2, and
   [132]3). In addition, links to instructional videos that demonstrate
   how to perform these analyses are provided on the landing page of
   iLINCS at ilincs.org. The same analyses can also be performed
   programmatically using the iLINCS API. R notebooks demonstrating this
   can be found on the GitHub ([133]https://github.com/uc-bd2k/ilincsAPI).

Use case 1: detecting and modulating aberrant mTOR pathway signaling

   Aberrant mTOR signaling underlies a wide range of human
   diseases^[134]41. It is associated with age-related diseases such as
   Alzheimer’s disease^[135]42 and the aging process itself^[136]41. mTOR
   inhibitors are currently the only pharmacological treatment shown to
   extend lifespan in model organisms^[137]43, and numerous efforts in
   designing drugs that modulate the activity of mTOR signaling are under
   way^[138]41. We use mTOR signaling as the prototypical example to
   demonstrate iLINCS utility in identifying chemical perturbagens capable
   of modulating a known signaling pathway driving the disease process, in
   establishing MOA of a chemical perturbagen, and in detecting aberrant
   signaling in the diseased tissue. Detecting changes in mTOR signaling
   activity in transcriptomic data is complicated by the fact that it
   is not reflected in changes in expression of mTOR pathway genes, and
   standard pathway analysis methods are not effective^[139]44. We show
   that CMAP analysis approach, facilitated by iLINCS, is essential for
   the success of these analyses. Step-by-step instructions for performing
   this analysis in iLINCS are provided in Supplemental Workflow SW[140]1.

   Identifying chemicals that can modulate the activity of a specific
   pathway or a protein in a specific biological context is often the
   first step in translating insights about disease mechanisms into
   therapies that can reverse disease processes. Here we demonstrate the
   use of iLINCS in identifying chemicals that can inhibit the mTOR
   activity. We use the Consensus Genes Signatures (CGSes) of CRISPR mTOR
   genetic loss of function perturbation in MCF-7 cell line as the query
   signature. The CMAP analysis identifies 258 LINCS CGSes and 831 CP
   Signatures with statistically significant correlation with the query
   signature. Top 100 most connected CGSes are dominated by the signatures
   of genetic perturbations of mTOR and PIK3CA genes (Fig. [141]2a),
   whereas all top 5 most frequent inhibition targets of CPs among top 100
   most connected CP signatures are mTOR and PIK3 proteins (Fig. [142]2b).
   Results clearly indicate that the query mTOR CGS is highly specific and
   sensitive to perturbation of the mTOR pathway and effectively
   identifies chemical perturbagens capable of inhibiting mTOR signaling.
   The full list of connected signatures is shown in Supplemental Data
   SD[143]1. The connected CP signatures also include several chemical
   perturbagens with highly connected signatures that have not been known
   to target mTOR signaling providing new candidate inhibitors.

Fig. 2. Analysis of LINCS L1000 signatures of genetic and chemical
perturbations.

   [144]Fig. 2
   [145]Open in a new tab

   a Most frequently perturbed genes among the Consensus Genes Signatures
   (CGS) connected to the mTOR knockdown CGS. b Most frequent inhibition
   targets of chemical perturbagens with signatures connected to the mTOR
   CGS signature. c Most enriched biological pathways for the everolimus
   signature. d Most frequently perturbed genes among CGSes connected with
   everolimus signature, and pathways most enriched by the perturbed
   genes. e Most frequent inhibition targets of chemical perturbagens with
   signatures connected to the everolimus signature and the pathways most
   enriched by the genes of the targeted proteins.

   Identifying proteins and pathways directly targeted by a bioactive
   chemical using its transcriptional signature is a difficult problem.
   Transcriptional signatures of a chemical perturbation often carry only
   an echo of such effects since the proteins directly targeted by a
   chemical and the interacting signaling proteins are not
   transcriptionally changed. iLINCS offers a solution for this problem by
   connecting the CP signatures to LINCS CGSes and facilitating a
   follow-up systems biology analysis of genes whose CGSes are highly
   correlated with the CP signature. This is illustrated by the analysis
   of the perturbation signature of the mTOR inhibitor drug everolimus
   (Fig [146]2c–e). Traditional pathway enrichment analysis of this CP
   signature via iLINCS connection to Enrichr (Fig. [147]2c) fails to
   identify the mTOR pathway as being affected. In the next step, we first
   connect the CP signature to LINCS CGSes and then perform pathway
   enrichment analysis of genes with correlated CGSes. This analysis
   correctly identifies mTOR signaling pathway as the top affected pathway
   (Fig. [148]2d). Similarly, connectivity analysis with other CP
   signatures followed by the enrichment analysis of protein targets of
   top 100 most connected CPs again identifies the Pi3k-Akt signaling
   pathway as one of the most enriched (Fig. [149]2e). In conclusion, both
   pathway analysis of differentially expressed genes in the everolimus
   signature and pathway analysis of connected genetic and chemical
   perturbagens provide us with important information about effects of
   everolimus. However, only the analyses of connected perturbagens
   correctly pinpoints the direct mechanism of action of the everolimus,
   which is the inhibition of mTOR signaling.

   The connectivity-based pathway analysis shares methodological
   shortcomings with the standard enrichment/pathway analyses of lists of
   differentially expressed genes, such as, for example, overlapping
   pathways. While the MTOR pathway shows the strongest association with
   everolimus, other pathways were also significantly enriched. A closer
   examination of the results indicates that this is due to the core mTOR
   signaling cascade being included as a component of other pathways and
   many of the genes that drive the associations with other four most
   enriched pathways are common with the mTOR pathway (Fig. [150]3).

Fig. 3. Perturbation gene targets in enriched pathways.

   Fig. 3
   [151]Open in a new tab

   Yellow squares indicate the membership of the target gene (rows) in the
   corresponding pathway (columns).

   One caveat in the results presented above is that the LINCS signatures
   based on L1000 platform provide a reduced representation of the global
   transcriptome consisting of expression levels of about 1000 “landmark”
   genes^[152]2. The landmark genes are selected in such a way that they
   jointly capture patterns of expression of majority genes in the genome
   and the computational predictions of expression of additional 12,000
   genes are also made. The relatively low number of measured genes could
   sometimes adversely affect the gene expression enrichment analysis of
   poorly represented pathways. To establish that this is not the case for
   mTOR signaling, we repeated the MOA analysis using the whole genome
   transcriptional signature of the mTOR inhibitor sirolimus from the
   original CMAP dataset^[153]15, which is also included in the iLINCS
   signature collection. Results of these analyses closely resemble the
   results with L1000 everolimus signature with connectivity analysis
   clearly pinpointing mTOR pathway and enrichment analysis of
   differentially expressed genes failing to do so (Supplemental
   Results [154]4).

   To verify that mTOR signaling modulation is also detectible in complex
   tissues we used iLINCS to re-analyze the effect of rapamycin in aged
   rat livers^[155]45 (GEO dataset [156]GSE108978). The rapamycin
   signature was constructed by comparing expression profiles of livers in
   eight rapamycin-treated rats to the nine vehicle controls at 24 months
   of age (Fig. [157]4, heatmap). The signature correlated strongly with
   CP signatures of chemicals targeting mTOR pathway genes (Fig. [158]4,
   bar plot).

Fig. 4. CMAP analysis of rapamycin (RAD001) signature in rat livers.

   [159]Fig. 4
   [160]Open in a new tab

   The heatmap shows the centered expression levels of differentially
   expressed genes and the bar plot shows the numbers of connected
   chemical perturbation signatures for top five targets.

Use case 2: proteo-genomics analysis of cancer driver events in breast cancer

   Contrasting transcriptional and proteomic profiles of different
   molecular cancer subtypes has long been a hallmark of cancer omics data
   analysis when seeking targets for intervention^[161]46. Constructing
   signatures by comparing cancer with normal tissue controls usually
   results in a vast array of differences characteristic of any cancer
   (proliferation, invasion, etc.)^[162]47, and are not specific to the
   driver mechanisms of the cancer samples at hand. On the other hand,
   comparisons of different cancer subtypes, as illustrated here, is
   effective in eliciting key driver mechanisms by factoring out generic
   molecular properties of a cancer^[163]48. Here, we demonstrate the use
   of matched preprocessed proteomic (RPPA) and transcriptomic (RNA-seq)
   breast cancer datasets to identify driver events which can serve as
   targets for pharmacological intervention in two different breast cancer
   subtypes. The analysis of proteomic data can directly identify affected
   signaling pathways by assessing differences in the abundance of
   activated (e.g., phosphorylated) signaling proteins. By contrasting
   proteomics and transcriptomic signatures of the same biological
   samples, we can distinguish between transcriptionally and
   post-translationally regulated proteins, and transcriptional signatures
   facilitate pathway enrichment and CMAP analysis. Step-by-step
   instructions for performing this analysis in iLINCS are provided in
   Supplemental Workflow SW[164]2.

   We analyzed TCGA breast cancer RNA-seq and RPPA data using the iLINCS
   “Datasets” workflow to construct the differential gene and protein
   expression signatures contrasting 174 Luminal-A and 50 Her2 enriched
   (Her2E) breast tumors. The results of the iLINCS analysis track closely
   the original analysis performed by the TCGA consortium^[165]48. The
   protein expression signature immediately implicated known driver
   events, which are also the canonical therapeutic targets for the two
   subtypes: The abnormal activity of the estrogen receptor in Luminal-A
   tumors, and the increased expression and activity of the Her2 protein
   in Her2E tumors (Fig. [166]5a). To further validate the strategy of
   directly comparing two subtypes of tumors, we compared our results with
   the analysis results when different subtypes are compared to normal
   breast tissue controls (Supplemental Results [167]5). The result of
   these analyses are much more equivocal, with Her2 and ERalpha proteins
   now superseded in significance by several more generic cancer-related
   proteome alterations, common to both subtypes (Supplemental
   Results [168]5).

Fig. 5. Proteo-genomics analysis of cancer driver events in breast cancer.

   [169]Fig. 5
   [170]Open in a new tab

   a Most differentially expressed proteins in the proteomics signatures
   constructed by comparing RPPA profiles of Her2E and Luminal-A BRC
   samples. b Gene expression profile of the genes corresponding to
   proteins in (a) based on RNA-seq data. c The transcriptional signature
   consisting of all highly differentially expressed genes (unadjusted,
   two-tailed P value<10^−10). d Enrichment analysis of genes upregulated
   in Luminal A, and upregulated in Her2E tumors via Enrich (unadjusted
   Fisher Exact Test P values).

   The corresponding Luminal A vs Her2E RNA-seq signature, constructing by
   differential gene expression analysis between 201 Luminal-A and 47
   Her2E samples, showed similar patterns of expression of key genes
   (Fig. [171]5b). All genes were differentially expressed (Bonferroni
   adjusted P value < 0.01) except for EGFR, indicating that the
   difference in expression levels of the EGFR protein with the
   phosphorylated Y1068 tyrosine residue (EGFR_pY1068) may be a
   consequence of post-translation modifications instead of increased
   transcription rates of the EGFR gene.

   Following down the iLINCS workflow, the pathway analysis of 734 most
   significantly upregulated genes in Luminal-A tumors (P value < 1e-10)
   (Fig. [172]5c) identified the Hallmark gene sets^[173]35 indicative of
   Estrogen Response to be the most significantly enriched (Fig. [174]5d)
   (See Supplemental Data SD[175]2 for all results). Conversely, the
   enrichment analysis of 665 genes upregulated in Her2E tumors identified
   the Hallmark gene sets of proliferation (E2F Targets, G2-M Checkpoint)
   and the markers of increased mTOR signaling (mTORC1 signaling). This
   reflects a known increased proliferation of Her2E tumors in comparison
   to Luminal-A tumors^[176]49. The increase in mTOR signaling is
   consistent with the increased levels of the phosphorylated 4E-BP
   protein, a common marker of mTOR signaling^[177]50.

   The CMAP analysis of the RNA-seq signature with LINCS CP signatures
   (Fig. [178]6) shows that treating several different cancer cell lines
   with inhibitors of PI3K, mTOR, CDK, and inhibitors of some other more
   generic proliferation targets (e.g., TOP21, AURKA) (see Supplemental
   Data SD[179]3 for complete results) produces signatures that are
   positively correlated with RNA-seq Luminal A vs Her2E signature,
   suggesting that such treatments may counteract the Her2E tumor driving
   events.

Fig. 6. Connectivity map analysis of Luminal A vs Her2E signatures.

   [180]Fig. 6
   [181]Open in a new tab

   b Top 100 connected CP signatures. b Signatures enriched for genes in
   the gray box. The GSEA plot for the most significantly enriched
   signature and the summary of targets for top 100 most enriched
   signatures. c Chemical perturbagens and their targets for CP signatures
   in (a).

   The detailed analysis of 100 most connected CP signatures showed that
   all signatures reflected proliferation inhibition as indicated by the
   enrichment of the genes in the KEGG Cell cycle pathway among the genes
   downregulated across all signatures (Fig. [182]6a). However, the
   analysis also showed that a subset of the signatures selectively
   inhibited expression of the mTORC1 signaling Hallmark gene set, and the
   same set of signatures exhibited increased upregulation of Apoptosis
   gene sets in comparison to the rest of the signatures. This indicates
   that the increased proliferation of in Her2E tumors may be partly
   driven by the upregulation in mTOR signaling.

   We also used iLINCS to identify de novo all signatures enriched for the
   mTOR-associated genes from Fig. [183]6a. The most enriched signatures
   (top 100) were completely dominated by signatures of mTOR inhibitors
   (Fig. [184]6b). The most highly enriched signature was generated by
   WYE-125132, a highly specific and potent mTOR inhibitor^[185]51. Using
   the iLINCS signature group analysis workflow we also summarized the
   drug-target relationships for the top 100 signatures (Fig. [186]6c)
   which recapitulate the dominance of mTOR inhibitors along with
   proliferation inhibitors targeting CDK proteins (Palbociclib and
   Milciclib).

Use case 3: drug repurposing for COVID-19

   The ongoing COVID-19 pandemic has underscored the importance of rapid
   drug discovery and repurposing to treat and prevent emerging novel
   pathogens, such as SARS-CoV-2. As part of the community-wide efforts to
   identify novel targets and treatment options, the transcriptional
   landscape of SARS-CoV-2 infections has been characterized extensively,
   including the identification of transcriptional signatures from
   patients as well as model systems^[187]52,[188]53. CMAP approach has
   been extensively used to explore that space of potential therapeutic
   agents with the search of Google Scholar website listing 662 studies
   for the covid AND “connectivity map” search. In iLINCS, 105
   COVID-19-related datasets are organized into a COVID-19 collection,
   facilitating signature connectivity-based drug discovery and
   repurposing in this context.

   We used iLINCS to construct a SARS-CoV-2 infection signature by
   re-analyzing the dataset profiling the response of various in vitro
   models to SARS-CoV-2 infection^[189]52 (GEO dataset [190]GSE147507).
   The use of multiple models, which respond differently to the virus
   infection, would make the signature created by direct comparisons of
   all “Infected” vs all “Mock infected” samples too noisy. The main
   mechanism implemented in iLINCS for dealing with various confounding
   factors is filtering samples by levels of possible confounding factors,
   which is the approach that is most used in the omics data analysis. In
   this case, we filtered samples to construct a signature by differential
   gene expression analysis of infected vs mock-infected A549 cell line,
   which was genetically modified to express ACE2 gene to facilitate viral
   entry into the cell. This left us with the comparison of three
   “Infected” and three “Mock infected” samples. Filtering of samples and
   the analysis using iLINCS GUI is demonstrated in the Supplemental
   workflow SW[191]3.

   The resulting signature comprises many upregulated chemokines and other
   immune system-related genes, including the EGR1 transcription factor
   that regulates inflammation and immune system response^[192]54,[193]55,
   and the pathway analysis implicates TNF signaling and NK-kappa B
   signaling as the two pathways most enriched for upregulated genes
   (Fig. [194]7a). CMAP analysis against the LINCS gene overexpression
   signatures in A549 cell line identified the signature of LYN tyrosine
   kinase as the most positively correlated with the SARS-CoV-2 infection
   signature (Fig. [195]7b). LYN is a member of SRC/FYN family of tyrosine
   kinases has been shown to be required for effective MERS-CoV
   replication^[196]56. The enrichment of genes with positively correlated
   overexpression signatures in A549 cell line, identified NF-kappa B
   signaling pathway as the most enriched (Fig. [197]7b), confirming
   mechanistically the role of NF-kappa B signaling in inducing the
   infection signature. Finally, CMAP analysis identified CDK inhibitors
   and the drug Avlodicip as the potential therapeutic strategies based on
   their ability to reverse the infection signature (Fig. [198]7c).
   Alvocidib is a CDK9 inhibitor with a broad antiviral activity and it
   has been suggested as a potential candidate for COVID-19 drug
   repurposing^[199]57.

Fig. 7. SARS-Cov-2 infection of A549 cells expressing ACE2.

   [200]Fig. 7
   [201]Open in a new tab

   a Upregulated genes (unadjusted, two-tailed P value < 10^−10) in top
   two enriched KEGG pathways (unadjusted Fisher exact test P values shown
   in the table). b Top KEGG pathway in the enrichment analysis
   (unadjusted Fisher exact test P values shown in the table) of
   signatures of gene overexpression mimicking infection in the A549 cell
   line. The list of six most positively correlated overexpression
   signatures (unadjusted, two-tailed weighted correlation P values are
   shown in the table) and the scatter plot of the LYN overexpression
   signature against the SARS Cov-2 infection signature. c Chemicals
   reversing the infection signatures and their protein targets.

   These results agree with the previous study utilizing iLINCS to
   prioritize candidate FDA-approved or investigative drugs for COVID-19
   treatment^[202]58. Of the top 20 candidates identified in that study as
   reversing SARS-CoV-2 transcriptome signatures, 8 were already under
   trial for the treatment of COVID-19, while the remaining 12 had
   antiviral properties and 6 had antiviral efficacy against coronaviruses
   specifically. Our analysis illustrates the ease with which iLINCS can
   be used to quickly provide credible drug candidates for an emerging
   disease.

Discussion

   iLINCS is a unique integrated platform for the analysis of omics
   signatures. Several canonical use cases described here only scratch the
   surface of the wide range of possible analyses facilitated by the
   interconnected analytical workflows and the large collections of omics
   datasets, signatures, and their connections. All presented use cases
   were executed using only a mouse to navigate iLINCS GUI. Each use case
   can be completed in less than 5 min, as illustrated in the online help
   and video tutorials. The published studies to date used iLINCS in many
   different ways, and to study a wide range of diseases (Supplemental
   Results [203]3).

   In addition to facilitating standard analyses, iLINCS implements
   innovative workflows for biological interpretation of omics signatures
   via CMAP analysis. In Use case 1, we show how CMAP analysis coupled
   with pathway and gene set enrichment analysis can implicate mechanism
   of action of a chemical perturbagen when standard enrichment analysis
   applied to the differentially expressed genes fails to recover targeted
   signaling pathways. In a similar vein, iLINCS has been successfully
   used to identify putative therapeutic agents by connecting changes in
   proteomics profiles in neurons from patients with schizophrenia; first
   with the LINCS CGSes of the corresponding genes, and then with LINCS CP
   signatures^[204]59. These analyses led to the identification of PPAR
   agonists as promising therapeutic agents capable of reversing
   bioenergetic signature of schizophrenia, which were subsequently shown
   to modulate behavioral phenotypes in rat model of
   schizophrenia^[205]60.

   The iLINCS platform was built with the flexibility to incorporate
   future extensions in mind. The combination of optimized database
   representation and R analysis engine provide endless opportunities to
   implement additional analysis workflows. At the same time, collections
   of omics datasets and signatures can be extended by simply adding data
   to backend databases. One of the important directions for improving
   iLINCS functionality will be the development of workflows for fully
   integrated analysis of multiple datasets and different omics data
   types. In terms of integrative analysis of matched transcriptomic and
   proteomic data, iLINCS facilitates the integration where results of one
   omics dataset informs the set of genes/proteins analyzed in the other
   dataset (Use case 2). However, the direct integrative analysis of both
   data types may result in more informative signatures^[206]61. At the
   same time, addition of more proteomics datasets, such as the Clinical
   Proteomic Tumor Analysis Consortium (CPTAC) collection^[207]62, will
   extend the scope of such integrative analyses.

   Many complex diseases, including cancer, consist of multiple,
   molecularly distinct, subtypes^[208]63–[209]65. Accounting for these
   differences is essential for constructing effective disease signatures
   for CMAP analysis. In Use case 2, we demonstrate how to use iLINCS in
   contrasting molecular subtypes when the information about the subtypes
   is included in sample metadata. An iLINCS extension that allows for de
   novo creation of molecular subtypes using cluster analysis, as is the
   common practice in analysis of cancer samples^[210]63, is currently
   under development. Another future extension under development is the
   workflow for constructing disease signatures using single cell
   datasets^[211]66. iLINCS contains a number of single cell RNA-seq
   (scRNA-seq) datasets, but their analysis is currently handled in the
   same way as the bulk RNA-seq data. A specialized workflow for
   extracting disease signatures from scRNA-seq data will lead to more
   precise signatures and more powerful CMAP analysis^[212]66.

   With many signatures used in CMAP analysis, and a large number of genes
   perturbed by either genetic or chemical perturbations, one has to
   carefully scrutinize results of CMAP-based pathway analyses to avoid
   false positive results and identify most relevant affected pathways.
   Limitations of standard gene enrichment pathway analysis related to
   overlapping pathways are important to keep in mind in the
   signature-similarity-based pathway analysis, as discussed in Use case
   1. In addition, the hierarchical nature of gene expression regulation
   may lead to similar transcriptional signatures being generated by
   perturbing genes at different levels of the regulatory programs (e.g.,
   signaling proteins vs transcriptional factors). Perturbations of
   distinct signaling pathways leads to modulation of the proliferation
   rates in cancer cell lines, and it is expected that resulting
   transcriptional signatures share some similarities related to up- and
   down-regulation of proliferation drivers and markers. At the same time,
   signatures corresponding to perturbation of proteins regulating the
   same sets of biological processes are likely to exhibit a higher level
   of similarity. The analysis of the top 100 chemical perturbagen
   signatures negatively correlated with Her2E breast cancer signature in
   Use case 2 reveals that they all contain the “proliferation” component.
   However, a subset of the most highly correlated signatures is more
   specifically associated with mTOR inhibition, indicating that the
   proliferation is affected in part by modulating mTOR signaling. The
   association with perturbations that modulate cellular proliferation,
   while real, could also be considered spurious as it is relatively
   non-specific. The association with mTOR signaling is more specific and
   provides a higher-level mechanistic explanation for differences in
   proliferation rates. iLINCS provides mechanisms for scrutinizing
   expression profiles of genes in signatures identified in CMAP analysis
   that is required for assessing these fine points (Use case 2), and they
   are essential for interpreting the results of a CMAP analysis.

   Several online tools have been developed for the analysis and mining
   LINCS L1000 signature libraries. They facilitate online queries of
   L1000 signatures^[213]67–[214]69 and the construction of scripted
   pipelines for in-depth analysis of transcriptomics data and
   signatures^[215]70. The LINCS Transcriptomic Center at the Broad
   Institute developed the clue.io query tool deployed by the Broad
   Connectivity Map team which facilitates connectivity analysis of
   user-submitted signatures^[216]2. iLINCS replicates the connectivity
   analysis functionality, and indeed, the equivalent queries of the two
   systems may return qualitatively similar results (see Supplemental
   Results [217]1 for a use case comparison). However, the scope of iLINCS
   is much broader. It provides connectivity analysis with signatures
   beyond Connectivity Map datasets and provides many primary omics
   datasets for users to construct their own signatures. Furthermore,
   analytical workflows in iLINCS facilitate deep systems biology analysis
   and knowledge discovery of both, omics signatures and the genes and
   protein targets identified through connectivity analysis. Comparison to
   several other web resources that partially cover different aspects of
   iLINCS functionality are summarized in Supplemental Results [218]2.

   iLINCS removes technical roadblocks for users without a programming
   background to re-use of publicly available omics datasets and
   signatures. The user interfaces are streamlined and strive to be
   self-explanatory to most scientists with conceptual understanding of
   omics data analysis. Recent efforts in terms of standardizing^[219]71
   and indexing^[220]72 are improving findability and re-usability of
   public domain omics data. iLINCS is taking the next logical step in
   integrating public domain data and signatures with a user-friendly
   analysis toolbox. Furthermore, all analyses steps behind the iLINCS GUI
   are driven by API which can be used within computational pipelines
   based on scripting languages^[221]73, such as R, Python and JavaScript,
   and to power the functionality of other web analysis
   tools^[222]30,[223]69. This makes iLINCS a natural tool for analysis
   and interpretation of omics signatures for scientists preferring
   point-and-click GUIs as well as data scientists using scripted
   analytical pipelines.

Methods

Statistics

   All differential expression and signature creation analyses are
   performed on measurements obtained from distinct samples. All P values
   are calculated using two-sided hypothesis tests. The specific tests
   depend on the data type and are described in detail in the rest of the
   methods and the Supplemental Methods document. The accuracy of the
   iLINCS datasets, signatures and analysis procedures were ascertained as
   described in the Supplemental Quality Control document. The versions of
   all R packages utilized by iLINCS are provided in the Supplemental Data
   SD[224]4.

Perturbation signatures

   All precomputed perturbation signatures in iLINCS, as well as
   signatures created using an iLINCS dataset, consist of two vectors: the
   vector of log-scale differential expressions between the perturbed
   samples and baseline samples d = (d[1],…,d[N]), and the vector of
   associated P values p = (p[1],…,p[N]), where N is the number of genes
   or proteins in the signature. Signatures submitted by the user can also
   consist of only log-scale differential expressions without P values,
   lists of up- and downregulated genes, and a single list of genes.

Signature connectivity analysis

   Depending on the exact type of the query signature, the connectivity
   analysis with libraries of precomputed iLINCS signatures are computed
   using different connectivity metrics. The choice of the similarity
   metric to be used in different contexts was driven by benchmarking six
   different methods (Supplementary Result [225]2).

   If the query signature is selected from iLINCS libraries of precomputed
   signatures, the connectivity with all other iLINCS signatures is
   precomputed using the extreme Pearson’s correlation^[226]74,[227]75 of
   signed significances of all genes. The signed significance of the ith
   gene is defined as
   [MATH:
   <msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   i mathvariant="normal">sign</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mspace width="0.25em"></mspace><mo>*</mo><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><mo>−</mo><msub><mrow><mi>log</mi></mrow><mrow><mn>10</m
   n></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced></mrow></mfenced><mo>,</mo><mspace
   width="0.25em"></mspace><mi>f</mi><mi>o</mi><mi>r</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</
   mo><mo>,</mo><mspace width="0.25em"></mspace><mi>N</mi><mo>,</mo>
   :MATH]
   1

   and the signed significance signature is s = (s[1],…,s[N]). The extreme
   signed signature e = (e[1],…,e[N]) is then constructing by setting the
   signed significances of all genes other than the top 100 and bottom 100
   to zero:
   [MATH:
   <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   fenced close="}" open="{"><mrow><mtable><mtr><mtd
   columnalign="left"><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></mrow>
   </msub><mo>,</mo></mtd><mtd
   columnalign="left"><mi>i</mi><mi>f</mi><mspace
   width="0.25em"></mspace><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>≥</mo><msup><mrow><mi>s</mi></mrow><mrow><mn>100</mn></
   mrow></msup><mspace width="0.25em"></mspace><mi>o</mi><mi>r</mi><mspace
   width="0.25em"></mspace><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>≤</mo><msup><mrow><mi>s</mi></mrow><mrow><mo>−</mo><mn>
   100</mn></mrow></msup></mtd></mtr><mtr><mtd
   columnalign="left"><mn>0</mn><mo>,</mo></mtd><mtd
   columnalign="left"><mi>o</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>r</mi><m
   i>w</mi><mi>i</mi><mi>s</mi><mi>e</mi></mtd></mtr></mtable></mrow></mfe
   nced> :MATH]
   2

   Where s^100 is the 100th most positive s[i] and s^−100 is the 100th
   most negative s[i]. The extreme Pearson correlation between two
   signatures is then calculated as the standard Pearson’s correlation
   between the extreme signed significance signatures.

   If the query signature is created from an iLINCS dataset, or directly
   uploaded by the user, the connectivity with all iLINCS signatures is
   calculated as the weighted correlation between the two vectors of
   log-differential expressions and the vector of weights equal to
   [-log10(P value of the query) −log10(P value of the iLINCS
   signature)]^[228]76. When the user-uploaded signature consists of only
   log-differential expression levels without P values, the weight for the
   correlation is based only on the P values of the iLINCS signatures
   [−log10(P values of the iLINCS signatures)].

   If the query signature uploaded by the user consists of the lists of
   up- and downregulated genes connectivity is calculated by assigning −1
   to downregulated and +1 to upregulated genes and calculating Pearson’s
   correlation between such vector and iLINCS signatures. The calculated
   statistical significance of the correlation in this case is equivalent
   to the t test for the difference between differential expression
   measures of iLINCS signatures between up- and downregulated genes.

   If the query signature is uploaded by the user in a form of a gene
   list, the connectivity with iLINCS signatures is calculated as the
   enrichment of highly significant differential expression levels in
   iLINCS signature within the submitted gene list using the Random Set
   analysis^[229]77.

Perturbagen connectivity analysis

   The connectivity between a query signature and a “perturbagen” is
   established using the enrichment analysis of individual connectivity
   scores between the query signature and set of all L1000 signatures of
   the perturbagen (for all cell lines, time points, and concentrations).
   The analysis establishes whether the connectivity scores as a set are
   “unusually” high based on the Random Set analysis^[230]77.

iLINCS signature libraries

   LINCS L1000 signature libraries (Consensus gene knockdown signatures
   (CGS), Overexpression gene signatures and Chemical perturbation
   signatures): for all LINCS L1000 signature libraries, the signatures
   are constructed by combining the Level 4, population control signature
   replicates from two released GEO datasets ([231]GSE92742 and
   [232]GSE70138) into the Level 5 moderated Z scores (MODZ) by
   calculating weighted averages as described in the primary publication
   for the L1000 Connectivity Map dataset^[233]2. For CP signatures, only
   signatures showing evidence of being reproducible by having the 75th
   quantile of pairwise spearman correlations of level 4 replicates (Broad
   institute distil_cc_q75 quality control metric^[234]2) greater than 0.2
   are included. The corresponding P values were calculated by comparing
   MODZ of each gene to zero using the Empirical Bayes weighted t test
   with the same weights used for calculating MODZs. The shRNA and CRISPR
   knockdown signatures targeting the same gene were further aggregated
   into Consensus gene signatures (CGSes)^[235]2 by the same procedure
   used to calculate MODZs and associated P values.

LINCS-targeted proteomics signatures

   Signatures of chemical perturbations assayed by the quantitative
   targeted mass spectrometry proteomics P100 assay measuring levels 96
   phosphopeptides and GCP assay against ~60 probes that monitor
   combinations of post-translational modifications on histones^[236]5.

Disease-related signatures

   Transcriptional signatures constructed by comparing sample groups
   within the collection of curated public domain transcriptional dataset
   (GEO DataSets collection)^[237]34. Each signature consists of
   differential expressions and associated P values for all genes
   calculated using Empirical Bayes linear model implemented in the limma
   package.

ENCODE transcription factor-binding signatures

   Genome-wide transcription factor (TF) binding signatures constructed by
   applying the TREG methodology to ENCODE ChiP-seq^[238]78. Each
   signature consists of scores and probabilities of regulation by the
   given TF in the specific context (cell line and treatment) for each
   gene in the genome.

Connectivity map signatures

   Transcriptional signatures of perturbagen activity constructed based on
   the version 2 of the original Connectivity Map dataset using Affymetrix
   expression arrays^[239]17. Each signature consists of differential
   expressions and associated P values for all genes when comparing
   perturbagen-treated cell lines with appropriate controls.

DrugMatrix signatures

   Toxicogenomic signatures of over 600 different compounds^[240]79
   maintained by the National Toxicology Program^[241]80 consisting of
   genome-wide differential gene expression levels and associated P
   values.

Transcriptional signatures from EBI Expression Atlas

   All mouse, rat and human differential expression signatures and
   associated P values from manually curated comparisons in the Expression
   Atlas^[242]8.

Cancer therapeutics response signatures

   These signatures were created by combining transcriptional data with
   drug sensitivity data from the Cancer Therapeutics Response Portal
   (CTRP) project^[243]81. Signatures were created separately for each
   tissue/cell lineage in the dataset by comparing gene expression between
   the five cell lines of that lineage that were most and five that were
   least sensitive to a given drug area as measured by the
   concentration-response curve (AUC) using two-sample t test.

Pharmacogenomics transcriptional signatures

   These signatures were created by calculating differential gene
   expression levels and associated P value between cell lines treated
   with anti-cancer drugs and the corresponding controls in two separate
   projects: The NCI Transcriptional Pharmacodynamics Workbench
   (NCI-TPW)^[244]82 and the Plate-seq project dataset^[245]4.

Constructing signatures from iLINCS datasets

   The transcriptomics or proteomics signature is constructed by comparing
   expression levels of two groups of samples (treatment group and
   baseline group) using Empirical Bayes linear model implemented in the
   limma package^[246]83. For the GREIN collection of GEO RNA-seq
   datasets^[247]84, the signatures are constructed using the negative
   binomial generalized linear model as implemented in the edgeR
   package^[248]85.

Analytical tools, web applications, and web resources

   Signatures analytics in iLINCS is facilitated via native R, Java,
   JavaScript, and Shiny applications, and via API connections to external
   web application and services. Brief listing of analysis and
   visualization tools is provided here. The overall structure of iLINCS
   is described in Supplemental Fig. [249]2.

   Gene list enrichment analysis is facilitated by directly submitting
   lists of gene to any of the three prominent enrichment analysis web
   tools: Enrichr^[250]20, DAVID^[251]21, ToppGene^[252]22. The
   manipulation and selection of list of signature genes is facilitated
   via an interactive volcano plot JavaScript application.

   Pathway analysis is facilitated through general-purpose enrichment
   tools (Enrichr, DAVID, ToppGene), the enrichment analysis of Reactome
   pathways via Reactome online tool^[253]23, and internal R routines for
   SPIA analysis^[254]86 of KEGG pathways and general visualization of
   signatures in the context of KEGG pathways using the KEGG API^[255]24.

   Network analysis is facilitated by submitting lists of genes to
   Genemania^[256]25 and by internal iLINCS Shiny Signature Network
   Analysis (SigNetA) application.

   Heatmap visualizations are facilitated by native iLINCS applications:
   Java-based FTreeView^[257]87, modified version of the JavaScript-based
   Morpheus^[258]88 and a Shiny-based HeatMap application, and by
   connection to the web application Clustergrammer^[259]29.

   Dimensionality reduction analysis (PCA and t-SNE^[260]89) and
   visualization of high-dimensional relationship via interactive 2D and
   3D scatter plots is facilitated via internal iLINCS Shiny applications.

   Interactive boxplots, scatter plots, GSEA plots, bar charts, and pie
   charts used throughout iLINCS are implemented using R ggplot^[261]90
   and plotly^[262]91.

   Additional analysis is provided by connecting to X2K Web^[263]26 (to
   identify upstream regulatory networks from signature genes),
   L1000FWD^[264]27 (to connect signatures with signatures constructed
   using the Characteristic Dimension methodology^[265]92), STITCH^[266]28
   (for visualization of drug-target networks), and piNET^[267]30 (for
   visualization of gene-to-pathway relationships for signature genes).

   Additional information about drugs, genes, and proteins are provided by
   links to, LINCS Data Portal^[268]31, ScrubChem^[269]32,
   PubChem^[270]33, Harmonizome^[271]93, GeneCards^[272]94, and several
   other databases.

Gene and protein expression dataset collections

   iLINCS backend databases provide access to more than 34,000
   preprocessed gene and protein expression datasets that can be used to
   create and analyze gene and expression protein signatures. Datasets are
   thematically organized into eight collections with some datasets
   assigned to multiple collections. User can search all datasets or
   browse datasets by collection.

LINCS collection

   Datasets generated by the LINCS data and signature generation
   centers^[273]7.

TCGA collection

   Gene expression (RNASeqV2), protein expression (RPPA), and copy number
   variation data generated by TCGA project^[274]63.

GDS collection

   A curated collection of GEO Gene Datasets (GDS)^[275]34.

Cancer collection

   An ad hoc collection of cancer-related genomics and proteomic datasets.

Toxicogenomics collection

   An ad hoc collection of toxicogenomics datasets.

RPPA collection

   An ad hoc collection of proteomic datasets generated by Reverse Phase
   Protein Array assay^[276]95.

GREIN collection

   Complete collection of preprocessed human, mouse, and rat RNA-seq data
   in GEO provided by the GEO RNA-seq Experiments Interactive Navigator
   (GREIN)^[277]84.

Reference collection

   An ad hoc collection of important gene expression datasets.

Reporting summary

   Further information on research design is available in the [278]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [279]Supplementary Information^ (27.6MB, pdf)
   [280]Reporting Summary^ (368.4KB, pdf)
   [281]41467_2022_32205_MOESM3_ESM.pdf^ (86.2KB, pdf)

   Description of Additional Supplementary Files
   [282]Supplementary Data SD1^ (96.4KB, xlsx)
   [283]Supplementary Data SD2^ (54.9KB, xlsx)
   [284]Supplementary Data SD3^ (57KB, xlsx)
   [285]Supplementary Data SD4^ (20.4KB, xlsx)
   [286]Software 1^ (3.9MB, zip)
   [287]Peer Review File^ (3.5MB, pdf)

Acknowledgements