Abstract

   The Library of Integrated Cellular Signatures (LINCS) project provides
   comprehensive transcriptome profiling of human cell lines before and
   after chemical and genetic perturbations. Its L1000 platform utilizes
   978 landmark genes to infer the transcript levels of 14,292 genes
   computationally. Here we conducted the L1000 data quality control
   analysis by using MCF7, PC3, and A375 cell lines as representative
   examples. Before perturbations, a promising 80% correlation in
   transcriptome was observed between L1000‐ and Affymetrix
   HU133A‐platforms. After library‐based shRNA perturbations, a moderate
   30% of differentially expressed genes overlapped between any two
   selected controls viral vectors using the L1000 platform. The
   mitogen‐activated protein kinase, vascular endothelial growth factor,
   and T‐cell receptor pathways were identified as the most significantly
   shared pathways between chemical and genetic perturbations in cancer
   cells. In conclusion, L1000 platform is reliable in assessing
   transcriptome before perturbation. Its response to perturbagens needs
   to be interpreted with caution. A quality control analysis pipeline of
   L1000 is recommended before addressing biological questions.
     __________________________________________________________________

Study Highlights.

   WHAT IS THE CURRENT KNOWLEDGE ON THE TOPIC?

   ☑ The Library of Integrated Cellular Signatures (LINCS) project
   provides comprehensive transcriptome profiling of human cell lines
   before and after chemical and genetic perturbations. Its L1000 platform
   utilizes 978 landmark genes to computationally infer to other 14,292
   genes expression. However, there is no quality control data analysis on
   the reproducibility of this L100 gene expression platform.

   WHAT QUESTION DID THIS STUDY ADDRESS?

   ☑ For the first time, this study conducted quality control analysis on
   the LINCS L1000 gene expression platform.

   WHAT THIS STUDY ADDS TO OUR KNOWLEDGE

   ☑ It shows a promising 80% correlation in transcriptome between L1000
   and Affymetrix HU133A for MCF7 breast cancer, A357 melanoma, and PC3
   prostate cancer cells. L1000 reproducibility analyses show that a
   moderate 30% of differentially expressed genes overlapped between any
   two selected controls viral vectors in the genetic perturbation
   screening. The MAPK, VEGF, and T‐cell receptor pathways are pointed out
   for the most significantly connected breast cancer cell in chemical and
   genetic perturbations. A quality control pipeline of L1000 data quality
   is recommended before addressing biological questions.

   HOW MIGHT THIS CHANGE DRUG DISCOVERY, DEVELOPMENT, AND/OR THERAPEUTICS?

   ☑ LINCS data provide the ability to establish a system pharmacology
   view of drug effect on the transcriptome. This landmark database
   provides a molecular basis for further study of single drug and/or drug
   combinatory effect at the cell level. LINCS data will eventually lead
   to more rational drug selection for patients.

   The combination of “omics” technologies and cell‐based drug screening
   tools have enabled us to evaluate cellular responses to drug
   perturbations and explore drug targets and their mechanisms.
   Large‐scale drug screens for anticancer projects, such as the National
   Cancer Institute 60 (NCI60) human tumor cell line panel,[26]1
   Connectivity Map (CMAP),[27]2, [28]3 Cancer Cell Line Encyclopedia
   (CCLE),[29]4 Genomics of Drug Sensitivity in Cancer (GDSC),[30]5 and
   the cancer therapeutics response portal (CTRP)[31]6 all used cancer
   cell baseline genomes and/or transcriptomes to predict drug cell
   responses. Cell‐based pooled short hairpin RNA (shRNA) screening is
   another important strategy for systematically identifying essential
   genes, and, eventually, therapeutic drug targets. The Achilles Project
   and DPSC‐Cancer shRNAs interference detection[32]7, [33]8 provided
   genome‐wide shRNA dropout signature profiles for identifying cell
   vulnerabilities associated with genetic alterations. These databases
   attempted to address various aspects of the associations between
   molecular profiles, genetic and chemical perturbations (perturbagens),
   and cell responses to the perturbagens.[34]9 However, using these data
   sources it is difficult to form an integrated picture between cancer
   cell molecular profiles and their responses to perturbagens, because
   these data were generated through different experimental
   platforms.[35]11 By combining the chemical compounds and RNAis, the
   Library of Integrated Network‐based Cellular Signatures (LINCS) project
   uses a novel transcriptome platform, L1000, to assess cell responses to
   perturbagens (lincs.hms.harvard.ed).[36]9, [37]12 It allows us to
   integrate transcriptomes, perturbagens, and cell responses to drugs at
   the same time.[38]12, [39]13, [40]14

   A cost‐effective bead‐based assay, the L1000 platform, is the LINCS
   primary technology that measures transcriptomic responses to
   perturbagens.[41]10 The L1000 platform directly measures transcripts of
   978 landmark genes, from which the transcript levels of 14,292 genes
   are computationally inferred according to Gene Expression Omnibus (GEO)
   genes expression.[42]10, [43]12 The relatively low cost of the L1000
   allows the LINCS project to assess transcriptomic responses to 20,413
   small molecule compounds and 22,119 genetic interference perturbagens,
   under more than four million different conditions (100‐fold larger than
   other existing screening studies) (support.lincscloud.org). However,
   the enormous impact of the L1000 technology on chemical and genetic
   perturbagen screening depends heavily on its data quality.

   The existing quality control analysis of the L1000 platform[44]10
   focused only on 90 differentially expressed “landmark” genes in a
   validation experiment. It showed a high correlation for the 90 landmark
   genes, 0.92, between their gene expressions in the L1000 platform and
   their reverse‐transcription polymerase chain reaction (RT‐PCR)
   validations. However, the quality control analysis did not include the
   14,292 L1000‐inferred genes. Hence, a valuable and ideal evaluation
   would be a whole‐genome gene expression comparison between the L1000
   and other established whole‐genome microarray platforms. Many
   cell‐based molecular profiling datasets in public domain databases
   (e.g., the GEO), allows us to compare L1000 data quality to the other
   gene expression platforms. Similar to all inhibitory RNA (shRNA)
   libraries, L1000 RNA interference studies are complicated by the
   sequence design of perturbation shRNAs, as they affect silencing
   efficiency.[45]15, [46]31 Thus, the effectiveness of genetic
   perturbagens on nonlandmark genes should be further investigated too.
   In L1000, a number of perturbagens and a number of experimental
   conditions are used to explore perturbagen‐induced change in
   transcriptome. It includes different cellular backgrounds, multiple
   chemical dosing concentrations, multiple timepoints,[47]12 as well as
   different empty control vectors (i.e., GFP, RFP, Luciferase, lacZ, and
   PGW) for single‐gene knockdown or overexpression. These experiment
   conditions provide us an opportunity to broadly explore perturbagen
   effects on cells. In this study, we are particularly interested in
   whether the selection of controls will influence the shRNAs effect and
   gene overexpression effect on the cells and their transcriptome
   response.

   To address the challenges in LINCS data quality control, more than
   6,975 chemicals and 3,827 single‐gene knockdown, and 2,281 single‐gene
   overexpression on three cancer cell lines, MCF7 (an estrogen
   receptor‐positive luminal breast cancer cell[48]16), A375 (human skin
   cell with malignant melanoma[49]17), and PC3 (prostate metastatic
   cell), were investigated. L1000 data were analyzed from four different
   aspects for the first time: 1) cancer cell baseline transciptome (i.e.,
   untreated) in L1000 platform was compared to their corresponding
   transcriptome in the Affymetrix HU133A platform; 2) transcriptomes were
   compared between chemical treated groups and controls at multiple
   timepoints for three cell lines; 3) RNAi experimental variation and its
   sensitivity to different control groups were explored; and 4)
   connectivity between genetic and chemical perturbations was
   investigated. Finally, a guidance on how to use the L1000 dataset is
   recommended.

METHODS

Materials general

   In this study, level 3 data of the normalized profiles are used for the
   quality control data analysis. These data are described in great detail
   in the supplemental materials. L1000 adopts the practice of storing
   data annotations (metadata) and datasets separately.[50]12, [51]13 The
   InstInfo file describes L1000 signature profiles under different
   experimental conditions (Supplementary Figure 1 visualizes the
   experimental design). Each expression profile is assigned with a unique
   identifier, i.e., signature ID (or “distil_id”), and it connects the
   data with its metadata in InstInfo. Table [52]1 lists the samples for
   L1000 data quality control analysis and platform comparison with the
   Affymetrix HU133A. Supplementary Table 2 shows the gene expression
   profiles and chemical numbers before and after compound treatments in 6
   hours (H6) and 24 hours (H24) in three cell lines. Supplementary Table
   4 lists these records and genes numbers for three types of cells at 96
   hours (H96) and 144 hours (H144) for knockdown and overexpression
   experiments against different control vectors.

Table 1.

   The samples number for L1000 data quality control and platform
   comparison between the Affymetrix HT‐HG‐U133A and L1000 in 22268 probe
   sets
   Platform Timepoints Experiment Types Cell lines
   MCF7 PC3 A375
   L1000 H6 Compounds #Gene expression profiles 43862 39605 27428
   #Chemical compounds 5434 4737 3083
   H24 Compounds #Gene expression profiles 57475 57380 28601
   #Chemical compounds 6976 5845 2169
   H96 Knockdown #Gene expression profiles 36023 41414 40640
   #Genes 3472 3824 3827
   Overexpression #Gene expression profiles 9220 10271 10109
   #Genes 2160 2281 2281
   H144 Knockdown #Gene expression profiles 20204 20414 /
   #Genes 1838 1726 /
   L1000 Base Line #Gene expression profiles 2922 27 24
   Affymetrix HT‐HG‐U133A Base Line #Gene expression profiles 56 8 16
   [53]Open in a new tab

Methods

   The flow of data processing and analysis is shown in Figure [54]1.

Figure 1.

   Figure 1
   [55]Open in a new tab

   The overall L1000 quality control data analysis procedure. The figure
   shows the data analysis process and its associated methods in the
   study. The data analysis procedure can be divided into four steps.
   First, L1000 data is reannotated, retrieved, and extracted. Second,
   analysis of a differentially expressed gene (DEG) is performed before
   vs. after perturbation (chemical and genetic both type) through
   two‐sample equal variance Student's t‐test. Third, L1000 data quality
   is analyzed. It includes the mRNA correlation analysis between
   different platforms, the genetic perturbation comparisons using
   different control vectors through R‐square; shRNA interfere scale and
   gene overexpression scale are calculated to evaluate the reliability of
   genetic perturbation experiments. Fourth, connectivity analysis is
   performed between chemical and genetic perturbations by GO enrichment
   and KEGG pathway analysis.

Differentially expressed gene analysis

   An unpaired two‐tailed Student's t‐test was used to evaluate the gene
   expression difference in data from two different groups, including the
   following: before and after chemical treatments (the gene profiles of
   chemical treatment vs. those incubated with dimethyl sulfoxide (DMSO)
   control) in MCF7, PC3, and A375 cells at 6 hours and 24 hours,
   respectively; before and after shRNAs/overexpression perturbations (the
   gene knockdown/overexpression group vs. its different control groups,
   such as empty vector GFP,[56]18 eGFP,[57]19 Luciferase, HcRed,[58]20 or
   lacZ,[59]21 respectively). The difference was considered significant if
   P < 0.01.

   Data batch effects due to the plate are removed by the quartile
   normalization. In the LINCS L1000 experimental design, each plate will
   have its own control samples (i.e., DMSO) and perturbagen‐treated
   samples. Our two‐sample t‐tests use perturbagen‐treated and control
   samples from the same plate to analyze differentially expressed genes
   (DEGs). In order to let the statistical t‐test be less sensitive to
   outliers or small variance, a minimum sample size of three was required
   for each group.

shRNAs interference and gene overexpression scale calculation

   shRNA knockdown scale quantifies the gene knockdown accuracy. At first,
   a gene expression is normalized by the housekeeping gene expression.
   Then, the knockdown scale quantifies an interfered gene expression
   change related to its uninterferred gene (i.e., control) in a cell.
   Denoting the control group as ctr, and the shRNA group as exp, the
   knockdown scale calculation is:
     * Gene_ctr = shRNA gene expression in control group/housekeeping gene
       expression in control group;
     * Gene_exp = shRNA gene expression in shRNA experimental
       group/housekeeping gene expression in the shRNA experimental group;
     * shRNA Knockdown Scale: siRi = 1‐ (Gene_exp/Gene_ctr);
     * In the article, if siRi > 0, the experiment is defined as a
       success; otherwise it is a failure.

   To calculate the gene overexpression scale, the formula is similar to
   the shRNAs interference scale:
     * Gene_ctr = Gene overexpression in control group/housekeeping gene
       expression in control group
     * Gene_exp = Gene overexpression in overexpression experiment
       group/housekeeping gene expression in overexpression experiment
       group
     * Gene Overexpression scale: oeRi = (Gene_exp/Gene_ctr)−1
     * When oeRi > 0, the experiment is deemed successful; otherwise the
       experiment is deemed failure.

   According to the housekeeping gene list in the references,[60]22,