Abstract The Library of Integrated Cellular Signatures (LINCS) project provides comprehensive transcriptome profiling of human cell lines before and after chemical and genetic perturbations. Its L1000 platform utilizes 978 landmark genes to infer the transcript levels of 14,292 genes computationally. Here we conducted the L1000 data quality control analysis by using MCF7, PC3, and A375 cell lines as representative examples. Before perturbations, a promising 80% correlation in transcriptome was observed between L1000‐ and Affymetrix HU133A‐platforms. After library‐based shRNA perturbations, a moderate 30% of differentially expressed genes overlapped between any two selected controls viral vectors using the L1000 platform. The mitogen‐activated protein kinase, vascular endothelial growth factor, and T‐cell receptor pathways were identified as the most significantly shared pathways between chemical and genetic perturbations in cancer cells. In conclusion, L1000 platform is reliable in assessing transcriptome before perturbation. Its response to perturbagens needs to be interpreted with caution. A quality control analysis pipeline of L1000 is recommended before addressing biological questions. __________________________________________________________________ Study Highlights. WHAT IS THE CURRENT KNOWLEDGE ON THE TOPIC? ☑ The Library of Integrated Cellular Signatures (LINCS) project provides comprehensive transcriptome profiling of human cell lines before and after chemical and genetic perturbations. Its L1000 platform utilizes 978 landmark genes to computationally infer to other 14,292 genes expression. However, there is no quality control data analysis on the reproducibility of this L100 gene expression platform. WHAT QUESTION DID THIS STUDY ADDRESS? ☑ For the first time, this study conducted quality control analysis on the LINCS L1000 gene expression platform. WHAT THIS STUDY ADDS TO OUR KNOWLEDGE ☑ It shows a promising 80% correlation in transcriptome between L1000 and Affymetrix HU133A for MCF7 breast cancer, A357 melanoma, and PC3 prostate cancer cells. L1000 reproducibility analyses show that a moderate 30% of differentially expressed genes overlapped between any two selected controls viral vectors in the genetic perturbation screening. The MAPK, VEGF, and T‐cell receptor pathways are pointed out for the most significantly connected breast cancer cell in chemical and genetic perturbations. A quality control pipeline of L1000 data quality is recommended before addressing biological questions. HOW MIGHT THIS CHANGE DRUG DISCOVERY, DEVELOPMENT, AND/OR THERAPEUTICS? ☑ LINCS data provide the ability to establish a system pharmacology view of drug effect on the transcriptome. This landmark database provides a molecular basis for further study of single drug and/or drug combinatory effect at the cell level. LINCS data will eventually lead to more rational drug selection for patients. The combination of “omics” technologies and cell‐based drug screening tools have enabled us to evaluate cellular responses to drug perturbations and explore drug targets and their mechanisms. Large‐scale drug screens for anticancer projects, such as the National Cancer Institute 60 (NCI60) human tumor cell line panel,[26]1 Connectivity Map (CMAP),[27]2, [28]3 Cancer Cell Line Encyclopedia (CCLE),[29]4 Genomics of Drug Sensitivity in Cancer (GDSC),[30]5 and the cancer therapeutics response portal (CTRP)[31]6 all used cancer cell baseline genomes and/or transcriptomes to predict drug cell responses. Cell‐based pooled short hairpin RNA (shRNA) screening is another important strategy for systematically identifying essential genes, and, eventually, therapeutic drug targets. The Achilles Project and DPSC‐Cancer shRNAs interference detection[32]7, [33]8 provided genome‐wide shRNA dropout signature profiles for identifying cell vulnerabilities associated with genetic alterations. These databases attempted to address various aspects of the associations between molecular profiles, genetic and chemical perturbations (perturbagens), and cell responses to the perturbagens.[34]9 However, using these data sources it is difficult to form an integrated picture between cancer cell molecular profiles and their responses to perturbagens, because these data were generated through different experimental platforms.[35]11 By combining the chemical compounds and RNAis, the Library of Integrated Network‐based Cellular Signatures (LINCS) project uses a novel transcriptome platform, L1000, to assess cell responses to perturbagens (lincs.hms.harvard.ed).[36]9, [37]12 It allows us to integrate transcriptomes, perturbagens, and cell responses to drugs at the same time.[38]12, [39]13, [40]14 A cost‐effective bead‐based assay, the L1000 platform, is the LINCS primary technology that measures transcriptomic responses to perturbagens.[41]10 The L1000 platform directly measures transcripts of 978 landmark genes, from which the transcript levels of 14,292 genes are computationally inferred according to Gene Expression Omnibus (GEO) genes expression.[42]10, [43]12 The relatively low cost of the L1000 allows the LINCS project to assess transcriptomic responses to 20,413 small molecule compounds and 22,119 genetic interference perturbagens, under more than four million different conditions (100‐fold larger than other existing screening studies) (support.lincscloud.org). However, the enormous impact of the L1000 technology on chemical and genetic perturbagen screening depends heavily on its data quality. The existing quality control analysis of the L1000 platform[44]10 focused only on 90 differentially expressed “landmark” genes in a validation experiment. It showed a high correlation for the 90 landmark genes, 0.92, between their gene expressions in the L1000 platform and their reverse‐transcription polymerase chain reaction (RT‐PCR) validations. However, the quality control analysis did not include the 14,292 L1000‐inferred genes. Hence, a valuable and ideal evaluation would be a whole‐genome gene expression comparison between the L1000 and other established whole‐genome microarray platforms. Many cell‐based molecular profiling datasets in public domain databases (e.g., the GEO), allows us to compare L1000 data quality to the other gene expression platforms. Similar to all inhibitory RNA (shRNA) libraries, L1000 RNA interference studies are complicated by the sequence design of perturbation shRNAs, as they affect silencing efficiency.[45]15, [46]31 Thus, the effectiveness of genetic perturbagens on nonlandmark genes should be further investigated too. In L1000, a number of perturbagens and a number of experimental conditions are used to explore perturbagen‐induced change in transcriptome. It includes different cellular backgrounds, multiple chemical dosing concentrations, multiple timepoints,[47]12 as well as different empty control vectors (i.e., GFP, RFP, Luciferase, lacZ, and PGW) for single‐gene knockdown or overexpression. These experiment conditions provide us an opportunity to broadly explore perturbagen effects on cells. In this study, we are particularly interested in whether the selection of controls will influence the shRNAs effect and gene overexpression effect on the cells and their transcriptome response. To address the challenges in LINCS data quality control, more than 6,975 chemicals and 3,827 single‐gene knockdown, and 2,281 single‐gene overexpression on three cancer cell lines, MCF7 (an estrogen receptor‐positive luminal breast cancer cell[48]16), A375 (human skin cell with malignant melanoma[49]17), and PC3 (prostate metastatic cell), were investigated. L1000 data were analyzed from four different aspects for the first time: 1) cancer cell baseline transciptome (i.e., untreated) in L1000 platform was compared to their corresponding transcriptome in the Affymetrix HU133A platform; 2) transcriptomes were compared between chemical treated groups and controls at multiple timepoints for three cell lines; 3) RNAi experimental variation and its sensitivity to different control groups were explored; and 4) connectivity between genetic and chemical perturbations was investigated. Finally, a guidance on how to use the L1000 dataset is recommended. METHODS Materials general In this study, level 3 data of the normalized profiles are used for the quality control data analysis. These data are described in great detail in the supplemental materials. L1000 adopts the practice of storing data annotations (metadata) and datasets separately.[50]12, [51]13 The InstInfo file describes L1000 signature profiles under different experimental conditions (Supplementary Figure 1 visualizes the experimental design). Each expression profile is assigned with a unique identifier, i.e., signature ID (or “distil_id”), and it connects the data with its metadata in InstInfo. Table [52]1 lists the samples for L1000 data quality control analysis and platform comparison with the Affymetrix HU133A. Supplementary Table 2 shows the gene expression profiles and chemical numbers before and after compound treatments in 6 hours (H6) and 24 hours (H24) in three cell lines. Supplementary Table 4 lists these records and genes numbers for three types of cells at 96 hours (H96) and 144 hours (H144) for knockdown and overexpression experiments against different control vectors. Table 1. The samples number for L1000 data quality control and platform comparison between the Affymetrix HT‐HG‐U133A and L1000 in 22268 probe sets Platform Timepoints Experiment Types Cell lines MCF7 PC3 A375 L1000 H6 Compounds #Gene expression profiles 43862 39605 27428 #Chemical compounds 5434 4737 3083 H24 Compounds #Gene expression profiles 57475 57380 28601 #Chemical compounds 6976 5845 2169 H96 Knockdown #Gene expression profiles 36023 41414 40640 #Genes 3472 3824 3827 Overexpression #Gene expression profiles 9220 10271 10109 #Genes 2160 2281 2281 H144 Knockdown #Gene expression profiles 20204 20414 / #Genes 1838 1726 / L1000 Base Line #Gene expression profiles 2922 27 24 Affymetrix HT‐HG‐U133A Base Line #Gene expression profiles 56 8 16 [53]Open in a new tab Methods The flow of data processing and analysis is shown in Figure [54]1. Figure 1. Figure 1 [55]Open in a new tab The overall L1000 quality control data analysis procedure. The figure shows the data analysis process and its associated methods in the study. The data analysis procedure can be divided into four steps. First, L1000 data is reannotated, retrieved, and extracted. Second, analysis of a differentially expressed gene (DEG) is performed before vs. after perturbation (chemical and genetic both type) through two‐sample equal variance Student's t‐test. Third, L1000 data quality is analyzed. It includes the mRNA correlation analysis between different platforms, the genetic perturbation comparisons using different control vectors through R‐square; shRNA interfere scale and gene overexpression scale are calculated to evaluate the reliability of genetic perturbation experiments. Fourth, connectivity analysis is performed between chemical and genetic perturbations by GO enrichment and KEGG pathway analysis. Differentially expressed gene analysis An unpaired two‐tailed Student's t‐test was used to evaluate the gene expression difference in data from two different groups, including the following: before and after chemical treatments (the gene profiles of chemical treatment vs. those incubated with dimethyl sulfoxide (DMSO) control) in MCF7, PC3, and A375 cells at 6 hours and 24 hours, respectively; before and after shRNAs/overexpression perturbations (the gene knockdown/overexpression group vs. its different control groups, such as empty vector GFP,[56]18 eGFP,[57]19 Luciferase, HcRed,[58]20 or lacZ,[59]21 respectively). The difference was considered significant if P < 0.01. Data batch effects due to the plate are removed by the quartile normalization. In the LINCS L1000 experimental design, each plate will have its own control samples (i.e., DMSO) and perturbagen‐treated samples. Our two‐sample t‐tests use perturbagen‐treated and control samples from the same plate to analyze differentially expressed genes (DEGs). In order to let the statistical t‐test be less sensitive to outliers or small variance, a minimum sample size of three was required for each group. shRNAs interference and gene overexpression scale calculation shRNA knockdown scale quantifies the gene knockdown accuracy. At first, a gene expression is normalized by the housekeeping gene expression. Then, the knockdown scale quantifies an interfered gene expression change related to its uninterferred gene (i.e., control) in a cell. Denoting the control group as ctr, and the shRNA group as exp, the knockdown scale calculation is: * Gene_ctr = shRNA gene expression in control group/housekeeping gene expression in control group; * Gene_exp = shRNA gene expression in shRNA experimental group/housekeeping gene expression in the shRNA experimental group; * shRNA Knockdown Scale: siRi = 1‐ (Gene_exp/Gene_ctr); * In the article, if siRi > 0, the experiment is defined as a success; otherwise it is a failure. To calculate the gene overexpression scale, the formula is similar to the shRNAs interference scale: * Gene_ctr = Gene overexpression in control group/housekeeping gene expression in control group * Gene_exp = Gene overexpression in overexpression experiment group/housekeeping gene expression in overexpression experiment group * Gene Overexpression scale: oeRi = (Gene_exp/Gene_ctr)−1 * When oeRi > 0, the experiment is deemed successful; otherwise the experiment is deemed failure. According to the housekeeping gene list in the references,[60]22,