Abstract Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is a powerful genomic technology that is used for the global mapping and analysis of open chromatin regions. However, for users to process and analyze such data they either have to use a number of complicated bioinformatic tools or attempt to use the currently available ATAC-seq analysis software, which are not very user friendly and lack visualization of the ATAC-seq results. Because of these issues, biologists with minimal bioinformatics background who wish to process and analyze their own ATAC-seq data by themselves will find these tasks difficult and ultimately will need to seek help from bioinformatics experts. Moreover, none of the available tools provide complete solution for ATAC-seq data analysis. Therefore, to enable non-programming researchers to analyze ATAC-seq data on their own, we developed a tool called Graphical User interface for the Analysis and Visualization of ATAC-seq data (GUAVA). GUAVA is a standalone software that provides users with a seamless solution from beginning to end including adapter trimming, read mapping, the identification and differential analysis of ATAC-seq peaks, functional annotation, and the visualization of ATAC-seq results. We believe GUAVA will be a highly useful and time-saving tool for analyzing ATAC-seq data for biologists with minimal or no bioinformatics background. Since GUAVA can also operate through command-line, it can easily be integrated into existing pipelines, thus providing flexibility to users with computational experience. Keywords: ATAC-seq data analysis, GUI, bioinformatic tool, ATAC-seq, NGS data analysis Introduction Eukaryotic DNA is packaged together with histones to form nucleosomes, the basic unit of chromatin, which is important in determining the structure of the genome and the regulation of biological processes such as transcription. The regulatory information within chromatin can be uncovered through high-throughput sequencing methods for assaying chromatin accessibility, nucleosome positioning, histone modifications, and transcription factor (TF) occupancy. Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) is a simple, yet highly robust and sensitive genome-wide technique that was recently developed for identifying nucleosome-bound and nucleosome-free regions of chromatin as well as TF occupancy ([25]Buenrostro et al., 2013). Briefly, ATAC-seq accomplishes this by simultaneously fragmenting and tagging genomic DNA with sequencing adaptors using the hyperactive Tn5 transposase enzyme ([26]Buenrostro et al., 2013). In contrast to similar technologies such as DNase I hypersensitive sites sequencing (DNase-seq) ([27]Song and Crawford, 2010), ATAC-seq requires less cells ([28]Buenrostro et al., 2013), meaning that ATAC-seq can be applied in many more experimental situations. In comparison to formaldehyde-assisted isolation of regulatory elements with sequencing (FAIRE-seq) ([29]Giresi and Lieb, 2009), another global chromatin accessibility technology, ATAC-seq data has a lower background and therefore better signal-to-noise ratio ([30]Buenrostro et al., 2013). Finally, ATAC-seq, unlike DNase-seq and FAIRE-seq, is much simpler and does not involve loss-prone steps like adapter ligation, gel purification, and crosslink reversal ([31]Buenrostro et al., 2013). Therefore, due to its ease of use, speed, low starting material requirement, reliability and multiplexing potential, ATAC-seq has been widely adopted and will most likely be the preferred method of choice in comparison to FAIRE-seq and DNase-seq for global chromatin accessibility analysis. Currently, several bioinformatic tools are available for processing or analyzing ATAC-seq data including, NucleoATAC ([32]Schep et al., 2015), ATACseqQC ([33]Ou et al., 2018), ATAC-Seq/DNase-Seq Pipeline^[34]1, and I-ATAC ([35]Ahmed and Ucar, 2017). However, these tools are either command-line based or provide only partial solutions, but more importantly, most of them are complicated to use especially for biologists with little bioinformatics training, command line, and scripting experience. Moreover, these tools lack differential analysis, functional annotation, and visualization of ATAC-seq results (Table [36]1). Thus, we developed Graphical User interface for the Analysis and Visualization of ATAC-seq data (GUAVA), a simple user-friendly GUI-based application for processing, analyzing and visualizing ATAC-seq data. GUAVA accomplishes this by performing the following main tasks: (i) pre-process raw sequencing reads, (ii) map sequencing reads to a reference genome, (iii) filter and shift the mapped reads, (iv) identify and annotate the ATAC-seq peaks, (v) normalize and visualize the ATAC-seq data tracks, (vi) identify differentially enriched peaks, and (vii) identify over-represented gene ontology terms and pathways. GUAVA operates as a standalone program on computers with either Linux or Mac OS. To aid users, GUAVA package contains a complete reference manual. Table 1. Comparison between GUAVA and available tools for ATAC-seq data analysis. Features GUAVA NucleoATAC I-ATAC ATACseqQC ATAC-Seq/DNase-Seq Pipeline Graphical user interface Yes No Yes No No Adapter trimming Yes No Yes No Yes Alignment Yes No Yes No Yes Alignment filtering Yes No Yes No Yes Alignment shifting Yes Yes Yes Yes Yes Fragment size distribution plot Yes Yes Yes Yes No Peak calling Yes No Yes No Yes ATAC-seq differential analysis Yes No No No No Visualization of ATAC-seq data tracks Yes No No Yes No Annotation and function analysis Yes No No No No Nucleosome positing No Yes No Yes No Transcription factor foot printing No Yes No Yes No [37]Open in a new tab A comparison between GUAVA and currently available tools for ATAC-seq data analysis based on features such read mapping, peak calling, and differential analysis, etc. Materials and Methods Design and Implementation Graphical User interface for the Analysis and Visualization of ATAC-seq data is an open-source software that was implemented using Java programming language. GUAVA has been designed to be a fully automated software for analyzing and visualizing of ATAC-seq data. GUAVA depends on several different tools and methods such as Bowtie ([38]Langmead et al., 2009) for mapping sequencing reads, MACS2 ([39]Feng et al., 2012) for calling peaks, and DESeq2 ([40]Love et al., 2014) for differential analysis. GUAVA contains two programs: (1) ATAC-seq data analysis (to process individual ATAC-seq samples or replicates) and (2) ATAC-seq differential analysis (to compare ATAC-seq samples from two different conditions). In addition, GUAVA has a genome index builder for non-programming researchers to create a genome index (Supplementary Figure [41]S1). A detailed workflow for GUAVA is shown in Figure [42]1. The GUI to select the desired GUAVA program and upload the input files is shown in Figure [43]2. FIGURE 1. [44]FIGURE 1 [45]Open in a new tab ATAC-seq data and differential analysis workflow for GUAVA. A diagram showing the GUAVA workflow for analyzing ATAC-seq data. The workflow has two parts (1) ATAC-seq data analysis: to process a ATAC-seq sample and (2) ATAC-seq differential analysis: to identify differentially enriched ATAC-seq signals between two conditions. FIGURE 2. [46]FIGURE 2 [47]Open in a new tab Design of the GUAVA input interface. (A) The main GUAVA window where users select the program they wish to use (ATAC-seq data analysis or ATAC-seq differential analysis). (B) Window for using ATAC-seq data analysis. To perform ATAC-seq data analysis, users load the sequencing data using the R1 fastq and R2 fastq buttons. Next, they choose between Bowtie and Bowtie2 from the drop-down menu in the “Alignment Parameters” section. If required, users can change the other parameters from the default value to another value. For example, the default RAM used for GUAVA is 1 GB but it can be changed if desired. Finally, to run the analysis users simply click the “Start Analysis” button. (C) Window for using ATAC-seq differential analysis. To perform ATAC-seq differential analysis, users set the project name, load the bam and bed files and then set the differential analysis parameters in (D) before selecting the “Start” button. Part 1: ATAC-seq Data Analysis To use the ATAC-seq data analysis program, the user simply needs to (1) upload the ATAC-seq paired-end fastq reads and (2) provide the Bowtie or Bowtie2 ([48]Langmead and Salzberg, 2012) index reference genome using the input interface of the ATAC-seq data analysis program (Figure [49]2B). If the data has not been adapter trimmed, the user can select the trimming option. The user has the option to change several parameters such as the maximum insert size, the minimum mapping quality and the p/q value cut-off. Finally, after setting the output folder for storing the result files, the user can select “Start Analysis.” GUAVA will then automatically process and analyze the data as described below. Once GUAVA has finished the analysis, results such as alignment statistics, fragment size distribution, ATAC-seq peaks, and their functional annotations are shown in the output interface (Figure [50]3). The user can also find an excel file containing the same results in the output folder (Supplementary Data Sheet [51]S1). FIGURE 3. FIGURE 3 [52]Open in a new tab Output interface for GUAVA ATAC-seq data analysis. (A) Input summary and alignment statistics. (B) Read filtering and peak calling summary. (C) Peak annotation table with sorting and filtering functionality. Easy access to IGV for visualizing peaks and automatically generated normalized ATAC-seq signal by GUAVA. (D) Visualization of ATAC-seq peaks with IGV. (E) Graph showing the fragment size distribution. (F) Bar chart showing the percentage of peaks in various genomic locations such as the promoter, intron, exon, and UTR, etc. (G) Plot showing the percentage of the peaks upstream and downstream of the TSS of the nearest genes. Different colors indicate different ranges of distances from the TSS. (H) Over-represented KEGG pathways obtained using the ChIPpeakAnno bioconductor package. Pre-processing of the Raw Reads In many instances, the raw sequenced reads used for ATAC-seq analysis will contain adapter sequences. Therefore, before mapping the raw reads to the genome, the adapter sequences should be trimmed off, otherwise any reads with adapter sequences will fail to map. For adapter trimming, GUAVA uses cutadapt ([53]Martin, 2011). If adapter sequences have already been trimmed, this step can be omitted. After adapter trimming, the reads will be saved in fastq format. GUAVA will then generate a fastq quality report using the FastQC tool^[54]2. The FastQC report is stored inside the folder with the suffix FastQC_OUTPUT within the GUAVA output folder. In the FastQC report, “per base sequence quality” should be more than 20 for each nucleotide in the sequencing read and “per base N content” should not exceed 5%. If these minimum parameters are not met, this is an indication the sequencing reads are of poor quality. It should be noted that poor quality or sequencing errors often leads to low mapping rate and therefore, it is important to check the quality of the sequencing reads before mapping them to the genome. Read Mapping For mapping ATAC-seq reads to the genome, the user can choose between Bowtie (for ungapped alignment) or Bowtie2 (for gapped alignment). In general, Bowtie2 is faster, more sensitive, and memory efficient for reads longer than 50 bp, whereas Bowtie sometimes performs better for relatively short reads (less than 50 bp) ([55]Langmead and Salzberg, 2012). The default parameters for these programs produce satisfactory alignment results, but the user can change them if they wish to achieve a higher alignment rate. The output alignment file is initially saved as a sequence alignment/map (SAM) format and then converted into a sorted BAM (binary version of SAM) format to save storage space and to speed up subsequent data processing. Alignment Filtering In the ATAC-seq assay, many of the aligned reads cannot be used for downstream data analysis due to mitochondrial DNA contamination, duplicates, and alignment in the blacklist regions of the genome (locations in the genome that tends to show artificially high read signals) ([56]Consortium, 2012). In GUAVA, duplicate reads are removed by Picard MarkDuplicates^[57]3, while reads aligning to the mitochondria and/or other chromosomes (if required) and blacklist regions are filtered using SAMtools ([58]Li et al., 2009). Finally, only reads which are aligned in proper pair and passed alignment filtering are retained for further processing. Alignment Shifting When the Tn5 transposase cuts open chromatin regions it introduces two cuts that are separated by 9 bp ([59]Buenrostro et al., 2013). Therefore, ATAC-seq reads aligning to the positive and negative strand need to be adjusted by +4 bp and -5 bp, respectively, to represent the center of the transposase binding site ([60]Buenrostro et al., 2013). In GUAVA, we have included an in-house program to perform the alignment shifting. The output file (a sorted BAM file) from this step is saved with the suffix “ATACseq.bam.” Users can use the alignment shifted bam file as an input bam file for differential analysis. Fragment Size Distribution Graph Tn5 transposase cuts open chromatin regions and also linker DNA. Therefore, the fragment size distribution graph of a good quality ATAC-seq library has two sharp peaks at <100 bp (open chromatin) and ∼200 bp (mono-nucleosome) and smaller peaks representing di-nucleosomes and tri-nucleosomes. GUAVA uses Picard CollectInsertSizeMetrics^3 to compute the fragment sizes on alignment shifted bam files and R to plot fragment size distribution graphs. Peak Calling To identify enriched chromatin regions in ATAC-seq libraries, MACS2 is used for peak calling with the parameters nomodel and nolambda. For differential analysis, users can use the narrowPeak file from MACS2 output. Annotation and Functional Analysis Graphical User interface for the Analysis and Visualization of ATAC-seq data uses the ChIPpeakAnno ([61]Zhu et al., 2010) bioconductor package for annotation and functional analysis. First, ATAC-seq peaks are categorized into different groups based on their genomic location (i.e., promoter, untranslated regions (UTRs), intron, and exon). Second, peaks that are within 5 kb upstream and 3 kb downstream of the TSS are associated to the nearest genes. Finally, these genes are then analyzed for over-represented gene ontology terms and KEGG pathways ([62]Kanehisa et al., 2017) using ChIPpeakAnno’s getEnrichedGO and getEnrichedPATH functions, respectively. Part 2: ATAC-seq Differential Analysis To use the ATAC-seq differential analysis program in GUAVA, users need to load the processed (filtered and shifted) ATAC-seq alignment bam file and narrowPeak file (hereafter called peak file) for each sample (Figure [63]2C). These files can be found in the output folder of the “ATAC-seq data analysis” program. Next, the user selects the “Next” button to set the fold change and p-value cut-off as well as select the output folder to store the results (Figure [64]2D). Finally, the user selects the “start” button to compare ATAC-seq signal from the two conditions. GUAVA will perform the tasks as described below to identify and annotate differentially enriched ATAC-seq peaks. Results such as the differentially enriched peaks, a volcano plot showing gained-open and gained-closed regions between two conditions, over-represented gene ontology and pathways are displayed in the output interface of the “ATAC-seq differential analysis” program (Supplementary Figure [65]S2). Generation of Unique Peak List and Read Count Matrix To prepare the input data for differential analysis, overlapping peaks from biological replicates are first merged to create a peak list for each condition (e.g., control and treatment). Next, the list of peaks from each condition are then combined to create a final list of unique peaks in which differential analysis is performed. GUAVA then creates a read count matrix by counting the reads in each unique peak for each input sample. To achieve this, GUAVA uses the Rsubread bioconductor package ([66]Liao et al., 2013). Differential ATAC-seq Analysis To identify differentially enriched ATAC-seq regions, GUAVA uses the DESeq2 bioconductor package. Users can set the fold change and p-value cut-off for defining the chromatin accessibility gained (gained-open) and reduced (gained-closed) peaks. Results are displayed in a tabular format in the GUAVA GUI as list of differentially enriched peaks and a volcano plot summarizing the significantly differentially enriched regions. Functional Analysis of Differentially Enriched Peaks To perform functional analysis on the differentially enriched peaks, they are first categorized into different groups based on their genomic location (i.e., promoter, UTRs, intron, and exon). Next, differentially enriched peaks are annotated and associated to the nearest genes if they occur within the user defined distance from the TSS. Finally, to identify over-represented gene ontology terms and KEGG pathways, GUAVA uses the ChIPpeakAnno bioconductor package. Normalization of ATAC-seq Signals for Visualization on the Integrated Genome Viewer It is important to normalize ATAC-seq data before visualizing them on the IGV ([67]Thorvaldsdottir et al., 2013) because most likely each sample was sequenced at different depths. Hence, the raw signals are not comparable. GUAVA uses the following method to produce normalized ATAC-seq data tracks. ATAC-seq signals are normalized as reads per million (RPM) so that they can be compared with other biological samples. Normalized data tracks are saved in bigwig format. Output interface of GUAVA allows users to easily view these normalized ATAC-seq signals on the IGV. With one click, the ATAC-seq data track, peaks and genome will be loaded onto the IGV. Multi-threading is used to open multiple instances on the IGV. Results and Discussion In this work, we have developed GUAVA, a fully automated GUI tool, to help researchers, in particular biologists with no computational knowledge, the ability to process and analyze ATAC-seq data on their own. To illustrate the use of GUAVA, we processed and analyzed an ATAC-seq dataset (accession number [68]GSE84515 ) that was derived from a study examining chromatin accessibility changes in breast cancer cells in response to PI3K inhibitors ([69]Toska et al., 2017). Specifically, ATAC-seq was performed on T47D cells that were treated with either DMSO (vehicle) or the PI3K inhibitor, BYL719. Overall, this dataset consists of four samples with two replicates for each condition (Supplementary Table [70]S1). Case Study Demonstrating GUAVA We first processed the four ATAC-seq samples individually using the ATAC-seq data analysis program. We selected the adapter trimming option with default parameters to remove the sequencing adapters from the raw reads. For read mapping, we used Bowtie2 and the hg19 genome a 2,000 bp insert size and a minimum mapping quality of 30. Since the test dataset is from breast cancer cells, we filtered reads aligning to the Y chromosome as well as the mitochondria by using the “show chromosome” button. For peak calling, a 0.001 q-value cut-off was used. In general, the average mapping rate of the four samples that we processed was >75%. Moreover, over half of the mapped reads passed alignment filtering. The detailed alignment and alignment filtering results for the four samples can be found in Supplementary Table [71]S2. We also observed a fragment size distribution interval of approximately 200 bp for the ATAC-seq peaks representing mono-, di-, and tri-nucleosomes (Supplementary Figure [72]S3). This result indicates the ATAC-seq libraries were of good quality. The ATAC-seq peaks for all the samples were mainly located in the promoter, intergenic, and intronic regions (Supplementary Figure [73]S4). Gene ontology analysis of the individual ATAC-seq samples showed over-represented terms for cell cycle, DNA repair, and the apoptotic signaling pathway (Supplementary Table [74]S3). In pathway analysis of the individual ATAC-seq samples, control samples showed over-represented terms for DNA replication, RNA transport, cell cycle, pathways in cancer, and signaling pathways, while BYL719 treated samples were over-represented in terms for ErbB signaling pathways, p53 signaling pathway, and phosphatidylinositol signaling system (Supplementary Table [75]S4). Next, we performed differential analysis on the above processed samples with GUAVA’s ATAC-seq differential analysis program using a twofold change and a 0.001 p-value cut-off. For annotation and functional analysis, we used a 5 kb upstream and a 3 kb downstream region for associating peaks with genes. Overall, we identified 11,067 chromatin accessibility locations that were reduced (gained closed) and 18,683 locations with increased chromatin accessibility (gained open) signals upon BYL719 treatment (Figure [76]4A). However, the reproducibility between the replicates in the control samples was less as compared with the replicates of the BYL719 treated samples (Figure [77]4B). Consistent with the known biological effects of BYL719, gene ontology analysis of the differential peaks showed over-represented terms for regulation of cell proliferation, inositol phosphate metabolic process, negative regulation of extrinsic apoptotic signaling pathway, and regulation of extrinsic apoptotic signaling pathway via death domain receptors (Figure [78]5A and Supplementary Data Sheet [79]S2). Similarly, in pathway enrichment analysis, the changes in chromatin accessibility induced by BYL719 associated with processes such as inositol phosphate metabolism, phosphatidylinositol signaling system, p53 signaling pathway, and pathways in cancer (Figure [80]5B and Supplementary Data Sheet [81]S2). FIGURE 4. [82]FIGURE 4 [83]Open in a new tab BYL719 induces differentially enriched ATAC-seq peaks in T47D cells. (A) A volcano plot showing the differentially enriched peaks between BLY719 and DMSO-treated T47D cells. In total, 29,750 regions were found with differentially enriched ATAC-seq signals. Out of which 18,683 regions were gained-open (increased in chromatin accessibility) and 11,067 were gained-closed regions (decreased in chromatin accessibility) before and after BLY719 treatment of T47D cells. (B) Principal component analysis plot for the control and treatment replicates. FIGURE 5. [84]FIGURE 5 [85]Open in a new tab Gene ontology (GO) and pathway analysis of differential enriched peaks. Differentially enriched peaks from Figure [86]4 were annotated with nearest downstream genes. Then these genes are used to perform (A) gene ontology and (B) pathway analysis. Here, we have presented GUAVA, a simple fully automated GUI software for ATAC-seq analysis. Our main objective behind developing GUAVA was to enable non-programming scientist to analyze ATAC-seq data immediately with no learning curve. The graphs and tables generated by GUAVA will help users to interpret and understand their data. We have created GUAVA with an easy-to-use interface that not only enables users to analyze their ATAC-seq data but also help them to visualize their results. Therefore, GUAVA is a useful user-friendly time saving tool for ATAC-seq analysis. The current version of GUAVA does not perform nucleosome positioning or TF foot-printing analyses and thus, one of our future goals for improving GUAVA will be to include both types of analyses. In addition, future versions of GUAVA will also incorporate modules for assessment of reproducibility of the ATAC-seq signals between the replicates. Availability and Future Directions Graphical User interface for the Analysis and Visualization of ATAC-seq data, together with a complete reference manual, is available at GitHub^[87]4 and the source code of GUAVA^[88]5. The sample data and documentation^[89]6 is available for the demonstration of GUAVA. Furthermore, a video guide ^[90]7 is available. Besides incorporating the additional features as mentioned above, we will also continue to make GUAVA as user-friendly as possible by improving the GUI as well as including modules for batch processing samples, and additional differential analysis methods. Author Contributions MD and EC conceived of the described tool. MD contributed to the source code and performed the analyses. ED made intellectual contributions to the development and improvement of the tool. MD and EC contributed to the writing of the manuscript and the online manual. Conflict of Interest Statement The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Acknowledgments