Abstract

Background

   Genome-wide Association Studies (GWAS) have proved invaluable for the
   identification of disease susceptibility genes. However, the
   prioritization of candidate genes and regions for follow-up studies
   often proves difficult due to false-positive associations caused by
   statistical noise and multiple-testing. In order to address this issue,
   we propose the novel GWAS noise reduction (GWAS-NR) method as a way to
   increase the power to detect true associations in GWAS, particularly in
   complex diseases such as autism.

Methods

   GWAS-NR utilizes a linear filter to identify genomic regions
   demonstrating correlation among association signals in multiple
   datasets. We used computer simulations to assess the ability of GWAS-NR
   to detect association against the commonly used joint analysis and
   Fisher's methods. Furthermore, we applied GWAS-NR to a family-based
   autism GWAS of 597 families and a second existing autism GWAS of 696
   families from the Autism Genetic Resource Exchange (AGRE) to arrive at
   a compendium of autism candidate genes. These genes were manually
   annotated and classified by a literature review and functional grouping
   in order to reveal biological pathways which might contribute to autism
   aetiology.

Results

   Computer simulations indicate that GWAS-NR achieves a significantly
   higher classification rate for true positive association signals than
   either the joint analysis or Fisher's methods and that it can also
   achieve this when there is imperfect marker overlap across datasets or
   when the closest disease-related polymorphism is not directly typed. In
   two autism datasets, GWAS-NR analysis resulted in 1535 significant
   linkage disequilibrium (LD) blocks overlapping 431 unique reference
   sequencing (RefSeq) genes. Moreover, we identified the nearest RefSeq
   gene to the non-gene overlapping LD blocks, producing a final candidate
   set of 860 genes. Functional categorization of these implicated genes
   indicates that a significant proportion of them cooperate in a coherent
   pathway that regulates the directional protrusion of axons and
   dendrites to their appropriate synaptic targets.

Conclusions

   As statistical noise is likely to particularly affect studies of
   complex disorders, where genetic heterogeneity or interaction between
   genes may confound the ability to detect association, GWAS-NR offers a
   powerful method for prioritizing regions for follow-up studies.
   Applying this method to autism datasets, GWAS-NR analysis indicates
   that a large subset of genes involved in the outgrowth and guidance of
   axons and dendrites is implicated in the aetiology of autism.

Background

   Genome-wide association studies (GWAS) have provided a powerful tool
   for identifying disease susceptibility genes. However, analysis of GWAS
   data has been focused on single-point tests, such as the traditional
   allele-based chi-squared test or the Cochran-Armitage Trend test
   [[50]1], which proceed by testing each single nucleotide polymorphism
   (SNP) independently. As it is likely that the disease variants have not
   been directly genotyped in a GWAS, tests that account for multiple
   flanking SNPs in linkage disequilibrium (LD) with the disease variants
   may increase the power to detect association [[51]2].

   Several approaches have been proposed in order to test for association
   based on multiple markers, which include the haplotype-based approach
   [[52]3-[53]5] and the multivariate approach [[54]6,[55]7]. Akey et al.
   [[56]8] used analytical approaches to demonstrate that multilocus
   haplotype tests can be more powerful than single-marker tests. For the
   multivariate approach, tests such as Hotelling's T^2 test are often
   used to account for multiple markers jointly [[57]6,[58]9]. Although
   statistical power can be increased by such multi-marker approaches, it
   is not a straightforward operation to select markers for testing.
   Including all markers in a gene or region may not be feasible since it
   greatly increases the degrees of freedom in the test, which can reduce
   the power.

   Follow-up studies, such as fine mapping and sequencing, are necessary
   in order to validate association signals and they are also challenging
   [[59]2]. Prioritization of genes or regions for follow-up studies is
   often decided by a threshold of P-values or ranking for significant
   markers [[60]10,[61]11]. However, many false positives can still exist
   in the markers classified as significant for follow-up as a result of
   statistical noise and genome-wide multiple testing. Joint and/or
   meta-analysis of GWAS data can achieve greater power if these data or
   P-values are available from different datasets. If P-values from
   individual and joint analyses are available, it is possible to further
   increase the power by assigning more weight to markers with replicated
   association signals in several datasets or to markers that have
   flanking markers with an association signal.

   We propose the use of the GWAS noise reduction (GWAS-NR) approach which
   uses P-values from individual analyses, as well as joint analysis of
   multiple datasets, and which accounts for association signals from
   surrounding markers in LD. GWAS-NR is a novel approach to extending the
   power of GWAS studies to detect association. Noise reduction is
   achieved by applying a linear filter within a sliding window in order
   to identify genomic regions demonstrating correlated profiles of
   association across multiple datasets. As noise reduction (NR)
   techniques are widely used to boost signal identification in
   applications such as speech recognition, data transmission and image
   enhancement, we expect that GWAS-NR may complement other GWAS analysis
   methods in identifying candidate loci that may then be prioritized for
   follow-up analysis or analysed in the context of biological pathways.

   Enhancing statistical power is particularly important in the study of
   complex diseases such as autism. There is overwhelming evidence from
   twin and family studies for a strong genetic component to autism, with
   estimates of heritability greater than 80% [[62]12-[63]14]. Autism is
   generally diagnosed before the age of 4, based on marked qualitative
   differences in social and communication skills, often accompanied by
   unusual patterns of behaviour (for example, repetitive, restricted,
   stereotyped) [[64]15]. Altered sensitivity to sensory stimuli and
   difficulties of motor initiation and coordination also are frequently
   present. Identifying the underlying genes and characterizing the
   molecular mechanisms of autism will provide immensely useful guidance
   in the development of effective clinical interventions.

   Numerous autism candidate genes have been reported based on association
   evidence, expression analysis, copy number variation (CNV), and
   cytogenetic screening. These genes involve processes including cell
   adhesion (NLGN3, NLGN4 [[65]16], NRXN1 [[66]17], CDH9/CDH10
   [[67]18,[68]19]), axon guidance (SEMA5A [[69]20]), synaptic scaffolding
   (SHANK2, DLGAP2 [[70]21], SHANK3 [[71]22]), phosphatidylinositol
   signalling (PTEN [[72]23], PIK3CG [[73]24]), cytoskeletal regulation
   (TSC1/TSC2 [[74]24,[75]25], EPAC2/RAPGEF4 [[76]26], SYNGAP1 [[77]21]),
   transcriptional regulation (MECP2 [[78]27], EN2 [[79]28]) and
   excitatory/inhibitory balance (GRIN2A [[80]29], GABRA4, GABRB1
   [[81]30]). However, aside from rare mutations and 'syndromic' autism
   secondary to known genetic disorders, the identification of specific
   genetic mechanisms in autism has remained elusive.

   Over the past decade, the vast majority of genetic studies of autism
   (both linkage and focused candidate gene studies) have failed to
   broadly replicate suspected genetic variations. For this reason, the
   assumption that autism is governed by strong and pervasive genetic
   variations has given way to the view that autism may involve numerous
   genetic variants, each having a small effect size at the population
   level. This may arise from common variations having small individual
   effects in a large number of individuals (the common disease-common
   variant [CDCV] hypothesis) or rare variations having large individual
   effects in smaller subsets of individuals (the rare variant [RV]
   hypothesis).

   Given the potential genetic heterogeneity among individuals with autism
   and the likely involvement of numerous genes of small effect at the
   population level, we expected that the GWAS-NR could improve the power
   to identify candidate genes for follow-up analysis. We applied GWAS-NR
   to autism GWAS data from multiple sources and conducted simulation
   studies in order to compare the performance of GWAS-NR with traditional
   joint and meta-analysis approaches. These data demonstrate that GWAS-NR
   is a useful tool for prioritizing regions for follow-up studies such as
   next-generation sequencing.

Methods

GWAS-NR

   The GWAS-NR algorithm produces a set of weighted P-values for use in
   prioritizing genomic regions for follow-up study. Roeder and Wasserman
   [[82]31] characterize the statistical properties of such weighting
   approaches in GWAS, observing that informative weights can improve
   power substantially, while the loss in power is usually small even if
   the weights are uninformative. The GWAS-NR algorithm computes a weight
   at each locus based on the strength and correlation of association
   signals at surrounding markers and in multiple datasets, without
   relying on prior information or scientific hypotheses. The weights are
   applied to the P-values derived from joint analysis of the complete
   data and the resulting weighted P-values are then used to prioritize
   regions for follow-up analysis.

   Noise reduction methods are frequently applied when extracting a common
   signal from multiple sensors. The filter used by GWAS-NR is similar to
   the method proposed by de Cheveigné and Simon [[83]32] for sensor noise
   suppression in magneto- and electro-encephalograph recordings. Each
   sensor is projected onto the other sensors and the fitted values from
   these regressions are used in place of the original values. The fitted
   values of such regressions retain sources of interest that are common
   to multiple sensors. As the regression residuals are orthogonal to the
   fitted values, uncorrelated components are suppressed.

   In a genomic context, the 'sensors' take the form of probit-transformed
   P-values derived from independent datasets, as well as P-values derived
   from joint analysis of the full dataset. The filter inherently
   highlights cross-validating associations, by preserving signals that
   jointly occur in a given genomic region and attenuating spikes that are
   not correlated across subsets of the data. However, GWAS-NR can achieve
   no advantage over simple joint analysis when an association signal is
   restricted to a single marker and flanking markers provide no
   supplementary information.

   We estimate ordinary least-squares regressions of the form
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo
   ><msub><mi>α</mi><mrow><mi>j</mi><mi>k</mi></mrow></msub><mo>+</mo><msu
   b><mi>β</mi><mrow><mi>j</mi><mi>k</mi></mrow></msub><msub><mi>Z</mi><mr
   ow><mi>i</mi><mi>k</mi></mrow></msub><mo>+</mo><msub><mi>v</mi><mrow><m
   i>j</mi><mi>k</mi></mrow></msub></mrow> :MATH]

   and compute projections
   [MATH: <mrow><mover
   accent="true"><mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></
   msub></mrow><mo
   stretchy="true">^</mo></mover><mo>=</mo><msub><mi>α</mi><mrow><mi>j</mi
   ><mi>k</mi></mrow></msub><mo>+</mo><msub><mi>β</mi><mrow><mi>j</mi><mi>
   k</mi></mrow></msub><msub><mi>Z</mi><mrow><mi>i</mi><mi>k</mi></mrow></
   msub></mrow> :MATH]

   where Z[i ]and Z[ik ]are the probits Φ^-1(1 - p) of the P-values at
   locus i in two datasets j and k. Φ^-1(⋅) denotes the inverse of the
   cumulative standard normal distribution. The estimates are computed
   within a centred sliding window of w markers and β[jk ]are constrained
   to be nonnegative which sets
   [MATH: <mrow><mover
   accent="true"><mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></
   msub></mrow><mo stretchy="true">^</mo></mover></mrow> :MATH]
   to the mean
   [MATH: <mrow><mover
   accent="true"><mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></
   msub></mrow><mo stretchy="true">¯</mo></mover></mrow> :MATH]
   in regions having zero or negative correlation across sensors. As β[jk
   ]is driven by the covariance between probits in datasets j and k,
   probits that demonstrate positive local correlation will tend to be
   preserved, while probits demonstrating weak local correlation will be
   attenuated. One local regression is computed for each locus and is used
   to compute a single fitted value
   [MATH: <mrow><mover
   accent="true"><mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></
   msub></mrow><mo stretchy="true">^</mo></mover></mrow> :MATH]
   for that locus. The same method is used to compute projections
   [MATH: <mrow><mover
   accent="true"><mrow><msub><mi>Z</mi><mrow><mi>i</mi><mi>k</mi></mrow></
   msub></mrow><mo stretchy="true">^</mo></mover></mrow> :MATH]
   .

   In order to capture association signals at adjacent loci in different
   datasets without estimating numerous parameters, the regressor at each
   locus is taken to be the probit of the lowest P-value among that locus
   and its two immediate neighbours. Quality control (QC) failure or
   different genotyping platforms can cause SNP genotypes to be missing in
   different datasets. Missing genotypes for a locus having no immediately
   flanking neighbours are assigned a probit of zero. The window width w
   is calculated as w = 2h + 1, where h is the lag at which the
   autocorrelation of the probits declines below a pre-defined threshold.
   In practice, we estimate the autocorrelation profile for each series of
   probits and use the average value of h with an autocorrelation
   threshold of 0.20.

   After computing the projections of
   [MATH:
   <mrow><msub><mrow><mover><mi>Z</mi><mo>∧</mo></mover></mrow><mi>j</mi><
   /msub></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mrow><mover><mi>Z</mi><mo>∧</mo></mover></mrow><mi>k</mi><
   /msub></mrow> :MATH]
   , the resulting values are converted back to P-values and a set of
   filtered P-values is computed from these projections using Fisher's
   method. The same algorithm is executed again, this time using the
   probits of the filtered P-values and the P-values obtained from the
   joint association analysis of the complete data. The resulting Fisher
   P-values are then treated as weighting factors and are multiplied by
   the corresponding raw P-values from the joint analysis, producing a set
   of weighted P-values. To aid interpretation, we apply a monotonic
   transformation to these weighted P-values, placing them between 0 and 1
   by fitting parameters of an extreme value distribution. The GWAS-NR
   algorithm was executed as a Matlab script.

Simulations

   Although noise reduction has been shown to be useful in other
   biomedical applications [[84]32], understanding its properties for
   identifying the true positives in disease association studies is also
   important. We used computer simulations to compare the performance of
   GWAS-NR with the joint association in the presence of linkage (APL)
   analysis and Fisher's method under a variety of disease models. We used
   genomeSIMLA [[85]33] to simulate LD structures based on the Affymetrix
   5.0 chip and performed the sliding-window haplotype APL [[86]34] test
   to measure association. Detailed descriptions for the simulation
   settings are provided in Additional File [87]1 and detailed haplotype
   configurations can be found in Additional File [88]2.

   An important goal for the proposed approach is to help prioritize
   candidate regions for follow-up studies such as next-generation
   sequencing. Top regions or genes ranked by their P-values are often
   considered priority regions for follow-up studies. In order to
   investigate the proportion of true positives that occur in the top
   regions, we treated the association tests as binary classifiers. The
   markers were ranked by their P-values and markers that occurred in the
   top k ranking were classified as significant, where k was pre-specified
   as a cut-off threshold. The markers that were not in the top k ranking
   were classified as non-significant. We then compared the sensitivity
   and specificity of GWAS-NR with the joint and Fisher's tests. The
   sensitivity was calculated based on the proportion of the three markers
   associated with the disease that were correctly classified as
   significant. The specificity was calculated based on the proportion of
   markers not associated with the disease that were correctly classified
   as non-significant. The sensitivity and specificity were averaged over
   1000 replicates.

Ascertainment and sample description

   We ascertained autism patients and their affected and unaffected family
   members through the Hussman Institute for Human Genomics (HIHG,
   University of Miami Miller School of Medicine, FL, USA), and the
   Vanderbilt Center for Human Genetics Research (CHGR, Vanderbilt
   University Medical Center, Tennessee, USA; UM/VU). Participating
   families were enrolled through a multi-site study of autism genetics
   and recruited via support groups, advertisements and clinical and
   educational settings. All participants and families were ascertained
   using a standard protocol. These protocols were approved by appropriate
   Institutional Review Boards. Written informed consent was obtained from
   parents, as well as from minors who were able to give informed consent;
   in individuals unable to give assent due to age or developmental
   problems, assent was obtained whenever possible.

   The core inclusion criteria were as follows: (1) chronological age
   between 3 and 21 years of age; (2) presumptive clinical diagnosis of
   autism; and (3) expert clinical determination of autism diagnosis using
   Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV criteria
   supported by the Autism Diagnostic Interview-Revised (ADI-R) in the
   majority of cases and all available clinical information. The ADI-R is
   a semi-structured diagnostic interview which provides diagnostic
   algorithms for classification of autism [[89]35]. All ADI-R interviews
   were conducted by formally trained interviewers who have achieved
   reliability according to established methods. Thirty-eight individuals
   did not have an ADI-R and, for those cases, we implemented a
   best-estimate procedure to determine a final diagnosis using all
   available information from the research record and data from other
   assessment procedures. This information was reviewed by a clinical
   panel led by an experienced clinical psychologist and included two
   other psychologists and a paediatric medical geneticist - all of whom
   were experienced in autism. Following a review of case material, the
   panel discussed the case until a consensus diagnosis was obtained. Only
   those cases in which a consensus diagnosis of autism was reached were
   included. (4) The final criterion was a minimal developmental level of
   18 months as determined by the Vineland Adaptive Behavior Scale (VABS)
   [[90]36] or the VABS-II [[91]37] or intelligence quotient equivalent
   >35. These minimal developmental levels assure that ADI-R results are
   valid and reduce the likelihood of including individuals with severe
   mental retardation only. We excluded participants with severe sensory
   problems (for example, visual impairment or hearing loss), significant
   motor impairments (for example, failure to sit by 12 months or walk by
   24 months) or identified metabolic, genetic or progressive neurological
   disorders.

   A total of 597 Caucasian families (707 individuals with autism) were
   genotyped at HIHG. This dataset consisted of 99 multiplex families
   (more than one affected individual) and 498 singleton (parent-child
   trio) families. A subset of these data had been previously reported
   [[92]19]. In addition, GWAS data were obtained from the Autism Genetic
   Resource Exchange (AGRE) [[93]35] as an additional dataset for
   analysis. The full AGRE dataset is publicly available and contains
   families with the full spectrum of autism spectrum disorders. From
   AGRE, we selected only families with one or more individuals diagnosed
   with autism (using DSM-IV and ADI-R); affected individuals with
   non-autism diagnosis within these families were excluded from the
   analysis. This resulted in a dataset of 696 multiplex families (1240
   individuals with autism) from AGRE [[94]35].

Genotyping and quality control and population stratification

   We extracted DNA for individuals from whole blood by using Puregene
   chemistry (QIAGEN, MD, USA). We performed genotyping using the Illumina
   Beadstation and the Illumina Infinium Human 1 M beadchip following the
   recommended protocol, only with a more stringent GenCall score
   threshold of 0.25. Genotyping efficiency was greater than 99%, and
   quality assurance was achieved by the inclusion of one CEPH control per
   96-well plate that was genotyped multiple times. Technicians were
   blinded to affection status and quality-control samples. The AGRE data
   were genotyped using the Illumina HumanHap550 BeadChip with over
   550,000 SNP markers. All samples and SNPs underwent stringent GWAS
   quality control measures as previously described in detail in Ma et al.
   [[95]19].

   Although population substructure does not cause a type I error in
   family-based association tests, multiple founder effects could result
   in a reduced power to detect an association in a heterogeneous disease
   such as autism. Thus, we conducted EIGENSTRAT [[96]38] analysis on all
   parents from analysed families for evidence of population substructure
   using the overlapping SNPs genotyped in both the UM/VU and AGRE
   datasets. In order to ensure the most homogeneous groups for
   association screening and replication, we excluded all families with
   outliers defined by EIGENSTRAT [[97]38] out of four standard deviations
   of principal components 1 and 2.

Haplotype block definition

   We used haplotype blocks to define regions of interest. Significant
   regions can be used for follow-up analysis such as next-generation
   sequencing. We applied the haplotype block definition method proposed
   by Gabriel et al. [[98]39] to the UM/VU dataset. We performed GWAS-NR
   based on single-marker APL P-values from UM/VU, AGRE and joint tests.
   We also performed GWAS-NR on P-values obtained from sliding-window
   haplotype tests with a haplotype length of three markers for the UM/VU,
   AGRE and joint datasets. Since the true haplotype length is not known,
   we chose a fixed length of three markers across the genome and used
   GWAS-NR to sort out true signals from the P-values. Blocks containing
   the top 5000 markers, as ranked by the minimum values (MIN_NR) of the
   GWAS-NR P-values obtained from single-marker tests, and the GWAS-NR
   P-values obtained from tests of three-marker haplotypes, were selected
   for further analysis.

Combined P-values for haplotype block scoring

   In order to test for the significance of the haplotype blocks, we
   calculated the combined P-value for each block using a modified version
   of the Truncated Product Method (TPM) [[99]40]. TPM has been shown to
   have correct type I error rates and more power than other methods
   combining P-values [[100]40] under different simulation models.
   Briefly, a combined score was calculated from the markers in each
   block, based on the product of MIN_NR that were below a threshold of
   0.05. We used the Monte Carlo algorithm [[101]40] with a slight
   modification to test the significance of the combined score.
   Specifically, a correlation matrix was applied to account for
   correlation among P-values for the markers in the same block. The null
   hypothesis is that none of the markers in the haplotype block are
   associated with the disease. In order to simulate the null distribution
   for the combined score, we generated two correlated sets of L uniform
   numbers based on the correlation of 0.67 for CAPL and HAPL P-values,
   where L denotes the number of tests in the block. The minimum values
   were selected from each pair in the two sets, which resulted in a
   vector of L minimum values. Then the correlation matrix was applied to
   the vector of L minimum values and a null combined GWAS-NR score was
   calculated for the haplotype block.

Functional analysis

   In order to investigate functional relationships among genes in the
   candidate set, each candidate was manually annotated and
   cross-referenced, based on a review of current literature, with
   attention to common functions, directly interacting proteins and
   binding domains. Supplementary functional annotations were obtained
   using DAVID (The Database for Annotation, Visualization and Integrated
   Discovery) version 6.7 [[102]41-[103]43].

Results

Simulations

   We present the simulation results for the three-marker haplotype
   disease models in Figures [104]1 and [105]2. Figure [106]1 presents
   receiver operating characteristic (ROC) curves to show the sensitivity
   and specificity of GWAS-NR, the joint APL analysis and Fisher's tests,
   based on varying cut-off values of ranking for significance. The
   Fisher's test to combine P-values was used here as a standard
   meta-analysis approach. The performance of a classification model can
   be judged based on the area under the ROC curve (AUC). For scenario 1
   (identical marker coverage in each dataset), GWAS-NR produced a greater
   AUC than the joint and Fisher's tests. It can also be observed from the
   figure that, given the same specificity, GWAS-NR achieved a higher
   sensitivity for classifying true positives as significant as the joint
   and Fisher's tests.

Figure 1.

   [107]Figure 1
   [108]Open in a new tab

   Comparative classification rates for genome-wide association studies
   noise reduction (GWAS-NR), joint analysis and Fisher's test. GWAS-NR
   has area under the curve (AUC) of 0.703 and the joint and Fisher's
   tests have AUC of 0.64 and 0.615, respectively, for the recessive
   model. Also GWAS-NR has AUC of 0.899 and the joint and Fisher's tests
   have AUC of 0.795 and 0.777, respectively, for the multiplicative
   model. For the dominant model, AUC for GWAS-NR, the joint and Fisher's
   tests are 0.981, 0.880 and 0.867, respectively. For the additive model,
   AUC for GWAS-NR, the joint and Fisher's tests are 0.932, 0.822, and
   0.807, respectively.

Figure 2.

   [109]Figure 2
   [110]Open in a new tab

   Comparative classification rates for genome-wide association studies
   noise reduction noise reduction (GWAS-NR), joint analysis and Fisher's
   test with 20% and 50% missing markers. GWAS-NR has area under the curve
   (AUC) of 0.689 and the joint and Fisher's tests have AUC of 0.622 and
   0.598, respectively, for the recessive model. Also GWAS-NR has AUC of
   0.883 and the joint and Fisher's tests have AUC of 0.776 and 0.760,
   respectively, for the multiplicative model. For the dominant model, AUC
   for GWAS-NR, the joint and Fisher's tests are 0.961, 0.852 and 0.844,
   respectively. For the additive model, AUC for GWAS-NR, the joint and
   Fisher's tests are 0.895, 0.785, and 0.775, respectively.

   As independent datasets may have an imperfect overlap of markers, which
   is true of the UM/VU and AGRE autism data, and the omission of the
   closest disease-related polymorphism from the data can have substantial
   negative impact on the power of GWAS [[111]44], we also compared the
   performance of GWAS-NR with the joint APL tests and Fisher's tests
   under a range of missing marker scenarios: 20% of the simulated markers
   in one dataset were randomly omitted for the recessive and
   multiplicative models and 50% of the simulated markers were randomly
   omitted in one dataset for the dominant and additive models. This
   performance is shown in Figure [112]2. Again, the GWAS-NR produced a
   greater AUC than the joint and Fisher's tests and achieved a higher
   sensitivity for classifying true positives at each level of
   specificity.

   The results for the two-marker haplotype disease models are shown in
   Additional File [113]3. The same pattern is also observed in Additional
   File [114]3 that GWAS-NR produced greater AUC than the joint and
   Fisher's tests.

   We also evaluated the type I error rates of the modified TPM for
   identifying significant LD blocks using a truncation threshold of 0.05.
   For the scenario assuming full marker coverage as described in
   Additional File [115]1, the modified TPM had type I error rates of
   0.035 and 0.004 at the significance levels of 0.05 and 0.01,
   respectively. For the missing-marker scenario, the type I error rates
   for the modified TPM were 0.046 and 0.007 at the significance levels of
   0.05 and 0.01, respectively.

Autism GWAS-NR results

   We applied the GWAS-NR in autism data using UM/VU, AGRE and the joint
   (UM/VU)/AGRE datasets. A flow diagram (Additional File [116]4) for the
   data analysis process is found in the supplemental data. The selection
   of haplotype blocks based on Gabriel's definition resulted in a total
   of 2680 blocks based on the top 5000 markers. Moreover, 141 markers out
   of the 5000 markers which are not in any blocks were also selected.
   Blocks of LD were scored based on the truncated product of P-values
   below a threshold of 0.05 and a P-value for each block was obtained
   through Monte Carlo simulation. The P-values for 141 markers not in any
   blocks were also calculated using the Monte Carlo algorithm to account
   for the minimum statistics. All of the 141 markers had P-values less
   than 0.05 and were selected. 725 LD blocks achieved a significance
   threshold of P < = 0.01, and an additional 810 blocks achieved a
   threshold of P < = 0.05. A complete list of these blocks is presented
   in Additional File [117]5.

   In order to determine what genes reside within the 1535 significant LD
   blocks, we used the University of California Santa Cruz (UCSC) Genome
   Browser Table Browser. The 1535 regions were converted into start and
   end positions based on the SNP positions in the March 2006
   (NCBI36/hg18) human genome assembly. These start and end positions were
   used to define regions in the UCSC Table Browser. We searched each
   region for overlap with the RefSeq annotation track in the UCSC
   Browser. This search resulted in 431 unique genes which mapped back to
   646 significant LD blocks and 50 single markers. These genes are
   presented in Additional File [118]6. For the remaining 839 LD blocks
   that did not overlap a RefSeq gene, we identified the nearest RefSeq
   gene using Galaxy [[119]45]. The distance to these nearest genes
   averaged 417,377 bp with a range from 5296 to 5,547,466 bp. These
   nearest genes include candidate genes for which strong proximal
   associations with autism have previously been reported, such as CDH9
   [[120]18,[121]19] and SEMA5A [[122]20]. We considered these genes for
   follow-up because GWAS-NR, by construction, may capture association
   information from nearby regions that may not be in strict LD with a
   given SNP and because these proximal locations may also incorporate
   regulatory elements. These genes are presented in Additional File
   [123]7. Combining these sets resulted in a candidate set of 860 unique
   genes (presented in Additional File [124]8). For genes assigned to more
   than one significant LD block, the lowest P-value among these blocks is
   used for sorting and discussion purposes.

   The most significant LD block we identified is located at 2p24.1 (ch2
   204444539-20446116; P = 1.8E-06) proximal to PUM2. One LD block located
   within the PUM2 exon also had nominally significant association (P =
   0.024). Additional top-ranking candidates, in order of significance,
   include CACNA1I (P = 1.8E-05), EDEM1 (P = 1.8E-05), DNER (P = 2.7E-05),
   A2BP1 (P = 3.6E-05), ZNF622 (P = 8.11E-05), SEMA4D (P = 9.09E-05) and
   CDH8 (P = 9.09E-05). Gene ontology classifications and InterPro binding
   domains reported by DAVID [[125]41-[126]43] to be most enriched in the
   candidate gene set are presented in Tables [127]1 and [128]2,
   respectively, providing a broad functional characterization of the
   candidate genes identified by the GWAS-NR in autism.

Table 1.

   Common functions of autism candidate genes identified by genome-wide
   association studies-noise reduction (GWAS-NR)
   Gene ontology (GO) term No. of genes GO term identification P-value^1
   Examples
   Cell adhesion 76 0007155 6.29E-13 CDH8, NCAM2
     __________________________________________________________________

   Biological adhesion 76 0022610 6.64E-13 CDH2, CTNNB1
     __________________________________________________________________

   Cell-cell adhesion 35 0016337 6.24E-08 CTNNA2, AMIGO2
     __________________________________________________________________

   Homophilic cell adhesion 21 0007156 1.21E-06 PTPRM, FAT1
     __________________________________________________________________

   Cell motion 44 0006928 6.65E-06 SEMA5A, FYN
     __________________________________________________________________

   Neuron differentiation 41 0030182 1.14E-05 EN2, NRXN1
     __________________________________________________________________

   Enzyme linked receptor protein signalling pathway 33 0007167 5.40E-05
   NCK2, FGFR2
     __________________________________________________________________

   Neuron development 32 0048666 1.07E-04 ROBO2, RTN4R
     __________________________________________________________________

   Negative regulation of gene expression 42 0010629 1.27E-04 SIX3, CUX2
     __________________________________________________________________

   Axonogenesis 22 0007409 1.31E-04 SEMA6A, SLITRK5
     __________________________________________________________________

   Cell morphogenesis involved in differentiation 25 0000904 2.16E-04
   PRKCA, PTK2
     __________________________________________________________________

   Cell motility 29 0048870 2.40E-04 DNER, PPAP2B
     __________________________________________________________________

   Localization of cell 29 0051674 2.40E-04 PTEN, NRP2
     __________________________________________________________________

   Negative regulation of transcription 38 0016481 3.19E-04 RBPJ, MEIS2
     __________________________________________________________________

   Cell morphogenesis involved in neuron differentiation 22 0048667
   3.94E-04 PARD3, KALRN
     __________________________________________________________________

   Transmembrane receptor protein tyrosine kinase signalling 23 0007169
   3.98E-04 SOCS2, DOK5
     __________________________________________________________________

   Neuron projection development 25 0031175 4.40E-04 RTN4R, NGF
     __________________________________________________________________

   Neuron projection morphogenesis 22 0048812 5.07E-04 PVRL1, CDH4
     __________________________________________________________________

   Regulation of cell projection organization 13 0031344 5.33E-04 SEMA4D,
   CDC42EP4
     __________________________________________________________________

   Negative regulation of nucleobase, nucleoside, nucleotide, and nucleic
   acid metabolic process 40 0045934 6.79E-04 BCL6, ZHX2
   [129]Open in a new tab

Table 2.

   Common binding domains of autism candidate genes identified by
   genome-wide association studies-noise reduction (GWAS-NR).
   INTERPRO term No. of genes INTERPRO identification P-value^1
   Immunoglobulin I-set 20 IPR013098 8.97E-06
     __________________________________________________________________

   Cadherin 16 IPR002126 6.98E-05
     __________________________________________________________________

   Cadherin cytoplasmic region 7 IPR000233 1.14E-04
     __________________________________________________________________

   Pleckstrin homology 26 IPR001849 5.03E-04
     __________________________________________________________________

   Immunoglobulin 21 IPR013151 5.61E-04
     __________________________________________________________________

   Immunoglobulin subtype 2 21 IPR003598 6.77E-04
     __________________________________________________________________

   Fibronectin, type III-like fold 19 IPR008957 1.19E-03
     __________________________________________________________________

   Fibronectin, type III 19 IPR003961 1.72E-03
     __________________________________________________________________

   Epidermal growth factor (EGF) 14 IPR006209 3.71E-03
     __________________________________________________________________

   Meprin/A5-protein/PTPmu (MAM) 5 IPR000998 6.78E-03
     __________________________________________________________________

   Protein-tyrosine phosphatase, receptor/non-receptor type 7 IPR000242
   7.36E-03
     __________________________________________________________________

   Pleckstrin homology-type 24 IPR001993 7.41E-03
     __________________________________________________________________

   von Willebrand factor, type A 10 IPR002035 7.41E-03
     __________________________________________________________________

   Immunoglobulin-like 35 IPR007110 7.57E-03
   [130]Open in a new tab

   Cell adhesion represented the most common functional annotation
   reported for the candidate gene set, with a second set of common
   functional annotations relating to neuronal morphogenesis and motility,
   including axonogenesis and neuron projection development. Given the
   enrichment scores reported by DAVID [[131]41-[132]43] implicating
   neurite development and motility, and because numerous cell adhesion
   molecules are known to regulate axonal and dendritic projections
   [[133]46,[134]47], we examined the known functional roles of the
   individual candidate genes responsible for these enrichment scores. A
   total of 183 candidate genes were represented among the top 20
   functional classifications reported by DAVID [[135]41-[136]43]. Based
   on annotations manually curated from a review of current literature, we
   observed that 76 (41.5%) of these genes have established roles in the
   regulation of neurite outgrowth and guidance. These include 39 (51.3%)
   of the candidate genes contained in the cell adhesion, biological
   adhesion, cell-cell adhesion and homophilic cell adhesion pathways.

   Gene ontology [[137]48] specifically associates two pathways with the
   narrow synonym 'neurite outgrowth': the neuron projection development
   (pathway 0031175); and the transmembrane receptor protein tyrosine
   kinase activity (pathway 0004714). To further test for functional
   enrichment of genes related to neurite outgrowth, we formed a
   restricted composite of these two pathways. Enrichment analysis using
   the EASE function of DAVID [[138]41-[139]43] rejected the hypothesis
   that this composite pathway is randomly associated with the autism
   candidate set (P = 2.07E-05).

   Although many of the candidate genes identified by the GWAS-NR remain
   uncharacterized or have no known neurological function, we identified
   125 genes within the full candidate set having established and
   interconnected roles in the regulation of neurite outgrowth and
   guidance. These genes are involved in diverse sub-processes including
   cell adhesion, axon guidance, phosphatidylinositol signalling,
   establishment of cell polarity, Rho-GTPase signalling, cytoskeletal
   regulation and transcription. Table [140]3 presents a summary of these
   genes by functional category. Additional File [141]9 presents
   annotations for these 125 candidates. Additional File [142]10 presents
   104 additional candidates which have suggestive roles in neurite
   regulation based on putative biological function or homology to known
   neurite regulators but where we did not find evidence specific to
   neurite outgrowth and guidance in the current literature.

Table 3.

   Autism candidate genes with known roles in neurite outgrowth and
   guidance.
   Function Candidate gene (by lowest P-value)
   Cadherin-catenin function CDH8, CDH2, CDH11, CTNNB1, CTNNA2, PKP4,
   CTNND2, CDH4, CTNND1, CTNNA3
     __________________________________________________________________

   Cell adhesion NCAM2, CNTN3, OPCML, ODZ4, NID1, CNTN5, F3, PVRL1, PTPRG,
   PARVA, FLRT2, ODZ2, NRXN1, ITGA9, ELMO1, FUT9, AMIGO2, KIRREL3,
   CNTNAP2, NTM
     __________________________________________________________________

   Ion channel CACNA1I, CACNA1G
     __________________________________________________________________

   Axon guidance SEMA4D, RTN4R, ROBO2, SEMA5A, PLXDC2, SLITRK5, SEMA6A,
   RGMA, UNC5D, ALCAM, NTNG2, RTN4RL1, PLXNC1, NRP2
     __________________________________________________________________

   Vesicle transport STX2, STX16, STXBP5, SYT6
     __________________________________________________________________

   Post-synaptic scaffold DLGAP2, MAGI1, MAGI2
     __________________________________________________________________

   Signal transduction DNER, SPRY4, FRK, PRKCA, DOK6, PDE3A, FER, IRS2,
   SOCS2, SPRY2, FRS3, DOK5, FYN, LZTS1, PTPRD, FGFR2, NRG3, PPP2R2B ALK,
   RYR2, PALM2-AKAP2, MAP3K7, NTRK3, NGF, PPM1H, GDNF, CXCR4, PTK2, NEDD9,
   PTPN1, LEPR
     __________________________________________________________________

   Phosphatidylinositol signalling PLA2G6, PIK3C2B, PTEN, PLA2G4A
     __________________________________________________________________

   Cell polarity FAT1, PARD3, PARD6G, DCHS2
     __________________________________________________________________

   Rho-GTPase signalling NCK2, DOCK1, PREX1, CDC42EP4, RND3, RGNEF, DOCK8,
   CIT, SRGAP3, KALRN, IQGAP2
     __________________________________________________________________

   Cytoskeletal regulation SGK1, MYLK, GPR56, APBB1IP, PTPRM, WIPF3,
   PTPRT, MAP3K8, MICAL2, DGKG, COBL, CALD1
     __________________________________________________________________

   Transcription PUM2, A2BP1, NKX6-1, SOX14, EN2, EBF1, MAP3K1, FOXG1,
   NFIC, BCL11A
   [143]Open in a new tab

   Outside of functions relating to neuritogenesis, the most significant
   functional annotation reported by DAVID for the candidate gene set
   relates to transmission of nerve impulses (p = 9.02E-04). We identified
   40 genes in the candidate set related to neurotransmission
   (synaptogenesis, neuronal excitability, synaptic plasticity, and
   vesicle exocytosis) which did not have overlapping roles in neurite
   regulation. Table [144]4 presents a summary of these genes by
   functional category.

Table 4.

   Autism candidate genes with roles in synaptic function.
   Function Candidate gene (by lowest P-value)
   Synaptogenesis LRRTM4, SYN3
     __________________________________________________________________

   Excitatory/inhibitory balance KCNIP1, KCNQ1, KCNQ5, KCNJ4, SLC6A13,
   IQCF1, GABBR2, GRIK4, OAT, KCNN3, GRM3, GCOM1, CACNA2D1, GRM7, ADRB2,
   KCNH7, KCNIP4, GRIK2, CACNG2, KCNMA1, KCNG1
     __________________________________________________________________

   Synaptic plasticity RIMS1, PTGER2, SLC24A2, NETO1, PTGS2
     __________________________________________________________________

   Vesicle exocytosis PTPRN2, AMPH, RAB11B, SYNPR
     __________________________________________________________________

   Other TPH2, CHRNA9, RIMBP2, ATXN1, CHRNB4, NOVA1, SNCAIP, CHRM3
   [145]Open in a new tab

   In order to investigate how the GWAS-NR results compared with the joint
   APL tests and Fisher's tests, we examined the lists of top 5000 markers
   selected based on GWAS-NR, joint APL test and Fisher's test P-values. A
   total of 3328 of the markers are overlapped between the lists for the
   GWAS-NR and joint APL tests, while 1951 of the markers are overlapped
   between the lists for the GWAS-NR and Fisher's tests. Thus, GWAS-NR had
   a higher concordance with the joint APL tests than the Fisher's tests.
   The results suggested that Fisher's test may have the lowest
   sensitivity to identify the true positives, which is consistent with
   our simulation results. Moreover, 120 markers that are not overlapped
   between Illumina Infinium Human 1M beadchip and Illumina HumanHap550
   BeadChip were among the top 5000 markers selected based on GWAS-NR.
   Some of the 120 markers are in the significant genes identified by
   haplotype blocks such as PUM2, A2BP1, DNER and SEMA4D.

   In order to similarly investigate the overlap of candidate genes
   indentified by GWAS-NR and joint APL tests, we repeated the haplotype
   block scoring method with the top 5000 markers as identified by joint
   APL: this analysis resulted in 1924 significant LD blocks. Of these,
   1257 overlapped with the blocks selected by GWAS-NR analysis.
   Identification of the RefSeq genes within with these 1257 shared
   regions showed that 380 potential candidate genes were shared by the
   two methods. In addition, GWAS-NR analysis produced 53 non-overlapping
   genes while the joint APL analysis produced 349 non-overlapping genes.

   As GWAS-NR amplifies association signals that are replicated in
   multiple flanking markers and across data sets, the method can be
   expected to produce a reduced list of higher confidence candidate
   regions for follow-up, compared with standard single-locus methods. At
   the same time, GWAS-NR does not generate a large number of significant
   candidates in regions that would otherwise be ranked as insignificant.
   While it is not possible to exclude a role in autism for the 349
   additional candidate genes produced by the joint APL analysis, it is
   notable that among the top 20 gene ontology pathways reported by DAVID
   [[146]41-[147]43] for this set of genes, not one is specific to
   neuronal function (data not shown). This analysis highlights the
   utility of GWAS-NR to narrow and prioritize follow-up gene lists.

Discussion

   We propose the use of GWAS-NR, a noise-reduction method for genome-wide
   association studies which aims to enhance the power to detect true
   positive associations for follow-up analysis. Our results demonstrate
   that GWAS-NR is a powerful method for the enhancement of the detection
   of genetic associations. Simulation evidence using a variety of disease
   models indicates that, when markers are ranked by P-values and
   candidates are selected based on a threshold rank, GWAS-NR achieves
   higher classification rates than the use of joint P-values or Fisher's
   method. In simulated data, the GWAS-NR also achieves strong performance
   when there is imperfect marker overlap across datasets and when the
   closest disease-related polymorphism is not typed. As Müller-Myhsok and
   Abel have observed, when less-than-maximum LD exists between a disease
   locus and the closest biallelic marker, the required sample size to
   achieve a given level of power may increase dramatically, particularly
   if there is a substantial difference in allele frequencies at the
   disease marker and the analysed marker [[148]49].

   In the context of allelic association, noise can be viewed as observed
   but random association evidence (for example, false positives) that is
   not the result of true LD with a susceptibility or causative variant.
   Such noise is likely to confound studies of complex disorders, where
   genetic heterogeneity among affected individuals or complex
   interactions among multiple genes may result in modest association
   signals that are difficult to detect. The influence of positive noise
   components is also likely to contribute to the so-called 'winner's
   curse' phenomenon, whereby the estimated effect of a putatively
   associated marker is often exaggerated in the initial findings,
   compared with estimated effects in follow-up studies [[149]50]. GWAS-NR
   appears to be a promising approach to address these challenges.

   By amplifying signals in regions where association evidence is locally
   correlated across datasets, the GWAS-NR captures information that may
   be omitted or underutilized in single-marker analysis. However, the
   GWAS-NR can achieve no advantage over simple joint analysis when
   flanking markers provide no supplementary information. This is likely
   to be true when a true risk locus is typed directly and a single-marker
   association method is used or when a true risk haplotype is typed
   directly and the number of markers examined in a haplotype-based
   analysis is of the same length.

   Joint analysis generally has more power than individual tests due to
   the increase of sample size. Therefore, GWAS-NR, which uses P-values
   from individual analyses as well as joint analysis of multiple
   datasets, is expected to have more power than individual tests.
   However, if there are subpopulations in the sample and the association
   is specific to a subpopulation, joint analysis may not be as powerful
   as an individual test for the subpopulation with the association
   signal. If samples from multiple populations are analysed jointly, test
   results for individual datasets should also be carefully examined with
   the GWAS-NR results.

   It is common for linear filters to include a large set of estimated
   parameters to capture cross-correlations in the data at multiple leads
   and lags. However, in a genomic context, the potentially uneven spacing
   of markers and varying strength of linkage disequilibrium between
   markers encouraged us to apply a parsimonious representation that would
   be robust to data structure. We expect that a larger, well-regularized
   parameterization may enhance the performance of the noise filter,
   particularly if the filter is constructed to adapt to varying linkage
   disequilibrium across the genome. This is a subject of further
   research.

   Our simulation results indicate that applying the modified TPM to
   select LD blocks based on GWAS-NR can have conservative type I error
   rates. The original TPM reported by Zaykin et al. [[150]40] produced
   the expected level of type I error, as a known correlation matrix was
   used in the simulations to account for correlation among P-values due
   to LD among markers. However, the true correlation is unknown in real
   datasets. Accordingly, we estimated correlations in our simulations and
   analysis by bootstrapping replicates of samples, as well as using the
   sample correlation between P-values obtained though single marker APL
   and sliding window haplotype analysis. It is possible that the use of
   estimated correlations may introduce extra variations in the
   Monte-Carlo simulations of TPM, which may contribute to conservative
   type I error rates. As we have demonstrated that GWAS-NR achieves
   higher sensitivity at each level of specificity, the resulting regions
   with top rankings can be expected to be enriched for true associations
   when such associations are actually present in the data, even if the LD
   block selection procedure is conservative. Overall, the simulation
   results suggest that GWAS-NR can be expected to produce a condensed set
   of higher confidence follow-up regions, and that this prioritization
   strategy can control the number of false positives at or below the
   expected number in analysis.

Autism

   Our data identify potential candidate genes for autism that encode a
   large subset of proteins involved in the outgrowth and guidance of
   axons and dendrites to their appropriate synaptic targets. Our results
   also suggest secondary involvement of genes involved in synaptogenesis
   and neurotransmission which further contribute to the assembly and
   function of neural circuitry. Taken together, these findings augment
   existing genetic, epigenetic and neuropathological evidence suggestive
   of altered neurite morphology, cell migration, synaptogenesis and
   excitatory-inhibitory balance in autism [[151]49].

   Altered dendritic structure is among the most consistent
   neuroanatomical findings in autism [[152]51,[153]52] and several other
   neurodevelopmental syndromes including Down, Rett and fragile-X
   [[154]53,[155]54]. Recent neuroanatomical findings include evidence of
   subcortical, periventricular, hippocampal and cerebellar heterotopia
   [[156]55] and altered microarchitecture of cortical minicolumns
   [[157]56], suggestive of dysregulated neuronal migration and guidance.
   In recent years, evidence from neuroanatomical and neuroimaging studies
   has led a number of researchers to propose models of altered cortical
   networks in autism, emphasizing the possible disruption of long-range
   connectivity and a developmental bias toward the formation of
   short-range connections [[158]57,[159]58].

   Neurite regulation is a common function of numerous top-ranking
   candidates. PUM2 codes for pumilio homolog 2, which regulates dendritic
   outgrowth, arborization, spine formation and filopodial extension of
   developing and mature neurons [[160]59]. DNER regulates the
   morphogenesis of cerebellar Purkinje cells [[161]60] and acts as an
   inhibitor to retinoic-acid induced neurite outgrowth [[162]61]. A2BP1
   binds with ATXN2 (SCA2), a dosage-sensitive regulator of actin filament
   formation that is suggested to mediate the loss of
   cytoskeleton-dependent dendritic structure [[163]62]. SEMA4D induces
   axonal growth cone collapse [[164]63] and promotes dendritic branching
   and complexity in later stages of development [[165]64,[166]65]. CDH8
   regulates hippocampal mossy fibre axon fasciculation and targeting,
   complementing N-cadherin (CDH2) in the assembly of synaptic circuits
   [[167]66].

   Neurite outgrowth and guidance can be conceptualized as a process
   whereby extracellular signals are transduced to cytoplasmic signalling
   molecules which, in turn, regulate membrane protrusion and neuronal
   growth cone navigation by reorganizing the architecture of the neuronal
   cytoskeleton. In general, neurite extension is dependent on microtubule
   organization, while the extension and retraction of finger-like
   filopodia and web-like lamellipodia from the neuronal growth cone is
   dependent on actin dynamics. Gordon-Weeks [[168]67] and Bagnard
   [[169]68] provide excellent overviews relating to growth cone
   regulation and axon guidance. Figure [170]3 provides a simplified
   overview of some of these molecular interactions.

Figure 3.

   [171]Figure 3
   [172]Open in a new tab

   Simplified schematic illustrating molecular mechanisms of neurite
   regulation. Extracellular events such as cell contact [[173]79],
   guidance cues [[174]64], neurotransmitter release [[175]80], and
   interactions with extracellular matrix components [[176]46] are
   detected by receptors and cell adhesion molecules at the membrane
   surface and are transduced via cytoplasmic terminals and multidomain
   scaffolding proteins [[177]47] to downstream signalling molecules
   [[178]81-[179]83]. Polarity and directional navigation is achieved by
   coordinating local calcium concentration [[180]84], Src family kinases
   [[181]85], cyclic nucleotide activation (cAMP and cGMP) [[182]86], and
   phosphoinositide signalling molecules which affect the spatial
   distribution and membrane recruitment of proteins that regulate the
   neuronal cytoskeleton [[183]87]. Chief among these regulators are the
   small Rho family GTPases RhoA, Rac and Cdc42, which serve as molecular
   'switches' to activate downstream effectors of cytoskeletal remodelling
   [[184]88]. In developed neurons, this pathway further regulates the
   formation of actin-dependent microarchitecture such as mushroom-like
   dendritic spines at the postsynaptic terminals of excitatory and
   inhibitory synapses [[185]89]. This simplified schematic presents
   components in an exploded format for tractability, and includes an
   abridged set of interactions. Additional File [186]9 presents autism
   candidate genes identified by GWAS-NR having known roles in neurite
   regulation. RPTP (receptor protein tyrosine phosphatase); EphR (Eph
   receptor); FGFR (fibroblast growth factor receptor); EphR (Eph
   receptor); PLXN (plexin); NRP (neuropilin); Trk (neurotrophin
   receptor); ECM (extracellular matrix); NetR (netrin receptor); NMDAR
   (NMDA receptor); mGluR (metabotropic glutamate receptor); AA
   (arachidonic acid); PLCγ (phospholipase C, gamma); MAGI (membrane
   associated guanylate kinase homolog); IP3 (inositol
   1,4,5-trisphosphate); DAG (diacylglycerol); PIP2 (phosphatidylinositol
   4,5-bisphosphate); PIP3 (phosphatidylinositol 3,4,5-trisphosphate);
   PI3K (phosphoinositide-3-kinase); nNOS (neuronal nitric oxide
   synthase); NO (nitric oxide); IP3R (inositol trisphosphate receptor);
   RyR (ryanodine receptor); GEF (guanine exchange factor); GAP (GTPase
   activating protein); MAPK (mitogen-activated protein kinase); and JNK
   (c-Jun N-terminal kinase).

   The autism gene candidates identified by GWAS-NR show functional
   enrichment in processes, including adhesion, cell motility,
   axonogenesis, cell morphogenesis and neuron projection development.
   Notably, a recent analysis of rare CNVs in autism by the Autism Genome
   Project Consortium indicates similar functional enrichment in the
   processes of neuronal projection, motility, proliferation, and Rho/Ras
   GTPase signalling [[187]21].

   We propose that, in autism, these processes are not distinct functional
   classifications but instead cooperate as interacting parts of a
   coherent molecular pathway regulating the outgrowth and guidance of
   axons and dendrites. Consistent with this view, the candidate set is
   enriched for numerous binding domains commonly found in proteins that
   govern neuritogenesis. These include immunoglobulin, cadherin,
   pleckstrin homology, MAM, fibronectin type-III and protein tyrosine
   phosphatase (PTP) domains [[188]69-[189]71].

   The cytoskeletal dynamics of extending neurites are largely governed by
   the activity of Rho-GTPases, which act as molecular switches to induce
   actin remodelling. Molecular evidence suggests that disassociation of
   catenin from cadherin promotes the activation of Rho-family GTPases Rac
   and Cdc42, resulting in cytoskeletal rearrangement [[190]72]. Guanine
   nucleotide exchange factors (GEFs) such as DOCK1 [[191]73] and KALRN
   [[192]74] activate Rho-GTPases by exchanging bound guanosine
   diphosphate (GDP) for guanosine triphosphate (GTP), while GTPase
   activating proteins (GAPs) such as SRGAP3 [[193]75] increase the rate
   of intrinsic GTP hydrolysis to inactivate GTPases. Pleckstrin homology
   domains, characteristic of several GEFs and GAPs, bind to
   phosphoinositides to establish membrane localization and also may play
   a signalling role in GTPase function [[194]76]. Certain GTPases outside
   of the Rho family, particularly Rap and Ras, also exert an influence on
   cytoskeletal dynamics and neurite differentiation [[195]77,[196]76].

   Several genes in the candidate set with established roles in neurite
   formation and guidance have been previously implicated in autism. These
   include A2BP1 (P = 3.60E-05), ROBO2 (2.00E-03), SEMA5A (2.30E-03), EN2
   (4.00E-03), CACNA1G (6.00E-03), PTEN (8.00E-03), NRXN1 (1.10E-02), FUT9
   (1.80E-02), DOCK8 (2.10E-02), NRP2 (2.60E-02) and CNTNAP2 (2.70E-02).
   Other previously reported autism candidate genes with suggestive roles
   in neurite regulation include PCDH9 (1.76E-03), CDH9 (6.00E-03) and
   CSMD3 (2.10E-02).

   The enriched presence of transcription factors in the candidate set is
   intriguing, as many of these candidates, including CUX2, SIX3, MEIS2
   and ZFHX1B have established roles in the specification of GABAergic
   cortical interneurons [[197]76]. Many guidance mechanisms in the
   neuritogenic pathway, such as Slit-Robo, semaphorin-neuropilin, and
   CXCR4 signalling also direct the migration and regional patterning of
   interneurons during development. Proper targeting of interneurons is
   vital to the organization of cortical circuitry, including minicolumnar
   architecture which is reported to be altered in autism [[198]78]. Thus,
   the functional roles of the candidate genes we identify may embrace
   additional forms of neuronal motility and targeting.

Conclusions

   We proposed a noise-reduction methodology, GWAS-NR, to enhance the
   ability to detect associations in GWAS data. By amplifying signals in
   regions where association evidence is locally correlated across
   datasets, the GWAS-NR captures information that may be omitted or
   underutilized in single-marker analysis. Simulation evidence
   demonstrates that under a variety of disease models, GWAS-NR achieves
   higher classification rates for true positive associations, compared
   with the use of joint p-values or Fisher's method.

   The GWAS-NR method was applied to autism data, with the objective of
   prioritizing regions of association for follow-up analysis. Gene set
   analysis was conducted in order to examine if the identified autism
   candidate genes were over-represented in any biological pathway
   relative to the background genes. The significance of a given pathway
   suggests that the pathway may be associated with autism due to the
   enrichment of autism candidate genes in that pathway. We find that many
   of the implicated genes cooperate within a coherent molecular
   mechanism. This neuritogenic pathway regulates the transduction of
   membrane-associated signals to downstream cytoskeletal effectors that
   induce the directional protrusion of axons and dendrites. This
   mechanism provides a framework that embraces numerous genetic findings
   in autism to date, and is consistent with neuroanatomical evidence.
   While confirmation of this pathway will require additional evidence
   such as the identification of functional variants, our results suggest
   that autistic pathology may be mediated by the dynamic regulation of
   the neuronal cytoskeleton, with resulting alterations in dendritic and
   axonal connectivity.

Abbreviations

   ADI-R: Autism Diagnostic Interview - Revised; AGRE: Autism Genetic
   Resource Exchange; APL: association in the presence of linkage; AUC:
   area under the curve; CNV: copy number variation; DAVID: Database for
   Annotation, Visualization and Discovery; GTP: guanosine triphosphate;
   LD: linkage disequilibrium; GWAS: Genome-wide association studies; NR:
   noise reduction; RefSeq: Reference Sequence; ROC: receiver operating
   characteristic; SNP: single nucleotide polymorphism; TPM: truncated
   product method.

Competing interests

   The authors declare that they have no competing interests.

Authors' contributions

   All co-authors contributed to writing the manuscript. JPH was the
   primary author of the manuscript, developed the statistical methods and
   the design for their implementation and contributed pathway analysis of
   candidate genes. RHC contributed to the development of the statistical
   methods and study design and also conducted statistical analyses. AJG
   contributed to molecular analysis and interpretation. JMJ, DS and DM
   conducted statistical analyses. IK and PLW performed molecular analysis
   and interpretation. JMV performed molecular analysis and contributed to
   the study design. ERM provided input into study design, methods
   development, statistical analyses, and interpretation of findings. MLC
   analysed clinical data and contributed to the study design. JRG
   performed molecular analysis, interpreted data and contributed to the
   study design. JLH provided input to study design and statistical
   analyses. MPV contributed to the design of the study, development of
   methods, coordination of statistical and molecular analysis and
   interpretation of data. She is also the primary investigator on the
   parent study.

Supplementary Material

   Additional File 1

   Appendix
   [199]Click here for file^ (27KB, DOC)
   Additional File 2

   Table S7: Haplotype configuration Association configuration for the
   power simulations.
   [200]Click here for file^ (31KB, DOC)
   Additional File 3

   Comparative classification rates for genome-wide association studies -
   noise reduction (GWAS-NR), Joint analysis and Fisher's Test. GWAS-NR
   has an area under the curve (AUC) of 0.679 and the joint and Fisher's
   tests have AUC of 0.624 and 0.604, respectively, for the recessive
   model. Also GWAS-NR has AUC of 0.855 and the joint and Fisher's tests
   have AUC of 0.781 and 0.751, respectively, for the multiplicative
   model. For the dominant model, AUC for GWAS-NR, the joint and Fisher's
   tests are 0.964, 0.871 and 0.853, respectively. For the additive model,
   AUC for GWAS-NR, the joint and Fisher's tests are 0.893, 0.806 and
   0.771, respectively.
   [201]Click here for file^ (152KB, DOC)
   Additional File 4

   Flow Chart: GWAS-NR analysis workflow in autism datasets. A flow chart
   demonstrating the data analysis and candidate gene selection of the
   autism datasets presented. HIHG: Hussman Institute for Human Genomics
   dataset, AGRE: Autism Genetic Resource Exchange dataset, APL:
   Association in the Presence of Linkage, GWAS-NR: Genome-wide
   Association Study - Noise Reduction, DAVID: Database for Annotation,
   Visualization and Integrated Discovery.
   [202]Click here for file^ (89.3KB, PDF)
   Additional File 5

   Table S1: linkage disequilibrium (LD) blocks identified by Genome-wide
   Association Study - Noise Reduction (GWAS-NR). Every LD block
   identified by GWAS-NR and haplotype analysis with a P-value < 0.05 is
   listed with the chromosome start and stop position, the length in
   basepairs of the LD block, and the minimum GWAS-NR P-value of the
   block.
   [203]Click here for file^ (191KB, XLS)
   Additional File 6

   Table S2: RefSeq genes overlapping linkage disequilibrium (LD) blocks
   identified by Genome-wide Association Study - Noise Reduction
   (GWAS-NR). Every LD block identified by GWAS-NR and haplotype analysis
   with a P-value < 0.05 and that overlaps a gene in the RefSeq database
   is listed with the chromosome start and stop position, the length in
   basepairs of the LD block, the minimum GWAS-NR P-value of the block,
   and the RefSeq name of the gene(s) that overlap the block.
   [204]Click here for file^ (77KB, XLS)
   Additional File 7

   Table S3: RefSeq genes nearest to linkage disequilibrium (LD) blocks
   identified by Genome-wide Association Study - Noise Reduction
   (GWAS-NR). Every LD block identified by GWAS-NR and haplotype analysis
   with a P-value < 0.05 that does not overlap with a gene in the
   reference sequence (RefSeq) database is listed with the chromosome
   start and stop position, the length in basepairs of the LD block, the
   minimum GWAS-NR P-value of the block and the RefSeq name of the gene(s)
   that is nearest to the block.
   [205]Click here for file^ (131KB, XLS)
   Additional File 8

   Table S4: Autism candidate genes identified by Genome-wide Association
   Study - Noise Reduction (GWAS-NR). A complete list of reference
   sequence (RefSeq) genes either overlapping or nearest to every LD
   blocks with the P-value of either the overlapping or nearest block.
   [206]Click here for file^ (81.5KB, XLS)
   Additional File 9

   Table S5: Autism candidate genes [Genome-wide Association Study - Noise
   Reduction (GWAS-NR)] having known roles in neurite outgrowth and
   guidance. A list of autism candidate genes with known roles in neurite
   outgrowth and axon guidance followed by a comment on molecular function
   and PubMed identifications of supporting literature.
   [207]Click here for file^ (78.5KB, XLS)
   Additional File 10

   Table S6: autism candidate genes [Genome-wide Association Study - Noise
   Reduction (GWAS-NR)] having suggestive roles in neurite outgrowth and
   guidance. A list of autism candidate genes with presumptive roles in
   neurite outgrowth and axon guidance followed by a comment on molecular
   function and PubMed identifications of supporting literature.
   [208]Click here for file^ (38.5KB, XLS)

Contributor Information

   John P Hussman, Email: hussman@hussmanfoundation.org.

   Ren-Hua Chung, Email: RChung@med.miami.edu.

   Anthony J Griswold, Email: agriswold@med.miami.edu.

   James M Jaworski, Email: jjaworski@med.miami.edu.

   Daria Salyakina, Email: DSalyakina@med.miami.edu.

   Deqiong Ma, Email: dma@med.miami.edu.

   Ioanna Konidari, Email: IKonidari@med.miami.edu.

   Patrice L Whitehead, Email: PWhitehead@med.miami.edu.

   Jeffery M Vance, Email: JVance@med.miami.edu.

   Eden R Martin, Email: EMartin1@med.miami.edu.

   Michael L Cuccaro, Email: mcuccaro@med.miami.edu.

   John R Gilbert, Email: jgilbert@med.miami.edu.

   Jonathan L Haines, Email: jonathan@chgr.mc.vanderbilt.edu.

   Margaret A Pericak-Vance, Email: mpericak@med.miami.edu.

Acknowledgements