- Technical Note
- Open Access
Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn's Disease dataset
BMC Research Notes volume 4, Article number: 386 (2011)
Though rooted in genomic expression studies, pathway analysis for genome-wide association studies (GWAS) has gained increasing popularity, since it has the potential to discover hidden disease pathogenic mechanisms by combining statistical methods with biological knowledge. Generally, algorithms or programs proposed recently can be categorized by different types of input data, null hypothesis or counts of analysis stages. Due to complexity caused by SNP, gene and pathway relationships, re-sampling strategies like permutation are always utilized to derive an empirical distribution for test statistics for evaluating the significance of candidate pathways. However, evaluation of these algorithms on real GWAS datasets and real biological pathway databases needs to be addressed before we apply them widely with confidence.
Two algorithms which use summary statistics from GWAS as input were implemented in KGG, a novel and user-friendly software tool for GWAS pathway analysis. Comparisons of these two algorithms as well as the other five selected algorithms were conducted by analyzing the WTCCC Crohn's Disease dataset utilizing the MsigDB canonical pathways. As a result of using permutation to obtain empirical p-value, most of these methods could control Type I error rate well, although some are conservative. However, the methods varied greatly in terms of power and running time, with the PLINK truncated set-based test being the most powerful and KGG being the fastest.
Raw data-based algorithms, such as those implemented in PLINK, are preferable for GWAS pathway analysis as long as computational capacity is available. It may be worthwhile to apply two or more pathway analysis algorithms on the same GWAS dataset, since the methods differ greatly in their outputs and might provide complementary findings for the studied complex disease.
Simple single-marker tests used in genome-wide association studies (GWAS) have contributed to the discovery of many loci responsible for the variation observed in complex traits or disorders [1–3]; nevertheless, they are also criticized for their stringent significance threshold  and disregard of prior knowledge, which might lead to Type II errors, that is, not detecting real effects. Recently, a number of complementary approaches have been developed to prioritize susceptibility genes of complex diseases and increase power, of which meta-analysis, epistasis analysis and GWAS pathway analysis (GWASPA) are typical ones and already widely applied .
Biological pathways, which are actually series of actions among molecules in a cell that lead to a certain product or a change in a cell , are usually identified by experimental approaches [7, 8] and then revised with bioinformatical mining tools . Shared across different organisms, pathways play an important role in metabolism, gene regulation and signal transduction . With genetic or environmental perturbations, some normal pathways might become dysfunctional and then contribute to complex diseases . Pathway analysis for complex disease, also known as gene-set analysis, originated with genomic expression studies ; one of the two classical methods in that field is over/under representation analysis by hypergeometric test, while the other is gene-set enrichment analysis (GSEA) using Kolmogorov-Smirnov-like test statistics [13, 14]. Since Wang et al.  first applied GSEA to GWAS data, more and more algorithms for performing genome-wide pathway analysis for SNP-chip datasets have emerged [16–19]. The PLINK set-based test utilizes average test statistics of groups of independent and/or truncated SNPs to provide a pathway-level test ; gene set ridge regression in association studies (GRASS) assesses joint association of pre-selected Eigen-SNPs for each gene in a candidate pathway with disease ; improved GSEA and association list Gene-ontology (GO) annotator (ALIGATOR) are two algorithms which utilize SNP-level test statistics or p-values in order to reduce computation cost over raw data-based methods [19, 20]. These different algorithms could be categorized according to type of input data (raw data or summary statistics), counts of analysis stages (one or two stage), or basic hypothetical tests (competitive or self-contained) . The difference between GWAS pathway analysis and classical pathway analysis in genomic expression studies lies in whether a gene-based score is directly provided or not. Previous research on multi-allelic association tests or gene-based tests, which always produce gene-level scores indirectly, provide a good bridge for pathway analysis on individual SNPs [21–23]. Among them, GATES is a rapid and powerful procedure for getting gene-based statistics which sometimes serve as a prerequisite for further advanced analysis .
Though several methods have already been proposed, there is no consensus as to the best method for conducting a GWASPA, especially when the underlying causal mechanism for disease at the functional level is not yet clear [16, 24]. Results discrepancies among different methods on the same dataset might arise due to different mapping strategies from SNPs to pathways, or an algorithm's power of detecting susceptibility pathways, as well as incompatible pathway databases used . Wang et al.  discussed the basic issues, main procedures and challenges involved in method development for GWASPA, but did not give a guideline of how to apply these methods from a practical perspective. Chen et al.  did compare the performance of different methods when evaluating the GRASS algorithm, but the simulation scenarios were designed for only one candidate pathway, and therefore need to be extended by examining both comprehensive simulated and real datasets on a genome-wide scale. Ballard et al.  illustrated the advantage of the random set method over the hypergeometric test for pathway analysis by analyzing three GWAS samples for Crohn's Disease (CD); however, they did not include any summary statistics-based algorithms which are becoming more and more popular .
As one of those complex diseases investigated at the earlier stage of GWAS , evidence for disease-causal variants, genes or even pathways related to CD are increasingly provided [3, 25, 27–29]. Meanwhile, the CD dataset from the Wellcome Trust Case Control Consortium (WTCCC) , which is openly available, has been repeatedly utilized and extensively explored by a variety of approaches [29, 31, 32]. This relatively abundant knowledge makes the WTCCC CD dataset a good testing sample for evaluating the performance of newly developed methods. Therefore, in this study, we implemented two summary statistics-based algorithms in an open-source tool named Knowledge-Based Mining System for Genome-Wide Genetic Studies (KGG) ; and then compared the performance of these two methods with another five existing methods using the WTCCC CD dataset, to evaluate the characteristics of the various methods, including Type 1 error, power and running time.
KGG is an open-source Java package developed for whole genome gene-based analysis, pathway analysis and protein-protein network analysis. Currently, it contains two classical algorithms for performing downstream pathway analysis after getting gene-level statistics by GATES , a novel gene-based method previously implemented in KGG. Formula 1 showed core idea of GATES, while formula 2 and 3 were basics of Simes' test  and hypergeometric test ; KGG would finally produce corresponding pathway-level results after running GATES-Simes or GATES-Hyper.
Note: P (j) are the ordered jth p-values (j from 1 to M) of the individual SNPs mapped to gene G; me is the effective number of independent SNP p-values among all M SNPs, after accounting for the LD structure among these M SNPs; me(j) is the effective number of independent SNP p-values among the top j SNPs, after accounting for the LD structure among these j SNPs.
Note: Assume k genes mapped to pathway A, PG (i) calculated by GATES, is the ith ordered gene p-value (i from 1 to k) among all k genes in pathway A.
Note: N is the total number of genes in the whole gene list; Q is the number of those N genes in the pathway A; n is the total number of genes passing gene p-value threshold; and q is number of genes in pathway A out of those n genes.
Among the other five selected algorithms, only Aligator takes SNP-level summary statistics as input as GATES-Simes and GATES-Hyper does; for Aligator, a gene is treated as significant if it contains at least one SNP with p-value below predefined threshold. Each pathway is then tested for whether it contains more significant genes than expected by chance . GSEAforGWAS selects the maximum SNP-level test statistic to represent a gene-level score, and then applies a weighted Kolmogorov-Smirnov running sum of competitive pathways to evaluate whether a particular pathway is enriched with top ranking genes or not . GRASS calculates gene scores by combining regularized beta coefficients of pre-selected Eigen SNPs, following by getting the pathway-level score from standardized gene-level scores . Both PLINK-Ave and PLINK-Max adopt the idea of a "set-based test", which computes a pathway level score directly from averaging SNP-level test statistics. After pruning SNPs in high LD, PLINK-Max only uses the top SNP in a pathway, but PLINK-Ave selects a few SNPs (default setting is up to the top 5 SNPs with p-value smaller than predefined threshold) . All of the above five algorithms need to evaluate significance for pathway association with disease by permutation, and we permuted 1000 times for all methods.
880 canonical pathways (originated from KEGG , BioCarta  and Reactome ) which been manually curated by biology experts, were collected from the MsigDB database . In comparison with Gene Ontology (GO) , which collects more broad functional categories in a hierarchical pattern, these canonical pathways represent relatively well-defined known biological pathways . To further reduce pathway-level heterogeneity, we only included pathways which contained between 10 and 300 genes, which filtered out 27 pathways. Another potential CD causal pathway "IL12-IL23" [16, 29] which was not in the database was also included, thus making the total number of selected candidate pathways to be 854. A survey of pathway size distribution, gene membership and overlapping ratio per pair-wise pathways was conducted by simple calculation using the 'Table' function in R. The WTCCC Crohn's Disease GWAS dataset was downloaded from the WTCCC website. Adopting the same quality control procedures as in the flagship publication , we included 1,748 CD patients as cases and 1,480 healthy individuals from the 1958 Birth Cohort  as controls. All the samples were genotyped with the Affymetrix 500 K chips. In total, 391,422 SNPs remained after quality control procedures for markers (minor allele frequency (MAF) > 0.05, Hardy-Weinberg equilibrium (HWE) p > 0.001, etc). Further selection of SNPs in or near (within 5 kb upstream or downstream) genes in the 854 canonical pathways resulted in 61,340 SNPs for subsequent pathway analysis. In order to run those summary statistics-based algorithms, we used PLINK (logistic regression model) to calculate SNP-level summary statistics (beta coefficient, odds ratio, p-value, etc.) with a general genomic control as population stratification adjustment. Due to lack of benchmark pathways for Crohn's Disease, we chose a subset of pathways from the 854 by enrichment analysis with GeneTrial , setting 93 candidate genes  previously reported for CD as input. This subset of pathways which were potentially associated with Crohn's disease was then recorded as list 1.
Comparison of performance
We created a permuted dataset by assigning randomly shuffled case/control labels to original genotypes from the WTCCC CD dataset. Four algorithms (PLINK-AVE, PLINK-MAX, GSEAforGWAS and GRASS) were adopted to perform pathway analysis on this raw permuted data when the other three algorithms (GATES-Simes, GATES-Hyper and Aligator) on SNP level summary statistics from logistic regression analysis of raw simulated data. Each algorithm would produce a set of pathway level p-values. Since the null hypotheses assumed that no pathways were enriched by disease-susceptibility SNPs or genes for this simulated dataset, we estimated Type I error for each algorithm empirically by calculating the proportion of pathways with nominal p-values smaller than the critical threshold (set at 0.05) out of all 854 pathways. One sample Kolmogorov-Smirnov tests were performed to investigate whether observed pathway p-values followed a (0, 1) uniform distribution or not. Then algorithms which had an appropriate type I error rate were applied to the original CD dataset so as to prioritize potential causal pathways. In order to conduct these comparisons fairly, the same value was set when a predefined threshold was needed. False discovery rates (FDR)  were computed from pathway p-values produced by different algorithms, and pathways with FDR smaller than 0.05 were treated as significantly associated with CD and then marked as pathway list 2. Hypergeometric tests were conducted to check whether the number of overlapping pathways between list 1 and list 2 was greater than expected by chance. Nominal p-values and FDR for the IL12-IL23 pathway from different algorithms were also recorded and compared. Therefore, power of detecting real associated pathways for all 7 algorithms could be quantified by three indices: number of significant pathways, p-values from the hypergeometric test, and significance level for IL12-IL23 pathway. In addition, impact of varying p-value threshold and LD pruning cut-off on power were investigated with GATES-Hyper and PLINK-AVE, since such impact for other algorithms was either not necessary or addressed before [16, 18, 20].
Survey on MsigDB canonical pathways
Table 1 presents characteristics of the 854 selected candidate pathways. Most of these pathways contain between 10 and 100 genes. Overlapping genes among pairs of pathways was less than 1 percent, allowing us to assume that the pathways were effectively independent, as required for GWASPA. However, investigation of genes mapping to pathways (also Table 1) showed that less than 30% of total pathway-included genes were unique to one pathway, while a few genes can even be covered by more than 100 different pathways. Pathway enrichment analysis for 93 candidate genes of CD revealed that 28 candidate pathways might be involved in causation of this disease (Additional File 1: Table S1). Literature evidence supported that 18 of these 28 pathways have good potential to associate with CD (also see Additional File 1: Table S1), though not all validated by experiments yet.
Type I error rate comparison
Quantile-quantile (QQ) plots  for original SNP p-values and GATES-produced gene p-values in a permutated dataset are shown in Figure 1. No SNPs or genes mapping to selected pathways were apparently deviated from the theoretical straight line of a uniform distribution for this permuted dataset, indicating that the tests behave correctly under the null and the permutation procedure has eliminated all effects. Table 2 contains estimates of Type I error (at family wise error rate 0.05) for seven different GWASPA algorithms. Most of them were conservative since Type I error rates were below 0.05, two even smaller than 0.02 (GRASS, GATES-Hyper). This tendency was verified by one sample K-S test for (0, 1) uniform distribution (also Table 2) and QQ plot for pathway p-values (Figure 2).
Power and CD susceptibility pathways
A QQ plot of SNP-based test statistics from the WTCCC Crohn's Disease dataset shows that a bunch of individual SNPs and a few genes were significantly associated with CD (Figure 3). Table 3 presents three indicators of power of different algorithms for detecting hidden susceptibility pathways. Consistently, PLINK-Ave appears to be the most powerful algorithm, as it produced more significant pathways that overlapped with previously known pathways than any other algorithm (Table 3). However, it takes more consideration when running PLINK-Ave, which is affected by flexible setting of LD and p-value truncation cut-offs (Additional File 1: Table S2). In general, summary statistics-based algorithms (GATES-Hyper, GATES-Simes and Aligator) had less power than those raw data-based algorithms, and use of average statistics (PLINK-Ave and GRASS) was more powerful than relying on the top statistics within a given pathway (PLINK-Max and GSEAforGWAS). Only one pathway was detected in common by PLINK-Ave, GRASS and GSEAforGWAS, but there might be around 80 pathways in total possibly related to CD (Additional File 1: Table S3).
Summary of running time and computing platform
Table 4 presents a summary of running times for all seven algorithms. Generally, algorithms with summary statistics as input were much faster than those utilizing raw data. Pathway analysis by GATES-Simes and GATES-Hyper could complete within one hour on a typical desktop computer. The time spent on computation was only several minutes as integrating marker LD information by KGG took most of the hour. Aligator and GRASS are both implemented in the R-SNPath package on a multi-core cluster, taking advantage of parallel computation. GSEAforGWAS was executed by the GenGen program following suggestion for parallel computation on their website--distributing permutations on four different nodes (each 250 times).
It has been suggested that multiple genes in immune system functional pathways, especially those containing different Interleukin factors, might be involved in causation of Crohn's Disease [28, 29]. Our findings of CD pathways across different algorithms also fall mostly into the category of immune response related pathways, though the number of significant pathways varied greatly for each individual algorithm. Likely, these differences are due to differences in four factors. The first is null hypothesis differences between self-contained tests and competitive tests [44, 45]. PLINK-Ave, PLINK-Max, GRASS and GATES-Simes, which are self-contained tests (see table 4), assume that a pathway does not contain any significant SNPs or genes. But GATES-Hyper, Aligator and GSEAforGWAS are all based on competitive tests (see table 4), which aims to test whether genes in one pathway are enriched with a greater number of associated SNPs. The second contributor to the differences observed is differences in test statistic construction. PLINK-AVE and GRASS both applies an "average" concept which combines evidence of selected SNPs or genes to identify pathways with more overall association signal, but PLINK-Max, GATES-Simes, GATES-Hyper, Aligator and GSEAforGWAS all use a "maximum" concept when computing pathway score from SNP scores directly or indirectly, relying on the most associated SNP in a pathway. It is likely that the underlying causal mechanism for CD involves more pathways covering moderate effect SNPs than simply a few pathways with top SNP hits only , implying that many more SNPs than we can identify individually contribute to disease. A third contributor to the differences observed is whether genes are treated as an intermediate bridge from SNPs and pathways in a two-stage approach. Though it has been suggested that two stage pathway analysis was more robust and powerful than one stage pathway analysis , our analyses don't support this, with the PLINK set-based test found to be superior to other gene-based algorithms. Finally, different strategies of handling LD structure between SNPs can also explain part of the variation in performance. Pruning SNPs in high LD (PLINK), choosing Eigen SNPs by principal component analysis (GRASS and KGG), or focusing only on the top SNP (Aligator and GSEAforGWAS), are all effective ways of reducing inflation caused by SNP dependency, but may perform differently.
Due to complexity caused by SNP, gene, and pathway relationships, re-sampling strategies like permutation are always utilized to derive an empirical distribution for test statistics to evaluate the significance of candidate pathways. However, it comes with a cost of time and memory. Usually two approaches are used to lighten the computational burden, either a revised adaptive or optimal permutation of raw genotype/phenotype data [46, 47] or a permutation of SNP or gene labels in summary statistics data . We showed that GATES-Hyper and GATES-Simes implemented in KGG were much faster than those permutation-based algorithms since both of them used approximation for p-value distribution. Clearly, the advantages of GWASPA over genomic expression studies or traditional laboratory works (such as knock out mouse and cultured cells) [7, 8] are apparent in terms of time and economy and can be used to guide these further studies.
There are also some limitations of our study. While the WTCCC CD dataset is an ideal model GWAS, the performance of GWASPA algorithms may not be as good with other datasets, where studies are smaller and less is known regarding function at the outset. We recommend applying two or more algorithms from different categories (self-contained versus competitive, average versus maximum and one-stage versus two-stage) in practice, especially when we cannot readily distinguish the scenario in which various susceptibility variants confer moderate risk to disease versus the scenario in which a major effect variant in a pathway plays a dominant role in complex diseases [15, 20]. In addition, we set a fixed threshold for SNP or gene p-value truncation when including them for pathway analysis by PLINK-Ave, Aligator, and GATES-Hyper, but this cut-off is somewhat arbitrary and the choice made can influence results dramatically. Moreover, the FDR method was applied to correct for multiple testing and determine significance on pathway level, since almost no pathway p-values would survive Bonferroni adjustment . Nevertheless, we think it was still reasonable for comparing performances of different algorithms, since we applied the same standards to all methods tested . Lastly, none of these algorithms took differences in pathway structure stored in the original BioPAX or SBML format into consideration , but treated them as plain text containing independent gene symbols, when in fact any gene can be a member of multiple pathways. This drawback is handling by more advanced statistical methods like mixed-effect models and Bayesian networks, which facilitates modelling of gene-gene overlapping, interaction and correlation as well as net gene effect within the same pathway [51, 52].
GWAS pathway analysis, which prioritizes candidate pathways associated with complex disorders, could serve as an important complement to individual SNP analysis and gene-based analysis. Though all algorithms selected in this study do not have inflated Type I error rates, they vary greatly in terms of power and running time. The PLINK truncated set-based test was the most powerful, but the two summary statistics-based algorithms implemented in KGG were the fastest. However, raw data-based algorithms should be preferred for GWAS pathway analysis as long as computation capacity is available, since they preserve the intact data structure and tend to be more powerful than summary statistics-based algorithms. When underlying disease causal mechanism is ambiguous, which is common for complex diseases, it is worthwhile to apply two or more pathway analysis algorithms on the same GWAS dataset.
Availability and Requirements
KGG is implemented using Java; therefore a Java Runtime Environment (JRE) is required to run KGG. Currently, installation of JRE and KGG are supported for Windows, Mac OS × and Linux. Three command files (run.win.bat, run.mac.sh and run.linux.sh) are provided for users to run KGG easily. A graphical user interface will automatically appear once initiating the command file. Documentation, source code, and precompiled binaries can be downloaded from http://bioinfo.hku.hk:13080/kggweb/home.htm.
Frayling TM: Genome-wide association studies provide new insights into type 2 diabetes aetiology. Nat Rev Genet. 2007, 8: 657-662.
Houlston RS, Cheadle J, Dobbins SE, Tenesa A, Jones AM, Howarth K, Spain SL, Broderick P, Domingo E, Farrington S, et al: Meta-analysis of three genome-wide association studies identifies susceptibility loci for colorectal cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33. Nat Genet. 2010, 42: 973-977. 10.1038/ng.670.
Franke A, McGovern DP, Barrett JC, Wang K, Radford-Smith GL, Ahmad T, Lees CW, Balschun T, Lee J, Roberts R, et al: Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat Genet. 2010, 42: 1118-1125. 10.1038/ng.717.
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.
Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet. 2010, 86: 6-22. 10.1016/j.ajhg.2009.11.017.
Biological Pathways Fact Sheet. [http://www.genome.gov/27530687]
Akira S, Yamamoto M, Takeda K: Role of adapters in Toll-like receptor signalling. Biochem Soc Trans. 2003, 31: 637-642.
De Bellard ME, Ching W, Gossler A, Bronner-Fraser M: Disruption of segmental neural crest migration and ephrin expression in delta-1 null mice. Dev Biol. 2002, 249: 121-130. 10.1006/dbio.2002.0756.
Viswanathan GA, Seto J, Patil S, Nudelman G, Sealfon SC: Getting started in biological pathway construction and analysis. PLoS Comput Biol. 2008, 4: e16-10.1371/journal.pcbi.0040016.
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, 36: D480-484.
Schadt EE: Molecular networks as sensors and drivers of common human diseases. Nature. 2009, 461: 218-223. 10.1038/nature08454.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, et al: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34: 267-273. 10.1038/ng1180.
Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling gene expression using onto-express. Genomics. 2002, 79: 266-270. 10.1006/geno.2002.6698.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.
Wang K, Li M, Bucan M: Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet. 2007, 81:
Wang K, Li M, Hakonarson H: Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010, 11: 843-854. 10.1038/nrg2884.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.
Chen LS, Hutter CM, Potter JD, Liu Y, Prentice RL, Peters U, Hsu L: Insights into Colon Cancer Etiology via a Regularized Approach to Gene Set Analysis of GWAS Data. Am J Hum Genet. 2010, 86: 860-871. 10.1016/j.ajhg.2010.04.014.
Zhang K, Cui S, Chang S, Zhang L, Wang J: i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res. 2010, 38: W90-95. 10.1093/nar/gkq324.
Holmans P, Green EK, Pahwa JS, Ferreira MA, Purcell SM, Sklar P, Wellcome Trust Case-Control C, Owen MJ, O'Donovan MC, Craddock N: Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet. 2009, 85: 13-24. 10.1016/j.ajhg.2009.05.011.
Ballard DH, Cho J, Zhao H: Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genet Epidemiol. 2010, 34: 201-212. 10.1002/gepi.20448.
Li MX, Gui HS, Kwan JS, Sham PC: GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure. Am J Hum Genet. 2011, 88: 283-293. 10.1016/j.ajhg.2011.01.019.
Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Investigators A, Hayward NK, Montgomery GW, Visscher PM, et al: A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010, 87: 139-145. 10.1016/j.ajhg.2010.06.009.
Elbers CC, van Eijk KR, Franke L, Mulder F, van der Schouw YT, Wijmenga C, Onland-Moret NC: Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol. 2009, 33: 419-431. 10.1002/gepi.20395.
Ballard D, Abraham C, Cho J, Zhao H: Pathway analysis comparison using Crohn's disease genome wide association studies. BMC Med Genomics. 2010, 3: 25-10.1186/1755-8794-3-25.
Peng G, Luo L, Siu H, Zhu Y, Hu P, Hong S, Zhao J, Zhou X, Reveille JD, Jin L, et al: Gene and pathway-based second-wave analysis of genome-wide association studies. Eur J Hum Genet. 2010, 18: 111-117. 10.1038/ejhg.2009.115.
Wellcome Trust Case Control C: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.
Eleftherohorinou H, Wright V, Hoggart C, Hartikainen AL, Jarvelin MR, Balding D, Coin L, Levin M: Pathway analysis of GWAS provides new insights into genetic susceptibility to 3 inflammatory diseases. PLoS One. 2009, 4: e8068-10.1371/journal.pone.0008068.
Wang K, Zhang H, Kugathasan S, Annese V, Bradfield JP, Russell RK, Sleiman PM, Imielinski M, Glessner J, Hou C, et al: Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn Disease. Am J Hum Genet. 2009, 84: 399-405. 10.1016/j.ajhg.2009.01.026.
Wellcome Trust Case Control Consortium. [http://www.wtccc.org.uk/]
De la Cruz O, Wen X, Ke B, Song M, Nicolae DL: Gene, region and pathway level analyses in whole-genome studies. Genet Epidemiol. 2010, 34: 222-231.
Feng T, Zhu X: Genome-wide searching of rare genetic variants in WTCCC data. Hum Genet. 2010, 128: 269-280. 10.1007/s00439-010-0849-9.
Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986, 73: 751-754. 10.1093/biomet/73.3.751.
KEGG Pathway Database. [http://www.genome.jp/kegg/pathway.html#disease]
BIOCARTA Pathways. [http://www.biocarta.com/genes/index.asp]
Molecular Signatures Database. [http://www.broadinstitute.org/gsea/msigdb/index.jsp]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Zhong H, Yang X, Kaplan LM, Molony C, Schadt EE: Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am J Hum Genet. 2010, 86: 581-591. 10.1016/j.ajhg.2010.02.020.
Backes C, Keller A, Kuentzer J, Kneissl B, Comtesse N, Elnakady YA, Muller R, Meese E, Lenhof HP: GeneTrail--advanced gene set enrichment analysis. Nucleic Acids Res. 2007, 35: W186-192. 10.1093/nar/gkm323.
Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2010, 29: 1165-1188.
Balding DJ: A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006, 7: 781-791. 10.1038/nrg1916.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Gene-set analysis and reduction. Brief Bioinform. 2009, 10: 24-34.
Fridley BL, Jenkins GD, Biernacka JM: Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods. PLoS One. 2010, 5:
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N: Pathway analysis by adaptive combination of P-values. Genet Epidemiol. 2009, 33: 700-709. 10.1002/gepi.20422.
Pahl R, Schafer H: PERMORY: an LD-exploiting permutation test algorithm for powerful genome-wide association testing. Bioinformatics. 2010, 26: 2093-2100. 10.1093/bioinformatics/btq399.
Guo YF, Li J, Chen Y, Zhang LS, Deng HW: A new permutation strategy of pathway-based approach for genome-wide association study. BMC Bioinformatics. 2009, 10: 429-10.1186/1471-2105-10-429.
Dunn OJ: Multiple comparisons among means. J Am Stat Assoc. 1961, 56: 52-64. 10.2307/2282330.
Bauer-Mehren A, Furlong LI, Sanz F: Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Mol Syst Biol. 2009, 5: 290-
Wang L, Jia P, Wolfinger RD, Chen X, Grayson BL, Aune TM, Zhao Z: An efficient hierarchical generalized linear mixed model for pathway analysis of genome-wide association studies. Bioinformatics. 2011, 27: 686-692. 10.1093/bioinformatics/btq728.
Baurley JW, Conti DV, Gauderman WJ, Thomas DC: Discovery of complex pathways from observational data. Stat Med. 2010, 29: 1998-2011. 10.1002/sim.3962.
The authors would like to thank the WTCCC for allowing us to use their CD dataset. The authors also would like to thank Dr. Johnny Kwan for helpful suggestions on the initial stage of this study, and Dr. Lina Chen (University of Chicago) for giving useful tips on running the SNPath package. HG was supported by a University Postgraduate Fellowship from The University of Hong Kong.
The authors declare that they have no competing interests.
HG devised the study, performed simulation and data analysis, and drafted the manuscript. ML developed the KGG package and implemented GATES-Simes and GATES-Hyper algorithms. SC and PS contributed to study design and revised the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1:Potential CD-associated pathways by different methods. Table S1. Positive pathways for Crohn's disease from enrichment analysis of 93 CD susceptibility genes. This table contains 28 enriched pathways (FDR < 0.05, hypergeometric test) from GeneTrail analysis. These pathways are treated as positive pathways for CD in this study. Table S2. Number of significant pathways (FDR < 0.05) detected by PLINK-Ave and GATES-Hyper algorithm with different LD and p-value threshold setting. This table shows impact of varying LD pruning and p-value threshold on detecting significant pathways for Crohn's disease. Table S3. Overlapping between pathways detected by different raw data based algorithms. This table presents significant pathways (FDR < 0.05) detected by PLINK-AVE, GRASS and GSEAforGWAS. The overlap between each pair of algorithms and also with positive pathways is illustrated by 'Yes' or 'No'. (XLS 50 KB)
About this article
Cite this article
Gui, H., Li, M., Sham, P.C. et al. Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn's Disease dataset. BMC Res Notes 4, 386 (2011) doi:10.1186/1756-0500-4-386
- Pathway Analysis
- Hypergeometric Test
- Wellcome Trust Case Control Consortium
- Candidate Pathway
- GWAS Dataset