Tutorial

Output and display (example of result page)

1. Input data

The input data should be a text file containing only two columns separated by table and without head line. The gzip format of the text file is also supported. There are two types of data are supported as input:

1.1 SNP association data

The first column is SNP ID and the second column is the -log (P-value) or statistics or odds ratio. The format is as follows (SNP ID, -log (P-value)).

If your input is P-value, the server will help transform it to -log (P-value). Simply tick on -logarithm transformation (necessary ONLY for P-value data)

rs1000000 	0.49471432586
rs10000010	0.51215487989
rs10000023	1.11367851344
rs10000030	0.35713994742
rs10000041	0.20210951694
rs1000007 	0.04436034698
rs10000081	0.37110043558
rs10000092	0.40197592767
rs10000121	0.43937612545
rs1000014 	0.45892023222

1.2 gene association data

The first column is gene HUGO symbol (http://www.genenames.org/) and the second column is the association data, e.g. -log (P-value), or statistics, or odds ratio. The format is as follows (gene symbol, maximum -log (P-value) of SNPs mapped to the gene):

    
GDA	1.947306
SCN3A	1.6901569
SCN3B	1.5979106
RPLP2	0.5395532
BTBD1	0.87419355
BTBD2	1.6567885
BTBD3	1.7276942
RPLP1	1.4337983
ACAA2	2.0501711
TMEFF2	1.7416022

2. Options

2.1 Optional multiple-level broad-to-narrow SNPs->genes mapping rules

Multiple SNPs->genes mapping rules can be utilized: "500 kb upstream and downstream range of gene", "100 kb upstream and downstream range of gene", "5 kb upstream and downstream range of gene", "within gene", and "functional SNPs", ordered from broad to narrow but rough to accurate. The SNPs->genes mapping is established based on SNP and gene annotations from the Ensembl BioMart database (Release 56 - 15th September 2009, http://www.ensembl.org/biomart/martview). Only one option can be chosen per run, and it is only applicable for SNP data.

Figure 2.1 Option of SNPs->genes mapping rules.

2.2 Choose gene set database

Figure 2.2

2.2.1 canonical pathways

The canonical pathways are from MSigDB v2.5 containing the pathways integrated and curated from a variety of online resources as follows:
KEGG
BioCarta
Signaling pathway database
Signaling gateway
Signal transduction knowledge environment
Human protein reference database
GenMAPP
Gene ontology
Sigma-Aldrich pathways
Gene arrays, BioScience Corp
Human cancer genome anatomy consortium
NetAffx

2.2.2 Curated gene ontology (GO) terms

GO biological process, GO molecular function, GO cellular component gene sets are from MSigDB v2.5. Only the GO terms with the following evidence codes, IDA IPI, IMP IGI, IEP ISS, TAS, and with reasonable categories are included. The reasonable categories are defined by MSigDB as: "GO gene sets for very broad categories, such as Biological Process, have been omitted from MSigDB. GO gene sets with fewer than 10 genes have also been omitted. Gene sets with the same members have been resolved based on the GO tree structure: if a parent term has only one child term and their gene sets have the same members, the child gene set is omitted; if the gene sets of sibling terms have the same members, the sibling gene sets are omitted".

2.2.3 Customized gene sets

Additionally, users can upload their own gene set data. The format requirements of the gene set are: 1) a text file without head line; 2) each gene set per line and table separated; 3) first column is gene set ID, second column is gene set description (use "na" or leave it as blank if not available), and the rest columns are gene HUGO symbols.

GO0045726          GO0045726       NOX1	P61812	Q9Y5S8	TGFB2
GO0016045          GO0016045       CD1D	NLRC4	NOD1	NOD2	O75594	P15813	PARG	PGLYRP1	PGLYRP2
GO0048536          GO0048536       BCL3	JARID2	NFKB2	NKX3-2	P20749	P31314	P78367
GO0010460          GO0010460       ADRA1A	ADRA1B	ADRB1	B1N7G2	B1N7G7	CHRNA7	CHRNA7-2
GO0035090          GO0035090       A0PJG1	A7MBM7	ANK1	LLGL1	P16157	Q15334
GO0050982          GO0050982       A2A3D9	A9Z1W1	GRIN2B	MKKS	MYC	O15273	P01106	P48431	P55011	P98161	P98161-2
GO0007346          GO0007346       A6NDV4	AFAP1L2	APBB1	APBB2	ATM	BCL6	BLM	BRCA2
GO0001890          GO0001890       AKT1	ANG	ARNT	BIRC2	CDX2	CDX4	CEBPB	CITED1
GO0016189          GO0016189       EEA1	Q15075
GO0008406          GO0008406       A6NKD2	ACVR2A	AMH	ANKRD7	AR	BAX	BRCA2	CSDE1	DMRT1	DMRT2

2.2.4 MHC/xMHC region masking for gene sets

If choosing the "Mask MHC/xMHC region", all the genes of the MHC/xMHC (major histocompatibility complex / extended major histocompatibility complex) region will be removed from the selected gene set database. Genes in the MHC/xMHC region genes are from Horton R, et al., Nature Reviews Genetics 2004 5, 889-899.

Figure 2.2.4 The option of masking genes in the MHC/xMHC region.

2.2.5 Filter gene sets by set size

The size of gene sets can be restrained to avoid the overly narrow or overly broad functional categories. The default minimum and maximum gene number in gene sets are 20 and 200, respectively (Wang et al., 2007 Am J Hum Genet 81 (6) 1278-1283; Fellay et al., 2009 PLoS Genet 5(12) e1000791).

Figure 2.2.5 The option of filtering gene sets by set size.

3. Output and display(example of result page)

The output interface contains the download link, from where all the results, both text and figures, can be downloaded, and a summary table in which the pathways/gene sets with FDR < 0.25 are presented and ordered by the increase of FDR (the threshold of FDR < 0.25 denotes the confidence of 'possible' or 'hypothesis', while the threshold of FDR < 0.05 is regarded as 'high confidence' or 'with statistical significance'). You can visit http://gsea4gwas.psych.ac.cn/getResult.do?result=9DA3BCD71BDB4CC5DEC84F64927C20EE.s3_1265892314763 to see an example result.

Figure 3 The result page.

3.1 Manhattan plot of pathway/gene set

A Manhattan plot is a type of bar graph, usually used to display data with a large number of data-points - many of non-zero amplitude, and with a distribution of higher-magnitude values, for instance in genome-wide association studies (http://en.wikipedia.org/wiki/Manhattan_plot). For the Manhattan plot of GWAS, the bar of x-axis is for each chromosome and the y-axis is for association data (typically -log (P-value)). Manhattan plot of GWAS maps the result of association test to chromosomal locations.

Here the Manhattan plot of gene set uses the Manhattan plot of GWAS as background, and highlights the results of association test for a given pathway/gene set. It helps users to graphically compare the association test results of the given pathway/gene set to the genome-scale data, and provides an interplay panel for user to view the information of the interesting genes belonging to the pathway/gene set.

Figure 3.1 Gene set Manhattan plot.

3.2 The number of Significant genes/Selected genes/All genes

Significant genes: genes mapped with at least one of the top 5% of all SNPs.
Selected genes: genes included in the i-GSEA analysis
All genes: all genes of the gene set.

These numbers help users to have a clear overview of the pathways/gene sets concerning: how many genes are involved in this pathway/gene set, how many genes are included in i-GSEA analysis, and how many genes are significant.