| Size: 5108 Comment:  | Size: 5157 Comment: organize the material | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 1: | Line 1: | 
| ## page was renamed from CSCBiostatService/GSEA | 
A Summary of GSEA (Gene Set Enrichment Analysis)
GSEA goal
The below description is directly cited from Subramanian et al. (2005):
- The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
- Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.
GSEA methods
- Three key elements - Calculation of an enrichment score (ES) - Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
- The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
- The ES is the maximum deviation from zero encountered in the random walk.
 
- Estimation of significance level of ES (nominal p-value)
- Adjustment for multiple hypothesis testing (FDR)
 
- Calculation of an enrichment score (ES) 
- Mathematical description - Enrichment score (ES) - ES is the maximum deviation from zero of Phit – Pmiss. - Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
- Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
 
- ES corresponds to a weighted Kolmogorov–Smirnov-like statistic - When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. - Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
- ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
- Null distribution of ES follows Kolmogorov distribution
 
- When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
 
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. 
 
- ES is the maximum deviation from zero of Phit – Pmiss. 
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)
 
- Enrichment score (ES) 
GSEA software
- Starting GSEA using one of the multiple ways: - launch javaGSEA desktop application 
- download and run javaGSEA Java Jar file 
- download and run R-GSEA R Script to explore GSEA method 
 
- GSEA input data formats
- GSEA analyses - Running GSEA (gene set enrichment analysis)
- Running leading edge analysis - After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
- The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
- A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
 
- Running GSEAPreranked - Run GSEA on a pre-ranked gene list, created as, - converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
- choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
- sorting the gene list in descending numerical order
 
 
- Run GSEA on a pre-ranked gene list, created as, 
- Running CollapseDataset - collapse dataset from probes to symbols - creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
- collapsing mode for the probe set: maximum or median expression value for the probe set
 
- can be run as part of GSEA analyses
 
- collapse dataset from probes to symbols 
 
- Viewing and interpreting GSEA Results
- Notes - Use only one probe per gene (collappsing a set of probes for a gene into the one gene) - after differential expression analysis: collaping probes by running GSEA CollapseDataset 
- before differential expression analysis: filtering probes except the one whith maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)
 
 
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene) 
References
Subramanian et al. (2005) 
 GSEA user guide 
 GSEA documentation 
 
