Differences between revisions 5 and 9 (spanning 4 versions)

A Summary of GSEA (Gene Set Enrichment Analysis)

GSEA goal

The below description is directly cited from Subramanian et al. (2005):

The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.

GSEA methods

Three key elements
1. Calculation of an enrichment score (ES)
  1. Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
  2. The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
  3. The ES is the maximum deviation from zero encountered in the random walk.
2. Estimation of significance level of ES (nominal p-value)
3. Adjustment for multiple hypothesis testing (FDR)
Mathematical description
- Enrichment score (ES)
  1. ES is the maximum deviation from zero of Phit – Pmiss.
    - Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
    - Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
  2. ES corresponds to a weighted Kolmogorov–Smirnov-like statistic
    - When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
      - Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
      - ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
      - Null distribution of ES follows Kolmogorov distribution
    - When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)

GSEA software

Starting GSEA using one of the multiple ways:
- launch javaGSEA desktop application
- download and run javaGSEA Java Jar file
- download and run R-GSEA R Script to explore GSEA method
GSEA input data formats
GSEA analyses
1. Running GSEA (gene set enrichment analysis)
2. Running leading edge analysis
  - After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
  - The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
  - A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
3. Running GSEAPreranked
  - Run GSEA on a pre-ranked gene list, created as,
    - converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
    - choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
    - sorting the gene list in descending numerical order
4. Running CollapseDataset
  - collapse dataset from probes to symbols
    - creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
    - collapsing mode for the probe set: maximum or median expression value for the probe set
  - can be run as part of GSEA analyses
Viewing and interpreting GSEA Results
Notes
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
  - after differential expression analysis: collaping probes by running GSEA CollapseDataset
  - before differential expression analysis: filtering probes except the one whith maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)

References

Subramanian et al. (2005)
GSEA user guide
GSEA documentation

-  ⇤ ← Revision 5 as of 2014-11-27 18:50:05 → 
  Size: 2734
  Editor: ChangjiangXu
  Comment:
+   ← Revision 9 as of 2014-12-12 20:34:15 → ⇥
  Size: 5157
  Editor: ChangjiangXu
  Comment: organize the material
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+## page was renamed from CSCBiostatService/GSEA
-Line 30:
+Line 31:
+   * Significance level of a gene set (nominal p-value)  
   * Significance level for multiple gene sets (FWER and/or FDR)

== GSEA software ==
 * Starting GSEA using one of the multiple ways:
   * launch [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA desktop application]]
   * download and run [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA Java Jar file]] 
   * download and run [[[[http://www.broadinstitute.org/gsea/downloads.jsp | R-GSEA R Script]] to explore GSEA method
 * GSEA input data formats
 * GSEA analyses
   1. Running GSEA (gene set enrichment analysis)
   1. Running leading edge analysis
     * After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
     * The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
     * A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
   1. Running GSEAPreranked
     * Run GSEA on a pre-ranked gene list, created as,
       * converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols) 
       * choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
       * sorting the gene list in descending numerical order 
   1. Running CollapseDataset
     * collapse dataset from probes to symbols
       * creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
       * collapsing mode for the probe set: maximum or median expression value for the probe set
     * can be run as part of GSEA analyses
 * Viewing and interpreting GSEA Results
 * Notes
   * Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
     * after differential expression analysis: collaping probes by running GSEA CollapseDataset
     * before differential expression analysis: filtering probes except the one whith maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)