GSEA (Gene Set Enrichment Analysis)

GSEA goal

The below description is directly cited from Subramanian et al. (2005):

The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.

GSEA methods

Three key elements
1. Calculation of an enrichment score (ES)
  1. Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
  2. The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
  3. The ES is the maximum deviation from zero encountered in the random walk.
2. Estimation of significance level of ES (nominal p-value)
3. Adjustment for multiple hypothesis testing (FDR)
Mathematical description
- Enrichment score (ES)
  1. ES is the maximum deviation from zero of Phit – Pmiss.
    - Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
    - Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
  2. ES corresponds to a weighted Kolmogorov–Smirnov-like statistic
    - When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
      - Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
      - ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
      - Null distribution of ES follows Kolmogorov distribution
    - When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)

GSEA software

Starting GSEA using one of the multiple ways:
- launch javaGSEA desktop application
  - choose 2GB or higher;
  - update compatible java version.
- download and run javaGSEA Java Jar file
  - MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar
  - Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar
- download and run R-GSEA R Script to explore GSEA method
GSEA input data formats
- gene sets gmt files, downloaded from Enrichment Map Gene Sets
- rank file, created from differential expression analysis
- ...
GSEA analyses
1. Running GSEA (gene set enrichment analysis)
2. Running leading edge analysis
  - After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
  - The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
  - A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
3. Running GseaPreranked
  - load gene sets, i.e., gmt file
  - load a pre-ranked gene list, created as,
    - converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
    - choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
    - sorting the gene list in descending numerical order
  - click GseaPreranked in the Tools tab
  - set the required fields and click Run
4. Running CollapseDataset
  - collapse dataset from probes to symbols
    - creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
    - collapsing mode for the probe set: maximum or median expression value for the probe set
  - can be run as part of GSEA analyses
Viewing and interpreting GSEA Results
Notes
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
  - after differential expression analysis: collaping probes by running GSEA CollapseDataset
  - before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)

Cytoscape

Download and install Cytoscape from http://www.cytoscape.org
Launch Cytoscape
- Install Apps: click on Apps - App manager - install Enrichment Map, Clustermaker2, WordCloud and AutoAnnotate apps
- Create Enrichment map: click on Apps - EnrichmentMap - Create Enrichment map
  1. select Analysis Type “GSEA”
  2. load GMT file used in GSEA enrichment analysis
  3. load GSEA enrichment results from rpt file
  4. select parameters: e.g., using FDR
    - P-value Cutoff 1
    - Q-value Cutoff 0.05
    - Similarity Cutoff Jaccard Coefficient 0.25

ssGSEA

Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA)
calculates separate enrichment scores for each pairing of a sample and gene set.

References

Subramanian et al. (2005)
GSEA user guide
GSEA documentation
ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21.
Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833