Contents
GSEA (Gene Set Enrichment Analysis)
GSEA goal
The below description is directly cited from Subramanian et al. (2005):
- The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
- Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.
GSEA methods
- Three key elements
- Calculation of an enrichment score (ES)
- Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
- The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
- The ES is the maximum deviation from zero encountered in the random walk.
- Estimation of significance level of ES (nominal p-value)
- Adjustment for multiple hypothesis testing (FDR)
- Calculation of an enrichment score (ES)
- Mathematical description
- Enrichment score (ES)
- ES is the maximum deviation from zero of Phit – Pmiss.
- Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
- Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
- ES corresponds to a weighted Kolmogorov–Smirnov-like statistic
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
- Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
- ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
- Null distribution of ES follows Kolmogorov distribution
- When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
- ES is the maximum deviation from zero of Phit – Pmiss.
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)
- Enrichment score (ES)
GSEA software
- Starting GSEA using one of the multiple ways:
launch javaGSEA desktop application
- choose 2GB or higher;
- update compatible java version.
download and run javaGSEA Java Jar file
- MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar
- Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar
download and run R-GSEA R Script to explore GSEA method
- GSEA input data formats
gene sets gmt files, downloaded from Enrichment Map Gene Sets
- rank file, created from differential expression analysis
- ...
- GSEA analyses
- Running GSEA (gene set enrichment analysis)
- Running leading edge analysis
- After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
- The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
- A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
Running GseaPreranked
- load gene sets, i.e., gmt file
- load a pre-ranked gene list, created as,
- converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
- choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
- sorting the gene list in descending numerical order
click GseaPreranked in the Tools tab
- set the required fields and click Run
Running CollapseDataset
- collapse dataset from probes to symbols
- creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
- collapsing mode for the probe set: maximum or median expression value for the probe set
- can be run as part of GSEA analyses
- collapse dataset from probes to symbols
- Viewing and interpreting GSEA Results
- Notes
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
after differential expression analysis: collaping probes by running GSEA CollapseDataset
- before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
Cytoscape
Download and install Cytoscape from http://www.cytoscape.org
- Launch Cytoscape
Install Apps: click on Apps - App manager - install Enrichment Map, Clustermaker2, WordCloud and AutoAnnotate apps
Create Enrichment map: click on Apps - EnrichmentMap - Create Enrichment map
- select Analysis Type “GSEA”
- load GMT file used in GSEA enrichment analysis
- load GSEA enrichment results from rpt file
- select parameters: e.g., using FDR
- P-value Cutoff 1
- Q-value Cutoff 0.05
- Similarity Cutoff Jaccard Coefficient 0.25
ssGSEA
- Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA)
- calculates separate enrichment scores for each pairing of a sample and gene set.
References
Subramanian et al. (2005)
GSEA user guide
GSEA documentation
ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21.
Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833