| Size: 2734 Comment:  | Size: 6265 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 1: | Line 1: | 
| #acl CscGroup:read,write,revert | #acl All:read | 
| Line 3: | Line 3: | 
| = A Summary of GSEA (Gene Set Enrichment Analysis) = | <<TableOfContents(2)>> = GSEA (Gene Set Enrichment Analysis) = | 
| Line 30: | Line 32: | 
| * Significance level of a gene set (nominal p-value) * Significance level for multiple gene sets (FWER and/or FDR) | |
| Line 31: | Line 35: | 
| == References == | == GSEA software == * Starting GSEA using one of the multiple ways: * launch [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA desktop application]] * choose 2GB or higher; * update compatible java version. * download and run [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA Java Jar file]] * MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar * Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar * download and run [[[[http://www.broadinstitute.org/gsea/downloads.jsp | R-GSEA R Script]] to explore GSEA method * GSEA input data formats * gene sets gmt files, downloaded from [[http://baderlab.org/GeneSets | Enrichment Map Gene Sets]] * rank file, created from differential expression analysis * ... * GSEA analyses 1. Running GSEA (gene set enrichment analysis) 1. Running leading edge analysis * After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets. * The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero. * A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets. 1. Running !GSEAPreranked * load gene sets, i.e., gmt file * load a pre-ranked gene list, created as, * converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols) * choosing the right ranking metric (make sure that the data do not include duplicate ranking values) * sorting the gene list in descending numerical order * click !GseaPreranked in the Tools tab * set the required fields and click Run 1. Running !CollapseDataset * collapse dataset from probes to symbols * creating a new dataset by collapsing all probe set values for a gene into a single value for the gene. * collapsing mode for the probe set: maximum or median expression value for the probe set * can be run as part of GSEA analyses * Viewing and interpreting GSEA Results * Notes * Use only one probe per gene (collappsing a set of probes for a gene into the one gene) * after differential expression analysis: collaping probes by running GSEA !CollapseDataset * before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median) = ssGSEA = * Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA) * calculates separate enrichment scores for each pairing of a sample and gene set. = References = | 
| Line 35: | Line 82: | 
| [[http://www.ncbi.nlm.nih.gov/pubmed/19847166 | ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21.]] <<BR>> [[http://www.jci.org/articles/view/65833 |Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833]] <<BR>> | 
GSEA (Gene Set Enrichment Analysis)
GSEA goal
The below description is directly cited from Subramanian et al. (2005):
- The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
- Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.
GSEA methods
- Three key elements - Calculation of an enrichment score (ES) - Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
- The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
- The ES is the maximum deviation from zero encountered in the random walk.
 
- Estimation of significance level of ES (nominal p-value)
- Adjustment for multiple hypothesis testing (FDR)
 
- Calculation of an enrichment score (ES) 
- Mathematical description - Enrichment score (ES) - ES is the maximum deviation from zero of Phit – Pmiss. - Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
- Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
 
- ES corresponds to a weighted Kolmogorov–Smirnov-like statistic - When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. - Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
- ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
- Null distribution of ES follows Kolmogorov distribution
 
- When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
 
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. 
 
- ES is the maximum deviation from zero of Phit – Pmiss. 
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)
 
- Enrichment score (ES) 
GSEA software
- Starting GSEA using one of the multiple ways: - launch javaGSEA desktop application - choose 2GB or higher;
- update compatible java version.
 
- download and run javaGSEA Java Jar file - MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar
- Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar
 
- download and run R-GSEA R Script to explore GSEA method 
 
- GSEA input data formats - gene sets gmt files, downloaded from Enrichment Map Gene Sets 
- rank file, created from differential expression analysis
- ...
 
- GSEA analyses - Running GSEA (gene set enrichment analysis)
- Running leading edge analysis - After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
- The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
- A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
 
- Running !GSEAPreranked - load gene sets, i.e., gmt file
- load a pre-ranked gene list, created as, - converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
- choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
- sorting the gene list in descending numerical order
 
- click GseaPreranked in the Tools tab 
- set the required fields and click Run
 
- Running CollapseDataset - collapse dataset from probes to symbols - creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
- collapsing mode for the probe set: maximum or median expression value for the probe set
 
- can be run as part of GSEA analyses
 
- collapse dataset from probes to symbols 
 
- Viewing and interpreting GSEA Results
- Notes - Use only one probe per gene (collappsing a set of probes for a gene into the one gene) - after differential expression analysis: collaping probes by running GSEA CollapseDataset 
- before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)
 
 
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene) 
ssGSEA
- Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA)
- calculates separate enrichment scores for each pairing of a sample and gene set.
References
Subramanian et al. (2005) 
 GSEA user guide 
 GSEA documentation 
 ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21. 
 Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833 
 
