| Size: 82 Comment:  | Size: 6265 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 1: | Line 1: | 
| {{{ #acl CscGroup:read,write,revert }}} Describe CSCBiostatService/GSEA here. | #acl All:read <<TableOfContents(2)>> = GSEA (Gene Set Enrichment Analysis) = == GSEA goal == The below description is directly cited from [[http://www.pnas.org/content/102/43/15545.full | Subramanian et al. (2005)]]: * The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction. * Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution. == GSEA methods == * Three key elements A. Calculation of an enrichment score (ES) 1. Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. 1. The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric). 1. The ES is the maximum deviation from zero encountered in the random walk. A. Estimation of significance level of ES (nominal p-value) A. Adjustment for multiple hypothesis testing (FDR) * Mathematical description * Enrichment score (ES) 1. ES is the maximum deviation from zero of Phit – Pmiss. * Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L. * Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L. 1. ES corresponds to a weighted Kolmogorov–Smirnov-like statistic * When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. * Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L. * ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ. * Null distribution of ES follows Kolmogorov distribution * When p = 1, the null distribution of ES is unknown, and estimated by permutation approach. * Significance level of a gene set (nominal p-value) * Significance level for multiple gene sets (FWER and/or FDR) == GSEA software == * Starting GSEA using one of the multiple ways: * launch [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA desktop application]] * choose 2GB or higher; * update compatible java version. * download and run [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA Java Jar file]] * MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar * Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar * download and run [[[[http://www.broadinstitute.org/gsea/downloads.jsp | R-GSEA R Script]] to explore GSEA method * GSEA input data formats * gene sets gmt files, downloaded from [[http://baderlab.org/GeneSets | Enrichment Map Gene Sets]] * rank file, created from differential expression analysis * ... * GSEA analyses 1. Running GSEA (gene set enrichment analysis) 1. Running leading edge analysis * After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets. * The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero. * A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets. 1. Running !GSEAPreranked * load gene sets, i.e., gmt file * load a pre-ranked gene list, created as, * converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols) * choosing the right ranking metric (make sure that the data do not include duplicate ranking values) * sorting the gene list in descending numerical order * click !GseaPreranked in the Tools tab * set the required fields and click Run 1. Running !CollapseDataset * collapse dataset from probes to symbols * creating a new dataset by collapsing all probe set values for a gene into a single value for the gene. * collapsing mode for the probe set: maximum or median expression value for the probe set * can be run as part of GSEA analyses * Viewing and interpreting GSEA Results * Notes * Use only one probe per gene (collappsing a set of probes for a gene into the one gene) * after differential expression analysis: collaping probes by running GSEA !CollapseDataset * before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median) = ssGSEA = * Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA) * calculates separate enrichment scores for each pairing of a sample and gene set. = References = [[http://www.pnas.org/content/102/43/15545.full | Subramanian et al. (2005)]] <<BR>> [[http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html | GSEA user guide]] <<BR>> [[http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page | GSEA documentation]] <<BR>> [[http://www.ncbi.nlm.nih.gov/pubmed/19847166 | ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21.]] <<BR>> [[http://www.jci.org/articles/view/65833 |Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833]] <<BR>> | 
GSEA (Gene Set Enrichment Analysis)
GSEA goal
The below description is directly cited from Subramanian et al. (2005):
- The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
- Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.
GSEA methods
- Three key elements - Calculation of an enrichment score (ES) - Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
- The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
- The ES is the maximum deviation from zero encountered in the random walk.
 
- Estimation of significance level of ES (nominal p-value)
- Adjustment for multiple hypothesis testing (FDR)
 
- Calculation of an enrichment score (ES) 
- Mathematical description - Enrichment score (ES) - ES is the maximum deviation from zero of Phit – Pmiss. - Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
- Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
 
- ES corresponds to a weighted Kolmogorov–Smirnov-like statistic - When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. - Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
- ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
- Null distribution of ES follows Kolmogorov distribution
 
- When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
 
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic. 
 
- ES is the maximum deviation from zero of Phit – Pmiss. 
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)
 
- Enrichment score (ES) 
GSEA software
- Starting GSEA using one of the multiple ways: - launch javaGSEA desktop application - choose 2GB or higher;
- update compatible java version.
 
- download and run javaGSEA Java Jar file - MAC: java -Xmx2G -jar ~/Documents/gsea2-2.2.0.jar
- Window: java -Xmx2G -jar gsea2-2.2.0.jar, or -Xmx4G -jar gsea2-2.2.3.jar
 
- download and run R-GSEA R Script to explore GSEA method 
 
- GSEA input data formats - gene sets gmt files, downloaded from Enrichment Map Gene Sets 
- rank file, created from differential expression analysis
- ...
 
- GSEA analyses - Running GSEA (gene set enrichment analysis)
- Running leading edge analysis - After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
- The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
- A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
 
- Running !GSEAPreranked - load gene sets, i.e., gmt file
- load a pre-ranked gene list, created as, - converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
- choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
- sorting the gene list in descending numerical order
 
- click GseaPreranked in the Tools tab 
- set the required fields and click Run
 
- Running CollapseDataset - collapse dataset from probes to symbols - creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
- collapsing mode for the probe set: maximum or median expression value for the probe set
 
- can be run as part of GSEA analyses
 
- collapse dataset from probes to symbols 
 
- Viewing and interpreting GSEA Results
- Notes - Use only one probe per gene (collappsing a set of probes for a gene into the one gene) - after differential expression analysis: collaping probes by running GSEA CollapseDataset 
- before differential expression analysis: filtering probes except the one with maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)
 
 
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene) 
ssGSEA
- Single-sample GSEA (ssGSEA), an extension of Gene Set Enrichment Analysis (GSEA)
- calculates separate enrichment scores for each pairing of a sample and gene set.
References
Subramanian et al. (2005) 
 GSEA user guide 
 GSEA documentation 
 ssGSEA: Barbie et al. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1, Nature. 2009 Nov 5;462(7269):108-12. doi: 10.1038/nature08460. Epub 2009 Oct 21. 
 Verhaak et al. (2013) Prognostically relevant gene signatures of high-grade serous ovarian carcinoma, J Clin Invest. 2013;123(1):517-525. doi:10.1172/JCI65833 
 
