Size: 2734
Comment:
|
Size: 5157
Comment: organize the material
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from CSCBiostatService/GSEA | |
Line 30: | Line 31: |
* Significance level of a gene set (nominal p-value) * Significance level for multiple gene sets (FWER and/or FDR) == GSEA software == * Starting GSEA using one of the multiple ways: * launch [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA desktop application]] * download and run [[http://www.broadinstitute.org/gsea/downloads.jsp | javaGSEA Java Jar file]] * download and run [[[[http://www.broadinstitute.org/gsea/downloads.jsp | R-GSEA R Script]] to explore GSEA method * GSEA input data formats * GSEA analyses 1. Running GSEA (gene set enrichment analysis) 1. Running leading edge analysis * After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets. * The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero. * A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets. 1. Running GSEAPreranked * Run GSEA on a pre-ranked gene list, created as, * converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols) * choosing the right ranking metric (make sure that the data do not include duplicate ranking values) * sorting the gene list in descending numerical order 1. Running CollapseDataset * collapse dataset from probes to symbols * creating a new dataset by collapsing all probe set values for a gene into a single value for the gene. * collapsing mode for the probe set: maximum or median expression value for the probe set * can be run as part of GSEA analyses * Viewing and interpreting GSEA Results * Notes * Use only one probe per gene (collappsing a set of probes for a gene into the one gene) * after differential expression analysis: collaping probes by running GSEA CollapseDataset * before differential expression analysis: filtering probes except the one whith maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median) |
A Summary of GSEA (Gene Set Enrichment Analysis)
GSEA goal
The below description is directly cited from Subramanian et al. (2005):
- The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the ranked gene list L, in which case the gene set is correlated with the phenotypic class distinction.
- Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.
GSEA methods
- Three key elements
- Calculation of an enrichment score (ES)
- Walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not.
- The magnitude of the increment depends on the correlation of the gene with the phenotype (or absolute value of the ranking metric).
- The ES is the maximum deviation from zero encountered in the random walk.
- Estimation of significance level of ES (nominal p-value)
- Adjustment for multiple hypothesis testing (FDR)
- Calculation of an enrichment score (ES)
- Mathematical description
- Enrichment score (ES)
- ES is the maximum deviation from zero of Phit – Pmiss.
- Pmiss is the empirical distribution function of the genes not in the gene set S, which is extended into the ranked gene list L.
- Phit is the cumulative distribution function of the genes in S with probability density of the rank metric, extended into the ranked gene list L.
- ES corresponds to a weighted Kolmogorov–Smirnov-like statistic
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
- Phit is the empirical distribution function of the genes in S, extended into the ranked gene list L.
- ES = sup{|Phit - Pmiss|}, used to test whether the two underlying probability distributions differ.
- Null distribution of ES follows Kolmogorov distribution
- When p = 1, the null distribution of ES is unknown, and estimated by permutation approach.
- When p = 0, ES reduces to the standard Kolmogorov–Smirnov statistic.
- ES is the maximum deviation from zero of Phit – Pmiss.
- Significance level of a gene set (nominal p-value)
- Significance level for multiple gene sets (FWER and/or FDR)
- Enrichment score (ES)
GSEA software
- Starting GSEA using one of the multiple ways:
launch javaGSEA desktop application
download and run javaGSEA Java Jar file
download and run R-GSEA R Script to explore GSEA method
- GSEA input data formats
- GSEA analyses
- Running GSEA (gene set enrichment analysis)
- Running leading edge analysis
- After running the gene set enrichment analysis, use the leading edge analysis to examine the genes that are in the leading-edge subsets of the enriched gene sets.
- The leading-edge subset in a gene set are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero.
- A gene that is in many of the leading-edge subsets is more likely to be of interest than a gene that is in only a few of the leading-edge subsets.
- Running GSEAPreranked
- Run GSEA on a pre-ranked gene list, created as,
- converting the features (probe identifiers) to human gene symbols as long as there are no duplicate features in the list (one-to-one correspondence to human gene symbols)
- choosing the right ranking metric (make sure that the data do not include duplicate ranking values)
- sorting the gene list in descending numerical order
- Run GSEA on a pre-ranked gene list, created as,
Running CollapseDataset
- collapse dataset from probes to symbols
- creating a new dataset by collapsing all probe set values for a gene into a single value for the gene.
- collapsing mode for the probe set: maximum or median expression value for the probe set
- can be run as part of GSEA analyses
- collapse dataset from probes to symbols
- Viewing and interpreting GSEA Results
- Notes
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
after differential expression analysis: collaping probes by running GSEA CollapseDataset
- before differential expression analysis: filtering probes except the one whith maximum MAD (median absolute deviation, defined as the median of the absolute deviations from the data's median)
- Use only one probe per gene (collappsing a set of probes for a gene into the one gene)
References
Subramanian et al. (2005)
GSEA user guide
GSEA documentation