GSEA Gene Set Enrichment Analysis (www.broadinstitute.org/gsea )

Summary

Introduction

Method

GSEA first ranks the genes based on a measure of each gene's differential expression with respect to the two phenotypes (for example, tumor versus normal using a t-test) or correlation with a continuous phenotype. Then the entire ranked list is used to assess how the genes of each gene set are distributed across the ranked list. To do this, GSEA walks down the ranked list of genes, increasing a running-sum statistic when a gene belongs to the set and decreasing it when the gene does not.

A simplified example is shown in the following figure.

The enrichment score (ES) is the maximum deviation from zero encountered during that walk. The ES reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes. A set that is not enriched will have its genes spread more or less uniformly through the ranked list. An enriched set, on the other hand, will have a larger portion of its genes at one or the other end of the ranked list. The extent of enrichment is captured mathematically as the ES statistic.

Next, GSEA estimates the statistical significance of the ES by a permutation test. To do this, GSEA creates a version of the data set with phenotype labels randomly scrambled, produces the corresponding ranked list, and recomputes the ES of the gene set for this permuted data set. GSEA repeats this many times (1000 is the default) and produces an empirical null distribution of ES scores. Alternatively, permutations may be generated by creating “random” gene sets (genes randomly selected from those in the expression dataset) of equal size to the gene set under analysis.

The nominal p-value estimates the statistical significance of a single gene set's enrichment score, based on the permutation-generated null distribution. The nominal p-value is the probability under the null distribution of obtaining an ES value that is as strong or stronger than that observed for your experiment under the permutation-generated null distribution.

Typically, GSEA is run with a large number of gene sets. For example, the MSigDB collection and subcollections each contain hundreds to thousands of gene sets. This has implications when comparing enrichment results for the many sets:

The ES must be adjusted to account for differences in the gene set sizes and in correlations between gene sets and the expression data set. The resulting normalized enrichment scores (NES) allow you to compare the analysis results across gene sets.

The nominal p-values need to be corrected to adjust for multiple hypothesis testing. For a large number of sets (rule of thumb: more than 30), we recommend paying attention to the False Discovery Rate (FDR) q-values: consider a set significantly enriched if its NES has an FDR q-value below 0.25.

GSEA enrichment scores and statistics

GSEAplot.gif

number of gene-set tested

1 gene-set

few gene-sets (3-10)

a lot of gene-sets >3000

ES

(./)

not recommended

not recommended

NES

not informative

(./)

(./)

nom p-value

(./)

better to use FDR

better to use FDR

FDR

not informative

(./)

(./) (*)

(*) FDR values are going to be pessimistic when using the BaderLab geneset files due to the high number of tested gene-sets and therefore the high p-value adjustment needed.

GSEA formats (.GCT, .CLS, .RNK, .GMT) :

Tips on how to install GSEA locally and launch it from the command line:

GSEA parameters

GSEA tutorial

single sample GSEA (ssGSEA)

References

Answers to questions

Troubleshooting

CancerStemCellProject/VeroniqueVoisin/AdditionalResources/GSEA (last edited 2016-01-04 15:36:11 by VeroniqueVoisin)

MoinMoin Appliance - Powered by TurnKey Linux