601
Comment:
|
7791
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Protocol = * A to Z protocol to create an EnrichmentMap from gene expression data and using GSEA (Gene Set Enrichment Map) |
= A to Z protocol to create an EnrichmentMap from Gene Expression Data and using GSEA (Gene Set Enrichment Analysis) = |
Line 4: | Line 3: |
== Installation == | * system requirement to run this workflow: * we need to have 2GB of free/system memory to run GSEA or navigate through an Enrichment Map. Thus having at least 4GB of system memory is recommended. (you can check it in System Properties on a Windows machine and under About this Mac for mac computers) * 64 bits (you can check it in System Properties on a Windows machine and under About this Mac for mac computers). |
Line 6: | Line 7: |
== How to preprocess the data using R == | * The goal of this analysis is to perform a gene set enrichment analysis. It is to look at the genes that are differentially expressed between the 2 (or more) conditions that we are looking at and see if some of these genes belong to same biological function or process. It can be a way to rapidly identify the major altered biological functions. Doing the analysis at the pathway level is also an efficient way to get over the noise in some dataset: if the differential expression value of a few genes are borderline regarding the significance because of some noise in the data, if the number of these genes belonging to a same biological function is higher than could occur by chance only, taking into account all these expression values for these genes and working at the pathway level could reveal that this pathway is significantly perturbed. The results are represented as a network which has the possibility to add different layers of information in top of each other, making the enrichment results informative. * The specific goal of this workflow example is to start from gene expression data and to show step by step how to construct and interpret an enrichment map. We would like to know by following this workflow all pathways that could be altered between the 2 (or more) conditions that we are testing. We aim in this analysis to have a global and comprehensive view of what is happening in the cells, a snapshot of the entire cell at the moment the RNA was extracted. * description of the steps: this section briefly explains the steps that will be followed to run the gene set enrichment workflow as well as the files needed for it. * The steps are to download the data from GEO and process the data - or adapt this protocol to use it with your processed gene expression data. Then the first step of the enrichment analysis will be to run GSEA (Gene Set Enrichment Analysis, which is a tool from the Broad Institute): to do it, we need to create a file called a rank file from the array data and use it with a pathway database downloadable file from the Baderlab website. Then we will create a network called an enrichment map using the Cytoscape software; for that we will need to create an expression file and use also the GSEA results that we have just run. * How to create a rank file (.rnk) * the rank file contains only 2 columns: the gene identifiers (official gene symbol in this workflow) as the first column and the differential expression values for each gene as the second column. In this protocol, we will use the t value from a moderated Student's t-test. Headers (column names) should be removed. The format should be tab delimited (meaning that the columns are separated by tabs) and the file extension should be .rnk. * the rank file is a format described in the GSEA documentation: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats * the rank file will be used to run the gene set enrichment analysis (GSEA). * /!\ the rank files contains all the genes: do not filter by only genes differentially expressed. We are ranking the genes from the top up regulated to the top down regulated. The genes that do not vary and not of interest are in the middle of the list and GSEA will not look for significance in the middle of the list but the format requirement for GSEA to correctly calculate the statistics is that all genes are listed. * How to get the pathway database file (.gmt) * The pathway database file contains all known and curated biological functions that we are going to test in this pathway enrichment analysis. For each of these functions, the names of genes known to be implicated in this function are listed beside this function. The gene set enrichment analysis will look if the top differentially expressed genes are included in some of these pathways. It will also assess if this enrichment can happen by chance only or not. * In this protocol we are going to use a file that include pathways from different sources (e.g Gene Ontology, Reactome, Kegg,...). We observed that having the most comprehensive set of pathways gave more sensitive results. Although databases that are included in this file can be overlapping, they are not 100% identical and the clusters created by the enrichment map from these different sources add confidence about the perturbation of a given biological function. * The link to the database file compiled by the BaderLab and updated monthly can be found at: http://download.baderlab.org/EM_Genesets/ (look for the current release at the bottom of the list) and a description of how the file is being created at: http://baderlab.org/GeneSets * How to run GSEA * GSEA can be downloaded from http://www.broadinstitute.org/gsea/index.jsp * you need to enter a valid e-mail address before going to the download section * for this workflow example, we are going the java web start option to run GSEA: choose the 'launch with 2GB' in the 'javaGSEA Desktop Application' box: {{attachment:GSEA_download.png}} * you may be able to save the icon on your computer (a file called gsea.jnlp). Each time you want to run GSEA, you just need to double-click on the icon. Each time you double click on the icon, GSEA will double check whether a new version if available and install the software in a temporary location in your computer (thus you need a working internet connection to lauch GSEA this way) * The first step when the application is open is to load the data by browsing or dragging and dropping the files: the .rnk and the .gmt files. Then, it is to open the GSEAPreranked window: menu bar --> Tools --> GseaPreranked. From this window, we typically upload the .rnk and .gmt file (.gmt file is located in the 'Genematrix (local gmx/gmt)' tab). The number of permutations is set to 2000 (1000 or 2000) and other parameters can be left as default. * How to create an expression file * Although only 2 files were needed to run GSEA, a additional file, called the expression file,needs to be prepared to create an enrichmentmap. For this workflow example, the expression file will contain as first column the official gene symbol, as second column the full names of the genes (called sometimes definition), followed by columns containing the normalized data for each of the samples included in the study. * How to create a map * What is the next step, how to use the map * How to create a figure * How to interpret the results * What next * (How to preprocess the data using R) * (How to preprocess the data using Excel) = FIRST EXAMPLE WITH AFFYMETRIX MICROARRAY DATA = * description of the data == Download the data from GEO == == Installation == |
Line 11: | Line 58: |
== How to preprocess the data using Excel == | == How to preprocess the data (normalization, QC, differential expression) == == How to update the annotations == |
A to Z protocol to create an EnrichmentMap from Gene Expression Data and using GSEA (Gene Set Enrichment Analysis)
- system requirement to run this workflow:
- we need to have 2GB of free/system memory to run GSEA or navigate through an Enrichment Map. Thus having at least 4GB of system memory is recommended. (you can check it in System Properties on a Windows machine and under About this Mac for mac computers)
- 64 bits (you can check it in System Properties on a Windows machine and under About this Mac for mac computers).
- The goal of this analysis is to perform a gene set enrichment analysis. It is to look at the genes that are differentially expressed between the 2 (or more) conditions that we are looking at and see if some of these genes belong to same biological function or process. It can be a way to rapidly identify the major altered biological functions. Doing the analysis at the pathway level is also an efficient way to get over the noise in some dataset: if the differential expression value of a few genes are borderline regarding the significance because of some noise in the data, if the number of these genes belonging to a same biological function is higher than could occur by chance only, taking into account all these expression values for these genes and working at the pathway level could reveal that this pathway is significantly perturbed. The results are represented as a network which has the possibility to add different layers of information in top of each other, making the enrichment results informative.
- The specific goal of this workflow example is to start from gene expression data and to show step by step how to construct and interpret an enrichment map. We would like to know by following this workflow all pathways that could be altered between the 2 (or more) conditions that we are testing. We aim in this analysis to have a global and comprehensive view of what is happening in the cells, a snapshot of the entire cell at the moment the RNA was extracted.
- description of the steps: this section briefly explains the steps that will be followed to run the gene set enrichment workflow as well as the files needed for it.
- The steps are to download the data from GEO and process the data - or adapt this protocol to use it with your processed gene expression data. Then the first step of the enrichment analysis will be to run GSEA (Gene Set Enrichment Analysis, which is a tool from the Broad Institute): to do it, we need to create a file called a rank file from the array data and use it with a pathway database downloadable file from the Baderlab website. Then we will create a network called an enrichment map using the Cytoscape software; for that we will need to create an expression file and use also the GSEA results that we have just run.
- How to create a rank file (.rnk)
- the rank file contains only 2 columns: the gene identifiers (official gene symbol in this workflow) as the first column and the differential expression values for each gene as the second column. In this protocol, we will use the t value from a moderated Student's t-test. Headers (column names) should be removed. The format should be tab delimited (meaning that the columns are separated by tabs) and the file extension should be .rnk.
the rank file is a format described in the GSEA documentation: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
- the rank file will be used to run the gene set enrichment analysis (GSEA).
the rank files contains all the genes: do not filter by only genes differentially expressed. We are ranking the genes from the top up regulated to the top down regulated. The genes that do not vary and not of interest are in the middle of the list and GSEA will not look for significance in the middle of the list but the format requirement for GSEA to correctly calculate the statistics is that all genes are listed.
- How to get the pathway database file (.gmt)
- The pathway database file contains all known and curated biological functions that we are going to test in this pathway enrichment analysis. For each of these functions, the names of genes known to be implicated in this function are listed beside this function. The gene set enrichment analysis will look if the top differentially expressed genes are included in some of these pathways. It will also assess if this enrichment can happen by chance only or not.
- In this protocol we are going to use a file that include pathways from different sources (e.g Gene Ontology, Reactome, Kegg,...). We observed that having the most comprehensive set of pathways gave more sensitive results. Although databases that are included in this file can be overlapping, they are not 100% identical and the clusters created by the enrichment map from these different sources add confidence about the perturbation of a given biological function.
The link to the database file compiled by the BaderLab and updated monthly can be found at: http://download.baderlab.org/EM_Genesets/ (look for the current release at the bottom of the list) and a description of how the file is being created at: http://baderlab.org/GeneSets
- How to run GSEA
GSEA can be downloaded from http://www.broadinstitute.org/gsea/index.jsp
- you need to enter a valid e-mail address before going to the download section
- for this workflow example, we are going the java web start option to run GSEA: choose the 'launch with 2GB' in the 'javaGSEA Desktop Application' box:
- you may be able to save the icon on your computer (a file called gsea.jnlp). Each time you want to run GSEA, you just need to double-click on the icon. Each time you double click on the icon, GSEA will double check whether a new version if available and install the software in a temporary location in your computer (thus you need a working internet connection to lauch GSEA this way)
The first step when the application is open is to load the data by browsing or dragging and dropping the files: the .rnk and the .gmt files. Then, it is to open the GSEAPreranked window: menu bar --> Tools --> GseaPreranked. From this window, we typically upload the .rnk and .gmt file (.gmt file is located in the 'Genematrix (local gmx/gmt)' tab). The number of permutations is set to 2000 (1000 or 2000) and other parameters can be left as default.
- How to create an expression file
- Although only 2 files were needed to run GSEA, a additional file, called the expression file,needs to be prepared to create an enrichmentmap. For this workflow example, the expression file will contain as first column the official gene symbol, as second column the full names of the genes (called sometimes definition), followed by columns containing the normalized data for each of the samples included in the study.
- How to create a map
- What is the next step, how to use the map
- How to create a figure
- How to interpret the results
- What next
- (How to preprocess the data using R)
- (How to preprocess the data using Excel)
FIRST EXAMPLE WITH AFFYMETRIX MICROARRAY DATA
- description of the data
Download the data from GEO
Installation
1) install R (http://www.r-project.org/)
2) install RStudio (http://www.rstudio.com/)
3) Go through on online R tutorial (e.g. this one: http://www.cyclismo.org/tutorial/R/)
How to preprocess the data (normalization, QC, differential expression)
How to update the annotations
How to create a rank file
How to create an expression file
How to run GSEA
How to create a map