Differences between revisions 49 and 54 (spanning 5 versions)

Enrichment Map Gene Sets

Contents

Enrichment Map Gene Sets

EnrichmentMap is a Cytoscape plugin developed in the Baderlab to help visualize, navigate and analyze functional enrichment results as generated from programs such as Gene Set Enrichment Analysis(GSEA), BiNGO, or David. Some enrichment programs, such as GSEA, allow the user to search against their own gene set database. As annotation (gene set) sources are regularly updated as new information is discovered we set up an automated system to update our gene set collections so we are always using the most up-to-date annotations.

If you use these gene sets, please cite our Enrichment Map paper.

Important Note - Genesets files from December 2011, January 2012, Februrary 2012, and March 2012 had an error in the up-propagation of GO. Up-propagation only followed the is-a relationship and did not follow the part-of relationship which translates into missing annotations. This primarily effects genesets in GO cellular compartment.

Summary

Gene Set Files can be downloaded from : http://download.baderlab.org/EM_Genesets/
Enrichment Map Gene Sets are a set of Gene Set files in GMT format (compatible with GSEA) updated monthly from original source locations available with:
1. Entrez gene ids
2. UniProt accessions
3. Gene symbols
The GMT File format contains one Gene Set per line. Each line contains:
- Name (tab) Description (tab) Gene (tab) Gene (tab) ...
- In our format:
  - Name = Gene Set Name % Gene Set Source % Gene Set Source identifier
    - Example --> ATP-dependent protein binding%GO%GO:0043008 OR arginine biosynthesis IV%HUMANCYC%ARGININE-SYN4-PWY
  - Description = Gene Set Name
    - Example --> ATP-dependent protein binding OR arginine biosynthesis IV
  - Gene = identified by one of the three possible identifiers (Entrez gene id, UniProt accession or gene symbols)
  - IMPORTANT NOTE: Originally we used the "|" to separate information in the Name field but we came across issues with this separator in GSEA so we changed to "%". The "%" was used as of the December 2011 build.

Current Stats

Sources

Human

Source	File Origin	File Type	ID extracted	Frequency source is updated	Number of pathways
KEGG (1)	KEGG ftp site (July 2011)	GMT	Symbol	static as of July 1, 2011	236
Msigdb - c2 (2) (other + Biocarta)	manual download from Msigdb	GMT	Entrez gene	sporadically	Biocarta - 217 Other - 47
NCI (3)	scripted download of zipped release from website	BioPAX	Entrez gene	sporadically	219 pathways
Institute of Bioinformatics (IOB)	received directly from IOB - static (July 2011)	BioPAX	Entrez gene	sporadically	35 pathways - 10 are the same as CellMap, 1 is the same as NetPath
NetPath(4) [also from IOB]	scripted download of files numbered 1-25	BioPAX	Entrez gene	static	25 pathways - 12 are cancer pathways (10 are CellMap) 13 are immunity pathways
HumanCyc (5)	scripted download of zipped release from password protected website.	BioPAX	UniProt	updated periodically	249 Pathways
Reactome (6)	scripted download of zipped release from website	BioPAX	UniProt	updated release	1117 pathways (release 37)
GO (7)	scripted download from EBI ftp site (human)	GAF	Uniprot	released once a month	13,034 no GO IEA 15,181 with GO IEA
Msigdb - c3 (2) Specialty GMTs mirs, transcription factors	manual download from Msigdb	GMT	Entrez gene	sporadically	221 miRs 616 TFs
Panther (8)	scripted download of biopax archive	BioPAX	UniProt	updated periodically	307 Pathways

Mouse

Source	File Origin	File Type	ID extracted	Frequency source is updated	Number of pathways
Reactome (6)	scripted download of zipped release from website	BioPAX	UniProt	updated release	946 pathways (release 37)
GO (7)	scripted download from MGI ftp site (mouse)	GAF	MGI	released once a month	14,563 no GO IEA 15,041 with GO IEA
KEGG (1)	translated from Human using Homologene	GMT	Entrez gene	static as of July 1, 2011	236
Msigdb - c2 (2) (other + Biocarta)	translated from Human using Homologene	GMT	Entrez gene	sporadically	total 880: Kegg -186 Reactome - 430 Biocarta - 217 Other - 47
NCI (3)	translated from Human using Homologene	GMT	Entrez gene	sporadically	219 pathways
Institute of Bioinformatics (IOB)	translated from Human using Homologene	GMT	Entrez gene	sporadically	35 pathways - 10 are the same as CellMap, 1 is the same as NetPath
NetPath (4) [also from IOB]	translated from Human using Homologene	GMT	Entrez gene	static	25 pathways - 12 are cancer pathways (10 are CellMap) 13 are immunity pathways
HumanCyc (5)	translated from Human using Homologene	GMT	Entrez gene	updated periodically	249 Pathways
Panther (8)	translated from Human using Homologene	BioPAX	UniProt	updated periodically	307 Pathways

Specialty Gene Sets

The bulk of our genesets are groupings from similar biological processes, pathways and functional annotations but there are a few additional collections of sets that we don't group with them. They include:
1. miRs - sets consisting of all the targets for a given microRNA.
  - miR genesets are retrieved from Msigdb c3 collection.
2. Transcription Factors - sets consisting of all the targets for a given transcription factor.
  - TF genesets are retrieved from Msigdb c3 collection.
3. Disease Phenotype - sets consisting of all known proteins associated with the given disease.
  - Disease phenotype genesets are retrieved from the Human phenotype ontology. Genes associated with a particular disease are annotated to it. In addition, in the same style as the Gene Ontology, the relationship between each disease is stored creating an ontology of diseases. Annotations are up-propagated to related disease terms.
4. Drugs Targets - sets consisting of all the known or predicted targets for a given drug.
  - Drug target information is retrieved from drugbank. Drugbank is a resource containing 6711 drug entries including 1447 FDA-approved small molecule drugs, 131 FDA-approved biotech (protein/peptide) drugs, 85 nutraceuticals and 5080 experimental drugs. In addition to the compilation of all drugs contained in drugbank geneset files are also created for each of the defined drug categories including approved, experimental, illicit, nutraceutical, and small molecule.

File Structure

< > denotes directory

<Release> - directory is named according to date sets were updated.
- <Species>
  - <Identifier> - (either Entrez gene, UniProt, Gene symbol)
    - <GO>
      - BP = biological process
      - MF = molecular function
      - CC = Cellular component
      - All = BP + MF + CC
      - no_GO_IEA - indicates that the file excludes GO annotations with evidence codes - 'IEA' (inferred from electronic annotation), 'ND' (No biological data available), 'RCA' (inferred from reviewed computational analysis)
      - with_GO_IEA - indicates that the file includes GO annotations with evidence codes - 'IEA' (inferred from electronic annotation), 'ND' (No biological data available), 'RCA' (inferred from reviewed computational analysis)
    - <Pathways>
    - <miRs>
    - <TF>
    - <Disease phenotypes>
In each <identifier> directory There are amalgamated Gene Set files:
- AllPathways - contains all pathway sources in the Pathways directory
- GOPathways - contains all GO (MF, BP, CC) and all Pathway sources in the Pathways directory.

Creating customized Gene Sets

Download the desired gene set files you would like to use in your customized set and concatenate the files.
For example, to combine Human_IOB_Entrezgene.gmt Human_NetPath_Entrezgene.gmt, you can use the following linux command:

   cat Human_IOB_Entrezgene.gmt Human_NetPath_Entrezgene.gmt > MyCustomizedSet.gmt

References

Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2011 Nov 10. PMID: 22080510
Pubmed
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50. PMID: 16199517
Pubmed
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D674-9. PMID: 18832364
Pubmed
Kandasamy K, et a NetPath: a public resource of curated signal transduction pathways.Genome Biol. 2010 Jan 12;11(1):R3. PMID: 20067622
Pubmed
Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6(1):R2. Epub 2004 Dec 22. PMID: 15642094
Pubmed
Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, Jupe S, Kalatskaya I, Mahajan S, May B, Ndegwa N, Schmidt E, Shamovsky V, Yung C, Birney E, Hermjakob H, D'Eustachio P, Stein L. Reactome: a database of reactions, pathways and biological processes Nucleic Acids Res. 2011 Jan;39(Database issue):D691-7. PMID: 21067998
Pubmed
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000 May;25(1):25-9. PMID: 10802651
Pubmed
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D284-8. PubMed PMID: 15608197
Pubmed

-  ⇤ ← Revision 49 as of 2012-04-16 16:37:08 → 
  Size: 11943
  Editor: RuthIsserlin
  Comment:
+   ← Revision 54 as of 2014-02-24 15:10:43 → ⇥
  Size: 13109
  Editor: RuthIsserlin
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 8:
-<<BR>>
'''Important Note - Genesets files from December 2011, January 2012, Februrary 2012, and March 2012 had an error in the up-propagation of GO.  Up-propagation only followed the ''is-a'' relationship and did not follow the ''part-of'' relationship.  This primarily effects genesets in GO cellular compartment.'''
+<<BR>><<BR>>
'''Important Note - Genesets files from December 2011, January 2012, Februrary 2012, and March 2012 had an error in the up-propagation of GO.  Up-propagation only followed the ''is-a'' relationship and did not follow the ''part-of'' relationship which translates into missing annotations.  This primarily effects genesets in GO cellular compartment.'''
 Line 20:
-      * Name = Gene Set Name | Gene Set Source | Gene Set Source identifier
        * Example --> ATP-dependent protein binding|GO|GO:0043008  '''OR'''    arginine biosynthesis IV|HUMANCYC|ARGININE-SYN4-PWY
+      * Name = Gene Set Name % Gene Set Source % Gene Set Source identifier
        * Example --> ATP-dependent protein binding%GO%GO:0043008  '''OR'''    arginine biosynthesis IV%HUMANCYC%ARGININE-SYN4-PWY
 Line 25:
+      * ''' IMPORTANT NOTE: Originally we used the "|" to separate information in the Name field but we came across issues with this separator in GSEA so we changed to "%".  The "%" was used as of the December 2011 build.'''
-Line 42:
+Line 43:
+|| [[http://www.pantherdb.org/pathway/|Panther]] ([[#ref8|8]]) || scripted download of biopax archive || BioPAX || !UniProt || updated periodically || 307 Pathways  ||
-Line 53:
+Line 55:
+|| [[http://www.pantherdb.org/pathway/|Panther]] ([[#ref8|8]]) || ''translated from Human using Homologene'' || BioPAX || !UniProt || updated periodically || 307 Pathways  ||
-Line 63:
+Line 66:
-      * Drug target information is retrieved from [[http://stitch.embl.de/|STITCH]].  STITCH is a resource containing the amalgamation of many different databases.  A score is attached to each protein - chemical interaction.  '''For the purpose of our genesets we only include protein - chemical interactions that have a combined score greater than 900'''
+      * Drug target information is retrieved from [[http://www.drugbank.ca/downloads|drugbank]].  Drugbank is a resource containing 6711 drug entries including 1447 FDA-approved small molecule drugs, 131 FDA-approved biotech (protein/peptide) drugs, 85 nutraceuticals and 5080 experimental drugs.  In addition to the compilation of all drugs contained in drugbank geneset files are also created for each of the defined  drug categories including approved, experimental, illicit, nutraceutical, and small molecule.
-Line 101:
+Line 104:
+. <<Anchor(ref8)>> Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD. '''The PANTHER database of protein families, subfamilies, functions and pathways.''' Nucleic Acids Res. 2005 Jan 1;33(Database issue):D284-8. PubMed PMID: 15608197 <<BR>> [[http://www.ncbi.nlm.nih.gov/pubmed/15608197|Pubmed]]