GeneMania DataWarehouse (DW)
Identifier Mapping Tables:
1. Eight tab delimited files (can be imported into Excel or a similar package, as well as parsed). The files represent IDs from the following species: S. cerevisiae, C. elegans, A. thaliana, R. Norvegicus, M. musculus, H. sapiens, D. Melanogaster, E. Coli.
2. For an Ensembl-based sheet, we will have the following columns: (example from human)
GMID: 3425 (add a 'GM' suffix: GM3425, if you prefer that, required column). Ensembl Gene ID: ENSG00000198888 Protein Coding: True/False (cannot be 'N/A') Gene Name: ND1 Ensembl Transcript ID: ENST00000361390 Ensembl Protein ID: ENSP00000354687 Uniprot ID: P03886 Entrez Gene ID: 4535 RefSeq mRNA ID: N/A (but could have been something like NM_001088) RefSeq Protein ID: AP_000639 Synonyms: MTND1; NAD1;ND1 Definition: NADH-ubiquinone oxidoreductase chain 1 (EC 1.6.5.3) (NADH dehydrogenase subunit 1). Source: Uniprot/SWISSPROT P03886
3. For an Entrez-based sheet, we will have the following columns: (example from Cress)
GMID: 144382 Entrez Gene ID: 2745418 Protein Coding: True Gene Name: AT2G01175 Uniprot ID: Q3EC92 TAIR Locus ID: AT2G01175 RefSeq mRNA ID: NM_201659 RefSeq Protein ID: NP_973388 Synonyms: N/A Definition: hypothetical protein
4. First row in each file will be the headers row.
5. In general, when there is no data available for a specific cell in a particular column, the term 'N/A' (or a similar standard term) will be used instead. In cases when there is more than one entry per cell, the entries will be separated by a ';'.
6. As a general rule for any importing system, its better to treat IDs as alphaneumeric in type rather than neumeric, since they can be either, depending on the source database referenced.
7. The GMID is an internal identifier that is unique per species, and is stable within a build/release of the IDMapping. It is not stable between different releases and is not unique across different species. The DW itself does not use or reference this ID.
8. The source of the ID Mapping information will be the first resource listed in each file. In other words, the reference point is Ensembl for all species, except for A.Thaliana and E.coli where the reference point is Entrez. Exceptions:
- Synonyms are accumulated from both resources, then filtered to provide a unique list of synonyms per gene. If the gene name from the secondary source is not listed as one of the synonyms of the primary source, it is added to the filtered list as a synonym as well.
RefSeq info always comes from Entrez.
LeftOver entries, as described in the ID validation process.
9. Synonyms are case-sensitive, so (example from human) ChM1L (for Ensembl ENSG00000000005) and CHM1L (for the matching Entrez 64102) are listed as two different synonyms.
10. Note that the version number for an identifier (RefSeq mRNA and protein IDs) is ignored/truncated. So, NM_001088.1 and NM_001088.2 will be listed as NM_001088 (once). This approach is followed in some other bioinformatics tools when version numbers are irrelevant.
11. Ensembl and Entrez are the resources used for all species, except for A.Thaliana (where the resources are Entrez and TAIR) and E. Coli (just Entrez).
12. For the Ensembl-based mapping files, the Uniprot IDs col may have the curated Uniprot/Swissport IDs, and not the Uniprot/TreEMBL IDs. It includes the Uniprot ID (aka entry name) and the Uniprot primary accession for a protein. For the Entrez-based mapping files, only the Uniprot primary accession is offered, but is inclusive of both Uniprot/Swissprot and Uniprot/TrEMBL.
13. For a mouse ID mapping table, there is an additional column representing MGIs.
Bonus ID Mapping
The term 'bonus' ID mapping refers to mapping tables that represent the opposite view of the ID mapping tables mentioned above (and that were part of the GM 'requirements'). The tables are based on a Resource2_Resource1 mapping, where the info is derived from Resource2 (with the exceptions mentioned earlier). For example, for an Entrez_Ensembl mapping, the table will have the following columns: Entrez Gene ID, Protein Coding, Entrez Gene Name, Uniprot ID, Ensembl Gene ID, RefSeq mRNA ID, RefSeq Protein ID, Synonyms, Definition. In the case of a TAIR_ENTREZ mapping, the table will have the following columns: TAIR Locus ID, Protein Coding, TAIR Locus Name, Uniprot ID, Entrez Gene ID, RefSeq mRNA ID, RefSeq Protein ID, Synonyms, Definitions. Same procedure is followed with these tables, except for the generation of GMIDs, which is disabled. These mappings can be used as a general reference, and are one of the side benefits of the flexible design adopted.
Linking IDs to Resources:
This section describes the hyperlinking of identifiers, from the ID mapping files, to their external resources. The IDs can be plugged into these URLs, as follows:
1. Ensembl
Ensembl Gene ID: http://www.ensembl.org/SpeciesName/geneview?gene=EnsemblGeneID Ensembl Transcript ID: http://www.ensembl.org/SpeciesName/transview?transcript=TranscriptID Ensembl Protein ID: http://www.ensembl.org/SpeciesName/protview?peptide=ProteinID
The SpeciesName can be any one of the following:
Hs:Homo_sapiens Mm:Mus_musculus Rn:Rattus_norvegicus Dm:Drosophila_melanogaster Sc:Saccharomyces_cerevisiae Ce:Caenorhabditis_elegans
Examples:
http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000000003 http://www.ensembl.org/Homo_sapiens/transview?transcript=ENST00000342929 http://www.ensembl.org/Mus_musculus/protview?peptide=ENSMUSP00000045693
2. Entrez
Entrez Gene ID: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=EntrezGeneID RefSeq mRNA: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=RefSeq_mRNA_ID RefSeq Protein: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=RefSeqProteinID
Examples:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=4232 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_002402 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_002393
3. TAIR
TAIR Locus ID: http://arabidopsis.org/servlets/TairObject?type=locus&name=TAIR_LocusID
Example:
http://arabidopsis.org/servlets/TairObject?type=locus&name=AT2G01175
4. Uniprot
Uniprot Primary Accession: http://www.uniprot.org/uniprot/UniprotAccession Uniprot Entry name (aka Uniprot ID): http://www.uniprot.org/uniprot/UniprotID
Examples
http://www.uniprot.org/uniprot/Q5EB52 http://www.uniprot.org/uniprot/MEST_HUMAN (redirected to the primary accession URL)
5. Ensembl gene names, Entrez gene names, and TAIR locus names are all linked to the pages of the corresponding Ensembl gene ID, Entrez gene ID, and TAIR locus ID, respectively.
6. We currently save both Uniprot primary accessions and Uniprot IDs in the Ensembl-based ID mapping files, in the same column. However, linking to Uniprot by the Uniprot primary accession is better than linking by the Uniprot ID, since the former is a stable and unique identifier for a Uniprot entry, while the latter might change between different Uniprot releases.