#acl GeneManiaGroup:read,write,revert

== GeneMania DataWarehouse (DW) ==

=== DW Resources - Entrez: ===


   The Entrez database, by NCBI, is released in two main formats.

   1. Flat file format: This is composed of 11 main flat files, in a tab delimited format. The files are:
{{{
gene2accession, 
gene2go, 
gene2pubmed, 
gene2refseq,
gene2sts,
gene2unigene,
gene_history,
gene_info,
mim2gene,
gene_refseq_uniprotkb_collab,
interactions.
}}}

   2. Most of these files are good for cross referencing and matching identifiers between different databases within NCBI and elsewhere. For example, matching a gene ID to the appropriate RNA/protein sequence IDs (gene2accession), to the published journal references (gene2pubmed), or to the associated human genetic diseases (mim2gene). Some of these files, however, have more meat in them, like gene2go (matching genes with GO ontologies), gene_info, and interactions (lists interactions with BIND, BioGrid, EcoCyc, HPRD).


   The local Entrez mirror is currently based on this format. The table/column names were purposefully matched to the file/header names for ease of use (except in cases where this might cause technical hassle, like having dots or spaces in column names). Note that there are no null columns in these tables. A hyphen '-' (and sometimes a '?') is usually used by the source files, instead.


   The advantage of this format is ease of use, and the fact that the files are inclusive of all species information available from NCBI. Local views for the subsets of interest can be created as well. 


   2. ASN.1 Binary format: This compressed format can be transformed to XML using a program associated with the Entrez release as well. The structure of the bulky XML produced is based on DTDs by NCBI, and not on an XML schema. It is mainly split on a per species (or class) basis, but there are all inclusive files as well. It is comparable to the gene_info table listed above.

   3. There are other released files that are of less significance or represent different subsets for the data listed above, like releasing the interactions for HIV as a separate file, and so on.

   4. For more information, visit the Entrez FTP site at: ftp://ftp.ncbi.nlm.nih.gov/gene/

=== DW Resources - Ensembl: ===

   1. Ensembl offers their data in many formats, including GenBank files, FASTA files, mysql databases and others. The DW includes a selective mirror of the mysql databases, namely the 'Core Database', for each of the species of interest, if available (not available for A.Thaliana and E.coli).

   2. Ensembl databases are on a per species basis. Generally speaking, Ensembl covers less species than Entrez, but provides a richer (and complex) view of those species.  

   3. For more information, visit the Ensembl FTP site at: ftp://ftp.ensembl.org/

=== DW Resources - TAIR: ===

   1. The DW includes a selective mirror of TAIR (Arabidopsis Information Resource). The mirror represents a subset of the latest data released by TAIR.


   2. TAIR (from a technical perspective) is of significantly less quality than major bioinformatics resources. This is partially due to the fact that some of their data is contributed by other parties. Full automation is not really an option with this resource. Some of the shortcomings/errors observed (and partially reported to/fixed by TAIR) that might resurface in future releases: 

      * Wrong number of tabs in a tab-delimited file (can mess the upload) 
      * Some of the flat files missing a header column. 
      * Inconsistency in treating an empty 'cell' in a table (e.g. empty cell vs. a default value). Also, extra control characters sometimes. 
      * HTML tags in names (e.g. pathways, compounds) in the Aracyc files. 
      * Inconsistency in naming the different release files. 
      * Some released files with incomplete/missing README files. 

   
The files/tables loaded are: 
{{{
Genes: 
TAIR8_functional_descriptions 
TAIR8_NCBI_GENEID 
gene_aliases.20080716 

Proteins: 
Quick_interactome2.0 
TAIR8_all.domains 
TairProteinInteraction.20071002 
TargetP_analysis.tair8 

Pathways: 
aracyc_compounds.20080611 
aracyc_dump.20080611
}}}

=== Identifier Mapping Tables: ===

1. Eight tab delimited files (can be imported into Excel or a similar
package, as well as parsed). The files represent IDs from the following species: S. cerevisiae, C. elegans, A. thaliana, R. Norvegicus, M. musculus, H. sapiens, D. Melanogaster, E. Coli.

2. Each sheet will have the following columns: (example from human)

{{{
GMID: 3425 (add a 'GM' suffix: GM3425, if you prefer that)
Ensembl Gene ID: ENSG00000198888
Protein Coding: True/False (cannot be 'N/A')
Gene Name: ND1
Ensembl Transcript ID(s): ENST00000361390
Ensembl Protein ID(s): ENSP00000354687
Uniprot ID(s): P03886
Entrez Gene ID(s): 4535
RefSeq mRNA ID(s): N/A (but could have been something like NM_001088)
RefSeq Protein ID(s): AP_000639
Synonyms: MTND1; NAD1;ND1
Definition: NADH-ubiquinone oxidoreductase chain 1 (EC 1.6.5.3) (NADH dehydrogenase subunit 1). Source: Uniprot/SWISSPROT P03886
}}}

3. First row in each file will be the headers row.

4. In general, when there is no data available for a specific cell in a
particular column, the term 'N/A' (or a similar standard term) will be used
instead. In cases when there is more than one entry per cell, the entries
will be separated by a ';'.

5. As a general rule for any importing system, its better to treat IDs as
alphaneumeric in type rather than neumeric, since they can be either,
depending on the source database referenced. 

6. The GMID is an internal identifier that is unique per species, and is stable within a build/release of the IDMapping. It is not stable between different releases and is not unique across different species. The DW itself does not use or reference this ID.

7. The source of the ID Mapping information will be the first resource listed in each file. In other words, the reference point is Ensembl for all species, except for A.Thaliana and E.coli where the reference point is Entrez. Exceptions:

   a) Synonyms are accumulated from both resources, then filtered to provide a unique list of synonyms per gene.
   b) RefSeq info always comes from Entrez.

8. Synonyms are case-sensitive, so (example from human) ChM1L (for Ensembl ENSG00000000005) and CHM1L (for the matching Entrez 64102) are listed as two different synonyms.


9. Note that the version number for an identifier (RefSeq mRNA and protein IDs) is ignored/truncated. So, NM_001088.1 and NM_001088.2 will be listed as NM_001088 (once). This approach is followed in some other bioinformatics tools when version numbers are irrelevant.

10. Ensembl and Entrez are the resources used for all species, except for A.Thaliana (where the resources are Entrez and TAIR) and E. Coli (just Entrez).


=== Identifier Validation: ===

   1. Duplicate Gene Symbols: For some of the species in the identifier mapping tables (e.g. A.Thaliana), the gene name (aka gene symbol) is not unique among all the genes for that species. In other words, a gene symbol might be shared between 2 or more genes, within the same species. Possible solutions:
         a. These exceptional cases should be handled by deleting the trouble gene symbol from the IDMapping file, for all of the affected genes, so it won't match any searching query to begin with. 
         a. Another option would be deleting all the affected gene entries, but that would lead to the loss of useful (and reliable) information. 
         a. Keep all. The GM front end and the GM engine should, if faced with such a use case, query the user about which gene are they specifically referring to. 
         a. Since the genes that share the same symbol are likely to be very similar in features (?), keep that symbol for one, and delete it from all the others. 

   2. Duplicate Uniprot IDs: This is the case where the same Uniprot ID is shared between two or more genes. This should be handled in a similar fashion to the previous case. 

   3. Generally speaking, the mapping between the identifiers from the two resources (e.g. Ensembl and Entrez) is not perfect. So, there will be some 'left over' identifiers from the second resource (e.g. Entrez) that will have to be captured separately, and added to the same ID Mapping file. For the Ensembl/Entrez example, the columns in the file will be similar to the ones listed above, minus any Ensembl specific information (Ensembl Gene ID, Ensembl Transcript ID, Ensembl Protein ID). Needless to say, this will require filtering the IDs from the two resources against each other, and filling in for the missing info when possible. For example, grabbing the gene definition line from Entrez, instead of Ensembl, for the example mentioned above. 

   4. There will always be mismatches between the 2 resources. For the Ensembl/Entrez example, Ensembl might reference Entrez gene IDs that no more exist in Entrez. To avoid that, the ID validation modules should check for the presence of all Entrez gene IDs, and if missing, drop from the ID Mapping file. Needless to say, this built-in mismatch problem will affect any ID cross-referencing process.

=== More DW Documents: ===

 * [:../GeneManiaDataCollection: GeneMANIA Data Collection]