GeneMania DataWarehouse (DW)
DW Resources - Entrez:
- The Entrez database, by NCBI, is released in two main formats.
- Flat file format: This is composed of 11 main flat files, in a tab delimited format. The files are:
gene2accession, gene2go, gene2pubmed, gene2refseq, gene2sts, gene2unigene, gene_history, gene_info, mim2gene, gene_refseq_uniprotkb_collab, interactions, names.dmp (taxonomy names)
Most of these files are good for cross referencing and matching identifiers between different databases within NCBI and elsewhere. For example, matching a gene ID to the appropriate RNA/protein sequence IDs (gene2accession), to the published journal references (gene2pubmed), or to the associated human genetic diseases (mim2gene). Some of these files, however, have more meat in them, like gene2go (matching genes with GO ontologies), gene_info, and interactions (lists interactions with BIND, BioGrid, EcoCyc, HPRD). The local Entrez mirror is currently based on this format. The table/column names were purposefully matched to the file/header names for ease of use (except in cases where this might cause technical hassle, like having dots or spaces in column names). Note that there are no null columns in these tables. A hyphen '-' (and sometimes a '?') is usually used by the source files, instead. The advantage of this format is ease of use, and the fact that the files are inclusive of all species information available from NCBI. Local views for the subsets of interest can be created as well.
- ASN.1 Binary format: This compressed format can be transformed to XML using a program associated with the Entrez release as well. The structure of the bulky XML produced is based on DTDs by NCBI, and not on an XML schema. It is mainly split on a per species (or class) basis, but there are all inclusive files as well. It is comparable to the gene_info table listed above.
- There are other released files that are of less significance or represent different subsets for the data listed above, like releasing the interactions for HIV as a separate file, and so on.
For more information, visit the Entrez FTP site at: ftp://ftp.ncbi.nlm.nih.gov/gene/
DW Resources - Ensembl:
Ensembl offers their data in many formats, including GenBank files, FASTA files, mysql databases and others. The DW includes a selective mirror of the mysql databases, namely the 'Core Database', for each of the species of interest, if available (not available for A.Thaliana and E.coli).
- Ensembl databases are on a per species basis. Generally speaking, Ensembl covers less species than Entrez, but provides a richer (and complex) view of those species.
For more information, visit the Ensembl FTP site at: ftp://ftp.ensembl.org/
DW Resources - TAIR:
- The DW includes a selective mirror of TAIR (Arabidopsis Information Resource). The mirror represents a subset of the latest data released by TAIR.
- TAIR (from a technical perspective) is of significantly less quality than major bioinformatics resources. This is partially due to the fact that some of their data is contributed by other parties. Full automation is not really an option with this resource. Some of the shortcomings/errors observed (and partially reported to/fixed by TAIR) that might resurface in future releases:
- Wrong number of tabs in a tab-delimited file (can mess the upload)
- Some of the flat files missing a header column.
- Inconsistency in treating an empty 'cell' in a table (e.g. empty cell vs. a default value). Also, extra control characters sometimes.
- HTML tags in names (e.g. pathways, compounds) in the Aracyc files.
- Inconsistency in naming the different release files.
- Some released files with incomplete/missing README files.
The files/tables loaded are:
Genes: TAIR8_functional_descriptions TAIR8_NCBI_GENEID gene_aliases.20080716 Proteins: Quick_interactome2.0 TAIR8_all.domains TairProteinInteraction.20071002 TargetP_analysis.tair8 Pathways: aracyc_compounds.20080611 aracyc_dump.20080611
DW Resources - Uniprot:
This resource is used by the ID mapping service needed for the BIND Translation project. The resource, and the associated classes, are focused on the ID mapping portion of Uniprot, that is released in conjunction with Entrez/Refseq.
The files/tables loaded are:
IDMapGen (matches the idmapping.dat file) IDMap (matches the idmapping_selected.tab file)
These map Uniprot accessions to a whole set of identifiers. Our focus is on Ensembl, Entrez Gene ID, RefSeq protein, and GI.