GeneMania Data Warehouse (DW)
OBO Stuff
1. As part of the BIND Translation project, some work was done for handling OBO files locally. This was neither the target nor the focus of that project, but it was deemed early on to be a better way for integrating the PSI-MI ontology terms (released as an OBO file) 'directly' into the inhouse translation process. Although this approach was later dropped, the classes developed remained there, and are part of the DW source I posted (DW_src_20091218). The changes introduced to the class hierarchy has added significant flexibility to the architecture, as well.
2. The OBO Reader is fed an ontology in OBO format, and it saves the terms as objects in memory. As a proof of concept, it does 'flatten the terms' and dump them in a special tab format, which is handy in cases where the focus is not on the hierarchy itself. For example,OboStuff.xls.
3. The focus of the reader was on the PSI-MI file details, but it does process GO, and responded fairly well to a random bunch of OBO files from various sources (from http://www.obofoundry.org/). Limitations include: the reader does not track replacements for obsolete terms, and does not support the created_by/creation_date, Replaced_by, alt_id, and consider tags. Anyone is welcome to use or refactor if interested.
Course Project
This segment was written to support a student's project (Winter 2009). It can serve as a general background on ontology support (namely GO).
1. Input files:
- Gene Association File (aka GO Annotation File):
http://www.geneontology.org/GO.current.annotations.shtml?all The supported species are: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae. Fields of interest are: DB & DB_Object_ID (this combo specifies the source for the subject gene/protein), Qualifier, GO ID, Evidence code (controlled vocabulary), DB_Object_Type.
- OBO File:
http://www.geneontology.org/GO.downloads.ontology.shtml. All entries are terms except the typedefs at the bottom. Fields of interest are: id (i.e. GO ID, cross-referenced with the earlier file), name, namespace (controlled vocabulary: P, C, or F), relationship (capture all), is_obsolete (when present, and set to 'true', the term is skipped). As a reminder, a node might have more than one parent, and more than one child. Note that the is_a relationship defines a parent-child relationship.
2. Output file: A tab delimited flat file, of the format: Gene ID <tab> GO ID <tab> Evidence Code.
- As a reminder, the mapping between Gene ID/GO ID is many-to-many. To produce this file, we up-propagate annotations (and evidence codes) from the more specific categories to the more general ones, as implied by the "true path rule". The up propagation step is identical against each species's association file. There is one output file per species.
- The 'Gene ID' column can be of the format: Gene_Source:Gene_ID.
- There is no concatenation of terms/evidence codes in an output file's entry.
- No redundacies, i.e. same gene-GO-Evidence code entries in the output file.
- As an example:
The protein: Q9UNF0-2 from UniprotKB is matched to GO term GO:0045806, and this term is a subclass of (among others) GO:0030100. So, the output file will include: UniprotKB:Q9UNF0-2 GO:0045806 P IEA UniprotKB:Q9UNF0-2 GO:0030100 P IEA
Comments
- As a technical tip, its better to use the same released files until you get the input/output pipeline working, to allow a stable set of data to compare against. After that, you can use the latest OBO file, and a more recent association file, if any.
- Please refer to the GO documentation for more info about the file formats, controlled vocabulary lists (e.g. evidence codes), and publicly available tools (if interested).
- Other issues:
- Listing subject type (protein, gene, ...etc). Matching a non-gene subject type to the respective gene. Matching source type to one of the major resources the DW currently supports.
Listing the qualifier 'NOT' next to the GO term, i.e. 'NOT:GO:0030100'