| Size: 3085 Comment:  | Size: 3168 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 2: | Line 2: | 
| #acl All:read | |
| Line 4: | Line 6: | 
| This page was written to support a student's project. It can serve as a general background on ontology support (namely GO). | === OBO Stuff === === Course Project === This segment was written to support a student's project (Winter 2009). It can serve as a general background on ontology support (namely GO). | 
GeneMania Data Warehouse (DW)
OBO Stuff
Course Project
This segment was written to support a student's project (Winter 2009). It can serve as a general background on ontology support (namely GO).
1. Input files:
- Gene Association File (aka GO Annotation File):  - http://www.geneontology.org/GO.current.annotations.shtml?all The supported species are: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens, Mus musculus, Rattus norvegicus, Saccharomyces cerevisiae. Fields of interest are: DB & DB_Object_ID (this combo specifies the source for the subject gene/protein), Qualifier, GO ID, Evidence code (controlled vocabulary), DB_Object_Type. 
 
- OBO File:  - http://www.geneontology.org/GO.downloads.ontology.shtml. All entries are terms except the typedefs at the bottom. Fields of interest are: id (i.e. GO ID, cross-referenced with the earlier file), name, namespace (controlled vocabulary: P, C, or F), relationship (capture all), is_obsolete (when present, and set to 'true', the term is skipped). As a reminder, a node might have more than one parent, and more than one child. Note that the is_a relationship defines a parent-child relationship. 
 
2. Output file: A tab delimited flat file, of the format: Gene ID <tab> GO ID <tab> Evidence Code.
- As a reminder, the mapping between Gene ID/GO ID is many-to-many. To produce this file, we up-propagate annotations (and evidence codes) from the more specific categories to the more general ones, as implied by the "true path rule". The up propagation step is identical against each species's association file. There is one output file per species.
- The 'Gene ID' column can be of the format: Gene_Source:Gene_ID.
- There is no concatenation of terms/evidence codes in an output file's entry.
- No redundacies, i.e. same gene-GO-Evidence code entries in the output file.
- As an example:
The protein: Q9UNF0-2 from UniprotKB is matched to GO term GO:0045806, and this term is a subclass of (among others) GO:0030100. So, the output file will include: UniprotKB:Q9UNF0-2 GO:0045806 P IEA UniprotKB:Q9UNF0-2 GO:0030100 P IEA
Comments
- As a technical tip, its better to use the same released files until you get the input/output pipeline working, to allow a stable set of data to compare against. After that, you can use the latest OBO file, and a more recent association file, if any.
- Please refer to the GO documentation for more info about the file formats, controlled vocabulary lists (e.g. evidence codes), and publicly available tools (if interested).
- Other issues:  - Listing subject type (protein, gene, ...etc). Matching a non-gene subject type to the respective gene. Matching source type to one of the major resources the DW currently supports.
- Listing the qualifier 'NOT' next to the GO term, i.e. 'NOT:GO:0030100' 
 
