#acl RashadBadrawi:read,write,revert #acl All:read == BIND To PSI-MI 2.5 Translation == <> === Introduction === BIND (The Biomolecular Interaction Network Database) is one of the major, freely available, biological interactions resources. It was built over several years through a manual curation process. BIND curated, in detail, over 200,000 molecular interactions, and over 3750 biological complexes involving over a 1000 species. Although BIND is still accessible through the 'BOND' site (http://bond.unleashedinformatics.com/), the data has been dormant for few years, and, unfortunately, the wealth of knowledge it carries has been underutilized by the scientific community, mainly due to the fact that this data is not available in a current 'community' standard format. === Objective === This project aims at resolving this problem by translating BIND to the PSI-MI 2.5 (molecular interactions) standard, and consequently, by making it available to interaction repositories like PathwayCommons (PC). This project will cover ''almost'' every interaction, for every species, available in BIND, but not every detail about every interaction in BIND. === BIND In Entrez === Entrez provides some interaction data as part of their regular release, that partially covers BIND. For each BIND interaction in Entrez, the following information are listed (if available): Interactants' organism, interactants' Entrez gene ID, interactants' Entrez protein accession, interactants' name, pubmed ID(s), a textual description of the interaction. The total number of BIND interactions captured is close to 12,500, spread over around 30 species. Obviously, most of the records have not been modified since 2005. There are no records for BIND complexes. === Other BIND Translations === A prototype for a BIND to PSI-MI 1.0 translation was developed, in the summer of 2006, at the University of Toronto, using XSLT technology. To the best of my knowledge, this translation was not placed into real use. === BIND Analysis === This analysis is not necessarily limited to BIND elements that were translated. Please refer to the attached Excel file for the referenced spreadsheets. [[attachment:BINDAnalysis.xls]] 1. BIND repository is available in the form of XML files. The repository has a total of 1,591 files, with a total size of app. 16 Gigabytes. The actual repository used in the translation has 1,621 files, since seven of the bulky XML files were split into smaller chunks for easier handling. These files are: taxid4932.1.xml, taxid4932.2.xml, taxid562.1.xml, taxid7227.1.xml, taxid9606.1.xml, taxid9606.2.xml, taxid10090.1.xml. Content was not altered during this process. A species is represented by one or more files. The timestamps in all files go back to 2006 (or earlier). 2. The value of BIND lies in its manually curated content, more than in its 'automatically' generated data (like interactants' synonyms, for example). The latter can be acquired by separate processes and the BIND version of it might be out of date. 3. The BIND XML files should be thought of as a database dump, rather than a refined selection of interaction data, ready to be translated. 4. Some of the heavily used BIND elements are data types of internal and 'operational' nature to the BIND team itself. These bear little significance at this point. Such elements include: the update history of an interaction record in the BIND database, curator's info, some timestamps, and other elements. 5. Some heavily used elements are focused on structural display and graphical representation of a complex. These are not of much interest, either. 6. The numbers in the element survey (below) are biased against biological complexes, since it does not differentiate between interaction and complex counts, and entries for complexes are way outnumbered by interactions in BIND. 7. Comment: One possible limitation we might face in the future is that the PSI-MI 2.5 to Biopax converter does not support 'attribute Lists'. These are the extra storage areas in the PSI-MI 2.5 schema, where additional details about an interaction (and that do not fit under any of the explicit PSI-MI 2.5 elements) can be stored. ==== Element Survey ==== 1. The attached spreadsheets are the product of a 'stats generator'. They aim at showing the following: a. The XML elements (and their distinct paths) that actually appear in the BIND repository (vs. the ones that are just part of the BIND schema, but are never referenced, if any). This is a general indicator on how much of a data model is exercised by its 'instances'. a. The 'abundance' of each element is listed; i.e. in how many files is a specific element used and the total number of times it appears in all files combined. a. The distinct XML elements (more accurately, paths) that bear data and can be translated. In the case of the BIND schema, this was defined as an element that either carries attributes, or is a leaf element, usually carrying content (some general exceptions below). This analysis approach is specifically useful to BIND and may or may not be very informative for other XML schemas. 2. The spreadsheet 'BINDElementsUnique' lists the element names (648 elements total) used in all the BIND repository. 3. The spreadsheet 'BINDElementStats' lists all the BIND elements, the unique XML path to each element, its 'abundance', and its 'significance' (i.e. whether it bears content or not). The columns used are: a. Element Name: BIND XML schema element name, not unique. a. Full Path: The unique path to the XML element. Unique column. In some cases, different paths to the same element name are different in context (and not just because of reusing a complex type). a. Total Files: The total number of BIND xml models/files this element has appeared in. a. Total Count: The total number of times this element has been used in all of the files. a. HasChildren/HasAttributes/HasText: These reflect whether an element bears useful content or not, as described above. A column value is set to 'true' if it satisfies that condition, at least once, in any entry, in any BIND file. 4. There are 3300 unique paths, out of which 1403 bear content (total usage: 72,706,810 times), including the elements with attributes as well (273 of them). Like the case with many other data models, most of the content is concentrated in a small subset of elements, that may or may not be of biological value. About 580 elements, out of the 1403 data-carrying XML elements are used less than a 100 times in 1620 files, which is negligible. 5. As a ''rough'' guideline for 'abundance' (and ignoring the complexes), a data element that appears in every interaction (e.g. interactant's name) should appear 212,031 times or more. There is always the chance that such an element is set to a dummy (as described later), but that is not taken into consideration in this context. 6. The spreadsheet BINDElementStatsBrief offers a different view for the same data. 7. To put BIND data into perspective, the same stats generator was used against another interaction database: the IntAct database (over 172,000 interactions) released in PSI-MI 2.5 xml format (a total of 435 files, 1.33 Gigabytes in size, organized mainly by species). The two spreadsheets named 'IntActElementsUnique' and 'IntActElementsStats' display the results of this comparative analysis. The number of XML elements used in all the IntAct is about 60, the number of unique paths is 216, about 128 of which carry data in them. This shows the difference in data complexity. It indirectly shows the value of standardizing a data representation as well. 8. A comprehensive translation, however, should focus on the scientific value of a component, in addition to its 'abundance'. The former cannot be captured in this survey. In addition, the survey does not take into account the size of the content for an element, but based on my knowledge with the BIND model, this is not a major concern. 9. Phases I-III (see project phases below) of this project handle the core translation for ''all'' interactions/complexes. Our approach is to cover every interaction possible, without capturing all the details available, since in that case, more resources will be needed. Those additional details are left for future enhancements, and this survey will serve as a partial guideline for that process as well. ==== BIND XML Schema/Data Model ==== The BIND data model was originally represented in the ASN.1 notation. It makes extensive use of earlier standard NCBI ASN.1 components (about 20) for representing biological entities like publications, sequences, and others. The BIND XML schema was generated automatically from the ASN.1 format. BIND XML schema/data model is significantly complicated. This is partially due to the richness of the resource itself, but also because of the manner the XML schema was built. The model heavily favors the use of 'sublements' instead of using attributes, values, and customized complex types. A user sifts through a lot of XML elements to get to a useful piece of information, in addition to what I would describe as issues of over-defining and naming schemes. For example, to represent one type of pubmed IDs for the experimental condition of an interaction in BIND, starting from the root XML element: {{{ BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_descr.BIND-descr.BIND-descr_cond.BIND-condition-set.BIND-condition-set_conditions.BIND-condition.BIND-condition_source.BIND-pub-set.BIND-pub-set_pubs.BIND-pub-object.BIND-pub-object_pub.Pub.Pub_medline.Medline-entry.Medline-entry_pmid.PubMedId }}} To represent a similar element, using PSI-MI 2.5: {{{ entrySet.entry.interactionList.interaction.experimentList.experiment Description.bibref.xref.primaryRef[@id:] }}} Another example of this is in defining the type of an interactant (e.g. protein, RNA, ...etc). Its more natural to define this as an attribute for the referenced object (in this case, 'interactant') and not as a series of subelements (aside from the issue of handling the identifier format needed): {{{ BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_a.BIND-object.BIND-object_id.BIND-object-type-id.BIND-object-type-id_protein }}} Such issues, in addition to the mere size of the data itself, significantly complicated the implementation and debugging of the translation. ==== BIND Data Quality ==== ====== Schema Validation ====== Generally speaking, the BIND models are semantically valid. All BIND files were validated against the BIND schema, and its associated schemas (mainly earlier NCBI schemas), and most of them checked out fine, with minor exceptions: 1. Some BIND files did not stick to their own controlled vocabulary for experimental method description (i.e. the 'BIND-experimental-system' element). It is set to 'microarray' in the human files, about 406 times; it is set to 'synthetic-lethal-sick-test' in the yeast files, about 474 times. These two types are not part of the BIND controlled vocabulary for describing an experiment, and thus, failed the schema validation. 2. Based on my understanding of BIND, the experimental method value, for these cases, should have been set to 'other', and a free style description separately added, under the element 'BIND-condition_descr'. This error, however, should not interfere with the translation since our focus is to use PSI-MI's CVs and not BIND CVs (see milestones section). ====== Data Issues/Bad data ====== However, the BIND models do suffer from missing data at times, and poor data representation at others. Here are some examples: 1. The external reference object in BIND carries a database name, and one of two identifiers (either a string or an integer data type). This could be modeled in a regular XML schema by using a 'choice' type to enforce such a structure. However, BIND considers all of these elements as required, and as such, sets one of them to a valid value, and the other to a dummy (that varies, as well). This is a case of poor design, not bad data, and is misleading. For example (from human, taxid9606.1.xml): {{{ ... Klotho -1 KLM0000304 }}} In this example, the first identifier value (set to '-1' in this case, but '0' in many others) should be ignored, and the second identifier should be used. In other cases, it would be the other way around (where the second identifier type is set to 'NULL'). An even more interesting example (from taxid6239.1.xml, listing a Wormbase xref): {{{ ... WP:NULL ... }}} 2. There are, however, very few exceptions (occuring 30 times total) to this x-ref data representation in BIND, and that is when both the string and integer fields are used to represent an external reference. All these cases are limited to representing identifiers for Merck Index or Gene Ontology. The BIND translator keeps both values. For example, (from files: taxid10090_PSIMI25.xml and taxid9606_PSIMI25.xml respectively): {{{ ... ... }}} {{{ ... ... }}} 3. The following excrepts are examples of completely erroneous x-ref entries: {{{ ... DBR1 ... ... LocusLink 0 0 .... SGD 0 ... }}} Another example (for an OMIM x-ref, in the homosapien's file): {{{ ... MCM10 ... OMIM 0 - ... }}} 4. Different elements can be set to 'NULL' or zero. These are either ignored during the translation, if not mandated by the PSI-MI 2.5 schema, or set to a default ('NO_VALUE'). There is a chance that future phases of this project might be able to fill in for some of these cases, like interactant names/descriptions, via external references listed, if any. 5. In mouse (from taxid10090.1.xml, also in the file taxid0.1), the interactant names/descriptions are listed as 'NULL' 154 times: {{{ ... NULL ... NULL ... }}} This is misleading and unnecessary, even if this information is missing (and required by the XML schema), since the BIND model allows setting values to a standard default term ('unknown'). 6. Sometimes an interactant's organism identifiers might be missing. This information was not automatically filled in, ignoring the assumption that every interactant belongs to the same species/file where it is listed under. 7. Some BIND complexes reference BIND interactions with a negative BIND Interaction ID. For example (file: taxid9606.2.xml.part2): {{{ -52023 }}} 8. Few interactions (e.g. check interaction ID: 18988 on the BOND web portal), have an erroneous 'BIND Division Type': ''"BIND Taxroot|Record where all molecules are from any organism except those within TaxID 4751 or TaxID 33208"''. This turns out to be a copy and paste from the BIND readme.txt file, and is not a valid BIND division type. It is translated as ''"BIND Taxroot"''. 9. In few cases, the SGD x-ref for an interactant is listed, not as a integer, but in the format ''"SGD: XXX"'' (e.g. ''"SGD: S000003663"''). 10. In few cases, there is inconsistency in the manner the MGI identifier is displayed and what it represents (example from the file taxid10090_PSIMI25.xml, shown in its wrong PSI-MI format): {{{ ... ... }}} 11. Few BIND interactants list RefSeq as an external reference. However, the identifiers listed are not refseq identifiers, and they, in fact, are GI identifiers. This is fixed during translation. Examples include (from the file taxid2336.1.xml, shown in its wrong PSI-MI format): {{{ ... ... ... }}} 12. There are rare byte sequences in the BIND XML files which are not UTF-8 compliant. This occurs in some textual fields (like description). 13. A significant number of the Entrez GI identifiers (GI is the primary identifier in BIND) are outdated or deprecated. See the BIND Refresh section. 14. Lack of consistency and unified representation for biological resources names used as external references. See the CV Mapping section. ====== Filtering BIND's 'Bad' Entries ====== The BIND translator attempts to 'fix' data within an entry, without removing the entry itself, if possible. The 'BIND bad data' section lists examples of such erroneous or dummy data. However, some interaction entries may be missing vital info that the interaction is deemed 'incomplete' without them (regardless of the reason behind the missing elements, i.e. biologically unavailable, source data issue, ...etc). This type of info includes: interaction names/titles, interactant names, complex names, complex subunit names, and all references to these interactions from complexes. Such interactions are dropped from the translated files, but are appropriately tracked in a log file (check the file BAD_ENTRIES.txt, in the release's folder): 1. The log file lists all dropped interactions, in PSI-MI (XML) interaction format. This allows consistency with how an interaction is listed in the production files. The file itself is not a valid PSI-MI file. It is generated per-run/build of the translation process, and covers whatever species included in that build. 2. For most entries in this file, there is usually a 'NO_VALUE' set somewhere, for a vital element/attribute. This is a unified representation by the translator for a whole variety of bad/missing data representations in BIND (like: 'Unknown', 'NULL', 'unknown', ...etc). 3. For each such entry, there is an attribute added to its attribute list, specifiying the name of the file it came from. The file name is specified as full path to its current location. For example: {{{ .... BIND Taxroot /Volumes/Groups/PathwayCommons/BIND/BIND_Data/ taxid83333.1.xml }}} === Support for BIND Complexes === The translation included a first release for BIND complexes. Some of the choices might be revised later. Attached is a 'ComplexesInventory' spreadsheet that specifies the number of complexes in each of the input, as well as the output, files: [[attachment:ComplexesInventory.xls]] 1. BIND complexes are modeled as PSI-MI interaction elements (which accomodate representing complexes). PSI-MI does not provide explicit support for complexes as separate entities (i.e. there is no element called 'complex'). Importing systems, or converters to other standards, should distinguish a BIND interaction from a BIND complex by capturing one of the unique BIND complex attributes like the 'order flag' for subunits (see the 'attribute list' example below). 2. The core elements translated (when available) for complexes include: BIND complex ID, total number of components, order flag, complex ref pubmed ID, interaction ID references. For each of the complex's subunits: the interactionRef for the subunit, the subunit number, whether its interactant A or B (in the BIND interaction referenced), in addition to the same details translated for an interaction's interactant. All different subunit types are supported (i.e. protein, small molecule, ...etc). 3. The BIND 'subunit source' elements, if present, are modeled as PSI-MI's ''participant.interactionRef''. If not present, the usual ''participant.interactant'' details are grabbed from the BIND complex subunit. We could have included both (i.e. the subunit details and its interactionRef), but that is not allowed in the PSI-MI schema (i.e. its either-or). 4. To save the pub references, we have to add a 'dummy' experimental description under the PSI-MI interaction, that takes bibliographical references (bib-ref). 5. In general, the interaction ID for an interaction in the PSI-MI format is a sequentially generated number, that is unique across species/files, within a BIND Translation build. Each PSI-MI interaction will save the original BIND interaction ID as a primary reference. BIND complexes entries reference interaction IDs by their BIND interaction ID. Therefore, there was a need for mapping between the two identifiers. The 'BIND interaction ID' referenced by a complex subunit is accurately mapped to the respective auto-generated PSI-MI interaction ID in the same PSI-MI file (within the same species) and the latter is used instead (otherwise, the PSI-MI file is not valid). However, all the BIND interaction IDs listed for a complex are still listed separately for that interaction (as an additional PSI-MI interaction attribute). 6. Note that there might be interactions in a complex's interaction list that are not referenced by any of the individual complex subunits. 7. If a complex points to a 'bad' interaction (within the same species), that complex is labelled as 'bad' as well, and is treated as such. Otherwise, we will have pointers to interaction IDs that do not exist for that species. This may have resulted in a large number of 'bad' complexes, an outcome that might be revised. 8. In some cases, complexes may reference interactions in other species. There is no way of verifying the validity of these references in the same build of the BIND translation. Those references are translated as is, and the complexes are labelled as 'bad data'. 9. The translated complex elements that are not explicitly supported in the PSI-MI schema are added as optional interaction attributes. PSI-MI does not seem to define detailed CV attributes for complexes (except a very general 'complex attribute'). We define our own attribute names that are used consistently. For example (file: taxid3702_PSIMI25.xml): At the participant (aka interactant) level: {{{ 37313 BIND Interactant B 181913 2 }}} At the interaction (i.e. complex) level: {{{ 37313;37311 181913;181895 false 3 ... }}} === BIND Refresh: ID Mapping === 1. BIND's primary interactant reference is GI. Entrez gene IDs is their secondary reference. GIs are used about 346,142 times vs. 168,630 times for Entrez gene IDs. Both may be available for the same interactant. Besides the fact that GIs may not be the most popular identifier around, some of the IDs used are out of date or have been deprecated. 2. The purpose of the ID mapping step in the BIND translation process was to map GIs (and later on Entrez gene IDs) to Uniprot accession(s), since these are more common, and this will facilitate the ID matching and 'unification' of interactants (within the same resource or across different interaction/pathway resources) by importing systems. 3. The mapping involved the following steps: a. Using the ID mapping files released by Uniprot (release 15.5) to map GIs/Entrez IDs. The system searched for matching Uniprot entries, within the same species the interactant was listed under in BIND. If not found, the search is conducted against any species. This has uncovered some 'species mismatch' cases (likely another outdated data issue). a. For the unmatched GIs, we used the Batch Entrez site (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to map those. This is a batch lookup tool by NCBI. Searches were conducted against both their nucleotide and their protein databases. a. Uniprot identifiers were added (in a non-redundant fashion) as secondary x-refs to the translated files, as a supplement to GIs/Entrez IDs. 4. The relationship between Uniprot and GI identifiers is zero-to-many from both ends. All the mapped Uniprot IDs are listed as secondary x-refs of reference type identity (we can make them of type 'see-also', but that defeats the purpose). 5. After running the first mapping step, against all BIND interactions: 43,841 GIs were matched in Uniprot, and 16,196 were not. Note that running the same mapping again may give a slightly different count due to using different file builds from Uniprot. 6. After running the second mapping step, most of the GIs were found in Batch Entrez. Some were renamed/replaced, and the mapping retrieved the current GI for that interactant and mapped it to RefSeq, whenever possible. 2158 GIs were not found to anything at all and are most likely deprecated. The original (even if replaced/deprecated) GIs were kept in the xml output file, to allow unification with data repositories that rely on BIND GIs. 7. PSI-MI reftype for GIs: During our translation, we do not assume that all GIs used for protein interactants in BIND point to protein sequences. After attempting to map a GI identifier, the CV mapping for the GI would be as follows: a. If a GI is matched to a Uniprot accession, or was retrieved from the 'protein database' via Batch Entrez (regardless of whether it points to a refseq sequence or not), its matched to the PSI-MI CV: genbank_protein_gi (MI:0851) a. If a GI was retrieved from the 'nucleotide database' via Batch Entrez, its matched to the CV: genbank_nucl_gi (MI:0852) a. Otherwise, its matched to the more general/parent CV: genbank indentifier (MI:0860). This is the case for deprecated GIs, ...etc. === BIND To PSI-MI 2.5 CV Mapping === [[attachment:CV_Mapping.xls]] 1. The attached spreadsheet covers most of the controlled vocabulary mapping between BIND and PSI-MI for 3 CV categories: Interactant external references (x-ref), interactor type, and interaction detection method. 2. For each CV term listed, the spreadsheet includes the following columns: a. CV Type: Category of the controlled vocabulary mapped. a. Source Term: The source term as its used in the source (in this case, BIND). a. Source ID: The unique identifier, if any, for this term in the source data model. a. Target Term: The term as its used in the target (in this case, PSI-MI). a. Target ID: The unique identifier, if any, for this term in the target data model. 3. For x-refs, the spreadsheets have additional columns for each of the interactant types in BIND, to specify the equivalent PSI-MI reference type (refType) for that particular x-ref and interactant type combination. The last two interactant type columns ('unspecified', 'photon') are, obviously, not referenced by any of the interactant x-refs. A photon is not recognized as an interactant type in PSI-MI. It is used in about 291 interactions in BIND. 4. Whenever there is an 'N/A' in the interactant type columns, it means that the particular x-ref/interactant type combination was not present in the source data. 5. Unfortunately, there is no CV for x-refs in BIND. We produced a comprehensive list of all the interactant external references used in BIND, as a first step, before mapping them to their PSI-MI counterparts. Although the list reflects the richness in the x-refs BIND uses, it shows the poor/redundant data representation as well (e.g.: ''Merck Index'', ''Merck Index #'', ''MerckIndex'' - 3 representations for the same data source). 6. As elaborated elsewhere, there are two interaction detection methods used in BIND that are not part of their CV for that category. 7. Whenever the 'generic term' for a CV category is used as a target term (for a non-generic source term), it indicates that there is currently no suitable match for that term in PSI-MI. For example, ''"participant xref;dictyBase"'' indicates that there is no support for a CV for the external bio-source dictyBase in PSI-MI, and as such, its listed as is when translated (i.e.: dictyBase). We will request updating the PSI-MI CVs to support all of these terms. 8. A slightly modified version of this spreadsheet is fed into the translation process. This design for the CV mapping accomdates 1-to-1 mappings between the source and the target. We could have invested more into the design of the CV mapping as a new XML schema or something similar, but that would have been an overhead. For more complicated CV mappings (e.g. the reftypes for the x-ref WormBase), the detailed business rules are handled at the object level. 9. Handling WormBase reference types (best available options): a. If interactant is a protein (regardless of whether the x-ref has the 'WormBase:' prefix, or is free-style, like 'ZK858.4'), use 'gene product'. If the x-ref has the prefix 'WP:' or 'CE:', use 'identity'. a. If interactant is of type DNA (x-ref always has the 'WormBase:' prefix), use 'identity'. a. If interactant is of type RNA (x-ref always has the 'WormBase:' prefix), use 'gene product'. a. If interactant is a gene (x-ref always has the 'WP:' prefix), use 'gene product'. 10. Reference types are also set for the x-refs added during the translation process (Uniprot IDs, Refseqs, ...). Those x-refs have the *_IDMAP suffix. 11. Some assumptions were made when setting the reference types for the GI-nucleotide x-ref. We set the reference type as 'identity' if the interactor (if existed) was a DNA or a gene, assuming that the DNA/gene is encoding (otherwise, it would have been a 'see-also'). 12. We do include both the CV Term and its respective PSI-MI unique ID, in case some importing systems read one but not the other. The BIND and pubmed entries in the CV mapping spreadsheet are listed under the x-ref section for convenience, although neither of them are considered interactant x-refs in BIND. Photons are not a supported interactant type in PSI-MI. === Translation Notes === Various translation notes: 1. '''BIND Division''': We do translate the 'BIND Division Type' for an interaction, as applicable. The following list encapsulates all the different division types used in BIND: {{{ MGI BIND-3DBP BIND Taxroot BIND-3DSM MIPS BIND Metazoa HDRES RefBIND Taxroot HIV-HPID BIND-3DFI BIND Fungi SGD FlyBase }}} Since, there is no suitable mapping for this information in the PSIMI schema, it is translated as an (additional) PSIMI interaction attribute, e.g.: {{{ ... MGI ... }}} 2. '''BIND Species''': [[attachment:BINDSpeciesNames.xls]] a. As mentioned earlier, BIND interactions files are organized by the species they reference. Their file names (for the most part) reflect the species ID. The attached spreadsheet lists the names of all the BIND files, the species ID (of the species where the bulk of the interactants in that file exist), and the species scientific and common names. b. A small number of taxonomy IDs (76 total) do not have a match in Entrez Taxonomy. This is due to one of the following reasons: - Dummy taxid in the file name, like taxid_0. - The species ID has been renamed (possibly merged) or deprecated. For example, the species ID 11489 used in BIND has been changed to 132504, 36377 is the same as 362651, and so on. This issue is very limited in scope and is mostly related to some micro-organisms. It is another example of the consequences of not keeping a data repository up to date, and its one of the 'BIND data' issues to be addressed later. c. BIND files do include inter-species interactions. For example, the human interactions file may include interactions where the first interactant is in homo sapiens, and the second is in some micro-organism. The translation process does not double check if the corresponding interaction file for that micro-organism (if any) lists the same interaction already listed in the human file. It is the responsibility of any importing system to check for that, if this issue matters in their internal data repository. 3. '''Revised Batch Process''': Few species in BIND are represented by one or more XML files. Furthermore, we split the bulky BIND files early on to allow processing with the SAX parser. The translation process, however, bundles the translated output of the BIND files that belong to the same species (judging by the file name 'taxid...') and generates a single output file per species. This is neater and is a pre-requisite for supporting BIND complexes, so that a complex will never reference an interaction which actually falls in a different file (for the same species). 4. '''PSI-MI Validation''': BIND PSI-MI files are validated against their schema in a process that is done successfully following each release. In addition, PSI-MI offers a semantic validator for PSI-MI models. It can be accessed on the web (http://www.ebi.ac.uk/intact/validator/start.xhtml) or incorporated into one's program. This validator is still in Beta version. The online version (which has a 10MG size limit) was used with a sample of our BIND PSI-MI files to tighten our implementation of PSI-MI controlled vocabularies. Our files did receive a warning and an error, both stemming from implementations that were done on purpose for the lack of better options at the moment. These are explained below for future reference: a. Not all BIND interactions contain a pubmed ID reference. Since such a reference is required by the PSI-MI schema, we resorted, by choice, to use the BIND interaction ID for that interaction as a (general) reference. This causes the validator to warn about the fact that the reference we use does not stem from pubmed. a. The PSI-MI CV does not satify all the needs for BIND terms. In the lack of a suitable PSI-MI CV for an experimental method, we list the BIND experimental method detection ID number (BIND exp. methods IDs: 24 and 32), instead of the equivalent PSI-MI ID number, since there are not any. Unfortunately, the validator only recognizes exp. methods stemming from PSI-MI CVs, and not any others, which is an unnecessary limiation. We could have remedied that by mapping such BIND terms to the general mapping (i.e. 'experimental interaction detection'), but that may lead importing systems to miss some more specific info/details. === BIND 'Mining' === 1. Since BIND represents one of the major resources of biological interactions, it makes sense to prototype some basic mining for the BIND data produced, before/independent of any importing system's repository. This allows generating pre-defined reports and/or stats about the BIND data. It leads to deeper understanding of 'data trends', particularly if future efforts in BIND Translation focused on more biochemical details of interactions. Its also useful in general data validation and sanity checks. 2. The prototyped BIND miner selected a handful of attributes to track and report on. They were chosen because they come from different levels in the PSI-MI model, in addition to reaping the fruits of using CVs with some of them. The attributes tracked are: Interactor types, Interaction detection methods used in interactions, interactor references to external databases, and the variety in the number of subunits in complexes. The stats for these attributes within the 8 favorite/selected species were generated. The sample results are displayed in the following spreadsheet (although future reports may use a graphical representation, instead): [[attachment:BIND_DataAnalysis_SelectedSpecies.xls]] 3. The first spreadsheet shows the total counts for each attribute, while the rest show them on a per-species/file basis. 4. The x-ref report also included x-refs added during the translation process itself. The stats for an x-ref are not for its unique occurence, so its counted every time it appears (which may happen often, if the interactant it references does appear often). 5. The report does not take into consideration the 'bad interactions' which were filtered out. === Design and Technical Notes === ==== Design And WorkFlow ==== 1. The overall theme is to abstract the design, but to keep the implementation down to earth by supporting all the specifics and unpredictable special cases in data. This allows handling issues at hand now, while facilitating software reuse later on. 2. Technical 'Biproducts': Several functionalities were developed for supporting the BIND translation process. These include: a. XML element survey: A standalone program that generates stats about each XML element in an XML file. It was very handy in BIND translation. Example outputs for both BIND and intact are in an attached spreadsheet (see 'BIND Analysis' section). a. ID mapping: These are listed under the data warehouse documentation. a. BIND Miner: A standalone program that generates specific stats about BIND models that are fed in PSI-MI xml format. The user has the option of generating the reports using the PSI-MI java API (object model representation of the interactions) or plain XML API (JDOM objects - elements and attributes). 3. Design bits: The translation framework is implemented as a small class hierarchy, to support type control for potential future PSI-MI translators. A small factory pattern/method is used to select the appropriate translator, specified by the user at the command line. This is a 'placeholder' for the future, since only one translator is currently available (i.e. BIND to PSI-MI 2.5). The BIND-complex translator class was implemented as a subclass for the main BIND translator class, and that might serve as a good example to follow, if translating other major and well-defined components in BIND. The BIND Miner prototype is implemented as a small class hierarchy with polymorphic calls. There are partial singleton implementations (e.g. in container-like classes), as well as utility design for other classes, like the tags and util classes. 4. The steps required for handling GIs in BIND that were not readily mapped to Uniprot involved querying the Batch Entrez tool and re-mapping those GIs, if feasible. That is the major, one-time, partially automated step in this translation. Otherwise, running the BIND Translator now is an automated process. 5. The narrated 'activity workflow' for the translation process is listed below. Note that this is a high level representation that does not delve into the details/special cases of data mapping: a. Load, from local file, the x-path/element names (handlers) for the BIND elements of interest and make it available throughout the rest of the translation. a. Load the current CV mapping between BIND and PSI-MI and make it available throughout the rest of the translation. Then, for each of the BIND files translated: a. Initialize the ID mapping process with Uniprot. This process loads the uniprot GI/Entrez gene mappings, for the species/file being translated at the moment. The species ID is strictly expected to be part of the input BIND file name (format: taxidXXX.foo, where 'XXX' is the species ID). a. Begin the translation of interactions. Traverse each BIND interaction tree while building the PSI-MI XML tree. a. When a 'supported' CV term element is encountered, the CV Mapper is called upon to provide the correct mapping (if any). a. When a GI/Entrez gene ID is encountered (for an interactant), the ID mapping service is called. If no match available within the same species, it is called upon across different species (see comment on 'species mismatch' elsewhere). If none matched, it is called upon to look into the supplementary mappings (i.e. output from mappings done through Batch Entrez, mentioned earlier). a. If interaction is missing 'vital info', label as 'bad'. a. If user requested translating complexes, do those as well. Add 'bad' labels if needed. a. Dump output. Dump bad interactions separately. ==== Technologies ==== All the source code for this project is written in java, is 'home grown', and is available through these wiki pages. Other tools and third party packages used include: 1. XML Spy - Professional Edition (version 2009, SP1) from Altova, Inc (http://www.altova.com/): An XML IDE that was used mostly for visualizing/analyzing xml schemas, and for preliminary xml file validation. 2. Altoval XML from Altova, Inc (http://www.altova.com): A freely available command line tool, with various XML related functionalities. It was used for batch XML file validation at the MS-DOS prompt. 3. NetBeans (version 6.7), a freely available java IDE sponsored by Sun Microsystems (http://www.sun.com), used for java (1.6) development. 4. JDOM (version 1.1) was used as an XML API and Xerces (version 2.0) as the XML parser of choice. 5. Miscellaneous commands (at the prompt) were helpful, in addition to minor use of the PSI-MI java API. ==== Code Notes ==== 1. The BIND Translator, and its associated classes, are divided over two packages: a. Translator package: Responsible for the actual translation, as well as sets the framework for such translators in general. a. Translation Utility package: Satellite functionalities related to translations. 2. There are 13 classes (app. 3,250 lines of code): a. Translator package: Trans/PSIMITrans, BINDTags, PSIMITags, CVMapEntry, BINDToPSIMITrans, BINDToPSIMITransComplex. a. Translation Utility package: XMLUtil, XMLStats, XMLElementInfo, PSIMIMiner, PSIMIMinerXML, PSIMIMinerModel. 3. More Technical Documents: a. PC Translator source (18/12/2009): [[attachment:BINDTrans_src_20091218.zip]] === Project Releases/Milestones === 1. All the BIND Translation related documents, including the different releases of the translation, are accessible from the Bader Lab domain, under the folder: /Volumes/Groups/PathwayCommons/BIND/BINDTranslation. 2. Note that in some releases, some of the PSI-MI XML files may not contain any interactions in them. This is due to the incremental nature of the different project phases, for one or more of the following reasons: - The equivalent source files happen to include complexes only, which we did not support at the time of that release. And/or - Those files do not carry any interactions where both interactants are proteins, and the release was focused on that. And/or - All the interactions in these files are deemed as 'bad interactions' (i.e. missing vital info) and were removed from the production file and logged. However, these 'empty' files are kept for inventory tracking purposes. 3. The following describes the folders for the different releases for BIND Translation. ==== Phase I ==== 1. Created a BIND data/element analysis, for future reference. 2. Provided a BIND to PSI-MI 2.5 translation for the core interaction elements. The analysis/requirements for this translation were reverse-engineered from the earlier XSLT prototype, scaling from version 1.0 to 2.5, when necessary, with some modifications as well. This core translation covered, at least: interaction related identifiers/description, interactants' names and descriptions, interactants' external references (e.g. GI IDs, Entrez Gene IDs,...etc), experimental condition details, species info, and publication identifiers. 3. Only interactions were both sides are defined as proteins were covered. 4. The translation targeted eight species: human, mouse, rat, yeast, ecoli, thali cress, fruit fly, and worm. 5. The PSI-MI 2.5 files were converted to BIOPAX-Level 2, using the available converter, and loaded in a local PC instance. This indirectly involved installing/setting up this instance. 6. Initial support for PSI-MI controlled vocabularies was added later. ''Released Files'': - PhaseI_a: PSIMI and BIOPAX files for BIND Interactions for the selected species. - PhaseI_b: Same content as above, but with the initial support of PSI-MI CVs and various enhancements/fixes. ==== Phase II ==== 1. Addressed some issues and hanging threads from phase I. 2. Translated non-protein interaction types (i.e., RNA, DNA, gene, small molecule, complex, and photon). 3. Translated all species (more accurately, all 1,620 BIND files). Converted all to BIOPAX and imported into local PC instance. 4. More detailed PSI-MI CV support. ''Released Files'': - The PSIMI25 files are split into two folders: one for the 'selected species' (the eight favorites listed above), and one for all the other species. - The BIOPAX folder has two sub-folders: BIOPAX_20090626 (BIOPAX files that match all the PSIMI files for this release) and BIOPAX_20090702 (A special BIOPAX files release to test fixes to the PSIMI-Biopax converter) ==== Phase III ==== 1. More focused and expanded releases. 2. BIND - Expand & Refresh: BIND interactants were mapped to Uniprot accns using both GIs and Entrez Gene IDs (whenever available) through an ID mapping process. ''Released files'': - PhaseIII_a: Focused on humans only. Purpose was to tighten the screws, refine CV mappings, handle some BIND 'data issues', and introduce new elements. - PhaseIII_b: Included all protein-protein interactions for all species/files. First release of BIND ID-mapping - BIND Refresh. Output species files are split over two folders: one for the 'selected species' (listed above), and one for all the other species. - PhaseIII_c: Second and expanded release of BIND ID-mapping. - PhaseIII_d: An all-inclusive release. This release included a revised release for all non-protein-protein interactions, in addition to a first release of all BIND complexes, plus other enhancements. There is only one file in this release that does not have any interactions in it: taxid64091.1.xml (Halobacterium). The original BIND file has one interaction and one complex for Hop (HR) tetramer, but both the interaction and the complex fail our filtering policy, and are as such logged into the bad entries file, instead. The PSI-MI output file is kept for record keeping, though. - PhaseIII_e: Similar to the previous release, but splits the build into different flavors, exercising the different available build options for the translation process: 'AllInclusive_NoDataCleaning' and 'ProteinsOnlyNoComplexes' folders. The local instance of cpath was refreshed. This is a transient, experimental instance to demonstrate interoperability and is accessible within the Bader lab domain: [[http://192.168.81.140:8080/cpath/ | BIND in PC]]. ==== Beyond Phase III ==== 1. Maintenance and handling more of BIND's 'data issues'. 2. Translating more interaction categories and elements, with the appropriate resource allocation (internal or through outside collaboration).