#acl RashadBadrawi:read,write,revert #acl All:read == BIND To PSI-MI 2.5 Translation == <> === Introduction === BIND (The Biomolecular Interaction Network Database) is one of the major, freely available, biological interactions resources. It was built over several years through a manual curation process. BIND curated, in detail, over 200,000 molecular interactions, and over 3750 biological complexes for hundreds of species. Although BIND is still accessible through the 'BOND' site (http://bond.unleashedinformatics.com/), the data has been dormant for few years, and, unfortunately, the wealth of knowledge it carries has been underutilized by the scientific community, mainly due to the fact that this data is not available in a current 'community' standard format. === Objective === This project aims at resolving this problem by translating BIND to the PSI-MI 2.5 (molecular interactions) standard, and consequently, by making it available through PathwayCommons (PC). This project will cover ''almost'' every interaction, for every species, available in BIND, but not every detail about every interaction in BIND. === BIND In Entrez === Entrez provides some interaction data as part of their regular release, that partially covers BIND. For each BIND interaction in Entrez, the following information are listed (if available): Interactants' organism, interactants' Entrez gene ID, interactants' Entrez protein accession, interactants' name, pubmed ID(s), a textual description of the interaction. The total number of BIND interactions captured is close to 12,500, spread over around 30 species. Obviously, most of the records have not been modified since 2005. There are no records for BIND complexes. === Other BIND Translations === A prototype for a BIND to PSI-MI 1.0 translation was developed, in the summer of 2006, at the University of Toronto, using XSLT technology. To the best of my knowledge, this translation was not placed into real use. === BIND Analysis === This analysis is not necessarily limited to BIND elements that were translated. Please refer to the attached Excel file for the referenced spreadsheets. [[attachment:BINDAnalysis.xls]] 1. BIND repository is available in the form of XML files. The repository has a total of 1,591 files, with a total size of app. 16 Gigabytes. The actual repository used in the translation has 1,621 files, since seven of the bulky XML files were split into smaller chunks for easier handling. These files are: taxid4932.1.xml, taxid4932.2.xml, taxid562.1.xml, taxid7227.1.xml, taxid9606.1.xml, taxid9606.2.xml, taxid10090.1.xml. Content was not altered during this process. A species is represented by one or more files. The timestamps in all files go back to 2006 (or earlier). 2. The value of BIND lies in its manually curated content, more than in its 'automatically' generated data (like interactants' synonyms, for example). The latter can be acquired by separate processes and the BIND version of it might be out of date. 3. The BIND XML files should be thought of as a database dump, rather than a refined selection of interaction data, ready to be translated. 4. Some of the heavily used BIND elements are data types of internal and 'operational' nature to the BIND team itself. These bear little significance at this point. Such elements include: the update history of an interaction record in the BIND database, curator's info, some timestamps, and other elements. 5. Some heavily used elements are focused on structural display and graphical representation of a complex. These are not of much interest, either. 6. The numbers in the element survey (below) are biased against biological complexes, since it does not differentiate between interaction and complex counts, and entries for complexes are way outnumbered by interactions in BIND. 7. Comment: One possible limitation we might face in the future is that the PSI-MI 2.5 to Biopax converter does not support 'attribute Lists'. These are the extra storage areas in the PSI-MI 2.5 schema, where additional details about an interaction (and that do not fit under any of the explicit PSI-MI 2.5 elements) can be stored. ==== Element Survey ==== 1. The attached spreadsheets are the product of a 'stats generator'. They aim at showing the following: a. The XML elements (and their distinct paths) that actually appear in the BIND repository (vs. the ones that are just part of the BIND schema, but are never referenced, if any). This is a general indicator on how much of a data model is exercised by its 'instances'. a. The 'abundance' of each element is listed; i.e. in how many files is a specific element used and the total number of times it appears in all files combined. a. The distinct XML elements (more accurately, paths) that bear data and can be translated. In the case of the BIND schema, this was defined as an element that either carries attributes, or is a leaf element, usually carrying content (some general exceptions below). This analysis approach is specifically useful to BIND and may or may not be very informative for other XML schemas. 2. The spreadsheet 'BINDElementsUnique' lists the element names (648 elements total) used in all the BIND repository. 3. The spreadsheet 'BINDElementStats' lists all the BIND elements, the unique XML path to each element, its 'abundance', and its 'significance' (i.e. whether it bears content or not). The columns used are: a. Element Name: BIND XML schema element name, not unique. a. Full Path: The unique path to the XML element. Unique column. In some cases, different paths to the same element name are different in context (and not just because of reusing a complex type). a. Total Files: The total number of BIND xml models/files this element has appeared in. a. Total Count: The total number of times this element has been used in all of the files. a. HasChildren/HasAttributes/HasText: These reflect whether an element bears useful content or not, as described above. A column value is set to 'true' if it satisfies that condition, at least once, in any entry, in any BIND file. 4. There are 3300 unique paths, out of which 1403 bear content (total usage: 72,706,810 times), including the elements with attributes as well (273 of them). Like the case with many other data models, most of the content is concentrated in a small subset of elements, that may or may not be of biological value. About 580 elements, out of the 1403 data-carrying XML elements are used less than a 100 times in 1620 files, which is negligible. 5. As a ''rough'' guideline for 'abundance' (and ignoring the complexes), a data element that appears in every interaction (e.g. interactant's name) should appear 212,031 times or more. There is always the chance that such an element is set to a dummy (as described later), but that is not taken into consideration in this context. 6. The spreadsheet BINDElementStatsBrief offers a different view for the same data. 7. To put BIND data into perspective, the same stats generator was used against another interaction database: the IntAct database (over 172,000 interactions) released in PSI-MI 2.5 xml format (a total of 435 files, 1.33 Gigabytes in size, organized mainly by species). The two spreadsheets named 'IntActElementsUnique' and 'IntActElementsStats' display the results of this comparative analysis. The number of XML elements used in all the IntAct is about 60, the number of unique paths is 216, about 128 of which carry data in them. This shows the difference in data complexity. It indirectly shows the value of standardizing a data representation as well. 8. A comprehensive translation, however, should focus on the scientific value of a component, in addition to its 'abundance'. The former cannot be captured in this survey. In addition, the survey does not take into account the size of the content for an element, but based on my knowledge with the BIND model, this is not a major concern. 9. Phases I&II (see project phases below) of this project handle the core translation for ''all'' interactions/complexes. Our approach is to cover every interaction possible, without capturing all the details available, since in that case, more resources will be needed. Those additional details are left for future enhancements, and this survey will serve as a partial guideline for that process as well. ==== BIND XML Schema/Data Model ==== The BIND data model was originally represented in the ASN.1 notation. It makes extensive use of earlier standard NCBI ASN.1 components (about 20) for representing biological entities like publications, sequences, and others. The BIND XML schema was generated automatically from the ASN.1 format. BIND XML schema/data model is significantly complicated. This is partially due to the richness of the resource itself, but also because of the manner the XML schema was built. The model heavily favors the use of 'sublements' instead of using attributes, values, and customized complex types. A user sifts through a lot of XML elements to get to a useful piece of information, in addition to what I would describe as issues of over-defining and naming schemes. For example, to represent one type of pubmed IDs for the experimental condition of an interaction in BIND, starting from the root XML element: {{{ BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_descr.BIND-descr.BIND-descr_cond.BIND-condition-set.BIND-condition-set_conditions.BIND-condition.BIND-condition_source.BIND-pub-set.BIND-pub-set_pubs.BIND-pub-object.BIND-pub-object_pub.Pub.Pub_medline.Medline-entry.Medline-entry_pmid.PubMedId }}} To represent a similar element, using PSI-MI 2.5: {{{ entrySet.entry.interactionList.interaction.experimentList.experiment Description.bibref.xref.primaryRef[@id:] }}} Another example of this is in defining the type of an interactant (e.g. protein, RNA, ...etc). Its more natural to define this as an attribute for the referenced object (in this case, 'interactant') and not as a series of subelements (aside from the issue of handling the identifier format needed): {{{ BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_a.BIND-object.BIND-object_id.BIND-object-type-id.BIND-object-type-id_protein }}} Such issues, in addition to the mere size of the data itself, significantly complicated the implementation and debugging of the translation. ==== BIND Data Quality ==== ====== Schema Validation ====== Generally speaking, the BIND models are semantically valid. All BIND files were validated against the BIND schema, and its associated schemas (mainly earlier NCBI schemas), and most of them checked out fine, with minor exceptions: 1. Some BIND files did not stick to their own controlled vocabulary for experimental method description (i.e. the 'BIND-experimental-system' element). It is set to 'microarray' in the human files, about 406 times; it is set to 'synthetic-lethal-sick-test' in the yeast files, about 474 times. These two types are not part of the BIND controlled vocabulary for describing an experiment, and thus, failed the schema validation. 2. Based on my understanding of BIND, the experimental method value, for these cases, should have been set to 'other', and a free style description separately added, under the element 'BIND-condition_descr'. This error, however, should not interfere with the translation since our focus is to use PSI-MI's CVs and not BIND CVs (see milestones section). ====== Data Issues/Bad data ====== However, the BIND models do suffer from missing data at times, and poor data representation at others. Here are some examples: 1. The external reference object in BIND carries a database name, and one of two identifiers (either a string or an integer data type). This could be modeled in a regular XML schema by using a 'choice' type to semantically enforce such a structure. However, BIND considers all of these elements as required, and as such, sets one of them to a valid value, and the other to a dummy (that varies, as well). This is a case of poor design, not bad data, and is misleading. For example (from human, taxid9606.1.xml): {{{ ... Klotho -1 KLM0000304 }}} In this example, the first identifier value (set to '-1' in this case, but '0' in many others) should be ignored, and the second identifier should be used. In other cases, it would be the other way around (where the second identifier type is set to 'NULL'). An even more interesting example (from taxid6239.1.xml, listing a Wormbase xref): {{{ ... WP:NULL ... }}} 2. Different elements can be set to 'NULL' or zero. These are either ignored during the translation, if not mandated by the PSI-MI 2.5 schema, or set to a default ('NO_VALUE'). There is a chance that future phases of this project might be able to fill in for some of these cases, like interactant names/descriptions, via external references listed, if any. There are cases where the interaction ID (a 'primary key') is set to zero, as well. 3. In mouse (from taxid10090.1.xml, also in the file taxid0.1), the interactant names/descriptions are listed as 'NULL' 154 times: {{{ ... NULL ... NULL ... }}} This is misleading and unnecessary, even if this information is missing (and required by the XML schema), since the BIND model allows setting values to a standard default term ('unknown'). 4. Sometimes an interactant's organism identifiers might be missing. This information was not automatically filled in, ignoring the assumption that every interactant belongs to the same species/file where it is listed under. 5. Some BIND complexes reference BIND interactions with a negative BIND Interaction ID. For example (file: taxid9606.2.xml.part2): {{{ -52023 }}} ====== Filtering BIND's 'Bad' Entries ====== === Support for BIND Complexes === The translation included a first release for BIND complexes. Some of the choices might be revised later. Attached is a 'ComplexesInventory' spreadsheet that specifies the number of complexes in each of the input, as well as the output, files: [[attachment:ComplexesInventory.xls]] 1. BIND complexes are modeled as PSI-MI interaction elements (which accomodate representing complexes). 2. The core elements translated (when available) for complexes include: BIND complex ID, total number of components, order flag, complex ref pubmed ID, interaction ID references. For each of the complex's subunits: the interactionRef for the subunit, the subunit number, whether its interactant A or B (in the BIND interaction referenced), in addition to the same details translated for an interaction's interactant. All different subunit types are supported (i.e. protein, small molecule, ...etc). 3. The BIND 'subunit source' elements, if present, are modeled as PSI-MI's ''participant.interactionRef''. If not present, the usual ''participant.interactant'' details are grabbed from the BIND complex subunit. We could have included both (i.e. the subunit details and its interactionRef), but that is not allowed in the PSI-MI schema (i.e. its either-or). 4. To save the pub references, we have to add a 'dummy' experimental description under the PSI-MI interaction, that takes bibliographical references (bib-ref). 5. In general, the interaction ID for an interaction in the PSI-MI format is a sequentially generated number, that is unique across species/files, within a BIND Translation build. Each PSI-MI interaction will save the original BIND interaction ID as a primary reference. BIND complexes entries reference interaction IDs by their BIND interaction ID. Therefore, there was a need for mapping between the two identifiers. The 'BIND interaction ID' referenced by a complex subunit is accurately mapped to the respective auto-generated PSI-MI interaction ID in the same PSI-MI file (within the same species) and the latter is used instead (otherwise, the PSI-MI file is not valid). However, all the BIND interaction IDs listed for a complex are still listed separately for that interaction (as an additional PSI-MI interaction attribute). 6. Note that there might be interactions in a complex's interaction list that are not referenced by any of the individual complex subunits. 7. If a complex points to a 'bad' interaction (within the same species), that complex is labelled as 'bad' as well, and is treated as such. Otherwise, we will have pointers to interaction IDs that do not exist for that species. This may have resulted in a large number of 'bad' complexes, an outcome that might be revised. 8. In some cases, complexes may reference interactions in other species. There is no way of verifying the validity of these references in the same build of the BIND translation. Those references are translated as is, and the complexes are kept in the release for now. 9. The translated complex elements that are not explicitly supported in the PSI-MI schema are added as optional interaction attributes. PSI-MI does not seem to define detailed CV attributes for complexes (except a very general 'complex attribute'). We define our own attribute names that are used consistently. For example (file: taxid3702_PSIMI25.xml): At the participant (aka interactant) level: {{{ 37313 BIND Interactant B 181913 2 }}} At the interaction (i.e. complex) level: {{{ 37313;37311 181913;181895 false 3 BIND Taxroot }}} === BIND Refresh: ID Mapping === === Project Releases/Milestones === 1. All the BIND Translation related documents, including the different releases of the translation, are accessible from the Bader Lab domain, under the folder: /Volumes/Groups/PathwayCommons/BIND/BINDTranslation. 2. Note that in some releases, some of the PSI-MI XML files may not contain any interactions in them. This is due to the incremental nature of the different project phases, for one or more of the following reasons: - The equivalent source files happen to include complexes only, which we did not support at the time of that release. And/or - Those files do not carry any interactions where both interactants are proteins, and the release was focused on that. And/or - All the interactions in these files are deemed as 'bad interactions' (i.e. missing vital info) and were removed from the production file and logged. However, these 'empty' files are kept for inventory tracking purposes. 3. The following describes the folders for the different releases for BIND Translation. ==== Phase I ==== 1. Created a BIND data/element analysis, for future reference. 2. Provided a BIND to PSI-MI 2.5 translation for the core interaction elements. The analysis/requirements for this translation were reverse-engineered from the earlier XSLT prototype, scaling from version 1.0 to 2.5, when necessary, with some modifications as well. This core translation covered, at least: interaction related identifiers/description, interactants' names and descriptions, interactants' external references (e.g. GI IDs, Entrez Gene IDs,...etc), experimental condition details, species info, and publication identifiers. 3. Only interactions were both sides are defined as proteins were covered. 4. The translation targeted eight species: human, mouse, rat, yeast, ecoli, thali cress, fruit fly, and worm. 5. The PSI-MI 2.5 files were converted to BIOPAX-Level 2, using the available converter, and loaded in a local PC instance. This indirectly involved installing/setting up this instance. 6. Initial support for PSI-MI controlled vocabularies was added later. ''Released Files'': - PhaseI_a: PSIMI and BIOPAX files for BIND Interactions for the selected species. - PhaseI_b: Same content as above, but with the initial support of PSI-MI CVs and various enhancements/fixes. ==== Phase II ==== 1. Addressed some issues and hanging threads from phase I. 2. Translated non-protein interaction types (i.e., RNA, DNA, gene, small molecule, complex, and photon). 3. Translated all species (more accurately, all 1,620 BIND files). Converted all to BIOPAX and imported into local PC instance. 4. More detailed PSI-MI CV support. ''Released Files'': - The PSIMI25 files are split into two folders: one for the 'selected species' (the eight favorites listed above), and one for all the other species. - The BIOPAX folder has two sub-folders: BIOPAX_20090626 (BIOPAX files that match all the PSIMI files for this release) and BIOPAX_20090702 (A special BIOPAX files release to test fixes to the PSIMI-Biopax converter) ==== Phase III ==== 1. More focused and expanded releases. 2. BIND - Expand & Refresh: BIND interactants were mapped to Uniprot accns using both GIs and Entrez Gene IDs (whenever available) through an ID mapping process. ''Released files'': - PhaseIII_a: Focused on humans only. Purpose was to tighten the screws, refine CV mappings, handle some BIND 'data issues', and introduce new elements. - PhaseIII_b: Includes all protein-protein interactions for all species/files. First release of BIND ID-mapping (BIND Refresh). Species split into two folders: one for the 'selected species' (listed above), and one for all the other species. - PhaseIII_c: Second and expanded release of BIND ID-mapping. - PhaseIII_d: An all-inclusive release. This release included a revised release for all non-protein-protein interactions, in addition to a first release of all BIND complexes, plus other enhancements. There is only one file in this release that does not have any interactions in it: taxid64091.1.xml (Halobacterium). The original BIND file has one interaction and one complex for Hop (HR) tetramer, but both the interaction/complex fail our filtering policy, and are as such logged into the bad entries file, instead. ==== Beyond Phase III ==== 1. Maintenance and handling more of BIND's 'data issues'. 2. Translating more interaction categories, with the appropriate resource allocation (internal or through outside collaboration).