BIND To PSI-MI 2.5 Translation
Contents
- BIND To PSI-MI 2.5 Translation
Introduction
BIND (The Biomolecular Interaction Network Database) is one of the major, freely available, biological interactions resources. It was built over several years through a manual curation process. BIND curated, in detail, over 200,000 molecular interactions, and over 3750 biological complexes involving over a 1000 species. Although BIND is still accessible online(www.bind.ca), the resource has not been updated in a few years and is not available in the current community standard PSI-MI 2.5, and, unfortunately, makes its use and incorporation into bioinformatics pipelines difficult.
Objective
This project aims at resolving this problem by translating BIND to the PSI-MI 2.5 (molecular interactions) standard (both xml and mitab formats), and consequently, by making it available in the interaction web service, PSICQUIC (http://code.google.com/p/psicquic/). This project covers most of the details outlines in the interactions, for every species, available in BIND, (but not every detail about every interaction in BIND).
Other BIND Translations/Data Availability
1. BINDPlus: As mentioned above, BIND is available online through the BOND web portal at: http://bond.unleashedinformatics.com/, owned and run by Thomson Reuters' Life Sciences Division. The site does not offer any batch download for the data. It does allow saving the data in the PSI-MI 2.0 format, which, even if accurate, is not a data format that was adopted by the scientific community. That website, however, remains the most comprehensive search engine for BIND data, despite the fact that it carries some of the 'bad data' observed during the course of this project.
2. Databases that incorporate a specific subset of BIND data include:
Human Annotated and Predicted Protein Interaction(HAPPI) (http://discern.uits.iu.edu:8340/HAPPI/index.html)
Human Protein Interaction Database(HPID) (http://165.246.44.48/hpid/webforms/intro.aspx)
pSTIINg (http://pstiing.licr.org/)
InnateDB (http://www.innatedb.ca/)
Interaction Reference Index (iRefIndex) (http://irefindex.uio.no/wiki/iRefIndex)
Agile Protein Interaction DataAnalyzer(APID) (http://bioinfow.dep.usal.es/apid/index.htm)
Michigan molecular interactions (MiMI) (http://mimi.ncibi.org/MimiWeb/main-page.jsp)
STRING (http://string-db.org/)
Human Protein-Protein Interaction Prediction (PIPS) (http://www.compbio.dundee.ac.uk/www-pips/)
Interlogous Interaction Database (I2D) (http://ophid.utoronto.ca/ophidv2.201/ )
3. A prototype for a BIND to PSI-MI 1.0 translation was developed, in the summer of 2006, at the University of Toronto, using XSLT technology. The project described in this document, however, is independent of that earlier work at all levels (including scope, technologies used). (http://tap.med.utoronto.ca/bind/)
BIND Analysis
This analysis is not necessarily limited to BIND elements that were translated. Please refer to the attached Excel file for the referenced spreadsheets.
1. BIND repository is available in the form of XML files. The repository has a total of 1,591 files, with a total size of app. 16 Gigabytes. The actual repository used in the translation has 1,621 files, since seven of the bulky XML files were split into smaller chunks for easier handling. These files are: taxid4932.1.xml, taxid4932.2.xml, taxid562.1.xml, taxid7227.1.xml, taxid9606.1.xml, taxid9606.2.xml, taxid10090.1.xml.
Content was not altered during this process. A species is represented by one or more files. The timestamps in all files go back to 2006 (or earlier).
2. The value of BIND lies in its manually curated content including experimental forms, cellular localizations, and detailed experimental information extracted from figures in associated publication, more than in its 'automatically' generated data (like interactor synonyms, for example). The automatically generated data can be updated by separate processes as opposed to translating data that might already be out of date.
3. The BIND XML files should be thought of as a database dump, rather than a refined selection of interaction data, ready to be translated.
4. Some of the heavily used BIND elements are data types of internal and 'operational' nature to the BIND team itself. These bear little significance at this point. Such elements include: the update history of an interaction record in the BIND database, curator's info, timestamps, and other elements.
Element Survey
1. The attached spreadsheets are the product of a 'stats generator'. They aim at showing the following:
- The XML elements (and their distinct paths) that actually appear in the BIND repository (vs. the ones that are just part of the BIND schema, but are never referenced, if any). This is a general indicator on how much of a data model is exercised by its 'instances'.
- The 'abundance' of each element is listed; i.e. in how many files is a specific element used and the total number of times it appears in all files combined.
- The distinct XML elements (more accurately, paths) that bear data and can be translated. In the case of the BIND schema, this was defined as an element that either carries attributes, or is a leaf element, usually carrying content (some general exceptions below). This analysis approach is specifically useful to BIND and may or may not be very informative for other XML schemas.
- For categorical elements (for example the element support, relating to publications that has values TRUE or FALSE) the count associated with the individual elements is also recorded. Although there are elements that are always present, it could always be set to some default value and not actually contain any information. This was calculated to highlight categorical variables that were used.
2. The spreadsheet 'BINDElementsUnique' lists the element names (648 elements total) used in all the BIND repository.
3. The spreadsheet 'BINDElementStats' lists all the BIND elements, the unique XML path to each element, its 'abundance', and its 'significance' (i.e. whether it bears content or not). The columns used are:
- Element Name: BIND XML schema element name, not unique.
- Full Path: The unique path to the XML element. Unique column. In some cases, different paths to the same element name are different in context (and not just because of reusing a complex type).
- Value: For categorical variables the value of the attribute.
- Total Files: The total number of BIND xml models/files this element has appeared in.
- Total Count: The total number of times this element has been used in all of the files.
HasChildren/HasAttributes/HasText: These reflect whether an element bears useful content or not, as described above. A column value is set to 'true' if it satisfies that condition, at least once, in any entry, in any BIND file.
4. There are 3300 unique paths, out of which 1403 bear content (total usage: 72,706,810 times), including the elements with attributes as well (273 of them). Like the case with many other data models, most of the content is concentrated in a small subset of elements, that may or may not be of biological value. About 580 elements, out of the 1403 data-carrying XML elements are used less than a 100 times in 1620 files, which is negligible.
5. As a rough guideline for 'abundance' (and ignoring the complexes), a data element that appears in every interaction (e.g. interactant's name) should appear 212,031 times or more. There is always the chance that such an element is set to a dummy (as described later), but that is not taken into consideration in this context.
6. The spreadsheet BINDElementStatsBrief offers a different view for the same data.
7. A comprehensive translation, however, should focus on the scientific value of a component, in addition to its 'abundance'. The former cannot be captured in this survey.
8. Phases I-III (see project phases below) of this project handle the core translation for all interactions/complexes.
BIND XML Schema/Data Model
The BIND data model was originally represented in the ASN.1 notation. It makes extensive use of earlier standard NCBI ASN.1 components (about 20) for representing biological entities like publications, sequences, and others. The BIND XML schema was generated using parsers from the ASN.1 format.
BIND XML schema/data model is significantly complicated. This is partially due to the richness of the resource itself, but also because of the manner the XML schema was built. The model heavily favors the use of 'subelements' instead of using attributes, values, and customized complex types. A user sifts through a lot of XML elements to get to a useful piece of information. For example, to represent one type of pubmed IDs for the experimental condition of an interaction in BIND, starting from the root XML element:
BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_descr.BIND-descr.BIND-descr_cond.BIND-condition-set.BIND-condition-set_conditions.BIND-condition.BIND-condition_source.BIND-pub-set.BIND-pub-set_pubs.BIND-pub-object.BIND-pub-object_pub.Pub.Pub_medline.Medline-entry.Medline-entry_pmid.PubMedId
To represent a similar element, using PSI-MI 2.5:
entrySet.entry.interactionList.interaction.experimentList.experiment Description.bibref.xref.primaryRef[@id:]
Another example of this is in defining the type of an interactor (e.g. protein, RNA, ...etc). Its more natural to define this as an attribute for the referenced object (in this case, 'interactor') and not as a series of subelements:
BIND-Submit.BIND-Submit_interactions.BIND-Interaction-set.BIND-Interaction-set_interactions.BIND-Interaction.BIND-Interaction_a.BIND-object.BIND-object_id.BIND-object-type-id.BIND-object-type-id_protein
Such issues, in addition to the mere size of the data itself, significantly complicated the implementation and debugging of the translation.
BIND Data Quality
Schema Validation
Generally speaking, the BIND models are semantically valid. All BIND files were validated against the BIND schema, and its associated schemas (mainly earlier NCBI schemas), and most of them checked out fine, with minor exceptions:
1. Some BIND files did not stick to their own controlled vocabulary for experimental method description (i.e. the 'BIND-experimental-system' element). It is set to 'microarray' in the human files, about 406 times; it is set to 'synthetic-lethal-sick-test' in the yeast files, about 474 times. These two types are not part of the BIND controlled vocabulary for describing an experiment, and thus, failed the schema validation.
2. Based on my understanding of BIND, the experimental method value, for these cases, should have been set to 'other', and a free style description separately added, under the element 'BIND-condition_descr'. This error, however, should not interfere with the translation since our focus is to use PSI-MI's CVs and not BIND CVs (see milestones section).
Data Issues/Bad data
However, the BIND models do suffer from missing data at times, and poor data representation at others. Here are some examples:
1. The external reference object in BIND carries a database name, and one of two identifiers (either a string or an integer data type). This could be modeled in a regular XML schema by using a 'choice' type to enforce such a structure. However, BIND considers all of these elements as required, and as such, sets one of them to a valid value, and the other to a dummy (that varies, as well). This is a case of poor design, not bad data, and is misleading. For example (from human, taxid9606.1.xml):
... <BIND-other-db> <BIND-other-db_dbname>Klotho</BIND-other-db_dbname> <BIND-other-db_intp>-1</BIND-other-db_intp> <BIND-other-db_strp>KLM0000304</BIND-other-db_strp> </BIND-other-db>
In this example, the first identifier value (set to '-1' in this case, but '0' in many others) should be ignored, and the second identifier should be used. In other cases, it would be the other way around (where the second identifier type is set to 'NULL').
An even more interesting example (from taxid6239.1.xml, listing a Wormbase xref):
... <BIND-other-db_strp>WP:NULL</BIND-other-db_strp> ...
2. There are, however, very few exceptions (occuring 30 times total) to this x-ref data representation in BIND, and that is when both the string and integer fields are used to represent an external reference. All these cases are limited to representing identifiers for Merck Index or Gene Ontology. The BIND translator keeps both values. For example, (from files: taxid10090_PSIMI25.xml and taxid9606_PSIMI25.xml respectively):
... <secondaryRef db="merck index" id="9861(12)" refTypeAc="MI:0356" refType="identity"></secondaryRef> ...
... <secondaryRef db="go" dbAc="MI:0448" id="IDA(5680)" refTypeAc="MI:0448" refType="gene ontology"></secondaryRef> ...
3. The following excrepts are examples of completely erroneous x-ref entries:
... <BIND-object_short-label>DBR1</BIND-object_short-label> ... <BIND-object_extref> ... <BIND-other-db> <BIND-other-db_dbname>LocusLink</BIND-other-db_dbname> <BIND-other-db_intp>0</BIND-other-db_intp> <BIND-other-db_strp>0</BIND-other-db_strp> </BIND-other-db> </BIND-object_extref> .... <BIND-object_extref> <BIND-other-db> <BIND-other-db_dbname>SGD</BIND-other-db_dbname> <BIND-other-db_intp>0</BIND-other-db_intp> <BIND-other-db_strp/> </BIND-other-db> </BIND-object_extref> ...
Another example (for an OMIM x-ref, in the human taxid9606.1.xml file):
... <BIND-object_short-label>MCM10</BIND-object_short-label> ... <BIND-object_extref> <BIND-other-db> <BIND-other-db_dbname>OMIM</BIND-other-db_dbname> <BIND-other-db_intp>0</BIND-other-db_intp> <BIND-other-db_strp>-</BIND-other-db_strp> </BIND-other-db> ...
4. Different elements can be set to 'NULL' or zero. These are either ignored during the translation, if not mandated by the PSI-MI 2.5 schema, or set to a default ('NO_VALUE'). There is a chance that future phases of this project might be able to fill in for some of these cases, like interactor names/descriptions, via external references listed, if any.
5. In mouse (from taxid10090.1.xml, also in the file taxid0.1), the interactor names/descriptions are listed as 'NULL' 154 times:
... <BIND-object_descr>NULL</BIND-object_descr> ... <BIND-object_short-label>NULL</BIND-object_short-label> ...
This is misleading and unnecessary, even if this information is missing (and required by the XML schema), since the BIND model allows setting values to a standard default term ('unknown').
6. Sometimes an interactor's organism identifiers were missing. This information was automatically filled in (undefined species taxid:36244), ignoring the assumption that every interactor belongs to the same species/file where it is listed under.
7. Some BIND complexes reference BIND interactions with a negative BIND Interaction ID. For example (file: taxid9606.2.xml.part2):
<BIND-mol-object-source_a> <Interaction-id>-52023</Interaction-id> </BIND-mol-object-source_a>
8. Few interactions (e.g. check interaction ID: 18988 on the BOND web portal), have an erroneous 'BIND Division Type':
"BIND Taxroot|Record where all molecules are from any organism except those within TaxID 4751 or TaxID 33208".
This turns out to be a copy and paste from the BIND readme.txt file, and is not a valid BIND division type. It is translated as "BIND Taxroot".
9. In few cases, the SGD x-ref for an interactant is listed, not as an integer, but in the format "SGD: XXX" (e.g. "SGD: S000003663").
10. In few cases, there is inconsistency in the manner the MGI identifier is displayed and what it represents (example from the file taxid10090_PSIMI25.xml, shown in its wrong PSI-MI format):
... <secondaryRef db="mgd/mgi" dbAc="MI:0479" id="107427" refTypeAc="MI:0251" refType="gene product"></secondaryRef> <secondaryRef db="mgd/mgi" dbAc="MI:0479" id="MGI:1890695" refTypeAc="MI:0251" refType="gene product"></secondaryRef> ...
11. Few BIND interactors list RefSeq as an external reference. However, the identifiers listed are not refseq identifiers, and they, in fact, are GI identifiers. This is fixed during translation. Examples include (from the file taxid2336.1.xml, shown in its wrong PSI-MI format):
... <secondaryRef db="REFSEQ" id="15643805"></secondaryRef> ... <secondaryRef db="REFSEQ" id="15644490"></secondaryRef> ...
12. There are rare byte sequences in the BIND XML files which are not UTF-8 compliant. This occurs in some textual fields (like description). This occurred in the taxid10090.1.xml file. This was corrected on the command line using the following command : iconv -f ISO8859-1 -t UTF-8 file-to-process > new-file
13. There are some dummy (negative) pubmed IDs used. Seems to be limited to the yeast file, repeated 68 times. For example (in PSI-MI XML format):
<experimentDescription ...> ... <primaryRef db="pubmed" dbAc="MI:0446" id="-2" refTypeAc="MI:0685" ...
14. A significant number of the Entrez GI identifiers (GI is the primary identifier in BIND) are outdated or deprecated. See the BIND Refresh section.
15. Lack of consistency and unified representation for biological resources names used as external references. See the CV Mapping section.
Filtering BIND Entries (‘Filtered Entries’)
The BIND translator attempts to 'fix' data within an entry, without removing the entry itself, if possible. The 'BIND Filtered data' section lists examples of such erroneous or dummy data. However, some interaction entries may be missing vital information (as defined in the MIMIx interaction standard http://www.nature.com/nbt/journal/v25/n8/full/nbt1324.html) that the interaction is deemed 'incomplete' without them (regardless of the reason behind the missing elements, i.e. biologically unavailable, source data issue, ...etc). This type of info includes: interactor name, interactor external references, experimental methods, publication, complex names, complex subunit names, and all references to these interactions from complexes. Such interactions are excluded from the translated files, but are appropriately tracked in a log file (check the file Filtered_ENTRIES.txt, in the release's folder):
1. The log file lists all excluded interactions, in PSI-MI (XML) interaction format. This allows consistency with how an interaction is listed in the production files. The file itself is not a valid PSI-MI file. It is generated per-run/build of the translation process, and covers whatever species included in that build.
2. For most entries in this file, there is usually a 'NO_VALUE' set somewhere, for a vital element/attribute. This is a unified representation by the translator for a whole variety of bad/missing data representations in BIND (like: 'Unknown', 'NULL', 'unknown', ...etc).
3. For each such entry, there is an attribute added to its attribute list, specifiying the name of the file it came from. The file name is specified as full path to its current location. For example:
<interaction.....> .... <attributeList> <attribute name="BIND Interaction Division">BIND Taxroot</attribute> <attribute name="BAD DATA FILE">/Volumes/Groups/PathwayCommons/BIND/BIND_Data/ taxid83333.1.xml</attribute> </attributeList> </interaction>
Support for BIND Complexes
The translation included a first release for BIND complexes. Some of the choices might be revised later. Attached is a 'ComplexesInventory' spreadsheet that specifies the number of complexes in each of the input, as well as the output, files:
1. BIND complexes are modeled as PSI-MI interaction elements (which accomodate representing complexes). There are multiple ways to represent a complex in PSI-MI. In the first representation a complex consists of a list of interactors (with no indication how the complex is formed). This format is used for representing experimental pull-downs and one of the interactors has to be defined as the bait. The second representation contains a list of interaction references and specifies the topology of the complex. A complex can also be represented as a list of interactors with the suspected topology represented as inferredInteractions in PSI-MI. For each of the BIND complexes we extract all the members of the complex and listed the topology in inferredInteractions.
Importing systems, or converters to other standards, can quickly distinguish a BIND interaction from a BIND complex by capturing one of the unique BIND complex attributes like the 'order flag' for subunits (see the 'attribute list' example below).
2. The core elements translated (when available) for complexes include: BIND complex ID, total number of components, order flag, complex ref pubmed ID, interaction ID references. For each of the complex's subunits: the interactionRef for the subunit, the subunit number, whether its interactant A or B (in the BIND interaction referenced), in addition to the same details translated for an interaction's interactant. All different subunit types are supported (i.e. protein, small molecule, ...etc).
3. The BIND 'subunit source' elements, if present, are modeled as PSI-MI's participant.interactionRef. If not present, the usual participant.interactor details are grabbed from the BIND complex subunit. We could have included both (i.e. the subunit details and its interactionRef), but that is not allowed in the PSI-MI schema (i.e. its either-or).
4. To save the pub references, we have to add a 'dummy' experimental description under the PSI-MI interaction, that takes bibliographical references (bib-ref).
5. In general, the interaction ID for an interaction in the PSI-MI format is a sequentially generated number, that is unique across species/files, within a BIND Translation build. Each PSI-MI interaction contains the original BIND interaction ID as a primary reference. BIND complexes entries reference interaction IDs by their BIND interaction ID. Therefore, there was a need for mapping between the two identifiers. The 'BIND interaction ID' referenced by a complex subunit is accurately mapped to the respective auto-generated PSI-MI interaction ID in the same PSI-MI file (within the same species) and the latter is used instead (otherwise, the PSI-MI file is not valid). However, all the BIND interaction IDs listed for a complex are still listed separately for that interaction (as an additional PSI-MI interaction attribute).
6. Note that there might be interactions in a complex's interaction list that are not referenced by any of the individual complex subunits.
7. If a complex points to a 'filtered' interaction (within the same species), that complex is labeled as 'filtered' as well, and is treated as such. Otherwise, we will have pointers to interaction IDs that do not exist for that species. This may have resulted in a large number of 'bad' complexes, an outcome that might be revised.
8. In some cases, complexes may reference interactions in other species. There is no way of verifying the validity of these references in the same build of the BIND translation (in the current implementation). Those references are translated as is, and the complexes are labeled as 'filtered data'. (This will be updated in a subsequent build.)
9. The translated complex elements that are not explicitly supported in the PSI-MI schema are added as optional interaction attributes. PSI-MI does not seem to define detailed CV attributes for complexes (except a very general 'complex attribute'). We define our own attribute names that are used consistently. For example (file: taxid3702_PSIMI25.xml):
At the participant (aka interactor) level:
<participant id="77735"> <interactionRef>37313</interactionRef> <attributeList> <attribute name="Complex Interactant Source"> BIND Interactant B </attribute> <attribute name="Interaction Ref - BIND ID">181913</attribute> <attribute name="Complex Subunit Number">2</attribute> </attributeList> </participant>
At the interaction (i.e. complex) level:
<attributeList> <attribute name="Complex Interactions List">37313;37311</attribute> <attribute name="Complex Interactions List - BIND IDs"> 181913;181895 </attribute> <attribute name="Complex Interactions List Ordered">false</attribute> <attribute name="Complex Number of Subunits">3</attribute> ... </attributeList>
BIND Refresh: ID Mapping
1. BIND's primary interactor reference is GI. Entrez gene IDs is their secondary reference (labeled as BIND DI). GIs are used 346,142 times vs. 168,630 times for Entrez gene IDs. Both may be available for the same interactor. GIs are not the most popular identifier around as they often get replaced and are difficult to update.
2. The purpose of the ID mapping step in the BIND translation process was to map GIs (and Entrez gene IDs) to accession(s), since these are easier to track, and this will facilitate the ID matching and 'unification' of interactors (within the same resource or across different interaction/pathway resources) by importing systems.
3. The mapping involved the following steps:
- Using the id1_fetch released as a command line tool by NCBI we gathered information pertaining to each GI including:
- Status
- Updated GI
- accession associated with original GI.
- For all but 340 GIs we were able to associate an accession.
BIND To PSI-MI 2.5 CV Mapping
1. The attached spreadsheet covers most of the controlled vocabulary mapping between BIND and PSI-MI for 3 CV categories: Interactor external references (x-ref), interactor type, and interaction detection method.
2. For each CV term listed, the spreadsheet includes the following columns:
- CV Type: Category of the controlled vocabulary mapped.
- Source Term: The source term as it is used in the source (in this case, BIND).
- Source ID: The unique identifier, if any, for this term in the source data model.
- Target Term: The term as it is used in the target (in this case, PSI-MI).
- Target ID: The unique identifier, if any, for this term in the target data model.
3. For x-refs, the spreadsheets have additional columns for each of the interactor types in BIND, to specify the equivalent PSI-MI reference type (refType) for that particular x-ref and interactor type combination. The last two interactor type columns ('unspecified', 'photon') are, obviously, not referenced by any of the interactor x-refs. A photon is not recognized as an interactor type in PSI-MI. It is used in about 291 interactions in BIND.
4. Whenever there is an 'N/A' in the interactor type columns, it means that the particular x-ref/interactor type combination was not present in the source data.
5. Unfortunately, there is no CV for x-refs in BIND. We produced a comprehensive list of all the interactor external references used in BIND, as a first step, before mapping them to their PSI-MI counterparts. Although the list reflects the richness in the x-refs BIND uses, it shows the poor/redundant data representation as well (e.g.: Merck Index, Merck Index #, MerckIndex - 3 representations for the same data source).
6. As elaborated elsewhere, there are two interaction detection methods used in BIND that are not part of their CV for that category.
7. Whenever the 'generic term' for a CV category is used as a target term (for a non-generic source term), it indicates that there is currently no suitable match for that term in PSI-MI. For example, "participant xref;dictyBase" indicates that there is no support for a CV for the external bio-source dictyBase in PSI-MI, and as such, its listed as is when translated (i.e.: dictyBase). We will request updating the PSI-MI CVs to support all of these terms.
8. A slightly modified version of this spreadsheet is fed into the translation process. This design for the CV mapping accomdates 1-to-1 mappings between the source and the target. We could have invested more into the design of the CV mapping as a new XML schema or something similar, but that would have been an overhead. For more complicated CV mappings (e.g. the reftypes for the x-ref WormBase), the detailed business rules are handled at the object level.
9. Handling WormBase reference types (best available options):
If interactor is a protein (regardless of whether the x-ref has the 'WormBase:' prefix, or is free-style, like 'ZK858.4'), use 'gene product'. If the x-ref has the prefix 'WP:' or 'CE:', use 'identity'.
If interactor is of type DNA (x-ref always has the 'WormBase:' prefix), use 'identity'.
If interactor is of type RNA (x-ref always has the 'WormBase:' prefix), use 'gene product'.
- If interactor is a gene (x-ref always has the 'WP:' prefix), use 'gene product'.
10. Reference types are also set for the x-refs added during the translation process (Uniprot IDs, Refseqs, ...). Those x-refs have the *_IDMAP suffix.
11. Some assumptions were made when setting the reference types for the GI-nucleotide x-ref. We set the reference type as 'identity' if the interactor (if existed) was a DNA or a gene, assuming that the DNA/gene is encoding (otherwise, it would have been a 'see-also').
12. We do include both the CV Term and its respective PSI-MI unique ID, in case some importing systems read one but not the other. The BIND and pubmed entries in the CV mapping spreadsheet are listed under the x-ref section for convenience, although neither of them are considered interactor x-refs in BIND. Photons are not a supported interactor type in PSI-MI.
Translation Notes
Various translation notes:
1. BIND Division: We do translate the 'BIND Division Type' for an interaction, as applicable. The following list encapsulates all the different division types used in BIND:
MGI BIND-3DBP BIND Taxroot BIND-3DSM MIPS BIND Metazoa HDRES RefBIND Taxroot HIV-HPID BIND-3DFI BIND Fungi SGD FlyBase
Since, there is no suitable mapping for this information in the PSIMI schema, it is translated as an (additional) PSIMI interaction attribute, e.g.:
... <attributeList> <attribute name="BIND Interaction Division">MGI</attribute> </attributeList> ...
2. BIND Species:
a. As mentioned earlier, BIND interactions files are organized by the species they reference. Their file names (for the most part) reflect the species ID. The attached spreadsheet lists the names of all the BIND files, the species ID (of the species where the bulk of the interactors in that file exist), and the species scientific and common names.
b. A small number of taxonomy IDs (76 total) do not have a match in Entrez Taxonomy. This is due to one of the following reasons:
- Dummy taxid in the file name, like taxid_0.
- The species ID has been renamed (possibly merged) or deprecated. For example, the species ID 11489 used in BIND has been changed to 132504, 36377 is the same as 362651, and so on.
This issue is very limited in scope and is mostly related to some micro-organisms. It is another example of the consequences of not keeping a data repository up to date, and its one of the 'BIND data' issues to be addressed later.
c. BIND files do include inter-species interactions. For example, the human interactions file may include interactions where the first interactor is in homo sapiens, and the second is in some micro-organism. The translation process does not double check if the corresponding interaction file for that micro-organism (if any) lists the same interaction already listed in the human file. It is the responsibility of any importing system to check for that, if this issue matters in their internal data repository.
3. Revised Batch Process: Few species in BIND are represented by one or more XML files. Furthermore, we split the bulky BIND files early on to allow processing with the SAX parser. The translation process, however, bundles the translated output of the BIND files that belong to the same species (judging by the file name 'taxid...') and generates a single output file per species. This is neater and is a pre-requisite for supporting BIND complexes, so that a complex will never reference an interaction which actually falls in a different file (for the same species).
4. PSI-MI Validation: BIND PSI-MI files are validated against their schema in a process that is done successfully following each release. In addition, PSI-MI offers a semantic validator for PSI-MI models. It can be accessed on the web (http://www.ebi.ac.uk/intact/validator/start.xhtml) or incorporated into one's program. The validator was downloaded, compiled and incorporated into our pipeline. The validator was run on each created PSI-MI xml file prior to translation to mitab format. The only recurrent error was associated to invalid taxid that we have not updated yet.
BIND 'Mining'
Since BIND represents one of the major resources of biological interactions, it makes sense to prototype some basic mining for the BIND data produced, before/independent of any importing system's repository. This allows generating pre-defined reports and/or stats about the BIND data. It leads to deeper understanding of 'data trends', particularly if future efforts in BIND Translation focused on more biochemical details of interactions. Its also useful in general data validation and sanity checks.
General Translation Stats
1. The following spreadsheet includes some general purpose translation stats/insights:
2. The first table encapsulates the unique counts for the translated interactions, interactors, and pubmed references in the All Inclusive build (i.e. all species, all interactor types, and including complexes), in the protein-protein interactions build for the 8 selected species, and also lists the same counts for the protein-protein interactions in the 8 species individually.
In summary, there are 183,495 unique interactions/complexes, 66,754 unique interactors (based on their primary reference), and 16,519 unique pubmed reference IDs. These numbers can be compared to the earlier implementation/uses of BIND data and to other available interaction databases. These figures do not include what was deemed as 'filtered' interactions/complexes.
The second 'sum' table for protein-protein interactions displays why/how the sum of the interaction numbers on a per species basis can be more than the interaction count in the 'group of species' count, due to the existence of redundant inter-species interactions (listed under each species).
3. Spreadsheets: 'InterSpeciesIntAllInclusive' and 'InterSpeciesIntProtein_8': These exemplify the per-release mapping between redundant BIND interaction IDs and their matching PSI-MI interaction IDs. There are two PSI-MI IDs in each entry, usually, belonging to separate species. These spreadsheets are of great value, and can be used by importing systems, as a table of their own, as a quality assurance measure for tracking inter-species BIND interactions that are listed in multiple species/files. This approach will keep the original output PSI-MI files intact, and allows users of a species file, like human, to still have all the interactions in which a human protein was involved in, while highlighting redundancies separately.
4. Spreadsheets 'PubMedvsInt(All)' and 'PubMedvsInt(Protein_8)': These compare the number of interactions referenced per pubmed ID. These can give more insights to what curators list as a reference when curating an interaction, how often a reference is used, and how much value will be added to this product when matching these pubmed IDs with a new data dimension like Mesh terms.
Interaction Attributes Stats
1. The prototyped BIND miner selected a handful of attributes to track and report on. They were chosen because they come from different levels in the PSI-MI model, in addition to reaping the fruits of using CVs with some of them. The attributes tracked are: Interactor types, Interaction detection methods used in interactions, interactor references to external databases, and the variety in the number of subunits in complexes. The stats for these attributes within the 8 favorite/selected species were generated. The sample results are displayed in the following spreadsheet (although future reports may use a graphical representation, instead):
2. The first spreadsheet shows the total counts for each attribute, while the rest show them on a per-species/file basis.
3. The x-ref report also included x-refs added during the translation process itself. The stats for an x-ref are not for its unique occurrence, so its counted every time it appears (which may happen often, if the interactor it references does appear often).
4. The report does not take into consideration the 'filtered interactions' which were excluded.
Design and Technical Notes
Design And WorkFlow
1. The overall theme is to abstract the design, but to keep the implementation down to earth by supporting all the specifics and unpredictable special cases in data. This allows handling issues at hand now, while facilitating software reuse later on.
2. Technical 'Biproducts': Several functionalities were developed for supporting the BIND translation process. These include:
- XML element survey: A standalone program that generates stats about each XML element in an XML file. It was very handy in BIND translation. Example outputs for both BIND and intact are in an attached spreadsheet (see 'BIND Analysis' section).
- ID mapping: These are listed under the data warehouse documentation.
- BIND Miner: A standalone program that generates specific stats about BIND models that are fed in PSI-MI xml format. The user has the option of generating the reports using the PSI-MI java API (object model representation of the interactions) or plain XML API (JDOM objects - elements and attributes).
3. Design bits: The translation framework is implemented as a small class hierarchy, to support type control for potential future PSI-MI translators. A small factory pattern/method is used to select the appropriate translator, specified by the user at the command line. This is a 'placeholder' for the future, since only one translator is currently available (i.e. BIND to PSI-MI 2.5). The BIND-complex translator class was implemented as a subclass for the main BIND translator class, and that might serve as a good example to follow, if translating other major and well-defined components in BIND. The BIND Miner prototype is implemented as a small class hierarchy with polymorphic calls. There are partial singleton implementations (e.g. in container-like classes), as well as utility design for other classes, like the tags and util classes.
4. The steps required for handling GIs in BIND that were not readily mapped to Uniprot involved querying the Batch Entrez tool and re-mapping those GIs, if feasible. That is the major, one-time, partially automated step in this translation. Otherwise, running the BIND Translator now is an automated process.
5. The narrated 'activity workflow' for the translation process is listed below. Note that this is a high level representation that does not delve into the details/special cases of data mapping:
- Load, from local file, the x-path/element names (handlers) for the BIND elements of interest and make it available throughout the rest of the translation.
- Load the current CV mapping between BIND and PSI-MI and make it available throughout the rest of the translation. Then, for each of the BIND files translated:
- Initialize the ID mapping process with Uniprot. This process loads the uniprot GI/Entrez gene mappings, for the species/file being translated at the moment. The species ID is strictly expected to be part of the input BIND file name (format: taxidXXX.foo, where 'XXX' is the species ID).
- Begin the translation of interactions. Traverse each BIND interaction tree while building the PSI-MI XML tree.
- When a 'supported' CV term element is encountered, the CV Mapper is called upon to provide the correct mapping (if any).
- When a GI/Entrez gene ID is encountered (for an interactor), the ID mapping service is called. If no match available within the same species, it is called upon across different species (see comment on 'species mismatch' elsewhere). If none matched, it is called upon to look into the supplementary mappings (i.e. output from mappings done through Batch Entrez, mentioned earlier).
- If interaction is missing 'vital info', label as 'bad'.
- If user requested translating complexes, do those as well. Add 'bad' labels if needed.
- Dump output. Dump bad interactions separately.
Technologies
All the source code for this project is written in java, is 'home grown', and is available through these wiki pages. Other tools and third party packages used include:
1. XML Spy - Professional Edition (version 2009, SP1) from Altova, Inc (http://www.altova.com/): An XML IDE that was used mostly for visualizing/analyzing xml schemas, and for preliminary xml file validation.
2. Altoval XML from Altova, Inc (http://www.altova.com): A freely available command line tool, with various XML related functionalities. It was used for batch XML file validation at the MS-DOS prompt.
3. NetBeans (version 6.7), a freely available java IDE sponsored by Sun Microsystems (http://www.sun.com), used for java (1.6) development.
4. JDOM (version 1.1) was used as an XML API and Xerces (version 2.0) as the XML parser of choice.
5. Miscellaneous commands (at the prompt) were helpful, in addition to minor use of the PSI-MI java API.
Code Notes
1. The BIND Translator, and its associated classes, are divided over two packages:
- Translator package: Responsible for the actual translation, as well as sets the framework for such translators in general.
- Translation Utility package: Satellite functionalities related to translations.
2. There are 13 classes (app. 3,250 lines of code):
- Translator package: Trans/PSIMITrans, BINDTags, PSIMITags, CVMapEntry, BINDToPSIMITrans, BINDToPSIMITransComplex.
- Translation Utility package: XMLUtil, XMLStats, XMLElementInfo, PSIMIMiner, PSIMIMinerXML, PSIMIMinerModel.
Attachments, Additional Documents, And Links
6. BIND_DataAnalysis_SelectedSpecies.xls
7. BIND Translator source: BINDTrans_src_20091218.zip --> an updated version of the source code can be found at http://download.baderlab.org/BINDTranslation/
8. BIND in PSICQUIC: PSICQUIC service link
9. BIND in PSI-MI 2.5 - Download page
10. 'BIND in PSI-MI 2.5' paper, submitted to Database (http://database.oxfordjournals.org/): BINDTrans_Manuscript.doc
Project Releases/Milestones
1. All the BIND Translation related documents, including the different releases of the translation, are accessible from the Bader Lab domain, under the folder: /Volumes/Groups/PathwayCommons/BIND/BINDTranslation.
2. Note that in some releases, some of the PSI-MI XML files may not contain any interactions in them. This is due to the incremental nature of the different project phases, for one or more of the following reasons:
- The equivalent source files happen to include complexes only, which we did not support at the time of that release. And/or
- Those files do not carry any interactions where both interactors are proteins, and the release was focused on that. And/or
- All the interactions in these files are deemed as 'Filtered interactions' (i.e. missing vital info) and were removed from the production file and logged.
However, these 'empty' files are kept for inventory tracking purposes.
3. The following describes the folders for the different releases for BIND Translation.
Phase I
1. Created a BIND data/element analysis, for future reference.
2. Provided a BIND to PSI-MI 2.5 translation for the core interaction elements. The analysis/requirements for this translation were reverse-engineered from the earlier XSLT prototype, scaling from version 1.0 to 2.5, when necessary, with some modifications as well. This core translation covered, at least: interaction related identifiers/description, interactants' names and descriptions, interactants' external references (e.g. GI IDs, Entrez Gene IDs,...etc), experimental condition details, species info, and publication identifiers.
3. Only interactions were both sides are defined as proteins were covered.
4. The translation targeted eight species: human, mouse, rat, yeast, ecoli, thali cress, fruit fly, and worm.
5. The PSI-MI 2.5 files were converted to BIOPAX-Level 2, using the available converter, and loaded in a local PC instance. This indirectly involved installing/setting up this instance.
6. Initial support for PSI-MI controlled vocabularies was added later.
Released Files:
- PhaseI_a: PSIMI and BIOPAX files for BIND Interactions for the selected species.
- PhaseI_b: Same content as above, but with the initial support of PSI-MI CVs and various enhancements/fixes.
Phase II
1. Addressed some issues and hanging threads from phase I.
2. Translated non-protein interaction types (i.e., RNA, DNA, gene, small molecule, complex, and photon).
3. Translated all species (more accurately, all 1,620 BIND files). Converted all to BIOPAX and imported into local PC instance.
4. More detailed PSI-MI CV support.
Released Files:
- The PSIMI25 files are split into two folders: one for the 'selected species' (the eight favorites listed above), and one for all the other species.
- The BIOPAX folder has two sub-folders: BIOPAX_20090626 (BIOPAX files that match all the PSIMI files for this release) and BIOPAX_20090702 (A special BIOPAX files release to test fixes to the PSIMI-Biopax converter)
Phase III
1. More focused and expanded releases.
2. BIND - Expand & Refresh: BIND interactants were mapped to Uniprot accns using both GIs and Entrez Gene IDs (whenever available) through an ID mapping process.
Released files:
- PhaseIII_a: Focused on humans only. Purpose was to tighten the screws, refine CV mappings, handle some BIND 'data issues', and introduce new elements.
- PhaseIII_b: Included all protein-protein interactions for all species/files. First release of BIND ID-mapping - BIND Refresh. Output species files are split over two folders: one for the 'selected species' (listed above), and one for all the other species.
- PhaseIII_c: Second and expanded release of BIND ID-mapping.
- PhaseIII_d: An all-inclusive release. This release included a revised release for all non-protein-protein interactions, in addition to a first release of all BIND complexes, plus other enhancements. There is only one file in this release that does not have any interactions in it: taxid64091.1.xml (Halobacterium). The original BIND file has one interaction and one complex for Hop (HR) tetramer, but both the interaction and the complex fail our filtering policy, and are as such logged into the bad entries file, instead. The PSI-MI output file is kept for record keeping, though.
- PhaseIII_e: Similar to the previous release, but splits the build into different flavors, exercising the different available build options for the translation process: 'AllInclusive_NoDataCleaning' and 'ProteinsOnlyNoComplexes' folders.
Phase IV
- Phase IV_a: All inclusive release reflecting a revised data filtering policy, whereby only interactions with missing primary refs for their interactors are filtered, in addition to complexes referencing them, if any.
==== Phase V =====
- Phase V: expansion of translation to include interaction places, experimental forms
Future Releases
1. Maintenance and handling more of BIND's 'data issues'.
2. Translating more interaction categories and elements.