GeneMania Data Warehouse (DW)
This document elaborates on the architecture of the GeneMania Data Warehouse subsystem.
General Comments:
- The DW encapsulates a mosaic of bioinformatics resources, with the list of resources growing over time. It also encapsulates basic data processing and admin functionalities.
- The general technical preference is to follow a 'one system, one technology' rule. In this case, that technology would be java-based, and a lot of the work would be home-grown source code.
- We may rely on third party bioinformatics software for some specific tasks. In a multi-year project of this scale, however, we should be aware that too much of this will result in more dependence and less flexibility, and, in some cases, a maze of different pieces of software. So, this has to be done wisely and as needed.
- As a general remark, the DW has to 'think' in terms of detailed biological/genomic content, because this is its daily 'bread and butter'. For example, it has to 'think' in terms of an Ensembl gene and a TAIR gene, and how their contents differ or relate to each other. The other GM subsystems, however, may be able to define things in a way that is less bound to biology.
- The DW has to balance between the available manpower, the need for regular data releases/processing, and building a professional software subsystem. The only way to achieve that is to add 'another brick to the wall' with each project. For example, the ID mapping project carried with it the gene-based entity classes, and so on.
Some of my DW classes are not related to GeneMania's ID mapping process, but were developed to help in the BIND Translation project. However, they are part of this class set/hierarchy because this is where they naturally belong. The posted code below (DW_src_20091218) should be considered as a separate thread, that should not be used (nor is it likely to benefit) in the aforementioned ID mapping process.
DW Architecture:
At a system level, the DW will be composed of the following layers/components:
- Loaders Layer: These are the classes (one or more packages) that manage the loading of data from the different external bioinformatics resources into the DW.
- Data Warehouse: This is a collection of the different database schemas that the data warehouse accumulates on a regular or occasional basis.
- Database Layer: This is the layer that encapsulates any database related functionality. It generally connects the entity layer with the different schemas, hiding the details of the latter. Most of the SQL stuff resides here, with varying levels of complexity, from single table queries to queries that join six tables (e.g. to retrieve the Entrez IDs matching Ensembl genes in the Ensembl core database).
- Entity Layer: These can be described as java beans with added benefits or functionalities. They are the main ingredients of the DW.
- Tools: These vary in complexity from simple wrapper/parser classes, to admin tools classes.
- General utilities: As the name indicate, this component carries the general utility classes. These offer general help for both GM ID mapping and BIND Translation.
Design Notes:
The root entity is the ExtResource class which is a high level of abstraction, hopefully, covering non-gene based external resources. The current and future gene-based entity classes share a common hierarchy (all inherit from the ExtResourceGene class, an extension of ExtResource). On top of type control, and general OO inheritance benefits, this structure allows flexibility when these classes are referencing each other, which gets handy for generating the ID mapping tables.
- As of the date of this writing (16/12/09), the DW code base is made up of 25 classes, distributed over 4 packages (app. 6250 lines of code). These classes elaborate more on the architecture described above:
Database Layer: DBUtil, EnsemblMirrorTables, EntrezMirrorTables, TAIRMirrorTables, UniprotMirrorTables, ExtResourceGeneTable.
Entity Layer: ExtResource, ExtResourceGene, EnsemblGene, EntrezGene, TAIRGene, Uniprot, OBOTerm.
Tools Layer: IdentifierMapper, IdentifierValidator, IVReportSummary, IVReportDetail, FFColumns, IdentifierMapperService, BatchEntrezReader, OBOReader, OBOContainer.
The current (11/02/09) ID mapping process prototypes partially saving a snapshot of the generated ID mapping tables locally, in the ExtResourceGene Table: When the saveLocal option is set to true, summary info about each gene is saved into that table (Source, Species, Gene ID, in addition to the status/timestamp columns), and the auto-incremented primary key for that table is used as the GMID. No other info will be saved, and the table is truncated regularly. This demonstrates one approach to saving ID mapping tables locally, if needed in the future. Since this is not a requirement, and if performance needs to be optimized, this step can be dropped, and replaced by simply generating unique integers as GMIDs.
- Another optimization approach would be to create 'sub-tables' for the Entrez mirror, that only include the species of interest to the DW, and query those, instead of using the all-inclusive tables.
More Technical Documents:
DW Source: DW_src_20091218.zip
Some useful diagrams (slightly outdated): GMDW_diagrams_20090217.zip