DW ID Mapping: README
The DW subsystem has a command line tool that ties all the different DW components together to generate and validate ID mapping files (tables) based on the specs described elsewhere on the GM wiki. To run the ID mapping tool, do the following:
- Unzip the file GMDW.zip.
- Change directory to GMDW ('cd GMDW').
- Run the program from the command line prompt, using any of the available options. For example, run the command:
- java -Xms120m -Xmx550m -jar ./dist/GeneMania.jar ENSEMBL_ENTREZ Hs false
Generating ID Mapping/Validation Files
The tool can be used in two modes:
1. One-by-one mode: In this mode, ID mapping/validation files are generated for a specific species and a specific mapping type as follows:
- java -Xms120m -Xmx550m -jar ./dist/GeneMania.jar mappingType species saveLocal [outputFileName]
- Mapping Type: can be any of the following: ENSEMBL_ENTREZ, ENTREZ_TAIR, ENTREZ_ECOLI, ENTREZ_ENSEMBL, TAIR_ENTREZ.
- Species: can be any of the following: Hs, Mm, Rn, Ce, Sc, Dm, Ec, At (i.e. human, mouse, rat, worm, Yeast, Fruit Fly, E.coli, Cress).
- Save Local: For practical purposes, consider the 'saveLocal' option as an instruction to the program to generate unique GMIDs for all the genes in the output files. It can be either 'true' or 'false'.
- Output File Name: Optionally, you can specify the name of the output ID mapping file.
2. Bulk (All-in-one) mode: In this mode, ID mapping/validation files are generated for all species and all 'required' mappings:
- java -Xms120m -Xmx550m -jar ./dist/GeneMania.jar [saveLocal]
- In this mode, the tool will run all the mappings for all the species, one species at a time. The mappings would be: ENSEMBL_ENTREZ, ENTREZ_TAIR, ENTREZ_ECOLI.
Comments
1. Note that the command line mechanism uses preset default parameters if none are provided at the command line prompt. These can be set in the DW.properties file (for either mode, separately).
2. Use the DW.properties file to customize general properties as needed. For example, to generate a copy of the ID Mapping tables, without the changes the validation step introduces, but still generate the validation reports, set the DefIVFix property to 'false' (default setting is 'true'). You can also use it to specify the version of the local mirrors to use in the ID mapping.
3. The default location for the output files is under the DWTools directory.
4. The default name for a mapping file is as follows: mappingType + '_' + speciesName. This file name cannot be customized when running in batch mode.
5. In general, its faster to 'test drive' the ID mapping, without the GMID generating mechanism, first (i.e. setting the 'saveLocal' command line option to false). Then to rerun again while generating IDs.
6. When running the program in the one-by-one mode, you cannot just combine any species with any mapping type. Refer to the wiki for more details on this.
7. The IVReports file names can be partially customized in the DW.properties file.
8. The tool can be packaged in a better way, as part of a 'global' build process. It can also be hooked up to the 'logging machinery' of the build process, as well. The performance can probably be improved, and there are some tips on that (in and out of the code) for the future, but there is no urge for this at the moment.