GeneMania DataWarehouse (DW)
Identifier Validation:
Shared Identifier: This is a general case, that covers all incidents where an identifier is shared between two or more genes. It applies to any identifier that is listed in the ID Mapping table, namely: gene name/symbol, Ensembl transcript ID, Ensembl protein ID, Entrez gene ID, Uniprot ID (irrespective of whether its a Uniprot ID or a Uniprot accession), RefSeq mRNA ID, RefSeq protein ID, and TAIR locus ID (for Entrez/TAIR). Possible solutions:
- Delete the shared gene symbol from the IDMapping file, for all of the affected genes, so it won't match any searching query to begin with. This is the solution of choice for now.
- Delete all the affected gene entries, but that would lead to the loss of useful (and reliable) information.
- Keep all. The GM front end and the GM engine should, if faced with such a use case, query the user for which gene are they specifically referring to.
- Since the genes that share the same symbol are likely to be very similar in features (?), keep that symbol for one, and delete it from all the others. Unfortunately, there is no automated method of deciding on which gene to select for this.
LeftOver Gene: Generally speaking, the mapping between the genes from two resources (e.g. Ensembl and Entrez) is not perfect. So, there will be some 'left over' genes from the second resource (e.g. Entrez) that will have to be captured separately, and added to the respective ID Mapping file. For the Ensembl/Entrez example, the entries of that section of the file will be similar to the earlier ones, minus any Ensembl specific information (Ensembl Gene ID, Ensembl Transcript ID, Ensembl Protein ID). Needless to say, this validation step will require filtering the IDs from the two resources against each other, and filling in for the missing info when possible. For example, we will need to grab the gene definition line from Entrez, instead of Ensembl, for an Ensembl/Entrez leftover gene.
Deprecated Identifier: Another reason for mismatches between two source is the use of old identifiers. Following the Ensembl/Entrez example, Ensembl might reference Entrez gene IDs that no more exist in Entrez. To avoid that, the ID validation modules should check for the presence of all Entrez gene IDs (in Entrez), and if missing, do the following:
- Check if the deprecated Entrez gene has been replaced by another one, or more than one, Entrez genes.
- If so, replace the deprecated Entrez ID with the new one(s), in the ID mapping table, together with the associated information.
- If not, drop the deprecated Entrez ID from the ID Mapping table. It should be reported into the IVReports as well.
Identifier Validation Report (IVReport):
- The Identifier Validation process generates two types of reports (tab-delimited format) that specify the details of the 'problems' to handle. These reports are particularly useful in cases where the 'problematic' identifier is deleted from the ID mapping tables. They will allow history tracking (audit trail) at the DW level. The GM front end can also benefit from them. For example, it will be able to inform the end user that 'gene symbol fooGene was dropped from their query because it is shared between two genes: geneA and geneB'.
- Summary Report: This report serves as an indicator on how widespread a validation 'problem' is. The columns are: Source, Species, Validation Type, Identifier Type (if applicable), Number of Occurences, Total Number. Examples:
Source: Ensembl Species: Hs (9606) Validation Type: LeftOver Gene Identifier Type: N/A Number of Occurences: 20164 Total Number: 36582 Source: Ensembl Species: Hs (9606) Validation Type: Deprecated Identifier Identifier Type: N/A Number of Occurences: 58 Total Number: 21566 Source: Ensembl Species: Hs (9606) Validation Type: Shared Identifier Identifier Type: Ensembl Gene Name Number of Occurences: 2601 Total Number: 31340
- Detailed Report: This will list all the problematic identifiers, along with all the gene IDs they relate to. There is a detailed report for each type of identifier validation.
Detailed Report - LeftOvers: This simply lists the Source, Species, and the leftover ID. Refer to the respective ID mapping table for more details on the leftover gene. Example:
Source Species Gene ID Ensembl Hs (9606) 100008587 Ensembl Hs (9606) 100008588
- Detailed Report - Deprecated: Source, Species, Gene ID (of the referring primary source gene), Old Gene ID (from the referenced source), New Gene ID (from the referenced source, if any).
Source Species Gene ID Old Gene ID New Gene ID Ensembl Hs (9606) ENSG00000034063 100133565 728688 Ensembl Hs (9606) ENSG00000070831 641992 N/A
- Detailed Report - Shared Identifier: Source, Species, Identifier Type, Identifier, Gene ID (i.e. the IDs referencing this identifier, directly or indirectly).
Source: Ensembl Species: Hs (9606) Identifier Type: Uniprot ID Identifier: RGPD7_HUMAN Gene ID: ENSG00000015568;ENSG00000183054 Source: Ensembl Species: Hs (9606) Identifier Type: Ensembl Gene Name Identifier: AL117336.22 Gene ID: ENSG00000200097;ENSG00000209753
- The source column represents the primary source of the information, like Ensembl, Entrez, ...etc. It has the same meaning in all types of validation files.
- The species is listed in the abbreviated format, and can be one of: At, Hs, Mm, Rn, Ce, Sc, Dm, Ec. The NCBI taxonomy ID is listed between brackets as well.
Validation Type comes from a controlled vocabulary list, and can be one of: "Shared Identifier", "LeftOver Gene", "Deprecated Identifier".
- Identifier Type: The type of the problematic identifier (Entrez Gene ID, Uniprot ID, ...etc). Used when validating for shared identifiers.
- The 'Number of Occurences' column lists how many times the problem occurs, so if gene symbol 'foo' is shared between gene1, gene2, and gene3, this counts as 1 occurence of the problem (not 3).
The 'Total Number' column should be intepreted as follows: When validating for 'Shared Identifier', its the total count of all the unique identifiers of a specific type. When validating for 'LeftOver Genes', its the total number of primary source genes (e.g. Ensembl, in the case of Ensembl/Entrez). When validating for 'Deprecated Identifier', its the total number of unique external reference identifier (e.g. the total number of Entrez gene IDs that are referenced by Ensembl, including the deprecated ones, if any).
- Note that detailed report for deprecated IDs lists only one of the referencing IDs, and not all of them (if this happens to be a shared external reference case). Shared identifiers are always listed in their respective report.
- The leftover genes in an ID mapping table are also validated for shared identifiers, and the results are listed in the detailed report as well. Its not listed in the summary report (since its not related to the primary source).
- There is no validation for shared identifiers between the primary source genes and the leftover genes in an ID mapping table (that would be cross-source comparison). There is also no validation across different identifier types.
- All forms of validations mentioned above are irrespective of the type of the gene (i.e. protein coding, micro-RNA, ...etc).
If an Entrez gene is shared between two different Ensembl genes, it goes without saying that any information loaded from that Entrez gene (e.g. RefSeq IDs) will be shared as well. However, the RefSeq related information are still listed in the validation reports for convenience.
- The DW identifier validation is geared towards the primary source used in an ID mapping table. For that reason, the shared identifier validation is done last in the validation process. This may result in some interesting scenarios. First example: a secondary source gene may not be referenced by the primary source, and not listed as a leftover, although it exists in the Entrez database. That is because its simply a shared Entrez gene between more than one Ensembl gene, and was dropped. However, it can still be tracked in the respective validation report. Second example: The replacement gene ID (for a deprecated one) is listed in the deprecated IDs report, but its not showing in the ID mapping table. Again, it was dropped because the old gene ID (that it replaced) was a shared one. It can be tracked in the shared identifier report as well.
- Comment on shared identifier validation when mapping back and forth (i.e. required and bonus mappings): As an example, when running ENSEML_ENTREZ mapping, the shared identifier validation for the leftover Entrez genes, reveals 12 cases of shared gene names. When running ENTREZ_ENSEMBL mapping, the same validation type for Entrez (now the primary resource), reveals 24 cases instead. The former list of genes should be a subset of the later ones. The reason for having more entries in the later case is simply because not every Entrez was referenced by Ensembl in the first place, and that can be verified by looking at the leftover/shared reports of the respective ENSEMBL_ENTREZ mapping.