3175
Comment:
|
3590
|
Deletions are marked like this. | Additions are marked like this. |
Line 32: | Line 32: |
(e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not) | |
Line 40: | Line 41: |
~-Mapping the first problem to interactions, let's say [[BR]] A1 interacts with B, C[[BR]] A2 interacts with B, D[[BR]] A3 interacts with B, C, E, F, G[[BR]] is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)[[BR]]-~ |
Distances: my personal experience
Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.
Quantitative Data Arrays
For quantitative data (e.g. transcription signals from microarray experiments) you can use
Euclidean distance, which is D = √ ((x1 - x2)2 + (y11 - y2)2 + (z1 - z2) + (...)2 + ...)BR it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensionsBR be careful, I'd suggest to use it only when:
you are sure that the data are in the same magnitude scale
you do not want to consider anti-correlated patterns similar
Correlation-based BR this distance is ideal when you want to group objects with inter-dependent behavioral trend BR for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes downBR whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes downBR for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)
Binary Arrays
Hamming = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)BR this distance was developed to measure transmission errors in binary coded stringsBR
Jaccard-based = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)BR (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)BR this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the arrayBR (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)BR (e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not)
Hamming distance is not good in my opinion when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the example above).BR E.g., consider these strings, and the choice made by Hamming and Jaccard:BR 11 000 000BR
is more similar to 10 000 001 according to Hamming (DH = 2, DJ = 2/3)
is more similar to 10 000 001 according to Jaccard (DH = 3, DJ = 3/5)
110 100 000
is more similar to 011 000 000 according to Hamming (DH = 3, DJ = 3/4)
is more similar to 110 111 101 according to Jaccard (DH = 4, DJ = 4/7)
Mapping the first problem to interactions, let's say BR A1 interacts with B, CBR A2 interacts with B, DBR A3 interacts with B, C, E, F, GBR is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)BR This is particularly important as some computational guys would probably go for Hamming as the first choice, without checking for its validity in the context of use.