Size: 3590
Comment:
|
Size: 4054
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 33: | Line 33: |
''Hamming distance is not good in my opinion when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the example above).''[[BR]] ~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''[[BR]] * 11 000 000[[BR]]-~ * ~-is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)-~ * ~-is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)-~ * ~-110 100 000-~ * ~-is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)-~ * ~-is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)-~ |
~-''Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).''-~[[BR]] ~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''-~[[BR]] * ~-{{{11 000 000}}}[[BR]]-~ * ~-{{{is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)}}}-~ * ~-{{{is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)}}}-~ * ~-{{{110 100 000}}}-~ * ~-{{{is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)}}}-~ * ~-{{{is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)}}}-~ |
Line 42: | Line 42: |
A1 interacts with B, C[[BR]] A2 interacts with B, D[[BR]] A3 interacts with B, C, E, F, G[[BR]] |
{{{A1 interacts with B, C}}}[[BR]] {{{A2 interacts with B, D}}}[[BR]] {{{A3 interacts with B, C, E, F, G}}}[[BR]] |
Line 46: | Line 46: |
''This is particularly important as some computational guys would probably go for Hamming as the first choice, without checking for its validity in the context of use.'' | [[BR]] In addition, neither of these two distances is suited when some array positions are noisy, whereas others are information-rich, as in the following fictional examples:[[BR]] {{{110 010 001}}}[[BR]] {{{110 100 000}}}[[BR]] {{{110 000 110}}}[[BR]] {{{000 101 100}}}[[BR]] {{{001 101 011}}}[[BR]] {{{000 111 000}}}[[BR]] clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th; the co-conservation of other residues is irregular; therefore, we may think that the co-conserved positions hold more information than the other ones. |
Distances: my personal experience
Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.
Quantitative Data Arrays
For quantitative data (e.g. transcription signals from microarray experiments) you can use
Euclidean distance, which is D = √ ((x1 - x2)2 + (y11 - y2)2 + (z1 - z2) + (...)2 + ...)BR it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensionsBR be careful, I'd suggest to use it only when:
you are sure that the data are in the same magnitude scale
you do not want to consider anti-correlated patterns similar
Correlation-based BR this distance is ideal when you want to group objects with inter-dependent behavioral trend BR for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes downBR whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes downBR for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)
Binary Arrays
Hamming = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)BR this distance was developed to measure transmission errors in binary coded stringsBR
Jaccard-based = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)BR (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)BR this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the arrayBR (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)BR (e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not)
Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).BR E.g., consider these strings, and the choice made by Hamming and Jaccard:BR
11 000 000BR
is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)
is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)
110 100 000
is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)
is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)
Mapping the first problem to interactions, let's say BR A1 interacts with B, CBR A2 interacts with B, DBR A3 interacts with B, C, E, F, GBR is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)BR BR In addition, neither of these two distances is suited when some array positions are noisy, whereas others are information-rich, as in the following fictional examples:BR 110 010 001BR 110 100 000BR 110 000 110BR 000 101 100BR 001 101 011BR 000 111 000BR clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th; the co-conservation of other residues is irregular; therefore, we may think that the co-conserved positions hold more information than the other ones.