Diff for "DanieleMerico/HowtoDirectory/Distances" - Bader Lab @ The University of Toronto

Differences between revisions 5 and 6

Distances: my personal experience

Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.
For a very rich directory on string similarity go to this link

Quantitative Data Arrays

For quantitative data (e.g. transcription signals from microarray experiments) you can use

Euclidean distance, which is D = √ ((x₁ - x₂)² + (y1₁ - y₂)² + (z₁ - z₂) + (...)² + ...)
it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions
be careful, I'd suggest to use it only when:
1. you are sure that the data are in the same magnitude scale
2. you do not want to consider anti-correlated patterns similar
Correlation-based
this distance is ideal when you want to group objects with inter-dependent behavioral trend
for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down
whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down
for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)

Binary Arrays

Hamming = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)
this distance was developed to measure transmission errors in binary coded strings
Jaccard-based = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)
(e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)
this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array
(e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)
(e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not)

Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).
E.g., consider these strings, and the choice made by Hamming and Jaccard:

11 000 000
- is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)
- is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)
110 100 000
- is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)
- is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)

Mapping the first problem to interactions, let's say
A1 interacts with B, C
A2 interacts with B, D
A3 interacts with B, C, E, F, G
is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)

In addition, neither of these two distances is suited when some array positions are noisy, whereas others are information-rich, as in the following fictional examples:
110 010 001
110 100 000
110 000 110
000 101 100
001 101 011
000 111 000
clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th; the co-conservation of other residues is irregular; therefore, we may think that the co-conserved positions hold more information than the other ones.

-  ⇤ ← Revision 5 as of 2007-11-16 00:19:58 → 
  Size: 4176
  Editor: DanieleMerico
  Comment:
+   ← Revision 6 as of 2009-07-07 00:47:58 → ⇥
  Size: 4178
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 10:
-Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.[[BR]]
For a very rich directory on string similarity go to [http://www.dcs.shef.ac.uk/~sam/stringmetrics.html this link]
+Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.<<BR>>
For a very rich directory on string similarity go to [[http://www.dcs.shef.ac.uk/~sam/stringmetrics.html|this link]]
 Line 15:
- * '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)[[BR]]
 it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions[[BR]]
+ * '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)<<BR>>
 it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions<<BR>>
 Line 20:
- * '''Correlation-based''' [[BR]]
 this distance is ideal when you want to group objects with inter-dependent behavioral trend [[BR]]
 for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down[[BR]]
 whereas,  for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down[[BR]]
+ * '''Correlation-based''' <<BR>>
 this distance is ideal when you want to group objects with inter-dependent behavioral trend <<BR>>
 for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down<<BR>>
 whereas,  for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down<<BR>>
 Line 27:
- * '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)[[BR]]
 this distance was developed to measure transmission errors in binary coded strings[[BR]]
 * '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)[[BR]]
 (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)[[BR]] 
 this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array[[BR]]
 (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)[[BR]]
+ * '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)<<BR>>
 this distance was developed to measure transmission errors in binary coded strings<<BR>>
 * '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)<<BR>>
 (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)<<BR>> 
 this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array<<BR>>
 (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)<<BR>>
 Line 34:
-~-''Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).''-~[[BR]]
~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''-~[[BR]]
 * ~-{{{11 000 000}}}[[BR]]-~
+~-''Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).''-~<<BR>>
~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''-~<<BR>>
 * ~-{{{11 000 000}}}<<BR>>-~
 Line 42:
-~-Mapping the first problem to interactions, let's say [[BR]]
{{{A1 interacts with B, C}}}[[BR]]
{{{A2 interacts with B, D}}}[[BR]]
{{{A3 interacts with B, C, E, F, G}}}[[BR]]
is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)[[BR]]-~
[[BR]]
+~-Mapping the first problem to interactions, let's say <<BR>>
{{{A1 interacts with B, C}}}<<BR>>
{{{A2 interacts with B, D}}}<<BR>>
{{{A3 interacts with B, C, E, F, G}}}<<BR>>
is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)<<BR>>-~
<<BR>>
 Line 49:
-whereas others are information-rich, as in the following fictional examples:[[BR]]
{{{110 010 001}}}[[BR]]
{{{110 100 000}}}[[BR]]
{{{110 000 110}}}[[BR]]
{{{000 101 100}}}[[BR]]
{{{001 101 011}}}[[BR]]
{{{000 111 000}}}[[BR]]
+whereas others are information-rich, as in the following fictional examples:<<BR>>
{{{110 010 001}}}<<BR>>
{{{110 100 000}}}<<BR>>
{{{110 000 110}}}<<BR>>
{{{000 101 100}}}<<BR>>
{{{001 101 011}}}<<BR>>
{{{000 111 000}}}<<BR>>