Diff for "DanieleMerico/HowtoDirectory/Distances" - Bader Lab @ The University of Toronto

Differences between revisions 1 and 2

Distances: my personal experience

Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.

Quantitative Data Arrays

For quantitative data (e.g. transcription signals from microarray experiments) you can use

Euclidean distance, which is D = √ ((x₁ - x₂)² + (y1₁ - y₂)² + (z₁ - z₂) + (...)² + ...)BR it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensionsBR be careful, I'd suggest to use it only when:
1. you are sure that the data are in the same magnitude scale
2. you do not want to consider anti-correlated patterns similar
Correlation-based BR this distance is ideal when you want to group objects with inter-dependent behavioral trend BR for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes downBR whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes downBR for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)

Binary Arrays

Hamming = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)BR this distance was developed to measure transmission errors in binary coded stringsBR
Jaccard-based = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)BR (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)BR this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the arrayBR (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)BR

Hamming distance is not good in my opinion when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the example above).BR E.g., consider these strings, and the choice made by Hamming and Jaccard:BR

11 000 000BR
- is more similar to 10 000 001 according to Hamming (D_H = 2, D_J = 2/3)
- is more similar to 10 000 001 according to Jaccard (D_H = 3, D_J = 3/5)
110 100 000
- is more similar to 011 000 000 according to Hamming (D_H = 3, D_J = 3/4)
- is more similar to 110 111 101 according to Jaccard (D_H = 4, D_J = 4/7)

This is particularly important as some computational guys would probably go for Hamming as the first choice, without checking for its validity in the context of use.

-  ⇤ ← Revision 1 as of 2007-11-15 23:25:31 → 
  Size: 3166
  Editor: DanieleMerico
  Comment:
+   ← Revision 2 as of 2007-11-15 23:25:56 → ⇥
  Size: 3175
  Editor: DanieleMerico
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 12:
-=== Quantitative Data ===
+=== Quantitative Data Arrays ===
 Line 25:
-=== Binary Data ===
+=== Binary Arrays ===