## Please edit system and help pages ONLY in the moinmaster wiki! For more
## information, please see MoinMaster:MoinPagesEditorGroup.
##master-page:Unknown-Page
##master-date:Unknown-Date
#acl DanieleMerico:admin,read,write,delete,revert All:read
#format wiki
#language en
== Distances: my personal experience ==
Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.<
>
For a very rich directory on string similarity go to [[http://www.dcs.shef.ac.uk/~sam/stringmetrics.html|this link]]
=== Quantitative Data Arrays ===
For quantitative data (e.g. transcription signals from microarray experiments) you can use
* '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)<
>
it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions<
>
''be careful, I'd suggest to use it only when:''
a. ''you are sure that the data are in the same magnitude scale''
a. ''you do not want to consider anti-correlated patterns similar''
* '''Correlation-based''' <
>
this distance is ideal when you want to group objects with inter-dependent behavioral trend <
>
for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down<
>
whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down<
>
''for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)''
=== Binary Arrays ===
* '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)<
>
this distance was developed to measure transmission errors in binary coded strings<
>
* '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)<
>
(e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)<
>
this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array<
>
(e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)<
>
(e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not)
~-''Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).''-~<
>
~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''-~<
>
* ~-{{{11 000 000}}}<
>-~
* ~-{{{is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)}}}-~
* ~-{{{is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)}}}-~
* ~-{{{110 100 000}}}-~
* ~-{{{is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)}}}-~
* ~-{{{is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)}}}-~
~-Mapping the first problem to interactions, let's say <
>
{{{A1 interacts with B, C}}}<
>
{{{A2 interacts with B, D}}}<
>
{{{A3 interacts with B, C, E, F, G}}}<
>
is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)<
>-~
<
>
In addition, neither of these two distances is suited when some array positions are noisy,
whereas others are information-rich, as in the following fictional examples:<
>
{{{110 010 001}}}<
>
{{{110 100 000}}}<
>
{{{110 000 110}}}<
>
{{{000 101 100}}}<
>
{{{001 101 011}}}<
>
{{{000 111 000}}}<
>
clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th;
the co-conservation of other residues is irregular;
therefore, we may think that the co-conserved positions hold more information than the other ones.