## Please edit system and help pages ONLY in the moinmaster wiki! For more ## information, please see MoinMaster:MoinPagesEditorGroup. ##master-page:Unknown-Page ##master-date:Unknown-Date #acl DanieleMerico:admin,read,write,delete,revert All:read #format wiki #language en == Distances: my personal experience == Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual. === Quantitative Data Arrays === For quantitative data (e.g. transcription signals from microarray experiments) you can use * '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)[[BR]] it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions[[BR]] ''be careful, I'd suggest to use it only when:'' a. ''you are sure that the data are in the same magnitude scale'' a. ''you do not want to consider anti-correlated patterns similar'' * '''Correlation-based''' [[BR]] this distance is ideal when you want to group objects with inter-dependent behavioral trend [[BR]] for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down[[BR]] whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down[[BR]] ''for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)'' === Binary Arrays === * '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)[[BR]] this distance was developed to measure transmission errors in binary coded strings[[BR]] * '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)[[BR]] (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)[[BR]] this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array[[BR]] (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)[[BR]] ''Hamming distance is not good in my opinion when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the example above).''[[BR]] ~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''[[BR]] * 11 000 000[[BR]]-~ * ~-is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)-~ * ~-is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)-~ * ~-110 100 000-~ * ~-is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)-~ * ~-is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)-~ ''This is particularly important as some computational guys would probably go for Hamming as the first choice, without checking for its validity in the context of use.''