## page was renamed from DanieleMerico/GSoC2010
#acl All:read DanieleMerico:write,delete,revert
= Google Summer of Code 2010: Semantic Network Summary =
== Short Description (from GenMAPP wiki) ==
'''Goal''' <<BR>> 
Develop a visual summary of a set of node attributes

'''Description''' <<BR>> 
When biological networks are investigated, it is common to look for clusters, i.e. sets of nodes that are highly inter-connected. To figure out the "biological meaning" of a cluster, the user has to sift through the long textual annotations that are associated to biological entities. We are interested in producing a graphical summary of such annotations. Word frequency in annotations is a good starting point. This can be visualized as a "tag cloud". In addition, the word layout can reflect similarity relations among words (e.g. co-occurrence in the same annotations).

This functionality can be applied to several networks outside biology, whenever nodes are associated to verbose textual annotation. An example is professional social networks (e.g. linked-in), where individuals are "annotated" by a short CV.

'''Language and Skills''' <<BR>> 
Java, basic statistics 
== Longer Description ==
=== Problem Description ===
Biological networks can be visualized and analyzed using Cytoscape. Since biological networks have a large number of nodes (a whole cell protein network has up to 20k nodes), it is common to summarize networks using by clustering, i.e. identifying groups of highly inter-connected nodes. Clusters can be identified algorithmically, or hand-picked by experts (with the help of the network layout). 

Once clustering have been identified, however, it not trivial to summarize their meaning. Bio-entities typically have rich semantics, which are encoded by long string attributes.

The purpose of the Semantic Network Summary module will be to generate more concise summaries.
==== Input/Output Description ====
The module will receive in input (S, A), a set of nodes S = {s1, s2, ..., sn} together with their string attributes A = {a1, a2, ..., an}.

For every input (S, A), a graphical summary of the string attributes will have to be generated. 

Attributes A are typically stored in biological databases. They can be free-text descriptions, or controlled-vocabulary terms (e.g. Gene Ontology). 
=== Available Solutions: Word Frequency ===
A first simple solution we have implemented:
 * break down {a1, a2, ..., an} into single words
 * count word frequencies
 * use coefficients based on information theory, or a statistical test p-value

This simple idea can be improved by:
 * removing common-place words (e.g. "of", "by", etc...)
 * dividing the word frequencies in A by the word frequencies in the full network (i.e. all nodes)

[[http://www.wordle.net|Wordle]] is a cool graphical representation based on word frequency. 
=== Going Beyond Simple Solutions ===
We would like applicants to be creative, and come up with good ideas on how to improve the frequency-based semantic summary.

We think taking into account relations between words would be very useful to make the semantic summary richer and more informative. In fact, breaking down description into words can make it harder to grasp the original meaning of string attributes.
=== Environment: Cytoscape ===
 * We want the semantic network summary to be implemented as a Cytoscape plugin
 * Cytoscape plug-ins are coded in Java using the Cytoscape API
== About ==
This project was started by 
 * [[RuthIsserlin|Ruth Isserlin]]
 * [[DanieleMerico|Daniele Merico]]

We are part of [[Home|Gary Bader's lab]] at University of Toronto - CCBR (Toronto, ON Canada). Our lab is strongly engaged in biological network research. Feel free to have a look at our [[Home|home page]] for more details on the lab research areas, and at our home-pages for our own research interests. [[Software/EnrichmentMap|Here]] is also a Cytoscape plugin we have recently developed.