4359
Comment:
|
4636
|
Deletions are marked like this. | Additions are marked like this. |
Line 17: | Line 17: |
Biological networks can be visualized and analyzed using Cytoscape. Since biological networks have a large number of nodes (a whole cell protein network has up to 20k nodes), it is common to summarize networks using by clustering, i.e. identifying groups of highly inter-connected nodes. Clusters can be identified algorithmically, or hand-picked by experts (with the help of the network layout). | Biological networks can be visualized and analyzed using Cytoscape. Since biological networks have a large number of nodes (a network consisting of all the proteins in a cell can have up to 20k nodes), one of the common approaches to summarize a network is by clustering, i.e. identifying groups of highly inter-connected nodes. Clusters can be identified algorithmically, or hand-picked by experts (with the help of the network layout). |
Line 19: | Line 19: |
Once clustering have been identified, however, it not trivial to summarize their meaning. Bio-entities typically have rich semantics, which are encoded by long string attributes. | Once clusters have been identified, however, it not trivial to summarize their meaning. Bio-entities typically have rich semantics, which are encoded by long string attributes. |
Line 23: | Line 23: |
The module will receive in input (S, A), a set of nodes S = {s1, s2, ..., sn} together with their string attributes A = {a1, a2, ..., an}. | The module will receive in input (S, A), a set of nodes S = {s1, s2, ..., sn} together with their set of attributes A = {a1, a2, ..., an}. |
Line 25: | Line 25: |
For every input (S, A), a graphical summary of the string attributes will have to be generated. | The user can choose one or more of the attributes to be used to summarize the set of nodes (for example the user might want to use both name and description of each node to extract a summary) |
Line 27: | Line 27: |
Attributes A are typically stored in biological databases. They can be free-text descriptions, or controlled-vocabulary terms (e.g. Gene Ontology). | For every input (S, A), a graphical or visual summary of the attributes will have to be generated. Attributes A can be any type of attribute associated with a cytoscape network,node or edge (i.e. String, Int, Double, boolean list...) |
Line 40: | Line 42: |
Here's how we used wordle to generate a semantic summary in one of our papers: | Here's how we used wordle to generate a semantic summary in one of our papers (string attributes are displayed as node labels). |
Line 42: | Line 44: |
{{attachment:SemanticSummaryExample_01.png}} | {{attachment:SemanticSummaryExample_02.png}} |
Line 52: | Line 54: |
We would like applicants to be creative, and come up with good ideas on how to improve the frequency-based semantic summary. | We would like applicants to be creative, and come up with new or different ways to improve the frequency-based semantic summary. |
Google Summer of Code 2010: Semantic Network Summary
Short Description (from GenMAPP wiki)
Goal
Develop a visual summary of a set of node attributes
Description
When biological networks are investigated, it is common to look for clusters, i.e. sets of nodes that are highly inter-connected. To figure out the "biological meaning" of a cluster, the user has to sift through the long textual annotations that are associated to biological entities. We are interested in producing a graphical summary of such annotations. Word frequency in annotations is a good starting point. This can be visualized as a "tag cloud". In addition, the word layout can reflect similarity relations among words (e.g. co-occurrence in the same annotations).
This functionality can be applied to several networks outside biology, whenever nodes are associated to verbose textual annotation. An example is professional social networks (e.g. linked-in), where individuals are "annotated" by a short CV.
Language and Skills
Java, basic statistics
Longer Description
Problem Description
Biological networks can be visualized and analyzed using Cytoscape. Since biological networks have a large number of nodes (a network consisting of all the proteins in a cell can have up to 20k nodes), one of the common approaches to summarize a network is by clustering, i.e. identifying groups of highly inter-connected nodes. Clusters can be identified algorithmically, or hand-picked by experts (with the help of the network layout).
Once clusters have been identified, however, it not trivial to summarize their meaning. Bio-entities typically have rich semantics, which are encoded by long string attributes.
The purpose of the Semantic Network Summary module will be to generate more concise summaries.
Input/Output Description
The module will receive in input (S, A), a set of nodes S = {s1, s2, ..., sn} together with their set of attributes A = {a1, a2, ..., an}.
The user can choose one or more of the attributes to be used to summarize the set of nodes (for example the user might want to use both name and description of each node to extract a summary)
For every input (S, A), a graphical or visual summary of the attributes will have to be generated.
Attributes A can be any type of attribute associated with a cytoscape network,node or edge (i.e. String, Int, Double, boolean list...)
Available Solutions: Word Frequency
A first simple solution we have implemented:
- break down {a1, a2, ..., an} into single words
- count word frequencies
- use coefficients based on information theory, or a statistical test p-value
This simple idea can be improved by:
- removing common-place words (e.g. "of", "by", etc...)
- dividing the word frequencies in A by the word frequencies in the full network (i.e. all nodes)
Wordle is a cool graphical representation based on word frequency.
Here's how we used wordle to generate a semantic summary in one of our papers (string attributes are displayed as node labels).
Ref: Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps. Proteomics. 2010 Feb 1. PMID: 20127684
Going Beyond Simple Solutions
We would like applicants to be creative, and come up with new or different ways to improve the frequency-based semantic summary.
We think taking into account relations between words would be very useful to make the semantic summary richer and more informative. In fact, breaking down description into words can make it harder to grasp the original meaning of string attributes.
Environment: Cytoscape
- We want the semantic network summary to be implemented as a Cytoscape plugin
- Cytoscape plug-ins are coded in Java using the Cytoscape API
About
This project was started by
We are part of Gary Bader's lab at University of Toronto - CCBR (Toronto, ON Canada). Our lab is strongly engaged in biological network research. Feel free to have a look at our home page for more details on the lab research areas, and at our home-pages for our own research interests. Here is also a Cytoscape plugin we have recently developed.