Diff for "DomainSpecificityPredictionProject" - Bader Lab @ The University of Toronto

Differences between revisions 53 and 88 (spanning 35 versions)

Proteome scanning of PDZ domain interactions using support vector machines

Motivation

PDZ domains mediate important biological processes through the recognition of short linear motifs. Two recent independent high through put protein microarray and phage display experiments have been used to detect PDZ domain interactions. Several computational predictors of PDZ domain interactions have also been developed, however they are trained using only protein microarray data or focus on limited subsets of PDZ domains. An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential binders. Such an application would require not only an accurate but precise predictor due to the thousands of possible interactors in a given proteome. However, once validated these predictions would increase the coverage of current PDZ domain interaction networks and further our understanding of the biologically processes they mediate.

Results

We developed a PDZ domain interaction predictor using SVMs trained with both protein microarray and phage display data. In order to use the phage display data for training, we developed a method to deterministically generate artificial negative interactions for the phage display data since it consisted of positive interactions only. Through extensive blind testing we showed that the SVM could predict interactions in different organisms. We then used the SVM to scan the proteomes of different organisms to predict binders for several PDZ domains. Predictions were validated using PDZBase or protein microarray data and a comparison of F1 measures and FPRs between the SVM and published or commonly used predictors demonstrated the SVM’s improved accuracy and precision.

Supplementary Data

Supplementary Document (Link)
PDZSVM Data Files PDZSVMData.zip
- Models
  - Chen model parameter and binding site encoding files
  - Stiffler model parameter files
- Proteomes
  - Ensembl proteome files for Human, Worm and Fly
- Experiment Interaction files (in peptide file format)
  - Fly files from Chen
  - Human files from Sidhu
  - Mouse files from Stiffler
  - Worm files from Chen
- Curated Interaction files (flat files)
  - PDZBase for Human (Worm and Fly included, but not used)
  - Human Protein Reference Database
- Phage codon bias files

Java Implementation

Source code
Binary jar file
Dependencies PDZSVMDep.zip
- jfreechart 1.0.12 (and dependencies)
- weka 3.9.1
- auc calculator (Davis & Goadrich, 2006)
- BioJava 1.5
- iText 2.1.3
- jmatio
- BRAIN 1.0.5 (pdzsvm)
- libSVM 2.8.9 (pdzsvm)

Team

Shirley Hui
Gary Bader

CategoryProject

-  ⇤ ← Revision 53 as of 2008-03-02 15:58:47 → 
  Size: 8172
  Editor: ShirleyHui
  Comment:
+   ← Revision 88 as of 2010-04-27 01:21:31 → ⇥
  Size: 8138
  Editor: ShirleyHui
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-== Table of Contents ==
[[TableOfContents()]]
+== Proteome scanning of PDZ domain interactions using support vector machines ==
-Line 6:
+Line 5:
-== Goals ==
 * Predict specificity of peptide recognition domain from the primary amino acid sequence.
 * Analyze PDZ, WW and then SH3 domains
+## == Table of Contents ==
## <<TableOfContents>>
-Line 10:
+Line 8:
-== Strategy ==
## [wiki:/Strategy Strategy Log]
+== Motivation ==
PDZ domains mediate important biological processes through the recognition of short linear motifs. Two recent independent high through put protein microarray and phage display experiments have been used to detect PDZ domain interactions.  Several computational predictors of PDZ domain interactions have also been developed, however they are trained using only protein microarray data or focus on limited subsets of PDZ domains.  An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential binders.  Such an application would require not only an accurate but precise predictor due to the thousands of possible interactors in a given proteome.  However, once validated these predictions would increase the coverage of current PDZ domain interaction networks and further our understanding of the biologically processes they mediate.
-Line 13:
+Line 11:
-== Status ==
 * [wiki:/Log Status Log]
+== Results ==
We developed a PDZ domain interaction predictor using SVMs trained with both protein microarray and phage display data.  In order to use the phage display data for training, we developed a method to deterministically generate artificial negative interactions for the phage display data since it consisted of positive interactions only.  Through extensive blind testing we showed that the SVM could predict interactions in different organisms.  We then used the SVM to scan the proteomes of different organisms to predict binders for several PDZ domains.   Predictions were validated using PDZBase or protein microarray data and a comparison of F1 measures and FPRs between the SVM and published or commonly used predictors demonstrated the SVM’s improved accuracy and precision.

== Supplementary Data ==
 a. Supplementary Document (Link)
 a. PDZSVM Data Files [[attachment:PDZSVMData.zip]]
  * Models
    * Chen model parameter and binding site encoding files
    * Stiffler model parameter files
  * Proteomes
    * Ensembl proteome files for Human, Worm and Fly
  * Experiment Interaction files (in peptide file format)
    * Fly files from Chen
    * Human files from Sidhu
    * Mouse files from Stiffler
    * Worm files from Chen
  * Curated Interaction files (flat files)
    * PDZBase for Human (Worm and Fly included, but not used)
    * Human Protein Reference Database
  * Phage codon bias files

== Java Implementation ==
 a. Source code
 a. Binary jar file 
 a. Dependencies [[attachment:PDZSVMDep.zip]] 
   * jfreechart 1.0.12 (and dependencies)
   * weka 3.9.1
   * auc calculator (Davis & Goadrich, 2006)
   * !BioJava 1.5
   * iText 2.1.3
   * jmatio
   * BRAIN 1.0.5 (pdzsvm)
   * libSVM 2.8.9 (pdzsvm)

## == Goals ==
## * Computationally predict specificity of peptide recognition domain from the primary amino acid sequences
## * Analyze PDZ, WW and then SH3 domains

## == Background ==
## * [[/PDZ|PDZ Domains]]
## * [[/MachineLearning|Machine Learning]]

## == Strategy ==
## * [[/Strategy|Strategy]]

## == Ideas ==
## * [[/Ideas|Ideas]]

## == Data ==
## * [[/PDZData|PDZ Data]]

## == Experiments ==
## * [[/Experiments|Experiments and Results]]

## == Status ==
##  * [[/Log|Status]]
-Line 67:
+Line 119:
-##  * [wiki:/Meeting Notes]
+## * [[/Meeting|Notes]]

## == Tools/Resources ==
## * [[/ToolsResources|Tools and Resources]]

## == Reading Notes ==
## * [[/../ShirleyHui/MBCReadings|Molecular Biology of the Cell]]
## * [[/../ShirleyHui/PPIReadings|Protein-protein Interaction Detection]]
## * Support Vector Machines

## == Related Literature ==
##  * [[http://www.connotea.org/rss/user/s2hui?download=view|Literature List on Connotea]]
## * [[http://www.baderlab.org/DomainSpecificityPredictionProject/Reading|Molecular Biology of the Cell]]
-Line 73:
+Line 137:
-== Tools/Resources ==

=== Domains ===
 * [wiki:/PDZ PDZ Domain]

=== Databases ===
 * [http://www.ensembl.org/ Ensembl]
   * Software system which produces and maintains automatic annotation on selected eukaryotic genomes.
 * [http://www.ebi.ac.uk/interpro/ InterPro]
   * Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
 * [http://www.biomart.org/ BioMart]
   * Query-oriented data management system that simplifies the task of creation and maintenance of advanced query interfaces backed by a relational database.  It is particularly suited for providing the 'data mining' like searches of complex descriptive (e.g. biological) data.

=== Sequence Alignment ===

==== Multiple ====
===== Hierarhical Methods =====
 * [http://www.compbio.dundee.ac.uk/Software/Amps/amps.html/ AMPS] 1990
   * Calculates Z-scores through pairwise sequences comparison with randomization
   * Generates alignments without having to generate trees
 * [http://www.ebi.ac.uk/clustalw/ ClustalW] 1997
   * Uses a series of different pair-score matrices
   * Biases location of gaps based on secondary structure mask
   * Allows for realigning to refine the alignment
   * Can infer phylogeny
   * Problems:
     * Time required to complete first all against all comparison to create guide tree
 * [http://www.drive5.com/muscle/ MUSCLE] 2004
   * MUltiple Sequence Comparison by Log-Expectation
   * Uses a quick hashing comparison based on identical matches 
 * [http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ MAFFT] 2005
   * Calculates guide tree faster by using fast Fourier transform method on AA properites to identify regions of similarity
   * Uses these regions to guide dynamic programming alignment of the sequences
 
===== Non Hierarchical Methods =====

 * [http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST] 1997
   * Searches a database with a single sequence
   * High scoring sequences are built into a multiple alignment which is used to derive a search profile for subsequent search of the database
   * Repeat until no new sequences are added to the profile or a specified number of iterations have been performed
 * [http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-Coffee] 2000
   * Builds a library of pairwise alignments for the sequences of interest
   * Uses library to inform hierarchical method to find a multiple alignment that preserves consistency between the pairwise alignments
   * Can align sequences of varying lengths
 * [http://baboon.math.berkeley.edu/amap/ AMAP] 2007
   * Multiple sequence alignment by sequence annealing

===== Probabilistic Methods =====
 * [http://probcons.stanford.edu/ Probcons] 2005
 * [http://probalign.njit.edu/probalign/login ProbAlign] 2006
   * Estimates amino acid posterior probabilities using a partition function of the alignments.
   * Computes the maximum expected accuracy alignment after applying the probability consistency transformation of Probcons.
   * Improvements best seen with datasets of variable and long length sequences.

=== Viewers ===
 * [http://www.jalview.org/ JalView]
   * Multiple alignment viewer/editor written in Java

== Background Literature ==

[http://www.connotea.org/rss/user/s2hui?download=view Literature List on Connotea]
 
=== Textbook ===
 * [http://www.baderlab.org/DomainSpecificityPredictionProject/Reading Molecular Biology of the Cell]

=== Other ===
 * http://proteinkeys.org