Diff for "DomainSpecificityPredictionProject" - Bader Lab @ The University of Toronto

Differences between revisions 11 and 94 (spanning 83 versions)

Proteome scanning of PDZ domain interactions using support vector machines

Motivation

PDZ domains mediate important biological processes through the recognition of short linear motifs. Two recent independent high through put protein microarray and phage display experiments have been used to detect PDZ domain interactions. Several computational predictors of PDZ domain interactions have also been developed, however they are trained using only protein microarray data or focus on limited subsets of PDZ domains. An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential binders. Such an application would require not only an accurate but precise predictor due to the thousands of possible interactors in a given proteome. However, once validated these predictions would increase the coverage of current PDZ domain interaction networks and further our understanding of the biologically processes they mediate.

Results

We developed a PDZ domain interaction predictor using SVMs trained with both protein microarray and phage display data. In order to use the phage display data for training, we developed a method to deterministically generate artificial negative interactions for the phage display data since it consisted of positive interactions only. Through extensive blind testing we showed that the SVM could predict interactions in different organisms. We then used the SVM to scan the proteomes of different organisms to predict binders for several PDZ domains. Predictions were validated using PDZBase or protein microarray data and a comparison of F1 measures and FPRs between the SVM and published or commonly used predictors demonstrated the SVM’s improved accuracy and precision.

Supplementary Data

Supplementary Document (Link)
PDZSVM Data Files PDZSVMData.zip
- Models
  - Chen model parameter and binding site encoding files
  - Stiffler model parameter files
- Proteomes
  - Ensembl proteome files for Human, Worm and Fly
- Experiment Interaction files (in peptide file format)
  - Fly files from Chen
  - Human files from Sidhu
  - Mouse files from Stiffler
  - Worm files from Chen
- Curated Interaction files (flat files)
  - PDZBase for Human (Worm and Fly included, but not used)
  - Human Protein Reference Database
- Phage codon bias files

Java Implementation

LICENSE PDZSVM_LICENSE.txt
Source code PDZSVM_1.0_src.zip
Jar file PDZSVM_1.0.jar
Dependencies PDZSVMDep.zip
- jfreechart 1.0.12 (and dependencies)
- weka 3.9.1
- auc calculator (Davis & Goadrich, 2006)
- BioJava 1.5
- iText 2.1.3
- jmatio
- BRAIN 1.0.5 (pdzsvm)
- libSVM 2.8.9 (pdzsvm)

Team

Shirley Hui
Gary Bader

CategoryProject

-  ⇤ ← Revision 11 as of 2007-05-11 18:19:18 → 
  Size: 3309
  Editor: GaryBader
  Comment:
+   ← Revision 94 as of 2010-04-27 02:33:49 → ⇥
  Size: 8241
  Editor: ShirleyHui
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-== Goals ==
 * Predict specificity of peptide recognition domain from the primary amino acid sequence.
 * Analyze PDZ, WW and then SH3 domains
+== Proteome scanning of PDZ domain interactions using support vector machines ==
-Line 7:
+Line 5:
-== Strategy ==
+## == Table of Contents ==
## <<TableOfContents>>
-Line 9:
+Line 8:
-== Status ==
 * [wiki:/Log Status Log]
+== Motivation ==
PDZ domains mediate important biological processes through the recognition of short linear motifs. Two recent independent high through put protein microarray and phage display experiments have been used to detect PDZ domain interactions.  Several computational predictors of PDZ domain interactions have also been developed, however they are trained using only protein microarray data or focus on limited subsets of PDZ domains.  An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential binders.  Such an application would require not only an accurate but precise predictor due to the thousands of possible interactors in a given proteome.  However, once validated these predictions would increase the coverage of current PDZ domain interaction networks and further our understanding of the biologically processes they mediate.
-Line 12:
+Line 11:
-== Tasks ==
+== Results ==
We developed a PDZ domain interaction predictor using SVMs trained with both protein microarray and phage display data.  In order to use the phage display data for training, we developed a method to deterministically generate artificial negative interactions for the phage display data since it consisted of positive interactions only.  Through extensive blind testing we showed that the SVM could predict interactions in different organisms.  We then used the SVM to scan the proteomes of different organisms to predict binders for several PDZ domains.   Predictions were validated using PDZBase or protein microarray data and a comparison of F1 measures and FPRs between the SVM and published or commonly used predictors demonstrated the SVM’s improved accuracy and precision.
 Line 14:
-. Learn SVN, Brain code (ResidueResidueCorrelation)
 1. Literature review related to domain specificity (background activity)
 1. Run ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2  (Requires: PDZ profiles from Gary)
 1. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
 1. Implement new feature: amino acid groups (learn amino acid groups) + run on PDZ data
 1. Think about new PDZ domain features that can be used for prediction.
+== Supplementary Data ==
 a. Supplementary Document (Link)
 a. PDZSVM Data Files [[attachment:PDZSVMData.zip]]
  * Models
    * Chen model parameter and binding site encoding files
    * Stiffler model parameter files
  * Proteomes
    * Ensembl proteome files for Human, Worm and Fly
  * Experiment Interaction files (in peptide file format)
    * Fly files from Chen
    * Human files from Sidhu
    * Mouse files from Stiffler
    * Worm files from Chen
  * Curated Interaction files (flat files)
    * PDZBase for Human (Worm and Fly included, but not used)
    * Human Protein Reference Database
  * Phage codon bias files
-Line 21:
+Line 32:
-== Ideas ==
 * Use of structural data (PDZ domain structures) (may require homology modeling)
 * Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
 * Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
+== Java Implementation ==
 a. LICENSE [[attachment:PDZSVM_LICENSE.txt]]
 a. Source code [[attachment:PDZSVM_1.0_src.zip]]
 a. Jar file [[attachment:PDZSVM_1.0.jar]]
 a. Dependencies [[attachment:PDZSVMDep.zip]] 
   * jfreechart 1.0.12 (and dependencies)
   * weka 3.9.1
   * auc calculator (Davis & Goadrich, 2006)
   * !BioJava 1.5
   * iText 2.1.3
   * jmatio
   * BRAIN 1.0.5 (pdzsvm)
   * libSVM 2.8.9 (pdzsvm)

## == Goals ==
## * Computationally predict specificity of peptide recognition domain from the primary amino acid sequences
## * Analyze PDZ, WW and then SH3 domains

## == Background ==
## * [[/PDZ|PDZ Domains]]
## * [[/MachineLearning|Machine Learning]]

## == Strategy ==
## * [[/Strategy|Strategy]]

## == Ideas ==
## * [[/Ideas|Ideas]]

## == Data ==
## * [[/PDZData|PDZ Data]]

## == Experiments ==
## * [[/Experiments|Experiments and Results]]

## == Status ==
##  * [[/Log|Status]]

## == Tasks ==
## 
##  1. --(Learn SVN, Brain code (!ResidueResidueCorrelation))--
##  1. Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
##  1. --(Run !ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2  (Requires: PDZ profiles from Gary))--
##  1. MSA subproject
##   1. --(Learn basics of multiple sequence alignment (Baxevanis, chapter 12))--
##   1. Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
##   1. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
##  1. Benchmark/validate correlation subproject
##   1. We know H (PDZ), T @-2 (peptide) correlation
##   1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically). We need to focus on ## PDZ structures that have bound peptides (search in PDB)
##   1. Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review]
## 1. Amino acid group subproject
##   1. Learn about amino acid groups
##   1. Define an initial aa grouping (reasonable grouping from Levy paper)
##   1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data. This involves implementing the groups as a reduced alphabet (amino acids in a group are considered equivalent)
##   1. Try all groupings to see how it affects the results (from Levy paper)
##   1. See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
##   1. Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
##  1. Think about new PDZ domain features that can be used for prediction.

## == Ideas ==
##  * [wiki:/MachineLearning Machine Learning Page]
##  * With current correlation counting calculation, Weight calculation by how many peptides are in the peptides file (i.e. normalize the correlation calculation in some way)
##  * Build tools to help interpret correlations in the context of multiple sequence alignments (and later structures).
##  * Use of structural data (PDZ domain structures) (may require homology modeling)
##  * Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
##  * Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
##  * Analysis of SNPs and how they affect domain binding (including correlations between SNPs)
##  * Define the binding site of the PDZ domain based on phage display data.  Given that identical binding sites between two PDZ domains should correspond to identical ## binding specificities, find the set of PDZ domain sites that correlate perfectly with binding specificity.

## == Courses ==

## === Biology ===
##  * [http://bio250y.chass.utoronto.ca/ BIO250] - Cell and Molecular Biology
##   * Classes: Tues/Thurs - 1-2 PM (Convocation Hall) OR Mon - 6-8 PM (MC 102-Mechanical Engineering Building)
##   * Textbook: [http://www.amazon.com/Molecular-Biology-Fourth-Bruce-Alberts/dp/0815332181/ref=pd_sim_b_1/105-5132391-0345258?ie=UTF8&qid=1188913552&sr=1-4 Molecular Biology of the Cell 4th Ed.] Alberts et al.
## === Protein Structure ===
##  * BCH340H1 - Proteins: from Structure to Proteomics
##   * Classes: Winter 2008
##   * Textbook: ?
##   * Previous Course Web Pages:
##     * [http://arrhenius.med.utoronto.ca/~chan/bch340h04-outline.html 2004 Chan]
##     * [http://xtal.uhnres.utoronto.ca/prive/BCH340/ 2006 Prive]
## === Machine Learning ===
##  * CSC2515 - Machine Learning
##    * Previous Course Web Pages:
##      * [http://www.cs.toronto.edu/~roweis/csc2515/ 2003-2006 Roweis]

## == Committee Meetings ==
## * [[/Meeting|Notes]]

## == Tools/Resources ==
## * [[/ToolsResources|Tools and Resources]]

## == Reading Notes ==
## * [[/../ShirleyHui/MBCReadings|Molecular Biology of the Cell]]
## * [[/../ShirleyHui/PPIReadings|Protein-protein Interaction Detection]]
## * Support Vector Machines

## == Related Literature ==
##  * [[http://www.connotea.org/rss/user/s2hui?download=view|Literature List on Connotea]]
## * [[http://www.baderlab.org/DomainSpecificityPredictionProject/Reading|Molecular Biology of the Cell]]
-Line 30:
+Line 138:
-== Documents ==

== Background Literature ==

=== More General ===
 * Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits, Bhattacharyya RP, Remenyi A, Yeh BJ, Lim WA., Annu Rev Biochem. 2006;75:655-80.
   * attachment:Domains_motifs_scaffolds_Bhattacharyya_et_al_2006.pdf
 * The Structure and Function of Proline Recognition Domains, Zarrinpar A, Bhattacharyya RP, Lim WA., Sci STKE. 2003 Apr 22;2003(179):RE8.
   * attachment:Structure_Function_Pro_Recog_Domains_Zarrinpar_et_al_2003.pdf
 * Can we infer peptide recognition specificity mediated by SH3 domains?, Cesareni G, Panni S, Nardelli G, Castagnoli L., FEBS Lett. 2002 Feb 20;513(1):38-44. 
   * attachment:Can_we_infer_PR_specificity_med_by_SH3_Cesareni_et_al_2002.pdf

=== Amino Acid Alphabets ===
 * Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices, Cannata N, Toppo S, Romualdi C, Valle G, Bioinformatics. 2002 Aug;18(8):1102-8.
   * attachment:Simplifying_AA_alphabets_branch_bound_substit_matrices_Cannata_2002.pdf
 * Simplified amino acid alphabets for protein fold recognition and implications for folding, Murphy LR, Wallqvist A, Levy RM, Protein Eng. 2000 Mar;13(3):149-52.
   * attachment:Simplified_AA_alphabets_Murphy_2000.pdf
 * Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM, Bioinformatics. 2000 Nov;16(11):988-1002.
   * attachment:Iterative_structure_search_for_protein_homologs_Wallqvist_2000.pdf

=== PDZ Related ===
 * PDZ domains-glue and guide., van Ham M, Hendriks W., Mol Biol Rep. 2003 Jun;30(2):69-82.
   * attachment:PDZ_Domains_Glue_and_Guide_2003.pdf
 * PDZ domains: structural modules for protein complex assembly., Hung AY, Sheng M., J Biol Chem. 2002 Feb 22;277(8):5699-702. Epub 2001 Dec 10.
   * attachment:PDZ_Domains_Structural_Modules_2001.pdf