Diff for "DomainSpecificityPredictionProject" - Bader Lab @ The University of Toronto

Differences between revisions 40 and 102 (spanning 62 versions)

Peptide Recognition Domain Interaction Prediction

Contents

Peptide Recognition Domain Interaction Prediction
Table of Contents
Team

Background

The human genome contains approximately 26,000 protein-coding genes, which through alternative splicing can direct the synthesis of thousands of different proteins. The majority of these proteins interact with other proteins to coordinate a variety of cellular processes including DNA replication, cell cycle control, and signal transduction. The ability to accurately detect these interactions enables the assembly of protein interaction networks which can be used to better understand and study the biochemistry of the cell.

Computation PPI Prediction

Computational methods to predict (protein protein interactions) PPIs have been developed and can be used to support or prioritize experiments. Such methods fall into a range of categories from physics to statistics-based method, however they all face several challenges. For physics-based prediction methods, the structures of the proteins are often unavailable or protein flexibility is not taken into consideration. Sequence based methods like PWMs can only represent short binding motifs and often do not account for interdependencies between residues and positions. In general, the computational prediction of PPIs is considered an extremely difficult problem that is not fully addressed by any existing method.

Many PPIs are mediated by peptide recognition domains (PRDs), which are evolutionary conserved modular interaction domains often found combined in different ways to form larger proteins. Proteins containing PRDs are used by the cell for numerous processes such as the co-localization of proteins, regulation of signaling processes or recognition of protein post-translational modifications. Interactions usually occur through the recognition of short linear sequences in the target protein such as proline-rich or C terminal motifs. Because of their simpler binding sites and straightforward modes of target recognition, it is easier to computationally predict peptide-PRD interactions than it is to predict PPIs more generally.

Computational Prediction of PDZ Domain Interactions

The PSD95/DlgA/Zo-1 (PDZ) domain is an ideal model for studying the computational prediction of peptide-PRD interactions since they are have important biological roles, are well studied and one of the simplest binding sites among PRDs. PDZ domains are found in bacteria, yeast, plants, and metazoans with 250 found in humans. They often interact with ion channels, adhesion molecules, and neurotransmitter receptors in signaling and scaffolding proteins. The biological roles include maintaining cell polarity, facilitating signal coupling, and regulating synaptic development. Their importance is emphasized, as mutations of the PDZ domain in different proteins have been associated with various diseases.

Sequence Based Prediction

Recently, two high through put experiments have been performed to study different PDZ domains. This has enabled the development of computational predictors of PDZ domain interactions. My current project focuses on using a machine learning method called support vector machines to computationally predict PDZ domain interactions directly from a given proteome. [Read More]

Sequence and Structure Based Prediction

Work in progress

Team

Shirley Hui
Gary Bader

CategoryProject

-  ⇤ ← Revision 40 as of 2007-09-04 14:14:51 → 
  Size: 11347
  Editor: ShirleyHui
  Comment:
+   ← Revision 102 as of 2010-07-09 16:16:15 → ⇥
  Size: 8763
  Editor: ShirleyHui
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-== Goals ==
 * Predict specificity of peptide recognition domain from the primary amino acid sequence.
 * Analyze PDZ, WW and then SH3 domains
+== Peptide Recognition Domain Interaction Prediction ==
-Line 7:
+Line 5:
-== Strategy ==
 * [wiki:/Strategy Strategy Log]
+== Table of Contents ==
<<TableOfContents>>
-Line 10:
+Line 8:
-== Status ==
 * [wiki:/Log Status Log]
+=== Background ===
The human genome contains approximately 26,000 protein-coding genes, which through alternative splicing can direct the synthesis of thousands of different proteins. The majority of these proteins interact with other proteins to coordinate a variety of cellular processes including DNA replication, cell cycle control, and signal transduction. The ability to accurately detect these interactions enables the assembly of protein interaction networks which can be used to better understand and study the biochemistry of the cell.
-Line 13:
+Line 11:
-== Tasks ==
+=== Computation PPI Prediction ===
Computational methods to predict (protein protein interactions) PPIs have been developed and can be used to support or prioritize experiments.  Such methods fall into a range of categories from physics to statistics-based method, however they all face several challenges.  For physics-based prediction methods, the structures of the proteins are often unavailable or protein flexibility is not taken into consideration.  Sequence based methods like PWMs can only represent short binding motifs and often do not account for interdependencies between residues and positions. In general, the computational prediction of PPIs is considered an extremely difficult problem that is not fully addressed by any existing method.
-Line 15:
+Line 14:
-. --(Learn SVN, Brain code (!ResidueResidueCorrelation))--
 1. Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
 1. --(Run !ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2  (Requires: PDZ profiles from Gary))--
 1. MSA subproject
  1. --(Learn basics of multiple sequence alignment (Baxevanis, chapter 12))--
  1. Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
  1. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
 1. Benchmark/validate correlation subproject
  1. We know H (PDZ), T @-2 (peptide) correlation
  1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically). We need to focus on PDZ structures that have bound peptides (search in PDB)
  1. Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review]
 1. Amino acid group subproject
  1. Learn about amino acid groups
  1. Define an initial aa grouping (reasonable grouping from Levy paper)
  1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data. This involves implementing the groups as a reduced alphabet (amino acids in a group are considered equivalent)
  1. Try all groupings to see how it affects the results (from Levy paper)
  1. See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
  1. Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
 1. Think about new PDZ domain features that can be used for prediction.
+Many PPIs are mediated by peptide recognition domains (PRDs), which are evolutionary conserved modular interaction domains often found combined in different ways to form larger proteins.  Proteins containing PRDs are used by the cell for numerous processes such as the co-localization of proteins, regulation of signaling processes or recognition of protein post-translational modifications. Interactions usually occur through the recognition of short linear sequences in the target protein such as proline-rich or C terminal motifs.  Because of their simpler binding sites and straightforward modes of target recognition, it is easier to computationally predict peptide-PRD interactions than it is to predict PPIs more generally.
-Line 35:
+Line 16:
-== Ideas ==
 * With current correlation counting calculation, Weight calculation by how many peptides are in the peptides file (i.e. normalize the correlation calculation in some way)
 * Build tools to help interpret correlations in the context of multiple sequence alignments (and later structures).
 * Use of structural data (PDZ domain structures) (may require homology modeling)
 * Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
 * Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
 * Analysis of SNPs and how they affect domain binding (including correlations between SNPs)
 * Define the binding site of the PDZ domain based on phage display data.  Given that identical binding sites between two PDZ domains should correspond to identical binding specificities, find the set of PDZ domain sites that correlate perfectly with binding specificity.
+=== Computational Prediction of PDZ Domain Interactions ===
The PSD95/DlgA/Zo-1 (PDZ) domain is an ideal model for studying the computational prediction of peptide-PRD interactions since they are have important biological roles, are well studied and one of the simplest binding sites among PRDs. PDZ domains are found in bacteria, yeast, plants, and metazoans with 250 found in humans. They often interact with ion channels, adhesion molecules, and neurotransmitter receptors in signaling and scaffolding proteins.  The biological roles include maintaining cell polarity, facilitating signal coupling, and regulating synaptic development. Their importance is emphasized, as mutations of the PDZ domain in different proteins have been associated with various diseases.
-Line 44:
+Line 19:
-== Courses ==
 * [http://bio250y.chass.utoronto.ca/ BIO250] - Cell and Molecular Biology
  * Classes: Tues/Thurs - 1-2 PM (Convocation Hall) OR Mon - 6-8 PM (MC 102-Mechanical Engineering Building)
  * Textbook: [http://www.amazon.com/Molecular-Biology-Fourth-Bruce-Alberts/dp/0815332181/ref=pd_sim_b_1/105-5132391-0345258?ie=UTF8&qid=1188913552&sr=1-4 Molecular Biology of the Cell 4th Ed.] Alberts et al.
 * BCH340H1 - Proteins: from Structure to Proteomics
  * Classes: Winter 2008
  * Textbook: ?
  * Old Course web pages:
    * [http://arrhenius.med.utoronto.ca/~chan/bch340h04-outline.html 2004 Chan]
    * [http://xtal.uhnres.utoronto.ca/prive/BCH340/ 2006 Prive]
+==== Sequence Based Prediction ====
Recently, two high through put experiments have been performed to study different PDZ domains.  This has enabled the development of computational predictors of PDZ domain interactions.  My current project focuses on using a machine learning method called support vector machines to computationally predict PDZ domain interactions directly from a given proteome. [[Data/PDZProteomeScanning|[Read More]]]
-Line 55:
+Line 22:
-== Committee Meetings ==
 * [wiki:/Meeting Notes]
+==== Sequence and Structure Based Prediction ====
 [[/Strategy|Work in progress]]

## == Goals ==
## * Computationally predict specificity of peptide recognition domain from the primary amino acid sequences
## * Analyze PDZ, WW and then SH3 domains

## == Background ==
## * [[/PDZ|PDZ Domains]]
## * [[/MachineLearning|Machine Learning]]

## == Strategy ==
## * [[/Strategy|Strategy]]

## == Ideas ==
## * [[/Ideas|Ideas]]

## == Data ==
## * [[/PDZData|PDZ Data]]

## == Experiments ==
## * [[/Experiments|Experiments and Results]]

## == Status ==
##  * [[/Log|Status]]

## == Tasks ==
## 
##  1. --(Learn SVN, Brain code (!ResidueResidueCorrelation))--
##  1. Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
##  1. --(Run !ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2  (Requires: PDZ profiles from Gary))--
##  1. MSA subproject
##   1. --(Learn basics of multiple sequence alignment (Baxevanis, chapter 12))--
##   1. Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
##   1. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
##  1. Benchmark/validate correlation subproject
##   1. We know H (PDZ), T @-2 (peptide) correlation
##   1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically). We need to focus on ## PDZ structures that have bound peptides (search in PDB)
##   1. Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review]
## 1. Amino acid group subproject
##   1. Learn about amino acid groups
##   1. Define an initial aa grouping (reasonable grouping from Levy paper)
##   1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data. This involves implementing the groups as a reduced alphabet (amino acids in a group are considered equivalent)
##   1. Try all groupings to see how it affects the results (from Levy paper)
##   1. See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
##   1. Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
##  1. Think about new PDZ domain features that can be used for prediction.

## == Ideas ==
##  * [wiki:/MachineLearning Machine Learning Page]
##  * With current correlation counting calculation, Weight calculation by how many peptides are in the peptides file (i.e. normalize the correlation calculation in some way)
##  * Build tools to help interpret correlations in the context of multiple sequence alignments (and later structures).
##  * Use of structural data (PDZ domain structures) (may require homology modeling)
##  * Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
##  * Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
##  * Analysis of SNPs and how they affect domain binding (including correlations between SNPs)
##  * Define the binding site of the PDZ domain based on phage display data.  Given that identical binding sites between two PDZ domains should correspond to identical ## binding specificities, find the set of PDZ domain sites that correlate perfectly with binding specificity.

## == Courses ==

## === Biology ===
##  * [http://bio250y.chass.utoronto.ca/ BIO250] - Cell and Molecular Biology
##   * Classes: Tues/Thurs - 1-2 PM (Convocation Hall) OR Mon - 6-8 PM (MC 102-Mechanical Engineering Building)
##   * Textbook: [http://www.amazon.com/Molecular-Biology-Fourth-Bruce-Alberts/dp/0815332181/ref=pd_sim_b_1/105-5132391-0345258?ie=UTF8&qid=1188913552&sr=1-4 Molecular Biology of the Cell 4th Ed.] Alberts et al.
## === Protein Structure ===
##  * BCH340H1 - Proteins: from Structure to Proteomics
##   * Classes: Winter 2008
##   * Textbook: ?
##   * Previous Course Web Pages:
##     * [http://arrhenius.med.utoronto.ca/~chan/bch340h04-outline.html 2004 Chan]
##     * [http://xtal.uhnres.utoronto.ca/prive/BCH340/ 2006 Prive]
## === Machine Learning ===
##  * CSC2515 - Machine Learning
##    * Previous Course Web Pages:
##      * [http://www.cs.toronto.edu/~roweis/csc2515/ 2003-2006 Roweis]

## == Committee Meetings ==
## * [[/Meeting|Notes]]

## == Tools/Resources ==
## * [[/ToolsResources|Tools and Resources]]

## == Reading Notes ==
## * [[/../ShirleyHui/MBCReadings|Molecular Biology of the Cell]]
## * [[/../ShirleyHui/PPIReadings|Protein-protein Interaction Detection]]
## * Support Vector Machines

## == Related Literature ==
##  * [[http://www.connotea.org/rss/user/s2hui?download=view|Literature List on Connotea]]
## * [[http://www.baderlab.org/DomainSpecificityPredictionProject/Reading|Molecular Biology of the Cell]]
-Line 62:
+Line 117:
-== Tools/Resources ==

=== Databases ===
 * [http://www.ensembl.org/ Ensembl]
   * Software system which produces and maintains automatic annotation on selected eukaryotic genomes.
 * [http://www.ebi.ac.uk/interpro/ InterPro]
   * Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
 * [http://www.biomart.org/ BioMart]
   * Query-oriented data management system that simplifies the task of creation and maintenance of advanced query interfaces backed by a relational database.  It is particularly suited for providing the 'data mining' like searches of complex descriptive (e.g. biological) data.

=== Sequence Alignment ===

==== Multiple ====
===== Hierarhical Methods =====
 * [http://www.compbio.dundee.ac.uk/Software/Amps/amps.html/ AMPS] 1990
   * Calculates Z-scores through pairwise sequences comparison with randomization
   * Generates alignments without having to generate trees
 * [http://www.ebi.ac.uk/clustalw/ ClustalW] 1997
   * Uses a series of different pair-score matrices
   * Biases location of gaps based on secondary structure mask
   * Allows for realigning to refine the alignment
   * Can infer phylogeny
   * Problems:
     * Time required to complete first all against all comparison to create guide tree
 * [http://www.drive5.com/muscle/ MUSCLE] 2004
   * MUltiple Sequence Comparison by Log-Expectation
   * Uses a quick hashing comparison based on identical matches 
 * [http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ MAFFT] 2005
   * Calculates guide tree faster by using fast Fourier transform method on AA properites to identify regions of similarity
   * Uses these regions to guide dynamic programming alignment of the sequences
 
===== Non Hierarchical Methods =====

 * [http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST] 1997
   * Searches a database with a single sequence
   * High scoring sequences are built into a multiple alignment which is used to derive a search profile for subsequent search of the database
   * Repeat until no new sequences are added to the profile or a specified number of iterations have been performed
 * [http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-Coffee] 2000
   * Builds a library of pairwise alignments for the sequences of interest
   * Uses library to inform hierarchical method to find a multiple alignment that preserves consistency between the pairwise alignments
   * Can align sequences of varying lengths
 * [http://baboon.math.berkeley.edu/amap/ AMAP] 2007
   * Multiple sequence alignment by sequence annealing

===== Probabilistic Methods =====
 * [http://probcons.stanford.edu/ Probcons] 2005
 * [http://probalign.njit.edu/probalign/login ProbAlign] 2006
   * Estimates amino acid posterior probabilities using a partition function of the alignments.
   * Computes the maximum expected accuracy alignment after applying the probability consistency transformation of Probcons.
   * Improvements best seen with datasets of variable and long length sequences.

=== Viewers ===
 * [http://www.jalview.org/ JalView]
   * Multiple alignment viewer/editor written in Java

== Background Literature ==

=== More General ===
 * The Human and Mouse Complement of SH2 Domain Proteins-Establishing the Boundaries of Phosphotyrosine Signaling, Liu BA, Jablonowski K, Raina M, Arce M, Pawson T and Nash PD, Mol Cell, 2006  Jun 23; 22(6):851-68.
   * attachment:Human_mouse_complement_of_SH2_domain_proteins_Liu_2006.pdf
 * Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits, Bhattacharyya RP, Remenyi A, Yeh BJ, Lim WA., Annu Rev Biochem. 2006;75:655-80.
   * attachment:Domains_motifs_scaffolds_Bhattacharyya_et_al_2006.pdf
 * The Structure and Function of Proline Recognition Domains, Zarrinpar A, Bhattacharyya RP, Lim WA., Sci STKE. 2003 Apr 22;2003(179):RE8.
   * attachment:Structure_Function_Pro_Recog_Domains_Zarrinpar_et_al_2003.pdf

=== Substitution Matrices ===
 * Empirical codon substitution matrix., Schneider A, Cannarozzi GM, Gonnet GH., BMC Bioinformatics. 2005 Jun 1;6:134.
   * attachment:Empirical_condo_substitution_matrix_Schneider_2005.pdf
 * Analysis of amino acid substitution during divergent evolution: the 400 by 400 dipeptide substitution matrix., Gonnet GH, Cohen MA, Benner SA., Biochem Biophys Res Commun. 1994 Mar 15;199(2):489-96.
   * attachment:Analysis_AA_Substitution_During_Divergent_Evolution_Gonnet_1994.pdf

=== Specificity Prediction/Inference ===
 * A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity., Ferraro E., Via A., Ausiello G., Helmer-Citterich M., Bioinformatics 22(19): 2333-2339 
   * attachment:A_novel_structure_based_encoding_for_machine_leanring_applied_to_inference_of_SH3_domain_specificity_Ferraro_2006.pdf
 * Ab initio prediction of transcription factor targets using structural knowledge, Kaplan T, Friedman N, Margalit H, PLoS Comput Biol. 2005 Jun;1(1):e1.
   * attachment:Ab_Initio_Prediction_Transcription_Factor_Targets_Using_Structural_Knowlegde_Kaplan_2005.pdf
 * Specificity and robustness in transcription control networks, Sengupta AM, Djordjevic M, Shraiman BI, Proc Natl Acad Sci U S A. 2002 Feb 19;99(4):2072-7.
   * attachment:Specificity_robustness_transcription_control_networks_Sengupta_2002.pdf
 * Can we infer peptide recognition specificity mediated by SH3 domains?, Cesareni G, Panni S, Nardelli G, Castagnoli L., FEBS Lett. 2002 Feb 20;513(1):38-44. 
   * attachment:Can_we_infer_PR_specificity_med_by_SH3_Cesareni_et_al_2002.pdf

=== Amino Acid Alphabets ===
 * Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices, Cannata N, Toppo S, Romualdi C, Valle G, Bioinformatics. 2002 Aug;18(8):1102-8.
   * attachment:Simplifying_AA_alphabets_branch_bound_substit_matrices_Cannata_2002.pdf
 * Simplified amino acid alphabets for protein fold recognition and implications for folding, Murphy LR, Wallqvist A, Levy RM, Protein Eng. 2000 Mar;13(3):149-52.
   * attachment:Simplified_AA_alphabets_Murphy_2000.pdf
 * Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM, Bioinformatics. 2000 Nov;16(11):988-1002.
   * attachment:Iterative_structure_search_for_protein_homologs_Wallqvist_2000.pdf

=== PDZ Related ===
 * Functional dynamics of PDZ binding domains: a normal-mode analysis., De Los Rios P et al., Biophys J. 2005 Jul;89(1):14-21.
   * attachment:Functional_Dynamics_PDZ_Domain_Normal_Mode_Rios_2005.pdf
 * PDZ domains-glue and guide., van Ham M, Hendriks W., Mol Biol Rep. 2003 Jun;30(2):69-82.
   * attachment:PDZ_Domains_Glue_and_Guide_2003.pdf
 * PDZ domains: structural modules for protein complex assembly., Hung AY, Sheng M., J Biol Chem. 2002 Feb 22;277(8):5699-702.
   * attachment:PDZ_Domains_Structural_Modules_2001.pdf

= Links =
 * http://proteinkeys.org