Diff for "DomainSpecificityPredictionProject" - Bader Lab @ The University of Toronto

Differences between revisions 1 and 21 (spanning 20 versions)

Goals

Predict specificity of peptide recognition domain from the primary amino acid sequence.
Analyze PDZ, WW and then SH3 domains

Strategy

Status

[wiki:/Log Status Log]

Tasks

Learn SVN, Brain code (ResidueResidueCorrelation)
Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
Run ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2 (Requires: PDZ profiles from Gary)
MSA subproject
1. Learn basics of multiple sequence alignment (Baxevanis, chapter 12)
2. Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
3. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
Benchmark/validate correlation subproject
1. We know H (PDZ), T @-2 (peptide) correlation
2. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically)
3. Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review]
Amino acid group subproject
1. Learn about amino acid groups
2. Define an initial aa grouping (reasonable grouping from Levy paper)
3. Add new feature to ResidueResidueCorrelation class so it considers grouping + run on PDZ data
4. Try all groupings to see how it affects the results (from Levy paper)
5. See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
6. Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
Think about new PDZ domain features that can be used for prediction.

Ideas

Use of structural data (PDZ domain structures) (may require homology modeling)
Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis

Team

Shirley Hui
Gary Bader

Tools/Resources

Databases

[http://www.ensembl.org/ Ensembl]
- Software system which produces and maintains automatic annotation on selected eukaryotic genomes.
[http://www.ebi.ac.uk/interpro/ InterPro]
- Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
[http://www.biomart.org/ BioMart]
- Query-oriented data management system that simplifies the task of creation and maintenance of advanced query interfaces backed by a relational database. It is particularly suited for providing the 'data mining' like searches of complex descriptive (e.g. biological) data.

Sequence Alignment

Multiple

Hierarhical Methods

[http://www.compbio.dundee.ac.uk/Software/Amps/amps.html/ AMPS] 1990
- Calculates Z-scores through pairwise sequences comparison with randomization
- Generates alignments without having to generate trees
[http://www.ebi.ac.uk/clustalw/ ClustalW] 1997
- Uses a series of different pair-score matrices
- Biases location of gaps based on secondary structure mask
- Allows for realigning to refine the alignment
- Can infer phylogeny
- Problems:
  - Time required to complete first all against all comparison to create guide tree
[http://www.drive5.com/muscle/ MUSCLE] 2004
- MUltiple Sequence Comparison by Log-Expectation
- Uses a quick hashing comparison based on identical matches
[http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ MAFFT] 2005
- Calculates guide tree faster by using fast Fourier transform method on AA properites to identify regions of similarity
- Uses these regions to guide dynamic programming alignment of the sequences

Non Hierarchical Methods

[http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST] 1997
- Searches a database with a single sequence
- High scoring sequences are built into a multiple alignment which is used to derive a search profile for subsequent search of the database
- Repeat until no new sequences are added to the profile or a specified number of iterations have been performed
[http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-Coffee] 2000
- Builds a library of pairwise alignments for the sequences of interest
- Uses library to inform hierarchical method to find a multiple alignment that preserves consistency between the pairwise alignments
- Can align sequences of varying lengths

Viewers

[http://www.jalview.org/ JalView]
- Multiple alignment viewer/editor written in Java

Background Literature

More General

The Human and Mouse Complement of SH2 Domain Proteins-Establishing the Boundaries of Phosphotyrosine Signaling, Liu BA, Jablonowski K, Raina M, Arce M, Pawson T and Nash PD, Mol Cell, 2006 Jun 23; 22(6):851-68.
- attachment:Human_mouse_complement_of_SH2_domain_proteins_Liu_2006.pdf
Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits, Bhattacharyya RP, Remenyi A, Yeh BJ, Lim WA., Annu Rev Biochem. 2006;75:655-80.
- attachment:Domains_motifs_scaffolds_Bhattacharyya_et_al_2006.pdf
The Structure and Function of Proline Recognition Domains, Zarrinpar A, Bhattacharyya RP, Lim WA., Sci STKE. 2003 Apr 22;2003(179):RE8.
- attachment:Structure_Function_Pro_Recog_Domains_Zarrinpar_et_al_2003.pdf

Specificity Prediction/Inference

Ab initio prediction of transcription factor targets using structural knowledge, Kaplan T, Friedman N, Margalit H, PLoS Comput Biol. 2005 Jun;1(1):e1. Epub 2005 Jun 24.
- attachment:Ab_Initio_Prediction_Transcription_Factor_Targets_Using_Structural_Knowlegde_Kaplan_2005.pdf
Specificity and robustness in transcription control networks, Sengupta AM, Djordjevic M, Shraiman BI, Proc Natl Acad Sci U S A. 2002 Feb 19;99(4):2072-7.
- attachment:Specificity_robustness_transcription_control_networks_Sengupta_2002.pdf
Can we infer peptide recognition specificity mediated by SH3 domains?, Cesareni G, Panni S, Nardelli G, Castagnoli L., FEBS Lett. 2002 Feb 20;513(1):38-44.
- attachment:Can_we_infer_PR_specificity_med_by_SH3_Cesareni_et_al_2002.pdf

Amino Acid Alphabets

Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices, Cannata N, Toppo S, Romualdi C, Valle G, Bioinformatics. 2002 Aug;18(8):1102-8.
- attachment:Simplifying_AA_alphabets_branch_bound_substit_matrices_Cannata_2002.pdf
Simplified amino acid alphabets for protein fold recognition and implications for folding, Murphy LR, Wallqvist A, Levy RM, Protein Eng. 2000 Mar;13(3):149-52.
- attachment:Simplified_AA_alphabets_Murphy_2000.pdf
Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM, Bioinformatics. 2000 Nov;16(11):988-1002.
- attachment:Iterative_structure_search_for_protein_homologs_Wallqvist_2000.pdf

PDZ Related

PDZ domains-glue and guide., van Ham M, Hendriks W., Mol Biol Rep. 2003 Jun;30(2):69-82.
- attachment:PDZ_Domains_Glue_and_Guide_2003.pdf
PDZ domains: structural modules for protein complex assembly., Hung AY, Sheng M., J Biol Chem. 2002 Feb 22;277(8):5699-702. Epub 2001 Dec 10.
- attachment:PDZ_Domains_Structural_Modules_2001.pdf

CategoryProject

-  ⇤ ← Revision 1 as of 2007-05-07 14:49:12 → 
  Size: 203
  Editor: GaryBader
  Comment:
+   ← Revision 21 as of 2007-05-18 16:36:42 → ⇥
  Size: 8111
  Editor: ShirleyHui
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
+ * Predict specificity of peptide recognition domain from the primary amino acid sequence.
 * Analyze PDZ, WW and then SH3 domains
-Line 8:
+Line 10:
+ * [wiki:/Log Status Log]
-Line 11:
+Line 14:
+. --(Learn SVN, Brain code (!ResidueResidueCorrelation))--
 1. Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
 1. --(Run !ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2  (Requires: PDZ profiles from Gary))--
 1. MSA subproject
  1. --(Learn basics of multiple sequence alignment (Baxevanis, chapter 12))--
  1. Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
  1. Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
 1. Benchmark/validate correlation subproject
  1. We know H (PDZ), T @-2 (peptide) correlation
  1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically)
  1. Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review]
 1. Amino acid group subproject
  1. Learn about amino acid groups
  1. Define an initial aa grouping (reasonable grouping from Levy paper)
  1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data
  1. Try all groupings to see how it affects the results (from Levy paper)
  1. See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
  1. Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
 1. Think about new PDZ domain features that can be used for prediction.

== Ideas ==
 * Use of structural data (PDZ domain structures) (may require homology modeling)
 * Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
 * Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
-Line 12:
+Line 40:
+ * Shirley Hui
 * Gary Bader
-Line 13:
+Line 43:
-== Documents ==
+== Tools/Resources ==

=== Databases ===
 * [http://www.ensembl.org/ Ensembl]
   * Software system which produces and maintains automatic annotation on selected eukaryotic genomes.
 * [http://www.ebi.ac.uk/interpro/ InterPro]
   * Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
 * [http://www.biomart.org/ BioMart]
   * Query-oriented data management system that simplifies the task of creation and maintenance of advanced query interfaces backed by a relational database.  It is particularly suited for providing the 'data mining' like searches of complex descriptive (e.g. biological) data.

=== Sequence Alignment ===

==== Multiple ====
===== Hierarhical Methods =====
 * [http://www.compbio.dundee.ac.uk/Software/Amps/amps.html/ AMPS] 1990
   * Calculates Z-scores through pairwise sequences comparison with randomization
   * Generates alignments without having to generate trees
 * [http://www.ebi.ac.uk/clustalw/ ClustalW] 1997
   * Uses a series of different pair-score matrices
   * Biases location of gaps based on secondary structure mask
   * Allows for realigning to refine the alignment
   * Can infer phylogeny
   * Problems:
     * Time required to complete first all against all comparison to create guide tree
 * [http://www.drive5.com/muscle/ MUSCLE] 2004
   * MUltiple Sequence Comparison by Log-Expectation
   * Uses a quick hashing comparison based on identical matches 
 * [http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ MAFFT] 2005
   * Calculates guide tree faster by using fast Fourier transform method on AA properites to identify regions of similarity
   * Uses these regions to guide dynamic programming alignment of the sequences
 
===== Non Hierarchical Methods =====

 * [http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST] 1997
   * Searches a database with a single sequence
   * High scoring sequences are built into a multiple alignment which is used to derive a search profile for subsequent search of the database
   * Repeat until no new sequences are added to the profile or a specified number of iterations have been performed
 * [http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-Coffee] 2000
   * Builds a library of pairwise alignments for the sequences of interest
   * Uses library to inform hierarchical method to find a multiple alignment that preserves consistency between the pairwise alignments
   * Can align sequences of varying lengths

=== Viewers ===
 * [http://www.jalview.org/ JalView]
   * Multiple alignment viewer/editor written in Java
-Line 17:
+Line 91:
+=== More General ===
 * The Human and Mouse Complement of SH2 Domain Proteins-Establishing the Boundaries of Phosphotyrosine Signaling, Liu BA, Jablonowski K, Raina M, Arce M, Pawson T and Nash PD, Mol Cell, 2006  Jun 23; 22(6):851-68.
   * attachment:Human_mouse_complement_of_SH2_domain_proteins_Liu_2006.pdf
 * Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits, Bhattacharyya RP, Remenyi A, Yeh BJ, Lim WA., Annu Rev Biochem. 2006;75:655-80.
   * attachment:Domains_motifs_scaffolds_Bhattacharyya_et_al_2006.pdf
 * The Structure and Function of Proline Recognition Domains, Zarrinpar A, Bhattacharyya RP, Lim WA., Sci STKE. 2003 Apr 22;2003(179):RE8.
   * attachment:Structure_Function_Pro_Recog_Domains_Zarrinpar_et_al_2003.pdf

=== Specificity Prediction/Inference ===

 * Ab initio prediction of transcription factor targets using structural knowledge, Kaplan T, Friedman N, Margalit H, PLoS Comput Biol. 2005 Jun;1(1):e1. Epub 2005 Jun 24.
   * attachment:Ab_Initio_Prediction_Transcription_Factor_Targets_Using_Structural_Knowlegde_Kaplan_2005.pdf
 * Specificity and robustness in transcription control networks, Sengupta AM, Djordjevic M, Shraiman BI, Proc Natl Acad Sci U S A. 2002 Feb 19;99(4):2072-7.
   * attachment:Specificity_robustness_transcription_control_networks_Sengupta_2002.pdf
 * Can we infer peptide recognition specificity mediated by SH3 domains?, Cesareni G, Panni S, Nardelli G, Castagnoli L., FEBS Lett. 2002 Feb 20;513(1):38-44. 
   * attachment:Can_we_infer_PR_specificity_med_by_SH3_Cesareni_et_al_2002.pdf

=== Amino Acid Alphabets ===
 * Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices, Cannata N, Toppo S, Romualdi C, Valle G, Bioinformatics. 2002 Aug;18(8):1102-8.
   * attachment:Simplifying_AA_alphabets_branch_bound_substit_matrices_Cannata_2002.pdf
 * Simplified amino acid alphabets for protein fold recognition and implications for folding, Murphy LR, Wallqvist A, Levy RM, Protein Eng. 2000 Mar;13(3):149-52.
   * attachment:Simplified_AA_alphabets_Murphy_2000.pdf
 * Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM, Bioinformatics. 2000 Nov;16(11):988-1002.
   * attachment:Iterative_structure_search_for_protein_homologs_Wallqvist_2000.pdf

=== PDZ Related ===
 * PDZ domains-glue and guide., van Ham M, Hendriks W., Mol Biol Rep. 2003 Jun;30(2):69-82.
   * attachment:PDZ_Domains_Glue_and_Guide_2003.pdf
 * PDZ domains: structural modules for protein complex assembly., Hung AY, Sheng M., J Biol Chem. 2002 Feb 22;277(8):5699-702. Epub 2001 Dec 10.
   * attachment:PDZ_Domains_Structural_Modules_2001.pdf