| Size: 9232 Comment:  | Size: 10130 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 23: | Line 23: | 
| 1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically) | 1. Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically). We need to focus on PDZ structures that have bound peptides (search in PDB) | 
| Line 28: | Line 28: | 
| 1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data | 1. Add new feature to !ResidueResidueCorrelation class so it considers grouping + run on PDZ data. This involves implementing the groups as a reduced alphabet (amino acids in a group are considered equivalent) | 
| Line 35: | Line 35: | 
| * Build tools to help interpret correlations in the context of multiple sequence alignments (and later structures). | |
| Line 38: | Line 39: | 
| * Analysis of SNPs and how they affect domain binding (including correlations between SNPs) | |
| Line 87: | Line 89: | 
| ===== Probabilistic Methods ===== * [http://probcons.stanford.edu/ Probcons] 2005 * [http://probalign.njit.edu/probalign/login ProbAlign] 2006 * Estimates amino acid posterior probabilities using a partition function of the alignments. * Computes the maximum expected accuracy alignment after applying the probability consistency transformation of Probcons. * Improvements best seen with datasets of variable and long length sequences. | |
| Line 132: | Line 141: | 
| = Links = * http://proteinkeys.org | 
Goals
- Predict specificity of peptide recognition domain from the primary amino acid sequence.
- Analyze PDZ, WW and then SH3 domains
Strategy
Status
- [wiki:/Log Status Log]
Tasks
- Learn SVN, Brain code (ResidueResidueCorrelation) 
- Literature review related to domain specificity (background activity), PDZ domains (from Ioana's project)
- Run ResidueResidue correlation analysis on PDZ domain data: 1-1 version + try others e.g. 1-2 (Requires: PDZ profiles from Gary) 
- MSA subproject - Learn basics of multiple sequence alignment (Baxevanis, chapter 12) 
- Find and evaluate MSA algorithms (compare notes with Stacy) + evaluate Superfamily, PFAM databases of protein family alignments
- Try different multiple sequence alignment algorithms (MSA) on the PDZ domain sequences to see if they affect the correlation results.
 
- Benchmark/validate correlation subproject - We know H (PDZ), T @-2 (peptide) correlation
- Look at structures (e.g. 1N7T and 1BE9) to see if correlated residues/positions are close to each other and compatible (physicochemically). We need to focus on PDZ structures that have bound peptides (search in PDB)
- Build set of known true and false correlations for use in evaluating prediction algorithm (Note: also ask Dev Sidhu, when available). See [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=10871264 Baldi et al. review] 
 
- Amino acid group subproject - Learn about amino acid groups
- Define an initial aa grouping (reasonable grouping from Levy paper)
- Add new feature to ResidueResidueCorrelation class so it considers grouping + run on PDZ data. This involves implementing the groups as a reduced alphabet (amino acids in a group are considered equivalent) 
- Try all groupings to see how it affects the results (from Levy paper)
- See if we can incorporate aa similarity defined by substitution matrix approach (e.g. BLOSUM, PAM, GONNET) into our method, instead of grouping
- Similarly, evaluate aa similarity defined by factor analysis (Atchley et al paper)
 
- Think about new PDZ domain features that can be used for prediction.
Ideas
- Build tools to help interpret correlations in the context of multiple sequence alignments (and later structures).
- Use of structural data (PDZ domain structures) (may require homology modeling)
- Use of machine learning methods (SVM for classification and boosting decision tree for interpretable learning model)
- Analysis of correlation within domain and peptide (inter-residue correlation) maybe correspondence analysis
- Analysis of SNPs and how they affect domain binding (including correlations between SNPs)
Team
- Shirley Hui
- Gary Bader
Tools/Resources
Databases
- [http://www.ensembl.org/ Ensembl] - Software system which produces and maintains automatic annotation on selected eukaryotic genomes.
 
- [http://www.ebi.ac.uk/interpro/ InterPro] - Database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
 
- [http://www.biomart.org/ BioMart] - Query-oriented data management system that simplifies the task of creation and maintenance of advanced query interfaces backed by a relational database. It is particularly suited for providing the 'data mining' like searches of complex descriptive (e.g. biological) data.
 
Sequence Alignment
Multiple
Hierarhical Methods
- [http://www.compbio.dundee.ac.uk/Software/Amps/amps.html/ AMPS] 1990 - Calculates Z-scores through pairwise sequences comparison with randomization
- Generates alignments without having to generate trees
 
- [http://www.ebi.ac.uk/clustalw/ ClustalW] 1997 - Uses a series of different pair-score matrices
- Biases location of gaps based on secondary structure mask
- Allows for realigning to refine the alignment
- Can infer phylogeny
- Problems: - Time required to complete first all against all comparison to create guide tree
 
 
- [http://www.drive5.com/muscle/ MUSCLE] 2004 - MUltiple Sequence Comparison by Log-Expectation
- Uses a quick hashing comparison based on identical matches
 
- [http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/ MAFFT] 2005 - Calculates guide tree faster by using fast Fourier transform method on AA properites to identify regions of similarity
- Uses these regions to guide dynamic programming alignment of the sequences
 
Non Hierarchical Methods
- [http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST] 1997 - Searches a database with a single sequence
- High scoring sequences are built into a multiple alignment which is used to derive a search profile for subsequent search of the database
- Repeat until no new sequences are added to the profile or a specified number of iterations have been performed
 
- [http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi T-Coffee] 2000 - Builds a library of pairwise alignments for the sequences of interest
- Uses library to inform hierarchical method to find a multiple alignment that preserves consistency between the pairwise alignments
- Can align sequences of varying lengths
 
- [http://baboon.math.berkeley.edu/amap/ AMAP] 2007 - Multiple sequence alignment by sequence annealing
 
Probabilistic Methods
- [http://probcons.stanford.edu/ Probcons] 2005 
- [http://probalign.njit.edu/probalign/login ProbAlign] 2006 - Estimates amino acid posterior probabilities using a partition function of the alignments.
- Computes the maximum expected accuracy alignment after applying the probability consistency transformation of Probcons.
- Improvements best seen with datasets of variable and long length sequences.
 
Viewers
- [http://www.jalview.org/ JalView] - Multiple alignment viewer/editor written in Java
 
Background Literature
More General
- The Human and Mouse Complement of SH2 Domain Proteins-Establishing the Boundaries of Phosphotyrosine Signaling, Liu BA, Jablonowski K, Raina M, Arce M, Pawson T and Nash PD, Mol Cell, 2006  Jun 23; 22(6):851-68. - attachment:Human_mouse_complement_of_SH2_domain_proteins_Liu_2006.pdf
 
- Domains, motifs, and scaffolds: the role of modular interactions in the evolution and wiring of cell signaling circuits, Bhattacharyya RP, Remenyi A, Yeh BJ, Lim WA., Annu Rev Biochem. 2006;75:655-80. - attachment:Domains_motifs_scaffolds_Bhattacharyya_et_al_2006.pdf
 
- The Structure and Function of Proline Recognition Domains, Zarrinpar A, Bhattacharyya RP, Lim WA., Sci STKE. 2003 Apr 22;2003(179):RE8. - attachment:Structure_Function_Pro_Recog_Domains_Zarrinpar_et_al_2003.pdf
 
Substitution Matrices
- Empirical codon substitution matrix., Schneider A, Cannarozzi GM, Gonnet GH., BMC Bioinformatics. 2005 Jun 1;6:134. - attachment:Empirical_condo_substitution_matrix_Schneider_2005.pdf
 
- Analysis of amino acid substitution during divergent evolution: the 400 by 400 dipeptide substitution matrix., Gonnet GH, Cohen MA, Benner SA., Biochem Biophys Res Commun. 1994 Mar 15;199(2):489-96. - attachment:Analysis_AA_Substitution_During_Divergent_Evolution_Gonnet_1994.pdf
 
Specificity Prediction/Inference
- A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity., Ferraro E., Via A., Ausiello G., Helmer-Citterich M., Bioinformatics 22(19): 2333-2339  - attachment:A_novel_structure_based_encoding_for_machine_leanring_applied_to_inference_of_SH3_domain_specificity_Ferraro_2006.pdf
 
- Ab initio prediction of transcription factor targets using structural knowledge, Kaplan T, Friedman N, Margalit H, PLoS Comput Biol. 2005 Jun;1(1):e1. - attachment:Ab_Initio_Prediction_Transcription_Factor_Targets_Using_Structural_Knowlegde_Kaplan_2005.pdf
 
- Specificity and robustness in transcription control networks, Sengupta AM, Djordjevic M, Shraiman BI, Proc Natl Acad Sci U S A. 2002 Feb 19;99(4):2072-7. - attachment:Specificity_robustness_transcription_control_networks_Sengupta_2002.pdf
 
- Can we infer peptide recognition specificity mediated by SH3 domains?, Cesareni G, Panni S, Nardelli G, Castagnoli L., FEBS Lett. 2002 Feb 20;513(1):38-44.  - attachment:Can_we_infer_PR_specificity_med_by_SH3_Cesareni_et_al_2002.pdf
 
Amino Acid Alphabets
- Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices, Cannata N, Toppo S, Romualdi C, Valle G, Bioinformatics. 2002 Aug;18(8):1102-8. - attachment:Simplifying_AA_alphabets_branch_bound_substit_matrices_Cannata_2002.pdf
 
- Simplified amino acid alphabets for protein fold recognition and implications for folding, Murphy LR, Wallqvist A, Levy RM, Protein Eng. 2000 Mar;13(3):149-52. - attachment:Simplified_AA_alphabets_Murphy_2000.pdf
 
- Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases, Wallqvist A, Fukunishi Y, Murphy LR, Fadel A, Levy RM, Bioinformatics. 2000 Nov;16(11):988-1002. - attachment:Iterative_structure_search_for_protein_homologs_Wallqvist_2000.pdf
 
PDZ Related
- Functional dynamics of PDZ binding domains: a normal-mode analysis., De Los Rios P et al., Biophys J. 2005 Jul;89(1):14-21. - attachment:Functional_Dynamics_PDZ_Domain_Normal_Mode_Rios_2005.pdf
 
- PDZ domains-glue and guide., van Ham M, Hendriks W., Mol Biol Rep. 2003 Jun;30(2):69-82. - attachment:PDZ_Domains_Glue_and_Guide_2003.pdf
 
- PDZ domains: structural modules for protein complex assembly., Hung AY, Sheng M., J Biol Chem. 2002 Feb 22;277(8):5699-702. - attachment:PDZ_Domains_Structural_Modules_2001.pdf
 
