#acl QuaidMorris:read,write WyethWasserman:read,write All:read This page includes supplementary information for the Canadian Bioinformatics Workshop [[http://bioinformatics.ca/workshops/beyond|Interpreting Gene Lists from Omics Studies]] == Module 1 - Introduction to Gene Lists == * Lab gene list file: [[attachment:module1YeastGenes.txt]] * Lab URLs: * Synergizer - http://llama.med.harvard.edu/cgi/synergizer/translate * Ensembl !BioMart - http://www.ensembl.org/index.html * Questions: * Synergizer * How many genes could be converted to Entrez Gene? * How many genes could not? * How many genes were not recognized? * Could you find conversions elsewhere? (hint: try the Entrez Gene web site) * Why could YDL023C not be converted to Entrez Gene? * Why was YER056CA not recognized? * How many genes had multiple Entrez Gene IDs? Why? * !BioMart * How many Entrez Gene IDs did you input? * How many Ensembl genes did you get back? * How many GO annotations did you retrieve? * What was the most frequent GO term? * What was the most frequent evidence code? Finished early? * Try gathering additional data from BioMart * Have you tried to collect data for your own gene list? Additional resources * Challenging ID mapping problems, like having ID lists for multiple species, may require use of multiple ID mapping services to solve. Additional ID mapping services are linked to from http://baderlab.org/IdentifierMapping Additional presentation notes * Gene Ontology species coverage bar chart from [[http://bib.oxfordjournals.org/cgi/reprint/6/3/298|Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.]] == Module 2 : Gene Set Analysis == Here's some more details for the protocol described on the module slides. A. Download yeast gene list from module 1, if you don't already have it 1. Get the cleaned gene list here: [[attachment:clean_list.txt]] 1. Get the dirty gene list here: [[attachment:dirty_list.txt]] 1. Save it on your Desktop B. Download list of all yeast genes for your background set 1. Go to [[http://www.biomart.org|Biomart]] 1. Choose Ensembl database 1. On right, choose "Ensembl Genes (release 49)" database 1. On right, choose "Saccharomyces cerevisiae genes (SGD1.01)" 1. On mid-left, choose "Attributes" 1. On right, expand "Gene" by clicking on "+" 1. Uncheck "Ensembl Gene ID" and "Ensembl Transcript ID" 1. Check "Gene name" 1. Press "Results" button on top row 1. Check "Unique results only" box on top right 1. Press "Go" button on right, near top 1. Save "mart_export.txt" file to your desktop 1. In case it didn't work for you, here's the list: [[attachment:yeast_gene_list.txt]] C. Translate gene list names if necessary using Synergizer 1. See protocol for module 1 D. Upload background set list and gene list to [[http://discover.nci.nih.gov/gominer/|GoMiner]], run analysis. 1. See Module 2 slides for protocol E. Wait for email F. Browse results 1. Have fun! == Module 3 - Laboratory 3: Gene Regulation and Promoter Analysis == Wyeth W. Wasserman, UBC Key Concepts - Accurate identification of transcription start sites is a challenge - CpG islands tend to associate with promoters - Evolutionary conservation (phylogenetic footprinting) helps to delineate regulatory regions - Identification of potential mediating transcription factors by searching for shared regulatory motifs in genes with similar expression patterns What you will be able to do at end of this section - Explore the regulation of a single gene using the UCSC Genome Browser - Use ConSite to identify putative regulatory regions and transcription factor binding sites - Analyze co-expressed gene sets to identify transcription factors that may contribute to their regulation - Run a motif discovery program and test the resulting profile against a database of profiles * Introduction Analysis of regulatory sequences has changed significantly in recent years. Many analyses are now pre-computed and available via genome browsers. New systems are emerging which facilitate the discovery of regulatory control mechanisms governing the expression of regulons. We will start the lab with the exploration of regulation of a single gene. In this section of the lab, your goal is to identify likely regulatory regions within the human genomic sequence. You will need to assess the position of the promoter (or promoters) and consider diverse forms of annotation. In the second step, you will predict TF binding sites near the gene and filter the predictions using phylogenetic footprinting. In the third phase of the laboratory exercise, you will analyze the regulation of a set of genes which are co-expressed with our gene. You will use one of a recent set of software that analyzes the frequency of binding sites for characterized transcription factors. Finally, you will perform motif discovery to find a potentially novel sequence patterns over-represented in the promoters of a gene list and compare the pattern against a database of patterns. *Lab 3.1 Learn about the Stat5a gene The human gene we will study is currently known as "signal transducer and activator of transcription 5a" (STAT5A). Like many genes, it has many synonymous names to create the confusion that keeps bioinformaticians gainfully employed. Step 1: (NOT DONE IN CLASS TODAY) Look up STAT5A in a gene catalog to learn some basic characteristics. Suitable catalogs would include GeneCards and EntrezGene. If you note nothing else, you should recognize that this gene encodes a DNA-binding transcription factor. Step 2: Now proceed to the UCSC Genome Browser and examine the human STAT5 gene. - Did you find the right gene (you could check the chromosome against the annotation in the gene catalog)? - What genes are nearby? Now we will examine the gene to ascertain as much as possible about the regulatory control sequences. Let's start with the transcription start site(s). - Where is the promoter for this gene? - Are there multiple promoters? (To address these questions, you should consider the indications provided by the mappings of known transcripts and ESTs for Stat5a.) CpG islands are highly associated with promoters. - Are there CpG islands annotated within or proximal to the gene? - What other promoter prediction annotations are available? Look at the FirstEF annotation line. - What is the relationship between CpG islands and FirstEF ? Ask your instructor or a TA if needed. Not all regulatory control sequences are situated immediately adjacent to a transcription start site. Let's examine the gene to determine some regions that might be likely to harbour regulatory elements. - Look at the conservation plots to determine regions conserved over evolution. Where are the peaks? - Take a quick look at the conserved TFBS annotations. What do you observe? Digest your observations and make some hypotheses guesses about potential locations of regulatory regions within or adjoining the Stat5a gene? *Lab 3.2 Phylogenetic Footprinting * Lab gene list file: [[attachment:STAT5A_HUMAN.txt]] * Lab gene list file: [[attachment:Stat5a_MOUSE.txt]] There is widespread expectation that cross-species comparisons will reveal regulatory regions. Based on the hypothesis that non-coding genomic sequences with a sequence-specific function will be preferentially maintained over moderate periods of evolution, human-mouse sequence comparisons are now commonly performed to identify likely regulatory regions. Within the conserved regions, it is possible to screen for individual transcription factor binding sites (TFBS). Try the ConSite service to study some predicted binding sites. - ConSite http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ - Input the human STAT5A and mouse Stat5a genomic sequences into the system. - Select ALL profiles exceeding 14 bits of information. o What does this mean? o Be careful to click the button indicating "all" and not the lower button indicating a selected set. o The processing can take a few minutes while the TFBS are reconciled. o What do you observe? • Try reducing the thresholds for conservation and/or motif scoring. - One of the primary challenges in studying TFBS is overcoming the specificity problem (recall the futility theorem). How does ConSite address this issue? *Lab 3.3 TFBS Over-Representation of Co-expressed Genes * Lab gene list file: [[attachment:Mod3_GENELIST_fixed.txt]] You’ve just received a list of co-expressed genes from your local bioinformatics specialist. The long awaited results of your experiment with C2C12 myoblasts. You replaced the fetal calf serum in the culture with horse serum, resulting in the differentiation of the myoblasts in muscle-like myotubes. RNA taken from both states was analyzed to reveal co-expressed genes in the myotubes. Download the list from the website. - Try to see if there are any types of TFBS that are over-represented in the conserved non-coding regions of these genes. Several tools allow for users to find over-represented conserved TF binding sites across a set of genes. Try oPOSSUM (from the list at http://www.cisreg.ca) to see which TFs are associated with your list of genes. o What do you find? *Lab 3.4 * Lab gene list file: [[attachment:Mod3_SEQUENCES.txt]] Take the promoter regions corresponding to the gene list and perform motif discovery using the MEME software (http://meme.sdsc.edu), Take one of the resulting profiles, convert the matrix format to a PFM and use the STAMP service (http://www.benoslab.pitt.edu/stamp/index.php) to compare the pattern against the database of known TF binding motifs.