Diff for "CSCPathwayAnalysisService/Data" - Bader Lab @ The University of Toronto

Differences between revisions 1 and 12 (spanning 11 versions)

Data Input Requirement

Your data should have been statistically analyzed:
- Data should have been normalized.
- Control quality plots should have been done:
  - Box-plot of intensity (before and after normalization)
    - Looking at the distribution of probe intensities across all arrays at once can, for example, demonstrate that one array is not like the others. Normalization corrects data heterogeneity and plots after normalization should be more homogenous.
    - Principal Component Analysis (2D-PCA)
      - PCA is recommended as an exploratory tool to uncover unknown trends in the data. When applied on samples, PCA will help you explore correlations between samples.
    - Unsupervised hierarchical clustering of samples and genes (performed on whole data)
      - Clustering is a useful exploratory technique for gene expression data. It groups genes and samples that have a similar gene expression patterns.
    - Please provide a powerpoint presentation with a figure for each analysis
- An appropriate statistical test testing your hypothesis (your biological question) should have been performed, for example: moderated t-test, paired t-test, ANOVA, ...
- If you need support for your statistical analyses, please contact Shaheena Bashir (Ph.D. in Statistics) at sbashir@uhnres.utoronto.ca.
  - Located at MaRS TMDT 15th floor, Shaheena Bashir offers free consultation for statistical analyses for Cancer Stem Cell program (https://sites.google.com/site/biostatisticscancerstemcell/). Your data will be analyzed and output in the correct format for subsequent pathway and network analyses. You are encouraged to contact Shaheena as soon as you plan your experiment: genomics technologies can be very sensitive to noise and a well designed experiment is very important for best results. Statistical consultation at the design stage is crucial for improved data quality and results.
You need to provide us with 1 file (.txt) for enrichment analysis :
- Name your file as follows: yourname_date_PIname.txt (example: veronique_March21_BADER.txt)
- Please rename your file with a new date if you resubmit your file
- Please follow the format description:
  - the first column corresponds to Entrez Gene ID.
    - An Entrez Gene ID is a numerical value that uniquely identifies genes.
    - For example the Entrez Gene ID for Myc (myelocytomatosis oncogene [ Mus musculus ]) is 17869: http://www.ncbi.nlm.nih.gov/gene/17869.
    - You can convert many types of gene identifiers and symbols to Entrez Gene ID using Synergizer or other similar tools.
  - the second column corresponds to a unique array identifier (ProbesetID for Affymetrix and sampleID for Illumina).
  - the third column corresponds to gene name (official gene symbol).
  - the fourth column corresponds to the gene description (full gene name).
  - the fifth and sixth columns contain the statistical values :
    - the statistical values are the ones that enable you to tell if a gene is significantly differentially expressed or not, it could be for example the t value and the p-value if you applied a t-test.
    - the whole table is ranked on the basis of one statistical value, preferentially the t value.
  - the additional columns contain the transformed (log2 for example) and normalized (RMA or quantile normalization for example) values for each sample (= each chip if gene expression data).
  - Example:

Entrez ID	Probeset ID	Gene Name	Gene Description	t value	p value	sample1	sample2	sample3
17218	10572906	Mcm5	minichromosome maintenance deficient 5, cell division cycle	44.0079	0.001	9.13084	9.7166	8.76638
27279	10448307	Tnfrsf12a	tumor necrosis factor receptor superfamily, member 12a	-41.815	0.001	8.58977	9.29698	8.80844
13215	10582809	Tk1	thymidine kinase 1	39.9456	0.001	8.94519	9.56513	8.38612
12937	10384145	H2afv	H2A histone family, member V	-33.6475	0.001	10.574	10.7741	10.5401
207277	10526848	A430033K04Rik	A430033K04Rik	33.3352	0.001	8.25088	8.4121	8.2783

Note:
- Each row of the table should correspond to a different gene. If several rows correspond to the same gene (same EntrezGeneID), there are 2 possibilities to remove the redundancy:
  - for a same gene, only the row corresponding to the most extreme t-value is conserved
  - for a same gene, the average of the different normalized values is calculated before the t-test is applied
  - the choice must be made before the statistical data analyses are performed. We can discuss it during the initial meeting.
- Include all your data (even data with non significant p-values)
- the data input requirement has been defined for gene expression data and may be different for other omics experiments. This will be discussed during the Analysis Planning meeting.

BACK TO STANDARD OPERATING PROCEDURES (SOP)

BACK TO HOME PAGE

-  ⇤ ← Revision 1 as of 2011-04-04 20:02:53 → 
  Size: 82
  Editor: VeroniqueVoisin
  Comment:
+   ← Revision 12 as of 2011-04-28 20:58:36 → ⇥
  Size: 5457
  Editor: VeroniqueVoisin
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Describe CancerStemCellProject/VeroniqueVoisin/PathwayAnalysisService/Data here.
+## page was renamed from CancerStemCellProject/VeroniqueVoisin/PathwayAnalysisService/Data
#acl All:read

== Data Input Requirement ==

 * '''Your data should have been statistically analyzed''':

   * '''Data should have been normalized.'''
   * '''Control quality plots should have been done:'''
     * '''Box-plot of intensity''' (before and after normalization)
       * Looking at the distribution of probe intensities across all arrays at once can, for example, demonstrate that one array is not like the others. Normalization corrects data heterogeneity and plots after normalization should be more homogenous.
      * '''Principal Component Analysis''' (2D-PCA)
        * PCA is recommended as an exploratory tool to uncover unknown trends in the data. When applied on samples, PCA will help you explore correlations between samples.
      * '''Unsupervised hierarchical clustering''' of samples and genes (performed on whole data)
        * Clustering is a useful exploratory technique for gene expression data. It groups genes and samples that have a similar gene expression patterns.
      * Please provide a powerpoint presentation with a figure for each analysis

   * '''An appropriate statistical test testing your hypothesis''' (your biological question) should have been performed, for example: moderated t-test, paired t-test, ANOVA, ...
   * '''If you need support for your statistical analyses, please contact Shaheena Bashir (Ph.D. in Statistics) at sbashir@uhnres.utoronto.ca'''.
    * Located at MaRS TMDT 15th floor, Shaheena Bashir offers free consultation for statistical analyses for Cancer Stem Cell program (https://sites.google.com/site/biostatisticscancerstemcell/). Your data will be analyzed and output in the correct format for subsequent pathway and network analyses. You are encouraged to contact Shaheena as soon as you plan your experiment: genomics technologies can be very sensitive to noise and a well designed experiment is very important for best results.  Statistical consultation at the design stage is crucial for improved data quality and results.

 * '''You need to provide us with 1 file (.txt) for enrichment analysis''' : 
     * Name your file as follows: yourname_date_PIname.txt (example: veronique_March21_BADER.txt)
     * Please rename your file with a new date if you resubmit your file
     * Please follow the format description:
          * the first column corresponds to Entrez Gene ID.
             * An Entrez Gene ID is a numerical value that uniquely identifies genes.
             * For example the Entrez Gene ID for Myc (myelocytomatosis oncogene [ Mus musculus ]) is 17869: http://www.ncbi.nlm.nih.gov/gene/17869.
             * You can convert many types of gene identifiers and symbols to Entrez Gene ID using [[http://llama.mshri.on.ca/synergizer/doc/|Synergizer]] or other similar tools.
          * the second column corresponds to a unique array identifier (ProbesetID for Affymetrix and sampleID for Illumina).
          * the third column corresponds to gene name (official gene symbol).
          * the fourth column corresponds to the gene description (full gene name).
          * the fifth and sixth columns contain the statistical values : 
              * the statistical values are the ones that enable you to tell if a gene is significantly differentially expressed or not, it could be for example the t value and the p-value if you applied a t-test.
              * the whole table is ranked on the basis of one statistical value, preferentially the t value.
          * the additional columns contain the transformed (log2 for example) and normalized (RMA or quantile normalization for example) values for each sample (= each chip if gene expression data).
           
      * '''Example''':
||Entrez ID||Probeset ID||Gene Name||Gene Description||t value||p value||sample1||sample2||sample3||
||17218||10572906||Mcm5|| minichromosome maintenance deficient 5, cell division cycle||44.0079||0.001||9.13084||9.7166||8.76638||
||27279||10448307||Tnfrsf12a||tumor necrosis factor receptor superfamily, member 12a||-41.815||0.001||8.58977||9.29698||8.80844||
||13215||10582809||Tk1||thymidine kinase 1||39.9456||0.001||8.94519||9.56513||8.38612||
||12937||10384145||H2afv||H2A histone family, member V||-33.6475||0.001||10.574||10.7741||10.5401||
||207277||10526848||A430033K04Rik||A430033K04Rik||33.3352||0.001||8.25088||8.4121||8.2783||


        * '''Note''':
         * Each row of the table should correspond to a different gene. If several rows correspond to the same gene (same EntrezGeneID), there are 2 possibilities to remove the redundancy:
          * for a same gene, only the row corresponding to the most extreme t-value is conserved
          * for a same gene, the average of the different normalized values is calculated before the t-test is applied
          * the choice must be made before the statistical data analyses are performed. We can discuss it during the initial meeting.
         * Include all your data (even data with non significant p-values) 
         * the data input requirement has been defined for gene expression data and may be different for other omics experiments. This will be discussed during the Analysis Planning meeting.

----
-----
[[CSCPathwayAnalysisService/SOP | BACK TO STANDARD OPERATING PROCEDURES (SOP) ]]  

[[CSCPathwayAnalysisService| BACK TO HOME PAGE ]]