CSCPathwayAnalysisService/Data - Bader Lab @ The University of Toronto

Data Input Requirement

Your data should have been statistically analyzed:
- Data should have been normalized.
- PLEASE PROVIDE Control quality PLOTS Control that you may have done: especially PCA and CLUSTERING PLOTS
  - Box-plot of intensity (before and after normalization)
    - Looking at the distribution of probe intensities across all arrays at once can, for example, demonstrate that one array is not like the others. Normalization corrects data heterogeneity and plots after normalization should be more homogenous.
    - Principal Component Analysis (2D-PCA)
      - PCA is recommended as an exploratory tool to uncover unknown trends in the data. When applied on samples, PCA will help you explore correlations between samples.
    - Unsupervised hierarchical clustering of samples and genes (performed on whole data)
      - Clustering is a useful exploratory technique for gene expression data. It groups genes and samples that have a similar gene expression patterns.
    - Please provide a powerpoint presentation with a figure for each analysis
- PLEASE PROVIDE ANY REPORT ACCOMPANYING THE STATISTICAL ANALYSIS that describes how the analysis has been done.
- An appropriate statistical test testing your hypothesis (your biological question) should have been performed, for example: moderated t-test, paired t-test, ANOVA, ...
- If you need support for your statistical analyses, please contact Shaheena Bashir (Ph.D. in Statistics) at sbashir@uhnres.utoronto.ca.
  - Located at MaRS TMDT 15th floor, Shaheena Bashir offers free consultation for statistical analyses for Cancer Stem Cell program (https://sites.google.com/site/biostatisticscancerstemcell/). Your data will be analyzed and output in the correct format for subsequent pathway and network analyses. You are encouraged to contact Shaheena as soon as you plan your experiment: genomics technologies can be very sensitive to noise and a well designed experiment is very important for best results. Statistical consultation at the design stage is crucial for improved data quality and results.
PLESSE A PROVE A TAB DELIMITED TEXT (.txt) CONTAINING DATA FOR THE PATHWAY AND NETWORK ANALYSIS (or alternatively a .csv file) :
- Name your file as follows: yourname_date_PIname.txt (example: veronique_March21_BADER.txt)
- Please rename your file with a new date if you resubmit your file
- Please follow the format description:
  - the first column corresponds to Entrez Gene ID.
    - An Entrez Gene ID is a numerical value that uniquely identifies genes.
    - For example the Entrez Gene ID for Myc (myelocytomatosis oncogene [ Mus musculus ]) is 17869: http://www.ncbi.nlm.nih.gov/gene/17869.
    - You can convert many types of gene identifiers and symbols to Entrez Gene ID using Synergizer or other similar tools.
  - the second column corresponds to a unique array identifier (ProbesetID for Affymetrix and sampleID for Illumina).
  - the third column corresponds to gene name (official gene symbol).
  - the fourth column corresponds to the gene description (full gene name).
  - the fifth and sixth columns contain the statistical values :
    - the statistical values are the ones that enable you to tell if a gene is significantly differentially expressed or not, it could be for example the t value and the p-value if you applied a t-test.
    - the whole table is ranked on the basis of one statistical value, preferentially the t value.
  - the additional columns contain the transformed (log2 for example) and normalized (RMA or quantile normalization for example) values for each sample (= each chip if gene expression data).
DATA INPUT EXAMPLE:

Entrez ID	Probeset ID	Gene Name	Gene Description	log2foldchange	t value	q value (FDR)	sample1	sample2	sample3
17218	10572906	Mcm5	minichromosome maintenance deficient 5, cell division cycle	10.2	44.0079	0.001	9.13084	9.7166	8.76638
27279	10448307	Tnfrsf12a	tumor necrosis factor receptor superfamily, member 12a	-9.8	-41.815	0.001	8.58977	9.29698	8.80844
13215	10582809	Tk1	thymidine kinase 1	8.7	39.9456	0.001	8.94519	9.56513	8.38612
12937	10384145	H2afv	H2A histone family, member V	-7.4	-33.6475	0.001	10.574	10.7741	10.5401
207277	10526848	A430033K04Rik	A430033K04Rik	7	33.3352	0.001	8.25088	8.4121	8.2783

Note: it is OK if your file contains additional columns ( identifiers or numeric values) in addition to the required columns.
Note:
- Each row of the table could be at the probe level or at the gene level
  - If several rows correspond to the same gene (same EntrezGeneID), there are 2 possibilities to remove the redundancy:
    - for a same gene, only the row corresponding to the most extreme t-value is conserved
    - for a same gene, the average of the different normalized values is calculated before the t-test is applied
    - the choice must be made before the statistical data analyses are performed. We can discuss it during the initial meeting.
    - Include all your data (even data with non significant p-values)
    - the data input requirement has been defined for gene expression data and may be different for other omics experiments. This will be discussed during the Analysis Planning meeting.

BACK TO STANDARD OPERATING PROCEDURES (SOP)

BACK TO HOME PAGE

CSCPathwayAnalysisService/Data (last edited 2012-03-16 16:28:14 by VeroniqueVoisin)