GUIDELINES FOR DATA INPUT REQUIREMENT FOR PATHWAY AND NETWORK ANALYSIS
Your data should have been statistically analyzed
Data should have been normalized.
PLEASE PROVIDE CONTROL QUALITY PLOTS THAT YOU MAY HAVE DONE: ESPECIALLY PCA AND CLUSTERING PLOTS
Box-plot of intensity (before and after normalization)
- Looking at the distribution of probe intensities across all arrays at once can, for example, demonstrate that one array is not like the others. Normalization corrects data heterogeneity and plots after normalization should be more homogenous.
Principal Component Analysis (2D-PCA)
- PCA is recommended as an exploratory tool to uncover unknown trends in the data. When applied on samples, PCA will help you explore correlations between samples.
Unsupervised hierarchical clustering of samples and genes (performed on whole data)
- Clustering is a useful exploratory technique for gene expression data. It groups genes and samples that have a similar gene expression patterns.
- Please provide if possible a powerpoint presentation with a figure for each analysis
An appropriate statistical test testing your hypothesis (your biological question) should have been performed, for example: moderated t-test, paired t-test, ANOVA.
PLEASE PROVIDE ANY REPORT ACCOMPANYING THE STATISTICAL ANALYSIS THAT DESCRIBES HOW THE STATISTICAL ANALYSIS HAS BEEN DONE.
If you need support for your statistical analyses, please contact our BIOSTATISTICS SERVICE.
Dr. ChangJiang Xu (changjiang.xu@utoronto.ca) offers free consultation for statistical analyses. Your data will be analyzed and output in the correct format for subsequent pathway and network analyses. You are encouraged to contact ChangJiang or Veronique as soon as you plan your experiment: genomics technologies can be very sensitive to noise and a well designed experiment is very important for best results (randomization of the samples, balanced design, reducing potential noises by standardizing protocols).
PLEASE A PROVIDE A TAB DELIMITED TEXT (.txt) CONTAINING DATA FOR THE PATHWAY AND NETWORK ANALYSIS (or alternatively a .csv file) :
- Name your file as follows: yourname_date_PIname_treated_vs_control_comparison.txt (example: veronique_March21_BADER_treated_vs_control.txt)
- Please rename your file with a new date if you resubmit your file
- Please follow the format description:
the first column corresponds to ENTREZ GENE ID.
- An Entrez Gene ID is a numerical value that uniquely identifies genes.
For example the Entrez Gene ID for Myc (myelocytomatosis oncogene [ Mus musculus ]) is 17869: http://www.ncbi.nlm.nih.gov/gene/17869.
You can convert many types of gene identifiers and symbols to Entrez Gene ID using Synergizer or other similar tools.
the second column corresponds to a UNIQUE ARRAY IDENTIFIER (PROBESET ID for Affymetrix and PROBE ID for Illumina).
the third column corresponds to GENE NAME (official gene symbol).
the fourth column corresponds to the GENE DESCRIPTION (full gene name).
the fifth column corresponds to the log2 FOLD CHANGE.
the sixth and seven columns contain the STATISTICAL VALUES :
- the statistical values are the ones that enable you to tell if a gene is significantly differentially expressed or not, it could be for example the t value and the p-value if you applied a t-test.
- the whole table is ranked on the basis of one statistical value, preferentially the t value.
the additional columns contain the transformed (log2 for example) and normalized (RMA or quantile normalization for example) values for each sample (= each chip if gene expression data): RAW NORMALIZED DATA.
- Please provide a sample label description file.
- ! Include all your data (even data with non significant p-values)
- please provide the origin of the annotation
DATA INPUT EXAMPLE:
Entrez ID |
Probeset ID |
Gene Name |
Gene Description |
log2foldchange |
t value |
q value (FDR) |
sample1 |
sample2 |
sample3 |
... |
17218 |
10572906 |
Mcm5 |
minichromosome maintenance deficient 5, cell division cycle |
10.2 |
44.0079 |
0.001 |
9.13084 |
9.7166 |
8.76638 |
... |
27279 |
10448307 |
Tnfrsf12a |
tumor necrosis factor receptor superfamily, member 12a |
-9.8 |
-41.815 |
0.001 |
8.58977 |
9.29698 |
8.80844 |
... |
13215 |
10582809 |
Tk1 |
thymidine kinase 1 |
8.7 |
39.9456 |
0.001 |
8.94519 |
9.56513 |
8.38612 |
... |
12937 |
10384145 |
H2afv |
H2A histone family, member V |
-7.4 |
-33.6475 |
0.001 |
10.574 |
10.7741 |
10.5401 |
... |
207277 |
10526848 |
7 |
33.3352 |
0.001 |
8.25088 |
8.4121 |
8.2783 |
... |
- Note: it is OK if your file contains additional columns ( identifiers or numeric values) in addition to the required columns.
Note:
- Each row of the table could be at the probe level or at the gene level
- If several rows correspond to the same gene (same EntrezGeneID), there are 2 possibilities to remove the redundancy:
- for a same gene, only the row corresponding to the most extreme t-value is conserved
- for a same gene, the average or the median of the different normalized values is calculated before the t-test is applied
- the choice must be made before the statistical data analyses are performed. We can discuss it during the initial meeting.
- the data input requirement has been defined for gene expression data and may be different for other omics experiments. This will be discussed during the Analysis Planning meeting.
- If several rows correspond to the same gene (same EntrezGeneID), there are 2 possibilities to remove the redundancy:
- Each row of the table could be at the probe level or at the gene level
BACK TO STANDARD OPERATING PROCEDURES (SOP) BACK TO HOME PAGE