#acl All:read = Summary of Affymetrix Microarray Data Analysis = 1. '''Microarray Experimental Designs''' * Biological and technical replicates * Pooling (biological averaging), blocking, randomization * Sample size determination 1. '''Affymetrix Microarray Data''' * CEL files: contain intensity values, higher intensity (transcript abundance) more active genes * CDF (chip description file) files: specify the probe and probe set to which each cell belongs * Terms: * Probe: oligonucleotides of 25 base (pair) length used to probe RNA targets (25 base sequence) * Probe pair: a unit composed of a perfect match (PM) and its mismatch (MM) * Probe pair set: PMs and MMs related to a common affyID (a group of probe pairs corresponds to a particular gene or a fraction of a gene. Some genes are represented by more than one probe set.) * affyID: an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array * Probe level data: Affy``Batch object created from CEL files using affy package function Read``Affy * Expression data: Expression``Set object, summarizing probe set values into one expression measure 1. '''Data Exploration''' * MA plots * M values are log fold changes, M=log2(T/C)=log2(T)-log2(C) * A values are average log intensities between two arrays, A=(log2(T)+log2(C))/2 * Images, residual images, histograms, boxplots, RNA degradation plots * R/Bioconductor functions: pData, phenoData, exprs, pm, mm, probeNames, sampleNames, geneNames, MAplot, image, hist, boxplot, plotAffyRNAdeg, ... 1. '''Data Preprocessing''' * Purpose: converting probe level data to expression values * Approaches: background correction, normalization, PM correction, and summarization a. Background correction methods: * rma: robust multiarray average method (Irizarry et al. 2003) * mas: Affymetrix Microarray Suite background correction method (2002) * GCRMA: modified RMA to estimate nonspecific binding (Wu et al. 2004) a. Normalization methods: * quantile, contrast and loess: discussed and compared by Bolstad et al. (2003) * constant (scaling): taken by Affymetrix, usually done after summarization * invariantset: used in the dChip software (Li and Wong 2001) * qspline: normalized by fitting splines to the quantiles (Workman et al. 2002). a. PM correction methods: * mas: an ideal mismatch subtracted from PM (Affymetrix 2002) * pmonly: no adjustment to the PM values. * subtractmm: subtract MM from PM (Affymetrix MAS 4.0 1999) a. Summarization methods: * avgdiff: the average (Affymetrix MAS 4.0 1999) * mas: Tukey biweight on log2(PM-CM) (Affymetrix MAS 5.0 2002) * liwong: model-based expression index (MBEI) (Li and Wong 2001), fitting the following multi-chip model to each probeset: * y_ij = theta_i * phi_j + epsilon_ij, where y_ij = PM_ij - MM_ij * y_ij = mu_i + theta_i * phi_j + epsilon_ij, where y_ij = PM_ij * medianpolish: used in the RMA expression summary (Irizarry et al. 2003). A multichip linear model is fit to data from each probeset * y_ij = alpha_i + beta_j + epsilon_ij, where y_ij are the background-adjusted, normalized, and log-transformed PM intensities * playerout: Lazaridis et al. (2002) * Popular methods || '''Methods''' || '''Background correction''' || '''Normalization''' || '''PM correction''' || '''Summarization''' || || RMA || rma || quantile || pmonly || medianpolish (log2 scale)|| || MAS5 || mas || constant || mas || mas (log2 scale)|| || MBEI || PM only || invariantset || pmonly or subtractmm || liwong || * Comparison of methods: compare the performance (e.g., power and FDR) of the methods using the data where the truth is known (Seo and Hoffman, 2006) * R function expresso: combining the preprocessing methods together, but not every method can be combined. * rma background correction should only be used in conjunction with the pmonly PM correction. * subtractmm PM correction should not be used in conjunction with mas and medianpolish summarization methods because of likely negative corrections. 1. '''Analysis of Differentially Expressed Genes''' * Approaches * Parametric test: t-test * Non-parametric tests: Wilcoxon sign-rank/rank-sum tests * Linear models of microarrays (limma package): * linear models and design matrix * Contrasts and contrasts matrix * Surrogate variable analysis (sva package): Surrogate variables constructed directly from high-dimensional data can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise (Leek and Storey, 2007 and 2008). * ANOVA and MANOVA * Multiple testings (p-value adjustments): * FWER: Bonferroni * FDR: Benjamini Hochberg 1. '''Clustering of Differentially Expressed Genes''' * Annotation; Gene ontology * Venn diagrams; clustering; classification * Diagnostics 1. '''Multiple Probesets per Gene''' * The unique probe sets that were initially designed may turn out to represent subclusters, and then multiple probe sets correspond to a single gene. * Different results from the multiple probe sets could be observed: * understand the reasons using the resources available on the [[http://www.affymetrix.com/analysis/index.affx | NetAffx™ Analysis Center]] * a case study, highlighting the need for care when assessing whether groups of probe sets all measure the same transcript (Stalteri and Harrison, 2007) * Choosing one representative probeset: * based on mean intensity across the experiment. The probeset with the highest intensities would be more accurate. * based on statistical significance. The probeset with the most significant fold change would be kept. * scoring methods (Li et al. 2011) '''References''' [[http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/affy.pdf | Gautier et al. (2014) Description of affy]] <
