#acl All:read

= Summary of Affymetrix Microarray Data Analysis =

 1. '''Microarray Experimental Designs'''
  * Biological and technical replicates
  * Pooling (biological averaging), blocking, randomization
  * Sample size determination

 1. '''Affymetrix Microarray Data'''
  * CEL files: contain intensity values, higher intensity (transcript abundance) more active genes
  * CDF (chip description file) files: specify the probe and probe set to which each cell belongs
  * Terms:
   * Probe: oligonucleotides of 25 base (pair) length used to probe RNA targets (25 base sequence)
   * Probe pair: a unit composed of a perfect match (PM) and its mismatch (MM)
   * Probe pair set: PMs and MMs related to a common affyID (a group of probe pairs corresponds to a particular gene or a fraction of a gene. Some genes are represented by more than one probe set.)
   * affyID: an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array
  * Probe level data: Affy``Batch object created from CEL files using affy package function Read``Affy
  * Expression data: Expression``Set object, summarizing probe set values into one expression measure

 1. '''Data Exploration'''
  * MA plots
   * M values are log fold changes, M=log2(T/C)=log2(T)-log2(C)
   * A values are average log intensities between two arrays, A=(log2(T)+log2(C))/2
  * Images, residual images, histograms, boxplots, RNA degradation plots
  * R/Bioconductor functions: pData, phenoData, exprs, pm, mm, probeNames, sampleNames, geneNames, MAplot, image, hist, boxplot, plotAffyRNAdeg, ...


 1. '''Data Preprocessing'''
  * Purpose: converting probe level data to expression values
  * Approaches: background correction, normalization, PM correction, and summarization
   a. Background correction methods:  
     * rma: robust multiarray average method (Irizarry et al. 2003)
     * mas: Affymetrix Microarray Suite background correction method (2002)
     * GCRMA: modified RMA to estimate nonspecific binding (Wu et al. 2004)
   a. Normalization methods:
     * quantile, contrast and loess: discussed and compared by Bolstad et al. (2003)
     * constant (scaling): taken by Affymetrix, usually done after summarization
     * invariantset: used in the dChip software (Li and Wong 2001)
     * qspline: normalized by fitting splines to the quantiles (Workman et al. 2002).
   a. PM correction methods:
     * mas: an ideal mismatch subtracted from PM (Affymetrix 2002)
     * pmonly: no adjustment to the PM values.
     * subtractmm: subtract MM from PM (Affymetrix MAS 4.0  1999)
   a. Summarization methods:
     * avgdiff: the average (Affymetrix MAS 4.0 1999)
     * mas: Tukey biweight on log2(PM-CM) (Affymetrix MAS 5.0  2002)
     * liwong: model-based expression index (MBEI) (Li and Wong 2001), fitting the following multi-chip model to each probeset:
       * y_ij = theta_i * phi_j + epsilon_ij, where y_ij = PM_ij - MM_ij
       * y_ij = mu_i + theta_i * phi_j + epsilon_ij, where y_ij = PM_ij 
     * medianpolish: used in the RMA expression summary (Irizarry et al. 2003). A multichip linear model is fit to data from each probeset
       * y_ij = alpha_i + beta_j + epsilon_ij, where y_ij are the background-adjusted, normalized, and log-transformed PM intensities
     * playerout: Lazaridis et al. (2002)

  * Popular methods
  || '''Methods''' || '''Background correction''' || '''Normalization''' || '''PM correction''' || '''Summarization''' ||
  || RMA || rma || quantile || pmonly || medianpolish (log2 scale)||
  || MAS5 || mas || constant || mas || mas (log2 scale)||
  || MBEI || PM only || invariantset || pmonly or subtractmm || liwong ||

  * Comparison of methods: compare the performance (e.g., power and FDR) of the methods using the data where the truth is known (Seo and Hoffman, 2006)

  * R function expresso: combining the preprocessing methods together, but not every method can be combined.
    * rma background correction should only be used in conjunction with the pmonly PM correction.
    * subtractmm PM correction should not be used in conjunction with mas and medianpolish summarization methods because of likely negative corrections.

 1. '''Analysis of Differentially Expressed Genes'''
  * Approaches
    * Parametric test: t-test
    * Non-parametric tests: Wilcoxon sign-rank/rank-sum tests
  * Linear models of microarrays (limma package):
    * linear models and design matrix
    * Contrasts and contrasts matrix 
    * Surrogate variable analysis (sva package): Surrogate variables constructed directly from high-dimensional data can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise (Leek and Storey, 2007 and 2008).
  * ANOVA and MANOVA
  * Multiple testings (p-value adjustments):
    * FWER: Bonferroni
    * FDR: Benjamini Hochberg

 1. '''Clustering of Differentially Expressed Genes'''
  * Annotation; Gene ontology
  * Venn diagrams; clustering; classification
  * Diagnostics

 1. '''Multiple Probesets per Gene'''
  * The unique probe sets that were initially designed may turn out to represent subclusters, and then multiple probe sets correspond to a single gene.
  * Different results from the multiple probe sets could be observed:
    * understand the reasons using the resources available on the [[http://www.affymetrix.com/analysis/index.affx | NetAffx™ Analysis Center]]
    * a case study, highlighting the need for care when assessing whether groups of probe sets all measure the same transcript (Stalteri and Harrison, 2007)
  * Choosing one representative probeset: 
    * based on mean intensity across the experiment. The probeset with the highest intensities would be more accurate. 
    * based on statistical significance. The probeset with the most significant fold change would be kept.
    * scoring methods (Li et al. 2011)


'''References'''

[[http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/affy.pdf | Gautier et al. (2014) Description of affy]] <<BR>>
[[http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/builtinMethods.pdf | B. Bolstad (2014) Built-in Processing Methods]] <<BR>>
[[http://media.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf | Affymetrix, Statistical algorithms description document, 2002.]] <<BR>>
[[http://www.biomedcentral.com/1471-2105/12/474 | Li et al. (2011) Jetset: selecting the optimal microarray probe set to represent a gene, BMC Bioinformatics.]] <<BR>>
[[http://www.pnas.org/content/early/2008/11/24/0808709105 | Leek and Storey (2008) A general framework for multiple testing dependence, PNAS]] <<BR>>
[[http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0030161 | Leek and Storey (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis, PLoS Genetics]] <<BR>>
[[http://www.biomedcentral.com/qc/1471-2105/8/13 | Stalteri and Harrison (2007) Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips, BMC Bioinformatics.]] <<BR>>
[[http://www.biomedcentral.com/1471-2105/7/395 | Seo and Hoffman (2006) Probe set algorithms: is there a rational best bet? BMC Bioinformatics]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/16646809 | Smyth (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Stat Appl Genet Mol Biol.]] <<BR>>
[[http://amstat.tandfonline.com/doi/abs/10.1198/016214504000000683#.U7yutPm-30s | Wu et al. (2004) A model-based background adjustment for oligonucleotide expression arrays, JASA.]] <<BR>>
[[http://bioinformatics.oxfordjournals.org/content/19/2/185.full.pdf | Bolstad et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics.]] <<BR>>
[[http://biostatistics.oxfordjournals.org/content/4/2/249.full.pdf+html | Irizarry et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics.]] <<BR>> 
[[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC126873/pdf/gb-2002-3-9-research0048.pdf | Workman et al. (2002) A new non-linear normalization method for reducing variability in DNA microarray experiments, Genome Biol.]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/11867083 | Lazaridis et al. (2002) A simple method to improve probe set estimates from oligonucleotide arrays, Math Biosci.]] <<BR>>
[[http://www.pnas.org/content/98/1/31.long | Li and Wong (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, PNAS.]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC55329/ | Li and Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biology]] <<BR>>