Differences between revisions 1 and 22 (spanning 21 versions)

Summary of Affymetrix Microarray Data Analysis

Microarray Experimental Designs
- Biological and technical replicates
- Pooling (biological averaging), blocking, randomization
- Sample size determination
Affymetrix Microarray Data
- CEL files: contain intensity values, higher intensity (transcript abundance) more active genes
- CDF (chip description file) files: specify the probe and probe set to which each cell belongs
- Terms:
  - Probe: oligonucleotides of 25 base (pair) length used to probe RNA targets (25 base sequence)
  - Probe pair: a unit composed of a perfect match (PM) and its mismatch (MM)
  - Probe pair set: PMs and MMs related to a common affyID (a group of probe pairs corresponds to a particular gene or a fraction of a gene. Some genes are represented by more than one probe set.)
  - affyID: an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array
- AffyBatch object: reading CEL files using affy package function ReadAffy
- Expression set: transferring an AffyBatch object to an expression set using Biobase package function exprs
Data Exploration
- MA plots
  - M values are log fold changes, M=log2(T/C)=log2(T)-log2(C)
  - A values are average log intensities between two arrays, A=(log2(T)+log2(C))/2
- Images, residual images
- Histograms, boxplots
- RNA degradation plots

Data Preprocessing

Approaches: background correction, normalization, PM correction, and summarization
1. Background correction methods:
  - rma: robust multiarray average method (Irizarry et al. 2003)
  - mas: Affymetrix Microarray Suite background correction method (2002)
  - GCRMA: modified RMA to estimate nonspecific binding (Wu et al. 2004)
2. Normalization methods:
  - quantile, contrast and loess: discussed and compared by Bolstad et al. (2003)
  - constant (scaling): taken by Affymetrix, usually done after summarization
  - invariantset: used in the dChip software (Li and Wong 2001)
  - qspline: normalized by fitting splines to the quantiles (Workman et al. 2002).
3. PM correction methods:
  - mas: an ideal mismatch subtracted from PM (Affymetrix 2002)
  - pmonly: no adjustment to the PM values.
  - subtractmm: subtract MM from PM (Affymetrix MAS 4.0 1999)
4. Summarization methods:
  - avgdiff: the average (Affymetrix MAS 4.0 1999)
  - mas: Tukey biweight on log2(PM-CM) (Affymetrix MAS 5.0 2002)
  - liwong: model-based expression index (MBEI) (Li and Wong 2001), fitting the following multi-chip model to each probeset:
    - y_ij = theta_i * phi_j + epsilon_ij, where y_ij = PM_ij - MM_ij
    - y_ij = mu_i + theta_i * phi_j + epsilon_ij, where y_ij = PM_ij
  - medianpolish: used in the RMA expression summary (Irizarry et al. 2003). A multichip linear model is fit to data from each probeset
    - y_ij = alpha_i + beta_j + epsilon_ij, where y_ij are the background-adjusted, normalized, and log-transformed PM intensities
  - playerout: Lazaridis et al. (2002)

Popular methods

Methods	Background correction	Normalization	PM correction	Summarization
RMA	rma	quantile	pmonly	medianpolish (log2 scale)
MAS5	mas	constant	mas	mas (log2 scale)
MBEI	PM only	invariantset	pmonly or subtractmm	liwong

Comparison of methods: compare the performance (e.g., power and FDR) of the methods using the data where the truth is known (Seo and Hoffman, 2006)
R function expresso: combining the preprocessing methods together, but not every method can be combined.
- rma background correction should only be used in conjunction with the pmonly PM correction.
- subtractmm PM correction should not be used in conjunction with mas and medianpolish summarization methods because of likely negative corrections.

Analysis of Differentially Expressed Genes
- Approaches
  - Parametric test: t-test
  - Non-parametric tests: Wilcoxon sign-rank/rank-sum tests
- Linear models of microarrays (limma package):
  - linear models and design matrix
  - Contrasts and contrasts matrix
- ANOVA and MANOVA
- Multiple testings (p-value adjustments):
  - FWER: Bonferroni
  - FDR: Benjamini Hochberg
Clustering of Differentially Expressed Genes
- Annotation; Gene ontology
- Venn diagrams; clustering; classification
- Diagnostics
Multiple Probesets per Gene
- The unique probe sets that were initially designed may turn out to represent subclusters, and then multiple probe sets correspond to a single gene.
- Different results from the multiple probe sets could be observed:
  - understand the reasons using the resources available on the NetAffx™ Analysis Center
  - a case study, highlighting the need for care when assessing whether groups of probe sets all measure the same transcript (Stalteri and Harrison, 2007)
- Choosing one representative probeset:
  - based on mean intensity across the experiment. The probeset with the highest intensities would be more accurate.
  - based on statistical significance. The probeset with the most significant fold change would be kept.
  - scoring methods (Li et al. 2011)

References

Gautier et al. (2014) Description of affy
B. Bolstad (2014) Built-in Processing Methods
Affymetrix, Statistical algorithms description document, 2002.
Li et al. (2011) Jetset: selecting the optimal microarray probe set to represent a gene, BMC Bioinformatics.
Stalteri and Harrison (2007) Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips, BMC Bioinformatics.
Seo and Hoffman (2006) Probe set algorithms: is there a rational best bet? BMC Bioinformatics
Smyth (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Stat Appl Genet Mol Biol.
Wu et al. (2004) A model-based background adjustment for oligonucleotide expression arrays, JASA.
Bolstad et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics.
Irizarry et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics.
Workman et al. (2002) A new non-linear normalization method for reducing variability in DNA microarray experiments, Genome Biol.
Lazaridis et al. (2002) A simple method to improve probe set estimates from oligonucleotide arrays, Math Biosci.
Li and Wong (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, PNAS.
Li and Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biology

-  ⇤ ← Revision 1 as of 2014-07-07 22:19:20 → 
  Size: 1217
  Editor: ChangjiangXu
  Comment:
+   ← Revision 22 as of 2014-07-15 18:37:56 → ⇥
  Size: 8110
  Editor: ChangjiangXu
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Affymetrix Microarray Data Analysis - Concepts, Methods, and R functions
Affymetrix Microarray Data
CEL files: contain processed intensity values, higher intensity (transcript abundance) more active genes
CDF (chip description file) files: specify the probe and probe set to which each cell belongs
Terms:
    Probe: oligonucleotides of 25 base (pair) length used to probe RNA targets (25 base sequence)
    probe pair: a unit composed of a perfect match (PM) and its mismatch (MM)
    Probe pair set: PMs and MMs related to a common affyID (a group of probe pairs corresponds to a particular gene or a fraction of a gene. Some genes are represented by more than one probe set.)
    affyID: an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array
Microarray Experimental Designs
Biological and Technical Replicates
Pooling, Blocking, Sample size determination
the RNAs average out when pooled (biological averaging)
Data Exploration
MA plots:
M values are log fold changes, M=log2(T/C)=log2(T)-log2(C)
A values are average log intensities between two arrays, A=(log2(T)+log2(C))/2
Histograms
Images, residual images
Boxplots
RNA degradation plots
+#acl All:read

= Summary of Affymetrix Microarray Data Analysis =

 1. '''Microarray Experimental Designs'''
  * Biological and technical replicates
  * Pooling (biological averaging), blocking, randomization
  * Sample size determination

 1. '''Affymetrix Microarray Data'''
  * CEL files: contain intensity values, higher intensity (transcript abundance) more active genes
  * CDF (chip description file) files: specify the probe and probe set to which each cell belongs
  * Terms:
   * Probe: oligonucleotides of 25 base (pair) length used to probe RNA targets (25 base sequence)
   * Probe pair: a unit composed of a perfect match (PM) and its mismatch (MM)
   * Probe pair set: PMs and MMs related to a common affyID (a group of probe pairs corresponds to a particular gene or a fraction of a gene. Some genes are represented by more than one probe set.)
   * affyID: an identification for a probe set (which can be a gene or a fraction of a gene) represented on the array
  * AffyBatch object: reading CEL files using affy package function Read``Affy
  * Expression set: transferring an AffyBatch object to an expression set using Biobase package function exprs

 1. '''Data Exploration'''
  * MA plots
   * M values are log fold changes, M=log2(T/C)=log2(T)-log2(C)
   * A values are average log intensities between two arrays, A=(log2(T)+log2(C))/2
  * Images, residual images
  * Histograms, boxplots
  * RNA degradation plots

 1. '''Data Preprocessing'''
  * Approaches: background correction, normalization, PM correction, and summarization
   a. Background correction methods:  
     * rma: robust multiarray average method (Irizarry et al. 2003)
     * mas: Affymetrix Microarray Suite background correction method (2002)
     * GCRMA: modified RMA to estimate nonspecific binding (Wu et al. 2004)
   a. Normalization methods:
     * quantile, contrast and loess: discussed and compared by Bolstad et al. (2003)
     * constant (scaling): taken by Affymetrix, usually done after summarization
     * invariantset: used in the dChip software (Li and Wong 2001)
     * qspline: normalized by fitting splines to the quantiles (Workman et al. 2002).
   a. PM correction methods:
     * mas: an ideal mismatch subtracted from PM (Affymetrix 2002)
     * pmonly: no adjustment to the PM values.
     * subtractmm: subtract MM from PM (Affymetrix MAS 4.0  1999)
   a. Summarization methods:
     * avgdiff: the average (Affymetrix MAS 4.0 1999)
     * mas: Tukey biweight on log2(PM-CM) (Affymetrix MAS 5.0  2002)
     * liwong: model-based expression index (MBEI) (Li and Wong 2001), fitting the following multi-chip model to each probeset:
       * y_ij = theta_i * phi_j + epsilon_ij, where y_ij = PM_ij - MM_ij
       * y_ij = mu_i + theta_i * phi_j + epsilon_ij, where y_ij = PM_ij 
     * medianpolish: used in the RMA expression summary (Irizarry et al. 2003). A multichip linear model is fit to data from each probeset
       * y_ij = alpha_i + beta_j + epsilon_ij, where y_ij are the background-adjusted, normalized, and log-transformed PM intensities
     * playerout: Lazaridis et al. (2002)

  * Popular methods
  || '''Methods''' || '''Background correction''' || '''Normalization''' || '''PM correction''' || '''Summarization''' ||
  || RMA || rma || quantile || pmonly || medianpolish (log2 scale)||
  || MAS5 || mas || constant || mas || mas (log2 scale)||
  || MBEI || PM only || invariantset || pmonly or subtractmm || liwong ||

  * Comparison of methods: compare the performance (e.g., power and FDR) of the methods using the data where the truth is known (Seo and Hoffman, 2006)

  * R function expresso: combining the preprocessing methods together, but not every method can be combined.
    * rma background correction should only be used in conjunction with the pmonly PM correction.
    * subtractmm PM correction should not be used in conjunction with mas and medianpolish summarization methods because of likely negative corrections.

 1. '''Analysis of Differentially Expressed Genes'''
  * Approaches
    * Parametric test: t-test
    * Non-parametric tests: Wilcoxon sign-rank/rank-sum tests
  * Linear models of microarrays (limma package):
    * linear models and design matrix
    * Contrasts and contrasts matrix 
  * ANOVA and MANOVA
  * Multiple testings (p-value adjustments):
    * FWER: Bonferroni
    * FDR: Benjamini Hochberg

 1. '''Clustering of Differentially Expressed Genes'''
  * Annotation; Gene ontology
  * Venn diagrams; clustering; classification
  * Diagnostics

 1. '''Multiple Probesets per Gene'''
  * The unique probe sets that were initially designed may turn out to represent subclusters, and then multiple probe sets correspond to a single gene.
  * Different results from the multiple probe sets could be observed:
    * understand the reasons using the resources available on the [[http://www.affymetrix.com/analysis/index.affx | NetAffx™ Analysis Center]]
    * a case study, highlighting the need for care when assessing whether groups of probe sets all measure the same transcript (Stalteri and Harrison, 2007)
  * Choosing one representative probeset: 
    * based on mean intensity across the experiment. The probeset with the highest intensities would be more accurate. 
    * based on statistical significance. The probeset with the most significant fold change would be kept.
    * scoring methods (Li et al. 2011)


'''References'''

[[http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/affy.pdf | Gautier et al. (2014) Description of affy]] <<BR>>
[[http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/builtinMethods.pdf | B. Bolstad (2014) Built-in Processing Methods]] <<BR>>
[[http://media.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf | Affymetrix, Statistical algorithms description document, 2002.]] <<BR>>
[[http://www.biomedcentral.com/1471-2105/12/474 | Li et al. (2011) Jetset: selecting the optimal microarray probe set to represent a gene, BMC Bioinformatics.]] <<BR>>
[[http://www.biomedcentral.com/qc/1471-2105/8/13 | Stalteri and Harrison (2007) Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips, BMC Bioinformatics.]] <<BR>>
[[http://www.biomedcentral.com/1471-2105/7/395 | Seo and Hoffman (2006) Probe set algorithms: is there a rational best bet? BMC Bioinformatics]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/16646809 | Smyth (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Stat Appl Genet Mol Biol.]] <<BR>>
[[http://amstat.tandfonline.com/doi/abs/10.1198/016214504000000683#.U7yutPm-30s | Wu et al. (2004) A model-based background adjustment for oligonucleotide expression arrays, JASA.]] <<BR>>
[[http://bioinformatics.oxfordjournals.org/content/19/2/185.full.pdf | Bolstad et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics.]] <<BR>>
[[http://biostatistics.oxfordjournals.org/content/4/2/249.full.pdf+html | Irizarry et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics.]] <<BR>> 
[[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC126873/pdf/gb-2002-3-9-research0048.pdf | Workman et al. (2002) A new non-linear normalization method for reducing variability in DNA microarray experiments, Genome Biol.]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/11867083 | Lazaridis et al. (2002) A simple method to improve probe set estimates from oligonucleotide arrays, Math Biosci.]] <<BR>>
[[http://www.pnas.org/content/98/1/31.long | Li and Wong (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, PNAS.]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC55329/ | Li and Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biology]] <<BR>>