## page was renamed from CSCBiostatService/IlluminChipDataAnalysis
## page was renamed from CSCBiostatService/IlluminBeadchipDataAnalysis
#acl All:read

= Illumina Gene Expression Array Data Analysis using R =

 1. '''Experimental design and data'''
   * Platform: Illumina Bead``Chips
   * Design: patients, groups (markers), and chips
   * Files (txt files)
     * raw data: each gene corresponds to one row.
     * sample names and array barcodes
     * annotation file

 1. '''Data preprocessing using lumi package'''
   i. Data input: using function lumiR or lumiR.batch
   i. Preprocessing
     * using encapsulating function lumiExpresso
     * Functions lumiB (background correction), lumiT (variance stabilizing transform), lumiN (normalization) and lumiQ (quality control), designed for preprocessing and quality control 
   i. Filtering
     * remove the undetectable (unexpressed) genes based on detection pvalue threshold given by
       a. quantile of all p-values, e.g., 50% quantile if the half of total probes are not detectable
       a. false positive rate, e.g., threshold = 0.10 (p-values follow an uniform distribution under null hypothesis)
     * remove technical replicates and/or irrelevant patients
   i. Visualizing
     * using function plot, including density, boxplot, MAplot, pair, and sampleRelation. See the details using help("plot-methods").
     * boxplot and density plot of both raw and normalized intensities on log2 scale
   i. Clustering
     * Using function plotSampleRelation: estimate the sample relations based on selected probes (based on large coefficient of variance (mean/standard variance)). Two methods can be used: MDS (Multi-Dimensional Scaling) or hierarchical clustering methods.
     Example: plot(lumi.data.object, what='sampleRelation', cv.Th = 0.10)
     * Detect the outlier: The current outlier detection is based on the distance from the sample to the center (average of all samples after removing 10 percent samples farthest away from the center).
     Example: temp <- detectOutlier(lumi.data.object, ifPlot=TRUE); any(temp) #if FALSE, there does not exist an outlier.
     * Using function hclust (cluster samples using Euclidean distance)
     Exampe: X <- exprs(lumi.data.object); temp <- hclust(dist(t(X)), method="average"); plot(temp)
     * Using principal component analysis (PCA)
     Example: X <- exprs(lumi.data.object); temp <- prcomp(t(X), scale=TRUE); groupColors <- palette(rainbow(length(levels(group))))
       a. Clusters using two components:
       plot(temp$x[, 1:2], col=groupColors[group], pch=19, main="PCA"); legend("topright", levels(group), col=groupColors, pch=19)
       a. Clusters using three components:
       scatterplot3d(temp$x[, 1:3], color=groupColors[group], pch=19, main="PCA"); legend("topleft", levels(group)), col=groupColors, pch=19)

 1. '''Statistical analysis of differential expressions using limma package'''
   i. Model design matrix generated using function model.matrix
      * define three factor variables: patient, marker (or group), and chip
      * unpaired design: design <- model.matrix(~ 0 + marker + chip)
      * paired design: the patient or sample effects may be different when measured twice or more.
        design <- model.matrix(~ 0 + marker + chip + patient)
   i. Fitting linear models
      * fit <- lmFit(X, design)
      * X: a matrix of gene expressions, each row consists of expressions of one gene.
      * For gene i, fitting a linear model: x_i= design * b_i + e_i
   i. Fitting contrasts (e.g., 3 contrasts)
      * contrasts <- c("marker3-marker1", "marker3-marker2", "marker2-marker1")
      * contrast.matrix <- makeContrasts(contrasts = contrasts, levels=design)
      * fit1 <- contrasts.fit(fit, contrast.matrix)
   i. Empirical Bayes
      * fit2 <- eBayes(fit1)
   i. Generating a top table with adjusted p-values and combining with annotations of interest
      * topfit based on F-statistic
        topfit <- topTable(fit2, number=nrow(X), adjust="BH")
      * topfit based on t-statistic for each contrast (e.g., contrast k)
        topfit <- topTable(fit2, number=nrow(X), adjust="BH", coef=k)
      * combining with annotations and mean expressions
        cbind(annotations, mean.expressions, topfit)

'''References'''

[[http://www.ncbi.nlm.nih.gov/pubmed/18467348 | Du (2008) lumi: a pipeline for processing Illumina microarray, Bioinformatics.]] <<BR>>
[[http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi.pdf | Du et al. (2014) Using lumi, a package processing Illumina microarray]] <<BR>>
[[http://www.bioconductor.org/packages/release/bioc/vignettes/lumi/inst/doc/lumi_VST_evaluation.pdf | Du et al. (2014) Evaluation of VST algorithm in lumi package]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/18178591 | Lin at al. (2008) Model-based variance-stabilizing transformation for Illumina microarray data, Nucleic Acids Res.]] <<BR>>
[[http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf | Smyth et al. (2014) limma: linear models for microarray data user’s guide]] <<BR>>
[[http://www.ncbi.nlm.nih.gov/pubmed/16646809 | Smyth (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments, Stat Appl Genet Mol Biol.]] <<BR>>