IMPORT AFFYMETRIX .CEL FILES AND CALCULATE CALLS / SIGNALS WITH R
CEL files are the output of Affymetrix software for raw microarray image treatment. They are useful as they are the starting point to generate whatever metrics. Unlike published data tables (e.g. NCBI GEO), they have not already undergone normalization or other arbitrary computational treatments.
We will see how to:
- import CEL files into the affybatch object
- use the affybatch object to calculate calls (according to open-source clone of MAS-5 algorithm)
- use the affybatch object to calculate rma signals
By "sample" we will refer to each ensemble of gene transcription values (technically, a vector), corresponding to a single run of a certain biological sample on a microarray
1 - IMPORT CEL FILES INTO AFFYBATCH
In this more extensive version of the How-To, we will deal in details with the treatment of the information attached to the samples (phenotypic profiles, experimental conditions, etc...). In R jargon, these information dimensions are referred to as covariates, whereas their descriptions are referred to as meta-data. They are both stored in the object phenoData, which, together with a matrix of signals or other values is a slot of the exprSet object.
# first of all, load the < affy > package
library (affy)
# in case < affy > was not previously installed, you can use the following commands:
# source("http://bioconductor.org/biocLite.R")
# biocLite(affy)
# or refer to bioconductor (http://www.bioconductor.org)
# let's suppose we have the data in the folder
# C:\Documents\RData
# first, we set the working directory to that folder
# ("directory" can be regarded as another name for "folder"; as a MS-Win user, it's quite DOS-fashioned)
setwd ("C:/Documents/RData")
# and now we define a phenoData to store the phenotypic profiles 
# and all the experimental information regarding the samples
# the commands are set with six samples
# assuming that we have 3 time-points (early, middle, late), and case vs control as covariates
# the sample-names will be stored as rownames of the data.frame
# first we generate the covariate values as factors (< rep > is the vector replication command)
x.covarTime.fct  <- as.factor (  c ("early", "early", "middle", "middle", "late", "late"))
x.covarState.fct <- as.factor (rep (x = c ("cont", "case"), times = 3))
x.df <- data.frame (Time  = x.covarTime.fct,
                    State = x.covarState.fct,
                    row.names = paste (rep (x = "sample.", times = 6), 
                                       as.character (x.covarTime.fct), 
                                       as.character (x.covarState.fct), sep ="."))
x.metaData <- data.frame (labelDescription = c ("Time", 
                                                "State"))
                       
new ("AnnotatedDataFrame")
new ("AnnotatedDataFrame", data = x.df)
new ("AnnotatedDataFrame", data = x.df, varMetadata = x.metaData)
as (x.df, "AnnotatedDataFrame")
x.phenoData <- new ("AnnotatedDataFrame")
pData (x.phenoData) <- x.df
varMetadata (x.phenoData) <- x.metaData
# if everything went ok, the result of this command should be positive
validObject(x.phenoData)
# and now we actually read the .CEL files and stuff them into the appropriate R object (affybatch)
# it is necessary to put the actual filenames instead of "file1.CEL", etc...
x.affy <- read.affybatch (filename = c ("file1.CEL", "file2.CEL", "file3.CEL"), phenoData = x.phenoData)
2 - CALCULATE MAS5 CALLS
MAS5 calls enable to evaluate whether a gene is transcribed (present, P) or not transcribed (absent, A) in certain sample
x.MAS5calls.ExprSet <- mas5calls.AffyBatch (x.affy) # - system answer will be: # Getting probe level data... # Computing p-values # Making P/M/A Calls # - the outuput is an ExprSet object. To extract the calls matrix, do: x.MAS5calls.mx <- exprs (x.MAS5calls.ExprSet)
3 - CALCULATE MAS5 CALLS
rma can be regarded as the best state-of-the-art algorithm for signal calculation. Remember that rma includes a step of sample-normalization; therefore it is not necessary to further normalize data, unless the rma signals were calculated separatedly.
# rma, like MAS5calls, outputs an ExprSet object x.rma.ExprSet <- rma (x.affy) x.rma.mx <- exprs (x.rma.ExprSet) # the signal matrix (a.k.a. expression matrix) is stored by < x.rma.mx >
