IMPORT AFFYMETRIX .CEL FILES AND CALCULATE CALLS / SIGNALS WITH R

CEL files are the output of Affymetrix software for raw microarray image treatment. They are useful as they are the starting point to generate whatever metrics. Unlike published data tables (e.g. NCBI GEO), they have not already undergone normalization or other arbitrary computational treatments.

We will see how to:

  1. import CEL files into the affybatch object
  2. use the affybatch object to calculate calls (according to open-source clone of MAS-5 algorithm)
  3. use the affybatch object to calculate rma signals

By "sample" we will refer to each ensemble of gene transcription values (technically, a vector), corresponding to a single run of a certain biological sample on a microarray

1 - IMPORT CEL FILES INTO AFFYBATCH

# first of all, load the < affy > package

library (affy)

# in case < affy > was not previously installed, you can use the following commands:
# source("http://bioconductor.org/biocLite.R")
# biocLite(affy)
# or refer to bioconductor (http://www.bioconductor.org)

# let's suppose we have the data in the folder
# C:\Documents\RData
# first, we set the working directory to that folder
# ("directory" can be regarded as another name for "folder"; as a MS-Win user, it's quite DOS-fashioned)

setwd ("C:/Documents/RData")

# the big nuisance is that we have to define a data-table to define the phenotypic profile of the samples
# of course this is of great help if the experiment is complex
# but by now, we will assume it's a niusance and just go for the easiest solution

# the commands are set with three samples
#  if the samples are more, it is sufficient to change 
#  if more classification parameters are required, add them to the df beyond "x"
#  and extend also the value of < labelDescription > to a character vector

x.df <- data.frame (x = c (1:3),
                    row.names = paste ("sample.", c (1:3), sep =""))

x.metaData <-
  data.frame (labelDescription = "Numbers")
                       
new ("AnnotatedDataFrame")
new ("AnnotatedDataFrame", data = x.df)
new ("AnnotatedDataFrame", data = x.df, varMetadata = x.metaData)

as (x.df, "AnnotatedDataFrame")

x.phenoData <- new ("AnnotatedDataFrame")
pData (x.phenoData) <- x.df
varMetadata (x.phenoData) <- x.metaData

# if everything went ok, the result of this command should be positive

validObject(x.phenoData)

# and now we actually read the .CEL files and stuff them into the appropriate R object (affybatch)
# it is necessary to put the actual filenames instead of "file1.CEL", etc...

x.affy <- read.affybatch (filename = c ("file1.CEL", "file2.CEL", "file3.CEL"), phenoData = x.phenoData)

2 - CALCULATE MAS5 CALLS

MAS5 calls enable to evaluate whether a gene is transcribed (present, P) or not transcribed (absent, A) in certain sample

x.MAS5calls.ExprSet <- mas5calls.AffyBatch (x.affy)
# - system answer will be:
#   Getting probe level data...
#   Computing p-values
#   Making P/M/A Calls
# - the outuput is an ExprSet object. To extract the calls matrix, do:

x.MAS5calls.mx <- exprs (x.MAS5calls.ExprSet) 

3 - CALCULATE rma SIGNALS

rma can be regarded as the best state-of-the-art algorithm for signal calculation. Remember that rma includes a step of sample-normalization; therefore it is not necessary to further normalize data, unless the rma signals were calculated separately. Also remember that rma signals are generated as log2; usually the log2 version is useful for checking distributions, but all other operations (e.g. search for differential expression) are often carried out on linear-transformed data.

# rma, like MAS5calls, outputs an ExprSet object
x.rma.ExprSet <- rma (x.affy)
x.rma.mx <- exprs (x.rma.ExprSet) 

# the signal matrix (a.k.a. expression matrix) is stored by < x.rma.mx >

DanieleMerico/HowtoDirectory/AffyCelCalSig (last edited 2009-07-07 00:48:07 by localhost)

MoinMoin Appliance - Powered by TurnKey Linux