IMPORT AFFYMETRIX .CEL FILES AND CALCULATE CALLS / SIGNALS WITH R

CEL files are the output of Affymetrix software for raw microarray image treatment. They are useful as they are the starting point to generate whatever metrics. Unlike published data tables (e.g. NCBI GEO), they have not already undergone normalization or other arbitrary computational treatments.

We will see how to:

  1. import CEL files into the affybatch object
  2. use the affybatch object to calculate calls (according to open-source clone of MAS-5 algorithm)
  3. use the affybatch object to calculate rma signals

By "sample" we will refer to each ensemble of gene transcription values (technically, a vector), corresponding to a single run of a certain biological sample on a microarray

1 - IMPORT CEL FILES INTO AFFYBATCH

In this more extensive version of the How-To, we will deal in details with the treatment of the information attached to the samples (phenotypic profiles, experimental conditions, etc...). In R jargon, these information dimensions are referred to as covariates, whereas their descriptions are referred to as meta-data. They are both stored in the object phenoData, which, together with a matrix of signals or other values is a slot of the exprSet object.

# first of all, load the < affy > package

library (affy)

# in case < affy > was not previously installed, you can use the following commands:
# source("http://bioconductor.org/biocLite.R")
# biocLite(affy)
# or refer to bioconductor (http://www.bioconductor.org)

# let's suppose we have the data in the folder
# C:\Documents\RData
# first, we set the working directory to that folder
# ("directory" can be regarded as another name for "folder"; as a MS-Win user, it's quite DOS-fashioned)

setwd ("C:/Documents/RData")

# and now we define a phenoData to store the phenotypic profiles 
# and all the experimental information regarding the samples

# the commands are set with six samples
# assuming that we have 3 time-points (early, middle, late), and case vs control as covariates
# the sample-names will be stored as rownames of the data.frame

# first we generate the covariate values as factors (< rep > is the vector replication command)

x.covarTime.fct  <- as.factor (  c ("early", "early", "middle", "middle", "late", "late"))
x.covarState.fct <- as.factor (rep (x = c ("cont", "case"), times = 3))

x.df <- data.frame (Time  = x.covarTime.fct,
                    State = x.covarState.fct,
                    row.names = paste (rep (x = "sample", times = 6), 
                                       as.character (x.covarTime.fct), 
                                       as.character (x.covarState.fct), 
                                       rep (x = 1, times = 6), sep ="."))

x.metaData <- data.frame (labelDescription = c ("Time", 
                                                "State"))
                       
new ("AnnotatedDataFrame")
new ("AnnotatedDataFrame", data = x.df)
new ("AnnotatedDataFrame", data = x.df, varMetadata = x.metaData)

as (x.df, "AnnotatedDataFrame")

x.phenoData <- new ("AnnotatedDataFrame")
pData (x.phenoData) <- x.df
varMetadata (x.phenoData) <- x.metaData

# if everything went ok, the result of this command should be positive:

validObject (x.phenoData)

# you can access the covariate table by the command:

pData (x.phenoData)

# and now we actually read the .CEL files and stuff them into the appropriate R object (affybatch)
# it is necessary to put the actual filenames instead of "file1.CEL", etc...

x.affy <- read.affybatch (filename = c ("file1.CEL", "file2.CEL", "file3.CEL", "file4.CEL", "file5.CEL", "file6.CEL"), phenoData = x.phenoData)

2 - CALCULATE MAS5 CALLS

MAS5 calls enable to evaluate whether a gene is transcribed (present, P) or not transcribed (absent, A) in certain sample

x.MAS5calls.ExprSet <- mas5calls.AffyBatch (x.affy)
# - system answer will be:
#   Getting probe level data...
#   Computing p-values
#   Making P/M/A Calls
# - the outuput is an ExprSet object. To extract the calls matrix, do:

x.MAS5calls.mx <- exprs (x.MAS5calls.ExprSet) 

3 - CALCULATE rma SIGNALS

rma can be regarded as the best state-of-the-art algorithm for signal calculation. Remember that rma includes a step of sample-normalization; therefore it is not necessary to further normalize data, unless the rma signals were calculated separately. Also remember that rma signals are generated as log2; usually the log2 version is useful for checking distributions, but all other operations (e.g. search for differential expression) are often carried out on linear-transformed data.

# rma, like MAS5calls, outputs an ExprSet object
x.rma.ExprSet <- rma (x.affy)
x.rma.mx <- exprs (x.rma.ExprSet) 

# the signal matrix (a.k.a. expression matrix) is stored by < x.rma.mx >

DanieleMerico/HowtoDirectory/ExprSet (last edited 2007-12-04 22:46:06 by DanieleMerico)

MoinMoin Appliance - Powered by TurnKey Linux