IMPORT AFFYMETRIX .CEL FILES AND CALCULATE CALLS / SIGNALS WITH R
CEL files are the output of Affymetrix software for raw microarray image treatment. They are useful as they are the starting point to generate whatever metrics. Unlike published data tables (e.g. NCBI GEO), they have not already undergone normalization or other arbitrary computational treatments.
We will see how to:
- import CEL files into the affybatch object
- use the affybatch object to calculate calls (according to open-source clone of MAS-5 algorithm)
- use the affybatch object to calculate rma signals
By "sample" we will refer to each ensemble of gene transcription values (technically, a vector), corresponding to a single run of a certain biological sample on a microarray
1 - IMPORT CEL FILES INTO AFFYBATCH
In this more extensive version of the How-To, we will deal in details with the treatment of the information attached to the samples (phenotypic profiles, experimental conditions, etc...). In R jargon, these information dimensions are referred to as covariates, whereas their descriptions are referred to as meta-data. They are both stored in the object phenoData, which, together with a matrix of signals or other values is a slot of the exprSet object.
# first of all, load the < affy > package
library (affy)
# in case 'affy' was not previously installed,
# do the following to install the package:
# (or refer to http://www.bioconductor.org)
# source("http://bioconductor.org/biocLite.R")
# biocLite(affy)
# let's suppose we have the data in the folder
# 'RData' that has the following path:
# 'C:\Documents\RData'
# first of all, we set the working directory to that folder
# ("directory" is just another name for "folder")
setwd ("C:/Documents/RData")
# and now we define a phenoData to store the phenotypes / conditions
# and all the experimental information regarding the samples
# the commands are set with six samples
# assuming that we have the following conditions (covariates)
# . 3 time-points (early, middle, late),
# . case vs control as covariates
# the sample-names will be stored as rownames of the data.frame
# first we generate the covariate values as factors (< rep > is the vector replication command)
x.covarTime.fct <- as.factor ( c ("early", "early", "middle", "middle", "late", "late"))
x.covarState.fct <- as.factor (rep (x = c ("cont", "case"), times = 3))
x.df <- data.frame (Time = x.covarTime.fct,
State = x.covarState.fct,
row.names = paste (rep (x = "sample", times = 6),
as.character (x.covarTime.fct),
as.character (x.covarState.fct),
rep (x = 1, times = 6), sep ="."))
x.metaData <- data.frame (labelDescription = c ("Time",
"State"))
x.phenoData <- new (
"AnnotatedDataFrame",
data = x.df,
varMetadata = x.metaData
)
# if everything went ok, the result of this command should be positive:
validObject (x.phenoData)
# you can access the covariate table by the command:
pData (x.phenoData)
# and now we actually read the .CEL files and stuff them into the appropriate R object (affybatch)
# it is necessary to put the actual filenames instead of "file1.CEL", etc...
x.affy <- read.affybatch (filename = c ("file1.CEL", "file2.CEL", "file3.CEL", "file4.CEL", "file5.CEL", "file6.CEL"), phenoData = x.phenoData)
2 - CALCULATE MAS5 CALLS
MAS5 calls enable to evaluate whether a gene is transcribed (present, P) or not transcribed (absent, A) in certain sample
x.MAS5calls.ExprSet <- mas5calls.AffyBatch (x.affy) # - system answer will be: # Getting probe level data... # Computing p-values # Making P/M/A Calls # - the outuput is an ExprSet object. To extract the calls matrix, do: x.MAS5calls.mx <- exprs (x.MAS5calls.ExprSet)
3 - CALCULATE rma SIGNALS
rma can be regarded as the best state-of-the-art algorithm for signal calculation. Remember that rma includes a step of sample-normalization; therefore it is not necessary to further normalize data, unless the rma signals were calculated separately. Also remember that rma signals are generated as log2; usually the log2 version is useful for checking distributions, but all other operations (e.g. search for differential expression) are often carried out on linear-transformed data.
# rma, like MAS5calls, outputs an ExprSet object x.rma.ExprSet <- rma (x.affy) x.rma.mx <- exprs (x.rma.ExprSet) # the signal matrix (a.k.a. expression matrix) is stored by 'x.rma.mx'
4 - CONVERT FROM PROBE-SET ID TO OTHER ID
Depending on your Affymetrix platform, you can use different Bioconductor packages.
For instance, for the most recent human platform, HG-U133 Plus 2.0, use library hgu133plus2.db
library (hgu133plus2.db) # Assuming your signals are stored in x.mx and its ID in the rownames, # we transform the matrix into a data.frame to allow for additional columns # with other identifiers and description x.df <- as.data.frame (x.mx) x.df$AffyID <- rownames (x.mx) x.df$GeneID <- unlist (as.list (hgu133plus2ENTREZID))[x.df$AffyID] x.df$GeneSymbol <- unlist (as.list (hgu133plus2GENENAME))[x.df$AffyID] x.df$GeneName <- unlist (as.list (hgu133plus2SYMBOL))[x.df$AffyID]
Mind that Entrez-Gene identifier will not be unique, as different Affymetrix probe-sets can map to the same Entrez-Gene ID. Although there is no elegant solution to solve this redundancy problems, two empirical strategies are usually used:
- pick the probe-set with highest average intensity
- pick the prbe-set with highest significance for differentiality (depends on how you defined differentiality in your design)
Removal of redundancy is required for gene-set enrichment methods like GSEA, although GSEA includes a functionality to do that automatically.