Format an Expression Matrix for GSEA

Outline

When an expression matrix is submitted to GSEA, there must be two string attribute columns (NAME, DESCRIPTION) before the sample columns with expression values. My recommendation is to use Entrez-Gene as identifiers for NAME, and a DESCRIPTION composed of gene symbol and gene full name. This requires to convert the IDs from Affy IDs to Entrez-Gene IDs; since different Affys can map to the same Entrez-Gene, there's also a redundancy reduction problem, that I will address using the Affy probe-set with highest average value (other solutions can be used).

It is also possible to keep the Affy ID, and use GSEA libraries to convert to gene symbols. However, symbols are less good identifiers. This is compatible with MSig-DB.

Also, here we will use the simplest format accepted by GSEA for expression matrices, .txt.

Code

Notes:

expr.mx is the initial object
- rownames correspond to Affy identifiers,
- values correspond to rma signals
  - make sure they are in linear scale and not log2 scale; log2 scale typically ranges from 1 to 6
this example if for the Affy array HG-U133 Plus 2.0 (hgu133plus2.db library); use other libraries where appropriate (e.g. hgu133a.db, mouse430a2.db, etc...)

# 1) CREATE OBJECT WITH ANNOTATIONS DATA

# 1.1) Load libraries
library (hgu133plus2.db)
library (org.Hs.eg.db)

# 1.2) Affy to entrez-gene (eg)
hgu133plus2ENTREZID.chv <- unlist (as.list (hgu133plus2ENTREZID))

# 1.3) eg to full gene name
Ann_GeneName.chv        <- unlist (as.list (org.Hs.egGENENAME))

# 1.4) eg to symbol
Ann_GeneSymbol.chv      <- unlist (as.list (org.Hs.egSYMBOL))

# 2) MAP UNIQUE AFFYs TO ENTREZ-GENE
#    (picking only the probe-set with max value)

# 2.1) Sort by average expression

avg.nv <- apply (expr.mx, 1, mean)
avg.nv <- sort (avg.nv, decreasing = T)

# 2.2) Pick probe-sets that are mapped to eg
#      and have maximal average expression, 
#      if more than one probe-set is available per eg

all.eg   <- hgu133plus2ENTREZID.chv[names (avg.nv)]
sel.Affy <- names (avg.nv)[
                        !duplicated (all.eg) & 
                        !is.na (all.eg)]
sel.eg   <- hgu133plus2ENTREZID.chv[sel.Affy]

# 2.3) Create data-frame, adding also symbols and full names
#      in the description field

GSEA.1.df <- data.frame (
                NAME        = sel.eg,
                DESCRIPTION = paste (
                        Ann_GeneSymbol.chv[sel.eg], 
                        Ann_GeneName.chv[sel.eg], 
                        sep = " :: "
                        ),
                stringsAsFactors = F)

# 2.4) Import rma values into the data-frame

GSEA.2.df <- as.data.frame (expr.mx[sel.Affy, ])

# 3) WRITE FINAL OBJECT

GSEA.df <- cbind (GSEA.1.df, GSEA.2.df)

write.table (
        GSEA.df, 
        file      = "Expr.txt", 
        col.names = T,
        row.names = F, 
        quote     = F, 
        sep       = "\t")