Format an Expression Matrix for GSEA
When an expression matrix is submitted to GSEA, there must be two string attribute columns (NAME, DESCRIPTION) before the sample columns with expression values. My recommendation is to use Entrez-Gene as identifiers for NAME, and a DESCRIPTION composed of gene symbol and gene full name. This requires to convert the IDs from Affy IDs to Entrez-Gene IDs; since different Affys can map to the same Entrez-Gene, there's also a redundancy reduction problem, that I will address using the Affy probe-set with highest average value (other solutions can be used).
It is also possible to keep the Affy ID, and use GSEA libraries to convert to gene symbols. However, symbols are less good identifiers. This is compatible with MSig-DB.
Also, here we will use the simplest format accepted by GSEA for expression matrices, .txt.
expr.mx is the initial object
rownames correspond to Affy identifiers,
values correspond to rma signals
- make sure they are in linear scale and not log2 scale; log2 scale typically ranges from 1 to 6
this example if for the Affy array HG-U133 Plus 2.0 (hgu133plus2.db library); use other libraries where appropriate (e.g. hgu133a.db, mouse430a2.db, etc...)
# 1) CREATE OBJECT WITH ANNOTATIONS DATA # 1.1) Load libraries library (hgu133plus2.db) library (org.Hs.eg.db) # 1.2) Affy to entrez-gene (eg) hgu133plus2ENTREZID.chv <- unlist (as.list (hgu133plus2ENTREZID)) # 1.3) eg to full gene name Ann_GeneName.chv <- unlist (as.list (org.Hs.egGENENAME)) # 1.4) eg to symbol Ann_GeneSymbol.chv <- unlist (as.list (org.Hs.egSYMBOL)) # 2) MAP UNIQUE AFFYs TO ENTREZ-GENE # (picking only the probe-set with max value) # 2.1) Sort by average expression avg.nv <- apply (expr.mx, 1, mean) avg.nv <- sort (avg.nv, decreasing = T) # 2.2) Pick probe-sets that are mapped to eg # and have maximal average expression, # if more than one probe-set is available per eg all.eg <- hgu133plus2ENTREZID.chv[names (avg.nv)] sel.Affy <- names (avg.nv)[ !duplicated (all.eg) & !is.na (all.eg)] sel.eg <- hgu133plus2ENTREZID.chv[sel.Affy] # 2.3) Create data-frame, adding also symbols and full names # in the description field GSEA.1.df <- data.frame ( NAME = sel.eg, DESCRIPTION = paste ( Ann_GeneSymbol.chv[sel.eg], Ann_GeneName.chv[sel.eg], sep = " :: " ), stringsAsFactors = F) # 2.4) Import rma values into the data-frame GSEA.2.df <- as.data.frame (expr.mx[sel.Affy, ]) # 3) WRITE FINAL OBJECT GSEA.df <- cbind (GSEA.1.df, GSEA.2.df) write.table ( GSEA.df, file = "Expr.txt", col.names = T, row.names = F, quote = F, sep = "\t")