Format an Expression Matrix for GSEA

Outline

When an expression matrix is submitted to GSEA, there must be two string attribute columns (NAME, DESCRIPTION) before the sample columns with expression values. My recommendation is to use Entrez-Gene as identifiers for NAME, and a DESCRIPTION composed of gene symbol and gene full name. This requires to convert the IDs from Affy IDs to Entrez-Gene IDs; since different Affys can map to the same Entrez-Gene, there's also a redundancy reduction problem, that I will address using the Affy probe-set with highest average value (other solutions can be used).

It is also possible to keep the Affy ID, and use GSEA libraries to convert to gene symbols. However, symbols are less good identifiers. This is compatible with MSig-DB.

Also, here we will use the simplest format accepted by GSEA for expression matrices, .txt.

Code

Notes:

#

# Make data-frame

# Compute vector of average values
avg <- expr.mx

DanieleMerico/Code/Affy2GSEA (last edited 2010-03-31 16:49:50 by DanieleMerico)

MoinMoin Appliance - Powered by TurnKey Linux