#acl All:read DanieleMerico:write,delete,revert = Format an Expression Matrix for GSEA = == Outline == When an expression matrix is submitted to GSEA, there must be two string attribute columns (NAME, DESCRIPTION) before the sample columns with expression values. My recommendation is to use Entrez-Gene as identifiers for NAME, and a DESCRIPTION composed of gene symbol and gene full name. This requires to convert the IDs from Affy IDs to Entrez-Gene IDs; since different Affys can map to the same Entrez-Gene, there's also a redundancy reduction problem, that I will address using the Affy probe-set with highest average value (other solutions can be used). It is also possible to keep the Affy ID, and use GSEA libraries to convert to gene symbols. However, symbols are less good identifiers. This is compatible with MSig-DB. Also, here we will use the simplest format accepted by GSEA for expression matrices, `.txt`. == Code == Notes: * `expr.mx` is the initial object * ''rownames'' correspond to Affy identifiers, * ''values'' correspond to rma signals * make sure they are in linear scale and not log2 scale; log2 scale typically ranges from 1 to 6 * this example if for the Affy array HG-U133 Plus 2.0 (`hgu133plus2.db` library); use other libraries where appropriate (e.g. `hgu133a.db`, `mouse430a2.db`, etc...) {{{ #!rscript numbers=off # # Make data-frame # Compute vector of average values avg <- expr.mx }}}