PCA
PCA is a dimensionality technique that enables to project a multidimensional data-set (e.g. microarray expression matrix) into a new space, where the dimensions are orthogonal and maximize the explanation of some statistical index. PCA maximizes the explained variation. Since the new space is "optimized", it is possible to consider only a limited number of dimensions (i.e. PCA components). As a consequence, the first PCA components (from the 1-st to the i-th) are regarded as highly informative. The eigenvalues are used to evaluate the information associated to a component.
PCA can be regarded as more selective and noise-cleaning than clustering, as if only the first i-th PCA components are used, some of the information present in the original data-set is neglected. Hopefully, the less relevant information. However, since PCA does not provide groups, it is necessary either to manually explore the space (considering only 2 or 3 components altogether, in 2D or 3D plots), or to define some computational criterion to group the data. If the data are particularly rich, many PCA components may be necessary to account for all relevant features in the data. To evaluate the information content of a single component, the corresponding eigenvalue is used, and to evaluate the cumulative fraction of information explained by several components, the respective eigenvalues are summed and divided by the total sum of eigenvalues.
In transcriptomics, PCA is usually used to project the samples from the N-transcripts space to a new space; the spatial relations between the samples in the new space should concord with the experimental design.
# install the package < ade4 > from R packages/install packages library (ade4) # compute PCA x.pca <- dudi.pca (df = x.df, scannf = FALSE, nf = ncol (x.df)) # plot the eigenvalues, which are stored in a vector in the < $eig > slot of < x.pca > # the amount of each eigenvalue can be regarded as a measure of the explained variance barplot (height = x.pca$eig, col = "aquamarine4", border = "aquamarine4" , space = 0, xlab = "PCA - eigenvalues") x11() plot (x = x.pca$co[, 1], y = x.pca$co[, 2], xlab = "PC 1", ylab = "PC 2", main = "PCA biplot")
The attached function "dudi-pca-plot-01.R" enables to generate the bi-plots and the eigenvalue plots.
The inputs are:
- dudi pca object (dudi.pca)
- data-set name (title.ch)
components to be printed (cp.nv)
the eigenvalue plot starts from the first of these components- point color (dot.col)
f.PCA.Dudi.plot (dudi.pca = x.pca, title.ch = "X", cp.nv = c (1, 2))
Usually, in microarray data, the first PCA component is associated to an overwhelming eigenvalue but only explains noisy variability (that is, not biologically relevant); this could be the case whenever the sample patterns displayed by the 1st component don't match with any of the patterns expected according to the experimental design (e.g. separation of cases and controls in a dicotomic design). An independent check can be performed by using other dimensionality reduction methods (e.g. Correspondence Analysis, Multidimensional Scaling), and comparing the patterns obtained. In such cases, it is better to consider only the second component onwards.