DanieleMerico/HowtoDirectory/PCA_ade4_dudipca - Bader Lab @ The University of Toronto

PCA is a dimensionality technique that enables to project a multidimensional data-set (e.g. microarray expression matrix) into a new space, where the dimensions are orthogonal and maximize the explanation of some statistical index. PCA maximizes the explained variation. Since the new space is "optimized", it is possible to consider only a limited number of dimensions (i.e. PCA components). As a consequence, the first PCA components (from the 1-st to the i-th) are regarded as highly informative. The eigenvalues are used to evaluate the information associated to a component.

PCA can be regarded as more selective and noise-cleaning than clustering, as if only the first i-th PCA components are used, some of the information present in the original data-set is neglected. Hopefully, the less relevant information. However, since PCA does not provide groups, it is necessary either to manually explore the space (considering only 2 or 3 components altogether, in 2D or 3D plots), or to define some computational criterion to group the data. If the data are particularly rich, many PCA components may be necessary to account for all relevant features in the data. To evaluate the information content of a single component, the corresponding eigenvalue is used, and to evaluate the cumulative fraction of information explained by several components, the respective eigenvalues are summed and divided by the total sum of eigenvalues.

In transcriptomics, PCA is usually used to project the samples from the N-transcripts space to a new space; the spatial relations between the samples in the new space should concord with the experimental design.

# install the package < ade4 > from R packages/install packages 

library (ade4)

# compute PCA

x.pca <- dudi.pca (df = x.df, scannf = FALSE, nf = ncol (x.df))

# plot the eigenvalues, which are stored in a vector in the < $eig > slot of < x.pca >
# the amount of each eigenvalue can be regarded as a measure of the explained variance

barplot (height = x.pca$eig, 
          col = "aquamarine4", border = "aquamarine4" , space = 0,
          xlab = "PCA - eigenvalues")

x11()
plot (x = x.pca$co[, 1], 
      y = x.pca$co[, 2],
      xlab = "PC 1",
      ylab = "PC 2",
      main = "PCA biplot")

# often, in microarray data, the first PCA component 
# is associated to an overwhelming eigenvalue
# but only explains noise (that is, not biologically relevant variation);
# this could be the case whenever the sample patterns

# in such cases, it is better take a look at the other components...

x11()
barplot (height = x.pca$eig[2:length (x.pca$eig)], 
          col = "aquamarine4", border = "aquamarine4" , space = 0,
          xlab = "PCA - eigenvalues")