|
Correspondence Analysis
I provide here a concise summary of the technique, see refs. [2]
and [3] for a thorough exposition. An informal, intuitive
description will be given below. The aim is to embed both rows (genes)
and columns (hybridizations) of a matrix in the same space, the first
two or three coordinates of which contain most of the information.
Let I genes and J hybridizations be collected into the
To reduce dimensionality, only the first two or three coordinates
of the new space are plotted. The loss of information associated with
this dimension reduction is quantified in terms of the proportion
of the so-called total inertia
The above summary is aimed to provide all the information needed to implement a simple CA algorithm. This can be easily done by using nested for loops. A much shorter implementation without loops can be achieved in any programming language supporting matrix multiplication and providing a routine for singular value decomposition, e.g. in MATLAB (Appendix B).
Standard coordinates as an aid in visualization
Correspondence analysis attempts to separate dissimilar objects (genes
or hybridizations) from each other; similar objects are clustered
together resulting in small distances. In contrast, the distance between
a gene and a hybridization cannot be directly interpreted. For visualization
of between-variable association in the plot one includes virtual genes
which have all their intensity focused in one hybridization [3].
The coordinates of such a gene are called standard coordinates of
the hybridization where this gene is expressed. Likewise, one could
introduce standard coordinates for genes. The standard coordinates
for the genes are computed as
Medians and replicate hybridizations in correspondence analysisTypically, replicate hybridizations are performed for each condition under study leading to several values for one gene/condition pair. The number of such repeated hybridizations is often small. I therefore represent these values by their gene-wise median rather than their gene-wise average because the median is less sensitive to outliers. The need remains, though, to visualize also the original data and not only the median since they contain valuable information about experimental variance and quality of individual hybridizations. In fact, CA offers the possibility to reflect both aspects. To this end, CA is first effected by using the gene-wise medians, determining the coordinate system to embed the original hybridization intensities. These data points are then referred to as supplementary points or points without mass. Thus the share of noise belonging to an experimental condition is shown by the spread of its hybridizations around the median. As the dimensions of the data are reduced by using medians of hybridizations per experimental condition, I refer to this strategy as hybridization-median determined scaling (HMS).
The embedding for hybridizations without mass is computed as follows.
Let the matrix N contain only the hybridization medians and
let N ![]() In our own data sets, a single hybridization consists of two corresponding spot sets because each cDNA had been spotted twice on the array. I refer to these spot sets as primary and secondary spots. They tend to show a higher correlation than hybridizations belonging to the same experimental condition. Plotting them separately (duplicating the number of supplementary points) provides an atomic unit of distance in the biplot, where no units are assigned to the axes. The intensity unit cancels out when calculating the correspondence matrix P.
Bibliography
|