No Title

M-CHIPS home

Characteristics:

Correspondence Analysis

Planar embedding

Interpretation

Storing annotations without freetext

Why?

How?

To what end?

Correspondence Analysis

I provide here a concise summary of the technique, see refs. [2] and [3] for a thorough exposition. An informal, intuitive description will be given below. The aim is to embed both rows (genes) and columns (hybridizations) of a matrix in the same space, the first two or three coordinates of which contain most of the information. Let I genes and J hybridizations be collected into the $I\times J$ matrix N with elements n_ij . Let n_i+ and n_+j denote the sum of the ith row and jth column, respectively. By n₊₊ I denote the grand total of N. The mass of the jth column is defined as c_j=n_+j/n₊₊, and likewise the mass of the ith row is r_i=n_i+/n₊₊. Basis for the calculation is the correspondence matrix P with elements p_ij=n_ij/n₊₊from which the matrix S with elements $s_{ij}=(p_{ij}-r_{i}c_{j})/\sqrt{r_{i}c_{j}}$ is derived. S is submitted to singular value decomposition [1], i.e. it is decomposed into the product of three matrices: S=U $\Lambda$ V^T. $\Lambda$ is a diagonal matrix, and its diagonal elements are referred to as the singular values of S. I think of them as sorted from the largest to the smallest and denote them by $\lambda _{k}$ . The coordinates for gene i in the new space are then given by $f_{ik}=\lambda _{k}u_{ik}/\sqrt{r_{i}}$ , for k=1,...,J. Hybridizations are viewed in the same space with hybridization jgiven coordinates $g_{jk}=\lambda _{k}v_{jk}/\sqrt{c_{j}}$ , for k=1,...,J. These coordinates are called principal coordinates.

To reduce dimensionality, only the first two or three coordinates of the new space are plotted. The loss of information associated with this dimension reduction is quantified in terms of the proportion of the so-called total inertia $\sum _{k}{\lambda _{k}^{2}}$ that is explained by the axis displayed. Total inertia is proportional to the value of the $\chi ^{2}$ statistic, and thus the amount of information represented in, e.g., a planar embedding $(\lambda _{1}^{2}+\lambda _{2}^{2})/\sum _{k}{\lambda _{k}^{2}}$ , corresponds to the proportion of the $\chi ^{2}$ statistic explained by the embedding.

The above summary is aimed to provide all the information needed to implement a simple CA algorithm. This can be easily done by using nested for loops. A much shorter implementation without loops can be achieved in any programming language supporting matrix multiplication and providing a routine for singular value decomposition, e.g. in MATLAB (Appendix B).

Standard coordinates as an aid in visualization

Correspondence analysis attempts to separate dissimilar objects (genes or hybridizations) from each other; similar objects are clustered together resulting in small distances. In contrast, the distance between a gene and a hybridization cannot be directly interpreted. For visualization of between-variable association in the plot one includes virtual genes which have all their intensity focused in one hybridization [3]. The coordinates of such a gene are called standard coordinates of the hybridization where this gene is expressed. Likewise, one could introduce standard coordinates for genes. The standard coordinates for the genes are computed as $u_{ik}/\sqrt{r_{i}}$ and for the hybridizations as $v_{jk}/\sqrt{c_{j}}$ . In practice, the spread of the set of real genes and hybridizations is much smaller than the spread introduced when including these virtual genes and hybridizations via their standard coordinates. As a consequence, the real points would shrink to a tiny area, so I rather depict the direction from the centroid of the data to the standard coordinates instead of the standard coordinates themselves.

Medians and replicate hybridizations in correspondence analysis

Typically, replicate hybridizations are performed for each condition under study leading to several values for one gene/condition pair. The number of such repeated hybridizations is often small. I therefore represent these values by their gene-wise median rather than their gene-wise average because the median is less sensitive to outliers. The need remains, though, to visualize also the original data and not only the median since they contain valuable information about experimental variance and quality of individual hybridizations. In fact, CA offers the possibility to reflect both aspects. To this end, CA is first effected by using the gene-wise medians, determining the coordinate system to embed the original hybridization intensities. These data points are then referred to as supplementary points or points without mass. Thus the share of noise belonging to an experimental condition is shown by the spread of its hybridizations around the median. As the dimensions of the data are reduced by using medians of hybridizations per experimental condition, I refer to this strategy as hybridization-median determined scaling (HMS).

The embedding for hybridizations without mass is computed as follows. Let the matrix N contain only the hybridization medians and let N $^{\star }$ of elements $n^{\star }_{ij'}$ be the original data matrix containing all the hybridizations. N is submitted to CA. Let P $^{\star }$ have elements $p^{\star }_{ij'}=n^{\star }_{ij'}/n^{\star }_{++}$ . The principal coordinates for the supplementary hybridizations from correspondence matrix P $^{\star }$ are then calculated as

$\begin{displaymath}g_{j'k}^{\star }=\frac{1}{\sum\limits _{i}p^{\star }_{ij'}}\sum _{i}\frac{p^{\star }_{ij'}f_{ik}}{\lambda _{k}}.\end{displaymath}$

In our own data sets, a single hybridization consists of two corresponding spot sets because each cDNA had been spotted twice on the array. I refer to these spot sets as primary and secondary spots. They tend to show a higher correlation than hybridizations belonging to the same experimental condition. Plotting them separately (duplicating the number of supplementary points) provides an atomic unit of distance in the biplot, where no units are assigned to the axes. The intensity unit cancels out when calculating the correspondence matrix P.

Bibliography

1: G. H. Golub and C. Reinsch.
Singular value decomposition and least squares solutions.
Numer. Math., 14:403-420, 1970.
2: M. J. Greenacre.
Theory and Applications of Correspondence Analysis, page 223.
Academic Press, London, 1st edition, 1984.
3: M. J. Greenacre.
Correspondence Analysis in Practice, pages 181-183 and 36.
Academic Press, London, 1st edition, 1993.