M-CHIPS Manual

All experiment descriptions should be directly accessible to statistical analysis. This can be achieved easily when data are not entered as free text but in a categorized, queryable form. This allows for application of multivariate procedures for correlating expression data and annotations.

To make all experiment descriptions directly accessible to statistical analysis, we permit only two types of experiment annotations, either numbers of predefined unit or values from predefined lists. If we, e.g., let an annotation `growth phase' be an enumeration-type variable comprising the defined values `exponential', `stationary' and `pseudo-hyphal', the occurrence of the value `exponential' can be counted within a set of hybridizations clustered by their expression profiles and compared with its overall frequency to determine whether it is characteristic, i.e. either over- or under-represented in the cluster.

While in free text descriptions the number of occurances of a value is not directly countable, dispensing with free text also causes problems. An arbitrary-length free text field allows to annotate each possible value and may also take any number of such atomic pieces of information. In contrast, the type of annotation described above is restricted to predefined values. New annotations and/or new values for existing annotations have to be added constantly as new experiments are designed. This requires the ability to define new annotations rapidly without altering the database scheme, i.e. during normal database operation. Absence of highly flexible free text annotations has to be compensated for by increased flexibility in database storage.

To achieve direct access for statistical methods to all experiment descriptions, they have been dissected into atomic items that can be represented by either numbers of predefined unit or values from predefined lists. To meet the flexibility requirements described above, the annotations are contained in tables rather than in the database structure itself. The web-based annotation process (described below) involves reading these definition tables and recording the entered numbers or selected values in annotation tables.

Definition tables. A separate database is maintained for each organism or field which contains particular definitions of experiment annotations appropriate for the attended samples. We provide annotation definitions for S. cerevisiae, A. thaliana, human tumor biopsies, T. brucei and N. crassa on our web page. Each database comes with a certain set of experiment-annotation definitions that are `organism-specific'. However, some, mostly hybridization-protocol related, `common' annotations are used in all databases. To facilitate inter-field analyses for the future, we try to keep this share as large as possible. New common annotations are added to all databases automatically by means of administration scripts. Each annotation has a unique identification number. They are stored as a linked list including an attribute pointing to the ID of the annotation next in sequence. This structure enables adding of annotations at arbitrary positions by linking the desired ancestor to a new element that points to the ID of the element following in that list. In a similar manner the whole set of defined values is stored by a second linked list within the same table. Hierarchical structure of annotation ontology is recorded by the content of a second table. The following table gives an example, listing the first part of the common annotations.

**Table:** Definition of experimental annotations (table contents)
`yeast=> select * from annotationheadings order by heading1no, heading2no, heading3no;` `heading1no\|heading1 \|heading2no\|heading2 \|heading3no\|heading3` `----------+-----------------------------+----------+-----------------+----------+------------------------------------` `1\|common_annotations \| 1\|array \| 1\|-` `1\|common_annotations \| 2\|hybridisation \| 2\|RNA_preparation` `1\|common_annotations \| 2\|hybridisation \| 3\|labeling` `1\|common_annotations \| 2\|hybridisation \| 4\|hybridisation_conditions` `1\|common_annotations \| 2\|hybridisation \| 5\|stringency_wash` `1\|common_annotations \| 2\|hybridisation \| 6\|detection` `1\|common_annotations \| 3\|sample \| 7\|-` `1\|common_annotations \| 4\|submission \| 8\|-` `2\|organism_specific_annotations\| 5\|genotype \| 9\|-` `... skipping ...` `yeast=> select * from annotations order by lastheadingno, ano, vno;` `lastheadingno\| ano\|nextano\|annotation \| vno\|nextvno\|value` `-------------+----+-------+-----------------------------------+----+-------+-----------------------------------------` `1\| 1\| 2\|array_source \| 10\| 11\|self_made` `1\| 1\| 2\|array_source \| 11\| 12\|genome_systems` `1\| 1\| 2\|array_source \| 12\| 13\|clontech` `1\| 1\| 2\|array_source \| 13\| 14\|research_genetics` `1\| 2\| 3\|array_series \| 0\| 0\|[]` `1\| 3\| 4\|array_individual \| 0\| 0\|[]` `1\| 4\| 5\|array_support \| 14\| 15\|nylon` `1\| 4\| 5\|array_support \| 15\| 16\|polypropylene` `1\| 4\| 5\|array_support \| 16\| 17\|glass` `1\| 5\| 6\|spotted_material \| 17\| 18\|PCR` `1\| 5\| 6\|spotted_material \| 18\| 19\|colonies` `1\| 5\| 6\|spotted_material \| 19\| 20\|DNA-oligo` `1\| 5\| 6\|spotted_material \| 20\| 21\|PNA-oligo` `1\| 6\| 7\|readfile \| 0\| 0\|[]` `1\| 7\| 8\|array_hybridisation \| 0\| 0\|[]` `2\| 8\| 9\|material_source \| 21\| 22\|fresh` `2\| 8\| 9\|material_source \| 22\| 23\|frozen` `... skipping ...`

The complete set of common annotations can be found in the first part of each annotation definition list on our web site, e.g. in the yeast list (HTML, text). The actual experiment annotations which are entered via similar HTML forms are stored elsewhere.

The structure of both tables is denormalized for visual clarity. Since the normalized form, consisting of separated tables, would be queried exclusively by joining them, we directly implemented the joins as database tables. Such redundancies, though not common for databases, are frequently used in data warehousing.

Assembling a definition list. Experiment annotations are defined in form of tab delimited lists. These lists can be recombined and edited in a spreadsheet program. The list format resembles the format of the above database tables.

Please avoid any special characters and replace any spaces within annotations, values or units by underscores. Special characters or spaces are not allowed because they may interfer with database functions or analysis algorithms.

First table. The first one provides a hierarchy of headings for the annotations. These headings may in principle have arbitrary nesting depth. In practice, they have never been tested with other than three levels. You may leave the numbering up to me but the numbers of the lowest (rightmost) level, which correspond to the numbers in the `lastheadingno' column of the second table.

Second table. Here, these numbers have to be given to indicate, which heading a particular annotation belongs to. You may leave out the numbers in columns `ano', `nextano', `vno', `nextvno' and represent the sequence of annotations and their values by their sequence in the list, instead.

For enumeration type annotations, repetitively list the annotation name in column `annotation' and define the allowed values in column `value', one line for each value. Annotations that are to take a number are defined by one line only. Please mark these ones by zeros in columns `vno' and `nextvno' and provide a unit within squared brackets in column `value'. For dimensionless numbers, enter empty brackets. Please stick to the units already defined in the sample lists or take over their style of textual representation. Remember not to use special characters or spaces.