Considering the ongoing development of analysis tools for array data, it would be unwise to store any processed form of the original data because it will be outdated when their calculation methods change. In order to perform our calculations on the fly on transcription data as raw as feasable, we decided to store the signal intensities derived from a hybridisation image by an imaging software. All of the recent releases of these software packages require human interaction for verification and adaption of a semi-automatically performed spot recognition, impeding on-the-fly calculations on the image data themselves.
Storing intensity information appears to be easy. A hybridisation yields a huge amount of uniform data comprising, in our case, two intensities and two background values per gene or EST (being spotted in duplicate). Performance considerations would suggest hybridisation-wise storage in tab delimited files or array tuples in a database, dispensing with selective retrieval of particular values but being fast accessing whole hybridisations. Specified subsets of spots are not easily accessible in a hybridisation data file nor in an array. However it should be possible to selectively retrieve intensities above a certain threshold or within a specified interval, thus it is necessary to store the values for every gene/EST as separate tuples in a database relation. In this form indices can be calculated to perform fast score-dependent queries utilizing the database capability of b-tree search. When in future hybridisation databases will be too large to be loaded into computer memory, it will become essential to perform tuple selections as well as simple calculations on the database level before loading compressed results into the memory for visualization.
Gene annotations may consist of clone numbers, accession numbers and different kinds of entries describing the spotted sequence or the encoded protein like chromosomal location, enzyme categorization number or protein structure. Here we only include identifiers serving as a key to connect to databases containing gene information, short variable length free text descriptions of the protein and its functional category and the spot location. Moreover it turned out to be necessary to explicitly keep control of the array the spot is located on, provided that the spotset comprises more than one array and each of them has been hybridized separately.
Because complex sequence annotations or enzyme properties are found in linked gene databases, the gene annotations may be stored in only one relation containing attributes for the above values, and every spotted element (gene or EST) can be described by one tuple.
Experimental annotations comprise the description of environmental conditions, genotype, patient data, information about surgery, type of tissue (incl. estimated degree of contamination by other celltypes), the sampling method and annotations related to hybridisation protocol, properties of the individual array or imaging process, to give some examples. They fall into the two realms of
Experimental annotations are set up by the biologists working in the field. They tend to grow with every new type of experiment performed. To account for this, an implementation of any concept will be useless if it does not enable easy and quick addition of new annotations or the completion of values for already defined annotations without altering the database scheme and the analysis algorithms. If an annotating experimenter finds something he or she forgot to define before uploading an experiment, the database should have the flexibility to incorporate a new annotation or value within a minute.
Gene and experimental annotations taken together sum up to less than 0.35% storage space of the yeast database (May 2000). Since the share of data entered directly by human beings may in any case have a size far too small to be relevant for query performance, flexibility remains the only time saving aspect related to experimental annotations.
For the conceptualization of structures for data storage one might prefer formats supporting a wider range of analytical access to the data than others. Let the experimental annotations, though ordered into various categories and subcategories, be text fields containing free text description of the annotation value, e.g. the yeast specific annotation
Our implementation works now for 33 yeast specific, 70 arabidopsis specific, 54 human tumor specific, 41 trypanosoma specific, and 76 common (technical) highly categorized experimental annotations. They were set up by biologists working in these fields enabling statistical analysis of the descriptions of nearly 700 hybridisations stored in a PostgreSQL database. The following more practical ascpects deal with the realization of such a database for a multiuser setting.