Next: 2. Realization concepts Up: M-CHIPS Database Report Previous: Contents

Subsections

1. General Considerations concerning Data Storage

1.1 Storing transcription intensities

1.1.1 What to store

Considering the ongoing development of analysis tools for array data, it would be unwise to store any processed form of the original data because it will be outdated when their calculation methods change. In order to perform our calculations on the fly on transcription data as raw as feasable, we decided to store the signal intensities derived from a hybridisation image by an imaging software. All of the recent releases of these software packages require human interaction for verification and adaption of a semi-automatically performed spot recognition, impeding on-the-fly calculations on the image data themselves.

1.1.2 How to store it

Storing intensity information appears to be easy. A hybridisation yields a huge amount of uniform data comprising, in our case, two intensities and two background values per gene or EST (being spotted in duplicate). Performance considerations would suggest hybridisation-wise storage in tab delimited files or array tuples in a database, dispensing with selective retrieval of particular values but being fast accessing whole hybridisations. Specified subsets of spots are not easily accessible in a hybridisation data file nor in an array. However it should be possible to selectively retrieve intensities above a certain threshold or within a specified interval, thus it is necessary to store the values for every gene/EST as separate tuples in a database relation. In this form indices can be calculated to perform fast score-dependent queries utilizing the database capability of b-tree search. When in future hybridisation databases will be too large to be loaded into computer memory, it will become essential to perform tuple selections as well as simple calculations on the database level before loading compressed results into the memory for visualization.

1.2 Storing gene annotations

1.2.1 What to store

Gene annotations may consist of clone numbers, accession numbers and different kinds of entries describing the spotted sequence or the encoded protein like chromosomal location, enzyme categorization number or protein structure. Here we only include identifiers serving as a key to connect to databases containing gene information, short variable length free text descriptions of the protein and its functional category and the spot location. Moreover it turned out to be necessary to explicitly keep control of the array the spot is located on, provided that the spotset comprises more than one array and each of them has been hybridized separately.

1.2.2 How to store it

Because complex sequence annotations or enzyme properties are found in linked gene databases, the gene annotations may be stored in only one relation containing attributes for the above values, and every spotted element (gene or EST) can be described by one tuple.

1.3 Storing experimental annotations

1.3.1 What to store

Experimental annotations comprise the description of environmental conditions, genotype, patient data, information about surgery, type of tissue (incl. estimated degree of contamination by other celltypes), the sampling method and annotations related to hybridisation protocol, properties of the individual array or imaging process, to give some examples. They fall into the two realms of

1.: Organism-specific annotations resembling the need of the specific research area such as e.g. `transgene' and `growth phase' for yeast or `tumor type' and `metastasis location' for human biopsies.
2.: Common annotations that are useful for all fields of interest. These technique-related properties like array characteristics, description of labelling, hybridisation and washing conditions or detection of the signals are annotated by all the users.

1.3.2 How to store it

1.3.2.1 Flexibility

Experimental annotations are set up by the biologists working in the field. They tend to grow with every new type of experiment performed. To account for this, an implementation of any concept will be useless if it does not enable easy and quick addition of new annotations or the completion of values for already defined annotations without altering the database scheme and the analysis algorithms. If an annotating experimenter finds something he or she forgot to define before uploading an experiment, the database should have the flexibility to incorporate a new annotation or value within a minute.

1.3.2.2 Performance

Gene and experimental annotations taken together sum up to less than 0.35% storage space of the yeast database (May 2000). Since the share of data entered directly by human beings may in any case have a size far too small to be relevant for query performance, flexibility remains the only time saving aspect related to experimental annotations.

1.3.2.3 Analysis aspects

For the conceptualization of structures for data storage one might prefer formats supporting a wider range of analytical access to the data than others. Let the experimental annotations, though ordered into various categories and subcategories, be text fields containing free text description of the annotation value, e.g. the yeast specific annotation

growth phase - value: 'exponential'.

From people querying sequence databases in a high throughput manner, one can learn that there are severe problems like misspelling, different words having the same meaning, various types of abbreviations, making it hard to analyse the contents of a text field for a high number of datasets. On the other hand one would expect the number of tuples (hybridisations / multiconditional experiments) of a public expression database, once established, to grow quite fast. People might cluster these tuples by the expression behavior of a set of genes and would want to know which growth conditions, experimental settings, genotypes or environmental conditions of the organism corresponded to a particular cluster. In other words: Which properties are common for hybridizations that share similar expression patterns? This question cannot be answered by visual inspection alone when looking on hundreds of hybridizations with huge numbers of sample properties. Sample descriptions are favourable that enable inclusion of these descriptions into the process of algorithmical analysis. To make them accessible to statistical analysis, the values of an experimental annotation should be directly comparable among the datasets. If we for example let the above annnotation 'growth phase' be an enumeration type variable comprising the defined values 'exponential', 'stationary' and 'pseudo-hyphal', the occurence of the value 'exponential' can be counted within the cluster and compared with it's overall occurence to determine if it is characteristic (either over- or underrepresented) for the cluster. Prerequisite is that the annotation values are enumerable. Apart from enumeration type annotations already mentioned, floating point numbers can be made enumerable by mapping them to a set of bins, e.g in a way that each bin covers an equally spaced range of values or in another manner that seems suitable for the particular annotation in terms of biological relevance.

1.3.2.4 M-CHIPS storage

Our implementation works now for 33 yeast specific, 70 arabidopsis specific, 54 human tumor specific, 41 trypanosoma specific, and 76 common (technical) highly categorized experimental annotations. They were set up by biologists working in these fields enabling statistical analysis of the descriptions of nearly 700 hybridisations stored in a PostgreSQL database. The following more practical ascpects deal with the realization of such a database for a multiuser setting.

Next: 2. Realization concepts Up: M-CHIPS Database Report Previous: Contents

Kurt Fellenberg
2001-10-24