Next: 3. Implementation Up: M-CHIPS Database Report Previous: 1. General Considerations concerning

Subsections

2. Realization concepts

M-CHIPS provides a storage concept for unified analytical access to microarray experiments from different fields of research, an instance of which is a field-specific (organism-specific) database. Although these databases adopt different ontologies for experiment annotation, they can be accessed by the very same analysis algorithms. They are designed to be used by the people who generate the data. To meet the requirements of these users, they have to allow for multiuser access including safe management of simultaneus write access, short waiting periods and privacy (protection against unauthorized access).

2.1 Safety considerations

2.1.1 Global accidents

2.1.1.1 Transactions

DBMSs capable of the administration of more than one version of a database at the same time (like Oracle or PostgreSQL) protect integrity of the stored data by transactions. Transactions give databases an all-or-nothing capability when making modifications. A transaction can comprise one or multiple queries with every of the performed changes becoming valid upon successful execution of the whole transaction and none of them in case of an error. At the same time all other users are insulated from seeing the partially committed transaction until the very moment of commitment, preventing database consistency from being damaged by simultaneous write access. Although transaction-based database management slows down access performance, we recommend to use a transaction based DBMS.

2.1.1.2 Global backups

Although the choice of a transaction-based DBMS ensures a great amount of safety for the data, there is no way to guarantee absolute secureness. In case of a disk headcrash or failure in the server's power supply while updating important system catalogues it may well be that the integrity of all the databases managed by the server is destroyed at the same time. In such a case we will restore the status of the last night for the whole database system from tape backup.

2.1.2 Private errors

In case of accidently deleting hybridisations from a single database it would be inappropriate to reset the whole system to the state of the night before. To be prepared for such a case, SQL dumps are performed separately for each database overnight. They consist of SQL queries that can be used to restore data subsets from a whole database down to a single tuple of a particular table.

2.1.3 Unauthorized access

To ensure that data (which may be unpublished) cannot be altered nor read by unauthorized individuals, update and/or read permissions can be granted on any database table to a particular user. Granting such permissions to user groups rather than separately to each user is a common procedure to circumvent the necessity of changing permissions for each database table upon registration of a new user. In our implementation nearly all the relations inherit from few parental tables and are accessed via their parental table only. Permission inheritance enables the administrator to quickly grant e.g. read access to a new user by changing permissions for a few parental tables in place of dealing with many tables or user groups. However, the main reason for access via parental tables is to enable pooling of tuples from hybridisation tables into large blocks without syntax alteration of accessing queries (which will be described below 2.2.2).

2.2 Performance considerations

Since the overall extent of data referring to gene descriptions and experimental annotations is minimal (see 1.3.2.2), performance considerations are related only to hybridisation intensities.

2.2.1 Minimizing query space

It is already quite efficient to divide the entirety of spots into appropriate subsets related to the type of queries that are performed. Most of the analysis queries target genes rather than empty or control spots, so we recommend to store at least the genes separately from the rest. In our implementation the spots are kept in tables belonging to (and inheriting from) 5 different parental tables comprising

genes (genes / ESTs - incl. housekeeping)
empty spots (no DNA has been spotted)
heterologous DNA (e.g. guide spots)
heterologous DNA with known concentration (external control spots for 'spiking', i.e. assaying standard RNA aliquots added before the labelling step)
reference spots (reserved for a novel category of control spots).

2.2.2 Separate or block-wise storage

As already mentioned in 1.1.2, fast querying of tuples is mediated by indices. If the above categories would contain the hybridisations stored so far as one big block table per category, adding a new hybridisation would be quite slow because of the time necessary for recomputing the indices. Because of this, every new hybridisation is inserted as 5 new separate relations, computing indices only for the new tuples.

However querying for certain values is slowed down by increasing number of separate tables, because there is no global index guiding the search immediately to the one containing the tuple. This structure, while enabling high performance for write / delete operations impedes a fast read access. In order to optimize both for writing and reading operations, we

write / delete hybridisations as separate tables, but
read from large blocks,

which are produced by over-night jobs that join those tables (hybridizations) that are not to be altered or deleted any more. Thus, computing of large indices is performed at times of low traffic (as an investment in query performance).

While storing hybridizations into blocks includes alteration of the database structure (decreasing the number of tables), it remains totally insulated from and invisible to the accessing software (algorithmic layer): Since every access to the intensity tables is directed via one of the five parental tables listed in 2.2.1, query syntax does not change with the `assembly' of a set of tables into one block: This block will be a child of a specific parental table as have been the collected tables, summarized within the new block.

2.3 Flexibility considerations

To meet the requirements described in 1.3.2.1, the categorization of experimental annotations should be kept in definition tables rather than mapped to database structure itself. In our concept annotations along with their defined values are stored in a definition table. Each annotation has a unique identification number. They are stored as a linked list including an attribute pointing to the ID of the annotation next in sequence. The ID serves as a key for querying the annotations, the defined sequence allows for a clear list structure facilitating the annotation process. The annotations are structured by a set of headings and subheadings with an arbitrary nesting depth, which are stored in a second table. The linked list structure enables adding of a new annotation at an arbitrary position by linking of the desired predecessor to a new element that points to the ID of the element following in the list. In a similar manner the whole set of defined values is numbered sequentially to enable rapid queries and stored by a linked list in the same table as the annotations.

To prepare for the administration of experiments related to a new field of research, it is sufficient to generate an empty database with definition tables containing the up-to-date list of common annotations along with a new second half of both annotation definition and heading table containing the 'organism-specific' annotations for the new field of experiments. A growing number of already assembled definition lists facilitate to compile new ones by serving as templates for the description of similar experimental procedures.

2.4 Considerations related with Analysis

As described in 1.3.2.3, the annotation values should be categorized down to an enumerable level, either directly by creating an enumeration type annotation, or by storing a floating point number. These numbers are stored along with a unit if this is required for a unique meaning/message of the value. They don't necessarily have to be non-integer. Discretizing numbers will be reasonable in cases where

similar values are expected to have the very same meaning in terms of their biological impact and
the probability of those equivalent values to match the very same number is low because of measurement errors.

Next: 3. Implementation Up: M-CHIPS Database Report Previous: 1. General Considerations concerning

Kurt Fellenberg
2001-10-24