M-CHIPS home
FAQs
UNIX
Questions, comments, contributions?
|
Filtering the data
Why filter your data?
         |
Correspondance analysis will produce strong relationships of genes to a
specific
condition the more of the total amount of the signal intensities of one gene
over all conditions is associated with that specific condition. Please note
that for the tightness of association only the relative values are of
importance. Meaning that the intensity level at wich the association is
observed is not relevent!
|          |
| This makes sense, because you would like to see
associations, but on the other hand this implies a strong need for
filtering. Ratios calculated from genes with low intensity-values
(background-level) tend to be extremly high as well as very inconsistent
which in return can result in very strong relationships in CA.
To avoid this one should filter out genes which are not expressed in the analysed conditions.
Thus, the most important filter for any data (irrespective of the particular
platform they come from) is intensity filtering.
In case of multichannel platforms, original intensities do not directly
reflect mRNA abundances (or whatever is measured) because of the competition
of the channels: A low red channel may either reflect low abundancy of the
red mRNA or a great excess of green mRNA that took all the binding
sites of the spot. From a low signal we cannot deduce a low mRNA abundance.
In order to estimate proper intensities to be used for filtering, M-CHiPS
uses the ratios and multiplies them with the gene-wise median of the
absolute intensities of the control channels (whenever processing multichannel data).
| |
How to operate the filters?
Select "Edit" in the main menu, select "Filter". |
There are filter1 to 3. Only genes meeting the requirements of all three filters survive the filtering. |
Klick on any of the three filters: |
     | a) Selected Genes: >= |
| b) Threshold: |
| c) Filter by: |
| d) Filter the: Max (Or) |
| e) of control (not for quality!): No |
| f) of condition1: Yes |
| g) of ... |
For each gene, a value is compared to (smaller or greater than, (a)) a
certain
threshold (b). To compute this one value per gene, you can define a set of
conditions ((e) to (g), exclude conditions by "no"). Assuming your set contains 3
conditions (3x "yes"), you have 3 values per gene. If you choose "max"
(d), the
maximum of the three values is compared to the threshold (b).
If that maximum is greater than the threshold (b) (and you have selected greater
than (a)), the gene will survive the filter. This is equivalent to "or"
(d) in the
sense that the gene survives if either the control or condition1 or
condition2 is greater than the threshold. It is suitable to do this e.g. in
combination with "fitted intensities" (c) in order to let a gene survive if, and
only if it is transcribed to a certain intensity level in any of the
conditions under study (disregarding all the genes that are below the
threshold everywhere throughout the experiment).
Free combination of above components enables to filter out e.g.
genes never above detection limit (as above),
genes affected by saturation (>=, big thresh., raw intensities, max, all "yes" or conditions affected),
genes below a certain fold-change compared to the control condition (>=, 2,
"by ratios" will account both for >=2x and <=0.5x, max, all "yes"),
genes showing low quality (reproducibility), just to give some examples.
Which filtering to apply?
Saturation (or not)
Saturation effects are visible from the scatterplots displayed during
normalization. If you see some dots look like showed into a straight border
line by snowplough instead of thinning out in a natural fashion, that's one.
If you never see such a thing, you can skip the next paragraph because you
do not need a saturation filter.
If the snowplough seemed to have pushed from the right side (vertical border
line), there is a saturation effect (i.e. a certain intensity that cannot be
exceeded for technical reasons) on the x-axis, i.e. with the control. If the
snowplough came from above (horizontal border line), the score that cannot
be exceeded is on the y-axis, i.e. with the non-control condition. The
intensity that cannot be exceeded can be read from the scatterplot. It
should give you a good idea about a threshold for saturation filtering.
Saturation intensities may differ from hybridization to hybridization.
If the saturation effects are limited to one or two conditions, it seems advisible
to deselect all other conditions in this filter - in order not to kill any
genes exceeding the threshold only in conditions unaffected by any saturation.
         |
Apart from filtering genes that are not expressed at relevant levels
(intensity filters) it is also possible to filter for reproducibility
measures. Here the repetitions of the conditions are taken into account.
Choose a quality measure according to the number of repetitions you have:
6 or more:
Permutation-based tests are best. If you have 6 or more repetitions of each condition,
export to SAM (FileI/O, common exports, log2SAM) and re-import the q-values
(Edit, p-values, import from SAM). You can also determine a set of genes
with a reasonable false discovery rate and import this set back for
filtering.
4 or more:
If you have beween 4 and 6 repetitions, you can use eBayes - computed
pvalues (Edit, p-values, then filter by pvalue).
2 to 3: If you have extemely low numbers of repetitions, use
minmaxseparation (explained below) .
| |
         |
|
         |
The x representing the repetitions of one gene in conditionA and the o's
representing the repetitions of the same gene in (for instance) the control
condition. The difference between the max(o) and min(x) is the
Minmax-Seperation.
|
|
The Std. deviation seperation should be applied when a high number of
repetitions is available. This method is less restrictive than the
Minmax-Sep. and accounts for the higher chance of obtaining outliers (as in
the figure) when increasing then number of repetitions. | |
|
KEEP IN MIND:When applying this filter you will most likly see well
seperated clusters in CA, BUT this must not necessarly be the 'natural'
information that was in your data, since you filtered out all genes that did
not have any seperation from your control condition!!
These measures are motivated by having at hand low numbers of repeated
measurements (in comparison to experiments e.g. in physics). See
/mchipsocgi/faqdb?attrib1=keywords&constr1=filtering
Beibarth, T., Fellenberg, K., Brors, B., Arribas-Prat, R., Boer, J. M.,
Hauser,
N. C., Scheideler, M., Hoheisel, J. D., Schtz, G., Poustka, A., & Vingron,
M.
(2000) Bioinformatics, 16, 1014-1022.
or e.g.
resultinfo.html#cclist
http://kups.ub.uni-koeln.de/volltexte/2003/364/pdf/11w1296.pdf
on page 35-38
for details.
Filter strategies 1. The most important filter is an
intensity filter! Its purpose is to discard all genes that remain below the
detection limit under all conditions under study. Some of these have highly
significant p-values, most of them large ratios, a.s.o. That's why there is
no substitute for intensity filtering.
2. If there are any saturation effects visible during normalization, see to it
that you get rid of them, next.
3. Perform a CA without HMS to identify (and delete: Edit - Drop - measurement)
outlying hybridizations. A single bad hyb can mess up your whole filtering
process! In many cases, outliers are already recognizable by low correlation
coefficients (during normalization).
Until now, no bias has been introduced! Please keep in mind that following
measures favor the false-positives over the false-negatives!
4. To increase clarity of the whole picture, you may apply a quality filter.
4.1 A ratio-filter is largely redundant - the quality filter will have
killed all the genes that do not change. E.g. for CA, they would be in the
middle of the plot where you do not see them, anyway. You may want to apply
it to produce color-coded lists including exclusively genes of the magic
2-fold change, though >;-)) (sorry for being bitchy).
For a recipy of how to find thresholds and what to start with and what to do
next, please have a look at our workflow suggestion.
| |
|