4
$\begingroup$

As you may know, analytic methods like PCA or LDA are used everywhere in science. In my case, I'm a chemist, and I use it to differentiate molecules.

I have 4 different systems (let's call them A, B, C and D), and each of them gives a particular response to a certain molecule. Let's say I have 2 molecules, Mol 1 and Mol 2 (this an imaginary situation, but you get the idea).

I have the following responses:

|       | A        | B        | C        | D        |
|-------|----------|----------|----------|----------|
| Mol 1 | 42,41,43 | 45,44,46 | 20,19,21 | 12,11,13 |
| Mol 2 | 3,2,4    | 25,24,26 | 42,41,43 | 28,27,29 |

I have 3 measurements for each system, and for each molecule.

I used python and the LDA class from the scikit-learn module to draw a 2D plot (this is a real plot, from real experiments):

my plot

I asked the LDA class to reduce the problem to 2 dimensions, to be able to draw a 2D plot. Each colour represents a molecule, a class. I have 3 points for each class. The cross represent the means of the classes.

Each ellipse represents a 99% confidence interval for the groups. I drew them myself, the width is ~5 times the x standard deviation, and the height ~5 times the y standard deviation.

Now start the questions.

  • You can see on the picture the brown ellipse overlaps the blue, the purple and the yellow ellipses. So for an unknown molecule, if its point is in the brown ellipse, can I or can I not say which molecule it is ?

  • Second question. Let's say I perform a Tukey HSD analysis on the measurements of Mol 1 and Mol 2, with the same confidence interval. If the test says Mol 1 and Mol 2 are not statistically different, for each of the 4 systems, will they ellipses superimpose on the LDA plot ?

EDIT:

After the answer of @EdM, I'd like to add a few things.

I would like to compare the Tukey HSD and the LDA.

Second, confidence intervals for individual variables and tests for the differences between two variables have complicated interrelations. This issue is extensively discussed on this Cross Validated page for 1-dimensional distinctions by t-tests. Under reasonable assumptions, lack of overlap of 99% confidence intervals is about equivalent to p < 0.0002, not p < 0.01, for a difference between the mean values. Requiring non-overlap of confidence intervals is much more conservative than it seems at first glance.

And:

1) If values for a new sample happens to fall within a particular confidence region you may have some confidence that it belongs to the same class, but there will always be some error rate. Because there will always be some error, you have to decide on the relative costs of different errors. There are also issues of overfitting, and resulting too great confidence in your classification, if all you did was analyze a set of data without attempts to separate out into training and test sets. These issues are covered by books like An Introduction to Statistical Learning, whose section 4.4.3 covers LDA with multiple predictors.

2) The complicated relations between confidence intervals for individual variables and confidence intervals for differences between variables makes it difficult to give a clear answer. In principle it seems that you would also need to know how the directions of the differences in the 4 individual measures map onto your 2D plot. It's probably more important to know what the actual decision boundaries between classes are from the LDA, and the classification error rates near those boundaries

.

So I understood non overlapping confidence intervals clearly means 2 mean values are different (at least for the mentioned t-tests case).

I also understand it is not the same thing for my LDA case. It is hard to clearly identify an unknown molecule, even if its measurements fall in a confidence interval.

However, the purpose of the Tukey HSD test is to tell if two means are statistically different. Would it be possible/interesting to perform a Tukey HSD test after a LDA dimension reduction ? It would for example compare the means of two groups, and tell if the means are statistically different (I think it would be enough to say if two molecules are different).

$\endgroup$
2
  • $\begingroup$ A difficulty, which you evidently understand, is that from these types of data you can't definitively "say which molecule it is." The best you can do is state probabilities; for practical application, you need to specify the relative costs of different types of mis-classifications. Also, note that the 99% confidence intervals you drew for individual molecules are not the same as 99% confidence intervals for distinguishing 2 molecules from each other. $\endgroup$
    – EdM
    Commented Jun 8, 2015 at 13:26
  • $\begingroup$ Thanks for your answer. However, I'm not sure I entirely understand you. What do you mean by "from these types of data you can't definitively "say which molecule it is."" ? You mean if 2 ellipses overlap, or in general ? For the second part of your answer "Also, note that the 99% confidence intervals you drew for individual molecules are not the same as 99% confidence intervals for distinguishing 2 molecules from each other." I have problems to understand that. If the 99% confidence intervals of 2 groups don't overlap, shouldn't that mean you can distinguish them, with a certainty of 99% ? $\endgroup$
    – Rififi
    Commented Jun 8, 2015 at 19:59

1 Answer 1

2
$\begingroup$

There are a couple of issues here about confidence intervals that need to be clarified.

First, remember that confidence intervals refer to the reliability of the estimation procedure, not to the probability that a value found within the interval will be a member of a particular class. As the Wikipedia page puts it:

"A 95% confidence interval does not mean that for a given realised interval calculated from sample data there is a 95% probability the population parameter lies within the interval, nor that there is a 95% probability that the interval covers the population parameter. Once an experiment is done and an interval calculated, this interval either covers the parameter value or it does not, it is no longer a matter of probability. The 95% probability relates to the reliability of the estimation procedure, not to a specific calculated interval."

In your diagram of 99% confidence regions, this means that if you repeated the sampling/measurement process multiple times and calculated the confidence regions the same way, 99% of those confidence regions would cover the true population mean value. That's not the same thing as being 99% sure that a value within the particular confidence region that you drew came a member of the associated class.

Second, confidence intervals for individual variables and tests for the differences between two variables have complicated interrelations. This issue is extensively discussed on this Cross Validated page for 1-dimensional distinctions by t-tests. Under reasonable assumptions, lack of overlap of 99% confidence intervals is about equivalent to p < 0.0002, not p < 0.01, for a difference between the mean values. Requiring non-overlap of confidence intervals is much more conservative than it seems at first glance.

So for your questions:

1) If values for a new sample happens to fall within a particular confidence region you may have some confidence that it belongs to the same class, but there will always be some error rate. Because there will always be some error, you have to decide on the relative costs of different errors. There are also issues of overfitting, and resulting too great confidence in your classification, if all you did was analyze a set of data without attempts to separate out into training and test sets. These issues are covered by books like An Introduction to Statistical Learning, whose section 4.4.3 covers LDA with multiple predictors.

2) The complicated relations between confidence intervals for individual variables and confidence intervals for differences between variables makes it difficult to give a clear answer. In principle it seems that you would also need to know how the directions of the differences in the 4 individual measures map onto your 2D plot. It's probably more important to know what the actual decision boundaries between classes are from the LDA, and the classification error rates near those boundaries.

Added in response to edited question:

I don't think that the Tukey HSD would be a good test to use here. That's designed for a different situation, such as when several different treatments are being compared with respect to their effects on an outcome variable, there are no pre-planned comparisons among the treatments, and you want a reliable way to compare all treatments against each other. The test assumes that "The observations being tested are independent within and among the groups."

It isn't an appropriate test for seeing if there are differences among 4 separate tests run on the same set of molecules, as once a molecule is specified the result of any 1 test provides much information about the results of each of the other 3 tests. So the observations from the multiple tests aren't independent.

What you have is a classification problem, and you are probably best off using the extensive tools already developed for classification, such as LDA.

$\endgroup$
1
  • $\begingroup$ Thank you for your very detailed answer. I edited my original question to add more questions :) There are still a few things I don't understand. Could you have a look, if you have time ? $\endgroup$
    – Rififi
    Commented Jun 11, 2015 at 8:40

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.