As you may know, analytic methods like PCA or LDA are used everywhere in science. In my case, I'm a chemist, and I use it to differentiate molecules.
I have 4 different systems (let's call them A, B, C and D), and each of them gives a particular response to a certain molecule. Let's say I have 2 molecules, Mol 1 and Mol 2 (this an imaginary situation, but you get the idea).
I have the following responses:
| | A | B | C | D |
| Mol 1 | 42,41,43 | 45,44,46 | 20,19,21 | 12,11,13 |
| Mol 2 | 3,2,4 | 25,24,26 | 42,41,43 | 28,27,29 |
I have 3 measurements for each system, and for each molecule.
I used python and the LDA class from the scikit-learn module to draw a 2D plot (this is a real plot, from real experiments):
I asked the LDA class to reduce the problem to 2 dimensions, to be able to draw a 2D plot. Each colour represents a molecule, a class. I have 3 points for each class. The cross represent the means of the classes.
Each ellipse represents a 99% confidence interval for the groups. I drew them myself, the width is ~5 times the x standard deviation, and the height ~5 times the y standard deviation.
Now start the questions.
You can see on the picture the brown ellipse overlaps the blue, the purple and the yellow ellipses. So for an unknown molecule, if its point is in the brown ellipse, can I or can I not say which molecule it is ?
Second question. Let's say I perform a Tukey HSD analysis on the measurements of Mol 1 and Mol 2, with the same confidence interval. If the test says Mol 1 and Mol 2 are not statistically different, for each of the 4 systems, will they ellipses superimpose on the LDA plot ?
After the answer of @EdM, I'd like to add a few things.
I would like to compare the Tukey HSD and the LDA.
Second, confidence intervals for individual variables and tests for the differences between two variables have complicated interrelations. This issue is extensively discussed on this Cross Validated page for 1-dimensional distinctions by t-tests. Under reasonable assumptions, lack of overlap of 99% confidence intervals is about equivalent to p < 0.0002, not p < 0.01, for a difference between the mean values. Requiring non-overlap of confidence intervals is much more conservative than it seems at first glance.
1) If values for a new sample happens to fall within a particular confidence region you may have some confidence that it belongs to the same class, but there will always be some error rate. Because there will always be some error, you have to decide on the relative costs of different errors. There are also issues of overfitting, and resulting too great confidence in your classification, if all you did was analyze a set of data without attempts to separate out into training and test sets. These issues are covered by books like An Introduction to Statistical Learning, whose section 4.4.3 covers LDA with multiple predictors.
2) The complicated relations between confidence intervals for individual variables and confidence intervals for differences between variables makes it difficult to give a clear answer. In principle it seems that you would also need to know how the directions of the differences in the 4 individual measures map onto your 2D plot. It's probably more important to know what the actual decision boundaries between classes are from the LDA, and the classification error rates near those boundaries
So I understood non overlapping confidence intervals clearly means 2 mean values are different (at least for the mentioned t-tests case).
I also understand it is not the same thing for my LDA case. It is hard to clearly identify an unknown molecule, even if its measurements fall in a confidence interval.
However, the purpose of the Tukey HSD test is to tell if two means are statistically different. Would it be possible/interesting to perform a Tukey HSD test after a LDA dimension reduction ? It would for example compare the means of two groups, and tell if the means are statistically different (I think it would be enough to say if two molecules are different).