0
$\begingroup$

I would like to know if there is any statistical explanation on why it would be a good idea to group some data into the following way:

  1. find the maximum ($\max$), minimum ($\min$), mean ($m$) and standard deviation ($sd$) of a sample;

  2. group the data into the following 4 subintervals: $(\min, m-sd)$, $(m-sd, m)$, $(m, m+sd)$, $(m+sd, \max)$.

I mention that there are no outliers in my dataset and we assume that the data is normally distributed. The numbers from my dataset correspond to the values of a performance index computed for 30 countries. This index is computed based on economic, social and institutional aspects (GDP, employment rate, industry, services, national debt, government size, etc).

I know that approximately 68% of the data falls within $(m-sd, m+sd)$. I also know that another way to group data is via quartiles (we would also have 4 subintervals: ($\min$, $Q_1$), ($Q_1$, $Q_2$), ($Q_2$, $Q_3$), ($Q_3$, $\max$), where $Q_1, Q_2, Q_3$ would be the quartiles). So, is there any particular advantage (that may be statistically explained) if we group some data via the 1st method?

$\endgroup$
7
  • 2
    $\begingroup$ Whether a procedure is useful is easier to address if we know the larger goal. So what is the purpose of the grouping or analysis? $\endgroup$
    – mkt
    Commented Aug 1, 2023 at 14:45
  • $\begingroup$ The main purpose is to group about 30 countries into 4 subintervals based on their performance in various areas such as industry, services, GDP, employment rates, etc. $\endgroup$
    – Alchimist
    Commented Aug 1, 2023 at 14:47
  • $\begingroup$ Edit it into the question, please. And include more detail if possible. $\endgroup$
    – mkt
    Commented Aug 1, 2023 at 14:50
  • 2
    $\begingroup$ I take issue with some of the premises on which the question is based. $\,$ 1. the "68%" is true of normal populations, but isn't necessarily true for other distributions, and in particular isn't necessarily true of data; the percentage could be quite some way from 68%. You can get 0%, or very close to 100%. $\,$ 2. Some of the subintervals you define might not even exist. Note that mean - sd could be smaller than the minimum and/or mean + sd could be larger than the maximum. What do you do then? $\endgroup$
    – Glen_b
    Commented Aug 1, 2023 at 16:17
  • $\begingroup$ Yes, you are right. That's why I said that we assume that the data is normally distributed. I also thought about what you said at 2., but the subintervals are well-defined in my case. $\endgroup$
    – Alchimist
    Commented Aug 1, 2023 at 16:30

3 Answers 3

1
$\begingroup$

There is no particular advantage, it may or may not be beneficial. You're discretizing continuous data, effectively treating all members of each group as identical. A priori, there isn't any reason to expect that grouping the data into groups of 16%, 34%, 34%, and 16% of the data will be particularly more meaningful than grouping them into equally sized groups of 25% each. There may be situations where it is beneficial, like if you're trying to characterize a more extreme, smaller group than a quartile, but there may be situations where it is not appropriate, like if there truly are 4 distinct equally sized groups represented in your data.

$\endgroup$
1
$\begingroup$

Short answer: No.

Longer answer: In general, there is nothing to be gained and much to be lost by any grouping of a continuous variable.

  • It throws away information.

  • If you then use it to do regression or some other inferential statistics, it increases both type I and type II error.

  • It limits what you can say.

  • It assumes that something special happens right at the cutpoints and that you have somehow picked those points perfectly, or, at least, very well.

There are some cases where grouping (or some more complex scheme) can help, but these are when there is some substantive reason for the split. For instance, if you were looking at "amount of alcohol consumed" then it might make sense to split at whatever the legal drinking age is in your population.

And it can sometimes make sense to present the data for certain groups. But for the analysis? Unless you have a strong reason, don't categorize.

$\endgroup$
0
$\begingroup$

The grouping by quartiles yields equitable groups, but for "grouping" countries on basis of some index, this does not seem to make sense. It is up to you, of course, what you call a "group", but in order to make sense in the common understanding of groups, differences within each group should be smaller than differences between groups. This is the basic problem of "clustering" and there are many algorithms devised for precisely this aim.

One particular approach would be k-means, which is also known as "vector quantization" because it minimizes the within cluster distance. Another simple algorithm is (complete link) hierarchical clustering.

Then there is the question whether there are actually four groups, or whether a different number of groups is more reasonable. As you only have a one-dimensional criterion (the performance index), I would recommend to determine the number of groups with kernel density estimates and the mode tree, as described in

Minnotte, Scott: "The Mode Tree: A Tool for Visualization of Nonparametric Density Features." Journal of Computational and Graphical Statistics 2,1, pp. 51-68 (1993)

There is a ready-to-run visualization function modetree in the R package multimode.

If the data does not show any grouping, this might also be because the one dimensional mapping on the "performance index" does not represent the data structure. Finding a mapping from high to low dimensions that approximately preserves the (relative) distance between data points is called "multi dimensional scaling", and by using these techniques you might find a better suited index. This is tricky, however, because you must somehow normalize the dimensions of all variables, because otherwise the distance will just be dominated by the variable with the greatest range variation.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.