81
$\begingroup$

I would like to know how to interpret a difference of f-measure values. I know that f-measure is a balanced mean between precision and recall, but I am asking about the practical meaning of a difference in F-measures.

For example, if a classifier C1 has an accuracy of 0.4 and another classifier C2 an accuracy of 0.8, then we can say that C2 has correctly classified the double of test examples compared to C1. However, if a classifier C1 has an F-measure of 0.4 for a certain class and another classifier C2 an F-measure of 0.8, what can we state about the difference in performance of the 2 classifiers ? Can we say that C2 has classified X more instances correctly that C1 ?

$\endgroup$
1
  • 2
    $\begingroup$ I'm not sure you can say much since the F-measure is function of both precision and recall: en.wikipedia.org/wiki/F1_score. You can do the math though and hold one (either precision or recall) constant and say something about the other. $\endgroup$
    – Nick
    Commented Feb 4, 2013 at 16:35

10 Answers 10

75
$\begingroup$

I cannot think of an intuitive meaning of the F measure, because it's just a combined metric. What's more intuitive than F-mesure, of course, is precision and recall.

But using two values, we often cannot determine if one algorithm is superior to another. For example, if one algorithm has higher precision but lower recall than other, how can you tell which algorithm is better?

If you have a specific goal in your mind like 'Precision is the king. I don't care much about recall', then there's no problem. Higher precision is better. But if you don't have such a strong goal, you will want a combined metric. That's F-measure. By using it, you will compare some of precision and some of recall.

The ROC curve is often drawn stating the F-measure. You may find this article interesting as it contains explanation on several measures including ROC curves: http://binf.gmu.edu/mmasso/ROC101.pdf

$\endgroup$
39
$\begingroup$

The importance of the F1 score differs based on the distribution of the target variable. Lets assume the target variable is a binary label.

  • Balanced class: In this situation, the F1 score can effectively be ignored, the mis-classification rate is key.
  • Unbalanced class, but both classes are important: If the class distribution is highly skewed (such as 80:20 or 90:10), then a classifier can get a low mis-classification rate simply by choosing the majority class. In such a situation, I would choose the classifier that gets high F1 scores on both classes, as well as low mis-classification rate. A classifier that gets low F1-scores should be overlooked.
  • Unbalanced class, but one class if more important that the other. For e.g. in Fraud detection, it is more important to correctly label an instance as fraudulent, as opposed to labeling the non-fraudulent one. In this case, I would pick the classifier that has a good F1 score only on the important class. Recall that the F1-score is available per class.
$\endgroup$
14
$\begingroup$

F-measure has an intuitive meaning. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).

With high precision but low recall, you classifier is extremely accurate, but it misses a significant number of instances that are difficult to classify. This is not very useful.

Take a look at this histogram. enter image description here Ignore its original purpose.

Towards the right, you get high precision, but low recall. If I only select instances with a score above 0.9, my classified instances will be extremely precise, however I will have missed a significant number of instances. Experiments indicate that the sweet spot here is around 0.76, where the F-measure is 0.87.

$\endgroup$
2
  • 1
    $\begingroup$ The last paragraph is misleading. There is not concept of a "good or bad" score without context of where we are applying this. In certain settings maybe 60% is the state of the art, in other setting 95% might be unacceptably low. $\endgroup$
    – usεr11852
    Commented Sep 13, 2020 at 0:03
  • $\begingroup$ Side note: precision and robust(ness) are also used with a very different meaning in validation and verification (precision referring to [low] variance-type error and robust referring to predictions that do not differ much under some influencing factor). So thanks for spelling out. $\endgroup$
    – cbeleites
    Commented Aug 16, 2022 at 11:55
6
$\begingroup$

The F-measure is the harmonic mean of your precision and recall. In most situations, you have a trade-off between precision and recall. If you optimize your classifier to increase one and disfavor the other, the harmonic mean quickly decreases. It is greatest however, when both precision and recall are equal.

Given F-measures of 0.4 and 0.8 for your classifiers, you can expect that these where the maximum values achieved when weighing out precision against recall.

For visual reference take a look at this figure from Wikipedia:

enter image description here

The F-measure is H, A and B are recall and precision. You can increase one, but then the other decreases.

$\endgroup$
2
  • $\begingroup$ I found the "Crossed Ladders" visualization to be a bit more straightforward - for me, it makes the equality of A=B resulting in the greatest H more intuitive $\endgroup$
    – Coruscate5
    Commented Jul 17, 2017 at 17:18
  • 1
    $\begingroup$ There is no B in that illustration, did you mean b? $\endgroup$
    – Schütze
    Commented Apr 18, 2021 at 22:18
6
$\begingroup$

With precision on the y-axis and recall on the x-axis, the slope of the level curve $F_{\beta}$ at (1, 1) is $-1/\beta^2$.

Given $$P = \frac{TP}{TP+FP}$$ and $$R = \frac{TP}{TP+FN}$$, let $\alpha$ be the ratio of the cost of false negatives to false positives. Then total cost of error is proportional to $$\alpha \frac{1-R}{R} + \frac{1-P}{P}.$$ So the slope of the level curve at (1, 1) is $-\alpha$. Therefore, for good models using the $F_{\beta}$ implies you consider false negatives $\beta^2$ times more costly than false positives.

$\endgroup$
4
$\begingroup$

I just want to note the following paper, published this year, that proposes "a simple transformation of the F-measure, which [the authors] call $F^*$ (F-star), which has an immediate practical interpretation." It even cited this very discussion on Cross Validated.

Specifically, $F^* = F/(2-F)$ "is the proportion of the relevant classifications which are correct, where a relevant classification is one which is either really class 1 or classified as class 1".

REFERENCES:

$\endgroup$
3
$\begingroup$

The formula for F-measure (F1, with beta=1) is the same as the formula giving the equivalent resistance composed of two resistances placed in parallel in physics (forgetting about the factor 2).

This could give you a possible interpretation, and you can think about both electronic or thermal resistances. This analogy would define F-measure as the equivalent resistance formed by sensitivity and precision placed in parallel.

For F-measure, the maximum possible is 1, and you loose resistance as soon as one among he two looses resistance as well (that is too say, get a value below 1). If you want to understand better this quantity and its dynamic, think about the physic phenomenon. For example, it appears that the F-measure <= max(sensitivity, precision).

$\endgroup$
2
$\begingroup$

The closest intuitive meaning of the f1-score is being perceived as the mean of the recall and the precision. Let's clear it for you :

In a classification task, you may be planning to build a classifier with high precision AND recall. For example, a classifier that tells if a person is honest or not.

For precision, you are able to usually tell accurately how many honest people out there in a given group. In this case, when caring about high precision, you assume that you can misclassify a liar person as honest but not often. In other words, here you are trying to identify liar from honest as a whole group.

However, for recall, you will be really concerned if you think a liar person to be honest. For you, this will be a great loss and a big mistake and you don't want to do it again. Also, it's okay if you classified someone honest as a liar but your model should never (or mostly not to) claim a liar person as honest. In other words, here you are focusing on a specific class and you are trying not to make a mistake about it.

Now, let take the case where you want your model to (1) precisely identify honest from a liar (precision) (2) identify each person from both classes (recall). Which means that you will select the model that will perform well on both metric.

You model selection decision will then try to evaluate each model based on the mean of the two metrics. F-Score is the best one that can describe this. Let's have a look on the formula:

$$ Recall: \text{r}=\frac{tp}{tp+fn}$$

$$ Precision: \text{p}=\frac{tp}{tp+fp}$$

$$Fscore: \text{f1} = \frac{2}{\frac{1}{r}+\frac{1}{p}}$$

As you see, the higher recall AND precision, the higher the F-score.

$\endgroup$
1
$\begingroup$

you can write the F-measure equation http://e.hiphotos.baidu.com/baike/s%3D118/sign=e8083e4396dda144de0968b38ab6d009/f2deb48f8c5494ee14c095492cf5e0fe98257e84.jpg in another way, like $$F_\beta=1/((\beta^2/(\beta^2+1))1/r+(1/(\beta^2+1))1/p)$$ so, when $β^2<1$, $p$ should be more important (or, larger, to get a higher $F_\beta$).

$\endgroup$
0
$\begingroup$

Knowing that F1 score is harmonic mean of precision and recall, below is a little brief about them.

I would say Recall is more about false negatives .i.e, Having a higher Recall means there are less FALSE NEGATIVES.

$$\text{Recall}=\frac{tp}{tp+fn}$$

As much as less FN or Zero FN means, your model prediction is really good.

Whereas having higher Precision means, there are less FALSE POSITIVES $$\text{Precision}=\frac{tp}{tp+fp}$$

Same here, Less or Zero False Positives means Model prediction is really good.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.