If you have ever tried solving a Classification task using a Machine Learning (ML) algorithm, you might have heard of a wellknown Fbeta score ML metric. On this page, we will:
Сover the logic behind the metric (both for the binary and multiclass cases);
Check out the metric’s formula;
Find out how to interpret the Fbeta score value;
Calculate Fbeta on simple examples;
Dive a bit deeper into the Micro and Macro Fbeta scores;
And see how to work with the Fbeta score using Python.
Let’s jump in.
Precision and Recall are valuable metrics that are widely used across the industry. Still, they have a massive drawback. They produce two values that must be first analyzed separately and then together (check out the PRcurve page to learn more about the tradeoff between these metrics) to better understand an algorithm’s performance. So, to overcome this disadvantage, Data Scientists came up with a way to combine Precision and Recall into an aggregate quality metric.
To define the term, in Machine Learning, the Fbeta score (or Fmeasure) is a Classification metric featuring a harmonic mean of Precision and Recall. To evaluate a Classification model using the Fbeta score, you need to have:
The ground truth classes;
And the model’s predictions.
Unfortunately, Fmeasure is not an intuitive metric, especially when considering its formula. Moreover, the Fbeta score does not have any 'physical' meaning, for example, Accuracy is a fraction of the predictions that a model got right  the Fmeasure does not have such an explanation. It only makes the situation more challenging. So, we suggest you trust the Data Scientists using this metric daily that Fmeasure is helpful and trustworthy, and do not try to dive deeper unless you are really interested.
As you can see, there is a beta parameter in the formula. To clarify, it determines the weight of Precision in the metric. In general, there are three most common values for the beta parameter:
F0.5 score (beta = 0.5): Such a beta makes a Precision value more important than a Recall one. In other words, it focuses on minimizing False Positives than minimizing False Negatives;
F1 score (beta = 1): True harmonic mean of Precision and Recall. In the bestcase scenario, if Precision and Recall are equal to 1, the F1 score will also be equal to 1;
F2 score (beta = 2): Such a beta makes a Recall value more important than a Precision one. In other words, it focuses on minimizing False Negatives than minimizing False Positives.
Of these three cases, the most popular is the F1 score one, as it is the easiest to interpret. That is why the F1 score is the only Fmeasure that has its own sklearn function. Anyway, as you can see, Fmeasure can be easily described using Precision and Recall. The Fbeta score algorithm for the binary Classification task is as follows:
Get predictions from your model;
Pick your beta parameter value;
Calculate the Precision and Recall scores;
Use the formal Fbeta score formula (do not forget about the parameter value you picked);
And analyze the obtained value.
For the binary case, the workflow is straightforward. However, there are also multiclass use cases when things are a bit tricky. In general, there are various approaches you can take when calculating Fmeasure for the multiclass task. There are at least three different options, as you can see in the sklearn Fbeta score metric function:
Micro;
Macro;
And Weighted.
Each of these approaches is solid and can be very helpful in model evaluation. Also, in real life, you will likely calculate the metric value using all of them to get a more comprehensive view of a problem. Please check out the micro and macro Fmeasure calculation examples below or the scikitlearn documentation page if you want to learn more.
So, the Fbeta score algorithm for the multiclass Classification task is as follows:
Get predictions from your model;
Pick your beta parameter value;
Identify the multiclass calculation approach you feel is the best for your task;
Use a Machine Learning library (for example, sklearn) to do the calculations for you;
And analyze the obtained value while considering the approach you used to get it and the parameter value you picked.
In the Fbeta score case, the metric value interpretation is straightforward. If you correctly classify more samples, higher Precision and Recall scores will give you a higher Fmeasure value. The higher the measured value, the better. For any beta parameter, the best possible value for Fmeasure is 1, and the worst is 0.
From our experience, for both multiclass and binary use cases, you should consider an Fbeta score > 0.85 as an excellent score, an Fbeta score > 0.7 as a good one, and any other score as the poor one. Still, you can set your own thresholds as your logic, beta parameter, and a task might vary highly from ours.
Also, please always try to see the bigger picture when analyzing the obtained value. Having an excellent Fmeasure value is fantastic, but can you say the same about the Precision and Recall scores? Additionally, please remember that the beta parameter significantly impacts the metric. The ideal scenario is when you pick the beta parameter to match your task for the Fmeasure to produce good casespecific results. For example, use beta = 1 if Precision and Recall are equally significant to you, use beta = 0.5 if your emphasis is on Precision, or use beta = 2 if Recall is more critical for you.
Let’s say we have a binary Classification task. For example, you are trying to determine whether a cat or a dog is on an image. You have a model and want to evaluate its performance using the F1 score. You pass 15 pictures with a cat and 20 images with a dog to the model. From the given 15 cat images, the algorithm predicts 9 pictures as the dog ones, and from the 20 dog images  6 pictures as the cat ones. First, let’s build a Confusion matrix (you can check the detailed calculation on the Confusion matrix page).

Ground truth Cat 
Ground truth Dog 
Predicted Cat 
TP = 6 
FP = 6 
Predicted Dog 
FN = 9 
TN = 14 
Excellent, now let’s calculate the F1 score using the formula for the binary Classification use case (the number of correct predictions is in the green cells of the table, and the number of the incorrect ones is in the red cells).
Precision = (TP) / (TP + FP) = (6) / (6 + 6) ~ 0.5
Recall = (TP) / (TP + FN) = (6) / (6 + 9) ~ 0.4
F1 score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.5 * 0.4) / (0.5 + 0.4) ~ 0.44
Ok, great. Let’s expand the task and add another class, for example, the bird one. You pass 15 pictures with a cat, 20 images with a dog, and 12 pictures with a bird to the model. The predictions are as follows:
15 cat images: 9 dog pictures, 3 bird ones, and 15  9  3 = 3 cat images;
20 dog images: 6 cat pictures, 4 bird ones, and 20  6  4 = 10 dog images;
12 bird images: 4 dog pictures, 2 cat ones, and 12  4  2 = 6 bird images.
Let’s build the matrix.

Ground truth Dog 
Ground truth Bird 
Ground truth Cat 
Predicted Dog 
10 
4 
9 
Predicted Bird 
4 
6 
3 
Predicted Cat 
6 
2 
3 
Macro F1 score is a way to study the classification as a whole. To calculate the Macro F1 score, you need to compute Macro Precision and Macro Recall and then use the F1 score formula. The Macro approach treats all the classes equally as it aims to see the bigger picture and evaluate the algorithm’s performance across all the classes in one value.
Let’s calculate the Precision value for each class. To do so, we need to go row by row (the green cell is the True Positives predictions for a specific class whereas red cells are False Positives):
Dog Precision: 10 / (4 + 9 + 10) ~ 0.43
Bird Precision: 6 / (4 + 3 + 6) ~ 0.46
Cat Precision: 3 / (6 + 2 + 3) ~ 0.27
Macro Precision score: (Dog Precision + Bird Precision + Cat Precision) / 3 = (0.43 + 0.46 + 0.27) / 3 ~ 0.386
Let’s calculate the Recall value for each class. To do so, we need to go column by column (the green cell is the True Positives predictions for a specific class whereas red cells are False Negatives):
Dog Recall: 10 / (4 + 6 + 10) ~ 0.5
Bird Recall: 6 / (4 + 2 + 6) ~ 0.5
Cat Recall: 3 / (9 + 3 + 3) ~ 0.2
Macro Recall score: (Dog Recall + Bird Recall + Cat Recall) / 3 = (0.5 + 0.5 + 0.2) / 3 ~ 0.4
In this case, the Macro F1 score will be:
Macro F1 score = 2 * (Macro Precision * Macro Recall) / (Macro Precision + Macro Recall) = 2 * (0.386 * 0.4) / (0.386 + 0.4) ~ 0.392
On the other hand, the Micro F1 score studies individual classes. To calculate it, you need to compute Micro Precision and Micro Recall and then use the F1 score formula. Thus, the Micro F1 score will combine the contributions of all classes to calculate the average metric.
Let’s calculate the Micro F1 score value for our use case
Micro Precision score: (TP Dog + TP Bird + TP Cat) / ((TP + FP) Dog + (TP + FP) Bird + (TP + FP) Cat) = (10 + 6 + 3) / ((4 + 9 + 10) + (4 + 3 + 6) + (6 + 2 + 3)) ~ 0.4
Micro Recall score: (TP Dog + TP Bird + TP Cat) / ((TP + FN) Dog + (TP + FN) Bird + (TP + FN) Cat) = (10 + 6 + 3) / ((4 + 6 + 10) + (4 + 2 + 6) + (9 + 3 + 3)) ~ 0.404
Micro F1 score = 2 * (Micro Precision * Micro Recall) / (Micro Precision + Micro Recall) = 2 * (0.4 * 0.404) / (0.4 + 0.404) ~ 0.401
The Fmeasure (F1 score primarily) is widely used in the industry, so all the Machine and Deep Learning libraries have their own implementation of this metric. For this page, we prepared three code blocks featuring calculating the Fbeta score in Python. In detail, you can check out:
Fmeasure in Scikitlearn (Sklearn);
Fmeasure in TensorFlow;
Fmeasure in PyTorch.
Scikitlearn is the most popular Python library for classical Machine Learning. From our experience, Sklearn is the tool you will likely use the most to calculate Fmeasure (especially, if you are working with the tabular data). Fortunately, you can do it in a blink of an eye.
# Importing the functions
from sklearn.metrics import fbeta_score
from sklearn.metrics import f1_score
# Initializing the lists
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
# Calculating the metrics and printing the result for F1 and F0.5 scores
f1_score(y_true, y_pred, average='macro')
fbeta_score(y_true, y_pred, average='macro', beta=0.5)
# Importing the functions
from tfa.metrics import FBetaScore, F1Score # TensorFlow Addons
# Initializing the Fbeta score metrics
metric = F1Score(num_classes=3, threshold=0.5)
metric_beta = FBetaScore(num_classes=3, threshold=0.5, beta = 0.5)
# Initializing the arrays
y_true = np.array([[1, 1, 1],
[1, 0, 0],
[1, 1, 0]], np.int32)
y_pred = np.array([[0.2, 0.6, 0.7],
[0.2, 0.6, 0.6],
[0.6, 0.8, 0.0]], np.float32)
# Calculating the metric and printing the result for F1 score
metric.update_state(y_true, y_pred)
result = metric.result()
result.numpy()
# Calculating the metric and printing the result for F0.5 score
metric_beta.update_state(y_true, y_pred)
result = metric_beta.result()
result.numpy()
!pip install torchmetrics
# Importing the functions
from torchmetrics.functional import f1_score, FBetaScore
# Initializing the tensors
target = torch.tensor([0, 1, 2, 0, 1, 2])
preds = torch.tensor([0, 2, 1, 0, 0, 1])
# Calculating the metrics and printing the result for F1 and F0.5 scores
f1_score(preds, target, num_classes=3)
f_beta = FBetaScore(num_classes=3, beta=0.5)
f_beta(preds, target)
Only 13% of vision AI projects make it to production, with Hasty we boost that number to 100%.