Understanding this relationship is vital for any company. Without it, you might find yourself stuck with a bunch of various metrics (ML and business ones), having no clue whether they correlate with one another. Second, even if analysts predict an increase in profit margin, you need to understand whether your new, fancy ML integration is the reason for it or other factors influenced it. Also, improving a specific ML metric for a model does not necessarily result in more profit - you need to understand how different aspects of your AI model’s performance impact your bottom line.
Set a goal but be flexible
Before starting development, you want to have a hypothesis in place for how AI will impact your business. It can be something simple. You might just want to say that with your new AI-powered feature, 20% of users that have tried it will be upsold to a larger subscription.
Then, you need to set the initial goals for your model. This can be hard, but you need something to aim for. However, you might not pick the right ML metric when you get started. As you don’t have any business data, you don’t know how different ML metrics correlate to your business. In short, you are in for an exploration stage that every ML project must go through.
Which machine learning metric(s) to pick?
There’s a reason that there are so many ML metrics. You might ask yourself: Why do we even need such diversity? Isn’t it better, faster, and easier to use a single metric without adding a choice complexity?
Well, no. Machine Learning metrics are precise and sometimes misleading. The results of this can be manifold, for example:
- The metric value might be good, but the model does not have any predictive power (the metric does not reflect the actual model performance);
- Your model might overfit, yet the metric will be good until you test the model in the production environment;
- The model might not overfit and perform fine in the development environment but still underperform in production because of the data shift (when developing a model, the metric value is good, but in production, it significantly decreases).
As you see, you might face a variety of unobvious cases where heavy relying on a single ML metric will be ineffective. To test and improve an ML model, you need to choose metrics that will adequately reflect the quality of the model from different perspectives. Still, as you can imagine, choosing the right metrics is a challenging task since you need to keep in mind many nuances.
For example, the accuracy metric is useless when there is a class imbalance in your data. Imagine you have trained a vision AI closed eyes detector. As a test set, you want to use a 110 image set that contains 100 images with an open eye and ten pictures with a closed eye. Let’s assume that the closed eye class will be the positive one.
So, when evaluating your model, 90 out of the 100 open eye images are identified correctly (True Negative = 90, False Positive = 10), and 5 out of 10 closed eye images are also labeled right (True Positive = 5, False Negative = 5). In this case, accuracy will be:
accuracy = (5 + 90) (90 + 10 + 5 + 5) = 0.86
It might seem like a good result. However, if you will predict an open eye class for each image of your test set, you will get higher accuracy (True Positive = 0, False Negative = 10, True Negative = 100, False Positive = 0)
accuracy = (0 + 100) (0 + 10 + 100 + 0) = 0.90
This phenomenon is called the accuracy paradox, and it happens because of a heavy class imbalance in the data. So, if you have such a dataset, you should consider using some other imbalance-proof metrics, such as f1-score.
Understanding and adapting false negatives and positives to your use case
Understanding model performance strongly depends on the task. Let’s give a simple example. Imagine you are performing a cancer screening test and need to choose whether you want to try and avoid False Positive or False Negative results. Sure, there is always a specific tradeoff between FN and FP, and the question itself might seem too philosophical and even unethical. Still, it is up to you to decide what is more important for your task because this use case might be viewed from two different perspectives. On the one hand, if you test a healthy person positive and start treating him as if he was sick, for example, giving him chemotherapy, you will badly damage his health. On the other hand, a False Negative cancer screening test result can cost a life.
Finding the right threshold
As mentioned in the previous section, ML models usually predict the probability of an element corresponding to a particular class - not the class itself. In such a case, you face the problem of choosing the correct threshold used to assign elements to various classes. Let’s back it up with another example.
Imagine having a vision AI plant pest detector trained to predict whether you have a pest problem based on various image signs of pests, such as holes in leaves, signs of rot on the leaves, etc., and the images of pests themselves. You have trained a model to give you a confidence score, basically the probability of a pest being in your field. For example, if you have some holes in the leaves, the model will be 20% confident that you have a pest, whereas if there is a pest on an image - the probability will be 100%. Still, you should develop a threshold you will use to assign elements to various classes (the presence or absence of pests ones). You might stick with a standard 50% threshold or develop a logic that if your model is 60% or more confident that there are pests in your field - it is time to do something about it.
Unfortunately, it is tough to straightaway come up with an optimal threshold. However, if the threshold is not specified, you can still assess your model by building a curve of metrics' dependence on different threshold values. For this, you have AUC-ROC, and AUC-PR.
Going beyond image classification
So far, we have been dealing with image classification metrics as we find them the easiest to understand for experienced data scientists and newcomers alike.
Nevertheless, not all ML results you want to evaluate have class labels or probabilities. For example, you have bounding boxes produced by Object Detectors or segmentation masks. In these cases, standard Classification or Regression metrics will not help you. There is a need for something developed explicitly for such a task. And that is where metrics such as IoU come in.
Many ML metrics is a necessary pain in the ass
So, that is it. Now you understand why we need so many ML metrics. It is not about complicating the task. It is about choosing proper metrics that will help you adequately evaluate your Machine Learning model from different points of view. However, in the end, it is up to you to decide which metric you will use because you are the one understanding best what your task needs.
Business metrics and computer vision
Direct and intermediate metrics
In some cases, when talking about business metrics in Artificial Intelligence, the connection between the earned money and the AI solution is direct. This is essentially true when customers pay to get access to your solution. If that is your case, you are lucky as you can directly calculate and analyze valuable information that can be further used to count some money. For instance, you can find out:
- The number of users that pay for your AI-powered app;
- The number of customers that pay extra for additional AI capabilities of your core product;
- A reduction in customer returns because of some quality issues;
- Increase in LTV (lifetime value) among users that use AI compared with those that don’t
Still, many companies will struggle to draw a direct connection between money and an AI solution. Therefore, you must apply intermediate metrics to bridge the gap between business and the ML model. For example, you can calculate:
- The amount of license plate numbers detected when a car exceeds the speed limit;
- The number of clients that got into the shop and bought nothing;
- The duration of the goods’ stay in the warehouse;
- The amount of time saved in the production process;
- An increase in the successful diagnosis of a particular disease;
Intermediate metrics are the way to assess the effectiveness of an ML integration into the business. In other words, with them, you can explain to stakeholders the benefits of using AI-powered solutions. Such metrics will almost always connect to either lower costs or higher profits. This is, after all, the main reason for companies to do most technical projects.
Optimizing ML models using business KPIs
Unfortunately, it is better to optimize ML models with difficult metrics for businesses to understand. For example, how to connect the area under the ROC curve in the binary image classification task to the money? You can not do it directly; even the question itself seems weird. So, companies face two challenges: how to measure and maximize the effectiveness of their Machine Learning application?
Here, measuring is easy if you have something to compare your newest model with, such as historical data, previous ML, or non-ML solutions. You are essentially comparing your new model with an earlier solution. This is a good approach as it can be used even when you can not directly connect an AI solution with business KPIs or develop useful intermediate metrics. Let’s take a look at an example. Imagine you had a simple OpenCV quality control system, but your Data Science department developed a brand new solution that addressed the flaws of the previous one. In such a case, you can easily compare two models by running them for some time alongside each other on separate production lines. Then, you check the number of returns you’ve had for the different production lines. This will give you a clear understanding of how AI impacts quality and reduces cost.
As for business optimization using Machine Learning, the best approach you might take is to translate the business metric used in a specific field into an ML metric that an ML model will optimize for when training. However, it will take a lot of effort and time to solve this problem because of various obstacles and the necessity for Data Scientists to dive deeply into the business. Still, if you manage to pull it off and get a metric that perfectly corresponds to your needs, the quality of an ML model from the business perspective will immediately rise.
For example, imagine your ML model predicting whether a customer will leave. In this case, as a business metric, you can use a graph for which the ordinate axis will indicate the number of customers the model predicted to leave, and the abscissa axis - the total amount of estimated funds of these customers. With this curve, the business will choose an optimal point, and Data Scientists use linear transformations to transform the graph into a PR curve. Thus, optimizing the AUC-PR metric will simultaneously improve the business metric.
Also, in many cases, ML integration into business means building two models: predictive and optimization. Predictive is more complicated and essential as you use the results in your optimization model. So, if a prediction model makes a mistake, it negatively affects the optimization model's performance. Thus, you reduce your optimization capabilities. It works as follows. If your prediction model underperforms when predicting customers’ behavior (whether they will leave or not), your optimization model will fail as well since the initial predictive accuracy was low. As a result, you will retain fewer customers than you could have saved, which is bad for business. So, there is pretty much no room for error. You need to be sure your AI solution will benefit the company because otherwise, it might be a disaster.
Fortunately, many ML metrics highly correlate with business metrics. It takes some adequate business logic on top of the metrics’ results.
Let’s back it up with an example. To start with, when working on a Classification task, you can assign dollar values to False Positive and False Negative predictions. Thus, you will get a price for each mistake your model makes. With that, you can build any business logic on top of the confusion matrix results. It is now connected to a business KPI and can be further associated with the traditional business metrics or create a unique business metric for your solution. Fortunately, many Classification ML metrics are designed based on the confusion matrix concept, so you should not face significant obstacles when building business logic on top of them. Please check out Airbnb’s metric used to measure their fraud prediction algorithm to find out more.
This is not an easy task. Data Scientists are used to operating with specific metrics, whereas the business-side uses others. Still, without understanding the connection between them, it is impossible to build a successful ML solution from all points of view.
So, when working on your next ML project, try to study the business field your model will operate in and figure out the business metrics used in it. Then, you should identify whether the ML metrics you want to use can be connected to your business metrics. It is crucial to take this step early to avoid misunderstandings between you and the business.
A great way of identifying the relation among metrics is to get feedback from businesses directly. This can help uncover edge cases where ML metrics look fine, but the solution does not work as intended. As we advance, you should choose ML metrics that will help you assess your model from various perspectives. As we have already mentioned, there are many metrics to choose from - which ones you should use depends on your use case, their relation with business, and the business logic you want to create on top of them.
Thanks for reading!