Your model’s success directly depends on the quality of your annotations. However, in reality, most datasets consist of raw unlabeled data. As a result, whole teams and departments end up spending hundreds of hours on data annotation, even for the simplest tasks.
Hasty already provides an environment with automated tools to speed up your annotation process. However, we wanted to directly address the need to label a vast amount of data.
To do so, we added the Active Learning feature to the platform. In short, it offers the user to label only the images the model is most unsure about instead of the whole dataset. This makes the annotation process more efficient and potentially improves your model’s performance – so you work smarter, not harder.
On this page, we will explore:
Let’s jump in!
To understand how Active Learning works, let’s first grasp the difference between Active Learning and the conventional approach.
In standard data annotation, the pipeline looks as follows:
As you see, a massive shortcoming of this method is that it requires us to label the entire dataset, which becomes painful and time-consuming as the size of the dataset grows.
This leads us to a question: do you really need to label the entire dataset? Most likely, you will have many raw images that will not improve your model significantly if labeled.
That’s why, with Hasty, you can not only apply Automated labeling to save time but also take advantage of the Active Learning (AL) feature. It uses an iterative algorithm that suggests the best images for the user to label (in the image annotation case). This might save tons of resources and time while having the potential to produce better results.
There are different ways to use Active Learning. The Unlabeled Pool scenario is one of the most common ones. This is how it works:
You probably wonder: how exactly does the model understand which images are more informative than the others?
In Hasty, we employ the Uncertainty Sampling technique to detect the most informative (uncertain) instances. The logic behind it is simple: the more uncertain the model is about some image’s class, the more informative it would be to label it.
Think of this classroom example. The teacher gave you a task on a math topic you are familiar with, but there is one formula you do not know. Two scenarios may follow:
Probably, it is more efficient when only the information you are unsure about is explained (Scenario 1) – and this is roughly how Uncertainty Sampling works.
The algorithm of Uncertainty Sampling is the following:
As an option, users can just label any image they want. This would be close to Random Sampling (assuming the user has no bias in picking images with certain content). This is also a valid heuristic, and oftentimes, one that is very hard to beat.
However, there are cases when Active Learning produces better results than Random Sampling. To illustrate the idea, let’s observe a toy example. Suppose we need a model that distinguishes between two classes – Cats and Milkshakes.
Let’s say the real class distribution looks as follows:
Based on that, we might suppose that the separation line should look somewhat like a slightly tilted straight line:
Suppose we can label only 28 instances to train the model. Which images should we choose?
In Random Sampling, as the name suggests, the user will label the images randomly. Here is what the separation line may look like:
The separation line we received is too tilted and does not reflect the actual distribution of classes perfectly. As you can see, random sampling does not guarantee high model performance.
In the Active Learning approach, we would try to label only the most informative, i.e. uncertain images. Labeling instances which are clearly cats or milkshakes is not very useful from the model’s perspective.
It would be dishonest not to mention one big caveat here: the problem of balancing Exploration and Exploitation in Active Learning.
Let’s take the case of self-driving cars. Imagine having around 1 million images in the dataset. How many should you rank?
There are two extremes you could go for:
This is one of the most challenging problems to balance. So far, Hasty uses predefined parameters to change the Explore vs. Exploit ratio for ranking. Currently, we run ranking on 100 images. However, this technique heavily under-explores if the number of unlabeled images is much greater than that number.
This will change as the feature matures. We plan to make these settings adjustable so you will be able to rank as many images as you need at a time. Moreover, you can simply label the 100 suggested images and run Active Learning again to receive new suggestions for what to label next.
There is yet another issue. How do you know which Uncertainty Sampling method works best? Should you go for Margin, Entropy, Variance, or nothing at all?
While the right heuristic on the right data can yield significant improvement, the downside is that there is no way to know this in advance.
This is why Hasty retrains models in quick successions. This way, you can get almost immediate feedback on your choice of heuristic and adjust it if you do not notice improvements.
Active Learning is a technique that tries to choose the most useful data to label while omitting uninformative samples. The advantages of using Active Learning in Machine Learning are as follows:
Potential drawbacks include:
In Hasty, we have integrated the Active Learning feature to deliver you the best annotation experience. The feature analyzes the dataset and suggests you the best images to label, saving your time and nerves.
To check out our Active Learning tutorial, click here.