We are opening up our Model Playground, the complete no-code solution for all your model development needs
Learn more

Blog

The first end-to-end vision AI platform with truly automated data cleaning capabilities

Clean your data 35x cheaper using Confident Learning instead of Consensus Scoring.

By Alex Wennman
Published on August 3rd, 2021

Did you know that up to 70% of all work to create a data asset for vision AI is spent on quality control?

We didn’t when we started Hasty. We thought that by automating annotation, we could deliver significant savings and help anyone complete their vision AI projects with half the resources in a quarter of the time. This turned out to be partly true. We did manage to bring automation percentages for vision projects up to 85–95%. But when we talked to our users, they told us that although this was very good, they still had to spend considerable time and money on quality control.

When researching quality control in more detail, we found out that the main problem was not fixing errors but finding them. We had many requests for consensus scoring and annotator performance comparison, but it didn’t sit well.

After all, we are a company focused on bringing automation to the whole of the vision AI process — so why not try to automate the process of finding potential errors by using AI? We spend a large part of 2021 on this, and now, we are releasing our creatively named “Error Finder” to the world.

error-finder-hero-gif

With Error Finder, you can use AI to find any type of annotation mistake in your dataset. It is a state-of-the-art solution that pushes the boundaries of what has been technologically possible before. But more importantly, we think that this enables you to create and maintain a data asset much faster and cheaper than ever before.

In this blog post, we first summarize the current status quo of data quality control in vision AI, then show you how Error Finder revolutionizes this process. We also elaborate on how we calculated the 35x cost savings figure. Finally, we’ll dive into how Error Finder works and what’s below the hood.

The current status of quality control in vision AI

Before we go into how Error Finder works and how you can use it, let’s first look at consensus scoring — the gold standard for quality control today.

Essentially, you get multiple annotators to annotate the same image and then compare the results. If the annotators did similar annotations, you can be secure in being aligned and providing quality data. Suppose there are outliers or differences in the output. In that case, you can review the images manually and see what is going on.

These “agreement metrics” do their job well but come with built-in redundancy. You’ll have at least two annotators annotate every image in your dataset. Over time, you can optimize the process so that two or more annotators annotate only every fifth or tenth image. Still, you will spend considerable resources on doing the same job twice or more. The result is an increase in overall data asset creation costs by 2–3x at least.

Using AI to help with quality control is a game-changer

Though consensus scoring has been the gold standard for the past ten years, we thought that there has to be another, better way to do QC on your data.

Following the notion of ‘using AI to train AI’, we spend the past six months researching ways to automate the process and found the answer in Confident Learning. With this, we’re able to reduce your budget spent on QC by 35x.

Here’s a concrete example that illustrates the benefits compared to doing consensus scoring with one of the more established tools on the market, Amazon SageMaker:

  • Say you labeled 10,000 images with five objects each.
  • Three annotators label every fifth image in a consensus scoring workflow (we saw teams with 6–7 annotators per image to start with, so that’s a conservative estimate).
  • One annotation costs you 84 cents.
  • As a result, your extra QC costs for tooling are $16 800, not including time and cost for manual work and overhead.

With Hasty:

  • Running Error Finder on all 50,000 annotations costs 30,000 credits.
  • Assuming a relatively high error rate of 15%, it’d cost 22,500 credits to fix all errors (3 credits per corrected error for segmentation models).
  • Resulting in overall QC costs for tooling of $476 (assuming that you’re on our builder plan).

As you can see, automating the QC process decreases the cost for tooling alone by 35, not including your time savings and reduced workforce costs. This result, to us (and hopefully for you), is incredible. But it is also a logical step when you use machines to do work for you instead of doing it completely manually.

If you want to estimate your overall project costs using Hasty, you can use this spreadsheet.

1 fy73sXkojB42qk MCHkhWA

What Error Finder can do for you

  • Automatically detect wrong classes and highlight the most considerable discrepancies between what our model sees and what your annotators did.
  • Automatically detect “missing labels” (not annotated) on allegedly completed images.
  • Automatically find artifact annotations that our model predicts are mistakes.
  • A utomatically detect annotations with a different shape from what the model expects.

In short, our new feature can help with any type of quality control that concerns image classification, object detection, and segmentation.

object-detection-quality-control

How it works

First, we would like to thank the team around Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang for their work. Their research on Confident Learning laid the groundwork for our Error Finder. With their algorithm, they were able to show that the most used datasets in vision AI have an average error rate of 3.4%.

The contribution by Northcutt et al. outperformed prior work as they were the first to follow a data-centric approach:

Advances in learning with noisy labels and weak supervision usually introduce a new model or loss function. Often this model-centric approach band-aids the real question: which data is mislabeled?

Without going too much into detail, the main idea is to use a given model to predict the joint distribution between the noisy (given) and uncorrupted (unknown) labels. Where the labels differ in the two sets, the model predicts an error. Then, errors above a certain likelihood of being an error are returned to the user in a sorted way. We recommend reading the whole paper for a more detailed explanation.

For finding classification errors, we implemented the approach almost 1:1. As explained above, the algorithm builds on top of an already trained model. The Error Finder performance heavily relies on the performance of the other model. This puts us in a great position. We already mastered automatically training custom and robust models at scale for our annotation automation.

To automatically detect missing labels, artifacts, or labels with a wrong shape, we developed a two-step algorithm. First, we match predicted annotations with the labeled data using IoU-thresholds. Then, we performed surgery on Detectron to obtain not only the most confident prediction but the probabilities for all predictions as well. These probability vectors we then run through the same algorithm as for the classification Error Finder.

But even the best model is, of course, not always right. That’s why we never fix issues for you. We just find them and sort them from the most likely error to the least. What you have to do then is go through and accept or reject the suggestions.

By doing so, our models get better, and your automation percentages will go up both for annotation and quality control will go up. The result is a self-improving system.

Try it out and help us make it better

The best thing is: you can see for yourself and try it for free. Error Finder is available for all users in Hasty starting today.

If you are interested in trying it out, everything you need to know can be found here.

As usual, you can get in touch with me directly at alex(at)hasty.ai for any questions, praise, or general niceness. Feel free to share your complaints, too — I have a two-month baby at home who speaks in CAPS-LOCK all day, so I can take it!

In conclusion — give it a test, let us know what you think, and help us make consensus scoring and enormous QA budgets a thing of the past.