2021.05.08

Uncovering hidden biases in your data

Tobias Schaffrath Rosario

This post is a hands-on guide on how you can use Hasty's tooling to de-bias your data.

The other day, I was chatting with one of the members of our community. We started talking about the challenges she's facing with her project right now, and our exchange boiled down to the following question: "How can I find hidden biases in my data? Can you create a guide on how to do this with Hasty?"

So here we go. This post is a hands-on guide on how you can use Hasty's tooling to de-bias your data.

Bias in AI is a tricky thing; after all, you introduce the algorithm to biases in your data-set when you train an AI. So, how can you find harmful biases in your data that cause your models to break and may cause substantial damage?

The answer: you, as ML-engineer, need to understand how your models interact with the data and hunt the biases yourself. We're exploring some exciting approaches in the domain of active learning which might help in the future. But still, these approaches can only assist the human, never replace it. So I'm sorry; you need to put in some work.

To truly understand how your models interact with your data, you need to test, test and test your models under real-world conditions. With Hasty, you can do this for visionAI applications without long set-up processes and MLOps hustle.

From model-centric to data-centric AI

A slide by Andrej Karpathy. In real-world applications, performance improvements are mostly made by understanding your data better. Not by working on the algorithm itself.

Not so long ago, Andrew Ng gave a great talk on this topic. One of the insights he shared was from a QA project in manufacturing. Andrew and his team were stuck at an accuracy of 76.2%.

He then split up the team. One group kept the model constant and added new data/improved the data quality to de-bias, and the other used the same data but tried to improve the model.

The team working on the data was able to boost the accuracy to 93.1%, whereas the other team couldn’t improve the performance at all.

This caused Andrew to advocate for a new way to build ML models — from model-centric to data-centric AI. Generally, his teams now build a model first, don’t change it, and iterate on the data before tweaking the model. This approach is producing great results.

Hasty is built for data-centric experiments, without writing a single line of code

Running experiments like Andrew’s teams do is super easy using Hasty, without writing a single line of code. When you start annotating your data in Hasty or upload already annotated data, we train a model for you in the background right away. The model is then used to make predictions for the next image to be annotated.

This is how our AI-assistants can help you to de-bias a data set:

Let’s assume that you’re working on an autonomous driving use case and only have raw data to start with. You start annotating, and after a few 100 images, the assistants work great for images taken in the sun. Then, you try to annotate an image taken when it was raining, and the assistants fail. This tells you that you should add more rainy images to your data sets to mitigate harmful biases in your data.

Consequently, our assistants do not only speed up the annotation process by 10x but allow you to get visual feedback on the model’s performance early on in your project without writing a single line of code.

In Hasty, we offer you predefined architectures and hyperparameter-configs, which proved to be working well for most use-cases. Initially, the only decision you need to make on the model-building side is if you want to do object detection, instance segmentation, or semantic segmentation, and you have the time to focus on the data work.

Check out this post to find the best annotation strategy for your use case. The graphic is under a free license, so you can go ahead and share it :

Get AI confident. Start using Hasty today.

Our platform is completely free to try. Sign up today to start your two-month trial.

Tuple

Hasty.ai helped us improve our ML workflow by 40%, which is fantastic. It reduced our overall investment by 90% to get high-quality annotations and an initial model.