The real value of Machine Learning (ML) is created once ML-models are deployed into production and applied to a real use-case. However, most past academic work focused on the theoretical aspects using pre-curated datasets like COCO or ImageNet. For real-world applications of ML, usually custom data-sets are needed. This brings lots of new challenges with it.
Quite often, ML-models that were tuned to perfection under lab-conditions fail in real-world conditions. This stems from a mismatch between the data used for testing, training, and validation–i.e., the model development–and the data the model encounters in the real world. This phenomenon is known as 'data shift'.
Almost every ML-practitioner faces this problem once their models are deployed into production. It leads to ML-teams spending their time collecting new data to align the lab with real-world conditions because the consequences of deploying a model containing data shift can be hazardous.
We will avoid jumping too much into theoretical aspects to keep it a light read. We recommend this Medium-article for a more formal introduction to data shift.
One last side note before we start: data shift is sometimes also referred to as, 'domain shift', 'concept shift' or 'concept drift', 'classification changes', 'changing environments', 'contrast mining in classification learning' 'fracture points' and 'fractures between data'. For this article, we will stick to 'data shift', though.
Data shift causes the best models to fail.
Data shift is a problem that even some of the most advanced ML teams have to deal with. In this talk, Andrej Karpathy (head of the computer vision team at Tesla) shares the story of how the team encountered it. The team tried to automatically turn on the windshield wipers when it rains using cameras pointing at the windshield. The model's job was to predict if the windshield is wet or not.
In production, the model broke in rare driving situations, e.g., when a car entered a tunnel. As Andreij put it, "the wipers were really excited about tunnels and would go like mad inside tunnels". The reason for this: Tesla didn't have enough images of tunnels in their dataset, and the model confused the light-reflections on the windshield inside a tunnel with rain-drops.
Figure 2: The above driving situations caused Tesla's wipers to go off as the distortions on the windshield were detected as putative raindrops. This happened because the driving situations were under-represented in the dataset used to develop the model. The image is taken from Andreij Karpathy.
The impact of data shift increases exponentially for imbalanced distributions and rare features. The smallest differences between the real-world distribution and the distribution of data used for the model development can cause devastating mispredictions.
Figure 3: For imbalanced distributions, already, a small data shift can have a huge impact.
Of course, the windshield example's consequences are not severe-the wipers could simply be turned off manually. Also, Tesla seems to have fixed the issue. But we don't even want to imagine if something like this would happen with the models controlling the autopilot steering the car.
One of the main reasons for data shift is selection-bias, which is hard to overcome.
If a model did not contain any data shift after being deployed for the first time, this would mean implicitly that the ML team successfully identified all relevant edge-cases a priori. In the example from above, it'd mean that the Tesla team would have thought of all of the cases in figure 2 before starting the data-selection. The odds of this happening are improbable for any ML-team.
Most people who developed an ML-model for a practical use-case probably experienced that the model encounters situations that were not thought of in the design phase in the real world. This can be described as an implicit selection-bias.
Especially in non-stationary environments in which the relationship between input and output changes over time, selection bias occurs quite often.
Figure 4: a stationary vs. non-stationary variable. For non-stationary variables, statistical properties change over time. If this is not taken into account for data-collection, implicit selection-bias will occur. Source
Without knowing how Tesla actually sampled their data, an example of the influence of a non-stationary environment would be if Tesla collected the data during Californian summer; thus, samples of iced windshields were missing. When the model then encountered ice in the real world, it broke (see the first image in figure 2).
Building a data flywheel to detect and overcome data shift
Most of the research regarding detecting data shift is located in the domain of unsupervised learning.
One of the main concepts is calculating the statistical distance for the feature-data used for model development, and the data the model gets as input for the feature in the real-world. One of those methods is to look at the histogram intersection between lab-conditions and the real world for a feature. The smaller the intersection, the bigger the data shift.
Figure 5: The smaller the histogram intersection between lab- and real-world conditions for a feature, the bigger the data shift. This method breaks for multi-dimensional problems like computer vision, though. Source
However, this approach reaches its limitations soon as it has difficulties with sparse features. But as we stated above, data shift causes the biggest problems when dealing with imbalanced distributions, i.e., sparse features. Also, calculating statistical distance for multi-dimensional features gets hard quite soon, making it unsuitable for more complex tasks like computer vision.
Another often-used approach is novelty detection. The main idea here is to train another model to detect how likely the original model's output was drawn from development distribution. In contrast to using the statistical distance, this approach is also suitable for complex problems like computer vision. However, the biggest caveat is that it detects if the data shift exists but not for which features.
Figure 6: Novelty detection helps identify predictions that seem abnormal based on the data used for model deployment. The abnormalities are often caused by data shift, but novelty detection cannot recognize where the data shift occurred. Source
To overcome these limitations, we propose a more practical approach: building a data flywheel. Conceptually, it is much simpler and straightforward compared to the computational approaches from above. However, it offers a way to identify data shift and fix it simultaneously for practical projects.
The main idea is to break the traditionally linear process of developing an ML-model and integrate the model development process (including training) with the model in production, creating an accelerating feedback loop mitigating mispredictions.
Mispredictions would get flagged by the persons using or testing the model in production and then sent back to the team labeling the data to expand the dataset, reducing the data shift. Then the model will be retrained with the updated data, keeping architecture and hyper-parameters the same. Or a new model using only the new data will be trained to ascertain how the model trained on the new data compares to the previous model (ceteris paribus).
Figure 7: Building the data flywheel allows ML-teams to create a vicious feedback-loop allowing between the model in production and the model development process mitigating mispredictions.
Building a data flywheel is the solution many big players bet on to deal with data shift. To name only a few of many: in the same talk we cited above, Andrej Karpathy said that they are working on something like this without disclosing any more details, for example. Another example: the Amazon research department published this paper describing their efforts in this direction (to be fair, the paper goes beyond data shift, but data shift is one of the core problems dealt with).
The data flywheel can only work with the right model development platform.
As simple as the data-flywheel is conceptually speaking, it brings some challenges when applying it to a practical ML-project.
- Data horizon: How quickly should the most recent data points be used for retraining? Generally speaking, the newer the data point is, the better. But this assumption might break in some use-cases, e.g., for seasonal features. Here it might be the better call to use data from the same time period last year.
- The cadence of retraining: When should models be retrained? Ideally, whenever the distribution of the updated data shifts to the extent that it influences the model's predictions. Identifying this point can get tricky in practice.
- Costs: Constantly labeling new data points and retraining the model can get expensive quickly. How can we minimize costs to make the data flywheel economically feasible?
The answers for the first and second questions will be different for every use-case. It's the ML team's job to figure those out. To do so, they must understand the data and get a feeling for it-a piece of general advice for every ML-practitioner. To achieve this, ML-teams should not just outsource the labeling work or delegate it to an intern. They should annotate the data on their own, but do it smartly by using one of many tools out there to speed up the annotation process to save costs and time.
Keeping the overall costs of the data flywheel under control imposes challenges on an architectural level. Real-world feedback should be integrated into the model development process with minimal effort, and the relabeling + retraining process should require little to none manual work.
Building a data flywheel, the architecture used for the model development process, must fulfill the following requirements:
- It should have endpoints, which can be easily integrated with the model in production.
- It should make the processes of labeling, training, testing, and validating as smooth as possible.
- It should automate as many tasks as possible along the way.
Here at Hasty.ai, we're building a tool that does exactly that. With our API-endpoints, images can be sent back to the annotation tool with one line of code. Our AI-assistants then automate the annotation-process by 70%. Also, you get visual feedback to understand what kind of data causes the models to fail. Then, the model can be retrained with one click (including hyper-parameter optimization.) Finally, the retrained model can be pushed back to production seamlessly through the API.
While doing a shameless plug here, let me also point out that Hasty is completely free to try - so you don't need to take our word for it, you can test yourself.
Much more work is needed to be done to bridge the proof-of-concept to production gap. We should think more systematically about the full cycle of ML-projects. - Andrew Ng
Data shift is a problem causing many models to break that most ML-teams working on real-world applications will face. With the data flywheel, we propose a framework to overcome the data shift. In contrast to most frameworks, it's not relying on complex statistical or mathematical considerations. It is rather a new practical approach on how to build models.
The ML community is quite advanced regarding the mathematical theory behind models. But the application of frameworks like the data flywheel is still missing quite often.
What are your thoughts on the problem of data-shift and the data flywheel as our proposed solution? Join the discussion in our community and leave your comment.
We're a Berlin-based startup and building the next-gen annotation tool for computer vision. We have custom AI-Assistants who observe you while you annotate and then take over the annotation work for you. They allow you to annotate data 10x faster and provide you with rapid feedback so you can validate and adapt your models as you work.
What content are you interested in?
We just started our blog. We want to continue sharing our experience of working with computer vision projects on a daily basis. To make sure that we provide content relevant to you, please fill out this survey. It takes less than 1min, we promise!