The Frankensuite problem in vision AI

Maulik Chetri

Today, there are many tools, software, and platforms that are aiming to assist AI teams in various ways. Many of them are great. However, an ML engineer can run into trouble when trying to use many tools in their pipeline as various tools don't always play nice with each other. In this article, we are trying three tools (Labelbox, FiftyOne, and Weights and Biases) to see how difficult it is to use all three in a project.

As projects related to machine learning and vision AI have gotten more complex and ambitious in the past decade, there's a burgeoning scene of new tools and software specifically designed to assist these projects. The software can be everything from a simple labeling tool, to data curation software, all the way to software that helps you build and run models. In short, for whatever you are trying to do throughout an ML project, there's software that can help.

However, that software often comes in two different flavors.

First, you have tools dedicated to a specific step in the process - for example, Labelbox for annotation, FiftyOne for data curation, and Weights and Biases for model experiment tracking.

Secondly, you have tools like Hasty, which is software that covers the whole ML lifecycle.

We often find ourselves talking with customers and users about the "Frankensuite" environment that exists today in terms of available software. At the core of what we mean with the "Frankensuite" problem is this: all three tools mentioned above can help you, but how do they play with each other? How much "stitching" do you have to do yourself to make these different tools work in your pipeline?

In our previous experience (2 years ago), there were considerable difficulties integrating different tools in a pipeline. Essentially, you would stitch different software together in various hacky ways (like a Frankenstein monster). For this article, then, we set out to see if that was still the case. Would it be a considerable pain or an afternoon of using pre-existing integrations?


We dealt with three tools: Labelbox, FiftyOne, and Weights and Biases. Labelbox annotates the images and assigns respective classes, FiftyOne curates the data for potential errors with our labeling, and Weights and Biases allow us to keep track of our model training performance.

We used a simple dataset composed of sharks for this article and aimed to run a classifier on it. We wanted to keep it as simple as possible as our interest was mainly in getting the tools to work with each other. However, if you find yourself in the planning process of a larger project, expect considerable more difficulties than we found, as some of the things we do will not scale nicely.

A vital supplement is that the two authors of this article are machine learning students fluent in Python. That experience might be slightly less than most teams trying to integrate three solutions. Nevertheless, using official documentation and having picked three very commonly used tools, we assume that we should combine all three software and make them work with each other.

Expected result

For a smooth integration between the different tools, the expected outcome was the following:

  1. We can annotate data in Labelbox
  2. We can bring that data into FiftyOne and find issues with our annotations
  3. We can quickly feedback from FiftyOne to Labelbox which images and annotations need to be corrected
  4. We can get the fixed data from Labelbox into Weights and Biases
  5. We can take the model we trained - using Weights and Biases to monitor performance - and deploy the same model to Labelbox for faster annotating (AI assistance)
  6. And deploy the model to Voxel51 to check for errors
  7. Additionally, we are also looking if we can automate our data pipeline easily so that we don't have to repeat the same manual tasks over and over again

Essentially, what we are looking for is a way of doing quick iterations using three different tools. We want to create and curate data quickly, use that for training, then use our (hopefully) performant new AI model to improve speed and accuracy for data labeling and curation. Then, rinse and repeat.

Obtained Results

Labeling data in Labelbox

Firstly, we started by labeling images of sharks in Labelbox. This process was relatively straightforward, and we could easily assign the classes to the different labeled sharks. We also introduced some manufactured errors when annotating.

Moving data from Labelbox to Fiftyone

Then, we had to switch to Voxel51 to find the errors in the labeled data.

When you use popular public datasets in your ML workflow, the process is reasonably pain-free. If you want to see and remove errors, these datasets can be loaded directly from the FiftyOne dataset zoo.

When the dataset is in Labelbox, and if the user wants to explore and review the data for annotation mistakes using FiftyOne, they will have to use the integrations provided in FiftyOne and carry out some additional steps. Both Labelbox and FiftyOne libraries are installed through pip command.

There is a utility in the FiftyOne that allows users to deal with the Labelbox datasets. With this feature, users can import the dataset from Labelbox into Fiftyone. The details of the dataset can thus be visualized in FiftyOne using the following command:

print (fo.Dataset(name="sharks"))

The Classification field called "tiger_shark_or_white_shark" is the ground truth annotations from Labelbox.

Then, for testing the 'mistakeness' method in FiftyOne, you'll need to add some model predictions. This approach can assist you in analyzing your dataset to identify unique and problematic instances and probable annotation errors in classification and object detection datasets.

As shown above, we used the pre-trained ImageNet model, which added the predictions field, as reflected in the updated dataset view.

The step to compute "mistakeness" was pretty straightforward once we had the model predictions, after which we explored the dataset using the FiftyOne App.

With that, the app was started locally, and we could see actual predictions:

This helped in detecting possible annotation mistakes in classifying the two classes of sharks, as depicted above. The predicted values are produced by the pre-trained model, whereas the ground truth is the label values from Labelbox.

As you can see, because we didn't have our own model helping us with predictions, we got predictions for classes we don't have (hammerhead). There's also a difference in our naming (white_shark instead of great white shark). So here we are dealing with a slightly imperfect method where we can use a pre-trained model that contains some of the same classes as the model we want to use, but also have classes not relevant to us.

With the pre-trained model, Fiftyone was able to produce some predictions of the labels. Now, to correct the labels, we have two possible solutions.

  • The first solution is to write a python script using the Fiftyone library to modify the labels in the dataset. The labels might be changed with a certain "mistakeness" threshold or using any other algorithm. Then with the function export_to_labelbox from FiftyOne's Labelbox utility, one can export the labels in Labelbox format, which can be uploaded to Labelbox.
  • Or, if the labeled project already exists in the Labelbox project tab, the only option would be to collect the images' IDs that required correction/reannotation. And then, the correction has to be done manually.

Training a model and using Weights and Biases for monitoring

The next step then is to train a model using our cleaned-up data. The first hurdle is that data exported from Labelbox has a proprietary format. To get your data working in WandB, you need to convert it to a standard format like COCO.

​​With that done, we can start training our model with the WandB library. The library can be used with different frameworks like Keras, Tensorflow, Pytorch, etc. With the WandB library, the users can initialize a project, configure hyperparameters for the model and log various metrics to visualize the model's performance.

While working with local data through API, simple adjustments in the code to log data into WandB were useful for querying experiment metadata, keeping an eye on neural networks, and plotting their outputs.

Using the new model in Labelbox and FiftyOne

So far, we haven't had too many issues. We successfully managed to bring out data from Labelbox to FiftyOne. We then used that data to train a model with the help of WandB. Along the way, we had some difficulties with data formats and moving data between our different tools - but that we think is manageable.

However, let's say our new model is performing exceptionally well. The next logical step is to take that model, use it in Labelbox to automate some percentage of the annotation process, and use it in FiftyOne to make better predictions.

How do we do that? With Labelbox, we have to run our model outside of the app and then upload predictions to Labelbox. Essentially, doing a pre-labeling approach. Of course, predictions need to be understandable by Labelbox, so we might have to convert the data again to be compatible. We also have to decide what predictions we upload and what we ignore. If we want to do this on a production-level project, we need to build enough automation to deploy models and generate predictions on the fly, which would be a significant pain point.

We are also leaving ourselves open to risk. Although our new model might be working well, it is doubtful that it is perfect. Therefore, we might create a lot of new errors in our data using predictions from our model. The more images you get predictions for, the worse our data issues can get. Worst case, we will have to manually review every image to figure out what's there that shouldn't be there. This can take as long as doing annotations manually.

Moving on to FiftyOne, we are still unsure how to bring our model into their application. As we understand it, in theory, it's possible. But we haven't managed to get it to work after two weeks. This can be a significant hurdle as you will have to rely on public models trained on public data for your data curation, which is not something we want to do when we have a better model available.

Of course, it might be us missing some core functionality of FiftyOne here. If we could figure out how to bring our model into their environment, it would give us tremendous help in us doing data curation as we would have a model trained on the actual data doing the hard lifting for us. Once again though, even if we could have gotten it to work, building an automated or even semi-automated integration between our model training and FiftyOne would take a considerable effort.

Actual result

You can get all three tools to work in concert, but it is considerable pain, and some steps lack documentation. For example, after two weeks, we are still unsure how we can bring a custom model into FiftyOne.

That leads to a question: What is the risk of working on an actual applied project and deciding to use these tools?

In short, you are taking a significant risk. Some of the obstacles are acceptable, but if you want to use these tools for a larger scale project, you will have to do a lot more engineering than you might expect.

Additionally, some of the steps we had to go through are not easy to automate, so building a smooth and efficient pipeline is not as easy as you might think. In this test, a lot of the computational work was done locally. For larger projects, this might not be possible. With that, especially FiftyOne might struggle as it is mainly intended to run in local environments.

This is not restricted to this configuration of tools. We are also doing a similar experiment with Supervisely, Aquarium, and Neptune. Our findings there are, so far, similar.

That leads us to a question - what use is it to have all these excellent tools if they won't work together without considerable effort? If you buy a software solution, you do so because it should be less work than building something internally.

Furthermore, how do we know, when using all these tools, in which our most up-to-date version of our data can be found? There's no apparent source of truth for us to fall back on. This, for most organizations, will lead to hacked together spreadsheets, additional back-and-forth communication, and easy to avoid mistakes.

To the final point, then. Of course, we are not writing this (only) for the greater good of the ML community. We have an agenda here. Hasty as a tool offers functionality comparable to all three of the tools used for this presentation. For data preparation, we have a state-of-the-art annotation tool. For data curation, we have out-of-the-box AI quality control. For experimentation, we provide a fully-fledged model building and experiment tracking solution.

The benefit to you? Because we have all functionality in one application, you don't need to worry about building integrations, looking through docs, or keeping track of what has been deployed where.

For decision-makers, the decision usually comes down to this question of ease-of-use versus specific functionalities. We understand that specialist tools can have great functionality and sometimes have some cool stuff we don't. That's the benefit of developing a very focused application. However, when working on a real-life project, we've found that ease-of-use almost always trumps specific functionalities as long as the basics are in place.

Most of the people we talk to are tired of spending time building (and rebuilding) integrations and manually triggering the same steps over and over instead of concentrating on what matters  -  better data and models.

Additionally, with Hasty, you can quickly move data and models between the different stages in the project. Use your data to train models - use your models to automate some percentage of the annotation or curation process. With us, you can do so in a couple of clicks. And it works for huge projects.

If you are interested in testing us out, there's a free trial. Also, if you are looking for the right tool but are unsure if Hasty is suitable for you, don't hesitate to reach out to [email protected]. He can give you a demo of how we can help you and answer any questions you may have about how to get started with AI projects.

Get AI confident. Start using Hasty today.

Our platform is completely free to try. Sign up today to start your two-month trial.


Hasty.ai helped us improve our ML workflow by 40%, which is fantastic. It reduced our overall investment by 90% to get high-quality annotations and an initial model.