A more holistic understanding of scenes for Computer Vision
About a quarter of our Hasty users are part of the research community. We reached out to them to learn more about their research. We learned so much during these calls that we asked some researchers to write a bit about their work for our blog. This time Juan Lagos Benitez, who is doing his Ph.D. at Tampere University, shares his thoughts on panoptic segmentation.
If you want to learn more about the project, please get in touch with Juan on LinkedIn. And don’t forget to check his GitHub profile, where he publishes extensions to the most famous architecture for Panoptic Segmentation.
And yes, the dog in the image is Juan’s puppy. It’s super cute, isn’t it?
Computer vision and scene understanding have become game-changers in today’s world. As we move forward into giving autonomous capabilities to machines to perform tasks in a human-way fashion, understanding the surroundings, objects around, and scenes becomes pivotal. As humans, we see things as a mere stimulus and comprehend what we see. Our brain gives meaning to what our eyes capture. We also unconsciously assign attributes to the things we see, such as distance, density, number of objects, amount, speed, dangerousness, texture, and even temperature. We may not be very accurate in each of those measurements we instinctively make. Still, when we see something, we recognize hundreds of patterns that allow us to perform different tasks effectively, e.g., sports, driving, walking, playing video games, etc.
It is essential to start with a bigger picture. The logical question is, what is Image Segmentation in Machine Learning?
Well, Segmentation is a well-known term in business and marketing. In short, it defines the process of splitting customers (or a whole market) into separate groups based on specific patterns in their behavior. Fortunately, such a definition is close to what we refer to when saying Image Segmentation in ML.
Image Segmentation in Machine Learning is a part of the vision AI field that incorporates different methods of dividing visual data (for example, an image) into segments featuring specific, similar, and significant information of the same class label.
For example, if you have a picture from your prom, you can use Image Segmentation to find each person on an image and locate his boundaries.
As of today, corporate Data Science regularly solves Image Segmentation challenges in various spheres. In Hasty, we see that the demand for high-quality Segmentation solutions has rapidly grown over the past couple of years. It also applies to Data Scientists who specialize in the Image Segmentation field. As a result, the industry is developing and growing, bringing new SOTAs, solution techniques, and challenges.
Nowadays, researchers say that the Image Segmentation field consists of three vision AI tasks. These are:
In this post, we are diving into the Panoptic Segmentation concept. Panoptic Segmentation combines Instance Segmentation and Semantic Segmentation to provide a more holistic understanding of a given scene than the latter two alone. In this post, I will walk you through the philosophy behind Panoptic Segmentation and show you how it is helping machines to view the world in the way we see it. I will also briefly review a novel approach to Panoptic Segmentation known as EfficientPS, a deep Convolutional Neural Network for Panoptic Segmentation, which uses EfficientNet as a backbone for extracting features. Additionally, you will learn how to obtain ground truth annotations for a Panoptic Segmentation task using Hasty.
Semantic Segmentation refers to the Computer Vision task of classifying pixels in an image. It is done by predefining some target classes, e.g., “car”, “vegetation”, “road”, “sky”, “sidewalk”, or “background”, where “background” is in most cases a default class. Then, each pixel in the image is assigned to one of those classes. Here’s an example:
As you can see in the example, every pixel in the image was colored depending on its class; hence, every pixel belonging to a car is masked in blue, and the same goes for the sidewalk, the vegetation, the road, and the sky.
To learn more about Semantic Segmentation, please refer to our definitive guide on Semantic Segmentation.
So far, so good. But what if we want to dig deeper into the type of information we can extract here? For example, we want to know how many cars are in one picture. Semantic segmentation is no help here, as all we can get is a pixel-wise classification. For such a task, we need to introduce the concept of Object Detection and Instance Segmentation.
When we do Object detection, we aim to identify bounded regions of interest within the image, inside of which is an object of a target class. Such objects are countable things such as cars, people, pets, etc. It does not apply to classes such as “sky” or “vegetation” since they are usually spread in different regions of the image, and you cannot count them one by one since there’s only one instance of them — there is only one “sky” not multiple.
It is very common to use bounding boxes to indicate the region within which we will find a given object. Here’s an example:
In the previous image, there are three bounding boxes, one for each car in the image. In other words, we are detecting cars and can now say how many of them are in the picture.
To learn more about Object Detection, please refer to our definitive guide on Object Detection.
Now, only some of the pixels inside those bounding boxes correspond to a car. Some of those pixels are part of the road; others of the sidewalk or the vegetation. If we want richer information from Object Detection, we can identify what pixels specifically belong to the same class assigned to the bounding box. That is what is called Instance Segmentation. Strictly speaking, we perform pixel-wise segmentation for every instance (bounding box in our case) we detected. This is what it looks like:
So we went from a rough detection with a bounding box to a more accurate detection in which we can also identify instances and count the number of objects of a given class. In addition, we know exactly what pixels belong to an object.
Sounds very good, but still, we have no information about all the other non-instance classes, such as “road”, “vegetation”, or “sidewalk”, as we did have it in semantic segmentation. That is when Panoptic Segmentation comes into play!
As mentioned in the introduction of this post, Panoptic Segmentation is a combination of Semantic Segmentation and Instance Segmentation. To put it another way, Panoptic Segmentation can obtain information such as the number of objects for every instance class (countable objects), bounding boxes, and Instance Segmentation masks. As a bonus, we get to know what class every pixel in the image belongs to from Semantic Segmentation. As a whole, this certainly provides a more holistic understanding of a scene.
Following our example, Panoptic Segmentation would look like this:
We have now managed to get a representation of the original image in such a way that it provides rich information about both Semantic and Instance classes altogether.
Now that we have covered the basics, let’s put this into practice. How can we do Panoptic Segmentation with a Deep Learning model? At this point, I would like to introduce a model developed at the University of Freiburg called EfficientPS.
EfficientPS is a deep learning model that makes panoptic predictions at a low computational cost by using a backbone built upon EfficientNet architecture. It consists of:
A backbone network for feature extraction;
Two output branches: one for Semantic Segmentation and one for Instance Segmentation;
A fusion block that combines the outputs from both output branches.
Here’s a diagram of the entire network:
Let’s take a closer look at its main modules. First, there’s the backbone network that produces four different outputs, each with a different spatial resolution, thus obtaining global context and localized features.
This backbone plays with three input parameters: width, depth, and input resolution, and scales them uniformly for better efficiency. Where width refers to the number of channels used in its building blocks, depth refers to the number of repetitive building blocks to be used, and the input resolution refers to the input resolution of the first layer, a.k.a. input layer.
Then there’s the Semantic Segmentation output branch:
This is a much smaller network itself compared to the backbone. It attempts to fulfill three requirements: capture fine features efficiently (large scale), capture long-range context (small-scale), and mitigate the mismatch between large-scale and small-scale features.
As output, this branch returns N layers with logits, where N is the number of classes, including “background”. In parallel to this branch, we have the Instance Segmentation output branch:
The instance segmentation branch resembles Mask R-CNN architecture. It consists of a regional proposal network (RPN) that is connected to two sub-networks. One sub-network returns bounding boxes and their corresponding class predictions, while the other returns the corresponding mask logits.
Finally, there’s the fusion module which is no longer a parametrized network but rather a heuristic series of blocks that, firstly, threshold, filter, scale, and pad the outputs of the Instance and Semantic Segmentation output branches and, secondly, combine the logits by computing the Hadamard product as shown in the following diagram:
As you might know, data annotation might be a bottleneck for AI startups as the conventional labeling approach is both costly and time-consuming. Hasty’s data-centric ML platform addresses the pain and automates 90% of the work needed to build and optimize your dataset for the most advanced use cases with our self-learning assistants using AI to train AI.
The primary focus of Hasty is the vision AI field. Therefore, Hasty is a perfect Panoptic Segmentation annotation tool as it implements all the necessary instruments to help you with your Panoptic Segmentation task.
To streamline your Panoptic Segmentation annotation experience, Hasty offers:
As of today, these are the key options Hasty has for the Panoptic Segmentation cases. If you want a more detailed overview, please check out the further resources or book a demo to get deeper into Hasty with the help of developers.
Panoptic Segmentation sets a milestone in scene understanding and Computer Vision. It gives more meaning and context to what a machine is “seeing”, leading to better decision-making in the case of autonomous machines. EfficientPS is a flexible network in terms of its modularity (backbone, semantic output branch, instance output branch, and fusion module). I’ve been working on a model based on this architecture myself. I aim at extending this network with other possible output branches as well as using other backbone networks, like resnet, which could work better depending on the case. Please can take a look at my GitHub repository to see the latest updates.
Only 13% of vision AI projects make it to production. With Hasty, we boost that number to 100%.
Our comprehensive vision AI platform is the only one you need to go from raw data to a production-ready model. We can help you with:
Labeling 10x faster with our AI Assistants.
Automating quality control, making it 35x faster, with our AI Consensus Scoring feature.
Train models in our no-code Model Playground, which can then be used to improve labeling and QA automation even further.
All while keeping you in control and your data safe.
All the data and models you create always belong to you and can be exported and used outside of Hasty at any given time entirely for free.
You can try Hasty by signing up for free here. If you are looking for additional services like help with ML engineering, we also offer that. Check out our service offerings here to learn more about how we can help.
This introduction to MLOps is intended as an introduction to the field, it's similarities and differences compared with …
A pattern-matching view on convolutions and neural networks