NEW
CloudFactory launches Accelerated Annotation product after acquiring Hasty.ai
10.01.2023 — Nursulu Sagimbayeva

Deep dive into the public Instance Segmentation datasets

We took a deep dive into the Instance Segmentation datasets field and are ready to share our findings. Check out the page to find out the Top 6 Instance Segmentation datasets for various industries that are publically available or can be acquired for free.

Deep dive into the public Instance Segmentation datasets

As you know, to achieve the best performance of your model, it is crucial to have a rich and properly labeled dataset. However, it is not always possible — or necessary — for you to gather your own dataset from scratch, as it requires massive effort and costs. Luckily, there exist large publicly available datasets that will save you time and provide you with high-quality topic-specific images.

In this article, we will:

Let’s jump in!

What is Instance Segmentation?

Instance Segmentation (IS) in Computer Vision (CV) refers to the task that combines the main principles behind Object Detection (OD) and Semantic Segmentation (SemSeg). The model receives images as an input and returns pixel-precise masks with labels for each object in the image.

Source
Please, check out our definitive guide on Instance Segmentation if you want to learn more.

Real-life applications of Instance Segmentation

Instance Segmentation task is used in many industries and domains. Below we will list a few most common examples.

1. Medicine
Using Instance Segmentation in medicine can help researchers and medical workers study humans' anatomical structures, identify and localize tumors and abnormalities in body tissues, and address many other practical and research-oriented tasks.

In this example, you can see 1) the original image; 2) the teeth and jaws annotated using Semantic Segmentation; 3) each tooth and jaw annotated using Instance Segmentation.
Source

2. Biology
Biology can use Instance Segmentation in many ways, for example, one can perform the IS task on bacteria colonies to understand their morphology, distribution, growth, and diversity.

Instance Segmentation of different bacteria types in Petri dish.
Source

3. Video Segmentation

Instance Segmentation can be used in videos for various purposes - from building self-driving vehicles to analyzing sports matches and teaching robots to perform surgeries.

Source

4. Satellite imagery

Governments and businesses use satellite imagery for a wide variety of tasks - from weather forecasting and environment assessment to warfare and urban planning. Instance Segmentation allows to detect and label the finest details in the image, for instance, each individual house, tree, or car.

Source

5. Product classification

Shops and stores might utilize Instance Segmentation to review their stocks easier or to improve customer experience by automatically recognizing the items in the shopping cart, to name a few examples.

Different types and instances of beverages annotated separately with different colors.
Source

Publicly available datasets on Instance Segmentation

1. COCO (Microsoft Common Objects in Context)

COCO is a large-scale Object Detection, Segmentation, and Captioning dataset. It was introduced in 2014 in a paper by Lin et al. and has set the state-of-the-art Object Recognition since then. The dataset is maintained by a team of contributors, and its sponsors include CVDF, Microsoft, Facebook, and Mighty AI.

To the day, the COCO dataset contains around 330K images which encompass:

Objects (thing) categories have a specific size and shape, whereas stuff categories are usually some background materials with homogeneous or repetitive patterns but no particular form. Stuff classes are also important since they consume large parts of the image and might help explain its context, relations between objects, and other significant properties.
Stuff categories example

Annotations in the COCO dataset

You can explore different object categories on COCO’s website. If you select specific classes, you can view all the images that contain all the specified labels - for example, cat and computer mouse.

Source

The images presented in COCO are mostly non-iconic and provide information about contextual relationships between objects. Such images are harder for the model to train on, since they naturally contain a lot of noise, clutter, occlusions, and other problematic aspects. In their paper, the authors illustrate the difference between iconic and non-iconic images.

The difference between iconic and non-iconic images.
Source

Besides Instance Segmentation, you might want to use the COCO dataset to train or evaluate your model for other CV tasks as well, including:

SOTA models evaluated on COCO

The COCO dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the COCO dataset, you can follow this page and track changes.

Source

2. Cityscapes

Cityscapes is a large-scale dataset focusing on the semantic understanding of urban street scenes. It was presented in 2015 in the paper by Cordts et al. and has been extended by various contributors since then.

The dataset consists of 5 000 images with fine annotations and 20 000 images with coarse annotations. The difference between fine and coarse annotations is illustrated below.

An example of a fine annotation, Stuttgart
An example of a coarse annotation, Saarbrücken

Annotations in the Cityscapes dataset

The annotations in Cityscapes are divided into 8 categories with 30 classes. For example, group “human” includes classes “person” and “rider,” and group “flat” includes classes “road,” “sidewalk,” and so on.

Source

To ensure sufficient diversity, the images were gathered from 50 cities under different conditions, like time of the year, day, and weather state. Initially, the dataset was recorded as a video, so only frames that met certain criteria (a large number of dynamic objects, varying scene layout and background) were selected.

If you are looking for images with foggy weather conditions specifically, check out the Foggy Cityscapes dataset, which is an extension of Cityscapes that contains images augmented with fog and rain.

The images in Cityscapes contain some metadata that might be of interest to you:

Apart from Instance Segmentation, with Cityscape, you can train and evaluate your model for the following CV tasks:

SOTA models evaluated on Cityscapes

The Cityscapes dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models and evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the Cityscapes dataset, you can follow this page and track changes.

Source

3. PASCAL VOC (PASCAL Visual Object Classes Challenge)

The first PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning) VOC (Visual Object Classes) challenge took place in 2005 and featured two competitions: classification and detection. The results of the challenge were presented in the paper "The 2005 PASCAL Visual Object Classes Challenge" by Everingham et al. Back then, the final dataset contained only 4 classes: bicycles, cars, motorbikes, and people.

Since then, the challenge has been conducted each year up to 2012. The competition categories have expanded extensively, and the data was gathered for the following tasks:

Currently, the training and validation sets have 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.

Region of interest (ROI) is a selected subset of the image that is proposed for further processing. It might be useful to apply this concept when you are interested not in the whole area of the image but in certain parts of it.

Annotations in the PASCAL VOC dataset

To the day, the PASCAL VOC 2012 dataset contains 20 object categories, including:

An example of annotations for Instance and Semantic Segmentation.
Source

SOTA models evaluated on PASCAL VOC

The PASCAL VOC dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the PASCAL VOC dataset, you can follow this page and track changes.

Source

4. LVIS: A Dataset for Large Vocabulary Instance Segmentation

LVIS is a dataset for Large Vocabulary Instance Segmentation. It was presented in the paper “LVIS: A Dataset for Large Vocabulary Instance Segmentation” in 2019 by Gupta et al.

The LVIS dataset uses the COCO 2017 train, validation, and test image sets and adds its own annotations to it. The data split is the following:

As the authors claim, due to the Zipfian distribution of categories in natural images, datasets usually have a few most commonly appearing categories. At the same time, there is a long tail of categories that appear rarely and have only few training samples that are not enough for the model to learn from. Hence, the LVIS dataset aims to provide an exhaustive annotation of underrepresented object categories.

Zipfian distribution is based on Zipf's law - an empirical law representing the fact that for some types of data, the rank-frequency distribution has an inverse relation.For example, in natural languages, the frequency of the word decreases abruptly as the rank number increases. Thus, on average, the most common article, “the,” appears around 1/10th of the time, the next most common word, “of,” appears around 1/20th of the time, and so on.
Zipfian distribution of word frequencies by rank in English text corpus.
Source

Annotations in the LVIS dataset

The LVIS dataset contains almost 2 million instance annotations and 1203 object categories - from general and common ones, like cat, book, bus, to more refined ones, like earring, flower arrangement, and knee pad.

On the LVIS website, you can access the images with specific labels from a specific set (training or validation).

The annotation for LVIS is performed without prior knowledge of the categories that will be labeled. This allows the annotators to naturally uncover the long tail of categories that appear in the images.

Apart from Instance Segmentation, you might want to use the LVIS dataset for Object Detection task as well.

SOTA models evaluated on LVIS

The LVIS dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the LVIS dataset, you can follow this page and track changes.

Source

5. NYUv2 (NYU-Depth V2)

The NYU-Depth V2 dataset consists of video sequences from various indoor scenes recorded by both the RGB and Depth cameras from the Microsoft Kinect. It was presented in 2012 in the paper “Indoor Segmentation and Support Inference from RGBD Images” by Silberman et al.

The dataset aimed to enable CV models to explore physical relationships between the objects in the images, possible actions that can be performed with them, and the geometric structure of the scene.

The images contain scenes of offices, stores, and rooms of houses with many occluded and unevenly lightened objects. Each object is labeled with a class and an instance number (chair1, chair2, chair3).

Overall, there are:

A depth map (depth image) is an image that contains information about the distance of the surfaces of scene objects from a viewpoint.In the heatmaps below (2nd column), the closest surfaces have colder colors, and the farthest surfaces are depicted with warmer colors.
Examples of the annotations

Annotations in the NYUv2 dataset

Among others, the NYU-Depth V2 dataset contains annotations of large planar surfaces, like floors, walls, and table tops. Hence, many objects can be interpreted in relation to those surfaces.

Understanding object interactions and positions in space is important, because in real life, we can not ignore relations between objects.For example, imagine you have a task to drink a cup of coffee and read a book. First, you need to parse the scene around you and detect these two objects. If the cup is on the book (the book is the supporting surface of the cup), then you should pick up the cup first. Such a simple task requires an understanding of complex scenes and support relations between objects.

Objects are also classified into structural classes that reflect their physical role in the scene:

The images are divided into the labeled and raw datasets.

The file weighs approximately 428GB, so if you do not want to download the entire dataset in a single file, you can choose individual categories.

Output from the RGB camera (left), preprocessed depth (center), and a set of labels (right) for the image.
Source

Apart from Instance Segmentation, with NYUv2, you can train and evaluate your model for the following CV tasks:

and so on.

SOTA models evaluated on NYUv2

The NYUv2 dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the NYUv2 dataset, you can follow this page and track changes.

Source

6. YouTubeVIS

YouTubeVIS is a large-scale dataset for Video Instance Segmentation. It is based on the YouTube-VOS (Video Object Segmentation) dataset and was presented in 2019 in a paper by Yang et al.

Video Instance Segmentation transfers the image Instance Segmentation task to the video domain. New tasks that come with that are simultaneous detection, segmentation, and tracking of object instances in videos. The instance masks should be labeled and associated across frames to identify the same objects.

As for the 2022 version, the dataset contains:

Apart from Video Instance Segmentation, you can train your CV model to perform Video Semantic Segmentation task on the YouTubeVIS dataset as well.

SOTA models evaluated on YouTubeVIS

The YouTubeVIS dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Instance Segmentation evaluated on the YouTubeVIS dataset, you can follow this page and track changes.

Source

A shameless plug: how to use Hasty to solve an Instance Segmentation task?

As you might know, data annotation might be a bottleneck for AI startups as the conventional labeling approach is both costly and time-consuming. Hasty’s data-centric ML platform addresses the pain and automates 90% of the work needed to build and optimize your dataset for the most​ advanced use cases ​ with our self-learning assistants using AI to train AI.

The primary focus of Hasty is the Computer Vision field. Therefore, Hasty is a perfect Instance Segmentation annotation tool as it implements all the necessary instruments to help you with your Instance Segmentation task.

Let’s go through the available options step-by-step. To streamline your Instance Segmentation annotation experience, Hasty offers:

As for the annotation quality control process, Hasty has you covered with its AI Consensus Scoring feature that has a separate Instance Segmentation review option. With the help of AI CS, you can find missing labels, extra labels, and different artifacts. Also, you can better understand how a machine sees your data, which might be valuable for your annotation strategy.

When it comes to model building, Hasty’s Model Playground supports many modern neural network architectures. For Instance Segmentation, these are:

As a backbone for these architectures, Hasty offers:

As a Machine Learning metric for the Semantic Segmentation case, Hasty implements mask mean Average Precision (mask mAP).

Hasty also has a Youtube channel featuring video tutorials for each vision AI task. Here is one about Instance Segmentation.

How to streamline your Instance Segmentation labeling experience with Hasty.ai

As of today, these are the key options Hasty has for the Instance Segmentation cases. If you want a more detailed overview, please check out the further resources or book a demo to get deeper into Hasty with our help.

Further Resources

Keep reading

Get AI confident. Start using Hasty today.

Automate 90% of the work, reduce your time to deployment by 40%, and replace your whole ML software stack with our platform.

Start for free Check out our services