NEW
CloudFactory launches Accelerated Annotation product after acquiring Hasty.ai
17.01.2023 — Nursulu Sagimbayeva

Deep dive into the public Semantic Segmentation datasets

We took a deep dive into the Semantic Segmentation datasets field and are ready to share our findings. Check out the page to find out the Top 6 Semantic Segmentation datasets for various industries that are publically available or can be acquired for free.

Deep dive into the public Semantic Segmentation datasets

To achieve the best performance of your model, it is crucial to have a rich and properly labeled dataset. However, it is not always possible — or necessary — for you to gather your own dataset from scratch, as it requires massive effort and costs. Luckily, large publicly available datasets will save you time and provide you with high-quality topic-specific images.

In this article, we will:

Let's jump in!

What is Semantic Segmentation?

Semantic Segmentation is a Computer Vision task that focuses on classifying every pixel in an image to create a pixel-precise segmentation map for a specific image. Each pixel is assigned to some class, like in the example below:

Example
Please, check out our definitive guide on Semantic Segmentation if you want to learn more.

Real-life applications of Semantic Segmentation

1. Biology
Semantic Segmentation is used for various tasks in biology - for instance, detecting boundaries of cells and multicellular structures.

Semantic Segmentation of the boundaries of muscle fibers.
Source

2. Medicine
Using Semantic Segmentation can help researchers and medical workers study humans anatomical structures, identify and localize tumors and abnormalities in body tissues, and address many other practical and research-oriented tasks.

Semantic Segmentation of retinal arteries in the eyes.
Source

3. Satellite imagery

Governments and businesses use satellite imagery for a wide variety tasks - from weather forecasting and environmental assessment to warfare and urban planning. Semantic Segmentation allows to detect and label various classes in the image, for instance, buildings, roads, crops, and so on.

Source

4. Video segmentation

Semantic Segmentation can be used in videos for various purposes - from building self-driving vehicles to analyzing road traffic and teaching robots to prevent product defects in the factory.

Source

5. Precision Agriculture

With the Semantic Segmentation of the fields and crops, you can reduce manual monitoring of agriculture and employ robots who will regularly spray out the required amount of herbicides and do weeding actions. Moreover, you can detect anomalies and issues with the crops and react in a timely manner.

Source

Publicly available datasets on Semantic Segmentation

1. COCO-Stuff: Thing and Stuff Classes in Context

You probably have heard of the COCO - a large-scale Object Detection, Segmentation, and Captioning dataset. COCO-Stuff is an extension of the original COCO dataset developed for scene understanding, which largely involves the Semantic Segmentation task. It was presented in 2018 in a paper by Caesar et al.

The dataset contains:

The data is divided into the following splits:

Objects (thing) categories have a specific size and shape, whereas stuff categories are usually some background materials with homogeneous or repetitive patterns but no particular form. Stuff classes are important as well since they consume large parts of the image and might help explain the context and other significant properties of an image.
Stuff categories example

Annotations in the COCO-Stuff dataset

As mentioned before, semantic classes can be either things (objects) or stuff (background). The authors highlight that many classification and detection works focus on thing classes, while stuff classes receive much less attention.

Stuff classes are crucial in explaining important aspects of an image, such as:

and so on.

Examples of annotations in COCO-Stuff

The COCO-Stuff dataset is compatible with COCO. The distribution of classes is presented in the image below.

Some of the classses, like desk, door, mirror and window could be either stuff or things and therefore appear in both COCO and COCO-Stuff. Visit this page to see the full list of classes and their descriptions.
Source

SOTA models evaluated on COCO-Stuff

The COCO-Stuff dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the COCO-Stuff dataset, you can follow this page and track changes.

Source

2. NYUv2 (NYU-Depth V2)

The NYU-Depth V2 dataset consists of video sequences from various indoor scenes recorded by both the RGB and Depth cameras from the Microsoft Kinect. It was presented in 2012 in a paper “Indoor Segmentation and Support Inference from RGBD Images” by Silberman et al.

The aim of the dataset was to enable CV models to explore physical relationships between the objects in the images, possible actions that can be performed with them, and the geometric structure of the scene.

The images contain scenes of offices, stores, rooms of houses with many occluded and unevenly lightened objects. Each object is labeled with a class and an instance number (chair1, chair2, chair3).

Overall, there are:

Depth map (depth image) is an image that contains information about the distance of the surfaces of scene objects from a viewpoint.In the heatmaps below (2nd column), the closest surfaces have colder color, and farthest surfaces have warmer color.
Examples of the images

Annotations in the NYUv2 dataset

Among others, NYU-Depth V2 dataset contains annotations of large planar surfaces, like floors, walls, and table tops. Hence, many objects can be interpreted in relation to those surfaces.

Understanding interactions between the objects and their positions in space is important, since in real life, we can not ignore relations between the objects.

For example, imagine you have a task to drink a cup of coffee and read a book. First, you need to parse the scene around you and detect these two objects. If the cup is on the book (the book is the supporting surface of the cup), then you should pick up the cup first. Such a simple task requires an understanding of complex scenes and support relations between objects.

Objects are also classified into structural classes that reflect their physical role in the scene:

The images are divided into labeled and raw dataset.

The file weights approximately 428GB, so if you do not want to download the entire dataset in a single file, you can choose individual categories.

Output from the RGB camera (left), preprocessed depth (center) and a set of labels (right) for the image.
Source

Apart from Semantic Segmentation, with NYUv2, you can train or evaluate your model for the following CV tasks:

and so on.

SOTA models evaluated on NYUv2

The NYUv2 dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the NYUv2 dataset, you can follow this page and track changes.

Source

3. ADE20K

The ADE20K is a semantic segmentation dataset presented in a paper by Zhou et al. in 2017. It is focused on scene-parsing, and contains more than 25K densely pixel-level annotated images.

Annotations in the ADE20K dataset

The images contain 150 semantic categories, including both stuff categories (sky, roads, grass) and individual objects (person, house, book). Many objects also have annotations of their parts. For example, an object “house” has parts like “balcony”, “door”, “column”, and “column” has parts “base”, “capital”, and so on.

Source

To give an example of how it looks in practice, in the image below, the first row shows the original images, the second row illustrates the annotation of objects, and the third row shows the annotation of object parts.

Source

The dataset also provides information about different attributes of objects, for instance, whether they are occluded or cropped, and so on. The advantage of the dataset is that the images are densely annotated, and the classes were defined in the process of annotations, not beforehands, which allows to label many finest details in the images.

SOTA models evaluated on ADE20K

The ADE20K dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the ADE20K dataset, you can follow this page and track changes.

Source

4. PASCAL VOC (PASCAL Visual Object Classes Challenge)

The first PASCAL (Pattern Analysis, Statistical Modeling and Computational Learning) VOC (Visual Object Classes) challenge took place in 2005 and featured two competitions: classification and detection. The resiults of the challenge were presented in a paper "The 2005 PASCAL Visual Object Classes Challenge" by Everingham et al. Back then, the final dataset contained only 4 classes: bicycles, cars, motorbikes, and people.

Since then, the challenge was conducted each year up to 2012. The competition categories have expanded extensively, and the data was gathered for the following tasks:

Currently, the training and validation sets have 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.

Region of interest (ROI) is a subset of the image that is proposed for further processing. It might be useful to apply this concept when you are interested not in the whole area of the image but in certain parts of it.

Annotations in the PASCAL VOC dataset

To the day, the PASCAL VOC 2012 dataset contains 20 object categories including:

An example of annotations for Instance and Semantic Segmentation.
Source

SOTA models evaluated on PASCAL VOC

The PASCAL VOC dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the PASCAL VOC dataset, you can follow this page and track changes.

Source

5. Cityscapes

Cityscapes is a large-scale dataset that focuses on semantic understanding of urban street scenes. It was presented in 2015 in a paper by Cordts et al and has been extended by various contributors since then.

The dataset consists of 5 000 images with fine annotations and 20 000 images with coarse annotations. The difference between fine and coarse annotations is illustrated below.

An example of a fine annotation, Stuttgart
An example of a coarse annotation, Saarbrücken

Annotations in the Cityscapes dataset

The annotations are divided into 8 categories with 30 classes. For example, group “human” includes classes “person” and “rider”, group “flat” includes classes “road”, “sidewalk”, and so on.

To ensure sufficient diversity, the images were gathered from 50 cities under different conditions, like time of the year, day, and weather state. Initially, the dataset was recorded as a video, so only frames that met certain creteria (large number of dynamic objects, varying scene layout and background) were selected.

If you are looking for images with foggy weather conditions specifically, check out Foggy Cityscapes dataset which is an extension of Cityscapes that contains images augmented with fog and rain.

The images in Cityscapes contain some metadata that might be of interest for you:

Apart from Semantic Segmentation, with Cityscape, you can train and evaluate your model for the following CV tasks:

SOTA models evaluated on Cityscapes

The Cityscapes dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the Cityscapes dataset, you can follow this page and track changes.

Source

6. DAVIS: Densely Annotation Video Segmentation dataset

The Densely Annotation Video Segmentation dataset (DAVIS) is a Video Segmentation dataset that contains 50 densely annotated high-resolution Full HD video sequences. It was presented in 2016 in a paper “A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation” by F. Perazzi et al.

Overall, there are 3455 densely annotated frames in DAVIS with the following data split:

Each video in the dataset is accompanied with a densely annotated, pixel-accurate and per-frame ground truth segmentation. The videos last about 2-4 seconds each, but encompass major challenges usually present in longer video sequences.

Examples of annotations in DAVIS.
Source

Annotations in the DAVIS dataset

The segmentation in the DAVIS dataset is pixel-accurate. Each video is additionally annotated with specific attributes such as occlusions, fast-motion, non-linear deformation and motion-blur.

List of video attributes and corresponding descriptions.
Source

To ensure content diversity, the classes in the dataset are distributed evenly. The categories available include:

On the dataset’s webpage, you can explore in detail various video sequences, including break-dance, cows, car turn, and many others.

Apart from Semantic Segmentation, with DAVIS, you can train and evaluate your model for the following CV tasks:

SOTA models evaluated on DAVIS

The DAVIS dataset serves as a benchmark that helps to assess the performance of various CV models. Thus, it becomes easier to compare the models between each other and to evaluate the improvement of each given model over time.

If you want to check out the state-of-the-art models for Semantic Segmentation evaluated on the DAVIS dataset, you can follow this page and track changes.

Source

A shameless plug: how to use Hasty to solve a Semantic Segmentation task?

As you might know, data annotation might be a bottleneck for AI startups as the conventional labeling approach is both costly and time-consuming. Hasty’s data-centric ML platform addresses the pain and automates 90% of the work needed to build and optimize your dataset for the most​ advanced use cases ​ with our self-learning assistants using AI to train AI.

The primary focus of Hasty is the vision AI field. Therefore, Hasty is a perfect Semantic Segmentation annotation tool as it implements all the necessary instruments to help you with your Semantic Segmentation task.

Let’s go through the available options step-by-step. To streamline your Semantic Segmentation annotation experience, Hasty offers:

When it comes to model building, Hasty’s Model Playground supports many modern neural network architectures. For Semantic Segmentation, these are:

As a backbone for these architectures, Hasty offers:

As a Machine Learning metric for the Semantic Segmentation case, Hasty implements mean Intersection over Union.

As of today, these are the key options Hasty has for the Semantic Segmentation cases. If you want a more detailed overview, please check out the further resources or book a demo to get deeper into Hasty with our help.

Further Resources

Keep reading

Get to production reliably.

Hasty is a unified agile ML platform for your entire Vision AI pipeline — with minimal integration effort for you.

Start for free Check out our services