We took a deep dive into the Computer Vision datasets field and are ready to share our findings. Check out the page for the best free Computer Vision datasets across various industries you should pay attention to in 2023.
Computer Vision (CV) is a rapidly growing field that focuses on developing algorithms and systems that can interpret and analyze visual information from the world around us. One of the critical components of developing effective CV models is access to high-quality, diverse datasets that can be used for training and testing.
Fortunately, there are many publicly available Computer Vision datasets on a wide range of topics, from object recognition and scene understanding to facial recognition and autonomous driving. These datasets can be invaluable resources for researchers and developers looking to build and refine CV models.
This post will explore some of the most popular and valuable publicly available Computer Vision datasets across different domains and applications.
Cityscapes is a large-scale dataset focusing on the semantic understanding of urban street scenes. The dataset consists of 5,000 images with fine annotations and 20,000 images with coarse annotations. The annotations in Cityscapes are divided into 8 categories with 30 classes. For example, group “human” includes classes “person” and “rider,” group “flat” includes classes “road,” “sidewalk,” and so on.
Foggy Cityscapes is an extension of Cityscapes that contains all the images augmented with fog and rain. It imitates different visibility ranges of 600, 300, and 150 meters, respectively.
GTA5 (Grand Theft Auto 5)
The GTA5 dataset contains dense pixel-level semantic annotations for 25,000 images synthesized by a photorealistic Grand Theft Auto 5 game. The images feature street views from the car’s perspective.
SVHN (Street View House Numbers)
Street View House Numbers (SVHN) is a digit classification dataset that contains more than 600k labeled images of house numbers taken from Google Street View. It consists of 10 classes, where each digit from 0 to 9 corresponds to one class.
GTSRB (German Traffic Sign Recognition Benchmark)The GTSRB is a multi-class, single-image classification dataset featuring German traffic signs. It contains 43 classes of traffic signs, with 39,209 images in the training set and 12,630 images in the test set. The dataset captures light, background, weather conditions, and partial occlusions variations.
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute)
KITTI consists of hours of traffic scenes captured by driving around the mid-size city of Karlsruhe, both in rural areas and on highways. The data was recorded with various technologies, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner.
Waymo Open Dataset
The Waymo Open Dataset is composed of two datasets – the perception dataset with high-resolution sensor data and labels for 2,030 segments and the motion dataset with object trajectories and corresponding 3D maps for 103,354 segments.
CelebA (CelebFaces Attributes Dataset)
CelebA dataset is a large-scale face attributes dataset with over 200k celebrity images obtained from 10,177 celebrities. Each image is labeled with 40 binary attributes (like age, gender, and hairstyle).
FFHQ image dataset consists of 70k high-quality PNG images of human faces. It contains considerable variation in age, ethnicity, image background, and accessories such as eyeglasses, sunglasses, hats, etc.
LFW (Labeled Faces in the Wild)
Labeled Faces in the Wild is a public benchmark for face verification, also known as pair matching.
It contains more than 13k faces collected from the web and more than 5.7k identities with 1,680 people with two or more images.
MORPH is a dataset for facial age estimation. It contains more than 55k unique images of more than 13k individuals from 2003 to late 2007. Ages range from 16 to 77, with a median age of 33.
VGGFace2 (Visual Geometry Group Face2)
VGGFace2 is a large-scale face dataset that contains 3.31 million images of 9,131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity, and profession (e.g., actors, athletes, politicians).
FaceForensics++ is a forensics dataset comprising 1000 original video sequences manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap, and NeuralTextures. It was meant to provide a benchmark for facial manipulation detection.
This large-scale collection of datasets covers 700 human action classes, including human-object interactions (e.g., playing instruments) and human-human interactions (e.g., shaking hands). Each action class has at least 700 video clips that last around 10 seconds.
CUHK03 (Chinese University of Hong Kong Re-identification)
The CUHK03 dataset focuses on the person re-identification task of matching pedestrian images from separate cameras. It consists of 13,164 images of 1,360 different identities, each observed by two disjoint camera views. The annotations come in two types: manually labeled and automatically detected bounding boxes.
The Market-1501 dataset contains 32,668 annotated bounding boxes of 1,501 identities. It was collected in front of a supermarket at Tsinghua University using 1 low-resolution and 5 high-resolution cameras.
The Human3.6M dataset is a large-scale motion capture dataset comprising 3.6 million 3D human poses and corresponding images. You will find activities performed by 11 different actors in 17 scenarios, including discussion, smoking, taking photos, talking on the phone, and so on.
MPII Human Pose
MPII Human Pose dataset is a benchmark for evaluating articulated human pose estimation. It includes around 25K images containing over 40K people with annotated body joints. Overall, the dataset covers 410 human activities.
ShanghaiTech Dataset was developed for crowd count estimation. It includes 1198 images with more than 330,000 heads annotated. Part A of the dataset contains 482 images and features photos from the Internet, and Part B contains 716 images taken from the busy streets of Shanghai.
DensePose-COCO is a large-scale ground-truth dataset developed for dense human pose estimation – mapping all human pixels of an RGB image to the 3D surface of the human body. Image-to-surface correspondences in the dataset were manually annotated on 50K COCO images.
COCO-Stuff: Thing and Stuff Classes in Context
COCO-Stuff is an extension of the original COCO dataset developed for scene understanding. It contains 164K complex images from COCO with dense pixel-level annotations, 80 thing classes (animals, vehicles, food, etc.), and 91 stuff classes (grass, mountains, walls, etc.).
The Places dataset focuses on scene recognition – a computer vision task that allows for defining a context for object recognition. It contains 205 scene categories and 2.5 million images with a category label.
SYNTHIA (SYNTHetic Collection of Imagery and Annotations)
SYNTHIA dataset was generated to aid semantic segmentation and scene understanding in the context of driving scenarios. It consists of more than 200k photo-realistic frames from a virtual city and contains 13 classes, including sky, building, road, car, pedestrian, cyclist, etc.
NYUv2 (NYU-Depth V2)
The NYU-Depth V2 dataset consists of video sequences from various indoor scenes recorded by both the RGB and Depth cameras from the Microsoft Kinect. It contains 1449 densely labeled pairs of aligned RGB and depth images, 464 new scenes taken from 3 cities, and 407,024 new unlabeled frames.
ScanNet is an RGB-D video dataset containing 2.5 million views in more than 1,500 scans, annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentations.
SUN RGBD (Scene Understanding)
The dataset is captured by four different sensors and contains 10,000 RGB-D images. The whole dataset is densely annotated and includes 146,617 2D polygons and 58,657 3D bounding boxes with accurate object orientations, as well as a 3D room layout and category for scenes.
The Replica Dataset contains precise reproductions of various indoor environments. Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, and planar, semantic, and instance segmentation.
USPS is a database that contains digital images of approximately 5,000 city names, 5,000 state names, 10,000 ZIP Codes, and 50,000 alphanumeric characters scanned from envelopes by the U.S. Postal Service. It consists of 7,291 train and 2,007 test images with 16x16 grayscale pixels.
IAM Handwriting Database
The IAM Handwriting Database contains 13,353 images of handwritten English text by 657 writers. The texts were transcribed from the Lancaster-Oslo/Bergen Corpus of British English. The images are labeled at the sentence, line, and word levels.
The Total-Text consists of 1,555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved.
COCO (Microsoft Common Objects in Context)
COCO is a large-scale Object Detection, Segmentation, and Captioning dataset. It contains around 330k images which encompass 80 objects (thing) categories (e.g., animals, vehicles, food) and 91 stuff categories (e.g., grass, mountains, walls).
IamgeNet is a large-scale visual database containing over 14 million labeled images of objects, scenes, and people. The images are organized according to the WordNet hierarchy and consist of more than 20K categories, focusing on everyday objects that people encounter daily.
PASCAL VOC (PASCAL Visual Object Classes Challenge)
PASCAL (Pattern Analysis, Statistical Modeling, and Computational Learning) VOC challenges have been conducted from 2005 to 2012. The most recent dataset contains 11,530 images in the training and validation sets. It features 20 object categories: animals, vehicles, households, people, etc.
ShapeNet is a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. It contains 3D models from many semantic categories and organizes them under the WordNet taxonomy. The ShapeNetCore subset covers 55 common object categories, whereas the ShapeNetSem subset covers 270 categories.
LVIS: A Dataset for Large Vocabulary Instance Segmentation
LVIS dataset uses the COCO 2017 train, validation, and test image sets and adds its own annotations to it. The train-validation-test split is 100,170/19,809/19,822 images. The dataset aims at providing an exhaustive annotation of underrepresented object categories.
The ADE20K is a semantic segmentation dataset focused on scene-parsing. It contains over 25k densely pixel-level annotated images and 150 semantic categories, including the stuff categories (sky, roads, grass) and individual objects (person, house, book).
CIFAR-10 and CIFAR-100 Datasets (Canadian Institute for Advanced Research, 10/100 classes)
The CIFAR-10 dataset consists of 60k 32x32 color images in 10 classes, with 6,000 images per class. The train/test split is 50k/10k. The images are labeled with one of 10 mutually exclusive classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
The CIFAR-100 dataset is similar to CIFAR-10, but it has 100 classes containing 600 images each. Every image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
Visual Question Answering (VQA) v2.0
VQA is a dataset containing open-ended questions about images. It provides 265,016 images with at least 3 questions (5.4 questions on average) per image, 10 ground-truth answers, and 3 plausible (but likely incorrect) answers per question.
CLEVR (Compositional Language and Elementary Visual Reasoning)
CLEVR is a diagnostic dataset that tests a range of visual reasoning abilities of VQA systems. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. The train-validation-test split is 70k/15k/15k images.
The GQA dataset is devoted to visual question answering. It leverages Visual Genome scene graph structures to create 22 million diverse reasoning questions for 113k images. It measures reasoning skills such as object and attribute recognition, transitive relation tracking, spatial reasoning, logical inference, and comparisons.
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Google's Conceptual Captions dataset has more than 3 million images paired with natural-language captions.
CUB-200-2011 (Caltech-UCSD Birds-200-2011)
CUB-200-2011 is a widely-used dataset for fine-grained visual categorization tasks. It contains 11,788 images and 200 categories. Each image is annotated with 15 Part Locations, 312 Binary Attributes, 1 Bounding Box, and is accompanied by ten single-sentence descriptions.
Oxford 102 Flower (102 Category Flower Dataset)
The Oxford 102 Flower dataset features images of flowers commonly occurring in the United Kingdom. Overall, there are 102 flower categories, with each class consisting of between 40 and 258 images.
The iNaturalist dataset is a large-scale dataset of images and annotations of living organisms. It consists of 579,184 training images and 95,986 validation images. There are a total of 5,089 categories in the dataset, united into 13 super-categories: Insecta (insects), Aves (birds), Reptilia, and so on.
AwA2 (Animals with Attributes 2)
The AwA2 dataset was created for benchmarking transfer-learning algorithms. It consists of 37,322 images of 50 animal classes with pre-extracted feature representations for each image. Each class has 85 numeric attribute values. It is possible to transfer information between different classes using shared attributes.
CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets. It contains 224,316 chest radiographs of 65,240 patients, where both frontal and lateral views are available.
The fastMRI dataset contains MRI scans of the knees and the brain. The knees data was obtained from more than 1,500 fully sampled knee MRIs, whereas the brain data was obtained from 6,970 fully sampled MRIs on 3 and 1.5 Tesla magnets.
LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative)
LIDC-IDRI is a reference database of lung nodules on CT scans with marked-up annotated lesions. It contains 1,018 cases from 1,010 lung patients.
ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique
patients with the text-mined 14 disease image labels (where each image can have multi-
labels). The labels were mined from the associated radiological reports using natural language processing.
BraTS (Brain Tumor Segmentation)
The BraTS dataset contains multimodal 3D MRI scans of patients with various types of brain tumors, including gliomas and meningiomas, as well as scans of healthy individuals. The MRI scans include T1-weighted, T1-weighted contrast-enhanced, T2-weighted, and Fluid-attenuated inversion recovery (FLAIR) sequences.
The DomainNet dataset features common objects in six domains – Clipart, Infograph, Painting, Sketch, Quickdraw (drawings of the worldwide players of the game “Quick Draw!”), and Real (photos and real-world images). It contains about 0.6 million images distributed among 345 categories.
The Office-Home dataset was built for the domain adaptation task. It consists of 15,500 images from 4 domains: Artistic Images, Clip Art, Product images, and Real-World images. Each domain has 65 object categories found typically in Office and Home settings.
The Sketch dataset is a collection of hand-drawn sketches of everyday objects gathered via crowdsourcing. It contains 20,000 unique sketches evenly distributed over 250 object categories, such as animals, vehicles, furniture, etc.
Manga109 comprises 109 manga volumes produced by professional Japanese manga artists. It contains ground-truth annotations for text and panel regions, character and speech bubble bounding boxes, and speech bubble text.
EuroSAT is a dataset for land use and land cover classification. It contains 27,000 labeled and geo-referenced images that consist of 10 classes. The Sentinel-2 satellite images were taken from the Earth observation program Copernicus.
DOTA (Dataset for Object deTection in Aerial Images)
DOTA is a large-scale dataset for object detection in aerial images. There are 18 common categories, 11,268 images, and 1,793,658 instances in DOTA-v2.0. Each image ranges from 800 × 800 to 20,000 × 20,000 pixels and contains objects exhibiting various scales, orientations, and shapes.
PASTIS (Panoptic Agricultural Satellite Time Series)
PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite time series. It contains 2,433 patches within the French metropolitan territory with panoptic annotations (instance index + semantic label for each pixel). Each patch is a Sentinel-2 multispectral image time series of variable length.
Only 13% of vision AI projects make it to production. With Hasty, we boost that number to 100%.
Our comprehensive vision AI platform is the only one you need to go from raw data to a production-ready model. We can help you with:
All the data and models you create always belong to you and can be exported and used outside of Hasty at any given time entirely for free.
You can try Hasty by signing up for free here. If you are looking for additional services like help with ML engineering, we also offer that. Check out our service offerings here to learn more about how we can help.
Thanks for reading, and happy training!
We are releasing a new, free feature that allows you to annotate in one-click. Read more about how it works and how you …
We are now launching Model Playground, a model experimentation and building environment where you can train and …
Hasty is a unified agile ML platform for your entire Vision AI pipeline — with minimal integration effort for you.