Nowadays, one of the fastest-growing areas in Artificial Intelligence (AI) is agriculture because AI algorithms greatly benefit agro-businesses from various perspectives. For example, Machine Learning models can optimize the workflows, make employees' lives easier, automate some tasks, reduce costs, and increase profits.
However, any AI solution will be as good as the data the Machine Learning algorithm was trained on. That is why you must be accurate and precise when gathering and labeling data assets. Fortunately, you do not have to do everything for yourself in many cases because there are datasets - collections of annotated data assets. Throughout AI history, enthusiasts and researchers have collected many datasets for various tasks, so when starting a Machine Learning project, you should always check for available datasets that fit your task and can be acquired for free.
Although the dataset might not be a perfect fit, looking at what is available publically is also a great way to understand how you can build your own data asset and offer good data to experiment on to get a first idea of how you want to go about building your model.
In Hasty, we work with loads of data on a daily basis and often see teams struggling with picking a proper dataset for their task. From our experience, the most common agricultural use-cases are:
That is why in this post, we will cover some free datasets that are useful for these Machine Learning tasks. Of course, there are more agricultural use-cases, but we will talk about only the four mentioned above since they are the most vital in the agricultural vision AI sphere. As you might know, in vision AI, you need a big dataset to build a successful solution. That is why, to narrow the search, we will look only for the datasets with at least 7000 images.
Note: In this post, we will mention only the datasets relevant when solving the vision AI tasks such as Classification, Object Detection, Instance Segmentation, and Semantic Segmentation.
This dataset was created for Weed Detection in Soybean Crops in 2017. The researchers used an unpiloted aerial vehicle flying at 4 meters height from the soybean field to capture the images. The made pictures were automatically segmented into various patches via the clustering algorithm based on super-pixels and then manually annotated into four classes:
- Soil (3249 samples);
- Soybean (7376 samples);
- Grass (3520 samples);
- Broadleaf weeds (1191 samples).
Thus, in the dataset, you will find 15 336 image patches that should be an excellent addition to your weed control dataset or can be used at the start to test some primary hypotheses.
The Sugar Beets 2016 is a popular agricultural dataset widely used in building robotic crop and weed detection AI solutions. It was created on a sugar beet farm using a robot carrying a four-channel multispectral camera and an RGB-D sensor to capture the most detailed information possible. The dataset consists of three major parts:
- Navigation data for a robot (it was not super relevant for us, but if you are working in the agricultural robotics field - you should check it out);
- 283 multi-class images annotated on a pixel level with sugar beet and nine different types of weeds making up the classes;
- 12 340 images with pixel-level annotations consisting of three classes: crop, weed, and background.
To summarize, Sugar Beets is a large-scale agricultural robot dataset that can be used for many tasks and has no publicly available analog. That is why it is more than worth your attention. Please check the initial paper to learn more.
As of today, the Weed Map dataset is believed to be the largest multispectral aerial dataset for sugar beet/ weed segmentation publicly available. It was created in 2018 using two unpiloted aerial vehicles carrying multichannel and multispectral cameras and flying about 10 meters above sugar beet fields to collect the images. As a result, the dataset consists of eight sets of high-resolution orthomosaic maps with pixel-level annotations for three classes: crop, weed, and background.
To avoid technical difficulties such as the inability to allocate these large maps on a GPU, researchers split the maps into 10 196 images in the manner of a sliding window. To summarize, the Weed Map dataset is of very high quality, which is essential for solving supervised vision AI tasks, such as pixel-level semantic classification.
The DeepWeeds dataset consists of 17 509 labeled RGB images (256 x 256 pixels) of eight nationally significant weed species native to eight locations across northern Australia.
The data assets were automatically collected using a customized ground weed control robot in natural field conditions. Initially, the dataset was built to solve weed classification tasks, so it only provides image tags for each picture (image-level annotations). That is why you cannot straightaway use DeepWeeds for weed segmentation and localization. However, there are at least a thousand images per class, and there is no significant class imbalance in the data which is beneficial for solving vision AI classification tasks. To summarize, you should check DeepWeeds if you are looking for a weed classification dataset.
As you might have noticed when reading about the previous datasets, some of them consist of separate parts that can be used for different applications. The Date Fruit dataset is yet another example on that list. The first subset is the classification one and has 8079 RGB images (224 x 224 pixels). The authors highlight the wide intra-class variations of pictures due to varied fruit maturity, image angles and scales, lighting, and fruit bagging states (some date bunches are covered by various types of bags). This subset is fully-labeled according to fruit variety and maturity. In the initial article, such a comprehensive annotation was used for evaluating Deep Learning models for fruit classification.
The second subset consists of images, videos, and weight measurements (in the metadata) of date brunches acquired during the harvesting period. With such data, this subset can be used for yield estimation tasks.
To summarize, Date Fruit is the comprehensive dataset for date fruit pre-harvesting and harvesting applications you can apply to various Machine Learning tasks.
The MinneApple dataset’s name is a portmanteau of Minnesota and apple. The dataset was gathered in an unusual manner as the authors captured videos of apple trees in natural conditions via their smartphones and then extracted the images from the video sequences. The obtained dataset was divided into two parts for different applications. The first subset is the detection/segmentation one, as it has 670 images with pixel-level annotated fruits (41 325 objects) and 311 unlabeled pictures. The images size in the first subset is 1280 × 720 pixels.
The second subset is called the counting set and has more than 66 000 images with ground-truth fruit counts.
As you might know, the counting and detection tasks are two different subproblems of the larger problem: yield estimation. To estimate yield accurately, you need to detect the fruit first and then use a separate counting algorithm since fruits can be segmented together. That is why MinneApple is a strong yield estimation dataset - it provides data to develop and test algorithms for both subproblems, which is beneficial for building chained model pipelines. So, if that is something you want to experiment with, MinneApple is a great dataset for you.
The Capsicum Annuum dataset was created in 2018 using an innovative approach of synthesizing agricultural data for Computer Vision tasks. In other words, the images from Capsicum Annuum were generated synthetically, not acquired by hand. To do so, the researchers procedurally generated plant models with randomized plant parameters (geometric parameters, color, texture, etc.) based on 21 empirically measured plant properties. They then rendered some scenes with the obtained models. As a result, the synthetic Capsicum Annuum dataset consists of 10 500 images with pixel-level ground truth segmentation of 8 plant parts classes: background, leaves, peppers, peduncles, stems, shoots and leaf stems, wires, and cuts.
To summarize, the synthetic nature of Capsicum Annuum does not allow it to be directly used in real-life use cases. However, this dataset is a good starting point for semantic segmentation tasks. You can get a benchmark model on Capsicum Annuum and then use some realistic images on fine-tuning.
The Maize Disease dataset was created to solve the subproblem of crop monitoring - disease detection. The dataset focuses on the Northern corn leaf blight, a common foliar disease of corn. The researchers collected RGB images on an infected cornfield in three different ways and split the dataset into three subsets:
- Handheld set (authors took pictures via a handheld camera) - 1787 images;
- Boom set (researchers mounted a camera on a boom) - 8766 images;
- Drone set (the authors used a UAV flying at 6 meters height with a camera on it) - 7669 images.
As for the labels, data annotation was done by agricultural experts who drew lines down the main axis of each Northern corn leaf blight lesion visible in the image, stretching down the entire length of the lesion.
So, initially, the Maize disease dataset did not have pixel-level annotations segmenting the lesion margins. However, in the latest research, authors crowdsourced the pixel-level lesion annotation task for the Drone set. That is why, as of today, a part of the Maize disease dataset has segmentation mask labels.
To summarize, Maize disease is a comprehensive dataset with accurate line and segmentation labels made by expert annotators and through crowdsourcing, respectively.
The Open plant Phenotype Database was created to become a common test set: a publicly available annotated reference dataset that can be used to compare the performance of different Machine Learning algorithms. So, the dataset contains 7590 RGB images of 47 plant species. Moreover, each species was cultivated under three different growth conditions: ideal, drought, and natural. This was done to provide a wide intra-class variety of plants in terms of visual appearance. As for annotations, the dataset has image-level annotations of plant classes and bounding boxes for the plants on the image.
Thus, you can use the Open plant Phenotype Database to evaluate plant classification and object detection models. To summarize, you can either use this dataset as a test set when validating your model or try to solve the Phenotyping task with it.
As you can see, there are many existing datasets that can jumpstart your agricultural AI efforts. The same is true for the majority of AI fields and tasks. Of course, the publicly available datasets might not be ideal for your task, but it is better to have at least some data rather than none. So, when planning your next Machine Learning project, you should spare some time to research available datasets that can be acquired for free. Here are some tips that should make your life easier:
- Pay attention to the number of images and annotations. In Machine Learning, usually the larger - the better;
- Check the annotation type and see if it matches your task;
- Double-check who annotated the dataset - is it crowdsourced or manually labeled by experts? Although there are good crowdsourced datasets, they sometimes have quality issues. Moreover, some annotations tasks require a certain level of expertise crowdsource can not fully provide;
- Try to find a suitable dataset with a corresponding article and some research with hypotheses testing around it.
To tell the truth, if you spend some time researching the available options, you will likely find something suitable. Unfortunately, as you might have noticed throughout this post, many datasets (especially in the agriculture sphere) are highly specific. So, the dataset might not be the perfect long-term solution despite being relevant for you. Therefore, at some point in the project, most AI teams decide to create a dataset that will 100% satisfy them. Sure, such an approach is time-consuming and expensive, but the payoff and potential are massive as well. Nevertheless, do not worry if you decide to go that way because Hasty will always back you up and save your time and nerves during the Data Annotation process. Try us out and see for yourself!
A shameless plug
Hasty is a vision AI platform that helps you throughout the ML lifecycle. To date, we can help you with:
- Automating up to 90% of all automation
- Make quality control 35x faster
- Train models directly on your data using our low-code model builder
- Take any custom models trained in Hasty and deploy them back to the annotation environment in one click
- Export any models you create in commonly used formats
- Or host any model in our cloud
- Monitor inferences made in production
- Most importantly, we offer all this through an API for easy integration.
In short, we take care of a lot of the MLOps so you don’t have to. Book a demo if you want to know more.