Dataset Description

Controlled Noisy Web Labels is a collection of ~212,000 URLs to images in which every image is carefully annotated by 3-5 labeling professionals by Google Cloud Data Labeling Service. Using these annotations, we establish the first benchmark of controlled real-world label noise from the web.

Blue Mini-ImageNet (synthetic noise)
Red Mini-ImageNet (real-world web noise)
Blue Stanford Cars (symmetric noise)
Red Stanford Cars (real-world web noise)

The Mini-ImageNet dataset is for coarse classification and the Stanford Cars dataset is for fine-grained classification.
Each of the training sets above contains one of the ten noise-levels p from 0% to 80%.
The validation set has clean labels and is shared across all noisy training sets.
The details for dataset construction and analysis can be found in our paper published in ICML 2020.

Examples

Mini-ImageNet: triceratops


Clean("image/class/label/is_clean": 1)	Clean("image/class/label/is_clean": 1)	Not Clean("image/class/label/is_clean": 0)

Stanford Cars: AM General Hummer SUV 2000


Clean("image/class/label/is_clean": 1)	Clean("image/class/label/is_clean": 1)	Not Clean("image/class/label/is_clean": 0)

License

The annotations are licensed by Google under CC BY 4.0 license. The images we use are under CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Reference

If you use this dataset, please cite the following paper:

Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels.

Ready to start using this dataset?