Dataset Description
Controlled Noisy Web Labels is a collection of ~
212,000 URLs to images in which every image is
carefully annotated by 3-5 labeling professionals by
Google Cloud Data Labeling Service.
Using these annotations, we establish the first benchmark of controlled real-world label noise from the web.
- Blue Mini-ImageNet (synthetic noise)
- Red Mini-ImageNet (real-world web noise)
- Blue Stanford Cars (symmetric noise)
- Red Stanford Cars (real-world web noise)
The Mini-ImageNet dataset
is for coarse classification and
the Stanford Cars dataset
is for fine-grained classification.
Each of the training sets above contains one of the ten noise-levels p from 0% to 80%.
The validation set has clean labels and is shared across all noisy training sets.
The details for dataset construction and analysis can be found in
our paper
published in ICML 2020.
Examples
Mini-ImageNet: triceratops
|
|
|
Clean("image/class/label/is_clean": 1) |
Clean("image/class/label/is_clean": 1) |
Not Clean("image/class/label/is_clean": 0) |
Stanford Cars: AM General Hummer SUV 2000
|
|
|
Clean("image/class/label/is_clean": 1) |
Clean("image/class/label/is_clean": 1) |
Not Clean("image/class/label/is_clean": 0) |
License
The annotations are licensed by Google under
CC BY 4.0 license.
The images we use are under
CC BY 2.0 license.
Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make
no representations or warranties regarding the license status of each image and you should verify the license
for each image yourself.
Reference
If you use this dataset, please cite the following paper:
Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels.