Connecting Vision and Language with

Localized Narratives

Publication

Connecting Vision and Language with Localized Narratives
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari
ECCV (Spotlight), 2020
[PDF] [BibTeX] [1'30'' video] [10' video]

@inproceedings{PontTuset_eccv2020,
  author    = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title     = {Connecting Vision and Language with Localized Narratives},
  booktitle = {ECCV},
  year      = {2020}
}

Abstract

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

Explore Localized Narratives

Explore some images and play the Localized Narrative annotation: synchronized voice, caption, and mouse trace. Don't forget to turn the sound on!

Explore

License

All the annotations available through this website are released under a CC BY 4.0 license. You are free to redistribute and modify the annotations, but we ask you to please keep the original attribution to our paper.

Code

Python Data Loader and Helpers

Visit the GitHub repository to view the code to download and work with Localized Narratives.
Here is the documentation about the file formats used.
Alternatively, you can manually download the data below.

From Traces to Boxes

This colab demonstrates how we get from a trace segment to a bounding box.

Downloads

Full Localized Narratives

Here you can download the full set of Localized Narratives (format description).
Large files are split in shards (a list of them will appear when you click below).
In parantheses, the number of Localized Narratives in each split. Please note that some images have more than one Localized Narrative annotation, e.g. 5k images in COCO are annotated 5 times.

Open Images

Train (507,444)

Validation (41,691)

Test (126,020)

COCO

Train (134,272)

Flickr30k

ADE20k

Textual captions only

To facilitate download, below are the annotations on the same images as above but containing only the textual caption, in case you are only interested in this part of Localized Narratives.

Open Images

COCO

Flickr30k

ADE20k

Automatic speech-to-text transcriptions

Below you can download the automatic speech-to-text transcriptions from the voice recordings. The format is a list of text chunks, each of which is a list of ten alternatives along with its confidence.

Please note: the final caption text of Localized Narratives is given manually by the annotators. The automatic transcriptions below are only used to temporally align the manual transcription to the mouse traces. The timestamps used for this, though, were not stored, so the alignment process cannot be reproduced. To have some timestamps, you'd need to re-run Google's speech-to-text transcription (here the code we used). Given that the API is constantly evolving, though, the transcription will likely not match the one stored below.

Open Images

COCO

Flickr30k

ADE20k