Connecting Vision and Language with

Localized Narratives


Connecting Vision and Language with Localized Narratives
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari
ECCV (Spotlight), 2020
[PDF] [BibTeX] [1'30'' video] [10' video]
  author    = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
  title     = {Connecting Vision and Language with Localized Narratives},
  booktitle = {ECCV},
  year      = {2020}


We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

Explore Localized Narratives

Explore some images and play the Localized Narrative annotation: synchronized voice, caption, and mouse trace. Don't forget to turn the sound on!


Python Data Loader and Helpers
Visit the GitHub repository to view the code to download and work with Localized Narratives.
Here is the documentation about the file formats used.
Alternatively, you can directly download the data below.
Full Localized Narratives
Here you can download the full 873,107 Localized Narratives (format description).
Large files are split in shards (a list of them will appear when you click below).
In parantheses, the number of Localized Narratives in each split.

File formats

The annotations are in JSON Lines format, that is, each line of the file is an independent valid JSON-encoded object. The largest files are split into smaller sub-files (shards) for ease of download. Since each line of the file is independent, the whole file can be reconstructed by simply concatenating the contents of the shards.

Each line represents one Localized Narrative annotation on one image by one annotator and has the following fields:

  • dataset_id String identifying the dataset and split where the image belongs, e.g. mscoco_val2017.
  • image_id String identifier of the image, as specified on each dataset.
  • annotator_id Integer number uniquely identifying each annotator.
  • caption Image caption as a string of characters.
  • timed_caption List of timed utterances, i.e. {utterance, start_time, end_time} where utterance is a word (or group of words) and (start_time, end_time) is the time during which it was spoken, with respect to the start of the recording.
  • traces List of trace segments, one between each time the mouse pointer enters the image and goes away from it. Each trace segment is represented as a list of timed points, i.e. {x, y, t}, where x and y are the normalized image coordinates and t is the time in seconds since the start of the recording. Please note that the coordinates can go a bit beyond the image, i.e. <0 or >1, as we recorded the mouse traces including a small band around the image.
  • voice_recording Relative URL path with respect to where to find the voice recording (in OGG format) for that particular image.

Below a sample of one Localized Narrative in this format:

  dataset_id: 'mscoco_val2017',
  image_id: '137576',
  annotator_id: 93,
  caption: 'In this image there are group of cows standing and eating th...',
  timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...],
  traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...],
  voice_recording: 'coco_val/coco_val_137576_93.ogg'
Textual captions only
To facilitate download, below are the annotations on the same images as above but containing only the textual caption, in case you are only interested in this part of Localized Narratives.

Previous Random Next
Image source: . Author: . Image license.
Dataset: Open Images. ID: . Recording file.