Connecting Vision and Language with

Video Localized Narratives

Paul Voigtlaender
Google Research
Soravit (Beer) Changpinyo
Google Research
Jordi Pont-Tuset
Google Research
Radu Soricut
Google Research
Vittorio Ferrari
Google Research

Description

Video Localized Narratives are a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 50k videos of the OVIS, UVO, Oops, and Kinetics datasets, totalling 3.5M words. Based on this data, we also construct new benchmarks for video narrative grounding and video question-answering tasks, and provide reference results from strong baseline models.

Explore

A video localized narrative annotation example
Open Annotation Visualizer

Video


Code

Visit the GitHub repository to view the code work with Video Localized Narratives.
Here is the documentation about the file format used used for the Video Localized Narratives.

Publication

Connecting Vision and Language with Video Localized Narratives
Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, and Vittorio Ferrari
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
[arXiv] [BibTeX]
Note: We added the annotations for Kinetics recently, hence the paper only mentions the annotations on OVIS, UVO, and Oops.
Screenshot of page 1 of the paper
Screenshot of page 2 of the paper
Screenshot of page 3 of the paper
Screenshot of page 4 of the paper
Screenshot of page 5 of the paper
Screenshot of page 6 of the paper
Screenshot of page 7 of the paper
Screenshot of page 8 of the paper
@inproceedings{Voigtlaender23CVPR,
  author        = {Paul Voigtlaender and Soravit Changpinyo and Jordi Pont-Tuset and Radu Soricut and Vittorio Ferrari},
  title         = {{Connecting Vision and Language with Video Localized Narratives}},
  booktitle     = {IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year          = {2023}
}
Copy to clipboard Close

Downloads

Videos and Frames
We do not provide the videos or their frames. Please download the raw datasets from their respective websites: OVIS, UVOv1.0, Oops, and Kinetics. You can use the script provided by the UVO dataset to extract frames.
Video Localized Narratives
Here you can download the full set of Video Localized Narrative Annotations (format description).
Please note that some videos have more than one Video Localized Narrative annotation. The original UVO dataset has subsets with sparse and dense annotations, we kept this split and provide separate downloads for the sparse and dense subsets. Close

File formats

The annotations are in JSON Lines format, that is, each line of the file is an independent valid JSON-encoded object.

Each line represents one Video Localized Narrative annotation on one video by one annotator and has the following fields:

  • vidln_id Integer identifying the Video Localized Narrative, e.g. 42.
  • dataset_id String identifying the dataset and split where the video belongs to, e.g. "UVO_sparse_train".
  • video_id String identifier of the video, e.g. "c1a40349".
  • annotator_id Integer number uniquely identifying each annotator, e.g. 5.
  • keyframe_names Names of the keyframes for the video (list of strings), e.g. ["img_0000005", "img_0000012", "img_0000019", "img_0000026"].
  • actor_narratives A list of dictionaries. The length is the number of actors and each dictionary has the annotations for one actor with the following items:
    • actor_name The name of the actor (string), e.g. "Man one".
    • keyframe_selection_indices A list of integers, specifying which of the keyframes (given by 'keyframe_names') are selected for this actor.
    • recording_start_time_ms_since_epoch Integer describing the time stamp when the audio recording for this actor has started (milliseconds elapsed since Jan 1, 1970), e.g. 1646807046669.
    • traces_start_time_ms_since_epoch Integer describing the time stamp when the annotation of traces for this actor has started (milliseconds elapsed since Jan 1, 1970), e.g. 1646807046265.
    • caption The manual transcription of what the annotator said when describing the actor (string), e.g. "A brown tiger with black stripes is fighting with the other tiger."
    • noun_segments The positions of words in the caption that are automatically tagged as nouns. A list of two-element lists, e.g. [[8, 13], [25, 32], [60, 65]], where [8,13] refers to caption[8:13], which for the example above is "tiger".
    • recording_filename The filename of the audio recording for this actor (string), e.g. "recordings/OVIS_train/1_0.webm", where to find the voice recording (in webm format) for the actor of the Video Localized Narrative.
    • time_alignment Provides start and end timestamps for words of the caption (when they were spoken). This is used to connect the words to mouse trace segments. The format is a list of dictionaries, one dictionary for each word, e.g. {'end_ms': 4720, 'referenced_word': 'brown', 'referenced_word_end_idx': 7, 'referenced_word_start_idx': 2, 'start_ms': 4180}. Here, 'start_ms' and 'end_ms' are time-stamps in milliseconds relative to the start of the audio file. 'referenced_word' is the spoken word of the caption and 'referenced_word_start_idx' and 'referenced_word_end_idx' provide indices into the caption which correspond to the word (caption[referenced_word_start_idx:referenced_word_end_idx] corresponds to 'referenced_word').
    • traces The mouse traces encoded as a list of mouse trace segments, where each mouse trace segment is a list of dictionaries, e.g. {'x': 0.53791, 'y': 0.41684, 'time_ms_since_epoch': 1646374113797, 'kf_idx': 6}. 'kf_idx' is the integer index of the keyframe (it indexes into 'keyframe_names'). 'time_ms_since_epoch' is the time stamp when this mouse position occurred (integer, in milliseconds since Jan 1, 1970), and 'x' and 'y' are normalized coordinates (floats) in the key-frame that specify the position of the mouse.
Audio Recordings
Here you can download the full set of Audio recordings of the Video Localized Narratives, in webm format.
Video Narrative Grounding
Here you can download the full set of Video Narrative Grounding Annotations (format description).
Close

File formats

The annotations consist of two JSON files per (sub-)dataset:

  • meta_expressions.json
    • videos A dictionary mapping mapping from the name of a video to the annotations for that video, which is again a dictionary with the following fields:
      • frames A list of all frames of the video, e.g. ['img_0000001', 'img_0000002', 'img_0000003'].
      • actor_narratives A list of dictionaries, each representing the narrative of one actor, with the following fields:
        • actor_idx An integer id representing the actor, e.g. 0.
        • actor_name The name of the actor (string): e.g. "Big cat".
        • description The description of the actor (string, same as 'caption' for the raw VideoLN data), e.g. "A big gray colored cat is playing with the little kitten in the cage."
      • expressions A dictionary mapping from expression_ids (integer, e.g. '0' or '1') to the data of the expression, again given by a dictionary, with the following fields:
        • narrative_actor_idx The index of the actor this expression belongs to (integer), e.g. 0.
        • noun_phrase_start_idx and noun_phrase_end_idx Integer indices (e.g. 0 and 18) into the description of the actor such that description[noun_phrase_start_idx:noun_phrase_end_idx] is the noun phrase that needs to be localized for the VNG task.
        • obj_id An integer identifier for the object which is used to look up the mask in either the original json file with annotations of the corresponding dataset, or in the provided extra_masks.json file, e.g. 285.
  • extra_masks.json
    • info A dictionary mapping from the key 'description' to a very short description (string, e.g. “OVIS-VNG test set extra mask annotations”).
    • videos A list of dictionaries, each providing meta-information about one video, with the following fields:
      • width The width of the frames of the video (integer), e.g. 1920.
      • height The height of the frames of the video (integer), e.g. 1080.
      • length The number of frames of the video (integer), e.g. 25.
      • file_names a list of names of the frames of the video, e.g. ['25153e58/img_0000001.jpg', '25153e58/img_0000002.jpg', '25153e58/img_0000003.jpg', '25153e58/img_0000004.jpg'].
      • id An integer identifier of the video, e.g. 548.
    • annotations A dictionary providing segmentation masks, with the following items:
      • id An integer identifier of the object, e.g. 4850600790259722026. Note that the meta_expressions.json data refers to these ids with 'expressions' -> 'obj_id'.
      • video_id An integer identifier of the video, e.g. 548.
      • height The height of the frames of the video (integer), e.g. 994.
      • width The width of the frames of the video (integer), e.g. 1920.
      • video_name The name of the video this annotation belongs to (string), e.g. "3a6e6ace".
      • segmentations A list of dictionaries (or null values) that provide a segmentation mask for each frame with the following items:
        • size A list of two integers, e.g. [994, 1920], specifying the height (994) and width (1920) of the mask annotation.
        • counts A run-length encoded representation of the segmentation mask (string), that can be decoded to a mask using pycocotools.
    Additionally, we have a train.txt and test.txt file for OVIS which defines our split of the OVIS dataset into training and test subset by listing the names of video in the sets.
All datasets
OVIS VNG + UVO VNG (14.9MB)*
Video Question-Answering
Here you can download the Oops-QA benchmark annotations including text-output and location-output questions and answers for the Oops dataset (format description).
Close

File formats

The annotations consist of two JSON files per (sub-)dataset:

  • qa_text_output.json Provides a dictionary for text-output questions with the following fields:
    • dataset The name of the dataset (string), e.g. "Oops".
    • split The split of the dataset (string), e.g. "train" or "val".
    • annotations Provides VideoQA text-output annotations as a list of dictionaries with the following fields:
      • video_name The name of the video this annotation belongs to (string), e.g. "train/False Start! - Best Fails of the Week! (May 2018) _ FailArmy31"
      • qa_pairs A list of dictionaries of questions and answers for this video, with the following fields:
        • question_id ID for this question (string), e.g. "train/False Start! - Best Fails of the Week! (May 2018) _ FailArmy31_question_train_0".
        • question Processed question used in our experiments (string), e.g. "who throws the egg at the man".
        • answer Processed answer used in our experiments (string), e.g. "baby girl".
        • raw_question Raw question (string), e.g. "Who throws the egg at the man?". In almost all cases you should use the "question" field instead.
        • raw_answer Raw answer (string), e.g. "Baby girl". In almost all cases you should use the "answer" field instead.
  • qa_location_output.json Provides a dictionary for location-output questions that maps from video names (string) to a list of VideoQA location-output question annotations for the video, where each question is an dictionary with the following fields:
    • vidln_id An integer identifier referring to the localized narrative on which this question is based.
    • actor_idx An integer id representing the actor from which the question was generated, e.g. 0.
    • question_hash A unique string identifier of the question, e.g. "69e69b3a8b29d92dd94be66c79f82d69"
    • question The location-output question as a string, e.g. "Where is the man that is wearing gray pants?"
    • video_name The name of the video this annotation belongs to (string), e.g. "Best Fails of the Week 3 May 2016 _ FailArmy29".
    • trace_frame The name of the frame for which the trace is annotated (string), e.g. "000009.png"
    • txt_slice_indices A list of two integers (e.g. [2, 5]), that define the start and end index with which the description of the actor can be sliced to obtain the object of interest.
    • txt_slice The name of the object of interest as string (e.g. "man"). The same as taking the caption of the actor and slicing it: caption[txt_slice_indices[0]:txt_slice_indices[1]].
    • trace A dictionary encoding the mouse trace as a segmentation mask with the following items:
      • size A list of two integers (e.g. [720, 538]) specifying the height (720) and width (538) of the mask annotation.
      • counts A run-length encoded representation of the segmentation mask (string), that can be decoded to a mask using pycocotools.
Oops
Location-output and text-output (3.8MB)*

* The annotations are licensed by Google LLC under CC BY 4.0 license.

Close visualizer