Unblocking Fine-Grained Evaluation of Detailed Captions (NeurIPS 2025)

Overview

Figure 1: Our VNLI-Critique model operating as a Critic and within the Critic-and-Revise pipeline.

Abstract

Large Vision-Language Models (VLMs) now generate highly detailed, paragraph-length image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding.

Key Contributions

📊 DOCCI-Critique Benchmark: A new, challenging dataset for fine-grained evaluation of detailed captions.
🤖 VNLI-Critique Model: An "Explaining AutoRater" for automated factuality assessment and critique generation.
✍️ Critic-and-Revise Pipeline: An automated pipeline to correct factual errors in captions using VNLI-Critique and an LLM.

Visuals & Examples

VNLI-Critique Assessment Example

VNLI-Critique's fine-grained, human-aligned fact-checking compared to zero-shot VLMs (from Figure 2 in the paper).

Critic-and-Revise Pipeline Example

Qualitative example of the Critic-and-Revise pipeline correcting a caption.

DOCCI-Critique Benchmark

DOCCI-Critique is a novel benchmark specifically designed for fine-grained factuality assessment of paragraph-level image descriptions. It comprises 1,400 VLM-generated paragraph captions derived from 100 diverse, high-resolution images, featuring over 10,216 sentence-level human annotations. Each sentence's factuality was judged by five annotators who also provided detailed textual rationales for any identified errors.

DOCCI-Critique Annotation Example

Illustrative example from the DOCCI-Critique benchmark showing sentence-level annotations and rationales (similar to Table 1 in the paper).

Benchmark Statistics Overview

Key statistics of the DOCCI-Critique benchmark, detailing per-model description lengths, factual accuracy, and lexical diversity (similar to Table 2 in the paper).

Experimental Results

AutoRater Performance on DOCCI-Critique

We evaluated VNLI-Critique as an AutoRater on our DOCCI-Critique benchmark by comparing its VLM rankings against human judgments across three factuality criteria. VNLI-Critique demonstrates exceptional alignment with human assessments.

AutoRater Correlation with Human Judgments

Correlation (Spearman's $\rho$, Kendall's $\tau$) between model-based rankings (including VNLI-Critique) and human judgments of VLM factuality on DOCCI-Critique (similar to Table 3 in the paper). VNLI-Critique achieves up to 0.981 Spearman correlation.

Further Results (External Benchmarks & Pipeline Performance)

Beyond DOCCI-Critique, our methods show strong performance on external datasets and in improving caption factuality through our pipeline:

VNLI-Critique on External Benchmarks

VNLI-Critique External Benchmark Results

Performance of VNLI-Critique on external benchmarks like M-HalDetect and CHOCOLATE, demonstrating SOTA results (e.g., 0.76 Macro-F1 on M-HalDetect) and strong generalization (similar to Table 4 in the paper).

Critic-and-Revise Pipeline Performance

Factuality improvements achieved by the Critic-and-Revise pipeline on datasets like DetailCaps-4870 and PixelProse, showing significant gains (e.g., 46% on DetailCaps-4870) as confirmed by human evaluation (similar to Table 6 in the paper).

For full experimental details, please refer to our paper - COMING SOON!.

The DOCCI-Critique benchmark annotations and associated DOCCI images used in this project are licensed by Google LLC under CC BY 4.0 license. The code is licensed under Apache 2.0.