VELA LogoVELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

EMNLP 2025 Main Conference
Kazuki Matsuda, Yuiga Wada, Shinnosuke Hirano, Seitaro Otsuki, Komei Sugiura
VELA eye-catch image

Vela LogoVELA is an automatic evaluation metric for long and detailed image captions,
designed within a novel LLM-Hybrid-as-a-Judge framework.

Abstract

In this study, we focus on the automatic evaluation of long and detailed image captions generated by Multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.

Quick Start (Score Computation)

Requirements

  • Python 3.10+
  • PyTorch >= 2.0

Installation

git clone --recursive git@github.com:Ka2ukiMatsuda/VELA.git
cd VELA
pip install -e .

Example Usage

from vela import load_pretrained_model

samples = [
  {
      "imgid": "sa_1545038.jpg",
      "cand": "A damaged facade of a building with bent railings and...",
      "refs": ["This is the side of a building. It shows mostly one floor and a little of the floor..."]
  },
  {
    "imgid": "sa_1545118.jpg",
    "cand": "The image captures the majestic Krasnoyarsk Fortress in Russia...",
    "refs": ["A monolithic monument surrounded by four pinkish stones on each side..."]
  }
]

vela = load_pretrained_model()
scores = vela.predict(samples=samples, img_dir="data/images")
print(scores)

For complete setup instructions and advanced usage, see the GitHub repository.

Key Features

LongCap-Arena Benchmark 🤗 Download

We introduce LongCap-Arena, a new benchmark tailored for long caption evaluation.

LongCap-Arena consists of:

Quick and Human-aligned Evaluation

VELA enables automatic evaluation of long and detailed image captions with high alignment to human judgments.

Leveraging a non-autoregressive LLM and late fusion of visual information, VELA achieves approximately 5× faster inference compared to conventional LLM-as-a-Judge approaches.

VELA eye-catch

Model Architecture

VELA consists of two main branches: the R2C-LLM branch and the I2C-Align branch.

The R2C-LLM branch evaluates the semantic relationship between candidate and references using a lightweight LLM in a non-autoregressive manner. The I2C-Align branch computes vision-language similarity using Long-CLIP without early fusion.

VELA model architecture

Quantitative Results

VELA achieves superior correlation with human judgments on LongCap-Arena compared to existing metrics.

Quantitative comparison with baseline metrics. Bold font indicates the best, and underlined font indicates the second best. The table includes metrics based on CLIP with ViT-L as the backbone, except for Polos, which does not have a ViT-L backbone variant.

Metrics TestA [Kendall's τc]↑ TestB [Kendall's τc]↑ Inference time [ms]↓
Desc. Rel. Flu. Desc. Rel. Flu.
Image captioning metrics BLEU 28.6 2.4 25.5 32.0 -10.1 -3.5 0.46
CIDEr -7.0 6.7 4.4 4.0 -3.4 1.9 1.3
CLIP-S 24.5 18.6 25.5 27.3 22.5 24.5 26
CLIP-Savg -8.6 11.5 3.2 12.8 27.5 28.4 200
RefCLIP-S 13.4 7.3 9.5 21.2 10.3 10.9 33
PAC-S 24.8 14.7 23.6 27.6 25.7 23.0 48
PAC-Savg -7.4 14.6 6.2 6.6 29.2 28.4 360
RefPAC-S 22.6 19.1 24.9 40.7 29.2 27.9 52
Polos 28.5 18.1 30.6 41.1 22.4 20.0 33
DENEB 10.3 18.4 22.2 31.3 35.7 32.6 47
PAC-S++ 29.7 21.4 34.2 28.1 21.9 21.1 36
PAC-S++avg -7.2 19.4 6.0 14.1 32.4 30.3 270
RefPAC-S++ 25.4 23.3 28.9 40.3 22.2 24.2 40
LLM-as-a-Judge FLEUR 17.3 2.6 0.5 12.6 10.6 -3.1 1,300
RefFLEUR 21.3 10.3 7.2 28.1 12.3 17.5 1,400
G-VEval 28.3 22.5 18.2 38.1 22.2 19.2 1,800
GPT4o w/o references 54.1±1.0 36.8±6.3 20.9±1.0 43.6±2.0 37.3±3.4 25.2±1.0 1,900
GPT4o w/ references 47.0±1.1 26.2±2.2 35.4±2.9 46.9±2.6 30.4±2.3 25.1±4.3 2,000
LLM-Hybrid-as-a-Judge Vela LogoVELA 56.4±1.3 40.0±1.1 57.4±1.3 54.0±0.4 52.3±1.1 39.0±2.3 260
Human performance 56.1 46.6 24.5 48.9 52.6 24.4 ---

Qualitative Results

BibTeX

@inproceedings{matsuda2025vela,
  title={{VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions}},
  author={Matsuda, Kazuki and Wada, Yuiga and Hirano, Shinnosuke and Otsuki, Seitaro and Sugiura, Komei},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025}
}