VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

EMNLP 2025 Main Conference

Kazuki Matsuda, Yuiga Wada, Shinnosuke Hirano, Seitaro Otsuki, Komei Sugiura

VELA is an automatic evaluation metric for long and detailed image captions,
designed within a novel LLM-Hybrid-as-a-Judge framework.

Abstract

In this study, we focus on the automatic evaluation of long and detailed image captions generated by Multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.

Quick Start (Score Computation)

Requirements

Python 3.10+
PyTorch >= 2.0

Installation

git clone --recursive git@github.com:Ka2ukiMatsuda/VELA.git
cd VELA
pip install -e .

Example Usage

from vela import load_pretrained_model

samples = [
  {
      "imgid": "sa_1545038.jpg",
      "cand": "A damaged facade of a building with bent railings and...",
      "refs": ["This is the side of a building. It shows mostly one floor and a little of the floor..."]
  },
  {
    "imgid": "sa_1545118.jpg",
    "cand": "The image captures the majestic Krasnoyarsk Fortress in Russia...",
    "refs": ["A monolithic monument surrounded by four pinkish stones on each side..."]
  }
]

vela = load_pretrained_model()
scores = vela.predict(samples=samples, img_dir="data/images")
print(scores)

For complete setup instructions and advanced usage, see the GitHub repository.

Key Features

LongCap-Arena Benchmark 🤗 Download

We introduce LongCap-Arena, a new benchmark tailored for long caption evaluation.

LongCap-Arena consists of:

7,805 images
7,805 human-provided long reference captions
7,805 candidates generated by 10 different models:

GPT-4o [Achiam+, 23]
InstructBLIP [Dai+, NeurIPS23]
InternVL [Chen+, CVPR24]
LLaVA-NeXT [Liu+, 24]
LLaVA-1.5 [Liu+, 23]
MultimodalGPT [Gong+, 23]
Qwen-VL-Chat [Bai+, ICLR24]
ShareGPT4V [Chen+, ECCV24]
BLIP-2 [Li+, ICML23]
GIT [Wang+, TMLR22]

32,246 human judgments across three perspectives (Descriptiveness, Relevance, and Fluency)

Quick and Human-aligned Evaluation

VELA enables automatic evaluation of long and detailed image captions with high alignment to human judgments.

Leveraging a non-autoregressive LLM and late fusion of visual information, VELA achieves approximately 5× faster inference compared to conventional LLM-as-a-Judge approaches.

Model Architecture

VELA consists of two main branches: the R2C-LLM branch and the I2C-Align branch.

The R2C-LLM branch evaluates the semantic relationship between candidate and references using a lightweight LLM in a non-autoregressive manner. The I2C-Align branch computes vision-language similarity using Long-CLIP without early fusion.

Quantitative Results

VELA achieves superior correlation with human judgments on LongCap-Arena compared to existing metrics.

Quantitative comparison with baseline metrics. **Bold** font indicates the best, and underlined font indicates the second best. The table includes metrics based on CLIP with ViT-L as the backbone, except for Polos, which does not have a ViT-L backbone variant.
	Metrics	TestA [Kendall's τ_c]↑			TestB [Kendall's τ_c]↑			Inference time [ms]↓
		Desc.	Rel.	Flu.	Desc.	Rel.	Flu.
Image captioning metrics	BLEU	28.6	2.4	25.5	32.0	-10.1	-3.5	0.46
	CIDEr	-7.0	6.7	4.4	4.0	-3.4	1.9	1.3
	CLIP-S	24.5	18.6	25.5	27.3	22.5	24.5	26
	CLIP-S_avg	-8.6	11.5	3.2	12.8	27.5	28.4	200
	RefCLIP-S	13.4	7.3	9.5	21.2	10.3	10.9	33
	PAC-S	24.8	14.7	23.6	27.6	25.7	23.0	48
	PAC-S_avg	-7.4	14.6	6.2	6.6	29.2	28.4	360
	RefPAC-S	22.6	19.1	24.9	40.7	29.2	27.9	52
	Polos	28.5	18.1	30.6	41.1	22.4	20.0	33
	DENEB	10.3	18.4	22.2	31.3	35.7	32.6	47
	PAC-S++	29.7	21.4	34.2	28.1	21.9	21.1	36
	PAC-S++_avg	-7.2	19.4	6.0	14.1	32.4	30.3	270
	RefPAC-S++	25.4	23.3	28.9	40.3	22.2	24.2	40
LLM-as-a-Judge	FLEUR	17.3	2.6	0.5	12.6	10.6	-3.1	1,300
	RefFLEUR	21.3	10.3	7.2	28.1	12.3	17.5	1,400
	G-VEval	28.3	22.5	18.2	38.1	22.2	19.2	1,800
	GPT4o w/o references	54.1±1.0	36.8±6.3	20.9±1.0	43.6±2.0	37.3±3.4	25.2±1.0	1,900
	GPT4o w/ references	47.0±1.1	26.2±2.2	35.4±2.9	46.9±2.6	30.4±2.3	25.1±4.3	2,000
LLM-Hybrid-as-a-Judge	VELA	56.4±1.3	40.0±1.1	57.4±1.3	54.0±0.4	52.3±1.1	39.0±2.3	260
	Human performance	56.1	46.6	24.5	48.9	52.6	24.4	---

Qualitative Results

Soccer field with players and stadium in background - VELA evaluation example

Human-provided reference caption

The soccer field in this image takes up the bottom 1/3 of the picture. It stretches from the left to the right of the image. From the center of the field to the left of center, there are around 21 soccer players in red and white jerseys walking on the field. There is one person in a yellow shirt on the left side of the field. Behind the field, there is a long white fence that stretches the entire picture. Behind the fence, there is a row of trees with no leaves. Behind the trees, we can see the back of a large stadium. The base of the stadium is about 1/3 the height of the image. It just about touches the left and right sides of the image. In the upper right and left corner of the stadium, there are metal towers. These towers are about as tall as the stadium's base. There is a long metal cord attached to both towers that droops down in the middle of the ceiling of the stadium. We can just make out some trees on the space left by the stadium on its sides. The sky in the background takes up the background of the top 1/2 of the image. The sky is above the stadium and can be seen on the lower sides of the stadium near the stairs.

Candidate caption

The image shows a football stadium. The stadium is located in Germany. The stadium is surrounded by a large field. The field is used for football matches. The stadium is located in the city of Hamburg. The stadium is used by the football club of Hamburg. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches. The stadium is used for football matches

Descriptiveness

Relevance

Fluency

👤Human

0.55

0.75

0.50

VELA

0.43

0.72

0.50

Stone church with courtyard and surrounding buildings - VELA evaluation example

Human-provided reference caption

Under a clear sky, there is a stone church in the middle. To its right is a stone wall with a small iron fence in the middle. Outside the church are two planters with plants. The church has an arch header and wooden door. It has a small round stain glass window. It has a bell tower in the back. On its left side there is a three story brick building sharing an iron gate with the church. On the left there is a two story building with its entrance on the right side. There are many stickers on its entrance door. Outside church there is a cobblestone square on the right, a brick road in the middle and a cobblestone road on the left. Along the left border there is a round brick wall.

Candidate caption

The image depicts a cobblestone courtyard surrounded by several buildings, including a large stone church with a clock tower. There are several potted plants scattered throughout the courtyard, adding a touch of greenery to the scene. In the center of the courtyard, there is a bench where people can sit and enjoy the surroundings. Additionally, there are two cars parked in the courtyard, likely belonging to visitors or residents of the area. Overall, the courtyard appears to be a peaceful and picturesque setting, perfect for relaxation and contemplation.

Descriptiveness

Relevance

Fluency

👤Human

0.31

0.47

1.0

VELA

0.46

0.52

0.89

European stone restaurant building with balcony and staircase - VELA evaluation example

Human-provided reference caption

This stone restaurant has other floors above it. Windows can be seen on two floors. At the top of the stairs on the left, there are four windows that appear vertically. Each has curtains in the windows. The white curtains are pulled back on each side. There are small metal balconies in front of them. In the space between the third and fourth windows from the top, there are some black letters that look like the same black letters at the end of the restaurant name on the building beside the steps. The woman going down the steps on the right is wearing sunglasses and a sweater that is open in the front. She is carrying a purse under her right arm.

Candidate caption

The image captures a serene scene in a European city. Dominating the frame is a stone building, its facade adorned with a sign that reads "Maison Dal 1920". The building's architectural details, including a balcony and a staircase, add to its charm. In front of the building, a couple is seen walking down the street, their figures adding a touch of life to the scene. The street itself is lined with other buildings, their stone facades and arched windows hinting at the city's rich history.The sky above is a clear blue, providing a beautiful backdrop to the scene. The image is taken from a low angle, which gives a sense of grandeur to the buildings and adds depth to the scene. The overall composition of the image suggests a peaceful day in a city steeped in history.

Descriptiveness

Relevance

Fluency

👤Human

0.75

0.67

0.83

VELA

0.75

0.71

0.86

City street with fences, vehicles, and traffic - VELA evaluation example

Human-provided reference caption

There are two different fences along the left side of this road. The one on the right has stone pillars with narrow white pickets. The pillars are the same height as the pickets. The one on the left looks like it is made of stone pillars with decorative concrete sections. There are four sections between each set of four pillars. The pillars are a little higher than the sections between them. The left lanes of traffic are busy. There are trucks, cars and buses on that side. There are no vehicles on the right side, but there are five orange and white striped posts along it.

Candidate caption

The image shows a city street with a line of parked buses and cars on the side of the road. There are three white and blue buses in total, parked next to each other on the right side of the street. Several cars are also parked or driving on the street, adding to the city atmosphere. In the background, there are trees that have turned yellow, possibly indicating that it is autumn. This suggests that the scene takes place during the fall season. Overall, the image depicts a typical city street scene with public transportation and private vehicles going about their day.

Descriptiveness

Relevance

Fluency

👤Human

0.60

0.78

0.81

VELA

0.64

0.65

0.85

Seaside area with people watching the water - VELA evaluation example

Human-provided reference caption

An area looking over the open sea close there is dirty and grass covered in brown leaves also. There are sofas for people to sit and tables to watch the water. There is a man standing with jeans and a backpack on looking inland and three people sitting on a couch on the left watching the water. The water is a bit dark and the waves coming in are low. It looks to be overcast as the sky is rather dark

Candidate caption

a person standing in front of a body of water

Descriptiveness

Relevance

Fluency

👤Human

0.31

0.86

0.75

VELA

0.48

0.90

0.78

Video

BibTeX


@inproceedings{matsuda2025vela,
  title = {{VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions}},
  author = {Matsuda, Kazuki and Wada, Yuiga and Hirano, Shinnosuke and Otsuki, Seitaro and Sugiura, Komei},
  booktitle = {EMNLP},
  year = {2025},
  pages = {8691--8707},
}

VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

VELA is an automatic evaluation metric for long and detailed image captions, designed within a novel LLM-Hybrid-as-a-Judge framework.

Abstract

Quick Start (Score Computation)

Requirements

Installation

Example Usage

Key Features

LongCap-Arena Benchmark 🤗 Download

Quick and Human-aligned Evaluation

Model Architecture

Quantitative Results

Qualitative Results

Video

BibTeX

VELA is an automatic evaluation metric for long and detailed image captions,
designed within a novel LLM-Hybrid-as-a-Judge framework.