Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Padlewski, Piotr; Bain, Max; Henderson, Matthew; Zhu, Zhongkai; Relan, Nishant; Pham, Hai; Ong, Donovan; Aleksiev, Kaloyan; Ormazabal, Aitor; Phua, Samuel; Yeo, Ethan; Lamprecht, Eugenie; Liu, Qi; Wang, Yuqi; Chen, Eric; Fu, Deyu; Li, Lei; Zheng, Che; d'Autume, Cyprien de Masson; Yogatama, Dani; Artetxe, Mikel; Tay, Yi

Computer Science > Computation and Language

arXiv:2405.02287 (cs)

[Submitted on 3 May 2024]

Title:Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Abstract:We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see this https URL

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.02287 [cs.CL]
	(or arXiv:2405.02287v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.02287

Submission history

From: Max Bain [view email]
[v1] Fri, 3 May 2024 17:59:55 UTC (2,142 KB)

Computer Science > Computation and Language

Title:Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators