Logo MMSearch

Benchmarking the potential of large models as multi-modal search engines

1CUHK MMLab, 2ByteDance, 3CUHK MiuLar Lab, 4Shanghai AI Laboratory,
5Peking University, 6Stanford University, 7Sensetime Research

Introduction

The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question.

To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce Logo MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process.

We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope Logo MMSearch may provide unique insights to guide the future development of multimodal AI search engine.

Leaderboard

Final scores and scores of each task on the LogoMMSearch.

# Model Date All End-to-end Requery Rerank Summarize
Avg News Know. Avg News Know. Avg News Know. Avg News Know. Avg News Know.
1 GPT-4o 🥇 2024-05 62.3 61.2 65.3 60.4 59.0 64.5 46.8 49.9 38.0 83.0 82.4 84.8 63.1 62.2 65.6
2 GPT-4V 🥈 2023-09 55.0 55.0 55.3 52.1 52.2 51.9 45.7 49.2 35.8 79.3 76.9 86.1 57.4 56.7 59.4
3 Claude 3.5 Sonnet 2024-06 53.5 53.1 54.7 49.9 49.3 51.6 42.0 43.6 37.7 80.2 78.7 84.2 59.4 60.3 57.0
4 Qwen2-VL-AnyRes-72B 2024-08 52.7 52.0 54.5 49.1 48.8 49.8 44.7 47.2 37.6 76.7 72.9 87.3 59.6 57.5 65.7
5 LLaVA-OneVision-72B 2024-08 50.1 50.1 50.2 44.9 45.1 44.1 42.9 45.9 34.3 82.2 80.5 86.7 61.4 59.1 67.7
6 Qwen2-VL-AnyRes 2024-08 45.3 44.1 48.7 40.3 39.2 43.5 39.0 41.9 30.8 76.7 73.8 84.8 54.7 52.7 60.4
7 LLaVA-OneVision 2024-08 36.6 39.4 28.9 29.6 33.1 19.7 35.8 40.3 23.2 72.8 73.5 70.9 53.5 51.8 58.5
8 Idefics3 2024-08 36.2 38.5 29.6 29.3 32.3 20.8 31.0 37.3 13.5 76.5 73.3 85.4 50.3 51.1 48.1
9 Idefics3-AnyRes 2024-08 35.7 38.2 28.7 30.1 32.9 22.3 27.2 32.2 13.2 72.7 73.1 71.5 45.2 46.3 42.1
10 InternVL2 2024-07 34.3 35.7 30.2 30.9 32.5 26.2 32.3 36.1 21.6 46.5 49.3 38.6 48.5 45.9 55.8
11 InternVL2-AnyRes 2024-07 34.1 34.2 33.7 30.0 30.3 29.0 31.4 35.5 19.7 53.2 52.3 55.7 46.9 44.4 53.7
12 mPlug-Owl3-AnyRes 2024-08 33.9 35.5 29.3 27.3 29.4 21.2 31.8 36.1 19.9 74.5 72.9 79.1 43.9 43.6 44.6
13 mPlug-Owl3 2024-08 32.1 34.8 24.4 24.6 28.1 14.9 32.6 36.7 21.2 74.3 73.5 76.6 45.6 45.6 45.4
14 LLaVA-NeXT-Interleave 2024-06 28.3 29.2 25.6 23.0 23.8 20.5 26.2 30.7 13.5 55.3 58.6 46.2 42.5 40.0 49.3
15 InternLM-XC2.5-AnyRes 2024-07 22.3 23.9 17.5 23.2 25.4 16.9 21.7 19.8 26.9 0.0 0.0 0.0 37.7 38.6 35.1
16 InternLM-XC2.5 2024-07 22.2 22.8 20.5 22.9 23.6 20.8 25.0 24.3 27.0 0.0 0.0 0.0 37.7 38.6 35.0
17 Mantis 2024-05 18.7 19.8 15.9 15.8 16.4 14.3 20.1 24.6 7.4 39.7 41.0 36.1 19.2 22.0 11.5
- Human - 69.2 69.6 68.1 68.2 68.6 67.1 43.7 45.0 40.1 85.7 87.3 81.2 72.8 71.4 76.7
All model sizes of open-source LMMs are 7B unless otherwise specified.
Human Performance*: Average human performance from annotators that are college students.

Logo MMSearch Dataset

Overview

Logo MMSearch is a multimodal search benchmark crafted to evaluate the potential of LMMs to function as a multimodal AI search engine. This benchmark encompasses a meticulously collected dataset of 300 queries spanning 14 subfields. To ensure the challenging nature of the benchmark. Our benchmark categorizes searching queries into two primary areas: News and Knowledge. News area covers the latest news at the time of data collection (August, 2024). This is to guarantee the answers to the queries will not be present in the training data of LMMs. For knowledge area, we collect queries requiring rare knowledge and then select the queries unable to be answered by current SoTA LMMs such as GPT-4o or Claude-3.5.

You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics of Logo MMSearch.

data-composition

Category distribution of Logo MMSearch.

Evaluation strategy and Data Example

Overview of evaluation strategy.

Example input, LMM output, and ground truth for four evaluation tasks.

Experiment Results

Error Analysis

Visualization Examples