MMMR Logo

Benchmarking Massive Multi-Modal Reasoning Tasks

Guiyao Tie1, Xueyang Zhou1, Tianhe Gu1, Ruihang Zhang1, Chaoran Hu1, Sizhe Zhang1, Mengqu Sun2, Yan Zhang1, Pan Zhou1, Lichao Sun2,
1Huazhong University of Science and Technology, 2Lehigh University
Figure 1: Comparison of Accuracy Across Different Models and Methods on the MMMR Dataset.

Figure 1: Comparison of Accuracy Across Different Models and Methods on the MMMR Dataset. The bar chart illustrates validation accuracy (in %) for baselines (Heuristic & Expert), MLLMs without Thinking (e.g., LLaMA-3.2-11B), and MLLMs with Thinking (e.g., Gemini-2.5 Pro, Dual), highlighting performance trends across six task types, with an average accuracy of 38.5% for MLLMs.

Introduction

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs—particularly those augmented with intermediate thinking traces (MLLMs-T)—remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

MMMR Dataset

MMMR establishes a comprehensive reasoning dataset MMMR to evaluate the multi-modal reasoning capabilities of MLLMs-T. It encompasses six task types—Logical Reasoning, Mathematical Reasoning, Spatio-Temporal Understanding, Coding, Map Navigation, and Scientific Reasoning—incorporating subcategories such as deductive reasoning, algebraic problem-solving, temporal analysis, code generation, path planning, and scientific hypothesis testing. MMMR integrates diverse modalities, including textual descriptions, images, and structured data (e.g., tables, graphs), with a balanced distribution across difficulty levels. MMMR is partitioned into validation and test sets to ensure a thorough and reproducible assessment of model performance.

Figure 2: Dataset Presentation of MMMR.

Figure 2: Dataset Presentation of MMMR, showcasing representative examples of multi-modal reasoning tasks. Each example highlights interleaved inputs, illustrating the dataset's complexity and diversity in evaluating unified reasoning capabilities.

Leaderboard

Table 1: Performance of Different Models on the MMMR Dataset (Accuracy %)
Models Validation (106) Test (977) Logic (182) Math (212) Space-Time (200) Code (141) Map (150) Science (198)
Baselines
Random Choice
22.1
23.62
24.18
24.06
21.50
25.53
22.67
23.74
Frequent Choice
26.8
26.58
26.92
26.42
24.00
24.82
25.33
29.80
Expert (Human only)
29.23
- - - - - - -
Expert (Human + GPT-4o)
52.85
- - - - - - -
Large Language Models without thinking
LLaMA-3.2-11B-Vision-Instruct
24.53
23.92
18.68
31.13
28.00
13.48
22.67
22.73
LLaMA-3.2-90B-Vision-Instruct
30.19
27.65
21.43
34.91
35.00
17.73
25.33
21.72
Qwen2.5-VL-32B-Instruct
34.86
34.90
25.27
45.28
45.00
32.62
36.67
21.72
Qwen2.5-VL-72B-Instruct
36.95
37.18
24.18
46.70
47.50
41.84
42.67
31.31
Qwen-VL-max
35.13
35.55
24.18
47.17
46.00
39.01
35.33
28.28
Gemma-3-27B-IT
30.87
29.01
22.53
42.45
33.50
34.75
26.67
27.27
Gemini-1.5 Flash
32.18
29.61
28.57
37.74
37.00
18.44
24.67
32.83
GPT-4 Vision
37.59
38.05
28.02
35.85
49.00
28.37
32.00
41.92
LLaMA-4-Maverick
40.68
41.82
30.77
44.81
46.00
37.59
30.67
38.38
Large Language Models with thinking
QVQ-72B-Preview
30.94
32.09
26.37
38.21
42.00
32.62
31.33
32.83
Gemini-2.0 Flash
37.63
37.89
35.16
50.47
49.50
28.37
30.67
41.41
Gemini-2.5 Pro
42.45
42.36
39.56
41.51
44.50
36.17
37.33
46.46
Claude-3.7-sonnet
38.28
37.72
35.71
45.75
51.00
21.28
34.00
43.94
o4-mini
38.64
37.58
34.62
46.23
47.50
19.86
29.33
41.41
Dual (GPT-4V + DeepSeek-R1)
41.26
41.00
35.71
47.64
48.00
22.70
22.67
45.45

Statistics


Examples

BibTeX

@inproceedings{tie2025mmmr,
        title={MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks},
        author={Guiyao Tie and Xueyang Zhou and Tianhe Gu and Ruihang Zhang and Chaoran Hu and Sizhe Zhang and Mengqu Sun and Yan Zhang and Pan Zhou and Lichao Sun},
        booktitle={Proceedings of the NeurIPS 2025 Datasets and Benchmarks Track},
        year={2025},
      }