MMMR:Benchmarking Massive Multi-Modal Reasoning Tasks

MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning

Guiyao Tie¹, Xueyang Zhou¹, Tianhe Gu¹, Ruihang Zhang¹, Chaoran Hu¹, Sizhe Zhang¹, Mengqu Sun², Yan Zhang¹, Pan Zhou¹, Lichao Sun²,

¹Huazhong University of Science and Technology, ²Lehigh University

Introduction

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs—particularly those augmented with intermediate thinking traces (MLLMs-T)—remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

Models	Validation (106)	Test (977)	Logic (182)	Math (212)	Space-Time (200)	Code (141)	Map (150)	Science (198)
Baselines
Random Choice	22.1	23.62	24.18	24.06	21.50	25.53	22.67	23.74
Frequent Choice	26.8	26.58	26.92	26.42	24.00	24.82	25.33	29.80
Expert (Human only)	29.23	-	-	-	-	-	-	-
Expert (Human + GPT-4o)	52.85	-	-	-	-	-	-	-
Large Language Models without thinking
LLaMA-3.2-11B-Vision-Instruct	24.53	23.92	18.68	31.13	28.00	13.48	22.67	22.73
LLaMA-3.2-90B-Vision-Instruct	30.19	27.65	21.43	34.91	35.00	17.73	25.33	21.72
Qwen2.5-VL-32B-Instruct	34.86	34.90	25.27	45.28	45.00	32.62	36.67	21.72
Qwen2.5-VL-72B-Instruct	36.95	37.18	24.18	46.70	47.50	41.84	42.67	31.31
Qwen-VL-max	35.13	35.55	24.18	47.17	46.00	39.01	35.33	28.28
Gemma-3-27B-IT	30.87	29.01	22.53	42.45	33.50	34.75	26.67	27.27
Gemini-1.5 Flash	32.18	29.61	28.57	37.74	37.00	18.44	24.67	32.83
GPT-4 Vision	37.59	38.05	28.02	35.85	49.00	28.37	32.00	41.92
LLaMA-4-Maverick	40.68	41.82	30.77	44.81	46.00	37.59	30.67	38.38
Large Language Models with thinking
QVQ-72B-Preview	30.94	32.09	26.37	38.21	42.00	32.62	31.33	32.83
Gemini-2.0 Flash	37.63	37.89	35.16	50.47	49.50	28.37	30.67	41.41
Gemini-2.5 Pro	42.45	42.36	39.56	41.51	44.50	36.17	37.33	46.46
Claude-3.7-sonnet	38.28	37.72	35.71	45.75	51.00	21.28	34.00	43.94
o4-mini	38.64	37.58	34.62	46.23	47.50	19.86	29.33	41.41
Dual (GPT-4V + DeepSeek-R1)	41.26	41.00	35.71	47.64	48.00	22.70	22.67	45.45

Models

Validation (106)

Test (977)

Logic (182)

Math (212)

Space-Time (200)

Code (141)

Map (150)

Science (198)

Baselines

Random Choice

22.1

23.62

24.18

24.06

21.50

25.53

22.67

23.74

Frequent Choice

26.8

26.58

26.92

26.42

24.00

24.82

25.33

29.80

Expert (Human only)

29.23

Expert (Human + GPT-4o)

52.85

Large Language Models without thinking

LLaMA-3.2-11B-Vision-Instruct

24.53

23.92

18.68

31.13

28.00

13.48

22.67

22.73

LLaMA-3.2-90B-Vision-Instruct

30.19

27.65

21.43

34.91

35.00

17.73

25.33

21.72

Qwen2.5-VL-32B-Instruct

34.86

34.90

25.27

45.28

45.00

32.62

36.67

21.72

Qwen2.5-VL-72B-Instruct

36.95

37.18

24.18

46.70

47.50

41.84

42.67

31.31

Qwen-VL-max

35.13

35.55

24.18

47.17

46.00

39.01

35.33

28.28

Gemma-3-27B-IT

30.87

29.01

22.53

42.45

33.50

34.75

26.67

27.27

Gemini-1.5 Flash

32.18

29.61

28.57

37.74

37.00

18.44

24.67

32.83

GPT-4 Vision

37.59

38.05

28.02

35.85

49.00

28.37

32.00

41.92

LLaMA-4-Maverick

40.68

41.82

30.77

44.81

46.00

37.59

30.67

38.38

Large Language Models with thinking

QVQ-72B-Preview

30.94

32.09

26.37

38.21

42.00

32.62

31.33

32.83

Gemini-2.0 Flash

37.63

37.89

35.16

50.47

49.50

28.37

30.67

41.41

Gemini-2.5 Pro

42.45

42.36

39.56

41.51

44.50

36.17

37.33

46.46

Claude-3.7-sonnet

38.28

37.72

35.71

45.75

51.00

21.28

34.00

43.94

o4-mini

38.64

37.58

34.62

46.23

47.50

19.86

29.33

41.41

Dual (GPT-4V + DeepSeek-R1)

41.26

41.00

35.71

47.64

48.00

22.70

22.67

45.45

Statistics

Figure 3: Data Summary.

Figure 4: Comparison of representative multi-modal reasoning datasets.

Figure 5: Statistical overview of MMMR dataset.

Examples

Figure 6: Logic: Thinking Case

Figure 7: Logic: Non-Thinking Case

Figure 8: Logic: Non-Thinking Case

$Figure 9: Math: Thinking Case$

Figure 9: Math: Thinking Case

$Figure 10: Math: Non-Thinking Case$

Figure 10: Math: Non-Thinking Case

$Figure 11: Math: Non-Thinking Case$

Figure 11: Math: Non-Thinking Case

$Figure 12: Math: Non-Thinking Case$

Figure 12: Math: Non-Thinking Case

$Figure 13: Math: Non-Thinking Case$

Figure 13: Math: Non-Thinking Case

Figure 14: Space Reasoning: Thinking Case

Figure 15: Space Reasoning: Non-Thinking Case

Figure 16: Space Reasoning: Non-Thinking Case

Figure 17: Science: Thinking Case

Figure 18: Science: Non-Thinking Case

Figure 19: Science: Non-Thinking Case

Figure 20: Science: Non-Thinking Case

Figure 21: Science: Non-Thinking Case

Figure 22: Code: Thinking Case

Figure 23: Code: Non-Thinking Case

Figure 24: Code: Non-Thinking Case

Figure 25: Map: Thinking Case

Figure 26: Map: Non-Thinking Case

Figure 27: Map: Non-Thinking Case

BibTeX

@inproceedings{tie2025mmmr, title={MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks}, author={Guiyao Tie and Xueyang Zhou and Tianhe Gu and Ruihang Zhang and Chaoran Hu and Sizhe Zhang and Mengqu Sun and Yan Zhang and Pan Zhou and Lichao Sun}, booktitle={Proceedings of the NeurIPS 2025 Datasets and Benchmarks Track}, year={2025}, }

MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning

Introduction

MMMR Dataset

Leaderboard

Statistics

Figure 3: Data Summary.

Figure 4: Comparison of representative multi-modal reasoning datasets.

Figure 5: Statistical overview of MMMR dataset.

Examples

Figure 6: Logic: Thinking Case

Figure 7: Logic: Non-Thinking Case

Figure 8: Logic: Non-Thinking Case

Figure 9: Math: Thinking Case

Figure 10: Math: Non-Thinking Case

Figure 11: Math: Non-Thinking Case

Figure 12: Math: Non-Thinking Case

Figure 13: Math: Non-Thinking Case

Figure 14: Space Reasoning: Thinking Case

Figure 15: Space Reasoning: Non-Thinking Case

Figure 16: Space Reasoning: Non-Thinking Case

Figure 17: Science: Thinking Case

Figure 18: Science: Non-Thinking Case

Figure 19: Science: Non-Thinking Case

Figure 20: Science: Non-Thinking Case

Figure 21: Science: Non-Thinking Case

Figure 22: Code: Thinking Case

Figure 23: Code: Non-Thinking Case

Figure 24: Code: Non-Thinking Case

Figure 25: Map: Thinking Case

Figure 26: Map: Non-Thinking Case

Figure 27: Map: Non-Thinking Case

BibTeX