MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Run Luo; Haonan Zhang; Longze Chen; Ting-En Lin; Xiong Liu; Yuchuan Wu; Min Yang; Yongbin Li; Minzheng Wang; Pengpeng Zeng; Lianli Gao; Heng Tao Shen; Yunshui Li; Hamid Alinejad-Rokny; Xiaobo Xia; Jingkuan Song; Fei Huang

doi:10.18653/v1/2025.findings-acl.1009

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Yongbin Li, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Hamid Alinejad-Rokny, Xiaobo Xia, Jingkuan Song, Fei Huang

Abstract

The development of Multimodal Large Language Models (MLLMs) has seen significant progress, driven by increasing demands across various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches aim to enhance MLLM capabilities through diverse architectures, their performance gains have become increasingly marginal. In contrast, data-driven methods, which scale up image-text instruction datasets, have proven more effective but face challenges related to limited data diversity and complexity. The absence of high-quality instruction data remains a major bottleneck in MLLM development. To address this issue, we propose , a novel multimodal instruction data evolution framework. This framework iteratively enhances data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that significantly improves MLLM capabilities. Starting with an initial dataset, SEED-163K, we employ to systematically expand instruction diversity, extend visual reasoning steps to improve cognitive abilities, and extract fine-grained visual details to enhance understanding and robustness. To rigorously evaluate our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained on the original seed dataset, our method achieves an average accuracy improvement of 3.1 percentage points. Moreover, our approach attains state-of-the-art (SOTA) performance in nine tasks while using significantly less data than existing state-of-the-art models.

Anthology ID:: 2025.findings-acl.1009
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19655–19682
Language:
URL:: https://aclanthology.org/2025.findings-acl.1009/
DOI:: 10.18653/v1/2025.findings-acl.1009
Bibkey:
Cite (ACL):: Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Yongbin Li, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Hamid Alinejad-Rokny, Xiaobo Xia, Jingkuan Song, and Fei Huang. 2025. MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19655–19682, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct (Luo et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1009.pdf

PDF Cite Search Fix data