中文Multi-SpatialMLLM:基于多模态大语言模型的多帧空间理解
ENMulti-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
本文提出框架,通过整合深度感知、视觉对应和动态感知等基础空间技能,赋予多模态大语言模型多帧空间理解能力。设计了新数据管道,收集了2700万+样本的MultiSPA数据集,显著提升模型在物理世界多帧推理任务中的表现。
arXiv:2505.17015v2 Announce Type: replace Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples sp