中文Multi-SpatialMLLM：基于多模态大语言模型的多帧空间理解

ENMulti-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

arXiv cs.CV2026年5月25日

本文提出框架，通过整合深度感知、视觉对应和动态感知等基础空间技能，赋予多模态大语言模型多帧空间理解能力。设计了新数据管道，收集了2700万+样本的MultiSPA数据集，显著提升模型在物理世界多帧推理任务中的表现。

arXiv:2505.17015v2 Announce Type: replace Abstract: Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples sp