中文4DThinker：利用4D影像进行动态空间理解

EN4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

arXiv cs.CV2026年5月25日

4DThinker是首个让视觉语言模型（VLM）基于动态4D空间推理的框架，解决单目视频理解中空间-时间推理的挑战。它避免了依赖外部几何模块或冗长文本描述，直接增强模型内在的4D时空推理能力，从而提升从视频中理解物理世界动态的准确性与效率。

arXiv:2605.05997v2 Announce Type: replace Abstract: Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic