中文VideoTemp-o3: 协调智能体视频思考中的时间定位与视频理解

ENVideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

arXiv cs.CV2026年5月25日

arXiv:2602.07801v4 提出新方法改进长视频理解：传统均匀帧采样效率低、易幻觉，现有“定位-剪辑-回答”流程虽好但定位弱、工作流僵化。该研究旨在解决上述问题，提升关键证据捕捉与推理效率。

arXiv:2602.07801v4 Announce Type: replace Abstract: In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these i