中文面向多模态推理的视觉引导策略优化

ENVisually-Guided Policy Optimization for Multimodal Reasoning

arXiv cs.CV2026年5月25日

强化学习结合可验证奖励（RLVR）显著提升了视觉语言模型（VLM）的推理能力。然而，VLM固有的文本主导导致视觉注意力稀疏，且时序推理中视觉遗忘加剧。为此，提出视觉引导策略优化（VGPO），通过强化视觉信息关注来弥补不足。该方法有效提升了视觉忠实度，为复杂视觉推理任务提供新思路。

arXiv:2604.09349v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual fo