中文ETCHR:编辑以澄清和利用推理
ENETCHR: Editing To Clarify and Harness Reasoning
多模态大语言模型在视觉推理上有进步,但纯文本思维链对精细聚焦或视角转换问题存在瓶颈。“用图像思考”范式弥补差距,但现有方法受限于固定工具包或产生噪声中间图像。本文提出解耦专用图像编辑模型与理解模型的新方案,但现成图像编辑器无法胜任该推理任务。该方法有望提升复杂视觉推理的准确性与灵活性。
arXiv:2605.23897v1 Announce Type: new Abstract: Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reas