中文PGT:用于改进多模态大语言模型视觉定位的程序化生成任务
ENPGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
针对多模态大语言模型细粒度理解不足,提出程序化生成任务(PGT)框架。通过叠加无歧义的几何图元于图像,生成额外密集监督信号,既能诱导细粒度视觉理解,又作为低成本诊断工具,分离视觉定位能力与语义理解缺陷。该方法简单有效,可识别感知失败根源。
arXiv:2605.23883v1 Announce Type: new Abstract: Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic p