中文PGT：用于改进多模态大语言模型视觉定位的程序化生成任务

ENPGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

arXiv cs.CV2026年5月25日

针对多模态大语言模型细粒度理解不足，提出程序化生成任务（PGT）框架。通过叠加无歧义的几何图元于图像，生成额外密集监督信号，既能诱导细粒度视觉理解，又作为低成本诊断工具，分离视觉定位能力与语义理解缺陷。该方法简单有效，可识别感知失败根源。

arXiv:2605.23883v1 Announce Type: new Abstract: Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic p