中文DocVAL:验证的思维链蒸馏用于有依据的文档视觉问答
ENDocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
大型视觉语言模型(VLM)空间定位强但成本高,紧凑VLM效率高但定位退化。为此提出DocVAL框架,采用验证的链式思维蒸馏,将大型模型的空间定位能力迁移至紧凑模型,在不增加推理开销的同时显著提升文档VQA的定位精度。
arXiv:2511.22521v3 Announce Type: replace Abstract: Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfer