中文DocVAL：验证的思维链蒸馏用于有依据的文档视觉问答

ENDocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

arXiv cs.CV2026年5月25日

大型视觉语言模型（VLM）空间定位强但成本高，紧凑VLM效率高但定位退化。为此提出DocVAL框架，采用验证的链式思维蒸馏，将大型模型的空间定位能力迁移至紧凑模型，在不增加推理开销的同时显著提升文档VQA的定位精度。

arXiv:2511.22521v3 Announce Type: replace Abstract: Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfer