中文视觉Transformer需要更好的令牌交互

ENVision Transformers Need Better Token Interaction

arXiv cs.CV2026年5月25日

维基视觉变换器在长时间训练中会出现密集退化现象，本研究提出“语义扩散”机制：全局语义信息不合理地扩散到局部补丁，而非仅由高范数伪影导致。分析表明密集表示质量无法通过局部性指标衡量。该发现对优化ViT稠密预测任务具有重要启示。

arXiv:2605.23868v1 Announce Type: new Abstract: Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high-norm artifacts alone. Instead, we characterize \emph{semantic diffusion}: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality al