中文Good Token Hunting:视觉几何变换器Token选择的搭车指南
ENGood Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
视觉几何变换器在多视角3D重建中表现优异,但全局注意力导致计算成本随输入序列平方增长,限制扩展性与效率。本研究提出简单通用策略:限制全局注意力中每个查询交互的键/值令牌数量。该方法有效降低复杂度,提升可扩展性与效率。
arXiv:2605.23892v1 Announce Type: new Abstract: Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achiev