中文通过令牌排列实现更稀疏的块稀疏注意力

ENSparser Block-Sparse Attention via Token Permutation

arXiv cs.CV2026年5月25日

大语言模型扩展上下文长度时，自注意力机制带来O(N²)计算瓶颈。研究发现长序列注意力矩阵稀疏，提出块稀疏注意力：将序列分块并跳过无关计算，显著降低内存和延迟，实现高效优化。

arXiv:2510.21270v2 Announce Type: replace-cross Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computa