中文通过令牌排列实现更稀疏的块稀疏注意力
ENSparser Block-Sparse Attention via Token Permutation
大语言模型扩展上下文长度时,自注意力机制带来O(N²)计算瓶颈。研究发现长序列注意力矩阵稀疏,提出块稀疏注意力:将序列分块并跳过无关计算,显著降低内存和延迟,实现高效优化。
arXiv:2510.21270v2 Announce Type: replace-cross Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computa