中文EM-Vid：无需训练的实体中心记忆，用于高效一致的多镜头视频生成

ENEM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

arXiv cs.CV2026年5月25日

多镜头视频生成需保持实体外观一致，但全帧存储会混杂持久实体信息与瞬态场景，导致信息泄漏与高计算成本。本文提出实体中心记忆——以实体索引的潜在补丁库，并引入稀疏令牌条件化，有效分离信息、降低开销。该方法提升了多镜头视频生成的实体一致性，实用性强。

arXiv:2605.23610v1 Announce Type: new Abstract: Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible w