中文EM-Vid:无需训练的实体中心记忆,用于高效一致的多镜头视频生成
ENEM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
多镜头视频生成需保持实体外观一致,但全帧存储会混杂持久实体信息与瞬态场景,导致信息泄漏与高计算成本。本文提出实体中心记忆——以实体索引的潜在补丁库,并引入稀疏令牌条件化,有效分离信息、降低开销。该方法提升了多镜头视频生成的实体一致性,实用性强。
arXiv:2605.23610v1 Announce Type: new Abstract: Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible w