中文CoMoGen：基于掩码引导的可控运动动力学与交互视频生成

ENCoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

arXiv cs.CV2026年5月25日

CoMoGen提出可控视频生成框架，基于输入图像和单条二值掩码序列生成逼真交互动态。其轻量级MaskAdapter将掩码编码为潜在残差信号，通过余弦加权调度注入多模态扩散Transformer（MMDiT），克服了传统UNet分层注入的局限。该方法实现了精准的时序控制，为交互仿真和视频编辑提供新方案。

arXiv:2605.22996v1 Announce Type: new Abstract: We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify whic