中文CoMoGen:基于掩码引导的可控运动动力学与交互视频生成
ENCoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen提出可控视频生成框架,基于输入图像和单条二值掩码序列生成逼真交互动态。其轻量级MaskAdapter将掩码编码为潜在残差信号,通过余弦加权调度注入多模态扩散Transformer(MMDiT),克服了传统UNet分层注入的局限。该方法实现了精准的时序控制,为交互仿真和视频编辑提供新方案。
arXiv:2605.22996v1 Announce Type: new Abstract: We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify whic