中文共同构图：面向多人交互场景的迭代姿态-图像生成

ENComposing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

arXiv cs.CV2026年5月25日

本文提出一种双姿态-图像表示，将人物中心结构先验引入预训练扩散变换器，联合预测2D姿态可视化图像和RGB图像，使结构与外观协同进化，解决了多人物交互场景生成中语义多样性不足、布局重复和交互失真的问题。该方法提升了生成图像的构图准确性与交互合理性。

arXiv:2605.23178v1 Announce Type: new Abstract: Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolv