中文超越基于VLM的奖励:扩散原生潜在奖励建模
ENBeyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM提出一种扩散原生潜在奖励模型,解决VLM作为奖励函数时计算成本高、域不匹配的问题。该方法直接在潜在空间优化偏好,提升对齐效率,降低资源消耗,适用于扩散和流匹配模型的实用对齐。
arXiv:2602.11146v2 Announce Type: replace Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent