中文超越基于VLM的奖励：扩散原生潜在奖励建模

ENBeyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

arXiv cs.CV2026年5月25日

DiNa-LRM提出一种扩散原生潜在奖励模型，解决VLM作为奖励函数时计算成本高、域不匹配的问题。该方法直接在潜在空间优化偏好，提升对齐效率，降低资源消耗，适用于扩散和流匹配模型的实用对齐。

arXiv:2602.11146v2 Announce Type: replace Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent