中文面向视觉-语言数据集蒸馏的多模态分布匹配

ENMultimodal Distribution Matching for Vision-Language Dataset Distillation

arXiv cs.CV2026年5月25日

提出多模态分布匹配（MDM）框架，通过几何感知高效压缩视觉-语言训练集，在有限计算和内存下保持表示质量与跨模态对齐，解决了现有方法计算量大且忽视模态相关性的问题。

arXiv:2605.23482v1 Announce Type: new Abstract: Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillat