中文解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

ENAnatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

arXiv cs.CV2026年5月25日

该研究提出针对视频胶囊内镜的多标签时间事件检测框架，通过类原型角度分离损失和生物状态机时间解码器解决Galar数据集极端类别不平衡。采用BiomedCLIP基础模型，结合局部差分注意力模块融合三连续帧，抑制静态背景增强瞬态病理信号，显著提升罕见异常检测性能。

arXiv:2603.17879v2 Announce Type: replace Abstract: This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static tempor