中文面向开放词汇动作识别的时空相似性体积聚合
ENSpatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition
近期开放词汇动作识别(OVAR)方法通常将视觉特征聚合为全局表示,丢失局部细节。本文提出SimVA框架,构建基于patch级视觉-文本相似度的密集4D时空相似度体积,并通过类别采样确保相似度对齐。该方法保留了细粒度时空线索,提升了动作识别的准确性与泛化性。
arXiv:2605.23288v1 Announce Type: new Abstract: Recent Open-Vocabulary Action Recognition (OVAR) methods typically aggregate visual features into a global representation before computing text alignment, a process that obscures local patch information and fine-grained spatio-temporal cues. We propose Similarity Volume Aggregation (SimVA), a framework that constructs a dense 4D spatio-temporal similarity volume from patch-level visual-text similarities. SimVA constructs a spatio-temporal similarity volume over local video tokens and action classes, and employs class sampling to ensure similarity