中文分解查询为工具调用实现长视频关键帧检索
ENDecomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
提出了ToolMerge方法,用于长视频问答的关键帧选择。该方法利用基于大语言模型的规划器将查询分解为多个工具调用,并指定各工具的应用区域与权重,再通过融合策略选出最具证据性的关键帧。相比单查询评分或固定模式分解,ToolMerge能灵活适配不同查询需求,提升关键帧检索的准确性。
arXiv:2605.23826v1 Announce Type: new Abstract: Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies