Video question answering method based on keyframe and summarization

doi:10.11772/j.issn.1001-9081.2025080995

Journal of Computer Applications

Received:2025-08-28 Revised:2025-10-14 Online:2025-10-21 Published:2025-10-21
Supported by:
National Natural Science Foundation of China;Natural Science Foundation of Tianjin

基于关键帧和摘要的视频问答方法

何丽¹,李志强²,宋雨浩¹,王晓¹

1. 天津财经大学信息科学与技术系，天津300222
2. 天津财经大学理工学院计算机与信息工程系

通讯作者: 李志强
基金资助:
国家自然科学基金青年项目;天津市自然科学基金面上项目

Abstract

Abstract: The video question answering task aims to deeply understand video content and answer natural language questions. Current approaches based on multimodal large language models struggle with long videos due to the dual constraints of context length and computational complexity, making it difficult to effectively model global semantics and temporal dependencies across scenes, resulting in degraded reasoning performance. To address this issue, a Scene-Driven Adaptive Keyframe Sampling (SD-AKS) method and a Question-Driven Video Summarization (QD-VS) are proposed. First, an iterative video scene clustering based on K-Means is introduced, and a Scene Separation Score (SSS) is designed to quantitatively evaluate feature differences between scenes, enhancing the accuracy and robustness of scene segmentation. Second, a problem-semantic-guided adaptive keyframe sampling strategy is designed to achieve dynamic keyframe selection based on quantified scene information density, ensuring fine coverage of information-rich regions. Furthermore, building upon the keyframe-based approach, a large language model is employed to summarize the video-related question set, generating task-oriented textual summaries that improve global reasoning capability. Experimental results demonstrate that the proposed methods achieve prediction accuracies of 62.6% and 85.1% on the EgoSchema and NExT-QA datasets, respectively. Compared with the best-performing baseline model, LLaVA-Video, the accuracies are increased by 5.3 and 1.8 percentage points, validating the effectiveness of the proposed methods across datasets.

Key words: Video Question Answering(VideoQA), Multimodal Large Language Model(MLLM), keyframe, multimodal reasoning, video summarization

摘要： 视频问答任务旨在深度理解视频内容并回答自然语言问题。当前基于多模态大语言模型的方法在处理长视频时，受限于上下文长度和计算复杂度的双重限制，难以有效建模跨场景的全局语义与时序依赖，造成推理性能退化。针对此问题，提出基于场景驱动的自适应关键帧采样（Scene-Driven Adaptive Keyframe Sample, SD-AKS）及问题驱动的视频摘要生成方法(Question-Driven Video Summarization, QD-VS)。首先，引入基于K-Means的迭代视频场景聚类，设计场景分离分数（Scene Separation Score, SSS），对场景间的特征差异进行量化评估，增强场景分割的精度和鲁棒性；其次，设计基于问题语义引导的自适应关键帧采样，实现基于场景信息密度量化的动态关键帧选择，和对信息密集区域的精细覆盖。此外，在关键帧方法的基础上，利用大语言模型对视频关联问题集进行总结，生成任务导向的文本摘要，提升模型的全局推理能力。实验结果表明，提出的方法在EgoSchema和NExT-QA数据集上的预测准确率分别达到62.6%和85.1%。与基线模型中表现最优的LLaVA-Video相比，准确率分别提升了5.3与1.8个百分点，验证了所提方法在跨数据集场景下的有效性。

关键词: 视频问答, 多模态大语言模型, 关键帧定位, 多模态推理, 视频摘要

CLC Number:

TP391

何丽李志强宋雨浩王晓. 基于关键帧和摘要的视频问答方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025080995.

[1]	WU Guangli, LI Leiting, GUO Zhenzhou, WANG Chengxiang. Video summarization generation model based on improved bi-directional long short-term memory network [J]. Journal of Computer Applications, 2021, 41(7): 1908-1914.
[2]	ZHOU Yu-bin. Fast browser and index for large volume surveillance video [J]. Journal of Computer Applications, 2012, 32(11): 3185-3197.