Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (7): 1908-1914.DOI: 10.11772/j.issn.1001-9081.2020091512

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Video summarization generation model based on improved bi-directional long short-term memory network

WU Guangli1,2, LI Leiting1, GUO Zhenzhou1, WANG Chengxiang1   

  1. 1. School of Cyberspace Security, Gansu University of Political Science and Law, Lanzhou Gansu 730070, China;
    2. Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education(Northwest Minzu University), Lanzhou Gansu 730030, China
  • Received:2020-09-28 Revised:2020-12-22 Online:2021-07-10 Published:2020-12-31
  • Supported by:
    This work is partially supported by the Natural Science Fundation of Gansu Province (17JR5RA161), the Youth Science and Technology Program of Gansu Province (18JR3RA193), the Lanzhou Talent Innovation and Entrepreneurship Project (2020-RC-27), the Colleges and Universities of Gansu Province Innovation Ability Improvement Project (2020B-167), the Longyuan Youth Innovation and Entrepreneurship Talent Project (2021LQGR20).

基于改进的双向长短期记忆网络的视频摘要生成模型

武光利1,2, 李雷霆1, 郭振洲1, 王成祥1   

  1. 1. 甘肃政法大学 网络空间安全学院, 兰州 730070;
    2. 中国民族语言文字信息技术教育部重点实验室(西北民族大学), 兰州 730030
  • 通讯作者: 武光利
  • 作者简介:武光利(1981-),男,山东潍坊人,教授,博士,CCF会员,主要研究方向:视频内容理解、人工智能;李雷霆(1996-),男,山东济宁人,硕士研究生,主要研究方向:视频内容理解、人工智能;郭振洲(1995-),男,河南濮阳人,硕士研究生,主要研究方向:视频内容理解、人工智能;王成祥(1995-),男,河南开封人,硕士研究生,主要研究方向:视频内容理解、人工智能。
  • 基金资助:
    甘肃省自然科学基金资助项目(17JR5RA161);甘肃省青年科技基金计划项目(18JR3RA193);兰州市人才创新创业项目(2020-RC-27);甘肃省高等学校创新能力提升项目(2020B-167);陇原青年创新创业人才项目(2021LQGR20)。

Abstract: In order to solve the problems that traditional video summarization methods often do not consider temporal information and the extracted video features are too complex and prone to overfitting, a video summarization generation model based on improved Bi-directional Long Short-Term Memory (BiLSTM) network was proposed. Firstly, the deep features of the video frames were extracted by Convolutional Neural Network (CNN), and in order to make the generated video summarization more diverse, the BiLSTM was adopted to convert the deep feature recognition task into the sequence feature annotation task of the video frames, so that the model was able to obtain more context information. Secondly, considering that the generated video summarization should be representative, the fusion of max pooling was adopted to reduce the feature dimension and highlight the key information to weaken the redundant information, so that the model was able to learn the representative features, and the reduction of the feature dimension also reduced the parameters required in the fully connected layer to avoid the overfitting problem. Finally, the importance scores of the video frames were predicted and converted into the shot scores, which was used to select the key shots to generate video summarization. Experimental results show that the improved video summarization model improves the accuracy of video summarization generation on two standard datasets TvSum and SumMe, its F1-score values are improved by 1.4 and 0.3 percentage points respectively compared with the existing Long Short-Term Memory (LSTM) network based video summarization model DPPLSTM (Determinantal Point Process Long Short-Term Memory).

Key words: video summarization, Convolutional Neural Network (CNN), Bi-directional Long Short-Term Memory (BiLSTM) network, max pooling

摘要: 针对传统视频摘要方法往往没有考虑时序信息以及提取的视频特征过于复杂、易出现过拟合现象的问题,提出一种基于改进的双向长短期记忆(BiLSTM)网络的视频摘要生成模型。首先,通过卷积神经网络(CNN)提取视频帧的深度特征,而且为了使生成的视频摘要更具多样性,采用BiLSTM网络将深度特征识别任务转换为视频帧的时序特征标注任务,让模型获得更多上下文信息;其次,考虑到生成的视频摘要应当具有代表性,因此通过融合最大池化在降低特征维度的同时突出关键信息以淡化冗余信息,使模型能够学习具有代表性的特征,而特征维度的降低也减少了全连接层需要的参数,避免了过拟合问题;最后,预测视频帧的重要性分数并转换为镜头分数,以此选取关键镜头生成视频摘要。实验结果表明,在标准数据集TvSum和SumMe上,改进后的视频摘要生成模型能提升生成视频摘要的准确性;而且它的F1-score值也比基于长短期记忆(LSTM)网络的视频摘要模型DPPLSTM在两个数据集上分别提高1.4和0.3个百分点。

关键词: 视频摘要, 卷积神经网络, 双向长短期记忆网络, 最大池化

CLC Number: