《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (3): 963-971.DOI: 10.11772/j.issn.1001-9081.2024040443

• 多媒体计算与计算机仿真 • 上一篇    下一篇

融合时序与全局上下文特征增强的弱监督动作定位

党伟超, 范英豪(), 高改梅, 刘春霞   

  1. 太原科技大学 计算机科学与技术学院,太原 030024
  • 收稿日期:2024-04-12 修回日期:2024-06-25 接受日期:2024-06-28 发布日期:2025-03-17 出版日期:2025-03-10
  • 通讯作者: 范英豪
  • 作者简介:党伟超(1974—),男,山西运城人,副教授,博士,CCF会员,主要研究方向:智能计算、软件可靠性
    高改梅(1978-),女,山西吕梁人,副教授,博士,CCF会员,主要研究方向:网络安全、密码学
    刘春霞(1977—),女,山西大同人,副教授,硕士,CCF会员,主要研究方向:软件工程、数据库。
  • 基金资助:
    山西省自然科学基金资助项目(202203021211194);太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)

Weakly supervised action localization based on temporal and global contextual feature enhancement

Weichao DANG, Yinghao FAN(), Gaimei GAO, Chunxia LIU   

  1. College of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan Shanxi 030024,China
  • Received:2024-04-12 Revised:2024-06-25 Accepted:2024-06-28 Online:2025-03-17 Published:2025-03-10
  • Contact: Yinghao FAN
  • About author:DANG Weichao, born in 1974, Ph. D., associate professor. His research interests include intelligent computing, software reliability.
    GAO Gaimei, born in 1978, Ph. D., associate professor. Her research interests include network security, cryptography.
    LIU Chunxia, born in 1977, M. S., associate professor. Her research interests include software engineering, database.
  • Supported by:
    Shanxi Provincial Natural Science Foundation(202203021211194);Doctoral Research Start-up Fund Project of Taiyuan University of Science and Technology(20202063);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2022063)

摘要:

针对现有的弱监督动作定位研究中将视频片段视为单独动作实例独立处理带来的动作分类及定位不准确问题,提出一种融合时序与全局上下文特征增强的弱监督动作定位方法。首先,构建时序特征增强分支以利用膨胀卷积扩大感受野,并引入注意力机制捕获视频片段间的时序依赖性;其次,设计基于高斯混合模型(GMM)的期望最大化(EM)算法捕获视频的上下文信息,同时利用二分游走传播进行全局上下文特征增强,生成高质量的时序类激活图(TCAM)作为伪标签在线监督时序特征增强分支;再次,通过动量更新网络得到体现视频间动作特征的跨视频字典;最后,利用跨视频对比学习提高动作分类的准确性。实验结果表明,交并比(IoU)取0.5时,所提方法在THUMOS’14和ActivityNet v1.3数据集上分别取得了42.0%和42.2%的平均精度均值(mAP),相较于CCKEE (Cross-video Contextual Knowledge Exploration and Exploitation)方法,在mAP分别提升了2.6与0.6个百分点,验证了所提方法的有效性。

关键词: 弱监督动作定位, 时序类激活图, 动量更新, 伪标签监督, 特征增强

Abstract:

In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies, a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed. Firstly, the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution, and the attention mechanism was introduced to capture the temporal dependency between video clips. Secondly, an EM (Expectation-Maximization) algorithm based on Gaussian Mixture Model (GMM) was designed to capture video context information. At the same time, global contextual feature enhancement was performed by using binary walk propagation. As the result, high-quality Temporal Class Activation Maps (TCAMs) were generated as pseudo labels to supervise the temporal enhancement branch online. Thirdly, the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos. Finally, cross-video contrastive learning was used to improve the accuracy of action classification. Experimental results show that the proposed method has the mean Average Precision (mAP) of 42.0% and 42.2% on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union (IoU) is 0.5, and compared with CCKEE (Cross-video Contextual Knowledge Exploration and Exploitation), the proposed method has the mAP improved by 2.6 and 0.6 percentage points, respectively, proving the effectiveness of the proposed method.

Key words: weakly supervised action localization, Temporal Class Activation Map (TCAM), momentum update, pseudo label supervision, feature enhancement

中图分类号: