Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 240-246.DOI: 10.11772/j.issn.1001-9081.2024010104

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Weakly supervised video anomaly detection with local-global temporal dependency

Pengcheng SONG(), Lijun GUO, Rong ZHANG   

  1. Faculty of Electrical Engineering and Computer Science,Ningbo University,Ningbo Zhejiang 315211,China
  • Received:2024-01-29 Revised:2024-03-25 Accepted:2024-03-25 Online:2024-05-09 Published:2025-01-10
  • Contact: Pengcheng SONG
  • About author:GUO Lijun, born in 1970, Ph. D., professor. His research interests include computer vision, machine learning, medical image analysis.
    ZHANG Rong, born in 1970, Ph. D., associate professor. Her research interests include digital image forensics, computer vision, medical image analysis.
  • Supported by:
    Natural Science Foundation of Zhejiang Province/ Zhejiang Provincial Public Welfare Technology Research Project(LGF21F020008);Ningbo Public Welfare Science and Technology Plan Project(2022S134)

利用局部-全局时间依赖的弱监督视频异常检测

宋鹏程(), 郭立君, 张荣   

  1. 宁波大学 信息科学与工程学院,浙江 宁波 315211
  • 通讯作者: 宋鹏程
  • 作者简介:郭立君(1970—),男,辽宁凌源人,教授,博士,CCF会员,主要研究方向:计算机视觉、机器学习、医学图像分析;
    张荣(1970—),女,河南鹤壁人,副教授,博士,CCF会员,主要研究方向:数字图像取证、计算机视觉、医学图像分析。
  • 基金资助:
    浙江省自然科学基金资助项目/公益技术项目(LGF21F020008);宁波市公益性科技计划项目(2022S134)

Abstract:

Weakly Supervised Video Anomaly Detection (WS-VAD) is of great significance to the field of intelligent security. Currently WS-VAD tasks face the following problems: the existing methods focus more on the discrimination of the video snippets themselves, ignoring the local and global temporal dependency among the snippets; the temporal structure of anomalous events is ignored in loss function setting; a large amount of normal snippet noise exists in the anomalous video, which interferes with the training convergence. Therefore, a WS-VAD method based on Local-Global Temporal Dependency (LGTD) network was proposed. In this method, the LGTD network utilized a Multi-scale Temporal Feature Fusion (MTFF) module to capture the local temporal correlation of snippets within different time spans. At the same time, a Multi-Head Self-Attention (MHSA) module was employed to integrate the information of all snippets within the video and understand the temporal correlation of the whole video sequence. After that, a Squeeze-and-Excitation (SE) module was used to optimize the internal feature weights of the snippets, so as to capture the temporal and spatial features of the snippets more accurately, and significantly improve the detection performance. In addition, the existing loss function was improved by introducing complementary K-maxmin inner bag loss and Top-K outer bag loss to increase the probability of selecting anomaly snippets from the anomalous video for optimization training. Experimental results show that the proposed method has the average Area Under the Curve (AUC) on UCF-Crime and ShanghaiTech datasets reached 83.18% and 95.41% respectively, which are improved by 0.08 and 7.21 percentage points respectively compared with the Collaborative Normality Learning (CNL) method. It can be seen that the proposed method can effectively improve the detection performance.

Key words: Video Anomaly Detection (VAD), weakly supervised learning, Multiple Instance Learning (MIL), multiscale feature fusion, Multi-Head Self-Attention (MHSA) mechanism

摘要:

弱监督视频异常检测(WS-VAD)对智能安防领域具有重要意义。而目前WS-VAD任务面临以下问题:现有方法更关注对视频片段本身的判别,而忽略了片段之间的局部和全局时间依赖性;在损失函数设置上忽略了异常事件的时序结构;异常视频中存在大量正常片段噪声,干扰训练的收敛。因此,提出一种基于局部-全局时间依赖(LGTD)网络的弱监督视频异常检测方法。该方法中,LGTD网络利用多尺度时序特征融合(MTFF)模块捕获不同时间跨度内片段的局部时间相关性;同时,利用多头自注意力(MHSA)模块整合视频内所有片段的信息,从而理解整个视频序列的时间相关性;之后,利用通道注意力挤压-激励(SE)模块优化片段内部的特征权重,从而更准确地捕捉视频片段的时空特征,并显著提升检测性能。此外,进一步改进现有损失函数,即引入互补的K-maxmin包内损失和Top-K包外损失,以提高从异常视频中选取异常片段进行训练优化的概率。实验结果表明,所提方法在UCF-Crime和ShanghaiTech数据集上的平均曲线下面积(AUC)分别达到了83.18%和95.41%,;与协同正态学习(CNL)方法相比,分别提高了0.08和7.21个百分点。可见,所提方法能有效提升检测性能。

关键词: 视频异常检测, 弱监督学习, 多实例学习, 多尺度特征融合, 多头自注意力机制

CLC Number: