Weakly supervised video anomaly detection with local-global temporal dependency

doi:10.11772/j.issn.1001-9081.2024010104

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 240-246.DOI: 10.11772/j.issn.1001-9081.2024010104

• Multimedia computing and computer simulation • Previous Articles Next Articles

Weakly supervised video anomaly detection with local-global temporal dependency

Pengcheng SONG(), Lijun GUO, Rong ZHANG

Faculty of Electrical Engineering and Computer Science，Ningbo University，Ningbo Zhejiang 315211，China

Received:2024-01-29 Revised:2024-03-25 Accepted:2024-03-25 Online:2024-05-09 Published:2025-01-10
Contact: Pengcheng SONG
About author:GUO Lijun， born in 1970， Ph. D.， professor. His research interests include computer vision， machine learning， medical image analysis.
ZHANG Rong， born in 1970， Ph. D.， associate professor. Her research interests include digital image forensics， computer vision， medical image analysis.
Supported by:
Natural Science Foundation of Zhejiang Province/ Zhejiang Provincial Public Welfare Technology Research Project(LGF21F020008);Ningbo Public Welfare Science and Technology Plan Project(2022S134)

利用局部-全局时间依赖的弱监督视频异常检测

宋鹏程(), 郭立君, 张荣

宁波大学信息科学与工程学院，浙江宁波 315211

通讯作者: 宋鹏程
作者简介:郭立君（1970—），男，辽宁凌源人，教授，博士，CCF会员，主要研究方向：计算机视觉、机器学习、医学图像分析；
张荣（1970—），女，河南鹤壁人，副教授，博士，CCF会员，主要研究方向：数字图像取证、计算机视觉、医学图像分析。
基金资助:
浙江省自然科学基金资助项目/公益技术项目(LGF21F020008);宁波市公益性科技计划项目(2022S134)

Abstract

Abstract:

Weakly Supervised Video Anomaly Detection （WS-VAD） is of great significance to the field of intelligent security. Currently WS-VAD tasks face the following problems： the existing methods focus more on the discrimination of the video snippets themselves， ignoring the local and global temporal dependency among the snippets； the temporal structure of anomalous events is ignored in loss function setting； a large amount of normal snippet noise exists in the anomalous video， which interferes with the training convergence. Therefore， a WS-VAD method based on Local-Global Temporal Dependency （LGTD） network was proposed. In this method， the LGTD network utilized a Multi-scale Temporal Feature Fusion （MTFF） module to capture the local temporal correlation of snippets within different time spans. At the same time， a Multi-Head Self-Attention （MHSA） module was employed to integrate the information of all snippets within the video and understand the temporal correlation of the whole video sequence. After that， a Squeeze-and-Excitation （SE） module was used to optimize the internal feature weights of the snippets， so as to capture the temporal and spatial features of the snippets more accurately， and significantly improve the detection performance. In addition， the existing loss function was improved by introducing complementary K-maxmin inner bag loss and Top-K outer bag loss to increase the probability of selecting anomaly snippets from the anomalous video for optimization training. Experimental results show that the proposed method has the average Area Under the Curve （AUC） on UCF-Crime and ShanghaiTech datasets reached 83.18% and 95.41% respectively， which are improved by 0.08 and 7.21 percentage points respectively compared with the Collaborative Normality Learning （CNL） method. It can be seen that the proposed method can effectively improve the detection performance.

Key words: Video Anomaly Detection (VAD), weakly supervised learning, Multiple Instance Learning (MIL), multiscale feature fusion, Multi-Head Self-Attention (MHSA) mechanism

摘要：

弱监督视频异常检测（WS-VAD）对智能安防领域具有重要意义。而目前WS-VAD任务面临以下问题：现有方法更关注对视频片段本身的判别，而忽略了片段之间的局部和全局时间依赖性；在损失函数设置上忽略了异常事件的时序结构；异常视频中存在大量正常片段噪声，干扰训练的收敛。因此，提出一种基于局部-全局时间依赖（LGTD）网络的弱监督视频异常检测方法。该方法中，LGTD网络利用多尺度时序特征融合（MTFF）模块捕获不同时间跨度内片段的局部时间相关性；同时，利用多头自注意力（MHSA）模块整合视频内所有片段的信息，从而理解整个视频序列的时间相关性；之后，利用通道注意力挤压-激励（SE）模块优化片段内部的特征权重，从而更准确地捕捉视频片段的时空特征，并显著提升检测性能。此外，进一步改进现有损失函数，即引入互补的K-maxmin包内损失和Top-K包外损失，以提高从异常视频中选取异常片段进行训练优化的概率。实验结果表明，所提方法在UCF-Crime和ShanghaiTech数据集上的平均曲线下面积（AUC）分别达到了83.18%和95.41%，；与协同正态学习（CNL）方法相比，分别提高了0.08和7.21个百分点。可见，所提方法能有效提升检测性能。

关键词: 视频异常检测, 弱监督学习, 多实例学习, 多尺度特征融合, 多头自注意力机制

CLC Number:

TP391.4

Pengcheng SONG, Lijun GUO, Rong ZHANG. Weakly supervised video anomaly detection with local-global temporal dependency[J]. Journal of Computer Applications, 2025, 45(1): 240-246.

宋鹏程, 郭立君, 张荣. 利用局部-全局时间依赖的弱监督视频异常检测[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 240-246.

Figures/Tables 8

Fig. 1 Overall framework of LGTD network

Fig. 2 Local-global temporal dependency network framework

Tab. 1 Frame-level AUC performance comparison of different methods on two datasets

方法	特征	AUC
方法	特征	UCF-Crime	ShanghaiTech
MIL（baseline）^［9］	C3D-RGB	75.40	71.50
MIL（baseline）^［9］	I3D-RGB	77.90	85.33
TCN-IBL^［10］	C3D-RGB	78.66	82.50
Motion-Aware^［11］	PWC-Flow	79.00	—
GCN-AD^［13］	TSN-RGB	82.12	84.44
NVAD ^［23］	I3D-RGB	82.44	—
AAVAD^［16］	I3D-RGB	82.60	85.70
MIST^［17］	I3D-RGB	82.30	94.83
CNL^［18］	—	83.10	88.20
TA-VAD^［19］	I3D-RGB	82.13	87.42
TEDN^［21］	C3D	79.49	—
本文方法	I3D-RGB	83.18	95.41

Tab. 2 Ablation experimental results of different loss functionss

$L o u t e r$	$L i n n e r$	$L s m o o t h$	AUC
—	—	—	79.56
√	—	—	81.83
—	√	—	81.75
—	—	√	80.18
√	√	—	82.74
√	√	√	83.18

Tab. 2 Ablation experimental results of different loss functionss

$L o u t e r$	$L i n n e r$	$L s m o o t h$	AUC
—	—	—	79.56
√	—	—	81.83
—	√	—	81.75
—	—	√	80.18
√	√	—	82.74
√	√	√	83.18

Tab. 3 Ablation experimental results of different network modules

模块	AUC
baseline	79.63
baseline+MTFF	80.06
baseline+MHSA	80.27
baseline+SE	79.94
baseline+MTFF+MHSA	81.33
baseline+MTFF+SE	81.05
baseline+MTFF+MHSA+SE	83.18

Fig. 3 Visualization results of cases of successful detection

Fig. 4 Visualization results of cases of detection failure

Fig. 5 Visualization comparison of detection results between proposed method and MIL method

References 23

1	DEL GIORNO A， BAGNELL J A， HEBERT M. A discriminative framework for anomaly detection in large videos ［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 334-349.
2	IONESCU R T， SMEUREANU S， ALEXE B， et al. Unmasking the abnormal events in video ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2914-2922.
3	HASAN M， CHOI J， NEUMANN J， et al. Learning temporal regularity in video sequences ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 733-742.
4	LUO W， LIU W， GAO S. A revisit of sparse coding based anomaly detection in stacked RNN framework ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 341-349.
5	REN H， LIU W， OLSEN S I， et al. Unsupervised behavior-specific dictionary learning for abnormal event detection ［C］// Proceedings of the 2015 British Machine Vision Conference. Durham： BMVA Press， 2015： No.28.
6	LIU W， LUO W， LIAN D， et al. Future frame prediction for anomaly detection — a new baseline ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6536-6545.
7	GONG D， LIU L， LE V， et al. Memorizing normality to detect anomaly： memory-augmented deep autoencoder for unsupervised anomaly detection ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 1705-1714.
8	PARK H， NOH J， HAM B. Learning memory-guided normality for anomaly detection ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 14360-14369.
9	SULTANI W， CHEN C， SHAH M. Real-world anomaly detection in surveillance videos ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6479-6488.
10	ZHANG J， QING L， MIAO J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection ［C］// Proceedings of the 2019 IEEE International Conference on Image Processing. Piscataway： IEEE， 2019： 4030-4034.
11	ZHU Y， NEWSAM S. Motion-aware feature for improved video anomaly detection ［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： No.19.
12	WAN B， FANG Y， XIA X， et al. Weakly supervised video anomaly detection via center-guided discriminative learning ［C］// Proceedings of the 2020 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2020： 1-6.
13	ZHONG J X， LI N， KONG W， et al. Graph convolutional label noise cleaner： train a plug-and-play action classifier for anomaly detection ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1237-1246.
14	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks ［EB/OL］. ［2023-06-20］. .
15	ZAHEER M Z， MAHMOOD A， SHIN H， et al. A self-reasoning framework for anomaly detection using video-level labels ［J］. IEEE Signal Processing Letters， 2020， 27： 1705-1709.
16	MA H， ZHANG L. Attention-based framework for weakly supervised video anomaly detection ［J］. The Journal of Supercomputing， 2022， 78（6）： 8409-8429.
17	FENG J C， HONG F T， ZHENG W S. MIST： multiple instance self-training framework for video anomaly detection ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 14004-14013.
18	LIU Y， LIU J， ZHAO M， et al. Collaborative normality learning framework for weakly supervised video anomaly detection ［J］. IEEE Transactions on Circuits and Systems II： Express Briefs， 2022， 69（5）： 2508-2512.
19	SONG W， KIM J， KIM J. Weakly supervised video anomaly detection with temporal attention module ［C］// Proceedings of the 37th International Technical Conference on Circuits/Systems， Computers and Communications. Piscataway： IEEE， 2022： 1-4.
20	ZHANG D， HUANG C， LIU C， et al. Weakly supervised video anomaly detection via transformer-enabled temporal relation learning ［J］. IEEE Signal Processing Letters， 2022， 29： 1197-1201.
21	KAMOONA A M， GOSTAR A K， BAB-HADIASHAR A， et al. Multiple instance-based video anomaly detection using deep temporal encoding-decoding ［J］. Expert Systems with Applications， 2023， 214： No.119079.
22	CARREIRA J， ZISSERMAN A. Quo Vadis， action recognition？ a new model and the kinetics dataset ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733.
23	WU P， LIU J， SHI Y， et al. Not only look， but also listen： learning multimodal violence detection under weak supervision ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12375. Cham： Springer， 2020： 322-339.

[1]	Zimeng ZHU, Zhixin LI, Zhan HUAN, Ying CHEN, Jiuzhen LIANG. Weakly supervised video anomaly detection based on triplet-centered guidance [J]. Journal of Computer Applications, 2024, 44(5): 1452-1457.
[2]	Tianmin DENG, Guotao MAO, Zhenhao ZHOU, Zhijian DUAN. Road vehicle detection and recognition algorithm based on densely connected convolutional neural network [J]. Journal of Computer Applications, 2022, 42(3): 883-889.
[3]	Ping LUO, Ling DING, Xue YANG, Yang XIANG. Chinese event detection based on data augmentation and weakly supervised adversarial training [J]. Journal of Computer Applications, 2022, 42(10): 2990-2995.
[4]	PENG Shuang, PENG Xiaoming. Object tracking with efficient multiple instance learning [J]. Journal of Computer Applications, 2015, 35(2): 466-469.
[5]	WEN Chao GENG Guohua LI Zhan. Image retrieval based on K-means clustering and multiple instance learning [J]. Journal of Computer Applications, 2011, 31(06): 1546-1548.

Weakly supervised video anomaly detection with local-global temporal dependency

利用局部-全局时间依赖的弱监督视频异常检测

RichHTML

PDF

PDF (Mobile)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 23

Related Articles 5

Recommended Articles

Metrics