Video-based person re-identification method based on graph convolution network and self-attention graph pooling

doi:10.11772/j.issn.1001-9081.2022010034

Abstract

Abstract:

Aiming at the bad effect of video person re-identification caused by factors such as occlusion， spatial misalignment and background clutter in cross-camera network videos， a video-based person re-identification method based on Graph Convolutional Network （GCN） and Self-Attention Graph Pooling （SAGP） was proposed. Firstly， the correlation information of different regions between frames in the video was mined through the patch relation graph modeling.In order to alleviate the problems such as occlusion and misalignment， the region features in the frame-by-frame images were optimized by using GCN. Then， the regions with low contribution to person features were removed by SAGP mechanism to avoid the interference of background clutter regions. Finally， a weighted loss function strategy was proposed， the center loss was used to optimize the classification learning results， and Online soft mining and Class-aware attention Loss （OCL） were used to solve the problem that the available samples were not fully used in the process of hard sample mining. Experimental results on MARS dataset show that compared with the sub-optimal Attribute-aware Identity-hard Triplet Loss （AITL）， the proposed method has the mean Average Precision （mAP） and Rank-1 increased by 1.3 percentage points and 2.0 percentage points. The proposed method can better utilize the spatial-temporal information in the video to extract more discriminative person features， and improve the effect of person re-identification tasks.

Key words: video-based person re-identification, Graph Convolutional Network (GCN), Self-Attention Graph Pooling (SAGP), weighted loss function strategy, center loss

摘要：

针对跨相机网络视频中存在的遮挡、空间不对齐、背景杂波等因素导致视频行人重识别效果较差的问题，提出一种基于图卷积网络（GCN）与自注意力图池化（SAGP）的视频行人重识别方法。首先，通过区块关系图建模挖掘视频中帧间不同区域的关联信息，并利用GCN优化逐帧图像中的区域特征，缓解遮挡和不对齐等问题；然后，通过SAGP机制去除对行人特征贡献较低的区域，避免背景杂波区域的干扰；最后，提出一种加权损失函数策略，使用中心损失优化分类学习结果，并使用在线软挖掘和类感知注意力（OCL）损失解决难样本挖掘过程中可用样本未被充分利用的问题。实验结果表明，在MARS数据集上，相较于次优的AITL方法，所提方法的平均精度均值（mAP）与Rank-1分别提高1.3和2.0个百点。所提方法能够较好地利用视频中的时空信息，提取更具判别力的行人特征，提高行人重识别任务的效果。

关键词: 视频行人重识别, 图卷积网络, 自注意力图池化, 加权损失函数策略, 中心损失

CLC Number:

TP391.4

Yingmao YAO, Xiaoyan JIANG. Video-based person re-identification method based on graph convolution network and self-attention graph pooling[J]. Journal of Computer Applications, 2023, 43(3): 728-735.

姚英茂, 姜晓燕. 基于图卷积网络与自注意力图池化的视频行人重识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 728-735.

Figures/Tables 11

References 33

1	叶钰，王正，梁超，等. 多源数据行人重识别研究综述［J］. 自动化学报， 2020， 46（9）： 1869-1884. 10.16383/j.aas.c190278
	YE Y， WANG Z， LIANG C， et al. A survey on multi-source person re-identification［J］. Acta Automatica Sinica， 2020， 46（9）： 1869-1884. 10.16383/j.aas.c190278
2	韩建栋，李晓宇. 基于多尺度特征融合的行人重识别方法［J］. 计算机应用， 2021， 41（10）： 2991-2996. 10.11772/j.issn.1001-9081.2020121908
	HAN J D， LI X Y. Pedestrian re-identification method based on multi-scale feature fusion［J］. Journal of Computer Applications， 2021， 41（10）： 2991-2996. 10.11772/j.issn.1001-9081.2020121908
3	CHUNG D， TAHBOUB K， DELP E J. A two stream siamese convolutional neural network for person re-identification ［C］// Proceedings of the 16th IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1983-1991. 10.1109/iccv.2017.218
4	ZHOU Z， HUANG Y， WANG W， et al. See the forest for the trees： joint spatial and temporal recurrent neural networks for video-based person re-identification ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4747-4756. 10.1109/cvpr.2017.717
5	LIU Y， YUAN Z， ZHOU W， et al. Spatial and temporal mutual promotion for video-based person re-identification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019， 33（1）： 8786-8793. 10.1609/aaai.v33i01.33018786
6	LI J， ZHANG S， HUANG T. Multi-scale temporal cues learning for video person re-identification［J］. IEEE Transactions on Image Processing， 2020， 29： 4461-4473. 10.1109/tip.2020.2972108
7	LIAO X， HE L， YANG Z， et al. Video-based person re-identification via 3d convolutional networks and non-local attention［C］// Proceedings of the 14th Asian Conference on Computer Vision， LNCS 11366. Cham： Springer， 2019： 620-634. 10.1007/978-3-030-20876-9_39
8	FU Y， WANG X， WEI Y， et al. STA： spatial-temporal attention for large-scale video-based person re-identification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019， 33（1）： 8287-8294. 10.1609/aaai.v33i01.33018287
9	LIU C T， WU C W， WANG Y C F， et al. Spatially and temporally efficient non-local attention network for video-based person re-identification［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： No.77. 10.1145/3377170.3377253
10	CHEN D， LI H， XIAO T， et al. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 1169-1178. 10.1109/cvpr.2018.00128
11	SUBRAMANIAM A， NAMBIAR A， MITTAL A. Co-segmentation inspired attention networks for video-based person re-identification［C］// Proceedings of the 17th IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 562-572. 10.1109/iccv.2019.00065
12	WU Y， BOURAHLA O E F， LI X， et al. Adaptive graph representation learning for video person re-identification［J］. IEEE Transactions on Image Processing， 2020， 29： 8821-8830. 10.1109/tip.2020.3001693
13	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks ［EB/OL］. （2017-02-22）［2021-08-13］. . 10.48550/arXiv.1609.02907
14	LEE J， LEE I， KANG J. Self-attention graph pooling［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 3734-3743.
15	WEN Y， ZHANG K， LI Z， et al. A discriminative feature learning approach for deep face recognition［C］// Proceedings of the 14th European Conference on Computer Vision， LNCS 9911. Cham： Springer， 2016： 499-515.
16	WANG X， HUA Y， KODIROV E， et al. Deep metric learning by online soft mining and class-aware attention［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019， 33（1）： 5361-5368. 10.1609/aaai.v33i01.33015361
17	SCARSELLI F， GORI M， TSOI A C， et al. The graph neural network model［J］. IEEE Transactions on Neural Networks， 2009， 20（1）： 61-80. 10.1109/tnn.2008.2005605
18	CHEN L， ZHANG H， XIAO J， et al. Counterfactual critic multi-agent training for scene graph generation［C］// Proceedings of the 17th IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4613-4623. 10.1109/iccv.2019.00471
19	WANG Y， SUN Y， LIU Z， et al. Dynamic graph CNN for learning on point clouds［J］. ACM Transactions on Graphics， 2019， 38（5）： No.146. 10.1145/3326362
20	LIU Z， ZHANG H， CHEN Z， et al. Disentangling and unifying graph convolutions for skeleton-based action recognition［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 143-152. 10.1109/cvpr42600.2020.00022
21	BAO L， MA B， CHANG H， et al. Masked graph attention network for person re-identification［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2019： 1496-1505. 10.1109/cvprw.2019.00191
22	YANG J， ZHENG W S， YANG Q， et al. Spatial-temporal graph convolutional network for video-based person re-identification［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3289-3299. 10.1109/cvpr42600.2020.00335
23	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
24	LI S， BAK S， CARR P， et al. Diversity regularized spatiotemporal attention for video-based person re-identification［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 369-378. 10.1109/cvpr.2018.00046
25	SUN Y， ZHENG L， YANG Y， et al. Beyond part models： person retrieval with refined part pooling （and a strong convolutional baseline）［C］// Proceedings of the 15th European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 480-496.
26	GAO H， JI S. Graph U-nets［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2083-2092.
27	HERMANS A， BEYER L， LEIBE B. In defense of the triplet loss for person re-identification ［EB/OL］. （2017-11-21）［2021-10-21］. . 10.21203/rs.3.rs-1501673/v1
28	ZHENG L， BIE Z， SUN Y， et al. MARS： a video benchmark for large-scale person re-identification［C］// Proceedings of the 14th European Conference on Computer Vision， LNCS 9910. Cham： Springer， 2016： 868-884.
29	RISTANI E， SOLERA F， ZOU R， et al. Performance measures and a data set for multi-target， multi-camera tracking［C］// Proceedings of the 14th European Conference on Computer Vision， LNCS 9914. Cham： Springer， 2016： 17-35.
30	KIRAN M， BHUIYAN A， BLAIS-MORIN L A， et al. Flow guided mutual attention for person re-identification［J］. Image and Vision Computing， 2021， 113： 104246. 10.1016/j.imavis.2021.104246
31	PORRELLO A， BERGAMINI L， CALDERARA S. Robust re-identification by multiple views knowledge distillation［C］// Proceedings of the 16th European Conference on Computer Vision， LNCS 12355. Cham： Springer， 2020： 93-110.
32	CHEN Z， LI A， JIANG S， et al. Attribute-aware identity-hard triplet loss for video-based person re-identification ［EB/OL］. ［2021-07-24］. . 10.3390/app10062198
33	SELVARAJU R R， COGSWELL M， DAS A， et al. Grad-CAM： visual explanations from deep networks via gradient-based localization［C］// Proceedings of the 16th IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 618-626. 10.1109/iccv.2017.74

方法	MARS				DukeMTMC-VideoReID
方法	mAP	R1	R5	R20	mAP	R1	R5	R20
CNN+CQDA	47.6	65.3	82.0	89.0	—	—	—	—
TAM+ SRM	50.7	70.6	90.0	97.6	—	—	—	—
SSA+ CASE	76.1	86.3	94.7	98.2	—	—	—	—
3DCNN+NLA	79.5	88.6	96.4	98.8	93.7	95.5	99.3	99.7
COSAM	79.9	84.9	95.5	97.9	94.1	95.4	99.3	99.8
STA	80.8	86.3	95.7	98.1	94.6	96.2	99.3	99.6
MA	80.9	87.3	—	—	94.8	96.7	—	—
STE- NVAN	81.2	88.9	—	—	93.5	95.2	—	—
VKD	83.1	89.4	96.8	—	93.5	95.2	98.6	—
AITL	84.4	88.2	96.5	98.4	95.3	95.4	99.6	99.9
本文方法	85.7	90.2	96.7	98.1	95.8	96.7	99.3	99.9

方法	MARS				DukeMTMC-VideoReID
方法	mAP	R1	R5	R20	mAP	R1	R5	R20
CNN+CQDA	47.6	65.3	82.0	89.0	—	—	—	—
TAM+ SRM	50.7	70.6	90.0	97.6	—	—	—	—
SSA+ CASE	76.1	86.3	94.7	98.2	—	—	—	—
3DCNN+NLA	79.5	88.6	96.4	98.8	93.7	95.5	99.3	99.7
COSAM	79.9	84.9	95.5	97.9	94.1	95.4	99.3	99.8
STA	80.8	86.3	95.7	98.1	94.6	96.2	99.3	99.6
MA	80.9	87.3	—	—	94.8	96.7	—	—
STE- NVAN	81.2	88.9	—	—	93.5	95.2	—	—
VKD	83.1	89.4	96.8	—	93.5	95.2	98.6	—
AITL	84.4	88.2	96.5	98.4	95.3	95.4	99.6	99.9
本文方法	85.7	90.2	96.7	98.1	95.8	96.7	99.3	99.9

模型	mAP	R1	R5	R20
Baseline	84.2	88.7	96.0	97.7
Baseline+GCN	85.3	88.7	96.6	98.3
Baseline+GCN+SAGP	85.4	89.2	96.4	98.2
Baseline+CL+OCL	85.1	88.9	96.7	98.3
Baseline+GCN+SAGP+CL+OCL	85.7	90.2	96.7	98.1

模型	mAP	R1	R5	R20
Baseline	84.2	88.7	96.0	97.7
Baseline+GCN	85.3	88.7	96.6	98.3
Baseline+GCN+SAGP	85.4	89.2	96.4	98.2
Baseline+CL+OCL	85.1	88.9	96.7	98.3
Baseline+GCN+SAGP+CL+OCL	85.7	90.2	96.7	98.1

切分块数	mAP	R1	R5	R20
2	84.8	88.8	96.1	98.1
4	85.7	90.2	96.7	98.1
8	85.2	89.0	96.1	98.4