Model distillation model based on training weak teacher networks about few-shots

doi:10.11772/j.issn.1001-9081.2021071201

Abstract

Abstract:

Aiming at the lack of training data of deep neural networks in image recognition， as well as the loss of detailed features and the large amount of distillation calculations in the multi-model distillation， a model distillation model based on training weak teacher networks about few-shots was proposed. Firstly， the weak teacher network set was trained through the Bootstrap aggregating （Bagging） algorithm in the ensemble learning algorithm. While retaining the detailed features of the image dataset， parallel computing was able to be realized to improve the efficiency of network generation. Then， the knowledge merging algorithm was combined， and single high-quality high-complexity teacher networks were formed based on the weak teacher network feature maps， thereby obtaining the image feature maps with more significant details. Finally， based on the current advanced model distillation， an ensemble distillation algorithm with meta-network improved with combined feature maps was proposed， which reduced the calculation of meta-network training and realized the training of the target network about few-shots at the same time. Experimental results show that the algorithm had a 6.39% relative improvement in accuracy compared to the distillation scheme that uses a high-quality network as the teacher network. Comparing the accuracy of the model which obtained by training and distilling the teacher networks with Adaptive Boosting （AdaBoost） algorithm and the accuracy of the model obtained by the ensemble distillation model， the difference is within the given error range. However， the network generation rate of the ensemble distillation algorithm was increased by 4.76 times compared with that of AdaBoost algorithm. Therefore， the proposed algorithm can effectively improve the accuracy and training efficiency of the target model about few-shots.

Key words: few-shot, model distillation, Ensemble Learning （EL）, meta learning, feature merging

摘要：

针对深度神经网络在图像识别中存在的训练数据不足，以及多模型蒸馏中存在的细节特征丢失和蒸馏计算量大的问题，提出一种小样本问题下培训弱教师网络的模型蒸馏模型。首先通过集成学习算法中的引导聚集（Bagging）算法培训弱教师网络集，在保留图像数据集细节特征的同时进行并行计算以提升网络生成效率；然后融合知识合并算法，并基于弱教师网络特征图形成单个高质量、高复杂度的教师网络，从而获得细节重点更突出的图像特征图；最后在目前先进的模型蒸馏基础上提出了针对组合特征图改进元网络的集成蒸馏模型，该算法在减少了元网络训练计算量的同时实现了小样本数据集对目标网络的训练。实验结果表明，所提模型在准确率上相较于单纯以优质网络为教师网络的蒸馏方案有6.39%的相对改进；比较自适应增强（AdaBoost）算法训练教师网络再加以蒸馏得到的模型和集成蒸馏模型的模型准确率，二者相差在给定误差范围内，而集成蒸馏模型比AdaBoost算法的网络生成速率提升了4.76倍。可见所提模型能有效提高目标模型在小样本问题下的准确率和训练效率。

关键词: 小样本, 模型蒸馏, 集成学习, 元学习, 特征合并

CLC Number:

TP389.1

Chunhao CAI, Jianliang LI. Model distillation model based on training weak teacher networks about few-shots[J]. Journal of Computer Applications, 2022, 42(9): 2652-2658.

蔡淳豪, 李建良. 小样本问题下培训弱教师网络的模型蒸馏模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2652-2658.

Figures/Tables 9

Fig. 1 Merging process of weak teacher model feature maps

Fig.2 Meta-network distillation model architecture based on combined feature maps

Tab. 1 Hyperparameter settings of different models on CUB200 dataset

模型	$η$	$J$	$T c 0$	$T c 1$	$β$	$γ$
元学习模型	1	1	0	2	0	1
集成蒸馏模型+Boosting	1	20	3	0	1	0
集成蒸馏模型	1	20	3	0	1	0
集成蒸馏模型+外部网络	2	21	1	1	1	1

Tab. 1 Hyperparameter settings of different models on CUB200 dataset

模型	$η$	$J$	$T c 0$	$T c 1$	$β$	$γ$
元学习模型	1	1	0	2	0	1
集成蒸馏模型+Boosting	1	20	3	0	1	0
集成蒸馏模型	1	20	3	0	1	0
集成蒸馏模型+外部网络	2	21	1	1	1	1

Tab. 2 Accuracy and computing time comparison of different models on CUB200 dataset

模型	准确率/%	运行时间/h
经典模型蒸馏	42.15±0.75	5.75
元学习模型	65.05±1.19	5.35
集成蒸馏模型+Boosting	69.37±0.66	32.72
集成蒸馏模型	58.07±0.73	5.68
集成蒸馏模型+外部网络	69.21±0.82	6.55

Fig. 3 Loss function reduction situations of ensemble distillation model and meta-learning model in training process

Tab. 3 Accuracies of different models on CIFAR-10 dataset’s images with different scales

模型	每个类别参与训练的图像数
模型	100	200	400	700	1 000
经典模型	51.90±0.58	58.58±0.86	67.34±0.97	77.53±1.30	81.08±1.27
注意力模型	62.32±0.91	67.97±1.43	77.43±0.49	81.44±1.76	83.17±0.79
元学习模型	67.95±1.26	71.91±1.31	79.42±1.20	83.10±0.32	85.46±0.72
集成蒸馏模型	72.56±1.13	81.57±0.73	81.57±0.73	84.60±1.19	86.62±1.52

Fig. 4 Influence weight change of the fourth layer of weak classifier on target network

Fig. 5 Active pixels of different models on CUB200 dataset

Fig. 6 Proportion of effective active pixels in saliency map within a batch

References 28

1	POLIKAR R. Ensemble learning［M］// ZHANG C， MA Y Q. Ensemble Machine Learning： Methods and Applications. Boston： Springer， 2012：1-34. 10.1007/978-1-4419-9326-7_1
2	PAN S J， YANG Q. A survey on transfer learning［J］. IEEE Transactions on Knowledge and Data Engineering， 2010， 22（10）：1345-1359. 10.1109/tkde.2009.191
3	RAZAVIAN A S， AZIZPOUR H， SULLIVAN J， et al. CNN features off-the-shelf： an astounding baseline for recognition［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2014：512-519. 10.1109/cvprw.2014.131
4	CUI Y， SONG Y， SUN C， et al. Large scale fine-grained categorization and domain-specific transfer learning［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018：4109-4118. 10.1109/cvpr.2018.00432
5	ROMERO A， BALLAS N， KAHOU S E， et al. FitNets： hints for thin deep nets［EB/OL］. （2015-03-27）. ［2021-08-21］..
6	ZAGORUYKO S， KOMODAKIS N. Paying more attention to attention： improving the performance of convolutional neural networks via attention transfer［EB/OL］. （2017-02-12）. ［2021-08-21］.. 10.1109/icip42928.2021.9506101
7	SRINIVAS S， FLEURET F. Knowledge transfer with Jacobian matching［C］// Proceedings of the 35th International Conference on Machine Learning. New York： PMLR.org， 2018：4723-4731.
8	JANG Y， LEE H， HWANG S J， et al. Learning what and where to transfer［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR.org， 2019：3030-3039.
9	JIA D， WEI D， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009：248-255. 10.1109/cvpr.2009.5206848
10	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. （2015-04-10）［2021-08-21］..
11	RIBEIRO M T， SINGH S， GUESTRIN C. "Why should I trust you？"： explaining the predictions of any classifier［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2016：1135-1144. 10.1145/2939672.2939778
12	YOU S， XU C， XU C， et al. Learning from multiple teacher networks［C］// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2017：1285-1294. 10.1145/3097983.3098135
13	FURLANELLO T， LIPTON Z C， TSCHANNEN M， et al. Born-again neural networks［C］// Proceedings of the 35th International Conference on Machine Learning. New York： PMLR.org， 2018：1607-1616.
14	DVORNIK N， MAIRAL J， SCHMID C. Diversity with cooperation： ensemble methods for few-shot classification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：3722-3730 . 10.1109/iccv.2019.00382
15	ZHANG Q S， CAO R M， SHI F， et al. Interpreting CNN knowledge via an explanatory graph［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：4454-4463. 10.1609/aaai.v32i1.11819
16	BAGHERINEZHAD H， HORTON M， RASTEGARI M， et al. Label refinery： improving ImageNet classification through label progression［EB/OL］. （2018-05-07）. ［2021-08-21］..
17	UZKENT B， YEH C， ERMON S. Efficient object detection in large images using deep reinforcement learning［C］// Proceedings of the 2020 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2020：1824-1822. 10.1109/wacv45572.2020.9093447
18	FROSST N， HINTON G. Distilling a neural network into a soft decision tree［EB/OL］. （2017-11-27）. ［2021-08-21］..
19	刘乐姗. 卷积神经网络模型压缩的算法优化研究［D］. 石家庄：河北经贸大学， 2020：8-36.
	LIU L S. The research on algorithm optimization of convolutional neural network model compression［D］. Shijiazhuang： Hebei University of Economics and Business， 2020：8-36.
20	FLENNERHAG S， MORENO P G， LAWRENCE N D， et al. Transferring knowledge across learning processes［EB/OL］. （2019-03-22）. ［2021-08-21］..
21	MURDOCH W J， LIU P J， YU B. Beyond word importance： contextual decomposition to extract interactions from LSTMs［EB/OL］. （2019-04-27）. ［2021-08-21］..
22	HINTON G， VINYALS O， DEAN J. Distilling the knowledge in a neural network［EB/OL］. （2015-03-09）. ［2021-08-21］..
23	HEO B， LEE M， YUN S， et al. Knowledge distillation with adversarial samples supporting decision boundary［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：3771-3778. 10.1609/aaai.v33i01.33013771
24	SHEN C C， WANG X C， SONG J， et al. Amalgamating knowledge towards comprehensive classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：3068-3075. 10.1609/aaai.v33i01.33013068
25	BREIMAN L. Bagging predictors［J］. Machine Learning， 1996， 24（2）：123-140. 10.1007/bf00058655
26	SHEN C C， XUE M Q， WANG X C， et al. Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：3503-3512. 10.1109/iccv.2019.00360
27	YE J W， WANG X C， JI Y X， et al. Amalgamating filtered knowledge： learning task-customized student from multi-task teachers［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 4128-4134. 10.24963/ijcai.2019/573
28	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016：770-778. 10.1109/cvpr.2016.90

[1]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[2]	Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774.
[3]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[4]	Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG. Meta-learning adaption for few-shot text-to-speech [J]. Journal of Computer Applications, 2024, 44(5): 1629-1635.
[5]	Wangjun SHI, Jing WANG, Xiaojun NING, Youfang LIN. Sleep stage classification model by meta transfer learning in few-shot scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1445-1451.
[6]	Xinye LI, Yening HOU, Yinghui KONG, Zhiqi YAN. Few-shot object detection combining feature fusion and enhanced attention [J]. Journal of Computer Applications, 2024, 44(3): 745-751.
[7]	Keyi FU, Gaocai WANG, Man WU. Few-shot object detection method based on improved region proposal network and feature aggregation [J]. Journal of Computer Applications, 2024, 44(12): 3790-3797.
[8]	Yuxin HUANG, Yiwang HUANG, Hui HUANG. Meta label correction method based on shallow network predictions [J]. Journal of Computer Applications, 2024, 44(11): 3364-3370.
[9]	Li XIE, Weiping SHU, Junjie GENG, Qiong WANG, Hailin YANG. Few-shot cervical cell classification combining weighted prototype and adaptive tensor subspace [J]. Journal of Computer Applications, 2024, 44(10): 3200-3208.
[10]	Bihui YU, Xingye CAI, Jingxuan WEI. Few-shot text classification method based on prompt learning [J]. Journal of Computer Applications, 2023, 43(9): 2735-2740.
[11]	Xiaomin ZHOU, Fei TENG, Yi ZHANG. Automatic international classification of diseases coding model based on meta-network [J]. Journal of Computer Applications, 2023, 43(9): 2721-2726.
[12]	Junjian JIANG, Dawei LIU, Yifan LIU, Yougui REN, Zhibin ZHAO. Few-shot object detection algorithm based on Siamese network [J]. Journal of Computer Applications, 2023, 43(8): 2325-2329.
[13]	Hui WANG, Jianhong LI. Few-shot recognition method of 3D models based on Transformer [J]. Journal of Computer Applications, 2023, 43(6): 1750-1758.
[14]	Jiehang DENG, Wenquan GUO, Hanjie CHEN, Guosheng GU, Jingjian LIU, Yukun DU, Chao LIU, Xiaodong KANG, Jian ZHAO. Few-shot diatom detection combining multi-scale multi-head self-attention and online hard example mining [J]. Journal of Computer Applications, 2022, 42(8): 2593-2600.
[15]	Renjie XU, Baodi LIU, Kai ZHANG, Weifeng LIU. Model agnostic meta learning algorithm based on Bayesian weight function [J]. Journal of Computer Applications, 2022, 42(3): 708-712.