小样本问题下培训弱教师网络的模型蒸馏模型

doi:10.11772/j.issn.1001-9081.2021071201

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2652-2658.DOI: 10.11772/j.issn.1001-9081.2021071201

• 人工智能 • 上一篇

小样本问题下培训弱教师网络的模型蒸馏模型

蔡淳豪, 李建良()

南京理工大学理学院，南京 210094

收稿日期:2021-07-12 修回日期:2021-09-06 接受日期:2021-09-08 发布日期:2021-09-14 出版日期:2022-09-10
通讯作者: 李建良
作者简介:蔡淳豪（1997—），男，江苏无锡人，硕士研究生，主要研究方向：模型蒸馏、深度学习、图像识别；
基金资助:
装备预研中国电科联合基金资助项目(6141B08231109)

Model distillation model based on training weak teacher networks about few-shots

Chunhao CAI, Jianliang LI()

School of Science，Nanjing University of Science and Technology，Nanjing Jiangsu 210094，China

Received:2021-07-12 Revised:2021-09-06 Accepted:2021-09-08 Online:2021-09-14 Published:2022-09-10
Contact: Jianliang LI
About author:CAI Chunhao， born in 1997， M. S. candidate. His research interests include model distillation， deep learning， image recognition.
Supported by:
Equipment Pre-Research and CETC Joint Fund(6141B08231109)

摘要/Abstract

摘要：

针对深度神经网络在图像识别中存在的训练数据不足，以及多模型蒸馏中存在的细节特征丢失和蒸馏计算量大的问题，提出一种小样本问题下培训弱教师网络的模型蒸馏模型。首先通过集成学习算法中的引导聚集（Bagging）算法培训弱教师网络集，在保留图像数据集细节特征的同时进行并行计算以提升网络生成效率；然后融合知识合并算法，并基于弱教师网络特征图形成单个高质量、高复杂度的教师网络，从而获得细节重点更突出的图像特征图；最后在目前先进的模型蒸馏基础上提出了针对组合特征图改进元网络的集成蒸馏模型，该算法在减少了元网络训练计算量的同时实现了小样本数据集对目标网络的训练。实验结果表明，所提模型在准确率上相较于单纯以优质网络为教师网络的蒸馏方案有6.39%的相对改进；比较自适应增强（AdaBoost）算法训练教师网络再加以蒸馏得到的模型和集成蒸馏模型的模型准确率，二者相差在给定误差范围内，而集成蒸馏模型比AdaBoost算法的网络生成速率提升了4.76倍。可见所提模型能有效提高目标模型在小样本问题下的准确率和训练效率。

关键词: 小样本, 模型蒸馏, 集成学习, 元学习, 特征合并

Abstract:

Aiming at the lack of training data of deep neural networks in image recognition， as well as the loss of detailed features and the large amount of distillation calculations in the multi-model distillation， a model distillation model based on training weak teacher networks about few-shots was proposed. Firstly， the weak teacher network set was trained through the Bootstrap aggregating （Bagging） algorithm in the ensemble learning algorithm. While retaining the detailed features of the image dataset， parallel computing was able to be realized to improve the efficiency of network generation. Then， the knowledge merging algorithm was combined， and single high-quality high-complexity teacher networks were formed based on the weak teacher network feature maps， thereby obtaining the image feature maps with more significant details. Finally， based on the current advanced model distillation， an ensemble distillation algorithm with meta-network improved with combined feature maps was proposed， which reduced the calculation of meta-network training and realized the training of the target network about few-shots at the same time. Experimental results show that the algorithm had a 6.39% relative improvement in accuracy compared to the distillation scheme that uses a high-quality network as the teacher network. Comparing the accuracy of the model which obtained by training and distilling the teacher networks with Adaptive Boosting （AdaBoost） algorithm and the accuracy of the model obtained by the ensemble distillation model， the difference is within the given error range. However， the network generation rate of the ensemble distillation algorithm was increased by 4.76 times compared with that of AdaBoost algorithm. Therefore， the proposed algorithm can effectively improve the accuracy and training efficiency of the target model about few-shots.

Key words: few-shot, model distillation, Ensemble Learning （EL）, meta learning, feature merging

中图分类号:

TP389.1

蔡淳豪, 李建良. 小样本问题下培训弱教师网络的模型蒸馏模型[J]. 计算机应用, 2022, 42(9): 2652-2658.

Chunhao CAI, Jianliang LI. Model distillation model based on training weak teacher networks about few-shots[J]. Journal of Computer Applications, 2022, 42(9): 2652-2658.

图/表 9

图1 弱教师模型特征图合并过程

Fig. 1 Merging process of weak teacher model feature maps

图2 基于组合特征图的元网络蒸馏模型架构

Fig.2 Meta-network distillation model architecture based on combined feature maps

表1 不同模型在CUB200数据集上的超参数设置

Tab. 1 Hyperparameter settings of different models on CUB200 dataset

模型	$η$	$J$	$T c 0$	$T c 1$	$β$	$γ$
元学习模型	1	1	0	2	0	1
集成蒸馏模型+Boosting	1	20	3	0	1	0
集成蒸馏模型	1	20	3	0	1	0
集成蒸馏模型+外部网络	2	21	1	1	1	1

表1 不同模型在CUB200数据集上的超参数设置

Tab. 1 Hyperparameter settings of different models on CUB200 dataset

模型	$η$	$J$	$T c 0$	$T c 1$	$β$	$γ$
元学习模型	1	1	0	2	0	1
集成蒸馏模型+Boosting	1	20	3	0	1	0
集成蒸馏模型	1	20	3	0	1	0
集成蒸馏模型+外部网络	2	21	1	1	1	1

表2 不同模型在CUB200数据集上的准确率及运算时间对比

Tab. 2 Accuracy and computing time comparison of different models on CUB200 dataset

模型	准确率/%	运行时间/h
经典模型蒸馏	42.15±0.75	5.75
元学习模型	65.05±1.19	5.35
集成蒸馏模型+Boosting	69.37±0.66	32.72
集成蒸馏模型	58.07±0.73	5.68
集成蒸馏模型+外部网络	69.21±0.82	6.55

图3 训练过程中集成蒸馏模型与元学习模型损失函数下降情况

Fig. 3 Loss function reduction situations of ensemble distillation model and meta-learning model in training process

表3 不同模型在CIFAR-10数据集的不同规模图像上的准确率 (%)

Tab. 3 Accuracies of different models on CIFAR-10 dataset’s images with different scales

模型	每个类别参与训练的图像数
模型	100	200	400	700	1 000
经典模型	51.90±0.58	58.58±0.86	67.34±0.97	77.53±1.30	81.08±1.27
注意力模型	62.32±0.91	67.97±1.43	77.43±0.49	81.44±1.76	83.17±0.79
元学习模型	67.95±1.26	71.91±1.31	79.42±1.20	83.10±0.32	85.46±0.72
集成蒸馏模型	72.56±1.13	81.57±0.73	81.57±0.73	84.60±1.19	86.62±1.52

图4 弱分类器第4层对目标网络影响权重变化

Fig. 4 Influence weight change of the fourth layer of weak classifier on target network

图5 CUB200数据集上不同模型的激活像素

Fig. 5 Active pixels of different models on CUB200 dataset

图6 一个batch内显著图有效激活像素比

Fig. 6 Proportion of effective active pixels in saliency map within a batch

参考文献 28

1	POLIKAR R. Ensemble learning［M］// ZHANG C， MA Y Q. Ensemble Machine Learning： Methods and Applications. Boston： Springer， 2012：1-34. 10.1007/978-1-4419-9326-7_1
2	PAN S J， YANG Q. A survey on transfer learning［J］. IEEE Transactions on Knowledge and Data Engineering， 2010， 22（10）：1345-1359. 10.1109/tkde.2009.191
3	RAZAVIAN A S， AZIZPOUR H， SULLIVAN J， et al. CNN features off-the-shelf： an astounding baseline for recognition［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2014：512-519. 10.1109/cvprw.2014.131
4	CUI Y， SONG Y， SUN C， et al. Large scale fine-grained categorization and domain-specific transfer learning［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018：4109-4118. 10.1109/cvpr.2018.00432
5	ROMERO A， BALLAS N， KAHOU S E， et al. FitNets： hints for thin deep nets［EB/OL］. （2015-03-27）. ［2021-08-21］..
6	ZAGORUYKO S， KOMODAKIS N. Paying more attention to attention： improving the performance of convolutional neural networks via attention transfer［EB/OL］. （2017-02-12）. ［2021-08-21］.. 10.1109/icip42928.2021.9506101
7	SRINIVAS S， FLEURET F. Knowledge transfer with Jacobian matching［C］// Proceedings of the 35th International Conference on Machine Learning. New York： PMLR.org， 2018：4723-4731.
8	JANG Y， LEE H， HWANG S J， et al. Learning what and where to transfer［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR.org， 2019：3030-3039.
9	JIA D， WEI D， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009：248-255. 10.1109/cvpr.2009.5206848
10	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. （2015-04-10）［2021-08-21］..
11	RIBEIRO M T， SINGH S， GUESTRIN C. "Why should I trust you？"： explaining the predictions of any classifier［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2016：1135-1144. 10.1145/2939672.2939778
12	YOU S， XU C， XU C， et al. Learning from multiple teacher networks［C］// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2017：1285-1294. 10.1145/3097983.3098135
13	FURLANELLO T， LIPTON Z C， TSCHANNEN M， et al. Born-again neural networks［C］// Proceedings of the 35th International Conference on Machine Learning. New York： PMLR.org， 2018：1607-1616.
14	DVORNIK N， MAIRAL J， SCHMID C. Diversity with cooperation： ensemble methods for few-shot classification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：3722-3730 . 10.1109/iccv.2019.00382
15	ZHANG Q S， CAO R M， SHI F， et al. Interpreting CNN knowledge via an explanatory graph［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：4454-4463. 10.1609/aaai.v32i1.11819
16	BAGHERINEZHAD H， HORTON M， RASTEGARI M， et al. Label refinery： improving ImageNet classification through label progression［EB/OL］. （2018-05-07）. ［2021-08-21］..
17	UZKENT B， YEH C， ERMON S. Efficient object detection in large images using deep reinforcement learning［C］// Proceedings of the 2020 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2020：1824-1822. 10.1109/wacv45572.2020.9093447
18	FROSST N， HINTON G. Distilling a neural network into a soft decision tree［EB/OL］. （2017-11-27）. ［2021-08-21］..
19	刘乐姗. 卷积神经网络模型压缩的算法优化研究［D］. 石家庄：河北经贸大学， 2020：8-36.
	LIU L S. The research on algorithm optimization of convolutional neural network model compression［D］. Shijiazhuang： Hebei University of Economics and Business， 2020：8-36.
20	FLENNERHAG S， MORENO P G， LAWRENCE N D， et al. Transferring knowledge across learning processes［EB/OL］. （2019-03-22）. ［2021-08-21］..
21	MURDOCH W J， LIU P J， YU B. Beyond word importance： contextual decomposition to extract interactions from LSTMs［EB/OL］. （2019-04-27）. ［2021-08-21］..
22	HINTON G， VINYALS O， DEAN J. Distilling the knowledge in a neural network［EB/OL］. （2015-03-09）. ［2021-08-21］..
23	HEO B， LEE M， YUN S， et al. Knowledge distillation with adversarial samples supporting decision boundary［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：3771-3778. 10.1609/aaai.v33i01.33013771
24	SHEN C C， WANG X C， SONG J， et al. Amalgamating knowledge towards comprehensive classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019：3068-3075. 10.1609/aaai.v33i01.33013068
25	BREIMAN L. Bagging predictors［J］. Machine Learning， 1996， 24（2）：123-140. 10.1007/bf00058655
26	SHEN C C， XUE M Q， WANG X C， et al. Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：3503-3512. 10.1109/iccv.2019.00360
27	YE J W， WANG X C， JI Y X， et al. Amalgamating filtered knowledge： learning task-customized student from multi-task teachers［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 4128-4134. 10.24963/ijcai.2019/573
28	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016：770-778. 10.1109/cvpr.2016.90

[1]	邓杰航, 郭文权, 陈汉杰, 顾国生, 刘景建, 杜宇坤, 刘超, 康晓东, 赵建. 融合多尺度多头自注意力和在线难例挖掘的小样本硅藻检测[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2593-2600.
[2]	张剑, 程培源, 邵思羽. 基于改进残差卷积自编码网络的类自适应旋转机械故障诊断[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2440-2449.
[3]	韩亚茹, 闫连山, 姚涛. 基于元学习的深度哈希检索算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2015-2021.
[4]	郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
[5]	许仁杰, 刘宝弟, 张凯, 刘伟锋. 基于贝叶斯权函数的模型无关元学习算法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 708-712.
[6]	李艳, 郭劼, 范斌. 元学习的不确定性特征构建及初步分析[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 343-348.
[7]	李小娟, 韩萌, 王乐, 张妮, 程浩东. 基于准确率爬坡的动态加权集成分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 123-131.
[8]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[9]	杜炎, 吕良福, 焦一辰. 基于模糊推理的模糊原型网络[J]. 计算机应用, 2021, 41(7): 1885-1890.
[10]	甘岚, 沈鸿飞, 王瑶, 张跃进. 基于改进DCGAN的数据增强方法[J]. 计算机应用, 2021, 41(5): 1305-1313.
[11]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[12]	董阳, 潘海为, 崔倩娜, 边晓菲, 滕腾, 王邦菊. 面向多模态磁共振脑瘤图像的小样本分割方法[J]. 计算机应用, 2021, 41(4): 1049-1054.
[13]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[14]	罗长银, 陈学斌, 马春地, 王君宇. 面向区块链的在线联邦增量学习算法[J]. 计算机应用, 2021, 41(2): 363-371.
[15]	魏淳武, 赵涓涓, 唐笑先, 强彦. 基于多时期蒸馏网络的随访数据知识提取方法[J]. 计算机应用, 2021, 41(10): 2871-2878.

小样本问题下培训弱教师网络的模型蒸馏模型

Model distillation model based on training weak teacher networks about few-shots

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 28

相关文章 15

编辑推荐

Metrics