结合主动学习和密度峰值聚类的协同训练算法

doi:10.11772/j.issn.1001-9081.2019010075

计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2297-2301.DOI: 10.11772/j.issn.1001-9081.2019010075

结合主动学习和密度峰值聚类的协同训练算法

龚彦鹭^1,2, 吕佳^1,2

1. 重庆师范大学计算机与信息科学学院, 重庆 401331;
2. 重庆师范大学重庆市数字农业服务工程技术研究中心, 重庆 401331

收稿日期:2019-01-11 修回日期:2019-03-20 发布日期:2019-04-15 出版日期:2019-08-10
通讯作者: 吕佳
作者简介:龚彦鹭(1995-),女,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘;吕佳(1978-),女,四川达州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘。
基金资助:
重庆市自然科学基金资助项目（cstc2014jcyjA40011）；重庆市教委科技项目（KJ1400513）；重庆师范大学科研项目（YKC17001，YKC19018）。

Co-training algorithm with combination of active learning and density peak clustering

GONG Yanlu^1,2, LYU Jia^1,2

1. College of Computer and Information Sciences, Chongqing Normal University, Chongqing 401331, China;;
2. Chongqing Center of Engineering Technology Research on Digital Agriculture Service, Chongqing Normal University, Chongqing 401331, China

Received:2019-01-11 Revised:2019-03-20 Online:2019-04-15 Published:2019-08-10
Supported by:
This work is partially supported by the Natural Science Foundation of Chongqing (cstc2014jcyjA40011), the Science and Technology Project of Chongqing Education Commission (KJ1400513), the Scientific Research Project of Chongqing Normal University (YKC17001, YKC19018).

摘要/Abstract

摘要： 针对协同训练算法对模糊度高的样本容易标记错误导致分类器精度降低和协同训练在迭代时选择加入的无标记样本隐含有用信息不够的问题，提出了一种结合主动学习和密度峰值聚类的协同训练算法。在每次迭代之前，先选择模糊度高的无标记样本主动标记后加入有标记样本集，然后利用密度峰值聚类对无标记样本聚类得到每个无标记样本的密度和相对距离。迭代时选择具有较高密度和相对距离较远的无标记样本交由朴素贝叶斯（NB）分类，反复上述过程直到满足终止条件。利用主动学习标记模糊度高的样本能够改善分类器误标记识别问题，利用密度峰值聚类能够选择出较好表现数据空间结构的样本。在UCI的8个数据集和Kaggle的pima数据集上的实验表明，与SSLNBCA算法相比，所提算法的准确率最高提升6.7个百分点，平均提升1.46个百分点。

关键词: 协同训练, 主动学习, 密度峰值, 朴素贝叶斯, 视图

Abstract: High ambiguity samples are easy to be mislabeled by the co-training algorithm, which would decrease the classifier accuracy, and the useful information hidden in unlabeled data which were added in each iteration is not enough. To solve these problems, a co-training algorithm combined with active learning and density peak clustering was proposed. Before each iteration, the unlabeled samples with high ambiguity were selected and added to the labeled sample set after active labeling, then density peak clustering was used to cluster the unlabeled samples to obtain the density and relative distance of each unlabeled sample. During iteration, the unlabeled samples with higher density and further relative distance were selected to be trained by Naive Bayes (NB) classification algorithm. The processes were iteratively done until the termination condition was satisfied. Mislabeled data recognition problem could be improved by labeling samples with high ambiguity based on active learning algorithm, and the samples reflecting data space structure well could be selected by density peak clustering algorithm. Experimental results on 8 datasets of UCI and the pima dataset of Kaggle show that compared with SSLNBCA (Semi-Supervised Learning combining NB Co-training with Active learning) algorithm, the accuracy of the proposed algorithm is up to 6.67 percentage points, with an average improvement of 1.46 percentage points.

Key words: co-training, active learning, density peak, Naive Bayes (NB), view

中图分类号:

TP181

龚彦鹭, 吕佳. 结合主动学习和密度峰值聚类的协同训练算法[J]. 计算机应用, 2019, 39(8): 2297-2301.

GONG Yanlu, LYU Jia. Co-training algorithm with combination of active learning and density peak clustering[J]. Journal of Computer Applications, 2019, 39(8): 2297-2301.

参考文献

[1] GOUTTE C, CANCEDDA N, DYMETMAN M, et al. Semi-supervised learning for machine translation[J]. Journal of the Royal Statistical Society, 2017, 172(2):530-530.
[2] ZHU S, SUN X, JIN D. Multi-view semi-supervised learning for image classification[J]. Neurocomputing, 2016, 208(10):136-142.
[3] XU C, TAO D, XU C. Large-margin multi-view information bottleneck[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(8):1559-1572.
[4] DU J, LING C X, ZHOU Z H. When does cotraining work in real data?[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(5):788-799.
[5] QIAN T, LIU B, CHEN L, et al. Tri-training for authorship attribution with limited training data:a comprehensive study[J]. Neurocomputing, 2016, 171(1):798-806.
[6] DEKEL O, GENTILE C, SRIDHARAN K. Selective sampling and active learning from single and multiple teachers[J]. Journal of Machine Learning Research, 2016, 13(1):2655-2697.
[7] SENER O, SAVARESE S. Active learning for convolutional neural networks:a core-set approach[J]. arXiv E-print, 2017:arXiv:1708.00489.
[8] PIROONSUP N, SINRHUPINVO S. Analysis of training data using clustering to improve semi-supervised self-training[J]. Knowledge-Based Systems, 2018, 143(2):65-80.
[9] WANG X Z, ASHFAG R A R, FU A M. Fuzziness based sample categorization for classifier performance improvement[J]. Journal of Intelligent and Fuzzy Systems, 2015, 29(3):1185-1196.
[10] ZHANG Y, WEN J, WANG X, et al. Semi-supervised learning combining co-training with active learning[J]. Expert Systems with Applications, 2014, 41(5):2372-2378.
[11] GAN H, SANG N, HUANG R, et al. Using clustering analysis to improve semi-supervised classification[J]. Neurocomputing, 2013, 25(3):290-298.
[12] 龚彦鹭,吕佳.结合半监督聚类和加权KNN的协同训练方法[J/OL].计算机工程与应用,2019:1-9[2018-12-28]. http://kns.cnki.net/kcms/detail/11.2127.TP.20181218.1748.032.html. (GONG Y L, LYU J. Co-training method combined semi-supervised clustering and weighted K nearest neighbor[J/OL]. Computer Engineering and Applications,2019:1-9[2018-12-28]. http://kns.cnki.net/kcms/detail/11.2127.TP.20181218.1748.032.html.)
[13] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[14] WU D, SHANG M S, LUO X, et al. Self-training semi-supervised classification based on density peaks of data[J]. Neurocomputing, 2018, 275(1):180-191.
[15] 罗云松,吕佳.结合密度峰值优化模糊聚类的自训练方法[J].重庆师范大学学报(自然科学版),2019,36(2):74-80. (LUO Y S, LYU J. Self-training algorithm combined with density peak optimization fuzzy clustering[J]. Journal of Chongqing Normal University (Natural Science), 2019, 36(2):74-80.)

结合主动学习和密度峰值聚类的协同训练算法

Co-training algorithm with combination of active learning and density peak clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[3]	孟凡, 杨群力, 霍静, 王新宽. 基于边缘异常候选集的迭代式主动多元时序异常检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1458-1463.
[4]	王杰, 孟华. 基于点云整体拓扑结构的图像分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1107-1113.
[5]	丁雨, 张瀚霖, 罗荣, 孟华. 基于信念子簇切割的模糊聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1128-1138.
[6]	熊炜, 陈奕博, 张丽真, 杨茜, 邹勤. 利用多帧序列影像的自监督单目深度估计[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3907-3914.
[7]	杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789.
[8]	胡立华, 李小平, 胡建华, 张素兰. 基于四叉树先验辅助的多视图立体方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3556-3564.
[9]	朱云华, 孔兵, 周丽华, 陈红梅, 包崇明. 图对比学习引导的多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3267-3274.
[10]	何子仪, 杨燕, 张熠玲. 深度融合多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2651-2656.
[11]	何添, 沈宗鑫, 黄倩倩, 黄雁勇. 基于自适应学习的多视图无监督特征选择方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2657-2664.
[12]	劳景欢, 黄栋, 王昌栋, 赖剑煌. 基于视图互信息加权的多视图集成聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1713-1718.
[13]	杨晓菡, 郝国生, 张谢华, 杨子豪. 基于协同训练与Boosting的协同过滤算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3136-3141.
[14]	汤春明, 陈雨晴, 张梓迪. 基于二项交换林和HotStuff的改进共识算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2112-2117.
[15]	章曼, 张正军, 冯俊淇, 严涛. 基于自适应可达距离的密度峰值聚类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1914-1921.