计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2297-2301.DOI: 10.11772/j.issn.1001-9081.2019010075

• 数据科学与技术 • 上一篇    下一篇

结合主动学习和密度峰值聚类的协同训练算法

龚彦鹭1,2, 吕佳1,2   

  1. 1. 重庆师范大学 计算机与信息科学学院, 重庆 401331;
    2. 重庆师范大学 重庆市数字农业服务工程技术研究中心, 重庆 401331
  • 收稿日期:2019-01-11 修回日期:2019-03-20 出版日期:2019-08-10 发布日期:2019-04-15
  • 通讯作者: 吕佳
  • 作者简介:龚彦鹭(1995-),女,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘;吕佳(1978-),女,四川达州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    重庆市自然科学基金资助项目(cstc2014jcyjA40011);重庆市教委科技项目(KJ1400513);重庆师范大学科研项目(YKC17001,YKC19018)。

Co-training algorithm with combination of active learning and density peak clustering

GONG Yanlu1,2, LYU Jia1,2   

  1. 1. College of Computer and Information Sciences, Chongqing Normal University, Chongqing 401331, China;;
    2. Chongqing Center of Engineering Technology Research on Digital Agriculture Service, Chongqing Normal University, Chongqing 401331, China
  • Received:2019-01-11 Revised:2019-03-20 Online:2019-08-10 Published:2019-04-15
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Chongqing (cstc2014jcyjA40011), the Science and Technology Project of Chongqing Education Commission (KJ1400513), the Scientific Research Project of Chongqing Normal University (YKC17001, YKC19018).

摘要: 针对协同训练算法对模糊度高的样本容易标记错误导致分类器精度降低和协同训练在迭代时选择加入的无标记样本隐含有用信息不够的问题,提出了一种结合主动学习和密度峰值聚类的协同训练算法。在每次迭代之前,先选择模糊度高的无标记样本主动标记后加入有标记样本集,然后利用密度峰值聚类对无标记样本聚类得到每个无标记样本的密度和相对距离。迭代时选择具有较高密度和相对距离较远的无标记样本交由朴素贝叶斯(NB)分类,反复上述过程直到满足终止条件。利用主动学习标记模糊度高的样本能够改善分类器误标记识别问题,利用密度峰值聚类能够选择出较好表现数据空间结构的样本。在UCI的8个数据集和Kaggle的pima数据集上的实验表明,与SSLNBCA算法相比,所提算法的准确率最高提升6.7个百分点,平均提升1.46个百分点。

关键词: 协同训练, 主动学习, 密度峰值, 朴素贝叶斯, 视图

Abstract: High ambiguity samples are easy to be mislabeled by the co-training algorithm, which would decrease the classifier accuracy, and the useful information hidden in unlabeled data which were added in each iteration is not enough. To solve these problems, a co-training algorithm combined with active learning and density peak clustering was proposed. Before each iteration, the unlabeled samples with high ambiguity were selected and added to the labeled sample set after active labeling, then density peak clustering was used to cluster the unlabeled samples to obtain the density and relative distance of each unlabeled sample. During iteration, the unlabeled samples with higher density and further relative distance were selected to be trained by Naive Bayes (NB) classification algorithm. The processes were iteratively done until the termination condition was satisfied. Mislabeled data recognition problem could be improved by labeling samples with high ambiguity based on active learning algorithm, and the samples reflecting data space structure well could be selected by density peak clustering algorithm. Experimental results on 8 datasets of UCI and the pima dataset of Kaggle show that compared with SSLNBCA (Semi-Supervised Learning combining NB Co-training with Active learning) algorithm, the accuracy of the proposed algorithm is up to 6.67 percentage points, with an average improvement of 1.46 percentage points.

Key words: co-training, active learning, density peak, Naive Bayes (NB), view

中图分类号: