计算机应用 ›› 2018, Vol. 38 ›› Issue (1): 110-115.DOI: 10.11772/j.issn.1001-9081.2017071721

• 人工智能 • 上一篇    下一篇

结合半监督聚类和数据剪辑的自训练方法

吕佳, 黎隽男   

  1. 重庆师范大学 计算机与信息科学学院, 重庆 401331
  • 收稿日期:2017-07-13 修回日期:2017-09-02 出版日期:2018-01-10 发布日期:2018-01-22
  • 通讯作者: 吕佳
  • 作者简介:吕佳(1978-),女,重庆人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;黎隽男(1992-),男,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    重庆市自然科学基金资助项目(cstc2014jcyjA40011);重庆市教委科技项目(KJ1400513);重庆市科研项目(CYS17176);重庆师范大学科研项目(YKC17001)。

Self-training method based on semi-supervised clustering and data editing

LYU Jia, LI Junnan   

  1. College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
  • Received:2017-07-13 Revised:2017-09-02 Online:2018-01-10 Published:2018-01-22
  • Supported by:
    This work is partially supported by Chongqing Natural Science Foundation of China (cstc2014jcyjA40011), Science and Technology Project of Chongqing Municipal Education Commission (KJ1400513), Chongqing Scientific Research Project (CYS17176), Chongqing Normal University Research Project (YKC17001).

摘要: 针对自训练方法在迭代中选出的置信度高的无标记样本所含信息量不大和自训练方法容易误标记无标记样本的问题,提出了一种结合半监督聚类和数据剪辑的Naive Bayes自训练方法。该自训练方法在每次迭代的时候,首先利用少量的有标记样本和大量的无标记样本进行半监督聚类,从而选出聚类隶属度高的无标记样本作Naive Bayes分类;然后利用数据剪辑技术来过滤掉聚类隶属度高而被Naive Bayes误分类的无标记样本。该数据剪辑技术能够同时利用有标记样本和无标记样本信息进行噪声过滤,解决了传统数据剪辑技术的性能可能因有标记样本数量匮乏而下降的问题。通过在UCI数据集上的对比实验,证明了所提算法的有效性。

关键词: 自训练, 半监督学习, 半监督聚类, 数据剪辑, 最近邻

Abstract: According to the problem that unlabeled samples of high confidence selected by self-training method contain less information in each iteration and self-training method is easy to mislabel unlabeled samples, a Naive Bayes self-training method based on semi-supervised clustering and data editing was proposed. Firstly, semi-supervised clustering was used to classify a small number of labeled samples and a large number of unlabeled samples, and the unlabeled samples with high membership were chosen, then they were classified by Naive Bayes. Secondly, the data editing technique was used to filter out unlabeled samples with high clustering membership which were misclassified by Naive Bayes. The data editing technique could filter noise by utilizing information of the labeled samples and unlabeled samples, solving the problem that performance of traditional data editing technique may be decreased due to lack of labeled samples. The effectiveness of the proposed algorithm was verified by comparative experiments on UCI datasets.

Key words: self-training, semi-supervised learning, semi-supervised clustering, data editing, nearest neighbor

中图分类号: