计算机应用 ›› 2019, Vol. 39 ›› Issue (10): 2822-2828.DOI: 10.11772/j.issn.1001-9081.2019040606

• 人工智能 • 上一篇    下一篇

基于新型间谍技术的半监督自训练正例无标记学习

李婷婷1,2, 吕佳1,2, 范伟亚1,2   

  1. 1. 重庆师范大学 计算机与信息科学学院, 重庆 401331;
    2. 重庆市数字农业服务工程技术研究中心(重庆师范大学), 重庆 401331
  • 收稿日期:2019-04-12 修回日期:2019-06-09 出版日期:2019-10-10 发布日期:2019-10-14
  • 通讯作者: 吕佳
  • 作者简介:李婷婷(1994-),女,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘;吕佳(1978-),女,四川达州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;范伟亚(1993-),男,河南开封人,硕士研究生,主要研究方向:数据挖掘、深度学习。
  • 基金资助:
    重庆市自然科学基金资助项目(cstc2014jcyjA40011);重庆市教委科技项目(KJ1400513);重庆师范大学科研项目(YKC19018)。

Semi-supervised self-training positive and unlabeled learning based on new spy technology

LI Tingting1,2, LYU Jia1,2, FAN Weiya1,2   

  1. 1. College of Computer and Information Sciences, Chongqing Normal University, Chongqing 401331, China;
    2. Chongqing Digital Agricultural Service Engineering Technology Research Center(Chongqing Normal University), Chongqing 401331, China
  • Received:2019-04-12 Revised:2019-06-09 Online:2019-10-10 Published:2019-10-14
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Chongqing (csts2014jcyjA40011); the Science and Technology Project of Chongqing Education Commission(KJ1400513); the Scientific Research Project of Chongqing Normal University (YKC19018).

摘要: 正例无标记(PU)学习中的间谍技术极易受噪声和离群点干扰,导致划分的可靠正例不纯,且在初始正例中随机选择间谍样本的机制极易造成划分可靠负例时效率低下,针对这些问题提出一种结合新型间谍技术和半监督自训练的PU学习框架。首先,该框架对初始有标记样本进行聚类并选取离聚类中心较近的样本来取代间谍样本,这些样本能有效地映射出无标记样本的分布结构,从而更好地辅助选取可靠负例;然后对间谍技术划分后的可靠正例进行自训练提纯,采用二次训练的方式取回被误分为正例样本的可靠负例。该框架有效地解决了传统间谍技术在PU学习中分类效率易受数据分布干扰以及随机间谍样本影响的问题。通过9个标准数据集上的仿真实验结果表明,所提框架的平均分类准确率和F-值均高于基本PU学习算法(Basic_PU)、基于间谍技术的PU学习算法(SPY)、基于朴素贝叶斯的自训练PU学习算法(NBST)和基于迭代剪枝的PU学习算法(Pruning)。

关键词: 正例无标记学习, 间谍技术, 半监督自训练, 聚类, 可靠负例, 可靠正例

Abstract: Spy technology in Positive and Unlabeled (PU) learning is easily susceptible to noise and outliers, which leads to the impurity of reliable positive instances, and the mechanism of selecting spy instances in the initial positive instances randomly tends to cause inefficiency in dividing reliable negative instances. To solve these problems, a PU learning framework combining new spy technology and semi-supervised self-training was proposed. Firstly, the initial labeled instances were clustered and the instances closer to the cluster center were selected to replace the spy instances. These instances were able to map the distribution structure of unlabeled instances effectively, so as to better assist to the selection of reliable negative instances. Then, the reliable positive instances divided by spy technology were purified by self-training, and the reliable negative instances which were divided as positive instances mistakenly were corrected by secondary training. The proposed framework can solve the problem of PU learning that the classification efficiency of traditional spy technology is susceptible to data distribution and random spy instances. The experiments on nine standard data sets show that the average classification accuracy and F-measure of the proposed framework are higher than those of Basic PU-learning algorithm (Basic_PU), PU-learning algorithm based on spy technology (SPY), Self-Training PU learning algorithm based on Naive Bayes (NBST) and Iterative pruning based PU learning (Pruning) algorithm.

Key words: Positive and Unlabeled (PU) learning, spy technology, semi-supervised self-training, clustering, reliable negative instance, reliable positive instance

中图分类号: