基于新型间谍技术的半监督自训练正例无标记学习

doi:10.11772/j.issn.1001-9081.2019040606

计算机应用 ›› 2019, Vol. 39 ›› Issue (10): 2822-2828.DOI: 10.11772/j.issn.1001-9081.2019040606

基于新型间谍技术的半监督自训练正例无标记学习

李婷婷^1,2, 吕佳^1,2, 范伟亚^1,2

1. 重庆师范大学计算机与信息科学学院, 重庆 401331;
2. 重庆市数字农业服务工程技术研究中心(重庆师范大学), 重庆 401331

收稿日期:2019-04-12 修回日期:2019-06-09 出版日期:2019-10-10 发布日期:2019-10-14
通讯作者: 吕佳
作者简介:李婷婷(1994-),女,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘;吕佳(1978-),女,四川达州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;范伟亚(1993-),男,河南开封人,硕士研究生,主要研究方向:数据挖掘、深度学习。
基金资助:
重庆市自然科学基金资助项目（cstc2014jcyjA40011）；重庆市教委科技项目（KJ1400513）；重庆师范大学科研项目（YKC19018）。

Semi-supervised self-training positive and unlabeled learning based on new spy technology

LI Tingting^1,2, LYU Jia^1,2, FAN Weiya^1,2

1. College of Computer and Information Sciences, Chongqing Normal University, Chongqing 401331, China;
2. Chongqing Digital Agricultural Service Engineering Technology Research Center(Chongqing Normal University), Chongqing 401331, China

Received:2019-04-12 Revised:2019-06-09 Online:2019-10-10 Published:2019-10-14
Supported by:
This work is partially supported by the Natural Science Foundation of Chongqing (csts2014jcyjA40011); the Science and Technology Project of Chongqing Education Commission(KJ1400513); the Scientific Research Project of Chongqing Normal University (YKC19018).

摘要/Abstract

摘要： 正例无标记（PU）学习中的间谍技术极易受噪声和离群点干扰，导致划分的可靠正例不纯，且在初始正例中随机选择间谍样本的机制极易造成划分可靠负例时效率低下，针对这些问题提出一种结合新型间谍技术和半监督自训练的PU学习框架。首先，该框架对初始有标记样本进行聚类并选取离聚类中心较近的样本来取代间谍样本，这些样本能有效地映射出无标记样本的分布结构，从而更好地辅助选取可靠负例；然后对间谍技术划分后的可靠正例进行自训练提纯，采用二次训练的方式取回被误分为正例样本的可靠负例。该框架有效地解决了传统间谍技术在PU学习中分类效率易受数据分布干扰以及随机间谍样本影响的问题。通过9个标准数据集上的仿真实验结果表明，所提框架的平均分类准确率和F-值均高于基本PU学习算法（Basic_PU）、基于间谍技术的PU学习算法（SPY）、基于朴素贝叶斯的自训练PU学习算法（NBST）和基于迭代剪枝的PU学习算法（Pruning）。

关键词: 正例无标记学习, 间谍技术, 半监督自训练, 聚类, 可靠负例, 可靠正例

Abstract: Spy technology in Positive and Unlabeled (PU) learning is easily susceptible to noise and outliers, which leads to the impurity of reliable positive instances, and the mechanism of selecting spy instances in the initial positive instances randomly tends to cause inefficiency in dividing reliable negative instances. To solve these problems, a PU learning framework combining new spy technology and semi-supervised self-training was proposed. Firstly, the initial labeled instances were clustered and the instances closer to the cluster center were selected to replace the spy instances. These instances were able to map the distribution structure of unlabeled instances effectively, so as to better assist to the selection of reliable negative instances. Then, the reliable positive instances divided by spy technology were purified by self-training, and the reliable negative instances which were divided as positive instances mistakenly were corrected by secondary training. The proposed framework can solve the problem of PU learning that the classification efficiency of traditional spy technology is susceptible to data distribution and random spy instances. The experiments on nine standard data sets show that the average classification accuracy and F-measure of the proposed framework are higher than those of Basic PU-learning algorithm (Basic_PU), PU-learning algorithm based on spy technology (SPY), Self-Training PU learning algorithm based on Naive Bayes (NBST) and Iterative pruning based PU learning (Pruning) algorithm.

Key words: Positive and Unlabeled (PU) learning, spy technology, semi-supervised self-training, clustering, reliable negative instance, reliable positive instance

中图分类号:

TP181

李婷婷, 吕佳, 范伟亚. 基于新型间谍技术的半监督自训练正例无标记学习[J]. 计算机应用, 2019, 39(10): 2822-2828.

LI Tingting, LYU Jia, FAN Weiya. Semi-supervised self-training positive and unlabeled learning based on new spy technology[J]. Journal of Computer Applications, 2019, 39(10): 2822-2828.

参考文献

[1] du PLESSIS M C, NIU G, SUGIYAMA M. Class-prior estimation for learning from positive and unlabeled data[J]. Machine Learning, 2017, 106(4):463-492.
[2] SANSONE E, de NATALE F G B, ZHOU Z. Efficient training for positive unlabeled learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 38(7):99-113.
[3] NIKDELFAZ O, JALILI S. Disease genes prediction by HMM based PU-learning using gene expression profiles[J]. Journal of Biomedical Informatics, 2018, 81:102-111.
[4] FREY N C, WANG J, BELLIDO G I V, et al. Prediction of synthesis of 2D metal carbides and nitrides (MXenes) and their precursors with positive and unlabeled machine learning[J]. ACS Nano, 2019, 13(3):3031-3041.
[5] 甘洪啸. 基于PU学习和贝叶斯网的不确定数据分类研究[D]. 咸阳:西北农林科技大学, 2017:1-61. (GAN H X. Research on uncertain data classification based on PU learning and Bayesian network[D]. Xianyang:Northwest A & F University, 2017:1-61.)
[6] WU Z, CAO J, WANG Y, et al. hPSD:a hybrid PU-learning-based spammer detection model for product reviews[J]. IEEE Transactions on Cybernetics, 2018(99):1-12.
[7] VILLATORO-TELLO E, ANGUIANO E, MONTES-Y-GÍMEZ M, et al. Enhancing semi-supevised text classification using document summaries[C]//Proceedings of the 2016 Ibero-American Conference on Artificial Intelligence, LNCS 10022. Berlin:Springer, 2016:115-126.
[8] HAN D, LI S, WEI F, et al. Two birds with one stone:classifying positive and unlabeled examples on uncertain data streams[J]. Neurocomputing, 2018, 277:149-160.
[9] ZENG X, LIAO Y, LIU Y, et al. Prediction and validation of disease genes using HeteSim scores[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017, 14(3):687-695.
[10] YU K, LIU Y, QIN L, et al. Positive and unlabeled learning for user behavior analysis based on mobile Internet traffic data[J]. IEEE Access, 2018, 6:37568-37580.
[11] ZHANG Y, LI L, ZHOU J, et al. POSTER:a PU learning based system for potential malicious URL detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York:ACM, 2017:2599-2601.
[12] 张璞, 刘畅, 李逍. 基于PU学习的建议语句分类方法[J]. 计算机应用, 2019, 39(3):639-643. (ZHANG P, LIU C, LI X. Suggestion sentence classification method based on PU learning[J]. Journal of Computer Applications, 2019, 39(3):639-643.)
[13] JUN N L, QING S Z. Semi-Supervised self-training method based on an optimum-path forest[J]. IEEE Access, 2019, 7(1):2169-3536.
[14] TANHA J, van SOMEREN M, AFSARMANESH H. Semi-supervised self-training for decision tree classifiers[J]. International Journal of Machine Learning & Cybernetics, 2017, 8(1):355-370.
[15] 罗云松, 吕佳. 结合密度峰值优化模糊聚类的自训练方法[J]. 重庆师范大学学报(自然科学版), 2019, 36(2):96-102. (LUO Y S, LYU J. Self-training algorithm combined with density peak optimization fuzzy clustering[J]. Journal of Chongqing Normal University (Natural Science Edition), 2019, 36(2):96-102.)
[16] CAPÍ M, PÉREZ A, LOZANO J A. An efficient approximation to the K-means clustering for massive data[J]. Knowledge-Based Systems, 2017, 117:56-69.
[17] FUSILIER D H, MONTES-Y-GÍMEZ M, ROSSO P, et al. Detecting positive and negative deceptive opinions using PU-learning[J]. Information Processing & Management, 2015, 51(4):433-443.

基于新型间谍技术的半监督自训练正例无标记学习

Semi-supervised self-training positive and unlabeled learning based on new spy technology

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[4]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[5]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[6]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[7]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[8]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[9]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[10]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[11]	邹志文, 秦程. 基于k-means++的动态构建空间主题R树方法[J]. 计算机应用, 2021, 41(3): 733-737.
[12]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[13]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.
[14]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[15]	陈港, 孟相如, 康巧燕, 阳勇. 基于拓扑分割与聚类分析的虚拟软件定义网络映射算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3309-3318.