结合半监督聚类和数据剪辑的自训练方法

doi:10.11772/j.issn.1001-9081.2017071721

计算机应用 ›› 2018, Vol. 38 ›› Issue (1): 110-115.DOI: 10.11772/j.issn.1001-9081.2017071721

结合半监督聚类和数据剪辑的自训练方法

吕佳, 黎隽男

重庆师范大学计算机与信息科学学院, 重庆 401331

收稿日期:2017-07-13 修回日期:2017-09-02 发布日期:2018-01-22 出版日期:2018-01-10
通讯作者: 吕佳
作者简介:吕佳(1978-),女,重庆人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;黎隽男(1992-),男,重庆人,硕士研究生,主要研究方向:机器学习、数据挖掘。
基金资助:
重庆市自然科学基金资助项目（cstc2014jcyjA40011）；重庆市教委科技项目（KJ1400513）；重庆市科研项目（CYS17176）；重庆师范大学科研项目（YKC17001）。

Self-training method based on semi-supervised clustering and data editing

LYU Jia, LI Junnan

College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

Received:2017-07-13 Revised:2017-09-02 Online:2018-01-22 Published:2018-01-10
Supported by:
This work is partially supported by Chongqing Natural Science Foundation of China (cstc2014jcyjA40011), Science and Technology Project of Chongqing Municipal Education Commission (KJ1400513), Chongqing Scientific Research Project (CYS17176), Chongqing Normal University Research Project (YKC17001).

摘要/Abstract

摘要： 针对自训练方法在迭代中选出的置信度高的无标记样本所含信息量不大和自训练方法容易误标记无标记样本的问题，提出了一种结合半监督聚类和数据剪辑的Naive Bayes自训练方法。该自训练方法在每次迭代的时候，首先利用少量的有标记样本和大量的无标记样本进行半监督聚类，从而选出聚类隶属度高的无标记样本作Naive Bayes分类；然后利用数据剪辑技术来过滤掉聚类隶属度高而被Naive Bayes误分类的无标记样本。该数据剪辑技术能够同时利用有标记样本和无标记样本信息进行噪声过滤，解决了传统数据剪辑技术的性能可能因有标记样本数量匮乏而下降的问题。通过在UCI数据集上的对比实验，证明了所提算法的有效性。

关键词: 自训练, 半监督学习, 半监督聚类, 数据剪辑, 最近邻

Abstract: According to the problem that unlabeled samples of high confidence selected by self-training method contain less information in each iteration and self-training method is easy to mislabel unlabeled samples, a Naive Bayes self-training method based on semi-supervised clustering and data editing was proposed. Firstly, semi-supervised clustering was used to classify a small number of labeled samples and a large number of unlabeled samples, and the unlabeled samples with high membership were chosen, then they were classified by Naive Bayes. Secondly, the data editing technique was used to filter out unlabeled samples with high clustering membership which were misclassified by Naive Bayes. The data editing technique could filter noise by utilizing information of the labeled samples and unlabeled samples, solving the problem that performance of traditional data editing technique may be decreased due to lack of labeled samples. The effectiveness of the proposed algorithm was verified by comparative experiments on UCI datasets.

Key words: self-training, semi-supervised learning, semi-supervised clustering, data editing, nearest neighbor

中图分类号:

TP181

吕佳, 黎隽男. 结合半监督聚类和数据剪辑的自训练方法[J]. 计算机应用, 2018, 38(1): 110-115.

LYU Jia, LI Junnan. Self-training method based on semi-supervised clustering and data editing[J]. Journal of Computer Applications, 2018, 38(1): 110-115.

参考文献

[1] YAROWSKY D. Unsupervised word sense disambiguation rivaling supervised methods[C]//ACL'95:Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 1995:189-196.
[2] ZHU X, GOLDBERG A B, BRACHMAN R, et al. Introduction to Semi-Supervised Learning[M]. San Rafael, CA:Morgan and Claypool Publishers, 2009:130.
[3] HADY M F A, SCHWENKER F. Co-training by committee:a new semi-supervised learning framework[C]//ICDMW'08:Proceedings of the 2008 IEEE International Conference on Data Mining Workshops. Washington, DC:IEEE Computer Society, 2008:563-572.
[4] WANG S, WU L, JIAO L, et al. Improve the performance of co-training by committee with refinement of class probability estimations[J]. Neurocomputing, 2014, 136(8):30-40.
[5] LEWIS D D. Naive (Bayes) at Forty:the independence assumption in information retrieval[C]//ECML'98:Proceedings of the 10th European Conference on Machine Learning. Berlin:Springer, 1998:4-15.
[6] LIU K, GUO Y, WANG S, et al. Semi-supervised learning based on improved co-training by committee[C]//IScIDE 2015:Proceedings of the 5th International Conference on Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques. Berlin:Springer, 2015:413-421.
[7] SHI L, MA X, XI L, et al. Rough set and ensemble learning based semi-supervised algorithm for text classification[J]. Expert Systems with Applications, 2011, 38(5):6300-6306.
[8] JOACHIMS T. A statistical learning model of text classification with support vector machines[C]//SIGIR'01:Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2001:128-136.
[9] JOACHIMS T. A probabilistic analysis of the Rochhio algorithm with TFIDF for text categorization[C]//ICML'97:Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco:Morgan Kaufmann, 1997:143-151.
[10] HAJMOHAMMADI M S, IBRAHIM R, SELAMAT A, et al. Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples[J]. Information Sciences, 2015, 317:67-77.
[11] LENG Y, XU X, QI G. Combining active learning and semi-supervised learning to construct SVM classifier[J]. Knowledge-Based Systems, 2013, 44(1):121-131.
[12] COVER T M. HART P E. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[13] FAN X, GUO Z, MA H. An improved EM-based semi-supervised learning method[C]//IJCBS'09:Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing. Washington, DC:IEEE Computer Society, 2009:529-532.
[14] FAN X, GUO Z, MA H. A semi-supervised text classfification method based on incremental EM algorithm[C]//ICIE'10:Proceedings of the 2010 WASE International Conference on Information Engineering. Washington, DC:IEEE Computer Society, 2010:211-214.
[15] 黎隽男,吕佳.结合主动学习与置信度投票的集成自训练方法[J].计算机工程与应用,2016,52(20):167-171.(LI J N, LYU J. Ensemble self-training method based on active learning and confidence voting[J]. Computer Engineering and Applications, 2016, 52(20):167-171.)
[16] TRIGUERO I, SáEZ J A, LUENGO J, et al. On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification[J]. Neurocomputing, 2014, 132(13):30-41.
[17] GAN H, SANG N, HUANG R, et al. Using clustering analysis to improve semi-supervised classification[J]. Neurocomputing, 2013, 101(3):290-298.
[18] YIN X, SHU T, HUANG Q. Semi-supervised fuzzy clustering with metric learning and entropy regularization[J]. Knowledge-Based Systems, 2012, 35(15):304-311.
[19] 陈日新,朱明旱.半监督K近邻分类方法[J].中国图象图形学报,2013,18(2):195-200.(CHEN R X, ZHU M H. Semi-supervised K-nearest neighbor classification method[J]. Journal of Image and Graphics, 2013, 18(2):195-200.)
[20] WILSON D L. Asymptotic properties of nearest neighbor rules using edited data[J]. IEEE Transactions on Systems Man & Cybernetics, 1972, SMC-2(3):408-421.
[21] TOMEK I. An experiment with the edited nearest-neighbor rule[J]. IEEE Transactions on Systems Man & Cybernetics, 1976, 6(6):448-452.
[22] HATTOR K, TAKAHASHI M. A new edited k-nearest neighbor rule in the pattern classification problem[J]. Pattern Recognition, 2000, 33(3):521-528.

结合半监督聚类和数据剪辑的自训练方法

Self-training method based on semi-supervised clustering and data editing

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张英俊, 李牛牛, 谢斌红, 张睿, 陆望东. 课程学习指导下的半监督目标检测框架[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2326-2333.
[2]	周妍, 李阳. 用于脑卒中病灶分割的具有注意力机制的校正交叉伪监督方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1942-1948.
[3]	徐童童, 解滨, 张春昊, 张喜梅. 融合转移概率矩阵的多阶最近邻图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1527-1538.
[4]	杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789.
[5]	张帅华, 张淑芬, 周明川, 徐超, 陈学斌. 基于半监督联邦学习的恶意流量检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3487-3494.
[6]	崔昊阳, 张晖, 周雷, 杨春明, 李波, 赵旭剑. 有序规范实数对多相似度K最近邻分类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2673-2678.
[7]	王瑞琪, 纪淑娟, 曹宁, 郭亚杰. 基于一致性训练的半监督虚假招聘广告检测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2932-2939.
[8]	陈方疏, 张为, 胡小明, 张宇飞, 孟宪凯, 石林祥. 加权路网空间中动态聚集最近邻居查询算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2026-2033.
[9]	伏博毅, 彭云聪, 蓝鑫, 秦小林. 基于深度学习的标签噪声学习算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 674-684.
[10]	林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
[11]	张海永, 方贤进, 张恩皖, 李宝玉, 彭超, 穆健翔. 基于测量报告信号聚类的指纹定位方法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3947-3954.
[12]	赵学健, 李豪, 唐浩天. 基于用户兴趣概念格约简的推荐评分预测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3340-3345.
[13]	方昕, 黄泽鑫, 张聿晗, 高天, 潘嘉, 付中华, 高建清, 刘俊华, 邹亮. 基于时域波形的半监督端到端虚假语音检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 227-231.
[14]	李锦烨, 黄瑞章, 秦永彬, 陈艳平, 田小瑜. 基于反绎学习的裁判文书量刑情节识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1802-1807.
[15]	邱永茹, 姚光乐, 冯杰, 崔昊宇. 基于半监督学习的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1577-1582.