《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2101-2112.DOI: 10.11772/j.issn.1001-9081.2024070953

• CCF第39届中国计算机应用大会 (CCF NCCA 2024) • 上一篇    下一篇

以标注确定性增强为导向的正类-无标签学习算法

何玉林1,2(), 何芃1,2, 黄哲学1,2, 解为成2, PHILIPPE Fournier-Viger2   

  1. 1.人工智能与数字经济广东省实验室(深圳),广东 深圳 518107
    2.深圳大学 计算机与软件学院,广东 深圳 518060
  • 收稿日期:2024-07-09 修回日期:2024-09-26 接受日期:2024-09-26 发布日期:2025-07-10 出版日期:2025-07-10
  • 通讯作者: 何玉林
  • 作者简介:何玉林(1982—),男,河北衡水人,研究员,博士,CCF会员,主要研究方向:大数据系统计算、多样本统计分析、数据挖掘、机器学习 yulinhe@gml.ac.cn
    何芃(2001—),女,江西南昌人,硕士研究生,主要研究方向:大数据分布式计算、数据挖掘、机器学习
    黄哲学(1959—),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:新型算力网络的智能计算、大数据近似计算、数据挖掘、机器学习
    解为成(1983—),男,湖北荆州人,副教授,博士,主要研究方向:数据挖掘、模式识别、机器学习
    PHILIPPE Fournier-Viger (1980—),男,加拿大蒙特利尔人,教授,博士,主要研究方向:数据挖掘、模式识别、人工智能。
  • 基金资助:
    广东省基础与应用基础研究基金粤深联合基金资助项目(2023B1515120020);广东省自然科学基金资助项目(2023A1515011667);深圳市基础研究项目(JCYJ20210324093609026);深圳市科技重大专项(KJZD20230923114809020)

Labeling certainty enhancement-oriented positive and unlabeled learning algorithm

Yulin HE1,2(), Peng HE1,2, Zhexue HUANG1,2, Weicheng XIE2, Fournier-Viger PHILIPPE2   

  1. 1.Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen),Shenzhen Guangdong 518107,China
    2.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
  • Received:2024-07-09 Revised:2024-09-26 Accepted:2024-09-26 Online:2025-07-10 Published:2025-07-10
  • Contact: Yulin HE
  • About author:HE Yulin, born in 1982, Ph. D., research fellow. His research interests include big data system computing, multi-sample statistical analysis, data mining, machine learning.
    HE Peng, born in 2001, M. S. candidate. Her research interests include distributed computing of big data, data mining, machine learning.
    HUANG Zhexue, born in 1959, Ph. D., professor. His research interests include intelligent computing of new computing power network, big data approximation computing, data mining, machine learning.
    XIE Weicheng, born in 1983, Ph. D., associate professor. His research interests include data mining, pattern recognition, machine learning.
    PHILIPPE Fournier-Viger, born in 1980, Ph. D., professor. His research interests include data mining, pattern recognition, artificial intelligence.
  • Supported by:
    Project of Guangdong Shenzhen Joint Fund of Guangdong Basic and Applied Basic Research Foundation(2023B1515120020);Natural Science Foundation of Guangdong Province(2023A1515011667);Basic Research Foundation of Shenzhen(JCYJ20210324093609026);Science and Technology Major Project of Shenzhen(KJZD20230923114809020)

摘要:

正类-无标签学习(PUL)是在负例样本未知时,利用已知的少量正类样本和大量无标签样本训练出性能可被实际应用接受的分类器。现有的PUL算法存在共性的缺陷,即对无标签样本标注的不确定性较大,这将导致分类器学习到的分类边界不准确,并且限制了所训练分类器在新数据上的泛化能力。为了解决这一问题,提出一种以无标签样本标注确定性增强为导向的PUL(LCE-PUL)算法。首先,通过验证集的后验概率均值和正类样本集中心点的相似程度筛选出可靠的正类样本,并通过多轮迭代逐步精细化标注过程,以提升对无标签样本初步类别判断的准确性,从而提高无标签样本标注的确定性;其次,把这些可靠的正类样本与原始正类样本集合并,以形成新的正类样本集,之后从无标签样本集中将它剔除;然后,遍历新的无标签样本集,并利用每个样本与若干近邻点的相似程度再次筛选可靠正类样本,以更准确地推断无标签样本的潜在标签,从而减少误标注的可能性,并提升标注的确定性;最后,更新正类样本集,并把未被选中的无标签样本视为负类样本。在具有代表性的数据集上对LCE-PUL算法的可行性、合理性和有效性进行验证。随着迭代次数的增加,LCE-PUL算法的训练呈现收敛的特性,且当正类样本比例为40%、35%和30%时,LCE-PUL算法构建的分类器测试精度相较于基于特定成本函数的偏置支持向量机(Biased-SVM)算法、基于Dijkstra的PUL标签传播(LP-PUL)算法和基于标签传播的PUL(PU-LP)算法等5种代表性对比算法中最多提升了5.8、8.8和7.6个百分点。实验结果表明,LCE-PUL是一种有效处理PUL问题的机器学习算法。

关键词: 正类-无标签学习, 标注确定性增强, 后验概率, 贝叶斯分类器, 两步法

Abstract:

Positive and Unlabeled Learning (PUL) is used to train classifiers with performance that can be accepted by practical applications when negative samples are unknown by utilizing a few known positive samples and many unlabeled samples. The existing PUL algorithms have a common flaw: big uncertainty in labeling unlabeled samples, leading to inaccurate classification boundaries learnt by the classifier and limiting the classifier’s generalization ability on new data. To solve this issue, an unlabeled sample Labeling Certainty Enhancement-oriented PUL (LCE-PUL) algorithm was proposed. Firstly, reliable positive samples were selected on the basis of similarity between posterior probability mean on the validation set and center point of the positive sample set, and the labeling process was refined gradually through iterations, so as to increase the accuracy of preliminary category judgments of unlabeled samples, thereby improving the certainty of labeling unlabeled samples. Secondly, these reliable positive samples were merged with the original positive sample set to form a new positive sample set, and then this set was removed from the unlabeled sample set. Thirdly, the new unlabeled sample set was traversed, and reliable positive samples were selected again based on similarity of each sample and multiple neighboring points, so as to further improve the inference of potential labels, thereby reducing mislabeling and enhancing certainty of labeling. Finally, the positive sample set was updated, and the unselected unlabeled samples were treated as negative samples. The feasibility, rationality, and effectiveness of LCE-PUL algorithm were validated on representative datasets. With the increase of iterations, the training of the LCE-PUL algorithm shows a convergent characteristic. When the proportion of positive samples is 40%, 35%, and 30%, the test accuracy of the classifier constructed by the LCE-PUL algorithm is improved by 5.8, 8.8, and 7.6 percentage points at most compared with the five representative comparative algorithms, including the Biased Support Vector Machine based on a specific cost function (Biased-SVM) algorithm, the Dijkstra-based Label Propagation for PUL (LP-PUL) algorithm, and the PUL by Label Propagation (PU-LP) algorithm. Experimental results show that LCE-PUL is an effective machine learning algorithm for handling PUL problems.

Key words: Positive and Unlabeled Learning (PUL), labeling certainty enhancement, posterior probability, Bayesian classifier, two-step method

中图分类号: