计算机应用 ›› 2021, Vol. 41 ›› Issue (9): 2473-2480.DOI: 10.11772/j.issn.1001-9081.2020111872

所属专题: 人工智能

• 人工智能 •    下一篇

基于权值多样性的半监督分类算法

毛铭泽, 曹芮浩, 闫春钢   

  1. 同济大学 电子与信息工程学院, 上海 201804
  • 收稿日期:2020-11-30 修回日期:2021-03-04 出版日期:2021-09-10 发布日期:2021-09-15
  • 通讯作者: 闫春钢
  • 作者简介:毛铭泽(1996-),男,上海人,硕士研究生,主要研究方向:数据挖掘、半监督学习;曹芮浩(1990-),男,山西长治人,博士研究生,主要研究方向:数据挖掘、网络交易风控;闫春钢(1963-),女,黑龙江双鸭山人,教授,博士生导师,主要研究方向:可信计算、智能计算。
  • 基金资助:
    国家重点研发计划项目(2017YFB1001804)。

Semi-supervised classification algorithm based on weight diversity

MAO Mingze, CAO Ruihao, YAN Chungang   

  1. College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
  • Received:2020-11-30 Revised:2021-03-04 Online:2021-09-10 Published:2021-09-15
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2017YFB1001804).

摘要: 在实际生活中,可以很容易地获得大量系统数据样本,却只能获得很小一部分的准确标签。为了获得更好的分类学习模型,引入半监督学习的处理方式,对基于未标注数据强化集成多样性(UDEED)算法进行改进,提出了UDEED+——一种基于权值多样性的半监督分类算法。UDEED+主要的思路是在基学习器对未标注数据的预测分歧的基础上提出权值多样性损失,通过引入基学习器权值的余弦相似度来表示基学习器之间的分歧,并且从损失函数的不同角度充分扩展模型的多样性,使用未标注数据在模型训练过程中鼓励集成学习器的多样性的表示,以此达到提升分类学习模型性能和泛化性的目的。在8个UCI公开数据集上,与UDEED算法、S4VM(Safe Semi-Supervised Support Vector Machine)和SSWL(Semi-Supervised Weak-Label)半监督算法进行了对比,相较于UDEED算法,UDEED+在正确率和F1分数上分别提升了1.4个百分点和1.1个百分点;相较于S4VM,UDEED+在正确率和F1分数上分别提升了1.3个百分点和3.1个百分点;相较于SSWL,UDEED+在正确率和F1分数上分别提升了0.7个百分点和1.5个百分点。实验结果表明,权值多样性的提升可以改善UDEED+算法的分类性能,验证了其对所提算法UDEED+的分类性能提升的正向效果。

关键词: 分类机器学习, 未标注数据, 半监督学习, 集成学习, 多样性

Abstract: In real life, many data samples of systems can be easily obtained, but only a small part of accurate laabels can be obtained. In order to obtain a better classification learning model, a semi-supervised classification algorithm based on weight diversity was proposed by introducing semi-supervised learning and improving Unlabeled Data to Enhance Ensemble Diversity (UDEED), namely UDEED+. In UDEED+, based on the prediction disagreement of unlabeled data by base learners, the loss of weight diversity was proposed. The disagreement between base learners was represented by the cosine similarity of the weights of base learners. The diversity of model was fully expanded from different perspectives of loss function, and the unlabeled data were used to encourage the diversity representation of ensemble learners in the process of model training, so as to improve the performance and generalization of the classification learning model. Comparative experiments were conducted on 8 UCI public datasets with semi-supervised algorithms of UDEED algorithm, Safe Semi-Supervised Support Vector Machine (S4VM) and Semi-Supervised Weak-Label (SSWL). Compared with UDEED, UDEED+ has the accuracy and F1 score improved by 1.4 percentage points and 1.1 percentage points respectively; compared with S4VM, UDEED+ has the accuracy and F1 score improved by 1.3 percentage points and 3.1 percentage points respectively; compared with UDEED, UDEED+ has the accuracy and F1 score improved by 0.7 percentage points and 1.5 percentage points respectively. Experimental results illustrate that the increase of weight diversity can improve the classification performance of the model, verifying its positive effect on the improvement of the classification performance of UDEED+.

Key words: classification machine learning, unlabeled data, semi-supervised learning, ensemble learning, diversity

中图分类号: