计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3119-3126.DOI: 10.11772/j.issn.1001-9081.2018041220

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

调整聚类假设联合成对约束半监督分类方法

黄华, 郑佳敏, 钱鹏江   

  1. 江南大学 数字媒体学院, 江苏 无锡 214122
  • 收稿日期:2018-04-28 修回日期:2018-06-12 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 钱鹏江
  • 作者简介:黄华(1994-),男,新疆奎屯人,硕士研究生,CCF会员,主要研究方向:模式识别、智能计算、机器学习;郑佳敏(1995-),女,浙江台州人,硕士研究生,主要研究方向:模式识别、智能计算、机器学习、医学图像处理;钱鹏江(1979-),男,江苏靖江人,教授,博士,主要研究方向:模式识别、生物信息、医学图像处理。
  • 基金资助:
    国家自然科学基金资助项目(61772241,61702225);中央高校基本科研专项资金资助重点A类项目(JUSRP51614A);江苏省青蓝工程项目; 2016年江苏省"六大人才高峰"高层次人才项目(2016-XYDXXJS-014)。

Adjusted cluster assumption and pairwise constraints jointly based semi-supervised classification method

HUANG Hua, ZHENG Jiamin, QIAN Pengjiang   

  1. School of Digital Media, Jiangnan University, Wuxi Jiangsu 214122, China
  • Received:2018-04-28 Revised:2018-06-12 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61772241, 61702225), the Fundamental Research Funds for the Central Universities (JUSRP51614A), the Qing Lan Project of Jiangsu Province, the 2016 Six Talent Peaks Project of Jiangsu Province (2016-XYDXXJS-014).

摘要: 当不同类别的样本严重重叠在分类边界时,由于聚类假设不能很好地反映出数据的真实分布,基于聚类假设的半监督分类方法的性能,可能比与之对立的监督分类方法更差。针对上述不安全的半监督分类问题,提出了调整聚类假设联合成对约束半监督分类方法(ACA-JPC-S3VM)。一方面,它将单个未标记样本到数据分布边界的距离融入到模型的学习中,能够一定程度上缓解此类情况下算法性能的下降程度;另一方面,它将成对约束信息引入,弥补了模型对监督信息利用方面的不足。在UCI数据集上的实验结果表明,ACA-JPC-S3VM方法的性能绝不会低于支持向量机(SVM),且在标记样本数量为10时的平均准确率较SVM高出5个百分点;在图像分类数据集上的实验结果表明,直推式支持向量机(TSVM)等半监督分类方法出现了不同程度的不安全学习情形(即性能相近或低于SVM),而ACA-JPC-S3VM却能安全地学习。因此,ACA-JPC-S3VM具有更好的安全性与正确性。

关键词: 半监督学习, 分类, 聚类假设, 调整聚类假设, 成对约束

Abstract: When samples from different classes over classification boundary are seriously overlapped, cluster assumption may not well reflect the real data distribution, so that semi-supervised classification methods based cluster assumption may yield even worse performance than their supervised counterparts. For the above unsafe semi-supervised classification problem, an Adjusted Cluster Assumption and Pairwise Constraints Jointly based Semi-Supervised Support Vector Machine classification method (ACA-JPC-S3VM) was proposed. On the one hand, the distances of individual unlabeled instances to the distribution boundary were considered in learning, which alleviated the degradation of the algorithm performance in such cases to some extent. On the other hand, the information of pairwise constraints was introduced to the algorithm to make up for its insufficient use of supervision information. The experimental results on the UCI dataset show that the performance of ACA-JPC-S3VM method would never be lower than that of SVM (Support Vector Machine), and the average accuracy is 5 percentage points higher than that of SVM when the number of labeled samples is 10. The experimental results on the image classification dataset show that the semi-supervised classification methods such as TSVM (Transductive SVM) have different degrees of unsafety learning (similar or worse performance than SVM) while ACA-JPC-S3VM can learn safely. Therefore, ACA-JPC-S3VM has better safety and correctness.

Key words: semi-supervised learning, classification, cluster assumption, adjusted cluster assumption, pairwise constraint

中图分类号: