Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (11): 3139-3143.DOI: 10.11772/j.issn.1001-9081.2018041251

Previous Articles     Next Articles

Semi-supervised K-means clustering algorithm based on active learning priors

CHAI Bianfang, LYU Feng, LI Wenbin, WANG Yao   

  1. School of Information Engineering, Hebei GEO University, Shijiazhuang Hebei 050031, China
  • Received:2018-04-23 Revised:2018-06-04 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61503260), the Teaching Reform Fund for Hebei GEO University (2017J04).

基于主动学习先验的半监督K-means聚类算法

柴变芳, 吕峰, 李文斌, 王垚   

  1. 河北地质大学 信息工程学院, 石家庄 050031
  • 通讯作者: 李文斌
  • 作者简介:柴变芳(1979-),女,山西运城人,副教授,博士,主要研究方向:复杂网络分析、概率图模型、机器学习、社区发现;吕峰(1991-),男,河北石家庄人,硕士研究生,主要研究方向:机器学习、复杂网络;李文斌(1974-),男,江西南昌人,教授,博士,CCF高级会员,主要研究方向:机器学习、复杂网络;王垚(1992-),男,河南开封人,硕士研究生,主要研究方向:机器学习、复杂网络。
  • 基金资助:
    国家自然科学基金资助项目(61503260);河北地质大学教改项目(2017J04)。

Abstract: Iteration-based Active Semi-Supervised Clustering Framework (IASSCF) is a popular semi-supervised clustering framework. There are two problems in this framework. The initial prior information is too less, which leads to poor clustering results in the initial iteration and infects the subsequent clustering. In addition, in each iteration only the sample with the largest information is selected to label, which results in a slow speed and improvement of the performance. Aiming to the existing problems, a semi-supervised K-means clustering algorithm based on active learning priors was designed, which consisted of initializing phase and iterating phase. In the initializing phase, the representative samples were selected actively to build an initial neighborhood set and a constraint set. Each iteration in iterating phase includes three steps:1) Pairwise Constrained K-means (PCK-means) was used to cluster data based on the current constraints. 2) Unlabeled samples with the largest information in each cluster were selected based on the clustering results. 3) The selected samples were extended into the neighborhood set and the constraint set. The iterating phase ends until the convergence thresholds were reached. The experimental results show that the proposed algorithm runs faster and has better performance than the algorithm based on the original IASSCF framework.

Key words: iterative framework, active learning, semi-supervised clustering, node priors, constraint priors

摘要: 基于迭代框架的主动半监督聚类框架(IASSCF)是一个流行的半监督聚类框架。该框架存在两个问题:其一,初始先验信息较少导致迭代初期聚类效果不佳,进而影响后续聚类结果;其二,每次迭代只选择信息量最大的一个样本标记,导致运行速度慢、性能提升慢。针对这两个问题,设计了一种基于主动学习先验的半监督K-means聚类算法。该方法包含初始化阶段和迭代阶段。初始化阶段主动选择代表性较高的节点集合,并基于代表节点集合构建各类的先验节点集合和约束先验集合。迭代阶段,每次迭代包含三步:1)基于当前约束先验集合,利用约束半监督聚类算法PCK-means对数据进行聚类;2)依据当前聚类结果,主动选择每个簇中最具价值信息的未标注样本点;3)利用选择样本点扩充先验节点集合及约束集合。迭代此过程至达到收敛阈值。实验结果表明,与基于原IASSCF框架的半监督K-means聚类算法相比,所提算法运行速度更快,性能更优。

关键词: 迭代框架, 主动学习, 半监督聚类, 节点先验, 约束先验

CLC Number: