Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 686-693.DOI: 10.11772/j.issn.1001-9081.2020071095

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Co-training algorithm combining improved density peak clustering and shared subspace

LYU Jia1,2, XIAN Yan1,2   

  1. 1. College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China;
    2. Chongqing Center of Engineering Technology Research on Digital Agriculture Service, Chongqing Normal University, Chongqing 401331, China
  • Received:2020-07-24 Revised:2020-10-06 Online:2021-03-10 Published:2020-11-12
  • Supported by:
    This work is partially supported by the Major Project of the National Natural Science Foundation of China (11991024), the Program of Chongqing University Innovation Research Group (CXQT20015), the Chongqing Graduate Research and Innovation Project (CYS20241).

结合改进密度峰值聚类和共享子空间的协同训练算法

吕佳1,2, 鲜焱1,2   

  1. 1. 重庆师范大学 计算机与信息科学学院, 重庆 401331;
    2. 重庆师范大学 重庆市数字农业服务工程技术研究中心, 重庆 401331
  • 通讯作者: 吕佳
  • 作者简介:吕佳(1978-),女,四川眉山人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;鲜焱(1995-),女,重庆涪陵人,硕士研究生,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金重大项目(11991024);重庆市高校创新研究群体项目(CXQT20015);重庆市研究生科研创新项目(CYS20241)。

Abstract: There would be lack of useful information in added unlabeled samples during the iterations of co-training algorithm, meanwhile, the labels of the samples labeled by multiple classifiers may happen to be inconsistent, which would lead to accumulation of classification errors. To solve the above problems, a co-training algorithm combining improved density peak clustering and shared subspace was proposed. Firstly, the two base classifiers were obtained by the complementation of attribute sets. Secondly, an improved density peak clustering was performed based on the siphon balance rule. And beginning from the cluster centers, the unlabeled samples with high mutual neighbor degrees were selected in a progressive manner, then they were labeled by the two base classifiers. Finally, the final categories of the samples with inconsistent labels were determined by the shared subspace obtained by the multi-view non-negative matrix factorization algorithm. In the proposed algorithm, the unlabeled samples with better representation of spatial structure were selected by the improved density peak clustering and mutual neighbor degree, and the same sample labeled by different labels was revised via shared subspace, solving the low classification accuracy problem caused by sample misclassification. The algorithm was validated by comparisons in multiple experiments on 9 UCI datasets, and experimental results show that the proposed algorithm has the highest classification accuracy rate in 7 data sets, and the second highest classification accuracy rate in the other 2 data sets.

Key words: co-training, density peak clustering, siphon balance rule, shared subspace, mutual neighbor degree

摘要: 针对协同训练算法在迭代过程中加入的无标记样本的有用信息不足和多分类器对样本标记不一致导致的分类错误累积问题,提出结合改进密度峰值聚类和共享子空间的协同训练算法。该算法先采取属性集合互补的方式得到两个基分类器,然后基于虹吸平衡法则进行改进密度峰值聚类,并从簇中心出发来推进式选择相互邻近度高的无标记样本交由两个基分类器进行分类,最后利用多视图非负矩阵分解算法得到的共享子空间来确定标记不一致样本的最终类别。该算法利用改进密度峰值聚类和相互邻近度选择出更具空间结构代表性的无标记样本,并采用共享子空间来修订标记不一致的样本,解决了因样本误分类造成的分类精度低的问题。在9个UCI数据集上的多组对比实验证明了该算法的有效性,实验结果表明所提算法相较于对比算法在7个数据集上取得最高的分类正确率,在另2个数据集取得次高的分类正确率。

关键词: 协同训练, 密度峰值聚类, 虹吸平衡法则, 共享子空间, 相互邻近度

CLC Number: