《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2701-2712.DOI: 10.11772/j.issn.1001-9081.2021081371

• 数据科学与技术 • 上一篇    下一篇

基于聚类粒化和簇间散度的属性约简算法

李艳1,2,3, 范斌1,2(), 郭劼1,2   

  1. 1.河北大学 数学与信息科学学院, 河北 保定 071002
    2.河北省机器学习与计算智能重点实验室(河北大学), 河北 保定 071002
    3.北京师范大学珠海校区 应用数学与交叉科学研究中心, 广东 珠海 519087
  • 收稿日期:2021-08-02 修回日期:2021-11-09 接受日期:2021-11-20 发布日期:2022-01-07 出版日期:2022-09-10
  • 通讯作者: 范斌
  • 作者简介:李艳(1976—),女,河北衡水人,教授,博士,CCF会员,主要研究方向:机器学习、不确定性信息处理;
    郭劼(1995—),男,河北邯郸人,硕士研究生,主要研究方向:机器学习、不确定性信息处理。
  • 基金资助:
    国家自然科学基金资助项目(61976141)

Attribute reduction algorithm based on cluster granulation and divergence among clusters

Yan LI1,2,3, Bin FAN1,2(), Jie GUO1,2   

  1. 1.College of Mathematics and Information Science,Hebei University,Baoding Hebei 071002,China
    2.Hebei Key Laboratory of Machine Learning and Computational Intelligence (Hebei University),Baoding Hebei 071002,China
    3.Research Center for Applied Mathematics and Interdisciplinary Sciences,Beijing Normal University at Zhuhai,Zhuhai Guangdong 519087,China
  • Received:2021-08-02 Revised:2021-11-09 Accepted:2021-11-20 Online:2022-01-07 Published:2022-09-10
  • Contact: Bin FAN
  • About author:LI Yan, born in 1976, Ph. D., professor. Her research interests include machine learning, uncertain information processing.
    GUO Jie, born in 1995, M. S. candidate. His research interests include machine learning, uncertain information processing.
  • Supported by:
    National Natural Science Foundation of China(61976141)

摘要:

属性约简是粗糙集理论中的研究热点,对连续值数据进行属性约简的算法大多基于优势关系或邻域关系。然而连续值数据集的属性不一定具有优势关系;而基于邻域关系的属性约简算法虽然可以通过邻域半径调整粒化程度,不过由于各属性量纲不同且半径参数为连续值使半径难以统一,导致整个参数粒化过程计算量较大。为解决此问题,提出一种基于聚类粒化的多粒度属性约简策略。首先,利用聚类方法将相似样本归类,并提出了基于聚类的近似集、相对正域及正域约简概念;其次,根据JS(Jensen-Shannon)散度理论对簇间各属性数据分布进行差异性度量,并选择出具有代表性的特征用以区分不同类簇;最后,利用可辨识矩阵设计了属性约简算法。所提算法不要求属性具有序关系,且不同于邻域半径,聚类参数为离散值,调节此参数就能够对数据集形成不同粒化程度的划分。在UCI与Kent Ridge数据集上进行的实验结果表明,该属性约简算法可以直接处理连续值数据,且该算法在较小范围内离散地调节聚类参数便能在保持甚至提高分类精度的前提下去除数据集中的冗余特征。

关键词: 连续值数据, 粗糙集, 属性约简, 聚类粒化, Jensen-Shannon散度

Abstract:

Attribute reduction is a hot research topic in rough set theory. Most of the algorithms of attribute reduction for continuous data are based on dominance relations or neighborhood relations. However, continuous datasets do not necessarily have dominance relations in attributes. And the attribute reduction algorithms based on neighborhood relations can adjust the granulation degree through neighborhood radius, but it is difficult to unify the radii due to the different dimensions of attributes and the continuous values of radius parameters, resulting in high computational cost of the whole parameter granulation process. To solve this problem, a multi-granularity attribute reduction strategy based on cluster granulation was proposed. Firstly, the similar samples were classified by the clustering method, and the concepts of approximate set, relative positive region and positive region reduction based on clustering were proposed. Secondly, according to JS (Jensen-Shannon) divergence theory, the difference of data distribution of each attribute among clusters was measured, and representative features were selected to distinguish different clusters. Finally, an attribute reduction algorithm was designed using a discernibility matrix. In the proposed algorithm, the attributes were not required to have ordered relations. Different from neighborhood radius, the clustering parameter was discrete, and the dataset was able to be divided into different granulation degrees by adjusting this parameter. Experimental results on UCI and Kent Ridge datasets show that this attribute reduction algorithm can directly deal with continuous data. At the same time, by using this algorithm, the redundant features in the datasets can be removed while maintaining or even improving the classification accuracy by discrete adjustment of the parameters in a small range.

Key words: continuous data, rough set, attribute reduction, cluster granulation, Jensen-Shannon divergence

中图分类号: