Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (5): 1467-1472.DOI: 10.11772/j.issn.1001-9081.2022081154

• Data science and technology • Previous Articles    

Attribute reduction for high-dimensional data based on bi-view of similarity and difference

Yuanjiang LI, Jinsheng QUAN, Yangyi TAN, Tian YANG()   

  1. Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing (Hunan Normal University),Changsha Hunan 410081,China
  • Received:2022-07-19 Revised:2022-09-06 Accepted:2022-10-12 Online:2023-05-08 Published:2023-05-10
  • Contact: Tian YANG
  • About author:LI Yuanjiang, born in 1999, M. S. candidate. His research interests include data mining, rough set theory, machine learning.
    QUAN Jinsheng, born in 2003. His research interests include machine learning.
    TAN Yangyi,born in 2002. Her research interests include rough set.
    YANG Tian, born in 1984, Ph. D., associate professor. Her research interests include granular computing and intelligent information processing, rough set, fuzzy set theory, topology.
  • Supported by:
    Outstanding Youth Program of Natural Science Foundation of Hunan Province(2021JJ20037);Training Program for Excellent Young Innovators of Changsha(kq1905031)

基于相似和差异双视角的高维数据属性约简

李元江, 权金升, 谭阳奕, 杨田()   

  1. 智能计算与语言信息处理湖南省重点实验室(湖南师范大学),长沙 410081
  • 通讯作者: 杨田
  • 作者简介:李元江(1999—),男,湖北宜昌人,硕士研究生,主要研究方向:数据挖掘、粗糙集理论、机器学习
    权金升(2003—),男,江苏徐州人,主要研究方向:机器学习
    谭阳奕(2002—),女,湖南株洲人,主要研究方向:粗糙集
    杨田(1984—),女,湖南长沙人,副教授,博士,主要研究方向:粒计算与智能信息处理、粗糙集、模糊集理论、拓扑学。math_yangtian@126.com
  • 基金资助:
    湖南省自然科学优秀青年基金资助项目(2021JJ20037);长沙市杰出创新青年培养计划项目(kq1905031)

Abstract:

Concerning of the curse of dimensionality caused by too high data dimension and redundant information, a high-dimensional Attribute Reduction algorithm based on Similarity and Difference Matrix (ARSDM) was proposed. In this algorithm, on the basis of discernibility matrix, the similarity measure for samples in the same class was added to form a comprehensive evaluation of all samples. Firstly, the distances of samples under each attribute were calculated, and the similarity of same class and the difference of different classes were obtained based on these distances. Secondly, a similarity and difference matrix was established to form an evaluation of the entire dataset. Finally, attribute reduction was performed, i.e., each column of the similarity and difference matrix was summed, the feature with the largest value was selected into the reduction in proper order, and the row vector of the corresponding sample pair was set to the zero vector. Experimental results show that compared with the classical attribute reduction algorithms DMG (Discernibility Matrix based on Graph theory), FFRS (Fitting Fuzzy Rough Sets) and GBNRS (Granular Ball Neighborhood Rough Sets), the average classification accuracy of ARSDM is increased by 1.07, 6.48, and 8.92 percentage points respectively under the Classification And Regression Tree (CART) classifier, and increased by 1.96, 11.96, and 12.39 percentage points under the Support Vector Machine (SVM) classifier. At the same time, ARSDM outperforms GBNRS and FFRS in running efficiency. It can be seen that ARSDM can effectively remove redundant information and improve the classification accuracy.

Key words: similarity and difference matrix, discernibility matrix, attribute reduction, rough set, granular computing, data mining

摘要:

针对数据维度过高、冗余信息过多导致维度灾难的问题,提出一种基于异同矩阵的高维属性约简算法(ARSDM)。该算法在区分矩阵的基础上加入对同类样本的相似度衡量,形成对所有样本的综合评估。首先,计算样本在每个属性下的距离,并基于这些距离得到同类相似度和异类差异度;其次,建立异同矩阵,形成对整个数据集的评价;最后,进行属性约简,即将异同矩阵的每一列求和,依次选择值最大的特征进行约简,并将相应样本对的行向量置为零向量。实验结果表明,与经典属性约简算法DMG(Discernibility Matrix based on Graph theory)、FFRS(Fitting Fuzzy Rough Sets)以及GBNRS(Granular Ball Neighborhood Rough Sets)相比,在分类回归树(CART)分类器下,ARSDM的平均分类准确率分别提高了1.07、6.48、8.92个百分点;在支持向量机(SVM)分类器下,ARSDM的平均分类准确率分别提高了1.96、11.96、12.39个百分点;运行效率上ARSDM优于GBNRS和FFRS。可见,ARSDM能够有效去除冗余信息,提高分类准确率。

关键词: 异同矩阵, 区分矩阵, 属性约简, 粗糙集, 粒计算, 数据挖掘

CLC Number: