Attribute reduction for high-dimensional data based on bi-view of similarity and difference

doi:10.11772/j.issn.1001-9081.2022081154

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (5): 1467-1472.DOI: 10.11772/j.issn.1001-9081.2022081154

• Data science and technology • Previous Articles

Attribute reduction for high-dimensional data based on bi-view of similarity and difference

Yuanjiang LI, Jinsheng QUAN, Yangyi TAN, Tian YANG()

Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing （Hunan Normal University），Changsha Hunan 410081，China

Received:2022-07-19 Revised:2022-09-06 Accepted:2022-10-12 Online:2023-05-08 Published:2023-05-10
Contact: Tian YANG
About author:LI Yuanjiang， born in 1999， M. S. candidate. His research interests include data mining， rough set theory， machine learning.
QUAN Jinsheng， born in 2003. His research interests include machine learning.
TAN Yangyi，born in 2002. Her research interests include rough set.
YANG Tian， born in 1984， Ph. D.， associate professor. Her research interests include granular computing and intelligent information processing， rough set， fuzzy set theory， topology.
Supported by:
Outstanding Youth Program of Natural Science Foundation of Hunan Province(2021JJ20037);Training Program for Excellent Young Innovators of Changsha(kq1905031)

基于相似和差异双视角的高维数据属性约简

李元江, 权金升, 谭阳奕, 杨田()

智能计算与语言信息处理湖南省重点实验室（湖南师范大学），长沙 410081

通讯作者: 杨田
作者简介:李元江（1999—），男，湖北宜昌人，硕士研究生，主要研究方向：数据挖掘、粗糙集理论、机器学习
权金升（2003—），男，江苏徐州人，主要研究方向：机器学习
谭阳奕（2002—），女，湖南株洲人，主要研究方向：粗糙集
杨田（1984—），女，湖南长沙人，副教授，博士，主要研究方向：粒计算与智能信息处理、粗糙集、模糊集理论、拓扑学。math_yangtian@126.com
基金资助:
湖南省自然科学优秀青年基金资助项目(2021JJ20037);长沙市杰出创新青年培养计划项目(kq1905031)

Abstract

Abstract:

Concerning of the curse of dimensionality caused by too high data dimension and redundant information， a high-dimensional Attribute Reduction algorithm based on Similarity and Difference Matrix （ARSDM） was proposed. In this algorithm， on the basis of discernibility matrix， the similarity measure for samples in the same class was added to form a comprehensive evaluation of all samples. Firstly， the distances of samples under each attribute were calculated， and the similarity of same class and the difference of different classes were obtained based on these distances. Secondly， a similarity and difference matrix was established to form an evaluation of the entire dataset. Finally， attribute reduction was performed， i.e.， each column of the similarity and difference matrix was summed， the feature with the largest value was selected into the reduction in proper order， and the row vector of the corresponding sample pair was set to the zero vector. Experimental results show that compared with the classical attribute reduction algorithms DMG （Discernibility Matrix based on Graph theory）， FFRS （Fitting Fuzzy Rough Sets） and GBNRS （Granular Ball Neighborhood Rough Sets）， the average classification accuracy of ARSDM is increased by 1.07， 6.48， and 8.92 percentage points respectively under the Classification And Regression Tree （CART） classifier， and increased by 1.96， 11.96， and 12.39 percentage points under the Support Vector Machine （SVM） classifier. At the same time， ARSDM outperforms GBNRS and FFRS in running efficiency. It can be seen that ARSDM can effectively remove redundant information and improve the classification accuracy.

摘要：

针对数据维度过高、冗余信息过多导致维度灾难的问题，提出一种基于异同矩阵的高维属性约简算法（ARSDM）。该算法在区分矩阵的基础上加入对同类样本的相似度衡量，形成对所有样本的综合评估。首先，计算样本在每个属性下的距离，并基于这些距离得到同类相似度和异类差异度；其次，建立异同矩阵，形成对整个数据集的评价；最后，进行属性约简，即将异同矩阵的每一列求和，依次选择值最大的特征进行约简，并将相应样本对的行向量置为零向量。实验结果表明，与经典属性约简算法DMG（Discernibility Matrix based on Graph theory）、FFRS（Fitting Fuzzy Rough Sets）以及GBNRS（Granular Ball Neighborhood Rough Sets）相比，在分类回归树（CART）分类器下，ARSDM的平均分类准确率分别提高了1.07、6.48、8.92个百分点；在支持向量机（SVM）分类器下，ARSDM的平均分类准确率分别提高了1.96、11.96、12.39个百分点；运行效率上ARSDM优于GBNRS和FFRS。可见，ARSDM能够有效去除冗余信息，提高分类准确率。

关键词: 异同矩阵, 区分矩阵, 属性约简, 粗糙集, 粒计算, 数据挖掘

CLC Number:

TP181

Yuanjiang LI, Jinsheng QUAN, Yangyi TAN, Tian YANG. Attribute reduction for high-dimensional data based on bi-view of similarity and difference[J]. Journal of Computer Applications, 2023, 43(5): 1467-1472.

李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472.

Figures/Tables 5

Tab. 1 Dataset information

数据集	样本数	特征数	类别数
Sonar	208	60	2
SCADI	70	205	7
Heart	270	13	2
Allaml	72	129	2
Lung	203	3 312	2
GLI85	85	22 283	2
Biodeg	1 055	41	2
ORL	400	1 024	40
Pageblock	5 472	10	5
Messidor	1 151	19	2
Cane	1 080	856	9

Tab. 2 Comparison of time/space complexity

算法	时间复杂度	空间复杂度
DMG	$O (m n 2)$	$O (m n 2)$
FFRS	$O (m 2 n 2)$	$O (m n 2)$
GBNRS	$O (n m 2)$	$O (n 2)$
ARSDM	$O (m n 2)$	$O (m n 2)$

Tab. 2 Comparison of time/space complexity

算法	时间复杂度	空间复杂度
DMG	$O (m n 2)$	$O (m n 2)$
FFRS	$O (m 2 n 2)$	$O (m n 2)$
GBNRS	$O (n m 2)$	$O (n 2)$
ARSDM	$O (m n 2)$	$O (m n 2)$

Tab. 3 Comparison of classification accuracy on reduction data under CART classifier

数据集	RAW	DMG	FFRS	GBNRS	ARSDM
Average	79.33	82.97	77.56	75.12	84.04
Sonar	73.45	75.00	76.00	75.52	75.5
SCADI	77.08	90.00	92.50	76.37	97.50
Heart	75.19	81.48	81.85	79.63	80.00
Allaml	85.36	98.33	93.33	75.42	100.00
Lung	90.27	88.33	94.44	79.45	95.00
GLI85	81.67	94.29	90.00	67.78	90.00
Biodeg	81.30	81.11	76.06	81.32	83.37
ORL	60.54	54.64	35.50	41.25	53.25
Pageblock	99.31	99.42	99.47	99.32	99.48
Messidor	61.84	62.37	64.43	63.51	62.43
Cane	86.67	87.69	49.63	86.76	87.87

Tab. 4 Comparison of classification accuracy on reduction data under SVM classifier

数据集	RAW	DMG	FFRS	GBNRS	ARSDM
Average	70.34	89.81	79.81	79.38	91.77
Sonar	87.49	89.50	82.50	80.71	92.50
SCADI	41.73	95.00	90.00	80.96	100.00
Heart	79.26	82.22	82.96	79.63	84.07
Allaml	65.42	95.00	100.00	84.52	98.33
Lung	68.57	92.22	97.22	80.43	92.78
GLI85	69.64	94.29	94.29	60.97	95.71
Biodeg	88.36	88.58	80.96	86.27	89.81
ORL	30.50	91.01	35.70	63.00	93.50
Pageblock	97.58	97.65	98.00	97.59	97.74
Messidor	73.84	73.85	66.69	74.89	75.65
Cane	71.39	88.61	49.54	84.26	89.35

Table 5 Comparison of reduction time of four algorithms

数据集	DMG	FFRS	GBNRS	ARSDM
Average	107.94	2 016.07	7 246.94	156.09
Sonar	0.85	21.02	57.55	3.16
SCADI	0.26	33.82	86.37	0.41
Heart	0.40	7.82	6.88	0.88
Allaml	6.15	702.26	5 843.18	13.18
Lung	25.78	1 829.63	4 517.38	52.91
GLI85	41.23	4 410.49	58 040.21	98.00
Biodeg	17.57	395.04	134.85	41.87
ORL	96.52	1 330.05	3 326.97	116.74
Pageblock	72.24	2 405.74	18.35	273.76
Messidor	10.76	205.45	103.27	22.68
Cane	915.60	10 835.40	7 581.36	1 093.38

References 36

1	周涛，陆惠玲，任海玲，等. 基于粗糙集的属性约简算法综述［J］. 电子学报， 2021， 49（7）： 1439-1449. 10.12263/DZXB.20200330
	ZHOU T， LU H L， REN H L， et al. Survey on attribute reduction algorithm of rough set［J］. Acta Electronica Sinica， 2021， 49（7）： 1439-1449. 10.12263/DZXB.20200330
2	HEDAR A R， WANG J， FUKUSHIMA M. TABU search for attribute reduction in rough set theory［J］. Soft Computing， 2008， 12（9）： 909-918. 10.1007/s00500-007-0260-1
3	PAWLAK Z. Rough sets［J］. International Journal of Computer and Information Sciences， 1982， 11（5）： 341-356. 10.1007/bf01001956
4	汤建国，祝峰，佘堃，等. 粗糙集与其他软计算理论结合情况研究综述［J］. 计算机应用研究， 2010， 27（7）：2404-2410. 10.3969/j.issn.1001-3695.2010.07.002
	TANG J G， ZHU W， SHE K， et al. Survey on combination of rough sets and other soft computing theories［J］. Application Research of Computers， 2010， 27（7）：2404-2410. 10.3969/j.issn.1001-3695.2010.07.002
5	HU Q H， ZHANG L J， ZHOU Y C， et al. Large-scale multimodality attribute reduction with multi-kernel fuzzy rough sets［J］. IEEE Transactions on Fuzzy Systems， 2018， 26（1）： 226-238. 10.1109/tfuzz.2017.2647966
6	YANG T， ZHONG X R， LANG G M， et al. Granular matrix： a new approach for granular structure reduction and redundancy evaluation［J］. IEEE Transactions on Fuzzy Systems， 2020， 28（12）： 3133-3144. 10.1109/tfuzz.2020.2984198
7	DAI J H， HU H， WU W Z， et al. Maximal-discernibility-pair-based approach to attribute reduction in fuzzy rough sets［J］. IEEE Transactions on Fuzzy Systems， 2018， 26（4）： 2174-2187. 10.1109/tfuzz.2017.2768044
8	DAI J H， HU Q H， HU H， et al. Neighbor inconsistent pair selection for attribute reduction by rough set approach［J］. IEEE Transactions on Fuzzy Systems， 2018， 26（2）： 937-950. 10.1109/tfuzz.2017.2698420
9	YANG Y Y， CHEN D G， WANG H， et al. Incremental perspective for feature selection based on fuzzy rough sets［J］. IEEE Transactions on Fuzzy Systems， 2018， 26（3）： 1257-1273. 10.1109/tfuzz.2017.2718492
10	LANG G M， LI Q G， YANG T. An incremental approach to attribute reduction of dynamic set-valued information systems［J］. International Journal of Machine Learning and Cybernetics， 2014， 5（5）： 775-788. 10.1007/s13042-013-0225-x
11	LIANG J Y， WANG F， DANG C Y， et al. A group incremental approach to feature selection applying rough set technique［J］. IEEE Transactions on Knowledge and Data Engineering， 2014， 26（2）： 294-308. 10.1109/tkde.2012.146
12	SKOWRON A， RAUSZER C. The discernibility matrices and functions in information systems［M］// SŁOWIŃSKI R. Intelligent Decision Support： Handbook of Applications and Advances of the Rough Sets Theory， TDLD 11. Dordrecht： Springer， 1992： 331-362. 10.1007/978-94-015-7975-9_21
13	CHEN D G， ZHAO S Y， ZHANG L， et al. Sample pair selection for attribute reduction with rough set［J］. IEEE Transactions on Knowledge and Data Engineering， 2012， 24（11）： 2080-2093. 10.1109/tkde.2011.89
14	HU Q H， YU D R， LIU J F， et al. Neighborhood rough set based heterogeneous feature subset selection［J］. Information Sciences， 2008， 178（18）： 3577-3594. 10.1016/j.ins.2008.05.024
15	徐波，张贤勇，冯山. 邻域粗糙集的加权依赖度及其启发式约简算法［J］. 模式识别与人工智能， 2018， 31（3）： 256-264. 10.16451/j.cnki.issn1003-6059.201803007
	XU B， ZHANG X Y， FENG S. Weighted dependence of neighborhood rough sets and its heuristic reduction algorithm［J］. Pattern Recognition and Artificial Intelligence， 2018， 31（3）： 256-264. 10.16451/j.cnki.issn1003-6059.201803007
16	QIAN Y H， LIANG J Y， PEDRYCZ W， et al. Positive approximation： an accelerator for attribute reduction in rough set theory［J］. Artificial Intelligence， 2010， 174（9/10）： 597-618. 10.1016/j.artint.2010.04.018
17	曾维佳，秦放，李琳，等. 基于信息熵的粗糙集属性应急数据去重挖掘算法研究［J］. 计算技术与自动化， 2021， 40（4）：64-68. 10.16339/j.cnki.jsjsyzdh.202104012
	ZENG W J， QIN F， LI L， et al. Research on algorithm of deduplication mining for rough set attribute emergency data based on information entropy［J］. Computing Technology and Automation， 2021， 40（4）：64-68. 10.16339/j.cnki.jsjsyzdh.202104012
18	YANG T， LI Q G， ZHOU B L. Related family： a new method for attribute reduction of covering information systems［J］. Information Sciences， 2013， 228： 175-191. 10.1016/j.ins.2012.11.005
19	CHEN J K， MI J S， LIN Y J. A graph approach for fuzzy-rough feature selection［J］. Fuzzy Sets and Systems， 2020， 391： 96-116. 10.1016/j.fss.2019.07.014
20	AGGARWAL M. Rough information set and its applications in decision making［J］. IEEE Transactions on Fuzzy Systems， 2017， 25（2）： 265-276. 10.1109/tfuzz.2017.2670551
21	AN S， HU Q H， PEDRYCZ W， et al. Data-distribution-aware fuzzy rough set model and its application to robust classification［J］. IEEE Transactions on Cybernetics， 2016， 46（12）： 3073-3085.
22	TAO H， HOU C P， NIE F P， et al. Effective discriminative feature selection with nontrivial solution［J］. IEEE Transactions on Neural Networks and Learning Systems， 2016， 27（4）： 796-808. 10.1109/tnnls.2015.2424721
23	WANG C Z， HE Q， CHEN D G， et al. A novel method for attribute reduction of covering decision systems［J］. Information Sciences， 2014， 254： 181-196. 10.1016/j.ins.2013.08.057
24	ARMANFARD N， REILLY J P， KOMEILI M. Local feature selection for data classification［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2016， 38（6）： 1217-1227. 10.1109/tpami.2015.2478471
25	WANG C Z， HUANG Y， SHAO M W， et al. Feature selection based on neighborhood self-information［J］. IEEE Transactions on Cybernetics， 2020， 50（9）： 4031-4042. 10.1109/tcyb.2019.2923430
26	ZHU P F， HU Q H. Adaptive neighborhood granularity selection and combination based on margin distribution optimization［J］. Information Sciences， 2013， 249： 1-12. 10.1016/j.ins.2013.06.012
27	YAMADA M， TANG J L， LUGO-MARTINEZ J， et al. Ultra high-dimensional nonlinear feature selection for big biological data［J］. IEEE Transactions on Knowledge and Data Engineering， 2018， 30（7）： 1352-1365. 10.1109/tkde.2018.2789451
28	HU M， TSANG E C C， GUO Y T， et al. Attribute reduction based on overlap degree and k-nearest-neighbor rough sets in decision information systems［J］. Information Sciences， 2022， 584： 301-324. 10.1016/j.ins.2021.10.063
29	WANG C Z， HU Q H， WANG X Z， et al. Feature selection based on neighborhood discrimination index［J］. IEEE Transactions on Neural Networks and Learning Systems， 2018， 29（7）： 2986-2999. 10.1109/tnnls.2018.2830700
30	DUBOIS D， PRADE H. Rough fuzzy sets and fuzzy rough sets［J］. International Journal of General Systems， 1990， 17（2/3）： 191-209. 10.1080/03081079008935107
31	JENSEN R， SHEN Q. Fuzzy-rough attribute reduction with application to web categorization［J］. Fuzzy Sets and Systems， 2004， 141（3）： 469-485. 10.1016/s0165-0114(03)00021-6
32	CHEN D G， ZHANG L， ZHAO S Y， et al. A novel algorithm for finding reducts with fuzzy rough sets［J］. IEEE Transactions on Fuzzy Systems， 2012， 20（2）： 385-389. 10.1109/tfuzz.2011.2173695
33	YANG M， CHEN S C， YANG X B. A novel approach of rough set-based attribute reduction using fuzzy discernibility matrix［C］// Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. Piscataway： IEEE， 2007：96-101. 10.1109/fskd.2007.97
34	WANG C Z， QI Y L， SHAO M W， et al. A fitting model for feature selection with fuzzy rough sets［J］. IEEE Transactions on Fuzzy Systems， 2017， 25（4）： 741-753. 10.1109/tfuzz.2016.2574918
35	XIA S Y， ZHANG H， LI W H， et al. GBNRS： a novel rough set algorithm for fast adaptive attribute reduction in classification［J］. IEEE Transactions on Knowledge and Data Engineering， 2022， 34（3）： 1231-1242. 10.1109/tkde.2020.2997039
36	CHEN J K， LIN Y J， LIN G P， et al. Attribute reduction of covering decision systems by hypergraph model［J］. Knowledge-Based Systems， 2017， 118： 93-104. 10.1016/j.knosys.2016.11.010

[1]	Qing WANG, Xiuwei GAO, Yehai XIE, Guilong LIU. Inner product reduction in formal context [J]. Journal of Computer Applications, 2023, 43(4): 1079-1085.
[2]	Xiaomeng SHAO, Meng ZHANG. Temporal convolutional knowledge tracing model with attention mechanism [J]. Journal of Computer Applications, 2023, 43(2): 343-348.
[3]	Jun WU, Aijia OUYANG, Lin ZHANG. Statistically significant sequential patterns mining algorithm under influence degree [J]. Journal of Computer Applications, 2022, 42(9): 2713-2721.
[4]	Yan LI, Bin FAN, Jie GUO. Attribute reduction algorithm based on cluster granulation and divergence among clusters [J]. Journal of Computer Applications, 2022, 42(9): 2701-2712.
[5]	Lin SUN, Jing ZHAO, Jiucheng XU, Xinya WANG. Feature selection algorithm based on neighborhood rough set and monarch butterfly optimization [J]. Journal of Computer Applications, 2022, 42(5): 1355-1366.
[6]	Chao LIU, Lei WANG, Wen YANG, Qiangqiang ZHONG, Min LI. Incremental attribute reduction method for set-valued decision information system with variable attribute sets [J]. Journal of Computer Applications, 2022, 42(2): 463-468.
[7]	Shunkun YU, Hongxu YAN. Heuristic attribute value reduction model based on certainty factor [J]. Journal of Computer Applications, 2022, 42(2): 469-474.
[8]	Yiheng LI, Chenxi DU, Yanyan YANG, Xiangyu LI. Feature selection algorithm for imbalanced data based on pseudo-label consistency [J]. Journal of Computer Applications, 2022, 42(2): 475-484.
[9]	Meng KANG, Zuqiang MENG. Efficient attribute reduction algorithm based on local conditional discernibility [J]. Journal of Computer Applications, 2022, 42(2): 449-456.
[10]	LIU Shize, QIN Yanjun, WANG Chenxing, SU Lin, KE Qixue, LUO Haiyong, SUN Yi, WANG Baohui. Traffic flow prediction algorithm based on deep residual long short-term memory network [J]. Journal of Computer Applications, 2021, 41(6): 1566-1572.
[11]	LI Xujuan, PI Jianyong, HUANG Feixiang, JIA Haipeng. Self-generated deep neural network based 4D trajectory prediction [J]. Journal of Computer Applications, 2021, 41(5): 1492-1499.
[12]	WANG Xiaorong, ZHANG Yuzhao, ZHANG Zhenjiang. Selection of express freight transportation schemes based on rough set over two universes [J]. Journal of Computer Applications, 2021, 41(5): 1500-1505.
[13]	LI Leitao, ZHANG Nan, TONG Xiangrong, YUE Xiaodong. β-distribution reduction based on discernibility matrix in interval-valued decision systems [J]. Journal of Computer Applications, 2021, 41(4): 1084-1092.
[14]	PENG Li, ZHANG Haiqing, LI Daiwei, TANG Dan, YU Xi, HE Lei. Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory [J]. Journal of Computer Applications, 2021, 41(3): 677-685.
[15]	CHEN Kai, YU Yanwei, ZHAO Jindong, SONG Peng. Work location inference method with big data of urban traffic surveillance [J]. Journal of Computer Applications, 2021, 41(1): 177-184.

Attribute reduction for high-dimensional data based on bi-view of similarity and difference

基于相似和差异双视角的高维数据属性约简

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 5

References 36

Related Articles 15

Recommended Articles

Metrics