Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory

doi:10.11772/j.issn.1001-9081.2020060894

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 677-685.DOI: 10.11772/j.issn.1001-9081.2020060894

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory

PENG Li¹, ZHANG Haiqing¹, LI Daiwei^1,2, TANG Dan¹, YU Xi³, HE Lei¹

1. School of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China;
2. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China;
3. College of Computer Science, Chengdu University, Chengdu Sichuan 610106, China

Received:2020-06-28 Revised:2020-10-15 Online:2021-01-15 Published:2021-03-10
Supported by:
This work is partially supported by the Youth Science Foundation Project of the National Natural Science Foundation of China (61602064), the International Erasmus+Capacity Building in Higher Education Project Research Funds for the Central University (598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP).

基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法

彭莉¹, 张海清¹, 李代伟^1,2, 唐聃¹, 于曦³, 何磊¹

1. 成都信息工程大学软件工程学院, 成都 610225;
2. 西南交通大学信息科学与技术学院, 成都 611756;
3. 成都大学计算机学院, 成都 610106

通讯作者: 李代伟
作者简介:彭莉(1996-),女,四川成都人,硕士研究生,CCF学生会员,主要研究方向:数据挖掘、机器学习、数据集成与可视化;张海清(1986-),女,山东聊城人,副研究员,博士,主要研究方向:机器学习、数据挖掘、知识发现、大数据、人工智能、粗糙集;李代伟(1976-),男,四川达县人,副教授,硕士,CCF会员,主要研究方向:数据挖掘、知识发现、机器学习、人工智能、粗糙集、数据集成与可视化;唐聃(1982-),男,四川成都人,教授,博士,主要研究方向:编码理论、智能服务;于曦(1973-),男,吉林长春人,教授,博士,主要研究方向:决策系统、神经网络、深度学习;何磊(1978-),男,河南太康人,讲师,博士,主要研究方向:微波遥感、气候预报、深度学习。
基金资助:
国家自然科学基金青年基金资助项目（61602064）；国际Erasmus+Capacity Building in Higher Education项目（598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP）。

Abstract

Abstract: Concerning the problem of the poor imputation capability of the ROUgh Set Theory based Incomplete Data Analysis Approach (ROUSTIDA) for the Hybrid Information System (HIS) containing multiple attributes such as discrete (e.g., integer, string, and enumeration), continuous (e.g., floating) and missing attributes in the real-world application, a Rough Set Theory based Hybrid Information System for Missing Data Imputation Approach (RSHISMIS) was proposed. Firstly, according to the idea of decision attribute equivalence class partition, HIS was divided to solve the problem of decision rule conflict problem that might occurs after imputation. Secondly, a hybrid distance matrix was defined to reasonably quantify the similarity between objects in order to filter the samples with imputation capability and to overcome the shortcoming of ROUSTIDA that cannot handle with continuous attributes. Thirdly, the nearest-neighbor idea was combined to solve the problem of ROUSTIDA that it cannot impute the data with the same missing attribute in the case of conflict between the attribute values of non-discriminant objects. Finally, experiments were conducted on 10 UCI datasets, and the proposed method was compared with classical algorithms including ROUSTIDA, K Nearest Neighbor Imputation (KNNI), Random Forest Imputation (RFI), and Matrix Factorization (MF). Experimental results show that the proposed method outperforms ROUSTIDA by 81% in recall averagely and 5% to 53% in precision. Meanwhile, the method has the maximal 0.12 reduction of Normalized Root Mean Square Error (NRMSE) compared with ROUSTIDA. Besides, the classification accuracy of the method is 7% higher on average than that of ROUSTIDA, and is also better than those of the imputation algorithms KNNI, RFI and MF.

Key words: ROUgh Set Theory based Incomplete Data Analysis Approach (ROUSTIDA), Hybrid Information System (HIS), missing data imputation, hybrid distance, nearest-neighbor

摘要： 为了提高基于粗糙集理论的不完备数据分析方法（ROUSTIDA）在实际应用中对包含离散型（如整型、字符串型、枚举型）、连续型（如浮点数表达）、缺失型属性的混合信息系统（HIS）数据的填补能力，提出了一种基于粗糙集理论的混合信息系统缺失值填补方法（RSHISMIA）。首先，根据决策属性等价类划分思想并按照决策属性对混合信息系统HIS进行划分，解决了填补后可能出现的决策规则冲突问题；其次，定义混合距离矩阵来合理量化对象间的相似性，从而筛选出具有填补能力的样本并克服ROUSTIDA无法处理连续性属性的缺点；然后，结合近邻思想解决了ROUSTIDA在无差别对象属性值发生冲突情况下无法对相同属性缺失数据进行填补的问题。最后，使用10个UCI标准数据集进行实验，将所提出的方法与ROUSTIDA、K近邻填补（KNNI）算法、随机森林填补（RFI）算法和矩阵分解（MF）等几种经典算法进行了比较。实验结果表明，与ROUSTIDA相比，所提方法在查全率上平均高出81%，在查准率上提升了5%~53%，且其归一化均方根误差（NRMSE）最多减小了0.12。此外，所提方法的分类准确率与ROUSTIDA相比平均提升了7%，且优于KNNI、RFI及MF等填补算法。

关键词: 基于粗糙集理论的不完备数据分析方法, 混合信息系统, 缺失值填补, 混合距离, 最近邻

CLC Number:

TP181

PENG Li, ZHANG Haiqing, LI Daiwei, TANG Dan, YU Xi, HE Lei. Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory[J]. Journal of Computer Applications, 2021, 41(3): 677-685.

彭莉, 张海清, 李代伟, 唐聃, 于曦, 何磊. 基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法[J]. 计算机应用, 2021, 41(3): 677-685.

References

[1] 樊哲宁, 杨秋辉, 翟宇鹏, 等. 重复数据中关键属性值缺失填补的改进ROUSTIDA算法[J]. 计算机科学,2019,46(2):30-34. (FAN Z N,YANG Q H,ZHAI Y P,et al. Improved ROUSTIDA algorithm for missing data imputation with key attributes in repetitive data[J]. Computer Science,2019,46(2):30-34.)
[2] WANG X,LI A,JIANG Z,et al. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme[J]. BMC Bioinformatics, 2006,7:No. 32.
[3] STEKHOVEN D J,BÜHLMANN P. MissForest-non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012,28(1):112-118.
[4] DIXON J K. Pattern recognition with partly missing data[J]. IEEE Transactions on Systems,Man,and Cybernetics,1979,9(10):617-621.
[5] RANJBAR M,MORADI P,AZAMI M,et al. An imputation-based matrix factorization method for improving accuracy of collaborative filtering systems[J]. Engineering Applications of Artificial Intelligence,2015,46(Pt A):58-66.
[6] PAWLAK Z. Rough sets[J]. International Journal of Computer and Information Sciences,1982,11(5):341-356.
[7] ZHU W,ZHANG W,FU Y. An incomplete data analysis approach using rough set theory[C]//Proceedings of the 2004 International Conference on Intelligent Mechatronics and Automation. Piscataway:IEEE,2004:332-338.
[8] 蒋亚军, 娄臻亮. 具有连续属性的不完备信息系统Rough集扩展[J]. 上海交通大学学报,2005,39(8):1322-1326.(JIANG Y J, LOU Z L. The extension of rough sets in incomplete information systems containing continuous attributes[J]. Journal of Shanghai Jiaotong University,2005,39(8):1322-1326.)
[9] 朱小飞, 卓丽霞. 一种基于量化容差关系的不完备数据分析方法[J]. 重庆工学院学报,2005,19(5):23-25.(ZHU X F, ZHUO L X. An incomplete data analysis method based on the values tolerance relation[J]. Journal of Chongqing Institute of Technology,2005,19(5):23-25.)
[10] 刘文军. 基于粗糙集理论的不完备决策表的完备化方法[J]. 长沙电力学院学报(自然科学版),2006,21(4):60-64.(LIU W J. A completion algorithm of incomplete decision table[J]. Journal of Changsha University of Electric Power (Natural Science),2006,21(4):60-64.)
[11] 王国胤. Rough集理论在不完备信息系统中的扩充[J]. 计算机研究与发展,2002,39(10):1238-1243. (WANG G Y. Extension of rough set under incomplete information systems[J]. Journal of Computer Research and Development,2002,39(10):1238-1243.)
[12] 霍忠诚, 曾玲, 范婷. 基于粗糙集的不完备数据分析方法[J]. 桂林电子科技大学学报,2011,31(5):419-421,425.(HUO Z C,ZENG L,FAN T. A new incomplete data analysis method based on rough sets[J]. Journal of Guilin University of Electronic Technology,2011,31(5):419-421,425.)
[13] 丁春荣, 李龙澍. 基于相似关系向量的改进ROUSTIDA算法[J]. 计算机工程与应用,2014,50(13):133-13.6(DING C R,LI L S. Improved ROUSTIDA algorithm based on similarity relation vector[J]. Computer Engineering and Applications, 2014,50(13):133-136.)
[14] 关莹, 苏贵斌, 康熠华. 一种改进的ROUSTIDA数据填补方法[J]. 软件导刊,2016,15(11):12-14.(GUAN Y,SU G B, KANG Y H. An improved method for data reinforcement of ROUSTIDA[J]. Software Guide,2016,15(11):12-14.)
[15] BAI X,ZHANG M,WU Q,et al. A novel data filling algorithm for incomplete information system based on valued limited tolerance relation[J]. International Journal of Database Theory and Application,2015,8(6):149-164
[16] PRIETO-CUBIDES J,ARGOTY C. Dealing with missing data using a selection algorithm on rough sets[J]. International Journal of Computational Intelligence Systems,2018,11(1):1307-1321.
[17] ZENG A,LI T,LIU D,et al. A fuzzy rough set approach for incremental feature selection on hybrid information systems[J]. Fuzzy Sets and Systems,2015,258:39-60.
[18] GÉRON A. Hands-On Machine Learning with Scikit-Learn and TensorFlow:Concepts,Tools,and Techniques to Build Intelligent Systems:1st Ed[M]. Sebastopol,CA:O' Reilly Media,2017:91-92.
[19] HE R,XU C,LI D,et al. A fuzzy-rough-based approach for uncertainty classification on hybrid information system[C]//Proceedings of the IEEE 3rd International Conference on Image, Vision and Computing. Piscataway:IEEE,2018:791-796.
[20] GRZYMALA-BUSSE J W,GOODWIN L K,GRZYMALA-BUSSE W J,et al. Handling missing attribute values in preterm birth data sets[C]//Proceedings of the 10th International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing,LNCS 3642. Berlin:Springer,2015:342-351.
[21] 周志华. 机器学习[M]. 北京:清华大学出版社,2016:93-94. (ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press,2016:93-94.)
[22] DAUWELS J,GARG L,EARNEST A,et al. Tensor factorization for missing data imputation in medical questionnaires[C]//Proceedings of the 2012 IEEE International Conference on Acoustics,Speech and Signal Processing. Piscataway:IEEE, 2012:2109-2112.
[23] DUA D,GRAFF C. UCI machine learning repository[DS/OL].[2020-03-10]. http://archive.ics.uci.edu/ml.
[24] ALEXKSANDER O. Rosetta[EB/OL].[2020-03-10]. http://www.idi.ntnu.no/~aleks/rosetta/.
[25] ALFONS A. simFrame:simulation framework[DB/OL].[2020-03-10]. https://CRAN.R-project.org/package=simFrame.
[26] CORTES C,VAPNIK V. Support-vector networks[J]. Machine Learning,1995,20(3):273-297.
[27] HSU C W,CHANG C C,LIN C J. A practical guide to support vector classification[J]. Bioinformatics,2010,67(5):4-5.
[28] ZHENG A, CASARI A. Feature Engineering for Machine Learning[M]. Sebastopol,CA:O'Reilly Media,2018:45-47.
[29] LALL U, SHARMA A. A nearest neighbor bootstrap for resampling hydrologic time series[J]. Water Resources Research, 1996,32(3):679-693.

Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory

基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 5

Recommended Articles

Metrics

[1]	LIU Jiamin WANG Huiyan ZHOU Xiaoli LUO Fulin. Face recognition based on improved isometric feature mapping algorithm [J]. Journal of Computer Applications, 2013, 33(01): 76-79.
[2]	XIANG Jun DA Bang-you LIANG Juan HOU Jian-hua. New feature description based on feature relationships for gait recognition [J]. Journal of Computer Applications, 2012, 32(03): 885-888.
[3]	. Palm-dorsa vein recognition based on two-dimensional Fisher linear discriminant [J]. Journal of Computer Applications, 2010, 30(3): 646-649.
[4]	. Human face shape classification method based on active shape model [J]. Journal of Computer Applications, 2009, 29(10): 2710-2712.
[5]	Yan LANG . Online optimal fuzzy identification using improved nearest-neighbor clustering method [J]. Journal of Computer Applications, 2008, 28(7): 1659-1661.