Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 677-685.DOI: 10.11772/j.issn.1001-9081.2020060894

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory

PENG Li1, ZHANG Haiqing1, LI Daiwei1,2, TANG Dan1, YU Xi3, HE Lei1   

  1. 1. School of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China;
    2. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China;
    3. College of Computer Science, Chengdu University, Chengdu Sichuan 610106, China
  • Received:2020-06-28 Revised:2020-10-15 Online:2021-03-10 Published:2021-01-15
  • Supported by:
    This work is partially supported by the Youth Science Foundation Project of the National Natural Science Foundation of China (61602064), the International Erasmus+Capacity Building in Higher Education Project Research Funds for the Central University (598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP).

基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法

彭莉1, 张海清1, 李代伟1,2, 唐聃1, 于曦3, 何磊1   

  1. 1. 成都信息工程大学 软件工程学院, 成都 610225;
    2. 西南交通大学 信息科学与技术学院, 成都 611756;
    3. 成都大学 计算机学院, 成都 610106
  • 通讯作者: 李代伟
  • 作者简介:彭莉(1996-),女,四川成都人,硕士研究生,CCF学生会员,主要研究方向:数据挖掘、机器学习、数据集成与可视化;张海清(1986-),女,山东聊城人,副研究员,博士,主要研究方向:机器学习、数据挖掘、知识发现、大数据、人工智能、粗糙集;李代伟(1976-),男,四川达县人,副教授,硕士,CCF会员,主要研究方向:数据挖掘、知识发现、机器学习、人工智能、粗糙集、数据集成与可视化;唐聃(1982-),男,四川成都人,教授,博士,主要研究方向:编码理论、智能服务;于曦(1973-),男,吉林长春人,教授,博士,主要研究方向:决策系统、神经网络、深度学习;何磊(1978-),男,河南太康人,讲师,博士,主要研究方向:微波遥感、气候预报、深度学习。
  • 基金资助:
    国家自然科学基金青年基金资助项目(61602064);国际Erasmus+Capacity Building in Higher Education项目(598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP)。

Abstract: Concerning the problem of the poor imputation capability of the ROUgh Set Theory based Incomplete Data Analysis Approach (ROUSTIDA) for the Hybrid Information System (HIS) containing multiple attributes such as discrete (e.g., integer, string, and enumeration), continuous (e.g., floating) and missing attributes in the real-world application, a Rough Set Theory based Hybrid Information System for Missing Data Imputation Approach (RSHISMIS) was proposed. Firstly, according to the idea of decision attribute equivalence class partition, HIS was divided to solve the problem of decision rule conflict problem that might occurs after imputation. Secondly, a hybrid distance matrix was defined to reasonably quantify the similarity between objects in order to filter the samples with imputation capability and to overcome the shortcoming of ROUSTIDA that cannot handle with continuous attributes. Thirdly, the nearest-neighbor idea was combined to solve the problem of ROUSTIDA that it cannot impute the data with the same missing attribute in the case of conflict between the attribute values of non-discriminant objects. Finally, experiments were conducted on 10 UCI datasets, and the proposed method was compared with classical algorithms including ROUSTIDA, K Nearest Neighbor Imputation (KNNI), Random Forest Imputation (RFI), and Matrix Factorization (MF). Experimental results show that the proposed method outperforms ROUSTIDA by 81% in recall averagely and 5% to 53% in precision. Meanwhile, the method has the maximal 0.12 reduction of Normalized Root Mean Square Error (NRMSE) compared with ROUSTIDA. Besides, the classification accuracy of the method is 7% higher on average than that of ROUSTIDA, and is also better than those of the imputation algorithms KNNI, RFI and MF.

Key words: ROUgh Set Theory based Incomplete Data Analysis Approach (ROUSTIDA), Hybrid Information System (HIS), missing data imputation, hybrid distance, nearest-neighbor

摘要: 为了提高基于粗糙集理论的不完备数据分析方法(ROUSTIDA)在实际应用中对包含离散型(如整型、字符串型、枚举型)、连续型(如浮点数表达)、缺失型属性的混合信息系统(HIS)数据的填补能力,提出了一种基于粗糙集理论的混合信息系统缺失值填补方法(RSHISMIA)。首先,根据决策属性等价类划分思想并按照决策属性对混合信息系统HIS进行划分,解决了填补后可能出现的决策规则冲突问题;其次,定义混合距离矩阵来合理量化对象间的相似性,从而筛选出具有填补能力的样本并克服ROUSTIDA无法处理连续性属性的缺点;然后,结合近邻思想解决了ROUSTIDA在无差别对象属性值发生冲突情况下无法对相同属性缺失数据进行填补的问题。最后,使用10个UCI标准数据集进行实验,将所提出的方法与ROUSTIDA、K近邻填补(KNNI)算法、随机森林填补(RFI)算法和矩阵分解(MF)等几种经典算法进行了比较。实验结果表明,与ROUSTIDA相比,所提方法在查全率上平均高出81%,在查准率上提升了5%~53%,且其归一化均方根误差(NRMSE)最多减小了0.12。此外,所提方法的分类准确率与ROUSTIDA相比平均提升了7%,且优于KNNI、RFI及MF等填补算法。

关键词: 基于粗糙集理论的不完备数据分析方法, 混合信息系统, 缺失值填补, 混合距离, 最近邻

CLC Number: