计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1014-1020.DOI: 10.11772/j.issn.1001-9081.2017.04.1014

• 数据科学与技术 • 上一篇    下一篇

基于动态可配置规则的数据清洗方法

朱会娟1,2,3, 蒋同海1,3, 周喜1,3, 程力1,3, 赵凡1,3, 马博1,3   

  1. 1. 中国科学院新疆理化技术研究所 多语种信息技术研究室, 乌鲁木齐 830011;
    2. 中国科学院大学 计算机与控制学院, 北京 100049;
    3. 新疆民族语音语言信息处理重点实验室, 乌鲁木齐 830011
  • 收稿日期:2016-09-20 修回日期:2016-12-22 出版日期:2017-04-10 发布日期:2017-04-19
  • 通讯作者: 程力
  • 作者简介:朱会娟(1984-),女,河南洛阳人,博士研究生,主要研究方向:数据清洗、数据融合、数据分析;蒋同海(1963-),男,新疆福海人,研究员,博士生导师,博士,主要研究方向:数据融合、数据分析、数据挖掘;周喜(1978-),男,湖南双峰人,研究员,博士,主要研究方向:数据清洗、数据融合、数据分析;程力(1973-),男,新疆乌鲁木齐人,研究员,博士生导师,博士,CCF会员,主要研究方向:云计算、网格计算、高性能计算、数据融合;赵凡(1980-),男,山西介休人,副研究员,博士研究生,主要研究方向:双语教学、数据可视化分析;马博(1984-),男,辽宁鞍山人,副研究员,博士,主要研究方向:语义检索、数据挖掘、知识发现、数据分析。
  • 基金资助:
    新疆维吾尔自治区高技术研究发展计划项目(201512103);中国科学院西部之光人才培养计划项目(XBBS201313);新疆维吾尔自治区青年科技创新人才培养工程计划项目(2014721033)。

Data cleaning method based on dynamic configurable rules

ZHU Huijuan1,2,3, JIANG Tonghai1,3, ZHOU Xi1,3, CHENG Li1,3, ZHAO Fan1,3, MA Bo1,3   

  1. 1. Research Center for Multilingual Information Technology, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi Xinjiang 830011, China;
    2. School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi Xinjiang 830011, China
  • Received:2016-09-20 Revised:2016-12-22 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by the Xinjiang High-Tech R&D Program (201512103), the West Light Foundation of the Chinese Academy of Sciences (XBBS201313), the Xinjiang Young Scholar Support Program (2014721033).

摘要: 针对传统数据清洗方法通过硬编码方法来实现业务逻辑而导致系统的可重用性、可扩展性与灵活性较差等问题,提出了一种基于动态可配置规则的数据清洗方法——DRDCM。该方法支持多种类型规则间的复杂逻辑运算,并支持多种脏数据修复行为,集数据检测、数据修复与数据转换于一体,具有跨领域、可重用、可配置、可扩展等特点。首先,对DRDCM方法中的数据检测和数据修复的概念、实现步骤以及实现算法进行描述;其次,阐述了DRDCM方法中支持的多种规则类型以及规则配置;最后,对DRDCM方法进行实现,并通过实际项目数据集验证了该实现系统在脏数据修复中,丢弃修复行为具有很高的准确率,尤其是对需遵守法定编码规则的属性(例如身份证号码)处理时其准确率可达100%。实验结果表明,DRDCM实现系统可以将动态可配置规则无缝集成于多个数据源和多种不同应用领域且该系统的性能并不会随着规则条数增加而极速降低,这也进一步验证了DRDCM方法在真实环境中的切实可行性。

关键词: 大数据, 数据质量, 数据清洗, 动态可配置规则, 数据预处理

Abstract: Traditional data cleaning approaches usually implement cleaning rules specified by business requirements through hard-coding mechanism, which leads to well-known issues in terms of reusability, scalability and flexibility. In order to address these issues, a new Dynamic Rule-based Data Cleaning Method (DRDCM) was proposed, which supports the complex logic operation between various types of rules and three kinds of dirty data repair behavior. It integrates data detection, error correction and data transformation in one system and contributes several unique characteristics, including domain-independence, reusability and configurability. Besides, the formal concepts and terms regarding data detection and correction were defined, while necessary procedures and algorithms were also introduced. Specially, the supported multiple rule types and rule configurations in DRDCM were presented in detail. At last, the DRDCM approach was implemented. Experimental results show that the implemented system provides a high accuracy on the discarded behavior of dirty data repair with real-life data sets. Especially for the attribute required to comply with the statutory coding rules (such as ID card number), whose accuracy can reach 100%. Moreover, these results also indicate that this reference implementation of DRDCM can successfully support multiple data sources in cross-domain scenarios, and its performance does not sharply decrease with the increase of the number of rules. These results further validate that the proposed DRDCM is practical in real-world scenarios.

Key words: big data, data quality, data cleaning, dynamic configurable rules, data preprocessing

中图分类号: