基于动态可配置规则的数据清洗方法

doi:10.11772/j.issn.1001-9081.2017.04.1014

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1014-1020.DOI: 10.11772/j.issn.1001-9081.2017.04.1014

基于动态可配置规则的数据清洗方法

朱会娟^1,2,3, 蒋同海^1,3, 周喜^1,3, 程力^1,3, 赵凡^1,3, 马博^1,3

1. 中国科学院新疆理化技术研究所多语种信息技术研究室, 乌鲁木齐 830011;
2. 中国科学院大学计算机与控制学院, 北京 100049;
3. 新疆民族语音语言信息处理重点实验室, 乌鲁木齐 830011

收稿日期:2016-09-20 修回日期:2016-12-22 出版日期:2017-04-10 发布日期:2017-04-19
通讯作者: 程力
作者简介:朱会娟(1984-),女,河南洛阳人,博士研究生,主要研究方向:数据清洗、数据融合、数据分析;蒋同海(1963-),男,新疆福海人,研究员,博士生导师,博士,主要研究方向:数据融合、数据分析、数据挖掘;周喜(1978-),男,湖南双峰人,研究员,博士,主要研究方向:数据清洗、数据融合、数据分析;程力(1973-),男,新疆乌鲁木齐人,研究员,博士生导师,博士,CCF会员,主要研究方向:云计算、网格计算、高性能计算、数据融合;赵凡(1980-),男,山西介休人,副研究员,博士研究生,主要研究方向:双语教学、数据可视化分析;马博(1984-),男,辽宁鞍山人,副研究员,博士,主要研究方向:语义检索、数据挖掘、知识发现、数据分析。
基金资助:
新疆维吾尔自治区高技术研究发展计划项目（201512103）；中国科学院西部之光人才培养计划项目（XBBS201313）；新疆维吾尔自治区青年科技创新人才培养工程计划项目（2014721033）。

Data cleaning method based on dynamic configurable rules

ZHU Huijuan^1,2,3, JIANG Tonghai^1,3, ZHOU Xi^1,3, CHENG Li^1,3, ZHAO Fan^1,3, MA Bo^1,3

1. Research Center for Multilingual Information Technology, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi Xinjiang 830011, China;
2. School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi Xinjiang 830011, China

Received:2016-09-20 Revised:2016-12-22 Online:2017-04-10 Published:2017-04-19
Supported by:
This work is partially supported by the Xinjiang High-Tech R&D Program (201512103), the West Light Foundation of the Chinese Academy of Sciences (XBBS201313), the Xinjiang Young Scholar Support Program (2014721033).

摘要/Abstract

摘要： 针对传统数据清洗方法通过硬编码方法来实现业务逻辑而导致系统的可重用性、可扩展性与灵活性较差等问题，提出了一种基于动态可配置规则的数据清洗方法——DRDCM。该方法支持多种类型规则间的复杂逻辑运算，并支持多种脏数据修复行为，集数据检测、数据修复与数据转换于一体，具有跨领域、可重用、可配置、可扩展等特点。首先，对DRDCM方法中的数据检测和数据修复的概念、实现步骤以及实现算法进行描述；其次，阐述了DRDCM方法中支持的多种规则类型以及规则配置；最后，对DRDCM方法进行实现，并通过实际项目数据集验证了该实现系统在脏数据修复中，丢弃修复行为具有很高的准确率，尤其是对需遵守法定编码规则的属性（例如身份证号码）处理时其准确率可达100%。实验结果表明，DRDCM实现系统可以将动态可配置规则无缝集成于多个数据源和多种不同应用领域且该系统的性能并不会随着规则条数增加而极速降低，这也进一步验证了DRDCM方法在真实环境中的切实可行性。

关键词: 大数据, 数据质量, 数据清洗, 动态可配置规则, 数据预处理

Abstract: Traditional data cleaning approaches usually implement cleaning rules specified by business requirements through hard-coding mechanism, which leads to well-known issues in terms of reusability, scalability and flexibility. In order to address these issues, a new Dynamic Rule-based Data Cleaning Method (DRDCM) was proposed, which supports the complex logic operation between various types of rules and three kinds of dirty data repair behavior. It integrates data detection, error correction and data transformation in one system and contributes several unique characteristics, including domain-independence, reusability and configurability. Besides, the formal concepts and terms regarding data detection and correction were defined, while necessary procedures and algorithms were also introduced. Specially, the supported multiple rule types and rule configurations in DRDCM were presented in detail. At last, the DRDCM approach was implemented. Experimental results show that the implemented system provides a high accuracy on the discarded behavior of dirty data repair with real-life data sets. Especially for the attribute required to comply with the statutory coding rules (such as ID card number), whose accuracy can reach 100%. Moreover, these results also indicate that this reference implementation of DRDCM can successfully support multiple data sources in cross-domain scenarios, and its performance does not sharply decrease with the increase of the number of rules. These results further validate that the proposed DRDCM is practical in real-world scenarios.

Key words: big data, data quality, data cleaning, dynamic configurable rules, data preprocessing

中图分类号:

TP311.11

朱会娟, 蒋同海, 周喜, 程力, 赵凡, 马博. 基于动态可配置规则的数据清洗方法[J]. 计算机应用, 2017, 37(4): 1014-1020.

ZHU Huijuan, JIANG Tonghai, ZHOU Xi, CHENG Li, ZHAO Fan, MA Bo. Data cleaning method based on dynamic configurable rules[J]. Journal of Computer Applications, 2017, 37(4): 1014-1020.

参考文献

[1] SWARTZ N. Gartner warns firms of "dirty data"[J]. Information Management Journal, 2007, 41(3): 6-7.
[2] ECKERSON W W. Data quality and the bottom line: achieving business success through a commitment to high quality data[EB/OL].[2016-03-10]. http://download.101com.com/pub/tdwi/Files/DQReport.pdf.
[3] GRAHAM C. Forecast: data quality tools, worldwide, 2006-2011[EB/OL].[2016-03-10]. https://www.gartner.com/doc/507207/forecast-data-quality-tools-worldwide.
[4] 覃远翔, 段亮, 岳昆. 基于信息熵的不确定性数据清理方法[J]. 计算机应用, 2013, 33(9): 2490-2492.(QIN Y X, DUAN L, YUE K. Approach for cleaning uncertain data based on information entropy theory[J]. Journal of Computer Applications, 2013, 33(9):2490-2492.)
[5] RAHM E, DO H H. Data cleaning: problems and current approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.
[6] 杨明花, 古志民. 基于兴趣特征的WUM数据预处理方法[J]. 计算机应用, 2006, 26(10): 133-134.(YANG M H, GU Z M. Data preprocessing method based on characteristic of interests for WUM[J]. Journal of Computer Applications, 2006, 26(10):2393-2388.)
[7] GALHARDAS H, FLORESCU D, SHASHA D, et al. Declarative data cleaning: language, model, and algorithms[C]//VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers, 2001: 371-380.
[8] VOLKOVS M, CHIANG F, SZLICHTA J, et al. Continuous data cleaning[C]//Proceedings of the 2014 IEEE 30th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2014: 244-255.
[9] OLIVEIRA P, RODRIGUES F, HENRIQUES P, et al. A taxonomy of data quality problems[EB/OL].[2016-03-10]. https://www.researchgate.net/profile/Helena_Galhardas/publication/250693546_A_Taxonomy_of_Data_Quality_Problems/links/02e7e534798484567c000000.pdf.
[10] EBAID A, ELMAGARMID A, ILYAS I F, et al. NADEEF: a generalized data cleaning system[J]. Proceedings of the VLDB Endowment, 2013, 6(12): 1218-1221.
[11] DALLACHIESA M, EBAID A, ELDAWY A, et al. NADEEF: a commodity data cleaning system[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2013: 541-552.
[12] 李俊奎, 王元珍, 李专. AzszpClean: 一种基于规则的数据清洗方案[J]. 山东大学学报(理学版), 2007, 42(9):71-74.(LI J K, WANG Y Z, LI Z. AzszpClean: a rule-based solution to data cleaning[J]. Journal of Shandong University (Natural Science), 2007, 42(9):71-74.)
[13] BOHANNON P, FAN W, FLASTER M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[C]//Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2005: 143-154.
[14] CHOMICKJ J, MARCINKOWSKI J. Minimal-change integrity maintenance using tuple deletions[J]. Information and Computation, 2005, 197(1): 90-121.
[15] WIJSEN J. Database repairing using updates[J]. ACM Transactions on Database Systems, 2005, 30(3): 722-768.
[16] FAN W, GEERTS F, JIA X, et al. Conditional functional dependencies for capturing data inconsistencies[J]. ACM Transactions on Database Systems, 2008, 33(2): 6.
[17] BRAVO L, FAN W, MA S. Extending dependencies with conditions[EB/OL].[2016-03-10]. http://www.vldb.org/conf/2007/papers/research/p243-bravo.pdf.
[18] GOLAB L, KARLOFF H, KORN F, et al. On generating near-optimal tableaux for conditional functional dependencies[J]. Proceedings of the VLDB Endowment, 2008, 1(1): 376-390.
[19] CHU X, ILYAS I F, PAPOTTI P. Holistic data cleaning: put violations into context[C]//Proceedings of the 2013 IEEE 29th International Conference on Data Engineering. Piscataway, NJ: IEEE, 2013:458-469.
[20] FAN W, MA S, TANG N, et al. Interaction between record matching and data repairing[J]. Journal of Data and Information Quality, 2014, 4(4): Article No 16.
[21] YAKOUT M, ELMAGARMID A K, NEVILLE J, et al. Guided data repair[J]. Proceedings of the VLDB Endowment, 2011, 4(5): 279-289.
[22] VWRBORGH R, DE W M. Using OpenRefine[M]. Birmingham: Packt Publishing, 2013:53.
[23] PROCTOR M, NEALE M, LIN P, et al. Drools documentation[EB/OL].[2016-03-10]. http://www.jboss.org/drools/documentation.html.
[24] 丁晶, 陈晓岚, 吴萍. 基于正则表达式的深度包检测算法[J]. 计算机应用, 2007, 27(9): 2184-2186.(DING J, CHEN X L, WU P. Deep packet inspection algorithm based on regular expressions[J]. Journal of Computer Applications, 2007, 27(9):2184-2186.)
[25] 周傲英, 金澈清, 王国仁, 等. 不确定性数据管理技术研究综述[J]. 计算机学报, 2009, 32(1): 1-16.(ZHOU A Y, JIN C Q, WANG G R, et al. A survey on the management of uncertain data[J]. Chinese Journal of Computers, 2009, 32(1):1-16.)

基于动态可配置规则的数据清洗方法

Data cleaning method based on dynamic configurable rules

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[2]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[3]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[4]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 计算机应用, 2020, 40(2): 416-419.
[5]	潘春霞, 杨秋辉, 谭武坤, 邓惠心, 伍佳. 软件缺陷预测中的数据预处理方法[J]. 计算机应用, 2020, 40(11): 3273-3279.
[6]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[7]	黄永鑫, 唐雪飞. 基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现[J]. 计算机应用, 2020, 40(1): 90-95.
[8]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[9]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.
[10]	纪丽娜, 陈凯, 于彦伟, 宋鹏, 王淑莹, 王成锐. 基于城市交通大数据的车辆类别挖掘及应用分析[J]. 计算机应用, 2019, 39(5): 1343-1350.
[11]	张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.
[12]	常征, 吕勇. 基于正则表达式的海量数据清洗系统[J]. 计算机应用, 2019, 39(10): 2942-2947.
[13]	郭方方, 潮洛蒙, 朱建文. 基于相似连接的多源数据并行预处理方法[J]. 计算机应用, 2019, 39(1): 57-60.
[14]	王雪菲, 丁维龙. 面向高速公路大数据的短时流量预测方法[J]. 计算机应用, 2019, 39(1): 87-92.
[15]	徐垚, 李卓然, 孟金龙, 赵利坡, 温建新, 王桂玲. 基于大规模船舶轨迹数据的航道边界提取方法[J]. 计算机应用, 2019, 39(1): 105-112.