计算机应用 ›› 2019, Vol. 39 ›› Issue (10): 2942-2947.DOI: 10.11772/j.issn.1001-9081.2019030492

• 数据科学与技术 • 上一篇    下一篇

基于正则表达式的海量数据清洗系统

常征, 吕勇   

  1. 中国电子科技集团公司 电子科学研究院, 北京 100086
  • 收稿日期:2019-03-25 修回日期:2019-06-14 发布日期:2019-07-15 出版日期:2019-10-10
  • 通讯作者: 常征
  • 作者简介:常征(1986-),男,重庆人,工程师,博士,主要研究方向:模式识别、软件工程、数据库;吕勇(1981-),男,辽宁沈阳人,工程师,硕士,主要研究方向:计算机运维、系统工程。
  • 基金资助:
    国家科技重大专项(2017ZX01013201)(2017ZX01013201)。

Mass data clean system based on regular expression

CHANG Zheng, LYU Yong   

  1. China Academy of Electronics and Information Technology, Beijing 100086, China
  • Received:2019-03-25 Revised:2019-06-14 Online:2019-07-15 Published:2019-10-10
  • Supported by:
    This work is partially supported by the National Science and Technology Major Project (2017ZX01013201).

摘要: 针对目前主流的数据提取、变形、加载(ETL)工具和受限环境下一些应用的不足之处,结合受限应用场景下的特殊要求,提出一种基于正则表达式的海量数据清洗系统(REMCS)。REMCS首先针对超长错误数据问题、批量数据源文件融合问题、数据源文件自动分拣问题等典型的6个问题找到数据的特点,其次根据数据的特点设置合适的正则表达式和预处理算法,然后使用算法模型去除数据中的错误完成数据预处理工作。同时详细阐述了REMCS的系统逻辑结构、常见问题、对应的解决算法和代码实现方案。最后通过对兼容的数据源文件格式、能够处理的问题种类、问题处理时间、处理数据极限值等4个方面进行对比,从几组常见的数据处理问题的对比实验可知,相较于传统的ETL工具,REMCS支持csv格式、json格式、dump格式等典型的9种文件格式,能够处理全部的6种常见问题,处理时间更短,能够支持的数据极限值更大。实验结果验证了针对受限应用场景下常见的数据处理问题,REMCS具有很好的适用性和准确性。

关键词: 正则表达式, 数据清洗, 大数据, 提取、变形、加载工具

Abstract: Based on the current mainstream Extract Transform Load (ETL) tools for data and the disadvantages of some applications in restricted environments, a Regular Expression Mass-data Cleaning System (REMCS) was proposed for the specific requirements in the restricted application scenarios. Firstly, the data features of six typical problems including ultra-long error data, batch fusion of data source files, automatic sorting of data source files, were discovered. And the appropriate regular expressions and pre-processing algorithms were put forward according to the data features. Then, data pre-processing was completed by using the algorithm model to remove the errors in data. At the same time, the system logical structure, common problems, and corresponding solutions, and code implementation scheme of REMCS were described in detail. Finally, the comparison experiments of several common data processing problems were carried out with the following four aspects:the compatible data source file formats, the soveble problem types, the problem processing time and the data processing limit value. Compared with the traditional ETL tools, REMCS can support nine typical file formats such as csv format, json format, dump format, and can address all six common problems with shorter processing time and larger supportable data limit value. Experimental results show that REMCS has better applicability and high accuracy for common data processing problems in restricted application scenarios.

Key words: regular expression, data cleaning, mass data, Extract Transform Load (ETL) tool

中图分类号: