计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 21-26.DOI: 10.11772/j.issn.1001-9081.2016.01.0021

• 第32届中国数据库学术会议(NDBC 2015) • 上一篇    下一篇

压缩数据上的关系代数操作算法

丁鑫哲1, 张兆功1, 李建中2, 谭龙1, 刘勇1   

  1. 1. 黑龙江大学 计算机科学技术学院, 哈尔滨 150080;
    2. 哈尔滨工业大学 计算机科学技术学院, 哈尔滨 150010
  • 收稿日期:2015-07-27 修回日期:2015-08-04 出版日期:2016-01-10 发布日期:2016-01-09
  • 通讯作者: 张兆功(1963-),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:海量数据挖掘、生物信息学
  • 作者简介:丁鑫哲(1990-),男,福建泉州人,硕士研究生,主要研究方向:大数据压缩;李建中(1950-),男,黑龙江哈尔滨人,教授,CCF会员,主要研究方向:海量数据管理与计算、无线传感器网络;谭龙(1971-),男,黑龙江哈尔滨人,副教授,博士,CCF会员,主要研究方向:无线认知网络、数据挖掘;刘勇(1990-),男,黑龙江七台河人,硕士研究生,主要研究方向:关键字检索。
  • 基金资助:
    国家自然科学基金资助项目(81273649);黑龙江省自然科学基金资助项目(F201434)。

Relational algebraic operation algorithm on compressed data

DING Xinzhe1, ZHANG Zhaogong1, LI Jianzhong2, TAN Long1, LIU Yong1   

  1. 1. College of Computer Science and Technology, Heilongjiang University, Harbin Heilongjiang 150080, China;
    2. College of Computer Science and Technology, Harbin Institute of Technology, Harbin Heilongjiang 150010, China
  • Received:2015-07-27 Revised:2015-08-04 Online:2016-01-10 Published:2016-01-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (81273649), the Natural Science Foundation of Heilongjiang Province (F201434).

摘要: 针对在大数据管理中,在压缩的数据上无需解压即可进行相关操作的问题,在数据服从正态分布的前提下,根据列数据存储的特点,提出了一种新的面向列存储的压缩方法——CCA。首先,通过对列数据的长度进行归类;然后,采用抽样的方法获得重复度较高的前缀;最后,使用字典编码进行压缩,提出了列索引(CI)和列实体(CR)作为数据压缩结构来降低大数据存储的空间需求,从而直接有效地在压缩数据上支持选择、投影、连接等基本操作,并实现了基于CCA的数据库原型系统——D-DBMS。理论分析和在1 TB数据上的实验结果表明,该压缩算法能够显著提高大数据的存储效率和数据操作性能,与BAP和TIDC压缩方法相比,在压缩率分别提高了51%、14%;在执行速度上提高了47%、42%。

关键词: 大数据压缩, 列索引, 列实体, 关系代数操作

Abstract: Since in the massive data management, the compressed data can be done some operations without decompressing first, under the condition of normal distribution, according to features of column data storage, a new compression algorithm which oriented column storage, called CCA (Column Compression Algorithm), was proposed. Firstly, the length of data was classified; secondly, the sampling method was used to get more repetitive prefix; finally the dictionary coding was utilized to compress, meanwhile the Column Index (CI) and Column Reality (CR) were acted as data compression structure to reduce storage requirement of massive data storage, thus the basic relational algebraic operations such as select, project and join were directly and effectively supported. A prototype database system based on CCA, called D-DBMS (Ding-Database Management System), was implemented. The theoretical analyses and the results of experiment on 1 TB data show that the proposed compression algorithm can significantly improve the performance of massive data storage efficiency and data manipulation. Compared to BAP (Bit Address Physical) and TIDC (TupleID Center) method, the compression rate of CCA was improved by 51% and 14%, and its running speed was improved by 47% and 42%.

Key words: massive data compression, Column Index (CI), Column Reality (CR), relational algebraic operation

中图分类号: