计算机应用 ›› 2021, Vol. 41 ›› Issue (3): 727-732.DOI: 10.11772/j.issn.1001-9081.2020060874

所属专题: 数据科学与技术

• 数据科学与技术 • 上一篇    下一篇

基于有向无环图的倒排链等字长划分压缩算法

姜琨1,2, 刘征1,2, 朱磊1,2, 李晓星3   

  1. 1. 西安理工大学 计算机科学与工程学院, 西安 710048;
    2. 陕西省网络计算与安全技术重点实验室(西安理工大学), 西安 710048;
    3. 中国人民解放军 63785部队, 西安 710043
  • 收稿日期:2020-06-15 修回日期:2020-11-05 出版日期:2021-03-10 发布日期:2021-01-15
  • 通讯作者: 姜琨
  • 作者简介:姜琨(1984-),男,陕西子洲人,讲师,博士,CCF会员,主要研究方向:搜索引擎索引压缩、top-k查询算法;刘征(1984-),男,江西南昌人,讲师,博士,CCF会员,主要研究方向:搜索引擎索引压缩、知识图谱;朱磊(1983-),男,陕西咸阳人,副研究员,博士,CCF会员,主要研究方向:知识图谱、图挖掘;李晓星(1985-),男,山西晋城人,工程师,硕士,CCF会员,主要研究方向:数据挖掘、知识图谱。
  • 基金资助:
    国家自然科学基金资助项目(61602374);陕西省自然科学基础研究计划项目(2016JQ6041)。

Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph

JIANG Kun1,2, LIU Zheng1,2, ZHU Lei1,2, LI Xiaoxing3   

  1. 1. Faculty of Computer Science and Engineering, Xi'an University of Technology, Xi'an Shaanxi 710048, China;
    2. Shaanxi Key Laboratory for Network Computing and Security Technology(Xi'an University of Technology), Xi'an Shaanxi 710048, China;
    3. PLA 63785 Troop, Xi'an Shaanxi 710043, China
  • Received:2020-06-15 Revised:2020-11-05 Online:2021-03-10 Published:2021-01-15
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61602374), the Natural Science Basic Research Plan in Shaanxi Province (2016JQ6041).

摘要: 在搜索引擎的倒排索引等字长(FWA)类型压缩算法中,倒排链的“贪心”分块划分策略和码字信息的交错存储使算法难以达到最优的压缩效果。针对上述问题,提出了一种基于有向无环图(DAG)的FWA划分压缩算法。首先,考虑到互联网网页聚类特性带来的倒排链小数字信息,设计了一种数据区为64位分块的新型FWA压缩格式。该压缩格式通过4位的指示区将数据区划分为16种适合于连续小数字压缩的存储模式,并将倒排链每个分块的指示位和数据位分类存储,从而保证了较好的批量解压性能。其次,在新压缩格式的基础上提出一种基于DAG描述的倒排链FWA划分压缩方法——固定字对齐划分(WAP)算法。该算法利用DAG将倒排链分块划分问题归结为单源最短路径(SSSP)问题,并考虑FWA压缩格式中数据区存储模式的限制条件来确定SSSP问题的结构形式和递归定义。然后,给出了采用动态规划求解SSSP问题并形成最优划分向量的伪码和算法复杂度,并对S9、S16、S8b等传统FWA算法的原有存储模式进行了基于DAG的划分优化,把优化前后的算法的计算复杂度进行比较分析。最后,使用仿真整数序列数据和文本检索会议(TREC) GOV2网页索引数据进行压缩性能实验。实验结果表明,相较于传统FWA类型算法,基于DAG的FWA划分算法在通过批量解压和划分优化技术提升算法的压缩率和解压速度同时,对连续小数字整数序列进行压缩时能够获得比传统参照框架(FOR)类型算法更高的压缩率。

关键词: 倒排索引, 等字长压缩算法, 有向无环图, 最优划分, 动态规划

Abstract: In Fixed Word-Aligned (FWA) inverted index compression algorithms of Web search engines, due to the "greedy" block partition strategy of the inverted list and the interleaved storage of the codeword information, it is difficult for the algorithm to achieve the optimal compression performance. To solve the above problem, an FWA partition compression algorithm based on Directed Acyclic Graph (DAG) was proposed. Firstly, considering the small integer information in the inverted list brought by the clustering characteristics of Web pages, a novel FWA compression format with data area of 64-bit blocks was designed. In this compression format, the data area was divided into 16 storage modes suitable for continuous small integer compression through 4-bit selector area, and the selector area and data area in each block of the inverted list were stored separately, so as to ensure good batch decompression performance. Secondly, a DAG described Word-Aligned Partition (WAP) algorithm was proposed on the basis of the new compression format. In the algorithm, the inverted list block partitioning problem was regarded as a Single-Source Shortest-Path (SSSP) problem by DAG, and by considering the constraints of various storage modes of data area in FWA compression format, the structure and recursive definition of the SSSP problem were determined. Thirdly, the dynamic programming technique was used to solve the problem of SSSP and generate the pseudo-code and algorithm complexity of the optimal partition vector. The original storage modes of traditional FWA algorithms such as S9, S16 and S8b were partitioned and optimized based on DAG, and the computational complexities of the algorithms before and after optimization were compared and analyzed. Finally, the compression performance experiments were conducted with simulation integer sequence data and Text REtrieval Conference (TREC) GOV2 Web page index data. Experimental results show that, compared with the traditional FWA algorithms, the DAG based FWA partition algorithm can improve the compression ratio and decompression speed by batch decompression and partition optimization technology. At the same time, it can obtain a higher compression ratio than the traditional Frame Of Reference (FOR) algorithms for the compression of continuous small integer sequence.

Key words: inverted index, Fixed Word-Aligned (FWA) compression algorithm, Directed Acyclic Graph (DAG), optimal partitioning, dynamic programming

中图分类号: