Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 636-642.DOI: 10.11772/j.issn.1001-9081.2020091473

Special Issue: 第37届CCF中国数据库学术会议(NDBC 2020)

• The 37th CCF National Database Conference (NDBC 2020) • Previous Articles     Next Articles

AAC-Hunter: efficient algorithm for discovering aggregation algebraic constraints in relational databases

ZHANG Xiaowei1, JIANG Dawei1,2, CHEN Ke1,2, CHEN Gang1,2   

  1. 1. College of Computer Science and Technology, Zhejiang University, Hangzhou Zhejiang 310027, China;
    2. Key Laboratory of Big Data Intelligent Computing of Zhejiang Province(Zhejiang University), Hangzhou Zhejiang 310027, China
  • Received:2020-09-07 Revised:2020-10-30 Online:2021-03-10 Published:2020-11-10
  • Supported by:
    This work is partially supported by the Youth Program of National Natural Science Foundation of China (61702449).


张效伟1, 江大伟1,2, 陈珂1,2, 陈刚1,2   

  1. 1. 浙江大学 计算机科学与技术学院, 杭州 310027;
    2. 浙江省大数据智能计算重点实验室(浙江大学), 杭州 310027
  • 通讯作者: 陈珂
  • 作者简介:张效伟(1996-),男,浙江嘉兴人,硕士研究生,主要研究方向:数据库约束发现;江大伟(1982-),男,安徽蚌埠人,研究员,博士,主要研究方向:分布式数据库、大数据管理、区块链;陈珂(1977-),女,河南开封人,副教授,博士,CCF会员,主要研究方向:时空数据库、数据挖掘、数据隐私保护;陈刚(1973-),男,浙江奉化人,教授,博士,CCF会员,主要研究方向:大数据管理。
  • 基金资助:

Abstract: In order to better maintain the data integrity and help auditors find anomalous reimbursement records in relational databases, the algorithm AAC-Hunter (Aggregation Algebraic Constraints Hunter), which discovered Aggregation Algebraic Constraints (AACs) automatically, was proposed. An AAC is a fuzzy constraint defined between the aggregation results of two columns in the database and acts on most but not all records. Firstly, joining, grouping and algebraic expressions were enumerated to generate candidate AACs. Secondly, the value range sets of these candidate AACs were calculated. Finally, the AAC results were output. However, this method was not able to face the performance challenges caused by massive data, so that a set of heuristic rules were applied to decrease the size of candidate constraint space and the optimization strategies based on intermediate results reuse and trivial candidate AACs elimination were employed to speed up the value range set calculation for candidate AACs. Experimental results on TPC-H and European Soccer datasets show that AAC-Hunter reduces the constraint discovery space by 95.68% and 99.94% respectively, and shortens running time by 96.58% and 92.51% respectively, compared with the baseline algorithm without heuristic rules or optimization strategies. As the effectiveness of AAC-Hunter is verified, it can be seen that AAC-Hunter can improve the efficiency and capability of auditing application.

Key words: constraints discovery, Aggregation Algebraic Constraint (AAC), relational database, data-driven, auditing

摘要: 针对如何更好地维护关系数据库的数据完整性以及帮助审计员找出违规的报销记录的问题,提出了自动发现聚合代数约束(AAC)的算法AAC-Hunter。AAC是一种定义在数据库中两列的聚合结果之间的模糊约束,作用于大多数而非全部记录上。AAC-Hunter首先枚举连接、分组和代数表达式来产生候选AAC,然后分别计算这些候选AAC的值域集合,最后输出AAC结果。但该方法无法应对海量数据带来的性能挑战,因此AAC-Hunter提出了一套启发式规则减小候选约束空间规模以及基于中间结果复用和消除平凡候选AAC的两个优化策略来加速候选AAC的值域集合计算。实验结果表明了对比不使用启发式规则和优化策略的基线算法,AAC-Hunter在TPC-H和European Soccer数据集上分别减小了95.68%和99.94%的约束发现空间,分别缩短了96.58%和92.51%的运行时间。可见AAC-Hunter具备有效性,能够提升审计应用的效率和能力。

关键词: 约束发现, 聚合代数约束, 关系数据库, 数据驱动, 审计

CLC Number: