用于重复充电运营记录的基于块采样的高效聚集查询算法

doi:10.11772/j.issn.1001-9081.2017112632

计算机应用 ›› 2018, Vol. 38 ›› Issue (6): 1596-1600.DOI: 10.11772/j.issn.1001-9081.2017112632

用于重复充电运营记录的基于块采样的高效聚集查询算法

潘鸣宇¹, 张禄¹, 龙国标¹, 李香龙¹, 马冬雪¹, 徐亮²

1. 国网北京市电力公司, 北京 100075;
2. 南瑞集团, 北京 102299

收稿日期:2017-11-06 修回日期:2018-02-05 发布日期:2018-06-13 出版日期:2018-06-10
通讯作者: 潘鸣宇
作者简介:潘鸣宇(1985-),男,北京人,高级工程师,硕士,主要研究方向:充换电运营数据分析;张禄(1984-),男,北京人,高级工程师,博士,主要研究方向:充换电运营数据分析;龙国标(1967-),男,北京人,高级工程师,主要研究方向:充换电运营数据分析;李香龙(1980-),男,河北石家庄人,高级工程师,硕士,主要研究方向:充换电运营数据分析;马冬雪(1989-),女,北京人,中级经济师,硕士,主要研究方向:充换电运营咨询;徐亮(1981-),男,辽宁沈阳人,高级工程师,硕士,主要研究方向:充换电运营数据分析。
基金资助:
国家电网公司总部科技项目（52020116000j）。

Efficient block-based sampling algorithm for aggregation query processing on duplicate charged records

PAN Mingyu¹, ZHANG Lu¹, LONG Guobiao¹, LI Xianglong¹, MA Dongxue¹, XU Liang²

1. State Grid Beijing Electric Power Company, Beijing 100075, China;
2. NARI Group, Beijing 102299, China

Received:2017-11-06 Revised:2018-02-05 Online:2018-06-13 Published:2018-06-10
Supported by:
This work is partially supported by the Science and Technology Project of State Grid Corporation (52020116000j).

摘要/Abstract

摘要： 现有查询分析方法通常将实体识别作为线下预处理过程清洗整个数据集，然而，随着数据规模的不断增大，这种高计算复杂性的线下清洗模式已经很难满足实时性分析应用的需求。针对重复充电运营记录上的聚集查询问题，提出一种将近似聚集查询处理与实体识别相结合的方法。首先，通过基于块的采样策略采集样本；然后，在采集到的样本上利用实体识别方法识别出重复的实体；最后，根据实体识别的结果重构得到聚集结果的无偏估计。所提方法避免了识别全部实体的时间代价，通过识别少量样本数据即可返回满足用户需求的查询结果。真实数据集和合成数据集上的实验结果验证了所提方法的高效性和可靠性。

关键词: 大数据, 实体识别, 聚集查询, 块采样, 分布式计算

Abstract: The existing query analysis methods usually treat the entity resolution as an offline preprocessing process to clean the whole data set. However, with the continuous increasing of data size, such offline cleaning mode with high computing complexity has been difficult to meet the needs of real-time analysis in most applications. In order to solve the problem of aggregation query on duplicate charged records, a new method integrating entity resolution with approximate aggregation query processing was proposed. Firstly, a block-based sampling strategy was adopted to collect samples. Then, an entity recognition method was used to identify the duplicate entities on the sampled samples. Finally, the unbiased estimation of aggregated results was reconstructed according to the results of entity recognition. The proposed method avoids the time cost of identifying all entities, and returns the query results that satisfy user needs by identifying only a small number of sample data. The experimental results on both real dataset and synthetic dataset demonstrate the efficiency and reliability of the proposed method.

Key words: big data, entity resolution, aggregation query, block sampling, distributed computing

中图分类号:

TP311

潘鸣宇, 张禄, 龙国标, 李香龙, 马冬雪, 徐亮. 用于重复充电运营记录的基于块采样的高效聚集查询算法[J]. 计算机应用, 2018, 38(6): 1596-1600.

PAN Mingyu, ZHANG Lu, LONG Guobiao, LI Xianglong, MA Dongxue, XU Liang. Efficient block-based sampling algorithm for aggregation query processing on duplicate charged records[J]. Journal of Computer Applications, 2018, 38(6): 1596-1600.

参考文献

[1] ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection:a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[2] KÖPCKE H, RAHM E. Frameworks for entity matching:a comparison[J]. Data & Knowledge Engineering, 2010, 69(2):197-210.
[3] HERNÁNDEZ M A, STOLFO S J. The merge/purge problem for large databases[C]//Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. New York:ACM, 1995:127-138.
[4] 王宏志,樊文飞.复杂数据上的实体识别技术研究[J].计算机学报,2011,34(10):1843-1852.(WANG H Z, FAN W F. Object identification on complex data:a survey[J]. Chinese Journal of Computers, 2011, 34(10):1843-1852.)
[5] 孙琛琛,申德荣,寇月,等.面向关联数据的联合式实体识别方法[J].计算机学报,2015,38(9):1739-1754.(SUN C C, SHEN D R, KOU Y, et al. A related data oriented joint entity resolution approach[J]. Chinese Journal of Computers, 2015, 38(9):1739-1754.)
[6] 寇月,申德荣,刘恒,等.异构网络中关联实体识别模型及增量式验证算法研究[J].计算机学报,2013,36(10):2096-2108.(KOU Y, SHEN D R, LIU H, et al. Research on related entity identification model and incremental verification algorithm for heterogeneous networks[J]. Chinese Journal of Computers, 2013, 36(10):2096-2108.)
[7] ANANTHAKRISHNA R, CHAUDHURI S, GANTI V. Eliminating fuzzy duplicates in data warehouses[C]//Proceedings of the 28th International Conference on Very Large Data Bases. San Francisco, CA:Morgan Kaufmann, 2002:586-597.
[8] BHATTACHARYA I, GETOOR L. Collective entity resolution in relational data[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1):Article No. 5.
[9] ALTWAIJRY H, KALASHNIKOV D V, MEHROTRA S. Query-driven approach to entity resolution[J]. Proceedings of the VLDB Endowment, 2013, 6(14):1846-1857.
[10] ALTWAIJRY H, MEHROTRA S, KALASHNIKOV D V. QuERy:a framework for integrating entity resolution with query processing[J]. Proceedings of the VLDB Endowment, 2015, 9(3):120-131.
[11] BHATTACHARYA I, GETOOR L. Query-time entity resolution[J]. Journal of Artificial Intelligence Research, 2007, 30(1):621-657.
[12] IOANNOU E, NEJDL W, NIEDERÉE C, et al. On-the-fly entity-aware query processing in the presence of linkage[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2):429-438.
[13] SISMANIS Y, WANG L, FUXMAN A, et al. Resolution-aware query answering for business intelligence[C]//Proceedings of the 2009 IEEE 25th International Conference on. Washington, DC:IEEE Computer Society, 2009:976-987.
[14] ALTOWIM Y, KALASHNIKOV D V, MEHROTRA S. Progressive approach to relational entity resolution[J]. Proceedings of the VLDB Endowment, 2014, 7(11):999-1010.
[15] WHANG S E, MARMAROS D, GARCIA-MOLINA H. Pay-as-you-go entity resolution[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(5):1111-1124.
[16] GRUENHEID A, DONG X L, SRIVASTAVA D. Incremental record linkage[J]. Proceedings of the VLDB Endowment, 2014, 7(9):697-708.
[17] WHANG S E, GARCIA-MOLINA H. Incremental entity resolution on rules and data[J]. The VLDB Journal, 2014, 23(1):77-102.
[18] CORMODE G, GAROFALAKIS M, HAAS P J, et al. Synopses for massive data:samples, histograms, wavelets, sketches[J]. Foundations and Trends in Databases, 2012, 4(1/2/3):1-294.
[19] GAROFALAKIS N, GIBBONS P B. Approximate query processing:taming the terabytes[C]//VLDB 2001:Proceedings of 27th International Conference on Very Large Data Bases. San Francisco, CA:Morgan Kaufmann, 2001:169-212.
[20] ACHARYA S, GIBBONS P B, POOSALA V, et al. The Aqua approximate query answering system[C]//Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. New York:ACM, 1999:574-576.
[21] AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB:queries with bounded errors and bounded response times on very large data[C]//Proceedings of the 8th ACM European Conference on Computer Systems. New York:ACM, 2013:29-42.
[22] BABCOCK B, CHAUDHURI S, DAS G. Dynamic sample selection for approximate query processing[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of data. New York:ACM, 2003:539-550.
[23] CHAUDHURI S, DAS G, NARASAYYA V. Optimized stratified sampling for approximate query processing[J]. ACM Transactions on Database Systems, 2007, 32(2):9.
[24] CONDIE T, CONWAY N, ALVARO P, et al. Online aggregation and continuous query support in MapReduce[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. New York:ACM, 2010:1115-1118.
[25] HELLERSTEIN J M, HASS P J, WANG H J. Online aggregation [C]// Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1997: 171-182.
[26] PANSARE N, BORKAR V R, JERMAINE C, et al. Online aggregation for large MapReduce jobs [J]. Proceedings of the VLDB Endowment, 2011, 4(11): 1135-1145.
[27] WU S, JIANG S X, OOI B C, et al. Distributed online aggregations [J]. Proceedings of the VLDB Endowment, 2009, 2(1): 443-454.
[28] WANG J N, KRISHNAN S, FRANKLIN M J, et al. A sample-and-clean framework for fast and accurate query processing on dirty data [C]// Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2014: 469-480.

用于重复充电运营记录的基于块采样的高效聚集查询算法

Efficient block-based sampling algorithm for aggregation query processing on duplicate charged records

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445.
[2]	于右任, 张仰森, 蒋玉茹, 黄改娟. 融合多粒度语言知识与层级信息的中文命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1706-1712.
[3]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[4]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[5]	罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392.
[6]	黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384.
[7]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[8]	曹萌, 余孙婕, 曾辉, 史红周. 基于区块链的医疗数据分级访问控制与共享系统[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1518-1526.
[9]	雷景生, 剌凯俊, 杨胜英, 吴怡. 基于上下文语义增强的实体关系联合抽取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1438-1444.
[10]	程顺航, 李志华, 魏涛. 融合自举与语义角色标注的威胁情报实体关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1445-1453.
[11]	华夏, 朱铮皓, 徐聪, 张曦煌, 柴志雷, 陈闻杰. 基于精准通信建模的脉冲神经网络工作负载自动映射器[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 827-834.
[12]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[13]	凌宇, 单志龙. 基于兴趣增强的知识概念推荐系统[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3697-3702.
[14]	侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2686-2692.
[15]	徐关友, 冯伟森. 基于transformer的python命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2693-2700.