MapReduce并行加速数据流多模式相似性搜索

doi:10.11772/j.issn.1001-9081.2017.01.0037

计算机应用 ›› 2017, Vol. 37 ›› Issue (1): 37-41.DOI: 10.11772/j.issn.1001-9081.2017.01.0037

• 2016年全国开放式分布与并行计算学术年会(DPCS2016)论文 • 上一篇下一篇

MapReduce并行加速数据流多模式相似性搜索

付晨¹, 钟诚¹, 叶波²

1. 广西大学计算机与电子信息学院, 南宁 530004;
2. 广西科技信息网络中心, 南宁 530012

收稿日期:2016-08-25 修回日期:2016-09-03 出版日期:2017-01-10 发布日期:2017-01-09
通讯作者: 钟诚
作者简介:付晨(1988-),女,江西瑞昌人,硕士,主要研究方向:网络软件工程、大数据高性能计算;钟诚(1964-),男,广西桂平人,教授,博士,CCF高级会员,主要研究方向:并行计算、大数据高性能计算、网络软件工程;叶波(1974-),男,广西南宁人,教授级高级工程师,博士,主要研究方向:网络软件工程、管理信息系统。
基金资助:
广西自然科学基金资助项目（2014GXNSFAA118396）。

Accelerating parallel searching similar multiple patterns from data streams by using MapReduce

FU Chen¹, ZHONG Cheng¹, YE Bo²

1. School of Computer, Electronics and Information, Guangxi University, Nanning Guangxi 530004, China;
2. Guangxi Scientific and Technological Information Center, Nanning Guangxi 530012, China

Received:2016-08-25 Revised:2016-09-03 Online:2017-01-10 Published:2017-01-09
Supported by:
This work is supported by the Natural Science Foundation of Guangxi (2014GXNSFAA118396).

摘要/Abstract

摘要： 设计时间序列数据在Hadoop分布式文件系统（HDFS）中的有效存储方式，利用分布式缓存工具Distributed Cache将各子序列分发到Hadoop集群的计算节点上，将动态时间弯曲距离矩阵划分成多个子矩阵，采取并行迭代计算每条反对角线上子矩阵的方法，基于MapReduce编程模型，实现高效并行计算时间序列动态弯曲距离，通过改进剪裁冗余计算方法，设计实现一种数据流多模式相似性搜索并行算法。中国雪深长时间序列数据集的实验结果表明，当每条时间序列的长度达到5000以上时，并行计算动态弯曲距离所需时间少于串行计算所需时间，当每条时间序列的长度达到9000以上时，参与计算的集群节点越多，并行计算所需时间越少；当模式长度达到4000、参与计算的集群节点数达5个以上时，从数据流中并行搜索出与模式匹配的相似子序列所需时间约为串行搜索所需时间的20%。

关键词: 时间序列, 数据流, 动态时间弯曲距离, 模式搜索, Hadoop

Abstract: The effective storage mode for time series was designed on Hadoop Distributed File System (HDFS), the sub-series were distributed to the compute nodes on Hadoop cluster by applying Distributed Cache tool, and the matrix of dynamic time warping distances was partitioned into several sub-matrixes. Based on MapReduce programming mode, by parallel computing sub-matrixes in each back-diagonal iteratively, the parallel computation of dynamic time warping distances was implemented, and an efficient parallel algorithm for searching similar patterns from data streams was developed by improving pruning redundant computation. The experimental results on the data set of snow depth long time series in China show that when the length of each time series is equal to or longer than 5000, the required time of parallel computing dynamic time warping distances is less than that of the corresponding sequential computation, and when the length of each time series is equal to or longer than 9000, the more the compute nodes used, the less the required parallel computation time; furthermore, when the length of each pattern is equal to or longer than 4000 and the number of compute nodes is equal to or larger than 5, the required time of parallel searching similar sub-series from data streams is 20% of the corresponding sequential searching time.

Key words: time series, data stream, dynamic time warping distance, pattern searching, Hadoop

中图分类号:

付晨, 钟诚, 叶波. MapReduce并行加速数据流多模式相似性搜索[J]. 计算机应用, 2017, 37(1): 37-41.

FU Chen, ZHONG Cheng, YE Bo. Accelerating parallel searching similar multiple patterns from data streams by using MapReduce[J]. Journal of Computer Applications, 2017, 37(1): 37-41.

参考文献

[1] MATSUBARA Y, SAKURAI Y, FALOUTSOS C, et al. Fast mining and forecasting of complex time-stamped events[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2012:271-279.
[2] WANG L, LEEDHAM G. Near and far infrared imaging for vein pattern biometrics[C]//Proceedings of the 2006 IEEE International Conference on Video and Signal Based Surveillance. Piscataway, NJ:IEEE, 2006:52-52.
[3] 陈乾,胡谷雨.一种新的DTW最佳弯曲窗口学习方法[J].计算机科学,2012,39(8):191-195.(CHEN Q, HU G Y. New leaning method for optimal warping window of DTW[J]. Computer Science, 2012, 39(8):191-195.)
[4] 莫倩芸,钟诚.机群系统上并行计算时间序列的动态弯曲距离[J].微电子学与计算机,2008,25(10):155-158.(MO Q Y, ZHONG C. Parallel computing dynamic warping distances for time sequences on the cluster computing systems[J]. Microelectronics & Computer, 2008, 25(10):155-158.)
[5] 张建平,李斌,刘学军,等.基于Hadoop的不确定异常时间序列检测[J].传感技术学报,2014,27(12):1659-1665.(ZHANG J P, LI B, LIU X J, et al. Uncertain abnormal time series detection based on Hadoop[J]. Chinese Journal of Sensors and Actuators, 2014, 27(12):1659-1665.)
[6] 沙剑.基于GPU的时间序列并行检索算法研究[D].大连:大连理工大学,2011:42-55.(SHA J. The research on parallel time series retrieval method based on GPU[D]. Dalian:Dalian University of Technology, 2011:42-55.)
[7] 欧阳一村.基于DTW距离的两步式时间序列相似搜索[D].广州:中山大学,2010:33-54.(OUYANG Y C. Two-step similarity search of time series based on DTW distance[D]. Guangzhou:Sun Yat-sen University, 2010:33-54.)
[8] SAKURAI Y, FALOUTSOS C, YAMAMURO M. Stream monitoring under the time warping distance[C]//Proceedings of the 23rd International Conference on Data Engineering. Piscataway, NJ:IEEE, 2007:1046-1055.
[9] TOYODA M, SAKURAI Y, ISHIKAWA Y. Pattern discovery in data streams under the time warping distance[J]. VLDB Journal, 2013, 22(3):295-318.
[10] SRIRAMA S N, JAKOVITS P, VAINIKKO E. Adapting scientific computing problems to clouds using MapReduce[J]. Future Generations Computer System, 2012, 28(1):184-192.
[11] 钟诚,陈国良.PRAM和LARPBS模型上的近似串匹配并行算法[J].软件学报,2004,15(2):159-169.(ZHONG C, CHEN G L. Parallel algorithms for approximate string matching on PRAM and LARPBS[J]. Journal of Software, 2004, 15(2):159-169.)
[12] 寒区旱区科学数据中心.中国雪深长时间序列数据集(1978-2012)[EB/OL].[2014-09-28]. http://westdc.westgis.ac.cn. (Cold and Arid Regions Science Data Center at Lanzhou. Snow depth long time series data set in China (1978-2012)[EB/OL].[2014-09-28]. http://westdc.westgis.ac.cn.)
[13] CHE T, LI X, JIN R, et al. Snow depth derived from passive microwave remote-sensing data in China[J]. Annals of Glaciology, 2008, 49(1):145-154.
[14] DAI L, CHE T, WANG J, et al. Snow depth and snow water equivalent estimation from AMSR-E data based on a priori snow characteristics in Xinjiang, China[J]. Remote Sensing of Environment, 2012, 127:14-29.

[1]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[2]	李洋莹, 陈智军, 张子豪, 游兰. 基于改进Elman神经网络的制糖企业原糖需求预测模型[J]. 计算机应用, 2021, 41(7): 2113-2120.
[3]	郭帅, 苏旸. 基于数据流的加密流量分类方法[J]. 计算机应用, 2021, 41(5): 1386-1391.
[4]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[5]	张凌哲, 黄向东, 乔嘉林, 勾王敏浩, 王建民. 面向时序数据的两阶段日志结构合并树文件合并框架[J]. 计算机应用, 2021, 41(3): 618-622.
[6]	曹阳, 闫秋艳, 吴鑫. 不平衡时间序列集成分类算法[J]. 计算机应用, 2021, 41(3): 651-656.
[7]	马停停, 冀天娇, 杨冠羽, 陈阳, 许文波, 刘宏图. 基于长短时记忆神经网络的手足口病发病趋势预测[J]. 计算机应用, 2021, 41(1): 265-269.
[8]	钱斌, 郑楷洪, 陈子鹏, 肖勇, 李森, 叶纯壮, 马千里. 基于残差连接长短期记忆网络的时间序列修复模型[J]. 计算机应用, 2021, 41(1): 243-248.
[9]	肖勇, 郑楷洪, 郑镇境, 钱斌, 李森, 马千里. 基于多尺度跳跃深度长短期记忆网络的短期多变量负荷预测[J]. 计算机应用, 2021, 41(1): 231-236.
[10]	高世乐, 王滢, 李海林, 万校基. 基于矩阵画像的金融时序数据预测方法[J]. 计算机应用, 2021, 41(1): 199-207.
[11]	郭秀婷, 朱昶胜, 张生财, 赵奎鹏. 分形插值在风速时间序列中的应用[J]. 计算机应用, 2020, 40(9): 2628-2633.
[12]	樊仲欣. 基于数据流的聚类趋势分析算法[J]. 计算机应用, 2020, 40(8): 2248-2254.
[13]	胡珉, 白雪, 徐伟, 吴秉键. 多维时间序列异常检测算法综述[J]. 计算机应用, 2020, 40(6): 1553-1564.
[14]	霍纬纲, 王慧芳. 基于自编码器和隐马尔可夫模型的时间序列异常检测方法[J]. 计算机应用, 2020, 40(5): 1329-1334.
[15]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.

MapReduce并行加速数据流多模式相似性搜索

Accelerating parallel searching similar multiple patterns from data streams by using MapReduce

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics