基于Hadoop的海量嘈杂数据决策树算法的实现

doi:10.11772/j.issn.1001-9081.2015.04.1143

计算机应用 ›› 2015, Vol. 35 ›› Issue (4): 1143-1147.DOI: 10.11772/j.issn.1001-9081.2015.04.1143

基于Hadoop的海量嘈杂数据决策树算法的实现

刘亚秋^1,2, 李海涛^1,2, 景维鹏^1,2

1. 东北林业大学信息与计算机工程学院, 哈尔滨 150040;
2. 黑龙江省林业生态大数据存储与高性能(云)计算工程技术研究中心, 哈尔滨 150040

收稿日期:2014-11-15 修回日期:2014-12-23 出版日期:2015-04-10 发布日期:2015-04-08
通讯作者: 景维鹏
作者简介:刘亚秋(1971-),男,辽宁法库人,教授,博士,主要研究方向:高性能计算、嵌入式计算; 李海涛(1989-),男,辽宁普兰店人,硕士研究生,主要研究方向:并行分布式计算、高性能计算; 景维鹏(1979-),男,黑龙江鹤岗人,副教授,博士,主要研究方向:分布式计算、容错计算。
基金资助:
国家自然科学基金资助项目(31370565);哈尔滨市科技创新人才研究专项资金资助项目(2013RFXXJ089)。

Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop

LIU Yaqiu^1,2, LI Haitao^1,2, JING Weipeng^1,2

1. College of Information and Computer Engineering, Northeast Forestry University, Harbin Heilongjiang 150040, China;
2. Heilongjiang Province Engineering Technology Research Center for Forestry Ecological Big Data Storage and High Performance Computing (Cloud Computing), Harbin Heilongjiang 150040, China

Received:2014-11-15 Revised:2014-12-23 Online:2015-04-10 Published:2015-04-08

摘要/Abstract

摘要：

针对当前决策树算法较少考虑训练集的嘈杂程度对模型的影响,以及传统驻留内存算法处理海量数据困难的问题,提出一种基于Hadoop平台的不确定概率C4.5算法——IP-C4.5算法。在训练模型时,IP-C4.5算法认为用于建树的训练集是不可靠的,通过用基于不确定概率的信息增益率作为分裂属性选择标准,减小了训练集的嘈杂性对模型的影响。在Hadoop平台下,通过将IP-C4.5算法以文件分裂的方式进行MapReduce化程序设计,增强了处理海量数据的能力。与C4.5和完全信条树(CCDT)算法的对比实验结果表明,在训练集数据是嘈杂的情况下,IP-C4.5算法的准确率相对更高,尤其当数据嘈杂度大于10%时,表现更加优秀;并且基于Hadoop的并行化的IP-C4.5算法具有处理海量数据的能力。

关键词: Hadoop, C4.5, 不确定概率, 嘈杂数据, 并行化

Abstract:

Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.

Key words: Hadoop, C4.5, imprecise probability, noisy data, parallelization

中图分类号:

TP181

刘亚秋, 李海涛, 景维鹏. 基于Hadoop的海量嘈杂数据决策树算法的实现[J]. 计算机应用, 2015, 35(4): 1143-1147.

LIU Yaqiu, LI Haitao, JING Weipeng. Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop[J]. Journal of Computer Applications, 2015, 35(4): 1143-1147.

参考文献

[1] GANTZ J, REINSEL D. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east — United States [EB/OL].[2010-10-10]. http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.
[2] QUINLAN J R. C4.5: programs for machine learning[M]. Burlington: Morgan Kaufmann Publishers, 1993: 17-42.
[3] QUINLAN J R. Induction of decision trees[J]. Machine Learning, 1986, 1(1): 81-106.
[4] WALLEY P. Inferences from multinomial data: learning about a bag of marbles [J]. Journal of the Royal Statistical Society, Series B: Methodological, 1996,58(1): 3-57.
[5] ABELLAN J, MORAL S. Building classification trees using the total uncertainty criterion [J]. International Journal of Intelligent Systems, 2003, 18(12): 1215-1225.
[6] ABELLAN J, MASEGOSA A R. An experimental study about simple decision trees for bagging ensemble on datasets with classification noise [C]// Proceedings of the 10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, LNCS 5590. Berlin: Springer-Verlag, 2009: 446-456.
[7] ABELLAN J, MASEGOSA A R. Bagging schemes on the presence of class noise in classification [J]. Expert Systems with Applications, 2012, 39(8): 6827-6837.
[8] MANTAS C J, ABELLAN J. Analysis and extension of decision trees based on imprecise probabilities: application on noisy data [J]. Expert Systems with Applications, 2014, 41(5): 2514-2525.
[9] LI Y, JIANG D, LI F. The application of generating fuzzy ID3 algorithm in performance evaluation [J]. Procedia Engineering, 2012, 29: 229-234.
[10] JIN C, LI F, LI Y. A generalized fuzzy ID3 algorithm using generalized information entropy [J]. Knowledge-Based Systems, 2014, 64: 13-21.
[11] ARMBRUST M, FOX A, GRIFFITH R, et al. A view of cloud computing [J]. Communications of the ACM, 2010, 53(4): 50-58.
[12] GHEMAWAT S, GOBIOFF H, LEUNG S T. The Google file system [C]// SOSP 2003: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. New York: ACM Press, 2003: 29-43.
[13] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
[14] ZHANG J, WONG J, LI T, et al. A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems [J]. International Journal of Approximate Reasoning, 2014, 55(3): 896-907.
[15] DAI W, JI W. A MapReduce implementation of C4.5 decision tree algorithm [J]. International Journal of Database Theory and Application, 2014, 7(1):49-60.
[16] WANG R, HE Y-L, CHOW C-Y, et al. Learning ELM-tree from big data based on uncertainty reduction [J]. Fuzzy Sets and Systems, 2015,258:79-100.
[17] WITTEN I H, FRANK E. Data mining: practical machine learning tools and techniques [M]. Burlington: Morgan Kaufmann Publishers, 2005.
[18] HETTICH S, BLAKE C L, MERZ C J. UCI repository of machine learning database [EB/OL]. [2014-04-04].http://archive.ics.uci.edu/ml/#/.

基于Hadoop的海量嘈杂数据决策树算法的实现

Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	王周恺, 张炯, 马维纲, 王怀军. 面向高速列车监测数据的并行解压缩算法[J]. 计算机应用, 2021, 41(9): 2586-2593.
[2]	蒋林, 施佳琪, 李远成. 可重构结构下合成视点失真变化算法并行设计与实现[J]. 计算机应用, 2021, 41(6): 1734-1740.
[3]	宋祥帅, 杨伏长, 谢江, 张武. Graphlet Degree Vector方法的优化与并行[J]. 计算机应用, 2020, 40(2): 398-403.
[4]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[5]	李耘书, 滕飞, 李天瑞. 基于微操作的Hadoop参数自动调优方法[J]. 计算机应用, 2019, 39(6): 1589-1594.
[6]	郭良敏, 朱莹, 孙丽萍. 障碍空间中基于并行蚁群算法的k近邻查询[J]. 计算机应用, 2019, 39(3): 790-795.
[7]	王伟, 谢耀滨, 尹青. 针对不平衡数据的决策树改进方法[J]. 计算机应用, 2019, 39(3): 623-628.
[8]	李龙洋, 董一鸿, 施炜杰, 潘剑飞. SQM:基于Spark的大规模单图上的子图匹配算法[J]. 计算机应用, 2019, 39(1): 46-50.
[9]	孙佳敏, 朱嘉富, 杨伏长, 谢江. 大规模生物网络马尔可夫聚类的并行化算法[J]. 计算机应用, 2019, 39(1): 66-71.
[10]	杨伏长, 朱嘉富, 孙佳敏, 谢江. 生物复杂网络motif发现的并行算法[J]. 计算机应用, 2019, 39(1): 72-77.
[11]	郑振涛, 赵卓峰, 王桂玲, 徐垚. 面向港口停留区域识别的船舶停留轨迹提取方法[J]. 计算机应用, 2019, 39(1): 113-117.
[12]	崔晨, 郑林江, 韩凤萍, 何牧君. 基于内存的HBase二级索引设计[J]. 计算机应用, 2018, 38(6): 1584-1590.
[13]	张承畅, 张华誉, 罗建昌, 何丰. 基于云计算和改进K-means算法的海量用电数据分析方法[J]. 计算机应用, 2018, 38(1): 159-164.
[14]	李强, 刘晓峰. 基于Hopfield神经网络的云存储负载均衡策略[J]. 计算机应用, 2017, 37(8): 2214-2217.
[15]	吴家皋, 夏轩, 刘林峰. 基于MapReduce的轨迹压缩并行化方法[J]. 计算机应用, 2017, 37(5): 1282-1286.