计算机应用 ›› 2015, Vol. 35 ›› Issue (4): 1143-1147.DOI: 10.11772/j.issn.1001-9081.2015.04.1143

• 数据技术 • 上一篇    下一篇

基于Hadoop的海量嘈杂数据决策树算法的实现

刘亚秋1,2, 李海涛1,2, 景维鹏1,2   

  1. 1. 东北林业大学 信息与计算机工程学院, 哈尔滨 150040;
    2. 黑龙江省林业生态大数据存储与高性能(云)计算工程技术研究中心, 哈尔滨 150040
  • 收稿日期:2014-11-15 修回日期:2014-12-23 出版日期:2015-04-10 发布日期:2015-04-08
  • 通讯作者: 景维鹏
  • 作者简介:刘亚秋(1971-),男,辽宁法库人,教授,博士,主要研究方向:高性能计算、嵌入式计算; 李海涛(1989-),男,辽宁普兰店人,硕士研究生,主要研究方向:并行分布式计算、高性能计算; 景维鹏(1979-),男,黑龙江鹤岗人,副教授,博士,主要研究方向:分布式计算、容错计算。
  • 基金资助:

    国家自然科学基金资助项目(31370565);哈尔滨市科技创新人才研究专项资金资助项目(2013RFXXJ089)。

Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop

LIU Yaqiu1,2, LI Haitao1,2, JING Weipeng1,2   

  1. 1. College of Information and Computer Engineering, Northeast Forestry University, Harbin Heilongjiang 150040, China;
    2. Heilongjiang Province Engineering Technology Research Center for Forestry Ecological Big Data Storage and High Performance Computing (Cloud Computing), Harbin Heilongjiang 150040, China
  • Received:2014-11-15 Revised:2014-12-23 Online:2015-04-10 Published:2015-04-08

摘要:

针对当前决策树算法较少考虑训练集的嘈杂程度对模型的影响,以及传统驻留内存算法处理海量数据困难的问题,提出一种基于Hadoop平台的不确定概率C4.5算法——IP-C4.5算法。在训练模型时,IP-C4.5算法认为用于建树的训练集是不可靠的,通过用基于不确定概率的信息增益率作为分裂属性选择标准,减小了训练集的嘈杂性对模型的影响。在Hadoop平台下,通过将IP-C4.5算法以文件分裂的方式进行MapReduce化程序设计,增强了处理海量数据的能力。与C4.5和完全信条树(CCDT)算法的对比实验结果表明,在训练集数据是嘈杂的情况下,IP-C4.5算法的准确率相对更高,尤其当数据嘈杂度大于10%时,表现更加优秀;并且基于Hadoop的并行化的IP-C4.5算法具有处理海量数据的能力。

关键词: Hadoop, C4.5, 不确定概率, 嘈杂数据, 并行化

Abstract:

Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.

Key words: Hadoop, C4.5, imprecise probability, noisy data, parallelization

中图分类号: