计算机应用 ›› 2012, Vol. 32 ›› Issue (09): 2463-2465.DOI: 10.3724/SP.J.1087.2012.02463

• 先进计算 • 上一篇    下一篇

基于MapReduce的决策树算法并行化

陆秋*,程小辉   

  1. 桂林理工大学 信息科学与工程学院,广西 桂林 541004
  • 收稿日期:2012-02-23 修回日期:2012-04-18 发布日期:2012-09-01 出版日期:2012-09-01
  • 通讯作者: 陆秋
  • 作者简介:陆秋(1979-),女,广西钦州人,讲师,硕士,主要研究方向:数据库、计算机网络; 程小辉(1961-),男,江西樟树人,教授,主要研究方向:数据库、嵌入式系统、计算机网络。
  • 基金资助:

    国家自然科学基金资助项目(61063001/F020207);浙江大学工业控制技术国家重点实验室项目(ICT1109)

Parallelization of decision tree algorithm based on MapReduce

LU Qiu*,CHENG Xiao-hui   

  1. School of Information Science and Engineering,Guilin University of Technology,Guilin Guangxi 541004,China
  • Received:2012-02-23 Revised:2012-04-18 Online:2012-09-01 Published:2012-09-01
  • Contact: Qiu Lu
  • Supported by:

    ;the National Natural Science Foundation of China

摘要: 针对传统决策树算法不能解决海量数据挖掘以及ID3算法的多值偏向问题,设计和实现了一种基于MapReduce架构的并行决策树分类算法。该算法采用属性相似度作为测试属性的选择标准来避免ID3算法的多值偏向问题,采用MapReduce模型来解决海量数据挖掘问题。在用普通PC搭建的Hadoop集群的实验结果表明:基于MapReduce的决策树算法可以处理大规模数据的分类问题,具有较好的可扩展性,在保证分类正确率的情况下能获得接近线性的加速比。

关键词: MapReduce, 属性相似度, Hadoop, 决策树, ID3算法

Abstract: In view of that the traditional decision tree algorithm that cannot solve the mass data mining and the multi-value bias problem of ID3 algorithm, the paper designed and realized a parallel decision tree classification algorithm based on the MapReduce framework. This algorithm adopted attribute similarity as the choice standard to avoid the multi-value bias problem of ID3 algorithm, and used the MapReduce model to solve the mass data mining problems. According to the experiments on the Hadoop cluster set up by ordinary PCs, the decision tree algorithm based on MapReduce can deal with massive data classification. What's more, the algorithm has good expansibility while ensuring the classification accuracy and can get close to linear speedup rate.

Key words: MapReduce, attribute similarity, Hadoop, decision tree, ID3 algorithm

中图分类号: