Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop
LIU Yaqiu1,2, LI Haitao1,2, JING Weipeng1,2
1. College of Information and Computer Engineering, Northeast Forestry University, Harbin Heilongjiang 150040, China;
2. Heilongjiang Province Engineering Technology Research Center for Forestry Ecological Big Data Storage and High Performance Computing (Cloud Computing), Harbin Heilongjiang 150040, China
Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.
刘亚秋, 李海涛, 景维鹏. 基于Hadoop的海量嘈杂数据决策树算法的实现[J]. 计算机应用, 2015, 35(4): 1143-1147.
LIU Yaqiu, LI Haitao, JING Weipeng. Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop. Journal of Computer Applications, 2015, 35(4): 1143-1147.
[1] GANTZ J, REINSEL D. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east — United States [EB/OL].[2010-10-10]. http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf. [2] QUINLAN J R. C4.5: programs for machine learning[M]. Burlington: Morgan Kaufmann Publishers, 1993: 17-42. [3] QUINLAN J R. Induction of decision trees[J]. Machine Learning, 1986, 1(1): 81-106. [4] WALLEY P. Inferences from multinomial data: learning about a bag of marbles [J]. Journal of the Royal Statistical Society, Series B: Methodological, 1996,58(1): 3-57. [5] ABELLAN J, MORAL S. Building classification trees using the total uncertainty criterion [J]. International Journal of Intelligent Systems, 2003, 18(12): 1215-1225. [6] ABELLAN J, MASEGOSA A R. An experimental study about simple decision trees for bagging ensemble on datasets with classification noise [C]// Proceedings of the 10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, LNCS 5590. Berlin: Springer-Verlag, 2009: 446-456. [7] ABELLAN J, MASEGOSA A R. Bagging schemes on the presence of class noise in classification [J]. Expert Systems with Applications, 2012, 39(8): 6827-6837. [8] MANTAS C J, ABELLAN J. Analysis and extension of decision trees based on imprecise probabilities: application on noisy data [J]. Expert Systems with Applications, 2014, 41(5): 2514-2525. [9] LI Y, JIANG D, LI F. The application of generating fuzzy ID3 algorithm in performance evaluation [J]. Procedia Engineering, 2012, 29: 229-234. [10] JIN C, LI F, LI Y. A generalized fuzzy ID3 algorithm using generalized information entropy [J]. Knowledge-Based Systems, 2014, 64: 13-21. [11] ARMBRUST M, FOX A, GRIFFITH R, et al. A view of cloud computing [J]. Communications of the ACM, 2010, 53(4): 50-58. [12] GHEMAWAT S, GOBIOFF H, LEUNG S T. The Google file system [C]// SOSP 2003: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. New York: ACM Press, 2003: 29-43. [13] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113. [14] ZHANG J, WONG J, LI T, et al. A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems [J]. International Journal of Approximate Reasoning, 2014, 55(3): 896-907. [15] DAI W, JI W. A MapReduce implementation of C4.5 decision tree algorithm [J]. International Journal of Database Theory and Application, 2014, 7(1):49-60. [16] WANG R, HE Y-L, CHOW C-Y, et al. Learning ELM-tree from big data based on uncertainty reduction [J]. Fuzzy Sets and Systems, 2015,258:79-100. [17] WITTEN I H, FRANK E. Data mining: practical machine learning tools and techniques [M]. Burlington: Morgan Kaufmann Publishers, 2005. [18] HETTICH S, BLAKE C L, MERZ C J. UCI repository of machine learning database [EB/OL]. [2014-04-04].http://archive.ics.uci.edu/ml/#/.