计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3044-3049.DOI: 10.11772/j.issn.1001-9081.2016.11.3044

• 先进计算 • 上一篇    下一篇

基于Hadoop架构的数据驱动的SVM并行增量学习算法

邳文君1,2, 宫秀军1,2   

  1. 1. 天津大学 计算机科学与技术学院, 天津 300350;
    2. 天津市认知计算与应用重点实验室(天津大学), 天津 300350
  • 收稿日期:2016-05-03 修回日期:2016-06-25 出版日期:2016-11-10 发布日期:2016-11-12
  • 通讯作者: 邳文君
  • 作者简介:邳文君(1992-),女,天津人,硕士研究生,主要研究方向:分布式数据挖掘、高性能计算;宫秀军(1972-),男,内蒙古赤峰人,副教授,博士,主要研究方向:人工智能、数据挖掘、生物信息学。
  • 基金资助:
    国家自然科学基金资助项目(61170177);国家863计划重点项目(2015AA020101);国家973计划项目(2013CB32930X)。

Data driven parallel incremental support vector machine learning algorithm based on Hadoop framework

PI Wenjun1,2, GONG Xiujun1,2   

  1. 1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
    2. Tianjin Key Laboratory of Cognitive Computing and Application(Tianjin University), Tianjin 300350, China
  • Received:2016-05-03 Revised:2016-06-25 Online:2016-11-10 Published:2016-11-12
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China (61170177), the Key Projects of National High Technology Research and Development Program (863 Program) of China (2015AA020101), the National Basic Research Program (973 Program) of China (2013CB32930X).

摘要: 针对传统支持向量机(SVM)算法难以处理大规模训练数据的困境,提出一种基于Hadoop的数据驱动的并行增量Adaboost-SVM算法(PIASVM)。利用集成学习策略,局部分类器处理一个分区的数据,融合其分类结果得到组合分类器;增量学习中用权值刻画样本的空间分布特性,对样本进行迭代加权,利用遗忘因子实现新增样本的选择及历史样本的淘汰;采用基于HBase的控制器组件用以调度迭代过程,持久化中间结果并减小MapReduce原有框架迭代过程中的带宽压力。多组实验结果表明,所提算法具有优良的加速比、扩展率和数据伸缩度,在保证分类精度的基础上提高了SVM算法对大规模数据的处理能力。

关键词: Hadoop, HBase, 支持向量机, 增量学习, 集成学习, 遗忘因子, 控制器组件

Abstract: Traditional Support Vector Machine (SVM) algorithm is difficult to deal with the problem of large scale training data, an efficient data driven Parallel Incremental Adaboost-SVM (PIASVM) learning algorithm based on Hadoop was proposed. An ensemble system was used to make each classifier process a partition of the data, and then integrated the classification results to get the combination classifier. Weights were used to depict the spatial distribution prosperities of samples which were to be iteratively reweighted during the incremental training stage, and forgetting factor was applied to select new samples and eliminate historical samples. Also, the controller component based on HBase was used to schedule the iterative procedure, persist the intermediate results and reduce the bandwidth pressure of iterative MapReduce. The experimental results on multiple data sets demonstrate that the proposed algorithm has good performance in speedup, sizeup and scaleup, and high processing capacity of large-scale data while guaranteeing high accuracy.

Key words: Hadoop, HBase, Support Vector Machine (SVM), incremental learning, ensemble leaning, forgetting factor, controller component

中图分类号: