基于Hadoop架构的数据驱动的SVM并行增量学习算法

doi:10.11772/j.issn.1001-9081.2016.11.3044

计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3044-3049.DOI: 10.11772/j.issn.1001-9081.2016.11.3044

基于Hadoop架构的数据驱动的SVM并行增量学习算法

邳文君^1,2, 宫秀军^1,2

1. 天津大学计算机科学与技术学院, 天津 300350;
2. 天津市认知计算与应用重点实验室(天津大学), 天津 300350

收稿日期:2016-05-03 修回日期:2016-06-25 发布日期:2016-11-12 出版日期:2016-11-10
通讯作者: 邳文君
作者简介:邳文君(1992-),女,天津人,硕士研究生,主要研究方向:分布式数据挖掘、高性能计算;宫秀军(1972-),男,内蒙古赤峰人,副教授,博士,主要研究方向:人工智能、数据挖掘、生物信息学。
基金资助:
国家自然科学基金资助项目（61170177）；国家863计划重点项目（2015AA020101）；国家973计划项目（2013CB32930X）。

Data driven parallel incremental support vector machine learning algorithm based on Hadoop framework

PI Wenjun^1,2, GONG Xiujun^1,2

1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
2. Tianjin Key Laboratory of Cognitive Computing and Application(Tianjin University), Tianjin 300350, China

Received:2016-05-03 Revised:2016-06-25 Online:2016-11-12 Published:2016-11-10
Supported by:
This work is partially supported by National Natural Science Foundation of China (61170177), the Key Projects of National High Technology Research and Development Program (863 Program) of China (2015AA020101), the National Basic Research Program (973 Program) of China (2013CB32930X).

摘要/Abstract

摘要： 针对传统支持向量机（SVM）算法难以处理大规模训练数据的困境，提出一种基于Hadoop的数据驱动的并行增量Adaboost-SVM算法（PIASVM）。利用集成学习策略，局部分类器处理一个分区的数据，融合其分类结果得到组合分类器；增量学习中用权值刻画样本的空间分布特性，对样本进行迭代加权，利用遗忘因子实现新增样本的选择及历史样本的淘汰；采用基于HBase的控制器组件用以调度迭代过程，持久化中间结果并减小MapReduce原有框架迭代过程中的带宽压力。多组实验结果表明，所提算法具有优良的加速比、扩展率和数据伸缩度，在保证分类精度的基础上提高了SVM算法对大规模数据的处理能力。

关键词: Hadoop, HBase, 支持向量机, 增量学习, 集成学习, 遗忘因子, 控制器组件

Abstract: Traditional Support Vector Machine (SVM) algorithm is difficult to deal with the problem of large scale training data, an efficient data driven Parallel Incremental Adaboost-SVM (PIASVM) learning algorithm based on Hadoop was proposed. An ensemble system was used to make each classifier process a partition of the data, and then integrated the classification results to get the combination classifier. Weights were used to depict the spatial distribution prosperities of samples which were to be iteratively reweighted during the incremental training stage, and forgetting factor was applied to select new samples and eliminate historical samples. Also, the controller component based on HBase was used to schedule the iterative procedure, persist the intermediate results and reduce the bandwidth pressure of iterative MapReduce. The experimental results on multiple data sets demonstrate that the proposed algorithm has good performance in speedup, sizeup and scaleup, and high processing capacity of large-scale data while guaranteeing high accuracy.

Key words: Hadoop, HBase, Support Vector Machine (SVM), incremental learning, ensemble leaning, forgetting factor, controller component

中图分类号:

TP311

邳文君, 宫秀军. 基于Hadoop架构的数据驱动的SVM并行增量学习算法[J]. 计算机应用, 2016, 36(11): 3044-3049.

PI Wenjun, GONG Xiujun. Data driven parallel incremental support vector machine learning algorithm based on Hadoop framework[J]. Journal of Computer Applications, 2016, 36(11): 3044-3049.

参考文献

[1] SALLEH N S M, SULIMAN A, AHMAD A R. Parallel execution of distributed SVM using MPI (CoDLib)[C]//Proceedings of the 2011 International Conference on Information Technology and Multimedia. Piscataway, NJ:IEEE, 2011:1-4.
[2] GRAF H P, COSATTO E, BOTTOU L, et al. Parallel support vector machines:the cascade SVM[C]//NIPS 2004:Advances in Neural Information Processing Systems 17. Red Hook, NY:Curran Associates, Inc., 2004:521-528.
[3] ZHANG J P, LI Z W, YANG J. A parallel SVM training algorithm on large-scale classification problems[C]//Proceedings of the 2005 International Conference on Machine Learning and Cybernetics. Piscataway, NJ:IEEE, 2005, 3:1637-1641.
[4] CARUANA G, LI M, LIU Y. An ontology enhanced parallel SVM for scalable spam filter training[J]. Neurocomputing, 2013, 108(5):45-57.
[5] LU Y, ROYCHOWDHURY V, VANDENBERGHE L. Distributed parallel support vector machines in strongly connected networks[J]. IEEE Transactions on Neural Networks, 2008, 19(7):1167-1178.
[6] POGGIO T, CAUWENBERGHS G. Incremental and decremental support vector machine learning[J]. Advances in Neural Information Processing Systems, 2001, 13:409-415.
[7] KATAGIRI S, ABE S. Incremental training of support vector machines using hyperspheres[J]. Pattern Recognition Letters, 2006, 27(13):1495-1507.
[8] SHILTON A, PALANISWAMI M, RALPH D, et al. Incremental training of support vector machines[J]. IEEE Transactions on Neural Networks, 2005, 16(1):114-131.
[9] DO T N, NGUYEN V H, POULET F. GPU-based parallel SVM algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2009, 3(4):368-377.
[10] HE Q, DU C, WANG Q, et al. A parallel incremental extreme SVM classifier[J]. Neurocomputing, 2011, 74(16):2532-2540.
[11] LÄMMEL R. Google's MapReduce programming model-revisited[J]. Science of Computer Programming, 2008, 70(1):1-30.
[12] SRIRAMA S N, BATRASHEV O, JAKOVITS P, et al. Scalability of parallel scientific applications on the cloud[J]. Scientific Programming, 2011, 19(2/3):91-105.
[13] 萧嵘, 王继成, 孙正兴, 等. 一种SVM增量学习算法α-ISVM[J]. 软件学报, 2001, 12(12):1818-1824. (XIAO R, WANG J C, SUN Z X, et al. A kind of incremental SVM learning algorithm α-ISVM[J]. Journal of Software, 2001, 12(12):1818-1824.)

[1]	孙敏, 成倩, 丁希宁. 基于CBAM-CGRU-SVM的Android恶意软件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1539-1545.
[2]	李雪, 姚光乐, 王洪辉, 李军, 周皓然, 叶绍泽. 基于样本增量学习的遥感影像分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 732-736.
[3]	丁建立, 黄辉, 曹卫东. 航班链运行状态动态监控方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3941-3948.
[4]	乔恩保, 高向阳, 程俊. 基于支持向量机的自恢复自适应蒙特卡洛定位算法[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3246-3251.
[5]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[6]	黄学雨, 贺怀宇, 林慧敏, 陈金水. 基于特征聚合的铜合金金相图分类识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2593-2601.
[7]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[8]	温祥西, 彭娅婷, 毕可心, 衡宇铭, 吴明功. 基于最优样本集在线模糊最小二乘支持向量机的飞行冲突网络态势预测[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3632-3640.
[9]	赵敬涛, 赵泽方, 岳兆娟, 李俊. TenrepNN：集成学习的新范式在企业自律性评价中的实践[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3107-3113.
[10]	蔡淳豪, 李建良. 小样本问题下培训弱教师网络的模型蒸馏模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2652-2658.
[11]	郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
[12]	张仲华, 赵福媛, 郭钧枫, 赵高长. 柯西自适应回溯搜索与最小二乘支持向量机的集成预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1829-1836.
[13]	杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1869-1875.
[14]	屈震, 李堃婷, 冯志玺. 基于有效通道注意力的遥感图像场景分类[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1431-1439.
[15]	刘晶, 董志红, 张喆语, 孙志刚, 季海鹏. 基于联邦增量学习的工业物联网数据共享方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1235-1243.

基于Hadoop架构的数据驱动的SVM并行增量学习算法

Data driven parallel incremental support vector machine learning algorithm based on Hadoop framework

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics