基于MapReduce的大数据主动学习

doi:10.11772/j.issn.1001-9081.2018041141

计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2759-2763.DOI: 10.11772/j.issn.1001-9081.2018041141

• 2018中国粒计算与知识发现学术会议(CGCKD 2018)论文 • 上一篇下一篇

基于MapReduce的大数据主动学习

翟俊海^1,2, 张素芳³, 王聪^1,2, 沈矗^1,2, 刘晓萌^1,2

1. 河北大学数学与信息科学学院, 河北保定 071002;
2. 河北省机器学习与计算智能重点实验室(河北大学), 河北保定 071002;
3. 中国气象局气象干部培训学院河北分院, 河北保定 071002

收稿日期:2018-04-06 修回日期:2018-06-01 出版日期:2018-10-10 发布日期:2018-10-13
通讯作者: 张素芳
作者简介:翟俊海(1964-),男,河北易县人,教授,博士,CCF会员,主要研究方向:机器学习、云计算、大数据;张素芳(1966-),女,河北蠡县人,副教授,硕士,主要研究方向:机器学习;王聪(1991-),男,河北平山人,硕士研究生,主要研究方向:云计算、大数据;沈矗(1993-),男,河北馆陶人,硕士研究生,主要研究方向:云计算、大数据;刘晓萌(1987-),女,河北保定人,硕士研究生,主要研究方向:机器学习。
基金资助:
河北省自然科学基金资助项目（F2017201026）；河北大学自然科学基金资助项目（799207217071）；河北大学研究生创新项目（hbu2018ss47）。

Big data active learning based on MapReduce

ZHAI Junhai^1,2, ZHANG Sufang³, WANG Cong^1,2, SHEN Chu^1,2, LIU Xiaomeng^1,2

1. College of Mathematics and Information Science, Hebei University, Baoding Hebei 071002, China;
2. Key Laboratory of Machine Learning and Computational Intelligence(Hebei University), Baoding Hebei 071002, China;
3. Hebei Branch of China Meteorological Administration Training Centre, China Meteorological Administration, Baoding Hebei 071002, China

Received:2018-04-06 Revised:2018-06-01 Online:2018-10-10 Published:2018-10-13
Supported by:
This work is partially supported by the Natural Science Foundation of Hebei Province (F2017201026), the Natural Science Foundation of Hebei University (799207217071), the Graduate Innovation Foundation of Hebei University (hbu2018ss47).

摘要/Abstract

摘要： 针对传统的主动学习算法只能处理中小型数据集的问题，提出一种基于MapReduce的大数据主动学习算法。首先，在有类别标签的初始训练集上，用极限学习机（ELM）算法训练一个分类器，并将其输出用软最大化函数变换为一个后验概率分布。然后，将无类别标签的大数据集划分为l个子集，并部署到l个云计算节点上。在每一个节点，用训练出的分类器并行地计算各个子集中样例的信息熵，并选择信息熵大的前q个样例进行类别标注，将标注类别的l×q个样例添加到有类别标签的训练集中。重复以上步骤直到满足预定义的停止条件。在Artificial、Skin、Statlog和Poker 4个数据集上与基于ELM的主动学习算法进行了比较，结果显示，所提算法在4个数据集上均能完成主动样例选择，而基于ELM的主动学习算法只在规模最小的数据集上能完成主动样例选择。实验结果表明，所提算法优于基于极限学习机的主动学习算法。

关键词: 大数据, 主动学习, 不确定性, 极限学习机, 样例选择

Abstract: Considering the problem that traditional active learning algorithms can only handle small and medium size data sets, a big data active learning algorithm based on MapReduce was proposed. Firstly, a classifier was trained by Extreme Learning Machine (ELM) algorithm on an initial training set, and the outputs of the classifier were transformed into a posterior probability distribution by softmax function. Secondly, the big data set without labels was partitioned into l subsets, which were deployed to a cloud computing platform with l nodes. On each node, the information entropies of instances of each subset were calculated by the trained classifier, and q instances with maximum information entropies were selected for labeling, then the l×q labeled instances were added into the training set. Repeat the above steps until the predefined termination criterion was satisfied. Contrast test with ELM-based active learning algorithm were conducted on 4 data sets including Artificial, Skin, Statlog and Poker. Experimental results show that the proposed algorithm can complete active instance selection on 4 data sets, while the active learning algorithm based on ELM can only complete active instance selection on the smallest data set, indicating that the proposed algorithm outperforms the active learning algorithm based on ELM.

Key words: big data, active learning, uncertainty, Extreme Learning Machine (ELM), instance selection

中图分类号:

TP181

翟俊海, 张素芳, 王聪, 沈矗, 刘晓萌. 基于MapReduce的大数据主动学习[J]. 计算机应用, 2018, 38(10): 2759-2763.

ZHAI Junhai, ZHANG Sufang, WANG Cong, SHEN Chu, LIU Xiaomeng. Big data active learning based on MapReduce[J]. Journal of Computer Applications, 2018, 38(10): 2759-2763.

参考文献

[1] EMANI C K, CULLOT N, NICOLLE C. Understandable big data:a survey[J]. Computer Science Review, 2015, 17:70-81.
[2] 李国杰, 程学旗. 大数据研究:未来科技及经济社会发展的重大战略领域——大数据的研究现状与科学思考[J]. 中国科学院院刊, 2012, 27(6):647-657. (LI G J, CHENG X Q. Big data research:the important strategic field of future science and technology, development of economic and social-research status and scientific thinking of big data[J]. Bulletin of the Chinese Academy of Sciences, 2012, 27(6):647-657.)
[3] ZHOU Z H, CHAWLA N V, JIN Y C, et al. Big data opportunities and challenges:discussions from data analytics perspectives[J]. IEEE Computational Intelligence Magazine, 2014, 9(4):62-74.
[4] SETTLES B. Active learning literature survey[R]. Madison, WI, USA:University of Wisconsin-Madison, Department of Computer Science, 2010.
[5] ANGLUIN D. Queries and concept learning[J]. Machine Learning, 1988, 2(4):319-342.
[6] HUANG S J, JIN R, ZHOU Z H. Active learning by querying informative and representative examples[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(10):1936-1949.
[7] DU B, WANG Z M, ZHANG L F, et al. Exploring representativeness and informativeness for active learning[J]. IEEE Transactions on Cybernetics, 2017, 47(1):14-26.
[8] ZHANG X, WANG S, YUN X. Bidirectional active learning:a two-way exploration into unlabeled and labeled data set[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015, 26(12):3034-3044.
[9] CHAKRABORTY S, BALASUBRAMANIAN V, PANCHANATHAN S. Adaptive batch mode active learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(8):1747-1760.
[10] CARDOSO T N C, SILVA R M, CANUTO S, et al. Ranked batch-mode active learning[J]. Information Sciences, 2017, 379:313-337.
[11] LONG B, BIAN J, CHAPELLE O, et al. Active learning for ranking through expected loss optimization[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5):1180-1191.
[12] GU Y, JIN Z, CHIU S C. Active learning combining uncertainty and diversity for multi-class image classification[J]. IET Computer Vision, 2015, 9(3):400-407.
[13] WANG R, WANG X Z, KWONG S, et al. Incorporating diversity and informativeness in multiple-instance active learning[J]. IEEE Transactions on Fuzzy Systems, 2017, 25(6):1460-1475.
[14] DU B, WANG Z M, ZHANG L F, et al. Robust and discriminative labeling for multi-label active learning based on maximum correntropy criterion[J]. IEEE Transactions on Image Processing, 2017, 26(4):1694-1707.
[15] SHEN P, LI C, ZHANG Z. Distributed active learning[J]. IEEE Access, 2016, 4:2572-2579.
[16] LIPOR J, WONG B P, SCAVIA D, et al. Distance-penalized active learning using quantile search[J]. IEEE Transactions on Signal Processing, 2017, 65(20):5453-5465.
[17] COHN D, ATLAS L, LADNER R. Improving generalization with active learning[J]. Machine Learning, 1994, 15(2):201-221.
[18] DAGAN I, ENGELSON S. Committee-based sampling for training probabilistic classifiers[C]//Proceedings of the 12th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann, 1995, 150-157.
[19] SMAILOVIC J, GRCAR M, LAVRAC N, et al. Stream-based active learning for sentiment analysis in the financial domain[J]. Information Sciences, 2014, 285(1):181-203.
[20] BOUGUELIA M R, BELAÏD Y, BELAÏD A. An adaptive streaming active learning strategy based on instance weighting[J]. Pattern Recognition Letters, 2015, 70:38-44.
[21] SILVA C, ANTUNES M, COSTA J, et al. Active manifold learning with twitter big data[J]. Procedia Computer Science, 2015, 53:208-215.
[22] WANG X Z, ZHAI J H. Learning with Uncertainty[M]. Boca Raton:CRC Press, 2016.
[23] 翟俊海. 数据约简——样例约简与属性约简[M]. 北京:科学出版社, 2015. (ZHAI J H. Data Reduction-Instance Reduction and Attribute Reduction[M]. Beijing:Science Press, 2015.)
[24] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1):107-113.
[25] HUANG G B, ZHU Q Y, SIEW C K. Extreme learning machine:a new learning scheme of feedforward neural networks[C]//Proceedings of the 2004 IEEE International Joint Conference on Neural Networks. Piscataway, NJ:IEEE, 2004:985-990.
[26] YU H L, SUN C Y, YANG W K, et al. AL-ELM:one uncertainty-based active learning algorithm using extreme learning machine[J]. Neurocomputing, 2015, 166:140-150.

基于MapReduce的大数据主动学习

Big data active learning based on MapReduce

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙浩艺, 王传美, 丁义明. 基于隐藏层输出矩阵的极限学习机算法优化[J]. 计算机应用, 2021, 41(9): 2481-2488.
[2]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 计算机应用, 2021, 41(8): 2273-2287.
[3]	唐延强, 李成海, 宋亚飞. 基于改进粒子群优化和极限学习机的网络安全态势预测[J]. 计算机应用, 2021, 41(3): 768-773.
[4]	裴仪瑶, 郭会明, 张丹普, 陈文博. 基于定位不确定性的鲁棒3D目标检测方法[J]. 计算机应用, 2021, 41(10): 2979-2984.
[5]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[6]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[7]	牛春彦, 夏克文, 张江楠, 贺紫平. 基于云量子花朵授粉的极限学习机算法[J]. 计算机应用, 2020, 40(6): 1627-1632.
[8]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[9]	易东义, 邓根强, 董超雄, 祝苗苗, 吕周平, 朱岁松. 基于图卷积神经网络的医保欺诈检测算法[J]. 计算机应用, 2020, 40(5): 1272-1277.
[10]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 计算机应用, 2020, 40(2): 416-419.
[11]	汪敏, 武禹伯, 闵帆. 基于多种聚类算法和多元线性回归的多分类主动学习算法[J]. 计算机应用, 2020, 40(12): 3437-3444.
[12]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[13]	任杰, 闵帆, 汪敏. 基于最远总距离采样的代价敏感主动学习[J]. 计算机应用, 2019, 39(9): 2499-2504.
[14]	龚彦鹭, 吕佳. 结合主动学习和密度峰值聚类的协同训练算法[J]. 计算机应用, 2019, 39(8): 2297-2301.
[15]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.