计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2759-2763.DOI: 10.11772/j.issn.1001-9081.2018041141

• 2018中国粒计算与知识发现学术会议(CGCKD 2018)论文 • 上一篇    下一篇

基于MapReduce的大数据主动学习

翟俊海1,2, 张素芳3, 王聪1,2, 沈矗1,2, 刘晓萌1,2   

  1. 1. 河北大学 数学与信息科学学院, 河北 保定 071002;
    2. 河北省机器学习与计算智能重点实验室(河北大学), 河北 保定 071002;
    3. 中国气象局气象干部培训学院 河北分院, 河北 保定 071002
  • 收稿日期:2018-04-06 修回日期:2018-06-01 出版日期:2018-10-10 发布日期:2018-10-13
  • 通讯作者: 张素芳
  • 作者简介:翟俊海(1964-),男,河北易县人,教授,博士,CCF会员,主要研究方向:机器学习、云计算、大数据;张素芳(1966-),女,河北蠡县人,副教授,硕士,主要研究方向:机器学习;王聪(1991-),男,河北平山人,硕士研究生,主要研究方向:云计算、大数据;沈矗(1993-),男,河北馆陶人,硕士研究生,主要研究方向:云计算、大数据;刘晓萌(1987-),女,河北保定人,硕士研究生,主要研究方向:机器学习。
  • 基金资助:
    河北省自然科学基金资助项目(F2017201026);河北大学自然科学基金资助项目(799207217071);河北大学研究生创新项目(hbu2018ss47)。

Big data active learning based on MapReduce

ZHAI Junhai1,2, ZHANG Sufang3, WANG Cong1,2, SHEN Chu1,2, LIU Xiaomeng1,2   

  1. 1. College of Mathematics and Information Science, Hebei University, Baoding Hebei 071002, China;
    2. Key Laboratory of Machine Learning and Computational Intelligence(Hebei University), Baoding Hebei 071002, China;
    3. Hebei Branch of China Meteorological Administration Training Centre, China Meteorological Administration, Baoding Hebei 071002, China
  • Received:2018-04-06 Revised:2018-06-01 Online:2018-10-10 Published:2018-10-13
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Hebei Province (F2017201026), the Natural Science Foundation of Hebei University (799207217071), the Graduate Innovation Foundation of Hebei University (hbu2018ss47).

摘要: 针对传统的主动学习算法只能处理中小型数据集的问题,提出一种基于MapReduce的大数据主动学习算法。首先,在有类别标签的初始训练集上,用极限学习机(ELM)算法训练一个分类器,并将其输出用软最大化函数变换为一个后验概率分布。然后,将无类别标签的大数据集划分为l个子集,并部署到l个云计算节点上。在每一个节点,用训练出的分类器并行地计算各个子集中样例的信息熵,并选择信息熵大的前q个样例进行类别标注,将标注类别的l×q个样例添加到有类别标签的训练集中。重复以上步骤直到满足预定义的停止条件。在Artificial、Skin、Statlog和Poker 4个数据集上与基于ELM的主动学习算法进行了比较,结果显示,所提算法在4个数据集上均能完成主动样例选择,而基于ELM的主动学习算法只在规模最小的数据集上能完成主动样例选择。实验结果表明,所提算法优于基于极限学习机的主动学习算法。

关键词: 大数据, 主动学习, 不确定性, 极限学习机, 样例选择

Abstract: Considering the problem that traditional active learning algorithms can only handle small and medium size data sets, a big data active learning algorithm based on MapReduce was proposed. Firstly, a classifier was trained by Extreme Learning Machine (ELM) algorithm on an initial training set, and the outputs of the classifier were transformed into a posterior probability distribution by softmax function. Secondly, the big data set without labels was partitioned into l subsets, which were deployed to a cloud computing platform with l nodes. On each node, the information entropies of instances of each subset were calculated by the trained classifier, and q instances with maximum information entropies were selected for labeling, then the l×q labeled instances were added into the training set. Repeat the above steps until the predefined termination criterion was satisfied. Contrast test with ELM-based active learning algorithm were conducted on 4 data sets including Artificial, Skin, Statlog and Poker. Experimental results show that the proposed algorithm can complete active instance selection on 4 data sets, while the active learning algorithm based on ELM can only complete active instance selection on the smallest data set, indicating that the proposed algorithm outperforms the active learning algorithm based on ELM.

Key words: big data, active learning, uncertainty, Extreme Learning Machine (ELM), instance selection

中图分类号: