计算机应用 ›› 2015, Vol. 35 ›› Issue (8): 2221-2226.DOI: 10.11772/j.issn.1001-9081.2015.08.2221

• 人工智能 • 上一篇    下一篇

面向贯序不均衡数据的混合采样极限学习机

毛文涛1,2, 王金婉1, 何玲1, 袁培燕1,2   

  1. 1. 河南师范大学 计算机与信息工程学院, 河南 新乡 453007;
    2. 智慧商务与物联网技术河南省工程实验室, 河南 新乡 453007
  • 收稿日期:2015-03-25 修回日期:2015-05-12 出版日期:2015-08-10 发布日期:2015-08-14
  • 通讯作者: 毛文涛(1980-),男,河南新乡人,副教授,博士,CCF会员,主要研究方向:机器学习、弱信号检测,maowt.mail@gmail.com
  • 作者简介:王金婉(1991-),女,河南济源人,硕士研究生, CCF会员,主要研究方向:机器学习、模式识别; 何玲(1990-),女,河南鹤壁人,硕士研究生,主要研究方向:泛化性理论; 袁培燕(1978-),男,河南邓州人,副教授,主要研究方向:移动计算。
  • 基金资助:

    国家自然科学基金资助项目(U1204609);中国博士后科学基金资助项目(2014M550508);河南省高校科技创新人才资助计划项目(15HASTIT022);河南省高校青年骨干教师资助计划项目(2014GGJS-046)。

Hybrid sampling extreme learning machine for sequential imbalanced data

MAO Wentao1,2, WANG Jinwan1, HE Ling1, YUAN Peiyan1,2   

  1. 1. College of Computer and Information Engineering, Henan Normal University, Xinxiang Henan 453007, China;
    2. Engineering Laboratory of Intellectual Business and Internet of Things Technologies, Henan Province, Xinxiang Henan 453007, China
  • Received:2015-03-25 Revised:2015-05-12 Online:2015-08-10 Published:2015-08-14

摘要:

针对现有机器学习算法难以有效提高贯序不均衡数据分类问题中少类样本分类精度的问题,提出一种基于混合采样策略的在线贯序极限学习机。该算法可在提高少类样本分类精度的前提下,减少多类样本的分类精度损失,主要包括离线和在线两个阶段:离线阶段采用均衡采样策略,利用主曲线分别构建多类和少类样本的可信区域,在不改变样本分布特性的前提下,利用可信区域扩充少类样本和削减多类样本,进而得到均衡的离线样本集,建立初始模型;在线阶段仅对贯序到达的多类数据进行欠采样,根据样本重要度挑选最具价值的多类样本,进而动态更新网络权值。通过理论分析证明所提算法在理论上存在损失信息上界。采用UCI标准数据集和实际的澳门空气污染预报数据进行仿真实验,结果表明,与现有在线贯序极限学习机(OS-ELM)、极限学习机(ELM)和元认知在线贯序极限学习机(MCOS-ELM)算法相比,所提算法对少类样本的预测精度更高,且数值稳定性良好。

关键词: 极限学习机, 在线贯序数据, 不均衡分类, 主曲线

Abstract:

Many traditional machine learning methods tend to get biased classifier which leads to lower classification precision for minor class in sequential imbalanced data. To improve the classification accuracy of minor class, a new hybrid sampling online extreme learning machine on sequential imbalanced data was proposed. This algorithm could improve the classification accuracy of minor class as well as reduce the loss of classification accuracy of major class, which contained two stages. In offline stage, the principal curve was introduced to model the confidence regions of minor class and major class respectively based on the strategy of balanced samples. Over-sampling of minority and under-sampling of majority was achieved based on confidence region. Then the initial model was established. In online stage, only the most valuable samples of major class were chosen according to the sample importance, and then the network weight was updated dynamically. The proposed algorithm had upper bound of the information loss through the theoretical proof. The experiment was taken on two UCI datasets and the real-world air pollutant forecasting dataset of Macao. The experimental results show that, compared with the existing methods such as Online Sequential Extreme Learning Machine (OS-ELM), Extreme Learning Machine (ELM) and Meta-Cognitive Online Sequential Extreme Learning Machine (MCOS-ELM), the proposed method has higher prediction precision and better numerical stability.

Key words: Extreme Learning Machine (ELM), online sequential data, imbalanced data classification, principal curve

中图分类号: