Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (11): 3122-3125.DOI: 10.11772/j.issn.1001-9081.2015.11.3122

• DPCS 2015 Paper • Previous Articles     Next Articles

Improved MIMLBoost algorithm based on importance evaluation of labels

HAO Ning1, XIA Shixiong1, NIU Qiang1, ZHAO Zhijun2   

  1. 1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou Jiangsu 221116, China;
    2. Ministry of Transport of Dinghai District, Zhoushan Zhejiang 316000, China
  • Received:2015-06-17 Revised:2015-07-09 Published:2015-11-13

基于类别重要度的MIMLBoost改进算法

郝宁1, 夏士雄1, 牛强1, 赵志军2   

  1. 1. 中国矿业大学 计算机科学与技术学院, 江苏 徐州 221116;
    2. 舟山市定海区交通建设事务中心, 浙江 舟山 316000
  • 通讯作者: 夏士雄(1961-),男,辽宁抚顺人,教授,博士,主要研究方向:智能控制、数据挖掘、工业通信网络.
  • 作者简介:郝宁(1990-),男,江苏徐州人,硕士研究生,主要研究方向:人工智能、智能信息处理; 牛强(1974-),男,河南南阳人,教授,博士,主要研究方向:数据挖掘、智能优化算法、智能信息处理; 赵志军(1966-),男,浙江余姚人,政工师,主要研究方向:人工智能、传感器网络.
  • 基金资助:
    江苏省产学研联合创新资金前瞻性联合研究项目(BY2014028-09);国家海洋局数字海洋科学技术重点实验室开放基金资助项目(KLDO201304);浙江省交通运输厅科研计划项目(2014T25).

Abstract: In order to solve the problem of class imbalance which the original degradation method causes in MIMLBoost algorithm, this paper introduced the importance of class into the original algorithm and an improved degradation method based on the category tag evaluating was proposed. First of all, the proposed method used a clustering algorithm to cluster all bags into groups. Each group could be treated as a concept in the multi-instance bag, and every class label could be quantified in each group. Then, the TF-IDF(Term Frequency-Inverse Document Frequency) algorithm was used to get the importance of each label in each group. Finally, for each group, the label whose importance was lowest in the group could be removed, because this label created many negative samples easily when the MIML (Multi-Instance Multi-Label) samples were transformed into multi-instance samples. The experimental results show that the new degradation method is effective, and the performance of improved algorithm is better than the original algorithm, especially in the terms of Hamming loss, coverage and ranking loss. This confirms that the new algorithm can reduce the error rate of classification and improve the precision of algorithm effectively.

Key words: Multi-Instance Multi-Label (MIML), MIMLBoost algorithm, Term Frequency-Inverse Document Frequency (TF-IDF) algorithm, clustering, class imbalance

摘要: 针对多示例多标记学习算法MIMLBoost中退化过程造成的类别不平衡问题,运用人工降采样思想,引入类别重要度,提出一种改进的基于类别标记评估的退化方法.该方法通过对示例空间中的示例包进行聚类,把标记空间中的标记量化到聚类簇上,再以聚类簇为单位,利用TF-IDF算法对每个类别标记进行重要度评估和筛选,去除重要度低的标记,并将簇中的示例包与其余的类别标记拼接起来,以此来减少大类样本的出现,完成多示例多标记样本向多示例单标记样本的转化.在自然数据集上进行了实验,实验结果发现,改进算法的性能整体上优于原算法,尤其在Hamming loss、coverage、ranking loss三个评测指标上尤为明显,说明所提算法能够有效降低分类的出错率,提高算法的精度和分类效率.

关键词: 多示例多标记, MIMIBoost算法, TF-IDF算法, 聚类, 类别不平衡

CLC Number: