Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (5): 1212-1215.DOI: 10.11772/j.issn.1001-9081.2016.05.1212

Previous Articles     Next Articles

Improved automatic classification algorithm of software bug report in cloud environment

HUANG Wei, LIN Jie, JIANG Yu'e   

  1. Faculty of Software, Fujian Normal University, Fuzhou Fujian 350108, China
  • Received:2015-11-17 Revised:2016-01-11 Online:2016-05-10 Published:2016-05-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61472082), the Natural Science Foundation of Fujian Province(2014J01220).


黄伟, 林劼, 江育娥   

  1. 福建师范大学 软件学院, 福州 350108
  • 通讯作者: 江育娥
  • 作者简介:黄伟(1991-),男,福建闽侯人,硕士研究生,主要研究方向:文本挖掘、大数据挖掘;林劼(1972-),男,福建三明人,副教授,博士,主要研究方向:数据挖掘;江育娥(1970-),女,福建古田人,教授,博士,主要研究方向:数据挖掘。
  • 基金资助:

Abstract: User-submitted bug reports are arbitrary and subjective. The accuracy of automatic classification of bug reports is not ideal. Hence it requires many human labors to intervention. With the bug reports database growing bigger and bigger, the problem of improving the accuracy of automatic classification of these reports is becoming urgent. A TF-IDF (Term Frequency-Inverse Document Freqency) based Naive Bayes (NB) algorithm was proposed. It not only considered the relationship of a term in different classes but also the relationship of a term inside a class. It was also implemented in distributed parallel environment of MapReduce model in Hadoop platform. The experimental results show that the proposed Naive Bayes algorithm improves the performance of F1 measument to 71%, which is 27 percentage points higher than the state-of-the-art method. And it is able to deal with massive amounts of data in distributed way by addding computational node to offer shorter running time and has better effective performance.

Key words: Naive Bayes of polynomials, bug report, text automatic classification, Term Frequency-Inverse Document Frequency (TF-IDF), cloud computing

摘要: 用户提交的软件错误报告随意性大、主观性强且内容少导致自动分类正确率不高,需要花费大量人工干预时间。随着互联网的快速发展用户提交的错误报告数量也不断增加,如何在海量数据下提高其自动分类的精确度越来越受到关注。通过改进词频-逆文档频率(TF-IDF),考虑到词条在类间和类内出现情况对文本分类的影响,提出一种基于软件错误报告数据集的改进多项式朴素贝叶斯算法,同时在Hadoop平台下使用MapReduce计算模型实现该算法的分布式版本。实验结果表明,改进的多项式朴素贝叶斯算法将F1值提高到71%,比原算法提高了27个百分点,同时在海量数据下可以通过拓展节点的方式缩短运行时间,有较好的执行效率。

关键词: 多项式朴素贝叶斯, 错误报告, 文本自动分类, 词频-逆文档频率, 云计算

CLC Number: