Improved automatic classification algorithm of software bug report in cloud environment

doi:10.11772/j.issn.1001-9081.2016.05.1212

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (5): 1212-1215.DOI: 10.11772/j.issn.1001-9081.2016.05.1212

Previous Articles Next Articles

Improved automatic classification algorithm of software bug report in cloud environment

HUANG Wei, LIN Jie, JIANG Yu'e

Faculty of Software, Fujian Normal University, Fuzhou Fujian 350108, China

Received:2015-11-17 Revised:2016-01-11 Online:2016-05-10 Published:2016-05-09
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61472082), the Natural Science Foundation of Fujian Province(2014J01220).

云环境下软件错误报告自动分类算法改进

黄伟, 林劼, 江育娥

福建师范大学软件学院, 福州 350108

通讯作者: 江育娥
作者简介:黄伟(1991-),男,福建闽侯人,硕士研究生,主要研究方向:文本挖掘、大数据挖掘;林劼(1972-),男,福建三明人,副教授,博士,主要研究方向:数据挖掘;江育娥(1970-),女,福建古田人,教授,博士,主要研究方向:数据挖掘。
基金资助:
国家自然科学基金资助项目(61472082);福建省自然科学基金资助项目(2014J01220)。

Abstract

Abstract: User-submitted bug reports are arbitrary and subjective. The accuracy of automatic classification of bug reports is not ideal. Hence it requires many human labors to intervention. With the bug reports database growing bigger and bigger, the problem of improving the accuracy of automatic classification of these reports is becoming urgent. A TF-IDF (Term Frequency-Inverse Document Freqency) based Naive Bayes (NB) algorithm was proposed. It not only considered the relationship of a term in different classes but also the relationship of a term inside a class. It was also implemented in distributed parallel environment of MapReduce model in Hadoop platform. The experimental results show that the proposed Naive Bayes algorithm improves the performance of F1 measument to 71%, which is 27 percentage points higher than the state-of-the-art method. And it is able to deal with massive amounts of data in distributed way by addding computational node to offer shorter running time and has better effective performance.

Key words: Naive Bayes of polynomials, bug report, text automatic classification, Term Frequency-Inverse Document Frequency (TF-IDF), cloud computing

摘要： 用户提交的软件错误报告随意性大、主观性强且内容少导致自动分类正确率不高,需要花费大量人工干预时间。随着互联网的快速发展用户提交的错误报告数量也不断增加,如何在海量数据下提高其自动分类的精确度越来越受到关注。通过改进词频-逆文档频率(TF-IDF),考虑到词条在类间和类内出现情况对文本分类的影响,提出一种基于软件错误报告数据集的改进多项式朴素贝叶斯算法,同时在Hadoop平台下使用MapReduce计算模型实现该算法的分布式版本。实验结果表明,改进的多项式朴素贝叶斯算法将F1值提高到71%,比原算法提高了27个百分点,同时在海量数据下可以通过拓展节点的方式缩短运行时间,有较好的执行效率。

关键词: 多项式朴素贝叶斯, 错误报告, 文本自动分类, 词频-逆文档频率, 云计算

CLC Number:

TP311

HUANG Wei, LIN Jie, JIANG Yu'e. Improved automatic classification algorithm of software bug report in cloud environment[J]. Journal of Computer Applications, 2016, 36(5): 1212-1215.

黄伟, 林劼, 江育娥. 云环境下软件错误报告自动分类算法改进[J]. 计算机应用, 2016, 36(5): 1212-1215.

References

[1] ZHANG J, WANG X Y, HAO D, et al. A survey on bug-report analysis[J]. Science China Information Sciences, 2015, 58(2):1-24.
[2] STRATE J D, LAPLANTE P A. A literature review of research in software defect reporting[J]. IEEE Transactions on Reliability, 2013, 62(2):444-454.
[3] SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. A time-based approach to automatic bug report assignment[J]. Journal of Systems & Software, 2015, 102:109-122.
[4] SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. Improving automatic bug assignment using time-metadata in term-weighting[J]. IET Software, 2014, 8(6):269-278.
[5] ALENEZI M, MAGEL K, BANITAAN S. Efficient bug triaging using text mining[J]. Journal of Software, 2013, 8(9):2185-2190.
[6] SHOKRIPOUR R, ANVIK J, KASIRUN Z M, et al. Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation[C]//Proceedings of the 10th International Workshop on Mining Software Repositories. Piscataway, NJ:IEEE, 2013:2-11.
[7] 黄小亮, 郁抒思, 关佶红. 基于LDA主题模型的软件缺陷分派方法[J]. 计算机工程, 2011, 37(21):46-48.(HUANG X L, YU S S, GUAN J H. Software bug triage method based on LDA topic model[J]. Computer Engineering, 2011, 37(21):46-48).
[8] JEONG G, KIM S, ZIMMERMANN T. Improving bug triage with bug tossing graphs[C]//Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. New York:ACM, 2009:111-120.
[9] MATTER D, KUHN A, NIERSTRASZ O. Assigning bug reports using a vocabulary-based expertise model of developers[C]//Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories. Piscataway, NJ:IEEE, 2009:131-140.
[10] SHOKRIPOUR R, KASIRUN Z M, ZAMANI S, et al. Automatic bug assignment using information extraction methods[C]//Proceedings of the 2012 International Conference on Computer Science Applications and Technologies. Piscataway, NJ:IEEE, 2012:144-149.
[11] 李文进, 熊小峰, 毛伊敏. 基于改进朴素贝叶斯的区间不确定性数据分类方法[J]. 计算机应用, 2014, 34(11):3268-3272.(LI W J, XIONG X F, MAO Y M. Classification method for interval uncertain data based on improved naive Bayes[J]. Journal of Computer Applications, 2014, 34(11):3268-3272.)
[12] 翟军昌, 秦玉平, 车伟伟. 垃圾邮件过滤中信息增益的改进研究[J]. 计算机科学, 2014, 41(6):214-216.(ZHAI J C, QIN Y P, CHE W W. Improvement of information gain in spam filtering[J]. Computer Science, 2014, 41(6):214-216.)
[13] 罗凌, 杨有, 马燕. 基于TAN贝叶斯网络的学习风格检测研究[J]. 计算机工程与应用, 2015, 51(6):48-54.(LUO L, YANG Y, MA Y. Research on detecting learning style based on TAN Bayesian network[J]. Computer Engineering and Applications, 2015, 51(6):48-54.)
[14] 张红蕊, 张永, 于静雯. 云计算环境下基于朴素贝叶斯的数据分类[J]. 计算机应用与软件, 2015, 32(3):27-30.(ZHANG H R, ZHANG Y, YU J W. Data classification based on naive Bayes in cloud computing environment[J]. Computer Applications and Software, 2015, 32(3):27-30.)
[15] 卫洁, 石洪波, 冀素琴. 基于Hadoop的分布式朴素贝叶斯文本分类[J]. 计算机系统应用, 2012, 21(2):210-213.(WEI J, SHI H B, JI S Q. Distributed naive Bayes text classification using Hadoop[J]. Computer Systems and Applications, 2012, 21(2):210-213.)
[16] MCCALLUM A, NIGAM K. A comparison of event models for naive Bayes text classification[C]//Proceedings of the 25th International Symposium on Computer and Information Sciences. Berlin:Springer, 1998:41-48.
[17] JIANG S, SAYYAD-SHIRABAD J, MATWIN S. Large scale text classification using semi-supervised multinomial naive Bayes[C]//Proceedings of the 28th International Conference on Machine Learning. Bellevue, WA:ICML, 2011:97-104.
[18] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5):513-523.

Improved automatic classification algorithm of software bug report in cloud environment

云环境下软件错误报告自动分类算法改进

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	CHEN Jiahao, YIN Xinchun. Traceable and revocable ciphertext-policy attribute-based encryption scheme based on cloud-fog computing [J]. Journal of Computer Applications, 2021, 41(6): 1611-1620.
[2]	GE Lina, HU Yugu, ZHANG Guifen, CHEN Yuanyuan. Reverse hybrid access control scheme based on object attribute matching in cloud computing environment [J]. Journal of Computer Applications, 2021, 41(6): 1604-1610.
[3]	YANG Ling, JIANG Chunmao. Strategy of energy-aware virtual machine migration based on three-way decision [J]. Journal of Computer Applications, 2021, 41(4): 990-998.
[4]	Xiaoling SUN, Guang YANG, Yanping SHEN, Qiuge YANG, Tao CHEN. Searchable encryption scheme based on splittable inverted index [J]. Journal of Computer Applications, 2021, 41(11): 3288-3294.
[5]	LYU Jiayu, ZHU Zhirong, YAO Zhiqiang. Two-channel dynamic data encryption strategy in cloud computing environment [J]. Journal of Computer Applications, 2020, 40(8): 2268-2273.
[6]	CHEN Chengjun, MAO Yingchi, WANG Yichao. CNN model compression based on activation-entropy based layer-wise iterative pruning strategy [J]. Journal of Computer Applications, 2020, 40(5): 1260-1265.
[7]	GUO Shujie, LI Zhihua, LIN Kaiqing. Fuzzy membership degree based virtual machine placement algorithmin cloud environment [J]. Journal of Computer Applications, 2020, 40(5): 1374-1381.
[8]	XU Yingxin, SUN Lei, ZHAO Jiancheng, GUO Songhui. Virtual field programmable gate array placement strategy based on ant colony optimization algorithm [J]. Journal of Computer Applications, 2020, 40(3): 747-752.
[9]	WANG Qingyong, MAO Yingchi, WANG Yichao, WANG Longbao. Computing task offloading based on multi-cloudlet collaboration [J]. Journal of Computer Applications, 2020, 40(2): 328-334.
[10]	LIU Fuxin, LI Jingwei, WANG Yihong, LI Lin. Design and implementation of cloud native massive data storage system based on Kubernetes [J]. Journal of Computer Applications, 2020, 40(2): 547-552.
[11]	YANG Shenshen, WU Huizhen, ZHUANG Lili, LYU Hongwu. Markov process-based availability modeling and analysis method of IaaS system [J]. Journal of Computer Applications, 2020, 40(10): 3013-3018.
[12]	LIN Li, XIONG Jinbo, XIAO Ruliang, LIN Mingwei, CHEN Xiuhua. Gaming@Edge: low latency cloud gaming system based on edge nodes [J]. Journal of Computer Applications, 2019, 39(7): 2001-2007.
[13]	XU Yabin, PENG Hong'en. PaaS platform resource allocation method based on demand forecasting [J]. Journal of Computer Applications, 2019, 39(6): 1583-1588.
[14]	LI Qirui, PENG Zhiping, CUI Delong, HE Jieguang. Optimization of virtual resource deployment strategy in container cloud [J]. Journal of Computer Applications, 2019, 39(3): 784-789.
[15]	LI Lei, XUE Yang, LYU Nianling, FENG Min. Online task and resource scheduling designing for container cloud queue based on Lyapunov optimization method [J]. Journal of Computer Applications, 2019, 39(2): 494-500.