计算机应用 ›› 2016, Vol. 36 ›› Issue (12): 3448-3453.DOI: 10.11772/j.issn.1001-9081.2016.12.3448

• 计算机软件技术 • 上一篇    下一篇

基于组合分类算法的源代码注释质量评估方法

余海1,2,3, 李斌2,3,4, 王培霞2,3,4, 贾荻3, 王永吉1,4   

  1. 1. 中国科学院软件研究所 互联网软件技术实验室, 北京 100190;
    2. 中国科学院大学, 北京 100190;
    3. 中国科学院软件研究所 总体部, 北京 100190;
    4. 中国科学院软件研究所 基础软件国家工程研究中心, 北京 100190
  • 收稿日期:2016-06-08 修回日期:2016-06-20 出版日期:2016-12-10 发布日期:2016-12-08
  • 通讯作者: 王永吉
  • 作者简介:余海(1989-),男,河南信阳人,硕士研究生,主要研究方向:操作系统、机器学习;李斌(1985-),男,甘肃天水人,工程师,博士研究生,主要研究方向:操作系统、代码分析;王培霞(1981-),女,山东潍坊人,高级工程师,博士研究生,主要研究方向:信息检索、自然语言处理;贾荻(1989-),女,北京人,助理工程师,硕士,主要研究方向:操作系统、数据处理;王永吉(1963-),男,辽宁营口人,研究员,博士,CCF高级会员,主要研究方向:虚拟化技术、隐蔽信道、实时系统、人工智能、数据挖掘、软件工程。
  • 基金资助:
    国家科技重大专项(2014ZX01029101-002)。

Source code comments quality assessment method based on aggregation of classification algorithms

YU Hai1,2,3, LI Bin2,3,4, WANG Peixia2,3,4, JIA Di3, WANG Yongji1,4   

  1. 1. Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China;
    3. General Department, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
    4. National Engineering Research Center of Fundamental Software, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2016-06-08 Revised:2016-06-20 Online:2016-12-10 Published:2016-12-08
  • Supported by:
    This work is partially supported by the National Science and Technology Major Project (2014ZX01029101-002).

摘要: 源代码注释是软件的重要组成部分,研究者往往需要利用人工或自动化的方法产生分析注释,注释的质量评估也往往是通过人工来完成,这无疑是低效不客观的。为此,首先从注释的格式、语言形式、内容以及与代码相关度4个方面出发构建注释评估准则;进而,基于这一准则提出了一种基于组合分类算法的注释质量评估方法。该方法将机器学习以及自然语言处理技术引入到注释质量评估中来,利用分类算法将注释分为不合格、合格、良好、优秀四个等级。通过对基本分类算法的组合使用,使得评估效果进一步提高。组合分类算法的准确率和F1值较单独使用某一种分类算法提高20个百分点左右,除宏平均F1值外,各项指标都达到了70%以上。实验结果表明,所提方法能够很好地应用于注释质量评估。

关键词: 源码注释, 质量评估, 文本分类, 组合算法, 自然语言处理

Abstract: Source code comments is an important part of the software, so researchers need to use manual or automated methods to generate comments. In the past, the quality assessment of this kind of comments is done manually, which is inefficient and not objective. In order to solve this problem, an assessment criterion was built in which four aspects of the comments including comment format, language form, content and code-related degree were considered. Then a code comments quality assessment method based on an aggregation of classification algorithms was proposed, in which machine learning and natural language processing technology were introduced into comments quality assessment, by using classification algorithms the comments were classified into four levels, including unqualified, qualified, good and excellent ones. The evaluation results were improved by the aggregation of the basic classification algorithms. The precision and F1 measure of the aggregated classification algorithm were improved about 20 percentage points compared with using a single classification algorithm, and all the indexes have reached more than 70% except the macro average F1 measure. The experimental results show that this method can be applied to assess the quality of comments effectively.

Key words: source code comments, quality assessment, text classification, aggregation algorithm, natural language processing

中图分类号: