计算机应用 ›› 2015, Vol. 35 ›› Issue (7): 1999-2003.DOI: 10.11772/j.issn.1001-9081.2015.07.1999

• 计算机软件技术 • 上一篇    下一篇

基于句法分析的代码摘要技术

王金水1, 薛醒思1, 翁伟2   

  1. 1. 福建工程学院 信息科学与工程学院, 福州 350108;
    2. 厦门理工学院 计算机与信息工程学院, 福建 厦门 361024
  • 收稿日期:2015-01-21 修回日期:2015-03-27 出版日期:2015-07-10 发布日期:2015-07-17
  • 通讯作者: 王金水(1981-),男,福建漳州人,讲师,博士,CCF会员,主要研究方向:软件工程、大数据分析,wangjinshui@fjut.edu.cn
  • 作者简介:薛醒思(1981-),男,福建福清人,讲师,博士,CCF会员,主要研究方向:进化算法; 翁伟(1979-),男,湖南衡阳人,讲师,硕士,CCF会员,主要研究方向:数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目(61402108);福建省中青年教师教育科研资助项目(JA14221);福建工程学院科研启动基金资助项目(GY-Z13113; GY-Z14068)。

Source code summarization technology based on syntactic analysis

WANG Jinshui1, XUE Xingsi1, WENG Wei2   

  1. 1. College of Information Science and Engineering, Fujian University of Technology, Fuzhou Fujian 350108, China;
    2. College of Computer and Information Engineering, Xiamen University of Technology, Xiamen Fujian 361024, China
  • Received:2015-01-21 Revised:2015-03-27 Online:2015-07-10 Published:2015-07-17

摘要:

针对词袋模型忽略了词条之间语义关系和概念结构的问题,提出一种基于句法分析的代码摘要技术。首先,该技术利用词性标注识别出最有可能体现代码特性的关键词;然后,通过块分析修正在词性标注过程中可能引入的错误;其次,对标识出的关键词进行降噪,以减少文本噪声带来的不利影响;最后,从关键词中选取若干个权值最高的词以组成代码摘要。实验结果表明,与基于词频-逆文档频率(TF-IDF)和基于TF-IDF扩展的代码摘要技术对比,所提技术生成的代码摘要与参考答案的重叠率(overlap)至少分别提高了9%和6%,说明该技术能够生成更加准确的代码摘要。

关键词: 代码摘要, 文本摘要, 句法分析, 自然语言处理, 程序理解

Abstract:

For overcoming the drawback of ignoring the semantic relationship between terms and concept structure in the bag of words model, a source code summarization technology based on syntactic analysis was proposed. Firstly, the part-of-speech tagging was utilized to recognize the keywords that characterized the code feature most. Secondly, the chunk parsing was used to revise the errors that could be introduced in the process of part-of-speech tagging. Thirdly, the noise reduction for those keywords was carried out to decrease the influence of text noise. Finally, several keywords with highest weights were selected to compose the summaries. Through the comparison with TF-IDF (Term Frequency-Inverse Document Frequency)-based and extended TF-IDF-based source code summarization technologies in the experiment, with respect to the overlap coefficient of the golden set, the summaries obtained by the proposed technology are improved by at least 9% and 6% respectively, which illuminates that the proposed technology is able to generate more precise source code summaries.

Key words: source code summarization, text summarization, syntactic analysis, natural language processing, program comprehension

中图分类号: