《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3896-3902.DOI: 10.11772/j.issn.1001-9081.2022111783

• 计算机软件技术 • 上一篇    下一篇

基于协同融合网络的代码搜索模型

宋其洪1,2, 刘建勋1,2(), 扈海泽1,2, 张祥平1,2   

  1. 1.服务计算与软件服务新技术湖南省重点实验室(湖南科技大学), 湖南 湘潭 411201
    2.湖南科技大学 计算机科学与工程学院, 湖南 湘潭 411201
  • 收稿日期:2022-11-29 修回日期:2023-03-25 接受日期:2023-03-28 发布日期:2023-05-08 出版日期:2023-12-10
  • 通讯作者: 刘建勋
  • 作者简介:宋其洪(1998—),男,陕西宝鸡人,硕士研究生,CCF会员,主要研究方向:代码搜索、代码补全
    刘建勋(1970—),男,湖南衡阳人,教授,博士,CCF杰出会员,主要研究方向:大数据、服务计算、云计算;Email:904500672@qq.com
    扈海泽(1989—),男,湖南邵阳人,讲师,博士研究生,主要研究方向:数据挖掘、代码搜索
    张祥平(1993—),男,福建三明人,博士研究生,主要研究方向:代码表征、代码克隆检测。
  • 基金资助:
    国家自然科学基金资助项目(61872139)

Code search model based on collaborative fusion network

Qihong SONG1,2, Jianxun LIU1,2(), Haize HU1,2, Xiangping ZHANG1,2   

  1. 1.Hunan Key Laboratory of Service Computing and New Software Service Technology (Hunan University of Science and Technology),Xiangtan Hunan 411201,China
    2.School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan Hunan 411201,China
  • Received:2022-11-29 Revised:2023-03-25 Accepted:2023-03-28 Online:2023-05-08 Published:2023-12-10
  • Contact: Jianxun LIU
  • About author:SONG Qihong, born in 1998, M. S. candidate. His research interests include code search, code completion.
    HU Haize, born in 1989, Ph. D. candidate, lecturer. His research interests include data mining, code search.
    ZHANG Xiangping, born in 1993, Ph. D. candidate. His research interests include code representation, code clone detection.
  • Supported by:
    National Natural Science Foundation of China(61872139)

摘要:

搜索并重用相关代码可以有效提高软件开发效率。基于深度学习的代码搜索模型通常将代码片段和查询语句嵌入同一向量空间,通过计算余弦相似度匹配并输出相应代码片段;然而大多数模型忽略了代码片段与查询语句间的协同信息。为了更全面地表征语义信息,提出一种基于协同融合的代码搜索模型BofeCS。首先,采用BERT(Bidirectional Encoder Representations from Transformers)模型提取输入序列的语义信息并将它表征为向量;其次,构建协同融合网络提取代码片段和查询语句间分词级的协同信息;最后,构建残差网络缓解表征过程中的语义信息丢失。为验证BofeCS的有效性,在多语言数据集CodeSearchNet上进行实验。实验结果表明,相较于基线模型UNIF(embedding UNIFication)、TabCS(Two-stage attention-based model for Code Search)和MRCS(Multimodal Representation for neural Code Search),BofeCS的平均倒数排名(MRR)、归一化折损累计增益(NDCG)和前k位成功命中率(SR@k)均有显著提高,其中MRR值分别提升了95.94%、52.32%和16.95%。

关键词: 软件开发, 代码搜索, 协同融合, BERT, 残差网络

Abstract:

Searching and reusing relevant code can significantly improve software development efficiency. The deep learning-based code search models usually embed code pieces and query statements into the same vector space and then match and output the relevant code by computing cosine similarity; however, most of these models ignore the collaborative information between code pieces and query statements. To fully represent semantic information, a collaborative fusion-based code search model named BofeCS was proposed. Firstly, BERT (Bidirectional Encoder Representations from Transformers) model was utilized to extract the semantic information of the input sequences and then represent it as vectors. Secondly, a collaborative fusion network was constructed to extract the token-level collaborative information between code pieces and query statements. Finally, a residual network was built to alleviate the semantic information loss during the representation process. The multi-lingual dataset CodeSearchNet was used to carry out experiments to evaluate the effectiveness of BofeCS. Experimental results show that BofeCS can significantly improve the accuracy of code search and outperform the baseline models, UNIF (embedding UNIFication), TabCS (Two-stage Attention-Based model for Code Search), and MRCS (Multimodal Representation for neural Code Search), in Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and Top k Success hit Rate (SR@k), where the MRR values are improved by 95.94%, 52.32%, and 16.95%, respectively.

Key words: software development, code search, collaborative fusion, BERT (Bidirectional Encoder Representations from Transformers), residual network

中图分类号: