基于协同融合网络的代码搜索模型

doi:10.11772/j.issn.1001-9081.2022111783

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3896-3902.DOI: 10.11772/j.issn.1001-9081.2022111783

基于协同融合网络的代码搜索模型

宋其洪¹^,², 刘建勋¹^,²(), 扈海泽¹^,², 张祥平¹^,²

^1.服务计算与软件服务新技术湖南省重点实验室(湖南科技大学), 湖南湘潭 411201
^2.湖南科技大学计算机科学与工程学院, 湖南湘潭 411201

收稿日期:2022-11-29 修回日期:2023-03-25 接受日期:2023-03-28 发布日期:2023-05-08 出版日期:2023-12-10
通讯作者: 刘建勋
作者简介:宋其洪（1998—），男，陕西宝鸡人，硕士研究生，CCF会员，主要研究方向：代码搜索、代码补全
刘建勋（1970—），男，湖南衡阳人，教授，博士，CCF杰出会员，主要研究方向：大数据、服务计算、云计算；Email：904500672@qq.com
扈海泽（1989—），男，湖南邵阳人，讲师，博士研究生，主要研究方向：数据挖掘、代码搜索
张祥平（1993—），男，福建三明人，博士研究生，主要研究方向：代码表征、代码克隆检测。
基金资助:
国家自然科学基金资助项目(61872139)

Code search model based on collaborative fusion network

Qihong SONG¹^,², Jianxun LIU¹^,²(), Haize HU¹^,², Xiangping ZHANG¹^,²

^1.Hunan Key Laboratory of Service Computing and New Software Service Technology （Hunan University of Science and Technology），Xiangtan Hunan 411201，China
^2.School of Computer Science and Engineering，Hunan University of Science and Technology，Xiangtan Hunan 411201，China

Received:2022-11-29 Revised:2023-03-25 Accepted:2023-03-28 Online:2023-05-08 Published:2023-12-10
Contact: Jianxun LIU
About author:SONG Qihong， born in 1998， M. S. candidate. His research interests include code search， code completion.
HU Haize， born in 1989， Ph. D. candidate， lecturer. His research interests include data mining， code search.
ZHANG Xiangping， born in 1993， Ph. D. candidate. His research interests include code representation， code clone detection.
Supported by:
National Natural Science Foundation of China(61872139)

摘要/Abstract

摘要：

搜索并重用相关代码可以有效提高软件开发效率。基于深度学习的代码搜索模型通常将代码片段和查询语句嵌入同一向量空间，通过计算余弦相似度匹配并输出相应代码片段；然而大多数模型忽略了代码片段与查询语句间的协同信息。为了更全面地表征语义信息，提出一种基于协同融合的代码搜索模型BofeCS。首先，采用BERT（Bidirectional Encoder Representations from Transformers）模型提取输入序列的语义信息并将它表征为向量；其次，构建协同融合网络提取代码片段和查询语句间分词级的协同信息；最后，构建残差网络缓解表征过程中的语义信息丢失。为验证BofeCS的有效性，在多语言数据集CodeSearchNet上进行实验。实验结果表明，相较于基线模型UNIF（embedding UNIFication）、TabCS（Two-stage attention-based model for Code Search）和MRCS（Multimodal Representation for neural Code Search），BofeCS的平均倒数排名（MRR）、归一化折损累计增益（NDCG）和前k位成功命中率（SR@k）均有显著提高，其中MRR值分别提升了95.94%、52.32%和16.95%。

关键词: 软件开发, 代码搜索, 协同融合, BERT, 残差网络

Abstract:

Searching and reusing relevant code can significantly improve software development efficiency. The deep learning-based code search models usually embed code pieces and query statements into the same vector space and then match and output the relevant code by computing cosine similarity； however， most of these models ignore the collaborative information between code pieces and query statements. To fully represent semantic information， a collaborative fusion-based code search model named BofeCS was proposed. Firstly， BERT （Bidirectional Encoder Representations from Transformers） model was utilized to extract the semantic information of the input sequences and then represent it as vectors. Secondly， a collaborative fusion network was constructed to extract the token-level collaborative information between code pieces and query statements. Finally， a residual network was built to alleviate the semantic information loss during the representation process. The multi-lingual dataset CodeSearchNet was used to carry out experiments to evaluate the effectiveness of BofeCS. Experimental results show that BofeCS can significantly improve the accuracy of code search and outperform the baseline models， UNIF （embedding UNIFication）， TabCS （Two-stage Attention-Based model for Code Search）， and MRCS （Multimodal Representation for neural Code Search）， in Mean Reciprocal Rank （MRR）， Normalized Discounted Cumulative Gain （NDCG）， and Top k Success hit Rate （SR@k）， where the MRR values are improved by 95.94%， 52.32%， and 16.95%， respectively.

Key words: software development, code search, collaborative fusion, BERT (Bidirectional Encoder Representations from Transformers), residual network

中图分类号:

TP311.5

宋其洪, 刘建勋, 扈海泽, 张祥平. 基于协同融合网络的代码搜索模型[J]. 计算机应用, 2023, 43(12): 3896-3902.

Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network[J]. Journal of Computer Applications, 2023, 43(12): 3896-3902.

图/表 11

参考文献 30

1	YAO Z， PEDDAMAIL J R， SUN H. CoaCor： code annotation for code retrieval with reinforcement learning ［C］// Proceedings of the 2019 World Wide Web Conference. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2019： 2203-2214. 10.1145/3308558.3313632
2	WAN Y， SHU J， SUI Y， et al. Multi-modal attention network learning for semantic source code retrieval ［C］// Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2019： 13-25. 10.1109/ase.2019.00012
3	GU X， ZHANG H， KIM S. Deep code search ［C］// Proceedings of the ACM/IEEE 40th International Conference on Software Engineering. New York： ACM， 2018： 933-944. 10.1145/3180155.3180167
4	YU Z， YU J， XIANG C， et al. Beyond bilinear： generalized multimodal factorized high-order pooling for visual question answering ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2018， 29（12）： 5947-5959. 10.1109/tnnls.2018.2817340
5	LI L， DONG R， CHEN L. Context-aware co-attention neural network for service recommendations ［C］// Proceedings of the IEEE 35th International Conference on Data Engineering Workshops. Piscataway： IEEE， 2019： 201-208. 10.1109/icdew.2019.00-11
6	LI B， SUN Z， LI Q， et al. Group-wise deep object co-segmentation with co-attention recurrent neural network ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8518-8527. 10.1109/iccv.2019.00861
7	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
8	HUSAIN H， WU HH， GAZIT T， et al. CodeSearchNet challenge： evaluating the state of semantic code search ［EB/OL］. ［2022-09-12］..
9	CAMBRONERO J， LI H， KIM S， et al. When deep learning met code search［C］// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2019： 964-974. 10.1145/3338906.3340458
10	XU L， YANG H， LIU C， et al. Two-stage attention-based model for code search with textual and structural features［C］// Proceedings of the 2021 IEEE International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2021： 342-353. 10.1109/saner50967.2021.00039
11	GU J， CHEN Z， MONPERRUS M. Multimodal representation for neural code search［C］// Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2021： 483-494. 10.1109/icsme52107.2021.00049
12	LV F， ZHANG H， LOU J G， et al. CodeHow： effective code search based on API understanding and extended Boolean model （E）［C］// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2015： 260-270. 10.1109/ase.2015.42
13	LU M， SUN X， WANG S， et al. Query expansion via WordNet for effective code search［C］// Proceedings of the IEEE 22nd International Conference on Software Analysis， Evolution， and Reengineering. Piscataway： IEEE， 2015： 545-549. 10.1109/saner.2015.7081874
14	LEMOS O A L， DE PAULA A C， ZANICHELLI F C， et al. Thesaurus-based automatic query expansion for interface-driven code search ［C］// Proceedings of the 11th Working Conference on Mining Software Repositories. New York： ACM， 2014： 212-221. 10.1145/2597073.2597087
15	LIU J， KIM S， MURALI V， et al. Neural query expansion for code search［C］// Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York： ACM， 2019： 29-37. 10.1145/3315508.3329975
16	WANG C， NONG Z， GAO C， et al. Enriching query semantics for code search with reinforcement learning［J］. Neural Networks， 2022， 145： 22-32. 10.1016/j.neunet.2021.09.025
17	ZOU Q， ZHANG C. Query expansion via learning change sequences［J］. International Journal of Knowledge-based and Intelligent Engineering Systems， 2020， 24（2）： 95-105. 10.3233/kes-200033
18	HU G， PENG M， ZHANG Y， et al. Unsupervised software repositories mining and its application to code search［J］. Software： Practice and Experience， 2020， 50（3）： 299-322. 10.1002/spe.2760
19	WU H， YANG Y. Code search based on alteration intent［J］. IEEE Access， 2019， 7： 56796-56802. 10.1109/access.2019.2913560
20	WANG H， ZHANG J， XIA Y， et al. COSEA： convolutional code search with layer-wise attention ［EB/OL］. ［2022-09-12］.. 10.48550/arXiv.2010.09520
21	LING X， WU L， WANG S， et al. Deep graph matching and searching for semantic code retrieval［J］. ACM Transactions on Knowledge Discovery from Data， 2021， 15（5）： No.88. 10.1145/3447571
22	WANG W， LI G， MA B， et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree［C］// Proceedings of the IEEE 27th International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2020： 261-271. 10.1109/saner48275.2020.9054857
23	夏冰，庞建民，周鑫，等.二进制代码相似性搜索研究进展［J］. 计算机应用， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
	XIA B， PANG J M， ZHOU X， et al. Research progress on binary code similarity search［J］. Journal of Computer Applications， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
24	ZHANG J， WANG X， ZHANG H， et al. A novel neural source code representation based on abstract syntax tree ［C］// Proceedings of the IEEE/ACM 41st International Conference on Software Engineering. Piscataway： IEEE， 2019： 783-794. 10.1109/icse.2019.00086
25	LING C， LIN Z， ZOU Y， et al. Adaptive deep code search ［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 48-59. 10.1145/3387904.3389278
26	MA H， LI Y， JI X， et al. MsCoa： multi-step co-attention model for multi-label classification ［J］. IEEE Access， 2019， 7： 109635-109645. 10.1109/access.2019.2933042
27	ZHANG P， ZHU H， XIONG T， et al. Co-attention network and low-rank bilinear pooling for aspect based sentiment analysis ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6725-6729. 10.1109/icassp.2019.8682248
28	SHUAI J， XU L， LIU C， et al. Improving code search with co-attentive representation learning［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 196-207. 10.1145/3387904.3389269
29	SHWARTZ-ZIV R， TISHBY N. Opening the black box of deep neural networks via information ［EB/OL］. ［2022-09-12］..
30	BELGHAZI M I， BARATIN A， RAJESWAR S， et al. Mutual information neural estimation［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 531-540.

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型输入	SR@1	SR@5	SR@10	MRR	NDCG
T	0.848	0.942	0.972	0.821	0.857
T + SBT	0.844	0.935	0.966	0.820	0.855
T + LCRS	0.517	0.719	0.812	0.509	0.579
T + RootPath	0.769	0.883	0.925	0.748	0.790
T + LeafPath	0.492	0.698	0.797	0.488	0.559

基于协同融合网络的代码搜索模型

Code search model based on collaborative fusion network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 15

编辑推荐

Metrics

方法	SR@1	SR@5	SR@10	MRR	NDCG
最大池化	0.848	0.942	0.972	0.821	0.857
平均池化	0.449	0.811	0.960	0.476	0.587

编程语言	SR@1	SR@5	SR@10	MRR	NDCG
Python	0.815	0.850	0.961	0.986	0.857
JavaScript	0.811	0.847	0.961	0.987	0.854
Ruby	0.677	0.692	0.838	0.901	0.729
Go	0.861	0.869	0.933	0.973	0.887
Java	0.821	0.848	0.942	0.972	0.857
PHP	0.899	0.917	0.971	0.987	0.919

[1]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[2]	刘耀, 李雨萌, 宋苗苗. 基于业务流程的认知图谱[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1699-1705.
[3]	刘源泂, 何茂征, 黄益斌, 钱程. 基于ResNet50和改进注意力机制的船舶识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1935-1941.
[4]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[5]	郭琳, 刘坤虎, 马晨阳, 来佑雪, 徐映芬. 基于感受野扩展残差注意力网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1579-1587.
[6]	付顺旺, 陈茜, 李智, 王国美, 卢妤. 用于篡改图像检测和定位的双通道渐进式特征过滤网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1303-1309.
[7]	周景贤, 李希娜. 基于改进卷积神经网络和射频指纹的无人机检测与识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 876-882.
[8]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[9]	赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423.
[10]	黄学雨, 贺怀宇, 林慧敏, 陈金水. 基于特征聚合的铜合金金相图分类识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2593-2601.
[11]	拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124.
[12]	张慧斌, 冯丽萍, 郝耀军, 王一宁. 基于注意力机制和迁移学习的古壁画朝代识别[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1826-1832.
[13]	申利华, 李波. 基于特征金字塔网络和密集网络的肺部CT图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1612-1619.
[14]	林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
[15]	徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355.

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426