基于协同融合网络的代码搜索模型

doi:10.11772/j.issn.1001-9081.2022111783

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3896-3902.DOI: 10.11772/j.issn.1001-9081.2022111783

基于协同融合网络的代码搜索模型

宋其洪¹^,², 刘建勋¹^,²(), 扈海泽¹^,², 张祥平¹^,²

^1.服务计算与软件服务新技术湖南省重点实验室(湖南科技大学), 湖南湘潭 411201
^2.湖南科技大学计算机科学与工程学院, 湖南湘潭 411201

收稿日期:2022-11-29 修回日期:2023-03-25 接受日期:2023-03-28 发布日期:2023-05-08 出版日期:2023-12-10
通讯作者: 刘建勋
作者简介:宋其洪（1998—），男，陕西宝鸡人，硕士研究生，CCF会员，主要研究方向：代码搜索、代码补全
刘建勋（1970—），男，湖南衡阳人，教授，博士，CCF杰出会员，主要研究方向：大数据、服务计算、云计算；Email：904500672@qq.com
扈海泽（1989—），男，湖南邵阳人，讲师，博士研究生，主要研究方向：数据挖掘、代码搜索
张祥平（1993—），男，福建三明人，博士研究生，主要研究方向：代码表征、代码克隆检测。
基金资助:
国家自然科学基金资助项目(61872139)

Code search model based on collaborative fusion network

Qihong SONG¹^,², Jianxun LIU¹^,²(), Haize HU¹^,², Xiangping ZHANG¹^,²

^1.Hunan Key Laboratory of Service Computing and New Software Service Technology （Hunan University of Science and Technology），Xiangtan Hunan 411201，China
^2.School of Computer Science and Engineering，Hunan University of Science and Technology，Xiangtan Hunan 411201，China

Received:2022-11-29 Revised:2023-03-25 Accepted:2023-03-28 Online:2023-05-08 Published:2023-12-10
Contact: Jianxun LIU
About author:SONG Qihong， born in 1998， M. S. candidate. His research interests include code search， code completion.
HU Haize， born in 1989， Ph. D. candidate， lecturer. His research interests include data mining， code search.
ZHANG Xiangping， born in 1993， Ph. D. candidate. His research interests include code representation， code clone detection.
Supported by:
National Natural Science Foundation of China(61872139)

摘要/Abstract

摘要：

搜索并重用相关代码可以有效提高软件开发效率。基于深度学习的代码搜索模型通常将代码片段和查询语句嵌入同一向量空间，通过计算余弦相似度匹配并输出相应代码片段；然而大多数模型忽略了代码片段与查询语句间的协同信息。为了更全面地表征语义信息，提出一种基于协同融合的代码搜索模型BofeCS。首先，采用BERT（Bidirectional Encoder Representations from Transformers）模型提取输入序列的语义信息并将它表征为向量；其次，构建协同融合网络提取代码片段和查询语句间分词级的协同信息；最后，构建残差网络缓解表征过程中的语义信息丢失。为验证BofeCS的有效性，在多语言数据集CodeSearchNet上进行实验。实验结果表明，相较于基线模型UNIF（embedding UNIFication）、TabCS（Two-stage attention-based model for Code Search）和MRCS（Multimodal Representation for neural Code Search），BofeCS的平均倒数排名（MRR）、归一化折损累计增益（NDCG）和前k位成功命中率（SR@k）均有显著提高，其中MRR值分别提升了95.94%、52.32%和16.95%。

关键词: 软件开发, 代码搜索, 协同融合, BERT, 残差网络

Abstract:

Searching and reusing relevant code can significantly improve software development efficiency. The deep learning-based code search models usually embed code pieces and query statements into the same vector space and then match and output the relevant code by computing cosine similarity； however， most of these models ignore the collaborative information between code pieces and query statements. To fully represent semantic information， a collaborative fusion-based code search model named BofeCS was proposed. Firstly， BERT （Bidirectional Encoder Representations from Transformers） model was utilized to extract the semantic information of the input sequences and then represent it as vectors. Secondly， a collaborative fusion network was constructed to extract the token-level collaborative information between code pieces and query statements. Finally， a residual network was built to alleviate the semantic information loss during the representation process. The multi-lingual dataset CodeSearchNet was used to carry out experiments to evaluate the effectiveness of BofeCS. Experimental results show that BofeCS can significantly improve the accuracy of code search and outperform the baseline models， UNIF （embedding UNIFication）， TabCS （Two-stage Attention-Based model for Code Search）， and MRCS （Multimodal Representation for neural Code Search）， in Mean Reciprocal Rank （MRR）， Normalized Discounted Cumulative Gain （NDCG）， and Top k Success hit Rate （SR@k）， where the MRR values are improved by 95.94%， 52.32%， and 16.95%， respectively.

Key words: software development, code search, collaborative fusion, BERT (Bidirectional Encoder Representations from Transformers), residual network

中图分类号:

TP311.5

宋其洪, 刘建勋, 扈海泽, 张祥平. 基于协同融合网络的代码搜索模型[J]. 计算机应用, 2023, 43(12): 3896-3902.

Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network[J]. Journal of Computer Applications, 2023, 43(12): 3896-3902.

图/表 11

参考文献 30

1	YAO Z， PEDDAMAIL J R， SUN H. CoaCor： code annotation for code retrieval with reinforcement learning ［C］// Proceedings of the 2019 World Wide Web Conference. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2019： 2203-2214. 10.1145/3308558.3313632
2	WAN Y， SHU J， SUI Y， et al. Multi-modal attention network learning for semantic source code retrieval ［C］// Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2019： 13-25. 10.1109/ase.2019.00012
3	GU X， ZHANG H， KIM S. Deep code search ［C］// Proceedings of the ACM/IEEE 40th International Conference on Software Engineering. New York： ACM， 2018： 933-944. 10.1145/3180155.3180167
4	YU Z， YU J， XIANG C， et al. Beyond bilinear： generalized multimodal factorized high-order pooling for visual question answering ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2018， 29（12）： 5947-5959. 10.1109/tnnls.2018.2817340
5	LI L， DONG R， CHEN L. Context-aware co-attention neural network for service recommendations ［C］// Proceedings of the IEEE 35th International Conference on Data Engineering Workshops. Piscataway： IEEE， 2019： 201-208. 10.1109/icdew.2019.00-11
6	LI B， SUN Z， LI Q， et al. Group-wise deep object co-segmentation with co-attention recurrent neural network ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8518-8527. 10.1109/iccv.2019.00861
7	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
8	HUSAIN H， WU HH， GAZIT T， et al. CodeSearchNet challenge： evaluating the state of semantic code search ［EB/OL］. ［2022-09-12］..
9	CAMBRONERO J， LI H， KIM S， et al. When deep learning met code search［C］// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2019： 964-974. 10.1145/3338906.3340458
10	XU L， YANG H， LIU C， et al. Two-stage attention-based model for code search with textual and structural features［C］// Proceedings of the 2021 IEEE International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2021： 342-353. 10.1109/saner50967.2021.00039
11	GU J， CHEN Z， MONPERRUS M. Multimodal representation for neural code search［C］// Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2021： 483-494. 10.1109/icsme52107.2021.00049
12	LV F， ZHANG H， LOU J G， et al. CodeHow： effective code search based on API understanding and extended Boolean model （E）［C］// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2015： 260-270. 10.1109/ase.2015.42
13	LU M， SUN X， WANG S， et al. Query expansion via WordNet for effective code search［C］// Proceedings of the IEEE 22nd International Conference on Software Analysis， Evolution， and Reengineering. Piscataway： IEEE， 2015： 545-549. 10.1109/saner.2015.7081874
14	LEMOS O A L， DE PAULA A C， ZANICHELLI F C， et al. Thesaurus-based automatic query expansion for interface-driven code search ［C］// Proceedings of the 11th Working Conference on Mining Software Repositories. New York： ACM， 2014： 212-221. 10.1145/2597073.2597087
15	LIU J， KIM S， MURALI V， et al. Neural query expansion for code search［C］// Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York： ACM， 2019： 29-37. 10.1145/3315508.3329975
16	WANG C， NONG Z， GAO C， et al. Enriching query semantics for code search with reinforcement learning［J］. Neural Networks， 2022， 145： 22-32. 10.1016/j.neunet.2021.09.025
17	ZOU Q， ZHANG C. Query expansion via learning change sequences［J］. International Journal of Knowledge-based and Intelligent Engineering Systems， 2020， 24（2）： 95-105. 10.3233/kes-200033
18	HU G， PENG M， ZHANG Y， et al. Unsupervised software repositories mining and its application to code search［J］. Software： Practice and Experience， 2020， 50（3）： 299-322. 10.1002/spe.2760
19	WU H， YANG Y. Code search based on alteration intent［J］. IEEE Access， 2019， 7： 56796-56802. 10.1109/access.2019.2913560
20	WANG H， ZHANG J， XIA Y， et al. COSEA： convolutional code search with layer-wise attention ［EB/OL］. ［2022-09-12］.. 10.48550/arXiv.2010.09520
21	LING X， WU L， WANG S， et al. Deep graph matching and searching for semantic code retrieval［J］. ACM Transactions on Knowledge Discovery from Data， 2021， 15（5）： No.88. 10.1145/3447571
22	WANG W， LI G， MA B， et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree［C］// Proceedings of the IEEE 27th International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2020： 261-271. 10.1109/saner48275.2020.9054857
23	夏冰，庞建民，周鑫，等.二进制代码相似性搜索研究进展［J］. 计算机应用， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
	XIA B， PANG J M， ZHOU X， et al. Research progress on binary code similarity search［J］. Journal of Computer Applications， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
24	ZHANG J， WANG X， ZHANG H， et al. A novel neural source code representation based on abstract syntax tree ［C］// Proceedings of the IEEE/ACM 41st International Conference on Software Engineering. Piscataway： IEEE， 2019： 783-794. 10.1109/icse.2019.00086
25	LING C， LIN Z， ZOU Y， et al. Adaptive deep code search ［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 48-59. 10.1145/3387904.3389278
26	MA H， LI Y， JI X， et al. MsCoa： multi-step co-attention model for multi-label classification ［J］. IEEE Access， 2019， 7： 109635-109645. 10.1109/access.2019.2933042
27	ZHANG P， ZHU H， XIONG T， et al. Co-attention network and low-rank bilinear pooling for aspect based sentiment analysis ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6725-6729. 10.1109/icassp.2019.8682248
28	SHUAI J， XU L， LIU C， et al. Improving code search with co-attentive representation learning［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 196-207. 10.1145/3387904.3389269
29	SHWARTZ-ZIV R， TISHBY N. Opening the black box of deep neural networks via information ［EB/OL］. ［2022-09-12］..
30	BELGHAZI M I， BARATIN A， RAJESWAR S， et al. Mutual information neural estimation［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 531-540.

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型输入	SR@1	SR@5	SR@10	MRR	NDCG
T	0.848	0.942	0.972	0.821	0.857
T + SBT	0.844	0.935	0.966	0.820	0.855
T + LCRS	0.517	0.719	0.812	0.509	0.579
T + RootPath	0.769	0.883	0.925	0.748	0.790
T + LeafPath	0.492	0.698	0.797	0.488	0.559

基于协同融合网络的代码搜索模型

Code search model based on collaborative fusion network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 15

编辑推荐

Metrics

方法	SR@1	SR@5	SR@10	MRR	NDCG
最大池化	0.848	0.942	0.972	0.821	0.857
平均池化	0.449	0.811	0.960	0.476	0.587

编程语言	SR@1	SR@5	SR@10	MRR	NDCG
Python	0.815	0.850	0.961	0.986	0.857
JavaScript	0.811	0.847	0.961	0.987	0.854
Ruby	0.677	0.692	0.838	0.901	0.729
Go	0.861	0.869	0.933	0.973	0.887
Java	0.821	0.848	0.942	0.972	0.857
PHP	0.899	0.917	0.971	0.987	0.919

[1]	黄学雨, 贺怀宇, 林慧敏, 陈金水. 基于特征聚合的铜合金金相图分类识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2593-2601.
[2]	拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124.
[3]	张慧斌, 冯丽萍, 郝耀军, 王一宁. 基于注意力机制和迁移学习的古壁画朝代识别[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1826-1832.
[4]	申利华, 李波. 基于特征金字塔网络和密集网络的肺部CT图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1612-1619.
[5]	林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
[6]	徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355.
[7]	夏飞, 陈帅琦, 华珉, 蒋碧鸿. 基于改进BERT的电力领域中文分词方法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3711-3718.
[8]	李宇航, 杨玉丽, 马垚, 于丹, 陈永乐. 基于BERT模型的文本对抗样本生成方法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3093-3098.
[9]	左敏, 王虹, 颜文婧, 张青川. 基于BERT和CNN的基因剪接位点识别[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3309-3314.
[10]	张志昂, 廖光忠. 基于U-Net的多尺度特征增强视网膜血管分割算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3275-3281.
[11]	廖列法, 李志明, 张赛赛. 基于深度残差网络的迭代量化哈希图像检索方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2845-2852.
[12]	贺怀清, 闫建青, 惠康华. 基于深度残差网络的轻量级人脸识别方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2030-2036.
[13]	董明宇, 严迪群. 基于ResNet的音频场景声替换造假的检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1724-1728.
[14]	张杨, 郝江波. 基于注意力机制和残差网络的恶意代码检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1708-1715.
[15]	王汇丰, 徐岩, 魏一铭, 王会真. 基于并联卷积与残差网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1570-1576.

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426