Code search model based on collaborative fusion network

doi:10.11772/j.issn.1001-9081.2022111783

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (12): 3896-3902.DOI: 10.11772/j.issn.1001-9081.2022111783

• Computer software technology • Previous Articles Next Articles

Code search model based on collaborative fusion network

Qihong SONG¹^,², Jianxun LIU¹^,²(), Haize HU¹^,², Xiangping ZHANG¹^,²

^1.Hunan Key Laboratory of Service Computing and New Software Service Technology （Hunan University of Science and Technology），Xiangtan Hunan 411201，China
^2.School of Computer Science and Engineering，Hunan University of Science and Technology，Xiangtan Hunan 411201，China

Received:2022-11-29 Revised:2023-03-25 Accepted:2023-03-28 Online:2023-05-08 Published:2023-12-10
Contact: Jianxun LIU
About author:SONG Qihong， born in 1998， M. S. candidate. His research interests include code search， code completion.
HU Haize， born in 1989， Ph. D. candidate， lecturer. His research interests include data mining， code search.
ZHANG Xiangping， born in 1993， Ph. D. candidate. His research interests include code representation， code clone detection.
Supported by:
National Natural Science Foundation of China(61872139)

基于协同融合网络的代码搜索模型

宋其洪¹^,², 刘建勋¹^,²(), 扈海泽¹^,², 张祥平¹^,²

^1.服务计算与软件服务新技术湖南省重点实验室(湖南科技大学), 湖南湘潭 411201
^2.湖南科技大学计算机科学与工程学院, 湖南湘潭 411201

通讯作者: 刘建勋
作者简介:宋其洪（1998—），男，陕西宝鸡人，硕士研究生，CCF会员，主要研究方向：代码搜索、代码补全
刘建勋（1970—），男，湖南衡阳人，教授，博士，CCF杰出会员，主要研究方向：大数据、服务计算、云计算；Email：904500672@qq.com
扈海泽（1989—），男，湖南邵阳人，讲师，博士研究生，主要研究方向：数据挖掘、代码搜索
张祥平（1993—），男，福建三明人，博士研究生，主要研究方向：代码表征、代码克隆检测。
基金资助:
国家自然科学基金资助项目(61872139)

Abstract

Abstract:

Searching and reusing relevant code can significantly improve software development efficiency. The deep learning-based code search models usually embed code pieces and query statements into the same vector space and then match and output the relevant code by computing cosine similarity； however， most of these models ignore the collaborative information between code pieces and query statements. To fully represent semantic information， a collaborative fusion-based code search model named BofeCS was proposed. Firstly， BERT （Bidirectional Encoder Representations from Transformers） model was utilized to extract the semantic information of the input sequences and then represent it as vectors. Secondly， a collaborative fusion network was constructed to extract the token-level collaborative information between code pieces and query statements. Finally， a residual network was built to alleviate the semantic information loss during the representation process. The multi-lingual dataset CodeSearchNet was used to carry out experiments to evaluate the effectiveness of BofeCS. Experimental results show that BofeCS can significantly improve the accuracy of code search and outperform the baseline models， UNIF （embedding UNIFication）， TabCS （Two-stage Attention-Based model for Code Search）， and MRCS （Multimodal Representation for neural Code Search）， in Mean Reciprocal Rank （MRR）， Normalized Discounted Cumulative Gain （NDCG）， and Top k Success hit Rate （SR@k）， where the MRR values are improved by 95.94%， 52.32%， and 16.95%， respectively.

Key words: software development, code search, collaborative fusion, BERT (Bidirectional Encoder Representations from Transformers), residual network

摘要：

搜索并重用相关代码可以有效提高软件开发效率。基于深度学习的代码搜索模型通常将代码片段和查询语句嵌入同一向量空间，通过计算余弦相似度匹配并输出相应代码片段；然而大多数模型忽略了代码片段与查询语句间的协同信息。为了更全面地表征语义信息，提出一种基于协同融合的代码搜索模型BofeCS。首先，采用BERT（Bidirectional Encoder Representations from Transformers）模型提取输入序列的语义信息并将它表征为向量；其次，构建协同融合网络提取代码片段和查询语句间分词级的协同信息；最后，构建残差网络缓解表征过程中的语义信息丢失。为验证BofeCS的有效性，在多语言数据集CodeSearchNet上进行实验。实验结果表明，相较于基线模型UNIF（embedding UNIFication）、TabCS（Two-stage attention-based model for Code Search）和MRCS（Multimodal Representation for neural Code Search），BofeCS的平均倒数排名（MRR）、归一化折损累计增益（NDCG）和前k位成功命中率（SR@k）均有显著提高，其中MRR值分别提升了95.94%、52.32%和16.95%。

关键词: 软件开发, 代码搜索, 协同融合, BERT, 残差网络

CLC Number:

TP311.5

Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network[J]. Journal of Computer Applications, 2023, 43(12): 3896-3902.

宋其洪, 刘建勋, 扈海泽, 张祥平. 基于协同融合网络的代码搜索模型[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3896-3902.

Figures/Tables 11

References 30

1	YAO Z， PEDDAMAIL J R， SUN H. CoaCor： code annotation for code retrieval with reinforcement learning ［C］// Proceedings of the 2019 World Wide Web Conference. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2019： 2203-2214. 10.1145/3308558.3313632
2	WAN Y， SHU J， SUI Y， et al. Multi-modal attention network learning for semantic source code retrieval ［C］// Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2019： 13-25. 10.1109/ase.2019.00012
3	GU X， ZHANG H， KIM S. Deep code search ［C］// Proceedings of the ACM/IEEE 40th International Conference on Software Engineering. New York： ACM， 2018： 933-944. 10.1145/3180155.3180167
4	YU Z， YU J， XIANG C， et al. Beyond bilinear： generalized multimodal factorized high-order pooling for visual question answering ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2018， 29（12）： 5947-5959. 10.1109/tnnls.2018.2817340
5	LI L， DONG R， CHEN L. Context-aware co-attention neural network for service recommendations ［C］// Proceedings of the IEEE 35th International Conference on Data Engineering Workshops. Piscataway： IEEE， 2019： 201-208. 10.1109/icdew.2019.00-11
6	LI B， SUN Z， LI Q， et al. Group-wise deep object co-segmentation with co-attention recurrent neural network ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8518-8527. 10.1109/iccv.2019.00861
7	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
8	HUSAIN H， WU HH， GAZIT T， et al. CodeSearchNet challenge： evaluating the state of semantic code search ［EB/OL］. ［2022-09-12］..
9	CAMBRONERO J， LI H， KIM S， et al. When deep learning met code search［C］// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2019： 964-974. 10.1145/3338906.3340458
10	XU L， YANG H， LIU C， et al. Two-stage attention-based model for code search with textual and structural features［C］// Proceedings of the 2021 IEEE International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2021： 342-353. 10.1109/saner50967.2021.00039
11	GU J， CHEN Z， MONPERRUS M. Multimodal representation for neural code search［C］// Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2021： 483-494. 10.1109/icsme52107.2021.00049
12	LV F， ZHANG H， LOU J G， et al. CodeHow： effective code search based on API understanding and extended Boolean model （E）［C］// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2015： 260-270. 10.1109/ase.2015.42
13	LU M， SUN X， WANG S， et al. Query expansion via WordNet for effective code search［C］// Proceedings of the IEEE 22nd International Conference on Software Analysis， Evolution， and Reengineering. Piscataway： IEEE， 2015： 545-549. 10.1109/saner.2015.7081874
14	LEMOS O A L， DE PAULA A C， ZANICHELLI F C， et al. Thesaurus-based automatic query expansion for interface-driven code search ［C］// Proceedings of the 11th Working Conference on Mining Software Repositories. New York： ACM， 2014： 212-221. 10.1145/2597073.2597087
15	LIU J， KIM S， MURALI V， et al. Neural query expansion for code search［C］// Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York： ACM， 2019： 29-37. 10.1145/3315508.3329975
16	WANG C， NONG Z， GAO C， et al. Enriching query semantics for code search with reinforcement learning［J］. Neural Networks， 2022， 145： 22-32. 10.1016/j.neunet.2021.09.025
17	ZOU Q， ZHANG C. Query expansion via learning change sequences［J］. International Journal of Knowledge-based and Intelligent Engineering Systems， 2020， 24（2）： 95-105. 10.3233/kes-200033
18	HU G， PENG M， ZHANG Y， et al. Unsupervised software repositories mining and its application to code search［J］. Software： Practice and Experience， 2020， 50（3）： 299-322. 10.1002/spe.2760
19	WU H， YANG Y. Code search based on alteration intent［J］. IEEE Access， 2019， 7： 56796-56802. 10.1109/access.2019.2913560
20	WANG H， ZHANG J， XIA Y， et al. COSEA： convolutional code search with layer-wise attention ［EB/OL］. ［2022-09-12］.. 10.48550/arXiv.2010.09520
21	LING X， WU L， WANG S， et al. Deep graph matching and searching for semantic code retrieval［J］. ACM Transactions on Knowledge Discovery from Data， 2021， 15（5）： No.88. 10.1145/3447571
22	WANG W， LI G， MA B， et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree［C］// Proceedings of the IEEE 27th International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2020： 261-271. 10.1109/saner48275.2020.9054857
23	夏冰，庞建民，周鑫，等.二进制代码相似性搜索研究进展［J］. 计算机应用， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
	XIA B， PANG J M， ZHOU X， et al. Research progress on binary code similarity search［J］. Journal of Computer Applications， 2022， 42（4）：985-998. 10.11772/j.issn.1001-9081.2021071267
24	ZHANG J， WANG X， ZHANG H， et al. A novel neural source code representation based on abstract syntax tree ［C］// Proceedings of the IEEE/ACM 41st International Conference on Software Engineering. Piscataway： IEEE， 2019： 783-794. 10.1109/icse.2019.00086
25	LING C， LIN Z， ZOU Y， et al. Adaptive deep code search ［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 48-59. 10.1145/3387904.3389278
26	MA H， LI Y， JI X， et al. MsCoa： multi-step co-attention model for multi-label classification ［J］. IEEE Access， 2019， 7： 109635-109645. 10.1109/access.2019.2933042
27	ZHANG P， ZHU H， XIONG T， et al. Co-attention network and low-rank bilinear pooling for aspect based sentiment analysis ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6725-6729. 10.1109/icassp.2019.8682248
28	SHUAI J， XU L， LIU C， et al. Improving code search with co-attentive representation learning［C］// Proceedings of the 28th International Conference on Program Comprehension. New York： ACM， 2020： 196-207. 10.1145/3387904.3389269
29	SHWARTZ-ZIV R， TISHBY N. Opening the black box of deep neural networks via information ［EB/OL］. ［2022-09-12］..
30	BELGHAZI M I， BARATIN A， RAJESWAR S， et al. Mutual information neural estimation［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 531-540.

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

编程语言	代码-查询对数
编程语言	训练集	验证集	测试集	合计
共计	1 880 853	89 154	100 529	2 070 536
Python	412 178	23 107	22 176	457 461
Javascript	123 889	8 253	6 483	138 625
Ruby	48 791	2 209	2 279	53 279
Go	317 832	14 242	14 291	346 365
Java	454 451	15 328	26 909	496 688
PHP	523 712	26 015	28 391	578 118

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型	SR@1	SR@5	SR@10	MRR	NDCG
UNIF	0.420	0.556	0.624	0.419	0.451
TabCS	0.547	0.683	0.748	0.539	0.569
MRCS	0.719	0.828	0.871	0.702	0.741
BofeCS	0.848	0.942	0.972	0.821	0.857

模型输入	SR@1	SR@5	SR@10	MRR	NDCG
T	0.848	0.942	0.972	0.821	0.857
T + SBT	0.844	0.935	0.966	0.820	0.855
T + LCRS	0.517	0.719	0.812	0.509	0.579
T + RootPath	0.769	0.883	0.925	0.748	0.790
T + LeafPath	0.492	0.698	0.797	0.488	0.559

Code search model based on collaborative fusion network

基于协同融合网络的代码搜索模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 30

Related Articles 15

Recommended Articles

Metrics

方法	SR@1	SR@5	SR@10	MRR	NDCG
最大池化	0.848	0.942	0.972	0.821	0.857
平均池化	0.449	0.811	0.960	0.476	0.587

编程语言	SR@1	SR@5	SR@10	MRR	NDCG
Python	0.815	0.850	0.961	0.986	0.857
JavaScript	0.811	0.847	0.961	0.987	0.854
Ruby	0.677	0.692	0.838	0.901	0.729
Go	0.861	0.869	0.933	0.973	0.887
Java	0.821	0.848	0.942	0.972	0.857
PHP	0.899	0.917	0.971	0.987	0.919

[1]	Yuanjiong LIU, Maozheng HE, Yibin HUANG, Cheng QIAN. Ship identification model based on ResNet50 and improved attention mechanism [J]. Journal of Computer Applications, 2024, 44(6): 1935-1941.
[2]	Yao LIU, Yumeng LI, Miaomiao SONG. Cognitive graph based on business process [J]. Journal of Computer Applications, 2024, 44(6): 1699-1705.
[3]	Lin GUO, Kunhu LIU, Chenyang MA, Youxue LAI, Yingfen XU. Image super-resolution reconstruction based on residual attention network with receptive field expansion [J]. Journal of Computer Applications, 2024, 44(5): 1579-1587.
[4]	Boyue WANG, Yingxiang LI, Jiandan ZHONG. Segmentation network for day and night ground-based cloud images based on improved Res-UNet [J]. Journal of Computer Applications, 2024, 44(4): 1310-1316.
[5]	Shunwang FU, Qian CHEN, Zhi LI, Guomei WANG, Yu LU. Two-channel progressive feature filtering network for tampered image detection and localization [J]. Journal of Computer Applications, 2024, 44(4): 1303-1309.
[6]	Jingxian ZHOU, Xina LI. UAV detection and recognition based on improved convolutional neural network and radio frequency fingerprint [J]. Journal of Computer Applications, 2024, 44(3): 876-882.
[7]	Xueyu HUANG, Huaiyu HE, Huimin LIN, Jinshui CHEN. Classification and recognition method of copper alloy metallograph based on feature aggregation [J]. Journal of Computer Applications, 2023, 43(8): 2593-2601.
[8]	Yuxin TUO, Tao XUE. Joint triple extraction model combining pointer network and relational embedding [J]. Journal of Computer Applications, 2023, 43(7): 2116-2124.
[9]	Huibin ZHANG, Liping FENG, Yaojun HAO, Yining WANG. Ancient mural dynasty identification based on attention mechanism and transfer learning [J]. Journal of Computer Applications, 2023, 43(6): 1826-1832.
[10]	Lihua SHEN, Bo LI. Super-resolution reconstruction of lung CT images based on feature pyramid network and dense network [J]. Journal of Computer Applications, 2023, 43(5): 1612-1619.
[11]	Chengyu LIN, Lei WANG, Cong XUE. Weakly-supervised text classification with label semantic enhancement [J]. Journal of Computer Applications, 2023, 43(2): 335-342.
[12]	Zhiang ZHANG, Guangzhong LIAO. Multi-scale feature enhanced retinal vessel segmentation algorithm based on U-Net [J]. Journal of Computer Applications, 2023, 43(10): 3275-3281.
[13]	Yuhang LI, Yuli YANG, Yao MA, Dan YU, Yongle CHEN. Text adversarial example generation method based on BERT model [J]. Journal of Computer Applications, 2023, 43(10): 3093-3098.
[14]	Liefa LIAO, Zhiming LI, Saisai ZHANG. Image retrieval method based on deep residual network and iterative quantization hashing [J]. Journal of Computer Applications, 2022, 42(9): 2845-2852.
[15]	Huaiqing HE, Jianqing YAN, Kanghua HUI. Lightweight face recognition method based on deep residual network [J]. Journal of Computer Applications, 2022, 42(7): 2030-2036.

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426

模型	SR@1	SR@5	SR@10	MRR	NDCG
BofeCS	0.848	0.942	0.972	0.821	0.857
BofeCS-协同融合网络	0.745	0.924	0.969	0.743	0.785
BofeCS-残差结构	0.716	0.927	0.979	0.668	0.743
BofeCS-Dropout结构	0.351	0.521	0.655	0.363	0.426