《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3896-3902.DOI: 10.11772/j.issn.1001-9081.2022111783
宋其洪1,2, 刘建勋1,2(), 扈海泽1,2, 张祥平1,2
收稿日期:
2022-11-29
修回日期:
2023-03-25
接受日期:
2023-03-28
发布日期:
2023-05-08
出版日期:
2023-12-10
通讯作者:
刘建勋
作者简介:
宋其洪(1998—),男,陕西宝鸡人,硕士研究生,CCF会员,主要研究方向:代码搜索、代码补全基金资助:
Qihong SONG1,2, Jianxun LIU1,2(), Haize HU1,2, Xiangping ZHANG1,2
Received:
2022-11-29
Revised:
2023-03-25
Accepted:
2023-03-28
Online:
2023-05-08
Published:
2023-12-10
Contact:
Jianxun LIU
About author:
SONG Qihong, born in 1998, M. S. candidate. His research interests include code search, code completion.Supported by:
摘要:
搜索并重用相关代码可以有效提高软件开发效率。基于深度学习的代码搜索模型通常将代码片段和查询语句嵌入同一向量空间,通过计算余弦相似度匹配并输出相应代码片段;然而大多数模型忽略了代码片段与查询语句间的协同信息。为了更全面地表征语义信息,提出一种基于协同融合的代码搜索模型BofeCS。首先,采用BERT(Bidirectional Encoder Representations from Transformers)模型提取输入序列的语义信息并将它表征为向量;其次,构建协同融合网络提取代码片段和查询语句间分词级的协同信息;最后,构建残差网络缓解表征过程中的语义信息丢失。为验证BofeCS的有效性,在多语言数据集CodeSearchNet上进行实验。实验结果表明,相较于基线模型UNIF(embedding UNIFication)、TabCS(Two-stage attention-based model for Code Search)和MRCS(Multimodal Representation for neural Code Search),BofeCS的平均倒数排名(MRR)、归一化折损累计增益(NDCG)和前k位成功命中率(SR@k)均有显著提高,其中MRR值分别提升了95.94%、52.32%和16.95%。
中图分类号:
宋其洪, 刘建勋, 扈海泽, 张祥平. 基于协同融合网络的代码搜索模型[J]. 计算机应用, 2023, 43(12): 3896-3902.
Qihong SONG, Jianxun LIU, Haize HU, Xiangping ZHANG. Code search model based on collaborative fusion network[J]. Journal of Computer Applications, 2023, 43(12): 3896-3902.
编程语言 | 代码-查询对数 | |||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | 合计 | |
共计 | 1 880 853 | 89 154 | 100 529 | 2 070 536 |
Python | 412 178 | 23 107 | 22 176 | 457 461 |
Javascript | 123 889 | 8 253 | 6 483 | 138 625 |
Ruby | 48 791 | 2 209 | 2 279 | 53 279 |
Go | 317 832 | 14 242 | 14 291 | 346 365 |
Java | 454 451 | 15 328 | 26 909 | 496 688 |
PHP | 523 712 | 26 015 | 28 391 | 578 118 |
表1 CodeSearchNet语料库的具体情况
Tab.1 Details about CodeSearchNet corpus
编程语言 | 代码-查询对数 | |||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | 合计 | |
共计 | 1 880 853 | 89 154 | 100 529 | 2 070 536 |
Python | 412 178 | 23 107 | 22 176 | 457 461 |
Javascript | 123 889 | 8 253 | 6 483 | 138 625 |
Ruby | 48 791 | 2 209 | 2 279 | 53 279 |
Go | 317 832 | 14 242 | 14 291 | 346 365 |
Java | 454 451 | 15 328 | 26 909 | 496 688 |
PHP | 523 712 | 26 015 | 28 391 | 578 118 |
模型 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
UNIF | 0.420 | 0.556 | 0.624 | 0.419 | 0.451 |
TabCS | 0.547 | 0.683 | 0.748 | 0.539 | 0.569 |
MRCS | 0.719 | 0.828 | 0.871 | 0.702 | 0.741 |
BofeCS | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
表2 4个模型在代码搜索任务上的对比实验结果
Tab.2 Results of comparison experiment of four models on code search task
模型 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
UNIF | 0.420 | 0.556 | 0.624 | 0.419 | 0.451 |
TabCS | 0.547 | 0.683 | 0.748 | 0.539 | 0.569 |
MRCS | 0.719 | 0.828 | 0.871 | 0.702 | 0.741 |
BofeCS | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
模型输入 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
T | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
T + SBT | 0.844 | 0.935 | 0.966 | 0.820 | 0.855 |
T + LCRS | 0.517 | 0.719 | 0.812 | 0.509 | 0.579 |
T + RootPath | 0.769 | 0.883 | 0.925 | 0.748 | 0.790 |
T + LeafPath | 0.492 | 0.698 | 0.797 | 0.488 | 0.559 |
表3 树序列对BofeCS性能的影响
Tab.3 Influence of tree sequence on BofeCS performance
模型输入 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
T | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
T + SBT | 0.844 | 0.935 | 0.966 | 0.820 | 0.855 |
T + LCRS | 0.517 | 0.719 | 0.812 | 0.509 | 0.579 |
T + RootPath | 0.769 | 0.883 | 0.925 | 0.748 | 0.790 |
T + LeafPath | 0.492 | 0.698 | 0.797 | 0.488 | 0.559 |
损失函数 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
S | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
M | 0.825 | 0.931 | 0.967 | 0.712 | 0.774 |
表4 两种损失函数的表现
Tab.4 Performance of two loss functions
损失函数 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
S | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
M | 0.825 | 0.931 | 0.967 | 0.712 | 0.774 |
方法 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
最大池化 | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
平均池化 | 0.449 | 0.811 | 0.960 | 0.476 | 0.587 |
表5 两种池化操作的表现
Tab.5 Performance of two pooling operations
方法 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
最大池化 | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
平均池化 | 0.449 | 0.811 | 0.960 | 0.476 | 0.587 |
模型 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
BofeCS | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
BofeCS-协同融合网络 | 0.745 | 0.924 | 0.969 | 0.743 | 0.785 |
BofeCS-残差结构 | 0.716 | 0.927 | 0.979 | 0.668 | 0.743 |
BofeCS-Dropout结构 | 0.351 | 0.521 | 0.655 | 0.363 | 0.426 |
表6 消融实验结果
Tab.6 Results of ablation experiments
模型 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
BofeCS | 0.848 | 0.942 | 0.972 | 0.821 | 0.857 |
BofeCS-协同融合网络 | 0.745 | 0.924 | 0.969 | 0.743 | 0.785 |
BofeCS-残差结构 | 0.716 | 0.927 | 0.979 | 0.668 | 0.743 |
BofeCS-Dropout结构 | 0.351 | 0.521 | 0.655 | 0.363 | 0.426 |
编程语言 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
Python | 0.815 | 0.850 | 0.961 | 0.986 | 0.857 |
JavaScript | 0.811 | 0.847 | 0.961 | 0.987 | 0.854 |
Ruby | 0.677 | 0.692 | 0.838 | 0.901 | 0.729 |
Go | 0.861 | 0.869 | 0.933 | 0.973 | 0.887 |
Java | 0.821 | 0.848 | 0.942 | 0.972 | 0.857 |
PHP | 0.899 | 0.917 | 0.971 | 0.987 | 0.919 |
表7 BofeCS在6种语言上的表现
Tab.7 Performance of BofeCS in six languages
编程语言 | SR@1 | SR@5 | SR@10 | MRR | NDCG |
---|---|---|---|---|---|
Python | 0.815 | 0.850 | 0.961 | 0.986 | 0.857 |
JavaScript | 0.811 | 0.847 | 0.961 | 0.987 | 0.854 |
Ruby | 0.677 | 0.692 | 0.838 | 0.901 | 0.729 |
Go | 0.861 | 0.869 | 0.933 | 0.973 | 0.887 |
Java | 0.821 | 0.848 | 0.942 | 0.972 | 0.857 |
PHP | 0.899 | 0.917 | 0.971 | 0.987 | 0.919 |
1 | YAO Z, PEDDAMAIL J R, SUN H. CoaCor: code annotation for code retrieval with reinforcement learning [C]// Proceedings of the 2019 World Wide Web Conference. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2019: 2203-2214. 10.1145/3308558.3313632 |
2 | WAN Y, SHU J, SUI Y, et al. Multi-modal attention network learning for semantic source code retrieval [C]// Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2019: 13-25. 10.1109/ase.2019.00012 |
3 | GU X, ZHANG H, KIM S. Deep code search [C]// Proceedings of the ACM/IEEE 40th International Conference on Software Engineering. New York: ACM, 2018: 933-944. 10.1145/3180155.3180167 |
4 | YU Z, YU J, XIANG C, et al. Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering [J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12): 5947-5959. 10.1109/tnnls.2018.2817340 |
5 | LI L, DONG R, CHEN L. Context-aware co-attention neural network for service recommendations [C]// Proceedings of the IEEE 35th International Conference on Data Engineering Workshops. Piscataway: IEEE, 2019: 201-208. 10.1109/icdew.2019.00-11 |
6 | LI B, SUN Z, LI Q, et al. Group-wise deep object co-segmentation with co-attention recurrent neural network [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8518-8527. 10.1109/iccv.2019.00861 |
7 | HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
8 | HUSAIN H, WU HH, GAZIT T, et al. CodeSearchNet challenge: evaluating the state of semantic code search [EB/OL]. [2022-09-12].. |
9 | CAMBRONERO J, LI H, KIM S, et al. When deep learning met code search[C]// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2019: 964-974. 10.1145/3338906.3340458 |
10 | XU L, YANG H, LIU C, et al. Two-stage attention-based model for code search with textual and structural features[C]// Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE, 2021: 342-353. 10.1109/saner50967.2021.00039 |
11 | GU J, CHEN Z, MONPERRUS M. Multimodal representation for neural code search[C]// Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2021: 483-494. 10.1109/icsme52107.2021.00049 |
12 | LV F, ZHANG H, LOU J G, et al. CodeHow: effective code search based on API understanding and extended Boolean model (E)[C]// Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2015: 260-270. 10.1109/ase.2015.42 |
13 | LU M, SUN X, WANG S, et al. Query expansion via WordNet for effective code search[C]// Proceedings of the IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering. Piscataway: IEEE, 2015: 545-549. 10.1109/saner.2015.7081874 |
14 | LEMOS O A L, DE PAULA A C, ZANICHELLI F C, et al. Thesaurus-based automatic query expansion for interface-driven code search [C]// Proceedings of the 11th Working Conference on Mining Software Repositories. New York: ACM, 2014: 212-221. 10.1145/2597073.2597087 |
15 | LIU J, KIM S, MURALI V, et al. Neural query expansion for code search[C]// Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York: ACM, 2019: 29-37. 10.1145/3315508.3329975 |
16 | WANG C, NONG Z, GAO C, et al. Enriching query semantics for code search with reinforcement learning[J]. Neural Networks, 2022, 145: 22-32. 10.1016/j.neunet.2021.09.025 |
17 | ZOU Q, ZHANG C. Query expansion via learning change sequences[J]. International Journal of Knowledge-based and Intelligent Engineering Systems, 2020, 24(2): 95-105. 10.3233/kes-200033 |
18 | HU G, PENG M, ZHANG Y, et al. Unsupervised software repositories mining and its application to code search[J]. Software: Practice and Experience, 2020, 50(3): 299-322. 10.1002/spe.2760 |
19 | WU H, YANG Y. Code search based on alteration intent[J]. IEEE Access, 2019, 7: 56796-56802. 10.1109/access.2019.2913560 |
20 | WANG H, ZHANG J, XIA Y, et al. COSEA: convolutional code search with layer-wise attention [EB/OL]. [2022-09-12].. 10.48550/arXiv.2010.09520 |
21 | LING X, WU L, WANG S, et al. Deep graph matching and searching for semantic code retrieval[J]. ACM Transactions on Knowledge Discovery from Data, 2021, 15(5): No.88. 10.1145/3447571 |
22 | WANG W, LI G, MA B, et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]// Proceedings of the IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE, 2020: 261-271. 10.1109/saner48275.2020.9054857 |
23 | 夏冰,庞建民,周鑫,等.二进制代码相似性搜索研究进展[J]. 计算机应用, 2022, 42(4):985-998. 10.11772/j.issn.1001-9081.2021071267 |
XIA B, PANG J M, ZHOU X, et al. Research progress on binary code similarity search[J]. Journal of Computer Applications, 2022, 42(4):985-998. 10.11772/j.issn.1001-9081.2021071267 | |
24 | ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree [C]// Proceedings of the IEEE/ACM 41st International Conference on Software Engineering. Piscataway: IEEE, 2019: 783-794. 10.1109/icse.2019.00086 |
25 | LING C, LIN Z, ZOU Y, et al. Adaptive deep code search [C]// Proceedings of the 28th International Conference on Program Comprehension. New York: ACM, 2020: 48-59. 10.1145/3387904.3389278 |
26 | MA H, LI Y, JI X, et al. MsCoa: multi-step co-attention model for multi-label classification [J]. IEEE Access, 2019, 7: 109635-109645. 10.1109/access.2019.2933042 |
27 | ZHANG P, ZHU H, XIONG T, et al. Co-attention network and low-rank bilinear pooling for aspect based sentiment analysis [C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2019: 6725-6729. 10.1109/icassp.2019.8682248 |
28 | SHUAI J, XU L, LIU C, et al. Improving code search with co-attentive representation learning[C]// Proceedings of the 28th International Conference on Program Comprehension. New York: ACM, 2020: 196-207. 10.1145/3387904.3389269 |
29 | SHWARTZ-ZIV R, TISHBY N. Opening the black box of deep neural networks via information [EB/OL]. [2022-09-12].. |
30 | BELGHAZI M I, BARATIN A, RAJESWAR S, et al. Mutual information neural estimation[C]// Proceedings of the 35th International Conference on Machine Learning. New York: JMLR.org, 2018: 531-540. |
[1] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[2] | 刘耀, 李雨萌, 宋苗苗. 基于业务流程的认知图谱[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1699-1705. |
[3] | 刘源泂, 何茂征, 黄益斌, 钱程. 基于ResNet50和改进注意力机制的船舶识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1935-1941. |
[4] | 沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806. |
[5] | 郭琳, 刘坤虎, 马晨阳, 来佑雪, 徐映芬. 基于感受野扩展残差注意力网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1579-1587. |
[6] | 付顺旺, 陈茜, 李智, 王国美, 卢妤. 用于篡改图像检测和定位的双通道渐进式特征过滤网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1303-1309. |
[7] | 周景贤, 李希娜. 基于改进卷积神经网络和射频指纹的无人机检测与识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 876-882. |
[8] | 余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714. |
[9] | 赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423. |
[10] | 黄学雨, 贺怀宇, 林慧敏, 陈金水. 基于特征聚合的铜合金金相图分类识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2593-2601. |
[11] | 拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124. |
[12] | 张慧斌, 冯丽萍, 郝耀军, 王一宁. 基于注意力机制和迁移学习的古壁画朝代识别[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1826-1832. |
[13] | 申利华, 李波. 基于特征金字塔网络和密集网络的肺部CT图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1612-1619. |
[14] | 林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342. |
[15] | 徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||