Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (10): 3070-3076.DOI: 10.11772/j.issn.1001-9081.2022101495
Special Issue: 人工智能
• Artificial intelligence • Previous Articles Next Articles
Received:
2022-10-12
Revised:
2022-11-29
Accepted:
2022-12-02
Online:
2023-10-07
Published:
2023-10-10
Contact:
Yang XIANG
About author:
WANG Jinyun, born in 1998, M. S. candidate. His research interests include natural language processing, machine learning, big data.
Supported by:
通讯作者:
向阳
作者简介:
汪锦云(1998—),男,江西上饶人,硕士研究生,主要研究方向:自然语言处理、机器学习、大数据;
基金资助:
CLC Number:
Jinyun WANG, Yang XIANG. Text semantic de-duplication algorithm based on keyword graph representation[J]. Journal of Computer Applications, 2023, 43(10): 3070-3076.
汪锦云, 向阳. 基于关键词图表示的文本语义去重算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3070-3076.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2022101495
数据集 | 正样本 | 负样本 | 训练集 | 验证集 | 测试集 |
---|---|---|---|---|---|
CNSE | 12 865 | 16 198 | 17 438 | 5 813 | 5 812 |
CNSS | 16 887 | 16 616 | 20 102 | 6 701 | 6 700 |
Tab. 1 Information of datasets
数据集 | 正样本 | 负样本 | 训练集 | 验证集 | 测试集 |
---|---|---|---|---|---|
CNSE | 12 865 | 16 198 | 17 438 | 5 813 | 5 812 |
CNSS | 16 887 | 16 616 | 20 102 | 6 701 | 6 700 |
算法 | CNSE | CNSS | ||
---|---|---|---|---|
准确率 | F1 | 准确率 | F1 | |
Simhash | 53.64 | 58.55 | 53.83 | 59.20 |
BM25 | 69.63 | 66.60 | 67.77 | 70.40 |
LDA | 63.81 | 62.44 | 62.98 | 69.11 |
SimNet | 71.05 | 69.26 | 70.78 | 74.50 |
DSSM | 58.08 | 64.68 | 61.09 | 70.58 |
C-DSSM | 60.17 | 48.57 | 52.96 | 56.75 |
CIG | 84.64 | 82.75 | 89.77 | 90.07 |
BERT微调 | 81.30 | 79.20 | 86.64 | 87.08 |
本文算法 | 85.75 | 84.65 | 89.93 | 90.76 |
Tab. 2 Experimental results of different algorithms on CNSE and CNSS datasets
算法 | CNSE | CNSS | ||
---|---|---|---|---|
准确率 | F1 | 准确率 | F1 | |
Simhash | 53.64 | 58.55 | 53.83 | 59.20 |
BM25 | 69.63 | 66.60 | 67.77 | 70.40 |
LDA | 63.81 | 62.44 | 62.98 | 69.11 |
SimNet | 71.05 | 69.26 | 70.78 | 74.50 |
DSSM | 58.08 | 64.68 | 61.09 | 70.58 |
C-DSSM | 60.17 | 48.57 | 52.96 | 56.75 |
CIG | 84.64 | 82.75 | 89.77 | 90.07 |
BERT微调 | 81.30 | 79.20 | 86.64 | 87.08 |
本文算法 | 85.75 | 84.65 | 89.93 | 90.76 |
算法 | F1 | |
---|---|---|
CNSE | CNSS | |
本文算法-Siam | 74.41 | 79.04 |
本文算法-Siam-GAT | 74.22 | 80.80 |
本文算法-cd-Siam-GAT | 72.95 | 79.11 |
本文算法-Sim | 74.98 | 83.84 |
本文算法-Sim-GAT | 82.71 | 88.52 |
本文算法-cd-Sim-GAT | 82.23 | 88.11 |
本文算法-Sim&Siam-GAT | 84.65 | 90.76 |
Tab. 3 Ablation experimental results on CNSE and CNSS datasets
算法 | F1 | |
---|---|---|
CNSE | CNSS | |
本文算法-Siam | 74.41 | 79.04 |
本文算法-Siam-GAT | 74.22 | 80.80 |
本文算法-cd-Siam-GAT | 72.95 | 79.11 |
本文算法-Sim | 74.98 | 83.84 |
本文算法-Sim-GAT | 82.71 | 88.52 |
本文算法-cd-Sim-GAT | 82.23 | 88.11 |
本文算法-Sim&Siam-GAT | 84.65 | 90.76 |
1 | CHARIKAR M. Similarity estimation techniques from rounding algorithms[C]// Proceedings of the 34th ACM Symposium on Theory of Computing. New York: ACM, 2002:380-388. 10.1145/509907.509965 |
2 | 王诚,王宇成. 基于Simhash的大规模文档去重改进算法研究[J]. 计算机技术与发展, 2019, 29(2):115-119. 10.3969/j.issn.1673-629X.2019.02.024 |
WANG C, WANG Y C. Research on improved large-scale documents deduplication algorithm based on Simhash[J]. Computer Technology and Development, 2019, 29(2):115-119. 10.3969/j.issn.1673-629X.2019.02.024 | |
3 | BRODER A Z. On the Resemblance and containment of documents[C]// Proceedings of the 1997 International Conference on Compression and Complexity of Sequences. Piscataway: IEEE, 1997: 21-29. |
4 | INDYK P, MOTWANI R. Approximate nearest neighbors: towards removing the curse of dimensionality[C]// Proceedings of the 30th ACM Symposium on Theory of Computing. New York: ACM, 1998:604-613. 10.1145/276698.276876 |
5 | APPLEBY A. MurmurHash[EB/OL]. (2011-03-01) [2022-08-22].. |
6 | HUANG P S, HE X, GAO J, et al. Learning deep structured semantic models for Web search using clickthrough data[C]// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York: ACM, 2013:2333-2338. 10.1145/2505515.2505665 |
7 | SHEN Y, HE X, GAO J, et al. A latent semantic model with convolutional-pooling structure for information retrieval[C]// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. New York: ACM, 2014:101-110. 10.1145/2661829.2661935 |
8 | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07) [2022-08-22].. 10.3126/jiee.v3i1.34327 |
9 | VELIČKOVIĆ P, CUCURULL G, CASANOVA A, et al. Graph attention networks[EB/OL]. (2018-02-04) [2022-08-22].. |
10 | ZHANG T, LIU B, NIU D, et al. Multiresolution graph attention networks for relevance matching[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018:933-942. 10.1145/3269206.3271806 |
11 | LIU B, NIU D, WEI H, et al. Matching article pairs with graphical decomposition and convolutions[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 6284-6294. 10.18653/v1/p19-1632 |
12 | 彭双和,图尔贡·麦提萨比尔,周巧凤. 基于Simhash的中文文本去重技术研究[J]. 计算机技术与发展, 2017, 27(11):137-140, 145. 10.3969/j.issn.1673-629X.2017.11.030 |
PENG S H, MAITISABIER T, ZHOU Q F. Research on deduplication technique of Chinese text with Simhash[J]. Computer Technology and Development, 2017, 27(11):137-140, 145. 10.3969/j.issn.1673-629X.2017.11.030 | |
13 | 张亚男,陈卫卫,付印金,等. 基于Simhash改进的文本去重算法[J]. 计算机技术与发展, 2022, 32(8):26-32. 10.3969/j.issn.1673-629X.2022.08.005 |
ZHANG Y N, CHEN W W, FU Y J, et al. Improved text deduplication algorithm based on Simhash[J]. Computer Technology and Development, 2022, 32(8): 26-32. 10.3969/j.issn.1673-629X.2022.08.005 | |
14 | SUN Y, QIU H, ZHENG Y, et al. SIFRank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model[J]. IEEE Access, 2020, 8:10896-10906. 10.1109/access.2020.2965087 |
15 | YE J, GUI T, LUO Y, et al. One2Set: generating diverse keyphrases as a set[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021:4598-4608. 10.18653/v1/2021.acl-long.354 |
16 | BARUNI J S, SATHIASEELAN J G R. Keyphrase extraction from document using RAKE and TextRank algorithms[J]. International Journal of Computer Science and Mobile Computing, 2020, 9(9):83-93. 10.47760/ijcsmc.2020.v09i09.009 |
17 | CHO T, LEE J H. Latent keyphrase extraction using LDA model[J]. Journal of Korean Institute of Intelligent Systems, 2015, 25(2):180-185. 10.5391/jkiis.2015.25.2.180 |
18 | 朱泽德,李淼,张健,等. 一种基于LDA模型的关键词抽取方法[J]. 中南大学学报(自然科学版), 2015, 46(6):2142-2148. |
ZHU Z D, LI M, ZHANG J, et al. A LDA-based approach to keyphrase extraction[J]. Journal of Central South University (Science and Technology), 2015, 46(6):2142-2148. | |
19 | DING L, ZHANG Z, LIU H, et al. Automatic keyphrase extraction from scientific Chinese medical abstracts based on character-level sequence labeling[J]. Journal of Data and Information Science, 2021, 6(3):35-57. 10.2478/jdis-2021-0013 |
20 | HAMILTON W L, YING R, LESKOVEC J. Representation learning on graphs: methods and applications[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2017, 40(3):52-74. |
21 | KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL]. (2017-02-22) [2022-08-22].. 10.48550/arXiv.1609.02907 |
22 | PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Stroudsburg, PA: ACL, 2018:2227-2237. 10.18653/v1/n18-1202 |
23 | CHE W, LIU Y, WANG Y, et al. Towards better UD parsing: deep contextualized word embeddings, ensemble, and treebank concatenation[C]// Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Stroudsburg, PA: ACL, 2018:55-64. |
24 | ARORA S, LIANG Y, MA T. A simple but tough-to-beat baseline for sentence embeddings[EB/OL] (2022-07-22) [2022-08-22].. |
25 | 陈乐乐,黄松,孙金磊,等. 基于BM25算法的问题报告质量检测方法[J]. 清华大学学报(自然科学版), 2020, 60(10):829-836. |
CHEN L L, HUANG S, SUN J L, et al. Bug report quality detection based on the BM25 algorithm[J]. Journal of Tsinghua University (Science and Technology), 2020, 60(10): 829-836. | |
26 | BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022. |
27 | ZHENG C, SUN Y, WAN S, et al. RLTM: an efficient neural IR framework for long documents[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2020:5457-5463. 10.24963/ijcai.2019/758 |
28 | DEVLIN J, CHANG W M, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: ACL, 2019:4171-4186. 10.18653/v1/n18-2 |
[1] | Yu DU, Yan ZHU. Constructing pre-trained dynamic graph neural network to predict disappearance of academic cooperation behavior [J]. Journal of Computer Applications, 2024, 44(9): 2726-2731. |
[2] | Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823. |
[3] | Tianci KE, Jianhua LIU, Shuihua SUN, Zhixiong ZHENG, Zijie CAI. Aspect-level sentiment analysis model combining strong association dependency and concise syntax [J]. Journal of Computer Applications, 2024, 44(6): 1786-1795. |
[4] | Dongju YANG, Chengfu HU. Keyword extraction method for scientific text based on improved TextRank [J]. Journal of Computer Applications, 2024, 44(6): 1720-1726. |
[5] | Lei GUO, Zhen JIA, Tianrui LI. Relational and interactive graph attention network for aspect-level sentiment analysis [J]. Journal of Computer Applications, 2024, 44(3): 696-701. |
[6] | Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670. |
[7] | Linqin WANG, Te ZHANG, Zhihong XU, Yongfeng DONG, Guowei YANG. Fusing entity semantic and structural information for knowledge graph reasoning [J]. Journal of Computer Applications, 2024, 44(11): 3371-3378. |
[8] | Wenjuan JIANG, Yi GUO, Jiaojiao FU. Reasoning question answering model of complex temporal knowledge graph with graph attention [J]. Journal of Computer Applications, 2024, 44(10): 3047-3057. |
[9] | Zhixiong ZHENG, Jianhua LIU, Shuihua SUN, Ge XU, Honghui LIN. Aspect-based sentiment analysis model fused with multi-window local information [J]. Journal of Computer Applications, 2023, 43(6): 1796-1802. |
[10] | Chun GAO, Mengling WANG. Highway traffic flow prediction based on feature fusion graph attention network [J]. Journal of Computer Applications, 2023, 43(10): 3114-3120. |
[11] | Shigang YANG, Yongguo LIU. Short text classification method by fusing corpus features and graph attention network [J]. Journal of Computer Applications, 2022, 42(5): 1324-1329. |
[12] | Shoulong JIAO, Youxiang DUAN, Qifeng SUN, Zihao ZHUANG, Chenhao SUN. Knowledge representation learning method incorporating entity description information and neighbor node features [J]. Journal of Computer Applications, 2022, 42(4): 1050-1056. |
[13] | Haitao XUE, Li WANG, Yanjie YANG, Biao LIAN. Rumor detection model based on user propagation network and message content [J]. Journal of Computer Applications, 2021, 41(12): 3540-3545. |
[14] | LI Ming, GUO Chenhao, CHEN Xing. Automatic annotation of visual deep neural network [J]. Journal of Computer Applications, 2020, 40(6): 1593-1600. |
[15] | HAO Zhifeng, KE Yanrong, LI Shuo, CAI Ruichu, WEN Wen, WANG Lijuan. Node classification method in social network based on graph encoder network [J]. Journal of Computer Applications, 2020, 40(1): 188-195. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||