Text semantic de-duplication algorithm based on keyword graph representation

doi:10.11772/j.issn.1001-9081.2022101495

Abstract

Abstract:

There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts waste storage space and can reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication algorithms rely on literal overlapping information， and do not make use of the semantic information of texts； at the same time， they cannot capture the interaction information between sentences that are far away from each other in long text， so that the de-duplication effect of these methods is not ideal. Aiming at the problem of text semantic de-duplication， a long text de-duplication algorithm based on keyword graph representation was proposed. Firstly， the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly， the nodes were encoded in various ways， and Graph Attention Network （GAT） was used to learn the relationship between nodes to obtain the vector representation of text to the graph， and judge whether the text pairs were semantically similar. Finally， the de-duplication processing was performed according to the text pair’s semantical similarity. Compared with the traditional methods， this method can use the semantic information of texts effectively， and through the graph structure， the method can connect the distant sentences in the long text by the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experimental results show that the proposed algorithm performs better than the traditional algorithms， such as Simhash， BERT （Bidirectional Encoder Representations from Transformers） fine-tuning and Concept Interaction Graph （CIG）， on both CNSE （Chinese News Same Event） and CNSS （Chinese News Same Story） datasets. Specifically， the F1 score of the proposed algorithm on CNSE dataset is 84.65%， and that on CNSS dataset reaches 90.76%. The above indicates that the proposed algorithm can improve the effect of text de-duplication tasks effectively.

Key words: text semantic de-duplication, keyword extraction, text matching, graph representation, Graph Attention Network (GAT)

摘要：

网络中存在大量语义相同或者相似的冗余文本，而文本去重能够解决冗余文本浪费存储空间的问题，并能为信息抽取任务减少不必要的消耗。传统的文本去重算法依赖文字重合度信息，而没有较好地利用文本语义信息，同时也无法捕捉长文本中距离较远句子之间的交互信息，去重效果不理想。针对文本语义去重问题，提出一种基于关键词图表示的长文本去重算法。首先，通过抽取文本对中的语义关键词短语，将文本对表示为以关键词短语为节点的图；其次，通过多种方式对节点进行编码，并利用图注意力网络（GAT）学习节点之间的关系，从而得到文本对图的向量表示，并判断文本对是否语义相似；最后，根据文本对的语义相似度进行去重处理。与传统算法相比，所提算法能有效利用文本的语义信息，并能通过图结构将长文本中距离较远的句子用关键词短语的共现关系连接起来，从而增加不同句子之间的语义交互。实验结果表明，所提算法在两个公开数据集CNSE （Chinese News Same Event）和CNSS（Chinese News Same Story）上都取得了比Simhash、BERT （Bidirectional Encoder Representations from Transformers）微调、概念交互图（CIG）等传统算法更好的表现。具体地，所提算法在CNSE数据集上的F1值达到84.65%，在CNSS数据集上的F1值达到90.76%，说明所提算法可以有效提升文本去重任务的效果。

关键词: 文本语义去重, 关键词抽取, 文本匹配, 图表示, 图注意力网络

CLC Number:

TP391.1

Jinyun WANG, Yang XIANG. Text semantic de-duplication algorithm based on keyword graph representation[J]. Journal of Computer Applications, 2023, 43(10): 3070-3076.

汪锦云, 向阳. 基于关键词图表示的文本语义去重算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3070-3076.

Figures/Tables 6

References 28

1	CHARIKAR M. Similarity estimation techniques from rounding algorithms［C］// Proceedings of the 34th ACM Symposium on Theory of Computing. New York： ACM， 2002：380-388. 10.1145/509907.509965
2	王诚，王宇成. 基于Simhash的大规模文档去重改进算法研究［J］. 计算机技术与发展， 2019， 29（2）：115-119. 10.3969/j.issn.1673-629X.2019.02.024
	WANG C， WANG Y C. Research on improved large-scale documents deduplication algorithm based on Simhash［J］. Computer Technology and Development， 2019， 29（2）：115-119. 10.3969/j.issn.1673-629X.2019.02.024
3	BRODER A Z. On the Resemblance and containment of documents［C］// Proceedings of the 1997 International Conference on Compression and Complexity of Sequences. Piscataway： IEEE， 1997： 21-29.
4	INDYK P， MOTWANI R. Approximate nearest neighbors： towards removing the curse of dimensionality［C］// Proceedings of the 30th ACM Symposium on Theory of Computing. New York： ACM， 1998：604-613. 10.1145/276698.276876
5	APPLEBY A. MurmurHash［EB/OL］. （2011-03-01）［2022-08-22］..
6	HUANG P S， HE X， GAO J， et al. Learning deep structured semantic models for Web search using clickthrough data［C］// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York： ACM， 2013：2333-2338. 10.1145/2505515.2505665
7	SHEN Y， HE X， GAO J， et al. A latent semantic model with convolutional-pooling structure for information retrieval［C］// Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. New York： ACM， 2014：101-110. 10.1145/2661829.2661935
8	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space［EB/OL］. （2013-09-07）［2022-08-22］.. 10.3126/jiee.v3i1.34327
9	VELIČKOVIĆ P， CUCURULL G， CASANOVA A， et al. Graph attention networks［EB/OL］. （2018-02-04）［2022-08-22］..
10	ZHANG T， LIU B， NIU D， et al. Multiresolution graph attention networks for relevance matching［C］// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York： ACM， 2018：933-942. 10.1145/3269206.3271806
11	LIU B， NIU D， WEI H， et al. Matching article pairs with graphical decomposition and convolutions［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2019： 6284-6294. 10.18653/v1/p19-1632
12	彭双和，图尔贡·麦提萨比尔，周巧凤. 基于Simhash的中文文本去重技术研究［J］. 计算机技术与发展， 2017， 27（11）：137-140， 145. 10.3969/j.issn.1673-629X.2017.11.030
	PENG S H， MAITISABIER T， ZHOU Q F. Research on deduplication technique of Chinese text with Simhash［J］. Computer Technology and Development， 2017， 27（11）：137-140， 145. 10.3969/j.issn.1673-629X.2017.11.030
13	张亚男，陈卫卫，付印金，等. 基于Simhash改进的文本去重算法［J］. 计算机技术与发展， 2022， 32（8）：26-32. 10.3969/j.issn.1673-629X.2022.08.005
	ZHANG Y N， CHEN W W， FU Y J， et al. Improved text deduplication algorithm based on Simhash［J］. Computer Technology and Development， 2022， 32（8）： 26-32. 10.3969/j.issn.1673-629X.2022.08.005
14	SUN Y， QIU H， ZHENG Y， et al. SIFRank： a new baseline for unsupervised keyphrase extraction based on pre-trained language model［J］. IEEE Access， 2020， 8：10896-10906. 10.1109/access.2020.2965087
15	YE J， GUI T， LUO Y， et al. One2Set： generating diverse keyphrases as a set［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2021：4598-4608. 10.18653/v1/2021.acl-long.354
16	BARUNI J S， SATHIASEELAN J G R. Keyphrase extraction from document using RAKE and TextRank algorithms［J］. International Journal of Computer Science and Mobile Computing， 2020， 9（9）：83-93. 10.47760/ijcsmc.2020.v09i09.009
17	CHO T， LEE J H. Latent keyphrase extraction using LDA model［J］. Journal of Korean Institute of Intelligent Systems， 2015， 25（2）：180-185. 10.5391/jkiis.2015.25.2.180
18	朱泽德，李淼，张健，等. 一种基于LDA模型的关键词抽取方法［J］. 中南大学学报（自然科学版）， 2015， 46（6）：2142-2148.
	ZHU Z D， LI M， ZHANG J， et al. A LDA-based approach to keyphrase extraction［J］. Journal of Central South University （Science and Technology）， 2015， 46（6）：2142-2148.
19	DING L， ZHANG Z， LIU H， et al. Automatic keyphrase extraction from scientific Chinese medical abstracts based on character-level sequence labeling［J］. Journal of Data and Information Science， 2021， 6（3）：35-57. 10.2478/jdis-2021-0013
20	HAMILTON W L， YING R， LESKOVEC J. Representation learning on graphs： methods and applications［J］. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering， 2017， 40（3）：52-74.
21	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2022-08-22］.. 10.48550/arXiv.1609.02907
22	PETERS M E， NEUMANN M， IYYER M， et al. Deep contextualized word representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long Papers）. Stroudsburg， PA： ACL， 2018：2227-2237. 10.18653/v1/n18-1202
23	CHE W， LIU Y， WANG Y， et al. Towards better UD parsing： deep contextualized word embeddings， ensemble， and treebank concatenation［C］// Proceedings of the CoNLL 2018 Shared Task： Multilingual Parsing from Raw Text to Universal Dependencies. Stroudsburg， PA： ACL， 2018：55-64.
24	ARORA S， LIANG Y， MA T. A simple but tough-to-beat baseline for sentence embeddings［EB/OL］（2022-07-22）［2022-08-22］..
25	陈乐乐，黄松，孙金磊，等. 基于BM25算法的问题报告质量检测方法［J］. 清华大学学报（自然科学版）， 2020， 60（10）：829-836.
	CHEN L L， HUANG S， SUN J L， et al. Bug report quality detection based on the BM25 algorithm［J］. Journal of Tsinghua University （Science and Technology）， 2020， 60（10）： 829-836.
26	BLEI D M， NG A Y， JORDAN M I. Latent Dirichlet allocation［J］. Journal of Machine Learning Research， 2003， 3：993-1022.
27	ZHENG C， SUN Y， WAN S， et al. RLTM： an efficient neural IR framework for long documents［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020：5457-5463. 10.24963/ijcai.2019/758
28	DEVLIN J， CHANG W M， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019：4171-4186. 10.18653/v1/n18-2

数据集	正样本	负样本	训练集	验证集	测试集
CNSE	12 865	16 198	17 438	5 813	5 812
CNSS	16 887	16 616	20 102	6 701	6 700

数据集	正样本	负样本	训练集	验证集	测试集
CNSE	12 865	16 198	17 438	5 813	5 812
CNSS	16 887	16 616	20 102	6 701	6 700

算法	CNSE		CNSS
算法	准确率	F1	准确率	F1
Simhash	53.64	58.55	53.83	59.20
BM25	69.63	66.60	67.77	70.40
LDA	63.81	62.44	62.98	69.11
SimNet	71.05	69.26	70.78	74.50
DSSM	58.08	64.68	61.09	70.58
C-DSSM	60.17	48.57	52.96	56.75
CIG	84.64	82.75	89.77	90.07
BERT微调	81.30	79.20	86.64	87.08
本文算法	85.75	84.65	89.93	90.76

算法	CNSE		CNSS
算法	准确率	F1	准确率	F1
Simhash	53.64	58.55	53.83	59.20
BM25	69.63	66.60	67.77	70.40
LDA	63.81	62.44	62.98	69.11
SimNet	71.05	69.26	70.78	74.50
DSSM	58.08	64.68	61.09	70.58
C-DSSM	60.17	48.57	52.96	56.75
CIG	84.64	82.75	89.77	90.07
BERT微调	81.30	79.20	86.64	87.08
本文算法	85.75	84.65	89.93	90.76

算法	F1
算法	CNSE	CNSS
本文算法-Siam	74.41	79.04
本文算法-Siam-GAT	74.22	80.80
本文算法-cd-Siam-GAT	72.95	79.11
本文算法-Sim	74.98	83.84
本文算法-Sim-GAT	82.71	88.52
本文算法-cd-Sim-GAT	82.23	88.11
本文算法-Sim&Siam-GAT	84.65	90.76