《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3070-3076.DOI: 10.11772/j.issn.1001-9081.2022101495

• 人工智能 • 上一篇    

基于关键词图表示的文本语义去重算法

汪锦云, 向阳()   

  1. 同济大学 电子与信息工程学院,上海 201804
  • 收稿日期:2022-10-12 修回日期:2022-11-29 接受日期:2022-12-02 发布日期:2023-10-07 出版日期:2023-10-10
  • 通讯作者: 向阳
  • 作者简介:汪锦云(1998—),男,江西上饶人,硕士研究生,主要研究方向:自然语言处理、机器学习、大数据;
  • 基金资助:
    国家自然科学基金资助项目(72071145)

Text semantic de-duplication algorithm based on keyword graph representation

Jinyun WANG, Yang XIANG()   

  1. College of Electronic and Information Engineering,Tongji University,Shanghai 201804,China
  • Received:2022-10-12 Revised:2022-11-29 Accepted:2022-12-02 Online:2023-10-07 Published:2023-10-10
  • Contact: Yang XIANG
  • About author:WANG Jinyun, born in 1998, M. S. candidate. His research interests include natural language processing, machine learning, big data.
  • Supported by:
    National Natural Science Foundation of China(72071145)

摘要:

网络中存在大量语义相同或者相似的冗余文本,而文本去重能够解决冗余文本浪费存储空间的问题,并能为信息抽取任务减少不必要的消耗。传统的文本去重算法依赖文字重合度信息,而没有较好地利用文本语义信息,同时也无法捕捉长文本中距离较远句子之间的交互信息,去重效果不理想。针对文本语义去重问题,提出一种基于关键词图表示的长文本去重算法。首先,通过抽取文本对中的语义关键词短语,将文本对表示为以关键词短语为节点的图;其次,通过多种方式对节点进行编码,并利用图注意力网络(GAT)学习节点之间的关系,从而得到文本对图的向量表示,并判断文本对是否语义相似;最后,根据文本对的语义相似度进行去重处理。与传统算法相比,所提算法能有效利用文本的语义信息,并能通过图结构将长文本中距离较远的句子用关键词短语的共现关系连接起来,从而增加不同句子之间的语义交互。实验结果表明,所提算法在两个公开数据集CNSE (Chinese News Same Event)和CNSS(Chinese News Same Story)上都取得了比Simhash、BERT (Bidirectional Encoder Representations from Transformers)微调、概念交互图(CIG)等传统算法更好的表现。具体地,所提算法在CNSE数据集上的F1值达到84.65%,在CNSS数据集上的F1值达到90.76%,说明所提算法可以有效提升文本去重任务的效果。

关键词: 文本语义去重, 关键词抽取, 文本匹配, 图表示, 图注意力网络

Abstract:

There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts waste storage space and can reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication algorithms rely on literal overlapping information, and do not make use of the semantic information of texts; at the same time, they cannot capture the interaction information between sentences that are far away from each other in long text, so that the de-duplication effect of these methods is not ideal. Aiming at the problem of text semantic de-duplication, a long text de-duplication algorithm based on keyword graph representation was proposed. Firstly, the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly, the nodes were encoded in various ways, and Graph Attention Network (GAT) was used to learn the relationship between nodes to obtain the vector representation of text to the graph, and judge whether the text pairs were semantically similar. Finally, the de-duplication processing was performed according to the text pair’s semantical similarity. Compared with the traditional methods, this method can use the semantic information of texts effectively, and through the graph structure, the method can connect the distant sentences in the long text by the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experimental results show that the proposed algorithm performs better than the traditional algorithms, such as Simhash, BERT (Bidirectional Encoder Representations from Transformers) fine-tuning and Concept Interaction Graph (CIG), on both CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) datasets. Specifically, the F1 score of the proposed algorithm on CNSE dataset is 84.65%, and that on CNSS dataset reaches 90.76%. The above indicates that the proposed algorithm can improve the effect of text de-duplication tasks effectively.

Key words: text semantic de-duplication, keyword extraction, text matching, graph representation, Graph Attention Network (GAT)

中图分类号: