《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (S1): 33-38.DOI: 10.11772/j.issn.1001-9081.2022111740

• 人工智能 • 上一篇    下一篇

面向长文本的两阶段文本匹配模型TP-TM

王佳睿1,2, 彭程1(), 范敏1   

  1. 1.中国科学院 成都计算机应用研究所,成都 610041
    2.中国科学院大学 计算机科学与技术学院,北京 100049
  • 收稿日期:2022-11-24 修回日期:2023-02-22 接受日期:2023-02-23 发布日期:2023-07-04 出版日期:2023-06-30
  • 通讯作者: 彭程
  • 作者简介:王佳睿(1998—),女,四川成都人,硕士研究生,主要研究方向:自然语言处理
    彭程(1976—),男,四川成都人,高级工程师,主要研究方向:软件工程、智能识别.pengcheng@casit.com.cn
    范敏(1984—),女,四川眉山人,工程师,硕士,主要研究方向:通信工程、软件测试。
  • 基金资助:
    四川省科技计划项目(2022ZHCG0007)

TP-TM: two-phase text matching model for long-form texts

Jiarui WANG1,2, Cheng PENG1(), Min FAN1   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China
    2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2022-11-24 Revised:2023-02-22 Accepted:2023-02-23 Online:2023-07-04 Published:2023-06-30
  • Contact: Cheng PENG

摘要:

针对传统文本匹配方法无法学习文本间深度语义匹配特征,深度短文本匹配模型难以捕获长文本细粒度匹配信号等问题,提出一种面向长文本的两阶段文本匹配模型TP-TM(Two-Phase Text Matching)。首先使用句子级过滤器过滤噪声句并提取关键句,然后将所获关键句输入词语级过滤器,利用融入了改进版删减策略的BERT(Bidirectional Encoder Representations from Transformers)模型挖掘文本间深度交互特征,对关键句进行词语级噪声过滤和细粒度匹配操作,最终通过拼接BERT不同位置特征预测文本对关系。在中文长文本公开新闻数据集CNSE(Chinese News Same Event)和CNSS(Chinese News Same Story)上进行实验,结果显示,相较于基线模型,TP-TM模型在CNSE和CNSS数据集上的准确率分别提升了0.99和1.55个百分点,F1值分别提升了0.98和1.46个百分点,有效提升了长文本匹配任务的准确度。

关键词: 文本匹配, 长文本, BERT, 过滤器, 特征删减

Abstract:

Aiming at the problem that traditional text matching methods cannot learn the deep semantic matching features between texts, and the deep short text matching model is hard to capture the fine-grained matching signals of long texts, a two-phase text matching model for long-form texts named TP-TM (Two-Phase Text Matching) was proposed. Firstly, the sentences were fed into sentence-level filters to filter the noisy sentences and extract the key sentences; then the key sentences were fed into a word-level filter, which used the BERT (Bidirectional Encoder Representation from Transformers) model incorporating the improved pruning strategy to mine the deep interaction features between texts, and performed word-level noise filtering and fine-grained matching operations on the key sentences. Finally, the relationship between text pairs was predicted by splicing different position features of BERT. Experimental results show that the accuracy of the TP-TM model on CNSE (Chinese News Same Event) and CNSS (Chinese News Same Story) datasets increases by 0.99 and 1.55 percentage points, and the F1 value increases by 0.98 and 1.46 percentage points, respectively, proving that TP-TM model can effectively improve the accuracy of long-form text matching tasks.

Key words: text matching, long-form text, BERT (Bidirectional Encoder Representation from Transformers), filter, feature deletion

中图分类号: