《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3692-3699.DOI: 10.11772/j.issn.1001-9081.2021101768

• 人工智能 • 上一篇    

基于图卷积网络的文本分割模型

杜雨奇, 郑津(), 王杨, 黄诚, 李平   

  1. 西南石油大学 计算机科学学院,成都 610500
  • 收稿日期:2021-10-14 修回日期:2022-01-07 接受日期:2022-01-24 发布日期:2022-03-04 出版日期:2022-12-10
  • 通讯作者: 郑津
  • 作者简介:杜雨奇(1998—),女,四川南充人,硕士研究生,主要研究方向:深度学习、自然语言处理
    王杨(1995—),男,重庆人,硕士研究生,主要研究方向:深度学习、图神经网络
    黄诚(1970—),男,四川南充人,副教授,硕士,主要研究方向:油气田开发智能化、数据挖掘
    李平(1977—),女,四川成都人,教授,博士,主要研究方向:机器学习、自然语言处理、复杂网络。
  • 基金资助:
    国家杰出青年科学基金资助项目(61625204);西南石油大学科研创新能力提升计划“启航”项目(2019QHZ016)

Text segmentation model based on graph convolutional network

Yuqi DU, Jin ZHENG(), Yang WANG, Cheng HUANG, Ping LI   

  1. School of Computer Science,Southwest Petroleum University,Chengdu Sichuan 610500,China
  • Received:2021-10-14 Revised:2022-01-07 Accepted:2022-01-24 Online:2022-03-04 Published:2022-12-10
  • Contact: Jin ZHENG
  • About author:DU Yuqi, born in 1998, M. S. candidate. Her research interests include deep learning, natural language processing.
    WANG Yang,born in 1995, M. S. candidate. His research interests include deep learning, graph neural network.
    HUANG Cheng, born in 1970, M. S., associate professor. His research interests include intellectualization of oil and gas field development, data mining.
    LI Ping, born in 1977, Ph. D., professor. Her research interests include machine learning, natural language processing, complex network.
  • Supported by:
    National Science Foundation for Distinguished Young Scholars of China(61625204);Scientific Research Starting Project of Southwest Petroleum University(2019QHZ016)

摘要:

文本分割的主要任务是将文本按照主题相关的原则划分为若干个相对独立的文本块。针对现有文本分割模型提取文本段落结构信息、语义相关性及上下文交互等细粒度特征的不足,提出了一种基于图卷积网络(GCN)的文本分割模型TS-GCN。首先,基于文本段落的结构信息与语义逻辑构建出文本图;然后,引入语义相似性注意力来捕获文本段落节点间的细粒度相关性,并借助GCN实现文本段落节点高阶邻域间的信息传递,以此增强模型多粒度提取文本段落主题特征表达的能力。将所提模型与目前常用作文本分割任务基准的代表模型CATS及其基础模型TLT-TS进行对比。实验结果表明在Wikicities数据集上,TS-GCN在未增加任何辅助模块的情况下比TLT-TS的评价指标Pk 值下降了0.08个百分点;在Wikielements数据集上,相较于CATS和TLT-TS,所提模型的Pk 值分别下降了0.38个百分点和2.30个百分点,可见TLT-TS取得了较好的分割效果。

关键词: 文本分割, 图卷积网络, 注意力, 自然语言处理, 深度学习

Abstract:

The main task of text segmentation is to divide the text into several relatively independent text blocks according to the topic relevance. Aiming at the shortcomings of the existing text segmentation models in extracting fine-grained features such as text paragraph structural information, semantic correlation and context interaction, a text segmentation model TS-GCN (Text Segmentation-Graph Convolutional Network) based on Graph Convolutional Network (GCN) was proposed. Firstly, a text graph based on the structural information and semantic logic of text paragraphs was constructed. Then, the semantic similarity attention was introduced to capture the fine-grained correlation between text paragraph nodes, and the information transmission between high-order neighborhoods of text paragraph nodes was realized with the help of GCN, so that the model ability of multi-granularity extraction of text paragraph topic feature representations was enhanced. The proposed model was compared with the representative model CATS (Coherence-Aware Text Segmentation), and its basic model TLT-TS (Two-Level Transformer model for Text Segmentation), which were commonly used as benchmarks for text segmentation task. Experimental results show that TS-GCN’s evaluation index Pk is 0.08 percentage points lower than that of TLT-TS without any auxiliary module on Wikicities dataset. And the proposed model has the Pk value decreased by 0.38 percentage points and 2.30 percentage points respectively on Wikielements dataset compared with CATS and TLT-TS. It can be seen that TS-GCN achieves good segmentation effect.

Key words: text segmentation, Graph Convolutional Network (GCN), attention, Natural Language Processing (NLP), deep learning

中图分类号: