基于图卷积网络的文本分割模型

doi:10.11772/j.issn.1001-9081.2021101768

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3692-3699.DOI: 10.11772/j.issn.1001-9081.2021101768

• 人工智能 • 上一篇

基于图卷积网络的文本分割模型

杜雨奇, 郑津(), 王杨, 黄诚, 李平

西南石油大学计算机科学学院，成都 610500

收稿日期:2021-10-14 修回日期:2022-01-07 接受日期:2022-01-24 发布日期:2022-03-04 出版日期:2022-12-10
通讯作者: 郑津
作者简介:杜雨奇（1998—），女，四川南充人，硕士研究生，主要研究方向：深度学习、自然语言处理
王杨（1995—），男，重庆人，硕士研究生，主要研究方向：深度学习、图神经网络
黄诚（1970—），男，四川南充人，副教授，硕士，主要研究方向：油气田开发智能化、数据挖掘
李平（1977—），女，四川成都人，教授，博士，主要研究方向：机器学习、自然语言处理、复杂网络。
基金资助:
国家杰出青年科学基金资助项目(61625204);西南石油大学科研创新能力提升计划“启航”项目(2019QHZ016)

Text segmentation model based on graph convolutional network

Yuqi DU, Jin ZHENG(), Yang WANG, Cheng HUANG, Ping LI

School of Computer Science，Southwest Petroleum University，Chengdu Sichuan 610500，China

Received:2021-10-14 Revised:2022-01-07 Accepted:2022-01-24 Online:2022-03-04 Published:2022-12-10
Contact: Jin ZHENG
About author:DU Yuqi， born in 1998， M. S. candidate. Her research interests include deep learning， natural language processing.
WANG Yang，born in 1995， M. S. candidate. His research interests include deep learning， graph neural network.
HUANG Cheng， born in 1970， M. S.， associate professor. His research interests include intellectualization of oil and gas field development， data mining.
LI Ping， born in 1977， Ph. D.， professor. Her research interests include machine learning， natural language processing， complex network.
Supported by:
National Science Foundation for Distinguished Young Scholars of China(61625204);Scientific Research Starting Project of Southwest Petroleum University(2019QHZ016)

摘要/Abstract

摘要：

文本分割的主要任务是将文本按照主题相关的原则划分为若干个相对独立的文本块。针对现有文本分割模型提取文本段落结构信息、语义相关性及上下文交互等细粒度特征的不足，提出了一种基于图卷积网络（GCN）的文本分割模型TS-GCN。首先，基于文本段落的结构信息与语义逻辑构建出文本图；然后，引入语义相似性注意力来捕获文本段落节点间的细粒度相关性，并借助GCN实现文本段落节点高阶邻域间的信息传递，以此增强模型多粒度提取文本段落主题特征表达的能力。将所提模型与目前常用作文本分割任务基准的代表模型CATS及其基础模型TLT-TS进行对比。实验结果表明在Wikicities数据集上，TS-GCN在未增加任何辅助模块的情况下比TLT-TS的评价指标P_k 值下降了0.08个百分点；在Wikielements数据集上，相较于CATS和TLT-TS，所提模型的P_k 值分别下降了0.38个百分点和2.30个百分点，可见TLT-TS取得了较好的分割效果。

关键词: 文本分割, 图卷积网络, 注意力, 自然语言处理, 深度学习

Abstract:

The main task of text segmentation is to divide the text into several relatively independent text blocks according to the topic relevance. Aiming at the shortcomings of the existing text segmentation models in extracting fine-grained features such as text paragraph structural information， semantic correlation and context interaction， a text segmentation model TS-GCN （Text Segmentation-Graph Convolutional Network） based on Graph Convolutional Network （GCN） was proposed. Firstly， a text graph based on the structural information and semantic logic of text paragraphs was constructed. Then， the semantic similarity attention was introduced to capture the fine-grained correlation between text paragraph nodes， and the information transmission between high-order neighborhoods of text paragraph nodes was realized with the help of GCN， so that the model ability of multi-granularity extraction of text paragraph topic feature representations was enhanced. The proposed model was compared with the representative model CATS （Coherence-Aware Text Segmentation）， and its basic model TLT-TS （Two-Level Transformer model for Text Segmentation）， which were commonly used as benchmarks for text segmentation task. Experimental results show that TS-GCN’s evaluation index P_k is 0.08 percentage points lower than that of TLT-TS without any auxiliary module on Wikicities dataset. And the proposed model has the P_k value decreased by 0.38 percentage points and 2.30 percentage points respectively on Wikielements dataset compared with CATS and TLT-TS. It can be seen that TS-GCN achieves good segmentation effect.

Key words: text segmentation, Graph Convolutional Network (GCN), attention, Natural Language Processing (NLP), deep learning

中图分类号:

TP391.1

杜雨奇, 郑津, 王杨, 黄诚, 李平. 基于图卷积网络的文本分割模型[J]. 计算机应用, 2022, 42(12): 3692-3699.

Yuqi DU, Jin ZHENG, Yang WANG, Cheng HUANG, Ping LI. Text segmentation model based on graph convolutional network[J]. Journal of Computer Applications, 2022, 42(12): 3692-3699.

图/表 12

图1 原始GCN与TS-GCN的对比

Fig. 1 Comparison between original GCN and TS-GCN

图2 TS-GCN整体流程

Fig. 2 Overall process of TS-GCN

图3 边的构建

Fig. 3 Construction of edges

表1 文本分割数据集的统计信息

Tab.1 Statistics of text segmentation datasets

信息类型	Wikicities	Wikielements
文档数	100	118
段落数	6 670	2 810
单词数	492 402	191 762
文本块长度	3.33±3.05	5.15±4.57
每个文档中文本块数	6.82±2.57	12.2±2.79

图4 文本分割的3种结果

Fig. 4 Three results of text segmentation

表2 文本分割任务中不同模型的Pk 值对比 (%)

Tab. 2 Comparison of Pk value among different models for text segmentation task

模型	学习方式	Wikicities	Wikielements
Random	无监督	47.14	50.08
文献［23］模型	无监督	22.10	20.10
GraphSeg	无监督	39.95	49.12
WIKI‑727K	有监督	19.68	41.63
TLT‑TS	有监督	19.21	20.33
CATS	有监督	16.85	18.41
TS‑GCN	有监督	19.13	18.03

图5 Wikicities文本片段的TS-GCN和WIKI-727K模型的分割结果与人工标注结果的对比

Fig. 5 Comparison of segmentation results among TS-GCN and WIKI-727K models with manual annotation results for Wikicities text fragments

图6 Wikicities中济南文档的第3、4和5自然段

Fig. 6 The third， fourth and fifth paragraphs of Jinan document in Wikicities

图7 Wikicities中济南文档的第11和12自然段

Fig. 7 The eleventh and twelfth paragraphs of Jinan document in Wikicities

表3 不同预训练词向量下的分割结果 (%)

Tab. 3 Segmentation results under different pre-training word vectors

预训练词向量	Wikicities	Wikielements
GloVe-300d	19.62	18.60
crawl-300d	19.90	18.45
wiki-news-300d	19.13	18.03

图8 不同GCN层数的分割结果对比

Fig. 8 Comparison of segmentation results with different GCN layers

表4 不同注意力计算方法下的分割结果 (%)

Tab. 4 Segmentation results of different attention calculation methods

注意力计算方法类型	Wikicities	Wikielements
未采用注意力	22.07	19.60
欧氏距离注意力	20.45	18.60
语义相似性注意力	19.13	18.03

参考文献 25

1	HEARST M A. TextTiling： segmenting text into multi-paragraph subtopic passages［J］. Computational Linguistics， 1997， 23（1）： 33-64.
2	秦兵，刘挺，李生. 多文档自动文摘综述［J］. 中文信息学报， 2005， 19（6）： 13-20， 56. 10.3969/j.issn.1003-0077.2005.06.003
	QIN B， LIU T， LI S. Survey of multi-document summarization［J］. Journal of Chinese Information Processing， 2005， 19（6）： 13-20， 56. 10.3969/j.issn.1003-0077.2005.06.003
3	ANGHELUTA R， DE BUSSER R， MOENS M F. The use of topic segmentation for automatic summarization［C］// Proceedings of the Association for Computational Linguistics 2002 Post-Conference Workshop on Automatic Summarization. Stroudsburg， PA： Association for Computational Linguistics， 2002： 1421-1426.
4	HUANG X J， PENG F C， SCHUURMANS D， et al. Applying machine learning to text segmentation for information retrieval［J］. Information Retrieval， 2003， 6（3/4）： 333-362. 10.1023/a:1026028229881
5	SHTEKH G， KAZAKOVA P， NIKITINSKY N， et al. Exploring influence of topic segmentation on information retrieval quality［C］// Proceedings of the 2018 International Conference on Internet Science， LNCS 11193. Cham： Springer， 2018： 131-140.
6	马长林，王涛. 基于相关主题模型和多层知识表示的文本情感分析［J］. 郑州大学学报（理学版）， 2021， 53（4）： 30-35.
	MA C L， WANG T. Text sentiment analysis based on correlated topic model and multi-layer knowledge representation［J］. Journal of Zhengzhou University （Natural Science Edition）， 2021， 53（4）： 30-35.
7	ZIRN C， GLAVAŠ G， NANNI F， et al. Classifying topics and detecting topic shifts in political manifestos［C］// Proceedings of the 2016 International Conference on the Advances in Computational Analysis of Political Text. Zagreb： University of Zagreb， 2016： 88-93.
8	MANUVINAKURIKE R， PAETZEL M， QU C， et al. Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems［C］// Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Stroudsburg， PA： Association for Computational Linguistics， 2016： 252-262. 10.18653/v1/w16-3632
9	ZHAO T Y， KAWAHARA T. Joint learning of dialog act segmentation and recognition in spoken dialog using neural networks［C］// Proceedings of the 18th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. ［S.l.］： Asian Federation of Natural Language Processing， 2017： 704-712. 10.18653/v1/w18-5021
10	VELIČKOVIĆ P， CUCURULL G， CASANOVA A， et al. Graph attention networks［EB/OL］. （2018-02-04）［2021-06-20］..
11	CHOI F Y Y. Advances in domain independent linear text segmentation［C］// Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2000： 26-33.
12	UTIYAMA M， ISAHARA H. A statistical model for domain-independent text segmentation［C］// Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2001： 499-506. 10.3115/1073012.1073076
13	LI J， SUN A X， JOTY S. SegBot： a generic neural text segmentation model with pointer network［C］// Proceedings of the 27th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2018： 4166-4172. 10.24963/ijcai.2018/579
14	KOSHOREK O， COHEN A， MOR N， et al. Text segmentation as a supervised learning task［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 469-473. 10.18653/v1/n18-2075
15	ARNOLD S， SCHNEIDER R， CUDRÉ-MAUROUX P， et al. SECTOR： a neural model for coherent topic segmentation and classification［J］. Transactions of the Association for Computational Linguistics， 2019， 7： 169-184. 10.1162/tacl_a_00261
16	BARROW J， JAIN R， MORARIU V， et al. A joint model for document segmentation and segment labeling ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 313-322. 10.18653/v1/2020.acl-main.29
17	LUKASIK M， DADACHEV B， PAPINENI K， et al. Text segmentation by cross segment attention［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2020： 4707-4716. 10.18653/v1/2020.emnlp-main.380
18	XING L Z， HACKINEN B， CARENINI G， et al. Improving context modeling in neural topic segmentation ［C］// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2020： 626-636.
19	WU B， WEI B F， LIU J， et al. Faceted text segmentation via multitask learning［J］. IEEE Transactions on Neural Networks and Learning Systems， 2021， 32（9）： 3846-3857. 10.1109/tnnls.2020.3015996
20	GLAVAŠ G， NANNI F， PONZETTO S P. Unsupervised text segmentation using semantic relatedness graphs［C］// Proceedings of the 5th Joint Conference on Lexical and Computational Semantics. Stroudsburg， PA： Association for Computational Linguistics， 2016： 125-130. 10.18653/v1/s16-2016
21	YAO L， MAO C S， LUO Y. Graph convolutional networks for text classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019： 7370-7377. 10.1609/aaai.v33i01.33017370
22	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2021-06-20］.. 10.48550/arXiv.1609.02907
23	CHEN H， BRANAVAN S R K， BARZILAY R， et al. Global models of document structure using latent permutations［C］// Proceedings of Human Language Technologies： The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2009： 371-379. 10.3115/1620754.1620808
24	BEEFERMAN D， BERGER A， LAFFERTY J. Statistical models for text segmentation［J］. Machine Learning， 1999， 34（1/2/3）： 177-210. 10.1023/a:1007506220214
25	GLAVAŠ G， SOMASUNDARAN S. Two-level transformer and auxiliary coherence modeling for improved text segmentation［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020：7797-7804. 10.1609/aaai.v34i05.6284

[1]	魏佳璇, 杜世康, 于志轩, 张瑞生. 图像分类中的白盒对抗攻击技术综述[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2732-2741.
[2]	刘月峰, 张小燕, 郭威, 边浩东, 何滢婕. 基于优化混合模型的航空发动机剩余寿命预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2960-2968.
[3]	文凯, 唐伟伟, 熊俊臣. 基于注意力机制和有效分解卷积的实时分割算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2659-2666.
[4]	李姚舜, 刘黎志. 嵌入注意力机制的轻量级钢筋检测网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2900-2908.
[5]	魏海云, 郑茜颖, 俞金玲. 基于多尺度网络的运动模糊图像复原算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2838-2844.
[6]	张文涛, 王园宇, 李赛泽. 基于条件对抗网络的单幅霾图像深度估计模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2865-2875.
[7]	李敬虎, 邢前国, 郑向阳, 李琳, 王丽丽. 基于深度学习的无人机影像夜光藻赤潮提取方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2969-2974.
[8]	侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2686-2692.
[9]	尹靖涵, 瞿绍军, 姚泽楷, 胡玄烨, 秦晓雨, 华璞靖. 基于YOLOv5的雾霾天气下交通标志识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2876-2884.
[10]	衡红军, 徐天宝. 基于多尺度卷积和门控机制的注意力情感分析模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2674-2679.
[11]	强赞霞, 鲍先富. 基于卷积长短期记忆的残差注意力去雨网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2858-2864.
[12]	王一宁, 赵青杉, 秦品乐, 胡玉兰, 宗春梅. 基于轻量密集神经网络的医学图像超分辨率重建算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2586-2592.
[13]	徐成霞, 阎庆, 李腾, 苗开超. 基于联合注意力机制的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2578-2585.
[14]	邓杰航, 郭文权, 陈汉杰, 顾国生, 刘景建, 杜宇坤, 刘超, 康晓东, 赵建. 融合多尺度多头自注意力和在线难例挖掘的小样本硅藻检测[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2593-2600.
[15]	张显杰, 张之明. 基于卷积神经网络和Transformer的手写体英文文本识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2394-2400.

基于图卷积网络的文本分割模型

Text segmentation model based on graph convolutional network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 25

相关文章 15

编辑推荐

Metrics