《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3692-3699.DOI: 10.11772/j.issn.1001-9081.2021101768
• 人工智能 • 上一篇
收稿日期:
2021-10-14
修回日期:
2022-01-07
接受日期:
2022-01-24
发布日期:
2022-03-04
出版日期:
2022-12-10
通讯作者:
郑津
作者简介:
杜雨奇(1998—),女,四川南充人,硕士研究生,主要研究方向:深度学习、自然语言处理基金资助:
Yuqi DU, Jin ZHENG(), Yang WANG, Cheng HUANG, Ping LI
Received:
2021-10-14
Revised:
2022-01-07
Accepted:
2022-01-24
Online:
2022-03-04
Published:
2022-12-10
Contact:
Jin ZHENG
About author:
DU Yuqi, born in 1998, M. S. candidate. Her research interests include deep learning, natural language processing.Supported by:
摘要:
文本分割的主要任务是将文本按照主题相关的原则划分为若干个相对独立的文本块。针对现有文本分割模型提取文本段落结构信息、语义相关性及上下文交互等细粒度特征的不足,提出了一种基于图卷积网络(GCN)的文本分割模型TS-GCN。首先,基于文本段落的结构信息与语义逻辑构建出文本图;然后,引入语义相似性注意力来捕获文本段落节点间的细粒度相关性,并借助GCN实现文本段落节点高阶邻域间的信息传递,以此增强模型多粒度提取文本段落主题特征表达的能力。将所提模型与目前常用作文本分割任务基准的代表模型CATS及其基础模型TLT-TS进行对比。实验结果表明在Wikicities数据集上,TS-GCN在未增加任何辅助模块的情况下比TLT-TS的评价指标Pk 值下降了0.08个百分点;在Wikielements数据集上,相较于CATS和TLT-TS,所提模型的Pk 值分别下降了0.38个百分点和2.30个百分点,可见TLT-TS取得了较好的分割效果。
中图分类号:
杜雨奇, 郑津, 王杨, 黄诚, 李平. 基于图卷积网络的文本分割模型[J]. 计算机应用, 2022, 42(12): 3692-3699.
Yuqi DU, Jin ZHENG, Yang WANG, Cheng HUANG, Ping LI. Text segmentation model based on graph convolutional network[J]. Journal of Computer Applications, 2022, 42(12): 3692-3699.
信息类型 | Wikicities | Wikielements |
---|---|---|
文档数 | 100 | 118 |
段落数 | 6 670 | 2 810 |
单词数 | 492 402 | 191 762 |
文本块长度 | 3.33±3.05 | 5.15±4.57 |
每个文档中文本块数 | 6.82±2.57 | 12.2±2.79 |
表1 文本分割数据集的统计信息
Tab.1 Statistics of text segmentation datasets
信息类型 | Wikicities | Wikielements |
---|---|---|
文档数 | 100 | 118 |
段落数 | 6 670 | 2 810 |
单词数 | 492 402 | 191 762 |
文本块长度 | 3.33±3.05 | 5.15±4.57 |
每个文档中文本块数 | 6.82±2.57 | 12.2±2.79 |
模型 | 学习方式 | Wikicities | Wikielements |
---|---|---|---|
Random | 无监督 | 47.14 | 50.08 |
文献[ | 无监督 | 22.10 | 20.10 |
GraphSeg | 无监督 | 39.95 | 49.12 |
WIKI‑727K | 有监督 | 19.68 | 41.63 |
TLT‑TS | 有监督 | 19.21 | 20.33 |
CATS | 有监督 | 16.85 | 18.41 |
TS‑GCN | 有监督 | 19.13 | 18.03 |
表2 文本分割任务中不同模型的Pk 值对比 (%)
Tab. 2 Comparison of Pk value among different models for text segmentation task
模型 | 学习方式 | Wikicities | Wikielements |
---|---|---|---|
Random | 无监督 | 47.14 | 50.08 |
文献[ | 无监督 | 22.10 | 20.10 |
GraphSeg | 无监督 | 39.95 | 49.12 |
WIKI‑727K | 有监督 | 19.68 | 41.63 |
TLT‑TS | 有监督 | 19.21 | 20.33 |
CATS | 有监督 | 16.85 | 18.41 |
TS‑GCN | 有监督 | 19.13 | 18.03 |
图5 Wikicities文本片段的TS-GCN和WIKI-727K模型的分割结果与人工标注结果的对比
Fig. 5 Comparison of segmentation results among TS-GCN and WIKI-727K models with manual annotation results for Wikicities text fragments
预训练词向量 | Wikicities | Wikielements |
---|---|---|
GloVe-300d | 19.62 | 18.60 |
crawl-300d | 19.90 | 18.45 |
wiki-news-300d | 19.13 | 18.03 |
表3 不同预训练词向量下的分割结果 (%)
Tab. 3 Segmentation results under different pre-training word vectors
预训练词向量 | Wikicities | Wikielements |
---|---|---|
GloVe-300d | 19.62 | 18.60 |
crawl-300d | 19.90 | 18.45 |
wiki-news-300d | 19.13 | 18.03 |
注意力计算方法类型 | Wikicities | Wikielements |
---|---|---|
未采用注意力 | 22.07 | 19.60 |
欧氏距离注意力 | 20.45 | 18.60 |
语义相似性注意力 | 19.13 | 18.03 |
表4 不同注意力计算方法下的分割结果 (%)
Tab. 4 Segmentation results of different attention calculation methods
注意力计算方法类型 | Wikicities | Wikielements |
---|---|---|
未采用注意力 | 22.07 | 19.60 |
欧氏距离注意力 | 20.45 | 18.60 |
语义相似性注意力 | 19.13 | 18.03 |
1 | HEARST M A. TextTiling: segmenting text into multi-paragraph subtopic passages[J]. Computational Linguistics, 1997, 23(1): 33-64. |
2 | 秦兵,刘挺,李生. 多文档自动文摘综述[J]. 中文信息学报, 2005, 19(6): 13-20, 56. 10.3969/j.issn.1003-0077.2005.06.003 |
QIN B, LIU T, LI S. Survey of multi-document summarization[J]. Journal of Chinese Information Processing, 2005, 19(6): 13-20, 56. 10.3969/j.issn.1003-0077.2005.06.003 | |
3 | ANGHELUTA R, DE BUSSER R, MOENS M F. The use of topic segmentation for automatic summarization[C]// Proceedings of the Association for Computational Linguistics 2002 Post-Conference Workshop on Automatic Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2002: 1421-1426. |
4 | HUANG X J, PENG F C, SCHUURMANS D, et al. Applying machine learning to text segmentation for information retrieval[J]. Information Retrieval, 2003, 6(3/4): 333-362. 10.1023/a:1026028229881 |
5 | SHTEKH G, KAZAKOVA P, NIKITINSKY N, et al. Exploring influence of topic segmentation on information retrieval quality[C]// Proceedings of the 2018 International Conference on Internet Science, LNCS 11193. Cham: Springer, 2018: 131-140. |
6 | 马长林,王涛. 基于相关主题模型和多层知识表示的文本情感分析[J]. 郑州大学学报(理学版), 2021, 53(4): 30-35. |
MA C L, WANG T. Text sentiment analysis based on correlated topic model and multi-layer knowledge representation[J]. Journal of Zhengzhou University (Natural Science Edition), 2021, 53(4): 30-35. | |
7 | ZIRN C, GLAVAŠ G, NANNI F, et al. Classifying topics and detecting topic shifts in political manifestos[C]// Proceedings of the 2016 International Conference on the Advances in Computational Analysis of Political Text. Zagreb: University of Zagreb, 2016: 88-93. |
8 | MANUVINAKURIKE R, PAETZEL M, QU C, et al. Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems[C]// Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Stroudsburg, PA: Association for Computational Linguistics, 2016: 252-262. 10.18653/v1/w16-3632 |
9 | ZHAO T Y, KAWAHARA T. Joint learning of dialog act segmentation and recognition in spoken dialog using neural networks[C]// Proceedings of the 18th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). [S.l.]: Asian Federation of Natural Language Processing, 2017: 704-712. 10.18653/v1/w18-5021 |
10 | VELIČKOVIĆ P, CUCURULL G, CASANOVA A, et al. Graph attention networks[EB/OL]. (2018-02-04) [2021-06-20].. |
11 | CHOI F Y Y. Advances in domain independent linear text segmentation[C]// Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2000: 26-33. |
12 | UTIYAMA M, ISAHARA H. A statistical model for domain-independent text segmentation[C]// Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2001: 499-506. 10.3115/1073012.1073076 |
13 | LI J, SUN A X, JOTY S. SegBot: a generic neural text segmentation model with pointer network[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2018: 4166-4172. 10.24963/ijcai.2018/579 |
14 | KOSHOREK O, COHEN A, MOR N, et al. Text segmentation as a supervised learning task[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 469-473. 10.18653/v1/n18-2075 |
15 | ARNOLD S, SCHNEIDER R, CUDRÉ-MAUROUX P, et al. SECTOR: a neural model for coherent topic segmentation and classification[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 169-184. 10.1162/tacl_a_00261 |
16 | BARROW J, JAIN R, MORARIU V, et al. A joint model for document segmentation and segment labeling [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 313-322. 10.18653/v1/2020.acl-main.29 |
17 | LUKASIK M, DADACHEV B, PAPINENI K, et al. Text segmentation by cross segment attention[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2020: 4707-4716. 10.18653/v1/2020.emnlp-main.380 |
18 | XING L Z, HACKINEN B, CARENINI G, et al. Improving context modeling in neural topic segmentation [C]// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2020: 626-636. |
19 | WU B, WEI B F, LIU J, et al. Faceted text segmentation via multitask learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(9): 3846-3857. 10.1109/tnnls.2020.3015996 |
20 | GLAVAŠ G, NANNI F, PONZETTO S P. Unsupervised text segmentation using semantic relatedness graphs[C]// Proceedings of the 5th Joint Conference on Lexical and Computational Semantics. Stroudsburg, PA: Association for Computational Linguistics, 2016: 125-130. 10.18653/v1/s16-2016 |
21 | YAO L, MAO C S, LUO Y. Graph convolutional networks for text classification[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2019: 7370-7377. 10.1609/aaai.v33i01.33017370 |
22 | KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL]. (2017-02-22) [2021-06-20].. 10.48550/arXiv.1609.02907 |
23 | CHEN H, BRANAVAN S R K, BARZILAY R, et al. Global models of document structure using latent permutations[C]// Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2009: 371-379. 10.3115/1620754.1620808 |
24 | BEEFERMAN D, BERGER A, LAFFERTY J. Statistical models for text segmentation[J]. Machine Learning, 1999, 34(1/2/3): 177-210. 10.1023/a:1007506220214 |
25 | GLAVAŠ G, SOMASUNDARAN S. Two-level transformer and auxiliary coherence modeling for improved text segmentation[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020:7797-7804. 10.1609/aaai.v34i05.6284 |
[1] | 魏佳璇, 杜世康, 于志轩, 张瑞生. 图像分类中的白盒对抗攻击技术综述[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2732-2741. |
[2] | 刘月峰, 张小燕, 郭威, 边浩东, 何滢婕. 基于优化混合模型的航空发动机剩余寿命预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2960-2968. |
[3] | 文凯, 唐伟伟, 熊俊臣. 基于注意力机制和有效分解卷积的实时分割算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2659-2666. |
[4] | 李姚舜, 刘黎志. 嵌入注意力机制的轻量级钢筋检测网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2900-2908. |
[5] | 魏海云, 郑茜颖, 俞金玲. 基于多尺度网络的运动模糊图像复原算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2838-2844. |
[6] | 张文涛, 王园宇, 李赛泽. 基于条件对抗网络的单幅霾图像深度估计模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2865-2875. |
[7] | 李敬虎, 邢前国, 郑向阳, 李琳, 王丽丽. 基于深度学习的无人机影像夜光藻赤潮提取方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2969-2974. |
[8] | 侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2686-2692. |
[9] | 尹靖涵, 瞿绍军, 姚泽楷, 胡玄烨, 秦晓雨, 华璞靖. 基于YOLOv5的雾霾天气下交通标志识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2876-2884. |
[10] | 衡红军, 徐天宝. 基于多尺度卷积和门控机制的注意力情感分析模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2674-2679. |
[11] | 强赞霞, 鲍先富. 基于卷积长短期记忆的残差注意力去雨网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2858-2864. |
[12] | 王一宁, 赵青杉, 秦品乐, 胡玉兰, 宗春梅. 基于轻量密集神经网络的医学图像超分辨率重建算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2586-2592. |
[13] | 徐成霞, 阎庆, 李腾, 苗开超. 基于联合注意力机制的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2578-2585. |
[14] | 邓杰航, 郭文权, 陈汉杰, 顾国生, 刘景建, 杜宇坤, 刘超, 康晓东, 赵建. 融合多尺度多头自注意力和在线难例挖掘的小样本硅藻检测[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2593-2600. |
[15] | 张显杰, 张之明. 基于卷积神经网络和Transformer的手写体英文文本识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2394-2400. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||