Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 127-132.DOI: 10.11772/j.issn.1001-9081.2020060920

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) • Previous Articles     Next Articles

Automatic summary generation of Chinese news text based on BERT-PGN model

TAN Jinyuan1, DIAO Yufeng1, QI Ruihua2, LIN Hongfei1   

  1. 1. School of Computer Science and Technology, Dalian University of Technology, Dalian Liaoning 116024, China;
    2. Language Intelligence Research Center, Dalian University of Foreign Languages, Dalian Liaoning 116024, China
  • Received:2020-05-31 Revised:2020-07-07 Online:2021-01-10 Published:2020-09-02
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2019YFC1200302), the Key Project of National Natural Science Foundation of China (61632011).


谭金源1, 刁宇峰1, 祁瑞华2, 林鸿飞1   

  1. 1. 大连理工大学 计算机科学与技术学院, 辽宁 大连 116024;
    2. 大连外国语大学 语言智能研究中心, 辽宁 大连 116024
  • 通讯作者: 林鸿飞
  • 作者简介:谭金源(1997-),男,辽宁大连人,硕士研究生,主要研究方向:自然语言处理;刁宇峰(1987-),女,辽宁沈阳人,博士研究生,主要研究方向:自然语言处理;祁瑞华(1974-),女,湖北襄樊人,教授,博士,主要研究方向:自然语言处理;林鸿飞(1962-),男,辽宁大连人,教授,博士,主要研究方向:自然语言处理。
  • 基金资助:

Abstract: Aiming at the problem that the abstractive summarization model in text automatic summarization task does not fully understand the context of sentence and generates duplicate contents, based on BERT (Bidirectional Encoder Representations from Transformers) and Pointer Generator Network (PGN), an abstractive summarization model for Chinese news text was proposed, namely Bidirectional Encoder Representations from Transformers-Pointer Generator Network (BERT-PGN). Firstly, combining with multi-dimensional semantic features, the BERT pre-trained language model was used to obtain the word vectors, thereby obtaining a more fine-grained text context representation. Then, through PGN model, the words were extracted from the vocabulary or the original text to form a summary. Finally, the coverage mechanism was combined to reduce the generation of duplicate contents and obtain the final summarization result. Experimental results on the single document Chinese news summary evaluation dataset of the 2017 CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC2017) show that, compared with models such as PGN and Long Short-Term Memory with attention mechanism (LSTM-attention), the BERT-PGN model combined with multi-dimensional semantic features has a better understanding of the original text of the summary, has the generated summary content richer and more comprehensive with the generation of duplicate and redundant contents effectively reduced, and has Rouge-2 and Rouge-4 indicators increased by 1.5% and 1.2% respectively.

Key words: abstractive summarization model, pre-trained language model, multi-dimensional semantic feature, Pointer Generator Network (PGN), coverage mechanism

摘要: 针对文本自动摘要任务中生成式摘要模型对句子的上下文理解不够充分、生成内容重复的问题,基于BERT和指针生成网络(PGN),提出了一种面向中文新闻文本的生成式摘要模型——BERT-指针生成网络(BERT-PGN)。首先,利用BERT预训练语言模型结合多维语义特征获取词向量,从而得到更细粒度的文本上下文表示;然后,通过PGN模型,从词表或原文中抽取单词组成摘要;最后,结合coverage机制来减少重复内容的生成并获取最终的摘要结果。在2017年CCF国际自然语言处理与中文计算会议(NLPCC2017)单文档中文新闻摘要评测数据集上的实验结果表明,与PGN、伴随注意力机制的长短时记忆神经网络(LSTM-attention)等模型相比,结合多维语义特征的BERT-PGN模型对摘要原文的理解更加充分,生成的摘要内容更加丰富,全面且有效地减少重复、冗余内容的生成,Rouge-2和Rouge-4指标分别提升了1.5%和1.2%。

关键词: 生成式摘要模型, 预训练语言模型, 多维语义特征, 指针生成网络, coverage机制

CLC Number: