《计算机应用》唯一官方网站 ›› 2021, Vol. 41 ›› Issue (12): 3527-3533.DOI: 10.11772/j.issn.1001-9081.2021060899

• 第十八届中国机器学习会议(CCML 2021) • 上一篇    

基于指针网络的抽取生成式摘要生成模型

陈伟, 杨燕()   

  1. 西南交通大学 计算机与人工智能学院,成都 611756
  • 收稿日期:2021-05-12 修回日期:2021-06-24 接受日期:2021-06-24 发布日期:2021-08-20 出版日期:2021-12-10
  • 通讯作者: 杨燕
  • 作者简介:陈伟(1996—),男,四川内江人,硕士研究生,主要研究方向:自然语言处理;
  • 基金资助:
    国家自然科学基金面上项目(61976247)

Extractive and abstractive summarization model based on pointer-generator network

Wei CHEN, Yan YANG()   

  1. School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China
  • Received:2021-05-12 Revised:2021-06-24 Accepted:2021-06-24 Online:2021-08-20 Published:2021-12-10
  • Contact: Yan YANG
  • About author:CHEN Wei, born in 1996, M. S. candidate. His research interests include natural language processing.
  • Supported by:
    the Surface Program of National Natural Science Foundation of China(61976247)

摘要:

作为自然语言处理中的热点问题,摘要生成具有重要的研究意义。基于Seq2Seq模型的生成式摘要模型取得了良好的效果,然而抽取式的方法具有挖掘有效特征并抽取文章重要句子的潜力,因此如何利用抽取式方法来改进生成式方法是一个较好的研究方向。鉴于此,提出了融合生成式和抽取式方法的模型。首先,使用TextRank算法并融合主题相似度来抽取文章中有重要意义的句子。然后,设计了融合抽取信息语义的基于Seq2Seq模型的生成式框架来实现摘要生成任务;同时,引入指针网络解决模型训练中的未登录词(OOV)问题。综合以上步骤得到最终摘要,并在CNN/Daily Mail数据集上进行验证。结果表明在ROUGE-1、ROUGE-2和ROUGE-L三个指标上所提模型比传统TextRank算法均有所提升,同时也验证了融合抽取式和生成式方法在摘要生成领域中的有效性。

关键词: 抽取生成式摘要, TextRank算法, Seq2Seq模型, 指针网络, 语义融合

Abstract:

As a hot issue in natural language processing, summarization generation has important research significance. The abstractive method based on Seq2Seq (Sequence-to-Sequence) model has achieved good results, however, the extractive method has the potential of mining effective features and extracting important sentences of articles, so it is a good research direction to improve the abstractive method by using extractive method. In view of this, a fusion model of abstractive method and extractive method was proposed. Firstly, incorporated with topic similarity, TextRank algorithm was used to extract significant sentences from the article. Then, an abstractive framework based on the Seq2Seq model integrating the semantics of extracted information was designed to implement the summarization task; at the same time, pointer-generator network was introduced to solve the problem of Out-Of-Vocabulary (OOV). Based on the above steps, the final summary was obtained and verified on the CNN/Daily Mail dataset. The results show that on all the three indexes ROUGE-1, ROUGE-2 and ROUGE-L, the proposed model is better than the traditional TextRank algorithm; meanwhile, the effectiveness of fusing extractive method and abstractive method in the field of summarization is also verified.

Key words: extractive and abstractive summarization, TextRank algorithm, Seq2Seq (Sequence-to-Sequence) model, pointer-generator network, semantic fusion

中图分类号: