计算机应用 ›› 2019, Vol. 39 ›› Issue (2): 311-315.DOI: 10.11772/j.issn.1001-9081.2018081958

• 人工智能 •    下一篇

基于深度学习的文本自动摘要方案

张克君1,2, 李伟男2, 钱榕1, 史泰猛1, 焦萌1   

  1. 1. 北京电子科技学院 计算机科学与技术系, 北京 100070;
    2. 西安电子科技大学 计算机科学与技术学院, 西安 710071
  • 收稿日期:2018-09-20 修回日期:2018-11-14 出版日期:2019-02-10 发布日期:2019-02-15
  • 通讯作者: 李伟男
  • 作者简介:张克君(1972-),男,山东临沂人,副教授,博士,CCF会员,主要研究方向:信息安全、智能信息处理;李伟男(1994-),男,陕西西安人,硕士研究生,主要研究方向:自动摘要;钱榕(1970-),男,山东济南人,副教授,博士,CCF会员,主要研究方向:复杂网络、数据挖掘;史泰猛(1995-),男,河北衡水人,硕士研究生,主要研究方向:文本分类;焦萌(1994-),女,河北石家庄人,硕士研究生,主要研究方向:文本主题挖掘。
  • 基金资助:
    国家重点研发计划项目(2018YFB1004101)。

Automatic text summarization scheme based on deep learning

ZHANG Kejun1,2, LI Weinan2, QIAN Rong1, SHI Taimeng1, JIAO Meng1   

  1. 1. Department of Computer Science and Technology, Beijing Electronic Science and Technology Institute, Beijing 100070, China;
    2. School of Computer Science and Technology, Xidian University, Xi'an Shaanxi 710071, China
  • Received:2018-09-20 Revised:2018-11-14 Online:2019-02-10 Published:2019-02-15
  • Supported by:
    This work is partially supported by the National Key R&D Program of China (2018YFB1004101).

摘要: 针对自然语言处理(NLP)生成式自动摘要领域的语义理解不充分、摘要语句不通顺和摘要准确度不够高的问题,提出了一种新的生成式自动摘要解决方案,包括一种改进的词向量生成技术和一个生成式自动摘要模型。改进的词向量生成技术以Skip-Gram方法生成的词向量为基础,结合摘要的特点,引入词性、词频和逆文本频率三个词特征,有效地提高了词语的理解;而提出的Bi-MulRnn+生成式自动摘要模型以序列映射(seq2seq)与自编码器结构为基础,引入注意力机制、门控循环单元(GRU)结构、双向循环神经网络(BiRnn)、多层循环神经网络(MultiRnn)和集束搜索,提高了生成式摘要准确性与语句流畅度。基于大规模中文短文本摘要(LCSTS)数据集的实验结果表明,该方案能够有效地解决短文本生成式摘要问题,并在Rouge标准评价体系中表现良好,提高了摘要准确性与语句流畅度。

关键词: 自然语言处理, 生成式文本自动摘要, 序列映射, 自编码器, 词向量, 循环神经网络

Abstract: Aiming at the problems of inadequate semantic understanding, improper summary sentences and inaccurate summary in the field of Natural Language Processing (NLP) abstractive automatic summarization, a new automatic summary solution was proposed, including an improved word vector generation technique and an abstractive automatic summarization model. The improved word vector generation technology was based on the word vector generated by the skip-gram method. Combining with the characteristics of abstract, three word features including part of speech, word frequency and inverse text frequency were introduced, which effectively improved the understanding of words. The proposed Bi-MulRnn+ abstractive automatic summarization model was based on sequence-to-sequence (seq2seq) framework and self-encoder structure. By introducing attention mechanism, Gated Recurrent Unit (GRU) gate structure, Bi-directional Recurrent Neural Network (BiRnn) and Multi-layer Recurrent Neural Network (MultiRnn), the model improved the summary accuracy and sentence fluency of abstractive summarization. The experimental results of Large-Scale Chinese Short Text Summarization (LCSTS) dataset show that the proposed scheme can effectively solve the problem of abstractive summarization of short text, and has good performance in Rouge standard evaluation system, improving summary accuracy and sentence fluency.

Key words: Natural Language Processing (NLP), abstractive automatic text summarization, sequence to sequence (seq2seq), self-encoder, word vector, Recurrent Neural Network (RNN)

中图分类号: