计算机应用 ›› 2019, Vol. 39 ›› Issue (12): 3476-3481.DOI: 10.11772/j.issn.1001-9081.2019050800

• 人工智能 • 上一篇    下一篇

基于双编码器的短文本自动摘要方法

丁建立, 李洋, 王家亮   

  1. 中国民航大学 计算机科学与技术学院, 天津 300300
  • 收稿日期:2019-05-13 修回日期:2019-07-16 出版日期:2019-12-10 发布日期:2019-07-23
  • 作者简介:丁建立(1963-),男,河南洛阳人,教授,博士,CCF会员,主要研究方向:民航智能信息处理、航空物联网;李洋(1995-),男,山东济宁人,硕士研究生,主要研究方向:自然语言处理、机器学习、深度学习;王家亮(1983-),男,辽宁辽阳人,讲师,博士,研究方向:民航信息系统、嵌入式计算、普适计算。
  • 基金资助:
    民航局科技重大专项基金资助项目(MHRD20150107,MHRD20160109);中央高校基本科研业务费专项资金资助项目(3122018C025);中国民航大学科研启动基金资助项目(2014QD13X)。

Short text automatic summarization method based on dual encoder

DING Jianli, LI Yang, WANG Jialiang   

  1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Received:2019-05-13 Revised:2019-07-16 Online:2019-12-10 Published:2019-07-23
  • Contact: 李洋
  • Supported by:
    This work is partially supported by the Civil Aviation Science and Technology Major Special Fund (MHRD20150107, MHRD20160109), the Fundamental Research Funds for the Central Universities (3122018C025), the Research Startup Fund Project of Civil Aviation University (2014QD13X).

摘要: 针对当前生成式文本摘要方法存在的语义信息利用不充分、摘要精度不够等问题,提出一种基于双编码器的文本摘要方法。首先,通过双编码器为序列映射(Seq2Seq)架构提供更丰富的语义信息,并对融入双通道语义的注意力机制和伴随经验分布的解码器进行了优化研究;然后,在词嵌入生成技术中融合位置嵌入和词嵌入,并新增词频-逆文档频率(TF-IDF)、词性(POS)、关键性得分(Soc),优化词嵌入维度。所提方法对传统序列映射Seq2Seq和词特征表示进行优化,在增强模型对语义的理解的同时,提高了摘要的质量。实验结果表明,该方法在Rouge评价体系中的表现相比传统伴随自注意力机制的递归神经网络方法(RNN+atten)和多层双向伴随自注意力机制的递归神经网络方法(Bi-MulRNN+atten)提高10~13个百分点,其文本摘要语义理解更加准确、生成效果更好,拥有更好的应用前景。

关键词: 生成式文本摘要, 序列映射(Seq2Seq), 双编码器, 经验分布, 词特征表示

Abstract: Aiming at the problems of insufficient use of semantic information and the poor summarization precision in the current generated text summarization method, a text summarization method was proposed based on dual encoder. Firstly, the dual encoder was used to provide richer semantic information for Sequence to Sequence (Seq2Seq) architecture. And the attention mechanism with dual channel semantics and the decoder with empirical distribution were optimized. Then, position embedding and word embedding were merged in word embedding technology, and Term Frequency-Inverse Document Frequency (TF-IDF), Part Of Speech (POS), key Score (Soc) were added to word embedding, as a result, the word embedding dimension was optimized. The proposed method aims to optimize the traditional sequence mapping of Seq2Seq and word feature representation, enhance the model's semantic understanding, and improve the quality of the summarization. The experimental results show that the proposed method has the performance improved in the Rouge evaluation system by 10 to 13 percentage points compared with traditional Recurrent Neural Network method with attention (RNN+atten) and Multi-layer Bidirectional Recurrent Neural Network method with attention (Bi-MulRNN+atten). It can be seen that the proposed method has more accurate semantic understanding of text summarization and the generation effect better, and has a better application prospect.

Key words: generated text summarization, Sequence to Sequence (Seq2Seq), double encoder, empirical distribution, word feature representation

中图分类号: