计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1272-1277.DOI: 10.11772/j.issn.1001-9081.2017112652

• 人工智能 • 上一篇    下一篇

结合注意力机制的长文本分类方法

卢玲, 杨武, 王远伦, 雷子鉴, 李莹   

  1. 重庆理工大学 计算机科学与工程学院, 重庆 400050
  • 收稿日期:2017-11-07 修回日期:2017-12-04 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 杨武
  • 作者简介:卢玲(1975-),女,重庆人,副教授,硕士,CCF会员,主要研究方向:机器学习、自然语言处理;杨武(1965-),男,重庆人,教授,硕士,CCF会员,主要研究方向:机器学习、信息检索;王远伦(1996-),男,重庆人,主要研究方向:机器学习、信息检索;雷子鉴(1997-),男,重庆人,主要研究方向:机器学习、信息检索;李莹(1997-),女,重庆人,主要研究方向:机器学习、信息检索。
  • 基金资助:
    国家社科基金西部项目(17XXW005);重庆市教委科学技术研究项目(KJ1500903)。

Long text classification combined with attention mechanism

LU Ling, YANG Wu, WANG Yuanlun, LEI Zijian, LI Ying   

  1. College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400050, China
  • Received:2017-11-07 Revised:2017-12-04 Online:2018-05-10 Published:2018-05-24
  • Contact: 杨武
  • Supported by:
    This work is partially supported by the West Project of the National Social Science Foundation of China (17XXW005), the Scientific and Technological Research Project of Chongqing Municipal Education Commission(KJ1500903).

摘要: 新闻文本常包含几十至几百条句子,因字符数多、包含较多与主题无关信息,影响分类性能。对此,提出了结合注意力机制的长文本分类方法。首先将文本的句子表示为段落向量,再构建段落向量与文本类别的神经网络注意力模型,用于计算句子的注意力,将句子注意力的均方差作为其对类别的贡献度,进行句子过滤,然后构建卷积神经网络(CNN)分类模型,分别将过滤后的文本及其注意力矩阵作为网络输入。模型用max pooling进行特征过滤,用随机dropout防止过拟合。实验在自然语言处理与中文计算(NLP&CC)评测2014的新闻分类数据集上进行。当过滤文本长度为过滤前文本的82.74%时,19类新闻的分类正确率为80.39%,比过滤前文本的分类正确率超出2.1%,表明结合注意力机制的句子过滤方法及分类模型,可在句子级信息过滤的同时提高长文本分类正确率。

关键词: 注意力机制, 卷积神经网络, 段落向量, 信息过滤, 文本分类

Abstract: News text usually consists of tens to hundreds of sentences, which has a large number of characters and contains more information that is not relevant to the topic, affecting the classification performance. In view of the problem, a long text classification method combined with attention mechanism was proposed. Firstly, a sentence was represented by a paragraph vector, and then a neural network attention model of paragraph vectors and text categories was constructed to calculate the sentence's attention. Then the sentence was filtered according to its contribution to the category, which value was mean square error of sentence attention vector. Finally, a classifier base on Convolutional Neural Network (CNN) was constructed. The filtered text and the attention matrix were respectively taken as the network input. Max pooling was used for feature filtering. Random dropout was used to reduce over-fitting. Experiments were conducted on data set of Chinese news text classification task, which was one of the shared tasks in Natural Language Processing and Chinese Computing (NLP&CC) 2014. The proposed method achieved 80.39% in terms of accuracy for the filtered text, which length was 82.74% of the text before filtering, yielded an accuracy improvement of considerable 2.1% compared to text before filtering. The emperimental results show that combining with attention mechanism, the proposed method can improve accuracy of long text classification while achieving sentence level information filtering.

Key words: attention mechanism, Convolutional Neural Network (CNN), Paragraph Vector (PV), information filtering, text classification

中图分类号: