计算机应用 ›› 2017, Vol. 37 ›› Issue (12): 3498-3503.DOI: 10.11772/j.issn.1001-9081.2017.12.3498

• 人工智能 • 上一篇    下一篇

结合语义扩展和卷积神经网络的中文短文本分类方法

卢玲, 杨武, 杨有俊, 陈梦晗   

  1. 重庆理工大学 计算机科学与工程学院, 重庆, 400050
  • 收稿日期:2017-06-16 修回日期:2017-08-29 出版日期:2017-12-10 发布日期:2017-12-18
  • 通讯作者: 杨武
  • 作者简介:卢玲(1975-),女,重庆人,副教授,硕士,CCF会员,主要研究方向:机器学习、信息检索;杨武(1965-),男,重庆人,教授,硕士,CCF会员,主要研究方向:信息检索、机器学习;杨有俊(1995-),男,重庆人,CCF会员,主要研究方向:机器学习、自然语言处理;陈梦晗(1998-),女,河南开封人,CCF会员,主要研究方向:机器学习、信息检索。
  • 基金资助:
    国家社会科学基金西部项目(17XXW005);重庆市教委科学技术研究项目(KJ1500903)。

Chinese short text classification method by combining semantic expansion and convolutional neural network

LU Ling, YANG Wu, YANG Youjun, CHEN Menghan   

  1. College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400050, China
  • Received:2017-06-16 Revised:2017-08-29 Online:2017-12-10 Published:2017-12-18
  • Supported by:
    This work is partially supported by the West Project of the National Social Science Foundation of China (17XXW005), the Scientific and Technological Research Program of Chongqing Municipal Education Commission(KJ1500903).

摘要: 中文新闻标题通常包含一个或几十个词,由于字符数少、特征稀疏,在分类问题中难以提升正确率。为解决此问题,提出了基于Word Embedding的文本语义扩展方法。首先,将新闻标题扩展为(标题、副标题、主题词)构成的三元组,用标题的同义词结合词性过滤方法构造副标题,对多尺度滑动窗口内的词进行语义组合,提取主题词;然后,针对扩展文本构造卷积神经网络(CNN)分类模型,该模型通过max pooling及随机dropout进行特征过滤及防止过拟合;最后,将标题、副标题拼接为双词表示,与多主题词集分别作为模型的输入。在2017自然语言处理与中文计算评测(NLP&CC2017)的新闻标题分类数据集上进行实验。实验结果表明,用三元组扩展结合相应的CNN模型在18个类别新闻标题上分类的正确率为79.42%,比未经扩展的CNN模型提高了9.5%,且主题词扩展加快了模型的收敛速度,验证了三元组扩展方法及所构建CNN分类模型的有效性。

关键词: 新闻标题分类, 语义扩展, 卷积神经网络, 同义词, 语义组合

Abstract: Chinese news title usually consists of a single word to dozens of words. It is difficult to improve the accuracy of news title classification due to the problems such as few characters and sparse features. In order to solve the problems, a new method for text semantic expansion based on word embedding was proposed. Firstly, the news title was expanded into triples consisting of title, subtitle and keywords. The subtitle was constructed by combining the synonym of title and the part of speech filtering method, and the keywords were extracted from the semantic composition of words in multi-scale sliding windows. Then, the Convolutional Neural Network (CNN) model was constructed for categorizing the expanded text. Max pooling and random dropout were used for feature filtering and avoidance of overfitting. Finally, the double-word spliced by title and subtitle, and the multi-keyword set were fed into the model respectively. Experiments were conducted on the news title classification dataset of the Natural Language Processing & Chinese Computing in 2017 (NLP&CC2017). The experimental results show that, the classification precision of the combination model of expanding news title to triples and CNN is 79.42% in 18 categories of news titles, which is 9.5% higher than the original CNN model without expanding, and the convergence rate of model is improved by keywords expansion. The proposed expansion method of triples and the constructed CNN model are verified to be effective.

Key words: news title classification, semantic expansion, Convolutional Neural Network (CNN), synonym, semantic composition

中图分类号: