计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1278-1282.DOI: 10.11772/j.issn.1001-9081.2017112631

• 人工智能 • 上一篇    下一篇

基于改进的多层BLSTM的中文分词和标点预测

李雅昆, 潘晴, Everett X. WANG   

  1. 广东工业大学 信息工程学院, 广州 510006
  • 收稿日期:2017-11-06 修回日期:2017-12-01 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 潘晴
  • 作者简介:李雅昆(1989-),男,河南新乡人,硕士研究生,主要研究方向:深度学习、自然语言处理;潘晴(1975-),男,江苏宜兴人,教授,博士,主要研究方向:图像处理、机器学习;Everett X.WANG (1961-),男,美籍华人,教授,博士,主要研究方向:卫星导航、最优控制和复杂系统的动态建模、机器学习。

Joint Chinese word segmentation and punctuation prediction based on improved multilayer BLSTM network

LI Yakun, PAN Qing, WANG Feng   

  1. School of Information Engineering, Guangdong University of Technology, Guangzhou Guangdong 510006, China
  • Received:2017-11-06 Revised:2017-12-01 Online:2018-05-10 Published:2018-05-24
  • Contact: 潘晴

摘要: 目前主流的序列标注问题是基于循环神经网络(RNN)实现的。针对RNN和序列标注问题进行研究,提出了一种改进型的多层双向长短时记忆(BLSTM)网络,该网络每层的BLSTM都有一次信息融合,输出包含更多的上下文信息。另外找到一种基于序列标注的可以并行执行中文分词和标点预测的联合任务方法。在公开的数据集上的实验结果表明,所提出的改进型的多层BLSTM网络模型性能优越,提升了中文分词和标点预测的分类精度;在需要完成中文分词和标点预测两项任务时,联合任务方法能够大幅地降低系统复杂度;新的模型及基于该模型的联合任务方法也可应用到其他序列标注任务中。

关键词: 中文分词, 标点预测, 序列标注, 双向长短时记忆网络

Abstract: The current mainstream sequence labeling is based on Recurrent Neural Network (RNN). Aiming at the problem of RNN and sequence labeling, an improved multilayer Bi-direction Long Short Term Memory (BLSTM) network for sequence labeling was proposed. Each layer of BLSTM had an operation of information fusion, and the output contained more contextual information. In addition, a method to perform Chinese word segmentation and punctuation prediction jointly was proposed. Experiments on the public datasets show that the improved multilayer BLSTM network model can improve the classification accuracy of Chinese segmentation and punctuation prediction. In the case of two tasks that need to be accomplished, the joint task method can greatly reduce the complexity of the system, and the new model and the joint task method can also be applied to solve other sequence labeling problems.

Key words: Chinese word segmentation, punctuation prediction, sequence labeling, Bi-directional Long Short Term Memory (BLSTM) network

中图分类号: