Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (10): 2794-2798.DOI: 10.11772/j.issn.1001-9081.2016.10.2794

Previous Articles     Next Articles

Chinese word segment based on character representation learning

LIU Chunli, LI Xiaoge, LIU Rui, FAN Xian, DU Liping   

  1. College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China
  • Received:2016-03-24 Revised:2016-06-21 Online:2016-10-10 Published:2016-10-10
  • Supported by:
    BackgroundThis work is partially supported by the National Natural Science Foundation of China (61373116), the Development Funds for the Key Subjects of the Universities in Shaanxi Province (112-1602), the Graduate Innovative Foundation of Xi'an University of Posts & Telecommunications (ZL2013-30).

基于表示学习的中文分词

刘春丽, 李晓戈, 刘睿, 范贤, 杜丽萍   

  1. 西安邮电大学 计算机学院, 西安 710121
  • 通讯作者: 李晓戈,E-mail:lixg@xupt.edu.cn
  • 作者简介:刘春丽(1990—),女,山西临汾人,硕士研究生,主要研究方向:自然语言处理、文本数据挖掘;李晓戈(1962—),男,安徽合肥人,教授,博士,主要研究方向:自然语言处理、机器学习、数据挖掘;刘睿(1992—),男,陕西咸阳人,硕士研究生,主要研究方向:自然语言处理、大数据;范贤(1991—),女,陕西咸阳人,硕士研究生,主要研究方向:情感分析、大数据;杜丽萍(1987—),女,陕西宝鸡人,硕士研究生,主要研究方向:自然语言处理、大数据。
  • 基金资助:
    国家自然科学基金资助项目(61373116);陕西省普通高等学校重点学科专项资金资助项目(112-1602);西安邮电大学研究生创新基金资助项目(ZL2013-30)。

Abstract: In order to improve the accuracy and the Out Of Vocabulary (OOV) recognition rate of the Chinese word segmentation, a Chinese word segmentation system based on character representation learning method was proposed. Firstly, the word in the text was mapped to a vector in a high-dimentioanl vecter space using Skip-gram model; then the K-means clustering algorithm was used to acquire clusters of the word vector, and the clustering results were regarded as features of Conditional Random Fields (CRF) model for training. Finally the CRF model was used for word segmentation and OOV recognition. The influences of the word vector dimensions, the number of clusters and different cluster algorithm on word segmentation were analyzed. Experiments were conducted on the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC2015) corpus. Experimental results show that the proposed system can effectively improve Chinese short text segmentation performance without using external knowledge, the F-value and the OOV recognition rate achieve to 95.67% and 94.78% respectively.

Key words: representation learning, word vector, clustering, Conditional Random Field (CRF), Chinese word segmentation

摘要: 为提高中文分词的准确率和未登录词(OOV)识别率,提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量;其次用K-means聚类算法将词向量聚类,并将聚类结果作为条件随机场(CRF)模型的特征进行训练;最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议(NLPCC2015)提供的微博评测语料进行测试,实验结果表明,在未利用外部知识的条件下,分词的F值和OOV识别率分别达到95.67%和94.78%,证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。

关键词: 表示学习, 词向量, 聚类, 条件随机场, 中文分词

CLC Number: