基于表示学习的中文分词

doi:10.11772/j.issn.1001-9081.2016.10.2794

计算机应用 ›› 2016, Vol. 36 ›› Issue (10): 2794-2798.DOI: 10.11772/j.issn.1001-9081.2016.10.2794

基于表示学习的中文分词

刘春丽, 李晓戈, 刘睿, 范贤, 杜丽萍

西安邮电大学计算机学院, 西安 710121

收稿日期:2016-03-24 修回日期:2016-06-21 发布日期:2016-10-10 出版日期:2016-10-10
通讯作者: 李晓戈,E-mail:lixg@xupt.edu.cn
作者简介:刘春丽(1990—),女,山西临汾人,硕士研究生,主要研究方向:自然语言处理、文本数据挖掘;李晓戈(1962—),男,安徽合肥人,教授,博士,主要研究方向:自然语言处理、机器学习、数据挖掘;刘睿(1992—),男,陕西咸阳人,硕士研究生,主要研究方向:自然语言处理、大数据;范贤(1991—),女,陕西咸阳人,硕士研究生,主要研究方向:情感分析、大数据;杜丽萍(1987—),女,陕西宝鸡人,硕士研究生,主要研究方向:自然语言处理、大数据。
基金资助:
国家自然科学基金资助项目（61373116）；陕西省普通高等学校重点学科专项资金资助项目（112-1602）；西安邮电大学研究生创新基金资助项目（ZL2013-30）。

Chinese word segment based on character representation learning

LIU Chunli, LI Xiaoge, LIU Rui, FAN Xian, DU Liping

College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China

Received:2016-03-24 Revised:2016-06-21 Online:2016-10-10 Published:2016-10-10
Supported by:
BackgroundThis work is partially supported by the National Natural Science Foundation of China (61373116), the Development Funds for the Key Subjects of the Universities in Shaanxi Province (112-1602), the Graduate Innovative Foundation of Xi'an University of Posts & Telecommunications (ZL2013-30).

摘要/Abstract

摘要： 为提高中文分词的准确率和未登录词（OOV）识别率，提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量；其次用K-means聚类算法将词向量聚类，并将聚类结果作为条件随机场（CRF）模型的特征进行训练；最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议（NLPCC2015）提供的微博评测语料进行测试，实验结果表明，在未利用外部知识的条件下，分词的F值和OOV识别率分别达到95.67%和94.78%，证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。

关键词: 表示学习, 词向量, 聚类, 条件随机场, 中文分词

Abstract: In order to improve the accuracy and the Out Of Vocabulary (OOV) recognition rate of the Chinese word segmentation, a Chinese word segmentation system based on character representation learning method was proposed. Firstly, the word in the text was mapped to a vector in a high-dimentioanl vecter space using Skip-gram model; then the K-means clustering algorithm was used to acquire clusters of the word vector, and the clustering results were regarded as features of Conditional Random Fields (CRF) model for training. Finally the CRF model was used for word segmentation and OOV recognition. The influences of the word vector dimensions, the number of clusters and different cluster algorithm on word segmentation were analyzed. Experiments were conducted on the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC2015) corpus. Experimental results show that the proposed system can effectively improve Chinese short text segmentation performance without using external knowledge, the F-value and the OOV recognition rate achieve to 95.67% and 94.78% respectively.

Key words: representation learning, word vector, clustering, Conditional Random Field (CRF), Chinese word segmentation

中图分类号:

TP391.1

刘春丽, 李晓戈, 刘睿, 范贤, 杜丽萍. 基于表示学习的中文分词[J]. 计算机应用, 2016, 36(10): 2794-2798.

LIU Chunli, LI Xiaoge, LIU Rui, FAN Xian, DU Liping. Chinese word segment based on character representation learning[J]. Journal of Computer Applications, 2016, 36(10): 2794-2798.

参考文献

[1] 魏晓宁.基于隐马尔可夫模型的中文分词研究[J]. 电脑知识与技术(学术交流), 2007, 4(11):885-886.(WEI X N. HMM-based of study on Chinese language classifying words [J]. Computer Knowledge and Technology (Academic Exchange), 2007, 4(11):885-886.)
[2] ANDREW M, DAYNE F, FEMANDO P. Maximum entropy Markov models for information extraction and segmentation [C]//Proceedings of the Seventeenth International Conference on Machine Learning. New York: ACM, 2000: 591-598.
[3] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields: probabilistic models for segmenting and labeling sequence data [C]//Proceedings of the 18th International Conference on Machine Learning. New York: ACM, 2001:282-289.
[4] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786): 504-507.
[5] TURIAN J, RATINOV L, BENGIO Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010: 384-394.
[6] KOO T, CARRERAS X, COLLINS M. Simple semi-supervised dependency parsing[C]//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2008: 595-603.
[7] MANN G S, MCCALLUM A. Generalized expectation criteria for semi-supervised learning of conditional random fields[C]//Proceedings of the 2008 Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010:1374-1377.
[8] YU D, WANG S, DENG L. Sequential labeling using deep-structured conditional random fields[J]. IEEE Journal of Selected Topics in Signal Processing, 2010, 4(6):965-973.
[9] ZHENG X Q, CHEN H Y, XU T Y. Deep learning for Chinese word segmentation and POS tagging[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: [s. n.], 2013: 647-657.
[10] 来斯惟, 徐立恒, 陈玉博, 等.基于表示学习的中文分词算法探索[J]. 中文信息学报, 2013, 27(5): 8-14.(LAI S W, XU L H, CHEN Y B, et al. Chinese word segment based on character representation learning [J]. Journal of Chinese Information Processing, 2013, 27(5): 8-14.)
[11] QIU X, QIAN P, YIN L, et al. Overview of the NLPCC 2015 shared task: Chinese word segmentation and POS tagging for micro-blog texts (2015)[EB/OL]. [2015-03-10]. http://arxiv.org/abs/1505.0759.
[12] word2vec[EB/OL]. [2015-03-12]. https://code.google.com/p/word2vec/.
[13] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[EB/OL]. [2015-03-10]. https://arxiv.org/abs/1310.4546.
[14] WU X, ZHOU J, SUN Y et al. Generalization of words for Chinese dependency parsing[C]//Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing, LNCS 9362. Berlin: Springer, 2015: 36-46.
[15] MILLER S, GUINNESS J, ZAMANIAN A. Name tagging with word clusters and discriminative training[EB/OL]. [2015-03-10]. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.9395.
[16] CRF++[EB/OL]. [2015-03-20]. http://sourceforge.net/projects/crfpp/.
[17] GAO J F, LI M, WU A, et al. Chinese word segmentation and named entity recognition: a pragmatic approach[J]. Computational Linguistics, 2005, 31(4):531-574.
[18] SUN X, WANG H, LI W. Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012: 253-262.
[19] 杜丽萍, 李晓戈, 于根, 等.基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版),2016, 52(1): 35-40.(DU L P, LI X G, YU G, et al. New word detection based on an improved PMI algorithm for enhancing segmentation system [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 35-40.)
[20] MIN K R, MA C G, ZHAO T M, et al. BonsonNLP: an ensemble approach for word segmentation and POS tagging[C]//Proceedings of the 4th CCF Conference on Natural Language Processing & Chinese Computing. Berlin: Springer, 2015: 520-526.

基于表示学习的中文分词

Chinese word segment based on character representation learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	杜郁, 朱焱. 构建预训练动态图神经网络预测学术合作行为消失[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2726-2731.
[3]	王清, 赵杰煜, 叶绪伦, 王弄潇. 统一框架的增强深度子空间聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1995-2003.
[4]	黎施彬, 龚俊, 汤圣君. 基于Graph Transformer的半监督异配图表示学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1816-1823.
[5]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[6]	蒋小霞, 黄瑞章, 白瑞娜, 任丽娜, 陈艳平. 基于事件表示和对比学习的深度事件聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1734-1742.
[7]	黄天宇, 李远兴, 陈昊, 郭紫佳, 魏明军. 地空协同场景下加权模糊聚类用户簇划分方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1555-1561.
[8]	高麟, 周宇, 邝得互. 进化双层自适应局部特征选择[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1408-1414.
[9]	徐童童, 解滨, 张春昊, 张喜梅. 融合转移概率矩阵的多阶最近邻图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1527-1538.
[10]	丁雨, 张瀚霖, 罗荣, 孟华. 基于信念子簇切割的模糊聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1128-1138.
[11]	袁泉, 陈昌平, 陈泽, 詹林峰. 基于BERT的两次注意力机制远程监督关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1080-1085.
[12]	孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841.
[13]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[14]	张卓, 陈花竹. 基于一致性和多样性的多尺度自表示学习的深度子空间聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 353-359.
[15]	杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789.