Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (4): 1044-1050.DOI: 10.11772/j.issn.1001-9081.2017.04.1044

Previous Articles     Next Articles

New words detection method for microblog text based on integrating of rules and statistics

ZHOU Shuangshuang, XU Jin'an, CHEN Yufeng, ZHANG Yujie   

  1. College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2016-09-25 Revised:2016-10-10 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China (61370130, 61473294), the Fundamental Research Funds for the Central Universities (2014RC040), the International Science and Technology Cooperation Program of China (2014DFA11350).


周霜霜, 徐金安, 陈钰枫, 张玉洁   

  1. 北京交通大学 计算机与信息技术学院, 北京 100044
  • 通讯作者: 徐金安
  • 作者简介:周霜霜(1991-),女,辽宁葫芦岛人,硕士研究生,主要研究方向:自然语言处理、信息抽取;徐金安(1970-),男,河南开封人,副教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译;陈钰枫(1981-),女,福建南平人,副教授,博士,主要研究方向:自然语言处理、人工智能;张玉洁(1961-),女,河南安阳人,教授,博士,主要研究方向:自然语言处理、机器翻译。
  • 基金资助:

Abstract: The formation rules of microblog new words are extremely complex with high degree of dispersion, and the extracted results by using traditional C/NC-value method have several problems, including relatively low accuracy of the boundary of identified new words and low detection accuracy of new words with low frequency. To solve these problems, a method of integrating heuristic rules, modified C/NC-value method and Conditional Random Field (CRF) model was proposed. On one hand, heuristic rules included the abstracted information of classification and inductive rules focusing on the components of microblog new words. The rules were artificially summarized by using Part Of Speech (POS), character types and symbols through observing a large number of microblog documents. On the other hand, to improve the accuracy of the boundary of identified new words and the detection accuracy of new words with low frequency, traditional C/NC-value method was modified by merging the information of word frequency, branch entropy, mutual information and other statistical features to reconstruct the objective function. Finally, CRF model was used to train and detect new words. The experimental results show that the F value of the proposed method in new words detection is improved effectively.

Key words: microblog new word, formation rule, statistical feature, C/NC-value method, Conditional Random Field (CRF) model

摘要: 结合微博新词的构词规则自由度大和极其复杂的特点,针对传统的C/NC-value方法抽取的结果新词边界的识别准确率不高,以及低频微博新词无法正确识别的问题,提出了一种融合人工启发式规则、C/NC-value改进算法和条件随机场(CRF)模型的微博新词抽取方法。一方面,人工启发式规则是指对微博新词的分类和归纳总结,并从微博新词构词的词性(POS)、字符类别和表意符号等角度设计的微博新词的构词规则;另一方面,改进的C/NC-value方法通过引入词频、邻接熵和互信息等统计量来重构NC-value目标函数,并使用CRF模型训练和识别新词,最终达到提高新词边界识别准确率和低频新词识别精度的目的。实验结果显示,与传统方法相比,所提出的方法能有效地提高微博新词识别的F值。

关键词: 微博新词, 构词规则, 统计量特征, C/NC-value方法, 条件随机场模型

CLC Number: