融合规则与统计的微博新词发现方法

doi:10.11772/j.issn.1001-9081.2017.04.1044

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1044-1050.DOI: 10.11772/j.issn.1001-9081.2017.04.1044

融合规则与统计的微博新词发现方法

周霜霜, 徐金安, 陈钰枫, 张玉洁

北京交通大学计算机与信息技术学院, 北京 100044

收稿日期:2016-09-25 修回日期:2016-10-10 发布日期:2017-04-19 出版日期:2017-04-10
通讯作者: 徐金安
作者简介:周霜霜(1991-),女,辽宁葫芦岛人,硕士研究生,主要研究方向:自然语言处理、信息抽取;徐金安(1970-),男,河南开封人,副教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译;陈钰枫(1981-),女,福建南平人,副教授,博士,主要研究方向:自然语言处理、人工智能;张玉洁(1961-),女,河南安阳人,教授,博士,主要研究方向:自然语言处理、机器翻译。
基金资助:
国家自然科学基金资助项目（61370130，61473294）；中央高校基本科研业务费专项资金资助项目（2014RC040）；科学技术部国际科技合作计划项目（K11F100010）。

New words detection method for microblog text based on integrating of rules and statistics

ZHOU Shuangshuang, XU Jin'an, CHEN Yufeng, ZHANG Yujie

College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

Received:2016-09-25 Revised:2016-10-10 Online:2017-04-19 Published:2017-04-10
Supported by:
This work is partially supported by National Natural Science Foundation of China (61370130, 61473294), the Fundamental Research Funds for the Central Universities (2014RC040), the International Science and Technology Cooperation Program of China (2014DFA11350).

摘要/Abstract

摘要： 结合微博新词的构词规则自由度大和极其复杂的特点，针对传统的C/NC-value方法抽取的结果新词边界的识别准确率不高，以及低频微博新词无法正确识别的问题，提出了一种融合人工启发式规则、C/NC-value改进算法和条件随机场（CRF）模型的微博新词抽取方法。一方面，人工启发式规则是指对微博新词的分类和归纳总结，并从微博新词构词的词性（POS）、字符类别和表意符号等角度设计的微博新词的构词规则；另一方面，改进的C/NC-value方法通过引入词频、邻接熵和互信息等统计量来重构NC-value目标函数，并使用CRF模型训练和识别新词，最终达到提高新词边界识别准确率和低频新词识别精度的目的。实验结果显示，与传统方法相比，所提出的方法能有效地提高微博新词识别的F值。

关键词: 微博新词, 构词规则, 统计量特征, C/NC-value方法, 条件随机场模型

Abstract: The formation rules of microblog new words are extremely complex with high degree of dispersion, and the extracted results by using traditional C/NC-value method have several problems, including relatively low accuracy of the boundary of identified new words and low detection accuracy of new words with low frequency. To solve these problems, a method of integrating heuristic rules, modified C/NC-value method and Conditional Random Field (CRF) model was proposed. On one hand, heuristic rules included the abstracted information of classification and inductive rules focusing on the components of microblog new words. The rules were artificially summarized by using Part Of Speech (POS), character types and symbols through observing a large number of microblog documents. On the other hand, to improve the accuracy of the boundary of identified new words and the detection accuracy of new words with low frequency, traditional C/NC-value method was modified by merging the information of word frequency, branch entropy, mutual information and other statistical features to reconstruct the objective function. Finally, CRF model was used to train and detect new words. The experimental results show that the F value of the proposed method in new words detection is improved effectively.

Key words: microblog new word, formation rule, statistical feature, C/NC-value method, Conditional Random Field (CRF) model

中图分类号:

TP391.1

周霜霜, 徐金安, 陈钰枫, 张玉洁. 融合规则与统计的微博新词发现方法[J]. 计算机应用, 2017, 37(4): 1044-1050.

ZHOU Shuangshuang, XU Jin'an, CHEN Yufeng, ZHANG Yujie. New words detection method for microblog text based on integrating of rules and statistics[J]. Journal of Computer Applications, 2017, 37(4): 1044-1050.

参考文献

[1] SPROAT R, EMERSON T. The first international Chinese word segmentation bakeoff[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2003, 17: 133-143.
[2] 邹纲, 刘洋, 刘群, 等.面向Internet的中文新词语检测[J]. 中文信息学报, 2004, 18(6):1-9.(ZOU G, LIU Y, LIU Q, et al. Internet-oriented Chinese new words detection[J]. Journal of Chinese Information Processing, 2004, 18(6):1-9.)
[3] MA W Y, CHEN K J. A bottom-up merging algorithm for Chinese unknown word extraction[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2003, 17: 31-38.
[4] SASANO R, KUROHASHI S, OKUMURA M. A simple approach to unknown word processing in Japanese morphological analysis[J]. Nuclear Physics A, 2014, 21(6): 1183-1205.
[5] WANG A, KAN M Y. Mining informal language from Chinese microtext: joint word recognition and segmentation[EB/OL].[2016-01-06]. http://www.aclweb.org/old_anthology/P/P13/P13-1072.pdf.
[6] SUN X, WANG H, LI W. Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. Stroudsburg, PA: Association for Computational Linguistics, 2012, 1: 253-262.
[7] HUANG M, YE B, WANG Y, et al. New word detection for sentiment analysis[EB/OL].[2016-01-03]. http://mirror.aclweb.org/acl2014/P14-1/pdf/P14-1050.pdf.
[8] 邢恩军, 赵富强.基于上下文词频词汇量指标的新词发现方法[J]. 计算机应用与软件, 2016, 33(6):64-67.(XING E J, ZHAO F Q. A novel approach for Chinese new word identification based on contextual word frequency-contextual word count[J]. Computer Applications and Software, 2016, 33(6): 64-67.)
[9] NUO M, LIU H, LONG C, et al. Tibetan unknown word identification from news corpora for supporting lexicon-based Tibetan word segmentation[EB/OL].[2016-01-03]. http://rsr.csdb.cn/serverfiles/csdb/paper/upload/20151021/201510210132497839.pdf.
[10] 杜丽萍, 李晓戈, 于根, 等.基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版), 2016, 52(1):35-40.(DU L P, LI X G, YU G, et al. New word detection based on an improved PMI algorithm for enhancing segmentation system[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 35-40.)
[11] LI C, XU Y. Based on support vector and word features new word discovery research[M]//Trustworthy Computing and Services. Berlin: Springer, 2013: 287-294.
[12] ATTIA M, SAMIH Y, SHAALAN K, et al. The floating Arabic dictionary: an automatic method for updating a lexical database through the detection and lemmatization of unknown words[EB/OL].[2016-01-03]. http://www.aclweb.org/anthology/C12-1006.
[13] FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms: the C-value/NC-value method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130.
[14] HUANG J H, POWERS D. Chinese word segmentation based on contextual entropy[EB/OL].[2016-01-06]. http://www.aclweb.org/website/old_anthology/Y/Y03/Y03-1017.pdf.
[15] YE Y, WU Q, LI Y, et al. Unknown Chinese word extraction based on variety of overlapping strings[J]. Information Processing and Management, 2013, 49(2): 497-512.
[16] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 2001: 282-289.
[17] LI H, HUANG C, GAO J, et al. The use of SVM for Chinese new word identification[C]//Proceedings of the 1st International Joint Conference on Natural Language Processing. Berlin: Springer, 2004: 723-732.
[18] XIA F. The segmentation guidelines for the PENN Chinese treebank (3.0)[EB/OL].[2016-01-07]. http://repository.upenn.edu/cgi/viewcontent.cgi?article=1038&context=ircs_reports.

融合规则与统计的微博新词发现方法

New words detection method for microblog text based on integrating of rules and statistics

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张庆杨凡方宇涵. 基于多模态信息融合的中文拼写纠错算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	高颖杰, 林民, 斯日古楞null, 李斌, 张树钧. 基于片段抽取原型网络的古籍文本断句标点提示学习方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3815-3822.
[3]	王猛张大千周冰艳马倩影吕继东. 基于时序知识图谱补全的CTCS-3级列控车载接口设备故障诊断方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	杨青朱焱. 改进语言规则中的表示的隐喻识别技术[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[5]	余婧陈艳平扈应黄瑞章秦永彬. 结合实体边界偏移的序列标注优化方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[6]	张伟牛家祥马继超沈琼霞. 深层语义特征增强的ReLM中文拼写纠错模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[7]	徐章杰陈艳平扈应黄瑞章秦永彬. 联合边界生成的多目标学习嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	代震龙韩萌杨文艳朱诗能杨书蓉. 序列模式挖掘综述[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	徐乐黄瑞章白瑞娜秦永彬. 基于意图正则化的深度半监督文本聚类[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	彭一峰朱焱. 结合预处理方法和对抗学习的公平链接预测[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[11]	赵彪秦玉华田荣坤胡月航陈芳锐. 依赖类型及距离增强的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[12]	任登燃王淑营. 基于差分边界增强的风电装备嵌套实体识别模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[13]	田海燕黄赛豪张栋李寿山. 视觉指导的分词和词性标注[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[14]	帅健王中卿陈嘉沥. 基于代码生成的细粒度情感分析方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[15]	姜雨杉, 张仰森. 大语言模型驱动的立场感知事实核查[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3067-3073.