基于无监督学习算法的推特文本规范化

doi:10.11772/j.issn.1001-9081.2016.07.1887

计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1887-1892.DOI: 10.11772/j.issn.1001-9081.2016.07.1887

基于无监督学习算法的推特文本规范化

邓加原, 姬东鸿, 费超群, 任亚峰

武汉大学计算机学院, 武汉 430072

收稿日期:2016-01-25 修回日期:2016-03-14 发布日期:2016-07-14 出版日期:2016-07-10
通讯作者: 邓加原
作者简介:邓加原(1991-),男,福建宁德人,硕士研究生,主要研究方向:自然语言处理;姬东鸿(1966-),男,河南驻马店人,教授,博士,CCF会员,主要研究方向:自然语言处理、数据挖掘、机器学习;费超群(1992-),男,河南驻马店人,硕士研究生,主要研究方向:自然语言处理;任亚峰(1986-),男,河南焦作人,博士,CCF会员,主要研究方向:自然语言处理、机器学习。
基金资助:
国家自然科学基金重点项目（61133012）；国家哲学社会科学重大计划项目（11&ZD189）；国家自然科学基金资助项目（61173062）。

Twitter text normalization based on unsupervised learning algorithm

DENG Jiayuan, JI Donghong, FEI Chaoqun, REN Yafeng

School of Computer, Wuhan University, Wuhan Hubei 430072, China

Received:2016-01-25 Revised:2016-03-14 Online:2016-07-14 Published:2016-07-10
Supported by:
This work is partially supported by the State Key Program of National Natural Science Foundation of China (61133012), the National Philosophy Social Science Major Bidding Project of China (11&ZD189), the National Natural Science Foundation of China (61173062).

摘要/Abstract

摘要： 推特文本中包含着大量的非标准词，这些非标准词是由人们有意或无意而创造的。对很多自然语言处理的任务而言，预先对推特文本进行规范化处理是很有必要的。针对已有的规范化系统性能较差的问题，提出一种创新的无监督文本规范化系统。首先，使用构造的标准词典来判断当前的推特是否需要标准化。然后，对推特中的非标准词会根据其特征来考虑进行一对一还是一对多规范化；对于需要一对多的非标准词，通过前向和后向搜索算法，计算出所有可能的多词组合。其次，对于多词组合中的非规范化词，基于二部图随机游走和误拼检查，来产生合适的候选。最后，使用基于上下文的语言模型来得到最合适的标准词。所提算法在数据集上获得86.4%的F值，超过当前最好的基于图的随机游走算法10个百分点。

关键词: 规范化, 无监督学习, 二部图, 随机游走, 拼写检查

Abstract: Twitter messages contain a large number of nonstandard tokens, created unintentionally or intentionally by people. It is crucial to normalize the nonstandard tokens for various natural language processing applications. In terms of the existing normalization systems which perform poorly, a novel unsupervised normalization system was proposed. First, a standard dictionary was used to determine whether a tweet needs to be normalized or not. Second, a nonstandard token was considered to take 1-to-1 or 1-to-N recovering based on its characteristics. For 1-to-N recovering, the nonstandard token would be divided into multiple possible words using forward and backward search. Third, some normalization candidates were generated for nonstandard tokens among multiple possible words by integrating random walk and spelling checker. Finally, the best normalized twitter could be obtained by taking all the candidates into consideration of n-gram language model. The experimental results on the manual dataset show that the proposed approach obtains F-score of 86.4%, which is 10 percentage points higher than that of current best graph-based random walk algorithm.

Key words: normalization, unsupervised learning, bipartite graph, random walk, spelling checker

中图分类号:

TP391
TP18

邓加原, 姬东鸿, 费超群, 任亚峰. 基于无监督学习算法的推特文本规范化[J]. 计算机应用, 2016, 36(7): 1887-1892.

DENG Jiayuan, JI Donghong, FEI Chaoqun, REN Yafeng. Twitter text normalization based on unsupervised learning algorithm[J]. Journal of Computer Applications, 2016, 36(7): 1887-1892.

参考文献

[1] RITTER A, CLARK S, MAUSAM M, et al. Named entity recognition in tweets:an experimental study[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2011:1524-1534.
[2] LIU F, LIU Y, WENG F. Why is "SXSW" trending? Exploring multiple text sources for twitter topic summarization[C]//Proceedings of the 2011 ACL Workshop on Language in Social Media. Stroudsburg, PA:Association for Computational Linguistics, 2011:66-75.
[3] MUKHERJEE S, BHANACHARYYA P. Sentiment analysis in twitter with lightweight discourse analysis[C]//Proceedings of the 26th International Conference on Computational Linguistics. New York:ACM, 2012:1847-1864.
[4] TANG D, WEI F, YANG N, et al. Learning sentiment-specific word embedding for Twitter sentiment classification[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2014:1555-1565.
[5] SAKAKI T, OKAZAKI M, MATSUO Y. Earthquake shakes Twitter users:real-time event detection by social sensors[C]//Proceedings of the 19th International Conference on the World Wide Web. New York:ACM, 2010:851-860.
[6] WENG J, LEE B-S. Event detection in Twitter[C]//Proceedings of the 5th International Conference on Weblogs and Social Media. Menlo Park, CA:AAAI Press, 2011:401-408.
[7] BENSON E, HAGHIGHI A, BARZILAY R. Event discovery in social media feeds[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2011:389-398.
[8] HAN B, BALDWIN T. Lexical normalisation of short text messages:mken sens a #twitter[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2011:368-378.
[9] LIU X, ZHANG S, WEI F, et al. Recognizing named entities in tweets[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2011:359-367.
[10] FOSTER J, CETINOGLU O, WAGNER J, et al. #hardtoparse:POS tagging and parsing the twitter verse[C]//Proceedings of the AAAI Workshop on Analyzing Microtext. Menlo Park, CA:AAAI Press, 2011:20-25.
[11] LIU F, WENG F, WANG B, et al. Insertion, deletion, or substitution? normalization text messages without pre-categorization nor supervision[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2011:19-24.
[12] HAN B, COOK P, BALDWIN T. Automatically constructing a normalization dictionary for microblogs[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Learning. Stroudsburg, PA:Association for Computational Linguistics, 2012:421-432.
[13] HASSAN H, MENEZES A. Social text normalization using contextual graph random walks[C]//Proceedings of the 51st Annual Meeting of the Association for Computation Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2013:1577-1586.
[14] WANG P, NG H T. A beam search decoder for normalization of social media text with application to machine translation[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computation Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2013:471-481.
[15] LI C, LIU Y. Improving text normalization via unsupervised model and discriminative reranking[C]//Proceedings of the ACL 2014 Student Research Workshop. Stroudsburg, PA:Association for Computational Linguistics, 2014:86-93.
[16] GOUWS S, HOVY D, METZLER D. Unsupervised mining of lexical variants from noisy text[C]//Proceedings of the First workshop on Unsupervised Learning in NLP. Stroudsburg, PA:Association for Computational Linguistics, 2011:82-90.
[17] NORRIS J R. Markov Chains[M]. Cambridge, UK:Cam-bridge University Press, 1997:35-38.
[18] HUGHES T, RAMAGE D. Lexical semantic relatedness with random graph walks[C]//Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2007:581-589.
[19] DAS D, PETROV S. Unsupervised part-of-speech tagging with bilingual graph-based projections[C]//Proceedings of the 49th Annual Meeting of the Association for Computation Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2011:600-609.
[20] MINKOV E, COHEN W W. Graph based similarity measures for synonym extraction from parsed text[C]//TextGraphs-7'12:Workshop Proceedings of TextGraphs-7 on Graph-based Methods for Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2011:20-24.
[21] MELAMED I D. Bitext maps and alignment via pattern recognition[J]. Computational Linguistics, 1999, 25(1):107-130.
[22] CONTRACTOR D, FARUQUIE T, SUBRAMANIAM V. Unsupervised cleaning of noisy text[C]//Proceedings of the 23rd International Conference on Computation Linguistics. New York:ACM, 2010:189-196.
[23] PENNELL D, LIU Y. A character-level machine translation ap-proach for normalization of SMS abbreviations[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2011:974-982.

基于无监督学习算法的推特文本规范化

Twitter text normalization based on unsupervised learning algorithm

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	贾洁茹, 杨建超, 张硕蕊, 闫涛, 陈斌. 基于自蒸馏视觉Transformer的无监督行人重识别[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2893-2902.
[2]	夏吾吉, 黄鹤鸣, 更藏措毛, 范玉涛. 基于无监督学习和监督学习的抽取式文本摘要综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1035-1048.
[3]	江锐, 刘威, 陈成, 卢涛. 非对称端到端的无监督图像去雨网络[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 922-930.
[4]	赵培, 乔焰, 胡荣耀, 袁新宇, 李敏悦, 张本初. 基于多域特征提取的多变量时间序列异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3419-3426.
[5]	胡能兵, 蔡彪, 李旭, 曹旦华. 基于图池化对比学习的图分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3327-3334.
[6]	王菁怡, 李超, 宋衡, 李迪, 朱俊武. 基于随机游走算法的频谱组合拍卖机制[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2352-2357.
[7]	黄梦林, 段磊, 张袁昊, 王培妍, 李仁昊. 基于Prompt学习的无监督关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2010-2016.
[8]	张忠平, 郭鑫, 张玉停, 张睿博. 基于全息图平稳分布因子的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1705-1712.
[9]	劳景欢, 黄栋, 王昌栋, 赖剑煌. 基于视图互信息加权的多视图集成聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1713-1718.
[10]	许喆, 王志宏, 单存宇, 孙亚茹, 杨莹. 基于重构误差的无监督人脸伪造视频检测[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1571-1577.
[11]	葛孟婷, 万鸣华. 基于近邻监督局部不变鲁棒主成分分析的特征提取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1013-1020.
[12]	李文博, 刘波, 陶玲玲, 罗棻, 张航. L1正则化的深度谱聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3662-3667.
[13]	周乐, 代婷婷, 李淳, 谢军, 楚博策, 李峰, 张君毅, 刘峤. 基于节点-属性二部图的网络表示学习模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2311-2318.
[14]	郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
[15]	陈广福, 王海波, 连雁平. 基于高阶自包含协同过滤的有向网络链路预测[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3060-3068.