基于改进k-means算法的中文词义归纳

计算机应用 ›› 2012, Vol. 32 ›› Issue (05): 1332-1334.

基于改进k-means算法的中文词义归纳

张宜浩¹,²,金澎¹,²,孙锐¹,²

1. 乐山师范学院计算机科学学院，四川乐山614004
2. 乐山师范学院智能信息处理与应用实验室，四川乐山614004

收稿日期:2011-11-16 修回日期:2012-01-09 发布日期:2012-05-01 出版日期:2012-05-01
通讯作者: 张宜浩
作者简介:张宜浩（1982-），男，河南信阳人，讲师，博士研究生，主要研究方向：自然语言处理;金澎（1977-），男，河南开封人，副教授，博士，主要研究方向：自然语言处理;孙锐（1977-），男，四川眉山人，讲师，硕士，主要研究方向：自然语言处理。
基金资助:
国家自然科学基金资助项目(61003206）;四川省教育厅科研项目(10ZB025）

Chinese word sense induction based on improved k-means algorithm

ZHANG Yi-hao¹,²,JIN Peng¹,²,SUN Rui¹,²

1. Laboratory of Intelligent Information Processing and Application,Leshan Teachers' College,Leshan Sichuan 614004,China
2. School of Computer Science, Leshan Teachers' College, Leshan Sichuan 614004,China

Received:2011-11-16 Revised:2012-01-09 Online:2012-05-01 Published:2012-05-01
Contact: ZHANG Yi-hao

摘要/Abstract

摘要： 汉语中一词多义现象普遍存在，词义归纳就是对在不同语境中具有相同语义的词进行归类，本质上是一聚类问题。目前广泛采用无指导的聚类方法对词义归纳进行研究，提出一种改进的k-means算法，该算法主要从初始簇中心的选取以及簇均值的计算两个方面进行改进，在一定程度上克服了其对“噪声”和孤立点数据的敏感。在特征表示上用同义词词林中词的分类编号来降低特征维度。实验表明改进k-means算法在性能上有较大的提升，F-Score达到了75.8%。

关键词: 词义归纳, k-means算法, 聚类, 同义词词林

Abstract: Polysemy is an important and pervasive semantic phenomenon in Chinese; the task of word sense induction is to classify words with the same semantics in different contexts, which is a clustering problem essentially. Currently, unsupervised clustering algorithm has been widely used in its research. In this paper, an improved method of k-means was proposed, which mainly improved the selection of initial cluster centers and the calculation of cluster centroid and overcame the “noise” and the sensitivity of isolated point in data to some extent. Another idea was to use the classification coding of word in Tongyici Cilin to reduce the feature dimension. The experimental results show that the performance has great improvement with the improved k-means, of which the F-Score reached 75.8%.

Key words: word sense induction, k-means algorithm, clustering, Tongyici Cilin

中图分类号:

TP391

张宜浩金澎孙锐. 基于改进k-means算法的中文词义归纳[J]. 计算机应用, 2012, 32(05): 1332-1334.

ZHANG Yi-hao JIN Peng SUN Rui. Chinese word sense induction based on improved k-means algorithm[J]. Journal of Computer Applications, 2012, 32(05): 1332-1334.

参考文献

［1］朱虹,刘扬.词汇语义知识库的研究现状和发展趋势［J］.情报学报,2008, 27(6): 870-877.

［2］OZLEM U, BORIS K, DENIZ Y. Word sense disambiguation for information retrieval［C］// Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Conference on Innovative Applications of Artificial Intelligence. [S.l.]:American Association for Artificial Intelligence, 1999:985-986.

［3］VICKREY D,BIEWALD L, TEYSSLER M, et al. Word-sense disambiguation for machine translation［C］// Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2005:771-778.

［4］ENEKO A,AITOR S. Evaluating word sense induction and discrimination systems ［C］// SemEval-2007：Proceedings of the 4th International Workshop on Semantic Evaluations. Stroudsburg: Association for Computational Linguistics,2007：7-12.

［5］PANTEL P,LIN DEKANG. Discovering word senses from text ［C］// Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York:ACM, 2002：613-619.

［6］SCHüTZE H. Automatic word sense discrimination［J］. Computational Linguistics, 1998, 24(1): 97-124.

［7］JI H, PLOUX S,WEHRLI E. Lexical knowledge representation with contexonyms［EB/OL］.［2011-10-22］.http://www.cs.toronto.edu/~gh/Courses/2528/Readings/Ji-etal-Contexonyms.pdf.

［8］何径舟,王厚峰.基于特征选择和最大熵模型的汉语词义消歧［J］.软件学报,2010, 21(6):1287-1295.

［9］王锦,王会珍,张俐.基于维基百科类别的文本特征表示［J］.中文信息学报,2011,25(2):27-31.

［10］田久乐,赵蔚.基于同义词词林的词语相似度计算方法［J］.吉林大学学报:信息科学版,2010,28(6):602-608.

［11］de AMORIM R C, MIRKIN B. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering［J］.Pattern Recognition,2012,45(3):1061-1075.

［12］RONG HUIGUI, LI MINGWEI, CAI LIJUN. An early recognition algorithm for BitTorrent traffic based on improved K-means［J］.Journal of Central South University of Technology,2011,18(6):2061-2067.

［13］ZHAO YING, KARYPIS G. Hierarchical clustering algorithms for document datasets［J］.Data Mining and Knowledge Discovery, 2005, 10(2):141-168.

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[4]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[5]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[6]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[7]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[8]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[9]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[10]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[11]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[12]	邹志文, 秦程. 基于k-means++的动态构建空间主题R树方法[J]. 计算机应用, 2021, 41(3): 733-737.
[13]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[14]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.
[15]	陈港, 孟相如, 康巧燕, 阳勇. 基于拓扑分割与聚类分析的虚拟软件定义网络映射算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3309-3318.