面向维吾尔语文本的改进后缀树聚类

doi:10.3724/SP.J.1087.2012.01078

计算机应用 ›› 2012, Vol. 32 ›› Issue (04): 1078-1081.DOI: 10.3724/SP.J.1087.2012.01078

面向维吾尔语文本的改进后缀树聚类

翟献民¹,田生伟²,禹龙³,冯冠军⁴

1. 新疆大学信息科学与工程学院，乌鲁木齐 830046
2. 新疆大学软件学院，乌鲁木齐 830008
3. 新疆大学网络中心，乌鲁木齐 830046
4. 新疆大学人文学院，乌鲁木齐 830046

收稿日期:2011-09-28 修回日期:2011-11-16 发布日期:2012-04-20 出版日期:2012-04-01
作者简介:翟献民(1988-)，男，山东泰安人，硕士研究生，主要研究方向：计算机智能；
田生伟(1973-)，男，新疆乌鲁木齐人，副教授，博士，主要研究方向：计算机智能、自然语言处理；
禹龙(1974-)，女，新疆乌鲁木齐人，副教授，主要研究方向：计算机智能、计算机网络；
冯冠军(1972-)，男，新疆乌鲁木齐人，副教授，博士，主要研究方向：维语言文学。
基金资助:
国家自然科学基金资助项目;国家社科基金资助项目(10BTQ045，11XTQ007);新疆大学博士基金资助项目

Improved suffix tree clustering for Uyghur text

ZHAI Xian-min¹,TIAN Sheng-wei²,YU Long³,FENG Guan-jun⁴

1. College of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China
2. College of Software, Xinjiang University, Urumqi Xinjiang 830008， China
3. Network Center, Xinjiang University, Urumqi Xinjiang 830046, China
4. College of Humanities, Xinjiang University, Urumqi Xinjiang 830046, China

Received:2011-09-28 Revised:2011-11-16 Online:2012-04-20 Published:2012-04-01

摘要/Abstract

摘要： 针对后缀树聚类选取基类时，基类短语出现信息不规范、重复和冗余的问题，提出了一种改进后缀树聚类算法。该算法首先以短语互信息算法改进基类的选取，选出遵守维吾尔语语法规则的基类短语；然后，利用短语归并算法对选取的重复基类短语进行归并；最后，在前两步的工作基础上，利用短语去冗余算法处理冗余的基类短语。实验证明，与传统后缀树聚类(STC）相比，改进后缀树聚算法的全面率、准确率都得到了提高。这表明，改进算法有效地改善了聚类效果。

关键词: 维吾尔语, 后缀树, 互信息, 归并, 冗余

Abstract: In order to solve the problems of non-standard, repetition and redundancy of information in the process of selecting the base class phrases, an improved Suffix Tree Clustering (STC) method was proposed. Firstly, phrase mutual information algorithm was put forward to choose the base class phrases abiding by Uyghur grammar. Secondly, in order to reduce the repeated base class phrase, the phrase reduction algorithm based on Uyghur grammar was proposed. Thirdly, on the basis of the first two steps, the phrase redundancy algorithm based on Uyghur grammar was constructed to remove redundant phrase. The experimental results show that this method improves the recall and the precision compared with STC. This indicates that the improved algorithm can enhance clustering performance effectively.

Key words: Uyghur, Suffix Tree (ST), Mutual Information (MI), reduction, redundancy

翟献民田生伟禹龙冯冠军. 面向维吾尔语文本的改进后缀树聚类[J]. 计算机应用, 2012, 32(04): 1078-1081.

ZHAI Xian-min TIAN Sheng-wei YU Long FENG Guan-jun. Improved suffix tree clustering for Uyghur text[J]. Journal of Computer Applications, 2012, 32(04): 1078-1081.

参考文献

［1］ ZAMIR O, ETZIONI O, MADANI O, et al.Fast and intuitive clustering of Web documents ［C］// Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1997: 287-290. ［2］ HONG YI, SAM K. Learning assignment order of instances for the constrained K-means clustering algorithm ［J］. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, 2009, 39(2): 568-574. ［3］ HALL L O, GOLDGOF D B. On convergence properties of the singlepass and online fuzzy c-means algorithm ［C］// 2010 IEEE International Conference on Fuzzy Systems, Washington, DC: IEEE, 2010: 1-3. ［4］ AIOLLI F, SAN-MARTINO G, HAGENBUCHNER M， et al. Learning nonsparse kernels by self organizing maps for structured data ［J］. IEEE Transactions on Neural Networks, 2009, 20(12): 1938-1949. ［5］ ZAMIR O, ETZIONI O. Web document clustering: A feasibility demonstration ［C］// SIGIR ’98: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1998: 46-54. ［6］ CHEN CHUNXI, BERTIL S. Parallel construction of large suffix trees on a PC cluster ［C］// Euro-Par 2005 Parallel Processing: 11th International Euro-Par Conference. Berlin: Springer, 2005: 1227-1236. ［7］ WANG JUNZE, MO YIJUN, HUANG BENXIONG, et al. Web search results clustering based on a novel suffix tree structure ［C］// Autonomic and Trusted Computing: 5th International Conference. Berlin: Springer,2008: 540-554. ［8］ KOPIDAKI S, PAPADAKOS P, TZITZIKAS Y. STC and NM-STC: two novel online results clustering methods for Web searching ［C］// WISE 2009: 10th International Conference. Berlin: Springer, 2009: 523-537. ［9］杜红斌，夏克文，刘南平,等.一种改进的基于广义后缀树的文本聚类算法［J］.信息与控制，2009, 38(3)：3331-336. ［10］ HAN WEN, GUO-SHUN HUANG, ZHAO LI. Clustering Web search results using semantic information ［C］// Proceedings of the Eighth International Conference on Machine Learning and Cybernetics. Liverpool: World Academic Press, 2009: 1504-1509. ［11］ WANG HUIJIAO, YIN BO, HOU JIE. An improved algorithm of STC for the deletion of duplicated Web pages based on repeated strings［C］// Proceedings of 2009 Third International Conference on Genetic and Evolutionary Computing. Washington, DC: IEEE CS, 2009: 414-417. ［12］闵可锐，赵迎宾，刘昕,等.互联网话题识别与跟踪系统设计及实现［J］.计算机工程，2008，34(19）：212-214. ［13］杨瑞龙，朱庆生，谢洪涛,等.一种新的加权后缀树Web文档聚类方法［J］.系统仿真学报，2011，23(3)：474-479. ［14］李睿，曾俊瑀，周四望.基于局部标签树匹配的改进网页聚类算法［J］.JOCA，2010，30(3)：818-810.

[1]	张师鹏, 李永忠, 杜祥通. 基于半监督学习和三支决策的入侵检测模型[J]. 计算机应用, 2021, 41(9): 2602-2608.
[2]	卿欣艺, 陈玉玲, 周正强, 涂园超, 李涛. 基于中国剩余定理的区块链存储扩展模型[J]. 计算机应用, 2021, 41(7): 1977-1982.
[3]	孙环, 陈宏滨. 基于萤火虫算法的无线传感器网络节点重部署策略[J]. 计算机应用, 2021, 41(2): 492-497.
[4]	杨先凤, 贵红军, 傅春常. 统一计算设备架构下的F-X域预测滤波并行算法[J]. 计算机应用, 2021, 41(2): 486-491.
[5]	宋一凡, 张鹏, 宗立波, 马波, 刘立波. 改进的基于冗余点过滤的3D目标检测方法[J]. 计算机应用, 2020, 40(9): 2555-2560.
[6]	朱相荣, 王磊, 杨雅婷, 董瑞, 张俊. 基于非自回归方法的维汉神经机器翻译[J]. 计算机应用, 2020, 40(7): 1891-1895.
[7]	谢斌红, 钟日新, 潘理虎, 张英俊. 结合剪枝与流合并的卷积神经网络加速压缩方法[J]. 计算机应用, 2020, 40(3): 621-625.
[8]	程玉胜, 宋帆, 王一宾, 钱坤. 基于专家特征的条件互信息多标记特征选择算法[J]. 计算机应用, 2020, 40(2): 503-509.
[9]	万志远, 刘勤明, 叶春明, 刘文溢. 突发事件下的医院应急资源冗余配置优化模型[J]. 计算机应用, 2020, 40(2): 584-588.
[10]	王煜, 徐建民. 用于网络新闻热点识别的热点新词发现[J]. 计算机应用, 2020, 40(12): 3513-3519.
[11]	雍菊亚, 周忠眉. 基于互信息的多级特征选择算法[J]. 计算机应用, 2020, 40(12): 3478-3484.
[12]	魏嘉旺, 王肖, 袁玉波. 人脸特征点定位的自适应窗回归方法[J]. 计算机应用, 2019, 39(5): 1459-1465.
[13]	毛莺池, 曹海, 平萍, 李晓芳. 基于最大联合条件互信息的特征选择[J]. 计算机应用, 2019, 39(3): 734-741.
[14]	王东先, 孟学雷, 何国强, 孙慧萍, 王喜栋. 基于改进蚁群算法的铁路乘务排班计划编制[J]. 计算机应用, 2019, 39(12): 3678-3684.
[15]	胡健, 苏永东, 黄文载, 肖鹏, 刘玉婷, 杨本富. 基于互信息加权集成迁移学习的入侵检测方法[J]. 计算机应用, 2019, 39(11): 3310-3315.

面向维吾尔语文本的改进后缀树聚类

Improved suffix tree clustering for Uyghur text

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics