计算机应用 ›› 2012, Vol. 32 ›› Issue (04): 1078-1081.DOI: 10.3724/SP.J.1087.2012.01078

• 数据库技术 • 上一篇    下一篇

面向维吾尔语文本的改进后缀树聚类

翟献民1,田生伟2,禹龙3,冯冠军4   

  1. 1. 新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2. 新疆大学 软件学院,乌鲁木齐 830008
    3. 新疆大学 网络中心,乌鲁木齐 830046
    4. 新疆大学 人文学院,乌鲁木齐 830046
  • 收稿日期:2011-09-28 修回日期:2011-11-16 发布日期:2012-04-20 出版日期:2012-04-01
  • 作者简介:翟献民(1988-),男,山东泰安人,硕士研究生,主要研究方向:计算机智能;
    田生伟(1973-),男,新疆乌鲁木齐人,副教授,博士,主要研究方向:计算机智能、自然语言处理;
    禹龙(1974-),女,新疆乌鲁木齐人,副教授,主要研究方向:计算机智能、计算机网络;
    冯冠军(1972-),男,新疆乌鲁木齐人,副教授,博士,主要研究方向:维语言文学。
  • 基金资助:
    国家自然科学基金资助项目;国家社科基金资助项目(10BTQ045,11XTQ007);新疆大学博士基金资助项目

Improved suffix tree clustering for Uyghur text

ZHAI Xian-min1,TIAN Sheng-wei2,YU Long3,FENG Guan-jun4   

  1. 1. College of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China
    2. College of Software, Xinjiang University, Urumqi Xinjiang 830008, China
    3. Network Center, Xinjiang University, Urumqi Xinjiang 830046, China
    4. College of Humanities, Xinjiang University, Urumqi Xinjiang 830046, China
  • Received:2011-09-28 Revised:2011-11-16 Online:2012-04-20 Published:2012-04-01

摘要: 针对后缀树聚类选取基类时,基类短语出现信息不规范、重复和冗余的问题,提出了一种改进后缀树聚类算法。该算法首先以短语互信息算法改进基类的选取,选出遵守维吾尔语语法规则的基类短语;然后,利用短语归并算法对选取的重复基类短语进行归并;最后,在前两步的工作基础上,利用短语去冗余算法处理冗余的基类短语。实验证明,与传统后缀树聚类(STC)相比,改进后缀树聚算法的全面率、准确率都得到了提高。这表明,改进算法有效地改善了聚类效果。

关键词: 维吾尔语, 后缀树, 互信息, 归并, 冗余

Abstract: In order to solve the problems of non-standard, repetition and redundancy of information in the process of selecting the base class phrases, an improved Suffix Tree Clustering (STC) method was proposed. Firstly, phrase mutual information algorithm was put forward to choose the base class phrases abiding by Uyghur grammar. Secondly, in order to reduce the repeated base class phrase, the phrase reduction algorithm based on Uyghur grammar was proposed. Thirdly, on the basis of the first two steps, the phrase redundancy algorithm based on Uyghur grammar was constructed to remove redundant phrase. The experimental results show that this method improves the recall and the precision compared with STC. This indicates that the improved algorithm can enhance clustering performance effectively.

Key words: Uyghur, Suffix Tree (ST), Mutual Information (MI), reduction, redundancy