计算机应用 ›› 2010, Vol. 30 ›› Issue (06): 1676-1678.

• 软件过程技术与中文信息处理 • 上一篇    下一篇

基于SVM的哈萨克语文本分类

王花1,古丽拉·阿东别克2,吴守用3   

  1. 1. 新疆大学信息科学与工程学院
    2.
    3. 新疆大学 信息科学与工程学院
  • 收稿日期:2009-12-11 修回日期:2010-04-02 发布日期:2010-06-01 出版日期:2010-06-01
  • 通讯作者: 王花
  • 基金资助:
    国家自然科学基金资助项目

Study on Kazak text categorization based on SVM

  • Received:2009-12-11 Revised:2010-04-02 Online:2010-06-01 Published:2010-06-01

摘要: 介绍了支持向量机(SVM)和k-最近邻法(kNN)分类算法的思想和两种哈萨克语特征提取方法。对SVM、kNN和Bayes算法在哈萨克语文本分类的实验进行了比较。实验结果表明:在处理哈萨克语文本分类问题上,SVM较kNN和Bayes有较好的分类效果。由于哈萨克文单词的语素和构形的特点,若对哈萨克语词缀进行切分,则会降低文本分类的准确率和查全率。

关键词: 哈萨克语文本分类, SVM, 特征选择, KNN

Abstract: This paper introduced the basic theory of the Support Vector Machine (SVM) and k-Nearest Neighbor (kNN) algorithm and two different features selection methods in Kazak natural language. An empirical study of using the SVM, kNN, Bayes algorithm to categorize the Kazak text was conducted. The experimental results show that compared with kNN, Bayes, SVM has better categorization of the Kazak text. Due to the characteristics of Kazak's morpheme and configuration, the precision and recall will be lowered if the word is cut with affix.

Key words: Kazak text categorization, SVM, featrur selection, KNN