计算机应用 ›› 2005, Vol. 25 ›› Issue (01): 11-13.DOI: 10.3724/SP.J.1087.2005.00011

• 人工智能 • 上一篇    下一篇

基于N元语言模型的文本分类方法

周新栋,王挺   

  1. 国防科技大学计算机学院
  • 出版日期:2005-01-01 发布日期:2011-04-22
  • 基金资助:

    国家863计划资助项目(2001AA114110)

Text classification based on N-gram language model

ZHOU Xin-dong, WANG Ting   

  1. School of Computer Science, National University of Defense Technology
  • Online:2005-01-01 Published:2011-04-22

摘要: 分类是近年来自然语言处理领域的一个研究热点。在分析了传统的分类模型后,文中提出了用N元语言模型作为中文文本分类模型。该模型不以传统的"词袋"(bagofwords)方法表示文档,而将文档视为词的随机观察序列。根据该方法,设计并实现一个基于词的2元语言模型分类器。通过N元语言模型与传统分类模型(向量空间模型和NaiveBayes模型)的实验对比,结果表明:N元模型分类器具有更好的分类性能。

关键词: 文本分类, N元语言模型, 参数平滑

Abstract: Text classification has become a research focus in the field of natural language processing. After the review of traditional text classification models, a method using N-gram language models to classify Chinese text was presented. This model doesn′t present documents with bag of words, but regards documents as random observation sequences. With the bi-gram model, a text classifier based on word level was implemented. The performance of the N-gram model classifier was compared with that of the traditional models (Vector Space Model and Naive Bayes Model). Experiment result shows that the accuracy and the stability of the N-gram model classifier are better than others.

Key words:  text classification, N-gram language model, parameter smoothing

中图分类号: