计算机应用 ›› 2010, Vol. 30 ›› Issue (9): 2348-2350.

• 数据库与知识工程 • 上一篇    下一篇

基于隐马尔可夫模型的文本分类算法

杨健1,汪海航2   

  1. 1. 大理学院
    2. 同济大学电子与信息工程学院
  • 收稿日期:2010-03-08 修回日期:2010-04-27 发布日期:2010-09-03 出版日期:2010-09-01
  • 通讯作者: 杨健
  • 基金资助:
    上海市科委科技支撑计划项目

Text classification algorithm based on hidden Markov model

  • Received:2010-03-08 Revised:2010-04-27 Online:2010-09-03 Published:2010-09-01
  • Contact: Jian Yang

摘要: 自动文本分类领域近年来已经产生了若干成熟的分类算法,但这些算法主要基于概率统计模型,没有与文本自身的语法和语义建立起联系。提出了将隐马尔可夫序列分析模型(HMM)用于自动文本分类的算法,首先构造表示文档类别的特征词集合,并以文档类别的特征词序列作为不同HMM分类器的观察序列,而HMM的状态转换序列则隐含地表示了不同类别文档内容的形成演化过程。分类时,具有最大生成概率的HMM分类器类标即为测试文档的分类结果。该算法构造的分类器模型一定程度上体现了不同类别文档的语法和语义特征,并可以实现多类别的自动文本分类,分类效率较高。

关键词: 文本分类, 隐马尔可夫模型, 信息增益, χ2检验, 词频—反文档频率

Abstract: A number of sophisticated automatic text classification algorithms have been proposed in recent years, but those algorithms are mainly based on the probability and statistical models and have not established a relationship with the syntax and semantic of text. In this paper, a new automatic text classification algorithm using Hidden Markov Model (HMM) was proposed. At first, a feature set was built to distinguish the document types. Then the different sequences of feature words were regarded as the different observations generated by HMM classifiers. The state transition sequence of a specific HMM classifier implied the process of document's formation and evolution in a specific document type. When a document was classified, the result was created by the HMM classifier which could get the greatest generation probability according to the document. To some extent, some syntactic and semantic features of different document were represented by the classification model. The model can be applied to automatic multi-category text classification, and it has high classification efficiency.

Key words: text classification, Hidden Markov Model (HMM), information gain, χ2 test, Term Frequency-Inverse Document Frequency (TF-IDF)

中图分类号: