Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (8): 2210-2214.DOI: 10.11772/j.issn.1001-9081.2015.08.2210

Previous Articles     Next Articles

W-POS language model and its selecting and matching algorithms

QIU Yunfei1, LIU Shixing1, WEI Haichao1, SHAO Liangshan2   

  1. 1. School of Software, Liaoning Technical University, Huludao Liaoning 125105, China;
    2. System Engineering Institute, Liaoning Technical University, Huludao Liaoning 125105, China
  • Received:2015-03-16 Revised:2015-04-29 Online:2015-08-10 Published:2015-08-14


邱云飞1, 刘世兴1, 魏海超1, 邵良杉2   

  1. 1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105;
    2. 辽宁工程技术大学 系统工程研究所, 辽宁 葫芦岛 125105
  • 通讯作者: 刘世兴(1990-),男,辽宁丹东人,硕士研究生,主要研究方向:数据挖掘、特征选择,
  • 作者简介:邱云飞(1976-),男,辽宁阜新人,教授,博士,CCF会员,主要研究方向:数据挖掘、情感分析; 魏海超(1993-),男,河北张家口人,主要研究方向:数据挖掘; 邵良杉(1961-),辽宁凌源人,教授,博士,主要研究方向:数据挖掘、情感分析。
  • 基金资助:



n-grams language model aims to use text feature combined of some words to train classifier. But it contains many redundancy words, and a lot of sparse data will be generated when n-grams matches or quantifies the test data, which badly influences the classification precision and limites its application. Therefore, an improved language model named W-POS (Word-Parts of Speech) was proposed based on n-grams language model. After words segmentation, parts of speeches were used to replace the words that rarely appeared and were redundant, then the W-POS language model was composed of words and parts of speeches. The selection rules, selecting algorithm and matching algorithm of W-POS language model were also put forward. The experimental results in Fudan University Chinese Corpus and 20Newsgroups show that the W-POS language model can not only inherit the advantages of n-grams including reducing amount of features and carrying parts of semantics, but also overcome the shortages of producing large sparse data and containing redundancy words. The experiments also verify the effectiveness and feasibility of the selecting and matching algorithms.

Key words: n-grams language model, parts of speech, redundancy, sparse data, feature selection



关键词: n-grams语言模型, 词性, 冗余度, 稀疏数据, 特征选择

CLC Number: