W-POS语言模型及其选择与匹配算法

doi:10.11772/j.issn.1001-9081.2015.08.2210

计算机应用 ›› 2015, Vol. 35 ›› Issue (8): 2210-2214.DOI: 10.11772/j.issn.1001-9081.2015.08.2210

W-POS语言模型及其选择与匹配算法

邱云飞¹, 刘世兴¹, 魏海超¹, 邵良杉²

1. 辽宁工程技术大学软件学院, 辽宁葫芦岛 125105;
2. 辽宁工程技术大学系统工程研究所, 辽宁葫芦岛 125105

收稿日期:2015-03-16 修回日期:2015-04-29 发布日期:2015-08-14 出版日期:2015-08-10
通讯作者: 刘世兴(1990-),男,辽宁丹东人,硕士研究生,主要研究方向:数据挖掘、特征选择,494784913@qq.com
作者简介:邱云飞(1976-),男,辽宁阜新人,教授,博士,CCF会员,主要研究方向:数据挖掘、情感分析; 魏海超(1993-),男,河北张家口人,主要研究方向:数据挖掘; 邵良杉(1961-),辽宁凌源人,教授,博士,主要研究方向:数据挖掘、情感分析。
基金资助:
国家自然科学基金资助项目(70971059);辽宁省创新团队项目(2009T045);辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)。

W-POS language model and its selecting and matching algorithms

QIU Yunfei¹, LIU Shixing¹, WEI Haichao¹, SHAO Liangshan²

1. School of Software, Liaoning Technical University, Huludao Liaoning 125105, China;
2. System Engineering Institute, Liaoning Technical University, Huludao Liaoning 125105, China

Received:2015-03-16 Revised:2015-04-29 Online:2015-08-14 Published:2015-08-10

摘要/Abstract

摘要：

n-grams语言模型旨在利用多个词的组合形式生成文本特征,以此训练分类器对文本进行分类。然而n-grams自身存在冗余词,并且在与训练集匹配量化的过程中会产生大量稀疏数据,严重影响分类准确率,限制了其使用范围。对此,基于n-grams语言模型,提出一种改进的n-grams语言模型——W-POS。将分词后文本中出现概率较小的词和冗余词用词性代替,得到由词和词性的不规则排列组成的W-POS语言模型,并提出该语言模型的选择规则、选择算法以及与测试集的匹配算法。在复旦大学中文语料库和英文语料库20Newsgroups中的实验结果表明,W-POS语言模型既继承了n-grams语言模型减少特征数量、携带部分语义和提高精度的优点,又克服了n-grams语言模型产生大量稀疏数据、含有冗余词的缺陷,并验证了选择和匹配算法的有效性。

关键词: n-grams语言模型, 词性, 冗余度, 稀疏数据, 特征选择

Abstract:

n-grams language model aims to use text feature combined of some words to train classifier. But it contains many redundancy words, and a lot of sparse data will be generated when n-grams matches or quantifies the test data, which badly influences the classification precision and limites its application. Therefore, an improved language model named W-POS (Word-Parts of Speech) was proposed based on n-grams language model. After words segmentation, parts of speeches were used to replace the words that rarely appeared and were redundant, then the W-POS language model was composed of words and parts of speeches. The selection rules, selecting algorithm and matching algorithm of W-POS language model were also put forward. The experimental results in Fudan University Chinese Corpus and 20Newsgroups show that the W-POS language model can not only inherit the advantages of n-grams including reducing amount of features and carrying parts of semantics, but also overcome the shortages of producing large sparse data and containing redundancy words. The experiments also verify the effectiveness and feasibility of the selecting and matching algorithms.

Key words: n-grams language model, parts of speech, redundancy, sparse data, feature selection

中图分类号:

TP18
TP301.6

邱云飞, 刘世兴, 魏海超, 邵良杉. W-POS语言模型及其选择与匹配算法[J]. 计算机应用, 2015, 35(8): 2210-2214.

QIU Yunfei, LIU Shixing, WEI Haichao, SHAO Liangshan. W-POS language model and its selecting and matching algorithms[J]. Journal of Computer Applications, 2015, 35(8): 2210-2214.

参考文献

[1] PAULS A, KLEIN D. Faster and smaller n-gram language models [C]//HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2011:258-267.
[2] YU J, WANG Y, CHEN H. An improved text feature extraction algorithm based on n-gram [J]. Library and Information Service, 2004, 48(8):48-50. (于津凯,王映雪,陈怀楚.一种基于N-gram改进文本特征提取算法[J].图书情报工作,2004,48(8):48-50.)
[3] PEÑAGARIKANO M, VARONA A, RODRÍGUEZ-FUENTES L J, et al. Dimensionality reduction for using high-order n-grams in SVM-based phonotactic language recognition [C]//INTERSPEECH 2011: Proceedings of the 12th Annual Conference of the International Speech Communication Association. London: dblp Computer Science Bibliography, 2011: 853-856.
[4] ZAKI T, ES-SAADY Y, MAMMASS D, et al. A hybrid method n-grams-TFIDF with radial basis for indexing and classification of Arabic document [J]. International Journal of Software Engineering and Its Applications, 2014, 8(2): 127-144.
[5] SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntactic dependency-based n-grams as classification features [C]//MICAI 2012: Proceedings of the 11th Mexican International Conference on Artificial Intelligence, LNCS 7630. Berlin: Springer, 2013: 1-11.
[6] YI Y, GUAN J, ZHOU S. Effective clustering of microRNA sequences by n-grams and feature weighting [C]//Proceedings of the 2012 IEEE 6th International Conference on Systems Biology. Piscataway: IEEE, 2012: 203-210.
[7] BOURAS C, TSOGKAS V. Enhancing news articles clustering using word n-grams [C]//DATA 2013: Proceedings of the 2nd International Conference on Data Technologies and Applications. London: dblp Computer Science Bibliography, 2013: 53-60.
[8] GHANNAY S, BARRAULT L. Using hypothesis selection based features for confusion network MT system combination [C]//EACL 2014: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra). Stroudsburg: Association for Computational Linguistics, 2014: 2-6.
[9] SIDOROV G, VELASQUEZ F, STAMATAOS E, et al. Syntactic n-grams as machine learning features for natural language processing [J]. Expert Systems with Applications, 2014, 41(3): 853-860.
[10] HAN Q, GUO J, SCHVTZE H. CodeX: combining an SVM classifier and character n-gram language models for sentiment analysis on Twitter text [C]//SemEval 2013: Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2013: 520-524.
[11] BESPALOY D, BAI B, QI Y, et al. Sentiment classification based on supervised latent n-gram analysis [C]//CIKM '11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York: ACM, 2011: 375-382.
[12] MILLER Z, DICKINSON B, HU W. Gender prediction on Twitter using stream algorithms with n-gram character features [J]. International Journal of Intelligence Science, 2012, 2(4A): 143-148.
[13] WRIGHT J, LLOYD-THOMAS H. A robust language model incorporating a substring parser and extended n-grams [C]//ICASSP 1994: Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing. Washington, DC: IEEE Computer Society, 1994: 361-364.
[14] HACIOGLU K, WARD W. Dialog-context dependent language modeling combining n-grams and stochastic context-free grammars [C]//ICASSP 2001: Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Washington, DC: IEEE Computer Society, 2001, 1: 537-540.
[15] SIU M, OSTENDORF M. Variable n-grams and extensions for conversational speech language modeling [J]. Speech and Audio Processing, 2000, 1(8): 63-75.
[16] ZHOU S, GUAN J, HU Y, et al. A Chinese document categorization system without dictionary support and segmentation processing [J]. Journal of Computer Research and Development, 2001, 38(7): 839-844. (周水庚,关佶红,胡运发,等.一个无需词典支持和切词处理的中文文档分类系统[J].计算机研究与发展,2001,38(7):839-844)
[17] GAO Z, LI X. Feature extraction method based on sliding window application in text classification[J]. Science & Technology Information, 2008(34): 23-24. (高振峰,李锡祚.基于滑动窗口的特征提取方法在文本分类中的应用[J].科技信息:学术版,2008(34):23-24.)
[18] PENG H, LONG F, DING C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[19] PING Y. Research on clustering and text categorization based on support vector machine[D]. Beijing: Beijing University of Posts and Telecommunications, 2012: 135-136. (平源. 基于支持向量机的聚类及文本分类研究[D]. 北京:北京邮电大学,2012:135-136.)

W-POS语言模型及其选择与匹配算法

W-POS language model and its selecting and matching algorithms

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[2]	柯添赐, 刘建华, 孙水华, 郑智雄, 蔡子杰. 融合强关联依赖和简洁语法的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1786-1795.
[3]	高麟, 周宇, 邝得互. 进化双层自适应局部特征选择[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1408-1414.
[4]	雷明珠, 王浩, 贾蓉, 白琳, 潘晓英. 基于特征间关系合成少数类样本的过采样算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1428-1436.
[5]	孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841.
[6]	徐大鹏, 侯新民. 基于网络结构设计的图神经网络特征选择方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 663-670.
[7]	孟圣洁, 于万钧, 陈颖. 最大相关和最大差异的高维数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 767-771.
[8]	刘晶鑫, 黄雯静, 徐亮胜, 黄冲, 吴建生. 字典学习与样本关联保持结合的无监督特征选择模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3766-3775.
[9]	何添, 沈宗鑫, 黄倩倩, 黄雁勇. 基于自适应学习的多视图无监督特征选择方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2657-2664.
[10]	孙林, 黄金旭, 徐久成. 基于邻域容差互信息和鲸鱼优化算法的非平衡数据特征选择[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1842-1854.
[11]	于振华, 刘争气, 刘颖, 郭城. 基于自适应混合粒子群优化的软件缺陷预测特征选择方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1206-1213.
[12]	孙林, 马天娇, 薛占熬. 基于Fisher score与模糊邻域熵的多标记特征选择算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3779-3789.
[13]	徐精诚, 陈学斌, 董燕灵, 杨佳. 融合特征选择的随机森林DDoS攻击检测[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3497-3503.
[14]	马磊, 罗川, 李天瑞, 陈红梅. 基于模糊粗糙集的无监督动态特征选择算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3121-3128.
[15]	陈亮, 汤显峰. 改进正余弦算法优化特征选择及数据分类[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1852-1861.