计算机应用 ›› 2014, Vol. 34 ›› Issue (10): 2869-2873.DOI: 10.11772/j.issn.1001-9081.2014.10.2869

• 人工智能 • 上一篇    下一篇

基于词性标注序列特征提取的微博情感分类

卢伟胜,郭躬德,陈黎飞   

  1. 福建师范大学 数学与计算机科学学院,福州 350007
  • 收稿日期:2014-04-28 修回日期:2014-06-12 出版日期:2014-10-01 发布日期:2014-10-30
  • 通讯作者: 郭躬德
  • 作者简介:卢伟胜(1990-),男,福建漳州人,硕士研究生,主要研究方向:数据挖掘、人工智能;
    郭躬德(1965-),男,福建龙岩人,教授,博士,主要研究方向:数据挖掘、机器学习;
    陈黎飞(1972-),男,福建长乐人,副教授,博士,主要研究方向:数据挖掘、模式识别。
  • 基金资助:

    国家自然科学基金资助项目

Emotion classification with feature extraction based on part of speech tagging sequences in micro blog

LU Weisheng,GUO Gongde,CHEN Lifei   

  1. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350007, China
  • Received:2014-04-28 Revised:2014-06-12 Online:2014-10-01 Published:2014-10-30
  • Contact: GUO Gongde

摘要:

传统的n-gram文本特征提取方法会产生高维度的特征向量,高维数据不但增大了分类的难度,同时也会增加分类的时间。针对这一问题,提出了一种基于词性(POS)标注序列的特征提取方法,根据词性序列能够代表一类文本的这一个特点,利用词性序列组作为文本的特征以达到降低特征维度的效果。在实验中,词性序列特征提取方法比n-gram特征提取方法至少提高了9%的分类精度,降低4816个维度。实验结果表明,该方法能够适用于微博情感分类。

Abstract:

Traditional n-gram feature extraction tends to produce a high-dimensional feature vector. High-dimensional data not only increases the difficulty of classification, but also increases the classification time. Aiming at this problem, this paper presented a feature extraction method based on Part-of-Speech (POS) tagging sequences. The principle of this method was to use POS sequences as text features to reduce feature dimension, according to the property that POS sequences can represent a kind of text.In the experiment,compared with the n-gram feature extraction, the feature extraction based on POS sequences at least improved the classification accuracy of 9% and reduced the dimension of 4816. The experimental results show that the method is suitable for emotion classification in micro blog.

中图分类号: