Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (6): 1613-1618.DOI: 10.11772/j.issn.1001-9081.2016.06.1613

Previous Articles     Next Articles

Sentiment analysis research based on combination of naive Bayes and latent Dirichlet allocation

SU Ying1, ZHANG Yong2, HU Po2, TU Xinhui2   

  1. 1. College of Information Science and Engineering, Wuchang Shouyi University, Wuhan Hubei 430064, China;
    2. School of Computer, Central China Normal University, Wuhan Hubei 430079, China
  • Received:2015-11-30 Revised:2016-02-23 Online:2016-06-10 Published:2016-06-08
  • Supported by:
    This work is partially supported by the Major Projects of National Social Science Foundation of China (12&2D223), the National Natural Science Foundation of China (61402191, 61300144, 61572223), the Project of State Language Commission (WT125-44), the Self-Determined Research Funds of Central China Normal University (CCNU14A05014,CCNU14A05015).

基于朴素贝叶斯与潜在狄利克雷分布相结合的情感分析

苏莹1, 张勇2, 胡珀2, 涂新辉2   

  1. 1. 武昌首义学院 信息科学与工程学院, 武汉 430064;
    2. 华中师范大学 计算机学院, 武汉 430079
  • 通讯作者: 张勇
  • 作者简介:苏莹(1982-),女,河南信阳人,讲师,硕士,主要研究方向:文本挖掘、机器学习;张勇(1978-),男,湖北仙桃人,副教授,博士,CCF会员,主要研究方向:文本挖掘、自然语言处理;胡珀(1980-),男,湖北武汉人,副教授,博士,CCF会员,主要研究方向:自动文摘、机器学习;涂新辉(1979-),男,湖北应城人,副教授,博士,CCF会员,主要研究方向:信息检索、自然语言处理、机器学习。
  • 基金资助:
    国家社会科学基金重大项目(12&2D223);国家自然科学基金资助项目(61402191,61300144,61572223);国家语委科研项目(WT125-44);华中师范大学自主科研项目(CCNU14A05014,CCNU14A05015)。

Abstract: Generally the manually labeled corpus is a critical resource for sentiment analysis. To circumvent laborious annotation efforts, an unsupervised hierarchical generation model for sentiment analysis was presented, which was based on the combination of Naive Bayes (NB) and Latent Dirichlet Allocation (LDA), named Naive Bayes and Latent Dirichlet Allocation (NB-LDA). Just needing the right emotional dictionary, the emotional tendencies of network comments were analyzed at sentence level and document level simultaneously without sentence level and document level markup information. In particular, the proposed model assumed that each sentence instead of each word had a latent sentiment label, and then the sentiment label generated a series of features for the sentence independently by the NB manner. The proposed model could combine the advanced Natural Language Processing (NLP) correlation technologies such as dependency parsing and syntactic parsing by the introduction of NB assumption and could be used to improve the performance for unsupervised sentiment analysis. The experimental results conducted on two sentiment corpus datasets show that the proposed NB-LDA can automatically derive the emotional polarities of sentence level and document level, and significantly improve the accuracy of sentiment analysis compared to the other unsupervised methods. Moreover, as an unsupervised model, the NB-LDA can achieve comparable performance to some supervised or semi-supervised methods.

Key words: sentiment analysis, topic model, Latent Dirichlet Allocation (LDA), Naive Bayes (NB), opinion mining

摘要: 针对情感分析需要大量人工标注语料的难点,提出了一种面向无指导情感分析的层次性生成模型。该模型将朴素贝叶斯(NB)模型和潜在狄利克雷分布(LDA)相结合,仅仅需要合适的情感词典,不需要篇章级别和句子级别的标注信息即可同时对网络评论的篇章级别和句子级别的情感倾向进行分析。该模型假设每个句子而不是每个单词拥有一个潜在的情感变量;然后,该情感变量再以朴素贝叶斯的方式生成一系列独立的特征。在该模型中,朴素贝叶斯假设的引入使得该模型可以结合自然语言处理(NLP)相关的技术,例如依存分析、句法分析等,用以提高无指导情感分析的性能。在两个情感语料数据集上的实验结果显示,该模型能够自动推导出篇章级别和句子级别的情感极性,该模型的正确率显著优于其他无指导的方法,甚至接近部分半指导或有指导的研究方法。

关键词: 情感分析, 主题模型, 潜在狄利克雷分布, 朴素贝叶斯, 意见挖掘

CLC Number: