Microblog advertisement filtering method based on classification feature extension of latent Dirichlet allocation

doi:10.11772/j.issn.1001-9081.2016.08.2257

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (8): 2257-2261.DOI: 10.11772/j.issn.1001-9081.2016.08.2257

Previous Articles Next Articles

Microblog advertisement filtering method based on classification feature extension of latent Dirichlet allocation

XING Jinbiao^1,2, CUI Chaoyuan¹, SUN Bingyu¹, SONG Liangtu¹

1. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei Anhui 230031, China;
2. School of Information Science and Technology, University of Science and Technology of China, Hefei Anhui 230026, China

Received:2016-01-15 Revised:2016-03-11 Online:2016-08-10 Published:2016-08-10
Supported by:
This work is partially supported by the National Key Technology R&D Program (2014BAD10B08), the Science and Technology Key Project of Anhui Province (1401032010).

基于隐含狄列克雷分配分类特征扩展的微博广告过滤方法

邢金彪^1,2, 崔超远¹, 孙丙宇¹, 宋良图¹

1. 中国科学院合肥智能机械研究所, 合肥 230031;
2. 中国科学技术大学信息科学技术学院, 合肥 230026

通讯作者: 邢金彪
作者简介:邢金彪(1990-),男,安徽阜阳人,硕士研究生,主要研究方向:推荐系统、自然语言处理;崔超远(1972-),男,内蒙古呼和浩特人,副研究员,博士,主要研究方向:虚拟化、云计算;孙丙宇(1974-),男,安徽淮北人,研究员,博士,主要研究方向:模式识别、智能决策;宋良图(1963-),男,安徽霍山人,研究员,博士,主要研究方向:地理信息系统、信息获取。
基金资助:
国家科技支撑计划项目（2014BAD10B08）；安徽省科技攻关计划项目（1401032010）。

Abstract

Abstract: The traditional microblog advertisement filtering methods neglect the impact of factors such as data sparseness, semantic information, and advertisement background characteristics. Focusing on these issues, a new filtering method based on classification feature extension of Latent Dirichlet Allocation (LDA) was proposed. Firstly, microblogs were divided into normal microblog and advertising microblog, and the topic model of LDA was built respectively to infer the corresponding topic distribution, the words in the topic model were regarded as the basis of feature extension. Secondly, the background characteristics were extracted in conjunction with text category information during extension to reduce the impact on text classification. Finally, the extended feature vectors were served as the input of the classifier, and the advertisements were filtered depending on the results of Support Vector Machine (SVM) classification. In comparison experiments with the method only based on short text classification, the precision of the proposed method was averagely increased by 4 percentage points. The results indicate that the proposed method can effectively extend the text features and reduce the influence of background characteristics, it is more suitable for the filtering of microblog advertisement with great amount of data.

Key words: advertisement filtering, Latent Dirichlet Allocation (LDA), short text classification, Support Vector Machine (SVM), feature extension

摘要： 传统的微博广告过滤方法忽略了微博广告文本的数据稀疏性、语义信息和广告背景领域特征等因素的影响。针对这些问题，提出一种基于隐含狄列克雷分配（LDA）分类特征扩展的广告过滤方法。首先，将微博分为正常微博和广告型微博，并分别构建LDA主题模型预测短文本对应的主题分布，将主题中的词作为特征扩展的基础；其次，在特征扩展时结合文本类别信息提取背景领域特征，以降低其对文本分类的影响；最后，将扩展后的特征向量作为分类器的输入，根据支持向量机（SVM）的分类结果过滤广告。实验结果表明，与现有的仅基于短文本分类的过滤方法相比，其准确率平均提升4个百分点。因此，该方法能有效扩展文本特征，并降低背景领域特征的影响，更适用于数据量较大的微博广告过滤。

关键词: 广告过滤, 隐含狄列克雷分配, 短文本分类, 支持向量机, 特征扩展

CLC Number:

TP181

XING Jinbiao, CUI Chaoyuan, SUN Bingyu, SONG Liangtu. Microblog advertisement filtering method based on classification feature extension of latent Dirichlet allocation[J]. Journal of Computer Applications, 2016, 36(8): 2257-2261.

邢金彪, 崔超远, 孙丙宇, 宋良图. 基于隐含狄列克雷分配分类特征扩展的微博广告过滤方法[J]. 计算机应用, 2016, 36(8): 2257-2261.

References

[1] 张剑峰,夏云庆,姚建民.微博文本处理研究综述[J].中文信息学报,2012,26(4):21-27.(ZHANG J F,XIA Y Q,YAO J M.A review towards microtext processing[J].Journal of Chinese Information Processing,2012,26(4):21-27.)
[2] 徐小琳,阙喜戎,程时端.信息过滤技术和个性化信息服务[J].计算机工程与应用,2003,39(9):182-184.(XU X L,QUE X R,CHENG S D.Information filtering and user modeling[J].Computer Engineering and Applications,2003,39(9):182-184.)
[3] 贺涛,曹先彬,谭辉.基于免疫的中文网络短文本聚类算法[J].自动化学报,2009,35(7):896-902.(HE T,CAO X B,TAN H.An immune based algorithm for Chinese network short text clustering[J].Acta Automatica Sinica,2009,35(7):896-902.)
[4] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[5] 王琳,冯时,徐伟丽,等.一种面向微博客文本流的噪音判别与内容相似性双重检测的过滤方法[J].计算机应用与软件,2012,29(8):25-29.(WANG L,FENG S,XU W L,et al.A filtering approach for spam discrimination and content similarity double detection for microblog text stream[J].Computer Applications and Software,2012,29(8):25-29.)
[6] 高俊波,梅波.基于文本内容分析的微博广告过滤模型研究[J].计算机工程,2014,40(5):17-20.(GAO J B,MEI B.Research on microblog advertisement filtering model based on text content analysis[J].Computer Engineering,2014,40(5):17-20.)
[7] 方东昊.基于LDA的微博短文本分类技术的研究与实现[D].沈阳:东北大学,2011:23-28.(FANG D H.Study and implementation for microblog's short text classification based on LDA[D].Shenyang:Northeastern University,2011:23-28.)
[8] 刁宇峰,杨亮,林鸿飞.基于LDA模型的博客垃圾评论发现[J].中文信息学报,2011,25(1):41-47.(DIAO Y F,YANG L,LIN H F.LDA-based opinion spam discovering[J].Journal of Chinese Information Processing,2011,25(1):41-47.)
[9] XU T,OARD D W.Wikipedia-based topic clustering for microblogs[J].Proceedings of the American Society for Information Science and Technology,2011,48(1):1-10.
[10] 吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015,51(4):123-127.(LYU C Z,JI D H,WU F F.Short text classification based on expanding feature of LDA[J].Computer Engineering and Applications,2015,51(4):123-127.).
[11] GRIFFITHS T L,STEYVERS M.Finding scientific topics[J].Proceedings of the National Academy of Sciences of the United States of America,2004,101(S1):5228-5235.
[12] 李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620-627.(LI W B,SUN L,ZHANG D K.Text classification based on labeled-LDA model[J].Chinese Journal of Computers,2008,31(4):620-627.)
[13] 张华平.NLPIR汉语分词系统[CP/OL].[2015-07-17].http://ictclas.nlpir.org/. (ZHANG H P.Chinese lexical analysis system[CP/OL].[2015-07-17].http://ictclas.nlpir.org/.)
[14] SALTON G,WONG A,YANG C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
[15] SALTON G,YANG C S.On the specification of term values in automatic indexing[J].Journal of Documentation,1973,29(4):351-372.
[16] CAO J,XIA T,et al.A density-based method for adaptive LDA model selection[J].Neurocomputing,2009,72(7/8/9):1775-1781.
[17] CHANG C-C,LIN C-J.LIBSVM:a library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology,2011,2(3):Article No.27.

Microblog advertisement filtering method based on classification feature extension of latent Dirichlet allocation

基于隐含狄列克雷分配分类特征扩展的微博广告过滤方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545.
[2]	Enbao QIAO, Xiangyang GAO, Jun CHENG. Self-recovery adaptive Monte Carlo localization algorithm based on support vector machine [J]. Journal of Computer Applications, 2024, 44(10): 3246-3251.
[3]	Xueyu HUANG, Huaiyu HE, Huimin LIN, Jinshui CHEN. Classification and recognition method of copper alloy metallograph based on feature aggregation [J]. Journal of Computer Applications, 2023, 43(8): 2593-2601.
[4]	Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention [J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
[5]	Shigang YANG, Yongguo LIU. Short text classification method by fusing corpus features and graph attention network [J]. Journal of Computer Applications, 2022, 42(5): 1324-1329.
[6]	Zhen QU, Kunting LI, Zhixi FENG. Remote sensing image scene classification based on effective channel attention [J]. Journal of Computer Applications, 2022, 42(5): 1431-1439.
[7]	Guifang QIAO, Shouming HOU, Yanyan LIU. Facial expression recognition algorithm based on combination of improved convolutional neural network and support vector machine [J]. Journal of Computer Applications, 2022, 42(4): 1253-1259.
[8]	Wang TAN, Yi LI. Synthesis of loop bound functions for loop programs [J]. Journal of Computer Applications, 2022, 42(2): 565-573.
[9]	Qian GE, Guangbin ZHANG, Xiaofeng ZHANG. Automatic feature selection algorithm based on interaction of ReliefF with maximum information coefficient and SVM [J]. Journal of Computer Applications, 2022, 42(10): 3046-3053.
[10]	Hongfei JIA, Xi LIU, Yu WANG, Hongbing XIAO, Suxia XING. Application of 3DPCANet in image classification of functional magnetic resonance imaging for Alzheimer’s disease [J]. Journal of Computer Applications, 2022, 42(1): 310-315.
[11]	JIA Heming, JIANG Zichao, LI Yao, SUN Kangjian. Simultaneous feature selection optimization based on improved spotted hyena optimizer algorithm [J]. Journal of Computer Applications, 2021, 41(5): 1290-1298.
[12]	ZOU Zhiwen, QIN Cheng. Method of dynamically constructing spatial topic R-tree based on k-means++ [J]. Journal of Computer Applications, 2021, 41(3): 733-737.
[13]	YUAN Qianqian, DENG Hongmin, WANG Xiaohang. Citrus disease and insect pest area segmentation based on superpixel fast fuzzy C-means clustering and support vector machine [J]. Journal of Computer Applications, 2021, 41(2): 563-570.
[14]	Hongliang CAO, Ying ZHANG, Bin WU, Fanyu LI, Xubo NA. Prediction method of liver transplantation complications based on transfer component analysis and support vector machine [J]. Journal of Computer Applications, 2021, 41(12): 3608-3613.
[15]	Kai LI, Jie LI. Structure-fuzzy multi-class support vector machine algorithm based on pinball loss [J]. Journal of Computer Applications, 2021, 41(11): 3104-3112.