基于垂直集成Tri-training的虚假评论检测模型

doi:10.11772/j.issn.1001-9081.2019112046

摘要/Abstract

摘要： 针对虚假评论会误导用户的偏向并使其利益遭受损失以及大规模人工标注评论的代价过高等问题，通过利用以往迭代过程中生成的分类模型来提高检测的准确性，提出一种基于垂直集成的Tri-training（VETT）的虚假评论检测模型。该模型在评论文本特征的基础上结合用户行为特征作为特征进行提取。在VETT算法中，迭代过程被分成组内垂直集成和组间水平集成两部分：组内集成是利用分类器以往的迭代模型集成为一个原始分类器，而组间集成是利用3个原始分类器通过传统过程训练得到这一轮迭代后的二代分类器，以此来提高标签标记的准确率。对比Co-training、Tri-training、基于AUC优化的PU学习（PU-AUC）和基于垂直集成的Co-training（VECT）等算法，VETT算法的F1值分别最大提高了6.5、5.08、4.27和4.23个百分点。实验结果表明VETT算法有较好的分类性能。

关键词: 虚假评论, 垂直集成, Tri-training, 迭代分类器, 标签准确率

Abstract: In view of the problems that fake reviews mislead users and make their interests suffer losses and the cost of large-scale manual labeling reviews is too high, by using the classification model generated in the previous iteration process to improve the accuracy of detection, a fake review detection model based on Vertical Ensemble Tri-Training (VETT) was proposed. In the model, the user behavior characteristics were combined as features based on the review text characteristics to perform feature extraction. In VETT algorithm, the iterative process was divided into two parts:vertical ensemble within the group and horizontal ensemble between groups. In-group ensemble is to construct an original classifier using the previous iterative models of the classifier, and the inter-group ensemble is to train three original classifiers through the traditional process to obtain the second-generation classifiers after this iteration, thereby improving the accuracy of the labels. Compared with Co-training, Tri-training, PU learning based on Area Under Curve (PU-AUC) and Vertical Ensemble Co-training (VECT) algorithms, VETT algorithm has the maximum value of F1 increased by 6.5, 5.08, 4.27 and 4.23 percentage points respectively. Experimental results show that the proposed VETT algorithm has better classification performance.

Key words: fake review, vertical ensemble, Tri-training, iterative classifier, label accuracy

中图分类号:

TP393.0

尹春勇, 朱宇航. 基于垂直集成Tri-training的虚假评论检测模型[J]. 计算机应用, 2020, 40(8): 2194-2201.

YIN Chunyong, ZHU Yuhang. Fake review detection model based on vertical ensemble Tri-training[J]. Journal of Computer Applications, 2020, 40(8): 2194-2201.

参考文献

[1] HUSSAIN N, MIRZA H T, RASOOL G, et al. Review detection techniques:a systematic literature review[J]. Applied Sciences, 2019, 9(5):No.987.
[2] LI H, FEI G, WANG S, et al. Bimodal distribution and co-bursting in review spam detection[C]//Proceedings of the 26th International Conference on World Wide Web. Republic and Canton of Geneva:International World Wide Web Conferences Steering Committee, 2017:1063-1072.
[3] JINDAL N, LIU B. Opinion spam and analysis[C]//Proceedings of the 2008 International Conference on Web Search and Data Mining. New York:ACM, 2008:219-230.
[4] JINDAL N, LIU B, LIM E P. Finding unusual review patterns using unexpected rules[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York:ACM, 2010:1549-1552.
[5] OTT M, CHOI Y, CARDIE C, et al. Finding deceptive opinion spam by any stretch of the imagination[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:Association for Computational Linguistics, 2011:309-319.
[6] MUKHERJEE A, VENKATARAMAN V, LIU B, et al. What Yelp fake review filter might be doing?[C]//Proceedings of the 7th International AAAI Conference on Weblogs and Social Media. Palo Alto, CA:AAAI Press, 2013:409-418.
[7] LI F, HUANG M, YANG Y, et al. Learning to identify review spam[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Palo Alto, CA:AAAI Press, 2011:2488-2493.
[8] RAYANA S, AKOGLU L. Collective opinion spam detection:bridging review networks and metadata[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2015:985-994.
[9] AKOGLU L, CHANDY R, FALOUTSOS C. Opinion fraud detection in online reviews by network effects[C]//Proceedings of the 7th International AAAI Conference on Weblogs and Social Media. Palo Alto, CA:AAAI Press, 2013:2-11.
[10] FAKHRAEI S, FOULDS J, SHASHANKA M. Collective spammer detection in evolving multi-relational social networks[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2015:1769-1778.
[11] XU C, ZHANG J. Combating product review spam campaigns via multiple heterogeneous pairwise features[C]//Proceedings of the 2015 SIAM International Conference on Data Mining. Philadelphia:Society for Industrial and Applied Mathematics, 2015:172-180.
[12] HEYDARI A, TAVAKOLI M, SALIM N, et al. Detection of review spam:a survey[J]. Expert Systems with Applications, 2015, 42(7):3634-3642.
[13] YOU Z, QIAN T, LIU B. An attribute enhanced domain adaptive model for cold-start spam review detection[C]//Proceedings of the 27th International Conference on Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2018:1884-1895.
[14] YILMAZ C M, DURAHIM A O. SPR2EP:a semi-supervised spam review detection framework[C]//Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway:IEEE, 2018:306-313.
[15] ZHANG X, BAI H, LIANG W. A social spam detection framework via semi-supervised learning[C]//Proceedings of the 2016 Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNCS 9794. Cham:Springer, 2016:214-226.
[16] LI Z, ZHANG X, SHEN H, et al. A semi-supervised framework for social spammer detection[C]//Proceedings of the 2015 Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNCS 9078. Cham:Springer, 2015:177-188.
[17] SAVAGE D, ZHANG X, YU X, et al. Detection of opinion spam based on anomalous rating deviation[J]. Expert Systems with Applications, 2015, 42(22):8650-8657.
[18] HAI Z, ZHAO P, CHENG P, et al. Deceptive review spam detection via exploiting task relatedness and unlabeled data[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2016:1817-1826.
[19] SHEHNEPOOR S, SALEHI M, FARAHBAKHSH R, et al. NetSpam:a network-based spam detection framework for reviews in online social media[J]. IEEE Transactions on Information Forensics and Security, 2017, 12(7):1585-1595.
[20] TANG X, QIAN T, YOU Z. Generating behavior features for cold-start spam review detection[C]//Proceedings of the 2019 International Conference on Database Systems for Advanced Applications, LNCS 11448. Cham:Springer, 2019:324-328.
[21] CRAWFORD M, KHOSHGOFTAAR T M, PRUSA J D, et al. Survey of review spam detection using machine learning techniques[J]. Journal of Big Data, 2015, 2(1):No.23.
[22] MANI S, KUMARI S, JAIN A, et al. Spam review detection using ensemble machine learning[C]//Proceedings of the 14th International Conference on Machine Learning and Data Mining in Pattern Recognition, LNCS 10935. Cham:Springer, 2018:198-209.
[23] ETAIWI W, NAYMAT G. The impact of applying different preprocessing steps on review spam detection[J]. Procedia Computer Science, 2017, 113:273-279.
[24] RAJAMOHANA S P, UMAMAHESWARI K, KEERTHANA S V. An effective hybrid cuckoo search with harmony search for review spam detection[C]//Proceedings of the 3rd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics. Piscataway:IEEE, 2017:524-527.
[25] 任亚峰,姬东鸿,张红斌,等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3):639-648. (REN Y F, JI D H, ZHANG H B, et al. Deceptive reviews detection based on positive and unlabeled learning[J]. Journal of Computer Research and Development, 2015, 52(3):639-648.)
[26] 李志欣,兰丹媚,张灿龙,等. 基于Co-training的微博垃圾评论识别方法[J]. 计算机工程, 2018, 44(7):212-218. (LI Z X, LAN D M, ZHANG C L, et al. Recognition method of microblogging spam comment based on Co-training[J]. Computer Engineering, 2018, 44(7):212-218.)
[27] ZHANG W, BU C, YOSHIDA T, et al. CoSpa:a co-training approach for spam review identification with support vector machine[J]. Information, 2016, 7(1):No.12.
[28] HEREDIA B, KHOSHGOFTAAR T M, PRUSA J, et al. An investigation of ensemble techniques for detection of spam reviews[C]//Proceedings of the 15th IEEE International Conference on Machine Learning and Applications. Piscataway:IEEE, 2016:127-133.
[29] GOH K L, SINGH A K. Comprehensive literature review on machine learning structures for Web spam classification[J]. Procedia Computer Science, 2015, 70:434-441.
[30] GILAD K, CORNELIA C, ASAF S. Vertical ensemble co-training for text classification[J]. ACM Transactions on Intelligent Systems and Technology, 2018, 9(2):No.21.
[31] KHURSHID F, ZHU Y, YOHANNESE C W, et al. Recital of supervised learning on review spam detection:an empirical analysis[C]//Proceedings of the 12th International Conference on Intelligent Systems and Knowledge Engineering. Piscataway:IEEE, 2017:1-6.
[32] LIU Y, PANG B. A unified framework for detecting author spamicity by modeling review deviation[J]. Expert Systems with Applications, 2018, 112:148-155.
[33] ROUT J K, DALMIA A, CHOO K K R, et al. Revisiting semi-supervised learning for online deceptive review detection[J]. IEEE Access, 2017, 5:1319-1327.
[34] FUSILIER D H, MONTES-Y-GÓMEZ M, ROSSO P, et al. Detecting positive and negative deceptive opinions using PU-learning[J]. Information Processing and Management, 2015, 51(4):433-443.
[35] LI L, QIN B, REN W, et al. Document representation and feature combination for deceptive spam review detection[J]. Neurocomputing, 2017, 254:33-41.
[36] WANG X, LIU K, ZHAO J. Handling cold-start problem in review spam detection by jointly embedding texts and behaviors[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2017:366-376.
[37] REN Y, JI D. Neural networks for deceptive opinion spam detection:An empirical study[J]. Information Sciences, 2017, 385/386:213-224.
[38] BLUM A, MITCHELL T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Conference on Computational Learning Theory. New York:ACM, 1998:92-100.
[39] ZHOU Z, LI M. Tri-training:exploiting unlabeled data using three classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11):1529-1541.
[40] LI J, OTT M, CARDIE C, et al. Towards a general rule for identifying deceptive opinion spam[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2014:1566-1576.
[41] SAVAGE D, ZHANG X, YU X, et al. Detection of opinion spam based on anomalous rating deviation[J]. Expert Systems with Applications, 2015, 42(22):8650-8657.
[42] 颜梦香,姬东鸿,任亚峰.基于层次注意力机制神经网络模型的虚假评论识别[J].计算机应用,2019,39(7):1925-1930. (YAN M X, JI D H, REN Y F. Deceptive review detection via hierarchical neural network model with attention mechanism[J]. Journal of Computer Applications, 2019, 39(7):1925-1930.)
[43] SAKAI T, NIU G, SUGIYAMA M. Semi-supervised AUC optimization based on positive-unlabeled learning[J]. Machine Learning, 2018, 107(4):767-794.