计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2194-2201.DOI: 10.11772/j.issn.1001-9081.2019112046

• 人工智能 • 上一篇    下一篇

基于垂直集成Tri-training的虚假评论检测模型

尹春勇, 朱宇航   

  1. 南京信息工程大学 计算机与软件学院, 南京 210044
  • 收稿日期:2019-12-02 修回日期:2020-04-18 出版日期:2020-08-10 发布日期:2020-06-29
  • 通讯作者: 尹春勇(1977-),男,山东潍坊人,教授,博士生导师,博士,主要研究方向:网络空间安全、大数据挖掘、隐私保护、人工智能、新型计算,ycy@nuist.edu.cn
  • 作者简介:朱宇航(1994-),男,江苏盐城人,硕士研究生,主要研究方向:机器学习、数据挖掘、隐私保护。
  • 基金资助:
    国家自然科学基金资助项目(61772282)。

Fake review detection model based on vertical ensemble Tri-training

YIN Chunyong, ZHU Yuhang   

  1. School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing Jiangsu 210044, China
  • Received:2019-12-02 Revised:2020-04-18 Online:2020-08-10 Published:2020-06-29
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61772282).

摘要: 针对虚假评论会误导用户的偏向并使其利益遭受损失以及大规模人工标注评论的代价过高等问题,通过利用以往迭代过程中生成的分类模型来提高检测的准确性,提出一种基于垂直集成的Tri-training(VETT)的虚假评论检测模型。该模型在评论文本特征的基础上结合用户行为特征作为特征进行提取。在VETT算法中,迭代过程被分成组内垂直集成和组间水平集成两部分:组内集成是利用分类器以往的迭代模型集成为一个原始分类器,而组间集成是利用3个原始分类器通过传统过程训练得到这一轮迭代后的二代分类器,以此来提高标签标记的准确率。对比Co-training、Tri-training、基于AUC优化的PU学习(PU-AUC)和基于垂直集成的Co-training(VECT)等算法,VETT算法的F1值分别最大提高了6.5、5.08、4.27和4.23个百分点。实验结果表明VETT算法有较好的分类性能。

关键词: 虚假评论, 垂直集成, Tri-training, 迭代分类器, 标签准确率

Abstract: In view of the problems that fake reviews mislead users and make their interests suffer losses and the cost of large-scale manual labeling reviews is too high, by using the classification model generated in the previous iteration process to improve the accuracy of detection, a fake review detection model based on Vertical Ensemble Tri-Training (VETT) was proposed. In the model, the user behavior characteristics were combined as features based on the review text characteristics to perform feature extraction. In VETT algorithm, the iterative process was divided into two parts:vertical ensemble within the group and horizontal ensemble between groups. In-group ensemble is to construct an original classifier using the previous iterative models of the classifier, and the inter-group ensemble is to train three original classifiers through the traditional process to obtain the second-generation classifiers after this iteration, thereby improving the accuracy of the labels. Compared with Co-training, Tri-training, PU learning based on Area Under Curve (PU-AUC) and Vertical Ensemble Co-training (VECT) algorithms, VETT algorithm has the maximum value of F1 increased by 6.5, 5.08, 4.27 and 4.23 percentage points respectively. Experimental results show that the proposed VETT algorithm has better classification performance.

Key words: fake review, vertical ensemble, Tri-training, iterative classifier, label accuracy

中图分类号: