Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 360-368.DOI: 10.11772/j.issn.1001-9081.2023020219

• Artificial intelligence • Previous Articles    

Fake review detection algorithm combining Gaussian mixture model and text graph convolutional network

Xing WANG1,2(), Guijuan LIU1,2, Zhihao CHEN1,2   

  1. 1.Research Center for Applied Statistical Sciences,Renmin University of China,Beijing 100872,China
    2.School of Statistics,Renmin University of China,Beijing 100872,China
  • Received:2023-03-03 Revised:2023-05-22 Accepted:2023-05-24 Online:2023-08-14 Published:2024-02-10
  • Contact: Xing WANG
  • About author:LIU Guijuan, born in 1986, M. S. candidate. Her research interests include natural language processing, deep learning.
    CHEN Zhihao, born in 1999, M. S. candidate. His research interests include network science, natural language processing.
  • Supported by:
    Key Project of National Social Science Foundation of China(18ATJ004)

高斯混合模型与文本图卷积网络结合的虚假评论识别算法

王星1,2(), 刘贵娟1,2, 陈志豪1,2   

  1. 1.中国人民大学 应用统计科学研究中心,北京 100872
    2.中国人民大学 统计学院,北京 100872
  • 通讯作者: 王星
  • 作者简介:刘贵娟(1986—),女,山东菏泽人,硕士研究生,主要研究方向:自然语言处理、深度学习
    陈志豪(1999—),男,浙江台州人,硕士研究生,主要研究方向:网络科学、自然语言处理。
  • 基金资助:
    国家社会科学基金重点项目(18ATJ004)

Abstract:

For insufficient edge weight window threshold design in Text Graph Convolutional Network (Text GCN), to mine the word association structure more accurately and improve prediction accuracy, a fake review detection algorithm combining Gaussian Mixture Model (GMM) and Text GCN named F-Text GCN was proposed. The edge signal strength of fake reviews that are relatively weak compared to normal reviews in training data size was improved by using GMM nature to separate noise edge weight distributions. Additionally, considering the diversity of information sources, the adjacency matrix was constructed by combing documents, words, reviews and non-text features. Finally, the fake review association structure of the adjacency matrix was extracted through spectral decomposition of Text GCN. Validation experiments were performed on 126 086 actual Chinese reviews collected by a large domestic e-commerce platform. Experimental results show that, for detecting fake reviews, the F1 value of F-Text GCN is 82.92%, outperforming BERT (Bidirectional Encoder Representation from Transformers) and Text CNN by 10.46% and 11.60%, respectively, the F1 of F-Text GCN is 2.94% higher than that of Text GCN. For highly imitated fake reviews which are challenging to detect, F-Text GCN achieves the overall prediction accuracy of 94.71% by secondary detection on the samples that Support Vector Machine (SVM) was difficult to detect, which is 2.91% and 14.54% higher than those of Text GCN and SVM. Based on study findings, lexical interference in consumer decision-making is evident in fake reviews’ second-order graph neighbor structure. This result indicates that the proposed algorithm is especially suitable for extracting long-range word collocation structures and global sentence feature pattern variations for fake reviews detection.

Key words: Gaussian Mixture Model (GMM), fake review detection, Text Graph Convolutional Network (Text GCN), adjacency matrix, co-occurrence word network

摘要:

针对文本图卷积网络(Text GCN)窗口边权阈值策略不足的问题,为了更精准地挖掘相关的词关联结构、提高预测精度, 提出一种高斯混合模型(GMM)与Text GCN结合的虚假评论识别算法F-Text GCN。首先,利用GMM分离噪声边权分布的特性,提高虚假评论在训练数据上相对正常评论数不足的边信号强度;然后,考虑到信源的多样性,综合文档、词汇和评论以及非文本特征构造邻接矩阵;最后,通过Text GCN的谱分解提取邻接矩阵的虚假评论关联结构实施预测。根据国内某大型电商平台采集的126 086条实际中文评论数据开展实证研究,实验结果表明,F-Text GCN识别虚假评论的F1值达到82.92%,与预训练表征模型BERT和文本卷积神经网络相比分别提升了10.46%和11.60%,相较于只使用评论文本信源的Text GCN模型F1值提升了2.94%;研究了高仿虚假评论的预测错误率,在支持向量机(SVM)作用后难识别的评论样本上尝试二次识别,F-Text GCN整体预测准确率可达94.71%,相较于Text GCN和SVM,在识别准确率上分别提升了2.91%和14.54%。研究发现,虚假评论的二阶图邻居结构显示出较强的干预消费者决策的词汇,这表明所提算法特别适用于提取用于虚假评论检测的长程词语搭配结构和全局句子特征模式变化的场景。

关键词: 高斯混合模型, 虚假评论识别, 文本图卷积神经网络, 邻接矩阵, 词汇共现网络

CLC Number: