计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2189-2193.DOI: 10.11772/j.issn.1001-9081.2019122114

• 人工智能 • 上一篇    下一篇

基于特征聚合的假新闻内容检测模型

何韩森, 孙国梓   

  1. 南京邮电大学 计算机学院, 南京 210023
  • 收稿日期:2019-12-17 修回日期:2020-03-19 出版日期:2020-08-10 发布日期:2020-05-14
  • 通讯作者: 孙国梓(1972-),男,安徽滁州人,教授,博士,主要研究方向:网络空间安全、电子数据取证。sun@njupt.edu.cn
  • 作者简介:何韩森(1992-),男,安徽池州人,硕士研究生,主要研究方向:自然语言处理、大数据。
  • 基金资助:
    国家自然科学基金资助项目(61502247);数学工程与先进计算国家重点实验室开放课题(2017A10);信息网络安全公安部重点实验室开放课题(C17611)。

Fake news content detection model based on feature aggregation

HE Hansen, SUN Guozi   

  1. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210023, China
  • Received:2019-12-17 Revised:2020-03-19 Online:2020-08-10 Published:2020-05-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61502247), the Open Project of State Key Laboratory of Mathematical Engineering and Advanced Computing (2017A10), the Open Project of Key Lab of Information Network Security of Ministry of Public Security (C17611).

摘要: 针对假新闻内容检测中分类算法模型的检测性能与泛化性能无法兼顾的问题,提出了一种基于特征聚合的假新闻检测模型CCNN。首先,通过双向长短时循环神经网络提取文本的全局时序特征,并采用卷积神经网络(CNN)提取窗口范围内的词语或词组特征;然后,在卷积神经网络池化层之后,采用基于双中心损失训练的特征聚合层;最后,将双向长短时记忆网络(Bi-LSTM)和CNN的特征数据按深度方向拼接成一个向量之后提供给全连接层,采用均匀损失函数uniform-sigmoid训练模型后输出最终的分类结果。实验结果表明,该模型的F1值为80.5%,在训练集和验证集上的差值为1.3个百分点;与传统的支持向量机(SVM)、朴素贝叶斯(NB)和随机森林(RF)模型相比,所提模型的F1值提升了9~14个百分点;与长短时记忆网络(LSTM)、快速文本分类(FastText)等神经网络模型相比,所提模型的泛化性能提升了1.3~2.5个百分点。由此可见,所提模型能够在提高分类性能的同时保证一定的泛化能力,提升整体性能。

关键词: 特征聚合, 卷积网络, 循环网络, 均匀损失, 泛化性能

Abstract: Concerning the problem that detection performance and generalization performance of the classification algorithm model in fake news content detection cannot be taken into account at the same time, a model based on feature aggregation was proposed, namely CCNN (Center-Cluster-Neural-Network). Firstly, the global temporal features of the text were extracted by bi-directional long and short term recurrent neural network, and the word or phrase features in the range of window were extracted by Convolutional Neural Network (CNN). Secondly, the feature aggregation layer based on dual center loss training was selected after the CNN pooling layer. Finally, the feature data of Bi-directional Long-Short Term Memory (Bi-LSTM) and CNN were stitched into a vector in the depth direction and provided to the fully connected layer. And the final classification result was output by the model trained by uniform loss function (uniform-sigmod). Experimental results show that the proposed model has an F1 value of 80.5%, the difference between training and validation sets is 1.3%. Compared with the traditional models such as Support Vector Machines (SVM), Naïve Bayes (NB) and Random Forest (RF), the proposed model has the F1 value increased by 9%-14%; compared with neural network models such as Long Short Term Memory (LSTM) and FastText, the proposed model has the generalization performance increased by 1.3%-2.5%. It can be seen that the proposed algorithm can improve the classification performance while ensuring a certain generalization ability, so the overall performance is enhanced.

Key words: feature aggregation, convolutional network, recurrent network, uniform loss, generalization performance

中图分类号: