Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (9): 2469-2476.DOI: 10.11772/j.issn.1001-9081.2018030643

Previous Articles     Next Articles

Spam messages recognizing method based on word embedding and convolutional neural network

LAI Wenhui, QIAO Yupeng   

  1. School of Automation Science and Engineering, South China University of Technology, Guangzhou Guangdong 510640, China
  • Received:2018-03-29 Revised:2018-04-23 Online:2018-09-10 Published:2018-09-06
  • Contact: 赖文辉

基于词向量和卷积神经网络的垃圾短信识别方法

赖文辉, 乔宇鹏   

  1. 华南理工大学 自动化科学与工程学院, 广州 510640
  • 通讯作者: 赖文辉
  • 作者简介:赖文辉(1994—),男,江西赣州人,硕士研究生,主要研究方向:机器学习、自然语言处理;乔宇鹏(1981—),女,黑龙江海林人,副研究员,博士,主要研究方向:布尔网络、博弈论、自然语言处理。

Abstract: It is of great social value and times background significance to filter and recognize spam messages. Traditional artificially designed feature selection methods may lead to data sparseness, insufficient co-occurrence of feature information and difficulty in feature extraction. To solve above problems, a spam messages recognizing method based on word embedding and convolutional neural network was proposed. Firstly, word2vec's skip-gram model was used to train the word embedding of each word in the short message dataset according to the Wiki Chinese corpus, and the two-dimensional feature matrix representing short message was composed of word embedding of each word in a short message. Then, the feature matrix was used as the input to the convolutional neural network. The multi-scale short message features were extracted by using different scale convolution kernels of the convolution layer, and the 1-max pooling strategy was used to obtain the local optimal features. Finally, the fusion feature vector, composed of the local optimal features, was put into the softmax classifier to get the classification results. Experiments were performed on 100000 short messages. The experimental results show that the recognition accuracy based on the convolutional neural network model can reach 99.5%, which is 2.4% to 5.1% higher than that of the traditional machine learning models with the same feature extraction method, and the recognition accuracy of each model maintains above 94%.

Key words: spam message, recognizing, word2vec, skip-gram, word embedding, Convolutional Neural Network (CNN)

摘要: 对垃圾短信进行过滤识别研究具有重要的社会价值和时代背景意义。针对传统的人工设计短信特征选择方法中存在数据稀疏、特征信息共现不足和特征提取困难的问题,提出一种基于词向量和卷积神经网络(CNN)的垃圾短信识别方法。首先,使用word2vec的skip-gram模型根据维基中文语料库训练出短信数据集中每个词的词向量,并将每条短信中各个词组所对应的词向量组成表示短信的二维特征矩阵;然后,把特征矩阵作为卷积神经网络的输入,通过卷积层的不同尺度卷积核提取多尺度短信特征,以及利用1-max pooling池化策略得到局部最优特征;最后,将局部最优特征组成融合特征向量放入softmax分类器中得出分类结果。在10万条短信数据上进行的实验结果表明,在特征提取方式相同的情况下,基于卷积神经网络模型的识别准确率能够达到99.5%,比传统的机器学习模型提高了2.4%~5.1%,且各模型的识别准确率均保持在94%以上。

关键词: 垃圾短信, 识别, word2vec, skip-gram, 词向量, 卷积神经网络

CLC Number: