Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (9): 2609-2614.DOI: 10.11772/j.issn.1001-9081.2020111837

Special Issue: 网络空间安全

• Cyber security • Previous Articles     Next Articles

Detection method of domains generated by dictionary-based domain generation algorithm

ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang   

  1. School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an Shaanxi 710021, China
  • Received:2020-11-24 Revised:2021-02-24 Online:2021-09-10 Published:2021-05-12
  • Supported by:
    This work is partially supported by the Basic Research Program of Natural Science of Shaanxi Province (2019JM-354).

基于字典的域名生成算法生成域名的检测方法

张永斌, 常文欣, 孙连山, 张航   

  1. 陕西科技大学 电子信息与人工智能学院, 西安 710021
  • 通讯作者: 常文欣
  • 作者简介:张永斌(1976-),男,陕西韩城人,讲师,博士,CCF会员,主要研究方向:网络安全、云计算;常文欣(1994-),女,陕西西安人,硕士研究生,主要研究方向:网络安全、深度学习;孙连山(1977-),男,黑龙江佳木斯人,副教授,博士,CCF会员,主要研究方向:软件安全工程、数据起源安全;张航(1996-),男,陕西咸阳人,硕士研究生,主要研究方向:网络安全、大数据。
  • 基金资助:
    陕西省自然科学基础研究计划项目(2019JM-354)。

Abstract: The composition of domain names generated by the dictionary-based Domain Generation Algorithm (DGA) is very similar to that of benign domain names and it is difficult to effectively detect them with the existing technology. To solve this problem, a detection model was proposed, namely CL (Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network). The model includes three parts:character embedding layer, feature extraction layer and fully connected layer. Firstly, the characters of the input domain name were encoded by the character embedding layer. Then, the features of the domain name were extracted by connecting CNN and LSTM in serial way through the feature extraction layer. The n-grams features of the domain name were extracted by CNN and the extracted result were sent to LSTM to learn the context features between n-grams. Meanwhile, different combinations of CNNs and LSTMs were used to learn the features of n-grams with different lengths. Finally, the dictionary-based DGA domain names were classified and predicted by the fully connected layer according to the extracted features. Experimental results show that when the CNNs select the convolution kernel sizes of 3 and 4, the proposed model achives the best performance. In the four dictionary-based DGA family experiments, the accuracy of the CL model is improved by 2.20% compared with that of the CNN model. And with the increase of the number of sample families, the CL network model has a better stability.

Key words: Domain Generation Algorithm (DGA), dictionary-based DGA, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, domain name detection

摘要: 针对基于字典的域名生成算法(DGA)生成域名与良性域名构成十分相似,现有技术难以有效检测的问题,提出一种卷积神经网络(CNN)和长短时记忆(LSTM)网络相结合的网络模型——CL模型。该模型由字符嵌入层、特征提取层及全连接层三部分组成。首先,字符嵌入层对输入域名的字符进行编码;然后,特征提取层将CNN与LSTM串行连接在一起,对域名字符特征进行提取,即通过CNN提取域名字符的n-grams特征,并将提取结果输入给LSTM,以便学习n-grams间的上下文特征,同时,为了学习不同长度的n-grams特征,可选择多组CNN与LSTM结合使用;最后,全连接层根据提取到的特征对基于字典的DGA生成域名进行分类预测。实验结果表明:当CNN选择的卷积核大小为3和4时,所提模型性能最佳。在四个基于字典的DGA家族的测试对比实验中,CL模型与CNN模型相比,准确率提升了2.20%,且随着样本家族数量的增加,CL模型具有更好的稳定性。

关键词: 域名生成算法, 基于字典的域名生成算法, 卷积神经网络, 长短时记忆网络, 域名检测

CLC Number: