Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 121-126.DOI: 10.11772/j.issn.1001-9081.2020060967

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) • Previous Articles     Next Articles

Hash learning based malicious SQL detection

LI Mingwei1, JIANG Qingyuan1, XIE Yinpeng1, HE Jindong2, WU Dan2   

  1. 1. National Key Laboratory for Novel Software Technology(Nanjing university), Nanjing Jiangsu 210023, China;
    2. Electric Power Science Research Institute, State Grid Fujian Electric Power Company Limited, Fuzhou Fujian 350007, China
  • Received:2020-05-31 Revised:2020-08-03 Online:2021-01-10 Published:2021-01-16
  • Supported by:
    This work is partially supported by the Science and Technology Project of State Grid Corporation of China (SGGR0000XTJS1900448).


李明威1, 蒋庆远1, 解银朋1, 何金栋2, 吴丹2   

  1. 1. 计算机软件新技术国家重点实验室(南京大学), 南京 210023;
    2. 国家电网福建省电力有限公司 电力科学研究院, 福州 350007
  • 通讯作者: 蒋庆远
  • 作者简介:李明威(1995-),男,江苏南通人,硕士研究生,主要研究方向:机器学习、哈希学习;蒋庆远(1991-),男,云南腾冲人,博士研究生,主要研究方向:机器学习、哈希学习;解银朋(1997-),男,河南项城人,硕士研究生,主要研究方向:分布式机器学习;何金栋(1982-),男,福建福州人,高级工程师,硕士,主要研究方向:网络安全;吴丹(1987-),女,吉林长春人,工程师,硕士,主要研究方向:网络安全。
  • 基金资助:

Abstract: To solve the high storage cost and low retrieval speed problems in malicious Structure Query Language (SQL) detection faced by Nearest Neighbor (NN) method, a Hash learning based Malicious SQL Detection (HMSD) method was proposed. In this algorithm, Hash learning was used to learn the binary coding representation for SQL statements. Firstly, the SQL statements were presented as real-valued features by washing and deleting the duplicated SQL statements. Secondly, the isotropic hashing was used to learn the binary coding representation for SQL statements. Lastly, the retrieval procedure was performed and the detection speed was improved by using binary coding representation. Experimental results show that on the malicious SQL detection dataset Wafamole, the dataset is randomly divided so that the training set contains 10 000 SQL statements and the test set contains 30 000 SQL statements, at the length of 128 bits, compared with nearest neighbor method, the proposed algorithm has the detection accuracy increased by 1.3%, the False Positive Rate (FPR) reduced by 0.19%,the False Negative Rate (FNR) decreased by 2.41%, the retrieval time reduced by 94%, the storage cost dropped by 97.5%; compared with support vector machine method, the proposed algorithm has the detection accuracy increased by 0.17%, which demonstrate that the proposed algorithm can solve the problems of nearest neighbor method in malicious SQL detection.

Key words: malicious SQL detection, Nearest Neighbor (NN), binary coding representation, Hash learning, large-scale retrieval

摘要: 针对最近邻(NN)方法在异常结构化查询语句(SQL)检测应用中面临的存储开销大、检索速度慢的问题,提出了一种基于哈希学习的异常SQL检测(HMSD)方法。该算法利用哈希学习来学习查询SQL语句的二值编码表示。首先,对查询SQL语句进行清洗去重,从而将查询SQL语句表示为实值特征形式;然后利用等方差哈希方法来学习查询SQL语句的二值编码表示;最后,通过二值编码表示进行检索并提高异常SQL检测的速度。实验结果表明,在异常SQL检测数据集Wafamole上,将数据集进行随机划分,使训练集包含10 000条SQL语句,测试集包含30 000条SQL语句,在128比特长度下,与最近邻方法相比,所提算法的检测精度提高了1.3%,假正例率(FPR)降低了0.19%,假负例率(FNR)降低了2.41%,检索时间减少了94%,存储开销降低了97.5%;与支持向量机方法相比,所提算法的检测精度提高了0.17%,验证了所提算法能解决最近邻方法在异常SQL检测中存在的问题。

关键词: 异常SQL检测, 最近邻, 二值编码表示, 哈希学习, 大规模检索

CLC Number: