计算机应用 ›› 2014, Vol. 34 ›› Issue (9): 2639-2644.DOI: 10.11772/j.issn.1001-9081.2014.09.2639

• 计算机安全 • 上一篇    下一篇

基于文档平滑和查询扩展的文档敏感信息检测方法

苏赢彬1,2,杜学绘1,2,夏春涛1,2,李海华3   

  1. 1. 数学工程与先进计算国家重点实验室,郑州 450001;
    2. 信息工程大学,郑州 450001;
    3. 河南工业贸易职业学院 计算机科学与技术系,郑州 450001
  • 收稿日期:2014-04-09 修回日期:2014-05-28 出版日期:2014-09-01 发布日期:2014-09-30
  • 通讯作者: 苏赢彬
  • 作者简介: 
    苏赢彬(1989-),男,四川绵阳人,硕士研究生,主要研究方向:网络安全;
    杜学绘(1968-),女,河南新乡人,教授,博士生导师,博士,主要研究方向:网络安全;
    夏春涛(1979-),男,河南许昌人,讲师,硕士,主要研究方向:网络安全;
    李海华(1965-),男,河南郑州人,副教授,硕士,主要研究方向:信息安全。
  • 基金资助:

    国家863计划项目

Sensitive information detection approach for documents based on document smoothing and query expansion

SU Yingbin1,2,DU Xuehui1,2,XIA Chuntao1,2,LI Haihua3   

  1. 1. Information Engineering University, Zhengzhou Henan 450001, China
    2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou Henan 450001, China
    3. Department of Computer Science and Technology, Henan Industry and Trade Vocational College, Zhengzhou Henan 450001, China
  • Received:2014-04-09 Revised:2014-05-28 Online:2014-09-01 Published:2014-09-30
  • Contact: SU Yingbin

摘要:

由于办公终端可能出现敏感信息泄露的风险,对终端上的文档进行敏感信息检测就显得十分重要,但现有敏感信息检测方法中存在上下文信息无关的索引导致文档建模不准确、查询语义扩展不充分的问题。为此,首先提出基于上下文的文档索引平滑算法,构建尽可能保留文档信息的索引;然后改进查询语义扩展算法,结合领域本体中概念敏感度适当扩大敏感信息检测范围;最后将文档平滑和查询扩展融合于语言模型,在其基础上提出了文档敏感信息检测方法。将采用不同索引机制、查询关键字扩展算法及检测模型的四种方法进行比较,所提出的算法在文档敏感信息检测中的查全率、准确率和F值分别为0.798,0.786和0.792,各项性能指标均明显优于对比算法。结果表明该算法是一种能更有效检测敏感信息的方法。

Abstract:

Detecting sensitive information on terminal documents becomes extremely important due to the potential risk of sensitive information leakage. In order to resolve the problems of imprecise document model caused by context-free index and inadequate semantic extension, firstly, a context-sensitive document smoothing algorithm was proposed to build document index, which can retain much more document information; secondly, combining the sensitivity of concept in the domain ontology, semantic extension was improved to expand the detection range of sensitive information; finally, document smoothing and query expansion were integrated into the language model, and a sensitive information detection approach based on the language model was proposed. Comparative experiments on four approaches using different index mechanisms, query expansion algorithms and detection models, the recall, precision and F-Measure of the proposed approach were 0.798, 0.786 and 0.792 respectively, and the various performance indicators were obviously better than the compared algorithms. The experimental results show that the proposed approach is a more effective one.

中图分类号: