计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2311-2316.DOI: 10.11772/j.issn.1001-9081.2014.08.2311

• 人工智能 • 上一篇    下一篇

基于模拟退火的在线Web文档内容数据质量评估

韩京宇,陈可佳   

  1. 南京邮电大学 计算机学院,南京210003
  • 收稿日期:2014-02-11 修回日期:2014-03-22 出版日期:2014-08-01 发布日期:2014-08-10
  • 通讯作者: 韩京宇
  • 作者简介:韩京宇(1976-),男,吉林白山人,副教授,博士,CCF会员,主要研究方向:数据管理、知识库;陈可佳(1980-),女,江苏淮安人,副教授,博士,主要研究方向:机器学习、信息系统。
  • 基金资助:

    国家自然科学基金资助项目

Data quality assessment of Web article content based on simulated annealing

HAN Jingyu,CHEN Kejia   

  1. College of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China
  • Received:2014-02-11 Revised:2014-03-22 Online:2014-08-01 Published:2014-08-10
  • Contact: HAN Jingyu

摘要:

针对基于训练模型或用户交互的Web数据质量评估方法不能在线响应,也不能获取内容事实内涵的问题,提出一种基于模拟退火(SA)的在线Web文档内容数据质量评估(QASA)方法。首先,通过在Web上搜集主题相关文档,构建目标文档的相关空间,进一步采用开放式信息抽取技术抽取文档内容的事实;然后,采用SA技术在线构建两个最重要的数据质量维度即准确性和完整性的参照;最后,通过比对目标文档和维度参照的事实来量化数据质量维度。实验结果表明,QASA方法可以及时返回近似最优解,并保持与离线算法等同或高于10%的精度。该方法不仅能满足实时响应的要求,而且具有高的评估精度,可应用于在线识别高质量的Web文档。

Abstract:

Because the existing Web quality assessment approaches rely on trained models, and users' interactions not only cannot meet the requirements of online response, but also can not capture the semantics of Web content, a data Quality Assessment based on Simulated Annealing (QASA) method was proposed. Firstly, the relevant space of the target article was constructed by collecting topic-relevant articles on the Web. Then, the scheme of open information extraction was employed to extract Web articles' facts. Secondly, Simulated Annealing (SA) was employed to construct the dimension baselines of two most important quality dimensions, namely accuracy and completeness. Finally, the data quality dimensions were quantified by comparing the facts of target article with those of the dimension baselines. The experimental results show that QASA can find the near-optimal solutions within the time window while achieving comparable or even 10 percent higher accuracy with regard to the related works. The QASA method can precisely grasp data quality in real-time, which caters for the online identification of high-quality Web articles.

中图分类号: