基于模拟退火的在线Web文档内容数据质量评估

doi:10.11772/j.issn.1001-9081.2014.08.2311

计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2311-2316.DOI: 10.11772/j.issn.1001-9081.2014.08.2311

基于模拟退火的在线Web文档内容数据质量评估

韩京宇,陈可佳

南京邮电大学计算机学院，南京210003

收稿日期:2014-02-11 修回日期:2014-03-22 出版日期:2014-08-01 发布日期:2014-08-10
通讯作者: 韩京宇
作者简介:韩京宇(1976-),男,吉林白山人，副教授,博士，CCF会员，主要研究方向:数据管理、知识库；陈可佳(1980-),女,江苏淮安人，副教授，博士，主要研究方向:机器学习、信息系统。
基金资助:
国家自然科学基金资助项目

Data quality assessment of Web article content based on simulated annealing

HAN Jingyu,CHEN Kejia

College of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China

Received:2014-02-11 Revised:2014-03-22 Online:2014-08-01 Published:2014-08-10
Contact: HAN Jingyu

摘要/Abstract

摘要：

针对基于训练模型或用户交互的Web数据质量评估方法不能在线响应,也不能获取内容事实内涵的问题，提出一种基于模拟退火(SA)的在线Web文档内容数据质量评估（QASA）方法。首先，通过在Web上搜集主题相关文档，构建目标文档的相关空间,进一步采用开放式信息抽取技术抽取文档内容的事实；然后，采用SA技术在线构建两个最重要的数据质量维度即准确性和完整性的参照；最后，通过比对目标文档和维度参照的事实来量化数据质量维度。实验结果表明，QASA方法可以及时返回近似最优解，并保持与离线算法等同或高于10%的精度。该方法不仅能满足实时响应的要求，而且具有高的评估精度，可应用于在线识别高质量的Web文档。

Abstract:

Because the existing Web quality assessment approaches rely on trained models, and users' interactions not only cannot meet the requirements of online response, but also can not capture the semantics of Web content, a data Quality Assessment based on Simulated Annealing (QASA) method was proposed. Firstly, the relevant space of the target article was constructed by collecting topic-relevant articles on the Web. Then, the scheme of open information extraction was employed to extract Web articles' facts. Secondly, Simulated Annealing (SA) was employed to construct the dimension baselines of two most important quality dimensions, namely accuracy and completeness. Finally, the data quality dimensions were quantified by comparing the facts of target article with those of the dimension baselines. The experimental results show that QASA can find the near-optimal solutions within the time window while achieving comparable or even 10 percent higher accuracy with regard to the related works. The QASA method can precisely grasp data quality in real-time, which caters for the online identification of high-quality Web articles.

中图分类号:

韩京宇陈可佳. 基于模拟退火的在线Web文档内容数据质量评估[J]. 计算机应用, 2014, 34(8): 2311-2316.

HAN Jingyu CHEN Kejia. Data quality assessment of Web article content based on simulated annealing[J]. Journal of Computer Applications, 2014, 34(8): 2311-2316.

参考文献

［1］AEBI D, PERROCHON L. Towards improving data quality ［C］// Proceedings of the 1993 International Conference on Information Systems and Management of Data. Washington, DC: IEEE Computer Society, 1993: 273-281.
［2］BATINI C, CAPPIELLO C, FRANCALANCI C, et al. Methodologies for data quality assessment and improvement ［J］. ACM Computing Surveys, 2009, 41(3):8-75.
［3］BOUZEGHOUB M, PERALTA V. A framework for analysis of data freshness ［C］// Proceedings of the 2004 International Information Quality Conference on Information System. Washington, DC: IEEE Computer Society, 2004: 59-67.
［4］DALIP D H, GONCALVES M A, CRISTO M, et al. Automatic assessment of document quality in Web collaborative digital libraries ［J］. Journal of Data and Information Quality, 2011, 2(3): article 14.
［5］BLUMENSTOCK J E. Size matters: word count as a measure of quality on Wikipedia ［C］// Proceedings of the 17th International Conference on World Wide Web. New York: ACM Press, 2008: 1095-1096.
［6］RASSBACH L, PINCOCK T, MINGUS B. Exploring the feasibility of automatically rating online article quality ［EB/OL］. ［2013-08-10］. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.
［7］ZENG H, ALHOSSAINI M A, DING L. Computing trust from revision history ［C］// PST'06: Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services. New York: ACM Press, 2006: 33-40.
［8］ZENG H, ALHOSSAINI M A, FIKES R, et al. Mining revision history to assess trustworthiness of article fragments ［C］// Proceedings of the 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing. New York: ACM Press, 2006: 1-10.
［9］HU M, LIM E P, SUN A. Measuring article quality in Wikipedia: models and evaluation ［C］// Proceedings of the 16th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2007: 243-252.
［10］YU S, MASATOSHI Y. Assessing quality scores of Wikipedia article using mutual evaluation of editors and texts ［C］// Proceedings of the 22nd ACM Conference on Information and Knowledge Management. New York: ACM Press, 2013: 1727-1732.
［11］LIU J, RAM S. Who does what: collaboration patterns in the Wikipedia and their impact on article quality ［J］. ACM Transactions on Management Information Systems, 2011, 2(2):1-23.
［12］JOHNSON D S, ARAGON C R, McGEOCH L A, et al. Optimization by simulated annealing: an experimental evaluation ［J］. Operations Research, 1991, 39(3): 78-406.
［13］WANG X, XU X, WANG Z. A profit optimization oriented service selection method for dynamic service composition ［J］. Chinese Journal of Computers, 2010, 33(11): 2104-2115.(王显志,徐晓飞,王忠杰.面向组合服务收益优化的动态服务选择方法［J］.计算机学报,2010,33(11):2104-2115.)
［14］TAN C M. Simulated annealing ［M］. Vienna: InTech Publisher, 2008: 77-88.
［15］DALVI N, KUMAR R, SOLIMAN M. Automatic wrappers for large scale Web extraction ［C］// Proceedings of the 37th International Conference on Very Large Databases. New York: VLDB Endowment, 2011: 219-230.
［16］XIAO S, HE Y. Approach of Chinese event IE based on verb argument structure ［J］. Computer Science, 2012, 39(5): 161-164.(肖升,何炎祥.基于动词论元结构的中文事件抽取方法［J］.计算机科学,2012,39(5):161-164.)
［17］YANG S, LIN H, HAN Y. Automatic data extraction from template-generation Web pages ［J］. Journal of Software, 2008, 19(2): 209-223.(杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法［J］.软件学报,2008,19(2):209-223.)
［18］ETZIONI O, FADER A, CHRISTENSEN J, et al. Open information extraction: the second generation ［C］// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2011: 3-10.
［19］SIMES G, GALHARDAS H, GRAVANO L. When speed has a price: fast information extraction using approximate algorithms ［C］// Proceedings of the 39th International Conference on Very Large Databases. New York: VLDB Endowment, 2013: 1462-1473.
［20］MAYS E, DAMERAU F J, MERCER R L. Context based spelling correction ［J］. Information Processing and Management, 1991, 27(5): 517-522.
［21］Princeton University. WordNet: a lexical database for English ［EB/OL］. ［2013-09-10］. http://wordnet.princeton.edu/.

[1]	赵全, 汤小春, 朱紫钰, 毛安琪, 李战怀. 大规模短时间任务的低延迟集群调度框架[J]. 计算机应用, 2021, 41(8): 2396-2405.
[2]	冯钧王秉发陆佳民. 分布式资源描述框架数据管理系统查询性能评价[J]. 计算机应用, 0, (): 0-0.
[3]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[4]	林定康颜嘉麒巴·楠登符朕皓姜皓晨. 门罗币匿名及追踪技术综述[J]. 计算机应用, 0, (): 0-0.
[5]	沈忱, 邰凌翔, 彭煜玮. 面向自动参数调优的动态负载匹配方法[J]. 计算机应用, 2021, 41(3): 657-661.
[6]	杨程, 陆佳民, 冯钧. 分布式环境下大规模资源描述框架数据划分方法综述[J]. 计算机应用, 2020, 40(11): 3184-3191.
[7]	兰海, 韩珂, 申砾, 崔秋, 彭煜玮. TiDB的多索引访问优化[J]. 计算机应用, 2020, 40(2): 410-415.
[8]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[9]	万静, 郑龙君, 何云斌, 李松. 高维不确定数据的子空间聚类算法[J]. 计算机应用, 2019, 39(11): 3280-3287.
[10]	李博, 张晓, 颜靖艺, 李可威, 李恒, 凌玉龙, 张勇. 基于值差度量和聚类优化的K最近邻算法在银行客户行为预测中的应用[J]. 计算机应用, 2019, 39(9): 2784-2788.
[11]	李耘书, 滕飞, 李天瑞. 基于微操作的Hadoop参数自动调优方法[J]. 计算机应用, 2019, 39(6): 1589-1594.
[12]	霍峥, 张坤, 贺萍, 武彦斌. 满足本地化差分隐私的众包位置数据采集[J]. 计算机应用, 2019, 39(3): 763-768.
[13]	朱跃龙, 朱晓晓, 王继民. 基于子序列全连接和最大团的时间序列模体发现算法[J]. 计算机应用, 2019, 39(2): 414-420.
[14]	尹远, 张昌, 文凯, 郑云俊. 基于DiffNodeset结构的最大频繁项集挖掘算法[J]. 计算机应用, 2018, 38(12): 3438-3443.
[15]	曲立平, 吴家喜. 基于评分可靠性的跨域个性化推荐方法[J]. 计算机应用, 2018, 38(11): 3081-3083.

基于模拟退火的在线Web文档内容数据质量评估

Data quality assessment of Web article content based on simulated annealing

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics