Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (3): 733-737.DOI: 10.11772/j.issn.1001-9081.2014.03.0733

• Artificial intelligence • Previous Articles     Next Articles

Shopping information extraction method based on rapid construction of template

LI Ping1,ZHU Jianbo2,ZHOU Lixin1,LIAO Bin2   

  1. 1. School of Software and Microelectronics, Peking University, Beijing 102600, China;
    2. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China
  • Received:2013-09-18 Revised:2013-11-15 Online:2014-03-01 Published:2014-04-01
  • Contact: ZHOU Lixin

基于快速构建模板的购物信息抽取方法

李萍1,朱建波2,周立新1,廖彬2   

  1. 1. 北京大学 软件与微电子学院,北京102600
    2. 新疆大学 信息科学与工程学院,乌鲁木齐830046
  • 通讯作者: 周立新
  • 作者简介:李萍(1989-),女,湖南株洲人,硕士研究生,主要研究方向:信息检索、软件工程管理;朱建波(1987-),男,浙江衢州人,硕士研究生,主要研究方向:云计算、分布式计算;周立新(1967-),男,北京人,副教授,博士,主要研究方向:数据挖掘与业务分析、物联网软件设计、信息检索、智慧城市、量化项目管理;廖彬(1986-),男,四川内江人,博士研究生,主要研究方向:数据库、网格与云计算。

Abstract:

Concerning the shopping information Web page constructed by template, and the large number of Web information and complex Web structure, this paper studied how to extract the shopping information from the Web page template by not using the complex learning rule. The paper defined the Web page template and the extraction template of Web page and designed template language that was used to construct the template. This paper also gave a model of extraction based on template. The experimental results show that the recall rate of the proposed method is 12% higher than the Extraction problem Algorithm (EXALG) by testing the standard 450 Web pages; the results also show that the recall rate of this method is 7.4% higher than Visual information and Tag structure based wrapper generator (ViNTs) method and 0.2% higher than Augmenting automatic information extraction with visual perceptions (ViPER) method and the accuracy rate of this method is 5.2% higher than ViNTs method and 0.2% higher than ViPER method by testing the standard 250 Web pages. The recall rate and the accuracy rate of the extraction method based on the rapid construction template are improved a lot which makes the accuracy of the Web page analysis and the recall rate of the information in the shopping information retrieval and the shopping comparison system improve a lot .

Key words: template, electronic commerce, information extraction, shopping information, goods

摘要:

针对由模板生成的购物信息网页,且根据其网页信息量大,网页结构复杂的特点,提出了一种不使用复杂的学习规则,而将购物信息从模板网页中抽取出来的方法。研究内容包括定义网页模板和网页的信息抽取模板,设计用于快速构建模板的模板语言,并提出一种基于模板语言抽取内容的模型。实验结果表明,在标准的450个网页的测试集下,所提方法的召回率相比抽取问题算法(EXALG)提高了12%;在250个网页的测试集下,召回率相比基于视觉信息和标签结构的包装器生成器(ViNTs)方法和增加自动信息抽取和视觉感知(ViPER)方法分别提升了7.4%,0.2%;准确率相比ViNTs方法和ViPER方法分别提升了5.2%,0.2%。基于快速构建模板的信息抽取方法的召回率和准确率都有很大提升,使得购物信息检索和购物比价系统中的网页分析的准确性和信息召回率得到很大的改进。

关键词: 模板, 电子商务, 信息抽取, 购物信息, 商品

CLC Number: