计算机应用 ›› 2017, Vol. 37 ›› Issue (1): 233-238.DOI: 10.11772/j.issn.1001-9081.2017.01.0233

• 人工智能 • 上一篇    下一篇

本体与条件随机场结合的涉农商品名称抽取与类别标注

黄念娥1,2, 黄河1, 王儒敬1   

  1. 1. 中国科学院 合肥智能机械研究所, 合肥 230031;
    2. 中国科学技术大学 合肥物质研究院, 合肥 230027
  • 收稿日期:2016-08-02 修回日期:2016-09-19 出版日期:2017-01-10 发布日期:2017-01-09
  • 通讯作者: 黄河
  • 作者简介:黄念娥(1991-),女,安徽安庆人,硕士研究生,主要研究方向:信息抽取、垂直搜索引擎;黄河(1980-),男,安徽合肥人,副研究员,博士,主要研究方向:农业大数据、农业智能系统;王儒敬(1964-),男,安徽亳州人,研究员,博士,主要研究方向:知识表示与可视化、知识获取。
  • 基金资助:
    国家科技支撑计划项目(2013BAD15B03);中国科学院重点部署项目(Y622A21291);安徽省科技攻关项目(1401032010)。

Agriculture-related product name extraction and category labeling based on ontology and conditional random field

HUANG Nian'e1,2, HUANG He1, WANG Rujing1   

  1. 1. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei Anhui 230031, China;
    2. Hefei Institute of Physical Science, University of Science and Technology of China, Hefei Anhui 230027, China
  • Received:2016-08-02 Revised:2016-09-19 Online:2017-01-10 Published:2017-01-09
  • Supported by:
    This work is partially supported by the National Science and Technology Support Program (2013BAD15B03), Chinese Academy of Sciences Key Deployment Project (Y622A21291), the Scientific and Technological Project of Anhui Province (1401032010).

摘要: 传统的基于条件随机场(CRF)的信息抽取方法在进行涉农商品名称抽取与类别标注时,需要大量的训练语料,标注工作量大,且抽取精度不高。为解决该问题,提出了一种基于农业本体与CRF相结合的涉农商品名称抽取与类别标注方法,将涉农商品名称的自动抽取与分类看作序列标注的任务。首先是原始数据的分词处理和词、词性、地理属性、本体概念特征选择;然后,采用改进的拟牛顿算法训练CRF模型参数,用维特比算法实现解码,共完成4组对比实验,识别出7种类别,并将CRF和隐马尔可夫模型(HMM)、最大熵马尔可夫模型(MEMM)通过实验进行比较;最后,将CRF应用于农产品供求趋势分析。结合合适的特征模板,本体概念的加入使CRF开放测试的总体准确率提高10.20%,召回率提高59.78%,F值提高37.17%,证明了本体与CRF结合方法在涉农商品名称和类别抽取中的可行性和有效性,可以促进农产品供求对接。

关键词: 条件随机场, 农业本体, 涉农商品名称, 供求趋势, 序列标注

Abstract: Traditional information extraction method based on Conditional Random Field (CRF) requires large-scale labeled corpus, it is expensive to label corpus manually and the extraction precision is low in processing agriculture-related product name extraction and category labeling. In order to solve this problem, a method of agriculture-related product name extraction and category labeling based on agricultural ontology and CRF was proposed, automatic extraction and classification of agriculture-related product names was regarded as sequence labeling. Firstly, original data was processed, word, part of speech, geographical attributes and ontology concept features were selected. Then, parameters of the CRF model were trained by the improved quasi-Newton algorithm and decoding was implemented by Viterbi algorithm. A total of four groups of comparative experiments were completed and seven categories were identified. CRF, Hidden Markov Model (HMM) and Maximum Entropy Markov Model (MEMM) were compared through experiments. Finally, the supply and demand trend analysis of agriculture produce was accomplished. The experimental results show that the overall precision, recall and F-score of the open test were increased by 10.20%, 59.78% and 37.17% respectively by adding ontology concepts with appropriate CRF features; it also proves the feasibility, effectiveness and practical significance of the method in promoting automatic supply and demand docking of agricultural products.

Key words: Conditional Random Field (CRF), agricultural ontology, agriculture-related product name, supply and demand trend, sequence labeling

中图分类号: