计算机应用 ›› 2019, Vol. 39 ›› Issue (5): 1351-1356.DOI: 10.11772/j.issn.1001-9081.2018112496

• 数据科学与技术 • 上一篇    下一篇

成本约束下自适应众包标注的用户观点抽取

赵威1, 林煜明1, 黄涛贻1, 李优2   

  1. 1. 广西可信软件重点实验室(桂林电子科技大学), 广西 桂林 541004;
    2. 广西自动检测技术与仪器重点实验室(桂林电子科技大学), 广西 桂林 541004
  • 收稿日期:2018-12-04 修回日期:2018-12-18 出版日期:2019-05-10 发布日期:2019-05-14
  • 通讯作者: 林煜明
  • 作者简介:赵威(1995-),男,河南商丘人,硕士研究生,主要研究方向:知识抽取与融合;林煜明(1978-),男,广西合浦人,副研究员,博士,CCF会员,主要研究方向:海量数据管理、知识图谱;黄涛贻(1994-),男,江苏无锡人,硕士研究生,主要研究方向:知识抽取与融合;李优(1978-),女,安徽涡阳人,副教授,硕士,主要研究方向:文本挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61562014,U1711263);广西自然科学基金重点项目(2018GXNSFDA281049);桂林电子科技大学研究生优秀学位论文培育项目(16YJPYSS15);桂林电子科技大学研究生教育创新计划项目(2018YJCX48);广西可信软件重点实验室研究课题(kx201916)。

User opinion extraction based on adaptive crowd labeling with cost constrain

ZHAO Wei1, LIN Yuming1, HUANG Taoyi1, LI You2   

  1. 1. Guangxi Key Laboratory of Trusted Software(Guilin University of Electronic Technology), Guilin Guangxi 541004, China;
    2. Guangxi Key Laboratory of Automatic Detecting Technology and Instruments(Guilin University of Electronic Technology), Guilin Guangxi 541004, China
  • Received:2018-12-04 Revised:2018-12-18 Online:2019-05-10 Published:2019-05-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61562014, U1711263), the Guangxi Natural Science Foundation Key Project (2018GXNSFDA281049),the GUET Excellent Graduate Thesis Program (16YJPYSS15), the GUET Graduate Education Innovation Program (2018YJCX48), the Guangxi Key Laboratory of Trusted Software Program(kx201916).

摘要: 用户评论包含了丰富的用户观点信息,对潜在的顾客和商家具有重要的参考价值。观点目标和观点词作为用户评论中的核心对象,它们的自动抽取是用户评论智能化应用的一项核心工作。目前主要采用有监督的抽取方法解决该问题,这些方法依赖于利用高质量的标注样本进行模型训练,而传统人工标注样本的方法不仅耗时费力,且标注成本高。众包计算为构建高质量训练样本集提供了一种有效途径,然而,众包工作者由于知识背景等因素使得标注结果的质量参差不齐。为了在有限的成本下获取高质量的标注样本,提出一种基于工作者专业水平评估的自适应众包标注方法,构建可靠的观点目标-观点词数据集。首先,通过小成本挖掘出高专业水平的工作者;然后,设计一种基于工作者可靠性的任务分发机制;最后,利用观点目标和观点词间的依赖关系设计了一种有效的标注结果融合算法,通过整合不同工作者的标注结果生成最终可靠的结果。在真实数据集上进行了一系列实验表明,与GLAD模型和多数投票(MV)算法方法相比,所提方法能够在成本预算较小的情况下将构建出的高质量观点目标-观点词数据集的可靠性提高10%左右。

关键词: 观点挖掘, 众包计算, 成本约束, 工作者检测, 数据整合

Abstract: User reviews contain a wealth of user opinion information which has great reference value to potential customers and merchants. Opinion targets and opinion words are core objects of user reviews, so the automatic extraction of them is a key work for user review intelligent applications. At present, the problem is solved mainly by supervised extraction method, which depends on high quality labeled samples to train the model. And traditional manual labeling method is time-consuming, laborious and costly. Crowdsourcing calculation provides an effective way to build a high-quality training sample set. However, the quality of the labeling results is uneven due to some factors such as knowledge background of the workers. To obtain high-quality labeling samples at a limited cost, an adaptive crowdsourcing labeling method based on professional level evaluation of workers was proposed to construct a reliable dataset of opinion target-opinion words. Firstly, high professional level workers were digged out with small cost. And then, a task distribution mechanism based on worker reliability was designed. Finally, an effective fusion algorithm for labeling results was designed by using the dependency relationship between opinion targets and opinion words, and the final reliable results were generated by integrating the labeling results of different workers. A series of experiments on real datasets show that the reliability of high quality opinion target-opinion word dataset built by the proposed method can be improved by about 10%, compared with GLAD (Generative model of Labels, Abilities, and Difficulties) model and MV (Majority Vote) method when the cost budget is low.

Key words: opinion mining, crowdsourcing calculation, cost constraint, worker measurement, data integration

中图分类号: