成本约束下自适应众包标注的用户观点抽取

doi:10.11772/j.issn.1001-9081.2018112496

计算机应用 ›› 2019, Vol. 39 ›› Issue (5): 1351-1356.DOI: 10.11772/j.issn.1001-9081.2018112496

成本约束下自适应众包标注的用户观点抽取

赵威¹, 林煜明¹, 黄涛贻¹, 李优²

1. 广西可信软件重点实验室(桂林电子科技大学), 广西桂林 541004;
2. 广西自动检测技术与仪器重点实验室(桂林电子科技大学), 广西桂林 541004

收稿日期:2018-12-04 修回日期:2018-12-18 发布日期:2019-05-14 出版日期:2019-05-10
通讯作者: 林煜明
作者简介:赵威(1995-),男,河南商丘人,硕士研究生,主要研究方向:知识抽取与融合;林煜明(1978-),男,广西合浦人,副研究员,博士,CCF会员,主要研究方向:海量数据管理、知识图谱;黄涛贻(1994-),男,江苏无锡人,硕士研究生,主要研究方向:知识抽取与融合;李优(1978-),女,安徽涡阳人,副教授,硕士,主要研究方向:文本挖掘。
基金资助:
国家自然科学基金资助项目（61562014，U1711263）；广西自然科学基金重点项目（2018GXNSFDA281049）；桂林电子科技大学研究生优秀学位论文培育项目（16YJPYSS15）；桂林电子科技大学研究生教育创新计划项目（2018YJCX48）；广西可信软件重点实验室研究课题（kx201916）。

User opinion extraction based on adaptive crowd labeling with cost constrain

ZHAO Wei¹, LIN Yuming¹, HUANG Taoyi¹, LI You²

1. Guangxi Key Laboratory of Trusted Software(Guilin University of Electronic Technology), Guilin Guangxi 541004, China;
2. Guangxi Key Laboratory of Automatic Detecting Technology and Instruments(Guilin University of Electronic Technology), Guilin Guangxi 541004, China

Received:2018-12-04 Revised:2018-12-18 Online:2019-05-14 Published:2019-05-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61562014, U1711263), the Guangxi Natural Science Foundation Key Project (2018GXNSFDA281049),the GUET Excellent Graduate Thesis Program (16YJPYSS15), the GUET Graduate Education Innovation Program (2018YJCX48), the Guangxi Key Laboratory of Trusted Software Program(kx201916).

摘要/Abstract

摘要： 用户评论包含了丰富的用户观点信息，对潜在的顾客和商家具有重要的参考价值。观点目标和观点词作为用户评论中的核心对象，它们的自动抽取是用户评论智能化应用的一项核心工作。目前主要采用有监督的抽取方法解决该问题，这些方法依赖于利用高质量的标注样本进行模型训练，而传统人工标注样本的方法不仅耗时费力，且标注成本高。众包计算为构建高质量训练样本集提供了一种有效途径，然而，众包工作者由于知识背景等因素使得标注结果的质量参差不齐。为了在有限的成本下获取高质量的标注样本，提出一种基于工作者专业水平评估的自适应众包标注方法，构建可靠的观点目标-观点词数据集。首先，通过小成本挖掘出高专业水平的工作者；然后，设计一种基于工作者可靠性的任务分发机制；最后，利用观点目标和观点词间的依赖关系设计了一种有效的标注结果融合算法，通过整合不同工作者的标注结果生成最终可靠的结果。在真实数据集上进行了一系列实验表明，与GLAD模型和多数投票（MV）算法方法相比，所提方法能够在成本预算较小的情况下将构建出的高质量观点目标-观点词数据集的可靠性提高10%左右。

关键词: 观点挖掘, 众包计算, 成本约束, 工作者检测, 数据整合

Abstract: User reviews contain a wealth of user opinion information which has great reference value to potential customers and merchants. Opinion targets and opinion words are core objects of user reviews, so the automatic extraction of them is a key work for user review intelligent applications. At present, the problem is solved mainly by supervised extraction method, which depends on high quality labeled samples to train the model. And traditional manual labeling method is time-consuming, laborious and costly. Crowdsourcing calculation provides an effective way to build a high-quality training sample set. However, the quality of the labeling results is uneven due to some factors such as knowledge background of the workers. To obtain high-quality labeling samples at a limited cost, an adaptive crowdsourcing labeling method based on professional level evaluation of workers was proposed to construct a reliable dataset of opinion target-opinion words. Firstly, high professional level workers were digged out with small cost. And then, a task distribution mechanism based on worker reliability was designed. Finally, an effective fusion algorithm for labeling results was designed by using the dependency relationship between opinion targets and opinion words, and the final reliable results were generated by integrating the labeling results of different workers. A series of experiments on real datasets show that the reliability of high quality opinion target-opinion word dataset built by the proposed method can be improved by about 10%, compared with GLAD (Generative model of Labels, Abilities, and Difficulties) model and MV (Majority Vote) method when the cost budget is low.

Key words: opinion mining, crowdsourcing calculation, cost constraint, worker measurement, data integration

中图分类号:

TP391

赵威, 林煜明, 黄涛贻, 李优. 成本约束下自适应众包标注的用户观点抽取[J]. 计算机应用, 2019, 39(5): 1351-1356.

ZHAO Wei, LIN Yuming, HUANG Taoyi, LI You. User opinion extraction based on adaptive crowd labeling with cost constrain[J]. Journal of Computer Applications, 2019, 39(5): 1351-1356.

参考文献

[1] WANG H, WANG H, YIN H Z, et al. A unified framework for fine-grained opinion mining from online reviews[C]// Proceedings of the 201649th Hawaii International Conference on System Sciences. Piscataway, NJ: IEEE, 2016:1134-1143.
[2] TANG D Y, QIN B, FENG X C, et al. Effective LSTMs for target-dependent sentiment classification[J/OL]. arXiv Preprint, 2015, 2015: arXiv:1512.01100(2015-12-03)[2016-09-26]. https://arxiv.org/abs/1512.01100.
[3] LIN Y M, JIANG X X, LI Y, et al. Collective extraction for opinion targets and opinion words from online reviews[C]// Proceedings of the 20167th International Conference on Cloud Computing and Big Data. Washington, DC: IEEE Computer Society, 2017: 3949-3958.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. New York: Curran Associates, 2012:1097-1105.
[5] LEASE M, ALONSO O. Crowdsourcing for search evaluation and social-algorithmic search[C]// Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2012:1180.
[6] CHANG J C, AMERSHI S, KAMAR E. Revolt: collaborative crowdsourcing for labeling machine learning datasets[C]// Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. New York: ACM, 2017:2334-2346.
[7] MITRA T, HUTTO C J, GILBERT E. Comparing person-and process-centric strategies for obtaining quality data on Amazon mechanical turk[C]// Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York: ACM, 2015:1345-1354.
[8] RAYKAR V C, VIKAS C. Supervised learning from multiple experts: whom to trust when everyone lies a bit[C]// Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 2009:889-896.
[9] DONMEZ, PINAR, CARBONELL J G, et al. Efficiently learning the accuracy of labeling sources for selective sampling[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2009:259-268.
[10] XI C, LIN Q H, ZHOU D Y. Optimistic knowledge gradient policy for optimal budget allocation in crowdsourcing[C]// Proceedings of the 2013 International Conference on Machine Learning. Cambridge: MIT Press, 2013:64-72.
[11] 冯剑红, 李国良, 冯建华. 众包技术研究综述[J]. 计算机学报, 2015, 38(9):1713-1726.(FENG J H, LI G L, FENG J H. A survey on crowdsourcing[J]. Chinese Journal of Computers, 2015, 38(9):1713-1726.)
[12] 毛莺池, 穆超, 包威. 空间众包中多类型任务的分配与调度方法[J]. 计算机应用, 2018,38(1):6-12.(MAO Y C,MU C,BAO W. Multi-type task assignment and scheduling oriented to spatial crowdsourcing[J]. Journal of Computer Applications,2018, 38(1):6-12.)
[13] 施战, 辛煜, 孙玉娥. 基于用户可靠性的众包系统任务分配机制[J]. 计算机应用, 2017, 37(9):2449-2453.(SHI Z, XIN Y, SUN Y E. Task allocation mechanism for crowdsourcing system based on reliability of users[J]. Journal of Computer Applications, 2017, 37(9):2449-2453.)
[14] LIU X, LU M Y, OOI B C, et al. CDAS: a crowdsourcing data analytics system[J]. Proceedings of the VLDB Endowment, 2012, 5(10):1040-1051.
[15] OMAR F Z, CHRIS C B. Crowdsourcing translation: professional quality from non-professionals[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2011:1220-1229.
[16] JACOB W, PAUL R, WU T F, et al. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise[C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. New York: Curran Associates, 2009: 2035-2043.
[17] SNOW R, CONNOR B O, JURAFSKY D, et al. Cheap and fast — but is it good? evaluating non-expert annotations for natural language tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2008: 254-263.
[18] SARMA A D, PARAMESWARAN A, WIDOM J. Towards globally optimal crowdsourcing quality management: the uniform worker setting[C]// Proceedings of the 2016 International Conference on Management of Data. New York: ACM, 2016:47-62.
[19] FENG J, LI G, WANG H, et al. Incremental quality inference in crowdsourcing[C]// DASFAA 2014: International Conference on Database Systems for Advanced Applications. Berlin: Springer, 2014:453-467.
[20] DEMARTINI G, DIFALLAH D E, MAUROUX P C. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]// Proceedings of the 21st International Conference on World Wide Web. New York: ACM, 2012: 469-478.
[21] McCALLUM D R, PETERSON J L. Computer-based readability indexes[C]// Proceedings of the ACM'82 Conference. New York: ACM, 1982: 44-48.
[22] HU M, LIU B. Mining opinion features in customer reviews[C]// Proceedings of the 19th National Conference on Artifical Intelligence. Menlo Park: AAAI Press, 2004:755-760.

成本约束下自适应众包标注的用户观点抽取

User opinion extraction based on adaptive crowd labeling with cost constrain

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	边小勇胡其仁袁培洋. 多注意力对比学习的红外小目标检测[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	李强白少雄熊源袁薇. 基于视觉大模型隐私保护的监控图像定位[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[3]	薛雅丽徐忠敏刘世豪. 基于多级小波残差网络的重力数据去噪方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	况世雄姚俊波陆佳炜王琪冰肖刚. 基于动态图卷积网络的电梯乘客异常行为数据增强方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[5]	康斌陈斌王俊杰李昱林赵军智咸伟志. 基于多粒度共享语义中心关联的文本到人物检索方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[6]	张庆杨凡方宇涵. 基于多模态信息融合的中文拼写纠错算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[7]	王昊王金伟程鑫张家伟吴昊罗向阳马宾. 彩色图像JPEG重压缩取证综述（ChinaMFS 2024+14）[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	王磊胡节彭博. 用于半监督火灾检测的分布自适应和动态课程伪标签框架[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	刘晋文王磊马博董瑞杨雅婷艾合塔木江·艾合麦提王欣乐. 基于弱监督模态语义增强的多模态有害信息检测方法 [J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	夏雨禾王晓东何启学. 基于频域增强图变分学习的时间序列异常检测[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[11]	殷兵, 凌震华, 林垠, 奚昌凤, 刘颖. 兼容缺失模态推理的情感识别方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[12]	王子怡李卫军刘雪洋丁建平刘世侠苏易礌. 基于Swin Transformer与多尺度特征融合的图像描述方法#br# [J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[13]	方鹏, 赵凡, 王保全, 王轶, 蒋同海. 区块链3.0的发展、技术与应用[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3647-3657.
[14]	庞玉东, 李志星, 刘伟杰, 李天昊, 王宁宁. 基于改进实时检测Transformer的塔机上俯视场景小目标检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3922-3929.
[15]	赵欣, 李鑫杰, 徐健, 刘步云, 毕祥. 基于卷积神经网络与Transformer并行的医学图像配准模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3915-3921.