基于支持向量机分类和语义信息的中文跨文本指代消解

doi:10.3724/SP.J.1087.2013.00984

计算机应用 ›› 2013, Vol. 33 ›› Issue (04): 984-987.DOI: 10.3724/SP.J.1087.2013.00984

基于支持向量机分类和语义信息的中文跨文本指代消解

赵知纬¹,²,顾静航¹,²,胡亚楠¹,²,钱龙华¹,³,周国栋¹,³

1. 苏州大学计算机科学与技术学院，江苏苏州 215006
2. 苏州大学自然语言处理实验室，江苏苏州 215006
3. .苏州大学自然语言处理实验室，江苏苏州 215006

收稿日期:2012-09-24 修回日期:2012-10-30 发布日期:2013-04-23 出版日期:2013-04-01
通讯作者: 钱龙华
作者简介:赵知纬(1987-)，男，浙江杭州人，硕士研究生，主要研究方向：信息抽取；顾静航(1987-)，男，河南洛阳人，硕士研究生，主要研究方向：信息抽取；胡亚楠(1989-)，女，安徽亳州人，硕士研究生，主要研究方向：信息抽取；钱龙华(1966-)，男，江苏苏州人，副教授，CCF会员，主要研究方向：自然语言处理；周国栋(1967-)，男，江苏溧阳人，教授，博士生导师，CCF高级会员，主要研究方向：自然语言处理。
基金资助:
国家自然科学基金资助项目（61172083）;国家自然科学基金资助项目（61172083）;江苏省高校自然科学重大项目(11KJA520003)

Chinese cross document co-reference resolution based on SVM classification and semantics

ZHAO Zhiwei¹,²,³,GU Jinghang¹,²,³,HU Yanan¹,²,³,QIAN Longhua¹,²,³,ZHOU Guodong²,⁴

1. Laboratory of Natual Language Processing, Soochow University, Suzhou Jiangsu 215006, China
2. School of Computer Science and Technology, Soochow University, Suzhou Jiangsu 215006, China
3. Laboratory of Natual Language Processing, Soochow University, Suzhou Jiangsu 215006, China
4. Laboratory of Natual Language Processing, Soochow University, Suzhou Jiangsu 215006, ChinaJiangsu 215006, China

Received:2012-09-24 Revised:2012-10-30 Online:2013-04-23 Published:2013-04-01
Contact: QIAN Longhua

摘要/Abstract

摘要： 跨文本(实体)指代消解(CDCR)的任务就是把所有分布在不同文本但指向相同实体的词组合在一起形成一个指代链。传统的跨文本指代消解主要采用聚类方法来解决信息检索中遇到的重名消歧问题。将聚类问题转换为分类问题，并采用支持向量机(SVM)分类器来解决信息抽取中的重名消歧和多名聚合问题。该方法可有效融合实体名称的构词特征、读音特征以及文本内部和文本外部的多种语义特征。在中文跨文本指代语料库上的实验表明，同聚类方法相比，该方法在提高精度的同时，也提高了召回率。

关键词: 跨文本指代, 信息抽取, 支持向量机分类器, 语义信息, 重名消歧, 多名聚合

Abstract: The task of Cross-Document Co-reference Resolution (CDCR) aims to merge those words distributed in different texts which refer to the same entity together to form co-reference chains. The traditional research on CDCR addresses name disambiguation posed in information retrieval using clustering methods. This paper transformed CDCR as a classification problem by using an Support Vector Machine (SVM) classifier to resolve both name disambiguation and variant consolidation, both of which were prevalent in information extraction. This method can effectively integrate various features, such as morphological, phonetic, and semantic knowledge collected from the corpus and the Internet. The experiment on a Chinese cross-document co-reference corpus shows the classification method outperforms clustering methods in both precision and recall.

Key words: cross document co-reference resolution, information extraction, Support Vector Machine (SVM) classifier, semantics, name disambiguation, variant consolidation

中图分类号:

TP391

赵知纬顾静航胡亚楠钱龙华周国栋. 基于支持向量机分类和语义信息的中文跨文本指代消解[J]. 计算机应用, 2013, 33(04): 984-987.

ZHAO Zhiwei GU Jinghang HU Yanan QIAN Longhua ZHOU Guodong. Chinese cross document co-reference resolution based on SVM classification and semantics[J]. Journal of Computer Applications, 2013, 33(04): 984-987.

参考文献

［1］MCCARTHY L W. Using decision trees for coreference resolution ［C］// MUC-6：Proceedings of the Sixth Message Understanding Conference. Montreal, Quebec, Canada: [s.n.], 1995: 20-25.

［2］BAGGA A, BALDWIN B. Entity-based cross-document coreferencing using the vector space model ［C］// COLING-ACL'98: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 1998: 79-85.

［3］NIST speech group. The ACE2008 evaluation plan: assessment of detection and recognition of entities and relations within and across documents ［EB/OL］. ［2008-08-08］. http://www.nist.gov/speech/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf.

［4］BARON A, FREEDMAN M. Who is who and what is what: experiments in cross-document co-reference ［C］// EMNLP'08: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008: 274-283.

［5］SINGH S, SUBRAMANYA A, PEREIRA F, et al. Large-scale cross-document coreference using distributed inference and hierarchical models ［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 793-803.

［6］GOOI C H, ALLAN J. Cross-document coreference on a large scale corpus ［C］// HLT-NAACL 2004. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004: 9-16.

［7］BOLLEGALA D, MATSUO Y, ISHIZUKA M. Disambiguating personal names on the Web using automatically extracted key phrases ［C］// Proceedings of the European Community of Artificial Intelligence. [S.l.]: IOS Press, 2006: 553-557.

［8］HUANG JIAN, TAYLOR S M, SMITH J L, et al. Profile based cross-document coreference using kernelized fuzzy relational clustering ［C］// Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009: 414-422.

［9］POPESCU O. Person cross document coreference with name perplexity estimates［C］// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009: 997-1006.

［10］POPESCU O. Dynamic parameters for cross document coreference ［C］// COLIN 2010. Beijing:[s.n.], 2010: 988-996.

［11］CHEN Y, JIN P, LI W J, et al. The Chinese persons name disambiguation evaluation: exploration of personal name disambiguation in Chinese news ［C/OL］// Joint Conference on Chinese Language Processing 2010. Beijing:ACL, 2010[2012-09-01]. ttps://www.aclweb.org/anthology-new/W/W10/W10-4152.pdf.

［12］LLOYD L, MEHLER A, SKIENA S. Identifying co-referential names across large corpora ［C］// Combinatorial Pattern Matching. Barcelona, Spain: [s.n.], 2006: 12-23.

［13］JOACHIMS T. Making large-scale SVM learning practical ［J］. Advances in Kernel Methods - Support Vector Learning. Cambridge: MIT Press, 1999.

［14］KALASHNIKOV D V, NURAY-TURAN R, MEHROTRA S. Towards breaking the quality curse. a Web-querying approach to Web people search ［C］// Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2008: 27-34.

［15］HAN XIANPEI, ZHAO JUN. Named entity disambiguation by leveraging Wikipedia semantic knowledge ［C］// Proceeding of the 18th ACM Conference on Information and Knowledge Management. New York: ACM 2009: 215-224.

［16］JOACHIMS T. Thorsten Joachims' Home Page ［EB/OL］. ［2010-05-05］. http://svmlight.joachims.org/.

［17］BAGGA A. Evaluation of coreferences and coreference resolution systems ［C］// Proceedings of the First Language Resource and Evaluation Conference. Granada, Spain: [s.n.],1998：563-566.

[1]	马胜位, 黄瑞章, 任丽娜, 林川. 基于多层语义融合的结构化深度文本聚类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2364-2369.
[2]	许亮, 张春, 张宁, 田雪涛. 融合多Prompt模板的零样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3668-3675.
[3]	王晓雨, 王展青, 熊威. 深度非对称离散跨模态哈希方法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2461-2470.
[4]	吕潇, 宋慧慧, 樊佳庆. 深浅层表示融合的半监督视频目标分割[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3884-3890.
[5]	罗萍, 丁玲, 杨雪, 向阳. 基于数据增强和弱监督对抗训练的中文事件检测[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2990-2995.
[6]	吕学强, 彭郴, 张乐, 董志安, 游新冬. 融合BERT与标签语义注意力的文本多标签分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 57-63.
[7]	吴丽丹, 薛雨阳, 童同, 杜民, 高钦泉. 基于前景语义信息的图像着色算法[J]. 计算机应用, 2021, 41(7): 2048-2053.
[8]	崔博文, 金涛, 王建民. 自由文本电子病历信息抽取综述[J]. 计算机应用, 2021, 41(4): 1055-1063.
[9]	郑思诚, 孔令华, 游通飞, 易定容. 动态环境下基于深度学习的语义SLAM算法[J]. 计算机应用, 2021, 41(10): 2945-2951.
[10]	周超然, 赵建平, 马太, 周欣. 基于注意力机制和集成学习的网页黑名单判别方法[J]. 计算机应用, 2021, 41(1): 133-138.
[11]	代刚, 张鸿. 基于语义相关性与拓扑关系的跨媒体检索算法[J]. 计算机应用, 2018, 38(9): 2529-2534.
[12]	罗明, 黄海量. 基于词汇语义模式的金融事件信息抽取方法[J]. 计算机应用, 2018, 38(1): 84-90.
[13]	张志华, 王建祥, 田俊峰, 吴国顺, 兰曼. 基于多元特征的分块人物关系识别系统[J]. 计算机应用, 2016, 36(3): 751-757.
[14]	马建红, 张明月, 赵亚男. 面向创新设计的专利知识抽取方法[J]. 计算机应用, 2016, 36(2): 465-471.
[15]	李汝君, 张俊, 张晓民, 桂小庆. 健康领域Web信息抽取[J]. 计算机应用, 2016, 36(1): 163-170.

基于支持向量机分类和语义信息的中文跨文本指代消解

Chinese cross document co-reference resolution based on SVM classification and semantics

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics