基于构成模式和条件随机场的企业简称预测

doi:10.11772/j.issn.1001-9081.2016.02.0449

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 449-454.DOI: 10.11772/j.issn.1001-9081.2016.02.0449

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

基于构成模式和条件随机场的企业简称预测

孙丽萍¹, 过弋^1,2, 唐文武¹, 徐永斌¹

1. 华东理工大学信息科学与工程学院, 上海 200237;
2. 石河子大学信息科学与技术学院, 新疆石河子 832007

收稿日期:2015-08-29 修回日期:2015-09-11 发布日期:2016-02-03 出版日期:2016-02-10
通讯作者: 过弋(1975-),男,江苏无锡人,教授,博士,主要研究方向:自然语言处理、智能信息处理、本体工程。
作者简介:孙丽萍(1990-),女,浙江上虞人,硕士研究生,主要研究方向:自然语言处理;唐文武(1992-),男,福建福鼎人,硕士研究生,主要研究方向:自然语言处理;徐永斌(1990-),男,江苏扬州人,硕士研究生,主要研究方向:自然语言处理。
基金资助:
国家自然科学基金资助项目(61462073,61272198)。

Enterprise abbreviation prediction based on constitution pattern and conditional random field

SUN Liping¹, GUO Yi^1,2, TANG Wenwu¹, XU Yongbin¹

1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China;
2. College of Information Science and Technology, Shihezi University, Shihezi Xinjiang 832007, China

Received:2015-08-29 Revised:2015-09-11 Online:2016-02-03 Published:2016-02-10

摘要/Abstract

摘要： 针对目前企业营销的不断深入,企业简称被各大新闻广泛使用,而作为新词又难以被有效识别的问题,提出一种基于构成模式和条件随机场(CRF)的企业简称预测方法。首先,从语言学的角度对企业全称和简称的构成规律进行了总结,并采用词库以及规则相结合的方式对Bi-gram算法进行改进,提出CBi-gram算法,实现了对企业全称的结构化切分,并提高了企业全称中核心词识别的准确性。然后,依据上述切分结果对企业类型进行再次细分,并通过人工总结和规则自学习的方法形成不同企业类型下的简称规则集。最后再基于规则生成企业的候选简称集,降低了不适用的规则对于不同类型的企业在生成简称过程中产生的噪声。另外,为了弥补单纯基于规则在解决全称缩写和简写缩写混合的局限性,引入CRF,从统计的角度对简称进行预测,并选取词、音调以及词在全称组成成分中的位置作为模型特征,进行模型训练,以实现两种方法的相互补充。实验结果显示,该方法具有较高的准确率,输出的企业简称集基本覆盖了企业的常用简称范围。

关键词: 企业简称, 构成模式, 简称预测, 核心词识别, 条件随机场

Abstract: With the continuous development of enterprise marketing, the enterprise abbreviation has been widely used. Nevertheless, as one of the main sources of unknown words, the enterprise abbreviation can not be effectively identified. A methodology on predicting enterprise abbreviation based on constitution pattern and Conditional Random Field (CRF) was proposed. First, the constitution patterns of enterprise name and abbreviation were summarized from the perspective of linguistics, and the Bi-gram algorithm was improved by a combination of lexicon and rules, namely CBi-gram. CBi-gram algorithm was used to realize the automatic segmentation of the enterprise name and improve the recognition accuracy of the company's core word. Then the enterprise type was subdivided by CBi-gram, and the abbreviation rule sets were collected by artificial summary and self-learning method to reduce noise caused by unsuitable rules. Besides, in order to make up the limitations of artificial building rules on abbreviations and mixed abbreviation, the CRF was introduced to generate enterprise abbreviation statistically, and word, tone and word position were used as characteristics to train model as supplementary. The experimental results show that the method exhibites a good performance and the output can fundamentally cover the usual range of enterprise abbreviations.

Key words: enterprise abbreviation, constitution pattern, abbreviation prediction, core word recognition, Conditional Random Field(CRF)

中图分类号:

TP393

孙丽萍, 过弋, 唐文武, 徐永斌. 基于构成模式和条件随机场的企业简称预测[J]. 计算机应用, 2016, 36(2): 449-454.

SUN Liping, GUO Yi, TANG Wenwu, XU Yongbin. Enterprise abbreviation prediction based on constitution pattern and conditional random field[J]. Journal of Computer Applications, 2016, 36(2): 449-454.

参考文献

[1] 王厚峰.汉语缩略语自动处理研究现状[J].中文信息学报,2011,25(5):60-67. (WANG H F. Survey abbreviation processing in Chinese text[J]. Journal of Chinese Information Processing, 2011, 25(5): 60-67.)
[2] 邱莎,王付艳,申浩如,等.基于含边界词性特征的中文命名实体识别[J].计算机工程,2012,38(13):128-130. (QIU S, WANG F Y, SHEN H R, et al. Chinese named entity recognition based on part of speech feature with edges[J]. Computer Engineering, 2012, 38(13): 128-130.)
[3] 雷静,张舵,冯霞.基于构成模式的汉语机构名识别[C]//SWCL-2008:第四届全国学生计算语言学研讨会会议论文集.北京:中国中文信息学会,2008:431-437. (LEI J, ZHANG D, FENG X. Recognition of Chinese organization name bases on constitution pattern[C]//SWCL-2008: Proceedings of the Fourth National Student Workshop on Computational Linguistics. Beijing: Chinese Information Processing Society of China, 2008: 431-437)
[4] 胡万亭,杨燕,尹红风,等.一种基于词频统计的组织机构名识别方法[J].计算机应用研究,2013,30(7): 2014-2016. (HU W T, YANG Y, YIN H F, et al. Organization name recognition based on word frequency statistics[J]. Application Research of Computers, 2013, 30(7): 2014-2016.)
[5] 沈嘉懿,李芳,徐飞玉,等.中文组织机构名称与简称的识别[J].中文信息学报,2007,21(6):17-21. (SHEN J, LI F, XU F, et al. Recognition of Chinese organization name and abbreviations[J]. Journal of Chinese Information Processing, 2007, 21(6): 17-21.)
[6] CHANG J-S, LAI Y-T. A preliminary study on probabilistic models for Chinese abbreviations[C]//Proceedings of the Third SIGHAN Workshop on Chinese Language Learning. Stroudsburg, PA: Association for Computational Linguistics, 2004: 9-16.
[7] 鞠久朋,张伟伟,宁建军,等.CRF与规则相结合的地理空间命名实体识别[J].计算机工程,2011,37(7):210-215. (JU J P, ZHANG W W, NING J J, et al. Geospatial named entities recognition using combination of CRF and rules[J]. Computer Engineering, 2011, 37(7): 210-215.)
[8] 刘凯,周雪忠,于剑,等.基于条件随机场的中医临床病历命名实体抽取[J].计算机工程,2014,40(9):312-316. (LIU K, ZHOU X Z, YU J, et al. Named entity extraction of traditional Chinese medicine medical records based on conditional random[J]. Computer Engineering, 2014, 40(9): 312-316.)
[9] 张金龙,王石,钱存发.基于CRF和规则的中文医疗机构名称识别[J].计算机应用与软件,2014,31(3):159-162. (ZHANG J L, WANG S, QIAN C F, et al. CRF and rules-based recognition of medical institutions name in Chinese[J]. Computer Application and Software, 2014, 31(3): 159-162.)
[10] 陈菘霖.汉语缩略词构词规律的社会心理实证性[J].中国社会语言学,2011(1):83-94. (CHEN S L. The psychological reality of the abbreviations in Chinese morphology[J]. Chinese Social Linguistics, 2011(1): 83-94.)
[11] 陈超,朱洪波,王亚强,等.中文财经文本中公司名简称的自动识别[J].四川大学学报(自然科学版),2011,48(2): 308-314. (CHEN C, ZHU H B, WANG Y Q, et al. Automatic recognition of company name abbreviations in Chinese financial texts[J]. Journal of Sichuan University (Natural Science Edition), 2011, 48(2): 308-314.)
[12] 杨晓东,晏立,尤慧丽.CCRF与规则相结合的中文机构名识别[J].计算机工程,2011,37(8):169-171. (YANG X D, YAN L, YOU H L. Chinese organization names recognition combined with ccrf and rules[J]. Computer Engineering, 2011, 37(8): 169-171.)
[13] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers, 2001: 282-289.
[14] 焦妍,王厚峰,张龙凯.基于条件随机场与Web数据的缩略语预测[J].中文信息学报,2012,26(2):62-68. (JIAO Y, WANG H F, ZHANG L K. Abbreviation prediction using conditional random field and Web data[J]. Journal of Chinese Information Processing, 2012, 26(2): 62-68)
[15] 钱揖丽,冯志茹.基于语块和条件随机场(CRFs)的韵律短语识别[J].中文信息学报,2014,28(5):32-38. (QIAN Y L, FENG Z R. Identification of Chinese prosodic phrase based on chunk and CRF[J]. Journal of Chinese Information Processing, 2014, 28(5): 32-38.)
[16] 连誉舜,赵宇明.基于分词信息的中文机构名简称自动生成方法[J].计算机应用与软件,2014,31(4):153-156. (LIAN Y S, ZHAO Y M. An automatic generation method for Chinese organization abbreviations based on segmentation information[J]. Computer Applications and Software, 2014, 31(4): 153-156.)
[17] 上海林原信息科技有限公司. HanLP: Han Language Processing [EB/OL]. [2015-04-02]. http://hanlp.linrunsoft.com/.
[18] 丁远钧,曹存根,王石,等.从中文Web网页中获取实体简称的研究[J].计算机科学,2012,39(3):174-182. (DING Y J, CAO C G, WANG S, et al. Extracting abbreviated names for Chinese entities from the Web[J]. Computer Science, 2012, 39(3): 174-182.)

基于构成模式和条件随机场的企业简称预测

Enterprise abbreviation prediction based on constitution pattern and conditional random field

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[2]	刘清堂, 马鑫倩, 周洁, 吴林静, 周鹏霄. 融合常识库和语法特征的数学应用题题意理解[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 356-364.
[3]	胡甜甜, 但雅波, 胡杰, 李想, 李少波. 基于注意力机制的Bi-LSTM结合CRF的新闻命名实体识别及其情感分类[J]. 计算机应用, 2020, 40(7): 1879-1883.
[4]	许玥, 冯梦如, 皮家甜, 陈勇. 基于深度学习模型的遥感图像分割方法[J]. 计算机应用, 2019, 39(10): 2905-2914.
[5]	廖斌, 李浩文. 基于多孔卷积神经网络的图像深度估计模型[J]. 计算机应用, 2019, 39(1): 267-274.
[6]	张晨, 钱涛, 姬东鸿. 基于神经网络的微博情绪识别与诱因抽取联合模型[J]. 计算机应用, 2018, 38(9): 2464-2468.
[7]	吴亮, 何毅, 梅雪, 刘欢. 基于时空兴趣点和概率潜动态条件随机场模型的在线行为识别方法[J]. 计算机应用, 2018, 38(6): 1760-1764.
[8]	刘一鸣, 张鹏程, 刘祎, 桂志国. 基于全卷积网络和条件随机场的宫颈癌细胞学图像的细胞核分割[J]. 计算机应用, 2018, 38(11): 3348-3354.
[9]	周霜霜, 徐金安, 陈钰枫, 张玉洁. 融合规则与统计的微博新词发现方法[J]. 计算机应用, 2017, 37(4): 1044-1050.
[10]	王婷, 王祺, 黄越圻, 殷亦超, 高炬. 基于症状构成成分的上下位关系自动抽取方法[J]. 计算机应用, 2017, 37(10): 2999-3005.
[11]	刘彤, 黄修添, 马建设, 苏萍. 基于完全联系的条件随机场的图像标注[J]. 计算机应用, 2017, 37(10): 2841-2846.
[12]	黄念娥, 黄河, 王儒敬. 本体与条件随机场结合的涉农商品名称抽取与类别标注[J]. 计算机应用, 2017, 37(1): 233-238.
[13]	汤浩, 何楚. 全卷积网络结合改进的条件随机场循环神经网络用于SAR图像场景分类[J]. 计算机应用, 2016, 36(12): 3436-3441.
[14]	冯艳红, 于红, 孙庚, 赵禹锦. 基于词向量和条件随机场的领域术语识别方法[J]. 计算机应用, 2016, 36(11): 3146-3151.
[15]	刘春丽, 李晓戈, 刘睿, 范贤, 杜丽萍. 基于表示学习的中文分词[J]. 计算机应用, 2016, 36(10): 2794-2798.