计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 449-454.DOI: 10.11772/j.issn.1001-9081.2016.02.0449

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

基于构成模式和条件随机场的企业简称预测

孙丽萍1, 过弋1,2, 唐文武1, 徐永斌1   

  1. 1. 华东理工大学 信息科学与工程学院, 上海 200237;
    2. 石河子大学 信息科学与技术学院, 新疆 石河子 832007
  • 收稿日期:2015-08-29 修回日期:2015-09-11 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 过弋(1975-),男,江苏无锡人,教授,博士,主要研究方向:自然语言处理、智能信息处理、本体工程。
  • 作者简介:孙丽萍(1990-),女,浙江上虞人,硕士研究生,主要研究方向:自然语言处理;唐文武(1992-),男,福建福鼎人,硕士研究生,主要研究方向:自然语言处理;徐永斌(1990-),男,江苏扬州人,硕士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61462073,61272198)。

Enterprise abbreviation prediction based on constitution pattern and conditional random field

SUN Liping1, GUO Yi1,2, TANG Wenwu1, XU Yongbin1   

  1. 1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China;
    2. College of Information Science and Technology, Shihezi University, Shihezi Xinjiang 832007, China
  • Received:2015-08-29 Revised:2015-09-11 Online:2016-02-10 Published:2016-02-03

摘要: 针对目前企业营销的不断深入,企业简称被各大新闻广泛使用,而作为新词又难以被有效识别的问题,提出一种基于构成模式和条件随机场(CRF)的企业简称预测方法。首先,从语言学的角度对企业全称和简称的构成规律进行了总结,并采用词库以及规则相结合的方式对Bi-gram算法进行改进,提出CBi-gram算法,实现了对企业全称的结构化切分,并提高了企业全称中核心词识别的准确性。然后,依据上述切分结果对企业类型进行再次细分,并通过人工总结和规则自学习的方法形成不同企业类型下的简称规则集。最后再基于规则生成企业的候选简称集,降低了不适用的规则对于不同类型的企业在生成简称过程中产生的噪声。另外,为了弥补单纯基于规则在解决全称缩写和简写缩写混合的局限性,引入CRF,从统计的角度对简称进行预测,并选取词、音调以及词在全称组成成分中的位置作为模型特征,进行模型训练,以实现两种方法的相互补充。实验结果显示,该方法具有较高的准确率,输出的企业简称集基本覆盖了企业的常用简称范围。

关键词: 企业简称, 构成模式, 简称预测, 核心词识别, 条件随机场

Abstract: With the continuous development of enterprise marketing, the enterprise abbreviation has been widely used. Nevertheless, as one of the main sources of unknown words, the enterprise abbreviation can not be effectively identified. A methodology on predicting enterprise abbreviation based on constitution pattern and Conditional Random Field (CRF) was proposed. First, the constitution patterns of enterprise name and abbreviation were summarized from the perspective of linguistics, and the Bi-gram algorithm was improved by a combination of lexicon and rules, namely CBi-gram. CBi-gram algorithm was used to realize the automatic segmentation of the enterprise name and improve the recognition accuracy of the company's core word. Then the enterprise type was subdivided by CBi-gram, and the abbreviation rule sets were collected by artificial summary and self-learning method to reduce noise caused by unsuitable rules. Besides, in order to make up the limitations of artificial building rules on abbreviations and mixed abbreviation, the CRF was introduced to generate enterprise abbreviation statistically, and word, tone and word position were used as characteristics to train model as supplementary. The experimental results show that the method exhibites a good performance and the output can fundamentally cover the usual range of enterprise abbreviations.

Key words: enterprise abbreviation, constitution pattern, abbreviation prediction, core word recognition, Conditional Random Field(CRF)

中图分类号: