计算机应用 ›› 2020, Vol. 40 ›› Issue (3): 631-637.DOI: 10.11772/j.issn.1001-9081.2019071193

• 人工智能 • 上一篇    下一篇

专利新词发现的双向聚合度特征提取新方法

陈梅婕1,2, 谢振平1,2, 陈晓琪1,2, 许鹏3   

  1. 1. 江南大学 数字媒体学院, 江苏 无锡 214122;
    2. 江苏省媒体设计与软件技术重点实验室(江南大学), 江苏 无锡 214122;
    3. 常州佰腾科技有限公司, 江苏 常州 213164
  • 收稿日期:2019-07-10 修回日期:2019-09-01 出版日期:2020-03-10 发布日期:2019-09-11
  • 通讯作者: 谢振平
  • 作者简介:陈梅婕(1995-),女,江苏宜兴人,硕士研究生,主要研究方向:机器学习、自然语言处理;谢振平(1979-),男,江苏常州人,教授,博士,CCF会员,主要研究方向:知识表示、认知学习;陈晓琪(1994-),女,湖北武汉人,硕士研究生,主要研究方向:机器学习、数据挖掘;许鹏(1983-),男,重庆人,硕士,主要研究方向:专利大数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61872166)。

Novel bidirectional aggregation degree feature extraction method forpatent new word discovery

CHEN Meijie1,2, XIE Zhenping1,2, CHEN Xiaoqi1,2, XU Peng3   

  1. 1. College of Digital Media, Jiangnan University, Wuxi Jiangsu 214122, China;
    2. Jiangsu Key Laboratory of Media Design and Software Technology(Jiangnan University), Wuxi Jiangsu 214122, China;
    3. Changzhou Baiteng Technology Company Limited, Changzhou Jiangsu 213164, China
  • Received:2019-07-10 Revised:2019-09-01 Online:2020-03-10 Published:2019-09-11
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61872166).

摘要: 针对通用新词发现方法对专利长词识别效果不佳、专利术语词性搭配模板的灵活性不高,以及缺乏对中文专利长词识别的无监督方法的问题,提出了一种发现专利新词的双向聚合度特征提取新方法。首先,以词中组分的双向条件概率统计信息为基础,构造提出了一个二元词上的双向聚合度统计特征;其次,利用此特征扩展提出了词边界筛选规则;最后,基于新特征和词边界规则实现专利新词的提取。实验结果表明,新方法在整体F-测度值方面,与通用领域新词发现方法相比,提高了6.7个百分点,与两种最新的专利词性搭配模板方法相比,分别提高了19.2个百分点和17.2个百分点,并且较为显著地提高了4~8字专利新词发现的F-测度值。综合地,所提出的方法提升了专利新词发现性能,并且能够更有效地提取专利文本中具有复合形式的长词,同时可以减少对预先训练过程和额外复杂规则库的依赖,具备更好的实用性。

关键词: 新词发现, 双向聚合度, 专利新词, 特征提取, 专利分析

Abstract: Aiming at the poor effect of general new word discovery method on the recognition of patent long words, the low flexibility of part of speech collocation template of patent terminology, and the lack of unsupervised methods for Chinese patent long word recognition, a novel bidirectional aggregation degree feature extraction method for patent new word discovery was proposed.Firstly, a bidirectional conditional probability was introduced on the statistical information between the first and last words on a double word term. Secondly, a word boundary filtering rule was extendedly introduced by using the above feature. Finally, new patent words were able to be extracted by combining the above aggregation degree feature and word boundary filtering rule. Experimental analysis show that, the new method improves the overall F-score by 6.7 percentage points compared with the new word discovery method in the general field, improves the overall F-score by 19.2 and 17.2 percentage points respectively compared with two latest patent terminology collocation template methods, and significantly increase the F-score for the discovery of new words with 4 to 8 characters. In summary, the proposed method greatly improves the performance of patent new word discovery, and can extract high compound long words in patent documents more effectively, while reducing the reliance on pre-training processes and extra complex rule base, with better practicality.

Key words: new word discovery, bidirectional aggregation degree, patent new word, feature extraction, patent analysis

中图分类号: