专利新词发现的双向聚合度特征提取新方法

doi:10.11772/j.issn.1001-9081.2019071193

计算机应用 ›› 2020, Vol. 40 ›› Issue (3): 631-637.DOI: 10.11772/j.issn.1001-9081.2019071193

专利新词发现的双向聚合度特征提取新方法

陈梅婕^1,2, 谢振平^1,2, 陈晓琪^1,2, 许鹏³

1. 江南大学数字媒体学院, 江苏无锡 214122;
2. 江苏省媒体设计与软件技术重点实验室(江南大学), 江苏无锡 214122;
3. 常州佰腾科技有限公司, 江苏常州 213164

收稿日期:2019-07-10 修回日期:2019-09-01 发布日期:2019-09-11 出版日期:2020-03-10
通讯作者: 谢振平
作者简介:陈梅婕(1995-),女,江苏宜兴人,硕士研究生,主要研究方向:机器学习、自然语言处理;谢振平(1979-),男,江苏常州人,教授,博士,CCF会员,主要研究方向:知识表示、认知学习;陈晓琪(1994-),女,湖北武汉人,硕士研究生,主要研究方向:机器学习、数据挖掘;许鹏(1983-),男,重庆人,硕士,主要研究方向:专利大数据挖掘。
基金资助:
国家自然科学基金资助项目（61872166）。

Novel bidirectional aggregation degree feature extraction method forpatent new word discovery

CHEN Meijie^1,2, XIE Zhenping^1,2, CHEN Xiaoqi^1,2, XU Peng³

1. College of Digital Media, Jiangnan University, Wuxi Jiangsu 214122, China;
2. Jiangsu Key Laboratory of Media Design and Software Technology(Jiangnan University), Wuxi Jiangsu 214122, China;
3. Changzhou Baiteng Technology Company Limited, Changzhou Jiangsu 213164, China

Received:2019-07-10 Revised:2019-09-01 Online:2019-09-11 Published:2020-03-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61872166).

摘要/Abstract

摘要： 针对通用新词发现方法对专利长词识别效果不佳、专利术语词性搭配模板的灵活性不高，以及缺乏对中文专利长词识别的无监督方法的问题，提出了一种发现专利新词的双向聚合度特征提取新方法。首先，以词中组分的双向条件概率统计信息为基础，构造提出了一个二元词上的双向聚合度统计特征；其次，利用此特征扩展提出了词边界筛选规则；最后，基于新特征和词边界规则实现专利新词的提取。实验结果表明，新方法在整体F-测度值方面，与通用领域新词发现方法相比，提高了6.7个百分点，与两种最新的专利词性搭配模板方法相比，分别提高了19.2个百分点和17.2个百分点，并且较为显著地提高了4~8字专利新词发现的F-测度值。综合地，所提出的方法提升了专利新词发现性能，并且能够更有效地提取专利文本中具有复合形式的长词，同时可以减少对预先训练过程和额外复杂规则库的依赖，具备更好的实用性。

关键词: 新词发现, 双向聚合度, 专利新词, 特征提取, 专利分析

Abstract: Aiming at the poor effect of general new word discovery method on the recognition of patent long words, the low flexibility of part of speech collocation template of patent terminology, and the lack of unsupervised methods for Chinese patent long word recognition, a novel bidirectional aggregation degree feature extraction method for patent new word discovery was proposed.Firstly, a bidirectional conditional probability was introduced on the statistical information between the first and last words on a double word term. Secondly, a word boundary filtering rule was extendedly introduced by using the above feature. Finally, new patent words were able to be extracted by combining the above aggregation degree feature and word boundary filtering rule. Experimental analysis show that, the new method improves the overall F-score by 6.7 percentage points compared with the new word discovery method in the general field, improves the overall F-score by 19.2 and 17.2 percentage points respectively compared with two latest patent terminology collocation template methods, and significantly increase the F-score for the discovery of new words with 4 to 8 characters. In summary, the proposed method greatly improves the performance of patent new word discovery, and can extract high compound long words in patent documents more effectively, while reducing the reliance on pre-training processes and extra complex rule base, with better practicality.

Key words: new word discovery, bidirectional aggregation degree, patent new word, feature extraction, patent analysis

中图分类号:

TP391.1

陈梅婕, 谢振平, 陈晓琪, 许鹏. 专利新词发现的双向聚合度特征提取新方法[J]. 计算机应用, 2020, 40(3): 631-637.

CHEN Meijie, XIE Zhenping, CHEN Xiaoqi, XU Peng. Novel bidirectional aggregation degree feature extraction method forpatent new word discovery[J]. Journal of Computer Applications, 2020, 40(3): 631-637.

参考文献

[1] 杨双龙, 吕学强, 李卓, 等. 中文专利文献术语自动识别研究[J]. 中文信息学报,2016,30(3):111-117,124.(YANG S L, LYU X Q,LI Z,et al. Automatic recognition of terms in Chinese patent literature[J]. Journal of Chinese Information Processing, 2016,30(3):111-117,124.)
[2] SUN X,SUN C,REN F. New word detection and emotional tendency judgment based on mixed model[C]//Proceedings of the IEEE 3rd International Conference on Cloud Computing and Intelligence Systems. Piscataway:IEEE,2014:118-123.
[3] 王密平, 王昊, 邓三鸿, 等. 基于CRFs的冶金领域中文专利术语抽取研究[J]. 现代图书情报技术,2016(6):28-36.(WANG M P,WANG H,DENG S H,et al. Extracting Chinese metallurgy patent terms with conditional random fields[J]. New Technology of Library and Information Service,2016(6):28-36.)
[4] 袁劲松, 张小明, 李舟军. 术语自动抽取方法研究综述[J]. 计算机科学,2015,42(8):7-12. (YUAN J S,ZHANG X M,LI Z J. Survey of automatic terminology extraction methodologies[J]. Computer Science,2015,42(8):7-12.)
[5] SUN X,HUANG D,SONG H,et al. Chinese new word identification:a latent discriminative model with global features[J]. Journal of Computer Science and Technology,2011,26(1):14-24.
[6] LI H,HUANG C,GAO J,et al. The use of SVM for Chinese new word identification[C]//Proceedings of the 1st International Joint Conference on Natural Language Processing,LNCS 3248. Berlin:Springer,2004:723-732.
[7] FU G,LUKE K K. Chinese named entity recognition using lexicalized HMMs[J]. ACM SIGKDD Explorations Newsletter,2005, 7(1):19-25.
[8] LEONG K S,WONG F,LI Y,et al. Integration of named entity information for Chinese word segmentation based on maximum entropy[C]//Proceedings of the 4th International Conference on Intelligent Computing,LNCS 5226. Berlin:Springer,2008:962-969.
[9] 陈飞, 刘奕群, 魏超, 等. 基于条件随机场方法的开放领域新词发现[J]. 软件学报,2013,24(5):1051-1060.(CHEN F,LIU Y Q,WEI C,et al. Open domain new word detection using condition random field method[J]. Journal of Software,2013,24(5):1051-1060.)
[10] NISHIMURA N,RAGDE P, SZEIDER S. Solving #SAT using vertex covers[C]//Proceedings of the 2006 International Conference on Theory and Applications of Satisfiability Testing,LNCS 4121. Berlin:Springer,2006:396-409.
[11] 王文荣, 乔晓东, 朱礼军. 针对特定领域的新词发现和新技术发现[J]. 现代图书情报技术,2008(2):35-40. (WANG W R, QIAO X D,ZHU L J. New word and technology discovery of specific domain[J]. New Technology of Library and Information Service,2008(2):35-40.)
[12] 夭荣朋, 许国艳, 宋健. 基于改进互信息和邻接熵的微博新词发现方法[J]. 计算机应用,2016,36(10):2772-2776. (YAO R P,XU G Y,SONG J. Micro-blog new word discovery method based on improved mutual information and branch entropy[J]. Journal of Computer Applications,2016,36(10):2772-2776.)
[13] 欧阳柳波, 周伟光. 基于位置标签与词性结合的组合词抽取方法[J]. 计算机应用研究,2016,33(4):1062-1065. (OUYANG L B,ZHOU W G. Compound word extraction based on location tag and POS[J]. Application Research of Computers,2016,33(4):1062-1065.)
[14] 周霜霜, 徐金安, 陈钰枫, 等. 融合规则与统计的微博新词发现方法[J]. 计算机应用,2017,37(4):1044-1050.(ZHOU S S, XU J A,CHEN Y F,et al. New words detection method for microblog text based on integrating of rules and statistics[J]. Journal of Computer Applications,2017,37(4):1044-1050.)
[15] 张华平, 商建云. 面向社会媒体的开放领域新词发现[J]. 中文信息学报,2017,31(3):55-61. (ZHANG H P,SHANG J Y. Social media-oriented open domain new word detection[J]. Journal of Chinese Information Processing,2017,31(3):55-61.)
[16] 马建红, 张炳斐, 张少光, 等. 基于主动MCNN-SCRF的新能源汽车命名实体识别[J]. 计算机工程与应用,2019,55(7):23-29.(MA J H,ZHANG B F,ZHANG S G,et al. Named entity recognition for new energy vehicles based on active MCNN-SCRF[J]. Computer Engineering and Applications,2019,55(7):23-29.)
[17] 刘昱彤, 吴斌, 谢韬, 等. 基于古汉语语料的新词发现方法[J]. 中文信息学报,2019,33(1):46-55.(LIU Y T,WU B,XIE T, et al. New word detection in ancient Chinese corpus[J]. Journal of Chinese Information Processing,2019,33(1):46-55.)
[18] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报,2010,24(3):112-116.(ZHANG G P, LIU D S,YIN B S,et al. Research on Chinese word segmentation for patent documents[J]. Journal of Chinese Information Processing,2010,24(3):112-116.)
[19] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报(自然科学版), 2013,49(1):159-164. (YUE J Y,XU J A,ZHANG Y J. Chinese word segmentation for patent documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis,2013,49(1):159-164.)
[20] 俞琰, 赵乃瑄. 基于通用词与术语部件的专利术语抽取[J]. 情报学报,2018,37(7):742-752. (YU Y,ZHAO N X. Patent term extraction based on generic words and term components[J]. Journal of the China Society for Scientific and Technical Information,2018,37(7):742-752.)
[21] 赵飞龙, 马建红. 面向专利的功能信息自动标注方法研究[J]. 重庆邮电大学学报(自然科学版),2015,27(2):273-278. (ZHAO F L,MA J H. Method of automatic annotation information for patents[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2015, 27(2):273-278.)
[22] 毛宇. 中医药症状的中文分词与句子相似度研究[D]. 杭州:浙江大学,2017:34-45. (MAO Y. Research of Chinese word segmentation and sentence similarity on traditional Chinese medicine symptom[D]. Hangzhou:Zhejiang University,2017:34-45.)
[23] 王杏利, 鞠建伟, 宋敏霞, 等. 农业项目科技查新特征与典型案例分析[J]. 数字图书馆论坛,2017(4):68-72.(WANG X L, JU J W,SONG M X,et al. Analysis of features and cases of the agricultural scientific and technical novelty search[J]. Digital Library Forum,2017(4):68-72.)
[24] 中华人民共和国国家知识产权局. 专利审查指南[M]. 北京:知识产权出版社,2010:9-174. (National Intellectual Property Administration,PRC. Guidelines for Patent Examination[M]. Beijing:Intellectual Property Publishing House,2010:9-174.)

专利新词发现的双向聚合度特征提取新方法

Novel bidirectional aggregation degree feature extraction method forpatent new word discovery

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	杨鑫, 陈雪妮, 吴春江, 周世杰. 结合变种残差模型和Transformer的城市公路短时交通流预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2947-2951.
[2]	付帅, 郭小英, 白茹意, 闫涛, 陈斌. 改进的CloFormer模型与有序回归相结合的年龄评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2372-2380.
[3]	陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.
[4]	龙伍丹, 彭博, 胡节, 申颖, 丁丹妮. 基于加强特征提取的道路病害检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2264-2270.
[5]	刘瑞华, 郝子赫, 邹洋杨. 基于多层级精细特征融合的步态识别算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2250-2257.
[6]	吴郅昊, 迟子秋, 肖婷, 王喆. 基于元学习自适应的小样本语音合成[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1629-1635.
[7]	崔晨辉, 蔺素珍, 李大威, 禄晓飞, 武杰. 基于孪生网络和Transformer的红外弱小目标跟踪方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 563-571.
[8]	范艺扬, 张洋, 曾尚, 曾渝, 付茂栗. 基于分解和频域特征提取的多变量长时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3442-3448.
[9]	赵培, 乔焰, 胡荣耀, 袁新宇, 李敏悦, 张本初. 基于多域特征提取的多变量时间序列异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3419-3426.
[10]	刘涛, 鞠事宏, 高一萌. 基于改进YOLOv8n的无人机视角下小目标检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3603-3609.
[11]	花晓雨, 李冬芬, 付优, 毕可骏, 应时, 王瑞锦. 结合层次图神经网络与长短期记忆的产业链风险评估预警模型[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3223-3231.
[12]	李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
[13]	张雨宁, 阿布都克力木·阿布力孜, 梅悌胜, 徐春, 麦尔达娜·买买提热依木, 哈里旦木·阿布都克里木, 侯钰涛. 基于自监督特征提取的骨骼X线影像异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 175-181.
[14]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[15]	王先兰, 周金坤, 穆楠, 王晨. 基于多任务联合学习的跨视角地理定位方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1625-1635.