《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1072-1079.DOI: 10.11772/j.issn.1001-9081.2023040532
• 人工智能 • 上一篇
收稿日期:
2023-05-04
修回日期:
2023-10-12
接受日期:
2023-10-12
发布日期:
2024-04-22
出版日期:
2024-04-10
通讯作者:
李圣文
作者简介:
朱俊杰(1999—),男,湖北武汉人,硕士研究生,主要研究方向:命名实体识别Junjie ZHU1, Li YU2, Shengwen LI1(), Changzheng ZHOU3
Received:
2023-05-04
Revised:
2023-10-12
Accepted:
2023-10-12
Online:
2024-04-22
Published:
2024-04-10
Contact:
Shengwen LI
About author:
ZHU Junjie, born in 1999, M. S. candidate. His research interests include named entity recognition.Supported by:
摘要:
技术名称是科技领域中用于准确交流信息的术语,自动识别技术名称可以帮助专家和大众发现、认知、应用新技术,具有重要价值;而基于无监督的方法在识别技术名称时存在规则复杂、适应性差等问题。为了提升从文本中识别技术名称的能力,提出一种综合成分句法的技术名称识别方法。首先,通过成分句法分析构造句法结构树;其次,从自上而下和自下而上这两个角度抽取候选技术名称;最后,融合统计频次和语义信息,以选取最优技术名称。此外,构建一个技术术语数据集以验证所提方法的有效性。在该数据集上的实验结果表明,相较于基于依存关系的方法,所提基于自下而上的方法的F1值提高了4.55个百分点;同时在3D打印领域进行了案例分析,发现所提方法识别的技术名称与该名称对应领域的发展契合,可用于回溯技术的发展历程和描绘技术的演化路径,为理解、发现、探索领域未来技术提供参考。
中图分类号:
朱俊杰, 余丽, 李圣文, 周长征. 综合成分句法分析的技术名称识别[J]. 计算机应用, 2024, 44(4): 1072-1079.
Junjie ZHU, Li YU, Shengwen LI, Changzheng ZHOU. Technology term recognition with comprehensive constituency parsing[J]. Journal of Computer Applications, 2024, 44(4): 1072-1079.
词性 | 解释 | 词性 | 解释 |
---|---|---|---|
NN | 常用名词 | CD | 基数词 |
NT | 时间名词 | JJ | 形容词 |
VV | 动词 | RB | 副词 |
VC | 是 | IN | 介词 |
VE | 有 | PU | 符号 |
PT | 代词 |
表 1 词性解释
Tab. 1 Explanation of part of speech
词性 | 解释 | 词性 | 解释 |
---|---|---|---|
NN | 常用名词 | CD | 基数词 |
NT | 时间名词 | JJ | 形容词 |
VV | 动词 | RB | 副词 |
VC | 是 | IN | 介词 |
VE | 有 | PU | 符号 |
PT | 代词 |
成分语法结构 | 解释 |
---|---|
Root | 处理的语句 |
IP | 简单从句 |
NP | 名词短语 |
VP | 动词短语 |
ADJP | 形容词短语 |
DNP | 由“的”构成的表示所属关系的短语 |
LCP | 方位词短语 |
PP | 介词短语 |
ADVP | 副词短语 |
PP | 介词短语 |
DP | 限定词短语 |
CP | 由“的”构成的表示修饰性关系的短语 |
表 2 成分语法结构解释
Tab. 2 Explanation of constituent grammer structure
成分语法结构 | 解释 |
---|---|
Root | 处理的语句 |
IP | 简单从句 |
NP | 名词短语 |
VP | 动词短语 |
ADJP | 形容词短语 |
DNP | 由“的”构成的表示所属关系的短语 |
LCP | 方位词短语 |
PP | 介词短语 |
ADVP | 副词短语 |
PP | 介词短语 |
DP | 限定词短语 |
CP | 由“的”构成的表示修饰性关系的短语 |
领域 | 文章数 |
---|---|
先进制造 | 2 437 |
新材料 | 2 219 |
信息 | 4 981 |
海洋 | 1 986 |
航天 | 2 439 |
航空 | 2 715 |
能源 | 3 354 |
生物 | 3 753 |
表 3 数据集简介
Tab. 3 Dataset introduction
领域 | 文章数 |
---|---|
先进制造 | 2 437 |
新材料 | 2 219 |
信息 | 4 981 |
海洋 | 1 986 |
航天 | 2 439 |
航空 | 2 715 |
能源 | 3 354 |
生物 | 3 753 |
方法 | 精确率 | 召回率 | F1 |
---|---|---|---|
正则表达式 | 65.00 | 19.85 | 30.41 |
正则表达式+词性标注 | 56.82 | 38.17 | 45.66 |
依存关系 | 78.87 | 42.75 | 55.45 |
keyBERT | 30.40 | 39.69 | 34.43 |
提示学习 | 34.31 | 44.27 | 38.66 |
GPT-NER | 48.74 | 72.93 | 58.43 |
自上而下 | 71.25 | 43.51 | 54.03 |
自下而上 | 74.16 | 50.38 | 60.00 |
表 4 不同方法的实验结果 (%)
Tab. 4 Experimental results of different methods
方法 | 精确率 | 召回率 | F1 |
---|---|---|---|
正则表达式 | 65.00 | 19.85 | 30.41 |
正则表达式+词性标注 | 56.82 | 38.17 | 45.66 |
依存关系 | 78.87 | 42.75 | 55.45 |
keyBERT | 30.40 | 39.69 | 34.43 |
提示学习 | 34.31 | 44.27 | 38.66 |
GPT-NER | 48.74 | 72.93 | 58.43 |
自上而下 | 71.25 | 43.51 | 54.03 |
自下而上 | 74.16 | 50.38 | 60.00 |
方法 | 策略 | F1 |
---|---|---|
自上而下 | w/o语义信息 | 55.40 |
w/o统计词频 | 53.33 | |
自下而上 | w/o 语义信息 | 59.36 |
w/o 统计词频 | 60.00 |
表 5 不同搜索策略的实验结果 (%)
Tab. 5 Experimental results of different search strategies
方法 | 策略 | F1 |
---|---|---|
自上而下 | w/o语义信息 | 55.40 |
w/o统计词频 | 53.33 | |
自下而上 | w/o 语义信息 | 59.36 |
w/o 统计词频 | 60.00 |
年份 | 3D打印技术名称 |
---|---|
2015 | 3D打印皮肤和关节的技术,3D打印毛细血管技术,金属所钛合金3D打印技术,铜合金3D打印技术,水凝胶3D打印技术…… |
2016 | 3D打印钛金属拇指骨移植,全新金属3D打印工艺,可耐1 400度高温的超强3D打印陶瓷,可从水底起飞的3D打印无人机, 不需紫外光照射的3D打印生物墨水…… |
2017 | 一种硅胶人工心脏,喷射成型的金属3D打印新技术XJET,具有高伸缩性紫外线固化3D打印弹性体, 打印高强度铝合金的新技术,可将打印速度提高10倍的新型桌面级3D打印机…… |
2018 | 可打印超柔软生物结构的低温3D打印技术,用于深海潜水艇的3D打印复合泡沫材料,3D打印高导电性镓合金的方法, 激光熔融技术…… |
2019 | 3D打印陶瓷防弹衣,3D打印环氧树脂碳纤维复合材料,用于3D打印弹药的高强度钢,高速纳米3D打印技术, 可自动去除缺陷层的金属3D打印机…… |
2020 | 多孔金属材料的3D打印技术,可处理骨骼肌损伤的手持式3D生物打印机,可定制3D打印血管支架, 可回收、可自愈的聚合物3D打印材料,在纳米尺度上3D打印凝胶和软材料的新方法…… |
2021 | 新型陶瓷基墨水,可从空气中取水的3D打印便携设备,高效3D打印生物组织技术,多用途生物3D打印机, 纳米级3D打印创建高分辨率光场打印…… |
表 6 2015—2021年3D打印技术名称
Tab. 6 Technology terms of 3D printing in 2015—2021
年份 | 3D打印技术名称 |
---|---|
2015 | 3D打印皮肤和关节的技术,3D打印毛细血管技术,金属所钛合金3D打印技术,铜合金3D打印技术,水凝胶3D打印技术…… |
2016 | 3D打印钛金属拇指骨移植,全新金属3D打印工艺,可耐1 400度高温的超强3D打印陶瓷,可从水底起飞的3D打印无人机, 不需紫外光照射的3D打印生物墨水…… |
2017 | 一种硅胶人工心脏,喷射成型的金属3D打印新技术XJET,具有高伸缩性紫外线固化3D打印弹性体, 打印高强度铝合金的新技术,可将打印速度提高10倍的新型桌面级3D打印机…… |
2018 | 可打印超柔软生物结构的低温3D打印技术,用于深海潜水艇的3D打印复合泡沫材料,3D打印高导电性镓合金的方法, 激光熔融技术…… |
2019 | 3D打印陶瓷防弹衣,3D打印环氧树脂碳纤维复合材料,用于3D打印弹药的高强度钢,高速纳米3D打印技术, 可自动去除缺陷层的金属3D打印机…… |
2020 | 多孔金属材料的3D打印技术,可处理骨骼肌损伤的手持式3D生物打印机,可定制3D打印血管支架, 可回收、可自愈的聚合物3D打印材料,在纳米尺度上3D打印凝胶和软材料的新方法…… |
2021 | 新型陶瓷基墨水,可从空气中取水的3D打印便携设备,高效3D打印生物组织技术,多用途生物3D打印机, 纳米级3D打印创建高分辨率光场打印…… |
1 | 蒋婷. 学术文献术语抽取方案比较研究[J]. 信息资源管理学报, 2021, 11(1): 112-122. |
JIANG T. A comparative study of term extraction schemes in academic literature[J]. Journal of Information Resources Management, 2021, 11(1): 112-122. | |
2 | 王燕鹏, 韩涛, 陈芳. 融合文献知识聚类和复杂网络的关键技术识别方法研究[J]. 图书情报工作, 2020, 64(16): 105-113. 10.13266/j.issn.0252-3116.2020.16.011 |
WANG Y P, HAN T, CHEN F. Identification of key technologies based on literature knowledge clustering and complex network[J]. Library and Information Service, 2020, 64(16): 105-113. 10.13266/j.issn.0252-3116.2020.16.011 | |
3 | 张雪, 孙宏宇, 辛东兴, 等. 自动术语抽取研究综述[J]. 软件学报, 2020, 31(7): 2062-2094. 10.13328/j.cnki.jos.006040 |
ZHANG X, SUN H Y, XIN D X, et al. Survey on automatic term extraction research[J]. Journal of Software, 2020, 31(7): 2062-2094. 10.13328/j.cnki.jos.006040 | |
4 | 俞琰, 陈磊, 姜金德, 等. 基于依存句法分析的中文专利候选术语选取研究[J]. 图书情报工作, 2019, 63(18): 109-118. |
YU Y, CHEN L, JIANG J D, et al. Research on the selection of Chinese patent candidate term based on dependency syntax parsing[J]. Library and Information Service, 2019, 63(18): 109-118. | |
5 | 熊李艳, 谭龙, 钟茂生. 基于有效词频的改进 C-value 自动术语抽取方法[J]. 现代图书情报技术, 2013 (9): 54-59. |
XIONG L Y, TAN L, ZHONG M S. An automatic term extraction system of improved C-value based on effective word frequency[J]. New Technology of Library and Information Service, 2013 (9): 54-59. | |
6 | 胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013, 29(2): 24-29. 10.11925/infotech.1003-3513.2013.02.04 |
HU A P, ZHANG J, LIU J L. Chinese term extraction based on improved C-value method[J]. New Technology of Library and Information Service, 2013, 29(2): 24-29. 10.11925/infotech.1003-3513.2013.02.04 | |
7 | 杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000. 10.11772/j.issn.1001-9081.2015.04.0996 |
DU L P, LI X G, ZHOU Y Z, et al. Application of improved point-wise mutual information in term extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000. 10.11772/j.issn.1001-9081.2015.04.0996 | |
8 | 李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究[J]. 中文信息学报, 2015, 29(1): 82-87. 10.3969/j.issn.1003-0077.2015.01.011 |
LI L S, WANG Y W, HUANG D G. Term extraction based on information entropy and word frequency distribution variety[J]. Journal of Chinese Information Processing, 2015, 29(1): 82-87. 10.3969/j.issn.1003-0077.2015.01.011 | |
9 | 吴家皋, 周凡坤, 张雪英. HMM模型和句法分析相结合的事件属性信息抽取[J]. 南京师大学报(自然科学版), 2014, 37(1): 30-34. |
WU J G, ZHOU F K, ZHANG X Y. Research of the extraction method of event properties based on the combining of HMM and syntactic analysis[J]. Journal of Nanjing Normal University (Natural Science Edition), 2014, 37(1):30-34. | |
10 | 刘辉, 刘耀. 基于条件随机场的专利术语抽取[J]. 数字图书馆论坛, 2014(12): 46-49. 10.3772/j.issn.1673-2286.2014.12.008 |
LIU H, LIU Y. Patent term extraction based on conditional random fields[J]. Digital Library Forum, 2014(12): 46-49. 10.3772/j.issn.1673-2286.2014.12.008 | |
11 | 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418. 10.3772/j.issn.1000-0135.2020.04.007 |
WU J, CHENG Y, HAO H, et al. Automatic extraction of Chinese terminology based on BERT embedding and BiLSTM-CRF model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418. 10.3772/j.issn.1000-0135.2020.04.007 | |
12 | SAFRANCHIK E, LUO S, BACH S. Weakly supervised sequence tagging from noisy rules [C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 5570-5578. 10.1609/aaai.v34i04.6009 |
13 | LISON P, BARNES J, HUBIN A, et al. Named entity recognition without labelled data: a weak supervision approach [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 1518-1533. 10.18653/v1/2020.acl-main.139 |
14 | LI Y, SHETTY P, LIU L, et al. BERTifying the hidden Markov model for multi-source weakly supervised named entity recognition [C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2021: 6178-6190. 10.18653/v1/2021.acl-long.482 |
15 | 樊梦佳, 段东圣, 杜翠兰, 等. 统计与规则相融合的领域术语抽取算法[J]. 计算机应用研究, 2016, 33(8): 2282-2285. 10.3969/j.issn.1001-3695.2016.08.009 |
FAN M J, DUAN D S, DU C L, al el. Domain-specific terms extraction algorithm based on combination of statistics and rules[J]. Application Research of Computers, 2016, 33(8): 2282-2285. 10.3969/j.issn.1001-3695.2016.08.009 | |
16 | 杨海涛. 基于规则的多种策略句法分析[J]. 软件导刊, 2014, 13(10): 63-64. 10.11907/rjdk.143375 |
YANG H T. Rule-based multi-strategy syntactic parsing[J]. Software Guide, 2014, 13(10): 63-64. 10.11907/rjdk.143375 | |
17 | SHARMA P, LI Y. Self-supervised contextual keyword and keyphrase retrieval with self-labelling[J]. Preprints, 2019, 2019: 2019080073. |
18 | CUI L, WU Y, LIU J, et al. Template-based named entity recognition using BART [C]// Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg: ACL, 2021: 1835-1845. 10.18653/v1/2021.findings-acl.161 |
19 | WEI X, CUI X, CHENG N, et al. Zero-shot information extraction via chatting with ChatGPT [EB/OL]. [2023-02-20]. . |
20 | WANG S, SUN X, LI X, et al. GPT-NER: named entity recognition via large language models[EB/OL]. [2023-05-12]. . |
21 | WATANABE T, SUMITA E. Transition-based neural constituent parsing [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2015: 1169-1179. 10.3115/v1/p15-1113 |
22 | CROSS J, HUANG L. Span-based constituency parsing with a structure-label system and provably optimal dynamic oracles [C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 1-11. 10.18653/v1/d16-1001 |
23 | CHENG J, REDDY S, SARASWAT V, et al. Learning structured natural language representations for semantic parsing [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2017: 44-55. 10.18653/v1/p17-1005 |
24 | STERN M, ANDREAS J, KLEIN D. A minimal span-based neural constituency parser [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 818-827. 10.18653/v1/p17-1076 |
25 | KITAEV N, KLEIN D. Constituency parsing with a self-attentive encoder [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2018: 2676-2686. 10.18653/v1/p18-1249 |
26 | FERNÁNDEZ-GONZÁLEZ D, GÓMEZ-RODRÍGUEZ C. Discontinuous grammar as a foreign language[J]. Neurocomputing, 2023, 524: 43-58. 10.1016/j.neucom.2022.12.045 |
27 | CUI L, YANG S, ZHANG Y. Investigating non-local features for neural constituency parsing [C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2022: 2065-2075. 10.18653/v1/2022.acl-long.146 |
28 | 张雪松, 郭瑞强, 黄德根. 基于依存关系的命名实体识别[J]. 中文信息学报, 2021, 35(6): 63-73. 10.3969/j.issn.1003-0077.2021.06.007 |
ZHANG X S, GUO R Q, HUANG D G. Named entity recognition based on dependency[J]. Journal of Chinese Information Processing, 2021, 35(6): 63-73. 10.3969/j.issn.1003-0077.2021.06.007 | |
29 | 邵卫, 化柏林. 基于依存句法分析的科技政策领域主题词表无监督构建[J]. 情报工程, 2020, 6(6): 33-44. 10.3772/j.issn.2095-915X.2020.06.004 |
SHAO W, HUA B L. Unsupervised construction of thesaurus in the science and technology policy based on dependency syntax analysis[J]. Technology Intelligence Engineering, 2020, 6(6): 33-44. 10.3772/j.issn.2095-915X.2020.06.004 | |
30 | 武楷彪, 朗宇翔, 董瑜. 融合句法结构和词义信息的政策文本关联挖掘方法研究[J]. 数据分析与知识发现, 2022, 6(5): 20-33. |
WU K B, LANG Y X, DONG Y. Mining policy text relevance with syntactic structure and semantic information[J]. Data Analysis and Knowledge Discovery, 2022, 6(5): 20-33. | |
31 | MANNING C, SURDEANU M, BAUER J, et al. The Stanford CoreNLP natural language processing toolkit [C]// Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: ACL, 2014: 55-60. 10.3115/v1/p14-5010 |
32 | REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3982-3992. 10.18653/v1/d19-1410 |
33 | ZHANG J, GAN R, WANG J, et al. Fengshenbang 1.0: being the foundation of Chinese cognitive intelligence [EB/OL]. [2023-03-30]. . |
34 | 王婧, 仓怀兴. 生物医学领域3D打印技术的发展及展望[J]. 生命科学仪器, 2018, 16(2): 14-20. |
WANG J, CANG H X. Development and prospect of 3D printing in biomedical field[J]. Life Science Instruments, 2018, 16(2): 14-20. | |
35 | 王雪莹. 3D打印技术与产业的发展及前景分析[J]. 中国高新技术企业, 2012, 26: 3-5. |
WANG X Y. Development and prospect analysis of 3D printing technology and industry[J]. China High-Tech Enterprises, 2012, 26: 3-5. | |
36 | 常天行, 刘彬, 方学伟, 等. 铝合金增材制造的发展现状与展望[J]. 宇航材料工艺, 2022, 52(2): 76-84. 10.12044/j.issn.1007-2330.2022.02.007 |
CHANG T X, LIU B, FANG X W, et al. Development status and prospect of aluminum alloy additive manufacturing[J]. Aerospace Materials & Technology, 2022, 52(2): 76-84. 10.12044/j.issn.1007-2330.2022.02.007 |
[1] | 师夏阳, 张风远, 袁嘉琪, 黄敏. 基于多语BERT的无监督攻击性言论检测[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3379-3385. |
[2] | 杜丽萍, 李晓戈, 周元哲, 邵春昌. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000. |
[3] | 陈士超 郁滨. 面向术语抽取的双阈值互信息过滤方法[J]. 计算机应用, 2011, 31(04): 1070-1073. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||