《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2686-2692.DOI: 10.11772/j.issn.1001-9081.2021071317
• 人工智能 • 上一篇
收稿日期:
2021-07-22
修回日期:
2021-10-22
接受日期:
2021-10-25
发布日期:
2022-09-19
出版日期:
2022-09-10
通讯作者:
滕飞
作者简介:
侯旭东(1996—),男,河南南阳人,硕士研究生,主要研究方向:医疗大数据分析;基金资助:
Xudong HOU, Fei TENG(), Yi ZHANG
Received:
2021-07-22
Revised:
2021-10-22
Accepted:
2021-10-25
Online:
2022-09-19
Published:
2022-09-10
Contact:
Fei TENG
About author:
HOU Xudong, born in 1996, M. S. candidate. His research interests include medical big data analysis.Supported by:
摘要:
针对在医疗命名实体识别(MNER)问题中随着网络加深,基于深度学习的识别模型出现的识别精度与算力要求不平衡的问题,提出一种基于深度自编码的医疗命名实体识别模型CasSAttMNER。首先,使用编码与解码间深度差平衡策略,以经过蒸馏的Transformer语言模型RBT6作为编码器以减小编码深度以及降低对训练和应用上的算力要求;然后,使用双向长短期记忆(BiLSTM)网络和条件随机场(CRF)提出了级联式多任务双解码器,从而完成实体提及序列标注与实体类别判断;最后,基于自注意力机制在实体类别中增加实体提及过程抽取的隐解码信息,以此来优化模型设计。实验结果表明,CasSAttMNER在两个中文医疗实体数据集上的F值度量可分别达到0.943 9和0.945 7,较基线模型分别提高了3个百分点和8个百分点,验证了该模型更进一步地提升了解码器性能。
中图分类号:
侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 计算机应用, 2022, 42(9): 2686-2692.
Xudong HOU, Fei TENG, Yi ZHANG. Medical named entity recognition model based on deep auto-encoding[J]. Journal of Computer Applications, 2022, 42(9): 2686-2692.
数据集 | 实体类别 | 文本总数 | |||||
---|---|---|---|---|---|---|---|
疾病和 诊断 | 手术 | 解剖 部位 | 药物 | 影像 检查 | 实验室 检验 | ||
CCKS-19 | 4 212 | 1 029 | 8 426 | 1 822 | 969 | 1 195 | 1 000 |
CCKS-20 | 4 345 | 923 | 8 811 | 1 935 | 1 002 | 1 297 | 1 050 |
表1 数据集中的实体类别与数量统计
Tab. 1 Entity class and quantity statistics in datasets
数据集 | 实体类别 | 文本总数 | |||||
---|---|---|---|---|---|---|---|
疾病和 诊断 | 手术 | 解剖 部位 | 药物 | 影像 检查 | 实验室 检验 | ||
CCKS-19 | 4 212 | 1 029 | 8 426 | 1 822 | 969 | 1 195 | 1 000 |
CCKS-20 | 4 345 | 923 | 8 811 | 1 935 | 1 002 | 1 297 | 1 050 |
模型 | 数据集 | |
---|---|---|
CCKS-19 | CCKS-20 | |
文献[ | 0.856 2 | |
文献[ | 0.851 6 | |
模型融合+规则[ | 0.915 4 | |
ChiEHRBert+实体融合[ | 0.912 4 | |
Ensemble[ | 0.905 1 | |
CasSAttMNER | 0.9439 | 0.9457 |
表2 各模型的FE评测统计
Tab. 2 FE evaluation statistics of each model
模型 | 数据集 | |
---|---|---|
CCKS-19 | CCKS-20 | |
文献[ | 0.856 2 | |
文献[ | 0.851 6 | |
模型融合+规则[ | 0.915 4 | |
ChiEHRBert+实体融合[ | 0.912 4 | |
Ensemble[ | 0.905 1 | |
CasSAttMNER | 0.9439 | 0.9457 |
数据集 | 模型 | 实体类别 | |||||
---|---|---|---|---|---|---|---|
疾病和诊断 | 实验室检验 | 手术 | 药物 | 解剖部位 | 影像检查 | ||
CCKS-19 | 文献[ | 0.842 9 | 0.769 4 | 0.833 3 | 0.9602 | 0.861 8 | 0.862 9 |
文献[ | 0.828 1 | 0.756 5 | 0.867 9 | 0.944 9 | 0.859 9 | 0.880 1 | |
CasSAttMNER | 0.9429 | 0.9306 | 0.9091 | 0.912 9 | 0.9549 | 0.9741 | |
CCKS-20 | 模型融合+规则[ | 0.905 3 | 0.835 0 | 0.9621 | 0.937 5 | 0.920 0 | 0.884 7 |
实体融合[ | 0.911 0 | 0.857 1 | 0.955 2 | 0.929 3 | 0.911 6 | 0.886 2 | |
Ensemble[ | 0.899 2 | 0.850 3 | 0.937 5 | 0.931 0 | 0.904 3 | 0.876 9 | |
CasSAttMNER | 0.9262 | 0.9542 | 0.932 2 | 0.9401 | 0.9565 | 0.9600 |
表3 各模型的实体F值度量表现统计
Tab. 3 Entity F value measure statistics of each model
数据集 | 模型 | 实体类别 | |||||
---|---|---|---|---|---|---|---|
疾病和诊断 | 实验室检验 | 手术 | 药物 | 解剖部位 | 影像检查 | ||
CCKS-19 | 文献[ | 0.842 9 | 0.769 4 | 0.833 3 | 0.9602 | 0.861 8 | 0.862 9 |
文献[ | 0.828 1 | 0.756 5 | 0.867 9 | 0.944 9 | 0.859 9 | 0.880 1 | |
CasSAttMNER | 0.9429 | 0.9306 | 0.9091 | 0.912 9 | 0.9549 | 0.9741 | |
CCKS-20 | 模型融合+规则[ | 0.905 3 | 0.835 0 | 0.9621 | 0.937 5 | 0.920 0 | 0.884 7 |
实体融合[ | 0.911 0 | 0.857 1 | 0.955 2 | 0.929 3 | 0.911 6 | 0.886 2 | |
Ensemble[ | 0.899 2 | 0.850 3 | 0.937 5 | 0.931 0 | 0.904 3 | 0.876 9 | |
CasSAttMNER | 0.9262 | 0.9542 | 0.932 2 | 0.9401 | 0.9565 | 0.9600 |
1 | 中华人民共和国国家卫生和计划生育委员会. 电子病历基本数据集标准: [S]. 北京:中国标准出版社, 2014. 10.3969/j.issn.1672-7185.2019.02.002 |
National Health and Family Planning Commission of the People’s Republic of China. Standard for basic data sets of electronic medical record: [S]. Beijing: China Standard Press, 2014. 10.3969/j.issn.1672-7185.2019.02.002 | |
2 | 国家卫生健康委办公厅. 关于印发电子病历应用管理规范(试行)的通知[EB/OL]. (2017-02-23) [2021-05-14].. 10.31901/24566764.2014/05.02.02 |
General Office of the National Health Commission. Notice on printing and distributing the management standards for the application of electronic medical records (for trial implementation)[EB/OL]. (2017-02-23) [2021-05-14].. 10.31901/24566764.2014/05.02.02 | |
3 | 国家卫生健康委办公厅. 关于印发电子病历系统应用水平分级评价管理办法(试行)及评价标准(试行)的通知[EB/OL]. (2018-12-09) [2021-05-14].. 10.37544/0720-5953-2018-09-12 |
General Office of the National Health Commission. Notice on issuing the administrative measures (trial) and evaluation standards (trial) for the application level evaluation of the electronic medical record system[EB/OL]. (2018-12-09) [2021-05-14].. 10.37544/0720-5953-2018-09-12 | |
4 | BODENREIDER O. The Unified Medical Language System (UMLS): integrating biomedical terminology[J]. Nucleic Acids Research, 2004, 32(S1): D267-D270. 10.1093/nar/gkh061 |
5 | PATRICK J, LI M. High accuracy information extraction of medication information from clinical notes: 2009 I2B2 medication extraction challenge[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 524-527. 10.1136/jamia.2010.003939 |
6 | UZUNER Ö, SOUTH B R, SHEN S Y, et al. 2010 I2B2/VA challenge on concepts, assertions, and relations in clinical text[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 552-556. 10.1136/amiajnl-2011-000203 |
7 | SUN W Y, RUMSHISKY A, UZUNER O. Evaluating temporal relations in clinical text: 2012 I2B2 challenge[J]. Journal of the American Medical Informatics Association, 2013, 20(5): 806-813. 10.1136/amiajnl-2013-001628 |
8 | STUBBS A, KOTFILA C, UZUNER Ö. Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 I2B2/UTHealth shared task Track 1[J]. Journal of Biomedical Informatics, 2015, 58(S): S11-S19. 10.1016/j.jbi.2015.06.007 |
9 | 杨锦锋,关毅,何彬,等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, 27(11):2725-2746. 10.13328/j.cnki.jos.004880 |
YANG J F, GUAN Y, HE B, et al. Corpus construction for named entities and entity relations on Chinese electronic medical records[J]. Journal of Software, 2016, 27(11):2725-2746. 10.13328/j.cnki.jos.004880 | |
10 | CUI Y M, CHENG W X, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[C]// Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: Association for Computational Linguistics, 2020:657-668. 10.18653/v1/2020.findings-emnlp.58 |
11 | COLLINS M, SINGER Y. Unsupervised models for named entity classification[C]// Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Stroudsburg, PA: Association for Computational Linguistics, 1999:100-110. |
12 | TROTT P A. International classification of diseases for oncology[J]. Journal of Clinical Pathology, 1977, 30(8): 782-782. 10.1136/jcp.30.8.782-c |
13 | CORNET R, DE KEIZER N. Forty years of SNOMED: a literature review[J]. BMC Medical Informatics and Decision Making, 2008, 8(S1): No.S2. 10.1186/1472-6947-8-s1-s2 |
14 | FRIEDMAN C, ALDERSON P O, AUSTIN J H M, et al. A general natural-language text processor for clinical radiology[J]. Journal of the American Medical Informatics Association, 1994, 1(2): 161-174. 10.1136/jamia.1994.95236146 |
15 | CODEN A, SAVOVA G, SOMINSKY I, et al. Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model[J]. Journal of Biomedical Informatics, 2009, 42(5): 937-949. 10.1016/j.jbi.2008.12.005 |
16 | SAVOVA G K, MASANZ J J, OGREN P V, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 507-513. 10.1136/jamia.2009.001560 |
17 | LI D C, KIPPER-SCHULER K, SAVOVA G. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts[C]// Proceedings of the 2008 Workshop on Current Trends in Biomedical Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2008: 94-95. 10.3115/1572306.1572326 |
18 | CLARK C, ABERDEEN J, COARR M, et al. MITRE system for clinical assertion status classification[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 563-567. 10.1136/amiajnl-2011-000164 |
19 | JONNALAGADDA S, COHEN T, WU S, et al. Enhancing clinical concept extraction with distributional semantics[J]. Journal of Biomedical Informatics, 2012, 45(1): 129-140. 10.1016/j.jbi.2011.10.007 |
20 | WU Y H, JIANG M, LEI J B, et al. Named entity recognition in Chinese clinical text using deep neural network[J]. Studies in Health Technology and Informatics, 2015, 216: 624-628. 10.1136/amiajnl-2013-002381 |
21 | HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. (2015-08-09) [2021-05-14].. |
22 | XU K, ZHOU Z F, HAO T Y, et al. A bidirectional LSTM and conditional random fields approach to medical named entity recognition[C]// Proceedings of the 2017 International Conference on Advanced Intelligent Systems and Informatics, AISC 639. Cham: Springer, 2017: 355-365. |
23 | JI B, LIU R, LI S S, et al. A hybrid approach for named entity recognition in Chinese electronic medical record[J]. BMC Medical Informatics and Decision Making, 2019, 19(S2): No.64. 10.1186/s12911-019-0767-2 |
24 | BAEVSKI A, EDUNOV S, LIU Y H, et al. Cloze-driven pretraining of self-attention networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2019: 5360-5369. 10.18653/v1/d19-1539 |
25 | LIU Y J, MENG F D, ZHANG J C, et al. GCDT: a global context enhanced deep transition architecture for sequence labeling[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Stroudsburg. Stroudsburg, PA: Association for Computational Linguistics, 2019: 2431-2441. 10.18653/v1/p19-1233 |
26 | LI J, YE D H, SHANG S. Adversarial transfer for named entity boundary detection with pointer networks[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2019: 5053-5059. 10.24963/ijcai.2019/702 |
27 | BALDI P, SADOWSKI P. The dropout learning algorithm[J]. Artificial Intelligence, 2014, 210: 78-122. 10.1016/j.artint.2014.02.004 |
28 | SUTTON C, McCALLUM A. An introduction to conditional random fields for relational learning[M]// GETOOR L, TASKAR B. Introduction to Statistical Relational Learning. Cambridge: MIT Press, 2007: 93-127. 10.7551/mitpress/7432.003.0006 |
29 | PARIKH A, TÄCKSTRÖM O, DAS D, et al. A decomposable attention model for natural language inference[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2016: 2249-2255. 10.18653/v1/d16-1244 |
30 | 医渡云. Yidu-S4K:医渡云结构化4K数据集[DS/OL]. (2020-11-09) [2021-05-14].. |
Cloud Yidu. Yidu-S4K: Yidu Cloud structured 4K data set[DS/OL]. (2020-11-09) [2021-05-14].. | |
31 | 2020全国知识图谱与语义计算大会. CCKS评测任务CFP[EB/OL]. [2021-05-14].,2020. 10.1155/2021/8884282 |
2020 China Conference on Knowledge Graph and Semantic Computing. CCKS evaluation task CFP[EB/OL]. [2021-05-14].,2020. 10.1155/2021/8884282 | |
32 | KINGMA D P, BA J L. Adam: a method for stochastic optimization[EB/OL]. (2017-01-30) [2021-05-22].. |
33 | 乔锐,杨笑然,黄文亢.基于BERT与模型融合的医疗命名实体识别[EB/OL].[2021-05-14].. 10.1145/3490322.3490336 |
QIAO R, YANG X R, HUANG W K. Medical named entity recognition based on BERT and model fusion[EB/OL]. [2021-05-14].. 10.1145/3490322.3490336 | |
34 | LI N, LUO L, DING Z Y, et al. DUTIR at the CCKS-2019 task1: improving Chinese clinical named entity recognition using stroke ELMo and transfer learning[EB/OL]. [2021-05-14].. |
35 | 晏阳天,赵新宇,吴贤. 基于BERT与字形字音特征的医疗命名实体识别[EB/OL]. [2021-05-14].. |
YAN Y T, ZHAO X Y, WU X. Medical named entity recognition based on BERT and character pattern and phonetic features[EB/OL]. [2021-05-14].. | |
36 | 杨文明,毕金良,邹佳丽,等. 基于 ChiEHRBert 与多模型融合的医疗命名实体识别[EB/OL]. [2021-05-14].. |
YANG W M, BI J L, ZOU J L, et al. Medical named entity recognition based on ChiENRBert and multi-model fusion[EB/OL]. [2021-05-14].. | |
37 | ZHENG H Y, WEN R, CHEN X, et al. Medical named entity recognition using CRF-MT-Adapt and NER-MRC[EB/OL]. [2021-05-14].. 10.1109/cds52072.2021.00068 |
[1] | 胡婕, 胡燕, 刘梦赤, 张龑. 基于知识库实体增强BERT模型的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2680-2685. |
[2] | 文凯, 唐伟伟, 熊俊臣. 基于注意力机制和有效分解卷积的实时分割算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2659-2666. |
[3] | 衡红军, 徐天宝. 基于多尺度卷积和门控机制的注意力情感分析模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2674-2679. |
[4] | 徐成霞, 阎庆, 李腾, 苗开超. 基于联合注意力机制的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2578-2585. |
[5] | 张丽莹, 庞春江, 王新颖, 李国亮. 基于改进YOLOv3的多尺度目标检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2423-2431. |
[6] | 张新宇, 丁胜, 杨治佩. 基于改进注意力机制的交通标志检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2378-2385. |
[7] | 玄英律, 万源, 陈嘉慧. 基于多尺度卷积和注意力机制的LSTM时间序列分类[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2343-2352. |
[8] | 李坤, 侯庆. 基于注意力机制的轻量型人体姿态估计[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2407-2414. |
[9] | 张剑, 程培源, 邵思羽. 基于改进残差卷积自编码网络的类自适应旋转机械故障诊断[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2440-2449. |
[10] | 吴明晖, 张广洁, 金苍宏. 基于多模态信息融合的时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2326-2332. |
[11] | 吕振虎, 许新征, 张芳艳. 基于挤压激励的轻量化注意力机制模块[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2353-2360. |
[12] | 刘博, 卿粼波, 王正勇, 刘美, 姜雪. 基于分块注意力机制和交互位置关系的群组活动识别[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2052-2057. |
[13] | 王海起, 王志海, 李留珂, 孔浩然, 王琼, 徐建波. 基于网格划分的城市短时交通流量时空预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2274-2280. |
[14] | 刘万军, 王佳铭, 曲海成, 董利兵, 曹欣宇. 基于频谱空间域特征注意的音乐流派分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2072-2077. |
[15] | 韩泽芳, 张雄, 上官宏, 韩兴隆, 韩静, 奉刚, 崔学英. 用于低剂量CT降噪的伪影感知生成对抗网络[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2301-2310. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||