基于异构图表示的中医电子病历分类方法

doi:10.11772/j.issn.1001-9081.2023030260

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 411-417.DOI: 10.11772/j.issn.1001-9081.2023030260

• 人工智能 • 上一篇

基于异构图表示的中医电子病历分类方法

王楷天¹, 叶青¹^,²(), 程春雷¹^,²

^1.江西中医药大学计算机学院，南昌 330004
^2.江西省中医人工智能重点研究室（江西中医药大学），南昌 330004

收稿日期:2023-03-16 修回日期:2023-05-25 接受日期:2023-05-26 发布日期:2023-07-05 出版日期:2024-02-10
通讯作者: 叶青
作者简介:王楷天（1999—），男，黑龙江牡丹江人，硕士研究生，主要研究方向：自然语言处理、数据挖掘
程春雷（1976—），男，江西南昌人，副教授，博士，主要研究方向：机器学习、知识表示与学习、知识图谱。
基金资助:
国家自然科学基金资助项目(82260988);江西省自然科学基金资助项目(20224BAB206102);江西省教育厅科学技术研究重点项目(GJJ201204)

Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation

Kaitian WANG¹, Qing YE¹^,²(), Chunlei CHENG¹^,²

^1.College of Computer Science，Jiangxi University of Chinese Medicine，Nanchang Jiangxi 330004，China
^2.Jiangxi Province Key Research Laboratory of Artificial Intelligence in Traditional Chinese Medicine （Jiangxi University of Chinese Medicine），Nanchang Jiangxi 330004，China

Received:2023-03-16 Revised:2023-05-25 Accepted:2023-05-26 Online:2023-07-05 Published:2024-02-10
Contact: Qing YE
About author:WANG Kaitian， born in 1999， M. S. candidate. His research interests include natural language processing， data mining.
CHENG Chunlei， born in 1976， Ph. D.， associate professor. His research interests include machine learning， knowledge representation and learning， knowledge graph.
Supported by:
National Natural Science Foundation of China(82260988);Jiangxi Provincial Natural Science Foundation(20224BAB206102);Science and Technology Research Key Project of Jiangxi Provincial Department of Education(GJJ201204)

摘要/Abstract

摘要：

中医（TCM）电子病历由于结构复杂多样与诊疗术语不规范的特点导致数据挖掘难度大、利用率低、难以抽取到有效信息。针对上述问题，提出基于LERT（Linguistically-motivated bidirectional Encoder Representation from Transformer）预训练模型与图卷积网络（GCN）并用异构图表示的中医电子病历分类模型TCM-GCN，用于改善中医电子病历特征有效表征的提取与分类。首先，利用LERT层词嵌入的方式将病历转换为句向量融入异构图中，以补全图结构缺失的病历整体语义特征；随后，为了缓解中医电子病历结构特点对特征提取产生的负面影响，异构图将关键词加入节点，使用BM25与点间互信息（PMI）算法构建图中“病历-关键词”“关键词-关键词”的边以表达病历的特征；最后，TCM-GCN依靠LERT-BM25-PMI构建的异构图对病历之间的特征关系进行聚合与抽取，完成病历分类的任务。在中医电子病历数据集上的实验结果表明，相较于次优的LERT，TCM-GCN加权平均后的准确率、召回率、F1值分别提升了2.24%、2.38%、2.32%，验证了算法在捕捉病历间隐含特征与中医电子病历分类工作上的有效性。

关键词: 异构图, 图卷积网络, 预训练模型, 文本分类, 自然语言处理, 中医电子病历

Abstract:

Traditional Chinese Medicine （TCM） electronic medical records face challenges in data mining， low utilization rates， and difficulty in extracting meaningful information due to their complex and diverse structures， as well as non-standard diagnosis and treatment terminology. To address these issues， a TCM electronic medical record classification model called TCM-GCN was proposed based on Linguistically-motivated bidirectional Encoder Representation from Transformer （LERT） pre-training model and Graph Convolutional Network （GCN）， and represented by a heterogeneous graph. The model was used to improve the extraction and classification of effective features in TCM electronic medical records. Firstly， the medical records were converted into sentence vectors using the word embedding method of the LERT layer and integrated into the heterogeneous graph to complement the overall semantic features that were missing in the graph structure. Next， to mitigate the negative impact of the structural characteristics on feature extraction， keywords were added to the nodes of the heterogeneous graph. The BM25 and Pointwise Mutual Information （PMI） algorithms were employed to construct edges representing the features of medical records， such as “medical record - keyword” and “keyword - keyword”. Finally， the task of medical record classification was completed by TCM-GCN， relying on the heterogeneous graph constructed by using LERT-BM25-PMI to aggregate and extract the feature relationships between medical records. Experimental results on the TCM electronic medical record dataset show that， compared to the suboptimal LERT， TCM-GCN achieves improvements of 2.24%， 2.38%， and 2.32% in accuracy， recall， and F1 value， respectively， after applying a weighted average， which confirms the effectiveness of the algorithm in capturing hidden features in medical records and classifying TCM electronic medical records.

Key words: heterogeneous graph, Graph Convolutional Network (GCN), pre-training model, text classification, Natural Language Processing (NLP), Traditional Chinese Medicine (TCM) electronic medical record

中图分类号:

TP391.1

王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 计算机应用, 2024, 44(2): 411-417.

Kaitian WANG, Qing YE, Chunlei CHENG. Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation[J]. Journal of Computer Applications, 2024, 44(2): 411-417.

图/表 10

参考文献 22

1	俞华，陶正玄，赵英英.围手术期智能临床辅助决策系统的构建与应用［J］.中国卫生信息管理杂志，2022，19（6）：911-917. 10.3969/j.issn.1672-5166.2022.06.022
	YU H， TAO Z X， ZHAO Y Y. Construction and application of preoperative period intelligent clinical decision support system［J］. Chinese Journal of Health Informatics and Management，2022，19（6）：911-917. 10.3969/j.issn.1672-5166.2022.06.022
2	张文博，陈希，张美霞，等.考虑适应性达成过程的慢性疾病个性化辅助决策方法［J］.系统工程， 2023， 41（4）： 127-136.
	ZHANG W B， CHEN X， ZHANG M X， et al. A personalized assistant-decision making method for chronic diseases considering the adaptive reaching process［J］. Systems Engineering， 2023， 41（4）： 127-136.
3	王红迁，汪鹏，王飞，等.多元数据融合的临床辅助决策系统的研究与应用［J］.中国数字医学，2019，14（11）：18-20. 10.3969/j.issn.1673-7571.2019.11.005
	WANG H Q， WANG P， WANG F， et al. Research and application of clinical decision support system based on multivariate data fusion［J］. China Digital Medicine， 2019，14（11）：18-20. 10.3969/j.issn.1673-7571.2019.11.005
4	YANG R， YE Q， CHENG C， et al. Decision-making system for the diagnosis of syndrome based on traditional Chinese medicine knowledge graph［J］. Evidence-Based Complementary and Alternative Medicine， 2022， 2022： No.8693937. 10.1155/2022/8693937
5	RUAN C， WU Y， YANG Y， et al. Semantic-aware graph convolutional networks for clinical auxiliary diagnosis and treatment of traditional Chinese medicine［J］. IEEE Access， 2021， 9： 8797-8807. 10.1109/access.2020.3048932
6	张玉洁，白如江，许海云，等.融合多自然语言处理任务的中医辅助诊疗方案研究——以糖尿病为例［J］.数据分析与知识发现，2022，6（1）：122-133.
	ZHANG Y J， BAI R J， XU H Y， et al. Assisted TCM diagnosis and treatment for diabetes with multi NLP tasks［J］. Data Analysis and Knowledge Discovery， 2022， 6（1）： 122-133.
7	蔡秀军，林辉，乔凯，等.智能辅助决策支持系统在临床诊疗决策中的应用研究［J］.中国数字医学， 2019， 14（3）： 111-113. 10.3969/j.issn.1673-7571.2019.03.031
	CAI X J， LIN H， QIAO K， et al. Research on the application of the intelligent decision-aid support system in clinical diagnosis and treatment decision support［J］. China Digital Medicine， 2019， 14（3）： 111-113. 10.3969/j.issn.1673-7571.2019.03.031
8	CUI Y， CHE W， WANG S， et al. LERT： a linguistically-motivated pre-trained language model ［EB/OL］. ［2023-02-17］. .
9	DOGRA V， VERMA S， CHATTERJEE P， et al. A complete process of text classification system using state-of-the-art NLP models［J］. Computational Intelligence and Neuroscience， 2022， 2022： No.1883698. 10.1155/2022/1883698
10	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsberg： ACL， 2014： 1746-1751. 10.3115/v1/d14-1181
11	DEVLIN J， CHANG M-W， LEE K， et al. BERT： Pre-training of deep bidirectional transformers for language understanding ［EB/OL］. ［2022-08-17］. . 10.18653/v1/n18-2
12	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks ［EB/OL］. ［2022-10-21］. . 10.48550/arXiv.1609.02907
13	YAO L， MAO C， LUO Y. Graph convolutional networks for text classification［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 7370-7377. 10.1609/aaai.v33i01.33017370
14	LIN Y， MENG Y， SUN X， et al. BertGCN： Transductive text classification by combining GCN and BERT ［EB/OL］. ［2023-01-22］. . 10.18653/v1/2021.findings-acl.126
15	李明浩，刘忠，姚远哲.基于LSTM-CRF的中医医案症状术语识别［J］.计算机应用，2018，38（S2）：42-46.
	LI M H， LIU Z， YAO Y Z. LSTM-CRF based symptom term recognition on traditional Chinese medical case［J］. Journal of Computer Applications， 2018， 38（S2）： 42-46.
16	杜琳，曹东，林树元，等.基于BERT与Bi-LSTM融合注意力机制的中医病历文本的提取与自动分类［J］.计算机科学，2020，47（S2）：416-420. 10.11896/jsjkx.200200020
	DU L， CAO D， LIN S Y， et al. Extraction and automatic classification of TCM medical records based on attention mechanism of BERT and Bi-LSTM［J］. Computer Science， 2020，47（S2）：416-420. 10.11896/jsjkx.200200020
17	WU F， SOUZA A， ZHANG T， et al. Simplifying graph convolutional networks［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR， 2019， 97： 6861-6871. 10.48550/arXiv.1902.07153
18	CHE W， FENG Y， QIN L， et al. N-LTP： an open-source neural language technology platform for Chinese ［EB/OL］. ［2022-11-19］. . 10.18653/v1/2021.emnlp-demo.6
19	LOSHCHILOV I， HUTTER F. SGDR： stochastic gradient descent with restarts ［EB/OL］. ［2022-03-19］. .
20	VINYALS O， BLUNDELL C， LILLICRAP T， et al. Matching networks for one shot learning［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 3637-3645.
21	LIU P， QIU X， HUANG X. Recurrent neural network for text classification with multi-task learning ［EB/OL］. ［2023-01-22］. . 10.18653/v1/d16-1012
22	JOULIN A， GRAVE E， BOJANOWSKI P， et al. Bag of tricks for efficient text classification ［EB/OL］. ［2022-12-29］. . 10.18653/v1/e17-2068

信息类别	内容
性别	女
望诊	面色偏滞；形体稍胖，神情平静，语速偏快、话多
脉诊	脉略滑，右脉边界欠清，左关稍旺，左寸上略浮
舌诊	舌质偏暗青，苔淡黄稍厚
查体	咽壁滤泡，分泌物多
主诉	右耳鸣半月余
中医诊断	感冒

信息类别	内容
性别	女
望诊	面色偏滞；形体稍胖，神情平静，语速偏快、话多
脉诊	脉略滑，右脉边界欠清，左关稍旺，左寸上略浮
舌诊	舌质偏暗青，苔淡黄稍厚
查体	咽壁滤泡，分泌物多
主诉	右耳鸣半月余
中医诊断	感冒

真实情况	预测情况
真实情况	预测为该疾病	预测不为该疾病
预测准确	TP	TN
预测错误	FP	FN

真实情况	预测情况
真实情况	预测为该疾病	预测不为该疾病
预测准确	TP	TN
预测错误	FP	FN

参数名	符号	值
GCN学习率	gcn_lr	0.05
GCN权重衰减	gcn_weight_decay	10^-5
GCN隐藏层特征维度	n_hidden	32
PMI阈值	p	0
LERT学习率	lert_lr	10^-5
LERT权重衰减	lert_weight_decay	10^-4
batch大小	batch_size	128
权重	λ	0.5
迭代次数	epoch	200

基于异构图表示的中医电子病历分类方法

Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 22

相关文章 15

编辑推荐

Metrics

模型	精确率	召回率	F1	AUC
LSTM	0.666 5	0.664 8	0.663 1	0.882 9
Text-GCN	0.749 5	0.750 9	0.748 3	0.930 7
LERT	0.767 4	0.763 6	0.763 0	0.915 2
Text_CNN	0.724 0	0.723 6	0.721 2	0.924 6
Text_RNN	0.690 8	0.685 5	0.684 9	0.915 9
FastText	0.725 0	0.714 5	0.713 4	0.937 3
TCM-GCN	0.784 6	0.781 8	0.780 7	0.927 2

实验	构图方法	精确率	召回率	F1	AUC
1	LERT+one-hot+PMI	0.771 5	0.770 9	0.770 0	0.916 2
2	LERT+ BW25	0.763 4	0.761 8	0.760 5	0.906 2
3	BW25+PMI	0.767 1	0.765 5	0.764 3	0.913 3
4	LERT+TF-IDF+PMI	0.774 1	0.772 7	0.771 4	0.916 2
5	LERT+BW25+PMI	0.784 6	0.781 8	0.780 7	0.927 2

[1]	张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410.
[2]	冯程皓, 谢振平, 丁博文. 中文文本纠错软件测试用例的选择生成方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 101-112.
[3]	史含笑, 王雷春. 结合LSTM和自注意力机制的图卷积网络短期电力负荷预测[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 311-317.
[4]	徐丽, 符祥远, 李浩然. 基于门控卷积的时空交通流预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2760-2765.
[5]	马国帅, 钱宇华, 张亚宇, 李俊霞, 刘郭庆. 动态异构信息融合的科研合作潜力预测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2775-2783.
[6]	周晓敏, 滕飞, 张艺. 基于元网络的自动国际疾病分类编码模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2721-2726.
[7]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.
[8]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[9]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[10]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[11]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[12]	陈克正, 郭晓然, 钟勇, 李振平. 基于负训练和迁移学习的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2426-2430.
[13]	金泽熙, 李磊, 刘继. 基于改进领域分离网络的迁移学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2382-2389.
[14]	张奕, 蔡钢生, 王真梅. 基于语义与全局双重注意力机制的长链非编码RNA-疾病关联预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2125-2132.
[15]	魏远, 林彦, 郭晟楠, 林友芳, 万怀宇. 融合出发地与目的地时空相关性的城市区域间出租车需求预测[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2100-2106.