《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 411-417.DOI: 10.11772/j.issn.1001-9081.2023030260

• 人工智能 • 上一篇    

基于异构图表示的中医电子病历分类方法

王楷天1, 叶青1,2(), 程春雷1,2   

  1. 1.江西中医药大学 计算机学院,南昌 330004
    2.江西省中医人工智能重点研究室(江西中医药大学),南昌 330004
  • 收稿日期:2023-03-16 修回日期:2023-05-25 接受日期:2023-05-26 发布日期:2023-07-05 出版日期:2024-02-10
  • 通讯作者: 叶青
  • 作者简介:王楷天(1999—),男,黑龙江牡丹江人,硕士研究生,主要研究方向:自然语言处理、数据挖掘
    程春雷(1976—),男,江西南昌人,副教授,博士,主要研究方向:机器学习、知识表示与学习、知识图谱。
  • 基金资助:
    国家自然科学基金资助项目(82260988);江西省自然科学基金资助项目(20224BAB206102);江西省教育厅科学技术研究重点项目(GJJ201204)

Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation

Kaitian WANG1, Qing YE1,2(), Chunlei CHENG1,2   

  1. 1.College of Computer Science,Jiangxi University of Chinese Medicine,Nanchang Jiangxi 330004,China
    2.Jiangxi Province Key Research Laboratory of Artificial Intelligence in Traditional Chinese Medicine (Jiangxi University of Chinese Medicine),Nanchang Jiangxi 330004,China
  • Received:2023-03-16 Revised:2023-05-25 Accepted:2023-05-26 Online:2023-07-05 Published:2024-02-10
  • Contact: Qing YE
  • About author:WANG Kaitian, born in 1999, M. S. candidate. His research interests include natural language processing, data mining.
    CHENG Chunlei, born in 1976, Ph. D., associate professor. His research interests include machine learning, knowledge representation and learning, knowledge graph.
  • Supported by:
    National Natural Science Foundation of China(82260988);Jiangxi Provincial Natural Science Foundation(20224BAB206102);Science and Technology Research Key Project of Jiangxi Provincial Department of Education(GJJ201204)

摘要:

中医(TCM)电子病历由于结构复杂多样与诊疗术语不规范的特点导致数据挖掘难度大、利用率低、难以抽取到有效信息。针对上述问题,提出基于LERT(Linguistically-motivated bidirectional Encoder Representation from Transformer)预训练模型与图卷积网络(GCN)并用异构图表示的中医电子病历分类模型TCM-GCN,用于改善中医电子病历特征有效表征的提取与分类。首先,利用LERT层词嵌入的方式将病历转换为句向量融入异构图中,以补全图结构缺失的病历整体语义特征;随后,为了缓解中医电子病历结构特点对特征提取产生的负面影响,异构图将关键词加入节点,使用BM25与点间互信息(PMI)算法构建图中“病历-关键词”“关键词-关键词”的边以表达病历的特征;最后,TCM-GCN依靠LERT-BM25-PMI构建的异构图对病历之间的特征关系进行聚合与抽取,完成病历分类的任务。在中医电子病历数据集上的实验结果表明,相较于次优的LERT,TCM-GCN加权平均后的准确率、召回率、F1值分别提升了2.24%、2.38%、2.32%,验证了算法在捕捉病历间隐含特征与中医电子病历分类工作上的有效性。

关键词: 异构图, 图卷积网络, 预训练模型, 文本分类, 自然语言处理, 中医电子病历

Abstract:

Traditional Chinese Medicine (TCM) electronic medical records face challenges in data mining, low utilization rates, and difficulty in extracting meaningful information due to their complex and diverse structures, as well as non-standard diagnosis and treatment terminology. To address these issues, a TCM electronic medical record classification model called TCM-GCN was proposed based on Linguistically-motivated bidirectional Encoder Representation from Transformer (LERT) pre-training model and Graph Convolutional Network (GCN), and represented by a heterogeneous graph. The model was used to improve the extraction and classification of effective features in TCM electronic medical records. Firstly, the medical records were converted into sentence vectors using the word embedding method of the LERT layer and integrated into the heterogeneous graph to complement the overall semantic features that were missing in the graph structure. Next, to mitigate the negative impact of the structural characteristics on feature extraction, keywords were added to the nodes of the heterogeneous graph. The BM25 and Pointwise Mutual Information (PMI) algorithms were employed to construct edges representing the features of medical records, such as “medical record - keyword” and “keyword - keyword”. Finally, the task of medical record classification was completed by TCM-GCN, relying on the heterogeneous graph constructed by using LERT-BM25-PMI to aggregate and extract the feature relationships between medical records. Experimental results on the TCM electronic medical record dataset show that, compared to the suboptimal LERT, TCM-GCN achieves improvements of 2.24%, 2.38%, and 2.32% in accuracy, recall, and F1 value, respectively, after applying a weighted average, which confirms the effectiveness of the algorithm in capturing hidden features in medical records and classifying TCM electronic medical records.

Key words: heterogeneous graph, Graph Convolutional Network (GCN), pre-training model, text classification, Natural Language Processing (NLP), Traditional Chinese Medicine (TCM) electronic medical record

中图分类号: