Journal of Computer Applications ›› 0, Vol. ›› Issue (): 50-54.DOI: 10.11772/j.issn.1001-9081.2024040472

• Artificial intelligence • Previous Articles     Next Articles

Multilevel-component joint Chinese word embedding model and application of core drug identification

Deyang KONG1,2,3, Yun ZHANG1,2,3(), Weiguang WANG4, Zijie CHEN4, Yongguo LIU1,2,3   

  1. 1.School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu Sichuan 610054,China
    2.Knowledge and Data Engineering Laboratory of Chinese Medicine,University of Electronic Science and Technology of China,Chengdu Sichuan 610054,China
    3.Innovation Center for Electronic Information and Traditional Chinese Medicine,University of Electronic Science and Technology of China,Chengdu Sichuan 610054,China
    4.Institute of Traditional Chinese Medicine,Beijing University of Chinese Medicine,Beijing 100029,China
  • Received:2024-04-22 Revised:2024-06-28 Accepted:2024-07-02 Online:2025-01-24 Published:2024-12-31
  • Contact: Yun ZHANG

多级组件融合中文词嵌入模型及核心药物识别应用

孔德阳1,2,3, 张云1,2,3(), 王维广4, 陈子杰4, 刘勇国1,2,3   

  1. 1.电子科技大学 信息与软件工程学院,成都 610054
    2.电子科技大学 中医知识与数据工程实验室,成都 610054
    3.电子科技大学 电子信息与中医药融合创新研究中心,成都 610054
    4.北京中医药大学 中医学院,北京 100029
  • 通讯作者: 张云
  • 作者简介:孔德阳(2003—),男,河北石家庄人,主要研究方向:语义分析、人工智能
    张云(1991—),男,四川内江人,助理研究员,博士,主要研究方向:数字医疗、语义分析
    王维广(1987—),男,黑龙江齐齐哈尔人,助理研究员,博士,主要研究方向:中医理论体系及知识图谱
    陈子杰(1978—),男,北京人,教授,博士,主要研究方向:中医理论体系
    刘勇国(1974—),男,四川绵阳人,教授,博士,主要研究方向:数字医疗、计算健康、人工智能。
  • 基金资助:
    国家科技基础资源调查专项(2022FY102002);四川省自然科学基金资助项目(2022NSFSC0958);四川省科技计划项目(2023YFS0325)

Abstract:

Word embedding models can map words to low-dimensional vector space for analyzing word semantics, which provides an effective way for computer understanding and text processing. Traditional Chinese word embedding models learn semantic information through the internal compositional information of Chinese words, however, for the utilization degree of Chinese characters and information of their different levels of components, different models have insufficient or excessive utilization problems. Thus, in order to utilize the information of different levels of components of Chinese characters better to generate high-quality word embeddings, a Multilevel-component Joint Chinese Word Embedding (MJWE) model was proposed to integrate the characteristics of words, Chinese characters, and multilevel-components, combine word embeddings with positional information, and construct multilevel-component embeddings composed of radicals and finer-grained components to capture the internal composition information of Chinese words more comprehensively. Meanwhile, a non-compositional word list was constructed to prevent the over utilization of the internal information of Chinese words. Experimental results show that MJWE model has the accuracy improved by 2.11% compared to JWE (Joint learning Word Embeddings) model on the word similarity task “WS-295”, by 2.52% compared to Skip-Gram (SG) model on the word analogy task “state”, by 6.58% compared to CBOW (Continuous Bag Of Words) model on the word analogy task “family”, by 0.71% compared to JWE model on the emotion classification task (two classes), and by 8.60% compared to SG model on the emotion classification task (seven classes). Meanwhile, MJWE model was applied to analyse literature on traditional Chinese medicine, for core drug identification in traditional Chinese medicine formulae, and MJWE model was able to identify the core drugs for treating different symptoms of chronic glomerulonephritis. It can be seen that MJWE can generate Chinese word embeddings with good quality, and combined with community detection algorithm, it can identify core drugs for treating different syndromes of chronic glomerulonephritis, which is conducive to assisting traditional Chinese medicine doctors in clinical decision-making.

Key words: word embedding, Chinese word, multilevel-component, Chinese medicinal formulae, core drug

摘要:

词嵌入模型可以将词语映射到低维向量空间以分析词语语义,为计算机理解和文本处理提供有效手段。传统中文词嵌入模型通过中文词语内部的组成信息学习语义信息,然而,对于汉字及其不同层级组件信息的利用程度,不同模型存在利用不够或过度的问题。为了更好地利用汉字不同层级组件信息生成高质量的词嵌入,提出多级组件融合中文词嵌入(MJWE)模型,综合考虑词语、汉字和多级组件的特征,融合带有位置信息的字嵌入,构建以偏旁、部首和更小粒度的组件构成的多级组件嵌入,从而更全面地捕捉中文词语内部语义信息。同时,构建非组合词词表防止词语内部信息的过度利用。实验结果表明,在词相似任务WS-295上,与JWE(Joint learning Word Embeddings)模型相比,MJWE模型的准确率提高了2.11%;在词类比任务state上,与跳元(SG)模型相比,MJWE模型的准确率提高了2.52%;在词类比任务family上,与连续词袋(CBOW)模型相比,MJWE模型的准确率提高了6.58%。在情感二分类任务上,与JWE模型相比,MJWE模型的准确率提高了0.71%;在情感七分类任务上,与SG模型相比,MJWE模型的准确率提高了8.60%。同时,将MJWE模型应用于中医文献分析,在方剂核心药物识别的任务中,MJWE可以识别治疗慢性肾小球肾炎不同证候的核心药物。可见,MJWE可以生成质量较好的中文词嵌入,结合社区检测算法可以识别治疗慢性肾小球肾炎不同证候的核心药物,有利于辅助中医医师临床决策。

关键词: 嵌入, 中文词语, 多级组件, 中药方剂, 核心药物

CLC Number: