计算机应用 ›› 2017, Vol. 37 ›› Issue (10): 2999-3005.DOI: 10.11772/j.issn.1001-9081.2017.10.2999

• 应用前沿、交叉与综合 • 上一篇    下一篇

基于症状构成成分的上下位关系自动抽取方法

王婷1, 王祺1, 黄越圻1, 殷亦超2, 高炬2   

  1. 1. 华东理工大学 信息科学与工程学院, 上海 200237;
    2. 上海中医药大学 附属曙光医院, 上海 200021
  • 收稿日期:2017-04-25 修回日期:2017-06-12 出版日期:2017-10-10 发布日期:2017-10-16
  • 通讯作者: 王婷(1993-),女,山东潍坊人,硕士研究生,CCF会员,主要研究方向:信息抽取、知识图谱,E-mail:wangting6524@163.com
  • 作者简介:王婷(1993-),女,山东潍坊人,硕士研究生,CCF会员,主要研究方向:信息抽取、知识图谱;王祺(1993-),男,江苏苏州人,硕士研究生,CCF会员,主要研究方向:信息抽取、知识图谱、机器翻译;黄越圻(1993-),男,浙江绍兴人,硕士研究生,CCF会员,主要研究方向:知识图谱、自然语言问答;殷亦超(1983-),男,上海人,工程师,硕士,主要研究方向:医院信息化;高炬(1966-),男,上海人,主任医师,硕士,主要研究方向:医院行政管理、中西医结合治疗肝胆病.
  • 基金资助:
    国家863计划项目(2015AA020107);国家科技支撑计划项目(2015BAH12 F01-05)。

Automatic hyponymy extracting method based on symptom components

WANG Ting1, WANG Qi1, HUANG Yueqi1, YIN Yichao2, GAO Ju2   

  1. 1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China;
    2. Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai 200021, China
  • Received:2017-04-25 Revised:2017-06-12 Online:2017-10-10 Published:2017-10-16
  • Supported by:
    This work is partially supported by the National High Technology Research and Development Program (863 Program) of China (2015AA020107), the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (2015BAH12F01-05).

摘要: 针对症状间上下位关系具有较强结构特性的问题,提出一种基于症状构成成分的上下位关系自动抽取方法。首先,通过观察症状实体,发现症状可以切分为原子症状词、修饰词等八种成分,且成分的构成序列满足一定的规则。然后,利用词法分析系统和条件随机场模型对症状进行切分和成分标注。最后,把症状之间的关系抽取看作一个分类问题,选取症状成分的构成特征、词典特征以及通用特征作为分类算法的特征;基于多种分类算法训练模型,将症状间的关系分为上下位关系和非上下位关系。实验结果表明,当选用支持向量机算法,同时选用三类特征时,取得了最好的效果,准确率、召回率和F1值分别达到了82.68%、82.13%和82.40%。在此基础上,使用所提出的关系抽取算法,抽取了20619条上下位关系,构建了具有上下位关系的症状知识库。

关键词: 上下位关系, 症状构成成分, 条件随机场, 关系分类, 支持向量机, 决策树, 朴素贝叶斯

Abstract: Since the hyponymy between symptoms has strong structural features, an automatic hyponymy extracting method based on symptom components was proposed. Firstly, it was found that symptoms can be divided into eight parts: atomic symptoms, adjunct words, and so on, and the composition of these parts satisfied certain constructed rules. Then, the lexical analysis system and Conditional Random Field (CRF) model were used to segment symptoms and label the parts of speech. Finally, the hyponymy extraction was considered as a classification problem. Symptom constitution features, dictionary features and general features were selected as the features of different classification algorithms to train the models. The relationship between symptoms were divided into hyponymy and non-hyponymy. The experimental results show that when these features are selected simultaneously, precision, recall and F1-measure of Support Vector Machine (SVM) are up to 82.68%, 82.13% and 82.40%, respectively. On this basis, by using the above hyponymy extracting algorithm, 20619 hyponymies were extracted, and the knowledge base of symptom hyponymy was built.

Key words: hyponymy, symptom component, Conditional Random Field (CRF), relationship classification, Support Vector Machine (SVM), decision tree, Naive Bayesian (NB)

中图分类号: