计算机应用 ›› 2020, Vol. 40 ›› Issue (10): 3088-3094.DOI: 10.11772/j.issn.1001-9081.2020030359

• 应用前沿、交叉与综合 • 上一篇    下一篇

基于医疗文本数据聚类的帕金森病早期诊断预测

张晓博1,2,3, 杨燕1,2,3, 李天瑞1,2,3, 陆凡1,2,3, 彭莉兰1,2,3   

  1. 1. 西南交通大学 信息科学与技术学院, 成都 611756;
    2. 西南交通大学 人工智能研究院, 成都 611756;
    3. 综合交通大数据应用技术国家工程实验室(西南交通大学), 成都 611756
  • 收稿日期:2020-03-26 修回日期:2020-05-29 出版日期:2020-10-10 发布日期:2020-06-10
  • 通讯作者: 杨燕
  • 作者简介:张晓博(1985-),男,山西运城人,助理研究员,博士研究生,CCF会员,主要研究方向:医疗数据挖掘、机器学习;杨燕(1964-),女,安徽合肥人,教授,博士,CCF杰出会员,主要研究方向:大数据分析与挖掘、多视图学习、集成学习、半监督学习;李天瑞(1969-),男,福建莆田人,教授,博士,CCF杰出会员,主要研究方向:大数据、云计算、数据挖掘、机器学习、粒度计算、粗糙集;陆凡(1995-),女,四川凉山人,硕士研究生,主要研究方向:深度学习、聚类;彭莉兰(1993-),女,四川成都人,硕士研究生,主要研究方向:模式识别、聚类。
  • 基金资助:
    国家自然科学基金资助项目(61976247);四川省重点研发计划项目(20ZDYF2837)。

Early diagnosis and prediction of Parkinson's disease based on clustering medical text data

ZHANG Xiaobo1,2,3, YANG Yan1,2,3, LI Tianrui1,2,3, LU Fan1,2,3, PENG Lilan1,2,3   

  1. 1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China;
    2. Institute of Artificial Intelligence, Southwest Jiaotong University, Chengdu Sichuan 611756, China;
    3. National Engineering Laboratory of Integrated Transportation Big Data Application Technology(Southwest Jiaotong University), Chengdu Sichuan 611756, China
  • Received:2020-03-26 Revised:2020-05-29 Online:2020-10-10 Published:2020-06-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61976247), the Key Research and Development Program in Sichuan Province (20ZDYF2837).

摘要: 针对多发于老龄人群的帕金森病(PD)的早期智能化诊断的问题,提出基于医疗检测文本信息数据的聚类技术来对PD进行分析预测。首先,对原始数据集进行预处理以获取有效特征信息,并通过主成分分析(PCA)方法将原始特征分别降维到8个不同维度的维度空间;然后,应用5个传统的经典聚类模型和3种不同的聚类集成方法分别对8个维度空间的数据进行聚类;最后,采用4个聚类性能指标来预测数据集中的多巴胺异常PD患者、健康体和无多巴胺缺失(SWEDD) PD患者。仿真结果显示,PCA特征维度值取30时,高斯混合模型(GMM)的聚类准确度达到89.12%;PCA特征维度值取70时,谱聚类(SC)的聚类准确度达到61.41%;PCA特征维度值取80时,元聚类算法(MCLA)的聚类准确度达到59.62%。对比实验结果表明,5种经典聚类方法中,PCA的特征维度值小于40时,高斯混合模型聚类效果最佳;3种聚类集成方法中,对于不同的特征维度,MCLA的聚类性能均表现优异,进而为PD的早期智能化辅助诊断提供了技术和理论支撑。

关键词: 帕金森病, 医疗文本数据, 主成分分析, 聚类, 聚类集成

Abstract: In view of the problem of the early intelligent diagnosis for Parkinson's Disease (PD) which occurs more common in the elderly, the clustering technologies based on medical detection text information data were proposed for the analysis and prediction of PD. Firstly, the original dataset was pre-processed to obtain effective feature information, and these features were respectively reduced to eight dimensional spaces with different dimensions by Principal Component Analysis (PCA) method. Then, five traditional classical clustering models and three different clustering ensemble methods were respectively used to cluster the data of eight dimensional spaces. Finally, four clustering performance indexes were selected to predict PD subject with dopamine deficiency as well as healthy control and Scans Without Evidence of Dopamine Deficiency (SWEDD) PD subject. The simulation results show that the clustering accuracy of Gaussian Mixture Model (GMM) reaches 89.12% when the value of PCA feature dimension is 30, the clustering accuracy of Spectral Clustering (SC) is 61.41% when the PCA feature dimension value is 70, and the clustering accuracy of Meta-CLustering Algorithm (MCLA) achieves 59.62% when the PCA feature dimension value is 80. The comparative experiments results show that GMM has the best clustering effect in the five classical clustering methods when the PCA feature dimension value is less than 40 and MCLA has the excellent clustering performance among the three clustering ensemble methods for different feature dimensions, which thereby provides the technical and theoretical supports for the early intelligent auxiliary diagnosis of PD.

Key words: Parkinson’s Disease (PD), medical text data, Principal Component Analysis (PCA), clustering, clustering ensemble

中图分类号: