《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (6): 1979-1986.DOI: 10.11772/j.issn.1001-9081.2022050727

• 前沿与综合应用 • 上一篇    

图自动编码器上二阶段融合实现的环状RNA-疾病关联预测

张奕1,2, 王真梅1()   

  1. 1.桂林理工大学 信息科学与工程学院, 广西 桂林 541006
    2.广西嵌入式技术与智能系统重点实验室(桂林理工大学), 广西 桂林 541006
  • 收稿日期:2022-05-23 修回日期:2022-09-01 接受日期:2022-09-05 发布日期:2022-09-23 出版日期:2023-06-10
  • 通讯作者: 王真梅
  • 作者简介:张奕(1977—),女,江西九江人,教授,博士,主要研究方向:生物信息学、机器学习、服务计算
    王真梅(1996—),女,广西玉林人,硕士研究生,主要研究方向:生物信息学Email:1134573228@qq.com
  • 基金资助:
    国家自然科学基金资助项目(62166014);广西自然科学基金资助项目(2020GXNSFAA297255);广西嵌入式技术与智能系统重点实验室项目(2019-01-16)

circRNA-disease association prediction by two-stage fusion on graph auto-encoder

Yi ZHANG1,2, Zhenmei WANG1()   

  1. 1.College of Information Science and Engineering,Guilin University of Technology,Guilin Guangxi 541006,China
    2.Guangxi Key Laboratory of Embedded Technology and Intelligent System (Guilin University of Technology),Guilin Guangxi 541006,China
  • Received:2022-05-23 Revised:2022-09-01 Accepted:2022-09-05 Online:2022-09-23 Published:2023-06-10
  • Contact: Zhenmei WANG
  • About author:ZHANG Yi, born in 1977, Ph. D., professor. Her research interests include bioinformatics, machine learning, service computing.
  • Supported by:
    National Natural Science Foundation of China(62166014);Guangxi Natural Science Foundation(2020GXNSFAA297255);Project of Guangxi Key Laboratory of Embedded Technology and Intelligent System(2019-01-16)

摘要:

大部分现有的用于预测环状RNA(circRNA)与疾病之间关联关系的计算模型通常使用circRNA和疾病相关数据等生物学知识,配合已知的circRNA-疾病关联信息对来挖掘出潜在的关联信息。然而这些模型受已知关联构成的网络稀疏性、负样本过少等固有问题的影响,导致预测性能不佳。因此,在图自动编码器基础上引入归纳式矩阵补全及自注意力机制进行二阶段融合,以实现circRNA-疾病关联预测,由此构建的模型叫GIS-CDA (Graph auto-encoder combining Inductive matrix complementation and Self-attention mechanism for predicting CircRNA-Disease Association)。首先,计算circRNA集成和疾病集成的相似性,并利用图自动编码器学习circRNA和疾病的潜在特征,以获得低维表征;接着,将学习到的特征输入归纳式矩阵补全,以提高节点之间的相似性和依赖性;然后,将circRNA特征矩阵和疾病特征矩阵整合为circRNA-疾病特征矩阵,以增强预测的稳定性和精确性;最后,引入自注意力机制,从特征矩阵中提取重要特征,并减少对其他生物信息的依赖。五折交叉和十折交叉验证的结果显示:GIS-CDA获得的平均接收者操作特征曲线下面积(AUROC)值分别为0.930 3和0.939 3,前者比基于KATZ测度的人类circRNA-疾病关联预测模型(KATZHCDA)、基于深度矩阵分解方法的circRNA-疾病关联(DMFCDA)预测模型、RWR(Random Walk with Restart)和基于加速归纳式矩阵补全的circRNA-疾病关联(SIMCCDA)预测模型分别高出了13.19、35.73、13.28和5.01个百分点;GIS-CDA的精确率-召回率曲线下面积(AUPR)值分别为0.227 1和0.234 0,前者比上述对比模型分别高出了21.72、22.43、21.96和13.86个百分点。此外,在circRNADisease、circ2Disease和circR2Disease数据集上的消融实验和案例研究进一步验证了GIS-CDA在预测circRNA-疾病的潜在关联方面具有较好的性能。

关键词: 图自动编码器, 归纳式矩阵补全, 自注意力机制, 环状RNA, 环状RNA-疾病关联信息对

Abstract:

Most existing computational models for predicting associations between circular RNA (circRNA) and diseases usually use biological knowledge such as circRNA and disease-related data, and mine the potential association information by combining known circRNA-disease association information pairs. However, these models suffer from inherent problems such as sparsity and too few negative samples of networks composed of the known association, resulting in poor prediction performance. Therefore, inductive matrix completion and self-attention mechanism were introduced for two-stage fusion based on graph auto-encoder to achieve circRNA-disease association prediction, and the model based on the above is GIS-CDA (Graph auto-encoder combining Inductive matrix complementation and Self-attention mechanism for predicting CircRNA-Disease Association). Firstly, the similarity of circRNA integration and disease integration was calculated, and graph auto-encoder was used to learn the potential features of circRNAs and diseases to obtain low-dimensional representations. Secondly, the learned features were input to inductive matrix complementation to improve the similarity and dependence between nodes. Thirdly, the circRNA feature matrix and disease feature matrix were integrated into circRNA-disease feature matrix to enhance the stability and accuracy of prediction. Finally, a self-attention mechanism was introduced to extract important features in the feature matrix and reduce the dependence on other biological information. The results of five-fold crossover and ten-fold crossover validation show that the Area Under Receiver Operating Characteristic curve (AUROC) values of GIS-CDA are 0.930 3 and 0.939 3 respectively, the former of which is 13.19,35.73,13.28 and 5.01 percentage points higher than those of the prediction models based on computational model of KATZ measures for Human CircRNA-Disease Association (KATZHCDA), Deep Matrix Factorization for CircRNA-Disease Association (DMFCDA), RWR (Random Walk with Restart) and Speedup Inductive Matrix Completion for CircRNA-Disease Associations (SIMCCDA), respectively; the Area Under Precision-Recall curve (AUPR) values of GIS-CDA are 0.227 1 and 0.234 0 respectively, the former of which is 21.72, 22.43, 21.96 and 13.86 percentage points higher than those of the above comparison models respectively. In addition, ablation experiments and case studies on circRNADisease, circ2Disease and circR2Disease datasets, further validate the good performance of GIS-CDA in predicting the potential circRNA-disease association.

Key words: graph auto-encoder, inductive matrix completion, self-attention mechanism, circular RNA (circRNA), circRNA-disease association information pair

中图分类号: