Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3404-3412.DOI: 10.11772/j.issn.1001-9081.2021111956

• CCF Bigdata 2021 • Previous Articles    

Deep fusion model for predicting differential gene expression by histone modification data

Xin LI, Tao JIA()   

  1. College of Computer and Information Science,Southwest University,Chongqing 400715,China
  • Received:2021-11-17 Revised:2021-11-23 Accepted:2021-12-06 Online:2021-12-31 Published:2022-11-10
  • Contact: Tao JIA
  • About author:LI Xinborn in 1997, M. S. candidate. Her research interests include data mining, bioinformatics, machine learning.
    JIA Taoborn in 1982, Ph. D., professor. His research interests include data science, complex network.
  • Supported by:
    Industry?University?Research Innovation Fund for Universities of China, Ministry of Education(2021ALA03016)

基于组蛋白修饰数据预测基因差异性表达的深度融合模型

李昕, 贾韬()   

  1. 西南大学 计算机与信息科学学院,重庆 400715
  • 通讯作者: 贾韬
  • 作者简介:李昕(1997—),女,四川绵阳人,硕士研究生,CCF会员,主要研究方向:数据挖掘、生物信息学、机器学习
    贾韬(1982—),男,重庆人,教授,博士,CCF会员,主要研究方向:数据科学、复杂网络。 tjia@swu.edu.cn
  • 基金资助:
    教育部中国高校产学研创新基金资助项目(2021ALA03016)

Abstract:

Concering the problem that the Cell type?Specificity (CS) and similarity and difference information between different cell types are not properly used when predicting Differential Gene Expression (DGE) with large?scale Histone Modification (HM) data, as well as large volume of input and high computational cost, a deep learning?based method named dcsDiff was proposed. Firstly, multiple AutoEncoders (AEs) and Bi?directional Long Short?Term Memory (Bi?LSTM) networks were introduced to reduce the dimensionality of HM signals and model them to obtain the embedded representation. Then, multiple Convolutional Neural Networks (CNNs) were used to mine the HM combined effects in each single cell type, and the similarity and difference information of each HM and joint effects of all HMs between two cell types. Finally, the two kinds of information were fused to predict DGE between two cell types. In the comparison experiments with DeepDiff on 10 pairs of cell types in the REMC (Roadmap Epigenomics Mapping Consortium) database, the Pearson Correlation Coefficient (PCC) of dcsDiff in DGE prediction was increased by 7.2% at the highest and 3.9% on average, the number of differentially expressed genes accurately detected by dcsDiff was increased by 36 at most and 17.6 on average, and the running time of dcsDiff was saved by 78.7%. The validity of reasonable integration of the above two kinds of information was proved in the component analysis experiment. The parameters of dcsDiff were also determined by experiments. Experimental results show that the proposed dcsDiff can effectively improve the efficiency of DGE prediction.

Key words: Histone Modification (HM), Differential Gene Expression (DGE), Cell type?Specificity (CS), AutoEncoder (AE), Bi?directional Long Short?Term Memory (Bi?LSTM) network, information fusion, epigenetics

摘要:

针对使用大规模组蛋白修饰(HM)数据预测基因差异性表达(DGE)时未合理利用细胞型特异性(CS)和细胞型间异同两类信息,且输入规模大、计算量高等问题,提出一种深度学习方法dcsDiff。首先,使用多个自编码器(AE)和双向长短时记忆(Bi?LSTM)网络降维,并建模HM信号得到嵌入表示;然后,利用多个卷积神经网络(CNN)分别挖掘每类CS的HM组合效应以及两细胞型间每种HM的异同信息和所有HM的联合影响;最后,融合两类信息预测两细胞型间的 DGE。在对REMC数据库中10对细胞型的实验中,与DeepDiff相比,dcsDiff的预测DGE的皮尔逊相关系数(PCC)最高提升了7.2%、平均提升了3.9%,准确检测出差异表达基因的数量最多增加了36、平均增加了17.6,运行时间节省了78.7%;进一步的成分分析实验证明了合理整合上述两类信息的有效性;并通过实验确定了算法的参数。实验结果表明dcsDiff能有效提高DGE预测的效率。

关键词: 组蛋白修饰, 基因差异性表达, 细胞型特异性, 自编码器, 双向长短时记忆网络, 信息融合, 表观遗传学

CLC Number: