《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3309-3314.DOI: 10.11772/j.issn.1001-9081.2022091447

• 前沿与综合应用 • 上一篇    

基于BERT和CNN的基因剪接位点识别

左敏1,2, 王虹1,2, 颜文婧1,2, 张青川1,2()   

  1. 1.北京工商大学 农产品质量安全追溯技术及应用国家工程研究中心,北京 100048
    2.北京工商大学 电商与物流学院,北京 100048
  • 收稿日期:2022-09-29 修回日期:2022-12-22 接受日期:2023-01-03 发布日期:2023-03-17 出版日期:2023-10-10
  • 通讯作者: 张青川
  • 作者简介:左敏(1973—),男,安徽铜陵人,教授,博士,主要研究方向:食品大数据、深度学习
    王虹(1997—),女,山西大同人,硕士研究生,主要研究方向:自然语言处理
    颜文婧(1985—),女,安徽淮南人,讲师,博士,主要研究方向:生物信息智能处理、深度学习、图像识别
    张青川(1982—),男,河北石家庄人,副教授,博士,主要研究方向:自然语言处理、深度学习、信息抽取。Email:zqc1982@126.com
  • 基金资助:
    国家自然科学基金项目资助项目(61873027)

Gene splice site identification based on BERT and CNN

Min ZUO1,2, Hong WANG1,2, Wenjing YAN1,2, Qingchuan ZHANG1,2()   

  1. 1.National Engineering Research Centre for Agri?Product Quality Traceability,Beijing Technology and Business University,Beijing 100048,China
    2.School of E?Business and Logistics,Beijing Technology and Business University,Beijing 100048,China
  • Received:2022-09-29 Revised:2022-12-22 Accepted:2023-01-03 Online:2023-03-17 Published:2023-10-10
  • Contact: Qingchuan ZHANG
  • About author:ZUO Min, born in 1973, Ph. D., professor. His research interests include food big data, deep learning.
    WANG Hong, born in 1997, M. S. candidate. Her research interests include natural language processing.
    YAN Wenjing, born in 1985, Ph. D., lecturer. Her research interests include intelligent processing of biological information, deep learning, image recognition.
  • Supported by:
    National Natural Science Foundation of China(61873027)

摘要:

随着高通量测序技术的发展,海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节,在基因发现和确定基因结构方面发挥着重要作用,且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸(DNA)序列高维特征提取能力不足的问题,构建了由BERT(Bidirectional Encoder Representations from Transformer)和平行的卷积神经网络(CNN)组合而成的剪接位点预测模型——BERT-splice。首先,采用BERT预训练方法训练DNA语言模型,从而提取DNA序列的上下文动态关联特征,并且使用高维矩阵映射DNA序列特征;其次,采用人类参考基因组序列hg19数据,使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练;最后,在上述基础上构建了剪接位点预测模型。实验结果表明,BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%,在受体集上的准确率为95.80%,相较于BERT与循环卷积神经网络(RCNN)构建的预测模型BERT-RCNN分别提高了1.55%和1.72%;同时,在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率(FPR)为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。

关键词: 剪接位点识别, BERT, 卷积神经网络, 深度学习, 脱氧核糖核酸

Abstract:

With the development of high-throughput sequencing technology, massive genome sequence data provide a data basis to understand the structure of genome. As an essential part of genomics research, splice site identification plays a vital role in gene discovery and determination of gene structure, and is of great importance for understanding the expression of gene traits. To address the problem that existing models cannot extract high-dimensional features of DNA (DeoxyriboNucleic Acid) sequences sufficiently, a splice site prediction model consisted of BERT (Bidirectional Encoder Representations from Transformers) and parallel Convolutional Neural Network (CNN) was constructed, namely BERT-splice. Firstly, the DNA language model was trained by BERT pre-training method to extract the contextual dynamic association features of DNA sequences and map DNA sequence features with a high-dimensional matrix. Then, the DNA language model was used to map the human reference genome sequence hg19 data into a high-dimensional matrix, and the result was adopted as input of parallel CNN classifier for retraining. Finally, a splice site prediction model was constructed on the basis of the above. Experimental results show that the prediction accuracy of BERT-splice model is 96.55% on the donor set of DNA splice sites and 95.80% on the acceptor set, which improved by 1.55% and 1.72% respectively, compared to that of the BERT and Recurrent Convolutional Neural Network (RCNN) constructed prediction model BERT-RCNN. Meanwhile, the average False Positive Rate (FPR) of donor/acceptor splice sites tested on five complete human gene sequences is 4.74%. The above verifies that the effectiveness of BERT-splice model for gene splice site prediction.

Key words: splice site identification, Bidirectional Encoder Representations from Transformers (BERT), Convolutional Neural Network (CNN), deep learning, DeoxyriboNucleic Acid (DNA)

中图分类号: