基于BERT和CNN的基因剪接位点识别

doi:10.11772/j.issn.1001-9081.2022091447

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3309-3314.DOI: 10.11772/j.issn.1001-9081.2022091447

所属专题：前沿与综合应用

• 前沿与综合应用 • 上一篇

基于BERT和CNN的基因剪接位点识别

左敏¹^,², 王虹¹^,², 颜文婧¹^,², 张青川¹^,²()

^1.北京工商大学农产品质量安全追溯技术及应用国家工程研究中心，北京 100048
^2.北京工商大学电商与物流学院，北京 100048

收稿日期:2022-09-29 修回日期:2022-12-22 接受日期:2023-01-03 发布日期:2023-03-17 出版日期:2023-10-10
通讯作者: 张青川
作者简介:左敏（1973—），男，安徽铜陵人，教授，博士，主要研究方向：食品大数据、深度学习
王虹（1997—），女，山西大同人，硕士研究生，主要研究方向：自然语言处理
颜文婧（1985—），女，安徽淮南人，讲师，博士，主要研究方向：生物信息智能处理、深度学习、图像识别
张青川（1982—），男，河北石家庄人，副教授，博士，主要研究方向：自然语言处理、深度学习、信息抽取。Email：zqc1982@126.com
基金资助:
国家自然科学基金项目资助项目(61873027)

Gene splice site identification based on BERT and CNN

Min ZUO¹^,², Hong WANG¹^,², Wenjing YAN¹^,², Qingchuan ZHANG¹^,²()

^1.National Engineering Research Centre for Agri?Product Quality Traceability，Beijing Technology and Business University，Beijing 100048，China
^2.School of E?Business and Logistics，Beijing Technology and Business University，Beijing 100048，China

Received:2022-09-29 Revised:2022-12-22 Accepted:2023-01-03 Online:2023-03-17 Published:2023-10-10
Contact: Qingchuan ZHANG
About author:ZUO Min， born in 1973， Ph. D.， professor. His research interests include food big data， deep learning.
WANG Hong， born in 1997， M. S. candidate. Her research interests include natural language processing.
YAN Wenjing， born in 1985， Ph. D.， lecturer. Her research interests include intelligent processing of biological information， deep learning， image recognition.
Supported by:
National Natural Science Foundation of China(61873027)

摘要/Abstract

摘要：

随着高通量测序技术的发展，海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节，在基因发现和确定基因结构方面发挥着重要作用，且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸（DNA）序列高维特征提取能力不足的问题，构建了由BERT（Bidirectional Encoder Representations from Transformer）和平行的卷积神经网络（CNN）组合而成的剪接位点预测模型——BERT-splice。首先，采用BERT预训练方法训练DNA语言模型，从而提取DNA序列的上下文动态关联特征，并且使用高维矩阵映射DNA序列特征；其次，采用人类参考基因组序列hg19数据，使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练；最后，在上述基础上构建了剪接位点预测模型。实验结果表明，BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%，在受体集上的准确率为95.80%，相较于BERT与循环卷积神经网络（RCNN）构建的预测模型BERT-RCNN分别提高了1.55%和1.72%；同时，在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率（FPR）为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。

关键词: 剪接位点识别, BERT, 卷积神经网络, 深度学习, 脱氧核糖核酸

Abstract:

With the development of high-throughput sequencing technology， massive genome sequence data provide a data basis to understand the structure of genome. As an essential part of genomics research， splice site identification plays a vital role in gene discovery and determination of gene structure， and is of great importance for understanding the expression of gene traits. To address the problem that existing models cannot extract high-dimensional features of DNA （DeoxyriboNucleic Acid） sequences sufficiently， a splice site prediction model consisted of BERT （Bidirectional Encoder Representations from Transformers） and parallel Convolutional Neural Network （CNN） was constructed， namely BERT-splice. Firstly， the DNA language model was trained by BERT pre-training method to extract the contextual dynamic association features of DNA sequences and map DNA sequence features with a high-dimensional matrix. Then， the DNA language model was used to map the human reference genome sequence hg19 data into a high-dimensional matrix， and the result was adopted as input of parallel CNN classifier for retraining. Finally， a splice site prediction model was constructed on the basis of the above. Experimental results show that the prediction accuracy of BERT-splice model is 96.55% on the donor set of DNA splice sites and 95.80% on the acceptor set， which improved by 1.55% and 1.72% respectively， compared to that of the BERT and Recurrent Convolutional Neural Network （RCNN） constructed prediction model BERT-RCNN. Meanwhile， the average False Positive Rate （FPR） of donor/acceptor splice sites tested on five complete human gene sequences is 4.74%. The above verifies that the effectiveness of BERT-splice model for gene splice site prediction.

Key words: splice site identification, Bidirectional Encoder Representations from Transformers (BERT), Convolutional Neural Network (CNN), deep learning, DeoxyriboNucleic Acid (DNA)

中图分类号:

TP399

左敏, 王虹, 颜文婧, 张青川. 基于BERT和CNN的基因剪接位点识别[J]. 计算机应用, 2023, 43(10): 3309-3314.

Min ZUO, Hong WANG, Wenjing YAN, Qingchuan ZHANG. Gene splice site identification based on BERT and CNN[J]. Journal of Computer Applications, 2023, 43(10): 3309-3314.

图/表 12

参考文献 29

1	WAINBERG M， MERICO D， DELONG A， et al. Deep learning in biomedicine［J］. Nature Biotechnology， 2018， 36（9）： 829-838. 10.1038/nbt.4233
2	DEGROEVE S， SAEYS Y， DE BAETS B， et al. SpliceMachine： predicting splice sites from high-dimensional local context representations［J］. Bioinformatics， 2005， 21（8）：1332-1338. 10.1093/bioinformatics/bti166
3	SONNENBURG S O R， SCHWEIKERT G， PHILIPS P， et al. Accurate splice site prediction using support vector machines［J］. BMC Bioinformatics， 2007， 8（S10）： No.S7. 10.1186/1471-2105-8-s10-s7
4	MAJI S， GARG D. Hybrid approach using SVM and MM2 in splice site junction identification［J］. Current Bioinformatics， 2014， 9（1）： 76-85. 10.2174/1574893608999140109121721
5	PASHAEI E， YILMAZ A， OZEN M， et al. A novel method for splice sites prediction using sequence component and hidden Markov model［C］// Proceedings of the 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway： IEEE， 2016： 3076-3079. 10.1109/embc.2016.7591379
6	ZHANG Q， PENG Q， ZHANG Q， et al. Splice sites prediction of Human genome using length-variable Markov model and feature selection［J］. Expert Systems with Applications， 2010， 37（4）： 2771-2782. 10.1016/j.eswa.2009.09.014
7	PASHAEI E， OZEN M， AYDIN N. Splice site identification in human genome using random forest［J］. Health and Technology， 2017， 7（1）： 141-152. 10.1007/s12553-016-0157-z
8	MEHER P K， SAHU T K， RAO A R. Prediction of donor splice sites using random forest with a new sequence encoding approach［J］. BioData Mining， 2016， 9： No.4. 10.1186/s13040-016-0086-4
9	CHEN T M， LU C C， LI W H. Prediction of splice sites with dependency graphs and their expanded bayesian networks［J］. Bioinformatics， 2005， 21（4）： 471-482. 10.1093/bioinformatics/bti025
10	SUN S， DONG Z， ZHAO J. Conditional random fields for multiview sequential data modeling［J］. IEEE Transactions on Neural Networks and Learning Systems， 2022， 33（3）： 1242-1253. 10.1109/tnnls.2020.3041591
11	LeCUN Y， BENGIO Y， HINTON G. Deep learning［J］. Nature， 2015， 521（7553）： 436-444. 10.1038/nature14539
12	NAITO T. Human splice-site prediction with deep neural networks［J］. Journal of Computational Biology， 2018， 25（8）： 954-961. 10.1089/cmb.2018.0041
13	ZUALLAERT J， GODIN F， KIM M， et al. SpliceRover： interpretable convolutional neural networks for improved splice site prediction［J］. Bioinformatics， 2018， 34（24）： 4180-4188. 10.1093/bioinformatics/bty497
14	WANG R， WANG Z， WANG J， et al. SpliceFinder： ab initio prediction of splice sites using convolutional neural network［J］. BMC Bioinformatics， 2019， 20（S23）： No.652. 10.1186/s12859-019-3306-3
15	JAGANATHAN K， PANAGIOTOPOULOU S K， McRAE J F， et al. Predicting splicing from primary sequence with deep learning［J］. Cell， 2019， 176（3）： 535-548.e24. 10.1016/j.cell.2018.12.015
16	SCALZITTI N， KRESS A， ORHAND R， et al. Spliceator： multi-species splice site prediction using convolutional neural networks［J］. BMC Bioinformatics， 2021， 22： No.561. 10.1186/s12859-021-04471-3
17	FERNANDEZ-CASTILLO E， BARBOSA-SANTILLÁN L I， FALCON-MORALES L， et al. Deep Splicer： a CNN model for splice site prediction in genetic sequences［J］. Genes， 2022， 13（5）： No.907. 10.3390/genes13050907
18	CANATALAY P J， UCAN O N. A bidirectional LSTM-RNN and GRU method to exon prediction using splice-site mapping［J］. Applied Sciences， 2022， 12（9）： No.4390. 10.3390/app12094390
19	POLLASTRO P， RAMPONE S. HS³D， a dataset of Homo Sapiens Splice regions， and its extraction procedure from a major public database［J］. International Journal of Modern Physics C， 2002， 13（8）： 1105-1117. 10.1142/s0129183102003796
20	TAYARA H， TAHIR M， CHONG K T. iSS-CNN： identifying splicing sites using convolution neural network［J］. Chemometrics and Intelligent Laboratory Systems， 2019， 188： 63-69. 10.1016/j.chemolab.2019.03.002
21	DASARI C M， BHUKYA R. InterSSPP： investigating patterns through interpretable deep neural networks for accurate splice signal prediction［J］. Chemometrics and Intelligent Laboratory Systems， 2020， 206： No.104144. 10.1016/j.chemolab.2020.104144
22	DU X， YAO Y， DIAO Y， et al. DeepSS： exploring splice site motif through convolutional neural network directly from DNA sequence［J］. IEEE Access， 2018， 6： 32958-32978. 10.1109/access.2018.2848847
23	DO D T， LE T Q T， LE N Q K. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites［J］. Briefings in Bioinformatics， 2021， 22（3）： No.bbaa128. 10.1093/bib/bbaa128
24	HAMID M N， FRIEDBERG I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks［J］. Bioinformatics， 2019， 35（12）： 2009-2016. 10.1093/bioinformatics/bty937
25	张海丰，曾诚，潘列，等. 结合BERT和特征投影网络的新闻主题文本分类方法［J］. 计算机应用， 2022， 42（4）： 1116-1124. 10.11772/j.issn.1001-9081.2021071257
	ZHANG H F， ZENG C， PAN L， et al. News topic text classification method based on BERT and feature projection network［J］. Journal of Computer Applications， 2022， 42（4）： 1116-1124. 10.11772/j.issn.1001-9081.2021071257
26	ASGARI E， MOFRAD M R K. Continuous distributed representation of biological sequences for deep proteomics and genomics［J］. PLoS ONE， 2015， 10（11）： No.e0141287. 10.1371/journal.pone.0141287
27	JOULIN A， GRAVE E， BOJANOWSKI P， et al. Bag of tricks for efficient text classification［C］// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics： Volume 2， Short Papers. Stroudsburg， PA： ACL， 2017： 427-431. 10.18653/v1/e17-2068
28	GRAVES A， JAITLY N， MOHAMED A R. Hybrid speech recognition with deep bidirectional LSTM［C］// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2013： 273-278. 10.1109/asru.2013.6707742
29	LAI S， XU L， LIU K， et al. Recurrent convolutional neural networks for text classification［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2015： 2267-2273. 10.1609/aaai.v29i1.9513

剪接位点	训练集样本数	验证集样本数	独立测试集样本数
供体	118 704	14 840	14 840
受体	129 124	16 142	16 142

剪接位点	训练集样本数	验证集样本数	独立测试集样本数
供体	118 704	14 840	14 840
受体	129 124	16 142	16 142

训练参数	参数值
批量大小（Batch size）	64
学习率（Learning rate）	0.000 1
迭代次数（Epochs）	20
损失函数（Loss function）	Binary cross entropy
更新策略（Update strategy）	Adam
溢出率（Drop out）	0.5

训练参数	参数值
批量大小（Batch size）	64
学习率（Learning rate）	0.000 1
迭代次数（Epochs）	20
损失函数（Loss function）	Binary cross entropy
更新策略（Update strategy）	Adam
溢出率（Drop out）	0.5

输入长度	供体		受体
输入长度	准确率/%	AUC	准确率/%	AUC
50	95.59	0.983 5	93.08	0.978 2
100	95.92	0.990 6	94.86	0.987 2
150	96.64	0.993 1	95.51	0.989 1
200	96.82	0.993 6	95.72	0.990 4
250	96.78	0.994 0	95.79	0.990 4
300	96.88	0.994 1	95.80	0.990 4
350	96.47	0.992 1	95.78	0.988 9
400	96.07	0.991 0	95.41	0.987 4

基于BERT和CNN的基因剪接位点识别

Gene splice site identification based on BERT and CNN

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 29

相关文章 15

编辑推荐

Metrics

模型	供体					受体
模型	准确率/%	MCC	敏感性/%	特异性/%	AUC	准确率/%	MCC	敏感性/%	特异性/%	AUC
Word2Vec	80.82	0.62	75.64	84.37	0.889 7	78.49	0.57	81.17	77.03	0.869 6
fastText	82.46	0.65	78.81	85.01	0.904 5	79.04	0.58	78.21	79.53	0.875 5
BERT	96.55	0.93	97.29	95.88	0.991 8	95.80	0.92	96.64	95.04	0.990 4

基因	长度	剪接位点数	位点	序列中GT/AG数	Top-50%准确率/%	预测剪接位点数	假阳性率/%
uc002asa.2	93 235	8	供体	5 066	100.00	264	5.06
uc002asa.2	93 235	8	受体	6 674	100.00	343	5.03
uc003ulo.1	45 323	4	供体	2 248	100.00	72	3.03
uc003ulo.1	45 323	4	受体	3 210	100.00	104	3.12
uc003qob.3	236 209	27	供体	12 833	96.30	434	3.39
uc003qob.3	236 209	27	受体	16 825	96.30	822	4.89
uc002vws.3	68 909	13	供体	4 041	92.31	330	8.19
uc002vws.3	68 909	13	受体	5 688	100.00	393	6.93
uc010wkx.1	26 779	16	供体	1 554	75.00	51	3.32
uc010wkx.1	26 779	16	受体	1 907	93.75	84	4.44

[1]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[2]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[3]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[4]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[5]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[6]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[7]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[8]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[9]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[10]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[11]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[12]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[13]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[14]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[15]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.

模型	供体					受体
模型	准确率/%	MCC	敏感性/%	特异性/%	AUC	准确率/%	MCC	敏感性/%	特异性/%	AUC
BERT	93.56	0.87	95.25	92.13	0.981 6	93.30	0.87	94.11	92.60	0.979 4
BERT-BiLSTM	95.08	0.90	96.67	93.70	0.985 4	93.52	0.87	95.71	91.69	0.980 5
BERT-RCNN	95.08	0.90	96.27	94.04	0.984 2	94.18	0.88	94.24	94.12	0.982 9
BERT-splice	96.55	0.93	97.29	95.88	0.991 8	95.80	0.92	96.64	95.04	0.990 4

模型	供体					受体
模型	准确率/%	MCC	敏感性/%	特异性/%	AUC	准确率/%	MCC	敏感性/%	特异性/%	AUC
BERT	93.56	0.87	95.25	92.13	0.981 6	93.30	0.87	94.11	92.60	0.979 4
BERT-BiLSTM	95.08	0.90	96.67	93.70	0.985 4	93.52	0.87	95.71	91.69	0.980 5
BERT-RCNN	95.08	0.90	96.27	94.04	0.984 2	94.18	0.88	94.24	94.12	0.982 9
BERT-splice	96.55	0.93	97.29	95.88	0.991 8	95.80	0.92	96.64	95.04	0.990 4