《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3309-3314.DOI: 10.11772/j.issn.1001-9081.2022091447
所属专题: 前沿与综合应用
• 前沿与综合应用 • 上一篇
左敏1,2, 王虹1,2, 颜文婧1,2, 张青川1,2()
收稿日期:
2022-09-29
修回日期:
2022-12-22
接受日期:
2023-01-03
发布日期:
2023-03-17
出版日期:
2023-10-10
通讯作者:
张青川
作者简介:
左敏(1973—),男,安徽铜陵人,教授,博士,主要研究方向:食品大数据、深度学习基金资助:
Min ZUO1,2, Hong WANG1,2, Wenjing YAN1,2, Qingchuan ZHANG1,2()
Received:
2022-09-29
Revised:
2022-12-22
Accepted:
2023-01-03
Online:
2023-03-17
Published:
2023-10-10
Contact:
Qingchuan ZHANG
About author:
ZUO Min, born in 1973, Ph. D., professor. His research interests include food big data, deep learning.Supported by:
摘要:
随着高通量测序技术的发展,海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节,在基因发现和确定基因结构方面发挥着重要作用,且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸(DNA)序列高维特征提取能力不足的问题,构建了由BERT(Bidirectional Encoder Representations from Transformer)和平行的卷积神经网络(CNN)组合而成的剪接位点预测模型——BERT-splice。首先,采用BERT预训练方法训练DNA语言模型,从而提取DNA序列的上下文动态关联特征,并且使用高维矩阵映射DNA序列特征;其次,采用人类参考基因组序列hg19数据,使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练;最后,在上述基础上构建了剪接位点预测模型。实验结果表明,BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%,在受体集上的准确率为95.80%,相较于BERT与循环卷积神经网络(RCNN)构建的预测模型BERT-RCNN分别提高了1.55%和1.72%;同时,在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率(FPR)为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。
中图分类号:
左敏, 王虹, 颜文婧, 张青川. 基于BERT和CNN的基因剪接位点识别[J]. 计算机应用, 2023, 43(10): 3309-3314.
Min ZUO, Hong WANG, Wenjing YAN, Qingchuan ZHANG. Gene splice site identification based on BERT and CNN[J]. Journal of Computer Applications, 2023, 43(10): 3309-3314.
剪接位点 | 训练集样本数 | 验证集样本数 | 独立测试集样本数 |
---|---|---|---|
供体 | 118 704 | 14 840 | 14 840 |
受体 | 129 124 | 16 142 | 16 142 |
表1 数据集2的统计信息
Tab. 1 Statistics of dataset2
剪接位点 | 训练集样本数 | 验证集样本数 | 独立测试集样本数 |
---|---|---|---|
供体 | 118 704 | 14 840 | 14 840 |
受体 | 129 124 | 16 142 | 16 142 |
训练参数 | 参数值 |
---|---|
批量大小(Batch size) | 64 |
学习率(Learning rate) | 0.000 1 |
迭代次数(Epochs) | 20 |
损失函数(Loss function) | Binary cross entropy |
更新策略(Update strategy) | Adam |
溢出率(Drop out) | 0.5 |
表2 模型参数设置
Tab. 2 Model parameter setting
训练参数 | 参数值 |
---|---|
批量大小(Batch size) | 64 |
学习率(Learning rate) | 0.000 1 |
迭代次数(Epochs) | 20 |
损失函数(Loss function) | Binary cross entropy |
更新策略(Update strategy) | Adam |
溢出率(Drop out) | 0.5 |
输入长度 | 供体 | 受体 | ||
---|---|---|---|---|
准确率/% | AUC | 准确率/% | AUC | |
50 | 95.59 | 0.983 5 | 93.08 | 0.978 2 |
100 | 95.92 | 0.990 6 | 94.86 | 0.987 2 |
150 | 96.64 | 0.993 1 | 95.51 | 0.989 1 |
200 | 96.82 | 0.993 6 | 95.72 | 0.990 4 |
250 | 96.78 | 0.994 0 | 95.79 | 0.990 4 |
300 | 96.88 | 0.994 1 | 95.80 | 0.990 4 |
350 | 96.47 | 0.992 1 | 95.78 | 0.988 9 |
400 | 96.07 | 0.991 0 | 95.41 | 0.987 4 |
表3 不同输入长度的序列在独立测试集上的准确率和AUC
Tab. 3 Accuracy and AUC for sequences with different input lengths on independent test set
输入长度 | 供体 | 受体 | ||
---|---|---|---|---|
准确率/% | AUC | 准确率/% | AUC | |
50 | 95.59 | 0.983 5 | 93.08 | 0.978 2 |
100 | 95.92 | 0.990 6 | 94.86 | 0.987 2 |
150 | 96.64 | 0.993 1 | 95.51 | 0.989 1 |
200 | 96.82 | 0.993 6 | 95.72 | 0.990 4 |
250 | 96.78 | 0.994 0 | 95.79 | 0.990 4 |
300 | 96.88 | 0.994 1 | 95.80 | 0.990 4 |
350 | 96.47 | 0.992 1 | 95.78 | 0.988 9 |
400 | 96.07 | 0.991 0 | 95.41 | 0.987 4 |
模型 | 供体 | 受体 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
准确率/% | MCC | 敏感性/% | 特异性/% | AUC | 准确率/% | MCC | 敏感性/% | 特异性/% | AUC | |
Word2Vec | 80.82 | 0.62 | 75.64 | 84.37 | 0.889 7 | 78.49 | 0.57 | 81.17 | 77.03 | 0.869 6 |
fastText | 82.46 | 0.65 | 78.81 | 85.01 | 0.904 5 | 79.04 | 0.58 | 78.21 | 79.53 | 0.875 5 |
BERT | 96.55 | 0.93 | 97.29 | 95.88 | 0.991 8 | 95.80 | 0.92 | 96.64 | 95.04 | 0.990 4 |
表4 不同模型在独立测试集上的性能比较
Tab. 4 Performance comparison of different models on independent test set
模型 | 供体 | 受体 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
准确率/% | MCC | 敏感性/% | 特异性/% | AUC | 准确率/% | MCC | 敏感性/% | 特异性/% | AUC | |
Word2Vec | 80.82 | 0.62 | 75.64 | 84.37 | 0.889 7 | 78.49 | 0.57 | 81.17 | 77.03 | 0.869 6 |
fastText | 82.46 | 0.65 | 78.81 | 85.01 | 0.904 5 | 79.04 | 0.58 | 78.21 | 79.53 | 0.875 5 |
BERT | 96.55 | 0.93 | 97.29 | 95.88 | 0.991 8 | 95.80 | 0.92 | 96.64 | 95.04 | 0.990 4 |
模型 | 供体 | 受体 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
准确率/% | MCC | 敏感性/% | 特异性/% | AUC | 准确率/% | MCC | 敏感性/% | 特异性/% | AUC | |
BERT | 93.56 | 0.87 | 95.25 | 92.13 | 0.981 6 | 93.30 | 0.87 | 94.11 | 92.60 | 0.979 4 |
BERT-BiLSTM | 95.08 | 0.90 | 96.67 | 93.70 | 0.985 4 | 93.52 | 0.87 | 95.71 | 91.69 | 0.980 5 |
BERT-RCNN | 95.08 | 0.90 | 96.27 | 94.04 | 0.984 2 | 94.18 | 0.88 | 94.24 | 94.12 | 0.982 9 |
BERT-splice | 96.55 | 0.93 | 97.29 | 95.88 | 0.991 8 | 95.80 | 0.92 | 96.64 | 95.04 | 0.990 4 |
表5 本文模型与常用的分类模型在独立测试集上的性能比较
Tab. 5 Performance comparison of the proposed model and commonly used classification models on independent test set
模型 | 供体 | 受体 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
准确率/% | MCC | 敏感性/% | 特异性/% | AUC | 准确率/% | MCC | 敏感性/% | 特异性/% | AUC | |
BERT | 93.56 | 0.87 | 95.25 | 92.13 | 0.981 6 | 93.30 | 0.87 | 94.11 | 92.60 | 0.979 4 |
BERT-BiLSTM | 95.08 | 0.90 | 96.67 | 93.70 | 0.985 4 | 93.52 | 0.87 | 95.71 | 91.69 | 0.980 5 |
BERT-RCNN | 95.08 | 0.90 | 96.27 | 94.04 | 0.984 2 | 94.18 | 0.88 | 94.24 | 94.12 | 0.982 9 |
BERT-splice | 96.55 | 0.93 | 97.29 | 95.88 | 0.991 8 | 95.80 | 0.92 | 96.64 | 95.04 | 0.990 4 |
基因 | 长度 | 剪接位点数 | 位点 | 序列中GT/AG数 | Top-50%准确率/% | 预测剪接位点数 | 假阳性率/% |
---|---|---|---|---|---|---|---|
uc002asa.2 | 93 235 | 8 | 供体 | 5 066 | 100.00 | 264 | 5.06 |
受体 | 6 674 | 100.00 | 343 | 5.03 | |||
uc003ulo.1 | 45 323 | 4 | 供体 | 2 248 | 100.00 | 72 | 3.03 |
受体 | 3 210 | 100.00 | 104 | 3.12 | |||
uc003qob.3 | 236 209 | 27 | 供体 | 12 833 | 96.30 | 434 | 3.39 |
受体 | 16 825 | 96.30 | 822 | 4.89 | |||
uc002vws.3 | 68 909 | 13 | 供体 | 4 041 | 92.31 | 330 | 8.19 |
受体 | 5 688 | 100.00 | 393 | 6.93 | |||
uc010wkx.1 | 26 779 | 16 | 供体 | 1 554 | 75.00 | 51 | 3.32 |
受体 | 1 907 | 93.75 | 84 | 4.44 |
表6 基于BERT-splice模型的人类基因预测结果
Tab. 6 Prediction results of human genes based on BERT-splice model
基因 | 长度 | 剪接位点数 | 位点 | 序列中GT/AG数 | Top-50%准确率/% | 预测剪接位点数 | 假阳性率/% |
---|---|---|---|---|---|---|---|
uc002asa.2 | 93 235 | 8 | 供体 | 5 066 | 100.00 | 264 | 5.06 |
受体 | 6 674 | 100.00 | 343 | 5.03 | |||
uc003ulo.1 | 45 323 | 4 | 供体 | 2 248 | 100.00 | 72 | 3.03 |
受体 | 3 210 | 100.00 | 104 | 3.12 | |||
uc003qob.3 | 236 209 | 27 | 供体 | 12 833 | 96.30 | 434 | 3.39 |
受体 | 16 825 | 96.30 | 822 | 4.89 | |||
uc002vws.3 | 68 909 | 13 | 供体 | 4 041 | 92.31 | 330 | 8.19 |
受体 | 5 688 | 100.00 | 393 | 6.93 | |||
uc010wkx.1 | 26 779 | 16 | 供体 | 1 554 | 75.00 | 51 | 3.32 |
受体 | 1 907 | 93.75 | 84 | 4.44 |
1 | WAINBERG M, MERICO D, DELONG A, et al. Deep learning in biomedicine[J]. Nature Biotechnology, 2018, 36(9): 829-838. 10.1038/nbt.4233 |
2 | DEGROEVE S, SAEYS Y, DE BAETS B, et al. SpliceMachine: predicting splice sites from high-dimensional local context representations[J]. Bioinformatics, 2005, 21(8):1332-1338. 10.1093/bioinformatics/bti166 |
3 | SONNENBURG S O R, SCHWEIKERT G, PHILIPS P, et al. Accurate splice site prediction using support vector machines[J]. BMC Bioinformatics, 2007, 8(S10): No.S7. 10.1186/1471-2105-8-s10-s7 |
4 | MAJI S, GARG D. Hybrid approach using SVM and MM2 in splice site junction identification[J]. Current Bioinformatics, 2014, 9(1): 76-85. 10.2174/1574893608999140109121721 |
5 | PASHAEI E, YILMAZ A, OZEN M, et al. A novel method for splice sites prediction using sequence component and hidden Markov model[C]// Proceedings of the 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway: IEEE, 2016: 3076-3079. 10.1109/embc.2016.7591379 |
6 | ZHANG Q, PENG Q, ZHANG Q, et al. Splice sites prediction of Human genome using length-variable Markov model and feature selection[J]. Expert Systems with Applications, 2010, 37(4): 2771-2782. 10.1016/j.eswa.2009.09.014 |
7 | PASHAEI E, OZEN M, AYDIN N. Splice site identification in human genome using random forest[J]. Health and Technology, 2017, 7(1): 141-152. 10.1007/s12553-016-0157-z |
8 | MEHER P K, SAHU T K, RAO A R. Prediction of donor splice sites using random forest with a new sequence encoding approach[J]. BioData Mining, 2016, 9: No.4. 10.1186/s13040-016-0086-4 |
9 | CHEN T M, LU C C, LI W H. Prediction of splice sites with dependency graphs and their expanded bayesian networks[J]. Bioinformatics, 2005, 21(4): 471-482. 10.1093/bioinformatics/bti025 |
10 | SUN S, DONG Z, ZHAO J. Conditional random fields for multiview sequential data modeling[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(3): 1242-1253. 10.1109/tnnls.2020.3041591 |
11 | LeCUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444. 10.1038/nature14539 |
12 | NAITO T. Human splice-site prediction with deep neural networks[J]. Journal of Computational Biology, 2018, 25(8): 954-961. 10.1089/cmb.2018.0041 |
13 | ZUALLAERT J, GODIN F, KIM M, et al. SpliceRover: interpretable convolutional neural networks for improved splice site prediction[J]. Bioinformatics, 2018, 34(24): 4180-4188. 10.1093/bioinformatics/bty497 |
14 | WANG R, WANG Z, WANG J, et al. SpliceFinder: ab initio prediction of splice sites using convolutional neural network[J]. BMC Bioinformatics, 2019, 20(S23): No.652. 10.1186/s12859-019-3306-3 |
15 | JAGANATHAN K, PANAGIOTOPOULOU S K, McRAE J F, et al. Predicting splicing from primary sequence with deep learning[J]. Cell, 2019, 176(3): 535-548.e24. 10.1016/j.cell.2018.12.015 |
16 | SCALZITTI N, KRESS A, ORHAND R, et al. Spliceator: multi-species splice site prediction using convolutional neural networks[J]. BMC Bioinformatics, 2021, 22: No.561. 10.1186/s12859-021-04471-3 |
17 | FERNANDEZ-CASTILLO E, BARBOSA-SANTILLÁN L I, FALCON-MORALES L, et al. Deep Splicer: a CNN model for splice site prediction in genetic sequences[J]. Genes, 2022, 13(5): No.907. 10.3390/genes13050907 |
18 | CANATALAY P J, UCAN O N. A bidirectional LSTM-RNN and GRU method to exon prediction using splice-site mapping[J]. Applied Sciences, 2022, 12(9): No.4390. 10.3390/app12094390 |
19 | POLLASTRO P, RAMPONE S. HS3D, a dataset of Homo Sapiens Splice regions, and its extraction procedure from a major public database[J]. International Journal of Modern Physics C, 2002, 13(8): 1105-1117. 10.1142/s0129183102003796 |
20 | TAYARA H, TAHIR M, CHONG K T. iSS-CNN: identifying splicing sites using convolution neural network[J]. Chemometrics and Intelligent Laboratory Systems, 2019, 188: 63-69. 10.1016/j.chemolab.2019.03.002 |
21 | DASARI C M, BHUKYA R. InterSSPP: investigating patterns through interpretable deep neural networks for accurate splice signal prediction[J]. Chemometrics and Intelligent Laboratory Systems, 2020, 206: No.104144. 10.1016/j.chemolab.2020.104144 |
22 | DU X, YAO Y, DIAO Y, et al. DeepSS: exploring splice site motif through convolutional neural network directly from DNA sequence[J]. IEEE Access, 2018, 6: 32958-32978. 10.1109/access.2018.2848847 |
23 | DO D T, LE T Q T, LE N Q K. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites[J]. Briefings in Bioinformatics, 2021, 22(3): No.bbaa128. 10.1093/bib/bbaa128 |
24 | HAMID M N, FRIEDBERG I. Identifying antimicrobial peptides using word embedding with deep recurrent neural networks[J]. Bioinformatics, 2019, 35(12): 2009-2016. 10.1093/bioinformatics/bty937 |
25 | 张海丰,曾诚,潘列,等. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 计算机应用, 2022, 42(4): 1116-1124. 10.11772/j.issn.1001-9081.2021071257 |
ZHANG H F, ZENG C, PAN L, et al. News topic text classification method based on BERT and feature projection network[J]. Journal of Computer Applications, 2022, 42(4): 1116-1124. 10.11772/j.issn.1001-9081.2021071257 | |
26 | ASGARI E, MOFRAD M R K. Continuous distributed representation of biological sequences for deep proteomics and genomics[J]. PLoS ONE, 2015, 10(11): No.e0141287. 10.1371/journal.pone.0141287 |
27 | JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Stroudsburg, PA: ACL, 2017: 427-431. 10.18653/v1/e17-2068 |
28 | GRAVES A, JAITLY N, MOHAMED A R. Hybrid speech recognition with deep bidirectional LSTM[C]// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway: IEEE, 2013: 273-278. 10.1109/asru.2013.6707742 |
29 | LAI S, XU L, LIU K, et al. Recurrent convolutional neural networks for text classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2015: 2267-2273. 10.1609/aaai.v29i1.9513 |
[1] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[2] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[3] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[4] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[5] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[6] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[7] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[8] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[9] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[10] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[11] | 顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625. |
[12] | 石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650. |
[13] | 赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318. |
[14] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[15] | 徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||