计算机应用 ›› 2015, Vol. 35 ›› Issue (10): 2808-2812.DOI: 10.11772/j.issn.1001-9081.2015.10.2808

• 第十五届中国机器学习会议(CCML2015)论文 • 上一篇    下一篇

基于单核苷酸统计和支持向量机集成的人类基因启动子识别

徐文轩1, 张莉1,2   

  1. 1. 苏州大学 计算机科学与技术学院系, 江苏 苏州 215006;
    2. 江苏省计算机信息处理技术重点实验室(苏州大学), 江苏 苏州 215006
  • 收稿日期:2015-06-15 修回日期:2015-06-27 出版日期:2015-10-10 发布日期:2015-10-14
  • 通讯作者: 张莉(1975-),女,江苏张家港人,教授,博士生导师,博士,CCF高级会员,主要研究方向:机器学习、模式识别zhangliml@suda.edu.cn
  • 作者简介:徐文轩(1993-),男,江苏盐城人,CCF会员,主要研究方向:机器学习、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61373093);国家级大学生创新创业训练计划项目(201410285032);江苏省自然科学基金资助项目(BK20140008,BK201222725);江苏省高校自然科学研究项目(13KJA520001);江苏省"青蓝工程"资助项目;苏州大学大学生课外学术科研基金资助项目(KY2014687B,KY2015544B,KY2015818B);苏州大学敬文书院"3I工程"项目(29)。

Human promoter recognition based on single nucleotide statistics and support vector machine ensemble

XU Wenxuan1, ZHANG Li1,2   

  1. 1. School of Computer Science and Technology, Soochow University, Suzhou Jiangsu 215006, China;
    2. Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou Jiangsu 215006, China
  • Received:2015-06-15 Revised:2015-06-27 Online:2015-10-10 Published:2015-10-14

摘要: 为高效地判别人类基因启动子,提出了一种基于单核苷酸统计和支持向量机集成的人类基因启动子识别算法。首先通过基因单核苷酸统计,从而将一个基因数据集分为C偏好和G偏好两个子集;然后分别对这两个子集提取DNA刚性特征、词频统计特征和CpG岛特征;最后采用多个支持向量机(SVM)集成的方式来学习这三种特征,并讨论了三种集成方式,包括单层SVM集成、双层SVM集成和级联SVM集成。实验结果表明所提算法能够提高人类基因启动子识别的敏感性和特异性,其中双层SVM集成的敏感性达到79.51%,且级联SVM集成的特异性高达84.58%。

关键词: CpG岛, DNA刚性, 人类启动子识别, KL散度, 单核苷酸统计, 支持向量机

Abstract: To efficiently discriminate the promoter in human genome, an algorithm for human promoter recognition based on single nucleotide statistics and Support Vector Machine (SVM) ensemble was proposed. Firstly, a gene dataset was divided into two subsets such as C-preferred and G-perferred subsets by using single nucleotide statistics. Secondly, DNA rigidity feature, word-based feature and CpG-island feature were extracted for each subset. Finally, these features were combined by using SVM ensemble learning. In addition, three ensemble ways were discussed, including single SVM ensemble, double-layer SVM ensemble and cascaded SVM ensemble. The experimental result shows that the proposed method can improve the sensitivity and specificity of human propoter recognition. Especially, the double-layer SVM ensemble can achieve the highest sensitivity of 79.51%, while the cascaded SVM ensemble has the highest specificity of 84.58%.

Key words: CpG-island, DNA rigidity, human promoter recognition, Kullback-Leibler divergence, nucleotide statistics, Support Vector Machine (SVM)

中图分类号: