计算机应用 ›› 2011, Vol. 31 ›› Issue (08): 2087-2091.DOI: 10.3724/SP.J.1087.2011.02087

• 人工智能 • 上一篇    下一篇

基于YKW图形表达的人类基因短编码序列识别

骆嘉伟,颜军,何海峰   

  1. 湖南大学 信息科学与工程学院,长沙410082
  • 收稿日期:2011-01-24 修回日期:2011-03-15 发布日期:2011-08-01 出版日期:2011-08-01
  • 通讯作者: 骆嘉伟
  • 作者简介:骆嘉伟(1964-),女,湖南长沙人,教授,博士生导师,博士,主要研究方向:生物信息学、数据挖掘;颜军(1986-),男,湖南株洲人,硕士研究生,主要研究方向:生物信息学;何海峰(1986-),男,福建福清人,硕士研究生,主要研究方向:生物信息学。
  • 基金资助:

    国家自然科学基金资助项目(60873184);湖南省自然科学基金资助项目(07JJ5086)

Short coding sequence identification of human genes based on YKW graphical representation

Jia-wei LUO,Jun YAN,Hai-feng HE   

  1. College of Information Science and Engineering, Hunan University, Changsha Hunan 410082, China
  • Received:2011-01-24 Revised:2011-03-15 Online:2011-08-01 Published:2011-08-01
  • Contact: Jia-wei LUO

摘要: 针对人类短编码序列的识别问题,根据碱基在密码子三个位置的偏性和碱基自身物理化学性质的分类,提出一种新的图形表示方法——YKW图形,然后在此图形上,提取了9个有效的面积矩阵特征,识别过程中,为了提高识别率利用递增特征选择算法添加4个统计特征,并采用主元分析(PCA)方法对这13个特征降维,最后使用支持向量机(SVM)对人类的短编码序列进行编码区/非编码区识别。实验结果表明,与其他方法相比,该方法使用较少的特征(7个或4个)取得了更好的识别结果。

关键词: 图形表达, 短编码序列识别, 面积矩阵, 基因序列

Abstract: According to base bias in the three positions of codon and base chemical properties, the YKW graph, a new graphical representation of gene sequences was introduced for recognizing short coding sequences of human genes. Nine effective features of area matrix were extracted in the YKW curves. In the identifying process, the incremental feature selection algorithm was used to add four statistical features to improve the accuracy. Then Principal Component Analysis (PCA) method was adopted to reduce dimensions and Support Vector Machine (SVM) was applied to classify the coding/un-coding sequence in short human genes. Finally, the experimental results show that the proposed method uses fewer features (seven or four) and gets better recognition results than other methods.

Key words: graphical representation, short coding sequence identification, area matrix, gene sequence

中图分类号: