计算机应用 ›› 2018, Vol. 38 ›› Issue (2): 427-432.DOI: 10.11772/j.issn.1001-9081.2017071767

• 人工智能 • 上一篇    下一篇

基于密度感知模式的生物序列分类算法

胡耀炜1, 段磊1,2, 李岭3, 韩超1   

  1. 1. 四川大学 计算机学院, 成都 610065;
    2. 四川大学 华西公共卫生学院, 成都 610041;
    3. 四川大学 生命科学学院, 成都 610041
  • 收稿日期:2017-07-24 修回日期:2017-09-13 出版日期:2018-02-10 发布日期:2018-02-10
  • 通讯作者: 段磊
  • 作者简介:胡耀炜(1992-),男,河北邯郸人,硕士研究生,主要研究方向:数据挖掘;段磊(1981-),男,四川成都人,副教授,博士,CCF高级会员,主要研究方向:数据挖掘;李岭(1969-),男,四川成都人,教授,博士,主要研究方向:生物医学;韩超(1993-),男,甘肃庆阳人,硕士研究生,CCF会员,主要研究方向:数据挖掘、分布式计算。
  • 基金资助:
    国家自然科学基金资助项目(61572332,81473446);中国博士后科学基金特别资助项目(2016T90850);中央高校基本科研业务费资助项目(2016SCU04A22)。

Biological sequence classification algorithm based on density-aware patterns

HU Yaowei1, DUAN Lei1,2, LI Ling3, HAN Chao1   

  1. 1. College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China;
    2. West China School of Public Health, Sichuan University, Chengdu Sichuan 610041, China;
    3. College of Life Science, Sichuan University, Chengdu Sichuan 610041, China
  • Received:2017-07-24 Revised:2017-09-13 Online:2018-02-10 Published:2018-02-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572332, 81473446), the China Postdoctoral Science Foundation (2016T90850), the Fundamental Research Funds for the Central Universities (2016SCU04A22).

摘要: 针对现有的基于模式的序列分类算法对于生物序列存在分类精度不理想、模型训练时间长的问题,提出密度感知模式,并设计了基于密度感知模式的生物序列分类算法——BSC。首先,在生物序列中挖掘具有"密度感知"的频繁序列模式;然后,对挖掘出的频繁序列模式进行筛选、排序制定成分类规则;最后,通过分类规则对没有分类的序列进行分类预测。在4组真实生物序列中进行实验,分析了BSC算法参数对结果的影响并提供了推荐参数设置;同时分类结果表明,相比其他四种基于模式的分类算法,BSC算法在实验数据集上的准确率至少提高了2.03个百分点。结果表明,BSC算法有较高的生物序列分类精度和执行效率。

关键词: 生物序列, 序列分类, 序列模式, 密度感知模式, 分类规则

Abstract: Concerning unsatisfactory classification accuracy and low efficiency of the existing pattern-based classification methods for model training, a concept of density-aware pattern and an algorithm for biological sequence classification based on density-aware patterns, namely BSC (Biological Sequence Classifier), were proposed. Firstly, frequent sequence patterns based on density-aware concept were mined. Then, the mined frequent sequence patterns were filtered and sorted for designing the classification rules. Finally, the sequences without classification were classified by classification rules. According to a number of experiments conducted on four real biological sequence datasets, the influence of BSC algorithm parameters on the results were analyzed and the recommended parameter settings were provided. Meanwhile, the experimental results showed that the accuracies of BSC algorithm were improved by at least 2.03 percentage points compared with other four pattern-based baseline algorithms. The results indicate that BSC algorithm has high biological sequence classification accuracy and execution efficiency.

Key words: biological sequence, sequence classification, sequential pattern, density-aware pattern, classification rule

中图分类号: