Classification of symbolic sequences with multi-order Markov model

doi:10.11772/j.issn.1001-9081.2017.07.1977

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (7): 1977-1982.DOI: 10.11772/j.issn.1001-9081.2017.07.1977

Previous Articles Next Articles

Classification of symbolic sequences with multi-order Markov model

CHENG Lingfang¹, GUO Gongde², CHEN Lifei²

1. Jinshan College of Fujian Agriculture and Forestry University, Fuzhou Fujian 350002, China;
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2017-01-13 Revised:2017-03-05 Online:2017-07-18 Published:2017-07-10
Supported by:
This work is supported by the National Natural Science Foundation of China (61672157).

符号序列多阶Markov分类

程铃钫¹, 郭躬德², 陈黎飞²

1. 福建农林大学金山学院, 福州 350002;
2. 福建师范大学数学与计算机科学学院, 福州 350117

通讯作者: 陈黎飞
作者简介:程铃钫(1983-),女,山东滕州人,讲师,硕士,主要研究方向:机器学习、数据挖掘;郭躬德(1965-),男,福建龙岩人,教授,博士,主要研究方向:人工智能、数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
基金资助:
国家自然科学基金资助项目（61672157）。

Abstract

Abstract: To solve the problem that the existing methods based on the fixed-order Markov models cannot make full use of the structural features involved in the subsequences of different orders, a new Bayesian method based on the multi-order Markov model was proposed for symbolic sequences classification. First, a Conditional Probability Distribution (CPD) model was built based on the multi-order Markov model. Second, a suffix tree for n-order subsequences with efficient suffix-tables and its efficient construction algorithm were proposed, where the algorithm could be used to learn the multi-order CPD models by scanning once the sequence set. A Bayesian classifier was finally proposed for the classification task. The training algorithm was designed to learn the order-weights for the models of different orders based on the Maximum Likelihood (ML) method, while the classification algorithm was defined to carry out the Bayesian prediction using the weighted conditional probabilities of each order. A series of experiments were conducted on real-world sequence sets from three domains and the results demonstrate that the new classifier is insensitive to the predefined order change of the model. Compared with the existing methods such as the support vector machine using the fixed-order model, the proposed method can achieve more than 40% improvement on both gene sequences and speech sequences in terms of classification accuracy, yielding reference values for the optimal order of a Markov model on symbolic sequences.

Key words: symbolic sequence, Markov chain model, multi-order model, Bayesian classification, suffix tree

摘要： 针对基于固定阶Markov链模型的方法不能充分利用不同阶次子序列结构特征的问题，提出一种基于多阶Markov模型的符号序列贝叶斯分类新方法。首先，建立了基于多阶次Markov模型的条件概率分布模型；其次，提出一种附后缀表的n-阶子序列后缀树结构和高效的树构造算法，该算法能够在扫描一遍序列集过程中建立多阶条件概率模型；最后，提出符号序列的贝叶斯分类器，其训练算法基于最大似然法学习不同阶次模型的权重，分类算法使用各阶次的加权条件概率进行贝叶斯分类预测。在三个应用领域实际序列集上进行了系列实验，结果表明：新分类器对模型阶数变化不敏感；与使用固定阶模型的支持向量机等现有方法相比，所提方法在基因序列与语音序列上可以取得40%以上的分类精度提升，且可输出符号序列Markov模型最优阶数参考值。

关键词: 符号序列, Markov链模型, 多阶模型, 贝叶斯分类, 后缀树

CLC Number:

TP311
TP18

CHENG Lingfang, GUO Gongde, CHEN Lifei. Classification of symbolic sequences with multi-order Markov model[J]. Journal of Computer Applications, 2017, 37(7): 1977-1982.

程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37(7): 1977-1982.

References

[1] XING Z, PEI J, KEOGH E. A brief survey on sequence classification[J]. ACM SIGKDD Explorations Newsletter, 2010, 12(1):40-48.
[2] DONG G, PEI J. Sequence Data Mining[M]. Berlin:Springer, 2007:47-65.
[3] 郭躬德,陈黎飞,李南.近邻分类方法及其应用[M].厦门:厦门大学出版社,2013:29-97.(GUO G D, CHEN L F, LI N. Nearest Neighbour Classification Method and Its Applications[M]. Xiamen:Xiamen University Press, 2013:29-97.)
[4] CRISTIANINI N, SCHOLKOPF B. Support vector machines and kernel methods:the new generation of learning machines[J]. Artificial Intelligence, 2002, 23(3):31-41.
[5] THEODORIDIS S. Machine Learning:A Bayesian and Optimization Perspective[M]. San Diego:Academic Press, 2015:876-902.
[6] 敖丽敏,罗存金.基于神经网络集成的DNA序列分类方法研究[J].计算机仿真,2012,29(6):171-175.(AO L M, LUO C J. DNA series classification based on ensemble neural networks[J]. Computer Simulation, 2012, 29(6):171-175.)
[7] 袁铭.标度曲线拟合与金融时间序列聚类[J].计算机应用,2015,34(11):3344-3347.(YUAN M. Fitting of scaling curve and financial time series clustering[J]. Journal of Computer Applications, 2015, 34(11):3344-3347.)
[8] KELIL A, WANG S. SCS:A new similarity measure for categorical sequences[C]//Proceedings of the 8th IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2008:343-352.
[9] HERRANZ J, NIN J, SOLE M. Optimal symbol alignment dis-tance:a new distance for sequences of symbols[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(10):1541-1554.
[10] YAKHNENKO O, SILVESCU A, HONAVAR V. Discriminatively trained Markov model for sequence classification[C]//Proceedings of the 5th IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2005:498-505.
[11] 杨一鸣,潘嵘,潘嘉林,等.时间序列分类问题的算法比较[J].计算机学报,2007,30(8):1259-1266.(YANG Y M, PAN R, PAN J L, et al. A comparative study on time series classification[J]. Chinese Journal of Computers, 2007, 30(8):1259-1266.)
[12] KONDRAK G. N-gram similarity and distance[C]//Proceedings of the 12th International Conference on String Processing and Information Retrieval. Berlin:Springer, 2005:115-126.
[13] FINK G A. Markov Models for Pattern Recognition:From Theory to Applications[M]. Berlin:Springer, 2008:95-111.
[14] TSCHUMITSCHEW K, NAUCK D, KLAWONN F. A classifica-tion algorithm for process sequences based on Markov chains and Bayesian networks[C]//Proceedings of the 14th International Conference on Knowledge-based and Intelligent Information and Engineering Systems. Berlin:Springer, 2010:141-147.
[15] 尹锐,李雄飞,李军,等.基于线性分段与HMM的时间序列分类算法[J].模式识别与人工智能,2011,24(4):574-581.(YIN R, LI X F, LI J, et al. Time series classification algorithm based on linear segmentation and HMM[J]. Pattern Recognition & Artificial Intelligence, 2011, 24(4):574-581.)
[16] XIONG T, WANG S, JIANG Q, et al. A novel variable-order Markov model for clustering categorical sequences[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10):2339-2353.
[17] KARLIN S, GHANDOUR G. Comparative statistics for DNA and protein sequences:single sequence analysis[J]. Proceedings of the National Academy of Sciences, 1985, 82(17):5800-5804.
[18] WEI D, JIANG Q, WEI Y, et al. A novel hierarchical clustering algorithm for gene sequences[J]. BMC Bioinformatics, 2012, 13(1):174.
[19] LOISELLE S, ROUAT J, PRESSNITZER D, et al. Exploration of rank order coding with spiking neural networks for speech recognition[C]//Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. Washington, DC:IEEE Computer Society, 2005:2076-2080.
[20] NAMIKI Y, ISHIDA T, AKIYAMA Y. Acceleration of sequence clustering using longest common subsequence filtering[J]. BMC Bioinformatics, 2013, 14(Suppl 8):S7.

Classification of symbolic sequences with multi-order Markov model

符号序列多阶Markov分类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 12

Recommended Articles

Metrics

[1]	WU Chongshu, LIN Lin, XUE Yunjing, SHI Peng. Hierarchical segmentation of pathological images based on self-supervised learning [J]. Journal of Computer Applications, 2020, 40(6): 1856-1862.
[2]	XU Weishan, YU Lei, FENG Junchi, HOU Shaofan. Software testing data generation technology based on software hierarchical model [J]. Journal of Computer Applications, 2016, 36(12): 3454-3460.
[3]	XIAO Yanli, ZHANG Zhenyu, YUAN Jiangtao. Calculation method of user similarity based on location sequence generalized suffix tree [J]. Journal of Computer Applications, 2015, 35(6): 1654-1658.
[4]	WANG Xing JIANG Xinhua LIN Jie XIONG Jinbo. Prediction of moving object trajectory based on probabilistic suffix tree [J]. Journal of Computer Applications, 2013, 33(11): 3119-3122.
[5]	ZHAI Xian-min TIAN Sheng-wei YU Long FENG Guan-jun. Improved suffix tree clustering for Uyghur text [J]. Journal of Computer Applications, 2012, 32(04): 1078-1081.
[6]	Ru-yan ZHANG Shi-tong WANG Yao XU. Maximum a posteriori classification method based on kernel method under t distribution [J]. Journal of Computer Applications, 2011, 31(04): 1079-1083.
[7]	. Convergence analysis of clonal selection algorithm based on BCA [J]. Journal of Computer Applications, 2010, 30(3): 772-775.
[8]	Bei Hui . Anti-spam model based on semi-Naive Bayesian classification model [J]. Journal of Computer Applications, 2009, 29(3): 903-904.
[9]	bo yin hua jiang. New algorithm based on repeat sequence deletion [J]. Journal of Computer Applications, 2009, 29(2): 403-405.
[10]	. Performance analysis of IEEE 802.11 multirate networks with rate adaptation [J]. Journal of Computer Applications, 2009, 29(10): 2638-2643.
[11]	. Research on junk SMS filtering system on mobile environment [J]. Journal of Computer Applications, 2007, 27(1): 221-224.
[12]	;. Research on Chinese name identification based on Bayes algorithm [J]. Journal of Computer Applications, 2006, 26(4): 998-1000.