Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (7): 1977-1982.

Classification of symbolic sequences with multi-order Markov model

1. 1. Jinshan College of Fujian Agriculture and Forestry University, Fuzhou Fujian 350002, China;
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China
• Received:2017-01-13 Revised:2017-03-05 Online:2017-07-10 Published:2017-07-18
• Supported by:
This work is supported by the National Natural Science Foundation of China (61672157).

符号序列多阶Markov分类

1. 1. 福建农林大学 金山学院, 福州 350002;
2. 福建师范大学 数学与计算机科学学院, 福州 350117
• 通讯作者: 陈黎飞
• 作者简介:程铃钫(1983-),女,山东滕州人,讲师,硕士,主要研究方向:机器学习、数据挖掘;郭躬德(1965-),男,福建龙岩人,教授,博士,主要研究方向:人工智能、数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
• 基金资助:
国家自然科学基金资助项目（61672157）。

Abstract: To solve the problem that the existing methods based on the fixed-order Markov models cannot make full use of the structural features involved in the subsequences of different orders, a new Bayesian method based on the multi-order Markov model was proposed for symbolic sequences classification. First, a Conditional Probability Distribution (CPD) model was built based on the multi-order Markov model. Second, a suffix tree for n-order subsequences with efficient suffix-tables and its efficient construction algorithm were proposed, where the algorithm could be used to learn the multi-order CPD models by scanning once the sequence set. A Bayesian classifier was finally proposed for the classification task. The training algorithm was designed to learn the order-weights for the models of different orders based on the Maximum Likelihood (ML) method, while the classification algorithm was defined to carry out the Bayesian prediction using the weighted conditional probabilities of each order. A series of experiments were conducted on real-world sequence sets from three domains and the results demonstrate that the new classifier is insensitive to the predefined order change of the model. Compared with the existing methods such as the support vector machine using the fixed-order model, the proposed method can achieve more than 40% improvement on both gene sequences and speech sequences in terms of classification accuracy, yielding reference values for the optimal order of a Markov model on symbolic sequences.

CLC Number: