计算机应用 ›› 2017, Vol. 37 ›› Issue (5): 1357-1362.DOI: 10.11772/j.issn.1001-9081.2017.05.1357

• 人工智能 • 上一篇    下一篇

基于动作空间划分的MAXQ自动分层方法

王奇, 秦进   

  1. 贵州大学 计算机科学与技术学院, 贵阳 550025
  • 收稿日期:2016-09-28 修回日期:2016-12-16 出版日期:2017-05-10 发布日期:2017-05-16
  • 通讯作者: 王奇
  • 作者简介:王奇(1992-),男,河南开封人,硕士研究生,主要研究方向:机器学习;秦进(1978-),男,贵州黔西人,副教授,博士,主要研究方向:计算智能。
  • 基金资助:
    国家自然科学基金资助项目(61562009);贵州大学引进人才科研项目(贵大人基合字(2012)028号)。

Automatic hierarchical approach of MAXQ based on action space partition

WANG Qi, QIN Jin   

  1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China
  • Received:2016-09-28 Revised:2016-12-16 Online:2017-05-10 Published:2017-05-16
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61562009), the Scientific Research Foundation for Talent Introduction of Guizhou University (2012028).

摘要: 针对分层强化学习需要人工给出层次结构这一问题,同时考虑到基于状态空间的自动分层方法在环境状态中没有明显子目标时分层效果并不理想的情况,提出一种基于动作空间的自动构造层次结构方法。首先,根据动作影响的状态分量将动作集合划分为多个不相交的子集;然后,分析Agent在不同状态下的可用动作,并识别瓶颈动作;最后,由瓶颈动作与执行次序确定动作子集之间的上下层关系,并构造层次结构。此外,对MAXQ方法中子任务的终止条件进行修改,使所提算法构造的层次结构可以通过MAXQ方法找到最优策略。实验结果表明,所提算法可以自动构造层次结构,而不会受环境变化的干扰。与Q学习、Sarsa算法相比,MAXQ方法根据该结构得到最优策略的时间更短,获得回报更高。验证了所提算法能够有效地自动构造MAXQ层次结构,并使寻找最优策略更加高效。

关键词: 强化学习, 分层强化学习, 自动分层方法, 马尔可夫决策过程, 子任务

Abstract: Since a hierarchy of Markov Decision Process (MDP) need to be constructed manually in hierarchical reinforcement learning and some automatic hierarchical approachs based on state space produce unsatisfactory results in environment with not obvious subgoals, a new automatic hierarchical approach based on action space partition was proposed. Firstly, the set of actions was decomposed into some disjoint subsets through the state component of the action. Then, bottleneck actions were identified by analyzing the executable actions of the Agent in different states. Finally, based on the execution order of actions and bottleneck actions, the relationship of action subsets was determined and a hierarchy was constructed. Furthermore, the termination condition for sub-tasks in the MAXQ method was modified so that by using the hierarchical structure of the proposed algorithm the optimal strategy could be found through the MAXQ method. The experimental results show that the algorithm can automatically construct the hierarchical structure which was not affected by environmental change. Compared with the QLearning and Sarsa algorithms, the MAXQ method with the proposed hierarchy obtains the optimal strategy faster and gets higher returns. It verifies that the proposed algorithm can effectively construct the MAXQ hierarchy and make the optimal strategy more efficient.

Key words: reinforcement learning, hierarchical reinforcement learning, automatic hierarchical approach, Markov Decision Process (MDP), subtask

中图分类号: