基于动作空间划分的MAXQ自动分层方法

doi:10.11772/j.issn.1001-9081.2017.05.1357

计算机应用 ›› 2017, Vol. 37 ›› Issue (5): 1357-1362.DOI: 10.11772/j.issn.1001-9081.2017.05.1357

基于动作空间划分的MAXQ自动分层方法

王奇, 秦进

贵州大学计算机科学与技术学院, 贵阳 550025

收稿日期:2016-09-28 修回日期:2016-12-16 发布日期:2017-05-16 出版日期:2017-05-10
通讯作者: 王奇
作者简介:王奇(1992-),男,河南开封人,硕士研究生,主要研究方向:机器学习;秦进(1978-),男,贵州黔西人,副教授,博士,主要研究方向:计算智能。
基金资助:
国家自然科学基金资助项目（61562009）；贵州大学引进人才科研项目（贵大人基合字（2012）028号）。

Automatic hierarchical approach of MAXQ based on action space partition

WANG Qi, QIN Jin

College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China

Received:2016-09-28 Revised:2016-12-16 Online:2017-05-16 Published:2017-05-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61562009), the Scientific Research Foundation for Talent Introduction of Guizhou University (2012028).

摘要/Abstract

摘要： 针对分层强化学习需要人工给出层次结构这一问题，同时考虑到基于状态空间的自动分层方法在环境状态中没有明显子目标时分层效果并不理想的情况，提出一种基于动作空间的自动构造层次结构方法。首先，根据动作影响的状态分量将动作集合划分为多个不相交的子集；然后，分析Agent在不同状态下的可用动作，并识别瓶颈动作；最后，由瓶颈动作与执行次序确定动作子集之间的上下层关系，并构造层次结构。此外，对MAXQ方法中子任务的终止条件进行修改，使所提算法构造的层次结构可以通过MAXQ方法找到最优策略。实验结果表明，所提算法可以自动构造层次结构，而不会受环境变化的干扰。与Q学习、Sarsa算法相比，MAXQ方法根据该结构得到最优策略的时间更短，获得回报更高。验证了所提算法能够有效地自动构造MAXQ层次结构，并使寻找最优策略更加高效。

关键词: 强化学习, 分层强化学习, 自动分层方法, 马尔可夫决策过程, 子任务

Abstract: Since a hierarchy of Markov Decision Process (MDP) need to be constructed manually in hierarchical reinforcement learning and some automatic hierarchical approachs based on state space produce unsatisfactory results in environment with not obvious subgoals, a new automatic hierarchical approach based on action space partition was proposed. Firstly, the set of actions was decomposed into some disjoint subsets through the state component of the action. Then, bottleneck actions were identified by analyzing the executable actions of the Agent in different states. Finally, based on the execution order of actions and bottleneck actions, the relationship of action subsets was determined and a hierarchy was constructed. Furthermore, the termination condition for sub-tasks in the MAXQ method was modified so that by using the hierarchical structure of the proposed algorithm the optimal strategy could be found through the MAXQ method. The experimental results show that the algorithm can automatically construct the hierarchical structure which was not affected by environmental change. Compared with the QLearning and Sarsa algorithms, the MAXQ method with the proposed hierarchy obtains the optimal strategy faster and gets higher returns. It verifies that the proposed algorithm can effectively construct the MAXQ hierarchy and make the optimal strategy more efficient.

Key words: reinforcement learning, hierarchical reinforcement learning, automatic hierarchical approach, Markov Decision Process (MDP), subtask

中图分类号:

TP181

王奇, 秦进. 基于动作空间划分的MAXQ自动分层方法[J]. 计算机应用, 2017, 37(5): 1357-1362.

WANG Qi, QIN Jin. Automatic hierarchical approach of MAXQ based on action space partition[J]. Journal of Computer Applications, 2017, 37(5): 1357-1362.

参考文献

[1] KHAN S G, HERRMANN G, LEWIS F L, et al. Reinforcement learning and optimal adaptive control:an overview and implementation examples[J]. Annual Reviews in Control, 2012, 36(1):42-59.
[2] 陈学松, 杨宜民. 强化学习研究综述[J]. 计算机应用研究, 2010, 27(8):2834-2838.(CHEN X S, YANG Y M. Reinforcement learning:survey of recent work[J]. Application Research of Computers, 2010, 8(27):2834-2838.)
[3] 赵冬斌, 邵坤, 朱圆恒,等. 深度强化学习综述:兼论计算机围棋的发展[J]. 控制理论与应用, 2016, 33(6):701-717.(ZHAO D B, SHAO K, ZHU Y H, et al. Review of deep reinforcement learning and discussions on the development of computer Go[J]. Control Theory & Applications, 2016, 33(6):701-717.)
[4] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999, 112(1/2):181-211.
[5] DIETTERICH T G. Hierarchical reinforcement learning with the MAXQ value function decomposition[J]. Journal of Artificial Intelligence Research, 2000, 13(1):227-303.
[6] PARR R E. Hierarchical control and learning for Markov decision processes[D]. Berkeley:University of California at Berkeley, 1998:87-109.
[7] HENGST B. Discovering hierarchy in reinforcement learning with HEXQ[C]//Proceedings of the 19th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers Inc., 2002:243-250.
[8] MCGOVERN E A. Autonomous discovery of temporal abstractions from interaction with an environment[D]. Amherst:University of Massachusetts Amherst, 2002:26-38.
[9] STOLLE M. Automated discovery of options in reinforcement learning[D]. Montreal:McGill University, 2004:21-31.
[10] MEHTA N, RAY S, TADEPALLI P, et al. Automatic discovery and transfer of MAXQ hierarchies[C]//Proceedings of the 25th International Conference on Machine Learning. New York:ACM, 2008:648-655.
[11] 石川, 史忠植, 王茂光. 基于路径匹配的在线分层强化学习方法[J]. 计算机研究与发展, 2008, 45(9):1470-1476.(SHI C, SHI Z Z, WANG M G. Online hierarchical reinforcement learning based on path-matching[J]. Journal of Computer Research and Development, 2008, 45(9):1470-1476.)
[12] 沈晶. 分层强化学习方法研究[D]. 哈尔滨:哈尔滨工程大学, 2006:28-55.(SHEN J. Research on hierarchical reinforcement learning approach[D]. Harbin:Harbin Engineering University, 2006:28-55.)
[13] 陈兴国, 俞扬. 强化学习及其在电脑围棋中的应用[J]. 自动化学报, 2016, 42(5):685-695.(CHEN X G, YU Y. Reinforcement learning and its application to the game of go[J]. Acta Automatica Sinica, 2016, 42(5):685-695.)
[14] BARTO A G, MAHADEVAN S. Recent advances in hierarchical reinforcement learning[J]. Discrete Event Dynamic Systems, 2003, 13(4):341-379.
[15] JONG N K, STONE P. State abstraction discovery from irrelevant state variables[C]//Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann Publishers Inc., 2005:752-757.
[16] TAKAHASHI Y, ASADA M. Multi-controller fusion in multi-layered reinforcement learning[C]//Proceedings of the 2001 International Conference on Multisensor Fusion and Integration for Intelligent Systems. Piscataway, NJ:IEEE, 2001:7-12.
[17] STOLLE M, PRECUP D. Learning options in reinforcement learning[C]//Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation. London:Springer-Verlag, 2002:212-223.
[18] 苏畅, 高阳, 陈世福,等. 基于SMDP环境的自主生成Options算法的研究[J]. 模式识别与人工智能, 2005, 18(6):679-684.(SU C, GAO Y, CHEN S F, et al. The study of recognizing Options based on SMDP[J]. Pattern Recognition and Artificial Intelligence, 2005, 18(6):679-684.)

基于动作空间划分的MAXQ自动分层方法

Automatic hierarchical approach of MAXQ based on action space partition

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	肖海林, 黄天义, 代秋香, 张跃军, 张中山. 基于轨迹预测的安全强化学习自动变道决策方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2958-2963.
[2]	何浩东, 符浩, 王强, 周帅, 刘伟. 基于深度强化学习的多机器人路径跟随与编队[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2626-2633.
[3]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[4]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[5]	赵晓焱, 韩威, 张俊娜, 袁培燕. 基于异步深度强化学习的车联网协作卸载策略[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1501-1510.
[6]	唐睿, 庞川林, 张睿智, 刘川, 岳士博. D2D通信增强的蜂窝网络中基于DDPG的资源分配[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1562-1569.
[7]	陈发堂, 黄淼, 金宇峰. 面向用户需求的低轨卫星资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1242-1247.
[8]	秦鑫彤, 宋政育, 侯天为, 王飞越, 孙昕, 黎伟. 基于自适应p持续的移动自组网信道接入和资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 863-868.
[9]	李源潮, 陶重犇, 王琛. 基于最大熵深度强化学习的双足机器人步态控制方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 445-451.
[10]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[11]	宋紫阳, 李军怀, 王怀军, 苏鑫, 于蕾. 基于路径模仿和SAC强化学习的机械臂路径规划算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 439-444.
[12]	余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638.
[13]	王昱, 关智慧, 李远鹏. 基于轨迹预测和分布式MADDPG的无人机集群追击决策[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3623-3628.
[14]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[15]	王昱, 任田君, 范子琳. 基于引导Minimax-DDQN的无人机空战机动决策[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2636-2643.