基于Dyna框架的非参数化近似策略迭代增强学习

doi:10.11772/j.issn.1001-9081.2017102531

计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1230-1238.DOI: 10.11772/j.issn.1001-9081.2017102531

基于Dyna框架的非参数化近似策略迭代增强学习

季挺, 张华

南昌大学江西省机器人与焊接自动化重点实验室, 南昌 330031

收稿日期:2017-10-25 修回日期:2017-12-12 发布日期:2018-05-24 出版日期:2018-05-10
通讯作者: 张华
作者简介:季挺(1982-),男,江西南昌人,博士研究生,主要研究方向:智能机器人、智能控制;张华(1964-),男,黑龙江哈尔滨人,教授,博士,主要研究方向:机器人、光纤传感、智能金属结构。
基金资助:
国家863计划项目（SS2013AA041003）。

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

JI Ting, ZHANG Hua

Key Laboratory of Robot & Welding Automation of Jiangxi Province, Nanchang University, Nanchang Jiangxi 330031, China

Received:2017-10-25 Revised:2017-12-12 Online:2018-05-24 Published:2018-05-10
Contact: 张华
Supported by:
This work is partially supported by the National High Technology Research and Development Program (863 Program) of China (SS2013AA041003).

摘要/Abstract

摘要： 为解决当前近似策略迭代增强学习算法逼近器不能完全自动构建的问题，提出一种基于Dyna框架的非参数化近似策略迭代（NPAPI-Dyna）增强学习算法。引入采样缓存和采样变化率设计二级随机采样过程采集样本，基于轮廓指标、采用K均值聚类算法实现trial-and-error过程生成核心状态基函数，采用以样本完全覆盖为目标的估计方法生成Q值函数逼近器，采用贪心策略设计动作选择器，利用对状态基函数的访问频次描述环境拓扑特征并构建环境估计模型；而后基于Dyna框架的模型辨识思想，将学习和规划过程有机结合，进一步加快了增强学习速度。一级倒立摆平衡控制的仿真实验中，当增强学习误差率为0.01时，算法学习成功率为100%，学习成功的最小尝试次数仅为2，平均尝试次数仅为7.73，角度平均绝对偏差为3.0538°，角度平均振荡范围为2.759°；当增强学习误差率为0.1时进行100次独立仿真运算，相比Online-LSPI和BLSPI算法平均需要150次以上尝试才能学习得到控制策略，而NPAPI-Dyna基本可在50次尝试内学习成功。实验分析表明，NPAPI-Dyna能够完全自动地构建、调整增强学习结构，学习结果精度较高，同时较快收敛。

关键词: 增强学习, Dyna框架, 策略迭代, 非参数化近似策略, 倒立摆

Abstract: In order to solve the problem that the approximator of the current approximation policy iteration reinforcement learning cannot be constructed completely automatically, a reinforcement learning algorithm of Nonparametric Approximation Policy Iteration based on Dyna Framework (NPAPI-Dyna) was proposed. Sampling cache and sampling change rate were introduced to design a two stage random sampling process to collect samples. By profile tolerance and K-means clustering, core state basis function was generated through trial-and-error process. Q-value function approximator was generated by using the complete coverage of sample as the target. Greedy strategy was applied to design action selector. Access frequency of the state basis function was used to describe environmental topology features and construct environment estimation model. Learning and planning processes were combined organically by identification of Dyna framework to accelerate the speed of learning.In the simulation experiments of single inverted pendulum balance control, when the reinforcement learning error rate is 0.01, the learning success rate of algorithm reaches 100%, the minimum number of successful attempts is only 2, the average number of attempts is only 7.73, and the mean absolute deviation of angle is 3.0538°, and the average oscillation range of angle is 2.759°. When reinforcement learning error rate is 0.1, 100 independent simulation operations are performed, to learn the control strategy, Online-LSPI and BLSPI (Batch Least-Squares Policy Iteration) have to try more than 150 times on average, however NPAPI-Dyna can succeed in 50 times of attempts. The experimental results show that NPAPI-Dyna can be completely automatically constructed and adjusted to enhance the learning structure, with high learning accuracy and rapid convergence ability.

Key words: reinforcement learning, Dyna framework, policy iteration, nonparametric approcimation policy, inverted pendulum

中图分类号:

TP181

季挺, 张华. 基于Dyna框架的非参数化近似策略迭代增强学习[J]. 计算机应用, 2018, 38(5): 1230-1238.

JI Ting, ZHANG Hua. Nonparametric approximation policy iteration reinforcement learning based on Dyna framework[J]. Journal of Computer Applications, 2018, 38(5): 1230-1238.

参考文献

[1] LAGOUDAKIS M G, PARR R. Least squares policy iteration[J]. Journal of Machine Learning Research, 2003, 4(6):1107-1149.
[2] BUSONIU L, ERNST D, de SCHUTTER B, et al. Online least-squares policy iteration for reinforcement learning control[C]//Proceedings of the 2010 American Control Conference. Piscataway, NJ:IEEE, 2010:486-491.
[3] 周鑫, 刘全, 傅启明, 等. 一种批量最小二乘策略迭代方法[J]. 计算机科学, 2014, 41(9):232-238. (ZHOU X, LIU Q, FU Q M, et al. Batch least-squares policy iteration[J]. Computer Science, 2014, 41(9):232-238.)
[4] 傅启明, 刘全, 伏玉琛, 等. 一种高斯过程的带参近似策略迭代算法[J]. 软件学报, 2013, 24(11):2676-2686. (FU Q M, LIU Q, FU Y C, et al. Parametric approximation policy iteration algorithm based on gaussian process[J]. Journal of Software, 2013, 24(11):2676-2686.)
[5] 傅启明. 强化学习中离策略算法的分析及研究[D]. 苏州:苏州大学, 2014:72-85. (FU Q M. Analysis and research on off-policy algorithms in reinforcement learning[D]. Suzhou:Soochow University, 2014:72-85.)
[6] 尤树华.贝叶斯强化学习中策略迭代算法研究[D]. 苏州:苏州大学, 2016:50-57.(YOU S H. Research on policy iteration algorithm within Bayesian reinforcement learning[D]. Suzhou:Soochow University, 2016:50-57.)
[7] XU X, PENG C, DAI B, et al. A kernel-based reinforcement learning approach to stochastic pole balancing control system[C]//Proceedings of the 2010 IEEE/ASME International Conference on Advanced Intelligent Mechatronics. Piscataway, NJ:IEEE, 2010:1329-1334.
[8] BARRETO A M S, PRECUP D, PINEAU J. Practical kernel-based reinforcement learning[J]. Journal of Machine Learning Research, 2016(17):1-70.
[9] 朱稷涵. 基于非参函数逼近的强化学习算法研究[D]. 苏州:苏州大学, 2014:18-28.(ZHU J H. Research on reinforcement learning algorithm based on nonparametric approximation[D]. Suzhou:Soochow University, 2014:18-28.)
[10] 闫称. 基于测地高斯核的策略迭代强化学习[D]. 徐州:中国矿业大学, 2015:17-42.(YAN C. Policy iteration reinforcement learning based on geodesic Gaussian kernel[D]. Xuzhou:China University of Mining and Technology, 2015:17-42.)
[11] 王雪松, 朱美强, 程玉虎. 强化学习原理及其应用[M]. 北京:科学出版社, 2014:58.(WANG X S, ZHU M Q, CHEN Y H. Principle and Application of Reinforcement Learning[M]. Beijing:Science Press, 2014:58.)
[12] 于剑, 程乾生. 模糊聚类方法中的最佳聚类数的搜索范围[J]. 中国科学(E辑), 2002, 32(2):274-280. (YU J, CHENG Q S. The search scope of optimal cluster number in fuzzy clustering method[J]. Science in China (Series E), 2002, 32(2):274-280.)

基于Dyna框架的非参数化近似策略迭代增强学习

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

[1]	张瑞, 张奇志, 周亚丽. 变长度弹性伸缩腿双足机器人半被动起步行走仿人控制[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 252-257.
[2]	尚芳剑, 李信, 翟迪, 陆阳, 张东磊, 钱玉文. 智能电网中两阶段网络切片资源分配技术[J]. 计算机应用, 2021, 41(7): 2033-2038.
[3]	刘思嘉, 童向荣. 基于强化学习的城市交通路径规划[J]. 计算机应用, 2021, 41(1): 185-190.
[4]	赵宇晴, 向阳. 基于分层编码的深度增强学习对话生成[J]. 计算机应用, 2017, 37(10): 2813-2818.
[5]	于国晨刘永信李晓红. 基于三维线性倒立摆的仿人机器人步态规划[J]. 计算机应用, 2012, 32(09): 2643-2647.
[6]	孙方义郑志强. 基于三角剖分的小脑模型在增强学习中的应用[J]. 计算机应用, 2009, 29(3): 871-873.
[7]	郑宇罗四维吕子昂. 强化学习算法的稳定状态空间控制[J]. 计算机应用, 2008, 28(5): 1328-1330.
[8]	杨涛刘贵全 . 一种基于双边拍卖的复制优化策略[J]. 计算机应用, 2006, 26(8): 1796-1798.