计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1230-1238.DOI: 10.11772/j.issn.1001-9081.2017102531

• 人工智能 • 上一篇    下一篇

基于Dyna框架的非参数化近似策略迭代增强学习

季挺, 张华   

  1. 南昌大学 江西省机器人与焊接自动化重点实验室, 南昌 330031
  • 收稿日期:2017-10-25 修回日期:2017-12-12 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 张华
  • 作者简介:季挺(1982-),男,江西南昌人,博士研究生,主要研究方向:智能机器人、智能控制;张华(1964-),男,黑龙江哈尔滨人,教授,博士,主要研究方向:机器人、光纤传感、智能金属结构。
  • 基金资助:
    国家863计划项目(SS2013AA041003)。

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

JI Ting, ZHANG Hua   

  1. Key Laboratory of Robot & Welding Automation of Jiangxi Province, Nanchang University, Nanchang Jiangxi 330031, China
  • Received:2017-10-25 Revised:2017-12-12 Online:2018-05-10 Published:2018-05-24
  • Contact: 张华
  • Supported by:
    This work is partially supported by the National High Technology Research and Development Program (863 Program) of China (SS2013AA041003).

摘要: 为解决当前近似策略迭代增强学习算法逼近器不能完全自动构建的问题,提出一种基于Dyna框架的非参数化近似策略迭代(NPAPI-Dyna)增强学习算法。引入采样缓存和采样变化率设计二级随机采样过程采集样本,基于轮廓指标、采用K均值聚类算法实现trial-and-error过程生成核心状态基函数,采用以样本完全覆盖为目标的估计方法生成Q值函数逼近器,采用贪心策略设计动作选择器,利用对状态基函数的访问频次描述环境拓扑特征并构建环境估计模型;而后基于Dyna框架的模型辨识思想,将学习和规划过程有机结合,进一步加快了增强学习速度。一级倒立摆平衡控制的仿真实验中,当增强学习误差率为0.01时,算法学习成功率为100%,学习成功的最小尝试次数仅为2,平均尝试次数仅为7.73,角度平均绝对偏差为3.0538°,角度平均振荡范围为2.759°;当增强学习误差率为0.1时进行100次独立仿真运算,相比Online-LSPI和BLSPI算法平均需要150次以上尝试才能学习得到控制策略,而NPAPI-Dyna基本可在50次尝试内学习成功。实验分析表明,NPAPI-Dyna能够完全自动地构建、调整增强学习结构,学习结果精度较高,同时较快收敛。

关键词: 增强学习, Dyna框架, 策略迭代, 非参数化近似策略, 倒立摆

Abstract: In order to solve the problem that the approximator of the current approximation policy iteration reinforcement learning cannot be constructed completely automatically, a reinforcement learning algorithm of Nonparametric Approximation Policy Iteration based on Dyna Framework (NPAPI-Dyna) was proposed. Sampling cache and sampling change rate were introduced to design a two stage random sampling process to collect samples. By profile tolerance and K-means clustering, core state basis function was generated through trial-and-error process. Q-value function approximator was generated by using the complete coverage of sample as the target. Greedy strategy was applied to design action selector. Access frequency of the state basis function was used to describe environmental topology features and construct environment estimation model. Learning and planning processes were combined organically by identification of Dyna framework to accelerate the speed of learning.In the simulation experiments of single inverted pendulum balance control, when the reinforcement learning error rate is 0.01, the learning success rate of algorithm reaches 100%, the minimum number of successful attempts is only 2, the average number of attempts is only 7.73, and the mean absolute deviation of angle is 3.0538°, and the average oscillation range of angle is 2.759°. When reinforcement learning error rate is 0.1, 100 independent simulation operations are performed, to learn the control strategy, Online-LSPI and BLSPI (Batch Least-Squares Policy Iteration) have to try more than 150 times on average, however NPAPI-Dyna can succeed in 50 times of attempts. The experimental results show that NPAPI-Dyna can be completely automatically constructed and adjusted to enhance the learning structure, with high learning accuracy and rapid convergence ability.

Key words: reinforcement learning, Dyna framework, policy iteration, nonparametric approcimation policy, inverted pendulum

中图分类号: