Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

doi:10.11772/j.issn.1001-9081.2017102531

Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (5): 1230-1238.DOI: 10.11772/j.issn.1001-9081.2017102531

Previous Articles Next Articles

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

JI Ting, ZHANG Hua

Key Laboratory of Robot & Welding Automation of Jiangxi Province, Nanchang University, Nanchang Jiangxi 330031, China

Received:2017-10-25 Revised:2017-12-12 Online:2018-05-10 Published:2018-05-24
Contact: 张华
Supported by:
This work is partially supported by the National High Technology Research and Development Program (863 Program) of China (SS2013AA041003).

基于Dyna框架的非参数化近似策略迭代增强学习

季挺, 张华

南昌大学江西省机器人与焊接自动化重点实验室, 南昌 330031

通讯作者: 张华
作者简介:季挺(1982-),男,江西南昌人,博士研究生,主要研究方向:智能机器人、智能控制;张华(1964-),男,黑龙江哈尔滨人,教授,博士,主要研究方向:机器人、光纤传感、智能金属结构。
基金资助:
国家863计划项目（SS2013AA041003）。

Abstract

Abstract: In order to solve the problem that the approximator of the current approximation policy iteration reinforcement learning cannot be constructed completely automatically, a reinforcement learning algorithm of Nonparametric Approximation Policy Iteration based on Dyna Framework (NPAPI-Dyna) was proposed. Sampling cache and sampling change rate were introduced to design a two stage random sampling process to collect samples. By profile tolerance and K-means clustering, core state basis function was generated through trial-and-error process. Q-value function approximator was generated by using the complete coverage of sample as the target. Greedy strategy was applied to design action selector. Access frequency of the state basis function was used to describe environmental topology features and construct environment estimation model. Learning and planning processes were combined organically by identification of Dyna framework to accelerate the speed of learning.In the simulation experiments of single inverted pendulum balance control, when the reinforcement learning error rate is 0.01, the learning success rate of algorithm reaches 100%, the minimum number of successful attempts is only 2, the average number of attempts is only 7.73, and the mean absolute deviation of angle is 3.0538°, and the average oscillation range of angle is 2.759°. When reinforcement learning error rate is 0.1, 100 independent simulation operations are performed, to learn the control strategy, Online-LSPI and BLSPI (Batch Least-Squares Policy Iteration) have to try more than 150 times on average, however NPAPI-Dyna can succeed in 50 times of attempts. The experimental results show that NPAPI-Dyna can be completely automatically constructed and adjusted to enhance the learning structure, with high learning accuracy and rapid convergence ability.

Key words: reinforcement learning, Dyna framework, policy iteration, nonparametric approcimation policy, inverted pendulum

摘要： 为解决当前近似策略迭代增强学习算法逼近器不能完全自动构建的问题，提出一种基于Dyna框架的非参数化近似策略迭代（NPAPI-Dyna）增强学习算法。引入采样缓存和采样变化率设计二级随机采样过程采集样本，基于轮廓指标、采用K均值聚类算法实现trial-and-error过程生成核心状态基函数，采用以样本完全覆盖为目标的估计方法生成Q值函数逼近器，采用贪心策略设计动作选择器，利用对状态基函数的访问频次描述环境拓扑特征并构建环境估计模型；而后基于Dyna框架的模型辨识思想，将学习和规划过程有机结合，进一步加快了增强学习速度。一级倒立摆平衡控制的仿真实验中，当增强学习误差率为0.01时，算法学习成功率为100%，学习成功的最小尝试次数仅为2，平均尝试次数仅为7.73，角度平均绝对偏差为3.0538°，角度平均振荡范围为2.759°；当增强学习误差率为0.1时进行100次独立仿真运算，相比Online-LSPI和BLSPI算法平均需要150次以上尝试才能学习得到控制策略，而NPAPI-Dyna基本可在50次尝试内学习成功。实验分析表明，NPAPI-Dyna能够完全自动地构建、调整增强学习结构，学习结果精度较高，同时较快收敛。

关键词: 增强学习, Dyna框架, 策略迭代, 非参数化近似策略, 倒立摆

CLC Number:

TP181

JI Ting, ZHANG Hua. Nonparametric approximation policy iteration reinforcement learning based on Dyna framework[J]. Journal of Computer Applications, 2018, 38(5): 1230-1238.

季挺, 张华. 基于Dyna框架的非参数化近似策略迭代增强学习[J]. 计算机应用, 2018, 38(5): 1230-1238.

References

[1] LAGOUDAKIS M G, PARR R. Least squares policy iteration[J]. Journal of Machine Learning Research, 2003, 4(6):1107-1149.
[2] BUSONIU L, ERNST D, de SCHUTTER B, et al. Online least-squares policy iteration for reinforcement learning control[C]//Proceedings of the 2010 American Control Conference. Piscataway, NJ:IEEE, 2010:486-491.
[3] 周鑫, 刘全, 傅启明, 等. 一种批量最小二乘策略迭代方法[J]. 计算机科学, 2014, 41(9):232-238. (ZHOU X, LIU Q, FU Q M, et al. Batch least-squares policy iteration[J]. Computer Science, 2014, 41(9):232-238.)
[4] 傅启明, 刘全, 伏玉琛, 等. 一种高斯过程的带参近似策略迭代算法[J]. 软件学报, 2013, 24(11):2676-2686. (FU Q M, LIU Q, FU Y C, et al. Parametric approximation policy iteration algorithm based on gaussian process[J]. Journal of Software, 2013, 24(11):2676-2686.)
[5] 傅启明. 强化学习中离策略算法的分析及研究[D]. 苏州:苏州大学, 2014:72-85. (FU Q M. Analysis and research on off-policy algorithms in reinforcement learning[D]. Suzhou:Soochow University, 2014:72-85.)
[6] 尤树华.贝叶斯强化学习中策略迭代算法研究[D]. 苏州:苏州大学, 2016:50-57.(YOU S H. Research on policy iteration algorithm within Bayesian reinforcement learning[D]. Suzhou:Soochow University, 2016:50-57.)
[7] XU X, PENG C, DAI B, et al. A kernel-based reinforcement learning approach to stochastic pole balancing control system[C]//Proceedings of the 2010 IEEE/ASME International Conference on Advanced Intelligent Mechatronics. Piscataway, NJ:IEEE, 2010:1329-1334.
[8] BARRETO A M S, PRECUP D, PINEAU J. Practical kernel-based reinforcement learning[J]. Journal of Machine Learning Research, 2016(17):1-70.
[9] 朱稷涵. 基于非参函数逼近的强化学习算法研究[D]. 苏州:苏州大学, 2014:18-28.(ZHU J H. Research on reinforcement learning algorithm based on nonparametric approximation[D]. Suzhou:Soochow University, 2014:18-28.)
[10] 闫称. 基于测地高斯核的策略迭代强化学习[D]. 徐州:中国矿业大学, 2015:17-42.(YAN C. Policy iteration reinforcement learning based on geodesic Gaussian kernel[D]. Xuzhou:China University of Mining and Technology, 2015:17-42.)
[11] 王雪松, 朱美强, 程玉虎. 强化学习原理及其应用[M]. 北京:科学出版社, 2014:58.(WANG X S, ZHU M Q, CHEN Y H. Principle and Application of Reinforcement Learning[M]. Beijing:Science Press, 2014:58.)
[12] 于剑, 程乾生. 模糊聚类方法中的最佳聚类数的搜索范围[J]. 中国科学(E辑), 2002, 32(2):274-280. (YU J, CHENG Q S. The search scope of optimal cluster number in fuzzy clustering method[J]. Science in China (Series E), 2002, 32(2):274-280.)

[1]	SHANG Fangjian, LI Xin, Di ZHAI, LU Yang, ZHANG Donglei, QIAN Yuwen. Two-phase resource allocation technology for network slices in smart grid [J]. Journal of Computer Applications, 2021, 41(7): 2033-2038.
[2]	WANG Yu, LIU Yanli, CHEN Shaowu. Maximum common induced subgraph algorithm based on vertex conflict learning [J]. Journal of Computer Applications, 2021, 41(6): 1756-1760.
[3]	WANG Jianping, WANG Gang, MAO Xiaobin, MA Enqi. Motion control method of two-link manipulator based on deep reinforcement learning [J]. Journal of Computer Applications, 2021, 41(6): 1799-1804.
[4]	DU Xixi, CHENG Hua, FANG Yiquan. Reinforced automatic summarization model based on advantage actor-critic algorithm [J]. Journal of Computer Applications, 2021, 41(3): 699-705.
[5]	LIU Sijia, TONG Xiangrong. Urban transportation path planning based on reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 185-190.
[6]	YAO Xinghu, TAN Xiaoyang. Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 1-7.
[7]	FU Kui, LIANG Shaoqing, LI Bing. Commodity recommendation model based on improved deep Q network structure [J]. Journal of Computer Applications, 2020, 40(9): 2613-2621.
[8]	HU Xuemin, CHENG Yu, CHEN Guowen, ZHANG Ruohan, TONG Xiuchi. Motion planning for autonomous driving with directional navigation based on deep spatio-temporal Q-network [J]. Journal of Computer Applications, 2020, 40(7): 1919-1925.
[9]	ZHENG Yanbin, FAN Wenxin, HAN Mengyun, TAO Xueli. Multi-agent collaborative pursuit algorithm based on game theory and Q-learning [J]. Journal of Computer Applications, 2020, 40(6): 1613-1620.
[10]	REN Na, ZHANG Nan, CUI Yan, ZHANG Rongxue, PANG Xinfu. Method of semantic entity construction and trajectory control for UAV electric power inspection [J]. Journal of Computer Applications, 2020, 40(10): 3095-3100.
[11]	CHI Yaping, MO Chongwei, YANG Yintan, CHEN Chunxia. Design and implementation of intrusion detection model for software defined network architecture [J]. Journal of Computer Applications, 2020, 40(1): 116-122.
[12]	CHEN Jiafeng, TENG Chong. Joint entity and relation extraction model based on reinforcement learning [J]. Journal of Computer Applications, 2019, 39(7): 1918-1924.
[13]	WANG Tiantian, YU Shuangyuan, XU Baomin. Research on proof of work mining dilemma based on policy gradient algorithm [J]. Journal of Computer Applications, 2019, 39(5): 1336-1342.
[14]	SHU Lingzhou, WU Jia, WANG Chen. Urban traffic signal control based on deep reinforcement learning [J]. Journal of Computer Applications, 2019, 39(5): 1495-1499.
[15]	SHA Zongxuan, XUE Fei, ZHU Jie. Scheduling strategy of cloud robots based on parallel reinforcement learning [J]. Journal of Computer Applications, 2019, 39(2): 501-508.

Nonparametric approximation policy iteration reinforcement learning based on Dyna framework

基于Dyna框架的非参数化近似策略迭代增强学习

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics