计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2252-2260.DOI: 10.11772/j.issn.1001-9081.2018112394
刘郭庆1,2,3, 王婕婷1,2,3, 胡治国1,2,3, 钱宇华1,2,3
收稿日期:
2018-12-04
修回日期:
2019-01-31
发布日期:
2019-08-14
出版日期:
2019-08-10
通讯作者:
钱宇华
作者简介:
刘郭庆(1994-),女,山西临汾人,硕士研究生,主要研究方向:强化学习;王婕婷(1991-),女,山西临汾人,博士研究生,主要研究方向:统计机器学习;胡治国(1977-),男,山西灵石人,讲师,博士,CCF会员,主要研究方向:计算机网络、分布式系统;钱宇华(1976-),男,山西晋城人,教授,博士,CCF会员,主要研究方向:数据智能、机器学习、大数据、复杂网络。
基金资助:
LIU Guoqing1,2,3, WANG Jieting1,2,3, HU Zhiguo1,2,3, QIAN Yuhua1,2,3
Received:
2018-12-04
Revised:
2019-01-31
Online:
2019-08-14
Published:
2019-08-10
Supported by:
摘要: 蒙特卡罗树搜索(MCTS)在棋类博弈问题中展现出卓越的性能,但目前多数研究仅考虑胜负两种反馈从而假设博弈结果服从伯努利分布,然而这种设定忽略了常出现的平局结果,导致不能准确地评估盘面状态甚至错失最优动作。针对这个问题,首先构建了基于三元分布的多臂赌博机(TMAB)模型并提出了最优臂确认算法TBBA;然后,将TBBA算法应用到三元极大极小采样树(TMST)中,提出了简单迭代TBBA算法的TBBA_tree算法和通过将树结构转化成TMAB的TMST最优动作识别(TTBA)算法。在实验部分,建立了两个精度不同的摇臂空间并在其基础上构造了多个具有对比性的TMAB和TMST。实验结果表明,相比均匀采样算法,TBBA算法准确率保持稳步上升且部分能达到100%,TBBA算法准确率基本保持在80%以上且具有良好的泛化性和稳定性,不会出现异常值和波动区间。
中图分类号:
刘郭庆, 王婕婷, 胡治国, 钱宇华. 基于三元多臂赌博机的树结构最优动作识别[J]. 计算机应用, 2019, 39(8): 2252-2260.
LIU Guoqing, WANG Jieting, HU Zhiguo, QIAN Yuhua. Best action identification of tree structure based on ternary multi-arm bandit[J]. Journal of Computer Applications, 2019, 39(8): 2252-2260.
[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search[J]. Nature, 2016, 529(7587):484-489. [2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676):354-359. [3] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J]. Science, 2018, 362(6419):1140-1144. [4] GARIVIER A, KAUFMANN E, KOOLEN W M. Maximin action identification:a new bandit framework for games[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:1028-1050. [5] THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933, 25(3/4):285-294. [6] BUBECK S, CESA-BIANCHI N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations & Trends in Machine Learning, 2012, 5(1):1-112. [7] ROBBINS H. Some aspects of the sequential design of experiments[J]. Bulletin of the American Mathematical Society, 1952, 58(5):527-535. [8] KALYANAKRISHNAN S, STONE P. Efficient selection of multiple bandit arms:theory and practice[C]//Proceedings of the 27th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2010:511-518. [9] EVEN-DAR E, MANNOR S, MANSOUR Y, et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems[J]. Journal of Machine Learning Research, 2006, 7:1079-1105. [10] KALYANAKRISHNAN S, TEWARI A, AUER P, et al. PAC subset selection in stochastic multi-armed bandits[C]//Proceedings of the 29th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2012:655-662. [11] KAUFMANN E, CAPPé O, GARIVIER A, et al. On the complexity of best-arm identification in multi-armed bandit models[J]. Journal of Machine Learning Research, 2016, 17(1):1-42. [12] MANNOR S, TSITSIKLIS J N. The sample complexity of exploration in the multi-armed bandit problem[J]. Journal of Machine Learning Research, 2004, 5:623-648. [13] GARIVIER A, KAUFMANN E. Optimal best arm identification with fixed confidence[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:998-1027. [14] AUDIBERT J-Y, BUBECK S, MUNOS R. Best arm identification in multi-armed bandits[C]//Proceedings of the 23rd Conference on Learning Theory.[S.l.]:PMLR, 2010:41-53. [15] BUBECK S, WANG T, VISWANATHAN N. Multiple identifications in multi-armed bandits[C]//Proceedings of the 30th International Conference on Machine Learning.[S.l.]:PMLR, 2013, 28(1):258-265. [16] SHAHRAMPOUR S, NOSHAD M, TAROKH V. On sequential elimination algorithms for best-arm identification in multi-armed bandits[J]. IEEE Transactions on Signal Processing, 2017, 65(16):4281-4292. [17] KAUFMANN E, KALYANAKRISHNAN S. Information complexity in bandit subset selection[C]//Proceedings of the 26th Conference on Learning Theory.[S.l.]:PMLR, 2013, 30:228-251. [18] CARPENTIER A, LOCATELLI A. Tight (lower) bounds for the fixed budget best arm identification bandit problem[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:590-604. [19] GABILLON V, GHAVAMZADEH M, LAZARIC A. Best arm identification:a unified approach to fixed budget and fixed confidence[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2012:3212-3220. [20] CHAPELLE O, LI L. An empirical evaluation of Thompson sampling[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2011:2249-2257. [21] MAY B C, KORDA N, LEE A, et al. Optimistic Bayesian sampling in contextual-bandit problems[J]. Journal of Machine Learning Research, 2012, 13(1):2069-2106. [22] KOMIYAMA J, HONDA J, NAKAGAWA H, et al. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays[C]//Proceedings of the 32nd International Conference on Machine Learning.[S.l.]:PMLR, 2015:1152-1161. [23] BROWNE C B, POWLEY E, WHITEHOUSE D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence & AI in Games, 2012, 4(1):1-43. [24] GELLY S, KOCSIS L, SCHOENAUER M, et al. The grand challenge of computer Go:monte carlo tree search and extensions[J]. Communications of the ACM, 2012, 55(3):106-113. [25] TERAOKA K, HATANO K, TAKIMOTO E. Efficient sampling method for Monte Carlo tree search problem[J]. IEICE Transactions on Information & Systems, 2014, E97-D(3):392-398. [26] KAUFMANN E, KOOLEN W M. Monte-Carlo tree search by best arm identification[J]. arXiv E-print, 2017:arXiv:1706.02986. Neural Information Processing Systems, 2017,30:4897-4906. [27] 高阳, 陈世福, 陆鑫. 强化学习研究综述[J]. 自动化学报, 2004, 30(1):86-100. (GAO Y, CHEN S F, LU X. Research on reinforcement learning technology:a review[J]. Acta Automatica Sinica, 2004, 30(1):86-100.) [28] 李宁, 高阳, 陆鑫,等.一种基于强化学习的学习Agent[J]. 计算机研究与发展, 2001, 38(9):1051-1056. (LI N, GAO Y, LU X, et al. A learning agent based on reinforcement learning[J]. Journal of Computer Research and Development, 2001, 38(9):1051-1056.) [29] 蔡庆生, 张波. 一种基于Agent团队的强化学习模型与应用研究[J].计算机研究与发展, 2000, 37(9):1087-1093. (CAI Q S, ZHANG B. An agent team based reinforcement learning model and its application[J]. Journal of Computer Research and Development, 2000, 37(9):1087-1093.) |
[1] | 宫智宇 王士同. 面向重尾噪声图像分类的残差网络学习方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[2] | 王虎 王晓峰 李可 马云洁. 融合多头自注意力的标签语义嵌入联邦类增量学习方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[3] | 刘晶鑫, 黄雯静, 徐亮胜, 黄冲, 吴建生. 字典学习与样本关联保持结合的无监督特征选择模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3766-3775. |
[4] | 宋逸飞, 柳毅. 基于数据增强和标签噪声的快速对抗训练方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3798-3807. |
[5] | 赵小阳 许新征 李仲年. 物联网应用中的可解释人工智能研究综述[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[6] | 余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638. |
[7] | 黄雨鑫, 黄贻望, 黄辉. 基于浅层网络预测的元标签校正方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3364-3370. |
[8] | 李志杰, 廖旭红, 李元香, 李青蓝. 基于基因关联分析的贝叶斯网络疾病样本分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3449-3458. |
[9] | 胡婕 郑启扬 孙军 张龑. 基于多关系标签图和局部动态重构学习的多标签分类模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[10] | 柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010. |
[11] | 尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020. |
[12] | 曹锋, 杨小玲, 易见兵, 李俊. 矛盾体分离超演绎方法及应用[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3074-3080. |
[13] | 许鹏程 何磊 李川 钱炜祺 赵暾. 基于Transformer的深度符号回归方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[14] | 颜文婧 王瑞东 左敏 张青川. 基于风味嵌入异构图层次学习的食谱推荐模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[15] | 郭书剑 余节约 尹学松. 图正则化弹性网子空间聚类[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||