基于三元多臂赌博机的树结构最优动作识别

doi:10.11772/j.issn.1001-9081.2018112394

计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2252-2260.DOI: 10.11772/j.issn.1001-9081.2018112394

基于三元多臂赌博机的树结构最优动作识别

刘郭庆^1,2,3, 王婕婷^1,2,3, 胡治国^1,2,3, 钱宇华^1,2,3

1. 山西大学大数据科学与产业研究院, 太原 030006;
2. 计算机智能与中文信息处理教育部重点实验室(山西大学), 太原 030006;
3. 山西大学计算机与信息技术学院, 太原 030006

收稿日期:2018-12-04 修回日期:2019-01-31 发布日期:2019-08-14 出版日期:2019-08-10
通讯作者: 钱宇华
作者简介:刘郭庆(1994-),女,山西临汾人,硕士研究生,主要研究方向:强化学习;王婕婷(1991-),女,山西临汾人,博士研究生,主要研究方向:统计机器学习;胡治国(1977-),男,山西灵石人,讲师,博士,CCF会员,主要研究方向:计算机网络、分布式系统;钱宇华(1976-),男,山西晋城人,教授,博士,CCF会员,主要研究方向:数据智能、机器学习、大数据、复杂网络。
基金资助:
国家自然科学基金资助项目（61672332，61432011，U1435212）；山西省自然科学基金资助项目（201701D121052）。

Best action identification of tree structure based on ternary multi-arm bandit

LIU Guoqing^1,2,3, WANG Jieting^1,2,3, HU Zhiguo^1,2,3, QIAN Yuhua^1,2,3

1. Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan Shanxi 030006, China;
2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China;
3. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China

Received:2018-12-04 Revised:2019-01-31 Online:2019-08-14 Published:2019-08-10
Supported by:
This work is partially supported by the National Natural Science Fundation of China (61672332, 61432011, U1435212), the Natural Science Foundation of Shanxi Province (201701D121052).

摘要/Abstract

摘要： 蒙特卡罗树搜索（MCTS）在棋类博弈问题中展现出卓越的性能，但目前多数研究仅考虑胜负两种反馈从而假设博弈结果服从伯努利分布，然而这种设定忽略了常出现的平局结果，导致不能准确地评估盘面状态甚至错失最优动作。针对这个问题，首先构建了基于三元分布的多臂赌博机（TMAB）模型并提出了最优臂确认算法TBBA；然后，将TBBA算法应用到三元极大极小采样树（TMST）中，提出了简单迭代TBBA算法的TBBA_tree算法和通过将树结构转化成TMAB的TMST最优动作识别（TTBA）算法。在实验部分，建立了两个精度不同的摇臂空间并在其基础上构造了多个具有对比性的TMAB和TMST。实验结果表明，相比均匀采样算法，TBBA算法准确率保持稳步上升且部分能达到100%，TBBA算法准确率基本保持在80%以上且具有良好的泛化性和稳定性，不会出现异常值和波动区间。

关键词: 蒙特卡罗树搜索, 三元多臂赌博机, 最优臂确认, 序列决策, 纯探索

Abstract: Monte Carlo Tree Search (MCTS) shows excellent performance in chess game problem. Most existing studies only consider the success and failure feedbacks and assum that the results follow the Bernoulli distribution. However, this setting ignores the usual result of draw, causing inaccurate assessment of the disk status and missing of optimal action. In order to solve this problem, Ternary Multi-Arm Bandit (TMAB) model was constructed and Best Arm identification of TMAB (TBBA) algorithm was proposed. Then, TBBA algorithm was applied to Ternary Minimax Sampling Tree (TMST). Finally, TBBA_tree algorithm based on the simple iteration of TBBA and Best Action identification of TMST (TTBA) algorithm based on transforming the tree structure into TMAB were proposed. In the experiments, two arm spaces with different precision were established, and several comparative TMABs and TMSTs were constructed based on the two arm spaces. Experimental results show that compared to the accuracy of uniform sampling algorithm, the accuracy of TBBA algorithm keeps rising steadily and can reach 100% partially, and the accuracy of TBBA algorithm is basically more than 80% with good generalization and stability and without outliers or fluctuation ranges.

Key words: Monte Carlo Tree Search (MCTS), Ternary Multi-Arm Bandit (TMAB), Best Arm Identification (BAI), sequential decision-making, pure exploration

中图分类号:

TP181

刘郭庆, 王婕婷, 胡治国, 钱宇华. 基于三元多臂赌博机的树结构最优动作识别[J]. 计算机应用, 2019, 39(8): 2252-2260.

LIU Guoqing, WANG Jieting, HU Zhiguo, QIAN Yuhua. Best action identification of tree structure based on ternary multi-arm bandit[J]. Journal of Computer Applications, 2019, 39(8): 2252-2260.

参考文献

[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search[J]. Nature, 2016, 529(7587):484-489.
[2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676):354-359.
[3] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J]. Science, 2018, 362(6419):1140-1144.
[4] GARIVIER A, KAUFMANN E, KOOLEN W M. Maximin action identification:a new bandit framework for games[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:1028-1050.
[5] THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933, 25(3/4):285-294.
[6] BUBECK S, CESA-BIANCHI N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations & Trends in Machine Learning, 2012, 5(1):1-112.
[7] ROBBINS H. Some aspects of the sequential design of experiments[J]. Bulletin of the American Mathematical Society, 1952, 58(5):527-535.
[8] KALYANAKRISHNAN S, STONE P. Efficient selection of multiple bandit arms:theory and practice[C]//Proceedings of the 27th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2010:511-518.
[9] EVEN-DAR E, MANNOR S, MANSOUR Y, et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems[J]. Journal of Machine Learning Research, 2006, 7:1079-1105.
[10] KALYANAKRISHNAN S, TEWARI A, AUER P, et al. PAC subset selection in stochastic multi-armed bandits[C]//Proceedings of the 29th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2012:655-662.
[11] KAUFMANN E, CAPPé O, GARIVIER A, et al. On the complexity of best-arm identification in multi-armed bandit models[J]. Journal of Machine Learning Research, 2016, 17(1):1-42.
[12] MANNOR S, TSITSIKLIS J N. The sample complexity of exploration in the multi-armed bandit problem[J]. Journal of Machine Learning Research, 2004, 5:623-648.
[13] GARIVIER A, KAUFMANN E. Optimal best arm identification with fixed confidence[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:998-1027.
[14] AUDIBERT J-Y, BUBECK S, MUNOS R. Best arm identification in multi-armed bandits[C]//Proceedings of the 23rd Conference on Learning Theory.[S.l.]:PMLR, 2010:41-53.
[15] BUBECK S, WANG T, VISWANATHAN N. Multiple identifications in multi-armed bandits[C]//Proceedings of the 30th International Conference on Machine Learning.[S.l.]:PMLR, 2013, 28(1):258-265.
[16] SHAHRAMPOUR S, NOSHAD M, TAROKH V. On sequential elimination algorithms for best-arm identification in multi-armed bandits[J]. IEEE Transactions on Signal Processing, 2017, 65(16):4281-4292.
[17] KAUFMANN E, KALYANAKRISHNAN S. Information complexity in bandit subset selection[C]//Proceedings of the 26th Conference on Learning Theory.[S.l.]:PMLR, 2013, 30:228-251.
[18] CARPENTIER A, LOCATELLI A. Tight (lower) bounds for the fixed budget best arm identification bandit problem[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:590-604.
[19] GABILLON V, GHAVAMZADEH M, LAZARIC A. Best arm identification:a unified approach to fixed budget and fixed confidence[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2012:3212-3220.
[20] CHAPELLE O, LI L. An empirical evaluation of Thompson sampling[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2011:2249-2257.
[21] MAY B C, KORDA N, LEE A, et al. Optimistic Bayesian sampling in contextual-bandit problems[J]. Journal of Machine Learning Research, 2012, 13(1):2069-2106.
[22] KOMIYAMA J, HONDA J, NAKAGAWA H, et al. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays[C]//Proceedings of the 32nd International Conference on Machine Learning.[S.l.]:PMLR, 2015:1152-1161.
[23] BROWNE C B, POWLEY E, WHITEHOUSE D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence & AI in Games, 2012, 4(1):1-43.
[24] GELLY S, KOCSIS L, SCHOENAUER M, et al. The grand challenge of computer Go:monte carlo tree search and extensions[J]. Communications of the ACM, 2012, 55(3):106-113.
[25] TERAOKA K, HATANO K, TAKIMOTO E. Efficient sampling method for Monte Carlo tree search problem[J]. IEICE Transactions on Information & Systems, 2014, E97-D(3):392-398.
[26] KAUFMANN E, KOOLEN W M. Monte-Carlo tree search by best arm identification[J]. arXiv E-print, 2017:arXiv:1706.02986. Neural Information Processing Systems, 2017,30:4897-4906.
[27] 高阳, 陈世福, 陆鑫. 强化学习研究综述[J]. 自动化学报, 2004, 30(1):86-100. (GAO Y, CHEN S F, LU X. Research on reinforcement learning technology:a review[J]. Acta Automatica Sinica, 2004, 30(1):86-100.)
[28] 李宁, 高阳, 陆鑫,等.一种基于强化学习的学习Agent[J]. 计算机研究与发展, 2001, 38(9):1051-1056. (LI N, GAO Y, LU X, et al. A learning agent based on reinforcement learning[J]. Journal of Computer Research and Development, 2001, 38(9):1051-1056.)
[29] 蔡庆生, 张波. 一种基于Agent团队的强化学习模型与应用研究[J].计算机研究与发展, 2000, 37(9):1087-1093. (CAI Q S, ZHANG B. An agent team based reinforcement learning model and its application[J]. Journal of Computer Research and Development, 2000, 37(9):1087-1093.)

基于三元多臂赌博机的树结构最优动作识别

Best action identification of tree structure based on ternary multi-arm bandit

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	宫智宇王士同. 面向重尾噪声图像分类的残差网络学习方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	王虎王晓峰李可马云洁. 融合多头自注意力的标签语义嵌入联邦类增量学习方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[3]	刘晶鑫, 黄雯静, 徐亮胜, 黄冲, 吴建生. 字典学习与样本关联保持结合的无监督特征选择模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3766-3775.
[4]	宋逸飞, 柳毅. 基于数据增强和标签噪声的快速对抗训练方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3798-3807.
[5]	赵小阳许新征李仲年. 物联网应用中的可解释人工智能研究综述[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[6]	余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638.
[7]	黄雨鑫, 黄贻望, 黄辉. 基于浅层网络预测的元标签校正方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3364-3370.
[8]	李志杰, 廖旭红, 李元香, 李青蓝. 基于基因关联分析的贝叶斯网络疾病样本分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3449-3458.
[9]	胡婕郑启扬孙军张龑. 基于多关系标签图和局部动态重构学习的多标签分类模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[11]	尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020.
[12]	曹锋, 杨小玲, 易见兵, 李俊. 矛盾体分离超演绎方法及应用[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3074-3080.
[13]	许鹏程何磊李川钱炜祺赵暾. 基于Transformer的深度符号回归方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[14]	颜文婧王瑞东左敏张青川. 基于风味嵌入异构图层次学习的食谱推荐模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[15]	郭书剑余节约尹学松. 图正则化弹性网子空间聚类[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.