Best action identification of tree structure based on ternary multi-arm bandit

doi:10.11772/j.issn.1001-9081.2018112394

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (8): 2252-2260.DOI: 10.11772/j.issn.1001-9081.2018112394

• Artificial intelligence • Previous Articles Next Articles

Best action identification of tree structure based on ternary multi-arm bandit

LIU Guoqing^1,2,3, WANG Jieting^1,2,3, HU Zhiguo^1,2,3, QIAN Yuhua^1,2,3

1. Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan Shanxi 030006, China;
2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China;
3. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China

Received:2018-12-04 Revised:2019-01-31 Online:2019-08-14 Published:2019-08-10
Supported by:
This work is partially supported by the National Natural Science Fundation of China (61672332, 61432011, U1435212), the Natural Science Foundation of Shanxi Province (201701D121052).

基于三元多臂赌博机的树结构最优动作识别

刘郭庆^1,2,3, 王婕婷^1,2,3, 胡治国^1,2,3, 钱宇华^1,2,3

1. 山西大学大数据科学与产业研究院, 太原 030006;
2. 计算机智能与中文信息处理教育部重点实验室(山西大学), 太原 030006;
3. 山西大学计算机与信息技术学院, 太原 030006

通讯作者: 钱宇华
作者简介:刘郭庆(1994-),女,山西临汾人,硕士研究生,主要研究方向:强化学习;王婕婷(1991-),女,山西临汾人,博士研究生,主要研究方向:统计机器学习;胡治国(1977-),男,山西灵石人,讲师,博士,CCF会员,主要研究方向:计算机网络、分布式系统;钱宇华(1976-),男,山西晋城人,教授,博士,CCF会员,主要研究方向:数据智能、机器学习、大数据、复杂网络。
基金资助:
国家自然科学基金资助项目（61672332，61432011，U1435212）；山西省自然科学基金资助项目（201701D121052）。

Abstract

Abstract: Monte Carlo Tree Search (MCTS) shows excellent performance in chess game problem. Most existing studies only consider the success and failure feedbacks and assum that the results follow the Bernoulli distribution. However, this setting ignores the usual result of draw, causing inaccurate assessment of the disk status and missing of optimal action. In order to solve this problem, Ternary Multi-Arm Bandit (TMAB) model was constructed and Best Arm identification of TMAB (TBBA) algorithm was proposed. Then, TBBA algorithm was applied to Ternary Minimax Sampling Tree (TMST). Finally, TBBA_tree algorithm based on the simple iteration of TBBA and Best Action identification of TMST (TTBA) algorithm based on transforming the tree structure into TMAB were proposed. In the experiments, two arm spaces with different precision were established, and several comparative TMABs and TMSTs were constructed based on the two arm spaces. Experimental results show that compared to the accuracy of uniform sampling algorithm, the accuracy of TBBA algorithm keeps rising steadily and can reach 100% partially, and the accuracy of TBBA algorithm is basically more than 80% with good generalization and stability and without outliers or fluctuation ranges.

Key words: Monte Carlo Tree Search (MCTS), Ternary Multi-Arm Bandit (TMAB), Best Arm Identification (BAI), sequential decision-making, pure exploration

摘要： 蒙特卡罗树搜索（MCTS）在棋类博弈问题中展现出卓越的性能，但目前多数研究仅考虑胜负两种反馈从而假设博弈结果服从伯努利分布，然而这种设定忽略了常出现的平局结果，导致不能准确地评估盘面状态甚至错失最优动作。针对这个问题，首先构建了基于三元分布的多臂赌博机（TMAB）模型并提出了最优臂确认算法TBBA；然后，将TBBA算法应用到三元极大极小采样树（TMST）中，提出了简单迭代TBBA算法的TBBA_tree算法和通过将树结构转化成TMAB的TMST最优动作识别（TTBA）算法。在实验部分，建立了两个精度不同的摇臂空间并在其基础上构造了多个具有对比性的TMAB和TMST。实验结果表明，相比均匀采样算法，TBBA算法准确率保持稳步上升且部分能达到100%，TBBA算法准确率基本保持在80%以上且具有良好的泛化性和稳定性，不会出现异常值和波动区间。

关键词: 蒙特卡罗树搜索, 三元多臂赌博机, 最优臂确认, 序列决策, 纯探索

CLC Number:

TP181

LIU Guoqing, WANG Jieting, HU Zhiguo, QIAN Yuhua. Best action identification of tree structure based on ternary multi-arm bandit[J]. Journal of Computer Applications, 2019, 39(8): 2252-2260.

刘郭庆, 王婕婷, 胡治国, 钱宇华. 基于三元多臂赌博机的树结构最优动作识别[J]. 计算机应用, 2019, 39(8): 2252-2260.

References

[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search[J]. Nature, 2016, 529(7587):484-489.
[2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676):354-359.
[3] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J]. Science, 2018, 362(6419):1140-1144.
[4] GARIVIER A, KAUFMANN E, KOOLEN W M. Maximin action identification:a new bandit framework for games[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:1028-1050.
[5] THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933, 25(3/4):285-294.
[6] BUBECK S, CESA-BIANCHI N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations & Trends in Machine Learning, 2012, 5(1):1-112.
[7] ROBBINS H. Some aspects of the sequential design of experiments[J]. Bulletin of the American Mathematical Society, 1952, 58(5):527-535.
[8] KALYANAKRISHNAN S, STONE P. Efficient selection of multiple bandit arms:theory and practice[C]//Proceedings of the 27th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2010:511-518.
[9] EVEN-DAR E, MANNOR S, MANSOUR Y, et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems[J]. Journal of Machine Learning Research, 2006, 7:1079-1105.
[10] KALYANAKRISHNAN S, TEWARI A, AUER P, et al. PAC subset selection in stochastic multi-armed bandits[C]//Proceedings of the 29th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2012:655-662.
[11] KAUFMANN E, CAPPé O, GARIVIER A, et al. On the complexity of best-arm identification in multi-armed bandit models[J]. Journal of Machine Learning Research, 2016, 17(1):1-42.
[12] MANNOR S, TSITSIKLIS J N. The sample complexity of exploration in the multi-armed bandit problem[J]. Journal of Machine Learning Research, 2004, 5:623-648.
[13] GARIVIER A, KAUFMANN E. Optimal best arm identification with fixed confidence[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:998-1027.
[14] AUDIBERT J-Y, BUBECK S, MUNOS R. Best arm identification in multi-armed bandits[C]//Proceedings of the 23rd Conference on Learning Theory.[S.l.]:PMLR, 2010:41-53.
[15] BUBECK S, WANG T, VISWANATHAN N. Multiple identifications in multi-armed bandits[C]//Proceedings of the 30th International Conference on Machine Learning.[S.l.]:PMLR, 2013, 28(1):258-265.
[16] SHAHRAMPOUR S, NOSHAD M, TAROKH V. On sequential elimination algorithms for best-arm identification in multi-armed bandits[J]. IEEE Transactions on Signal Processing, 2017, 65(16):4281-4292.
[17] KAUFMANN E, KALYANAKRISHNAN S. Information complexity in bandit subset selection[C]//Proceedings of the 26th Conference on Learning Theory.[S.l.]:PMLR, 2013, 30:228-251.
[18] CARPENTIER A, LOCATELLI A. Tight (lower) bounds for the fixed budget best arm identification bandit problem[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:590-604.
[19] GABILLON V, GHAVAMZADEH M, LAZARIC A. Best arm identification:a unified approach to fixed budget and fixed confidence[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2012:3212-3220.
[20] CHAPELLE O, LI L. An empirical evaluation of Thompson sampling[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2011:2249-2257.
[21] MAY B C, KORDA N, LEE A, et al. Optimistic Bayesian sampling in contextual-bandit problems[J]. Journal of Machine Learning Research, 2012, 13(1):2069-2106.
[22] KOMIYAMA J, HONDA J, NAKAGAWA H, et al. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays[C]//Proceedings of the 32nd International Conference on Machine Learning.[S.l.]:PMLR, 2015:1152-1161.
[23] BROWNE C B, POWLEY E, WHITEHOUSE D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence & AI in Games, 2012, 4(1):1-43.
[24] GELLY S, KOCSIS L, SCHOENAUER M, et al. The grand challenge of computer Go:monte carlo tree search and extensions[J]. Communications of the ACM, 2012, 55(3):106-113.
[25] TERAOKA K, HATANO K, TAKIMOTO E. Efficient sampling method for Monte Carlo tree search problem[J]. IEICE Transactions on Information & Systems, 2014, E97-D(3):392-398.
[26] KAUFMANN E, KOOLEN W M. Monte-Carlo tree search by best arm identification[J]. arXiv E-print, 2017:arXiv:1706.02986. Neural Information Processing Systems, 2017,30:4897-4906.
[27] 高阳, 陈世福, 陆鑫. 强化学习研究综述[J]. 自动化学报, 2004, 30(1):86-100. (GAO Y, CHEN S F, LU X. Research on reinforcement learning technology:a review[J]. Acta Automatica Sinica, 2004, 30(1):86-100.)
[28] 李宁, 高阳, 陆鑫,等.一种基于强化学习的学习Agent[J]. 计算机研究与发展, 2001, 38(9):1051-1056. (LI N, GAO Y, LU X, et al. A learning agent based on reinforcement learning[J]. Journal of Computer Research and Development, 2001, 38(9):1051-1056.)
[29] 蔡庆生, 张波. 一种基于Agent团队的强化学习模型与应用研究[J].计算机研究与发展, 2000, 37(9):1087-1093. (CAI Q S, ZHANG B. An agent team based reinforcement learning model and its application[J]. Journal of Computer Research and Development, 2000, 37(9):1087-1093.)

Best action identification of tree structure based on ternary multi-arm bandit

基于三元多臂赌博机的树结构最优动作识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	. Privacy preserving localization of surveillance images based on large vision models [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Federated class-incremental learning method with multi-head self-attention for label semantic embedding [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	Jingxin LIU, Wenjing HUANG, Liangsheng XU, Chong HUANG, Jiansheng WU. Unsupervised feature selection model with dictionary learning and sample correlation preservation [J]. Journal of Computer Applications, 2024, 44(12): 3766-3775.
[4]	Yifei SONG, Yi LIU. Fast adversarial training method based on data augmentation and label noise [J]. Journal of Computer Applications, 2024, 44(12): 3798-3807.
[5]	. Research review on explainable artificial intelligence in internet of things applications [J]. Journal of Computer Applications, 0, (): 0-0.
[6]	Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638.
[7]	Yuxin HUANG, Yiwang HUANG, Hui HUANG. Meta label correction method based on shallow network predictions [J]. Journal of Computer Applications, 2024, 44(11): 3364-3370.
[8]	Zhijie LI, Xuhong LIAO, Yuanxiang LI, Qinglan LI. Disease sample classification algorithm by Bayesian network with gene association analysis [J]. Journal of Computer Applications, 2024, 44(11): 3449-3458.
[9]	HU Jie, ZHENG Qiyang, SUN Jun, ZHANG Yan. Multi-label classification model based on multi-relational label graph and local dynamic reconstruction learning [J]. Journal of Computer Applications, 0, (): 0-0.
[10]	Wenze CHAI, Jing FAN, Shukui SUN, Yiming LIANG, Jingfeng LIU. Overview of deep metric learning [J]. Journal of Computer Applications, 2024, 44(10): 2995-3010.
[11]	Chunyong YIN, Yongcheng ZHOU. Automatically adjusted clustered federated learning for double-ended clustering [J]. Journal of Computer Applications, 2024, 44(10): 3011-3020.
[12]	Feng CAO, Xiaoling YANG, Jianbing YI, Jun LI. Contradiction separation super-deduction method and application [J]. Journal of Computer Applications, 2024, 44(10): 3074-3080.
[13]	. Deep symbol regression method based on Transformer [J]. Journal of Computer Applications, 0, (): 0-0.
[14]	. RecipeFlavor: Recipe Recommendation Model Based on Flavor Embedding and Heterogeneous Graph Hierarchical Learning [J]. Journal of Computer Applications, 0, (): 0-0.
[15]	. Graph regularized Elastic Net Subspace Clustering [J]. Journal of Computer Applications, 0, (): 0-0.