Best action identification of tree structure based on ternary multi-arm bandit

doi:10.11772/j.issn.1001-9081.2018112394

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (8): 2252-2260.DOI: 10.11772/j.issn.1001-9081.2018112394

• Artificial intelligence • Previous Articles Next Articles

Best action identification of tree structure based on ternary multi-arm bandit

LIU Guoqing^1,2,3, WANG Jieting^1,2,3, HU Zhiguo^1,2,3, QIAN Yuhua^1,2,3

1. Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan Shanxi 030006, China;
2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China;
3. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China

Received:2018-12-04 Revised:2019-01-31 Online:2019-08-10 Published:2019-08-14
Supported by:
This work is partially supported by the National Natural Science Fundation of China (61672332, 61432011, U1435212), the Natural Science Foundation of Shanxi Province (201701D121052).

基于三元多臂赌博机的树结构最优动作识别

刘郭庆^1,2,3, 王婕婷^1,2,3, 胡治国^1,2,3, 钱宇华^1,2,3

1. 山西大学大数据科学与产业研究院, 太原 030006;
2. 计算机智能与中文信息处理教育部重点实验室(山西大学), 太原 030006;
3. 山西大学计算机与信息技术学院, 太原 030006

通讯作者: 钱宇华
作者简介:刘郭庆(1994-),女,山西临汾人,硕士研究生,主要研究方向:强化学习;王婕婷(1991-),女,山西临汾人,博士研究生,主要研究方向:统计机器学习;胡治国(1977-),男,山西灵石人,讲师,博士,CCF会员,主要研究方向:计算机网络、分布式系统;钱宇华(1976-),男,山西晋城人,教授,博士,CCF会员,主要研究方向:数据智能、机器学习、大数据、复杂网络。
基金资助:
国家自然科学基金资助项目（61672332，61432011，U1435212）；山西省自然科学基金资助项目（201701D121052）。

Abstract

Abstract: Monte Carlo Tree Search (MCTS) shows excellent performance in chess game problem. Most existing studies only consider the success and failure feedbacks and assum that the results follow the Bernoulli distribution. However, this setting ignores the usual result of draw, causing inaccurate assessment of the disk status and missing of optimal action. In order to solve this problem, Ternary Multi-Arm Bandit (TMAB) model was constructed and Best Arm identification of TMAB (TBBA) algorithm was proposed. Then, TBBA algorithm was applied to Ternary Minimax Sampling Tree (TMST). Finally, TBBA_tree algorithm based on the simple iteration of TBBA and Best Action identification of TMST (TTBA) algorithm based on transforming the tree structure into TMAB were proposed. In the experiments, two arm spaces with different precision were established, and several comparative TMABs and TMSTs were constructed based on the two arm spaces. Experimental results show that compared to the accuracy of uniform sampling algorithm, the accuracy of TBBA algorithm keeps rising steadily and can reach 100% partially, and the accuracy of TBBA algorithm is basically more than 80% with good generalization and stability and without outliers or fluctuation ranges.

Key words: Monte Carlo Tree Search (MCTS), Ternary Multi-Arm Bandit (TMAB), Best Arm Identification (BAI), sequential decision-making, pure exploration

摘要： 蒙特卡罗树搜索（MCTS）在棋类博弈问题中展现出卓越的性能，但目前多数研究仅考虑胜负两种反馈从而假设博弈结果服从伯努利分布，然而这种设定忽略了常出现的平局结果，导致不能准确地评估盘面状态甚至错失最优动作。针对这个问题，首先构建了基于三元分布的多臂赌博机（TMAB）模型并提出了最优臂确认算法TBBA；然后，将TBBA算法应用到三元极大极小采样树（TMST）中，提出了简单迭代TBBA算法的TBBA_tree算法和通过将树结构转化成TMAB的TMST最优动作识别（TTBA）算法。在实验部分，建立了两个精度不同的摇臂空间并在其基础上构造了多个具有对比性的TMAB和TMST。实验结果表明，相比均匀采样算法，TBBA算法准确率保持稳步上升且部分能达到100%，TBBA算法准确率基本保持在80%以上且具有良好的泛化性和稳定性，不会出现异常值和波动区间。

关键词: 蒙特卡罗树搜索, 三元多臂赌博机, 最优臂确认, 序列决策, 纯探索

CLC Number:

TP181

LIU Guoqing, WANG Jieting, HU Zhiguo, QIAN Yuhua. Best action identification of tree structure based on ternary multi-arm bandit[J]. Journal of Computer Applications, 2019, 39(8): 2252-2260.

刘郭庆, 王婕婷, 胡治国, 钱宇华. 基于三元多臂赌博机的树结构最优动作识别[J]. 计算机应用, 2019, 39(8): 2252-2260.

References

[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of go with deep neural networks and tree search[J]. Nature, 2016, 529(7587):484-489.
[2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676):354-359.
[3] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play[J]. Science, 2018, 362(6419):1140-1144.
[4] GARIVIER A, KAUFMANN E, KOOLEN W M. Maximin action identification:a new bandit framework for games[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:1028-1050.
[5] THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933, 25(3/4):285-294.
[6] BUBECK S, CESA-BIANCHI N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations & Trends in Machine Learning, 2012, 5(1):1-112.
[7] ROBBINS H. Some aspects of the sequential design of experiments[J]. Bulletin of the American Mathematical Society, 1952, 58(5):527-535.
[8] KALYANAKRISHNAN S, STONE P. Efficient selection of multiple bandit arms:theory and practice[C]//Proceedings of the 27th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2010:511-518.
[9] EVEN-DAR E, MANNOR S, MANSOUR Y, et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems[J]. Journal of Machine Learning Research, 2006, 7:1079-1105.
[10] KALYANAKRISHNAN S, TEWARI A, AUER P, et al. PAC subset selection in stochastic multi-armed bandits[C]//Proceedings of the 29th International Conference on Machine Learning. Cambridge, MA:MIT Press, 2012:655-662.
[11] KAUFMANN E, CAPPé O, GARIVIER A, et al. On the complexity of best-arm identification in multi-armed bandit models[J]. Journal of Machine Learning Research, 2016, 17(1):1-42.
[12] MANNOR S, TSITSIKLIS J N. The sample complexity of exploration in the multi-armed bandit problem[J]. Journal of Machine Learning Research, 2004, 5:623-648.
[13] GARIVIER A, KAUFMANN E. Optimal best arm identification with fixed confidence[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:998-1027.
[14] AUDIBERT J-Y, BUBECK S, MUNOS R. Best arm identification in multi-armed bandits[C]//Proceedings of the 23rd Conference on Learning Theory.[S.l.]:PMLR, 2010:41-53.
[15] BUBECK S, WANG T, VISWANATHAN N. Multiple identifications in multi-armed bandits[C]//Proceedings of the 30th International Conference on Machine Learning.[S.l.]:PMLR, 2013, 28(1):258-265.
[16] SHAHRAMPOUR S, NOSHAD M, TAROKH V. On sequential elimination algorithms for best-arm identification in multi-armed bandits[J]. IEEE Transactions on Signal Processing, 2017, 65(16):4281-4292.
[17] KAUFMANN E, KALYANAKRISHNAN S. Information complexity in bandit subset selection[C]//Proceedings of the 26th Conference on Learning Theory.[S.l.]:PMLR, 2013, 30:228-251.
[18] CARPENTIER A, LOCATELLI A. Tight (lower) bounds for the fixed budget best arm identification bandit problem[C]//Proceedings of the 29th Annual Conference on Learning Theory.[S.l.]:PMLR, 2016, 49:590-604.
[19] GABILLON V, GHAVAMZADEH M, LAZARIC A. Best arm identification:a unified approach to fixed budget and fixed confidence[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2012:3212-3220.
[20] CHAPELLE O, LI L. An empirical evaluation of Thompson sampling[C]//Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:Curran Associates Inc, 2011:2249-2257.
[21] MAY B C, KORDA N, LEE A, et al. Optimistic Bayesian sampling in contextual-bandit problems[J]. Journal of Machine Learning Research, 2012, 13(1):2069-2106.
[22] KOMIYAMA J, HONDA J, NAKAGAWA H, et al. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays[C]//Proceedings of the 32nd International Conference on Machine Learning.[S.l.]:PMLR, 2015:1152-1161.
[23] BROWNE C B, POWLEY E, WHITEHOUSE D, et al. A survey of monte carlo tree search methods[J]. IEEE Transactions on Computational Intelligence & AI in Games, 2012, 4(1):1-43.
[24] GELLY S, KOCSIS L, SCHOENAUER M, et al. The grand challenge of computer Go:monte carlo tree search and extensions[J]. Communications of the ACM, 2012, 55(3):106-113.
[25] TERAOKA K, HATANO K, TAKIMOTO E. Efficient sampling method for Monte Carlo tree search problem[J]. IEICE Transactions on Information & Systems, 2014, E97-D(3):392-398.
[26] KAUFMANN E, KOOLEN W M. Monte-Carlo tree search by best arm identification[J]. arXiv E-print, 2017:arXiv:1706.02986. Neural Information Processing Systems, 2017,30:4897-4906.
[27] 高阳, 陈世福, 陆鑫. 强化学习研究综述[J]. 自动化学报, 2004, 30(1):86-100. (GAO Y, CHEN S F, LU X. Research on reinforcement learning technology:a review[J]. Acta Automatica Sinica, 2004, 30(1):86-100.)
[28] 李宁, 高阳, 陆鑫,等.一种基于强化学习的学习Agent[J]. 计算机研究与发展, 2001, 38(9):1051-1056. (LI N, GAO Y, LU X, et al. A learning agent based on reinforcement learning[J]. Journal of Computer Research and Development, 2001, 38(9):1051-1056.)
[29] 蔡庆生, 张波. 一种基于Agent团队的强化学习模型与应用研究[J].计算机研究与发展, 2000, 37(9):1087-1093. (CAI Q S, ZHANG B. An agent team based reinforcement learning model and its application[J]. Journal of Computer Research and Development, 2000, 37(9):1087-1093.)

Best action identification of tree structure based on ternary multi-arm bandit

基于三元多臂赌博机的树结构最优动作识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	. Partially explainable non-negative matrix tri-factorization based on prior knowledge [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Stock trend prediction method based on temporal hypergraph convolutional neural networks [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	. Adaptive artificial fish swarm algorithm utilizing gene exchange [J]. Journal of Computer Applications, 0, (): 0-0.
[4]	. Online kernel regression based on random sketching method [J]. Journal of Computer Applications, 0, (): 0-0.
[5]	. Genetic algorithm for approximate concept and its recommendation application [J]. Journal of Computer Applications, 0, (): 0-0.
[6]	. Multi-label active learning algorithm for shale gas reservoir prediction [J]. Journal of Computer Applications, 0, (): 0-0.
[7]	REN Kezhou, PENG Furong, GUO Xin, WANG Zhe, ZHANG Xiaojing. Social recommendation based on dynamic integration of social information [J]. Journal of Computer Applications, 2021, 41(10): 2806-2812.
[8]	ZHANG Zhihao, LIN Yaojin, LU Shun, GUO Chen, WANG Chenxi. Multi-label feature selection based on label-specific feature with missing labels [J]. Journal of Computer Applications, 2021, 41(10): 2849-2857.
[9]	WANG Yahui, QIAN Yuhua, LIU Guoqing. Ordinal decision tree algorithm based on fuzzy advantage complementary mutual information [J]. Journal of Computer Applications, 2021, 41(10): 2785-2792.
[10]	. Long and short- term recommendation model based on knowledge graph preference attention network and its updating method [J]. Journal of Computer Applications, 0, (): 0-0.
[11]	. Spatial-temporal prediction model of urban short-term traffic flow based on grid division [J]. Journal of Computer Applications, 0, (): 0-0.
[12]	. Capsule network knowledge graph embedding model based on relational memory [J]. Journal of Computer Applications, 0, (): 0-0.
[13]	ZHANG Cheng, WAN Yuan, QIANG Haopeng. Deep unsupervised discrete cross-modal hashing based on knowledge distillation [J]. Journal of Computer Applications, 2021, 41(9): 2523-2531.
[14]	SUN Haoyi, WANG Chuanmei, DING Yiming. Extreme learning machine optimization based on hidden layer output matrix [J]. Journal of Computer Applications, 2021, 41(9): 2481-2488.
[15]	BIAN Lingzhi, WANG Zhijie. Credit scoring model based on enhanced multi-dimensional and multi-grained cascade forest [J]. Journal of Computer Applications, 2021, 41(9): 2539-2544.