基于贝叶斯优化的强化学习广义不动点解逼近

doi:10.6040/j.issn.1672-3961.0.2023.148

山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (4): 21-34.doi: 10.6040/j.issn.1672-3961.0.2023.148

• 机器学习与数据挖掘 • 上一篇

基于贝叶斯优化的强化学习广义不动点解逼近

陈兴国^1,2,吕咏洲¹,巩宇¹,陈耀雄³

1.南京邮电大学大数据安全与智能处理重点实验室, 江苏南京 210023;2.南京大学计算机软件新技术国家重点试验室, 江苏南京 210046;3.淮阴工学院电子信息工程学院, 江苏淮安 223003

发布日期:2024-08-20
作者简介:陈兴国(1984— ),男,江苏南通人,讲师,硕士生导师,博士,主要研究方向为机器学习、强化学习、智能博弈. E-mail:chenxg@njupt.edu.cn
基金资助:
国家自然科学基金资助项目(62276142,62206133,62202240,62192783);科技创新2030——“新一代人工智能”重大项目资助项目(2018AAA0100905);江苏省产业前瞻与关键核心技术竞争资助项目(BE2021028);深圳市中央引导地方科技发展资金资助项目(2021Szvup056)

Bayesian optimization-based generalized fixed point approximation

CHEN Xingguo^1,2, LÜ Yongzhou¹, GONG Yu¹, CHEN Yaoxiong³

1. Jiangsu Key Laboratory of Big Data Security &
Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing 210023, Jiangsu, China;
2. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210046, Jiangsu, China;
3. Faculty of Electronic Information Engineering, Huaiyin Institute of Technology, Huaian, 223003, Jiangsu, China

Published:2024-08-20

摘要/Abstract

摘要： 针对强化学习不动点的解更优这一问题,提出广义不动点解模型设计,该设计使用n步自举法的不动点解扩展和基于线性插值法的不动点解构造方法。将该设计应用于成熟的CBMPI算法框架上,提出基于广义不动点的CBMPI(n,β)算法。针对如何表达并逼近最优解这一问题,提出基于贝叶斯优化的广义不动点解的参数优化和基于集成学习的更高质量的解。在经典的10×10规模的Tetris游戏环境中验证算法提出的有效性。试验结果证明了基于线性插值法的广义不动点构造能比n步传统不动点效果好,其效果与其超参数步长n和插值参数β有很大关联。在100局的Tetris游戏中,平均分达到4 388.3,表明贝叶斯优化技术可以找到多组表现优异的策略。对表现优异的四组广义不动点的策略参数(贝叶斯优化技术的结果)进行策略集成和值函数集成,得到更高质量的解。平均分可以分别达到4 526.29和4 579.74,试验结果表明基于广义不动点的策略集成和基于广义不动点的值函数集成的分数相较于广义不动点的分数有小幅度提高,证实了可以通过集成学习寻找更高质量的解。

关键词: 强化学习, 值函数近似估计, 不动点, 贝叶斯优化, 俄罗斯方块

中图分类号:

TP311

陈兴国,吕咏洲,巩宇,陈耀雄. 基于贝叶斯优化的强化学习广义不动点解逼近[J]. 山东大学学报 (工学版), 2024, 54(4): 21-34.

CHEN Xingguo, LÜ Yongzhou, GONG Yu, CHEN Yaoxiong. Bayesian optimization-based generalized fixed point approximation[J]. Journal of Shandong University(Engineering Science), 2024, 54(4): 21-34.

参考文献

[1] 陈兴国, 孙丁源昊, 杨光, 等. 不动点视角下的强化学习算法综述[J]. 计算机学报, 2023, 46(6): 1246-1271. CHEN Xingguo, SUN Dingyuanhao, YANG Guang, et al. A Survey of Reinforcement Learning Algorithms from a Fixed Point Perspective[J]. Chinese Journal of Computers, 2023, 46(6): 1246-1271.
[2] GEIST M, PIETQUIN O. Algorithmic survey of parametric value function approximation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(6): 845-867.
[3] PUTERMAN M L. Markov decision processes: discrete stochastic dynamic programming[M]. New York, USA: John Wiley & Sons, 2014.
[4] SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44.
[5] BRADTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning[J]. Machine Learning, 1996, 22(1-3): 33-57.
[6] GERAMIFARD A, BOWLING M, SUTTON R S. Incremental least-squares temporal difference learning[C] //Proceedings of the 21st National Conference on Artificial Intelligence. Massachusetts, USA: AAAI Press, 2006: 356-361.
[7] ERNST D, GEURTS P, WEHENKEL L. Tree-based batch mode reinforcement learning[J]. Machine Learning, 2005, 6(4): 503-556.
[8] KUMAR H, KOPPEL A, RIBEIRO A. On the sample complexity of actor-critic method for reinforcement learning with function approximation[J]. Machine Learning, 2023: 1-35.
[9] SUTTON R S, MAEI H, SZEPESVáRI C. A convergent O(n)temporal-difference algorithm for off-policy learning with linear function approximation[C] //Advances in Neural Information Processing Systems.Cambridge, USA: MIT Press, 2008: 1609-1616.
[10] SUTTON R S, MAEI H R, PRECUP D, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation[C] //Proceedings of the 26th International Conference on Machine Learning. New York, USA: ACM, 2009: 993-1000.
[11] BAIRD L. Residual algorithms: reinforcement learning with function approximation[C] // Proceedings of the 12th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers Inc, 1995: 30-37.
[12] SCHERRER B. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view[C] //Proceedings of the 27th International Conference on Machine Learning. Madison, USA: Omnipress, 2010: 959-966.
[13] TSITSIKLIS J, VAN ROY B. An analysis of temporal-difference learning with function approximation technical[J]. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690.
[14] BOYAN J A. Technical update: least-squares temporal difference learning[J]. Machine Learning, 2002, 49(2-3): 233-246.
[15] GERAMIFARD A, BOWLING M, ZINKEVICH M, et al. iLSTD: eligibility traces and convergence analysis[C] //Advances in Neural Information Processing Systems. Cambridge, USA: MIT press, 2006: 441-448.
[16] MAEI H R, SUTTON R S. GQ(λ): a general gradient algorithm for temporal-difference prediction learning with eligibility traces[C] //Proceedings of the 3rd Conference on Artificial General Intelligence. Paris, France: Atlantis, 2010: 91-96.
[17] JOHNS J, PETRIK M, MAHADEVAN S. Hybrid least-squares algorithms for approximate policy evaluation[J]. Machine Learning, 2009, 76(1): 243-256.
[18] 吴毓双, 陈筱语, 马静雯, 等. 基于一般化斜投影的异策略时序差分学习算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1052-1062. WU Yushuang, CHEN Xiaoyu, MA Jingwen, et al. Off-policy linear temporal difference learning algorithms with a generalized oblique projection[J]. Journal of Nanjing University(Natural Science), 2017, 53(6): 1052-1062.
[19] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT press, 2018.
[20] BERTSEKAS D, TSITSIKLIS J N. Neuro-dynamic programming[M]. Belmont, USA: Athena Scientific, 1996.
[21] ANTOS A, SZEPESVÁRI C, MUNOS R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path[J]. Machine Learning, 2008, 71(1): 89-129.
[22] SCHERRER B, GHAVAMZADEH M, GABILLON V, et al. Approximate modified policy iteration and its application to the game of Tetris[J]. Machine Learning, 2015, 16(49): 1629-1676.
[23] 厉海涛, 金光, 周经伦, 等. 贝叶斯网络推理算法综述[J]. 系统工程与电子技术, 2008, 30(5): 935-939. LI Haitao, JIN Guang, ZHOU Jinglun, et al. A review of bayesian network inference algorithms[J]. Systems Engineering and Electronics, 2008, 30(5): 935-939.
[24] LIOR R. Ensemble-based classifiers[J]. Artificial Intelligence Review, 2010, 33(1-2): 1-39.
[25] LICHTENBERG J M, ?瘙塁IM?瘙塁EK Ö. Regularization in directable environments with application to Tetris[C] //Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: IMLS, 2019: 3953-3962.
[26] DEMAINE E D, HOHENBERGER S, LIBEN-NOWELL D. Tetris is hard, even to approximate[C] // Proceedings of the 9th International Conference on Computing and Combinatorics. Berlin, Germany: Springer, 2003: 351-363.
[27] FARIAS V F, VAN ROY B. Tetris: A study of randomized constraint sampling[C] //Probabilistic and Randomized Methods for Design under Uncertainty. London, UK: Springer, 2006: 189-201.
[28] THIERY C, SCHERRER B. Improvements on learning Tetris with cross entropy[J]. International Computer Games Association Journal, 2009, 32(1): 23-33.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于贝叶斯优化的强化学习广义不动点解逼近

Bayesian optimization-based generalized fixed point approximation

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

多维度评价

本文评价

推荐阅读 0

[1]	曹宇慧,黄昱泽,冯北鹏,张淼,郭珍珍. 基于深度强化学习的物联网服务协同卸载方法[J]. 山东大学学报 (工学版), 2024, 54(1): 83-90.
[2]	张俊三,程俏俏,万瑶,朱杰,张世栋. MIRGAN: 一种基于GAN的医学影像报告生成模型[J]. 山东大学学报 (工学版), 2021, 51(2): 9-18.
[3]	吴正健,木特力甫·马木提,吾尔尼沙·买买提,阿力木江·艾沙,库尔班·吾布力. 基于LTP和HOG纹理特征融合的中亚文档图像文种识别[J]. 山东大学学报 (工学版), 2021, 51(2): 115-121.
[4]	常致富,周风余,王玉刚,沈冬冬,赵阳. 基于深度学习的图像自动标注方法综述[J]. 山东大学学报 (工学版), 2019, 49(6): 25-35.
[5]	沈晶,刘海波,张汝波,吴艳霞,程晓北. 基于半马尔可夫对策的多机器人分层强化学习[J]. 山东大学学报(工学版), 2010, 40(4): 1-7.