山东大学学报 (工学版) ›› 2024, Vol. 54 ›› Issue (4): 21-34.doi: 10.6040/j.issn.1672-3961.0.2023.148
陈兴国1,2,吕咏洲1,巩宇1,陈耀雄3
CHEN Xingguo1,2, LÜ Yongzhou1, GONG Yu1, CHEN Yaoxiong3
摘要: 针对强化学习不动点的解更优这一问题,提出广义不动点解模型设计,该设计使用n步自举法的不动点解扩展和基于线性插值法的不动点解构造方法。将该设计应用于成熟的CBMPI算法框架上,提出基于广义不动点的CBMPI(n,β)算法。针对如何表达并逼近最优解这一问题,提出基于贝叶斯优化的广义不动点解的参数优化和基于集成学习的更高质量的解。在经典的10×10规模的Tetris游戏环境中验证算法提出的有效性。试验结果证明了基于线性插值法的广义不动点构造能比n步传统不动点效果好,其效果与其超参数步长n和插值参数β有很大关联。在100局的Tetris游戏中,平均分达到4 388.3,表明贝叶斯优化技术可以找到多组表现优异的策略。对表现优异的四组广义不动点的策略参数(贝叶斯优化技术的结果)进行策略集成和值函数集成,得到更高质量的解。平均分可以分别达到4 526.29和4 579.74,试验结果表明基于广义不动点的策略集成和基于广义不动点的值函数集成的分数相较于广义不动点的分数有小幅度提高,证实了可以通过集成学习寻找更高质量的解。
中图分类号:
[1] 陈兴国, 孙丁源昊, 杨光, 等. 不动点视角下的强化学习算法综述[J]. 计算机学报, 2023, 46(6): 1246-1271. CHEN Xingguo, SUN Dingyuanhao, YANG Guang, et al. A Survey of Reinforcement Learning Algorithms from a Fixed Point Perspective[J]. Chinese Journal of Computers, 2023, 46(6): 1246-1271. [2] GEIST M, PIETQUIN O. Algorithmic survey of parametric value function approximation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(6): 845-867. [3] PUTERMAN M L. Markov decision processes: discrete stochastic dynamic programming[M]. New York, USA: John Wiley & Sons, 2014. [4] SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44. [5] BRADTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning[J]. Machine Learning, 1996, 22(1-3): 33-57. [6] GERAMIFARD A, BOWLING M, SUTTON R S. Incremental least-squares temporal difference learning[C] //Proceedings of the 21st National Conference on Artificial Intelligence. Massachusetts, USA: AAAI Press, 2006: 356-361. [7] ERNST D, GEURTS P, WEHENKEL L. Tree-based batch mode reinforcement learning[J]. Machine Learning, 2005, 6(4): 503-556. [8] KUMAR H, KOPPEL A, RIBEIRO A. On the sample complexity of actor-critic method for reinforcement learning with function approximation[J]. Machine Learning, 2023: 1-35. [9] SUTTON R S, MAEI H, SZEPESVáRI C. A convergent O(n)temporal-difference algorithm for off-policy learning with linear function approximation[C] //Advances in Neural Information Processing Systems.Cambridge, USA: MIT Press, 2008: 1609-1616. [10] SUTTON R S, MAEI H R, PRECUP D, et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation[C] //Proceedings of the 26th International Conference on Machine Learning. New York, USA: ACM, 2009: 993-1000. [11] BAIRD L. Residual algorithms: reinforcement learning with function approximation[C] // Proceedings of the 12th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers Inc, 1995: 30-37. [12] SCHERRER B. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view[C] //Proceedings of the 27th International Conference on Machine Learning. Madison, USA: Omnipress, 2010: 959-966. [13] TSITSIKLIS J, VAN ROY B. An analysis of temporal-difference learning with function approximation technical[J]. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690. [14] BOYAN J A. Technical update: least-squares temporal difference learning[J]. Machine Learning, 2002, 49(2-3): 233-246. [15] GERAMIFARD A, BOWLING M, ZINKEVICH M, et al. iLSTD: eligibility traces and convergence analysis[C] //Advances in Neural Information Processing Systems. Cambridge, USA: MIT press, 2006: 441-448. [16] MAEI H R, SUTTON R S. GQ(λ): a general gradient algorithm for temporal-difference prediction learning with eligibility traces[C] //Proceedings of the 3rd Conference on Artificial General Intelligence. Paris, France: Atlantis, 2010: 91-96. [17] JOHNS J, PETRIK M, MAHADEVAN S. Hybrid least-squares algorithms for approximate policy evaluation[J]. Machine Learning, 2009, 76(1): 243-256. [18] 吴毓双, 陈筱语, 马静雯, 等. 基于一般化斜投影的异策略时序差分学习算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1052-1062. WU Yushuang, CHEN Xiaoyu, MA Jingwen, et al. Off-policy linear temporal difference learning algorithms with a generalized oblique projection[J]. Journal of Nanjing University(Natural Science), 2017, 53(6): 1052-1062. [19] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT press, 2018. [20] BERTSEKAS D, TSITSIKLIS J N. Neuro-dynamic programming[M]. Belmont, USA: Athena Scientific, 1996. [21] ANTOS A, SZEPESVÁRI C, MUNOS R. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path[J]. Machine Learning, 2008, 71(1): 89-129. [22] SCHERRER B, GHAVAMZADEH M, GABILLON V, et al. Approximate modified policy iteration and its application to the game of Tetris[J]. Machine Learning, 2015, 16(49): 1629-1676. [23] 厉海涛, 金光, 周经伦, 等. 贝叶斯网络推理算法综述[J]. 系统工程与电子技术, 2008, 30(5): 935-939. LI Haitao, JIN Guang, ZHOU Jinglun, et al. A review of bayesian network inference algorithms[J]. Systems Engineering and Electronics, 2008, 30(5): 935-939. [24] LIOR R. Ensemble-based classifiers[J]. Artificial Intelligence Review, 2010, 33(1-2): 1-39. [25] LICHTENBERG J M, ?瘙塁IM?瘙塁EK Ö. Regularization in directable environments with application to Tetris[C] //Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: IMLS, 2019: 3953-3962. [26] DEMAINE E D, HOHENBERGER S, LIBEN-NOWELL D. Tetris is hard, even to approximate[C] // Proceedings of the 9th International Conference on Computing and Combinatorics. Berlin, Germany: Springer, 2003: 351-363. [27] FARIAS V F, VAN ROY B. Tetris: A study of randomized constraint sampling[C] //Probabilistic and Randomized Methods for Design under Uncertainty. London, UK: Springer, 2006: 189-201. [28] THIERY C, SCHERRER B. Improvements on learning Tetris with cross entropy[J]. International Computer Games Association Journal, 2009, 32(1): 23-33. |
[1] | 曹宇慧,黄昱泽,冯北鹏,张淼,郭珍珍. 基于深度强化学习的物联网服务协同卸载方法[J]. 山东大学学报 (工学版), 2024, 54(1): 83-90. |
[2] | 张俊三,程俏俏,万瑶,朱杰,张世栋. MIRGAN: 一种基于GAN的医学影像报告生成模型[J]. 山东大学学报 (工学版), 2021, 51(2): 9-18. |
[3] | 吴正健,木特力甫·马木提,吾尔尼沙·买买提,阿力木江·艾沙,库尔班·吾布力. 基于LTP和HOG纹理特征融合的中亚文档图像文种识别[J]. 山东大学学报 (工学版), 2021, 51(2): 115-121. |
[4] | 常致富,周风余,王玉刚,沈冬冬,赵阳. 基于深度学习的图像自动标注方法综述[J]. 山东大学学报 (工学版), 2019, 49(6): 25-35. |
[5] | 沈晶,刘海波,张汝波,吴艳霞,程晓北. 基于半马尔可夫对策的多机器人分层强化学习[J]. 山东大学学报(工学版), 2010, 40(4): 1-7. |
|