Q-learning

维基百科,自由的百科全书
跳转至: 导航搜索

Q学习 一种无需模型的 强化学习 技术.特别是,Q学习可以被用于找到一个最佳行动选择方针对任意(有限)给定的马尔可夫决策过程(MDP).它用于学习一个 行为-价值函数,该函数给出一个给定状态下给定行为和优化决策的期望价值。一种决策是agent用于在不同行为中进行选择并给出其所处状态的规则。当一个动作-价值函数被习得后,能通过简单地选择在场景中价值最高的动作来构建优化决策。Q学习的一个优势是它能在无需环境模型的情况下比较可用动作的期望效用。另外,Q学习能处理随机变换和回报的问题。对于任何有限MDP,Q学习被证明能够找到一个优化决策,该策略能够最大化从当前状态开始的所有成功行为的总回报期望值。

Algorithm[编辑]

MDP问题的模型,由一个智能体,场景S 与每个场景的可用行为集合 A组成.通过执行一个a a \in A, 智能体可以从一个场景进入另一个场景。每个场景产生给智能体一个奖励(实数或自然数)。智能体的目标是最大化它的奖励。它通过学习每个场景的最佳行为做到这一点。

算法算法随后有一个函数用于计算场景-行为组合的质量(Quality,Q的来源):

Q: S \times A \to \mathbb{R}

在学习开始之前,Q返回一个(任意)常数,这由设计者指定。然后,每个时间智能体选择一个行为并观察奖励以及新场景,这两个可能由之前的场景与做出的行为决定。算法代码是简单地价值迭代更新 价值迭代更新.它负责用新信息矫正旧值。

Q_{t+1}(s_{t},a_{t}) = \underbrace{Q_t(s_t,a_t)}_{\rm old~value} + \underbrace{\alpha_t(s_t,a_t)}_{\rm learning~rate} \times \left[ \overbrace{\underbrace{R_{t+1}}_{\rm reward} + \underbrace{\gamma}_{\rm discount~factor} \underbrace{\max_{a}Q_t(s_{t+1}, a)}_{\rm estimate~of~optimal~future~value}}^{\rm learned~value} - \underbrace{Q_t(s_t,a_t)}_{\rm old~value} \right]

Qt(st,at) :: 老值 at(st,at) :: 学习速率 R[t+1] :: 奖励 Y :: 折扣因子 max(a,Qt(s[t+1],a)) :: 最优未来估计值 Qt(st,at) :: 老值 Qt(st,at)+at(st,at)*[R[t+1]+Y*max(a,Qt(s[t+1],a)) :: 被学习的值

R_{t+1} 是在 s_{t}执行a_{t} 后被观察到的奖励,  \alpha_t(s, a) (0 < \alpha \le 1) 是学习速率(可对所有点对为同一个值(点对应该指的是(s,a))). 折扣因子 \gamma (0 \le \gamma \le 1) 折衷马上获得的与延迟获得的奖励。

算法结束当 s_{t+1} 是一个终结场景(或"吸引场景")。无论如何,Q学习也能学习在非终止的任务中。如果折扣因子小于1,则行为价值是有限的即使问题包含无限的。

注意所有终结场景 s_f, Q(s_f, a) 永远不会被更新,所以将保持它的初始值。在多数情况下, Q(s_f,a) 可以等于0.

变量对算法的影响[编辑]

学习速率[编辑]

学习速率决定了新得到的信息覆盖老信息的程度。一个为0的因子将使得智能体什么都学不到,而如果因子为1则使得智能体只考虑最近获得的信息。在一个确定性环境中,一个学习速率为1的的设置是合适的。而当一个问题存在一些随机性,则算法仍在在一些技术条件下收敛,这要求它降到0.在实践中,通常提供一个常数为这个函数,比如 \alpha_t(s,a) = 0.1 对所有的 t.[1]

折扣因子[编辑]

折扣因子决定了对未来奖励的重视程度。一个为0的因子使得智能体短视由于只考虑眼前的奖励。当一个因子接近1将使它为长期的奖励奋斗。如果折扣因子超过1。这个动作将偏离合理观测。

初始条件 (Q(s_0,a_0))[编辑]

Q学习是一个迭代算法,它假定一个初始条件在第一次更新发生之前。一个高的初始值,被称为"乐观初始条件"[2] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fix deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[3]

执行[编辑]

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing levels of complexity of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research.[4]

More generally, Q-learning can be combined with function approximation.[5] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous, and therefore infinitely large. Additionally, it may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

早期研究[编辑]

Q-learning最初由Watkins[6]在1989年提出. 随后Watkins和Dayan[7]在1992对其进行了收敛性证明.

变种[编辑]

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with Probably approximately correct learning (PAC).[8]

Due to the fact that the maximum approximated action value is used in the Q-learning update, in noisy environments Q-learning can sometimes overestimate the actions values, slowing the learning. A recent variant called Double Q-learning was proposed to correct this. [9]

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.[10] The advantage of Greedy GQ is that convergence guarantees can be given even when function approximation is used to estimate the action values.

近期Google DeepMind提出了一个基于Q学习和深度学习的应用,将之命名为深度增强学习深度Q网络。该应用成功地实现以专家级水平玩一些早期的Atari 2600游戏。最初的成功在2014年公布,并发表在2015年二月的Nature。

参见[编辑]

扩展链接[编辑]

参考资料[编辑]

  1. ^ Reinforcement Learning: An Introduction. Richard Sutton and Andrew Barto. MIT Press, 1998.
  2. ^ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html
  3. ^ The Role of First Impression in Operant Learning. Shteingart H, Neiman T, Loewenstein Y. J Exp Psychol Gen. 2013 May; 142(2):476-88. doi: 10.1037/a0029550. Epub 2012 Aug 27.
  4. ^ Tesauro, Gerald. Temporal Difference Learning and TD-Gammon. Communications of the ACM. March 1995, 38 (3) [2010-02-08]. 
  5. ^ Hado van Hasselt. Reinforcement Learning in Continuous State and Action Spaces. In: Reinforcement Learning: State of the Art, Springer, pages 207-251, 2012
  6. ^ Watkins, C.J.C.H., (1989), Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
  7. ^ Watkins and Dayan, C.J.C.H., (1992), 'Q-learning.Machine Learning', ISBN : 8:279-292
  8. ^ Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Proc. 23nd[需要解释] ICML 2006, pages 881–888, 2006.
  9. ^ Hado van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems 23, pages 2613-2622, 2011.
  10. ^ Hamid Maei, and Csaba Szepesv{\'a}ri, Shalabh Bhatnagar and Richard Sutton. Toward off-policy learning control with function approximation. In proceedings of the 27th International Conference on Machine Learning, pages 719-726, 2010.