Q-learning

维基百科,自由的百科全书
跳转至: 导航搜索

Q学习 一种无模型的 强化学习 技术.特别是,Q学习可以被用于找到一个最佳行动选择方针对任意(有限)给定的马尔可夫决策过程(MDP).它通过学习一个 行为-价值函数 ,其最终给出一个给定行为的期望价值。当像一个动作-价值函数被学习了,最佳决策能被建构通过简单地选择在场景中价Q学习能值最高的动作。Q学习的一个优势是它能比较可用动作的期望效用二避免安要求一个环境模型。附加的,Q学习能处理随机变换和奖励的问题而不需要任何修改。

Algorithm[编辑]

MDP问题的模型,由一个智能体,场景S 与每个场景的可用行为集合 A组成.通过执行一个a a \in A, 智能体可以从一个场景进入另一个场景。每个场景产生给智能体一个奖励(实数或自然数)。智能体的目标是最大化它的奖励。它通过学习每个场景的最佳行为做到这一点。

算法算法随后有一个函数用于计算场景-行为组合的质量(Quality,Q的来源):

Q: S \times A \to \mathbb{R}

在学习开始之前,Q返回一个(任意)常数,这由设计者指定。然后,每个时间智能体选择一个行为并观察奖励以及新场景,这两个可能由之前的场景与做出的行为决定。算法代码是简单地价值迭代更新 价值迭代更新.它负责用新信息矫正旧值。

Q_{t+1}(s_{t},a_{t}) = \underbrace{Q_t(s_t,a_t)}_{\rm old~value} + \underbrace{\alpha_t(s_t,a_t)}_{\rm learning~rate} \times \left[ \overbrace{\underbrace{R_{t+1}}_{\rm reward} + \underbrace{\gamma}_{\rm discount~factor} \underbrace{\max_{a}Q_t(s_{t+1}, a)}_{\rm estimate~of~optimal~future~value}}^{\rm learned~value} - \underbrace{Q_t(s_t,a_t)}_{\rm old~value} \right]

Qt(st,at) :: 老值 at(st,at) :: 学习速率 R[t+1] :: 奖励 Y :: 折扣因子 max(a,Qt(s[t+1],a)) :: 最优未来估计值 Qt(st,at) :: 老值 Qt(st,at)+at(st,at)*[R[t+1]+Y*max(a,Qt(s[t+1],a)) :: 被学习的值

R_{t+1} 是在 s_{t}执行a_{t} 后被观察到的奖励,  \alpha_t(s, a) (0 < \alpha \le 1) 是学习速率(可对所有点对为同一个值(点对应该指的是(s,a))). 折扣因子 \gamma (0 \le \gamma \le 1) 折衷马上获得的与延迟获得的奖励。

算法结束当 s_{t+1} 是一个终结场景(或"吸引场景")。无论如何,Q学习也能学习在非终止的任务中。如果折扣因子小于1,则行为价值是有限的即使问题包含无限的。

注意所有终结场景 s_f, Q(s_f, a) 永远不会被更新,所以将保持它的初始值。在多数情况下, Q(s_f,a) 可以等于0.

变量对算法的影响[编辑]

学习速率[编辑]

学习速率决定了新得到的信息覆盖老信息的程度。一个为0的因子将使得智能体什么都学不到,而如果因子为1则使得智能体只考虑最近获得的信息。在一个确定性环境中,一个学习速率为1的的设置是合适的。而当一个问题存在一些随机性,则算法仍在在一些技术条件下收敛,这要求它降到0.在实践中,通常提供一个常数为这个函数,比如 \alpha_t(s,a) = 0.1 对所有的 t.[1]

折扣因子[编辑]

折扣因子决定了对未来奖励的重视程度。一个为0的因子使得智能体短视由于只考虑眼前的奖励。当一个因子接近1将使它为长期的奖励奋斗。如果折扣因子超过1。这个动作将偏离合理观测。

初始条件 (Q(s_0,a_0))[编辑]

Q学习是一个迭代算法,它假定一个初始条件在第一次更新发生之前。一个高的初始值,被称为"乐观初始条件"[2] can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of Q. This will allow immediate learning in case of fix deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.[3]

执行[编辑]

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing levels of complexity of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research.[4]

More generally, Q-learning can be combined with function approximation.[5] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous, and therefore infinitely large. Additionally, it may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

早期研究[编辑]

Q-learning最初由Watkins[6]在1989年提出. 随后Watkins和Dayan[7]在1992对其进行了收敛性证明.

变种[编辑]

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with Probably approximately correct learning (PAC).[8]

Due to the fact that the maximum approximated action value is used in the Q-learning update, in noisy environments Q-learning can sometimes overestimate the actions values, slowing the learning. A recent variant called Double Q-learning was proposed to correct this. [9]

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.[10] The advantage of Greedy GQ is that convergence guarantees can be given even when function approximation is used to estimate the action values.

参见[编辑]

扩展链接[编辑]

参考资料[编辑]

  1. ^ Reinforcement Learning: An Introduction. Richard Sutton and Andrew Barto. MIT Press, 1998.
  2. ^ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html
  3. ^ The Role of First Impression in Operant Learning. Shteingart H, Neiman T, Loewenstein Y. J Exp Psychol Gen. 2013 May; 142(2):476-88. doi: 10.1037/a0029550. Epub 2012 Aug 27.
  4. ^ Tesauro, Gerald. Temporal Difference Learning and TD-Gammon. Communications of the ACM. March 1995, 38 (3) [2010-02-08]. 
  5. ^ Hado van Hasselt. Reinforcement Learning in Continuous State and Action Spaces. In: Reinforcement Learning: State of the Art, Springer, pages 207-251, 2012
  6. ^ Watkins, C.J.C.H., (1989), Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
  7. ^ Watkins and Dayan, C.J.C.H., (1992), 'Q-learning.Machine Learning', ISBN : 8:279-292
  8. ^ Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Proc. 23nd[需要解释] ICML 2006, pages 881–888, 2006.
  9. ^ Hado van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems 23, pages 2613-2622, 2011.
  10. ^ Hamid Maei, and Csaba Szepesv{\'a}ri, Shalabh Bhatnagar and Richard Sutton. Toward off-policy learning control with function approximation. In proceedings of the 27th International Conference on Machine Learning, pages 719-726, 2010.