学习率

在机器学习和统计学中，学习率（英語：Learning rate）是优化算法中的一个可调参数，它决定了每次迭代的步长，使得优化向损失函数的最小值前进。^[1]它影响到新学习到的信息在多大程度上取代了旧信息，暗示了机器学习模型 "学习 "的速度。在自适应控制中，学习率通常被称为增益（Gain）。^[2]

设置学习率需要在收敛速度和过冲（Overshooting）之间进行权衡。学习时的前进方向通常由损失函数的负梯度决定，而学习率决定了在这个方向上迈出多大一步。过高的学习率会使迈一大步，超过最小值；但过低的学习率会导致收敛速度变慢，或收敛于局部最小值。^[3]

为了加速收敛，防止振荡和陷入不理想的局部极小值，学习率在训练过程中往往按照计划或自适应改变。^[4]

学习率计划表

初始学习率速率可以参考系统默认值，也可以使用其他方式选择。学习率计划表在学习过程中改变学习率，最常见的是在epochs或iterations之间改变。学习率的改变通常由两个参数决定：衰减（Decay）和动量（Momentum）。目前常见的学习率修改方式是基于时间、基于步骤或基于指数的。^[4] 衰减的作用是将学习过程稳定在一个好的、没有振荡的位置（当恒定学习率过高时，学习过程可能会在最小值附近震荡）。衰减率通常由超参数控制。

动量类似于从山上滚下来的球，我们希望球在最低点（对应于最低的误差）停留。当梯度方向长期一致时，动量可以加快学习速度（提高学习率），也能通过跳过局部最小值。动量由类似于球质量的超参数控制，而这一参数必须手动选择。当动量过大时，球会滚过希望找到的最小值；当动量过低时，它将不起作用。计算动量的公式比计算衰减的公式更复杂，但在常用的深度学习库（如Keras）中已经实现。

基于时间的学习计划表会根据前一个迭代的学习率改变学习率。考虑到衰减的因素，学习率更新公式为：

$\eta _{n+1}={\frac {\eta _{n}}{1+dn}}$

其中 $\eta$ 是学习率。 $d$ 是衰减参数，而 $n$ 是迭代步骤。

基于步的学习计划表根据预先定义的步长改变学习率。通常定义为：

$\eta _{n}=\eta _{0}d^{\left\lfloor {\frac {1+n}{r}}\right\rfloor }$

其中 $\eta _{n}$ 是在第 $n$ 步的学习率， $\eta _{0}$ 初始学习率， $d$ 是每次下降时学习率的变化程度（0.5代表减半）， $r$ 对应下降率（即下降的频率，10代表每10此迭代下降一次）。向下取整函数（ $\lfloor \dots \rfloor$ ）将小于1的数改为0。

指数式学习计划表与基于步的学习计划表类似，但使用的不是步，而是递减的指数函数。考虑到衰减的数学公式是：

$\eta _{n}=\eta _{0}e^{-dn}$

其中 $d$ 是衰减参数。

自适应学习率

学习率计划表存在的问题是，学习率变换方式取决于超参数，而超参数必须为手动选择。许多不同类型的自适应梯度下降算法解决这个问题（如Adagrad、Adadelta、RMSprop和Adam^[5]），这些算法通常内置于深度学习库。^[6]

参考文献

^ Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. 2012: 247. ISBN 978-0-262-01802-9.
^ Delyon, Bernard. Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory. Unpublished Lecture Notes (Université de Rennes). 2000. CiteSeerX 10.1.1.29.4428 .
^ Buduma, Nikhil; Locascio, Nicholas. Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. 2017: 21. ISBN 978-1-4919-2558-4.
^ ^4.0 ^4.1 Patterson, Josh; Gibson, Adam. Understanding Learning Rates. Deep Learning : A Practitioner's Approach. O'Reilly. 2017: 258–263. ISBN 978-1-4919-1425-0.
^ Murphy, Kevin. Probabilistic Machine Learning: An Introduction. Probabilistic Machine Learning: An Introduction (MIT Press). 2021 [10 April 2021]. （原始内容存档于2021-04-11）.
^ Brownlee, Jason. How to Configure the Learning Rate When Training Deep Learning Neural Networks. Machine Learning Mastery. 22 January 2019 [4 January 2021]. （原始内容存档于2019-06-20）.

[1] Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. 2012: 247. ISBN 978-0-262-01802-9.

[2] Delyon, Bernard. Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory. Unpublished Lecture Notes (Université de Rennes). 2000. CiteSeerX 10.1.1.29.4428 .

[3] Buduma, Nikhil; Locascio, Nicholas. Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. 2017: 21. ISBN 978-1-4919-2558-4.

[variablelearningrate-4] 4.0 ^4.1 Patterson, Josh; Gibson, Adam. Understanding Learning Rates. Deep Learning : A Practitioner's Approach. O'Reilly. 2017: 258–263. ISBN 978-1-4919-1425-0.

[5] Murphy, Kevin. Probabilistic Machine Learning: An Introduction. Probabilistic Machine Learning: An Introduction (MIT Press). 2021 [10 April 2021]. （原始内容存档于2021-04-11）.

[6] Brownlee, Jason. How to Configure the Learning Rate When Training Deep Learning Neural Networks. Machine Learning Mastery. 22 January 2019 [4 January 2021]. （原始内容存档于2019-06-20）.

[1]

[2]

[3]

[4]

[5]

[6]