# 梯度提升技术

## 非正式介绍

（本节遵循Li [6]对梯度增强的说明。）

${\displaystyle F_{m+1}(x)=F_{m}(x)+h(x)=y}$

${\displaystyle h(x)=y-F_{m}(x)}$.

## 算法

${\displaystyle {\hat {F}}={\underset {F}{\arg \min }}\,\mathbb {E} _{x,y}[L(y,F(x))]}$.

${\displaystyle {\hat {F}}(x)=\sum _{i=1}^{M}\gamma _{i}h_{i}(x)+{\mbox{const}}}$.

${\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L(y_{i},\gamma )}}}$,
${\displaystyle F_{m}(x)=F_{m-1}(x)+{\underset {h_{m}\in {\mathcal {H}}}{\operatorname {arg\,min} }}\left[{\sum _{i=1}^{n}{L(y_{i},F_{m-1}(x_{i})+h_{m}(x_{i}))}}\right]}$,

${\displaystyle F_{m}(x)=F_{m-1}(x)-\gamma _{m}\sum _{i=1}^{n}{\nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))},}$
${\displaystyle \gamma _{m}={\underset {\gamma }{\arg \min }}{\sum _{i=1}^{n}{L\left(y_{i},F_{m-1}(x_{i})-\gamma \nabla _{F_{m-1}}L(y_{i},F_{m-1}(x_{i}))\right)}},}$

Input: training set ${\displaystyle \{(x_{i},y_{i})\}_{i=1}^{n},}$ a differentiable loss function ${\displaystyle L(y,F(x)),}$ number of iterations M.

Algorithm:

1. Initialize model with a constant value:
${\displaystyle F_{0}(x)={\underset {\gamma }{\arg \min }}\sum _{i=1}^{n}L(y_{i},\gamma ).}$
2. For m = 1 to M:
1. Compute so-called pseudo-residuals:
${\displaystyle r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.}$
2. Fit a base learner (or weak learner, e.g. tree) ${\displaystyle h_{m}(x)}$ to pseudo-residuals, i.e. train it using the training set ${\displaystyle \{(x_{i},r_{im})\}_{i=1}^{n}}$.
3. Compute multiplier ${\displaystyle \gamma _{m}}$ by solving the following one-dimensional optimization problem:
${\displaystyle \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})\right).}$
4. Update the model:
${\displaystyle F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x).}$
3. Output ${\displaystyle F_{M}(x).}$

## 梯度树增强

m步的通用梯度提升将适合决策树${\displaystyle h_{m}(x)}$伪残留物。 让${\displaystyle J_{m}}$是它的叶子数。 树将输入空间划分为${\displaystyle J_{m}}$不相交的区域${\displaystyle R_{1m},\ldots ,R_{J_{m}m}}$并预测每个区域的恒定值。 使用指标符号 ，输出${\displaystyle h_{m}(x)}$输入x可以写为和：

${\displaystyle h_{m}(x)=\sum _{j=1}^{J_{m}}b_{jm}\mathbf {1} _{R_{jm}}(x),}$

${\displaystyle F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x),\quad \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})).}$

${\displaystyle F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J_{m}}\gamma _{jm}\mathbf {1} _{R_{jm}}(x),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma ).}$

### 树木大小

${\displaystyle J}$ （树中终端节点的数量）是该方法的参数，可以针对手头的数据集进行调整。 它控制模型中变量之间允许的最大交互级别。 用${\displaystyle J=2}$ （ 决策树桩 ），不允许变量之间进行交互。 用${\displaystyle J=3}$该模型可能包括多达两个变量之间的相互作用的影响，依此类推。

Hastie等。 [7]评论通常${\displaystyle 4\leq J\leq 8}$对于提升效果很好，结果对选择${\displaystyle J}$在这个范围内${\displaystyle J=2}$不足以用于许多应用程序，并且${\displaystyle J>10}$不太可能是必需的。

## 正则化

### 收缩率

${\displaystyle F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma _{m}h_{m}(x),\quad 0<\nu \leq 1,}$

## 名字

R的一种流行的开源实现将其称为“通用提升模型”， [10]但是，使用BRT扩展这项工作的软件包。 [17] Salford Systems的商业实现使用名称“ Multiple Additive Regression Trees”（MART）和TreeNet，两者均为商标。   [ 需要引用 ]

## 参考文献

1. ^ Breiman, L. Arcing The Edge (PDF). Statistics Department, University of California, Berkeley. June 1997 [2019-11-12]. （原始内容存档 (PDF)于2020-05-09）.
2. Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine (PDF). February 1999 [2019-11-12]. （原始内容存档 (PDF)于2019-11-01）. 引用错误：带有name属性“Friedman1999a”的<ref>标签用不同内容定义了多次 引用错误：带有name属性“Friedman1999a”的<ref>标签用不同内容定义了多次
3. Friedman, J. H. Stochastic Gradient Boosting (PDF). March 1999 [2019-11-12]. （原始内容存档 (PDF)于2014-08-01）. 引用错误：带有name属性“Friedman1999b”的<ref>标签用不同内容定义了多次 引用错误：带有name属性“Friedman1999b”的<ref>标签用不同内容定义了多次
4. 空引用 (帮助)
5. Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus. Boosting Algorithms as Gradient Descent in Function Space (PDF). May 1999 [2019-11-12]. （原始内容 (PDF)存档于2018-12-22）.
6. ^ Cheng Li. A Gentle Introduction to Gradient Boosting (PDF). [2019-11-12]. （原始内容存档 (PDF)于2018-10-24）.
7. Hastie, T.; Tibshirani, R.; Friedman, J. H. https://web.archive.org/web/20091110212529/http://www-stat.stanford.edu/~tibs/ElemStatLearn/ |archiveurl=缺少标题 (帮助). 10. Boosting and Additive Trees 2nd. New York: Springer. 2009: 337–384. ISBN 978-0-387-84857-0. （原始内容存档于2009-11-10）.
8. ^ Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient ${\displaystyle b_{jm}}$ for the region ${\displaystyle R_{jm}}$ is equal to just the value of output variable, averaged over all training instances in ${\displaystyle R_{jm}}$.
9. ^ Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.
10. Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.页面存档备份，存于互联网档案馆引用错误：带有name属性“gbm-vignette”的<ref>标签用不同内容定义了多次 引用错误：带有name属性“gbm-vignette”的<ref>标签用不同内容定义了多次
11. ^ Learn Gradient Boosting Algorithm for better predictions (with codes in R). [2019-11-12]. （原始内容存档于2019-08-13）.
12. ^ Tianqi Chen. Introduction to Boosted Trees页面存档备份，存于互联网档案馆
13. ^ Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking 互联网档案馆存檔，存档日期2010-08-07., page 14.
14. ^
15. ^ Friedman, Jerome. Multiple Additive Regression Trees with Application in Epidemiology. Statistics in Medicine. 2003, 22 (9): 1365–1381. PMID 12704603. doi:10.1002/sim.1501.
16. ^ Elith, Jane. A working guide to boosted regression trees. Journal of Animal Ecology. 2008, 77 (4): 802–813. PMID 18397250. doi:10.1111/j.1365-2656.2008.01390.x.
17. ^ Elith, Jane. Boosted Regression Trees for ecological modeling (PDF). CRAN. CRAN. [31 August 2018]. （原始内容存档 (PDF)于2020-07-25）.