線性回歸

1. 如果目标是预测或者映射，线性回归可以用来对观测数据集的和X的值拟合出一个预测模型。当完成这样一个模型以后，对于一个新增的X值，在没有给定与它相配对的y的情况下，可以用这个拟合过的模型预测出一个y值。
2. 给定一个变量y和一些变量${\displaystyle X_{1}}$,...,${\displaystyle X_{p}}$，这些变量有可能与y相关，线性回归分析可以用来量化y与Xj之间相关性的强度，评估出与y不相关的${\displaystyle X_{j}}$，并识别出哪些${\displaystyle X_{j}}$的子集包含了关于y的冗余信息。

簡介

理論模型

${\displaystyle Y_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta _{p}X_{ip}+\varepsilon _{i},\qquad i=1,\ldots ,n}$

數據和估計

${\displaystyle Y=X\beta +\varepsilon \,}$

${\displaystyle X={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}}}$

X通常包括一個常數項。

古典假設

• 樣本是在母體之中随機抽取出來的。
• 因變量Y在實直線上是連續的
• 殘差項是獨立相同分佈的(iid)，也就是說，殘差是独立随机的，且服從高斯分佈

${\displaystyle {\mbox{E}}(Y_{i}\mid X_{i}=x_{i})=\alpha +\beta x_{i}\,}$

最小二乘法分析

最小二乘法估計

${\displaystyle {\hat {\beta }}=(X^{T}X)^{-1}X^{T}y\,}$

迴歸推論

${\displaystyle {\hat {\sigma }}^{2}={\frac {S}{n-p}},}$

${\displaystyle {\hat {\sigma }}^{2}\cdot {\frac {n-p}{\sigma ^{2}}}\sim \chi _{n-p}^{2}}$

${\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X^{T}X)^{-1}X^{T}y} .}$

${\displaystyle {\hat {\beta }}\sim N(\beta ,\sigma ^{2}(X^{T}X)^{-1})}$

${\displaystyle {\hat {\sigma }}_{j}={\sqrt {{\frac {S}{n-p}}\left[\mathbf {(X^{T}X)} ^{-1}\right]_{jj}}}.}$

${\displaystyle {\hat {\beta }}_{j}\pm t_{{\frac {\alpha }{2}},n-p}{\hat {\sigma }}_{j}.}$

${\displaystyle \mathbf {{\hat {r}}=y-X{\hat {\boldsymbol {\beta }}}=y-X(X^{T}X)^{-1}X^{T}y} .\,}$

單變量線性回歸

${\displaystyle Y=\alpha +\beta X+\varepsilon }$

${\displaystyle \sum _{i=1}^{n}\varepsilon _{i}^{2}=\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}}$

${\displaystyle \left\{{\begin{array}{lcl}n\ \alpha +\sum \limits _{i=1}^{n}x_{i}\ \beta =\sum \limits _{i=1}^{n}y_{i}\\\sum \limits _{i=1}^{n}x_{i}\ \alpha +\sum \limits _{i=1}^{n}x_{i}^{2}\ \beta =\sum \limits _{i=1}^{n}x_{i}y_{i}\end{array}}\right.}$

${\displaystyle {\hat {\beta }}={\frac {n\sum \limits _{i=1}^{n}x_{i}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}\,}$
${\displaystyle {\hat {\alpha }}={\frac {\sum \limits _{i=1}^{n}x_{i}^{2}\sum \limits _{i=1}^{n}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\bar {y}}-{\bar {x}}{\hat {\beta }}}$
${\displaystyle S=\sum \limits _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}=\sum \limits _{i=1}^{n}y_{i}^{2}-{\frac {n(\sum \limits _{i=1}^{n}x_{i}y_{i})^{2}+(\sum \limits _{i=1}^{n}y_{i})^{2}\sum \limits _{i=1}^{n}x_{i}^{2}-2\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}}$
${\displaystyle {\hat {\sigma }}^{2}={\frac {S}{n-2}}.}$

${\displaystyle {\frac {1}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}{\begin{pmatrix}\sum x_{i}^{2}&-\sum x_{i}\\-\sum x_{i}&n\end{pmatrix}}}$

${\displaystyle y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}}$

${\displaystyle y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {1+{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}}$

方差分析

${\displaystyle {\text{SST}}=\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}$　，其中：　${\displaystyle {\bar {y}}={\frac {1}{n}}\sum _{i}y_{i}}$

${\displaystyle {\text{SST}}=\sum _{i=1}^{n}y_{i}^{2}-{\frac {1}{n}}\left(\sum _{i}y_{i}\right)^{2}}$

${\displaystyle {\text{SSReg}}=\sum \left({\hat {y}}_{i}-{\bar {y}}\right)^{2}={\hat {\boldsymbol {\beta }}}^{T}\mathbf {X} ^{T}\mathbf {y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right),}$

${\displaystyle {\text{SSE}}=\sum _{i}{\left({y_{i}-{\hat {y}}_{i}}\right)^{2}}=\mathbf {y^{T}y-{\hat {\boldsymbol {\beta }}}^{T}X^{T}y} .}$

${\displaystyle {\text{SST}}=\sum _{i}\left(y_{i}-{\bar {y}}\right)^{2}=\mathbf {y^{T}y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right)={\text{SSReg}}+{\text{SSE}}.}$

${\displaystyle R^{2}={\frac {\text{SSReg}}{\text{SST}}}=1-{\frac {\text{SSE}}{\text{SST}}}.}$

参考文献

引用

1. ^ Rencher, Alvin C.; Christensen, William F., Chapter 10, Multivariate regression – Section 10.1, Introduction, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics 709 3rd, John Wiley & Sons: 19, 2012 [2019-05-14], ISBN 9781118391679, （原始内容存档于2019-06-15）.
2. ^ Hilary L. Seal. The historical development of the Gauss linear model. Biometrika. 1967, 54 (1/2): 1–24. JSTOR 2333849. doi:10.1093/biomet/54.1-2.1.
3. ^ Yan, Xin, Linear Regression Analysis: Theory and Computing, World Scientific: 1–2, 2009 [2019-05-14], ISBN 9789812834119, （原始内容存档于2019-06-08）, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.
4. ^ Deaton, Angus. Understanding Consumption. Oxford University Press. 1992. ISBN 978-0-19-828824-4.
5. Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. International Economics: Theory and Policy 9th global. Harlow: Pearson. 2012. ISBN 9780273754091.
6. ^ Laidler, David E. W. The Demand for Money: Theories, Evidence, and Problems 4th. New York: Harper Collins. 1993. ISBN 978-0065010985.
7. Ehrenberg; Smith. Modern Labor Economics 10th international. London: Addison-Wesley. 2008. ISBN 9780321538963.

来源

• Cohen, J., Cohen P., West, S.G., & Aiken, L.S. Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. 2003.
• Draper, N.R. and Smith, H. Applied Regression Analysis. Wiley Series in Probability and Statistics. 1998.
• Robert S. Pindyck and Daniel L. Rubinfeld. Chapter One. Econometric Models and Economic Forecasts. 1998.
• Charles Darwin. The Variation of Animals and Plants under Domestication. (1868) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)

延伸阅读

• Pedhazur, Elazar J. Multiple regression in behavioral research: Explanation and prediction 2nd. New York: Holt, Rinehart and Winston. 1982. ISBN 0-03-041760-0.
• Barlow, Jesse L. Chapter 9: Numerical aspects of Solving Linear Least Squares Problems. Rao, C.R. (编). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
• Björck, Åke. Numerical methods for least squares problems. Philadelphia: SIAM. 1996. ISBN 0-89871-360-9.
• Goodall, Colin R. Chapter 13: Computation using the QR decomposition. Rao, C.R. (编). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
• National Physical Laboratory. Chapter 1: Linear Equations and Matrices: Direct Methods. Modern Computing Methods. Notes on Applied Science 16 2nd. Her Majesty's Stationery Office. 1961.