斯皮尔曼等级相关系数

斯皮尔曼等级相关系数（简称等级相关系数，或称秩相关系数，英语：Spearman's rank correlation coefficient或Spearman's ρ），在统计学中，常以希腊字母 $\rho$ （rho）或以 $r_{s}$ 表示，这一相关系数以查尔斯·斯皮尔曼（英语：Charles Spearman）之名命名。它是衡量两个变量的相关性的非参数指标。它利用单调函数评价两个统计变量的相关性。若数据中没有重复值，且当两变量完全单调相关时，斯皮尔曼相关系数为+1或−1。

定义和计算

斯皮尔曼相关系数的定义为等级变量之间的皮尔逊相关系数。^[1]

对于样本容量为 $n$ 的样本，将 $n$ 个原始数据 $X_{i},Y_{i}$ 转换成等级数据 $\operatorname {R} ({X_{i}}),\operatorname {R} ({Y_{i}})$ ，则相关系数 $r_{s}$ 为

r_{s}=\rho _{\operatorname {R} (X),\operatorname {R} (Y)}={\frac {\operatorname {cov} (\operatorname {R} (X),\operatorname {R} (Y))}{\sigma _{\operatorname {R} (X)}\sigma _{\operatorname {R} (Y)}}},

其中

\rho

是皮尔逊积矩相关系数，但使用等级变量来计算，

\operatorname {cov} (\operatorname {R} (X),\operatorname {R} (Y))

为等级变量的协方差，

\sigma _{\operatorname {R} (X)}

和

\sigma _{\operatorname {R} (Y)}

为等级变量的标准差。

通常，对于数据中相同的值，其等级数等于它们按值升序排列的所处位置的平均值。^[2]如下表所示：

变量 $X_{i}$	升序位置（仅示意，不使用）	升序位置的平均等级数（使用）
18	1	1
2.3	2	2
1.2	3	${\frac {4+3}{2}}=3.5\$
1.2	4	${\frac {4+3}{2}}=3.5\$
0.8	5	5

当所有的等级数值都为整数时，可以通过以下简单的步骤计算等级相关系数：^[1]^[3]

r_{s}=1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}},

其中

d_{i}=\operatorname {R} (X_{i})-\operatorname {R} (Y_{i})

为每组观测中两个变量的等级差值，

n为观测数。

证明

考虑一个双变量样本 $(x_{i},y_{i}),\,i=1\dots ,n$ ，其相应的位次为 $(R(X_{i}),R(Y_{i}))=(R_{i},S_{i})$ 。则 $x,y$ 的斯皮尔曼等级相关系数为：

r_{s}={\frac {{\frac {1}{n}}\sum _{i=1}^{n}R_{i}S_{i}-{\overline {R}}\,{\overline {S}}}{{\sqrt {\sigma _{R}}}{\sqrt {\sigma _{S}}}}},

其中： ${\overline {R}}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}R_{i}$ ， ${\overline {S}}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}S_{i}$ ， $\sigma _{R}^{2}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}(R_{i}-{\overline {R}})^{2}$ ， $\sigma _{S}^{2}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}(S_{i}-{\overline {S}})^{2}$ ，

若假定样本中两变量均没有重复数值，则 $r_{s}$ 可只用 $d_{i}:=R_{i}-S_{i}$ 来给出。

在此假定下， $R,S$ 可视为随机变量，其分布类似于均匀分布随机变量， $U$ ，其自变量取值为 $\{1,2,\ldots ,n\}$ 。

因此 ${\overline {R}}={\overline {S}}=\mathbb {E} [U]$ 且 $\sigma _{R}^{2}=\sigma _{S}^{2}=\mathrm {Var} (U)=\mathbb {E} [U^{2}]-\mathbb {E} [U]^{2}$ ，其中 $\mathbb {E} [U]=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}i=\textstyle {\frac {(n+1)}{2}}$ ， $\mathbb {E} [U^{2}]=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}i^{2}=\textstyle {\frac {(n+1)(2n+1)}{6}}$ ，故有 $\mathrm {Var} (U)=\textstyle {\frac {(n+1)(2n+1)}{6}}-\left(\textstyle {\frac {(n+1)}{2}}\right)^{2}=\textstyle {\frac {n^{2}-1}{12}}$ 。（这些求和可以用三角形数和四角锥数的公式来计算，也可以用离散数学的基本求和结果来计算。）

既然

{\begin{aligned}{\frac {1}{n}}\sum _{i=1}^{n}R_{i}S_{i}-{\overline {R}}{\overline {S}}&={\frac {1}{n}}\sum _{i=1}^{n}{\frac {1}{2}}(R_{i}^{2}+S_{i}^{2}-d_{i}^{2})-{\overline {R}}^{2}\\&={\frac {1}{2}}{\frac {1}{n}}\sum _{i=1}^{n}R_{i}^{2}+{\frac {1}{2}}{\frac {1}{n}}\sum _{i=1}^{n}S_{i}^{2}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}-{\overline {R}}^{2}\\&=({\frac {1}{n}}\sum _{i=1}^{n}R_{i}^{2}-{\overline {R}}^{2})-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\&=\sigma _{R}^{2}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\&=\sigma _{R}\sigma _{S}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\\end{aligned}}

则综上可得

r_{s}={\frac {\sigma _{R}\sigma _{S}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}}{\sigma _{R}\sigma _{S}}}=1-{\frac {\sum _{i=1}^{n}d_{i}^{2}}{2n\cdot {\frac {n^{2}-1}{12}}}}=1-{\frac {6\sum _{i=1}^{n}d_{i}^{2}}{n(n^{2}-1)}}.

当数据中存在相等的数值时，使用该简化公式会得到错误结果：只有在两组变量中所有数值不重复时，才有 $\sigma _{\operatorname {R} (X)}\sigma _{\operatorname {R} (Y)}=\operatorname {Var} {(\operatorname {R} (X))}=\operatorname {Var} {(\operatorname {R} (Y))}=(n^{2}-1)/12$ （根据有偏方差计算）。第一个方程（通过标准差进行归一化）即使在排名标准化为[0, 1]（“相对排名”）的情况下仍可使用，因为它对平移和线性缩放都不敏感。

对于截取的数据也不应使用简化公式。即，当希望计算前X条记录的等级相关系数时，应当使用前述的皮尔逊积矩相关系数公式。^[4]

解释

斯皮尔曼相关系数的正负性的解读

正的斯皮尔曼相关系数反映两个变量

X

和

Y

之间单调递增的趋势。

负的斯皮尔曼相关系数反映两个变量

X

和

Y

之间单调递减的趋势。

斯皮尔曼相关系数表明 $X$ （自变量）和 $Y$ （因变量）的相关方向。如果当 $X$ 增加时， $Y$ 趋向于增加，则斯皮尔曼相关系数为正。如果当 $X$ 增加时， $Y$ 趋向于减少，则斯皮尔曼相关系数为负。斯皮尔曼相关系数为0表明当 $X$ 增加时 $Y$ 没有任何趋向性。当 $X$ 和 $Y$ 越来越接近完全的单调相关时，斯皮尔曼相关系数会在绝对值上增加。当 $X$ 和 $Y$ 完全单调相关时，斯皮尔曼相关系数的绝对值为1。完全的单调递增关系意味着对任意两对数据 $X i, Y i$ 和 $X j, Y j$ ，有 $X i - X j$ 和 $Y i - Y j$ 总是同号。完全的单调递减关系意味着对任意两对数据 $X i, Y i$ 和 $X j, Y j$ ，有 $X i - X j$ 和 $Y i - Y j$ 总是异号。

斯皮尔曼相关系数经常被称作“非参数”的，其中有两层含义。首先，当 $X$ 和 $Y$ 的关系由任意单调函数描述时，则它们是完全皮尔逊相关的。与此相应的，皮尔逊相关系数只能给出由线性方程描述的 $X$ 和 $Y$ 的相关性。其次，斯皮尔曼不需要先验知识（也就是说，知道其参数）便可以准确获取 $X$ 和 $Y$ 的采样概率分布。

示例

在此例中，我们要使用下表所给出的原始数据计算一个人的智商和其每周看电视的小时数的相关性（数据为虚构）。

智商, $X_{i}$	每周看电视小时数, $Y_{i}$
106	7
86	0
100	27
101	50
99	28
103	29
97	20
113	12
112	6
110	17

首先，我们必须根据以下步骤计算出 $d_{i}^{2}$ ，如下表所示。

排列第一列数据（ $X_{i}$ ）。创建新列 $x_{i}$ 并赋以等级值1、2、3……n。
然后，排列第二列数据（ $Y_{i}$ ）。创建第四列 $y_{i}$ 并相似地赋以等级值1、2、3……n。
创建第五列 $d_{i}$ ，填入两个等级列（ $x_{i}$ 和 $y_{i}$ ）的差值。
创建最后一列 $d_{i}^{2}$ 填入 $d_{i}$ 的平方。

智商, $X_{i}$	每周看电视小时数, $Y_{i}$	$x_{i}$ 的排名	$y_{i}$ 的排名	$d_{i}$	$d_{i}^{2}$
86	0	1	1	0	0
97	20	2	6	−4	16
99	28	3	8	−5	25
100	27	4	7	−3	9
101	50	5	10	−5	25
103	29	6	9	−3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

根据 $d_{i}^{2}$ 计算 $\sum d_{i}^{2}=194$ 。样本容量 $n$ 为10。将这些值带入方程

\rho =1-{\frac {6\times 194}{10(10^{2}-1)}}

得ρ = −0.175757575...，p-value = 0.627188（使用t分布）

该数值接近0，表明尽管看电视时间和智商似乎呈负相关，但两个变量之间的关系很弱。在原始数据中存在相同数值的情况下，不应使用此公式，而应当用排名计算皮尔逊相关系数（如上文所述）。

显著性的确定

一种确定被观测数据的 $ρ$ 值是否显著不为零（ $r$ 总是有 $1 \geq r \geq -1$ ）的方法是计算它是否大于 $r$ 的概率，作为零假设，并使用排列检验。这种方法的优势在于它考虑了样本中的重复出现的数据个数，以及在计算等级相关性时处理它们的方式。

另一种方法是使用皮尔逊积矩中使用到的费雪变换。也就是， $ρ$ 的置信区间和假设检验可以通过费雪变换获得

F(r)={1 \over 2}\ln {1+r \over 1-r}=\operatorname {arctanh} (r).

如果 $F (r)$ 是 $r$ 的费雪变换，则

z={\sqrt {\frac {n-3}{1.06}}}F(r)

是 $r$ 的z-值，其中， $r$ 在统计独立性（ $ρ = 0$ ）^[7]^[8]的零假设下近似服从标准正态分布。

显著性为

t=r{\sqrt {\frac {n-2}{1-r^{2}}}}

其在零假设下近似服从自由度为 $n - 2$ 的t分布。^[9] A justification for this result relies on a permutation argument.^[10]

一般地，斯皮尔曼相关系数在有三个或更多条件的情况下是有用的。并且，它预测观测数据有一个特定的顺序。例如，在同一任务中，一系列的个体会被尝试多次，并预测在多次尝试过程中，性能会得到提升。在这种情况下，对条件间趋势的显著性检验由E. B. Page^[11]发展了，并通常称为给定序列下的Page趋势检验。

基于斯皮尔曼相关系数的一致性分析

经典的一致性分析（英语：Correspondence analysis）是一种统计方法，它给两个标称变量赋给一个分数。通过这种方法，两个变量间的皮尔逊相关系数被最大化了。

有一种被称为级别相关分析的等价方法，它能够最大化斯皮尔曼相关系数或肯德尔等级相关系数（英语：Kendall rank correlation coefficient）。^[12]

参见

肯德尔等级相关系数（英语：Kendall rank correlation coefficient）
等级相关（英语：Rank correlation）
切比雪夫总和不等式、排序不等式
皮尔逊积矩相关系数
图模式
马尔可夫链
马尔可夫逻辑网络

参考文献

^ ^1.0 ^1.1 Myers, Jerome L.; Well, Arnold D., Research Design and Statistical Analysis 2nd, Lawrence Erlbaum: 508, 2003, ISBN 0-8058-4037-0
^ Dodge, Yadolah. The Concise Encyclopedia of Statistics. Springer-Verlag New York. 2010: 502. ISBN 978-0-387-31742-7.
^ Maritz. J.S. (1981) Distribution-Free Statistical Methods, Chapman & Hall. ISBN 0-412-15940-6. (page 217)
^ Al Jaber, Ahmed Odeh; Elayyan, Haifaa Omar. Toward Quality Assurance and Excellence in Higher Education. River Publishers. 2018: 284. ISBN 978-87-93609-54-9.
^ Yule, G.U and Kendall, M.G. (1950), "An Introduction to the Theory of Statistics", 14th Edition (5th Impression 1968). Charles Griffin & Co. page 268
^ Piantadosi, J.; Howlett, P.; Boland, J. (2007) "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305–312
^ Choi, S.C. (1977) Test of equality of dependent correlations. Biometrika, 64 (3), pp. 645–647
^ Fieller, E.C.; Hartley, H.O.; Pearson, E.S. (1957) Tests for rank correlation coefficients. I. Biometrika 44, pp. 470–481
^ Press, Vettering, Teukolsky, and Flannery (1992) Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition, page 640
^ Kendall, M.G., Stuart, A. (1973)The Advanced Theory of Statistics, Volume 2: Inference and Relationship, Griffin. ISBN 0-85264-215-6 (Sections 31.19, 31.21)
^ Page, E. B. Ordered hypotheses for multiple treatments: A significance test for linear ranks. Journal of the American Statistical Association. 1963, 58 (301): 216–230. doi:10.2307/2282965.
^ Kowalczyk, T.; Pleszczyńska E. , Ruland F. (eds.). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations. Studies in Fuzziness and Soft Computing vol. 151. Berlin Heidelberg New York: Springer Verlag. 2004. ISBN 978-3-540-21120-4.

G.W. Corder, D.I. Foreman, "Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach", Wiley (2009)
C. Spearman, "The proof and measurement of association between two things" Amer. J. Psychol., 15 (1904) pp. 72–101
M.G. Kendall, "Rank correlation methods", Griffin (1962)
M. Hollander, D.A. Wolfe, "Nonparametric statistical methods", Wiley (1973)
J. C. Caruso, N. Cliff, "Empirical Size, Coverage, and Power of Confidence Intervals for Spearman's Rho", Ed. and Psy. Meas., 57 (1997) pp. 637–654

外部链接

"Understanding Correlation vs. Copulas in Excel" （页面存档备份，存于互联网档案馆） by Eric Torkia, Technology Partnerz 2011
Table of critical values of ρ for significance with small samples （页面存档备份，存于互联网档案馆）
A calculator that shows the working out for Spearman's correlation （页面存档备份，存于互联网档案馆）
Spearman's rank online calculator （页面存档备份，存于互联网档案馆）
Chapter 3 part 1 shows the formula to be used when there are ties
Spearman's rank correlation （页面存档备份，存于互联网档案馆）: Simple notes for students with an example of usage by biologists and a spreadsheet for Microsoft Excel for calculating it (a part of materials for a Research Methods in Biology course).

[myers2003-1] 1.0 ^1.1 Myers, Jerome L.; Well, Arnold D., Research Design and Statistical Analysis 2nd, Lawrence Erlbaum: 508, 2003, ISBN 0-8058-4037-0

[2] Dodge, Yadolah. The Concise Encyclopedia of Statistics. Springer-Verlag New York. 2010: 502. ISBN 978-0-387-31742-7.

[3] Maritz. J.S. (1981) Distribution-Free Statistical Methods, Chapman & Hall. ISBN 0-412-15940-6. (page 217)

[Jaber-4] Al Jaber, Ahmed Odeh; Elayyan, Haifaa Omar. Toward Quality Assurance and Excellence in Higher Education. River Publishers. 2018: 284. ISBN 978-87-93609-54-9.

[Yule_and_Kendall-5] Yule, G.U and Kendall, M.G. (1950), "An Introduction to the Theory of Statistics", 14th Edition (5th Impression 1968). Charles Griffin & Co. page 268

[6] Piantadosi, J.; Howlett, P.; Boland, J. (2007) "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305–312

[7] Choi, S.C. (1977) Test of equality of dependent correlations. Biometrika, 64 (3), pp. 645–647

[8] Fieller, E.C.; Hartley, H.O.; Pearson, E.S. (1957) Tests for rank correlation coefficients. I. Biometrika 44, pp. 470–481

[9] Press, Vettering, Teukolsky, and Flannery (1992) Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition, page 640

[10] Kendall, M.G., Stuart, A. (1973)The Advanced Theory of Statistics, Volume 2: Inference and Relationship, Griffin. ISBN 0-85264-215-6 (Sections 31.19, 31.21)

[11] Page, E. B. Ordered hypotheses for multiple treatments: A significance test for linear ranks. Journal of the American Statistical Association. 1963, 58 (301): 216–230. doi:10.2307/2282965.

[12] Kowalczyk, T.; Pleszczyńska E. , Ruland F. (eds.). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations. Studies in Fuzziness and Soft Computing vol. 151. Berlin Heidelberg New York: Springer Verlag. 2004. ISBN 978-3-540-21120-4.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

定义和计算

相关度量

解释

示例

显著性的确定

基于斯皮尔曼相关系数的一致性分析

参见

参考文献

外部链接