斯皮爾曼等級相關係數

斯皮爾曼等級相關係數（簡稱等級相關係數，或稱秩相關係數，英語：Spearman's rank correlation coefficient或Spearman's ρ），在統計學中，常以希臘字母 $\rho$ （rho）或以 $r_{s}$ 表示，這一相關系數以查爾斯·斯皮爾曼（英語：Charles Spearman）之名命名。它是衡量兩個變量的相關性的非參數指標。它利用單調函數評價兩個統計變量的相關性。若數據中沒有重複值，且當兩變量完全單調相關時，斯皮爾曼相關係數為+1或−1。

定義和計算[編輯]

斯皮爾曼相關係數的定義為等級變量之間的皮爾森相關係數。^[1]

對於樣本容量為 $n$ 的樣本，將 $n$ 個原始數據 $X_{i},Y_{i}$ 轉換成等級數據 $\operatorname {R} ({X_{i}}),\operatorname {R} ({Y_{i}})$ ，則相關係數 $r_{s}$ 為

r_{s}=\rho _{\operatorname {R} (X),\operatorname {R} (Y)}={\frac {\operatorname {cov} (\operatorname {R} (X),\operatorname {R} (Y))}{\sigma _{\operatorname {R} (X)}\sigma _{\operatorname {R} (Y)}}},

其中

\rho

是皮爾森積動差相關係數，但使用等級變量來計算，

\operatorname {cov} (\operatorname {R} (X),\operatorname {R} (Y))

為等級變量的協方差，

\sigma _{\operatorname {R} (X)}

和

\sigma _{\operatorname {R} (Y)}

為等級變量的標準差。

通常，對於數據中相同的值，其等級數等於它們按值升序排列的所處位置的平均值。^[2]如下表所示：

變量 $X_{i}$	升序位置（僅示意，不使用）	升序位置的平均等級數（使用）
18	1	1
2.3	2	2
1.2	3	${\frac {4+3}{2}}=3.5\$
1.2	4	${\frac {4+3}{2}}=3.5\$
0.8	5	5

當所有的等級數值都為整數時，可以透過以下簡單的步驟計算等級相關係數：^[1]^[3]

r_{s}=1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}},

其中

d_{i}=\operatorname {R} (X_{i})-\operatorname {R} (Y_{i})

為每組觀測中兩個變量的等級差值，

n為觀測數。

證明

考慮一個雙變量樣本 $(x_{i},y_{i}),\,i=1\dots ,n$ ，其相應的位次為 $(R(X_{i}),R(Y_{i}))=(R_{i},S_{i})$ 。則 $x,y$ 的斯皮爾曼等級相關係數為：

r_{s}={\frac {{\frac {1}{n}}\sum _{i=1}^{n}R_{i}S_{i}-{\overline {R}}\,{\overline {S}}}{{\sqrt {\sigma _{R}}}{\sqrt {\sigma _{S}}}}},

其中： ${\overline {R}}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}R_{i}$ ， ${\overline {S}}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}S_{i}$ ， $\sigma _{R}^{2}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}(R_{i}-{\overline {R}})^{2}$ ， $\sigma _{S}^{2}=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}(S_{i}-{\overline {S}})^{2}$ ，

若假定樣本中兩變量均沒有重複數值，則 $r_{s}$ 可只用 $d_{i}:=R_{i}-S_{i}$ 來給出。

在此假定下， $R,S$ 可視為隨機變量，其分佈類似於均勻分佈隨機變量， $U$ ，其自變量取值為 $\{1,2,\ldots ,n\}$ 。

因此 ${\overline {R}}={\overline {S}}=\mathbb {E} [U]$ 且 $\sigma _{R}^{2}=\sigma _{S}^{2}=\mathrm {Var} (U)=\mathbb {E} [U^{2}]-\mathbb {E} [U]^{2}$ ，其中 $\mathbb {E} [U]=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}i=\textstyle {\frac {(n+1)}{2}}$ ， $\mathbb {E} [U^{2}]=\textstyle {\frac {1}{n}}\textstyle \sum _{i=1}^{n}i^{2}=\textstyle {\frac {(n+1)(2n+1)}{6}}$ ，故有 $\mathrm {Var} (U)=\textstyle {\frac {(n+1)(2n+1)}{6}}-\left(\textstyle {\frac {(n+1)}{2}}\right)^{2}=\textstyle {\frac {n^{2}-1}{12}}$ 。（這些求和可以用三角形數和四角錐數的公式來計算，也可以用離散數學的基本求和結果來計算。）

既然

{\begin{aligned}{\frac {1}{n}}\sum _{i=1}^{n}R_{i}S_{i}-{\overline {R}}{\overline {S}}&={\frac {1}{n}}\sum _{i=1}^{n}{\frac {1}{2}}(R_{i}^{2}+S_{i}^{2}-d_{i}^{2})-{\overline {R}}^{2}\\&={\frac {1}{2}}{\frac {1}{n}}\sum _{i=1}^{n}R_{i}^{2}+{\frac {1}{2}}{\frac {1}{n}}\sum _{i=1}^{n}S_{i}^{2}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}-{\overline {R}}^{2}\\&=({\frac {1}{n}}\sum _{i=1}^{n}R_{i}^{2}-{\overline {R}}^{2})-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\&=\sigma _{R}^{2}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\&=\sigma _{R}\sigma _{S}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}\\\end{aligned}}

則綜上可得

r_{s}={\frac {\sigma _{R}\sigma _{S}-{\frac {1}{2n}}\sum _{i=1}^{n}d_{i}^{2}}{\sigma _{R}\sigma _{S}}}=1-{\frac {\sum _{i=1}^{n}d_{i}^{2}}{2n\cdot {\frac {n^{2}-1}{12}}}}=1-{\frac {6\sum _{i=1}^{n}d_{i}^{2}}{n(n^{2}-1)}}.

當數據中存在相等的數值時，使用該簡化公式會得到錯誤結果：只有在兩組變量中所有數值不重複時，才有 $\sigma _{\operatorname {R} (X)}\sigma _{\operatorname {R} (Y)}=\operatorname {Var} {(\operatorname {R} (X))}=\operatorname {Var} {(\operatorname {R} (Y))}=(n^{2}-1)/12$ （根據有偏方差計算）。第一個方程（透過標準差進行歸一化）即使在排名標準化為[0, 1]（「相對排名」）的情況下仍可使用，因為它對平移和線性縮放都不敏感。

對於截取的數據也不應使用簡化公式。即，當希望計算前X條記錄的等級相關係數時，應當使用前述的皮爾森積動差相關係數公式。^[4]

解釋[編輯]

斯皮爾曼相關係數的正負性的解讀

正的斯皮爾曼相關係數反映兩個變量

X

和

Y

之間單調遞增的趨勢。

負的斯皮爾曼相關係數反映兩個變量

X

和

Y

之間單調遞減的趨勢。

斯皮爾曼相關係數表明 $X$ （自變量）和 $Y$ （應變量）的相關方向。如果當 $X$ 增加時， $Y$ 趨向於增加，則斯皮爾曼相關係數為正。如果當 $X$ 增加時， $Y$ 趨向於減少，則斯皮爾曼相關係數為負。斯皮爾曼相關係數為0表明當 $X$ 增加時 $Y$ 沒有任何趨向性。當 $X$ 和 $Y$ 越來越接近完全的單調相關時，斯皮爾曼相關係數會在絕對值上增加。當 $X$ 和 $Y$ 完全單調相關時，斯皮爾曼相關係數的絕對值為1。完全的單調遞增關係意味着對任意兩對數據 $X i, Y i$ 和 $X j, Y j$ ，有 $X i - X j$ 和 $Y i - Y j$ 總是同號。完全的單調遞減關係意味着對任意兩對數據 $X i, Y i$ 和 $X j, Y j$ ，有 $X i - X j$ 和 $Y i - Y j$ 總是異號。

斯皮爾曼相關係數經常被稱作「非參數」的，其中有兩層含義。首先，當 $X$ 和 $Y$ 的關係由任意單調函數描述時，則它們是完全皮爾森相關的。與此相應的，皮爾森相關係數只能給出由線性方程描述的 $X$ 和 $Y$ 的相關性。其次，斯皮爾曼不需要先驗知識（也就是說，知道其參數）便可以準確獲取 $X$ 和 $Y$ 的採樣概率分佈。

示例[編輯]

在此例中，我們要使用下表所給出的原始數據計算一個人的智商和其每周看電視的小時數的相關性（數據為虛構）。

智商, $X_{i}$	每周看電視小時數, $Y_{i}$
106	7
86	0
100	27
101	50
99	28
103	29
97	20
113	12
112	6
110	17

首先，我們必須根據以下步驟計算出 $d_{i}^{2}$ ，如下表所示。

排列第一列數據（ $X_{i}$ ）。創建新列 $x_{i}$ 並賦以等級值1、2、3……n。
然後，排列第二列數據（ $Y_{i}$ ）。創建第四列 $y_{i}$ 並相似地賦以等級值1、2、3……n。
創建第五列 $d_{i}$ ，填入兩個等級列（ $x_{i}$ 和 $y_{i}$ ）的差值。
創建最後一列 $d_{i}^{2}$ 填入 $d_{i}$ 的平方。

智商, $X_{i}$	每周看電視小時數, $Y_{i}$	$x_{i}$ 的排名	$y_{i}$ 的排名	$d_{i}$	$d_{i}^{2}$
86	0	1	1	0	0
97	20	2	6	−4	16
99	28	3	8	−5	25
100	27	4	7	−3	9
101	50	5	10	−5	25
103	29	6	9	−3	9
106	7	7	3	4	16
110	17	8	5	3	9
112	6	9	2	7	49
113	12	10	4	6	36

根據 $d_{i}^{2}$ 計算 $\sum d_{i}^{2}=194$ 。樣本容量 $n$ 為10。將這些值帶入方程

\rho =1-{\frac {6\times 194}{10(10^{2}-1)}}

得ρ = −0.175757575...，p-value = 0.627188（使用t分佈）

該數值接近0，表明儘管看電視時間和智商似乎呈負相關，但兩個變量之間的關係很弱。在原始數據中存在相同數值的情況下，不應使用此公式，而應當用排名計算皮爾森相關係數（如上文所述）。

顯著性的確定[編輯]

一種確定被觀測數據的 $ρ$ 值是否顯著不為零（ $r$ 總是有 $1 \geq r \geq -1$ ）的方法是計算它是否大於 $r$ 的概率，作為虛無假設，並使用排列檢驗。這種方法的優勢在於它考慮了樣本中的重複出現的數據個數，以及在計算等級相關性時處理它們的方式。

另一種方法是使用皮爾森積動差中使用到的費雪轉換。也就是， $ρ$ 的置信區間和假設檢定可以透過費雪轉換獲得

F(r)={1 \over 2}\ln {1+r \over 1-r}=\operatorname {arctanh} (r).

如果 $F (r)$ 是 $r$ 的費雪轉換，則

z={\sqrt {\frac {n-3}{1.06}}}F(r)

是 $r$ 的z-值，其中， $r$ 在統計獨立性（ $ρ = 0$ ）^[7]^[8]的虛無假設下近似服從標準正態分佈。

顯著性為

t=r{\sqrt {\frac {n-2}{1-r^{2}}}}

其在虛無假設下近似服從自由度為 $n - 2$ 的t分佈。^[9] A justification for this result relies on a permutation argument.^[10]

一般地，斯皮爾曼相關係數在有三個或更多條件的情況下是有用的。並且，它預測觀測數據有一個特定的順序。例如，在同一任務中，一系列的個體會被嘗試多次，並預測在多次嘗試過程中，性能會得到提升。在這種情況下，對條件間趨勢的顯著性檢驗由E. B. Page^[11]發展了，並通常稱為給定序列下的Page趨勢檢驗。

基於斯皮爾曼相關係數的一致性分析[編輯]

經典的一致性分析（英語：Correspondence analysis）是一種統計方法，它給兩個標稱變量賦給一個分數。透過這種方法，兩個變量間的皮爾森相關係數被最大化了。

有一種被稱為級別相關分析的等價方法，它能夠最大化斯皮爾曼相關係數或肯德爾等級相關係數（英語：Kendall rank correlation coefficient）。^[12]

參見[編輯]

肯德爾等級相關係數（英語：Kendall rank correlation coefficient）
等級相關（英語：Rank correlation）
柴比雪夫總和不等式、排序不等式
皮爾森積動差相關係數
圖模式
馬可夫鏈
馬可夫邏輯網絡

參考文獻[編輯]

^ ^1.0 ^1.1 Myers, Jerome L.; Well, Arnold D., Research Design and Statistical Analysis 2nd, Lawrence Erlbaum: 508, 2003, ISBN 0-8058-4037-0
^ Dodge, Yadolah. The Concise Encyclopedia of Statistics. Springer-Verlag New York. 2010: 502. ISBN 978-0-387-31742-7.
^ Maritz. J.S. (1981) Distribution-Free Statistical Methods, Chapman & Hall. ISBN 0-412-15940-6. (page 217)
^ Al Jaber, Ahmed Odeh; Elayyan, Haifaa Omar. Toward Quality Assurance and Excellence in Higher Education. River Publishers. 2018: 284. ISBN 978-87-93609-54-9.
^ Yule, G.U and Kendall, M.G. (1950), "An Introduction to the Theory of Statistics", 14th Edition (5th Impression 1968). Charles Griffin & Co. page 268
^ Piantadosi, J.; Howlett, P.; Boland, J. (2007) "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305–312
^ Choi, S.C. (1977) Test of equality of dependent correlations. Biometrika, 64 (3), pp. 645–647
^ Fieller, E.C.; Hartley, H.O.; Pearson, E.S. (1957) Tests for rank correlation coefficients. I. Biometrika 44, pp. 470–481
^ Press, Vettering, Teukolsky, and Flannery (1992) Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition, page 640
^ Kendall, M.G., Stuart, A. (1973)The Advanced Theory of Statistics, Volume 2: Inference and Relationship, Griffin. ISBN 0-85264-215-6 (Sections 31.19, 31.21)
^ Page, E. B. Ordered hypotheses for multiple treatments: A significance test for linear ranks. Journal of the American Statistical Association. 1963, 58 (301): 216–230. doi:10.2307/2282965.
^ Kowalczyk, T.; Pleszczyńska E. , Ruland F. (eds.). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations. Studies in Fuzziness and Soft Computing vol. 151. Berlin Heidelberg New York: Springer Verlag. 2004. ISBN 978-3-540-21120-4.

G.W. Corder, D.I. Foreman, "Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach", Wiley (2009)
C. Spearman, "The proof and measurement of association between two things" Amer. J. Psychol., 15 (1904) pp. 72–101
M.G. Kendall, "Rank correlation methods", Griffin (1962)
M. Hollander, D.A. Wolfe, "Nonparametric statistical methods", Wiley (1973)
J. C. Caruso, N. Cliff, "Empirical Size, Coverage, and Power of Confidence Intervals for Spearman's Rho", Ed. and Psy. Meas., 57 (1997) pp. 637–654

外部連結[編輯]

"Understanding Correlation vs. Copulas in Excel" （頁面存檔備份，存於互聯網檔案館） by Eric Torkia, Technology Partnerz 2011
Table of critical values of ρ for significance with small samples （頁面存檔備份，存於互聯網檔案館）
A calculator that shows the working out for Spearman's correlation （頁面存檔備份，存於互聯網檔案館）
Spearman's rank online calculator （頁面存檔備份，存於互聯網檔案館）
Chapter 3 part 1 shows the formula to be used when there are ties
Spearman's rank correlation （頁面存檔備份，存於互聯網檔案館）: Simple notes for students with an example of usage by biologists and a spreadsheet for Microsoft Excel for calculating it (a part of materials for a Research Methods in Biology course).

[myers2003-1] 1.0 ^1.1 Myers, Jerome L.; Well, Arnold D., Research Design and Statistical Analysis 2nd, Lawrence Erlbaum: 508, 2003, ISBN 0-8058-4037-0

[2] Dodge, Yadolah. The Concise Encyclopedia of Statistics. Springer-Verlag New York. 2010: 502. ISBN 978-0-387-31742-7.

[3] Maritz. J.S. (1981) Distribution-Free Statistical Methods, Chapman & Hall. ISBN 0-412-15940-6. (page 217)

[Jaber-4] Al Jaber, Ahmed Odeh; Elayyan, Haifaa Omar. Toward Quality Assurance and Excellence in Higher Education. River Publishers. 2018: 284. ISBN 978-87-93609-54-9.

[Yule_and_Kendall-5] Yule, G.U and Kendall, M.G. (1950), "An Introduction to the Theory of Statistics", 14th Edition (5th Impression 1968). Charles Griffin & Co. page 268

[6] Piantadosi, J.; Howlett, P.; Boland, J. (2007) "Matching the grade correlation coefficient using a copula with maximum disorder", Journal of Industrial and Management Optimization, 3 (2), 305–312

[7] Choi, S.C. (1977) Test of equality of dependent correlations. Biometrika, 64 (3), pp. 645–647

[8] Fieller, E.C.; Hartley, H.O.; Pearson, E.S. (1957) Tests for rank correlation coefficients. I. Biometrika 44, pp. 470–481

[9] Press, Vettering, Teukolsky, and Flannery (1992) Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition, page 640

[10] Kendall, M.G., Stuart, A. (1973)The Advanced Theory of Statistics, Volume 2: Inference and Relationship, Griffin. ISBN 0-85264-215-6 (Sections 31.19, 31.21)

[11] Page, E. B. Ordered hypotheses for multiple treatments: A significance test for linear ranks. Journal of the American Statistical Association. 1963, 58 (301): 216–230. doi:10.2307/2282965.

[12] Kowalczyk, T.; Pleszczyńska E. , Ruland F. (eds.). Grade Models and Methods for Data Analysis with Applications for the Analysis of Data Populations. Studies in Fuzziness and Soft Computing vol. 151. Berlin Heidelberg New York: Springer Verlag. 2004. ISBN 978-3-540-21120-4.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]