Dice係數

戴斯係數（Dice coefficient），也稱索倫森-戴斯係數（Sørensen–Dice coefficient），取名於Thorvald Sørensen（英語：托瓦爾·索倫森）和Lee Raymond Dice（英語：李·雷蒙德·戴斯）^[1]，是一種集合相似度度量函數，通常用於計算兩個樣本的相似度：

s={\frac {2|X\cap Y|}{|X|+|Y|}}

它在形式上和Jaccard指數沒多大區別，但是有些不同的性質。

和Jaccard類似，它的範圍為0到1。與Jaccard不同的是，相應的差異函數

d=1-{\frac {2|X\cap Y|}{|X|+|Y|}}

不是一個合適的距離度量措施，因為它沒有三角形不等性的性質。例如給定 {a}, {b}, 和 {a,b}, 前兩個集合的距離為1，而第三個集合和其他任意兩個集合的距離為三分之一。

與Jaccard類似, 集合操作可以用兩個向量 A 和B的操作來表示:

$s_{v}={\frac {2|A\cdot B|}{|A|^{2}+|B|^{2}}}$

上式給出了兩個向量的距離輸出，也給出了更一般情況下向量之間的相似度度量措施。戴斯係數可以計算兩個字符串的相似度：Dice（s1,s2）=2*comm(s1,s2)/(leng(s1)+leng(s2))。其中，comm (s1,s2)是s1、s2 中相同字符的個數leng(s1)，leng(s2)是字符串s1、s2 的長度。

在信息檢索中, 給定關鍵詞集合X 和Y ，相似度定義為兩倍的共同信息(重疊部分)除以基數的總和 :^[2]

當作為字符串之間的相似度度量時, 計算兩個字符串之間的係數, x 和y，使用 bigrams 公式如下:^[3]

s={\frac {2n_{t}}{n_{x}+n_{y}}}

其中n_t 是兩個字符串共有的bigrams的個數, n_x 是 x中bigrams的個數，n_y 是 y中bigrams的個數。例如要計算下面兩個字符串之間的相似度:

night

nacht

我們可以在各個單詞中得出如下bigrams集合:

{ni,ig,gh,ht}

{na,ac,ch,ht}

每個集合有4個元素, 這個兩個集合只有一個相同的元素: ht.

代入公式我們可以計算出, s = (2 · 1) / (4 + 4) = 0.25.

同見[編輯]

雅卡爾指數（Jaccard index）, 等同於: $D=2J/(1+J)$ and $J=D/(2-D)$
Tversky index
Levenshtein distance
Sørensen similarity index

參考文獻[編輯]

^ Dice, Lee R. Measures of the Amount of Ecologic Association Between Species. Ecology. 1945, 26 (3): 297–302. JSTOR 1932409. doi:10.2307/1932409.
^ van Rijsbergen, Cornelis Joost. Information Retrieval. London: Butterworths. 1979 [2012-05-26]. ISBN 3-642-12274-4. （原始內容存檔於2005-04-06）.
^ Kondrak, Grzegorz; Marcu, Daniel; and Knight, Kevin. Cognates Can Improve Statistical Translation Models (PDF). Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: 46–48. 2003 [2012-05-26]. （原始內容存檔 (PDF)於2016-03-04）.

參考資料[編輯]

[1] Dice, Lee R. Measures of the Amount of Ecologic Association Between Species. Ecology. 1945, 26 (3): 297–302. JSTOR 1932409. doi:10.2307/1932409.

[2] van Rijsbergen, Cornelis Joost. Information Retrieval. London: Butterworths. 1979 [2012-05-26]. ISBN 3-642-12274-4. （原始內容存檔於2005-04-06）.

[3] Kondrak, Grzegorz; Marcu, Daniel; and Knight, Kevin. Cognates Can Improve Statistical Translation Models (PDF). Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: 46–48. 2003 [2012-05-26]. （原始內容存檔 (PDF)於2016-03-04）.

[1]

[2]

[3]