最小哈希

在計算機科學領域，最小哈希（或最小哈希式獨立排列局部性敏感哈希（英語：locality sensitive hashing））方法是一種快速判斷兩個集合是否相似的技術。這種方法是由Andrei Broder （1997）,^[1]發明的，最初在AltaVista搜索引擎中用於在搜索結果中檢測並消除重複Web頁面。^[2]

它同樣也應用於大規模聚類問題，比如通過文檔間包含的詞語相似性進行聚類。^[1]

雅可比相似度與最小哈希值[編輯]

雅可比相似度係數通常用來表示兩個集合的相似度，定義 $U$ 是一個集合， $A$ 和 $B$ 是 $U$ 的子集。雅可比相似度定義如下：^[3]

J(A,B)={{|A\cap B|} \over {|A\cup B|}}.

它是一個0到1之間的數值上，當其為0時表示兩個集合不相交，當其為1時表示兩個集合相等，其他的情況則在0和1之間。它廣泛地用於兩集合間相似性的判斷：當雅可比係數趨向於1時，兩個集合更相似；反之，當雅可比係數趨向於0時，兩個集合更不相似。

定義h是一個將 $U$ 中的元素映射到一些不相交整數的哈希函數， $perm$ 是集合 $U$ 中元素的排列排列，對於任意集合S，定義 $h min (S)$ 為S集合中具有最小h(x)函數值的元素x，對應 $h \circ perm$ （集合 $S$ 的元素 $x$ 是 $h (perm (x))$ 的最小值）。將 $h min$ 應用於集合 $A$ 和 $B$ ，假定沒有發生哈希碰撞。有 $h min (A) = h min (B)$ ，當且僅當最小哈希值的併集 $A \cup B$ 依賴於交集 $A \cap B$ 時。因此，

Pr[h min (A) = h min (B)] = J (A, B).

也就是說， $h min (A) = h min (B)$ 是真的概率等於 $J (A, B)$ 另一方面來說，如果 $r$ 是一個當 $h min (A) = h min (B)$ 時值為1，其它情況下值為0的隨機變量，那麼 $r$ 可認為是 $J (A, B)$ 的無偏估計。儘管 $r$ 的方差很高時，不能很好的估計雅可比相似度，因為 $r$ 總是0或1。最小哈希思想通過以相同方式構造的幾個變量，將其平均在一起來減少這種方差

算法[編輯]

多哈希函數的變種[編輯]

最簡單的最小哈希方法是使用 $k$ 個不同的哈希函數，其中 $k$ 是固定的整數參數，使用這 $k$ 個函數所對應的 $k$ 個 $h min (S)$ 值來描述每個集合 $S$ 。使用這種最簡單的版本來判斷 $J (A, B)$ ，假定y是使得 $h min (A) = h min (B)$ 的哈希函數個數，使用 $y / k$ 作為估計。則此估計是 $k$ 個不同的0-1隨機變量的平均值，其中每個隨機變量當 $h min (A) = h min (B)$ 值為1，反之為0，並且是 $J (A, B)$ 的無偏估計。因此，該平均值同樣也是一個無偏估計，而且通過0-1隨機變量之和的標準Chernoff上界（英語：Chernoff bound）可得知，其期望誤差是 $O(1/\sqrt k)$ 。所以，針對任意給定的常數 $ε > 0$ ，存在另一常數 $k = O(1/ε 2)$ ，其估計的期望誤差不超過 $ε$ 。例如，使用400個哈希函數值來估計 $J (A, B)$ ，其期望誤差將小於或等於.05。

單一哈希函數的變種[編輯]

計算多個哈希函數的代價是相當昂貴的，因此有關最小哈希方法的另一種實現方法是僅使用單一的哈希函數來避免這個問題。對於每個集合，使用這個單一的哈希函數選出其中的多個值，而不是每個哈希函數選擇一個值。假定 $h$ 是一個哈希函數， $k$ 是一個固定整數。如果 $S$ 是 $h$ 域上 $k$ 或更多元素的集合，則定義 $h (k) (S)$ 為 $S$ 中具有最小 $h$ 值的 $k$ 個元素所組成的子集。該子集 $h (k) (S)$ 可用作集合 $S$ 的一個簽名，任意兩個集合間的相似度可通過比較它們的簽名來計算。

特別地，假定A and B為任意兩個集合， $X = h (k) (h (k) (A) \cup h (k) (B)) = h (k) (A \cup B)$ 是 $A \cup B$ 的k個元素的集合，如果h是隨機變量並且k個元素的任意子集等可能地被選擇。也就是說， $X$ 是 $A \cup B$ 的簡單隨機樣本（英語：simple random sample）。 $Y = X \cap h (k) (A) \cap h (k) (B)$ 是集合 $X$ 中屬於 $A \cap B$ 交集的元素。因此，| $Y$ |/ $k$ 是 $J (A, B)$ 的無偏估計。單一哈希函數的估計與多個哈希函數產生的估計的不同在於 $X$ 總是有 $k$ 個元素，而多個哈希函數由於兩個不同的哈希函數可能會產生相同的最小值，因此可能會產生更少的樣本元素。然而，當 $k$ 相對集合大小來說很小時，該區別可忽略不計。

通過不重複取樣的標準Chernoff上界（英語：Chernoff bound），該估計的期望誤差為 $O(1/\sqrt k)$ ，其性能與多個哈希函數方法相匹配。

耗時分析[編輯]

| $Y$ |/ $k$ 估計通過給定集合的兩個簽名能夠在 $O(k)$ 能夠計算出來，因此，當 $ε$ and $k$ 為常數時，從簽名中計算相似度估計的時間也為常數，這樣當眾多兩兩相似度需要計算時，該方法在運行時間上與每個集合中元素的完全比較相比，能夠有實質性的優化。

最小哈希式獨立排列[編輯]

為了實現上述的最小哈希方法，哈希函數 $h$ 需要定義 $n$ 元素上的一個隨機排列，這裡的 $n$ 是指待比較的所有集合併集中不相交元素的總數。但是由於存在 $n!$ 個不同的排列，僅僅指定一個真正隨機的排列就需要 $Ω(n log n)$ 位，即使 $n$ 一般時，這個數值也很大。基於這樣的事實，與全域哈希（英語：universal hashing）相類似的理論，有大量的研究工作尋找「最小哈希式獨立的」一簇排列，意指針對域的任意子集，任何元素都與其最小值是等可能的。已經證明，最小哈希式獨立的排列簇至少必須包含: $lcm(1,2,...,n)\geq e^{n-o(n)}$ 個不同的排列，因此它需要 $Ω(n)$ 位來指定一個排列，這個數值仍然很大。^[2]

由於實踐上不可行，引入了最小哈希式獨立的兩個變型概念：嚴格最小哈希式獨立排列簇和近似最小哈希式獨立排列簇。嚴格的最小哈希式獨立是指最小哈希式獨立屬性被限制在集合基數至多為 $k$ 的一些集合中。^[4] 近似最小哈希式獨立最多有一個固定的概率 $ε$ 變化為完全獨立。^[5]

應用[編輯]

最小哈希的最初應用包括在Web文檔中聚類並消除近似重複，這通過在那些文檔中出現的詞語集合來描述。^[1]^[2] 相似的技術也應用於其他類型數據的聚類和近似重複消除，如圖片：在圖片數據中，一張圖片可以通過分割用很多更小的子圖片集合或更多複雜圖片特徵的描述集合來表示。^[6]

Schleimer，Wilkerson & Aiken (2003)使用最小哈希技術作為數字文檔剽竊檢測方法的一部分，他們的方法將文檔表示成給定長度的子串集合，將文檔劃分成更大固定長度的窗口，然後使用子串的最小哈希值作為每個窗口的描述值。如果文本的拷貝部分比兩倍窗口尺寸還要長，則該描述值將肯定匹配保存在數據庫中眾多描述值中的一個，這樣那個窗口就可以用來檢查有多少內容是拷貝的。^[7]

在數據挖掘領域，Cohen 等人 (2001)使用最小哈希技術作為關聯規則學習的工具。給定一個數據庫，其中每一項都有多個屬性（可看作是每行為一個數據庫項，每列為一個屬性的0-1矩陣），他們將最小哈希的近似度方法應用於Jaccard係數，用來辨別頻繁共同出現的屬性候選對，然後僅計算這些候選對的確切係數值，以確定哪些項目共同出現的頻度低於一個給定的嚴格閾值。^[8]

參考文獻[編輯]

^ ^1.0 ^1.1 ^1.2 Broder, Andrei Z., On the resemblance and containment of documents, Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, 電氣電子工程師學會: 21–29, 1997, doi:10.1109/SEQUEN.1997.666900 .
^ ^2.0 ^2.1 ^2.2 Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael, Min-wise independent permutations, Proc. 30th ACM Symposium on Theory of Computing (STOC '98)（英語：Symposium on Theory of Computing）, New York, NY, USA: 計算機協會: 327–336, 1998, doi:10.1145/276698.276781 .
^ Jaccard, Paul, étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bulletin de la Société Vaudoise des Sciences Naturelles, 1901, 37: 547–579 .
^ Matou?ek, Ji?í; Stojakovi?, Milo?, On restricted min-wise independence of permutations, Random Structures and Algorithms, 2003, 23 (4): 397–408, doi:10.1002/rsa.10101 .
^ Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D., Low discrepancy sets yield approximate min-wise independent permutation families, Information Processing Letters（英語：Information Processing Letters）, 2000, 73 (1–2): 29–32, doi:10.1016/S0020-0190(99)00163-5 .
^ Chum, Ond?ej; Philbin, James; Isard, Michael; Zisserman, Andrew, Scalable near identical image and shot detection, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07), 2007, doi:10.1145/1282280.1282359 ; Chum, Ond?ej; Philbin, James; Zisserman, Andrew, Near duplicate image detection: min-hash and tf-idf weighting, Proceedings of the British Machine Vision Conference (PDF) 3: 4, 2008 [2012-05-18], （原始內容存檔 (PDF)於2021-02-24） .
^ Schleimer, Saul; Wilkerson, Daniel S.; Aiken, Alex, Winnowing: local algorithms for document fingerprinting, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03): 76–85, 2003, doi:10.1145/872757.872770 .
^ Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C., Finding interesting associations without support pruning, IEEE Transactions on Knowledge and Data Engineering, 2001, 13 (1): 64–78, doi:10.1109/69.908981 .
^ Andoni, Alexandr; Indyk, Piotr, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, ACM通訊, 2008, 51 (1): 117–122, doi:10.1145/1327452.1327494 .

[b97-1] 1.0 ^1.1 ^1.2 Broder, Andrei Z., On the resemblance and containment of documents, Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, 電氣電子工程師學會: 21–29, 1997, doi:10.1109/SEQUEN.1997.666900 .

[bcfm-2] 2.0 ^2.1 ^2.2 Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael, Min-wise independent permutations, Proc. 30th ACM Symposium on Theory of Computing (STOC '98)（英語：Symposium on Theory of Computing）, New York, NY, USA: 計算機協會: 327–336, 1998, doi:10.1145/276698.276781 .

[3] Jaccard, Paul, étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bulletin de la Société Vaudoise des Sciences Naturelles, 1901, 37: 547–579 .

[4] Matou?ek, Ji?í; Stojakovi?, Milo?, On restricted min-wise independence of permutations, Random Structures and Algorithms, 2003, 23 (4): 397–408, doi:10.1002/rsa.10101 .

[5] Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D., Low discrepancy sets yield approximate min-wise independent permutation families, Information Processing Letters（英語：Information Processing Letters）, 2000, 73 (1–2): 29–32, doi:10.1016/S0020-0190(99)00163-5 .

[6] Chum, Ond?ej; Philbin, James; Isard, Michael; Zisserman, Andrew, Scalable near identical image and shot detection, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07), 2007, doi:10.1145/1282280.1282359 ; Chum, Ond?ej; Philbin, James; Zisserman, Andrew, Near duplicate image detection: min-hash and tf-idf weighting, Proceedings of the British Machine Vision Conference (PDF) 3: 4, 2008 [2012-05-18], （原始內容存檔 (PDF)於2021-02-24） .

[7] Schleimer, Saul; Wilkerson, Daniel S.; Aiken, Alex, Winnowing: local algorithms for document fingerprinting, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03): 76–85, 2003, doi:10.1145/872757.872770 .

[8] Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C., Finding interesting associations without support pruning, IEEE Transactions on Knowledge and Data Engineering, 2001, 13 (1): 64–78, doi:10.1109/69.908981 .

[9] Andoni, Alexandr; Indyk, Piotr, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, ACM通訊, 2008, 51 (1): 117–122, doi:10.1145/1327452.1327494 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]