# 最小哈希

## 雅可比相似度与最小哈希值

${\displaystyle J(A,B)={{|A\cap B|} \over {|A\cup B|}}.}$

Pr[hmin(A) = hmin(B)] = J(A,B).

## 算法

### 耗时分析

|Y|/k估计通过给定集合的两个签名能够在O(k)能够计算出来，因此，当ε and k为常数时，从签名中计算相似度估计的时间也为常数，这样当众多两两相似度需要计算时，该方法在运行时间上与每个集合中元素的完全比较相比，能够有实质性的优化。

## 应用

Schleimer， Wilkerson & Aiken (2003)使用最小哈希技术作为数字文档剽窃检测方法的一部分，他们的方法将文档表示成给定长度的子串集合，将文档划分成更大固定长度的窗口，然后使用子串的最小哈希值作为每个窗口的描述值。如果文本的拷贝部分比两倍窗口尺寸还要长，则该描述值将肯定匹配保存在数据库中众多描述值中的一个，这样那个窗口就可以用来检查有多少内容是拷贝的。[7]

## 参考文献

1. Broder, Andrei Z., On the resemblance and containment of documents, Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, IEEE: 21–29, 1997, doi:10.1109/SEQUEN.1997.666900.
2. Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael, Min-wise independent permutations, Proc. 30th ACM Symposium on Theory of Computing (STOC '98), New York, NY, USA: Association for Computing Machinery: 327–336, 1998, doi:10.1145/276698.276781.
3. ^ Jaccard, Paul, étude comparative de la distribution florale dans une portion des Alpes et des Jura, Bulletin de la Société Vaudoise des Sciences Naturelles, 1901, 37: 547–579.
4. ^ Matou?ek, Ji?í; Stojakovi?, Milo?, On restricted min-wise independence of permutations, Random Structures and Algorithms, 2003, 23 (4): 397–408, doi:10.1002/rsa.10101.
5. ^ Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D., Low discrepancy sets yield approximate min-wise independent permutation families, Information Processing Letters, 2000, 73 (1–2): 29–32, doi:10.1016/S0020-0190(99)00163-5.
6. ^ Chum, Ond?ej; Philbin, James; Isard, Michael; Zisserman, Andrew, Scalable near identical image and shot detection, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07), 2007, doi:10.1145/1282280.1282359; Chum, Ond?ej; Philbin, James; Zisserman, Andrew, Near duplicate image detection: min-hash and tf-idf weighting, Proceedings of the British Machine Vision Conference (PDF) 3: 4, 2008.
7. ^ Schleimer, Saul; Wilkerson, Daniel S.; Aiken, Alex, Winnowing: local algorithms for document fingerprinting, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03): 76–85, 2003, doi:10.1145/872757.872770.
8. ^ Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C., Finding interesting associations without support pruning, IEEE Transactions on Knowledge and Data Engineering, 2001, 13 (1): 64–78, doi:10.1109/69.908981.
9. ^ Andoni, Alexandr; Indyk, Piotr, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Communications of the ACM, 2008, 51 (1): 117–122, doi:10.1145/1327452.1327494.