很久以前,我花1.25美元在便宜货桌上买了一本数据结构的书。在这篇文章中,哈希函数的解释说,由于“数学的本质”,它最终应该被一个质数mod。
你对一本1.25美元的书有什么期待?
不管怎么说,我花了很多年思考数学的本质,但还是没弄明白。
当有质数个桶时,数字的分布真的更均匀吗?
或者这是一个老程序员的故事,每个人都接受,因为其他人都接受?
很久以前,我花1.25美元在便宜货桌上买了一本数据结构的书。在这篇文章中,哈希函数的解释说,由于“数学的本质”,它最终应该被一个质数mod。
你对一本1.25美元的书有什么期待?
不管怎么说,我花了很多年思考数学的本质,但还是没弄明白。
当有质数个桶时,数字的分布真的更均匀吗?
或者这是一个老程序员的故事,每个人都接受,因为其他人都接受?
当前回答
只是把从答案中得到的一些想法写下来。
Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread
其他回答
http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/
解释得很清楚,还有图片。
编辑:作为一个总结,使用质数是因为当数值乘以所选质数并将它们全部相加时,获得唯一值的可能性最大。例如,给定一个字符串,将每个字母的值与质数相乘,然后将它们全部相加,就会得到它的哈希值。
一个更好的问题是,为什么是数字31?
只是把从答案中得到的一些想法写下来。
Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread
对于一个哈希函数来说,重要的不仅仅是尽量减少冲突,而且是不可能在改变几个字节的同时保持相同的哈希。
假设你有一个方程: (x + y*z) % key = x且0<x<key且0<z<key。 如果key是一个质数n*y=key对于n中的每一个n为真,对于其他所有数为假。
一个key不是主要示例的例子: X =1, z=2, key=8 因为key/z=4仍然是一个自然数,4成为我们方程的一个解,在这种情况下(n/2)*y = key对于n中的每一个n都成立。这个方程的解的数量实际上翻了一番,因为8不是质数。
如果我们的攻击者已经知道8是方程的可能解,他可以将文件从产生8改为产生4,并且仍然得到相同的哈希值。
假设表的大小(或模数)是T = (B*C)。如果你输入的散列是(N*A*B) N可以是任何整数,那么你的输出就不会很好地分布。因为每次n变成C、2C、3C等,你的输出就会开始重复。也就是说,你的输出只会分布在C位。注意这里的C是(T / HCF(表大小,哈希))。
这个问题可以通过制造hcf1来消除。质数是很好的选择。
另一个有趣的现象是当T = 2^N时。这些将给出与所有输入哈希的低N位完全相同的输出。由于每个数都可以表示为2的幂,当我们对任意数取T的模时,我们将减去所有2的幂形式的数,即>= N,因此总能得到特定模式的数,取决于输入。这也是一个糟糕的选择。
类似地,T作为10^N也是不好的,因为类似的原因(模式是十进制数而不是二进制数)。
因此,质数往往会给出更好的分布结果,因此是表大小的好选择。
我想为Steve Jessop的回答补充一些东西(我不能评论,因为我没有足够的声誉)。但我找到了一些有用的材料。他的回答很有帮助,但他犯了一个错误:桶的大小不应该是2的幂。我引用Thomas Cormen, Charles Leisersen等人写的《算法导论》263页
When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.
希望能有所帮助。