为什么哈希函数应该使用质数模?

只是把从答案中得到的一些想法写下来。

Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

2021-12-29 07:56:25

这取决于哈希函数的选择。

许多哈希函数通过将数据中的各种元素与一些因子相乘，再乘以与机器的字大小相对应的2的幂的模(这个模可以通过让计算溢出来释放)来组合数据中的各种元素。

您不希望在数据元素的乘数和哈希表的大小之间有任何公共因子，因为这样可能会发生改变数据元素不会将数据分散到整个表上的情况。如果你为表的大小选择一个质数，这样的公因数是极不可能的。

另一方面，这些因数通常由奇数质数组成，因此在哈希表中使用2的幂也应该是安全的(例如，Eclipse在生成Java hashCode()方法时使用31)。

2009-07-18 07:32:19

我想为Steve Jessop的回答补充一些东西(我不能评论，因为我没有足够的声誉)。但我找到了一些有用的材料。他的回答很有帮助，但他犯了一个错误:桶的大小不应该是2的幂。我引用Thomas Cormen, Charles Leisersen等人写的《算法导论》263页

When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.

希望能有所帮助。

2015-12-03 17:43:02

只是把从答案中得到的一些想法写下来。

Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

2021-12-29 07:56:25

假设表的大小(或模数)是T = (B*C)。如果你输入的散列是(N*A*B) N可以是任何整数，那么你的输出就不会很好地分布。因为每次n变成C、2C、3C等，你的输出就会开始重复。也就是说，你的输出只会分布在C位。注意这里的C是(T / HCF(表大小，哈希))。

这个问题可以通过制造hcf1来消除。质数是很好的选择。

另一个有趣的现象是当T = 2^N时。这些将给出与所有输入哈希的低N位完全相同的输出。由于每个数都可以表示为2的幂，当我们对任意数取T的模时，我们将减去所有2的幂形式的数，即>= N，因此总能得到特定模式的数，取决于输入。这也是一个糟糕的选择。

类似地，T作为10^N也是不好的，因为类似的原因(模式是十进制数而不是二进制数)。

因此，质数往往会给出更好的分布结果，因此是表大小的好选择。

2016-09-06 04:16:23

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions. Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used. However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about http://www.azillionmonkeys.com/qed/hash.html

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

2009-07-17 19:34:32

为什么哈希函数应该使用质数模?

推荐文章

最新文章

标签