为什么哈希函数应该使用质数模?

关于素数幂模的“数学的本质”是它们是有限域的一个组成部分。另外两个构建块是加法运算和乘法运算。素模的特殊性质是，它们用“常规”的加法和乘法运算形成一个有限域，只是取到模。这意味着每一个乘法都映射到一个不同的整数对质数求模，每一个加法也是如此。

质模的优势在于:

它们在二次哈希中选择次乘数时给予了最大的自由，除了0之外的所有乘数最终都将访问所有元素一次如果所有哈希值都小于模量，则根本不会发生碰撞随机质数比两个模的幂更好地混合，并压缩所有比特的信息，而不仅仅是一个子集

然而，它们有一个很大的缺点，它们需要整数除法，这需要很多(~ 15-40)个周期，即使在现代CPU上也是如此。用大约一半的计算就可以确保散列混合得很好。两次乘法和异移运算比一个质数模更容易混合。然后，我们可以使用任何哈希表的大小，哈希约简是最快的，对于2个表大小的幂，总共给出7个操作，对于任意大小的表，大约9个操作。

我最近研究了许多最快的哈希表实现，其中大多数都不使用质数模块。

哈希表索引的分布主要依赖于所使用的哈希函数。质数模量不能修复一个坏的哈希函数，一个好的哈希函数也不能从质数模量中受益。然而，在某些情况下，它们可能是有利的。例如，它可以修复半坏的哈希函数。

2019-09-15 21:53:12

为了提供另一种观点，这里有一个网站:

http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

它认为你应该使用尽可能多的桶而不是四舍五入到质数桶。这似乎是个合理的可能性。直观地说，我当然可以看到桶的数量越多越好，但我无法对此进行数学论证。

2009-07-17 19:44:13

只是把从答案中得到的一些想法写下来。

Hashing uses modulus so any value can fit into a given range We want to randomize collisions Randomize collision meaning there are no patterns as how collisions would happen, or, changing a small part in input would result a completely different hash value To randomize collision, avoid using the base (10 in decimal, 16 in hex) as modulus, because 11 % 10 -> 1, 21 % 10 -> 1, 31 % 10 -> 1, it shows a clear pattern of hash value distribution: value with same last digits will collide Avoid using powers of base (10^2, 10^3, 10^n) as modulus because it also creates a pattern: value with same last n digits matters will collide Actually, avoid using any thing that has factors other than itself and 1, because it creates a pattern: multiples of a factor will be hashed into selected values For example, 9 has 3 as factor, thus 3, 6, 9, ...999213 will always be hashed into 0, 3, 6 12 has 3 and 2 as factor, thus 2n will always be hashed into 0, 2, 4, 6, 8, 10, and 3n will always be hashed into 0, 3, 6, 9 This will be a problem if input is not evenly distributed, e.g. if many values are of 3n, then we only get 1/3 of all possible hash values and collision is high So by using a prime as a modulus, the only pattern is that multiple of the modulus will always hash into 0, otherwise hash values distributions are evenly spread

2021-12-29 07:56:25

假设表的大小(或模数)是T = (B*C)。如果你输入的散列是(N*A*B) N可以是任何整数，那么你的输出就不会很好地分布。因为每次n变成C、2C、3C等，你的输出就会开始重复。也就是说，你的输出只会分布在C位。注意这里的C是(T / HCF(表大小，哈希))。

这个问题可以通过制造hcf1来消除。质数是很好的选择。

另一个有趣的现象是当T = 2^N时。这些将给出与所有输入哈希的低N位完全相同的输出。由于每个数都可以表示为2的幂，当我们对任意数取T的模时，我们将减去所有2的幂形式的数，即>= N，因此总能得到特定模式的数，取决于输入。这也是一个糟糕的选择。

类似地，T作为10^N也是不好的，因为类似的原因(模式是十进制数而不是二进制数)。

因此，质数往往会给出更好的分布结果，因此是表大小的好选择。

2016-09-06 04:16:23

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions. Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used. However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about http://www.azillionmonkeys.com/qed/hash.html

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

2009-07-17 19:34:32