我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!
例如,我知道它接受键,计算哈希(我正在寻找一个解释),然后执行某种模运算来计算出它在存储值的数组中的位置,但我的知识到此为止。
谁能解释一下过程吗?
编辑:我并不是特别问哈希码是如何计算的,而是对哈希表如何工作的一般概述。
我正在寻找一个关于哈希表如何工作的解释-用简单的英语为我这样的傻瓜!
例如,我知道它接受键,计算哈希(我正在寻找一个解释),然后执行某种模运算来计算出它在存储值的数组中的位置,但我的知识到此为止。
谁能解释一下过程吗?
编辑:我并不是特别问哈希码是如何计算的,而是对哈希表如何工作的一般概述。
当前回答
到目前为止,所有的答案都很好,并且从不同的方面了解了哈希表的工作方式。这里有一个简单的例子,可能会有帮助。假设我们想要存储一些带有小写字母字符串的项作为键。
正如simon所解释的,哈希函数用于从大空间映射到小空间。对于我们的例子,一个简单的哈希函数实现可以取字符串的第一个字母,并将其映射为一个整数,因此“短吻鳄”的哈希代码为0,“蜜蜂”的哈希代码为1,“斑马”的哈希代码为25,等等。
接下来,我们有一个包含26个存储桶的数组(在Java中可以是数组列表),我们将项放入与键的哈希码匹配的存储桶中。如果我们有不止一个元素键以相同字母开头,它们就会有相同的哈希码,所以它们都会进入存储桶中寻找那个哈希码所以必须在存储桶中进行线性搜索才能找到一个特定的元素。
在我们的例子中,如果我们只有几十个项目,键横跨字母表,它会工作得很好。然而,如果我们有一百万个条目,或者所有的键都以'a'或'b'开头,那么我们的哈希表就不是理想的。为了获得更好的性能,我们需要一个不同的哈希函数和/或更多的桶。
其他回答
哈希表完全基于这样一个事实,即实际计算遵循随机访问机模型,即内存中任何地址的值都可以在O(1)时间或常数时间内访问。
因此,如果我有一个键的宇宙(我可以在应用程序中使用的所有可能的键的集合,例如,滚动no。对于学生来说,如果它是4位,那么这个宇宙就是从1到9999的一组数字),并且一种将它们映射到有限大小的数字集的方法可以在我的系统中分配内存,理论上我的哈希表已经准备好了。
Generally, in applications the size of universe of keys is very large than number of elements I want to add to the hash table(I don't wanna waste a 1 GB memory to hash ,say, 10000 or 100000 integer values because they are 32 bit long in binary reprsentaion). So, we use this hashing. It's sort of a mixing kind of "mathematical" operation, which maps my large universe to a small set of values that I can accomodate in memory. In practical cases, often space of a hash table is of the same "order"(big-O) as the (number of elements *size of each element), So, we don't waste much memory.
现在,一个大集合映射到一个小集合,映射必须是多对一的。因此,不同的键将被分配相同的空间(?? ?不公平)。有几种方法可以解决这个问题,我只知道其中最流行的两种:
Use the space that was to be allocated to the value as a reference to a linked list. This linked list will store one or more values, that come to reside in same slot in many to one mapping. The linked list also contains keys to help someone who comes searching. It's like many people in same apartment, when a delivery-man comes, he goes to the room and asks specifically for the guy. Use a double hash function in an array which gives the same sequence of values every time rather than a single value. When I go to store a value, I see whether the required memory location is free or occupied. If it's free, I can store my value there, if it's occupied I take next value from the sequence and so on until I find a free location and I store my value there. When searching or retreiving the value, I go back on same path as given by the sequence and at each location ask for the vaue if it's there until I find it or search all possible locations in the array.
CLRS的《算法导论》对这个主题提供了非常好的见解。
其实比这更简单。
哈希表不过是一个包含键/值对的向量数组(通常是稀疏数组)。此数组的最大大小通常小于哈希表中存储的数据类型的可能值集中的项数。
哈希算法用于根据将存储在数组中的项的值生成该数组的索引。
This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.
因此,您可以使用多个键来生成相同的散列代码。当这种情况发生时,将遍历向量中的项,并在向量中的键和正在查找的键之间进行直接比较。如果找到,则返回与该键关联的值,否则不返回任何值。
我的理解是这样的:
这里有一个例子:把整个表想象成一系列的桶。假设您有一个带有字母-数字哈希码的实现,并且每个字母都有一个存储桶。该实现将哈希码以特定字母开头的每个项放入相应的bucket中。
假设你有200个对象,但只有15个对象的哈希码以字母“B”开头。哈希表只需要查找和搜索'B' bucket中的15个对象,而不是所有200个对象。
至于计算哈希码,没有什么神奇的。目标只是让不同的对象返回不同的代码,对于相同的对象返回相同的代码。您可以编写一个类,它总是为所有实例返回相同的整数作为哈希代码,但这实际上会破坏哈希表的用处,因为它只会变成一个巨大的桶。
简短而甜蜜:
哈希表封装了一个数组,我们称之为internalArray。将项以如下方式插入数组:
let insert key value =
internalArray[hash(key) % internalArray.Length] <- (key, value)
//oversimplified for educational purposes
有时两个键会散列到数组中的同一个索引,而您希望保留这两个值。我喜欢把两个值都存储在同一个索引中,通过将internalArray作为一个链表数组来编码很简单:
let insert key value =
internalArray[hash(key) % internalArray.Length].AddLast(key, value)
所以,如果我想从哈希表中检索一个项,我可以这样写:
let get key =
let linkedList = internalArray[hash(key) % internalArray.Length]
for (testKey, value) in linkedList
if (testKey = key) then return value
return null
删除操作写起来也很简单。正如你所知道的,从我们的链表数组中插入、查找和删除几乎是O(1)。
当我们的internalArray太满时,可能在85%左右的容量,我们可以调整内部数组的大小,并将所有项目从旧数组移动到新数组中。
你们已经很接近完整地解释了这个问题,但是遗漏了一些东西。哈希表只是一个数组。数组本身将在每个槽中包含一些内容。至少要将哈希值或值本身存储在这个插槽中。除此之外,您还可以存储在此插槽上碰撞的值的链接/链表,或者您可以使用开放寻址方法。您还可以存储一个或多个指针,这些指针指向您希望从该槽中检索的其他数据。
It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:
uint slotIndex = hashValue % hashTableSize;
这个值是该值将要进入的槽。在开放寻址中,如果槽位已经被另一个哈希值和/或其他数据填充,将再次运行模运算来查找下一个槽:
slotIndex = (remainder + 1) % hashTableSize;
我想可能还有其他更高级的方法来确定槽索引,但这是我见过的最常见的方法……会对其他表现更好的公司感兴趣。
With the modulus method, if you have a table of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many total items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guesstimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.
我希望这对你有所帮助。
PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)