如果我在使用git时发生了哈希冲突,会发生什么?
例如,我设法提交了两个具有相同sha1校验和的文件, git会注意到它还是会破坏其中一个文件?
git是否可以改进以适应这种情况,或者我必须更改为一个新的哈希算法?
(请不要通过讨论这有多不可能来转移这个问题-谢谢)
如果我在使用git时发生了哈希冲突,会发生什么?
例如,我设法提交了两个具有相同sha1校验和的文件, git会注意到它还是会破坏其中一个文件?
git是否可以改进以适应这种情况,或者我必须更改为一个新的哈希算法?
(请不要通过讨论这有多不可能来转移这个问题-谢谢)
当前回答
哈希碰撞是如此的不可能,这完全是令人震惊的!全世界的科学家都在努力实现这一目标,但还没有成功。不过,对于某些算法,比如MD5,它们是成功的。
几率有多大?
SHA-256有2^256种可能的哈希值。大约是10^78。或者更形象地说,碰撞的可能性大约是
1:100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
中彩票的几率大约是1:14 Mio。与SHA-256碰撞的几率就像连续11天中彩票一样!
数学解释:14 000 000 ^ 11 ~ 2^256
此外,宇宙大约有10^80个原子。这仅仅是SHA-256组合的100倍。
MD5碰撞成功
即使是MD5,这种可能性也很小。尽管如此,数学家们还是创造了一次碰撞:
d131dd02c5e6eec4 693d9a0698aff95c 2fcab58712467eab 4004583eb8fb7f89 55ad340609f4b302 83e488832571415a 085125e8f7cdc99f d91dbdf280373c5b d8823e3156348f5b ae6dacd436c919c6 dd53e2b487da03fd 02396306d248cda0 e99f33420f577ee8 ce54b67080a80d1e c69821bcb6a88393 96f9652b6ff72a70
MD5和
d131dd02c5e6eec4 693d9a0698aff95c 2fcab50712467eab 4004583eb8fb7f89 55ad340609f4b302 83e4888325f1415a 085125e8f7cdc99f d91dbd7280373c5b d8823e3156348f5b ae6dacd436c919c6 dd53e23487da03fd 02396306d248cda0 e99f33420f577ee8 ce54b67080280d1e c69821bcb6a88393 96f965ab6ff72a70
这并不意味着MD5算法被破解后就不安全了。您可以故意创建MD5碰撞,但意外的MD5碰撞的几率仍然是2^128,这仍然很大。
结论
你完全不用担心碰撞。哈希算法是检查文件一致性的第二安全方法。唯一安全的方法是二进制比较。
其他回答
git是否可以改进以适应这种情况,或者我必须更改为一个新的哈希算法?
任何哈希算法都有可能发生冲突,所以改变哈希函数并不能排除问题,它只是让它不太可能发生。所以你应该选择一个非常好的哈希函数(SHA-1已经是,但你要求不要被告知:)
如果两个文件在git中具有相同的哈希和,它会将这两个文件视为相同的。在绝对不可能发生这种情况的情况下,你可以总是返回一次提交,并更改文件中的某些内容,这样它们就不会再碰撞了……
请参阅Linus Torvalds的帖子“开始考虑sha-256?”的邮件列表。
好吧,我想我们现在知道会发生什么了——你应该预料到你的存储库会被损坏(源代码)。
我最近在一个BSD讨论组中发现了一篇来自2013-04-29的帖子
http://openbsd-archive.7691.n7.nabble.com/Why-does-OpenBSD-use-CVS-td226952.html
海报宣称:
我在使用git rebase时遇到了一次哈希碰撞。
不幸的是,他没有为自己的说法提供任何证据。但也许你想试着联系他,问问他关于这个所谓的事件。
但在更一般的层面上,由于生日攻击,SHA-1哈希碰撞的几率为1 / pow(2,80)。
这听起来很多,而且肯定比世界上所有Git存储库中出现的单个文件的版本总数还要多。
但是,这只适用于实际保留在版本历史中的版本。
If a developer relies very much on rebasing, every time a rebase is run for a branch, all the commits in all the versions of that branch (or rebased part of the branch) get new hashes. The same is true for every file modifies with "git filter-branch". Therefore, "rebase" and "filter-branch" might be big multipliers for the number of hashes generated over time, even though not all of them are actually kept: Frequently, after rebasing (especially for the purpose of "cleaning up" a branch), the original branch is thrown away.
但是,如果碰撞发生在重基或过滤器分支期间,它仍然会产生不利影响。
另一件事是估计git存储库中散列实体的总数,看看它们离pow(2,80)有多远。
假设我们有大约80亿人,他们都在运行git,并将他们的东西保存在每人100个git存储库中。让我们进一步假设平均存储库有100次提交和10个文件,并且每次提交只更改其中一个文件。
对于每个修订,我们至少有一个树对象和commit对象本身的哈希。加上修改后的文件,每个修订有3个哈希值,因此每个存储库有300个哈希值。
对于80亿人的100个存储库,这给出的pow(2,47)离pow(2,80)还很远。
但是,这并不包括上面提到的假定的乘法效应,因为我不确定如何将其包括在这个估计中。也许这会大大增加碰撞的几率。特别是当非常大的存储库有很长的提交历史时(比如Linux内核),许多人为了小的更改而重新基于存储库,这仍然会为所有受影响的提交创建不同的哈希值。
用正确的“但是”来回答这个问题,而不解释为什么这不是一个问题是不可能的。如果没有很好地理解哈希是什么,是不可能做到这一点的。它比你在计算机科学课程中接触到的简单情况要复杂得多。
There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.
So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.
思考这个问题并没有错——这是一个很好的问题,很多人只是把它当作“太不可能了,不值得思考”——但实际上比这要复杂一些。如果它确实发生了,它应该很容易被检测到,它不会是正常工作流程中的无声损坏。