2024-04-09 05:00:00

git如何存储文件?

我刚开始学习git,为了做到这一点,我开始阅读git社区书,在这本书中,他们说SVN和CVS存储文件之间的差异,而git存储所有文件的快照。

但我不太明白他们说的快照是什么意思。git真的会在每次提交时复制所有文件吗?因为这是我从他们的解释中了解到的。

PS:如果任何人有任何更好的来源来学习git,我会很感激。


Git确实包含了每次提交的所有文件的完整副本,除了,对于已经出现在Git repo中的内容,快照将简单地指向所述的内容,而不是复制它。 这也意味着具有相同内容的几个文件只存储一次。

因此快照基本上是一个提交,指的是目录结构的内容。

一些很好的参考资料是:

git.github.io / git-reference

你告诉Git你想用Git commit命令保存一个项目的快照,它基本上记录了项目中所有文件当时的样子

2020:“Git中的提交:它是快照/状态/图像还是更改/diff/补丁/增量?” git浸

实验12演示了如何获取以前的快照

“你可以发明git(也许你已经发明了!)” 什么是git“快照”? 学习GitHub


progit书对快照有更全面的描述:

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored. Git thinks about its data more like as below: This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.

参见:

“如果git的功能脱离了文件的快照,为什么。git/没有随着时间的推移变得巨大?” 每个git提交的树对象内容中存储了哪些信息


Jan Hudec补充了这一重要评论:

While that's true and important on the conceptual level, it is NOT true at the storage level. Git does use deltas for storage. Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.

Git逻辑上将每个文件存储在其SHA1下。这意味着,如果存储库中有两个内容完全相同的文件(或者重命名一个文件),则只存储一个副本。

但这也意味着当您修改文件的一小部分并提交时,将存储该文件的另一个副本。git解决这个问题的方法是使用包文件。每隔一段时间,就会收集repo中的所有“松散”文件(实际上不只是文件,还包括包含提交和目录信息的对象)并将其压缩到一个包文件中。使用zlib压缩包文件。类似的文件也是delta压缩的。

在拉或推时也使用相同的格式(至少在某些协议中),因此这些文件不必再次被重新压缩。

这样做的结果是,包含整个未压缩的工作副本、未压缩的最近文件和压缩的旧文件的git存储库通常相对较小,比工作副本的大小小两倍。这意味着它比具有相同文件的SVN repo要小,即使SVN不将历史存储在本地。

OP:快照在Git中是什么意思?Git会在每次提交时复制所有文件吗?

Git中的快照是什么意思?

In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time of commit. Apart from references to the actual files, each commit is also infused with relevant metadata such as commit message, author (inc. time stamp), committer (inc. timestamp), and references to parent commit(s); all of which are immutable!

由于提交(或正式称为提交对象)在整体上是不可变的,因此试图修改其任何内容都是不可能的。一旦提交被创建,就永远不能被篡改或修改!

Git如何在内部存储文件

从Pro Git书中我们了解到:

Git是一个内容可寻址的文件系统。太好了。这是什么意思?这意味着Git的核心是一个简单的键值数据存储。这意味着您可以将任何类型的内容插入到Git存储库中,Git将为此返回给您一个惟一的键,您可以稍后使用它来检索该内容。

因此,让我们看看下面的插图来弄清楚上述语句的真正含义,以及Git如何在内部存储数据(特别是文件)。

包含三次提交的简单提交历史,包括如何在Git中存储实际数据(文件和目录)的概述。在左边显示了实际的快照,与上次提交相比的“delta change”用绿色突出显示。最右边是用于存储的内部对象。

Git在其内部存储中使用了三个主要对象:

提交对象(高级快照容器) 树对象(低级文件名/目录容器) Blob对象(低级文件内容容器)

要在Git中存储一般意义上的文件(例如内容+文件名/目录),需要一个blob和一个树;blob只存储文件内容,树存储引用blob的文件名/目录。要构造嵌套目录,需要使用多个树;因此,树可以同时引用blob和树。从高级的角度来看,您不必担心blob和树,因为Git会在提交过程中自动创建它们。

注意:Git自底向上计算所有哈希值(键),从blob开始,移动通过任何子树,最终到达根树——将键作为输入提供给它的直接父树。这个过程产生了如图所示的结构,在数学和计算机科学中被称为有向顺循环图(DAG),例如,所有的引用只在一个方向上移动,没有任何循环依赖。

进一步分析可视化示例

By scrutinizing above history we see that for the initial C0 commit two empty files were added, src/index.js and .gitignore – but only one blob got created! That's because Git only stores unique content, and since the content of the two empty files obviously resulted in the same hash: e69de – only one entry was needed. However, as their filenames and paths differed two trees got created to keep track of this. Each tree returning a unique hash (key) computed based on the paths and blobs it's referencing.

继续向上到第二个提交C1,我们看到只有.gitignore文件被更新,产生了一个包含该数据的新blob (e51ac)。就根树而言,它仍然使用src/index.js文件中相同的子树引用。然而,根树也是一个具有新散列(键)的全新对象,这仅仅是因为底层的.gitignore引用更改了。

在最后的C2提交中,只更新了src/index.js文件,出现了一个新的blob (257cc)——迫使创建一个新的子树(5de32),并最终创建一个新的根树(07eff)。

总之

每次创建新的提交时,整个项目的快照都会被记录下来,并按照DAG数据结构存储到内部数据库中。无论何时签出提交,您的工作树都会被重构,以反映通过根树引用的底层快照相同的状态。

来源:以上节选自这篇关于主题的完整文章:不可变快照- Git的核心概念之一