Deltas

One of the first things I said in this book is that a VCS repository contains every version of everything that has ever happened.

So how does the repository store all that stuff? Maybe it just keeps a full snapshot of every version of the tree. Disk space is cheap, right?

Well, it’s not that cheap. If version control data were stored that way, lots of teams would have repositories of 10 TB or more. Around this point, the common argument that “disk space is cheap” starts to break down.  The cost of dealing with 10 TB of important data is much greater than just the cost of the actual disk platters.

Fortunately, there is a huge amount of redundancy in version-controlled data. We observe that tree N is often not terribly different from tree N-1.  By definition, each version of the tree is derived from its predecessor.  A commit might be as simple as a one-line fix to a single file.  All of the other files are unchanged; we don’t really need to store another copy of them.

So we don’t want to store the full contents of the tree for every single change.  Instead, we want a way to store a tree in terms of the changes relative to another tree.  We call this a delta. All version control tools use some form of delta concept when storing repository data.

A tree is a hierarchy of directories and files.  A delta is the difference between two trees.  In theory, those two trees do not need to be related.  However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other.  Some developer started with tree N and made one or more changes, resulting in tree N+1.

We can think of a delta as a list of changes which express the difference between two trees. This includes files or directories that have been added, modified, renamed, deleted, or moved.

The delta concept can be used for individual files as well. A file delta merely expresses the difference between two files.  Once again, the reason we calculate a file delta is because we believe it will be smaller, usually because one of the files is derived from the other.

Many modern version control tools use binary file deltas for repository storage.  One popular file delta algorithm is called vcdiff[39]. It outputs a list of byte ranges which have been changed.  This means it can handle any kind of file, binary or text.  As an ancillary benefit, the vcdiff algorithm compresses the data at the same time.

Binary deltas are a helpful feature for some version control tool users, especially in situations where the binary files are large.  Consider the case where a user checks out a 500 MB file, changes a few bytes, and commits it back in.  If the repository is using file deltas, it will only grow by a small amount.

Some version control tools can also use binary deltas to improve performance over slow network lines. If both sides of the network connection already have version N, then transferring version N+1 over the wire can be accomplished by sending just a delta. The increase in network performance for offsite users can be quite dramatic.