Time and Space Tradeoffs in Version Control Storage
Storage is one of the most difficult challenges for a
version control system. For every file, we must store every version that has
ever existed. The logical size of a version control repository never shrinks.
It just keeps growing and growing, and every old version needs to remain
So, what is the best way to store every version of
As we look for the right scheme, let's remember three things
we consider to be important:
- Data integrity is paramount. In a version control tool,
nothing can be considered to be more important than guarding the safety of
- Performance is critical. Software developers have about
as much patience as a German Shepherd sitting in front of a pot roast.
- Space matters too. We're going to be storing lots of
data, much of which is being kept almost entirely for the purpose of
archiving history. We'd prefer to keep this archive as compact as
In this blog entry I will report the results of some
exploration I've been doing. I am experimenting with different ways of storing
the full history of one source code file. In this case, the file comes from
the source code for SourceGear Vault. It has been regularly edited for almost
seven years. There are 508 versions of this file.
As I describe the various things I have tried, a running
theme will be the classic
tradeoff of space vs. speed. In physics, we know that matter and energy
are interchangeable. In computer science, we know that time and space are
interchangeable. Usually, we can find a way to make things faster by using
more space, or make things smaller by taking more time.
As I said, I'll be storing 508 versions of the same file.
It's a C# source code file. For each attempt, I will report two things:
- The total amount of space required to store all 508
- The total amount of time required to retrieve (or
decompress or decode) all 508 versions, one at a time.
Before we get started, a few caveats:
- I realize that these experiments would yield different
results for a different kind of file. If you're storing source code,
there might be some things here you can apply. If you're storing JPEG
images, not so much.
- All these experiments were done on my Mac Book Pro laptop.
The CPU is a Core 2 Duo, which I consider to be decently fast. But like
most laptops, this machine has an I/O system which I consider to be
quasi-crappy. I would probably get somewhat different results if I were
running on a more serious piece of hardware.
OK, how should we store these 508 versions of the file?
No compression at all
As a first attempt, let's just store them. No compression
or funky encoding. Each of the 508 versions will be stored in full and
This is the starting point, even if it is not very
Size: 112,643 KB
Time: 2.5 s
Yes, dear reader, I admit that this file is far too long.
You can do the math. If the archive takes 112 MB and there
are 508 versions, then each one is 230 KB. That's pretty big for a source code
Actually, it's worse than you think. The 230 KB figure is
just the average. The first version of the file is around 90 KB. The latest
version is over 400 KB.
In our defense, I'd like to point out that this piece of code
needs to stay compatible with .NET 1.1, so the entire class must be in a single
file. However, I'd still have to answer to the charge of "First Degree Failure
to Refactor". Fine. I'll have my attorney contact you to plead out on a
lesser charge. I'm thinking maybe "Third Degree Contributing to the
Delinquency of an Intern", or something like that.
This "full and uncompressed" format uses an awful lot of
space, but it is also the fastest. We will find ways of making this smaller, but
all of those ways will be slower.
The relevant questions are:
- How much smaller?
- How much slower?
Some solutions will allow us to make this a lot smaller and
only a little slower. Those are interesting. Other possibilities will be only
a little smaller but a lot slower. Those are not so interesting.
OK, for our next idea, let's just compress every version
Size: 22,516 KB
Time: 4.0 sec
The results of this idea are surprisingly impressive. The
archive is over 80% smaller, and only about 60% slower. That's darn good,
considering that I didn't have to be terribly clever.
This tradeoff is probably worth it. In fact, it establishes
a new baseline that might be tough to beat.
How do we get better than this?
Instead of just compressing every file independently, we
could store things as deltas. Think of a delta as simply the difference
between one version and the next.
Compression with zlib takes one standalone thing and makes
an equivalent standalone thing which is smaller.
In contrast, a delta is a representation of the differences
between two files. Suppose that somebody takes file X and makes a few changes
to it, resulting in file Y. With a delta algorithm, we could calculate the
delta between X and Y, and call it D. Then, instead of storing Y, we can store
The nice thing here is that D will be approximately the size
of the edits, regardless of the size of the two files. If X was a 100 MB file
and Y was the same file with an extra 50 bytes appended to the end, then D will
be somewhere around 50 bytes,
A delta is a concept which might be implemented in a lot of
different ways. In my case, the delta algorithm I am using is VCDIFF, which is
described in RFC 3284. We
have our own implementation of VCDIFF. Other implementations include xdelta and open-vcdiff.
The important thing to remember about deltas for storage is
that you must have the reference item. D is a representation of Y, but only if
you have X handy. X is the reference.
OK, it should be obvious that this concept can be helpful in
storing a repository, but how do we set things up?
One big delta chain
As a first attempt, let's store all 508 versions as a big
chain of deltas. Every version is stored as a delta against the version just
before it. Version 1 is the reference, and is the only version that is not
stored as a delta.
Size: 7,682 KB
Time: Way too long to wait
Wow -- this is really small. It's over 93% smaller than the
full/uncompressed form. It'll be hard to find a general purpose approach that
is smaller than this.
But good grief this is slow. Fetching version 508 takes an
eternity, because first you have to construct a temporary version of 507. And to
construct version 507, you first have to construct a temporary version of 506.
And so on.
Let's try something else. The problem with the chaining
case above is that retrieving version 508 requires us to go all the way back to
version 1, which is incredibly inefficient. Instead, let's insert "key frames"
every 10 versions. We borrow this idea from the video world where compressed
video streams store every frame as a delta, but every 10 seconds they insert a
full, uncompressed frame of video.
By using key frames with chaining deltas, we can cut the
time required to fetch the average version of the file. For example, with a
key frame every 10 versions, we get most of the benefits of chaining, but in
the worst case, we only need 9 delta operations to retrieve any version.
Size: 18,024 KB
Time: 41.0 sec
This is better, but still not very good. The compression here
isn't much better than zlib, and the perf is still a lot worse. Compared to
zlib, we don't want to pay a 10x speed penalty just to get 20% better
All the key frames are stored as full and uncompressed
files, and they're taking up a lot of space. Maybe we should zlib those key frames?
Size: 9,092 KB
Time: 42.7 sec
Now at least the compression is starting to look
interesting. This is less than half the size of the zlib case, and 91.9%
smaller than the full form, which is a level of compression that is probably
worth the trouble. But the overall perf is still quite slow. In fact, it's
even slower here than plain chaining with key frames, because we have to
un-zlib the key frame.
The big problem here is that chains of deltas are killing
our performance. Chained deltas can be used to make things very small because
each delta matches up nicely with one set of user edits. But chained deltas
are slow because we need multiple operations to retrieve a given file.
Another approach would be to use each reference for more
than one delta. I call this the flower approach. With a flower, we deltify a
line of versions by picking one version (say, the first one) and using it as
the reference to make all the others into deltas.
Flower deltas should be much faster, since any file can be reconstructed
with just one undeltify operation.
So let's try to flower all 508 versions using version 1 as
the reference for all of them.
Size: 35,851 KB
Time: 10.9 sec
As expected, the performance here is much better.
But the overall space savings is lousy. Only version 2 was
based directly on version 1. Every version after that has less and less in
common with version 1, so the delta algorithm can't draw as much stuff from the
This particular approach isn't going to win. Plain zlib is both
smaller and faster.
Flowers with key frames
Maybe we should try the flower concept with key frames?
Like before, every 10 frames go together as a group. But instead
of chaining, we're going to run each group as a flower. The first version in
the group will serve as the reference for the other 9. We can reasonably
assume that the deltification of frame 10 won't be as good as frame 2, but
hopefully 10 and 1 still have enough in common to be worthwhile.
Size: 18,648 KB
Time: 12.2 sec
Wow. This looks a lot better than chaining. The space used
is about 17% smaller than zlib, but instead of being 10 times slower, it's only
3 times slower.
Of course, we can use the same trick we tried before. Let's
zlib all those key frames.
Size: 9,716 KB
Time: 13.6 sec
This seems like a potentially useful spot. It's less than
half the size of zlib. The perf still a lot slower than zlib, but at only
about 3X slower, the tradeoff is the best we've seen so far.
OK. So we've made a lot of progress on saving space, but 3X
slower than zlib still seems like a high price to pay. Do we really want to
make that trade? Do we have to?
Some things get retrieved more often than others
Let's look at the patterns for how this data is going to be
I've been reporting the total time required to fetch all 508
versions of the file. However, this benchmark doesn't reflect real usage very
well at all. In practice, the recent stuff gets retrieved a LOT more often
than the older stuff. Most of the time, developers are updating their working
copy to whatever is latest.
As a rough guess, I'm going to say that version 508 gets
retrieved twice as often as 507, which gets retrieved twice as often as 506,
and so on. A timing test based on that assumption gives us results something
One big flower 4.0
Flower with key frames 5.1
Chain with key frames 24.5
Not too surprising.
In the spirit of optimizing performance for the most common
operations, why not keep all the more recent versions in a faster form? We
could still use something more aggressive for the older stuff, but we can
probably get a nice performance boost if we just refuse to use deltification
for the most recent 10 versions of the file.
But how should we store those 10 versions? In full format?
Or zlib? This is an arbitrary choice with a clear tradeoff. For now, I choose
zlib. If we wanted a little more speed at the expense of using a little more
space, we could just keep those 10 versions in full form.
By choosing zlib for the most recent 10 revisions, now my "get
the recent stuff" benchmark runs in 1.7 seconds no matter what scheme I use.
But we still care about performance for the case where
somebody fetches an older version, even if that fetch doesn't happen as often.
That's the point of version control storage. Every version has to be
available. And when somebody does fetch version 495, we want our version
control system to still be reasonably fast.
Reversing the direction of the chains
Since the more recent versions are retrieved more often,
obviously, our chains are all going the wrong direction. If we had them go the
other way, then retrieval would get slower as the versions get older instead of
as the versions get newer.
But this approach doesn't lend itself well to the way
version control repositories naturally grow in the wild. In these tests, I
have mostly ignored the issues of constructing each storage scheme. I've
already got all 508 versions, so I'm just fiddling around with different
schemes of storing them all, comparing size and retrieval time.
In practice, those 508 versions happened one at a time, in
order. If we're going to store the versions with backward chains, then each
time we commit a new version of the file, we're going to need to re-encode something
that was previously stored. This makes the commit operation slower. It is
also a questionable idea from the perspective of data integrity. The safest
way to maintain data is to not touch it after it has been written. Once it's
there, leave it alone.
One case where we might want to be a bit more liberal
toward rewriting data is in a "pack" operation, such as the one Git has. It
wouldn't be terribly crazy to consider a standalone pack operation in a DVCS to
be better than rewriting data for each commit, for several reasons:
- It allows us to keep commit fast.
- Since pack would be done offline, its implementation can
be focused more on data integrity and space savings than on performance.
- Since the pack code can be separated from the commit code,
all the risky code can be kept together where it is easier to maintain.
- Since the pack operation is separate from commit, a user
that does not want to run pack does not have to.
- A pack operation in a DVCS is happening on just one
instance (clone) of the repository, not on the only copy.
Anyway, a pack operation would allow us to use storage
schemes that do not work well on the fly, incrementally updating as each
version comes in.
Visualizing the results
This plot makes it easier to see which schemes are better
In my experimentation, I actually did a lot more schemes.
For example, instead of key frames every 10 versions, I also tried every 5, 15
and 20. However, all those extra data points really cluttered the graph. So I
only included the most important ones here.
- In the lower right, we find "full". Very fast and very
- In the upper left, we find "chains". Very slow and very
- We can ignore any point which is both above AND to the
right of any other point. The "1flower" point is the one where I made one
big flower, using version 1 as the reference for every other version.
This scheme ends up being useless since zlib is better in both ways that
- All the other points represent possible tradeoffs which
might be interesting, depending upon our priorities
Intuitively, the schemes which are closer to the origin are
better. This graph makes it easy to see that "zlib" and "flowers" are probably
the two most interesting options I have discussed here.