2004-09-03 02:07:06

Chapter 4: Repositories

This is part of an online book called Source Control HOWTO, a best practices guide on source control, version control, and configuration management.

< Chapter 3

Chapter 5 >

Cars and clocks

In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.

An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just want to know what time it is. Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us.
An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However, people who really understand cars tend to get better performance out of them.

Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside.

Repository = File System * Time

A repository is the official place where you store all your source code. It keeps track of all your files, as well as the layout of the directories in which they are stored. It resides on a server where it can be shared by all the members of your team.

But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system. A repository is much more than that. A repository contains history.

A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every version of your source code that has ever existed. The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data.

How do we store all those old versions of everything?

As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just keep a complete copy of the entire tree for every change that has happened?

We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in the fall of 2001. In the summer of 2002, we started "dogfooding". On October 25th, 2002, we abandoned our repository history and started a fresh repository for the core components of Vault. Since that day, this tree has been modified 4,686 times.

This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At today's prices for disk space, this option is worth considering.

However, this particular repository is just not very large. We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which are a lot bigger.

As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes.

At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.

So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another copy of them.

So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a tree represented as a set of changes to another tree. We call this a "delta".

Delta direction

As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas", because each delta expresses the set of changes from one version to the next.

We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed.

The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.

Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other tree N is represented as a set of differences from tree N+1. This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees.

Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example, suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree.

What is a delta?

I've been throwing around this concept of deltas, but I haven't stopped to describe them.

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with tree N and made one or more changes, resulting in tree N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees.

For example, let's suppose that Wilbur starts with tree N and makes the following changes:

He deletes $/top/subfolder/foo.c because it is no longer needed.
He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
He edits $/top/bar.c to remove all the calls to the functions in foo.c
He renames $/top/hello.c and gives it the new name hola.c
He adds a new file called feature_creep.c to $/top/
He edits $/top/Makefile to add feature_creep.c to the list of filenames
He moves $/top/subfolder/readme.txt into $/top

At this point, he commits all of these changes to the repository as a single transaction. When the SCM server stores this delta, it must remember all of these changes.

For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in tree N but does not exist in tree N+1.

For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes.

For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item. If we simply remember every item by its path, we cannot remember the occasions when that path changes.

Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full representation of this changeset item needs to contain the entire contents of that file.

Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way. We could handle these items the same way as item 5, by storing the entire contents of the new version of the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.

File deltas

A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other.

For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed. This is the same kind of results which are produced by the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that software developers and web developers have a lot of text files.

CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.

Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284. This algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses the data at the same time.

Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow by a small amount.

Deltas and diffs are different

Please note that I make a distinction between the terms "delta" and "diff".

A "delta" is the difference between two versions. If we have one full file and a delta, then we can construct the other full file. A delta is used primarily because it is smaller than the full file, not because it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text files.
A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented, but really cool visual diff tools can also highlight the specific characters on a line which differ. The purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs are really useful for text files, because human beings tend to read text files. Most human beings don't read binary files, and human-readable diffs of binary files are similarly uninteresting.

As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their repository deltas.

The evolution of source control technology

At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas before file deltas. That is not the way the history of the world unfolded.

Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version control systems like RCS only handled file deltas. There was no way for the system to remember folder-level operations like add, renaming or deleting files.

Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the world today. It was originally developed as a set of wrappers around RCS which essentially provided support for some folder-level operations. Although CVS still has some important limitations, it was a big step forward.

Today, several modern source control systems are designed around the notion of tree-wide deltas. By accurately remembering every possible operation which can happen to a repository, these tools provide a truly complete history of a project.

What can be stored in a repository?

Best Practice: Checkin all the canonical stuff, and nothing else

Although you can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff".

To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit.

If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.

People sometimes ask us what kind of things can be stored in a repository. In general, the answer is: "Any file". It is true that I am focusing on tools which are designed for software developers and web developers. However, those tools don't really care what kind of file you store inside them. Vault doesn't care. Perforce, Subversion and CVS don't care. Any of these tools will gratefully accept any file you want to store.

If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them. A tool which uses binary deltas in the repository may be a better choice.

If all of your files are binary, you may want to explore other solutions. Tools like Vault and Subversion were designed for programmers. These products contain features designed specifically for use with source code, including diff and automerge. You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job. Consider exploring "document management" systems instead.

How is the repository itself stored?

We need to descend through one more layer of abstraction before we turn our attention back to more practical matters. So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored.

A repository must store every version of every file. It must remember the hierarchy of files and folders for every version of the tree. It must remember metadata, information about every file and folder. It must remember checkin comments, explanations provided by the developer for each checkin. For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably. There are several different ways of approaching the problem.

RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond memories, that particular phase of my life is over.)

CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository contains some additional metadata.

When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying database.

Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being one of the fastest SCM tools.

Managing repositories

Best Practice: Use separate repositories for things which are truly separate

Most SCM tools offer the ability to have multiple distinct repositories. Vault can even host multiple repositories on the same Vault server. People often ask us when this capability should be used.

In general, you should store related items in the same repository. Start a separate repository only in situations where the contents of the two are completely unrelated. In a small ISV, it may be quite logical to have only one repository which contains every project.

Creating a source control repository is kind of a special event. It's a little bit like adopting a cat. People often get a cat without realizing the animal is going to be around for 10-20 years. Your repository may have similar longevity, or even longer.

Shortly after SourceGear was founded in 1997, we created a SourceSafe repository. Over seven years later, that repository is still in use, almost every day. (Along with a whole bunch of legacy projects, it contains the source code for SourceOffSite. We never migrated that project to Vault because we wanted the SourceOffSite developers to continue eating their own dogfood.)

That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never been a very big company). It contains thousands of files, thousands of checkins, and has been backed up thousands of times.

Treat your repository well and it will serve you well:

Obviously you should do regular backups. That repository contains everything your fussy and expensive programmers have ever created. Don't risk losing it.
Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how many people are doing daily backups that cannot actually be restored when they are needed.
Put your repository on a reliable server. If your repository goes down, your entire team is blocked from doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply (UPS).
Be conservative in the way your SCM server machine is managed. Don't put anything on that machine that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets released. I've been shocked how many times one of our servers went south simply because we installed a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on some other machine before you put them on critical servers.
Keep your SCM server inside a firewall. If you need to allow your developers to access the repository from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and Subversion can be tunneled through ssh or something similar.

This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of care and caution which should be used for your SCM repository.

Undo

As I have mentioned, one of the best things about source control is that it contains your entire history. Every version of everything is stored. Nothing is ever deleted.

However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that should not be checked in? My history contains something I would rather forget. I want to pretend that it never happened. Isn't there some way to really delete from a repository?

In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry about the fact that your repository contains a full history of the error. Your mistakes are a part of your past. Accept them and move on with your life.

However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and choose the Rollback command.

To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simply creates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the very least, one of them is.

As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a repository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. The obliterate command is the only way to delete something and make it truly gone forever.

Best Practice: Never obliterate anything that was real work

The purist in me wants to recommend that nothing should ever be obliterated. However, my pragmatist side prevails. There are situations where obliterate is not sinful.

However, obliterate should never be used to delete actual work. Don't obliterate a file simply because you discovered it to be a bad idea. Don't obliterate a file simply because you don't need it anymore. Obliterate is for situations where something in the repository should never have been there at all. For example, if you accidentally checkin a gigabyte of MP3s alongside your C++ include files, obliterate is a justifiable choice.

In my original spec for Vault, I had decided that we would not implement any form of destructive delete. We eventually decided to compromise and implement this command, but I really wanted to discourage its use. SourceSafe makes it far too easy to rewrite history and pretend that something never happened. In the Delete dialog box, SourceSafe includes a checkbox called "Destroy Permanently". This is an atrocious design decision, roughly equivalent to leaving a sledgehammer next to the server machine so that people can bash the hard disks with it every once in a while. This checkbox is almost irresistible. It simply begs to be checked, even though it is very rarely the right thing to do.

When we first designed the obliterate command for Vault, I wanted its user interface to somehow make the user feel guilty. I argued that the obliterate dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.

The rest of the team agreed that we should discourage people from using this command, but in the end, we settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client, not the regular client people use every day. In effect, we made the obliterate command available, but inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to think twice before they try to rewrite history and pretend something never happened.

Kimchi again?

Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that "everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler. Rules don't have exceptions. Generalizations always apply.

This is how we learn. We understand the basic rules first and see the finer points later. First we learn that memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.

My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-edit-checkin".

In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single repository. Each client has a working folder. All clients contact the same server.

I confess that not all SCM tools work this way. Tools like BitKeeper and Arch are based on the concept of distributed repositories. Instead of one repository, there can be several, or even many. Things can be retrieved or committed to any repository at any time. The repositories are synchronized by migrating changesets from one repository to another. This results in a merge situation which is not altogether different from merging branches.

From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power user, this paradigm for source control is very cool.

Having no experience in the implementation of these systems, I will not be explaining their behavior in any detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of articles will continue to focus on the more mainstream architecture for source control.

Looking ahead

In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side and dive into the details of working folders.

< Chapter 3

Chapter 5 >