Most DVCS tools, including Git, Mercurial, and Veracity, use cryptographic hashes.
A cryptographic hash is an algorithm which constructs a short digest from a sequence of bytes of any length. There are many such hash algorithms[40]. For the SHA-1[41] algorithm, the output digest is always 160 bits in length. Some hash algorithms, including SHA-2[42] and Skein[43], are capable of generating longer digests, at lengths of 256, 512, or even 1024 bits.
Let’s take a closer look at how a DVCS makes use of cryptographic hashes. I will be using Git for my examples in this section, but it applies to Veracity as well. Mercurial, on the other hand, does things a bit differently.
In this example, we want to use our VCS to store four text files. For the sake of keeping things simple, each file is just a few bytes long. (The example would be more realistic if the files were a lot bigger, but you get the idea.)
eric:hashes_example eric$ echo Eric > file1.txt eric:hashes_example eric$ echo Erik > file2.txt eric:hashes_example eric$ echo eric > file3.txt eric:hashes_example eric$ echo Eirc > file4.txt eric:hashes_example eric$ ls -l total 32 -rw-r--r-- 1 eric staff 5 Jun 20 10:29 file1.txt -rw-r--r-- 1 eric staff 5 Jun 20 10:29 file2.txt -rw-r--r-- 1 eric staff 5 Jun 20 10:29 file3.txt -rw-r--r-- 1 eric staff 5 Jun 20 10:29 file4.txt
Each of these files contains my first name or a slight misspelling thereof. Now I use Git to show me the SHA-1 hash for each of these files.[44]
eric:hashes_example eric$ git hash-object file1.txt 44bf09d0a2c36585aed1c34ba2e5d958a9379718 eric:hashes_example eric$ git hash-object file2.txt 63ae94dae6067d9683cc3a9cea87f8fb388c0e80 eric:hashes_example eric$ git hash-object file3.txt 782d09e3fbfd8cf1b5c13f3eb9621362f9089ed5 eric:hashes_example eric$ git hash-object file4.txt a627820d67e455a4f0dfa49c912fbddb88fca483
Note that even though all four of the input strings are similar, the resulting hash values are very different. As you’ll see later, this is important.
Git uses hashes in two important ways.
When you commit a file into your repository, Git calculates and remembers the hash of the contents of the file. When you later retrieve the file, Git can verify that the hash of the data being retrieved exactly matches the hash that was computed when it was stored. In this fashion, the hash serves as an integrity checksum, ensuring that the data has not been corrupted or altered.
For example, if somebody were to hack the DVCS repository such that
the contents of file2.txt
were changed to “Fred”, retrieval of that
file would cause an error because the software would detect that
the SHA-1 digest for “Fred” is not 63ae94dae606…
Git also uses hash digests as database keys for looking up files and data.
If you ask Git for the contents of file2.txt
, it
will first look up its previously computed digest for the contents of that file[45], which
is 63ae94dae606… Then it looks in the
repository for the data associated with that value and returns “Erik”
as the result. (For the moment, you should try to ignore the fact that
we just used a 40 character hex string as the database key for four characters
of data.)
Let’s assume that we now want to add another file,
file5.txt
, which happens to contain exactly the same string
as file2.txt
. So the hash of the
file contents will be exactly the same.
eric:hashes_example eric$ echo Erik > file5.txt eric:hashes_example eric$ git hash-object file5.txt 63ae94dae6067d9683cc3a9cea87f8fb388c0e80
When Git stores the contents of file5.txt
,
it will realize that it already has a copy of that data. There
is no need to store it again. Hooray! Git just saved us four
bytes of storage space! (Keep in mind that instead of “Erik”, these
two files could contain a gigabyte of video, which would imply a somewhat
more motivating space savings.) This process is called
deduplication.
This is deeply neato, but what would have happened if file5.txt
did not contain “Erik” but somehow happened to still have a SHA-1 hash of
63ae94dae606…?
According to the pigeonhole principle[46], this is theoretically possible. When a cryptographic hash algorithm generates the same digest for two different pieces of data, we call that a collision.
If a collision were to happen in this situation, we would have some pretty
big problems. When the DVCS is asked to store the contents of
file5.txt
(which does not contain “Erik” but which somehow does have a SHA-1 hash of 63ae94dae606…), it would
incorrectly
conclude that it already has a copy of that data. So the real contents of
file5.txt
would be discarded. Future attempts to
retrieve the contents of that file would erroneously return “Erik”.
Because of this, it is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.
And just how unlikely is that?
Your chances of winning the Powerball lottery are far better than finding a hash collision. After all, lotteries often have actual winners. The probability of a hash collision is more like a lottery that has been running since prehistoric times and has never had a winner and will probably not have a winner for billions of years.
It is no accident that “Eric”, “Erik”, “eric”, and “Eirc” have hash values that are so different. Cryptographic hash algorithms are intentionally designed to ensure that two similar pieces of data have digests which are not similar.
The likelihood of accidentally finding a collision is related to the bit length of the hash. Specifically, the average number of evaluations necessary to find a collision is 2(bit_length/2).[47] So, if we are trying to find two pieces of data which have the same SHA-1 hash, we could expect to be searching through 280 pieces of data. If we check one million hashes per second, we’ll probably find a collision in about 38 billion years.
Unsurprisingly, no one has ever found a SHA-1 collision.
Note that these probabilities apply to the situation where a hash collision is found accidentally, roughly equivalent to the notion of somebody who is just checking random combinations to see if a collision happens to show up. But what if somebody is being a bit more intentional, searching for a collision using a better method than just being random? Surely this search won’t take as long if we’re being smart about it, right?
Well, no. That’s part of the definition of a good cryptographic hash algorithm: There is no better method. If there were, then the hash would be considered “broken”.
This is fairly important for a DVCS. For example, consider the situation where somebody has access to a repository containing source code for a payroll system. Their goal is to alter the source code such that they will get extra money on payday.
If they can take a source file and then find an altered version of that file which has the same SHA-1 hash, they might be able to achieve their goal. Because the SHA-1 hash matches, it is quite likely that they could store their altered version in the repository without anyone noticing.
But with a strong cryptographic hash function, it is virtually impossible to find any string of bytes which have the same SHA-1 hash as the original file. And it is even less likely that they could find an altered version which accomplishes the goal of giving them more money, or even compiles without errors.
Incidentally, SHA-1 is actually considered broken. For security-oriented applications, it is obsolete and should generally not be used anymore. However, let me explain a bit more about what cryptographers mean when they say that SHA-1 is broken.
SHA-1 is considered broken because somebody found a smarter way to search for a collision, a method which is more effective than just trying random combinations over and over as fast as you can. But that doesn’t mean that finding a collision is easy. It simply means that the search for a collision in SHA-1 should take less time than it is theoretically supposed to take. Instead of the full 80 bits of strength that we would expect SHA-1 to have, it actually has about 51 bits of strength. That means that instead of 38 billion years, we should expect to find a collision in about 70 years.
But still, 70 years is a long time. It remains the case that nobody has ever found a collision in SHA-1.
Nonetheless, there are some who will feel safer using a stronger hash algorithm. This is why we decided to give Veracity support for SHA-2 and Skein, both of which allow for 256 bits or more and neither of which has been broken. At 256 bits, the search for a collision is going to take a long time. Instead of one million attempts per second, let’s do a trillion. And let’s assume that there are 6 billion people on Earth and every one of them has a computer and each of us are doing a trillion checks per second. At that rate, it should take us around 2 trillion years to find a collision.