2009-03-30 15:48:17
Why is Git so Fast?
In the DVCS world, Git has a reputation for being really fast. I am curious about how Git got this way.
When I started thinking about this question, seven different answers came to my mind. Some of those answers seem more interesting or correct than others.
One: Maybe Git is fast simply because it's a DVCS.
There's probably some truth here. One of the main benefits touted by the DVCS fanatics is the extra performance you get when everything is "local".
But this answer isn't enough. Maybe it explains why Git is faster than Subversion, but it doesn't explain why Git is so often described as being faster than the other DVCSs.
Two: Maybe Git is fast because Linus Torvalds is so smart.
This might very well be correct. But it's not interesting.
Fine. So Linus is smarter than all of us. But how did he use those smarts to make Git so fast? What are the details?
Three: Maybe Git is fast because it's written in C instead of one of those newfangled higher-level languages.
Nah, probably not. Lots of people have written fast software in C#, Java or Python.
And lots of people have written really slow software in traditional native languages like C/C++. Adobe writes most of their stuff in C++, and they don't have any trouble making sure that release N+1 is slower than release N.
Four: Maybe Git is fast because being fast is the primary goal for Git.
This is another one of those high-level answers that is probably correct but doesn't have the kind of details about which I am curious.
Still. Take some time to read through the archives of the Git developers mailing list. These people spend a LOT of time talking about performance issues.
Five: Maybe Git is fast because it does less.
One of my favorite recent blog entries is this piece which claims that the way to make code faster is to have it do less.
Predictably, people came out of the woodwork to say how wrong this guy was. That's what happens to almost any blog entry about performance tuning or optimization. Readers ignore anything correct in the article and quibble about little stuff.
But this guy was essentially correct. One way to make software faster is to make it do less.
For example, the way you get something in the Git index is you use the "git add" command. Git doesn't scan your working copy for changed files unless you explicitly tell it to. This can be a pretty big performance win for huge trees. Even when you use the "remember the timestamp" trick, detecting modified files in a really big tree can take a noticeable amount of time.
Or maybe Git's shortcut for handling renames is faster than doing them more correctly like Bazaar does.
Six: Maybe Git is fast because it doesn't use much external code.
Very often, when you are facing a decision to use somebody else's code or write it yourself, there is a performance tradeoff. Not always, but often. Maybe the third party code is just slower than the code you could write yourself if you had time to do it. Or maybe there is an impedance mismatch between the API of the external library and your own architecture.
This can happen even when the library is very high quality. For example, consider libcurl. This is a great library. Tons of people use it. But it does have one problem that will cause performance problems for some users: When using libcurl to fetch an object, it wants to own the buffer. In some situations, this can end up forcing you to use extra memcpys or temporary files. The reason all the low level calls like send() and recv() allow the caller to own the loop and the buffer is because this is the best way to avoid the need to make extra copies of the data on disk or in memory.
People make fun of those with NIH Syndrome, but my observation is that folks who suffer from this disorder tend to create faster software, even if they also tend to ship everything late. :-)
Maybe Git is fast because every time they faced one of these "buy vs. build" choices, they decided to just write it themselves.
Seven: Maybe Git isn't really that fast.
If there is one thing I've learned about version control it's that everybody's situation is different. It is quite likely that Git is a lot faster for some scenarios than it is for others.
How does Git handle really large trees? Git was designed primary to support the efforts of the Linux kernel developers. A lot of people think the Linux kernel is a large tree, but it's really not. Many enterprise configuration management repositories are FAR bigger than the Linux kernel.
Final thoughts
This week's version control blog entry raises more questions than answers. I'm not a Git user, nor have I looked much at its code, so I don't really know why it's so fast. I'm just curious. If you have better answers than mine (and I admit that's a low hurdle), feel free to send them to me or post them in my comments.
But FWIW, I have decided it is time for me to become a Git user. When I was writing about Git a few weeks ago, a lot of Git users kept telling me I just don't get it. I've spent more time thinking about version control implementation and design than most folks, so I tend to think I actually do "get it". But my curiosity is piqued, and I hate to pass up an opportunity to learn something, so I'm going to give it a try. I've got a small project here at SourceGear that I work on part-time with a couple other people. We've decided to switch to Git and see how it goes. I'll let you know what I find out.