Veracity: DAGs and Data

Veracity: DAGs and Data
Prev	Chapter 12. DVCS Internals	Next

Veracity is written in C (the core libraries) and JavaScript (the web applications). It is primarily a command-line application (vv) but also contains a built-in web server and web-based user interface.

I am using Veracity for version control as I write this book. So in the following examples, I’m just going to crawl through the guts of my book repository. A little information up-front:

The Veracity scripting interpreter is called vscript. The scripting language is JavaScript, extended with a bunch of hooks into the Veracity libraries.
The name of my repository instance is book2.
In general, Veracity stores everything in JSON.

DAGs and Blobs

A Veracity repository stores two kinds of things: DAGs and blobs. First let’s talk about DAGs.

A DAG is used to represent the version history of something. Each node of the DAG represents one version, with one or more arrows pointing to the version(s) from which that node was derived. A DAG has one root node.^[49] If a DAG has just one leaf node, then we know without ambiguity which version is the latest.

Veracity supports two kinds of DAGs:

A tree DAG keeps the version history of a directory structure from a filesystem. Each node of the DAG represents one version of the whole tree.
A database (or “db”) DAG keeps the version history of a database, or a list of records. Each node of the DAG represents one state of the complete database.

A repository can have many database DAGs, each with a different purpose, distinguished by a numeric ID we call a dagnum.

Here’s a vscript snippet which lists all the DAGs in a repository:

var r = sg.open_repo("book2");
var a = r.list_dags();
r.close();
print(sg.to_json__pretty_print(a));

When I run this script, I get:

eric:~ eric$ vscript list_dags.js
[
    "0000000010101042",
    "0000000010101052",
    "0000000010102062",
    "0000000010102072",
    "0000000010201001",
    "0000000010201011",
    "00000000102021c2",
    "00000000102021d2",
    "00000000102031c2",
    "00000000102031d2",
    "00000000102040c2",
    "00000000102040d2",
    "00000000102051c2",
    "00000000102051d2",
    "00000000102071c2",
    "00000000102071d2",
    "0000000010301002",
    "0000000010301012",
    "0000000010302002",
    "0000000010302012"
]

Well, that’s not very friendly, is it? All those hex numbers! And how can there be 20 DAGs in this repository, anyway?

Actually, there are only 10. Sort of. What we’ve got here are 10 “real” DAGs, each of which has an audit DAG.

For every changeset in every non-audit DAG, an audit record is added (to its audit DAG) containing the UTC timestamp (on the local machine) and the userid of who committed it.

If you look closely, the audit DAGs are evident here because they’re the ones where the second digit (from the right) is an odd number.

The purpose of each DAG can be found by looking at the bits in the dagnum while reading a particularly tedious section of the Veracity source code. I’ll spare you the trouble. Here is a description of all 10 DAGs:

dagnum	Description
`0000000010101042`	Areas (db)
`0000000010102062`	Users (db)
`0000000010201001`	Version control (tree)
`00000000102021c2`	VC Comments (db)
`00000000102031c2`	VC Stamps (db)
`00000000102040c2`	VC Tags (db)
`00000000102051c2`	VC Named branches (db)
`00000000102071c2`	VC Hooks (db)
`0000000010301002`	Work items (db)
`0000000010302002`	Builds (db)

As you can see, the db DAGs have the tree DAG outnumbered, 9 to 1. In fact, those 10 audit DAGs are db DAGs as well. So we’ve got 19 db DAGs and 1 tree DAG. This is fairly typical for a Veracity repository. The source tree itself is filesystem-oriented data, but most other data fits better into a record-with-fields design. Veracity uses db DAGs to track lots of different stuff.

Six of the DAGs in this list are related to version control. There is the tree itself, and then we have one DAG each to keep track of comments, stamps, tags, named branches, and hooks.

The users DAG is used to keep track of user accounts. The areas DAG can be used to keep track of which DAGs logically go together. All six of the version control (VC) DAGs are in one area. Work items and builds are another area.

Before we go on, we should tidy up a bit. We’ve got enough big long hex numbers around, so let’s get rid of the ones for the dagnums. The scripting API has defined constants for all the primary dagnums.

eric:~ eric$ vscript
vscript> print(sg.dagnum.VERSION_CONTROL)
0000000010201001
vscript> ^D

Now let’s dive into the version control DAG itself. The way a DAG works is that the most recent information is in the leaves. Here’s a little script to list all the leaf nodes for the version control tree DAG:

var r = sg.open_repo("book2");
var leaves = r.fetch_dag_leaves(sg.dagnum.VERSION_CONTROL);
r.close();
print(sg.to_json__pretty_print(leaves));

Running the script, I get one result, indicating that my repository has no branching going on:

eric:~ eric$ vscript fetch_dag_leaves.js
[
    "f10628e5792251dc886f600a6ae8610a38ac2204"
]

The ID of a dagnode is also the ID of its changeset blob. Which reminds me, let’s talk about blobs.

A blob is just a sequence of bytes. It can be empty, or it can have many gigabytes in it. The length of a blob is represented as a 64-bit integer, so Veracity can handle any size blob you’ve got.

A repository provides key-value storage for blobs. The key for each blob is the cryptographic hash of its contents. The repository in this example is configured to use SHA-1, the same hash function used by Mercurial and Git.

In the Veracity code, we use the word HID, short for “hash ID”, to refer to the hash of a blob.

Whenever you retrieve a blob (in full), the HID is verified.

There are two kinds of blobs.

User data. Every file you store under version control becomes a blob. Actually each version of that file becomes a blob.
Program data. Program data is used to store things that Veracity needs to remember, such as the contents of a directory, or database records, or changeset objects. All program data is stored as JSON.

When creating a new changeset in a DAG, we create a serialized changeset record. The HID of that record becomes the ID of the new dagnode.

Changesets

So, when we ask for the dagnode IDs for the leaf nodes, the resulting IDs can be used to retrieve the changeset blob. Here is what that changeset blob looks like:

eric:book2 eric$ vv dump_json f10628e5792251dc886f600a6ae8610a38ac2204
{
  "dagnum" : "0000000010201001",
  "generation" : 91,
  "parents" : 
  [
    "c821cfbc8964db9958d1278a5e4e2947462730e9"
  ],
  "tree" : 
  {
    "changes" : 
    {
      "c821cfbc8964db9958d1278a5e4e2947462730e9" : 
      {
        "g3a3b61269bea4392951a785dcf7efbde40e5331a56db11e0a84b60fb42f09aca" : 
        {
          "hid" : "40c1af01a8c0cea66ecb99529befbd8e7a004c42"
        },
        "g8a7471f886864c04a836d0c4621df781a2e67bbe572611e08f5d60fb42f09aca" : 
        {
          "hid" : "a3656282d8c467f00b21d83317d2de0374af761c"
        }
      }
    },
    "root" : "c86c077f1f0c165f90ca7715b4a41d8281fc5feb"
  },
  "ver" : 1
}

As I mentioned before, there are two kinds of DAGs, db and tree. The version control DAG is, of course, a tree DAG, so its changeset records have a “tree” section. The db changesets look a little different as you’ll see later.

dagnum identifies the DAG to which this changeset belongs.
generation is an integer which indicates the distance from this dagnode to the root. The root dagnode has a generation of 1. All other nodes have a generation which is 1 + the maximum generation of its parents.
ver defines the version number of the format of the changeset record.
parents is an array of references to the parents of this dagnode.
tree.changes contains one entry for each parent. Each such entry contains a list of everything in this dagnode which has changed with respect to that parent.
tree.root contains the HID of the treenode for the root of the tree.

So, what’s a treenode?

Treenodes

In a version control tree, each of the user’s files is stored as a blob. But each directory is a treenode. Here’s one:

eric:book2 eric$ vv dump_json c86c077f1f0c165f90ca7715b4a41d8281fc5feb | expand -t 2
{
  "tne" : 
  {
    "g3a3b61269bea4392951a785dcf7efbde40e5331a56db11e0a84b60fb42f09aca" : 
    {
      "hid" : "40c1af01a8c0cea66ecb99529befbd8e7a004c42",
      "name" : "@",
      "type" : 2
    }
  },
  "ver" : 1
}

This treenode is actually what we call the “super-root”. It’s an extra level of tree hierarchy that the user never sees, so that we can record metadata about the user’s root. So let’s dive one level deeper.

eric:book2 eric$ vv dump_json 40c1af01a8c0cea66ecb99529befbd8e7a004c42 | expand -t 2
{
  "tne" : 
  {
    "g0ae054064de54d4b88db6d8b26ad4d79688421e0595811e0804960fb42f09aca" : 
    {
      "bits" : 1,
      "hid" : "56eedb1343e12183875d14a1ec3d1a4098d49a25",
      "name" : "g",
      "type" : 1
    },
    "g8a7471f886864c04a836d0c4621df781a2e67bbe572611e08f5d60fb42f09aca" : 
    {
      "hid" : "a3656282d8c467f00b21d83317d2de0374af761c",
      "name" : "version_control_howto.xml",
      "type" : 1
    },
    "g8e481f4af9d5450a83fc77cca7f0bc07a70fdfa466e511e0837160fb42f09aca" : 
    {
      "hid" : "9e65873dbc6d7c8579392a6acc9a856d25bb0c46",
      "name" : "docbook-xsl-1.76.1",
      "type" : 2
    },
    "gb45372a549bb4044b65b788212d0828af338a140580311e08ced60fb42f09aca" : 
    {
      "hid" : "85e06e062d72def73dce1897bdcef9531ec87526",
      "name" : "images",
      "type" : 2
    },
    "ge502a109a22e44c099d66014fb5ecd1d9477f9025d3b11e0b7a360fb42f09aca" : 
    {
      "hid" : "19ba6f1d215bfad27181c4113ce80985dae7fdeb",
      "name" : "custom_fo.xsl",
      "type" : 1
    }
  },
  "ver" : 1
}

This is a more illustrative treenode. Basically its tne object (short for tree node entry) contains a list of entries, one for each item in the directory.

This directory has five entries in it:

g is a bash script I use to generate a PDF.
version_control_howto.xml is the DocBook file containing all my content.
docbook-xsl-1.76.1 is a copy of the DocBook XSL stylesheets.
images is a subdirectory containing all the artwork for the book.
custom_fo.xsl is my XSL customization layer.

For each entry, the treenode knows the HID of the blob containing the contents of that item. In the case of a file, such as custom_fo.xsl, the HID refers to the blob that contains the actual contents of the file. In the case of a subdirectory like images, the HID refers to another treenode.

The blob a3656282d8c467f00b21d83317d2de0374af761c contains (one version of) the DocBook content of this book.

DB Records

So where’s the log message on this commit? For that we have to look in a different DAG. Using the same technique as above, we find that the leaf for the version control comments DAG is 053da8cbbd986b14dc06b3d8dab08be3388266ff. Let’s dump that changeset and see what it looks like.

eric:book2 eric$ vv dump_json 053da8cbbd986b14dc06b3d8dab08be3388266ff | expand -t 2
{
  "dagnum" : "00000000102021c2",
  "db" : 
  {
    "changes" : 
    {
      "9ff7c857361d30d6a51b9fcf9f5ddbff9940d4e1" : 
      {
        "add" : 
        {
          "fb96b2c70dcca6a82e6b8ee222c26395cccf4d42" : 0
        }
      }
    }
  },
  "generation" : 91,
  "parents" : 
  [
    "9ff7c857361d30d6a51b9fcf9f5ddbff9940d4e1"
  ],
  "ver" : 1
}

This is a db changeset instead of a tree changeset. It contains a “db” section, which, again, contains one delta against each parent. That delta indicates that one new record was added. Let’s dump the blob for the new record and see what it looks like.

eric:book2 eric$ vv dump_json fb96b2c70dcca6a82e6b8ee222c26395cccf4d42 | expand -t 2
{
  "csid" : "f10628e5792251dc886f600a6ae8610a38ac2204",
  "text" : "committing my changes before I continue writing"^[50]
}

And there’s the db record for the comment. Note that the csid field matches the changeset ID from the version control DAG.

What about the who and when? Once again, we need to check another DAG, the audit DAG for the version control DAG. Its dagnum is 0000000010201011. I grab its only leaf and dump the corresponding changeset record:

eric:book2 eric$ vv dump_json 15bc2d16081d6ad6baeb4c790821d8aeee864d34 | expand -t 2
{
  "dagnum" : "0000000010201011",
  "db" : 
  {
    "changes" : 
    {
      "3a4b6f6222d5ae761ad375eb1c7aa8a5f9ba0390" : 
      {
        "add" : 
        {
          "c52ff03833aeb8f180583ce2fc7ea7bbf7e392bf" : 0
        }
      }
    }
  },
  "generation" : 92,
  "parents" : 
  [
    "3a4b6f6222d5ae761ad375eb1c7aa8a5f9ba0390"
  ],
  "ver" : 1
}

Here is the new record:

eric:book2 eric$ vv dump_json c52ff03833aeb8f180583ce2fc7ea7bbf7e392bf | expand -t 2
{
  "csid" : "f10628e5792251dc886f600a6ae8610a38ac2204",
  "timestamp" : "1304457549322",
  "userid" : "gc580073ae5164a61bd92c3241bf3d9f457b0b01056db11e0995060fb42f09aca"
}

The value for userid isn’t very intuitive, is it? That is actually the record ID for the user record, located over in a separate DAG.

Here is a script to dump all user records:

eric:~ eric$ cat u.js
var repo = sg.open_repo("book2");
var zs = new zingdb(repo, sg.dagnum.USERS);
var recs = zs.query('user', ['*']);
repo.close();
print(sg.to_json__pretty_print(recs));

Running the script produces the following output:

eric:~ eric$ vscript u.js | expand -t 2
[
  {
    "name" : "eric",
    "prefix" : "X",
    "recid" : "gc580073ae5164a61bd92c3241bf3d9f457b0b01056db11e0995060fb42f09aca"
  }
]

So at last you can see that it was me who did the commit shown above.

Templates

Now let’s dive a bit deeper. A db DAG contains a “database”, or a set of records. These records must follow a template. That template is basically like a schema for the database. It describes one or more record types, specifying the fields for each record type.

Here is the template for the version control comments DAG:

{
    "version" : 1,
    "rectypes" :
    {
        "item" :
        {
            "fields" : 
            {
                "csid" :
                {
                    "datatype" : "string",
                    "constraints" :
                    {
                        "required" : true,
                        "index" : true
                    }
                },
                "text" :
                {
                    "datatype" : "string",
                    "constraints" :
                    {
                        "required" : true,
                        "maxlength" : 16384,
                        "full_text_search" : true
                    }
                }
            }
        }
    }
}

It is illegal to have a template where merge can fail. The template above satisfies that rule because it has no record ID, which means that records cannot be modified and that unique constraints are not allowed. This template is a rather simplistic example.

Here’s a slightly more complicated example, the template for version control tags:

{
    "version" : 1,
    "rectypes" :
    {
        "item" :
        {
        "merge" :
            {
                "merge_type" : "field",
                "auto" : 
                [
                    {
                        "op" : "most_recent"
                    }
                ]
            },
            "fields" : 
            {
                "csid" :
                {
                    "datatype" : "string",
                    "constraints" :
                    {
                        "required" : true,
                        "index" : true
                    }
                },
                "tag" :
                {
                    "datatype" : "string",
                    "constraints" :
                    {
                        "required" : true,
                        "index" : true,
                        "unique" : true,
                        "maxlength" : 256
                    },
                    "merge" :
                    {
                        "uniqify" : 
                        {
                            "op" : "append_userprefix_unique",
                            "num_digits" : 2,
                            "which" : "least_impact"
                        }
                    }
                }
            }
        }
    }
}

Like a comment, a tag has just two fields: The changeset ID to which it applies and a string. But for a tag, that string is required to be unique, which introduces the possibility that the unique constraint could be violated on a merge. So Veracity requires us to provide a way to uniqify, to resolve the violation of the unique constraint automatically as the merge is happening.

Repository Storage

Now let’s look at how all this stuff is actually stored.

The repository API presents an abstraction of a repository instance. Callers of the API remain unaware of certain details of exactly how dagnodes and blobs are being stored. These details are left to the storage implementation, thus allowing different tradeoffs to be used for different situations.

In Veracity 1.0, the only shipping implementation of this repository API is called FS3. The “FS” stands for “filesystem”, representing the fact that blobs are simply stored in files (although not one blob per file). The “3” simply means that it is the third incarnation—FS1 and FS2 did not survive the development process.

FS3 stores repositories in the “closet”, which by default is a directory in your home directory named .sgcloset.

eric:book2 eric$ cd ~/.sgcloset/

eric:.sgcloset eric$ ls -l
total 496
-rw-r--r--  1 eric  staff   60416 May  3 18:02 descriptors.jsondb
drwxr-xr-x  4 eric  staff     136 May  3 18:02 repo
-rw-r--r--  1 eric  staff  190464 Apr 24 19:35 settings.jsondb

eric:.sgcloset eric$ cd repo

eric:repo eric$ ls -l
total 0
drwxr-xr-x  22 eric  staff  748 May  3 15:04 alpo_858b
drwxr-xr-x  16 eric  staff  544 May  3 18:00 book2_d2a1

eric:repo eric$ cd book2_d2a1/

eric:book2_d2a1 eric$ ls -l
total 771928
-rw-r--r--   1 eric  staff      20480 Mar 25 07:28 0000000010101042.dbndx
-rw-r--r--   1 eric  staff      28672 Mar 25 07:28 0000000010102062.dbndx
-rw-r--r--   1 eric  staff    3390464 May  3 16:19 0000000010201001.treendx
-rw-r--r--   1 eric  staff      58368 May  3 16:19 0000000010201011.dbndx
-rw-r--r--   1 eric  staff     118784 May  3 16:19 00000000102021c2.dbndx
-rw-r--r--   1 eric  staff      19456 Mar 25 07:28 00000000102031c2.dbndx
-rw-r--r--   1 eric  staff      21504 Mar 25 07:28 00000000102040c2.dbndx
-rw-r--r--   1 eric  staff      75776 May  3 16:19 00000000102051c2.dbndx
-rw-r--r--   1 eric  staff      18432 Mar 25 07:28 00000000102071c2.dbndx
-rw-r--r--   1 eric  staff      99328 Mar 25 07:28 0000000010301002.dbndx
-rw-r--r--   1 eric  staff      58368 Mar 25 07:28 0000000010302002.dbndx
-rw-r--r--   1 eric  staff  390010297 May  3 16:19 000001
drwxr-xr-x  62 eric  staff       2108 May  3 16:19 f
-rw-r--r--   1 eric  staff    1283072 May  3 16:19 fs3.sqlite3

These files are my book repository. Actually, two of them matter more than the others.

All the blobs are stored in the file called 000001. FS3 stores blobs by appending them to this file. When the file gets to be a gigabyte, it starts a new file called 000002.

Reflecting a strong bias toward reliability, the FS3 data file is append-only. Once a blob has been appended, it is never altered. Furthermore, Veracity’s repository API has no way to remove a blob or a dagnode.
The other important file is fs3.sqlite3. As its name suggests, this is a SQLite^[51] database. It contains two things:
- The list of blobs, and for each blob, the offset/length of where to find it in the data file.
- The list of dagnodes.

All of the other files in the repository directory are somewhat secondary.

Most of them are repository indexes, with file names ending in ndx. We can think of these in the same way that we think about indexes in a SQL database. They do not contain actual data; they exist simply to make certain operations faster. It is possible to delete all the repository indexes and reconstruct them using nothing more than the data file(s) and the fs3.sqlite3 file.

Note that in some situations it is legal for a Veracity repository instance to have no indexes at all. This capability is helpful for setting up a very scalable central server.

For Veracity 1.0, repository indexes are not transferred by clone, push, or pull. Each repository instance is responsible for maintaining its own indexes.

Blob Encodings

The Veracity repository API allows a blob to be stored in one of three “encodings”.

full — the exact bytes of the blob are all stored
zlib — the blob is stored compressed
vcdiff — the blob is stored as a vcdiff delta relative to another blob

For performance, FS3 stores all incoming new blobs in the zlib encoding.

Once the blob is stored in a given repository instance, its encoding cannot be changed. But its encoding can be altered in the course of a clone operation. While the clone command copies the blob from one instance of the repository to another, it can re-encode the blob as it passes through. For example, the following Veracity command produces a deltified copy of a repository by using the --pack option with the clone command.

~ harry$ vv clone --pack lottery lottery_deltified

And that reminds me that I should say a word or two about Veracity’s implementation of the communication between repository instances.

Similar to the repository API, another API is used to hide the details for clone, push, and pull. Veracity currently includes two implementations of this API, one for local operations and one which works over HTTP.

By default, clone, push, and pull always transfer blobs without changing the encoding. This means that if a blob is in deltified (vcdiff) form, it will be transferred over the network in that form, thus saving network traffic.

^[49]Git allows the DAG to have multiple root nodes. Veracity does not.

^[50]This brief, content-free log message was not a shining example of best practices.

^[51]http://www.sqlite.org/

Prev	Up	Next
Mercurial: Repository Structure	Home \| ToC	Chapter 13. Best Practices