patch-based versus tree-based merging: hsenag

hsenag

patch-based versus tree-based merging

Sep 24, 2006 19:51

I don't normally post about deeply technical things I've been working on, but I've been thinking about this for a few days now and wanted somewhere public to record my conclusions.

Almost uniquely amongst distributed version control systems (SO6 being the exception, but I'm not aware of it being in wide use), darcs is patch-based rather than tree-based. But what does this mean in practice?

To get a better handle on this, we need to look at what happens during a merge, the key thing that a version control system needs to do.

All of the tree-based systems are based around three-way merges. Given two repositories A and B to be merged, they pick a common ancestor (more on how later), diff each of them against the ancestor, then adjust one diff for any offset changes implied by the other diff, and apply it to the other repository. (Things become a bit more complicated in the case of merging directory operations such as file moves and renames, but there's nothing conceptually hard and I haven't investigated the details of how this is handled. It's not really important.)

So, how is the ancestor chosen? Clearly it should be some repository state that did actually exist in the histories of A and B, or it would make no sense to use as a basis for the merge. It also should contain all the changes that have already been merged between A and B, because otherwise we will either get a spurious conflict, identical changes have to be silently merged, which can cause problems in the case where we really want a conflict from identical changes.

Unfortunately, it is not necessarily the case that a repository state will exist with such a property. http://revctrl.org/CrissCrossMerge gives some examples of when this can occur. It causes problems for most of the tree-based revision control systems.

In fact, the correct solution is to make up an appropriate ancestor. It should contain precisely the intersection of the sets of changes in A and B. One way to construct it is to merge all the possible LCAs (an LCA of A and B is an ancestor of both A and B that is not itself an ancestor of another common ancestor of A and B). This is what git does. However, I believe its solution goes wrong in the presence of conflicts. Suppose X, Y and Z are three such LCAs, and that they all conflict with each other. Git merges them in non-deterministic order, leaving the conflict markers in the merge results. The order in which they are merged will determine the order of conflict markers. This means that the contents of the base tree used for the merge are non-deterministic, which will make the merge results themselves non-deterministic. I haven't yet tried to actually construct an example of git going wrong, though. Another flaw with git is that if the conflict between two trees that need to be merged is at the directory level (e.g. between two different renamings of the same file), it just gives up.

On the other hand, one of the fundamental properties of darcs is that any repository with the same set of patches must behave the same. When darcs does a merge, it implicitly constructs the correct ancestor as described above (the precise mechanics are different, but the effect is the same). Because of this fundamental property, it is guaranteed that merges are reproduceable; merge the same two sets of patches and you'll get the same result. Darcs handles the conflict scenario described above by competely ignoring the effects of conflicting patches when merging two repositories. (This strategy has its own problems, but that's a different topic entirely.)

From a usability point of view, the fact that every merge in darcs is repeatable means that it doesn't need to track them as separate commits.

I'm happy to discuss this further with anyone interested, either on #revctrl or #darcs on freenode, or in the comments of this post. However, they are just my current tentative conclusions after some investigation, so please don't take them as gospel truth or flame me too hard for being wrong :-)

darcs