Optimizing Git’s Merge Machinery, #5

Published in

Palantir Blog

8 min readAug 3, 2021

Editor’s note: This is the fifth post in a series by a Palantir Software Engineer on scaling Git’s merge and rename detection machinery. Click to read the first, second, third, and fourth blog posts.

The first post in this series included background information on how the merge machinery works and why I have worked on optimizing and rewriting it. That background will be critical to understanding this post. So, if you’re unaware that Git stores hashes for files and directories in addition to having hashes for commits, or that merging involves three-way content merges, or that each merge involves two sets of rename detections, you may want to go back and read that post first. If you know all that already, let’s keep going.

I’ll consider another optimization in this post, one that I summarized in my first post simply as “tree-level merging.” It sounds simple enough, but it’s only possible if we leverage some of our previous optimizations.

Switching Tracks

The new merge algorithm has three primary top-level functions that do the work of a merge:

collect_merge_info()
detect_and_process_renames()
process_entries()

The collect_merge_info() function recursively traverses into each subdirectory to gather information about each file and directory of all the commits involved in the three-way merge, collecting and storing relevant information along the way. The detect_and_process_renames() function handles rename detection (importantly allowing us to adjust paths and pair unpaired files for three-way content merging). The process_entries() function then iterates over all the paths, doing three-way content merging and handling a long list of special cases to determine what the merge result should be for each path.

Taking a look at the time spent on the mega-renames test case (rebasing 35 patches across an upstream with ~26,000 renames), we can compare the overall timing with the timing just spent on renames. We can compare relative timings before any optimizations, and what they are now:

                                            Before            Now
        Overall timing:                     5499.7s           10s                                            
        Just detect_and_process_renames():  5487.1s            1s

(As a side note, it’s probably worth noting that the starting point of 5499.7s was 7.6% faster than the old merge algorithm).

We’ve made impressive gains in improving the time needed for rename detection, but now we need to turn our sights elsewhere. collect_merge_info() and process_entries() are now the major culprits for time spent. So, in this post, we’ll dive a bit into those.

Trivial Merges: Neither side modifies

Git has a hash for every file (“blob”) and subdirectory (“tree”), in addition to a hash for every commit. The tree hashes are derived from the hashes of blobs and subtrees within it, and the commit hash is derived from a combination of the top-level tree and extra metadata like commit author/date/message/etc. Since every merge is a three-way operation that involves three commits (the commit at the tip of the two branches being merged, plus the merge base commit), and each commit has a top-level tree object, the merge operation will work with three top-level trees. Every tree and blob has a hash which means that when collect_merge_info() recurses into subdirectories, it will know the hashes of each tree and blob object it encounters for each of the three “sides” of the merge.

Knowing these hashes allows some file merges to be performed without even looking at the contents of the files. For example, if the three hashes for a given file all match (meaning that neither side of history modified the file), then using that same hash for the merge result for that file gives the correct answer. The same is also true for directories — if the three hashes all match for that directory, the resulting directory will also have the same hash. This type of file or directory merge is referred to as a “trivial merge” because it can be performed without looking at the content of the file or directory.

In the case of a directory, a trivial merge is particularly beneficial because you not only save recursing into that directory, but also the corresponding files underneath it, potentially saving significant amounts of time.

But are there other types of trivial merges, and can we apply those?

Trivial Merges: Modified on only one side

If a particular file existed on all three sides of the merge and the hash on one side matches the merge base, we have another type of trivial merge. The fact that the file on one side matches the merge base (as determined by their hashes matching) means we can take the hash from the non-matching side as the merge result.

In the past, people have asked if we could do a similar thing with directories with the old merge algorithm, but it was never a possibility — applying these kinds of trivial directory merges would, in general, wreck both regular rename detection (the unmatched side can have new files that serve as potential destinations in rename detection) and directory rename detection (the unmatched side could have a new directory that was moved into it, and the matched side might be involved in a directory rename that we cannot ignore due to possible files added to the parent directory).

There’s also a second reason that the old merge algorithm wouldn’t benefit much from such an optimization, namely that it used a data structure referred to simply as “the index.” That data structure involves a complete population of all files within the repository and thus would force us to recurse anyway. However, since the new merge algorithm ignores this data structure, this second reason is not an obstacle. We can focus just on the potential breaking of rename detection.

But what if there aren’t any renames?

If there are no renames, then rename detection is skipped, and in that case, we could also do trivial directory resolution for directories where one side matches the merge base. Unfortunately, there’s a bit of a problem: the only way to know if there are no renames is to first recurse into all subdirectories in collect_merge_info() and then run rename detection.

Are there other reasons that rename detection is skipped?

In this series, the third blog post introduced the concept of a “relevant” rename, and started skipping rename detection entirely if there were no “relevant” sources. There were two kinds of relevance: location (for directory rename detection) and content (for three-way file content merges). So, if we can somehow determine that there are no relevant sources anywhere, we can avoid recursing into directories that are unmodified on one side.

When a directory is unmodified on one side of history, meaning that the hash of the directory on that side matches the hash of the directory in the merge base, then the hashes of every subdirectory and file underneath that directory also has to match on that side with the merge base. If all the files match the merge base, then they are not relevant content. Further, if a directory is unmodified on one side of history, that directory will only be relevant if its parent directory was.

But, we still have the lingering problem of not knowing if there are relevant sources elsewhere until we’ve processed all the paths. But, we know that the current directory won’t add any relevant sources, so that we can implement a simple idea.

Defer recursing into directories modified on one side

If we save off information about directories modified on one side and defer recursing into them until the end, we can determine whether there are any relevant sources anywhere. If there are any, we can recurse into the deferred directories. If there aren’t any, we can do a trivial directory resolution by taking the non-matching side of history as the merge result.

Extending trivial directory resolution further

My fourth blog post introduced the concept of a cached rename when reapplying a long sequence of linear commits, such as during a cherry-pick or rebase operation. This provides us a way to enable trivial directory resolutions in even more cases. Suppose there are relevant renames, but they are cached from a previous cherry-pick or rebase. In that case, we already know how to match up the files and only need to recurse into the relevant source and target directories to get both sides of that rename while ignoring any sibling directories as we recurse.

This means that although the previous blog post provided speedups of “only” 13%, the optimizations from that post actually supercharge this one to make it effective even in cases with many renames when rebasing or cherry-picking.

Full speed ahead!

When there are obstacles in our path, but a significant reward at the end of the path, it can tempting to ignore the obstacles and proceed with gusto. To borrow a phrase, we want to move “full speed ahead.”

We have a similar situation here. When any relevant renames are not cached, the trivial directory resolution optimization does not help. Often, when the optimization does not apply, it is because there are a handful of relevant sources — maybe even only one. It felt frustrating to need to recurse into potentially hundreds or even thousands of directories just for a single rename, but it was necessary for accuracy.

Further, I noted that process_entries() was the most expensive piece. It was bothersome that process_entries() had to process each individual path when I knew that most directories were not involved in the renames. Doing a trivial resolve of them would have negated the need to handle all the paths under them. This stimulated an idea, a way to proceed with full safety, while still gaining even more performance; namely, change this calling sequence:

collect_merge_info()
detect_and_process_renames()
process_entries()

into

collect_merge_info()
detect_and_process_renames()
<cache all the renames, and restart>
collect_merge_info()
detect_and_process_renames()
process_entries()

That probably looks like more work, but there are a few things to note here: (1) although we run collect_merge_info() twice, the second time we get to employ trivial directory resolves, which can make it much faster, so the increased time in collect_merge_info() can be small; (2) while we run detect_and_process_renames() again, all renames are cached, so it’s nearly a no-op; (3) the big payoff comes from the fact that process_entries() can be much faster due to having far fewer entries to process.

A simple heuristic that counts directories and files in the first collect_merge_info() run determines if restarting the operation will provide enough savings to be worth the extra cost of restarting.

Results

For the particular test case I’ve been focusing on in this series, a test case involving rebasing 35 patches onto a branch that renamed ~26K files, this speeds up the overall runtime by about 8.7x:

                                Before                  After
        mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s

This optimization is even more effective for cases with few or no renames — the upstream thread about this optimization included another test case that saw a 25x improvement.

This optimization will likely appear in Git 2.33 in August 2021.

The first five posts in this series explored clever ideas to reduce how much work is done, while still getting the same answer or a close enough approximation. The remaining post in this series will be devoted to simply trying to do the same amount of work faster using straightforward optimization techniques.

Author

Elijah Newren