Screwtape's Notepad

Extracting libraries with git subtree

Programs written in C and C++ often don’t have access to a useful package manager, so including dependencies can be a pain. You can just copy the source-code into your repo, but then you wind up with multiple copies that can drift apart and diverge over time. Keeping the dependency in a separate repo can help you keep track of changes, but if you have old code-bases where divergence has already happened, how do you preserve those changes and their development history?

The project

I have become the current maintainer of bsnes and higan, game console emulators developed over the past fifteen years or so by a game preservationist known as byuu. Although bsnes and higan are now independent projects, they have a shared development history, and both are built on the same foundational libraries:

Sometimes a bsnes user submits a change to one of those shared libraries which higan would also benefit from, sometimes the other way around, and it’s a hassle trying to track which fixes are in each branch. I’d like to peel those libraries and their development history out of the bsnes and higan repositories, and store them in individual repos to centralise issue tracking and have a single source of truth for their code.

Finding the code

Today, bsnes and higan are developed in git, but it wasn’t always the case. Originally, they were developed without source control at all, and much of the older history was reconstructed from collected source tarballs. As a result, different commits often put the code for a given library at different locations within the source tree. In order to extract the development history of a library, we’ll need to locate it within each commit.

It turns out git’s rename detection does a pretty good job of detecting code movements from one place to another over history. git log --numstat reports file movements using a special syntax like this:

higan/processor/{lr35902 => sm83}/instruction.cpp

This line dates to the time we discovered the Game Boy’s CPU was actually a Sharp SM83, so higan renamed its CPU core from the more generic “lr35902” to the more specific “sm83”. Git detected the rename, and it reports the rename from A to B with a special syntax {A => B}.

There are three main cases we care about:

We can write a regex that matches each of these cases in the git log --numstat output:

#__parent rename__ ____rename to_____ __rename from__
.* => .*\}.*/nall/|\{nall => [^\}]*\}|\{ .* => nall\}

Therefore, we can get a map of all the places the nall library has been kept by grepping for that regex in the git log --numstat output:

$ git log --numstat |
    egrep -o '.* => .*\}.*/nall/|\{nall => [^\}]*\}|\{.* => nall\}' |
    cut -f3- |
    sort -u
{ananke/nall => nall}
{bsnes => higan}/nall/
{bsnes => src}/lib/nall/
{higan/nall => nall}
{nall => bsnes/nall}
{snesfilter => kaijuu}/nall/
{snespurify => purify}/nall/
{snespurify => purify/phoenix}/nall/
{src => bsnes}/lib/nall/
src/{lib => }/nall/
{src/nall => nall}

As you can see, the nall library has at various times been stored in ananke/nall, bsnes/lib/nall, bsnes/nall, higan/nall, kaijuu/nall, purify/nall, snesfilter/nall, snespurify/nall, src/lib/nall, src/nall, and plain nall.

Collecting the code

In order to extract a consistent history, we need to move our library into a consistent place in each commit. Since for this project we only care about a single library, we can ignore the rest of the code and just make a copy of each commit rewritten to move our library into the correct position.

The tool to do this rewriting is git-filter-repo. As the name suggests, it’s built to rewrite an entire repo, but we don’t want to mess up our repo’s history, we just want to rewrite a single temporary branch.

First we need to create our temporary branch that we can safely rewrite:

git switch -c nall-history

…and then we can use git-filter-repo to do the rewriting:

git-filter-repo \
    --refs nall-history \
    --force \
    --path-rename bsnes/lib/nall:nall \
    --path-rename bsnes/nall:nall \
    --path-rename higan/nall:nall \
    --path-rename src/lib/nall:nall \
    --path-rename src/nall:nall

In the above command:

Compared to the list of nall paths found in the previous section, a number of prefixes are missing, including ananke/nall, kaijuu/nall, purify/nall, snesfilter/nall, and snespurify/nall. That’s becaus these are helper tools that have at times been included with bsnes or higan, with their own copies of the nall library. A commit with any of these directories is guaranteed to have one of the others, so we’re not missing out on any history.

The end result of this process is that the nall-history branch now contains the nall library at a consistent path in each commit.

Parsed 1833 commits
New history written in 1.66 seconds...
HEAD is now at 23e4991ee correctly set O_NONBLOCK in OSS
Completely finished after 1.71 seconds.

Extracting the code

git subtree is a competitor to the more well-known git submodule. Rather than committing a reference to an external git repo that needs to be cloned separately, git subtree merges the entire history of the upstream repository into the target repository. It can also do the opposite — taking a directory of a repository and “peeling off” just the commits that affect it.

Thanks to git-filter-repo, we now have a branch where our library is in a consistent location, perfectly positioned for git subtree to extract it.

The command looks like this:

git subtree split -P nall

In each commit in the history of the current branch, git subtree will ignore everything outside the “nall” prefix, then trim that prefix from all the files that remain, resulting in a commit history that only tracks the changes to the library we care about.

Afterward, it prints the commit ID of the tip of the new history. We can make a new branch for it, so we can clean it up without having to redo the previous steps.

git switch -c just-nall-history $(
    git subtree split -P nall

Cleaning the code

The resulting history has some quirks.

Empty commits

For example, every commit in the original history is still present, even if it didn’t modify anything in our library. Luckily, git-filter-repo can clean that up for us quite easily.

git-filter-repo \
    --refs just-nall-history \
    --force \
    --prune-empty always

The --refs and --force options are as before, but --prune-empty always removes any commits that don’t change files. If your repo has many merges in it, you might also want to inspect the --prune-degenerate option, which tells git-filter-repo what to do when --prune-empty prunes a merge commit’s parents.

File permissions

This is somewhat specific to my specific repository, but at different times the code has been copied from a FAT32 file-system to a Linux machine. As a result, some commits have a bunch of files marked as executable, and other commits have them as non-executable, more or less at random. This clutters the history, and would be good to clean up.

Luckily, the nall library should have no executable files at all ever (it’s all just C++ headers) so we can use git-filter-repo once again:

git-filter-repo \
    --refs just-nall-history \
    --commit-callback '
        for change in commit.file_changes:
            if change.mode == b"100755":
                change.mode = b"100644"

The --commit-callback option takes a fragment of Python code which is evaluated for each commit that gets rewritten. The mode strings are traditional octal POSIX file permissions.

Combining sources

If your library was kept manually synchronised between a bunch of different projects, you should repeat the above steps for each of them. Once you have a line of history for each repo, you can fetch those “identical” branches into a common repository and merge them together to find out just how wrong you were.

Re-merging the code

If you start off with a standalone library repository, and copy it into your application with git subtree add, you can push changes back upstream with git subtree split and pull down changes with git subtree merge. However, if you start off with a manually copied library and split it out, it seems you can’t merge changes back with git subtree merge — or maybe the cleanup steps like removing empty commits and noisy permission changes break it.

Either way, the way I get the new standalone repository back in is to commit removing the old library code, and a second commit that adds it back in with git subtree add. It’s a bit messy, and it has the potential to break git bisect, but I think it’s worth it to have an upstream repository with clean history and make it easy to pull changes back downstream.

As an alternative, you could add the new repository in place with git submodule, but I think you’d still have to remove the old code and add the new as separate commits.

If you do use git subtree, since there’s no standard marker in the repository saying where to find the upstream source (that’s kind of the point of git subtree) I like to leave a file in the repository root to automate the update procsss. It usually looks something like this:

# Because git subtree doesn't provide an easy way to automatically merge
# changes from upstream, this shell script will do the job instead. If you
# don't have a POSIX-compatible shell on your system, feel free to use this
# as a reference for what commands to run, rather than running it directly.

# Change to the directory containing this script, or exit with failure to
# prevent git subtree scrawling over some other repo.
cd "$(dirname "$0")" || exit 1

# Merge changes from the nall repository.
git subtree pull --prefix=nall master

And that’s it!