Programs written in C and C++ often don’t have access to a useful package manager, so including dependencies can be a pain. You can just copy the source-code into your repo, but then you wind up with multiple copies that can drift apart and diverge over time. Keeping the dependency in a separate repo can help you keep track of changes, but if you have old code-bases where divergence has already happened, how do you preserve those changes and their development history?
I have become the current maintainer of bsnes and higan, game console emulators developed over the past fifteen years or so by a game preservationist known as byuu. Although bsnes and higan are now independent projects, they have a shared development history, and both are built on the same foundational libraries:
Sometimes a bsnes user submits a change to one of those shared libraries which higan would also benefit from, sometimes the other way around, and it’s a hassle trying to track which fixes are in each branch. I’d like to peel those libraries and their development history out of the bsnes and higan repositories, and store them in individual repos to centralise issue tracking and have a single source of truth for their code.
Today, bsnes and higan are developed in git, but it wasn’t always the case. Originally, they were developed without source control at all, and much of the older history was reconstructed from collected source tarballs. As a result, different commits often put the code for a given library at different locations within the source tree. In order to extract the development history of a library, we’ll need to locate it within each commit.
It turns out git’s rename detection
does a pretty good job of detecting code movements
from one place to another over history.
git log --numstat
reports file movements
using a special syntax like this:
higan/processor/{lr35902 => sm83}/instruction.cpp
This line dates to the time
we discovered the Game Boy’s CPU was actually a Sharp SM83,
so higan renamed its CPU core
from the more generic “lr35902” to the more specific “sm83”.
Git detected the rename,
and it reports the rename from A to B
with a special syntax {A => B}
.
There are three main cases we care about:
We can write a regex that matches each of these cases
in the git log --numstat
output:
#__parent rename__ ____rename to_____ __rename from__
.* => .*\}.*/nall/|\{nall => [^\}]*\}|\{ .* => nall\}
Therefore,
we can get a map of all the places
the nall library has been kept
by grepping for that regex in the git log --numstat
output:
$ git log --numstat |
egrep -o '.* => .*\}.*/nall/|\{nall => [^\}]*\}|\{.* => nall\}' |
cut -f3- |
sort -u
{ananke/nall => nall}
{bsnes => higan}/nall/
{bsnes => src}/lib/nall/
{higan/nall => nall}
{nall => bsnes/nall}
{snesfilter => kaijuu}/nall/
{snespurify => purify}/nall/
{snespurify => purify/phoenix}/nall/
{src => bsnes}/lib/nall/
src/{lib => }/nall/
{src/nall => nall}
As you can see,
the nall library has at various times
been stored in ananke/nall
, bsnes/lib/nall
, bsnes/nall
, higan/nall
,
kaijuu/nall
, purify/nall
, snesfilter/nall
, snespurify/nall
,
src/lib/nall
, src/nall
, and plain nall
.
In order to extract a consistent history, we need to move our library into a consistent place in each commit. Since for this project we only care about a single library, we can ignore the rest of the code and just make a copy of each commit rewritten to move our library into the correct position.
The tool to do this rewriting is git-filter-repo. As the name suggests, it’s built to rewrite an entire repo, but we don’t want to mess up our repo’s history, we just want to rewrite a single temporary branch.
First we need to create our temporary branch that we can safely rewrite:
git switch -c nall-history
…and then we can use git-filter-repo to do the rewriting:
git-filter-repo \
--refs nall-history \
--force \
--path-rename bsnes/lib/nall:nall \
--path-rename bsnes/nall:nall \
--path-rename higan/nall:nall \
--path-rename src/lib/nall:nall \
--path-rename src/nall:nall
In the above command:
--refs nall-history
means git-filter-repo will only rewrite
the history of that specific branch,
not the history of the repo.--force
disables a safety check:
because git-filter-repo is designed to rewrite an entire repo,
it normally refuses to run
unless it’s in a freshly cloned repo.
Because we’re only rewriting a single branch,
and a freshly created, temporary branch at that,
we’re not actually in danger.--path-rename A:B
does the renaming work:
in each commit,
anything at path A gets renamed to path B.Compared to the list of nall paths found in the previous section,
a number of prefixes are missing,
including ananke/nall
, kaijuu/nall
, purify/nall
, snesfilter/nall
,
and snespurify/nall
.
That’s becaus these are helper tools that have at times
been included with bsnes or higan,
with their own copies of the nall library.
A commit with any of these directories
is guaranteed to have one of the others,
so we’re not missing out on any history.
The end result of this process is that the nall-history branch now contains the nall library at a consistent path in each commit.
Parsed 1833 commits
New history written in 1.66 seconds...
HEAD is now at 23e4991ee correctly set O_NONBLOCK in OSS
Completely finished after 1.71 seconds.
git subtree
is a competitor
to the more well-known git submodule
.
Rather than committing a reference to an external git repo
that needs to be cloned separately,
git subtree
merges the entire history of the upstream repository
into the target repository.
It can also do the opposite —
taking a directory of a repository
and “peeling off” just the commits that affect it.
Thanks to git-filter-repo,
we now have a branch where our library is in a consistent location,
perfectly positioned for git subtree
to extract it.
The command looks like this:
git subtree split -P nall
In each commit in the history of the current branch,
git subtree
will ignore everything outside the “nall” prefix,
then trim that prefix from all the files that remain,
resulting in a commit history that only tracks the changes
to the library we care about.
Afterward, it prints the commit ID of the tip of the new history. We can make a new branch for it, so we can clean it up without having to redo the previous steps.
git switch -c just-nall-history $(
git subtree split -P nall
)
The resulting history has some quirks.
For example, every commit in the original history is still present, even if it didn’t modify anything in our library. Luckily, git-filter-repo can clean that up for us quite easily.
git-filter-repo \
--refs just-nall-history \
--force \
--prune-empty always
The --refs
and --force
options are as before,
but --prune-empty always
removes any commits that don’t change files.
If your repo has many merges in it,
you might also want to inspect the --prune-degenerate
option,
which tells git-filter-repo what to do
when --prune-empty
prunes a merge commit’s parents.
This is somewhat specific to my specific repository, but at different times the code has been copied from a FAT32 file-system to a Linux machine. As a result, some commits have a bunch of files marked as executable, and other commits have them as non-executable, more or less at random. This clutters the history, and would be good to clean up.
Luckily, the nall library should have no executable files at all ever (it’s all just C++ headers) so we can use git-filter-repo once again:
git-filter-repo \
--refs just-nall-history \
--commit-callback '
for change in commit.file_changes:
if change.mode == b"100755":
change.mode = b"100644"
'
The --commit-callback
option
takes a fragment of Python code
which is evaluated for each commit that gets rewritten.
The mode strings are traditional octal POSIX file permissions.
If your library was kept manually synchronised between a bunch of different projects, you should repeat the above steps for each of them. Once you have a line of history for each repo, you can fetch those “identical” branches into a common repository and merge them together to find out just how wrong you were.
If you start off with a standalone library repository,
and copy it into your application with git subtree add
,
you can push changes back upstream with git subtree split
and pull down changes with git subtree merge
.
However,
if you start off with a manually copied library and split it out,
it seems you can’t merge changes back with git subtree merge
—
or maybe the cleanup steps like removing empty commits
and noisy permission changes
break it.
Either way,
the way I get the new standalone repository back in
is to commit removing the old library code,
and a second commit that adds it back in with git subtree add
.
It’s a bit messy,
and it has the potential to break git bisect
,
but I think it’s worth it
to have an upstream repository with clean history
and make it easy to pull changes back downstream.
As an alternative,
you could add the new repository in place
with git submodule
,
but I think you’d still have to remove the old code
and add the new
as separate commits.
If you do use git subtree
,
since there’s no standard marker in the repository
saying where to find the upstream source
(that’s kind of the point of git subtree
)
I like to leave a update-subtrees.sh
file in the repository root
to automate the update procsss.
It usually looks something like this:
# Because git subtree doesn't provide an easy way to automatically merge
# changes from upstream, this shell script will do the job instead. If you
# don't have a POSIX-compatible shell on your system, feel free to use this
# as a reference for what commands to run, rather than running it directly.
# Change to the directory containing this script, or exit with failure to
# prevent git subtree scrawling over some other repo.
cd "$(dirname "$0")" || exit 1
# Merge changes from the nall repository.
git subtree pull --prefix=nall https://github.com/higan-emu/nall.git master
And that’s it!