A brief log of the history of a Git repo.

Moving files between Git repos while preserving history

written on June 23, 2021

categories: engineering

tags: git, tooling

I recently found myself having to move files from one Git repo to another while preserving their history. It's one of those things that you probably do rarely enough to the point of never having thought about them, but once you come across them you realize that they obviously occur now and again — that was my feeling at least.

Actually, it's even weirder: at first, I just did what my monkey-programmer-brain thought made sense and, well, copied the files over ; it was only after that that I realized that now all history on those files is lost, it's like they came out of nowhere. As with all things, in small, hobby, 1-person projects this may be totally unimportant, but if we're talking about a repo with decades of history and lots of people working on it, well it's kind of a huge downside to not be able to keep all of that information. After all, this is one of the main reasons we use Git.

So how do we do that? A quick internet search reveals three options (at the time of writing):

Before I explain which one I consider to be the best for all cases, let's first address the fact that git filter-branch is actually redundant. On their documentation page, we read:

WARNING

git filter-branch has a plethora of pitfalls that can produce non-obvious manglings of the intended history rewrite (and can leave you with little time to investigate such problems since it has such abysmal performance). These safety and performance issues cannot be backward compatibly fixed and as such, its use is not recommended. Please use an alternative history filtering tool such as git filter-repo. If you still need to use git filter-branch, please carefully read SAFETY (and PERFORMANCE) to learn about the land mines of filter-branch, and then vigilantly avoid as many of the hazards listed there as reasonably possible.

So basically, git filter-repo should be used instead. That only leaves us with two candidates. But before actually testing, we need something to test on!

Setting up the test repos

Okay, so I've created two repos:

$ tree .
.
├── new_repo
│   └── new_file.txt
└── old_repo
    └── old_file.txt

2 directories, 2 files

As you can see, they each contain one file. Let's see their history:

old_repo $ git log --oneline
73a826f (HEAD -> master) Second commit on the old repo
028467d First commit on the old repo

new_repo $ git log --oneline
6d07ab6 (HEAD -> master) Second commit on the new repo
8a2e099 First commit on the new repo

So now they both have two commits each. The objective here will be simple: move the old_file.txt to the new repo while preserving its history.

Trying with git subtree

If you want to get a detailed idea of how subtrees work you can read the doc, but in a nutshell they're like minimal submodules that you can create, that are not otherwise tracked (like regular submodules are), so it's very handy for when you need to move stuff around.

The first thing we need to do is create the subtree we're after in the old_repo. In order to do that, we essentially split the subtree off of the main tree of the branch:

old_repo $ git subtree --prefix . -b old-repo-export
assertion failed:  test old_file.txt = .
assertion failed:  test old_file.txt = .
No new revisions were found

Aaaand this is the first limitation of git subtree: it can't just handle a file. You need to be able to move at least a directory in order to get it done. Just for the sake of showing this tool off anyway, I'll add each file in a files folder this time (everything else stays the same).

Let's try again:

old_repo $ git subtree split --prefix files/ -b old-repo-export
Created branch 'old-repo-export'
9033262df3d1b04471838da43e603c46e4323732

Cool, it worked! So it created a new branch, called old-repo-export, and it put whatever was in files/ in it. Now we can hop over to the new_repo and pull that subtree:

new_repo $ git subtree pull --prefix files/ ../old_repo/ old-repo-export
warning: no common commits
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
 * branch            old-repo-export -> FETCH_HEAD
fatal: refusing to merge unrelated histories

Yet another limitation: it won't let us pull in unrelated histories. What is that? Well, it's because old_file.txt was never tracked in new_repo/, so to Git is completely unrelated. There is a flag that allows that for git pull, called --allow-unrelated-histories (we'll actually see it later); however, it's not supported in git subtree (at least for version 2.32.0, which is the latest at the time of writing).

Let's try with git subtree add. Notice that, since files/ already exists, we can't use it as a prefix; we need to provide a different one, and merge them manually later.

new_repo $ git subtree add --prefix files-old/ ../old_repo/ old-repo-export
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
 * branch            old-repo-export -> FETCH_HEAD
Added dir 'files-old'

Well, at least now we get the complete history:

new_repo $ git log --oneline
247ccd2 (HEAD -> master) Add 'files-old/' from commit '9033262df3d1b04471838da43e603c46e4323732'
6d07ab6 Second commit on the new repo
8a2e099 First commit on the new repo
9033262 Second commit on the old repo
75cf0cf First commit on the old repo

However, in order to be realistic, we should move all of the files where they should be:

new_repo $ rsync -r files-old/ files/
new_repo $ rm -rf files-old/
new_repo $ git add .
new_repo $ git commit -m "Move old_repo files to the files/ directory"

Okay, that was kind of a bumpy ride but at the end we got all the files in the right place along with their history. Right? Well...

new_repo $ git log --oneline files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory

Even --follow doesn't work here:

new_repo $ git log --oneline --follow files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory

To be honest, I really don't know why this doesn't work, since we have the full history if we run git log on the root of the repo. Anyway, let's do an interactive rebase to squash the last commit; this will clean up the history, as there will only be one commit regarding the import from the old_repo/.

new_repo $ git rebase -i HEAD~2
pick 75cf0cf First commit on the old repo
pick 9033262 Second commit on the old repo
pick d322fc1 Move old_repo files to the files/ directory

# Rebase 6d07ab6..d322fc1 onto 6d07ab6 (3 commands)

What? Why is it trying to rebase the old commits as well? Also, where is the commit adding the file from the old_repo/? My best guess here is that it's replacing that commit by the entire imported history, which is just bonkers to me. If I rebase now, I'll end up applying older commits on top of newer commits, which will be just incredibly confusing for anyone trying to get useful information out of the history. This case is of course trivial, but imagine if you had a history that was thousands of commits long: all of a sudden, the changes you made yesterday would be buried under years and years of old commits. What a mess.

Actually solving the problem with git filter-repo

Thankfully, there is a solution to this problem. Granted, it's a bit more complicated, but at least it gets the job done, no matter the complexity of the repos or the type of stuff you want to move. I'll do a more complete demo later, but let's start from the same base as before.

In order to make this work, we first need to install git filter-repo; it's not a Git builtin, but rather an external package that can be installed via pip:

$ pip install git-filter-repo

Then, we can use it in the old_repo/. Make sure you use a "disposable" clone for this, as it will end up modifying your local clone. You'll see what I mean.

old_repo $ git filter-repo --path files/ --force

We need to use --force here because the repo is not a fresh clone; if you're working with a real repo that has a remote, you can simply create a new clone and not use the --force switch.

old_repo $ ls
files

Okay, so what changed? Well, it's difficult to show here, but actually everything in the repo's history was thrown out, except for the files/ directory (that we passed through the --path option). This allows us to simply use this "stripped" version of the repo in a merge, in order to get all history in the same repo. Let's do that, then:

new_repo $ git remote add old-repo-import ../old_repo/
new_repo $ git pull old-repo-import master --allow-unrelated-histories

And that's it! We've now successfully imported all of our history. You can check it by looking at the log yet again:

new_repo $ git log --oneline files/old_file.txt
80b3123 (old-repo-import/master) Second commit on the old repo
128322a First commit on the old repo

By the way, at this point you can also get rid of the dummy remote, so that it doesn't pollute the history:

new_repo $ git remote rm old-repo-import
new_repo $ git log --oneline files/old_file.txt
80b3123 Second commit on the old repo
128322a First commit on the old repo

Much better!

A non-trivial example

However, in reality things are often more complex than that. During these migrations, you might have to change some directory names, which will result in Git perceiving the files as "deleted" and "created"; this, in turn, will make the full history inaccessible unless you use --follow. Actually, it's even worse: without --follow, you'll get the history after the merge, and with it you'll get the history before the merge. That's pretty annoying to remember, and it's pretty much like having two places you need to look in every time you want to consult the history of the repo.

Thankfully, there is a way to do this in git filter-repo... but it doesn't scale well. Let's see an example.

I've created two new repos that are a bit more complex. Here's old_repo/:

old_repo $ tree .
.
└── old_files
    ├── build_stuff
    │   └── build_file.txt
    ├── doc_stuff
    │   └── doc_file.txt
    └── source_stuff
        └── source_file.txt

4 directories, 3 files

old_repo $ git log --oneline
5be5a3f (HEAD -> master) Add old doc file
3b3cea6 Add old build file
6e5c48c Add old source file

And here's new_repo/:

new_repo $ tree .
.
└── new_file.txt

0 directories, 1 file

new_repo $ git log --oneline
b607e66 (HEAD -> master) Modify new file
63315ee Add new file

Now, we're going to make the migration a bit more complex as well: let's say we want to import all of the content of the old_files/ directory from the old_repo/, but we want to perform the following renames:

Now, the naive strategy here would be to just run git filter-repo on old_files/ and then perform the renames manually, but firstly there's no need since there is already functionality in git filter-repo to handle that, and secondly (and most importantly): we will lose history by doing that. Remember what I said about renames? We'll need to use --follow, we'll break production, the world will catch fire, the universe will explode and the aliens will make us rebuild it in JavaScript.

So, let's start in the exact same way: by fetching a fresh clone of the old repo (I don't have a remote in this case because it's a simple demo, but you get the point) and running git filter-repo on it. However, this time we will also specify the renames we wish to apply:

old_repo $ git filter-repo --path old_files/ --path-rename old_files/build_stuff/:files/build/ --path-rename old_files/doc_stuff/:files/doc/ --path-rename old_files/source_stuff/:files/src/ --path-rename old_files/:files/ --force

Again, you don't need to see --force here if you have a fresh clone. Well, did that work?

old_repo $ tree .
.
└── files
    ├── build
    │   └── build_file.txt
    ├── doc
    │   └── doc_file.txt
    └── src
        └── source_file.txt

4 directories, 3 files

old_repo $ git log --oneline
d88ffa0 (HEAD -> master) Add old doc file
986e373 Add old build file
ed444b0 Add old source file

It most certainly did!

So, that rename command might not be as trivial as you think: the order is actually important. I made a very conscious choice when I added the "parent folder" (old_files/ in this case) rename to the end of the list, so that it gets applied last; if we put it first, then all of the other ones get ignored because there's no old_files/ folder anymore, so their paths are all wrong. I'm pretty sure it would work if you put it first, but then changed old_files/ to just files/ in the other paths though.

In any case, the rest of the process should be easy, as it's the same as before: add the filtered repo as a dummy remote, pull from it, remove it, and you're done.

new_repo $ git remote add old-repo-import ../old_repo
new_repo $ git pull old-repo-import master --allow-unrelated-histories
new_repo $ git remote rm old-repo-import

And as you can see, it totally works:

new_repo $ git log --oneline
ba16b00 (HEAD -> master) Modify new file
bff28a2 Add new file
d88ffa0 Add old doc file
986e373 Add old build file
ed444b0 Add old source file

new_repo $ git log --oneline files/doc/doc_file.txt
d88ffa0 Add old doc file

And that's it! Have fun importing stuff!

Conclusion

So, there actually is a solution to this problem — no matter the repo's structure and complexity — that can be done relatively easily. The only step that doesn't scale well is the path renaming one, and sure you could automate it (and I guess you probably should if you have to do this a gazillion times), but personally I think I'll always prefer doing these things manually just to be able to check every step of the way and make sure nothing went wrong.

To be honest, I really don't understand why there isn't a simple builtin command that lets you do this in Git; sure, it's not a problem people come across often, but it's not an extreme corner case either. I mean, even git filter-repo that does the job manually isn't built into Git at the end of the day, and Git clearly doesn't really suggest using its builtin (git filter-branch). Subtrees sound cool and could be very simple, but as we clearly saw, they just do not work at this point, at least as far as this task is concerned.

I hope this guide helped you — let's just say that transferring Git history between repos isn't exactly the definition of fun.


< VHDL project template for open-source projects Google foobar, part i: filtering duplicates >