Moving files between Git repos while preserving history
written on June 23, 2021
categories: engineering
I recently found myself having to move files from one Git repo to another while preserving their history. It's one of those things that you probably do rarely enough to the point of never having thought about them, but once you come across them you realize that they obviously occur now and again — that was my feeling at least.
Actually, it's even weirder: at first, I just did what my monkey-programmer-brain thought made sense and, well, copied the files over ; it was only after that that I realized that now all history on those files is lost, it's like they came out of nowhere. As with all things, in small, hobby, 1-person projects this may be totally unimportant, but if we're talking about a repo with decades of history and lots of people working on it, well it's kind of a huge downside to not be able to keep all of that information. After all, this is one of the main reasons we use Git.
So how do we do that? A quick internet search reveals three options (at the time of writing):
Before I explain which one I consider to be the best for all cases, let's first address
the fact that git filter-branch
is actually redundant. On their documentation page,
we read:
WARNING
git filter-branch has a plethora of pitfalls that can produce non-obvious manglings of the intended history rewrite (and can leave you with little time to investigate such problems since it has such abysmal performance). These safety and performance issues cannot be backward compatibly fixed and as such, its use is not recommended. Please use an alternative history filtering tool such as git filter-repo. If you still need to use git filter-branch, please carefully read SAFETY (and PERFORMANCE) to learn about the land mines of filter-branch, and then vigilantly avoid as many of the hazards listed there as reasonably possible.
So basically, git filter-repo
should be used instead. That only leaves us with two
candidates. But before actually testing, we need something to test on!
Setting up the test repos
Okay, so I've created two repos:
$ tree .
.
├── new_repo
│ └── new_file.txt
└── old_repo
└── old_file.txt
2 directories, 2 files
As you can see, they each contain one file. Let's see their history:
old_repo $ git log --oneline
73a826f (HEAD -> master) Second commit on the old repo
028467d First commit on the old repo
new_repo $ git log --oneline
6d07ab6 (HEAD -> master) Second commit on the new repo
8a2e099 First commit on the new repo
So now they both have two commits each. The objective here will be simple: move the
old_file.txt
to the new repo while preserving its history.
Trying with git subtree
If you want to get a detailed idea of how subtrees work you can read the doc, but in a nutshell they're like minimal submodules that you can create, that are not otherwise tracked (like regular submodules are), so it's very handy for when you need to move stuff around.
The first thing we need to do is create the subtree we're after in the old_repo
. In
order to do that, we essentially
split the subtree off of the main tree of the branch:
old_repo $ git subtree --prefix . -b old-repo-export
assertion failed: test old_file.txt = .
assertion failed: test old_file.txt = .
No new revisions were found
Aaaand this is the first limitation of git subtree
: it can't just handle a file. You
need to be able to move at least a directory in order to get it done. Just for the
sake of showing this tool off anyway, I'll add each file in a files
folder this time
(everything else stays the same).
Let's try again:
old_repo $ git subtree split --prefix files/ -b old-repo-export
Created branch 'old-repo-export'
9033262df3d1b04471838da43e603c46e4323732
Cool, it worked! So it created a new branch, called old-repo-export
, and it put
whatever was in files/
in it. Now we can hop over to the new_repo
and pull that
subtree:
new_repo $ git subtree pull --prefix files/ ../old_repo/ old-repo-export
warning: no common commits
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
* branch old-repo-export -> FETCH_HEAD
fatal: refusing to merge unrelated histories
Yet another limitation: it won't let us pull in unrelated histories. What is that?
Well, it's because old_file.txt
was never tracked in new_repo/
, so to Git is
completely unrelated. There is a flag that allows that for git pull
, called
--allow-unrelated-histories
(we'll actually see it later); however, it's not
supported in git subtree
(at least for version 2.32.0, which is the latest at the
time of writing).
Let's try with git subtree add
. Notice that, since files/
already exists, we can't
use it as a prefix; we need to provide a different one, and merge them manually later.
new_repo $ git subtree add --prefix files-old/ ../old_repo/ old-repo-export
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
* branch old-repo-export -> FETCH_HEAD
Added dir 'files-old'
Well, at least now we get the complete history:
new_repo $ git log --oneline
247ccd2 (HEAD -> master) Add 'files-old/' from commit '9033262df3d1b04471838da43e603c46e4323732'
6d07ab6 Second commit on the new repo
8a2e099 First commit on the new repo
9033262 Second commit on the old repo
75cf0cf First commit on the old repo
However, in order to be realistic, we should move all of the files where they should be:
new_repo $ rsync -r files-old/ files/
new_repo $ rm -rf files-old/
new_repo $ git add .
new_repo $ git commit -m "Move old_repo files to the files/ directory"
Okay, that was kind of a bumpy ride but at the end we got all the files in the right place along with their history. Right? Well...
new_repo $ git log --oneline files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory
Even --follow
doesn't work here:
new_repo $ git log --oneline --follow files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory
To be honest, I really don't know why this doesn't work, since we have the full history
if we run git log
on the root of the repo. Anyway, let's do an interactive rebase to
squash the last commit; this will clean up the history, as there will only be one
commit regarding the import from the old_repo/
.
new_repo $ git rebase -i HEAD~2
pick 75cf0cf First commit on the old repo
pick 9033262 Second commit on the old repo
pick d322fc1 Move old_repo files to the files/ directory
# Rebase 6d07ab6..d322fc1 onto 6d07ab6 (3 commands)
What? Why is it trying to rebase the old commits as well? Also, where is the commit
adding the file from the old_repo/
? My best guess here is that it's replacing that
commit by the entire imported history, which is just bonkers to me. If I rebase now,
I'll end up applying older commits on top of newer commits, which will be just
incredibly confusing for anyone trying to get useful information out of the history.
This case is of course trivial, but imagine if you had a history that was thousands of
commits long: all of a sudden, the changes you made yesterday would be buried under
years and years of old commits. What a mess.
Actually solving the problem with git filter-repo
Thankfully, there is a solution to this problem. Granted, it's a bit more complicated, but at least it gets the job done, no matter the complexity of the repos or the type of stuff you want to move. I'll do a more complete demo later, but let's start from the same base as before.
In order to make this work, we first need to install git filter-repo
; it's not a Git
builtin, but rather an external package that can be installed via pip
:
$ pip install git-filter-repo
Then, we can use it in the old_repo/
. Make sure you use a "disposable" clone for this,
as it will end up modifying your local clone. You'll see what I mean.
old_repo $ git filter-repo --path files/ --force
We need to use --force
here because the repo is not a fresh clone; if you're working
with a real repo that has a remote, you can simply create a new clone and not use the
--force
switch.
old_repo $ ls
files
Okay, so what changed? Well, it's difficult to show here, but actually everything in
the repo's history was thrown out, except for the files/
directory (that we passed
through the --path
option). This allows us to simply use this "stripped" version of
the repo in a merge, in order to get all history in the same repo. Let's do that, then:
new_repo $ git remote add old-repo-import ../old_repo/
new_repo $ git pull old-repo-import master --allow-unrelated-histories
And that's it! We've now successfully imported all of our history. You can check it by looking at the log yet again:
new_repo $ git log --oneline files/old_file.txt
80b3123 (old-repo-import/master) Second commit on the old repo
128322a First commit on the old repo
By the way, at this point you can also get rid of the dummy remote, so that it doesn't pollute the history:
new_repo $ git remote rm old-repo-import
new_repo $ git log --oneline files/old_file.txt
80b3123 Second commit on the old repo
128322a First commit on the old repo
Much better!
A non-trivial example
However, in reality things are often more complex than that. During these migrations,
you might have to change some directory names, which will result in Git perceiving the
files as "deleted" and "created"; this, in turn, will make the full history inaccessible
unless you use --follow
. Actually, it's even worse: without --follow
, you'll get the
history after the merge, and with it you'll get the history before the merge. That's
pretty annoying to remember, and it's pretty much like having two places you need to
look in every time you want to consult the history of the repo.
Thankfully, there is a way to do this in git filter-repo
... but it doesn't scale well.
Let's see an example.
I've created two new repos that are a bit more complex. Here's old_repo/
:
old_repo $ tree .
.
└── old_files
├── build_stuff
│ └── build_file.txt
├── doc_stuff
│ └── doc_file.txt
└── source_stuff
└── source_file.txt
4 directories, 3 files
old_repo $ git log --oneline
5be5a3f (HEAD -> master) Add old doc file
3b3cea6 Add old build file
6e5c48c Add old source file
And here's new_repo/
:
new_repo $ tree .
.
└── new_file.txt
0 directories, 1 file
new_repo $ git log --oneline
b607e66 (HEAD -> master) Modify new file
63315ee Add new file
Now, we're going to make the migration a bit more complex as well: let's say we want to
import all of the content of the old_files/
directory from the old_repo/
, but we
want to perform the following renames:
old_files/
should becomefiles/
build_stuff/
should becomebuild/
doc_stuff/
should becomedoc/
source_stuff/
should becomesrc/
Now, the naive strategy here would be to just run git filter-repo
on old_files/
and
then perform the renames manually, but firstly there's no need since there is already
functionality in git filter-repo
to handle that, and secondly (and most importantly):
we will lose history by doing that. Remember what I said about renames? We'll need to
use --follow
, we'll break production, the world will catch fire, the universe will
explode and the aliens will make us rebuild it in JavaScript.
So, let's start in the exact same way: by fetching a fresh clone of the old repo (I
don't have a remote in this case because it's a simple demo, but you get the point) and
running git filter-repo
on it. However, this time we will also specify the renames
we wish to apply:
old_repo $ git filter-repo --path old_files/ --path-rename old_files/build_stuff/:files/build/ --path-rename old_files/doc_stuff/:files/doc/ --path-rename old_files/source_stuff/:files/src/ --path-rename old_files/:files/ --force
Again, you don't need to see --force
here if you have a fresh clone. Well, did that
work?
old_repo $ tree .
.
└── files
├── build
│ └── build_file.txt
├── doc
│ └── doc_file.txt
└── src
└── source_file.txt
4 directories, 3 files
old_repo $ git log --oneline
d88ffa0 (HEAD -> master) Add old doc file
986e373 Add old build file
ed444b0 Add old source file
It most certainly did!
So, that rename command might not be as trivial as you think: the order is actually
important. I made a very conscious choice when I added the "parent folder" (old_files/
in this case) rename to the end of the list, so that it gets applied last; if we put it
first, then all of the other ones get ignored because there's no old_files/
folder
anymore, so their paths are all wrong. I'm pretty sure it would work if you put it
first, but then changed old_files/
to just files/
in the other paths though.
In any case, the rest of the process should be easy, as it's the same as before: add the filtered repo as a dummy remote, pull from it, remove it, and you're done.
new_repo $ git remote add old-repo-import ../old_repo
new_repo $ git pull old-repo-import master --allow-unrelated-histories
new_repo $ git remote rm old-repo-import
And as you can see, it totally works:
new_repo $ git log --oneline
ba16b00 (HEAD -> master) Modify new file
bff28a2 Add new file
d88ffa0 Add old doc file
986e373 Add old build file
ed444b0 Add old source file
new_repo $ git log --oneline files/doc/doc_file.txt
d88ffa0 Add old doc file
And that's it! Have fun importing stuff!
Conclusion
So, there actually is a solution to this problem — no matter the repo's structure and complexity — that can be done relatively easily. The only step that doesn't scale well is the path renaming one, and sure you could automate it (and I guess you probably should if you have to do this a gazillion times), but personally I think I'll always prefer doing these things manually just to be able to check every step of the way and make sure nothing went wrong.
To be honest, I really don't understand why there isn't a simple builtin command that
lets you do this in Git; sure, it's not a problem people come across often, but it's
not an extreme corner case either. I mean, even git filter-repo
that does the job
manually isn't built into Git at the end of the day, and Git clearly doesn't really
suggest using its builtin (git filter-branch
). Subtrees sound cool and could be very
simple, but as we clearly saw, they just do not work at this point, at least as far as
this task is concerned.
I hope this guide helped you — let's just say that transferring Git history between repos isn't exactly the definition of fun.