Moving files between Git repos while preserving history
written on June 23, 2021
I recently found myself having to move files from one Git repo to another while preserving their history. It's one of those things that you probably do rarely enough to the point of never having thought about them, but once you come across them you realize that they obviously occur now and again—that was my feeling at least.
Actually, it's even weirder: at first, I just did what my monkey-programmer brain thought made sense and, well, copied the files over ; it was only after that that I realized that now all history on those files is lost, it's like they came out of nowhere. As with all things, in small, hobby, one-person projects this may be totally unimportant, but if we're talking about a repo with decades of history and lots of people working on it, well it's kind of a huge downside to not be able to keep all of that information. After all, this is one of the main reasons we use Git.
So how do we do that? A quick web search reveals three options (at the time of writing):
Before I explain which one I consider to be the best for all cases, let's first address the fact that git filter-branch is actually redundant/obsolete. On their documentation page, we read:
WARNING
git filter-branch has a plethora of pitfalls that can produce non-obvious manglings of the intended history rewrite (and can leave you with little time to investigate such problems since it has such abysmal performance). These safety and performance issues cannot be backward compatibly fixed and as such, its use is not recommended. Please use an alternative history filtering tool such as git filter-repo. If you still need to use git filter-branch, please carefully read SAFETY (and PERFORMANCE) to learn about the land mines of filter-branch, and then vigilantly avoid as many of the hazards listed there as reasonably possible.
So basically, git filter-repo should be used instead. That only leaves us with two candidates. But before actually testing, we need something to test on!
Setting up the test repos
Okay, so I've created two repos:
$ shell
$ tree .
.
├── new_repo
│ └── new_file.txt
└── old_repo
└── old_file.txt
2 directories, 2 files
As you can see, they each contain one file. Let's see their history:
$ shell
old_repo $ git log --oneline
73a826f (HEAD -> master) Second commit on the old repo
028467d First commit on the old repo
new_repo $ git log --oneline
6d07ab6 (HEAD -> master) Second commit on the new repo
8a2e099 First commit on the new repo
So now they both have two commits each. The objective here will be simple: move old_file.txt
to the new repo while preserving its history.
Trying with git subtree
If you want to get a detailed idea of how subtrees work you can read the doc, but in a nutshell they're like minimal submodules that you can create, that are not otherwise tracked (like regular submodules are), so it's very handy for when you need to move stuff around.
The first thing we need to do is create the subtree we're after in old_repo
. In order to do that, we essentially split
the subtree off of the main tree of the branch:
$ shell
old_repo $ git subtree --prefix . -b old-repo-export
assertion failed: test old_file.txt = .
assertion failed: test old_file.txt = .
No new revisions were found
Aaaand this is the first limitation of git subtree: it can't just handle a file. You
need to be able to move at least a directory in order to get it done. Just for the
sake of showing this tool off anyway, I'll add each file in a files
directory this time (everything else stays the same).
Let's try again:
$ shell
old_repo $ git subtree split --prefix files/ -b old-repo-export
Created branch 'old-repo-export'
9033262df3d1b04471838da43e603c46e4323732
Cool, it worked! So it created a new branch, called old-repo-export
, and it put whatever was in files/
in it. Now we can hop over to new_repo
and pull that subtree:
$ shell
new_repo $ git subtree pull --prefix files/ ../old_repo/ old-repo-export
warning: no common commits
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
* branch old-repo-export -> FETCH_HEAD
fatal: refusing to merge unrelated histories
Yet another limitation: it won't let us pull in unrelated histories. What is that? Well, it's
because old_file.txt
was never tracked in new_repo/
, so to Git is completely unrelated. There
is a flag that allows that for git pull
,
called --allow-unrelated-histories
(we'll actually see
it later); however, it's not supported in git subtree
(at least for version 2.32.0, which is the latest at the time of writing).
Let's try with git subtree add
. Notice that, since
files/
already exists, we can't use it as a prefix; we
need to provide a different one, and merge them manually later.
$ shell
new_repo $ git subtree add --prefix files-old/ ../old_repo/ old-repo-export
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 459 bytes | 459.00 KiB/s, done.
From ../old_repo
* branch old-repo-export -> FETCH_HEAD
Added dir 'files-old'
Well, at least now we get the complete history:
$ shell
new_repo $ git log --oneline
247ccd2 (HEAD -> master) Add 'files-old/' from commit '9033262df3d1b04471838da43e603c46e4323732'
6d07ab6 Second commit on the new repo
8a2e099 First commit on the new repo
9033262 Second commit on the old repo
75cf0cf First commit on the old repo
However, in order to be realistic, we should move all of the files where they should be:
$ shell
new_repo $ rsync -r files-old/ files/
new_repo $ rm -rf files-old/
new_repo $ git add .
new_repo $ git commit -m "Move old_repo files to the files/ directory"
Okay, that was kind of a bumpy ride but at the end we got all the files in the right place along with their history. Right? Well...
$ shell
new_repo $ git log --oneline files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory
Even --follow
doesn't work here:
$ shell
new_repo $ git log --oneline --follow files/old_file.txt
d322fc1 (HEAD -> master) Move old_repo files to the files/ directory
To be honest, I really don't know why this doesn't work, since we have the full history if we
run git log
at the root of the repo. Anyway, let's do
an interactive rebase to squash the last commit; this will clean up the history, as there will
only be one commit regarding the import from the old_repo/
.
$ shell
new_repo $ git rebase -i HEAD~2
pick 75cf0cf First commit on the old repo
pick 9033262 Second commit on the old repo
pick d322fc1 Move old_repo files to the files/ directory
# Rebase 6d07ab6..d322fc1 onto 6d07ab6 (3 commands)
What? Why is it trying to rebase the old commits as well? Also, where is the commit
adding the file from the old_repo/
? My best guess here
is that it's replacing that commit by the entire imported history, which is just
bonkers to me. If I rebase now, I'll end up applying older commits on top of newer
commits, which will be just incredibly confusing for anyone trying to get useful
information out of the history. This case is of course trivial, but imagine if you had a
history that was thousands of commits long: all of a sudden, the changes you made yesterday
would be buried under years and years of old commits. What a mess.
Actually solving the problem with git filter-repo
Thankfully, there is a solution to this problem. Granted, it's a bit more complicated, but at least it gets the job done, no matter the complexity of the repos or the type of stuff you want to move. I'll do a more complete demo later, but let's start from the same base as before.
In order to make this work, we first need to install git filter-repo
; it's not a Git builtin, but rather an
external package that can be installed via pip
:
$ shell
$ pip install git-filter-repo
Then, we can use it in the old_repo/
. Make sure you use
a disposable
clone for this, as it will end up modifying your local clone. You'll see
what I mean.
$ shell
old_repo $ git filter-repo --path files/ --force
We need to use --force
here because the repo is not a
fresh clone; if you're working with a real repo that has a remote, you can simply
create a new clone and not use the --force
switch.
$ shell
old_repo $ ls
files
Okay, so what changed? Well, it's difficult to show here, but actually everything in
the repo's history was thrown out, except for the files/
directory (that we passed through the --path
option). This allows us to simply use this
stripped
version of the repo in a merge, in order to get all history in the same repo.
Let's do that, then:
$ shell
new_repo $ git remote add old-repo-import ../old_repo/
new_repo $ git pull old-repo-import master --allow-unrelated-histories
And that's it! We've now successfully imported all of our history. You can check it by looking at the log yet again:
$ shell
new_repo $ git log --oneline files/old_file.txt
80b3123 (old-repo-import/master) Second commit on the old repo
128322a First commit on the old repo
By the way, at this point you can also get rid of the dummy remote, so that it doesn't pollute the history:
$ shell
new_repo $ git remote rm old-repo-import
new_repo $ git log --oneline files/old_file.txt
80b3123 Second commit on the old repo
128322a First commit on the old repo
Much better!
A non-trivial example
However, in reality things are often more complex than that. During these migrations, you might
have to change some directory names, which will result in Git perceiving the files as
deleted
and created
; this, in turn, will make the full history inaccessible
unless you use --follow
. Actually, it's even worse:
without --follow
, you'll get the history after
the merge, and with it you'll get the history before the merge. That's pretty annoying
to have to remember, and it's pretty much like having two places you need to look in every time
you want to consult the history of the repo.
Thankfully, there is a way to do this in git filter-repo
... but it doesn't scale well. Let's see an
example.
I've created two new repos that are a bit more complex. Here's old_repo/
:
$ shell
old_repo $ tree .
.
└── old_files
├── build_stuff
│ └── build_file.txt
├── doc_stuff
│ └── doc_file.txt
└── source_stuff
└── source_file.txt
4 directories, 3 files
old_repo $ git log --oneline
5be5a3f (HEAD -> master) Add old doc file
3b3cea6 Add old build file
6e5c48c Add old source file
And here's new_repo/
:
$ shell
new_repo $ tree .
.
└── new_file.txt
0 directories, 1 file
new_repo $ git log --oneline
b607e66 (HEAD -> master) Modify new file
63315ee Add new file
Now, we're going to make the migration a bit more complex as well: let's say we want to import
all of the content of the old_files/
directory from
old_repo/
, but we want to perform the following
renames:
-
old_files/
should becomefiles/
; -
build_stuff/
should becomebuild/
; -
doc_stuff/
should becomedoc/
; -
source_stuff/
should becomesrc/
.
Now, the naive strategy here would be to just run git filter-repo
on old_files/
and then perform the renames manually, but firstly
there's no need since there is already functionality in git filter-repo
to handle that, and secondly (and most
importantly): we will lose history by doing that. Remember what I said about renames? We'll
need to use --follow
, we'll break prod, the world will
catch fire, the universe will explode and the aliens will make us rebuild it in JavaScript.
So, let's start in the exact same way: by fetching a fresh clone of the old repo (I don't have
a remote in this case because it's a simple demo, but you get the point) and running git filter-repo
on it. However, this time we will also
specify the renames we wish to apply:
$ shell
old_repo $ git filter-repo --path old_files/ --path-rename old_files/build_stuff/:files/build/ \
--path-rename old_files/doc_stuff/:files/doc/ \
--path-rename old_files/source_stuff/:files/src/ \
--path-rename old_files/:files/ --force
Again, you don't need to see --force
here if you have a
fresh clone. Well, did that work?
$ shell
old_repo $ tree .
.
└── files
├── build
│ └── build_file.txt
├── doc
│ └── doc_file.txt
└── src
└── source_file.txt
4 directories, 3 files
old_repo $ git log --oneline
d88ffa0 (HEAD -> master) Add old doc file
986e373 Add old build file
ed444b0 Add old source file
It most certainly did!
So, that rename command might not be as trivial as you think: the order is actually important.
I made a very conscious choice when I added the parent directory
(old_files/
in this case) rename to the end of the list, so
that it gets applied last; if we put it first, then all of the other ones get ignored because
there's no old_files/
directory anymore, so their paths
are all wrong. I'm pretty sure it would work if you put it first, but then changed old_files/
to just files/
in the other paths though.
In any case, the rest of the process should be easy, as it's the same as before: add the filtered repo as a dummy remote, pull from it, remove it, and you're done.
$ shell
new_repo $ git remote add old-repo-import ../old_repo
new_repo $ git pull old-repo-import master --allow-unrelated-histories
new_repo $ git remote rm old-repo-import
And as you can see, it totally works:
$ shell
new_repo $ git log --oneline
ba16b00 (HEAD -> master) Modify new file
bff28a2 Add new file
d88ffa0 Add old doc file
986e373 Add old build file
ed444b0 Add old source file
new_repo $ git log --oneline files/doc/doc_file.txt
d88ffa0 Add old doc file
And that's it! Have fun importing stuff!
Conclusion
So, there actually is a solution to this problem—seemingly, no matter the repo's structure and complexity—that can be done relatively easily. The only step that doesn't scale well is the path renaming one, and sure you could automate it (and I guess you probably should if you have to do this a gazillion times), but personally I think I'll always prefer doing these things manually just to be able to check every step of the way and make sure nothing went wrong.
To be honest, I really don't understand why there isn't a simple builtin command that lets you
do this in Git; sure, it's not a problem people come across often, but it's not an extreme
corner case either. I mean, even git filter-repo
that
does the job manually isn't built into Git at the end of the day, and Git clearly
doesn't really suggest using its builtin (git filter-branch
). Subtrees sound cool and could be
very simple, but as we clearly saw, they just do not work at this point, at least as far as
this task is concerned.
I hope this guide helped you—let's just say that transferring Git history between repos isn't exactly the definition of fun.