A few weeks ago, a new repository taken from a subset of a first one was created at work. But it was created by just copying files from the first repository while losing the entire git history.
Creating the new repository
Let’s call the new repository new-repo, and the first one src-repo.
src-repo repository structure can be seen as a classic webapp having the following directory structure:
backend/
frontend/
images/
The images directory contains Docker files and scripts (for software and tools unrelated to backend and frontend), which was moved for convenience to the new new-repo repository.
It was roughly created like:
$ git init new-repo
$ cd new-repo
$ cp -a /path/to/src-repo/images images
$ git add .
$ git commit -m "init: copied files from src-repo/images"
Then the result was pushed to GitHub, and the new repository started living its life with new GitHub actions. A few days ago, the images directory was deleted from src-repo as everything was migrated.
Note: I personally much prefer having everything into a single monolithic repository (up to a certain size), as it simplifies a lot of things. But due to a split between deployment and software teams, they wanted to have that directory living in its own repository as they had different life cycles.
A bug appeared in production
new-repo started living its life.
A few days ago, changes to some core files were made and then deployed to production.
The bug was really important, but was only triggered on an edge-case that was not tested by integration tests. So that is kind of a first big miss.
The changes refactored a part of the code that was here for historical reasons, but without knowing why, and as the git history was not available anymore in new-repo, it was not tracked why. It was obvious deep in the git history of src-repo, but as the images directory was deleted, it was not looked at.
I am also pretty sure that the bug would have occurred even with the git history available, as the changes looked quite harmless.
Anyway, I decided to merge back the src-repo/images history into new-repo to have everything at hand for next time.
Importing Git history
Cleaning the first repository
In a previous company, I had already used git subtree to import the history of a repository into a new directory of another repository, which is the standard and easy use case.
I wanted to do a bit of the same thing here, but it had a few challenges:
- there is no sub-directory to import, we want to import the history for existing files
- the
new-repostarted living its own life, so files cannot be overwritten like that
Anyway, the first step was to extract the history of the src-repo/images cleanly to be reimported.
After some googlingLLM search, I found a new tool called git-filter-repo that did exactly what I needed.
Install was easy: brew install git-filter-repo.
I first cloned a new instance of src-repo to be able to work inside cleanly without fear of breaking anything on my development setup.
As the relevant directory was removed in a previous commit, I needed to checkout that commit. Instead of using a new branch, I just reset --hard to the commit before as I did not care about the rest of the history.
I then did a bit of trial and error with some git filter-repo and --subdirectory-filter, but that did not suit what I wanted:
- it removed the first directory (
imageshere) that I wanted to keep as it moves files to the root directory - it kept the history of all branches, but I wanted to only keep
main
I settled for a simple git filter-repo --path images --refs main which allowed me to keep the images directory of the main branch to match what was done in new-repo.
In the end, I had my src-repo with only the history of images, and all other files removed.
# before filtering
src-repo-main
A [images] --- B [backend] --- C [images] --- D [frontend]
|
| "git filter-repo --path images --refs main"
v
src-repo-main (Cleaned)
A' [images] ----------------- C' [images]
# after filtering:
# - commits B and D have been removed
# - history is now specific to "images" with A' and C' having new commit hashes
Importing the history into the new repository
Now that src-repo had a “cleaned” importable state, I could import it into new-repo. However, as new-repo already had its own new history, I could not just overwrite everything, so I had to create a new “empty” branch with: git switch --orphan src-repo-main.
I could then easily import the history of src-repo by merging it to the empty branch with: git pull /path/to/src-repo main. This imported the cleaned history into src-repo.
Merging the two histories
At that point, there are two distinct branches in new-repo that do not share any commit between them:
mainwhich is the original one which started to live its lifesrc-repo-mainwhich was the one just imported
# imported from src-repo while "main" started its life
branch: src-repo-main branch: main
| |
A' X
| |
C' Y
| |
E' Z (HEAD)
Two options from here:
- merge the two histories with a merge commit
- rebase
mainon top ofsrc-repo-mainwhich would become the new default branch
Using a merge commit
Using a merge commit allows to:
- keep the original histories separated
- do not mess up with the original branch
- but the two histories are still disjoint, using a
git blamewill only show the last branch, and GitHub has trouble displaying the whole history on a single file (it works much better within IDEs like IntelliJ). It can be displayed via the CLI withgit log --full-history -- path/to/file
A' --- C' --- E' (src-repo-main)
\
\ <-- merge Commit (M)
\
X --- Y --- Z -- M (main)
Steps are simple, inside new-repo:
- merge
maininto the new branch:git merge origin/main --allow-unrelated-histories, files will be updated with the ones frommainintosrc-repo-main - push to origin:
git push -u origin src-repo-main - create a Pull Request in GitHub to merge back the new branch into
main. At that point, displayed diff in the Pull Request must be empty as only the history is imported and no file is changed, added or deleted - merge the Pull Request with a merge commit to keep whole history
# init and filter old repository
$ brew install git-filter-repo
$ git clone src-repo src-repo-cleaned
$ cd src-repo-cleaned
$ git reset --hard ${COMMIT_BEFORE_DELETION}
$ git filter-repo --path images --refs main
# import the branch to the new-repo
$ cd path/to/new-repo
$ git switch --orphan src-repo-main
$ git pull /path/to/src-repo-cleaned main
# import using merge strategy
$ git merge origin/main --allow-unrelated-histories
$ git push -u origin src-repo-main
# create a Pull Request to target "main"
Rebasing main on top of the new branch
Using a rebase strategy:
- keeps the history linear
- is incompatible with changes from
main - as it introduces new commits, hashes from Pull Request are also lost
- all previously opened Pull Requests need to be redone from the start and target the new branch, as commits are all new
# before with 2 histories in different branches
A'--- C'--- E' (src-repo-main)
X --- Y --- Z (main)
# after "git rebase src-repo-main":
# ("src-repo-main") (rebased "main" commits)
A' --- C' --- E' --- X' --- Y' --- Z' (main-rebased)
Steps are also quite simple:
- create a new branch from
mainto be safe:git switch -c main-rebased - rebase the changes from
mainon top of the new branch, only keeping changes frommain:git rebase src-repo-main -X theirs. This will remove all merge commits and only keep commits from the originalmain
Note: using git rebase, the meanings of ours and theirs are swapped compared to a merge. Here, -X theirs tells Git to use the commits being replayed (the ones from main) over the upstream history if a conflict occurs.
With the above steps, a new history in main-rebased is available with the history from src-repo-main followed by the one from main.
From here 2 choices:
- the
main-rebasedbranch becomes the new default one, but that breaks some workflow and automation - changes from
main-rebasedare force-pushed tomainto overwrite it. Developers need to reset their changes to the new branch
# init and filter old repository
$ brew install git-filter-repo
$ git clone src-repo src-repo-cleaned
$ cd src-repo-cleaned
$ git reset --hard ${COMMIT_BEFORE_DELETION}
$ git filter-repo --path images --refs main
# import the branch to the new-repo
$ cd path/to/new-repo
$ git switch --orphan src-repo-main
$ git pull /path/to/src-repo-cleaned main
# rebase main on top of src-repo-main
$ git switch -c main-rebased origin/main
$ git rebase src-repo-main -X theirs
# overwrite "main" on origin
$ git checkout main
$ git reset --hard main-rebased
$ git push --force
Squashing after rebasing
In the case of this repository, there have only been a few Pull Requests since it has been created. However the history was still kind of a mess as most of the Pull Requests were merged with merge commits.
I used the opportunity to clean up the history by rebasing and squashing merges that belonged to the same Pull Request. In the process, I renamed the first commit of each Pull Request to the same format than a squash merge: <Title> (#PR), so that it is then tracked correctly.
The operation is then:
$ git rebase -i <COMMIT_BEFORE_REBASE>
# reword first commit
# fixup following commits until the next ones
I looked at each Pull Request to determine the commits to squash together.
This now gives a linear and clean history for new commits on the new branch.
I have looked and tried to use git rebase -i --rebase-merges which keeps track of all the merge commits and changes on the branch, but it was too complex to use for my use case, and just having the commits themselves was enough.
The changes were validated by diffing the two branches: git diff main main-rebased to ensure their content was the same.
Update process and communication
Both strategies described above have their pros and cons. After discussion with the team, we decided to keep the history with the rebase + squash strategy, as it was the best in the long term having a single unified history.
To migrate the branch, we followed a few simple steps:
- for only squash merge on the repository
- communicate to the team not to merge new Pull Requests during a give time slot
- deactivate protected branches if any
- executing the git commands:
$ git switch main # original branch
$ git pull --rebase # ensure it is up date
$ git reset --hard origin/main # ensure it is at the same version than origin
$ git switch -c main-old # create the "backup" branch
$ git push origin # push it
$ git switch main-rebased # switch to the rebased branch
$ git reset --hard origin/main-rebased # ensure it is up date
$ git diff main main-rebased # make sure there is no difference
$ git push --force origin main-rebased:main # overwrite main with main-rebased
- check in Github everything is good
- notify the users to update their repository:
$ git fetch -a
$ git reset --hard origin/main
- they can now work normally again
Existing Pull Requests
When doing the migration, the original main branch was kept and renamed to main-old. As there were a few Pull Requests opened, using a rebase + squash strategy messes the diff in GitHub.
So we:
- changed the target of the existing Pull Requests from
maintomain-old - closed them with a message to recreate them by cherry-picking the new commits on the new branch
That way developers could still see their diff with the original main branch and ensure that the diff in their new Pull Request is identical.
Conclusion
The migration went very smoothly, there were not a lot of risks. No problem were encountered and users could re-create their Pull Requests easily.
For next time:
- the history from
src-reposhould have been imported at the beginning when creatingnew-repo - a test to ensure such regression in production is detected is of course needed!