Our Journey to Git
 

2016 marks the year we switched our development repositories from Subversion to Git. As I was responsible for large parts of this migration, I want to share some technical aspects of this journey with focus on the biggest obstacles we had to master: Combining multiple Subversion repositories into one Git repository, shrinking the repository size, dealing with Subversion externals and resulting changes in our development process.

 

Combining multiple Subversion repositories

Our Subversion repositories

Our product Teamscale is based on the open source analysis toolkit ConQAT formerly developed at the Competence Center Software Maintenance at Technische Universität München and now maintained by us. The first commit in that Subversion repository dates back to January 2005 and concerns the ConQAT predecessor called SVAT. This repository still contains central parts of our source code and had at the time of migration 57.870 commits.

In addition, we developed proprietary analyses on top of ConQAT and Teamscale itself in a separate Subversion repository. This repository contains a total of 26.155 commits since 2009. Customer specific analyses were developed in a third Subversion repository containing 12.405 commits.

In all three repositories we had the same branching strategy: Ongoing development was happening on trunk. Stable release branches were forked out of trunk at the same time and fixes periodically were merged back from release branches into trunk. I guess I’m not telling you anything new here, this is the standard way to work with Subversion.

We encountered, however, that synchronizing branches over three repositories creates an extra overhead that is not needed. Hence, we decided that the migration to Git should combine the code of all three repositories while maintaining the original branch layout and keeping all history.

Combining multiple repositories

The first step of our migration simply converted the Subversion repositories into Git repositories using the excellent git-svn bridge. The next step was to build a fresh Git repository out of the converted repositories that combines the branches of each source repository into a single branch, e.g. the commits on trunk in each repository will be cherry-picked in date order into a single trunk in the final target repository. Sadly there was no tool or script that was doing this out of the box, so we had to write our own Git Repository Zipper.

I find the metaphor of a zipper you know from your jackets quite suitable for the algorithm. The following figures outline briefly how it works.

Repository A:

       A3----A5----A7 release-branch
      /             \
A1---A2---A4---A6---A8 master

Repository B:

       B3----B5----B7 release-branch
      /        \    \
B1---B2---B4---B6---B8 master

Combined Repository:

       B3-----A3----A5-----B5-------B7---A7 release-branch
      /                      \        \    \
B1---A1---A2---B2---B4---A4---B6---A6---B8---A8 master

This is also the right time to thank the awesome folks from libgit2sharp who made the Git API available for C#.

Shrinking the repository size

Now we have a single Git repository containing everything from our Subversion repositories including branches. It allows us to browse the complete history of files. The repository had, however, a size of over 6GB which causes long cloning times and also hampers performance of regular Git operations. The reason is that the Subversion repository was also used to store documents and large binary data (e.g. releases, libraries, …). Git is not suited to store this kind of data as it requires the whole repository history to be downloaded on initial clone.

We performed the following means to reduce the overall repository size to a bit more than 1GB:

  • Get rid of the largest (unused) files by examining the git repository and then removing them from the complete Git history with git filter-branch --index-filter.
  • Remove whole folders that contained code or data not relevant for our development.
  • Partly introduce library dependency management with Gradle.

We were also investigating the usage of Git LFS to store large binary files, but at the time of migration support for Git LFS in Eclipse—which is our main development IDE—was still limited. Also splitting the history using git-replace was discarded as the final outcome of 1 GB was quite satisfying for us. We still have these options if the repository continues to grow, but take proactive means to prevent this by a simple server-side commit hook that rejects commits with files larger than 5MB.

Getting rid of svn:external

The biggest obstacle that held back the migration to Git were Subversion externals. We used externals as a mean to share common code within the repository between e.g. the Teamscale server and IDE clients to prevent redundancy. We also used externals to centrally manage JARs. There is a discussion on Stackoverflow regarding the anti-pattern of using Subversion externals for dependency management to which I largely agree, so the migration to Git was the right time to get rid of them.

The core solution is the introduction of a Gradle task, that has to be executed after cloning the Git repository and bootstraps the development environment:

  • For usage of externals to manage JARs, we are simply using Gradle dependencies instead. The bootstrap task will download the JARs from Maven Central and place them in the lib folder of the Java projects.
  • For usage of externals to share code folders, the bootstrap task will create a symlinks. This has the additional benefit that editing a shared file in a consuming project will edit the file in the original place and making the modification visible in all places immediately. You may ask Wait, doesn’t Git support symlinks out of the box?—You are right, but this does just hold for Unix-based platforms and we have several contributors working on Windows machines.

In the long run, we want to replace the bootstrapping by real Gradle project dependencies. But this heavily affects the ConQAT class loading and our aim was to not change code because of the migration. Moreover, the bootstrapping is now also used to configure the Git repository to consistent line-ending style and Git merge/pull behavior for all contributors as well as copying shared Eclipse project settings.

Git is just the beginning…

Actually the advent of Git marks the beginning of further changes to our development process:

  • Together with the migration we changed our branching strategy to solely develop on feature branches and thus allowing us to have a stable and (kind of) always release-ready master branch.
  • Feature branches require a more sophisticated build setup. We configured Jenkins jobs to build and test each pushed branch and are about to migrate to Jenkins2 pipeline workflows in favor of a one monolithic long running build job.
  • Also our code review process is affected: Git feature branches make additional tools for tracking the review state of files superfluous as changes are committed on separate branches. Before merging the branch into master we can simply identify the changed files and review them.

Summary

The migration to Git went (unexpectedly) smooth: I have prepared all migration steps in a shell script and performed several dry runs before we set our Subversion repositories in read-only mode on a Friday afternoon. The whole migration roughly took a day, so on Monday morning the Git repository was available for all developers. Of course there were several questions of how to do X in Git in the following weeks—in fact I was doing quite a lot of assistance in the first week and I strongly recommend to take this into consideration if you plan to switch from a centralized version control system like Subversion or Microsoft Team Foundation to Git.

I’m also keen to hear your experience if you already migrated your repository or answer your questions if you plan such a migration. Feel free to comment directly on this post or reach out to me via mail or Twitter.