Migrating Version Histories to Git

Posted on 01/04/2018 by Dr. Andreas Göb

Over the recent years, version control systems have started to consolidate. While there were lots of proprietary systems in the past, the world now seems to agree that distributed systems such as Git are the way to go in most cases. At CQSE, we switched our own development infrastructure from SVN to Git in 2016. In this post, I will outline some important lessons I learned from migrating several code bases from proprietary solutions to Git over the last years.

During the last year, I met lots of developers telling me they already moved to Git some time ago, or are planning to do so in the near future. And this does not only apply for languages like C, C# or Java, but also for, e.g., PL/SQL, where the code typically exists only in the database, and developers have been using external VCSs like CVS or SVN for many years.

Even enterprises like Microsoft and SAP are switching to Git. Microsoft moved the Windows source code to Git and now claims to have the world’s largest Git repository. Meanwhile, SAP changes their programming models for partners to use Git for HANA and more recently even ABAP development.

In summary, Git seems to be a pretty safe bet when choosing a VCS to use for development during coming years (or when choosing a VCS to support in your code-analysis tool).

Be Aware of Your Use Case

Before thinking about possible migration paths, you must be sure about the migration path: Do you want your development to switch over gradually and allow a transition phase with both systems being in use? Or can you pull off an immediate switch of your whole development infrastructure, like CQSE did in 2016? Or do you want to introduce a new tool (e.g., Teamscale), that cannot directly talk to your proprietary VCS and you therefore need a mechanism to mirror the changes?

These use cases are radically different. In the first scenario, you can gradually switch over your developer machines as well as your build infrastructure, analysis tools and other parts of the development environment. This comes, of course, at the cost of having to maintain a two-way bridge between your old VCS and Git, including conflict resolution in cases where both receive conflicting changes at the same time.

In the second scenario, you only need to convert the repository once, and afterwards all development happens using the Git repository. For archiving and reporting purposes, you may keep a read-only interface to your old VCS around, but other than that the migration is a one-shot, one-way process that potentially requires a lot of accompanying infrastructure changes.

Option three is again very different and overall the easiest one, since the migration is strictly one-way, and all the development continues to happen using the old VCS. Still, you must ensure you can incrementally mirror changes from there into the Git repository.

There are existing tools for all of the above scenarios. Therefore, it is worthwhile to have a close look at which tools and APIs are available.

Carefully Examine Existing APIs and Tools

Switching from SVN to Git seems to be the path most development teams take, which is why there are lots of sophisticated options available for that scenario. For example, GitHub offers an Importer for SVN repositories, and you can interact with GitHub using a Subversion client. This way you can start by migrating the repository, but continue to use all the existing infrastructure until you gradually switch to »native« Git workflows and tools. As an alternative, you may start switching your developer workplace to Git and use the Git client with your existing SVN repository. This way, developers can get used to the Git tooling, and develop locally using Git, while the central server still uses SVN. Of course, all of these tools have caveats, e.g., git svn sometimes has problems on Windows and crashes on every operation.

Some other Git Bridge tools for Perforce, ClearCase, CVS, Mercurial, and Rational Team Concert are listed in our Teamscale FAQ.

But even if you cannot find a Git Bridge, e.g., because you are using CA’s Harvest SCM, you may still find some helpful information online. In case of Harvest SCM this includes SDKs for both C++ and Java as well as a documentation of the command line tools that you can use to write your own, light-weight migration tool.

Last Resort: Write Your Own Tool

In case there is no migration tool for your use case, you may decide to create your own tool. The remainder of this post describes our own experience in doing so, using Harvest SCM as an example. My use case was migrating the complete version history and regularly mirroring changes from Harvest SCM to Git in order to analyze the source code using Teamscale’s Git connector, while development still works in Harvest SCM.

When we started looking into a solution for this problem, we had already completed several similar projects, including our own migration from SVN to Git using the libgit2sharp library in C#, as well as our Teamscale Connector for SAP NetWeaver ABAP systems, which uses JGit for building a Git repository from Java code.

This time, we took a different approach. We wanted to be able to open-source our tool at a later point in time, so we could not bundle it with the Harvest SCM SDK, which is only available to Harvest customers. Therefore, we decided to interface with Harvest SCM using command-line calls, and we used the same approach for Git, so that our migrator becomes just a thin layer of data conversion and command assembly in-between. For this, we chose Groovy as a scripting language, since it runs on the JVM, which is needed anyway if you have a Teamscale installation.

Map VCS Concepts Appropriately

When it comes to interfacing between different VCSs, it is not always obvious how their respective concepts map onto each other. For example, in Git, every state of the repository is represented by a commit hash, while in SVN every state of the repository is denoted by an incrementing repository revision number. In Harvest SCM, however, files are versioned individually, meaning that every file that is added to the repository starts at its own version 0, and the version is incremented with every change to that file. This means that in Harvest, changes to every file are recorded individually, whereas in Git you typically have multiple changed files in a single commit.

In Git and SVN, you have commit messages, whereas in Harvest SCM you assign changes to a Package. While branches are available in all three systems, Harvest SCM adds the concept of views, which is not easily translated into any concept available in Git.

According to our use case, we decided to use the Package name as commit message, and to group changes to individual files into Git commits if author and package were the same, and the modification time was reasonably close. Moreover, we decided to just migrate a single view to Git, which is specified as a parameter to the migration tool.

Migrate as Much as Needed and as Little as Possible

Based on our decision on how to translate Harvest SCM concepts to Git concepts, we can start to look into the command line documentation to find out how to get the data we need. We found the two commands hsv (Select Version) and hco (Check Out) to be sufficient for our needs to recreate the complete version history.


hsv -b broker -en project -usr user -pw password -vp viewPath 
  -s * -iv av -id sd from to

This instructs Harvest SCM to return a list of all file versions in the given view path that have been changed between the from and to dates. We can then parse this list and group the changes to then fetch individual file versions.

// Changes are grouped hy author, package, and modification time
for (change in groupedChanges) {

  // Construct a valid Git author string
  def gitAuthor = "${change.key.author} <${change.key.author}@nomail>"

  // Checkout changed files, delete removed ones
  def processedTimes = processFileEntries(change.value)

  // Stage all changes and commit them to the Git repository
  commit(gitAuthor, change.key.package, change.key.time)

  // Record all modification times to skip them in the next run
  commitsFile.withWriterAppend { out -> 
    processedTimes.toSorted().each { out.println it } 
  }
}

When we find a deletion in the list, we just delete the file from the the file system. Otherwise, we request the respective file from Harvest SCM.


hco -b broker -en project -usr user -pw password -vp viewPath 
  -cp localPath -s fileName -vn fileVersion -br -r

This is of course only a rough outline of the process omitting lots of details that can cause problems during migrations. Of course, time formats is one of them, path separators is another one. But there are established ways for dealing with these kinds of issues, so I omitted them here. Also, Harvest SCM records changes that are not actual changes to the file, e.g., when a developer locks a file. These have to be filtered during the processing.

Conclusion

Writing your own VCS migration tool is hard, even if it just targets a very specific use case. So whenever there are ready-to-use tools available, you should invest some time to investigate whether they fit your needs.

If you need to write your own tool, make sure you start with the minimum set of features that cover your needs and rely on external tools as much as possible. Our Harvest SCM migrator has only about 200 lines of code.

Do you have more suggestions for VCS migration tools that we should include in our FAQ? Would you like to analyze your source code stored in Harvest SCM with Teamscale? Just leave a comment below or contact me.

Click to activate comments for this page.

Please note that comments use the third-party integration Disqus, which shares data with both Twitter and Facebook and has a separate privacy policy that you have to agree to before activating comments on this page.