How we can learn from history...

As quality consultants, we mainly work together with our customers, but we are also actively involved in current research. In this post, we summarize our paper »Incremental Origin Analysis of Source Code Files« that was recently accepted for publication at the MSR—the Working Conference on Mining Software Repositories (from 31.5. to 1.6.14 in Hyderabad, India).

I guess most of you have heard about it—and many of you use it on a daily basis: The version control system. A version control system allows different developers to work on the same software system simultaneously and it keeps track of every single change. The version control system helps you to answer many questions in different situations during daily development: Assuming the build breaks overnight because of new failing test cases, what was the last change in this file? Which change broke the test? Or assuming you have to work on a file you are not really familiar with and you need someone to help you out. Who was last to edit the file? Who has edited the file in its overall history? If you care about code quality, you are also probably interested in: How does the size of this file grow over time? How does the clone coverage of this file evolve?

Answers to all of these questions require the history of a source code file. Unfortunately, the history of a file is not always complete. When a file is renamed or moved from one location to another, it is likely that this move is not tracked in the repository. Well, to be more precise, if the rename is done with the specific command offered by the version control system, it will be tracked: for example, SVN offers the command svn move to move or rename a file. Sophisticated development environments such as Eclipse run this command in the background when users call the Eclipse refactoring to rename a class.

However, in many cases, moves are not recorded in the repository—if the move is done manually in the file system, for example, or if parts of the repository are moved across repository boundaries. Our research shows that up to 38% of files in open source systems have at least one move or rename in its history which is not recorded in the repository. Imprecise history information leads to imprecise analyses results of software evolution. And here, it becomes problematic for you as a software developer and for us as a quality consultant company.

To analyze certain quality aspects of your software, we often analyze the entire history of a software system to illustrate the quality evolution. This analysis relies on accurate tracking of files over time to produce precise results and to provide complete historical information for its users: For example, to illustrate how a specific file or method grows over time, we require the complete history of the source code file. Or when analyzing which quality defects were created since the last release, we need to track quality defects over time which, in turn, also relies of accurate file tracking.

In the current state of the art, the problem of file tracking has not been solved sufficiently. Exisiting approaches fail to detect copies and do not work incrementally: they can analyze only moves and renames between two versions of a system, but they cannot analyze thousands of commits in the complete history in feasible time. Our approach, in contrast, works incrementally and is based on the following three steps:

  1. It extracts recorded information about moves and renames from the underlying version control system, for example from Subversion or Git. We call these moves »explicit« moves.
  2. It uses an efficient incremental clone detection to find moves and renames that have not been recorded in the version control system. We call these moves »implicit« moves. Further, the clone detection helps to also detect unrecorded copies.
  3. It uses a name-based comparison to find moves which are missed by the clone detection. It looks for files with the same name in different locations and uses a light-weight content comparison to detect moves and copies.

Step two and three are of heuristic nature—we have no guarantee that the detected implicit moves are correct. However, in a large scientific case study on open-source systems using SVN as repository, we evaluated the correctness and the completeness of these two heuristics. We even asked developers to confirm whether the moves, renames, and copies as detected by our approach are correct. The evaluation showed that if the heuristics detect a move, the detection is correct in at least 97 of 100 cases (97% precision). The evaluation also shows that out of 100 moves that happened during the history, the heuristics are able to detect at least 92 (92% recall).

We have implemented this approach in our newest quality analysis tool, Teamscale. Overall, Teamscale reliably detects moves, renames, and copies during the history of a source file. Hence, we do not make mistakes, when we try to learn from history.