Practical Guide to Code Clones (Part 1)
Dr. Benjamin Hummel
One well known principle in software engineering states don’t repeat yourself, also known as the DRY principle. A very obvious violation of DRY is the application of copy/paste to create duplicates of large portions of source code within the same code base. These duplicate pieces of code, also known as code clones, have been subject to lots of research in the last two decades. In this two-part post I want to summarize those parts of the current knowledge that I find most relevant to the practitioner, especially the impact of clones on software development and how to deal with them in terms of tools and processes.
What is a clone?
A clone is a fragment of source code that appears at least twice in a system. Typically, we also allow code snippets that are not exactly the same, but have been modified slightly, for example different formatting, variations in comments, renaming of variables, and so on. Additionally, we require a certain minimal length to exclude very short fragments that are similar only because of certain common patterns (e.g., loop constructs). Often used values are 7 or 10 statements. You can see an example for a clone from the Jenkins source code in the following image (click for full size). The highlighted differences in the clone include different comments, variable renaming, fully qualified access of a field, and the usage of a method instead of a field in one place.
Probably, it would be possible to extract a common method for both of these fragments. But even if this is not possible, it is still very likely that any changes to one of the two pieces of code should also be applied to the other one. So the developers should at least be aware of this duplication.
The definition of a clone used here is purely syntactical, and thus only captures code created via copy/paste/modify. Code that solves the same problem but looks completely different can not be found in general.
Copy/paste: Who would do that?
When discussing cloning with developers, there are typically two reactions. Either they are convinced that in their team nobody would do something as bad as copy/paste, or they are sure that theirs is the system with the most clones ever. The truth is typically somewhere in-between. One way to determine the amount of cloning, is the clone coverage.
Using clone coverage, we can capture the extent of cloning in a single system. Typical values using our own detector are in the range between 5% and 15% (with minimal clone length of 10 statements), although we have seen systems with 50% clone coverage and more. Some clone coverage values for Open Source systems are shown in the following chart. Note that values below 5% clone coverage are very rare. As you can see from the low values for our own tool ConQAT, we eat our own dog food.
Cloning considered harmful?
Using suitable tools, we can easily determine that clones can be found in nearly every non-trivial piece of software. Our measurements of clone coverage also suggest that most systems have not only a few clones, but lots of them. But are clones actually a problem?
In practice, the answer is very simple: It depends. When building a prototype, which will be thrown away soon, you should actually use lots of copy/paste, as this will help you to implement features faster. The same holds, if your employer pays you by lines of code. When your focus is long-term maintenance of your code-base, clones will be a problem if you are not aware of them. Changing cloned code in a non-consistent way often leads to bugs. But to prevent this, you might not have to eliminate all clones. It might be sufficient to have a tool that keeps you aware of clones in code you are changing.
Personally, I find it tiresome to work with code in many clones. Having lots of déjà vu moments during development, because the code looks the same everywhere, does not fit with my image of being a software professional. While surely there are cases, where elimination of a clone is too complicated, I try to keep my code as clone-free as possible. What about you? I would be happy to hear about your opinion in the comments.
What is next?
This concludes the first part of our mini series. I hope you got an idea what code clones are about and whether they are interesting for your own development work. In the next part we show you, which tools you can use for detecting clones in your code base and how to deal with clones in the long term.