Facts & Fallacies of Software Measurement
Posted on 01/07/2014 by Dr. Florian Deißenböck
For almost as long as we develop software, we have tried to measure various aspects of it. Among them: size, complexity and quality. According to some counts, more than 1,000 software metrics have been proposed since the 1960ies. Hence, one could hope that we developed a sound understanding about how to use them effectively and efficiently in practice. Nevertheless, I almost daily encounter organizations that fail to employ software metrics to their benefit. In many cases, the inappropriate use of metrics actually does more harm then good in these organizations.
This article summarizes the pitfalls that I observed in practice (and academia) over the course of ten years and emphasizes the best practices that are required to sucessfully employ software metrics. TODO: Mention talks.
What are Metrics?
Metrics are not exclusive to software engineering. On the contrary, they are prevalent in almost all discplines. Conciously or not, we encounter metrics in many areas almost every day. Take, for example, your daily newspaper. In most sections, ranging from sports to economics, editors try to quantitatively describe various topics through the use of metrics. For example, metrics like ball possession, # fouls, # goal attempts are used to capture a football game. As the following figure shows (showing a game I prefer not to remember), these metrics are even visualized with bar charts. Dedicated newspapers and magazines drive this a lot further by providing even more metrics and even fancier visualizations like heatmaps displaying the movements of individual players. In other sports, particulary baseball, game metrics have almost become something like a discpline in itself.
Similarily, a few pages on, the Mobility section of your newspaper uses a host of metrics to quantify the characteristics of cars. Actually, the metrics shown in the figure below are only a subset of the metrics used in specialized moto magazines. They range from the obvious things like displacement and performance to less obvious but surely important characteristics like the interior noise. As most of these magazines are read purely volonturarily by car enthusiast, I find it interesting to note, that people are absolutely ready to deal with a great amount of measurement data if they are truly interested in a certain topic.
As pointed out before, more than 1,000 metrics have been proposed to different characteristics of software. Consequently, some structure is need to categorized them. Many such structures, usually in the form of sophisticated quality models have been proposed. For the sake of the following discussion, however, it’s sufficient to use the rather simplistic taxonomy shown below. This separates software metrics into process and product metrics whereas the latter is further separated into static and dynamic metrics. Static metrics can be measured by analyzing a software systems artifacts like source code, models, architecture diagrams or database schemata. To measure dynamic metrics, the system needs to be executed. While most the following issues are valid for all types of software metrics, this article is written with static product metrics in mind as these are the ones where I see most challenges in practice.
Before going into the details, one should ask what a software metric actually is, i.e. how is it defined? When we look up the definition of “software metric” in the IEEE Glossary on Software Engineering Terminology, we find that a metric “is a quantitative measure of the degree to which a system, component, or process possess a given attribute.” So far, this describes a straight-forward measurement. If we are, e.g., interested in the length (the attribute) of a pencil (the system) we can simply use a ruler to measure this length.
Fallacy #1: Measurement == Evaluation
Things get more complicated with the IEEE’s definition of the term quality metric: “A quantitative measure of the degree to which an item possesses a given quality attribute”. While this appears to be a seemingly innocent extension of the former definition, the little word “quality” makes all the difference. Suddenly, we are not interested in a purely objective measurement of the length of the pencil, say 24cm, but want to know if the 24cm is a good or bad length for a pencil. To me the activities of measurement (“how long is the pencil”) and evaluation (“does the pencil have a good length”) are two distinct activities that have decisesively different characteristics and requirements. The fallacy I observe in practice is that these two activitites are often not clearly separated. This causes a number of problems that are best illustrated with an example.
A metric I consider highly valuable is the so-called clone coverage. It measures the relative amount of code that is covered by code clones. A code clone is a piece of code that has been created by copy&pasting, and potentially modifiying, an existing piece of code. While this is a very direct and often used form of reuse, it has a severe negative consequences. First, it increases maintenance efforts as changes to a cloned piece of code often need to replicated for all copies. Second, missing one of the copies creates an inconsistency between the clones that is, in most cases, undesired. The figure below shows an example where a null pointer exeception was fixed in the code on the left but forgot to be fixed in the copied code (on the right), too. As a result, the bug that caused the change on the left was fixed only halfway as the same problem can arise in the code on the right. In 2009 we conducted a large-scale (and often cited) study that showed that these types of inconsistencies occur in practice and, if the inconsistency is unintentional, represent defects in about 50% of the cases. In the study alone, more than 100 defects where found in systems that have been live for multiple years.
To demonstrate the difference between measurement and evaluation I ran our clone detection tool on itself. The results are visualized in the figure below where each rectangle represents a source file and the colored areas show cloned code. One may rightfully ask why a tool that is meant to detect code clones and is developed by a team that is supposedly aware of the negative effects of cloning, exhibits such a high amount of clones.
This is answered by the next figure that shows that most of the clones disappear if we use a gray color to depict code areas that are generated and never maintained manually. The rationale behind this is that clones in generated code do not cause the negative consequences discussed above. As the code is never touched manually (but always regenerated from an input specification), maintenanced efforts are not affected and inconsistencies are impossible. What happened here? The clone detection tool considers two identical (or almost identical) pieces of code as clones no matter if they have been created by copy&paste or by a generator. In the first case, however, the result is relevant w.r.t. to maintenance activities while it is not in second case.
It is my opinion that this is not the fault of the clone detection tool (or any other static analysis tool). The problem lies in the separation of the measurement and evaluation activities. While the measurement (finding identical pieces of code) can be automated, the evaluation (deciding what consequences the clones have w.r.t. software quality) often cannot. The reason for this is that the evaluation requires a large amount of context information that is not available to the tool as it is simply not present in the analyzed artifact, e.g. the source code.
Obviously, the analysis tool could be made smarter (and in practice it is) by telling it which pieces of code are generated so it can suppress clones found in these areas. Effectively, this means that we are pushing context information into the measurement activity. While this is absolutely reasonable and should be done, there are clear limits to this approach. For example, when running the clone detection in a project for the first time (even if generated code is excluded), developers will find many rightful reasons why certain findings are not relevant for them. Examples are:
- Clones found in code that is about to be deleted.
- Clones between the analyzed system and third-party the team is not responsible for.
- Clones that are of temporary nature and exist only for a certain time, e.g. during a restructuring phase.
Ultimately, the relevancy of the findings also depends on the intended activities. If the team is interested in clones it could actually remove, clones with third-party code may not be relevant as the team cannot change it. If the team, however, simply wants to be aware of the clones to prevent them from making inconsistent changes, these same clones may be highly relevant for them.
It is my experience, that it is rarely possible to push all the required context information into the measuremnt process. Instead we have to accept that the successfull application of software metrics requires an evaluation of measurement results that needs to be performed by humans. Up to now, only humans are capable of fully understanding the entire context information that is required to make an informed judgement about the relevancy of finding generated by an automated measurement.
Fallacy #1: Measurable == Relevant
Comments disabled for non-green and non-published posts