What Is Code Coverage?

Dr. Sven Amann

Our Test Gap analysis and Test Impact analysis automatically monitor the execution of your application to determine which code is covered in tests. As a result, they are objective with respect to what has been tested and, more importantly, what hasn’t. No personal judgement or gut feeling involved.

However, when we first setup the analyses with our customers, we often find that the measurements differ (significantly!) from their expectations. Often, this is because other coverage tools report different coverage numbers. This post explores causes for such differences.

We usually refer to code coverage as a percentage, as in “we have 83.5% coverage on our software system”. To compute such a percentage, we divide the amount of code that has been executed in tests by the total amount of code in the system. When two coverage tools report different coverage, this means that they disagree with regard to either one or both of these amounts.

As we will see, this might be a disagreement in the absolute values as well as in the unit by which they measure the amount, e.g., number of statements vs. number of code branches.

What Do You Measure?

Some coverage tools measure statement coverage, others line coverage, method coverage, file coverage, basic-block coverage, branch coverage, path coverage, or yet another type of coverage. These different types of coverage are not directly comparable, because they count different things.

For Test-Gap Analysis and Test-Impact Analysis, Teamscale considers method coverage. In the Tests perspective, it reports line coverage. So if Teamscale reports different coverage than your other coverage tool, first check whether you are comparing apples and oranges.

On a related note: It is a common misunderstanding that line coverage and statement coverage are the same thing. You can easily see the difference on a simple example program with just two method-call statements in the same line:

a.foo(); a.bar(); 

Assume that a test executes this program and foo() throws an exception, which means that bar() is not invoked. Line coverage would report that one line out of the one-line program was executed, i.e., we have 1/1=100% coverage. Statement coverage, on the other hand, would report that from the two call statements the first was executed and the second wasn’t, i.e., 1/2=50% coverage. A huge difference!

What Is "All Your Code"?

We find that there are huge differences in how different coverage tools determine the number of all statements in the code. Most notably, some coverage tools count only those statements that were loaded during test execution. For example, dotCover (.NET) counts only statements in loaded assemblies. And coverage.py (Python) considers only files that are loaded, unless you explicitly specify the location of all your code via the --source parameter. Both tools, therefore, ignore any code that has not been loaded, although this code was obviously not covered in tests.

As a result, if you write an additional test covering a part of some code that was previously not loaded, the reported coverage may actually go down!

Also, many coverage tools simply count statements from all executed code. In practice, however, we usually want to distinguish certain types of code:

  • First, coverage of test code is usually irrelevant for quality-assurance purposes.
  • Second, coverage of generated code may be relatively low, because generators often generate more code than your application needs, which is then not covered in your tests.
  • Third, coverage of internal tools, such as migration scripts, is typically less critical, as defects are not customer visible.

Teamscale counts all statements in the source code as they appear in your version-control system, except for what you configure as an explicit exclude (test code, internal tools, generated code, etc.) in your project configuration. This is literally all the code that you declare relevant for testing, regardless of what gets loaded during test execution. Code that is not loaded during testing is consequently counted as not covered. Therefore, Teamscale may report much lower test coverage than other tools. However, with Teamscale, additional testing efforts will only ever increase your test coverage, which gives you a reliable measure to base your decisions on.

What Code Is "Executable"?

Usually, coverage tools don’t simply include all code when computing coverage, but rather only executable code. The rational is that code that isn’t executable cannot ever be covered in test and should, therefore, be excluded when measuring coverage. Which code should be counted as “executable” is a wildly disputed topic, however. For example, some coverage tools count lines containing only curly braces as executable, while others exclude them. And some coverage tools for dynamic languages, such as Python, count lines containing class or method declarations as executable, while tools for other language, such as Java or C#, typically exclude them.

This is not to say that one way to count is necessarily better or worse than another or even wrong. It just means that there are some degrees of freedom, which lead to (sometimes even huge) differences between almost any two coverage tools out there, as one tool will count a certain line that another tool excludes, and vice versa.

To see which lines Teamscale consideres executable, you can enable the Annotate test coverage option in the right sidebar of the code perspective. Lines shown as red, green, or yellow are considered executable, while white lines are not.

Test coverage in code in Teamscale

What Is "Reachable"?

Code coverage may also be approximated statically, for example, with nDepend (.NET). The idea is to use your tests as entry points and compute which code is reachable from any of them. This approach is tempting, because it does not require any code execution and is, therefore, quite fast. Note, however, that there is a huge difference between which code is statically reachable and which code is actually executed in a test:

  • Since it is often not statically decidable whether some conditional branch will be executed, static analyses often simply assume that all branches are executed, which massively over-approximates actual coverage.
  • Since it is often unclear statically, which concrete implementation will be invoked through a polymorphic call, static analyses often simply stop at polymorphic calls, which massively under-approximates actual coverage.

For theses reasons, static reachability and coverage are simply two different measures that are non-comparable.

Choose Your Weapons Wisely

As we have seen, the seemingly simple term “coverage” stands for quite a variety of different measures in different flavors. Whether you use Teamscale or not, you should always be aware of what you measure and for what purpose you use these measurements.

For example, if you want to know exactly how well you’ve tested some piece of code, you might consider branch or even path coverage. Achieving high branch or path coverage is usually quite expensive in terms of test runtime though, because you need lots and lots of test cases that add up to a long-running test suite. Additionally, with many technologies, measuring these kinds of coverage itself causes a significant runtime overhead.

Often, a much cheaper approach to increase the overall effectiveness of your testing is to identify changes that were not test at all. Teamscale’s Test Gap analysis can do this for you, based on relatively lightweight method-coverage measurements, and we’ve seen it reduce field defects by almost a quarter.

Alternatively, or in addition, you may counter increasing test runtimes by selectively running only relevant tests. Using Teamscale’s Test Impact analysis, which also requires only method coverage, may uncover 90% of the failing tests in only 2% of the test runtime, for example.

Choose your weapons wisely.