Identifiers in Source Code: Just Because You Can Doesn’t Mean You Should

Posted on 10/15/2015 by Dr. Andreas Göb

As we all know, programmers spend a lot of their time reading code. The paper Concise and Consistent Naming shows that approximately 70% of a system’s source code is identifiers, i.e. names of procedures, methods, variables, constants, and so on. The paper concludes that identifiers should be chosen with care. In this post, I approach the topic of identifiers from a more technical perspective, and illustrate some basic things both programmers and tool vendors can easily stumble upon.

Consider the following ABAP program:

REPORT *.
TYPES * TYPE I.
FORM *.
 DATA * TYPE *.
 * = * * *.
ENDFORM.

What do you think happens when you execute that program? The problem is not only that the chosen identifiers are far from being descriptive. The program is also hard to understand from a syntactical perspective. What does the asterisk refer to? Since there are no reserved words in ABAP, the asterisk is a valid identifier. You are even allowed to use it as a name for the program (report) itself, a form routine, a data type, and a local variable all at the same time, although it also serves as the multiplication operator.

Some time ago, Horst Keller posted an ABAP obfuscation riddle on SAP’s community network. The question was under which conditions the following code was syntactically correct:

INCLUDE
NOT. IF
NOT  NOT  NOT  NOT  NOT  NOT  NOT  NOT  NOT
NOT  NOT !NOT  OR   NOT  NOT  NOT  NOT  NOT
NOT  NOT  NOT  NOT  NOT  NOT  NOT  NOT  NOT
NOT  NOT  NOT=>NOT( NOT ) OR  NOT  NOT  NOT
NOT  NOT  NOT  NOT  NOT  NOT !NOT  ...  NOT.

Within less than two hours, another forum member solved the riddle by providing all definitions to make that code compile. The solution was that NOT is all of

  • The NOT operator,
  • The name of a program include,
  • A local class in that include,
  • A static method in that class,
  • A select-option, i.e. a parameter to the program, and
  • A custom-defined macro that can be used as an alias for ENDIF.

While both examples were deliberately crafted examples for code obfuscation, they still illustrate that one can produce fairly unreadable code within the borders of a language specification. More precisely, not all programming languages provide support for choosing sensible identifiers, but leave that completely up to the developers.

If you like, you can get yourself in even more trouble by widening the character set you choose your identifiers from: Languages like Java and C# let you choose identifiers not only from the set of ASCII characters or characters used in western languages (Ä, ö, ß, Æ, î, Ø) but from the full Unicode spectrum covering Arabic, Hebrew, Chinese and many more languages. This makes it extremely important to consider the file encoding when dealing with source code files, as my colleague Florian pointed out in his post No Such Thing as Plain Text.

While emoticons and similar glyphs are currently disallowed in many programming languages, there is still a vast variety of characters that can be used in identifiers. This is a fact that not only code editors and compilers, but all tools dealing with source code artifacts need to be aware of. In particular when detecting clones or renamings in the code, tools need to be very careful about what to treat as an identifier. Additionally, people with badly configured IDEs may accidentally change the file encoding and make the build fail, generated API documentation may contain unreadable characters, just to name some of the possible effects.

Therefore, coding guidelines are extremely important in order to mitigate the problems mentioned above, especially for programming languages that are very permissive with regards to identifiers. From my experience in software development projects I learned that guidelines are only effective if they are combined with tool support for developers. Tools like Checkstyle for Java or StyleCop for C# can be used to operationalize coding guidelines to a large extent. Most of the time, these tools are integrated into the IDE and rely on the developer running them from time to time. Depending on the IDE, these tools may be configured to be executed as part of a local build, but this still requires configuration, in particular to ensure that the project- or company-wide set of rules is used.

At CQSE, we take a different approach: We use our quality analysis suite Teamscale to check for violations of coding guidelines (along many other aspects of software quality and maintainability) centrally and in near real-time. Simple naming rules can be configured using regular expressions. For more advanced concepts, Teamscale offers a custom check API, which can be used to write checks in a few lines of Java code. Violations are detected immediately after a commit and can be shown both in a web browser and in various IDEs. This approach ensures that all the code is analyzed, that every team member uses the same set of rules, and that everybody can get results for the whole system without having to manually run an analysis tool. Certain rules can even be specified independently of the programming language to gain additional consistency across the whole multi-language codebase.

Maybe you have not seen examples like the above in production code. In any case, I think every software development project needs some way of restricting what the programming language allows to a level that makes sense for that project.