No Such Thing As Plain Text
Dr. Florian Deißenböck
While there are numerous excellent articles, blog posts and books about the correct handling of character encodings in software systems, many systems still get it wrong because their architects and developers never understood what this is all about. I suspect that is due to the fact that many systems work more or less correctly even if you don’t care about character encodings.
Moreover, character encodings in general and Unicode in particular is a topic of overwhelming complexity (if you want to fully understand it). The combination of these two facts—it mostly works even if I don’t care and really understanding the thing is hard—allows laziness to set in; resulting in software systems that handle text correctly under the most trivial circumstances only.
This blog post aims to remedy this by focusing on the single most important rule for developers while abstracting away everything that is not required to understand this rule.
Daily Character Encoding Fails
My name is Florian Deißenböck. As you may notice my last name contains two characters (»ß« and »ö«) that are not plain Latin letters. While not plain Latin, these characters are quite common in the German language.
Still, I daily (literally daily!) receive e-mails and letters that scramble my name in various manners—only two examples are shown below. Please note that the two companies that send the e-mails below are both German and, hence, a name with a sharp s and an o-umlaut should be nothing exotic for them.
And, guess what, all these e-mails and letters are created by software systems (humans usually get my name right—at least in German speaking countries). Presumably, these software systems are developed by people who don’t know a bit about character encodings or who use faulty libraries.
This suspicion is confirmed by my work as a software quality auditor: In almost every system we audit we find gross deficiencies in the way they deal with text. This concerns the code level as well as the architectural level—most architecture specifications don’t even mention the topic at all.
I asked myself why these problems are still so common in 2020. After all, the ASCII encoding was published more than 50 years ago and Unicode turned 25 a couple of years back.
There are tons of resources on the Internet and in the book shops about correct handling of text and most modern programming languages provide (more or less) proper implementations for character handling. At the latest, after Joel Spolsky published his seminal post »The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets« in 2003, there should be no more excuses to be ignorant about this topic.
However, most developers still are. Remembering my own naïve approaches to this topic (and some lengthy discussions with my colleagues), I came to the conclusion that these are the reasons for this omnipresent ignorance:
- Character encodings in general and Unicode in particular is a topic of great complexity. So, understanding it is hard and takes a lot of time (even Joel Spolsky’s »absolute minimum« article is longer than 7 pages).
- Even if you completely ignore the intricacies of character encodings, your software system will somehow work (as long as you stick to one platform and English text).
- Most programming languages and class libraries encourage encoding-ignorant handling of text (for example, equipping Java’s String class with a constructor that only takes a byte array as an argument was just a bad idea)
To remedy this, this post focuses on the most important misunderstandings and the one rule that is required to get character encodings right in your software systems. Consequently, this post brutally simplifies things. In some places up to a point where someone with a good understanding of character encodings will consider them wrong.
So, for everybody who is on first-name terms with abstract characters, grapheme clusters and the non-injectivness of the coded character: The point of this post is not to give a detailed explanation of character encodings but to create awareness for this topic at all.
Some Necessary Background on Encodings
When thinking about characters, I find it helpful to separate the following aspects. For example, when handling the letter »A«, you should separate:
- Glyph: The visual representation of the character, e.g., on the screen or on paper. This is determined by the font you use.
- Code Point: A numeric value that represents a character. For example, the code point of letter »A« is 65 (for almost all character sets). The code point depends on the character set you use, e.g., US-ASCII, ISO-8859-1, or Unicode. While the example letter »A« shares the same code point in most character sets, this is not the case for other characters, e.g., of the Japanese language.
- Code Unit: A binary representation of the code point, e.g., 0100 0001. The concrete binary representation is determined by the encoding scheme you use. While character sets like ISO 8859-1 have straight-forward encoding schemes that use a single byte to represent each code unit, Unicode provides different encoding schemes to map from the code point to the code unit. For example, the 8-bit encoding scheme UTF-8 encodes code point 65 (the letter »A«) as 0100 0001 whereas the 16-bit encoding schemes UTF-16BE and UTF-16LE encode the same code point as 0000 0000 0100 0001 and 0100 0001 0000 0000 respectively.
These terms and their relationships are illustrated with some examples in the figure below.
Example 1 shows, that no matter what font is used to display a character, the code point and the code unit are the same as long as the same character set and encoding scheme is used.
Example 2 shows that Unicode and IS0 8859-1 share the same code points for many characters, even for more exotic ones like the u-umlaut. It also shows, that the UTF-8 encoding scheme defines a code unit different from the straight-forward one-byte-per-code-point encoding scheme.
Example 3 shows, that the copyright sign is associated with two different code points and, consequently, with two different code units, for the characters sets ISO 8859-1 and IBM 850.
In many cases the use of a specific encoding scheme also implies the use of a specific character set. For example the use of UTF-8 or UTF-16 implies that Unicode is used. On the other hand, simple character sets like ISO 8859-1 that contain 256 characters or less only know the straight-forward one-byte-per-code-point encoding scheme. Hence, this conceptually important distinction is often not made explicit. Instead, the combination of a character set and an encoding scheme is often referred to simply as the encoding (or, even more confusingly, as the character set).
The Single Most Important Rule
Even with this little (overly simplified, partially incorrect) background, it becomes obvious why Joel Spolsky is absolutely right when he reduces the whole complex character handling topic to one single rule:
»There Ain’t No Such Thing As Plain Text!«.
Without the exact knowledge about the character set and the encoding scheme, you simply cannot interpret a byte array as text (or convert text to a byte array). This is absolutely the only thing you have to remember!
If someone tells you that the report files written by his system are in plain text, you ask »which character set and which encoding scheme do you use?« As described above, you can narrow this down even more and simply ask »which encoding do you use?« as answers like UTF-8 will tell you about the encoding scheme as well as the character set.
If someone tells you that you can use his API by sending the data as plain text, you ask »which encoding?«. If someone tells you, that his system can import all data from plain text files, you ask »which encoding?«. If you review code and find that it converts a byte array to text without knowledge of the encoding, ask the author what’s going on (and cancel the release that you planned for next week).
In fact, whenever a byte array needs to be treated as text (or text needs to be stored as byte array), you ask »which encoding?«. If the only thing you get is quizzical look, you know there is a problem!
Why Many Developers Don’t Care
If this rule is so fundamental (and simple), why do many developers don’t care? First, the commonly used encodings like ISO-8859-1 and UTF-8 are similar enough to behave almost identical for simple Latin letters like »a«, »b«, »c« and many punctuation characters like »;«, »,« »#«.
Consequently, you don’t even notice that there is a problem if you write a file as »ISO-8859-1« and read it as »UTF-8« as long as it doesn’t contain letters like the o-umlaut. However, things go south if the file contains such characters. For example my name is rendered like »Dei?enb?ck« by most editors if it was saved in an ISO-8859-1 encoded file but read with the UTF-8 encoding. Does this look familiar? Please note, that the question mark is not the actual glyph used for corrupted character but a way of the editor signaling that there is a problem. Word, for example, displays it like this:
Second, many programming languages (and their class libraries) make it very easy to be lazy. For example, Java let’s you construct a String object from a byte array with the simple constructor
String(byte bytes). If the above rule is correct, this just shouldn’t be possible. How can a String object be constructed without knowledge about the encoding? The answer is given by the documentation of the constructor »Constructs a new String by decoding the specified array of bytes using the platform’s default charset.« (whereas »charset« here refers to the character set as well as the encoding scheme). So what’s happening here is that the Java virtual machine assumes one encoding to be the default encoding for each platform, e.g., UTF-8 on Linux and Windows-1252 on most Windows platforms (in Western countries), and uses this to decode the byte array. One needs not to be a pessimist to see that this will lead to problems if you run
String.getBytes (that uses the default encoding, too) on Windows, store the results to a file and then load the bytes from the file on Linux and construct a String from it without specifying the encoding. Of course, you only discover these problems when your test cases contain non-Latin letters.
What You Should Do
To prevent this type of problem, always think about textual data as a pair of binary information plus the information about the encoding. Become wary when your programming language of choice lets you treat binary data as text without making the encoding explicit. In Java, for example, stay away from class
FileReader that does not support a specification of the encoding and use
InputStreamReader instead. If you consistently follow this advice, you will notice that in certain situations, you will actually be forced to query the information about the encoding from the end user, e.g., when you read a file provided by the user. Don’t hesitate to do so. Better make your user think about this issue (and document it properly) instead of making false assumptions (like the platform encoding) that ultimately leads to corrupt data.
And don’t use so-called encoding-guessing algorithms. They use heuristics that are known to be error prone. If you yourself can make decisions about the encoding, e.g., when defining a file format, use UTF-8 which has a number of advantages and can be considered the de-facto standard today.
Finally, be aware that this post focuses on the most important rule only. To handle characters correctly, you most probably need to learn a lot more about the intricate world of character encodings. Good starting points can be found here:
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets: A must read for everybody.
- UTF-8 Everywhere: The authors of this manifesto make the case for the universal application of the UTF-8 encoding scheme and do a very good job at explaining some difficult concepts of Unicode.
- Java: A Rough Guide to Character Encoding: Very good summary of the intricacies of Java’s Unicode implementation.
- Multi-Language Character Sets: Detailed slide set on the topic.