Character encodings explained

Plain text files are the simplest files you can think of. They’re a sequence of bytes that store information in (mostly) human-readable form, without any formatting information attached. Of course, the information therein is in binary format, that is, a sequence of zeroes and ones. In order for the computer to show its content in a way humans can understand, that is, actual letters, your computer needs to know what character encoding it uses. Character encodings are sets of rules that tell the computer how to translate information in binary format into readable texts.

History of character encodings

When the home computer revolution began, in the 1980s, most users spoke English as their first language. Manufacturers were mostly based either in the US or the UK. Those machines were then sold in the same markets and used the ASCII encoding. ASCII’s specifics were originally published in 1963 and it used 7 bits to represent each character. As long as they were able to display the characters used in English, all was well and good. Those who didn’t live in English-speaking countries had a hard time properly displaying texts in their languages. In some cases, they had to give up accented characters, which was a mild annoyance but something one could live with. In other cases they were completely out of luck.

When Thai PhD candidate Van Suwannukul had to write his dissertation, he had to design and manufacture a special graphics card, dubbed the Hercules Graphics Card, in order to write his document in Thai language. His idea was simple but revolutionary: instead of adding Thai support to his computer, he created a monochrome graphics card that simulated the behaviour of colour graphics cards. Each Thai character wasn’t an actual character, but an image (or sprite) of that particular character. It wasn’t perfect, but it worked.

The major downside to Suwannukul’s solution was that it forgo encodings completely. As time went on, it became more and more clear that the then-standard ASCII encoding didn’t cut it anymore, as the market for personal computers had expanded well beyond the American and British markets it originally tackled. Among those, Microsoft had to release a new version of Windows 3, named Windows 3.2, for the Chinese market.

ISO-8859 encodings

Around the same time, the International Standards Organization took note of the need to standardise character encodings to ensure that users from different countries could read each other’s documents with minimal hassle. In 1987, the ISO 8859-1 standard was published, detailing the encoding for correctly displaying Western European languages and a few others that used the Latin alphabet. Other 14 documents soon followed, each detailing the encoding for a specific region. In order to preserve compatibility, each standard was based on the previous ASCII encoding.

Adoption was left to developers and not everyone complied. Microsoft used its own Windows-1252 encoding, based on the ANSI standard. Although it was more or less compatible with ISO 8859-1, it wasn’t 100% comparable, making text files created in Windows slightly different than those created in other operating systems.

But it didn’t end there: ISO encodings were still restricted to a specific geographic area. A British user could still have issues attempting to view a document created by a Pole and couldn’t properly display contents written in Japanese.

Enter Unicode

The same year the ISO-8859 documents were released, work started on Unicode, a standard that aimed to consistently encode and display all characters in all languages. The Unicode standard allows for more than 1.1 million characters, or code points, of which more than 900.000 are available for public assignment. Of all the available code poits, only 24% have been assigned so far. The various code points spread across 17 planes that contain 65.536 code points each. As of 2016, only 24% of all code points have been assigned, meaning that Unicode will continue being a standard for years to come.

To be clear, Unicode is not a character encoding, but a standard, a theoretical description of how characters must be classified and how encodings should be developed. There are 4 major encodings based on Unicode, one of which (UCS-2) is deprecated, whereas two saw only limited acceptance. The most-commonly used is UTF-8. Compared to other encodings, UTF-8 only requires limited memory to encode a character, which means less wasted disk space. A UTF-8-encoded character can take up 8, 16, 24, or 32 bits (or, 1 to 4 bytes), depending on the plane it resides in. By comparison, a UTF-16-encoded characters take up either 16 or 32 bits.

The first version of the Unicode standard was released in 1991 and, as of 2016, it’s reached its ninth revision. More than 87% of web pages use UTF-8. The strength of Unicode is also where it falls short, though. The possibility of storing more than 1 million characters means that no font can actually support all of them. In the words of Quivira’s creator: “Quivira will never provide every character defined in the Unicode standard. This would be technically impossible, because a font is limited to 65,536 characters, while Unicode already defines more than 100,000.”

Solving common problems related to character encodings

Garbled text

Garbled text in Notepad++
Garbled text in Notepad++.

Sometimes, you may get garbled text in an ANSI-encoded file. To solve this, make sure your text editor is using the correct encoding. In Notepad++, click Encode and then select the correct Encode in UTF-8. If this still doesn’t solve your issue, your file may be using ANSI encoding, albeit with a different character set. Click Encode, then browse Character sets until you find the correct option.

Missing characters

Encodings define how to interpret binary data, but do not explain the text editor what those characters actually look like. For this, text editors (and other programs) use fonts. Word processors are generally smart enough to know that they should use a different font if they can’t display a character. However, if some characters are missing or replaced with a rectangle, try using a different font. The two most complete fonts there are are Unifont and Quivira.

Did I miss something?

Let me know in the comments if I missed something in this article. I’ll do my best to anwer your questions.

About Andrea Luciano Damico 126 Articles
Andrea Luciano Damico is a freelance translator from Italy. Among his interests are linguistics, technology, video games, and generally being a chill guy. He runs Let's Translate.it and Tech4Freelancers.net.