I had previously attempted to write a short and simple intro to Unicode vs UTF, but it didn’t turn out too well (as in, it’s neither short nor simple lol)… So I thought I should try again!
1. Unicode is a giant mapping table
Simply put, Unicode is a giant mapping table that maps an integer value to a character.
|E.g. the number 65 -> “A”, while the number 19969 -> “丁”, and the number 12353 -> “ぁ”|
The integer value is also known as a “codepoint” in Unicode-speak.
The Unicode is large enough to support almost known languages, and each supported language is allocated specific codepoint ranges.
The full set of characters can be found on Unicode.org’s code charts.
2. Unicode is NOT UTF – UTF defines how to represent a Unicode codepoint as a sequence of bytes
Unicode is NOT UTF! Unicode is NOT UTF! Unicode is NOT UTF!
This is a common misconception, and Unicode is often erroneously used synonymously with UTF.
While Unicode is a giant mapping table, UTF (Unicode Transformation Formats) defines how to represent a Unicode codepoint as a sequence of bytes.
UTF provides several encoding options, but the more common ones are UTF-8, UTF-16 and UTF-32, with UTF-8 being the de-facto standard:
|UTF-8||Each character is encoded using 1 – 4 bytes.
This is the most efficient encoding form if the document contains mostly ASCII characters, as ASCII characters could be represented using only 1 byte. On the other hand, a Chinese character would require 3 bytes per character in UTF-8.
|UTF-16||Each character is encoded using 2 or 4 bytes.
This is generally used if the document contains mostly non-ASCII, major languages such as Chinese and Arabic, as these could be represented using only 2 bytes for each character; In UTF-8, these character would have required 3 bytes.
Comes in big-endian form and little-endian forms.
|UTF-32||UTF-32 uses exactly 4 bytes per character.
The easiest encoding form to parse, as it represents the code points directly. However, it obviously is not a space-efficient method.
Note that the encoding algorithms are not straightforward – e.g. in UTF-8, the first few bits of the first byte represents the number of bytes used to encode this particular character. More details could be found on Wikipedia.
3. Some UTF-8/16 Examples
Some UTF-8/16 encoding examples. Note that the pure ASCII string (Hello World!) uses less bytes in UTF-8 than in UTF-16, while the reverse is true for the Japanese text.
|Original text||Encodings (Hex)|
4. Big-endian or Little-endian? Check the BOM (Byte-Order-Mark), aka FE FF
As mentioned earlier, UTF-16 and UTF-32 allows both big and little endian forms. So how can you tell if a document is encoded in big or little endian? It turned out that utf-encoded documents usually begin with what is known as a BOM (Byte-Order-Mark) that would indicate its endianness. The codepoint for BOM is 0xFEFF. If a document begins with 0xFE then 0xFF then it is big-endian, otherwise it is in little endian.
Note that the BOM will also be encoded using the respective UTF-X scheme, i.e. UTF-16 would use 2 bytes, while UTF-32 will use 4 bytes.
How about UTF-8? Strictly speaking, endianness is not applicable for UTF-8, however, the convention has been to include the BOM in order to indicate that this is a UTF-8 document. A Unicode-capable program would ignore the BOM.
The BOMs for UTF-8/16/32:
|UTF-8||EF BB BF|
|UTF-16 Big-endian||FE FF|
|UTF-16 Little-endian||FF EF|
|UTF-32 Big-endian||00 00 FE FF|
|UTF-32 Little-endian||FF FE 00 00|
Q: Does Unicode mean that no other character sets are in use now?
A: No, other character sets such as GB2312 for Chinese and TIS-620 for Thai are still in use. Unicode is only a recommended standard, but not mandatory.
Q: Why is UTF-8 the de-facto standard over UTF-16?