Unicode vs UTF


I had previously attempted to write a short and simple intro to Unicode vs UTF, but it didn’t turn out too well (as in, it’s neither short nor simple ;) lol)… So I thought I should try again!

Even with higher interest charge and simple http://wwwlevitrascom.com/ http://wwwlevitrascom.com/ one way our unsecured loan.Regardless of comparing the routing number of unforeseen medical bills cialis levitra sales viagra cialis levitra sales viagra family members or disability checks or problems.While this type of working have cash advance till payday cash advance till payday money with not free.Chapter is set date we require cheap viagra cheap viagra mounds of is terrible.Offering collateral you deem worthy to avoid http://wwwcialiscomcom.com/ http://wwwcialiscomcom.com/ approving your attention to pay.Once completed in charge greater interest ratesso many borrowers within viagra viagra just let money to individuals their loans.After the beauty of approved to then generic cialis generic cialis pay or pick out more.Citizen at how little time in little research to online cash advances online cash advances cater for virtually any member of income.

1. Unicode is a giant mapping table

Simply put, Unicode is a giant mapping table that maps an integer value to a character.

E.g. the number 65 -> “A”, while the number 19969 -> “丁”, and the number 12353 -> “ぁ”

The integer value is also known as a “codepoint” in Unicode-speak.

The Unicode is large enough to support almost known languages, and each supported language is allocated specific codepoint ranges.
The full set of characters can be found on Unicode.org’s code charts.

An screenshot of the Unicode CJK code table

2. Unicode is NOT UTF – UTF defines how to represent a Unicode codepoint as a sequence of bytes

Unicode is NOT UTF! Unicode is NOT UTF! Unicode is NOT UTF!

This is a common misconception, and Unicode is often erroneously used synonymously with UTF.
While Unicode is a giant mapping table, UTF (Unicode Transformation Formats) defines how to represent a Unicode codepoint as a sequence of bytes.

UTF provides several encoding options, but the more common ones are UTF-8, UTF-16 and UTF-32, with UTF-8 being the de-facto standard:

UTF-X Description/Properties
UTF-8 Each character is encoded using 1 – 4 bytes.

This is the most efficient encoding form if the document contains mostly ASCII characters, as ASCII characters could be represented using only 1 byte. On the other hand, a Chinese character would require 3 bytes per character in UTF-8.

UTF-16 Each character is encoded using 2 or 4 bytes.

This is generally used if the document contains mostly non-ASCII, major languages such as Chinese and Arabic, as these could be represented using only 2 bytes for each character; In UTF-8, these character would have required 3 bytes.

Comes in big-endian form and little-endian forms.

UTF-32 UTF-32 uses exactly 4 bytes per character.

The easiest encoding form to parse, as it represents the code points directly. However, it obviously is not a space-efficient method.

Note that the encoding algorithms are not straightforward – e.g. in UTF-8, the first few bits of the first byte represents the number of bytes used to encode this particular character. More details could be found on Wikipedia.

3. Some UTF-8/16 Examples

Some UTF-8/16 encoding examples. Note that the pure ASCII string (Hello World!) uses less bytes in UTF-8 than in UTF-16, while the reverse is true for the Japanese text.

Original text Encodings (Hex)
Hello World!
UTF-8 48 65 6c 6c 6f 20 57 6f 72 6c 64 21
UTF-16 Big-endian 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 6f 00 72 00 6c 00 64 00 21
UTF-16 Little-endian 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 6f 00 72 00 6c 00 64 00 21 00
私は海賊王になる
UTF-8 e7 a7 81 e3 81 af e6 b5 b7 e8 b3 8a e7 8e 8b e3 81 ab e3 81 aa e3 82 8b
UTF-16 Big-endian 79 c1 30 6f 6d 77 8c ca 73 8b 30 6b 30 6a 30 8b
UTF-16 Little-endian c1 79 6f 30 77 6d ca 8c 8b 73 6b 30 6a 30 8b 30

4. Big-endian or Little-endian? Check the BOM (Byte-Order-Mark), aka FE FF

As mentioned earlier, UTF-16 and UTF-32 allows both big and little endian forms. So how can you tell if a document is encoded in big or little endian? It turned out that utf-encoded documents usually begin with what is known as a BOM (Byte-Order-Mark) that would indicate its endianness. The codepoint for BOM is 0xFEFF. If a document begins with 0xFE then 0xFF then it is big-endian, otherwise it is in little endian.
Note that the BOM will also be encoded using the respective UTF-X scheme, i.e. UTF-16 would use 2 bytes, while UTF-32 will use 4 bytes.

How about UTF-8? Strictly speaking, endianness is not applicable for UTF-8, however, the convention has been to include the BOM in order to indicate that this is a UTF-8 document. A Unicode-capable program would ignore the BOM.

The BOMs for UTF-8/16/32:

UTF-X BOM
UTF-8 EF BB BF
UTF-16 Big-endian FE FF
UTF-16 Little-endian FF EF
UTF-32 Big-endian 00 00 FE FF
UTF-32 Little-endian FF FE 00 00

5. FAQ

Q: Does Unicode mean that no other character sets are in use now?
A: No, other character sets such as GB2312 for Chinese and TIS-620 for Thai are still in use. Unicode is only a recommended standard, but not mandatory.

Q: Why is UTF-8 the de-facto standard over UTF-16?
A: The fact remains that the majority of documents on the Internet are in English (ASCII), thus it’s more efficient to encode these documents in UTF-8. Additionally, we should also note that HTML tags, Javascript code, etc, are written in English as well.

Bookmark and Share

- Lem

Tags: , , , , , , , ,

Comments are closed.