Lately, I have been doing a lot of work with UTF-8 character conversions. I spent a good deal of time scratching my head trying to figure out how to convert database tables encoded in a latin1 charset with UTF-8 encoded data to a UTF-8 charset. I ended up accomplishing this by following the method described here:Converting Database Character Sets « WordPress Codex Essentially, you convert all text containing fields to their binary (BLOB) counterpart data-type (which has no charset), then convert them back to their normal data-type along with the desired charset. There is some nuance to this method if the charset of the table doesn’t match the charset of the data being saved to the table. I won’t go into that here, but just know that data loss could occur during those conversions if you are not careful. The article above describes how to get around that issue.
While I was searching for a solution above, I really got into the nuts and bolts of how UTF-8 encoding works. The purpose of this article is to illustrate how UTF-8 encoding works by breaking down a Unicode character. I’ll first share some notes on UTF-8 characters that will help us understand the relationship between single-byte characters and multi-byte characters. Then I’ll talk about how UTF-8 encoding uses the first byte to determine if a character is an ASCII character and how many bytes that character is comprised of. After providing a formula to convert bits to a Unicode character number, I will finish with an example, breaking down Unicode character 226 (â).
The relationship between single-byte characters (ASCII) and multi-byte UTF-8 characters
- Multi-byte UTF-8 characters can be boiled down to a grouping of single-byte characters: UTF-8 Encoded â represented in single byte characters: Ã¢
- UTF-8 is a variable-length encoding. This means that only as many bytes as are needed will be used when encoding a Unicode character using UTF-8. Unicode characters 0-127 are represented using 1 byte (ASCII), characters 127-2048 are represented using two bytes, characters 2,028-65,535 are represented using three bytes, and characters 65,535-1,112,064 are represented using four bytes. Hence, UTF-8 can encode any Unicode characters up to 1,112,064. Currently, there are only 110,187 characters defined [Unicode 7.0](http://www.unicode.org/charts//PDF/Unicode-7.0/U70-1F300.pdf), so we have some room to grow!Interesting note: There are two camps when it comes to UTF-8 characters having a 4 byte or 6 byte maximum. https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/. I assume a 4 byte characters maximum for this article.
- An 8 bit character can store a number up to 255, but ASCII only assigns up to 127 bits. Characters 0-127 have the same encoding under ASCII and UTF-8, thus making UTF-8 a super-set of ASCII. This will come into play when we talk about the bits of multi-byte characters in UTF-8.
- “UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я. Characters 224-239 are like a double shift. 226 followed by 190 and then 128 is character 12160: ⾀. 240 and over is a triple shift.” – https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/
The First Byte Holds The Key
UTF-8 uses the first byte (8 bits) of any Unicode character to indicate if it is an ASCII character and also to indicate how many bytes the character is comprised of.
ASCII Character Example: The highest bit of the first byte indicates if the character is an ASCII character. (0xxxxxxxx = ASCII character – 1xxxxxxx = non-ASCII character).
Multi-Byte Character Example: Like the ASCII character example, the highest bit of the first byte indicates if the character is an ASCII character. Additionally, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how long the sequence is. If the highest order bit is set to 0, we are dealing with an ASCII character. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence (e.g., 1110xxxx = 3 byte character).
- 110xxxxx -> The highest bit tells us that this is not an ASCII character and the second highest bit (being a 1 in this example) tells us that the next byte belongs to the encoding of this Unicode character.
- 1110xxxx -> The third bit being set tells us that this is a 3 byte character.
The table below shows the bit sequence for 1, 2, 3, and 4 byte characters:
|Bits used to represent a Unicode character||Bytes||Byte 1||Byte 2||Byte 3||Byte 4|
Converting multi-byte characters to a Unicode number
If U represents a Unicode character number and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then a Unicode character number U can be calculated as follows:
If a sequence has one byte, then
U = C1
Else if a sequence has two bytes, then
U = (C1 – 192) * 64 + C2 – 128
Else if a sequence has three bytes, then
U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128
U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128
Let’s breâk down â châracter
Unicode number 226 = â(http://www.codetable.net/decimal/226)
UTF-8 Encoded â represented in single byte characters: Ã¢
Single-byte characters Ã¢ as bits = 1100001110100010 and as bytes = 11000011 10100010 ( 195, 162)
First byte: 11000011 or 195
– We know that this is NOT an ASCII character since the highest bit is set to 1
– We know that this is a two byte character since the highest two bits are set to 1
Calculation to a Unicode number using the formula above:
C1 (first byte) = 195
C2 (second byte) = 162
U = ((C1 – 192) * 64) + C2 – 128
U = ((195 – 192) * 64) + 162 – 128