Breaking down Unicode characters encoded in UTF-8

Lately, I have been doing a lot of work with UTF-8 character conversions. I spent a good deal of time scratching my head trying to figure out how to convert database tables encoded in a latin1 charset with UTF-8 encoded data to a UTF-8 charset. I ended up accomplishing this by following the method described here:Converting Database Character Sets « WordPress Codex Essentially, you convert all text containing fields to their binary (BLOB) counterpart data-type (which has no charset), then convert them back to their normal data-type along with the desired charset. There is some nuance to this method if the charset of the table doesn’t match the charset of the data being saved to the table. I won’t go into that here, but just know that data loss could occur during those conversions if you are not careful. The article above describes how to get around that issue.

Continue reading