Tuesday, October 25, 2011

Character Sets: Get PHP, Perl, MySQL, and Unicode to Play Together

This post is a companion to Perl and Unicode in Brief, an attempt to cover the same ground more concisely.

This is an extended remix of my recent post on the subject, only less of a rambling story and more focused.  Again, I'll start with some background definitions.

I'll also assume that you're going to make everything UTF-8, because as a US-centric American who has the luxury of using English, that's what makes the most sense for my systems.  However, if you understand everything I wrote, it should not be difficult to make everything UTF-16 or any other encoding you desire.



Terminology and History

First, character sets.  A character set defines a set of characters and numbers for representing them.  For instance, ASCII defines 0-127, including the Latin alphabet necessary for US English.  So, "A" is 65 and "a" is 97 in ASCII.  After this was defined, the 8-bit byte was standardized and computers went overseas.  ASCII wasn't suitable, so various locales took advantage of the 8th bit to add 127 more characters for their languages.  The most common of these in the Western world are the ISO-8859 family, which gave us latin-1, latin-7, and so forth.  These sorts of character sets are known as "8-bit" because they define 8 bits worth of space.

Meanwhile in Asia, they had character encodings like Big5 that let them use far more than 256 codes for their scripts that had far more than 256 symbols, and also embed ASCII for quoting English text.  However, Big5 did not fully specify what the embedded character set was.

Looking at this mess, Unicode was born, to define a single character set, once and for all.  Ideally, everything would be Unicode everywhere, and any user could see any text and be relieved of figuring out whether the document was (for example) CP1251 or KOI8-R.  The first encoding of Unicode, UCS-2, used 16 bits but was otherwise largely identical to other systems in that the character codes (what Unicode called code points) and output bytes were identical.  UCS-2 proved to be too small, yet was considered bloated by US and Western European users who had mostly-ASCII text, since UCS-2 took twice the space for most of their characters.

UTF-8 solved this problem by dividing bytes into segments that contained information about the byte stream (such as "Part 1 of 3" or "Subsequent part") and "character data".  With a bit of cleverness I'll skip here, ASCII remains as-is, and higher characters are represented with a multi-byte sequence.  Thus, 0xceb1 is the specific UTF-8 encoding of the character 0x03b1 (greek small letter alpha).

In the US, we're lazy and still call UTF-8 a character set.  This is just a historical artifact of us not using Big5 or some other encoding that can have different character sets embedded in it, and it doesn't end up being too confusing since UTF-8 can contain only one character set, Unicode.

Symptoms of Character Encoding Problems

There are two obvious symptoms when trying to output UTF-8.  First is the double encoding problem, where you see things like "’" where there should be only one special character.  This typically indicates that UTF-8 encoded data was interpreted as if it were an 8-bit encoding, and converted to UTF-8.  Each UTF-8 byte has become a whole UTF-8 character.

The opposite symptom is a text littered with �, the "replacement character", instead of your special characters.  This usually means some 8-bit encoding is actually being interpreted as if it were UTF-8, and the bytes not following the rules of the latter encoding are replaced with the Dreaded Diamond Question Mark (or possibly Dreaded Box for some Windows machines).

A more subtle symptom is when everything appears to work, but the labels are not entirely accurate: a Web page declares its character set to be UTF-8, but the database is labeled as "latin1".  Yet if the script that serves the Web page tells the database to send results in UTF-8 mode, then suddenly the Web page appears double-encoded (our first symptom from above).  In this case, the database actually contains UTF-8 data, but the connection and column (or table or database/schema) are both labeled as latin1; thus no conversion is happening and sending the result in a UTF-8 page allows the client to display it correctly.

Another symptom I have seen in the wild is that strings in UTF-8 containing only some particular 8-bit encoding's characters (such as Latin-1) get littered with �, while other strings that happen to contain any characters outside that range get printed in correct UTF-8—including any characters that break in other strings!  This one is Perl with a non-UTF-8 output encoding.  If you have warnings on, and check the appropriate error log, you may find "Wide char in print" warnings generated for the strings that seemed to print correctly.

Fixing Things

As mentioned in my previous post, to fix the database mislabeling with MySQL, I could dump the data from mysql and restore it.  I created two dumps: one with --no-data that I piped through sed -e s/latin1/utf8/g and one with --no-structure --no-create-db --default-character-set=latin1 that I changed the SET NAMES line to be utf8 in my favorite editor.  This gave me a database structure labeled UTF-8, which I then populated with UTF-8 data (as it was) also labeled as UTF-8 (because of the new SET NAMES).

I also took a moment to edit /etc/my.cnf to add a [client] section with default-character-set=utf8.  I also set the server's variables to utf8: character_set_X, where X is client, connection, database, results, and server.

The PHP website was already issuing the appropriate header for UTF-8 output, so it displayed replacement characters until I changed the connection character set.  With PHP 5.3.6 and up with PDO, adding charset=utf8 to the DSN works; otherwise, I simply issued SET NAMES utf8 after connecting.  If I were using mysqli, I would use set_charset after connecting, so that the library would also know about the new encoding in use.

Perl also displayed replacement characters after I added mysql_enable_utf8 => 1 to the DBI connection options.  To avoid breaking older programs, Perl considers everything to be in a "machine" encoding unless it's told otherwise.  I'm not entirely sure where it comes from; it's certainly not locale, because that was UTF-8.  However, when faced with a UTF-8 string being printed on some filehandle that is not UTF-8, Perl tries to convert the string to the filehandle encoding.  If it succeeds, it writes that 8-bit encoding out.  If it fails, it writes out the UTF-8 bytes and generates the "Wide char in print" warning.

The fix for Perl was to set the filehandle to UTF-8 output with binmode(STDOUT, ':encoding(UTF-8)'); before writing non-ASCII data to it.  If I had any in my source code, I would also need use utf8; to prevent Perl from double-encoding UTF-8 sequences in the program source (particularly, in strings).  And if I were using Perl 5.14, I'd probably also use feature 'unicode_strings'; as well, as mentioned in The "Unicode Bug".

I've Earned It, Now

I need to buy one of the I � Unicode items from this shop.  And if you want more perspective on this topic, Jeff Atwood collected a few links in his I {entity} Unicode post.

No comments: