Tuesday, October 18, 2011

Character Sets, Encodings, MySQL, and your data

This post is a companion to Perl and Unicode in Brief, an attempt to cover similar ground more concisely.  And this post is a revised version of the one you're currently reading.

I'm currently moving data from a (relatively old now) MySQL 5.0 server into Amazon RDS.  I've been here before, when I was moving data from MySQL 4.x into 5.0 and mangling character sets.  This time, I want to make 100% sure everything comes across with maximum fidelity, and also get the character encoding as stored to be labeled correctly in MySQL.

First, a quick definition or two:
  • Character Set: a specific table to translate between characters and numbers.  Example: ASCII defines characters for numbers 0-127; "A" is 65.  This can also be described as "a set of characters, and their corresponding representation inside the computer."
  • Character Encoding: a means of "packing" numbers from the character set into a container.  Example: UTF-8.  The Unicode character 0x2013 becomes 0xE2,80,99. The "E" signifies "Part 1 of 3", and part of the remaining bytes simply indicate "Continued"; the 0x2013 is then divided up to fit in the parts of the bytes that aren't indicating their "Part 1" or "Continued" status.  In the specific case of UTF-8, the encoding is designed so that the ASCII range 0-127 (0x00-7F) is encoded without change: a leading 0-7 means "Part 1 of 1".
  • 8-bit character encoding: In older, simpler days, character sets defined only as many characters as could fit in 8 bits, and defined the encoding as simply the numbers.  Character number 181 would encode as a byte (8 bits) with value 181.
  • A character encoding implies the associated character set, because the encoding defines how numbers in its character set become individual bytes.  How characters in other sets would be encoded is left undefined and basically impossible.
This last point is why MySQL lets you set "character sets" to UTF-8, though the latter is an encoding.


Now, I'm staring at a database where the actual bits that make up the data are encoded in UTF-8, but MySQL thinks they are in Latin-1.  Whenever anything connects to the database, the character set is always Latin-1, so MySQL does no translation.  Nor does the code serving our web page, preferring instead to send it as-is to the user.  In the end, the HTML page is labeled as UTF-8, so the untranslated UTF-8 bytes display correctly in the browser.

The problem is, every database backup that I have taken might be secretly broken: mysqldump is Unicode-aware, and left to its own devices (no --default-character-set given), it sets the connection into UTF-8 mode and reads out the data.  This means the bits that end up in the file are double-encoded: when coming across a left-curly-quote, the server says, "Hmm, the client is asking for UTF-8 and I have Latin-1.  I'll need to translate these 0xE2,80,9C bytes here into UTF-8 and send it that way."

What comes out on disk?  I checked a hex dump, and it's 0xC3,A2,E2,82,AC,C5,93.  Viewed as UTF-8, that byte stream is A-circumflex, Euro, and lowercase OE ligature.  This is clearly impossible, since Latin-1 predated the creation of the Euro, and it's Latin-15 that contains it; it's double impossible because the Euro character in Latin-15 is 0xA4 which has nothing to do with any of the bytes we were trying to encode.

However, in Windows-1252, the 0x80 byte is the Euro symbol, which can only mean: when MySQL says "latin1", it is actually interpreting the data as Windows-1252.  But it's not so bad, because the entire transformation is completely reversible (as long as latin1 continues to mean the same thing)... but what about the right-curly-quote, which ends in 0x9D, whose value is not defined in Windows-1252?  The "9D" value is just assumed to map to Unicode character 9D, so 0xC5,93 is replaced by 0xC2,9D.

What character is that, anyway?  It's OSC, the Operating System Command, part of the C1 control set and a control sequence introducer.  Headdesk!

I think this still works okay for me, because the exact reverse happens going back into Windows-1252.  At least, on the same server version and whatever Unicode libraries it's using.  Nobody tries to interpret the OSC anyway, but it still looks odd to find a "?" showing where there actually is a valid UTF-8 sequence in the source.

Now that I knew all this, I could get a database dump out in an unconverted format, by including --default-character-set=latin1 on the  mysqldump command line.  The only problem was, this would give me a file which was exactly like the database: it contained UTF-8 data labeled as Latin-1.  Loading it into a MySQL server would recreate the same situation as before.

My solution, found nowhere on the Internet, and which would probably make a lot of purists scream at me, was to take two dumps: one with --no-data that I piped through sed -e s/latin1/utf8/g to change the labeling, and one with --no-create-info --no-create-db in which I edited the SET NAMES latin1 by hand (to utf8, of course), to avoid editing the "latin1" in actual data when I was talking about the problem in the bug database.

Then I did some additional sleuthing and fixed the non-binary non-UTF-8 data which was lying around in the dump, and it all imported into RDS fine.

As a possible point of interest, it took about 10 minutes and 11 seconds to load the 351 MB data-only dump into a small DB instance.  (Work has a symmetrical 8 Mbps connection, so it wasn't necessarily network-bound.)  It did take a noticeable amount of time to load the 1 MB of schema, but unfortunately I didn't expect that to be difficult, so I don't have timing for it.

No comments: