Wednesday, January 18, 2012

Perl and Unicode in Brief

Perl requires a knob for every I/O, and expects you to set them all correctly yourself.  By default, they're all off (Unicode-unaware) for backwards compatibility.
  1. If you want to handle Unicode and avoid The Unicode Bug, in which your strings sometimes act like they aren't actually Unicode: in perl 5.12+, use feature 'unicode_strings';.  For older perl, see Unicode::Semantics, or use utf8::upgrade by hand.  These methods achieve their task by forcing "the UTF-8 flag" on for the string.
  2. If you want strings in your source text with non-ASCII: save it as a utf-8 encoded file and use utf8;.  Or you can encode Unicode code points with hex-escapes, \xae → ®, or \x{30ab} → カ.  There are technically other options, which have additional drawbacks (utf-16 breaks the #! line; latin-1 is restricted to latin-1 unless you decode it yourself.)
  3. If you want to print to a UTF-8 aware environment like your terminal emulator or CGI STDOUT after issuing a Content-Type: text/html; charset=utf-8 header: setting UTF-8 on the filehandle with binmode(STDOUT, ':utf8') is the minimum, but :encoding(utf-8) instead of :utf8 makes stricter guarantees that real code points are coming out.
  4. If you want to read a UTF-16 encoded document into a Unicode string with minimal fuss: open(FH, '< :encoding(utf-16)', $name).  Note that the document has to be correctly encoded.  You can use the Encode module's decode function if you need finer control over error behavior, but that's naturally more fuss: use Encode; open(FH, '<', $name); while (<fh>) { $line = decode($_, 'utf-16', $POLICY); ... }
  5. If you want to convert a Unicode string to a specific set of bytes for some encoding-unaware module to throw on the wire, use the encode function from the Encode module: use Encode; $message->attr('content-type.charset', 'utf-16'); $message->data(encode("UTF-16", $body));  (This example would be for MIME::Lite, if you're curious.)
  6. If you want to read a file encoded with charset X, into a string encoded with charset Y, I've found no instant way to do this.  It's probably best to pass the input-encoding along as the output-encoding if at all possible.  But you might find the Encode module's from_to(), or string-IO as in IO::File->new(\$out, '>:'), or maybe a whole PerlIO filter as in PerlIO::code helpful if you can't.
  7. If you see "Wide character in ..." warnings, then you passed a string with code points >=0x100 to something that expected a byte string of some sort: either really latin-1, or an encoded string.
  8. If you see longer strings of gibberish where you expected sensible non-ASCII characters, then you have probably double-encoded, either literally, or by printing an encoded string to a filehandle which does encoding.
  9. If you see the Unicode replacement character in a stream that should be UTF-8, you haven't encoded at all, such as printing a byte string on a raw filehandle in an environment expecting UTF-8.  Most likely, the filehandle should have an encoding set on it, per point #3 above, though that may cause #8 on other strings you've printed.
  10. If you are using modules, they each may or may not deal with Unicode.  DBD::mysql has the mysql_enable_utf8 option; Email::MIME accepts encoded strings via body, and decoded ones through body_str, but for the latter, you must also set the charset and encoding attributes (which correspond to the charset of Content-Type, and the Content-Transfer-Encoding, respectively.)  MIME::Lite does not handle decoded strings at all and hopes for the best.
The most difficult thing to come to terms with for me was, Perl doesn't have any notion of "the string's encoding" despite being Unicode-aware and having the UTF-8 flag.  A string is always a series of character code points; if it's an "encoded string" or a "byte string" then it's a series of character code points with values <= 0xFF.  The UTF-8 flag is almost irrelevant, except where it leaks out into point #1 because Unicode-unaware scripts are given the illusion that Perl is still single-byte.  Unicode-aware scripts get stuck dealing with all the usual Unicode issues, and also having to avoid falling into Unicode-unaware mode by accident.

No comments: