Thursday, January 26, 2012

Plain Old Data

I’m coming to the conclusion that there’s actually no such thing as “plain data;” it always has some metadata attached.  If it doesn’t, it might be displayed incorrectly, and then a human needs to interfere to determine the correct metadata to apply to fix the problem.  (Example: View → Character Encoding in Firefox.)  Pushed to the extreme, even “just numbers” have metadata: they can be encoded as text, a binary integer/float (IEEE 754 or otherwise) of some size/endianness, or an ASN.1 encoding.

Another conclusion I’m reaching is that HTTP conflates all kinds of metadata.  Coupled with the lack of self-contained metadata in file formats and filesystems, things start to accumulate hacks.

Wednesday, January 25, 2012

What If: Weak Memory Pages

Raymond Chen wrote about the "what if everybody did this?" problem of applications written to consume up to some threshold of memory and free some of it under pressure: if multiple applications have different thresholds that they're trying to maintain, then the one with the smallest-free threshold wins.  Of course, the extreme of this is a normal application that doesn't try to do anything fancy, which acts like it has a negative-infinity threshold.  If it never adjusts its allocations in response to free memory, then it always wins.

Some of the solutions batted around in the comment thread involve using mmap() or other tricks to try to get the OS to manage the cache, but this brings up its own problems.


Wednesday, January 18, 2012

Perl and Unicode in Brief

Perl requires a knob for every I/O, and expects you to set them all correctly yourself.  By default, they're all off (Unicode-unaware) for backwards compatibility.
  1. If you want to handle Unicode and avoid The Unicode Bug, in which your strings sometimes act like they aren't actually Unicode: in perl 5.12+, use feature 'unicode_strings';.  For older perl, see Unicode::Semantics, or use utf8::upgrade by hand.  These methods achieve their task by forcing "the UTF-8 flag" on for the string.
  2. If you want strings in your source text with non-ASCII: save it as a utf-8 encoded file and use utf8;.  Or you can encode Unicode code points with hex-escapes, \xae → ®, or \x{30ab} → カ.  There are technically other options, which have additional drawbacks (utf-16 breaks the #! line; latin-1 is restricted to latin-1 unless you decode it yourself.)
  3. If you want to print to a UTF-8 aware environment like your terminal emulator or CGI STDOUT after issuing a Content-Type: text/html; charset=utf-8 header: setting UTF-8 on the filehandle with binmode(STDOUT, ':utf8') is the minimum, but :encoding(utf-8) instead of :utf8 makes stricter guarantees that real code points are coming out.
  4. If you want to read a UTF-16 encoded document into a Unicode string with minimal fuss: open(FH, '< :encoding(utf-16)', $name).  Note that the document has to be correctly encoded.  You can use the Encode module's decode function if you need finer control over error behavior, but that's naturally more fuss: use Encode; open(FH, '<', $name); while (<fh>) { $line = decode($_, 'utf-16', $POLICY); ... }
  5. If you want to convert a Unicode string to a specific set of bytes for some encoding-unaware module to throw on the wire, use the encode function from the Encode module: use Encode; $message->attr('content-type.charset', 'utf-16'); $message->data(encode("UTF-16", $body));  (This example would be for MIME::Lite, if you're curious.)
  6. If you want to read a file encoded with charset X, into a string encoded with charset Y, I've found no instant way to do this.  It's probably best to pass the input-encoding along as the output-encoding if at all possible.  But you might find the Encode module's from_to(), or string-IO as in IO::File->new(\$out, '>:'), or maybe a whole PerlIO filter as in PerlIO::code helpful if you can't.
  7. If you see "Wide character in ..." warnings, then you passed a string with code points >=0x100 to something that expected a byte string of some sort: either really latin-1, or an encoded string.
  8. If you see longer strings of gibberish where you expected sensible non-ASCII characters, then you have probably double-encoded, either literally, or by printing an encoded string to a filehandle which does encoding.
  9. If you see the Unicode replacement character in a stream that should be UTF-8, you haven't encoded at all, such as printing a byte string on a raw filehandle in an environment expecting UTF-8.  Most likely, the filehandle should have an encoding set on it, per point #3 above, though that may cause #8 on other strings you've printed.
  10. If you are using modules, they each may or may not deal with Unicode.  DBD::mysql has the mysql_enable_utf8 option; Email::MIME accepts encoded strings via body, and decoded ones through body_str, but for the latter, you must also set the charset and encoding attributes (which correspond to the charset of Content-Type, and the Content-Transfer-Encoding, respectively.)  MIME::Lite does not handle decoded strings at all and hopes for the best.
The most difficult thing to come to terms with for me was, Perl doesn't have any notion of "the string's encoding" despite being Unicode-aware and having the UTF-8 flag.  A string is always a series of character code points; if it's an "encoded string" or a "byte string" then it's a series of character code points with values <= 0xFF.  The UTF-8 flag is almost irrelevant, except where it leaks out into point #1 because Unicode-unaware scripts are given the illusion that Perl is still single-byte.  Unicode-aware scripts get stuck dealing with all the usual Unicode issues, and also having to avoid falling into Unicode-unaware mode by accident.

Friday, January 13, 2012

A nice vim highlighting hack

I wanted to highlight places where control flow could be redirected in my perl code, so I hacked up my personal colorscheme file to highlight Exceptions specifically:

hi Exception ctermfg=white ctermbg=blue

Now, I just needed to define the things I wanted highlighted as Exception*.  Thus, the newly added ~/.vim/after/syntax/perl.vim:

" flow control highlighting
syn keyword perlStatementCtlExit return die croak confess last next redo
syn keyword perlStatementWarn    warn carp cluck
hi link perlStatementCtlExit Exception
hi link perlStatementWarn    Statement

" and i'm tired of everything being yellow
hi link perlStatementStorage Define

The last line isn't related to the above, but it recolors my/local/our in Preprocessor Blue instead of Statement Yellow.  They do, after all, affect the state of the compiler at parse time.


* This means that I'm going to open a non-Perl file sometime and weird things will have Exception highlighting.  Nobody notices the subtle differences when it's all Statement colored by default.

Wednesday, January 11, 2012

Layer 7 Routing: HTTP Ate the Internet

In the beginning was TCP/IP, and the predominant model was that servers would listen for clients using a pre-established port number.  Then came Sun RPC, in which RPC servers were established dynamically, and listened on semi-random ports (still, one port per service provided); the problem was solved by baking the port mapper into the protocol.  The mapper listens on a pre-established port, and the client first connects there to inquire, "On what port shall I find service X?"

Then came HTTP, the layer 6 protocol masquerading as layer 7.