Thursday, January 26, 2012

Plain Old Data

I’m coming to the conclusion that there’s actually no such thing as “plain data;” it always has some metadata attached.  If it doesn’t, it might be displayed incorrectly, and then a human needs to interfere to determine the correct metadata to apply to fix the problem.  (Example: View → Character Encoding in Firefox.)  Pushed to the extreme, even “just numbers” have metadata: they can be encoded as text, a binary integer/float (IEEE 754 or otherwise) of some size/endianness, or an ASN.1 encoding.

Another conclusion I’m reaching is that HTTP conflates all kinds of metadata.  Coupled with the lack of self-contained metadata in file formats and filesystems, things start to accumulate hacks.

In my ramblings about REST and RPC, I mentioned that the combination of ETag/Last-Modified (a point-in-time) with a URL (an identity) provides a specific state of that identity.  If an HTTP client wants to save that state to disk, how does it proceed?  Typically, it saves the entity-body and discards the headers.  Yet at least some of those headers specified important data like the ETag, Cache-Control, Content-Location, and optionally for text/* types, character encoding.  (Other headers are specific to the transfer/message itself, such as Connection and the curiously named TE.  These are concerned with layer 6 rather than layer 7.)

HTML, being in use world wide on the WWW, rapidly hit problems with locale, so it has long had the capability to use <meta http-equiv> tags for specifying the charset.  (Amusingly, HTML 4 seems to define it as something the HTTP server is supposed to parse, and add to the actual response headers.  If you need it after the transfer is done, it’s clearly not the transfer’s metadata, but the content’s.)  A lot of formats have gone Unicode only, like Java source code, or had Unicode hacked into them at a later date to solve the depressingly familiar issues of character set.

The funny thing about specifying the character set in HTML is that you have to be able to interpret the characters in the first place in order to find the meta tag.  Thus, more hacking has ensued: if you have an ASCII-compatible encoding, your only requirement is to put the meta tag in the first 1K of the document; if it’s Unicode, you can start it with a BOM.  Otherwise, it must be specified through the HTTP header, or else it’s up to the user-agent to guess/use latin-1/do something else.

Meanwhile on Unix, everything is “just” a stream of bytes, including things that aren’t, and things that are more than that.  On one hand, this made Unix programs much more adaptable to handling multiple, and new, encodings; on the other, it was a massive effort to make every program handle new encodings, since there was never much support for it (especially for variable-length encodings) baked into C, and for compatibility, it had to be opt-in.  Also, when working on a remote system, the whole stack had to agree on a character encoding.

By pushing encoding issues up to the programs, Unix keeps knowledge of character encoding out of the kernel.  Yet it doesn’t provide much to allow applications to keep the encoding of text—or any other metadata, really—attached to the bytes.  Nobody relies on extended attributes (or file forks, or alternate streams), because they might be turned off, or a file might be transferred to a filesystem that doesn’t support them, such as the infamous FAT family.  To my knowledge, there’s no “metadata channel” attached to IPC, either.  In-band communication is the only reliable form.

This might be fairly sensible, though.  Some formats can have multiple kinds of data that they can store.  In particular, certain formats of ID3 tags come out on my pocket MP3 player as sequences of Hanzi and squares.  It seems that the encoding of the tags has changed between versions, and the poor old thing gets horribly confused when it doesn’t have its matching revision of tags.

Still, most of the time when I’m dealing with “data” in a programming language, I’m keeping track of both the data and the format.  Python still has some warts, like the encoding of sys.stdout goes missing when stdout is a pipe instead of a terminal, but it still seems to be the right approach, and is thus doomed to failure at the hands of dirtier alternatives.  If you try to double-encode in Perl or PHP, that string is going to get double encoded, and you’ll have junk like €œ everywhere.  Python will throw an exception (“bytes object has no encode method” in 3.x, and “could not decode character” when a non-ASCII byte was present in the string in 2.x).  On the flip side, this makes dealing with damaged data trickier, because this “don’t compound common problems” attitude protects you from a simple double-decoding.

The other interesting thing about PHP not understanding character sets is that addslashes/stripslashes can mangle Big5 and Shift-JIS text.  At one point, someone realized they could make magic_quotes cause a SQL injection, by using a double-byte character ending in 0x27.  Adding the backslash changes the character, and moves the 0x27 out to become a real single-quote.

On the other hand, since PHP is so completely unaware of encodings, when you print a string and it comes out weird, you need hardly more than those bytes to find your problem.  Perl’s backwards-compatible idea that the world is single-byte unless specified means that when you try to print Unicode data without configuring the proper IO encoding on a handle, it tries to convert to its legacy encoding and print that.  If code points are out of range, then it prints all of them in UTF-8, with the lovely “wide character in print” warning.  So it may appear to work, in spite of being completely broken, if you happened to choose data that won’t break it and have warnings off.

Perl’s functions work the same way; if you try to pack a UTF-8 string (with the utf8 flag on) into hex codes, hoping to see what the bytes are before they get to the filehandle, it doesn’t work—they go through the same encoding process.  To have a fully Unicode pipeline in Perl or Python, you need to set up the appropriate encodings for every segment of the process before you can actually see the data that you have, so you can understand whether it is correct in the first place.

</tangent>

The point is, there’s no such thing as “just text,” or “just data.”  MP3 may contain different text encodings in ID3 tags, the same is true of JPEG/EXIF comments, and MIME’s whole purpose in life is to stitch together randomly encoded things, including any of the above, into a parse-able whole.  “Text” based, of course.

Looking at HTTP once more, there’s a whole stack of data to deal with: data about the message itself, where a 202 Accepted status implies that there will be no response body; about management of the ephemeral connection, with Connection and TE; about the response body when present, as in Content-Type and Last-Modified; and about the server, the common examples being Server and X-Powered-By.  Some modern standards groups are rushing to add security and privacy as well, through the DNT (do not track) header and the various “do stuff across origins” specifications.

And in the end, the response body might carry something that has its own metadata, like a MIME message with its parts, or HTML document with HEAD tags.  Formats which embraced “In-band communication is the only reliable form”.

Formats which don’t necessarily think all data has a self-evident format.


Irrelevant tl;dr: stalk me via twitter, or ye olde atom feed.

No comments: