Thursday, December 27, 2012

A Minimal, Working Perl FastCGI Example

Updated 2013-01-30

Please see the next version, which allows for newly-written FastCGI scripts that do not have a CGI equivalent.

Original post follows.

I couldn't really find one of these on the Internet, so I'm going to document what I have working so far. In the following examples, assume an application named "site" rooted at /home/site/web, with static files served out of public/, app-specific modules contained under lib/, and other perl module paths (vendor, local::lib, etc.) compiled into Perl itself.  Site::* references an app-specific module, naturally.

Apache configuration snippet, using mod_fcgid as the FastCGI process manager:
DocumentRoot /home/site/web/public
FcgidInitialEnv PERL5LIB /home/site/web/lib
FcgidWrapper /home/site/web/handler.fcgi
RewriteEngine on
RewriteCond /home/site/web/lib/Site/Entry/$ -f
RewriteRule ^/(.*)\.pl$ - [QSA,L,H=fcgid-script]
Note that the left of RewriteRule is matched, and the $1 reference assigned, prior to evaluating RewriteCond's.  The rule means, "if the requested filename exists where the FastCGI wrapper will look for it, force it to be handled by FastCGI."  This interacts perfectly with DirectoryIndex: requesting / with a DirectoryIndex in effect invokes the handler.fcgi for as long as Site::Entry::index exists.

Now, my handler.fcgi named in the Apache configuration for the FcgidWrapper looks like this:
use warnings;
use strict;
use CGI::Fast;
use FindBin;
use Site::Preloader ();
while (my $q = CGI::Fast->new) {
    my ($base, $mod) = ($ENV{SCRIPT_FILENAME});
    $base = substr($base, length $ENV{DOCUMENT_ROOT});
    $base =~ s/\.pl$//;
    $base =~ s#^/+##;
    $base =~ s#/+#::#g;
    $base ||= 'index';
    $mod = "Site::Entry::$base";
    my $r = eval {
        eval "require $mod;"
            and $mod->invoke($q);
    warn "$mod => $@" unless defined $r;
This means that I can have models named like Site::Login and view-controllers for them under Site::Entry::login (handling /, naturally).  I still have to rewrite the site from vanilla CGI scripts into module form, but the RewriteRule work above means that the FastCGI versions can be picked up URL-by-URL.  It doesn't require a full-site conversion to a framework to gain any benefits.

There's one additional feature this wrapper has: by using SCRIPT_FILENAME and removing DOCUMENT_ROOT off the front, I can rewrite a pretty URL (prior to the last RewriteRule shown above) to one ending .pl and still have the wrapper work.  SCRIPT_NAME keeps the name of the script as it was in the original request, and does not receive the rewritten value.  (Only SCRIPT_FILENAME does, on my exact setup.)  So I did it this way, rather than re-applying all my rewrites in Perl.

Saturday, December 22, 2012

mod_fcgid and PHP_FCGI_CHILDREN

While attempting to validate a configuration for some server benchmarking, I happened upon some curious pstree output which read in part:


I had expected at most 8 php-cgi children, having configured PHP_FCGI_CHILDREN=8 in the wrapper script, but mod_fcgid was clearly starting multiple instances of the wrapper, so I was getting an absurd number of children.

I discovered FcgidMaxProcesses and further benchmarks ensued:


The first two rows, in spite of the second only providing 40% of the available PHP children, are within error of each other in actual performance.  The third line produces a total of 20 children again, but now 10 of them instead of 4 are visible in mod_fcgid for handling requests.

Clearly, there are too many cooks in the kitchen: both mod_fcgid and php-cgi (which, I should mention, was built with --disable-fpm) are trying to act as master, so PHP ends up wasting $PHP_FCGI_CHILDREN - 1 processes because mod_fcgid won't pass concurrent requests for them to handle.

What happens if you don't configure PHP_FCGI_CHILDREN at all?  php-cgi handles the request directly, as if it were a child to begin with, and pstree looks more like this:


That looks rational.  How does it perform?


Concurrency was held the same at 16 throughout all tests today, so it looks like we've run into the limits of the 100 mbit network.  (Serious benchmarking will happen once my gigabit gear arrives.)  Still, the fact that performance keeps up with 10×2 children portends well; it means there's no penalty for unsetting PHP_FCGI_CHILDREN and mod_fcgid isn't broken without it.

Friday, December 14, 2012

The Router Trinity

There are technically two separate companies housed in the suite, with separate subnets set up for themselves, but they wanted to be able to exchange data at some point in the past.  The link between the two networks is accomplished by no less than three consumer-grade "cable/dsl routers."

Let's start with a picture so you have something to boggle at:

Saturday, December 1, 2012

Hairy Escaping Problems (Keep the Pieces 2)

I was just settling in to hack out a Smarty-like template system (or at least an interpreter for it) in a non-PHP language, when my brain went all meta on me. ‘How can I never, ever have to deal with careful manual control over output encoding, ever again?’

Tuesday, November 27, 2012

DynamoDB in the Trenches

Amazon's DynamoDB, as they're happy to tell you, is an SSD-backed NoSQL storage service with provisioned throughput.  Data consists of three basic types (number, string, and binary) in either scalar or set forms.  (A set contains any number of unique values of its parent type, in no particular order.)  All lookups are done by hash keys, optionally with a range as sub-key; the former effectively defines a root type, and the latter is something like a 1:Many relation.  The hash key is the foreign key to the parent object, and the range key defines an ordering of the collection.

But you knew all that; there are two additional points I want to add to the documentation.

1. Update is Upsert

DynamoDB's update operation actually behaves as upsert—if you update a nonexistent item, the attributes you updated will be created as the only attributes on the item.  If this would result in an invalid item as a whole, then you want to use the expected-value mechanism to make sure the item is really there before the update applies.

2. No Attribute Indexing or Querying

NoSQL is at once the greatest draw to non-relational storage, and also its biggest drawback.  On DynamoDB, there's no query language.  You can get items by key cheaply, or scan the whole table expensively, and there's nothing in between.  There's no API for indexing any attributes, so there's no API for querying by index, either.  You can't even query on the range of a range key independently of the hash (e.g. "find all the posts today, regardless of topic" on a table keyed by a topic-hash and date-range.)

If you need lookup by attribute-equals more than you need consistent query performance, then you can use SimpleDB.  RDS could be a decent option, especially if you want ordered lookups (as in DELETE FROM sessions WHERE expires < NOW();—when the primary key is "id".)

A not-so-good option would be to add another DynamoDB table keyed by attribute-values and containing sets of your main table's hash keys—but you can't update multiple DynamoDB tables transactionally, so this is more prone to corruption than other methods.

And if you want to pull together two sources of asynchronous events, and act on the result once both events have occurred, then the Simple Workflow service might work.  (I realized this about 98% of the way through a different solution, so I stuck with that.  I might have been able to store the half-complete event into the workflow state instead, no DynamoDB needed, but since I didn't walk that path, I can't vouch for it.)

Thursday, November 22, 2012

Perl non-CGI: The Missing Overview

After living in mod_php and Perl CGI for far too long, it was time to look at reworking our application to fit something else.  Although we had mod_perl installed and doing little more than wasting resources, I didn’t want to bind the app tightly to mod_perl in the way it was already tightly bound to CGI.  That meant surveying the landscape and trying to understand what modern Perl web development actually looks like.

But first, some history!

Wednesday, November 21, 2012

Because I Can: SMTPS + MIME::Lite monkeypatch

I had an app that calls to MIME::Lite->send() to send an email, which until recently was using an SMTP server slated for decommissioning Real Soon Now.  It was my job to convert it to Amazon SES, and I figured it would be easier to tell MIME::Lite to use SES's SMTP interface instead of importing the web side's full Perl library tree just for one module out of it.

Ha ha!  SES requires SSL, and neither MIME::Lite nor Net::SMTP have any idea about that.  They were both written before the days of dependency injection, so I had to go to some length to achieve it.  And now, I golfed it a bit for you:
package MyApp::Monkey::SMTPS;
use warnings;
use strict;
use parent 'IO::Socket::SSL';

# Substitute us for the vanilla INET socket
require Net::SMTP;
@Net::SMTP::ISA = map {
  s/IO::Socket::INET/MyApp::Monkey::SMTPS/; $_
}  @Net::SMTP::ISA;

our %INET_OPTS = qw(
  PeerPort smtps(465)
  SSL_version TLSv1
); # and more options, probably

# Override new() to provide SSL etc. parameters
sub new {
  my ($cls, %opts) = @_;
  $opts{$_} = $INET_OPTS{$_} foreach keys %INET_OPTS;
PeerPort overrides the default of smtp(25) built in to Net::SMTP; I needed a port where the whole connection is wrapped in SSL instead of using STARTTLS, and 465 is the one suitable choice of the three that SES-SMTP supports.

The main caveat about this is that it breaks Net::SMTP for anyone else in-process who wants to send mail to a server that lacks a functional port 465.  But as you may have guessed, that's not a problem for my script, today.

Thursday, November 15, 2012

Some vim hacks

1. BlurSave

" Add ability to save named files when vim loses focus.
if exists("g:loaded_plugin_blursave")
let g:loaded_plugin_blursave = 1
let s:active = 0

function BlurSaveAutocmdHook()
 if s:active
  silent! wa

autocmd FocusLost * call BlurSaveAutocmdHook()
command BlurSaveOn let s:active = 1
command BlurSaveOff let s:active = 0
Save to ~/.vim/plugin/blursave.vim (or vimfiles\plugin\blursave.vim for Windows) and you now have a :BlurSaveOn command: every time your gvim (or Windows console vim) loses focus, named buffers will be saved.

My plan here is to develop a Mojolicious app in Windows gvim, with the files in a folder shared with a VirtualBox VM.  With blursave, when I Alt+Tab to the browser, vim saves and morbo reloads.

2. Graceful Fallback

The vim function exists() can test just about anything. I now have this stanza in my ~/.vim/syntax/after/mkd.vim:
" Engage UniCycle plugin, if loaded
if exists(":UniCycleOn")
Now, whenever I'm writing a new blog post for propaganda, I don't have to remember to run :UniCycleOn manually.

3. Extension remapping

Due to the disagreement on various systems as to what markdown should be called (Nocs for iOS offers just about every option except .mkd, while that happens to be the preferred extension for the syntax file I have for it—actually named mkd.vim), I also link .md to the mkd syntax in ~/.vimrc:
" .md => markdown
autocmd BufRead,BufNewFile *.md  setlocal filetype=mkd 
This lets me make Nocs-friendly Markdown files and still have vim highlight them.

Tuesday, November 13, 2012

Autoflush vs. Scope (CGI::Session)

CGI::Session writes its session data to disk when DESTROY is called, possibly during global destruction, but the order of global destruction is non-deterministic.  It will generally work when CGI::Session is writing to files, since it doesn't depend on anything else to do that, but using other storage like memcached or a database, the connection to storage may have been cleaned up before CGI::Session can use it.  Then your session data is mysteriously lost, because it was never saved to begin with.

Another possible interaction between object lifetime occurs when there are multiple CGI::Session objects: unless both have identical data, whichever one gets destroyed last, wins.  At one point, I added an END {} block to a file which had my $session declared.  All of a sudden, that END block kept $session alive until global destruction and the other CGI::Session instance, into which I recorded that a user was in fact logged in, now flushed first.  Because the logged-in state was then overwritten by the session visible from the END block (even though the block itself never used it), nobody could log in!

Yet another problem happened when I pulled $session out of that code and stored it in a package.  The END block had finished its purpose and been deleted, so moving $session to a package once again extended its life to global destruction: a package variable stays around forever, because the package itself is a global resource.  However, since the login path had flush() calls carefully placed on it, what broke this time was logout.  The delete() call couldn't take effect because the storage was gone by the time the session was cleaned up.

Friday, November 9, 2012

CGI::Session and your HTTP headers

For CGI::Session to work, you must send the Set-Cookie header (via $session->header() or otherwise) when the session's is_new method returns true.  I discovered this by tripping over an awesome new failure mode today:
  1. Restart memcached (or otherwise create new session storage).
  2. Nothing stays saved in the session.  Can't log in.
When CGI::Session receives a session ID that doesn't exist in session storage, it changes the session ID to prevent session fixation attacks.  Which means that if you only send the header in the absence of a browser cookie, data is written to the new ID, but the browser will re-submit the old ID next request.

(It turns out my real problem was the stupidly simple error of 'trying to write to the wrong memcached server,' but the above did happen to my test page while I was trying to figure out why memcached wasn't saving anything.)

Tuesday, November 6, 2012

It's All Programming

Programming codifies a process into something that can be executed on a machine.  But this is psychologically no different than codifying any other process into a set of rules for any interpreter, not necessarily mechanical.

The link between programming code and law has been noted in the past: the laws try to leave no room for argument, so they become long, and subject to similar problems as computer code.  Particularly unintended consequences: witness the spate of sexting prosecutions that try to brand teens as sex offenders for a decade for sending nude—or sometimes even just swimsuit—pics to their significant other.

Laws writ small are the ordinary rules of everyday life.  Those Dilbert moments where you receive multiple conflicting rules?  Those are bugs.

Friday, November 2, 2012

The Pointlessness of sudo's Default Run-As User

Amazon Linux ships with the default configuration*:
ec2-user ALL = NOPASSWD: ALL
Which means, ec2-user is allowed to run any command, without providing a password, while logged in from any machine.  But only as root—since the Runas_Spec is missing, the default of (root) is assumed.

This is entirely pointless because it also ships with the common PAM configuration, in which /etc/pam.d/su contains:
auth sufficient
So the game of Simon Says, in order to bypass the root-only sudo restriction so you can run as any user, password-free, without touching files in /etc in advance, becomes:
sudo su -s /bin/bash $TARGET_USER <
Normally, su uses the shell for the user as listed in /etc/passwd, but if we're interested in a /sbin/nologin account, then we can set any other shell listed in /etc/shells with the -s flag.

When you give any account root access, they probably have the whole machine.  I'm not sure what sudo was hoping to accomplish by "limiting" the default Runas_Spec to root.

* It also ships with Defaults requiretty which means you actually need someone to allocate you a controlling terminal for sudo to work, even though ec2-user doesn't need a password, and visiblepw is disabled by default.

Thursday, November 1, 2012

Bugs in Production

The amount that a bug hitting production annoys me turns out to be proportional to log(affected_users / time) * stupidity_of(bug).  If nobody can use the core functionality of the app because of something that would have failed a perl -c check, that yields a lot more angst than "some non-critical task doesn't work for one (uniquely configured) client when the day of the month is 29 or more," even though the latter is often more difficult to diagnose.

Yeah.  I crashed our site the other day over a trivial logging change, intended to gather debugging information for a rare condition of the latter sort.  It was so trivial it couldn't possibly go wrong, meaning stupidity_of(bug) was quite large.

Monday, October 29, 2012

When Layering Goes Bad

A lot of systems are built in layers.  Games are often split into engines and scripts.  Another classic is the "three tier" architecture with data storage, application model, and view-controllers split out onto individual machines.

But more often, I run into systems where code is taking a core sample instead of building on the layers.

Thursday, October 25, 2012

War Story: Apache, SSL, and name-based vhosts

Note: this post was written about a year ago, before we completed some major upgrades to our infrastructure.  I meant to post it as soon as we were done, but it got buried under too many other posts and drafts.  The original post follows, without edits for temporal accuracy.

You can do it The Right Way and use SNI if:
  1. You don't care about Internet Explorer (7 and 8) on Windows XP.
  2. You have Apache 2.2.12 or newer.
  3. You have openssl 0.9.8f or newer with TLS extensions; extensions are included by default in 1.0.
If all of these are okay for you, then setup is a matter of enabling NameVirtualHost *:443 and setting up TLS vhosts in much the same way as regular vhosts.

Otherwise, you have to try a bit harder.

Tuesday, October 23, 2012


I figured out my underlying problem with Yegge's liberal/conservative (libertarian/authoritarian) division of programming cultures.

People like looking down on those considered inferior.  "Conservative" adds another way to do just that.

Tuesday, October 2, 2012

Compile Time

You might have heard that in Lisp, the whole language is there all the time.  You can read while compiling, eval while reading, and so on.  This isn't necessarily exclusive to Lisp—Perl offers BEGIN/CHECK/UNITCHECK—but it isn't exactly common in mainstream languages.

At first, it sounds brilliant.  "I can use my whole language to {read the configuration | filter some code on-the-fly | whatever} for super fast run-time performance!"  But there's a consequence that nobody seems to realize until they've gone far down that path: if you have a compile-test switch like perl -c, you can no longer guarantee that using it is safe if you wrote code that runs during compilation.

This is almost a trivial statement: compile testing has to compile the code; you're running code at compile time; ergo, your compile-time code will run.  But beware of the details:
  1. If you read your configuration files and exit if something's wrong, then you must now have a valid configuration to run a compile test.
  2. Generalizing the previous: if you pull anything from an external service, your compile test depends on that service being up.  It may also depend on having your credentials for that service available.
  3. If you do a ton of work to prepare a cache for runtime, you have to wait for that—then the compile test finishes and throws it all away.
  4. If you have an infinite loop in compile-time code, the compilation test never completes.  Not a problem for a human at the keyboard, but could be difficult in a script (e.g. VCS commit hooks).
  5. If the language allows you to define reader macros or source filters at compile time, then you can't even syntax-check the source without running the compile-time code; the lex phase now depends on the execution state that accumulates during compilation.
  6. If your code assumes the underlying platform is Unix because that's what the server is, you can't compile test on Windows.  Or, you have to write your whole compile phase cross-platform.
If you want to execute expensive code or do sanity checks before run time, consider carefully where they would best be placed.  Perl's INIT can give you the same "run once, before runtime" behavior without affecting a compile test.   Separate, automated tests can be configured to interrupt neither your compile checks, nor the production system on failure.  (Sometimes, a 90%-working production system is desirable, compared to 0%.)

Friday, September 28, 2012

An Exercise in Optimizing PHP

Last winter, I was optimizing a PHP reporting script for no real reason besides practicing optimization.

Tuesday, September 25, 2012

Radical Clojure

Apparently, last month I missed Yegge's post and followup regarding software liberals and conservatives.

One of the things that caught my eye, that's clearly a Nerd Trap but so powerful that I need to answer anyway, is this little quote:
But the reality is that Clojure is quite conservative.  As far as I'm concerned, if you can't set the value of a memory variable without writing code to ask for a transaction, it's... it's... all I can think of is Morgan Freeman in the Shawshank Redemption, working at the grocery store, complaining "Forty years I been asking permission to piss. I can't squeeze a drop without say-so."
Emphasis and vivid FUD in original.

There's just one missing word, though, that makes all the difference:
You can't set the value of a shared memory variable outside of a transaction.
Shared.  Global.  Possibly still in use by someone else.

Clojure is utterly pointless without understanding that time is explicit and everything is an immutable value (unless you have a Java native thing.)  Values last, unmodified, until a reader asks for an update, and so a writer must be forbidden from modifying (destroying) that memory.  There's a whole paradigm hiding there, which you can see again if you look at Datomic.  Clojure without immutable values might as well be JRuby.

The other trick up Clojure's sleeve is that 'transaction' implies something rather more heavyweight outside of Clojure than in it.  Again, because immutability is pervasive, Clojure's transactions don't have to do read tracking.  When anyone reads an old value, it's going to stay what they read.  It's like RCU, conceptually, except that every read is protected and nobody needs to copy because the messy details are taken care of on write.

And if it's not actually shared?  Then you should use a var, which only one thread can see, and therefore doesn't need to be updated in a transaction!

Clojure is an interesting language.  I'd still recommend it.

Added 26 Sept: I think that liberal/conservative as applied by Yegge divide the languages according to how much a language insists on its own philosophy.  Things like Erlang and Clojure get ranked "conservative" right in with Pascal because they have Ways to Do Things, even if they are progressive ideas (functional and parallel/async are strongly encouraged.)  Then Perl is more "liberal" because it's equally well-suited to OO, procedural, functional, and concatenative programming, i.e. just barely, and if it didn't have regexes as syntax, it would be long dead already.  Python is more conservative than Ruby despite sharing characteristics because "there should be one—and preferably only one—obvious way to do it."  That's a clear Conservative value, right in the middle of the Zen of Python.

In any case, the insight that everyone thinks they're liberal is accurate.  Upon some thinking, which is hard and time-consuming and therefore not generally applied on the Internet, "conservative" is a fair enough label for my tendencies at this point in time—because I got burned throwing around too much fire, then came off that job a few months later to take positions in more conservative cultures.  But I've never thought of myself as a stodgy dinosaur programmer.

Monday, September 24, 2012

AWS: as-* command parameters

It turns out that the auto scaling commands like as-create-auto-scaling-group are thin wrappers around the AWS SDK for Java; you can read all the commands placed in /opt/aws/apitools/as/bin but they eventually just invoke Java with a classpath set and invoke or so.

Thus, the CLI commands are essentially documented by the Java API Reference.  In particular, the as-* family reflects operations on which consume data types in

What I was specifically looking for was documentation on the --health-check and --grace-period operations, and ....model.CreateAutoScalingGroupRequest finally has me covered.  Health check can be either 'EC2' or 'ELB', to select either the standard (and presumably default) EC2 "is the host hardware alive? can we ping the vm?" instance checks; or the "can we access this URL on the instance's http server?" health check configured in setting up an elastic load balancer.  Respectively.

Likewise, --grace-period is the time delay between starting up a new instance, and when auto scaling is allowed to start asking for its health.

Friday, September 21, 2012

HTTP(S) Endpoints and SNS

If you subscribe an HTTP endpoint to SNS, the first thing AWS does is send a subscription confirmation to that endpoint, which should be ready to handle it.  (If not, you can manually re-subscribe it through the console when the service is up, and AWS re-sends the confirmation.)  It turns out that the actual messages as exchanged with endpoints are defined in the Getting Started Guide, Appendix D; not the API reference!  The API reference only contains the administrative calls that can be made to the API, such as ConfirmSubscription.

It also turns out that the signature of the message sent to the endpoint is message-specific; Notification messages use a different set of fields to be signed than SubscriptionConfirmation and UnsubscriptionConfirmation.  Again, the details (and examples) of this are not in the API documentation, but Appendix D of the Getting Started Guide.

Monday, September 10, 2012

Security is Hard

"We constantly find '0days' as part of pentests and use them against our customers. Just the other day, we used an 0day SQL injection bug in [popular manufacturer's name deleted] firewall to break into a customer."
—Rob Graham via Ars Technica

A firewall.  Had an SQL injection bug.

A firewall.  A security product.

With the most basic of web security bugs embedded.

Obviously, being a black hat these days is like shooting fish in a barrel.  With a cannon.

Wednesday, August 22, 2012

Think of the Olden Days!

My first Linux machine had 128 MB of RAM.  The bzip2 warnings that you needed at least 4 MB of RAM to decompress any archive seemed obsolete at the time (even our then-4.5-year-old budget computer had shipped with twice that for Windows 95 RTM) and downright comical now that I have 4,096 MB at my disposal.

I was compressing something the other day with xz, which was taking forever, so I opened up top and only one core was under heavy use.  Naturally.  In the man page is a -T<threads> option... that isn't implemented because won't someone think of the memory!

OK, sure.  It appears to be xz -6 based on the resident 93 MB; with four cores, it's still under 10% of RAM.  The only ways it could come close to hurting are to run at xz -9 which consumes 8 times the memory and would seriously undermine the "reasonable speed" goal even with four threads; to run with 44 cores but not more RAM; or to run it on a dual-thread system on 256 MB.  The concern seems to be nearly obsolete already... will we be reading the man page in 2024 and finding that there are no threads because they use memory?

The point of this little rant is this: someone has a bigger, better system than you.  Either one they paid a lot of money for and would like to see a return on investment, or one they got further into the future than yours.  If you tuned everything to work on your system today, left or right shift by 1, then you have a small window of adaptability that will soon be obsolete.  Especially pertinent here is that parallelizing compression does not add requirements to the decompressor.  A single-thread system will unpack just as well, it just takes longer; unlike the choice of per-thread memory which forces the decompressor to allocate enough to handle the compression settings.

(Like gzip and bzip2, there exist some parallel xz utilities.  But only pbzip2 has made it into the repository.)

Friday, August 17, 2012

Troubleshooting cloud-init on Amazon Linux

cloud-init drops files as it works under /var/lib/cloud/data – you'll find your user-data.txt there, and if it was processed as an include, you'll also have user-data.txt.i.

If you're using #include to run a file from s3 and it wasn't public (cloud-init has no support yet for IAM Roles, nor special handling for S3), then user-data.txt.i will contain some XML indicating "Access Denied".  Otherwise, you should see your included script wrapped in an email-ish structure, and an unwrapped (and executable) version under /var/lib/cloud/data/scripts.

Update 23 Aug: Per this thread, user data is run once per instance by default, so you can't test it by simple reboots unless you have edited /etc/init.d/cloud-init-user-scripts to change once-per-instance to always.  Or use your first boot to set up an init script for subsequent boots.  But this doesn't apply if you build an AMI—see the 1 Oct/8 Oct update below for notes on that.

Update 2 Sep: I ended up dropping an upstart job into /etc/init/sapphirepaw-boot from my script; where the user data is #include\n and the upstart job is a task script that runs curl | perl. is public, and knows how to get the IAM Role credentials from the instance data, then use them to pull the private  That, in turn, actually knows how to read the EC2 tags for the instance and customize it accordingly.  Finally, ends up acting as the interpreter for scripts packed into (some of them install further configuration files and such, so a zip is a nice, atomic unit to carry them all in).

Note that I have just duct-taped together my own ad-hoc, poorly specified clone of chef or puppet.  A smarter approach would have been to pack an AMI with one of those out-of-the-box, then have fetch the relevant recipe and use chef/puppet to apply it.  Another possibility would be creating an AMI per role, with no changes necessary on boot (aside, perhaps, from `git pull`) to minimize launch time.  That would prevent individual instances from serving multiple roles, but that could be a good thing at scale.

But now I'm just rambling; go forth and become awesome.

Update 1 Oct, 8 Oct: To cloud-init, "once per instance" means once per instance-id.  Building an AMI caches the initial boot script, and instances started from that AMI run the cached script, oblivious to whether the original has been updated in S3.  My scripts now actively destroy cloud-init's cached data.  Also, "the upstart job" I mentioned was replaced by a SysV style script because the SysV script I wanted to depend on is invisible to upstart: rc doesn't emit individual service events, only runlevel changes.

Monday, August 13, 2012

Muddy Waters of Crypto

After doing quite a bit of searching on the Internet, I've come up no clearer on what the state of bcrypt is relative to alternatives.  Although I did find a couple of odd complaints, which I'll take first.

Friday, August 3, 2012

Why nobody uses SRP?

Aside from it being patented in part, the security goal of SRP doesn't quite fit the way we use the Internet these days: it uses a procedure similar to Diffie-Hellman to establish a secure channel based on the username presented.  Meanwhile, we have a standard for anonymous secure channels (TLS) over which we can exchange credentials without further crypto*, and using HTML forms means not being beholden to browser UI, such as HTTP Authorization's ugly modal dialogs with no logout feature.

* Although it would be nice to be able do <input type="password" hashmode="pbkdf2;some-salt" ...> to enable the server to store something other than a cleartext password, without all the dangers of trying to do crypto in javascript.

Bonus chatter: Someone once asked why I would use Digest auth even over TLS.  "In case TLS is broken" didn't appease him, but since then, we've seen high-profile failures like DigiNotar and Comodo, and attacks like BEAST.

Monday, July 30, 2012

DynamoDB Performance

Things I learned this past week:
  1. AWS Signature V4 no longer requires temporary credentials.  If you aren’t caching your tokens, this can give you a nice speedup because it cuts IAM/STS out of the connection sequence.
  2. AWS service endpoints are SSL.  If you make a lot of fresh connections, you may pay a lot of overhead per connection.
  3. Net::Amazon::DynamoDB and CGI are terrible things to mix.
Read on for details.

Wednesday, July 18, 2012

Add Multiply Exponentiate Tetrate

A random thought occurred to me today:
  1. Multiplication is iterated addition.  (= (* 5 4) (reduce + (take 4 (repeat 5))))
  2. Exponentiation is iterated multiplication.  (= (expt 5 4) (reduce * (take 4 (repeat 5)))) ; if you've imported clojure.contrib.math/expt
  3. So what do you get if you iterate exponentiation?  Is it useful?
Another way to look at the question is: logarithms strength-reduce by one level, hence the log rules like (= (+ (log a) (log b)) (log (* a b))) and the definition of log as the inverse of exponentiation, just as division and subtraction invert multiplication and addition.  What, then, is the law for (expt (log a) (log b))?  Again, this should be the log of our mystery operation on a and b.

It turns out Wikipedia has me covered.  A tiny little section on the Exponentiation page links off to tetration, Ackermann function, and Knuth's up-arrow notation.  There goes my night.

(Bonus: I finally have all the background to understand the third panel of xkcd 207. #latetotheparty)

Tuesday, July 17, 2012

find: arguments are not actually one big expression

I first learned about find somewhere around 12 years ago, so the documentation today might contradict me, but I’ve been carrying around in my head this false notion: that flags like -print0 participate in the conditions as a true value while they update a flag that sets the eventual output format.

They don’t, in fact, participate.  find accepts them with -print0 -a «expr» syntax just to mess with you, and outputs the filename as soon as it sees the -print0 option.  That means these two commands are equivalent:
find . -print0 -a «expr»
find . -print0
And in fact, if you offer multiple -print options, you'll get each filename printed out multiple times.

I was intending to do CRLF→LF translations only on text files, and the extension-matching came after the -print0.  Since find emitted no warning, I only noticed the damage when I deployed the website and looked at Firefox valiantly trying to make sense of all the broken images.

The correct way to write the command is actually:
find . «expr» -print0
This triggers the -print0 “option” only once the complete expression has matched.  And suppresses the regular/default -print, of course.

Thursday, July 12, 2012

Unintended Consequences (an API Design Rant)


  • Your package needs to understand encoding. It can’t just throw away structure and hope for the best.
  • If a package is overly simple, it’s likely to be too simple for real-world use; and more likely to be reimplemented with a “slightly less awful” hack because it wasn’t that big in the first place.
  • MIME won.  Email packages should understand/generate MIME by default where necessary, and only avoid MIME processing at the caller’s option.  If that.
  • Perl’s Unicode implementation adds too much complexity (and it doesn’t help that people don’t agree on terminology).  Now package authors get to write two APIs, one for Unicode and one for octets.


Consider the simple program:
use Email::MIME;
$m = Email::MIME->create(
header_str => [
To => join(', ', qw(
print $m->as_string;

What happens if you run this on Perl 5.10?  The final email address is “break@domain.example .net”—note the space between “example” and “.net”!

As it turns out, such a thing may actually be legal per RFC 2822: the obs-domain syntax allows for embedded CFWS around the dots of the domain.  However, Amazon SES doesn’t support it, so some email was bouncing with the error message, “Domain contains control or whitespace.”

The rogue space explains the SES failure, but how did it get there in the first place?

Friday, July 6, 2012

Keep the Pieces

When a low-level function is going to write out a string, for instance the To header of an email, I often find myself tempted to “just” pass down strings.  Often, I find later that I would rather have the header in array form at some intermediate level, so that I can add recipients only if they’re not already present.  I’m then forced to parse the string in some manner, with that choice requiring some balance of performance and correctness.  (It’s tempting to make code deal only with the subset of the RFC you think you’ll need.)

If this happens more than once on the way down (“some errors should email us admins if we’re not already involved in this transaction”), it gets even worse: build original array, reify to string, {parse, modify, reify} × 2.  Whereas just handing the array down through looks more like build, modify × 3, reify and send.

Letting the lowest layer put the data on the wire in the format the wire requires can also be more robust: if there are only Bcc recipients and the To address is “Undisclosed-recipients: ;” then checking whether To is empty loses its simplicity: it can have a non-empty value and yet not have real recipients.  Also, nobody at the higher layers has to care whether your addresses are actually separated by comma or by semicolon.

Finally, this lets you push down basic cleaning like calling array_unique() into the lowest level, meaning each modification along the way can quickly append and trust the result will be safe on the wire.  All those layers become more concise and readable.

Thursday, July 5, 2012

Auth doesn't belong in the session

PHP locks session access by default, from the time you call session_start() until session_commit(), or until the response is fully written if you didn't commit it earlier.  If you store your authenticated state ("user bob; expires at 12:30") inside the session, then you have to open the session any time you need to know who the user is.  If that makes you set up your app to open the session automatically and leave it open the whole request, then you're hurting parallelism if you have read-only operations.

If you store the auth info in a separate, MAC'd* cookie instead, then you can read the auth state without affecting the session.  Of course, the auth cookie is the most powerful one, so all possible protections should apply: HttpOnly and Secure, served over HTTPS.

* Don't let your users impersonate each other by editing their own cookies.

Tuesday, May 8, 2012

Are moose heavy? There's mouse.

I happened across Mouse in the dependency chart of an Amazon Route53 module, so I looked it up.
  • Mouse is meant to be lighter than Moose, and compile faster (for CLI and primordial CGI).
  • Mouse wants to let you do s/Mouse/Moose/g and have nothing break, if it turns out Mouse isn't heavy enough.
  • Mouse also exports warnings and strict for you when you use it.
  • Mouse basically doesn't want to have MouseX.
  • Any::Moose gives you Moose if that is already loaded, or Mouse otherwise.
So there you have it: Mouse is a lightweight Moose.  Without antlers.

Updated 2013 Feb 11: Apparently there's also Moo and it's preferred, at least for today.  Because one isn't enough in Perl.

Sunday, May 6, 2012

Python: Slicing in reverse, in the middle of a sequence

When slicing forwards, it's relatively simple to understand: s[7:9] returns a 2-item sequence of elements 7 and 8.  This works pretty much like any other half-open interval, in which one side (the 7) is included and the other (the 9) excluded.  The resulting length is simply the difference between the end and start indexes, 9-7=2.

What about backward? If you reverse the numbers and add a stride value, s[9:7:-1] gives you elements 9 and 8.  Since the interval is still half-open, now 9 is on the closed end and included, and 7 is open and excluded.  So s[8:6:-1] is the reverse of s[7:9].  You're getting two elements, starting at 8 and ending before 6, going backwards.

What happens if you want to get the reverse of s[0:5]?  The above math would suggest s[4:-1:-1] but negative indexes are way at the other end of the sequence, so this produces an empty result.  The correct answer is actually omitting the end index, as in s[4::-1].  That invokes the regular "all items remaining in sequence" meaning, that is also used in s[9:].

Wednesday, May 2, 2012

PPTP is legacy and insecure

Unlike IPSec and L2TP+IPSec, a PPTP VPN tunnel is carried over TCP, which means all packets traveling inside the tunnel are delivered reliably—including any tunneled TCP traffic.  Therefore, the inner TCP layer never sees packet loss, wreaking havoc with its congestion control mechanisms.

Although the RFC refers to the tunnel as "separate", that's only conceptual.  The traffic is carried inside of the single TCP connection.  For example, SSH application data delivered over PPTP is wrapped in its usual TCP/IP, then PPP+GRE (the tunnel/"user data" from PPTP's perspective), then forwarded over the TCP/IP control connection to the PPTP server.

L2TP produces a similar wrapping structure, but the outermost connection is UDP instead.  Its IP frames are wrapped in PPP+GRE and then delivered to the server over UDP/IP.

Consider L2TP+IPSec first.  (At least, my experience with vanilla IPSec has been less than recommendable.)

Update: In the news this year (2012), PPTP's key exchange is broken, and that's actually been known about for years.  There's even a cloud service to crack it for you if you're lazy.  Current advice is actually:
Never use PPTP.

Reference: Moxie Marlinspike's post detailing the 256 complexity of the attack: the md4(password) is used as DES keys which encrypt the same known plaintext, meaning that each DES encryption can potentially reveal any of the segments of the hashed password string. The plaintext password is not necessary for the protocol. As a bonus, the final DES key is padded with five zero octets since the hash isn't long enough for three DES keys.

Monday, April 30, 2012

Uncontrolled Mutation: An Example of Inefficiency

I pruned /etc/apt/sources.list a bit tonight, because it annoys me when I have to wait for a lot of data just to answer the question, "So do I need to upgrade anything?"  Really, when a handful of updates are published, I should not have to get a fresh 4.6 MB copy of the entire multiverse package tree.

Since the scope of potential mutation is the entire index file, all clients only get file-level granularity for controlling the amount of data they download.

Tuesday, April 24, 2012

A New Way to Write Broken Perl

I called $cls->SUPER::new in a package constructor that was erroneously missing a base class, which issues the helpful diagnostic:
Can't locate object method "new" via package "The::Package::Name" at .../The/Package/ line 12.
So Perl says it can't find the method at the exact site of the definition, which is weird enough, but the real problem is triggered by the line with the SUPER call.  Once I added the use parent 'The::Parent'; line to the package, everything was fine.

This behavior was observed Perl 5.10.1 as presently available through the repository for CentOS 6.2.

Monday, April 16, 2012

AmazonS3 'headers' and 'meta' options in the PHP SDK

When you're using create_object, or several other methods of the AmazonS3 class, the $opts parameter often allows for both headers and meta keys.  Although headers is documented as "HTTP headers to send along with the request", it turns out that they are returned with the object (i.e. in the response) when that object is requested from S3.  In contrast, keys in meta are canonicalized, prepended with x-amz-meta-, and returned that way.

That is, if you want to upload filenames like "$id-$md5.pdf" but deliver them to the user as "ContosoUserAgreement.pdf" in the Save-As dialog, then headers should contain a Content-Disposition key with a value of attachment; filename="ContosoUserAgreement.pdf".

If you put it in meta instead, then the HTTP headers on retrieval will contain a x-amz-meta-content-disposition header instead, which will not be honored by the browser.

I found all this out by uploading something like 12,000 files with the wrong metadata.  I then wrote a script to fix it, which ran straight into the problem that update_object doesn't work, so you have to use copy_object instead.  Note that when using copy_object with 'metadataDirective' => 'REPLACE', you need to specify all the metadata you want, because it does what it says: it deletes all old metadata before adding the new metadata from the copy_object request, so that only the new metadata exists afterwards.

Tuesday, March 27, 2012

Triumphs of the Worse

I know I seem like an irrelevant dinosaur to you node.js hipsters, but there are some valuable lessons floating around there that the next+1 new hotness would be wise to consider.  They all stem from the questions: what were the popular web languages?  And why?

Wednesday, March 21, 2012

Language Driven

If templates weren’t languages, we wouldn’t call them template languages, nor have so many of them: HAML, HTML::Mason, TemplateToolkit, PHP, Smarty, pagelib and Output, and so forth.

(This is essentially a repost of something that used to be on my old blog.)

Saturday, March 10, 2012

An interesting recovery scenario

My computer didn't boot last night.  It failed waiting for the root filesystem, and dropped to a busybox (initramfs) prompt.  After nosing around a bit, trying to figure out what to do to make the "regular" boot process happen, I noticed that the root device was present, so I figured it couldn't hurt to try Ctrl-D.

It promptly booted up into recovery mode on the failure of a filesystem to mount, so I panicked and backed up /home as long as it was there.  However, this made even less sense as to why sda was missing, because that's where the active MBR, grub, and /boot are: the latter holding the kernel and initramfs image that were now running.  If the disk was gone, how did I boot from it?

Long, boring story short: the disk didn't vanish until something triggered a read of the whole structure of the disk.  grub only needs the first few sectors to get started booting, so it wasn't until the later search of filesystem UUIDs that the kernel somehow wedged the disk so hard that it needed a cold reboot to re-appear.  A warm reboot would hang the POST at drive detection.

It turns out that libata knows how to hard-reset a SATA drive, and this also works, but it never fell back to using it until I was booting with libata.force=nosrst included on the kernel command line.  (This also happened to be with a USB stick, so that I could have a functional linux to examine the damage with.)  That let me get it working enough to do a fsck, which restored it to fully operational.

Now I have a backup of the "more reliable so I don't need to back it up, besides it'd take forever to dribble 15 GB out over usb2" volume.  I want to say that's the worst 15-minute savings ever, but OTOH, not having the backup meant it was worth trying to fix the problem instead of writing off the drive and its data as a loss.

Update: this happened again, so I added a drive, fixed the dying drive again, and migrated everything to the new disk.  I'm glad I set up lvm ages ago, because pvmove made it 95% easy.

Tuesday, March 6, 2012

UI Consistency: vim and emacs

I was adding a little more vim customization this morning, to work around the fact that the vim setup on the ec2 instance is completely silly about editing php code (php.vim includes html.vim, which acts like it owns the entire file, and thus calls set matchpairs+=<:> and set comments=…; this is okay for actual html, but php.vim doesn’t restore these settings, so you get error bells when writing $foo->bar() from the unbalanced ‘>’, and you can’t format your PHP comments anymore because they’re completely unrecognized.)

Anyway, this led to the realization that my vim setup, as large as it is growing, is built on a large batch of external code, which is in turn dependent on both vim version and the distribution.  Due to subtle changes to hundreds of defaults—especially complex values like the aforementioned matchpairs and comments settings—it’s impractical to predict how a given vimrc will behave in a random environment.

Thanks to site-lisp, I am guessing that emacs has essentially the same issue: your initialization may interact in with site-lisp code to produce per-host variation in behavior.

For something I’m going to spend my day in, that’s a painful situation.  The variation interrupts the flow of expected results, pushing the editor back up into my consciousness.  But would I want an editor I couldn’t customize?

Friday, March 2, 2012

Notes on Sharing a Unix Account

After accidentally breaking an ec2 instance while trying to set up a separate user account with admin rights, I decided to keep using ec2-user and set it up to coexist with any other people who logged in on the account.  I don't expect my boss wants all my shell/vim/etc. customizations.

To achieve this, I took advantage of openssh's environment support (which required enabling PermitUserEnvironment yes in /etc/ssh/sshd_config) to set a variable when I log into the server with my key pair:

environment="VUSER=sapphirepaw" ssh-rsa ...

Next, a one-line change to ~/.bashrc:

[ -n "$VUSER" -a -r "$HOME/.$VUSER/bashrc" ] && . "$HOME/.$VUSER/bashrc"

That newly-sourced bashrc then takes care of setting up the rest of the world, with code like:

export SCREENRC="$mydir/screenrc"
export VIMINIT="source $mydir/ec2init.vim"
alias st="svn status -q"

Notice that vim doesn't support any sort of "find the vimrc here" environment variable, but it does allow for arbitrary Ex commands to run, so I used that instead.  (Hat tip to this helpful message.)  ec2init.vim then reads:

let s:rdir="/home/ec2-user/.sapphirepaw"
let &rtp=s:rdir . "/vimfiles," . &rtp . "," . s:rdir . "/vimfiles/after"
exec "source " . s:rdir . "/vimrc"

This expands all the variables soon enough to be useful, and also means that if I ever move/reconfigure the root directory name, I will have only one place to change it in vim.  And from there, all my settings are loaded.  Life is good again.

Tuesday, February 28, 2012

VirtualBox Webcam Activation

Perhaps your webcam turns on when starting a guest VM.  (thread 1, thread 2, thread 3, none with helpful info.  There's also at least one bug for it, only two months old but still silent at time of writing.)

It turns out that this is related to audio.  If I have a VM with the audio disabled—such as one provisioned through vagrant—then it doesn't turn on the webcam.  I conclude that VBox is actually attempting to use the camera's microphone as audio input.

(I'd report this on the bug, if they had OpenID or something.  But building a whole Oracle Account just for this? Meh.)

Monday, February 27, 2012

Broken By Default

This is why everything that uses openssl needs to configure a cipher list:
Mon 12:38 ~$ openssl version
OpenSSL 1.0.0g-fips 18 Jan 2012
Mon 12:38 ~$ openssl ciphers DEFAULT | sed -e 's/:/ /g'
I cut the stronger ciphers from the output, leaving weak ones: everything that is EXP (pre-2000 export strength, 40- or 56-bit keys) or DES.  I decided to let triple-DES slide even though it's legacy and limited to 112 bits of security.  I also let KRB5 and PSK slide, even though my understanding is that they're useless on the public Internet, due to needing to share a Kerberos setup or key (resp.) with the client in advance of the connection being made.

Due to the weak ciphers being included by default, everyone needs to specially configure their server to gain true security.  This means that all admins who want to do it "right" must keep up on all advancements in the field of cryptography, and distinguish real breaks from crackpot allegations.  All admins who want it to "work" will just search the web and paste in whatever cipher suite they find, potentially leaving them vulnerable to BEAST.  Meanwhile, that library we trusted to provide security is doing its best to avoid giving it to us.

In other news: SSL Deployment Best Practices (PDF).

Saturday, February 25, 2012

War Story: The day I nearly invented server management

Long before I knew about puppet (or heard about chef via vagrant), I was working on a system with around 100 virtual hosts, nearly all of them configured from a small pool of standard features.  This made for a long, complicated vhosts.conf file, which we managed by keeping it in alphabetical order and using search a lot.  I realized that a lot of the duplication could be removed by generating it from a template file.

Thursday, February 16, 2012

Just a dream

Last night, I was dreaming that they took all the special variables out of perl because they’re too powerful… everything like $_ and $. and $[.  Basically, all of perlvar was completely removed from the language.

Except, if you knew where to look for it, this one was still there.  You assigned a function to it and it became a handle for modifying Reality itself.  I linked it to tr/[a-z]/[A-Z]/ and pushed the frame of it over the plate-glass window of the bank, and the gold-foil lettering announcing the name of the bank*, indeed, changed to uppercase.

Then I woke up, as I realized that they had left the power in there for their own ends, and would probably kill unauthorized users…

* It was a long, overdone, corporate-flavored name like “Actual Bank of the United States of America, A Global Services Marketplace Group Company” (where "Actual Bank" is the name of a real bank, but not Bank of America.)

Saturday, February 11, 2012

Layer Juggling

  1. vim windows (Ctrl+W{w, W, h, j, k, l, ...})
  2. vim tabs (gt, gT, :tab, ...)
  3. screen session (Ctrl+Z ...) [because I liked Ctrl+A as beginning-of-line]
  4. terminal window tabs (Ctrl+{PageUp, PageDown, Shift+PageUp, Shift+PageDown, ...})
  5. application windows, e.g. other terminals (Alt+`, Alt+Shift+`) [a distinction newly required in Unity and Gnome-Shell’s defaults]
  6. other applications (Alt+Tab, Alt+Shift+Tab) [may include all workspaces]
  7. other workspaces (Ctrl+Alt+{↑, ↓} for gnome-shell and Unity, additionally Ctrl+Alt+{←, →} for Unity; also with Shift to drag a window with you)
My setup of Windows at work doesn’t have layers 4, 5, or 7, and it’s still too many to handle effectively.  Even if the command is right for the intent, the wrong result can still happen, for instance when I go to change window in vim and firefox closes a tab or two because it had the keyboard focus.

I get layers 2 and 3 mixed up so frequently that I typically only have tabs open in vim for a wide-ranging interface change inside my code, where I need to update model, validation, and view/controller all at once.  Each of those scopes gets a tab, and the tab is split into windows for each affected file of that particular scope.  If I have to muck around in more than two different layers at once, it gets extremely error-prone.

I think this is the reason people try to do everything inside emacs: if it’s run within a single frame, which I boldly claim is the common case, it combines layers 1 through 4 into a common framework, and leaves only layer 6 as important on the desktop.  You don’t need workspaces to tame a sprawling collection of windows anymore, because most of them are inside emacs.

Subscribe to my feed for the firehose, or check @sapphirepaw_org on twitter for stuff I deem important enough to bother telling the world about.

Thursday, February 9, 2012

TLS (nee SSL) and SSH: A Compact Comparison

TLS and SSH rely on basically the same math for their connections.  The connection is initiated with asymmetric encryption, and part of that is exchanging (encrypted, of course) a symmetric session key for faster encryption of the main session traffic.

The main difference comes in how that initial asymmetric key is determined to be the one that legitimately belongs to the server, rather than an attacker who is trying to intercept communications.

Sunday, February 5, 2012

Race conditions, they're everywhere

Ever since I added an SSD, once in a while, gnome2 won't be able to load a random panel applet on login.  (It was a dual-core system until the slow death of I/O interfaces on the motherboard finally consumed SATA, at which point I replaced it with a quad-core board.)  That means that every now and again, I get an error similar to the following:

This OAFIID is actually relatively transparent.  Sometimes, I get ones that look more like a UUID and I ask the dialog, "How does that give me any information?"  There's not even any indication about what the error was.  (It is floating around in ~/.xsession-errors, with the helpful indication that the child didn't return an error.)

Fortunately, the answer is always the same: Don't Delete.  Things will most likely work next login.

A first impression of gnome-shell

Recently, I picked up a spare 8GB USB thumb drive so that I could test out various distributions.  I spent yesterday running Fedora 16 that way, to give gnome-shell a shake or two, since it was resisting my efforts to get it to run under 3D acceleration in VirtualBox.

It turns out that there's one detail I can't stand, and I'd especially not be able to stand if I were switching between gnome3 at home and MS Windows at work: they broke Alt+Tab by introducing Alt+Grave.  With two browser windows and Rhythmbox open, I kept getting Rhythmbox instead of "the last used window".  Also, the Alt+Tab switcher considers all workspaces, which defeats the point of them.

So the race for "what to do when Ubuntu 12.04 comes out" is down to Unity vs. KDE for me.  (I've been sticking to LTS versions since I no longer particularly like fixing things and adapting to gratuitous changes every six months.)

Thursday, February 2, 2012

Linux and BSD

A long time ago, someone characterized the difference between Linux and FreeBSD something like this: “Linux just hacks stuff in randomly.  BSD guys think about how to do it right, then proceed.”

Given that frame of reference, I think of myself as a BSD style programmer.  It may have taken me a ridiculous amount of time to get Amazon SES up and running, but that’s because I went through the existing open(SENDMAIL, "|$sendmail") style code and replaced it with building the email via Email::MIME, with correct encodings and charsets; now I’m more or less guaranteed to generate MIME compliant messages, without copying around boundary generation etc. through all the places that need to send mail.

And before all that, I had to understand how Perl handled Unicode so that I could understand how to make everything work, always.  For real this time.

Thursday, January 26, 2012

Plain Old Data

I’m coming to the conclusion that there’s actually no such thing as “plain data;” it always has some metadata attached.  If it doesn’t, it might be displayed incorrectly, and then a human needs to interfere to determine the correct metadata to apply to fix the problem.  (Example: View → Character Encoding in Firefox.)  Pushed to the extreme, even “just numbers” have metadata: they can be encoded as text, a binary integer/float (IEEE 754 or otherwise) of some size/endianness, or an ASN.1 encoding.

Another conclusion I’m reaching is that HTTP conflates all kinds of metadata.  Coupled with the lack of self-contained metadata in file formats and filesystems, things start to accumulate hacks.

Wednesday, January 25, 2012

What If: Weak Memory Pages

Raymond Chen wrote about the "what if everybody did this?" problem of applications written to consume up to some threshold of memory and free some of it under pressure: if multiple applications have different thresholds that they're trying to maintain, then the one with the smallest-free threshold wins.  Of course, the extreme of this is a normal application that doesn't try to do anything fancy, which acts like it has a negative-infinity threshold.  If it never adjusts its allocations in response to free memory, then it always wins.

Some of the solutions batted around in the comment thread involve using mmap() or other tricks to try to get the OS to manage the cache, but this brings up its own problems.

Wednesday, January 18, 2012

Perl and Unicode in Brief

Perl requires a knob for every I/O, and expects you to set them all correctly yourself.  By default, they're all off (Unicode-unaware) for backwards compatibility.
  1. If you want to handle Unicode and avoid The Unicode Bug, in which your strings sometimes act like they aren't actually Unicode: in perl 5.12+, use feature 'unicode_strings';.  For older perl, see Unicode::Semantics, or use utf8::upgrade by hand.  These methods achieve their task by forcing "the UTF-8 flag" on for the string.
  2. If you want strings in your source text with non-ASCII: save it as a utf-8 encoded file and use utf8;.  Or you can encode Unicode code points with hex-escapes, \xae → ®, or \x{30ab} → カ.  There are technically other options, which have additional drawbacks (utf-16 breaks the #! line; latin-1 is restricted to latin-1 unless you decode it yourself.)
  3. If you want to print to a UTF-8 aware environment like your terminal emulator or CGI STDOUT after issuing a Content-Type: text/html; charset=utf-8 header: setting UTF-8 on the filehandle with binmode(STDOUT, ':utf8') is the minimum, but :encoding(utf-8) instead of :utf8 makes stricter guarantees that real code points are coming out.
  4. If you want to read a UTF-16 encoded document into a Unicode string with minimal fuss: open(FH, '< :encoding(utf-16)', $name).  Note that the document has to be correctly encoded.  You can use the Encode module's decode function if you need finer control over error behavior, but that's naturally more fuss: use Encode; open(FH, '<', $name); while (<fh>) { $line = decode($_, 'utf-16', $POLICY); ... }
  5. If you want to convert a Unicode string to a specific set of bytes for some encoding-unaware module to throw on the wire, use the encode function from the Encode module: use Encode; $message->attr('content-type.charset', 'utf-16'); $message->data(encode("UTF-16", $body));  (This example would be for MIME::Lite, if you're curious.)
  6. If you want to read a file encoded with charset X, into a string encoded with charset Y, I've found no instant way to do this.  It's probably best to pass the input-encoding along as the output-encoding if at all possible.  But you might find the Encode module's from_to(), or string-IO as in IO::File->new(\$out, '>:'), or maybe a whole PerlIO filter as in PerlIO::code helpful if you can't.
  7. If you see "Wide character in ..." warnings, then you passed a string with code points >=0x100 to something that expected a byte string of some sort: either really latin-1, or an encoded string.
  8. If you see longer strings of gibberish where you expected sensible non-ASCII characters, then you have probably double-encoded, either literally, or by printing an encoded string to a filehandle which does encoding.
  9. If you see the Unicode replacement character in a stream that should be UTF-8, you haven't encoded at all, such as printing a byte string on a raw filehandle in an environment expecting UTF-8.  Most likely, the filehandle should have an encoding set on it, per point #3 above, though that may cause #8 on other strings you've printed.
  10. If you are using modules, they each may or may not deal with Unicode.  DBD::mysql has the mysql_enable_utf8 option; Email::MIME accepts encoded strings via body, and decoded ones through body_str, but for the latter, you must also set the charset and encoding attributes (which correspond to the charset of Content-Type, and the Content-Transfer-Encoding, respectively.)  MIME::Lite does not handle decoded strings at all and hopes for the best.
The most difficult thing to come to terms with for me was, Perl doesn't have any notion of "the string's encoding" despite being Unicode-aware and having the UTF-8 flag.  A string is always a series of character code points; if it's an "encoded string" or a "byte string" then it's a series of character code points with values <= 0xFF.  The UTF-8 flag is almost irrelevant, except where it leaks out into point #1 because Unicode-unaware scripts are given the illusion that Perl is still single-byte.  Unicode-aware scripts get stuck dealing with all the usual Unicode issues, and also having to avoid falling into Unicode-unaware mode by accident.

Friday, January 13, 2012

A nice vim highlighting hack

I wanted to highlight places where control flow could be redirected in my perl code, so I hacked up my personal colorscheme file to highlight Exceptions specifically:

hi Exception ctermfg=white ctermbg=blue

Now, I just needed to define the things I wanted highlighted as Exception*.  Thus, the newly added ~/.vim/after/syntax/perl.vim:

" flow control highlighting
syn keyword perlStatementCtlExit return die croak confess last next redo
syn keyword perlStatementWarn    warn carp cluck
hi link perlStatementCtlExit Exception
hi link perlStatementWarn    Statement

" and i'm tired of everything being yellow
hi link perlStatementStorage Define

The last line isn't related to the above, but it recolors my/local/our in Preprocessor Blue instead of Statement Yellow.  They do, after all, affect the state of the compiler at parse time.

* This means that I'm going to open a non-Perl file sometime and weird things will have Exception highlighting.  Nobody notices the subtle differences when it's all Statement colored by default.

Wednesday, January 11, 2012

Layer 7 Routing: HTTP Ate the Internet

In the beginning was TCP/IP, and the predominant model was that servers would listen for clients using a pre-established port number.  Then came Sun RPC, in which RPC servers were established dynamically, and listened on semi-random ports (still, one port per service provided); the problem was solved by baking the port mapper into the protocol.  The mapper listens on a pre-established port, and the client first connects there to inquire, "On what port shall I find service X?"

Then came HTTP, the layer 6 protocol masquerading as layer 7.