Monday, July 22, 2013

Bootstrapping an EC2 Instance, 2013 Edition

In the interests of "not implying old posts on my blog are still state-of-the-art", here's a quick overview of how I manage EC2 instances at work, modernized.

First, a general outline of how we handle our EC2 farm: we're doing B2B work and have a predictable daily cycle, so we leave one instance running overnight and spawn more to handle additional traffic during US business hours.  This is done with some scheduling rules in Auto Scaling, which mandates having an AMI to launch.  To minimize the time involved in daily startup, we have a "bootstrapping" stage where we start up an Amazon Linux instance, load our app code and config onto it, then bundle it into an AMI that becomes our auto-scaling AMI.

To allow us to hot-update code at any time, each instance that comes up also connects to a SNS topic to listen for deployment messages.  I may go into more detail on that in the future, but for now, know that this requirement affects the design because an instance of our AMI needs to register itself with SNS when Auto Scaling launches it.  Prior to that point, it doesn't know the address to register with SNS.

A Review of the Old Way

There existed an S3 bucket with a public first-stage boot script.  This script was initially launched by cloud-init and used the credentials from an IAM Role to access the private second-stage scripts on S3.  stage2 code could then have private credentials embedded, and was laid out as a script-runner and a tarball containing scripts to run.  There was also support for using tags to specify an alternate tarball in the bucket, for testing new stage2 scripts without disturbing the existing default set.

Part of stage2's duties were to drop a "re-init" script and connect it to the boot sequence, so that on reboot (or launch of the built AMI) it would kick off the process of retrieving and executing stage1.

This ran into many problems with cloud-init.  Troubleshooting was difficult.  The straw that broke the camel's back, however, was that cloud-init would save the stage1 code into the AMI and execute that cached stage1 code when a new instance was started.  I had to add another stage2 script to nuke cloud-init's cache, or commit to never ever changing the stage1→stage2 hand-off.

The other major problem with this setup was that all of stage2 had to be careful not to accidentally double one-time actions.  Such as adding the limited user that would own our DocumentRoot.  It seemed like a lot of waste to run all these scripts all the time, when they usually wouldn't do anything.

A minor pain point that never turned out to be a big deal was that this system was divided into three segments that ran in sequence: global setup, service setup, and global cleanup.  If multiple services needed Apache installed, then each of their files all included httpd in the package selection.  There wasn't any notion of 'inheritance' where a service could say it required Apache, and have Apache installed before it.

Finally, a theoretical problem: the dependence on AWS infrastructure such as cloud-init and IAM Roles meant that all this machinery was nearly useless for setting up a local virtual machine for development.  If we ever hired another developer, turning my VM into something they could use was going to be downright painful.

Enter the Modernization

Taking the above into account, I rewrote the stage2 runner to have a concept of "phases".  The first time, to customize an Amazon Linux instance, the init phase runs; it drops an init script that runs later boots as the update phase.  Anything that needs to know about the live instance (as started by Auto Scaling) goes in update; everything else, including package selection and CPAN module builds, happens during init only.

I also added some infrastructure to be able to start up an instance, upload a shell script to it, then log in and run it.  This goes with a new shell script that embeds AWS credentials for accessing the S3 bucket.  This means that technically, stage1 no longer has to be public.  It also means that the instance no longer needs any IAM Role.  These both simplify the launching of a new instance.

With a tiny more bit of plumbing to connect this "Amazon Linux → configured company instance" step to scripting I already had in place for "configured company instance → AMI", I now have a one-touch command to build a new AMI from scratch.  All it needs is a Bourne-flavor shell, OpenSSH, and the AWS EC2 command-line utilities.

I also added some flourishes to the inheritance code, so that if the AMI is brought up with more services requested to be configured, the stage2 runner knows to go back and run init for the new service (and dependencies that weren't already present!) before running update for everything.  The main use for this is a "test" service which links some "" names between Route 53 and Apache, including building an X.509 certificate.  This gives me a way to reach the instance without mangling /etc/hosts and make sure our code actually works, before configuring the AMI as the instance to launch in Auto Scaling.

One minor benefit to this new approach is that it no longer subscribes to SNS during the init phase, because it expects that instance to be packaged and torn down.  That, in turn, means there's no reason for the instance running the init phase to be in the security group that allows SNS traffic.  In fact, the only group now required for the AMI building process is SSH access.

Net result: there's inheritance; no IAM Role is required; no user-data is required (as cloud-init is not used); fewer security groups are required for the AMI build process; building an AMI no longer requires the EC2 console; building an AMI is one command, with no "make the human wait for something before continuing" step.

No comments: