Sunday, May 14, 2017

Reflections on Work (as our codebase turns 10)

I’ve been at my current employer for over seven years.  We’ve been building and maintaining all the code around the business this whole time, and I’d like to think I learned something worth sharing…



Repository handling

For legacy reasons, we have multiple repositories.  This is frequently a pain: how do you bring multiple repos (say, one with a website and one with a library of domain models) together?

  1. git submodule
  2. git subtree
  3. Vendoring

All of these have major drawbacks.  We started with submodules, which offered zero safety and frequently led to committing in the wrong project.  git didn’t have any warnings that hey, this is a submodule, because the submodule is a complete git repo in its own right.  This probably wouldn’t have been such a problem, if there were a quick way to temporarily replace a submodule with a local copy of it for integration testing without having to push the superproject.

We ended up with a lot of low-quality commits that don’t really tell the story of the feature so much as “feature, bugfix, bugfix, bugfix, bugfix.”  And it couldn’t be neatly rebased once pushed.  Because the submodule version is determined by commit hash, I disallowed force-pushing to prevent breaking the submodules.

I looked at git subtree in my search for something better, but it did something even weirder.  I don’t know how it works; that didn’t seem any more intuitive than submodules, so it never got released as a submodule replacement.

That leaves vendoring with a tool intended for the language: Composer for PHP code.  Of course, our libraries are private, so we also had to set up a satis server for it.  The main drawback here is that there’s no clean way to test libraries within downstream projects.  Either the upstream needs pushed and satis run to regenerate the index, allowing downstream to update; or else, like submodules, one must poke around in vendor to temporarily point the directory to the wrong place (the unpublished code to test.)

In all of these cases, we can’t rebase any upstreams, because the downstreams reference them by commit hash: git, AFAIK, doesn’t have a “commit X rebased to Y” log that it can share during git pull, so rebasing upstream breaks all downstream project history.

If I were doing everything again, the subversion-to-git migration would yield one huge git repo from the incoming subversion repos.  We could then rebase at will, and test everything in-place.  As a bonus, it would encourage us to use a proper deployment pipeline.  Speaking of which…

Deployments

Our deployment pipeline has grown from “logging in and running svn update” to a script making a temporary copy of the repo, running git pull && composer install && build.sh there, and if it all works, rotating the new directory into place.  This deployment script is started via a SNS message.

There are a lot of secrets in the git repos, because that’s the easiest place to put them in this model.  This is widely considered to be bad practice, and one of my great shames.

I really want to switch to making “binary releases” out of our code.  The plan would be to make local builds, upload the tarball to s3, and issue a deploy command to pull that tarball and an overlay (configuration, i.e. secrets). So, the server’s actions would be download two tarballs, unpack them into a temporary directory, and then rotate it into place.

Conveniently, that model seems to fit much better with AWS CodeDeploy.  We could use that.

In the meantime, I’m migrating the secrets into $HOME on the server AMIs, and making code use that (e.g. by passing profile to the AWS SDK.)

Besides reducing the number of services that must be available when deploying, a binary release system would also mean the webserver didn’t incidentally have the complete history available.  We could delete the rules to check and ignore .git directories and .git* files from the webserver configuration.

Having a single git repo would have increased the pressure to have a binary-based release process, because we wouldn’t want to check the whole thing out on every server that needs any part of it.

SQL everywhere

For optimizability, there’s almost no barrier between code and the database. This lets any code do queries however it wants, to get it as fast as it can, but means the data structure is woven throughout the entire codebase.

Conservative changes (that don’t break the site for a random amount of time between a few seconds and “however long we’re scrambling to fix the bugs”) require a multi-stage rollout with two “edit all affected code” passes.  This is always the case, but “all affected code” isn’t restricted to any subset of our code: it can be any and all code. Across all repositories.

We’re moving toward building things as “an API” that interacts with the database, and “apps” that consume the API for displaying the UI, doing reporting, or running periodic tasks (remote data integration and the like.) This at least separates “the database structure” from most of the callers, reducing “all affected code” to “the API code.”

Time will tell if that one works out.

Database engine concerns

We followed MySQL up to 5.6, where we got stuck.  The change to 5.7 started sucking up days of effort, and promising to be exciting for weeks to come if any code didn’t turn off strict mode or didn’t handle data errors, so the update got pushed off indefinitely.  (We have a standard connection function that sets plenty of options by default, but in the name of simplicity, not everybody likes to use it.)

Our path forward looks like it will be Aurora at some point.  Otherwise, MariaDB may prove to be a viable candidate, if we don’t want to go back and support MySQL 5.7 and up.

There’s basically no chance of supporting Postgres.  We’re not distributing our software, we don’t have any Postgres knowledge in-house, and of course, all of the code assumes MySQL.

Dual Languages

The original code for the websites was written in Perl CGI.  We have two major sites—an “agent center” used by the salesmen to configure installations, and the “platform” that serves our actual clients.

When I started, the agent center did almost nothing but allow for downloading some PDF forms, and setting up users who could download PDF forms.  It was part of a slightly more expansive site marketing our product, that mainly pulled HTML out of the database and pasted it into <div id="content"></div>, outside of the agent center proper.

I ported the agent center/marketing stuff to PHP right away, and over time, this has proven to be one of the best decisions of my career, although I thought I wasted a couple of weeks on it when I first did it.  We added an enrollment system for configuring installations, replacing an Excel spreadsheet that was never filled out properly.  We added a little CRM, a trouble ticketing system, some performance metrics, and integrated a few things with the platform.

Mostly, data managed by PHP is for the enrollment/CRM/ticket systems and isn’t accessed by the platform, except during installations (when an enrollment is copied to live system data.) And vice versa: live system data is periodically compiled into a cache for the benefit of the performance metrics, and very little PHP actually accesses the Perl data.

But as integrations have deepened over time, we have ended up with a few places where the same business rules end up coded in Perl and in PHP.  This has been a major source of trouble to get right and keep right in the face of business changes.

In the wake of these effects, I’ve strongly resisted adding a third language to the ecosystem.  I want to kill Perl 100% before we ever think about moving out of PHP, or else we’ll have our same old problems… in triplicate.

Single Page

Because it was written in 2006 or so, when GMail was the coolest thing, the platform website (the one still backed by Perl) is written as a single-page app.  Everything fires off an ajax loader to replace the content of <div id="main"></div> when any “navigation” needs to happen, and that comment about // TODO: make the back button work is probably a decade old, now.

I was able to move a chunk of the codebase from Perl CGI to Perl FastCGI (this has its own issues), but changing out the Javascript framework looks like an intractable problem.  We’re finally building a new API in PHP to cover everything Perl did, and creating an all-new (React) UI to go with it.

Whether the Perl UI will be migrated to the API (even partially), or just switched off, remains undetermined.

FastCGI

There are two parts to a FastCGI service: a “process manager” and “worker processes.”  I didn’t have a good understanding of this, so when I built the FastCGI bits, Apache’s mod_fcgid ended up as process manager without an explicit decision.  I’m not very happy with the results.

  1. There is no clean restart for mod_fcgid.  When Apache reloads, it unloads and reloads all its dynamic modules, which kills the process manager and frees its memory. Until Apache 2.4 (and even then, I don’t know if mod_fcgid ever took advantage of the change), there was no way for a module to be warned this was coming.
  2. There is no preforking for mod_fcgid.  When a request arrives, it is either assigned an idle worker, or a new worker is started. When the server is started/reloaded, there are 0 workers, and none are pre-warmed.

For reliable websites, #1 is a major pain; and for #2, it makes it difficult to have a low-latency service if any Perl modules that take significant time to load are used.

As such, I ended up rewriting our early API site in Dancer, which runs in an HTTP server of its own called starman, with Apache proxying requests to it.  Starman can be gracefully restarted without taking down open connections (its own, or any other website’s) via Server::Starter.

Later additions to the API have been in PHP.  This is a lot easier than trying to migrate the platform would be, because we don’t have session data to worry about.  The API does not require its callers to support cookies.

Going back to the platform UI, it might have been much better to do the same (starman/Server::Starter proxying HTTP) instead of trying to go with FastCGI specifically.  But I didn’t know that yet.

Slow CGI

I never finished the FastCGI migration for the platform.  Somewhere around 30% of the pages, covering 80% of hits (and all of the difficult logic), were migrated.

I had collected hit stats from the logs, then proceeded from most-to-least hits.  This gave us the biggest wins quickly, but it also doomed us to rapidly diminishing returns.  I’m pretty sure this caused the project to be permanently ‘interrupted’ before it could be completed.

Sure, pages are delivered in a quarter of the time to first byte, half the time overall, and half the CPU.  But once the major pages were done, there wasn’t political will to wait another couple of weeks to finish converting everything.

Nothing Broken Is Ever Fixed

The harshest lesson: ‘temporary’ solutions become distressingly permanent.  We are still using the Smoothness theme with jQueryUI, not because it is what anyone wanted.  It’s what I picked as a placeholder to code against, with the expectation we’d choose or commission a different theme.

No comments: