Tuesday, November 27, 2012

DynamoDB in the Trenches

Amazon's DynamoDB, as they're happy to tell you, is an SSD-backed NoSQL storage service with provisioned throughput.  Data consists of three basic types (number, string, and binary) in either scalar or set forms.  (A set contains any number of unique values of its parent type, in no particular order.)  All lookups are done by hash keys, optionally with a range as sub-key; the former effectively defines a root type, and the latter is something like a 1:Many relation.  The hash key is the foreign key to the parent object, and the range key defines an ordering of the collection.

But you knew all that; there are two additional points I want to add to the documentation.

1. Update is Upsert

DynamoDB's update operation actually behaves as upsert—if you update a nonexistent item, the attributes you updated will be created as the only attributes on the item.  If this would result in an invalid item as a whole, then you want to use the expected-value mechanism to make sure the item is really there before the update applies.

2. No Attribute Indexing or Querying

NoSQL is at once the greatest draw to non-relational storage, and also its biggest drawback.  On DynamoDB, there's no query language.  You can get items by key cheaply, or scan the whole table expensively, and there's nothing in between.  There's no API for indexing any attributes, so there's no API for querying by index, either.  You can't even query on the range of a range key independently of the hash (e.g. "find all the posts today, regardless of topic" on a table keyed by a topic-hash and date-range.)

If you need lookup by attribute-equals more than you need consistent query performance, then you can use SimpleDB.  RDS could be a decent option, especially if you want ordered lookups (as in DELETE FROM sessions WHERE expires < NOW();—when the primary key is "id".)

A not-so-good option would be to add another DynamoDB table keyed by attribute-values and containing sets of your main table's hash keys—but you can't update multiple DynamoDB tables transactionally, so this is more prone to corruption than other methods.

And if you want to pull together two sources of asynchronous events, and act on the result once both events have occurred, then the Simple Workflow service might work.  (I realized this about 98% of the way through a different solution, so I stuck with that.  I might have been able to store the half-complete event into the workflow state instead, no DynamoDB needed, but since I didn't walk that path, I can't vouch for it.)

No comments: