Evading DevOps Work for Non-Real-Time Content Generation and Delivery Applications

July 11th, 2015 Permalink

In my experience, a traditional development team rolling out applications to servers, virtual or otherwise, needs at least 20% of its people to be experienced in and focused on devops work. That means designing, building, and maintaining server stacks, the deployment system, QA infrastructure, other tools needed in the course of development, and so forth. Ten developers turning out code and one devops specialist doing all of the above is a recipe for a frazzled devops specialist and bottlenecks in the development process. I've been in that position; it isn't all that much fun, even if you are good at what you do. The system tends to be (eventually) self-correcting in a well-run team, as good developers are willing to learn and get things done outside their current specialty - where do new devops folk come from, after all? Still, better to have the people who can both code well and set up a sound deployment infrastructure at the outset.

Unfortunately devops is a challenging hire. There just aren't many good people out there as a proportion of the software engineering population. The end result is that there are a lot of terrible deployment systems and hacked-together tools out there, string and baling wire that gets the job done for putting software on servers, assembled by developers who were learning the basics of Puppet or Chef or Nginx configuration or Jenkins best practices or how to set up Unix services at the time and no doubt wince whenever they look back at what was built. That is not even to think about keeping up with security patches, something that is becoming ever more vital these days, given the greater attention given to finding and fixing vulnerabilities in encryption implementations such as OpenSSL.

This is all the case for, as I said, the traditional approach of deploying software to servers. It looks very much like the future of a great deal of application development is going to involve abstracting away the server and the stack, however. This will mean writing code to run in a sandbox provided by a third party service, wherein you are effectively renting a specified runtime environment for a given number of processor cycles, and everything to do with setting up and maintaining that runtime environment is offloaded onto the service. Things like Docker as used in a cloud provider like AWS are a small step in that direction, but don't really eliminate enough of the devops work in practice, to my mind. AWS Lambda is a larger step, but its implementation is deliberately constrained to very small and very short programs. The future probably looks a lot more like Lambda than Docker, I think: an array of standards for various languages and runtime environments, rented by processor time, and applications split into separate logical components by usage characteristics, their operation and the passage of data coordinated by abstraction layer services for queues (e.g. SQS), notifications (e.g. SNS), and file storage (e.g. S3).

This is still a little in the future for support of arbitrary applications, but if you are creating a content generation and delivery application that doesn't require real-time interaction with the end user, then even today you may well be able to do away with near all of the devops work that would accompany traditional deployment to server applications. The following outline focuses on AWS as a provider, but other platforms exist now and more will follow later.

Generate Content in Node.js and Lambda

Write your application in Node.js and break up the architecture into discrete components that can work within the AWS Lambda limits on execution: concurrency, memory, time. Each one of these components then becomes a Lamdba function and runs separately within AWS Lambda. This may or may not plausible, but for most applications it should be possible to factor out operations into distinct Lambda functions that can reliably execute within the constraints. These modules typically call APIs to obtain their inputs and write generated content to S3.

Serve Generated Content From S3 via CloudFront

S3 buckets can be used to serve static content over HTTP/S, and can also be the origin for a CDN. That CDN doesn't have to be AWS CloudFront, but since it is right there and very easy to set up, why not. Using S3 removes the need to maintain a web server stack, and the savings in cost and time there alone are reason enough to think carefully about whether or not any new application can fulfill its requirements as a pure static content generator, serving over HTTP/S. If there is no real-time functionality, the answer is often yes.

Drive Content Generation from Message Queues

Message queues are a great tool for breaking up larger applications into their logical component parts. Anywhere you find yourself thinking "I want to ensure that this happens once," or "I now need to take an action, but it doesn't have to be synchronous with the flow of execution," or "I need to run these tasks with a given concurrency" then you can drop in a queue. The queue takes a message with details, and then any other system can in principal pick up those messages and take the necessary actions.

You probably don't want to be the person responsible for setting up your own message queue implementation and keeping it happy, however. It's a complicated business with many subtle edge cases and failure modes that lead to lost data, and you really do need a good understanding in order to create a watertight design. Fortunately all of that can be outsourced to Amazon. AWS provides a robust, simple queue implementation in the form of SQS. Since it exposes metrics to CloudWatch, and monitoring and visualization services such as Datadog can consume CloudWatch data via APIs, an added bonus of breaking up application design with queues is that you gain useful insight into the operating state of the application along with points of alerting (e.g. queues backing up) for little additional cost.

The challenge here is that a Lambda function cannot directly react to the existence of SQS messages, but you want Lambda function instances to spawn to consume messages from a queue at some defined maximum concurrency and frequency. Thus an additional mechanism is needed to manage triggering of Lambda functions; once a function is running, it can check a queue. The trick is to do this without any sort of server deployment or other apparatus that needs deployment, stack design, and other devops support.

One option is to set up some form of perpetually cycling control Lambda function. Since Lambda functions can make arbitrary AWS API requests (the Node.js AWS SDK is pretty good), it is perfectly feasible to create a Lambda function that both respawns itself at intervals and spawns other Lambda functions in response to circumstances.

It is also possible to stick with notification-based Lambda function triggers, such as SNS, or S3 events, or via other APIs, but the issue here is one of concurrency limits for the Lambda functions. Whatever systems generate notifications must manage concurrency and clean up or retry Lambda function failures, since the Lambda functions themselves can't do that if they fail due to resource limits. So you have potentially complicated additional requirements, all of which are unneeded if using queues because queues handle failure and retry implicitly. To my eyes that's enough extra work to make it well worth trying to assemble the system to be driven by SQS messages.

How About a Database for Stateful Applications?

There's always Amazon's DynamoDB or RDS services, both of which will provide all of the creature comforts of a database without the need to manage a database stack. Little to no devops needed there for setup and ongoing maintenance, but you still need to manage a sensible offsite and replicated backup strategy.

Considering the Dreaded Vendor Lock-In

If you build something in the form suggested above, then yes, you're locked in to AWS for the foreseeable future, though the degree of lock-in is nowhere near as bad as for, say, Google App Engine. You could pick up your Lambda code, since it's all Node.js, and run it in an entirely different unified server environment, migrate to a different database implementation, and use different queue mechanisms - even to the point of stubbing out in-memory queues and running the whole thing on one server. That is definitely a harder migration than moving a server instance with a well-defined deployment process to a new hosting environment, of course, but the point of the exercise is to balance that potential cost against the actual savings of near eliminating the devops workload of a more traditional design.