Two Approaches to Elastic Load Balancers in CloudFormation Deployment

The core of the standard setup for a cluster of stateless web application servers deployed to Amazon's Elastic Compute Cloud (EC2) consists of an Elastic Load Balancer (ELB), a Route 53 DNS record pointing to the ELB, and one or more Auto Scaling Groups (ASG) attached to the ELB. Setting aside the details of the web service running on instances managed by the ASG, this core can then be extended with CloudWatch metrics and alarms, scaling policies, notifications, health checks on the ELB that trigger the ASG to remove unhealthy instances, security groups with varied rules on the ELB and instances, and so on. The stateless web application servers interact with APIs or databases external to the cluster, and all user state and other necessary state data is stored there.

For the purposes of deployment, this is all usually packaged up into one or more CloudFormation stacks, and the JSON stack definitions - or the scripts that generate those JSON definitions - sit at the heart of any system of AWS devops automation. There are several quite different approaches to using CloudFormation to deploy stateless web application clusters, and many of the differences hinge on the treatment of ELBs. I'll sketch two of them here, one more suited to low traffic applications, and the other necessary for high traffic applications, say tens of requests per second or more.

Low Traffic: One Stack, Disposable ELB

The simplest approach is to lump every stateless resource into one CloudFormation stack definition. Specifically:

ELB
ELB Security Group
ELB Route 53 DNS Entry: appname-version-buildnumber.stacks.example.com
ASG
ASG Launch Configuration
Instance Security Group
CloudFormation Wait Condition

This means that a new ELB is created for each deployment of a new version of the application, while the old ELB is destroyed when the old stack is destroyed. Seamless deployment of a new version without downtime works as follows:

Deploy the new CloudFormation stack.
Validate that the stack came up and is functioning correctly.
Switch the DNS CNAME entry for appname.stacks.example.com to point to the new appname-version-buildnumber.stacks.example.com entry.
Wait for the appname.stacks.example.com TTL to expire. Hopefully it is set to a reasonably short 60 seconds or so.
Make a decision on whether to roll back or not. This may involve waiting for some time.
If not rolling back, delete the old stack.

My cloudformation-deploy NPM package implements this workflow. Given good instance health check endpoints it is straightforward to automate the entire process, with human intervention only needed when something goes wrong.

Unfortunately, ELBs are Not Infinitely Elastic

The single stack approach that creates a new ELB for each deployment is great for low traffic applications, but it doesn't work for high traffic applications. ELBs have no configuration setting to warn them of the level of traffic they will receive immediately upon creation, and behind the scenes are limited in their ability to rapidly scale up from their default initial capacity. New ELBs for high traffic applications must be created and then manually pre-warmed by AWS support before coming live, or they will drop most of the incoming traffic until they scale up to handle the load, a process that can easily take tens of minutes to work its way through the underlying systems.

High Traffic: Two Stacks, Persistent ELB

The requirement for ELBs to be pre-warmed for high traffic applications means that we must deploy the ELB and the web application cluster in separate stacks. The ELB stack will only be deployed once for the lifetime of the application. The only reason to redeploy would be a move to a different AWS region, or to update a setting that cannot be updated manually via the EC2 console. The ELB stack contains the following items:

ELB
ELB Security Group
ELB Route 53 DNS Entry: app-elb-buildnumber.stacks.example.com

The application stack has the rest:

ASG
ASG Launch Configuration
Instance Security Group
CloudFormation Wait Condition

Deployment of a new application version runs as follows:

Deploy the new application CloudFormation stack.
Validate that the stack came up and its instances are functioning correctly.
Add the new application stack ASG to the existing persistent ELB.
Remove the old application stack ASG from the ELB.
Verify that everything is still working.
Make a decision on whether to roll back or not. This may involve waiting for some time.
If not rolling back, delete the old application stack.

Replacing a Persistent ELB

What if you need to replace the ELB, however? There will need to be an extra layer of indirection in the DNS entries to make this work: a CNAME entry for app.stacks.example.com that points to app-elb-buildnumber.stacks.example.com. That allows for a persistent ELB to be replaced seamlessly by taking the following steps:

Deploy the new ELB stack and verify it came up correctly.
Add the existing ASG to the new ELB stack, so that it accepts traffic from both old and new ELBs.
Contact AWS support to carry out the necessary pre-warming. This can take a day or more to work its way through the support system, but the actual pre-warming doesn't take long.
Update app.stacks.example.com to point to the new app-elb-buildnumber.stacks.example.com DNS entry.
Wait for the appname.stacks.example.com TTL to expire. Hopefully it is set to a reasonably short 60 seconds or so.
Delete the old ELB stack.

Implications for Application Development and Versioning

In a stateless web application of the sort considered in this post, version updates to the application server cluster happen independently of version updates to APIs and other data sources used by the application servers. The first thing to consider is that this requires careful coordination between the application and data layer developers when the data layer updates. The application servers must first update such that they can handle both old and new versions of the API or other data source that changes. This tends to be easier for APIs that are well versioned, since the application can then decide which version to use for any given update. It tends to be much harder for database schema changes and the like, however, as these provide no ability to continue to use the old options.

Another thing to consider is that stateless must really mean stateless insofar as an application server is concerned. A user might be in the middle of a session, between two page views, when an application cluster version update occurs. If sessions are kept in an in-memory distributed application server cache, to pick one example by which these servers might be slightly stateful even while using external databases, then sessions will be disrupted by a deployment. So all state must reside in data stores outside the scope of the application CloudFormation stack resources.