Throwing Hardware at the Problem is a Perfectly Respectable Solution

October 22nd, 2016 Permalink

Senior developers should think about cost in a holistic way. Junior developers should learn to think about cost in a holistic way. The job of software engineering isn't just a matter of designing and constructing technical solutions, it is to design and build technical solutions that strike an optimal balance between the expected quality of the solution and expected cost of the solution. Or at least to make an assessment of the space of quality and cost, the outcomes for different paths, and then let those in charge of the purse strings make an educated choice based on that information. This doesn't just mean cost to build, it also means cost to maintain, and cost to upgrade and adapt across the lifetime of a solution. It isn't just hardware costs, it is the equally important cost of paying for staffing.

Most organizations quickly develop at least adequate vision and control over cost of hardware and cloud infrastructure, and tend to optimize those costs. The numbers are right there in the billing statements, after all. Conversely, most organizations do poorly when it comes to vision and control over the time of developers: there is a tendency in this industry to optimize the more easily measured costs while spending developer time like water and letting dates slip because of it. There are only so many developers in an organization. When one or more systems or projects require continual babysitting and rescue, that ensures that less time is available to work on the roadmap for future products and features. It is a cost, and a sizable cost, but one that tends to be poorly managed.

That hardware is easily tracked and analyzed, while developer time is less easily tracked and analyzed, leads quite readily to situations in which companies optimize the former in ways that cause them to spend much more on the latter. To pick an example, if you optimize an Elasticsearch cluster to the minimum functional size, you'll most likely wind up spending a lot more developer and support time on managing its failure modes than you will have saved in server costs. This same situation shows up to one degree or another in most applications and installations. This is an age of ever cheaper computation. It is thus ever more frequently the case that throwing hardware at a problem, or overprovisioning a server stack, is the right call to mimimize the longer term total cost of ownership.

To serve a major site or application today, you might be looking at anything from five to fifty larger cloud instances plus a CDN contract, with the number of instances and cost of the CDN scaling by the degree to which data and change is involved. This represents an middle tier web property in term of traffic, 1,000 to 5,000 requests per second at the origin endpoints behind CDN, something at that sort of size. With fewer servers, this might be a less frequently changed web application with heavy use of caching: high memory Varnish instances in front of a few webservers that largely serve documents, doing little in the way of business logic. With more servers, this is a an application with significant business logic, tens of application servers, and an equally large data layer: core storage in SQL, access via Elasticsearch, and so forth. If we're talking about AWS reserved instances, then five to fifty instances probably sits in the range of a $20,000 to $200,000 yearly cost. Consider that a single senior developer in the mythical average US market will cost the company $150,000-$200,000 a year, depending on the additional costs you want to factor in beyond wages, benefits, and taxes, such as office space requirements. Given that, adding $20,000 in additional instances is roughly equivalent to consuming a month of time on the part of one member of the development team that supports the application. Thus if you have ongoing issues that cost numerous days of developer time every month, it becomes worth trying to solve them with the deployment of additional hardware or cloud instances.

To take a simple example, consider loss of data nodes or excessive merging or garbage collection events in Elasticsearch. This sort of thing will happen in any significant use of Elasticsearch, but diagnosing why it happens and how to minimize these issues at a given scale of cluster is a complex undertaking, one that can chew up a great deal of time and experimentation. As an alternative to that work, it is well worth considering the option of simply running a larger Elasticsearch cluster. Additional nodes in the cluster will make the cluster less likely to enter a state in which individual nodes experience issues that degrade overall performance. Throwing hardware at the issue compares favorably in this case, versus spending the time to work through the underlying problems, for all but the very largest of Elasticsearch clusters. This is especially given that Elasticsearch experts are fairly thin on the ground and their time is expensive. This same principle applies to most systems, applications, and application components. The core of it is to ask yourself whether server failures causing the team to spend a lot of unanticipated time on diagnosis and emergency response, and whether adding more redundant servers will be cheaper in the long run than the expected cost of developer time to chase down and fix underlying causes.

The point of this short discussion is that you have to take a broader view of the situation when assessing design, maintenance, and costs so as to decide on a course of action. Optimize for the entire expected cost, not just the part you're most familiar with, or the part that is easiest to measure.