Lessons from Working with Elasticsearch at Scale

Elasticsearch is a great tool, but also possessed of the potential to become a pit into which large sums of money and developer time flow, never to be seen again. For the purposes of this post, let us assume that we are past the basics: the architecture of Elasticsearch is understood, writing queries flows naturally, interactions with the cluster via the HTTP API are easily accomplished, initial design of the structure of data and mappings inside a cluster is fairly straightforward, and there is an appreciation of the relationships between configuration and mapping choices on the one hand and CPU, RAM, and storage type on the other. Given that, what would be useful to know when embarking upon the use of Elasticsearch at scale, for high-traffic applications or very large datasets? I offer these few items as things that I would have liked to have known a year ago or so.

Successful Configurations for Other Applications are Not Directly Applicable

Different applications have very different Elasticsearch usage characteristics. The same application and same data with somewhat different choices of arrangement and indexing have very different Elasticsearch usage characteristics. The same configuration options applied to one arrangement of data versus another can produce wildly different outcomes. There are a fair number of good articles out there that talk to some degree or another about Elasticsearch configuration and optimization for a specific large-scale application. It is always helpful to read these articles, but only from a general perspective, as Elasticsearch isn't a tool in which one can pick out specific useful configuration tips from other people's efforts and apply them directly to one's own clusters with any confidence. It is far better to approach Elasticsearch configuration from the perspective of first principles and deliberate experimentation.

Updates are Painfully Expensive, and Not Just in the Obvious Way

Writing new data and editing existing data, which are much the same thing in Elasticsearch, is expensive. This is something that is perhaps not as well emphasized in the Elasticsearch documentation as it might be, given the degree to which it dominates the character of an Elasticsearch cluster at scale. Computational cost, frequency of performance issues, and knowledge required to establish a stable configuration are all as different as night and day when comparing (a) a cluster containing infrequently updated data and (b) a cluster containing data that must be continually updated at high frequency. In the former case, throwing a cluster together with any old setup will likely work just fine. In the latter case, a high rate of merges and other activities result, large garbage collection events and response time spikes become a real threat, and especially in older Elasticsearch versions there exist a whole set of fairly esoteric issues surrounding shard size and merging that can lead to a cluster rotting over time: loss of performance, growth of disk usage, and so on. In general, an application with a high rate of updates requires a much greater investment of time and talent to establish and manage in addition to being much more prone to new and interesting failure modes.

Experimentation is the Only Sure Way to Determine Load Character

It is next to impossible to accurately predict in advance how large a cluster should be for a given data model and flow of updates, what types of server instances and storage should be used (considering factors such as SSD versus magnetic drives, network bandwidth, CPU and RAM), and how it will behave when under the load of a specific application. Building a cluster and running extended load tests against it is really the only good way to obtain useful answers, and that includes the important point of sizing to accommodate node failures. The load tests have to be extended because many problems only show up infrequently, such as disk usage growth, excess merges, and large garbage collection events.

Optimizing Only for Cluster Size is Likely Disastrous

When considering total cost of ownership of an Elasticsearch cluster, it is very, very important to keep track of the costs of time spent managing it. It is all too easy to strive to optimize the easily calculated cost of server instances, reduce the cluster down to the smallest it can be, and as a result spend a great deal more than was saved on responding to and fixing issues that cause significant response time spikes or even outright outages. Developer time is expensive in comparison to server uptime, and all Elasticsearch clusters run more smoothly when they have sufficient excess capacity to soak up merges, garbage collection events, and node failure and subsequent reshuffling of data without causing performance degradation. A cluster optimized to the smallest number of servers instances needed to run when everything is going fine will fail completely under load when one node is lost or temporarily inconvenienced - and the larger the cluster, the greater the odds per unit time that any one of its nodes will suffer such an event.

Deploying a New Cluster is a Very Slow Process

One very important determinant of a great many of the development and operations choices made when using Elasticsearch is that it takes a very long time to deploy a new or a replacement cluster, certainly much longer than people are used to for relational databases. For a reasonable set of data transformation rules and a reasonably sized cluster for the data set used, it can easily take a day or more to establish the cluster, load the data, and then run the necessary updates to bring it up to date. It may turn out to be be faster to arrange matters such that the cluster loads, processes, and indexes data with a much larger set of server instances than it eventually runs on, as the combined time to load and then resize the cluster - reallocating all of the processed and indexed data to a new set of smaller server instances - is significantly shorter than simply loading data with the final configuration.

That it may take a day, or at best a good number of hours, to replace a cluster drives most of the considerations of reliability and failover. It makes it necessary to maintain a larger capacity and much more carefully and proactively monitor the health of an Elasticsearch cluster than would otherwise be the case. The existence of a failover cluster is essential for an application, given that any major outage would means hours to days of downtime without it. Another effect of the time taken to deploy new Elasticsearch clusters is that it greatly increases the cost and turnaround time for necessary tests, experiments, and ongoing development: this is something to be aware of when planning a product roadmap.

In AWS, Keeping All Instances in One Availability Zone Probably Makes Sense

Elasticsearch cluster nodes tend to transmit a lot of data between one another, especially for clusters that accept a high rate of data updates. When deploying in AWS it might initially sound like a good idea to split a cluster across availability zones, but in order for that to make sense as a robustness exercise in the event of availability zone failure, the cluster has to be at least half as big again as the minimum sensible size, and possibly more. The only way to calibrate that size for a specific usage is to run a cluster under load and then turn off a third of the nodes. The overall cost of data transfer for cluster operation in multiple availability zones will be much higher than that of a cluster in one availability zone, since zone to zone data rates are not insignificant.

Given the exceedingly sparse history of availability zone failures over the past five years, it is useful to ask whether it make sense to be resilient in this way at all, or to instead offer failover by maintaining a second Elasticsearch cluster in another location. Given that other components of a cluster and its supporting infrastructure can and will fail far more frequently than AWS availability zones, it is likely that most organizations will already plan to maintain a failover cluster. In this sort of situation, the uptime requirements for the application would have to be fairly stringent to justify both a failover and clusters large enough to take an availability zone outage in stride, especially with the further data transfer costs piled on top of the instance costs.

Splitting Data into Multiple Clusters can be Very Helpful

Splitting out data by character, such as usage and most especially rate of updates, can be very helpful. Put the data that must update frequently into one cluster and the more static data into another cluster, for example. Joining the two clusters with tribe nodes is an option, but often a problematic one given the limitations. Otherwise the more sensible choice is to make the application data layer smart enough to connect to multiple clusters for different document types. Once data is split into clusters in this way, optimizing for each use case becomes much easier - very different configurations, arrangements of data, and classes of server instance might be used, something that cannot be achieved with everything lumped into a single cluster.