An Approach to the Optimization of an Expensive Elasticsearch Sorted Query

In any serious use of Elasticsearch, given a data set of reasonable size and complexity, it is usually possible to construct sorted queries that will bring the cluster to its knees given even moderate volume. One of the possible causes of excessive load is a sorted query that hits too many indices for the cluster servers to comfortably accommodate. Paging through those sorted results tends to exacerbate this issue, as an end user or API client steps through thousands of items, ten items at time, with each page running the underlying expensive query that accesses a large number of indices and servers. Obviously, developers design an application and its Elasticsearch data layer to comfortably support the expected range of sorted, paged queries, but change over time tends to eventually sabotage the best of architectures and expectations.

As a trivial and contrived example, consider a database of purchases split out into indexes by product and year. It was originally designed to support queries on various properties of the purchases related to a specific product, so in the worst case a couple of indexes might be involved, and the number of indexes used could in any case be cut down by requiring and then inspecting a date range for requests:

curl -XGET localhost:9200/product1_2017,product1_2016/_search -d '{
  "size": 10,
  "from": 200,
  "query": {
    "bool": {
      "must": {
        "term": { 
          "store": "example store" 
        }
      }
    }
  },
  "sort": [
    { 
      "purchase_date": {
        "order" : "asc"
      }
    },
    "_score"
  ]
}'

Then the requirements changed abruptly to include a new feature of gathering products together into product groups. The same queries for individual products now run over all products in a group, and as a consequence many indexes will be touched. The queries will generate considerable additional load and will be inefficient and slow in comparison to their predecessors:

curl -XGET localhost:9200/_all/_search -d '{
  "size": 10,
  "from": 200,
  "query": {
    "bool": {
      "must": [
        {
          "term": { 
            "store": "example store" 
          },
          "term": { 
            "group": "example product group" 
          }
        }
      ]
    }
  },
  "sort": [
    { 
      "purchase_date": {
        "order" : "asc"
      }
    },
    "_score"
  ]
}'

In reality this would be an obvious development - obvious enough that any good development team would see it coming well before it happened and ensure that they were not blindsided. Conceptually similar situations can and will occur in more subtle ways, however, showing up in the corners of a complicated project over the years in which it changes and grows, and some of them will be unexpected. This is why any project that uses Elasticsearch must monitor slow queries and periodically investigate and optimize the slowest ones. No project is static, and the work done in the last major update to keep things running smoothly will gradually corrode over time.

Given the discovery of a sorted query that is expensive because it touches many indexes, what to do about it? One possible approach is to replace it with the following: (a) obtain the _id values for all of the documents considered in the scope of the sorted query, then (b) run a query to sort and page them, referencing the documents directly by _id. This can work to alleviate the load of a query that touches many indexes, since the part of the task that involves obtaining _id values can be broken up into multiple queries, each of which only touches one or a few indexes. These can run in parallel to some degree. From the example above, queries for purchase _id can run per product with a large number of items returned per page:

curl -XGET localhost:9200/product1_2017,product1_2016/_search -d '{
  "size": 10000,
  "from": 0,
  "fields": [
    "_uid"
  ],
  "query": {
    "bool": {
      "must": {
        "term": { 
          "store": "example store" 
        }
      }
    }
  },
  "sort": [
    "_uid"
  ]
}'

The sorting is then accomplished by running an ids query that includes all of those purchase document _id values. Breaking them up into blocks of at most a few hundred values appears to be more efficient in the clusters I've worked with, as is segregating document _id values into blocks by the index containing the document, if the information to do that is available:

curl -XGET localhost:9200/_all/_search -d '{
  "size": 10,
  "from": 200,
  "query": {
    "bool": {
      "should": [
        {
          "ids": { 
            "type": "purchase",
            "values": [
              "1",
              "2",
              ...
            ] 
          }
        },
        {
          "ids": { 
            "type": "purchase",
            "values": [
              "301",
              "302",
              ...
            ] 
          }
        },
        ...
      ]
    }
  },
  "sort": [
    { 
      "purchase_date": {
        "order" : "asc"
      }
    },
    "_score"
  ]
}'

This works well for a relatively low number of documents scattered across a relatively large number of indexes. It is a bad choice for either a relatively small number of indexes, where a standard single sorted query should be used, or when there are a relatively large number of documents to consider. When the number of documents to consider becomes large, then the sorted ids query grows to become an enormous JSON document, and the cluster load and query response time may or may not be improved by adopting this strategy. Certainly it is an indicator to adopt a different optimization strategy in that case.