Analyzing the Contents of Very Large S3 Buckets

Amazon's S3 buckets are a very convenient place to put data. Once an organization starts using AWS in any meaningful way, S3 usage will grow inexorably over time. Pressure from those responsible for paying the bills will keep the worst excesses under control. Typically, lifecycle rules will be put in place to trim the offending collections of files, or move them to Glacier storage. Nonetheless, given years of growth and turnover in an organization that handles any appreciable amount of data, newcomers will tend to find themselves inheriting buckets that contain terabytes of data and tens of millions of files, far from all of which are well understood by any one given individual. Worse, since there was until comparatively recently a limit on the number of S3 buckets permitted per AWS account, many of the inherited buckets will contain a melange of data from teams and applications, both existing and obsolete.

As an exercise in programmer archaeology, producing a useful inventory of common data stores is certainly a frustrating and slow task, gated by human interaction speeds. At some point it becomes necessary for any meaningful cleanup to proceed, however. If you don't want to find yourself in this situation, then try enforcing these simple rules from the outset:

Create one S3 bucket per data-generating application. An application can use data in any bucket, but only writes data to its own bucket.
All buckets are tagged to identify the responsible team or business unit in billing reports.
Teams or business units pay for S3 usage from their own budgets.

This sort of approach ensures that those closest to unnecessary S3 usage, and who have the most contextual knowledge, also have the greatest incentive to do something about it before it becomes an impenetrable swamp.

If you have already inherited an impenetrable swamp, what to do with it, however? S3 is a flat filesystem with few extras atop that bare-bones functionality. There is little getting around the fact that few built-in metrics are available on bucket contents, and certainly not on any subsection of its contents based on key prefixes. In order to figure out what is in there, you pretty much have to list and examine every object, or use a tool that, under the hood, lists and examines every object in order to calculate aggregate data. Once S3 bucket file counts grow to reach the hundreds of thousands, and never mind millions, this becomes painfully slow. Not to mention that API calls are, while very cheap, not free.

Using S3 Inventory

The first version of this post was written just prior to the introduction of S3 Inventory. This service provides a scheduled flat file dump that lists bucket contents and their metadata. S3 Inventory is the option you should use in order to mine and analyze bucket contents. The rest of this post should be taken as a lengthy illustration as to (a) why it is a far better choice, and (b) why Amazon should have provided the inventory service a long time ago.

Using the AWS CLI to List Objects

The AWS CLI tool offers a very useful JMESPath query function that in conjunction with the s3api list-objects command takes much of the drudgery out of asking simple questions about bucket contents. For a one-off investigation of straightforward metrics such as aggregate size of files with a given key prefix, this is probably the best bet. For example, to obtain the total size and count of files found under s3://example-bucket/example-prefix/:

aws s3api list-objects \
  --bucket example-bucket \
  --prefix example-prefix/ \
  --output json \
  --query "[sum(Contents[].Size), length(Contents[])]"

This will take time, however. Under the hood it is listing all of those objects, sequentially, 1000 per HTTP request. If running this on an EC2 instance, expect it to take on the order of 5-10 minutes per million objects. If running it outside AWS, listing a million objects will take considerably longer. Just how much longer depends on the quality of the connection, round trip time to the API endpoint, and so on. My advice is not to bother finding out. Just spin up a t2.nano Amazon Linux instance and run it there, or make use of an existing instance if you have one to hand.

Listing Objects Concurrently by Walking a Tree

Fortunately, we are not completely restricted to listing objects in a linear, sequential fashion. The S3 API allows keys to be broken up by an arbitrary delimiter character, and thus a tree of "directories" can be built. Listing a particular "directory" returns the objects at that level, plus the next level of "directories". This is how the S3 console interface works, using the / character as a default delimiter. This is best illustrated by example, as the API documentation isn't terribly clear. Let us say that a bucket has the following keys:

alpha/file1.txt
alpha/seven/file/file300.txt
alpha/slope/file22.txt
beta/one/file10.txt
beta/subtext/file17.txt

When issuing a list objects request with a delimiter / and prefix alpha/, the response includes (a) the S3 objects immediately under the prefix "directory" in response.Contents and (b) the next level of "subdirectories" in response.CommonPrefixes:

var AWS = require('aws-sdk');
var client = new AWS.S3();

client.listObjects({
  Bucket: 'exampleBucket',
  Delimiter: '/',
  MaxKeys: 1000,
  Prefix: 'alpha/'
}, function (error, response) {
  console.log(response.Contents.map(function (obj) { return obj.Key; }));
  console.log(response.CommonPrefixes);
});

[
  'alpha/file1.txt'
]

[
  'alpha/seven/',
  'alpha/slope/'
]

Note that, under the hood, this still builds upon the process of listing all of the objects, and so interacts interestingly with the MaxKeys parameter. To obtain all of the objects and common prefixes in very large buckets, it may or may not be necessary to page through multiple requests using either NextMarker or NextContinuationToken depending on whether you are using the listObjects or listObjectsV2 API. If MaxKeys is much smaller than the number of files or common prefixes, even for small buckets, individual responses may not include all of the common prefixes, the "subdirectories". You might look at an older post here for a generic example of paged requests using the Node.js SDK, and you should certainly experiment with your own buckets.

Since a bucket, or subset of its objects by key prefix, can be broken up in this way into "directories", it is possible to traverse the contents of the bucket as a tree: start at the top, and descend in a way that is mostly but not completely breadth-first. At each level list all of the objects and common prefixes and recursively spawn a new thread for each "subdirectory"; depending on the keys and bucket contents this this may require multiple requests, and listing of objects in that "directory" mixes with the discovery of "subdirectories". In this way many concurrent requests can be made and the contents of a bucket read and assessed much more rapidly.

If using Node.js or similar, it isn't even necessary to spawn new threads for a low level of concurrency. For most simple applications, a single Node.js process can support 10-20 concurrent or near-concurrent HTTP requests before running into issues. So one approach is to create a processing queue with limited concurrency, then walk the S3 bucket and send the listing of each "subdirectory" as a task to the queue. An example implementation follows:

var async = require('async');
var AWS = require('aws-sdk');
var _ = require('lodash');

// An asynchronous queue to process list tasks at a modest concurrency.
var queue = async.queue(
  listRecursively,
  10
);

// An S3 client, reading access tokens from the default locations.
var s3Client = new AWS.S3();

/**
 * Analyze an S3 object.
 * 
 * @param {Object} s3Object An S3 object definition.
 */
function analyzeS3Object (s3Object) {

  // Here perform whatever analysis or assembly of running totals is required.

}

/**
 * Take action upon completion of the recursive bucket listing process.
 */
queue.drain = function () {
  console.info('Complete.');

  // Take other actions as desired here.

};

/**
 * List the objects in a given 'directory' by common prefix, and all the
 * common prefixes for 'subdirectories'.
 *
 * @param {Object} options
 * @param {AWS.S3} options.s3Client An AWS client instance.
 * @param {String} options.bucket The bucket name.
 * @param {Number} options.delimiter Used to find common prefixes to split out
 *   requests for object listing. Defaults to '/'.
 * @param {String} [options.prefix] If set only return keys beginning with
 *   the prefix value.
 * @param {String} [options.continuationToken] If set the list only a paged set
 *   of keys, with the token showing the start point.
 * @param {Number} [options.maxKeys] Maximum number of keys to return per
 *   request. Defaults to 1000.
 * @param {Function} callback - Callback of the form
    function (error, nextMarker, Object[], String[]).
 */
function listPage (
  options,
  callback
) {
  var params = {
    Bucket: options.bucket,
    Delimiter: options.delimiter,
    ContinuationToken: options.continuationToken,
    MaxKeys: options.maxKeys,
    Prefix: options.prefix
  };

  // S3 operations have a small but significant error rate, so wrap them in a
  // retry.
  async.retry(
    3,
    function (asyncCallback) {
      s3Client.listObjectsV2(params, asyncCallback);
    },
    function (error, response) {
      var continuationToken;

      if (error) {
        return callback(error);
      }

      // Check to see if there are yet more objects to be obtained, and if so
      // return the continuationToken for use in the next request.
      if (response.IsTruncated) {
        continuationToken = response.NextContinuationToken;
      }

      callback(
        null,
        continuationToken,
        response.Contents,
        _.map(response.CommonPrefixes, function (prefixObject) {
          return prefixObject.Prefix;
        })
      );
    }
  );
};

/**
 * List the objects in a given 'directory' by common prefix, and spawn new tasks
 * to list all child 'directories'.
 *
 * All the objects found in this 'directory' are piped onward.
 *
 * @param {Object} options
 * @param {AWS.S3} options.s3Client An AWS client instance.
 * @param {String} options.bucket The bucket to list.
 * @param {Number} options.delimiter Used to find common prefixes to split out
 *   requests for object listing. Defaults to '/'.
 * @param {Number} [options.maxKeys] Maximum number of keys to return per
 *   request. Defaults to 1000.
 * @param {String} [options.prefix] If present, only list objects with keys that
 *   match the prefix.
 * @param {Function} callback Of the form function (error).
 */
function listRecursively (
  options,
  callback
) {
  var firstCall = true;

  /**
   * Recursively list common prefixes.
   *
   * @param {String|undefined} marker A value provided by the S3 API to enable
   *   paging of large lists of keys. The result set requested starts from the
   *   marker. If not provided, then the list starts from the first key.
   */
  function recurse (continuationToken) {
    options.continuationToken = continuationToken;

    listDirectoryPage(
      options,
      function (error, nextContinuationToken, s3Objects, commonPrefixes) {
        if (error) {
          return callback(error);
        }

        // If this is the first call for this recursion, then spin up more jobs
        // for the queue to look at the subdirectories, the common prefixes
        // under the present one.
        if (firstCall) {
          firstCall = false;

          // Each common prefix is only returned once, even across requests
          // using continuation tokens for 'subdirectories' containing many
          // files.
          _.each(commonPrefixes, function (commonPrefix) {
            var subDirectoryOptions = _.chain(
              {}
            ).extend(
              options,
              {
                prefix: commonPrefix
              }
            ).omit(
              'continuationToken'
            ).value();

            queue.push(subDirectoryOptions);
          });
        }

        // Send any S3 object definitions to be analyzed.
        _.each(s3Objects, function (s3Object) {
          analyzeS3Object(s3Object);
        });

        // If there are more objects, go get them.
        if (nextContinuationToken) {
          listRecusively(nextContinuationToken);
        }
        else {
          callback();
        }
      }
    );
  }

  // Start the recursive listing at the beginning, with no continuationToken.
  recurse();
};

// Start things going.
listRecusively({
  bucket: 'exampleBucket',
  delimiter: '/'
});

Implemented as an NPM Package

An implementation similar to the example code above can be found in the NPM package s3-object-streams, which as the name might suggest uses the Node.js stream functionality for control of asynchronous flow. It is fairly simple, but good enough to list and assess metrics on the contents of buckets with tens of millions of files in a reasonable amount of time.

Build a Central Bucket Assessment Service

When people in your organization have the frequent need to examine the contents of large buckets, the best thing to do is buckle down and build a central web service that scans S3 buckets on a regular basis. It doesn't have to be complicated. Breaking things down by bucket and the top level or two of common prefixes ("subdirectories"), and then listing total file size and file count by S3 storage class is probably sufficient to answer most inquiries. Putting that into a web application is a small project given the existence of a tool such as the aforementioned s3-object-streams to generate the metrics.