Adding an S3 Inventory to Every Bucket in an AWS Account via the API

The S3 Inventory service is a way to request automatic generation of a complete or partial inventory of S3 bucket contents, written to CSV files that are also placed in S3. This service can save a lot of tedious work for those of us who are put in the position of performing archaeology on the contents of sprawling shared buckets that have existed for years without any form of pruning or gardening. It is definitely a tool to start using, but you'll find a number of strange design choices in the Amazon APIs and SDKs, and the new endpoints for the S3 Inventory service have their share. In my case it took some experimentation and careful rereading of the documentation to turn out functioning code.

For Example: ARN, Not Bucket Name

The item I spent the most time on can be found in the putBucketInventoryConfiguration endpoint. The configuration looks as follows in the SDK documentation. Note that there are two places in which an Id parameter appears, and also two places in which a Bucket parameter appears.

var params = {
  Bucket: 'STRING_VALUE', /* required */
  Id: 'STRING_VALUE', /* required */
  InventoryConfiguration: { /* required */
    Destination: { /* required */
      S3BucketDestination: { /* required */
        Bucket: 'STRING_VALUE', /* required */
        Format: 'CSV', /* required */
        AccountId: 'STRING_VALUE',
        Prefix: 'STRING_VALUE'
      }
    },
    Id: 'STRING_VALUE', /* required */
    IncludedObjectVersions: 'All | Current', /* required */
    IsEnabled: true || false, /* required */
    Schedule: { /* required */
      Frequency: 'Daily | Weekly' /* required */
    },
    Filter: {
      Prefix: 'STRING_VALUE' /* required */
    },
    OptionalFields: [
      'Size | LastModifiedDate | StorageClass | ETag | IsMultipartUploaded | ReplicationStatus',
      /* more items */
    ]
  }
};

Naively, one might leap in, as I did, and try the following code. The intent is to set up a configuration that adds inventory files for the exampleBucket contents to that same bucket, with a distinguishing prefix:

var AWS = require('aws-sdk');

var client = new AWS.S3();
var bucket = 'exampleBucket';
var id = 'exampleBucketInventory';
var params = {
  Bucket: bucket,
  Id: id,
  InventoryConfiguration: {
    Destination: {
      // Put the inventory files into the same bucket.
      S3BucketDestination: {
        Bucket: bucket,
        Format: 'CSV',
        // Not needed if everything is in the same account.
        // AccountId: '111222333444',
        // Do NOT add a trailing '/' to this prefix; it will be added 
        // by S3 Inventory when creating the keys regardless, and so
        // the key will start with 'inventory//' if you add one here.
        Prefix: 'inventory'
      }
    },
    // Omit the filter to inventory all objects in the bucket.
    // Filter: {
    //   Prefix: 'examplePrefix'
    //},
    Id: id,
    IncludedObjectVersions: 'All',
    IsEnabled: true,
    Schedule: {
      Frequency: 'Daily'
    },
    // Add all the mentioned optional fields.
    OptionalFields: [
      'Size',
      'LastModifiedDate',
      'StorageClass',
      'ETag',
      'IsMultipartUploaded',
      'ReplicationStatus'
    ]
  }
};

client.putBucketInventoryConfiguration(params, function (error) {
  if (error) {
    console.log(error);
  }
});

Try this, however, and it will fail with the following very uninformative error: "MalformedXML: The XML you provided was not well-formed or did not validate against our published schema". On careful rereading of the SDK documentation, or checking the examples elsewhere in the Amazon API documentation, it turns out that the second Bucket parameter needs to be an ARN. Something like arn:aws:s3:::exampleBucket, for example. The working example then looks like this:

var AWS = require('aws-sdk');

var client = new AWS.S3();
var bucket = 'exampleBucket';
var id = 'exampleBucketInventory';
var params = {
  Bucket: bucket,
  Id: id,
  InventoryConfiguration: {
    Destination: {
      // Put the inventory files into the same bucket.
      S3BucketDestination: {
        // A bucket ARN rather than a bucket name.
        Bucket: 'arn:aws:s3:::' + bucket,
        Format: 'CSV',
        // Not needed if everything is in the same account.
        // AccountId: '111222333444',
        // Do NOT add a trailing '/' to this prefix; it will be added 
        // by S3 Inventory when creating the keys regardless, and so
        // the key will start with 'inventory//' if you add one here.
        Prefix: 'inventory'
      }
    },
    // Omit the filter to inventory all objects in the bucket.
    // Filter: {
    //   Prefix: 'examplePrefix'
    //},
    Id: id,
    IncludedObjectVersions: 'All',
    IsEnabled: true,
    Schedule: {
      Frequency: 'Daily'
    },
    // Add all the mentioned optional fields.
    OptionalFields: [
      'Size',
      'LastModifiedDate',
      'StorageClass',
      'ETag',
      'IsMultipartUploaded',
      'ReplicationStatus'
    ]
  }
};

client.putBucketInventoryConfiguration(params, function (error) {
  if (error) {
    console.log(error);
  }
});

Ensuring an Inventory for Every S3 Bucket in an Account

It is simple enough to set up a service that runs every so often to ensure that every S3 bucket in an account is inventoried via the S3 Inventory service, and that the CSV inventory files are in a consistent location. The cost of doing this is minimal, and the benefits are concrete. No more lengthy uses of the listObjects endpoint to figure out the size or contents of the bucket or subsets of the bucket by key prefixes, for example. The list of contents is much more readily available, and for very large buckets and goals such as assessing the size of sections by key prefix, the inventory delay or a day or two isn't important.

The following code is offered as an example, a Node.js module that adds an S3 inventory configuration to every bucket in an account. It could, for example, be set to run daily as a Lambda function to add the necessary S3 Inventory configuration to new buckets as they are created. Since the S3 Inventory processes run at most daily, there is no need to check more frequently.

/**
 * @fileOverview Add a consistent S3 Inventory configuration to all buckets.
 */

// NPM.
var async = require('async');
var AWS = require('aws-sdk');

/**
 * Obtain an S3 client with default configuration and permissions read from the
 * environment.
 *
 * @return {AWS.S3} A client instance.
 */
exports.getClient = function () {
  if (!exports.client) {
    exports.client = new AWS.S3();
  }

  return exports.client;
};

/**
 * Ensure that the specified bucket has an S3 inventory set up in a standard
 * way. Write the CSV inventory files to the same bucket.
 *
 * @param {String} bucket The bucket.
 * @param {Function} callback Of the form function (error).
 */
exports.ensureInventory = function (bucket, callback) {
  var inventoryId = bucket + '-inventory';
  var getParams = {
    Bucket: bucket,
    Id: inventoryId
  };
  var client = exports.getClient();

  client.getBucketInventoryConfiguration(getParams, function (error, data) {
    // If the configuration already exists then call back without taking further
    // action.
    if (!error) {
      return callback();
    }

    // If there is a failure for some reason other than the configuration not
    // existing, then call back with the error.
    if (error.code !== 'NoSuchConfiguration') {
      return callback(error);
    }

    // Otherwise, create the missing configuration.
    var putParams = {
      Bucket: bucket,
      Id: inventoryId,
      InventoryConfiguration: {
        Destination: {
          // Put the inventory files into the same bucket, under a defined key
          // prefix.
          S3BucketDestination: {
            // This has to be the ARN, not the bucket name, but can omit the
            // account ID portion of the ARN.
            Bucket: 'arn:aws:s3:::' + bucket,
            Format: 'CSV',
            // If everything is in the same account, this can be omitted.
            //AccountId: '111222333444',
            // Do NOT add a trailing '/' to this prefix; it will be added 
            // by S3 Inventory when creating the keys regardless, and so
            // the key will start with 'inventory//' if you add one here.
            Prefix: 'inventory'
          }
        },
        // ID in both places? Apparently so.
        Id: inventoryId,
        IncludedObjectVersions: 'All',
        IsEnabled: true,
        Schedule: {
          Frequency: 'Daily'
        },
        /*
        // Omit the filter and all of the bucket contents are added to the
        // inventory.
        Filter: {
          Prefix: 'examplePrefix'
        },
        */
        OptionalFields: [
          'Size',
          'LastModifiedDate',
          'StorageClass',
          'ETag',
          'IsMultipartUploaded',
          'ReplicationStatus'
        ]
      }
    };

    client.putBucketInventoryConfiguration(putParams, callback);
  });
};

/**
 * Ensure that an S3 inventory is set up for all buckets in the account.
 *
 * @param {Function} callback Of the form function (error).
 */
exports.ensureInventoryForAllBuckets = function (callback) {
  var client = exports.getClient();
  var concurrency = 10;

  client.listBuckets(function (error, data) {
    if (error) {
      return callback(error);
    }

    // Now that the old low bucket limits are gone, an AWS account might have
    // thousands of buckets - or more.
    //
    // From 10-20 underlying HTTP operations in parallel in the same Node.js
    // thread is generally safe, depending on the connection speed.
    async.everyLimit(
      data.Buckets.map(function (bucketObj) {
        return bucketObj.Name;
      }),
      concurrency,
      exports.ensureInventory,
      callback
    );
  });
};

var ensureInventory = require('./ensureInventory');

ensureInventory.ensureInventoryForAllBuckets(function (error) {
  if (error) {
    console.error(error);
  }
});

Destination Buckets Must Have a Suitable Policy

Every destination bucket must have a bucket policy that allows the S3 Inventory system to upload the inventory files, otherwise no files will be generated. This can be accomplished with the following statement for an example-destination-bucket to hold inventory files under _inventory for an example-inventoried-bucket.

{
    "Version": "2008-10-17",
    "Id": "BucketPolicy",
    "Statement": [
        {
            "Sid": "Inventory",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::example-destination-bucket/_inventory/*",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:::example-inventoried-bucket"
                }
            }
        }
    ]
}

In the scenario in the code examples provided in this post, in which every bucket holds its own inventory files, the example-destination-bucket and example-inventoried-bucket would be the same. It is, however, something of a pain to have to add the policy to every bucket. That can of course be automated, but then the tool set up to ensure that ensures policies are correct would have to have permissions to change all bucket rules. That is acceptable in some circumstances, but should probably make you feel uncomfortable.

Put a Lifecycle Rule on the Destination Buckets

S3 Inventory writes a new set of files for each new inventory, either daily or weekly. Unless there is a compelling reason to keep everything, it is a good idea to create a lifecycle rule to cover the destination buckets and key prefixes associated with inventory files, and use that to delete all but the latest inventory.

Inventoried and Destination Buckets Must be in the Same Region

The API will allow you to set a destination bucket in a different region to the inventoried bucket. This will not work, however. The two buckets must be in the same region. Thus if writing inventory data to something other than the inventoried bucket, it is necessary to set up a destination bucket per region.

Format of S3 Inventory Files

The S3 Inventory system writes files to the destination bucket with the following directory structure, assuming that the prefix is _inventory, the source bucket is example-inventoried-bucket, and the inventory ID is inventory:

_inventory
  example-inventoried-bucket
    inventory
      2016-12-10T18-01Z
        manifest.checksum
        manifest.json
      data
        e99a3da6-1a7b-443b-8076-2727ab52b00a.cvs.gz
        ...

The manifest.json file for each inventory, placed in a folder titled for the timestamp of the inventory, has the following format:

{
  "sourceBucket" : "example-inventoried-bucket",
  "destinationBucket" : "arn:aws:s3:::example-destination-bucket",
  "version" : "2016-11-30",
  "fileFormat" : "CSV",
  "fileSchema" : "Bucket, Key, VersionId, IsLatest, IsDeleteMarker, Size, LastModifiedDate, ETag, StorageClass, ReplicationStatus",
  "files" : [ {
    "key" : "_inventory/example-inventoried-bucket/inventory/data/e99a3da6-1a7b-443b-8076-2727ab52b00a.csv.gz",
    "size" : 13453068,
    "MD5checksum" : "519afba5c656d2496b29747384b837ee"
  } ]
}

So if using automation to process the inventory, the first step is to read the latest manifest.json in order to find out (a) the ordering of columns in the CSV, and (b) which gzipped CSV files to download.