Process Millions of Amazon S3 Objects

Building a decoupled workflow to process S3 objects at scale.

5 min readJul 11, 2023

Introduction

When working with Amazon Simple Storage Service (S3), I often encounter scenarios where there’s a need to process millions of S3 objects. For example, data in your S3 data lake needs to be transformed and loaded into a different data store. Or perhaps your system is an event-driven architecture utilizing Amazon S3 Event Notifications, and you have pre-existing objects that were created before the event notification was enabled. For these scenarios, it’s essential to have a scalable solution that can process or “backfill” millions of objects in a timely manner.

In this blog, I’ll demonstrate a solution I’ve employed and tested on hundreds of millions of S3 objects. Note that the contents of this blog aren’t particularly innovative. I simply wanted to share code that’s helped me and hopefully helps you as well — feel free to copy/paste code! Also, there are a variety of methods to tackle this problem, but I prefer this approach for several reasons which I’ll elaborate on below.

Architecture

The general approach to my method of processing millions of S3 objects has three phases:

List
Load
Work

I deploy a state machine with AWS Step Functions that manages the list and load phase. The logic is encapsulated by a loader AWS Lambda function. Pointers to S3 objects (not the actual data) are loaded into an Amazon Simple Queue Service (SQS) queue. Another worker Lambda function polls the queue, reads the data in each S3 object, and transforms and loads the data into a data store (or whatever you need to do with the data, the possibilities are endless). Using an SQS queue decouples the list and load operations from the work phase, and the queue acts as a work log of files that still need to be processed. Below is a diagram showing the workflow:

The state machine is as follows, with the “Load S3 objects into SQS queue” being the loader Lambda function:

All failures in the state machine are sent to a separate failure queue in SQS with information about the error.

Tech Stack

Infrastructure and application code are deployed via AWS Cloud Development Kit (CDK). I prefer this toolkit because it allows me to write infrastructure as code, in this case Typescript, which I chose because CDK itself is written in Typescript. My application code is written in Python because it’s my preferred backend language, as I think it’s readable, easy to learn, and Boto3 (AWS SDK for Python) is straightforward.

Design Choices

I chose AWS Step Functions and AWS Lambda because I prefer serverless options wherever possible as there is no infrastructure to manage. I especially like Lambda because there is a configuration setting that allows you to control function concurrency. This is useful if you’re calling an external API with transactions per second (TPS) thresholds in the worker function. If you find that you're over the allowed TPS limit, you can turn down concurrency and control the rate at which you're polling the SQS queue. In the same way, if you're writing to a database, you can do so in a controlled manner so as not to overwhelm the database with too many writes.

Concurrent processing is possible because messages are locked for further consumption once received by the worker Lambda function. The amount of time SQS prevents other consumers (instances of the worker function in this case) from receiving the same message depends on the queue's visibility timeout. Once the worker function successfully processes the message, it is deleted from the queue. If there are processing failures, you can return an error, which means the message reappears in the queue after the visibility time has expired, or you can send the message to a dead-letter queue for further investigation (one is deployed by the CDK code). You can read more about using Lambda with SQS here.

Another consumer option is to use an Amazon Elastic Compute Cloud (EC2) instance instead of a worker Lambda. This may prove to be more cost-effective, but it means you have to manage infrastructure and you lose some control of concurrency or you need to handle parallelism in the code running on the EC2 instance. You can mitigate cost concerns with Lambda by allocating just enough memory for the function, as Lambda for memory used and the duration of the function. You can check how much memory the Lambda uses by looking at the Amazon CloudWatch Logs log group for the function:

REPORT RequestId: 31d79ea0-4ec6-5408-8e84-c27f12fa5f58
Duration: 528.76 ms Billed Duration: 529 ms
Memory Size: 256 MB Max Memory Used: 78 MB Init Duration: 419.21 ms

Code

I published a CDK repo that deploys the solution. You can invoke the state machine with the following payload, which loads pointers to S3 objects into the SQS queue:

{
   "bucket": "<S3 BUCKET NAME>",
   "prefix": "<S3_PREFIX_WITH_FORWARD_SLASH>",
   "page_size": <Integer between 1 and 100>, // defaults to 500
   "queue_url": "<SQS_QUEUE_URL>" // defaults to queue deployed by CDK code
}

The required attributes are bucket and prefix, which tell the loader function where to find the files that you want to process or backfill. page_size is an optional attribute that defaults to 500 (minimum 1, maximum 1000). This attribute tells the loader function how many S3 objects you want to list with one API call, and in turn how many are sent to the SQS queue in one message. You need to fine-tune this value, as there are SQS and Lambda limits to consider. First, the maximum message size that SQS supports is 256 KB. If your S3 object names are long, you may need to bring this value under 500. Second, the maximum duration of a Lambda function is 15 minutes. You need to ensure the worker function has enough time to process each S3 object in the message polled from the SQS queue. Be conservative at first, so you don't have errors sending to the SQS queue or timeouts in the Lambda function.

Post-Processing

When the worker Lambda function has finished processing a key, I recommend moving the S3 object to a new prefix (supported in the code repo). This provides a natural status checkpoint for how many objects there are left to process. Additionally, it helps with idempotency, as it reduces the chance you’ll process an object more than once. Note that this incurs S3 costs, as you’ll need to first move the object then delete the original object. This should be done at the end of the worker logic, as the move and delete signal successful processing.

Conclusion

I hope this blog was informative. Drop me a note if you have questions or comments.