Move Millions of Files from Amazon EC2 to Amazon S3 using AWS DataSync

Transferring data at scale using the NFS protocol and AWS DataSync.

7 min readJul 31, 2023

Introduction

In this blog I’ll show you how to rapidly transfer files from Amazon EC2 (EC2) to Amazon S3 (S3) using AWS DataSync (DataSync). This is useful if you have log or application files on an EC2 instance and you want to move them to S3 for further processing. In many situations, the AWS Command Line Interface (AWS CLI) is sufficient for this task, but I have seen scenarios with millions of small files on disk, and the AWS CLI cannot transfer the files in a timely manner. To do this, we’ll create a Network File System (NFS) server on the EC2 instance. The code for the solution is available here.

Note: Be careful if you follow along with this blog, as using these services at scale will result in AWS charges.

Architecture

We’ll deploy the architecture shown below. The EC2 instance in orange is our source server and contains the files we want to move to S3. If you’re following along, it’s likely you’ll start with only the source server and an S3 bucket — we’ll deploy the other resources shortly.

To reduce data transfer costs and keep data in the same Availability Zone, we’ll deploy the DataSync agent and the VPC endpoint in the same subnet as the source server.

Infrastructure Deployment

I’ve written a CloudFormation template to deploy the remaining infrastructure needed to facilitate file transfer. The template deploys the following resources:

DataSync agent

An EC2 instance with the latest DataSync AMI.

VPC endpoint

Used for control plane traffic between the agent and the DataSync service. The VPC endpoint allows traffic to stay on the Amazon network, meaning the agent does not need access to the public internet or to a NAT gateway. Read more on using DataSync with VPC endpoints here.

Three EC2 security groups

DataSync agent
– Allowing traffic on port 80 from your machine for automatic agent activation. This can be removed once activation is complete, more on this below.
VPC endpoint and DataSync task elastic network interfaces (ENIs)
– The docs (ref1, ref2) are a little misleading on this, but you need to allow traffic on ports 1024–1064 for control plane operations and port 443 for data transfer and for agent activation/creation within the DataSync service. DataSync uses this security group for both the VPC endpoint and the task ENIs.
– I tested the VPC endpoint without port 443 open and it fails to register the agent. Furthermore, 443 is required for data transfer as this security group is used for the VPC endpoint and task ENIs.
Source server
– Allowing TCP/UDP traffic on port 2049 from the DataSync agent. You need to attach this to the source server after deployment.

You can deploy the CloudFormation template by clicking on the following link. Again, choose the subnet where the source server resides. Note that the elastic network interfaces shown in the architecture diagram are deployed when you create the DataSync task, which we’ll do below.

Source Server

I didn’t have a server laying around with millions of files, so I created an Amazon Linux EC2 instance and wrote files to disk using the following script inside the /home/ec2-user/datasync/ directory:

for i in {1..1000000}
do
    echo $(printf "%07d" "$i") > "$(printf "%07d" "$i").txt"
done

The result:

$ ls | wc -l
1000001

Note: You may need to adjust the commands if you’re using a different flavor of operating system than Amazon Linux.

NFS Server

The next step is to turn the source server into an NFS server. NFS is a distributed file system protocol that allows clients to access files over a network. In our case, the DataSync agent acts as the client and facilitates transfer between the source server and the DataSync service. To enable NFS on the source server, first ensure nfs-utils is installed:

$ sudo yum -y install nfs-utils

Add the line below to /etc/exports (you likely need to create the file). This file serves as configuration for the NFS server, and indicates the directories that are shared with NFS clients as well as allowed IP addresses. Change the directory to your source directory, and the IP address range to your VPC CIDR (or the IP of your DataSync agent):

/home/ec2-user/datasync/  172.30.0.0/16(rw,sync,no_root_squash)

It may be necessary to update file permissions for the agent:

$ chmod -R 755 /home/ec2-user/datasync/

Finally, start the NFS server:

$ sudo systemctl start nfs-server

You can read more about DataSync and NFS here. Note: It’s also possible to use Amazon EFS with DataSync as a source location, but if your files are on an EC2 instance and not an Amazon EFS file system, it’s just as much trouble to transfer them there.

DataSync Agent

The next step is to create and activate the agent in the DataSync service (we’ve already deployed the agent on EC2, but we need to register it with the service). Navigate to DataSync in the AWS Management Console. Choose the values below, specifying the DataSyncServiceSG security group created by the CloudFormation template:

Automatically fetch activation key

The DataSync service needs an activation code to register the agent. The AWS Management Console can automatically retrieve the code if your agent is open to your IP address on port 80 (the default for the security group deployed by the stack). This may require public accessibility, but it depends on your network configuration. If the above is true, copy and paste the agent’s IP address (either public or private, depending on your network configuration) in the form below and DataSync fetches the activation key:

Manually fetch activation key

To manually fetch the activation key, SSH to the DataSync agent with the following command (or use AWS Systems Manager Session Manager):

$ ssh -i <ssh_private_key> -c aes256-ctr -oKexAlgorithms=diffie-hellman-group14-sha1 admin@<ip_of_datasync_agent>

When you see the screen below, type “0” and press enter:

Follow the prompts, and you’ll see the activation key:

DataSync Locations

We need to create two locations within the DataSync service, one for the source server (NFS) and another for S3. Below shows a screenshot creating the NFS location. For the IP address, use the private IP address of the source server:

Next we’ll create an S3 location to store the files transferred from the source server. Enter the destination S3 bucket and optionally a folder. The AWS Management Console will autogenerate a role with the required permissions for data transfer to the bucket:

DataSync Task

The final step is to create a DataSync task and start the file transfer. In the DataSync service, create a new task and specify the NFS source location we created:

For the destination location, choose the S3 location:

Start the task and files will begin to transfer from the source server to S3. Leave a comment if you have trouble with task execution, as I’ve seen a few different bugs and may have troubleshooting tips. Below are the files in S3 after a successful transfer:

Performance

Transfer performance is impacted by the EC2 instance type and EBS volume type of the source server, as well as the EC2 instance type of the DataSync agent. AWS recommends an agent instance size of m5.2xlarge for transfers of up to 20 million files. If your transfer speeds are slow, try increasing the instance size of your source server and agent, as network and EBS bandwidth generally increase with larger instance classes. I ran various tests below (I would expect more variance with a larger number of files).

Changing EBS volume type of source server

Both source server and agent were m5.large

Changing instance type of source server and DataSync agent

I used 1.6 million files for this comparison

Conclusion

In this blog, I showed you how to rapidly transfer millions of files from an EC2 instance to S3. This is useful when the number of files is too large for the AWS CLI. Drop me a note if you have questions or comments.