Loading bulk data into DynamoDB - 1CloudHub: Digital Transformation – Advisory | Solutions

We faced a use case with a web application requiring millisecond scale results from the database. We had around 500,000 records under a s3 bucket to be ingested into the DynamoDB table.

PutItem API

A provisioned DynamoDB table with default settings (5 RCU and 5 WCU) was created with the Partition key (ItemID) and when called with the put _item API call via Lambda, the process was ingesting one record at a time which was sufficient for a record ingestion of a smaller dataset. In our case, the dataset was large and the provisioned table was very slow for ingestion often leading to throughput error or Lambda timing out.

Batch_writer()

With the DynamoDB.Table.batch_writer() operation we can speed up the process and reduce the number of write requests made to the DynamoDB.

This method returns a handle to a batch writer object that will automatically handle buffering and sending items in batches. In addition, the batch writer will also automatically handle any unprocessed items and resend them as needed. All you need to do is call put_item with table batch writer to ingest the data to DynamoDB table.

Configurations used for DynamoDB.Table.batch_writer():

We structured the input data so that the partition key(ItemID) is in the first column of the CSV file.
We created a DynamoDB demand table with On-Demand for the Read/write capacity to scale automatically
Lambda Function with a time out of 15 minutes, which contains the code to export the CSV data to DynamoDB table
Ensure the IAM roles associated to the services are configured

Once configured, we tested the Lambda function, the records successfully loaded into DynamoDB table and the whole execution just took around five minutes.

Serverless and Event Driven

The whole pipeline was serverless and the lambda function was configured with the S3 event trigger with the prefix (.csv). An overall architecture and data flow is depicted as below

Note that the Lambda function will timeout if the files are large in size. Break the CSV input files into smaller chunks and run the pipeline such that the whole process will be event driven and the data is loaded into the DynamoDB table automatically.

The code snippet used is as below,

import json
import boto3
import os
import csv
import codecs
import sys

s3_client = boto3.resource(‘s3’)
dynamoDB_client = boto3.resource(‘dynamodb’)
tableName = ‘personalize_item_id_mapping’

def lambda_handler(event, context):
bucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’]
key = event[‘Records’][0][‘s3’][‘object’][‘key’]
obj = s3_client.Object(bucket,key).get()[‘Body’]
table = dynamoDB_client.Table(tableName)
batch_size = 100
batch = []
#DictReader is a generator; not stored in memory
for row in csv.DictReader(codecs.getreader(‘utf-8’)(obj)):
if len(batch) >= batch_size:
write_to_dynamo(batch)
batch.clear()
batch.append(row)
if batch:
write_to_dynamo(batch)

def write_to_dynamo(rows):
table = dynamoDB_client.Table(tableName)
try:
with table.batch_writer() as batch:
for i in range(len(rows)):
item_id = int(rows[i][‘item_id’])
operator_name = rows[i][‘operatorname’]
source_name = rows[i][‘sourcename’]
destination_name = rows[i][‘destinationname’]
seat_type = rows[i][‘seat_type’]
bus_type = rows[i][‘bus_type’]
#print (item_id,operator_name,source_name,destination_name,seat_type,bus_type)
batch.put_item(
Item={
‘item_id’: item_id,
‘operator_name’ : operator_name,
‘source_name’ : source_name,
‘destination_name’ : destination_name,
‘seat_type’ : seat_type,
‘bus_type’ : bus_type
}
)
except Exception as e:
print(str(e))

Post the process, we changed the table settings to Provisioned table with the RCU and WCU required for the application, to make it cost-effective.

With the guideline above, you can now easily ingest large datasets into DynamoDB in a more efficient, cost-effective, and straightforward manner. If you have any questions or suggestions, please reach out to us at contactus@1cloudhub.com

Written by : Dhivakar Sathya & Umashankar N

Dhivakar Sathya

Data Engineer | 1Cloudhub

Tags:

#aws datalake

In Blog

by Dhivakar Sathya