Building a Data Lake with AWS Lake Formation - 1CloudHub: Digital Transformation – Advisory | Solutions

With growing numbers of people accessing data, it is important that data platforms are flexible and scalable. Hence people switch to cloud to achieve these objectives without compromising on security. However, the key challenge in moving to cloud-based data platform is in ingestion of the data with a faster and secured approach since most of the data are present across on-premises databases such as RDBMS. The cloud-based data lake opens the structured and unstructured data for more flexible analysis. Analysts and data scientists can then access it with the tools of their choice.

The conventional way of building a Data Lake involves setting up a large infrastructure, securing the data which is a time-consuming process and not a cost-effective approach. Even building a data lake in the cloud requires several steps:

Setting up storage.
Moving, cleaning, preparing the data.
Configuring and enforcing security policies for each service.
Manually granting access to users.

This process is tedious, it’s easily error-prone and not stable.

AWS Lake Formation

A new managed service by Amazon web services to help you build a secure data lake in few steps. Lake Formation has several advantages:

Identify, ingest, clean and transform data: With Lake Formation, you can move, store and clean your data faster.
Enforce security policies for users: Once the data source parameters are set, define the security policies through AWS Identity and Access Management which enforces the policies for all users accessing the data.
Increase in productivity: With Lake Formation, you build a data catalog which makes users more productive by finding the right dataset to analyse.

AWS Lake Formation can be created in just three steps:

Lake Formation makes it easier for ingesting the data from multiple sources via a feature called Blueprint

The blueprint includes one-time bulk database load, incremental load to data lake from MySQL, PostgreSQL, Oracle, and Microsoft SQL Server databases

i] Database Snapshot:

Database Snapshot allows one-time bulk load of data into the data lake. It’s a hassle-free way, input the import source details and configure the import target and the frequency and it’s done! Yes, it’s done!

The source import target has a data format option where you can specify the data format to be written in your data lake. As of now, it supports CSV and Parquet.

Frequency can be set to run on-demand, hourly, daily, weekly, monthly and you can also choose the days when the workflow has to be executed. Mostly the frequency feature will be used for the incremental database option.

ii] Incremental Database:

This feature in blueprint enhances the Change Data Capture of the data. Bookmark key is the column name of the table to be checked for the last imported value from where the data can be appended. Partitioning of data can also be achieved as an option.

Note:

Instead of using a database snapshot, you can use incremental data import in the first step mentioning the bookmark job (Unique or Primary column name) with partitions if present. For the next incremental append, AWS Glue will be ready with the metadata from the previous run.

Since the AWS lake Formation executes AWS Glue jobs at the backend, Glue Crawler can be invoked to leverage the catalog and the user can query the table directly in Athena and check the data imported.

It’s not perfect!

Although AWS Lake formation is an effortless approach, there are few challenges ;

AWS lake formation at this point doesn’t have any method to specify a where clause for the source data (even though exclude patterns are present to skip specific tables).
Partitioning of specific columns present in the source database was possible in AWS Lake formation, but partitioning based on custom fields which are not present in the source database during ingestion was not possible.
On creating a workflow, Lake Formation creates multiple jobs in the AWS Glue console thereby inducing manual workload to delete or track the job.

Most of the common scenarios involving the quick ingestion of data were much easier with AWS Lake Formation. But when a little more customization is required, it’s better to switch to a custom AWS Glue Script.

Written by : Dhivakar Sathya & Umashankar N

Medium Blog Post

AWS Reference

Dhivakar Sathya

Data Engineer | 1Cloudhub

Tags:

#aws datalake

In Blog

by Dhivakar Sathya