Challenges and approaches on Data Profiling - 1CloudHub: Digital Transformation – Advisory | Solutions

‘Data is the new oil’, depicts the fact that data stored in on-premises databases for a longer period has immense potential to solve business challenges. Data quality is of the utmost importance in order to obtain meaningful results from the data. Data quality is also a measure of the accuracy, validity and completeness of data. In this blog, we have highlighted some of the challenges faced in improving the quality of data.

The occurrence of duplicate data
Null values and Outliers in the data
Inconsistency of attributes.

To overcome these challenges, we leveraged Data Profiling.

Data Profiling is a process of examining data from an existing source and summarizing information about that data. The data profiler provides

Details about the attributes, distribution of data, missing data.
Maximum, minimum and average values of each attribute.
Relationship between the attributes by correlation matrix, histogram analysis, whether it is categorical or numerical.

The following are the insights of Data Profiling on a sample dataset

Different Approaches of Data Profiling

Data Profiling with pandas
Data Profiling with spark
Data Profiling with pandas on Google Colab

Note: The following approaches are experimented on a sample dataset of 250MB with 40 columns.

Data Profiling with Pandas

Replace the <df> with relevant data frame in which dataset is imported

In pandas, appropriate package for data profiling has to be installed and imported

The run time taken to execute the data profiler on pandas was the highest amongst the other approaches.

Data Profiling with Spark

In spark, appropriate package for data profiling has to be installed and imported

Replace the <df> with relevant data frame name in which the dataset is loaded.

The run time to execute the data profiler in spark (local machine) was considerably faster than processing on Pandas.

To make it more efficient, cache the data frame in spark and perform data profiling. The run time was faster than processing on spark.

Data Profiling with pandas on Google Colab

Google Colab, a free cloud service by Google that provides infrastructure with 12GB of RAM and 1 GPU as standard. It gives us feasibility to mount with the google drive.

The appropriate package for data profiling has to be installed and imported.

Replace <path-name> with the path where the file is located.

The run time to execute the data profiler in pandas leveraging Google colab was around 14 seconds which was more efficient than the other approaches.

Difficulties faced with Google Colab

External libraries have to be installed and imported for every new session if required.
File handling errors while importing large datasets from the google drive.

With the help of data profiling we identified challenges like finding null values in the data, occurrence of duplicate data. While the data profiler is a good approach to understand the details of the data, there are certain things that data profiler doesn’t provide :

The outliers in the data.
Different plots to visualize different attributes.
Detailed view on the inconsistency of data.

Organizations can make better decisions with data they can trust, and data profiling is an essential first step on this journey. If you wish to explore further about Data Profiling for your organization, do contact us to schedule a demo session

Written by :

Sreekar Ippili & Umashankar N

Sreekar Ippili

Data Analyst -1CloudHub

Tags:

#aws #dataprofiling

In Blog

by Sreekar Ippili