Exploratory Data Analysis and Data Imputation for Amazon Forecast - 1CloudHub: Digital Transformation – Advisory | Solutions

No matter what kind of data science projects one is assigned to, making the sense of the dataset and cleaning it always critical for a good approach.

Problem Statement

We have a hierarchical data for products for a Retail Store for different categories from three states namely, California, Texas and Arizona. Looking at this data, we need to predict the sales of products for one month (30 days). The training data consist of individual sales for 305 days. Using this training data, we need to predict the upcoming days.

Analyzing Dataset

Before ingesting the target time-series data to Amazon Forecast, the first step is to start with Data Analysis.

It helps us to understand the data by identifying trends and data gaps. To start with, Exploring the Dataset, we use the Python Pandas Library to read the data and print the first few rows.

Amazon Forecast requires target time series data that consist of timestamp, item_id, and a value. We can also add one or more dimensions to the schema.

In the above dataset, the “dat” attribute is considered as timestamp, and the “Purchase count” attribute will be value to predict.

Check with Total entries and Data types of the dataset

Change the timestamp attribute to index

Check for missing values by filtering with the States and Product

By filtering the attribute States with California and Product with Apple, the data in the above image has missing values between “2019-11-02” and “2019-11-04”, “2019-11-05” and “2019-11-08”.

There are various scenarios for missing data. For example, if the product has no sales at all in the specific state, it is inaccessible or has no records of missed dates. Missing values can significantly impact a model’s accuracy.

Re-indexing the Data Range and filling the dates Empty values

From the above graph and data, the missing values in the dataset are filled with empty Amazon Forecast, which provides a number of filling methods for handling missing values in your target time series and related time series datasets. The filling process is to add standardised values to missing dataset entries.

Amazon Forecast supports following filling methods:

Middle filling – Fills any missing values between the start and the end date of the data set item.
Back filling – Fills any missing values between the last recorded data point and the global dataset end date.
Future filling (related time series only) – Fills any missing values between the global end date and the end of the forecast horizon.

There are also some limitations and guidelines for filling the accepted logics with both Target Time Series data and Related Time Series data. For the above use case, there were no sales from the business for the missing dates. So, we used the filling methods to fill the missing values with 0.

Based on the different business use cases, the gaps in the above graph can also be filled with different logic like 0, nan, mean, median for both target time series data and related time series data.

References:

Developer guide of official Amazon Forecast

Written by :

Sreekar Ippili & Umashankar N

Sreekar Ippili

Data Analyst -1CloudHub

Tags:

#aws

In Blog

by Sreekar Ippili