Intelligent Document Digitization With Amazon Textract

In this blog, We will explain how 1CH improvised the functionality of Amazon Textract by developing a custom workflow solution by integrating with AWS Step Functions, AWS Lambda and AWS SNS making the above super-sized documents process easier and quicker.

Extracting required data from a document manually may be quite simple. But extracting data from hundreds of documents manually on a daily basis is not as simple as that. It takes a significant amount of effort and time to complete the action, and it is prone to error.

Leveraging Amazon Textract, data in documents can be processed more efficiently and quickly.

BUSINESS CHALLENGE

Every business in our world deals with hundreds of documents on a daily basis. Especially in custom business, there are various documents like invoices, lading, containers, and packing lists, etc. in which people manually extract the data from the respective documents, which is a tedious process. Another challenge faced when dealing with supersized documents is segregating the single document into distinct documents according to its file type.

1CH improvised the functionality of Amazon Textract by developing a custom workflow solution by integrating with AWS Step Functions, AWS Lambda and AWS SNS making the above super-sized documents process easier and quicker.

AMAZON TEXTRACT

Amazon Textract is a machine learning (ML) service that uses Optical Character Recognition (OCR) to automatically extract text, handwriting, and data from scanned documents such as PDFs. Textract is used to detect and extract key-value pairs in documents and structured data stored in tables without any manual intervention and with higher accuracy.

AMAZON AUGMENTED AI

Amazon Augmented AI is a machine learning service which makes it easy to build the workflows required for human review. A2I can be used to review certain files for human oversight to ensure accuracy.

STEP FUNCTIONS

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services and automate business processes, managing failures and retries without any human intervention.

SOLUTION APPROACH

The business challenge when dealing with many documents to extract data from hundreds of PDF documents and split combined PDF documents into distinct PDF based on document categorization can be simplified by using the above Amazon services.

SPLIT PDF DOCUMENTS

We established a pipeline leveraging S3, Amazon Textract and Lambda to split PDF documents accordingly.

The Amazon Textract StartDocumentTextDetection API is used to detect the text present in the document (PDF) along with its confidence level.

Amazon Lambda is used to split documents into distinct files using the “PYPDF2” module, based on the file type present in the document which is detected by Amazon Textract.

The distinct PDF documents are then uploaded to S3.

Insights into documents processed are visualized using Amazon QuickSight (BI Tool). The Insights like status, total splits made in a document, and document types a document is split into are visualized using charts for an understandable definition of each document.

DATA EXTRACTION

We established a pipeline leveraging S3, AWS Textract, Lambda, Step Functions and Amazon Augmented AI (A2I) to extract data from PDF documents.

Amazon Textract StartDocumentAnalysis API is used to scan the PDF document and extract the key-value pairs along with their confidence level available in the document. The Lambdas are used to process the key-value pairs with confidence level above 90% and the CSV file containing key-value pairs are stored in S3.

Key-value pairs with confidence below 90% are subjected to Human review. Amazon Augmented AI is used to create Human review for the document with key-value pairs below 90% confidence. Human review HTML template is created according to the requirement and the corresponding template is used for A2I.

The lambda functions are incorporated into step functions in order to give time for the human review worker to review the documents that are subjected to human review. The human-reviewed documents are then processed, and the corresponding CSV files are stored in S3.

Insights into documents processed are visualized using Amazon QuickSight (BI Tool). The insights like status, key-value pairs extracted from a document, and human reviewed key-value pairs are visualized using charts for an understandable definition of each document.

VISUALIZATION OF RESULTS:

Key business decisions are made efficiently based on the quality of data available. Hence, the ability to interpret and derive meaningful insights from the processed data would help in understanding the status and goals in a more clear-cut manner.

Amazon QuickSight is leveraged in augmenting the quality of data by adding a presentation layer so that actionable information can be derived which eventually leads to effective decision making.

BENEFITS

Since the solution comprises of serverless components, the architecture scales as per the demand, incurring no additional cost where there is no load (except for storage).

 

MANUAL TEXTRACT
  • Human make errors while visual processing leading to lesser accuracy.
  • Time taken to process supersize files is higher.
  • Large human workforce is required.
  • Computers make no errors and data is extracted with higher accuracy.
  • Time taken to process supersize files is very low.
  • No human intervention is required.

If you have any questions or suggestions, please reach out to us at contactus@1cloudhub.com

Written by :  Surya Prakash K  Sripranav P &   Umashankar N

Sharing is caring!

In Blog
Subscribe to our Newsletter1CloudHub