Working with Amazon Textract (Part 1)

What Is Amazon Textract?

Amazon Textract makes it easy to add document text detection and analysis to your applications. The Amazon Textract Text Detection API can detect text in a variety of documents including financial reports,medical records, and tax forms. For documents with structured data, you can use the Amazon Textract Document Analysis API to extract text, forms and tables.Amazon Textract is based on the same proven, highly scalable, deep-learning technology that was developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. You don’t need any machine learning expertise to use it. Amazon Textract includes simple, easy-to-use APIs that can analyze image files and PDF files. Amazon Textract is always learning from new data, and AWS is continually adding new features to the service.

This Architecture shows overall workflow and few additional components that are used in addition to the core architecture described above to process incoming documents as well as large backfill.

In this article we’ll cover the, in document (PDF, JPG, PNG) text, tables and form analysis in both synchronous and asynchronous way.

With synchronous processing, Amazon Textract can analyze single-page documents for applications where latency is critical. Amazon Textract also provides asynchronous operations to extend support to multi page documents.

Video Promotion

Process incoming documents workflow

  1. A document gets uploaded to an Amazon S3 bucket. It triggers a Lambda function which writes a task to process the document to DynamoDB.
  2. Using DynamoDB streams, a Lambda function is triggered which writes to an SQS queue in one of the pipelines.
  3. Documents are processed as described above by “Image Pipeline” or “Image and PDF Pipeline”.

Image pipeline (Use Sync APIs of Amazon Textract)

  1. The process starts as a message is sent to an Amazon SQS queue to analyze a document.
  2. A Lambda function is invoked synchronously with an event that contains queue message.
  3. Lambda function then calls Amazon Textract and store result in different data stores for example DynamoDB, S3 or Elasticsearch.

You control the throughput of your pipeline by controlling the batch size and lambda concurrency.

Image and PDF pipeline (Use Async APIs of Amazon Textract)

  1. The process starts when a message is sent to an SQS queue to analyze a document.
  2. A job scheduler lambda function runs at certain frequency for example every 5 minutes and poll for messages in the SQS queue.
  3. For each message in the queue it submits an Amazon Textract job to process the document and continue submitting these jobs until it reaches the maximum limit of concurrent jobs in your AWS account.
  4. As Amazon Textract is finished processing a document it sends a completion notification to an SNS topic.
  5. SNS then triggers the job scheduler lambda function to start next set of Amazon Textract jobs.
  6. SNS also sends a message to an SQS queue which is then processed by a Lambda function to get results from Amazon Textract and store them in a relevant datasets for example DynamoDB, S3 or Elasticsearch.

Your pipeline runs at maximum throughput based on limits on your account. If needed you can get limits raised for concurrent jobs and pipeline automatically adapts based on new limits.

Please go to part 2 of this blog for actual coding and implementation.

I’m an undergrad student at IIIT Ranchi, pursuing my B-Tech in computer science and Engineering. I love to learn and share new technologies.