From Document to Insights

Harnessing the power of cognitive AI

December 2019

In a business landscape characterized by digital transformation, it is easy to see how data offers businesses an edge. The addition of intelligence to the data management process has infused speed into data movement and analysis. Increasingly, enterprises are finding new ways to gather, unify, clean, and analyze data at speed.

RPA has helped automate digital data movement across business tasks, but hits a roadblock with non-digital data in documents such as invoices, scanned paper forms, statements, claims, and receipts. How then can organizations increase business efficiency by bringing more data into the purview of their digital information systems, especially data that is locked away in scanned documents? Two words – Data Digitization.

Understanding the Importance of Intelligent Information Extraction

Most large companies have many business processes that deal with paper forms and scanned documents, usually requiring human agents to enter this information into an enterprise IT system. While organizations have relied on varying capacities on manual processing centers, optical character recognition (OCR), and handwriting recognition, each of these techniques has its share of challenges. The emergence of deep learning techniques in the areas of computer vision and NLP, coupled with the flexibility of provisioning resources in the cloud is a game-changer enabling a new breed of text digitization solutions. These new techniques can help understand the relationships between field labels and values and the structure and layout of data elements – starting from boxes and tables to specific details like checkboxes and signatures.

I already have OCR. How is this different?

OCR involves the ability to detect text regions on any scanned images and convert those regions into the correct digital text. Handwriting recognition must supplement this foundational capability and include the ability to deal with pages that feature handwritten text. Here are some of the issues with that approach:

Manual data extraction from scanned/native documents is inefficient, effort-intensive, and slows down the business process.
OCR/handwriting recognition based on templates needs substantial setup and configuration effort for each new document format.
Traditional OCR techniques have limitations when processing forms (fields) and while dealing with tables.

Visualizing The Data Digitization

A comprehensive data digitization solution should offer the following features:

OCR and Handwriting Recognition

OCR and Handwriting Recognition forms the foundational layer for all digitization products with computer vision models at work in recognizing printed and handwritten text. This layer usually identifies rectangular regions within an input image before extracting the text within that box.

Structure and Table Detection

Structure and Table Detection enables products to categorize individual regions of results from OCR/handwriting recognition into logical groups. Data types, alignment, and text styling are all used to improve the accuracy of this group to improve the system’s capabilities.

Checkbox Detection

Checkbox Detection is not offered by most existing OCR/handwriting systems even though checkboxes are a standard option in forms. Digitization products should be able to recognize a checkbox group accurately and then identify the options selected in a group.

Document Type Classification or page classification

Document Type Classification or page classification brings key NLP capabilities to automatically classify various pages in an input document into a specific type. Document types need to be configured, and each type must be tagged with examples to enable this feature.

Document Understanding

Document Understanding is a function of interpreting various information elements to group them into pairs of fields and values. It requires the capability of understanding the layout and structure of the document. This feature should complete the complex task of detecting checkboxes and tables while segregating table headers, data, and rows that don’t align with the table structure.

Reviewer UI

Reviewer UI and Workflow provides an interface for human review of digitization results. The feature must be designed for high productivity and offer support for user management, workflow, and rolebased security. Additionally, they must support the validation of data against data types, including emails, addresses in a location, SSNs, and dates.

Data Validation and Resolution

Data Validation and Resolution checks for validity according to the classified data type, while reviewer UI and workflow adds a user interface to the digitized data to enable human review and verification.

Data Type Detection

Data Type Detection is the capability to auto-classify values in the field-value pairs into data types such as numbers, data-time values, general text, and names. Validation rules can be applied to this classified data, and human users can also change data types.

Document Version Detection

Document Version Detection adds the capability to configure the criteria for completeness and detect the latest version in document types. It relies on some outof- the-box features for human signature detection.

Autocorrect Suggestions

Autocorrect Suggestions based on data generated from a manual review of data digitization output revealing specific problems with OCR/ handwriting conversion. This data can also be used to train custom models that predict corrections. These corrections are unique to a particular deployment and require the product to be deployed and used in production for some time before being enabled.

Use Cases for Data Digitization

Form Digitization: Digitizing existing enrolment or other multi-page paper-based forms that involve a mix of typed text, handwriting, check boxes, and other fields and tables.

Dynamic Extraction or Touch-free Zero Template Extraction: Dealing with non-standard input documents that are not structured like forms, but usually contain the same information, albeit in varying layouts.

Content Classification and Extraction from Mixed-type Documents: Digitizing documents that include many different document types.

Information Consistency Checking: The most complex use case that requires mature products, which address all the previous use cases and also support the definition of consistency verification rules that enforce domain-specific rules for information consistency.

Digitization products today deliver enhanced value through these advanced features for extracting and interpreting information from your scanned documents and integrating results into existing business processes and applications. They are typically used in conjunction with modern scanning solutions to ensure a virtually touch-free deployment of the process in production, so your human resources can focus on improving and delivering stellar customer experiences.

By John Kuriakose
Principle Product Architect,
Infosys Nia, Infosys