January 2021SHARE
January 2021
SHARE

Summary

Enterprises are realizing that their documents are a rich source of hidden data and insights. But as the market is flooded with multiple jargons and claims, it can get difficult for the end consumer to de-clutter through the noise. Read this article to demystify the myths around applying AI for document processing.

Mike leads technology for a large CPG player and is currently facing a big dilemma. He is losing out on critical business insights because most of the company’s data is tied up in unstructured documents. He’s been looking at document digitization options that could unlock these insights but feels that they all might be making tall claims. Surely not everyone can deliver 100% accuracy!

Mike isn’t the only one. Analysts report that 80-90% of data in organizations is unstructured and locked up in documents creating bottlenecks for information processing and roadblocks for digital transformation. For organizations that want to become digital businesses, the first order of the day is document digitization. No wonder then that the global document management services market size is projected to reach USD 57.56 billioni by 2027 . But given the demand, there has been a spurt of solution offerings in the market, making it essential for technology leaders to make an informed choice.

Mike isn’t sure that the parameters he’s evaluating give him the right picture of what the document digitization solutions can do. So, he calls a consultant to discuss this problem.

“Anne, I really need to get our documents digitized to make sense of where we can cut costs or improve efficiencies. I’ve been looking at solutions, and while all of them promise the moon, I’m not quite confident if they can deliver it,” said Mike. “Could you help me figure out what works, please?”

“Sure, Mike, tell me how I can help.”

“Well, the first thing, which I thought would be the easiest to pick is the accuracy they promise. Unfortunately, all of them promise 100% accuracy. That can’t be true, right?”

“Yes, that’s misleading. In fact, not just accuracy, you should look at five key areas where some myth-busting is needed. Let me take you through all of them, and you can decide.”

Myth #1: 100% accuracy guarantee

What is the accuracy Mike should look for? Isn’t a solution that promises 100% the best to consider? Accuracy is the wrong starting point in assessing the business case for document digitization solutions. Rather, Mike should be focusing on the costs for the current process and explore what-if analysis to arrive at a minimum accuracy for each document type in scope to make the project viable.

The accuracy claims of document digitization solutions require reference data and are based on a tiny sample set with no guarantee of how they will perform on a large scale. A 100% accuracy guarantee is not possible using automation and AI alone; there has to be some level of human involvement. Also, accuracy claims are, at best optimistic beliefs. The only way to measure accuracy is during implementation on the actual data set and with the manual effort of mapping the actual results vs. the expected ones. Since that’s not feasible on the large data sets in production, often what’s touted as accuracy is a proxy measure called the confidence score — i.e., how confident the system is of correct extraction.

The way it works is that there is a threshold for the machine learning models — say 90% — and if the extracted value from an OCR engine has a confidence score higher than the threshold, it is believed to be correct. This can be verified for a small set and extrapolated to the documents in production. The question is — how reliable and sound is your extrapolation from a few hundred documents to a million or more?

A product can have a high confidence score yet fail to provide accurate extraction and vice versa. It is purely a function of the kind of training data that the model was exposed to and the type of actual data it’s now running through.

Hence, instead of accuracy, the business case should rest on savings in cost and effort and how automation can increase productivity and reduce the time taken for key business processes like loan application processing.

Myth #2: Unconditional Straight Through Processing (STP) Promise

From the point of accuracy, we established that pre-production is impossible to guarantee 100% accuracy; it shouldn’t even be an expectation. So, the products that promise unconditional STP are merely relying on confidence score, which we have seen is not a measure of correctness. It also doesn’t factor in the issues that could emerge from document layout and content drift as formats and information change over time.

Instead of relying on unconditional STP, one should look for a system that can give you a reasonable accuracy for you to cut off into production and then conduct benchmarking exercises at regular intervals in production. The idea should be to start with a reasonable measure that makes sense for your basic ROI calculation and then constantly improve the product performance with measurement and human feedback.

Benchmarking helps measure the accuracy over time, catches deviations due to content drift, and can be re-trained on new use cases. To achieve this, we need to simplify the effort and the benchmarking process radically — something that most products lack and the reason they depend so heavily on the confidence score.

XtractEdge Platform solves this problem. With a few clicks, even non-technical people can redirect some sample documents over any configurable period to the Exception Queue, where they are checked manually. Once the auditors review and correct the results on the sample set, it gives a precise measure of accuracy. A learning and auto-tuning capability then improves the product based on benchmarking data and auto-suggests corrections based on an ML model. This improves the product over time. As you scale in production, it allows you to make sure that our new kinds of layouts are not degrading your performance, and if they are, the feedback can be used to bring it back into an optimal extraction.

Myth #3: Implementation without calibration (PoC)

An out of the box (OOTB) product needs to be calibrated for an organization’s data quality. For instance, in poor document quality, an OOTB model will underperform on the client data.

In the product evaluation phase, it makes good business sense to do a PoC. This will help assess the context of the business problem and product fitment — how the technology works with client data. A PoC also helps provide a starting point for the OOTB model to work and adjust to client data needs. This calibration forms a solid foundation, which continues from the PoC to implementation to production. Instead of a one-time calibration, the right approach is to do it often. Along with the real measurement of product performance on your documents and data, a PoC allows your staff to see and feel the user interface used in production to manage the entire document lifecycle. It also gives you a sense of the actual human effort and skill required to review and correct machine results.

Loved what you read?

Get 15 practical thought leadership articles on AI and Automation delivered to your inbox

Subscribe

Loved what you read?

Get 15 practical thought leadership articles on AI and Automation delivered to your inbox

Subscribe

Myth #4: There is no need for customization

No software platform can address all use cases for all industry verticals and all business domains without customization. Every company has its unique business requirement to be addressed, and document digitization has an unthinkable level of complexity. This stems from the variety and variance in documents that are unique to each organization.

Keyword variance is not easy to solve because you don’t know what to expect in the client data. For every kind of use case or customer use case, you will have to spend some time customizing the product to deal with these variables. Some of these variances will not be fully known right upfront when you’re cutting over from implementation to production. Even a customized product can run across document formats that it hasn’t been trained on, impacting performance.

5 dimensions of document layout variance

  • Classification keyword
    What is the document classified as? Is it a form, is it an invoice? Is it a bill of lading? Is it a packing list?
  • Structural elements variance
    Field – Value pair, Table, Checkbox, Images, Signature, Logo, Paragraph, Section
  • Key-value mapping variance
    Direction of association between field and value
  • Field keyword variance
    Various alternate names (alias) for the field. For instance, an invoice date could also be called a bill date.

    1. Alias for field
    2. Order of Preference for Field alias
  • Location of structural element
    Where is the information that you are looking for located in the document? E.g., Is the invoice date at the top of the document or the bottom? Is it inside a box or a table, on the right or left?

XtractEdge Platform solves this problem with an Onboarding feature. As variance from document sources cannot be controlled, XtractEdge Platform makes it easier to automatically identify and train new layouts, improving the system as it runs.

Once in the production pipeline, if XtractEdge Platform detects that the quality of extraction is not good or the classification has gone wrong, then it quickly makes an intelligent decision to do two things:

  • It puts the document into an exception queue for manual checking, or
  • It identifies it as a new layout that needs to be trained on and puts it in the onboarding queue.

The onboarding queue saves the effort of checking every new layout manually. Once a human agent trains the system on the new layout, it re-processes all the similar documents in the onboarding queue through the production pipeline.

Myth #5: Buy and forget

Implementing a document digitization product does not mean that you can buy and forget about it. As we saw from the points above, it needs continuous training and improvement depending on the data quality and variance and to handle exceptions and outliers. It’d help if you had some human agents working in tandem with the product to get the best output.

“So, Mike, as you can see, while products may make many claims, the on-ground reality is quite different.”

“I can now see that clearly Anne, and this has been immensely helpful in weeding out some of the options I was considering. I quite like the idea of a product that learns and improves over time.”

“Well then, you know which one to pick!”

PREVIOUS ARTICLE

NEXT ARTICLE

PREVIOUS ARTICLE

NEXT ARTICLE