Skip to main content

Doc Recognizer - An innovative approach to extract information with AI

Nowadays, companies receive a wide range of documents containing some form of structured information. Although the information is structured, different formats for the same type of document must be read and processed/integrated/consumed. This means that resources must be allocated to perform the menial task of analyzing each document, and manually processing it.

John is one of those resources. He works at company XLDA, a medium company that receives hundreds of invoices a week. For the past year, John has been manually processing these invoices.

Sometimes John misses a few bits of info here and there. Sometimes John inputs an invoice twice. And sometimes the workload is so crushing that the company misses a payment deadline because John's can’t manage his backlog.

XLDA knows it would be better to automate this process, so it contacts BI4ALL and asks whether they can address this problem. Coming up with a solution should be simple, right? Well, not exactly. Most, if not all, of the documents are either in paper format or are scanned images/PDFs. This means that we have no direct access to their textual information.

Furthermore, the larger the company, the larger the document variance, since different suppliers and/or clients use different formats for the same type of document.

We’ve been talking about invoices. But why not extend the problem to any kind of document, whose information is structured in some way? For example, packing lists are a great example of a different (yet similar) type of document. They are very used in logistics departments when material arrives at a company, and they may contain dozens of pages per document, with potentially hundreds of instances of items to extract.

This is a multi-layered problem. But let's start peeling back.

First, a paper invoice from supplier Y arrives at XLDA and John has to scan the invoice. But a scan is just an image, so the solution needs to be able to "read" the text. Hence, we’ll surely be using an OCR engine. OCR stands for Optical Character Recognition and what it does is, given an image containing text, it outputs that same text. It's been available for quite some time, but advances in deep learning have brought unprecedented levels of accuracy to the technology. In order to maximize the success rate, we need to guarantee that the document is scanned with good quality.

So, any document that comes in physical paper or in scanned format will first go through an OCR engine. With this, we've extracted the text and layout of the document, but we still don’t have the relevant invoice data that we set to achieve. So, what comes next?

Next comes defining what information we want to extract. Different types of documents have different attributes. Within the same category, different suppliers include different information. In our scenario, let's say I want to extract the invoice number, the invoice date, the VAT number of the supplier, the total amount to pay, and the payment deadline.

Now we need to tag a few examples of values we wish to extract. But why would I tag information if I want that same information extracted, might you ask? We’re tagging the examples in this invoice to *teach* our AI engine about how the information in invoices from this supplier is presented. This is a one-time effort, and when the next invoice comes along, it will be automatically processed. But I digress, so back to the tagging. Here is and example of how the tagging process would be.

So far, we've gathered three critical pieces of information:


  1. The textual content of the document;
  2. The information we want to extract;
  3. A few examples of the information we want to extract in our invoice.


The next step is feeding these to our AI engine. But what will it do the information, and what will it produce? The answer to the former is that it will learn a set of rules and properties that allow it to best extract the information. The answer to the latter is that these rules and properties will be stored in something called a "model". We can then apply the model to our invoice and see if the result matches our expectations. Better yet would be to apply a different invoice, lest we be accused of "memorizing" the original document.

Now, whenever a new invoice from Y arrives, John can simply submit the invoice to the platform and validate the results.

What happens when I have dozens or hundreds of suppliers sending me different invoices? You can create as many models as you want, and tweak them as new data comes in.

But what if I want to scale, and use the platform for documents other than invoices? Our platform was devised to be generic. We made as few compromises as possible, so you can create groups for each type of document, each with different attributes.



Despite living through the times of unparalleled digital innovation, most businesses nowadays still use paper documents to track interactions. Even when they take the digital route, most of the time it's a PDF file. This means that employees still have to analyze each incoming document, locate the information and manually input them elsewhere (for example, in an ERP). Using an automated solution as Doc Recognizer can bring massive ROI for companies, as this process can release employees to perform other tasks that bring more value to the company.


About the author

João Amaro

AI Engineer


Automate all documents processing with

Doc Recognizer

Automate the processing of all your different documents with a simple and intuitive interface.

We use Cookies to improve your browsing experience and for statistical purposes.
By visiting us, you consent to its use.