Transformers vs. OCR: an in-depth comparison for Information Extraction

Felipe Bandeira
Python in Plain English
15 min readSep 1, 2023

--

Photo by Mikołaj on Unsplash

1. Introduction

Information extraction (IE) is a problem yet to be solved for small businesses and startups. The cost of the best existing solutions can often be prohibitive for entrepreneurs, and open-source tools tend to lack the necessary quality to solve the challenge. Are there alternatives, then?

We address this issue in two ways: first, by developing an image processing pipeline applying an Optical Character Recognition (OCR), and second, by fine-tuning an OCR-free model named Donut. Last, we compare the accuracy of both models, performing multiple tests to evaluate their strengths and drawbacks. We also made publicly available the deep learning model we fine-tuned.

2. What is Information Extraction and why does it matter?

In essence, IE is a subfield of NLP responsible for identifying relevant information in texts and extracting that to specific output formats. The concept of “relevant information” varies according to the task at hand, but when it comes to Relation Extraction (RE), one can summarize the challenge by asking: 1) who are the relevant entities in a text and 2) how are they related to each other?

Take as an example the following paragraph, which was composed of excerpts from Wikipedia:

Getúlio Dornelles Vargas (São Borja, 19 April 1882 — 24 August 1954) was a Brazilian lawyer and politician who served as the 14th and 17th president of Brazil. After returning to the state Legislative Assembly, Vargas led troops during Rio Grande do Sul’s 1923 civil war.

In RE tasks, the goal is to find the relevant entities in the text and classify their relationship, looking for outputs such as:

  • Getúlio Vargas — president of — Brazil
  • Getúlio Vargas — born in — São Borja
  • Getúlio Vargas — fought in — Rio Grande do Sul’s 1923 civil war

Although Wikipedia articles might not be useful for businesses, invoices and IDs are. For several companies, extracting information from such documents is a daily challenge.

Suppose a small IT firm called XYZ Data Privacy. They were hired by Pete’s Farm, a 50-year-old business that had always worked using paper but was now migrating to digital. Pete’s Farm wanted XYZ to create a database of all clients that had ever purchased from them.

To do so, XYZ would need to read copies of IDs or driver’s licenses from all clients and extract from each one information about the person: their name, date of birth, and address. How should they do so?

3. Tech review: what has been done before?

3.1. Existing solutions: too costly

Usually, this task is accomplished using a combination of computer vision and NLP. Suppose one of the documents provided by Pete’s was scanned and turned into an image. First, a text detection stage takes place to identify where information is located, creating text boxes around it. Next, an OCR engine reads the content within each box in the image and converts it to text. Last, an algorithm (which can be a post-OCR pipeline or a language model) classifies the information extracted (“John Doe” would be tagged as “Client name”) and outputs it in the desired format.

Illustration of OCR process: 1) image is transformed to grayscale 2) text boxes are identified 3) text boxes are scanned 4) text is returned. (Source: author)

There are commercially available solutions for this, such as Amazon’s Textract or Microsoft’s Form Recognizer. However, these options have three main issues:

  1. Cost: Amazon, for example, charges 10 USD per 1000 pages, which becomes a burden if XYZ processes hundreds of thousands or millions of files daily.
  2. Volatility: if Amazon increases its prices, XYZ might be forced to do the same, affecting its client base and existing contracts.
  3. Authenticity: as a technology firm, it is expected that XYZ has its own solution, and not that they simply merge third-party systems for the client.

3.2. Open source solutions: too unreliable or heavy

Thinking about the previous problems, one might wonder: what if we use open-source software and create a version of the previous solutions ourselves? There are, after all, open-source OCR engines, such as Tesseract, which should be capable of extracting the text. Additionally, there are also LLMs that could take in the extracted text and return the relevant entities and their relations, such as MPT-7B.

Extracting text with Tesseract is definitely feasible. The post-OCR processing, however, is the problem: LLMs are heavy, and, for startups, the cost of storing and running one can be prohibitive.

A promising option we found to tackle that was REBEL, a seq2seq model based on BART and designed to extract relation triplets from text. The initial tests we performed were positive: when given excerpts of news articles, for example, the model correctly extracted the relations we expected. For an input such as the following:

input = "Silvia Schenker (born January 17, 1954 in Aarau, 
the capital of the canton of Aargau, Switzerland) is a Swiss
politician. She joined the National Council of Switzerland
(the lower house of the federal assembly) in 2003 and served
until 2019. Schenker was a member of the Commission for Social
Security and Health (CSSS). She is a member of the Swiss Socialist Party.
Now a national advisor, she lives in Bâle."

We would obtain the following sequence of relations:

Relations:
{'head': 'Silvia Schenker', 'type': 'date of birth', 'tail': 'January 17, 1954'}
{'head': 'Silvia Schenker', 'type': 'place of birth', 'tail': 'Aarau'}
{'head': 'Silvia Schenker', 'type': 'country of citizenship', 'tail': 'Switzerland'}
{'head': 'Aarau', 'type': 'country', 'tail': 'Switzerland'}
{'head': 'canton', 'type': 'country', 'tail': 'Switzerland'}
{'head': 'Aargau', 'type': 'capital', 'tail': 'Aarau'}
{'head': 'Aargau', 'type': 'instance of', 'tail': 'canton'}
{'head': 'Aargau', 'type': 'country', 'tail': 'Switzerland'}
{'head': 'National Council of Switzerland', 'type': 'country', 'tail': 'Switzerland'}
{'head': 'Swiss Socialist Party', 'type': 'country', 'tail': 'Switzerland'}
{'head': 'federal assembly', 'type': 'country', 'tail': 'Switzerland'}

However, less-structured text (such as emails, invoices, or IDs) turned out to be a problem right away. This happened because REBEL was trained on texts with context, containing fully structured sentences and complete explanations of the subject at hand. Invoices and emails, however, are hardly like that, and the results provided by the model often did not make sense. For an email entry, such as the following:

input = "Get in touch with below:

Name: Jane
Email: jane@gmail.com
Mobile: 510–712–456"

REBEL would return a faulty sequence of entity relations:

Relations:
{'head': 'Jane', 'type': 'residence', 'tail': '510-712-4567'}
{'head': 'Jane', 'type': 'work location', 'tail': '510-712-4567'}
{'head': 'Jane', 'type': 'number of episodes', 'tail': '510-712-4567'}

The problem, then, was set: how could a company like XYZ effectively extract information from documents?

4. Our approaches

We defined our goal as extracting information from driver's licenses. For that, we took as starting point licenses from the EU, which have standardized fields despite small variations in design. A few additional premises we used were:

  • Input images would be IDs submitted by users (which means various lighting conditions)
  • Image would be cropped before pre-processing (which means little background noise)

4.1. Solution 1 (OCR-based)

Given that using OCRs for reading documents is a standard in the industry, we started by exploring this approach. Our first goal was to build a solution that pre-processed an image, extracted the text from it with an OCR, and then classified it later with a post-OCR pipeline.

One must note that, when it comes to the usage of OCR-based methods, image pre-processing is crucial to obtaining good results. Despite there being multiple papers suggesting pre-processing methods, such as El Harraj & Raissouni [1] and Koistinen et al. [2], there does not exist a standard that can be applied in most cases. Instead, the same processing method can lead to vastly different results depending on the input at hand, which means we would have to define our own processing pipeline from scratch.

To do so, we created a small base set of images, which was used to compare the results from different pipelines. The set consisted of six driver licenses from different countries, with different resolutions, image sizes, and lighting conditions. Next, the following aspects were explored:

  • Thresholding: Four different thresholding methods were analyzed, as shown in the image below. On average, global thresholding and Otsu’s binarization performed very similarly, while the adaptive methods tended to make images hard to read even to the human eye. The two best-performing methods were kept, and further modifications were evaluated on top of them.
Text boxes (colored) identified by Tesseract after processing the same image with different thresholding methods. From top to bottom, left to right: binary, Otsu’s binarization, adaptive mean, and adaptive Gaussian thresholding. (Source: author)
  • Morphological operations: Five different transformations were analyzed, both alone and combined: blurring, dilation, erosion, opening, and closing. For the operations that required kernel configuration, different kernel sizes were evaluated: 2x2, 4x4, 8x8. Larger kernels were prone to make the results worse, and in general, all morphological transformations resulted in worse outputs for the OCR.
Resulting images after closing with multiple kernel sizes (source: author)
  • Resizing and interpolation: According to suggestions from Tesseract’s website, 32px is the optimal text size for extraction. With that in mind, we designed a simple process for resizing: given an input image, an OCR extracted text from it, the medium height of text boxes was calculated, and the image was resized to achieve a median box height of 32px. In this operation, we also tested five types of interpolation, looking to compare their impact on the results of the OCR.

After multiple iterations, we concluded that cubic interpolation was the most appropriate for enlarged images, while interpolation based on pixel area performed best for scaled-down images. We also used bilinear interpolation for cases where no text box was detected at first, assuming that the OCR failure was caused by the image’s minimal size (and thus needing to be scaled up).

Final image processing pipeline and post-OCR
With a given input image, the final pipeline performed optimal resizing, applied Otsu's binarization, and scanned the text using Tesseract. The next step in our plan was to create functions that could take in the extracted text and return the entities we needed, such as name, date of birth, license number and so on. The problems with this approach, however, were already evident:

  • Noise and cost: Rather than reading only the relevant fields, the OCR read (and often misread) everything that it could, returning a noisy text to be processed. To extract the relevant information, a robust post-OCR pipeline would be necessary, making this solution costly.
  • Suboptimal image processing: As mentioned before, there is no silver bullet for image processing, which means input images in different conditions are capable of making an "optimal" pipeline useless for the OCR. As a consequence, it was nearly impossible to create a pipeline that could give us certainty of good performance.
  • Need for perfect input: We found that, unless we had a perfect input (a very clear and well-aligned image, which would often not be the case in a real-world setting of images submitted by users), the text was often misread or ignored. The same happened to licenses containing characters not present in the English alphabet (such as Polish or Lithuanian).

4.2. Solution 2 (OCR-free)

To tackle the OCR-related constraints, we delved into possibilities that used deep learning, focusing on img2seq models. One that caught our attention was Donut [3], a Visual Document Understanding model capable of taking an image as input and performing text-related tasks, such as information extraction and visual question answering. To achieve this without employing an OCR, Donut uses a visual encoder (Swin-B Transformer) and a text decoder (BART).

In detail, Swin-B is a transformer architecture designed specifically for image processing. One of the main aspects of a Swin transformer is that it breaks down an image into patches and processes them hierarchically, allowing it to capture local and global contextual information. Its way of working can be explained straightforwardly:

  • First, the input image is split into several non-overlapping patches.
  • Then, the patches go through a shifted window-based multi-head self-attention module — this simply helps the model understand the relationship between different parts of the image by considering the content of nearby patches.
  • Next, a two-layer MLP allows the model to learn patterns within each patch and thus have a better understanding of the image’s content.
  • Last, the patch tokens pass through patch merging layers, allowing the model to aggregate information and create a more comprehensive representation of the image.
  • The output of this process is then passed to the decoder, a multilingual BART model.
Illustration of how Donut works. Original image (A), split into non-overlapping patches (B), shifted window self-attention module learns relationship between different parts of the image (C), two-layer MLP learns patterns within each patch (D), patch merging (E), final tokens (F), BART taking tokens as input and returning the result (G). (Source: author)

In order to adjust the model to extract information from driver licenses, we created a synthetic dataset and fine-tuned it using an MLOps pipeline named Sparrow. Sparrow allowed us to split the data into test/training/validation, configure fine-tuning, keep track of losses using Wandb, and save the weights at each epoch on HuggingFace.

We fine-tuned Donut using Colab free (T4 GPU) for 7 epochs with a synthetic dataset of 2500 images, taking 3 hours and 43 minutes. The fine-tuned model achieved an accuracy of 98% in the test set, an optimal performance confirmed by the tests we ran in a set of real licenses. Take as example the license below:

Template license (Source: Road Safety Authority)

For this image, Donut yielded the following results:

{'First Name': 'THOMAS', 
'Last Name': 'ARMSTRONG',
'Date of Birth': '07.10.78IRELAND',
'Date of Issue': '13.03.17',
'Date of Expiry': '12.03.27',
'Issuing Authority': 'ROAD SAFETY AUTHORITY',
'Driver Number': '001234567',
'License Number': '020012ABCD'}

For comparison, this is the output yielded by the OCR approach for the same license:

"CEADUNAS TIOMANA DRIVING LICENCE cil IRELAD 1 ARMSTRONG 2 THOMAS 7 3 OPO Te 
fa 130317 ac ROMP SAFETY AUTHORITY 4b 120327 adeOatesss7 NS 6 oO bpon0 ouB o Oo"

5. Comparative analysis

To contrast both approaches, we started by evaluating the accuracy of the text extracted by each. We created a small dataset with 12 real licenses from 10 EU countries (Austria, Estonia, Germany, Lithuania, Netherlands, Poland, Romania, UK, Spain, and Sweden). The images were submitted by multiple individuals and did not go through any processing, except for resizing (done automatically by the OCR approach and manually for the Donut approach).

For each license, we compared the total number of characters identified correctly by the models. If the ground truth was “John”, for example, but the models extracted “Joh” or “Lohn”, this would represent an accuracy of 75% — only three characters were extracted correctly out of the four that were expected.

The results for the accuracy of text extraction can be seen below. In essence, this test answers the question: overall, how much of the text that we expected the models to extract were in the output correctly?

Comparison of the overall accuracy of each method. In general, Donut performed better on extracting the information from licenses than the OCR

Despite providing an insight on the performance of each model, this test leaves a fundamental aspect unanswered about Donut. Is the model not identifying the fields it was supposed to read? Or is it identifying the fields, but reading them wrong?

To answer these questions, we performed two additional tests. First, out of all nine fields that Donut was supposed to extract from each license, we evaluated the percentage of time in which each one was effectively extracted. Then, we evaluated the second point: out of the fields that were identified, how accurate was the extracted information?

(Here, accuracy was measured as the percentage of correctly extracted characters per field, compared to the truth table of each license. For each extra character present in the output and not listed in the truth table, a correct character was subtracted from the accuracy calculation).

The results of both tests can be seen below:

Results of the two additional tests. The column “Identified” measures the percentage of time in which each field was identified. The column “Accuracy” measures, out of the identified fields, how accurate was the text they extracted compared to its truth table

From this process, a few insights were gathered:

  • Despite there being room for improvement in the identification of some fields, the majority of them was identified in a satisfactory portion of the time
  • Whenever a field was identified, its text was extracted with high accuracy

Additionally, some of the fields that the model struggled the most to extract, such as the dates and the issuer, are precisely fields whose position in the licenses changes according to the country: countries like Poland display these fields stacked on top of each other, while many other nations (and the licenses from our dataset) have date of issue and issuer positioned side-by-side, just like date of expiry and driver number. This suggests that a more comprehensive dataset, containing licenses with both design patterns, would allow the model to generalize better and capture these fields with greater accuracy.

As for the OCR approach, we could not measure how accurately it classified the information in the extracted text, but this should be carefully taken into account. To illustrate the challenge, consider again the license shown in the previous section along with its output:

"CEADUNAS TIOMANA DRIVING LICENCE cil IRELAD 1 ARMSTRONG 2 THOMAS 7 3 OPO Te 
fa 130317 ac ROMP SAFETY AUTHORITY 4b 120327 adeOatesss7 NS 6 oO bpon0 ouB o Oo"

The post-OCR pipeline has two main challenges: first, how to differentiate signal (the personal data in the output) from noise? And then, how to correctly classify each part of the signal (e.g., is 130317 a license number or a date?)

In the first approach, multiple post-OCR functions are necessary for that. However, it is not hard to realize that their performance will be suboptimal, given that the OCR output itself is far from ideal. On the other hand, in addition to extracting the text more accurately, Donut also classified it right away, providing an output in which “Thomas” was already tagged as name and “13.03.17” as the date of issue. This makes Donut an end-to-end solution, providing results that are far more reliable than the ones given by the OCR-based approach.

6. Conclusion

After evaluating two different approaches for extracting information from driver licenses, we concluded that a transformer-based approach such as Donut performs better than an OCR-based approach. The OCR-based option often yielded inaccurate and noisy text, requiring a post-OCR pipeline to classify entities correctly. On the other hand, Donut achieved greater accuracy (98% in the test dataset, 63.9% in the set of real licenses) and could output entities that were already classified, making it an end-to-end solution.

Our fine-tuned Donut model can be found below:

6.1. Further studies

Both options have room for improvement. For the OCR-based approach, enhancements could be simple, such as using a different OCR engine; PaddlePaddle, for example, seems promising, with an apparent performance better than Tesseract’s. In addition, there is also room for high-end refinement: Sporici et al. [4] point out that determining image processing methods based on pre-set conditions and relying on principles related to human vision is not coherent since most OCR engines are implemented using neural nets. Instead, they suggest convolution-based pre-processing using kernels generated in an unsupervised fashion, which could enhance performance by processing images specifically to their needs.

As for the Donut-based approach, many improvements can be made in the architecture and fine-tuning of the model. On the architecture side, different encoders/decoders might enhance its performance without any additional work. On the dataset side, expanding the number of templates used, for example, to encompass licenses from more EU countries can help the model generalize better, becoming less dependent on specific design patterns. Moreover, adding licenses from countries that use more than one alphabet (such as Bulgaria or Ukraine, both of which use variations of Cyrillic) might help the model produce more accurate outputs.

Last, it is fundamental to mention the connection between the solutions developed above and the initial motivation of this paper, which was Relation Extraction. Although licenses only contain information from one individual, invoices and similar documents have information from multiple. Donut can be similarly fine-tuned to identify these entities and extract the information from each one appropriately, making it a viable open-source solution to one more challenge in NLP.

References

[1] El Harraj, A., & Raissouni, N, OCR Accuracy Improvement on Document Images Through a Novel Pre-Processing Approach (2015), Signal & Image Processing : An International Journal

[2] Koistinen, M., Kettunen, K., & Pääkkönen, T, Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing (2017), University of Helsinki Open Repository

[3] Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., & Park, S., OCR-free Document Understanding Transformer (2021)

[4] Sporici, D., Cușnir, E., & Boiangiu, C.-A., Improving the Accuracy of Tesseract 4.0 OCR Engine Using Convolution-Based Preprocessing (2020) Symmetry

Notes

  1. All license images displayed in this article are public domain or allow reproduction (source)

In Plain English

Thank you for being a part of our community! Before you go:

--

--

Generalist at heart, curious about most things. Currently focusing on ML/AI but always up for learning something new