OCR is Always Evolving, Always Hot

By Aaron Bianchi
Apr 29, 2021

As a teenager in the 1970s I worked for an early Optical Character Recognition (OCR) company. They had an SUV-sized scanner in their computer room that digitized IBM Selectric double-spaced Pica text with about 80% accuracy and printed it to microfiche. I learned to program the DEC VAX that drove the scanner by typing octal instructions onto paper tape and then bootstrapping the tape reader. I also spent many hours in the proofreading pool comparing the microfiche output to the source data, the Manhattan White Pages, and logging corrections.

OCR has come a long way since then.

Today’s OCR is an application of computer vision that enables machines to find and extract text embedded in images. OCR projects are seeing explosive growth because of their potential for reductions in the cost of human labor and human mistakes and increases in productivity and security. 

Real-world examples of OCR are legion:

  • Many autonomous device use cases demand an ability to read text in the form of signage, warnings, and surface-embedded instructions

  • Industries like real estate and financial services want to reduce or eliminate human involvement in digitizing business documents and other artifacts and electronically capturing the business-critical content therein

  • Likewise, many industries are seeking to eliminate the need for humans to interpret and process handwritten content like patient charts, whiteboard sessions and annotated text documents

  • Other examples include license plate recognition, menu digitization, language translation, and many more

OCR models are a subset of machine learning models, and more and more, deep learning OCR is data scientists’ preferred approach. The complexity and nuance of real-world OCR tasks gives deep learning models an appreciable performance edge. 

Deep learning models don’t train themselves. They, too, require training data, and feedback and refactoring, to achieve optimal outcomes. And in fact, their performance edge comes at a cost: deep learning OCR requires significantly more, often orders of magnitude more, training data than many other ML approaches. 

OCR involves two steps, and OCR models must be trained in both. A trained model has to identify the location of salient text in an image, referred to as text detection, and it must perform text recognition, the extraction of text content. 

The very large quantities required aside, OCR training data is produced in standard fashion. Human data labelers annotate input images, typically with bounding boxes or polygons, to localize text areas. The particular application may require that they separately label different text areas or indicate how text blocks are related. 

Importantly, labeling and annotation is just the final step in training data preparation. Many data science teams work with data collections that include input images that are distorted, skewed, or inconsistently lit or sized. Yet other teams are confronted with very large quantities of paper that have not been digitized. 

Training data partners that can supplement OCR training data labeling with a full complement of data curation and data creation services offer data science teams a significant leg up with regard to their OCR projects. 

Previous
Previous

ML Data Preparation Demands a Big Toolbox