Celebrating 25 years of DDD's Excellence and Social Impact.

Author name: DDD

Avatar of DDD
digitaldividedata article feature image ocr evolving

OCR is Always Evolving, Always Hot

digitaldividedata article feature image ocr evolving

By Aaron Bianchi
Apr 29, 2021

As a teenager in the 1970s I worked for an early Optical Character Recognition (OCR) company. They had an SUV-sized scanner in their computer room that digitized IBM Selectric double-spaced Pica text with about 80% accuracy and printed it to microfiche. I learned to program the DEC VAX that drove the scanner by typing octal instructions onto paper tape and then bootstrapping the tape reader. I also spent many hours in the proofreading pool comparing the microfiche output to the source data, the Manhattan White Pages, and logging corrections.

OCR has come a long way since then.

Today’s OCR is an application of computer vision that enables machines to find and extract text embedded in images. OCR projects are seeing explosive growth because of their potential for reductions in the cost of human labor and human mistakes and increases in productivity and security.

Real-world examples of OCR are legion:

  • Many autonomous device use cases demand an ability to read text in the form of signage, warnings, and surface-embedded instructions

  • Industries like real estate and financial services want to reduce or eliminate human involvement in digitizing business documents and other artifacts and electronically capturing the business-critical content therein

  • Likewise, many industries are seeking to eliminate the need for humans to interpret and process handwritten content like patient charts, whiteboard sessions and annotated text documents

  • Other examples include license plate recognition, menu digitization, language translation, and many more

OCR models are a subset of machine learning models, and more and more, deep learning OCR is data scientists’ preferred approach. The complexity and nuance of real-world OCR tasks gives deep learning models an appreciable performance edge.

Deep learning models don’t train themselves. They, too, require training data, and feedback and refactoring, to achieve optimal outcomes. And in fact, their performance edge comes at a cost: deep learning OCR requires significantly more, often orders of magnitude more, training data than many other ML approaches.

OCR involves two steps, and OCR models must be trained in both. A trained model has to identify the location of salient text in an image, referred to as text detection, and it must perform text recognition, the extraction of text content.

The very large quantities required aside, OCR training data is produced in standard fashion. Human data labelers annotate input images, typically with bounding boxes or polygons, to localize text areas. The particular application may require that they separately label different text areas or indicate how text blocks are related.

Importantly, labeling and annotation is just the final step in training data preparation. Many data science teams work with data collections that include input images that are distorted, skewed, or inconsistently lit or sized. Yet other teams are confronted with very large quantities of paper that have not been digitized.

Training data partners that can supplement OCR training data labeling with a full complement of data curation and data creation services offer data science teams a significant leg up with regard to their OCR projects.

OCR is Always Evolving, Always Hot Read Post »

fleet operations

Announcing the Launch of Autonomous Fleet Ops

04 June, 2025 by Sahil Potnis, VP of Product & Partnerships

Detroit, MI, USA: Digital Divide Data (DDD) continues to expand its end-to-end data capabilities for Autonomous Systems across land, air, sea, and space. Our latest solution set is targeted towards supporting Autonomous Fleet Operations including Human in the Loop (HiTL) data solutions for:

(A) Remote Teleoperations to enable full Autonomy

(B) Operational Data Intelligence to gather ODD exposure and mission intel insights

(C) Fleet Management functions surrounding the capabilities of mission command and control

(D) In-Cabin Monitoring to drive forward the safety of ADAS systems

Remote Teleoperations as a Service” is growing rapidly across the globe to augment the core Autonomous premise of any system, and to unlock L3+ and L4 SAE levels of Autonomy. Similarly, using Operational Data Intelligence is an essential part of the fleet operations and is aimed at most effectively deploying assets across multiple sites; be it for testing, or data collection. Cabin Monitoring serves a critical role to directly support any Autonomy company’s CONOPS for safer and reliable operations.

DDD’s in-house expertise on these workflows and ability to stand-up US (onshore) or offshore operations in a lightning quick span of 10 days[1] is a (critical) market differentiating USP necessary to advance the autonomy tech.

“DDD’s Fleet Operation solutions coupled with data operations support services gives our clients the ability to deliver accelerated fleet deployment and management with controlled, scaleable, and cost effective outcomes”, says Sameer Raina, DDD CEO and President.

DDD is actively in pursuit of value-added technology partners to make its Fleet Operations ecosystem robust, scalable and diverse. DDD’s acquisition of Liberty Source PBC in 2024 has supplemented the workforce to have a direct on the ground US presence, vital to unlock low latency – high data security workflows. DDD’s social impact mission, operational excellence of a global workforce (US, Africa, Asia), deep subject matter expertise in Autonomy and toolchain partnerships uniquely positions the team to be an industry leader in providing such end-to-end Autonomy HiTL data solutions.

[1] 10 business days is the average time to set-up a pilot project for an Autonomy focused workflow. Not specific to the Fleet Operations capability.

Announcing the Launch of Autonomous Fleet Ops Read Post »

Scroll to Top