Natural Language Processing Is Impossible Without Humans

unsplash image PeUJyoylfe4

By Aaron Bianchi
Jan 15, 2022

Computer vision dominates the popular imagination. Use cases like driverless cars, facial recognition, and drone deliveries – machines navigating the three-dimensional world – are compelling and easy to grasp, even if the technology behind these use cases is not well understood.

But in reality, the holy grail of AI is natural language processing (NLP). Teaching machines to accurately and reliably understand and generate human language ushers in a revolution with boundaries that are hard to envision.

In theory, machines can be perfect listeners, which unlike humans never get bored or distracted. They also can consume and respond to content far, far faster than any human, at any time of day or night. The implications of these capabilities are staggering.

This assumes, of course, that we really can teach algorithms to understand what they are “hearing” and build into them the judgment required to communicate on our behalf. And that is what makes NLP such an elusive holy grail: because doing that is so hard on so many levels. Sure, helping machines to make sense of two- and three-dimensional images is an enormous challenge, and headlines describing autonomous vehicle crashes and facial recognition mistakes hint at the complexity of CV. But human language is orders of magnitude more complex.

Five ways that humans struggle with our own natural language processing:

You misinterpret sarcasm in a text message
You hear a pun and you don’t get it
You overhear a conversation between experts and get lost in their specialized vocabulary
You struggle to understand accented speech
You yearn for context when you come up against semantic, syntactic, or verbal ambiguity (“He painted himself,” or “What a waste/waist!”)

Obviously, processing and interpreting language can be a challenge even for humans, and language is our principal form of communication. Language is complex, and chock full of ambiguity and nuance. We begin to process language in the womb and spend our whole lives getting better at it. And we still make mistakes all the time.

Ways that humans and machines struggle with each other’s natural language processing:

Comprehending not just content, but also context
Processing language in the context of personal vocabularies and modes of speech
Seeing beyond content to intent and sentiment
Detecting and adjusting for errors in spoken or written content
Interpreting dialects, accents, and regionalisms
Understanding humor, sarcasm, misdirection
Keeping up with usage and word evolution and slang
Mastering specialized vocabularies

These challenges have not deterred NLP pioneers, and NLP remains an extremely fast-growing sector of machine learning. These pioneers have made great progress with use cases like:

Document classification – building models that assign content-driven labels and categories to documents to assist in document search and management
Named entity recognition – constructing and training models that identify particular categories of content in text so as to understand the text’s purpose
Chat bots – replacing human operators with models that can ascertain a customer’s problem and direct them to the right resource

Of course, even these NLP applications are complex, and the pioneers have taken away three lessons that anyone interested in NLP should heed:

Algorithms require enormous volumes of labeled and annotated training data. The complexity and nuance of language processing means that much of what we think of natural language is full of edge cases. And as we all know, training algorithms on edge cases can demand many orders of magnitude more training data than the routine. Because algorithms have not yet overcome the barriers to machine/human communication outlined above, training data must come from humans.

Only humans can label and annotate text and speech data in ways that highlight nuance and context.
Relying on commercial and open-source NLP training data is a dead end. Getting your model to the confidence levels you need demands training data that matches your specific context, industry, use case, vocabulary, and region.

The hard lesson that the pioneers learned is that NLP invariably demands custom-labeled datasets.
The humans who prepare your datasets must be qualified. If you are dealing with a healthcare use case, your human specialists must have fluency with medical terminology and processes. If the audience for your application is global, the training data cannot be prepared by specialists in a single geography. If the model will encounter slang and idiomatic content, the specialists must be able to label your training data appropriately.

Given the volume of training data NLP requires and the complexity and nuance that surrounds these models, look for a data labeling partner with a sizable, diverse, distributed workforce of labeling specialists.

Team DDD

Natural Language Processing Is Impossible Without Humans Read Post »

Data Bias: AI’s Ticking Time Bomb

By Aaron Bianchi
Dec 8, 2021

We’ve all seen the headlines. It’s big news when an AI system fails or backfires, and it’s an awful black eye for the organization the headlines point to.

Most of the time these headlines can be traced back to issues with the AI model’s training data. Bias in training data can take a variety of forms, but they all create the potential for leaving the algorithm under- or mis-trained.

In our discussions with clients, we alert them to three data preparation mistakes or oversights that can produce bias:

Failing to ensure the data measuring instrument is accurate. Distortion of the entire data set can result from bad measurement or collection techniques. When this occurs, the bias tends to be consistent and in a particular direction. The danger here is that the production model is out of sync with the reality it is designed to react to.

Bad measurement can take lots of forms. Low quality speech samples, with noise and missing frequencies, can affect a model’s ability to process speech in real time. A drone with an inaccurate GPS system or misaligned altimeter will provide image-based training data that has systematically distorted image metadata. Poorly designed survey and interview instruments can consistently distort responses.

Failing to accurately capture the universe in the data. Sample bias occurs when the training data set is not representative of the larger space the algorithm is intended to operate in. A non-representative training data set will teach the algorithm that the problem space is different than it is.

A classic example of sample bias involved an attempt to teach an algorithm to distinguish dogs from wolves. The training data’s wolf images were overwhelmingly in snowy settings, which led the algorithm to conclude that every picture with snow in it contained a wolf.

But as amusing as this is, sample bias can have serious consequences. Facial recognition algorithms that are trained on disproportionately Caucasian images misidentify African Americans and Asians. Autonomous vehicles that crash into gray trailers on overcast days likely have had too little exposure to this scenario.

Failing to eliminate social or cultural influences from the data. Cultural bias happens when human prejudices or regional idiosyncrasies make their way into AI training data.

As an example, in the UK athletic shoes are often referred to as pumps. In the US pumps are heeled women’s shoes. Use a UK-based team to label shoe images to train an algorithm targeting US shoppers and you end up with a model that may offer the wrong shoes due to cultural bias.

Cultural bias can be more insidious, however. Randomly sample images of airline pilots and you will end up with a data set that is almost entirely male. However, it would be wrong for all kinds of reasons to have the algorithm you are training conclude that airline pilot and male are causally related.

There are well-understood approaches to data gathering and data sampling that avoid all these forms of bias. Unfortunately, these approaches are labor- and time-intensive and data science teams often lack members with the skills or bandwidth to address them.

Our clients generally want to offload the management of training data bias, and they are delighted to learn that mitigating this bias is a core DDD capability. We fully understand the forms that bias takes, and the sources of those biases. We know how to create or collect data sets that are free of bias. And if you already have the data you need, we can offload the management of data bias in your training data sets.

Team DDD

Data Bias: AI’s Ticking Time Bomb Read Post »

Using Aerial Imagery as Training Data

By Aaron Bianchi
Aug 6, 2021

Numerous industries use satellite and aerial imagery to apply machine learning to business and social problem sets. This is a particular strength for DDD given our experience in geospatial and aerial use cases in insurance, transportation, meteorology, environmental protection, agriculture, law enforcement, national security, remote delivery, and traffic management.

This experience has taught us a great deal about the challenges and pitfalls associated with aerial image segmentation. We aired a webinar on this subject, and you can view the recording on-demand. Our goal is to deliver a hands-on guide to overcoming these challenges.

Price of failure. Consider the cost of inadequately or incorrectly training an algorithm to evaluate geospatial or aerial images. Say the project is agricultural. Can you imagine the impact of incorrectly identifying crop disease or inadequate irrigation? Or say the project is military. What are the potential costs of misidentifying an elementary school as an army barracks? You need to include the expected cost of this kind of failure into your DIY vs outsource equation.
Workforce. Aerial and geospatial images are often very large and very detailed, meaning that a large workforce of labelers is required to generate sufficient volumes of training data in a timely fashion. In our experience, most data science teams don’t have access to an in-house workforce big enough to meet their training data demands. This lack of a workforce is one of the principal drivers of seeking a training data partner.
Data volumes. Keep in mind that you may be able to support in-house data preparation for an initial, simple use case, but in your quest for greater levels of model confidence, you will have to train your algorithm on additional use cases, and eventually edge cases. You may be able to generate enough data in-house to train an algorithm to land a delivery drone on a simple graphical market, but what does it take to distinguish between a leaf on the marker and a three-year-old child? Each additional use case requires at least as much training data as the first one, and rarely-occurring edge cases may require significantly more data. This dramatically compounds your workforce requirements, a discovery that many data science teams make late in their projects when budgets are dwindling and deadlines are imminent.
Process and tools. Extremely high-resolution images are far too large to assign to a single labeler. But breaking up images, assigning them to multiple labelers, and then reassembling everything coherently introduces issues around worker consistency and process management. Do you have the wherewithal to train consistency into your own workforce? Do you have the technology and process required to track changes to very high numbers of image fragments? Most data science teams don’t.
Specialization. Are you confident that you can define the most efficient tasks required to label your training data? We had a client who wanted us to label every individual tree in enormous hi-res forest images. As it happened, they weren’t interested in tree density; rather, they were trying to detect illegal land clearing. Because we have been preparing training data for decades, we were able to show them a different approach to labeling their images that appropriately trained their algorithm, but at a fraction of the time and cost of their approach.
Focus. Preparing training data for aerial and geospatial systems involves the application of human judgment to nuanced, and sometimes hard to decipher, images. Our own data shows that the longer individuals spend on a particular kind of interpretation the faster and more accurate they do the work. Data science teams that crowdsource their aerial segmentation work do not capture these workforce efficiencies. DDD assigns you a team that stays with you throughout the span of your project, meaning that you capture all the benefits of growing worker efficiency and your effective cost per transaction steadily declines.

Team DDD

Using Aerial Imagery as Training Data Read Post »

Five Key Criteria to Consider When Evaluating a Data Labeling Partner

By Aaron Bianchi
Jul 14, 2021

Machine learning (ML) and AI have dramatically changed the way many businesses across the globe work. As ML and AI continue to evolve, one of the biggest challenges is to ensure the quality of the data utilized by your systems.

For machine learning to work, your system needs properly labeled data. Without it, your ML model may not recognize patterns, which it needs to make decisions or perform its functions.

This is one reason data scientists and corporations worldwide work with data labeling partners or invest in data labeling tools.

Are you currently looking for a data labeling partner? Before getting started on your search, you must first understand what data labeling is.

What is Data Labeling?

Data labeling is an essential part of ML, particularly Supervised Learning, a common type of ML used today.

Data labeling identifies raw data such as text files, images, and videos and adds context to them. Once data have been labeled, it will be the learning foundation of your ML model for all data processing activities.

As your ML model relies heavily on data labeling, make sure you’re working with a data labeling partner that isn’t just reliable; your partner should also have sufficient data labeling experience in your industry.

How to Choose a Data Labeling Partner

There are many ways to find professionals to perform data labeling for you. The most popular is working with a data labeling company or contractor.

Essentially, these service providers become an extension of your team. They manage all your data and would often charge by their output volume.

Why should you work with a data labeling company? One of its benefits is that it’s more cost-effective than investing in data labeling tools and spending on human resources. Secondly, working with a data labeling service provider ensures the work is done right. When your team doesn’t have enough knowledge and experience with data labeling, you’ll need to give them time to learn it. Additionally, you’ll have to provide more time for them to finish the work, which isn’t an efficient use of your company’s resources.

When choosing a data labeling partner, don’t forget to take the following steps. These will help you find the best provider and make your search more efficient.

Define Your Goals
Setting goals and expectations is crucial, especially when working with professionals outside of your organization. Remember, they will be working on your data. Therefore, they should have a clear understanding of what you expect from them and the service required of them.

It would help to have the following information from the beginning:
• Project overview
• Timeline
• Data volume
• Data quality guidelines or overview
Set a Budget
Once you’ve prepared all the information, the next step is to decide on a budget.

Every service provider is different, and all of them would have different rates. Having a budget would make it easier to create a shortlist of candidates, mainly when most of your chosen candidates provide similar proposals or offers.
Create a List of Candidates

Now that you have your budget and project details on hand, the actual search begins!

Don’t be in a rush to find the “one” for your company. Instead, take your time evaluating multiple service providers. Do your background research, look for customer reviews, and find out their overall standing in the industry.
Ask for Proof of Concept

Provide a sample task that is quite similar to your project and evaluate how each candidate would deliver the output. This is an easy way to identify a service provider’s skills, experience, and reliability. Additionally, a proof of concept could help you determine any possible roadblocks you may encounter once your project starts.

Criteria for Evaluating a Data Labeling Partner

With thousands of companies offering data labeling services, it could be challenging to assess everyone on your list.

The best way to evaluate your candidates is to set some criteria. Here are five you may use when choosing a partner.

Data Quality
Keep in mind that your ML or AI model would only be as good as the quality of data you provide. Because of this, checking for data quality is of utmost importance when looking for a data labeling service provider.

Tip: Don’t forget to talk to your candidates about their quality control measures.

Technology
Another benefit of outsourcing data labeling is that you can access tools and technology that your company may not otherwise afford.

Ask your vendors which tools and technology they would use for your project. Their tools should help you maximize your time, resources, and efficiency — all while providing quality data.

Workforce
Sure, a service provider may already work with multiple clients… but that doesn’t mean they’re suitable for your project. Make sure their staff knows how to handle the type and volume of data you have. This would help get things going smoothly and with minimal supervision from your end.

Security
Confidentiality and data security are crucial when it comes to outsourcing this type of work. You wouldn’t want to worry about data leaks and hacks, would you? Inquire about the company’s security protocols and process of handling sensitive data.

Social proof
When possible, ask for a list of (past or present) clients. Then, get in touch with them to ask for their feedback on the provider. You may also consider looking into case studies that they’ve done, which would give you a good idea of the quality of their work and processes.

Finding the right data labeling partner for your company doesn’t always have to be complicated. With this guide, you could get started on your search and make sound decisions.

Do you want to learn more about data labeling and how Digital Divide Data could help? Fill out our contact form, and we’d be happy to learn more about your needs and walk you through our process.

Team DDD

Five Key Criteria to Consider When Evaluating a Data Labeling Partner Read Post »

Flat lay of a professional hand tool set including wrenches, screwdrivers, pliers, drill bits, a measuring tape, and a utility knife arranged on a light blue background

ML Data Preparation Demands a Big Toolbox

By Aaron Bianchi
May 11, 2021

Part of the challenge of building machine learning models is that no two are the same. Train the same machine learning algorithm against different sets of data, and you end up with a different model.

If the quality of the raw data is high and the training data sampling is done well, the models shouldn’t vary a lot… but those are all big “if”s. Which is why data preprocessing, the actual data preparation process, is critically important.

A Forbes survey revealed that data scientists spend nearly 80% of their time on data prep, a quarter of that on data collection, and the other three quarters on data cleaning. Other survey results that indicate that real-world data science isn’t everything these practitioners thought it would be; clearly, data collection and data cleaning are not how they imagined they’d spend their working hours.

Data preparation is so time consuming because it is so important. The adage – or more appropriately in this setting, admonition – “Garbage In, Garbage Out” very much applies to data preparation for machine learning, which, in extreme cases, can involve the entire lifecycle of data collection to data cleaning and feature engineering. Missteps at any point in this process will result in low-confidence model predictions, or even a model that just misperforms.

Beyond their importance, training data sets for machine learning algorithms are also voluminous – many millions of data items in the case of complex problem spaces – and much of the data prep work demands human involvement (although much of this work is often repetitive and requires only contextual training to perform).

Finally, data preprocessing usually involves a variety of technologies, both for doing the actual work of preparing the data and for managing quality in the context of volume. If the problem space is simple – say, structured data with duplicates, null values and some lack of standardization – the technology needn’t be complex. But complex problem spaces – say, identifying and tracking video objects with complex taxonomies – can require specialized technology, much of it open source, with very particular feature sets.

In recent years, numerous solution providers have sprung up to fill the human and technological gaps that data scientists confront in preparing quality training data at scale. Some offload the human labor side of data prep. Others provide technology for cleaning and labeling training data sets. And yet, some provide both.

Data science teams would be well-advised to choose carefully when evaluating data preparation partners. DDD has “inherited” more than a few customers whose first vendor fell down, invariably on one or both of these two dimensions:

Something less than full lifecycle data prep. Recall that the Forbes survey indicated 60% of data scientists’ time is spent in data cleaning. Most of today’s data preparation vendors emphasize training data labeling and annotation. They presume that they will be given data that has already been cleaned.

If your data needs cleaning, i.e.,

it has not been de-duped
it is missing information
it is inconsistently presented from different sources
it requires entity resolution
it is image data of uneven quality or perspective
it is handwritten and requires transcription

If you don’t have data, or don’t have enough data, and need data collected or created, these vendors are not a good fit for you.

Reliance on a single technology platform. Every ML project is unique, with unique data. No single set of proprietary tools can possibly match up to every machine learning algorithm and training data set. Data science teams need to know that whoever is doing their data preparation is technology platform agnostic. They need to know that whoever is doing their data preparation has the flexibility and freedom to choose the best tool(s) for the project at hand and is not trying to shoehorn data into an inappropriate tool or trying to jerry-rig a third-party tool onto their platform for the sake of the team’s project.

Team DDD

ML Data Preparation Demands a Big Toolbox Read Post »

digitaldividedata article feature image ocr evolving

OCR is Always Evolving, Always Hot

By Aaron Bianchi
Apr 29, 2021

As a teenager in the 1970s I worked for an early Optical Character Recognition (OCR) company. They had an SUV-sized scanner in their computer room that digitized IBM Selectric double-spaced Pica text with about 80% accuracy and printed it to microfiche. I learned to program the DEC VAX that drove the scanner by typing octal instructions onto paper tape and then bootstrapping the tape reader. I also spent many hours in the proofreading pool comparing the microfiche output to the source data, the Manhattan White Pages, and logging corrections.

OCR has come a long way since then.

Today’s OCR is an application of computer vision that enables machines to find and extract text embedded in images. OCR projects are seeing explosive growth because of their potential for reductions in the cost of human labor and human mistakes and increases in productivity and security.

Real-world examples of OCR are legion:

Many autonomous device use cases demand an ability to read text in the form of signage, warnings, and surface-embedded instructions
Industries like real estate and financial services want to reduce or eliminate human involvement in digitizing business documents and other artifacts and electronically capturing the business-critical content therein
Likewise, many industries are seeking to eliminate the need for humans to interpret and process handwritten content like patient charts, whiteboard sessions and annotated text documents
Other examples include license plate recognition, menu digitization, language translation, and many more

OCR models are a subset of machine learning models, and more and more, deep learning OCR is data scientists’ preferred approach. The complexity and nuance of real-world OCR tasks gives deep learning models an appreciable performance edge.

Deep learning models don’t train themselves. They, too, require training data, and feedback and refactoring, to achieve optimal outcomes. And in fact, their performance edge comes at a cost: deep learning OCR requires significantly more, often orders of magnitude more, training data than many other ML approaches.

OCR involves two steps, and OCR models must be trained in both. A trained model has to identify the location of salient text in an image, referred to as text detection, and it must perform text recognition, the extraction of text content.

The very large quantities required aside, OCR training data is produced in standard fashion. Human data labelers annotate input images, typically with bounding boxes or polygons, to localize text areas. The particular application may require that they separately label different text areas or indicate how text blocks are related.

Importantly, labeling and annotation is just the final step in training data preparation. Many data science teams work with data collections that include input images that are distorted, skewed, or inconsistently lit or sized. Yet other teams are confronted with very large quantities of paper that have not been digitized.

Training data partners that can supplement OCR training data labeling with a full complement of data curation and data creation services offer data science teams a significant leg up with regard to their OCR projects.

Team DDD

OCR is Always Evolving, Always Hot Read Post »

Announcing the Launch of Autonomous Fleet Ops

04 June, 2025 by Sahil Potnis, VP of Product & Partnerships

Detroit, MI, USA: Digital Divide Data (DDD) continues to expand its end-to-end data capabilities for Autonomous Systems across land, air, sea, and space. Our latest solution set is targeted towards supporting Autonomous Fleet Operations including Human in the Loop (HiTL) data solutions for:

(A) Remote Teleoperations to enable full Autonomy

(B) Operational Data Intelligence to gather ODD exposure and mission intel insights

(D) In-Cabin Monitoring to drive forward the safety of ADAS systems

“Remote Teleoperations as a Service” is growing rapidly across the globe to augment the core Autonomous premise of any system, and to unlock L3+ and L4 SAE levels of Autonomy. Similarly, using Operational Data Intelligence is an essential part of the fleet operations and is aimed at most effectively deploying assets across multiple sites; be it for testing, or data collection. Cabin Monitoring serves a critical role to directly support any Autonomy company’s CONOPS for safer and reliable operations.

DDD’s in-house expertise on these workflows and ability to stand-up US (onshore) or offshore operations in a lightning quick span of 10 days[1] is a (critical) market differentiating USP necessary to advance the autonomy tech.

“DDD’s Fleet Operation solutions coupled with data operations support services gives our clients the ability to deliver accelerated fleet deployment and management with controlled, scaleable, and cost effective outcomes”, says Sameer Raina, DDD CEO and President.

DDD is actively in pursuit of value-added technology partners to make its Fleet Operations ecosystem robust, scalable and diverse. DDD’s acquisition of Liberty Source PBC in 2024 has supplemented the workforce to have a direct on the ground US presence, vital to unlock low latency – high data security workflows. DDD’s social impact mission, operational excellence of a global workforce (US, Africa, Asia), deep subject matter expertise in Autonomy and toolchain partnerships uniquely positions the team to be an industry leader in providing such end-to-end Autonomy HiTL data solutions.

[1] 10 business days is the average time to set-up a pilot project for an Autonomy focused workflow. Not specific to the Fleet Operations capability.

Team DDD

Announcing the Launch of Autonomous Fleet Ops Read Post »

Author name: Team DDD

Natural Language Processing Is Impossible Without Humans

Data Bias: AI’s Ticking Time Bomb

Using Aerial Imagery as Training Data

Five Key Criteria to Consider When Evaluating a Data Labeling Partner

What is Data Labeling?

How to Choose a Data Labeling Partner

Criteria for Evaluating a Data Labeling Partner

ML Data Preparation Demands a Big Toolbox

OCR is Always Evolving, Always Hot

Announcing the Launch of Autonomous Fleet Ops

Physical Al

Data Services

Generative Al