Celebrating 25 years of DDD's Excellence and Social Impact.

Author name: DDD

Avatar of DDD
DDD StreetAnnotated 2

How Data Labeling and Annotation Are Fueling Autonomous Driving’s Global Movement

DDD StreetAnnotated 2

By Abhilash Malluru
Feb 1, 2023

Autonomous driving is becoming more prevalent worldwide, garnering increased interest in optimizing technology through data labeling and annotation from investors and developers alike. With that growing interest comes an emerging need for experienced developers who can develop the tools and processes necessary for driver behavior monitoring, self-parking, motion planning, and traffic mapping.

Growing acceptance of autonomous driving has led to several approaches to advancing data labeling, annotation, and other machine learning processes. As these become standardized and more widely accepted in the industry, it’s crucial to understand the difficulties and obstacles which might arise in deploying them to any autonomous driving development platform.

Data Labeling and Annotation Strategies for Autonomous Vehicle Applications

The standard methods regarding the implementation of data labeling and annotation are as follows:

  • Bounding Boxes

  • Semantic segmentation

  • Polylines

  • Video Frame Annotation

  • Keypoints

  • Polygons

Bounding Boxes – Crucial for Robotaxis

2D bounding box annotation uses video or image annotation to identify and spatially place objects. It first maps items to develop datasets, then machine learning models use those datasets to localize objects. Depending on the method deployed, it can support various tags or text extraction for things like street signs.

This annotation technique is vital for an autonomous vehicle or robotaxi’s navigation. It relies heavily upon complex logic systems and requires additional inputs to differentiate for decision-making, meaning it requires significantly large quantities of data and human input for the vehicle to operate effectively and safely.

Partnering with firms that have extensive experience in this method like any reputable managed service model (MSM) can help you implement and deploy a technique like bounding boxes. A managed service provider (MSP) has both a data annotation workforce and expert consultants who can help guide your needs and pinpoint any difficulties or obstacles that might arise.

Semantic Segmentation to Identify Humans from Objects

Semantic segmentation is a technique that relies on a computer’s optical input to divide images into different components and label them by each pixel. This process is crucial to identify different types of objects so that a system can make a decision. For example, semantic segmentation helps a system identify people in a crosswalk. It may not know how many, but the point that people are crossing is enough to influence the decision-making process.

However, the most significant hurdle is that semantic segmentation is incredibly time-consuming. And this is where a dedicated team of SMEs from a third-party platform becomes invaluable. MSMs enable any organization seeking to implement semantic segmentation toolchains for this absolutely crucial process.

Since DDD’s workforce is trained in standard models and data annotation methods, they can help establish efficient and steady workflows while minimizing operational costs. These experts can handle such laborious tasks as semantic segmentation so you can place your focus elsewhere, ensuring you can complete other project needs before deliverables are due.

Polylines – Crucial for Overall Road System

This image annotation method enables the visualization and identification of lanes, including bicycle lanes, lane directions, diverging lanes, and oncoming traffic. Polylines require extensive data sets to be successfully labeled and deployed.

Polylines are crucial for autonomous driving as a means of lane detection. Accurate and consistent modeling allows for navigation and the avoidance of obstacles. Plus, models can be trained further so they better adhere to relevant traffic laws by detecting road markings and signs. MSMs can help offload some of the enormous overhead which goes into developing the toolchains necessary for polylines.

Video Frame Annotation – Necessary for Object Detection

Autonomous vehicles can use video annotation to identify, classify, and recognize objects and lanes. It can work in conjunction with techniques like semantic segmentation and polylines. Video frame annotation is necessary for more accurate object detection and works in conjunction with other annotation methods to provide accurate results.

Video annotation is time-consuming as it relies upon analyzing and data labeling thousands of video frames. Whether your platform is leveraging video and image annotation for autonomous vehicles or robotaxis, partnering with a third-party service can drastically reduce the time needed to implement this form of data annotation.

Keypoints – Giving Robotaxis Adaptability

Data drives both autonomous vehicles and the development of the systems which guide them. Keypoints provide a frame of reference for objects that might change shape by leveraging multiple consecutive points.

As with most of the techniques related to autonomous vehicles or robotaxis, this form of data annotation is a very consuming and costly process. While much of the modeling that goes into what serves a self-driving vehicle needs elements of artificial intelligence or machine learning, a human component must still input the points on the sets processed for data labeling.

Nothing encountered on the road will remain static, doubly so for those using autonomous vehicles in metropolitan areas. With this type of data labeling, leveraging an organization with actionable domain experience like MSMs can help develop streamlined methods and toolchains. Cost is dictated per hour or unit, and DDD’s staff brings much experience in standardized data labeling and annotation methods.

Polygons – Greater Precision for Visual Processing

Polygons operate like bounding boxes for visual data annotation. Irregular objects and accurate object detection greatly benefit from the implementation of polygonal data annotation. Polygonal annotation can have far greater precision than the bounding box method. When properly implemented, it helps detect things like obstructions, sidewalks, and the sides of the roads.

Polygonal annotation is a vital step in the autonomous driving model. Objects are very rarely uniform, and as such this method of annotation has a crucial function in making effective and safe models for the sake of detection and recognition. Its integration into your workflow comes from it being a time-consuming process. Compared to methods like bounding box annotation, it requires even more resources and time to correctly integrate. Engaging an MSM to help provide a platform can significantly reduce the time needed to implement this into your autonomous driving toolchain. Leveraging a third-party resource with actionable and proven experience can easily lead to greater precision in your detection model.

Get Started With a Data Labeling Service

The past few years have made it abundantly clear that autonomous driving is here to stay, and leveraging another organization’s expertise into your workflow frees up valuable resources and manpower which could be better spent on other aspects of project development. Plus, we can’t ignore the time it takes to invest and develop these annotation methods.

So if you’re developing the technologies and models that power autonomous driving, it’s worth considering outsourcing at least some of the workflows to a third-party vendor. MSMs like Digital Divide Data (DDD) provide a platform to help you and your staff overcome some of the pitfalls of developing systems for autonomous driving.

Data labeling and data annotation alike are diverse and complicated fields of work. You can discuss your project needs and requirements with the DDD staff today. By partnering with us, you gain access to a developed platform that delivers exceptional results for your digital labeling and annotation needs. Let’s discuss your project requirements today.

How Data Labeling and Annotation Are Fueling Autonomous Driving’s Global Movement Read Post »

human2Bchecklist2Bcomputer

4 Advantages of Human-Powered Data Annotation vs Tools/Software

human%2Bchecklist%2Bcomputer

By Aaron Bianchi
Sep 20, 2022

“Check all the images that contain traffic lights.”

For some, these increasingly difficult CAPTCHAs are a source of endless frustration. But they give us something interesting to consider. If we prove that we are human by correctly identifying objects, how can a computer check our work? The answer lies in a domain of artificial intelligence called machine learning (ML).

Before CAPTCHA pictures get to you, data scientists train computers to recognize objects by providing lots of examples (training sets). If you’re wondering where those training sets come from, you’re right on the money! They come from a process called data annotation or data labeling.

Then, a model is developed to recognize specific objects. If the model is good, the computer can use it to identify the same objects in new pictures.

Artificial intelligence can’t create working models without well-trained data sets—garbage in, garbage out – this has always been the rule of thumb.

1. We Get the Big Picture

Imagine that you could talk to a computer to teach it new things. If you wanted to teach this computer to recognize a pest that is disrupting your crop yield, how might you approach this?

Chances are, you’d show it some pictures of pests you are interested in spotting and say, “Hey computer, look for these!”.

Machine learning works in the same way. Data annotation is like gathering the pictures you would like to show the computer and circling the important parts.

Unlike the computer, we understand the end goal of the model. We’ve likely defined, or at least have an understanding of its use case. As humans, understanding how the entire process works gives us an advantage when developing a data annotation strategy.

For instance, you can use your judgment to pick out a picture that wouldn’t be the best to include in the set. In this way, you’re telling the computer, “This isn’t a great example; let’s move on to a different one.”

This type of human logic is what artificial intelligence cannot yet replicate. The human side of understanding what the data means offers greater flexibility and understanding that create more substantial outcomes. Outcomes are not as strong with automated training set preparation.

2. We are Natural Language Processors

Natural Language Processing, or NLP, is the branch of artificial intelligence working to make computers understand human speech. We interact with NLP almost every day through “smart” devices.

“Hey Alexa, tell me more about Natural Language Processing.”

Like other areas of machine learning, NLP requires large training data sets. One type of data set consists of transcribed audio to train AI to turn speech into text. Another data set contains large amounts of text with annotations to highlight specific areas.

Both need humans to curate and pre-process the data before moving forward. As humans, we have an obvious advantage: we create and use language constantly. Human-powered data annotation for NLP is a great way to optimize model development.

The applications of NLP are endless. Sentiment analysis helps companies mine affective states or moods from customer messages/feedback. NLP can break down language barriers in unprecedented ways. This means people can communicate about weather patterns or pest attacks in real-time using different languages!

3. The Promise of Innovation

With so many advances in artificial intelligence and machine learning, we can be sure that our work is only getting started. AI won’t innovate itself, and researchers in computer science are the ones moving the field forward.

Of course, thinking about the importance of humans in the data preparation process does not diminish the role of technology—new software solutions to machine learning enter the market daily. Human innovation is needed to translate theoretical advances into practice.

An essential part of assembling a data annotation strategy is determining which tools to use and when to use them. Experienced professionals draw from experience to select the right tools for specific situations.

With so much raw data available in the agricultural tech industry, companies realize that the best solution is often a combination of software. Check out how machine learning has use cases across industries.

4. Data Annotation Professionals See the Process Through

Data can be messy. And let’s be honest: humans can be messy too! In the case of machine learning, this shared characteristic works to our advantage.

We need workers to clean data, address inconsistencies, and format data in a way that works for training AI. We use the term “data wrangling” to describe this process. Although “wrangling” may seem like a harsh term, it captures the actual amount of effort needed to prep data before use.

Part of the benefit of using a data annotation provider is that they can help you through the entire process. This includes:

  • data creation or collection

  • data cleaning and curation

  • data labeling or annotation

 Consider using artificial intelligence to detect potential disease in a large field of crops by periodically analyzing photos of crops. This is likely a massive undertaking for an organization. First, enough data to compile a training data set is needed.

 Once you’ve created a clean training data set for supervised learning, the story isn’t over.

Human intervention is needed to assess how well the AI can correctly identify diseased crops in the future. In situations where the machine cannot perform accurately, people need to determine the parameters of a new training set. Then, the process repeats, once again under human supervision.

Harness the Power of Data Annotation

With machine learning driving global industries forward, organizations need access to high-quality training sets. Organizations might not have in-house resources to handle data annotation at scale.

Fortunately, Digital Divide Data offers across-the-board support to get companies to the finish line, no matter where they start. As a non-profit organization, DDD is challenging the industry’s status-quo with impact sourcing, youth outreach, and more.

To get started, see how DDD’s suite of fully managed services (CV, NLP, Data and Content) can exceed your expectations.

ServicesIndustriesClientsWhy DDDAboutBlogContactTerms of UsePrivacy Policy

Copyright © 2022 • DDD • All Rights Reserved

4 Advantages of Human-Powered Data Annotation vs Tools/Software Read Post »

NLP2Beverday2Buse

Everyday Applications You Didn’t Realize Were Powered by NLP

NLP%2Beverday%2Buse

By Aaron Bianchi
Feb 23, 2022

We live in an era of sophisticated algorithms, Big Data, and machine learning that gets better by the day. Businesses recognize the importance of data processing, artificial intelligence (AI), and natural language processing (NLP) for growth. Here are some ways you may already be using NLP in your daily life that could inspire ideas for your company.

What is Natural Language Processing?

NLP is essentially AI that deals with understanding human language. Advanced language sets us apart from other animals on the planet, and communication is integral to our societies. So, as tools, computers were always going to have to develop to a point where they could decipher natural language patterns full of nuance. With the help of programmers and data scientists, machines are constantly refining their ability to comprehend subtleties and create meaning.

NLP Works in Three Fundamental Steps

  1. Break down a spoken sample or written language input into parts or categories.

  2. Discern how these pieces of information are linked.

  3. Produce meaning.

The software detects context, emotion, and sentiment through exposure to lots of data. This consumption of enormous datasets is known as deep learning. Helped by developments in so-called neural networks that imitate neurons in your brain, deep learning only came to the fore in the 2010s. But it’s had a massive impact since then.

Using accumulated knowledge of word sequence and other factors, AI can interpret whether your use of bass refers to a fish or a guitar, for example.

NLP Applications You May Be Familiar With

Search Engines

Just Google it…When you Google something, the search engine offers you autocomplete suggestions. NLP facilitates these predictions by using search data to determine your intent and hasten the process. NLP also tries to overcome any spelling or other errors on your part and assembles relevant content in search engine result pages (SERPs) by matching your query to ideal web pages. In addition, semantic search can enhance digital marketing and SEO capabilities.

Virtual Assistants

“Siri, what is a virtual assistant?” If you’re like most people, you talk to your virtual assistants, like Siri or Alexa and even when you are on the line with automated call centers. Who wants to press numbers as options when you can state exactly what you want or are searching for? Do they sound monotonous or robotic, or are they unable to follow commands? In general, the answer is no, even though the tech has some way to go before consumer interactions become seamless. NLP divides your voice’s frequencies and soundwaves into tiny bits of code ready for further analysis. Speech recognition and voice recognition are two substantial aspects of NLP that will be major features of the online landscape in years to come.

Email and Document Assistants

“Great, thanks!” “Thank you.” “Got it.” Look familiar? Think about your smartphone keyboard and predictive texts that help you type faster, for starters. Consider, too, Outlook or Gmail’s Smart Reply functions.

You’ve likely worked with auto-complete functionality. Or you’ve used the grammar check browser extensions that abound on the internet, helping you craft professional messages or documents in the country-specific version of a language. Furthermore, your inbox can separate emails into various folders such as junk or promotional mail due to NLP.

Chatbots

“How may I help you today?” Chatbots, the text-based equivalent of voice assistants, have become popular and can fulfill basic requests such as booking flights or helping most customers answer simple questions. You might have come across one on an eCommerce store, during product demos, or on educational apps.

Customers often prefer texting or chatting with real people when the stakes are higher or when their needs are more complex. But as NLP improves, chatbots will become more fit for purpose.

Translation and Transcription Tools 

“How do you say that in Spanish?” They perform the seemingly simple task of converting an input language into an output language or materializing spoken words on the screen. But there’s word order to manage, not to mention linguistic idiosyncrasies.

These days, you can point your phone camera at an object with a foreign language on it, and standard augmented reality apps on your phone superimpose a translation for you. The ingredients in products from overseas are no longer a mystery, and any included instructions should be understandable.

Life-Changing Use Cases

Future Possibilities! There are numerous current examples of NLP bridging information and communication divides significantly. Imagine an app that can translate sign language or serve non-verbal individuals with disabilities. NLP doesn’t just help us interact more efficiently with computers; it also opens up new and promising avenues with other people.

NLP Applications In The Future

On-demand TV streaming existed only in theory once, but steadily rising computing power and lower costs turned vision into reality. The same is true for our ideas about robots or internet of things (IoT) gadgets that can talk to us in a less stilted manner than we’ve come to expect.

Soon, home and work life might rely on integrated virtual assistants as much as they rely on video calls, GPS, or online shopping. Research firm, Gartner, suggests that by 2025 about half of all knowledge workers will interact with a virtual assistant every day. And the worldwide conversational AI market is projected to grow to $15.7 billion by 2024.

NLP can play a role but are not limited to these industries:

  • Banking

  • Healthcare

  • Media

  • Manufacturing

  • Retail

Currently, the automotive industry is testing voice biometrics so drivers can access info such as navigation history. And self-driving cars will require advanced NLP. Thanks to human innovation, NLP’s applications are endless.

Partner With Digital Divide Data 

Digital Divide Data partners with Fortune 500 companies and world-class institutions, and can help you optimally sort through and organize your datasets. Using NLP, we can hone in on pertinent information in CVs to structure your training data. We hold ourselves to the highest standards and provide an end-to-end data service customized to your needs. Reach out for more information and to find out how we can strengthen your operations and brand.

Everyday Applications You Didn’t Realize Were Powered by NLP Read Post »

shutterstock 1201594333

Why Data Annotation Software Still Needs a Human Touch

shutterstock 1201594333

By Aaron Bianchi
Feb 3, 2022

Artificial Intelligence (AI) is growing in popularity as a tool to provide everything from better customer care to translation services, driverless cars, smart technology, and more. Consisting of several different technologies that work together to deliver the end result, AI is computer-based programming that mimics human behavior.

Although AI has advanced enormously over the past decade, involving humans in its development is still essential if premium results are required.

Here we take a look at how AI is trained using test data and how human-powered data annotation and data labeling adds significant value to the outcomes that AI delivers. 

What is Data Annotation Software?

Data annotation software is software that is written to annotate production-grade training data. AI isn’t created in a fully formed state. To provide a human-like response to data, AI has to “learn”. As an example, when AI picks up an image of a tree, it doesn’t know that it’s an image of a tree. The ability to recognize that a particular configuration of pixels is a tree is only obtained after AI has had access to millions of tree images.  

The process by which the AI learns to recognize a tree (as an example) is known as machine learning (ML). For effective machine learning to take place, the AI needs access to a large volume of training datasets – data that can be used to help develop the algorithms (mathematical models) needed to develop a human-like response. Using the data, AI can develop a prediction model on the basis of its learning. 

For example, if an AI program has been given access to millions of tree images, it can use mathematical modeling to build a picture of what arrangement of pixels, statistically speaking, is most likely to be a tree. With this information, when the AI is given access to another tree picture, it can assess the probability of it being a tree and label it accordingly. Obviously, AI is capable of interpreting millions (if not billions) of different pieces of data, but to do so accurately, it needs access to enormous amounts of test data that provides the material needed to create accurate algorithms (mathematical models).

To assist in the process, the test data needs to be annotated – labeled in such a way that AI can interpret it effectively and developing a high quality training dataset, depends on many things. You can use platform providers or managed services with specialists. In the context of recognizing a tree, for example, data annotation might be used to enable the AI machine to interpret the data you’ve provided as a tree.

Due to the enormous amount of trained data, or training datasets that are needed for successful machine learning, data annotation software has been developed to try to reduce the time needed for annotation to take place. Data annotation software does make machine learning faster, but it also has some significant drawbacks, some of which are highlighted below. 

What are the Limitations of Data Annotation Software?

  • Exceptions. Every set of data is likely to have exceptions – outliers that are likely to confound the boundaries set up as part of the algorithmic modeling that AI completes. If the data annotation software can’t recognize these outliers and label them correctly (which is likely if the data doesn’t conform to the usual parameters), this limits the level of machine learning that can take place.

  • Limited annotation labeling. Particularly when diverse data is being deployed, the software may not be able to cope with the large variety of labels that are needed for effective machine learning.

  • Quality control. Data annotation software is usually equipped with features that identify where there are quality control issues. Unfortunately, the issues identified are those that are beyond the capability of the annotation software to resolve. Without additional input, those quality issues will remain.

  • Limited sorting. Data annotation software can play a valuable role in sorting data, and flagging data that it can’t easily sort and label. Unfortunately, the software can’t correct the issues it flags – which is where human intervention comes in.

What Role do Humans Play in Data Annotation Software?

Humans can resolve issues with test data that data annotation software can’t. Although the goal of machine learning is to create AI that can “think” in the same way as a human (but without the risk of human error), it’s still not as advanced as the human brain. Particularly when it comes to making judgments that involve subjectivity, data that involves an understanding of intent is vital to get the best results. For example: a surgeon clutching a scalpel, could be considered interchangeable with a knife-wielding criminal, without the benefit of understanding intent.

What are the Advantages That Humans Bring to Data Annotation Software?

The advantages that humans bring to data annotation software mainly relate to our ability to process data that falls outside the machine-learned parameters. 

Humans are essential when it comes to developing the training datasets that can’t be successfully cataloged by the annotation software. More sophisticated decision-making, particularly that which is based on subjective criteria, needs human input.

When annotation software presents a quality control issue, it’s humans that are required to decide on a suitable course of action.

Similarly, diverse, complex data will need human intervention for it to be correctly labeled so that machine learning can take place effectively.

Why are Optimal Results Dependent on Human Input?

Ultimately, AI algorithms are only as good as their test data. The higher the caliber of the datasets (including accurate, clear labeling), the more effective the AI is going to be in meeting its outcomes. 

As humans are the machines that control machine learning, their input is essential for the process to deliver optimal outcomes. 

Why Data Annotation Software Still Needs a Human Touch Read Post »

unsplash image PeUJyoylfe4

Natural Language Processing Is Impossible Without Humans

unsplash image PeUJyoylfe4

By Aaron Bianchi
Jan 15, 2022

Computer vision dominates the popular imagination. Use cases like driverless cars, facial recognition, and drone deliveries – machines navigating the three-dimensional world – are compelling and easy to grasp, even if the technology behind these use cases is not well understood.

But in reality, the holy grail of AI is natural language processing (NLP). Teaching machines to accurately and reliably understand and generate human language ushers in a revolution with boundaries that are hard to envision. 

In theory, machines can be perfect listeners, which unlike humans never get bored or distracted. They also can consume and respond to content far, far faster than any human, at any time of day or night. The implications of these capabilities are staggering.  

This assumes, of course, that we really can teach algorithms to understand what they are “hearing” and build into them the judgment required to communicate on our behalf. And that is what makes NLP such an elusive holy grail: because doing that is so hard on so many levels. Sure, helping machines to make sense of two- and three-dimensional images is an enormous challenge, and headlines describing autonomous vehicle crashes and facial recognition mistakes hint at the complexity of CV. But human language is orders of magnitude more complex. 

Five ways that humans struggle with our own natural language processing:

  • You misinterpret sarcasm in a text message

  • You hear a pun and you don’t get it

  • You overhear a conversation between experts and get lost in their specialized vocabulary

  • You struggle to understand accented speech 

  • You yearn for context when you come up against semantic, syntactic, or verbal ambiguity (“He painted himself,” or “What a waste/waist!”)

Obviously, processing and interpreting language can be a challenge even for humans, and language is our principal form of communication. Language is complex, and chock full of ambiguity and nuance. We begin to process language in the womb and spend our whole lives getting better at it. And we still make mistakes all the time. 

Ways that humans and machines struggle with each other’s natural language processing:

  • Comprehending not just content, but also context

  • Processing language in the context of personal vocabularies and modes of speech

  • Seeing beyond content to intent and sentiment

  • Detecting and adjusting for errors in spoken or written content

  • Interpreting dialects, accents, and regionalisms

  • Understanding humor, sarcasm, misdirection

  • Keeping up with usage and word evolution and slang

  • Mastering specialized vocabularies

These challenges have not deterred NLP pioneers, and NLP remains an extremely fast-growing sector of machine learning. These pioneers have made great progress with use cases like: 

  • Document classification – building models that assign content-driven labels and categories to documents to assist in document search and management

  • Named entity recognition – constructing and training models that identify particular categories of content in text so as to understand the text’s purpose

  • Chat bots – replacing human operators with models that can ascertain a customer’s problem and direct them to the right resource

Of course, even these NLP applications are complex, and the pioneers have taken away three lessons that anyone interested in NLP should heed:

  1. Algorithms require enormous volumes of labeled and annotated training data. The complexity and nuance of language processing means that much of what we think of natural language is full of edge cases. And as we all know, training algorithms on edge cases can demand many orders of magnitude more training data than the routine. Because algorithms have not yet overcome the barriers to machine/human communication outlined above, training data must come from humans.

    Only humans can label and annotate text and speech data in ways that highlight nuance and context. 

  2. Relying on commercial and open-source NLP training data is a dead end. Getting your model to the confidence levels you need demands training data that matches your specific context, industry, use case, vocabulary, and region. 

    The hard lesson that the pioneers learned is that NLP invariably demands custom-labeled datasets. 

  3. The humans who prepare your datasets must be qualified. If you are dealing with a healthcare use case, your human specialists must have fluency with medical terminology and processes. If the audience for your application is global, the training data cannot be prepared by specialists in a single geography. If the model will encounter slang and idiomatic content, the specialists must be able to label your training data appropriately.

Given the volume of training data NLP requires and the complexity and nuance that surrounds these models, look for a data labeling partner with a sizable, diverse, distributed workforce of labeling specialists. 

Natural Language Processing Is Impossible Without Humans Read Post »

shutterstock 173232572

Data Bias: AI’s Ticking Time Bomb

shutterstock 173232572

By Aaron Bianchi
Dec 8, 2021

We’ve all seen the headlines. It’s big news when an AI system fails or backfires, and it’s an awful black eye for the organization the headlines point to.

Most of the time these headlines can be traced back to issues with the AI model’s training data. Bias in training data can take a variety of forms, but they all create the potential for leaving the algorithm under- or mis-trained.

In our discussions with clients, we alert them to three data preparation mistakes or oversights that can produce bias:

Failing to ensure the data measuring instrument is accurate. Distortion of the entire data set can result from bad measurement or collection techniques. When this occurs, the bias tends to be consistent and in a particular direction. The danger here is that the production model is out of sync with the reality it is designed to react to.

Bad measurement can take lots of forms. Low quality speech samples, with noise and missing frequencies, can affect a model’s ability to process speech in real time. A drone with an inaccurate GPS system or misaligned altimeter will provide image-based training data that has systematically distorted image metadata. Poorly designed survey and interview instruments can consistently distort responses.

Failing to accurately capture the universe in the data. Sample bias occurs when the training data set is not representative of the larger space the algorithm is intended to operate in. A non-representative training data set will teach the algorithm that the problem space is different than it is.

A classic example of sample bias involved an attempt to teach an algorithm to distinguish dogs from wolves. The training data’s wolf images were overwhelmingly in snowy settings, which led the algorithm to conclude that every picture with snow in it contained a wolf.

But as amusing as this is, sample bias can have serious consequences. Facial recognition algorithms that are trained on disproportionately Caucasian images misidentify African Americans and Asians. Autonomous vehicles that crash into gray trailers on overcast days likely have had too little exposure to this scenario.

Failing to eliminate social or cultural influences from the data. Cultural bias happens when human prejudices or regional idiosyncrasies make their way into AI training data.

As an example, in the UK athletic shoes are often referred to as pumps. In the US pumps are heeled women’s shoes. Use a UK-based team to label shoe images to train an algorithm targeting US shoppers and you end up with a model that may offer the wrong shoes due to cultural bias.

Cultural bias can be more insidious, however. Randomly sample images of airline pilots and you will end up with a data set that is almost entirely male. However, it would be wrong for all kinds of reasons to have the algorithm you are training conclude that airline pilot and male are causally related.

There are well-understood approaches to data gathering and data sampling that avoid all these forms of bias. Unfortunately, these approaches are labor- and time-intensive and data science teams often lack members with the skills or bandwidth to address them.

Our clients generally want to offload the management of training data bias, and they are delighted to learn that mitigating this bias is a core DDD capability. We fully understand the forms that bias takes, and the sources of those biases. We know how to create or collect data sets that are free of bias. And if you already have the data you need, we can offload the management of data bias in your training data sets.

Data Bias: AI’s Ticking Time Bomb Read Post »

DDD Webinar HeroImage Unretouched

Using Aerial Imagery as Training Data

DDD Webinar HeroImage Unretouched

By Aaron Bianchi
Aug 6, 2021

Numerous industries use satellite and aerial imagery to apply machine learning to business and social problem sets. This is a particular strength for DDD given our experience in geospatial and aerial use cases in insurance, transportation, meteorology, environmental protection, agriculture, law enforcement, national security, remote delivery, and traffic management.

This experience has taught us a great deal about the challenges and pitfalls associated with aerial image segmentation. We aired a webinar on this subject, and you can view the recording on-demand. Our goal is to deliver a hands-on guide to overcoming these challenges.

  1. Price of failure. Consider the cost of inadequately or incorrectly training an algorithm to evaluate geospatial or aerial images. Say the project is agricultural. Can you imagine the impact of incorrectly identifying crop disease or inadequate irrigation? Or say the project is military. What are the potential costs of misidentifying an elementary school as an army barracks? You need to include the expected cost of this kind of failure into your DIY vs outsource equation.

  2. Workforce. Aerial and geospatial images are often very large and very detailed, meaning that a large workforce of labelers is required to generate sufficient volumes of training data in a timely fashion. In our experience, most data science teams don’t have access to an in-house workforce big enough to meet their training data demands. This lack of a workforce is one of the principal drivers of seeking a training data partner.

  3. Data volumes. Keep in mind that you may be able to support in-house data preparation for an initial, simple use case, but in your quest for greater levels of model confidence, you will have to train your algorithm on additional use cases, and eventually edge cases. You may be able to generate enough data in-house to train an algorithm to land a delivery drone on a simple graphical market, but what does it take to distinguish between a leaf on the marker and a three-year-old child? Each additional use case requires at least as much training data as the first one, and rarely-occurring edge cases may require significantly more data. This dramatically compounds your workforce requirements, a discovery that many data science teams make late in their projects when budgets are dwindling and deadlines are imminent.

  4. Process and tools. Extremely high-resolution images are far too large to assign to a single labeler. But breaking up images, assigning them to multiple labelers, and then reassembling everything coherently introduces issues around worker consistency and process management. Do you have the wherewithal to train consistency into your own workforce? Do you have the technology and process required to track changes to very high numbers of image fragments? Most data science teams don’t.

  5. Specialization. Are you confident that you can define the most efficient tasks required to label your training data? We had a client who wanted us to label every individual tree in enormous hi-res forest images. As it happened, they weren’t interested in tree density; rather, they were trying to detect illegal land clearing. Because we have been preparing training data for decades, we were able to show them a different approach to labeling their images that appropriately trained their algorithm, but at a fraction of the time and cost of their approach.

  6. Focus. Preparing training data for aerial and geospatial systems involves the application of human judgment to nuanced, and sometimes hard to decipher, images. Our own data shows that the longer individuals spend on a particular kind of interpretation the faster and more accurate they do the work. Data science teams that crowdsource their aerial segmentation work do not capture these workforce efficiencies. DDD assigns you a team that stays with you throughout the span of your project, meaning that you capture all the benefits of growing worker efficiency and your effective cost per transaction steadily declines.

Using Aerial Imagery as Training Data Read Post »

FiveCriteriaImage

Five Key Criteria to Consider When Evaluating a Data Labeling Partner

FiveCriteriaImage

By Aaron Bianchi
Jul 14, 2021

Machine learning (ML) and AI have dramatically changed the way many businesses across the globe work. As ML and AI continue to evolve, one of the biggest challenges is to ensure the quality of the data utilized by your systems.

For machine learning to work, your system needs properly labeled data. Without it, your ML model may not recognize patterns, which it needs to make decisions or perform its functions.

This is one reason data scientists and corporations worldwide work with data labeling partners or invest in data labeling tools.

Are you currently looking for a data labeling partner? Before getting started on your search, you must first understand what data labeling is.

What is Data Labeling?

Data labeling is an essential part of ML, particularly Supervised Learning, a common type of ML used today.

Data labeling identifies raw data such as text files, images, and videos and adds context to them. Once data have been labeled, it will be the learning foundation of your ML model for all data processing activities.

As your ML model relies heavily on data labeling, make sure you’re working with a data labeling partner that isn’t just reliable; your partner should also have sufficient data labeling experience in your industry.

How to Choose a Data Labeling Partner

There are many ways to find professionals to perform data labeling for you. The most popular is working with a data labeling company or contractor.

Essentially, these service providers become an extension of your team. They manage all your data and would often charge by their output volume.

Why should you work with a data labeling company? One of its benefits is that it’s more cost-effective than investing in data labeling tools and spending on human resources. Secondly, working with a data labeling service provider ensures the work is done right. When your team doesn’t have enough knowledge and experience with data labeling, you’ll need to give them time to learn it. Additionally, you’ll have to provide more time for them to finish the work, which isn’t an efficient use of your company’s resources.

When choosing a data labeling partner, don’t forget to take the following steps. These will help you find the best provider and make your search more efficient.

  1. Define Your Goals
    Setting goals and expectations is crucial, especially when working with professionals outside of your organization. Remember, they will be working on your data. Therefore, they should have a clear understanding of what you expect from them and the service required of them.

    It would help to have the following information from the beginning:
    • Project overview
    • Timeline
    • Data volume
    • Data quality guidelines or overview

  2. Set a Budget
    Once you’ve prepared all the information, the next step is to decide on a budget.

    Every service provider is different, and all of them would have different rates. Having a budget would make it easier to create a shortlist of candidates, mainly when most of your chosen candidates provide similar proposals or offers.

  3. Create a List of Candidates

    Now that you have your budget and project details on hand, the actual search begins!

    Don’t be in a rush to find the “one” for your company. Instead, take your time evaluating multiple service providers. Do your background research, look for customer reviews, and find out their overall standing in the industry.

  4. Ask for Proof of Concept

    Provide a sample task that is quite similar to your project and evaluate how each candidate would deliver the output. This is an easy way to identify a service provider’s skills, experience, and reliability. Additionally, a proof of concept could help you determine any possible roadblocks you may encounter once your project starts.

Criteria for Evaluating a Data Labeling Partner

With thousands of companies offering data labeling services, it could be challenging to assess everyone on your list.

The best way to evaluate your candidates is to set some criteria. Here are five you may use when choosing a partner.

Data Quality
Keep in mind that your ML or AI model would only be as good as the quality of data you provide. Because of this, checking for data quality is of utmost importance when looking for a data labeling service provider.

Tip: Don’t forget to talk to your candidates about their quality control measures.

Technology
Another benefit of outsourcing data labeling is that you can access tools and technology that your company may not otherwise afford.

Ask your vendors which tools and technology they would use for your project. Their tools should help you maximize your time, resources, and efficiency — all while providing quality data.

Workforce
Sure, a service provider may already work with multiple clients… but that doesn’t mean they’re suitable for your project. Make sure their staff knows how to handle the type and volume of data you have. This would help get things going smoothly and with minimal supervision from your end.

Security
Confidentiality and data security are crucial when it comes to outsourcing this type of work. You wouldn’t want to worry about data leaks and hacks, would you? Inquire about the company’s security protocols and process of handling sensitive data.

Social proof
When possible, ask for a list of (past or present) clients. Then, get in touch with them to ask for their feedback on the provider. You may also consider looking into case studies that they’ve done, which would give you a good idea of the quality of their work and processes.

Finding the right data labeling partner for your company doesn’t always have to be complicated. With this guide, you could get started on your search and make sound decisions.

Do you want to learn more about data labeling and how Digital Divide Data could help? Fill out our contact form, and we’d be happy to learn more about your needs and walk you through our process.

Five Key Criteria to Consider When Evaluating a Data Labeling Partner Read Post »

ML Data Preparation Demands a Big Toolbox

By Aaron Bianchi
May 11, 2021

Part of the challenge of building machine learning models is that no two are the same. Train the same machine learning algorithm against different sets of data, and you end up with a different model.

If the quality of the raw data is high and the training data sampling is done well, the models shouldn’t vary a lot… but those are all big “if”s. Which is why data preprocessing, the actual data preparation process, is critically important.

A Forbes survey revealed that data scientists spend nearly 80% of their time on data prep, a quarter of that on data collection, and the other three quarters on data cleaning. Other survey results that indicate that real-world data science isn’t everything these practitioners thought it would be; clearly, data collection and data cleaning are not how they imagined they’d spend their working hours.

Data preparation is so time consuming because it is so important. The adage – or more appropriately in this setting, admonition – “Garbage In, Garbage Out” very much applies to data preparation for machine learning, which, in extreme cases, can involve the entire lifecycle of data collection to data cleaning and feature engineering. Missteps at any point in this process will result in low-confidence model predictions, or even a model that just misperforms.

Beyond their importance, training data sets for machine learning algorithms are also voluminous – many millions of data items in the case of complex problem spaces – and much of the data prep work demands human involvement (although much of this work is often repetitive and requires only contextual training to perform).

Finally, data preprocessing usually involves a variety of technologies, both for doing the actual work of preparing the data and for managing quality in the context of volume. If the problem space is simple – say, structured data with duplicates, null values and some lack of standardization – the technology needn’t be complex. But complex problem spaces – say, identifying and tracking video objects with complex taxonomies – can require specialized technology, much of it open source, with very particular feature sets.

In recent years, numerous solution providers have sprung up to fill the human and technological gaps that data scientists confront in preparing quality training data at scale. Some offload the human labor side of data prep. Others provide technology for cleaning and labeling training data sets. And yet, some provide both.

Data science teams would be well-advised to choose carefully when evaluating data preparation partners. DDD has “inherited” more than a few customers whose first vendor fell down, invariably on one or both of these two dimensions:

Something less than full lifecycle data prep. Recall that the Forbes survey indicated 60% of data scientists’ time is spent in data cleaning. Most of today’s data preparation vendors emphasize training data labeling and annotation. They presume that they will be given data that has already been cleaned.

If your data needs cleaning, i.e.,

  • it has not been de-duped

  • it is missing information

  • it is inconsistently presented from different sources

  • it requires entity resolution

  • it is image data of uneven quality or perspective

  • it is handwritten and requires transcription

If you don’t have data, or don’t have enough data, and need data collected or created, these vendors are not a good fit for you.

Reliance on a single technology platform. Every ML project is unique, with unique data. No single set of proprietary tools can possibly match up to every machine learning algorithm and training data set. Data science teams need to know that whoever is doing their data preparation is technology platform agnostic. They need to know that whoever is doing their data preparation has the flexibility and freedom to choose the best tool(s) for the project at hand and is not trying to shoehorn data into an inappropriate tool or trying to jerry-rig a third-party tool onto their platform for the sake of the team’s project.

ML Data Preparation Demands a Big Toolbox Read Post »

Scroll to Top