How Good is Your Unlabeled Data?

Data scientists seeking AI solutions to real world problems have learned some hard lessons about the importance of quality training data. In our experience, though, there are still lessons to be learned about the importance of getting data ready for labeling and annotation.

If your AI project is using supervised learning to train your machine learning algorithm you are likely in need of a significant amount of training data to achieve your goals. These training datasets require careful labeling and structuring to ensure that they enable your algorithm to operate successfully when it is deployed.

Data science teams have learned some hard lessons about training datasets in the last few years. They’ve discovered that labeling and annotating data in a quest for ground truth is extremely time consuming and demands capabilities and resources that they don’t have. In response, a mini industry of data labeling service providers has sprung up to meet the explosive growth in demand for outsourced machine learning training data. The result is a division of labor where the data science team focuses on the machine learning algorithm, the provider focuses on generating a labeled dataset, and the combined components come together in a complete machine learning model.

Data labeling service providers like DDD deliver specialized technologies, skills and human labelers. And from a process perspective, you can expect the provider will:

  • Endeavor to understand your algorithm and design labeling and annotation tasks that reflect the algorithm’s training requirements.

  • Choose the appropriate annotation and labeling approach and tooling. A computer vision project may demand a tool that uses bounding boxes to label images. A natural language processing project may require tooling that permits sentiment analysis of an audio recording.

  • Select a labeling team that meets the project’s scale, skill and quality requirements, train the team members on the project’s tasks and processes, and then put them to work.

  • Identify the quality control processes that produce the best training data in the most efficient manner, and then monitor the performance of tooling and labelers alike.

  • Solicit your participation in the process, report on progress, and seek your feedback on quality and progress.

  • Establish a feedback loop for ongoing labeling efforts and refactoring.

This is a familiar story to anyone who has evaluated data labeling services. But what is missing from the story is one of the most important determinants of successful data labeling: data curation. Data curation is a distinct discipline that ensures that your as-yet unlabeled data is in a state that permits effective labeling and annotation.

In the real world source data is messy. It may be incomplete. It may arrive in inconsistently formatted batches. It may contain unwanted duplicates. There may be entity resolution issues in the data. Image data may be skewed or distorted, unevenly lit or inconsistently sized. Source data may arrive in multiple, even unusable, file formats.

And when source data is messy, training dataset quality suffers, and the algorithm fails to reach the confidence levels the AI project demands. Or worse, issues in the underlying data cause the algorithm to make bad predictions with high confidence. This, more than anything, is the source of brand-killing “AI System Fails” headlines.


Data labeling has risen to prominence, for good reason. But you ignore data curation at your peril. Data science teams are most successful when they work with a partner that complements labeling and annotation services with a full spectrum of data curation services. Both are essential to the delivery of training and validation datasets.

Previous
Previous

Enhancing Image Categorization with the Quantized Object Detection Model in Surveillance Systems

Next
Next

Horizontal vs. Vertical AI: Which Is Right for Your Organization?